Thebook v42

M ADRAS S CHOOL OF E CONOMICS
Lecture Notes
on courses taught by
D R . R AKESH N IGAM
Lecture Notes
Towards brevity, clarity and precision.
courses taught by
D R . R AKESH N IGAM
Edition 4
Edited by: Aditya Kedar Tata, Akash Gupta,

Ishita Gupta, Lilendar Rohidas, Rohith Krishna
Notes scribed by PGDM batch of 2019-21.
February 19, 2021

Preface
This book has been compiled from a series of lecture notes on a comprehensive
set of topics that include - Linear Algebra, Discrete Mathematics, Probability foun-
dations, Stochastic Processes, Computational Finance and Data Science. The lec-
ture notes on all of the above topics have been collectively written by the PGDM
(2019 − 21) batch at the Madras School of Economics, based on courses taught by
Dr. Rakesh Nigam.
Our objective, with this edition, is to create a resource of mathematical foun-

dations needed to understand, and pursue learnings in the fields of Data Science
and Finance. The mathematical foundations in this book serve the purpose of
enhancing one’s understanding of - the concepts, intuition and processes - under-
lying various Machine Learning algorithms and Statistical computation methods.
Our attempt has been to make this book self contained, in the sense that it covers
an extensive portion of various topics that form the building blocks of modern day
Analytics and Research. The concepts related to Linear Algebra and Stochastics,
have been built from the ground up - this would serve as a convenient reference
for students and professionals alike, who might be looking to pursue a deeper un-
derstanding of topics in Analytics and Finance.
Editors
February 19, 2021
2
Features new in the 4th edition
• Linear Algebra - Linear Algebra concepts are essential to grasp for a bet-
ter understanding of analysis methods - Multivariate Data representations
rely heavily on Linear Algebra - Code representations and implementations
are based on Matrices and Vectors - Various unsupervised Machine Learn-
ing methods are focussed entirely on Linear Algerba computation methods.
We start with the basics of Vector algebra, moving on to more complex al-
gorthims like the Singular Value Decomposition and Principal Component
Analysis. Applications in Text Analytics are also covered here.
• Advanced Probability - Most Machine Learning and Econometric models
are probabilistic in nature - thereby making it essential for us to get a good
grasp on the fundamentals of probability theory. This part of the book first
builds the foundations of probability theory with concepts related to - Ran-
dom variables, Combinatorics, Expectation and Variance - and subsequently
moving on to more advanced topics that include - Time Series analysis, joint
distributions and Conditional probability theorems.
• Stochastic Processes - We introduce the Markov Chain framework in this
section. This part of the book relies heavily on topics related to - The memo-
ryless property, Exponential and Poisson distributions, Discrete time Markov
Chains. There are sections on Queueing theory as well - Queueing models
are constructed entirely using the concepts of Markov Chains and stochastic
processes.
• Stochastic Calculus and Computational Finance - In this section we con-
tinue from where we left off in the Stochastic Processes section - into the
domain of Continuous time Markov Chains and Martingales. We then move
on to advanced topics related to the Black Scholes Merton Model for pricing
options.
• Topics in Data Science - This section is devoted to the mathematical under-
pinnings of some advanced computational frameworks that include - Natu-
ral Language Processing, Computer Vision and an introduction to Bayesian
models of statistical computation.
• Calculus and Discrete Mathematics - In our attempt to make this book as
self contained as possible, we have included a rather comprehensive section
that covers a host of topics - Fourier Transforms, Fundamentals of Calculus,
Propositional Logic, Proof Techniques, Graph theory and Recurrence rela-
tions.
3
Contents
I Linear Algebra 15
1 Introduction to Linear Algebra 16
1.1 Introduction to Vector Addition and Subtraction . . . . . . . . . . . 16
1.1.1 Vector Addition and Scalar multiplication . . . . . . . . . . . . 16
1.2 Vector Dot product . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Norm of a vector . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.2 Unit Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.3 Angles between vectors and dot products . . . . . . . . . . . . . 19
1.3 Solving system of Linear equations using Linear combinations . . . 19
1.4 Motivation from Chemistry . . . . . . . . . . . . . . . . . . . . . . 20
1.5 Application of Linear algebra in Mechanical Engineering and Eco-
nomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.1 Engineering Terminology . . . . . . . . . . . . . . . . . . . . . 21
1.5.2 Static analysis of a 3-bar truss structure . . . . . . . . . . . . . 22
1.5.3 Economic analysis of a manufacturing firm . . . . . . . . . . . . 25
1.5.4 Solving this problem for the firm . . . . . . . . . . . . . . . . . 26
2 On Solving Ax = b 28
2.1 Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.1 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Case 1: Full Rank (r = m = n) . . . . . . . . . . . . . . . . . . 29
2.3 Case 2: Full Column Rank (r = n < m) . . . . . . . . . . . . . . 30
2.3.1 Unique Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 No solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Case 3: Full Row Rank (r = m < n) . . . . . . . . . . . . . . . . 32
2.5 Case 4: Not Full Rank (r < n, r < m) . . . . . . . . . . . . . . . 32
2.6 Applications in Economics: The Leontief Input-Output Model . . . 34
2.7 Applications in Physics: Kirchhoff’s Voltage and Current Laws . . . 35
3 Matrix Definiteness and Cramers Rule 41

3.1 Definite and Semidefinite matrices . . . . . . . . . . . . . . . . . . 41
3.1.1 Tests for definiteness . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Cramer’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Higher order derivatives . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Gradients and Tangent planes . . . . . . . . . . . . . . . . . . . . . 44
4
Contents 5
3.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Concavity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Envelope theorem . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.3 Comparitive statics . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Multivariate optimization . . . . . . . . . . . . . . . . . . . . . . . 47
3.6.1 Multivariate comparitive statics . . . . . . . . . . . . . . . . . . 48
4 Multivariate Data Analysis - Basic concepts 49

4.1 A short recap of SVD concepts . . . . . . . . . . . . . . . . . . . . . 49
4.1.1 The process of SVD . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.2 Forms of Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Data Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Centering the values . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 Variance and standard deviation . . . . . . . . . . . . . . . . . 53
4.2.3 Standard values . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.4 Basics on Matrix representations . . . . . . . . . . . . . . . . . 54
4.3 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Introducing Linear Transformations . . . . . . . . . . . . . . . . . . 57
4.4.1 Derivatives as linear transformations . . . . . . . . . . . . . . . 58
4.5 Choosing a transformation matrix . . . . . . . . . . . . . . . . . . . 58
4.6 Change of Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Constructing a matrix . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.8 Choosing the best bases . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Singular Value Decomposition 64

5.1 Topics covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 SVD as a lowest rank approximation . . . . . . . . . . . . . . . . . 65
5.4 The Symmetric R and L matrices . . . . . . . . . . . . . . . . . . . 65
5.4.1 Case 1 :- The R matrix . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.2 Case 2 : The L matrix . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 A general purpose inverse of a square matrix . . . . . . . . . . . . 67
5.6 Discovering the Left Inverse of a matrix . . . . . . . . . . . . . . . 68
5.6.1 Proof by contradiction . . . . . . . . . . . . . . . . . . . . . . . 69
5.6.2 Forming the Left Inverse . . . . . . . . . . . . . . . . . . . . . . 69
5.6.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Principal Component Analysis 71

6.1 Changing the bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.2 Diagonalize the Covariance matrix . . . . . . . . . . . . . . . . 75
6.3 Solve PCA using Eigen Decomposition . . . . . . . . . . . . . . . . 75
6.4 Using the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7 Text Analytics using SVD 78

Contents 6
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . 78
7.3 Visualizing SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4 Information Retrieval Strategies . . . . . . . . . . . . . . . . . . . . 80
7.5 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.6 Weighted-Zone Scoring . . . . . . . . . . . . . . . . . . . . . . . . 80
7.7 Term frequency ranking . . . . . . . . . . . . . . . . . . . . . . . . 82
7.8 Inverse document frequency ranking . . . . . . . . . . . . . . . . . 85
II Advanced Probability 87
8 Fundamentals of Probability 88
8.1 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.1.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.1.2 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1.3 Binomial Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 Basic Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . 91
8.3.1 Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.3.2 Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4.1 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4.2 Geometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.4.3 Negative Binomial . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.5 Cumulative Frequency distributions . . . . . . . . . . . . . . . . . 94
8.6 General points about discrete variables . . . . . . . . . . . . . . . . 94
8.7 Conditions For Independence . . . . . . . . . . . . . . . . . . . . . 95
8.7.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.7.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.8 Axioms Of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.9 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . 97
8.10 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.10.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.11 Properties Of Moment Generating Function . . . . . . . . . . . . . 101
8.12 Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9 Sequences and Series 103

9.1 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.1.1 General theorems and statements . . . . . . . . . . . . . . . . . 104
9.1.2 Squeeze theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.1.3 Increasing, Decreasing and Bounded . . . . . . . . . . . . . . . 105
9.2 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.2.1 The Ratio and Root test . . . . . . . . . . . . . . . . . . . . . . 107
10 Basics of Convolutions 108

10.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Contents 7
10.1.1 Sums of random variables . . . . . . . . . . . . . . . . . . . . . 108

10.1.2 Sum of two Uniform random variables . . . . . . . . . . . . . . 109
11 Limit theorems and Convergence - Part 1 111

11.1 Weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . 111
11.2 Converse of Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.3 Counter Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.4 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
11.5 Consequence of Weak Law of Large Number . . . . . . . . . . . . 116
11.6 WLLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
11.7 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . . . . 116
11.8 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11.9 Convergence of Random Variables . . . . . . . . . . . . . . . . . . 118
11.9.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . . 120
11.9.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . 120
11.9.3 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . . 121
12 Limit theorems and Convergence - Part 2 123

12.1 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . 123
12.1.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
12.1.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
12.2 Strong Law of Large Number . . . . . . . . . . . . . . . . . . . . . 126
12.2.1 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
12.3 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . 130
12.3.1 Properties of MGF . . . . . . . . . . . . . . . . . . . . . . . . . 131
12.4 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.4.1 Law of Large Number . . . . . . . . . . . . . . . . . . . . . . . 132
12.4.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 134
12.5 Markov’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12.6 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.6.1 A proposition linked with Chebyshev’s inequality . . . . . . . . 136
12.7 The weak law of large numbers . . . . . . . . . . . . . . . . . . . . 136
12.8 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.8.1 proving the theorem . . . . . . . . . . . . . . . . . . . . . . . . 137
12.9 The strong law of large numbers . . . . . . . . . . . . . . . . . . . 138
12.9.1 proof of SLLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
12.10 Convergence of sums of uniform random variables . . . . . . . . . 140
13 Borel-Cantelli Lemma 142

13.1 Fundamental guiding principles . . . . . . . . . . . . . . . . . . . . 142
13.1.1 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
13.1.2 Working with sets: part 1 . . . . . . . . . . . . . . . . . . . . . 142
13.1.3 Working with sets: part 2 . . . . . . . . . . . . . . . . . . . . . 143
13.1.4 Basic laws governing set operations . . . . . . . . . . . . . . . . 144
13.1.5 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . 144
13.1.6 Probability of continuous sets . . . . . . . . . . . . . . . . . . . 144
13.2 Countable and Uncountable sets . . . . . . . . . . . . . . . . . . . 145
Contents 8
13.3 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

13.3.1 Revisiting the background of discussion . . . . . . . . . . . . . 146
13.4 Lim Sup and Lim Inf . . . . . . . . . . . . . . . . . . . . . . . . . . 147
13.4.1 An illustrative example: 1 . . . . . . . . . . . . . . . . . . . . . 148
13.4.2 An illustrative example: 2 . . . . . . . . . . . . . . . . . . . . . 149
13.4.3 Reiterating the Definitions . . . . . . . . . . . . . . . . . . . . . 150
14 Time Series Analysis: Basics 151

14.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
14.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
14.3 Properties of Power spectral density Sxx (ω) : . . . . . . . . . . . . . 152
14.4 White Noise (WN): . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
14.5 Discrete Time Stochastic Processes . . . . . . . . . . . . . . . . . . 157
14.5.1 Sampling a CTSP : Continuous Time Stochastic Processes . . . . 160
14.6 Strong stationarity: . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
14.7 Cyclo Stationery process . . . . . . . . . . . . . . . . . . . . . . . . 164
14.8 White Noise is a special Stochastic Process . . . . . . . . . . . . . . 165
14.9 Gaussian Random Process – GRP . . . . . . . . . . . . . . . . . . . 165
14.10 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
15 Time Series Analysis: Intermediate 168

15.1 Reviewing Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . 168
15.2 Fundamentals of Time Series . . . . . . . . . . . . . . . . . . . . . 169
15.2.1 Stationary processes . . . . . . . . . . . . . . . . . . . . . . . . 169
15.3 Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 171
15.4 Lag Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
15.4.1 First order difference equation . . . . . . . . . . . . . . . . . . 172
15.4.2 Second order difference equation . . . . . . . . . . . . . . . . . 173
15.5 White noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
15.6 Moving average process . . . . . . . . . . . . . . . . . . . . . . . . 174
15.6.1 MA(q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
15.7 Infinite order moving average . . . . . . . . . . . . . . . . . . . . . 176
15.8 Autoregressive process . . . . . . . . . . . . . . . . . . . . . . . . . 177
15.8.1 AR(2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
15.8.2 AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
15.9 ARMA(p,q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
15.10 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
15.11 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
15.12 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . 183
15.12.1Stationary stochastic process . . . . . . . . . . . . . . . . . . . 183
15.12.2Nonstationary stochastic process . . . . . . . . . . . . . . . . . 184
15.12.3Random walk with drift . . . . . . . . . . . . . . . . . . . . . . 185
15.12.4Unit root process . . . . . . . . . . . . . . . . . . . . . . . . . . 185
15.13 TS and DS processes . . . . . . . . . . . . . . . . . . . . . . . . . . 185
15.14 Integrated stochastic process . . . . . . . . . . . . . . . . . . . . . 186
15.14.1Properties of integrated series . . . . . . . . . . . . . . . . . . . 187
15.15 Spurious regression . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Contents 9
15.16 Tests of stationarity: ACF . . . . . . . . . . . . . . . . . . . . . . . 187

15.16.1Statistical significance of ACF . . . . . . . . . . . . . . . . . . . 188
15.17 The unit root test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
15.17.1Augmented DF test . . . . . . . . . . . . . . . . . . . . . . . . . 190
15.18 Transformation: Difference stationary process . . . . . . . . . . . . 191
15.18.1Trend stationary process . . . . . . . . . . . . . . . . . . . . . . 191
15.19 Cointegration: Regressing unit root over unit root . . . . . . . . . . 191
15.19.1Testing for Cointegration . . . . . . . . . . . . . . . . . . . . . . 192
15.20 AR, MA and ARIMA modeling . . . . . . . . . . . . . . . . . . . . . 193
15.20.1Moving average process . . . . . . . . . . . . . . . . . . . . . . 193
15.20.2ARMA process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
15.20.3ARIMA process . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
15.21 Box Jenkins methodology . . . . . . . . . . . . . . . . . . . . . . . 194
15.22 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
15.23 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
16 Joint Distributions 197

16.1 Joint distribution functions . . . . . . . . . . . . . . . . . . . . . . 197
16.1.1 The joint continuous case . . . . . . . . . . . . . . . . . . . . . 198
16.1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
16.1.3 Summing independent random variables . . . . . . . . . . . . . 199
16.2 Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
16.2.1 Conditionals: continuous . . . . . . . . . . . . . . . . . . . . . 200
16.3 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
16.4 covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
16.5 Conditional expectations . . . . . . . . . . . . . . . . . . . . . . . . 201
16.5.1 More arguments on conditional expectations . . . . . . . . . . . 201
16.6 Conditional variance . . . . . . . . . . . . . . . . . . . . . . . . . . 202
17 Conditional Probability 203

17.1 Joint Probability Distributions . . . . . . . . . . . . . . . . . . . . . 203
17.2 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . 205
17.2.1 Conditional PMF of X given Y . . . . . . . . . . . . . . . . . . . 205
17.3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . 205
17.4 Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . 206
17.4.1 Law of Total Expectation . . . . . . . . . . . . . . . . . . . . . . 206
17.4.2 Conditional Expectation as a Function of Random Variable . . . 207
17.5 Iterated Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 210
17.6 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 211
17.7 Law of Total Variance . . . . . . . . . . . . . . . . . . . . . . . . . 212
17.7.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
17.7.2 The Case of Independent Random Variables . . . . . . . . . . . 213
17.7.3 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
17.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
18 Multivariate Gaussian Distribution 216

18.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Contents 10
18.1.1 Expected value of random vectors . . . . . . . . . . . . . . . . . 217

18.1.2 Covariance Matrix for a random vector . . . . . . . . . . . . . . 217
18.1.3 Properties of the covariance matrix . . . . . . . . . . . . . . . . 218
18.2 Bivariate Gaussian Random Vectors . . . . . . . . . . . . . . . . . . 219
18.2.1 Joint PDF of Bivariate Normal . . . . . . . . . . . . . . . . . . . 221
18.3 Multivariate Gaussian Random Vector . . . . . . . . . . . . . . . . 223
18.3.1 PDF of a Gaussian Vector . . . . . . . . . . . . . . . . . . . . . . 223
III Stochastic Processes 225

19 Introduction to Markov Chains 226
19.1 Premise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
19.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
19.2.1 Memoryless Property . . . . . . . . . . . . . . . . . . . . . . . . 227
19.2.2 Continuous Time Markov Chains . . . . . . . . . . . . . . . . . 227
19.3 A Betting Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
19.3.1 Modelling the Game . . . . . . . . . . . . . . . . . . . . . . . . 228
19.4 Discrete Time Markov Chains . . . . . . . . . . . . . . . . . . . . . 230
19.4.1 Graphic Representation . . . . . . . . . . . . . . . . . . . . . . 231
19.5 Multiple step Transition Probability . . . . . . . . . . . . . . . . . . 231
19.5.1 Two step transition probability . . . . . . . . . . . . . . . . . . 232
19.5.2 Multiple Step transition probability . . . . . . . . . . . . . . . . 232
20 The Gambler’s ruin Framework 235

20.1 The Gambler’s ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
20.1.1 Setting up the model . . . . . . . . . . . . . . . . . . . . . . . . 235
20.1.2 Solving the model: Getting the expressions . . . . . . . . . . . 236
20.1.3 Solving the model: solving recursively . . . . . . . . . . . . . . 237
20.1.4 The model under various bias factors . . . . . . . . . . . . . . . 238
20.2 Transient and Recurrent states . . . . . . . . . . . . . . . . . . . . 239
20.2.1 An illustrative example . . . . . . . . . . . . . . . . . . . . . . . 239
20.3 Communication between states . . . . . . . . . . . . . . . . . . . . 240
20.4 Characterizing number of visits . . . . . . . . . . . . . . . . . . . . 241
21 Setting the DTMC Framework 243

21.1 Limiting properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
21.2 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
21.3 Irreducible MC properties . . . . . . . . . . . . . . . . . . . . . . . 245
21.4 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
21.5 DTMC matrix equations . . . . . . . . . . . . . . . . . . . . . . . . 246
21.6 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
21.7 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
22 Introducing Inhomogeneous DTMCs 250

22.1 Inhomogeneous DTMCs: Theorem . . . . . . . . . . . . . . . . . . 250
22.1.1 Proof of the above theorem . . . . . . . . . . . . . . . . . . . . 250
Contents 11
23 Random Walks and Gambler’s Ruin 254

23.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
23.2 General Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
23.3 Particular Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
23.3.1 Case 1: (p < 1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . 256
23.3.2 Case 2: (p = 1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . 257
23.4 Biased Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . 258
24 Memoryless Property 260

24.0.1 Memoryless property of Exponential Distribution . . . . . . . . 260
24.0.2 Proving the memoryless property . . . . . . . . . . . . . . . . . 261
24.0.3 Shifting of PDF leads to memoryless property . . . . . . . . . . 262
25 The Poisson Process 264

25.1 Revisiting Notes on PDF . . . . . . . . . . . . . . . . . . . . . . . . 264
25.2 Poisson as a Binomial Approximation . . . . . . . . . . . . . . . . . 264
25.3 Recalling the Bernoulli Process . . . . . . . . . . . . . . . . . . . . 265
25.4 The Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . 266
25.4.1 Breaking down the Poisson Process . . . . . . . . . . . . . . . . 266
25.4.2 Modeling the cumulative arrival time . . . . . . . . . . . . . . . 267
25.5 Counting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
25.6 The Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . 269
25.6.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
25.7 PASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
25.7.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
25.7.2 Conceptual proof of PASTA . . . . . . . . . . . . . . . . . . . . . 272
25.7.3 A PASTA Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
25.8 Poisson Process Revisited . . . . . . . . . . . . . . . . . . . . . . . 273
25.8.1 Key properties derived . . . . . . . . . . . . . . . . . . . . . . . 274
25.9 Deriving the distribution . . . . . . . . . . . . . . . . . . . . . . . . 274
25.10 Interarrival times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
25.11 PASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
26 Introduction to Queues 278

26.1 Frequency Interpretation . . . . . . . . . . . . . . . . . . . . . . . 278
26.2 The Birth-Death Model . . . . . . . . . . . . . . . . . . . . . . . . . 280
26.3 Queues and Communication Networks . . . . . . . . . . . . . . . . 282
26.3.1 The deriving the state probabilities . . . . . . . . . . . . . . . . 283
26.4 The Buffer Drops Problem . . . . . . . . . . . . . . . . . . . . . . . 284
26.4.1 Results in terms of Detailed Balance . . . . . . . . . . . . . . . 286
26.5 Concurrent Arrivals and Departures . . . . . . . . . . . . . . . . . 286
26.6 Limited Queue Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
27 Queueing models - M/M/1 290

Contents 12
27.1 The underlying concepts . . . . . . . . . . . . . . . . . . . . . . . . 290

27.1.1 Notes on the exponential distribution . . . . . . . . . . . . . . . 290
27.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
27.3 The M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
27.3.1 Waiting times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
27.4 M/M/s queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
27.4.1 Prerequisites: infinitesimal rates . . . . . . . . . . . . . . . . . . 293
27.4.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
27.4.3 Getting the average measures . . . . . . . . . . . . . . . . . . . 295
28 Continuous Time Markov Chains 296

28.1 Motivation for CTMC . . . . . . . . . . . . . . . . . . . . . . . . . . 296
28.2 Time Discreteness and Markov Property . . . . . . . . . . . . . . . 296
28.3 Properties of CTMC . . . . . . . . . . . . . . . . . . . . . . . . . . 298
28.4 Chapman Kolmogorov Equation . . . . . . . . . . . . . . . . . . . . 299
28.5 Rate Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
28.6 Description of the CTMC . . . . . . . . . . . . . . . . . . . . . . . . 301
28.7 Forward Kolmogorov Equation . . . . . . . . . . . . . . . . . . . . 304
28.8 Backward Kolmogorov Equation . . . . . . . . . . . . . . . . . . . 305
28.9 Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 306
28.10 Mental image of CTMC . . . . . . . . . . . . . . . . . . . . . . . . 306
29 CTMC and Embedded MC 307

29.1 Transition Rate Matrices . . . . . . . . . . . . . . . . . . . . . . . . 307
29.2 Global Balance Equations . . . . . . . . . . . . . . . . . . . . . . . 308
29.3 Behavior in stationarity . . . . . . . . . . . . . . . . . . . . . . . . 309
29.4 Solving the balance equations . . . . . . . . . . . . . . . . . . . . . 310
29.5 Embedded Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 311
IV Stochastic Calculus &

Computational Finance 313
V Topics in Data Science 314
VI Topics in Calculus &

Discrete Mathematics 315
30 Fourier Transforms 316
30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
30.2 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
30.3 Amplitudes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
30.4 Alternate forms of writing the series . . . . . . . . . . . . . . . . . 318
30.5 Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
30.6 Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Contents 13
30.6.1 The autocorrelation theorem . . . . . . . . . . . . . . . . . . . 320
31 Laplace, Dirac Delta and Fourier Series 321

31.1 The Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . 321
31.1.1 Examples and Properties . . . . . . . . . . . . . . . . . . . . . . 321
31.1.2 Expanding a little on Heaviside function . . . . . . . . . . . . . 323
31.1.3 Inverse Laplace transform . . . . . . . . . . . . . . . . . . . . . 323
31.2 The Dirac Delta Impulse function . . . . . . . . . . . . . . . . . . . 324
31.2.1 Filtering property . . . . . . . . . . . . . . . . . . . . . . . . . . 326
31.3 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
31.3.1 Fourier series formula . . . . . . . . . . . . . . . . . . . . . . . 327
32 A Primer in Calculus 329

32.1 What are intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
32.1.1 Solving inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 329
32.1.2 Absolute value . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
32.2 Rates, Limits and Derivatives . . . . . . . . . . . . . . . . . . . . . 330
32.2.1 Differential calculus . . . . . . . . . . . . . . . . . . . . . . . . 331
32.3 Basics of integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
32.3.1 Integrals by substitution . . . . . . . . . . . . . . . . . . . . . . 332
32.3.2 Integration by parts . . . . . . . . . . . . . . . . . . . . . . . . . 333
32.4 Inverse functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
33 Transforms and the Memoryless Property 336

33.1 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
33.2 Memoryless property of the Exponential . . . . . . . . . . . . . . . 337
33.3 Finding the expectation of a Geometric R.V . . . . . . . . . . . . . 338
34 Recurrence Relations 340

34.1 Linear Recurrence Relations . . . . . . . . . . . . . . . . . . . . . . 340
34.1.1 Solving linear recurrence relations . . . . . . . . . . . . . . . . 340
34.2 Linear Nonhomogeneous Recurrence Relations . . . . . . . . . . . 342
35 Propositional Logic 344

35.1 Introduction to Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 344
35.2 Propositional Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
35.2.1 Semantics of Propositional Logic . . . . . . . . . . . . . . . . . 345
35.2.2 Truth Tables and Interpretations . . . . . . . . . . . . . . . . . . 346
35.3 The Satisfiability Problem . . . . . . . . . . . . . . . . . . . . . . . 347
35.3.1 Truth tables and the Subset problem . . . . . . . . . . . . . . . 347
35.4 Satisfiability and Validity . . . . . . . . . . . . . . . . . . . . . . . . 350
35.4.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
35.4.2 Logical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 351
35.5 P-NP problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
35.5.1 The Satisfiability Problem and P vs. NP . . . . . . . . . . . . . . 352
Contents 14
35.5.2 Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
36 Graph Theory 356

36.1 Review of graph theory . . . . . . . . . . . . . . . . . . . . . . . . 356
36.2 Applications in Physics: Kirchhoff’s Voltage and Current Laws . . . 357
36.3 Relation between Linear Algebra and Graph Theory . . . . . . . . . 361
37 Proof Techniques 362

37.1 Introduction to Proof Techniques . . . . . . . . . . . . . . . . . . . 362
37.2 Direct Proof Technique . . . . . . . . . . . . . . . . . . . . . . . . . 362
37.3 Indirect Proof Technique . . . . . . . . . . . . . . . . . . . . . . . . 363
37.3.1 Proof by Contrapositive . . . . . . . . . . . . . . . . . . . . . . 363
37.3.2 Proof by Contradiction . . . . . . . . . . . . . . . . . . . . . . . 364
37.4 Conditionals, Proof by Contradiction . . . . . . . . . . . . . . . . . 366
37.4.1 Conditional Statements . . . . . . . . . . . . . . . . . . . . . . 366
37.4.2 Proof by Contradiction . . . . . . . . . . . . . . . . . . . . . . . 367
37.5 Principle of Induction and some Examples . . . . . . . . . . . . . . 368
37.6 Recap of some important terminology . . . . . . . . . . . . . . . . 369
37.7 Example of Proof by Contradiction . . . . . . . . . . . . . . . . . . 369
37.7.1 Prove a given statement P using contradiction . . . . . . . . . . 369
37.8 Principle of Induction . . . . . . . . . . . . . . . . . . . . . . . . . 370
37.8.1 Proof by Induction . . . . . . . . . . . . . . . . . . . . . . . . . 370
37.8.2 Example 1 of Proof by Induction . . . . . . . . . . . . . . . . . 371
37.8.3 Example 2 of Proof by Induction . . . . . . . . . . . . . . . . . 372
Part I
Linear Algebra
15
Chapter 1
Introduction to Linear Algebra
Topics
• Introduction to vector addition and scalar multiplication

• Vector dot product and unit vectors
• System of linear equations represented using linear combinations
• Motivation from Chemistry
• Application in Mechanical Engineering
• Application in Economics
1.1 Introduction to Vector Addition and Subtraction

A vector is defined as having a magnitude and a direction. We represent it as an
arrow in the plane or in space. Length of the arrow is the vector’s magnitude
and direction of the arrow is the vector’s direction. A Vector having n elements
or components is a vector belonging to Rn
1.1.1 Vector Addition and Scalar multiplication

Consider two points in the R2 plane, represented by the column vectors x and
y. Addition of two vectors gives us another vector in the same plane and can be
represented as :
ñ ô ñ ô
x1 y
x= y= 1 (1.1)
x2 y2
ñ ô ñ ô
x1 + y 1 z
z =x+y = = 1 (1.2)
x2 + y 2 z2
A vector can be multiplied by any scalar, say c to produce another vector in the
same plane and this can be represented as :
16
Chapter 1. Introduction to Linear Algebra 17
y x
Figure 1.1: Representing Vector Addition
ñ ô ñ ô
x cx1
cx = c 1 = (1.3)
x2 cx2
Consider an example where we take our x vector as the point (4, 2) and y vector
as the point (-1,2), then upon adding the two vectors we will get the vector z as
a result. This is shown geometrically below in figure 1 and can be represented in
vector algebraic form as follows :
ñ ô ñ ô ñ ô
4 −1 3
+ = (1.4)
2 2 4
A combination of the methods of Vector addition and scalar multiplication of

vectors is used to form Linear combinations of the form cx + dy.
A short note on linear combinations : If one vector can be expressed as the sum-
mation of scalar multiples of other vectors, it is known as a linear combination of
those vectors.
1.2 Vector Dot product

A vector Dot product or inner product of vectors x = (x1 , x2 ) and y = (y1 , y2 )
is x · y = x1 y1 + x2 y2 . If θ is the angle between the two
vectors
x and y, then
vector dot product can also be represented as x · y = x y cosθ

1.2.1 Norm of a vector

n
if u ∈ R then the norm or magnitude of u denoted as u is defined as the

length or magnitude of the vector and can be calculated using the formula :
p
u = u1 2 + u2 2 + u3 2 + · · · + un 2 (1.5)

The length of a vector with two elements is the square root of the sum of each
element squared.The magnitude of a vector is sometimes called the length of a
vector, or norm of a vector. Basically, norm of a vector is a measure of distance,
symbolized by double vertical bar. Usually a vector Norm represents the norm in
the Euclidean space and is also known as the Euclidean norm.
A short note on Euclidean space : Euclidean space is the fundamental space re-
lated to classical geometry or Euclidean geometry. It is a concept that was introduced
by the ancient Greek Mathematician - Euclid of Alexandria. The set of n-tuples of real
number that are equipped with the dot product functionality is essentially a Euclidean
space of dimension n.
1.2.2 Unit Vectors

The word "unit" is always indicating that some measurement equals "one". The
unit price is the price for one item. A unit cube has sides of length one. A unit
circle is a circle with radius one. A unit Vector is a vector which has a length of 1
unit. Normally, i is used to represent unit vector in the direction of the x − axis
and j is used to represent unit vector in the direction of the y − axis. Any vector in
the R2 plane can be represented using linear combinations of the two unit vectors.
For example, any vector r can be written using unit vectors. Let us suppose r to
be the coordinates (2,2), then we can represent this as follows :
r = xî + y ĵ (1.6)
ñ ô ñ ô
1 0
î = , ĵ = (1.7)
0 1
ñ ô ñ ô ñ ô
1 0 2
2 +2 = (1.8)
0 1 2
Some notes to remember :

√
• Length of a Vector x is given by x = x · x

• If Vector x perpendicular to vector y, then their Dot product is x · y = 0
• The Dot product of unit Vectors lies between −1 and 1

1.2.3 Angles between vectors and dot products

Sometimes the dot product is called the scalar product. The dot product is also
an example of an inner product and so on occasion you may hear it called an
inner product. However before we compute the dot product of vectors, the angle
between them is required. This is illustrated in the figure and it helps drive home
the concept of dot products of vectors linked with the angle between them. There
is also a nice geometric interpretation to the dot product. First suppose that θ is
the angle between a and b such that 0 ≤ θ ≤ π as shown in the image below.
Figure 1.2: Angle between vectors
1.3 Solving system of Linear equations using Linear

combinations
A linear equation in the variables x1 · · · xn is an equation that can be written in the
form :
a1 x1 + a2 x2 + a3 x3 + · · · + an xn = b (1.9)
where b and the coefficients a1 · · · an are real or complex numbers, usually known
in advance. The subscript n may be any positive integer. In textbook examples
and exercises, n is normally between 2 and 5 so as to make computations easier.
In real-life problems, n might be 50 or 5000, or anyhing for that matter. We will
attempt to solve a system of linear equations using linear combinations. Consider
a set of linear equations as follows :
2x − y = 0 (1.10)
− x + 2y − z = −1 (1.11)
− 3y + 4z = 4 (1.12)
Representing this system of equations in the form Ax = b, where A is the coeffi-
cient matrix, x is the vector of unknowns and b is the result vector.
C1 C2 C3    
r1T "−2 −1 0 # x 0
T
r2 −1 2 −1  y  = −1  (1.13)
r3T
0 −3 4 z 4
       
2 −1 0 0
x −1 + y 2 + z −1 = −1
       (1.14)
0 −3 4 4
Solving the system of equations, we get x = 0, y = 0 and z = 1. In general, we

can write the system of equations as a linear combination of columns of coefficient
matrix A, as follows :
xC1 + yC2 + zC3 = b (1.15)
1.4 Motivation from Chemistry

Now we will demonstrate how to use the Ax = b form of solving a system of
equations to come up with the chemical reaction that forms water. Let us say that
we do not know how many molecules of Hydogen and Oxygen are needed to form
water. So our equation of unknowns would be : xH2 + yO2 → zH2 O. We can now
form two equations based on number of atoms of each molecule as follows :
xH2 + yO2 → zH2 O (1.16)
2x + 0y = 2z → x + 0y − z = 0 (1.17)
0x + 2y = z → 0x + 2y − z = 0 (1.18)
We will now write this system of equations in the form of Ax = b and attempt to
solve it using the concept of linear combinations of columns of coefficient matrix
A, as follows :
ñ ô x ñ ô
1 0 −1   0
y = (1.19)
0 2 −1 0
z
ñ ô ñ ô ñ ô ñ ô
1 0 −1 0
x +y +z = (1.20)
0 2 −1 0
Now putting values of y = 1, the solution to the system of equations turns out to
be x = 2 and z = 2, whereas if we put y = 2, the solution turns out to be x = 4
and x = 4. So the final equation of water can now be written as :
2H2 + O2 → 2H2 O (1.21)
4H2 + 2O2 → 4H2 O (1.22)
The resulting system of equations has infinite number of solutions. For each arbi-
trary value of y, we will get some combination of values for x and z as well.
As we can see, a very basic system of chemical equations can be easily solved using
matrix manipulation, vector dot prodcuts, linear combinations and vector addition
(as explained in the above sections). An illustration of chemican equations was
taken simply to drive home the point that these concepts can be applied to any
field and can consequently simplify computations.
1.5 Application of Linear algebra in Mechanical En-

gineering and Economics
We will now begin with the first set of projects involving the concepts explained
in this section. Our first project is about an application of linear algebra in Me-
chanical Engineering and the second one is related to an application in Business
Economics. In the mechanical application section, before the project description
begins, a quick recap of concepts and definitions has been added so as to make
the reader familiar with some common Engineering terminology. We then proceed
to illustrate the Static Analysis of a 3-bar Truss Structure using a small model
and some generated data. We solve the problem using Matrix computations and
arrive at a final solution, illustrating the problem solving ability of Linear algebra
techniques in practical applications. As for the Business Economics project, we
have analysed a specific case of a manufacturer where we show how the manu-
facturer obtains an optimal quantity of products to sell, again, using linear algebra
computation methodology.
1.5.1 Engineering Terminology

Common terminology used in mechanical engineering used to describe the below
example :
• Beam : A beam is a structural element that primarly resist loads applied

laterally to the beam axis.
• Truss : A truss is an assembly of beams or other elements that creates a rigid

structure.
• Pin joints : A pin jointed truss is a structure made up from separate compo-
nents by connecting them together at pinned joints or nodes, usually to form
a series of triangles.
• Hinged joints : It is a mechanical bearing that connects two solid ob-

jects,allowing only limited angle of rotation between them.
• Moment of force : It is a measeure of tendency to cause a body to rotate

about a specific point or axis.
• Statically determinate : It is a structure in which the sum of forces in any

direction is zero and no acceleration is experienced.
• Tensile force : It is a capacity of a material or the structure to withstand

loads tending to elongate.
• Comprehensive strength : It is actually capacity of a structure to withstand

loads tending to reduce size.
• Reaction force : A force acting in the opposite direction.
• Static equilibrium equations : The static equilibrium of a particle is an

important concept in statics. In a rectangular coordinate system the equi-
librium equations can be represented by three scalar equations, where the
sums of forces in all three directions are equal to zero.
1.5.2 Static analysis of a 3-bar truss structure

Truss structures are commonly used in Structural Engineering and Architech-
ture due to their superior stiffness and strength for a given amount of material. A
truss structure that consists of straight members connected by means of Pin joints
and supported at both ends by means of hinged joints or rollers such that they
cannot transfer moments, if described as statically determinate. Thus, loading
subjected to a truss structure results in either tensile or compressive forces in the
members, together with reation forces at the support. As we show below, a 3-bar
planar Truss figure is treated where the static equilibium equations are setup
using free body diagrams.
• Aim :
Since the static equilibrium equations of a truss structure tell us that the
sum of the forces along x and y directions at all the joints should be equal
to 0, we will use this concept to form the equations in Matrix notation and
solve for finding the unknown variables - Bar forces and Reaction forces
along each beam.
• Procedure :
The 3 bar truss is subjected to a verticle load F of value 1000N in Node

Figure 1.3: 3 bar truss structure with pin joints
1 and a static analysis of this structure can determine the unknown internal
bar forces N12 , N13 , N23 and reaction forces R2x , R2y , R3y . Internal bar forces
are positive.
• Forming equations and Matrix representation :
Equations are now formed and represented using matrix notation as shown
below :
P
JointP1: Fx = 0 : −cosαN12 + cosβN13 = 0
Joint 1: FP
y = 0 : −sinαN12 − sinβN13 − 1000 = 0
Joint 2: x = 0 : cosαN12 + N12 + R2x = 0
FP
Joint 2:P Fy = 0 : sinαN12 + R2y = 0
Joint 3: PFx = 0 : −N23 − cosβN13 = 0
Joint 3: Fy = 0 : sinβN13 + R3y = 0
This is a linear system of equations with 6 unknowns, and it can be written

in Ax = b matrix form as :
    
−cosα 0 cosβ 0 0 0 N1 2 0
−sinα 0 −sinβ 0 0 0
 N2 3 1000
   

 cosα 1 0 1 0 0 N1 3  0 
    
=
 sinα 0 0 0 1 0  R2x   0 
  
 0 −1 −cosβ 0 0 0  R2y   0 
    
0 0 sinβ 0 0 1 R3y 0
• Data set and Solving the system :
We now attempt to solve the system of equations and solve for the unknown
forces. The measurement of the angles and their corresponding cosine values
are as follows :
β = 36.86◦
α = 53.13◦
cos α = 0.6
cos β = 0.8
sin α = 0.8
sin β = 0.6
The solution of the unknowns is as follows :

   
N12 −800
N   480 
 23   
N13  −600
   
=
R2x   0 
   
R2y   640 
   
R3y 360
• Conclusion :
We must note that the above approach is applicable to statically determine

structures. Stresses in the bars can be computed based on the bar forces
divided by cross sectional areas and knowing the allowable stresses for the
materials used, it is possible to choose appropriate cross sectional areas for
the bars.
The use of matrix based approach as explained above makes it possible
to treat systems with many variables in a general, structured way. It
also helps us solve the system of equations to obtain the unknown bar
forces and reaction forces.
1.5.3 Economic analysis of a manufacturing firm

Most mathematical models used by economists ultimately involve a system of sev-
eral equations, which usually express how one or more endogenous variables de-
pend on several exogenous parameters. If these equations are all linear, the study
of such systems belongs to an area of mathematics called linear algebra. Even
if the equations are nonlinear, much may be learned from linear approximations
around the solution we are interested in—for example, how the solution changes
in response to small shocks to the exogenous parameters. Indeed, such models
lie right at the heart of the econometric techniques that form the basis of most
modern empirical economic analysis. The analysis and even the comprehension
of systems of linear equations becomes much easier if we use some key mathe-
matical concepts such as matrices, vectors, and determinants. These, as well as
their application to economic models, will be introduced in this chapter and in the
next. Actually, the usefulness of linear algebra extends far beyond its ability to
solve systems of linear equations. For instance, in the theory of differential and
difference equations, in linear and nonlinear optimization theory, in statistics and
econometrics, the methods of linear algebra are used extensively.
CASE : A firm manufactures 3 different types of chocolate bar, ‘ChocoA’, ‘ChocoB’

and ‘ChocoC’. The main ingredients in each are cocoa, milk and coffee. To pro-
duce 1000 ChocoA bars requires 5 units of cocoa, 3 units of milk and 2 units of
coffee. To produce 1000 ChocoB bars requires 5 units of cocoa, 4 of milk and 1 of
coffee, and the production of 1000 ChocoC bars requires 5 units of cocoa, 2 of milk
and 3 of coffee. The firm has supplies of 250 units of cocoa, 150 of milk and 100 of
coffee each week (and as much as it wants of the other ingredients, such as sugar).
Show that if the firm uses up its supply of cocoa, milk and coffee, then the number
of ChocoB bars produced each week equals the number of ChocoC bars produced.
How does the number of ChocoA bars produced relate to the production level of
the other two bars? Find the maximum possible weekly production of ChocoC
bars.
1.5.4 Solving this problem for the firm

Let us denote the weekly production levels of ChocoA, ChocoB and ChocoC bars
by a, b, c (respectively), measured in thousands of bars. Then, since 5 units of
cocoa are needed to produce one thousand bars of each type, and since 250 units
of cocoa are used each week, we must have :
5a = 5b + 5c = 250 (1.23)
By considering the distribution of the 150 available units of milk :
3a + 4b + 2c = 150 (1.24)
Similarly, since 100 units of coffee are used, it must be the case that :
2a + b + 3c = 100 (1.25)
In other words, we have the system of equations as follows :
5a = 5b + 5c = 250 (1.26)
3a + 4b + 2c = 150 (1.27)
2a + b + 3c = 100 (1.28)
We solve this using elementary row operations, starting with the augmented ma-
trix, as follows :
   
5 5 5 250 1 1 1 50
3 4 2 150 → 3 4 2 150
2 1 3 100 2 1 3 100
   
1 1 1 50 1 1 1 50
3 4 2 150 → 0 1 −1 0 
2 1 3 100 0 0 0 0
Therefore, the system is equivalent to :
a + b + c = 50
b−c=0
from which we obtain b = c and a = 50 − b − c = 50 − 2c. In other words, the
weekly production levels of ChocoB and ChocoC bars are equal, and the produc-
tion of ChocoA is (in thousands) 50 − 2c where c is the production of ChocoC (and
ChocoB) bars. Clearly, none of a, b, c can be negative, so the production level, c,
of ChocoC bars must be such that a = 50 − 2c ≥ 0 that is, c ≥ 25. Therefore, the
maximum number of ChocoC bars which it is possible to manufacture in a week is
25000 (in which case the same number of Break bars are produced and no ChocoA
bars will be manufactured).
Summary
• Introduction to Vector addition and scalar multiplication and how they
combine to form linear combinations.
• Vector dot products and use of unit vectors.
• The Ax = b form can be used to find solutions to systems of equations

such as the Chemistry examples described above.
• The Ax = b approach can also be used in applications in many fields

such as Mechanical Engineering and Economics.
• References for definitions : David Lay, Norman Biggs and Aalborg Uni-
versity Journal
Chapter 2
On Solving Ax = b
Solving linear equations

• The four different cases for the linear equations (reflected in rref, R) de-
pending on the rank r are:
– Case 1: Full Rank: r = m = n
– Case 2: Full column rank: r = n < m
– Case 3: Full row rank: r = m < n
– Case 4: Not full rank: r < n and r < m
2.1 Premise
Consider a matrix A of order m × n, where m denotes the number of rows and
n denotes the number of columns. Also consider the vector b having m rows and
vector x comprising of n variables.1 Then a given system of m linear equations in
n variables can be written as:
Ax = b
The above equation can be visually represented as shown below:

     
 A  x = b (2.1)
m×n n×1 m×1
Objective: We attempt to solve the above for any given set of linear equations.
This is achieved by the algorithmic way of elimination which uses elementary row
operations. The key idea behind elimination is to obtain an upper triangular ma-
trix U . This matrix is further solved using back-substitution.
1
All vectors considered here are column vectors, unless otherwise stated.
28
Chapter 2. On Solving Ax = b 29
Algorithm for elimination

→ Given, Ax = b, the first element of the first row in A is pivoted.
entry to eliminate in row i
→ The multiplier is lij = .
pivot in row j
→ We subtract lij times equation j from equation i, to make the aij entry zero.
→ Pivot entries cannot be zero! If there is a zero in the pivot position, exchange
with non-zero rows below it.
→ Once U is obtained, back-substitution of U x = c yields x.
→ When a full set of pivots cannot be obtained for every row and column,
Ax = b has nosolution or infinitely many solutions.
→ When the entries above the pivots are made zero, we obtain the reduced row
echelon form rref, denoted by R. Here, Rx = d.
2.1.1 Rank
Definition 2.1.1 The rank of A is the number of pivots. This number is denoted
by r. Rank serves the purpose of determining the structure of R. There are four
cases which are described in the subsequent sections.
2.2 Case 1: Full Rank (r = m = n)

Let A be a square matrix (m = n), and let r = m = n.
• r = m =⇒ Every row is a pivot row and R has no zero rows. This signals
that each row is an independent row.
• r = n =⇒ Every column is a pivot column and R has no free columns. This

signals that there are no dependencies among columns.
• We know that,
r = m =⇒ Ax = b has a solution for every right side b ∈ Rm and,
r = n =⇒ Ax = b has only one solution (if it has any),
thus, Ax = b has a unique solution for r = m = n.
• The null space, N (A) contains only the zero vector x = 0 (because, full
column rank) and, the column space C(A) ∈ Rm (because, full row rank).
• The matrix A is invertible. Thus, x = A−1 b gives the unique solution.
• The rref for A has the structure R = I.
Consider the following system of linear equations:
x + 2y + z = 2
3x + 8y + z = 12
4y + z = 2 (2.2)
These equations can be represented in the form of a square matrix A of order

m = n = 3. We solve for Ax = b using elimination.
      
1 2 1 x 2 1 2 1 2
3 8 1 y  = 12 =⇒ [A|b] =  3 8 1 12  (2.3)
0 4 1 z 2 0 4 1 2
On performing the row operations, we obtain U and R as follows:

   
1 2 1 2 1 0 0 2
[U |c] =  0 2 −2 6  −→ [R|d] =  0 1 0 1  (2.4)
0 0 5 −10 0 0 1 −2
From the above, we can see that R = I and the unique solution is given by x =
2, y = 1, z = −2.
2.3 Case 2: Full Column Rank (r = n < m)

Let A be a matrix of order m × n where r < m, and r = n.
• r = n =⇒ Every column is a pivot column. This signals that there are no

dependencies among columns. Thus, there are n pivot variables and no free
variables.
• r < m =⇒ Every row is not a pivot row and there are linear dependencies
among rows.
• r = n =⇒ Only r rows have pivots. There are (m − r) zero rows.
• If b ∈ C(A) then the null space contains only the zero vector, N (A) = {0}
(because, full column rank) then Ax = b will have a unique solution.
• If b 6∈ C(A) and the null space is empty, N (A) = { } then Ax = b does not
have a solution.
ñ ô
I
• The rref for A has the structure R =
0
2.3.1 Unique Solution

x + 5y = 1
2x + 4y = 2
3x + 9y = 3
8x + 10y = 8 (2.5)
These equations can be respresented in the form of a matrix A of order 4 × 2. We

solve for Ax = b using elimination.
     
1 5 ñ ô 1 1 5 1
2 4  x 2  2 4 2 
     
= =⇒ [A|b] = (2.6)
3 9  y 3  3 9 3 
     
8 10 8 8 10 8
On performing row operations, we obtain U and R as follows:
   
1 5 1 1 0 1
 0 −6 0   0 1 0
   
[U |c] =   −→ [R|d] =  (2.7)

 0 0 0   0 0 0


0 0 0 0 0 0
ñ ô
I
From the above, we can see that R = .
0
It can also be seen that the system has a unique solution given by x = 1 and
y = 0. This is true because, b is a linear combination of the columns of A. Here, b
is exactly the first column of A. Therefore, b ∈ C(A) implies that there is a unique
solution for Ax = b.
2.3.2 No solution
x + 5y = 2
2x + 4y = 1
3x + 9y = 1
8x + 10y = 4 (2.8)
These equations can be respresented in the form of a matrix A of order 4 × 2. We
     
1 5 ñ ô 2 1 5 2
2 4  x 1  2 4 1 
     
=   =⇒ [A|b] =  (2.9)
3 9  y 1  3 9 1 
  
8 10 4 8 10 4
On performing row operations, we obtain U and R as follows:
   
1 5 2 1 0 − 1/2
 0 −6 −3   0 1 1/2
   
[U |c] =   −→ [R|d] =  (2.10)

 0 0 −2   0 0 −2


0 0 3 0 0 3
ñ ô
I
From the above, we can see that R = .
0.
Clearly, 0x + 0y 6= −2 or 3. Hence, Ax = b is not solvable as b 6∈ C(A).
2.4 Case 3: Full Row Rank (r = m < n)

Let A be a matrix of order, and let r = m.
• r = m =⇒ Every row is a pivot row and R has no zero rows. This signals
that each row is an independent row.
• r < n =⇒ There are r pivot columns and (n − r) free columns.
• r = m =⇒ Infinite solutions exist for Ax = b because b ∈ C(A) ∀ b ∈ Rm .
• Say, v ∈ N (A) such that v 6= 0 =⇒ Av = 0. For any constant α,
A(x + αv) = b
Ax + αAv = b
Ax = b (∵ Av = 0) (2.11)
There are infinite choices for α. Hence, infinite solutions exist.
• The rref for A has the structure R = [ I | F ]

x + 3y + 4z + 3w = 2
2x + 5y + 8z + 11w = 3 (2.12)
These equations can be represented in the form of a matrix A of order 2 × 4. We

 
ñ ô x ñ ô ñ ô
1 3 4 3 y  2 1 3 4 3 2
 = =⇒ [A|b] = (2.13)
2 5 8 11  z  3 2 5 8 11 3
w

ñ ô ñ ô
1 3 4 3 2 1 0 4 −12 −1
[U |c] = −→ [R|d] = (2.14)
0 −1 0 5 −1 0 1 0 5 1
From the above, we can see that R = [ I | F ]. b is obtained using linear combina-
tions of the columns of A. Thus infinite solutions are obtained when b ∈ C(A).
2.5 Case 4: Not Full Rank (r < n, r < m)

Let A be a matrix of order m × n where r < m and r < n.
• r < m =⇒ Every row is not a pivot row and there are linear dependen-
cies among rows. This gives a non-trivial null space, which brings out the
inconsistency and therefore, may give no solution if b 6∈ C(A).
• r < n =⇒ There are r pivot columns and (n − r) free columns. This signals
that there are infinite solutions if b ∈ C(A).
ñ ô
Ir F
• The rref for A has the structure R =
0 0
1x + 2y + 2z + 2w = 1
2x + 4y + 6z + 8w = 5
3x + 6y + 8z + 10w = 6 (2.15)
These equations can be represented in the form of a matrix A of order 3 × 4. We

 
  x    
1 2 2 2   1 1 2 2 2 1
y
2 4 6 8     = 5 =⇒ [A|b] =  2 4 6 8 5  (2.16)
 
z 
3 6 8 10 6 3 6 8 10 6
w

   
1 2 2 2 1 1 2 0 −2 −2
[U |c] =  0 0 2 4 3  −→ [R|d] =  0 0 1 2 1.5  (2.17)
   
0 0 0 0 0 0 0 0 0 0
In order to obtain the rref structure, columns c2 and c3 are exchanged.

 
1 0 2 −2 −2
[R|d] =  0 1 0 2 1.5  (2.18)
 
0 0 0 0 0
The null space matrix N comprises of the null space solutions for the set of equa-
tions. In order to obtain the null space solution, set Rx = 0 and solve for x. We
further observe that,  
−2 2
 ñ ô
 0 −2  −F

N = = (2.19)
 1 0  I
0 1
The null space solution for R also turns out to be the null space solution for A.
Hence, AN = O, that is,
 
  −2 2  
1 2 2 2  0 0
 0 −2  

AN =  2 6 4 8   = 0 0 =O (2.20)
 1 0 
3 8 5 10 0 0
0 1
2.6 Applications in Economics: The Leontief Input-

Output Model
The Leontief Input-Output Model discusses the interdependence of industries on
each other. The model assumes that in an open economy, any firm has two kinds
of demands, namely, internal demand - demand placed by the industry itself as
well as other manufacturing industries to be used as inputs, and external demand
- demand placed by consumers outside the production sector i.e. to be used for
final consumption only. The total production of any firm is a summation of both
of these demands.
Let x be the production vector that lists the output for each sector and let d be
the final demand of the non-productive consumption sector. A will represent the
intermediate demand matrix resulting from the interdependencies among the in-
dustries. This can be represented as follows:
     
 amount  intermediate  f inal 
     
produced = demand + demand
 (x) 
   (Ax)  
  (d)  
Elaborating the above, we get,

       
x1 a11 a12 . . . a1n x1 d1
x   a a . . . a  x  d 
 2   21 22 2n   2   2
.  .  . .
       
 = .  +  
.  .  . .
.  .  . .
       
xn am1 am2 . . . amn xn dn
Here, aij represents the output produced by industry i that will be used in the
production of industry j while di represents the final external demand for industry
i. Then x will be computed as:
x = (I − A)−1 d (2.21)
Let us understand the above using an example.
Example: Consider an open economy with three industries, namely electricity

generation, services and automobile manufacturing plant. The industries are in-
terdependent on each other and also face some external demand from the govern-
ment. The extent of interdependency is described below:
• To produce one unit of electricty, industry must consume 0.1 of its own pro-
duction, 0.3 units of service and 0.1 units of automobile industry.
• To produce one unit of services, the industry uses 0.25 units of electricity,
0.4 units of its own production and 0.15 units of automobile.
• To produce one unit of automobile, the industry consumes 0.2 units of elec-
tricity, 0.5 units of services and 0.1 units of its own production.
• The final demand for electricity, services and automobile was 50,000 units,
75,000 units and 1,25,000 units respectively.
How much should each industry produce in order to fulfil the total demand?
Solution: The above information can be represented in the form of a

input-output matrix (A) and the demand vector d is:
   
0.1 0.25 0.2 50, 000
A = 0.3 0.4 0.5 and d =  75, 000 
0.1 0.15 0.1 1, 25, 000
Here, (I − A) is given as:
 
0.9 −0.25 −0.2
(I − A) = −0.3 0.6 −0.5
−0.1 −0.15 0.9
Using Gaussian Elimination, the coefficient matrix will be:
 
1.464 0.803 0.771
(I − A)−1 = 1.007 2.488 1.606
0.330 0.503 1.464
Clearly, the rank of the above matrix is equal to the order of the matrix i.e r =
m = n. Therefore, this system will have a unique solution in x which is given by
     
1.464 0.803 0.771 50000 229921.59
x = 1.007 2.488 1.606 ·  75000  = 437795.27
0.330 0.503 1.464 125000 237401.57
Thus, the electricity generation industry will produce 229921.59 units of elec-
tricity, service industry will produce 437795.27 and the automobile industry will
produce 237401.57 units to cater to the the combined, internal and external de-
mand.
2.7 Applications in Physics: Kirchhoff’s Voltage and

Current Laws
The previous application dealt with the interdependencies of three industries with
respect to the demand for their products in the economy. One of the above in-
volved an electricity generation industry. It can be asserted that electricity is one of
the key components demanded for production in any industry. The supply of elec-
tricity to industries far and wide happens through a series of lines and networks.
The subsequent application in physics would throws light into a very fundamental
law concerning networks, loops and junctions. These are the Kirchhoff’s Voltage
and Current Laws.
The Kirchhoff’s laws are conservation laws that form the foundation to all of cir-
cuit theory. Kirchhoff’s voltage law states that the directed sum of potential drops
around any closed loop is zero. The Kirchhoff’s current law states that the alge-
braic sum of currents in a closed loop equals zero. Using the idea of networks
and graphs one can construct basic circuits, which can then be represented in the
matrix form. Further, we show how this is linked to the Ax = b form; It is also
shown that the Kirchhoff’s laws directly embody this form and the idea of rank
helps in extracting meaningful properties which are of relevance to physics.
b1
x1 x2
b4 b5
x4
b3 b6 b2
x3
Figure 2.1: Circuit with 4 nodes
Consider the above circuit with four nodes labelled 1, 2, 3 and 4. Let xi be the
electric potential at the ith node. The potential difference on the edge b1 is given
by x2 − x1 . This the vector b can be constructed as follows.
   
b1 x2 − x1
b  x − x   
 2  3 2
 vector comprising of
b3  x1 − x3 
   
b= =  = potential differences (2.22)
b4  x4 − x1  
  between the nodes. 

b5  x4 − x2 
  
b6 x4 − x3
The above equation can now be rewritten in order to bring about the Ax = b form
as shown below. The matrix A is of the order 6 × 4, with 5 edges and 4 nodes.
−1x1 + 1x2 + 0x3 + 0x4 = b1    

−1 1 0 0   b1
0x1 − 1x2 + 1x3 + 0x4 = b2 
 0 −1 1 0  x1

b2 
 
1x1 + 0x2 − 1x3 + 0x4 = b3 
1 0 −1 0  x2  b3 
   
Ax =    =   = b

−1x1 + 0x2 + 0x3 + 1x4 = b4  −1 0 0 1  x3  b4 
0 −1 0 1  x4 b5 
   
0x1 − 1x2 + 0x3 + 1x4 = b5 
0 0 −1 1 b6
0x1 + 0x2 − 1x3 + 1x4 = b6
It can therefore be seen that Ax = b captures the essence of Kirchhoff’s Voltage

law. The matrix A6×4 gives information about the potential differences between
the nodes. This matrix A is known as the incidence matrix. We can solve Ax = b
using Gauss’ elimination method. The rref matrix R gives information about the
rank and free variable matrix F .
     
−1 1 0 0 −1 1 0 0 −1 0 0 1
 0 −1 1 0   0 −1 1 0   0 −1 0 1 
     
0 −1 0 
 0 0 0 0 
 1  0 0 0 0 
   
 −→   −→ 
 
 −1 0 0 1   0 0 −1 1   0 0 −1 1 
 
 0 −1 0 1   0 0 0 0   0 0 0 0 
     
0 0 −1 1 0 0 0 0 0 0 0 0
| {z } | {z } | {z }
incidence matrix A upper triangular matrix U rref matrix - R
   
−1 0 0 1 1 0 0 −1

 0 −1 0 1 

 0 1 0
 −1 
 ñ ô
 0 0 −1 1   0 0 1 −1  I3 F
 −→   −→
   
0 0 0 0   0 0 0 0 0 0

 
0 0 0 0   0 0 0 0
    | {z }
 
rref matrix - R
0 0 0 0 0 0 0 0
| {z } | {z }
Row 3 and Row 4 are swapped. rref matrix - R
We know that the null space matrix contains the vectors in the N (A). The infor-
mation about the null space matrix N is found in the matrix F which contains
pivot variables under elimination. Thus, the null space solution of Ax is:
   
x1 1
x  1
   
N =  2 = α  
x3  1
x4 1
The above result is significant in physics. It means that if the potential differ-
ences between the nodes are zero, b = 0 then the nodes are at the same electric
potential. In other words, these nodes are equipotential. It can therefore be seen
that Ax = b encapsulates the Kirchhoff’s Voltage Law.
Using Ohm’s law and the information about conductance of each edge, one could
create a conductance matrix, G. This could then be used to obtain the currents on
each edge yi . The vector denoted by y has entries comprising of currents in all the
edges. Thus Gb = y, where G is the 6 × 6 conductance matrix; b is the vector of
potential difference and y is the vector of currents.
y1
1 2
y4 y5
y3 y6 y2
Figure 2.2: Circuit with 4 nodes
The currents along with their flow directions are shown in the above figure. Now
the Kirchhoff’s current law can be written as AT y = 0. Thus, vectors in the null
space of AT correspond to collections of current in the loop that satisfy Kirchhoff’s
laws. This is summarized in the following schematic.
x = {x1 , x2 , x3 , x4 } AT y = 0
potential at nodes Kirchhoff’s Current Law
Ax = b AT y
x2 − x1 , etc. give Ohm’s law y = {y1 , y2 , y3 , y4 , y5 }

potential differences Gb = y currents on egdes
The AT y = 0 equation is written out and using the method of elimination the
reduced row echelon form (rref) is obtained and the rank r is imminent in the
number of pivots.
 
  y1  
−1 0 1 −1 0 0 y 
 2 0
1 −1 0 0 −1 0  y3  0
    
  =  

0 1 −1 0 0 −1  y4  0


0 0 0 1 1 1 y5  0
 
y6
 
−1 0 1 −1 0 0 0
1 −1 0 0 −1 0 0
 
T
=⇒ [A | 0 ] = 
 
0 1 −1 0 0 −1 0

 
0 0 0 1 1 1 0
   
−1 0 1 −1 0 0 −1 0 1 −1 0 0
 0 −1 1 −1 −1
1 −1 0 0 −1 0  0 
 
 −→ 
  
0 1 −1 0 0 −1   0 0 0 −1 −1 −1 
 

0 0 0 1 1 1 0 0 0 0 0 0
| {z } | {z }
AT matrix - current through nodes upper triangular matrix, U
  
−1 0 1 0 1 1 −1 0 0 1 1 1
 0 −1 1 0 0 1   0 −1 0 1 0 1 
−→   −→ 
   
0 0 0 −1 −1 −1   0 0 −1 0 −1 −1 


0 0 0 0 0 0 0 0 0 0 0 0
| {z } | {z }
rref matrix - R Col. 3 and Col. 4 are swapped.
 
1 0 0 −1 −1 −1 ñ ô
 0 1 0 −1 0 −1  I3 F
−→   −→
 
 0 0 1 0 1 1  0 0
0 0 0 0 0 0 | {z }
rref matrix - R
| {z }
Col. 3 and Col. 4 are swapped.
The null space matrix contains the vectors in the N (AT ). The information about
the null space matrix N is found in the matrix F which contains pivot variables
under elimination. Thus, the null space solution of AT y is:
       
y1 1 1 1 y1 1 1 1
y   1 0 1  y   1 0 1 
 2    2  
y 0 −1 −1 y 1 0 0
       
N =  4 = 
 3 
 =⇒ y =   = 
    
y3   1 0 0  y4   0 −1 −1 

y5   0 1 0  y5   0 1 0 
       
y6 0 0 1 y6 0 0 1
We observe that the null space solution of AT which is given by AT y = 0 encap-

sulates the Kirchhoff’s Current Law, where each solution to y gives the current
flow around a particular loop. We also notice that a linear combination of these
solutions give the other loops. The rank r = 3 therefore gives an insight into the
number of independent loops in the network.
Summary
• The four different cases for the linear equations (reflected in rref, R) de-
pending on the rank r are:
– Case 1: Full Rank: r = m = n. Here x = A−1 b and R = I

ñ ô
I
– Case 2: Full column rank: r = n < m. In this case, R =
0
– Case 3: Full row rank: r = m < n. The structure of R is: R = [ I | F ]

ñ ô
Ir F
– Case 4: Not full rank: r < n and r < m. Here, R =
0 0
Chapter 3
Matrix Definiteness and Cramers

Rule
Topics covered - Definiteness, Cramer’s rule, Fundamentals of Calculus, Gradients

and Envelope theorem
3.1 Definite and Semidefinite matrices

Let us suppose that we have a square symmetric matrix A, then if we premultiply
it with a vector x and postmultiply it with the transpose of the same vector, then
what we have is a quadratic form. It is given as:
ñ ôñ ô
î ó a a x1
11 12
x1 x2 = a11 x21 + (a12 + a21 )x1 x2 + a22 x22 (3.1)
a21 a22 x2
Here are some important definitions regarding the matrix A:
• A is positive definite if xT Ax > 0, ∀x 6= 0.
• A is negative definite if xT AX < 0, ∀x 6= 0.
• A is positive semidefinite if xT Ax ≥ 0, ∀x.
• A is negative semidefinite if xT Ax ≤ 0, ∀x.
Note that there are certain situation wherein we do not require a specific sign on
the quadratic form, but a set of restricted values. In this situation, we say that A
is positive definite subject to constraint bx = 0 if xT Ax > 0 for all x 6= 0 and
bx = 0.
3.1.1 Tests for definiteness

We say that a matrix is positive semidefinite is it has nonnegative diagonal terms.
Therefore we say that if x = (1, 0 · · · , 0) then xT Ax = a11 . We can also define
41
Chapter 3. Matrix Definiteness and Cramers Rule 42
necessary and sufficient conditions for definiteness criteria. First we note the def-
inition of minor matrices - these are submatrices of A formed by eliminating k
rows and the same numbered k columns. We say that the naturally ordered or
nested principal minor matrices of A are the minor matrices given by:
ñ ô a11 a12 a13 
a11 a12 
a11 , , a21 a22 a23  (3.2)
a21 a22
a31 a32 a33
Minor determinants are simply the determinants of the minors. Note now that if
we are given a square matrix A and a vector b then we can border A by b in the
following manner. Note that what this gives us is essentially a bordered matrix.
 
0 b1 b2 · · · bn
 1 11 a12 · · · a1n 
b a 
.
. .. .. .. ..  (3.3)
. . . . . 

bn an1 an2 · · · ann
Now it turns out that the border preserving principal minor matrices are the
principal minors of this matrix, that include the border elements as well and are
typically used to test definiteness under the constraint conditions. We that the
square matrix A is:
• Positive definite if and only if the principal minor determinants are all pos-
itive.
• Negative definite if and only if the principal minor determinants of order k
have a sign as (−1)k for k = 1, 2, · · · , n.
• Positive definite subject to constraint bx = 0 if and only if the border pre-
serving principal minors are all negative. The condition is actually (−1)m
where m is the number of constraints.
• Negative definite subject to constraint bx = 0 if and only if the border
preserving principal minors have sign (−1)k for k = 1, 2, · · · , n.
3.1.2 Cramer’s rule

This rule provides us a fairly convenient method for finding the solution of a sys-
tem of linear equations of the form Ax = b.
    
a11 · · · a1n x1 b1
 . . .  .   . 
 . .. .. 
 .   ..  =  ..  (3.4)
   
an1 · · · ann xn bn
Now the Cramer’s rule states that in order to find the component of the solution
vector, say xi , then all we have to do is to replace the ith column of matrix A with
the vector b to form the matrix denoted as Abi and then compute:
|Abi |
xi = (3.5)
|A|
3.2 Analysis
Given a vector x in the space Rn and a positive real number e, we can define an
open ball of radius e at x as follows:
Be (x) = {y ∈ Rn : |y − x| < e} (3.6)
We note that a set of points A is an open set if for every x in A there exists some
Be (x) which is contained in A. Further we note that if x is in an arbitrary set and
there exists some e > 0 such that Be (x) is in A, then x is said to be in the interior
of A.
The complement of a set A in Rn consists of all points in Rn that are not in

A. This is denoted as Rn \A or Ac . Note an important point that A would be a
closed set if Rn \A is an open set. A set A is said to be bounded if there is some
x in A and some e > 0 such that A is contained in Be (x). Note that if a non empty
set in Rn is closed and bounded, it is called compact.
An infinite sequence in Rn depicted as {xi } = (x1 , x2 , · · · ) is simply an infinite

set of points. The sequence xi is said to converge to a point x∗ if for every e > 0
there is an integer m such that for all i > m, we say that xi is in Be (x). This
statement is akin to saying that x∗ is the limit of the sequence {xi } and can be
written as:
lim xi = x∗ (3.7)
i→∞
Note that if the sequence converges to a point it is called a convergent sequence.

Lastly, in this section we note that a function f (x) is continuous at x∗ if for every
sequence {xi } that converges to x∗ we have a corresponding sequence {f (xi )} that
converges to f (x∗ ). A continuous function is a function that is continuous at every
point in its domain.
3.3 Calculus
Calculus gives us the tools to approximate certain function with linear functions.
Given a function f : R → R we can define its derivative at point x∗ as follows:
df (x∗ ) f (x∗ + t) − f (x∗ )

= lim (3.8)
dx t→0 t
Provided the limit exists. Consequently we say that if the derivative of f exists at
x∗ then f is differentiable at x∗ . Now consider a linear function F (t) defined by:
F (t) = f (x∗ ) + f 0 (x∗ )t (3.9)
This would be a pretty good approximation to f near x∗ since:
f (x∗ + t) − F (t) f (x∗ + t) − f (x∗ ) − f 0 (x∗ )t

lim = lim =0 (3.10)
t→0 t t→0 t
Therefore if we have an arbitrary function f : Rn → Rm , then we can define its

derivative at x∗ , Df (x∗ ) as being a linear map from Rn to Rm that approximates
f close to x∗ in the sense that:
|f (x∗ + t) − f (x∗ ) − Df (x∗ )t|
lim =0 (3.11)
|t|→0 |t|
Now given a function f : Rn → R we can also define partial derivatives of f with
respect to xi evaluated at x∗ . To achieve this, we hold all components fixed except
for the ith component such that f is now a function only with respect to xi and
then evaluate the first order derivative. This is denoted as:
∂f (x∗ )
(3.12)
∂xi
Now since the total derivative Df (x∗ ) is nothing but a collection of linear trans-
formations via the partial derivatives, we can represent it as a matrix. Note here
that we are representing the partial derivatives of many functions here.
∂f1 (x∗ ) ∂f1 (x∗ )

 
 ∂x ···
1 xn 
.. .. ..
 
Df (x∗ ) =  (3.13)
 
. . . 
∗ ∗
 
 ∂fm (x ) ∂fm (x ) 
v···
∂x1 xn
This method of depicting the partial derivatives of multiple functions with respect
to the various components forms the Jacobian matrix of f at x∗ .
3.3.1 Higher order derivatives

Now if we have a function f : Rn → R then the Hessian matrix of that function
is the matrix of mixed partial derivatives:
Ç å
2 ∗
∂ f (x )
D2 f (x∗ ) = (3.14)
∂xi ∂xj
Further note the the Hessian is a symmetric matrix.
3.4 Gradients and Tangent planes

Consider a function f : Rn → R. We define the gradient of f at x∗ as a vector
whose coordinates are the partial derivatives of f at x∗ :
Ç å
∗ ∂f (x∗ ) ∂f (x∗ )
Df (x ) = ,··· , (3.15)
∂x1 ∂xn
Note that an important interpretation of the gradient is that it points in the di-
rection that the function f increases most rapidly. To denote this, we first let h
be a vector of norm 1, it is basically a unit vector. Now the derivative of f in the

direction of h at x∗ is simply Df (x∗ )h. We can apply the inner product formula
here to obtain:
Df (x∗ )h = |Df (x∗ )| cos θ (3.16)
Clearly the above expression would be maximized when θ = 0 and would in turn
imply collinearity among Df (x∗ ) and h. Now we note that the level set of a
function is the set of all x such that the function is constant: Q(a) = {x : f (x) =
a}. This level set of a function f : Rn → R would generally be a n − 1 dimensional
surface. Note now that the upper contour set of a function f : Rn → R is the set
of all x such that f (x) is atleast as large as some number: U (a) = {x ∈ Rn : f (x) ≥
a}. With this defined, we can now find the formula for a tangent hyperplane to
the level set at some point x∗ . Now we know from the Taylor expansion formula
and the general method of finding linear approximations of a function around a
certain point x∗ is given by: f (x∗ ) + Df (x∗ )(x − x∗ ). With this we can say that
the best linear approximation to {x : f (x) = a} should be given by:
H(a) = {x : f (x∗ ) + Df (x∗ )(x − x∗ ) = a} (3.17)
Now note that since f (x∗ ) = a then we have the following formula for the tangent
hyperplane:
H(a) = {x : Df (x∗ )(x − x∗ ) = 0} (3.18)
With this we can state that if x is a vector in the tangent hyperplane, then x − x∗
is orthogonal to the gradient of f at x. We note that a tangent hyperplane is
defined by a set of points x such that x − x∗ is orthogonal to the gradient of a
function f on which point x∗ lies, at x∗ .
3.5 Optimization
Let f : R → R be a function and we say that this function achieves its maximum
at x∗ if f (x∗ ) > f (x) for all x. On similar lines, if we say that x∗ is a minimum
point if f (x∗ ) < f (x) for all x. Suppose now that a function achieves a maximum
at x∗ . Now we know from calculus conditions that at this point the first derivative
would be equal to 0 and the second derivative would be less than or equal to 0 at
x∗ . This can be depicted as:
f 0 (x∗ ) = 0 (3.19)
f 00 (x∗ ) < 0 (3.20)
Similarly for minimization the second order condition would become:
f 00 (x∗ ) > 0 (3.21)
3.5.1 Concavity
A function is said to be concave if it satisfies the following condition:
f (tx + (1 − t)y) ≥ tf (x) + (1 − t)f (y) (3.22)

A concave function always has its second derivative less than or equal to 0 or it has
a maximum. Now a convex function is one that satisfies the following property:
f (tx + (1 − t)y) ≤ tf (x) + (1 − t)f (y) (3.23)
A convex function is known to have a minima or its second derivative is greater
than or equal to 0.
3.5.2 Envelope theorem

The envelope theorem deals with the situation wherein we have a function to
optimize of the form f (x, a) where x is the choice variable and a is exogenous.
However since this optimal choice of x might vary depending on a we model it as
x(a). Then the optimal value function takes the form:
M (a) = f (x(a), a) (3.24)
Now very often we are interested in seeing how the optimal value changes as we
change the parameter a. For this we differentiate both sides to get:
dM (a) ∂f (x(a), a) ∂x(a) ∂f (x(a), a)
= + (3.25)
da ∂x ∂a ∂a
As per our optimizing conditions, since x(a) is the optimal choice then we would
naturally have:
∂f (x(a), a)
=0 (3.26)
∂x
With this we finally have:
dM (a) ∂f (x(a), a) ∂f (x, a)
= = |x = x(a) (3.27)
da ∂a ∂a
What this means is that the derivative is taken by holding x fixed at the optimal
choice x(a). Basically the total derivative with respect to the parameter is nothing
but the partial derivative evaluated at optimal choice.
3.5.3 Comparitive statics

Note that this in situation we typically want to see as to how the optimal choice
changes as the parameter changes. First we note that the optimal choice satisfies:
∂f (x(a), a)
=0 (3.28)
x
Further differentiating both sides we would get:
∂ 2 f (x(a), a) dx(a) ∂ 2 f (x(a), a)
+ =0 (3.29)
∂x2 da ∂x∂a
Now we would essentially solve for:
dx(a) ∂ 2 f (x(a), a)/∂x∂a
=− 2 (3.30)
a ∂ f (x(a), a)/∂x2
Note that since the denominator would always be negative sign, the sign of the
derivative of optimal choice would depend on the second cross partial derivative
with respect to x and a.
3.6 Multivariate optimization

Consider optimizing a function that has two choice variables x1 and x2 . Treat
these two in a vector format as x = (x1 , x2 ). The maximization problem can then
be written as:
max f (x1 , x2 ) = max f (x) (3.31)
x1 ,x2 x
Now the two separate first order conditions are given by:
∂f (x1 , x2 )
=0 (3.32)
∂x1
∂f (x1 , x2 )
=0 (3.33)
∂x2
Generally for n choice variables we define the gradient vector as:
Å ã
∂f ∂f
Df (x) = ,··· , (3.34)
∂x1 ∂xn
Therefore the n first order conditions can be conveniently written as:
Df (x∗ ) = 0 (3.35)
Now the second order conditions of this problem can be expressed in the form of
a Hessian matrix and is given by:
ñ ô
f11 f12
H= (3.36)
f21 f22
Here fij stands for the following mixed partial derivatives:
∂ 2f
(3.37)
∂xi ∂xj
Now from calculus we can say that for a maximization problem, at optimal choice
x∗ the Hessian matrix must be negative semidefinite. Therefore for some vector
(h1 , h2 ) it must satisfy:
ñ ôñ ô
ó f
11 f12 h1
î
h1 h2 ≤0 (3.38)
f21 f22 h2
This can also be written as hHhT ≤ 0 and note that that the corresponding condi-
tion for minimization would require the same matrix to be of a positive semidefi-
nite form.
3.6.1 Multivariate comparitive statics

We can see in the multivariate scenario as to how the optimal choice varies with
parameter a. Note the first order conditions:
∂f (x1 (a), x2 (a), a)

=0 (3.39)
∂x1
∂f (x1 (a), x2 (a), a)
=0 (3.40)
∂x2
Now differentiating with respect to a we get:
∂x1 ∂x2
f11 + f12 + f13 = 0 (3.41)
∂a ∂a
∂x1 ∂x2
f21 + f22 + f23 = 0 (3.42)
∂a ∂a
This can then be written in a matrix form as follows:
ñ ôñ ô ñ ô
f11 f12 x01 (a) −f13
= (3.43)
f21 f22 x02 (a) −f23
Now this we can simply use the cramer’s rule to find the respective rate of changes
of optimal choices:
−f13 f12

−f23 f22

∂x1
= (3.44)
∂a f11 f12

f21 f22

Chapter 4
Multivariate Data Analysis - Basic

concepts
4.1 A short recap of SVD concepts

We begin with a recap of essential equations and concepts before starting out with
linear transformations. Recall that the singular value decomposition of any m × n
matrix is given by :-
T
Am×n = Um×m Σm×n Vn×n (4.1)
Where the matrices V and U are orthonormal matrices satisfying the following
conditions :-
U T U = Im (4.2)
V T V = In (4.3)
And Σ is a diagonal matrix containing singular values that are arranged in a de-
creasing order such that σ1 > σ2 > · · · σr > 0. The singular values after the first
r values are effectively 0 where r is the rank of the original matrix A. [2]In order
to write down what the columns of these orthonormal matrices represent, we will
present their illustration as follows :-
V = [v1 , v2 , · · · , vr |vr+1 , · · · , vn ] (4.4)
U = [u1 , u2 , · · · , ur |vr+1 , · · · , vm ] (4.5)

In equation 4 the first r columns of the matrix represent the r basis vectors of
the Row space of A and in equation 5 the first r columns of the matrix represent
the basis vectors of the Column space of A. And the n − r and m − r columns
respectively represent the basis vectors for the Null space and Left Null space of
A. V captures the space Rn and U captures the space Rm .
4.1.1 The process of SVD

These matrices of the decomposition are computed as follows :-
49
Chapter 4. Multivariate Data Analysis - Basic concepts 50
• Find the Eigenvalues of E = AT A. Take their positive square root to get the
singular values :- p p
σA1 = λE1 , σA2 = λE2 (4.6)
• On the basis on those Eigenvalues, compute the Normalized Eigenvectors of

matrix E :-
x
XE = (4.7)
||x||2
• Using these normalized eigenvectors form the V orthonormal matrix which

is composed of these eigenvectors as its columns :-
î ó
V = XE1 XE2 (4.8)
• Compute the diagonal Σ matrix as well :-

ñ ô
σ1 0
Σ= (4.9)
0 σ2
• Find matrix U by using the following modified relationship of the SVD equa-
tion :-
AV = U Σ (4.10)
Av1 = u1 σ1 (4.11)
1
u1 = Av1 (4.12)
σ1
• Present the reduced dimensional form of the matrix by assembling A as the
sum of r rank 1 matrices formed by the summation of the inner product of
each pair of orthonormal eigenvectors :-
r
X
A= σi ui viT (4.13)
i=1
• Finally the trucated form of the SVD which is essentially the reduced rank
approximation of A is given as :-
A = Ur Σr VrT (4.14)
4.1.2 Forms of Inverses

In the case when m > n = r then the Null space does not exist which means that
it only contains the 0 vector. In this case the general inverse of A does not exist.
So we would compute the Left inverse in this case.[2] The left inverse is formed
as follows :-
Ln = AT A (4.15)
L−1
n Ln = In (4.16)
A−1 T −1 T
L = (A A) A (4.17)
A−1 −1 T
L = Vn Σ Um (4.18)
In a similar way we can compute the Right inverse of a matrix in the special case
of n > m = r. Here the Left null space does not exists and we can compute the
right inverse as follows :-
Rm = AAT (4.19)
−1
Rm Rm = Im (4.20)
A−1 T T −1
R = A (AA ) (4.21)
A−1
R = Vn Σ−1 Um
T
(4.22)
In case m < r and n < r both the null spaces in effect do not exist. In this case we
compute the Pseudoinverse which is given by :-
A+ = Vr Σ+ UrT (4.23)
1
A+ ui = vi (4.24)
σi
4.2 Data Matrices

A multivariate data matrix refers to a table in which values are arranged such that
the columns represent the attributes or the variables for which measurements are
made and the rows refer to the individuals or records which refer to the various
observations taken. It is represented as a (n − individuals) × (p − variable) matrix.
[1]  
x11 · · · x1j · · · x1p
 . .. .. .. 
 .. . . . 
 
X =  xi1 · · · xij · · · xip  = [x1 , · · · , xi , · · · , xp ] (4.25)
 
 . . . . 
 . .. .. ..
 .


xn1 · · · xnj · · · xnp
Note that j stands for the j th column or the j th variable.

 
x1j
 .  î óT
xj =  .. 

 = x 1j · · · x nj (4.26)
xnj
The Average statistic for a variable j or the j th column of X can be computed as

follows :- n
1 1X
x̄j = (x1j + · · · + xnj ) = xij (4.27)
n n i=1
Now in order to express such a sum in vector terms, we will a special vector
called i which is essentially a vector of 1s. It is an n × 1 vector of 1s given as :
in = [1, 1, 1, · · · , 1]. The sum of all the x components of the j th column is thus
represented as :-  
x1j
î ó x2j 
iTn xj = 1 1 · · · 1 
 
.
 . 
 (4.28)
 . 
xnj
And finally the averge can be expressed as :-
1 T
x̄j = i xj (4.29)
n n
4.2.1 Centering the values

Centering the values of the data matrix merely involves finding out the deviation
of those values from the mean so in effect it means subtracting out the mean from
each vector component. So to get the centered value vector for the j th variable,
denoted as yj we will get :-
         
y1j x1j − x̄ x1j x¯j x¯j
 .   .   .  . .
yj =  ..  =  ..  =  ..  −  ..  = xj −  .. 
        
 (4.30)
ynj xnj − x̄ xnj x¯j x¯j
Now we note that the last vector with all mean values has values that are all equal.
So we can write that exression as :-
 
x¯j
.
 .  = x̄j in = in x̄j (4.31)
.
x¯j
We can further rewrite this form as :-

 
x¯j Å ã
. 1 1
 .  = in ×
. in xj = in iTn xj
T
(4.32)
n n
x¯j
We can write the final form as :-

       
y1j x1j x¯j x¯j
 .   .  . . 1 T
yj =  ..  =  ..  −  ..  = xj −  ..  = In xj − n in in xj (4.33)
      
ynj xnj x¯j x¯j
Å ã
1 T 1 T
In xj − in in xj = In − in in xj = J x~j (4.34)
n n
Å ã
1 T
J = In − in in xj (4.35)
n
J is called the centering matrix and has the properties of a symmetric matrix
: J = J T and J 2 = J and iTn J = 0. The last condition means that the sum
of deviations is 0. We must additionally remember some key properties about
premultiplying the matrices, say : sJA (s is some scalar) with centering matrices
:-
iT (sJA) = siT JA = 0 (4.36)
J(sJA) = sJJA = sJA (4.37)
Remember an important point that the average of the centered value in the new
column would be 0. [1]This can be check with computation as well. Suppose
there is a centered column of the j variable in the data matrix called yj = Jxj .
When we take the average we essentially compute :-
iT yj = iT (Jxj ) = 0 (4.38)
4.2.2 Variance and standard deviation

Variance is a statistic that gives us a measure of spread and is defined by using sum
of squared distances of values from their average. So the variance for variable j
would be given by :-
n
1Ä ä 1X
vjj = (x1j − x̄j )2 + · · · (xnj − x̄j )2 = (xij − x̄j )2 (4.39)
n n i=1
In vector form the same equation as above can be obtained by writing :-

 
x1j − x̄j
1î ó .. 
vjj = x1j − x̄j · · · xnj − x̄j  .
 (4.40)
n  
x1j − x̄j
 
x1j − x̄j
 .. 
 = Jxj (4.41)
.

 
x1j − x̄j
Now we can write the variance in a simplified form as :-

1 1 1
vjj = (Jxj )T (Jxj ) = xTj J T Jxj = xTj Jxj (4.42)
n n n
Since we know that the deviation centered form of xj is given as yj = Jxj we can
write the variance formula as follows as well :-
1 T 1 1 1
vjj = xj Jxj = xTj J T Jxj = yjT yj = ||yj ||2 (4.43)
n n n n
Note that the variance of the centered values is the same as the variance of the
raw values.
4.2.3 Standard values

Now when we divide the centered values which are essentially the raw values
with the vector average subtracted, by the standard deviation of the vector of
observations, we get the standard values or z-scores. The standard vector of
observations is zj = [z1j , · · · , znj ] and it is computed as :-
 √   
(x1j − x̄j )/ vjj x1j − x̄j
..  = √1  ..  = √1 Jxj = √1 yj
   
zj =  . . (4.44)

√
 vjj   vjj vjj
(xnj − x̄j )/ vjj xnj − x̄j
This process of transforming raw values into standard values is called standard-
ization and in this case the mean of the standard vector is 0 and variance is 1.
4.2.4 Basics on Matrix representations

A general propery of matrices is given by :-
[Ab1 , · · · , Abk ] = A[b1 , · · · , bk ] (4.45)
Where A is an n × m matrix and bi are m × 1 vectors. This principle can be

extended to representing the matrix of standardized values Z = [z1 · · · , zp ] where
Z is a n × p matrix. This is given by :-
 
1
 √v 
 11 
ñ ô  . ..

1 1  
Z = √ y1 · · · √ yp = [y1 , · · · , yp ]   = Y D−1 (4.46)
 
v11 vpp ..

 . 

 1 
√
 
vpp
√ 
v11
 .. 
 . 
D= (4.47)
 
.. 
 . 
√
 
vpp
Here D is a p × p diagonal matrix containing the standard deviations of the p
columns of the data matrix. Finally the standardized Z matrix can be written as :-
Z = JXD−1 (4.48)
4.3 Correlations
We shall now consider a n × p data matrix wherein n referes to the number of
obervations or individuals and p refers to the number of attributes or factors of
measurements. [1]The relationship between two variables j and k called a corre-

lation is best described by their scatterplot - which involves plotting the values of
individual/observation i for the j and k variables and see if there is any discernible
pattern. In this scatterplot n individuals are plotted as points with coordinates
(xij , xik ). Here xij refers to the value of j variable for individual i and similarly for
k as well. Now the correlation between two variables j and k is described by their
covariance which is denoted as follows :-
n
1X
vjk = (xij − x̄j )(xik − x̄k ) (4.49)
n i=1
This term is computed using the centered values of the variable vectors and is
positive for positive correlation, negative for negative correlation and 0 for no
correlation. We can represent the same formula in vector form as :-
 
ï x
ò 1j − x̄ j
1 .  . 
vjk = . . (4.50)
n x1j − x̄j . xnj − x̄j  .
 

xnj − x̄j
When we write this relation in terms of the centering matrix we get :-

1 1 1 1
vjk = (Jxj )T (Jxk ) = xTj J T Jxk = xTj Jxk = yjT yk (4.51)
n n n n
The above equation is an illustration of a p × p matrix that is called the covariance
matrix and one that contains variances in the diagonal elements and covariances
in the off-diagonal elements.
4.3.1 Correlation coefficient

A limitation of covariance is that it does give us any indication of the degree of
correlation between two variables. This is where correlation coefficient comes in.
It is computed by dividing the covariance term with standard deviations of both
variables. It is given by :-
vjk xTj Jxk

rjk =√ √ =» (4.52)
vjj vkk
p
xTj Jxj xTk Jxk
Note that we have directly written formula here without the 1/n terms since they
simply cancel out. This equation can be further simplified to give :-
(Jxj )T (Jxk ) yjT yk

rjk =p p = (4.53)
(Jxj )T (Jxj ) (Jxk )T (Jxk ) ||yj ||||yk ||
Note that the formula for deriving the cosine of the angle between two vectors is
also given by the same formula as the correlation coefficient. So the angle between
the two vectors is essentially a measure of their correlation as well. Additionally
note that the correlation coefficient between pairs of standardized variables and
raw variables is the same. Now, the p × p covariance matrix of the data matrix X
is given by :-
 
xT1 J T Jx1 · · · xT1 J T Jxk · · · xT1 J T Jxp
 .. .. .. 
 . ··· . ··· . 
1 T T
 
V = xj J Jx1 · · · xTj J T Jxk · · · xTj J T Jxp  (4.54)

n . . . 

 .. ··· .. ··· .. 

xTp J T Jx1 · · · xTp J T Jxk · · · xTp J T Jxp
 
xT1 J T
1 .  î ó
V =  .
.
 Jx1 · · · Jxp (4.55)
n  T T
xp J
 
xT1
1 ..  î
 J T J x1 · · · xp
ó
V =  (4.56)
n  .T 
xp
1 T T 1 1
V = X J JX = X T JX = Y T Y (4.57)
n n n
Similarly we can derive the correlation matrix. Since covariances among pairs
of the standardized variables are nothing but the correlations between the raw
variables, we can denote the correlation matrix in terms of z as follows :-
 
z1T J T Jz1 · · · z1T J T Jzk · · · z1T J T Jzp
 .. .. .. 
 . ··· . ··· . 
1 T T

T T T T

R = zj J Jz1 · · · zj J Jzk · · · zj J Jzp  (4.58)

n  .. .. .. 
. ··· . ··· .
 
 
T T T T T T
zp J Jz1 · · · zp J Jzk · · · zp J Jzp
 
z1T z1 · · · z1T zk · · · z1T zp
 . .. .. 
 .. · · · . · · · . 
1 
R = zjT z1 · · · zjT zk · · · zjT zp  (4.59)
 
n . . . 
 . .. .. 
 . ··· ··· 
T T T
zp z1 · · · zp zk · · · zp zp
The above result is obtained with the property Jz = z. Finally we can write the
correlation matrix as :-
1
R = ZT Z (4.60)
n
Now recall that previously we had shown that Z = JXD−1 . We can substitute this
into equation 12 to obtain :-
1 −1 T T
R= D X J JXD−1 = D−1 V D−1 (4.61)
n
An important point to note is statistically we take out the unbiased covariances by

dividing not by n but by n − 1. So effectively our V matrix would be :-
1 n
V cv = X T JX = V (4.62)
n−1 n−1
4.4 Introducing Linear Transformations

When we are computing the expression Av then we can think of matrix A as trans-
forming vector v into another vector Av. We can think of such a transformation
as a function :- A vector is inputted and the function or trasformation gives an
output. The notation is as follows :-
T (v) = Av (4.63)
We will gradually move towards more complicated transformation wherein we

transform a whole space V by using the transformation operator A. A trans-
formation T assigns each output T (v) to each input vector v in V .[3] The
following conditions are met by transformations :-
T (v + w) = T (v) + T (w) (4.64)
T (cv) = cT (v) (4.65)

T (cv + dw) = cT (v) + dT (w) (4.66)
A key point to note is that a transformation that adds a constant value to an input
is not considered linear. Also note a special case of a transformation wherein the
input space (V ) is the same as the output space (W ) - this is known as the Identity
transformation given by :-
T (v) = v (4.67)
We will now show an example of a linear transformation involving dot products.
Suppose there is a fixed vector or transformation operator called a = (1, 3, 4)T .
We can choose any vector in 3 − D space called v and compute its transformation
as follows :-  
î ó v1
a · v = 1 2 3 v2  = v1 + 2v2 + 3v3 (4.68)
v3
We note that the result is a scalar. So in essence, the transformation function T
which applies vector a as an inner product to any input vector v transforms the
input vector v ∈ R3 to T (v) ∈ R1 . An important condition to note is :-
u = c1 v1 + c2 v2 + · · · + cn vn (4.69)
T (u) = c1 T (v1 ) + c2 T (v2 ) + · · · + cn T (vn ) (4.70)

This is a rather important result and it states that :- If every vector u is a linear
combination of the basis vectors v then the transformation of u is the same as the
linear transformation of all the basis vectors that make up that vector.
4.4.1 Derivatives as linear transformations

Suppose we have to compute the derivative of the expression :-
u = 6 − 4x + 3x2 (4.71)
We would approach this problem by first computing the transformations or deriva-

tives of the basis vectors : 1, x, x2 .[2] After that we will apply the previous result
of linearity of combinations of transformations to get the transformation or the
derivative of u as follows :-
du
= 6(derivative of 1) − 4(derivative of x) + 3(derivative of x2 ) = −4 + 6x
dx
(4.72)
Now we know that the transformation function is the derivative operator, that is :-
d
T = (4.73)
dx
The basis vectors are given by : v1 , v2 , v3 = 1, x, x2 and this transformation acts on
the basis vectors as follows :-
dv1 dv2 dv3
= 0, = 1 = v1 , = 2x = 2v2 (4.74)
dx dx dx
So basically the 3 − D input space V has been transformed into a 2 − D output
space W . Hence we can form the transformation matrix as :-
ñ ô
0 1 0
A= (4.75)
0 0 2
When we operate this matrix on the basis vectors v1 , v2 , v3 we would get the result
of the derivative transformation of the basis vectors respectively as 0, v1 , v2 . We
can write the input function also as a + bx + cx2 and the corresponding derivative
output as b + 2cx. Hence we can write this operation as :-
ñ ô a ñ ô
0 1 0   b
b = (4.76)
0 0 2 2c
c
Note here that the input scenario, our basis vectors were 1, x, x2 and their coor-
dinates a, b, c gave us the u vector on whom transformation was performed. But
now in the output our basis vectors have changed to 1, x and their coordinates
b, 2c now describe the transformed u vector in a lower dimensional space.
4.5 Choosing a transformation matrix

We can actually form a matrix for every linear transformation we wish to perform.
Before we start to compute such a matrix, remember some key points. Input
vectors v belongs to the input space V ∈ Rn and output vectors T (v) belongs
to the output space W ∈ Rm and in such a case A is an m × n matrix. And

ultimately out choice of basis vectors that define V and W will also decide what
our transformation matrix is. Also note that the standard basis vectors for the
spaces Rn and Rm is usually the columns of identity matrix I. This is a simple form
of transformation T (v) = Av.[3] However we must note that these vector spaces
can also be defined by other basis vectors and in that case the transformation
matrix would change. Our goal is to choose bases such that we get the best T
matrix - possibly a diagonal matrix. Note the important point that when Input and
output bases are different our transformation matrix is essentially a change
of bases matrix.
• Suppose that we know what T (v) is for basis vectors v1 , · · · , vn .
• Columns 1 to n of this transformation matrix will contains the output basis

vectors T (v1 ), · · · , T (vn ).
• Let c be the vector of coordinates or linear combination scalar weights that

describe a given vector in the space using these basis vectors. Then Ac is the
combination of all those n columns.
• Then the transformed vector of coordinates as per new bases will be :-
Ac = c1 T (v1 ) + · · · cn T (vn ) (4.77)
NOTE :- Any vector v is a linear combination of input basis vectors with coordi-
nates as the scalar weights as c1 v1 + · · · + cn vn . In the same manner T (v) will
be the linear combination of output basis vectors with coordinates as the scalar
weights as c1 T (v1 ) + · · · cn T (vn ).
Consider an examples wherein T transforms standard basis vectors in R2 to other

basis vectors in R3 . We go from v1 = (1, 0) to T (v1 ) = (2, 3, 4) and from v2 = (0, 1)
to T (v2 ) = (5, 5, 5). So we are essentially transforming a vector in 2 − D described
by standard basis vectors to a vector in 3 − D described by the new basis vectors
by applying transformation matrix A to it. So obviously our A matrix will be of the
form 3 × 2. And the output basis vectors T (v1 ) and T (v2 ) will form the columns
of A. With input coordinates c1 = 1 and c2 = 1, the transformation is given as :-
T (v) = c1 T (v1 ) + c2 T (v2 ) (4.78)

   
2 5 ñ ô 7
3 5 1 = 8 (4.79)
1
4 5 9
4.6 Change of Bases

Consider now a situation wherein the input space is also the output space that is
V = R2 and W = R2 . Also suppose that T (v) = v is the identity transformation.
However this only happens when the bases vectors are the same. For our next
examples we will consider different bases to illustrate how the matrix is formed.
For our case T (v) = v we will change the basis from v to w where each vector v
is a linear combination of w1 and w2 .[2] The collection of input and output bases
vector in a matrix form are given below :-
ñ ô
3 6
[v1 , v2 ] = (4.80)
3 8
ñ ô
3 0
[w1 , w2 ] = (4.81)
1 2
Notice the change of bases by the following relation :-
v1 = 1w1 + 1w2 (4.82)
v2 = 2w1 + 3w2 (4.83)

Now the catch here is this. We have if you notice written the input basis as a
function of the output basis. Let this not confuse you. In this case we are applying
the identity transformation on each basis vector such that T (v1 ) = v1 and T (v2 ) =
v2 and only after this transformation we are expressing the resultant v vectors in
the form of the output basis w1 and w2 . So finally we can have a change of basis
matrix transformation B such that :-
WB = V (4.84)
B = W −1 V (4.85)
[w1 , w2 ][B] = [v1 , v2 ] (4.86)
ñ ôñ ô ñ ô
3 0 1 2 3 6
= (4.87)
1 2 1 3 3 8
So we note a rather important point here that when the input basis in in V and
output basis is in W then the identity transformation T = I is actually doing the
operation B = W −1 V . Another interpretation of this is as follows :- Consider a
vector u that can be written using both these bases :-
u = c1 v1 + · · · + cn vn (4.88)
u = d1 w1 + · · · + dn d n (4.89)
Then we have the relation :-
V c = Wd (4.90)
So we notice that the coordinates of the new basis or the coefficients can be ex-
pressed as :-
d = W −1 V c (4.91)
d = Bc (4.92)
B = W −1 V (4.93)
An important result for the transformation of standard basis is therefore :- When

the standard basis V = I is changed to another basis W then the transforma-
tion matrix B is not W but it is W −1 . If we want to transform a vector given by
standard basis then we premultiply that with the inverse of the traditionally given
transformation matrix to obtain the coefficients of the new basis :-
ñ ô ñ ô
x î ó−1 x
→ w1 w2 (4.94)
y y
4.7 Constructing a matrix

Before we begin let us lay out some conditions. We have an input space V ∈ Rn
and output space W ∈ Rm defined by their respective bases that we have chosen
as vi and wi respectively. Transformation matrix A will be m × n and the first
column of A will have the output of T applied to v1 and the resultant output from
T (v1 ) will be in W . Note that T (v1 ) is a combination of the output bases :-
T (v1 ) → a11 w1 + · · · + am1 wm (4.95)
Therefore the elements a11 , · · · , am1 form the first column of A. As an example
when the transformation T is the derivative operator and the first basis vector is
1 then T (v1 ) = 0 that is the first column of A is the zero vector. Suppose the
input basis vectors vi are : 1, x, x2 , x3 . The output basis vectors wi are : 1, x, x2 .
Transformation T is the derivative operator and A is the derivative matrix. We
define the various processes as follows :-
v = c1 + c2 x + c3 x 2 + c4 x 3 (4.96)
dv
= 1c2 + 2c3 x + 3c4 x2 (4.97)
dx
 
  c1  
0 1 0 0   c2
c2   
Ac = 0 0 2 0   = 2c3
  (4.98)
c3 
0 0 0 3 3c4
c4
Note that the j th column of transformation matrix A is found by applying T to the

j th basis vector vj . And T (vj ) is a combination of the output bases :-
T (vj ) = a1j w1 + · · · + amj wm (4.99)
Note additionally that every v is a combination of the input basis vectors and every
T (v) is a combination of the output basis vectors. Also note that the result of Ac is
the coefficients or the coordinates of the transformed vector as per the new basis.
4.8 Choosing the best bases

What if we choose our bases in such a way that our transformation matrix is a di-
agonal matrix. Remember that our transformation matrix T depends on the bases
we choose. If we are dealing with the standard bases then we are quite unlikely
to get a diagonal matrix. So we see that if there are n independent eigenvectors
then we choose them as our input and output bases and the result of this is that we
get our transformation matrix as the diagonal eigen value matrix Λ.[3] Remember
that in our case T transforms Rn to Rn thereby making our input and output bases
the same.
• Consider a projection matrix T that projects every vector v = (x, y) in R2

onto the line y = −x. If we follow the convention of the standard basis
v1 = (1, 0) then we will get projected onto the point
Å ã
1 1
T (v1 ) = ,− (4.100)
2 2
• similarly we will end up projecting v2 = (0, 1) as follows :-

Å ã
1 1
T (v2 ) = − , (4.101)
2 2
• So our projection transformation matrix becomes :-

ñ ô
1/2 −1/2
A= (4.102)
−1/2 1/2
• Now we will see that when we choose our bases as the eigenvectors of a
projection matrix we get the bases as :-
v1 = w1 = (1, −1) (4.103)
v2 = w2 = (1, 1) (4.104)
We would notice that equation 65 would project to T (v1 ) = v1 and therefore
λ1 = 1. Also equation 66 would project to T (v1 ) = 0 and therefore λ2 = 0.
• The new matrix therefore is :-

ñ ô ñ ô
1 0 λ1 0
= (4.105)
0 0 0 λ2
Remember that when we have an identity transformation or when our input and
output bases are the same we essentially change our bases form standard to new
bases by using B −1 where B = W −1 V . Since in the standard basis our projection
or transformation matrix is A we can change the bases to the eigenvector bases
using change of bases transformation matrix B in order to get a new projection

matrix that is similar to the earlier one as :-
−1
Abb = Bstdb Astd Bbstd (4.106)
Here b refers to the new basis constisting of the eigenvectors and std refers to the
standard basis. In case our input and output bases are different then we change
our bases to singular basis vectors which are the result of an SVD on the T matrix.
We can transform our original transformation matrix A by :-
−1
Bout ABin = U −1 AV = Σ (4.107)
Here the matrix V contains the input basis vectors and U contains the output basis
vectors.
Chapter 5
Singular Value Decomposition
5.1 Topics covered

This lecture starts off with a recap of the fundamental concepts behind Singular
Value Decomposition (SVD), moving on to explore the effects of SVD in various
sizes (and rank) of matrices. Subsequently we have a detailed look at the concept
of Left and Right inverse of a matrix and in what situations they are formed. We
see that these concepts are intimately tied to the existence of Null spaces.
5.2 Introduction
We decompose matrices in order to make our computation easier and to reveal
relationships among attributes. Decomposing a matrix with the method of Diag-
onalization is only possible if we have a square matrix, however an SVD can be
applied to any matrix of any size. An SVD happens to be one of the most popular
decompositions since it reveals the four fundamental subspaces that make up the
entire vector space, namely :- The row space, the column space, the null space
and the null space of the transposed matrix. Once we decompose any matrix to
reveal the elements of these fundamental subspaces, it is possible to easily identify
the relevant and most important properties of that matrix. We are instantly able
to see the relationships between rows and columns as well identify correlations
or dependencies among various attributes. In addition, we find that the Singular
values in the decomposed matrix reveal important information about the effective
rank of a matrix. As we go along exploring this method of decomposition, we
will see how the SVD results in a dimensionality reduction of a matrix. In essence
this means that any large matrix can be decomposed or compressed into a ma-
trix with a very small effective rank and as a result our computations can become
much faster and efficient. There are wide ranging applications of the SVD as will
see later on. It is used extensively in disciplines like :- Economics, Finance, En-
gineering, Physics, Social science, Data analysis, Statistical modelling and many
others. When it comes to applying this concept to big data, it is especially of great
64
Chapter 5. Singular Value Decomposition 65
help. We will now develop the mathematics behind the Left and Right Inverse of
a matrix using SVD and a small recap before that.
5.3 SVD as a lowest rank approximation

The standard representation of the SVD is given by : Am×n = Um Σm×n VnT . We
can represent this standard form in an expanded manner, where we can see the
elements of the four subspaces clearly :
  
σ12 0 ··· 0 0 v~T
0 σ22 ··· 0 0  1 
  .. 


î ó .. .. .. .. ..  . 
. . . . .  v~T 
  
A = u~1 u~2 · · · u~r ur+1
~ u~m   r  (5.1)
0 ··· σr2 0 0  . 

 . 

0 0 2
· · · σr+1 0  . 


0 0 ··· σn v~nT
In this representation the vectors u1 · · · ur ∈ Um×m represent the column space

of matrix A and the vectors ur+1 · · · um ∈ Um×m represent the null space of AT .
Similarly the vectors v1T · · · vrT ∈ Vn×n represent the row space of A and the vectors
T
vr+k ∈ Vn×n onwards, represent the null space of A. In the Σ matrix as well, we
see that the first r singular values are the dominant non-zero singular values and
the rest are relatively quite small and hence can be ignored (set to 0). We must
note here that U and V are orthonormal matrices. When we represent the main
matrix using a rank r approximation by essentially setting all the non dominant
singular values to zero, we obtain a truncated SVD of matrix A. This is given by :
T
Am×n = Um×r Σr×r Vr×n (5.2)
i=r
σi u~i v~iT
X
A= (5.3)
i=1
Equation 2 shows the truncated form of the SVD ; the rank r approximation of A
and equation 3 shows how matrix A is assembled from a summation of r rank 1
matrices.
5.4 The Symmetric R and L matrices

5.4.1 Case 1 :- The R matrix
As we will go along this lecture we will discover the Left and Right inverses of a
matrix, which exist under certain special conditions. But before we develop those
theorums, we need a more fundamental building block. that is the R matrix,

formed by multiplying : AAT . The R matrix is a symmetric matrix, with R = RT
For complete understanding, here is how it will be represented in an SVD form :
R = AAT = Um×m Σm×n (Vn×n

T
Vn×n )ΣTn×m Um×m
T
(5.4)
R = Um×m Σm×n ΣTn×m Um×m

T
(5.5)
 
σ12 0 · · · 0 0
 0 σ2 · · · 0 0 
 2 
 . . . . . 
 . .. .. .. .. 
Um×m 
 . T
 Um×m (5.6)
2
 0 · · · σr 0 0 

2
 0 0 · · · σr+1 0 
 
0 0 ··· σm
T
NOTE : Note that in equation 4, the expression (Vn×n Vn×n ) resolves to the identity
matrix In×n .
What are singular values ? (A recap) : The diagonal entries of the matrix Σ
are called the singular values of a matrix and are computed by using the positive
squareroot of the eigenvalues of AAT . They are typically arranged in an ascending
order in the matrix with :
σ1 > σ2 > σ3 · · · > σr > 0
Where r denotes the effective rank of the matrix and is also an indicator of the
number of non-zero singular values present in the diagonal matrix.
5.4.2 Case 2 : The L matrix

Now we come to the another fundamental building block necessary to explain
the formation of the Left inverse, which we shall look at shortly. The L matrix is
formed by the multiplication : AT A and it is also a symmetric matrix with L = LT .
Again, here we will represent this matrix in its SVD form for better a understanding
:
L = AT A = Vn×n ΣTn×m (Um×m

T T
Um×m )Σm×n Vn×n (5.7)
L = Vn×n ΣTn×m Σm×n Vn×n
T
(5.8)
 
σ12 0 · · · 0 0
 0 σ2 · · · 0 0 
 2 
 . . . . . 
 . . . . .
. . . . .
 T
Vn×n   Vn×n (5.9)
 0 · · · σr2 0 0 
 
 0 0 · · · σr+k 2
0 
 
0 0 ··· σn
T
NOTE : Note that in equation 7, the expression (Um×m Um×m ) resolves to the iden-
tity matrix Im×m .
Now, having developed a foundation using R and L matrices, we will move on

to explaining the formation of Left and Right inverses using certain special cases
inolving the four subspaces of a matrix, namely :- The column space, the row
space, the null space and the left null space.
5.5 A general purpose inverse of a square matrix

Before we start building the concept of the Left inverse, we need to remind our-
selves of the general inverse of a matrix, given by A−1 and see why is it that we
even need this. We primarily require the inverse of A to be able to solve a general
system of equations of the form :
Am×n xn×1 = bm×1 (5.10)
xn×1 = A−1 bm×1 (5.11)
This is a special scenario wherein m = n which essentially implies that we are
dealing with a square matrix. So we can conclude that for a square matrix with
full rank : m = n = r and the matrix does not have any linear dependencies
among columns or rows. It is only in this special scenario that a general purpose
inverse of A exists, using which we can compute the vector of unknown values :
x. We should also note that in this scenario, there exists a unique A−1 and hence
a unique solution to the system of equations exists. A geometric depiction of this
scenario is shown below :
Figure 5.1: case 1: m = n = r
As we can see from this figure, the matrix A is mapping the vector of unknowns
x ∈ Rn which belongs to the Row space of A, to b ∈ Rm which belongs to the
Column space of A. Notice here that since all the rows and columns of matrix
A are independent, the N (A) and N (AT ) do not exist. This is a very important
result. Since the null space and the left null space do not exist, that essentially
implies that both the null spaces contain only the zero vector. It is due to this
reason that the general purpose inverse of A exists and there is a unique solution
for the system of equations.
5.6 Discovering the Left Inverse of a matrix

We will now consider a case whererin m > n and n = r which essentially means
that matrix A has a full column rank. There are linear dependencies among the
rows of A but all the columns are linearly independent. In this case we notice that
the null space of A contains only the zero vector : N (A) = 0. This is a unique case
in the sense that due to this reason, the N (A) does not exist, however the null
space of AT does exist - there are nonzero solutions to the system Ax = 0. So in
this case N (AT ) ∈ Rm does in fact exist. Now this actually creates some problems
for us, in the sense that a general purpose A−1 does not exist and we cannot solve
the system of equations.
So in order to be able to solve the system of equations, we will construct the

Left inverse of A. In the diagram we have shown the subspaces of matrix A along
with the components of vector b along the C(A) and N (AT ). A geometric repre-
sentation of this case is shown below : Now within this scenario, we can have two
Figure 5.2: case 2: m > n = r
cases as well. One involving the situation where the vector b belongs to the C(A)
and one case where it does not. They are explained below :
• case 1 : bm×1 ∈ C(A) : In this case the vector vectb has no choice but
to be in the C(A) and hence there will be only 1 solution to the system of
equations. Vector b always belongs to the C(A).
• case 2 : bm×1 ∈ / C(A) : In this case we will project the vector b onto the
column space of A. This in turn means that we are solving a Least squares
solution vector. The equations are given by :
00
AXLS = b ∈ C(A) (5.12)
AT AXLS = AT b (5.13)
XLS = (AT A)−1 AT b
00
b = Pb
P = A(AT A)−1 AT (5.14)
P = AL−1
n A
T
(5.15)
5.6.1 Proof by contradiction

We will now prove by contradiction that there is no null space existing as a sub-
space of A. This is listed out stepwise below :
• Assume : zn 6= On and Azn = Om
• Multiplying both sides by AT we get : AT (Azn ) = AT Om = On
• Multiplying by vector zn we get : (Azn )T (Azn ) = On
• With this we get : ||(Azn )|| = On → zn = On
• This last step leads us to a contradiction since our initial assumption is vio-
lated. Hence we conclude the following.
• L−1 T
n zn = A Azn = On can never happen and hence the null space does not
exist.
5.6.2 Forming the Left Inverse

We will now finally construct the left inverse of matrix A using the symmetric L
matrix that we developed at the start of this lecture. Note that in the following
equations, A−1
L denotes the left inverse of A :
Ln = AT A (5.16)
L−1
n Ln = In
(AT A)−1 (AT A) = In

(AT A)−1 (AT ) = A−1
L (5.17)
A−1
L A = In
A−1 −1 T
L = Ln A (5.18)
5.6.3 Notes
Finally we note that if there is a vector in N (AT ) then it will be mapped to the
zero vector and if there is a vector in the C(A) then it will stay there. This can be
further validated using the left inverse in the following equations :
((AT A)−1 AT )A = In (5.19)
A(AT A)−1 AT = AA−1

L (5.20)
Here equation 24 suggests that there is no null space for A and equation 25 sug-
gests that AT has a null space. From here on, in the next lecture, we will develop
the theory behind the Right inverse in a similar fashion.
Chapter 6
Principal Component Analysis
The setting behind subsequent explanations rely on an example that considers a 3−D
space wherein we have attached a mass at the end of a fixed spring which is frinc-
tionless. We aim to measure its oscillations or frequency along the x axis. So we
essentially place three viewing angles A, B, C at three axes (not knowing which axis
is infact the x axis). After this we record measurements of the position of the mass
over a certain amount of time. Consider that it moves with a frequency of 120hz.
The primary function of Principal component analysis is to express the original

dataset in the form of optimally chosen basis vectors that are different than the
standard basis vectors. We expect this new basis to filter out noise from our dataset
and reveal the most important variables in the system. This process helps us to
understand in the sea of variables and data, as to which are the most important
variables and which are the noisy variables. Now note that each time sample or
experimental trial is treated as one sample in our dataset where we are recording
the position of the mass over 10 minutes.[4] The positions are recorded by all three
camera’s with respect to their 2 − D image from their viewing angle : in essence
they are recording a projection of the mass position from 3 − D space to 2 − D
space. So the three 2 − D measurements of the three cameras are collapsed into
one vector X. Note that recording the ball position in an experimental trial for
10 minutes involves taking : 10 × 60 × 120 = 72000 obersations. Our observation
vector can be expressed as :-  
xA
y 
 A
x 
 
X =  B (6.1)
 yB 
 xC 
 
yC
We can here make some important statements like :- Sample vector X is made
up of m measurements or sample vector X is an m dimensional vector in a
space that is spanned by some set of orthonormal basis vectors. Also every
measurement vector in this space is essentially a linear combination of these
orthonormal unit basis vectors. Note yet another important point that since we
have measure out data as per the standard basis, our coordinate vectors of sample
71
Chapter 6. Principal Component Analysis 72
measurements would naturally be formed by linear combinations of these basis

vectors only. For an m dimensional case, we will end up forming a m × m matrix
of these basis vectors :-
   
b1 1 0 ··· 0
 .  . . 
 ..  =  .. . .
B= (6.2)
  =I

bn 0 1
Here each row is an orthonormal basis vector bi with m components.
6.1 Changing the bases

Is there another set of basis vectors which is essentially a linear combination of the
previous basis vectors that can re-express our dataset ?
Let X be our original dataset where each column represents a recording of the
position of the mass at a certain point of time and essentially it is a 6 × 72000
matrix.[4] Now let Y be another dataset that is related to the original dataset by
linear transformation P . Y is our new representation of our dataset and is given
by :-
PX = Y (6.3)
Now we will define a few variables for further interpretation :-
• pi are the rows of P
• xi are the columns of X
• yi are the columns of Y
• Aso remember that P is a matrix that transforms X into Y
• P can be treated as a matrix that rotates and then stretches X in the process
of transformation
• The rows or P : [p1 , · · · , pm ] are the new basis vectors that express the
coordinates of the X columns
We can see how this transformation plays out in matrices :-
   
p1 p1 · x 1 · · · p1 · x n
 . î
 x1 · · · xn =  .. ..
ó  
PX =  .
 .   . . 
 (6.4)
pm pm · x 1 · · · pm · x n
Now note that each column of Y is given by the form :-

 
p1 · x 1
 . 
yi =  . . 
 (6.5)
pm · x 1
An important point to note here is that each coefficient or coordinate of yi is the

inner product of xi with the corresponding row of P . Note that the j th coefficient
of yi is simply the projection on to the j th row of P . Each column of Y is
a projection on to the basis vectors of P . rows of P are the new basis vectors
representing the columns of X. So we note finally that the row vectors of P which
are the new basis vectors are also known as the principal components of X. Now
the question is :- What is the best way to re-express X and what is a good choice
of P . Also what features we would like Y to have ?
6.2 Variance
We realise at this point that we need to keep our noise in measurements low so as
to get most information out of our signal. Also, all noise is measured in relative
terms to the signal. [4]With this we get the Signal to noise ratio which is a ratio
of variances :-
2
σsignal
SN R = 2 (6.6)
σnoise
A high SN R would indicate precise measurement and low SN R indicates more
noise in the data. Here is a figure to give an idea of how signal and noise in-
corporate a series of measurements :- Each camera’s motion recording should be
Figure 6.1: SNR
expected to have a straight line motion, so any deviation in positioning resulting

from the measurements is noise. Note that the small line depicts the variance of
noise and the big line depicts variance of the signal. We assume that Directions
with largest variances in the measurement space must contain the dynamics
of interest. Also we can see that the largest variance or the most dominant di-
rection of measurement is neither along the x basis or the y basis but along the
slanted axis of our measurement points. That is why we have to look for other
bases because our dominant direction of highest variance does not correspond to
the directions of their of the standard basis vectors.
Note here that maximising the variance or our SN R is equivalent to appropri-

ately rotating the standard basis vectors so as to adjust one of them along the best
fit line to our data which also has the largest variance. Hence, rotating the stan-
dard basis vector along a line parallel to the dominant direction of our best fit line
would reveal the direction of motion of our mass. An additional point to note is
that measuring multiple variables causes redundancy or confounding. Two mea-
sure in this space seem to be correlated.[4] As a result we can say that essentially
measuring one variable is enough to express the data in a concise manner. This is
the fundamental idea behind dimensionality reduction.
6.2.1 Covariance
In the case of 2 variables it is somewhat easy to expose confounding by fitting a
best line on the data and assessing the quality of the fit. We will now generalize
this notion to any dimension. Consider two sets of measurements or samples with
0 means :-
A = [a1 , · · · , an ], B = [b1 , · · · , bn ] (6.7)
The variance of A and B are given as follows :-
n n
1X 2 2 1X 2
σA2 = ai , σB = b (6.8)
n i=1 n i=1 i
The covariance is a measure of the degree of linear relationship between two

variables. The covariance between A and B is given by :-
n
2 1X
σAB = ai b i (6.9)
n i=1
If A and B in the above example are considered to be row vectors we can write
the covariance formula in matrix form as follows :-
2 1 T
σab = ab (6.10)
n
Further expanding this idea, if we consider the matrix X as containing many such
row vectors of the form a and b then we can simultaneously find the covariance
between each pair of these measurement vectors by computing the covariance
matrix in the following way :-
1
CX = XX T (6.11)
n
Remember that this is a square m × m matrix with variance of each sample mea-
surement vector along the diagonals of the matrix. [5]The off diagonal terms
are the covariance terms between various pairs of sample measurement vectors.
Remember two crucial points :-
• The high values in diagonal elements reflect the interesting structure that
explains the significant variance of the measurement that interests us.
• The off diagonal elements reflect the noise in the system.

We ideally want to manipulate this CX covariance matrix to obtain CY which has
some interesting optimal features.
6.2.2 Diagonalize the Covariance matrix

Remember that our ultimate aim is to minimize redundancy which is measure by
the off diagonal terms in the covariance matrix and we wish to maximize the signal
which is measure by the diagonal variance terms. We will what this optimized CY
would look like. Remember two main points regarding the optimal form :-
• The off diagonal terms should be made 0 and in essence, the effect of the y
measurements that cause confounding is decoupled from our measurements.
• Each successive dimension should be rank ordered as per variance values.
PCA comes in at this point wherein it is essentially a method of transforma-

tion the covariance matrix by diagonalizing it in the easiest way possible. That
is by premultiplying the projection matrix P which is an orthonormal matrix
since its columns which represent the new basis vectors are orthonormal vectors
: [p1 , · · · , pm ]. The way PCA works is that P essentially rotates the existing basis
vector so that it alings with the dominant direction of the best fit line of maximum
variance. [4] The algorithm is as follows :-
• Select a normalized direction in the m dimensional space in which the vari-

ance in X is maximum. Select this direction as vector p1 .
• Now find another direction along which variance is maximised but this time
include only those direction to which the earlier found direction is orthongo-
nal. This is due to the orthonormality constraints. Keep saving these vectors
as pi .
• Repeat this procedure until m vectors are selected.
Note that the resultant ordered set of vectors pi are precisely what are known as
principal components. The importance of rank ordering these dirction vectors is
that we can obtain easily the importance of each direction.
6.3 Solve PCA using Eigen Decomposition

Our goal for original dataset X is as follows :- Find an orthonormal matrix P
1
in the relation P X = Y such that the covariance matrix CY = Y Y T is a
n
diagonal matrix. And that the rows of P are the principal components or
new basis vectors of X Here is how we find CY :-
1
CY = YYT (6.12)
n
1
CY = (P X)(P X)T (6.13)
n
1
CY = P XX T P T (6.14)
n
1
CY = P ( XX T )P T (6.15)
n
CY = P C X P T (6.16)
Additionally we note that any square symmetric matrix is infact diagonalized by
an orthogonal matrix of its eigenvectors. This relation of similarity transformation
or diagonalization is given by :-
A = EDE T (6.17)
Where A is a symmetric matrix and D is a diagonal matrix containing the eigenval-

ues on its diagonals. E is a matrix of eigenvectors where the eigenvectors are the
orthogonal columns of matrix E. Now we select a matrix P such that its columns
pi are the eigenvectors of the X covariance matrix given by :-
1
XX T (6.18)
n
Recall that for a projection matrix P −1 = P T , P P −1 = I and that we are choosing
P = E T . Now we can re-evaluate CY as :-
CY = P C X P T (6.19)
CY = P (EDE T )P T (6.20)
CY = P (P T DP )P T (6.21)
CY = (P P T )D(P P T ) (6.22)
CY = (P P −1 )D(P P −1 ) (6.23)
CY = D (6.24)
And hence we find that with that particular choice of P our CY has been diago-
nalized. Our results can be summarized as follows :-
1
• The principal components of X are the eigenvectors of CX = XX T .
n
• The ith diagonal value of CY is the variance of X along the principal compo-
nent pi .
• The process involves first centering the values of X and then finding the
eigenvectors of CX .
6.4 Using the SVD

Let X be a n × m matrix and X T X be a m × m square symmetric matrix of rank r.
We then note the following :-
• [v1 , · · · , vr ] is the set of orthonormal eigenvectors of the X T X matrix with
corresponding eigenvalues as λ1 · · · , λr . We can write the characteristic
equation as :-
(X T X)vi = λi vi (6.25)
References 77
√
• σi = λi are the positive singular values
• [u1 , · · · , ur ] are vectors such that :-

1
ui = Xvi (6.26)
σi
• u· uj = 1 or 0 if i = j or i 6= j.
• Xvi = σi ui
Note that the sets of eigenvectors v and the vectors u are orthonormal vectors in r
dimensional space. Also Σ is a diagonal matrix as previously shown with diagonal
elements containing the singular values in a rank ordered manner. Finally when
we fill up all the v and u vectors into matrices we get the decomposition of X as
follows :-
X = U ΣV T (6.27)
This essentially means the the orthonormal matrix U first rotates the X matrix,
the Σ matrix stretches it and then V T matrix again rotates it. In the equation
XV = ΣU we can think of the columns of V as input vectors and columns of U as
ouput vectors wherein these vectors span the input and output spaces respectively.
Now we present this manipulation :-
X = U ΣV T (6.28)
U T X = ΣV T (6.29)
UT X = Z (6.30)
Where Z = ΣV T . We note that in this last equation U T can be essentially seen as
a change of basis matrix such that X is re-expressed as Z. Note that the columns
of V in this case are the principal components of X.
References
[1] Kohei adachi : Matrix based approach to multivariate data analysis
[2] Gilbert Strang : Introduction to Linear Algebra
[3] David Lay : Linear Algebra and its applications
[4] Jonathon Schlens (Google Research) : A tutorial on PCA
[5] Johnston Dinardo : Econometric Methods (Matrix section appendices)
[6] Serge Land : Introduction to Linear Algebra

Chapter 7
Text Analytics using SVD
Scribes: Ishita Gupta, Rohith Krishna
7.1 Introduction
Text Analytics is a discipline of computer science that is used to draw meaning-
ful inferences from unstructured text documents. Some popular applications of
text analytics include sentiment analysis, email spam filters, plagiarism detection,
stock market predictions based on sentiments of web users and also in dream
content analysis in psychology. It uses techniques of matrix decomposition such
as eigendecomposition, singular value decomposition to construct a low-rank ap-
proximation to the term-document matrix.
7.2 Singular Value Decomposition

We review singular value decomposition(SVD) and its features. Consider an m × n
matrix C with rank r. The SVD of C is given by:
C = U ΣV T (7.1)
where, U and V are orthonormal matrices provided C contains only real entries.
The matrix U has dimensions m × m and contains the orthogonal eigenvectors
of CC T while the matrix V has dimensions n × n and contains the orthogonal
eigenvectors of C T C. The matrix Σ is diagonal1 and contains entries σi ’s called
singular values. The matrix Σ has dimensions m × n and the number of non-zero
singular values in it is min(m, n). Let p = min(m, n).
Note. The eigenvalues λ1 , λ2 , · · · , λr of the square matrices CC T is the same√as

that of the matrix C T C. Further, the singular values of of C are given by σi = λi
where 1 ≤ i ≤ r. The λi ’s are arranged in the descending order, with the most
dominant eigenvalue on the first row’s diagonal, the next on the second diagonal
and so on. As a consequence, the singular values, which are the square root the
1
The term diagonal matrix, in a general case refers to a rectangular diagonal matrix.
78
Chapter 7. Text Analytics using SVD 79
eigenvalues are also arranged in this order. A key assumption here is that the
eigenvalues are all non-negative λi ≥ 0 ∀ i ∈ 1, 2, · · · , p and thus the singular
values are all real, σi ∈ R+ . The matrix Σ has entries such that Σii = σi for
1 ≤ i ≤ r and zero otherwise. Therefore:
σ1 > σ2 > · · · > σr

σr+1 ' σr+2 ' · · · ' σp ' 0 (7.2)
7.3 Visualizing SVD

In order to visualize the singular value decomposition of a rectangular matrix,
consider the following: A rectangular matrix C with dimension m × n where m >
n. The number of rows exceed the columns and hence C is a thin matrix. In such a
case we can see that the number of singular values is p = min(m, n) = n here. Of
these n singular values r are dominant singular values and the remaining (n − r)
singular values fall to zero. Thus one can see:
Cm × n Um Σm × n VnT
Figure 7.1: Visualizing SVD: A thin matrix.
Now consider a long rectangular matrix C with dimension m × n where m < n.

The number of columns exceed the number of rows and hence C is a long matrix.
In such a case we can see that the number of singular values is p = min(m, n) = m
here. Of these m singular values r are dominant singular values and the remaining
(m − r) singular values fall to zero. Thus one can see:
Cm × n Um Σm × n VnT
Figure 7.2: Visualizing SVD: A long matrix.

7.4 Information Retrieval Strategies

The study of text analytics encompassed the domain of information retrieval,
which deals with finding relevant documents. The idea of relevance of a document
is to be solidified before further progress. We understand that retrieval strategies
involve measuring the similarity or closeness between a query and a document.
A document is considered relevant to a query if more terms appear in common
in both. There are several information retrieval strategies: probabilistic retrieval,
Language Model, Inference networks, Neural networks, Boolean Indexing, Latent Se-
mantic Indexing, Vector Space Model, Genetic Algorithms, Fuzzy Set retrieval etc.
In this lecture we shall be focussing on Latent Semantic Indexing and the Vector
Space model.
7.5 Query
A query is a mini-document that a user wishes to search for in a collection of
documents. This section expounds different treatments of a query. A Boolean
query is one which return matching documents in the collection that match the
given query. However, in real life, one often has to sift through a large number of
documents - either on the internet or a private intranet collection of documents.
Therefore, it is imminent on the search engine that it ranks the various documents
on the basis of relevance to the query and displays matching documents in the
rank-order. In order to accomplish this, we develop various scoring techniques.
7.6 Weighted-Zone Scoring

Consider a Boolean query q and a single document d in a collection of documents.
We define a zone as an arbitrary unbounded amount of text in a part of a docu-
ment. That is, every document can be divided into parts called zones. The purpose
of defining zones is to break up the document into smaller chunks each with in-
dividual weights of importance in the document. When the document is scored,
these zones and their weights play an important role in ranking and hence the
relevance of the search results.
For instance, a document can have title, abstract, author, body and references as
zones. These zones are chosen so because of their natural and logical separation
in actual documents such as research papers, however, it should be noted that
zones can be arbitrarily chosen. Suppose the query is to search for all works by
the author Strogatz, that is q = “Strogatz”, then a greater weight must be assigned
on the author zone than the others. However, if the query is to search for the
works that deal with chaos, q = “chaos”, then weights must be larger on the title,
abstract and body. The distinction made here implies that the search engine has to
weigh different zones (such as abstract, author etc.) differently according to the
query being made.
A Boolean score, assigns the value of 1 or 0, depending on whether the match

for the query is found in a particular zone or not found. Based on this Boolean
score and the zone weights, the process of weighted zone scoring, assigns a score
between 0 and 1 for the query-document pair. The WZ scoring is performed by
considering linear combinations of zone scores. With these WZ scores, documents
can be ranked for a particular query.
WZS assigns score

(q, d) −−−−−−−−−−−−→ [0, 1]
based on zone weights
Now consider a query-document pair (q, d). Our objective is to weight score the
document for the given query q. Let the document be divided into l zones. It is
important that if this document is taken from a collection, then every document
in the collection has l zones each. Let gi be the weight on the ith zone. Further, let
g1 , g2 , · · · , gi , · · · , gl ∈ [0, 1] such that,
l
X
gi = 1 (7.3)
i=1
For 1 ≤ i ≤ l, si is the Boolean score denoting a match between the query q and
the ith zone. Suppose all the query q terms occur in the zone i, then the Boolean
score for the ith zone is 1. That is:
(
Boolean If query q matches zone i =⇒ si = 1
Score If query q does NOT match zone i =⇒ si = 0
The weighted zone score is defined as the summation of all Boolean scores weighted
by their individual zone weight in a document. That is,
l
X
WZS = gi si (7.4)
i=1
Example. Consider the following document d divided into 4 zones - title, author,
abstract and body as shown below. Each of these zones are assigned weights:
g1 = 0.2, g2 = 0.1, g3 = 0.2, and, g4 = 0.5. The search query is chaos. For this
particular query-document pair, determine the WZS.
For the given (q, d) pair, with q = chaos, we know the Title g1 0.2
adjacent zonal weights. We can also observe that the Author g2 0.1
sum of zonal weights for the document equals one. Abstract g3 0.2
Σgi = 1. Body g4 0.5
For this particular query, the word chaos pertains to the subject matter in the doc-
ument. Thus it is reflected in the given weights that the zone author takes the
least weight. However, this were a search engine that is indexing based on the
Document d
1st zone g1
g1
2nd zone
g3
3rd zone
l zones
gi
ith zone
lth zone gl
Figure 7.3: The l zones of the document d
author, the opposite would be true. Since the term chaos appears in d in zones
title, abstract and body, the corresponding Boolean scores are si = 1. Hence the
WZS is:
X4
WZS = gi si = (0.2)1 + (0.1)0 + (0.2)1 + (0.5)1 = 0.9
i=1
Note. In case the query comprises of two terms, then there are two methods of
going forth with Boolean scores. The two-term query can be expressed as q = q1 ·q2 .
Here the · denotes Boolean AND; thus if the independent Boolean scores of both
return 1 then we obtain a match with the query q. Another way is to express the
two-term query as q = q1 + q2 , where the + denotes the Boolean OR. In this case
a match is obtained if any of the two terms are present in a particular zone.
Note. For multiple-term queries, we take the “bag of words” model here, which
means that we ignore syntax and treat any combinations of the words in the given
query as identical to the original. Thus if q1 = A red bag full of apples. and if q2 =
A bag full of red apples, then q1 ' q2 , although their meanings are different.
7.7 Term frequency ranking

The Section 7.5 introduced the idea of a query. Queries could have a single term
or multiple terms. The query in which the entry searched is a single term without
any connectives (such as Boolean operators) is called freeform text query. The
idea of dividing a document into zones and searching the query for match within
Document d
1st zone g1
Li-Yorke Chaos in Models Title
with Backward Dynamics
2nd zone g2
David R. Stockman, Department of Economics Author
3rd zone g3
Some economic models like the overlapping generations model

may have the property that the dynamical system characterizing Abstract
equilibria in the model are multi-valued going forward in time,
4 zones
but single-valued going backward in time. In such instances,

what does chaos mean in such a dynamical system ?
4th zone
g4
1 Introduction
The equilibrium of a dynamic economic model can often be
characterized as a trajectory generated by a dynamical system.
Many nonlinear dynamical systems are single-valued moving Body
forward, but multi-valued going backward, i.e., the system is
not invertible.
However, in economics there are dynamical systems with the

opposite property, namely dynamics that are multi-valued going
forward, but are single-valued going backward - the model or
dynamical system has backward dynamics - chaos might also
be inherent in the system.
Σgi = 1
Figure 7.4: The document d sourced from [3].
these zones and thereby obtaining the WZS has been discussed in Section ??.
Now, we shall consider a collection of documents c in this section. We measure

the frequency of a query term within this document in order to obtain the term
frequency. The weighting scheme term frequency (tf ) is simply defined as the
number of times a term t in a query q appears in d. Further we develop a method
use term frequency in order to weigh and rank documents. Note that each term in
a document has a weight depending on its occurrence frequency in the document.
The in order to determine the relevance of a document, one has to take in into
account the adverse effect of common words - words that occur too often. These
effects have to be attenuated for meaningful and relevant search results. In due
course we will make use of SVD to carry this out.
We use raw term frequency to characterize queries, where all terms are equally
important. This, however, is not useful to us, for the following reason. Consider
the previous example in 7.6. If the collection of documents is the Journal on Non-
linear Dynamics and Chaos, then the query chaos has little to no significance in
rank-ordering documents for all documents are likely to contain this term.
Collection of
documents, c
d1
d2 Individual
documents, di
d3
d4
Figure 7.5: Collection c of documents di .
We define collection frequency as the number of occurrences of a term t in the

entire collection c. One approach to get around with the problem of raw term
frequency in weighing documents is to divide the tf of a term in a document by
its collection frequency cf . Note that the term frequency is defined for a term in a
particular document and hence is often denoted with the subscript tft,d .
A better method for weighing terms uses the document frequency (df ) defined as
the total number of documents in the collection that contain the term t. It makes
sense to use df because our purpose here is to discriminate between documents
for the purpose of scoring and ranking them.
Example.
Consider the adjacent table. For the words equilib- Word cf df
rium and there, the collection frequency are almost equilibrium 235 102
identical. However, since there is a common word, its there 259 195
df is higher. While scoring one has to search for equi- chaos 240 161
librium in only 102 documents, making the process
efficient and retrieval speedy. document number = 200
7.8 Inverse document frequency ranking

Consider a collection c with a total of N documents. Thus, Σdi = N . From the
example in 7.8, we know that common words have a high df . Intuitively we
understand that these words with a higher df must have a lower score for they are
common word irrelevant to ranking documents. In order to attenuate their effects
in scoring, we propose the use of inverse document frequency (idf ) defined as:
Å ã
N
idf = log (7.5)
dft
It can be observed that the idf of a rare term is high while that of a common term
is low. For the example in 7.8, we can see that idfequilibrium > idfthere .
In order to incorporate the merits of using both tf and idf in rank ordering, we
combine them into tf -idf composite weighting. This weight for a term t in a
document d is defined as:
tf -idft,d = tft,d × idft (7.6)
This method assigns weight to a term t in a document d such that tf -idf is:
• highest when t occurs many times in a small number of documents.

• lower when t occurs fewer times in a document or when t occurs in many
documents.
• lowest when t occurs in almost all documents.
Example.
Consider the adjacent table with terms from a
collection of the Journal on Non-linear dynam- Term df idf
ics and chaos. Here. equilibrium is an important equilibrium 102 0.894
term not found in all documents and has a rel- there 695 0.061
atively high idf . A common word there is found chaos 490 0.212
in virtually all documents and has the least idf . dexterity 20 1.602
However a rare word like dexterity has the high-
est idf . document number = 800
A document in considered a vector with one component corresponding to each

term in the collection. If terms in the collection do not appear in the document
then its weight is zero. Using the definition in 7.6, we can define the overlap score
measure which is the score of a document d, as the sum of tf -idf weights in of
each term in d. That is, X
Score(q, d) = tf -idft,d (7.7)
t∈q
In the forthcoming section, we shall expand on this definition in a more rigorous

manner using the vector-space model. We shall define the term-document ma-
trix and apply SVD on this matrix for ranking the documents and yielding rele-
vant/close search results.
References 86
References
[1] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman,
R. (1990). Indexing by latent semantic analysis. Journal of the American
society for information science, 41(6), 391-407.
[2] Manning, C. D., Raghavan, P., & SchÃ¼tze, H. (2008). Introduction to

information retrieval. Cambridge University Press. https://nlp.stanford.
edu/IR-book/
[3] David Stockman, (2012), Li-Yorke Chaos in Models with Back-

ward Dynamics. https://cpb-us-w2.wpmucdn.com/sites.udel.edu/dist/
2/425/files/2012/03/lyc-backward.pdf
[4] Grossman, D. A., & Frieder, O. (2012). Information retrieval: Algorithms

and heuristics (Vol. 15). Springer Science & Business Media.
[5] Strang, G., (1993). Introduction to linear algebra (Vol. 3). Wellesley, MA:
Wellesley-Cambridge Press.
[6] Golub, G. H., & Van Loan, C. F. (2012). Matrix computations (Vol. 3). JHU
press.
Part II
Advanced Probability
87
Chapter 8
Fundamentals of Probability
8.1 Combinatorics
Starting with the fundamental principle of counting, we can assume that experi-
ment 1 results in any of m possible outcomes and experiment 2 results in any of
n possible outcomes. Then, if these two experiments are performed in succession,
we would observe that there are a total of mn outcomes possible. Note that the
below matrix lists out all the possible pairs of outcomes from experiment 1 and
2. The item (i, j) corresponds to the pair in which i was obtained in experiment 1
and j was obtained in experiment 2.
 
(1, 1) (1, 2) · · · (1, n)
 . .. .. .. 
 . (8.1)
 . . . . 

(m, 1) (m, 2) · · · (m, n)
As a general rule, remember that if there are a total of r experiments to be per-

formed and the first has n1 possibilities as outcomes, the second experiment has n2
possibilities and the rth experiment has nr possibilities, then we will have a total
possibilities from all the experiments together as:
n1 × n2 · · · × nr (8.2)
8.1.1 Permutations
Ordered arrangements of elements are called permutations. For example if we
have letters a, b, c then the permutations of these letters (elements) is given as:
abc, acb, bac, bca, cab, cba (8.3)
Each such arrangement is called a permutation. Note that as a general rule, for n
objects there are n! permutations:
n(n − 1)(n − 2) · · · 3.2.1 (8.4)
Things become a bit more involved when we are permuting elements in which
there are some objects that are alike. For example if we want to find different
88
Chapter 8. Fundamentals of Probability 89
arrangements of the word PEPPER then obviously we will have a total of 6! per-
mutation possible since there are six letters in the word. But, what if we simply
interchange the alike elements in the word ? For example, if we simply inter-
change the two middle P’s, then it wouldn’t really change our permutation. For
this reason we calculate the total number of permutations of PEPPER by adjusting
for the permutations among the alike elements as well. So we final number of
permutations would become:
6!
(8.5)
3!2!
Note that 3! refers to the number of permutations among the P’s (which are three
in number) and 2! refers to the number of permutations among the E’s. As a
general rule we can say:
n!
(8.6)
n1 !n2 ! · · · nr !
Where there are n1 alike elements of type 1, n2 alike elements of type 2 and so on.
8.1.2 Combinations
If we want to form groups of r objects from a total of n objects where essentially
our permutations are now order irrelevant then we call them combinations. The
number of such combinations are given by:
Ç å
n n!
= (8.7)
r (n − r)!r!
8.1.3 Binomial Theorem

The Binomial theorem is a general rule that applies to polynomial expansion of a
sum of two variables. It is given by:
n Ç å
n
X n k n−k
(x + y) = x y (8.8)
k=0
k
As a simple example, consider finding the expansion of (x + y)3 . We can use the
binomial theorem to expand this expression:
Ç å Ç å Ç å Ç å
3 3 0 3 3 1 2 3 2 3 3
(x + y) = xy + xy + x y+ xy (8.9)
0 1 2 3
= y 3 + 3xy 2 + 3x2 y + x3 (8.10)
8.2 Basic Sets

The Union of many events given by E1 , E2 , · · · , En can be expressed as:
inf
[
En (8.11)
n=1
This set of all unions consists of all outcomes in atleast one of the Ei events. In a
similar manner, the event consisting of outcomes in all of the Ei events is given by
a continuous intersection of these sets:
inf
\
En (8.12)
n=1
Note the all important De-Morgans laws given by the following expressions. Also
note that the superscript c refers to complement of a set (which is nothing but a
set of elements not in the set).
n
!c n
[ \
c c c
(A ∪ B) = A ∩ B → Ei = Eic (8.13)
n=1 n=1
n
!c n
\ [
c c c
(A ∩ B) = A ∪ B → Ei = Eic (8.14)
n=1 n=1
Note an important point that events are nothing but sets of outcomes and hence
we can denote events as sets and perform set manipulation on them. For example,
we can denote the concept of mutually exclusive events using set notation as
follows:
E1 ∩ E2 = E1 E2 = φ (8.15)
The above equation basically means that if the intersection of two sets is the dis-
joint set then they are effectively, mutually exclusive sets or events. We can also
compute the probability of the union of many mutually exclusive events as follows:
Ñ é
inf
[ Xinf
P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) → P = P (Ei ) (8.16)
n=1 n=1
Moving on, we can state the basic expansion of a union of sets that are not
mutually exclusive as:
E ∪ F = E + F − EF (8.17)
Expanding the union of three sets:
E ∪ F ∪ G = E + F + G − EF − EG − F G + EF G (8.18)
Applying the probability operator and we simply get:
P (E∪F ∪G) = P (E)+P (F )+P (G)−P (EF )−P (EG)−P (F G)+P (EF G) (8.19)
Notice hard enough and you’ll see that a pattern emerges in terms of signs in the
above summation. The combined union is the sum of all (positive sign) sets taken
one at a time, all (negative sign) sets taken two at a time and all (positive sign)
sets taken three at a time. We can generalize this to the union of n sets as follows
in terms of probability:
n
X X
P (E1 ∪E2 · · · En ) = P (Ei )− P (E1 E2 )+· · ·+(−1)n+1 P (E1 E2 · · · En ) (8.20)
i=1 i1 <i2
8.3 Conditional probability

This means finding the probability of an event E occurring given the fact that
event F has occurred. Since F has already occurred we can say that this is now
our new sample space, instead of the entire sample space. So essentially, we want
to find the probability that E and F both occur simultaneously given that F has
already occurred. It is given by:
P (EF )
P (E|F ) = (8.21)
P (F )
Note that the above expression can alternatively be written as:
P (EF ) = P (E|F )P (F ) (8.22)
Now note an important point. Suppose that there are two sets or events called
E and F . Now we know that when only these two sets exist in our world, then
the set E can be defined as - the union of the intersection of E with F and the
intersection of E with the complement of F .
E = EF ∪ EF c (8.23)
Applying the probability operator to the above sets we get:
P (E) = P (EF ) + P (EF c ) = P (E|F )P (F ) + P (E|F c )P (F c ) (8.24)
Therefore from the above expression we can say that the total probability of event
E is the weighted average of the conditional probability of E that F has occurred
and the conditional probability of E such that F c has occurred or F has not oc-
curred.
8.3.1 Bayes
We will introduce the concept of Bayes theorem with the help of a common
exmaple. Suppose that D is the event that a person has a disease and E is the
event that upon testing for the disease, the test comes out positive (Note that
there can be a false positive test also - if a person does not have the disease then
the test comes positive). Now if we want to find the probability that - the person
has the disease given that the result if positive.
P (DE)
P (D|E) = (8.25)
P (E)
P (E|D)P (D)
= (8.26)
P (E|D)P (D) + P (E|Dc )P (Dc )
8.3.2 Odds
As a quick note, odds are defined as the ratio of probability of occurrence of an
event to the probability of the non-occurrence of the event. It is given as:
P (A) P (A)
= (8.27)
P (Ac ) 1 − P (A)
8.4 Distributions
Starting with the Bernoulli random variable, we define this random variable as
the outcome of a single trial when the outcomes are of only two types - success
and failure, encoded as 1 and 0 respectively.
p(0) = P (X = 0) = 1 − p (8.28)
p(1) = P (X = 1) = p (8.29)
Extending the same concept a little further, if supposing we have n independent
trials, each of which is associated with a probability of success of p and probability
of failure of (1 − p) and if we define random variable X as the Number of suc-
cesses in n trials then what we have is essentially a binomial random variable.
Ç å
n i
p(i) = p (1 − p)n−i (8.30)
i
Some general properties:
E[X] = np (8.31)
V AR[X] = npq = np(1 − p) (8.32)
n Ç å
X n k
P (X ≤ i) = p (1 − p)n−k (8.33)
k=0
k
8.4.1 Poisson
A random variable X taking on values 1, 2, · · · , is called a Poisson random vari-
able with parameter λ if:
e−λ λi
p(i) = P (X = i) = (8.34)
i!
Note that this is the approximation of a binomial variable when n is very large and
p is small. Some general properties:
E[X] = λ (8.35)
V AR[X] = λ (8.36)
The derivation for optional reading can be presented below in a step wise man-
ner:
• First, we assume a binomial random variable with parameters n and p such

that n is very large and p quite small. Using the binomial distribution prob-
ability mass function, we can get the probability of i successes in n trials
as:
n!
P (X = i) = pi (1 − p)n−i (8.37)
(n − i)!i!
• Now we basically let λ = np or p = λ/n and with this we can rewrite the
previous formula in terms of λ as follows :
Å ãi Å
λ n−i
ã
n! λ
P (X = i) = 1− (8.38)
(n − i)!i! n n
n(n − 1) · · · (n − i + 1) λi (1 − λ/n)n
= (8.39)
ni i! (1 − λ/n)i
• Now we note the following approximations :
λ n
Å ã
1− ≈ e−λ (8.40)
n
n(n − 1) · · · (n − i + 1)
≈1 (8.41)
ni
λ i
Å ã
1− ≈1 (8.42)
n
• And finally we end up with :
e−λ λi
P (X = i) = (8.43)
i!
8.4.2 Geometric
Suppose now that there are many independent trials, each having a probability
of success as p, such that these trials are performed until a success occurs. Our
random variable X primarily defines the number of trials required until the first
success is encountered.
P (X = n) = (1 − p)n−1 p (8.44)
Some key points:
1
E[X] = (8.45)
p
1−p q
V AR[X] = 2
= 2 (8.46)
p p
8.4.3 Negative Binomial

Now suppose that we perform many independent trials, with each trial having
the same probability of success as p and we perform trials until we accumulate r
successes. Here let the primary random variable X denote the number of trials
required to accumulate r successes.
Ç å
n−1 r
P (X = n) = p (1 − p)n−r (8.47)
r−1
The main logic is that for us to stop conducting the trials, the rt h success has
to happen at the nt h trial and therefore we count the combinations of the r − 1
successes that must have occurred in the last n − 1 trials. Some key points :
r
E[X] = (8.48)
p
r(1 − p) rq
V AR[X] = 2
= 2 (8.49)
p p
8.5 Cumulative Frequency distributions

The cumulative distribution function F (X) for a random variable X is given by:
F (X) = P (X ≤ x) (8.50)
Note that for a distribution function F (X), F (b) denotes the probability that the
random variable takes on values less than or equal to b. some properties about
CDF functions are:
• F is non decreasing which essentially means that for a < b we have F (a) <
F (b).
• The following can be defined as a limiting case:
lim F (b) = 1 (8.51)

b→inf
• The following can be defined as a limiting case:
lim F (b) = 0 (8.52)

b→− inf
• The CDF function is right continuous.
8.6 General points about discrete variables

Here is a quick list of some general pointers regarding expectations and variances
regarding discrete random variables.
X
E[X] = xp(x) (8.53)
X
E[g(X)] = g(x)p(x) (8.54)
V AR[X] = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2 (8.55)

E[aX + b] = aE[X] + b (8.56)
V AR[aX + b] = a2 V AR[X] (8.57)
8.7 Conditions For Independence

Consider A1 , A2 , A3 are the three events in sample space S. Pairwise independence
between the events can be defined by the following conditions
P (A1 A2 ) = P (A1 )P (A2 ) (8.58)

P (A1 A3 ) = P (A1 )P (A3 ) (8.59)
P (A2 A3 ) = P (A2 )P (A3 ) (8.60)
Mutual independence between the events can be defined by the following condi-
tions
P (A1 A2 ) = P (A1 )P (A2 ) (8.61)
P (A1 A3 ) = P (A1 )P (A3 ) (8.62)
P (A2 A3 ) = P (A2 )P (A3 ) (8.63)
P (A1 A2 A3 ) = P (A1 )P (A2 )P (A3 ) (8.64)
Mutual independence implies pairwise independence but pairwise independence
does not imply pairwise independence. Mutual independence conditions need to
get satisfied to say events as independent
8.7.1 Example 1
Let us consider a fair coin is tossed twice. Let sample space be denoted as S and
A1 , A2 , A3 are the three events in sample space S
S = {HH, HT, T H, T T }
A1 = {HH, HT }, A2 = {HH, T H}, A3 = {HH, T T }
Let us see whether independence exist between the events
P (A1 A2 ) = P {HH} = 1/4
P (A1 ) = 2/4
P (A2 ) = 2/4
P (A1 )P (A2 ) = 1/4
P (A1 A2 ) = P (A1 )P (A2 )
Let us check for other conditions
P (A1 A2 A3 ) = P (A1 )P (A2 )p(A3 ) (8.65)
P (A1 A2 A3 ) = P {HH} = 1/4
P (A1 ) = 1/2
P (A2 ) = 1/2
P (A3 ) = 1/2
P (A1 )P (A2 )P (A3 ) = 1/8
Equation 8 is not satisfied . Therefore independence does not exist between the
events
8.7.2 Example 2
Let us consider a fair die is tossed .
S = {1, 2, 3, 4, 5, 6}
A1 = {1, 2, 3, 4}
A2 = {4, 5, 6}
A3 = {4, 5, 6}
Let us check for the following condition
P (A1 A2 A3 ) = P (A1 )P (A2 )p(A3 )
P (A1 A2 A3 ) = P {4} = 1/6

P (A1 ) = 4/6
P (A2 ) = 3/6
P (A3 ) = 3/6
P (A1 )P (A2 )P (A3 ) = 1/6
Let us check for other conditions
P (A1 A2 ) = P (A1 )P (A2 )
P (A1 A2 ) = 1/6
P (A1 ) = 4/6
P (A2 ) = 1/2
P (A1 )P (A2 ) = 1/3
P (A1 A2 ) is not equal to P (A1 )P (A2 ) Therefore independence does not exist be-
tween the events
8.8 Axioms Of Probability

If S is the sample space
S : P (f unction)
Given any A is a subset of S, we have a value P(A)
P : p(s)− > R
where R is the real number If sample space is infinite, then there is some restriction
on A
8.9 Moment Generating Function

Let us consider a function f(x)
f (x) = a0 + a1 x + a2 x2 + a3 x3 + ... + an xn + ....
Differentiating f(x), we get
f 1 (x) = a1 + 2a2 x + 3a3 x2 + ... + nan xn − 1 + ....
Differentiating the above equation, we get
f 11 (x) = 2a2 + 3.2a3 x + ... + n(n − 1)an xn − 2 + ....
putting x=0
f (0) = a0
f 1 (0) = a1
f 11 (0) = 2a2
f 111 (0) = 6a3
f n (0) = n!an
an = (1/n!)f n (0)
The power series expansion for f(x) is unique
f (x) = f (0) + f 1 (0)x + (f 11 (0)x2 )/2! + (f 111 (0)x3 )/3! + ...
8.10 Consequences
Let us consider x and y are discrete and have a probability mass function p(x,y)
1.E(x + y) = E(x) + E(y) X
E(x) = xp(x)
x
XX
E(x + y) = (x + y)p(x, y)
x y
XX XX
E(x + y) = xp(x, y) + yp(x, y)
x y x y
XX
E(x) = xp(x, y)
x y
XX
E(y) = yp(x, y)
x y
E(x + y) = E(x) + E(y)

The same result holds for the continuous case random variable If x,y are jointly
continuous then
E(x + y) = E(x) + E(y)
In a analogous way we get for a finite number of random variablesx1 , x2 , xn

Xn n
X
E[ xi ] = E[xi ]
i=1 i=1
For infinite number of random variables we have to worry about convergence

2.If x and y are independent
E(xy) = E(x)E(y)
If x and y are jointly discrete with pmf p(x,y)

XX
E(xy) = (xy)p(x, y)
x y
But x and y are independent
p(x, y) = p( x, y)(x, y) = px (x)py (y)
So X
E(XY ) = xypx (x)py (y)
x,y
XX
E(XY ) = [ xypx (x)py (y)]
x y
X X
E(XY ) = [ xpx (x)][ ypy (y)]
x y
we know X
E(X) = xpx (x)
x
X
E(Y ) = ypy (y)
y
Therefore
E(xy) = E(x)E(y)
The same result is applicable for continuous case
3. var(X + Y ) = E[(X + Y )2 ] − [E(X + Y )]2
var(X + Y ) = E[X 2 + Y 2 + 2XY ] − [E(X) + E(Y )]2
var(X + Y ) = E[X 2 + Y 2 + 2XY ] − [E(X)]2 − [E(Y )]2 − 2E(X)E(Y )

var(X + Y ) = E[X 2 ] + E[Y 2 ] + E[2XY ] − [E(X)]2 − [E(Y )]2 − 2E(X)E(Y )
var(X + Y ) = E[X 2 ] − [E(X)]2 + E[Y 2 ] − [E(Y )]2 + E[2XY ] − 2E(X)E(Y )
we know
var(X) = E[X 2 ] − [E(X)]2
var(Y ) = E[Y 2 ] − [E(Y )]2
2cov(x, y) = E[2XY ] − 2E(X)E(Y )

Therefore
var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y )
Note: If X and Y are independent, then
var(x + y) = var(x) + var(y)
cov(x, y) = 0
Let us see an example n independent identical trials and each with probability p
of success . Let x be no of successes
Xj = {1if successontrialj, 0otherwise}
j=1,2...n, then
E(x1 ) = p
var(x1 ) = pq
But Xj has n independent identical trials
E(x1 ) = np
var(x1 ) = npq
Note: X X XX
cov( Xi , Yj ) = cov(Xi , Yj )
i j
Definition: If X is a continuous or discrete random variable

Mx (t) = E(etx )
Mx (t) is the moment generating function of X(MGF of random variable X)
E(x) = 1st momentof X = mean
E(x2 ) = 2nd momentof X
E(xk ) = k th momentof X
for x discrete random variable
X
Mx (t) = etx P (X = x)
x
for x continuous random variable

Z ∞
Mx (t) = etx fx (X)dx
−∞
Mx (t) = E(etx )
x=∞
X
x
e = xn /n!
x=0
e = 1 + (tx) + (tx) /2! + .... + (tx)k /k! + ....

tx 2
E(etx ) = E(1) + tE(x) + t2 /2!E(x2 ) + .... + tk /k!E(xk ) + ...

E(1) = 1, E(etx ) is the moment generating function
8.10.1 Example 1
X is a binomial with parameter n and p,Mx (t) =?
n
X
Mx (t) = etk P (X = k)
k=0
n Ç å
X
tk n k (n−k)
Mx (t) = e p q
k=0
k
n Ç å
X n
Mx (t) = etk (et p)k q (n−k)
k=0
k
Mx (t) = (et p + q)n
Mxk (0) = E(X k )
f (x) = a0 + a1 x + ... + an xn + ....
an = f n (0)/n!
Mx (t) = (et p + q)n
Mx1 (t) = n(et p + q)n−1 pet
Mx11 (t) = n(n − 1)(et p + q)n−2 (et p)pet + n(et p + q)n−1 pet
Mx1 (0) = n(pe0 + q)n−1 pe0 = np = E(X) = meanof binomial
Mx11 (0) = n(n − 1)(p2 ) + np = n2 p2 − np2 + np = E(X 2 )
Lets calculate Variance now
σ 2 = E(x2 ) − [E(x)]2 = n2 p2 − np2 + np − n2 p2
σ 2 = npq
Fact:
Consider X and Y are random variables. If Mx (t) = My (t) for all t then X and Y
have the same distribution
Note:
X and Y may have same probability mass function or cumulative distribution func-
tion but X and Y are two different function which have same distribution
Mx (t) = My (t)
E(X) = E(Y )
This implies means are same
E(X 2 ) = E(Y 2 )
This implies variance will be same
X and Y are discrete random variables
E(etx ) = p1 etx1 + ..... + pn etxn = Mx (t)
E(ety ) = q1 ety1 + ..... + qm etym = My (t)

Lets see what happens as t becomes ∞
Mx (t) = E(etx ) = pn etxn
My (t) = E(ety ) = qm etym

For Mx (t) = My (t), n=m,xn = ym ,pn = qm subtract off the last terms and apply
the argument again
pn etxn = qm etym (ast− > ∞)
n = m => pi = qi ; xi = yi
So X and Y have the same distribution
8.11 Properties Of Moment Generating Function

1.Max (t) = E[et(ax) ] = E[etax ] = Mx (at)
2.If X and Y are independent random variables
M(x+y) (t) = E[et(x+y) ] = E[etx ety ]
= E(etx )E(ety )
As X and Y are independent random variables ,etx and ety are independent random
variables
= Mx (t)My (t)
etx and ety are independent random variables by independence of X and Y as etx
and ety are powers series of X and Y
3.If C is the constant
M(x+c) (t) = E[et(x+c) ] = E[etx etc ]
M(x+c) (t) = etc E(etx ) = etc MX (t)

Random variable X has normal probability density function(pdf) with parameters
µ and σ √ 2
fx (x) = 1/ 2πe−1/2(x−µ/σ)
Mx (t) =?
√ 2
Z has pdf fz (x) = 1/ 2πe−1/2(x) is normal pdf with parameters µ = 0,σ 2 = 1
2
Mz (t) = et /2
X = σz + µ
has normal pdf with parameters
µ = 0; σ 2 = 1
F (x) = P (σz + µ ≤ x) = P (Z ≤ (x − µ)/σ)

√ Z (x−µ/σ) (t2 /2)
1/ 2π e dt = F (x)
−∞
F (x) = P (σz + µ ≤ x)
F 1 (x) = P df of (σz + µ = x)
√ Z (x−µ/σ) (t2 /2)
F (X) = 1/ 2π e dt = F (x)
−∞
√ 2
F 1 (x) = 1/ 2πe−1/2((x−µ)/σ) 1/σ
We want Mx (t) where X has normal pdf with paramater µ, σ
X = σz + µ
M( σz + µ)(t) = eµt Mσz (t) = eµt Mz (σt)

2 /2)
Mz (t) = e(t
2 /2)
Mx (t) = Mσz+µ (t) = eµt e(σt
2 /2)+µt
Mx (t) = e(σt
Mx1 (0) = µ
8.12 Theorem
If X and Y are independent random variables
X ≈ N (µx , σx2 )
Y ≈ N (µy , σy2 )
Then X+Y is normal N (µx + µy , σx2 + σy2 ) Moment generating function completely
determines the distributions, X and Y are indpendent random variables
Mx+y (t) = Mx (t)My (t)

2 2 2 /2+(µ +µ )t]
Mx+y (t) = e[(σx +σy )t x y
which is the moment generating function of a normal random variable with mean
(µx + µy ) and variance (σx2 + σy2 )
Chapter 9
Sequences and Series
9.1 Sequences
A sequence is nothing but a list of numbers written in a specific order. Sequences
can be infinite or finite. A general form of sequences can be shown:
a1 - first term
a2 - second term
an - nth term
Some of the common ways we can denote sequences are as follows:
{a1 , a2 , · · · , an , an+1 , · · · } (9.1)
{an } (9.2)
{an }∞
n=1 (9.3)
To illustrate with an example, here is how we would write the first few terms of a
sequence:
n+1 3 4 5
{ 2 }∞ n=1 = { 2 , , , } (9.4)
n n=1 4 9 16
n=2 n=3 n=4
An interesting way to think about sequences is as functions that map index values
to the value that the particular sequence might take. For example consider the
same sequence as above written as a function and its values written in a tuple of
the format (n, f (n)).
n+1
f (n) = (9.5)
n2
values → (1, 2), (2, 3/4), (3, 4/9), (4, 5/16) (9.6)
We do this because in this situation we can essentially plot out the values and
obtain a graphical representation of a sequence.
103
Chapter 9. Sequences and Series 104
2.5
f (n)
2
1.5
0 1 2 3 4 5
n
We can observe from this graph that as n increases the value of sequence terms is
going closer and closer to zero. Hence we can say that the limiting value of this
sequenec is zero:
n+1
lim an = lim =0 (9.7)
n→∞ n→∞ n2
9.1.1 General theorems and statements

• We can generally state that if we make an sufficiently close to L for large
values of n then the values of an approaches L as n approaches infinity.
limn→∞ an = L
As a more precise definition we can say that limn→∞ an = L if for every

number > 0 there exists an integer N such that:
|an − L| < , when: n > N
• We can say that limn→∞ an = ∞ if for every number M > 0 there is an

integer N such that:
an > M when: n > N
• We can say that limn→∞ an = −∞ if for every number M < 0 there exists a
number N such that:
an < M when: n > N
• The key insight for us that for a limit to exist and have a finite value, then
all the sequence terms must get closer and closer to that finite value as n
approaches infinity.
• If limn→∞ an exists and is finite we say that the sequence is convergent

whereas if limn→∞ an does not exist and if infinite then we say that the se-
quence is divergent.
• Given a sequence {an } if we have a function f (x) such that f (n) = an and
that limx→∞ f (x) = L then we can say that:
limn→∞ an = L
9.1.2 Squeeze theorem

We can state the squeeze theorem for sequences as follows:
if: an ≤ cn ≤ bn for all n > N for some N
and if: limn→∞ an = limn→∞ bn = L
Then we can say that: limn→∞ cn = L
This theorem is particularly useful when we are trying to compute the limits of
sequence that alternate in signs, for example the modulus function. Another im-
portant theorem to state, which we shall prove using the squeeze theorem is:
if: limn→∞ |an | = 0 then: limn→∞ an = 0
Additionally we note that for this theorem to work, the limit has to be zero. Now
to prove this using the squeeze theorem:
We can first of all note that: −|an | ≤ an ≤ |an |
Then we note that: limn→∞ (−|an |) = − limn→∞ |an | = 0
Therefore now that we have: limn→∞ (−|an |) = limn→∞ |an | = 0
then by squeeze theorem we have: limn→∞ an = 0
As an additional theorem of convergence that is closely related we can state:
The sequence: {rn }∞
n=0 converges if: −1 < r < 1
and diverges for all other values of r
Mathematically this can also be stated as :

0, if −1 < r < 1.


n
lim r = 1, if r = 1. (9.8)
n→∞ 
∞ otherwise

9.1.3 Increasing, Decreasing and Bounded

Given a sequence {an } we have the following important definitions that explain
key concepts about the nature of the sequence.
• A sequence is increasing if: an < an+1 for every n.
• A sequence is decreasing if: an > an+1 for every n.

• If an is an increasing or decreasing sequence it is known to be monotonic.

Note that a monotonic sequence always either increases or decreases, not
both.
• If there exists a number m such that m ≤ an for every n then we say that
the sequence is bounded below and m is called the lower bound of the
sequence.
• If there exists a number M such that an ≤ M for every n then we say that
the sequence is bounded above and M is called the upper bound of the
sequence.
• Finally we can say that if {an } is bounded and monotonic then {an } is con-
vergent.
9.2 Series
To begin defining an infinite series we first start with a sequence {an }. Note that
a sequence is just a sequence of numbers whereas a series represents some kind of
operation on those sequence of numbers. We can define a basic series as:
s1 = a1
s 2 = a1 + s 2
s 3 = a1 P + a2 + a3
sn = ni=1 ai
We can further note that the successive values of the series itself forms a sequence
of numbers which can be represented as {sn }∞ n=1 . This is a sequence of partial
sums. Now we can compute the limiting value of this sequence of partial sums as:
n
X ∞
X
lim sn = lim ai = ai (9.9)
n→∞ n→∞
i=1 i=1
Note that as in the case of sequences before, if the sequence of series values has a
finite limit, then the series is said to be convergent and if the limit does not exist
then it is divergent. Now we will prove the following theorem :
P
if: an converges then: limn→∞ an = 0
• Step 1: We can write the following two partial sums for the given series:
sn−1 = Pn−1
P
i=1 ai = a1 + a2 + · · · + an−1
sn = ni=1 ai = a1 + a2 + · · · + an
• Step 2: Subtracting the two partial sums we can get :
an = sn − sn−1
P
• We can say that if an is convergent then the sequence of partial sums is
also convergent for some finite value. Note that the same holds true for the
partial sums series of n and (n − 1).
{sn }∞
n=1 → limn→∞ sn = s → limn→∞ sn−1 = s
• Step 4: Finally we can write:
limn→∞ an = limn→∞ (sn − sn−1 ) = s − s = 0
9.2.1 The Ratio and Root test

The ratio test can be applied to check for convergence of a series. Suppose we
have a series given by: X
an (9.10)
Then we can define:
an+1
L = lim
(9.11)
n→∞ an
Now the following conditions would hold:
• If L < 1 the series is convergent.
• If L > 1 the series is divergent.
• If L = 1 the series may be divergent or convergent.
Now to present the root test, suppose we have the series defined by:
X
an (9.12)
Then we can define: »

L = lim n
|an | = lim |an |1/n (9.13)
n→∞ n→∞
Now the following conditions would hold:
• If L < 1 the series is convergent.
• If L > 1 the series is divergent.
• If L = 1 the series may be divergent or convergent.

Chapter 10
Basics of Convolutions
10.1 Random variables

The random variables X and Y are said to be independent if for any two sets of
real numbers A and B, which essentially represent the set of outcomes for the
respective random variables, we have:
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B) (10.1)
We can write the above relation in terms of a cumulative distribution function as
well:
P (X ≤ a, Y ≤ B) = P (X ≤ a)P (Y ≤ b) (10.2)
Further, we can also write the independence condition of two random variables in
the CDF notation as follows:
F (a, b) = FX (a)FY (b), ∀(a, b) (10.3)
10.1.1 Sums of random variables

We will now attempt to find the distribution of X + Y from the distributions of in-
dependent random variables X and Y . Assume that these are continuous random
variables with probability density functions respectively given as fX and fY . The
cumulative distribution function of X + Y can then be given as:
FX+Y (a) = P (X + Y ≤ a) (10.4)
Z Z
= fX (x)fY (y)dxdy (10.5)
X+Y ≤a
Now our integrals limits are basically defined over X and Y values such that X +
Y ≤ a. We can break this into the following integral limits - we let y take on
any value between minus infinity and infinity and we ensure satisfaction of the
inequality expression by letting x ≤ a − y. Therefore we can say that Y takes on
any value whereas X is limited from minus infinity to the upper bound of a − y
Z ∞ Z a−y
FX+Y (a) = fX (x)fY (y)dxdy (10.6)
−∞ −∞
108
Chapter 10. Basics of Convolutions 109
Z ∞ Z a−y
= fX (x)dxfY (y)dy (10.7)
−∞ ∞
Z ∞
= FX (a − y)fY (y)dy (10.8)
−∞
The above final expression is derived from the following definition of cumulative
function of a continuous random variable:
Z a
FX (a) = fX (a)dx (10.9)
−∞
Hence our final cumulative distribution function of the sum of two independent
random variables FX+Y is known as the convolution of the distributions FX and
FY - the individual cumulative distributions.
Z ∞
FX+Y (a) = FX (a − y)fY (y)dy (10.10)
−∞
Now we know that if we differentiate a cumulative distribution function we obtain

a probability density function of a continuous variable. Using this fact, we can find
the PDF of the convolution distribution FX+Y as:
Z ∞
d
fX+Y (a) = FX (a − y)fY (y)dy (10.11)
da −∞
Z ∞
d
= FX (a − y)fY (y)dy (10.12)
−∞ da
Z ∞
= fX (a − y)fY (y)dy (10.13)
∞
10.1.2 Sum of two Uniform random variables

Let X and Y be two independent uniform random variables distributed uniformly
as per (0, 1) bounds. We will now calculate the probability density of X + Y .
(
1, if 0 < a < 1
fX (a) = fY (a) = (10.14)
0, otherwise
Therefore we can get the convolution probability distribution, referring to the

general form in equation 13 as given below. Note that we are essentially putting
fY (y) equal to 1 because the function takes that value for the bounds (0, 1).
Z 1
fX+Y (a) = fX (a − y)dy (10.15)
0
Now we note from the definition of fX that this function will take on a value of 1
in the interval:
0<a−y <1 (10.16)
Chapter 10. Basics of Convolutions 110
Subtracting a from both sides we get the interval as:
− a < −y < 1 − a (10.17)
Multiplying the inequality with −1 we get the final interval as:
a−1<y <a (10.18)
Since the above intervals for y are between a − 1 and a and not as it should be
- between 0 and 1, we achieve this interval by further splitting a into two sets of
intervals, which together combine to give us the correct limits of y. First, we split
a as 0 < a < 1 we would have the interval defined as:
0<y<a (10.19)
Note that we have 0 on the left side of this inequality because when a lies be-
tween 0 and 1 then the expression a − 1 would always be negative, hence we
collapse it to 0 since our function cannot take on negative values. Therefore we
would have: Z a
fX+Y (a) = dy = a (10.20)
0
Second, we would split a as 1 ≤ a ≤ 2 such that the main integral limit is:
a−1<y <1 (10.21)
We are essentially collapsing the right hand limit to 1 since any value of a that
is between 1 and 2 would not exists in our function. Therefore we can resolve the
distribution as: Z 1
fX+Y (a) = dy = 1 − (a − 1) = 2 − a (10.22)
a−1
Therefore when we draw the distribution function, taking equation 20 along with
limits of a between 0 and 1 we would get a linearly rising line till x = 1 and a
linearly falling line from x = 1 till x = 2.
0.5
0.5 1 1.5 2
Chapter 11
Limit theorems and Convergence -

Part 1
11.1 Weak law of large numbers

Consider X1 , X2 , .........., Xn be identical independent distributed(iids) random vari-
ables with finite mean E[Xi ]=µ < ∞. i.e.
n
X1 + X2 + ...... + Xn 1X
Sñ = = Xi (11.1)
n n i=1
Then, the sample mean almost sure converges to population mean µ. i.e.
P (probability)
Sñ −−−−−−−−→ µ (11.2)
n→∞
In the previous lecture , we studied the proof of theorem that Convergence in

Probability implies Convergence in distribution .
P (P robability) D(Distribution)
If [Xn −−−−−−−−→ X], then [Xn −−−−−−−−−→ X] (11.3)
n→∞ n→∞
The above statement is a stronger form of convergence.

But in this lecture we will learn that converse of the above theorem is not true.
11.2 Converse of Theorem

The theorem says that : Convergence in distribution does not imply convergence
in Probability
D(Distribution) P (P robability)
If [Xn −−−−−−−−−→ X]; [Xn −−−−−−−−→ X] (11.4)
n→∞ n→∞
The above statement says that if a convergence occur in distribution , then it does
not imply that convergence will occur in probability also.Let us try to verify the
statement using counter example .
111
Chapter 11. Limit theorems and Convergence - Part 1 112
Figure 11.1: Figure 1
11.3 Counter Example

Let X be the standard normal variable with mean zero and standard deviation as
one represented as X ∼ N(0,1) .Let us also consider Xn = −X for n = 1,2,3,4,......
Using the above definition , we can tell Xn is also standard normal with mean zero
and standard deviation as one represented as Xn ∼ N(0,1) .
Here we can see that both Xn and X has the same cummulative distribution func-
tion for all values of n i.e ∀ n .This is because we have defined Xn = −X , hence
both of them has same cummulative distribution function (cdf).
Now , since they have same distribution , so we want to check whether it has
convergence in probability or not .
In order to check that , let us use the definition of Probability , by considering
greater than 0. For > 0 ,we have
P [|Xn − X| > ] = P [|−X − X| > ] = P [|2X| > ] (11.5)
In the above equation , we know Xn = −X , hence substituted it in the equation

and we get P[|2X| > ] . Simplying the above equation further we get ,

P [|2X| > ] = P [|X| > ] (11.6)
2
Now removing the modulus from the equation (6) and expanding the term , we
get ,

P [|X| > ] = P [X > ] + P [X < − ] 6= 0 (11.7)
2 2 2
So now we are looking at the two tails , where we have positive probability on
both sides of the tail as depicted in the graph shown below .
Now we will explain how we remove the modulus from event.


ñ

ô X > , if X ≥ 0
Event, |X| > = 2 (11.8)
2 X > − , if X < 0

2
Now the above equation can be written as union of the two cases which we ob-
tained after removing modulus from X .
ñ ô

Event, |X| > = (X > ) ∪ (X < − ) (11.9)
2 2 2
So we saw that Xn does not converge in probability , since ,
P [|Xn − X| > ] > 0, for > 0 (11.10)
11.4 General Case

Let us look at the general case of convergence .
Statement 11.4.1 If Xn converges to X in probability , then Xn converges to

X in distribution .This implies that Convergence in Probability is stronger than
Convergence in Distribution .
In mathematical form , we can write the above statement as follows :-

P (P robability) D(Distribution)
If [Xn −−−−−−−−→ X]⇒ [Xn −−−−−−−−−→ X] (11.11)
n→∞ n→∞
Proof. In order to proof the above statement , let us consider a distribution

function as shown below :
FXn (x) = P [Xn ≤ x] (11.12)
Now using the Law of total Probability , the above equation can be written into
summation of below two terms .
P [Xn ≤ x] = P [Xn ≤ x, X ≤ (x + )] + P [Xn ≤ x, X > (x + )] (11.13)
Simplifying the equation further by factoring into conditional probability i.e. we

are conditioning on X.Accordingly we can write the above equation as follows :
= P [Xn ≤ x|X ≤ (x + )]P [X ≤ (x + )] + P [Xx , X − > x] (11.14)
Now we know that joint probability is intersection of two events which is smaller.The
conditional probability is a probability which is less than one. Hence we can write
as follows :
P [Xn ≤ x|X ≤ (x + )]P [X ≤ (x + )] = P [X ≤ (x + )] (11.15)

Similarly , we can write the second term of equation 14 as follows :
P [Xn ≤ x, X − > x] = P [Xn < X − ] (11.16)
The above equation can be written since Xn ≤ X − .

Therefore, substituting the values of equation 15 and 16 in equation 14 , we get ,
FXn (x) ≤ P [X ≤ (x + )] + P [Xn < X − ] (11.17)
FXn (x) ≤ FX (x + ) + P [|Xn − x| > ] (11.18)

Since Xn < (X-)
(
(Xn − x) > , if Xn ≥ X
Event(E), [|Xn − x| > ] = (11.19)
−(Xn − x) > , if Xn < X
Event E is the union of event E1 and E2 .
Also
Fx (x − ) = P [X ≤ (x − )] (11.20)
P [X ≤ (x − )] = P [X ≤ (x − ), Xn ≤ x] + P [X ≤ (x − ), Xn > x] (11.21)
Fx (x − ) = P [X ≤ (x − )|Xn ≤ x] + P [X ≤ (x − ), Xn > x] (11.22)
Now we know that ,
P [X ≤ (x − )|Xn ≤ x] = P [Xn ≤ x] (11.23)
Let us try to simplify above equation , we have
X ≤ (x − ), Xn > x ⇒ X ≤ (x − ), (Xn − ) > (x − (11.24)

⇒ X ≤ (x − ) < (Xn − ) (11.25)
⇒ X < (Xn − ) (11.26)
Using the above , we can substitute the above value in equation 22 as follows ,
P [X ≤ (x − ), Xn > x] = P [X < Xn − ] (11.27)
Therefore , combining equation (23) and equation (27) , we can write the equa-
tion (22) as follows :-
Fx (x − ) ≤ P [Xn ≤ x] + P [X < Xn − ] (11.28)
Fx (x − ) ≤ FXn (x) + P [X < Xn − ] (11.29)
Fx (x − ) ≤ FXn (x) + p[|Xn − x| > ] (11.30)

(
(Xn − x) > , if Xn ≥ X ⇐ F1
F unction(F ) = [|Xn − x| > ] = (11.31)
−(Xn − x) > , if Xn < X ⇐ F2
F unction(F1 ) = [X < Xn − ] = [ < (Xn − x)] (11.32)

Similarly , we can rewrite function F2 as ,
F2 = −(Xn − x) > = (Xn − X) < − (11.33)
Now we know , Function (F) is union of F1 and F2
F = F1 ∪ F2 (11.34)
Now taking Probability on both sides , we get
P (F ) = P (F1) + P (F2) (11.35)
Now we can see that P (F1) ≤ P(F) , since P (F2) ≥ 0. Now comparing above
equation ,
FXn (x) ≤ FX (x + ) + P [|Xn − x| > ] (11.36)
Fx (x − ) ≤ FXn (x) + P [|Xn − x| > ] (11.37)

Now we can rewrite the above equation 36 as ,
Fx (x − ) − P [|Xn − x| > ] ≤ FXn (x) (11.38)
Combining equation 35 and 37 , we get
Fx (x − ) − P [|Xn − x| > ] ≤ FXn (x) ≤ FX (x + ) + P [|Xn − x| > ] (11.39)

P (P robability)
Now take the limit n → ∞ , since [Xn −−−−−−−−→ X] means
n→∞
for > 0 ,
n→∞
P [|Xn − x| > ] −−−→ (11.40)
The above equation implies that ,
Fx (x − ) ≤ lim Fxn (x) ≤ FX (x + ), ∀ > 0 (11.41)

n→inf
⇒ lim Fxn (x) = FX (x−) (11.42)

n→inf
Thus above equation implies that ,

D(Distribution)
[Xn −−−−−−−−−→ X] (11.43)
n→∞
11.5 Consequence of Weak Law of Large Number

Note. The assumption of finite variance Var(Xi )=σ 2 < ∞ is not required . Let us
consider a random sample Sn which converges to mean of the population µ when
n → ∞ as shown below :
P (P robability)
S¯n −−−−−−−−→ µ = E(Xi ) (11.44)
n→∞
Let us consider , (
1, if ω ∈ A
Xi (ω) = (11.45)
0, if ω ∈
/A
using the above condition , we can write as ,
E[Xi ] = P (ω ∈ A) = P (A) (11.46)
Now we know we can obtain random sample mean S¯n as shown in the below
equation :
∞
X1 + X2 + · + Xn 1X
S¯n = = Xi = f ractionof timesω ∈ A (11.47)
n n n=1
11.6 WLLN
We define S¯n as Fraction of times the outcome ω ∈ Ω is in a given set A, converges
in probability to E(Xi )=µ =P(A) = Probability of set or event A. Also E(Xi )=µ
and var(Xi )=σ 2
11.7 Convergence in Mean

Convergence in mean is one of the strongest form of convergence .
Converges
Xn (ω) −−−−−−→ X(ω) (11.48)
n→∞
The above equation then implies that

Converges
Xn −−−−−−→ X (11.49)
n→∞
Thus the above two equation means that distance between Xn and X tends to zero
i.e. d(Xn ,X) → 0.
Definition 11.7.1 Let r ≥ 1 be a constant number . Thus the sequence of random

variable (RVs) X1 ,X2 ,·,Xn , converges to rth or(Lr ) norm mean to random variable
X.
In mathematical form ,we can write the above definition as follows :

Lr
Xn −−−→X (11.50)
n→∞
if ,
lim d(Xn , X) = lim E[|Xn − X|r ] → 0 (11.51)
x→∞ x→∞
Now when r= 2 , the above equation gives us the mean square convergence (m.s)
as shown in the equation below :
mean square convergence
Xn −−−−−−−−−−−−−−→ X (11.52)
n→∞
Now let us explain the above definition with the help of an example . Consider
a random sample Xn which has uniform distribution with mean equal to 0 and
1
variance equal to .Now we need to prove that
n
L
r
Xn −→ X = 0∀r ≥ 1, (11.53)
If , function of x defined as below,


1
n, 0≤x≤
F unction, FXn (x) = n (11.54)
0, otherwise
Now in order to prove the following , we know that ,
E[|Xn − X|r ] = E[|Xn |r ], asX = 0 (11.55)
Z∞ Z1/n Z1/n
E[|Xn |r ] = FXn (x)dx = xr ndx = n xr dx (11.56)
−∞ 0 0
Simplifying the above equation as follows :

" #1/n " #1/n
x(r+1 n
E[|Xn |r ] = n = x(r+1) (11.57)
(r + 1) (r + 1)
0 0
After substuting the limits in the above equation , we get

" #(r+1)
n 1 n 1
E[|Xn |r ] = = = (11.58)
(r + 1) n (r + 1)n(r+1) (r + 1)nr
1 n→∞
E[|Xn |r ] = −−−→ 0∀r ≥ 1 (11.59)
(r + 1)nr
Thus the we proved that a random sample converges to X .
r L
Xn −→ X = 0 (proved) (11.60)
Theorem 11.7.1 Convergence in mean us stronger than convergence in Probabil-

ity .
Mathematically , we can write the above theorem as follows :
r L P (P robability
Xn −−−→ X =⇒ Xn −−−−−−−−→ X (11.61)
n→∞ n→∞
Proof. Consider for any > 0 , the tail probability given as follows :
P [|Xn − X| ≥ ] = P [|Xn − X|r ≥ r ], sincer ≥ 1 (11.62)
Now using markov inequality , we can write

" #
|Xn − X|r n→∞
E −−−→ 0 (11.63)
r
We can write the above equation because |Xn − X|r ≥ 0 and thus |Xn − X| ≥ 0
i.e. it is non-negative random variable . Also r > 0.
Now using the above statement , we can write as follows .
n→∞
P [|Xn − X| ≥ ] −−−→ 0 (11.64)
Thus we proved that ,

P (P robability)
Xn −−−−−−−−→ X (11.65)
n→∞
If
L n→∞
r
Xn −−−→ X =⇒ E[|Xn − X|r ] −−−→ 0 (11.66)
n→∞
Note. Converse of the above theorem is not true i.e. the sequence of random
variable Xn that converges in Probability but does not converge in mean.
P (P robabilityr L
M athematically, Xn −−−−−−−−→ X ; Xn −−−→X (11.67)
n→∞ n→∞
11.8 Recap
So far we have introduced the topics of Convergence and Moment Generating
Functions. We have also looked upto the limit theorems in detail. And in this
lecture we will discuss in detail the almost sure convergence and strong law of
large number. And in the coming lecture we will bulid the course and introduce
to the stochastic process.
11.9 Convergence of Random Variables

Random variables are those variables whose values depends on the outcomes of
a random experiment. Suppose we are having a sequence of random variables
X1 , X2 , .........., Xn converges to a random variable X. That is Xn gets closer and
closer to the X for some value of n as n increases. Suppose we want to observe the
value of the random variable X, but we cannot observe it directly. So, what we do
is that we come with some estimation technique to measure the X, and got it as
X1 . And again we estimate it and update the estimation to X2 and so on. And we
continue this process to get X1 , X2 , ..... And as we increase the n our estimation
gets better and better. So, we hope that the value Xn converges to X.
There are different ways in which a sequence can be converge. Some of these
convergence are stonger than the other and some are weaker. If a sequence is
stonger and the other sequence is weaker. Then the later sequence convergence
implies the former convergence. A sequence can converges to following types:
• Convergence in distribution
• Convergence in probability
• Convergence in mean
• Almost sure convergence
For example, using the figure, we conclude that if a sequence of random variables
converges in probability to a random variable X, then the sequence converges in
distribution to X as well.
Figure 11.2: Relations between different types of convergence

11.9.1 Convergence in distribution

With this type of convergence, the next outcome in a sequence of a random exper-
iment becomes good with a given probability distribution. Convergence in distri-
bution is the weakest form of convergence. However, this form of convergence is
widely used, most often it arises from application of the central limit theorem.
A sequence X1 , X2 , .........., Xn of random variable is said to be converge in distri-

bution, or weakly converge to a random variable X if,
lim Fn (x) = F (x), (11.68)

n→∞
for every number x R and F is continuous. Here, Fn and F are the CDFs of the
random variables Xn and X, respectively.
Examples of convergence in distribution

• Tossing coins
Let Xn be the fraction of tails coming after tossing an unbiased coin n times. Then
let the X1 has the Bernoulli distribution with expected value of µ= 0.5 and vari-
ance σ 2 = 0.25. And the subsequent sequence of random variables X2 , X3 ....... will
also be distributed binomially.
So as we increase the number toss i.e. is n the distribution starts converting to
a normal distribution. This is explained by the Central Limit Theorem. If we in-
crease the n the mean of the sample µ will be distributed normally.
• Dice Problem
Suppose in a dice making factory, the first batch of the dice produced came out
to be biased or defects. So, now the outcome from tossing the dice will follow an
different marked uniform distribution.
But as the production process improved and the dice became less and less defec-
tives. And the outcome from throwing the dice will now follow uniform distribu-
tion more closely.
11.9.2 Convergence in Probability

As a sequence progress the probability associated with the outcomes of the exper-
iment becomes smaller and smaller.
Convergence in probability is also the type of convergence established by the weak

law of large numbers.
A sequence X1 , X2 , .........., Xn of random variable converges in probability towards

the random variable X, if for all > 0
lim P (|Xn − X| > ) = 0 (11.69)

n→∞
Figure 11.3: Use case of central limit theorem
Examples of Convergence in Probability

This example should not be taken literally. Consider the following experiment.
First, pick a random person in the street. Let X be his/her height, which is ex
ante a random variable. Then ask other people to estimate this height by eye. Let
Xn be the average of the first n responses. Then (provided there is no systematic
error) by the law of large numbers, the sequence Xn will converge in probability
to the random variable X.
• Note
1. Convergence in probability implies convergence in distribution.

2. Convergence in probability does not imply almost sure convergence.
11.9.3 Convergence in Mean

For the interpretation , when we say that the sequence Xn converges to X, it
means the distance between the Xn and X gets smaller and smaller. For example,
if we define the distance between Xn and X as P (|Xn − X| > ), we define it
as convergence in probability. Another way to represent the distance between Xn
and X is,
E(|Xn − X|r ), (11.70)

Figure 11.4: Convergence in Probability
where r ≥ 1 is a fixed number. This defines the convergence in mean.
Defination: So a convergence in mean is defined as if a sequence of random

variable X1 , X2 , .........., Xn converges in rth mean or in the Lr norm to a random
variable X, if
lim E(|Xn − X|r ) = 0 (11.71)

n→∞
It is shown by,
Lr
Xn −→ X (11.72)
Chapter 12
Limit theorems and Convergence -

Part 2
12.1 Almost Sure Convergence

Almost sure convergence is one of the important discoveries in the probability and
statistics. It will lead to the establishment for the strong law of large numbers.
Also called ’with probability one convergence’ (w.p.1).When we say that probability
of an event is zero (0), then the event doesnot occure at all and if one (1) then the
event occures all the time.
For example, if both the side of a coin are ’Heads’ (biased coin) Then the probabil-
ity of ’tails’ is zero and probability of ’heads’ is one. When we say P (1) then it’s a
sure event i.e. degerating case.
Almost sure means occures almost every where but there are some places that
it doesnot occure. This is pointwise convergence in the same space sS. We have a
sequence of random variables X1 , X2 , .........., Xn defines on an underlying sample
space and we assume S is a finite set |S| < ∞.
And we have a function (Xn ) and (X)that matches ’S’ to real numbers.Xn : S
−→ R and X : S −→ R. Limitng random variable is also on the sample space.
S = {s1 , s2 , .........sn } , |S| = k < ∞

S = {s1 , s2 , .........si ........sk };
si is the ith outcome. Take a randomn variable Xn on sample space Xn (si ) = xni .
This is the ith real number outcome. For i = 1, 2, .......k and n = 1, 2, .......
After a random experiment is performed for example coin is tossed one of the si
will occure (that is ’H occurs). Since it is the outcome of the experiment and then
the values of the Xn are known, i.e. xni are known.
If si is the outconme of the experiment, si S. The random variable is realised.
123
The following sequence x1i , x2i , ........, xni , of real number is observed and we can
discuss about the convergence of the rel number.
Almost sure convergence is defined on the convergence of the thse sequence.

Once the outcome is reservered the sequence of the real numbers coming
their. That where the almost sure convergence comes into picture.
12.1.1 Example 1
Suppose we do a random experiment of tossing a coin. The outcomes being head
or tail.|S| = 2 < ∞
Consider a sequence of random variable X1 , X2 , .........., Xn .
(
n
, if s = H
Xn = n+1 n (12.1)
(−1) , otherwise
So, consider each of the outcomes H or T and determine if the sequence of real
number converges or not.
n
If s = H =⇒ Xn (H) = n+1
We fixed the outcome head, so

X1 (H) = 12 , X2 (H) = 23 , X3 (H) = 43 .......
We see here a sequence of real numbers and as we increase the n, the squence is
converging to 1.
Hence, sequence converges to 1 as n −→ ∞, if s = H (outcome is fixed at H).
Now, if s = T . outcome is T rails =⇒ Xn (T )= (−1)n .

So, X1 (T ) = −1, X2 (T ) = +1, ...... continous till infinite.
We see that the sequence doesnot converges since it oscillates between -1 and
+1 as n becomes larger and larger. Let’s define an event S,
E∞ = {si ∈ S; lim Xn (s) = 1} (12.2)

n→∞
The probability of the event will be,
P [E∞ ] = P [{si ∈ S; lim Xn (s) = 1}] (12.3)

n→∞
This event converges when the uoutcome is Head. (s = H).So, the probability of
the event E∞ i.e. the probability of Heads is 12 . Since it is a single toss of fair coin.
NOTE: In this examaple the sequence Xn (s) = xn s converged to 1 when s = H

and with probability(H) = 12 and the sequence did not converged when s = T . If
the probability that the sequence Xn (s) converges to X(s) is equal to 1 then Xn
converges X almost surely with probability 1.
a.s w.p.1
Xn −−−→ XorXn −−−→ X (12.4)
n→∞ n→∞
• Definition: Consider sequence of random variable X1 , X2 , .........., Xn , con-

verges almost surely to a ransom variable X and consider the event {s∈S|limn→∞
Xn (s)=X(s)} = E∞ ,
a.s
if Xn −−−→ Xif P (E∞ ) = 1. (12.5)
n→∞
12.1.2 Example 2
Suppose a sample space between 0, 1 and uniform probability measure on S.
P ([a, b]) = (b − a), 0 6 a 6 b < 1 (12.6)

Probability is nothing the distance of the interval. Consider sequence of random
variable X1 , X2 , .........., Xn ,
(
1, if 0 6 s < n+1
2n
Xn = (12.7)
0, otherwise
Now, define a randon variable X on the sample space S,

(
1, if 0 6 s < 21
X(s) = (12.8)
0, otherwise
n+1 1
2n
is a bigger interval and 2
is a smaller interval.
Now we have to show that
a.s
Xn −−−→ X (12.9)
n→∞
Putting different values for the Xn (s), we get the values X1 (s) = 1, X2 (s) = 43 , ....
and so on. Here the intervals are shrinking.
Consider an event E∞ = {si ∈ S; limn→∞ Xn (s)=X(S)}. These are the set of

outcomes where the limn→∞ Xn (s)=X(S).
• Case (i) (Outcomes lies between the interval 0 to 21 ) For s ∈ (0 , 1

2
), X(s) =
1. (From the defination). So, [0, 12 ] ⊂ E∞ since s ∈ [0. 12 ].
Here, s is a sunset of E∞
1
• Case (ii) Now for s > 2
=⇒ X(s) = 0. Since s > 12 , then 2s > 1 or 2(s − 1)
> 0.
(
n+1
1, if 0 6 s < 2n
Xn (s) = (12.10)
0, otherwise
We know (2s − 1) > 0, X(s). limn→∞ Xn (s)=X(s)=0, ∀ s> 21 . Now,
1
limn→∞ Xn (s)=0, ∀ n> 2s−1 .
So, we choose ’n’ in such a way that when we increase ’n’ larger and larger it be-
1
comes, 2s−1 . So, the condition becomes, limn→∞ Xn (s)=X(s)=0. That implies, s
∈ E∞ .
=⇒ ( 21 , 1) ⊂ E∞ .
Now we can write the event as, E∞ = [0, 21 ] ∪ ( 21 , 1). And applying probability, we
get;
1 1
P [E∞ ] = P {[0, ]} + P {( , 1)} (12.11)
2 2
From the axioms of probability, disjoint unions of event and both are uniform
measures. So, the probability of occuring of both will be half ( 12 ).
= 12 + 12 . = 1.
so, P[E∞ ] = 1
So, from the case above we have shown that a sequence of random variable
X1 , X2 , .........., Xn converges almost sure to X(s) when the sample size increases.
i.e.
a.s
Xn −−−→ X
n→∞
12.2 Strong Law of Large Number

Consider X1 , X2 , .........., Xn be identical independent distributed random variables
with finite mean µ(=E[Xi ]) < ∞. i.e.
n
X1 + X2 + ...... + Xn 1X
Sñ = = Xi (12.12)
n n i=1
Then, the sample mean almost sure converges to population mean µ. i.e.
a.s
Sñ −−−→ µ (12.13)
n→∞
This is stronger form of convegence because this implies weaker form of conver-
gence. Stronger law implies the weak law.So, the expected value of mean will be
E(Sñ ) = µ ( This goes to a degenerated randomn variable µ) and the var(Sñ ) goes
to zero as n −→ ∞.
12.2.1 Example 3
Consider a sample space S = [0, 1]. 0 6 s 6 1, with uniform probability distribu-
tion.
We define a sequence of random variable
Xn (s) = (s + S n ) and X(s) = s in S ∀ s ∈ [0,1]
that is, 0 6 s 6 1, such that,
n→∞
sn −−−→ 0
So, as ’n’ tends to infite, Xn becomes smaller and smaller. This implies, limn→∞
n→∞
Xn (s) → s, since sn −−−→ 0.
So, we have to show that

lim Xn (s) → X(s) (12.14)
n→∞
Now consider, s = 1, which gives Xn (1)= 2 ∀ n and X(s) = 1. Hence, limn→∞

Xn (1) = 2 6= X(1)= 1.
As n → ∞. limn→∞ Xn (1) is not equal to X(1) at s = 1. So, the probability of

P [1] = 0. But, limn→∞ Xn (s) → X(s) for s[0, 1]
So, the event becomes, {s∈S | limn→∞ Xn (s)=X(s)} = E∞ and the probability of
this event is givent by,
P (E∞ ) = P {[0, 1]} = (1 − 0) = 1 (12.15)

a.s
It is unoform probability measure, which implies , Xn −−−→ X.
n→∞
NOTE: Almost sure convergence is similar to pointwise convergence of a sequence

of function, except that the convergence need not occure on a set "D" with proba-
bility zero (0).
D = {s ∈ S| lim Xn (s) = X(s) = {1}} (12.16)
n→∞
and the probability of D is zero (= P (D) = 0). SInce it is uniform distribution

which is continous distribution. So, the probability of a single point is zero.
We have already seen, let a random variable X is a function of probability space

domain of X which maps the domain of X to real number (range of values of
X), then an another function ’h’ and we get another real line. Here ’h’ is also a
function of X. This is a composition of two functions. Z is function of X.
So, random variable,

Z = h(X(ω)), ωS (12.17)
X h
−
→R→
− R (12.18)
We have proved weak law of large number for the finite variance case where,
var(Xi ) = σ 2 < ∞, (12.19)

by using the Chebyshev’s inequality.
That mean the sample mean of the random variable converges in probability as n
tends to infinity to a degenerate random variable (X) µ, where µ is the a deter-
ministic constant.
The mean is also finite.
E(Xi ) < ∞. (12.20)
So,
E(S¯n ) = µ,
σ 2 n→∞ (12.21)
var(S¯n ) = −−−→ 0.
n
NOTE: If we remove the finite variance assumption, then lets check what will hap-
pen.
Let Xi are independent identical random variable with well defined moment gen-
erating function. So, these are defined as, MXi (0) = 1 and MX0 i (0) = µ = E(Xi ).
So, by defination,
MXi (t) = E[etXi ] (12.22)
So now the moment generating function of S¯n is defined in,
¯
MS¯n (t) = E[etSn ] (12.23)
t P
Xi
= E[e n ] (12.24)
t P
= E[e( n ) Xi
] (12.25)
n
1X
S¯n = (12.26)
n i=1
t
MS¯n (t) = E[e n [X1 + X2 + ....Xn ]] (12.27)
t t t
= E[e n X1 .e n X2 .........e n Xn ] (12.28)
Breaking down the equation. As Xi are independent, the above equation 32 will
factor out.
t t t
MS¯n (t) = E[e n X1 ].E[e n X2 ].........E[e n Xn ] (12.29)
So, now what se got is the product of moment generating functions.
t t t t
MS¯n (t) = MX1 ( ).MX2 ( ).MX3 ( )..............MXn ( ) (12.30)
n n n n
Since all the momnet generating functions are identical and since Xi is identical
to random variable X, we have
t
MS¯n (t) = [MX ( )]2 (12.31)
n
Let see it through the Taylor series expansion. So, the taylor series of ’f ’ about x
= 0 to finite order,
f (x) = f (0) + hf 0 (0) + ........, (12.32)

where h = nt .
So as we increase the ’h’ it will go to zero.
t t t
MX ( ) = [MX (0) + M 0 (0) + 0( )] (12.33)
n n n
Expanding to the 1st order about t = 0. So, by defination, MX (0) = 1 and MX0 (0) =
µ.
t t
=⇒ MS¯n ( ) = [1 + ( )µ] (12.34)
n n
So, as n tends to infinity, the
t n→∞
[1 + ( )µ] −−−→ etµ (12.35)
n
Recall the limit in calculus concepts,
1 r n
lim (1 + rx) x = lim (1 + ) = er (12.36)
x→o x→∞ n
So, similarly we incorporated here.
So,
n→∞
MS¯n (t) −−−→ etµ = Mµ (t) (12.37)
The moment generating funtion of S¯n goes to eµt when n → ∞ which is the mo-
ment generating funtion of the degenerated random variable µ.
So, hence the distribution of the S¯n converges weakly to the distribution of de-
generated random variable Y = µ.
Recall:
We saw convergence in distribution to a constant µ implies convergence in proba-
bility hence we have weak law of large number. So,
n→∞ n→∞
[Xn −−−→ µ] =⇒ [Xn −−−→ µ] (12.38)
D P
where, µ = constant
NOTE:
Here, using the moment generating function we can prove the weak law of large
number without finite variance assumption.
12.3 Moment Generating Functions

One of the important reason why we are using moment generating functions is to
determine the sum of random variable. Moment generating functions are a sim-
ple way to find the moments like mean(µ) and variance(σ 2 ). Through MGF we
can represent a probability distribution with a simple one-variable function. Each
probability distribution has a unique MGF which means they are especially useful
for solving problems like finding the distribution for sums of random variables.
But, before defining moment generating function let’s define what are moments.
• Moments
The nth moment of a random variable is the expected value of its nth power.
Definition: Let X be a random variable, and n N . If the expected value,
µX (n) = E[X n ] (12.39)
exists and is finite, then X is said to possess a finite nth moment and µX (n) is called
nth moment.
For example, the first moment gives the expected value E[X]. And the second cen-
tral moment is the variance of X. Similar to mean and variance, other moments
give useful information about random variables.
The moment generating function (MX (s)) of a random variable X is defined

as,
MX (s) = E[esX ]
We say that MGF of X exists, if there exists a positive constant a such that
MX (s) is finite for all s[−a, a].
One question that can be raised is, why Moment Generating Function are usefull?
So, there are two reasons behind this. First being, moment generating function
of any random variable X gives us all moments of X. That is why it is called the
moment generating function.
Second, MGF can uniquely determine the distribution. So, therefore if two random
variable have the same MGF then they must have the same distribution as well.
Thus, if you find the MGF of a random variable, you have indeed determined its
distribution.
• Finding Moments from MGF:
Remember the Taylor series for ex : for all xR, we have
∞
x2 x3 X xk
ex = 1 + x + + + ......... = (12.40)
2! 3! k=0
k!
Now we can write,

∞ ∞
X sX k X X k sk
esX = = (12.41)
k=0
k! k=0
k!
So, thus the equation becomes,
∞
X sk
MX (s) = E[esX ] = E[X k ] (12.42)
k=0
k!
k
We conclude that the k th moment of X is the coefficient of sk ! in the Taylor series
of MX (s). Thus, if we have the Taylor series of MX (s), we can obtain all moments
of X.
12.3.1 Properties of MGF

Some of the properties that can be said about the moment generating functions
are:
• Property 1
Let two random variable have the same moment generating function, then the
random variables have the same distribution. Suppose, X and Y are two random
variable and have the same moment generating function MX (s), then X & Y are
distributed in the same way(same CDF etc.).
So, we can say moment generating function determines the distribution of a ran-
dom variable. And this can come handy while dealing with unknown random
variable.
• Property 2
While dealing with the sum of the random variable, moment generating functions
makes it easier to handel. If there are two independent random variables, X and
Y and we want to see the moment generating function of X + Y , then we have to
multiply the separate, individual moment generating functions of X and Y .
So, if X and Y are independent and X has moment generating function MX (s)
and Y has moment generating function MY (s) , then the moment generating func-
tion of X + Y is just MX (s) MY (s) , or the product of the two moment generating
functions.
Example 4
If X ≡ Uniform (0, 1) , find E[Y k ] using MY (s).
MGF of Uniform distribution is,

es − 1
MY (s) = (12.43)
s
So, simplifying down the above function into parts we get,

∞
1 X sk
= ( − 1) (12.44)
s k=0 k !
∞
1 X sk
= (12.45)
s k=1 k !
∞
X sk−1
= (12.46)
k=1
k!
∞
X 1 sk
= (12.47)
k=1
k+1k!
sk 1
Thus, the coefficient of k!
in the Taylor series for MY (s) is k+1
, so
1
E[Y k ] = (12.48)
k+1
12.4 Limit Theorems

Limit theorems are very important and extremely useful in applied Statistics. The
first limit theorem is the Law of Large Number.
12.4.1 Law of Large Number

Limit theorems help us to deal with random variable as we take limit. The first
being Law of Large Number, which essentially states that the sample mean will
eventually approach to the population mean of a random variable as we increase
the draws to infinity.
• Definition:
Consider i.i.d. random variables X1 , X2 , .........., Xn . Let the mean of each random
variable be µ. We define the sample mean as,
X1 + X2 , .......... + Xn
X̄n = (12.49)
n
It is to note that the sample mean X̄n is random itself.
It makes sense that this sample mean will fluctuate, because the components that
make it up (the X terms) are themselves random.
Figure 12.1: Law of Large Number
Now based on the concept we have two different type of law of large number,
Strong Law of Large Number

The strong law of large number states that as n tends to ∞ the sample mean X̄n
goes to the population mean X̄ with probability 1. This is a formal way of saying
that the sample mean will definitely approach the true mean.
The strong law of large number is based on almost sure convergence.
The strong law of Large Number and Almost sure convergence are being discussed
throughly in the previous section of this lecture.
a.s
X̄n −−−→ µ (12.50)
n→∞
Weak Law of Large Number

As the sample sixe ’n’ increases/grows to infinity (∞), the probability that the
sample mean differs from the population mean by some small amount "" is equal
to zero.
In simple words, The Weak Law of Large Numbers, also known as Bernoulli’s the-
orem, states that if you have a sample of independent and identically distributed
random variables, as the sample size grows larger, the sample mean will tend to-
ward the population mean.
• Definition
Let X1 , X2 , .........., Xn be i.i.d. random variables with a finite expected value

E[Xi ] = µ < ∞. Then for any > 0,
lim P [|X̄ − µ| ≥ ] = 0 (12.51)

n→∞
12.4.2 Central Limit Theorem

Central limit theorem is the second limit theorem which is equally important deal-
ing with random variable and it also deals with the long-run behavior of the sam-
ple mean as n grows.
The central limit theorem states that, if we choose a sufficiently large random
sample from a population mean of mean ’µ’ and variance "σ 2 ", then the random
sample will be distributed normally with sample mean "µ".
This is an extremely powerful result, because it holds no matter what the distribu-
tion of the underlying random variables (i.e., the X 0 s) is.
D σ2
X̄n −
→ N (µ, ) (12.52)
n
D
Where − → means ‘converges in distribution’; it’s implied here that this convergence
takes place as n , or the number of underlying random variables, grows.
• NOTE:
LLN states that the mean of a large number of i.i.d. random variable converges to
expected value.
And, CLT states that, under similar condition, the sum of large sample of random
variable has an approximate normal distribution.
X̄ − µ
Zn = (12.53)
√σ
n
The central limit theorem states tha CDF of Zn converges to the standard normal
CDF.
12.5 Markov’s Inequality

Before delving into the proof and codes of the limit theorems, fundamental build-
ing blocks in the forms of **Markov’s inequality** and **Chebyshev’s inequal-
ity** propositions need to be stated. These propositions help us in finding out the
bounds of probabilities in cases where we only know the mean and variance of a
probability distribution. First we will look at proving Markov’s inequality. Suppose
that X is a random variable taking on non-negative values, then for any a > 0 we
would have:
E[X]
P (X ≥ a) ≤
a
Now we essentially represent an indicator random variable by defining it over
a > 0 as follows:
(
1 ifX ≥ a
I=
0 otherwise
We can say that for X ≥ 0 we can have the indicator variable taking on values as:
X
I≤
a
Taking expectations on both sides we get:
E[X]
E[I] ≤
a
While computing E[I] we can put it the form of a weighted average with proba-
bilities as the weights:
E[I] = 1 × P (X ≥ A) + 0 × P (X < A) = P (X ≥ A)
Now because we have essentially shown as E[I] = P (X ≥ A) we have justified
the Markov’s inequality:
E[X]
P (X ≥ A) ≤
a
12.6 Chebyshev’s inequality

This proposition can be stated as follows - If X is a random variable with finite
mean µ and variance σ 2 then for any value of k > 0 we have the following result.
Note that the event in the probability shown below essentially represents **tail
probabilities of X** and this inequality gives us a bound on these tail probabilities.
σ2
P (|X − µ| ≥ k) ≤
k2
2
Now we know that (X − µ) is essentially a non-negative random variable. So
basically if we set a = k 2 then we can write the **Markov’s inequality** as follows:
E[(X − µ)2 ]
P ((X − µ)2 ≥ a2 ) ≤
a2
2 2
We also know that (X − µ) ≥ k and |X − µ| ≥ k are equivalent conditions.
Therefore we can rewrite the above eqaution as:
σ2
P (|X − µ| ≥ k) ≤
k2
12.6.1 A proposition linked with Chebyshev’s inequality

This important proposition states that - random variables with zero variance are
associated with a constant probability of 1. Basically if V ar(X) = 0 then:
P (X = E[X]) = 1
Now for a zero variance random variable we can state the **Chebyshev’s inequal-
ity**, for any n ≥ 1 as follows:
Å ã
1
P |X − µ| > ≤0
n
Since we know that probability cannot be negative this inequality can be resolved
to an equation:
Å ã
1
P |X − µ| > =0
n
Now we can let n → ∞ and use the continuity property of probability to get:
Å ã Ç Å ãå
1 1
0 = lim P |X − µ| > = P lim |X − µ| > = P (X 6= µ)
n→∞ n n→∞ n
In the right hand side, if we resolve the modulus operator within the limits, we
will essentially get events which can be combined in some form:
(X − µ) > 0 → X > µ
(−X + µ) > 0 → X < µ
X 6= µ
12.7 The weak law of large numbers

Let X1 , X2 , · · · , Xn be a sequence of independently and identically distributed ran-
dom variables, each having finite mean E[Xi ] = µ. Then we that for any > 0
the tail probabilities of the distribution of the sample mean goes to 0 and the
probability mass centers around the population mean.
X 1 + X2 + · · · + Xn
Å ã
n→∞
P − µ| > → 0
n
Before moving on with the proof, we assume that the random variables have finite
variance σ 2 . We can now calculate the mean and variance associated with the
sample mean statistic.
X 1 + X2 + · · · + Xn
Å ã
E =µ
n
X 1 + X2 + · · · + Xn σ2
Å ã
V ar =
n n
From **Chebyshev’s inequality** we can state that:
X1 + X2 + · · · + Xn σ2
Å ã
P | − µ| ≥ ≤ = 0, as n → ∞
n n
12.8 Central Limit Theorem

Statement of the **CLT** is as follows - Let X1 , X2 , · · · , Xn be a sequence of in-
dependently and identically distributed random variables with finite mean µ and
finite variance σ 2 .Then, the distribution of the sum of these random variables tends
to the **standard normal distribution** given by:
X1 + X2 + · · · + Xn − nµ
√
σ n
The distribution of the above statistic tends to standard normal as n → ∞. We can
say that for −∞ < a < ∞:
Ç å Z a
X1 + X2 + · · · + Xn − nµ n→∞ 1 2
P √ ≥a → √ exp−x /2 dx
σ n 2π −∞
Now before delving into the proof of the **CLT** we have a look at a rather
important lemma stated as follows - Let Z1 , Z2 , · · · be a sequence of random vari-
ables having distribution functions given by FZn and moment generating functions
MZn . Further we let Z be a random variable with distribution function FZ and
moment generating function MZ . We can say that if MZn (t) then for all t we have
FZn (t) → FZ (t) for all those t for which FZ (t) is continuous.
As per this lemma we can note that if Z is a standard normal variable then MZ (t) =
2 n→∞ 2 n→∞
et /2 then it follows that if MZn (t) → et /2 then we can say that FZn (t) → Φ(t).
12.8.1 proving the theorem

We first assume that the mean and variance of random variable X is given as
µ = 0 and σ 2 = 1. The moment generating
√ function of independent and identically
distributed random variable Xi / n is given by:
Ç å Ç å
tXi t
E exp √ =M √
n n
Pn
Now the moment generating function of i=1 Xi /n would be given by:
Ç å!n
t
M √
n
This is due to the logic that summed power of exponents resolves to a multiplica-
tion:
ex1 +x2 = ex1 × ex2

And since all the sequence of n random variables are independent and identical
we can essentially raise all those multiplication terms to the power of n.
Now we let L(t) = log M (t) = log E[exp(tXi )]. We can note that:
L(0) = log M (0) = log E[1] = 0

M 0 (0)
L0 (0) = =µ=0
M (0)
M (0)M 00 (0) − [M 0 (0)]2
L00 (0) = = E[X 2 ] = 1
[M (0)]2
√ n→∞
With the above results in mind, we basically √ need to prove that [M (t/ n)]n →
2 n→∞
et /2 or equivalently, we can show that nL(t/ n) → t2 /2. Now to show this we
note that:
√ √
L(t/ n) −L0 (t/ n)n−3/2 t 0
lim = lim , L hospitals rule used
n→∞ n−1 n→∞ −2n−2
√
L0 (t/ n)t
= lim
n→∞ 2n − 1/2
√
−L00 (t/ n)n−3/2 t2
= lim , again using L0 hospitals
n→∞ −2n−3/2
Ç å
00 t t2
= lim L √ = t2 /2, sinceL00 (0) = 1
n→∞ n 2
12.9 The strong law of large numbers

We would state this theorem as follows - Let X1 , X2 , · · · , Xn be a sequence of
independently and identically distributed random variables each having a finite
mean E[Xi ] = µ, then we say that with probability equal to 1 that:
X1 + X2 + · · · + Xn n→∞
→ µ
n
To demonstrate, let us assume that a sequence of independent trials of some exper-
iment is performed. Further we let E be a fixed event of the experiment that P (E)
is the corresponding probability of the occurrence of that event in a particulart
trial. Now we define an indicator random variable as:
(
1 if E occurs in the ith trial
Xi =
0 if E does not occur in the ith trial
Then by the Strong law of large numbers we state that with probability 1 that:
X1 + X 2 + · · · + Xn
→ E[X] = 1 × P (E) + 0 × P (Ē) = P (E)
n
We note that X1 + X2 + · · · + Xn denotes the number of times the event E occurs

in n successive trials of the experiment. This law states that with probability 1,
the limiting proportion of times the event occurs is the true probability of the
occurrence of that event, that is P (E). Before delving into the proof, we will
assume that the random variables have a finite fourth moment E[Xi4 ] = K < ∞.
12.9.1 proof of SLLN

Pn
We begin by assuming that the mean of Xi is equal to 0. We let Sn = i=1 Xi .
Then we have:
E[Sn4 ] = E[(X1 + · · · + Xn )(X1 + · · · + Xn )(X1 + · · · + Xn )(X1 + · · · + Xn )]
Upon expanding the right hand side multiplications, we will get terms of the form:
Xi4 , Xi3 Xj , Xi2 Xj2 , Xi2 Xj Xk , Xi Xj Xk Xl

Now because all the Xi are independent and have mean 0 we have:
E[Xi3 Xj ] = E[Xi3 ]E[Xj ] = 0

E[Xi2 Xj Xk ] = E[Xi2 ]E[Xj ]E[Xk ] = 0
E[Xi Xj Xk Xl ] = 0
Further we note that there will be 42 = 6 terms that will be of the form Xi2 Xj2

since we are essentially selecting pairs of 2 out of 4 elements. Now, expanding the
product, resolving the zero terms and taking expectations we get:
Ç å Ç å
n n
E[Sn4 ] = E[Xi4 ] + 6 E[Xi2 Xj2 ]
1 2
= nK + 3n(n − 1)E[Xi2 ]E[Xj2 ]

Now we will make use of the condition that these variables are independent and
write:
0 ≤ V ar(Xi2 ) = E[Xi4 ] − (E[Xi2 ])2

(E[Xi2 ]) ≤ E[Xi4 ] = K
From this we will obtain:
E[Sn4 ] ≤ nK + 3n(n − 1)K

Dividing throughout by n4 we would get:
Ç å
Sn4 K 3K
E 4
≤ 3+ 2
n n n
Note that we have ignored the expansion term 3K/n3 since the above would be
a bigger set and hence gives us a upper bound. Finally we would take an infinite
sum on both sides to get:
∞
! ∞ Ç å
X Sn4 X Sn4
E = E <∞
n=1
n4 n=1
n4
Note that we can write this because of the fact that the right hand side sequence
would converge to a finite sum over an infinite series. We note now that if there
is a positive measure of probability for which an infinite sum is infinite, then its
expected value would also be infinite. But this is not the case here since our
expected value is finite. Therefore we can say with probability 1 that:
∞
X S4 n
<∞
n=1
n4
We also know that the convergence of an infinite series goes to 0 and hence we
can say with probability 1 that:
Sn4
lim =0
n→∞ n4
Now since the fourth power of this expression goes to 0 then we can also conclude
with probability 1 that:
Sn n→∞
→ 0
n
Finally, in this case we took our mean to be 0 but in a more general case we can
make the statement that with probability 1:
n
X Xi − µ
lim =0
n→∞
i=1
n
n
X Xi
lim =µ
n→∞
i=1
n
12.10 Convergence of sums of uniform random vari-

ables
Now we will look at a case wherein we take independent and identically dis-
tributed random variables belonging to the **uniform distribution** characterised
as:
Xi ∼ unif (0, 1)
Further, the general formula for mean and variance is given by:
a+b (b − a)2
µ= , σ=
2 12
The mean for this distribution is 1/2 and variance is 1/12. The general form for
this probability distribution function is given by:

 1 , x ∈ (a, b)
fXi (x) = b − a
0, x ∈ (a, b)c
Now what we will do is essentially this: Take varying number of random samples
from this distributions, compute the **standardized Z statistic** (which implies
the distribution of the sum of the random variables) for each sample size and
check the distribution of this sum. The general formulation for this is:
(X1 + X2 + · · · + Xn ) − nµ
Zn = √
nσ
If we sample once we get:
X1 − 1/2
Z1 = p
1/12
Now if we sample twice we get:
(X1 + X2 ) − 2/2
Z2 = p
2/12
Similarly if we sample say 50 times we would get:
(X1 + X2 + · · · + X50 ) − 50/2

Z50 = p
50/12
We would then observe that the PDF of this sum of a large number of uniform
random variables would converge to the **Standard Normal Distribution**.
Chapter 13
Borel-Cantelli Lemma
13.1 Fundamental guiding principles

Before delving into the rather involved concepts of Stochastic Processes, it is ab-
solutely essential for us to have a firm grasp on the fundamentals of Probability
theory, Set theory, Sequences and Limit theorems. In this prelude of sorts, we will
revisit some fundamental principles regarding some of those concepts and then
move on to discussing stochastic processes.
13.1.1 Sample space

Suppose a random experiment is conducted. Then, the set of all possible outcomes
of the experiment is called the sample space, denoted by Ω of S. Suppose our
experiment is about tossing two coins. Then the sample space could be given by:
S = {(H, H), (H, T ), (T, H), (T, T )} (13.1)
Now note that any subset E of the sample space is called an event. This is typi-
cally a set that contains various outcomes of the experiment and we say that if a
particular outcome is contained within E, then event E has occurred. For example
if we define our event to be - E is the event that heads appears on the first coin toss
- then our associated set for this event would be:
E = {(H, H), (H, T )} (13.2)
13.1.2 Working with sets: part 1

For two events E and F belonging to some sample space, we say that the union of
those events is the event that consists of outcomes that are contained in either E
or F or both. Consider E to be defined as in the previous section and we further
define F as {(T, H)}. Then the union set would be:
E ∪ F = {(H, H), (H, T ), (T, H)} (13.3)
Now if we consider an event such that it contains all the outcomes contained
in both E and F then that event would be the intersection of the two events
142
Chapter 13. Borel-Cantelli Lemma 143
and is shown as follows. Assume that E = {(H, H), (H, T ), (T, H)} and F =
{(H, T ), (T, H), (T, T )}.
E ∩ F = EF = {(H, T ), (T, H)} (13.4)
Now let us consider another example of two events obtained from rolling two dies
and the associated outcome tuples denote the sum of the two die rolls. Suppose
E = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} is the event that the sum of two die
rolls is 7 and let F = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} be the event that the sum of
two die rolls is 6. Look carefully and you might notice that the two events have
nothing in common. There are no outcomes that are contained in both sets and
hence we say that such an event simply could not occur. Such an event is known
as a null event and is denoted as EF = φ. In this case we say that events E and
F are mutually exclusive.
13.1.3 Working with sets: part 2

Just like we defined unions and intersections of sets in the above section, we can
define unions and intersections over n number of events. Suppose we have many
events given by (E1 , E2 , · · · ) then their union is typically given by:
∞
[
En (13.5)
n=1
In a similar manner, the intersection event of many events can be defined as fol-
lows: ∞
\
En (13.6)
n=1
Now if we want to define an event such that it contains all those outcomes in
the sample space S that are not in event E, then such an event is known as the
complement of E denoted as E c . Note that the complement of the sample space
is the null set (S c = φ). Further, for any two events E and F if all the outcomes in
E are also present in F then we say that E is a subset of F and consequently, F is
the superset of E. This is denoted as:
E⊂F (13.7)
Note that the condition of equality of two sets is given is they are both subsets of
each other. That is:
E = F ⇐⇒ E ⊂ F and F ⊂ E (13.8)
Some of these concepts seem quite intuitive when viewed in the form of Venn
diagrams. Basic examples are presented below.
A B A B
13.1.4 Basic laws governing set operations

Some of the basic algebraic rules that govern set operations are mentioned below:
• Commutative law: E ∪ F = F ∪ E and EF = F E.
• Associative law: (E ∪ F ) ∪ F = E ∪ (F ∪ G) and EF (G) = E(F G).
• Distributive law: (E ∪ F )G = EG ∪ F G and EF ∪ G = (E ∪ G)(F ∪ G).
• De-Morgans’s laws: These laws give us meaningful relations between unions,

intersections and complements of sets.
Ñ éc
[n n
\
Ei = Eic (13.9)
i=1 i=1
Ñ éc
n
\ n
[
Ei = Eic (13.10)
i=1 i=1
13.1.5 Axioms of probability

Consider an experiment with sample space S. For each event E of the sample
we assume that a function P (E) is defined and satisfies the following the three
axioms:
• 0 ≤ P (E) ≤ 1
• P (S) = 1.
• For a sequence of mutually exclusive events E1 , E2 , · · · such that Ei Ej = φ

then we have: Ñ é
∞
[ X∞
P Ei = P (Ei ) (13.11)
i=1 i=1
13.1.6 Probability of continuous sets

We consider a sequence of events {En } to be an increasing sequence if the fol-
lowing is true:
E1 ⊂ E2 ⊂ · · · ⊂ En ⊂ En+1 (13.12)
Similarly, we define a sequence of events to be decreasing is the following holds:
E1 ⊃ E2 ⊃ · · · ⊃ En ⊃ En+1 (13.13)
Now for an increasing sequence of events we can essentially define a limiting event
in the form of: ∞
[
lim En = Ei (13.14)
n→∞
i=1
Similarly we can define the limiting event for a decreasing sequence of events as:
∞
\
lim En = Ei (13.15)
n→∞
i=1
Additionally we note an important proposition that lays down the probability for
an increasing or decreasing sequence of events as:
lim P (En ) = P ( lim En ) (13.16)

n→∞ n→∞
13.2 Countable and Uncountable sets

We will first illustrate the inherent confusion when looking at computing the prob-
abilities of a uniform random variable given by X ∼ U (0, 1). We know that for all
X = x that P (X = x) = 0. But at the same it is also true that P (0 ≤ X ≤ 1) = 1.
Now this is where the confusion comes in:
X
P (0 ≤ X ≤ 1) = P (X = x) (13.17)
x∈[0,1]
Going by the initial statement if we multiple individual sums over P (X = x) = 0

we are bound to get 0 in the overall sum instead of 1. Now this confusion can
be cleared by stating that the summation over an interval of real numbers is not
sensible. We would fall into a trap while deciding which values of x to use in our
sums. Suppose we take the first value as 0 but then what would be the second
value in the sum?
With this in mind, we tend to define infinite sums as a limiting case of finite sums,
denoted as: ∞ n
X X
xi = lim xi (13.18)
n→∞
i=1 i=1
Therefore we further state that in order to even have an infinite sum, it has to
be possible to arrange the terms in a sequence. Note now that if an infinite set
of terms can be arranged in a sequence it is called countable and otherwise it is
uncountable. Positive rational are said to be countable since we can list them in
an order as a ratio of integers.
1 2 3
, , ,··· (13.19)
2 3 4
However we must note that real numbers between 0 and 1 are not countable.
Suppose we try to arrange these real numbers into a sequence x1 , x2 , · · · . Also
note that choose to express these as:
∞
X
xj = dij 10−i (13.20)
i=1
Where dij ∈ (0, 1, 2, · · · , 9) is the ith digit after the decimal place, of the j th number
in the sequence. What is happeneing here is that we assume that any given x can
be written as a long sequence of decimal expansions. We randomly assign decimal

expansions by allowing any given ith digit after the decimal point to take on any
value between 0 and 9 as is embodied by the set dij . We can see this illustrated as
follows:
P∞x1 → j−i= 1
x1 as i=1 di1 10 = 0.1 + 0.031 + · · · ≈ 0.2 · · · 74131
Therefore we can writeP
Similarly, x2 as ∞
i=1 di2 10
−i
= 0.2 + 0.572 + · · · ≈ 0.1 · · · 90572
We assume that with the above sequence we can essentially list out the entire set
between 0 and 1. Therefore if now consider an indicator random variable such
that I(A) = 1 if condition A is true and I(A) = 0 if condition A is false. Then we
can define a new number by:
∞
X
y= (1 + I{dii = 1})10−i (13.21)
i=1
Now look closely at this number, it is basically saying that if the diagonal element
of the array of the xj sequences, that is dii is equal to 1 then the ith digit of y would
take on the number 2. Finally when we get the decimal expansion of y we would
have the case that - The first decimal expansion element of y would be different
from the d11 element of x1 , the second decimal expansion element of y would be
different from the d22 element of x2 and so on. With this what we have essentially
done is that proven that all the xj sequence of numbers differ in at least one digit
from the newly defined number y. Therefore we note that while y does in fact
belong to the set (0, 1) it is not equal to any of the xj sequence of numbers. Hence
we can say that the elements between 0 and 1 cannot be arranged or explicitly
listed out.
13.3 Recap
An infimum of a subset set S of a partially ordered set T denoted by inf S is
the greatest element in T that is less than or equal to all elements in S. It is
the Greatest lower bound. The supremum of a set S of a partially ordered set
T denoted by sup S is the least element in T that is greater than or equal to all
elements in S. It is the lowest upper bound.
13.3.1 Revisiting the background of discussion

In the earlier lecture, we saw that if we have an infinite sequence of 0 and 1 and
that our event of interest is a particular ek in that sequence turning out to be 1,
then we found that there are infinitely many such events possible. No matter how
far we choose the cut off point (n) in such a sequence, we would still end up
with infinitely many occurrences of our event of interest. Then we laid down the
condition that - for all cut off points n there exists a k ≥ n such that ek = 1 which
is given as:
∀n∈N ∃k≥n ek = 1 (13.22)
Then we translated this broad condition into events. Basically we are interested in
event Ek happens infinitely many times. Before moving further, note some basic
principles:
∞
\
∀n∈N → , intersections (13.23)
n=1
∞
[
∃k≥n → , unions (13.24)
k=n
With this we can write our condition of Ek happening infinitely many times as:
∞ [
\ ∞
Ek (13.25)
n=1 k=n
After this we saw that if the series of events is divergent then, under the assump-
tion that the Ek events are disjoint, we proved that as the limit of n tends to
infinity, the tail probability (starting with cutoff point n) would tend to infinity.
∞
X
lim P (Ek ) → ∞ (13.26)
n→∞
k=n
And conversely, if the series of events is convergent then the tail of the series goes
to 0 as n tends to infinity.
X∞
lim P (Ek ) → 0 (13.27)
n→∞
k=n
Finally with these funamental properties laid out, we formulated the Borel Canilli
Lemma which stated:
 
∞
X ∞ [
\ ∞
if P (En ) < ∞ → P  Ek  = 0 (13.28)
n=1 n=1 k=n
13.4 Lim Sup and Lim Inf

For an event An that occurs infinitely often, we define Lim Sup as follows:
∞ [
\ ∞
lim sup An = lim sup An = Ak (13.29)
n→∞
n=1 k=n
For an event A that occurs all but finitely often, we define the Lim Inf as follows:
∞ \
[ ∞
lim inf An = lim inf An = Ak (13.30)
n→∞
n=1 k=n
Now let us consider Ω to be our sample space and in that we consider a sample
point or an outcome ω ∈ Ω. Then the following condition can be defined:
ω ∈ [limn→∞ sup An ] ⇐⇒ ω lies in infinitely many of the individual sets An
We can define a similar condition for the Infimum as well:
ω ∈ [limn→∞ inf An ] ⇐⇒ ω lies in all but a finite number of sets
13.4.1 An illustrative example: 1

Consider a set X defined as X = {0, 1}. Now we consider a sequence of subsets of
this main set as follows:
{Xn } = {(X1 = {0}), (X2 = {1}), (X3 = {0}), (X4 = {1}) · · · } (13.31)
From this set of subsets, we can clearly notice that the even indexed elements
are all {0}, whereas all the odd indexed elements are {1}. With these we define
two new series of event classes that separately contain the odd and even indexed
elements:
{Yn } = {{0}, {0}, · · · } (13.32)
{Zn } = {{1}, {1}, · · · } (13.33)
Now consider the original series {Xn }. In order to evaluate its Lim Sup we need
to first compute the successive unions of all the subsets in the series of the form
{0} ∪ {1}. Note that in every union iteration, we get the same set of the form
{0, 1}. Now define the LimSup as follows:
 
∞
\ [ ∞ ∞
\
lim sup An =  An  → lim sup Xn = {0, 1} = {0, 1} (13.34)
n=1 k=n n=1
Now similarly, if we want to compute the Lim Inf of this series, then we need to
first take successive intersections of all the events, which in this case would be
iteratively found out as follows: {0} ∩ {1} = φ. φ ∩ {0} = φ. And so on. In the we
will get only the null set φ. With this the LimInf could be defined as follows:
 
[∞ ∞
\ [∞
lim inf An =  Ak → lim inf Xn =
 [φ] = φ (13.35)
n=1 k=n n=1
We can notice from equation 34 and 35 that the LimSup and LimInf functions of
particular sequence are not equal. When this is the case we say that the limit of
this sequence does not exist. Now recall through equation 32, the sequence Yn .
We will now compute the LimSup and LimInf for this series and check if it’s limit
exists, which essentially the same condition as the LimSup and LimInf being equal.
 
∞
\ ∞
[ \∞
lim sup Yn =  Yk =
 {0} = {0} (13.36)
n=1 k=n n=1
 
∞
[ ∞
\ ∞
[
lim inf Yn =  Yk  = {0} = {0} (13.37)
n=1 k=n n=1
Hence from the above two equations we see clearly that the limit of the series
would exist and be given as:
lim sup Yn = lim inf Yn = lim Yn = {0} (13.38)

13.4.2 An illustrative example: 2

We will now, through an example, show that the limiting behaviour of a series
does not depend on transients, rather it depends on the long term pattern shown
by the tail of the sequence. Transients are events that occur finitely often, whereas
we are more interested in finding the events that tend to happen infinitely often.
The idea is that these finitely occurring transients do not have any effect on the
long term limiting behaviour of the series. Let us suppose a sequence given by:
{Bn } = {(B1 = {50}), (B2 = {20}), (B3 = {35}), {0}, {1}, · · ·} (13.39)
| {z } | {z }
T ransients tail pattern
Now for each value of the cut-off, specifying the starting of a sequence, we call
union set of all events after the cut-off point as Dn . With this the LimSup can be
specified as:
 
\∞ ∞
[
lim sup Bn =  Bk  = D1 ∩ D2 ∩ D3 ∩ · · · = {0, 1} (13.40)
n=1 k=n
| {z }
Dn
For all values of cut-off points, we see that the event {0, 1} tends to happen in-
finitely often since the multiple unions for all n cut off points resolve to that set.
To explicitly break down this process, we see the contents of the those Dn sets and
how they evolve.
∞
[
D1 = Bk = {50, 20, 35, −15, 0, 1}
k=1
D2 = {20, 35, −15, 0, 1}

D3 = {35, −15, 0, 1}
D4 = {−15, 0, 1}
D5 = {0, 1}
D6 = {0, 1}
Now we will compute the LimInf of this series. Note that the successive intersec-
tions of each set in this series would simply be the null set. Let us see the workings
of this:
 
[∞ ∞
\
lim inf Bn =  Bk  = E1 ∪ E2 ∪ E3 · · · = φ ∪ φ · · · = φ (13.41)
n=1 k=n
| {z }
En
We note that each En is the successive intersection of multiple Bn events. These

successive intersections boil down to the null set since each successive element in
the the series is different.
E1 = B1 ∩ B2 ∩ B3 ∩ · · · = φ (13.42)
We now see that the limiting supremum and infimum are not at all affected by
the the transient, finitely occurring events in the series. Hence with example, it is
shown that transients do not affect the LimSup and LimInf and that the limiting
behaviour of the sequence is defined by its tail.
13.4.3 Reiterating the Definitions

To reiterate some of the subtleties expressed in the concepts relating to LimSup
and LimInf here we attempt to further strengthen the conceptual understanding by
reiterating key obervations and definitions. First, suppose that {Xn } is a sequence
of subsets of some large set X. Now we note that LimSup of Xn is a set that
consists of elements of X which belong to Xn for infinitely many n, where the
series is countably infinite. To state this even more precisely we can say:
x ∈ lim sup Xn iff there exists a subsequence {Xnk } of {Xn } such that x ∈ Xnk ∀k
Similar definitions for the LimInf can also be stated. LimInf of Xn is a set that
consists of elements of X that belong to Xn for all except finitely many n. We
can state this more precisly as follows:
x ∈ lim inf Xn iff there exists some m > 0 such that x ∈ Xn for all n > m
Note that the sequence of infimum of a series is an increasing sequence. We can
show this as follows:
∞
\
In = inf Xm = Xm = Xn ∩ Xn+1 ∩ Xn+2 · · · (13.43)
m=n
We see that the sequence {In } is infact an increasing sequence with In ⊂ In+1 . Why
is this an increasing sequence? Because in each increasing iteration of forming
In we are taking fewer successive intersections among the Xn events and fewer
intersections naturally correspond to bigger sets. Hence the (n+1)th will be bigger
than the nth set. Now we say the the least upper bound on this sequence of
Infimums (In ) is the LimInf and is given by:
∞
" ∞ #
[ \
lim inf Xn = sup{inf{Xm |m ∈ (n, n + 1, · · · )}} = Xm (13.44)
n→∞
n=1 m=n
Similarly, the LimSup is the greatest lower bound on a decreasing sequence of

supremums (unions of sets). Why are successive unions of sets a decreasing se-
quence? Because as we increase the cut-off point, we take fewer unions each time,
which naturally implies a smaller set. This condition is represented as
∞
[
Jn = sup{Xm |m ∈ (n, n + 1, · · · )} = Xm = Xn ∪ Xn+1 · · · (13.45)
m=n
Chapter 14
Time Series Analysis: Basics
14.1 Recap
We have discussed about what are random/ stochastic processes. Stochastic pro-
cess is the process of some values changing randomly over time. At its simplest
form, it involves a variable changing at a random rate through time. There are var-
ious types of stochastic processes. Mainly classified into discrete time stochastic
processes and continuos time stochastic processes.
14.2 Introduction
For FT [ y ( t ) ] to exist, y(t) must be absolutely intergratable
Z ∞
=⇒ |y(t)|dt < ∞ (14.1)
−∞
Recall: For Weakly Stationery Stochastic Process X ( t ) ,
|Rxx (τ )| ≤ Rxx (0) = E[x2 (t)] (14.2)

where the Auto-correlation function Rxx (τ ) is bounded.
Thus instead of working directly with X (t), we deal with autocorrelation function
Rxx (τ ) which is bounded and hence absolutely integratable.
Consider Weakly Stationery Stochastic Process, WSSP X(t), (Wiener-Khinchine

Theorem: This theorem plays a central role in the stochastic series analysis, since
it relates the Fourier transform of x(t) to the Autocorrelation function-ACF.)
Z ∞
F T [Rxx (τ )] = Sxx (ω) = Rxx (τ )e−jωτ dτ (14.3)
n
Z ∞
−1 1 0
Rxx (τ ) = F T [Sxx (w)] = Sxx (ω)ej ωτ dω (14.4)
2π −∞
Where Rxx (τ ) is the autocorrelation function, Sxx (ω) is the spectral density F T [Rxx (τ )]
is the fourier transform and F T −1 [Sxx (w)] is inverse fourier transform.
151
Chapter 14. Time Series Analysis: Basics 152
Z ∞
2 1
E(x (t)) = Rxx (0) = Sxx (ω)dω (14.5)
2π −∞
Where E(x2 (t)) say that Mean square value of the stochastic process X(t) is the
average power of X(t).
14.3 Properties of Power spectral density Sxx(ω) :

Power Spectral Densfty (PSD) is the frequency response of a random or periodic
signal. It tells us where the average power is distributed as a function of frequency.
The PSD is deterministic, and for certain types of random signals is independent of
time. This is useful because the Fourier transform of a random time signal is itself
random, and therefore of little use calculating transfer relationships (i.e., finding
the output of a filter when the input is random). The signal has to be stationary,
which means that the statistics(mean, variance, covariance) do not change as a
function of time.
1. Sxx (ω) is a non-negative function of omega.
Sxx (ω) ≥ 0 (14.6)
2. Sxx (ω) is an even function.
Sxx (−ω) = Sxx (ω) (14.7)

R τ =+∞
Consider, Sxx (−ω) = 7=−∞
Rxx (τ )e−j(−ω)τ dτ
Z ∞ Z ∞
= Rxx (τ )e +jωτ
dτ = Rxx (τ )e−jω−τ dτ
−∞ −∞
Lets take r=-τ

Z r=∞ Z r=∞
−jωτ
Sxx (−ω) = Rxx (τ )e (−dτ ) = Rxx (τ )e−jωτ dτ = Sxx (ω)
r=−∞ r=−∞
=⇒ −dτ = dτ =⇒ Even.
3. Power spectral density Sxx (ω) is real function of X(t) in real.

Z +∞ Z +∞
−jωτ
Sxx (ω) = Rxx (τ )e dτ = Rxx (τ )[Cos(ωτ ) − jSin(ωτ )]dτ
−∞ −∞
Z +∞ Z +∞
Sxx (ω) = Rxx (τ )Cos(ωτ )dτ − j Rxx (τ )Sin(ωτ )dτ
−∞ −∞
Z +∞ Z +∞
=⇒ Sxx (ω) = Rxx (τ )Cos(ωτ )dτ = 2 Rxx (τ )Cos(ωτ )dτ.
−∞ 0
4. Average power of X(t) is E(x2 (t)).

Z ∞
2 1
E(x (t)) = Rxx (0) = Sxx (ω)dω. (14.8)
2π −∞
5. Sxx (ω) is real valued function.
∗
=⇒ Sxx (ω) = Sxx (ω) (14.9)
∗
Where Sxx (ω) is complex conjugate of Sxx (ω).
Recall: a = x + jy and a∗ = x − jy.
=⇒ If a= a∗ =⇒ x + jy = x − jy =⇒ y = −y =⇒ 2y = 0 =⇒ y=0
=⇒ a=x which is a real valued function.
R +∞
6. −∞ Rxx dτ < ∞ then Sxx (ω) is a continuous function of omega.
Z +∞
Sxx (ω) = Rxx (τ )e−jωτ dτ = F T [Rxx (τ )] (14.10)
−∞
Note: Since Power spectral density Sxx (ω) is even, non-negative real func-
tion. Same Rxx (τ ) can not be the autocorrelation function of WSSP X(t).
For Example: e−ατ , τ e−ατ of Sin(ω0 τ ) can not be autocorrelation function of

WSSP since Fourier transform of the function is complex.
Definition: For 2 Stochastic Processes X(t) and Y(t) that are jointly WSSP F T [Rxx (τ )]=
Cross power spectral density= Sxy (ω). Then
Z ∞
Sxy (ω) = F T [Rxx (τ )] = Rxy (τ )e−jωτ dτ. (14.11)
−∞
Where Sxy (ω) is a complex function even when X(t) and Y(t) are real stochastic
processes.
Recall: Rxy (τ ) = Ryx (−τ ) from property (1)
=⇒ Ryx (τ ) = Rxy (−τ ) (14.12)
(t - τ ) t (t + τ )
Figure 12.1
∗
=⇒ Syx (ω) = Sxy (−ω) = Sxy (ω) (14.13)
By Definition: Rxy (τ ) = E[X(T )Y (T + τ )] =⇒ Ryx (τ ) = E[Y (T )X(T + τ )]
=⇒ Ryx (−τ ) = E[Y (T )X(T − τ )] = E[X(T )Y (T − τ )] = Rxy (τ )
Then, Z ∞
Sxy (ω) = Rxy (τ )e−jωτ dτ = F T [Rxy (τ )]
−∞
Z ∞ Z ∞
∗ jωτ
Sxy (ω) = Rxy (τ )e dτ = Ryx (−τ )ejωτ dτ
−∞ −∞
Z ∞ Z ∞
∗
Sxy (ω) = Ryx (−τ )e jωτ
dτ ⇐⇒ τ =−τ
dτ =−dτ − Ryx (τ1 )e−jωτ dτ1
−∞ −∞
Z ∞
∗
Sxy (ω) = Ryx (τ1 )e−jωτ dτ1 = F T [Ryx (τ ) = Syx (ω).
−∞
∗
Sxy (ω) = Syx (ω) (14.14)
Example: Find autocorrelation
® function Rxx (τ ) of stochastic process with power
S0 ; |ω| < ω0
spectral density Sxx (ω) =
0 otherwise


 ω < ω0 if ω > 0
Sol: |ω|<ω0 =⇒ −ω < ω0 if ω < 0
 =⇒ ω > −ω0 if ω < 0

SXX (ω)
S0
ω
-ω0 0 + ω0
Figure 12.2
ejx = cos x + j sin x (14.15)

−x
e−∞ = cos x − j sin x (14.16)
Adding equation (15) and (16)
ejx + e−jx
⇒ cos x = (14.17)
2
Subtracting equation (15) and (16)
ejx + e−jx
⇒ sin x = (14.18)
2j
R∞
Rxx (ω) = F TR−1 [Sxx (ω)] = 2π
1
−∞
Sxx ejωτ dc
1 ∞ So
R ∞ (14.19)
Rxx (τ ) = 2π −∞ o
S ejωτ dω = 2π −∞
ejωτ dω
So ejω0 τ −e−jω0 τ
⇒ Rxx (τ ) = So
[ejωτ ]ω−ω So
= 2πjτ
0
[ejω0 τ − e−jω0 τ ] = [ ]
2πjτ 0
So
πτ 2j (14.20)
⇒ Rxx (τ ) = πτ Sn (ω0 τ )
14.4 White Noise (WN):

White noise is a random signal having equal intensity at different frequencies, giv-
ing it a constant power spectral density. Even a binary signal which can only take
on the values 1 or 0 will be white if the sequence is statistically uncorrelated. Noise
having a continuous distribution, such as a normal distribution, can of course be
white.
In statistics and econometrics one often assumes that an observed series of data
values is the sum of a series of values generated by a deterministic linear pro-
cess, depending on certain independent /explanatory variables, and on a series
of random noise values. If there is non-zero correlation between the noise values
underlying different observations then the estimated model parameters are still
unbiased, but estimates of their uncertainties such as confidence intervals will be
biased .
In Time series analysis there are often no explanatory variables other than the past
values of the variable being modeled i.e. the dependent variable. In this case the
noise process is often modeled as a moving average process, in which the current
value of the dependent variable depends on current and past values of a sequential
white noise process.
Here, N(t) = White Noise
Definition: White noise is a random function N(t) whose power spectral den-
sity Sn n(ω) is constant for all frequencies ω. N(t) is the white noise.
N0
⇒ Snn (ω) = is constant ∀ ω (14.21)
2
Where N0 is a real positive constant.
Autocorrelation of white noise =Rnn (τ ) = F T −1 [Sxx (ω)] = F T −1 ( N20 )
N0 N0 N0
Rnn (τ ) = F T −1 ( ) = ( )F T −1 [1] = ( )δ(τ )
2 2 2
R∞ −jωτ −jω(0)
F T [δ(τ )] = −∞ δ(τ )e dτ = e = e0 = 1
F T −1 [1] = δ(τ ) ⇒ F T [δ(τ )] = 1
®
∞ if x = 0
δ(τ ) = (14.22)
0 if x 6= 0
N0 N0
SN N (ω) = RN N (τ ) = δ (τ )
Z Z
F T −1

N0 N0
FT
Z Z
ω τ
0 0
Power Spectral Density Auto Correlation Function
SN N (ω) RN N (τ )
Figure 12.3
Example: Let Y(t) = [X(t) +N(t)] be weakly stationery process with X(t) as the
2
actual speed and N(t) is the zero mean noise process with variance σN and µN = 0.
Find the power spectral density of Y(T) i.e. Sxx (ω ).
Sol: Z ∞
Ryy (τ ) e−jωτ dτ

Syy (ω) = F T Ryy (τ ) =
−∞
Where Ryy (τ ) is the autocorrelation function of WSSP Y(t).

= Ryy (τ ) = E Y (t) Y (t + τ ) andY (t) = X[X(t) + N (t)]
Where X(t) and N(t) are independent stochastic processes.
2
µN = 0 and V ar(N (t)) = σN (14.23)

Ryy (τ ) = E y (t) y (t + τ ) = E X (t + τ ) + N (t + τ )

Ryy (τ ) = E X (t) X (t + τ ) + X (t) N (t + τ ) + N (t) X (t + τ ) + N (t) N (t + τ )
Where E [ X (t)N (t + τ ] = 0andE[N (t)X(t + τ ) ] =0

Ryy (τ ) = E X (t) X (t + τ ) + N (t) N (t + τ )
2
Ryy (τ ) = Rxx (τ ) + Rnn (τ ) = Rxx (τ ) + σN δτ
Definition: White noise is a random function N(t) whose power spectral

density Sn n(ω) is constant for all frequencies ω. N(t) is the white noise.
N0
⇒ Snn (ω) = is constant ∀ ω
2
Where N0 is a real positive constant.
With the Autocorrelation of white noise as
N0
Rnn (τ ) = ( )δ(τ )
2
where δ(τ ) is the dirac delta function is
®
∞ if x = 0
δ(τ ) =
0 if x 6= 0

Recall: Rnn (0) = E N 2 (t)
2
î ó î ó2 î ó
= V ar N (t) = E N 2 (t) − E N (t) = E N 2 (t) − µ2N

σN
î ó
2
σN = E N 2 (t) = Rnn (0)
2 ∞ if x = 0
⇒ Rnn (τ ) = σN δ τ ) where δ τ )= {
0 if x 6= 0
2
Ryy (τ ) = Rxx (τ ) + σN δ (τ )
Now solving for
the power
spectral
density of Y(t), 2
2
Syy (ω) = FT Ryy
(τ ) = F T Rxx (τ ) + σN δ (τ ) = F T Rxx (τ ) + F T σ N δ (τ )
Since F T δ (τ ) =0
2
Syy (ω) = Sxx (ω) + σN
Till now we have discussed about the continuous time stochastic processes (CTSP).
Let us look at the discrete time stochastic processes in further section.
14.5 Discrete Time Stochastic Processes

A discrete-time stochastic process is essentially a random vector with components
indexed by time, and a time series observed in an economic application is one
realization of this random vector. we exclusively consider processes in discrete
time, i.e. processes which are observed at equally spaced points of time t = 0, 1,
2,.... In other words, a discrete process is considered to be an approximation of
the continuous counterpart.
When interpreted as time, if the index set of a stochastic process has a finite or
countable number of elements, such as a finite set of numbers, the set of integers,
or the natural numbers, then the stochastic process is said to be in discrete time.
Let Xn =X[n], n=0,1,2,. . . . . . in a Random sequence.

DTSP Xn = X[n] iid got by sampling a continuous time stochastic process. If the
sampling interval is Ts
n→0 1 2
t
0 1 TS 2 TS
Figure 12.4
X[n]= X[Ts ] where n=0, ± 1, ± 2,

± 3, . . . . . .
Mean of X[n] is µx [n] = E X (n)
Autocorrelation function of X(n) is Rnn (n, n + m) = E X (n) X (n + m)
Definition: Auto covariance function of X(n) is Cxx (n1 , n2 ) measured by sampling
between X( n1 ) and X (n2 ) .

Cxx (n1 , n2 ) = E{ X (n1 ) − µx (n1 ) X (n2 ) − µx (n2 ) }
Cxx (n1 , n2 ) = E[X(n1 )X(n2 )] − µx (n1 )µx (n2 )]
If X( n1 ) and X (n2 ) are independent then Rxx (n1 , n2 ) = µx (n1 ) µx (n2 )

⇒ Cxx (n1 , n2 ) = 0 i.e. X( n1 ) and X (n2 ) are uncorrelated random variables.
Definition: A Discrete Time Stochastic Process (DTSP) is called white noise if the
Random variables X( nk ) are uncorrelated.
Note: If White noise is Gaussian WSSP then X(n) consist of a sequence of IID RVS
with Variance σ 2
Auto Correlation function of White noise Gaussian= Rxx (m) = σ 2 δ (m)
1 if m = 0
δ (m) = {
0 if m 6= 0
P+∞
Definition: Power spectral density of X(n) is Sxx (Ω) = m=−∞ Rxx (m) e−jΩm

Sxx (Ω) = DF T Rxx (m) (14.24)

Where Rxx (m) is discrete autocorrelation function of X (m) and DF T Rxx (m)
is Discrete Fourier transformation.
e−j(Ω+2π)n = ejΩn e−j2πn = e−jΩn
e−j(Ω+2π)n = e−jΩn
Hence e−jΩn is periodic with 2π . ⇒ Sxx (Ω) is periodic with 2π .
Therefore, it is sufficient to define Sxx (Ω) in range Ω ∈ (- π, π )
Ω
−π 0 π
Figure 12.5
1
R∞
⇒ Autocorrelation function of X(n)= Rxx (m) = 2π −∞
Sxx ejΩτ dΩ
Properties of Power Spectral Density Sxx (ω):
1. Sxx (Ω) is periodic with 2π.
=⇒ Sxx (Ω + 2π) = Sxx (Ω) (14.25)
2. Sxx (Ω) is an even function in Ω.
=⇒ Sxx (−Ω) = Sxx (Ω) (14.26)
3. Sxx (Ω) is real.
+∞
X
Sxx (Ω) = Rxx (m) e−jΩm (14.27)
m=−∞
e−jΩm = Cos (jΩm) − jSin (jΩm)
+∞
X
Sxx (Ω) = Rxx (m) Cos (jΩm) − jSin (jΩm)
m=−∞
+∞
X +∞
X

Sxx (Ω) = Rxx (m) Cos (jΩm) − j Rxx (m) Sin (jΩm)
m=−∞ m=−∞
(14.28)
Sxx (Ω) is an even function. Cos (jΩm) is even and Sin (jΩm) is odd.
⇒ Sxx (Ω) is real.
4. E[X2 (n)] is the average power of DTSP X(m).
E[X 2 (n)] = Rnn (0)

Z π Z π
1 0 1
= Sxx (Ω) e dΩ = Sxx (Ω) dΩ
2π −π 2π −π
Z π
2 1
E[X (n)] = Rnn (0) = Sxx (Ω) dΩ (14.29)
2π −π
Example: Assume X(n) is Real SP ⇒ Rxx (−m) = Rxx (m) . Find the power
spectral density of X(n) i.e. Sxx (Ω) .
Sol:
Sxx (Ω) = DF T Rxx (m) = +∞ −jΩm

P
m=−∞ Rxx (m) e
P−1 . P+∞
−jΩm −jΩm
Sxx (Ω) = m=−∞ Rxx (m) e + m=0 Rxx (m) e
.
Sxx (Ω) = +∞ −jΩm
= −1 −jΩm
P P
R
m=−∞ P xx (m) e m=−∞ Rxx (m) e +
+∞ −jΩm
m=0 Rxx (m) e
Introducing aP
dummy k.
⇒ Sxx (Ω) = k=+1 jΩm
+ +∞ −jΩm
P
k=−∞ Rxx (−k) e m=0 Rxx (m) e
k=∞
X +∞
X
Sxx (Ω) = Rxx (−k) e jΩm
+ Rxx (0) + Rxx (k) e−jΩm
k=1 k=0
k=∞
X î ó
Sxx (Ω) = Rxx (k) ejΩm + e−jΩm + Rxx (0)
k=1
We know Rxx (−k)

Pk=∞= Rxx (k) is an even function.
⇒ Sxx (Ω) = 2 k=1 Rxx (k) Cos (kΩ) + Rxx (0)
Proving Rxx (−k) = Rxx (k)
Z π
1
Rxx (m) = Sxx (Ω) ejΩm dΩ
2π −π
Z π
1
Rxx (−m) = Sxx (Ω) e−jΩm dΩ
2π −π
Let α = - Ω
Z π
1
Rxx (−m) = Sxx (−α) e−j(−α)m [−dα]
2π −π
1
Rπ
⇒ Rxx (−m) = − 2π −π
Sxx (−α) ejαm dα
Sxx (α) is an even function by property.

1
Rπ
⇒ Rxx (−m) = 2π
S (α) ejαm dα = Rxx (m)
−π xx
Rxx (−m) = Rxx (m) (14.30)
14.5.1 Sampling a CTSP : Continuous Time Stochastic Processes

A stochastic process with property that almost all sample paths are continuous is
called a continuous process. If the index set is some interval of the real line, then
time is said to be continuous. An example of a continuous-time stochastic process
for which sample paths are not continuous is a Poisson process.
Discrete Time Stochastic Process, DTSP is a good by sampling CTSP X(t). If CTSP
X(t) is sampled at constant intervals Ts time units i.e. Ts is sampling period. Then
samples from a DTSP are defined by X(n).
− -2 -1 0 +1 2 − −
− -2 TS -1 TS 0 TS 1 TS 2 TS − −
TS TS TS TS
Figure 12.6
Let X(n)=X(n Ts ) for n=0, ± 1, ± 2, ...........

If mean is µx (t) and autocorrelation is Rxx (t) of CTSP X(t) then for DTSP X(n) we
have
µx (t) = µx (n Ts ) and Rxx (n1 , n2 ) = Rxx (n1 Ts , n2 Ts )
µx (t) = µx (n Ts ) is the continuous time mean sampled at n Ts
Rxx (n1 , n2 ) = Rxx (n1 Ts , n2 Ts ) continuous time autocorrelation function sam-
pled at n1 Ts and n2 Ts
Note: If X(t) is a WSSP in continuous time then X(n) is also WSSP in discrete
time with µx (n) = µx = Constant and Rxx (m) = Rxx (mTs ) .
X(t) CTSP =⇒ XT
X (ω1 , t) One Realization of CTSP
is a sample path
0 t1 t2 t3 t4 t (Continious time)
=⇒
Sampling
X(n) DTSP
X (ω1 , t) Xn , n = 0, 1, 2.....
n (Discrete time)
0 1 2 3 4
Figure 12.7
CTSP { X(t), t∈ T}
For to ∈ T , X(to ) is Random Variable so here

CDF is FX(t0 ) (n) = P X (t0 ) ≤ x
For t1 ∈ T and t2 ∈ T. The Joint CDF of X(t1 ) and X(t2 ) is

FX(t0 ) (x1 , x2 ) = P X (t1 ) ≤ x1 , X (t2 ) ≤ x2
14.6 Strong stationarity:

In mathematics and statistics, a strictly stationary process or strongly stationary
process is a stochastic process whose unconditional joint probability distribution
does not change when shifted in time. For many applications strict-sense station-
arity is too restrictive. A strong form of stationarity is when the distribution of a
time-series is exactly the same trough time.
FX(t) (x) = FX(t+∆) (x) ∀ t ∈ T and (t +∆ ) ∈ T
Joint CDF of X(t1 ) and X(t2 ) is the same as the joint distribution of X(t1 +∆ ) and
X(t2 +∆ ) i.e. time shift of ∆ doesn’t change it’s stationarity properties.
Definition: CTSP { X(t), t∈ T} is SSSP if ∀ t1 , t2 , ......, tn ∈ R and all τ ∈ R, Joint
CDF of X(t1 ), X(t2 ), ......., X(tr )that is for all real numbers x1 , x2 , ....., xn is
FX(t1 ),X(t2 ),...,X(tn ) ( x1 , x2 , . . . ., xn ) = FX(t1 +∆), X(t2 +∆)....,X(tn +∆) (x1 , x2 , . . . ., xn )
Definition: For DTSP X(n), n ∈ Z

Integer set= { . . . .,-2,-1,0,1,2,. . . .} ∀ n1 , n2 . . . .nk ∈ Z and all ∆ ∈ Z
Joint CDF of X(n1 ) , X(n2 ), ...... , X(nk ) is same as joint CDF of X(n1 +∆ ) , X(n2 +∆ ),
...... , X(nk +∆ ) i.e. for all real numbers x1 , x2 , ....., xk
FX(n1 ),X(n2 ),...,X(nk ) ( x1 , x2 , . . . ., xk ) = FX(n1 +∆), X(n2 +∆)....,X(nk +∆) (x1 , x2 , . . . ., xk )
Weak Stationarity WSSP, t1 , t2 , ..... , tn ∈ R and for all τ ∈ R
1. The mean function does not change due to shifts in time and is independent
of time. E[X(t1 )]= E[X(t2 )] i.e. µx (t1 ) = µx (t2 ) = Constant.
2. The autocorrelation function does not change by shifts in time and is inde-
pendent of time. E[X(t1 ) X(t2 )]= E[X(t1 + τ ) X(t2 + τ )].
Definition: CTSP is WSSP if τ =( t1 - t2 )
1. µx (t) = µx (x) ∀ t ∈ R
2. Rxx (t1 , t2 ) = Rxx (t1 − t2 ) = Rxx (τ )
Definition: DTSP { X(n), n∈ Z } is WSSP if
1. µx (n) = µx ∀ n ∈ Z
2. Rxx (n1 , n2 ) = Rxx (n1 − n2 ) ∀ n1 , n2 . . . .nk ∈ Z
For Weakly Stationery

Stochastic
Processes,

⇒ Rxx (τ ) = E X (t) X (t + τ ) = E X (t + τ ) X (t)
î ó
Rxx (0) = E X 2 (t)

For WSSP E X 2 (t) is not a function of time.
î ó
E X 2 (t) = Rxx (0)

Since X 2 (t) ≥ 0 ⇒ E X 2 (t) ≥ 0 ⇒ Rxx (0) ≥ 0

Rxx (−τ ) = E X (t) X (t + τ ) = E X (t + τ ) X (t) = Rxx (τ )
Rxx (τ ) is even function all τ ∈ R.
Note: Rxx (τ ) takes its maximum value at τ =0 that is X (t + τ ) and X (t)

have highest correlation at τ =0.
Theorem: |Rxx (τp) | ≤ Rxx (0) ∀ τ ∈ R

Proof: | E[XY]| ≤ E (x2 ) E (y 2 )
The above equality holds iff X= α Y for some constant α ∈ R
X= X(t) and Y= X (t − τ )
»
[|E[X(t)X(t − τ )]) ≤ E(X(t)2 )E[X(t − τ )2 ] (14.31)
»
|Rxx (τ ) | = Rxx (0) Rxx (0) = Rxx (0) (14.32)
⇒ |Rxx (τ ) | ≤ Rxx (0) (14.33)

RXX (τ )
max value RXX (0)
τ = 0 τ
Figure 12.8
14.7 Cyclo Stationery process

A cyclostationary process is a signal having statistical properties that vary cycli-
cally with time. By probabilistically viewing the measurements as an instance of
a stochastic process or an alternative veiwing deterministicly approach the mea-
surements as a single time series, from which a probability distribution for some
event associated with the time series can be defined as the fraction of time that
event occurs over the lifetime of the time series. In both the ways, the process
or time series is said to be cyclostationary if and only if its associated probability
distributions vary periodically with time.
A signal that is just a function of time and not a sample path of a stochastic pro-
cess can exhibit cyclostationary properties in the framework of the fraction-of-time
point of view. If the signal is further ergodic, all sample paths exhibits the same
time-average.
This process has a periodic structure. The statistical properties are repeated every
T0 units of time. That is the random variables X(t1 ) , X(t2 ), . . . . . . , X(tn ) have the
same joint CDF as the RVS X(t1 + Tp ), X(t2 + Tp ), . . . .., X(tn + Tp ) then the RVS
are cyclo-stationery.
For Example: X(t)= A Cos (ω t) ⇒ X(t + 2π ω
) = A Cos (ω (t + 2π ω
))
= A Cos (ω t + 2π )= A Cos (ω t)= X(t)

X(t) is periodic with period = Tp = 2π ω
Statistical properties of X(t) do not change by shifting time by Tp units.
Note: In the above definition τ =Tp or ∆ = Tp
Definition: Cyclo-stationery DTSP X(n), n ∈ Z if ∃ M ∈ R ={ 0,1,2,..} such

that
1. µx (n + M ) = µx ∀ n ∈ Z
2. Rxx (n1 + M, n2 + M ) = Rxx (n1 , n2 ) ∀ n1 , n2 . . . .nk ∈ Z

Definition: X(t) is CTSP and X(t) is mean square continuous at time t if

î ó
lim E |X (t + δ) − X (t) |2 |. = 0
δ→0
i.e. the difference |X (t + δ) − X (t) is small on an average.
Note: Mean Square continuity does not mean that every possible realization of
X(t) is a continuos function.
14.8 White Noise is a special Stochastic Process

A very commonly-used random process is white noise. White noise is a random
signal having equal intensity at different frequencies, giving it a constant power
spectral density.
Definition:
2 N(t)R ∞ is called WhiteR Noise of Snn (ω) = N2o ∀ ω
∞ No
E X (t) = −∞ Snn (ω) dω = −∞ 2 dω = ∞
With the Autocorrelation function Rxx
ï ò
−1 No No
Rxx (τ ) = F T = δ (τ )
2 2
N(t) is white noise stochastic process

No
Snn (ω) = ∀ω
2
ï ò
−1 −1 No No

Rnn (τ ) = F T Snn (ω) = F T = δ (τ )
2 2
Where δ (τ ) is the dirac delta function
∞ if τ = 0
δ (τ ) = {
0 if τ 6= 0

E X 2 (t) = Rnn (0) ⇒ WNSP has infinite power.
Also, Rnn (τ ) = 0 f or any τ 6= 0
⇒ N(t1 ) and N(t2 ) are uncorrelated for any t1 6= t2
⇒ White Gaussian noise GN(t1 ) and GN(t2 ) are independent for any t1 6= t2
For Gaussian RVS independence ⇔ uncorrelated
14.9 Gaussian Random Process – GRP

A Gaussian process is a stochastic process i.e. a collection of random variables
indexed by time or space, such that every finite collection of those random vari-
ables has a multivariate normal distribution, i.e. every finite linear combination
of them is normally distributed. If a random process is modelled as a Gaussian
process, the distributions of various derived quantities can be obtained explicitly.
Such quantities include the average value of the process over a range of times and
the error in estimating the average using sample values at a small set of times.
The Gaussian random variable is clearly the most commonly used and of most im-
portance. For continuous variables, possible values are distributed on a continuous
scale and the probability density function links every possible value with a given
probability intensity which we can think of as the probability to find the value of
the variable around every possible value. A theoretical frequency distribution for a
random variable, characterized by a bell-shaped curve symmetrical about its mean.
T
X → = [X1 , X2 , . . . , Xn ] is a random vector.
→
a = [a1 , a2 , . . . , an ]T ∈ R.
RVS X1 , X2 , . . . , Xn are jointly normal if all ai ∈ R.
Jointly Gaussian random variables can be characterized by the property that every
scalar linear combination of such variables is Gaussian. An important property of
jointly normal random variables is that their joint PDF is completely determined
by their mean and covariance matrices.
The random normal variable= Y = a→T X → = a1 X1 + a2 X2 + ..... + an Xn

X → is Gaussian vector if the RVS X1 , X2 , . . . , Xn are jointly normal
Note: Joint PDF of X1 , X2 , . . . , Xn is completely determined by m→ vector

of Covariance matrix C
m→ = E [X → ] , X → = [X1 , X2 , . . . , Xn ]T
Covariance
→ matrix= C
→T

C= E X X with | C| = Det(C )
1D Gaussian PDF of X = fx (x)
1 −( X−µ)2 2 )
= e 2σ (14.34)
2πσ
N D Gaussian PDF of X → = fx (x)
x−m)T −1 (~
1 1c x−m)
~
= n p e−( 2
)
(14.35)
(2π)
2 |c|
Definition: A SP { X(t), t ∈ R.} in a Gaussian or Normal Random Process (SP) if

∀ t1 , t2 , ....., tn ∈ R the n RVS X(t1 ) , X(t2 ),..... , X(tn ) are jointly normal.
Note: For Gausian SP, Strong weak stationery are equivalent.
Note: If two jointly norma; random process X(t) and Y (t) are uncorrelated
that is Cxy (t1 , t2 ) = 0 ∀ t1 , t2 then X(t) and Y (t) are two independent SPs
Note: For Gaussian SP, Weak statinarity and strong stationarity SSSP are
equivalent
Theorem: For Gaussian SP, { X(t), t∈ T} if X(t) is WSSP then X(t) is SSSP
Definition: 2 SP { X(t)} and { Y (t)} are jointly Gaussian for all
t1 , t2 , ....., tn ∈ Rx
t1 , t2 , ......, tn ∈ Ry the RVS X(t1 ) , X(t2 ), ...... , X(tn ) are jointly normal.
Proof:
Need to show ∀ t1 , t2 ,....., tn ∈ R, the variables X(t1 ) , X(t2 ), ....... , X(tk ) have the
same joint CDF as the RVS X(t1 + τ ), X(t2 + τ ), ....., X(tk + τ )
Since these RVS are jointly Gaussian, We show that the mean vector of co variance
matrices are same.
If X(t) is WSSP ⇒ µx (ti ) = µx tj = µx = Constant ∀ i,j
And Cxx (ti + τ, tj + τ ) = Cxx (ti , tj ) = Cxx (ti − tj )∀i, j
⇒ Mean vector of Covariance matrix of X(t1 ) , X(t2 ), ...... , X(tk ) is same as the
mean vector and covariance matrix of X(t1 + τ ), X(t2 + τ ), ......, X(tk + τ ).
14.10 Summary:
Properties of Power spectral density Sxx (ω) are it is a non-negative, even, real
and continuous function in ω. Since Power spectral density Sxx (ω) is even, non-
negative real function. Same Rxx (τ ) can not be the autocorrelation function of
WSSP X(t).White noise is a random function N(t) whose power spectral density
Sn n(ω) is constant for all frequencies ω. N(t) is the white noise. ⇒ Snn (ω) =
N0
2
is constant ∀ ω A Discrete Time Stochastic Process (DTSP) is called white
noise if the Random variables X( nk ) are uncorrelated.Properties of Sxx (ω) are it
is is periodic with 2π, is an even function in Ω, is real and even function. Discrete
Time Stochastic Process, DTSP is a good by sampling CTSP X(t). If CTSP X(t) is
sampled at constant intervals Ts time units i.e. Ts is sampling period. Then sam-
ples from a DTSP are defined by X(n).
If X(t) is a WSSP in continuous time then X(n) is also WSSP in discrete time with
µx (n) = µx = Constant and Rxx (m) = Rxx (mTs )
For Weak Stationarity, The mean function does not change due to shifts in time and
is independent of time. E[X(t1 )]= E[X(t2 )] i.e. µx (t1 ) = µx (t2 ) = Constant.The
autocorrelation function does not change by shifts in time and is independent of
time. E[X(t1 ) X(t2 )]= E[X(t1 + τ ) X(t2 + τ )].
Rxx (τ ) takes its maximum value at τ =0 that is X (t + τ ) and X (t) have highest
correlation at τ =0 The Cyclo Stationery process has a periodic structure. The sta-
tistical properties are repeated every T0 units of time. That is the random variables
X(t1 ) , X(t2 ), . . . . . . , X(tn ) have the same joint CDF as the RVS X(t1 + Tp ), X(t2
+ Tp ), . . . .., X(tn + Tp ) then the RVS are cyclo-stationery. For Gaussian SP, Weak
statinarity and strong stationarity SSSP are equivalent
Chapter 15
Time Series Analysis: Intermediate
15.1 Reviewing Limit Theorems

In this section we shall consider the limiting behaviour of a sequence of random
variables given by {zn }. We can discuss the various modes of convergence in a
step wise manner.
• A sequence of random variables {zn } converges in probability to a constant
α if for any > 0 we have:
lim P rob(|zn − α| > ) = 0 (15.1)
n→∞
Note that the constant α is called the probability limit of zn and can be also
written in the following notations as well:
plimn→∞ zn = α (15.2)
P
zn → α (15.3)
• A sequence of random scalars {zn } converges almost surely to a constant α
if we have:
P rob( lim zn = α) = 1 (15.4)
n→∞
This is a stronger condition than the convergence in probability. That is, if a

sequence of random variables converges almost surely, then it converges in
probability as well.
• A sequence of random variables {zn } converges in mean square to α if we
have:
lim E[(zn − α)2 ] = 0 (15.5)
n→∞
• Now in all of the above scenarios our sequence was converging to a constant
value, however convergence holds for a target random variable as well. We
can say that a sequence of random variables {zn } converges to a random
variable z if:
P
{zn − z} → 0 (15.6)
P
zn → z (15.7)
168
Chapter 15. Time Series Analysis: Intermediate 169
• Let {zn } be a sequence of random variables and let Fn be the cumulative

frequency distribution of zn . We can say that the sequence {zn } converges
in distribution to random variable z if the CDF Fn of zn converges to the
CDF F of z at every continuity point of F . This condition can be written as:
D
zn → z (15.8)
Note additionally that F is known as the asymptotic distribution of zn .
15.2 Fundamentals of Time Series

In time series analysis we will mostly be dealing with stochastic processes. A
Stochastic process is basically a sequence of random variables. Now if the index
for the random variables is interpreted as representing time, then what we have
is essentially a time series. Further we note that if {zi }(i = 1, 2, · · · ) is a stochas-
tic process, its realization is an assignment for each i of a possible value of zi .
Therefore the realizations of {zi } is essentially a sequence of real numbers. A fun-
damental point to note about time series is that we only observe the realisations
of the stochastic process underlying the time series, only once.
As an example, consider the annual inflation rate of some country between 1950
and 2000. This would essentially be a list of 50 numbers or values which would
form one possible outcome for the stochastic process underlying the inflation rate
variable. If history took a different course, we would have had a different sample
of realizations of the same stochastic process. Now if we could observe historical
values many times, we could assemble many samples, each containing a different
list of 50 inflation rate values. Note that in this case the mean inflation rate for say
1950 would be the mean inflation rate for 1950 across all the historical samples.
This kind of a population mean is called the ensemble mean and is defined as the
average across all possible different states of nature at a given time period.
While it is obviously not possible to observe alternate histories, if we make the
assumption that the distribution of the inflation remains unchanged, that is the
set of 50 values we observe are all assumed to have come from the same distribu-
tion, then we are essentially making a stationarity assumption. Further we state
ergodicity as the level of persistence in the process, that is the extent to which
each element will contain some information not available in other elements.
15.2.1 Stationary processes

A stochastic process {zi }(i = 1, 2, · · · ) is said to be strictly stationary if for any
given finite integer r and for a set of subscripts: i1 , i2 , · · · , ir , the joint distribution
of (zi , zi1 , zi2 , · · · , zir ) depends only on: (i1 − i, i2 − i, · · · , ir − i) and not on i.
What this basically means is that the length of time period lag is what defines
the distributional features and not the start or end of the lag. For example, the
distribution of (z1 , z5 ) is the same as (z12 , z16 ). The distribution of zi does not
depend on the absolute position of i rather on its relative position. We can infer
from this statement that the mean, variance and other higher moments remain the
same across all i. Now we note some important definitions within this framework.
• A sequence of independently and identically distributed random variables
is a stationary process that exhibits no serial dependence.
• There are many aggregate time series such as GDP that are not stationary
because they exhibit time trends. Further we note that many time trends can
be reduced to stationary processes. A process is called trend stationary if it
becomes stationary after subtracting from it a linear function of time. Also,
if a process is non stationary but its first difference zi −zi−1 is stationary, then
the sequence {zi } is called difference stationary.
• A stochastic process is said to be weakly covariance stationary if:
E(zi ) does not depend on i

Cov(zi , zi=j ) depends on the index j and not on i.
The j th order autocovariance denoted by Γj is defined as:
Γj = Cov(zi , zi−j ) (15.9)
Further we note that Γj does not depend on i due to covariance stationarity.

Another condition thus satisfied is:
Γj = Γ−j (15.10)
We can say the the 0th order autocovariance is nothing but the variance given
by:
Γ0 = V ar(zi ) (15.11)
The corresponding notation for scalar quantities is :
γj = γ−j (15.12)
If we take a string of n successive values of the stochastic process (zi , zi+1 , · · · , zi+n−1 )
then by the rule of covariance stationarity we can say that the (n × n) covari-
ance matrix is the same as that of (z1 , z2 , · · · , zn ) and is given by:
 
γ0 γ1 γ2 · · · γn−1
 1 γ0 γ1 · · · γn−2 
 γ 
 . .. 
 ..
V ar(zi , zi+1 , · · · , zi+n−1 ) =  ··· ··· ··· .  (15.13)

γn−2 · · · γ1 γ0 γ1 
 
γn−1 · · · γ2 γ1 γ0
Finally, the j th order autocorrelation coefficient is given by:

γj Cov(zi , zi−1 )
ρj = = (15.14)
γ0 V ar(zi )
The plot of {ρj } against the time index j is called a correlogram.
• Another important class of weakly stationary processes is the white noice

process which is a process with zero mean and no serial correlation.
Covariance stationary process {zi } is white noise if

E(zi ) = 0 and Cov(zi , zi−j ) = 0
Additionally we note that an independently and identically distributed se-

quence with zero mean and finite variance is called an independent white
noise process.
15.3 Difference Equations

Time series analysis is concerned with dynamic consequences of events over time.
We will first consider a dynamic equation that relates the value of a variable y on
date ty to its value on the previous date and some other value wt . This give us a
linear first order difference equation.
yt = φyy−1 + wt (15.15)
A difference equation is nothing but an expression that relates a variable yt to its

previous values and the above equation is first order because only the first lag of
the variable (yt−1 ) appears in the equation. Now we will try to find out that in this
dynamic system, how does y change as we change the values of w. Now before
solving a difference equation we will assume that the same equation stands for all
dates t. Hence we can now write down the difference equation for each date.
Date 0: y0 = φy−1 + w0
Date 1: y1 = φy0 + w1
Date 2: y2 = φy1 + w2
Date t: yt = φyt−1 + wt
Now if we are aware of the starting value of y that is y−1 and we are also aware of
all the t values of w then it is possible for us to simulate this system to obtain all
values of y. Getting the value of y1 as follows:
y1 = φy0 + w1 = φ(φy0−1 + w0 ) + w1 (15.16)
y1 = φ2 y−1 + φw0 + w1 (15.17)

Similarly we can calculate for y2 as follows:
y2 = φy1 + w2 (15.18)
y2 = φ(φ2 y−1 + φw0 + w1 ) + w2 (15.19)

y2 = φ3 y−1 + φ2 w0 + φw1 + w2 (15.20)
We can continue in this manner of recursive substitution, we will get the value of
yt in terms of initial value of y and the history of values that w takes on.
yt = φt+1 y−1 + φt w0 + φt−1 w1 + · · · + φwt−1 + wt (15.21)

15.4 Lag Operators

A time series is a collection of observations indexed by the date of the observations.
We can write this collection of random variables as:
(y1 , y2 , · · · , yT ) (15.22)
Now we note that a time series operator typically transforms one type of series into
another type of time series and one such popular operator is the lag operator. This
lag operator basically gives the previous values of a variable at a particular date.
It is represented by L and its operation is shown as:
Lxt = xt−1 (15.23)
L(Lxt ) = L(xt−1 ) = L2 (xt ) = xt−2 (15.24)

As a general rule we can define the lag operator’s functionality as :
Lk xt = xt − k (15.25)
15.4.1 First order difference equation

Let us first start with a first order difference equation given by:
yt = φyt−1 + wt (15.26)
This can be written using the lag operator as:
yt = φLyt + wt (15.27)
yt − φLyt = wt → (1 − φL)yt = wt (15.28)

Now we will multiply both sides of this equation with the following operator:
(1 + φL + φ2 L2 + · · · + φt Lt ) (15.29)
This would give the main equation as:
(1 + φL + φ2 L2 + · · · + φt Lt )(1 − φL)yt = (1 + φL + φ2 L2 + · · · + φt Lt )wt (15.30)
Consider only the LHS compound operator and expand the operator in brackets to
get:
(1 + φL + φ2 L2 + · · · + φt Lt )(1 − φL) (15.31)
= (1 + φL + φ2 L2 + · · · + φt Lt ) − (1 + φL + φ2 L2 + · · · + φt Lt )φL (15.32)
2 2 t t 2 2 3 3 t+1 t+1
= (1 + φL + φ L + · · · + φ L ) − (φL + φ L + φ L + · · · + φ L ) (15.33)
= (1 − φt+1 Lt+1 ) (15.34)
Now we can substitute this compound operator back into our main equation given
by equation 16 and obtain:
(1 − φt+1 Lt+1 )yt = (1 + φL + φ2 L2 + · · · + φt Lt )wt (15.35)

y − φt+1 yt−t−1 = wt + φwt−1 + φ2 wt−2 + · · · φt w0 (15.36)

yt = φt y−1 + wt + φwt−1 + · · · + φt w0 (15.37)
An interesting point to notice is the behaviour of this compound operator as t
becomes large. We already saw by expanding out the LHS:
(1 + φL + φ2 L2 + · · · + φt Lt )(1 − φL)yt = yt − φt+1 y−1 (15.38)
Now note that if t becomes very large and if |φ| < 1 then the expression φt+1 y−1
would tend to 0. Hence we can think of the operator: (1 + φL + φ2 L2 + · · · + φt Lt )
to be an approximation for the inverse of (1 − φL), which would inturn satisfy the
following condition:
(1 − φL)−1 (1 − φL) = 1 (15.39)
With this kind of an operation over our difference equation we can essentially
write it in the form:
yt = (1 − φL)−1 wt (15.40)
yt = wt + φwt−1 + φ2 wt−2 + · · · (15.41)
15.4.2 Second order difference equation

Now consider a second order difference equation of the form:
yt = φ1 yt−1 + φ2 yt−2 + wt (15.42)
Writing this equation in the lag operator format we get:
(1 − φ1 L − φ2 L2 )yt = wt (15.43)
For a moment, let us only consider the lag operator polynomial in the LHS. We can
essentially factorize this polynomial by selecting numbers λ1 and λ2 such that:
(1 − φ1 L − φ2 L) = (1 − λ1 L)(1 − λ2 L) = (1 − [λ1 + λ2 ]L + λ1 λ2 L2 ) (15.44)
Clearly in our search for these λ values we look to satisfy the following properties:
λ1 + λ2 = φ1 (15.45)
λ1 λ2 = −φ2 (15.46)
Now obviously we need to ensure that the left hand side of the above equation is
equal to the right hand side. With this, we can actually write out a corresponding
polynomial that fulfils the same criterion.
(1 − φ1 z − φ2 z 2 ) = (1 − λ1 z)(1 − λ2 z) (15.47)
We must understand that by replacing the lag operator by a scalar z, it would

allow us to find values of z that would set the RHS to 0. This makes our job easy
since the value of z that sets the RHS to 0 is also the value that would set the LHS
to 0. Basically the solutions would be: z1 = λ−1 −1

1 and z2 = λ2 . Therefore we now
have:
(1 − φ1 z − φ2 z 2 ) = 0 (15.48)
Finally by the quadtratic formula we can find the value of z that solves the above
equation. p
φ1 − φ21 + 4φ2
z1 = (15.49)
−2φ2
p
φ1 + φ21 + 4φ2
z2 = (15.50)
−2φ2
Note that with this result we make the statement that the difference equation
would be stable if the roots of the lag polynomial lie outside the unit circle.
15.5 White noise

The fundamental building block of various time series processes is the white noise
process which is a sequence {t }∞
t=−∞ where the elements have zero mean and
finite variance.
E(t ) = 0 (15.51)
E(2t ) = σ 2 (15.52)
Also the successive values are uncorrelated across time.
E(t , τ ) = 0 (15.53)
An independent element white noise process following a normal distribution is

called a Gaussian white noise process.
∼ N (0, σ 2 ) (15.54)
15.6 Moving average process

Let {t } be a white noise sequence, then we can define as process as:
Yt = µ + t + θt−1 (15.55)
This is known as a first order moving average process denoted as M A(1). We

call this a moving average because Y is contructed from a weighted average of the
two most recent values of the white noise terms. The expectation of Y is given by:
E(Yt ) = E(µ + t + θt−1 ) = µ (15.56)
We note that the constant term included in the process equation is actually the
mean. The variance of Y is given by:
E(Yt − µ)2 = E(t + θt−1 )2 (15.57)

= E(2t + θ2 2t−1 + 2θt t−1 ) (15.58)

= (1 + θ2 )σ 2 (15.59)
The first autocovariance is given by:
E(Yt − µ)(Yt−1 − µ) = E(t + θt−1 )(t−1 + θt−2 ) (15.60)
= E(t t−1 + θt t−2 + θ2t−1 + θ2 t−1 t−2 ) (15.61)

= θσ 2 (15.62)
We further note that the higher autocovariances are all zero. Note that since the
mean and variance of M A(1) are not dependent on time, they are considered to be
covariance stationary. The j th autocorrelation of a covariance stationary process
is thus given by:
γj
ρj = (15.63)
γ0
Cov(Yt , Yt−j ) γj
Corr(Yt , Yt−j ) = » = √ √ = ρj (15.64)
p
V ar(Yt ) V ar(Yt−j ) γ0 γ0
We can write the first autocorrelation of M A(1) as follows:
θσ 2 θ
ρ1 = = (15.65)
(1 + θ2 )σ 2 1 + θ2
Note that all the higher autocorrelations will be zero in this case.
15.6.1 MA(q)
The q th order moving average process can be described as:
Yt = µ + t + θ1 t−1 + θ2 t−2 + · · · + θq t−q (15.66)
Mean of this process is given by:
E(Yt ) = E(µ) + E(t ) + θ1 E(t−1 ) + θ2 E(t−2 ) + · · · + θq E(t−q ) = µ (15.67)
Variance of the M A(q) process is given by:
γ0 = E(Yt − µ)2 = E(t + θ1 t−1 + · · · + θq t−q )2 (15.68)
Now since the white noise terms are uncorrelated, we can write the variance as:
γ0 = σ 2 + θ12 σ 2 + θ22 σ 2 + · · · + θq2 σ 2 = (1 + θ12 + θ22 + · · · + θq2 )σ 2 (15.69)
Coming to the computation of the j th lag covariance and dropping out the cross
product terms of the white noise terms (since they are uncorrelated and will re-
solve to zero), we get:
γj = E[(t + θ1 t−1 + · · · + θq t−q )(t−j + θ1 t−j−1 + · · · + θq t−j−q )] (15.70)

= E(θj 2t−j + θj+1 θ1 2t−j−1 + · · · + θq θq−j 2t−q ) (15.71)

Therefore finally we can get the covariance function as:
(
[θj + θj+1 θ1 + · · · + θq θq−j ]σ 2 , if j ≤ q.
γj = (15.72)
0, if j > q.
We can note the corresponding values for the M A(2) process:
γ0 = [1 + θ12 + θ22 ]σ 2 (15.73)
γ1 = [θ1 + θ2 θ1 ]σ 2 (15.74)
2
γ2 = [θ2 ]σ (15.75)
γ3 = γ4 = · · · = 0 (15.76)
In this case, the autocorrelation function is 0 after q lags.
15.7 Infinite order moving average

The previously defined M A(q) process can be written as:
q
X
Yt = µ + θj t−j (15.77)
j=0
With θ0 = 1. Now we consider the process that results when q → ∞. this can be
shown as: ∞
X
Yt = µ + ψj t−j = µ + ψ0 t + ψ1 t−1 + · · · (15.78)
j=0
This is essentially an M A(∞) process. We further note that this infinite sequence
ensures a covariance stationary process provided that:
X∞
|ψj | < ∞ (15.79)
j=0
Note that a sequence of numbers satisfying the above condition is said to be ab-
solutely summable. We can now calculate the mean and autocovariances of an
M A(∞) process with absolutely summable coefficients.
E(Yt ) = lim E(µ + ψ0 t + ψ1 t−1 + · · · + ψT t−T ) = µ (15.80)
T →∞
γ0 = E(Yt − µ)2 = lim E(ψ0 t + ψ1 t−1 + · · · + ψT t−T )2 (15.81)

T →∞
= lim (ψ02 + ψ12 + · · · + ψT2 )σ 2 (15.82)

T →∞
γj = E(Yt − µ)(Yt−j − µ) (15.83)
2
= σ (ψj ψ0 + ψj+1 ψ1 + · · · ) (15.84)
An important point to note is that an absolutely summable M A(∞) coefficients
implies absolutely summable autocovariances as well.
X∞
|γj | < ∞ (15.85)
j=0
15.8 Autoregressive process

A first order autoregression AR(1) is defined by the following equation:
Yt = c + φYt−1 + t (15.86)
Where again {t } is a white noise process. In earlier sections when we looked
at the analysis of difference equations, we learnt that if |φ| ≥ 1 then the effect
of the terms on Y tend to accumulate rather than die out over time, in which
case a covariance stationary process would not exist. If however |φ| < 1 then
we would have a covariance stationary process which could be characterized by
the following stable equation. Note that this the same equation we obtained after
recursively substituting a general difference equation (here w = (c + ).
Yt = (c + t ) + φ(c + t−1 ) + φ2 (c + t−2 ) + · · · (15.87)
= [c/(1 − φ)] + t + φt−1 + φ2 t−2 + · · · (15.88)

Look closely and we would notice that this is actually an M A(∞) process with
ψj = φj . Now when we incorporate the stationarity condition that |φ| < 1 then we
would satisfy:
X∞ ∞
X
|ψj | = |φ|j (15.89)
j=0 j=0
This would end up being equal to [1/(1 − |φ|)] since it would essentially be a
geometric series of partial sums. Mean of the AR(1) process can be represented
as:
c
E(Yt ) = (15.90)
1−φ
The variance can be represented as:
γ0 = E(Yt − µ)2 (15.91)
= E(t + φt−1 + φ2 t−2 + · · · )2 (15.92)

σ2
= (1 + φ2 + φ4 + · · · )σ 2 = (15.93)
1 − φ2
The j th autocovariance can be shown by:
γj = E(Yt − µ)(Yt−j − µ) (15.94)
= E(t + φt−1 + · · · + φj t−j + φj+1 t−j−1 + · · · )(t−j + φt−j−1 + · · · ) (15.95)

= [φj + φj+2 + φj+4 + · · · ]σ 2 (15.96)
= φj (1 + φ2 + φ4 + · · · )σ 2 (15.97)
(φj )
= σ2 (15.98)
(1 − φ2 )
Therefore we can now write the autocorrelation function as:
γj
ρj = = φj (15.99)
γ0
This autocorrelation function follows a pattern of geometric decay. We can make

interpretations like - the impact of a unit increase in t on Yt+j is basically equal
to the correlation beteen Yt and Yt+j . Note that as seen before we have actually
derived the moments of the AR(1) process by essentially viewing it as an M A(∞)
process, however we can directly find those moments out as well.
E(Yt ) = c + φE(yt−1 ) + E(t ) (15.100)
Note that since we have assumed convariance stationarity, we would also have:
E(Yt ) = E(Yt−1 ) = µ (15.101)
Therefore we would have:
c
µ= (15.102)
1−φ
In a similar manner we can obtain the second moment in the following way:
Yt = µ(1 − φ) + φYt−1 + t (15.103)
Yt − µ = φ(Yt−1 − µ) + t (15.104)
Now we can square both sides and take expectations.
E(Yt − µ)2 = φ2 E(Yt−1 − µ)2 + E(2 ) + 2φE((Yt−1 − µ)) (15.105)
Note an interesting point that (Yt−1 − µ) is essentially a function of t−1 and so on.
Yt−1 − µ = t−1 + φt−2 + · · · (15.106)
Therefore since t is not correlated with those other values of then it would
certainly not be correlated with (yt−1 − µ) either. Hence we can safely that :
E[(Yt−1 − µ)t ] = 0 (15.107)
Also assuming covariance stationarity we can also say that:
E(yt − µ)2 = E(Yt−1 − µ)2 = γ0 (15.108)
Our main equation would then resolve to:
γ0 = φ2 γ0 + 0 + σ 2 (15.109)
σ2
γ0 = (15.110)
1 − φ2
To obtain the j th order autocovariance we would multiply equation 19 with (Yt−j −
µ) and then take expectations.
E[(Yt − µ)(Yt−j − µ)] = φE[(Yt−1 − µ)(Yt−j − µ)] + E[t (Yt−j − µ)] (15.111)
Again by the previous logic, we can say the term (Yt−j − µ) is actually a function of
t−j onwards and hence is uncorrelated with t . We can also write the following:
E[(Yt−1 − µ)(Yt−j − µ)] = E[(Yt−1 − µ)(Y(t−1)−(j−1) − µ)] = γj−1 (15.112)
Therefore our main equation then becomes:
γj = φγj−1 (15.113)
γj = φj γ0 (15.114)
15.8.1 AR(2)
The second order autoregressive function can be written as:
Yt = c + φ1 Yt−1 + φ2 Yt−2 + t (15.115)
We can write this equation in the lag format as follows:
(1 − φ1 L − φ2 L2 )Yt = c + t (15.116)
From our earlier discussions regarding difference equations and their stability, we
can say that this equation is stable if the roots of the characteristic polynomial lie
outside the unit circle. It is only when this condition is satisfied that we can say
that the AR(2) process is covariance stationary.
(1 − φ1 z − φ2 z 2 ) = 0 (15.117)
Note that the inverse of this autoregressive operator can be written as:
ψ(L) = (1 − φ1 L − φ2 L2 )−1 = ψ0 + ψ1 L + ψ2 L2 + · · · (15.118)
Now multiplying both sides of our main equation with this function we get:
Yt = ψ(L)c + ψ(L)t (15.119)
Also since c is a constant the operator premultiplied with c would simply become:
c
ψ(L)c = (15.120)
1 − φ1 − φ2
Additionally we also have the condition that :
∞
X
|ψj | < ∞ (15.121)
j=0
Now since we have effectively resolved our AR(2) process to an M A(∞) process
as is evident by equation 34 we can state the mean of the process as:
c
µ= (15.122)
1 − φ1 − φ2
Now we can find the second moment by rewriting the main equation as:
Yt = µ(1 − φ1 − φ2 ) + φ1 Yt−1 + φ2 Yt−2 + t (15.123)
Yt − µ = φ1 (Yt−1 − µ) + φ2 (Yt−2 − µ) + t (15.124)

Multiplying both sides by (Yt−j − µ) and taking expectations we would get:
γj = φ1 γj−1 + φ2 γj−2 (15.125)
Consequently we can find the autocorrelations by dividing throughout with γ0 and

obtain:
ρj = φ1 ρj−1 + φ2 ρj−2 (15.126)
Now for j = 1 we would get:

ρ1 = φ1 + φ2 ρ1 (15.127)
φ1
ρ1 = (15.128)
1 − φ2
For j = 2 we would get:
ρ2 = φ1 ρ1 + φ2 (15.129)
For getting the variance we can multiply equation 39 with (Yt − µ) and get:
E[Yt −µ]2 = φ1 E[(Yt −µ)(Yt−1 −µ)]+φ2 E[(Yt −µ)(Yt−2 −µ)]+E[t (Yt −µ)] (15.130)
Note what the last term in this equation would resolve to:
E[t (Yt − µ)] = E(t )[φ1 (Yt−1 − µ) + φ2 (Yt−2 − µ) + t ] = σ2 (15.131)
The final equation then becomes:
γ0 = φ1 ρ1 γ0 + φ2 ρ2 γ0 + σ 2 (15.132)
Finally we get the variance as follows:
(1 − φ2 )σ 2
γ0 = (15.133)
(1 + φ22 )[(1 − φ2 )2 − φ21 ]
15.8.2 AR(p)
An autoregressive processes of the pth order can be written as:
Yt = c + φ1 Yt−1 + φ2 Yt−2 + · · · + φp Yt−p + t (15.134)
For ensuring stationarity the roots of our characteristic polynomial must lie outside
the unit cirlce.
1 − φ1 z − φ2 z 2 − · · · − φp z p = 0 (15.135)
After applying the inverse of the characteristic lag operator polynomial we can
obtain the covarince stationary transformation of this process as follows:
Yt = µ + ψ(L)t (15.136)
ψ(L) = (1 − φ1 L − φ2 L2 − · · · − φp Lp )−1 (15.137)

Note that the stationarity condition is as follows:
∞
X
|ψj | < ∞ (15.138)
j=0
We can take expectations on the main equation to get the mean as follows:
µ = c + φ1 µ + φ2 µ + · · · + φp µ (15.139)
c
µ= (15.140)
1 − φ1 − φ2 − · · · − φp
Writing the main autoregressive equation in mean deviation form we obtain:
Yt − µ = φ1 (Yt−1 − µ) + φ2 (Yy−2 − µ) + · · · + φp (Yt−p − µ) + t (15.141)
Now if we multiply both sides by Yt−j − µ we would essentially obtain the autoco-
variance functions as:
γj = φ1 γj−1 + φ2 γj−2 + · · · + φp γj−p (15.142)
Note that for j = 0 we will essentially get the variance as:
γ0 = φ1 γ1 + φ2 γ2 + · · · + φp γp + σ 2 (15.143)
15.9 ARMA(p,q)
First we note that ARM A(p, q) is a process that includes both autoregressive and
moving average terms.
Yt = c + φ1 Yt−1 + φ2 Yt−2 + · · · + φp Yt−p + t + θ1 t−1 + · · · + θq t−q (15.144)
We can write the equation in lag operator notation as:
(1 − φ1 L − φ2 L2 − · · · − φp Lp )Yt = c + (1 + θ1 L + θ2 L2 + · · · + θq Lq )t (15.145)
Our precondition for stationarity in the ARM A process is essentially the same
condition as the AR process and its stationarity essentially depends on the AR
parameters.
(1 − φ1 z − φ2 z 2 − · · · − φp z p ) = 0 (15.146)
The above equation should ideally have roots that lie outside the unit circle for the
equation system to be stable and for stationarity to exist. We would now divide
both sides of the main equation by (1 − φ1 L − φ2 L2 − · · · − φp Lp ) to get :
Yt = µ + ψ(L)t (15.147)
Where we have:
1 + θ1 L + θ2 L2 + · · · + θq Lq
ψ(L) = (15.148)
q − φ1 L − φ2 L2 − · · · − φp Lp
We would ultimately obtain the process mean as:
c
µ= (15.149)
1 − φ1 − φ2 − · · · − φp
The mean is the same as for the AR(p) process. Now we can write the ARM A in
terms of mean deviations to get:
Yt − µ = φ1 (Yt−1 − µ) + φ2 (Yt−2 − µ) + · · · + φp (Yt−p − µ) + t + · · · + θq t−q (15.150)

Premultiplying this equation with (Yt−j − µ) we would get the covariance function
as:
γj = φ1 γj−1 + φ2 γj−2 + · · · + φp γj−p (15.151)
A word of caution here is that the above set of covariances are true only for j >
q. It is basically after q lags that the autocovariance function of ARM A follows
the same autocovariance pattern as the AR(p) process. Further we note that the
above autocovariance function does not hold for j ≤ q because of the presence of
correlations between θj t−j and Yt−j .
15.10 Invertibility
We we will define invertibility of an M A(1) process. Consider the M A(1) as fol-
lows:
Yt − µ = (1 + θL)t (15.152)
We note that the white noise terms are uncorrelated and have constant variance
σ 2 . Provided that |θ| < 1 we can multiply both sides (1 − θL)−1 to obtain:
(1 − θL − θ2 L2 − · · · )(Yt − µ) = t (15.153)
We note that the above equation can essentially be viewed as an AR(∞) repre-
sentation. If a moving average process can be written as an infinite autoregressive
process by simply inverting the moving average operator (1+θL), then the moving
average process is said to be invertible. In a similar manner we can also define
invertibility for an M A(q) process as well.
Yt − µ = (1 + θ1 L + θ2 L2 + · · · + θq Lq )t (15.154)
Now provided that the roots of the characteristic polynomial lie outside the unit
circle, the invertibility condition would be valid.
1 − θ1 z + θ2 z 2 + · · · θq z q = 0 (15.155)
Therefore we can now invert the moving average operator :
(1 + η1 L + η2 L2 + · · · ) = (1 + θ1 L + θ2 L2 + · · · + θq Lq )−1 (15.156)
Multiplying on both sides we would get:
(1 + η1 L + η2 L2 + · · · )(Yt − µ) = t (15.157)
The above equation is essentially an AR(∞) process.
15.11 Time Series

Consider that we have collected GDP data for some period as our primary sample
data. As a first step the natural log transformation of the GDP value are plotted
along time to see the general trend of this variable with time. Further suppose that
we see an upward trend, with some fluctuations along the way. The key insight
from this is that we might want to predict the trajectory of this curve beyond our
sample period and for that we would need to the statistical, stochastic mecha-
nism or the data generating process that generated these curves.
15.12 Stochastic processes

A random or stochastic process is a collection of random variables indexed with
time. In terms of notation, we refer to a continuous random variable as Y (t) and
a discrete random variable as Yt . Letting Y represent GDP, our data would com-
prise the following collection of random variables: (Y1 , Y2 , · · · Y243 , Y244 ). Here the
subscript 1 refers to the first observation that is the GDP of the first quarter of the
first year of measurement (say 1950) and the subscripy 244 refers to the last obser-
vation in our data. However an explanation is needed as to how exactly we can
think of this GDP measure as a random variable - Suppose that in one particular
instance the GDP for the first quarter of 1970 was $3, 758 billion. Now in theory we
say that at that particular time this GDP figure could have taken on many possible
values depending on various factors of states of nature. Therefore we say that the
figure $3758 is one particular realization of all such possibilities. Hence we can
say that the random variable representing GDP is actually a stochastic process and
the data we are looking at are the realizations of that stochastic process during the
particular time period.
15.12.1 Stationary stochastic process

A stochastic process is said to be stationary if its mean and variance are constant
over time and the covariance between two time periods depends only on the lag
between the time periods and not the actual value of time for which the covariance
may be required. This is a classic definition of a weakly stationary or covariance
stationary process. Let Yt be a stochastic time series with the following proper-
ties:
E(Yt ) = µ (15.158)
V ar(Yt ) = E(Yt − µ)2 = σ 2 (15.159)
γk = E[(Yt − µ)(Yt−k − µ)] (15.160)
We say that γk is the autocovariance of lag k, that is the covariance between Y
values k periods apart. Note that if k = 0 then we simply get the variance of the
process, for k = 1 we get γ1 which is the covariance between adjacent values.
Now an important point to note is that if we shift the origin of our GDP time series
from Yt to Yt+m (supposing this shift represents shifting the first quarter of 1950
to say the first quarter of 1960), then we will find that if Yt is stationary then the
mean, variance and autocovariances of Yt+m would be the same as that for Yt .
We can therefore state that if a time series is stationary then it mean, variance
and autocovariances at various lags are the same, regardless of what point in
time we measure them - these measures are time invariant.
Another important concept is that of mean reversion which essentially means

that a stationary time series will always fluctuate around its mean and tend to
return to it with a constant finite variance. The speed with which it reverts to
the mean value depends on the strength of autocovariances. Another special type
of stochastic process to make note of is the white noise process which is char-
acterized typically by having a zero mean, constant finite variance and serially
uncorralated terms.
15.12.2 Nonstationary stochastic process

A classic example of a nonstationary process is the random walk model. Suppose
we have a white noise process characterized by ut having 0 mean and variance σ 2 .
In this case the series Yt would be a random walk if:
Yt = Yt−1 + ut (15.161)
Thus in this model the value of Yt is equal to the value of Yt−1 plus the effects of a
random shock. Using continuous substitution we can write:
Y1 = Y0 + u1 (15.162)
Y2 = Y1 + u2 = Y0 + u1 + u2 (15.163)
If the process started at some initial period t = 0 then we can the time series as:
X
Yt = Y0 + ut (15.164)
We can compute the mean and variance as follows:

X
E(Yt ) = E(Y0 + ut ) = Y0 (15.165)
V ar(Yt ) = tσ 2 (15.166)
We can clearly notice that the variance of this series depends on time and hence
violates the stationarity condition. Further we note that a RWM with no intercept
is essentially a model without drift and here Y0 = 0, therefore E(Yt ) = 0. An
interesting feature of the RWM is the persistence of random shocks as is even
evident from equation 7. We can see here that Yt is the sum of initial value Y0
plus the sum of various random shocks. We can say that the impact of a particular
shock does not die out. For example if we encounter u2 = 2 then every value of Yt
from Y2 onwards will be 2 units higher persistently hence the effect of this shock
does not Pdie out. For this reason a RWM has infinite memory. We note that the
quantity ut is known as a stochastic trend. Now if we write the RWM equation
as:
Yt − Yt−1 = ∆Yt = ut (15.167)
With ∆ being the first difference operator we notice that while the series of Yt is
nonstationary, the series of its first difference is infact stationary.
15.12.3 Random walk with drift

We can write the RWM equation as:
Yt = δ + Yt−1 + ut (15.168)
Here δ is called the drift parameter. We can rewrite the equation as:
Yt − Yt−1 = ∆Yt = δ + ut (15.169)
This basically shows that yt shifts upwards or downwards depending on the value
of δ. Computing the mean and variance for this model:
E(Yt ) = Y0 + tδ (15.170)
V ar(Yt ) = tσ 2 (15.171)
Even in this we can see that the variance, as well the mean, depends on time and
hence violates the stationarity conditions. The RWN with and without drift are
nonstationary series.
15.12.4 Unit root process

We can write the RWM as:
Yt = ρYt−1 + ut (15.172)
Note that this model resembles an AR(1) model and with ρ = 1 we will basically
have a RWN without drift. Now in the case that ρ = 1 we face what is known
as a unit root problem that makes the series nonstationary. However if we have
|ρ| < 1 then the series would be stationary.
15.13 TS and DS processes

Firstly we note that if the time trend (evolution of series with time) is a deter-
ministic function of time, then it is a deterministic trend and predictable and
otherwise it known as a stochastic trend. Consider the following time series:
Yt = β1 + β2 t + β3 Yt−1 + ut (15.173)
Where ut is a white noise process and t is time measured in chronological order.

With this we have the following possibilities:
• Pure random walk - If β1 = β2 = 0 and β3 = 1 we will get:
Yt = Yt−1 + ut (15.174)
This is essentially a nonstationary RWM without drift. But we note that:
∆Yt = Yt − Yt−1 = ut (15.175)
This is stationary. Hence we can say that a RWM without drift is a difference
stationary process (DSP).
• Random walk with drift - If we set β1 6= 0, β1 = 0 and β3 = 1 we would get:

Yt = β1 + Yt−1 + ut (15.176)
This is essentially a RWM with drift and is nonstationary. We can write this
as:
Yt − Yt−1 = ∆Yt = β1 + ut (15.177)
This is an example of a stochastic trend. This is also a DSP process since
the nonstationarity in Y has been effectively eliminated by taking first dif-
ferences.
• Deterministic trend - If we set β1 6= 0, β2 6= 0 and β3 = 0 we would get:
Yt = β1 + β2 t + ut (15.178)
This is an example of a trend stationary process (TSP). The mean of Yt is
not constant (β1 + β2 t) but the variance is constant σ 2 . If we subtract the
mean from Yt we would basically get a stationary series and hence this is
called trend stationary. This process of removing the deterministic trend if
called detrending.
• Random walk with drift and deterministic trend - If β1 6= 0, β2 6= 0 and
β3 = 1 we get:
Yt = β1 + β2 t + Yt−1 + ut (15.179)
Here we have a random walk with drift and a deterministic trend which is
further evident if we write the same equation as:
∆Yt = β1 + β2 t + ut (15.180)
• Deterministic trend with stationary AR(1) component - If β1 6= 0, β2 6= 0

and β3 < 1 then we have:
Yt = β1 + β2 t + β3 Yt−1 + ut (15.181)
This process is stationary around its deterministic trend.
15.14 Integrated stochastic process

We note that the RWM is a special case of a more general class of stochastic pro-
cesses called integrated processes. For a RWM without drift that is nonstationary
but its first difference is stationary, we call it - RWM without drift integrated of
order 1, denoted as I(1). In this manner, if the take the first difference of the
first difference, thereby making a process stationary, it is a time series integrated
of order 2. Note the following for I(2) (producing a stationary process) before
moving further:
∆∆(Yt ) = ∆(Yt − Yt−1 ) = ∆Yt − ∆Yt−1 = Yt − 2Yt−1 − Yt−2 (15.182)
In general, a time series integrated of order d is represented as Yt ∼ I(d). Note
that if a series is stationary it is effectively integrated with order 0 and is shown as
Yt ∼ I(0).
15.14.1 Properties of integrated series

Here are some important properties of integrated time series. We let Xt , Yt , Zt
represent three time series.
• If Xt ∼ I(0) and Yt ∼ I(1) then Zt = (Xt + Yt ) ∼ I(1).
• If Xt ∼ I(d) then Zt = (a + bXt ) ∼ I(d).
• If Xt ∼ I(d1 ) and Yt ∼ I(d2 ) then Zt = (aXt + bYt ) ∼ I(d2 ) where d1 < d2 .
• If Xt ∼ I(d) and Yt ∼ I(d) then Zt = (aXt + bYt ) ∼ I(d∗). Note that d∗ can
be equal to d or even less than it.
15.15 Spurious regression

Consider the following random walk models:
Yt = Yt−1 + ut (15.183)
Xt = Xt−1 + vt (15.184)
Suppose we generated 500 obsevations from ut ∼ N (0, 1) and 500 observations
from vt ∼ N (0, 1). Assume that initial values of both Xt and Yt are zero. Further
assume that ut and vr are serially and mutually uncorrelated. We know that both
these series are nonstationary, are I(1) and exhibit stochastic trends. Now if we
regress Yt on Xt we should expect an R2 that tends to be 0 and no relation since
they are fundamentally uncorrelated processes. But if by chance we obtain a result
that gives a statistically significant coefficient, then that would be termed as a
spurious regression. A good rule of thumb to identify spurious regressions is if
R2 > d where d is the Durbin Watson statistic.
15.16 Tests of stationarity: ACF

A popular testing procedure for stationarity concerns the autocorrelation func-
tion. The ACF at lag k is given as:
γk
ρk = (15.185)
γ0
Now note that if we plot ρk against k we obtain a population correlogram. Since
we are effectively always dealing with realizations or samples of stochastic pro-
cesses we tend to compute the sample autocorrelation function denoted as ρ̂k . In
order to compute this, we first calculate the sample autocovariance γ̂k and the
sample variance γˆ0 . P
(Yt − Ȳ )(Yt+k − Ȳ )
γ̂k = (15.186)
n
(Yt − Ȳ )2
P
γ̂0 = (15.187)
n
Here n is the sample size. Finally we can get the sample ACF at k lag as:
γ̂k
ρ̂k = (15.188)
γ̂0
A plot of sample ACF with k is called the sample correlogram. If the correlogram
of a time series hovers around zero and resembles that of a purely white noise cor-
relogram, it is probably a stationary series. On the other hand the correlogram of
a nonstationary random walk series will exhibit strong correlations upto large lag
lengths. In this case the autocorrelation coefficient starts at a very high value and
slowly declines towards 0 as lag length increases. Next we will begin to answer
some pertinent questions like - How do we choose the lag length over which to ob-
serve ACF patterns. And how to decide if the correlation coefficient at a particular
lag is statistically significant or not.
15.16.1 Statistical significance of ACF

We typically judge the statistical significance of ρ̂ by its standard error. If a time
series is purely random then its sample autocorrelation function is approximately
distributed as:
ρ̂k ∼ N (0, 1/n) (15.189)
Basically in large samples the sample autocorrelation coefficients are normally
distributed with mean 0 and variance equal to the inverse of the sample size.
Suppose we are given the standard deviation to be 0.0640 then we can obtain the
95% population confidence interval for ρk as:
ρ̂k ± 1.96(0.0640) (15.190)
P rob(ρ̂k ± 1.96(0.0640) < ρk < ρ̂k ± 1.96(0.0640)) = 0.95 (15.191)

We essentially reject the null hypothesis that the true autocorrelation ρk is 0 is 0
is not contained in the confidence interval. Now instead of testing individual lag
autocorrelations, we can simultaneously test the joint hypothesis that all the ρk
values upto a certain lag length are zero. We can do this using the Q statistic test
of Box and Pierce. m
X
Q=n ρ̂2k (15.192)
k=1
where n is the sample size the m is the lag length. This test is usually used to
check if a given series is a white noise series or not. In large samples this Q test
statistic is approximately distributed as a chi square variable with m degrees of
freedom. Therefore if the computed Q exceeds the critical Q from the chi square
distribution at the chosen level of significance, then we reject the null hypothesis
that all the ρk are simultaneously zero. A variation of this test is the Ljung Box
test statistic which is given as:
m
X ρ̂2k
LN = n(n + 2) ∼ χ2 (m) (15.193)
k=1
n−K
15.17 The unit root test

The unit root test is basically a test of stationarity. Let us start with the following
series:
Yt = ρYt−1 + ut (15.194)
We know that if ρ = 1 then the above model would become a RWM without drift
which we know is a nonstationary process. Now the central idea behind the unit
root test is that we can essentially regress Yt on its lagged value Yt−1 and check the
statistical significance of the coefficient ρ. If it is statistically equal to 1 then we
conclude that the process is unit root nonstationary. Before actually performing
OLS on this, we will manipulate the equation as:
Yt − Yt−1 = ρYt−1 − Yt−1 + ut = Yt−1 (ρ − 1) + ut (15.195)
Now setting (ρ − 1) = δ we can write the above equation as:
∆Yt = δYt−1 + ut (15.196)
With this manipulation we are effectively testing the hypothesis that δ = 0 against
the alternative hypothesis that δ < 0. If δ = 0 therefore ρ = 1 we conclude that
our process is infact unit root nonstationary. Before proceeding another interesting
point to note is that if δ actually turns out to be 0 then our model would effectively
collapse to:
∆Yt = (Yt − Yt−1 ) = ut (15.197)
which implies our earlier point that the first differences of a RWM is a station-
ary random process. Now reiterating the problem - we will first regress the first
differences of Yt on Yt−1 and check if the estimated slope of the regression δ̂ is sta-
tistically equal to 0 or not. It if it zero we conclude the series to be nonstationary
and if it is negative we conclude it to be stationary. Note that we cannot use a t test
for the null hypothesis δ = 0 here because even in large samples the coefficient
does not asymptotically resolve to a normal distribution. Now Dickey and Fuller
showed that under the null hypothesis, t value of the coefficient of Yt−1 follows a
tau statistic. This Tau statistic test is known as the DF test. The various possible
scenarios under which this test might be applied are:
• Yt is a random walk: ∆Yt = δYt−1 + ut .
• Yt is a random walk with drift: ∆Yt = β1 + δYt−1 + ut .
• Yt is a RWM with drift around a deterministic trend: β1 + β2 t + δYt−1 + ut
• In all cases the null hypothesis is given by: H0 : δ = 0 - there is a unit root;
time series is nonstationary; has a stochastic trend.
• Alternative hypothesis is given by: H1 : δ < 0 - time series is stationary,

possibly around a deterministic trend.
Now we note that if the null hypothesis is rejected it could mean that (1) Yt is
stationary with zero mean in the first case. (2) Yt is stationary with nonzero
mean. We must note that the critical values of the Tau statistic would be different
for different model specifications and hence we must try to ensure that we are not
committing a specification error in modeling. Now we estimate as follows:
• Estimate either of the specified models by OLS.
• Divide the estimated coefficient of Yt−1 by its standard error to get the Tau
statistic and refer to the DF table.
• If the computed Tau value |τ | exceeds the critical DF value then we reject
the null hypothesis and conclude stationarity of the process. Alternatively if
the computed value is smaller than the DF value we reject null hypothesis.
As a demonstration the regression of the above three model specifications is
given below:
• ∆GDPˆ t = 0.000968GDPt−1 . Specifications: t = (12.9), R2 = 0.0147. d =

1.31.
• ∆GDPt = 0.0221 − 0.00165GDPt−1 . Specifications: t = (2.432)(−1.56). R2 =

0.0096. d = 1.34.
• ∆GDPt = 0.2092+0.0002t−0.0269GDPt−1 . Specifications: t = (1.89)(1.70)(−1.89).

R2 = 0.0215. d = 1.33.
• After this we find from the DF table that the 5% signifiance level critical
values for the three models are : −1.95 (no intercept, no trend). −2.88
(intercept, no trend). −3.43 (intercept and trend).
• Now first of all we should immediately rule out model 1 because its coeffi-
cient is positive. δ > 0 would imply a greater than 1 correlation which would
make the model explode.
• In the other two models we can accordingly compute the Tau stat and com-
pare with the critical values.
15.17.1 Augmented DF test

In the above DF test we assumed that the error terms ut are uncorrelated. In case
the error terms are correlated we use the augmented Dickey Fuller test. In this
test we essentially augment in the above model equations, the lagged values of
the dependent variable ∆Yt . In the ADF test we compute the following regression.
m
X
∆Yt = β1 + β2 t + δYt−1 + αi ∆Yt−i + t (15.198)
i=1
Here t is a pure white noise error term and ∆Yt−1 = (Yt−1 − Yt−2 ), ∆Yt−2 =
(Yt−2 − Yt−3 ) and so on. The point is to specifically include the lagged terms so
as to ensure that the error terms are not correlated, thereby giving us unbiased
estimates of δ. Note that lag length is usually selected based on information
criteria.
15.18 Transformation: Difference stationary process

If we have a time series with unit root, then we first differences of such a series
would be stationary. Therefore taking first differenes will be our goto transforma-
tion. Consider the previous GDP data example. We can take the first differences
of this nonstationary GDP series to get Dt = ∆GDPt = (GDPt − GDPt−1 ). We can
now consider the following regression:
ˆ t = 0.00557 − 0.6781Dt−1
∆D (15.199)
t = (7.14)(−11.0204), R2 = 0.33, d = 2.05 (15.200)

Further we note that the 1% critical value of DF Tau value for this model is −3.45
and since the computed Tau value −11.02 is significantly lesser than the critical
value we safely reject the null hypothesis and conclude that the first differenced
series is infact stationary, or it is I(0).
15.18.1 Trend stationary process

A TSP is stationary around the trend line. The simplest way to make a trend time
series stationary is to regress the series on time and as a result the residuals of this
regression will be stationary.
Yt = β1 + β2 t + ut (15.201)
ût = (Yt − β̂1 − β̂2 t) (15.202)

This series will essentially be stationary and ût is known as the detrended time
series.
15.19 Cointegration: Regressing unit root over unit

root
In earlier sections we have seen that if we regress a nonstationary time series on
another nonstationary series we might obtain spurious regression results. Con-
sider two series of LPCE and LDPI, we find that both are unit root with I(1) and
both contain a stochastic trend. If supposing the two series share the same com-
mon trend, then the regression results will not be spurious. Consider running the
following regression:
LP CEt = β1 + β2 LDP It + ut (15.203)
Note that the L denotes natural logarithm and β2 represents the elasticity of real
personal consumption expenditure to real disposable personal income. We can
denote it as consumption elasticity.
ut = LP CEt − β1 − β2 LDP I (15.204)
Now suppose we subject ut to a unit root test and find that it is stationary I(0).
The interesting point here is that even though LPCE and LDPI are individually I(1)
having stochastic trends, we find that their linear combination is I(0). Basically we
can say that the linear combination has the effect of cancelling out the two stochas-
tic trends in the series. Therefore in this scenario this regression of consumption
on income would make sense. We say that the two variables are cointegrated if
their linear combination gives a stationary series I(0). Note that typically variables
are cointegrated if they have a long term relationship between them. In short, to
test cointegration we see if the residuals of the regression function are I(0).
15.19.1 Testing for Cointegration

The central process in testing for cointegration is to take a typical regression of
two nonstationary processes like:
LP CEt = β1 + β2 LDP I + ut (15.205)
After estimating the regression, we will find the residuals and then use the DF and
ADF tests of unit root testing. We note that since the estimated residuals of ut are
based on the estimated cointegrating parameter β2 , the DF and ADF critical values
are not appropriate. This is because the estimated parameter is found by the
original regression function and not by the OLS procedure of the residual based
function. Therefore in this context the DF and ADF tests are refered to as Engle
Granger(EG) and augmented Engle Granger(AEG) tests. First we run the first
regression and find the coefficient estimates:
LPˆCE t = −0.1942 + 1.0114LDP It (15.206)
t = (−8.23)(348.94), R2 = 0.998, d = 0.1558 (15.207)
Initially it might seem that this is a spurious regression result since both the series
are individually nonstationary. However we will now conduct a unit root test on
the residuals to ascertain that.
∆ût = −0.0764ût−1 (15.208)
t = (−3.04), R2 = 0.0369, d = 2.53 (15.209)
Now we note that the EG asymptotic 5% and 10% critical values for the test statistic
are −3.34 and −3.04 respectively. Therefore we can say that the residuals are
nonstationary at the 5% level. We might be able to do better. Let us try modeling
the same regression problem by including the trend component and then see if the
residuals are stationary or not.
LPˆCE t = 2.813 + 0.0037t + 0.5844LDP It (15.210)
t = (21.34)(22.93)(31.27), R2 = 0.99, d = 0.29 (15.211)

Now we conduct the unit root test on the residuals to get:
∆ût = −0.1498ût−1 (15.212)
t = (−4.45), R2 = 0.075, d = 2.39 (15.213)

Now the corresponding DF tests show that the residuals are stationary. The key
catch here is that the residuals are stationary I(0) but they are stationary around
a deterministic trend that is linear. The residuals are J(0) plus a linear trend.
15.20 AR, MA and ARIMA modeling

We can choose to model our GDP variable series as follows:
(Yt − δ) = α1 (Yt−1 − δ) + ut (15.214)
Further we note that here δ is the mean of Yt and ut is a white noise process with
mean 0 and variance σ 2 . We can say that Yt follows a first order autoregressive
process or is AR(1). The value of Y at time t depends on its previous first lag
value and a random shock. Note that the Y values are typically expressed as their
mean deviations and the model is basically stating that the current value of Y is
a proportion of its previous value and an error term. In a more general way, an
AR(p) process can be expressed as:
(Yt − δ) = α1 (Yt−1 − δ) + α2 (Yt−2 − δ) + · · · + αp (Yt−p − δ) + ut (15.215)
15.20.1 Moving average process

Suppose we choose to model Y as:
Yt = µ + β1 ut + β2 ut−1 (15.216)
Here µ is a constant and u is a stochastic error term modeled as white noise. We

say that Y at time t is equal to a constant plus a moving average of the white noise
error terms. The above is an M A(1) process. Similarly, an M A(q) process can be
represented as:
Yt = µ + β0 ut + β1 ut−1 + · · · + βq ut−q (15.217)
15.20.2 ARMA process

If Y has characteristics of both M A and AR then it can be modeled as an ARM A
process. Here is an example of ARM A(1, 1):
Yt = θ + α1 Yt−1 + β1 ut + β2 ut−1 (15.218)

15.20.3 ARIMA process

We know from earlier discussions that for a series to be weakly stationary they
have constant mean and variance and their covariances are time invariant. Fur-
ther we also learned that some nonstationary series are integrated with order I(1)
then their first difference is stationary I(0). Generally, if we take a d order dif-
ference of an integrated I(d) series then we would get a stationary I(0) series.
Now if we have a series for which we have to take d differences and then we ap-
ply a stationary ARM A(p, q) model to it, then we say that the original series is
ARIM A(p, d, q) or an autoregressive integrated moving average series. There-
fore an ARIM A(2, 1, 2) series can essentially be differenced once and then be
modeled as an ARM A(2, 2) process. We further note that an ARIM A(p, 0, q) =
ARM A(p, q), ARIM A(p, 0, 0) = AR(p) and ARIM A(0, 0, q) = M A(q).
15.21 Box Jenkins methodology

The key question for us to answer is that if we are looking at some time series data
like the GDP example we have mentioned so frequently, then how would we know
if it is a purely AR(p) process, a purely M A(q) process, an ARM A(p, q) process
or an ARIM A(p, d, q) and correspondingly how can be figure out the p, d and q
values appropriate to model the data. For this reason we use the Box Jenkins
method.
• identification: To find the appropriate values of p, d and q we will see how

a correlogram and partial correlogram can help.
• Estimation: After figuring out the appropriate p and q values to form the
model, we need to estimate the parameters included in the model which are
the AR and M A terms.
• Diagnostic checking: Having chosen a particular ARIM A model we will

then check if its fits the data reasonable well. If not we might attempt to find
another ARIM A model for the same. A simple test is to see if the residuals
of the model are white noise or not. If they are then our model is fine. If not,
then we need to make it again.
• Forecasting: We will then use the model to forecast values.
15.22 Identification
The main tools used in the identification process are the ACF, PACF and their cor-
responding correlograms which are nothing but plots of ACF/PACF with the corre-
sponding lag lengths. Now the concept of partial autocorrelation is analogous to
the concept of a partial regression coefficient in multivariate regression analysis.
For example the k th coefficient βk measures the rate change in the mean value of
the regressand for a unit increase in the value of the k th regressor, while holding
the effect of all other regressors constant. Similarly, the partial autocorrelation
ρkk measures the correlation between observations that are k time periods apart
after controlling for autocorrelations in the intermediate time lags (less than k).
For example the correlation exhibited between Yt and Yt−k might be due to the
correlation they exhibit with intermediate lags Yt−1 , Yt−2 , · · · . The partial correla-
tion ρkk basically removes their influence.
From the standard figures of ACF and PACF correlograms in the GDP series we
will notice that the ACF declines very slowly and the ACF values upto 22 lag val-
ues are statistically significantly different from zero, that is they all fall outside the
95% confidence interval bounds. On the other hand, the PACF falls dramatically
after the second lag and PACF values after lag 2 are statistically insignificant. Now
we must note that this GDP series is nonstationary and before applying the Box
Jenkins approach we must convert it to a stationary series. What we have seen
before as well is that if we take the first differences in GDP we do not observe any
time trends and also after conducting DF unit root tests we were able to ascertain
that the first differenced series is stationary.
We see that after this the ACF and PACF plots look very different. In this par-
ticular example we have the ACF at lags 1, 2, 5 that are statistically significant and
all others are insignificant. In the PACF, we have only the lags 1, 12 that are sta-
tistically significant. Now to see if our data can be modeled as an ARM A, M A
or AR process, we need to compare the ACF and PACF plots of standard models
of AR(1), M A(1), ARM A(1, 1), M A(2), ARIM A(2, 2) and so on. All these series
have typically characteristic patterns in the ACF and PACF and if our data ACF
and PACF patterns resemble any of those, we will select that as our modeling ap-
proach. We will then apply diagnostic checks to see if the chosen ARM A model is
accurate or not. to do this we keep some ground rules handy.
• AR(p) - ACF Decays exponentially or with a damped sine wave or both. PACF
has significant spikes through lags p
• M A(q) - ACF has significant spikes through lags q. PACF delines exponen-
tially.
• ARM A(p, q) - ACF has exponential decay. PACF has exponential decay.
15.23 Estimation
Let Y ∗ denote the first differenced logged GDP figures and let us assume that our
data exhibits an M A(2) pattern and so we model it that way.
y∗t = µ + β1 ut−1 + β2 ut−2 (15.219)
We might obtain the following estimates from the data.
Yˆ∗t = 0.0082 + 0.291ut−1 + 0.202ut−2 (15.220)

t = (9.32)(4.61)(3.20), R2 = 0.12, d = 1.97 (15.221)

However upon doing some diagnostic checking including checking the ACF, PACF
plots (and checking for individual significance) and the Ljung Box statistic to check
for collective significance of autocorrelation terms, we may find that we cannot
reject the null hypothesis and hence we may need to look for another ARIM A
model. Note that forecasting the level value rather than the first differenced value
we can seperate out the model as :
Yt − Yt−1 = µ + β1 ut−1 + β2 ut−2 (15.222)

Chapter 16
Joint Distributions
16.1 Joint distribution functions

In this section we look at probabilistic statements about two or more random
variables. For two random variables X and Y we can define the joint cumulative
distribution function of X and Y by:
F (a, b) = P (X ≤ a, Y ≤ b), where − ∞ < a, b < ∞ (16.1)
We can obtain the marginal distribution of X from the joint distribution of X

and Y as follows:
FX (a) = P (X ≤ a) (16.2)
Å ã
= P (X ≤ a, Y ≤ ∞) = P lim {X ≤ a, Y ≤ b} (16.3)
b→∞
= lim P (X ≤ a, Y ≤ b) (16.4)
b→∞
= lim F (a, b) = F (a, ∞) (16.5)

b→∞
By using a similar argument, we can compute the marginal distribution of Y as

follows:
FY (b) = P (Y ≤ b) = F (∞, b) (16.6)
Now suppose we wish to compute the joint probability that X is greater than some
value a and Y is greater than some value b. We can compute this as follows:
P (X > a, Y > b) = 1 − P ({X > a, Y > b}C ) (16.7)
= 1 − P ({X > a}C ∪ {Y > b}C ) (16.8)

= 1 − P ({X ≤ a} ∪ {Y ≤ b}) (16.9)
= 1 − [P ({X ≤ a}) + P ({Y ≤ b}) − P ({X ≤ a, Y ≤ b})] (16.10)
= 1 − FX (a) − FY (b) + F (a, b) (16.11)
The general rule for computing this probabilities says the following:
P (a1 < X < a2 , b1 < Y < b2 ) = F (a2 , b2 )+F (a1 , b1 )−F (a1 , b2 )−F (a2 , b1 ) (16.12)
197
Chapter 16. Joint Distributions 198
In case random variables X and Y happen to be discrete we can define the joint
probability mass function of X and Y as follows:
p(x, y) = P (X = x, Y = y) (16.13)
The associated Marginals would be given as follows:
X
pX (x) = P (X = x) = p(x, y) (16.14)
y
X
pY (y) = P (Y = y) = p(x, y) (16.15)
x
16.1.1 The joint continuous case

We note that X and Y are jointly continuous if there exists a function f (x, y)
defined for all real x and y having the property that for every set C of pairs of real
numbers we have:
Z Z
P {(X, Y ) ∈ C} = f (x, y)dxdy (16.16)
(x,y)∈C
The function f (x, y) is known as the joint PDF of X and Y . Suppose that A and
B are sets of real numbers and we define C such that C = {(x, y) : x ∈ A, y ∈ B}
then we can write:
Z Z
P {X ∈ A, Y ∈ B} = f (x, y)dxdy (16.17)
B A
Also note that:
Z b Z a
F (a, b) = P {X ∈ (−∞, a), Y ∈ (∞, b)} = f (x, y)dxdy (16.18)
−∞ −∞
If we partially differentiate the above function by both a and b we would get back
the PDF as follows:
∂2
f (a, b) = F (a, b) (16.19)
∂a∂b
We define a rather interestig interpretation of the joint density function below:
Z a+da Z b+db
P {a < X < a + da, b < Y < b + db} = f (x, y)dxdy ≈ f (a, b)dadb
a b
(16.20)
Where da and db are very small quantities. Now the marginals for the individual
continuous random variables can be given by:
P {X ∈ A} = P {X ∈ A, Y ∈ (−∞, ∞)} (16.21)
Z Z ∞ Z
= f (x, y)dxdy = fX (x)dx (16.22)
A −∞ A
Note the following points additionally:
Z ∞
fX (x) = f (x, y)dy (16.23)
−∞
Z ∞
fY (y) = f (x, y)dx (16.24)
−∞
16.1.2 Independence
The random variables X and Y are said to be independent if for two sets of real
numbers A and B we can write:
P {X ∈ A, Y ∈ B} = P {X ∈ A}P {Y ∈ B} (16.25)
Basically we say that the above two random variables are independent if for all
possible A and B the sets {X ∈ A} and {Y ∈ B} are disjoint or independent.
What follows from this rule are the following points:
F (a, b) = FX (a)FY (b) (16.26)
p(x, y) = pX (x)pY (y) (16.27)
16.1.3 Summing independent random variables

We will now discuss the situation wherein the distribution of X + Y might be
required from the distributions of X and Y given that they are independent. We
can get the CDF of X + Y as follows:
FX+Y (a) = P {X + Y ≤ a} (16.28)
Z Z
= fX (x)fY (y)dxdy (16.29)
x+y≤a
Z ∞ Z a−y
= fX (x)fY (y)dxdy (16.30)
−∞ −∞
Z ∞ Z a−y
= fX (x)dxfY (y)dy (16.31)
−∞ −∞
Z ∞
FX (a − y)fY (y)dy (16.32)
−∞
This cumulative distribution of X + Y is called the convolution of X and Y . If we

differentiate this we wil get the PDF of the convolution as:
Z ∞ Z ∞
d
fX+Y (a) = FX (a − y)fY (y)dy = fX (a − y)fY (y)dy (16.33)
da −∞ −∞
16.2 Conditionals
For two events the conditional probability of E given F is as follows:
P (EF )
P (E|F ) = (16.34)
P (F )
For discrete variables we would have to define the conditional probability mass
function of X given Y = y. This is given as:
P {X = x, Y = y} p(x, y)
pX|Y (x|y) = P {X = x|Y = y} = = (16.35)
P {Y = y} pY (y)
In a similar manner we can define the conditional CDF of X given Y = y as

follows: X
FX|Y (x|y) = P {X ≤ x|Y = y} = pX|Y (a|y) (16.36)
a≤x
A crucial point to note that if X and Y are independent then the conditional dis-
tribution of X with respect to Y is just the same as the unconditional distribution
of X.
pX|Y (x|y) = P (X = x) (16.37)
16.2.1 Conditionals: continuous

Suppose we have two random variables that have a joint PDF denoted as f (x, y)
then the conditional probability density function of X given that Y = y is given
by:
f (x, y)
fX|Y (x|y) = (16.38)
fY (y)
Similarly, we can note the conditional cumulative distribution function of X given
Y = y as follows:
Z a
FX|Y (a|y) = P {X ≤ a|Y = y} = fX|Y (x|y)dx (16.39)
−∞
16.3 Expectations
We start off by recalling the expected value of a random variable X as given by
the expression: X
E[X] = xp(x) (16.40)
x
Z ∞
E[X] = xf (x)dx (16.41)
−∞
We say that E[X] is a weighted average of all possible values of X with the asso-
ciated probabilities as the weights. Consider two random variables X and Y and
g as a function of these two variables. Then we have:
XX
E[g(X, Y )] = g(x, y)p(x, y) (16.42)
y x
Z ∞ Z ∞
E[g(X, Y )] = g(x, y)p(x, y)dxdy (16.43)
−∞ −∞
Laslty, we recall that if X and Y are independent then we have:
E[g(X)h(Y )] = E[g(X)]E[h(Y )] (16.44)

16.4 covariance
The covariance is an expression that captures the extent of linear relationship
between two random variables X and Y . It is defined as:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ] (16.45)
Note that if X and Y are independent then the covariance would be 0. Note some
properties about the covariance:
• Cov(aX, Y ) = aCov(X, Y )
ÄPn Pm ä Pn Pm
• Cov i=1 X i , j=1 j =
Y i=1 j=1 Cov(Xi , Yj )
16.5 Conditional expectations

We can define the conditional expectation of X given Y = y such that:
X X
E[X|Y = y] = xP {X = x|Y = y} = xpX|Y (x|y) (16.46)
x x
Z ∞
E[X|Y = y] = xfX|Y (x|y)dx (16.47)
−∞
(P
g(x)pX|Y (x|y), discrete
E[g(X)|Y = y] = R ∞x (16.48)
−∞
g(x)fX|Y (x|y)dx
Further we note the following rule:
 
Xn n
X
E  Xi |Y = y =
 E[Xi |Y = y] (16.49)
i=1 i=1
16.5.1 More arguments on conditional expectations

We will now think of E[X|Y ] as a function of random variable Y which obtains a
value of E[X|Y = y] at Y = y. We note that E[X|Y ] is also a random variable.
Note the following crucial property regarding expectations:
E[X] = E[E[X|Y ]] (16.50)
For the discrete this can be written as follows:

X
E[X] = E[X|Y = y]P {Y = y} (16.51)
y
And for the continuous case we can write:

Z ∞
E[X] = E[X|Y = y]fY (y)dy (16.52)
−∞
We can actually obtain probabilities apart from just expectations of a random vari-
able by conditioning it on some other random variable. Let E be an arbitrary event
and let X denote an indicator random variable as follows:
(
1, if E occurs
X= (16.53)
0, if E does not occur
We know that the expectation of an indicator random variable is given by:
E[X] = P (E) (16.54)
Now for any random variable Y we have:
E[X|Y = y] = P (E|Y = y) (16.55)
We can now apply the total expectations law in terms of probabilities as follow:
X
P (E) = P (E|Y = y)P (Y = y) (16.56)
y
Z
P (E) = P (E|Y = y)fY (y)dy (16.57)
y
16.6 Conditional variance

The conditional variance of X given that Y = y is denoted as follows:
V ar(X|Y ) = E[(X − E[X|Y ])2 |Y ] (16.58)
We say that conditional variance is equal to the conditional expected square of

the difference between X and its conditional mean with the given value of Y .
Noting the fact that the traditional formula for variance is given by: V ar(X) =
E[X 2 ] − (E[X])2 we can write for the conditional case:
V ar(X|Y ) = E[X 2 |Y ] − (E[X|Y ])2 (16.59)
Taking expectations on both sides we get:
E[V ar(X|Y )] = E[E[X 2 |Y ]] − E[(E[X|Y ])2 ] = E[X 2 ] − E[(E[X|Y ])2 ] (16.60)
Since we have the fact that E[X|Y ] = E[X] we can write:
V ar(E[X|Y ]) = E((E[X|Y ])2 ) − (E[X])2 (16.61)
Now we can essentially add equations 21 and 22 to get the variance of X. With
this we can present the conditional variance decomposition as follows:
V ar(X) = E[V ar(X|Y )] + V ar(E[X|Y ]) (16.62)

Chapter 17
Conditional Probability
17.1 Joint Probability Distributions

Joint Probability Distributions studies the joint effect of two or more random vari-
ables. We use joint distributions when the random variables are related to each
other. The random variables involved could be either discrete or continuous.
When we study the joint probability distributions of two random variables, it is
called a bivariate distribution whereas when we study more than two random
variables, it is called a multivariate distribution. We now extend our definitions of
probability mass functions and probability distribution functions to joint probabil-
ity distributions.
Definition 17.1.1 Joint Probability Function. When X and Y are discrete random
variables, then we refer to f (x, y) as the joint probability mass function of X and
Y iff the following is satisfied.
f (x, y) = P (X = x, Y = y) (17.1)
Definition 17.1.2 Joint Probability Density Function. When X and Y are con-
tinuous random variables, then we refer to f (x, y) as the joint probability density
function iff the following is satisfied.
Z bZ d
P (a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y) dx dy (17.2)
a c
When we are given the joint probability distribution functions (pmf or pdf), we can
compute the marginal probability distribution functions of the random component
variables using the following definitions.
Definition 17.1.3 Marginal probability mass functions. The marginal pmf of X

and Y , given that both X and Y are discrete is given by
X
fX (x) = f (x, y) (17.3)
all y
X
fY (y) = f (x, y) (17.4)
all x
203
Chapter 17. Conditional Probability 204
Definition 17.1.4 Marginal probability distribution function. The marginal pdf of

two continuous random variables X and Y are given by:
Z y=+∞
fX (x) = f (x, y)dy (17.5)
y=−∞
Z x=+∞
fY (y) = f (x, y)dx (17.6)
x=−∞
Definition 17.1.5 Joint Cumulative Distribution Function. The joint cdf of two
random variables X and Y is given by:
FXY (x, y) = P (X ≤ x, Y ≤ y) (17.7)
The above definition of joint cdf is applicable to discrete,continuous and random
variables. We can also write this as:
FXY (x, y) = P (X ≤ x and Y ≤ y) = P (X ≤ x) ∩ P (Y ≤ y) (17.8)

Using the joint cdf we can find the marginal cdfs FX (x) and FY (y). Specifically,
for any x, y ∈ R:
FXY (x, ∞) = P (X ≤ x, Y ≤ ∞) = P (X ≤ x) = FX (x) = lim FXY (x, (17.9)

y)
y→∞
FXY (∞, y) = P (X ≤ ∞, Y ≤ y) = P (Y ≤ y) = FY (y) = lim FXY (x,(17.10)
y)
x→∞
Also note that we must have

FXY (∞, ∞) = 1
FXY (−∞, y) = 0 ∀ y ∈ R
FXY (x, −∞) = 0 ∀ x ∈ R
For any two random variables X and Y and real numbers x1 ≤ x2 and y1 ≤ y2 ,
we can say:
P [x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ] = FXY (x2 , y2 ) − FXY (x1 , y2 ) − FXY (x2 , y1 ) + FXY (x1 , y1 )
From the joint cumulative distribution of (x2 , y2 ), we subtract the cdf of (x2 , y1 )
and (x1 , y2 ) and add back the cdf of (x1 , y1 ) to get the probability of the shaded
region:
x1 , y2 x2 , y2
y2
y1 x2 , y1
x1 , y1
x1 x2
17.2 Conditional Distribution

Definition 17.2.1 Conditional Function. For a discrete random variable X and
event A, the conditional PMF of X given A is defined as
P (X = xi and A)
PX | A (xi ) = P [X = xi | A] =
P (A)
The conditional CDF of X given A is F[X | A] (x) and is denoted by
FX | A (x) = P [X ≤ x | A]
17.2.1 Conditional PMF of X given Y

In this case we condition the occurrence of X on another random variable Y .
Thus, we find the conditional PMF of X given Y .
P [X = xi , Y = yj ] PXY (xi , yj )
PXY (xi | yj ) = P [X = xi | Y = yj ] = =
P [Y = Yj ] PY (yj )
When two random variables X and Y are independent, then the occurrence of X
does not depend on the occurrence of Y and vice-versa. Thus, we will modify the
pmf and the cdf accordingly.
PXY (x, y) = PX (x).PY (y) ∀ x, y (17.11)

FXY (x, y) = FX (x).FY (y) ∀ x, y (17.12)
This occurs because when X and Y are independent PX | Y (xi | yj ) = Px (xi ) since
Y does not provide any information about X.
17.3 Conditional Expectation

We cumulatively sum up all the values that X can take given that an event A has
occurred to get the conditional expectation of X given A.
X
E[X | A] = xi .PX | A (xi )
xi ∈Rx
where PX | A (xi ) is the conditional PMF of X given A.
Similarly, when we condition the occurrence of X on another random variable

Y and find the expectation, it will be given as
X
E[X | Y = y] = xi .PX | Y (xi | y) (17.13)
xi ∈Rx
where PX | Y (xi ) is the conditional PMF of X given Y .

17.4 Law of Total Probability

The total probability rule also called the law of total probability breaks up the
sample space Ω into partitions. The sum of the probabilities of all these individual
partitions is what forms the overall probability of the event.
Definition 17.4.1 Law of Total Probability. If B1 , B2 , B3 ... are partitions of the
sample space Ω then for any event A, we have
X X
P (A) = P (A ∩ Bi ) = P (A | Bi ).P (Bi )
i i
A sample space Ω has partitions B1 , B2 , B3 ... and so on. Also an event A is seen to
occur within the same sample space Ω.
The total probability of A can then be a cumulative sum of all the probabilities of
intersections of A and the partitions.
Ω Ω
B1 B2 A
If you want to find P(A), you can look at a partition of Ω, and add the amount of
probability of A that falls in each partition
A ∩ B1 A ∩ B2
17.4.1 Law of Total Expectation

If Y is a discrete random variable with Ry = y1 , y2 , y3 , the events{Y = y1 }, {Y =
y2 } form the partitions of the sample space Ω. In this case, the probability of X
will be given as
X X
PX (x) = PXY (x, yj ) = PX | Y (x | yj )PY (yj )
yj ∈RY yj ∈RY
For any set A, we compute the probability

X
P [X ∈ A] = P [X ∈ A | Y = yj ].PY (yj )] (17.14)
yj ∈RY
Also, we know that,
X
P (A) = P (A | Bi ).P (Bi ) (Probability of A) (17.15)
i
We compute the probability of X using the partitions of the discrete random vari-
able Y
X
PX (x) = PX | Y (x | yj )PY (yj ) (Probability of X using partitions of
(17.16)
Y)
yj ∈RY
The expectation of X and the conditional expectation of X given Bi using the Law
of Total Probability are given by:
X
E[X] = x.P [X = x] (Expectation of X) (17.17)
x∈RX
X
E[X|Bi ] = x.P [X = x | Bi ] (Conditional Expectation of X in every partition)
(17.18)
x∈RX
Conditioning X on Y and taking the expectation, we get,

X
E[X | Y = yj ] = x.P [X = x | Y = yj ] (17.19)
x∈Rx
Thus, substituting equation (18) and (19) to compute the total expectation of X,
we get
X X X
x.Px (x) = [ x.P [X = x | Y = yj ]]Py (yj ) (17.20)
x∈Rx yj ∈Ry x∈Rx
As we can clearly see, the LHS in Equation (20) is simply the expectation of X and
the part within brackets of the RHS is E[X | yj ].
The above results can be summarised to obtain the Law of Total Expectation which
is given by
X
E[X] = E(X | yj )Py (yj ) (17.21)
yj ∈RY
Basically the law of total expectation tells us how to compute the expectation of
E(X) using the knowledge of the conditional expectation of X given Y = yj . This is
directly derived from the law of total probability since the expectation is summed
up over the marginal probabilities of the partitions of Y
17.4.2 Conditional Expectation as a Function of Random Vari-

able
As we saw previously from equation (13), conditional expectation of X given Y
where Y is a random variable is given as
X
E[X | Y = y] = xi .PX | Y (xi | y)
xi ∈Rx
where y depends on the value of y and is a function of y. This can be written as

follows:
X
g(y) = E[X | Y = y] = xi .PX | Y (xi | y)
xi ∈Rx
where g(y) is a function of the random variable Y . Thus, conditional expectation

of X can be written as a function of the random variable Y
g(Y ) = E[X | Y ]
since Y is a random variable with space Ry = {y1 , y2 . . .}
Example 1: X = (aY + b)
E[X | Y = y] = E[(aY + b) | Y = y] = ay + b
where ay + b is a function of g(y)
E[X | Y ] = aY + b
Therefore. E[X | Y ] is a function of random variable Y .
Example 2: Let us assume two random variables X and Y with joint pmf and let
Z = E[X | Y ]
Y =0 Y =1
X=0 1/5 2/5 P (X = 0) = 1/5 + 2/5 = 3/5
X=1 2/5 0 P (X = 1) = 2/5 + 0 = 2/5
P (Y = 0) = 1/5 + 2/5 = 3/5 P (Y = 1) = 2/5 + 0 = 2/5
In the given expression, the randomness of Z comes with the randomness of Y .

As per the table, we can infer that X and Y follow a Bernoulli distribution with
p = 2/5.
PX,Y (0, 0) 1/5

PX | Y (0 | 0) = = = 1/3
PY (0) 3/5
PX,Y (1, 0) 2/5

PX | Y (1 | 0) = = = 2/3 = 1 − PX | Y (0, 0)
PY (0) 3/5
Therefore, conditional X on Y = 0 follows a Bernoulli distribution with p = 2/3
(X | Y = 0) ∼ Bern(2/3)
Also conditional X on Y = 1 follows a Bernoulli distribution with p = 1

PX,Y (0, 1) 2/5
PX | Y (0 | 1) = = =1
PY (1) 2/5
PX,Y (1, 1) 0
PX | Y (1 | 1) = = =0
PY (1) 2/5
∴ we can say with certainty that whenever Y = 1, X will take the value 0.
(X | Y = 1) ∼ Bern(0)
We know that if X and Y are independent,
PX,Y (x, y) = PX (x).PY (y) ∀ x, y
In the above example,
3 3 9
PX,Y (0, 0) = 1/5 6= PX (0).PY (0) = . =
5 5 25
Thus, X and Y are not independent random variables.
(
E[X | Y = 0] with probability P (Y = 0) = 3/5
Z = E[X | Y ] =
E[X | Y = 1] with probability P (Y = 1) = 2/5
(
2/3 with probability 3/5
Z = E[X | Y ] =
0 with probability 2/5
Therefore, PMF of Z can be written as

3/5 if z = 2/3


PMF of Z = PZ (z) = 2/5 if z = 0

0 otherwise

Thus, the expectation of Z will be given as

3 2 2
E[Z] = . + .0 + 0 = 2/5
5 3 5
We also know that E[X] = 2/5 since X ∼ Bern( 25 ). Thus, using the law of total
expectation, we can say that
E[X] = E[Z] = E[E[X | Y ]] = 2/5

We will now look at the variance of Z.
2
V ar(Z) = E(Z 2 ) − E[Z]2 = E(Z 2 ) − 2/5
( 2
2 2/3 = 4/9 with probability 3/5
Z =
0 with probability 2/5
4 3 2
E[Z 2 ] = . + .0 + 0 = 4/15
9 5 5
2
2
V ar(Z) = E(Z ) − 2/5 = 4/15 − 4/25 = (20 − 12)/75 = 8/75
Note: For given random variables X and Y and known functions g(X) and
h(y),we can compute the conditional expectation as
E[g(X).h(Y ) | X] = g(X).E[h(Y ) | X]
17.5 Iterated Expectation

The Law of Iterated Expectation states that the expected value of a random vari-
able is equal to the sum of the expected values of that random variable conditioned
on a second random variable. For given random variables X and Y and known
functions g(X) and h(y), we have
g(Y ) = E[X | Y ] and g(y) = E(X | Y = y)

The expectation of X will be given as
X X
E[X] = EX | Y = yj .PY (yj ) = g(yj ).PY (yj )
yj ∈Ry
| {z }
(yj )
g(yj )
| {z }
E[g(Y )]
Also, using the law of total expectation, we have
E[X] = E[E[X | Y ]]
| {z }
g(Y )
If X and Y are independent random variables, PX | Y (x | y) = PX (x). since the

occurrence of X does not depend on Y and Y offers no additional information
to X. Also, from our previous results, we know that if X and Y are independent
random variables then
X X
E[X | Y = y] = xi .PX | Y (xi | y) = x.PX (x) = E[X]
xi ∈Rx x
E[X | Y ] = E[X]
E[g(X) | Y ] = E[g(X)]
PXY (x, y) = PX (x).PY (y)
E[XY ] = E[X].E[Y ]
The covariance between independent random variables is zero. However, covari-

ance being zero does not imply independence of the random variables. Thus inde-
pendence is a stronger condition than covariance = zero as
Independence ⇒ Cov(X, Y ) = 0
but
Cov(X, Y ) = 0 ; Independence
Also, if X and Y are independent random variables
E[g(X).h(Y )] = E[g(X)].E[h(Y )]
17.6 Conditional Variance

A conditional variance is calculated much like a variance is, except you replace
the probability mass function with a conditional probability mass function. We
will now look at the variance of X when it is conditioned on a given event Y . We
know that
µX | Y (y) = E[X | Y = y] = µX | Y =y (17.22)
where µX | Y (y) is the conditional mean of X given Y = y. Denoting variance as σ 2 ,

we have the conditional variance as
2
2
σX | Y =y = E[ X − µX | Y =y | Y = y]
X 2
= xi − µX | Y =y .PX | Y (X = xi | Y = y)
xi ∈RX
2 2
2
σX | Y =y = E[X |Y = y] − µX | Y =y ]
Recall,
2
σX = E[X 2 ] − (µX )2
2
Note: For given random variables X and Y , σX | Y =y = V ar(X | Y = y) is a
deterministic function of Y and V ar(X | Y ) is a random variable.
Example 3. Continuing from the previous example, let Z = E[X | Y ] and V =

V ar(X | Y ). Find the pmf of V and E[V ].
In example 2, we noted that X and Y ∼ Bern(2/5). Also, (X |Y = 0) ∼ Bern(2/3)

and P (X = 0 | Y = 1) = 1. We know that the variance of a Bernoulli distribution
is given as pq. Thus, V can be given as
(
V ar(X | Y = 0) = (2/3).(1/3) = 2/9 with probability P (Y = 0) = 3/5
V = V ar(X|Y ) =
V ar(X | Y = 1) = 0 with probability P (Y = 1) = 2/5
Also, P (X = 1 | Y = 1) = 0 ⇒ (X | Y = 1) ∼ Bern(0). The pmf of random variable

V will be given as

3/5 if v = 2/9


pmf of V = PV (v) = PV (V = v) = 2/5 if v = 0

0 otherwise

The expectation of V will be computed as
E(V ) = (3/5).(2/9) + 0 = 2/15

Also, given that X ∼ Bern(2/5) and variance of a Bernoulli distribution = pq

V ar(X) = (2/5).(1 − 2/5) = (2/5).(3/5) = 6/25
or
V ar(X) = E(V ) + V ar(Z) = 2/15 + 8/75 = 6/25
where V = V ar(X | Y and Z = E[X | Y ]. This brings us to the Law of Total
Variance.
17.7 Law of Total Variance

The law of total variance assumes that we are breaking down the sample space
of X based on the values of another random variable Y . The total variance of
X is a random variable. Each realization assumes that we first draw Y from its
unconditional distribution, then sample X from its conditional distribution given
Y = y. The law is given as
V ar(X) = E[V ar(X | Y ] + V ar[E(X | Y )] (17.23)
The first term means that we want the expected variance of X as we average over
all values of Y whereas in the second part of the RHS we are taking the variance
of the conditional expectation of X given Y . The first part is kind of an intra-
samples phenomenon while the second part is a kind of inter sample phenomenon.
This will be further clarified after we derive the law and observe some examples
thereafter.
17.7.1 Derivation
Assuming Z = E[X | Y ],
2
V = V ar(X | Y ) = E(X 2 | Y ) − E[X | Y ] = E[X 2 | Y ] − Z 2 (17.24)
Using the law of total expectation, the expectation of V will be given as
E[V ] = E[E[X 2 | Y ]] −E[Z 2 ] = E[X 2 ] − E[Z 2 ] (17.25)
| {z }
E[X 2 ]
We know that variance of Z is given as

2 2
V ar(Z) = E(Z 2 ) − E[Z] = E(Z 2 ) − E(X) (17.26)
Z = .E[X | Y ] ⇒ E(Z) = E[E[X | Y ]] = E[X]
Adding (25) and (26), we get,
2
E[V ] + V ar(Z) = E[X 2 ] − E[Z 2 ] + E(Z 2 ) − E(X)
2
= E[X 2 ] − E(X)
= V ar(X)
∴ V ar(X) = E[V ] + V ar(Z) = E[V ar(X | Y )] + V ar(E[X | Y ])
All the terms in the law are non-negative since variance ≥ 0.
17.7.2 The Case of Independent Random Variables

When X and Y are independent random variables, E(X|Y ) = 0. Hence, V ar(E(X|Y )) =
0. In such a case, the law of total variance will be as follows:
V ar(X) = E[V ar(X | Y )]
Thus, an important result to remember is
V ar(X) ≥ E[V ar(X | Y )] (17.27)
The variance captures the uncertainty in a variable. When we condition X on Y ,

on an average the variance of X reduces. The random variable Y gives us some
extra information on X so knowing Y reduces the uncertainty of X on an average.
If V ar(X) = 0, there is almost no uncertainty in the random variable X.
17.7.3 Intuition
Let us demonstrate the intuition behind the law of total variance with some exam-
ples.
Example 4. Suppose we are analysing the variation of heights of different people

across n countries. We pick a person at random and measure his height which is
given by X. Each country will be denoted by Yi which is also picked randomly.
Thus, Y ∈ {1, 2, 3, . . . n}. Instead of taking every person, we take a sample of these
n countries.
• Variance of X in Country i will be given by V ar(X | Y = i).
• The average height in Country i will be given by E[X | Y = i].
Thus, applying the law of total variance, we will get
V ar(X) = E[V ar(X | Y )] + V ar[E(X | Y ]
Here, the variance decomposition has been done in a way that the LHS gives us
the total variance within the population. The first term on the RHS gives us the
variability of height within the country i.e. the average of variances of heights
across countries. The second term is the inter-sample phenomenon as we men-
tioned earlier i.e. it will tell us the variability of height across countries by giving
us the variance of average height across countries.
Example 5. Suppose we are modelling the number of customers that visit a

store. It is given that,
• N = number of customers that visit a store in a given day
• µN = E(N )
2
• σN = V ar(N )
• Xi = amount in Rs. that the ith customer spends on average.
• Xi are independent and identically distributed trials (iid).
Assume that Xi are independent of each other which implies that the amount spent
by every customer is independent of the amount spent by any other customer.
Also, assume that Xi is independent of N i.e. the amount spent by each customer
is independent of the number of customers that visit the store on that day.
µ( X) = E(Xi ) = E(X) (17.28)

2
σX = V ar(Xi ) = V ar(X) (17.29)
Let Y be the total store sales in a given day. Thus Y = N

P
i=1 Xi
N
X
E(Y ) = E[ Xi ]
i=1
N
X
6= E(Xi )
i=1
This is because N is a random variable and therefore, linearity of expectation does

not hold here.
N
X N
X
E(Y ) = E[E(Y | N )] = E[E[ Xi | N ]] = E[ E[Xi | N ]
i=1 i=1
| {z }
E(Xi )
since Xi and N are independent and we can apply linearity of expectation.
XN
E(Y ) = E( E(Xi ))
i=1
XN
= E( E(X)
i=1
= E[N.E[X]]
∴ E(Y ) = E(N ).E(X)
Applying the law of total variance
V ar(Y ) = E[V ar(Y | N )] + V ar[E(Y | N )]

| {z }
N.E[X]
N
X
V ar(Y | N ) = V ar[ Xi | N ] (since Xi and N are independent)
i=1
N
X
= V ar(Xi | N )
i=1
XN
= V ar(Xi ) (Xi are iid)
i=1
XN
= V ar(X)
i=1
V ar(Y | N ) = N.V ar(X)
Therefore the total variance in Y can be given as

2
(17.30)
V ar(Y ) = E[N. V ar(X)] + V ar[N. E(X) ] = V ar(X).E(N ) + E(X) .V ar(N )
| {z } | {z }
constant constant
17.8 References
• Ramachandran, K. M., & Tsokos, C. P. (2020). Mathematical statistics with
applications in R. Academic Press.
• Sheldon Ross - A first course in probability

Chapter 18
Multivariate Gaussian Distribution
18.1 Random Vectors

The idea of random vectors turns out to be extremely handy when dealing with a
number of random variables. In this section we define random vectors of order n,
its expectation value and other entities such as the correlation and the covariance
matrix for a random vector.
Definition. [Random Variable].Consider a given probability space (Ω, F, P), where
Ω is the sample space, F refers to σ-algebra, the event-space which is a collection
of all possible events and P is the probability measure. A random variable, X is a
function with domain Ω and counterdomain the real line. The random variable X
is defined such that the set Ar defined by Ar = {ω : X(ω) ≤ r} belongs to F for
every real number r.
The random variable function therefore maps every element in the sample space
to the real number line. One can think of a random experiment where Ω is the
totality of all outcomes of the random experiment. The random variable (RV)
X takes every outcome of the experiment to a real number. The reason why we
mention the event space, F in the definition is that we are usually in events defined
by random variables. This is why we require a collection of ω’s for which X(ω) ≤ r
to form an event; we would use the idea of random variable to describe events.
Definition. [Random Vector] We define a random vector X as a column vector
with a collection of n random variables, say, {X1 , X2 , . . . , Xn } as its components.
That is, X T = [X1 X2 . . . Xn ].
Let x be a realization of the random vector X. We define the cumulative distribu-
tion function (CDF) of X as:
FX (x) = FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn )
î ó
FX (x) = P X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn (18.1)
Now, if the Xi ’s are jointly continuous, then the probability density function (PDF)
of X, denoted by fX (x) is the joint PDF of the Xi random variables.
fX (x) = fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) (18.2)
216
Chapter 18. Multivariate Gaussian Distribution 217
18.1.1 Expected value of random vectors

The expected value of a random vector is the vector with its entries as the expected
values of component random variables. Thus, for a random vector X with n
random variables, we have,
h i
E[X T ] = E[X1 ] E[X2 ] . . . E[X3 ] = µTX (18.3)
In extension, the expected value for a sum of random vectors, is the sum of indi-
vidual expected values of each random vector.
î ó î ó î ó î ó
E X1T + X2T + · · · + XkT = E X1T + E X2T + · · · + E XkT (18.4)

    
X11 X21 Xk1
X  X  X 
î
T T T
ó  12   22   k2 
 ..  + E  ..  + · · · + E  .. 
E X1 + X2 + · · · + Xk = E      
 .   .   . 
X1n X2n Xkn
     
E[X11 ] E[X21 ] E[Xk1 ]
 E[X12 ]   E[X22 ] 
ó      E[X ] 
î k2 
T T T

E X 1 + X2 + · · · + Xk =  .  +  .  + · · · + 
    . 
.
 .   .  . .
 . 
 
E[X1n ] E[X2n ] E[Xkn ]
18.1.2 Covariance Matrix for a random vector

Consider a series of random variables X1 , X2 , . . . Xn . The realizations of all the
random variables forms our sample data. Here, we consider each variable as a
column and every realization adds a row to our sample dataset. Note that for
every column Xi we can determine the variance - this is the V ar(Xi ), where each
value of random variable Xi is a realization. Likewise, between two columns, Xi
and Xj one can determine the covariance, Cov(Xi , Xj ). Now we can construct the
n × n covariance matrix CX , such that the it contains variance terms along the
diagonal, and covariance between all Xi ’s and Xj ’s.
The set of random variables X1 , X2 , . . . Xn thought of as a random vector X can

have a correlation and covariance matrix defined î as follows.
ó The correlation ma-
T
trix for a random vector is defined as RX = E XX .
 
E[X12 ] E[X1 X2 ] . . . E[X1 Xn ]
î ó  E[Xs X1 ] E[X22 ]
 . . . E[X2 Xn ]
T

RX = E XX =  .. .. .. ..  (18.5)

 . . . .


E[Xn X1 ] E[Xn X2 ] . . . E[Xn2 ]
We define variance of a random variable Xi as V ar(Xi ) = E[(Xi − µXi )2 ] and the

covariance between two random variables Xi and Xj as E[(Xi − µXi )(Xj − µXj )].
In a similar manner, we define the covariance matrix (CX ) for a random vector X
as:
 
V ar(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn )
h i  Cov(X2 , X1 )
 V ar(X2 ) . . . Cov(X2 , Xn )
= E (Xi −µXi )(Xj −µXj )T = 

CX . .. .. . 

 .. . . .. 

Cov(Xn , X1 ) Cov(Xn , X2 ) ... V ar(Xn )
(18.6)
In a similar manner, we can also define for any two random vectors X and Y ,
cross-correlation (RXY ) and cross-covariance (CXY ) matrices.
h i
T
RXY = E XY
h i
CXY = E (X − µX )(Y − µY )T (18.7)
Note. Here, µX = E[X] is the expected value of random vector X and µY = E[Y ]
is the expected value of random vector Y .
     
X1 E[X1 ] µ1
 X2   E[X2 ]   µ2 
î ó      
µX = E X = E  .  =  .  =  . 
     (18.8)
 ..   ..   .. 

Xn E[Xn ] µn
18.1.3 Properties of the covariance matrix

The covariance matrix generalizes the variance of a random variable to that of a
random vector. In essence the covariance matrix for a random vector is equivalent
to the covariance matrix generated by considering n random variables. This matrix
exhibits certain interesting properties which we shall make use of in subsequent
sections. These properties are first stated and subsequently proved here.
T
P1. The covariance matrix is symmetric. CX = CX . Consequently, CX can be
diagonalized and all the eigenvalues of CX are real.
P2. If X is a real random vector X ∈ Rn , that is (Xi ∈ R, ∀ i), then CX is positive

semidefinite (PSD). This means that ∀ b 6= 0, bT CX b ≥ 0.
if f.
P3. For a random vector X, the covariance matrix CX is positive definite (PD) ⇐=⇒
if f.
all eigenvalues of CX > 0. Equivalently, CX is PD ⇐=⇒ det(CX ) > 0.
T
Theorem 18.1.1 The covariance matrix is symmetric. CX = CX . Consequently,
CX can be diagonalized and all the eigenvalues of CX are real.
î ó î ó
T
Proof. CX = Ci,j = Cov(Xi , Xj ) = Cov(Xj , Xi ) = Cj,i = CX . The covariance
between two variables does not change when their positions are interchanges. As
a consequence of symmetric nature of covariance matrix, we have the corollary

that all the eigenvalues of CX are real and its eigenvectors are orthogonal.
Note. We say that a matrix A is positive semidefinite (PSD) if ∀ b 6= 0, bT Ab ≥ 0.

Likewise, A is positive definite (PD) if ∀ b 6= 0, bT Ab > 0.
Theorem 18.1.2 If X is a real random vector X ∈ Rn , that is (Xi ∈ R, ∀ i), then

CX is positive semidefinite (PSD).
Proof. Let b ∈ îRN be a fixedó vector with n components. We define a random

T
variable Y = b X − E[X] . Note that Y here is a random variable and not a
random vector. Further, the random variable Y 2 is non-negative, ie., E[Y 2 ] ≥ 0.
E[Y 2 ] ≥ 0 =⇒ E[Y Y ] ≥ 0 =⇒ E[Y Y T ] ≥ 0 (18.9)
Note here that Y T is not a vector but a number. Thus Y T = Y . Substituting the
value of Y in the above expression,
" #
î ó î ó T
E[Y Y T ] ≥ 0 =⇒ E bT X − E[X] bT X − E[X] ≥0
" #
T
E bT

=⇒ X − E[X] X − E[X] b ≥ 0
h T i
bT E

=⇒ X − E[X] X − E[X] b≥0
| {z }
CX , covariance matrix of X
T
=⇒ b C b≥0 =⇒ CX is PSD. (18.10)
| X{z }
condition for positive
semidefiniteness (PSD)
18.2 Bivariate Gaussian Random Vectors

Definition. [Bivariate Gaussian] Two random variables X and Y are bivariate
Gaussian or jointly normal if the variable aX + bY has a normal distribution for all
a ∈ R and b ∈ R .
Note. If X and Y are Bivariate Gaussian and

- if a = b = 0 =⇒ aX + bY = 0 =⇒ Constant zero is a normal random
variable with mean 0 and variance 0.
- if a = 1, b = 0 =⇒ aX + bY = X =⇒ X is a normal random variable.
- if a = 0, b = 1 =⇒ aX + bY = Y =⇒ Y is a normal random variable.
2
Theorem 18.2.1 If X ∼ N(µX , σX ) and Y ∼ N(µY , σY2 ) are jointly normal random
variables, setting a = b = 1 =⇒ aX + bY = X + Y , then the resultant random
2
variable X + Y ∼ N(µX + µY , σX + σY2 + 2ρXY σX σY ). Essentially, a linear com-
bination fo two normal variables X and Y gives a joint normal variable X + Y .
Example. Let Z1 ∼ N(0, 1) and Z2 ∼pN(0, 1) be two independent random vari-

ables. Define X = Z1 and Y = ρZ1 + 1 − ρ2 Z2 , where ρ ∈ (−1, +1). Show that
X and Y are bivariate normal.
We know from the above theorem that since Z1 and Z2 are independent normal
distributions, they are also jointly normal. We can therefore attempt to obtain the
joint PDF of Z1 and Z2 . Because, these are independent, we have,
fZ1 ,Z2 (z1 , z2 ) = fZ1 (z1 ).fZ2 (z2 )

| {z }
because Z1 and Z2
are independent RVs.
1 1 2 1 1 2
fZ1 ,Z2 (z1 , z2 ) = √ e− 2 z1 √ e− 2 z2
2π 2π
| {z } | {z }
fZ1 (z1 ) fZ2 (z2 )
1 − 1 (z12 +z22 )
fZ1 ,Z2 (z1 , z2 ) = e 2 (18.11)
2π
In order to prove that X and Y are bivariate normal, we need to show first that
aX + bY is normal ∀ a, b ∈ R.
h p i
aX + bY 2
= aZ1 + b ρZ1 + 1 − ρ Z2
Ä p ä
aX + bY = (a + bρ) Z1 + b 1 − ρ2 Z2 (18.12)
| {z } | {z }
constant coeff. constant coefficient
From the above, we see that aX + bY is expressed as a linear combination of

standard normals Z1 and Z2 . Thus it must be the case that aX + bY is normal.
Example. Let Z1 ∼ N(0, 1) and Z2 ∼ pN(0, 1) be two independent random vari-
ables. Define X = Z1 and Y = ρZ1 + 1 − ρ2 Z2 , where ρ ∈ (−1, +1). Obtain the
expected value and variance of X and Y .
V ar(X) = V ar(Z1 ) = 1
»
V ar(Y ) = V ar(ρZ1 ) + V ar( (1 − ρ2 ))
| {z } | {z }
=ρ2 V ar(Z1 ) =(1−ρ2 )V ar(Z2 )
E[X] = E[Z1 ] = 0
p
E[Y ] = ρ E[Z1 ] + 1 − ρ2 E[Z2 ] = 0
| {z } | {z }
=0, given =0, given
Recall the following properties about covariance of two variables:

• Cov(X, Z + W ) = Cov(X, Z) + Cov(X, W )

• Cov(Z, aW ) = aCov(Z, W )
p
ρ(X, Y ) = Cov(X, Y ) = Cov(Z1 , ρZ1 + 1 − ρ2 Z2 )
p
ρ(X, Y ) = Cov(Z1 , ρZ2 ) + Cov(Z1 , 1 − ρ2 Z2 )
p
ρ(X, Y ) = ρ Cov(Z1 , Z1 ) + 1 − ρ2 Cov(Z1 , Z2 )
| {z } | {z }
=V ar(Z1 )=1 = 0, since Z1 and Z2
are independent
ρ(X, Y ) = ρ
18.2.1 Joint PDF of Bivariate Normal

Let Z1 ∼ N(0, 1) and Z2 ∼ N(0, 1) be two pindependent random variables. Define
random variables X = Z1 and Y = ρZ1 + 1 − ρ2 Z2 , where ρ ∈ (−1, +1). In this
case we have, V ar(X) = V ar(Y ) = 1 and E[X] = E[Y ] = 0. In such a case, X
and Y have the bivariate normal distribution with correlation coefficient ρ if their
joint PDF fX,Y (x, y) is:
Å ã
1 − 1
2(1−ρ2 )
{x2 −2ρxy+y 2 }
fX,Y (x, y) = p e (18.13)
2π 1 − ρ2
In order to prove a very important result about the PDF of a bivariate normal
2
distribution, we first construct two normal random variables X ∼ N(µX , σX ) and
2
Y ∼ N(µY , σY ), such that ρ(X, Y ) = ρ. Let Z1 ∼ N(0, 1) and Z2 ∼ N(0, 1) be
independent random variables.
X = σX Z1 + µX
î p ó
Y = σY ρZ1 + 1 − ρ2 Z2 + µY (18.14)
X and Y have a bivariate normal distribution if the joint PDF

1 2
fX,Y (x, y) = p e−u (18.15)
2πσX σY 1 − ρ 2
"Å ã2 #
x − µX y − µY 2 2ρ(x − µX )(y − µY )
Å ã
1
u2 = + − (18.16)
2(1 − ρ2 ) σX σY σX σY
where, µx , µY ∈ R and σX > 0, σY > 0 and ρ ∈ (−1, 1) are all parameters of the
PDF of the joint normal distribution fX,Y (x, y).
Theorem 18.2.2 Let X and Y be two bivariate normal random variable with joint
PDF fX,Y (x, y) given by:
h i
x−µX 2 y−µY 2 2ρ(x−µX )(y−µY )

1
1 − 2 σX
+ σY
− σ X σY
fX,Y (x, y) = p e 2(1−ρ ) (18.17)
2πσX σY 1 − ρ 2
Then, there exist independent standard normal variables Z1 and Z2 such that X
and Y are given by:
X = σX Z1 + µX
î p ó
Y = σY ρZ1 + 1 − ρ2 Z2 + µY (18.18)
Note. We can generate X and Y from the standard normal Z1 and Z2 . Using this
method, we can generate samples from the bivariate normal distribution.
On Independence of RVs
If X and Y are independent random variables, then Cov(X, Y ) = 0, that is X

and Y are uncorrelated. But the inverse is not true.
If X and Y are jointly normal random variables and they are uncorrelated
Cov(X, Y ) = 0, then X and Y are independent random varibles.
Theorem 18.2.3 If X and Y are bivariate normal and are uncorrelated Cov(X, Y ) =
0, then they are independent random variables.
Proof. If ρ = ρ(X, Y ) = Cov(X, Y ) = 0 AND (X, Y ) ∼ bivariate normal, then from

(18.16) and (18.15) we have,
"Å ã2 Å ã2 #
1 x − µ X y − µ Y
u2 = +
2 σX σY
1 h x − µ 2 y − µ 2i
Ñ é Ñ é
X Y
− +
1 2 σX σY
fX,Y (x, y) = e
2πσX σY
1 h x − µX 2 i 1 h y − µY 2 i
Ñ é Ñ é
− −
1 2 σ 1
fX,Y (x, y) = √ e X √ e 2 σY
2πσX 2πσY
| {z }| {z }
marginal distribution of X marginal distribution of Y
fX (x) fY (y)
fX,Y (x, y) = fX (x)fY (y) ∀ x, y ∈ R (18.19)
Thus, X and Y are independent random variables.
In the subsequent sections, we extend the bivariate normal distribution to a case

with n-jointly distributed random variables. We shall see precisely how the idea of
random vectors aids us in this context. Here, we will also make use of functions of
random vectors to showcase the same.
18.3 Multivariate Gaussian Random Vector

Definition 18.3.1 (Multivariate Gaussian) Random variables {X1 , X2 , . . . Xn }
are said to be multivariate Gaussian (or jointly normal) if ∀ a1 , a2 , . . . an inR, the
random variable W (which is a linear combination of all Xi ’s) is a normal random
variable. Here,
W = a1 X 1 + a2 X 2 + · · · + an X n (18.20)
| {z }
W is a linear combination
of all n random variables.
Expressed in terms of vector notation, we have,
W = aT X (18.21)
where, aT = [a1 a2 . . . an ] is a constant coefficient vector and X T = [X1 X2 . . . Xn ]

is the random vector. Note that in the trivial case if a = 0 =⇒ W ∼ N(0, 0).
Definition. [Gaussian Vector]A random vector X is a Gaussian (or normal) ran-
dom vector if the n-random variables {X1 , X2 , . . . Xn } (that make up its compo-
nents) are jointly normal.
18.3.1 PDF of a Gaussian Vector

Consider the standard normal vector Z comprised of n identical but independent
standard normal random variables Zi . That is Zi ∼ N(0, 1) and Zi ’s are indepen-
dent. This implies,
fZ (z) = fZ1 ,Z2 ,...,Zn (z1 , z2 , . . . , zn )

fZ (z) = fZ1 (z1 ).fZ2 (z2 ) . . . fZn (zn )
| {z }
Since each of Zi are independent
of the others Zj , where i 6= j
n n
Y Y 1 1 2
fZ (z) = fZi (zi ) = √ e− 2 zi
i=1 i=1
2π
1 1 Pn 2 1 1 T
fZ (z) = (n/2)
e− 2 i=1 zi = (n/2)
e− 2 z .z
(2π) (2π)
| {z } | {z }
product becomes sum expressed in
within the exponent. vector form.
1 1 T
fZ (z) = (n/2)
e− 2 z .z (18.22)
(2π)
The above equation (18.22) is the PDF of a Standard Gaussian vector. We may
now, extend this PDF for a general normal random vector X with mean vector m
and covariance matrix CX
References 224
Conversion from standard normal to general normal
Let Z be a standard normal random variable such that Z ∼ N(0, 1), and let X
be a general normal random variable with finite mean µ and variance σ. We can
obtain X from Z using the transformation:
X = σZ + µ
Likewise, one can extend this to the case of random vector X with mean vector
m, where m = E[X] and covariance matrix CX such that
î ó
CX = E (X − m)(X − m)T
Our objective now is to find such a transformation. It turns out that if CX can
be decomposed into CX = AAT , then
X = AZ + m
The matrix A can be thought of as a kind of standard deviation matrix, containing

the roots of variances and covariances.
References
[1] Stochastic Processes Course notes of Prof. Rakesh Nigam.
[2] Ross, Sheldon. A first course in probability. Pearson, 2014.
[3] Karlin. Samuel, and Howard E. Taylor. A second course in stochastic pro-
cesses. Elsevier, 1981.
[4] Karlin, S., and H. M. Taylor. A first course in stochastic processes, Acad. Press,
New York. 1966.
Part III
Stochastic Processes
225
Chapter 19
Introduction to Markov Chains
19.1 Premise
In probability theory, a stochastic process is a set of random variables indexed by
time or space. It refers to a system that has observations at various points of time
and the outcome or the observed value for each observation is actually a random
variable.
We find applications of stochastic processes in many disciplines including physics,

biology, image processing, signal processing, computer science, cryptography, fi-
nance and many others. In this chapter, we will study a specific kind of a stochastic
process known as Markov Chains.
19.2 Markov Chains

Modern probability theory studies the likelihood that the knowledge of previous
outcomes influences the predictions for the outcomes of a future experiment. A
markov chain is one such stochastic process in which the probability of each event
depends only on the state observed in the previous event. In other words, the
outcome of an experiment at a given time can affect the outcome of the next ex-
periment only.
In a Markov Chain, we have a set of states S = {1, 2, 3, . . .}. A process starts

at a given state and subsequently moves to the next state with each observation.
Each move is called a step. At time n, the chain will be in a state Xn . Suppose
the chain is currently in a state i. The probability of the chain moving to another
state j in the next time period is pij . These probabilities are called as transition
probabilities.
226
Chapter 19. Introduction to Markov Chains 227
19.2.1 Memoryless Property

In a Markov Chain, the probability of next state Xn+1 will only depend on the cur-
rent state Xn and not on the the past states such as Xn−1 , Xn−2 , Xn−3 , . . ..
Suppose we model the mood of a person as a state. A person can be in two states
- a happy state H or a sad state S. A Markov chain on this model will imply that
the mood of the person tomorrow will only be affected by the mood of the person
today and not by her mood yesterday or any previous day. The possible states are
S = {H, S} and the occurrences are indexed by time intervals n, n + 1.
0.2
0.8 H S 0.3
0.7
The random variable Xn can be written as

(
0 Happy State (H)
Xn =
1 Sad State (S)
In such a case, the transition probability matrix will be given as,
ÇH S å
H 0.8 0.2
S 0.7 0.3
The transitions between states occur at discrete times. In this example we can say
that P (Hn+1 | Hn ) = 0.8 i.e. if the person is happy today, the probability that she
will be happy tomorrow is 0.8 and the probability that she is sad tomorrow given
that she was happy today is P (Sn+1 | Hn ) = 0.2.
19.2.2 Continuous Time Markov Chains

In the previous case, the state space and time were both discrete in nature. A
continuous time markov chain is characterised by a discrete state space, S =
{1, 2, 3, . . .} and a continuous time index, t. Transition between states can hap-
pen at any time and there is no defined time interval as before. The probability of
changing state is an infinitesimal time dt.
19.3 A Betting Game

Let us try to understand discrete time Markov Chains with an example. Suppose
you are playing a betting game in a casino. You place Rs. b bets. The possible
outcomes for each bet are
a You gain Rs. b with probability p.
b You lose the bet with probability (1 − p) = q.
The rules for playing the game are:
• Play until you go broke (loose all your money).
• Keep playing forever.
If you start with an initial wealth,W0 , the question that we need to address is that
given the above setup, should you play this game or not.
19.3.1 Modelling the Game

The number of bets will be indexed by time t and the outcome of bet at t is given
by X(t) which is a Bernoulli random variable.
(
1 if bet is won with probability p
Xt =
0 if bet is won with probability 1 − p = q
The player’s wealth at time t is given as Wt . Initially at t = 0, W (0) = W0 . At time

t > 0 the wealth Wt depends on past wins and losses.
W (t) = W (t − 1) + b when bet is won with probability p
W (t) = W (t − 1) − b when bet is lost with probability q

If the initial wealth W0 = 20, bet b = 1 and win probability = p = 0.55, should you
play?
According to the previous cases, the outcome of the bet can be modelled as a
random variable.
W (t) = W (t − 1) + Y (t)
where Yt is given as,
(
b if bet is won with probability p
Yt =
−b if bet is won with probability 1 − p = q
In the above case, the expectation of Y( t) is given as
E[Y (t)] = bp − b(1 − p) = bp − b + bp = 2pb − b = b(2p − 1)

The wealth can be recursively written as follows:
W (1) = W (0) + Y (1)

W (2) = W (1) + Y (2)
= W (0) + [Y (1) + Y (2)]
W (t) = W (0) + [Y (1) + Y (2) + . . . + Y (t)]
Xt
W (t) = W0 + Y (i)
i=1
Applying linearity of expectation on the last equation,
t
X
E[W (t)] = E[W0 ] + E[Y (i)]
i=1
W (t) = W (0) + E[Y (i)]t
W (t) = W (0) + (2p − 1)bt
When W0 = 20, b = 1 and p = 0.55,
2p − 1 = 1.1 − 1 = 0.1
∴ W (t) = W (0) + 0.1t = 20 + 0.1t

The possible trajectories that W (t) can take can be seen below All realisations of
wealth will hover around the trend line.
To find the average tendencies, we assume W (t − 1) > 0
W (t) = W (t − 1) + Y (t)
E[W (t) | W (t − 1)] = E[W (t − 1) | W (t − 1)] + E[Y (t) | W (t − 1)]
= W (t − 1) + E[Y (t) | W (t − 1)]
since Y(t) is independent of previous wealth
= W (t − 1) + E[Y (t)]
= W (t − 1) + (2p − 1)b
Now conditioning on W (t − 2)
W (t) = W (t − 1) + Y (t)
W (t − 1) = W (t − 2) + Y (t − 1)
W (t) = W (t − 2) + Y (t − 1) + Y (t)
E[W (t) | W (t − 2)] = W (t − 2) + (2p − 1)b + (2p − 1)b
Proceeding recursively t times gives us
E[W (t) | W0 ] = W0 + t(2p − 1)b

This analysis is not entirely correct as W (t) might go to zero at which point the
game would come to an end.
19.4 Discrete Time Markov Chains

Consider a stochastic process characterised by discrete time indices n = 0, 1, 2, . . .
and time dependent random state Xn . The state Xn takes values in a countable
number of discrete states, S = {0, 1, 2, 3, . . . , i, i + 1 . . .}. The history of stochastic
process XN is given by Xn , Xn−1 , Xn−2 , X0 and is deterministic in nature as it is
known to the observant. The stochastic process Xn is a Markov Chain if
P [Xn+1 = j | Xn = i, Xn−1 ] = P [Xn+1 = j | Xn = i] = pij
• Future state Xn+1 depends only on current state Xn and the history of the
process Xn−1 is irrelevant for future evolution of the process.
• Probabilities pij are constant for all times (time invariant)
By definition for an arbitrary m,
P [Xn+m | Xn , Xn−1 ] = P [Xn+m | Xn ]
• Xn+m depends only on Xn+m−1 which only depends on Xn+m−2 , . . . which
depends only on Xn .
• pij ’s are probabilities such that pij ≥ 0 and ∞
P
j=1 pij = 1 and are entries for
the transition probability matrix.
The transition probability matrix, P is given by
0 1 2 3 ...
p p01 p02 ... . . . 0
00
p 10 p11 p12 ... . . . 1
p p21 p22 ... . . . 2
 20 
 . 
 .
 .


 pi0 pi1 pi2 ... . . . i
 
..
.
19.4.1 Graphic Representation

A stochastic process for infinite number of states can be graphically observed,
• The system can enter a given state either from the previous state with prob-
ability pi−1,i or from the next state with probability pi+1,i or it can stay in the
same state with probability pii .
• The system can exit a given state either to the previous state with probability
pi,i−1 or to the next state with probability pi,i+1 .
19.5 Multiple step Transition Probability

We will now study the case beyond one step transitions and look at the model
when there can be two or more step transitions in time.
(2)
Transition probability between 2 time slots = Pij = P [Xm+2 = j | Xm = i]
Note: Here the power of 2 does not imply a square of the matrix but the number
of time steps increased to compute the new probability transition matrix.
Probability of Xm+n gives Xm to an n-step transition probability matrix.

(m)
Pij = P [Xm+n = j | Xm = i]
The relation between n-step, m-step and m+n step transition probabilities is given
by the famous Chapman Kolmogorov equation. This result will be proven later.
Pijm+n = Pijm .Pijn

19.5.1 Two step transition probability

Let us look at the proof for the two step transition probability which will be later
extended to m + n steps.
p2ij = P [Xn+2 = j | Xn = i]
Using law of total probability,

∞
X
P [Xn+2 = j, Xn+1 = k | Xn = i]
k=1
By rules of conditional probability, we can say that,

∞
X
P [Xn+2 = j | Xn+1 = k, Xn = i]
k=1
Finally, as discussed by the memoryless property of Markov Chains, we can let go

off the past states of the system as it only depends on the immediate past. This
gives us,
X∞
P [Xn+2 = j | Xn+1 = k].P [Xn+1 = k | Xn = i]
k=1
∞
X ∞
X
∴ p2ij = pkj .pik = pik .pkj
k=1 k=1
Thus, we can say,

P 2 = P.P
Using matrix multiplication, matrix P . matrix P . This will give us the dot product
of ith row and j th column. For fixed i and j it will be a matrix element.
19.5.2 Multiple Step transition probability

We will now look at the proof for the Chapman Kolmogorov equations by extend-
ing our previous arguments of law of total probability, conditional probability and
memoryless property of Markov Chains.
pm+n
ij = P [Xm+n = j | X0 = i]
X∞
m+n
pij = P [Xm+n = j, Xm = k | X0 = i]
k=1
∞
X
pm+n
ij = P [Xm+n = j | Xm = k, X0 = i].P [Xm = k | X0 = i]
k=1
Removing past X0 = i,
∞
X
pm+n
ij = P [Xm+n = j | Xm = k].P [Xm = k | X0 = i]
k=1
X∞
pm+n
ij = pnkj .pm
ik
k=1
∞
X
pm+n
ij = pm n
ik .pkj
k=1
This gives us the final transition probability matrix and the Chapman Kolmogorov
equation,
P m+n = P m .P n
• pm n
ik is the probability of going from X0 = i to Xm = k and pkj is the probability
of going from Xm = k to Xn = j.
• As proved earlier, the product of pm n

ik and pkj will give us the probability of
going from X0 = i to Xm+n = j while passing through Xm = k at time m.
• Since any state k might have occurred at time m, we sum over all k.
The Chapman Kolmogorov equation is given by,
P m+n = P m .P n
where,
P m+n is the (m + n) step transition,
P m is the m step transition and
P n is the n step transition.
This allows us to make the following arguments,
P (2) = P.P = P 2
P (n) = P n−1 .P = P n−2 .P.P = . . . = P n
Theorem 19.5.1 The matrix of n step transition probabilities P (n) is given by the
nth power of the transition probability matrix P =⇒ P (n) = P n .
Chapter 20
The Gambler’s ruin Framework
20.1 The Gambler’s ruin

We will now describe with the help of an example, how a Markov process can be
modeled. In the gambler’s ruin problem here are some basic rules to keep in
mind.
• Think of yourself as a Gambler who places $b bets at each time step.
• One possible outcome is that you gain $b with probability p and you lose $b
with probability (1 − p) = q.
• You start playing the game with an initial wealth $ib.
• There exists a bias factor which is the ratio of the probability of losing to
the probability of winning and is denoted as:
q
α= (20.1)
p
• Concerning the bias factor we have three broad situations - first is that if
α > 1 or q > p then you are more likely to lose than win the bet. Second
is when α < 1 or q < p you are more likely to win rather than lose the bet.
Third is when α = 1 or p = q = 0.5 which means that the gamble is a fair
game.
• There are two boundary conditions that state that - You keep playing until
you go broke or you reach a maximum possible wealth of $bN .
• Given all these rules of the game, we are interested in answering the question
- What is the probability Si of reaching $bN before going broke given that our
initial wealth is $ib ?
20.1.1 Setting up the model

The above problem sets us up on fertile ground to use a Markov modeling ap-
proach. We can set up the model in the following steps:
235
Chapter 20. The Gambler’s ruin Framework 236
• We first set up the state space - this would obviously consist of positive and
negative increments of bet amount $b. However we choose to normalize the
state space by dividing each value in the set by $b to ultimately get:
S = {0, 1, · · · , i − 1, i, i + 1, · · · , N } (20.2)
• We have constant transtition probabilities for single up steps and down

steps. pi,i+1 = p is the probability of going from state i to state i + 1 in
the next time step. pi,i−1 = q is the probability of going from state i to state
i − 1 in the next time step.
• Furthermore, we have p00 = pN N = 1 which states that the probability of us
transitioning from state 0 to state 0 and from state N to state N is 1. This is a
special case wherein the states 0 and N are said to be absorbing states. The
gambler will eventually end up in one of these states. The remaining states
are said to be transient.
• We denote X0 = ib as our initial state. Lastly, a diagramatic representation
of the model is shown below:
Figure 20.1: Markov chain
20.1.2 Solving the model: Getting the expressions

Before we get to solving the model, let us recall the expression that we are looking
for - We want to know Si , the probability of a successful betting run of reaching $bN
before going broke given an initial wealth level of $ib.
• We write down our target probability, conditional on the initial wealth state:
Si = P [Xn = N |X0 = i] (20.3)
• Now we apply the law of total probability by defining an intermediate step

and summing over all possible states of the intermediate step (x1 ).
X
Si = P [Xn = N, X1 = k|X0 = i] (20.4)
k∈{i+1,i−1}
= P [Xn = N, X1 = i + 1|X0 = i] + P [Xn = N, X1 = i − 1|X0 = i] (20.5)

= P [Xn = N |X1 = i + 1, X0 = i] P [X1 = i + 1|X0 = i] (20.6)
| {z }
p
+ P [Xn = N |X1 = i − 1, X0 = i] P [X1 = i − 1|X0 = i]

| {z }| {z }
M arkov q
• The underbraces corresponding to Markov indicates that we are apply-

ing the Markov property for that particular expression - we are essentially
throwing away the past and keeping the only the present.
• After applying the Markov property, we get the main expression as:
Si = P [Xn = N |X1 = i + 1] p + P [Xn = N |X1 = i − 1] q (20.7)

| {z } | {z }
Si+1 Si−1
Si = Si+1 p + Si−1 q (20.8)
• Noting that p + q = 1 we can rewrite the above expression as:
Si (p + q) = Si+1 p + Si−1 q (20.9)
Si p + Si q = Si+1 p + Si−1 q (20.10)

p[Si+1 − Si ] = q[Si − Si−1 ] (20.11)
• Recall that the bias factor is given by α = q/p. We can further rewrite the
above equation in terms of the bias factor as:
(Si+1 − Si ) = α(Si − Si−1 ) (20.12)
20.1.3 Solving the model: solving recursively

Continuing from the previous section, we must note that if the current state is
i = 0, then we have S0 = 0 - that is the probability of winning the game is 0 since
we will always be in state 0. With this, we now apply recursive definitions to get
an analytic expression for Si .
• Continuing from equation 12, we first set i = 1.
(S2 − S1 ) = α(S1 − S0 ) = αS1 (20.13)
• Similarly, for i = 2 we get:
(S3 − S2 ) = α(S2 − S1 ) = α2 S1 (20.14)
• Indeed we can see a pattern emerging here. We can then generalize the
above expression as:
(Si − Si−1 ) = α(Si−1 − Si−2 ) = αi−1 S1 (20.15)
• Now if we sum all the above expressions, the middle terms like - S2 , S3 , · · · , Si−1
will simply cancel out and we will be left with:
Si − S1 = S1 [α + α2 + · · · + αi−1 ] (20.16)
Si = S1 [1 + α + α2 + · · · + αi−1 ] (20.17)
• Note in the above equation, the term in brackets is a geometric series. Now
assuming that α 6= 1 we can write the general form of the above expression
as: ñ ô
1 − αi
Si = S1 (20.18)
1−α
• We will now apply the first boundary condition as SN = 1 which states that
when i = N the probability of winning is 1. We can then subtitute SN for Si
and write:
(1 − αN )
1 = SN = S1 (20.19)
(1 − α)
(1 − α)
S1 = (20.20)
1 − αN
• Now that we have an analytic expression for S1 , we can substitute this into
the general formula of Si to obtain:
(1 − αi )
Si = (20.21)
(1 − αN )
20.1.4 The model under various bias factors

We will now see what happens to the above expression with different values of the
bias factor. Further we will also check what happens when the wealth levels tend
towards infinity.
• Taking the first case wherein α = 1, we have:
Si = S1 [1 + 1 + · · · + 1] = iS1 (20.22)
Further, if i = N then we would have SN = N S1 = 1 and consequently,
S1 = (1/N ). Finally the expression for Si would become:
i
Si = (20.23)
N
• Now we see what happens when α > 1. In this case the general expression
for Si would become:
(αi − 1)
Si = N (20.24)
(α − 1)
Now as N → ∞ we would have αN → ∞, since the probability of losing is
more than the probability of winning. Therefore in this case, you will almost
surely lose all your money. The final expression would be:
(αi − 1) N →∞
Si ≈ → 0 (20.25)
αN
• In the third case we see what happens when α < 1. We see that as N → ∞
the expression evaluates to:
(1 − αi ) N →∞
Si = → (1 − αi ) (20.26)
1 − αN
20.2 Transient and Recurrent states

Transient states are states that might be visited in the beginning but eventually
the visits stop. What we are essentially saying is that the random process does
not take on the transient state values after some time. This could be stated as -
almost surely Xn 6= i as n → ∞. On the hand, visits to recurrent states keep
happening infinitely often. So for any arbitrary m, we say - almost surely Xn = i
for some n ≥ m. Here are some important conditions and definitions for transient
and recurrent states.
• We define fi as the probability that starting at state i (X0 = i), the DTMC
ever reenters state i. This can be written as:
"∞ #  
[ [∞
fi = P Xn = i|X0 = i = P  Xn = i|Xm = i (20.27)
n=1 n=m+1
• We say that state i is recurrent if fi = 1. That is, the DTMC reenters state i
again and again almost surely, infinitely often.
• We say that state i is transient if fi < 1. This essentially means that there is
a positive probability (1 − fi ) of never coming back to transient state i.
20.2.1 An illustrative example

Concepts from the previous section can be explained in better detail by taking an
example presenting the visual representation of the same. Consider the model
shown below.
Figure 20.2: Markov model with transient and recurrent states
• Firstly, we see that state R3 is recurrent because the following condition

holds:
P [X1 = R3 |X0 = R3 ] = 1 (20.28)
• Now we will show that state R1 is also recurrent. At the first time step
(n = 0) we have:
P [X1 = R1 |X0 = R1 ] = 0.3 (20.29)
• Now at the second time step (n = 1) we can go from R1 to R2 and then from
R2 to R1 .
P [X2 = R1 , X1 6= R1 |X0 = R1 ] = (0.7)(0.6) (20.30)
• Similarly we can calculate the probability that the random process would
take on the value of state R1 after 3 time steps (n = 3). So we go from R1 to
R2 , loop within R2 for one time step and return to R1 .
P [X3 = R1 , X2 6= R1 , X1 6= R1 |X0 = R1 ] = (0.7)(0.4)(0.6) (20.31)
• In a similar manner, we can get the probability of coming back to state R1

after n time steps as follows:
P [Xn+1 = R1 , Xn 6= R1 , · · · , X1 6= R1 |X0 R1 ] = (0.7)(0.4)n−1 (0.6) (20.32)
• Since we are checking for the process returning to state R1 in any number of
time steps, we sum across all possible paths with different values of n.
∞
X
fi = P [Xn+1 = R1 , Xn 6= R1 , · · · , X1 6= R1 |X0 = R1 ] (20.33)
n=0
∞
X
fi = P [X1 = R1 |X0 = R1 ] + (0.7)(0.4)n−1 (0.6) (20.34)
| {z } n=1
0.3
"∞ #
X
fi = 0.3 + 0.7 (0.4)n−1 (0.6) (20.35)
n=1
fi = 0.3 + 0.7 = 1 (20.36)
• Finally we note that since, as per our definition, fi = 1 - we can say tht R1 is
in fact, a recurrent state.
20.3 Communication between states

When we say that state i and j communicate with each other, it means that state
i is accessible from j and j is accessible from i (i ⇐⇒ j). We say that state j
is accessible from state i if there is a nonzero nth step transition probability from
i to j. We can make a similar statement for the other way round as well. These
conditions can be expressed as follows:
(n)
(i → j) ; pij > 0 (20.37)
(m)
(j → i) ; pji > 0 (20.38)
The above two conditions, taken together imply that the states communicate
(i ⇐⇒ j). Here are some key properties about this equivalence relation:
• Reflexive: The state i is accessible to itself. i ⇐⇒ i is true since p0ii = 1.
• Symmetry: This states that if i ⇐⇒ j then j ⇐⇒ i. We can further state

that i ⇐⇒ j is true because both the transition probabilities are positive
(n) (m)
(pij > 0 and pji > 0).
• Transitivity: This states that if i ⇐⇒ j and j ⇐⇒ k, then i ⇐⇒ k.

This property can be shown using the Chapman Kolmogorov relation as
follows: X (n) (m)
(n+m) (n) (m)
pik = pij pjk ≥ pij pjk (20.39)
| {z } j
|{z} |{z}
i→k i→j j→k
(l+r)
X (l) (r) (l) (r)
pki = pkj pji ≥ pkj pji (20.40)
| {z } j |{z} |{z}
k→i k→j j→i
From the above two equations we can gather that (i → k) and (k → i) hence
it then follows that i ⇐⇒ k.
• Lastly we note that the equivalence relation partitions the set of states
into disjoint classes.
20.4 Characterizing number of visits

We will now look at another random variable within this framework - the number
of visits to state i given that we start in state i (X0 = i). This random variable is
denoted as follows: ∞
X
Ni = I{Xn = i|X0 = i} (20.41)
n=1
The I in the above relation denotes an indicator random variable that can be
defined as follows:
(
1, if Xn = i|X0 = i
I{Xn = i|X0 = i} = (20.42)
0, if Xn 6= i|X0 = i
We now go a bit further within this new framework and define the probability of
revisiting state i exactly n times. This is given below:
P [Ni = n] = fin (1 − fi ) (20.43)
Note that the (1 − fi ) expression is crucial here since it denotes the probability of
escaping from state i. This is because we only want our random variable to visit
the state i exactly n times. So, at the (n + 1)th step we escape from this state.
The above expression indeed looks familiar - it is the definition of a geometric
random variable. We can then say that:
Ni ∼ geometric(1 − fi ) (20.44)
Since we have now effectively characterized a distribution for the random variable,
we can compute the expected number of visits to state i given X0 = i. This is
shown as: ∞
X 1
E[Ni ] = nfin (1 − fi ) = (20.45)
n=1
1 − f i
Note that if i is a recurrent state, then fi = 1 and consequently, the expected

number of visits to state i would be infinity (almost surely). Here is an explicit
computation for the expectd number of visits by using the linearity of expecta-
tions rule. ∞ ∞
X X (n)
E[Ni ] = E[I{Xn = i|X0 = i}] = pii (20.46)
n=1 n=1
We can now state the first part of this theorem as - State i is transient iff, given
that (fi < 1), the following holds:
∞
1 X (n)
E[Ni ] = ⇐⇒ pii < ∞ (20.47)
1 − fi n=1
The second part of this theorem states that - state i is recurrent iff, for (fi = 1),
we have: ∞
X (n)
E[Ni ] = ∞ ⇐⇒ pii = ∞ (20.48)
n=1
Chapter 21
Setting the DTMC Framework
21.1 Limiting properties

• Markov chains have one step memory and they eventually forget the initial
state X0 .
(n)
πj = lim P [Xn = j|X0 = i] = lim pij (21.1)
n→∞ n→∞
• It is implicitly assumed that the limit is independent of the initial state X0 =

i.
• We say that for a probability transition matrix - the probabilities are indepen-
dent of time n as n → ∞ since the matrix product converges. all columns
are equal - which means that the probabilities are independent of the initial
condition. ñ ô
0.6 0.4
P 30 = (21.2)
0.6 0.4
• The period of a state i is denoted as di . We say that a state i is periodic with

(n)
period di iff pii 6= 0 only if n is a multiple of di and di is the largest number
with such a property to hold. This can be alternatively stated as - There is a
positive probability of returning to state i only every di time steps.
• We note that if period di = 1 then the state i is aperiodic.
• We say that state i is recurrent if the MC returns to state i with probability

1. The recurrent condition is written mathematically as:
∞
X (n)
pii = ∞ (21.3)
n=1
• State i is said to be positive recurrent if the expected value of return time

to state i is finite.
∞
X n−1
Y
(n) (m)
E[return time to i] = npii (1 − pii ) < ∞ (21.4)
n=1 m=0
243
Chapter 21. Setting the DTMC Framework 244
• State i is said to be null recurrent if i is recurrent and the expected return

time to i is infinite:
E[return time to i] = ∞ (21.5)
• Positive and null recurrence are class properties and recurrent states in a
finite MC are positive recurrent.
• Ergodic states are those that are positive recurrent and aperiodic. An irre-
ducible MC with ergodic states is said to be an Ergodic MC.
21.2 Example 1
• We look at the various probabilities of return times to state 0 as given below:
1 1
P [return time = 2] = 1 · = (21.6)
2 2
1 1 1
P [return time = 3] = 1 · · = (21.7)
2 3 2·3
1
P [return time = n] = (21.8)
(n − 1)n
• Now using Induction hypotheses, we can show the following general result
to hold:
k+1
X 1 k
= (21.9)
m=2
(m − 1)m k+1
• Using the above result, we can show that state 0 is recurrent by proving that
the probability of returning to 0 is 1:
n n
X X 1
P [return time = m] = (21.10)
m=2 m=2
(m − 1)m
1 1 1
= + + + ··· (21.11)
1·2 2·3 3·4
n + 1 n→∞
= → 1 (21.12)
n
• Now we look to characterize the expected return time to state 0:
∞ ∞
X X 1
E[return time to 0] = nP [return time = n] = n (21.13)
n=2 n=2
(n − 1)n
∞
X 1 n→∞
= → ∞ (21.14)
n=2
n−1
The above result holds true due to the divergence property of a Harmonic
series. Because of this result we can say that state 0 is null recurrent.
21.3 Irreducible MC properties

(n)
• We say that for an irreducible Ergodic DTMC, the limit limn→∞ pij exists and
is independent of the initial state i.
(n)
πj = lim pij (21.15)
n→∞
• The steady state probabilities given as πj ≥ 0 are the unique non-negative

solutions for the following system of linear equations:
∞
X
πj = πi pij (21.16)
i=0
∞
X
πj = 1 (21.17)
j=0
• The limiting probabilities given above are independent of the initial state
X0 = i and the above algebraic equations can be solved to find pij .
• We note an importnat point that - no periodic states, no transient states,

no null recurrent states and no multiple classes should be present for limit
probabilities to exist. Consider the following result:
∞
X
P [Xn+1 = j] = P [Xn+1 = j, Xn = i] (21.18)
i=1
∞
X
= P [Xn+1 = j|Xn = i]P [Xn = i] (21.19)
i=1
∞
X
= pij P [Xn = i] (21.20)
i=1
We say that if the limit exists as n → ∞ we can say that:
P [Xn+1 = j] = P [Xn = j] = πj (21.21)
21.4 Example 2
• We now look at an example, characterizing stationary distribution properties
using a geometric random variable X ∼ geom(p).
P [X = k] = P [T1 , T2 , · · · , Tk−1 , Hk ] = q (k−1) p (21.22)
• We know that the expected value of a geometric random variable can be

expressed as:
1
E[X] = (21.23)
p
• An alternative characterization of the expected value is given below:

∞
X ∞
X
E[X] = kP [X = k] = kq (k−1)p (21.24)
k=1 k=1
∞
X 1
=p kq (k−1) = (21.25)
k=1
p
∞
X 1
=⇒ kq (k−1) = (21.26)
k=1
p2
∞
1 X 1
=⇒ kqq (k−1) = (21.27)
q k=1
p2
• Now consider a MC with two states (with loops and equal probabilities),
using which we denote the following return probabilities:
1 1
P [T1 = 1] = = q = p =⇒ =2 (21.28)
2 p
Å ã2
1
P [T1 = 2] = (21.29)
2
Å ãk
1
P [T1 = k] = (21.30)
2
• We now compute the expected return time to state 1 as follows:
∞ ∞ Å ã2 X ∞
X X 1
E[T1 ] = kP [T1 = k] = k = kq k = 2 (21.31)
k=1 k=1
2 k=1
• The fact that the above equation is equal to 2 can be proved as follows using
the result from equation 27 as follows:
∞ ∞
1X k 1 X q 1
kq = 2 =⇒ kq k = 2 = 2 = 2 (21.32)
q k=1 p k=1
p p
• We can see that since the expected time to reenter state 1 is finite so we say
that π2 = π2 = 1/2 is the stationary distribution. Also we say that state 1 is
positive recurrent due to finite expected return time.
21.5 DTMC matrix equations

• We write the limiting distribution in matrix form (for a DTMC with |S| = J
states) as follows:  
π 1 π2 · · · π J
π π · · · π 
n
 1 2 J
lim P =  . . .  (21.33)
n→∞ .
 . .
. ··· . . 
π 1 π2 · · · π J
We can clearly see above that as before in the limiting case, all the rows have
same probabilities which in turn implies that the probabilities are indepen-
dent of the initial state X0 = i.
• So we can write the probability distribution vector after the nth step as fol-
lows:
lim P (n) = lim [P T ]n P (0) = [π1 , π2 , · · · , πJ ]T (21.34)
n→∞ n→∞
=⇒ P (n)T = P (0)T [P ]n (21.35)

• The vector format of the stationary distribution can be represted as follows:
π = [π1 , π2 , · · · , πJ ] (21.36)
• Given a vector of 1s denoted as 1 we can write the limiting condition system

of equations as:
π = PTπ (21.37)
πT · 1 = 1 (21.38)
• The above system of equations is solved by using eigenvectors such that
the principal eigenvector corresponds to the largest eigenvalue of 1 and all
that all other eigenvectors are associated with eigenvalues with modulus less
than 1. If that is not the case then P n would diverge.
21.6 Example 3
• We consider the example of the following aperiodic, irreducible MC, charac-
terized by the following transition probability matrix:
 
0 0.3 0.7
P = 0.1 0.5 0.4 (21.39)
0.1 0.2 0.7
• Since the MC contains loops, it is an aperiodic MC with one recurrent class

R1 = {1, 2, 3}. The number of states are also finite, so it is a positive recurrent
class.
• Due to the above properties we say that π1 , π2 , π3 and do not depend on the
initial states. The following general result holds true:
(n)
πj = lim pij (21.40)
n→∞
• The system can then be written as follows:

3
X
πj = πi pij (21.41)
i=1
3
X
πi = 1 (21.42)
i=1
• The above two equations can be put into a system of equations in matrix
format as follows:    
π1 0 0.1 0.1  
 π1
π2  0.3 0.5 0.2  
  
=  π (21.43)
π3  0.7 0.4 0.7 2
  
π3
1 1 1 1
| {z }
PT
• Since we have three unknown variables and 4 equations - we would imme-

diately say that this sytem is overdetermined - however because of the pres-
ence of one dependent row - the system actually has 3 linearly independent
equations.
(0)
• Denoting the initial distribution as P [X0 = i] = πi . With this we can write
the following:
∞
X
P [X1 = j] = P [X1 = j, X0 = i] = P [X1 = j|X0 = i] P [X0 = i] (21.44)
i=1
| {z } | {z }
pij (0)
πi
(1) (0)
=⇒ πj = pij πi (21.45)
• From this we say that if the probability distribution is unchanged for all n
then:
(n)
P (Xn = i) = πi = πi (21.46)
• We say that the DTMC is stationary in a probabilistic sense - the states might
change, but the probabilties don’t.
21.7 Example 4
(n)
• We let Ti denote the fraction of time spend in the ith state upto time n.
This can be expressed with the help of an indicator random variable as:
n
(n) 1X
Ti = I{Xm = i} (21.47)
n m=1
• With the above characterization, we can compute the expected fraction of

time spent in state i as follows:
n ∞
(n) 1X 1X
E[Ti ] = E[I{Xm = i}] = P [Xm = i] (21.48)
n m=1 n m=1
• The above indicator random variable can be characterized as follows:

(
1, P [Xm = i]
I{Xm = i} = (21.49)
0, (1 − P [Xm = i])
• Now taking the limiting case under the condition that the probability of being
in state i is the same regardless of n we get the following result:
n n
(n) 1X n→∞ 1
X
E[Ti ] = P [Xm = i] → πi (21.50)
n m=1 n m=1
1
= (πi + πi + · · · + πi ) = πi (21.51)
n
n→∞ (n) n→∞
P [Xm = i] → πi =⇒ E[Ti ] → πi (21.52)
• We note note that for an Ergodic DTMC, as n tends to infinity, time average
is equal to the ensemble average. This can be written as follows:
n
(n) 1X (n)
lim ti = lim I{Xm = i} = E[Ti ] (21.53)
n→∞ n→∞ n | {z }
m=1
| {z } πi
time average
• We note that for an Ergodic DTMC, the same condition is true without the
expectation as well:
n
(n) 1X
lim Ti = lim I{Xm = i} = πi (21.54)
n→∞ n→∞ n
m=1
Chapter 22
Introducing Inhomogeneous DTMCs
22.1 Inhomogeneous DTMCs: Theorem

• First of we let χ = {Xn : n ≥ 0} be a time inhomogeneous Markov Chain,
such that n ∈ N0 = {0, 1, 2, · · · , N }. We define this MC on a state space
denoted as E.
• We also note there exists a time homogeneous Markov Chain as well, de-
noted by χ0 = {x0n : n ≥ 0}. The state space over which this Markov Chain is
defined is given by E × N0 - the cross product set of the elements of E and
the elements of N0 .
• Further, we say that the state of the inhomogeneous MC is the first order
projection of the state of the associated homogeneous MC. This condition
can be denoted as Xn = P r1 (Xn0 ) for all n ∈ N0 . This denotes the projection
in the first dimension.
22.1.1 Proof of the above theorem

• First we let χ denote the time inhomogeneous DTMC characterized by tran-
sition probabilities of the form:
Pn;(ij) = P [Xn+1 = j|Xn = i] (22.1)
Note that these probabilities depend on the specific time instant n.
• Now we define a tuple of 2 dimensional random variables given by Xn0 =
(Xn , n). defined for all values of n ∈ N0 .
• The resulting distribution of this MC denoted as χ0 = {Xn0 : n ∈ N0 } is given
as P 0 .
• By definition we can obtain Xn = P r1 (Xn0 ) for all n ∈ N0 . Before moving
ahead, we must first define the delta function given as follows:
(
1, k = 0
δk,0 = (22.2)
0, k 6= 0
250
Chapter 22. Introducing Inhomogeneous DTMCs 251
• With this framework in place, we can define the time homogeneous MC

transition probabilities of the 2 dimensional tuple random variable, with the
time inhomogeneous MC of the time dependent random variable. This is
represented as follows:
P 0 [X00 = (i, k)] = δk,0 P [X0 = i] (22.3)
• In the above equation we note that the general 2 dimensional tuple random
variable given by Xn0 = (Xn , n) is specified in terms of:
X00 = (X0 , 0) = (i, k) =⇒ X0 = i (22.4)
• Now the projection condition that fetches us the original time inhomoge-
neous random variable from the first dimension projection of the time ho-
mogeneous random variable is given as:
Xn = P r1 (Xn0 ) (22.5)
X0 = P r1 (X00 ) = P r1 [(X0 , 0)] (22.6)
• The crucial point to grasp is that given the fact that P [X0 = i] holds for for
i ∈ E, we might express the following transition probabilities without the
time index dependence:
0
P(i,k),(j,l) = P 0 [Xk+1
0
= (j, l)|Xk0 = (i, k)] = δl,(k+1) Pk;(ij) (22.7)
• We note that because of the above relation - it suffices to discuss the time
inhomogeneous MC in terms of an associated time homogeneous MC.

• Let us assume a time inhomogeneous MC with only two possible states in
its state space - essentially the length of the state space is |S| = 2 - where the
two possible states are a and b. Their initial limiting distribution is given by:
π0 = (0.25
|{z}, |{z}
0.75) (22.8)
a b
• The associated time homogeneous MC with the 2 dimensional tuple state

space could then be represented as follows:
π00 = (0.25
|{z}, |{z}
0.75, |{z}
0 , |{z}
0 , |{z}
0 , |{z}
0 ) (22.9)
(a,0) (b,0) (a,1) (b,1) (a,2) (b,2)
• At this point it is instructive to recall the general equation relating the limit-
ing probabilities to the transition matrix as follows:
(n+1)
π (n) |{z}
| {z } = |{z}
π P (22.10)
(1×m) (1×m) (m×m)
• We must also note the various phases of transition across states for this pro-
cess - the limiting distributions and the associated transition probability ma-
trices.
P P P
X0 (π0 ) →0 X1 (π1 ) →1 X2 (π2 ) →2 X3 (π3 ) (22.11)
• Now with different transition probability matrices at different time steps, we
can then write the comprehensive partitioned transtition probability matrix
of the associated time homogeneous MC with 2 dimensional tuple states.
(a, 1), (b, 1) (a, 2), (b, 2) (a, 3), (b, 3)
" P0 0 0 # (a, 0), (b, 0)
0 P1 0 (a, 1), (b, 1)
0 0 P2 (a, 2), (b, 2)
• Therefore we can then write the following probability transition matrix as

essentially a 3 × 3 partitioned matrix:
 
P0 0 0
P =  0 P1 0  (22.12)
0 0 P2
• We can then write the following equations that relate the limiting distri-
butions across different time steps to the individual probability transition
matrices:
π1 = π0 P0 (22.13)
π2 = π2 P1 (22.14)
π3 = π2 P2 (22.15)
• Further, the time homogeneous limiting distribution vectors can be ex-
pressed as follows: î ó
π(0) = π0 0 0 (22.16)
î ó
π(1) = 0 π1 0 (22.17)
î ó
π(2) = 0 0 π2 (22.18)
• The computation of transitioning across various limiting distributions can be

shown below:
 
î ó P0 0 0 î ó
π(1) = π(0) P = π0 0 0  0 P1 0  = π1 0 0 (22.19)
0 0 P2
• We will make a slight notational correction in the above result by placing

the non zero value of the distribution vector to the middle element of
the array - then we shall continue the process as follows:
 
î ó P0 0 0 î ó
π(2) = π(1) P = 0 π1 0  0 P1 0  =⇒ 0 0 π2 (22.20)
0 0 P2
• Again, the final vector output in the prevous equation is a result of the correc-
tion we applied after equation 19. Finally, we do the following computation
to get the third distribution as:
 
î ó P0 0 0 î ó
π(3) = 0 0 π2  0 P1 0  = 0 0 π3 (22.21)
0 0 P2
• If we give specific values to our P0 , P1 , P2 matrices and then compile the

actual partitioned matrix we could have a matrix like this:
(a, 1), (b, 1) (a,ñ2), (b,ô 2) (a,ñ3), (b,ô 3)

ñ ô 
0.7 0.3 0 0 0 0
 0.4 0.6 (a, 0), (b, 0)
 ñ 0 0 0 0  
ô ñ ô ñ ô 
 0 0 0.8 0.2 0 0 

 (a, 1), (b, 1)
 0 0 0.5 0.5 0 0 

 ñ ô ñ ô ñ ô
 0 0 0 0 0.7 0.3 
(a, 2), (b, 2)
0 0 0 0 0.4 0.6
• The above matrix essentially specifies the partitioned transition matrix P

that we have been demonstrating the in the previous equations.
• With the help of such a framework, we can essentially transform our

computation of time inhomogeneous Markov Chains to the space of time
homogeneous Markov Chains.
Chapter 23
Random Walks and Gambler’s Ruin
We model the Gambler’s Ruin problem as a random walk. The objective is to

calculate the probability that the player wins a particular desired amount and the
time taken (number of steps) to win this amount. The desired amount could be a
positive number m greater than his current wealth n. If could even be 0, in which
case we are interested in finding the probability that the player goes broke, and
the number of time-steps taken to go broke.
Problem. (Gambler’s Ruin) The player starts the game with an initial wealth of
amount n. A roulette wheel is spun and in each bet the player either wins $1
(event W) or loses and amount of $1 (event L). Let p be the probability of winning
and let q = 1 − p be the probability of losing the bet. The player keeps playing
until an amount of m is won or until they goes broke.
A random walk system is used to model the behavior of the player’s gains and
losses. Through each time step, the player’s outcomes are recorded as a string of
W’s and L’s. For instance, {WLLLLLWLL...} is a sample string. For every W, the
random walker moves up a step from the initial position of n and for every L, they
move a step downward, relative to the previous position.
Note that the probability of winning an ith bet is mutually independent of that
of winning any past (or future) bets. Thus the random walk is a memoryless
independent system. Such a random walk with mutually independent moves is
called a martingale. All random walks are not necessarily martingales. If p 6= 1/2
then the random walk is said to be biased and if p = 1/2 then the random walk is
unbiased. A random walk may or may not have boundaries. In our case though,
the random walk has boundaries at 0 and n + m.
Define W ∗ as the event that the RW hits the upper bound amount T = n + m,
before it hits 0, the bottom boundary. T here denotes the top boundary. Define D
as the event that the RW starts with amount n.
254
Chapter 23. Random Walks and Gambler’s Ruin 255
23.1 Problem
Our objective is to compute the probability that the player reaches upper bound
T before going broke, denoted by Xn = P [W ∗ |D = n]. The standard tree method
of computing the probability is complicates as the size of the sample space |S| is
infinite. It could theoretically be possible to go a infinite alternate W and L steps
such that the player never wins amount m in finite time. One such sample path is
given by {WLWLWL...} an so on.
If the player starts off broke, then he would never be able to enter the game. Thus
the probability that he would gain wealth T is 0. Likewise, for a player with initial
wealth of T by construction of the game is not interested in entering it. But for
a player entering the game with initial wealth in the range (0, T ), then it can be
proved that the probability of reaching a level of wealth T before going broke is
given by Xn = pXn+1 + (1 − p)Xn−1 .
Proposition. For the aforementioned game, probability of winning an amount T

before going broke is given by,

 0, if n = 0


Xn = 1, if n = T

 pXn+1 + qXn−1 , if n ∈ (0, T )

23.2 General Solution

The first two cases are fairly straight-forward. Given the initial wealth D, X0 =
P [W ∗ |D = 0] = 0. Likewise, XT = P [W ∗ |D = T ] = 1. Now in order to prove the
last segment, we have two further cases based on the first bet. Say E be the event
that the player wins the first bet and E be the event that they lose the first bet.
When n lies in the range (0, T ) we can write the required probability as,
Xn = P [W ∗ |D = n]
= P [W ∗ ∩ E | D = n] + P [W ∗ ∩ E | D = n] (23.1)
| {z }
Law of Total Probability
Using the definition of conditional probability, we have
î ó î ó
Xn = P E | D = n · P W ∗ | E ∧ (D = n)
î ó î ó
+P E | D = n · P W ∗ | E ∧ (D = n)
î ó î ó
Xn = pP W ∗ | (D = n + 1) + (1 − p)P W ∗ |(D = n − 1)
| {z }
due to mutual independence, E ∧ (D = n) ≡ (D = n + 1)
Xn = pXn+1 + (1 − p)Xn−1 (23.2)

Solution. The above equation is a linear homogeneous recurrence relation and can
be solved in the standard way of solving difference equation. We first assume
a trial solution and obtain a general solution; further by applying the boundary
conditions, the arbitrary constants take fixed values and we obtain the particular
solution. We can rearrange the terms in the recurrence relation,
pXn+1 − Xn + (1 − p)Xn−1 = 0 (23.3)

The boundary conditions are X0 = 0 and XT = 1. Let Xn = rn be a trial solution.
Substituting in the above equation,
pXn+1 − Xn + (1 − p)Xn−1 = 0
prn+1 − rn + (1 − p)rn−1 = 0
pr − 1 + (1 − p)r−1 = 0
pr2 − r + (1 − p) = 0 (23.4)
The above quation is called a characteristic equation, the solutions to which are
plugged in to obtain the general solution of the linear homogeneous difference
equation.
p p
1± 1 − 4p(1 − p) 1 ± 1 − 4p + 4p2
r = =
2p 2p
p
1 ± (1 − 2p)2 1 ± (1 − 2p)
r = =
2p 2p
2 − 2p 2p
r = or
2p 2p
1−p
r = or 1 (23.5)
p
The solutions to the characteristic equation are r1 = (1 − p)/p and r2 = 1. When

p = 0.5 then r1 = r2 = 1 and the game becomes an unbiased or fair game.
23.3 Particular Solution

23.3.1 Case 1: (p < 1/2)
In case p 6= 0.5, (which is true in the case of a casino game), we obtain the general
solution as,
Xn = Ar1n + Br2n
1−p n
Å ã
Xn = A + B(1)n
p
1−p n
Å ã
Xn = A +B (23.6)
p
On applying the boundary conditions in we obtain the particular solutions. The

first condition X0 = 0 imples that A + B = 0 =⇒ A = −B. Likewise, the second
condition XT = 1 implies that
1−p T
Å ã
XT = A −A=1
p
"Å ãT #−1
1−p
A = −1 and
p
"Å #−1
1−p T
ã
B = − −1 (23.7)
p
Thus, the probability of winning an amount T before going broke Xn is solved in

terms of initial wealth n and probability of winning a single bet p.
ÖÄ ä è
1−p n
p
−1
Xn = Ä äT (23.8)
1−p
p
−1
In the case of a roulette, p < 1/2. Then (1 − p)/p > 1. And T = n + m. We have,
!(n−T ) !(T −n) !(m)

1−p p p
Xn ≤ = = (23.9)
p 1−p 1−p
Theorem. If p < 1/2 (more likely to lose a bet), the probability of winning and
amount m before losing n, is less than (p/1 − p)m .

!(m)
h i p
P win m before losing n ≤ (23.10)
1−p
23.3.2 Case 2: (p = 1/2)

In the case of a fair game, p = 1/2. Thus, (1 − p)/p = 1. The general solution for
the recurrence relation is obtained as:
Xn = [An + B](1)n = An + B (23.11)
On applying the boundary conditions, we have,
0 = X0 = B =⇒ B = 0 (23.12)
1 = XT = AT + B =⇒ A = 1/T (23.13)
n n
Thus, Xn = = (23.14)
T n+m
Theorem. If p = 1/2, (fair game), the probability of winning and amount m before
losing n, is n/(n + m).

h i n
P win m before losing n = (23.15)
n+m
23.4 Biased Random Walk

In an unbiased game, the ransom swings are same in the upwards as well as
downward transitions. In a biased game, p < 1/2, the drift outweights the random
swings. The downward deterministic drift is given by (1 − 2p). This is the amount
we expect to lose on a steady basis. In other words,
h i
E loss on every bet = (1 − 2p) (23.16)
After x bets (or steps of the RW), we are drifted√by (1 − 2p)x. This is the expected
loss in x consecutive bets and is of the order o( x).
h i √
E loss on x bets = (1 − 2p)x > o( x) (23.17)
√
Since in o( x), the constant is small, the swings cannot save and the random walk
goes downward and crashes almost surely. The probability of winning m before
going broke in an unfair game is computed in the previous section.
One way in which the random walker does not go broke is if they play forever. The
probability of playing forever is zero, which is 1 minus the probability of losing.
probability of playing forever = 0 = 1 − probability of losing

prob. of going broke + prob. of win = 1
(23.18)
There are sample points W LW LW LW... going forever. By measure theory, when
we add up these sample points we get the probability of zero.
P [W LW . . . ] + P [LW . . . ] + · · · = 0 (23.19)
This motivates the question: How long does it take for one to go broke (or win)?
and is answered in the subsequent sections.
Define S to be the number of steps until the random walker hits the boundary. We
have En = E[S|D = n]. We claim that,

0,

 if n = 0, already broke.
En = 0, if n = T, already wealthy. (23.20)

1 + pEn+1 + (1 − p)En−1 , if 0 < n < T.

Let us focus on the last case, and take up the recurrence relation,
En = 1
|{z} + pEn+1 +(1 − p)En−1 , (23.21)
| {z } | {z }
1st bet, makes it win 1st bet, lose 1st bet,
inhomogeneous start with n+1 start with n−1
recurrence over again over again
This recurrence relation (inhomogeneous linear recurrence) can be re-expressed

as follows. The boundary conditions are E0 = ET = 0. The general solution is also
obtained.
Recurrence relation: pEn+1 − En + (1 − p)En−1 = −1 (23.22)

1 − p n
General solution: En = A + B for p 6= 1/2 (23.23)
p
The particular solution can be derived by applying the boundary conditions and
solving for A and B. Note that, the general solution is the sum of homogeneous
solution and the particular solution.
En = a (constant) (23.24)
En = an + b (23.25)
1
=⇒ a = (23.26)
1 − 2p
Set b = 0 (23.27)
Thus we have the solution given by:
General solution = Homogeneous Solution + Particular Solution(23.28)

ÖÄ ä è
1−p n
n T p
−1
En = − Ä äT (23.29)
1 − 2p 1 − 2p 1−p
−1
p
Since, we expect to lose (1 − 2p) every time and we start with n, as m → ∞, we

have,
n
En u
1 − 2p
Chapter 24
Memoryless Property
The memoryless property is a unique property exhibited by the exponential distri-

bution and its discrete analog, the geometric distribution. We shall observe in this
section how the distribution is forgetful of its past and at any given moment, the
processes modelled by this distribution behaves as though it is time zero.
24.0.1 Memoryless property of Exponential Distribution

Consider a random variable X such that it follows an exponential ditribution with
λ as the parameter, X ∼ Exp(λ). We know that the PDF and CDF of the exponen-
tial distribution are given by:
fX (x) = λe−λx u(x) (24.1)

FX (x) = (1 − e−λx )u(x) (24.2)
where u(x) is the unit step function which is given by:

(
1, x ≥ 0
u(x) = (24.3)
0, x < 0
We also note that the expected value and variance of the exponential distribution
is,
1 2 1
E[X] = ; E[X 2 ] = ; V ar(X) = (24.4)
λ λ2 λ2
Consider the x to represent a time period. Starting from any time zero on the
timeline, there exists a positive number a on the timeline, and after a duration x
has passed the process reaches point x + a on the timeline. This is shown in Figure
24.1.
We can consider the event that the random variable X (modelling time in this
case) has a value X > x + a, given that the event X > a has occurred. The
memoryless property is stated as,
Memoryless
Property P [X > (x + a) | (X > a)] = P [X > a] (24.5)
260
Chapter 24. Memoryless Property 261
time zero time a time (a + x)
0 a a+x time
Figure 24.1: Timeline
Given that the event X > a has already occurred, then the conditional probability
of X > (x + a) is same as P (x), that is, the process assumes as if event X > a has
not even occurred, like when it started at time zero.
Note that the right-hand side of the expression P [X > x] is same as the event
P [X > x|X > 0]. And the event that X > 0 refers to the entire given the sample
space; naturally for P (E) = P (E|Ω) implies E(X) = E(X|Ω).
24.0.2 Proving the memoryless property

Consider the following scenario where one has to model whether a customer has
arrived or not, or whether a car engine has failed. Let X be the random variable
that denotes the time to failure of a car. The event E = (X > t) is the event that
the car engine has not failed by time t. Given that this event has already occurred
we ask if there is any change in the probability that the car engine would fail after
a further period s. This is shown in the Figure 24.2.
s
time zero
0 t t+s time
P [(X ≤ t + s) ∩ (X > t)]
Figure 24.2: Time till car engine fails modelled by random variable X
The required probability is given event E = (X > t), what is the conditional
probability of event X ≤ (t + s). This is given by, the probability of intersection of
the two events over the probability of E.
P [(X ≤ t + s) ∩ (X > t)]

P [(X ≤ t + s) ≤ (X > t)] = (24.6)
P [X > t]
Since we know the CDF, we can calculate the tail probability as,
FX (x) = P [X ≤ x] = (1 − e−λx )u(x)
P [X > x] = 1 − FX (x) = 1 − (1 − e−λx )u(x) = e−λx u(x) (24.7)
References 262
From figure 24.2 the probability of the event P [(X ≤ t + s) ∩ (X > t)] is written
as,
P [(X ≤ t + s) ∩ (X > t)]

P [(X ≤ t + s) ≤ (X > t)] =
P [X > t]
P [t < X ≤ (t + s)]
P [(X ≤ t + s) ≤ (X > t)] =
P [X > t]
Fx (t + s) − FX (t) (1 − e−λ(t+s) ) − (1 − e−λt
P [(X ≤ t + s) ≤ (X > t)] = =
1 − −FX (t) 1 − (1 − e−λt ))
e−λt − e−λ(t+s)
P [(X ≤ t + s) ≤ (X > t)] = = 1 − e−λs = FX (s)
e−λt
P [(X ≤ t + s) ≤ (X > t)] = P [X ≤ s] (24.8)
Thus, this indicates that the random process only remembers the present, but not
the past, hence its name - memoryless property.
24.0.3 Shifting of PDF leads to memoryless property

Consider the conditional CDF and PDF of the event that X ≤ x given X > t,
FX|X>t (x|x > t) = P [(X ≤ x) | (X > t)]
= 1 − e−λs = 1 − e−λ(x−t) (24.9)
dh −λ(x−t)
i
fX|X>t (x|x > t) = 1−e = λe−λ(x−t) (24.10)
dx
Recall the PDF of the original exponential distribution if of the same form. That
is,
fX (x) = λe−λx
fX|X>t (x|x > t) = λe−λ(x−t) (24.11)
Thus, as depicted in Figure 24.3, the shifting property of the exponential distribu-
tion’s PDF causes the memoryless property. We see from the figure, that the PDF
is merely shifted by t in the exponent, whereas the distribution suffers no change.
References
[1] Stochastic Processes Course notes of Prof. Rakesh Nigam.
[2] Ross, Sheldon. A first course in probability. Pearson, 2014.
[3] Karlin. Samuel, and Howard E. Taylor. A second course in stochastic pro-
cesses. Elsevier, 1981.
[4] Karlin, S., and H. M. Taylor. A first course in stochastic processes, Acad. Press,
New York. 1966.
References 263
Figure 24.3: Memoryless Property of the exponential distribution arises from the
shifting nature of its PDF.
Chapter 25
The Poisson Process
25.1 Revisiting Notes on PDF

• Probability distribution of a continuous random variable is characterized by
its probability density function (PDF). Unlike the probability mass as in
the case of a discrete random variable, here we are dealing with a density
of probability - Probability per unit length of an interval.
• The probability density at a point x is given by the following (note that Delta
denotes the length of the interval):
P [x < X < x + ∆]
fX (x) = lim (25.1)
∆→0 ∆
• The expression given in the numerator of the RHS of the above equation
could be written as a difference of CDF values as:
P [x < X < x + ∆] = FX (x + ∆) − FX (x) (25.2)
• Substituting the above relation we can rewrite equation 1 in the following

form (note that the following relation holds true if the CDF is differentiable
at x):
FX (x + ∆) − FX (x) dFX (x)
fX (x) = lim = = FX0 (x) (25.3)
∆→0 ∆ dx
• We can say that for a small δ value, the following relation would hold true:
P [x < X < x + δ] ≈ fX (x)δ (25.4)
25.2 Poisson as a Binomial Approximation

• The Poisson distribution turns out to be an approximation of a random vari-
able following a binomial distribution with parameters (n, p) such that n is
264
Chapter 25. The Poisson Process 265
very large and p is very small. Suppose X follows a binomial distribution

with parameters n and p and let λ = np. We then have:
n!
P {X = i} = pi (1 − p)n−i (25.5)
(n − i)!i!
Å ãi Å
λ n−i
ã
n! λ
P {X = i} = 1− (25.6)
(n − i)!i! n n
n(n − 1) · · · (n − i + 1) λi (1 − λ/n)n
P {X = i} = (25.7)
ni i! (1 − λ/n)i
• Now we know that for a moderately sized λ and large n the following results
would hold:
λ n
Å ã
result 1: 1 − ≈ e−λ (25.8)
n
n(n − 1) · · · (n − i + 1)
result 2: ≈1 (25.9)
ni
λ i
Å ã
result 3: 1 − ≈1 (25.10)
n
• Using the above three results and plugging them in equation 6 we would get
the Poisson distribution as:
e−λ λi
P {X = i} = (25.11)
i!
25.3 Recalling the Bernoulli Process

• In this random process, time is perceived in terms of discrete time slots
where Bernoulli trials occur.
• In each individual discrete time slot, the probability of success (arrival) oc-
curing is taken as p.
• Naturally then, if we consider n time slots, then the probability of a certain

number of arrivals in those n discrete time slots would follow a binomial
distribution.
• Here, the interarrival times would follow a geometric PMF. Recall that the
interarrival time characterizes the time until first arrival.
• Furthermore, time until k arrivals happen follows a pascal distribution. If

we define Yk = T1 + T2 + · · · + Tk where Ti is the interarrival time till the ith
arrival from the previous arrival onward. Each Ti is independent and follows
a geometric distribution such that Ti ∼ geom(p). The Pascal PMF is then
given as: Ç å
t−1 k
P (Yk = t) = p (1 − p)t−k (25.12)
k−1
25.4 The Poisson Process

• The Poisson process is nothing but a continuous time version of the previ-
ously described Bernoulli process. Here intervals of the same length have
the same probabilistic behavior.
• The probability of k arrivals in a fixed duration time time τ is given by P (k, τ )

and all possible values of this probability function over various values of k
sums to 1.
• The property of time homogeneity prevails in this process. This essentially

means that the probability P (k, τ ) depends on the lag length τ and not on
the exact location of the time interval. Just like the Bernoulli process has the
same probability of success p for each discrete time slot, the Poisson process has
the same probability of k arrivals in each time interval. Here also, disjoint time
intervals are independent.
• The number of arrivals in disjoint time intervals are independent. We further

note that λ denotes the intensity of the arrival process - It is the arrival
rate. For a very small interval of time δ we then have the following (note that
o(δ 2 ) denotes the higher order terms that are an extremely small, negligible
quantity): 
2
(1 − λδ) + o(δ ), if k = 0


P (k, δ) = λδ + o(δ 2 ), if k = 1 (25.13)
 2
0 + o(δ ), if k > 1

• The precise mathematical characterization of the arrival rate λ is given as

follows:
P (1, δ)
lim =λ (25.14)
δ→0 δ
• The expected number of arrivals in the interval 0 to δ can be given as a
weighted sum of the probability of 1 arrival and the probability of 0 arrivals.
Since only these two possible outcomes can happen in a small interval δ.
E[Number of arrivals in [0, δ]] = λδ(1) + (0)(1 − λδ) = λδ (25.15)
• Now using the above result, we can alternatively define the arrival rate as
the expected number of arrivals per unit time:
E[Number of arrivals in [0, δ]]

=λ (25.16)
δ
25.4.1 Breaking down the Poisson Process

• Suppose we have a big interval of length called τ . Now we can actually
break this big interval into small chunks where each individual chunk is of
length δ. Now the length of the big interval is nothing but the sum of the
lengths of the small intervals. If we have n such intervals, then we have:
τ
n= (25.17)
δ
• Further we note that each individual δ time interval could have either 1 or
0 arrivals. So in each trial, the probability of having 1 arrival is just like the
probability of success in a discrete time slot and is given as p = λδ.
• With this framework, we can actually think about this now in terms of a
bernoulli process where we have the paramters n = τ /δ and p = λδ. We now
write the probability of having k arrivals in n slots:
Ç å
n k
P [k arrivals in n slots] = p (1 − p)n−k (25.18)
k
• Now since we can effectively write δ as τ /n we can rewrite the above equa-
tion as follows:
Ç å Å ãk Å
λτ n−k
ã
n λτ
P [k arrivals in n slots] = 1− (25.19)
k n n
• Now if we follow the steps laid out in section 0.2 on the Poisson as a binomial
approximation, we could tend δ to 0 and n to ∞ to obtain the following
Poisson distribution:
(λτ )k e−λτ
P (k, τ ) = (25.20)
k!
• More generally, we could model the probabiliy of N arrivals in t as a random
variable Nt ∼ P ois(λt). Where the expected value and variance are equal
and are given by λt.
• We can relate the Bernoulli process expected value and variance with the
Poisson process as follows (as n → ∞, δ → 0 and p → 0):
E[number of arrivals] = np =⇒ λτ (25.21)
V ar[number of arrivals] = np(1 − p) ≈ np =⇒ λτ (25.22)
25.4.2 Modeling the cumulative arrival time

• First of all we note that the time until the k th arrival is a continuous random
variable and hence it has a PDF.
• The way we think about these arrivals is this - There are k − 1 arrivals in the
interval of time [0, t] and then in the last small interval [t, t + δ] we have one
arrival, totalling to k arrivals. We also have the last interval of time such that
δ → 0.
• This random variable can be denoted as Yk and its probability can be given
by using the result from section 0.1 in the start of this note.
P DF (yk ) = fYk (t)δ = P [t ≤ Yk ≤ t + δ] (25.23)
= P [(k − 1) arrivals in [0, t]]P [1 arrival in [t, t + δ]] (25.24)
• Note that we are multiplying the probabilities in the above equation since
the time intervals are disjoint and hence, independent. We can then rewrite
the above equation as:
fYk (t)δ = P [(k − 1) arrivals in [0, t]](λδ) (25.25)
• Now note that Yk is nothing but the random variable encoding the time until
k th arrival. Furthermore, the left term of the RHS in the above equation
could easily be modeled as a Poisson distribution with parameter (λt) and
we also know that (λδ) is the probability of 1 arrival in the last time interval.
With this we can write:
ñ ô
(λt)k−1 −λt
fYk (t)δ = e (λδ) (25.26)
(k − 1)!
• The above distribution can be further simplified and then resulting Erlang
distribution is then represented as follows (note that the PDF only depends
on k):
λk tk−1 e−λt
fYk (t)δ = (25.27)
(k − 1)!
• We additionally note that the time until first arrival or the interarrival time
in the Poisson process follows an exponential distribution - Ti ∼ exp(λ).
25.5 Counting Process

• The counting process is a Stochastic process denoted by N (t) such that
N (t) ≥ 0. N (t) represents a family of random variables that takes on non-
negative integer values. It encodes the count of number of events occurred
upto time t.
• This counting proces is said to be non decreasing - in the sense that if s < t
then we have N (s) ≤ N (t).
• This process has the property of independent increments - which essentially
means that the number of events happening in disjoint time intervals are
independent.
• If we consider the times s1 < t1 < s2 < t2 such that we have time intervals of
the form (s1 , t1 ) and (s2 , t2 ) then the increments (number of events in these
intervals) is given by:
(s1 , t1 ) =⇒ [N (t1 ) − N (s1 )] (25.28)
(s2 , t2 ) =⇒ [N (t2 ) − N (s2 )] (25.29)
• Equation 1 and 2 refer to the increment random variable which encodes -

number of events happening within those time intervals. We can write their
independence property in the following format:

P [N (t1 ) − N (s1 )] = k, [N (t2 ) − N (s2 )] = l (25.30)

= P [N (t1 ) − N (s1 )] = k P [N (t2 ) − N (s2 )] = l (25.31)
• NOTE - These independent increments do not imply that N (t) is independent

of N (s).
• The process also exhibits the property of stationary increments - which

states that the probability distribution of the number of events depends only
on the length of the time interval and not on the origin of the intervals start
and end points.
• If we consider the time intervals (s, t) and (s, s + t), then stationary incre-
ments implies the following:

P [N (s + t) − N (s)] = k = P [N (t) = k] (25.32)
25.6 The Poisson Process

• A counting process is said to classify as a Poisson process if it has certain
properties - these properties are enumerated below.
• Property 1 - The process has stationary and independent increments.
• Property 2 - The number of events in time interval (0, t) has a Poisson dis-
tribution with mean (λt). We say that Nt ∼ pois(λt) such that:
e−λt (λt)n
P [N (t) = n] = (25.33)
n!
• Equivalent Property 1 - The probability of 1 event happening in an infinites-

imal time h is given by:
P [N (t) = 1] = λh + o(h) (25.34)
• Equivalent Property 2 - The probability of at most one event happening in

time interval h is given by:
P [N (t) > 1] = o(h) ≈ 0 (25.35)

25.6.1 An Example
• We let {N (t), t ∈ (0, ∞)} be a Poisson process with rate λ and we let X1
denote its first arrival time. We will now attempt to show that given the fact
that N (t) = 1 - then X1 is uniformly distributed in the time interval (0, t) -
instead of being distributed exponentially, as could be mistakenly perceived.
• In short, we need to show that [X1 |N (t) = 1] ∼ unif (0, t)the following
property holds:
x
P [X1 ≤ x|N (t) = 1] = , 0 ≤ x ≤ t (25.36)
t
• We can then write the following expressions for the above conditional prob-
ability:
P [X1 ≤ x, N (t) = 1]
P [X1 ≤ x|N (t) = 1] = (25.37)
P (N (t) = 1)
• Now we recall the fact that Nt ∼ pois(λt) and its probability is given by:
e−λt (λt)k
P [N (t) = k] = (25.38)
k!
P [N (t) = 1] = λte−λt (25.39)
• Using the above facts and expressions, we might state that the probability
P [X1 ≤ x, N (t) = 1] - is actually associated with the event of one arrival
in the interval (0, x) and no arrival in the interval (x, t). This probability is
given as: ñ ô
e−λx (λx)1 î −λ(t−x) ó
P [X ≤ x, N (t) = 1] = e (25.40)
1!
= λxe−λx e−λt eλx = (λx)e−λt (25.41)
λxe−λt x
P [X1 ≤ x|N (t) = 1] = −λt
= (25.42)
λte t
• The above expression essentially defines the probability as defined by the
uniform distribution. Hence we have proved our initial argument.
25.7 PASTA
• PASTA stands for - Poisson arrivals see time averages. We build this con-
cept by considering a queueing system - which can be thought of as a restau-
rant in which customers arrive at a certain rate. The number of customers in
the restaurant at a particular time is said to be the state of the system.
• We say that customers arrive at a rate of λ into the restaurant and we say
that the state of the system is characterized by Ej . The system spends its
time in different states across various time intervals.
• Customers arrivals to the system (restaurant) constitute a Poisson process

with rate (intensity) of λ. These arrivals in turn - induce state transitions
in the system - which means that if there are currently 10 people in the
system, then if x more people arrive at a rate of λ after a certain time, then
the state has transitioned from Ej = 10 to Ej = 10 + x.
• In equilibrium, we tend to associate two probabilities with each state Ej .

These are enumerated below.
• First - There is supposed to be a gatekeeper or an observer who keeps track

of who is entering, staying and leaving the system. We define the probability
of the system being in state Ej as seen by the outside observer as follows:
πj = Prob that system is in state Ej at a random instant (25.43)
πj = P (Ej ) (25.44)
• Second - This is from the perspective of the arriving customer. We define

the probability of the system being in state Ej as perceived by the arriving
customer as:
πj∗ = Prob that system is in state Ej just before a randomly chosen arrival
(25.45)
• We also note that in general the inequality is observed - πj 6= πj∗ .
25.7.1 An Example
• Let us look at ’My PC (computer)’ - which is a system of one customer and
one server.
• The above system can only be in two possible states. The PC can be free (E0 )
or the PC can be occupied (E1 ).
• Now note that since ’I am’ the only customer - the PC will always be free
just when ’I need it’. Hence the following customer perspective probability
would be:
π0∗ = Prob that PC is in state E0 (PC is free) just when I need it = 1 (25.46)
π1∗ = Prob that PC is in state E1 (PC occupied) just when I need it = 0

(25.47)
• From the observer’s point of view, we have the following probability defini-
tions:
π0 = P (E0 ) = Proportion of time PC is free < 1 (25.48)
π1 = P (E1 ) = Proportion of time PC is occupied > 0 (25.49)
• Now we note the following inequality conditions:
π0 < 1, π0∗ = 1 =⇒ π0 6= π0∗ (25.50)
π1 > 0, π1∗ = 0 =⇒ π1 6= π1∗ (25.51)
• NOTE - In this example we must note that the arrival process is not Pois-
son. This means that when an arrival has occurred (you have started to work
on your PC) - then for a while it is quite unlikely that another arrival process
would occur. That is, you have essentially stopped the previous session and
started a new one - hence the arrivals at different times are not independent.
• However we also note that for a Poisson arrival, the PASTA property is satis-
fied - that is that πj = πj∗ for state Ej .
25.7.2 Conceptual proof of PASTA

• We note that the arrival history before the instant of consideration, ir-
respective of whether we are considering a random instant or an arrival
instant - are stochastically the same.
• The above point is true because the sequence of arrivals have exponen-
tially distributed interarrival times - this comes from the memoryless
property of the exponential distribution.
• With this, we say that the remaining time to the next arrival has the same
exponential distribution irrespective of the time that has already elapsed
since the previous arrival.
• Since the stochastic characterization of the arrival process before the instant
of consideration is the same, irrespective of how the instant has been chosen
- the state distributions of the system (induced by past arrival processes) at
the instant of consideration must the be same in both cases.
25.7.3 A PASTA Queue

• Let N (t) denote the number of customers in the queue at time t. So, the
probability that the queue has k customers at time t is P [N (t) = k].
• The steady state probability of the queue having k customers is given by the
following expression:
Pk = lim P [N (t) = k] (25.52)
t→∞
• Then, we define the probability that the proces monitor (outside observer)
sees k customers in the queue just before an arrival as:
Ak = lim P [N (t) = k|an arrival happens at t+ ] (25.53)

t→∞
P [N (t) = k, an arrival happens in (t, t + ∆t)]

Ak = lim (25.54)
t→∞ P [an arrival happens in (t, t + ∆t)]
P [N (t) = k, an arrival happens in (t, t + ∆t)]
Ak = lim lim (25.55)
t→∞ ∆t→0 P [an arrival happens in (t, t + ∆t)]
• We now apply the Bayes rule to convert the numerator in the above expres-
sion to a conditional probability format:
P [N (t) = k]P [arrival occurs in (t, t + ∆t)|N (t) = k]
t→∞ ∆t→0 P [arrival occurs in (t, t + ∆t)]
• Now we note that since Poisson arrivals are essentially independent of the
queue size, we can write the following equality:
P [arrival occurs in (t, t + ∆t)|N (t) = k] = P [arrival occurs in (t, t + ∆t)]

(25.57)
• With this in place, we can rewrite equation 29 in the following manner:
P [N (t) = k]P [arrival occurs in (t, t + ∆t)]
t→∞ ∆t→0 P [arrival occurs in (t, t + ∆t)]
• The right term of the numerator will cancel out with the denominator and
we are then left with the following:
Ak = lim lim P [N (t) = k] = lim P [N (t) = k] = Pk (25.59)

t→∞ ∆t→0 t→∞
25.8 Poisson Process Revisited

• The Poisson process describes the statistical properties of a sequence of
events - the event could be thought of as customer arrivals or instances of
credit defaults. The following time indexed random variable family is the
center of focus in this process:
N (t) = number of events that occur in the time interval [0, t] (25.60)
• We note an important definition - A function f (h) is said to be o(h) if f (h)

tends to go faster towards 0 as compared to h in the limiting case. This
condition is denoted as:
o(h)
lim =0 (25.61)
h→0 h
• Definition: The sequence of random variables denoted as {N (t); t ≥ 0} is

said to be a Poisson Process with rate λ > 0 if the following few conditions
hold.
• First condition: The initial condition holds that at time 0 no events have
occurred:
N (0) = 0 (25.62)
• Second condition: The number of events that occur in non-overlapping

time intervals are said to be independent. We essentially treat disjoint time
regions as statistically independent. The number of events that happen in
a certain time period have no influence on the number of events that happen
in a different time period.
• Third condition: The distribution of the number of events that occur in a

given time period depends only on the length of the period and not on its
location. We say that the Poisson Process is stable over time - its statistical
properties don’t change as time advances. So essentially, all time periods
of same length are statistically identical regardless of where in time this
period actually takes place.
• Fourth condition: Behavior of the Poisson Process in a small window of time

- probability of 1 event happening.
P [N (h) = 1] = λh + o(h) (25.63)
• Fifth condition: Behavior of the Poisson Process in a small window of time

- probability of more than 2 events happening.
P [N (h) ≥ 2] = o(h) (25.64)
• Sixth condition: Combining the above two condtions, we can get the prob-
ability of 0 events happening as:
P [N (h) = 0] = 1 − P [N (h) ≥ 1] = 1 − λh − o(h) (25.65)
25.8.1 Key properties derived

• The number of events in a certain time period is Poisson.
• Events in a Poisson Process occur with a constant rate λ.
• The time between events is exponentially distributed with paramter λ de-

noted as (expλ ).
• The distribution of the number of events N (t) in an interval of length t gen-

erated by a Poisson Process over a period of length t follows a Poisson dis-
tribution.
25.9 Deriving the distribution

• Let us consider an example such that there are 0 events till time t and 0
events even till time (t + h) - which also implies that there are no events
between times t and (t + h) as well. So essentially we have the following
measures:
N (0) = 0 (25.66)
N (t) = 0 (25.67)
N (t + h) = 0 (25.68)
{N (t + h) − N (t)} = 0 (25.69)
• Now the bigger event of no events in time (t + h) can be written as a con-

junction of two events over smaller, intermediate time periods - that is a
combination of conditions that combine - no events happening till time t and
no events happening between times t and (t + h).
P [N (t + h) = 0] = P [N (t) = 0 and {N (t + h) − N (t)} = 0] (25.70)
independent increments =⇒ P [N (t) = 0]P [{N (t + h) − N (t)} = 0]

(25.71)
stationary increments =⇒ P [N (t) = 0]P [N (h) = 0] (25.72)
using property 6 =⇒ P [N (t) = 0][1 − λh − o(h)] (25.73)
• We denote the probability of 0 events happening till time t as P0 (t) and

the probability of 0 events happening till time (t + h) as P0 (t + h). These
probabilities are given as:
P0 (t) = P [N (t) = 0] (25.74)
P0 (t + h) = P [N (t + h) = 0] (25.75)
P0 (t + h) = P0 (t)[1 − λh − o(h)] (25.76)
• Now we rearrange both sides of the above equation and divide throughout
by h to obain the following:
P0 (t + h) − P0 (t) o(h)
= −λP0 (t) − P0 (t) (25.77)
h h
ñ ô
P0 (t + h) − P0 (t) o(h)
lim = −λP0 (t) − lim P0 (t) (25.78)
h→0 h h→0 h
|{z}
0
dP0 (t)
=⇒ = −λP0 (t) (25.79)
dt
• The above equation is essentially a differential equation which has the fol-
lowing general form and initial conditions:
P0 (t) = Ae−λt (25.80)
initial condition: P [N (0) = 0] = 1 =⇒ P0 (0) = A = 1 (25.81)

=⇒ P0 (t) = e−λt (25.82)
• The above result can then be extended to show the general case - probability
of getting k events during a period of length t follows a Poisson distribution
with parameter (λt) as follows:
e−λt (λt)k
P [N (t) = k] = (25.83)
k!
expected number of events in a period =⇒ E[N (t)] = λt (25.84)
• We had stated earlier that - events occur at a constant rate λ. This can be
further expanded as:
expected number of arrivals in t λt

rate of arrivals (t) = = =λ (25.85)
length of period t t
• Further we state that the expected time between arrivals is given as follows:
1
E[time between arrivals] = (25.86)
λ
25.10 Interarrival times

• The time between events is said to be distributed exponentially. We denote
T1 as the time until the first event and T2 as the time between 1st and 2nd
event.
• Consider time T1 as ending at time instant t on a continuous time scale.

Further we let time instant (t + s) as lying somewhere between t and the
ending of time period T2 . Essentially the difference between these two time
instants is s.
• Now we want to find the probability that time T2 is actually greater than s
given that the first event occurs at time t. We can write this as follows:
P [T2 > s| T1 = t ] = P [no events in [t, t + s]|T1 = t] (25.87)

| {z }
1st event occurs at t
independent increments =⇒ P [no events in [t, t + s]] (25.88)

stationary increments =⇒ P [no events in period of length s] (25.89)
=⇒ P [N (s) = 0] = e−λs = P [T2 > s] (25.90)
• The above equation is nothing but the complimentary CDF of the exponential
distribution - hence now we know that T2 ∼ exp(λ)
25.11 PASTA
• PASTA (Poisson Arrivals See Time Averages) property refers to the expected
state of a queueing system as seen by an arrival from a Poisson Process. An
arrival from a Poisson Process observes the system as if it was arriving at a
random moment in time. Therefore the expected value of any parameter
of the queue at the instant of a Poisson arrival is the long run average value
of that parameter.
• Consider that till time a1 from time 0 - we have 0 arrivals. Then in the
interval between a1 and (a1 + b) - that is in the interval of size b - there is one
arrival. Finally, there are also 0 arrivals in the interval between (a1 + b) and
(a1 + b + a2 ) = t - that is in the interval of size a2 .
N (a1 ) = 0 (25.91)
N (b) = 1 (25.92)
N (a2 ) = 0 (25.93)
N (t) = 1 (25.94)
• We will now show that the probability that the Poisson Process produces 1
arrival in the interval of length b is the same as the probability of a randomly
chosen point being in the interval b.
P [0 in a1 and 1 in b and 0 in a2 ]
P [one arrival in [0, t] occurs in b] =
P [1 arrival in t]
(25.95)
P [N (a1 ) = 0, N (b) = 1, N (a2 ) = 0]
P [N (b) = 1|N (t) = 1] = (25.96)
P [N (t) = 1]
P [N (a1 ) = 0]P [N (b) = 1]P [N (a2 ) = 0]
ind. increments =⇒ (25.97)
P [N (t) = 1]
(e−λa1 )[λbe−λb ][e−λa2 ] b
stationary increments =⇒ = (25.98)
[λte−λt ] t
• From the above result we can state that - arriving from a Poisson Process
is statistically identical to arriving at a random moment in time.
Chapter 26
Introduction to Queues
26.1 Frequency Interpretation

• We now look at the framework wherein probabilities are interpreted as fre-
quencies of occurrence of states. With this perspective we say that πj is the
long run frequency of visiting state j.
• We might also say that πj is the steady state probability of being in state j or
the long term probability of being in state j. The global balance equation
for the state j then can be given as follows:
m
X
πj = πk pkj (26.1)
k=1
• If we run a long trajectory of a Markov Chain, which visits many states with
different frequencies over time - then the long run frequency of being in
state j is given by πj .
• Further we denote the frequency of transition from state k to state j as fol-

lows:
(k → j) =⇒ (πk pkj ) (26.2)
• The total frequency of transitioning from all possible values of state k to

state j - happens to be the frequency with which we are in state j. This is
derived by summing over all possible states k that reach j as follows:
m
X
πj = πk pkj (26.3)
k=1
• The diagram below tells us how the probabilities of transition from all pos-
sible states sum up to get the probability of being in state j. In the figure
below we note that π1 is the fraction of time we are in state 1. Similarly,
π1 p1j denotes the transition from state 1 to state j. This is shown in the figure
from all m states. Note that p1j represents the fraction of transitions from
state 1 to state j whenever we find ourselves in state 1.
278
Chapter 26. Introduction to Queues 279
• In the next section we look at a simple example demonstrating these con-

cepts.
Figure 26.1: state j limiting probability

• Consider a MC with only two possible states given by 1 and 2. This MC
is characterized by no periodicity since it has self loops. Further, it has a
single recurrent class - which in turn implies that the initial condition does
not matter. Below is a pictorial representation:
Figure 26.2: example 1
• Here we have j = 1, 2 as the possible states and π1 and π2 are the steady
state probabilities of being in state 1 and 2 respectively. We can then write
down the steady state balance equation as follows for a general state j:
2
X
πj = πk pkj (26.4)
k=1
• From the above equation, we can obtain the steady state probabilities for
state 1 and 2 as follows:
π1 = π1 p11 + π2 p21 = 0.5π1 + 0.2π2 (26.5)
π2 = π1 p12 + π2 p22 = 0.5π1 + 0.8π2 (26.6)

• By putting the constraint that π1 + π2 = 1 and simultaneously solving the
two equations we get:
2 5
π1 = , π2 = (26.7)
7 7
26.2 The Birth-Death Model

• The birth and death model corresponds to arrival and departure processes
in a system. We explain this concept using a supermarket model where
customers arrive and depart after obtaining service. This system is assumed
to have m possible states and associated transition probabilities. This is vi-
sualized as follows:
Figure 26.3: Birth Death process
• Further we assume that the arrival rate is constant at each time slot. This
means that we have pi = p and qi = q for all possible values of i. Additionally,
we define the load factor of the system as follows:
p
ρ= (26.8)
q
• With this framework in place, we can write the detailed balance equation
for some state i - by factoring in the frequency of outward transitions from
state i and the frequency of downward transitions to state i. These two
rates should be in equilibrium. Note that πi corresponds to the fraction of
time we are in state i and pi corresponds to the frequency of transition
from state i to state (i + 1). The detailed balance is given by:
πi pi = πi+1 qi+1 (26.9)
• Note that with the above detailed balance equation - we essentially have a
recursion equation - since πi+1 can be computed in terms of πi .
• Further, we have a total of (m + 1) unknowns in the form of (π0 , · · · , πm ) and
we don’t know pi0 as well. We then have m equations and a normalization
condition of the form:
πi pi = πi+1 qi+1 (26.10)
Xm
πi = 1 (26.11)
i=1
• Our attempt then is to solve the above recurrence equation in terms of π0

after we fix π0 . Recall before we ahead the folowing parameters and terms:
pi = p =⇒ probability of moving up (26.12)
qi = q =⇒ probability of moving down (26.13)

p
ρ= =⇒ load factor of the system (26.14)
q
• We further lay down certain conditions. First - If p = q then ρ = 1 and the
system is said to be balanced. Second - If p > q then ρ > 1 and the system
has a rightward drift. Third - If p < q then ρ < 1 and the system has a left
drift.
• From the detailed balance equation, we can then write the limiting proba-
bilities of states in terms of the load factor as follows:
πi p = πi+1 q (26.15)
p
=⇒ πi+1 = πi = πi ρ (26.16)
q
=⇒ πi = π0 ρi , by continuous recursion (26.17)
• The normalization condition can then be expressed as follows:

m
X
πi = π1 + · · · + πm = 1 (26.18)
i=0
m
X
=⇒ π0 ρ i = 1 (26.19)
i=1
1
=⇒ π0 = Pm (26.20)
i=0 ρi
• Now we note that if ρ = 1 then πi = π0 for all i. This means that all the
steady state probabilities are equal. We can say that every state i is equally
likely to occur in the long run. We can then write the following relations:
1 1 1
π0 = Pm = Pm = (26.21)
i=0 ρi i=0 1 m+1
1
π i = π0 = (26.22)
m+1
• Now we can consider the case when p < q which implies that ρ < 1. This
means that our system is stable and that there is a tendency of customers to
be served faster than they arrive. The drift is said to be leftward. Further,
If we take the limiting case of m tending to ∞ then we have the following:
1
π0 = m (26.23)
X
ρi
i=1
| {z }
geometric series
1
=⇒ π0 = = (1 − ρ) (26.24)
1/(1 − ρ)
• With this result we can obtain the limiting distribution of state i as a geo-
metric distribution as follows:
πi = π0 ρi = (1 − ρ)ρi (26.25)
• After characterizing the general distribution, we can get the expected num-
ber of customers in the queueing system as follows:
ρ
E[Xn ] = (26.26)
1−ρ
26.3 Queues and Communication Networks

• We might define the goal of a communication system as - moving packets
from generating sources to intended destinations.
• Now in between arrivals and departures - we hold packets in a memory

buffer. Our aim is to design the buffers efficiently.
• We further characterize a large period of time as being divided into equally

sized time slots of duration ∆t. This is done such that - the time slot between
time instant 1 and time instant 2 is the first slot - and the time slot between
time instant n and time instant (n + 1) is the nth time slot.
• We then assume that during each individual time slot - a packet can arrive
with probability λ. Then, the packet arrival rate can be given by:
expected number of arrivals λ

arrival rate = = (26.27)
unit of time ∆t ∆t
• Another assumption that during each individual time slot - a packet departs
the system with probability µ. Then the departure rate can be given as fol-
lows:
expected number of departures µ
departure rate = = (26.28)
unit of time ∆t ∆t
• In this model we assume that there are no simultaneous arrivals and de-
partures in a given time slot. Note the following variable definitions:
qn = number of packets in queue in the nth time slot (queue length) (26.29)
An = number of packet arrivals during the nth time slot (26.30)

Dn = number of packet departures during the nth time slot (26.31)
Dn = 0 =⇒ there are no departures in the nth time slot (26.32)
(
1, λ
An = =⇒ E[An ] = λ (26.33)
0, (1 − λ)
(
1, µ
Dn = =⇒ E[Dn ] = µ (26.34)
0, (1 − µ)
26.3.1 The deriving the state probabilities

• First we note - given that the queue length till time n is 0 or qn = 0 then we
cannot have any departures since queue length cannot be negative. So we
would have Dn = 0 in this case. Then the queue length at (n + 1) can be
given by:
qn+1 = qn + An (26.35)
• If on the other hand we have qn > 0 then the queue length at time (n + 1)
would have elements of both arrivals and departures and would be given by
(note that the positive sign above the brackets indicates non-negativity of
the queue length):
qn+1 = [qn + An − Dn ]+ (26.36)
• Recall that in this framework, either An = 1 or Dn = 1 - both can’t simulta-

neously happen. The arrival and departure probabilities are given by:
P [An = 1] = λ (26.37)
P [Dn = 1] = µ (26.38)
• After making the Markov assumption that future queue lengths depend only
on the current queue length, we get the probability of queue length in-
creasing and probability of queue length decreasing (on the condition that
i > 0) as follows:
P [qn+1 = i + 1|qn = i] = P [An = 1] = λ (26.39)
P [qn+1 = i − 1|qn = i] = P [Dn = 1] = µ (26.40)

• We can say that the probability of either an arrival happening or a departure

happening is simply the sum of the individual probabilities - since they are
disjoint events. Therefore such a probability is (λ + µ) - now the comple-
ment of this disjunction event is nothing but - the probability of neither
happening - which in turn is the probability of the queue length remaining
the same. This is given as:
P [qn+1 = i|qn = i] = 1 − λ − µ (26.41)
• The probability of queue length not changing given that currently the queue
length is 0 - essentially removes the possibility of a departure happening -
such a probability measure is then given by:
P [qn+1 = 0|qn = 0] = 1 − λ (26.42)
• Now for an infinite state space MC - we can specify the various transition
probabilities as given below:
p0,0 = 1 − λ (26.43)
pi,(i+1) = λ (26.44)
pi,(i−1) = µ (26.45)
pi,i = 1 − λ − µ, ∀i 6= 0 (26.46)
26.4 The Buffer Drops Problem

• Referring to the previous description of packets arriving in discrete time slots,
we lay down the following conditions:
λ
A packet arrives with probability λ with arrival rate (26.47)
∆t
µ
A packet departs with probability µ with arrival rate (26.48)
∆t
No concurrence - no simultaneous arrivals departures in each time slot
(26.49)
• Now we can use the global balance equations to find the limiting probability
distributions by essentially computing the eigenvector of transition matrix
P T associated with the largest eigenvalue of 1. These relations are specified
as follows:
[P T ]n p(0) = p(n) =⇒ P T π = π (26.50)
π = lim p(0) (26.51)
n→∞
• Further note that the reulting probability distribution would have an expo-
nential format. View the figure below for a pictorial specification of the
model we have described.
Figure 26.4: Birth and Death - Buffer Drops
• We shall now write down the limit distributio equations for states 0 and i
respectively in the following manner:
π0 = (1 − λ)π0 + µπ1 (26.52)
πi = λπi−1 + (1 − λ − µ)πi + µπi+1 (26.53)
• Now we would formulate a general exponential expression for the limiting
distribution with parameters α and c - this can be written as follows:
πi = cαi =⇒ π0 = cα0 (26.54)
• We can substitute the above equation in equation 52 and 53 to obtain the

following results:
cα0 = (1 − λ)cα0 + µcα1 (26.55)
λ
=⇒ 1 = (1 − λ) + µα =⇒ α = (26.56)
µ
cαi = λcαi−1 + (1 − λ − µ)cαi + µcαi+1 (26.57)
divide throughout by cαi−1 =⇒ µα2 − (λ + µ)α + λ = 0 (26.58)
• The above quadratic equation can be satisfied with α = (λ/µ). After this,
we will proceed to discover the value of the parameter c by solving the con-
straint equation given by:
X ∞ J=∞
X Å λ ãi
πi = 1 =⇒ c (26.59)
i=0 i=0
µ
c
=⇒ Å ã =1 (26.60)
λ
1−
µ
ñ Å ãô
λ
=⇒ c = 1 − (26.61)
µ
• Now that we have found out the parameters α and c - we can then completely
specify the limiting distribution as follows:
ñ Å ãô Å ãi
i λ λ
πi = cα = 1 − (26.62)
µ µ
• Further note that the ratio (µ/λ) is known as the queue stability margin -
A larger value of this ratio implies fewer packets in queue.
26.4.1 Results in terms of Detailed Balance

• Start by recalling the limiting distribution equation for state 0 and then re-
arranging the terms to obtain the following:
π0 = (1 − λ)π0 + µπ1 (26.63)
=⇒ π0 = π0 − λπ0 + µπ1 (26.64)

=⇒ λπ0 = µπ1 (26.65)
• Now recall the limiting distribution equation for state i and then rearrange
the terms to obtain the following result:
πi = λπi−1 + (1 − λ − µ)πi + µπi+1 (26.66)
=⇒ πi = λπi−1 + πi − (λ + µ)πi + µπi+1 (26.67)

=⇒ (λ + µ)πi = λπi−1 + µπi+1 (26.68)
| {z } | {z }
rate at which queue leaves i rate at which queue enters i
• The above two results tell us one basic result - that the average rate at
which the queue leaves state 0 (λπ0 ) is equal to the average rate at which
the queue enters state 0 (µπ1 ).
26.5 Concurrent Arrivals and Departures

• The most basic assumption that characterizes this model is that - there is
concurrence - packets may arrive and depart in the same time slot. However
we must note that even despite this fact, the queue evolution equations
remain the same as before - they are general equations upholding the model.
qn+1 = qn + An =⇒ if qn = 0 and Dn = 0 (26.69)
qn+1 = [qn + An − Dn ]+ (26.70)
• Even thought the evolution equations remain the same - the queue probabil-
ities will change now. Note the following probabilities:
queue length increasing =⇒ P [qn+1 = i + 1|qn = i] = λ(1 − µ) (26.71)
queue length decreasing =⇒ P [qn+1 = i − 1|qn = i] = µ(1 − λ) (26.72)

queue length staying same =⇒ P [qn+1 = i|qn = i] = (26.73)
= λµ + (1 − λ)(1 − µ) (26.74)
|{z} | {z }
1 arrival and 1 departure no arrival and no departure
same length with i = 0 =⇒ P [qn+1 = 0|qn = 0] = (1 − λ) + λµ (26.75)
• A pictorial representation of this model is provided below. The probability

given by equation 74 is represented in the figure as V .
Figure 26.5: Concurrent arrivals and departures
• Given that in equilibrium - the rate at which the queue leaves a state equals
the rate at which the queue enters the state, we can write the following
queue balance equations:
λ(1 − µ)π0 = µ(1 − λ)π1 (26.76)
[λ(1 − µ) + µ(1 − λ)]πi = λ(1 − µ)πi−1 + µ(1 − λ)πi+1 (26.77)
• As in the previous case, we will now attempt to find the parameters of the
limiting distribution by specifying its general exponential form:
πi = cαi (26.78)
• First, we substitute the above expression in equation 76 to obtain a value for

the paramter α as follows:
λ(1 − µ)
λ(1 − µ)cα0 = µ(1 − λ)cα1 =⇒ α = (26.79)
µ(1 − λ)
• We now substitute the general form distribution into equation 77 and obtain
the quadratic form that can be satisfied by the above expression for α.
[λ(1 − µ) + µ(1 − λ)]cαi = λ(1 − µ)cαi−1 + µ(1 − λ)cαi+1 (26.80)
=⇒ µ(1 − λ)α2 − [λ(1 − µ) + µ(1 − λ)]α + λ(1 − µ) = 0 (26.81)
• Now before moving ahead in simplifying the above expression, we can es-
sentially rearrange the α expression given by equation 79 to obtain the fol-
lowing:
λ(1 − µ) = αµ(1 − λ) (26.82)
• Now in equation 81 we will use the above relation to essentially subtitute the
RHS of the above equation in place of the LHS expression. With this we find
that the quadratic equation is satisfied.
µ(1 − λ)α2 − [αµ(1 − λ) + µ(1 − λ)]α + αµ(1 − λ) = 0 (26.83)

• Now our next step is to find the constant c by applying the constraint equa-
tion of the limiting probabilities as follows:
∞
X ∞
X
πi = cαi = 1 (26.84)
i=0 i=0
∞
X c
c αi = 1 =⇒ =1 (26.85)
i=0
1−α
=⇒ c = (1 − α) (26.86)
• As a final step, with our specified parameters - we can write the limiting
distribution in the following form:
lim P [qn = i] = πi = cαi = (1 − α)αi (26.87)

n→∞
26.6 Limited Queue Size

• In the earlier problems we were looking at queues of infinite length. How-
ever, when we essentially the limit the queue size - then packets are dropped
if there are too many packets already in the queue.
• We specify state J as being the state of full capacity queue length. Note that
there cannot be any arrivals in this state - if there is an arrival then it would
lead top buffer overflow.
• Note that since even this system is essentially a concurrent arrival-departure

system as laid out in the previous section - the queue balance equations
for state 0 and general state i are the same. However, here we have to add
another equation that specifies queue balance for state J. It is given by:
µ(1 − λ)πJ = λ(1 − µ)πJ−1 (26.88)
• As before the general exponential form of the limiting distribution remains

the same. The general form for the J th state probability is then:
πJ = cαJ (26.89)
• We substitute the above expression in the queue balance equation of state J

to obtain an expression for α - which is the same as in the infinite case.
µ(1 − λ)cαJ = λ(1 − µ)cαJ−1 (26.90)
λ(1 − µ)
=⇒ α = (26.91)
µ(1 − λ)
• Now we try to find the constant c by taking into account the constraint
equations as follows:
XJ XJ
πi = cαi = 1 (26.92)
i=0 i=0
J
X c[1 − α(J+1) ]
=⇒ cαi = =1 (26.93)
i=0
[1 − α]
(1 − α)
=⇒ c = (26.94)
[1 − α(J+1) ]
• If we take J to infinity we will get the same distribution as specified in the

previous section.
Chapter 27
Queueing models - M/M/1
This chapter focusses on developing the fundamental concepts of Queueing theory involving
- Exponential distributions, Counting processes, Markov chains and transition rates. Conse-
quently a quick overview of the M/M/1 model is given before delving into the M/M/s model
since it essentially serves as the connecting link.
27.1 The underlying concepts

Queueing models revolve around the concept of a Counting process which can be com-
pletely characterised by a Poisson distribution and a Exponential distribution. The
counting process is denoted by {N (t), t ≥ 0} where N (t) denotes the total number of
events that have occurred upto time t. Note that the occurrence of an event essentially
means - customers arriving in the system. Note that we assume something called in-
dependent increments in this situation which basically means that the number of events
occurring by time t, that is N (t) is independent of the number of events occurring be-
tween times s and t + s denoted as (N (t + s) − N (s)). We note a critical point here that the
number of events occurring in an interval of length t follows a Poisson Distribution with
mean λt.
(λt)n
P [N (t) = n] = e−λt (27.1)
n!
Now we let the sequence T1 , T2 , · · · represent the sequence of times between events (or
between arrivals of customers). T1 is the time until the first event (arrival), T2 is the time
between the second and first event and so on. Now we notice the following equation:
P [T1 > t] = P [N (t) = 0] = e−λt (27.2)
So therefore we can then state that in queueing models, interarrival times and service
times follow an exponential distribution and customer arrival rate follows a poisson
distribution.
27.1.1 Notes on the exponential distribution

Exponential random variables are used to model the time elapsed between occurrence of
random events. We say that T ∼ exp(λ) is an exponential random variable with parameter
λ and its pdf is given by:
fT (t) = λe−λt , ∀t ≥ 0 (27.3)
290
Chapter 27. Queueing models - M/M/1 291
The CDF, denoted as P [T ≤ t] and its complement, denoted as P [T ≥ t] are given below:
FT (t) = 1 − e−λt (27.4)
P (T ≥ t) = 1 − FT (t) = e−λt (27.5)

The expected value of interarrival times is given by:
Z ∞
1
E[T ] = tλe−λt dt = (27.6)
0 λ
Note that λ represents the rate at which events occur in time interval T and the expected
value E[T ] is the mean interarrival time between events. Note that the Variance of this
distribution is given by:
1
var[T ] = 2 (27.7)
λ
27.2 Markov chains

The Markov chains concept is at the heart of queueing models. To start with the very
basics, we first say that every stochastic process goes through various transitions across
states. We define the state of a stochastic process as the value taken on by some random
variable X(t) at a particular time. Now when dealing with queueing models, our main
random variable is N (t) which captures the number of customers arriving the system at
certain time intervals. Let us say that in the first t time interval, i customers arrive and
then after another s time interval, there are j customers in the system. This can be thought
of as the random variable making a transition from state i to state j. Further, the Markov
property states that the probability of the random variable taking on some state in the
future is independent of the past and should only be condtioned on the current state of
the random variable.
P [X(t + s) = j|X(s) = i, X(u) = u, u < s] = P [X(t + s) = j|X(s) = i] (27.8)
We also note the transition times denoted as Ti which means - the transition of a random
variable out of state i. This represents the time it takes to transit from one state to another
and is essentially the interarrival time. The standard notation is that when an event occurs,
the probability that we transit out of state i to state j is Pij . Now in most cases we are
often dealing with the limiting transition probabilities or the steady state transition
probabilities. This is represented as:
lim Pij (t) = πij (27.9)

t→∞
27.3 The M/M/1 Queue

As per this model, we say that arrivals occur with a Poisson process with rate λ and as per
the counting process described above, number of customers arriving N (t) during the
time interval (0, t) follows the PDF given by:
e−λt (λt)j
P [N (t) = j] = (27.10)
j!
And intervarrival times follow an exponential distribution with PDF:

p(x) = λe−λx (27.11)
Additionally we also note that service times follow an exponential distribution with rate
µ given by:
p(x) = µe−µx (27.12)
We now note the mean interarrival and service times given by the first moments:
1 1
E[interarrival time] = = (27.13)
λ arrival rate
1 1
E[service time] = = (27.14)
µ service rate
Next we define traffic intensity also known as the utilization factor denoted as ρ as the
ratio of arrival and service rates. This utilization factor is the probability that the server
is busy when the system is in equilibrium - it gives us a measure of expected number of
customers in service.
arrival rate λ
ρ= = (27.15)
service rate µ
Now to note some important definitions:
• Q(t) = Number of customers in the system.
• Qq = Number of customers in queue (excluding those already in service).
• We define L = E(Q) and Lq = E(Qq ).
• We have the following results about L and Lq as follows:
ρ λ
L= = (27.16)
1−ρ µ−λ
ρ2 λ2
Lq = = (27.17)
1−ρ µ(µ − λ)
27.3.1 Waiting times

Now we further note that Tq and T denote respectively, the amount of time a customer
spends in the queue and in the system. We say that the total time spent in the system is the
sum of the waiting time (time in queue) and the time in service. Note that for n customers
in the system and service times following an exponential distribution with parameter µ,
the total service time for n customers is distributed as a Gamma or an Erlang distribution
given by:
µn xn−1
fn (x) = e−µx (27.18)
(n − 1)!
With some simplifications, we can get the expected values of these times denoted as:
E[Tq ] = Wq and E[T ] = W . These are given by:
ρ λ
Wq = E[Tq ] = = (27.19)
µ(1 − ρ) µ(µ − λ)
1
W = E[T ] = (27.20)
µ−λ
Finally we note the Little’s law that connects these values for us:
L = λW and Lq = λWq (27.21)
27.4 M/M/s queue

This model is used in situations wherein there are multiple servers. As usual, the arrival
of customers follows a Poisson distribution and service times follow an exponential distri-
bution. The number of servers is s and they provide service independent of each other. A
central assumption in this case is that the arriving customers enter into a single queue and
the customer at the start of the waiting line would enter into service if the server is free.
Again as usual we have λ and µ as the arrival and service rates respectively.
27.4.1 Prerequisites: infinitesimal rates

In the previous section we discussed about transition probabilities and the Markov prop-
erty. Now in the situation when time is continuous and our stochastic process takes on
discrete values then we model the transition probabilities using limiting cases. For our
random variable to change its state form i to j in an infinitesimal time interval ∆t, the
transition probability is given by:
Pij (∆t)
lim = λij (27.22)
t→∞ ∆t
In a typical counting process, the probability for one event occurring in a small time inter-
val ∆t is given by:
λ∆t + o(∆t) (27.23)
Where λ is the rate of customer arrivals and o(∆t) is a function defined such that:
o(∆t)
lim =0 (27.24)
∆t→∞ ∆t
Which essentially means that o(∆t) is so small that it is even negligible as compared to
the limit as the small time interval tends to 0. Further we note that in the customer arrival
counting process, the probability of no customers arriving in the time interval is given by:
1 − λ∆t + o(∆t) (27.25)
Further we note that the steady state probability p0 actually means this - it is the steady
state probability such that we transit from some state i to a state of 0 which essentially
means a state where no customers are there in the system. Similarly pn stands for the
steady state probability that we transit from state i to state n.
27.4.2 Derivation
Referring to the Birth and Death model, we write the infinitesimal transition rates as λn
and µn for n customers in the system. We note that the arrival rate is constant that is it
does not change depending on the number of customers already in the system. We now
determine if µn changes or not. Assuming n busy servers as of time t, then we say that the
probabiliy of a server completing service during the interval (t, t + ∆t) is given by:
µ∆t + o(∆t) (27.26)
Given that out of n servers, the probability that 1 server completes service is:
Ç å
n
[µ∆t + o(∆t)]1 [1 − µ∆t + o(∆t)]n−1 = nµ∆t + o(∆t) (27.27)
1
Now given that n servers are busy, the probability that r servers complete service in the
given interval is given by:
Ç å
n
[µ∆t + o(∆t)]r [1 − µ∆t + o(∆t)]n−r = o(∆t) (27.28)
r
We see clearly that the only case wherein we have nonnegligible probability of service
completion is the case of completion of one service. Therefore the service rate would be
given by nµ and since we have s servers in all, it is given by sµ. Also since arrival rate
is constant we have λn = λ. Recall that we defined the concept of steady state transition
probabilities in the previous section. With this we can write the Chapman-Kolmogorov
equations in terms of steady state probabilities as follows:
λp0 = µp1 (27.29)
(λ + nµ)pn = λpn−1 + (n + 1)µpn+1 , 0 < n < s (27.30)

(λ + sµ)pn = λpn−1 + sµpn+1 (27.31)
Recursively solving these equations would give us the following relation between transi-
tion steady state probabilities:
1 λ n
Å ã
pn = p0 (27.32)
n! µ
1
pn = (sρ)n p0 (27.33)
n!
We then use the condition that the probability of transitioning
P to a particular state over all
possible previous states is 1. So we use the condition that ∞ 0 pn = 1 and hence obtain
the expression for p0 and pn as follows:
 −1
s−1 r s
X (sρ) (sρ) 
p0 =  + (27.34)
r! s!(1 − ρ)
r=0
pn = ρn−s ps , n ≥ s (27.35)
Where ρ again represents the traffic intensity and since the maximum service rate is sµ,
in this case it is given by:
λ
ρ= (27.36)
sµ
A key point tonote is that if the number of customers in the system is more than s then
the system effectively behaves like a M/M/1 system with service rate sµ. Some other
notations that we need for further proofs are:
λ
• Expected number of busy servers: α =
µ
α
• Alternative expression 1: =ρ
s
• Alternative expression 2: α = sρ
References 295
27.4.3 Getting the average measures

Noting the steady state probability for the system transitioning to state 0 is given by:
ã−1 −1
 
s−1 r s Å
X α α α
p0 =  + 1−  (27.37)
r! s! s
r=0
With this we get the mean number of customers in the system as:
ραs p0
L=α+ (27.38)
s!(1 − ρ)2
Consequently, the mean number of customers in the queue is given as:
ρα2 p0
Lq = (27.39)
s!(1 − ρ)2
Getting to the waiting time, we get the mean waiting time in queue as follows:
α s p0
Wq = (27.40)
s!sµ(1 − ρ)2
References
[1] Narayan Bhat - An introduction to queueing theory
[2] Donald Gross - Fundamentals of queueing theory
[3] Sheldon Ross - Stochastic processes

Chapter 28
Continuous Time Markov Chains
28.1 Motivation for CTMC

Most dynamical systems are asynchronous in nature. That is events or measure-
ments or both do not occur based on a global clock. An asynchronous system does
not depend on strict arrival times. As a consequence, in such systems,
1. Events, measurements or durations are irregularly spaced.

2. Rates vary by several orders of magnitude.
3. Durations of continuous measurement need to be expressed explicitly.
In a Discrete Time Markov Chain (DTMC), computations proceed one time-step

at a time. However for uneventful times, this is computationally expensive. In
Continuous Time Markov Chains (CTMC), there is no natural time step. Hence
these models jump over uneventful time periods.
28.2 Time Discreteness and Markov Property

In this section we show that sampling a DTMC at sub-intervals (or equivalently,
sampling with a greater sampling rate) in order to extend them to CTMC does not
work. We also determine those exact cases where this yields a meaningful solution
and those cases where it does not.
Proposition. A DTMC cannot be converted to a CTMC by sampling at sub-intervals,

or by increasing rate of sampling.
Example. Consider the following 2-state DTMC with transition probability matrix
T1 . Let us denote the states by S = {1, 2}.
ñ ô
0.75 0.25
T1 =
0.5 0.5
Let the states be denoted by 1 and 2. Each entry in T1 denotes the conditional
probability that the system moved to state j in time t given that it is in state i in
296
Chapter 28. Continuous Time Markov Chains 297
î ó
time (t − 1). That is P Xt = j|Xt−1 = i .
If T1 were to describe a continuous-time system sampled at period 1 unit, then

T1/2 describes the same system sampled at period of 1/2 time unit (or equivalently
twice the sampling rate). We can compute T1/2 by matrix factorisation of T . This
is because of the following reason.
Figure 28.1: Sampling in half time-steps
At time t the system is at state j and at time (t − 1) is it as state i. Let k be

the state of the system at time period (t − 21 ). The state k can be either 1 or 2.
Sampling at half time intervals would mean that the prior transition matrix T1
would now represent 2-step transition probabilities, and retain the same values.
The probability of two-step transition can now be decomposed in using the discrete
Chapman-Kolmogorov equation.
î ó X î ó î ó
P (Xt = j|Xt−1 = i) = P (Xt− 1 = k|Xt−1 = i) P (Xt = j|Xt− 1 =(28.1)
k)
2 2
k
X
T1 (i, j) = T 1 (i, k).T 1 (k, j) (28.2)
2 2
k
î ó2 î ó1
T1 = T 1 T 1 = T 1 =⇒ T 1 = T1 2 (28.3)
2 2 2 2
The matrix T 1 is the matrix square root of T1 . By decomposing the matrix, T , we

2
get the value of T 1 .
2
ñ ô
0.8334 0.1667
T1 =
2 0.3334 0.6667
We observe that T 1 is a stochastic matrix, with rows summing to 1. Although

2
it appears to be a reasonable method to convert DTMC to CTMC by sub-interval
sampling, it is actually no so because this holds only in the case when T1 is posi-
tive definite (has positive eigenvalues). In the following example, we consider a
transition matrix with at least one negative eigenvalue and show that it does not
yield a real-valued stochastic matrix.
Example. Consider the DTMC with the following transition matrix T1 .

ñ ô
0.1 0.9
T1 =
0.9 0.1
The eigenvalues of T1 are 1 and −0.8, and is thus not positive definite. Like the
previous example, decomposing T1 using matrix square root gives the transition
matrix T 1 for half sub-interval sampling of the Markov Chain.
2
ñ ô
0.5 + 0.447i 0.5 − 0.447i
T1 =
2 0.5 − 0.447i 0.5 + 0.447i
Thus we see that there is no real valued stochastic matrix describing the same
procss as T1 but at half the sampling periodicity. Put differently, there is no 2-
state CTMC, which when sampled at rate of 1 time unit produces a Markov Chain
with matrix T1 . The problem in generating T 1 arises because T1 has negative
2
eigenvalues.
Proposition. Only stochastic transition matrices with all positive eigenvalues cor-
respond to a CTMC process sampled at a given periodicity. This means that:
1. The set of CTMC is smaller than the set of DTMC.
2. These processes are Markovian only when sampled at a particular periodicity

and the only method of extension to points of time outside the periodicity
would be to construct non-Markovian and non-stationary processes.
3. Many systems do not have a natural sampling rate. The rate is chosen for
computational or measurement convenience.
28.3 Properties of CTMC

Definition. A Continuous Time Markov Chain is a stochastic process X(t) that
evolves in continuous time (t ≥ 0) on discrete state space. It obeys the Markov
property and time homogeneity.
Markov Property The Markov property states that the conditional probability
of the process to be in a future state j depends only on the current state and is
independent of the past path taken by the process.
î ó î ó
P X(t) = j|X(t1 ) = i1 , . . . , X(tn ) = in = P X(t) = j|X(tn ) = in (28.4)
Time Homogeneity This refers to the property that the conditional probability
of the process being in a future state j given a current state i remains the same
so long as the time interval between transition is the same. That is, for example
2-period transitional probability remains the same no matter what the time point
of the initial state - transition from i to j in interval t = 1 to t = 3 is same as that
in the interval t = 5 to t = 7.
î ó î ó
P X(t) = j|X(s) = i = P X(t − s) = j|X(0) = i (28.5)
Notation Define the notation pij (s, t + s) as the following.
î ó
pij (s, t + s) = P X(t + s) = j|X(s) = i (28.6)
î ó
pij (0, t) = pij (t) = P X(t) = j|X(0) = i (28.7)
Thus we have the transition probability matrix given by,

 
P [t] =  pij (t)  , t≥0 (28.8)
For a transition at a single instant t = 0, note that,
(
î ó 1, i = j
pij (0) = P X(0) = j|X(0) = i = (28.9)
6 j
0, i =
 
P (t) =  pij (t)  , t ≥ 0 =⇒ P (0) = I (28.10)
This implies that the probability for same instant transitions 1 if it remains in the
same state and 0 if it moves to another state. Thus single-instant transitions from
one state to a different state are not allowed. Consequently, the transition matrix
P (0) is an Identity matrix.
28.4 Chapman Kolmogorov Equation

Recall that the discrete time Chapman-Kolmogorov equation establishes the rela-
tionship between multi-step transitions as a product of sub-step transition matri-
ces. That is,
P (n+m) = P (n) .P (m) (28.11)

X∞
(n+m) (n) (m)
Pij = Pik .Pkj , ∀i, j ∈ S and n, m ≥ 0, (28.12)
k=0
The continuous time analogue of the Chapman-Kolmogorov equation is given by
X
P (0, s + t) = Pik (0, s).Pkj (s, s + t), 0<s<t (28.13)
k
P (s + t) = P (s).P (t) (28.14)
Note that Pij (t) is a continous and differentiable function of t.
28.5 Rate Matrix

The CTMC does not have a natural sampling period. Arrivals in a CTMC process
may occur at irregularly spaced intervals. There is thus a difficulty in specifying the
CTMC with transition probability matrix P . The time-step of P could potentially
be a real value, given no periodicity in sampling. Therefore we resort to the use
of a rate matrix Q to specify the CTMC. We had noted that the Pij (0, t) matrix is
continuous and differentiable with respect to t. Obtaining the right derivative of
pij (t) we have,

dpij (t)
qij = (28.15)
dt

ñ t=0 ô
pij (t + h) − pij (t)
qij = lim (28.16)
h

h→0
t=0
ñ ô
pij (h) − pij (0)
qij = lim (28.17)
h→0 h
Proposition. The elements of the transition probability matrix pij are related to
those of the rate matrix by the following equation, where qii ≤ 0, qij ≥ 0 and h
takes small non-negative values.
(
1 + hqii + o(h), i = j, for small h
pij (h) = (28.18)
0 + hqij + o(h), i 6= j, h → 0
Proof. We first note that the transition matrix for pij (0) = δij , which is the Kro-
necker Delta function taking the value of 1 when i = j and 0 otherwise. Differen-
tiating pii (t) with respect to t,
ñ ô ñ ô
dpii (t) pii (h) − pii (0) 1 + hqii + o(h) − 1
= lim = lim (28.19)
dt t=0 h→0 h h→0 h
ñ ô
o(h)
= lim qii + = qii + 0 = qii (28.20)
h→0 h
Likewise for the case i 6= j, differentiating wrt. t,

ñ ô ñ ô
dpij (t) pij (h) − pij (0) hqij + o(h) − 0
= lim = lim (28.21)
dt t=0 h→0 h h→0 h
ñ ô
o(h)
= lim qij + = qij + 0 = qij (28.22)
h→0 h
Proposition. The row sum of the rate matrix is zero.

Proof. Let Q denote the rate matrix such that Q = {qij }. Also, Σj pij (t) = 1, ∀i,
since the row sum of stochastic matrix P (t) is 1. Differentiating this with respect
to t, the RHS turns out to be 0.
d d(1)
(Σj pij (t)) = (28.23)
dt t=0 dt t=0
X dpij (t)
= 0 =⇒ Σj qij = 0 (28.24)
j
dt
28.6 Description of the CTMC

In the case of a DTMC, the transition probability matrix answers all questions
about the description of the process. Given the single time-step transition matrix,
we know the probability that it jumps from one state to another state. Further,
using the Chapman-Kolmogorov equation, multiple time-step transitions can be
computed by simply applying the matrix products. However, the same cannot be
said for the CTMC. In this section, we attempt to answer two main questions:
1. How long does the CTMC stay at a particular state before jumping to the
next state?
2. With what probability does the CTMC jump to the given next state?
Let Ti denote the time spent by the CTMC in state i before moving to another state.
For i ≥ 1, the probability that time spent in state i greater than an arbitrary value
t, is the probability of intersection between events: that it stays in i for period s
for all values of s betwee 0 and t, conditioned on initial state i.
î ó î ó
P Ti > t = P X(s) = i, 0 ≤ s ≤ t | X(0) = i (28.25)
Applying the Markov chain rule, the probability of being in state i for sub-intervals
(fractions of total time) depends only on the current state information and not the
past.
î ó h i
P Ti > t = P X(s) = i, 0 ≤ s ≤ nt | X(0) = i .
h i
t 2t t
×P X(s) = i, n ≤ s ≤ n | X n = i .
h i
× · · · × P X(s) = i, (n−1)t
n
≤ s ≤ t | X (n−1)t
n
= i (28.26)
Using time homogeneity property, the conditional probability in each of these n

intervals is the same because their same interval size nt .
î ó î ó n
P Ti > t = P X(s) = i, 0 ≤ s ≤ nt | X(0) = i ∀n
î ón
= lim X(s) = i, 0 ≤ s ≤ nt | X(0) = i
n→∞
n n
= lim pii ( n ) = lim 1 + qii . n + o( n ) = etqii (28.27)
t t t
n→∞ n→∞
| {z }
exponential function
The probability that the time spent by the CTMC is state i greater than t follows
an exponential distribution, with the rate matrix diagonal element, −qii as the
parameter. Also note that qii ≤ 0. Thus,
î ó î ó
P Ti > t = etqii , P Ti ≤ t = 1 − etqii
î ó −1
=⇒ Ti ∼ Exp(−qii ) , E Ti = (28.28)
qii
Suppose the CTMC changes stateh at time t, from state i to state j. The i probability
of this jump is given by limh→0 P X(t + h) = j | X(t) = i, X(t + h) 6= i . Expanding
this term within the limit, we obtain,
h i
P X(t + h) = j | X(t) = i, X(t + h) 6= i (28.29)
h i
P X(t + h) = j , X(t) = i, X(t + h) 6= i
= h i
P X(t) = i, X(t + h) 6= i
h i
P X(t + h) = j , X(t) = i
= h i , j 6= i (28.30)
P X(t + h) 6= i , X(t) = i
Also note that,

h i
h i P X(t + h) = j , X(t) = i
P X(t + h) = j | X(t) = i = h i
P X(t) = i
h i
h i P X(t + h) 6= i , X(t) = i
P X(t + h) 6= i | X(t) = i = h i
P X(t) = i
h i h i
P X(t + h) = j | X(t) = i P X(t + h) = j , X(t) = i
=⇒ h i = h 6= i
i , j (28.31)
P X(t + h) 6= i | X(t) = i P X(t + h) 6= i , X(t) = i
Substituting the above equations in that for probability of jump in states,
h i
lim P X(t + h) = j | X(t) = i, X(t + h) 6= i
h→0
h i
P X(t + h) = j , X(t) = i
= lim h i , j 6= i
h→0
P X(t + h) 6= i , X(t) = i
pij (h) hqij
= lim P = lim P
k6=i pik (h) k6=i hqik
h→0 h→0
qij qij
= P = (28.32)
k6=i qik −qii
Since we know that the row sum of the Q matrix is zero, Σk qP ik = 0, and that
qii ≤ 0, it implies that for all k 6= i (off diagonal entries), qii = − k6=i qik .
We have now set up the premise to answer the two questions intially raised. î The
ó
CTMC remains in state i for a period Ti , such that Ti ∼ Exp(qii ) with mean E Ti =
−1 q
qii
. Then it jumps to another state j 6= i, with probability −qijii . Thus we have
proved that the CTMC process depends only on the rate matrix Q rather than on
the transition probability matrix P . If the CTMC process is only observed at jumps
then a Markov Chain is obtained with Transition Matrix P . This MC with P as the
transition matrix is called the Embedded Markov Chain.
   
qij
P =  pij  , = 
 
−qii 
The states of the CTMC are defined to be recurrent or transient in accordance

with their properties in the Embedded Markov Chain. The exception to this is
periodicity, which is not applicable to a continuous process.
28.7 Forward Kolmogorov Equation

The Forward Kolmogorov equation is a first order differential equation that de-
scribes the dynamics of the CTMC. In order to determine its dynamics, we ask
the question, given Q, how do we get P (t) for any t ≥ 0 ? From the continu-
ous time analogue of the Chapman-Kolmogorov equation P (t + h) = P (t)P (h) for
t ≥ 0, h ≥ 0, we have,
P (t + h) − P (t) P (t)P (h) − P (t)

=
h Ä h ä
P (t) P (h) − I
=
Ä h ä
P (t) P (h) − P (0)
= (28.33)
h
Taking the derivative of P (t) with respect to time t,
" #
dP (t) P (t + h) − P (t)
= lim
dt h→0 h
" Ä ä#
dP (t) P (t) P (h) − P (0)
= lim
dt h→0 h
"Ä ä#
dP (t) P (h) − P (0)
= P (t) lim
dt h→0 h
" #
dP (t) dP (t)
= P (t) = P (t)Q (28.34)
dt dt
t=0
This results in the famous Forward Kolmogorov Equation for determining the tran-
sition probabilities of a CTMC given the rate matrix. In the matrix form, we have,
dP (t)
= P (t)Q (28.35)
dt
dpij (t) X
= pik (t)qkj ∀ i, j (28.36)
dt k
The FKE can also be expressed in its element-wise form. Express the Chapman-
Kolmogorov equation in element-wise form:
X
pij (s, t + h) = pik (s, t).pkj (t, t + h)
k∈S
Taking the derivative with respect to t gives: (for s < t)

∂pij (s, t) pij (s, t + h) − pij (s, t)

= lim
∂t h→0 j
∂pij (s, t) X
= pik (s, t)
∂t k∈S
∂pij (s, t) ∂pkj X
= = pik (s, t)q + kj(t) (28.37)
∂t ∂t k∈S
28.8 Backward Kolmogorov Equation

The Backward Kolmogorov Equation (BKE) can be derived much in the same
manner as the forward. From the continuous time analogue of the Chapman-
Kolmogorov equation P (t + h) = P (t)P (h) for t ≥ 0, h ≥ 0, we have,
Ä ä
P (t + h) − P (t) P (t)P (h) − P (t) P (t) P (h) − P (0)
= = (28.38)
h h h
Taking the limit as h → 0,
P (t + h) − P (t) P (h) − P (0)

lim = lim P (t)
h→0 h " h→0
# h
dP (t) dP (t)
=⇒ = P (t) = QP (t) (28.39)
dt dt
t=0
Thus, we have the BKE given by:
dP (t)
= QP (t) (28.40)
dt
dpij (t) X
= qik pkj (t) ∀ i, j (28.41)
dt k
In Finance, we use the BKE in pricing financial products such as options, futures
and other derivatives. Suppose we have a payoff from a derivative at maturity
period t = T . We wish to compute the initial price of the derivative at t = 0, which
can be done using the model the dynamics of its underlying asset given by the
Backward Kolmogorov equation.
The forward and backward Kolmogorov equations give the dynamics of the system
P (t). We know that P (0) = I and P (t) = eQt . For a finite state CTMC the stationary
solution of both equations are the same.
dP (t)
FKE: = P (t)Q =⇒ P Q = 0 gives stationary solution. (28.42)
dt
dP (t)
BKE: = QP (t) =⇒ QP = 0 gives stationary solution. (28.43)
dt
Both the equations result in the same stationary solution for P (t).
28.9 Stationary Distribution

The stationary distribution of the CTMC process is given by the vector π, such
that π = πP (t), with Σj πj = 1, πj ≥ 0, ∀t. Since π denotes the stationary or
equilibrium distribution, we have,
h i h i
P X(0) = i = πi then P X(t) = i = πi ∀i, ∀t (28.44)
=⇒ πP (t) = π ∀t (28.45)
Taking the derivative on both sides of the equation,
Ä ä Ä ä
d πP (t) d π dP (t)
= =⇒ π =0 ∀t (28.46)
dt dt dt
dP (t)
π =0 =⇒ πQ = 0 (28.47)
dt

t−0
Thus to obtain π, solve equations πQ = 0 and Σj πj = 1.

Theorem. For an irreducible process the stationary distribution π is unique if
it exists. And if π exists the process is positive recurrent and all rows of P (t)
converge to π.
28.10 Mental image of CTMC

The mental image of CTMC is a DTMC in which transitions can happen at any
time (because time is continuous). Let S denote the discrete state space. It can be
finite or countably infinite. The Markov property means that the jump times are
exponentially distributred. We denote all information pertaining to the history of
the process X upto time s by the filtration FX(s) . A family of σ-fields Ft is defined
to be a filtration if Ft1 ⊂ Ft2 whenever t1 ≤ t2 .
FX(s) : all information pertaining to the
history of X upto time s. (28.48)
Let state j ∈ S and let s ≤ t, then Markov Property can be stated as
h i h i
P X(t) = j | FX(s) = P X(t) = j | X(s) (28.49)
Since we also want the process to be time homogeneous,
h i h i
P X(t) = j | X(s) = P X(t − s) = j | X(0) (28.50)
Any process satifying the above two equations is said to be a Time Homogoenous,
Continuous Time Markov Chain. Equivalently, the CTMC can also be defined in
terms of the transition rate matrix, as we have seen previously.
Chapter 29
CTMC and Embedded MC
29.1 Transition Rate Matrices

• In the framework of Continuous Time Markov Chains (CTMCs) - the pro-
cess is decribed by the Transition Rate matrix or the Q matrix - Where the
diagonal elements are negative - and the row sum is 0.
• A basic intuition about this rate matrix is that - Probability mass flowing
out of one state will go to another state - essentially it is conserved.
• Below if a representation of such a rate matrix for a CTMC process:

   
q00 q01 · · · −q0 q01 · · ·
q10 q11 · · · =  q10 −q1 · · ·
   
Q=  . (29.1)
.. .
.. .
..
  .
.. .
.. .
..

• We denote - the probability of being in state i by time t - as follows:
αi (t) = P [Xt = i] (29.2)
αi (t + ∆t) = P [Xt+∆t = i] (29.3)
• If we know that the process takes on state i by time t, then the probability of
the process jumping to state j in a small time ∆t is given by:
P [Xt+∆t = j] = P [Xt+∆t = j|Xt = i]P [Xt = i] (29.4)
=⇒ αj (t + ∆t) = αi (t)Pij (∆t) (29.5)
• If we consider the above result by considering all possible states i and j then
we can write the same in vector form as follows:
α(t + ∆t) = α(t)P (∆t) (29.6)
307
Chapter 29. CTMC and Embedded MC 308
• We now consider the limiting case - of the probability of the continuous

time process being in a certain state:
α(t + ∆t) − α(t) α(t)P (∆t) − α(t)
lim = lim (29.7)
∆t→0 ∆t ∆t→0 ∆t
dα P (∆t) − I
=⇒ = α lim (29.8)
dt ∆t→0 ∆t
P (∆t) − P (0) dP (t)
NOTE: lim = = P 0 (0) = Q (29.9)
∆t→0 ∆t dt t=0
dα(t)
=⇒ = α(t)Q (29.10)
dt
• We can now essentially expand the above expression to actually derive the
transition probability matrix over the time interval ∆t as follows:
dα(t)
= α(t)Q (29.11)
dt
α(t + ∆t) − α(t)
=⇒ = α(t)Q (29.12)
∆t
=⇒ α(t + ∆t) = α(t) + α(t)Q∆t + o(∆t) (29.13)
=⇒ α(t + ∆t) = α(t) [I + Q∆t] +o(∆t) (29.14)
| {z }
=P (∆t)
NOTE: P (∆t) = [I + Q∆t] (29.15)

NOTE: lim P (∆t) = P (0) = I (29.16)
∆t→0
=⇒ α(t) = α(0) = eQt

|{z} (29.17)
matrix exponent function
29.2 Global Balance Equations

• We know that the stationary distribution is independent of time and is given
as follows:
π = lim α(t) (29.18)
t→∞
• Since this is independent of time, its derivative with respect to time would
be zero. Then we have the following relation:
dα(t)
= α(t)Q =⇒ πQ = 0 (29.19)
dt
• The above stated balance equation essentially characterizes the balance of
probability flows across states. We take the specific case of the j th row as
follows: X
qj π j = πi qij (29.20)
P
|{z} i6=j
i6=j qji
X X
=⇒ πj qji = πi qij (29.21)
i6=j i6=j
• The above equation states that the total outflow of probability from state j
and the inflow into state j is the same. We can essentially remove the sum-
mation and consequently obtain the detailed balance equation as follows:
πj qji = πi qij (29.22)
NOTE: πi qij = prob of flow from state i to j (transition frequency) (29.23)
• We note that these equations are linearly dependent - that is, any given
equation is automatically satisfied if the other equation is satisfied - due to
conservation of probability. The solution is unique upto a constant factor
and is uniquely determined by adding a normalization condition as follows:
X
π T e = 1 =⇒ πj = 1 (29.24)
j
• Further note that the stationary distribution is nothing but the left eigen-
vector corresponding to the eigenvalue of 0 - after solving the following
equation:
πT Q = 0 (29.25)
29.3 Behavior in stationarity

• Let us say that we have a state space denoted as ω = {A, AC } - we see that in
stationarity, the probability flows between the two sets constiutes partition
of the state space.
Figure 29.1:
• Given the above setup, we can write the balance equations over these two
sets as follows: X X
πi qij = πj qji (29.26)
i∈A,j∈AC j∈AC ,i∈A
=⇒ π T Q = 0 (29.27)
ñ ô
î ó Q QAAC î ó
AA
=⇒ πA πAC = 0 = 0A 0AC (29.28)
QAC A QAC AC
î ó î ó
πA QAA + πAC QAC A πA QAAC + πAC QAC AC = 0A 0AC (29.29)
(1) =⇒ πA QAA + πAC QAC A = 0A (29.30)
(2) =⇒ πA QAAC + πAC QAC AC = 0AC (29.31)
NOTE: QAA = QAC AC = 0
|{z} (29.32)
zero matrices
=⇒ πA QAAC = πAC QAC A (29.33)
29.4 Solving the balance equations

• Our attempt is to solve the homogeneous balance equations along with
the normalization condition given as follows:
Solve: π T Q = 0 (29.34)
subject to: eT π = 1 (29.35)
• The e vector and E matrix of all ones help us to frame the normalization in
a vector-matrix format - these are given by:
   
1 1 ··· 1 eT
1 1 · · · 1 eT 
î ó    
T
e = 1 1 · · · 1 , |{z} E =  .. ..  =  .. 
..    (29.36)
n×n  . . · · · .   . 
1 1 ··· 1 eT
• Essentially then, we have n copies of the normalization condition given by

the following equation:
π T |{z}
|{z} e =1 (29.37)
1×n n×1
• Putting together this entire framework we get the following equation:

   
1 1 ··· 1   1
1 1 · · · 1 π1 1
  π1  =  . 
    
Eπ = e =  .
. .. . (29.38)
. . · · · .
.
 .   . 
 . .
. πn

1 1 ··· 1 1
• We can now note the following equations and associated computations that
help us arrive at the solution for the stationary distribution vector.
(1) =⇒ π T Q = 0T (29.39)
(2) =⇒ π T E T = eT (29.40)
adding the above two =⇒ π T [Q + E] = 0T + eT = eT (29.41)
=⇒ [Q + E]T pi = e =⇒ [QT + E]π = e (29.42)
T −1
=⇒ π = [Q + E] e (29.43)
29.5 Embedded Markov Chains

• With every CTMC process we associate a corresponding DTMC process - this
is known as an Embedded Markov Chain - or alternatively, a jump chain
(e)
denoted as Xn
• Here, the focus is on the transition of Xt - whenever the transitions occur -

we say that the focus is on the sequence of different states visited by Xt .
• Let us consider the state transitions of the process Xt occurring at time in-
stances - t0 , t1 , · · · .
(e)
• Note the definition: Xn = value of Xt immediately after transition at time n
- that is at time instant t+
n or simply, the value of Xt in the interval (tn , tn+1 ).
(e)
• Since the process Xt is a Markov Process, we say that Xn is a DTMC and
the following condition holds:
Xn(e) = Xt+n (29.44)
Figure 29.2: EMC
• Transition probability of the embedded markov chain is denoted as pij if

the transition is from state Xt = i to state Xt = j. We can then write the
following relations:
pij = lim P [Xt+∆t = j|Xt+∆t 6= i, Xt = i] (29.45)

∆t→0
=⇒ lim PXt =i [Xt+∆t = j|Xt+∆t 6= i] (29.46)

∆t→0
PXt =i [Xt+∆t = j, Xt+∆t 6= i]

conditional probability =⇒ lim (29.47)
∆t→0 PXt =i [Xt+∆t 6= i]
P [Xt+∆t = j, Xt+∆t 6= i|Xt = i]
=⇒ lim (29.48)
∆t→0 P [Xt+∆t 6= i|Xt = i]
• We note that when Xi ∼ exp(λi ) then the following condition would hold:
λi
P [min(X1 · · · , Xn ) = Xi ] = (29.49)
(λ1 + · · · λn )
• Finally we have the transition probability of the embedded markov chain as

given by: 
q
 P ij , i 6= j

pij = j qij (29.50)
0, i = j

Part IV
Stochastic Calculus &

Computational Finance
313
Part V
Topics in Data Science
314
Part VI
Topics in Calculus &

Discrete Mathematics
315
Chapter 30
Fourier Transforms
30.1 Introduction
To provide some context for our discussion about Fourier transforms and series, let
us imagine a scenario. Since Fourier analysis is concerned with signals and waves,
we imagine a musician playing a steady note on a trumpet. Further there is a mi-
crophone in front of the trumpet that is essentially capturing the sound produced.
The mic typically has a diaphram which undergoes pressure due to the sound
waves from the trumpet and this pressure then translates into voltage, which is
proportional to the instantaneous air pressure. Now if measure this with an os-
cillopscope we will get a graph of pressure against time F (t) which would turn
out to be periodic. Note that it is the reciprocal of the period which is termed as
the frequency of the note being on the trumpet. The typical relationship between
frequency and time period is:
1
ν= (30.1)
T
Let us say that the fundamental frequency of this one note sound is 256Hz. Now
in reality one sine wave of the said frequency is not produced, rather multiple
overtones are produced which are multiples of the fundamental frequency with
various amplitudes and phases. Phase basically determines where in the cycle
the signal would start repeating. Technically, we can analyse the wave by finding
a list of the amplitudes and phases of the various sine waves that comprise the
complex signal. We can plot a graph of amplitudes against frequency denoted by
A(ν). Now since we are effectively bringing the function from the time domain
to the frequency domain we say that A(ν) is the Fourier transform of F (t).
30.2 Fourier series

Continuing with our previous example, we can say that a steady note sound signal
can be described completely by the fundamental frequency, its amplitude and the
amplitudes of its overtones or harmonics. For this we can use a discrete sum:
F (t) = a0 + a1 cos(2πν0 t) + b1 sin(2πν0 t) + a2 cos(4πν0 t) + b2 sin(4πν0 t) + · · · (30.2)
316
Chapter 30. Fourier Transforms 317
Here ν0 represents the fundamental frequency. The various sine and cosine func-
tions in the series denote the various phases of the signal that are not in step with
the fundamental signal. We can rewrite the previous formula as:
∞
X
F (t) = an cos(2πnν0 t) + bn sin(2πnν0 t) (30.3)
−∞
Note that this process of constructing a waveform by adding together the funda-
mental frequency and its overtones of various amplitudes is called Fourier syn-
thesis. Given that cos(−x) = cos(x) and sin(x) = − sin(−x) we can rewrite the
above expression as:
∞
X
F (t) = A0 /2 + An cos(2πnν0 t) + Bn sin(2πnν0 t) (30.4)
n=1
Where An = a−n + an and Bn = bn − b−n .
30.3 Amplitudes
Now note that the opposite process of extracting the frequencies and amplitudes
from the original signal is called Fourier analysis. We are interested in trying to
find the amplitudes Am and Bm for various instances of m. Now before moving
ahead, we note the utilisation of the orthogonality property of trigonometric
functions - the central idea is that if we take a sine and a cosine, or two sines or two
cosines (as multiples of the fundamental frequency), then take their product and
integrate this product over the period of fundamental frequency, then the result is
zero. Noting that 1 period is denoted as the inverse of frequency: P = 1/ν0 , we
have: Z P
cos(2πnν0 t) cos(2πmν0 t)dt = 0 (30.5)
t=0
Z P
sin(2πnν0 t) sin(2πmν0 t)dt = 0 (30.6)
t=0
Z P
sin(2πnν0 t) cos(2πmν0 t)dt = 0 (30.7)
t=0
Note that in case m = n then the first two integrals would resolve to 1/2ν0 . Now
we note some general expressions for the coefficient values:
2 P
Z
Bm = F (t) sin(2πmν0 t)dt (30.8)
P t=0
2 P
Z
Am = F (t) cos(2πmν0 t)dt (30.9)
P t=0
An alternate way of writing the Fourier series is shown below. Note that this
expression comes about as a result of taking Am = Rm cos φm and Bm = Rm sin φm .
∞
A0 X
F (t) = + Rm cos(2πmν0 t + φm ) (30.10)
2 m=1
Where Rm is the mth harmonic amplitude and φm is the corresponding phase

value. Note as an additional point that two waves are said to be in phase if their
crests arrive together at a certain point. If however at some point, one wave has a
trough and the other has a crest, then they are said to be completely out of phase
- in this case they are said to have a 180 degrees phase difference.
30.4 Alternate forms of writing the series

We can actually write the Fourier series in the form of complex exponentials
instead of trigonometric functions. First as a reference we note the DeMoivre’s
theorem:
(cos x + i sin x)n = cos(nx) + i sin(nx) = eixn (30.11)
Now we denote the Fourier series using this notation:
∞
X
F (t) = Cm e2πimν0 t (30.12)
−∞
Now the coefficients Cm are basically complex numbers. Now typically, without
going into the derivations, we use inversion formulae to get the coefficienty values
of the real and complex parts of the coefficients:
Z 1/ν0
Am = 2ν0 F (t) cos(2πmν0 t)dt (30.13)
0
Z 1/ν0
Bm = 2ν0 F (t) sin(2πmν0 t)dt (30.14)
0
Z 1/ν0
Cm = 2ν0 F (t)e−2πmν0 t dt (30.15)
0
The above expressions can be further rewritten in slightly different notation, if we
let ν0 = ω0 /2π. The expressions become:
Z 2π/ω0
Am = ω0 /π F (t) cos(mω0 t)dt (30.16)
0
Z 2π/ω0
Bm = ω0 /π F (t) sin(mω0 t)dt (30.17)
0
Z 2π/ω0
Cm = 2ω0 /π F (t)e−imω0 t dt (30.18)
0
An easy way to remember these formulae is by writing them as:
Z Å ã
2 2πmt
Am = F (t) cos dt (30.19)
period one period period
Z Å ã
2 2πmt
Bm = F (t) sin (30.20)
period one period period
30.5 Fourier Transforms

We note that whether F (t) is periodic or not, we can give a complete description
of the function using combinations of sines and cosines. Note that a non periodic
function can be thought of as the limiting case of a periodic function wherein
the period tends to infinity and the fundamental frequency tends to zero. Also
in this case the harmonics would be closely spaced and there would be a contin-
uum of them, with each such harmonic having an infinitesimal amplitude given
as: a(ν)dν. Now integrating throughout all these amplitudes to synthesize the
function we get:
Z ∞ Z ∞
F (t) = a(ν)dν cos(2πνt) + b(ν)dν sin(2πνt) (30.21)
−∞ −∞
Writing this in terms of amplitude and phase values we have:

Z ∞
F (t) = r(ν) cos(2πνt + φ(ν))dν (30.22)
−∞
This can also, alternatively be written as:

Z ∞
F (t) = Φ(ν)e2πiνt dν (30.23)
−∞
Note that if F (t) is real then a(ν) and b(ν) are real as well, however if the function
F (t) is asymmetrical that is if F (t) 6= F (−t) then we have complex values Φ(ν).
In certain cases F (t) is symmetrical, which in turn implies that Φ(ν) is real and
F (t) consists only of cosines. Our Fourier series would then become:
Z ∞
F (t) = Φ(ν) cos(2πνt)dν (30.24)
−∞
Now comes the interesting bit. We can actually recover the function that contains
information about the frequencies Φ(ν) from F (t) by way of inversion.
Z ∞
Φ(ν) = F (t) cos(2πνt)dt (30.25)
−∞
Finally we say that Φ(ν) which is a function in the frequency domain, is the Fourier
transform of F (t) which is in the time domain. Another general formulation of
this is given by: Z ∞
Φ(ν) = F (t)e−2πiνt dt (30.26)
−∞
30.6 Spectrum
Note first that the square of the amplitude of oscillation of a wave, gives a measure
of power contained in each harmonic of the wave. In case the fourier transform
of F (t) that is Φ(ν) is complex, then if we take the product of this and its complex
conjugate Φ∗ (ν) then we would get the power spectrum or the spectral power
density of F (t).
S(ν) = Φ(ν)Φ∗ (ν) (30.27)
References 320
30.6.1 The autocorrelation theorem

The autocorrelation function of a function F (t) can be defined as:
Z ∞
A(t) = F (t0 )F (t + t0 )dt0 (30.28)
−∞
We note that the process of autocorrelation can be thought of as - multiplying every

point of a function by another point that is t0 distance away and then summing all
those products. Now let us take its fourier transform on both sides of the equation:
Z ∞ Z ∞Z ∞
Γ(ν) = A(t)e2πiνt
dt = F (t0 )F (t + t0 )e2πiνt dt0 dt (30.29)
−∞ −∞ −∞
After solving this we would the following expression:
Γ(ν) = Φ∗ (ν)Φ(ν) (30.30)
Note that Φ is nothing but the fourier transform of F (t) and the multiplication
of the fourier transform with its complex conjugate gives us the power spectral
density of F (t). Finally we note that autocorrelation function is the fourier
transform of this power spectum function we just obtained.
References
[1] A students guide to Fourier transforms- JF James
Chapter 31
Laplace, Dirac Delta and Fourier

Series
31.1 The Laplace Transform

The Laplace function is primarily a mapping of points in the t domain of a function
f (t) to points in the s domain. We must note that the Laplace transform exists for
functions that are of exponential order, are bounded and have converging
infinite integrals. The mathematical definition of the Laplace Transform is as
follows: Z ∞
L{F (t)} = f (s) = F (t)e−st dt (31.1)
0
A popular property of the Laplace transform is that of linearity and can be stated
as:
L{aF1 (t) + bF2 (t)} = aL{F1 (t)} + b{F2 (t)} (31.2)
Yet another important theorem associated with this transform is called the first
shift theorem and can be defined as follows:
L{e−bt F (t)} = f (s + b) (31.3)
The proof of this theorem is pretty straightforward.

Z T
−bt
L{e F (t)} = lim e−st e−bt F (t)dt (31.4)
T →∞ 0
Z ∞ Z ∞
−st −bt
= e e F (t)dt = e−(s+b)t F (t)dt = f (s + b) (31.5)
0 0
31.1.1 Examples and Properties

To demonstrate how this transform works, we will show a simple example of trans-
forming the function F (t) = t. Note that integration by parts is used.
Z T
L(t) = lim te−st dt (31.6)
T →∞ 0
321
Chapter 31. Laplace, Dirac Delta and Fourier Series 322
T
t −st T
Z ï ò Z T
−st 1
→ te dt = − e − − e−st dt (31.7)
0 s 0 0 s
ï òT
T 1
= − e−sT + − 2 e−st (31.8)
s s 0
T 1 1 T →∞ 1
= − e−sT − 2 e−st + 2 → 2 (31.9)
s s s s
Therefore the Laplace transform of the function F (t) = t is given by f (s) = 1/s2 .
Now we note some general formulae regarding various Laplace transforms. The
derivation of these expressions is omitted in this section.
n!
• L(tn ) =
sn+1
1
• L{tea t} =
(s − a)2
• Before the next formula we must recall the Euler’s formula that gives us an
expression for the polar coordinates of complex numbers:
eit = cos(t) + i sin(t) (31.10)
Now we note that due to the linearity property, the Lapace transform of ei t
is given by:
L(eit ) = L(cos(t)) + iL(sin(t)) (31.11)
Where the Laplace transforms of the individual trigonometric functions are:
s
L(cos(t)) = (31.12)
s2 + 1
1
L(sin(t)) = (31.13)
s2 +1
d
• L{tF (t)} = − f (s)
ds
• A popular function whose Laplace transform is immensely useful is the Heav-
iside’s unit step function which is given by:
(
0, if t < 0.
H(t) = (31.14)
1, if t ≥ 0.
Consequently its Laplace transform is given by:

1
L(1) = (31.15)
s
• Laplace transform of a first order differentiable function can be writen as:

Z ∞
0
L{F (t)} = e−st F 0 (t)dt = −F (0) + sf (s) (31.16)
0
• Laplace transform of a second order differentiable function is given as:
L{F 00 (t)} = s2 f (s) − sF (0) − F 0 (0) (31.17)
In a similar manner, we can generalize the above two points to write the
Laplace transform of an n times differentiable function as:
L{F (n) (t)} = sn f (s) − sn−1 F (0) − sn−2 F 0 (0) − · · · − F (n−1) (0) (31.18)
w
• L{sin(wt)} =
s2
+ w2
s
• L{cos(wt)} = 2
s + w2
31.1.2 Expanding a little on Heaviside function

We earlier mentioned that the Laplace transform of the Heaviside function is given
by L{H(t)} = 1/s. However we are usually more interested in finding out the
transform os H(t − t0 ) where t0 > 0. Applying the Laplace transform definition to
this we get: Z ∞
l{H(t − t0 )} = H(t − t0 )e−st dt (31.19)
0
Note that as per the way this function is defined, for t < t0 we would have H(t −
t0 ) = 0 and the transform would evaluate as follows, taking only those t such that
t > t0 and consequently the function evaluating to H(t − t0 ) = 1.
Z ∞ ñ ô∞
−st e−st e−st0
L{H(t − t0 )} = e dt = − = (31.20)
t0 s t s
0
This function assumes special relevance when it is multiplied with another func-
tion and that action of multiplying this Heaviside function is analogous to ’switch-
ing on’ the other function. With this intuition we can state the second shift theo-
rem defined as:
L{H(t − t0 )F (t − t0 )} = e−st0 f (s) (31.21)
Note that with this we can find the Laplace transform of a function that is switched
on at t = t0 .
31.1.3 Inverse Laplace transform

Taking out the inverse of Laplace transforms usually involves a bit of solving using
the partial fractions decomposition method. The standard definition for the
inverse transform is given as:
if L{F (t)} = f (s) (31.22)
then L−1 (f (s)) = F (t) (31.23)

Now we take a simple example wherein the inverse transform is determined using
partial fractions. Å ã
−1 a
→L (31.24)
s2 − a2
Solving the undetermined coefficients using partial fractions we get:
ï ò
a 1 1 1
= − (31.25)
s 2 − a2 2 s−a s+a
Now we can simply apply the linearity property of the inverse transform operator
to get: ï ò
−1 a 1
L 2 2
= (eat − e−at ) (31.26)
s −a 2
31.2 The Dirac Delta Impulse function

It is observed that there exist certain functions which might not classify as func-
tions in the true sense. In order to classify as a function, an expression needs to be
defined for all values of the variable in the specified range. Note that if this is not
the case, then the expression would not be a function since it would cease to be
well defined. We are usually not interested in such expressions, however we note
that even if some of these expressions might not be well defined, if they have some
desirable global properties, then such expressions indeed turn out to be rather
useful. One such function is Dirac’s δ function. The definition is as follows:
δ(t) = 0, ∀t, t 6= 0 (31.27)

Z ∞
h(t)δ(t)dt = h(0) (31.28)
−∞
The above is defined for any function h(t) that is continuous in the interval (−∞, ∞).
The Dirac-δ function can be thought of as the limiting case of a top hat function
with unit area as it becomes infinitesimally thin and tall. First we define a function
Figure 31.1: Top hat function

as follows: 
0,

 if t ≤ −1/T .
Tp (t) = 0.5T, if −1/T < t < 1/T . (31.29)

0, if t ≥ 1/T .

The Dirac Delta function then models the limiting behaviour of this function and
can be written as:
δ(t) = lim Tp (t) (31.30)
T →∞
The integral definition can be written as follows:

Z ∞ Z ∞
h(t) lim Tp (t)dt = lim h(t)Tp (t)dt (31.31)
−∞ T →∞ T →∞ −∞
The value of the integral within the limits indicates the area under the cuve
h(t)Tp (t) and we say that this area would approach the value h(0) as T → ∞.
Further we say that for very large value of T the interval [−1/T, 1/T ] will be small
enough for the value of h(t) to not differ from its value at the origin. With this we
can express h in the form of: h(t) = h(0) + (t) where the term (t) tends to 0 as
T goes to infinity. Therefore we can say that h(t) tends to h(0) for extremely large
values of T . Note that δ(t) is not a true function since it is not defined for t = 0,
therefore δ(0) has no value. Writing out the left and right side limits we get:
Z ∞
h(t)δ(t)dt = h(0) (31.32)
0−
Z 0+
h(t)δ(t)dt = h(0) (31.33)
−∞
As a limiting case of the top hat function the Dirac Delta function then looks this:
We note an important property that as the interval gets smaller and smaller due
Figure 31.2: Dirac Delta function: limiting case of a top hat
to T becoming large, the area under the top hat function would always be unity.
Hence in the limiting case, the length of the arrow (which happens to represent
the Dirac-δ function) is 1. Therefore we have with h = 1:
Z ∞
δ(t)dt = 1 (31.34)
−∞
The Laplace transform of the Dirac Delta function is given by:
L{δ(t)} = 1 (31.35)
This essentially means that we are reducing the width of the top hat function
such that it lies between 0 and 1/T (because in the exponential order laplace
transformation we usually have limits starting from 0), and that we are increasing
the height from T /2 to T so as to preserve the unit area.
31.2.1 Filtering property

Going further with the Dirac-δ function we say that the function δ(t−t0 ) represents
an impulse that is centered at time t0 . This can be thought of as a transient signal
and the limiting case of a function K(t) which is the displaced version of the top
hat function: 
0,

 if t ≤ t0 − 1/T .
K(t) = 0.5T, if t0 − 1/T < t < t0 + 1/T . (31.36)

0, if t ≥ t0 + 1/T .

Now as T → ∞ and utilising the definition of the Dirac-δ function we get:

Z ∞
h(t)δ(t − t0 )dt = h(t0 ) (31.37)
−∞
We can get the Laplace transform of this Dirac Delta function, provided that t > 0
as:
L{δ(t − t0 )} = e−st0 (31.38)
This has been called the filtering property since we can see clearly from the defi-
nition that the Dirac-δ function helps us pick out a particular value of a function.
Z ∞
h(t)δ(t − t0 )dt = h(t0 ) (31.39)
−∞
31.3 Fourier Series

The central idea behind a Fourier series is that any given function can be ex-
pressed as a series of Sine and Cosine functions. Here we will be dealing mostly
with periodic and piecewise continuous functions. Let us first start with functions
defined on the closed interval [−π, π] which possess one sided limits at −π and π.
We have a function that maps values such that f : [−π, π] → C. We can now state
the Dirichlet theorem as follows: If f is a member of the space of piecewise con-
tinuous functions which are 2π periodic on the closed interval [−π, π], having both
left and right derivatives at the end points, then we say that for each x ∈ [−π, π]
the Fourier series of f converges to:
f (x− ) + f (x+ )
(31.40)
2
And at both the end points (x = ±π) the series converges to:
f (π − ) + f ((−π)+ )
(31.41)
2
The fourier series gives us a result that at points of discontinuity the value of
Fourier series of f takes the value of the mean of one sided limits of f as the
value at the discontinuous point.
31.3.1 Fourier series formula

Remember that the whole point of a Fourier series is to express a periodic function
as a series of sine and cosine functions. We see the components of such a series
are typically periodic functions of period 2π given as:
1, cos(x), sin(x), cos(2x), sin(2x), · · · , cos(nx), sin(nx) (31.42)
These terms, together form a trigonometric system and the resulting series so
obtained is called the trigonometric series:
a0 + a1 cos(x) + b1 sin(x) + a2 cos(2x) + b2 sin(2x) + · · · (31.43)

∞
X
= a0 + (an cos(nx) + bn sin(nx)) (31.44)
n=1
Here the a and b terms are the coefficients of the series and we say that if the
coefficients are such that the series converges, then its sum would also have the
same period as the individual components, that is 2π. Now if we have a function
f (x) of period 2π and can be represented by a convergent series of the form in
equation 44 then we say that the Fourier series of f (x) is:
∞
X
f (x) = a0 + (an cos(nx) + bn sin(nx)) (31.45)
n=1
Consequently, the Fourier coefficients can be found using the following equa-
tions: Z π
1
a0 = f (x)dx (31.46)
2π −π
1 π
Z
an = f (x) cos(nx)dx (31.47)
π −π
1 π
Z
bn = f (x) sin(nx)dx (31.48)
π −π
A crucial point to note is that the underlying concept behind this Fourier series is
the orthogonality of the trigonometric system - which means that every term in
the trigonometric series is orthogonal to each other, or that their inner product is
zero. In terms of integrals we can write this condition as:
Z π
cos(nx) sin(mx)dx = 0 (31.49)
−π
Chapter 32
A Primer in Calculus
32.1 What are intervals

First note that an interval is nothing but a subset of the real line if it contains at
least two numbers and all real numbers lying between any two of its elements. A
typical example of an interval is as follows: The set of all real numbers x such that
−2 ≤ x ≤ 5. Note that a finite interval is said to be closed if it contains both
the end points, half open if it contains one end point and open if it contains no
endpoints. The endpoints are known as boundary points whereas all the other
points are called interior points. Some typical ways in which these intervals are
written:
• (a, b) −→ {x|a < x < b}
• [a, b] −→ {x|a ≤ x ≤ b}
• (a, ∞) −→ {x|x > a}
• (−∞, b) −→ {x|x < b}
32.1.1 Solving inequalities

Solving an inequality is defined as the process of finding an interval of numbers
that satisfy an inequality in x. A typical example is presented below:
−→ 2x − 1 < x + 3 (32.1)
2x < x + 4 (32.2)
x<4 (32.3)
−→ (−∞, 4) (32.4)
329
Chapter 32. A Primer in Calculus 330
32.1.2 Absolute value

The absolute value of a number x is denoted as |x| and is defined by:
(
x, x ≥ 0
|x| = (32.5)
−x, x < 0
Noting some common properties about absolute value:

• | − a| = |a|
• |ab| = |a||b|
• |a + b| ≤ |a| + |b|
The problem statement of solving an inequality with absolute values can be stated
as: The inequality |a| < D states that the distance from a to 0 is less than D or we
can say that a lies between −D and D. This is traditionally denoted as:
|a| < D −→ −D < a < D (32.6)
To give a clear demonstration, we will compute the solution for the inequality
|2x − 3| ≤ 1 and |2x − 3| ≥ 1 given as follows:
Part 1: |2x − 3| ≤ 1 (32.7)
→ −1 ≤ 2x − 3 ≤ 1, add 3 to all sides (32.8)

−→ 2 ≤ 2x ≤ 4, now divide all sides by 2 (32.9)
−→ 1 ≤ x ≤ 2, Solution set is: → [1, 2] (32.10)
Part 2: |2x − 3| ≥ 1 (32.11)
−→ 2x − 3 ≥ 1, or, −(2x − 3) ≥ 1 (32.12)
−→ 2x − 3 ≥ 1, or, 2x − 3 ≤ −1 (32.13)
−→ x ≥ 2, or, x ≤ 1, Solution: → (−∞, 1) ∪ (2, ∞) (32.14)
32.2 Rates, Limits and Derivatives

Before even getting to limits we note the definition of average rate of change of
a function: Given an arbitrary function y = f (x) we calculate the average rate of
change of y with respect to x over an interval [x1 , x2 ] by dividing the change in y
given by ∆y = f (x2 ) − f (x1 ) by the change in x denoted by ∆x = x2 − x1 = h.
This is given as:
∆y f (x2 ) − f (x1 ) f (x1 + h) − f (x2 )
= = (32.15)
∆x x2 − x1 h
Now getting to one definition of the limit of a function, we say that for a function
f (x) defined on an open interval about x0 , except at x0 itself, if f (x) gets arbitrarily
close to a number L for all x sufficiently close to x0 we say that f approaches the
limit L as x approaches x0 . This is typically expressed as:
lim f (x) = L (32.16)

x→x0
Noting the Sandwich theorem we say that if a function f is sandwiched between

the values of two other functions g and h, which have the same limit L at a point
c, then the value of f must also approach L at this point. Suppose that g(x) ≤
f (x) ≤ h(x) for all x in some open interval containing c, except x = c itself and
also suppose the following condition:
lim g(x) = lim h(x) = L (32.17)

x→c x→c
Then we say that the following also holds:
lim f (x) = L (32.18)

x→c
Now we note the formal definition of a limit: Let a function f (x) be defined on
an open interval about x0 except at x0 itself. We say that f (x) approaches the limit
L as x approaches x0 :
lim f (x) = L (32.19)
x→x0
If for every small number > 0 there exists a corresponding number δ > 0 such
that for all x:
0 < |x − x0 | < δ −→ |f (x) − L| < (32.20)
Now we shall note the definition of a continuous function. We say that a function
f (x) is continuous at an interior point x = c of its domain if the following holds:
lim f (x) = f (c) (32.21)

x→c
Further, we say that a function is continuous if the continuity property shown

above holds everywhere in its domain.
32.2.1 Differential calculus

The definition of a derivative of a function is linked to the idea of the slope of the
average rate of change of a function f at point x = x0 . The derivative of a function
f with respect to x is said to be another function f 0 whose value at x is given by:
f (x + h) − f (x)
f 0 (x) = lim (32.22)
h→0 h
We say that the domain of f 0 (x) is the set of all points in the domain of f for which
the limit exists. If f 0 (x) exists we say that f is differentiable at x. Finally, the
process of calculating the derivative of a function is called differentiation. Note
some common differentiation rules:
• The product rule is given by:

d dv du
(uv) = u + v (32.23)
dx dx dx
• The quotient rule is given by:
Å ã v du − u dv
d u
= dx 2 dx (32.24)
dx v v
• Second order derivatives are denoted as:

dy 0 d2 y
Å ã
00 d dy
y = = = 2 (32.25)
dx dx dx dx
Next we discuss the chain rule. This is a rule that used to compute the derivatives
of composite functions. If f (u) is differentiable at a point u = g(x) and in turn
g(x) is differentiable at a point x, then we composite function (f ◦g)0 (x) = f 0 (g(x))·
g 0 (x). Letting y = f (u) and u = g(x) we can write:
dy dy du
= (32.26)
dx du dx
32.3 Basics of integrals

Integration is the process of determining the original function from its derivative
f . In this regard, the first step is to find all possible functions that could have f as
their derivatives - these functions are collectively called antiderivatives of f and
the procedure for getting them is called indefinite integration. The second step
is to hone in on one particular function corresponding to the derivative by looking
at a known function value. We note that a function F (x) is the antiderivative of
f (x) if:
F 0 (x) = f (x) (32.27)
The set of all antiderivatives of f are got by the indefinite integral of f given by:
Z
f (x)dx = F (x) + C (32.28)
Where C is the constant of integration.
32.3.1 Integrals by substitution

Often a change of variable can simplify the integration we are trying to evaluate.
This idea is demonstrated quite well with an example:
Z
Solve: (x + 2)5 dx (32.29)
→ Let u = x + 2, such that du = d(x + 2) → du = dx (32.30)

Z Z 6
u
→ (x + 2)5 dx = u5 du = +C (32.31)
6
(x + 2)6
= +C (32.32)
6
Next we come to the definition of definite integrals. Let f (x) be a function de-
fined on a closed interval [a, b]. We can think of the area under the curve to be
approximated by the integral: Z b
f (x)dx (32.33)
a
A Key point to note is that the value of the definite integral over an interval
depends on the function and not on the letter we choose to represent the
independent variable. Basically we can use t, u, x or whatever for that matter.
We say that the variable of integration is called a dummy variable. Note some
common properties of definite integrals:
Ra
• a f (x)dx = 0
Ra Rb
• b f (x)dx = − a f (x)dx
Lastly we note the mean value theorem that helps us find the average value of
an integrable function f at some point c in the interval [a, b] as:
Z b
1
f (c) = f (x)dx (32.34)
b−a a
We will now note the first part of the fundamental R x theorem of calculus. It
states that if f is continuous on [a, b] then F (x) = a f (t)dt has a derivative at
every point of [a, b] which is given by:
Z x
dF d
= f (t)dt = f (x) (32.35)
dx dx a
Now, the second part of the fundamental theorem of calculus states that if f is
continuous at every point [a, b] and F is the antiderivative of f on [a, b] then:
Z a
f (x)dx = F (b) − F (a) (32.36)
a
32.3.2 Integration by parts

Integration by parts is a technique which helps us simplify integrals of the form:
Z
f (x)g(x)dx (32.37)
In which we say that f can be differentiated repeatedly and g can be integrated

repeatedly without any difficulty. For example look at the following example:
Z
xex dx (32.38)
In this f (x) = x can be differentiated twice to get 0 and g(x) = ex can be integrated
multiple times without any problem. The formula for this is given by:
Z Z
udv = uv − vdu (32.39)
Demonstrating with this a simple example:

Z
find x cos xdx (32.40)
u = x, dv = cos xdx −→ du = dx, v = sin x (32.41)

Z Z
x cos xdx = x sin x − sin xdx = x sin x + cos x + C (32.42)
32.4 Inverse functions

Recall that a function is rule that assigns a value from it range to each point in
its domain. Some functions have a one-to-one correspondence, whereas others
assign the same value to more than one point. A function that has distinct val-
ues at distinct points is called one-to-one. Now note that since each ouput of a
one-to-one function comes from only one point, this function can effectively be
reversed to map back to the input from the output. Therefore, the function ob-
tained by reversing a one-to-one function is called the inverse of the function. It
is denoted as f −1 . Note that if the compose a function and its invese, we get the
identity function which is a function that assigns each number to itself. We say
that functions f and g are an inverse pair if:
f (g(x)) = x, and, g(f (x)) = x (32.43)
Where f = g −1 and g = f −1 . Note the steps for finding the inverse of a function
with an example:
• Question: Find the inverse of:
1
y = x+1 (32.44)
2
• Step 1: Solve for x in terms of y:
x = 2y − 2 (32.45)
• Step 2: Interchange x and y:

y = 2x − 2 (32.46)
• Therefore the inverse of the function f (x) = (1/2)x + 1 is given by:

f −1 (x) = 2x − 2 (32.47)
• Lastly, we note the derivative rule for inverse functions as given by:
1
(f −1 )0 = (32.48)
f0
References 335
References
[1] An introduction to Laplace transforms and Fourier series - PPG Dyke
[2] Advanced Engineering Mathematics - Kreyszig
[3] Time Series Analysis - Hamilton
[4] Basic Econometrics - D. Gujarati
[5] A First Course in Probability Theory - Sheldon Ross
[6] Econometrics - Fumio Hayashi

Chapter 33
Transforms and the Memoryless

Property
33.1 Transforms
• The Laplace transform of a function f : [0, ∞] → C is defined as follows:
Z ∞
F (s) = L(f (t)) = e−st f (t)dt, ∀s ∈ C (33.1)
0
Note that the Laplace transform of a function f (t) exists only if it is of expo-
nential order. Now just like taking the Laplace tranform of a function f (t)
in the time space, we can take the inverse Laplace transform of the cor-
responding function F (s) in frequency space (or s space) to get back the
original function in the time space.
L−1 (F (s)) = f (t) (33.2)
• When f (t) and g(t) are two piecewise continuous functions defined over
t > 0, then their convolution is represented as follows:
Z t
(f ∗ g)(t) = f (t − u)g(u)du, ∀0 ≤ t < ∞ (33.3)
0
We now note that the Laplace transform of the convolution of two functions
is the product of the Laplace transforms of the individual functions:
L((f ∗ g)(t)) = F (s)G(s) (33.4)
• We denote the impulse function fk (t) for some positive k as follows:


1, 0 ≤ t ≤ k
fk (t) = k (33.5)
0, t > k
The area under the graph of this function is alwasy 1. If we consider the
limiting case of this impulse function, as k → ∞, we will have infinite height
336
Chapter 33. Transforms and the Memoryless Property 337
at 0. This limiting case is known as the Dirac Delta function denoted as

δ(t). The Laplace transform of this function is given as:
L(δ(t)) = 1 (33.6)
The Laplace transform of δ(t − T ) is given as:
L(δ(t − T )) = e−sT (33.7)
This function can be used to sample a continuous function at various points.
• Now we get to the Z-transform. Suppose that f (t) is a continuous function

in time and we sample this functions at regular intervals of time T . With this
we would obtain:
f (0), f (T ), f (2T ), · · · , f (nT ) (33.8)
Now we note that the impulse response of a function at t = T is captured by
δ(t − T ). The sampled function f ∗ (t) then becomes:
f ∗ (t) = f (0)δ(t) + f (T )δ(t − T ) + f (2T )δ(t − 2T ) + · · · (33.9)

∞
X
= f (nt)δ(t − nT ) (33.10)
n=0
Now we take the Laplace transform of the above function to get:

∞
X
∗
F ∗ (s) = L(f (t)) = f (nT )L(δ(t − nT )) (33.11)
n=0
∞
X
= f (nT )e−nT s (33.12)
n=0
−sT
Now we take z = e and subtitute in the above equation to get the Z-
transform as: ∞
X
F (z) = f (nT )z −n (33.13)
n=0
The function F (z) is called the Z-transform of the discrete time signal func-
tion f (nT ).
∞
X
F (z) = Z(f (nT )) = f (nT )z −n (33.14)
n=0
33.2 Memoryless property of the Exponential

Consider an exponentially distributed random variable X ∼ exp(λ). This distri-
bution is defined for all positive values of x. The PDF and CDF of this distribution
are characterized as follows:
fX (x) = λe−λx (33.15)
FX (x) = 1 − e−λx (33.16)

The expected value and variance are given as follows:
1
E[X] = (33.17)
λ
1
V ar(X) = (33.18)
λ2
The memoryless property of the exponential distribution is stated as follows:
P [X > (x + a)|X > a] = P [X > x] (33.19)
We can interpret this rule as follows - Suppose we are seeing how customers arrive
at a shop. If we do not observe a customer arrival until time a has elapsed, the
distribution of waiting time from time a until the next customer arrives is the
same as when we again start our waiting time from 0. Consider the conditional
distribution that denotes - the probability that X is within the range of time t and
time t + s, given that time t has already passed - which is the same as the process
starting from zero and X lying below s.
P [X ≤ (t + s)|X > t] = P [X ≤ s] (33.20)
We can define x = (s + t) and rewrite the above equation as follows:
P [X ≤ x|X > t] = P [X ≤ s] (33.21)
Note now that in the above equation the RHS is nothing but the CDF. Further, if
we denote s = (x − t) we get:
P [X ≤ x|X > t] = 1 − e−λs = 1 − e−λ(x−t) (33.22)
With this formulation, the conditional CDF can be written as follows:
P [X ≤ x|X > t] = 1 − e−λ(x−t) = FX|X>t (x|X > t) (33.23)
This essentially denotes a shift in the PDF. The new expected value is given by:
1
E[X|X > t] = t + = t + E[X] (33.24)
λ
33.3 Finding the expectation of a Geometric R.V

Consider a random variable X following a geometric distribution characterized
as X ∼ geom(p). We will now attempt to find the expectation of this random
variable by using the Law of Total Expectations (LTE) and conditioning on an
indicator random variable. Let the indicator random variable Y be defined such
that: (
0 = T, q
Y = (33.25)
1 = H, p
This indicator variable says that we get Y = 0 or tails with a probability of q and
we get Y = 1 or heads with a probability of p. Writing the LTE formulation as:
E[X] = EY [E(X|Y )] (33.26)
= E(X|Y = 0)P (Y = 0) + E(X|Y = 1)P (Y = 1) (33.27)

Note that random variable X is defined as the waiting time of first heads. Now
since in the first toss, if Y = 0, a time length of 1 has elapsed - so the distribution
would now shift and we can write the following:
E(X|Y = 0) = 1 + E[X] (33.28)
Now we also know that if by the first time step, if we have observed a head, then
the expected time to heads will be 1. So we have:
E(X|Y = 1) = 1 (33.29)
With this, we can rewrite the unconditional expectation of X as follows:
E[X] = (1 + E[X])q + p (33.30)
E[X] = 1 + qE[X] (33.31)

1
E[X] = (33.32)
1−q
1
E[X] = (33.33)
p
Hence we showed that with the help of an indicator random variable - and the
memoryless shifting property of the geometric distribution - the unconditional
expectation of the geometric distribution can be found.
Chapter 34
Recurrence Relations
34.1 Linear Recurrence Relations

• A recursive definition of a sequence specifies one or more initial terms or
boundary conditions and a rule for determining subsequent terms from the
preceding terms. This "rule" is called a recurrence relation - it is basically
an equation that relates subsequent terms to previous terms in a sequence.
• A particular sequence of numbers is a solution of the recurrence relation if

it satisfies the recurrence relation equation. Here is an example of a typical
recurrence relation:
an = 2an−1 , initial condition: a0 = 5 (34.1)
• A linear homogeneous recurrence relation if degree k with constant co-

efficients is of the general form (note that the coefficients should not be
zero):
an = c1 an−1 + c2 an−2 + · · · + ck an−k (34.2)
• The above recurrence relation is said to be homogeneous because there are

no terms in the equations that are not multiples of some aj . We further
note that in such a relation, coefficients are constant and are not dependent
on n. Lastly, the degree is said to be k because an is expressed as a linear
combination of the previous k lagged terms.
• The initial conditions could be of the form: a0 = C0 , a1 = C1 , · · · .
34.1.1 Solving linear recurrence relations

• The step in starting out solving a linear homogeneous recurrence relation, is
to look for general solutions of the form (note that r is some constant):
an = r n (34.3)
340
Chapter 34. Recurrence Relations 341
• Then, rn would qualify as a solution to the recurrence relation given by

equation 2 if and only if it satisfies the following equation:
rn = c1 rn−1 + c2 rn−2 + · · · + ck rn−k (34.4)
• Next, take all the RHS terms to the left hand side (so as to get 0 on the RHS)
and also divide each term by rn−k to get:
rk − c1 rk−1 − c2 rk−2 − · · · − ck−1 r − ck = 0 (34.5)
• We say that the sequence {an } : an = rn is a solution to the recurrence rela-

tion if it satisfies the above equation. The above equation is called the char-
acteristic equation of the recurrence relation and the solutions are called
the characteristic roots of the recurrence relation.
• Theorem: Let c1 and c2 be real numbers and suppose the quadratic relation
r2 − c1 r − c2 = 0 has two distinct roots r1 and r2 . Then we say that the
sequence {an } is a solution of the recurrence relation given by an = c1 an−1 +
c2 an−2 if and only if the following general solution holds (note that α1 and
α2 are two constant - akin to A and B taken in class):
an = α1 r1n + α2 r2n (34.6)

• We will try to find the solution to the recurrence relation:
an = an−1 + 2an−2 (34.7)
• The initial conditions are: a0 = 2 and a1 = 7.

• Step 1: Write the characteristic equation:
rn = rn−1 + 2rn−2 (34.8)
• Step 2: Since k = 2 here (because an is related to past 2 lags) we take the

RHS terms to the left and divide throughout by rn−k or in this case rn−2 to
obtain:
r2 − r − 2 = 0 (34.9)
• Step 3: We then find the roots of the above equation as r1 = 2 and r2 = −1.
Hence we say that the sequence {an } is a solution to the given recurrence
relation if and only if:
an = α1 2n + α2 (−1)n (34.10)
• Step 4: Put the initial conditions (for n = 0 and n = 1) in to get:
a0 = 2 = α 1 + α 2 (34.11)
a1 = 7 = α1 2 + α2 (−1) (34.12)
Chapter 34. Recurrence Relations 342
• Step 5: We then solve the above two equations to get the value of α1 = 3
and α2 = −1. Finally, substituting all the values we get the general solution
to the recurrence relation of the form:
an = 3 · 2n − (−1)n (34.13)
34.2 Linear Nonhomogeneous Recurrence Relations

• The recurrence relation of the form an = 3an−1 +2n is an example of a linear
nonhomogeneous recurrence relation with constant coefficients. It is of
the general form (note that F (n) is some function of n - such a term was
missing in the homogeneous equation):
an = c1 an−1 + c2 an−2 + · · · + ck an−k + F (n) (34.14)
• Now we note the the following equation is termed as the associated homo-
geneous recurrence relation - basically the equation we get from simply
ignoring the F (n) term:
an = c1 an−1 + c2 an−2 + · · · + ck an−k (34.15)
• We say that the solution of a linear nonhomogeneous recurrence relation is

the sum of a particular solution and a solution of the associated linear
homogeneous recurrence relation

• We want to find the solutions to the nonhomogeneous recurrence relation:
an = 3an−1 + 2n (34.16)
With the initial condition that a1 = 3.
• Step 1: we first try to find a general solution for the associated homoge-
neous linear recurrence relation given by (just write the equation after
ignoring that 2n term):
an = 3an−1 (34.17)
r=3 (34.18)
• Step 2: From the above characteristic equation we get the root as 3. The
general solution could now be written of the form:
a(h)
n = α3
n
(34.19)
Note that the superscript (h) only denotes that this is an associated homo-
geneous solution.
References 343
• Step 4: Now we find a particular solution by finding a solution for F (n) =

2n. Since this is a linear function of n we can say a general form for this
solution could look like:
pn = cn + d (34.20)
Where c and d are some constants. Now we substitute this into the nonho-
mogeneous recurrence relation to get:
cn + d = 3(c(n − 1) + d) + 2n (34.21)
=⇒ (2 + 2c)n + (2d − 3c) = 0 (34.22)

so, we say that cn + d is a solution to the nonhomogeneous recurrence rela-
tion if and only if (2 + 2c) = 0 and (2d − 3c) = 0. This would further imply
that cn + d is a solution only if:
c = −1and d = −3/2 (34.23)
Finally writing the particular solution of the form cn + d by substituting the

above values in c and d we get:
a(p)
n = −n − 3/2 (34.24)
• Step 5: Finally, we sum the general and particular solution to get the com-
plete solution as:
an = a(h)
n + an
(p)
(34.25)
3
an = −n − + α3n (34.26)
2
• Step 6: As an absolute last step, we put the initial condition in the general
solution of the associated homogeneous recurrence relation by putting a1 =
1 and we would get α = 11/6. With this the proper solution would be:
3 11 n
an = −n − n + ·3 (34.27)
2 6
References
[1] Rosen - Discrete Mathematics and its Applications
Chapter 35
Propositional Logic
35.1 Introduction to Logic

Logic is the basis of all mathematical reasoning and automated reasoning. The
rules of logic define the meaning of mathematical statements and give an objec-
tive conclusion to the same. Logic has numerous practical applications in the fields
of mathematics as well as computer science.
Logic is mainly categorised in to four types:
Logic
Propositional Logic Predicate Logic Modal Logic Temporal Logic
Propositional logic is the branch of logic that studies ways of combining or alter-
ing statements or propositions to form complicated statements/propositions. It
studies the way in which statements interact with each other. Predicate logic is an
extension of propositional logic. It adds the ideas of predicates and quantifiers
which in essence captures meaning in statements. Modal logic is an extension to
propositional and predicate logic which includes operators expressing modality.
Temporal logic is any system of rules and symbolism for representing and reason-
ing about propositions qualified in terms of time. We will discuss propositional
logic in the following sections.
35.2 Propositional Logic

Let us introduce the basic building blocks of logic- propositions. A proposition is a
declarative statement that can either be true or false, but not both. A proposition
constitutes atoms.
Definition 35.2.1 Atom:An atom is the most basic form of proposition. It is de-
noted by capital letters.(eg. P, Q, R, etc)
344
Chapter 35. Propositional Logic 345
Definition 35.2.2 Proposition: A proposition is a statement formed by connect-

ing primitive atoms. It takes the Boolean values of True (T ) or False (F ), but not
both.
Example 1: It is raining in Chennai or I have an umbrella. In the given proposi-

tion, the atoms can be defined as follows:
P: It is raining in Chennai.
Q: I have an umbrella.
The proposition is of the form P or Q where ’or’ is the connector. Various connec-
tors used in propositional logic (in order of precedence) are as given below:
Symbol Definition
¬ Not
∧ And
∨ Or
=⇒ Implies
⇐⇒ Iff (If and only if)
Table 35.1: Symbols in propositional logic
Definition 35.2.3 Formula: A formula is made up of atoms and connectors.

Formulas are defined recursively. Every atom is a formula. Thus, if φ and ψ are
atoms then φ, ψ, φ ∧ ψ, φ ∨ ψ, φ → ψ,φ =⇒ ψ, all are valid formulas. Building
formulae from primitive propositions is like combining simple sentences to form a
complicated one. This is explained in the following example.
Example 2: If it is raining in Chennai and I have an umbrella then I open my

umbrella. Here, the atoms can be defined as follows:
P: It is raining in Chennai.
Q: I have an umbrella.
R: I open my umbrella.
The above proposition can be represented as a formula: P ∧ Q =⇒ R
35.2.1 Semantics of Propositional Logic

Interpretation (I) of propositional logic is specified by a truth assignment i.e. an
atom can either be True (T) or False (F). In Example 2, an interpretation will spec-
ify whether it is raining in Chennai or not i.e if P= T then it is raining in Chennai.
However, if P= F then it is not raining in Chennai. The simple determination of
whether the given proposition or formula is true or false is called an interpreta-
tion.
We will now learn how to form truth tables based on the Boolean value of these
interpretations.
35.2.2 Truth Tables and Interpretations

Suppose, φ and ψ are two formulae, then the truth table can be formulated as
below:
Note: For the implication formula, only when φ = T , we proceed to check for ψ.
I φ ψ ¬φ (Not) φ∧ψ φ∨ψ φ =⇒ ψ φ↔ψ

not and or implies equivalent
I1 T =1 T =1 F T T T T
I2 T =1 F =0 F F T F F
I3 F =0 T =1 T F T T F
I4 F =0 F =0 T F F T T
Table 35.2: Truth Tables for different formulae
Otherwise, we assume the formula to be true (T ). We relate the values of T and

F to the bit strings used in computer applications. Therefore we assign a value of
1 to T and 0 to F . Interpretation I for φ is denoted by φ[I].
The truth value of formula φ evaluated under interpretation I is: φ[I1 ] = T

ψ[I1 ] = T φ[I2 ] = F The formula F [I1 ] = φ[I1 ] ∧ ψ[I1 ] = T ∧ T = T .
Similarly, the formula F [I2 ] = φ[I2 ] ∧ ψ[I2 ] = T ∧ F = F
Definition 35.2.4 φ[I]: The truth value of formula φ evaluated for an interpreta-
tion I.
(i) If φ is an atom ρ then φ[I] is the truth value assigned to ρ in the interpreta-
tion.
(ii) If φ = ¬ρ then φ[I]= ¬ρ[I].
(iii) If φ = ρ1 ∧ ρ2 then φ[I] = ρ1 [I] ∧ ρ2 [I].
(iv) If φ = ρ1 ∨ ρ2 then φ[I] = ρ1 [I] ∨ ρ2 [I].
Mapping an interpretation, using a given formula φ gives us the truth value of φ.

Mapping
I −−−−−→ φ[I] is T or F .
φ
Examples:
• (P ∧ ¬P ) is F under all interpretations.

• (P ∨ ¬P ) is T under all interpretations.
• φ ↔ ψ implies that formulas φ and ψ are equivalent.
Note: A formula can evaluate T or F . A formula has n atoms.

Number of interpretations for n atoms is given by 2n .
35.3 The Satisfiability Problem

35.3.1 Truth tables and the Subset problem
This section brings out the connection between the truth tables formed by n atoms
and the Subset Problem. One can ask the following question: How many truth ta-
bles can be constructed over n atoms, P1 , P2 , P3 , . . . , Pn ? It can be seen that the
answer to the above is related to the subset problem. The subset problem is ex-
pressed as the following question: How many subsets of elements can be created
from a set of size n?
I P1 P2 P3 ... Pn
I1 T =1 T =1 T =1 ... T =1
I2 T =1 T =1 F =0 ... F =0
I3 T =1 F =0 F =0 ... F =0
.. .. .. .. .. ..
. . . . . .
Im F =0 F =0 F =0 ... F =0
Table 35.3: n atoms and m interpretations
Consider the case where n = 2. The number of interpretations will be 2n = 22 = 4.

The truth table for the same can be formed as follows: Each interpretation can
I P1 P2
I1 T =1 T =1
I1 T =1 F =0
I1 F =0 T =1
I4 F =0 F =0
Table 35.4: Truth table with 4 interpretations for 2 atoms.
have only 2 values (T or F ). Therefore, the total number of possible results for 2
atoms will be,
n 2
2(2 ) = 2(2 ) = 24 = 16 ways.
The set of all interpretations of a formula is given by,
S = {I1 , I2 , . . . , I(2n ) } = {I1 , I2 , I3 , I4 }
I4 I1 I2 I3 I4 Subsets
T4 T1 T2 T3 T4 {I1 ,I2 ,I3 ,I4 }

I3
T3
F4 T1 T2 T3 F4 {I1 , I2 , I3 }
I2
T4 T1 T2 F3 T4 {I1 , I2 , I4 }
T2 F3
F4 T1 T2 F3 F4 {I1 , I2 }
T4 T1 F2 T3 T4 {I1 , I3 , I4 }
I1
F2 T3
F4 T1 F2 T3 F4 {I1 , I3 }
T1
T4 T1 F2 F3 T4 {I1 , I4 }
F3
F4 T1 F2 F3 F4 {I1 }
T4 F1 T2 T3 T4 {I2 , I3 , I4 }
T3
F4 F1 T2 T3 F4 {I2 , I3 }
F1
T4 F1 T2 F3 T4 {I2 , I4 }
T2 F3
F4 F1 T2 F3 F4 {I2 }
T4 F1 F2 T3 T4 {I3 , I4 }
T3
F2 F4 F1 F2 T3 F4 {I3 }
T4 F1 F2 F3 T4 {I4 }
F3
F4 F1 F2 F3 F4 {}
Figure 35.1: Tree diagram showing the number of truth tables formed from 2
atoms.
De Morgan’s Laws:
The De Morgan’s Laws allow the expression of conjunctions and disjunctions purely
in terms of each other via negation. The laws can be used to simplify computation
of truth tables. Consider two propositions φ and ψ. The De Morgan’s Laws are
given as follows:
¬(φ ∧ ψ) = (¬φ ∨ ¬ψ) (35.1)

¬(φ ∨ ψ) = (¬φ ∧ ¬ψ) (35.2)
The above statements translate to:
not (A or B) = not A and not B; and

not (A and B) = not A or not B
Example: We can eliminate ∧ by performing the following simplification:

(φ ∧ ψ) = ¬(¬(φ ∨ ψ))
Thus, we will not need both OR (∨) and AND (∧) to capture truth tables. Ap-
plications of De Morgan’s Laws are often found in the maliciousness of Artificial
Intelligence (AI).
Propositional Satisfiability
Definition 35.3.1 Satisfiability: A propositional formula φ is satisfiable if there
exists a truth assignment I such that φ[I] = T for that assignment. This denoted
as: I φ when φ[I] = T for given I.
Propositional Validity
Definition 35.3.2 Validity: A propositional formula φ is valid (also called as a
tautology) if φ[I] = T , for all truth assignments I. Hence we have, I φ, ∀ I.
Summary
• Logic is categorized into four types - Propositional, Predicate, Modal and
Temporal.
• Atom:An atom is the most basic form of proposition. It is denoted by capital

letters.(eg. P, Q, R, etc)
• Proposition: A proposition is a statement formed by connecting primitive

atoms. It takes the Boolean values of True (T ) or False (F ), but not both.
• Formula: A formula is made up of atoms and connectors.
• A formula can evaluate T or F . A formula has n atoms. The number of

interpretations for n atoms is given by 2n . Further the total number of truth
n
tables formed by n atoms is given by 22 .
35.4 Satisfiability and Validity

Topics
• Review of satisfiability and validity.

• The satisfiability problem and its connection to P vs N P .
• Logical Inference - arriving at conclusions based on facts and premises.
• Entailment and an algorithm to check for entailment.
• Theorem connecting entailment and validity.
35.4.1 Review
Definition 35.4.1 Satisfiability: A propositional formula φ is satisfiable if there
exists a truth assignment I such that φ[I] = T for that assignment. This denoted
as: I φ when φ[I] = T for given I.
Definition 35.4.2 Validity: A propositional formula φ is valid (also called as a

tautology) if φ[I] = T , for all truth assignments I. Hence we have, I φ, ∀ I.
Definition 35.4.3 Contradiction: A propositional formula φ is a contradiction

(unsatisfiable) if φ[I] = F , for all truth assignments I. Hence we have, φ[I] =
F, ∀ I.
Example 1. Is the formula φ = P ∧Q
Example 2. Is the formula φ = P ∨Q
satisfiable? valid?
satisfiable? valid?
Solution. The truth table for the
Solution. The truth table for the
given formula, φ = P ∧ Q is first
given formula, φ = P ∨ Q is first
constructed for all possible truth
constructed for all possible truth
assignments I.
assignments I.
I P Q φ=P ∧Q
I P Q φ=P ∨Q
I1 T T T
I1 T T T
I2 T F F
I2 T F T
I3 F T F
I3 F T T
I4 F F F
I4 F F F
From the truth table, it is evident that From the truth table, it is evident that
φ[I1 ] = T , thus the given formula is φ[I1 ] = T , thus the given formula is
satisfiable for the truth assignment I1 . satisfiable for the truth assignment I1 .
However, since the formula yields F However, since the formula yields F
for all other assignments I2 , I3 and I4 , for the assignment I4 , φ is not valid.
φ is not valid. (∀ I, φ[I] 6= T , hence (∀ I, φ[I] 6= T , hence not valid).
not valid).
Example 3. Is the formula Example 4. Is the formula

φ = P ∨ ¬P a tautology (valid)? φ = P ∧ ¬P a contradiction?
Solution. The truth table for the Solution. The truth table for the
given formula, φ = P ∨ ¬P is first given formula, φ = P ∧ ¬P is first
constructed for all possible truth constructed for all possible truth
assignments I. assignments I.
I P ¬P φ = P ∨ ¬P I P ¬P φ = P ∧ ¬P
I1 T F T I1 T F F
I2 F T T I2 F T F
Since, ∀ I, φ[I] = T , it can be said that Since, ∀ I, φ[I] = F , it can be said that
φ = P ∨ ¬P is a valid formula. φ = P ∧ ¬P is a contradiction.
Note:
iff .
• A formula φ is valid ⇐=
=⇒ ¬φ is unsatisfiable.
• If φ is valid, then φ[I] = T , ∀ I =⇒ ¬φ[I] = F, ∀ I. =⇒ ¬φ is unsatisfiable.
35.4.2 Logical Inference

People use language not only as a medium of communication, but also to logically
infer conclusions based on certain facts and premises proposed to them by others.
Any language is inbuilt with certain rules and semantics. Thus, it is possible to
reduce the statements in a sentence to propositional logic atoms, and test for
validity. An argument is a sequence of statements that lead to a conclusion. An
argument is valid only if and only if all it is impossible for all of its premises to be
true and the conclusion to be false. The following example elucidates the method
of logical inference.
Example 5. Can we conclude that Gaurav is not carrying an umbrella?, if we

know that:
Fact 1 F1 Gaurav carries an umbrella if it is cloudy and forecast calls for rain.
Fact 2 F2 It is cloudy.
Solution. The statements in the given facts can be broken down into the following
atoms. Further, the facts and premises can be condensed on the basis of the atoms.
Facts:
Atoms: F1 P ∧ Q → R
Atom P It is cloudy. F2 ¬P
Atom Q The forecast calls for rain. Premises:
Atom R Gaurav carries an umbrella. P1 P ∧ Q
P2 ¬P
?
Conclusion: Is ¬R true? (¬R = T )
Combining the premises P1 and P2 , we have, P1 ∧ P2 = (P ∧ Q) ∧ ¬P = F always.

This is because whenever (P ∧ Q) = T then ¬P = F and whenever (P ∧ Q) = F
then ¬P = F/T , thereby making (P ∧ Q) ∧ ¬P = F always. Now the left side
in the facts always give F . So, if F → R, then from the definition of implication,
we know that R = T always. Thus F → ¬R is always F . We cannot conclude
¬R from the facts. Thus the conclusion that Gaurav is not carrying an umbrella
cannot be inferred from the facts.
35.5 P-NP problem

35.5.1 The Satisfiability Problem and P vs. NP
Intuitive statement: In propositional logic, we know that for n atoms, there are
2n interpretations. Each of these interpretations could either be true of false. This
n
implies that we have 22 truth tables. Therefore, for large n, checking for validity
and satisfiability of a formula becomes an extremely difficult problem. Now, the
satisfiability problem addresses precisely this and is posed as the following ques-
tion - How does one check if a given formula is satisfiable, for all interpretations?,
for checking the satisfiability of each interpretation is a tedious process.
Polynomial time: An algorithm is said to be solvable in polynomial time if the

number of steps required to complete the algorithm for a given input is O(nk ) for
some non-negative integer k, where n is the input size in units of bits needed to
represent the input.
P vs N P problem: There are two major classes of problems that theoretical

computer scientists have identified according to their complexity. Those prob-
lems which can be solved using some algorithm in deterministic polynomial time
fall under the complexity class P . For example, multiplication or sorting are class
P problems. For some problems, there is no known way to solve it quickly, but if
one is provided with information showing the answer, it is possible to verify the
answer quickly. The class of problems which can be verified in polynomial time,
but cannot be solved in polynomial time are called N P problems, short for non-
deterministic polynomial time.
One can quickly assert that all problems that can be solved in polynomial time
can also be checked for, in polynomial time. Thus it can be said that all P prob-
lems form a subset of N P problems.
N P completeness: A set of problems are N P complete when each of them can

be reduced to another N P problem in polynomial time. Thus informally it can be
said that an N P complete problem is at least as tough as any other N P problem.
Further solving one N P problem would imply solving all N P complete problems
in polynomial time.
Cook-Levin theorem: This theorem states that the satisfiability problem is N P

complete. This means that any problem in N P can be reduced in polynomial

time by a deterministic Turing machine to the problem of determining whether a
Boolean formula is satisfiable.
If P = N P , then that would mean that all problems which can be verified in poly-
nomial time, can also be solved in polynomial time. This has widespread ramifi-
cations to the way of life as we know it. Many problems in operations research
such as integer linear programming and the travelling salesman problem can be
efficiently solved, leading to widespread impact on logistics. Further, solving the
N P complete protein folding problem would lead to significant advances in life
sciences and medicine.
Negative consequences would also arise, for N P complete problems are funda-
mental to several disciplines. For instance cryptography relies on certain problems
being hard to solve. Finding successful and efficient ways to solve N P complete
problems would lead to breakage of many cryptosystems, thereby throwing indi-
viduals’ privacy and security at risk.
The importance of the P vs N P problem is articulated by the renowned theo-

retical computer scientist Scott Aaronson as follows:
If P = N P , then the world would be a profoundly different place than

we usually assume it to be. There would be no special value in creative
leaps, no fundamental gap between solving a problem and recognizing
the solution once it’s found. Everyone who could appreciate a symphony
would be Mozart; everyone who could follow a step-by-step argument
would be Gauss; everyone who could recognize a good investment strategy
would be Warren Buffett.
35.5.2 Entailment
Definition 35.5.1 A set of n formulae ψ1 , ψ2 , . . . , ψn entails a single formula ψ
if, for every truth assignment I that satisfies all of ψ1 , ψ2 , . . . , ψn , I satisfies ψ.
The entailment of ψ1 , ψ2 , . . . , ψn formulae in a single formula ψ is denoted by:
ψ1 , ψ2 , . . . , ψn ψ.
Note: When {ψ1 , ψ2 , . . . , ψn } entails ψ, we consider ψ as a logical consequence

of {ψ1 , ψ2 , . . . , ψn }.
Example 6. Check if P ∨ Q P . Also check if P P ∨ Q.

Solution. The following truth table is constructed, where φ = P ∨ Q and ψ = P .
The columns are repeated for ease of visualization, since order matters in entail-
ment. A B need not imply B A.
It can be seen that, for truth assignment I3 formula φ is satisfied but not ψ. Thus
φ 6 ψ, that is, φ does not entail ψ. However for all truth assignments where ψ is
I P Q φ=P ∨Q ψ=P ¬φ ¬φ ∨ ψ ψ φ ¬ψ ¬ψ ∨ φ
I1 T T T T F T T T F T
I2 T F T T F T T T F T
I3 F T T F F F F T T T
I4 F F F F T T F F T T
T , it can be seen that φ is also T . Thus, ψ φ. In short, P P ∨ Q and P ∨ Q 6 P .
The process of checking for entailment involves going through the entire truth
table and checking for satisfiability of a formula and hence is tedious. There is an
easier way to go about this. In order to check if φ ψ, create a truth table column
¬φ ∨ ψ. If this column T ∀ I, then φ entails ψ. In the above example, the column
¬ψ ∨ φ is all T , thus, ψ φ.
Example 7. Check if P ∧ Q P . Also check if P P ∧ Q.

Solution. The following truth table is constructed, where φ = P ∧ Q and ψ = P .
I P Q φ=P ∧Q ψ=P ¬φ ¬φ ∨ ψ ψ φ ¬ψ ¬ψ ∨ φ
I1 T T T T F T T T F T
I2 T F F T T T T F F F
I3 F T F F T T F F T T
I4 F F F F T T F F T T
From the truth table, ψ is T in all cases when φ is T (the sole case being truth
assignment I1 ). Further, ¬φ ∨ ψ = T, ∀ I, φ, which means that φ entails ψ, or
P ∧ Q P . However, in truth assignment I2 , one can observe that φ is F when ψ
is T . In addition, ¬ψ ∨ φ 6= T, ∀ I, because for I2 , ¬ψ ∨ φ 6= F . Hence, ψ does not
entail φ making P 6 P ∧ Q.
iff.
Theorem 1. ψ1 , ψ2 , . . . , ψn entails ψ ⇐= =⇒ [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid.
Proof. This theorem has to be proved for both directions - the only if direction and
the if only direction. Thus, the theorem is expressed in two statements as follows
and both statements are individually proved.
Only if case: ψ1 , ψ2 , . . . , ψn entails ψ =⇒ [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid.
If only case: [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid. =⇒ ψ1 , ψ2 , . . . , ψn entails ψ.
Only if case
=========⇒
Suppose ψ1 , ψ2 , . . . , ψn entails ψ, we wish to show that φ = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→
ψ} is valid. In order to show that φ is valid, we have to prove that for every truth
assignment I, φ[I] = T .
Consider any truth assignment I, the left side of the formula can either be T or F .
This makes two cases- one where (ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = F and the other when
(ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = T .
Case 1: (ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = F
φ = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ}
φ[I] = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ[I]}
| {z }
F in Case 1.
φ = ¬[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] ∨ ψ (∵ φ → ψ is defined as ¬φ ∨ ψ)
φ[I] = ¬F ∨ ψ
φ[I] = T ∨ ψ = T =⇒ φ[I] is T
Case 1: (ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = T

This means that each ψi is T . We have assumed that ψ1 , ψ2 , . . . , ψn entails ψ. By
the definition of entailment ψ[I] is T . Thus,
φ[I] = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ[I] }

| {z } |{z}
T in Case 2. T
φ[I] = T −→ T =⇒ φ[I] is T
If only case
⇐=========
Suppose φ = {[ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ} is valid for all truth assignments I. This
means ∀ I, φ[I] = T . We wish to show that ψ1 , ψ2 , . . . , ψn entails ψ.
φ[I] = {(ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] −→ ψ[I]} is valid. Thus φ[I] = T ∀ I. This means
that (ψ1 ∧ ψ2 ∧ . . . ∧ ψn )[I] = T for any interpretation I. Hence, φ[I] = {T −→ ψ[I]}
is valid. Thus, ψ[I] = T , proving that ψ1 , ψ2 , . . . , ψn entails ψ.
Summary
• A formula φ is satisfiable if there exists truth assignment I such that φ[I] =
T.
• A formula φ is valid if φ[I] = T ∀ I.
• A formula φ is a contradiction if φ[I] = F ∀ I.
iff .
• A formula φ is valid ⇐=
=⇒ ¬φ is unsatisfiable.
• Checking if a formula is satisfiable for all truth interpretations is called the
satisfiability problem. This falls under the general class of N P problems
which can be verified, but cannot be solved in polynomial time.
• Logical inference is the process of reducing statements into atoms and
checking for validity of conclusions from facts and premises.
• A set of n formulae ψ1 , ψ2 , . . . , ψn entails a single formula ψ if, for every truth
assignment I that satisfies all of ψ1 , ψ2 , . . . , ψn , I satisfies ψ. This is denoted
by: ψ1 , ψ2 , . . . , ψn ψ.
iff.
• Theorem: ψ1 , ψ2 , . . . , ψn entails ψ ⇐==⇒ [ψ1 ∧ ψ2 ∧ . . . ∧ ψn ] −→ ψ is valid.
Chapter 36
Graph Theory
36.1 Review of graph theory

Definition 36.1.1 A graph G is a set of vertices V and edges E. =⇒ G = {V, E}.
Here, the set of vertices is given by V = {v1 , v2 , . . . , vn } where, |V | = n and the set
of edges is given by E = {e1 , e2 , . . . , en } where, |E| = m.
Definition 36.1.2 The degree of a vertex v in a graph is the number of edges
incident on v ∈ V .
Example 1. Consider the following undirected graph. The degrees of the vertices
of the graph have been calculated.
v1
d1 = Deg{v1 } = 2
e1 e2 d2 = Deg{v2 } = 1
d3 = Deg{v3 } = 1
v2 v3
Example 2. Consider the following directed graph. There are two types of de-
grees for these vertices in directed graphs, namely, dout and din .
v1 v1 v2 v3
e1 e2 din 2 1 0
dout 0 1 2
v2 e3 v3
The sum of degrees for a graph can be generalised as:

Xn
T
d ·1= di = 2m = 2|E| (36.1)
i=1
This shows that the degree of a graph is given by double counting the edges as
every edge will be included from both the vertices.
Definition 36.1.3 A tree T is a connected acyclic graph.
356
Chapter 36. Graph Theory 357
Example 3. Consider the following trees. We observe an interesting relationship

between the number of vertices and edges, m = n − 1 or |E| = |V | − 1.
v1
v1 e1 e3
n=3 e4 n=6
v2 v6
e1 e2 m=2 m=5
e2 v4 e5
v2 v3 v3 v5
Therefore, the sum of degrees for a tree can be generalised as:

n
X
d(vi ) = 2m = 2(n − 1) (36.2)
i=1
36.2 Applications in Physics: Kirchhoff’s Voltage and

Current Laws
Kirchhoff’s laws in physics are conservation laws that form the foundation to all
of circuit theory. Kirchhoff’s voltage law states that the directed sum of potential
drops around any closed loop is zero. The Kirchhoff’s current law states that the
algebraic sum of currents in a closed loop equals zero. Using the idea of networks
and graphs one can construct basic circuits, which can then be represented in the
matrix form. Further, we show how this is linked to the Ax = b form; It is also
shown that the Kirchhoff’s laws directly embody this form and the idea of rank
helps in extracting meaningful properties which are of relevance to physics.
x1
b4
x4
b1
b3
b5
x2
b2 x3
Figure 36.1: Figure: Circuit with 4 nodes and 2 independent loops
Consider the above circuit with four nodes labelled 1, 2, 3 and 4. Let xi be the
electric potential at the ith node. The potential difference on the edge b1 is given
by x2 − x1 . This the vector b can be constructed as follows.

   
b1 x2 − x1  
b  x − x  vector comprising of
 2  3 2  
b = b3  = x3 − x1  = potential differences (36.3)
   
b4  x4 − x1   between the nodes. 
     
b5 x4 − x3
The above equation can now be rewritten in order to bring about the Ax = b form
as shown below. The matrix A is of the order 5 × 4, with 5 edges and 4 nodes.
−1x1 + 1x2 + 0x3 + 0x4 = b1

   
−1 1 0 0   b1
0x1 − 1x2 + 1x3 + 0x4 = b2  0 −1 1 0  x1 b 
    2
 x2   

−1x1 + 0x2 + 1x3 + 0x4 = b3 Ax =  −1 0 1 0    = b3  = b

 x3   
−1 0 0 1 b4 

−1x1 + 0x2 + 0x3 + 1x4 = b4  
x4
0 0 −1 1 b5
0x1 + 0x2 − 1x3 + 1x4 = b5
It can therefore be seen that Ax = b captures the essence of Kirchhoff’s Voltage

law. The matrix A5×4 gives information about the potential differences between
the nodes. This matrix A is known as the incidence matrix. We can solve Ax = b
using Gauss’ elimination method. The rref matrix R gives information about the
rank and free variable matrix F .
     
−1 1 0 0 −1 1 0 0 −1 0 0 1
 0 −1 1 0   0 −1 1 0   0 −1 0 1 
     
−1 0 1 0 −→ 0 0 0 0 −→ 0 0 0 0
     
     
 −1 0 0 1   0 0 −1 1   0 0 −1 1 
     
0 0 −1 1 0 0 0 0 0 0 0 0
| {z } | {z } | {z }
incidence matrix A upper triangular matrix U rref matrix - R
   
−1 0 0 1 1 0 0 −1
 0 −1 0 1   0 1 0 −1  ñ ô
    I3 F
0 0 −1 1  −→  0 0 1 −1  −→
   
0 0

0 0 0 0  0 0 0 0
   
   | {z }
0 0 0 0 0 0 0 0 rref matrix - R
| {z } | {z }
Row 2 and Row 3 are swapped. rref matrix - R
We know that the null space matrix contains the vectors in the N (A). The infor-
mation about the null space matrix N is found in the matrix F which contains
pivot variables under elimination. Thus, the null space solution of Ax is:
   
x1 1
x 1
   
N =  2 = α  
 
x3  1
x4 1
The above result is significant in physics. It means that if the potential differences
between the nodes are zero, b = 0 then the nodes are at the same electric poten-
tial. In other words, these nodes are equipotential. It can therefore be seen that
Ax = b encapsulates the Kirchhoff’s Voltage Law.
Using Ohm’s law and the information about conductance of each edge, one could
create a conductance matrix, G. This could then be used to obtain the currents on
each edge yi . The vector denoted by y has entries comprising of currents in all the
edges. Thus Gb = y, where G is the 5 × 5 conductance matrix; b is the vector of
potential difference and y is the vector of currents.
1 y4
4
y1
y3
y5
2
y2
3
Figure 36.2: Figure: Circuit with 4 nodes and 2 independent loops
The currents along with their flow directions are shown in the above figure. Now
the Kirchhoff’s current law can be written as AT y = 0. Thus, vectors in the null
space of AT correspond to collections of current in the loop that satisfy Kirchhoff’s
laws. This is summarized in the following schematic.
x = {x1 , x2 , x3 , x4 } AT y = 0
potential at nodes Kirchhoff’s Current Law
Ax = b AT y
x2 − x1 , etc. give Ohm’s law y = {y1 , y2 , y3 , y4 , y5 }

potential differences Gb = y currents on egdes
The AT y = 0 equation is written out and using the method of elimination the
reduced row echelon form (rref) is obtained and the rank r is imminent in the
number of pivots.
 
  y  
1
−1 0 −1 −1 0 y  0
  2  
1 −1 0 0 0    0

 y3  =  

0 1 1 0 −1    0

y4 

0 0 0 1 1 0
y5
 
−1 0 −1 −1 0 0
1 −1 0 0 0 0
 
=⇒ [AT | 0 ] = 
 
 0 1 1 0 −1 0


0 0 0 1 1 0
   
−1 0 −1 −1 0 −1 0 −1 −1 0
 0 −1 −1 −1
1 −1 0 0 0  0
  
 −→   −→
  
0 1 1 0 −1   0 0 0 −1 −1

 
0 0 0 1 1 0 0 0 0 0
| {z } | {z }
AT matrix - current through nodes upper triangular matrix, U
   
−1 0 −1 0 1 −1 0 0 −1 1
 0 −1 −1 0 1   0 −1 0 −1 1 
 −→   −→
   
0 0 0 −1 −1   0 0 −1 0 −1 


0 0 0 0 0 0 0 0 0 0
| {z } | {z }
rref matrix - R Col. 3 and Col. 4 are swapped.
 
1 0 0 1 −1 ñ ô
 0 1 0 1 −1  I3 F
 −→
 
 0 0 1 0 1  0 0

0 0 0 0 0 | {z }
rref matrix - R
| {z }
Col. 3 and Col. 4 are swapped.
The null space matrix contains the vectors in the N (AT ). The information about
the null space matrix N is found in the matrix F which contains pivot variables
under elimination. Thus, the null space solution of AT y is:
       
y1 −1 1 y1 −1 1
y   −1 1  y   −1 1 
 2    2  
y
N =  4 =  0 −1 y
 =⇒ y =  3  =  1 0
       

y3   1 0  y4   0 −1 
       
y5 0 1 y5 0 1
One can quickly notice that the null space solution of AT which is given by AT y =
0 encapsulates the Kirchhoff’s Current Law, where each solution to y gives the
current flow around a particular loop.
36.3 Relation between Linear Algebra and Graph The-

ory
In the previous chapters the dimension of Null space of a matrix and its relation
to rank has been discussed. Recall that,
Dim[N (AT )] = m − r (36.4)
where m is the number of edges and r is the number of nodes minus one. The
Dim[N (AT )] gives the number of independent loops L in the graph. The number of
rows in the incidence matrix A is equal to the number of edges|E| in the loop. The
rank of the incidence matrix r (which is invariant under the transpose operation)
is one less than the number of nodes |V | − 1. Therefore,
|E| +|V | − 1 = L (36.5)

|V | −|E| + L = 1 (36.6)
The above formula can be interpreted as a variant of Euler’s formula in topology

or the Gibb’s Phase Rule in thermodynamics. Note that the first term represents
zero-dimensional vertices, the second denotes one-dimensional lines and the third
denotes two-dimensional loops.
Summary
• A graph G is a set of vertices V and edges E. =⇒ G = {V, E}.
Here, the set of vertices is given by V = {v1 , v2 , . . . , vn } where, |V | = n and
the set of edges is given by E = {e1 , e2 , . . . , en } where, |E| = m.
• The degree of a vertex v in a graph is the number of edges incident on v ∈ V .
• A tree T is a connected acyclic graph.
• Therefore, the sum of degrees for a tree can be generalised as:
n
X
d(vi ) = 2m = 2(n − 1)
i=1
• It can be seen that Ax = b encapsulates the Kirchhoff’s Voltage Law while

AT y = 0 encapsulates the Kirchhoff’s Current Law, where each solution to y
gives the current flow around a particular loop.
• The Euler’s formula in topology, given by |V | −|E| + L = 1 is linked to graph
theory. Here |V | is the number of vertices, |E| is the number of edges and L
is the number of independent loops.
Chapter 37
Proof Techniques
Topics
• Introduction to Proof Techniques.
• Direct Proofs.
• Indirect Proofs - Proof by Contrapositive and Proof by Contradiction.
37.1 Introduction to Proof Techniques

In this section, we formally introduce the idea of proofs and the methods used
in the construction of proofs. Proofs are essential in satisfying the human need
for validation of statements. Proofs are required to establish and check for con-
sistency among myriads of statements and data. Furthermore, in an increasingly
automated world, the logical inferences made through proof techniques are meth-
ods of communication between humans and machines.
Definition 37.1.1 A proof is a valid argument that establishes the truth of a math-
ematical statement.
The primer of proof technique is the hypothesis of a theorem or axiom. Proof

techniques make use of this and the method of logical inference in establishing
the truth of a given mathematical statement.
37.2 Direct Proof Technique

The objective here is that, given a premise P , we have to show that the conclusion
Q holds. The direct proof of a conditional statement P ⇒ Q involves the following
steps:
Step 1: Assume that the proposition P is T .

Step 2: Construct subsequent steps using logical inferences showing
that Q must also be T .
362
Chapter 37. Proof Techniques 363
The direct proof technique shows that if P is T then Q must also be T , and there-
fore, the case that P is T and Q is F never arises.
Example 1: Let n be an integer where n ∈ Z. Prove:

i. If n is even, then n2 is even.
ii. If n is odd, then n2 is odd.
We can intuitively validate the above statements by considering specific values of

n. Say, if n = 2 which is even, then we know that n2 = 4, which is also even.
Similarly, if n = 11 which is odd, then we know that n2 = 121, which is also odd.
We shall now use direct proof technique and show that the above statements are
true for all n ∈ Z.
Proof. Assume P is T , i.e. n is even. Then n = 2k, ∀ k ∈ Z.

∴ n2 = 4k 2 = 2(2k 2 ) = 2j where j ∈ Z. We can see that, n2 is even.
Therefore, Q is T when P is T .
Assume P is T , where n is odd now. Then n = 2k + 1, ∀ k ∈ Z.

∴ n2 = (2k + 1)2 = 4k 2 + 4k + 1 = 2(2k 2 + 2k) + 1 = 2l + 1 where l ∈ Z. We can
see that n2 is odd. Therefore, Q is T when P is T .
37.3 Indirect Proof Technique

While direct proofs start with the premise and end with the conclusion, indirect
proofs do not follow this. Direct proofs can sometimes be hard to compute because
the series of logical steps may not always be obvious. Thus we resort to indirect
proof techniques such as proof by contrapositive and by contradiction.
Proof
Direct Indirect
Contrapositive Contradiction
37.3.1 Proof by Contrapositive

We know that P ⇒ Q is equivalent to ¬Q ⇒ ¬P . In contraposition, we make use
of this statement and prove that ¬Q ⇒ ¬P is T which implies that P ⇒ Q is T .
Proof by contraposition involves the following steps:
Step 1: Take ¬Q as the premise.

that ¬Q ⇒ ¬P .
Example 2: (Proof by contrapositive) For any given integer n such that n ∈ Z,

prove that if n2 is even, then n is even.
Proof. Consider the following propositions:
P : n2 is even.
Q: n is even.
In the method of proof by contrapositive, we need to show that ¬Q ⇒ ¬P holds

in order to prove P ⇒ Q. We begin by assuming that ¬Q is T which means that n
is not even. Therefore, n is odd. It follows from Example 1 that if n is odd, then n2
is also odd. This means that n2 is not even. Therefore, ¬P is T . Hence, ¬Q ⇒ ¬P .
37.3.2 Proof by Contradiction

We want to prove that the statement P is T . In order to do this, we first assume
that the statement is F , i.e. P is F . Then we show that this assumption leads to
something nonsensical or in other words, we have a contradiction. We conclude
that our original assumption that P is F is not true. Hence, statement P is T .
Proof by contradiction involves the following steps:
Step 1: Assume that the statement P is F .

that this assumption leads to a contradiction of the form: A ∧ ¬A is T .
Step 3: The above contradiction proves that the original assumption
must not be true.
Step 4: Hence, the original statement P must be T .
Example 3: (Proof by contradiction) Let n be an integer where n ∈ Z. Prove

that if n2 is even then n is even.
Proof. Consider the following propositions:
P : n2 is even.
Q: n is even.
In the method of proof by contradiction, we assume that the statement P ⇒ Q is

F . This holds only when P is T and Q is F .
Suppose P is T and Q is F .
−→ n2 is even and n is not even, i.e. n is odd.
−→ n2 is even and n2 is odd (From Ex.1)
−→ But n2 cannot be both even and odd, which gives us a
contradiction. (as A ∧ ¬A is always F )
Contradiction [A ∧ ¬A]
Premise Conclusion
P C
Assume P is T . Logical steps But (P ∧ ¬P ) is T
¬P is T It can never be T.
Assume C is F .
| {z }
Contradiction!
√
Example: To prove that 2 is irrational.
We √will prove the above by using the technique of proof by contradiction.

P : 2 is irrational. √
Let us assume that P is not true (¬P ) i.e. 2 is rational.
We know that a rational number can be written in the form p/q where q 6= 0 and
the greatest common divisor of p and q is 1 i.e. p and q are relatively prime num-
bers. —(A)
√
Then 2 can be written as: √
2 = p/q
√
p = 2q
p2 = 2q 2
p2 is even. Therefore p is even [as proved in Example]. ⇒ p = 2l where l ∈ Z.
Since p and q are relatively primes and p is even.
⇒ q is odd —- (B)
p2 = 2q 2 and p = 2
(2l)2 = 2q 2
4l2 = 2q 2
q 2 = 2l2
q 2 is even. Therefore q is even. —- (C)
(B) and (C) give a contradiction. Also, if p and q are even then 2 should be a
common factor of p and q.(D)
However, from (A) we know that the only common factor of p and q is 1. Thus, A
and D also give us a contradiction.
Thus, we can say that ¬P is F i.e. P is true.
37.4 Conditionals, Proof by Contradiction

37.4.1 Conditional Statements
Implication (P ⇒ Q)
Implication is a statement of the form,
is T} , then Q is T .
If |P {z
| {z }
antecedent consequence
The statement P is T is called the antecedent or the hypothesis while the statement
Q is T is called the consequence or the conclusion. Whenever P is T , Q is T as
well. Also, whenever Q is not T (¬Q), P is not T as well (¬P ).
P ⇒ Q ≡ ¬Q ⇒ ¬P
P ⊆Q Q⊆P
P Q P Q
¬P
¬Q
Whenever P is T , Q is T as well. Whenever ¬Q is T , ¬P is T as well.
The Contrapositive of the Implication.

The contrapositive of the implication (P ⇒ Q) is the following statement:
If Q is F then, P is F i.e. ¬Q ⇒ ¬P . In order to prove P ⇒ Q, one can prove the
equivalent contrapositive statement. This means proving the statement, if P is T
then Q is T is equivalent to proving if Q is F then P is F in case the latter is easier
to prove.
Biconditionals(P ⇔ Q)
The biconditional statement P ⇔ Q is the proposition P if and only if Q. This is
the same as stating (P ⇒ Q) ∧ (Q ⇒ P ). It is only possible when P and Q have
the same truth values and is false otherwise. It also means that, P is necessary and
sufficient for Q. For example, any integer n is even, iff n2 is even is a biconditional
statement i.e. ∀ n ∈ Z, n is even ⇔ n2 is even.
P =⇒ Q Q =⇒ P
P Q
Q P
(P =⇒ Q) ∧ (Q =⇒ P )
The boundary between

P Q
P and Q is lost in
biconditionals.
37.4.2 Proof by Contradiction

Proof by contradiction shows that some statement P is T by showing that it cannot
be F . In order to prove: P is T by contradiction, we do the following. Make an
assumption that P is F . Then use logical reasoning to conclude that something
inferred is a contradiction, that is, it is of the form, A ∧ ¬A = T . From this
contradiction, we conclude that P cannot be F . Hence we prove that P is T . A
contradiction is a proof of a statement P is T , that works by showing that P cannot
be F .
Recap: (Proof by Contradiction)

Suppose we wish to prove that P is T . We,
a. Assume that P is F , i.e, we assume ¬P
b. Derive through logical reasoning, something that we know is a contradiction.
c. Conclude that P is ¬F = T .
Note. In propositional logic one can observe the following:
(¬P =⇒ F ) =⇒ P
(¬¬P ∨ F ) =⇒ P
P ∨F =⇒ P
P =⇒ P
Thus, we observe that P is hidden in the above implication statement (¬P =⇒

F ).
Example: If x + y = 16 then either x is greater than equal to 8 or y is greater

than equal to 8.
x + y = 16 ⇒ [(x ≥ 8) ∨ (y ≥ 8)]
| {z }
We will first prove the above by using its contrapositive:
¬[(x ≥ 8) ∧ (y ≥ 8)] ⇒ ¬[x + y = 16]
[¬(x ≥ 8) ∧ ¬(y ≥ 8)] ⇒ x + y 6= 16

(x < 8) ∧ (y < 8) ⇒ x + y 6= 16
If (x < 8) and (y < 8) then x + y 6= 16.
Let x and y be arbitrary numbers such that x < 8 and y < 8.
(x + y) < y + 8 < 8 + 8 = 16 [x < 8 and y < 8]
∴ (x + y) < 16 which is contrary to our given premise and therefore our conclusion
that (x ≥ 8) ∨ (y ≥ 8) is true.
We will now prove the same using the technique of proof by contradiction.
P : x + y = 16 ⇒ [(x ≥ 8) ∨ (y ≥ 8)]
If x + y = 16 then, x ≥ 8 or y ≥ 8.
Let us assume that P is not true.
¬P : [x + y = 16] ∧ ¬[(x ≥ 8) ∨ (y ≥ 8)]

Using De Morgan’s Laws,
¬P : [x + y = 16] ∧ [¬(x ≥ 8) ∧ ¬(y ≥ 8)]
¬P : [x + y = 16] ∧ [(x < 8) ∧ (y < 8)]

[x = y = 16] ∧ (x < 8) ∧ (y < 8) is contradictory as shown above.
Hence, P is T .
37.5 Principle of Induction and some Examples

Topics
• Example of Proof by contradiction

• Principle of Induction
• Examples of Induction
37.6 Recap of some important terminology

We will revise some of the fundamental terminology related to theorums, proof
methods and other terms before beginning this lecture about principles of induc-
tion and proof by contradiction.
• Theorum : It is a statement that can be shown to be true. It can also be

referred to as a result.
• Lemma : A less important theorum that is helpful in proving other results is

called a Lemma.
• Conjecture : A conjecture is a statement that is being proposed to be a true

statement, based on some partial evidence.
• Proof by contradiction : Since the statement r ∧ ¬r is a contradiction when-

ever r is a proposition, we can prove that p is true if we can show that
¬p → (r ∧ ¬r) is true for some proposition r. Proofs of this type are called
proofs by contradiction and they are generally known as indirect proofs.
37.7 Example of Proof by Contradiction

In a proof by contradiction, or indirect proof, you show that if a proposition were
false, then some false fact would be true. Since a false fact by definition can’t be
true, the proposition must be true. Steps followed in proof by contradiction :
• Given a statement P , assume that P has a truth value of False
• Using logical reasoning, deduce a contradiction : Where something known

is False
• Conclude that P is ¬F = T
37.7.1 Prove a given statement P using contradiction

We have to prove the given statement using the method of Proof by contradiction.
The statement is as follows :
(x + y) = 16 → (x ≥ 8) ∨ (y ≥ 8) (37.1)
• Step 1 : Start by stating the P statement :
P : [(x + y) = 16 → (x ≥ 8) ∨ (y ≥ 8)] (37.2)

• Step 2 : Within statement P , the antecedent is A = [(x + y) = 16] and the

consequent is B = [(x ≥ 8) ∨ (y ≥ 8)]. With this, we can write P : A → B
in the logically equivalent form of P : ¬A ∨ B. Original statement written in
this form is :
P : [[¬(x + y) = 16] ∨ [(x ≥ 8) ∨ (y ≥ 8)]] (37.3)
• Step 3 : Now we assume the statement P as False, or we assume ¬P to be

True. ¬P can be written in the form ¬[¬A∨B] which is equivalent to A∧¬B.
This statement written in proper notation is given as :
¬P : [[(x + y) = 16] ∧ ¬[(x ≥ 8) ∨ (y ≥ 8)]] (37.4)
¬P : [[(x + y) = 16] ∧ [(x < 8) ∧ (y < 8)]] (37.5)

¬P : [[(x + y) = 16] ∧ [(x < 8)] ∧ [(y < 8)]] (37.6)
• Step 4 : The previous step leads to a contradiction. Since, if x is less than

8 and y is also less than 8, then x + y can never be equal to 16. This is a
contradiction.
• Step 5 : Since assuming ¬P to be True has led us to a contradiction, there-

fore, we can conclude that ¬P is False and consequently, P is True. Hence
we have proved the original statement P to be True.
37.8 Principle of Induction

Induction is a powerful method for showing a property is true for all nonnega-
tive integers. Induction plays a central role in discrete mathematics and computer
science. In fact, its use is a defining characteristic of discrete, as opposed to con-
tinuous mathematics. Proof by Induction is a method which allows us to prove
that certain properties are True for all positive Integers or non-negative Integers.
The properties should be known to us.
37.8.1 Proof by Induction

We want to prove that A(x) is an assertion or property concerning n. We have to
prove that A(x) is a property that holds for all n ∈ Z. The steps are as followed :
• Base case : Show that the assertion A(1) holds
• Induction hypothesis : We assume that A(n) holds for n = k, that is A(k) is

True.
• Induction step : We want to prove that A(k + 1) is True using Induction

hypothesis. This step can be shown logically as follows :
[A(1) ∧ A(k)] → A(k + 1) (37.7)
37.8.2 Example 1 of Proof by Induction

n(n + 1)
For all Integers, A(n) = 1 + 2 + 3 + · · · + n = . Applying steps of Induction:
2
1(1 + 1)
• Base case : A(1) = = 1 and A(1) = 1. Thus, the condition holds for
2
the Base case.
• Induction hypothesis : A(k) = 1 + 2 + 3 + · · · + k. Assume this is equal to

k(k + 1)
.
2
(k + 1)[(k + 1) + 1]
• Induction step : Prove that A(k + 1) = . Proof as follows
2
:
A(k + 1) = 1 + 2 + 3 + · · · + k + (k + 1) (37.8)
k(k + 1)
A(k) = 1 + 2 + 3 + · · · + k = (37.9)
2
A(k + 1) = A(k) + A(k + 1) (37.10)
k(k + 1)
A(k + 1) = + (k + 1) (37.11)
2
k
A(k + 1) = (k + 1)[ + 1] (37.12)
2
(k + 1)(k + 2)
A(k + 1) = (37.13)
2
(k + 1)((k + 1) + 1)
→ A(k + 1) = (37.14)
2
Hence, with the last step of Proof by Induction, we have proved that A(n) =
n(n + 1)
1 + 2 + 3 + ··· + n = for all Integers.
2
37.8.3 Example 2 of Proof by Induction

For all Integers, we have to prove that A(n) = 1 + 3 + 5 + · · · + (2n − 1) = n2 .
Following the steps of Induction, we have :
• Base case : A(1) = 1, which is equal to (1)2 = n2 = 1. Hence the base case is
True.
• Induction hypothesis : A(k) = 1 + 3 + 5 + · · · + (2k − 1) = k 2 . This is assumed

to be True.
• Induction step : To show that A(k+1) = 1+3+5+· · ·+(2k−1)+[2(k+1)−1] =

(k + 1)2 . We prove this as follows :
A(k + 1) = A(k) + (k + 1) (37.15)
A(k + 1) = k 2 + 2k + 1 (37.16)
A(k + 1) = (k + 1)2 (37.17)
Hence, we have proved that A(n) = 1 + 3 + 5 + · · · + (2n − 1) = n2 is True for all

Integers.

Thebook v42

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thebook v42

Uploaded by

Copyright:

Available Formats

M ADRAS S CHOOL OF E CONOMICS

Edited by: Aditya Kedar Tata, Akash Gupta,

February 19, 2021

Our objective, with this edition, is to create a resource of mathematical foun-

3 Matrix Definiteness and Cramers Rule 41

4 Multivariate Data Analysis - Basic concepts 49

5 Singular Value Decomposition 64

6 Principal Component Analysis 71

7 Text Analytics using SVD 78

9 Sequences and Series 103

10 Basics of Convolutions 108

10.1.1 Sums of random variables . . . . . . . . . . . . . . . . . . . . . 108

11 Limit theorems and Convergence - Part 1 111

12 Limit theorems and Convergence - Part 2 123

13 Borel-Cantelli Lemma 142

13.3 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

14 Time Series Analysis: Basics 151

15 Time Series Analysis: Intermediate 168

15.16 Tests of stationarity: ACF . . . . . . . . . . . . . . . . . . . . . . . 187

16 Joint Distributions 197

17 Conditional Probability 203

18 Multivariate Gaussian Distribution 216

18.1.1 Expected value of random vectors . . . . . . . . . . . . . . . . . 217

III Stochastic Processes 225

20 The Gambler’s ruin Framework 235

21 Setting the DTMC Framework 243

22 Introducing Inhomogeneous DTMCs 250

22.1.2 An illustrative example . . . . . . . . . . . . . . . . . . . . . . . 251

23 Random Walks and Gambler’s Ruin 254

24 Memoryless Property 260

25 The Poisson Process 264

26 Introduction to Queues 278

27 Queueing models - M/M/1 290

27.1 The underlying concepts . . . . . . . . . . . . . . . . . . . . . . . . 290

28 Continuous Time Markov Chains 296

29 CTMC and Embedded MC 307

IV Stochastic Calculus &

V Topics in Data Science 314

VI Topics in Calculus &

30.6.1 The autocorrelation theorem . . . . . . . . . . . . . . . . . . . 320

31 Laplace, Dirac Delta and Fourier Series 321

32 A Primer in Calculus 329

33 Transforms and the Memoryless Property 336

34 Recurrence Relations 340

35 Propositional Logic 344

35.5.2 Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

36 Graph Theory 356

37 Proof Techniques 362

Introduction to Linear Algebra

• Introduction to vector addition and scalar multiplication

1.1 Introduction to Vector Addition and Subtraction

1.1.1 Vector Addition and Scalar multiplication

Figure 1.1: Representing Vector Addition

A combination of the methods of Vector addition and scalar multiplication of

1.2 Vector Dot product

1.2.1 Norm of a vector

1.2.2 Unit Vectors

Some notes to remember :

• If Vector x perpendicular to vector y, then their Dot product is x · y = 0

• The Dot product of unit Vectors lies between −1 and 1

1.2.3 Angles between vectors and dot products

Figure 1.2: Angle between vectors

1.3 Solving system of Linear equations using Linear