Market Notes

Quantitative Trading Series
Quantitative and Qualitative Treatments

to Capital Markets
and related bodies of knowledge
By
HangukQuant
Private Notes,
Quantitative Research
2022∼
Quantitative Trading Series
Quantitative and Qualitative Treatments

to Capital Markets
and related bodies of knowledge
By
HangukQuant
Private Notes,
2022∼
DISCLAIMER: the contents of this work are not intended as investment, legal, tax or any other
advice, and is for informational purposes only. It is illegal to make unauthorized copies, forward to an
unauthorized user or to post this article electronically without express written consent by HangukQuant.
Abstract
This book is designed to be a practical handbook for all finance professionals, practitioner or academic.
It is an organization of the various knowledge domains, with a focus on drawing links in the intricate
web between the theory and practice of finance that market participants seek to unfold. It contains
discussions of trading anomalies, premias and inefficiencies. It contains discussions in discretionary
and quantitative trading. Discussion stretches across theoretical work, such as statistical methods,
linear algebra and financial mathematics. Applied work in equity research, quantitative research and
macroeconomic theory is involved.
This work is attributed to the brilliant writers, academics, scientists and traders before me. Although
we have tried to credit the referenced work where relevant, to give a complete reference for its source
is impossible. The work has been organized and compiled from various texts, lecture notes, journals,
blogs, personal communications and even scraps of scribbled notes from the author’s time in college.
These contain notes from blogs referencing journals, journals referencing blogs, blogs referencing blogs
referring journals - you name it. We apologise if we have failed to credit your work. Other work
is original. Readers may reach us at hangukquant@gmail.com. The updated notes are released at
hangukquant.substack.com.
Faith is to have believe without seeing. This work is dedicated to those who placed their faith in
me. To Jeong(s), Choi, Julian and my dearest friends who have shaped my world view and colored it
rainbow.
Keywords:
Linear Algebra
Calculus Methods
Computer Methods
Global Macro Trading
Statistics & Probability Theory
Risk Premia and Market Inefficiencies
Equities Trading and Other Asset Classes
Table of Contents
Title i
Abstract ii
1 Introduction 1
1.1 Guidelines for Reviewing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Ordinary Calculus 2
3 Linear Algebra 3
3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Computational Methods in the Euclidean Space . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.1 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.1.1 Elementary Row Operations (EROs) . . . . . . . . . . . . . . . . . . . . . 5
3.2.1.2 Row-Echelon Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1.3 Gaussian Elimination Methods . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.1.4 Homogeneous Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2.1 Operations on Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.2.2 Invertibility of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2.3 Elementary Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2.4 Matrix Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Euclidean Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3.1 Finite Euclidean Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3.2 Linear Spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3.3 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3.4 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3.5 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3.6 Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3.7 Transition Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.4 Matrix Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4.1 Row, Column Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4.2 Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4.3 Nullspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.5 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.5.1 Orthogonal Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.5.2 Best Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.5.3 Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.6 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.6.1 Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.6.2 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.6.3 Orthogonal Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.6.4 Quadratic Forms and Conic Sections . . . . . . . . . . . . . . . . . . . . . 70
3.2.7 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.7.1 Ranges and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 Abstract Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.3.1 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.3.2 Finite Dimensional Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.2.1 Vector Spans and Linear Independence . . . . . . . . . . . . . . . . . . . 88
3.3.2.2 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.2.3 Dimensions of a Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3.3 Linear Maps/Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.3.1 Vector Space of Linear Maps . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.3.2 Vector Space Associated with Linear Maps . . . . . . . . . . . . . . . . . 96
3.3.3.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.3.3.4 Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3.3.5 Products, Quotients of Vector Spaces . . . . . . . . . . . . . . . . . . . . 108
iii
3.3.3.6 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.3.4 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.3.5 Eigenvectors and Invariant Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.5.1 Invariant Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.3.5.2 Eigenvectors and Upper Triangular Matrices . . . . . . . . . . . . . . . . 127
3.3.5.3 Eigenspaces, Diagonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . 131
3.3.6 Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.3.6.1 Inner Products and Norms . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.3.6.2 Orthonormal Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.3.6.3 Orthogonal Complements and Minimization Problems . . . . . . . . . . . 143
3.3.7 Operators on Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.3.7.1 Self-Adjoint and Normal Operators . . . . . . . . . . . . . . . . . . . . . 147
3.3.7.2 Spectral Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.3.7.3 Positive Operator and Isometries . . . . . . . . . . . . . . . . . . . . . . . 156
3.3.7.4 Polar and Singular Value Decomposition . . . . . . . . . . . . . . . . . . 160
3.3.8 Operators on Complex Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . 166
3.3.8.1 Generalized Eigenvectors and Nilpotency . . . . . . . . . . . . . . . . . . 166
3.3.8.2 Decomposition of Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.3.8.3 Characteristic and Minimal Polynomial . . . . . . . . . . . . . . . . . . . 173
3.3.8.4 Jordan Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
3.3.9 Operators on Real Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.3.9.1 Complexification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.3.9.2 Operators on Real Inner Product Spaces . . . . . . . . . . . . . . . . . . 183
3.3.10 Trace and Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.3.10.1 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.3.10.2 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.4 More Concepts in Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
3.4.1 Rayleigh Quotients, Matrix Norms and Characterization of Singular Values . . . . 195
3.4.2 Schur Complement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.4.3 Schur Complement and Positive (Semi)definite Matrices . . . . . . . . . . . . . . . 200
3.5 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
4 Convex Optimization 209
4.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
4.2 Mathematical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
4.3 General Overview of Problem Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.4 Convex Sets and Preservation of Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.4.1 Convex Sets and Other Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
4.4.2 Operations Preserving Convexity of Sets . . . . . . . . . . . . . . . . . . . . . . . . 218
4.4.3 Proper Cones and Generalized Inequalities . . . . . . . . . . . . . . . . . . . . . . 219
4.4.4 Separating and Supporting Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . 221
4.4.5 Dual Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
4.5 Convex Functions and Preservation of Convexity . . . . . . . . . . . . . . . . . . . . . . . 226
4.5.1 Checking for Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
4.5.2 Operations that Preserve Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 233
4.5.3 Conjugate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
4.5.4 Quasiconvex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.5.5 Preservation of Quasiconvexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
4.5.6 Log-concave/convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
4.5.7 Extension of Convexity to Generalized Equalities . . . . . . . . . . . . . . . . . . . 249
4.6 Convex Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
4.6.1 Quasiconvex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4.6.2 Linear Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
4.6.3 Quadratic Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4.6.4 Second Order Cone Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
4.6.5 Geometric Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
4.6.6 Optimization Problems with Generalized Inequalities . . . . . . . . . . . . . . . . . 269
4.6.7 Vector Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
4.7 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
4.7.1 Lagrangian and Dual Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
4.7.2 Dual and Conjugate Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
4.7.3 Lagrange Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
4.7.4 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
4.7.5 Perturbation and Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 289
4.7.6 Theorem of Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
4.7.7 Duality and Generalized Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 297
4.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
4.8.1 Norm Approximation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
4.8.2 Least-Norm Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
4.8.3 Regularized Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
iv
4.8.4 Robust Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
4.8.5 Function Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
4.8.6 Parametric and Non Parametric Estimations . . . . . . . . . . . . . . . . . . . . . 312
4.9 Algorithms for Unconstrained Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . 316
4.9.1 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
4.9.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
4.9.3 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
4.9.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
4.9.5 Self-Concordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
4.10 Equality Constrained Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
4.10.1 Equality Constrained Convex Quadratic Problems . . . . . . . . . . . . . . . . . . 336
4.10.2 Elimination of Equality Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 337
4.10.3 Solving Equality Constrained Problem via the Dual Problem . . . . . . . . . . . . 338
4.10.4 Feasible point Newton’s method with equality constraints . . . . . . . . . . . . . . 339
4.10.5 Infeasible point Newton’s method with equality constraints . . . . . . . . . . . . . 342
4.11 Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
4.11.1 Barrier Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
4.11.2 Phase I Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
4.11.3 Primal Dual Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 355
5 Set Theory 358
5.1 Algebra of Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
6 Probability and Statistical Models 359
6.1 Probability Spaces and Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . 359
6.2 Counting and Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
6.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
6.3.1 Expectations of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
6.3.2 Riemann and Lebesgue Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
6.3.3 Convergence of Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
6.3.4 Computing Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
6.3.5 Change of Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
6.3.6 Random Variable Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
6.3.7 Random Variable Co-Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
6.4 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
6.5 Information and Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
6.5.1 Independence and Joint Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
6.5.2 Conditioning Probabilities, Densities and Expectations . . . . . . . . . . . . . . . . 398
6.5.3 Transformation and Combination of Random Variables . . . . . . . . . . . . . . . 407
6.6 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
6.7 Laws and Basics of Probability Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
6.8 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
6.8.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
6.8.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
6.9 Estimation of Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
6.9.0.1 Estimation of Paired Samples . . . . . . . . . . . . . . . . . . . . . . . . . 414
6.10 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
6.10.1 Consistency of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
6.11 Monte Carlo Bootstraps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.12 Method of Maximum Likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
6.12.1 Confidence Intervals of Maximum Likelihood Estimates . . . . . . . . . . . . . . . 420
6.12.2 Large Sample Theory of Maximum Likelihoods . . . . . . . . . . . . . . . . . . . . 420
6.12.3 Least Squares and Maximum Likelihoods . . . . . . . . . . . . . . . . . . . . . . . 421
6.13 Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
6.14 Efficiency of Estimators and Asymptotic Relative Efficiencies . . . . . . . . . . . . . . . . 424
6.14.1 Comparing Asymptotic Relative Efficiency . . . . . . . . . . . . . . . . . . . . . . 425
6.15 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
6.16 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
6.16.1 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
6.17 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
6.17.1 Bernoulli Trials and Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . 429
6.17.2 Hypergeometric Random Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 430
6.17.3 Exponential Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
6.17.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
6.17.5 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
6.17.6 Gaussian/Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
6.17.6.1 The Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
6.17.6.2 Other Gaussian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
6.17.7 Chi-Squared Distribution χ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
6.17.8 Log-Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
6.17.9 t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
6.17.10 F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
6.17.11 Gamma Distribution Γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
v
6.17.12 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
6.17.13 Logistic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
6.17.14 Rayleigh Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
6.17.15 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
6.18 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
6.18.1 Large Sample Theory on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . 443
6.19 Methods in Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
7 Hypothesis Testing, Interval Estimation and Other Tests 445
7.1 Theory of Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
7.2 Generalized Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
7.3 Tests on Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
7.3.1 QQ-Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
7.4 Tests on Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
7.4.1 One-Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
7.4.2 One-Sample Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
7.4.2.1 Normal Approximation to Sign Test and Continuity Corrections . . . . . 451
7.4.3 Interval Estimation of the Median . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
7.4.4 One-Sample (Wilcoxon) Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . 452
7.4.4.1 Distribution of the Wilcoxon Signed Rank Test Statistic . . . . . . . . . . 453
7.4.4.2 Relating the Wilcoxon Signed Rank Test to the T-Test . . . . . . . . . . 454
7.4.5 Hodge-Lehmann Estimates for Median and Tukey’s Method for Confidence Intervals454
7.4.6 Paired-Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
7.4.7 Paired Sample Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
7.4.8 Two Sample T-Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
7.4.9 Two-Sample Mann Whitney (Wilcoxon Rank Sum) Test . . . . . . . . . . . . . . . 458
7.4.10 Distribution of the Wilcoxon Rank Sum Test Statistic . . . . . . . . . . . . . . . . 459
7.5 Location Shift Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
7.6 Parametric and Nonparametric Analysis of Variance, ANOVA/F-Test, Kruskal Wallis Test
(to be reviewed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
7.6.1 F Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
7.6.1.1 Groups of Different Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
7.6.2 Kruskal Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
7.6.2.1 Distribution of the Kruskal Wallis Test Statistic . . . . . . . . . . . . . . 465
7.6.3 Bonferroni’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
7.6.4 Tukey’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
7.6.5 Two-Way ANOVA Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
7.7 Correlation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
7.7.1 Parametric Correlation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
7.7.2 Nonparametric Spearman Correlation Test . . . . . . . . . . . . . . . . . . . . . . 467
7.8 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
7.8.1 Kolmogorov Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
8 Statistical Learning 471
8.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
8.2 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
8.2.1 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
8.2.2 Least Squares vs Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . 472
8.2.3 Regression Functions, Classifiers and Prediction Errors . . . . . . . . . . . . . . . . 473
8.3 Local Methods in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
8.4 Bias, Variance and Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
8.5 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
8.5.1 Supervised Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
8.6 Classes of Restricted Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
8.6.1 Roughness Penalty and Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . 477
8.6.2 Kernel Methods and Local Regression . . . . . . . . . . . . . . . . . . . . . . . . . 477
8.6.3 Basis Functions and Dictionary Methods . . . . . . . . . . . . . . . . . . . . . . . . 478
8.7 Model Selection, Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
8.8 Least Squares Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
8.8.1 Simple Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
8.8.1.1 Assumptions of the Simple Linear Equation . . . . . . . . . . . . . . . . . 480
8.8.1.2 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
8.8.1.3 Model Properties and Variance of Estimates . . . . . . . . . . . . . . . . 481
8.8.1.4 Assumptions of the Analysis of Model on Simple Linear Equations . . . . 482
8.8.1.5 Test of Significance on Regression Coefficients . . . . . . . . . . . . . . . 482
8.8.1.6 Test of Significance on Regression Model and ANOVA Methods . . . . . 482
8.8.1.7 Confidence Intervals on Parameters and Variance Estimates . . . . . . . . 483
8.8.1.8 Confidence Intervals and Prediction Intervals on Response . . . . . . . . 484
8.8.1.9 No Intercept Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
8.8.1.10 Coefficient of Determination, R2 . . . . . . . . . . . . . . . . . . . . . . . 485
8.8.1.11 Maximum Likelihood Estimators vs Simple Least Squares . . . . . . . . . 485
vi
8.8.2 MultipleLeast Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
8.8.2.1 Interaction Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
8.8.2.2 Assumptions and Model Notations . . . . . . . . . . . . . . . . . . . . . . 486
8.8.2.3 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
8.8.2.4 Model Properties and Variance of Estimates . . . . . . . . . . . . . . . . 487
8.8.2.5 Assumptions for Analysis of Multiple Least Squares Regression . . . . . . 488
8.8.2.6 Significance Tests for Regression Coefficients by t-tests and Partial Sum
of Squares Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
8.8.2.7 Confidence Interval for Regression Coefficient Estimates . . . . . . . . . . 489
8.8.2.8 Confidence Interval and Prediction Intervals of Estimates on Mean Re-
sponse and Response Variables . . . . . . . . . . . . . . . . . . . . . . . . 490
8.8.2.9 Significance Tests for Regression Model . . . . . . . . . . . . . . . . . . . 490
8.8.2.10 Coefficients of Determination and Adjustments . . . . . . . . . . . . . . . 491
8.8.2.11 Interpretation of Model and Coefficients . . . . . . . . . . . . . . . . . . . 491
8.8.2.12 Regressor Variable Hull and Extrapolation of the Input Space . . . . . . 491
8.8.2.13 Standardization of Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
8.8.2.14 Indicator Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
8.8.3 Adequacy of The Least Squares Method . . . . . . . . . . . . . . . . . . . . . . . . 493
8.8.3.1 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
8.8.3.1.1 Leverage Values, Influential Values and the Variance of Residuals 493
8.8.3.2 Standardization of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.8.3.3 Checking Normality Assumptions . . . . . . . . . . . . . . . . . . . . . . 494
8.8.3.4 Note on Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
8.8.3.5 Outliers and Influential Data . . . . . . . . . . . . . . . . . . . . . . . . . 495
8.8.3.6 Lack of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
8.8.3.7 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
8.8.3.7.1 Detection of multicollinearity - Plots, Variance Inflation Factors
and Eigensystem Analysis . . . . . . . . . . . . . . . . . . . . . . 497
8.8.4 Correction of Inadequacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
8.8.4.1 Transformation of Response-Regressor . . . . . . . . . . . . . . . . . . . . 498
8.8.4.1.1 Box-Cox Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
8.8.4.1.2 Box-Tidwell Method . . . . . . . . . . . . . . . . . . . . . . . . . 498
8.8.4.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
8.8.4.3 Principal Component Regression . . . . . . . . . . . . . . . . . . . . . . . 499
8.8.5 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
8.8.6 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
8.8.7 Variable Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
8.8.7.1 Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
8.8.7.1.1 Akaike Information Criteria . . . . . . . . . . . . . . . . . . . . . 502
8.8.7.1.2 Bayesian Information Criteria . . . . . . . . . . . . . . . . . . . 502
8.8.7.2 Computational Methods - Brute Force and Stepwise Greedy Solutions . . 502
8.8.7.2.1 Brute Force Method . . . . . . . . . . . . . . . . . . . . . . . . . 502
8.8.7.2.2 Forward Selection Method . . . . . . . . . . . . . . . . . . . . . 502
8.8.7.2.3 Backward Elimination Method . . . . . . . . . . . . . . . . . . . 503
8.8.7.2.4 Stepwise Regression Method . . . . . . . . . . . . . . . . . . . . 503
8.9 Rank Methods for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
8.10 Nonparametric Density Curve Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
8.10.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
8.10.2 Naive Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
8.10.3 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
8.11 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
8.11.1 Nadaraya–Watson Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 508
9 Randomization and Simulation 509
9.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
10 Utility Theory 511
10.1 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
10.1.1 CRRA Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
11 Statistical Finance 512
11.1 Simulation and Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
11.1.1 Destruction of Patterns in Prices and Bars by Permutation . . . . . . . . . . . . . 512
11.1.1.1 Permuting Price Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
11.1.1.2 Permuting Bar (OHLCV) Data . . . . . . . . . . . . . . . . . . . . . . . . 513
11.1.1.2.1 Cautionary Note on the Walkforward Data Shuffle . . . . . . . 517
11.2 Probabilistic Analysis of Trading Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
11.2.1 Parametric and Non-Parametric Methods for Tests of Location . . . . . . . . . . . 518
11.2.2 Monte Carlo Permutation Methods for Arbitrary Performance Criterion . . . . . . 519
11.2.2.1 p-value in in-sample and overfit detection by data shuffle . . . . . . . . . 520
11.2.2.2 p-value of asset timing in oos by decision shuffle . . . . . . . . . . . . . . 521
vii
11.2.2.3 p-value of asset picking in oos by decision shuffle . . . . . . . . . . . . . . 521
11.2.2.4 p-value of trader skill in oos by data shuffle . . . . . . . . . . . . . . . . . 522
11.2.2.5 p-value of signal families in oos by data shuffle and bias adjustments . . . 522
11.2.2.5.1 p-value of 1-best signal . . . . . . . . . . . . . . . . . . . . . . . 523
11.2.2.5.2 p-value of k-combined signals . . . . . . . . . . . . . . . . . . . . 523
11.2.2.5.3 p-value of k-marginal signals by bounding . . . . . . . . . . . . . 524
11.2.2.5.4 p-value of k-marginal signals by greedy selection and FER control524
11.2.3 Monte Carlo Bootstraps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
12 Portfolio Management 527
12.1 Introduction and Problem Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.1.1 Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.1.2 Risk and Standard Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
12.1.3 VaR, Conditional VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
12.1.4 Alternative Measures of Portfolio Risk . . . . . . . . . . . . . . . . . . . . . . . . . 530
12.1.5 Risk-Adjusted Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.1.6 A Basket of Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
12.1.7 Computations for the Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.2 Dynamic Signal Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12.2.1 Elegant Mathematics, Poor Economics - MPT . . . . . . . . . . . . . . . . . . . . 534
12.2.2 Risk-Free Rates and the Capital Market Line . . . . . . . . . . . . . . . . . . . . . 537
12.3 Utility functions and Indifference Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
12.4 Factor Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
12.4.1 Time Series Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
12.4.2 Cross Sectional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
12.4.3 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
12.4.4 Simplified Markowitz Formulations under the Factor Model Analysis . . . . . . . . 546
12.4.5 Factor Hedging and Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
12.4.5.1 Single/Multiple Factor Hedging . . . . . . . . . . . . . . . . . . . . . . . 547
12.5 Extension of the Classical Mean-Variance Framework for Various Constraints and Utilities 549
12.5.1 Practical Constraints and Modifications to Objectives . . . . . . . . . . . . . . . . 549
12.5.2 Extension of the Utility Function for Higher Moments of Portfolio Returns . . . . 552
12.5.2.1 Polynomial Goal Programming, Fabozzi et al. [7] . . . . . . . . . . . . . . 553
12.5.3 Numerical Solving of the Extension to N Asset Mean-Variance Portfolios with
Long Constraints and Linear Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
12.5.4 Numerical Solving of the Extension to N Multi-Asset Alpha Mean-Variance Port-
folios with Long Constraints and Linear Costs . . . . . . . . . . . . . . . . . . . . 557
12.6 Robust Estimation Methods and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 559
12.6.1 Robust Estimators for Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
12.6.2 Robust Estimators for Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
12.6.3 Robust Estimation of Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
12.6.4 Robust Estimation of High Dimensional Covariance and Correlation Matrices . . . 563
12.6.4.1 Difficulty in Estimation of the Covariance Matrix . . . . . . . . . . . . . 564
12.6.4.2 Random Matrix Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
12.6.4.3 Change in Centre of Mass . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
12.6.4.4 Ridge Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
12.6.4.5 Ledoit Wolf Constant Correlation . . . . . . . . . . . . . . . . . . . . . . 565
12.6.4.6 EPO Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
12.6.5 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
13 Stochastic Calculus 566
13.1 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
13.1.1 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
13.1.2 Limiting Distributions and Derivation of Log-Normality . . . . . . . . . . . . . . . 567
13.1.3 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
13.1.3.1 Brownian Motion Filtration . . . . . . . . . . . . . . . . . . . . . . . . . . 572
13.1.3.2 Variation of the Brownian Motion . . . . . . . . . . . . . . . . . . . . . . 573
13.1.3.3 Volatility of an Exponentiated Brownian Motion . . . . . . . . . . . . . . 576
13.1.4 Brownian Motion as Markov Process . . . . . . . . . . . . . . . . . . . . . . . . . . 577
13.1.5 First Passage Time Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
13.1.6 Reflection Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
13.1.7 Brownian Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
13.2 Stochastic Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
13.2.1 Ito Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
13.2.2 Ito Doeblin Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
13.2.3 Black Scholes Merton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
13.2.3.1 Evolution of Portfolio Value . . . . . . . . . . . . . . . . . . . . . . . . . 604
13.2.3.2 Evolution of Option Value . . . . . . . . . . . . . . . . . . . . . . . . . . 605
13.2.3.3 Deriving the BSM PDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
13.2.4 Verifying the Black Scholes Merton Solution . . . . . . . . . . . . . . . . . . . . . . 606
13.2.5 Option Greeks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
viii
13.2.6 Put-Call Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
13.2.7 Multivariable Stochastic Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616
13.2.8 Levy’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
13.2.9 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
13.2.10 Brownian Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
13.2.10.1 Joint Distributions of the Brownian Bridge . . . . . . . . . . . . . . . . . 632
13.2.11 Brownian Bridge as Conditioned Brownian Motion . . . . . . . . . . . . . . . . . . 635
13.3 Risk-Neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
13.3.1 Risk Neutral Measure, Generalized Geometric Brownian Motion . . . . . . . . . . 637
13.3.2 Risk Neutral Measure, Value of Portfolio Process . . . . . . . . . . . . . . . . . . . 638
13.3.3 Risk Neutral Measure, Derivative Pricing . . . . . . . . . . . . . . . . . . . . . . . 639
13.3.4 Risk Neutral Measure, Obtaining the Black-Scholes-Merton Form . . . . . . . . . . 639
13.3.5 Martingale Representation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 643
13.3.5.1 One Brownian Motion Martingale Representation . . . . . . . . . . . . . 643
13.3.5.2 Single Stock Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
13.3.6 Fundamental Theorem of Asset Pricing . . . . . . . . . . . . . . . . . . . . . . . . 645
13.3.6.1 Multidimensional Market Model . . . . . . . . . . . . . . . . . . . . . . . 646
13.3.6.2 Existence of Risk-Neutral Measure . . . . . . . . . . . . . . . . . . . . . . 648
13.3.6.3 Uniqueness of Risk-Neutral Measures . . . . . . . . . . . . . . . . . . . . 653
13.3.7 Dividend Paying Stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
13.3.7.1 Continuous Dividends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
13.3.7.2 European Call Pricing for Continuous Dividends . . . . . . . . . . . . . . 657
13.3.7.3 Lump Sum Dividends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
13.3.7.4 European Call Pricing for Lump Sum Dividends . . . . . . . . . . . . . . 662
13.3.8 Forwards and Futures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
13.3.8.1 Forwards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
13.3.8.2 Futures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
13.3.9 Difference in Valuations Between Forwards and Futures . . . . . . . . . . . . . . . 666
13.4 Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
13.4.1 Stochastic Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
13.4.2 Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
13.4.3 Feynman-Kac Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
13.4.4 Applications to Interest Rate Models . . . . . . . . . . . . . . . . . . . . . . . . . . 677
13.4.5 Multidimensional Feynman-Kac Theorem . . . . . . . . . . . . . . . . . . . . . . . 685
13.5 Exotic Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
13.5.1 Up-and-Out Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
13.5.2 Lookback Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
13.5.3 Asian Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
13.6 American Derivative Securities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
13.6.1 Perpetual American Put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
13.6.2 Finite Expiration American Put . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
13.6.3 Finite Expiration American Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
13.7 Change of Numeraire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
13.7.1 Foreign, Domestic Risk-Neutral Measures . . . . . . . . . . . . . . . . . . . . . . . 731
13.7.1.1 Domestic Risk-Neutral Measure . . . . . . . . . . . . . . . . . . . . . . . 732
13.7.1.2 Foreign Risk-Neutral Measure . . . . . . . . . . . . . . . . . . . . . . . . 733
13.7.2 Forward Exchange Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
13.7.3 Garman-Kohlhagen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737
13.7.4 Exchange Rate Put-Call Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
13.7.5 Forward Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
13.7.5.1 Zero-Coupon Bond Numeraire . . . . . . . . . . . . . . . . . . . . . . . . 739
13.7.6 Stochastic Rate Option Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
13.8 Term Structure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
13.8.1 Affine-Yield Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
13.8.1.1 Two-Factor Vasicek Model . . . . . . . . . . . . . . . . . . . . . . . . . . 745
13.8.1.2 Two-Factor CIR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
13.8.1.3 Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
13.8.2 Heath-Jarrow-Morton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
13.8.2.1 Dynamics of Forward Rates and Bond Prices . . . . . . . . . . . . . . . . 759
13.8.2.2 Heath-Jarrow-Morton under Risk-Neutrality . . . . . . . . . . . . . . . . 761
13.8.2.3 Relation to the Affine-Yield Models . . . . . . . . . . . . . . . . . . . . . 762
13.8.2.4 Implementation of HJM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
13.8.3 Forward LIBOR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
13.8.3.1 (Forward) LIBOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
13.8.3.2 Backset LIBOR Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . 765
13.8.3.3 Black Caplet Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
13.8.3.4 Relating Forward LIBOR and ZCB Volatility . . . . . . . . . . . . . . . . 767
13.8.3.5 Forward LIBOR Term Structure Model . . . . . . . . . . . . . . . . . . . 768
13.9 Calculus for Jump Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771
13.9.1 Compensated and Compound Poisson Processes . . . . . . . . . . . . . . . . . . . 771
13.9.2 Moment Generating Function for the Compound Poisson Process . . . . . . . . . . 773
13.9.3 Integrals and Differentials of Jump Processes . . . . . . . . . . . . . . . . . . . . . 775
ix
13.9.4 Change of Measure for Jump Processes . . . . . . . . . . . . . . . . . . . . . . . . 785
13.9.5 Option Pricing under the Jump Model of S(t) driven by N (t) . . . . . . . . . . . . 794
13.9.6 Option Pricing under the Jump Model of S(t) driven by W (t) and Q(t) . . . . . . 797
14 Volatility Trading 803

14.1 Arbitrage Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
14.2 Pricing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
14.3 Option Greeks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
14.4 Volatility Measurement and Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
14.4.1 Measurements of Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
15 Risk Premiums 817
15.1 Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
15.1.1 Microeconomic Foundations of Risk Premiums . . . . . . . . . . . . . . . . . . . . 819
15.1.1.1 One-Period Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
15.1.2 Idiosyncratic Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
15.1.2.1 Continuous Time Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
15.1.3 Interest Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
15.1.4 Expected Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824
15.1.4.1 Central Pricing Equation and the Capital Asset Pricing Model . . . . . . 825
16 Alternative Data 826
17 Programming 827
17.1 Python Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
17.1.1 Computers and Python as a Language . . . . . . . . . . . . . . . . . . . . . . . . . 828
17.1.2 Code and Computer Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
17.1.2.1 time.time(), function decorators, IPython and OS utilities . . . . . . . . . 830
17.1.2.2 Function Profiling with cProfile . . . . . . . . . . . . . . . . . . . . . . . 833
17.1.2.3 Line Profiling with line profiler . . . . . . . . . . . . . . . . . . . . . . . . 837
17.1.2.4 Bytecode Profiling with dis . . . . . . . . . . . . . . . . . . . . . . . . . . 839
17.1.3 Asynchronous Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 841
18 Conclusions 862
A Russian Doll Tester 1
B Market Historians 1
B.1 1960 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
B.2 1961 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
B.3 1962 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
B.4 1963 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
B.5 1964 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
B.6 1965 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
B.7 1966 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
B.8 1967 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
B.9 1968 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
B.10 1969 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
B.11 1970 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
B.12 1971 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
B.13 1972 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
B.14 1973 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
B.15 1974 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
B.16 1975 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
B.17 1976 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
B.18 1977 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
B.19 1978 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
B.20 1979 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
B.21 1980 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
B.22 1981 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
B.23 1982 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
B.24 1983 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
B.25 1984 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
B.26 1985 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
B.27 1986 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
B.28 1987 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
B.29 1988 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
B.30 1989 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
B.31 1990 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
B.32 1991 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
B.33 1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
B.34 1993 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
B.35 1994 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
B.36 1995 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
B.37 1996 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
B.38 1997 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
B.39 1998 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
x
B.40 1999 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
B.41 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
B.42 2001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
B.43 2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
B.44 2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
B.45 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
B.46 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
B.47 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
B.48 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
B.49 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
B.50 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
B.51 2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
B.52 2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
B.53 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
B.54 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
B.55 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
B.56 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
B.57 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
B.58 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.59 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.60 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.61 2020 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.62 2021 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.63 2022 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
B.64 2023 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
C CODE References 1
xi
Chapter 1
Introduction
1.1 Guidelines for Reviewing Work

The following are the stages of alpha formulations.
Idea 1 (This means to further explore the idea creatively. This is a precursor to a Test.).
Test 1 (This refers to parameterized research idea that is to be verified as a Strategy.).
Strategy 1 (This explores the implementation and characteristics of a Test.).
The following are the stages of theoretical formulations.
Definition 1 (Standard conventions and formal nomenclature are introduced.).
Problem 1 (A formalization of the problem statement is provided).
Exercise 1 (An example or working problem to demonstrate concepts discussed).
The following are stages of theoretical derivations
Lemma 1 (An important result used as is or for other derivations.).
Corollary 1 (An important aside of the theoretical work.).
Theorem 1 (A central result with derivations).
Result 1 (A central result without proof.).
The following are for declarative statements.
Proposition 1 (An opinion of sorts.).
Fact 1 (A statement of (almost) undeniable truth.).
1
Chapter 2
Ordinary Calculus
Theorem 2 (Integration By Parts). The integration by parts formula takes form

Z Z
udv = uv − vdu (1)
Theorem 3 (L’Hopital’s Rule).
2
Chapter 3
Linear Algebra
Here we discuss concepts in linear algebra - notably the literature on this subject is divided into two
different schools. One introduces linear algebra as the mathematics and computation of multiply defined
linear equations. Here the focus is on teaching linear algebra as a tool for manipulation and compu-
tation in multi-dimensional spaces. Determinants are introduced early on, and focuses are on matrix
operations. The second approach is to treat matrices as abstract objects, laying focus to the structure
of linear operators and vector spaces. Determinants and matrices are only introduced later. Here we
provide both - the first will focus on the linear algebraic manipulation of matrices on finite-dimensional,
Euclidean spaces. The second treatment will focus on the underlying mathematics of the structure of
linear operators and their properties, including the mathematics in infinite dimensional vector spaces
and over complex fields. Some of these treatments and notes on Linear Algebra herein are adapted from
the texts from Ma et al. [13], Axler [3] and Roman [17].
3.1 Notations
The notations are mostly standard; all vectors unless otherwise specified are column (whenever it mat-
ters). Some non-standard notations in the chapter are outlined. Given a vector/list of numbers denoted
v, the object diag(v) would be a square matrix with diagonals consisting of elements in v. Otherwise
presented a matrix M , the object diag(M ) would represent the vector consisting of the diagonals. Other
notations are defined as they are introduced.
3.2 Computational Methods in the Euclidean Space

3.2.1 Linear Systems
Definition 2 (Linear Equation). A linear equation is one in which for variables {x1 , · · · , xn }, equation
takes form
n
X
ai xi = b (2)
i=1
where ai ∈ R, i ∈ [n] and b ∈ R.
3
Definition 3 (Zero Equation). A zero equation is a linear equation (see Definition 2) where all i ∈
[n], ai = 0 and b = 0. That is,
0x1 + 0x2 + · · · 0xn = 0. (3)
The variables xi , i ∈ [n] in Definition 2 are not known and it is our task to solve for the solutions
to these. The number of variables defines the dimensionality of our problem setting. For instance, see
that the equation ax + by + cz = d specify variables in the three-dimensional space (x, y, z) ∈ R3 . For
instance, the linear equation z = 0 specifies an xy-plane inside the xyz-space.
Definition 4 (Solution and Solution Sets to a Linear Equation). A solution to a linear equation (see
Definition 2 is a set of numbers {x1 = s1 , x2 = s2 , · · · , xn = sn } that satisfies the linear equation s.t.
n
X
ai si = b. (4)
i=1
The set of all such solutions is called a solution set to the equation. When the solution set is expressed
by equations representing exactly the equations in the solution set, these set of expressions are known as
the general solution.
For instance, in the xy-space, solutions to the equation x + y = 1 are points taking the form (x, y) =
(1−s, s) where s ∈ R. In the xyz-space, the solutions to the same equation are points (x, y, z) = (1−s, s, t)
where s, t ∈ R. The solution set form points on a plane. The solution set to the zero equation (see
Definition 3) is the entire space Rn corresponding to the number of dimensions in the linear equation.
Pn
The solution set to i 0xi 6= 0 is ∅.
Definition 5 (Linear System). A finite set of m equations in n variables x1 , · · · xn is called a linear

system and may be represented
ai1 x1 + ai2 x2 + · · · + ain xn = bi , i ∈ [m] (5)
where aij , i ∈ [m], j ∈ [n] ∈ R.
Definition 6 (Zero System). A zero system is a linear system (see Definition 5) where all the constants
aji , bj , i ∈ [n], j ∈ [m] are zero.
Definition 7 (Solution and Solution Sets to a Linear System). A solution to a linear system (see
Definition 5) is a set of numbers {x1 = s1 , x2 = s2 , · · · xn = sn } that satisfies all linear equations (i.e)
n
X
aji si = bj , j ∈ [m] (6)
i=1
The set of all such solutions is called a solution set to the system. When the solution set is expressed by
equations representing exactly the equations in the solution set, these set of expressions are known as the
general solution.
Definition 8 (Consistency of Systems). A system of linear equations that has solution set 6= ∅ is said
to be consistent. Otherwise it is inconsistent.
Every system of linear equations will either be consistent or inconsistent. Consistent systems have
either a unique solution or infinitely many solutions.
4
Exercise 2. Show that a linear system Ax = b has either no solution, only one solution or infinitely
many.
Proof. If the linear system is not consistent then it must have no solution. Otherwise, it may have a
unique solution, or more than one solution. Suppose there are two solutions u 6= v and Au = Av = b.
Then we may write
A(tu + (1 − t)v) = tAu + (1 − t)Av = tb + (1 − t)b = tb + b − tb = b. (7)
This is valid for all t ∈ R, and so we have infinitely many solutions.
For example, a system of two linear equations in two-dimensional space each representing a line has
infinite solutions if they are the same line, no solution if they are parallel but different lines, and exactly
one solution otherwise.
Exercise 3. In the xyz-space, the two equations
a1 x + b1 y + c1 z = d1 , (E1 ) (8)
a2 x + b2 y + c2 z = d2 , (E2 ) (9)
where ∃a1 , b1 , c1 6= 0 ∧ ∃a2 , b2 , c2 6= 0 represents two planes. The solution to the system is the intersection
between the two planes. Logicize that there is either no solution (E1 //E2 ) or infinite number of solutions
((E1 = E2 ) ∨ (E1 intersects E1 on a line)).
3.2.1.1 Elementary Row Operations (EROs)
Definition 9 (Augmented Matrix Representation of Linear Systems). See that the system of linear
equations (Definition 5) given
n
X
∀j ∈ m, aji xi = bj (10)
i=1
may be represented by the rectangular array of numbers

 
a11 a12 ··· a1n b1
 
 a21 a22 ··· a2n b2 

 ···
 (11)
 ··· ··· ··· 

am1 am2 ··· amn bm
and we call this the augmented matrix of the system. We denote this (A|b). Sometimes, we omit this
representation and just assign a single letter, say M , to represent the entire matrix.
Definition 10 (Elementary Row Operations). When we solve for a linear system, we implicitly or
explicitly perform the following operations; i) multiply equation by some non-zero k ∈ R, (ii) interchange
two equations, (iii) add a multiple of one equation to another. In the augmented matrix (see Definition
9), these operations correspond to multiplying a row by a non-zero constant, swapping two rows and
adding a multiple of one row to another row respectively. These three operations are collectively known
as the elementary row operations. We adopt the following notations
1. kRi ≡ multiply row i by k.
2. Ri ↔ Rj ≡ swap rows i, j.
5
3. Rj + kRi ≡ add k times of row i to row j.
Definition 11 (Row Equivalent Matrices). Two matrices A, B are said to be row equivalent if one may
be obtained by another from a series of EROs. We denote this by
R
A ≡ B. (12)
Theorem 4 (Solution Sets of Row Equivalent Augmented Matrix Represented Linear Systems). Two
linear systems (Definition 5) with augmented matrix representations (A|b), (C|d) have the same solution
R
set if (A|b) ≡ (C|d).
Proof. See proof in Exercise 14 using block matrix notations.
3.2.1.2 Row-Echelon Forms
Definition 12 (Leading Entry). The first non-zero number in a row of the matrix is said to be the
leading entry of the row.
Definition 13 (Zero Row). Let the row representing a zero equation (see Definition 3) be called the zero
row.
Definition 14 (Zero Column). Let the column representing all zero coefficients in the representative
linear system for some variable (see Definition 6) be called the zero column. That is, the column has all
zeros.
Definition 15 (Row-Echelon Form (REF)). A matrix is said to be row-echelon if the following properties
hold:
1. Zero rows (Definition 13) are grouped at the bottom of the matrix.
2. If any two successive rows are non-zero rows, then the higher row has a leading entry (Definition
12) occurring at a column that is to the left of the lower row.
For matrix A, we denote its matrix REF as REF (A).
Definition 16 (Pivot Points/Columns). The leading entries (Definition 12) of row-echelon matrices
(Definition 15) are called pivot points. The column of a row-echelon form containing a pivot point is
called a pivot column, and is otherwise a non-pivot column.
Definition 17 (Reduced Row-Echelon Form (RREF)). A reduced row-echelon-form matrix is a row-

echelon-form matrix that has
1. All leading entries of non-zero row equal to one. (Definitions 12 and 13)
2. In each pivot column, all entries other than the pivot point is equal to zero. (Definition 16)
For matrix A, we denote its matrix RREF as RREF (A).
Note that a zero system is an REF (and also an RREF) by the Definitions 15 and 17. We show that
obtaining the REF and RREF gives us an easy way to obtain the solution set to a linear system.
Exercise 4 (Finding solutions to REF, RREF Representations of Linear Systems; Back-Substitution

Method). Find the solution set to the linear systems represented by the following augmented matrices.
(see Definitions 9, 5 and 4)
6
1.
 
1 0 0 1
 0 1 0 2  (13)
 
0 0 1 3
2.
 
0 2 2 1 −2 2
 0 0 1 1 1 3  (14)
 
0 0 0 0 2 4
3.
 
1 −1 0 3 −2
 0 0 1 2 5  (15)
 
0 0 0 0 0
4.
" #
0 0 0 0
(16)
0 0 0 0
5.
 
3 1 4
 0 2 1  (17)
 
0 0 1
Proof. 1. It is easy to see that x1 = 1, x2 = 2, x3 = 3 is the unique solution this linear system.
2. Since this represents the linear system
2x2 + 2x3 + x4 − 2x5 = 2, (18)

x3 + x4 + x5 = 3, (19)
2x5 = 4, (20)
solve. We let the solutions to variables of non-pivot columns be arbitrary. That is, x1 ∈ R. The
third equation says x5 = 2. Substituting into the second equation, get
x3 + x4 + 2 = 3, (21)
so x3 = 1 − x4 . Substituting into first equation,
2x2 + 2(1 − x4 ) + x4 − 2 · 2 = 2, (22)
so x2 = 2 + 21 x4 . So there are two free parameters, and we arrive at the general solution
(x1 , x2 , x3 , x4 , x5 ) = (s, 2 + 12 t, 1 − t, t, 2), where s, t ∈ R. This technique is known as the back-
substitution method.
3. By the same back-substitution method, arrive at the general solution (x1 , x2 , x3 , x4 ) = (−2 + s −
3t, s, 5 − 2t, t) where s, t ∈ R.
4. The solution set is (r, s, t) = R3 .
5. This system is inconsistent! (Definition 8)
7
3.2.1.3 Gaussian Elimination Methods
R
Let A ≡ R. If R is (R)REF, R is said to (reduced) row-echelon form of A and A is said to have (R)REF
form R.
Theorem 5 (Gaussian Elimination/Row Reduction and Gauss-Jordan Elimination). We outline the

algorithm to reduce a matrix A to its REF B.
1. Locate the leftmost non-zero column (see Definition 14).
2. If this happens to be the top-most column, then continue. Else, swap the top row with the row
corresponding to the leading entry (Definition 12) found in the previous step.
3. For each row below the top row, add a suitable multiple so that all leading entries below the leading
entry of the top row equals zero.
4. From the second row onwards, repeat algorithm from step 1 applied to the submatrix until REF is
obtained.
To further get a RREF from REF obtained,
5. Multiply a suitable constant to each row so that all the leading entries become one.
6. From the bottom row onwards, add suitable multiples of each row such that all rows above the
leading entries at pivot columns (Definition 16) are all zero.
Steps 1 − 4 are known as Gaussian Elimination. Obtaining the RREF via Steps 1 − 6 is known as
Gauss-Jordan elimination.
Exercise 5. Obtain the RREF of the following augmented matrix

 
0 0 2 4 2 8
 1 2 4 5 3 −9  (23)
 
−2 −4 −5 −4 3 6
via Gauss-Jordan Elimination (see Theorem 5).
8
Proof. Recall the notations for EROs (see Definition 10). We perform the following steps;
 
1 2 4 5 3 −9
 0 0 2 4 2 8  R1 ↔ R2 , (24)
 
−2 −4 −5 −4 3 6
 
1 2 4 5 3 −9
 0 0 2 4 2 8  R3 + 2 · R1 , (25)
 
0 0 3 6 9 −12
 
1 2 4 5 3 −9
3
 0 0 2 4 2 8  R3 − · R2 , (26)
 
2
0 0 0 0 6 −24
 
1 2 4 5 3 −9
1 1
 0 0 1 2 1 4  R2 , R3 , (27)
 
2 6
0 0 0 0 1 −4
 
1 2 4 5 0 3
 0 0 1 2 0 8  R2 − 1 · R3 , R1 − 3 · R3 , (28)
 
0 0 0 0 1 −4
 
1 2 0 −3 0 −29
 0 0 1 2 0 8  R1 − 4 · R2 . (29)
 
0 0 0 0 1 −4
Result 2 (REF and their Interpretations for Solution Sets). Consider the REF (A|b) augmented matrix
form (see Definition 9). Note that every matrix has a unique RREF but can have many different REFs.
If a linear system is not consistent (Definition 8), then the last column of the REF form of the augmented
Pn
matrix is a pivot column. In other words, there will be a row representing an equation where i 0xi = c,
but c 6= 0. There is no solution to this linear system. A consistent linear system has a unique solution
if except the last column b, every column is a pivot column. This system has as many variables in the
linear system as the number of nonzero rows in the REF. If there exists a non-pivot column in the REF
that is not the last one (b), then this consistent linear system has infinitely many solutions. This linear
system has number of variables greater than the number of non-zero rows in the REF.
Note that when solving for linear systems in which the contents are unknown constants, then we need
to be careful about performing illegal row operations. That is, assume an augmented matrix
 
a 1 0 a
 1 1 1 1  (30)
 
0 1 a b
and in order to make the second row leading entry 0, we would perhaps like to perform R2 − a1 R1 .
However, we do not know that a 6= 0. In this case, we can consider either first swapping the first and
second row and progressing, or perform a by-case method.
1 we thank reader Irena for the correction of errata in the Gaussian Elimination workings.
9
3.2.1.4 Homogeneous Linear Systems
Definition 18 (Homogeneous Linear Systems). A linear system (Definition 18) is homogeneous (HLS)
if it has augmented matrix representation (A|b) where b = 0 and all constants aij , ∈ R, ∀i ∈ [m], ∀j ∈ [n].
See that the HLS is always satisfied by xi = 0, i ∈ [n] and we call this the trivial (sometimes, zero)
solution. A non-trivial solution is any other solution that is not trivial.
Exercise 6. See that in the xy-plane, the equations
a1 x + b1 y = 0, (31)
a2 x + b2 y = 0 (32)
where a1 , b1 not both zero and a2 , b2 not both zero each represent straight lines through the origin, The
system has only the trivial solution when the two equations are not the same line, otherwise they have
infinitely many solutions. In the xyz-space, a system of two such linear equations passing through the
origin always has infinitely many (non-trivial) solutions in addition to the trivial one, since they are
either the same plane or intersect at a line passing through the origin at (0, 0, 0).
Lemma 2. A HLS (Definition 18) has either only the trivial solution or infinitely many solutions in
addition to the trivial solution. A HLS with more unknowns than equations has infinitely many solutions.
Proof. The first assertion is trivial since the zero solution satisfies it. The second assertion follows
from considering the REF of the augmented matrix representation of a HLS with more unknowns than
equations, then apply Result 2.
Exercise 7. For a HLS Ax = 0 (Definition 18) with non-zero solution, show that the system Ax = b
has either no solution or infinitely many solutions.
Proof. By Theorem 2, a HLS system has no solution, one solution or infinite solutions. But suppose
there is some solution u s.t. Au = b. Let v be non-zero solution for the HLS s.t. Av = 0, v 6= 0. Then
A(u + v) = Au + Av = b + 0 = b, so u + v is solution and u + v 6= u. But by Lemma 2, the solution space
for Ax = 0 must have infinitely many vectors if such a v exists. It follows Ax = b has infinitely many
solutions if ∃u s.t Au = b.
3.2.2 Matrices
We formally defined augmented matrices in Definition 9. In the earlier theorems, we also referred to
generalized matrix representations of numbers. We provide formal definition here.
Definition 19 (Matrix). A matrix is a rectangular array (or array of arrays) of numbers. The numbers
are called entries. The size of a matrix is given by the rectangle’s sides, and we say a matrix A is m × n
for m rows and n column matrix. We can denote the entry at the i-th row and j-th coordinate by writing
A(i,j) = aij . This is often represented
 
a11 a12 · · · a1n
 
 a21 a22 · · · a2n 
A=
···
, (33)
 ······ ··· 

am1 am2 · · · amn
and for brevity we also denote this A = (aij )m×n , and sometimes we drop the size all together and write
A = (aij ).
10
Definition 20. For brevity, given a matrix A (Definition 19) we refer to its size by using the notation
nrows(A) and ncols(A) to indicate the number of rows in A and number of columns in A respectively.
That is, A is a matrix size nrows(A) × ncols(A).
Definition 21 (Column, Row Matrices/Vectors). A column matrix (vector) is a matrix with only a
single column. A row matrix (vector) is a matrix with only one row.
Definition 22 (Square Matrix). A square matrix is a matrix (Definition 19) that is square (number of
rows is equivalent to the number of rows). We say An×n square matrix is of order n.
Definition 23 (Diagonal Matrix). A square matrix A of order n (Definition 22) is diagonal matrix if
all entries that are not along the diagonal are zero. That is,
aij = 0 when i 6= j. (34)
Definition 24 (Scalar Matrix). A diagonal matrix (Definition 23) is scalar matrix if all diagonal entries
are the same, that is
(
0 i 6= j
aij = (35)
c i = j,
for some constant c ∈ R.
Definition 25 (Identity Matrix). Scalar matrix (Definition 24) is identity matrix if the diagonals are
all one, that is c = 1. We often denote this as 1. If the size needs to be specified, we add subscript 1n
to indicate order n.
Definition 26 (Zero Matrix). Arbitrary matrix m × n is zero matrix if all entries are zero.
Definition 27 (Symmetric Matrix). A square matrix A (Definition 22) is symmetric if aij = aji for all
i, j ∈ [n].
Definition 28 (Triangular Matrix). A square matrix A (Definition 22) is upper triangular if aij = 0
whenever i > j, and is lower triangular if aij = 0 whenever i < j.
3.2.2.1 Operations on Matrices
Definition 29 (Matrix Addition, Subtraction and Scalars). The following are defined for operations on
matrices:
1. Scalar Multiplication: cA = (caij ).
2. Matrix addition: A + B = (aij + bij ).

2
3. Matrix subtraction: A − B = (aij − bij ). We denote −A = −1 · A.
Definition 30 (Matrix Equality). To show that two matrices A, B are equal, we have to show their their
size is the same, and their entries aij = bij ∀i, ∀j.
Theorem 6 (Properties of Matrix Operators). Define matrices A, B, C of the same size and let c, d ∈ R.
Then the following properties hold:
2 notethat the matrix subtraction can be defined as the addition of a matrix A with a matrix B that has first been
operated on a by scalar multiplication of c = −1.
11
1. Commutativity: A + B = B + A.
2. Associativity: A + (B + C) = (A + B) + C.
3. Linearity: c(A + B) = cA + cB.
4. Linearity: (c + d)A = cA + dA.
5. c(dA) = (cd)A = d(cA).
6. A + 0 = 0 + A = A.
7. A − A = 0.
8. 0A = 0.
Proof. To show equality of matrices, we have to show their size is the same and that their corresponding
entries match (see Definition 30). The proofs for the above theorems are rather trivial, and we show
the associativity law (other proofs are of the same stripe). Proof of associativity: Let A = (aij ), B =
(aij ), C = (aij ), then
A + (B + C) = (aij ) + (B + C) (36)
= (aij ) + (bij + cij ) (37)
= (aij + bij ) + (cij ) (38)
= (A + B) + (cij ) (39)
= (A + B) + C. (40)
That is, we rely on the associativity on addition of real numbers to prove the associativity on addition
of matrices. Finally, see that their sizes match.
Definition 31 (Matrix Multiplication). For matrices A = (aij )m×p , B = (bij )p×n , the matrix product
AB is defined to be the m × n matrix s.t.
p
X
C = A × B = (cij )m×n = aik bkj . (41)
k=1
The matrix multiplication AB is only possible when ncols(A) = nrows(B).
Exercise 8. Show that matrix multiplication (Definition 31) is not commutative.
Proof. Prove by counterexample. For matrices

! !
−1 0 1 2
A= , B= , (42)
2 3 3 0
see that
! !
−1 −2 3 6
AB = 6= = BA. (43)
11 4 −3 0
Since the matrix multiplication is not commutative, when describing in words, we say that AB is the
pre-multiplication of A to B and BA as the post-multiplication of A to B to prevent ambiguity.
12
Theorem 7 (Properties of Matrix Multiplication). Matrix multiplication (Definition 31) satisfies the
following properties (we assume trivially that the size of the matrices are appropriate such that the matrix
multiplication is legitimate) :
1. Associativity: A(BC) = (AB)C.
2. Distributivity: A(B1 + B2 ) = AB1 + AB2 .
3. c(AB) = (cA)B = A(cB).
4. A0 = 0, and 0A = 0.
5. For identity matrix (Definition 25) of appropriate size ,A1 = 1A = A.
Proof. Proof of the asserted statements follow directly form their definitions of matrices and matrix
multiplications (Definitions 19, 31) and computing the resulting entries componentwise via the laws of
algebra on real numbers (additionally, we also have to show that the sizes on the LHS and RHS are
matching).
Definition 32 (Powers of Square Matrices). For square matrix A and natural number n ≥ 0, the power
of A can be written
1

 if n = 0,
An = · · · A} if n ≥ 1. (44)

 |AA{z
n number of times
By associativity, Am An = Am+n . By non-commutativity (AB)n 6= An B n . See Theorem 7 for

statements on properties of matrix multiplications.
Exercise 9. Show that if AB = BA, then (AB)k = Ak B k .
Proof. We proof by induction. Base case is when k = 1, so (AB)1 = AB = A1 B 1 . This statement is

trivial. Now assume (AB)j = Aj B j for j < k. Then (AB)j+1 = (AB)j AB = Aj B j AB. Define the swap
operator ψ : BA → AB, then apply ψ j (B j A) to get AB j . Then we have Aj ψ j (B j A)B = Aj AB j B =
Aj+1 B j+1 and by induction we are done.
We may express rows, columns and even submatrices of a matrix by grouping together different
entities. Here we show some examples.
!
1 2 3
Exercise 10 (Expressing Matrices as Block Matrices of Rows and Columns). For matrix A = ,
4 5 6
 
1 1
B =  2 3, we may write
 
−1 2
!
a1
A= , B = b1 b2 , (45)
a2

a1 = 1 2 3 , a2 = 4 5 6 , (46)
   
1 1
b1 =  2  , b1 = 3 . (47)
   
−1 2
13
See that the following relationships hold by direct computation
!
a1 B
AB = Ab1 Ab2 = . (48)
a2 B
Exercise 11 (Block Matrix Operations). Let A be m × n matrix, and B1 , B2 be n × p, n × q matrices,

C1 , C2 be r × m matrices, and D1 , D2 be s × m, t × m matrices respectively. See which of the following
block operations are valid:

1. A B1 B2 = AB1 AB2 .

2. C1 C2 A = C1 A C2 A .
! !
D1 D1 A
3. A= .
D2 D2 A
Proof. Refer to Exercise 10 for operations on matrix blocks written as rows and columns.

1. If we write B1 = b1 · · · bp , B2 = c1 · · · cq . Then

A B1 B2 = Ab1 ··· Abp Ac1 ··· Acq (49)
and the relation is valid.
2. The matrix sizes do not permit a valid matrix operation.

   
d1 f1
3. If we let D1 = · · · , D2 = · · · , then
   
ds ft
 
d1
 
· · · 
!  
D1 d 
 s
=  . (50)
D2  f1 
 
· · · 
 
ft
Then we have
 
d1 A
 
 ··· 
!  
D1 d A
 s 
A=  (51)
D2 f1 A
 
 ··· 
 
ft A
and the relation is valid.
Recall the augmented matrix representation of linear systems (see Definition 9). We may define an
equivalent form.
14
Definition 33 (Matrix Representation of Linear System). For system of linear equations
∀j ∈ [m], aj1 x1 + aj2 x2 + · · · ajn xn = bj , (52)
we may represent the linear system by matrix multiplication

    
a11 a12 · · · a1n x1 b1
    
 a21 a22 · · · a2n   x 2  =  b2  .
   

··· (53)
 ··· ··· ···  
· · ·   · · · 
  
am1 am2 · · · amn xn bm
| {z } | {z } | {z }
A x b
Then we say that A is the coefficient matrix, x is the variable matrix and that b is the constant matrix
for the linear system specified. A solution to the linear system is a n × 1 column matrix
 
u1
 
 u2 
u= 
· · · (54)
 
un

where Au = b. If we treat A = c1 c2 ··· cn where ci represents the i-th column of A, then we
may write
n
X
c1 x1 + c2 x2 + · · · cn xn = cj xj = b. (55)
j=1
That is, the constant matrix is a linear combination of the columns of the coefficient matrix, where the
weights are determined via the variable matrix.
Definition 34 (Matrix Transpose). For matrix A = (aij )m×n , the matrix transpose of A is written
A0 = (a0ij )n×m where the entry a0ij = aji .
See that the rows of A are the columns of A0 and vice versa. See that a square matrix A is symmetric
(Definition 27) iff A = A0 .
Theorem 8 (Properties of the Matrix Transpose). The matrix transpose follows the following properties
1. (A0 )0 = A.
2. (A + B)0 = A0 + B 0 .
3. For c ∈ R, (cA)0 = cA0 .
4. (AB)0 = B 0 A0 .
Proof. The proof of the first three parts are fairly straightforward by direct computation of the algebraic
properties of real numbers that follow from their Definitions. We show the last assertion. Denote the sizes
of matrix A to be m × n and that of B to be n × p so that the matrix multiplications (Definition 31) are
defined. Then AB has size m×p, so that its transpose has size p×m. B 0 has size p×n, A0 has size n×m,
Pn
so B 0 A0 has size p × m. We show they are componentwise equivalent. Since (AB)ij = l ail blj . Then
Pn Pn
(AB)0ij = l ajl bli . On the other hand, we have A0ij = aji , Bij0
= bji , so that (B 0 A0 )ij = l b0il a0lj =
Pn
l bli ajl . We have showed that the corresponding entries are the same.
15
3.2.2.2 Invertibility of Matrices
Definition 35 (Invertibility of Square Matrix). Let A be square matrix of order n (Definition 22), then
we say that A is invertible if ∃ square matrix B of order n s.t. AB = 1n = BA. The matrix B is said
−1
to be the inverse of A. We denote this A . There is no ambiguity; we shall see that the inverse of a
matrix is unique (Theorem 9).
Definition 36 (Singularity of Square Matrix). A matrix that does not have an inverse (Definition 35)
is said to be singular.
!
1 0
Exercise 12. Show that the matrix A = is singular.
1 0
!
a b
Proof. Suppose not. Then let the inverse be B = . Then by Definition 35, we have
c d
! ! ! !
1 0 a b 1 0 a+b 0
BA = 1 = = = . (56)
0 1 c d 1 0 c+d 0
Then 1 = 0. Contradiction.
Theorem 9 (Uniqueness of Inverses). If B, C are inverses of square matrix A, then B = C.
Proof. Write
AB = 1 =⇒ CAB = C 1 =⇒ 1B = C =⇒ B = C. (57)
Exercise 13 (Conditions for Invertibility of Square Matrix Order Two). In the case for square matrix
A of order two, denote
!
a b
. (58)
c d
State the conditions for invertibility and find the matrix inverse.
!
d −b
ad−bc ad−bc
Proof. Define B = −c a
, which is defined only if ad − bc 6= 0. We may compute directly the
ad−bc ad−bc
matrices AB = BA = 1 (we show how to explicitly compute matrix inverses such as B later on).
Theorem 10 (Properties of Matrix Inverse). Let A, B be two invertible matrices (Definition 35), and
c 6= 0, ∈ R. Then the following properties hold
1. cA is invertible, in particular (cA)−1 − 1c A−1 .
2. A0 is invertible, and (A0 )−1 = (A−1 )0 .
3. A−1 is invertible and (A−1 )−1 = A.
4. AB is invertible and (AB)−1 = B −1 A−1 .
Proof. -
16
1. We can write

1 1
(cA)( A−1 ) = c AA−1 = 1, (59)
c c
1 1
( A−1 )(cA) = ( c)A−1 A = 1, (60)
c c
and the result immediately follows.
2. We show this by verifying that (A−1 )0 is the inverse of A0 , which confirms the assertion that A0 is
invertible. In particular, by properties of matrix transpose (Theorem 8), write
A0 (A−1 )0 = (A−1 A)0 = 10 = 1, (61)

(A −1 0 0
) A = (AA ) = 1 = 1.
−1 0 0
(62)
Then A0 is invertible, and the inverse is (A−1 )0 .
3. See that A−1 A = 1, AA−1 = 1 and by definition of inverse (Definition 35), the result follows.
4. Since A, B invertible, write
(AB)(B −1 A−1 ) = ABB −1 A−1 = A1A−1 = AA−1 = 1. (63)
Also
(B −1 A−1 )(AB) = 1 (64)
by similar reasoning.
Definition 37 (Negative Powers of a Square Matrix). For an invertible matrix A, we may define negative
powers for a square matrix given n ∈ Z+ as the matrix power (Definition 32) of the inverse. That is,
A−n = (A−1 )n . (65)
See that if An is invertible, then (An )−1 = A−n for any n ∈ Z.
3.2.2.3 Elementary Matrices
One may notice that the elementary row operations (see Definition 10) may be considered as the pre-
multiplication of some matrix to the matrix being operated on. For instance, see that
   
1 0 2 3 1 0 2 3
 2R
A = 2 −1 3 6 →2 B = 4 −2 6 12 , (66)
  
1 4 4 0 1 4 4 0
and see that

 
1 0 0
0 2 0 A = B. (67)
 
0 0 1
| {z }
E1
17
In particular, the ERO kRi (Definition 10) may be performed by the pre-multiplication of matrix Ek ,
where Ek is a diagonal matrix (Definition 23) of order nrows(A), where all the entries along the diagonal
are one except for the i-th row, where the entry is k. If k 6= 0, and since performing kRi , k1 Ri in sequence
gives us back the same matrix - see that the Ek is invertible and that Ek−1 is the diagonal matrix with
1
all ones along the diagonal except for k entry on the i-th row.
Next, observe the ERO Ri ↔ Rj (see Definition 10) on the following instance:
   
1 0 2 3 1 0 2 3
 R2 ↔R3
A = 2 −1 3 6 → B = 1 4 4 0 , (68)
  
1 4 4 0 2 −1 3 6
and see that

 
1 00
0 0 1 A = B. (69)
 
0 1 0
| {z }
E2
In particular, the ERO Ri ↔ Rj (Definition 10) may be performed by the pre-multiplication of matrix
Es , where Es is a matrix that began with an identity matrix (Definition 25) of order nrows(A) and has
gone through precisely the row swap Ri ↔ Rj . See that swapping rows i and j and then swapping again
rows i and j gives us back the original matrix. Then Es = Es−1 .
Last but not least, observe the ERO Ri + kRj (see Definition 10) on the following instance:
   
1 0 2 3 1 0 2 3
 R3 +2R1
A = 2 −1 3 6 → B = 1 4 4 0 , (70)
  
1 4 4 0 3 4 8 6
and see that

 
1 00
0 1 0 A = B. (71)
 
2 0 1
| {z }
E3
In particular, the ERO Ri + kRj (Definition 10) may be performed by the pre-multiplication of matrix
El , where El is a matrix that began with an identity matrix (Definition 25) of order nrows(A) and has
gone through precisely the row addition Ri + kRj . As before, the (triangular, Definition 28) matrix El
is invertible and El−1 represents the row-swap operation Ri − kRj .
Definition 38 (Elementary Matrix). A square matrix (Definition 22) that can be obtained from an iden-
tity matrix (Definition 25) from a single elementary row operation (Definition 38) is called an elementary
matrix.
We saw that all elementary matrices (Definition 38) are invertible, and their inverses are also elemen-
tary matrices. The discussions thus far allow us to arrive at the following result:
Lemma 3. The EROs (Definition 10) performed on arbitrary matrices correspond precisely to the pre-
multiplication of an elementary matrix (Definition 38) obtained from performing the ERO on the identity
matrix (Definition 25).
18
For a series of EROs applied in sequence O1 , O2 , · · · Ok , (Definition 10) applied on A, s.t.
O O O
A →1 →2 · · · →k B, (72)
and their corresponding elementary matrices E1 , · · · , Ek , see that the relation
Ek Ek−1 · · · E1 A = B (73)
must hold. By the invertibility, we have the relation
A = E1−1 E2−1 · · · Ek−1 B. (74)
Exercise 14. Prove the solution-set equivalency asserted in Theorem 4.
Proof. We show that if there are two row equivalent (Definition 11) augmented matrices (Definition 9)
(A|b), (C|d), then the linear systems Ax = b, Cx = d share solution set. By Lemma 3, see that ∃E s.t.
(C|d) = E(A|b) = (EA|Eb), (75)
which is valid by Exercise 11. Then if Au = b (that is if u is solution), then
Au = b =⇒ EAu = Eb =⇒ Cu = d. (76)
On the other hand, if Cv = d, then
Cv = d =⇒ EAv = Eb =⇒ E −1 EAv = E −1 Eb =⇒ IAv = Ib =⇒ Av = b. (77)
They share solution set.
Theorem 11 (Invertibility of Square Matrices, 1). If A is square matrix order n, then the following
statements are equivalent:
1. A is invertible.
2. Ax = 0 has only the trivial solution.
3. RREF of A is identity 1 matrix.

4. A can be expressed as Πni Ei , where Ei are elementary matrices.
Proof. It turns out that this theorem shows an easy way to compute the inverses of an invertible matrix
A. To show
(i) 1 =⇒ 2: if Ax = 0, then
x = Ix = A−1 Ax = A−1 0 = 0, (78)
where the last step follows from Theorem 7.
(ii) 2 =⇒ 3: Ax = 0 is the only trivial solution. Since A is square, nrows(A) = ncols(A), then by
Lemma 2, the RREF of A of (A|0) has no zero rows. By definition of RREF (Definition 17), the
RREF of A is identity (Definition 25).
19
(iii) 3 =⇒ 4: Since RREF of A is 1, by Lemma 3, ∃Ei , i ∈ [k] s.t.
Ek Ek−1 · · · E1 A = 1. (79)
Then A = (Ek · · · E1 )−1 1, and by inverse properties, Theorem 10, we have
A = E1−1 · · · Ek−1 . (80)
(iv) 4 =⇒ 1: Since A is product of invertible elementary matrices, A is invertible by Theorem 10.
Theorem 12 (Cancellation Law). Let A be an invertible matrix (Definition 35) of order m, then the
following properties hold:
1. AB1 = AB2 =⇒ B1 = B2 .
2. C1 A = C2 A =⇒ C1 = C2 .
This does not hold for matrix A when it is non-singular.
Proof. For first the part,
AB1 = AB2 =⇒ AB1 − AB2 = 0 =⇒ A(B1 − B2 ) = 0. (81)
Then since A is invertible, the HLS has only trivial solution by Theorem 11, so B1 − B2 = 0 and it
follows that B1 = B2 . For part 2, write
(C1 − C2 )A = 0 =⇒ (C1 − C2 )AA−1 = 0 =⇒ (C1 − C2 )1 = 0, (82)
and the result follows.
We may use the discussions in Theorem 11 to compute the matrix inverse. For A satisfying Ek · · · E1 A =
1, see that Ek · · · E1 = A−1 by the post multiplication of A−1 to both the RHS and LHS. Recall this is
valid, since we are guaranteed the invertibility of A. Furthermore, this is unique (Theorem 9). Consider
the n × 2n matrix (A|1n ). Then
Ek · · · E1 (A|1) = (Ek · · · E1 A|Ek · · · E1 1) (83)

= (1|A −1
). (84)
That is, to the augmented matrix (A|1), if we perform Gauss-Jordan elimination (see Theorem 5) and
get RREF 1 on the LHS of |, the RHS is A−1 . Otherwise, A is singular and does not have an inverse.
The following theorem shows us that given square matrices A, B - when we are to verify A−1 = B, we
are only required to check one of AB = 1 or BA = 1.
Theorem 13. Let A, B be square matrix order n. If AB = 1, then A, B are both invertible and
A−1 = B, B −1 = A, BA = 1. (85)
Proof. Consider HLS (Definition 18) Bx = 0. If Bu = 0, then
ABu = Iu =⇒ A0 = u =⇒ 0 = u. (86)
Then Bx = 0 only has the trivial solution. By Theorem 11, B is invertible. Since B is invertible:
AB = 1 =⇒ ABB −1 = 1B −1 =⇒ A1 = B −1 =⇒ A = B −1 . (87)
So A is invertible by Theorem 11 and A−1 = (B −1 )−1 = B, BA = BB −1 = 1.
20
Exercise 15. For square matrix A, given A2 − 3A − 61 = 0, show that A is invertible.
Proof. Since we may write
A(A − 31) = A2 − 3A1 = A2 − 3A = 61, (88)
− 31) = 1, and it follows that A is invertible from Theorem 13.

1
then A 6 (A
Theorem 14 (Singularity of Matrix Products). Let A, B be two square matrices of order n. Then if A
is singular, AB, BA are both singular (see Definition 14).
Proof. Suppose not. Then AB is invertible, and let C = (AB)−1 . Then we may write
ABC = 1, (89)
then A is invertible since A−1 = BC by Theorem 13. This is contradiction.
Theorem 15 (Elementary Column Operations). See from Lemma 3 that the pre-multiplication of an
elementary matrix to matrix A is equivalent to doing an ERO on Ap×m matrix. Let Ek , Es , El be
elementary matrices corresponding to kRi , Ri ↔ Rj , Ri + kRj respectively (see Definition 10). Then, the
post multiplication of the matrices Ek , Es , El correspond to
1. Multiplying the i-th column of A by k.
2. Swap columns i, j in A.
3. Add k times j-th column of A to i-th column of A
respectively and let these be known collectively as elementary column operations (ECOs). They shall be
denoted kCi , Ci ↔ Cj , Ci + kCj .
3.2.2.4 Matrix Determinants
It turns out that whether a square matrix is invertible (Definition 35) depends on a quantity of the
matrix known as the determinant. We define this recursively.
Definition 39 (Determinants and Cofactors). For square matrix A order n, let Mij be an (n−1)×(n−1)
square matrix obtained from A by deleting the i-th and j-th column. Then the determinant of A is defined
as
(
a11 if n = 1,
det(A) = (90)
a11 A11 + a12 A12 + · · · + a1n A1n if n > 1,
where Aij = (−1)i+j det(Mij ). The number Aij is known as the ij-cofactor of A. This method of recur-
sively computing matrix determinants are known as cofactor expansion. Often, we adopt the equivalent
notations for determinant of A:
a11 a12 ··· a1n

a21 a22 ··· a2n
det(A) = . (91)
··· ··· ··· ···
an1 an2 ··· ann
21
Exercise 16 (Cofactor Expansion Examples). Here we show some instances of co-factor expansion.
When the matrix is 2 × 2, then we have a general form
!
a b
A= . (92)
c d
Then see that the determinant by cofactor expansion
a · (−1)1+1 det(d) + b · (−1)1+2 det(c) = ad − bc. (93)

 
−3 −2 4
Then for larger matrices, we may use these sub-results. For instance, the determinant for B =  4 3 1
 
0 2 4
via cofactor expansion is obtained
3 1 4 1 4 3
det(B) = (−3) − (−2) +4 = −3(3 · 4 − 1 · 2) + 2(4 · 4 − 1 · 0) + 4(4 · 2 − 3 · 0) = 34. (94)
2 4 0 4 0 2
Result 3 (Cofactor Expansion Invariance). For square matrix A order n, det(A) (Definition 39) may
be found via cofactor expansion along any row or any column.
Theorem 16 (Cofactor Expansion of Triangular Matrices). For triangular matrix A, the determinant
A is equal to the product of diagonal entries of A.
Proof. By definition of triangular matrices (Definition 28), both the upper triangular and lower triangular
has a row that is all zeros except for possibly a singly entry (the diagonal itself). That is, an upper
triangular takes general form
 
a11 a12 ··· a1n
 
 0 a22 ··· a2n 
A=
· · ·
 (95)
 ··· ··· ···
0 0 ··· ann
Additionally, since matrix is square, cofactor expansion along the last row, last entry has the term
(−1)i+i = 1. By Result 3, see that if we apply recursively the cofactor expansion along the last row,
we obtain just the product of the diagonal entries. A similar reasoning is applied if the matrix is lower
triangular.
See that the determinant of 1 is one by Theorem 16.

Theorem 17 (Determinant of Matrix Transpose). For square matrix A of order n, det(A) = det(A0 ).
Proof. We prove by induction. The base case is for a matrix containing a single scalar value. This is
trivially true, since the transpose of a matrix 1 × 1 is itself. Next, assume det(A) = det(A0 ) for any
square matrix A order k. We show this holds for (k + 1) × (k + 1) matrix. In particular, by cofactor
expansion along the first row of A, obtain
n
X
det(A) = (−1)1+i a1i det(M1i ). (96)
i
Next perform, cofactor expansion along the first column of A0 , then

n
X
det(A0 ) = 0
(−1)1+i a1i det(M1i ). (97)
i
By induction, det(A) = det(A0 ) since det(Mij ) = det(Mij

0
).
22
Theorem 18 (Determinant of Repeated Row/Column Matrix). The determinant of a square matrix
with two identical rows is zero. The determinant of a square matrix with two identical columns is zero.
!
a b
Proof. We prove by induction. The base case is for matrix A size 2 × 2. For matrix A = , by
a b
Exercise 16 we have det(A) = ab − ab = 0. Assume that for k < n, det(A) size k × k with repeated row
is zero. Then consider a (k + 1) × (k + 1) matrix with row i equivalent to row j, i 6= j. Then by cofactor
expansion along some row m that is neither i nor j, we have
det(A) = am1 Am1 + · · · + am,k+1 Am,k+1 (98)
Amr is the cofactor (−1)m+r det(Mmr ), which has identical rows and by inductive assumption has de-
terminant zero. Then det(A) = 0 and we are done. Since det(A) = det(A0 ), a square matrix with two
identical columns has transpose with two identical rows and the result follows.
Theorem 19. Recall the notations for EROs (Definition 10) and correspondence to their elementary
matrices (Lemma 3). Let A be square matrix, and
(i) B be a square matrix obtained by some ERO kRi . Then, det(B) = kdet(A).
(ii) B be a square matrix obtained by some ERO Ri ↔ Rj . Then, det(B) = −det(A).
(iii) B be a square matrix obtained by some ERO Ri + kRj . Then, det(B) = det(A).
(iv) E be some elementary matrix with size nrows(A) × nrows(A). Then det(EA) = det(E)det(A).
It turns out that this is quite useful because the determinants of elementary matrices are fairly easy to
compute. Only the elementary matrix corresponding to the swap operation is a non-triangular matrix
(Definition 28), but even the swap operation has corresponding elementary matrix where each sub-square
matrix has row/column with only a single scalar entry of one and the rest zero.
Proof. We do not prove this theorem but this may be obtained via the rather mechanical cofactor
expansion and definition of matrix determinants (Definition 39).
Theorem 20. Recall the notations for CROs (Definition 15) and correspondence to their elementary
matrices. Let A be square matrix, and
(i) B be a square matrix obtained by some CRO kCi . Then, det(B) = kdet(A).
(ii) B be a square matrix obtained by some CRO Ci ↔ Cj . Then, det(B) = −det(A).
(iii) B be a square matrix obtained by some CRO Ci + kCj . Then, det(B) = det(A).
(iv) E be some elementary matrix with size nrows(A) × nrows(A). Then det(AE) = det(E)det(A).
Theorem 21 (Determinants and Invertibility). Square matrix A is invertible iff det(A) 6= 0.
Proof. For square matrix A we may write B = Ek · · · E1 A, where each Ei is elementary matrix and B is
RREF. By Theorem 19, det(B) = det(A)Πki=1 det(Ei ). By Theorem 11, if A is invertible then B = 1, and
det(B) = 1. Then det(A) 6= 0 since 6 ∃i s.t. det(Ei ) = 0. If A is singular, then B has zero row (Definition
13). By cofactor expansion (Theorem 3) along the zero row, det(B) = 0, then det(A) = 0 since again,
6 ∃i s.t. det(Ei ) = 0.
Theorem 22. For square matrix A, B order n and c ∈ R, the following hold:
23
1. det(cA) = cn det(A),
2. det(AB) = det(A)det(B),
3. If A is invertible, then det(A−1 ) = 1

det(A) .
Proof. -
1. This follows from Theorem 19 and seeing that cA is multiplying each of the n rows by c.
2. If A is singular, then AB is singular by Theorem 14. Then det(AB) = det(A)det(B) = 0. Other-

wise, matrix A may be represented by product of elementary matrices s.t.
det(AB) = det(E1 · · · Ek B) = det(B)Πki=1 det(Ei ) = det(B)det(A). (99)
3. Follows since det(A)det(A−1 ) = det(AA−1 ) = det(1) = 1. The first equality follows from part 2.
Definition 40 (Classical Adjoint). Let A be square matrix order n. Then the (classical) adjoint of A is
n × n matrix
 0
A11 A12 ··· A1n
 
 A21 A22 ··· A2n 
adj(A) = 
···
 , (100)
 ··· ··· ··· 

An1 An2 ··· Ann
where Aij is (i,j) cofactor of A (Definition 39).
Theorem 23 (Inverse by Adjoint). Let A be square matrix, then if A is invertible, we have

1
A−1 = adj(A). (101)
det(A)
Proof. Let B = A · adj(A), then
bij = ai1 A01j + ai2 A02j + · · · ain A0nj (102)

= ai1 Aj1 + ai2 Aj2 + · · · ain Ajn . (103)
By definition of cofactor expansion (see Definition 39 and Theorem 3), see that
det(A) = bii . (104)
By Equation 103, see that when i 6= j, then bij is the cofactor expansion along the row j of matrix A
where the entries of row i, j are both ai1 , ai2 , · · · ain . Then by Theorem 18, bij = 0 if i 6= j. Then
1
A · adj(A) = det(A)1 =⇒ A · adj(A) = 1. (105)
det(A)
Then the result follows.
Theorem 24 (Cramer’s Rule). Suppose Ax = b is linear system (Definition 5), where A is square matrix
order n. Then if Ai is the matrix obtained from replacing i-th column of A by b, and if A is invertible,
then the system has unique solution
 
det(A1 )
 
1  det(A2 ) 
x=  . (106)
det(A) 
 ··· 

det(An )
24
Since
1
Ax = b ↔ x = A−1 b = adj(A) · b, (107)
det(A)
then
b1 A1i + b2 A2i + · · · + bn Ani det(Ai )
xi = = . (108)
det(A) det(A)
Exercise 17. For Am×n , Bn×p matrices, if Bx = 0 has infinitely many solutions, how many solutions
does ABx = 0 have? What about if Bx = 0 has only the trivial solution?
Proof. Suppose Bx = 0 has infinitely many solutions, and let this solution space be S. See that ∀s ∈ S,
ABs = A0 = 0. There are at least as many solutions as Bx, and this is in fact infinitely many. On other
hand, we cannot make comments
! about the solutions to ABx =!0 when Bx = 0 only
! has trivial solution.
1 0 1 0 1 0
For instance, if B = , the cases for matrix A = and A = give rise to a linear
0 1 1 0 0 0
system with trivial solution and infinitely many solutions respectively.
Definition 41 (Trace). For square matrix A of order n, the matrix trace denoted tr(A) is the sum of
entries along the diagonals of A. For A, B square matrix both of order n, Cm×n , Dn×m , we have
1. that
tr(A + B) = tr(A) + tr(B). (109)
2. that tr(cA) = ctr(A).
3. that tr(CD) = tr(DC).
4. that 6 ∃A, B s.t. AB − BA = 1.
Proof. The first two properties are easy to proof by definitions of trace and matrix. For the third
statement, see that
n
X
(CD)ii = cij dji , (110)
j
m X
X n
tr(CD) = cij dji (111)
i j
n X
X m
= dji cij . (112)
j i
See that the RHS is precisely tr(DC). Lastly, since tr(AB − BA) = tr(AB) − tr(BA) = tr(AB) −
tr(AB) = 0 by the earlier parts and tr(1n ) = n, it cannot be that AB − BA = 1.
Exercise 18 (Orthogonal Matrices). A square matrix is an orthogonal matrix if
AA0 = 1 = A0 A. (113)
Suppose A, B is square matrix order n and orthogonal, then show AB is orthogonal.
25
Proof. See that (by Theorem 6)
AB(AB)0 = ABB 0 A0 = A1A0 = AA0 = 1, (114)
and that
(AB)0 AB = B 0 A0 AB = B 0 1B = B 0 B = 1. (115)
Orthogonal matrices are treated in Section 3.2.5.3.
Exercise 19 (Nilponent Matrices). A square matrix is a nilpotent matrix if ∃k ∈ Z+ s.t. Ak = 0. Let

A, B be square matrices order n, and that AB = BA with nilpotent matrix A. Show that AB is nilpotent.
Show that we require the condition AB = BA.
Proof. Let k be some constant s.t. Ak = 0. Then by Exercise 9 we have
(AB)k = Ak B k =⇒ 0B k = 0, (116)
! !
0 1 0 0
so AB is nilpotent. We may prove by simple counterexample, say A = ,B = .
0 0 1 0
Exercise 20. Show that for diagonal matrix A, the power of the diagonal matrix Ak is diagonal matrix
with entry akii , for i ∈ [nrows(A)].
Proof. Obtain this by simply writing out the mathematical induction proof.
Exercise 21. Prove or disprove the following:
1. If A, B diagonal matrices of same size, BA = BA.
2. If A is square matrix, and A2 = 0, then A = 0.
3. If A is matrix s.t. AA0 = 0, A = 0.
4. A, B invertible =⇒ A + B invertible.
5. A, B singular =⇒ A + B singular.
Proof. -
1. This statement is true. See that ABij = aii bii and BAij = bii aii .
!
0 1
2. This statement is false by counterexample A = .
0 0
Pn
3. This statement is true. For matrix A size m × n, AA0 is square matrix m × m. AA0ii = j aij a0ji =
Pn 2 0
j aij and this implies that if AA = 0, aij = 0 for all values i, j. A must be zero matrix.
4. This statement is false by counterexample:

! !
1 0 −1 0
A= , B= . (117)
0 1 0 −1
26
5. This statement is false by counterexample:
! !
1 0 0 0
A= , B= . (118)
0 0 0 1
Exercise 22. Let A be square matrix. Then
1. Show that if A2 = 0, then 1 − A is invertible. Find the inverse.

2. Show that if A3 = 0, then 1 − A is invertible. Find the inverse.
3. Find the relation at higher order powers.
Proof. -
1. Since
(1 − A)(1 + A) = 1 − A2 = 1, (119)
then 1 − A is invertible with inverse 1 + A.

2. See that
(1 − A)(1 + A + A2 ) = 1 − A3 = 1, (120)
so the inverse of 1 − A is 1 + A + A2 .
3. As in previous parts, the general form matrix inverse of 1 − A where An = 0 is
n−1
X
Aj . (121)
j=0
Exercise 23. Suppose A, B is invertible square matrix order n, and that A + B is invertible. Then show
that A−1 + B −1 is invertible and find (A + B)−1 .
Proof. If A + B is invertible, then the matrix (A(A + B)−1 B) must be invertible. Consider the inverse
of this matrix, by Theorem 10 we have
(A(A + B)−1 B)−1 = B −1 (A + B)A−1 = (B −1 A + 1)A−1 = B −1 + A−1 . (122)
We have effectively shown that the inverse of A−1 + B −1 exists and is (A(A + B)−1 B). Then we may
write
A(A + B)−1 B = (A−1 + B −1 )−1 (123)

A−1 A(A + B)−1 BB −1 = A−1 (A−1 + B −1 )−1 B −1 = (A + B)−1 (124)
and we are done.
Exercise 24. Let A, P, D be square matrices s.t.
A = P DP −1 . (125)
Show that Ak = P Dk P −1 for all k ∈ Z+ .
27
−1
Proof. See that Ak = P DP −1 P −1 −1
{z } · · · P DP . Then all the adjacent P P is identity and we arrive
| DP
k times
at P Dk P −1 .
R
Exercise 25. Show that for matrix Am×n , Bn×m , and A ≡ REF (A) with REF (A) having some zero
row, show that AB is singular.
R
Proof. If A ≡ REF (A) with REF (A) having a zero row, then A = Ek · · · E1 REF (A) for elementary
R
matrices Ei , i ∈ [k], and AB = Ek · · · E1 REF (A)B. It follows that AB ≡ REF (A)B and since REF (A)
has zero row, by the block matrix multiplication (Exercise 11) AB has REF (AB) where REF (AB) has
zero row. This can never be row equivalent to 1, and by Theorem 11, AB is singular.
Exercise 26. For matrix Am×n and m > n, see if it is possible for AB to be invertible where B is
matrix size n × m.
Proof. AB will always be singular. The REF of A has at most n non-zero rows, and since m > n, REF
form of A has zero row. Then by the proof in Exercise 25, AB must be singular.
Exercise 27. Let A be some 2 × 2 orthogonal matrix (Definition 18). Prove that
1. det(A) = ±1,
!
cos(θ) −sin(θ)
2. A = for some θ ∈ R if det(A) = 1,
sin(θ) cos(θ)
!
cos(θ) sin(θ)
3. and otherwise A = .
sin(θ) cos(θ)
Proof. -
1. det(1) = det(AA0 ) = det(A)det(A0 ) = det(A)2 = 1.

!
a b
2. For matrix A = , if A is orthogonal, A−1 = A0 . Then using invertibility by adjoint
c d
(Theorem 23), we can write
! ! !
a c 1 d −b d −b
= = . (126)
b d det(A) −c a −c a
So a = d, b = −c and by assumption a2 + c2 = ad − bc. Let a = cos(θ), c = sin(θ), the result

follows.
3. Follow part 2. with a → −d, b → c.
Exercise 28. Let A be invertible square matrix order n. Then
1. Show that adj(A) is invertible.
2. Find det(adj(A)), adj(A)−1 .
3. Show det(A) = 1 =⇒ adj(adj(A)) = A.
Proof. -
28
1. By Theorem 23, we have

1 1
A adj(A) = 1 =⇒ A adj(A) = 1 (127)
det(A) det(A)
by Theorem 13.
2. By Theorem 22, since

n
1
det(1) = det(adj(A)) det(A) = 1, (128)
det(A)
then det(adj(A)) = det(A)n−1 and adj(A)−1 = det(A)

1
A.
h i
1
3. From the general form A det(A) adj(A) = 1, we can write

1
adj(A) adj(adj(A)) = 1. (129)
det(adj(A))
Then by part 2, we have
1
adj(adj(A)) = det(adj(A))adj(A)−1 = det(adj(A)) A = det(A)n−1 det(A)−1 A = det(A)n−2 A.
det(A)
If det(A) = 1, then it follows that
adj(adj(A)) = A. (130)
Exercise 29. Prove or disprove the following statements.
1. A, B square matrices of order n satisfies det(A + B) = det(A) + det(B).
2. If A is square matrix, det(A + 1) = det(A0 + 1).
3. A, B square matrices of order n and A = P BP −1 for some invertible P satisfies det(A) = det(B).
4. A, B, C square matrices of order and det(A) = det(B) satisfies det(A + C) = det(B + C).
Proof. -
1. This is false by counterexample:
A = 12 , B = −12 . (131)
2. This is true, since det(A + 1) = det((A + 1)0 ) = det(A0 + 1).
3. This is true, since
det(A) = det(P BP −1 ) = det(P )det(B)det(P −1 ) = det(B)det(P )det(P −1 ) = det(B) · 1. (132)
4. This is false by counterexample:
A = −1 2 , B = 12 , C = 12 . (133)
29
3.2.3 Euclidean Vector Spaces
3.2.3.1 Finite Euclidean Spaces
A vector may be specified by the direction of the arrow, and its length specified by its magnitude. Two
vectors are equal if the share direction and magnitude. If we denote a length of the vector u by kuk,
then clearly the length of a scaled vector cu must be ckuk. The geometrical interpretations for vectors
are somewhat elusive past three dimensional spaces, however, it should be noted that the theorems
constructed in spaces of dimensions lower than three may be extended to higher finite dimensions, even
if it may not be visualized.
Definition 42 (Vector and Coordinates). A n-vector or ordered n-tuple of real numbers takes form
(u1 , u2 , · · · , un ) (134)
where ui ∈ R, i ∈ [n]. The i-th component or coordinate of a vector is the entry ui .
Definition 43 (Vector Terminologies). Two n-vectors u, v are equal if ∀i ∈ [n], ui = vi . The vector
w = u + v is s.t ∀i ∈ [n], wi = ui + vi . Scalar multiple of vector is the operation for some c ∈ R, w = cu
s.t. ∀i ∈ [n], wi = cui . The negative of vector u is the scalar multiple of vector where c = −1. The
subtraction of vector v from u is the addition of vector u to negative of vector v. A zero vector is one in
which ∀i ∈ [n], ui = 0.
See that we may identify vectors as special cases of matrices, that is either the row vector or column
vector (Definition 21).
Theorem 25 (Vector Operations). For n-vector u, v, w, the following hold:
1. u + v = v + u,
2. u + (v + w) = (u + v) + w,
3. u + 0 = u = 0 + u,
4. u + (−u) = 0,
5. c(du) = (cd)u,
6. c(u + v) = cu + cv,
7. (c + d)u = cu + du,
8. 1u = u.
Proof. These properties follow from their definitions. Otherwise, see that vectors are matrices, and use
the same result on matrices (i.e. Theorem 7, Definition 29 and Definition 31).
We give formal definitions for Euclidean spaces.
Definition 44 (Euclidean Space). A Euclidean space is the set of all n-vectors of real numbers. This is
denoted Rn . When n = 1, we usually just write R. For any element u ∈ Rn , u is n-vector.
See that the solution set of a linear system (Definition 5) must be a subset of the Euclidean space.
30
Exercise 30 (Expressions for Geometric Objects in the Euclidean Space). We show implicit and explicit
expressions for objects in low dimensional spaces.
1. See that a line in R2 may be represented (implicitly) by the set notation
{(x, y)|ax + by = c} , (135)
where a, b, c ∈ R, and it is not the case that both a, b are zero. This may (explicitly) also be written
as

c − bt
, t |t ∈ R if a 6= 0, or equivalently (136)
a

c − at
t, |t ∈ R if b 6= 0. (137)
b
2. A plane in R3 may be expressed
{(x, y, z)|ax + by + cz = d} (138)
where a, b, c ∈ R not all zero and d ∈ R. We may also write explicitly as any of the equivalent
forms

d − bs − ct
, s, t |s, t ∈ R a 6= 0, (139)
a

d − as − ct
s, , t |s, t ∈ R b 6= 0, (140)
b

d − as − bt
s, t, |s, t ∈ R c 6= 0. (141)
c
3. A line in R3 may be represented by the explicit set notation
{(a0 + at, b0 + bt, c0 + ct|t ∈ R} = {(a0 , b0 , c0 ) + t(a, b, c)|t ∈ R} , (142)
where a, b, c, a0 , b0 , c0 ∈ R, and not all a, b, c are zero.
Definition 45 (Set Cardinality). For finite set S, the number of elements in the set (cardinality) is
denoted |S|.
3.2.3.2 Linear Spans
Definition 46 (Linear Combination). Let ui , i ∈ [k] be vectors in Rn , then ∀ci ∈ R, i ∈ [k], the vector
k
X
ci ui (143)
i
is said to be linear combination of the vectors ui , i ∈ [k].
Definition 47 (ei ). Denote vectors ei ∈ Rn , as the vectors with 1 in the i-th entry and zero everywhere
else. That is
ei = (0 · · · 0 |{z}
1 0 · · · 0). (144)
i-th
Pn
See that for u ∈ Rn , we can write u = i ui ei .
31
Definition 48 (Linear Span). Let S = {ui , i ∈ [k]} be set of vectors in Rn , then the set of all linear
combinations of ui , i ∈ [k], that is
( k
)
X
ci ui | ∀i ∈ [k], ci ∈ R (145)
i
is called the linear span of set S and is denoted as span(S) or span{u1 , · · · uk }.
See that we may express spans in different ways. For instance, a set V = {(2a + b, a, 3b − a) | a, b ∈ R}
can be written as span{(2, 1, −1), (1, 0, 3)}.
Exercise 31. Show that
V = span{(1, 0, 1), (1, 1, 0), (0, 1, 1)} = R3 . (146)
Proof. V = R3 if we may write arbitrary vector (x, y, z) as a linear combination of elements in the
spanning set of V (we formally define this later, but treat this for now to be the three vectors given).
That is, ∃a, b, c s.t.
a(1, 0, 1) + b(1, 1, 0) + c(0, 1, 1) = (x, y, z), (147)
and this corresponds to augmented matrix system

   
1 1 0 x 1 1 0 x
 GE (Def. 5) 
 0 1 1 y  →  0 1 1 y . (148)
 
1 0 1 z 0 0 2 z−x+y
This system is consistent regardless of the values of x, y, z. On the other hand, supposed we performed
Gaussian Elimination and obtain zero row on the LHS, that is the coefficient matrix. Then, it is possible
for the last column to be a pivot column and for the system to be inconsistent (Result 2).
We may generalize Exercise 31 to a more general question of whether a set of vectors span the entire
Euclidean space Rn .
Corollary 2. For set S = {ui , i ∈ [k]} ∈ Rn , S spans Rn iff for arbitrary vector v ∈ Rn , the linear system
represented by the augmented matrix (Definition 9) is consistent, where (A|v) and A is coefficient matrix
created from horizontally stacking the column vectors ui , i ∈ [k]. This is immediately made obvious if we
consider the discussion inside the matrix representation for linear systems in Definition 33. By Theorem
2, if REF (A) has no zero row, then the linear system is always consistent. Otherwise, the system is not
always consistent and span(S) 6= Rn .
Theorem 26 (Cardinality of a Set and Its Spanning Limitations). For set S = {ui , i ∈ [k]} be set of
vectors in Rn , if k < n, then S cannot span Rn .
Proof. Since the coefficient matrix obtained from stacking k columns is size n × k, then the result follows
directly from Theorem 26.
Theorem 27 (Zero Vector and Span Closure). Let S = {ui , i ∈ [k]} ⊆ Rn . Then,
1. 0 ∈ span(S).
Pr
2. For any vi ∈ span(S) and ci ∈ R, i ∈ [r], i ci vi ∈ span(S).
32
Proof. -
P
1. See that 0 = i 0ui ∈ span(S).
2. For each v ∈ span(S), they are linear combination of ui , i ∈ [k]. Then we may express
v1 = a11 u1 + · · · + a1k uk , (149)

v2 = a21 u1 + · · · + a2k uk , (150)
··· (151)
vr = ar1 u1 + · · · + ark uk , (152)
(153)
so that for linear combination
c 1 v1 + · · · + c r vr = (c1 a11 + c2 a21 + · · · + cr ar1 )u1 (154)

+(c1 a12 + c2 a22 + · · · + cr ar2 )u2 (155)
+··· (156)
+(c1 a1k + c2 a2k + · · · + cr ark )uk . (157)
See this is in span(S).
Theorem 28 (Spanning Set of a Set Span). For S1 = {ui , i ∈ [k]}, S2 = {vj , j ∈ [m]} ⊆ Rn , span(S1 ) ⊆
span(S2 ) iff for all i ∈ [k], ui is a linear combination of vj , j ∈ [m].
Proof. →: Assume span(S1 ) ⊆ span(S2 ), then since S1 ⊆ span(S1 ) ⊆ span(S2 ), each ui is linear
combination of v’s.
←: Assume ∀i ∈ [k], ui is linear combination of v’s. Then, ui ∈ span(S2 ), ∀i ∈ [k]. By Theorem 27, any
w that is linear combination of these u’s can rewritten as linear combination of the v’s, which is itself in
span(S2 ). Then we are done.
Exercise 32. Discuss how one may approach to see if for some set S1 , S2 , whether span(S1 ) ⊆ span(S2 ).
Proof. Let the vectors in S1 be denoted ui , i ∈ [n] and in S2 be denoted vj , j ∈ [m]. Then in order to see if
each ui may be represented as a linear combination of the vj ’s, we may simultaneously solve for multiple
linear systems. These linear systems may be represented by an augmented matrix (V |u1 |u2 · · · |uk ), and
by Gaussian Elimination we are able to check if any of the systems (V |ui ), i ∈ [n] are not consistent. V
here is obtained by horizontally stacking the column vectors for vi . This follows from the discussion made
in Definition 33 on constant matrix as linear combinations of the columns in the coefficient matrix.
Theorem 29 (Redundant Vectors). Let S = {ui , i ∈ [k]} ⊆ Rn , and if ∃j ∈ [k] s.t. uj is linear
combination of vectors in S\uj , then span(S) = span(S\uj ).
Proof. The proof follows directly from applying Theorem 28.
Let u, v be two nonzero vectors. Then span{u, v} = su + tv, ∀s, t ∈ R. If it is not the case that u//v,
then span{u, v} is a plane containing origin. In R2 space, the span is just the entire space. In R3 , the
span can be written
span{u, v} = {su + tv|s, t, ∈ R} = {(x, y, z) | ax + by + cz = 0}, (158)
33
where (a, b, c) is solution to the system of two linear equations u1 a + u2 b + u3 c = 0, v1 a + v2 b + v3 c = 0
for u = (u1 , u2 , u3 ), v = (v1 , v2 , v3 ).
For a line in R2 , R3 , see that any point on the line may be represented by a point x plus some vector
u that is scaled. That is, the line may be written by some
L = {x + tu | t ∈ R} (159)
= {x + w | w ∈ span(u)}. (160)
On the other hand, for some plane in R3 , and u non-parallel to v, we may represent plane
P = {x + su + tv | s, t ∈ R} (161)
= {x + w | w ∈ span{u, v}}. (162)
A generalization of this statement can be made in Rn . That is,
1. for x, u ∈ Rn , u 6= 0, the set
L = {x + w | w ∈ span{u}} (163)
is a line in Rn .
2. For x, u, v ∈ Rn , u.v 6= 0, and u 6= kv for some k ∈ R, then the set
P = {x + w | w ∈ span{u, v}} (164)
is plane in Rn .
3. Take x, u1 , u2 , · · · ur , ∈ Rn the set
Q = {x + w | w ∈ span{u1 , · · · , ur }} (165)
is a k-plane in Rn where k is the dimension of the span{u1 , · · · , ur }. Dimensions of vector spaces

are introduced in Section 3.2.3.6.
3.2.3.3 Subspaces
Definition 49 (Subspace). For V ⊆ Rn , V is subspace of Rn if V = span(S), S = {u1 , · · · , uk } for

some vectors ui∈[k] ∈ Rn . We say that V is the subspace spanned by S. We say that S spans V . We say
that u1 , u2 , · · · , uk span V . We say that S is the spanning set for V .
Definition 50 (Zero Space). From Definition 49 and Theorem 27, see that 0 ∈ Rn spans the subspace
that contains itself, that is span{0} = {0}. This is known as the zero space.
Recall the vectors ei ’s defined as in (Definition 47). For vectors ei , i ∈ [n] ∈ Rn , see that for all
Pn
u = (u1 , · · · , un ) ∈ Rn , we may write u = i ui ei , so it follows that Rn = span({e1 , · · · , en }). Trivially,
Rn is subspace of itself. In abstract linear algebra texts, the definition of subspace is relaxed to permit
abstract objects and are usually provided as follows (see that Theorem 27 holds under this definition):
Definition 51 (Subspace). Let V be non-empty subset of Rn . Then V is subspace of Rn iff
∀u, v ∈ V, ∀c, d ∈ R, cu + dv ∈ V. (166)
34
Theorem 30 (HLS Solution Space). The solution set of a HLS (Definition 18) in n variables is subspace
of Rn . We call this the solution space of the HLS.
Proof. Let the matrix representation of the HLS be Ax = 0. If the HLS only has trivial solution, then
the solution space is spanned by the trivial solution and is the zero space. Next, if it has non-trivial
solution, then it has infinitely many solutions (see Lemma 2). Then by Definition 33, we may let solutions
Pncols(A)
x= i ci ai where ai is column vector of the coefficient matrix A. That is, the solution space is
spanned by the columns of A, and is therefore subspace of Rn .
If we solve some linear system and arrive at the general solution, it is easy to find the spanning
vectors. For instance, let the general solution be
       
x 2s − 3t 2 −3
y = s = s 1 + t  0 . (167)
       
     
z t 0 1
The solution space is therefore {(2s − 3t, s, t) | s, t ∈ R} = span{(2, 1, 0), (−3, 0, 1)}.
3.2.3.4 Linear Independence
We saw the concept of vector redundancy in a spanning set in Theorem 29. Here, we give formal
treatment to such vectors with the concept of linear independence.
Pk
Definition 52 (Linear (In)Dependence). For set S = {ui , i ∈ [k]} ∈ Rn , consider i ci ui = 0, for
ci ∈ R, i ∈ [k]. This has a HLS representation (Definition 18) where the coefficient matrix U is obtained
 
c1
from stacking the vectors horizontally, s.t U = u1 · · · uk and c = · · · is the variable matrix.
 
ck
Then see that the zero solution satisfies the system always. The set S is said to be linearly independent
and u1 , · · · , uk are said to be linearly independent if the HLS only has the trivial solution. Otherwise,
Pk
∃ai∈[k] 6= 0 and i ai ui = 0; a non-trivial solution exists. Then S is a linearly dependent set and
u1 , · · · uk are said to be linearly dependent vectors. For brevity, we use the notations
LIN D(S) = LIN D{u1 , u2 , · · · , uk } (168)
to indicate linear independence and
¬LIN D(S) = ¬LIN D{u1 , u2 , · · · , uk } (169)
to indicate linear dependence.
Let S = {u} be a subset of Rn , then S is linearly dependent iff u = 0. For S = {u, v} ⊂ Rn , S

is linearly dependent iff u = av, for some a ∈ R. If 0 ∈ S for arbitrary S ∈ Rn , it must be linearly
dependent.
Theorem 31 (No Redundancy of Linearly Independent Set). Let S = {ui , i ∈ [k]} ⊂ Rn where k ≥ 2.
Then, S is linearly dependent iff ∃i ∈ [k] s.t. ui is a linear combination of vectors in S\ui . Equivalent
statement by the iff condition is that S is linearly independent iff no vector in S may be written as linear
combination of the other vectors.
35
Pn
Proof. →: If LIN D(S), then i ai ui = 0 has non-trivial solution by Definition 52. Without loss of
generality, let ai 6= 0, then
a1 a2 ai−1 ai+1 ak
ui = − − u2 − · · · − ui−1 − ui+1 − · · · − uk . (170)
ai ai ai ai ai
Pk
We have showed directly that ui is l.c of the other vectors. ←: If ∃ui = j6=i aj uj , for some real numbers
aj∈[k],j6=i . Then let ai = −1, for which we have
a1 u1 + · · · + ai−1 ui−1 + ai ui + ai+1 ui+1 + · · · + ak uk (171)

= ui − ui (172)
= 0. (173)
So we have found some non-zero solution, and hence by definition, S must be linearly dependent.
Recall Theorem 26 on the minimum size of a spanning set required for Rn . Here we give statements
that allow us to determine the maximum size of the spanning set for Rn that is linearly independent.
Theorem 32. Let S = {ui , i ∈ [k]} ∈ Rn . If k > n, then S is linearly dependent.
Proof. The proof follows immediately by seeing that the HLS representation by stacking columns of u
has non-trivial solutions by Lemma 2. S is linearly dependent by Definition 52 as a result.
Theorem 33 (No Redundancy of Non-Linearly Combinable Element). Let ui , i ∈ [k] be linearly inde-
pendent vectors in Rn . If uk+1 ∈ Rn , and it is not l.c. of ui , i ∈ [k], then {ui , i ∈ [k]} ∪ {uk+1 } is linearly
independent.
Proof. We show that the vector equation

k+1
X
ci u i = 0 (174)
i
has only trivial solution. See that ci , i ∈ [k] must be zero by itself in the HLS in k variables by assumption
and definition for linear independence (Definition 52). We just need to show that ck+1 = 0. Suppose
not, then we may write
k
X ci
uk+1 = − ui (175)
c
i=1 k+1
and this is a contradiction since we assumed no linear combination is possible. So, ck+1 must be zero.
Therefore, the HLS represented for ui , i ∈ [k + 1] must have only the trivial solution.
3.2.3.5 Bases
Definition 53 (Vector Spaces and Subspaces of Vector Space). A set V is vector space if either V = Rn
or V is subspace (Definition 49, 51) of Rn for some n ∈ Z+ . For some vector space W , the set S is
subspace of W if S is a vector space contained inside W .
We may be interested in finding the smallest set possible s.t. all vector in some vector space V may
be represented as a linear combination of the elements in the set.
Definition 54 (Basis). Let S = {u1 , u2 , · · · , uk } be subset of a vector space V (Definition 53). Then we
say that S is a basis for V if (i) S is linearly independent (Definition 52) and (ii) S spans V (Definition
48). When V = {0}, the zero space, set ∅ to be the basis.
36
That is, a basis for a vector space V must contain the smallest possible number of elements that
can span V , since it must have no redundant vectors. Recall from Theorem 28 that for vector space V
spanned by some set S, if all elements in S may be represented by some linear combination of vectors in
S̃, and S̃ is linearly independent, then S̃ must be basis for span(S) = V by definition of basis (Definition
54).
Theorem 34 (Unique Representation of Elements on Basis). If S = {ui , i ∈ [k]} is basis for vector
Pk
space V , then ∀v ∈ V , v has unique representation v = i ci ui .
Pk Pk
Proof. Suppose ∃ci∈[k] , dj∈[k] s.t. v = i=1 ci ui = j=1 dj uj , then by subtracting the two equations,
get
(c1 − d1 )u1 + (c2 − d2 )u2 + · · · + (ck − dk )uk = 0. (176)
But since S is linearly independent (it is basis), the only solution is the trivial solution, so ∀i ∈ [k], ci =
di .
By Theorem 34, we should be able to specify an arbitrary vector in some vector space w.r.t to the
coefficients of the l.c. on its basis.
Definition 55 (Basis Coordinates). Let S = {ui , i ∈ [k]} be basis for a vector space V and v ∈ V , then
Pk
since v may uniquely expressed by some ci , i ∈ [k] (by Theorem 34) as v = i ci ui , we say that the
coefficients ci are coordinates of v relative to basis S and call the vector (v)S = (c1 , c2 , · · · , ck ) ∈ Rk the
coordinate vector of v relative to basis S.
To find the coordinate vector of some v relative to some basis S, we may simply solve for the linear
system S̃x = v, where S̃ is coefficient matrix obtained by stacking the column vectors of elements of S.
We give formal definition for a collection of vectors that we denoted ei (Definition 47).
Definition 56 (Standard Basis). Let E = {ei , i ∈ [n]} where ei is the vector of all zeros, except for a
single entry of one in the ith-coordinate. Then it is easy to see that E spans Rn , and that LIN D(E). E
is basis for Rn . In particular, we call this the standard basis, and see that
(u)E = (u1 , · · · , un ) = u. (177)
Corollary 3. By Definition 55, for basis S of V , ∀u, v ∈ V , u = v iff (u)S = (v)S . Additionally, by
Definition 55, ∀vi∈[r] ∈ V , ci∈[r] ∈ R, see that
(c1 v1 + c2 v2 + · · · + cr vr )S = c1 (v1 )S + c2 (v2 )S + · · · + cr (vr )S . (178)
Theorem 35 (Linear Dependence Duality). Let S be basis for vector space V (Definition 54, 53), and
|S| = k. Let vi ∈ V, i ∈ [r], then
1. LIN D({vi , i ∈ [r]}) ↔ LIN D({(vi )S , i ∈ [r]}) for vectors (vi )S ∈ Rk .
2. span{vi , i ∈ [r]} = V iff span{(vi )S , i ∈ [r]} = Rk .
Proof. -
Pr Pr Pr
1. By Corollary 3, we can write i ci vi = 0 ↔ ( i ci vi )S = (0)S ↔ i ci (vi )S = (0)S , where
k
(0)S ∈ R . The first equality has non-trivial solution iff the last equality has the non-trivial
solution and we are done.
37
2. Assume S = {ui , i ∈ [k]}. →: Assume span{vi , i ∈ [r]} = V . Then by closure (Theorem 27) and
basis definitions (Definition 54), we may write
k
X r
X
∀a = (a1 , · · · , ak ) ∈ Rk , w := ai ui ∈ V = c j vj (179)
i j
for some constants cj , j ∈ [r]. By basis coordinate (Definition 55) and Corollary 3, we may write
a = (w)S = (c1 v1 + · · · + cr vr )S = c1 (v1 )S + · · · + cr (vr )S . (180)
Then it follows that (vi )S , i ∈ [r] spans Rk . ←: On the other hand, suppose span{(vi )S , i ∈ [r]} =
Rk . See that ∀w ∈ V, (w)S ∈ Rk so ∃ci , i ∈ [r] s.t.
r
X Xr
(w)S = ci (vi )S = ( ci vi )S , (181)
i i
Pr
and therefore w = i ci vi by Corollary 3. Since we picked arbitrary w, we are done.
3.2.3.6 Dimensions
Theorems 26 and 32 give statements of the number of elements required for a basis for a vector space
that is Rk - here we use the duality given by Theorem 35 to make comments on arbitrary real vector
space V .
Theorem 36 (Vector space has fixed size basis). Let V be vector space with basis S, |S| = k. Then
1. Any subset of V with more than k vectors is always linearly dependent, and
2. Any subset of V with less than k vectors cannot span V .
Proof. -
1. Let T = {vi , i ∈ [r]} ⊂ V , and r > k. Then their coordinate vectors (vi )S are set of r vectors in
Rk , and since r > k, by Theorem 32, (vi )S , i ∈ [r] is linearly dependent, then by duality (Theorem
35) it follows that ¬LIN D(T ).
2. Let Q = {vi , i ∈ [t]}, ⊂ V and t < k, then (vi )S , i ∈ [t] may not span Rk (Theorem 26) and Q
cannot span V by duality (Theorem 35).
Theorem 36 gives us a metric for the ‘size’ of a vector space. We formalize this with dimensions.
Definition 57 (Dimensions, dim). The dimension of a vector space V , denoted dim(V ) is the number
of vectors in any basis for V . Since zero space has basis ∅ (Definition 54), dim(0) = 0.
We can see that the dimension of a vector space denote the concept of degrees of freedom. Consider
the subspace W = {(x, y, z)|y = z}. We may write ∀w ∈ W, w := (x, y, y) = x(1, 0, 0) + y(0, 1, 1), s.t.
W = span{(1, 0, 0), (0, 1, 1)}. Additionally, (1, 0, 0), (0, 1, 1) are linearly independent and so they form
basis. dim(W ) = 2.
38
Exercise 33 (Finding the Nullity and Basis of a HLS Solution Space). By considering the (R)REF of an
HLS (Definition 18), it is easy to see that the dimension of the solution space is the number of non-pivot
columns (Definition 16) in the (R)REF form. To see this, suppose that the RREF representation of some
HLS in variables (v, w, x, y, z) may be written to be
 
1 1 0 0 1 0
 
 0 0 1 0 1 0 
 , (182)
 0 0 0 1 0 0 
 
0 0 0 0 0 0
then by back substitution (Exercise 4), see that the linear system may have general solution
       
v −s − t −1 −1
w   s  1 0
       
       
 x  =  −t  = s  0  + t −1 (183)
       
y   0  0 0
       
z t 0 1
for s, t ∈ R. Then see that the dimension of the solution space is 2, and in fact we found the basis for the
solution space {(−1, 1, 0, 0, 0), (−1, 0, −1, 0, 1)}. This solution space is known as the nullspace, and we
have found the basis of the nullspace. The cardinality of this basis is known as the nullity. The nullspace,
basis, and nullity are discussed later in Definition 64, Definition 54 and Definition 65 respectively.
Theorem 37. Let V be vector space, dimension k (Definition 57) and S ⊂ V . The statements are
equivalent for:
1. S is basis for V .
2. LIN D(S) ∧ |S| = k.
3. S spans V and |S| = k.
That is, if we know |S| = k, we only need to check if span(S) = V or LIN D(S) to show it is basis for
V.
Proof. The statements for 1 → 2, 1 → 3 follow from Theorem 36. Additionally, to show 2 → 1, assume
S is linearly independent and |S| = k. Suppose it is not basis for V , then take the vector u ∈ V ∧ u 6∈
span(S). Then by Theorem 31, S 0 = S ∪ {u} is set of k + 1 linearly independent vectors, and Theorem
36 asserts the contradiction. To show 3 → 1, assume S spans V , |S| = k and suppose S is not basis.
P
Then ∃v ∈ S s.t. v = si ∈S\v ci si for some constants ci ∈ R, and S̃ := S\v is set of k − 1 vectors where
span(S̃) = span(S) = V by Theorem 29. Theorem 36 asserts the contradiction.
Theorem 38 (Dimension of a Subspace). Let U be subspace (Definition 49) of vector space V . Then
dim(U ) ≤ dim(V ). In particular, U 6= V =⇒ dim(U ) < dim(V ).
Proof. Let S be basis for U , so S ⊆ U ⊆ V and since it is basis, S is linearly independent subset of V .
By part 1, Theorem 36, since S is linearly independent, it must not have more than k = dim(V ) vectors,
that is dim(U ) = |S| ≤ dim(V ). On the other hand, assume |S| = dim(U ) = dim(V ), then Theorem
37 asserts that the linear independence of S and set cardinality makes V = span(S) = U . So we have
shown that
dim(U ) = dim(V ) =⇒ U = V (184)
39
Since (dim(U ) ≤ dim(V )) ∧ (dim(U ) ≥ dim(V )) ↔ dim(U ) = dim(V ), we have effectively showed the
contrapositive of the statement, and by logical equivalency we are done.
1. A is invertible.

5. det(A) 6= 0.
6. Rows of A form basis for Rn .
7. Columns of A form basis for Rn .
Proof. See proof in Theorem 11 for the iff conditions for statement 1 ↔ 4. 1 ↔ 5 is proved by Theorem
21. 6 ↔ 7 by Theorem 10 - rows of A are columns of A0 and A invertible iff A0 is invertible. We are done
if we show any i ∈ [5] ↔ 7. We show 2 ↔ 7. If Ax = 0 only has trivial solution, then the columns are
linearly independent by the statements given in Definition 52. There are n columns. Then by Theorem
37, {a1 , a2 , · · · an } where ai is i-th column of A is basis of Rn .
3.2.3.7 Transition Matrices
Definition 58 (Row/Column Vector Representation of Basis Coordinates). Recall that for basis S =
{ui , i ∈ [k]} of vector space V and v ∈ V , v has unique coordinate vector representation (Definition 55,
Theorem 34) written
(v)S = (c1 , · · · , ck ) (185)
and we write also write this as a column vector

 
c1
 
 c2 
[v]S = 
· · · .
 (186)
 
ck
It is trivial that bases are not unique. For two bases S, T spanning vector space V , we may be
interested in the relation [w]S ∼ [w]T . This relation is captured by the transition matrix. In particular,

c1
 
P  c2 
let S = {ui , i ∈ [k]}, T = {vi , i ∈ [k]} and some w ∈ V be written w = ci ui s.t. [w]S = 
· · ·, then

 
ck
since each ui ’s may be represented by the vectors in T , suppose
∀i ∈ [k], ui = a1i v1 + a2i v2 + · · · + aki vk . (187)
40
 
a1i
 
a2i 
That is, each ui ∈ [k] has T -basis coordinate representation [ui ]T = 
 · · ·  and see that

 
aki
k
X
w= (c1 aj1 + c2 aj2 + · · · + ck ajk )vj . (188)
j
That is,
 
c1 a11 + c2 a12 + · · · ck a1k
 
 c1 a21 + c2 a22 + · · · ck a2k 
[w]T = 
  = [u1 ]T [u2 ]T ··· [uk ]T [w]S . (189)
 ··· 

c1 ak1 + c2 ak2 + · · · ck akk

Define P = [u1 ]T [u2 ]T ··· [uk ]T , then [w]T = P [w]S for all w ∈ V and we call P the transition
matrix.
Definition 59 (Transition Matrix). Let S = {u1 , · · · , uk } and T be two bases for vector space. Then
P = ([u1 ]T · · · [uk ]T ) is said to be transition matrix from S to T , and [w]T = P [w]S holds for all w ∈ V .
We may find the transition matrix by the Gaussian Elimination (or Gauss Jordan) algorithm discussed
in Theorem 5 and using the interpretations for linear systems as in Definition 33. For two bases S =
{ui , i ∈ [k]}, T = {vi , i ∈ [k]} respectively, we solve for the system with augmented matrix representation
(Definition 9) (T |u1 |u2 · · · |uk ), where T is coefficient matrix obtained from stacking column vectors vi ,
i ∈ [k]. Then the column vectors on the RHS of the RREF augmented matrix are the weights for the
linearly combined columns of T . In fact, the RHS of the augmented matrix from the first | onwards is
precisely the transition matrix P : [w]S → [w]T .
Theorem 40 (Properties of the Transition Matrix). Let S, T be two bases of vector space V and P be
transition matrix from S → T , then
1. P is invertible and
2. P −1 is the transition matrix from T → S.
Proof. It is easy to both logicize this argument and to prove it. Note that for S = {ui , i ∈ [k]}, the
vectors [ui ]S , i ∈ [k] is standard basis (Definition 56) in Rk . Let Q be transition matrix from T to S.
Then see that for i ∈ [k], the i-th column of QP is written QP [ui ]S = Q[ui ]T = [ui ]S . Then stacking the
columns [ui ]S , i ∈ [k] gives us 1k .
Exercise 34. Discuss if the following is true:
1. If S1 , S2 are finite subsets of Rn , then span(S1 ∩ S2 ) = span(S1 ) ∩ span(S2 ).
2. If S1 , S2 are finite subsets of Rn , then span(S1 ∪ S2 ) = span(S1 ) ∪ span(S2 ).
Proof. -
1. False, consider the sets S1 = {(1, 0), (0, 1)}, S2 = {(1, 0), (0, 2)}.
2. False, consider the sets S1 = {(1, 0)}, S2 = {(0, 1)}.
41
Exercise 35 (Coset). Let W be subspace of Rn and v ∈ Rn , then
W + v = {u + v | u ∈ W } (190)
is said to be coset of W containing v. Give geometric interpretations for the coset W + v when
1. W = {(x, y)|x + y = 0}, v = (1, 1).
2. W = {(c, c, c)|c ∈ R}, v = (0, 0, 1).
3. W = {(x, y, z)|x + y + z = 0}, v = (2, 0, −1).
Proof. -
1. The line x + y = 2 in R2 .
2. The line {(0, 0, 1) + c(1, 1, 1)|c ∈ R} in R3 .
3. The plane x + y + z = 1 in R3 .
The union of subspaces are rarely a subspace; we define sums of subspaces.
Definition 60 (Subset Addition). Suppose Ui , i ∈ [m] are subsets of V , then we denote the sum of
subsets U1 + · · · + Um to be the set of all possible sum of elements Ui , i ∈ [m], that is
m
(m )
X X
Ui = ui | ∀i ∈ [m], ui ∈ Ui . (191)
i i
Exercise 36. Let V, W be subspaces of Rn , then show that V + W (see Definition 60) is subspace of Rn .
Proof. Let V = span{vi , i ∈ [m]}, W = span{wi , i ∈ [n]}, then
V +W = {v + w|v ∈ V, w ∈ W } (192)
X X
= { ai vi + bj wj | ai , bj ∈ R, ∀i ∈ [m], ∀j ∈ [n]} (193)
= span{v1 , · · · , vm , w1 , · · · , wn }, (194)
so V + W is subspace of Rn .
Exercise 37. Let A be m × n matrix, and VA = {Au|u ∈ Rn }. Show that VA is subspace of Rm . Let A
be square matrix order n. Show WA := {u ∈ Rn |Au = u} is subspace of Rn .
Pn
Proof. Let A = (c1 · · · cn ) be column-stacked representation of A, then ∀u ∈ Rn , see that Au = i ui ci ,
n
so VA = span{ci , i ∈ [n]} is subspace of R . Next, see that
Au = u ↔ (A − 1)u = 0, (195)
and since WA is the solution set of (A − 1)u = 0, WA must be subspace of Rn . In fact, VA is an instance
of a column space and WA is an instance of a nullspace. These are discussed in (Definition 61 and
Definition 64).
42
Exercise 38. Let V, W be subspaces of Rn . Show that V ∩ W is subspace of Rn . Show V ∪ W is subspace
of Rn iff V ⊆ W or W ⊆ V .
Proof. Both V, W contain zero, so V ∩ W is nonempty. Let u, v be vectors in V ∩ W and a, b ∈ R, then

au + bv is in V by span closure (Theorem 27) and au + bv is also in W by the span closure. So it must
be in V ∩ W and we are done with the first part.
We show the second statement. ←: suppose V ⊆ W , then V ∪ W = W and this is subspace of Rn .
Also, if W ⊆ V , then W ∪ V = V is also subspace of Rn . → Suppose V ∪ W is subspace, and suppose
that V 6⊆ W . Then ∃y ∈ V s.t. y ∈ V ∧ y 6∈ W , and since V ∪ W is subspace, for arbitrary x ∈ W , see
x + y ∈ V ∪ W . It follows that either x + y ∈ V or x + y ∈ W . Assume it is in W , then −x ∈ W and
writing (x + y) − x = y ∈ W and this is contradiction. So x + y ∈ V . Since V is subspace, −y ∈ V ,
x = (x + y) − y ∈ V and therefore W ⊆ V .
Exercise 39. Let ui , i ∈ [k] be vectors ∈ Rn , and P be some square matrix order n.
1. Show that P ui , i ∈ [k] linearly independent implies ui , i ∈ [k] linearly independent.
2. Show that for linearly independent ui , i ∈ [k], P ui , i ∈ [k] linearly independence is guaranteed only
if P is invertible.
Proof. -
1. See that
k
X k
X k
X
ci ui = 0 =⇒ P ( ci ui ) = P 0 =⇒ ci P ui = 0. (196)
i i i
2. See that
k
X Xk k
X
ci P ui = 0 =⇒ P ( ci ui ) = 0 =⇒ ci ui = P −1 0 = 0. (197)
i i i
Each ci must be zero by linear independence (Definition 52). Verify the last assertion with a
counter example using the matrices
     
1 0 1 1 0
u1 = 0 , u2 = 1 , P = 1 1 0 , (198)
     
0 0 0 0 0
where P u1 = P u2 .
Exercise 40. Let V ∈ Rn , V 6= ∅. Show V is subspace of Rn iff ∀u, v ∈ V, c, d, ∈ R, cu + dv ∈ V . Show

that the largest set of linearly independent vectors in V must span V .
Proof. By Theorem 27, if V is subspace, ∀u, v ∈ V , c, d ∈ R, cu + dv ∈ V . On the other hand, suppose

∀u, v ∈ V , cu + dv ∈ V for c, d ∈ R. We can use this recursively to show that ∀ui , i ∈ [k] ∈ V ,
span{ui , i ∈ [k]} ⊆ V . We can show the last statement by contradiction. Suppose not, then ∃v ∈ V
but it is not l.c. of some largest set S of linearly independent vectors. Then S ∪ {v} is set of linearly
independent vector (Theorem 31) and we obtained a larger linearly independent set, a contradiction.
Exercise 41. Let V be vector space.
43
1. Suppose S ⊆ V s.t. span(S) = V . Show ∃S 0 ⊆ S s.t S 0 is basis for V .
2. Suppose T ⊆ V s.t. T is linearly independent, then show ∃T 0 basis for V s.t T ⊆ T 0 .
Proof. We show both parts by presenting an algorithm that allows us to obtain precisely the sets specified.
1. span(S) = V , so |S| ≥ n. If |S| = n, then we are done by S 0 = S. Else, S is linearly dependent

by (Theorem 36) and ∃v ∈ S that is l.c of the remaining vectors; S 0 = S\{v} satisfies span(S 0 ) =
span(S). Repeat until |S| = n.
2. LIN D(T ), so |T | ≤ n by Theorem 36. If |T | = n then we are done by Theorem 36. Otherwise,
∃v ∈ V but v 6∈ span(T ). Let T 0 = T ∪ {v} where T 0 is linearly independent by Theorem 31.
Repeat until kT 0 k = n and we have basis as specified.
Exercise 42. Let V be vector space with dim(V ) = n, then show ∃ui , i ∈ [n + 1] s.t. ∀v ∈ V , v may be
expressed as l.c of ui ’s with non-negative coefficients.
Pn
Proof. Take basis {ui |i ∈ [n]} spanning V , and write un+1 = −u1 −u2 · · ·−un . Then ∀v ∈ V , v = i ai ui
for some ai ∈ R, i ∈ [n]. Define a := min{0, a1 , · · · , an }, s.t
v = (a1 − a)u1 + (a2 − a)u2 + · · · + (an − a)un + (−a)un+1 (199)

Xn n
X
= ai ui + a (−1)ui + (−a)un+1 . (200)
i i
Each ai − a ≥ 0, −a ≥ 0 and we are done.
Theorem 41 (Subset Addition Bound). Let V, W be subspaces of Rn . Then,
dim(V + W ) = dim(V ) + dim(W ) − dim(V ∩ W ). (201)
Proof. Let {ui , i ∈ [k]} be basis spanning V ∩ W , then by Theorem 41, there ∃vi ∈ V, i ∈ [m] s.t
{u1 , · · · , uk , v1 , · · · , vm } is basis for V . Also, ∃wi , i ∈ W, i ∈ [n] s.t. {u1 , · · · , uk , w1 , · · · , wn } span W .
Then see that
V + W = span{ui , vj , wl , i ∈ [k], j ∈ [m], l ∈ [n]}. (202)
Consider vector equation

k
X m
X n
X
ai ui + bj v j + cl wl = 0. (203)
i j l
Pn Pk Pm Pn Pk
Since l cl wl = −( i ai ui + j bj vj ) ∈ V ∩ W , then ∃di ∈ R, i ∈ [k] s.t. l cl wl = i di ui . Then
c1 w1 + · · · + cn wn − d1 u1 − · · · − dk uk = 0, (204)
and since the LIN D{ui , wl , i ∈ [k], l ∈ [n]} it follows that constants c’s and d’s are all equal to zero.
Substitute c’s = 0 into the vector equation to get
a1 u1 + · · · ak uk + b1 v1 + · · · + bm vm = 0, (205)
by LIN D{ui , vj , i ∈ [k], j ∈ [m]} it follows that constants a’s and b’s are all zero. So the vector equation
has only the trivial solution; the u, v, w’s are all linearly independent. See that we get the relation
dim(V + W ) = k + m + n = (k + m) + (k + n) − k = dim(V ) + dim(W ) − dim(V ∩ W ). (206)
44
Exercise 43. Determine which of these are true.
1. If S1 , S2 are bases for V, W respectively, where V, W are subspaces of a vector space, then S1 ∩ S2
is basis for V ∩ W .
2. If S1 , S2 are bases for V, W respectively, where V, W are subspaces of a vector space, then S1 ∪ S2
is basis for V + W .
3. If V, W are subspaces of vector space, then ∃ basis S1 for V and basis S2 for W s.t. S1 ∩ S2 is basis
for V ∩ W .
4. If V, W are subspaces of vector space, then ∃ basis S1 for V and basis S2 for W s.t. S1 ∪ S2 is basis
for V + W .
Proof. -
1. False. Consider S1 = {(1, 0), (0, 1)}, S2 = {(1, 0), (0, 2)}.
2. False. Consider span{(1)}, span{(2)} = R.
3. True. These bases are found and reasoned with in the proof for Theorem 41.
4. True. These bases are found and reasoned with in the proof for Theorem 41.
3.2.4 Matrix Vector Spaces

Matrices and vector spaces were defined (Definition 19, Definition 53) and we would like to study the
vector spaces that are associated with a matrix. In particular, we are interested in the row space, column
space and the nullspace of some matrix.
3.2.4.1 Row, Column Spaces
Definition 61 (Row Spaces  and Column Spaces). Let A = (aij ) be m × n matrix with columns denoted
r1
 
 r2  n
(c1 , · · · cn ) and rows denoted 
· · · representing A. Then the row space is the subspace of R spanned by

 
rm
ri , i ∈ [m], that is span{ri , i ∈ [m]} ⊆ Rn . The column space is the subspace of Rm spanned by ci , i ∈ [n],
that is span{ci , i ∈ [n]} ⊆ Rm . For brevity, we denote the row space and column space associated with
matrix A as rowSpace(A), colSpace(A) respectively.
It is easy to see that row space A and column space A0 are identical, and that column space A and
row space A0 are identical by definition of transpose (Definition 34).
We have discussed methods to check if some set of vectors are linearly dependent by considering its
HLS solution (Definition 52). We want to obtain the basis for row spaces and column spaces respectively.
R
Observe that for matrices A, B with RREF (A) = RREF (B), then A ≡ B.
R
Theorem 42 (Row Space Invariance Over EROs). Let A ≡ B, then the row spaces of A, B are identical.
That is, the EROs (Definition 10) preserve the row space of a matrix.
45
 
r1
 
 r2 
Proof. Let A = · · · be m rows for matrix A. The proofs can be obtained by performing the EROs on

 
rm
elements of the set S = {ri , i ∈ [m]} to obtain S̃ and observing that span(S) = span(S̃). For instance,
for ERO kRi , picking some ri ∈ S and S̃ = S\{ri } ∪ {kri } preserves the span. We omit the proofs for
the other EROs, but they should not be difficult to obtain or reason with.
Recall column space A is row space A0 , and so column space A has basis formed by taking the non-
zero rows in REF (A0 ). We may employ other methods, however. Note that EROs!do not preserve
! the
R 0 0 1 0
column space of matrix; consider the simple example of A ≡ B where A = ,B = and
1 0 0 0
observe they do not share column space.
R
Theorem 43. Let A ≡ B, then prove that
1. Set of columns in A is linearly independent iff set of corresponding columns in B is linearly inde-
pendent.
2. Set of columns in A form basis for colSpace(A) iff set of corresponding columns in B is basis for
colSpace(B).
Proof. -
1. Let A = (a1 · · · an ) be the column stacked representation and B both be b × n matrices. Assume
R
A ≡ B s.t.
B = Ek · · · E1 A. (207)
Define P = Ek · · · E1 , then B = P A = (P a1 · · · P an ), and by Theorem 39, P is invertible. By part

1, part 2 of Exercise 39, subset columns aj ’s. j ∈ [m], m ≤ n are linearly independent iff the P aj ’s
are linearly independent.
2. →:Suppose some subset of columns S1 are basis for colSpace(A). The first part asserts the cor-
responding columns (say S2 , where S2 = {P s|s ∈ S1 }) in B are linearly independent. Clearly,
span(S2 ) ⊆ colSpace(B). So we just need to show that colSpace(B) ⊆ span(S2 ). Take u ∈
Pn
colSpace(B), for some real ci , i ∈ [n]. we have u = i ci P ai . Since span(S1 ) = colSpace(A), then
a1 , · · · an ∈ span(S1 ) and so P a1 , · · · P an ∈ span(S2 ) since the elements of S2 were obtained by
applying P to each element of S1 . We are done.
Theorem 44 (Linear Independence at Pivot Rows and Columns of REF Form). Proof that the nonzero
rows and pivot columns of a REF matrix are linearly independent.
Proof. That is, we want to prove that pivot rows and pivot columns are linearly independent. First,
we show that the pivot rows (nonzero rows) of REF (A) for some matrix Am×n is linearly independent.
Consider
 
r1
 r2 
 
 
· · ·
REF (A) =  (208)

 rj 
 
···
46
for j non-zero rows. Each non-zero row has a leading entry (Definition 12). Denote the access operator
[·], s.t. r[l] refers to the l-th coordinate of vector r. Then by Definition of REFs (Definition 15), see that
each ri has leading entry to the left of rj , when j > i. For each non-zero row, denote the leading entry
for row ri to be at coordinate li , then see that ri [li ] 6= 0, and lj > li when j > i and ∀i, j, rj [li ] = 0 when
Pj
i < j. Consider the equation i ci ri = 0. Suppose c1 6= 0, then see that
j
X j
X
( ci ri )[l1 ] = ci (ri [l1 ]) = c1 r1 [l1 ] + c2 r2 [l1 ] + · · · cj rj [l1 ] = c1 r1 [l1 ] + 0 6= 0[l1 ]. (209)
i i
Therefore, c1 must be zero. Suppose c2 6= 0, then

j
X j
X
( ci ri )[l2 ] = ci (ri [l2 ]) = c1 r1 [l2 ] + c2 r2 [l2 ] + · · · cj rj [l2 ] = 0 + c2 r2 [l2 ] + 0 6= 0[l2 ]. (210)
i i
Repeating this, all ci , i ∈ [j] must be equal to zero and we obtain linear independence by Definition 52.
R
It is easier to prove the linear independence of pivot columns. By Theorem 43, since REF (A) ≡
RREF (A), if we show linear independence of the RREF (A) pivot columns, we are done. By definition of
Gauss-Jordan elimination (Theorem 5), each pivot column at the RREF (A) is only non-zero at leading
entry and the set of pivot columns have non-zero entry at different coordinates. It is trivial to see that
no pivot column can be represented by a linear combination of the other pivot columns, so by Theorem
31, the set of pivot columns are linearly independent.
Theorem 45 (Basis for Row Space in the Row-Echelon Form). Let A be some matrix, then the non-zero
rows in REF (A) is basis for row space A.
Proof. This follows directly from row space invariance (Theorem 44) over EROs and linear independence
of pivot rows by Theorem 44.
R
By Theorem 44, Theorem 43, since matrix A ≡ REF (A), the basis for column space of A may be
obtained by taking the columns of A corresponding to the pivot columns (Definition 16) in REF (A). If
we would like to find the basis containing the originals vectors in the set provided, see that we would first
stack the column vectors horizontally and use Gaussian Elimination (Theorem 5) and pick the relevant
columns from the original matrix. If we row-stacked and took non-zero vectors from the REF, we would
obtain a basis but they might be linear transformations of the original vectors. If we are askedto
extend

s1
 
 s2 
some linearly independent S = {si , i ∈ [k]} to basis for some Rn , n > k, we may take REF  
· · ·
 
sk
and add elements of the standard basis ei ∈ Rn (Definition 56), for i corresponding to the non-pivot
columns. Choices other than ei exists; we just need to ensure that the leading entry of the new vector
corresponds to the i-th coordinate.
Theorem 46 (Representations of the Column Space). For m × n matrix A,
colSpace(A) = {Au | u ∈ Rn }. (211)

Pn
Proof. Let A = (c1 c2 · · · cn ), where ci is column i of A, then ∀u ∈ Rn , see that A · u = i ui ci ∈
n
span{c1 , c2 , · · · , cn }, so {Au|u ∈ R } ⊆ colSpace(A). On the other hand, suppose some b ∈ colSpace(A),
Pn
then ∃ui ∈ R, i ∈ [n] s.t. b = i ui ci = Au. Then colSpace(A) ⊆ {Au | u ∈ Rn }. It follows that
colSpace(A) = {Au|u ∈ Rn }.
47
Theorem 47 (Constant Matrix is Member of the Column Space). A system of linear equations Ax = b
is consistent iff b ∈ colSpace(A).
Proof. The proof immediately follows from Theorem 46; a system of linear equations Ax = b must be
consistent iff ∃u ∈ Rn s.t. Au = b.
3.2.4.2 Ranks
Theorem 48 (Dimension Equality in Row and Column Spaces).
dim(rowSpace(A)) = dim(colSpace(A)). (212)
Proof. This follows immediately from Theorem 44 - the dim(rowSpace(A)) = number of non-zero rows
in REF of arbitrary matrix = number of pivot columns = dim(colSpace(A)).
Definition 62 (Matrix Rank). The rank of a matrix is the dimension of its row space (or column space,
Theorem 48). We denote the rank of some matrix A by rank(A).
Definition 63 (Full Rank). It is trivial to see that for m × n matrix A, rank(A) ≤ min{m, n}, If
rank(A) = min{m, n}, we say that matrix A is full rank.
A square matrix A is full rank iff it is invertible.
Theorem 49 (Rank of Matrix Transpose). Since row space A is columns space A0 , rank(A) = rank(A0 ).
Corollary 4 (Linear System Consistency and Rank of Augmented Matrix). A linear system (Definition
5) is consistent iff rank(A) = rank((A|b)). That is, when the b is not a pivot column.
Theorem 50 (Rank Bound of Matrix Product). Let A, B be m × n, n × p matrices respectively. Then,
rank(AB) ≤ min{rank(A), rank(B)}. (213)
Proof. Let A = (a1 a2 · · · an ), B = (b1 b2 · · · bp ) be their columnwise representations. Then by block

multiplication (see Exercise 11), we may write
AB = (Ab1 Ab2 · · · Abp ) (214)
as the columnwise representation for their matrix product. By Theorem 46, see that ∀i ∈ [p], Abi ∈
colSpace(A) and since colSpace(AB) = span{Abi , i ∈ [p]} and each Abi may be written as a linear
combination of the columns in A, then by Theorem 28, colSpace(AB) ⊆ colSpace(A). It follows by
Theorem 38 that
rank(AB) ≤ rank(A) (215)
Equation 215 asserts that rank(B 0 A0 ) ≤ rank(B 0 ). We may write rank(AB) = rank((AB)0 ) =
rank(B 0 A0 ) ≤ rank(B 0 ) = rank(B). We have proven the equivalent statement (rank(AB) ≤ rank(A)) ∧
(rank(AB) ≤ rank(B)).
48
3.2.4.3 Nullspaces
Definition 64 (Nullspace). Let A be m × n matrix, then the solution space (Definition 30) of linear
system Ax = 0 is the nullspace of A. We refer to this subspace as nullSpace(A).
Definition 65 (Nullity). Define the nullity of m × n matrix A as nullity(A) := dim(nullSpace(A)).

See that since the solution vector ∈ Rn , then nullity(A) ≤ n.
We have already seen how to find the basis and nullity of the solution space in Exercise 33, where
the nullity was two.
Theorem 51 (Rank-Nullity Theorem / Dimension Theorem for Matrices). Let A be matrix size m × n.
Then rank(A) + nullity(A) = n.
Proof. Consider the REF form for (A|0). Reason that rank(A) corresponds to the number of pivot
columns in REF (A) and nullity(A) corresponds to the number of non-pivot columns in REF (A). See
Definition 16 and Exercise 33 for intuition.
Theorem 52 (Representations of the Solution Space). See Theorem 46 for representations of the column
space of some matrix A. Suppose Ax = b has solution v. Then the solution set of the system may be
written
M = {u + v | u ∈ nullSpace(A)}. (216)
Proof. Suppose v is solution s.t. Av = b, then let Aw = b be some solution s.t. Aw = b. For u := w − v,
we may write
Au = A(w − v) = Aw − Ab = b − b = 0, (217)
so u ∈ nullSpace(A). It follows that u + v = w ∈ M and the solution space ⊆ M . On the other hand,
∀w ∈ M , ∃u ∈ nullSpace(A) s.t. w = u + v by assumption. See that
Aw = A(u + v) = Au + Av = 0 + b = b, (218)
so w is solution. M ⊆ the solution space and we are done.
Theorem 52 asserts the Ax = b has the unique solution iff nullSpace(A) is the zero space. That is
when (A|0) only has trivial solution.
Exercise 44. Prove that the row space of a matrix is orthogonal to its nullspace.
 
(a1,· ) · x
Proof. If x ∈ nullspace(A), then Ax = 0, and therefore  ···  = 0 and clearly the list
 
(anrows(A),· ) · x
spanning row space of A is orthogonal to x, and hence x ∈ rowspace(A)⊥ . On the other hand, if
x ∈ rowspace(A)⊥ , then the dot product of any row with x gives zero, and we get Ax = 0. So
rowspace(A) = nullspace(A)⊥ .
Exercise 45. Let V = {(1, 1, 0, 0), (−1, 0, 1, 0)} and W = {(−1, 2, 3, 0), (2, −1, 2, −1)} and find the basis
for V + W .
Proof. We may find the basis for V + W by stacking v1 , v2 , w1 , w2 into rows of a matrix and taking the
non-zero rows of its REF form.
49
Exercise 46. Let A be square matrix order 3 and describe geometrically the solution set of the HLS
Ax = 0 when rank(A) is zero, one, two and three respectively.
Proof. By rank-nullity theorem, see that nullity(A) = 3 − rank(A). So when rank(A) = 0 the
3
nullspace(A) = R , when rank(A) = 1 the nullSpace(A) is plane through origin, and when rank(A) = 2
the nullSpace(A) is line through origin. Finally, when rank(A) = 3 then nullSpace(A) is the zero
space.
Theorem 53. Let B be m × n matrix. If ∃n × m matrix C s.t. BC = 1, then we say that C is the right
inverse of B. Show that m × n matrix B has right inverse iff rank(B) = m.
Proof. By definition, rank(B) = dim(colSpace(B)) ≤ m (see Definition 63). Let {ei , i ∈ [m]} be
standard basis for Rm , then B has right inverse iff B(u1 · · · um ) = (e1 · · · em ) for some ui , i ∈ [m] ∈
Rn iff systems Bx = ei , i ∈ [m] are consistent for all i ∈ [m] iff ei , i ∈ [m] ∈ colSpace(B) iff m ≤
dim(colSpace(B)) ≤ m iff rank(B) = m.
Exercise 47. Suppose A, B are two matrices and AB = 0, then show that colSpace(B) ⊆ nullSpace(A).
Proof. Define B = (b1 · · · bn ) to be the column stacked representation for B, and see that
AB = 0 =⇒ (Ab1 · · · Abn ) = 0 =⇒ ∀j ∈ [n], Abj = 0. (219)
The result follows.
Exercise 48. Prove that no matrix has row space and nullspace that contain the same nonzero vector.
Proof. We show that the only vector that is both in row space and column space must be the zero
a1
vector. Let A = · · · be row-stacked representation for A, and let u ∈ nullSpace(A). Then see that
 
an
Pn
Au = 0 =⇒ ai u = 0 for all i. Suppose u ∈ rowSpace(A), then u = i ci ai for some constants
ci , i ∈ [n]. Then
n
X n
X
u0 u = ci ai u = 0 = u2i = 0 ↔ ∀i ∈ [n] ui = 0. (220)
i i
Theorem 54. Let A, P be m × n matrix and m × m matrix respectively. If P is invertible, then show
that rank(P A) = rank(A). If rank(P A) = rank(A), does this imply P is invertible?
Proof. Since P is invertible, by Theorem 39, P is product of elementary matrices, say Πni Ei . Then
R
P A = Πni Ei A, and P ≡ A, and by Theorem 42, they share row space. Then
rank(P A) = dim(rowSpace(P A)) = dim(rowSpace(A)) = rank(A). (221)

!
1 0
The converse does not hold, consider P = A = .
0 0
Theorem 55. Prove rank(A + B) ≤ rank(A) + rank(B).
50
Proof. Let A, B have column stacked representations (a1 · · · an ), (b1 · · · bn ) respectively, and let S1 ⊆
{ai , i ∈ [n]} be the basis for colSpace(A) and S2 ⊆ {bi , i ∈ [n]} be the basis for colSpace(B). Then
colSpace(A + B) = span{ai + bi |i ∈ [n]} ⊆ span{S1 ∪ S2 }, so rank(A + B) = dim(colSpace(A + B)) ≤
rank(A) + rank(B).
Exercise 49. Let A be m × n matrix, then show that
Ax = b consistent for all b ∈ Rm =⇒ A0 y = 0 has only trivial solution. (222)
Proof. By rank nullity (Theorem 51), we have
nullity(A0 ) = m − rank(A0 ) = m − rank(A) = 0, (223)
since rank(A) = m by Theorem 47.
Exercise 50. Let A be m × n matrix.
1. Show that nullSpace(A) = nullSpace(A0 A).
2. Show that nullity(A) = nullity(A0 A) and that rank(A) = rank(A0 A).
Determine if the following are true:
3. nullity(A) = nullity(AA0 ).
4. rank(A) = rank(AA0 ).
Proof. -
1. For u ∈ nullSpace(A), see that A0 (Au) = 0 and so u ∈ nullSpace(A0 A). Then

  nullSpace(A) ⊆
b1
 
 b2 
nullSpace(A0 A). On the other hand, for v ∈ nullSpace(A0 A), and let Av = 
· · ·, then

 
bm
m
X
(Av)0 (Av) = v 0 A0 Av = v 0 0 = 0 =⇒ b2i = 0 =⇒ ∀i ∈ [m], bi = 0. (224)
i
Then Av = 0, v ∈ nullSpace(A) =⇒ nullSpace(A0 A) ⊆ nullSpace(A).
2. First part asserts that nullity(A) = nullity(A0 A). Rank nullity (Theorem 51) asserts that
rank(A) = n − nullity(A) = n − nullity(A0 A) = rank(A0 A). (225)

!
1 0 0
3. False, by counterexample A = , where AA0 = 12 .
0 1 0
4. True, since rank(A) = rank(A0 ) = rank((A0 )0 A0 ) = rank(AA0 ). We used Theorem 49.
Exercise 51 (Conditions for and finding Left and Right Inverses). Show that a matrix has left inverse
iff it has full column rank. Show that a matrix has right inverse iff it has full row rank. Show a left and
right inverse respectively.
51
Proof. Exercise 53 showed the condition for right inverses. Here we prove both the statements again
and in fact find instances of the left and right inverse. Suppose A is m × n matrix. We show A has left
inverse iff it has full column rank. See that A0 A is invertible iff rank(A0 A) = n = rank(A) (Exercise 50)
iff A is full column rank. Then we can write
(A0 A)−1 A0 A = 1, (226)
and see that (A0 A)−1 A0 is left inverse. For right inverse, see that AA0 invertible iff rank(AA0 ) = m =
rank(A) (Exercise 50) iff A is full row rank. Then we can write
AA0 (AA0 )−1 = 1. (227)
and see that A0 (AA0 )−1 is right inverse.
Exercise 52. Determine which of these are true:

R
1. If A ≡ B, then rowSpace(A0 ) = rowSpace(B 0 ).
R
2. If A ≡ B, then colSpace(A0 ) = colSpace(B 0 ).
R
3. If A ≡ B, then nullSpace(A0 ) = nullSpace(B 0 ).
4. If A, B are matrices of same size, then rank(A + B) = rank(A) + rank(B).
5. If A, B are matrices of same size, then nullity(A + B) = nullity(A) + nullity(B).
6. If A is n × m matrix and B is m × n matrix, then rank(AB) = rank(BA).
7. If A is n × m matrix and B is m × n matrix, then nullity(AB) = nullity(BA).
Proof. -
! !
1 0 0 0
1. False, by counterexample A = and B = .
0 0 1 0
2. True, since rowSpace(A) = rowSpace(B) by invariance (Theorem 42) and rowSpace(A) = colSpace(A0 ).
! !
1 0 0 0
3. False by counterexample A = ,B = .
0 0 1 0
4. False by counterexample A = B = 12 . See Theorem 55 for bound relation.
5. False by counterexample A = B = 02×2 .

! ! ! !
0 1 0 0 0 1 0 0
6. False by counterexample A = ,B = , where AB = , BA = .
0 0. 0 1 0 0 0 0
7. False by counterexample using the same matrices A, B defined in part 6.
52
3.2.5 Orthogonality
Definition 66 (⊥). For two objects a, b, a ⊥ b means a is orthogonal to b.
Definition 67 (Vector p-norm, `p ). Define the p-norm of a vector, for real p ≥ 1 be called the `p norm
of a vector, written
n
! p1
X
p
kxkp := |xi | . (228)
i=1
When written without the subscripts p, let p = 2, the euclidean norm (Definition 68). The vector norm
is also said to be the length of a vector.
pP
Definition 68 (Euclidean Norm). The euclidean norm for x is the vector norm kxk2 = x2i .
Definition 69 (Vector Distance). For two vectors u, v ∈ Rn defined, we say that their distance is ku−vk2
and denote this d(u, v).
Consider a triangle with sides with length a, b, c respectively. Let the angle between the edges of
√
side lengths a, b be θ. The cosine rule states that c = a2 + b2 − 2ab cos θ. Now consider their vector
analogues. a → u, b → v, c → (u − v) s.t.
ku − vk2 = kuk2 + kvk2 − 2kukkvk cos θ, (229)
so
kuk2 + kvk2 − ku − vk2

θ = arccos . (230)
2kukkvk
If the triangle was inscribed in a two dimensional surface with coordinates u = (u1 , u2 ), v = (v1 , v2 ),
then
p
d(u, v) = (u1 − v1 )2 − (u2 − v2 )2 , (231)
and
!
u21 + u22 + v12 + v22 − (u1 − v1 )2 + (u2 − v2 )2
θ = arccos (232)
2kukkvk

u1 v1 + u2 v2
= arccos (233)
kukkvk
Definition 70 (Dot/Inner Product). For two vectors u, v, ∈ Rn , the dot product of u, v is denoted
n
X
u · v := ui vi . (234)
i
Definition 71 (Unit Vectors). A unit vector is vector v for which kvk = 1.
Definition 72 (Angle). The angle between two vectors u, v ∈ R is denoted

u·v
arccos . (235)
kukkvk
We denote the angle between two vectors u, v to be ](u, v).
53
From Definition
70,
see that we may express the angle in the two dimensional problem (Equation
u·v
233) as arccos kukkvk , which is consistent with the generalized statement in Definition 72.
Note that for column vectors u, v, the dot product u · v = u0 v.
Theorem 56 (Properties of the Dot Product). For vectors u, v, w ∈ Rn , c ∈ R, the following hold:
1. u · v = v · u.
2. (u + v) · w = u · w + v · w, w · (u + v) = w · u + w · v,
3. (cu) · v = u · (cv) = c(u · v),
4. kcuk = |c|kuk and
5. u · u ≥ 0 and u · u = 0 ↔ u = 0.
Proof. The proofs for these statements should follow directly from their Definitions.
It is easy to see that u · u = kuk2 for arbitrary u ∈ Rn .
3.2.5.1 Orthogonal Basis
Definition 73 (Orthogonality). If two vectors u, v ∈ Rn satisfy u · v = 0, we say the two vectors are
orthogonal. Additionally, for a set S ⊆ Rn , ∀si , sj , si 6= sj ∈ S, if si · sj are orthogonal vectors, then we
say that S is orthogonal set. In addition, if all the vectors in orthogonal set S is unit vector (Definition
71), then we say S is orthonormal set.
Given two vectors u, v ∈ Rn , if u is orthogonal to v, then their angle (Definition 72) is given by
π
arccos (0) = , (236)
2
which in R2 , R3 is the concept of perpendicularity.
1
Definition 74 (Normalization of Vectors and Sets). For arbitrary vector vi , see that ṽi = kvi k vi has
norm one, since
1 1
kṽi k = vi = kvi k = 1. (237)
kvi k kvi k
This is called normalizing a vector. See that for vi , vj that is orthogonal, normalization preserves orthog-
onality, since

1 1 1
ṽi · ṽj = vi · vj = (vi · vj ) = 0. (238)
kvi k kvj k kvi kkvj k
The process of converting an orthogonal set to orthonormal set by dividing each element by its norm is
called normalizing a set.
See that standard basis ei ’s (Definition 56) is orthonormal.
Theorem 57 (Orthogonal Set S is LIN D). If S is orthogonal set (Definition 73) of nonzero vectors in
vector space, then LIN D(S).
54
Pk Pk
Proof. Let S = {ui , i ∈ [k]}. Consider HLS i ci ui = 0. Since S is orthogonal, we may write ( i ci ui ) ·
ui = ci (ui · ui ). Then since
X
∀i ∈ [k], ci (ui · ui ) = ( ci ui ) · ui = 0 · ui = 0 (239)
but ui 6= 0, ci must be zero and the HLS must have only the trivial solution. S is linearly independent
by Definition 52.
Corollary 5. By equivalent statements for basis (Theorem 37) and linear independence of orthogonal
sets, to see if some set S in vector space dim k is orthogonal basis, we only need to check orthogonality
of S and |S| = k.
Theorem 58. If S = {ui , i ∈ [k]} be orthogonal basis for vector space V , then ∀w ∈ V , we may express
w · u1 w · uk
w= u1 + · · · + uk . (240)
u1 · u1 uk · uk

That is (w)S = w·u1
u1 ·u1 , · · · , uw·u
k ·u
k
k
. In particular, if S is orthonormal basis, them (w)S = (w · u1 , · · · , w ·
uk ).
Pk
Proof. Let w = i ci ui , then for i ∈ [k], see that
X
w · ui = ( ci · ui ) · ui = ci (ui · ui ), (241)
w·ui
and therefore ci = ui ·ui . The last assertion follows from observing ui · ui = kui k2 = 1 for all i ∈ [k] under
orthonormality.
Definition 75 (Orthogonality to Vector Space). Let V be subspace of Rn , then we say that u ∈ Rn is

orthogonal to V if ∀v ∈ V, u is orthogonal to v.
Definition 76 (Normal Vector). Let V be some subspace Rn . If ∀u ∈ V , ∃n s.t. n · u = 0, then n is

orthogonal to V (Definition 75) and we call n a normal vector of V .
For instance, if V is plane in Rn and V is s.t.
V = {(x, y, z) ∈ R3 | ax + by + cz = 0}, (242)
where normal vector n = (a, b, c), then V = {u ∈ R3 |n · u = 0}.

Given a vector space V spanned by S = {ui , i ∈ [k]}, to find all vectors orthogonal to V , we shall
solve for the linear systems v · ui = 0, i ∈ [k]for arbitrary  v ∈ V . That is, solve for the HLS where
 vector
u1 u1 v
   
 u2  u2 v 
the vectors in S are row stacked, which is   v = 
    = 0 (see Exercise 11). Formally:
· · ·  ···

uk uk v
Theorem 59. For V = span{ui , i ∈ [k]} subspace of Rn , vector v ∈ Rn is orthogonal to V iff v · ui = 0

for all i ∈ [k].
Definition 77 (Orthogonal Projection). Let V be subspace Rn , then every vector u ∈ Rn may be written
uniquely as form
u = p + n, (243)
where n is orthogonal to V and p ∈ V . We call p the projection of u onto V .
55
Theorem 60 (Projections with Basis). Let V be subspace of Rn , w be vector in Rn , then if S = {ui , i ∈
[k]} is orthogonal basis for V , we have
k
X w · ui
p := ui , (244)
i
ui · ui
where p is projection of w onto V (Definition 77). Additionally, if S is orthonormal basis, then

k
X
p= (w · ui )ui . (245)
i
Pk w·ui
Proof. Define p := i ui ·ui ui and n := w − p, then ∀i ∈ [k], see that
n · ui = w · ui − p · ui (246)
k
X w · uj
= w · ui − (uj · ui ) (247)
j
uj · uj
w · ui
= w · ui − (ui · ui ) (248)
ui · ui
= 0. (249)
The last assertion follows from ui · ui = 1 for all i ∈ [k] under orthonormality.
See that Theorem 58 is consistent with Theorem 60 by w → p, n → 0.
Theorem 61 (Gram-Schmidt Process). Let S = {ui , i ∈ [k]} be basis for vector space V , let
i−1
X ui · vj
∀i ∈ [k], vi := ui − vj . (250)
j=1
vj · vj
Then {vi , i ∈ [k]} is orthogonal basis for V . Divide each element vi by kvi k in orthogonal basis to get
orthonormal basis (see Definition 74).
Proof. First, see from the algorithm that each of the vi ’s are linear combinations of the ui ’s, which span
V with dimension k. By span closure, each of the vi ’s ∈ V . Additionally, there a total of k such vi ’s, so
by Corollary 5, we only need to show orthogonality of the vi ’s. When we have {v1 }, this set is vacuously
an orthogonal set. Suppose sets of size l, S = {vt , t ∈ [l]}, l < i are orthogonal. Then consider for l < i,
 
i−1
X u i · vj 
vi · vl = ui − v j vl (251)
j=1 j
v · vj
vi · vl
= ui · vl − vl · vl (252)
vl · vl
= 0. (253)
So vi is orthogonal to elements of the set vi−1 , which by inductive assumption is orthogonal. That is the
addition of vi keeps orthogonality invariant. Then by induction {vi , i ∈ [k]} are orthogonal and we are
done.
Exercise 53 (Gram-Schmidt Process Run). Apply Gram-Schmidt (Theorem 61) to transform {u1 , u2 , u3 }
for R3 into orthogonal basis, where u1 = (1, −1, 2), u2 = (2, 1, 0), u3 = (0, 0, 1).
56
Proof. Work through these iteratively:
v1 = u1 , (254)
u2 · v1
v2 = u2 − v1 , (255)
v1 · v1
u3 · v1 u 3 · v2
v3 = u3 − v1 − v2 (256)
v1 · v1 v2 · v2
to obtain the orthogonal vectors.
3.2.5.2 Best Approximations
Theorem 62 (Best Approximation Theorem). Let V be subspace of Rn . If u ∈ Rn and p is projection

of u onto V (Definition 77), then
∀v ∈ V, d(u, p) ≤ d(u, v). (257)
p is the best approximation for vector u that is in vector space V .
Proof. For arbitrary v ∈ V , let
n := u − p, w := p − v, x := u − v. (258)
Then see x = n + w and n · w = 0. So
kxk2 = x · x = (n + w) · (n + w) = n · n + w · w = knk2 + kwk2 . (259)
Therefore kxk2 ≥ knk2 and
d(u, p) = ku − pk = knk ≤ kxk = ku − vk = d(u, v). (260)
To find the shortest distance of some vector u to a vector space V , find the projection p of u onto V
and compute d(u, p).
Exercise 54 (Least-Squares Method). Suppose the random variables for r, s, t are related
t = cr + ds + e, (261)
for constants c, d, e. Suppose we have observations for (r, s, t)i , i ∈ [6], and we would like to estimate the
(beta) coefficients for c, d, e so we have a better
 understanding therelationships
 between the random
r1 s1 1   t1
  c  
 r2 s2 1  t2 
variables. Defining A =  , x = d , b =  · · ·, we would like to solve for
    
· · · ··· · · ·

e
 
r6 s6 1 t6
Ax = b. However, it turns out that due to the presence of random errors, Ax = b is almost always
ˆ ê) for (c, d, e). The least squares
inconsistent. Instead, we would like to find the best fit estimates (ĉ, d,
method minimizes the sum of squared errors proposed by the model; it solves for the x that minimizes
kb − Axk2 . This statement is equivalent to the form in Equation 3027 in our discussion on multiple
least-squares method. For m × n matrix A, the least-squares solution is the vector u ∈ Rn that satisfies
∀v ∈ Rn , kb − Auk ≤ kb − Avk. (262)
57
See Theorem 46 that we may express colSpace(A) = {Av|v ∈ Rn }. It turns out that the least squares
solution b is the best approximation of b onto colSpace(A).
Theorem 63. Let Ax = b be linear system for m × n matrix, and p be projection of b onto colSpace(A).
Then
∀v ∈ Rn , kb − pk ≤ kb − Avk. (263)
u is least-square solution to Ax = b iff Au = p.
Proof. By Best Approximation Theorem 62, see that
∀w ∈ colSpace(A), kb − pk = d(b, p) ≤ d(b, w) = kb − wk, (264)
and since colSpace(A) = {Av|v ∈ Rn }, the result follows.
In Equation 3032, we obtained the least-squares solution via matrix calculus. Here we derive the
same solution using the linear algebraic theorems.
Theorem 64 (Obtaining the Least Squares Solution). Let Ax = b be linear system. Then u is least
squares solution iff u solves A0 Ax = A0 b.

Proof. Let A = a1 a2 · · · an where ai is column i. Let V = colSpace(A), then u is least squares
solution to Ax = b iff Au is projection of b onto V iff (b − Au) is orthogonal to V iff (b − Au) is orthogonal
to vectors in the span of V , that is {ai , i ∈ [n]}. This is linear system
∀i ∈ [n], ai · (b − Au) = 0 (265)
which is A0 (b − Au) = 0 by block matrix representations and therefore A0 Au = A0 b. We used the best
approximation Theorem 62, definitions for vector space orthogonality (Definition 75), column space
representations (Theorem 46) and block matrix operations (Exercise 11).
That is, we do not need to explicitly solve for projection p - we may instead solve for the linear
system (A0 A)x = A0 b. In the form for Equation 3032, we have assumed that the matrix A is full rank,
st. A0 A is invertible (verify this) and a unique least squares solution exists. Here, we have shown a more
generalized problem without assuming a unique least squares solution. If the linear system A0 Ax = A0 b
has infinitely many solutions, pick some vector u from the solution space and compute Au := p as the
projection of b onto V .
3.2.5.3 Orthogonal Matrices
Recall that for S = {ui , i ∈ [k]}, T = {vi , i ∈ [k]} bases for vector space V , the transition matrix P from
S → T (Definition 59) is written

P = [u1 ]T [u2 ]T ··· [uk ]T (266)
and [w]T = P [w]S is satisfied for w ∈ V .
Definition 78 (Orthogonal Matrix). A square matrix (Definition 22) is orthogonal if A−1 = A0 .
Theorem 65. A square matrix A is orthogonal iff AA0 = 1.
Proof. Proof follows directly from Theorem 13.
58
Theorem 66 (Equivalent Statements for Matrix Orthogonality). Let A be square matrix order n, then
the following statements are equivalent:
1. A is orthogonal,
2. Rows of A form orthonormal basis for Rn .
3. Columns of A form orthonormal basis for Rn .

 
a1
 
 a2 
· · · be the row-stacked representation of A. By Corollary 5, 1 ↔ 2 can be proved if
Proof. Let A =  
 
an
we show that A orthogonal iff {ai , i ∈ [n]} is orthonormal. See that
AA0 = (ai a0j )n×n = (ai · aj )n×n , (267)
so A orthogonal iff AA0 = 1 iff ∀i, j, ai · aj = δij is the Dirac delta function δij = 1{i = j}. The last
statement iff a1 , · · · an is orthonormal. Proof for 1 ↔ 3 is similar.
Theorem 67. Let S, T be two orthonormal bases for vector space, P be transition matrix S → T . Then,
P is orthogonal and P P 0 = 1. P 0 is transition matrix from T → S.
Proof. Let S = {ui , i ∈ [k]}, T = {vi , i ∈ [k]} be two orthonormal bases given. Then by orthonormality
we may express (Theorem 58)
k
X
∀i ∈ [k], ui = (ui · vi )vi . (268)
i
Then transition matrix P from S → T is expressed

 
u1 · v1 u2 · v1 · · · uk · v1
 
u1 · v2 u2 · v2 · · · uk · v2 
 . (269)
 ··· ··· ··· ··· 
 
u1 · vk u2 · vk · · · uk · vk
We may repeat the same exercise and verify that the transition matrix Q from T → S is s.t. Q0 = P .
The final assertion follows from Theorem 40.
Exercise 55 (Rotation of Coordinates). Let E = {e1 , e2 } be standard basis (Definition 56) for R2 . We
may obtain a rotation in the coordinate system by angle θ. See that if we let
(u1 )E = (cos(θ), sin(θ)), (270)

(u2 )E = (− sin(θ), cos(θ)), (271)
then S = {u1 , u2 } is orthonormal basis for R2 and the transition matrix from S → E is
!
cos(θ) − sin(θ)
P = . (272)
sin(θ) cos(θ)
For arbitrary v = (x, y) ∈ R2 , (v)S = (x0 , y 0 ) is obtained via the relation

! !
x0 x
= P0 (273)
y0 y
59
s.t.
x0 = x cos(θ) + y sin(θ) (274)

y0 = −x sin(θ) + y cos(θ), (275)
which is the rotation by θ to a new coordinate system.
Theorem 68 (Cauchy-Schwarz Inequality). For vectors u, v ∈ Rn , prove
|u · v| ≤ kukkvk. (276)
Proof. If u = 0, then |0 · vk ≤ k0kkvk. Else, if u 6= 0, then denote
a = u · u, b = 2u · v, c = v · v, (277)
and ∀t ∈ R, see that
0 ≤ (tu + v)(tu + v) = t2 (u · u) + 2tu · v + v · v = at2 + bt + c. (278)
Since this is strictly greater than zero, the discriminant b2 − 4ac ≤ 0, so

√ √
4(u · v)2 ≤ 4(u · u)(v · v) =⇒ (u · v)2 ≤ (u · u)(v · v) =⇒ |(u · v)| ≤ u · u v · v = kukkvk. (279)
Theorem 69 (Triangle Inequality). For vectors u, v ∈ Rn , prove
ku + vk ≤ kuk + kvk. (280)
Proof. We can write
ku + vk2 = (u + v) · (u + v) (281)
= u · u + v · v + 2u · v (282)
2 2
≤ kuk + kvk + 2kukkvk Theorem 68 (283)
= (kuk + kvk)2 . (284)
The result follows.
Exercise 56. Prove that for u, v, w ∈ Rn with u = v + w,
1.
d(u, w) ≤ d(u, v) + d(v, w). (285)
2.
ku + vk2 + ku − vk2 = 2kuk2 + 2kvk2 . (286)
3.
1 1
u·v = ku + vk2 − ku − vk2 . (287)
4 4
Proof. -
60
1. Use the Triangle Inequality (Theorem 69 with u → u − v, v → v − w and so
ku − wk ≤ ku − vk + kv − wk ↔ d(u, w) ≤ d(u, v) + d(v, w). (288)
2. See that
ku + vk2 + ku − vk2 = (u + v) · (u + v) + (u − v) · (u − v) (289)

= 2(u · u) + 2(v · v) + 2u · v − 2u · v (290)
2 2
= 2kuk + 2kvk . (291)
This part shows that for a parallelogram with u, v as sides, taking the sum of squares of the four
sides is the sum of squares of the diagonals.
3. See that
1 1 1
(u + v) · (u + v) − (u − v) · (u − v) = (2u · v + 2u · v) = u · v. (292)
4 4 4
Exercise 57 (Orthogonal Space is Subspace). Let W be subspace of Rn , and define
W ⊥ = {u ∈ Rn | u orthogonal to W }. (293)
Show W ⊥ is subspace of Rn .
Proof. Let {wi , i ∈ [k]} be a basis for W , then see that

 
w1
u ∈ W ⊥ ↔ ∀i ∈ [k], wi · u = 0 ↔ · · · u = 0. (294)
 
wk
Therefore W ⊥ is a nullspace.
Exercise 58. Let {u1 , · · · , un } be set of orthogonal vectors, then show

n 2 n
X X
ui = kui k2 . (295)
i i
Proof. Write
n 2 n n n n
X X X X X
ui =( ui ) · ( uj ) = (ui · ui ) = kui k2 . (296)
i i j i i
 
1 1 0
 
1 1 0
Exercise 59 (QR Factorization Example). Let A = 
1
 and
 1 1
0 1 1
     
1 1 0
     
1 1 0
u1 = 
1 ,
 u2 = 
1 ,
 u3 = 
1 .
 (297)
     
0 1 1
61
Use Gram-Schmidt process to transform {u1 , u2 , u3 } into orthonormal basis {w1 , w2 , w3 } for colSpace(A).
Then write each of the ui ’s as linear combination of wi ’s. Then write A = QR, where Q is 4 × 3 matrix
where the columns are orthonormal, and R is 3 × 3 upper triangular with positive entries along the
diagonal.
Proof. Apply Gram-Schmidt (Theorem 61) to obtain orthonormal basis

     
1 0 −1
     
1 1
  0 1  −1
w1 = √  , w2 =  , w3 = √  . (298)
3 1
 
0
  6 2 


0 1 0
See that
√ √
r
1 2
u1 = 3w1 , u2 = 3w1 + w2 , u3 = √ w1 + w2 + w3 . (299)
3 3
Then let the matrices
√ √ 
3 3 √1
3
 
 0
A = (u1 u2 u3 ) = (w1 w2 w3 )  1 1 
q , (300)
2
0 0 3
 
√1 0 − √16 √ √ 
 13 3 3 √1
3
− √16 

√
 3 0  
Q = (w1 w2 w3 ) =  0 1 1 
, R= (301)
 √1 0 √2 
 q 
 3 6 2
0 0 3
0 1 0
satisfy A = QR.
Theorem 70 (Unique Projection). Let V be subspace of Rn , and u ∈ Rn . Show that u = n + p, where

n is orthogonal to V and p is projection of u onto V is unique.
Proof. Let u = n1 + p1 = n2 + p2 . We show that the two representations must be the same. Since for
i = 1, 2 we have ni · pj = 0, and n1 + p1 = n2 + p2 =⇒ n1 − n2 = p2 − p1 , then
kn1 − n2 k2 = (n1 − n2 ) · (n1 − n2 ) = (n1 − n2 ) · (p2 − p1 ) = n1 · p2 − n1 · p1 − n2 · p2 + n2 · p1 = 0.(302)
Therefore, n1 − n2 = 0 and so n1 = n2 . We also have p2 − p1 = n1 − n2 = 0, so p1 = p2 .
Exercise 60. Let A be square matrix order n, and A2 = A, A0 = A. Then
1. Show that ∀u, v ∈ Rn , (Au) · v = u · (Av).
2. Show that ∀w ∈ Rn , Aw is projection of w onto subspace V = {u ∈ Rn |Au = u} of Rn .
Proof. -
1. (Au) · v = (Au)0 v = u0 A0 v = u0 Av = u · (Av).
2. Since A(Aw) = A2 w = Aw ∈ V , then for v := w − Aw, see that for all u ∈ V (applying part 1 and
using the property Au = u of elements in subspace V ),
u · v = u · w − u · Aw = u · w − Au · w = u · w − u · w = 0. (303)
Since w = Aw + v, Aw ∈ V and v ⊥ V , Aw is projection w onto V .
62
Exercise 61. Discuss which of these are true:
1. kuk = kvk =⇒ ku + wk = kv + wk.
2. kuk = kvk and w orthogonal to u, v =⇒ ku + wk = kv + wk.
3. u orthogonal to v, w =⇒ u orthogonal to v + w.
4. u, v orthogonal and v, w orthogonal =⇒ u, w orthogonal.
Proof. -
1. False, see counterexample u, v, w = (1, 0), (0, 1), (2, 0) respectively.

p p
2. True, since ku + wk = kuk2 + |wk2 , kv + wk = kvk2 + kwk2 .
3. True, u · (v + w) = u · v + u · w = 0.
4. False, see counterexample u, v, w = (1, 0), (0, 1), (2, 0) respectively.
Exercise 62. Suppose a linear system Ax = b is consistent, then show that the solution space of Ax = b
is the solution space of A0 Ax = A0 b.
Proof. If Av = b, then since A0 Av = A0 Av = A0 b, v is solution for A0 Ax = A0 b. Then the solution for

space for Ax = b is written (Theorem 52, Exercise 50 part 1)
{u + v|u ∈ nullSpace(A)} = {u + v|u ∈ nullSpace(A0 A)}, (304)
which is the solution space for A0 Ax = A0 b.
Exercise 63. Let A be orthogonal matrix order n and u, v ∈ Rn . Show that
1. kuk = kAuk.
2. d(u, v) = d(Au, Av).
3. ](u, v) = ](Au, Av).
Proof. -
1. kAuk2 = (Au)0 (Au) = u0 A0 Au = u0 u = kuk2 .
2. d(Au, Av) = kAu − Avk = kA(u − v)k = ku − vk = d(u, v) by part 1.

(Au)·(Av)
3. (Au) · (Av) = u0 A0 Av = u0 v = u · v =⇒ ](u, v) = arccos kukkvk
u·v
= arccos kAukkAvk =
](Au, Av) by part 1.
Exercise 64. Let A be orthogonal matrix order n and S = {ui , i ∈ [n]} be basis for Rn .
1. Show that T = {Aui , i ∈ [n]} is basis for Rn .
2. Show that S orthogonal =⇒ T orthogonal.
63
3. Show that S orthonormal =⇒ T orthonormal. (Orthogonal (unitary) matrices preserve vector
norms).
Proof. -
1. Since A−1 = A0 then T is linearly independent by Exercise 39. So T is basis by Theorem 37.
2. Follows immediately from Exercise 63, since (Au) · (Av) = u · v.
3. Part 2 asserts that T is minimally orthogonal set. Then to show orthonormality, see Exercise 63,
part 1. That is, vector norms are preserved under transformations from orthogonal (more generally,
unitary) matrices.
Exercise 65. Determine which of these are true:
1. If A = (c1 · · · ck ) is n × k matrix and ci , i ∈ [k] orthonormal then A0 A = 1k .
2. If A = (c1 · · · ck ) is n × k matrix and ci , i ∈ [k] orthonormal then AA0 = 1n .
3. If A, B orthogonal matrices, then A + B is orthogonal.
4. If A, B orthogonal matrices, then AB is orthogonal.
Proof. -
 
c01

1. True, since A0 A = · · · c1 ··· ck = (ci · cj )k×k = 1k .
 
c0k
 
1 0
2. False by counterexample A = 0 1. Also, recall that A has right inverse iff rank(A) = n but
 
0 0
here rank(A) = 2 < 3 (Theorem 53).
3. False by counterexample A = 12 , B = −12 .
4. True, since (AB)0 (AB) = B 0 A0 AB = 1.
3.2.6 Diagonalization
All vectors here are expressed column unless otherwise stated.
3.2.6.1 Eigenvalues
Definition 79 (Eigenvalues and Eigenvectors). Let A be square matrix order n, then nonzero u ∈ Rn
is eigenvector of A if Au = λu for some constant λ. λ is said to be eigenvalue of A, and u is said to be
eigenvector of A associated with eigenvalue λ.
Definition 80 (Characteristic Polynomials). Let A be square matrix order n. Then the equation
det(λ1 − A) = 0 (305)
is said to be a characteristic equation of A with characteristic polynomial det(λ1 − A).
64
Theorem 71 (Eigenvalue solves the characteristic polynomial). Let A be square matrix order n, then λ
is eigenvalue of A iff det(λ1 − A) = 0.
Proof. λ is eigenvalue of A iff Au = λu for some nonzero u ∈ Rn iff λu − Au = 0 iff (λ1 − A)u = 0
iff (λ1 − A)x = 0 has non-trivial solutions iff det(λ1 − A) = 0, by Theorem 39. When expanded
det(λ1 − A) = 0 turns out to be polynomial in λ of degree n. (verify this)
 
0 −1 0
Exercise 66. For matrix C = 0 0 2, the characteristic polynomial is
 
1 1 1
λ 1 0
det(λ1 − C) = 0 λ −2 = λ3 − λ2 − 2λ + 2 = (λ − 1)(λ2 − 2), (306)
−1 −1 λ−1
√ √
so det(λ1 − C) = 0 iff λ ∈ {1, 2, − 2} which are the eigenvalues of C.
1. A is invertible.

5. det(A) 6= 0.
6. Rows of A form basis for Rn .
7. Columns of A form basis for Rn .
8. rank(A) = n.
9. 0 is not eigenvalue of A.
Proof. See proof in Theorem 39 for the iff conditions for statement 1 ↔ 7. Statement 6, 7 ↔ 8 is trivial
by definition of rank (Definition 62). We just need to show any of statements 1 ∼ 8 iff statement 9.
By Theorem 71, λ is eigenvalue of A iff det(λ1 − A) = 0, so 0 is not eigenvalue of A iff det(0 − A) =
det(−A) = (−1)n det(A) 6= 0 (last step follows from Theorem 22), which is iff det(A) 6= 0. Then we are
done.
Theorem 73. If A is triangular matrix (Definition 28), then the eigenvalues of A are diagonal entries
of A.
Proof. Suppose A = (aij ) order n is triangular, then consider λ1 − A. This is triangular matrix with
diagonals λ − aii , i ∈ [n], so by Theorem 16, see that
det(λ1 − A) = Πni (λ − aii ). (307)
It follows that the diagonal entries aii are the eigenvalues of A.
65
Definition 81 (Eigenspace). Let A be square matrix order n and λ be eigenvalue, then solution space of
(λ1 − A)x = 0 is called the eigenspace of A associated with eigenvalue λ, and we denote this Eλ . See that
this is a nullspace. If nonzero u ∈ Eλ , then u must be an eigenvector of A associated with λ; Au = λu.
We know how to obtain the eigenvalues of a matrix A. See Exercise 66 on solving characteristic
polynomials. Once we obtain some set of eigenvalues, then each eigenvalue has an associated eigenspace,
which can be obtained by solving some HLS. We know how to obtain the spanning basis (Definition 54)
for nullspaces (Definition 64). See Exercise 33 for a walk-through.
3.2.6.2 Diagonalization
Definition 82 (Diagonalizable Matrix). Let A be square matrix order n, then we say that it is diagonal-
izable if ∃P that is invertible s.t. P −1 AP = D and D is diagonal matrix. Then P is said to diagonalize
matrix A.
Theorem 74. Let A is square matrix order n, then A is diagonalizable iff A has n linearly independent
eigenvectors.
Proof. →: Suppose A is diagonalizable, then let P be invertible matrix satisfying P −1 AP = D where

(
λi if i = j ,
Dii = (308)
0 if i 6= j .
For P = (u1 u2 · · · un ), since AP = P D, then
A(u1 u2 · · · un ) = (u1 u2 · · · un )D =⇒ (Au1 Au2 · · · Aun ) = (λ1 u1 λ2 u2 · · · λn un ) (309)
so that Aui = λi for all i. That is, u1 , · · · un are eigenvectors of A, and since P is invertible, by
equivalent statements (Theorem 72), it follows that {ui , i ∈ [n]} is Rn basis; they are linearly independent.
←: Suppose A has n linearly independent eigenvectors ui , i ∈ [n]. Let these be associated with the
eigenvectors λi , i ∈ [n], then by equivalent statements for basis (Theorem 37), it follows that {ui , i ∈ [n]}
is basis for Rn . Then define P = (u1 u2 · · · un ), and see that
AP = (Au1 Au2 · · · Aun ) = (λ1 u1 λ2 u2 · · · λn un ) = P D, (310)
where
(
λi if i = j ,
Dii = . (311)
0 if i 6= j .
By the equivalence relations asserted by Theorem 72, P is invertible and P −1 AP = D.
Exercise 67. Given a square matrix A order n, discuss how one may determine if A is diagonalizable,
and if it is so, outline how to find invertible P s.t. P −1 AP = D for some diagonal matrix D.
Proof. 1. First, find all distinct eigenvalues λi , i ∈ [k] by solving the characteristic equation det(λ1 −
A) = 0.
2. For each i ∈ [k], find basis Sλi for eigenspace Eλi by solving the associated HLS.
3. Let S = ∪ki Sλi , if |S| < n, then A is not diagonalizable, and otherwise it is diagonalizable. Suppose
S = {u1 , · · · , un }, then the matrix P = (u1 u2 · · · un ) is invertible matrix diagonalizing A.
The case when matrix A has non-real eigenvalues when solving for the characteristic polynomial are
discussed in the section on abstract linear algebra techniques over complex fields.
66
Result 4. Suppose the characteristic polynomial of matrix A is factorized to det(λ1 − A) = Πki (λ − λi )ri ,
then for each eigenvalue λi , dim(Eλi ) ≤ ri . Furthermore, A is diagonalizable iff in step 2 outlined in
algorithm for Exercise 67, we obtain ∀i ∈ [k], dim(Eλi ) = ri .
Exercise 68. -
 
0 −1 0
1. Let C = 0 0 2. Solving the characteristic polynomial (see Exercise 66), the eigenvalues
 
1 1 1
√ √
are 1, 2, − 2. Solving the linear system for λ = 1, (λ1 − C)x = 0,
    
1 1 0 x 0
0 1 −2    0 ,
y = (312)
    

−1 −1 1 − 1 z 0
     
x −2  −2 
 
get general solution y  = t  2 . Then E1 = span  2  . We may repeat the same
     
 
z 1 1
 
exercise to get
   
 √ −1   −1
 √ 
   

E√2 = span  2 , E−√2 = span − 2 . (313)
 
   
1 1
   
   
−2 −1 −1 1 0 0
√ √  √
Then let P =  2 2 − 2 and P −1 CP = 0 2 0 .
  
√
1 1 1 0 0 − 2
 
1 0 0
2. Let A =  1 2 0. Then either by solving the characteristic polynomial or observing that this is
 
−3 5 2
triangular (Theorem 73), the eigenvalues are 1, 2. Solving the 
linear systems, λ = 1, (λ1 − A)x = 0
     
x 1 
 1 

with general solution y  = t −1 , see that E1 = span −1 . Next, solving the linear
     
 
z 8 8
 
      
x 0  0 
 
system λ = 2, (λ1 − A)x = 0 with general solutions y  = t 0 , see that E2 = span 0 .
     
 
z 1 1
 
We only have two linearly independent eigenvectors, therefore A is not diagonalizable by Theorem
74.
Exercise 69. Let A be square matrix order n, then suppose we have m < n linearly independent eigenvec-
tors ui , i ∈ [m], where Aui = λi ui , and λi ’s are not necessarily distinct. For new eigenvalue µ 6= λi∈[m] ,
and linearly independent vectors {vj , j ∈ [p]} ⊆ Eµ , prove {ui , i ∈ [m]} ∪ {vj , j ∈ [p]} is linearly inde-
pendent.
Pm Pp
Proof. Consider i ai ui + j bi vj = 0, then multiply A to both sides to get
m
X p
X
ai λi ui + bj µvj = 0, (314)
i j
67
Pm Pp P
m Pp
and subtract the two equations i ai λi ui + j bi µvj = 0 and µ · i ai ui + j bi vj = 0 to get
m
X
ai (λi − µ)ui = 0, (315)
i
which implies ai (λi − µ) = 0 by independence of ui ’s, But λi 6= µ, so each of the ai ’s = 0. Substitute

Pp
this into the vector equations to get j bj vj = 0, which by the linear independence of vj ’s, imply each
of bj ’s = 0.
Exercise 70. Prove that eigenvectors belonging to distinct eigenspaces are linearly independent.
Proof. Let T v1 = λ1 v1 , T v2 = λ2 v2 and λ1 6= λ2 . Then consider α1 v1 + α2 v2 = 0, then
0 = T (0) = T (α1 v1 + α2 v2 ) = α1 λ1 v1 + α2 λ2 v2 (316)
by linear transformation properties (Theorem 76), and see that λ1 0 = λ1 α1 v1 + λ1 α2 v2 . Since
0 = α1 λ1 v1 + α2 λ2 v2 = α1 λ1 v1 + α2 λ1 v2 , (317)
0 = 0v1 + α2 (λ2 − λ1 )v2 see that α2 = 0 since λ1 6= λ2 by assumption and v2 6= 0 by definition of

eigenvectors (Definition 79). Then 0 = α1 v1 + 0v2 =⇒ α1 = 0 since v1 6= 0 by definition of eigenvector,
and α1 = α2 = 0. For two eigenvectors, consider them already independent and belonging to the same
eigenspace, or if they belong to distinct eigenspaces, we have shown they must be linearly independent.
Then the induction proof is shown in Exercise 69 and the result generalizes to any set of eigenvectors
each from distinct eigenspaces.
Corollary 6. Let A be square matrix order n. If A has n distinct eigenvalues, then A is diagonalizable.
Proof. This is trivial to see, since for each eigenvalue, there is at least one eigenvector associated with it.
We have n eigenvectors. The eigenvectors are linearly independent by Exercise 70, hence by Theorem
74, A is diagonalizable.
See that for diagonalizable matrix A of square matrix order n and invertible P satisfying
P −1 AP = D, (318)
where D is diagonal matrix with diagonal entry λi at Dii , i ∈ [n], we have
1. for m ∈ Z+ , Am = P Dm P −1 , where Dm is diagonal matrix with diagonal entry λm

i at (i, i) entry,
2. and if we are further given that A−1 exists, then λi 6= 0 for all i by Theorem 72 and λ−1
i is valid
for all i ∈ [n]. In fact
A−1 = P D̃P −1 (319)
where D̃ is diagonal matrix with (i, i) entry λ−1

i . We may also obtain A
−m
as we did in part 1 by
making the substitution A−1 → A, D̃ → D.
Exercise 71. Find a closed form solution for the Fibonacci sequence.
68
Proof. The Fib-sequence may be written as (a0 , a1 , · · · ) s.t. a0 = 0, a1 = 1 and an = an−1 + an−2 for all
n ≥ 2. Then see that we may write
an = an (320)
an+1 = an−1 + an , (321)
with matrix representation

! ! !
an 0 1 an−1
= . (322)
an+1 1 1 an
! !
an 0 1
Define xn = ,A = , s.t. xn = Axn−1 = A2 xn−2 = · · · An x0 . Then obtain an invertible
an+1 1 1
! √ !
1+ 5
1 1 −1 2 0
√ √
P as in Exercise 67, and get P = 1+ 5 1− 5 . Compute P AP = D = √ . Then
1− 5
2 2 0 2
we may write
!
an
= x n = An x 0 (323)
an+1
!  1+√5 n  !−1 !
1 1 2 0 1 1 0
= √ √  √ n  √ √ (324)
1+ 5 1− 5 1− 5 1+ 5 1− 5
2 2 0 2 2 2 1
 √ n √ n 
√1 1+ 5 √1 1− 5
5 2 − 5 2
=  √ n+1 √ n+1  . (325)
√1 1+ 5 √1 1− 5
5 2 − 5 2
We have found the n-th Fibonacci sequence number, an .
3.2.6.3 Orthogonal Diagonalization
If we have obtained linearly independent eigenvectors (see Exercise 67), we may obtain an orthonor-
mal basis for the span of these eigenvectors (recall we may use Gram-Schmidt procedure to obtain an
orthonormal set from a basis (Definition 61)).
Definition 83 (Orthogonally Diagonalization/Spectral Decomposition/Eigen Decomposition). Square

matrix A is orthogonally diagonalizable if there exists orthogonal matrix P s.t P 0 AP = D, where D is
some diagonal matrix. P is said to orthogonally diagonalize A. For orthogonal diagonalization written
A = P DP 0 , we may write
 0 
h i h i P·,1 n n
 X X
A = λ1 P·,1 · · · λn P·,n P 0 = λ1 P·,1 · · · λn P·,n  · · ·  = 0
λi P·,i P·,i = λi vi vi0 , (326)

0 i i
P·,n
where vi are the eigenvectors associated with λi , which form orthogonal columns in P . The form
Pn 0
i λi vi vi is said to be a spectral decomposition of the matrix A.
Theorem 75. Square matrix A order n is orthogonally diagonalizable iff A0 = A (it is symmetric).
Proof. We only prove →. Suppose A is orthogonally diagonalizable, then for some P , P 0 AP = D and
P 0 = P −1 with D being diagonal matrix. We may write
A = (P 0 )−1 DP −1 = P DP 0 . (327)
69
Since D0 = D, we have
A0 = (P DP 0 )0 = P 00 D0 P 0 = P DP 0 = A. (328)
Verify this theorem for ←.
Exercise 72. Given symmetric matrix A order n, discuss how to find an orthogonal matrix P s.t
P 0 AP = D for some diagonal matrix D.
Proof. -
1. First, find all the distinct eigenvalues, λi , i ∈ [k]
2. For each λi , find basis Sλi spanning Eλi and use Gram-Schmidt process to obtain orthonormal
basis Tλi .
3. Let T = ∪ki Tλi := {v1 , · · · vn }. Then P = (v1 v2 · · · vn ) is orthogonal matrix that diagonalizes A.
When the matrix is symmetric, it turns out that the eigenvalues are always real (verify this). By
Result 4, let the characteristic polynomial be expressed
det(λ1 − A) = Πki (λ − λi )ri , (329)
then dim(Eλi ) = ri and |Sλi | = |Tλi | = ri .
3.2.6.4 Quadratic Forms and Conic Sections
Definition 84 (Quadratic Form). The general form

n X
X n
Q(x1 , · · · , xn ) = qij xi xj , (330)
i=1 j=i
where qij ∈ R is said to be a quadratic form in n variables xi , i ∈ [n]. If we define symmetric matrix


 qii i = j,

1

A = (aij ) where aij = 2 qij i < j, (331)

1q



ji i > j,
2
then see that we may express
  
1 1
q11 2 q12 ··· 2 q1n x1
1 1
 
 2 q12 q22 ··· 2 q2n   x2 
   0
Q((xi )i∈[n] ) = (x1 x2 · · · xn ) 
 ···   = x Ax. (332)
 ··· ··· ··· 
 · · ·
1 1
2 q1n 2 q2n ··· qnn xn
Then we may write Q : Rn → R, where Q(x) = x0 Ax for all x ∈ Rn .
The quadratic form takes quite a common occurrence in practical applications. For instance, see
multivariate normal density (Equation 2701), factor hedging objectives (Equation 3344) and mean-
variance portfolios (Equation 637).
70
Exercise 73. Consider the quadratic form Q2 (x, y, z) = x2 + 2y 2 + z 2 + 2xz, see that we may write
  
1 0 1 x
Q2 (x, y, z) = x y z 0 2 0 y  . (333)
  
1 0 1 z
Exercise 74 (Simplification of Quadratic Forms). Let Q(x) = x0 Ax be a quadratic form for x0 =

(x1 · · · xn ), and n × n symmetric matrix A. We would like to simplify the quadratic form. Since A is
symmetric, apply algorithm in Exercise 72 to obtain orthogonal P s.t P 0 AP = D, where D is diagonal
matrix with (i, i) entry λi , i ∈ [n]. Next, define new variables yi , i ∈ [n] s.t. y = P 0 x = P −1 x. Then
x = P y and we may write
n
X
Q(x) = Q(P y) = (P y)0 A(P y) = y 0 P 0 AP y = y 0 Dy = λi yi2 . (334)
i
Exercise 75. Consider again the quadratic form Q2 (x, y, z) = x2 + 2y 2 + z 2 + 2xz as in Exercise 73,
then we perform simplification of this quadraticform as suggested
 in Exercise 74. By algorithm presented
√1 √1
   
2
0 − 2
1 0 1 2 0 0
  0
in Exercise 72, obtain orthogonal matrix P =  0 1 0  . Then P 0 2 0 P = 0 2 0.
  
√1 0 √1 1 0 1 0 0 0
2 2
Defining the variables
 0    1 
x x √ (x + z)
2
0 
 0  
y  = P y  = 
 y .
 (335)
0 1
z z √ (−x + z)
2
Then we may write the quadratic form

   0
2 0 0 x
Q2 (x, y, z) = (x0 y 0 z 0 ) 0 2 0 y 0  = 2x02 + 2y 02 = (x + z)2 + 2y 2 . (336)
  
0 0 0 z0
Definition 85 (Quadratic Equation and Associated Quadratic Forms). A quadratic equation in two
variables x, y takes form
ax2 + bxy + cy 2 + dx + ey = f, (337)
where a, b, c, d, e, f ∈ R and ∃, a, b, c 6= 0. We may express

! ! !
a 21 b x x
(x y) 1 + (d e) = f. (338)
2b c y y
Denote
! ! !
1
x a 2b d
x= , A= 1
, b= , (339)
y 2b c e
so that the quadratic equation is written x0 Ax + b0 x = f . The x0 Ax term (expanded, ax2 + bxy + cy 2 ) is
called a quadratic form associated with the quadratic equation.
A quadratic equation (Definition ??) represents graphically a conic section; a conic section is degen-
erated if it is empty set, point, line, pair of lines, and non-degenerated if it is circle, ellipse, hyperbola or
71
parabola. A non-degenerated conic section is said to be standard form if it takes one of form in Table
3.1.
Table 3.1: Standard Forms for Conic Section
N-D Form Equation Quadratic Form

! !
1
x2 y2 2 0 x
Circle/Ellipse α2 + β2 =1 (x y) α =1
0 β12 y
! !
1
x2 y2 2 0 x
Hyperbola α2 − β2 =1 (x y) α 1
=1
0 − β2 y
! !
2 y2 − α12 0 x
Hyperbola − αx 2 + β2 =1 (x y) 1
=1
0 y
! ! β2 !
1 0 x x
Parabola x2 = αy (x y) + 0 −α =0
0 0! y ! y!
0 0 x x
Parabola y 2 = αx (x y) + −α 0 =0
0 1 y y
Exercise 76. Consider the quadratic equation 2x2 + 24xy + 9y 2 + 20x − 6y = 5. Show this can be written
as standard form of hyperbola.
Proof. The quadratic equation may be written

! ! !
2 12 x x
(x y) + 20 −6 = 5. (340)
12 9 y y
! ! ! !
−4
3
2 12 18 0 x0
Obtain orthogonal matrix P = 54 5
s.t. P 0 P = = D. Define =
5
3
5 12 9 0 −7 y0
!
x
P0 , then the quadratic equation becomes
y
! ! !
0 0 x0 3 −4
5 5 x0
(x y )D + 20 −6 = 5. (341)
y0 4
5
3
5 y0
Then
36 0 98 0
18x02 − 7y 02 +
x − y = 5 (342)
5 5
0 1 2 0 7 2
18(x + ) − 7(y + ) = −8 (343)
5 5
(x0 + 51 )2 (y 0 + 75 )2
− + = 1. (344)
4/9 8/7
Exercise 77. Let A be square matrix order 2, and assume characteristic polynomial λ2 + mλ + n. Then
show that m = −tr(A) (Definition 41), n = det(A).
!
a b
Proof. Define arbitrary matrix A = , then
c d
λ−a −b
det(λ1 − A) = = (λ − a)(λ − d) − (−b)(−c) = λ2 + (−a − d)λ + (ad − bc). (345)
−c λ−d
72
Then m = −a − d = −tr(A), the negative sum of diagonals in A and n = det(A).
Exercise 78. Let λ be eigenvalue of square matrix A, then
1. show λn is eigenvalue of An , where n ∈ Z+ ,
2. if A invertible, show 1
λ is eigenvalue of A−1 .
3. show λ is eigenvalue of A0 .
Proof. -
1. We prove by induction. For j = 1, Aj x = λx. Assume for j < n, that Aj x = λj x. Then for j + 1,
see Aj+1 x = AAj x = Aλj x = λj+1 x. By induction we are done.
2. Let x be eigenvector associated with λ, then Ax = λx =⇒ x = A−1 (λx) = λA−1 x, which implies
1
λx = A−1 x.
3. We prove using the transpose-determinant relation. λ is eigenvalue of A if it satisfies characteristic

equation det(λ1 − A) = 0. See det(λ1 − A) = det((λ1 − A)0 ) = det(λ1 − A0 ) = 0, so λ is eigenvalue
of A0 .
Exercise 79. Let A be square matrix s.t A2 = A, then
1. show that if A has eigenvalue λ, it must be either zero or one.
2. find the matrix size 2 × 2 (possibly many) A with eigenvalues zero and one.
Proof. -
1. Let x be eigenvector associated with λ, then A2 = A =⇒ A2 x = Ax =⇒ λ2 x = λx =⇒

λ(λ − 1)x = 0.
! !
a b −1 1 0
2. Since A has two distinct eigenvalues, ∃P = s.t. P AP = , and using the classical
c d 0 0
adjoint, write (Theorem 23)
!0
−1 1 d −c
P = s.t (346)
det(P ) −b a
! ! ! !
1 0 1 a 0 d −b 1 ad −ab
A = P P −1 = = . (347)
0 0 det(P ) c 0 −c a det(P ) cd −cb
We require that det(P ) = ad − bc 6= 0.
Exercise 80. Let A be square matrix order n, A2 = 0 but A 6= 0. Then
1. show that the only possible eigenvalue is 0,
2. argue if A is diagonalizable or not,
3. for u ∈ Rn , Au 6= 0, prove (u, Au) linearly independent,
73
!
−1 0 0
4. for n = 2, ∃ invertible P satisfying P AP = .
1 0
Proof. -
1. For x 6= 0, Ax = λx, see that A2 = 0 =⇒ A2 x = 0x =⇒ A(λx) = 0 =⇒ λ2 x = 0 iff λ = 0.
2. Not diagonalizable, since if it is, ∃P s.t. P −1 AP = 0, A = P 0P −1 = 0 but A 6= 0.
3. Consider au + bAu = 0 =⇒ A(au + Au) = A0 =⇒ aAu + A2 u = 0 =⇒ aAu = 0, but Au 6= 0

so a = 0. Then bAu = 0 but Au 6= 0 so b = 0. So a, b = 0 and we are done.

4. We show by construction. Let P = u Au , which is invertible by Theorem 72 and see AP =
!

2
0 0
Au A u = Au 0 . Also, P = 0u + Au 0u + 0Au = Au 0 . Then AP =
1 0
!
0 0
P and the result follows.
1 0
Exercise 81. Let {ui , i ∈ [n]} be basis spanning Rn , and A be square matrix order n satisfying Aui =
ui+1 for i ∈ [n − 1], Aun = 0. Show the only possible eigenvalue of A is zero, and find all the associated
eigenvectors.
Proof.
∀i ∈ [n], An ui = An−1 ui+1 = · · · = Ai un = 0. (348)

Pn
For v ∈ Rn where Av = λv and v = i ci ui , we can write
n
X
An v = ci An ui = 0. (349)
i
Since An v = λn v (see Exercise 78) but v 6= 0, λ = 0 and the eigenvalue must be zero. To get all the
eigenvectors, write
n
X n−1
X
0 = Av = ci Aui = ci ui+1 + cn 0. (350)
i=1 i=1
All u2 , · · · , un are linearly independent, so ci ’s are zero for i ∈ [n − 1] and v = cn un . The eigenvectors
are just vectors in span{un }.
!
a 1
Exercise 82. Determine the values of a, b s.t. is diagonalizable.
0 b
Proof. Consider the characteristic equation
λ−a −1
= 0 ↔ (λ − a)(λ − b) = 0. (351)
0 λ−b
!
λ−a −1
Then the eigenvalues are a and b. If a = b, then consider the HLS x = 0, with λ = a,
0 λ−a
which would clearly have nullspace spanned by a single vector. Then the matrix is not diagonalizable.
Otherwise, we have two distinct eigenvalues, and by Corollary 6, the matrix is diagonalizable.
74
Exercise 83. Square matrices A, B are similar if ∃P s.t. P −1 AP = B. If A, B similar, then prove the
following statements hold true.
1. An , B n similar ∀n ∈ Z+ .
2. If A invertible, B invertible and A−1 , B −1 similar.
3. If A diagonalizable, B diagonalizable.
Proof. -
1. B n = (P −1 AP )(P −1 AP ) · · · (P −1 AP ) = P −1 An P .
2. B −1 = (P −1 AP )−1 = P −1 A−1 P .
3. If ∃Q s.t. Q−1 AQ = D, define R = P −1 Q, then R is invertible (Theorem 14) and R−1 BR =

Q−1 P BP −1 Q = Q−1 AQ = D.
Exercise 84. A square matrix A order n is stochastic matrix if all entries are ≥ 0 and the sum of
entries in each column is one. Show that 1 is eigenvalue of a stochastic matrix and for any eigenvalue
λ, |λ| ≤ 1.
Pn 0
0
Proof. See that for stochastic matrix A, for all i ∈ [n], j aji = 1. Then A 1 1 · · · 1 =
0
1 1 · · · 1 . Recall that one is eigenvalue of A0 iff it is eigenvalue of A (Exercise 78). For the
second assertion, suppose A0 x = λx for some x 6= 0. For k = arg max |xj |, j ∈ [n], see that |xk | > 0 since
x 6= 0. Define the access operator [·] s.t. a[i] accesses the value at i-th coordinate of a. Then
n
X n
X n
X n
X Xn
0
(A x)[k] = aik xi = λxk =⇒ |λ||xk | = | aik xi | ≤ |aik xi | ≤ aik |xi | ≤ ( aik )|xk | = |xk |.
i i i i i
We used the triangle inequality (Exercise 69) and property aij ≥ 0. The statement implies |λ| ≤ 1.
Exercise 85 (Matrix Exponentiation). Let A be square matrix, then exponential for A is the matrix
∞
1 2 1 n
exp(A) = 1 + A + A + · · · =
X
A . (352)
2! n=1
n!
!
3 0
Compute exp(A) for A = .
8 −1
! ! !
2 0 −1 1 2n 0
Proof. Obtain matrix P −1 AP = for P = . Then An = P P −1 and
0 4 1 1 0 4n
!
1 1 2
1+ 1! 2 + 2! 2 + ··· 0
exp(A) = P 1 1 2
P −1 (353)
0 1+ 1! 4 + 2! 4 + ···
P∞ 1 n
using the Taylor expansions exp(x) = n=0 n! x .
Exercise 86. Determine which are true:
1. A diagonalizable implies A0 diagonalizable.
75
2. A, B diagonalizable implies A + B diagonalizable.
3. A, B diagonalizable implies AB diagonalizable.
Proof. -
1. True, since for P −1 AP = D we can write
D = D0 = (P −1 AP )0 = P 0 A0 (P −1 )0 = P 0 A0 (P 0 )−1 . (354)
! !
2 0 0 0
0 0 1 2
! !
2 0 1 0
0 1 1 2
Exercise 87. Let u be some column matrix, then show that 1 − uu0 is orthogonally diagonalizable.
Proof. (uu0 )0 = uu0 =⇒ (1 − uu0 ) = (1 − uu0 )0 (it is symmetric).
Exercise 88. Let A be symmetric matrix, if Au = λu, Av = µv, λ 6= µ, show that u · v = 0.
Proof. v 0 Au = v 0 (λu) = λv 0 u = λ(v · u) and by symmetricity v 0 Au = v 0 A0 u = (Av)0 u = (µv)0 u = µv 0 u =

µ(v · u) implies λ(v · u) = µ(v · u) =⇒ (λ − µ)(v · u) implies v · u = 0 since λ 6= µ.
Exercise 89. Determine which are true: If A, B orthogonally diagonalizable,
1. then A + B orthogonally diagonalizable.
2. then AB orthogonally diagonalizable.
Proof. -
1. True, since A, B orthogonally diagonalizable iff A = A0 , B = B 0 implies A+B = A0 +B 0 = (A+B)0 .

! !
1 0 0 1
0 0 1 0
Exercise 90. Let there be real constants λ1 ≤ λ2 ≤ λ3 . Then

P3 P3
1. Show λ1 is the minimum value of i λi x2i for all real numbers x1 , x2 , x3 satisfying i x2i = 1.
2. Show λ3 is the maximum value satisfying conditions in part 1.

 
2 −1 0
3. Find the minimum and maximum values of u0 Qu for all vectors u in R3 where Q = −1 2 −1
 
0 −1 2
with constraint kuk = 1.
Proof. -
P3 Pn
1. See that if we define (x1 , x2 , x3 ) = (1, 0, 0), then i x2i = 1, i λi x2i = λ1 . The minimum value
P3 P3 P3 P3
must be ≤ λ1 . On the other hand, if i x2i = 1, i λi x2i ≥ i λ1 x2i = λ1 ( i x2i ) = λ1 . The
result follows.
76
2. The second part follows by the same technique as in part 1.
√ √
3. Solve for eigenvalues of Q to obtain 2 − 2, 2, 2 + 2. Then ∃P s.t P 0 QP = D where D is diagonal
√ √
matrix with diagonal entries 2 − 2, 2, 2 + 2. Define P 0 u = (x1 , x2 , x3 )0 , then
√ √
u0 Qu = u0 (P P 0 )Q(P P 0 )u = (P 0 u)0 (P 0 QP )(P 0 u) = (2 − 2)x21 + 2x22 + (2 + 2)x23 (355)
3 √
and u0 u = u0 (P P 0 )u = (P 0 u)0 (P 0 u) = i x2i . The minimum value is 2 − 2 and maximum value
P
√
2 + 2.
Exercise 91. Name the conic section and write the standard form represented by a non-degenerated
conic section satisfying
!
x
(x y)A (356)
y
where A is symmetric matrix order 2 with eigenvalues 1, 4.
! !
x0 x
Proof. There exists orthogonal P s.t. P 0 AP = D with diagonal entries 1, 4. Define = P0 ,
y0 y
then
! !
x
0 x0 x02 y 02
x y A = 8 ↔ x0 y 0 P AP =8↔ + = 1. (357)
y y0 8 2
This is ellipse (see Table 3.1).
3.2.7 Linear Transformations

Definition 86. A linear transformation is a mapping T : Rn → Rm of form
    
x1 a11 a12 · · · a1n x1
    
 x2   a21 · · · · · · a2n   x2 
T = · · · =  · · ·
    , (358)
   ··· ··· ···   · · · 
 
xn am1 am2 · · · amn xn
| {z }
A
where aij ∈ R for i ∈ [m], j ∈ [n]. A is called the standard matrix. When T is mapping from Rn → Rn ,
then it is called a linear operator. For brevity, if we wish to indicate the dimension of the domain and
images under T , we notate
T is linear transformation from Rn → Rm ≡ Tn(m) . (359)
In abstract linear algebra, we define T : V → W to be a mapping from vector space V to W , and say
it is linear transformation if ∀u, v ∈ V, c, d ∈ R, T (cu + dv) = cT (u) + dT (v) is satisfied.
An example of a linear transformation is the identity transformation I : Rn → Rn s.t. I(u) = u for
all u ∈ Rn . The standard matrix is 1. Another is the zero transformation O : Rn → Rm s.t. O(u) = 0
for all u ∈ Rn . The standard matrix is 0m×n .
Suppose we have linear transformation T : R2 → R3 s.t.
   
!! x+y 1 1 ! !
x  x x
T =  2x  = 2 0  ∀ ∈ R2 . (360)
  
y y y
−3y 0 −3
77
 
1 1
See that standard matrix is 2 0 .
 
0 −3
Theorem 76. Let T = Rn → Rm be linear transformation,
1. T (0) = 0,
2. ∀ui ∈ Rn , ci ∈ R,
k
X k
X
T( ci ui ) = ci T (ui ). (361)
i i
Proof. Let A be standard matrix for T , then T (u) = Au by definition for u ∈ Rn and we may apply the
P P P
properties of matrix operations. In particular, T (0) = A0 = 0, T ( ci ui ) = A( ci ui ) = ci Aui =
P
ci T (ui ) by Theorem 6.
We may use Theorem 76 to check if a function given is a linear transformation or not.

(m)
Exercise 92. Show that Tn is linear transformation iff T (cu + dv) = cT (u) + dT (v) for all u, v ∈ Rn ,
c, d ∈ R.
n
Proof. →: follows immediately from Theorem 76. ←: suppose ∀u, v ∈ R and c, d ∈ R, we
have
n
T (cu + dv) = cT (u) + dT (v). Let {ei , i ∈ [n]} be basis for R and A = T (e1 ) · · · T (en ) . For
Pn
arbitrary u ∈ Rn , we may express u = i ui ei , see that
 
u1
n  
X  u2 
T (u) = ui T (ei ) = T (e1 ) · · · T (en ) · · · = Au,
 (362)
i  
un
so T is linear transformation.
Pn
Let {ui , i ∈ [n]} be basis spanning Rn , then ∀v ∈ Rn , see that we may write v = i ci ui , and by
Pn
Theorem 76, T (v) = i ci T (ui ). It follows that the image T (v) of v is determined completely by the
images T (ui )’s of basis vectors ui .
Exercise 93. Let T : R3 → R2 be linear transformation

     
1 ! 0 ! 2 !
1 −1 4
T 1 = , T 1 = , T  0  = . (363)
     
3 2 −1
1 1 −1
Then

−1
1. find the image of vector  4  under T and
 
2. find formula representing T .
78
Proof. The vectors {(1, 1, 1)0 , (0, 1, 1)0 , (2, 0, −1)0 } are basis for R3 . Writing (−1, 4, 6)0 as l.c of the ele-
ments in the basis, solve for the linear system
       
−1 1 0 2
 4  = c1 1 + c2 1 + c3  0  . (364)
       
6 1 1 −1
The solution yields c1 = 3, c2 = 1, c3 = −2, and therefore the image is
        
−1 1 0 2
T  4  = T 3 1 + 1 − 2  0  (365)
        
6 1 1 −1
     
1 0 2 ! ! ! !
1 −1 4 −6
= 3T 1 + T 1 − 2T  0  = 3 + −2 = .
     
3 2 −1 13
1 1 −1
We repeat the steps in part 1, except on arbitrary vector in R3 . Solve (x, y, z) = c1 (1, 1, 1) + c2 (0, 1, 1) +
c3 (2, 0, −1) to get solution c1 = x − 2y + 2z, c2 = −x + 3y − 2z and c3 = y − z. Then the general formula
is
 
x ! ! ! !
1 −1 4 2x − y
T y  = (x − 2y + 2z) + (−x + 3y − 2z) + (y − z) = . (366)
 
3 2 −1 x − y + 3z
z
From the previous exercise, it follows that for T : Rn → Rm , standard basis {ei , i ∈ [n]} (Definition
56), we have T (ei ) = Aei as column i of the standard matrix. The images T (ei ) for i ∈ [n] completely
define T .
Exercise 94 (Obtaining standard matrix via Gauss Jordan Elimination). Consider the linear transfor-
mation in Exercise 93 - here we obtain the standard matrix directly via GJE (Theorem 5). Take
   
1 0 2 1 0 0 1 0 0 1 −2 2
 1 1 0 0 1 0  →  0 1 0 −1 3 −2  . (367)
   
1 1 −1 0 0 1 0 0 1 0 1 −1
Then each of the basis elements are written as l.c
                     
1 1 0 0 1 0 2 0 1 0 2
0 = 1 − 1 , 1 = −2 1 + 3 1 +  0  , 0 = 2 1 − 2 1 −  0  .(368)
                     
0 1 1 0 1 1 −1 1 1 1 −1
Then T ((1, 0, 0)0 ) = T ((1, 1, 1)0 ) − T ((0, 1, 1)0 ) = (1, 3)0 − (−1, 2)0 = (2, 1)0 and so on, and the standard
matrix is just (T (e1 ) T (e2 ) T (e3 )).
Definition 87. Let S : Rn → Rm , T : Rm → Rk be linear transformations. Then define the composition
of T with S, T ◦ S as mapping Rn → Rk that satisfies
(T ◦ S)(u) = T (S(u)) u ∈ Rn . (369)

(m) (k) (k)
Theorem 77. If we have Sn , Tm , then (T ◦ S)n , that is T ◦ S is linear transformation Rn → Rk . If
(m) (k) (k)
A, B is standard matrix for Sn , Tm , then standard matrix for (T ◦ S)n is BA.
Proof. For u ∈ Rn , then (T ◦ S)(u) = T (S(u)) = T (Au) = BAu. T ◦ S is linear transformation with
standard matrix BA.
79
3.2.7.1 Ranges and Kernels
(m)
Definition 88. For Tn , denote R(T ) as the range of T , and this is the set of images of T ,
R(T ) = {T (u) | u ∈ Rn } ⊆ Rm . (370)

(m) Pn
Theorem 78. Let Tn and {ui , i ∈ [n]} be basis for Rn . Then recall that T (v := i ci v i ) ∈
n
span{T (ui ), i ∈ [n]} for all v ∈ R by Theorem 76. It follows that R(T ) ⊆ span{T (ui ), i ∈ [n]}. On the
P P
other hand, see that ∀i ∈ [n], ci T (ui ) = T ( ci ui ) ∈ R(T ) by Theorem 76, so R(T ) ⊇ span{T (ui ), i ∈
[n]}. Then R(T ) = span{T (ui ), i ∈ [n]}.
(m)
Theorem 79. Let A be standard matrix for Tn , then R(T ) = colSpace(A).
Proof. Let {ei , i ∈ [n]} be standard basis for Rn , and since column i of A is T (ei ), then by Theorem 78,
R(T ) = span{T (ei ), i ∈ [n]} = colSpace(A).
Definition 89. Let T be linear transformation. Then dim(R(T )) is called rank of T , and we denote this
rank(T ). See that by Theorem 79, for standard matrix A of T , rank(A) = rank(T ).
(m)
Definition 90. Let Tn , then the kernel of T is denoted ker(T ) and is the set of vectors in Rn that
maps to the zero vector in Rm ,
ker(T ) = {u|T (u) = 0} ⊆ Rn . (371)

(m)
Theorem 80. For Tn with standard matrix A, ker(T ) = nullSpace(A).
Proof. See that ker(T ) = {u|T (u) = 0} = {u|Au = 0}, which by definition is the nullspace (Definition
64).
Definition 91. Let T be linear transformation, then dim(ker(T )) is called nullity of T and we denote
this nullity(T ). See that for standard matrix A for T , nullity(T ) = nullity(A).
Theorem 81 (Rank-Nullity Theorem, Linear Transformations). It is trivial to see from reasoning in

(m)
Theorem 51 that rank(T ) + nullity(T ) = n for Tn .
Exercise 95. For T, T1 , T2 linear transformations from Rn → Rm with standard matrices A, A, B respec-
(m) (m)
tively, define (T1 + T2 )n s.t (T1 + T2 )(u) = T1 (u) + T2 (u) for all u ∈ Rn . Additionally, define (λT )n
s.t. (λT )(u) = λT (u) for all u ∈ Rn . Show that that (T1 + T2 ), λT are both linear transformations and
find their standard matrices.
Proof. It is easy to see by the duality between a linear transformation and its standard matrix both
results. That is, (T1 + T2 )(u) = T1 (u) + T2 (u) = Au + Bu = (A + B)u, and (λT )(u) = λT (u) = λAu =
(λA)u.
s.t. S ◦ T = 1, the identity transformation, then

(n) (n)
Exercise 96. Let Tn be linear operator, and if ∃Sn
T is said to be invertible with inverse S. For invertible T , standard matrix A, find standard matrix for
inverse of T .
Proof. See S(T (u)) = S(Au) = 1 iff S(Au) = A−1 (Au) so the standard matrix is A−1 for S.
(n)
Exercise 97. Let n be unit vector Rn , and define Pn s.t. P (x) = x − (n · x)n for all x ∈ Rn . Then
show that P is linear transformation, find its standard matrix and prove that P ◦ P = P .
80
Proof. Note that the term n · x = n0 x is ‘commutative’ since it is scalar.
∀x ∈ Rn , P (x) = x − (n · x)n = 1x − nn0 x = (1 − nn0 )x, (372)
so P is linear transformation with standard matrix 1 − nn0 . Next, write
(P ◦ P )(x) = P (P (x)) = P (x − (n · x)n) (373)

= x − (n · x)n − (n · (x − (n · x)n))n (374)
= x − (n · x) − ((n · x) − (n · x)(n · n))n (375)
= x − (n · x)n (376)
= P (x). (377)
(n)
Exercise 98. Let Tn be linear transformation and T ◦ T = T .
1. If T is not zero transformation, show ∃u 6= 0 ∈ Rn s.t. T (u) = u.
2. If T is not identity transformation, show ∃v 6=∈ Rn s.t. T (v) = 0.

(n)
3. Find all linear transformations Tn satisfying T ◦ T = T .
Proof. -
1. Since T is not zero transformation, then ∃x ∈ Rn s.t T (x) 6= 0, and defining u = T (x), we have
T (u) = T (T (x)) = (T ◦ T )(x) = T (x) = u.
2. If T is not identity, then there exists y ∈ Rn s.t. T (y) 6= y. Then for v = T (y) − y,
T (v) = T (T (y) − y) = T (y) − T (y) = 0. (378)
3. See Exercise 79.
(n)
Exercise 99. Let n be unit vector ∈ Rn , and Fn s.t F (x) = x − 2(n · x)n for all x ∈ Rn , then
1. show that F is linear transformation, and find its standard matrix.
2. prove F ◦ F = 1, the identity transformation.
3. show that the standard matrix for F is orthogonal matrix.
Proof. -
1.
∀x ∈ Rn , F (x) = x − 2(n · x)n = 1x − 2nn0 x = (1 − 2nn0 )x. (379)
The standard matrix is 1 − 2nn0 .

2. Use the standard matrix and compute (1 − 2nn0 )(1 − 2nn0 ) = 1 − 2nn0 − 2nn0 + 4n(n0 n)n0 = 1.
3. See that (1 − 2nn0 ) is symmetric, and by part 2 we get (1 − 2nn0 )0 (1 − 2nn0 ) = 1.
81
(n)
Exercise 100. A linear operator Tn is isometry of kT (u)k = kuk for all u ∈ Rn .
1. If T is isometry on Rn , then show T (u) · T (v) = u · v for all u, v ∈ Rn .

(n)
2. Let A be standard matrix for linear operator Tn . Show T is isometry iff A is orthogonal matrix.
Proof. 1.
T (u + v) · T (u + v) = (T (u) + T (v)) · (T (u) + T (v)) (380)

= T (u) · T (u) + T (v) · T (v) + 2(T (u) · T (v)) (381)
2 2
= kT (u)k + kT (v)k + 2(T (u) · T (v)) (382)
= kuk2 + kvk2 + 2(T (u) · T (v)), (383)
and
T (u + v) · T (u + v) = kT (u + v)k2 = ku + vk2 = (u + v) · (u + v) (384)

= u · u + v · v + 2(u · v) (385)
2 2
= kuk + kvk + 2(u · v). (386)
2. kT (u)k = kAuk = kuk (see Exercise 64). On the other hand, if T is isometry and {ei , i ∈ [n]}
is standard basis (Definition 56), then (Aei ) · (Aej ) = (Aei )0 Aej = e0i A0 Aej = (A0 A)ij , but
(Aei )0 Aej = T (ei ) · T (ej ) = ei · ej = δij , so A0 A = 1.
Exercise 101. Find the nullity of T given these information respectively:

(6)
1. T4 , rank(T ) = 2.
(4)
2. R(T6 ) = R4 .
(6)
3. RREF (T6 ) has four nonzero rows.
Proof. The nullity of T for part 1) is 2, part 2) is 2, and part 3) is 2.

(n)
Exercise 102. Let Tn be linear operator T (v) = 2v, then find the kernel and range of T .
Proof. The kernel is the zero space and range is Rn .

(n)
Exercise 103. Let V be subspace of Rn , and define Pn s.t. ∀u ∈ Rn , P (u) is projection u onto V
(see Definition 77). Then show P is linear transformation. If n = 3, and V is plane ax + by + cz = 0,
∃a, b, c 6= 0, find ker(P ), R(P ).
Proof. Let {vi , i ∈ [k]} be orthonormal basis spanning V , then by Theorem 58, we may express P (u) =
Pk Pk 0
i (u · vi )vi = ( i vi vi )u. ker(P ) = span{(a, b, c)}, R(P ) = V .
(m)
Exercise 104. Show for Tn , ker(T ) = {0} iff T is one-to-one.
Proof. If ker(T ) = {0}, then for u, v satisfying T (u) = T (v), we have T (u − v) = T (u) − T (v) = 0 =⇒
u − v = 0, so u = v. On the other hand, if T is one-to-one, then since T (0) = 0, only 0 maps to image 0
under T , so the kernel of T must only contain 0 - then ker(T ) = {0}.
82
(m) (k)
Exercise 105. For Sn , Tm , show
1. ker(S) ⊆ ker(T ◦ S),
2. R(T ◦ S) ⊆ R(T ).
Proof. -
1. u ∈ ker(S) =⇒ S(u) = 0 =⇒ (T ◦ S)(u) =⇒ T (S(u)) = T (0) = 0 =⇒ u ∈ ker(T ◦ S) =⇒

ker(S) ⊆ ker(T ◦ S).
2. v ∈ R(T ◦ S) =⇒ ∃u s.t v = (T ◦ S)(u) =⇒ v = T (S(u)) = T (w) for w := S(u) =⇒ v ∈

R(T ) =⇒ R(T ◦ S) ⊆ R(T ).
3.3 Abstract Linear Algebra

Section 3.2 makes arguments for linear algebraic computations in the Euclidean space. Matrices were the
foremost theme. (Standard) Matrices and their relation to linear transformations/maps were discussed
in Section 3.2.7. In this section, the foremost subject are linear maps and operators. The focus lies on the
appreciation for the structure of linear operators. Euclidean spaces are swapped out for abstract vector
spaces. Links and applicable proofs from the treatments in Euclidean spaces are drawn. In many cases,
the proofs are similar and require trivial modifications from R → F. Some of the proofs referenced rely
on sub-proofs defined on the Euclidean space. When the modifications are easy to derive, we reference
the proofs without highlighting the necessary changes. Proofs that require non-trivial modifications for
arbitrary vector spaces are constructed again. To keep in line with the singleton principle, as much as
possible, repetition of similar results are not re-written. Section 3.2 is hence required material.
3.3.1 Vector Spaces

Euclidean vector spaces and subspaces were given treatment in Section 3.2.3. Generalization to complex
vector spaces and abstract spaces are given treatment here.
Definition 92 (Complex Numbers). A complex number is an ordered pair (a, b), a ∈ R, b ∈ R and is
often written a + bi. ‘a’ is said to be the real component and ‘b’ is said to be the imaginary component.
The set of complex numbers is usually denoted C = {a + bi | a, b ∈ R}. The addition and multiplication
of complex numbers in C are defined by the following rules
(a + bi) + (c + di) = (a + c) + (b + d)i,
(a + bi)(c + di) = (ac − bd) + (ad + bc)i.

√
See that a + 0i is just a real number, so we can think of R ⊂ C. i used to denote the value −1. It
2
follows i = −1. The rules of complex arithmetic apply as we would expect when using real numbers,
and applying the definition of addition and multiplication of complex numbers defined in Definition 92.
In particular,
Theorem 82 (Complex Arithmetic). For complex numbers α, β, λ ∈ C,
1. α + β = β + α,
83
2. (α + β) + λ = α + (β + λ),
3. (αβ)(λ) = α(βλ),
4. λ + 0 = λ, λ1 = λ,
5. ∀α, !∃β s.t. α + β = 0, and β is given the name additive inverse,
6. ∀α 6= 0, !∃β s.t. αβ = 1, and β is given the name multiplicative inverse,
7. λ(α + β) = λα + λβ.
Vectors/lists were introduced in Definition 42. A vector is different from a set in the sense that sets
are unordered, and identical members of a set are considered as redundant. That is, {α, α} = {α}. Lists
are ordered. A generalization of the Euclidean space (Definition 44) is given by
Definition 93. Fn = {x = (x1 , · · · , xn ) : xj ∈ F | ∀j ∈ [n]}. The j-th coordinate of (x1 , · · · , xn ) ∈ Fn is

xj .
When F = R then Definition 93 is Euclidean space (Definition 44). F is an arbitrary field. A field
is a set that contains at least two distinct elements, with operations of addition and multiplication
satisfying the complex arithmetic rules (Theorem 82). Addition is done componentwise, so that in Fn ,
(x + y)i = (xi + yi ). Addition is commutative, the zero vector is denoted 0, the additive inverse of x is
(−x), and scalar multiplication λx has i-th component λxi . 0 is also the additive identity.
Exercise 106. For a, b, c, d ∈ R and a, b 6= 0, find real numbers c, d s.t.

1
= c + di. (387)
(a + bi)
Proof.
1 a − bi a − bi
= = 2 . (388)
a + bi (a + bi)(a − bi) a + b2
a −b
Follows that c = a2 +b2 , d = a2 +b2 .
See Definition 51 and 53. Addition on a set V is a function that assigns u + v ∈ V to pair of elements
u, v ∈ V . Scalar multiplication on set V is function that assigns λv ∈ V to λ ∈ F and v ∈ V . We consider
abstract vector spaces.
Definition 94. A vector space is a set V along with addition on V and scalar multiplication on V where
the following holds for all u, v, w ∈ V and a, b ∈ F:
1. u + v = v + u,
2. (u + v) + w = u + (v + w), (ab)v = a(bv),
3. ∃0 ∈ V , s.t. v + 0 = v,
4. ∃w ∈ V s.t. v + w = 0,
5. 1v = v,
6. a(u + v) = au + bv, (a + b)v = av + bv.
84
Elements of a vector space are vectors/points. The scalar multiplication of λ ∈ F depends on F.
Therefore, when the field needs to be specified, we say V is vector space over F. The real vector space is
vector space over R; complex vector space is vector space over C. An infinite dimensional vector space
can be defined as follows:
Definition 95. F∞ over F is defined to be
F∞ = {(x1 , x2 , · · · ) | xj ∈ F, ∀j = 1, 2, · · · }. (389)
The addition and scalar multiplication on set F∞ is defined componentwise as in the finite dimensional
case, and the additive identity is the sequence of all zeros.
The study of linear algebra is concerned mostly with the behavior of finite dimensional vector spaces.
Definition 96 (Function Space, FS ). If S is a set, then FS is the set of functions from S → F. That
is, each element in FS is a function that maps members of the set S to some element in the field F. For
f, g ∈ FS , the sum f + g ∈ FS is the function defined by
(f + g)(x) = f (x) + g(x), ∀x ∈ S. (390)
For λ ∈ F, f ∈ FS , the product λf ∈ FS is the function defined by (λf )(x) = λf (x) for all x ∈ S.
For instance, the notation R[0,1] specifies the set of real valued functions on the interval [0, 1]. If
S 6= {}, then FS is a vector space over field F. The additive identity of FS , the zero function is defined
via 0(x) = 0, and the additive inverse of f ∈ FS is denoted −f : S → F given by (−f )(x) = −f (x)
for all x ∈ S. In FS , the elements are functions. In Rn , the elements are lists of real numbers. Vector
spaces can contain arbitrary objects, insofar as they satisfy the right properties (Definition 94). Fn can
be thought to be a special case of FS - if we write F{1,2,··· ,n} .
A vector space has unique additive identity and for each element, unique additive inverses. The
proofs are trivial. For instance, write v + w = 0 and v + w0 = 0 implies that w = w + 0 = w + (v + w0 ) =
(w + v) + w0 = 0 + w0 = w0 . Since they are unique, denote −v do be the additive inverse of v without
ambiguity. Let V be a vector space in the all discussions from now on (Definition 94) unless otherwise
stated.
Exercise 107. Argue that the empty set is not a vector space.
Proof. All vector spaces have some additive identity element.
Exercise 108. Argue that the additive inverse condition can be replaced with the condition 0v = 0 for
all v ∈ V for vector spaces (Definition 94).
Proof. Assume additive inverse, then ∀v ∈ V , 0v = (0 + 0)v = 0v + 0v. Add additive inverse to LHS and
RHS, so 0 = 0v. OTOH, assume 0v = 0 for all v ∈ V , then v +(−1)v = 1v +((−1)v) = (1−1)v = 0v = 0,
so (−1)v is additive inverse.
See subspace Definition 51. For abstract vector space,
Definition 97. Subset U of V is subspace of V if U is vector space (Definition 94). Similar conditions
hold as in the Euclidean space. That is, U is subspace of V iff
1. 0 ∈ U ,
85
2. u, w ∈ U =⇒ u + w ∈ U ,
3. a ∈ F, u ∈ U =⇒ au ∈ U .
See Euclidean space analogue on span closure (Theorem 27).
{0}, V are respectively the smallest possible and largest possible subspace of V . The set of continuous,
real-valued functions on [0, 1] is subspace of R[0,1] . Therefore, the sum of two continuous functions
is continuous. The set of differentiable real-valued functions on R is a subspace of RR . Therefore,
(cf )0 (x) = cf 0 (x).
See subset addition defined in Definition 60. Subset addition on subspaces is subspace. See proof,
Theorem 36. A stronger statement can be made. The sum of subspaces is the smallest containing
subspace.
Theorem 83 (Sum of Subspace is Smallest Containing Subspace). Suppose Ui , i ∈ [m] are subspaces of
Pm
V , then i Ui is the smallest subspace of V containing Ui ’s, i ∈ [m].
Pm
Proof. 0 ∈ i Ui is trivial. We know that subset addition is subspace (Theorem 36). The proof requires
definition of linear spans (Definition 48) and span closure (Theorem 27), but it is also easy to see by the
Pm
definition for subset addition (Definition 60). So i Ui is subspace of V . We just need to show it is the
Pm
smallest. Obviously Ui is contained in i Ui . Every subspace of V containing U1 , U2 , · · · , Um contains
Pm
i Ui .
Pm Pm
We know that every element in i Ui , where Ui ’s are subspaces, can be written in the form i ui ,
where ui ∈ Ui , i ∈ [m]. We are interested when this is a unique representation.
Pm
Definition 98 (Direct Sum). Let Ui , i ∈ [m] be subspaces of V . Then the sum i Ui is said to be direct
Pm Pm
sum if each element in i Ui has unique representation i ui , i ∈ [m] where ui ∈ Ui . Additionally, if
Pm m
i Ui is direct sum, then we denote this as U1 ⊕ U2 · · · ⊕ Um = ⊕i Ui .
Consider the standard basis (Definition 56) ei , i ∈ [n] and Euclidean space Rn . Define each subspace
Ui = span(ei ), i ∈ [n], then see that Rn = ⊕ni Ui .
Pm
Theorem 84 (Condition for Direct Sum). Let Ui , i ∈ [m] be subspaces of V , then i Ui is ⊕m
i Ui iff
Pm
i ui = 0, ui ∈ Ui has unique representation ui = 0 for all i ∈ [m].
Pm
Proof. Suppose we have direct sum, if we let all ui ’s be zero vectors, then i ui = 0 and by definition of
direct sum, this is unique representation. On the other hand, suppose the only way to express zero vector
Pm Pm
is summation of the zero vectors. We show this must be unique representation. Take v = i ui = i vi ,
Pm
where ui , vi ∈ Ui , and since u − v = 0 = i (ui − vi ) and ui − vi ∈ Ui , ui = vi and the two representation
are actually the same.
Theorem 85. If U, W are subspaces of V , then U + W is direct sum iff U ∩ W = {0}.
Proof. Suppose U + W is direct sum, then for v ∈ U ∩ W, v + (−v) = 0 and by unique representation,
v = 0. On the other hand, suppose U ∩ W = {0}. Let u ∈ U, w ∈ W and 0 = u + w. Since w is unique
additive inverse w = (−u) ∈ U , so w ∈ U ∩ W and w must be 0. Then u = 0.
Exercise 109. Show that the set of differentiable real-valued functions f on (0, 3) s.t. f 0 (2) = b is
subspace of R(0,3) iff b = 0.
86
Proof. Let this set of functions be denoted by V . Clearly, 00 (2) = b iff b = 0. For f, g ∈ V , see that
(f + g)0 (2) = (f 0 + g 0 )(2) = f 0 (2) + g 0 (2) = b + b = b iff b = 0. Finally, (λf )0 (2) = λf 0 (2) = λb = b iff
b = 0.
Exercise 110. Show that the set of differentiable real-valued functions f on (−4, 4) s.t. f 0 (−1) = 3f (2)
is subspace of R(±4) .
Proof. Let this set of functions be denoted by V . Clearly, f = 0 satisfies the condition 00 (−1) = 3 · 0(2).
For f, g, ∈ V , see that (f + g)0 (−1) = f 0 (−1) + g 0 (−1) = 3f (2) + 3g(2) = 3(f (2) + g(2)) = 3(f + g)(2).
Finally, (λf )0 (−1) = λf 0 (−1) = λ3f (2) = 3(λf )(2).
Exercise 111. Argue if R2 is subspace of C2 .
Proof. No, for subspace U of V , U uses the same addition and scalar multiplication as on V . So if
R2 ⊂ C2 is subspace, then i(1, 1) = (i, i) ∈ R2 but this is false.
Exercise 112. Real valued function f over R is periodic if f (x) = f (x + p) for some p for all x. Is the
set of periodic functions R → R subspace of RR ?
√
Proof. No, by counterexample h(x) = f (x) + g(x) = sin 2x + cos(x). See Wolfram Alpha plot (link) -
clearly this is non-periodic.
Exercise 113. Prove that the intersection of subspaces U1 , U2 of V is subspace. Prove that the union of
subspaces U1 , U2 of V is subspace of V iff either of the U ’s are contained in the other.
Proof. See Exercise 38, with R → F.
Exercise 114. Argue for or against the following statements, where U, M, W are subspaces of V :
1. U + W = W + U ?
2. (U + M ) + W = U + (M + W )?
3. Does the addition of subspaces of V have additive identity and additive inverses?
Proof. 1. Yes, the proof is trivial (show subset relation in both directions).
2. Yes, the proof is trivial (show subset relation in both directions).
3. Yes, the additive identity is {0} (prove using the definition of subset addition, Definition 60). Only
the subspace {0} has additive inverse.
Exercise 115. U1 , U2 , W are subspaces of V . U1 + W = U2 + W =⇒ U1 = U2 ?
Proof. False by counterexample U1 = {0}, U2 = V.W = V .
Exercise 116. U1 , U2 , W are subspaces of V . V = U1 ⊕ W = U2 ⊕ W =⇒ U1 = U2 ?
Proof. False by counterexample V = R2 , U1 = span((1, 0)), U2 = span((0, 1)), W = span((1, 1)) where
V = U1 ⊕ W = U2 ⊕ W . Refer to definition of spans (Definition 48). Prove by the direct sum check
(Theorem 85) or using dimensions of vector spaces discussed later (dimV = dimUi +dimW, i = 1, 2).
87
3.3.2 Finite Dimensional Vector Spaces
3.3.2.1 Vector Spans and Linear Independence
Linear spans were introduced in Definition 48. Replacing R → F,
Definition 99 (Vector Span).

( m
)
X
span(v1 , v2 , · · · , vm ) = ai vi | ai ∈ F, ∀i ∈ [m] .
i
The span of {} is {0}.
Theorem 86. The span of a list of vectors in V is the smallest subspace of V containing all vectors in
the list.
Proof. Let vi , i ∈ [m] be vectors in V , then we show span((vi )i∈[n] ) is subspace of V . Clearly 0 is in the
span. The span is closed (see Theorem 27) under linear combinations. On the other hand, each vj is
linear combination of vi ’s i ∈ [n] trivially. For set S, span(S) contains each of s ∈ S. Conversely, since
subspaces are closed under addition and scalar multiplication, every subspace of V containing each of
the vj ’s contain span(v1 , · · · , vn ). Therefore the span must be the smallest subspace of V containing the
spanning set.
Definition 100 (Finite-Dimensional Vector Space). A finite-dimensional vector space is a vector space
that is spanned by some list (by definition, a list is of finite length).
Definition 101 (Polynomial). A function p : F → F is polynomial with coefficients in F if ∃ai , i ∈ [m] ∈

F s.t.
p(z) = a0 + a1 z + a2 z 2 + · · · + am z m (391)
for all z ∈ F. Denote the set of all polynomial functions with coefficients in F to be P(F). P(F) is vector
space (Definition 94). P(F) is subspace of FF , a function space (Definition 96).
Definition 102 (Identically-Zero Polynomial). A polynomial p ∈ P(F) is said to be identically zero if

it has all zero coefficients.
The coefficients of a polynomial are uniquely determined by the polynomial.
Definition 103 (Degree of Polynomial). Let p ∈ P(F) be polynomial (Definition 101). p is said to have
degree m if ∃a0 , · · · am ∈ F, am 6= 0 s.t.
m
X
p(z) = ai z i , (392)
i=0
and we write this as deg p = m. A polynomial that is identically zero has degree −∞.
Definition 104. For m non-negative integer, Pm (F) is the set of all polynomials (Definition 101) with
coefficients in F, and ∀p ∈ Pm (F), deg p ≤ m.
See that Pm (F) = span(1, z, · · · , z m ). P(F) is infinite-dimensional. For any list of elements of P(F),
let m be the highest degree of all the polynomials in that list, then z m+1 is not spanned by this set of
vectors and no list spans P(F). Linear independence is defined in Definition 52. The same definition
applies when R → F. The empty list is vacuously linearly independent. For a set of linearly independent
88
vectors, a vector belonging to the set spanned by that set of vectors has unique representation. See
from Theorem 31 that for a set of linearly dependent vectors in V , minimally one of the vectors may be
removed without affecting the span. This vector may be written as a linear combination of the other
vectors. The proof holds.
The contrapositive of Theorem 32 argues that the length of every linearly independent list of vectors
must be less than or equal to the length of the spanning list. Here, an alternative proof by construction
and more general statement is given.
Theorem 87. In a finite-dimensional vector space, the length of every linearly independent list of vectors
is less than or equal to the length of every spanning list of vectors.
Proof. Let ui , i ∈ [m] be linearly independent vectors in V and wi , i ∈ [n] span V . If we add u1 to wi ’s,
then u1 is a l.c. of vectors wi , i ∈ [n], and the new list is linearly dependent. Theorem 31 argues we
may remove one of the wi ’s, and let this removed vector be w. This new list u1 ∪ (W \w) spans V by
Theorem 29. We may add u2 and repeat the same algorithm, but since u1 , u2 is linearly independent,
we will remove another of the w’s, and let this be w0 . We can keep doing this until we are done with all
of the u’s. Each time we add one of the u’s, we remove one of the w’s, so there are at least as many w’s
as u’s.
The contrapositive logical equivalent statement of Theorem 87 is that a spanning list has length at
least as great as some linearly independent list.
Theorem 88. Every subspace of a finite dimensional vector space is finite-dimensional.
Proof. Suppose that V is finite dimensional, and that U is subspace of V . Since V is finite dimensional,
V is spanned by some finite set by definition. U is subset relation (Definition 97). So each vector in U
is in V , which can be written in terms of the finite set. Then U is spanned by the finite set.
Exercise 117. Show that
1. if we consider C as vector space over R, then (1 + i, 1 − i) is linearly independent, but
2. if we consider C as vector space over C, then (1 + i, 1 − i) is linearly dependent.
Proof. -
1. Let x, y ∈ R, and suppose x(1 + i) + y(1 − i) = 0, then it follows that x + y = 0, x − y = 0 which

is iff x, y = 0 Therefore, the list is linearly independent over R.
2. OTOH, see that i(1 + i) + 1(1 − i) = 0, so the list is linearly dependent over C.
Exercise 118. If {vi , i ∈ [m]} is linearly independent set, then is {λvi , i ∈ [m]}, λ 6= 0, λ ∈ F linearly
independent?
Pm Pm
Proof. This is a trivial exercise by writing i ai (λvi ) = i (λai )vi = 0, where by linear independence
all λai = 0, i ∈ [m].
Exercise 119. If vi ∈ [m] is linearly independent in V , and w ∈ V , prove that if {vi + w | i ∈ [m]}
linearly dependent, then w ∈ span({vi , i ∈ [m]}).
89
Pm Pm Pm
Proof. Consider i ai (vi + w) = 0, so i ai vi + ( i ai )w = 0. By linear dependence, some ai is non-
Pm Pm
zero. Assume i ai = 0, then i ai vi = 0, and by linear dependence all ai ’s are zero - contradiction -
Pm Pm
so not the case that i ai = 0. We may then write w = − Pm1 ai i ai vi .
i
Theorem 33 asserts that the addition of vector to a linearly independent set keeps the linear inde-
pendence invariant when the vector is not linear combination of the elements in the set.
Exercise 120. Suppose pi , i ∈ [0, m] are m + 1 polynomials in Pm (F) s.t. pj (2) = 0 for each j. Then
argue that p0 , p1 , · · · , pm are not linearly independent in P(F).
Proof. There are m + 1 vectors, and assume them to be linearly independent. Then they must span
Pm (F), which is the set of all ≤ m degree polynomials. Clearly when pj (2) = 0 for all j ∈ [0, m] this is
not true.
3.3.2.2 Bases
Definition of basis for vector spaces were introduced in Definition 54. In particular basis set spans the
vector space V and is linearly independent. Standard basis was introduced in Definition 56. In Fn , the
same standard basis (0, · · · , 1, · · · , 0) is used. The list 1, z, · · · , z m is basis of Pm (F). Every element in
the vector space can be written as unique linear combination of the vectors in its basis. Theorem 34
asserts this. Proof follows, where the ci ’s, di ’s ∈ F in the general case. We show that this is in fact an
iff relationship.
Theorem 89. A list vi , i ∈ [n] of vectors in V is basis of V iff ∀v ∈ V , v has unique representation
n
X
v= ai vi , ∀i, ai ∈ F. (393)
i
Proof. As argued, the forward relationship (that basis implies unique representation of vector space
elements) is proved by Theorem 34. On the other hand, suppose that every v ∈ V have unique represen-
tation. Since every vector has minimally some representation, then clearly vi , i ∈ [n] spans V . We need
Pn
to assert linear independence. Suppose ai ∈ F, i ∈ [n] and 0 = i ai vi . Trivially, ai = 0, i ∈ [n] satisfies
the equation. But this is unique representation. So we are done.
Theorem 90. Every spanning list in a vector space can be reduced to a basis of the vector space. Every
linearly independent list of vectors in a finite dimensional vector space can be extended to a basis of the
vector space.
Proof. This theorem is just a theorem-(ification?) of the statements asserted in Exercise 41 where we
prove by construction the algorithm stated.
Corollary 7. Every finite dimensional vector space has basis.
Proof. By definition, a finite-dimensional vector space (Definition 100) has spanning list, and each span-
ning list can be reduced to basis (Theorem 90).
Theorem 91. For every subspace U of finite-dimensional vector space V , there is a subspace W of V
s.t. V = U ⊕ W .
90
Proof. Axler [3]. V is finite-dimensional, so every subspace of it is finite-dimensional (Theorem 88), so
U is finite-dimensional. Let ui , i ∈ [m] be basis of U , and it may be extended to basis of V (Theorem
90). Let this extension be the vectors wj , j ∈ [n] s.t. {ui | i ∈ [m]} ∪ {wj | j ∈ [n]} is basis of V . Let W =
span({wi , i ∈ [n]}). By definition of direct sum (Definition 98) and condition for direct sums (Theorem
85), we need to show that V = U + W, U ∩ W = {0}. Suppose v ∈ V , then ∃ai , i ∈ [m], bj , j ∈ [n] ∈ F
s.t.
m
X n
X
v= ai ui + bj wj . (394)
i j
Then v = u + w where u ∈ U, w ∈ W which implies v ∈ U + W . It follows V = U + W . Next, suppose

v ∈ U ∩ W , then ∃ai , i ∈ [m], bj , j ∈ [n] ∈ F s.t.
m
X n
X
v= a i ui = bj wj , (395)
i j
Pm Pn
s.t. i ai ui − j bj wj = 0. But since each of the ai , bj ’s are zero by linear independence, v = 0 and
we are done.
Exercise 121. Let U be subspace of C5 defined by
U = {(z1 , · · · , z5 ) ∈ C5 | 6z1 = z2 , z3 + 2z4 + 3z5 = 0}. (396)
Find basis of U , and extend this basis for C5 . Find a subspace W of C5 s.t. C5 = U ⊕ W .
Proof. Gaussian Elimination and the back-substitution method holds (Exercise 4). Solve
" #
6 −1 0 0 0
(397)
0 0 1 2 3
HLS. U is the nullspace of this HLS, solve to get U = span(v1 , v2 , v3 ) where v1 = (1/6, 1, 0, 0, 0), v2 =
(0, 0, −2, 1, 0), (0, 0, −3, 0, 1). Extend this basis to C5 with vectors e1 , e3 from the standard basis (Defi-
nition 56, over C5 ) and set W = span(e1 , e2 ).
Exercise 122. Prove or disprove the statement: there exists some basis p0 , p1 , p2 , p3 ∈ P3 (F) s.t. none
of the polynomials pi , i ∈ [0, 3] have degree of 2.
Proof. Clearly, 1, x1 , x3 is subset of standard basis of P3 (F) and is linearly independent. Clearly, x2 +x3 6∈
span(1, x1 , x3 ). Theorem 33 asserts 1, x1 , x2 + x3 , x3 is linearly independent. This is size 4 and is basis
of P3 (F) with no polynomial degree two.
Exercise 123. Suppose vi , i ∈ [1, 4] is basis of V , then prove that
v1 + v2 , v2 + v3 , v3 + v4 , v4 (398)
is also basis of V .
Proof. By unique basis representation (Theorem 89), clearly v1 = v1 , v2 = v2 , v3 = v3 , v4 = v4 is unique

representation (i.e. none of the vectors in a basis may be written as a linear combination of the other
vectors). Clearly, {v4 } is linearly independent. Assume for sequence of linearly independent vectors
v1 , · · · , vn , by induction, that we have a set of linearly independent vectors, where for some j, each
element in the set is linear combination of the elements v≥j+1 . We show that addition of vj + v, where
v ∈ span(v≥j+1 ) keeps linear independence invariant. Clearly, by the unique basis representation asserted
earlier, vj is not in span(v≥j+1 ). Then apply Theorem 33. So the linear independence is invariant. When
j = 4, all elements in the basis are used, and our result is proved.
91
Exercise 124. Suppose that V = U ⊕ W . Suppose ui , i ∈ [m] is basis for U , and wi , i ∈ [n] is basis for
W . Then prove that u1 , · · · um , w1 , · · · wn is basis of V .
Pm Pn Pm Pn
Proof. First, we show linear independence. Consider i ai ui + j bj wj = 0, then i ai ui = − j bj wj .
Pm Pn
See this is in U ∩W . Since this is direct sum then i ai ui = j bj wj = 0, and by linear independence of
each basis, all ai ’s, bj ’s are zero, hence linear independence of u1 , · · · um , w1 , · · · wn . Finally, by definition
of direct sum, ∀v ∈ V, ∃u ∈ U, w ∈ W s.t. v = u + w, and u1 , · · · um , w1 , · · · wn spans V .
3.3.2.3 Dimensions of a Vector Space
Theorem 36 argues that basis length has fixed size. A more general proof that does not involve Euclidean
spaces are stated.
Theorem 92. Any two bases of a finite dimensional vector space have the same length.
Proof. If V is finite-dimensional, and B1 , B2 are bases of V , then by definition of basis, B1 is linearly

independent and have size at most the length of spanning list B2 by Theorem 87. Invoke symmetry.
Dimensions of a vector space were defined as the length of the spanning basis (Definition 57). We
denote this dim(V ) = dim V and so on. See that Pm (F) has dimension m + 1. Every finite dimensional
vector space have finite spanning list. Every subspace of a finite dimensional vector space is finite-
dimensional. Hence the dimension is well defined on all of these objects.
If V is finite-dimensional and U is subspace of V , then dim U ≤ dim V . Additionally, equality holds
iff they are the same space, that is U = V . This bound on subspace dimension is proven in Theorem 38.
Statements asserted in Theorem 36 are shown without the dependence on Euclidean spaces.
Theorem 93. Suppose V is finite dimensional, then all linearly independent list L satisfying |L| =
dim(V ) is basis of V .
Proof. If dimV = n, and v1 , · · · , vn are linearly independent, then vi , i ∈ [n] may be (vacuously) extended
to basis for V by Theorem 90. The extension adds no vectors; since the original list is length n and
hence, we started with a basis.
Exercise 125. Show that 1, (x − 5)2 , (x − 5)3 is a basis of the subspace U ∈ P3 (R) defined
U = {p ∈ P3 (R) | p0 (5) = 0}. (399)
Proof. It is trivial to see that 1, (x − 5)2 , (x − 5)3 ∈ U . Let a, b, c ∈ R and suppose that for all x,
a + b(x − 5)2 + c(x − 5)3 = 0. (400)
See that since each of the x2 , x3 term disappears, b = c = 0, then a = 0. The list is linearly independent
in U . Then dim U ≥ 3. But clearly dim U 6= 4, since we can pick a vector (say x) in P3 (R) but not in U ,
so it cannot span the third degree polynomial space. It follows that dim U = 3 (along with the trivial
dim U ≤ 4) and we have basis.
In the Euclidean space, the equivalency relations for basis, spans and linear independence are given
by Theorem 37. A generalized statement in the abstract space is stated:
Theorem 94. Suppose V is finite dimensional, then all spanning list of vectors in V with length dim V
is basis of V .
92
Proof. If dim V = n and v1 , · · · , vn span V , then it may be reduced to basis by Theorem 90. The
reduction is vacuous since every basis must have length n, so we started with basis.
If U1 , U2 are subspaces of finite dimensional vector space, then Theorem 41 asserts that
dim(U1 + U2 ) = dim U1 + dim U2 − dim(U1 ∩ U2 ). (401)
Exercise 126. Suppose vi , i ∈ [m] is linearly independent in V , w ∈ V , then prove that
dim span(v1 + w, · · · , vm + w) ≥ m − 1. (402)
Proof. Exercise 119 shows the dimension may only decrease. Here we show that at most ‘one dimension
is destroyed’. See that (v2 − v1 ) = (v2 + w) − (v1 + w), so v2 − v1 ∈ span(v1 + w, · · · , vm + w). Similarly,
vi − v1 ∈ span(v1 + w, · · · , vm + w). Since v2 − v1 , · · · , vm − v1 are linearly independent (see proof in
Exercise 123), then the dimension is ≥ m − 1.
Exercise 127. If U, W are subspaces of R8 , Show that
(dim U = 3) ∧ (dim W = 5) ∧ (U + W = R8 ) =⇒ R8 = U ⊕ W. (403)
Proof.
dim(U ∩ W ) = dim U + dim W − dim(U + W ) = 8 − dim R8 = 0 (404)
iff U ∩ W = {0} iff R8 = U ⊕ W .
Exercise 128. Suppose dim U = dim W = 4 and U, W are subspaces in C6 . Prove that there exists two
vectors in U ∩ W s.t neither of these vectors is a scalar multiple of the other.
Proof.
dim(U ∩ W ) = dim U + dim W − dim(U + W ) = 8 − dim(U + W ) ≥ 8 − dim C6 = 2. (405)
Exercise 129. If Ui , i ∈ [m] are subspaces of finite dimensional V and ⊕m

i Ui is direct sum, then
m
X
dim(⊕m
i Ui ) = dim(Ui ). (406)
i
Proof. Recursively apply the inference from Exercise 124.
Exercise 130. Prove or disprove the inclusion-exclusion relation
dim(U1 + U2 + U3 ) = dim U1 + dim U2 + dim U3 (407)

−dim(U1 ∩ U2 ) − dim(U1 ∩ U3 ) − dim(U2 ∩ U3 ) + dim(U1 ∩ U2 ∩ U3 ).
Proof. Proof by counterexample [1]. Let V = R2 , and define U1 = {(x, 0) : x ∈ R}, U2 = {(0, y) : y ∈
R}, U3 = {(x, x) : x ∈ R}, then clearly the relation 2 = 3 does not hold.
93
3.3.3 Linear Maps/Transformations
3.3.3.1 Vector Space of Linear Maps
Linear maps were first introduced in Section 3.2.7. Linear maps is a mapping from one vector space V
to another W . Let V, W be vector spaces over some field F for the remainder of this section. Linear
maps are defined in Definition 86. Equivalent statement(s) for linear maps are given in Exercise 92. For
the abstract vector space, we adopt the equivalency conditions formalized here:
Definition 105. A linear map V → W is a function T : V → W s.t. the conditions of (i) additivity
T (u + v) = T u + T v for all u, v, ∈ V and (ii) homogeneity T (λv) = λ(T v) hold for all λ ∈ F and all
v ∈V.
Definition 106. The set of all linear maps from V → W is denoted L(V, W ).
The zero map 0 is a linear map that takes each element of a vector space to the additive identity of
another vector space. The identity map 1 is a function on some vector space that maps an element to
itself. We state some other useful linear maps:
Exercise 131. 1. The differentiation is a linear map D = L(P(R), P(R)) denoted by Dp = p0 . By

the property of linear maps, (f + g)0 = f 0 + g 0 and (λf )0 = λf 0 when f, g are differentiable functions
and λ ∈ R.
2. An integration linear map on the unit interval can be written T ∈ L(P(R), R) denoted by T p =
R1
0
p(x)dx.
3. Multiplication by x2 is a linear map T ∈ L(P(R), P(R)) denoted (T p)(x) = x2 p(x) for all x ∈ R.
4. The backward shift linear map can be denoted T ∈ L(F∞ , F∞ ) that satisfies
T (x1 , x2 , x3 , · · · ) = (x2 , x3 , · · · ). (408)
5. All linear map from T ∈ L(Fn , Fm ), where n, m ∈ Z+ can be written in the generalized form
T (x1 , · · · , xn ) = (A1,1 x1 + · · · + A1,n xn , · · · , Am,1 x1 + · · · + Am,n xn ), (409)
where Aj,k ∈ F for j ∈ [m], k ∈ [n]. Then we say that A is standard matrix. See analogous
Definition 86.
Recall that we argued in Exercise 93 that the images T (ei ) for i ∈ [n] completely define the structure
of T . In particular, the transformation T on any basis determine the behavior of T . Furthermore, we
can find a linear map that takes on whatever values we wish on vectors in some basis. We make clear
this statement with the following theorem:
Theorem 95. Suppose vi , i ∈ [n] form basis of V , and that wi ∈ W, i ∈ [n]. Then there exists a unique
linear map T : V → W s.t. T vj = wj for each j ∈ [n].
Proof. Axler [3]. Existence is first shown. Define T : V → W s.t.
Xn n
X
T( ci vi ) = ci wi , (410)
i i
94
where ci are arbitrary elements in F. Since vi ’s are basis, then this is valid function T from V → W .
Pn Pn
For each j, take cj = 1, ci6=j = 0 and we have T vj = wj . If u, v ∈ V and u = i ai vi , and v = i ci vi ,
we may write
T (u + v) = T ((a1 + c1 )v1 + · · · + (an + cn )vn ) (411)

= (a1 + c1 )w1 + · · · (an + vn )wn (412)
n
X n
X
= ( ai wi ) + ( ci w i ) (413)
i i
= T u + T v. (414)
Pn Pn Pn
Similarly, see that for λ ∈ F, v = i ci vi , we can write T (λv) = T ( i λci vi ) = λ( i ci wi ) = λT v.
Then T is a valid linear map from V → W . To show uniqueness, suppose that T ∈ L(V, W ), and
T vj = wj for j ∈ [n]. If ci ∈ F, i ∈ [n]. then homogeneity property implies T (cj vj ) = cj wj , and
Pn Pn
additivity implies T ( i ci vi ) = i ci wi . Then T is uniquely determined on span(v1 , · · · , vn ) and since
vi ’s are basis, then T is uniquely determined on V .
We can define addition and scalar multiplication on L(V, W ). For S, T ∈ L(V, W ), λ ∈ F, the sum
S + T is said to be function satisfying (S + T )(v) = Sv + T v and the product λT is said to be function
satisfying (λT )(v) = λ(T v) for all v ∈ V . See that L(V, W ) is a valid vector space with these properties
defined (Definition 94). Although in general, there is no meaning for the product of two elements of some
abstract vector space, in the case of linear maps, the product implies a function composition. From here
on, let U, V, W be vector spaces.
Definition 107. Let T ∈ L(U, V ), S ∈ L(V, W ) then the product ST ∈ L(U, W ) is the function satisfying
(ST )(u) = S(T (u)) for all u ∈ U . Note that this definition requires that T maps into the domain of S.
Result 5. Define linear maps S1 , S2 , T, T1 , T2 , T3 , identity map 1. Assuming the linear mappings are
all valid, then linear maps satisfy associativity, identity and distributive properties. In other words,
(T1 T2 )T3 = T1 (T2 T3 ), T 1 = 1T = T and (S1 + S2 )T = S1 T + S2 T, S(T1 + T2 ) = ST1 + ST2 .
Linear maps take 0 → 0. For any linear map T , T (0) = 0. See this by writing T (0) = T (0 + 0) =
T (0) + T (0) =⇒ T (0) = 0.
Exercise 132. Suppose T ∈ L(Fn , Fm ), then show that T has a standard matrix A (Exercise 131).
Proof. Define the systems T (ei ) = A·,i for i ∈ [n] and invoke linearity in the computation of T (x1 , · · · , xn ).
See proof of Exercise 92.
Exercise 133. Let T ∈ L(V, W ) and vi , i ∈ [m] in V . If T vi , i ∈ [m] independent in W , prove that vi ’s
are linearly independent.
Proof. This is proved directly in Exercise 39. Take P to be a linear map rather than some matrix (part
1. does not use the square property).
Exercise 134. Suppose U is subspace of V , U 6= V . If S ∈ L(U, W ) and S 6= 0, and T : V → W is

defined
(
Sv v ∈ U,
Tv = (415)
0 v ∈ V, v 6∈ U.
Prove T is not linear map on V .
95
Proof. Since S is non-zero function, ∃u ∈ U s.t. Su 6= 0. Let there be v ∈ V, v 6∈ U . Since we may
express v = (u + v) − u, and v is not in U , then (u + v) must not be in U . Then T (u + v) = 0 by
definition, and suppose T is linear. We get 0 = T (u + v) = T u + T v = Su 6= 0. This is contradiction.
Exercise 135. Suppose V is finite-dimensional. Prove that every linear map on subspace of V can be
extended to a linear map on V . In particular, if U is subspace of V , S ∈ L(U, W ), then ∃T ∈ L(V, W )
s.t. T u = Su for all u ∈ U .
Proof. Let ui , i ∈ [m] be basis of subspace U , and extend this to basis {ui : i ∈ [m]} ∪ {vj : j ∈ [m + 1, n]}
of V . Theorem 95 asserts that the linear map T ui = Sui , T vj = 0 for i ∈ [m], j ∈ [m + 1, n] is well
defined. Clearly this satisfies T u = Su for u ∈ U .
Exercise 136. If V is finite dimensional and dim V > 0, and W is infinite dimensional, then L(V, W )
is infinite dimensional.
Proof. [1] We use the result that a vector space is infinite dimensional iff there is some infinite sequence
of vectors in the vector space s.t. for all positive integer m, the sequence up to m-th index is linearly
independent. Hence, for infinite dimensional W , there exists some sequence w1 , w2 , · · · s.t. wi , i ∈ [m]
are linearly independent for m ∈ Z+ . Let Ti ∈ L(V, W ), and if we show that Ti , i ∈ [m] is linearly
independent for arbitrary positive integer m, then (by the same assertion), L(V, W ) must be infinite
dimensional. Since V is finite dimensional, let the basis be vi , i ∈ [n]. Define Ti (v1 ) = wi (this is well
Pm Pm Pm
defined by Theorem 95), and consider i ai Ti = 0. Since we may write ( i ai Ti )(v1 ) = 0 = i ai wi ,
and the wi ’s are linearly independent, the Ti ’s must be linearly independent and we are done.
3.3.3.2 Vector Space Associated with Linear Maps
Column spaces and nullspaces were introduced in Definition 61 and Definition 64. We saw they were
related to ranges (Definition 88) and kernels (Definition 90). While we used some terms to refer to vector
spaces associated with matrices, and some to linear transformations, Exercise 131 shows that they refer
to the same vector space. We shall use them interchangeably. In particular,
Definition 108. For T ∈ L(V, W ), the nullspace of T , denoted null T is defined
null T = {v ∈ V : T v = 0}.
For D ∈ L(P(R), P(R)), the differentiation map defined by Dp = p0 has null space equal to the set
of constant functions.
We know that the nullspace of the standard matrix representation of a linear transformation is a
subspace. A generalized statement with abstract spaces goes:
Theorem 96. Suppose T ∈ L(V, W ), then null T is subspace of V .
Proof. T (0) = 0 trivially, so 0 ∈ null T . Additionally, see that for u, v ∈ null T , T (u+v) = T (u)+T (v) =
0 + 0 = 0 since T is linear map and hence u + v ∈ null T . Similar proof works for scalar multiplication
of λ ∈ F. Then the proof follows by definition of subspace (Definition 97).
Definition 109. A function T : V → W is injective (one-to-one) if T u = T v implies that u = v. The

contrapositive that is often useful states that u 6= v =⇒ T u 6= T v.
Theorem 97 (Equivalent Statements for Injectivity). Linear map T ∈ L(V, W ) is injective iff null T =
{0}. We write this as inj(T ) if T is injective.
96
Proof. Suppose that T is injective, we know {0} ⊆ null T since it is subspace. OTOH, if v ∈ null T , see
T (v) = 0 = T (0) and by injectivity, v = 0 so null T ⊆ {0}. Now suppose null T = {0}, then let u, v ∈ V
map to the same output. That is,
T u = T v =⇒ 0 = T u − T v = T (u − v), (416)
and since u − v ∈ null T , u − v = 0 and u = v. T is injective.
Definition 88 stated for arbitrary vector spaces specifies the definition of range as follows:
Definition 110. For linear map T ∈ L(V, W ), range of T is written
range T = {T v | v ∈ V }. (417)
Suppose D ∈ L(P(R), P(R)) is the differentiation map Dp = p0 , then since ∀q ∈ P(R), ∃p ∈ P(R)
s.t. p0 = q, so range D = P(R).
We know that range is column space of its standard matrix Am×n (Theorem 79) - it is subspace of
m
R . We work with abstract spaces.
Theorem 98. If T ∈ L(V, W ), then range T is subspace of W .
Proof. Clearly T (0) = 0 ∈ rangeT . If T vi = wi , then T (v1 +v2 ) = T v1 +T v2 = w1 +w2 , rangeT is closed
under addition. Write a similar proof for scalar multiplication by λ ∈ F, that is T (λv) = λT v = λw.
Invoke Definition 97.
Definition 111. Function T : V → W is surjective (onto) if it has range equals W . We write this as
sur(T ) if T is surjective.
The differentiation map D ∈ L(P5 (R), P5 (R)) is not surjective but D ∈ L(P5 (R), P4 (R)) is surjective.
Rank-nullity Theorem 81 is restated in terms of abstract vector spaces:
Theorem 99. If V is finite-dimensional and T ∈ L(V, W ), then range T is finite dimensional and
dim V = dim(null T ) + dim(range T ). (418)
Proof. Let ui , i ∈ [m] be basis of null T , then dim(null T ) = m and linearly independent list ui , i ∈ [m]
may be extended to basis {ui | i ∈ [m]} ∪ {vi | i ∈ [n]} of V by Theorem 90. Them dim V = m + n. If
we show dim(range T ) = n, we are done. We show that T vi , i ∈ [n] is basis for range T . Let v ∈ V ,
Pm Pn Pn
then there exists representation v = i ai ui + i bi vi , and applying T , we get T v = 0 + i bi T vi . So
Pn Pn
T vi , i ∈ [n] spans range T . We show linear independence. Consider i ci T vi = 0 =⇒ T ( i ci vi ) =
Pn
0 =⇒ i ci vi ∈ null T , and since ui ’s span null T , then there exists representation
n
X m
X
ci vi = di ui , (419)
i i
and by linear independence of vi , ui ’s, the coefficients ci ’s are zero.
Theorem 100. Suppose V, W are finite dimensional vector spaces s.t. dim V > dim W , then 6 ∃T ∈
L(V, W ) s.t. T is injective.
Proof. dim(null T ) = dim V − dim(range T ) ≥ dim V − dim W > 0, but inj(T ) =⇒ null T = {0}
(Theorem 97).
97
Theorem 101. Suppose V, W are finite dimensional vector spaces s.t. dim V < dim W , then 6 ∃T ∈
L(V, W ) s.t. T is surjective.
Proof. dim(range T ) = dim V − dim(null T ) ≤ dimV < dimW , but sur(T ) =⇒ range T = W by
Definition 111.
A HLS (Definition 18) with more variables than equations have non-zero solutions. We know this
from Lemma 2. Consider the linear map T represented by the coefficient matrix of the HLS, with relation
given by Exercise 131. If it has non-zero solutions, then null T is bigger than {0}. In other words, it
is not injective. This is consistent with Theorem 100, since T ∈ L(Fn , Fm ) and n > m. On the other
hand, an inhomogenous system of linear equations with more equations than variables has no solution
for some choice of the constant matrix. Here n < m, and T is not surjective.
Exercise 137. Suppose V is vector space and S, T ∈ L(V, V ) s.t. range S ⊂ null T . Prove that
(ST )2 = 0.
Proof. (ST )2 (v) = ST S(T v) = ST (Sw) = S0 = 0 for v, w ∈ V and T v = w.
Exercise 138. Give example of linear map T : R4 → R4 s.t range T = null T .
Proof. Let ei , i ∈ [4] be the standard basis, and define T by
T e1 = e3 , T e2 = e4 , T e3 = T e4 = 0. (420)
Exercise 139. Prove 6 ∃T : R5 → R5 s.t. range T = null T .
Proof. Rank-nullity asserts that any linear map T must satisfy dim(R5 ) = dim null T + dim range T = 5
- clearly this is impossible for dim range T = dim null T .
Pm
Exercise 140. Fix vi , i ∈ [m] in V . Prove that for T ∈ L(Fm , V ) s.t. T (z) = i zi vi , if vi , i ∈ [m]
linearly independent, then inj(T ).
Pm
Proof. If vi ’s are linearly independent, then T (z) = i zi vi = 0 implies zi = 0 for all i ∈ [m], which
implies z = 0 and null T = {0}.
Exercise 141. Suppose V, W are finite dimensional with 2 ≤ dim V ≤ dim W . Show that
{T ∈ L(V, W ) | ¬inj(T )} (421)
is not subspace of L(V, W ).
Proof. Let vi , i ∈ [n] be basis for V and wj , j ∈ [m] be basis for W , and define T1 , T2 ∈ L(V, W ) by
T1 v1 = 0, T1 v2 = w2 , T1 vi = 21 wi for i ∈ [3, n], and T2 v1 = w1 , T2 v2 = 0, T2 vi = 21 wi , i ∈ [3, n]. Clearly,
¬inj(T1 ) ∧ ¬inj(T2 ) but (T1 + T2 )vi = wi for i ∈ [m] is injective since wi ’s are linearly independent. See
Pm Pm
this by writing (T1 + T2 )( i ai vi ) = i ai wi = 0, use linear independence of w’s to get ai = 0 =⇒
Pm
i ai vi = 0.
Exercise 142. Prove that if T ∈ L(V, W ), inj(T ) and vi , i ∈ [n] linearly independent vectors in V , then
T vi , i ∈ [n] are linearly independent in W (compare this with Exercise 133).
Pn Pn Pn
Proof. Suppose T is injective, and consider 0 = i ai vi ∈ null T , then T ( i ai vi ) = i ai T vi = 0. By
linear independence of vi ’s, the ai ’s are zero and result follows.
98
Exercise 143. If vi , i ∈ [n] span V and T ∈ L(V, W ), show that T vi ’s span range T .
Pn
Proof. Let vi , i ∈ [n] be basis for V , then clearly arbitrary v := i ai vi has transformation T v =
Pn Pn
T ( i ai vi ) = i ai T vi and range T ⊂ span((T vi )i∈[n] ). On the other hand, since T vi ∈ range T and
span is closed, ∀i ∈ [n], T vi ∈ range T implies span((T vi )i∈[n] ) ⊂ range T and we are done.
Exercise 144. Show that inj(Si ), ∀i ∈ [n] =⇒ inj(S1 S2 · · · Sn ).
Proof. We show null{S1 S2 · · · Sn } = {0}. Since S1 injective, then S1 (S2 · · · Sn v) = 0 implies S2 · · · Sn v =

0, which implies S3 · · · Sn v = 0 by injectivity of S2 , so on and so forth.
Exercise 145. If V is finite-dimensional and T ∈ L(V, W ), then prove ∃U subspace of V satisfying
U ∩ null T = {0}, range T = {T u : u ∈ U }. (422)
Proof. We know that null T is subspace, so ∃U s.t. V = null T ⊕ U by Theorem 91. Then Theorem
85 asserts null T ∩ U = {0}. Obviously {T u : u ∈ U } ⊂ range T . Next, let t ∈ range T , then let v be
the vector mapping to t s.t. T v = t, and by direct sum decomposition we can write T (v := n + u) =
T n + T u = t ∈ {T u : u ∈ U }, where n ∈ null T, u ∈ U , so range T ⊂ {T u : u ∈ U }.
Exercise 146. Prove that if ∃ linear map on V whose null space and range are both finite dimensional.
Prove that V must be finite dimensional.
Proof. Let ui , i ∈ [m] span null T and wj , j ∈ [n] span range T . Note for each wj ∈ range T, ∃vj ∈ V
Pn
s.t. wj = T vj . See that ∀v ∈ V, T v ∈ range T =⇒ that for some constants bj , T v = j bj wj =
Pn Pn Pn Pn
j bj T vj = T ( j bj vj ). By linearity, T (v − j bj vj ) = 0, so v − j bj vj ∈ null T , and for some choice
Pn Pm
of constants ki , v − j bj vj = i ki ui . Then move the summation term to the RHS and see that v is
spanned by the vj ’s and ui ’s.
Exercise 147. Suppose V, W finite dimensional, then prove ∃T ∈ L(V, W ) and inj(T ) iff dimV ≤
dimW .
Proof. Suppose inj(T ), then rank-nullity asserts dim V = dim null T + dim range T = dim range T ≤
dim W . On the other hand, suppose dim V ≤ dim W , then let vj , j ∈ [m] be basis of V , wi , i ∈ [n] be
basis of W and we have m ≤ n. For all i ∈ [m], let T vi = wi (well-defined by Theorem 95). Since wi ’s are
Pm Pm Pm
linearly independent, if we write T ( i ai vi ) = i ai wi = 0 =⇒ ∀i ∈ [m], ai = 0 =⇒ i a i vi = 0
and T must be injective.
Exercise 148. Suppose V, W finite dimensional, then prove ∃T ∈ L(V, W ) and sur(T ) iff dimV ≥
dimW .
Proof. Mirror proof in Exercise 147. Suppose sur(T ), dim V = dim null T + dim range T = dim null T +
dim W ≥ dim W . Suppose dim V ≥ dim W , then let vj , j ∈ [m] be basis of V , wi , i ∈ [n] be basis of W
and we have m ≥ n. For all i ∈ [n], let T vi = wi , and see range T = W .
Exercise 149. Suppose V, W finite dimensional, and that U is subspace of V . Prove ∃T ∈ L(V, W ) s.t.
null T = U iff dim U ≥ dim V − dim W .
Proof. dim V = dim null T + dim range T . Suppose T exists s.t. null T = U . Then dim U =
dim V − dim range T ≥ dim V − dim W . OTOH, suppose dim U ≥ dim V − dim W , and let ui , i ∈ [m]
be basis of U . Extend this to basis {ui : i ∈ [m]} ∪ {vj : j ∈ [n]} of V , and let wk , k ∈ [p] be basis of W .
See that p > n. Define the operator T ui = 0, T vj = wj for all i ∈ [m], j ∈ [n]. Then null T = U and we
are done.
99
Exercise 150. Suppose W finite dimensional and T ∈ L(V, W ), then prove that inj(T ) iff ∃S ∈ L(W, V )
s.t. ST = 1 ∈ L(V ).
Proof. Axler [2] Suppose inj(T ). Define S 0 : range T → V satisfying S 0 T v = v. By injectivity, each
element in range T is represented in form T v uniquely, so T is well defined. S 0 may be extended to
linear map S ∈ L(W, V ) (Theorem 135). Then for all v ∈ V , ST v = S 0 T v = v and ST is 1. OTOH, if
∃S ∈ L(W, V ) s.t. ST = 1 ∈ L(V ), then for u, v ∈ V and T u = T v, see that u = ST u = ST v = v and
u = v, so inj(T ).
Exercise 151. Suppose V finite dimensional and T ∈ L(V, W ), then prove that sur(T ) iff ∃S ∈ L(W, V )
s.t. T S = 1 ∈ L(W, W ).
Proof. Axler [2] Suppose sur(T ), then W = range T is finite dimensional and let wi , i ∈ [n] be basis
Pm Pm
of W . ∀j, ∃vj ∈ V satisfying wj = T vj . Define S ∈ L(W, V ) by S( i ai wi ) = i ai vi . Then
Pm Pm Pm
(T S)( i ai wi ) = T ( i ai vi ) = i ai wi , so we have obtained the identity map. OTOH, if ∃S ∈
L(W, V ) s.t. T S = 1, then for all w ∈ W , w = T Sw =⇒ w ∈ range T , so sur(T ).
Exercise 152. Suppose U, V finite dimensional and S ∈ L(V, W ) and T ∈ L(U, V ). Prove or disprove
dim(null ST ) ≤ dim(null S) + dim(null T ). (423)
Prove or disprove
dim(range ST ) ≤ min{dim(range S), dim(range T )}. (424)
Proof. For all u ∈ null T, ST u = S0 = 0, so u ∈ null ST , then clearly the first result holds. Recall
that we related ranges of linear transformations to column space of (standard) matrices (Definition 79).
Recall that rank of matrices is the dimension of the column space (Definition 62). Recall the rank bound
of matrix products (Theorem 50). Then the result is proved for the second part. Alternatively, see the
direct proof: clearly range ST ⊂ range S =⇒ dim range ST ≤ dim range S. For basis ui , i ∈ [m] of
range T , range ST = span((Sui )i∈[m] ) and so dim range ST ≤ m = dim range T . Then the inequality
follows.
Exercise 153. Suppose dim(W ) < ∞ and T1 , T2 ∈ L(V, W ), then prove that null(T1 ) ⊂ null(T2 ) iff
∃S ∈ L(W, W ) s.t. T2 = ST1 .
Proof. Clearly, if ∃S s.t. T2 = ST1 , then T2 v = ST1 v = S0 = 0 for all v ∈ nullT , so nullT1 ⊂ nullT2 . On
the other hand, suppose null T1 ⊂ null T2 . Recall from Exercise 145 that we may write V = K ⊕ null T1 ,
for some subspace K. Let ki , i ∈ [n] be basis for K, and nj , j ∈ [m] be basis for null T1 , then {ki : i ∈
[n]} ∪ {nj : j ∈ [m]} is basis for V (Exercise 124), and define S s.t T2 ki = ST1 ki , T2 nj = ST1 nj , for
i ∈ [n], j ∈ [m], which is well defined by Theorem 95.
Exercise 154. Suppose dim(V ) < ∞ and T1 , T2 ∈ L(V, W ), then prove that range(T1 ) ⊂ range(T2 ) iff
∃S ∈ L(V, V ) s.t. T1 = T2 S.
Proof. Obviously, if ∃S satisfying T1 = T2 S, then range T1 ⊂ range T2 . Suppose range T1 ⊂ range T2 ,

and let ui , i ∈ [m] be basis of V . By the assumption, ∀i ∈ [m], ∃vi satisfying T1 ui = T2 vi . Define S by
Sui = vi , and we have T1 ui = T2 vi = T2 Sui and we are done.
Exercise 155. Suppose T ∈ L(V, W ), and {wi , i ∈ [m]} is basis for range T , then prove ∃ϕi , i ∈ [m] in
Pm
L(V, F) satisfying T v = i ϕi (v)wi for all v ∈ V .
100
Pm
Proof. Unique basis representation (Theorem 34) asserts ∀v ∈ V, ∃ai ∈ F, i ∈ [m] s.t. T v = i ai wi ,
and suppose ai = ϕi (v), where ϕi ∈ L(V, F) for all i ∈ [m]. We show this is a valid linear functional:
Pm Pm Pm
see we have T u = i ϕi (u)wi , T v = ϕi (v)wi and T (u + v) = i ϕi (u + v)wi . By linearity,
Pm Pm i Pm
T (u + v) = T u + T v = i ϕi (u)wi + i ϕi (v)wi = i (ϕi (u) + ϕi (v))wi , so ϕi (u) + ϕi (v) = ϕi (u + v)
and additivity is proved. Adopt similar proof for homogeneity.
Exercise 156. Suppose ϕ ∈ L(V, F), then argue that
∃u ∈ V ∧ u 6∈ null ϕ =⇒ V = null(ϕ) ⊕ {au : a ∈ F}. (425)
Proof. See Exercise 145. Rank nullity asserts that dim V = dim null ϕ + dim range ϕ. But since range ϕ
is subspace of F, then dim range ϕ ≤ 1. Finally, if u ∈ V not in null ϕ, then dim range ϕ ≥ 1. So range ϕ
must be spanned by a single vector (i.e. span(u) = {au : a ∈ F}).
Exercise 157. Suppose ϕ1 , ϕ2 ∈ L(V, F) and null(ϕ1 ) = null(ϕ2 ). Then show ∃c ∈ F satisfying
ϕ1 = cϕ2 .
Proof. Assume nullϕ1 = nullϕ2 = nullϕ. If V = nullϕ then ϕ1 , ϕ2 are zero linear functionals and we are
done. Otherwise, ∃u ∈ V not in null ϕ and Exercise 156 asserts we can write V = null ϕ ⊕ {au : a ∈ F},
so ∀v ∈ V , there exists expression v = w + au for some a ∈ F, w ∈ null ϕ. Then
ϕ1 (u) ϕ1 (u) ϕ1 (u)
ϕ2 (v) = ϕ2 (w + au) = aϕ2 (u) = aϕ1 (u) = ϕ1 (w + au) = ϕ1 (v). (426)
ϕ2 (u) ϕ2 (u) ϕ2 (u)
ϕ1 (u)
See that ϕ1 = ϕ2 (u) ϕ2 .
3.3.3.3 Matrices
In the Euclidean space algebra we began with matrices (Definition 19) and then related them to linear
maps. In fact, linear maps can be represented by the standard matrix (Definition 86, Exercise 131).
Here, we give a more explicit, and generalized relation between the two objects.
Recall that given basis {vi , i ∈ [n]} of V and linear map T : V → W , then the images T vi ’s determine
the values of T on arbitrary vectors. Matrices are an ‘efficient method of recording the values of T vi ’s
in terms of the basis of W (Axler [3]).’
Definition 112 (Matrix). Let m, n ∈ Z+ , then an m × n matrix A is rectangular array of elements of

F with m rows, n columns representing
 
A1,1 ··· A1,n
 ··· ··· ··· . (427)
 
Am,1 ··· Am,n
Aj,k is the row-j column-k entry of A. When used without both indices (i.e. Aj,· , A·,k ), the entire
row/column is referenced.
Previously, our standard matrix representations discussed in Section 3.2.7 were from the standard
basis of the Euclidean space. This is generalized to arbitrary basis of abstract vector space.
Definition 113 (Matrix of a Linear Map). Suppose T ∈ L(V, W ) and vi∈[n] is basis of V , wi∈[m] is basis
of W , then the matrix of T w.r.t to these bases is the m × n matrix M(T ) with Aj,k entries defined by
m
X
T vk = Ai,k wi . (428)
i
101
If we are required to specify the bases, then we write the matrix as M(T, (vi )i∈[n] , (wj )j∈[m] ). When
we refer to basis S = {vi : i ∈ [n]} of some vector space and say that M(T ) is the matrix of linear
map T w.r.t to basis S, clearly the matrix M(T ) depends on the order of the elements in S. We often
refer to the basis as a set rather than a list, and this makes the previous statement ambiguous, since
M(T, (v1 , · · · , vn ), (·)) 6= M(T, (vn , · · · , v1 ), (·)). To avoid ambiguity, and at the risk of slight abuse of
notation, when we refer to matrix representation with reference to a basis that is defined a set, then we
adopt the natural order, in that we assume the set S is first arranged in an ordered list in increasing
index i ∈ [n].
Refer to Definition 113. We may think of vk as the column selector in the matrix M(T ). The
coefficients in column A·,k are the linear weights that when combined with the basis vectors of the vector
space W that is mapped into, give the linear transformation applied to the selector, giving T vk . If T
is a linear map Fn → Fm , then assume the basis is the standard basis (Definition 56) unless otherwise
specified. When working with Pm (F), then assume the standard basis is 1, x, · · · , xm .
Exercise 158. If D ∈ L(P3 (R), P2 (R)) is the differentiation map, then find the matrix D w.r.t the
standard basis.
Proof. Since D(xn ) = nxn−1 , then

 
0 1 0 0
M(D) = 0 0 2 0 . (429)
 
0 0 0 3
Matrix additions, matrix subtractions and matrix scalar multiples follow from Definition 29. Prop-
erties of matrix operators follow from Theorem 6 with R → F. Matrix multiplication follows from
Definition 31 and their properties follow from Theorem 7. We state without proof (it would be rather
trivial to work through the definitions) the following results (it is assumed that the bases are already
determined and used consistently for the linear maps S, T, S + T, λT ):
Theorem 102. If S, T ∈ L(V, W ) then M(S + T ) = M(S) + S(T ). If λ ∈ F, T ∈ L(V, W ) then

M(λT ) = λM(T ). Additionally, when T ∈ L(U, V ), S ∈ L(V, W ), then M(ST ) = M(S)M(T ).
With the operations of addition and scalar multiplication defined, we can construct vector spaces
(along with the other properties in Definition 94).
Definition 114 (Set of all Matrices of Fixed Size). Denote the set of all m × n matrices with entries in
F be denoted by Fm,n . Here m, n ∈ Z+ .
Theorem 103 (Fm,n (Definition 114) is a vector space). Fm,n (Definition 114) is a vector space with
dimension mn.
Proof. The properties of vector spaces (Definition 94) are satisfied by the properties of matrix operators
(Theorem 6). The additive identity is the zero matrix (Definition 26). The list of m × n matrices that
have 0 in all entries except at single (j, k) entry is a basis of Fm,n and there are mn such matrices in the
basis, hence the dimension.
Recall that there are many ways to think of matrix multiplications, particularly in the block form
(Exercise 11). Here we restate the results and add more.
102
Result 6. We suppose that all the matrix operations defined here are valid (in terms of their sizes).
Then see that
1. (AC)j,k = Aj,· C·,k ,
2. (AC)·,k = AC·,k ,
Pn
3. if A is m × n matrix and c = (c1 c2 · · · cn )0 , then Ac = i ci A·,i ,
Pn
4. if a = (a1 · · · an ) then aC = i ai Ci,· .
Exercise 159. If V, W finite dimensional and T ∈ L(V, W ), then show that M(T ) has at least dim(rangeT )
non-zero entries.
Proof. Let vi , i ∈ [n] be basis of V , wj , j ∈ [m] be basis of W , and suppose M(T ) has ≤ (dimrangeT )−1
non-zero entries. There are at most (dim range T ) − 1 non-zero vectors in the list (T vi , i ∈ [n]), and since
rangeT = span((T vi )i∈[n] ) (see Exercise 143), then dimrangeT ≤ (dimrangeT )−1. Contradiction.
Exercise 160. Let V, W be finite dimensional and T ∈ L(V, W ), then show that that there exists some
basis of V, W each s.t. w.r.t to these bases, all entries of M(T ) are zero except the entries M(T )jj = 1
for j ∈ [dim(range T )].
Proof. Let ui , i ∈ [m] be basis for null T , and extend this to basis of V using vi , i ∈ [n] s.t. we have
{vj : j ∈ [n]} ∪ {ui : i ∈ [m]}. Then (see Theorem 99) T vi , i ∈ [n] is basis for range T . Extend linearly
independent T vi , i ∈ [n] to basis of W , and suppose we get {T vi : i ∈ [n]} ∪ {µj : j ∈ [s]}. With respect
to this basis of V , and this basis of W , the matrix M(T ) satisfies the requirements.
Note that for square matrix A order n and j, k ∈ [n], the (j, k) entry of A3 is equal to
n X
X n
Aj,p Ap,r Ar,k . (430)
p r
3.3.3.4 Isomorphisms
Matrix inverses were defined (see Definition 35).
Definition 115 (Invertibility of Linear Maps). A linear map T ∈ L(V, W ) is invertible if ∃S ∈ L(W, V )
s.t. ST = 1 is the identity map on V , and T S = 1 is the identity map on W . Then S is said to be to
the inverse of T . An invertible linear map is said to be an isomorphism. We write this as inv(T ). A
linear map that is not invertible is said to be singular.
Matrix inverses were unique (Theorem 9). Linear map inverses are unique. The adaptation of proof
is trivial - since it is unique, it is not ambiguous to write T −1 to be the inverse of T .
Theorem 104 (Invertibility iff Injective and Surjective). Linear map T is invertible iff inj(T ) and
sur(T ).
Proof. Let T ∈ L(V, W ). Suppose T is invertible, then suppose u, v ∈ V and T u = T v, then
u = T −1 T (u) = T −1 T (v) = 1v = v (431)
103
implies injectivity. Let w ∈ W , then since w = T T −1 w, w ∈ range T and range T = W implies
surjectivity. Now, suppose we have injectivity and surjectivity assumed. For all w ∈ W, define Sw to be
unique element of V s.t. T Sw = w. Clearly, T S is the identity map. Let v ∈ V , then see that
T (ST )v = T S(T v) = 1T v = T v, (432)
so ST v = v. ST is also the identity map. Finally, we need to show linearity conditions for S. We can
do this by writing
T (Sw1 + Sw2 ) = T Sw1 + T Sw2 = w1 + w2 = T S(w1 + w2 ), (433)
and since Sw1 +Sw2 is the unique element of V mapping to w1 +w2 under T , then S(w1 +w2 ) = Sw1 +Sw2
and S is additive. Similarly, T (λSw) = λT (Sw) = λw = T S(λw) and S is homogeneous (λSw is the
unique element in V mapping to λw under T ).
Definition 116. Two vector spaces are isomorphic if there is an isomorphism from one vector space to
the other (see Definition 115).
Theorem 105. Two finite dimensional vector spaces over F are isomorphic iff they share dimension.
Proof. Suppose T is isomorphism from V onto W , then since T is invertible, null T = {0}, range T =
W (Theorem 97, Definition 111), then rank-nullity (Theorem 99) asserts dim V = dim(null T ) +
dim(range T ) = dim W . On the other hand, if vi∈[n] , wi∈[n] is basis of V, W respectively, then de-
fine
n
X n
X
T( ci vi ) = ci wi , (434)
i i
where T vi = wi , (this is well defined since the vi ’s are basis, and the existence of such a linear map is
Pn Pn
asserted by Theorem 95). Since wi ’s span W , then sur(T ). Next, see that if T ( i ci vi ) = i ci wi = 0,
Pn
then the ci ’s are zero by linear independence of the wi ’s and i ci vi = 0 - that is null T = {0}. Then
inj(T ). Then by equivalence of sur(T ) ∧ inj(T ) (Theorem 104) we are done.
Theorem 105 asserts that all vector spaces V of dim V = n is isomorphic to Fn . If vi , i ∈ [n] is
basis for V and wj , j ∈ [m] is basis for W , then for each T ∈ L(V, W ), we have its corresponding matrix
representation M(T ) ∈ Fm,n . It turns out that M is a linear map. This linear map is invertible. Every
linear map can be taken to a unique matrix representation under M.
Theorem 106 (Axler [3]). Suppose vi , i ∈ [n] is basis of V , and wj , j ∈ [m] is basis of W . Then M is
isomorphism between L(V, W ) and Fm,n .
Proof. M is linear by Theorem 102. We need to show that inj(M ) ∧ sur(M ). If T ∈ L(V, W ) and
M(T ) = 0, then T vk = 0 for all k ∈ [n]. Then since vi ’s are basis, T = 0 and M is injective since
null M = {0}, where 0 is the additive identity function. Let A ∈ Fm,n , then let T be linear map from
Pm
V → W s.t. T vk = j Aj,k wj for k ∈ [n]. M(T ) = A (Definition 113) and range M = Fm,n . We are
done.
Theorem 107 (Dimension of Vector Space of Linear Maps). Since L(V, W ) is isomorphic to Fm,n , where
V has basis size m and W has basis size n, and since dimFm,n = mn, then dimL(V, W ) = dimV ·dimW
by Theorem 105.
The matrix of a linear map is defined (Definition 113). The matrix of a vector is defined as follows.
104
Definition 117 (Matrix of a vector). Let v ∈ V and V have basis vi , i ∈ [n]. Then the matrix of v w.r.t
this basis is
 
c1
 
 c2 
M(v) =  
· · · (435)
 
cn
Pn
that satisfies v = i ci vi . In other words, the matrix M(v) contain coordinates of vector v w.r.t to
specified basis (see Definition 55).
Once a basis vi , i ∈ [n] is chosen, the function M is an isomorphism of V onto Fn,1 that takes v ∈ V
to M(v).
Theorem 108. Suppose T ∈ L(V, W ) and vi , i ∈ [n] is basis of V , wj , j ∈ [m] is basis of W . Let k ∈ [n],
then M(T )·,k = M(T vk ).
Proof. Follows from the definitions (see Definition 113, Definition 117).
Theorem 109. Let T ∈ L(V, W ) and v ∈ V , suppose vi , i ∈ [n] is basis for V and wj , j ∈ [m] is basis
for W , then M(T v) = M(T )M(v).
Pn Pn
Proof. v may be written by some ci ’s as v = i ci vi , where ci ∈ F, then T v = i ci T vi and so
n
X n
X
M(T v) = ci M(T vi ) = ci M(T )·,i = M(T )M(v). (436)
i i
We use linearity of M, Result 6 and Theorem 109.
Every linear map can be thought of as a matrix multiplication map after suitable relabeling via the
isomorphism given by M.
Definition 118. A linear map from a vector space onto itself T ∈ L(V, V ) is said to be a linear operator.
We make shorthand notation L(V ) = L(V, V ) for convenience.
It turns out that when we are presented with linear operators, say T ∈ L(V, V ), inj(T ) ↔ sur(T ).
Theorem 110. Suppose V is finite dimensional and T ∈ L(V ), then inv(T ) ↔ inj(T ) ↔ sur(T ).
Proof. Theorem 104 asserts that inv(T ) =⇒ inj(T ) and inv(T ) =⇒ sur(T ). Suppose T injective, then
null T = {0} and rank-nullity asserts dim(range T ) = dim V − dim(null T ) = dim V , so range T = V ,
hence sur(T ). Suppose T surjective, then rank-nullity asserts dim(null T ) = dim V − dim(range T ) = 0.
Then null T = {0}, hence inj(T ).
Exercise 161. Show ∀q ∈ P(R), ∃p ∈ P(R) s.t.

00
(x2 + 5x + 7)p = q. (437)
00
Proof. Suppose T : Pm (R) → Pm (R) is defined T p = (x2 + 5x + 7)p . Every polynomial with second
derivative zero has general form ax + b, where a, b ∈ R, so null T = {0}, therefore inj(T ). Since this
is linear operator, T is surjective. Surjectivity implies the existence of such a polynomial p ∈ Pm (R)
00
satisfying the conditions (x2 + 5x + 7)p = q. This is satisfied for any nonzero polynomial since we
used arbitrary m.
105
Exercise 162. Suppose T ∈ L(U, V ) and S ∈ L(V, W ) and both are invertible, then prove that ST ∈
L(U, W ) is invertible and that (ST )−1 = T −1 S −1 .
Proof. Adapt proof of Theorem 10.
Exercise 163. Suppose V is finite dimensional and dim V > 1, then prove that the set of singular linear
operators on V is not a subspace of L(V ).
Proof. Proof by counterexample. Suppose {vi : i ∈ [n]} is basis of V , and define T1 v1 = 0, T1 vi = vi for
i ∈ [2, n]. Define T2 v1 = v1 , T2 vj = 0 for j ∈ [2, n]. Then T1 , T2 singular but T1 + T2 is invertible (in
particular, it is equal to 1).
Exercise 164. Suppose V is finite dimensional and U is subspace of V . Let linear map S ∈ L(U, V ),
then prove ∃ invertible T ∈ L(V ) s.t. T u = Su for all u ∈ U iff inj(S).
Proof. Suppose T invertible, and T u = Su, then clearly S is injective since invertible implies injectivity.
OTOH, suppose S is injective and T u = Su. Let ui , i ∈ [n] be basis of U and extend this to basis
of V , such that we have {ui : i ∈ [n]} ∪ {vj : j ∈ [m]}. Since S is injective, Sui , i ∈ [n] is linearly
independent (Exercise 142), then extend this to basis of V to get {Sui , i ∈ [n]} ∪ {wj : j ∈ [m]}. Define
T ui = Sui , T vj = wj for i ∈ [n], j ∈ [m] and see this is well defined by Theorem 95. Since Sui ’s, wj ’s
span V , T is surjective. Then T is invertible.
Exercise 165. Suppose W is finite dimensional and T1 , T2 ∈ L(V, W ). Prove null T1 = null T2 iff ∃
invertible S ∈ L(W ) s.t. T1 = ST2 .
Proof. [1]. Suppose null T1 = null T2 . Since range T2 is finite dimensional, let wi , i ∈ [n] be basis
Pn
for range T2 , ∃vi , i ∈ [n] satisfy T2 vi = wi , i ∈ [n]. For v ∈ V , since we may write T2 v = i ai wi
Pn Pn Pn
for some choice of ai constants, T2 (v − i ai vi ) = 0, and since v = v − i ai vi + i ai vi , we have
Pn Pn Pn
V = null T2 + span((vi )i∈[n] ). Now if i ai vi ∈ null T2 , then T2 ( i ai vi ) = i ai wi = 0, and by linear
Pn
independence of wi ’s, ai ’s are all zero and i ai vi = 0. That is null T2 ∩ span((vi )i∈[n] ) = {0} and
therefore V = null T2 ⊕ span((vi )i∈[n] ). Similar arguments assert T1 vi , i ∈ [n] are linearly independent,
Pn Pn
and consider i ai T1 vi = T1 ( i ai vi ) = 0. Extend wi , i ∈ [n] to basis of W as wi , i ∈ [n], ei , i ∈ [m]
and T1 vi , i ∈ [n] to basis of W as T1 vi , i ∈ [n], fj , j ∈ [m]. We may define S ∈ L(W ) to satisfy
Swi = T1 vi , Sej = fj , i ∈ [n], j ∈ [m]. (438)

Pn
Since any v ∈ V may be written v = n + i ai vi , where n ∈ null T2 = null T1 , we can write
n
X Xn Xn n
X
ST2 (v) = ST2 (n + ai vi ) = ST2 ( ai vi ) = S( ai wi ) = ai T1 vi (439)
i i i i
n
X n
X
= T1 ( ai vi ) = T1 (n + ai vi ) = T1 (v). (440)
i i
Since sur(S), so inv(S). On the other hand, if inv(S) s.t. ST2 = T1 , ∀u ∈ null T, we may write
ST2 u = T1 u = 0, and by invertibility, T2 u = 0. u ∈ null T2 , so null T1 ⊂ null T2 . Follow the symmetrical
arguments for T2 = S −1 T1 to show null T2 ⊂ null T1 and we are done.
Exercise 166. Suppose V is finite dimensional and T : V → W and sur(T ), then prove ∃U subspace of
V s.t. T|U is isomorphism of U onto W . T|U is to denote the restriction of domain to U , that is ∀u ∈ U ,
T|U (u) = T u.
106
Proof. Let wi , i ∈ [n] be basis of W , and by surjectivity, there ∃vi , i ∈ [n] s.t. T vi = wi . Since T vi ’s are
independent, vi ’s are independent (Exercise 133) and let U = span((vi )i∈[n] ). Then T|U is isomorphism
(their dimensions are the same, Theorem 105).
Exercise 167. Suppose V is finite dimensional and S, T ∈ L(V ), then prove that ST invertible iff S, T
both invertible.
Proof. See Exercise 14. This proof is generalized to abstract vector space. Suppose ST invertible, then
∃R = (ST )−1 satisfying RST = ST R = 1. If v ∈ null T , then v = 1v = RST v = 0 and hence
null T = {0}. inj(T ), T is operator, so inv(T ). Next, if u ∈ V , then u = 1u = ST Ru, so u ∈ range S.
Then range S = V , sur(S) and since S is operator, S is invertible. OTOH, suppose S, T invertible, then
write
(ST )(T −1 S −1 ) = S 1S −1 = 1, (441)
and
(T −1 S −1 )(ST ) = T −1 1T = 1, (442)
so ST is invertible.
Exercise 168. Suppose V is finite dimensional and S, T ∈ L(V ), then prove that ST = 1 ↔ T S = 1.
Proof. See Theorem 13.
Exercise 169. Show that V is finite dimensional and S, T, U ∈ L(V ) and ST U = 1, Then show T
invertible and T −1 = U S.
Proof. By Exercise 167 , S, T, U are all invertible and we can write T U = S −1 1 → T = S −1 U −1 →

T −1 = (S −1 U −1 )−1 = U S.
Exercise 170. Suppose V is finite dimensional, R, S, T ∈ L(V ) and sur(RST ). Then prove inj(S).
Proof. RST is linear operator, and so sur(RST ) =⇒ inv(RST ) =⇒ inv(S) =⇒ inj(S). (see
Exercise 167).
Exercise 171. Suppose V finite dimensional and T ∈ L(V ), then prove that T = λ1 for some λ ∈ F iff
ST = T S for all S ∈ L(V ).
Proof. Suppose T = λ1, then ST = Sλ1 = λS = λ1S = T S. OTOH [2], suppose ST = T S for all
S ∈ L(V ). We begin by showing ∀v ∈ V , (v, T v) is linearly dependent. Suppose not, then we may
extend to basis v, T v, u1 , · · · , un of V . Define S ∈ L(V ) s.t.
n
X
S(av + bT v + ci ui ) = bv, (443)
i
then ST v = v, Sv = 0. By assumption, ST v = T Sv = T 0 = 0, and since this holds for non-zero S,

v = 0, which is contradiction since (v, T v) cannot be independent if v = 0. Then ∀ non-zero v ∈ V , we
may write T v = av v, and we are done if we can show av = a for all v ∈ V and some constant a ∈ F.
Consider non-zero v, w ∈ V , and suppose v, w linearly dependent, then ∃b ∈ F s.t w = bv, and we can
write aw w = T w = T bv = bT v = aav v = av w and av = aw . Otherwise, they are linearly independent,
and we can write av+w (v + w) = T (v + w) = T v + T w = av v + aw w, so (av+w − av )v + (av+w − aw )w = 0,
and by linear independence, av+w = av , av+w = aw , and so av = aw .
107
Exercise 172. Suppose T ∈ L(P(R)) s.t. inj(T ), deg T p ≤ deg p for all non-zero p ∈ P(R), then show
that sur(T ) and deg T p = deg p.
Proof. Since deg T p ≤ deg p for all p ∈ P(R), ∀n ∈ N+ , T|Pn (R) : Pn (R) → Pn (R). Since this is linear
operator, and T is injective, then T|Pn (R) is injective and T|Pn (R) is surjective. And since this satisfies
for all n ∈ N+ , T is surjective. [1] For the second assertion, we argue by induction. Since deg T p ≤ deg p,
when deg p = 0, clearly deg T p = 0. Assume for the statement holds for some positive n, and by
the surjectivity of T|Pn (R) : Pn (R) → Pn (R), ∀p, deg p ≤ n, ∃q ∈ Pn (R) s.t. T q = p. Furthermore,
T|Pn+1 (R) : Pn+1 (R) → Pn+1 (R) is surjective, and ∀p, deg p ≤ n + 1, ∃q ∈ Pn+1 (R) s.t. T q = p. If
∃p ∈ Pn+1 (R), deg p = n + 1 satisfying deg T p < n + 1, then ∃q ∈ Pn (R) satisfying T q = T p since
T|Pn (R) is surjective. Then T p = T q, but deg p 6= deg q but ¬inj(T ) is contradiction. We require that
deg T p = deg p.
Exercise 173. Show that a HLS (Definition 18) has only the trivial solution iff for all choice of c for
Ax = c, there exists some solution u satisfying Au = c (the system is consistent).
Proof. Note this is a different problem from Theorem 52. Consider the linear operator T that is obtained
from the isomorphism M−1 with A. Then the second statement asserts that T is surjective, which is iff
linear operator T is injective, which is iff only the trivial solution exists.
3.3.3.5 Products, Quotients of Vector Spaces
Definition 119 (Product of Vector Space). Let Vi , i ∈ [m] be vector spaces over F, then the product of
the vector spaces are defined by
×m
i Vi = V1 × V2 · · · × Vm = {(v1 , · · · , vm ) : ∀i ∈ [m], vi ∈ Vm }. (444)
Addition of this product space is defined by
(u1 , · · · , um ) + (v1 , · · · , vm ) = (u1 + v1 , · · · , um + vm ), (445)
and scalar multiplication on this product space is defined by
λ(v1 , · · · , vm ) = (λv1 , · · · , λvm ). (446)
Elements of P2 (R) × R3 are lists of length two, and one such element is
(5 − 6x + 4x2 , (3, 8, 7)) ∈ P2 (R) × R3 . (447)
The product of vector spaces (Definition 119) is a vector space. All the vector spaces should be over
the same field, and the resulting product is also over that field. For instance, the additive identity for
the space that Equation 447 belongs to is (0, 0), where the first zero is an identically zero polynomial
(Definition 102) and the second zero is the vector (0, 0, 0). The additive inverse of (v1 , · · · , vm ) is
(−v1 , · · · , −vm ).
The linear map that that takes a vector ((x1 , x2 ), (x3 , x4 , x5 )) ∈ R2 × R3 → (x1 , x2 , x3 , x4 , x5 ) is
clearly injective and surjective, so the two vector spaces R2 × R3 , R5 is isomorphic.
Theorem 111. Let Vi , i ∈ [m] be finite dimensional vector spaces, then the vector space product satisfies
Pm
condition dim(×m
i Vi ) = i dim Vi .
Proof. We can easily proof by construction, by combining the standard basis for each of the component
subspaces, and where all other entries are additive identity. The explicit proof is not stated.
108
For example, a possible basis of P2 (R) × R2 can be written
(1, (0, 0)), (x, (0, 0)) (x2 , (0, 0)) (0, (1, 0)) (0, (0, 1)). (448)
Theorem 112. Suppose Ui , i ∈ [m] are subspaces of V , and define a linear map Γ : U1 × · · · × Um →
U1 + · · · + Um by
m
X
Γ(u1 , · · · , um ) = ui . (449)
i
Pm
Then i Ui is direct sum (Definition 98) iff Γ is injective/invertible. By definition of subspace addition
Pm
i Ui (Definition 60), Γ is surjective, so the iff condition is equivalent for injectivity and invertibility.
Pm
Proof. Γ is injective iff 0 has unique representation i ui , where ui = 0 for all i ∈ [m]. Then Γ is
Pm
injective iff i Ui is direct sum.
Pm
Theorem 113. Suppose V is finite dimensional and Ui , i ∈ [m] are subspaces of V , then i Ui is direct
Pm Pm
sum iff dim( i Ui ) = i dim Ui .
Proof. We know sur(Γ) (in Theorem 112), so rank-nullity asserts that (Theorem 99)
Xm
dim(×m
i Ui ) = dim(null Γ) + dim(range Γ) = dim(null Γ) + dim( Ui ). (450)
i
Γ is injective iff
m
X m
X
dim(null Γ) = dim({0}) ↔ dim( Ui ) = dim(×m
i Ui ) = dim(Ui ). (451)
i i
Theorem 111, 112 were used.
Definition 120. Let v ∈ V , U be subspace of V , then the sum of v + U is the subset of V defined by
v + U = {v + u | u ∈ U }. (452)
Definition 121. An affine subset of V is subset of V of the form v + U for some v ∈ V and subspace
U of V . This affine subset is said to be parallel to U .
Definition 122 (Quotient of Vector Space). Let U be subspace of V , then the quotient space V /U is set
of all affine subsets of V parallel to U . This may be expressed
V /U = {v + U : v ∈ V }. (453)
Theorem 114. Suppose U is subspace of V , and let v, w ∈ V . Then the following are equivalent
statements:
1. v − w ∈ U ,
2. v + U = w + U ,
3. (v + U ) ∩ (w + U ) 6= ∅.
Proof. If 1. holds, then v − w ∈ U , then v + u = w + ((v − w) + u) ∈ w + U . Therefore v + U ⊂ w + U .

By symmetry, w + U ⊂ v + U , and w + U = v + U . It is trivial to see that 2. implies 3. Suppose 3. is
true, then let u1 , u2 ∈ U s.t. v + u1 = w + u2 , then v − w = u2 − u1 . Then v − w ∈ U and 3. implies
1.
109
Definition 123 (Operations on the Quotient Space). Let U be subspace of V , then addition and scalar
multiplication on quotient space V /U is defined by
(v + U ) + (w + U ) = (v + w) + U,
λ(v + U ) = (λv) + U,
where v, w ∈ V and λ ∈ F.
Theorem 115 (Quotient space is vector space). Suppose U is subspace of V , then the quotient space
V /U with operations defined as in Definition 123 is valid vector space (Definition 94).
Proof. Suppose v, w, v̂, ŵ ∈ V s.t. v + U = v̂ + U, w + U = ŵ + U , then want to show that (v + w) +

U = (v̂ + ŵ) + U . Theorem 114 asserts that v − v̂ ∈ U, w − ŵ ∈ U , and since U is subspace, then
(v − v̂) + (w − ŵ) = (v + w) − (v̂ + ŵ) ∈ U . Again Theorem 114 asserts that (v + w) + U = (v̂ + ŵ) + U .
The addition operation makes sense. The subspace property also asserts that λ(v − v̂) = λv − λv̂ ∈ U , so
(λv) + U = (λv̂) + U . The additive identity in the quotient space is 0 + U = U , and the additive inverse
of v + U is −v + U .
Definition 124 (Quotient Map). Let U be subspace of V , then the quotient map π is the linear map
π : V → V /U satisfying
π(v) = v + U, ∀v ∈ V. (454)
Theorem 116. Let U be subspace of finite dimensional V , then dim(V /U ) = dim V − dim U .
Proof. Additive identity is 0 + U = U . For u, 0 ∈ U , u − 0 ∈ U so u + U = 0 + U = U (Theorem 114),

hence null π = U . See that range π = V /U . Rank-nullity asserts that dim V = dim U + dim(V /U ).
Definition 125. Suppose T ∈ L(V, W ), and define T̃ : V /(null T ) → W by
T̃ (v + null T ) = T v. (455)
Definition 125 is well defined since ∀u, v ∈ V satisfying u + null T = v + null T , then u − v ∈ null T
(Theorem 114) and T (u − v) = 0.
Theorem 117. Suppose T ∈ L(V, W ), then
1. T̃ is linear map from V /(null T ) → W ,
2. inj(T̃ ),
3. range T̃ = range T ,
4. V /(null T ) is isomorphic to range T .
Proof. 1. -
2. Let v ∈ V, T̃ (v+nullT ) = 0, then T v = 0 and v ∈ nullT . Since 0 ∈ nullT , then Theorem 114 asserts
that v + null T = 0 + null T . Then null T̃ = {0} (this zero identity represents 0 + null T = null T ).
3. This follows from the using the definition T̃ (see Definition 125).
4. T̃ is the isomorphism from V /(null T ) → range T by parts 2,3.
110
Exercise 174. Suppose T is function from V → W , then the graph of T is said to be the subset of
V × W defined
{(v, T v) ∈ V × W | v ∈ V }. (456)
Prove that T is linear map iff graph of T is subspace of V × W .
Proof. Suppose T is linear, and let (v, T v), (w, T w) be elements in graph T , then (v, T v) + (w, T w) =
((v + w), T (v + w)) and see this belongs to graph T . Additivity is proved. Homogeneity uses similar
proof. Then it is subspace of V × W . Suppose graph T is subspace of V × W , then (v, T v) + (w, T w) =
(v + w, T (v + w) = T v + T w) ∈ graph T , where the equality follows by definition of graph. We showed
additivity. Similarly, for λ ∈ F, we get λ(v, T v) = (λv, λT v) ∈ graph T and we get λT v = T λv. T is
linear map by definition (Definition 105).
Exercise 175. Let Vi , i ∈ [m] be vector spaces and suppose ×m

i Vi is finite-dimensional, then prove that
Vj is finite dimensional for each j ∈ [m].
Proof. The product of vector space has finite dimension and has dimension given by Theorem 111. If
the sum is finite, then each of the components must have finite dimension.
Exercise 176. Prove that L(×m m m

i Vi , W ) and ×i L(Vi , W ) are isomorphic. Prove that L(V, ×i Wi ) and
×m
i L(V, Wi ) are isomorphic.
Proof. Theorem 107, 111 asserts that

Xm
dim L(×m
i Vi , W ) = dim (×m
i Vi ) · dim W = ( dim Vi ) · dim W (457)
i
m
X m
X
= dim Vi dim W = dim L(Vi , W ) = dim (×m
i L(Vi , W )). (458)
i i
Proof of the second statement is similar.
Exercise 177. Let n ∈ Z+ and prove that V n := ×ni V and L(Fn , V ) are isomorphic.
Proof. Theorem 107, 111 asserts that

n
X
dim V n = dim (×ni V ) = dim V = ndim V = dim Fn · dim V = dim (L(Fn , V )). (459)
i
Exercise 178. Let v, w ∈ V . Let U, W be subspaces of V and v + U = x + W . Then prove that U = W .
Proof. If v + U = x + W, then ∃w1 ∈ W s.t. v = x + w1 . Then v − x ∈ W , and ∀u ∈ U , since

v + U = x + W , ∃w2 ∈ W s.t. v + u = x + w2 . Then
u = (x − v) + w2 = −w1 + w2 ∈ W, (460)
and U ⊂ W . By symmetry, W ⊂ U and we are done.
Exercise 179. Prove that nonempty A ⊂ V is an affine subset of V iff λv +(1−λ)w ∈ A for all v, w ∈ A
and all λ ∈ F.
111
Proof. [1] Suppose A is affine subset of V , then ∃a ∈ V , subspace U of V s.t. A = a+U . Then ∀v, w, ∈ A,
we may express v = a + u1 , w = a + u2 for some u1 , u2 ∈ U . Then
λv + (1 − λ)w = λ(a + u1 ) + (1 − λ)(a + u2 ) = a + [λu1 + (1 − λ)u2 ] ∈ a + U = A. (461)
On the other hand, assume that λv + (1 − λ)w ∈ A for all v, w ∈ A, λ ∈ F. Since A 6= {0}, let a ∈ A,
then we show A − a = {x − a : x ∈ A} is subspace of V . For x − a ∈ A − a, λ ∈ F and x ∈ A, we
λx + (1 − λ)a ∈ A =⇒ λ(x − a) = λx + (1 − λ)a − a ∈ A − a, (462)
so it this set is closed to scalar multiplication. Additionally, for x − a, y − a ∈ A − a and x, y ∈ A, we

may write
x y
(x − a) + (y − a) = 2( + − a) ∈ A − a. (463)
2 2
So the set is closed to addition. Then A − a is valid subspace of V . Since A = a + (A − a), A is affine
subset.
Exercise 180. Suppose A1 , A2 are affine subsets of V , and prove that A1 ∩ A2 is either affine subset of
V or {}.
Proof. If A1 ∩ A2 6= ∅, then ∀x, y ∈ A1 ∩ A2 , λx + (1 − λ)y = λ(x − y) + y ∈ A1 , and λx + (1 − λ)y =

λ(x − y) + y ∈ A2 , by Exercise 179, both A1 , A2 are affine subsets of V . Then λx + (1 − λy) ∈ A1 ∩ A2
where RHS is affine subset of V . In fact, for Ai , i ∈ [m] collection of affine subsets in V , we have showed
that ∩m
i Ai affine subset of V or {}.
Exercise 181. Suppose U is subspace of V s.t. V /U is finite dimensional. Prove that V is isomorphic
to U × (V /U ).
Proof. See Theorem 105, Theorem 116 and Theorem 111. Clearly, dim V = dim U + dim V /U =
dim (U × V /U ).
Exercise 182. Suppose U is subspace of V and vi + U, i ∈ [m] is basis of V /U and ui , i ∈ [n] is basis of
U . Prove that vi , i ∈ [m], uj , j ∈ [n] is basis of V .
Proof. lin [1]. For all v ∈ V , since vi + U, i ∈ [m] is basis of V /U , there ∃λi , i ∈ [m] s.t. v + U =
Pm Pm Pm Pn
i λi (vi + U ). Then v − i λi vi ∈ U , so ∃ηj , j ∈ [n] s.t. v − i λi vi = i ηj uj . It follows that
Pm Pn
v = i λi vi + j ηj uj , and V = span(v1 , · · · , vm , u1 , · · · , un ). Since dim V = dim U + dim V /U , the
cardinality of basis for V should exactly be equal to the cardinality of basis for U sum cardinality of
basis for V /U . So the spanning set we obtained from combining the two basis has the right size, and it
is basis (Theorem 94).
Exercise 183. Suppose ϕ ∈ L(V, F) and ϕ 6= 0, then prove that dim(V /null ϕ) = 1.
Proof. ϕ is not zero function, so ∃v ∈ V s.t. ϕ(v) = a, a 6= 0. Then range ϕ ≥ 1, and since range ϕ is
subspace of F, then range ϕ ≤ 1, so range ϕ = 1. Then since dim V = dim(null ϕ) + dim(range ϕ) =
dim(null ϕ) + dim(V /null ϕ), dim(V /null ϕ) = dim(range ϕ) = 1.
Exercise 184. Suppose U is subspace of V s.t. V /U is finite dimensional and prove ∃W subspace of V
satisfying dim W = dim V /U , V = U ⊕ W .
Proof. These spaces are shown in the proof of Exercise 182.
112
Exercise 185. Suppose T ∈ L(V, W ) and U is subspace of V . Let π be quotient map V → V /U . Prove
∃S ∈ L(V /U, W ) s.t. T = Sπ iff U ⊂ null T .
Proof. Suppose ∃S ∈ L(V /U, W ) s.t. T = Sπ. Since ∀u ∈ U , πu = 0, then T u = Sπu = S0 = 0,

so u ∈ null T and U ⊂ null T . On the other hand, if U ⊂ null T , then define S : V /U → W s.t.
S(v + U ) = T v. This is well defined since ∀v1 , v2 ∈ V , v1 + U = v2 + U =⇒ v1 − v2 ∈ U . By assumption,
T (v1 − v2 ) = 0 since U ⊂ null T and T v1 = T v2 . Since T ∈ L(V, W ), see that S ∈ L(V /U, W ) and by
definition, Sπ(v) = S(v + U ) = T v for all v ∈ V , so T = Sπ.
Exercise 186. Suppose U is subspace of V , and let Γ : L(V /U, W ) → L(V, W ) by Γ(S) = Sπ. Then
1. Γ is linear map,
2. inj(Γ) and that
3. range Γ = {T = L(V, W ) | T u = 0 ∀u ∈ U }.
Proof. -
1.
Γ(c1 S1 + c2 S2 ) = (c1 S1 + c2 S2 )π = c1 S1 π + c2 S2 π = c1 Γ(S1 ) + c2 Γ(S2 ). (464)
2. For all v ∈ V , Γ(S) = 0 =⇒ Sπ = 0 =⇒ (Sπ)v = 0 =⇒ Sπ(v) = S(v + U ) = 0 =⇒ S = 0.
3. Exercise 185 asserts that
T ∈ range Γ ↔ ∃S ∈ L(V /U, W ) s.t T = Sπ ↔ U ⊂ null T ↔ T ∈ L(V, W ) s.t T u = 0 ∀u ∈ U.
3.3.3.6 Duality
Definition 126 (Linear Functional). A linear functional on V is a linear map belonging to L(V, F).
Definition 127 (Dual Space). The dual space of V , denoted V 0 is the vector space containing all linear
functional on V . That is, V 0 = L(V, F).
The definitions (Definitions 126, 127) make is natural for us to say ‘let x be the linear functional
belonging to the dual space of V ’.
Theorem 118. Let V be finite-dimensional, then V 0 is finite dimensional and dim V 0 = dim V .
Proof. The result follows from dimV 0 = dimV ·dimF. A vector space is isomorphic to its dual space.
Definition 128 (Dual Basis). If vi , i ∈ [n] is basis of V , then the dual basis of vi , i ∈ [n] is the list
ϕi , i ∈ [n] elements of V 0 , where each ϕj is the linear functional satisfying
ϕj (vk ) = 1{k = j}. (465)
Exercise 187. Find the dual basis of the standard basis {ei , i ∈ [n]} (Definition 56) of Fn .
Proof. For j ∈ [n], define ϕj to be the linear functional on Fn selecting the j-th coordinate of vector in
Fn . That is, ϕj (x1 , · · · , xn ) = xj . Then ϕj (ek ) = 1{k = j} and we are done.
113
Theorem 119 (Dual basis is basis). Let V be finite dimensional, then the dual basis of basis of V is the
basis for the dual space V 0 .
Pn
Proof. Let vi , i ∈ [n] be basis of V and ϕi , i ∈ [n] be the dual basis. Consider i ai ϕi = 0, then see
Pn
that ( i ai ϕi )(vj ) = aj for j ∈ [n]. It follows that ∀i ∈ [n], ai = 0. Then ϕi ’s are linearly independent.
There are n such elements, and dim V 0 = dim V , so we have basis.
Definition 129 (Dual Map). If T ∈ L(V, W ), then the dual map of T is the linear map T 0 ∈ L(W 0 , V 0 )
defined by T 0 (ϕ) = ϕ ◦ T for all ϕ ∈ W 0 .
That is, T 0 (ϕ) is the composition of the linear maps ϕ, T . Then T 0 (ϕ) is linear map from V → F,
and this belongs to the dual space (i.e. T 0 (ϕ) ∈ V 0 ). See that for all ϕ, ψ ∈ W 0 , then T 0 (ϕ + ψ) =
(ϕ + ψ) ◦ T = ϕ ◦ T + ψ ◦ T = T 0 (ϕ) + T 0 (ψ). Additionally, when λ ∈ F, ϕ ∈ W 0 , then T 0 (λϕ) =
(λϕ) ◦ T = λ(ϕ ◦ T ) = λT 0 (ϕ).
Exercise 188. Define D : P(R) → P(R) by Dp = p0 , the differentiation map. Suppose ϕ is the linear
functional on P(R) defined by ϕ(p) = p(3), then D0 (ϕ) is the linear functional on P(R) given by
(D0 (ϕ))(p) = (ϕ ◦ D)(p) = ϕ(Dp) = ϕ(p0 ) = p0 (3). (466)

R1
If we let ϕ be the linear functional on P(R) s.t. ϕ(p) = 0
p, then D0 (ϕ) is the linear functional on
P(R) given
Z 1
(D0 (ϕ))(p) = (ϕ ◦ D)(p) = ϕ(Dp) = ϕ(p0 ) = p0 = p(1) − p(0). (467)
0
Theorem 120 (Properties of the Dual Map). The following properties are satisfied by dual maps:
1. (S + T )0 = S 0 + T 0 for all S, T ∈ L(V, W ),
2. (λT )0 = λT 0 for all λ ∈ F, T ∈ L(V, W ),
3. (ST )0 = T 0 S 0 for all T ∈ L(U, V ) and for all S ∈ L(V, W )
Proof. The additivity and homogeneity conditions follow as usual and are not proved. For part 3, let
ϕ ∈ W 0 , then
(ST )0 (ϕ) = ϕ ◦ (ST ) = (ϕ ◦ S) ◦ T = T 0 (ϕ ◦ S) = T 0 (S 0 (ϕ)) = (T 0 S 0 )(ϕ). (468)
It follows that the function taking T to T 0 is a linear map from L(V, W ) → L(W 0 , V 0 ).
Definition 130 (Annihilator). For U ⊂ V , the annihilator of U is denoted U 0 and has definition
U 0 = {ϕ ∈ V 0 | ϕ(u) = 0 ∀u ∈ U }. (469)
Let U be subspace of R consisting of all polynomial multiples of x2 , and if ϕ is the linear functional
on P(R) defined ϕ(p) = p0 (0), then ϕ ∈ U 0 .
0
Exercise 189. Let ei , i ∈ [5] be standard basis in R5 , and ϕi , i ∈ [5] be dual basis of R5 . If U =
span{e1 , e2 } then show U 0 = span{ϕ3 , ϕ4 , ϕ5 }.
114
Proof. Recall the dual basis of standard basis (Exercise 187) - we have ϕj (x1 , · · · , x5 ) = xj . Let ϕ ∈
P5
span{ϕi : i = 3, 4, 5}, then ∃c3 , c4 , c5 s.t. ϕ = 3 ci ϕi . If (x1 , x2 , 0, 0, 0) ∈ U , then
X5
ϕ(x1 , x2 , 0, 0, 0) = ( ci ϕi )(x1 , x2 , 0, 0, 0) = 0. (470)
3
It follows that ϕ ∈ U 0 . OTOH, if ϕ ∈ U 0 , then since dual basis is basis (Theorem 119), we can write
P5 0
P5
ϕ = 1 ci ϕi . For i = 1, 2 ei ∈ U, ϕ ∈ U , we may write 0 = ϕ(ei ) = ( 1 ci ϕi )(ei ) = ci . Then
P5
ϕ = 3 ci ϕi and it belongs to span{ϕ3 , ϕ4 , ϕ5 }.
Theorem 121 (Annihilator is subspace). Let U ⊂ V , then U 0 is subspace of V 0 .
Proof. The zero linear functional 0 ∈ U 0 . If ϕ, ψ ∈ U 0 , then ϕ, ψ ∈ V 0 and by definition ϕ(u) = ψ(u) = 0
for all u ∈ U . Write (ϕ + ψ)(u) = ϕ(u) + ψ(u) = 0 + 0 = 0 and ϕ + ψ ∈ U 0 (closed). The scalarization
proof is similar.
Theorem 122. Let V be finite dimensional and U be subspace of V , then dim U + dim U 0 = dim V .
Proof. Let i ∈ L(U, V ) be defined i(u) = u for u ∈ U , then i0 ∈ L(V 0 , U 0 ) and rank-nullity asserts that
dim(range i0 ) + dim(null i0 ) = dim V 0 . But since null i0 = U 0 and dual space is isomorphic we may
write dim(range i0 ) + dim U 0 = dim V . If ϕ ∈ U 0 , then ϕ may be extended to linear functional ψ on V
(Exercise 135) and by definition of i0 , we get i0 (ψ) = ψ ◦ i = ϕ. Then ϕ ∈ range i0 and so U 0 ⊂ range i0 .
Clearly range i0 ⊂ U 0 . It follows that dim(range i0 ) = dim U 0 = dim U .
Theorem 123. Suppose V, W are finite dimensional and T ∈ L(V, W ), then
1. null T 0 = (range T )0 ,
2. dim(null T 0 ) = dim(null T ) + dim W − dim V
Proof. 1. Suppose ϕ ∈ null T 0 , then 0 = T 0 (ϕ) = ϕ ◦ T and 0 = (ϕ ◦ T )(v) = ϕ(T v) for all v ∈ V .
Then ϕ ∈ (range T )0 , and null T 0 ⊂ (range T )0 . OTOH, let ϕ ∈ (range T )0 , then ϕ(T v) = 0 and
0 = ϕ ◦ T = T 0 (ϕ), so ϕ ∈ null T 0 and (range T )0 ⊂ null T 0 ,
2. Write (use Theorem 122)
dim null T 0 = dim(range T )0 = dim W − dim(range T ) = dim W − (dim V − dim(null T )).(471)
Theorem 124. For T ∈ L(V, W ), sur(T ) ↔ inj(T 0 ).
Proof. See that sur(T ) iff range T = W iff dim(range T ) − dim W = dim(range T )0 = 0 iff (range T )0 =
{0} iff null T 0 = {0} iff inj(T 0 ). Theorem 122 was used.
Theorem 125. Let T ∈ L(V, W ) for finite dimensional V, W . Then
1. dim(range T 0 ) = dim(range T ), and
2. range T 0 = (null T )0 .
Proof. 1. Write (using dual space isomorphism, Theorem 122 and Theorem 123)
dim(range T 0 ) = dim W 0 − dim(null T 0 ) = dim W − dim(range T )0 = dim(range T ). (472)
115
2. Suppose ϕ ∈ range T 0 , then ∃ψ ∈ W 0 satisfying ϕ = T 0 (ψ), Then v ∈ null T =⇒
ϕ(v) = (T 0 (ψ))v = (ψ ◦ T )(v) = ψ(T v) = ψ(0) = 0. (473)
It follows that ϕ ∈ (null T )0 and range T 0 ⊂ (null T )0 . Finally, write
dim(range T 0 ) = dim(range T ) = dim V − dim(null T ) = dim(null T )0 . (474)
Theorem 126. For T ∈ L(V, W ), inj(T ) ↔ sur(T 0 ).
Proof. See that (by Theorem 123) inj(T ) iff null T = {0} iff (null T )0 = V 0 iff range T 0 = V 0 iff
sur(T 0 ).
Matrix transpose were defined in Definition 34. Our discussion for dual spaces give motivation for
transposes, as we shall see. Properties of matrix transpose were discussed in Theorem 8. Suppose we
have basis vi , i ∈ [n] for V and dual basis ϕi , i ∈ [n] for V 0 , as well as basis wi , i ∈ [m] for W along with
dual basis ψi , i ∈ [m] for W 0 . The standard notations write M(T ) w.r.t to the v, w basis and M(T 0 )
w.r.t to the ϕ, ψ dual basis.
Theorem 127. For T ∈ L(V, W ), M(T 0 ) = (M(T ))0 .
Proof. We may write A = M(T ), C = M(T 0 ). Let j ∈ [m] and k ∈ [n]. By definition of M(T 0 ), we
Pn Pn
can write (ψj ◦ T ) = T 0 (ψj ) = r Cr,j ϕr . Then (ψj ◦ T )(vk ) = r Cr,j ϕr (vk ) = Ck,j . Furthermore,
Pm Pm
(ψj ◦T )(vk ) = ψj (T vk ) = ψj ( r Ar,k wr ) = r Ar,k ψj (wr ) = Aj,k . Then Ck,j = Aj,k and by definition
of transpose we are done.
Row spaces and column spaces were discussed in Definition 61. We found they have the same
dimension, and are hence isomorphic. We prove the results in the general F case using the results from
abstract linear algebra.
Definition 131 (Rank). Let A be m × n matrix over F, then row rank of A is the dimension of span of
rows in A, and column rank of A is the dimension of the span of columns in A. The rank of the matrix
is its column rank.
Theorem 128. Let T ∈ L(V, W ) for finite dimensional V, W . Then dim(range T ) = column rank of
M(T ).
Proof. Let vi , i ∈ [n] be basis for V and wj , j ∈ [m] be basis for W . Then the function that takes
w ∈ span(T vi , i ∈ [n]) to M(w) is an isomorphism from span (T vi , i ∈ [n]) to span(M(T vi ), i ∈ [n]).
Therefore the dimensions match, and the RHS is the column rank of M(T ) by definition. Also, rangeT =
span(T vi , i ∈ [n]), and therefore dim(range T ) = dim(span(T vi , i ∈ [n])) and the RHS is column rank
M(T ).
We know that the column space and row space share dimension (Theorem 62), so the rank of matrix
A is the same as its column rank/row rank.
Theorem 129. Suppose A ∈ Fm,n , then row rank of A is column rank of A and we write this as rank A.
Proof. Let T : L(Fn,1 , Fm,1 ) and T x = Ax, then M(T ) = A w.r.t to the standard bases and we can write
colRank A = colRank(M(T )) = dim(range T ) = dim(range T 0 ) = colRank M(T 0 ) = colRank A0 = rowRank A.
116
Exercise 190. Explain why every linear functional (Definition 126) must be surjective or the zero map.
Proof. By definition, linear functional ϕ ∈ L(V, F) and since range ϕ is subspace of F, with dim F, then
dim(range ϕ) ≤ 1. If dim(range ϕ) = 0 then ϕ is zero map, else it is spanned by single vector and
range ϕ = F.
Exercise 191. Suppose V is finite dimensional, v ∈ V and v 6= 0, then ∃ϕ ∈ V 0 s.t. ϕ(v) = 1.
Proof. Since v is non-zero, extend this to basis v, v2 , · · · , vn . Then consider dual basis w.r.t to this basis
as ϕ, ϕ2 , · · · , ϕn . Then ϕ(v) = 1.
Exercise 192. Suppose V is finite dimensional and U is subspace of V s.t. U 6= V . Prove ∃ϕ ∈ V 0 s.t.
ϕ(u) = 0 for all u ∈ U but ϕ 6= 0.
Proof. Let ui , i ∈ [m] be basis of U and extend this to ui , i ∈ [m], uj , j ∈ [m + 1, k] basis of V . Then
define ϕ ∈ V 0 by ϕ(ui ) = 1{i = m + 1}. Then ϕ is non-zero linear functional satisfying requirements.
0 m 0
Exercise 193. Let Vi , i ∈ [m] be vector spaces, and prove that (×m
i Vi ) and ×i Vi are isomorphic.
Proof. See Theorem 118. Then

m
X m
X
0
dim (×m m
i Vi ) = dim (×i Vi ) = dim Vi = dim Vi0 = dim (×m 0
i Vi ). (475)
i i
Exercise 194. Suppose V is finite dimensional and vi , i ∈ [m] be elements in V , then define Γ : V 0 → Fm
by Γ(ϕ) = (ϕ(v1 ), · · · , ϕ(vm )). Then prove that vi , i ∈ [m] spans V iff inj(Γ), and that vi∈[m] is linearly
independent iff sur(Γ).
Proof. Suppose vi , i ∈ [m] spans V , then Γ(ϕ) = 0 implies ∀i ∈ [m], ϕ(vi ) = 0. Then ϕ is zero linear
functional since vi ’s span V . Then inj(Γ). On the other hand, suppose inj(Γ). Suppose span((vi )i∈[m] ) 6=
V , then ∃ϕ ∈ V 0 (Exercise 192) satisfying ϕ(v) = 0 for all v ∈ span((vi )i∈[m] ) where ϕ 6= 0, but this is
contradiction since we assumed inj(Γ). So the vi ’s must span V .
For the second part, suppose vi , i ∈ [m] are linearly independent, then we may extend this to basis
vi , i ∈ [m], vj , j ∈ [m + 1, n] of V . Then by Theorem 95, we may define ϕ ∈ V 0 s.t. ϕ(vi ) = fi , i ∈ [m] for
all (f1 , · · · , fm ) ∈ Fm . Then sur(Γ). On the other hand, suppose Γ is surjective, [1] and that vi , i ∈ [m]
Pm
are linearly independent. Then for some ki , i ∈ [m] ∈ F, ∃ki 6= 0 s.t. i ki vi = 0. Then vi may
be written as linear combination of the other vi ’s. We argue ei 6∈ range T . Otherwise, ∃ϕ ∈ V 0 s.t.
Pi−1 Pn
Γ(ϕ) = ei , then ϕ(vj6=i ) = 0, ϕ(vi ) = 1, but ϕ(v := j=1 aj vj + j=i+1 bj vj ) = 0 and we must have
ϕ(vi ) = 0. This is contradiction. Follows that vi , i ∈ [m] must be linearly independent.
Let m ∈ Z+ , then see by direct computation that the dual basis of the standard basis of Pm (R) is
p(j) (0)
given by ϕi , i = 0, 1, · · · , m where ϕj (p) = j! .
Exercise 195. Let vi , i ∈ [n] be basis of V and ϕi , i ∈ [n] be the dual basis of V 0 . Then suppose ψ ∈ V 0 ,
Pn
and prove that ψ = i ψ(vi )ϕi .
Pn Pn
Proof. See that ( i ψ(vi )ϕi )(vi ) = ψ(vi ), then clearly ψ = i ψ(vi )ϕi .
Exercise 196. Show that the dual map of the identity map on V is identity map on V 0 .
117
Proof. Let vi , i ∈ [n] be basis of V and let ϕi , i ∈ [n] be the dual basis of V 0 . Then the identity map takes
Pn Pn
T vi = vi . Consider the dual map T 0 ∈ L(V 0 ), and see this is defined T 0 (ϕ = i ai ϕi ) = i ai T 0 ϕi =
i ai ϕi 1 =
Pn Pn Pn
i ai ϕi ◦ T = i ai ϕi , so we are done.
Exercise 197. Let W be finite dimensional and T ∈ L(V, W ), then argue that T 0 = 0 iff T = 0.
Proof. If T = 0, then ∀w ∈ W 0 , v ∈ V , write (T 0 w)v = wT v = w0 = 0. So T 0 = 0. On the other hand, if

T 0 = 0, and suppose T 6= 0, then ∃v ∈ V s.t. 0 6= T v ∈ W , and Exercise 191 asserts (substitute T v → v)
ϕ ∈ W 0 satisfying ϕ(T v) = T 0 ϕv = 1, and T 0 6= 0. This is contradiction. So T = 0.
Exercise 198. Let V, W be finite dimensional and prove that the map taking T → T 0 , where T ∈
L(V, W ), T 0 ∈ L(W 0 , V 0 ) is isomorphism.
Proof. Let Γ be the linear map Γ(T ) = T 0 , and since dim L(V, W ) = dim V · dim W = dim V 0 · dim W 0 =
dim L(W 0 , V 0 ), inj(Γ) =⇒ sur(Γ) so inj(Γ) =⇒ inv(Γ). Suppose Γ(S) = S 0 = 0, then S = 0 by
Exercise 197.
Exercise 199. Suppose U ⊂ V , and prove that U 0 = {ϕ ∈ V 0 | U ⊂ null ϕ}.
Proof. ∀u ∈ U, ϕ(u) = 0 ↔ U ⊂ null ϕ, so U 0 = {ϕ ∈ V 0 | ϕ(u) = 0 ∀u ∈ U } is Definition 130.
Exercise 200. Prove that if U ⊂ W ⊂ V , then W 0 ⊂ U 0 .
Proof. For all ϕ ∈ W 0 , ϕ(w) = 0 for w ∈ W , but ∀u ∈ U ⊂ W , ϕ(u) = 0 and ϕ ∈ U 0 .
Exercise 201. Suppose U, W are subspaces of V and show that (U + W )0 = U 0 ∩ W 0 .
Proof. [1]. Note U ⊂ U +W, W ⊂ U +W , and Exercise 200 asserts that (U +W )0 ⊂ U 0 , (U +W )0 ⊂ W 0 .

It follows that (U + W )0 ⊂ (U 0 ∩ W 0 ). OTOH, if f ∈ U 0 ∩ W 0 , then f (u) = f (w) = 0∀u ∈ U, ∀w ∈ W .
Then f (u + w) = f (u) + f (w) = 0 and ∀x := u + w ∈ U + W , f (x) = 0, So f ∈ (U + W )0 =⇒
(U 0 ∩ W 0 ) ⊂ (U + W )0 .
Exercise 202. Suppose V is finite dimensional and U, W are subspaces of V satisfying W 0 ⊂ U 0 . Prove
that U ⊂ W .
Proof. Exercise 201 asserts that (U +W )0 = U 0 ∩W 0 = W 0 , where the last equality holds by assumption
W 0 ⊂ U 0 . Since dim(U + W )0 = dim V − dim(U + W ) and dim W 0 = dim V − dim W , so dim(U + W ) =
dim W Together with statement W ⊂ U + W , we get U + W = W =⇒ U ⊂ W .
Exercise 203. Suppose V is finite dimensional and U, W are subspaces of V , then prove (U ∩ W )0 =
U 0 + W 0.
Proof. U ∩ W ⊂ U, U ∩ W ⊂ W =⇒ U 0 ⊂ (U ∩ W )0 , W 0 ⊂ (U ∩ W )0 by Exercise 200, and

therefore U 0 + W 0 ⊂ (U ∩ W )0 . OTOH, recall that dim V = dim U + dim U 0 (Theorem 122). Recall
also that for U subspace, W subspace, U ∩ W is subspace of W (adapt Exercise 38). Recall that
dim (U + W ) = dim U + dim W − dim (U ∩ W ) (Theorem 41). Furthermore, recall that (U ∩ W )0 is
subspace (Theorem 121) of V . Finally, Exercise 201 asserts that (U + W )0 = U 0 ∩ W 0 . Then,
dim(U 0 + W 0 ) = dim U 0 + dim W 0 − dim(U 0 ∩ W 0 ) (476)

= (dim V − dim U ) + (dim V − dim W ) − (dim V − dim(U + W )) (477)
= dim V − dim U − dim W + (dim U + dim W − dim(U ∩ W )) (478)
= dim V − dim(U ∩ W ) (479)
0
= dim(U ∩ W ) . (480)
118
dim(U 0 + W 0 ) = dim (U ∩ W )0 , (U 0 + W 0 ) ⊂ (U ∩ W )0 =⇒ (U 0 + W 0 ) = (U ∩ W )0 .
3.3.4 Polynomials
Complex numbers were introduced in Definition 92. For complex number z = a + bi, we refer to the real
and imaginary part of z as Re z = a and Im z = b respectively.
Definition 132 (Complex conjugates and absolute values). Let z ∈ C, then the complex conjugate of z,
is written z and is defined
z = Re z − (Im z)i. (481)
Absolute value of complex number is written |z| and is defined

p
|z| = Re z 2 + Im z 2 . (482)
Some complex arithmetic were discussed in Theorem 82. We extend some of the results:
Theorem 130. Suppose w, z ∈ C, then
1. z + z = 2Re z,
2. z − z = 2(Im z)i
3. zz = |z|2 ,
4. (w + z) = w + z and wz = w̄z̄,
5. z = z,
6. |Re z| ≤ |z|, |Im z| ≤ |z|,
7. |z| = |z|,
8. |wz| = |w||z|,
9. |w + z| ≤ |w| + |z|.
Proof. We prove the triangle inequality statement |w + z| ≤ |w| + |z|. We may write
|w + z|2 = (w + z)(w + z) (483)

= ww + zz + wz + zw (484)
= |w|2 + |z|2 + wz + wz (485)
2 2
= |w| + |z| + 2Re wz (486)
≤ |w|2 + |z|2 + 2|wz| (487)
2 2
= |w| + |z| + 2|w||z| (488)
= (|w| + |z|)2 . (489)
The triangle inequality for vectors refer to Theorem 69, and states that ku + vk ≤ kuk + kvk. If we
think of complex numbers z = a + bi as vectors with two coordinates (a, b), then the last statement in
√
Theorem 130 is to the same effect - see that the absolute value of a complex number a2 + b2 is the
same as its vector representation Euclidean norm (Definition 68).
119
Theorem 131. A function p : F → F is polynomial with coefficients in F if ∃a0 , · · · am ∈ F satisfying
Pm
p(z) = i ai z i for all z ∈ F. If ∀z ∈ F, we have p(z) = 0, then ∀ai , i ∈ [0, m], ai = 0. The coefficients
of a polynomial are uniquely determined by the polynomial.
contrapositive equivalent is shown. ∃ai 6= 0, then without loss of generality, let

Proof. Axler [3]. The P
m−1
|ai |
am 6= 0 and let z = i=0
|am | + 1. Then z ≥ 1, z j ≤ z m−1 for j = 0, 1, · · · , m − 1 and by triangle
inequality (Theorem 82) we have
m−1 m−1
!
X X
i
ai z ≤ |ai | z m−1 < |am z m |. (490)
i=0 i=0
Pm
Then the LHS cannot equal the RHS in magnitude and it follows that i=0 ai z i 6= 0, so ∃z ∈ F s.t.
p(z) 6= 0 (p must be non-zero polynomial).
Theorem 132 (Division Algorithm for Polynomials). Suppose p, s ∈ P(F) and s 6= 0, then !∃q, r ∈ P(F)
s.t. p = sq + r and deg r < deg s.
Proof. Let n = deg p, m = deg s. If n < m, let q = 0, r = p and we are done. Otherwise n ≥ m
and define T : (Pn−m (F) × Pm−1 (F)) → Pn (F) by T (q, r) = sq + r. Then T is valid linear map, and
(q, r) ∈ null T =⇒ sq + r = 0 =⇒ q = 0, r = 0 (since deg s = m, deg sq ≥ m). Then dim(null T ) = 0
and this is unique. Next, see that
dim (Pn−m (F) × Pm−1 (F)) = (n − m + 1) + (m − 1 + 1) = n + 1. (491)
Rank nullity asserts that dim(range T ) = n + 1 = dimPn (F). Then range T = Pn (F) and ∃q ∈ Pn−m (F),
r ∈ Pm−1 (F) satisfying p = T (q, r) = sq + r.
Definition 133 (Polynomial Zero/Root). The zero/root of a polynomial p ∈ P(F) is number λ ∈ F

satisfying p(λ) = 0.
Definition 134 (Polynomial Factor). For polynomials s, p ∈ P(F), s is said to be factor of p if ∃q s.t.
p = sq.
Theorem 133. Suppose p ∈ P(F) and λ ∈ F, then p(λ) = 0 iff ∃q ∈ P(F) s.t. p(z) = (z − λ)q(z) holds
for all z ∈ F.
Proof. Suppose q ∈ P(F) and p(z) = (z − λ)q(z) ∀z ∈ F, then clearly λ is root. OTOH, suppose λ is
root, see that deg (z − λ) = 1. Since a polynomial with degree less than one is constant, the Division
Algorithm (Theorem 132) argues that ∃q ∈ P(F), r ∈ F s.t. p(z) = (z − λ)q(z) + r for all z ∈ F. Since
p(λ) = 0, then r = 0 and p(z) = (z − λ)q(z) for all z ∈ F.
Theorem 134. A (non-zero) polynomial has at most as many zeros as its degree.
Proof. Let p ∈ P(F) and deg p = m. If m = 0, then p(z) = a0 6= 0 and p has no zeros. Suppose m = 1,
then p(z) = a0 + a1 z, a1 6= 0, then p has zero − aa10 and no other zeros. Otherwise, assume polynomial
with degree m − 1 has at most m − 1 distinct zeros, then if p have no zeros, we are done. Otherwise p
has zero and we can write p(z) = (z − λ)q(z), where deg q = m − 1 and has at most m − 1 distinct zeros.
If p(z) = 0, z = λ or q(z) = 0. Then p has at most m zeros.
We state the next crucial result in algebra without proof.
120
Result 7 (Fundamental Theorem of Algebra). Every nonconstant polynomial with complex coefficients
has a zero.
Theorem 135. If p ∈ P(C) is nonconstant polynomial, then p has unique factorization of the form
p(z) = c(z − λ1 ) · · · (z − λm ), c, λi∈[m] ∈ C. (492)
Proof. Let p ∈ P(C), m = deg p. We prove by induction. The base case for when m = 1 is trivial-
p(z) = a0 + a1 z, a1 6= 0, then p(z) = a1 (z + aa10 ). Assume unique factorization of the form asserted exists
for polynomial degree m − 1. Fundamental Theorem of Algebra asserts that p has some zero λ, and so
we can write p(z) = (z − λ)q(z) for all z ∈ C. Since deg q = m − 1, inductively q has unique factorization,
and so p has factorization. To show this is unique, see that c is uniquely determined as the coefficient of
z m of p. Suppose
Πm m
i (z − λi ) = Πi (z − τi ) (493)
for all z ∈ C, then since the LHS equals 0 when z = λ1 , it follows that one of the τi ’s equate to λ1 .
Without loss of generality, assume that τ1 = λ1 , and for z 6= λ1 , we get
Πm m
i=2 (z − λi ) = Πi=2 (z − τi ). (494)
By the inductive assumption, this is unique polynomial factorization (except possibly the order) and we
are done.
Theorem 136. Polynomials with real coefficients have zeros in pairs. That is, if λ ∈ C is root, then λ
is root.
Pm Pm
Proof. Let p(z) = ai z i , where ai , i ∈ [m] are real. Then suppose λ ∈ C is zero of p, so
i i ai λi = 0
Pm i
and taking the complex conjugate on both sides, we get i ai λ = 0. Therefore λ is root.
Theorem 137 (Quadratic Polynomial Factorization). Suppose b, c ∈ R, then there exists polynomial
factorization
x2 + bx + c = (x − λ1 )(x − λ2 ) (495)
where λ1 , λ2 ∈ R iff b2 ≥ 4c.
Proof. Completing the squares, get

2
b2

b
x2 + bx + c = x+ + c− . (496)
2 4
If b2 < 4c, then the RHS is positive for all real x. Then the polynomial has no real roots and may not
2
be factored. Otherwise, b2 ≥ 4c, then there ∃d ∈ R s.t. d2 = b4 − c and we can write
2
2 b 2 b b
x + bx + c = x + −d = x+ +d x+ −d . (497)
2 2 2
Theorem 138. Suppose p ∈ P(R) is nonconstant polynomial, then p has unique factorization of form
(here m, M is possibly zero):
p(x) = c(x − λ1 ) · · · (x − λm )(x2 + b1 x + c1 ) · · · (x2 + bM x + cM ), (498)
where c, λi∈[m] , bi∈[M ] , ci∈[M ] ∈ R and b2j < 4cj for all j ∈ [M ].
121
Proof. See that p ∈ P(R) ⊂ P(C), we may think of the p as a complex polynomial. If all complex zeros
of p are real, then we are done. Otherwise, it has non-real complex zero λ, λ. We may write
p(x) = (x − λ)(x − λ)q(x) = (x2 − 2(Re λ)x + |λ|2 )q(x) (499)
for some polynomial q ∈ P(C) with deq q = (deg p) − 2. Then if we can prove q has all real coefficients
and employing induction on deg p, (x − λ) appears in the factorization of p an equal number of times as
p(x)
(x − λ). To prove this, consider q(x) = x2 −2Re(λ)x+|λ|2 for all x ∈ R, and so q(x) ∈ R (both numerator,
denominator are real). We may write
n−2
X
q(x) = ai xi , (500)
i=0
where n = deg p, ai∈[n−2] ∈ C. Then since this is real with zero imaginary component, see that
n−2
X
0 = Im(q(x)) = (Im ai )xi , x ∈ R. (501)
i=0
Since this is satisfied for all x, we have a zero polynomial. The zero polynomial has all zero coefficients.
Then the imaginary components Im ai are all zero, and q has all real coefficients and we are done.
To show uniqueness, factor x2 + bj x + cj with b2j < 4cj has unique factorization (x − λj )(x − λj ) for
λj ∈ C. Two different factorization of p would imply, that considered as a member of the set of complex
polynomials, we would obtain two different factorization of p ∈ P(C) - this is contradiction of the unique
factorization Theorem 135.
Exercise 204. Suppose m, n ∈ Z+ and m ≤ n, and let λi ∈ F for i ∈ [m]. Then prove ∃p ∈ P(F) s.t.
deg p = n and 0 = p(λi ) for i ∈ [m] and such that p has no other zeros.
Proof. Polynomial of the form p(z) = Πm

i=1 (z − λi )(z − λ1 )
n−m
satisfies the problem.
Exercise 205. Suppose m is nonnegative integer and zi , i ∈ [m + 1] are distinct elements of F, and
wi , i ∈ [m + 1] are in F. Then prove !∃p ∈ Pm (F) s.t. p(zj ) = wj for j ∈ [m + 1].
Proof. Define the map T : Pm (F) → Fm+1 by T p = (p(z1 ), · · · , p(zm+1 )). Clearly this is linear map. If
we show that T is isomorphism, then inj(T ) and p is unique, and sur(T ) so ∃ such a p. For p ∈ null T , if
we write T p = (0, · · · , 0), then p(zi ) for i ∈ [m + 1] and p has m + 1 distinct roots. Theorem 134 asserts
a non-zero polynomial can have at most m distinct roots, so p can only be zero. Then p = 0 and inj(T ).
Rank nullity asserts (Theorem 99) that dim range T = dim Pm (F) − dim null T = m + 1 = dim Fm+1 .
Exercise 206. Suppose p ∈ P(C) and deg p = m, then show that p has m distinct zeros iff p, p0 has no
zeros in common.
Proof. Axler [2]. If p, p0 has m distinct roots, then we may write p(z) = c(z − λ1 ) · · · (z − λm ), with
λi 6= λj when i 6= j. We can apply chain rule to some fixed λj , and show 6 ∃j s.t. p0 (λj ) = 0. Write
p(z) = (z − λj )q(z) for some non-zero polynomial q(z) containing the remaining terms and see that
p0 (z) = (z−λj )q 0 (z)+q(z), and p0 (λj ) = q(λj ) 6= 0. OTOH, we want to show that p, p0 having no common
zeros implies that p has m distinct zeros. We show contrapositive. Suppose p has < m distinct zeros,
then we can write p(z) = (z −λ)n q(z) for some n ≥ 2, and see that p0 (z) = (z −λ)n q 0 (z)+n(z −λ)n−1 q(z),
and clearly p0 (λ) = 0. We have obtained a common root.
122
Exercise 207. Suppose p ∈ P(C) and polynomial q : C → C is defined by
q(z) = p(z)p(z). (502)
Then prove q is polynomial with all real coefficients.

Pn i
Proof. Let deg p = n, then p(z) = i=0 ai z . Since ab = ab (Theorem 130), we may write p̄(z̄) =
Pn i
i=0 āi z . This is polynomial, so q(z) is product of polynomials and is polynomial. To see q ∈ P(R),
P2n
see that deg q = 2n. Then write q(z) = i=0 qi z i . But since q(z̄) = p(z̄)p(z) = p(z̄)p(z) = q(z), so
P2n i
P2n i
i=0 q̄i z = i=0 qi z , and qi = q̄i implies q(z) is polynomial with coefficients in R.
Exercise 208. Suppose non-zero polynomial p ∈ P(F), and let U = {pq : q ∈ P(F)}. Then show that
dim(P(F)/U ) = deg p and find the basis of the quotient space P(F)/U .
Proof. Division algorithm for polynomials (Theorem 132) argues that ∀f ∈ P(F), !∃q, !∃r satisfying
f = pq + r, deg r < deg p. Uniqueness imply that P(F) = U ⊕ Pdeg p−1 (F), and recall that dim P(F) =
dim U + dim P(F)/U (Theorem 116), so dimP(F)/U = dim Pdeg p−1 (F) = deg p. A basis for P(F)/U is
1, x, · · · , xdeg p−1 .
3.3.5 Eigenvectors and Invariant Subspaces

3.3.5.1 Invariant Subspaces
Definition 135. Suppose T ∈ L(V ), then a subspace of U of V is invariant subspace under T if

∀u ∈ U, T u ∈ U .
The trivial invariant subspaces w.r.t to all operators T ∈ L(V ) is {0} and V itself. We can also show
that the nullspace and range is invariant under T .
Exercise 209. Show that null T, range T are both invariant subspaces of V .
Proof. If u ∈ null T, then T u = 0 and T (T u) = 0, so T u ∈ null T . If u ∈ range T, then T u ∈ range T ,

so range T is invariant under T .
Definition 136. Suppose T ∈ L(V ), then λ ∈ F is said to be eigenvalue (characteristic value) of T if

∃v ∈ V s.t. v 6= 0, T v = λv.
Therefore, T has a one-dimensional invariant subspace iff T has eigenvalue.
Theorem 139. Suppose V is finite dimensional and T ∈ L(V ), and λ ∈ F, then the statements are
equivalent for
1. λ is eigenvalue of T ,
2. T − λ1 is not injective,
3. T − λ1 is not surjective,
4. T − λ1 is not invertible.
Proof. Since T is linear operator, parts 2,3,4 are equivalent statements. If T v = λv then T v − λv =
(T − λ1)v = 0 and part 1 iff part 2.
123
Definition 137. For T ∈ L(V ), eigenvalue λ ∈ F, vector v ∈ V is eigenvector of T corresponding to
eigenvalue λ if v 6= 0, T v = λv.
It follows that non-zero v ∈ V is eigenvector of T iff v ∈ null(T − λ1). Exercise 70 asserts the
result that for eigenvectors vi , i ∈ [m] corresponding to distinct eigenvalues λi , i ∈ [m], the eigenvectors
vi , i ∈ [m] are linearly independent. Suppose λi , i ∈ [m] are distinct eigenvalues of linear operator T ,
since they are linearly independent, there must be at most dim V distinct eigenvalues.
If T ∈ L(V ), U is subspace of V invariant under T , then U determines two other operators, namely
T |U , T /U .
Definition 138. Let T ∈ (V ) and U be T -invariant subspace of V , then define
1. the restriction operator T |U (u) ∈ L(U ) to be
∀u ∈ U, T |U (u) = T u, (503)
2. the quotient operator T /U ∈ L(V /U ) to be
∀v ∈ V, (T /U )(v + U ) = T v + U. (504)
It is left to the reader to verify that the restriction and quotient operators are valid linear maps. For the
quotient operator, verify that if v + U = w + U , then T v + U = T w + U . To see this, write by equivalence
v + U = w + U =⇒ (v − w) ∈ U (Theorem 114), and by invariance T (v − w) ∈ U , and by linearity
T v − T w ∈ U , and by equivalence T v + U = T w + U .
Exercise 210. Let T ∈ L(F2 ) be T (x, y) = (y, 0) and U = {(x, 0) : x ∈ F}. Then show
1. U is T -invariant, T |U is zero operator on U .
2. 6 ∃W subspace of F2 that is T -invariant and s.t. F 2 = U ⊕ W .
3. T /U is the zero operator on F2 /U .
Proof. -
1. For (x, 0) ∈ U, see T (x, 0) = (0, 0) ∈ U and U is T -invariant zero operator.
2. Let W be subspace of V , and F2 = U ⊕ W , then dim W = dimF2 − dim U = 1, and if W were

invariant, nonzero w ∈ W must be eigenvector. However, clearly 0 is the eigenvalue of T , and all
eigenvectors of T are in U . W must not be invariant.
3. For (x, y) ∈ F2 , write
(T /U )((x, y) + U ) = T (x, y) + U = (y, 0) + U = 0 + U, (505)
where Theorem 114 asserts the last equality. T /U is zero operator.
When there is no ambiguity about the linear map at hand, we just say that a subspace is invariant
without specifying the linear operator it is associated with.
Exercise 211. Suppose T ∈ L(V ), and U is subspace of V , then prove that U ⊂ null T =⇒ U is
T -invariant. Prove that if range T ⊂ U =⇒ U is T -invariant.
124
Proof. ∀u ∈ U , u ∈ null T so T u = 0, and 0 ∈ U so U is invariant.
∀u ∈ U , T u ∈ range T =⇒ T u ∈ U so U is invariant.
Exercise 212. Let S, T ∈ L(V ) and ST = T S, then prove null S is invariant under T .
Proof. For all v ∈ null S, ST v = T Sv = 0, so T v ∈ null T .
Exercise 213. Let S, T ∈ L(V ) and ST = T S, then prove range S is invariant under T .
Proof. ∀u ∈ range S, ∃v ∈ V s.t. Sv = u, so T u = T Sv = ST v ∈ range S.

Pm
Exercise 214. Suppose T ∈ L(V ), Ui , i ∈ [m] are invariant subspaces. Then prove i Ui is invariant.
Pm Pm Pm
Proof. For all u ∈ i Ui we can write u = ui where ui ∈ Ui , i ∈ [m] and T u =
i i T ui , where by
Pm
invariance T ui ∈ Ui for i ∈ [m] and hence T u ∈ i Ui .
Exercise 215. Suppose T ∈ L(V ), and Ui , i ∈ [m] are invariant subspaces. Then prove ∩m
i Ui is
invariant.
Proof. For all u ∈ ∩m

i Ui , u ∈ U1 ∧u ∈ U2 ∧· · ·∧u ∈ Um , and by invariance T u ∈ U1 ∧T u ∈ U2 ∧· · · T u ∈ Um
and so T u ∈ ∩m
i Ui and we are done.
Exercise 216. If V finite dimensional, U is invariant subspace under every operator T ∈ L(V ), then
prove that U = {0} or U = V .
Proof. Prove contrapositive. Suppose U 6= {0} and U 6= V , and choose non-zero u ∈ U , and w ∈ V, 6∈ U .
u may be extended to basis u, v1 , · · · , vn of V , and Theorem 95 asserts that there is some unique linear
map T (au + b1 v1 + · · · bn vn ) = aw. Then T u = w, and U is not invariant under T .
Exercise 217. Find all eigenvalues,eigenvectors of T (w, z) = (z, w) for w, z ∈ F.
Proof. Since T (w, z) = λ(z, w), then we obtain linear system z = λw, w = λz. Then w = λ2 w, λ2 = 1
and λ = ±1. Then the set of eigenvectors w.r.t eigenvalue 1 (eigenspace) is span((1, 1)), and eigenspace
w.rt. eigenvalue −1 is span((1, −1)).
Exercise 218. Define D : P(R) → P(R) by Dp = p0 , then find all eigenvalues and eigenvectors of D.
Proof. If λ is eigenvalue, then Dq = λq = q 0 . Reason that the differentiation map satisfies this only for
λ = 0, and eigenvectors q belonging to the set of nonzero constants as polynomial functions.
Exercise 219. Define T ∈ L(V ) for finite dimensional V , and λ ∈ F, then prove ∃α ∈ F s.t. |α − λ| <
1
1000 and T − α1 is invertible.
1
Proof. [1]. Let αi ∈ F s.t. |αi − λ| = 1000+i for i ∈ [dim V + 1]. Since T has at most dim V distinct
eigenvalues, ∃αi that is not eigenvalue of T and T − αi 1 is invertible (Theorem 139).
Exercise 220. Suppose V = U ⊕W , where U, W are non-zero subspaces and P ∈ L(V ) by P (u+w) = P u
for all u, w ∈ U, W respectively. Then find the eigenvectors and eigenvalues of P .
Proof. Write P v = P (u+w) = u = λv = λ(u+w) = λu+λw for some v = u+w, where u ∈ U, w ∈ W by

the direct sum relation. Then u = λu + λw, and so (λ − 1)u + λw = 0. By direct sum, (λ − 1)u = λw = 0.
If u 6= 0, then λ = 1, w = 0 and eigenvectors are any nonzero v ∈ V . Otherwise, u = 0, and if w 6= 0,
then λ = 0 and the eigenvectors are any nonzero w ∈ W .
125
Exercise 221. Let T ∈ L(V ) and let there be some invertible S ∈ L(V ). Then prove that T, S −1 T S
share eigenvalues and discuss the relationship between their eigenvectors.
Proof. For all v ∈ V, λ ∈ F satisfying T v = λv, see that
S −1 T S(S −1 v) = S −1 T v = S −1 T v = S −1 λv = λS −1 v. (506)
Since S −1 is invertible and hence injective, S −1 v is non-zero. Pre-multiply S to both RHS and LHS to
see that they share precisely the same eigenvalues. Then v is eigenvector of T iff S −1 v is eigenvector of
S −1 T S.
Exercise 222. Suppose V is complex vector space and T ∈ L(V ), M(T ) w.r.t arbitrary V -basis contain
all real entries. Then show that if λ is eigenvalue, so is λ.
Proof. Although the question is asking about matrices, we can just map it onto the operator domain
Pn Pn
by isomorphism. W.r.t specified basis, say vj , i ∈ [n], T vj = ai,j vi . ∀v := i ci vi s.t. T v =
Pn Pn Pn Pn Pn Pn i Pn Pn
T i ci vi = λ i ci vi = i ci T vi = j cj i ai,j vi , see that i λci vi = i j ai,j cj vi . Then
n
X n X
X n n X
X n
λci vi = ai,j cj vi = ai,j cj vi (507)
i i j i j
and we are done.

1
Exercise 223. Suppose T ∈ L(V ) is invertible, and λ 6= 0. Show that λ is eigenvalue iff λ is eigenvalue
−1
of T . Prove they have the same eigenvectors.
Proof. See Exercise 78.
Exercise 224. For finite dimensional V , S, T ∈ L(V ), prove ST, T S have same eigenvalues.
Proof. Suppose λ is eigenvalue of ST , then ∃v 6= 0 s.t. ST v = λv, and consider T S(T v) = T λv = λT v, so

T v is the eigenvector associated with eigenvalue λ if T v 6= 0. Otherwise T v = 0, then ST v = λv = 0, v 6= 0
implies λ = 0, and also that T is not injective. Then T S cannot be invertible, and λ = 0 is eigenvalue
of T S. Invoke symmetry.
Exercise 225. Let An×n have entries in F, and T ∈ L(Fn ) be s.t. T x = Ax. If the sum of entries in
each column of A is one, then prove that one is eigenvalue of T . If the sum of the entries in each row of
A equal one, prove that one is eigenvalue of T .
Proof. See Exercise 84 for both assertions. The proof for the first assertion holds for Ai,j ∈ F.
Result 8. If T ∈ L(V ) have u, v as eigenvectors and u + v is eigenvector of T , then u, v correspond to

the same eigenvalue.
This result becomes trivial when we talk about the eigenspace (Definition 141).
Exercise 226. Suppose T ∈ L(V ), and ∀v ∈ V , ∃λ ∈ F s.t. T v = λv. Prove T = c1 for some c ∈ F.
Proof. [2]. ∀v ∈ V , let av ∈ F that satisfies T v = av v. If T = c1, then av would not depend on v.
Consider v, w ∈ V , and suppose (v, w) linearly dependent, then ∃b ∈ F s.t. w = bv. Then
aw w = T w = T bv = bT v = bav v = av w, (508)
so aw = av . Otherwise, they are linearly independent and av+w (v + w) = T (v + w) = T v + T w =

av v + aw w, so (av+w − av )v + (av+w − aw )w = 0. By linear independence, av+w = av = aw .
126
Exercise 227. Suppose V is finite dimensional, T ∈ L(V ) and every subspace of V , dimension dimV −1
is invariant under T . Then prove T = c1 for some c ∈ F.
Proof. [2]. Prove the contrapositive. Exercise 226 asserts ∃u ∈ V s.t. it is not an eigenvector. Then u, T u
is linearly independent and extend this to basis of V as u, T u, v1 , · · · , vn . Let U = span(u, v1 , · · · , vn ),
and see that this is subspace of V and dim U = dim V − 1, but is not invariant since T u 6∈ U .
Exercise 228. Suppose T ∈ L(V ), dimrangeT = k. Prove that T has at most k +1 distinct eigenvalues.
Proof. Let λi , i ∈ [m] be distinct eigenvalues, then ∃vi , i ∈ [m] s.t. T vi = λi vi . But there can be at
most one λi that is equal to zero, so at least m − 1 of the vectors vi , i ∈ [m] are non-zero (and linearly
independent, see Exercise 70) in range T . Then m − 1 ≤ dim range T = k.
Exercise 229. Suppose V is finite dimensional and vi , i ∈ [m] is list of vectors in V , and prove these
are linearly independent iff ∃T ∈ L(V ) s.t vi , i ∈ [m] are eigenvectors of T corresponding to distinct
eigenvalues.
Proof. That eigenvectors corresponding to distinct eigenvalues are linearly independent are known (Ex-
ercise 70). On the other hand, let vi , i ∈ [m] be linearly independent, then extend this to basis vi , i ∈
[m], vj , j ∈ [m + 1, n] and Theorem 95 asserts that a linear map T ∈ L(V ) defined by T vi = ivi , i ∈ [n]
is well defined, and such T must exist.
Exercise 230. Suppose λi , i ∈ [n] are distinct real numbers and prove that the list exp(λi x), i ∈ [n] is
linearly independent in the vector space of real valued functions on R.
Proof. For vector space V = span((exp(λi x), i ∈ [n]), define operator T ∈ L(V ) by T f = f 0 , then see
that T exp(λi x) = λi exp(λi x), and since λi ’s are distinct we are done by Exercise 70.
Exercise 231. Suppose T ∈ L(V ) and prove that T /range T = 0.
Proof. For all x ∈ V, x+rangeT ∈ V /rangeT . Then T /(rangeT )(x+rangeT ) = T x+rangeT = rangeT ,
and since this holds for all x ∈ V , T /(range T ) is the zero function.
Exercise 232. Suppose T ∈ L(V ), and prove that T /null T is injective iff null T ∩ range T = {0}.
Proof. Since we may write T /(null T )(x + null T ) = T x + null T . RHS implies T /(null T ) injective iff
T x ∈ null T ↔ x ∈ null T . This statement is equivalent to null T ∩ range T = {0}. OTOH, suppose
null T ∩ range T = {0}. Then ∀v ∈ null T ∩ range T , ∃u ∈ V s.t. T u = v, and T u ∈ null T implies
u ∈ null T and we get v = T u = 0.
3.3.5.2 Eigenvectors and Upper Triangular Matrices
The concept of applying a polynomial to an operator is introduced here. Powers of matrices are intro-
duced in Definition 32. Powers of matrix inverses (negative powers) were introduced Definition 37.
Definition 139 (Polynomial applied to linear operators). Suppose T ∈ L(V ) and p ∈ P(F) be polynomial
satisfied by
m
X
p(z) = ai z i , z ∈ F. (509)
i=0
127
Then define the polynomial for operators to be
m
X
p(T ) = ai T i . (510)
i=0
Recall that T 0 = 1.
Definition 140 (Product of Polynomials). If p, q ∈ P(F), then pq ∈ P(F) is the polynomial defined
(pq)(z) = p(z)q(z) for z ∈ F.
Theorem 140 (Properties of Polynomial Products). If p, q ∈ P(F), T ∈ L(V ), then (pq)(T ) = p(T )q(T )
and p(T )q(T ) = q(T )p(T ).
Pm Pn
Proof. Suppose p(z) = j=0 aj z j and q(z) = k=0 bk z k for z ∈ F, then
m X
X n
(pq)(z) = aj bk z j+k , (511)
j k
so
  !
m X
X n Xm n
X
(pq)(T ) = aj bk T j+k = j
aj T  bk T k
= p(T )q(T ). (512)
j k j k
See that p(T )q(T ) = (pq)(T ) = (qp)(T ) = q(T )p(T ).
Theorem 141. Every operator on finite-dimensional,nonzero complex vector space has eigenvalue.
Proof. Let V be complex vector space and dim V = n > 0 and T ∈ L(V ). Suppose non-zero v ∈ V ,
then v, T v, · · · T n v is n + 1 vectors and must not be linearly independent. Then ∃ai , i ∈ [n] satisfying
Pn
0 = i=0 ai T i v. At least one of the a1 , · · · , an 6= 0, otherwise a0 v = 0 =⇒ a0 = 0. Theorem 135
asserts that the factorization
m
X
ai z i = c(z − λ1 ) · · · (z − λm ) (513)
i=0
exists where c is a nonzero complex number and λj∈[m] ∈ C. Write

m n
!
ai T v = c(T − λ1 1) · · · (T − λm 1)v.
X X
i i
0 = ai T v = (514)
i=0 i
Then T − λj 1 is not injective for some j and T has eigenvalue.
Since the operator maps a vector space to itself, we may use the same basis. Accordingly, the matrix
representation of a linear operator M(T ) for T ∈ L(V, V ) = L(V ) may employ only one basis vi , i ∈ [n]
of V . We can then write M(T ) = M(T, (vi )i∈[n] ) without ambiguity. The matrices of operators are
therefore square (Definition 22). Given a operator T ∈ L(V ), we want to find a basis of V s.t. M(T ) is
as simple as possible.
We know that a linear operator over complex vector space (Theorem 141) has at least one eigenvalue,
say λ. Let v1 be the eigenvector, and extend this to basis of V as v1 , v2 , · · · , vn . Then the matrix M(T )
w.r.t basis must have λ in the (1, 1) entry and all zeros in the first column. We know this by definition
Pn
(Definition 113), since T v1 = i Ai,1 vi = λv1 , so A1,1 = λ. In fact, we may choose a basis of V such
that M(T ) has more zeros. Diagonal matrices, upper and lower triangular matrices were introduced in
Definition 28.
128
Theorem 142. Suppose T ∈ L(V ) and vi∈[n] be basis of V , then the following are equivalent statements:
1. M(T, (vi )i∈[n] ) is upper triangular,
2. T vj ∈ span{vi , i ∈ [j]} for all j ∈ [n],
3. span{vi , i ∈ [j]} is invariant under T , for j ∈ [n].
Proof. Part 1. iff part 2. is trivial, and so is part 3. =⇒ part 2. To show part 2. =⇒ part 3, see that
for all j ∈ [n],
∀i ∈ [j], T vi ∈ span(v1 , · · · , vi ) ⊂ span(v1 , · · · , vj ). (515)
Then if v ∈ span{vi , i ∈ [j]}, T v ∈ span{vi , i ∈ [j]}, so invariance follows.
Theorem 143. Over the complex space C, every operator has an upper triangular. That is, if V is
finite dimensional complex vector space, T ∈ L(V ), then ∃{vi , i ∈ [n]} basis of V s.t. M(T ) is upper
triangular.
Proof. The base case when dim V = 1 is trivial. Assume the result holds for all vector spaces with
dimension < dim V . We know an eigenvalue exists (Theorem 141); let this be λ and let U = range(T −
λ1). Since T −λ1 is not surjective, it follows that dimU < dimV . See T u = (T −λ1)u+λu. By definition
of U , (T − λ1)u ∈ U , trivially λu ∈ U , therefore T u ∈ U - U is invariant. Then T |U is operator on U ,
and by induction, ∃ui , i ∈ [m] basis of U where T |U has upper triangular matrix representation. Then
∀j, write T uj = (T |U )(uj ) ∈ span{ui , i ∈ [j]}. Extend ui , i ∈ [m] to basis of V and let this extension be
vj , j ∈ [n]. For each k, we may write T vk = (T − λ1)vk + λvk . Again, by definition, (T − λ1)vk ∈ U =
span((ui )i∈[m] ), λvk ∈ span(v1 , · · · , vk ) and therefore T vk ∈ span(u1 , · · · , um , v1 , · · · , vk ). By Theorem
142, T has upper triangular w.r.t ui , vj where i ∈ [m] and j ∈ [n].
Theorem 144. Let T ∈ L(V ) have an upper triangular, then T is invertible iff all diagonal entries of
that upper triangular matrix are non-zero.
Proof. Let vi , i ∈ [n] be basis of V w.r.t T with an upper triangular containing constants λi , i ∈ [n] as
diagonals. Suppose ∀i ∈ [n], λi 6= 0, then upper triangular matrix implies that T v1 = λ1 v1 . Because
λ1 6= 0, and we have T ( λv11 ) = v1 . Then v1 ∈ range T . Consider T ( λv22 ) = av1 + v2 for some a ∈ F, then
since T ( λv22 ), av1 ∈ range T , it follows that v2 ∈ range T . Similarly, see that T ( λv33 ) = bv1 + cv2 + v3 for
some b, c, ∈ F. The LHS, bv1 , cv2 in range T implies that v3 ∈ range T . We can continue this algorithm
to show that vi ∈ range T ∀i ∈ [n]. Since vi ’s are basis, and range T is closed, range T = V , so sur(T ) by
definition. So for operator T , inv(T ). Suppose T is invertible, and this implies that λ1 6= 0. Otherwise,
T v1 = 0. Let 1 < j ≤ n and let λj = 0, then T maps span(v1 , · · · , vj ) → span(v1 , · · · , vj−1 ). This is
mapping into smaller dimension space, so T cannot be injective, but ¬inj(T ), =⇒ ¬inv(T ), and this is
contradiction of the assumption. So λj 6= 0.
In the Euclidean space treatments, we saw that the determinant of singular matrices vanish (Theorem
72), and that upper triangular have determinants equal to the product of diagonal entries (Theorem 16),
and the result follows. The proof in Theorem 144 is determinant-free.
Although no method exists for exactly computing the eigenvalues of an operator from a matrix (there
are no algebraic solutions to solving the characteristic polynomial (Definition 80) for square matrix
m > 4), if we can find a basis w.r.t the operator that is upper triangular, then we can compute the
eigenvalues easily.
129
Theorem 145 (Eigenvalues of Upper Triangular Matrix). Suppose T ∈ L(V ) has upper triangular, then
eigenvalues of T are precisely the entries on the diagonal of the upper triangular.
Proof. Suppose vi , i ∈ [n] be basis of V s.t. w.r.t T has upper triangular M(T ) where the diagonals are
λi , i ∈ [n]. Let λ ∈ F, and see that M(T − λ1) is upper triangular matrix with diagonals λi − λ. T − λ1
is not invertible iff ∃i s.t. λi − λ = 0. Then λ is eigenvalue of T iff λ = λi for some i ∈ [n].
Exercise 233. Let T ∈ L(V ) and n ∈ Z+ s.t T n = 0. Prove that 1 − T is invertible and that
(1 − T )−1 = 1 + T + · · · + T n−1 . (516)
Proof. See Exercise 22.
Exercise 234. Suppose P ∈ L(V ) s.t. P 2 = P , then show that V = null P ⊕ range P .
Proof. For all v ∈ V , we may write v = P v + (v − P v), where P v ∈ range P . Furthermore, see that
P (v − P v) = P v − P 2 v = P v − P v = 0, so v − P v ∈ null P . Then V = range P + null P . For
v ∈ null P ∩ range P , ∃u ∈ V s.t. P u = v, P v = 0, so we may write 0 = P v = P P u = P u = v, so v = 0
and null P ∩ range P = {0}. Theorem 85 asserts V = null P ⊕ range P .
Exercise 235. Suppose S, T ∈ L(V ) and inv(S), then prove that for p ∈ P(F), we have
p(ST S −1 ) = Sp(T )S −1 . (517)
Proof. For arbitrary m ∈ Z+ , we have
(ST S −1 )m = ST S −1 ST S −1 · · · ST S −1 = ST m S −1 (518)
and so the equation holds for any z m ∈ P(F). See this holds for any element of the basis for Pn (F),
where n is positive integer. Then the result holds.
Exercise 236. Let T ∈ L(V ) and prove that 9 is eigenvalue of T 2 iff 3 or −3 is eigenvalue of T .
Proof. If ±3 is eigenvalue associated with eigenvector v, then clearly T T v = T (±3)v = (±3)2 v = 9v holds
and 9 is eigenvalue of T 2 . OTOH, if 9 is eigenvalue of T 2 , then T 2 − 91 is not injective, (T − 31)(T + 31)
is not injective and minimally one of the factors are not injective. Then 3 or −3 is eigenvalue.
Exercise 237. Suppose V is finite dimensional, T ∈ L(V ) and non-zero v ∈ V . If p is nonzero

polynomial of smallest degree satisfying p(T )v = 0, show that every zero of p is eigenvalue of T .
Proof. If p(λ) = 0, then p(z) = (z − λ)q(z) for some q(z) ∈ P(F), and suppose λ is not eigenvalue of T .
Then T − λ1 is injective, and so 0 = p(T )v = (T − λ1)q(T )v implies that q(T )v = 0. But q is smaller
degree nonzero polynomial then p, and we have found a contradiction. Then λ must be eigenvalue of
T.
Exercise 238. For T ∈ L(V ), eigenvector v of T associated with eigenvalue λ, p ∈ P(F), prove that
p(T )v = p(λ)v.
Pk
Proof. By Exercise 78, T v = λv =⇒ T n v = λn v. For polynomial p := i=0 ai xi ∈ P(F), we have
k
! k k
X X X
i
p(T )v = ai T v = ai T i v = ai λi v = p(λ)v. (519)
i=0 i=0 i=0
130
Exercise 239. For complex vector space V , T ∈ L(V ), p ∈ P(C), α ∈ C, prove that α is eigenvalue of
p(T ) iff α = p(λ) for some eigenvalue λ of T .
Proof. Suppose α is eigenvalue of p(T ), then p(T ) − α1 is not injective, and by polynomial factorization
(Theorem 135) we may write p(z) − α = c(z − λ1 ) · · · (z − λm ), where c, λi∈[m] ∈ C. If c 6= 0, then
p(T ) − α1 = c(T − λ1 1) · · · (T − λm 1). Since the LHS is not injective, some term on the RHS is not
injective, and λj must be eigenvalue. Then p(λj ) − α = 0. OTOH, if α = p(λ) for some eigenvalue λ of
T , then ∃v ∈ V, v 6= 0 satisfying T v = λv. Exercise 238 asserts and p(T )v = p(λ)v = αv.
Exercise 240. For complex vector space W , linear map T ∈ L(W ) with no eigenvalues, prove that every
invariant subspace W under T is either {0} or infinite dimensional.
Proof. This is just the contrapositive equivalent of Theorem 141.
Exercise 241. For finite complex vector space V , linear map T ∈ L(V ) and f : C → R given by
f (λ) = dim range(T − λ1). (520)
Prove f is not a continuous function.
Proof. T has eigenvalue (Theorem 141), say λ. Then T − λ1 is not surjective (Theorem 139). Then
dim range(T − λ1) < dim V . Otherwise, for non-eigenvalue Λ, dim(range T − Λ1) = dim V . T has finite
number of eigenvalues (Theorem 6), so we can define some sequence λ1 , λ2 , · · · s.t. limn→∞ λn = λ, and
λn is not eigenvalue. Then dim (range T − λn 1) = dim V , and we found f (λ) 6= limn→∞ f (λn ) where
λn → λ.
3.3.5.3 Eigenspaces, Diagonal Matrices
Diagonal matrices were defined (see Definition 23). Eigenspaces were defined (see Definition 81). To
make clear the association with a linear operator and in the generalization of fields, we define:
Definition 141. Let T ∈ L(V ) and λ ∈ F, then eigenspace of T corresponding to λ is written E(λ, T ) =
null(T − λ1). E(λ, T ) is the set of all eigenvectors of T corresponding to λ (along with 0).
Since eigenspace is nullspace, which is vector space, then λ is eigenvalue iff E(λ, T ) 6= {0}. If λ is
eigenvalue of T ∈ L(V ), then T |E(λ,T ) is the operator corresponding to multiplication by λ.
Theorem 146. Suppose V is finite dimensional and T ∈ L(V ), and suppose λi , i ∈ [m] are distinct
eigenvalues. Then ⊕m
i E(λi , T ) (is direct sum).
Pm
Proof. Consider i ui = 0 where each ui ∈ E(λi , T ). Then since they belong to distinct eigenspaces,
they are linearly independent (Exercise 70), and each ui must be zero. By definition of direct sum
(Definition 85) we are done.
Pm
Note that Theorem 146 asserts that i dim E(λi , T ) ≤ dim V , since dimension of direct sum of
subspaces equals to sum of dimensions of component subspaces (Theorem 113).
Definition 142 (Diagonalizable Operator). An operator T ∈ L(V ) is said to be diagonalizable if M(T )

is diagonal matrix for some basis of V .
Theorem 147. Let V be finite dimensional and T ∈ L(V ), and let λi , i ∈ [m] be distinct eigenvalues of
T . Then the following statements are equivalent:
131
1. T is diagonalizable,
2. V has basis of eigenvectors of T ,
3. ∃i ∈ [n], Ui invariant s.t. V = ⊕ni Ui and dim Ui = 1,
4. V = ⊕mi E(λi , T ),
Pm
5. dim V = i dim E(λi , T ).
Proof. Part 1. iff part 2. since T ∈ L(V ) has diagonal matrix M(T )ii = λi w.r.t vi , i ∈ [n] where
T vi = λi vi . Assume part 2. holds, then V has basis of eigenvectors, and if we write Ui = span(vi ), i ∈ [n],
Pm
by unique representation of elements in basis, each vector in i Ui is written uniquely and V = ⊕ni Ui .
So part 2. implies part 3. If part 3. holds, ∀i ∈ [n], let ui ∈ Ui , and see that these are eigenvectors.
Since they may be written uniquely as sum of ui ’s, where each ui ∈ Ui , then ui ’s are basis of V
(Theorem 89). Part 3. implies part 2. So 1 ↔ 2 ↔ 3. We show 2 → 4 → 5 → 2. If V has basis of
Pm
eigenvectors, then each v ∈ V can be expressed by the eigenvectors. V = i E(λi , T ) = ⊕m
i E(λi , T )
by Theorem 146, so 4 holds. Then 5 holds since by the dimension of direct sum relation. If 5 holds,
Pm
that is dim V = i dim E(λi , T ), then choose a basis of each E(λj , T ) and combine the bases to get
eigenvectors vi , i ∈ [n]. Recursively apply the results of Exercise 69 to assert their linear independence
and we are done.
Theorem 148. If T ∈ L(V ) has dim V distinct eigenvalues, then T is diagonalizable.
Proof. Theorem 74 asserts a linear operator is diagonalizable iff it has n = dim V linearly independent
eigenvectors, and eigenvectors belonging to distinct eigenvalues are linearly independent (Theorem 70).
Exercise 242. Suppose V is finite dimensional vector space, and let T ∈ L(V ) be diagonalizable and
prove that V = null T ⊕ range T .
Proof. [1]. If inv(T ), then clearly null T = {0}, range T = V and the result holds. Otherwise, ¬inv(T ),
so T −01 is not invertible, λ0 = 0 is eigenvalue. Also, V has basis of eigenvectors of T . Let 0, λ1 , · · · λm be
m
eigenvalues of T , then Theorem
asserts V = ⊕i=0 E(λi , T ). See that nullT = E(0, T ). For eigenvector
147
1
vi ∈ E(λi , T ), see that T = vi , so E(λi , T ) ⊂ range T . Then ⊕m
λi vi i=0 E(λi , T ) ⊂ range T . For all
Pm
v ∈ V , there is expression v = i=0 vi , where vi ∈ E(λi , T ). Then range T ⊂ ⊕m i=0 E(λi , T ). Then
range T = ⊕m
i=0 E(λi , T ), and V = null T ⊕ range T .
Exercise 243. Suppose V is finite dimensional and T ∈ L(V ), then prove that
1. V = null T ⊕ range T ,
2. V = null T + range T ,
3. null T ∩ range T = {0}.
are equivalent statements.
Proof. Clearly, 1 implies 2 and 1 implies 3. Now suppose 2 holds, then dimV = dimnullT +dimrangeT −
dim(null T ∩ range T ) (see Theorem 41), but rank nullity (Theorem 99) asserts the last term must be
zero, and hence 3 holds. If 3 holds, then since dim (null T ∩ range T ) = 0, and dim(null T + range T ) =
dim null T + dim range T − dim(null T ∩ range T ) = dim V − 0 =⇒ dim(null T + range T ) = dim V .
Then null T + range T = V, and by direct sum condition (Theorem 85), 1 holds.
132
Exercise 244. Suppose V is finite dimensional, T ∈ L(V ) has dimV distinct eigenvalues, and S ∈ L(V )
share eigenvectors with T . Then prove that ST = T S.
Proof. T has eigenvector basis (Theorem 147), say vi , i ∈ [n] where n = dim V . These are also the
eigenvectors of S. Suppose T vi = λt vi , Svi = λs vi for i ∈ [n], and see that ST vi = λs λt vi = λt λs vi =
T Svi and the operators commute on basis - they commute on V .
If T ∈ L(V ) is invertible, then Exercise 78 reasons that E(λ, T ) = E( λ1 , T −1 ) for λ ∈ F, λ 6= 0.
Exercise 245. For finite dimensional V , T ∈ L(V ), λi , i ∈ [m] distinct, non-zero eigenvalues of T ,
Pm
prove that i dim E(λi , T ) ≤ dim range T .
Proof. Note that distinct eigenvalue-eigenvectors are linearly independent. Then let E(λi , T ) have basis
Vi , and see that ∪m
i Vi is linearly independent (see Exercise 69). Also for vi ∈ Vi , T vi = λi vi =⇒
T λ1i vi = vi . Then since T|E(λi ,T ) is invariant, ∪m
i Vi (using direct sum result (Theorem 146)) contains
Pm
i dimE(λi , T ) linearly independent vectors in range T . There can be at most dim range T linearly
independent vectors.
3.3.6 Inner Product Spaces

3.3.6.1 Inner Products and Norms
The vector p-norm was defined (Definition 67). The dot product of a vector was defined (Definition 70).
If we think of vectors as points, the vector (Euclidean) norm is its distance from the origin. Properties
of dot product were studied in Theorem 56. For fixed y ∈ Rn , the map from Rn → R that sends x ∈ Rn
to x · y is linear. The inner product is a generalization of the dot product. To extend these concepts
over complex fields, we introduce some additional definitions. For z ∈ (z1 , · · · , zn ) ∈ Cn , the norm of z
is given by
p
kzk = |z1 |2 + · · · + |zn |2 . (521)
√ Pn
Recall that for z = a+bi ∈ C, the absolute value of z is denoted |z| = a2 + b2 . See that kzk2 = i zi z i ,
since |zi |2 = zi z i . On the other hand, the inner product of w with z, where w = (w1 , · · · , wn ) ∈ Cn is
the value
w1 z 1 + · · · + wn z n . (522)
The inner product of w, z equals the complex conjugate of the inner product of z, w. Then, an inner
product can be defined on V , regardless of whether V is real or complex vector space. For the remainder
of the section, when λ ∈ C, λ ≥ 0 is used to indicate that λ ∈ R and is nonnegative. The inner product
is written in ‘bra-ket’ notation, and the inner product of u, v is written hu|vi.
Definition 143 (Inner Product). An inner product on V is a function that takes some order pair (u, v)
of elements in V to a number hu|vi ∈ F. The following properties of inner product are satisfied:
1. Positivity: hv|vi ≥ 0 for all v ∈ V .
2. Definiteness: hv|vi = 0 iff v = 0.
3. Additivity in first slot: hu + v|wi = hu|wi + hv|wi for all u, v, w ∈ V .
4. Homogeneity in first slot: hλu|vi = λhu|vi for λ ∈ F, u, v ∈ V .
133
5. Conjugate Symmetry: hu|vi = hv|ui for all u, v ∈ V .
See Definition 143. Over real spaces, every real number is its complex conjugate. Then hv|wi = hw|vi.
Definition 144 (Euclidean Inner Product). Let w = (w1 , · · · , wn ), z = (z1 , · · · , zn ) ∈ Fn . The Euclidean
inner product on Fn is the function specified by
n
X
hw|zi = wi z i . (523)
i
When the inner product is not specified, then assume the inner product in question is Euclidean
(Definition 144).
Other inner products can be specified. For instance, let ci > 0 for i ∈ [n], then an inner product on
n
F can be written
n
X
hw|zi = ci wi z̄i . (524)
i
An inner product may be defined on a vector space of continuous real-valued functions on [−1, 1] by
Z 1
hf |gi = f (x)g(x)dx. (525)
1
Another inner product may be defined on the polynomial space P(R) as

Z ∞
hp|qi = p(x)q(x) exp(−x)dx. (526)
0
Definition 145. When a vector space V and inner product on V is specified, we call V an inner product
space.
So Fn is the inner product space associated with the Euclidean inner product. Assume that for this
section, V is some inner product space over field F.
Inner products over real space were explored in Theorem 56. Here we study the properties of inner
products over arbitrary fields.
Theorem 149. 1. For u ∈ U , the function taking v → hv|ui is linear map from V → F.
2. h0|ui = 0 for all u ∈ V ,
3. hu|0i = 0 for all u ∈ V ,
4. hu|v + wi = hu|vi + hu|wi for all u, v, w ∈ V ,
5. hu|λvi = λhu|vi for λ ∈ F and u, v ∈ V .
Proof. Most of the proofs follow from applying Definitions 143. For instance, we may prove part 4. as
follows:
hu|v + wi = hv + w|ui = hv|ui + hw|ui = hv|ui + hw|ui = hu|vi + hu|wi. (527)
Parts 4 implies that additivity also apply in the second slot.
The Euclidean norm was specified in Definition 68. This definition suffices when the inner product
in question is an Euclidean inner product. We generalize the vector norm for arbitrary inner products.
We shall see that the inner product determines the vector norm.
134
Definition 146 (Norm). For v ∈ V , the norm of kvk is defined
p
kvk = hv|vi. (528)
See that Definition 146 is consistent with the Euclidean inner product over Fn and our previous
definition of the Euclidean norm (Definition 68). In the inner product space ofqcontinuous real-valued
R1 R1
functions on [−1, 1], inner product hf |gi = −1 f gdx, the vector norm is kf k = −1
(f (x))2 dx.
Theorem 150 (Properties of Norm (Definition 146)). Let v ∈ V , then kvk = 0 iff v = 0 and kλvk =
|λ|kvk for all λ ∈ F.
Proof. The first part is trivial, the second part writes proof:
kλvk2 = hλv|λvi = λhv|λvi = λλhv|vi = |λ|2 kvk2 . (529)
Orthogonality was defined for real, Euclidean inner product spaces in Definition 73. We extend the
definition for arbitrary fields and arbitrary inner product:
Definition 147 (Orthogonal). Two vectors u, v ∈ V are orthogonal if hu|vi = 0. We denote this
u ⊥ v = 0.
Theorem 151 (Zero and Orthogonality). 0 is orthogonal to every vector in V , and 0 is the only vector
in V that is orthogonal to itself.
Theorem 152 (Pythagoras Theorem). Let u, v ∈ V and u ⊥ v then ku + vk2 = kuk2 + kvk2 .
Proof. ku + vk2 = hu + v|u + vi = hu|ui + hv|vi + hv|ui + hu|vi = kuk2 + kvk2 .
Theorem 153. Suppose u, v ∈ V with v 6= 0, and we want to write u = cv+w, where c ∈ F and hw|vi = 0
(w, v is orthogonal). Write u = cv + (u − cv), and we choose c satisfying 0 = hu − cv|vi = hu|vi − ckvk2 ,
! hu|vi
so c = kvk2 . See that a valid orthogonal decomposition would be written
u = cv + w, (530)
hu|vi hu|vi
where c = kvk2 and w = u − kvk2 v.
See orthogonal decomposition (Theorem 153) is a subroutine in the Gram-Schmidt procedure (Theo-
rem 61). We have proven the Cauchy-Schwarz inequality (Theorem 68). The Cauchy-Schwarz holds for
non-Euclidean inner products - we prove this result:
Theorem 154 (Cauchy-Schwarz Inequality). Suppose u, v ∈ V , then
|hu|vi| ≤ kukkvk. (531)
Equality holds iff ∃λ ∈ F s.t. u = λv.
Proof. If v = 0, then the proof is trivial since 0 = 0. Otherwise, write its orthogonal decomposition
(Theorem 153)
hu|vi
u= v + w, (532)
kvk2
135
and by Pythagoras Theorem 152, write
hu|vi 2
kuk2 = k vk + kwk2 (533)
kvk2
|hu|vi|2
= + kwk2 (534)
kvk2
|hu|vi|2
≥ . (535)
kvk2
The equality holds iff kwk = 0 iff w = 0 iff u = λv for some constant λ (see argument in Theorem
153).
If f, g are continuous functions on [−1, 1], Theorem 154 asserts that

Z 1 2 Z 1 Z 1
f (x)g(x)dx ≤ (f (x))2 dx (g(x))2 dx .
−1 −1 −1
Triangle inequality was proved (Theorem 69). Again, the extension of dot product to inner product
requires a modification in the proof:
Theorem 155 (Triangle Inequality). Let u, v ∈ V , then ku + vk ≤ kuk + kvk. Equality holds iff u = λv
for some λ ≥ 0.
Proof. Write
ku + vk2 = hu + v|u + vi (536)

= hu|ui + hv|vi + hu|vi + hv|ui (537)
2 2
= kuk + kvk + hu|vi + hu|vi (538)
= kuk2 + kvk2 + 2Re hu|vi (539)
2 2
≤ kuk + kvk + 2 |hu|vi| (540)
≤ kuk2 + kvk2 + 2kukkvk (541)
2
= (kuk + kvk) . (542)
Equality holds iff hu|vi = kukkvk. If v = λu then (recall λ > 0 =⇒ λ ∈ R)
hu|vi = hu|λui = λkuk2 = kukλkuk. (543)
Then kukλkuk = kuk|λ|kuk = kukkλuk = kukkvk since λ ≥ 0.
The parallelogram equality is proved in Exercise 56, and the natural modification to the proof is left
out. The equality asserts that ∀u, v ∈ V, ku + vk2 + ku − vk2 = 2(kuk2 + kvk2 ).
Exercise 246. Suppose V is real inner product space, and show that ∀u, v ∈ V,, we have hu + v|u − vi =
kuk2 − kvk2 . Show that kuk = kvk =⇒ hu + v|u − vi = 0.
Proof.
hu + v|u − vi = hu|ui − hv|vi − hu|vi + hv|ui = hu|ui − hv|vi = kuk2 = kvk2 . (544)
hu|vi = hv|ui = hv|ui since V is real inner product space. The second statement clearly follows as a
result of the first statement.
136
√
Exercise 247. Prove that for T ∈ L(V ), kT vk ≤ kvk for all v ∈ V , T − 21 must be invertible.
√ √
Proof. We prove contrapositive. Suppose ¬inv(T − 21), then 2 is eigenvalue (Theorem 139), and
√
∃v ∈ V s.t. T v = 2v and for this v, kT vk > kvk.
Exercise 248. For u, v ∈ V , prove that hu|vi = 0 iff kuk ≤ ku + avk for all a ∈ F.
Proof. Suppose hu|vi = 0, then
ku + avk2 = hu + av|u + avi = kuk2 + kavk2 + hu|avi + hav|ui = kuk2 + kavk2 ≥ kuk2 . (545)
Take square roots. OTOH, suppose kuk ≤ ku + avk, then
ku + avk2 − kuk2 = |a|2 kvk2 + hav|ui + hv|aui ≥ 0. (546)
If v = 0, then clearly hu|vi = 0. Suppose v 6= 0, then (see previous equation for expansion)
|a|2 kvk2 + hav|ui + hv|abi ≥ 0 ↔ |a|2 kvk2 + ahv|ui + ahu|vi ≥ 0. (547)
Since this holds for all a ∈ F, use a → − hu|vi

kvk2 , then
hu|vi hv|ui hu|vi hv|ui hu|vihv|ui hv|uihu|vi hu|vihv|ui

2 2
kvk2 − 2
hv|ui − 2
hu|vi = 2
−2 2
=− ≥0 (548)
kvk kvk kvk kvk kvk kvk kvk2
2
so − |hu|vi|
kvk2 ≥ 0 implies hu|vi = 0.
Exercise 249. For u, v ∈ V , prove that kau + bvk = kbu + avk for all a, b ∈ R iff kuk = kvk.
Proof. See that
kau + bvk = kbu + avk =⇒ k1u + 0vk = k0u + 1vk. (549)
Assume ku| = kvk, then since a, b, ∈ R, we have
kau + bvk = a2 kuk2 + b2 kvk2 + ab(hu|vi + hv|ui), (550)

2 2 2 2
kbu + avk = b kuk + a kvk + ab(hu|vi + hv|ui). (551)
Then the result follows.
Exercise 250. If u, v ∈ V and kuk = kvk = 1, hu|vi = 1, prove that u = v.
Proof. See that
hu − v|u − vi = kuk2 + kvk2 − hu|vi − hv|ui = 1 + 1 − 1 − 1 = 0 (552)
which by definiteness (Definition 143) asserts u = v.
Exercise 251. Prove that if u, v ∈ V and kuk, kvk ≤ 1, then

p p
1 − kuk2 1 − kvk2 ≤ 1 − |hu|vi|. (553)
137
Proof. Cauchy-Schwarz (Theorem 154) asserts that |hu|vi| ≤ kukkvk. Then see that 0 ≤ 1 − kukkvk ≤
p p p p
1 − |hu|vi|. So 1 − kuk2 1 − kvk2 ≤ 1 − kukkvk =⇒ 1 − kuk2 1 − kvk2 ≤ 1 − |hu|vi|, and consider
the iff statements
(1 − kuk2 )(1 − kvk2 ) ≤ (1 − kukkvk)2 (554)

1 − kuk2 − kvk2 + kuk2 kvk2 ≤ 1 + kuk2 kvk2 − 2kukkvk (555)
2 2
0 ≤ kuk + kvk − 2kukkvk (556)
0 ≤ (kuk − kvk)2 (557)
and we are done.

1 1 1 1

Exercise 252. Prove that 16 ≤ (a + b + c + d) a + b + +
for all a, b, c, d > 0.
c d
Pn Pn Pn
Proof. Cauchy-Schwarz (Theorem 154) asserts that for xi , yi , i ∈ [n], that | i xi yi |2 ≤ ( i x2i )( i yi2 ).
Then see that
r r r r !2
1 1 1 1 1 1 1 1
(a + b + c + d) + + + ≥ a + b + c + d = 42 . (558)
a b c d a b c d
Pn Pn
Exercise 253. Prove that ( i xi )2 ≤ n i x2i .
Pn Pn Pn
Proof. Cauchy-Schwarz (Theorem 154) asserts that for xi , yi , i ∈ [n], that | i xi yi |2 ≤ ( i x2i )( i yi2 ).
Make the substitution yi → 1.
2 P
b2j
P P
n n 2 n
Exercise 254. Prove that j a j b j ≤ j ja j j j for all real aj , bj ’s.
Proof. This is an immediate application of the Cauchy-Schwarz inequality after writing

 2  2
n n
X X p 1
 aj bj  =  jaj · √ bj  . (559)
j j
j
Exercise 255. Suppose V is real inner product space, then prove that
ku + vk2 − ku − vk2
∀u, v ∈ V, hu|vi = . (560)
4
Proof.
1 1 1
ku + vk2 + ku − vk2 = kuk2 + kvk2 + 2hu|vi − (kuk2 + kvk2 − 2hu|vi) = 4hu|vi (561)
4 4 4
Exercise 256. Suppose V is complex inner product space, then prove that
ku + vk2 − ku − vk2 + ku + ivk2 i − ku − ivk2 i
∀u, v ∈ V, hu|vi = . (562)
4
Proof. We write each of the individual terms separately. Combine
1.
ku + vk2 = kuk2 + kvk2 + hu|vi + hv|ui, (563)
138
2.
−ku − vk2 = −kuk2 − kvk2 + hu|vi + hv|ui, (564)
3.
ku + ivk2 i = i kuk2 + kvk2 + hu|ivi + hiv|ui = kuk2 i + kvk2 i + hu|vi − hv|ui,

(565)
4.
−ku − ivk2 i = −i kuk2 + kvk2 − hu|ivi − hiv|ui = −kuk2 i − kvk2 i + hu|vi − hv|ui.

(566)

1
2 1 2
Exercise 257. Show that n (a1 + · · · an ) ≤ n (a1 + · · · a2n ) for ai ∈ R, i ∈ [n].
Proof. Divide LHS, RHS the inequality in Exercise 253.
Exercise 258. Suppose S ∈ L(V ), inj(S), and define h·|·i1 by
hu|vi1 = hSu|Svi (567)
for u, v ∈ V . Show h·|·i1 is inner product on V . Show that the injective property is a necessary condition.
Proof. Verify the properties of inner products are satisfied (Definition 143). For instance, for injective
S, we get
hu|ui1 = hSu|Sui = 0 ↔ Su = 0 ↔ u = 0. (568)
The proof of positivity, first-slot additivity, first-slot homogeneity, and conjugate symmetry may be
obtained similarly. If S is not injective, ∃v 6= 0 satisfying, but hv|vi1 = hSv|Svi = 0 contradicts the
definiteness property of h·|·i1 .
Exercise 259. Suppose u, v, w ∈ V , and prove that
1 kw − uk2 + kw − vk2 ku − vk2

kw − (u + v)k2 = − . (569)
2 2 4
Proof. [1]. Consider the parallelogram equality (Theorem 56), for a, b we have
ka − bk2 + ka + bk2 = 2kak2 + 2kbk2 . (570)
w−u w−v
Then make substitutions a → 2 ,b → 2 to get
2 2 2 2
w−u w−v w−u w−v w−u w−v
− + + =2 +2 (571)
2 2 2 2 2 2
2 2 2 2
v−u 2w − u − v w−u w−v
+ =2 +2 , (572)
2 2 2 2
shift the terms around and simplify - the relation holds. Note that
hv − u|v − ui = kvk2 + kuk2 − hv|ui − hu|vi = hu − v|u − vi, (573)
ku−vk2 kv−uk2
so 4 = 4 .
139
1
Exercise 260. Let C ⊂ V satisfying ∀u, v ∈ C, 2 (u + v) ∈ C. For w ∈ V , prove that there is at most
one point in C closest to w, that is there is at most one u ∈ C s.t.
kw − uk ≤ kw − vk, ∀v ∈ C. (574)
Proof. Suppose not, then there is some distinct v1 , v2 ∈ C closest to w, then we must have kw − v1 k ≤
kw − v2 k, kw − v1 k ≥ kw − v2 k, therefore kw − v1 k = kw − v2 k. Exercise 259 asserts that
2
1 kw − v1 k2 + kw − v2 k2 kv1 − v2 k2
w − (v1 + v2 ) = − < kw − v1 k2 , (575)
2 2 4
and this is contradiction of the closest assumption. Then v1 , v2 must be non-distinct.
3.3.6.2 Orthonormal Bases
Orthogonal bases were discussed (Section 3.2.5.1). Orthonormal vectors ei , i ∈ [m] in vector space
V is said to have property hej |ek i = 1{j = k} - the extension to the inner product is natural. We
know orthogonal/orthonormal vectors have nice computations. Recall that orthonormal (more generally,
orthogonal) list of vectors are linearly independent (Theorem 57 with trivial modifications).
Pm Pm
If ei , i ∈ [m] are orthonormal in V , then k i ai ei k2 = i |ai |2 for all ai ∈ F. This result is a direct
application of the Pythagoras Theorem 152. A list of orthonormal vectors in V is orthonormal basis if
the list is basis. Corollary 5 holds; any orthonormal list of vectors in V with length dim V is orthonormal
basis of V . Theorem 58 asserts that given orthonormal basis, each vector in V has unique and simple
basis-coordinate representations. The extension to general inner-product spaces is natural, and we write
it here:
Pn
Theorem 156. Let ei , i ∈ [n] be orthonormal basis of V , then ∀v ∈ V , we may write v = i hv|ei iei ,
Pn
and kvk2 = i |hv|ei i|2 .
Pn
Proof. Since ei ’s are basis, there exists representation v = i ai ei , then taking inner product with each
Pn
element in basis, get hv|ej i = h i ai ei |ej i = aj . The second result follows from Pythagoras Theorem
152.
The Gram-Schmidt procedure was stated in Theorem 61. The statement is made again, but the
generalization of the proof is trivial and omitted.
Theorem 157 (Gram-Schmidt Procedure). Let vi , i ∈ [m] be linearly independent list of vectors in V ,
v1
and e1 = kv1 k . Then for j ∈ [2, m], define ej by
vj − hvj |e1 ie1 − · · · − hvj |ej−1 iej−1

ej = . (576)
kvj − hvj |e1 ie1 − · · · − hvj |ej−1 iej−1 k
Then ei , i ∈ [m] is orthonormal list of vectors in V and span(v1 , · · · , vj ) = span(e1 , · · · , ej ) for all
j ∈ [m].
R1
Exercise 261 (Axler [3]). Find orthonormal basis of P2 (R), where the inner product is hp|qi = −1 p(x)q(x)dx.
R1 √
Proof. The standard basis is {1, x, x2 }. To begin with, consider k1k2 = −1 12 dx = 2, so k1k = 2 and
q
we may write e1 = 12 . Write
r !r
Z 1
1 1
x − hx|e1 ie1 = x − x dx = x, (577)
−1 2 2
140
R1 q q
and kxk2 = −1
x2 dx = 23 . Then kxk = 2
3 and e2 = 3
2 x. Finally, write
x2 − hx2 |e1 ie1 − hx2 |e2 ie2 (578)

Z 1 r !r r !r
Z 1
2 2 1 1 3 3
=x − x dx − x2 xdx x (579)
−1 2 2 −1 2 2
1
= x2 − . (580)
3
R1 q q
Then kx2 − 31 k2 = −1
(x4 − 23 x2 + 19 )dx = 8
45 , and kx2 − 13 k = 8
45 . Then e3 = 45
8 (x
2
− 31 ).
Every finite dimensional (inner product) space has spanning list. Each spanning list may be reduced
to basis. Then the basis may be Gram-Schmidt to orthonormal basis. Hence every finite dimensional
inner product space has orthonormal basis. Trivially, every orthonormal list may also be extended to an
orthonormal basis. We saw that every finite dimensional complex vector space V has basis to which the
matrix of the associated operators are upper triangular (Theorem 143). It turns out that this in turns
implies the existence of an orthonormal basis.
Theorem 158. Let T ∈ L(V ). If T has upper triangular matrix w.r.t to some basis of V , then T has
upper triangular matrix w.r.t. some orthonormal basis of V .
Proof. Let T be upper triangular w.r.t basis vi , i ∈ [n], then span(v1 , · · · , vj ) is invariant under T for each
j ∈ [n] (Theorem 142). By Gram-Schmidt (Theorem 157), we can obtain orthonormal basis ei , i ∈ [n]
where span(e1 , · · · , ej ) = span(v1 , · · · , vj ) for all j ∈ [n]. Then the span(e1 , · · · , ej ) for the different j’s
are T -invariant and T has upper triangular w.r.t to this basis ei , i ∈ [n] (Theorem 142).
Theorem 159 (Schur’s Theorem). Suppose V is finite dimensional complex vector space and T ∈ L(V ),
then T has upper triangular w.r.t to orthonormal basis of V .
Proof. T has upper triangular w.r.t some basis (Theorem 143). Existence of basis with upper triangular
implies orthonormal basis to which the matrix is upper triangular (Theorem 158).
For fixed u ∈ V , the map that sends v → hv|ui is a linear functional (Definition 126) on V . It turns
out that every linear functional is of this form.
Theorem 160 (Riesz Representation Theorem). Let V be finite dimensional and ϕ be linear functional
on V , then !∃u ∈ V s.t. ϕ(v) = hv|ui for all v ∈ V .
Proof. Existence is shown. Let ei , i ∈ [n] be orthonormal basis of V (the existence is guaranteed as we
argued), then by Theorem 58 we may write
n
X n
X Xn
ϕ(v) = ϕ( hv|ei iei ) = hv|ei iϕ(ei ) = hv| ϕ(ei )ei i (581)
i i i
Pn
for all v ∈ V . Then let u = i ϕ(ei )ei , and ϕ(v) = hv|ui. We prove uniqueness. Suppose ∃u1 , u2 s.t.
ϕ(v) = hv|u1 i = hv|u2 i iff hv|u1 i − hv|u2 i = 0 iff hv|u1 − u2 i = 0. Since this is satisfied for all v, if we let
v = u1 − u2 , then hv|vi = 0 asserts that v = u1 − u2 = 0, so u1 = u2 .
Exercise 262. Let ei , i ∈ [m] be orthonormal list of vectors in V , and for v ∈ V , prove that kvk2 =
Pn 2
i |hv|ei i| iff v ∈ span((ei )i∈[m] ).
141
Proof. Let ei , i ∈ [m] be orthonormal list of vectors in V , then it may be extended to orthonormal basis
Pm
as ei , i ∈ [m], ej , j ∈ [m + 1, n]. If v ∈ span((ei )i∈[m] ), then we may write v = i ai ei , and taking
Pm Pm Pm
hv = i ai ei |ei i = ai , so ai = hv|ei i. Then v = i hv|ei iei , and kvk2 = i |hv|ei i|2 . On the other
hand, since ei , i ∈ [m], ej , j ∈ [m + 1, n] is orthonormal basis, we can write
kvk2 = |hv|e1 i|2 + |hv|e2 i|2 + · · · + |hv|em i|2 + · · · |hv|en i|2 . (582)
Pm Pn Pm Pn
By assumption of kvk2 = i |hv|ei i|2 , i=m+1 |hv|ei i|2 = 0, then v = i hv|ei ivi + i=m+1 0vi and
the result follows.
Exercise 263. If V is real inner product space, vi , i ∈ [m] are linearly independent in V , prove there is
2m orthonormal lists ei , i ∈ [m] vectors in V satisfying
∀j ∈ [m], span(v1 , · · · , vj ) = span(e1 , · · · , ej ). (583)
Proof. For any linearly independent list vi , i ∈ [m], the Gram-Schmidt procedure (Theorem 157) gives
us orthonormal list ei , i ∈ [m] vectors satisfying
∀j ∈ [m], span(v1 , · · · , vj ) = span(e1 , · · · , ej ). (584)
Moreover, since ∀i ∈ [m], hei |ej i = 0 =⇒ hei |−ej i = 0, for each i, we are free to take both the positive
and negative of vector in the orthonormal list, so we have 2m choices.
Exercise 264. Let u ∈ V , and Φu be a linear functional on V taking v → hv|ui for all v ∈ V . Then
1. Show that if F = R, then Φ is linear map V → V 0 .
2. Show that if F = C, and V 6= {0}, then Φ is not linear map.
3. Show inj(Φ).
4. Show that if F = R, and V is finite dimensional, then Φ isomorphism from V → V 0 .
Proof. -
1.
∀u1 , u2 ∈ V, v ∈ V, Φ(u1 + u2 )(v) = hv|u1 + u2 i = hv|u1 i + hv|u2 i = Φ(u1 )v + Φ(u2 )v,

∀λ ∈ R, u ∈ V, Φ(λu)v = hv|λui = λhv|ui = λhv|ui = λΦ(u)v. (585)
6 λhv|ui for any λ 6∈ R.

2. λhv|ui =
3.
Φ(u1 ) = Φ(u2 ) =⇒ 0 = (Φ(u1 ) − Φ(u2 )) = (Φ(u1 − u2 ))(v) = hv|u1 − u2 i (586)
for all v ∈ V , and by definiteness, letting v → u1 − u2 we conclude u1 − u2 = 0, so u1 = u2 .
4. inj(Φ) and Theorem 118 asserts by rank-nullity (Theorem 99)
dim V = dim V 0 = dim null Φ + dim range Φ = dim range Φ. (587)
Then sur(Φ), inj(Φ) =⇒ inv(Φ).
142
3.3.6.3 Orthogonal Complements and Minimization Problems
Orthogonal sets were defined in Definition 73. The orthogonal complement is the largest orthogonal set
in some vector space w.r.t to another set.
Definition 148 (Orthogonal Complement). If U ⊂ V , then orthogonal complement of U is written U ⊥

and is the set of all vectors in V that is orthogonal to U , that is
U ⊥ = {v ∈ V | hv|ui = 0 ∀u ∈ U }. (588)
Since zero is orthogonal to everything (Theorem 151), every orthogonal complement shall contain
zero. We explore more properties.
Theorem 161. Let U ⊂ V , then
1. U ⊥ is subspace,
2. {0}⊥ = V ,
3. V ⊥ = {0},
4. U ∩ U ⊥ = {0},
5. U, W ⊂ V and U ⊂ W implies W ⊥ ⊂ U ⊥ .
Proof. -
1. Obviously, zero is contained. Next, let v, w ∈ U ⊥ , then hv + w|ui = hv|ui + hw|ui = 0, and let
λ ∈ F, v ∈ U ⊥ , then hλv|ui = λhv|ui = 0.
2. Theorem 151.
3. Theorem 151.
4. Let v ∈ U ∩ U ⊥ then hv|vi = 0 iff v = 0, so U ∩ U ⊥ = {0}.
5. If v ∈ W ⊥ , then ∀u ∈ U ⊂ W , hv|ui = 0, so hv|ui = 0. Then v ∈ U ⊥ and W ⊥ ⊂ U ⊥ .
Theorem 162. If U is finite dimensional subspace of V , then V = U ⊕ U ⊥ .
Proof. Let v ∈ V , and ei , i ∈ [m] be orthonormal basis of U (U is subspace, subspace is finite dimensional,
Pm Pm Pm
so existence is guaranteed), then since v = v + i hv|ei iei − i hv|ei iei , let u = i hv|ei iei be the
unique basis representation in U by Theorem 89. Rearranging, we can write v = u + v − u = u + w,
where w = v − u. See
m
X
∀j, hw = v − hv|ei iei |ej i = hv|ej i − hv|ej i =⇒ w ∈ U ⊥ . (589)
i
Then, since v = u + w, where u ∈ U, w ∈ U ⊥ , then V = U + U ⊥ . Since U ∩ U ⊥ = {0} (Theorem 161),

then V = U ⊕ U ⊥ by check for direct sum (Theorem 85).
Then Theorem 162, 113 gives us dimension of U ⊥ by the relation dim V = dim U + dim U ⊥ .
Theorem 163. If U is finite dimensional subspace of V , then U = (U ⊥ )⊥ .
143
Proof. Let u ∈ U , then for all v ∈ U ⊥ , clearly hu|vi = 0 so u ∈ (U ⊥ )⊥ . Let v ∈ (U ⊥ )⊥ , then by
decomposition (Theorem 162) we may write v = u + w, where u ∈ U, w ∈ U ⊥ , then v − u ∈ U ⊥ , and
v ∈ (U ⊥ )⊥ , u ∈ (U ⊥ )⊥ , it follows that v − u ∈ U ⊥ ∩ (U ⊥ )⊥ , then hv − u|v − ui = 0, then v − u = 0, then
v = u, so v ∈ U , so (U ⊥ )⊥ ⊂ U .
Orthogonal projection was introduced in Definition 77, and projections with basis were constructed
in Theorem 60. This projection is unique (Theorem 70). Minimal distances and Best Approximations
were studied in Theorem 62. A more extensive study is conducted here. Let U be finite dimensional
subspace of V , then denote the orthogonal projection of V onto U as the operator PU ∈ L(V ) satisfying
∀v ∈ V, v = PU v + w, (590)
where PU v ∈ U and w ∈ U ⊥ . Since V = U ⊕ U ⊥ , this is consistent with the unique representation.
Theorem 164 (Properties of Orthogonal Projections). Let U be finite dimensional subspace of V and
v ∈ V , then
1. PU ∈ L(V ),
2. PU u = u for all u ∈ U ,
3. PU w = 0 for all w ∈ U ⊥ ,
4. range PU = U ,
5. null PU = U ⊥ ,
6. v − PU v ∈ U ⊥ ,
7. PU2 = PU ,
8. kPU vk ≤ kvk,
Pm
9. for orthonormal basis ei , i ∈ [m], PU v = i hv|ei iei .
Proof. Some proofs are rather trivial and hence omitted.
1. Let v1 , v2 ∈ V , and suppose v1 = u1 + w1 , v2 = u2 + w2 , where u1 , u2 ∈ U, w1 , w2 ∈ U ⊥ , then

PU v1 = u1 , PU v2 = u2 , and since v1 + v2 = u1 + u2 + w1 + w2 then PU (v1 + v2 ) = u1 + u2 =
PU v1 + PU v2 . Homogeneity proof is similar.
2. -
3. -
4. -
5. -
6. -
7. Let v = u + w, u ∈ U, w ∈ U ⊥ , then PU (PU v) = PU u = u = PU v by part 2.
8. Let v = u + w, u ∈ U, w ∈ U ⊥ , then kPU vk2 = kuk2 ≤ kuk2 + kwk2 = kvk2 (Pythagoras Theorem
152). Restatement of Best Approximation Theorem with u → 0 (Theorem 62).
144
9. See computation for u in proof of Theorem 162.
Discussions in Theorem 62 give us the basis for the result: if U is finite dimensional subspace of V ,
then kv − PU vk ≤ kv − uk for all u ∈ U . Equality holds iff u = PU v. The adaptation of the proof for
arbitrary inner product spaces are trivial.
Exercise 265. Find polynomial u ∈ P5 (R) that approximates sin x as well as possible in the interval
[±π] on the loss function
Z π
| sin x − u(x)|2 dx. (591)
−π
Proof. Let CR be the real inner product space of continuous functions on [−π, π] with inner product
Rπ
hf |gi = −π f (x)g(x)dx. Let v ∈ CR be defined v(x) = sin(x), and U = P5 (R) be subspace of CR .
Then we want to find arg minu∈U kv − uk. To do this, we may use standard basis of P5 (R) and obtain
orthonormal basis ei , i ∈ [6] of U via the Gram-Schmidt procedure (Theorem 157), where the inner
product is given. Then the Best Approximation Theorem gives us u(x) = PU v. This is the best
approximation of sin x up to polynomial degree order five in ±π.
Exercise 266. Prove that for vi , i ∈ [m] in V , that {vi : i ∈ [m]}⊥ = (span((vi )i∈[m] )⊥ .
Pn Pm Pm
Proof. For w ∈ {vi : i ∈ [m]}⊥ , v := i ai vi ∈ span((vi )i∈[m] ), hw|v := i ai vi i = i ai hw|vi i = 0,
so {vi : i ∈ [m]}⊥ ⊂ span((vi )i∈[m] )⊥ . OTOH, if v ∈ span((vi )i∈[m] )⊥ , then clearly v is orthogonal to
each vj , j ∈ [m] and {vi : i ∈ [m]}⊥ ⊃ span((vi )i∈[m] )⊥ .
Exercise 267. If U is finite dimensional subspace of V , then prove that U ⊥ = {0} iff U = V .
Proof. U = (U ⊥ )⊥ = {0}⊥ = V . If U = V , then U ⊥ = V ⊥ = {0}. See Theorem 161.
Exercise 268. If U is subspace of V , and U has basis ui , i ∈ [m], and u1 , · · · , um , wm+1 , · · · , wn is

basis of V , then prove that if Gram-Schmidt procedure is applied to this basis of V then the produced list
ei , i ∈ [m], fj , j ∈ [m + 1, n] is orthonormal basis of U , orthonormal basis of U ⊥ respectively.
Proof. We know that the Gram-Schmidt procedure produces orthonormal basis of V , and suppose
the vectors ei , i ∈ [m], fj , j ∈ [m + 1, n] obtained. In particular the algorithm (Theorem 157) keeps
span(ui , i ∈ [j]) = span(ei , i ∈ [j]) for j ∈ [n], so clearly ei ’s have the same size orthonormal list as
the size of basis of U , and is also in U , so it must be orthonormal basis of U . Additionally, for all
∀i, j, hei |fj i = 0, so span((ei )i∈[m] ) ⊥ span((fj )j∈[m+1,n] ). Since V = U ⊕ U ⊥ , dim V = dim U + dim U ⊥ ,
the fj ’s are also of the right size, is orthonormal basis, and is orthogonal to U , then it is basis of U ⊥ .
Exercise 269. If V finite dimensional, U is subspace of V , then show that PU ⊥ = 1 − PU .
Proof. Write V = U ⊕U ⊥ , then ∀v ∈ V, ∃u ∈ U, u⊥ ∈ U ⊥ s.t. v = u+u⊥ , and since PU (v) = u, PU ⊥ (v) =

u⊥ , then
(1 − PU )v = v − PU (v) = v − u = u⊥ = PU⊥ (v). (592)
Exercise 270. If U, W finite dimensional subspaces of V , then prove that PU PW = 0 ↔ hu|wi = 0 for
all u ∈ U, w ∈ W .
145
Proof. [1]. If PU PW = 0,then for all w ∈ W , PU PW w = PU w = 0, so w ∈ U ⊥ . Then for all u ∈
U, hw|ui = 0. OTOH, if hu|wi = 0, then for all u ∈ U, w ∈ W , clearly W ⊂ U ⊥ . Then PU w = 0, and
PU PW = 0.
Exercise 271. If V is finite dimensional and P ∈ L(V ) s.t P 2 = P , and hu|vi = 0 for all u ∈ null P, v ∈
range P . Then prove that there exists subspace U of V s.t. P = PU .
Proof. [1]. Let U = range P , then for u ∈ U , ∃v ∈ V s.t. P v = u and so u = P v = P 2 v = P P v = P u, so

P u = u. Let v ∈ V , then since V = null P ⊕ range P for P 2 = P (Exercise 234), we may write v = u + w,
for u ∈ U, w ∈ null P . Since null P ⊂ U ⊥ , then P v = P (u + w) = P u = u = PU (u + w) = PU v,
P = PU .
Exercise 272. If V is finite dimensional and P ∈ L(V ) s.t. P 2 = P , and kP vk ≤ kvk for all v ∈ V ,
then prove there is subspace U of V s.t. P = PU .
Proof. [1]. Exercise 234 asserts that V = null P ⊕ range P , then for all v ∈ V , there is expression
v = λu + w for u ∈ null P, w ∈ range P . Then kP (λu + w)k ≤ kλu + wk and Exercise 248 asserts that
hu|wi = 0. It follows that null P ⊥ range P and let U = range P . It follows that P = PU .
Exercise 273. If T ∈ L(V ), U is finite dimensional subspace of V , then prove U is invariant under T
iff PU T PU = T PU .
Proof. Suppose U invariant, since V = U ⊕ U ⊥ , for all v ∈ V , we may write v = u + u⊥ where

u ∈ U, u⊥ ∈ U ⊥ , and let T u = ũ, then see that
PU T PU v = PU T PU (u + u⊥ ) = PU T PU u = PU T u = PU ũ = ũ = T u = T PU (u + u⊥ ) = T PU v. (593)
OTOH, referring to the same equation, see that PU T u = T u implies T u ∈ U , so T invariant.
Exercise 274. If V is finite dimensional T ∈ L(V ), and U is subspace of V , then prove that U, U ⊥ is
invariant under T iff PU T = T PU .
Proof. Assume U, U ⊥ invariant. Since V = U ⊕ U ⊥ , for all v ∈ V , we can write v = u + u⊥ . Then we

may write
PU T (u + u⊥ ) = PU (ut + u⊥
t ) = ut ,
and
T PU (u + u⊥ ) = T (u + 0) = ut .
On the other hand, suppose PU T = T PU . Then may write
PU T PU = T PU PU = T PU , (594)
and Exercise 273 asserts that U is T invariant. By Exercise 269, we also have
PU⊥ T PU⊥ = (1 − PU )T PU ⊥ = (T − PU T )PU ⊥ = (T − T PU )PU ⊥ = T (1 − PU )PU ⊥ = T PU ⊥ PU ⊥ = T PU ⊥ .
Then Exercise 273 asserts that U ⊥ is T invariant.
146
3.3.7 Operators on Inner Product Spaces
3.3.7.1 Self-Adjoint and Normal Operators
Definition 149 (Adjoint). Let T ∈ L(V, W ), then the adjoint of T is function T ∗ : W → V satisfying
hT v|wi = hv|T ∗ wi (595)
∀v ∈ V, ∀w ∈ W .
Recall that inner products are linear functionals (Definition 126). The linear functional taking v →
hT v|wi has some unique vector (Theorem 160) T ∗ w s.t hT v|wi = hv|T ∗ wi for all v ∈ V .
Exercise 275. Let u ∈ V , x ∈ W , and define T ∈ L(V, W ) by T v = hv|uix for all v ∈ V . Find the
adjoint of T .
Proof. For fixed w ∈ W , write
hv|T ∗ wi = hT v|wi = hhv|uix|wi = hv|uihx|wi = hv|hw|xiui. (596)
Hence T ∗ w = hw|xiu.
Theorem 165. For T ∈ L(V, W ), T ∗ ∈ L(W, V ).
Proof. Let λ ∈ F, w, w1 , w2 ∈ W and v ∈ V , then
hv|T ∗ (w1 + w2 )i = hT v|w1 + w2 i = hT v|w1 i + hT v|w2 i = hv|T ∗ w1 i + hv|T ∗ w2 i = hv|T ∗ w1 + T ∗ w2 i(597)
and
hv|T ∗ λwi = hT v|λwi = λ̄hT v|wi = λ̄hv|T ∗ wi = hv|λT ∗ wi. (598)
Theorem 166 (Adjoint Properties). Let S, T be linear operators in L(V, W ) and λ ∈ F. Then
1. (S + T )∗ = S ∗ + T ∗ ,
2. (λT )∗ = λ̄T ∗ ,
3. (T ∗ )∗ = T ,
4. 1∗ = 1 ,
5. (ST )∗ = T ∗ S ∗ for T ∈ L(V, W ), S ∈ L(W, U ).
Proof. -
1.
hv|(S + T )∗ wi = h(S + T )v|wi = hSv|wi + hT v|wi = hv|S ∗ wi + hv|T ∗ wi = hv|S ∗ w + T ∗ wi. (599)
2.
hv|(λT )∗ wi = hλT v|wi = λhT v|wi = λhv|T ∗ wi = hv|λ̄T ∗ wi. (600)
147
3.
hw|(T ∗ )∗ vi = hT ∗ w|vi = hv|T ∗ wi = hT v|wi = hw|T vi. (601)
4.
hv|1∗ ui = h1v|ui = hv|ui (602)
5. Let T ∈ L(V, W ), S ∈ L(W, U ), then if v ∈ V , u ∈ U , write
hv|(ST )∗ ui = hST v|ui = hT v|S ∗ ui = hv|T ∗ (S ∗ u)i. (603)
Theorem 167. If T ∈ L(V, W ) then
1. null T ∗ = (range T )⊥ ,
2. range T ∗ = (null T )⊥ ,
3. null T = (range T ∗ )⊥ ,
4. range T = (null T ∗ )⊥ .
Proof. For ∀w ∈ W , ∀v ∈ V , w ∈ null T ∗ iff T ∗ w = 0 iff hv|T ∗ wi = 0 iff hT v|wi = 0 iff w ∈ (range T )⊥ .
This proves part 1. The other parts follow by taking orthogonal complements and substituting T ∗ →
T.
Definition 150 (Conjugate Transpose). The conjugate transpose of a matrix A, written A† , is the
matrix obtained by transposing the matrix A and taking the conjugates of each entry.
3
Theorem 168. Let T ∈ L(V, W ) and ei , i ∈ [n] be orthonormal basis and fj , j ∈ [m] be orthonormal
basis of W , then
M(T ∗ , (fj )j∈[m] , (ei )i∈[n] ) = M(T, (ei )i∈[n] , (fj )j∈[m] )† . (604)
Proof. By unique basis representation (Theorem 89) and definition of matrix for linear maps (Definition
113), we may write
m
X
T ek = hT ek |fj ifj (605)
j
, where the coefficients belong to M(T )·,k . The (j, k) entry of M(T ) is hT ek |fj i. The same argument
yields (j, k) entry at M(T ∗ ) to be hT ∗ fk |ej i, and since this is equals hfk |T ej i = hT ej |fk i, the argument
follows.
Definition 151 (Self-adjoint). An operator ∈ L(V ) is said to be self-adjoint if T = T ∗ . Then ∀v, w ∈ V ,

hT v|wi = hv|T wi. Self-adjoint operators are used interchangeably with the term ‘Hermitian’ operator.
The sum of two self-adjoint operators are self-adjoint. The scalarization of self-adjoint operator is
self-adjoint. Other properties are studied.
3 M(T ∗ ) is not necessarily M(T )† when the basis is not orthonormal.
148
Theorem 169. Every eigenvalue of self-adjoint operator is real.
Proof. Let T be self-adjoint, and λ be eigenvalue associated with eigenvector v, then
λkvk2 = hλv|vi = hT v|vi = hv|T vi = hv|λvi = λ̄kvk2 . (606)
Theorem 170. Over complex field C, if V is complex inner product space and T ∈ L(V ), then hT v|vi = 0
for all v ∈ V implies T = 0.
Proof. Axler [3]. Write

hT (u + w)|u + wi − hT (u − w)|u − wi hT (u + iw)|u + iwi − hT (u − iw)|u − iwi
hT u|wi = + i.
4 4
for all u, w ∈ V . Since this satisfies all u, w, and the RHS consists of terms of type hT v|vi, let w = T u,
then hT u|T ui = 0 iff T u = 0 for all u iff T is zero operator.
Theorem 171. Over complex field C, if V is complex inner product space and T ∈ L(V ), then T self
adjoint iff hT v|vi ∈ R for all v ∈ V .
Proof. Write
hT v|vi − hT v|vi = hT v|vi − hv|T vi = hT v|vi − hT ∗ v|vi = h(T − T ∗ )v|vi. (607)
If T self adjoint then clearly the RHS is zero, and the LHS is zero, which is iff hT v|vi ∈ R. OTOH, if
hT v|vi is always real then the RHS is zero and T must equal T ∗ .
Theorem 170 asserts the result over complex spaces. The statement holds for real spaces if we assume
T is self-adjoint.
Theorem 172. If T is self adjoint on V where T ∈ L(V ) then hT v|vi = 0 for all v ∈ V implies T = 0.
Proof. Write (we may assume real-inner product space, since complex spaces are already shown to hold.
then using the self-adjoint property,)
hT (u + w)|u + wi − hT (u − w)|u − wi
hT u|wi = .
4
Reasoning is similar to proof of Theorem 170.
Definition 152. An operator on inner product space is said to be normal if T T ∗ = T ∗ T (adjoint-

commutativity).
Clearly self-adjoint operators are normal.
Theorem 173. An operator T ∈ L(V ) is normal iff kT vk = kT ∗ vk for all v ∈ V .
Proof. For all v ∈ V , since T ∗ T − T T ∗ is normal, using Theorem 170 write T ∗ T − T T ∗ = 0 iff
h(T ∗ T − T T ∗ )v|vi = 0 iff hT ∗ T v|vi = hT T ∗ v|vi iff kT vk2 = kT ∗ vk2 .
Theorem 174 (Normal operators share eigenvectors with adjoints).
Proof. 1 commutes with adjoint (is normal), so if T is normal, then (T − λ1) is normal, then
0 = k(T − λ1)vk = k(T − λ1)∗ vk = k(T ∗ − λ̄1)vk. (608)
v is shared, associated eigenvalue is the conjugate transpose.
149
We know that eigenvectors associated with distinct eigenvalues are linearly independent (Exercise
70). We know that orthogonality implies linear independence (Theorem 57). When the operator is
normal, the eigenvectors are orthogonal.
Theorem 175. Let T ∈ L(V ) be normal, then eigenvectors of T corresponding to distinct eigenvalues
are orthogonal.
Proof. Let α, β be distinct eigenvalues with eigenvectors u, v, then by Theorem 174, write
(α − β)hu|vi = hαu|vi − hu|β̄vi = hT u|vi − hu|T ∗ vi = 0, (609)
where we used Theorem 174 to equate T ∗ v = β̄v. Then hu|vi = 0.
Exercise 276. If T ∈ L(V ), λ ∈ F, prove that λ is eigenvalue of T iff λ is eigenvalue of T ∗ .
Proof. See
hT v|ui = hλv|ui = hv|λui = hv|T ∗ ui. (610)
See Theorem 174 - instead here in this exercise, the (conjugate) eigenvalue is shared, but no statement
can be made about the eigenvector associated.
Exercise 277. Suppose T ∈ L(V ), U is subspace of V . Then show that U invariant under T iff U ⊥
invariant under T ∗ .
Proof. Note V = U ⊕ U ⊥ , and let u ∈ U, v ∈ U ⊥ . Suppose U is T -invariant, T u = ũ then
hT u|vi = hũ|vi = 0 = hu|T ∗ vi (611)
so T ∗ v ∈ U ⊥ . Invoke symmetry.
Exercise 278. Suppose T ∈ L(V, W ), then prove that T injective iff T ∗ surjective, and T surjective iff
T ∗ injective.
Proof. See Theorem 167. See that inj(T ) iff null T = {0} iff (range T ∗ )⊥ = {0} iff range T ∗ = V iff
sur(T ∗ ). Then repeat the argument with T → T ∗ .
Exercise 279. Prove
dim null T ∗ = dim null T + dim W − dim V, (612)
and that
dim range T ∗ = dim range T (613)
for all T ∈ L(V, W ).
Proof. Let T ∈ L(V, W ), then T ∗ ∈ L(W, V ), then
dim V = dim null T + dim range T, (614)

dim W = dim null T ∗ + dim range T ∗ , (615)
∗ ∗
dim W − dim V = dim null T − dim null T + dim range T − dim range T. (616)
The first statement holds iff the second statement holds. See that
dim range T ∗ + dim null T = dim null T ⊥ + dim null T = dim V = dim range T + dim null T. (617)
150
R1
Exercise 280. For P2 (R) with inner product defined hp|qi = 0
p(x)q(x)dx, define T ∈ L(P2 (R)) by
2
T (a0 + a1 x + a2 x ) = a1 x. Then
1. Show that T is not self adjoint,
2. that the matrix M(T ) w.r.t standard basis of P2 (R) is a zero matrix except for an entry of one in
the (2, 2) entry. Since this matrix must equal its conjugate transpose, but T 6= T ∗ , then argue why
this is not a violation of the rules of linear algebra.
Proof. If T self adjoint, then hT p|qi = hp|T qi but see that for p(x) = a0 +a1 x+a2 x2 , q(x) = b0 +b1 x+b2 x2 ,
we get hT p|qi 6= hp|T qi (work through the integrals). This is not a violation since the basis chosen are
standard, but the standard bases for polynomial function spaces are not orthonormal.
Exercise 281. Let S, T ∈ L(V ) be self-adjoint, then prove that ST self adjoint iff ST = T S.
Proof. See (ST )∗ = T ∗ S ∗ . Then ST self adjoint iff T S = T ∗ S ∗ = (ST )∗ = ST by the self-adjoint of
S, T .
Exercise 282. Suppose V is real inner product space, then show that the set of self-adjoint operators on
V is subspace of L(V ).
Proof. Let S, T be self adjoint, then
h(S + T )v|ui = hSv + T v|ui = hSv|ui + hT v|ui = hv|Sui + hv|T ui = hv|Su + T vi = hv|(S + T )vi (618)
and
hλT v|ui = λhT v|ui = λhv|T ui = hv|λT ui (619)
under the real-inner product space assumptions.
Exercise 283. Suppose V is complex inner product space, V 6= {0}. Show that the set of self-adjoint
operators on V is not subspace of L(V ).
Proof. See that homogeneity proof in Exercise 282 does not hold under complex inner space assumptions.
6 hv|λT ui for λ 6∈ R.
In particular, the equality statement should be changed to λhv|T ui =
Exercise 284. If P ∈ L(V ) and P 2 = P , then prove there exists subspace U of V s.t. P = PU iff P self
adjoint.
Proof. [2]. This problem asks us to prove that P is orthogonal projection iff P self adjoint for P satisfying
P 2 = P . Since V = U ⊕ U ⊥ , for all v ∈ V , we may write v = uv + u⊥ ⊥ ⊥
v , where uv ∈ U, uv ∈ U . See that
(since u⊥
v , uw orthogonal) for ∀v, w ∈ V ,
hPU v|wi = huv |uw + u⊥ ⊥

w i = huv |uw i + huv |uw i = hv|PU wi. (620)
So PU is self adjoint. On the other hand, if P self-adjoint, then for v ∈ V , since we may write P (v−P v) =
P v − P 2 v = 0, then v − P v ∈ null P = (range P ∗ )⊥ = (range P )⊥ . We may express v = P v + (v − P v),
and since P v ∈ range P, (v − P v) ∈ (range P )⊥ , then P v = Prange P v and P = Prange P . P is orthogonal
projection.
Exercise 285. If T ∈ L(V ) is normal, and has eigenvalues 3, 4, then prove ∃v ∈ V s.t. kvk =
√
2, kT vk = 5.
151
Proof. ∃v1 , v2 s.t. T v1 = 3v1 , T v2 = 4v2 . Then T kvv11 k = 3ṽ, T kvk
v2
2
= 4ṽ2 for ṽ1 = ṽ2 normalized to
unit norm. Theorem 175 asserts hv1 |v2 i = 0 = hṽ1 |ṽ2 i. Then write v = ṽ1 + ṽ2 and see hT ṽ|T ṽi =
√
h3ṽ1 + 4ṽ2 |3ṽ1 + 4ṽ2 i = 9kṽ1 k + 16kṽ2 k = 25 and kT vk = 25 = 5, and kvk2 = hṽ1 + ṽ2 |ṽ1 + ṽ2 i =
hṽ1 |ṽ1 i + hṽ2 |ṽ2 i = 2.
Exercise 286. For fixed u, x ∈ V , define T ∈ L(V ) by T v = hv|uix for all v ∈ V , and
1. prove that if F = R, then T self adjoint iff u, x is linearly dependent,
2. prove that T normal iff u, x linearly dependent.
Proof. [1]. See that for w1 , w2 ∈ V , we may write hw|T ∗ w2 i = hT w1 |w2 i = hhw1 |uix|w2 i = hw1 |uihx|w2 i =
hw1 |hx|w2 iui = hw1 |hw2 |xiui. Then T ∗ v = hv|xiu.
1. If T self adjoint, then hv|uix − hv|xiu = T v − T ∗ v = 0, and take v = u, then hu|uix − hu|xiu = 0
and x, u must be linearly dependent. If u, x linearly dependent, write u = cx for some c ∈ R, and
T v = hv|uix = hv|cxi 1c u = hv|xiu = T ∗ v.
2. Write hhv|uix|xiu = T ∗ (hv|uix) = T ∗ T v = T T ∗ v = T (hv|xiu) = hhv|xiu|uix. Take v = u, since

hhu|uix|xiu = hhu|xiu|uix, so u, x linearly dependent. OTOH, u = cx for c ∈ F and T T ∗ v =
T (hv|xiu) = hhv|xiu|uix = hhv|xix|cxicx = hhv|uix|xiu = T ∗ (hv|uix) = T ∗ T v.
Exercise 287. Prove that T normal =⇒ null T = null T ∗ .
Proof. This follows immediately from kT vk = kT ∗ vk iff T normal and definiteness.
Exercise 288. Prove
T is normal =⇒ range T = range T ∗ . (621)
Proof.
range T = (null T ∗ )⊥ = (null T )⊥ = range T ∗ . (622)
See Theorem 287 and 167 for assertion of these inequalities. Note that we already saw dim range T =
dim range T ∗ - here we see that the subspaces are the same when T is normal.
Exercise 289. Prove
T is normal =⇒ (null T k = null T ) ∧ (range T k = range T ) (623)
for all k ∈ Z+ .
Proof. [2]. When k = 1 the result is trivial. Suppose T ∈ L(V ) normal, k ≥ 2 ∈ Z+ . Clearly null T ⊂
null T k . For v ∈ null T k , see that hT ∗ T k−1 v|T ∗ T k−1 vi = hT T ∗ T k−1 v|T k−1 vi = hT ∗ T k v|T k−1 vi =
h0|T k−1 vi = 0. Then T ∗ T k−1 = 0, and so 0 = hT ∗ T k−1 v|T k−2 vi = hT k−1 v|T k−1 vi, hence T k−1 v = 0
and v ∈ null T k−1 . Repeat the argument to show v ∈ null T k−2 and so on until v ∈ null T 1 , so
null T k ⊂ null T . On the other hand, if v ∈ range T k , clearly v ∈ range T . Since we may write
dim range T k = dim V − dim null T k = dim V − dim null T = dim range T, (624)
and rangeT, rangeT k are both subspaces of V , rangeT k ⊂ rangeT with equal dimension, so rangeT k =
range T .
152
3.3.7.2 Spectral Theorem
An operator on V has diagonal matrix w.r.t. basis iff some basis consists precisely of the eigenvectors
(Theorem 147). We are interested in the cases when these bases can be orthonormal. It turns out that
this corresponds to the case when operators are normal in C and self-adjoint in the case of R.
Theorem 176 (Complex Spectral Theorem). Let F = C, and T ∈ L(V ), then the statements are
equivalent for:
1. T is normal,
2. V has orthonormal basis consisting of eigenvectors of T ,
3. T has diagonal matrix w.r.t some orthonormal basis of V .
Proof. The statements 2 iff 3 is proved similar to the one written in Theorem 147. Suppose 3 holds,
then by Theorem 168, M(T ∗ ) = M(T )† (note there is ‘swapping’ of basis) and the product of diagonal
matrices are commutative. So T is normal. OTOH, suppose T is normal, then Schur’s Theorem 159
(without the normal assumptions) asserts ∃ei , i ∈ [n] orthonormal basis of V s.t. M(T, (ei )i∈[n] ) = A
is upper-triangular. Then see that kT e1 k2 = |A1,1 |2 by matrix representation of linear map (Definition
113). Taking the conjugate transpose of A (absolute values are the same for conjugates), then
n
X
kT ∗ e1 k2 = |A1,i |2 . (625)
i=1
But since T normal, then kT e1 k = kT ∗ e1 k (Theorem 173) implies that all the entries in A1,· are zero
Pn
except possibly at A1,1 . Repeating the argument kT e2 k2 = |A2,2 |2 = kT ∗ e2 k2 = i=2 |A2,i |2 gives us all
entries for A2,· = 0 except possibly at |A2,2 | and so on. Then A is diagonal.
Theorem 177 (Invertible Quadratic Expressions). Let T ∈ L(V ) be self adjoint and b, c ∈ R. If b2 < 4c,
then T 2 + bT + c1 is invertible.
Proof. For v ∈ V ,
h(T 2 + bT + c1)v|vi = hT 2 v|vi + bhT v|vi + chv|vi (626)

2
= hT v|T vi + bhT v|vi + ckvk (627)
2 2
≥ kT vk − |b|kT vkkvk + ckvk note −|b|, use Theorem 154 (628)
2 2

|b|kvk b
= kT vk − + c− kvk2 (629)
2 4
> 0, (630)
so null T = {0}. Then inj(T 2 + bT + c1), T ∈ L(V ) =⇒ inv(T 2 + bT + c1).
Every operator on complex vector space has eigenvalue (Theorem 141). The result holds on real
vector spaces if T is self-adjoint.
Theorem 178. Suppose V 6= {0} and T ∈ L(V ) is self-adjoint, then T has eigenvalue.
Proof. If V is complex vector space, employ Theorem 141. Assume V is real inner product space. Let
dim V = n, then for some non-zero v ∈ V , T i v, i = 0, · · · n must not be linearly independent and ∃ai 6= 0
Pn
for some representation 0 = i=0 ai T i v. By polynomial factorization (Theorem 138), we may write
n
X
ai xi = c(x2 + b1 x + c1 ) · · · (x2 + bM x + cM )(x − λ1 ) · · · (x − λm ), (631)
i=0
153
Pn
where c 6= 0, b2j < 4cj . Similarly, the 0 = i=0 ai T i v can be written
n
X Xn
0 = ai T i v = ( ai T i )v (632)
i=0 i=0
= c(T 2 + b1 T + c1 1) · · · (T 2 + bM T + cM 1)(T − λ1 1) · · · (T − λm 1)v. (633)
Each of the quadratic expression terms are invertible (and hence not equal zero, Theorem 177), and
c 6= 0, so ∃j ∈ [m] s.t. (T − λj 1) = 0, so T has eigenvalue.
Theorem 179. If T ∈ L(V ) is self-adjoint and U is T -invariant subspace of V , then
1. U ⊥ is T -invariant,
2. T|U ∈ L(U ) is self-adjoint and
3. T|U ⊥ ∈ L(U ⊥ ) is self-adjoint.
Proof. -
1. For v ∈ U ⊥ , u ∈ U , see that
hT v|ui = hv|T ui = hv|u0 i = 0, (634)
where T u = u0 ∈ U , so T v ∈ U ⊥ .
2. For u, v ∈ U ,
hT|U u|vi = hT u|vi = hu|T vi = hu|T|U vi. (635)
3. Similar to 2.
Theorem 180. Let F = R, and T ∈ L(V ), then the statements are equivalent for:
1. T is self-adjoint,
2. V has orthonormal basis consisting of eigenvectors of T ,
3. T has diagonal matrix w.r.t some orthonormal basis of V .
Proof. The statements 2 iff 3 is proved similar to the one written in Theorem 147. If 3 holds, then T has
diagonal matrix, diagonal matrix equals (conjugate) transpose for F = R, so T = T ∗ and T is self-adjoint.
We prove that part 1 implies part 2 by induction. Self-adjoint T implies existence of eigenvalue (Theorem
178). If dim V = 1, then the matrix M(T ) w.r.t to the unit normalized, associated eigenvector is size
1 × 1 diagonal matrix with orthonormal basis. But since part 3 iff part 2, then part 1 implies part 2 for
dimension one. Assume dim V > 1, and let u be eigenvector of T , kuk = 1, and let U = span(u). Then
U is one-dimensional T -invariant subspace of V and Theorem 179 asserts T|U ⊥ ∈ L(U ⊥ ) is self-adjoint.
Since dim V = dim U ⊕ dim U ⊥ , by induction, ∃ orthonormal basis of U ⊥ consisting of eigenvectors of
T|U ⊥ , say some set SU⊥ . Then SU⊥ ∪ {u} gives part 2.
Exercise 290. Suppose F = C, T ∈ L(V ), prove that T normal iff all pairs of eigenvalues corresponding
to distinct eigenvalues of T are orthogonal and V = ⊕m
i E(λi , T ), where λi ’s are distinct.
154
Proof. The forward proof is asserted by Theorem 175, Theorem 176 and Theorem 147. The backward
proof is asserted by Theorem 124, Theorem 147 and Theorem 176.
Exercise 291. Suppose F = R, T ∈ L(V ) and prove that T self-adjoint iff all pairs of eigenvalues
corresponding to distinct eigenvalues of T are orthogonal and V = ⊕m
i E(λi , T ), where λi ’s are distinct.
Proof. The forward proof is asserted by Theorem 175, Theorem 180 and Theorem 147. The backward
proof is asserted by Theorem 124 and Theorem 147 and Theorem 180.
Exercise 292. Prove that normal operator on a complex inner product space is self adjoint iff all its
eigenvalues are real.
Proof. The forward proof is asserted by Theorem 169. Suppose all eigenvalues are real, then complex
spectral theorem 176 asserts there is orthonormal basis ei , i ∈ [n] of V of eigenvectors s.t. T ej = λj ej .
Since (w.r.t ei ’s, the diagonal matrix) M(T ) = M(T )† , then T must be self-adjoint.
Exercise 293. Suppose V is complex inner product space, T ∈ L(V ) is normal and T 9 = T 8 =⇒ that
T is self-adjoint and T 2 = T .
Proof. [2]. Complex spectral theorem 176 asserts ∃ei , i ∈ [n] orthonormal, eigenvector basis of V , then
let T ej = λj ej for j ∈ [n], and see that T 9 ej = λ9j ej , T 8 ej = λ8j ej , λ9j = λ8j =⇒ λj ∈ {0, 1}. Then
eigenvalues of T are real and T is self-adjoint (Exercise 292). Specifically, if we write T 2 ej = λ2 ej =
λj ej = T ej , then T 2 = T .
Exercise 294. Suppose V is complex inner product space, then prove that ∀ normal operator T on V ,
there exists square root operator S satisfying S 2 = T .
Proof. Suppose T normal, then complex spectral theorem 176 asserts the existence of some orthonormal
basis ei ∈ V , i ∈ [n] consisting of eigenvectors, and let these be λi s.t. T ei = λei , i ∈ [n]. If we define
1 1
Sei = λi2 ei for i ∈ [n], then S 2 = T . Note that here, λi2 is the complex square root.
Exercise 295. Prove or disprove: every self adjoint operator T on V has cube root (T = S 3 ).
1 1
Proof. See proof in Exercise 294, make λi2 → λi3 , and see this is well defined since λi ∈ R by self-adjoint
(Exercise 292).
Exercise 296. Suppose T ∈ L(V ) is self adjoint, λ ∈ F, > 0. If ∃v ∈ V s.t. kvk = 1 and kT v−λvk < ,
then prove that T has eigenvalue λ0 satisfying |λ0 − λ| < .
Proof. [2]. Spectral theorem 176, 180 asserts there is orthonormal basis ei , i ∈ [n] of eigenvectors of T ,
Pn Pn
and let it be that T ei = λi ei , i ∈ [n]. We may write v = i hv|ei iei , so T v = i λi hv|ei iei , and write
2 > kT v − λvk2 (636)

n
X
= k (λi − λ)hv|ei iei k2 (637)
i
n
X
= |λi − λ|2 |hv|ei i|2 (638)
i
n
X
2
≥ (min{|λi − λ| , i ∈ [n]})( |hv|ei i|2 ) (639)
i
= min{|λi − λ|2 , i ∈ [n]}. (640)
Then ∃j s.t. > |λj − λ|.
155
 
1 1 0
Exercise 297. Define A = 0 1 1 and find the entry that makes A normal·
 
1 0 ?
Proof. Find that ? = 1 by writing out A0 A = AA0 and comparing the entries.
3.3.7.3 Positive Operator and Isometries
Definition 153 (Positive Operator). An operator T ∈ L(V ) is positive (also called positive semidefinite)
if T is self-adjoint and hT v|vi ≥ 0 for all v ∈ V .
Note that Definition 153 is same as saying Re (M(T )v)† v = Re v † M(T )v ≥ 0, or in the case of real
vector spaces, simply v 0 M(T )v ≥ 0.
If U is subspace of V , orthogonal projection PU is positive. Since V = U ⊕ U ⊥ , for all v ∈ V , we
may write v = uv + u⊥ ⊥ ⊥ ⊥
v , where uv ∈ U, uv ∈ U . See that (since uv , uw orthogonal) for ∀v, w ∈ V ,
hPU v|wi = huv |uw + u⊥ ⊥

w i = huv |uw i + huv |uw i = hv|PU wi. (641)
So PU is self adjoint. Furthermore, see
hPU v|vi = huv |uv + u⊥

vi≥0 (642)
for arbitrary v; PU v is positive. If T ∈ L(V ) is self-adjoint, b, c ∈ R and b2 < 4c, then T 2 + bT + c1 is

positive.
Definition 154 (Square Root of an Operator). An operator R is square root of an operator T if R2 = T .

√
We also denote T = R.
Theorem 181. Let T ∈ L(V ), then the following statements are equivalent for:
1. T is positive,
2. T is self-adjoint and all eigenvalues of T are ≥ 0,
3. T has positive square root,
4. T has self-adjoint square root,
5. ∃R ∈ L(V ) s.t T = R∗ R.
Proof. If T positive, then T self adjoint, and see that hT v|vi = hλv|vi = λhv|vi ≥ 0 =⇒ λ ≥ 0.
Now suppose T self-adjoint with non-negative eigenvalues. Spectral theorem (Theorem 176, 180) asserts
existence of orthonormal basis ei , i ∈ [n] in V of eigenvectors of T , associated with eigenvalues λi , i ∈ [n].
p
Then define R ∈ L(V ) s.t. Rej = λj ej for j ∈ [n], which is well defined by Theorem 95. Furthermore,
R is positive, and R2 ej = T ej , hence R2 = T . Since T has positive square root, then it must have
self-adjoint square root. This means we may write T = R2 = R∗ R where R∗ = R by self-adjoint. If
∃R ∈ L(V ) where T = R∗ R, then T ∗ = (R∗ R)∗ = T , and see that hT v|vi = hR∗ Rv|vi = hRv|Rvi ≥ 0 so
T must be positive.
Theorem 182. Each positive operator on V has unique positive square root.
Proof. See Theorem 181. T has positive square root. To argue uniqueness, see that the behavior of R
on the eigenvectors of T are uniquely determined, and each T is uniquely determined on the eigenvector
basis by Theorem 95.
156
Theorem 182 does not say anything about the number of non positive square roots of a positive
operator.
Definition 155 (Isometry). An operator S ∈ L(V ) is isometry if kSvk = kvk for all v ∈ V . An operator
is isometry if the norm is preserved. An isometry on a real inner product space is often said to be an
orthogonal operator, and on a complex inner product space is often said to be a unitary operator.
Exercise 298. Suppose λi , i ∈ [n] are scalars s.t. ∀i, |λi | = 1, S ∈ L(V ) satisfies Sei = λi ei for some
orthonormal basis ei , i ∈ [n] of V . Then argue that S is isometry.
Pn Pn
Proof. Since ei , i ∈ [n] are orthonormal basis, ∀v ∈ V , v = hv|ei iei , and kvk2 = 2
i |hv|ei i| .
Pn Pni
Applying S to both LHS and RHS, see that Sv = i hv|ei iSei = i λi hv|ei iei , and since by assumption
Pn
|λi | = 1, then kSvk2 = i |hv|ei i|2 .
Theorem 183. If S ∈ L(V ), then the following are equivalent statements:
1. S is isometry,
2. hSu|Svi = hu|vi for all u, v ∈ V ,
3. ∀ei ∈ V, i ∈ [n] orthonormal =⇒ Sei , i ∈ [n] orthonormal,
4. ∃ei ∈ V, i ∈ [n] s.t. Sei , i ∈ [n] is orthonormal,
5. S ∗ S = 1,
6. SS ∗ = 1,
7. S ∗ is isometry,
8. inv(S), S −1 = S ∗ .
Proof. Suppose S is isometry, then by Exercise 256 and Exercise 255, see that inner products can be
written as norm terms, and since isometry preserves norms, it preserves the inner product and hSu|Svi =
hu|vi. Clearly, if hSu|Svi = hu|vi, then hSei |Sej i = hei |ej i for orthonormal list of vectors ei ∈ [n], which
by existence of an orthonormal basis of any finite vector space implies there must exist some orthonormal
Pn Pn
basis ei , i ∈ [n] where Sei , i ∈ [n] are orthonormal. Then consider u := i ai ei , v := i ci ei ∈ V ,
Pn Pn Pn Pn Pn Pn
and see that hS ∗ Su|vi = hS ∗ S i ai ei | i ci ei i = i
∗
j ai cj hS Sei |ej i = i j ai cj hSei |Sej i =
i hai ei |ci ei i = hu|vi. Then S S = 1, and Theorem 9 asserts

Pn Pn Pn ∗
i ai ci hSei |Sei i = i ai ci hei |ei i =
S∗S = 1 ↔ SS ∗ = 1. See that if SS ∗ = 1, then kS ∗ vk2 = hS ∗ v|S ∗ vi = hSS ∗ v|vi = hv|vi = kvk2 , and
S is isometry. Clearly part 5,6 assert part 8, and if 8 holds, then hSv|Svi = hS ∗ Sv|vi = hv|vi and S
∗
preserves norm, so it is isometry (Definition 155).
Part 5,6 of Theorem 183 assert that isometries are normal.
Theorem 184. When V is complex inner product space, S ∈ L(V ), then S is isometry iff ∃ some
orthonormal basis of V consisting of eigenvectors of S, where each associated eigenvalue has absolute
value of one.
Proof. Suppose S is isometry, then S is normal; Complex Spectral Theorem 176 asserts there is or-
thonormal basis ei , i ∈ [n] in V of eigenvectors of S. For j ∈ [n], define λj to be eigenvalue associated
with eigenvector ej , and see that |λj | = kλj ej k = kSej k = kej k = 1. See Exercise 298 for the other
direction.
157
Exercise 299. If T is positive operator in L(V ), and T v = w, T w = v for v, w ∈ V , then prove v = w.
Proof. See that
0 ≤ hT (v − w)|v − wi = hT v − T w|v − wi = hw − v|v − wi = −hv − w|v − wi ≤ 0 (643)

=⇒ v − w = 0 =⇒ v = w. (644)
Exercise 300. Suppose T is positive operator on V and U is T invariant subspace. Prove that T|U ∈
L(U ) is positive operator.
Proof.
∀u ∈ U, hT|U u|ui = hT u|ui = hu|T ui = hu|T|U ui, (645)
since positive operator implies self-adjoint operator. Then T|U self adjoint. Similarly, hT|U u|ui =
hT u|ui ≥ 0.
Exercise 301. If T ∈ L(V, W ), prove that T ∗ T is positive operator on V , and that T T ∗ is positive
operator on W .
Proof. (T ∗ T )∗ = T ∗ T ∗∗ = T ∗ T , so T ∗ T is self-adjoint, and since for all v ∈ V , we may write
hT ∗ T v|vi = hT v|T vi ≥ 0 (646)
by definiteness (Theorem 143), T ∗ T positive. Similar proof to show T T ∗ positive on W .
That M(T )0 M(T ) is positive semidefinite will turn out to be very useful in applications.
Exercise 302. Prove sum of two positive operators on V is positive. Prove that the positive integer
power of positive operator is positive.
Proof. Suppose T, S positive on V , then
(T + S)∗ = T ∗ + S ∗ = (T + S), (647)
so self-adjoint and
h(T + S)v|vi = hT v|vi + hSv|vi ≥ 0 (648)
so positive. Positive operators are self-adjoint, so spectral theorems (Theorem 176, 180) assert existence
of diagonal matrix w.r.t basis of eigenvectors to which the diagonals are the eigenvalues λi ≥ 0. Taking
powers of diagonal matrices is equal to the componentwise power of entries, and since these are nonneg-
ative, the matrix power is also diagonal matrix with nonnegative eigenvalues w.r.t eigenvector basis, and
Theorem 181 asserts the positive operator positive integer power is also positive.
Exercise 303. Prove that if A is positive, then null A = {v : v 0 Av = 0}.
Proof. Clearly, if v ∈ null A, then v 0 Av = v 0 (0) = 0. On the other hand, suppose v 0 Av = 0. Then
Theorem 182 asserts we may write A = R∗ R, and we can write (Definition 143)
v 0 R∗ Rv = hRv|Rvi = 0 ⇐⇒ Rv = 0, (649)
so Av = R∗ Rv = R∗ 0 = 0 and v ∈ null A.
158
Exercise 304. Prove that for X, Y ∈ L(V ) positive, then
dim range X + Y ≥ min{dim range X, dim range Y }. (650)
Proof. For all v ∈ null X + Y , see that (since v 0 Xv > 0 =⇒ v 0 Y v < 0, but both are positive)
v 0 (X + Y )v = 0 ⇐⇒ v 0 Xv = 0, v 0 Y v = 0, (651)
so v ∈ null X, null Y (see Exercise 303). It follows that null X + Y = null X ∩ null Y . Then by
rank-nullity
dim range X + Y = dim V − dim null X + Y (652)

= dim V − dim null X ∩ null Y (653)
≥ dim V − dim null X (654)
= dim range X. (655)
Invoke symmetry.
Exercise 305. Suppose T is positive operator on V , and prove that T invertible iff hT v|vi > 0 for all
non-zero v ∈ V .
Proof. Suppose T invertible, then since T positive, by Theorem 182, ∃ self-adjoint S s.t. S 2 = T . Since
T injective, then ∀v ∈ V where v 6= 0, Sv 6= 0. See that hT v|vi = hS 2 v|vi = hSv|Svi > 0.
Assume hT v|vi > 0 for all non-zero v. since T is positive, by definition hT v|vi ≥ 0. If hT v = 0|vi = 0,
then v must be zero. Then {0} = null T and since T is operator, T invertible.
Exercise 306. Suppose T ∈ L(V ), and for u, v ∈ V , define
hu|viT = hT u|vi. (656)
Prove that h·|·iT is inner product on V iff T is invertible, positive operator.
Proof. Suppose h·|·iT is inner product, then ∀v ∈ V , hT v|vi = hv|viT ≥ 0, so T positive. If v 6=

0, hv|viT > 0, Exercise 305 asserts that T invertible. OTOH, if T is invertible and positive, then see that
the inner product properties hold. For example, for positivity condition, see that hv|viT = hT v|vi ≥ 0.
The other proofs are similar.
Exercise 307. Suppose S ∈ L(V ), then prove the statements are equivalent for:
1. S is isometry,
2. hS ∗ u|S ∗ vi = hu|vi for all u, v ∈ V ,
3. S ∗ ei , i ∈ [m] is orthonormal list for every orthonormal list ei , i ∈ [m] in V ,
4. S ∗ ei , i ∈ [n] is orthonormal basis for some orthonormal basis ei , i ∈ [n] in V .
Proof. S is isometry iff S ∗ is isometry (Theorem 183). Then apply Theorem 183 with S ∗ → S for all
the other equivalency statements.
159
3.3.7.4 Polar and Singular Value Decomposition
Recall Theorem 183. When T ∗ T = 1, T is isometry. On complex numbers, this is a parallel condition
z z
√
to zz = 1. See that each complex number may be decomposed into z = |z| |z| = |z| zz. The first term
is an element of the unit circle. Think of this as an isometry.
√
Theorem 185 (Polar Decomposition). Suppose T ∈ L(V ), then ∃ isometry S ∈ L(V ) s.t T = S T ∗ T .
√ √ √ √
Proof. Axler [3]. For v ∈ V , see that kT vk2 = hT v|T vi = hT ∗ T v|vi = h T ∗ T T ∗ T v|vi = h T ∗ T v| T ∗ T vi =
√ √ √
k T ∗ T vk2 . Then kT vk = k T ∗ T vk. Define linear map S1 : range T ∗ T → range T by
√
S1 ( T ∗ T v) = T v, (657)
√ √
and we check this is well defined. For v1 , v2 ∈ V, T ∗ T v1 = T ∗ T v2 , we show that T v1 = T v2 . Note
that
√ √ √
kT v1 − T v2 k = kT (v1 − v2 )k = k T ∗ T (v1 − v2 )k = k T ∗ T v1 − T ∗ T v2 k = 0. (658)
Further testing for additivity and homogeneity should confirm that S1 is linear map. Since S1 :
√ √
range T ∗ T → range T , then kS1 uk = kuk for u ∈ range T ∗ T . Therefore, S1 must be injective,
√ √
and rank-nullity asserts (Theorem 51) that dim range T ∗ T = dim range T , and dim(range T ∗ T )⊥ =
√
dim(range T )⊥ . Choose some orthonormal basis ei , i ∈ [m] for (range T ∗ T )⊥ , and orthonormal basis
√
fi , i ∈ [m] of (range T )⊥ . Then Theorem 95 asserts that the map S2 : (range T ∗ T )⊥ → (range T )⊥
defined by
m
X m
X
S2 ( ai ei ) = ai fi (659)
i i
√ Pn
is well defined. For w ∈ (range T ∗ T )⊥ , we have kS2 wk2 = i |ai |2 = kwk2 . Finally, let S be the
operator
( √
S1 v v ∈ range T ∗T ,
Sv = √ (660)
S2 v v ∈ (range T ∗ T )⊥ .
√ √
That is for orthogonal decomposition v = u + w ∈ V = range T ∗ T ⊕ (range T ∗ T )⊥ , Sv takes
√
on Sv = S1 u + S2 w. Then for v ∈ V , see that (note that T ∗ T is positive, so T ∗ T is positive by
√ √ √ ∗
Theorem 181, so T ∗ T self-adjoint and T ∗ T = T ∗ T , which applying Theorem 167, asserts that
√ √ ∗ √
(range T ∗ T )⊥ = null T ∗ T = null T ∗ T , so
√ √
S T ∗ T v = S1 T ∗ T v = T v, (661)
√
hence T = S T ∗ T . See also that
kSvk2 = kS1 u + S2 wk2 = kS1 uk2 + kS2 wk2 = kuk2 + kwk2 = kvk2 (662)
by application of Pythagoras Theorem 152.
Polar Decomposition Theorem 185 asserts that every operator may be written as product of isometry
and positive operator. The isometry is characterized in Theorem 183 and positive operators are char-
acterized in Theorem 181. Note that although the spectral theorem (see Theorem 176) asserts that for
√ √
T = S T ∗ T , there are orthonormal basis for both S, T ∗ T that causes the matrix representation to be
diagonal, these orthonormal bases need not be the same.
160
Definition 156 (Singular Values). Suppose T ∈ L(V ), then the singular values of T are said to be the
√ √
(nonnegative) eigenvalues of positive operator T ∗ T , where each eigenvalue λ is repeated dimE(λ, T ∗ T )
times.
√
T ∗ T is positive, hence self-adjoint, and spectral theorems (Theorem 176, 180) asserts that every
operator T ∈ L(V ) has dim V singular values.
Theorem 186 (Singular Value Decomposition (SVD)). Suppose T ∈ L(V ) has singular values given
Pn
si , i ∈ [n], then ∃ei , i ∈ [n], fj , j ∈ [n] orthonormal bases of V satisfying T v = i si hv|ei ifi for all v ∈ V .
√
Proof. Spectral theorems assert positive operator T ∗ T has orthonormal basis ei , i ∈ [n] in V of eigenvec-
√ Pn √ √ Pn
tors s.t. T ∗ T ej = sj ej , j ∈ [n]. Since v = i hv|ei iei , applying T ∗ T we get T ∗ T v = i si hv|ei iei .
√
Polar decomposition (Theorem 185) asserts some T = S T ∗ T , then we must have
n n
√ X X
T v = S T ∗T v = si hv|ei iSei = si hv|ei ifi , ∀v ∈ V. (663)
i i
The last equality holds and fj , j ∈ [n] is orthonormal bases since S is isometry (Theorem 183) under the
polar decomposition (Theorem 185).
Although we often used one set of basis when referring to linear operators, Theorem 186 gives us the
Pn
ability to use two different bases. In particular, the SVD writes T ek = i si hek |ei ifi = sk fk , and this
asserts that
M(T, (ei )i∈[n] , (fj )j∈[n] ) = diag((si )i∈[n] ) (664)
is a diagonal matrix with the singular values on the diagonal. The result is that every(!) operator on
V has a diagonal matrix w.r.t some orthonormal bases ei ’s and fj ’s. The numerical approximation to
the singular values of T may be obtained by first approximately computing the eigenvalues of T ∗ T and
taking the non-negative square roots.
Theorem 187. Suppose T ∈ L(V ), then the singular values of T are the nonnegative square roots of
the eigenvalues of T ∗ T , with eigenvalue λ repeated dim E(λ, T ∗ T ) number of times.
Proof. The spectral theorems (Theorem 176, 180) imply existence of orthonormal basis ei , i ∈ [n] with
√
λi ≥ 0, i ∈ [n] (recall T ∗ T is positive, Exercise 301) eigenvalues s.t. T ∗ T ej = λj ej . Clearly, T ∗ T ej =
p
λj ej for j ∈ [n].
Exercise 308. For fixed u, x ∈ V, u 6= 0, T ∈ L(V ) defined T v = hv|uix, prove that ∀v ∈ V ,

√ kxk
T ∗T v = hv|uiu. (665)
kuk
Proof. See T ∗ v = hv|xiu (Exercise 275). Then we have
T ∗ T v = T ∗ (hv|uix) = hhv|uix|xiu = hv|uikxk2 u. (666)

kxk
If we define R ∈ L(V ) s.t. Rv = kuk hv|uiu, then see that
kxk2

kxk kxk kxk
R2 v = R hv|uiu = h hv|uiu|uiu = hv|uikuk2 u = T ∗ T v. (667)
kuk kuk kuk kuk2
√
So R = T ∗ T . To prove this is positive, check that hRv|vi ≥ 0, and that it is self-adjoint. Check
this by defining some orthonormal basis ei , i ∈ [n] and checking the corresponding matrix is equal to its
conjugate transpose.
161
√
Exercise 309. Let T ∈ L(V ) and prove ∃ isometry S ∈ L(V ) s.t. T = T T ∗ S.
√
Proof. By the Polar Decomposition Theorem 185 we know there we may write T ∗ = S T T ∗ by using
T ∗ → T . Then take adjoints to obtain
√ √ ∗ √
T = (S T T ∗ )∗ = T T ∗ S ∗ = T T ∗ S ∗ . (668)
The equalities follow from positive (and hence self-adjoint) property of T T ∗ (Exercise 301) and the
statement S ∗ is isometry iff S is isometry (Theorem 183).
Exercise 310. Let T ∈ L(V ) and s be singular value of T , then prove ∃v ∈ V s.t. kvk = 1, kT vk = s.
√
Proof. Let v be s.t. T ∗ T v = sv and kvk, and see from polar decomposition (Theorem 185), isometry
(Definition 155)
√ √
kT vk = kS T ∗ T vk = k T ∗ T vk = ksvk = s (669)
√
since the eigenvalues of T ∗ T ≥ 0.
Exercise 311. Find the singular values of the differentiation operator D ∈ P(R2 ).
Proof. [1]. We explain the general steps required. First, verify that
r r r
1 3 45 2 1
, x, (x − ) (670)
2 2 8 3
is orthonormal bases of P2 (R). Then write the matrix of M(D) w.r.t to this basis and compute
M(D)0 M(D), which turns out to be a diagonal matrix with 0, 3, 5 on the diagonals. The singular
√ √
values are 0, 3, 15 respectively.
Exercise 312. Let T ∈ L(V ), S ∈ L(V ) be isometries. Let R ∈ L(V ) be positive operator s.t. T = SR.
√
Prove R = T ∗ T .
√
Proof. (corrected from) [1]. Let T = S 0 T ∗ T under the polar decomposition. Then
√
0 = kT vk2 − kT vk2 = kS ∗ T vk2 − kS 0∗ T vk2 = kRvk2 − k T ∗ T vk2 (671)
= hR∗ Rv|vi − hT ∗ T v|vi = h(R∗ R − T ∗ T )v|vi = h(R2 − T ∗ T )v|vi. (672)
√
Exercise 313. Let T ∈ L(V ), then prove inv(T ) iff !∃S ∈ L(V ) isometry that satisfies T = S T ∗ T .
√
Proof. If T invertible, then T T −1 = S T ∗ T T −1 and by uniqueness of inverses (Theorem 9), see that
√
T ∗ T T −1 is the inverse of S. OTOH, we prove the contrapositive. If T not invertible, then zero is
√ √
an eigenvalue (Theorem 72), and see that T = S T ∗ T = S 0 T ∗ T where S 0 v = v on eigenvectors v of
non-zero eigenvalue and S 0 v 6= v on eigenvectors of zero eigenvalue. Then S 6= S 0 give two different polar
decompositions.
Exercise 314. Let T ∈ L(V ) self-adjoint, then prove that the singular values of T equal the absolute
values of the eigenvalues of T .
Proof. This is an immediate result of Theorem 187 and self-adjoint. In particular,
T ∗ T v = T 2 v = λ2 v = |λ|2 v. (673)
162
Exercise 315. Let T ∈ L(V ), prove that T, T ∗ share singular values.
√
Proof. See that for polar decomposition T = S T ∗ T , we have
√ √
T T ∗ = S T ∗ T (S T ∗ T )∗ = S(T ∗ T )S ∗ , (674)
but S ∗ = S −1 by Theorem 183, so Exercise 221 asserts that T, ST S −1 share eigenvalues (and multiplicity,
discussed later in Definition 160). Then apply Theorem 187 with T → T and T → T ∗ .
Exercise 316. Let T ∈ L(V ), prove inv(T ) iff zero is not singular value of T .
Proof. 0 not singular value of T iff 0 not eigenvalue value of T ∗ T iff T ∗ T invertible iff T ∗ , T invertible.
See Theorem 72 and Theorem 167.
Exercise 317. Let S ∈ L(V ), prove S isometry iff all singular values of S is one.
√
Proof. If S is isometry, then S ∗ S = 1 and all the singular values must be one. Otherwise assume all
the singular values are one, then S ∗ Sv = v and S ∗ S = 1 so S must be isometry by equivalency (Theorem
183).
Exercise 318. Let T1 , T2 ∈ L(V ), prove T1 , T2 share singular values iff ∃S1 , S2 ∈ L(V ) isometries s.t.
T1 = S1 T2 S2 .
Proof. Let T1 , T2 have polar decompositions T1 = S1 T1∗ T1 , T2 = S2 T2∗ T2 . Further let ei , i ∈ [n] be
p p
the eigenvector basis in V for T1∗ T1 , and fj , j ∈ [n] be the eigenvector basis in V for T2∗ T2 , which
p p
exists by the spectral theorems (Theorem 176, 180), and let the corresponding eigenvalues be shared and
equal to si , i ∈ [n]. Define linear map Sej = fj for j ∈ [n], which is well defined under Theorem 95, and
Theorem 183 asserts S is isometry. Then
T1∗ T1 ej = sj ej = S ∗ (sj fj ) = S ∗ T2∗ T2 fj = S ∗ T2∗ T2 Sej .

p p p
(675)
T1∗ T1 = S ∗
p p ∗
It follows that T2 T2 S, so
T1∗ T1 = S1 S ∗ T2∗ T2 S = S1 S ∗ S2∗ T2 S

p p
T1 = S1 (676)
and the result follows since product of isometries is also isometry (the norm must be preserved on each
transformation).
Pn
Exercise 319. Suppose T ∈ L(V ) has SVD given by T v = i si hv|ei ifi , where ei , fj ’s are orthonormal
bases and si ’s are singular values. Then for v ∈ V
Pn
1. prove that T ∗ v = i si hv|fi iei ,
Pn
2. prove that T ∗ T v = i s2i hv|ei iei ,
√ Pn
3. prove that T ∗ T v = i si hv|ei iei ,
Pn hv|fi iei
4. prove that if T invertible, then T −1 v = i si .
Proof. Since T ej = sj fj , consider the matrix representation and conjugate transpose asserted by Theo-
rem 168. Then part 1 holds. Part 2 holds by direct computation using part 1. Part 3. holds by repeated
√
application of T ∗ T to get part 2. Part 4. holds by using the formulas and doing direct computation
to check that T T −1 v = v for all v ∈ V .
163
Exercise 320. Let T ∈ L(V ), and ŝ denote the smallest singular value of T , s denote the largest singular
value of T . Prove that ŝkvk ≤ kT vk ≤ skvk for all v ∈ V . If λ is eigenvalue of T , prove ŝ ≤ |λ| ≤ s.
Proof. For T ∈ L(V ) with singular values si , i ∈ [n], we may write

n
X
Tv = si hv|ei ifi (677)
i
for all v ∈ V where ei ’s, fi ’s are orthonormal bases respectively. Then

n
X n
X n
X
kT vk2 = |si hv|ei i|2 = |si |2 |hv|ei i|2 ≤ max |si |2 |hv|ei i|2 = s2 kvk2 . (678)
i
i i i
Then kT vk ≤ skvk, and a similar proof works for ŝkvk ≤ kT vk. For unit eigenvector kvk, we may write
|λ| = |λ|kvk = kT vk ≤ skvk = s (679)
and a similar approach gets ŝ ≤ |λ|.
Exercise 321. Prove that the number of nonzero singular values of Am×n ∈ Rm×n is rank(A).
Proof. Recall that rank(A) = rank(A0 A) (see Exercise 50). Clearly A0 A is positive since hA0 Ax|xi =
hAx|Axi ≥ 0. Then the real spectral theorem 180 assert an orthonormal basis of eigenvectors for A0 A
(suppose A0 A → M(T ∈ L(V )), and rank-nullity asserts that dim V = dim null T + dimrange T , where
nullT is spanned by eigenvectors corresponding to zero eigenvalues. Then dimrangeT equals the number
of nonzero eigenvalues (T acts on invariant subspace), which is the number of nonzero singular values of
A0 A, which is rank(A0 A), which is rank(A).
Exercise 322 (Singular Value Decomposition of Rectangular Matrix). See SVD on operators (Theorem
186), but the theorem can be generalized to other linear maps. Here the problem of doing a SVD on
arbitrary, rectangular matrices with entries in R are discussed. It is asserted that any m × n matrix A
factors into
A = U ΣV 0 , (680)
where Um×m , Vn×n are orthogonal matrices (Definition 155) and Σm×n is diagonal matrix (here the
definition of diagonal is relaxed to non-square matrices with the condition Σij = 0 if i 6= j, for possibly
non-square Σ).
Proof. We are interested in finding U, Σ, V 0 . First consider
AA0 = U ΣV 0 V 0 Σ0 U 0 = U (ΣΣ0 )U 0 , (681)

A0 A = V Σ0 U 0 U ΣV 0 = V (Σ0 Σ)V 0 (682)
clearly AA0 , A0 A are symmetric and hence orthogonally diagonalizable (Theorem 75), with eigenvalues
in the diagonal of ΣΣ0 and Σ0 Σ, and where U, V are orthogonal matrices with eigenvectors forming
orthogonal columns in each of the matrices. Recall Theorem 187 - see that the diagonals of Σ0 Σ are the
squares of the singular values for A; with number of non-singular values equal to rank(A) by Exercise
321.
Exercise 323. For matrix A ∈ Rm×n , prove that if x 6= y, x, y ∈ rowspace(A), then Ax 6= Ay.
164
Proof. Suppose Ax 6= Ay, then A(x − y) = 0, then (x − y) ∈ nullspace(A) and (x − y) ∈ rowspace(A),
and Exercise 48 asserts that x − y = 0 and x = y.
We know that rank(A) = rank(A0 ) (Theorem 61), so the dimension of row space and dimension
of column spaces match. Theorem 105 asserts that the two vector spaces are isomorphic. Really, an
arbitrary, rectangular matrix A acts as the invertible mapping from the row space to column space of
matrix A. This motivates us to find a pseudo-inverse A+ s.t. for y ∈ rowspace(A), y = A+ Ay.
Exercise 324 (Pseudo-Inverse). When A is square matrix m × n, and rank(A) = r, and r = m = n,

there exists a two-sided inverse AA−1 = A−1 A = 1 (full rank). Suppose the matrix is nonsquare. When
the matrix A is full column rank, s.t. r = n, then there is a left inverse, and nullspace(A) = {0} with
zero or unique solution to Ax = b (Theorem 52). In particular, rank(A0 A) = rank(A), so A0 A is full
rank and (A0 A)−1 A0 A = 1 shows us that a (others exist) left inverse exists. On the other hand, when
| {z }
A−1
l
the matrix A is full row rank, s.t r = m < n, then nullspace(A0 ) = {0}, and there the existence of n − m
free variables giving us infinite number of solutions to linear system Ax = b. The right inverse is given
by A A0 (AA0 )−1 = 1 shows us that (others exist) right inverse exists. Additionally,
| {z }
A−1
r
AA−1
l = A(A0 A)−1 A0 (683)
is the projection onto the column space of A, and
A−1 0 0 −1
r A = A (AA ) A (684)
is the projection onto the row space of A. (see Exercise 51)

In the most general case, the column is neither full row or column rank, and we want to find a (pseudo,
or Moore–Penrose) inverse motivated by our previous discussion. Consider the SVD of A given by
A = U ΣV 0 (Exercise 322), with rank(A) = r with
 
σ1 0 ··· 0
 0 ··· ··· 0
 
 
Σm×n · · ·
= ··· σr · · ·
. (685)
· · · ··· ··· · · ·
 
0 ··· ··· 0
The pseudo-inverse of Σ, written

 1

σ1 0 ··· 0
 0 ··· ··· 0
 
Σ+
 
1
n×m · · ·
= ··· σr · · ·
 (686)
· · · ··· ··· · · ·
 
0 ··· ··· 0
and see that ΣΣ+ is m × m matrix with ones on the first r diagonals and zero everywhere else, while
Σ+ Σ is n × n matrix with ones on the first r diagonals and zero everywhere else. The pseudo-inverse of
A is written
A + = V Σ+ U 0 . (687)
165
As argued in the discussion leading up to the example, if any vector x were in nullspace(A), (or
colspace(A0 )), then A+ Ax = 0, and if x were in rowspace(A), (or colspace(A0 )), then A+ Ax = x.
This generalizes the two-sided, left and right inverses. If x 6= 0, nullspace(A) = {0}, and A is square
then A+ = A−1 .
3.3.8 Operators on Complex Vector Spaces

3.3.8.1 Generalized Eigenvectors and Nilpotency
Nilpotent matrices were briefly encountered in Exercise 19, 22. We take a deeper look at these operators.
Recall powers of an matrices were defined in Theorem 32. The same applies to operators as one would
expect.
Theorem 188. Suppose T ∈ L(V ), then
{0} = null T 0 ⊂ null T 1 ⊂ · · · ⊂ null T k ⊂ null T k+1 . (688)
Proof. v ∈ null T k =⇒ T k+1 v = T T k v = T 0 = 0, so v ∈ null T k+1 .
Theorem 189. Suppose T ∈ L(V ), and m is nonnegative integer. If null T m = null T m+1 , then
null T m = null T m+i for i = 1, 2 · · · .
Proof. Let k ∈ Z+ , we already know null T m+k ⊂ null T m+k+1 . Suppose v ∈ null T m+k+1 , then
T m+1 T k v = T m+k+1 v = 0, then T k v ∈ nullT m+1 = nullT m . Then T m+k v = T m T k v = 0, v ∈ nullT m+k
and null T m+k ⊃ null T m+k+1 .
We know null spaces keep growing, and once they stop, they don’t start (Theorem 188, 189). Do
they stop?
Theorem 190. Suppose T ∈ L(V ), and n = dim V , then null T n = null T n+i for i = 1, 2, · · · .
Proof. We prove null T n = null T n+1 , the other cases are asserted by Theorem 189. Suppose not, then
{0} = null T 0 ( null T 1 ( · · · ( null T n ( null T n+1 . (689)
Since each nullspace is contained in but not equal to the next, then dim null T n+1 ≥ n + 1, this is
contradiction.
Theorem 191. Suppose T ∈ L(V ), n = dim V , then V = null T n ⊕ range T n .
Proof. Suppose v ∈ null T n ∩ range T n , then T n v = 0, ∃u ∈ V s.t v = T n u. Then 0 = T n v = T 2n u, but

since nullspaces do not grow past n = dim V , T n u = 0, Then v = T n u = 0 and so null T n , range T n is
direct sum. To show the direct sum equals V , then
dim(null T n ⊕ range T n ) = dim null T n + dim range T n = dim V (690)
since T n ∈ L(V ) and using Exercise 113 and Rank-Nullity Theorem 99.
For T ∈ L(V ), we want to describe T by finding a neat subspace decomposition V = ⊕m

i Ui , where
each Ui is T invariant. In the case of normal operators on complex inner product space, and self-adjoint
operators on real inner product spaces, the spectral theorem 176, 180 asserts that there is an orthonormal
basis of eigenvectors. See Theorem 147. The simplest subspace decomposition V = ⊕m
i Ui , where each
166
dim Ui = 1 and Ui is T invariant is possible iff V has basis of eigenvectors of T , which is when the
eigenspace decomposition V = ⊕ki E(λi , T ) exists. So although such a description for T is available for
self-adjoint, normal operators under complex and real inner product vector space settings respectively,
in general no such statement can be made. The study of generalized eigenvectors study this problem.
Definition 157 (Generalized Eigenvector). Let T ∈ L(V ), λ be eigenvalue of T , then v ∈ V is generalized

eigenvector of T corresponding to λ if v 6= 0, (T − λ1)j v = 0 for some j ∈ Z+ .
Definition 158 (Generalized Eigenspace). Let T ∈ L(V ), λ ∈ F, then the generalized eigenspace of T
corresponding to λ is written G(λ, T ) and is the set of all generalized eigenvectors associated with λ for
T , along with the zero vector.
For eigenspace (Definition 141), it is trivial to see that E(λ, T ) ⊂ G(λ, T ).
Theorem 192. Let T ∈ L(V ), λ ∈ F, then G(λ, T ) = null (T − λ1)dim V . G(λ, T ) is subspace of V .
Proof. Reason this by writing v ∈ null (T − λ1)dim V =⇒ v ∈ G(λ, T ) by definition of generalized

eigenspace (Definition 158), and v ∈ G(λ, T ) =⇒ ∃j ∈ Z+ s.t v ∈ null (T − λ1)j ⊂ null (T − λ1)dim V
by Theorem 188,190.
We know that eigenvectors corresponding to distinct eigenvalues are linearly independent (Theorem
70). Turns out the same holds for generalized eigenvectors.
Theorem 193. Let T ∈ L(V ), suppose λi , i ∈ [m] are distinct eigenvalues of T , and vi , i ∈ [m] are
corresponding generalized eigenvectors, then vi , i ∈ [m] are linearly independent.
ai vi , and let k be the largest nonnegative integer s.t. (T − λ1 1)k v1 6= 0.

Pn
Proof. Axler [3]. Consider i
Define w = (T − λ1 1)k v1 , then (T − λ1 1)w = (T − λ1 1)k+1 w = 0, and hence T w = λ1 w. It follows
that (T − λ1)w = (λ1 − λ)w for all λ ∈ F, and see that (T − λ1)n w = (λ1 − λ)n w, where n = dim V .
Applying (T − λ1 1)k (T − λ2 1)n · · · (T − λm 1)n to both sides of i ai vi = 0, using Theorem 192, we get
Pn
0 = a1 (T − λ1 1)k (T − λ2 1)n · · · (T − λm 1)n v1 (691)

= a1 (T − λ2 1) · · · (T − λm 1) w
n n
(692)
= a1 (λ1 − λ2 )n · · · (λ1 − λm )n w (693)
Then a1 = 0. We can repeat the argument to get aj = 0 for all j.
Definition 159. An operator T ∈ L(V ) is nilpotent if ∃j ∈ Z+ s.t. T j = 0.
Theorem 194. Let N ∈ L(V ) be nilpotent, then N dim V = 0.
Proof. N nilpotent =⇒ G(0, N ) = V , then Theorem 192 asserts null N dim V = V .
Theorem 195. Suppose N is nilpotent on V , then there exists some basis of V s.t. M(N ) is upper
triangular, and furthermore, it has zero diagonals.
Proof. For some basis of null N , extend this to basis of null N 2 , null N 3 and so on until a basis of V is
obtained (since null N dim V = V ). In particular, let v1,· be the basis of null N 1 , v1,· ∪ v2,· be the basis
of null N 2 and so on. Let these bases vectors be enumerated in increasing order of i for vi,· and let the
corresponding basis be vi , i ∈ [n], where n = dim V . We argue this basis fits the criterion. Note that
null N 1 ⊂ null N 1+i for any non-negative integer i. When we apply N to any vector in v1,· , we get zero,
167
so the resulting vector does not include any positive weight in the linear combination of basis vectors
including and coming after itself. When we apply N to any vector in v2,· , the resulting vector does not
include any positive weight in the linear combination of basis vectors including and coming after itself.
N v2,j∈[n] must be spanned by the previous basis vectors in v<2,· . That is, when we apply N to the vector
in the basis order specified, we get a vector that is linear combination of only the previous basis vectors.
The result follows.
Exercise 325. Suppose T ∈ L(V ), α, β ∈ F, α 6= β, then prove that G(α, T ) ∩ G(β, T ) = {0}.
Proof. Suppose not, then some nonzero v ∈ G(α, T ) ∩ G(β, T ) correspond to (self) linearly independent
vector by Theorem 193. But no vector is linearly independent to itself, so v can only be zero.
Exercise 326. Suppose T ∈ L(V ), m ∈ Z+ , v ∈ V satisfy T m−1 v 6= 0 but T m v = 0, then prove that
v, T v, · · · T m−1 v must be linearly independent.
Pm−1
Proof. Consider the equation i=0 ai T i v = 0. If we multiply this equation by T m−1 , we get a0 T m−1 v =
0, if we multiply by T m−2 , we get a1 T m−1 v = 0 (using a0 = 0), and so on for all ai , i ∈ [0, m − 1].
Exercise 327. Suppose N ∈ L(V ) is nilpotent, then prove that 0 is the only eigenvalue of N .
Proof. Suppose not, then λ 6= 0 is eigenvalue, and for associated eigenvector v, N k v = λk v 6= 0 for all
k ∈ Z+ and no k satisfies N k = 0.
Exercise 328. Suppose S, T ∈ L(V ) and ST is nilpotent, is T S nilpotent?
Proof. [1]. See that
null (T S)dim V = null (T S)dim V +1 = null T S(T S)dim V = null T (ST )dim V S = V (694)
dim V
=⇒ (T S) = 0. (695)
Exercise 329. Suppose T ∈ L(V ) not nilpotent, and n = dimV , then show V = nullT n−1 ⊕rangeT n−1 .
Proof. T is not nilpotent, so T n 6= 0, and dim null T n < n and null T n−1 = null T n by Theorem 190.
Then rank-nullity asserts that
V = null T n + range T n = null T n−1 + range T n . (696)
It is easy to see that range T n ⊂ range T n−1 , and it follows that
V = null T n−1 + range T n−1 . (697)
We have dim(null T n−1 + range T n−1 ) =
dim V = dim null T n−1 + dim range T n−1 − dim(null T n−1 ∩ range T n−1 ) (698)
and rank nullity asserts that null T n−1 ∩ range T n−1 = {0} and we have direct sum by check for direct
sum (Theorem 85).
Exercise 330. Suppose N ∈ L(V ), and ∃ basis of V s.t. M(N ) is upper triangular matrix with zero
diagonals. Then prove N nilpotent.
168
Proof. See proof of Theorem 195. In particular, the basis vi , i ∈ [n] to which M(N ) is upper triangular
satisfies N vj ∈ span(v<j ) and see that N j vj = 0 for all j ∈ [n], and hence N n = 0.
Exercise 331. Suppose V is inner product space, N ∈ L(V ) normal and nilpotent, then prove that N
must be zero operator.
Proof. Let k = dim V . It is easy to see that N k−1 is normal, and that N 2(k−1) is zero since N dim V =
0, 2(k − 1) ≥ dim V . Theorem 173 asserts that
kN (k−1)∗ N (k−1) vk2 = kN (k−1) N (k−1) vk2 = 0, (699)
so N (k−1)∗ N (k−1) = 0. Then
kN (k−1) vk2 = hN (k−1) v|N (k−1) vi = hv|N (k−1)∗ N (k−1) vi = 0, (700)
and N (k−1) = 0. Repeat the argument to see that N (k−2) = 0 and so on until N 1 = 0.
Exercise 332. Suppose V is inner product space, N ∈ L(V ) is nilpotent, then prove ∃ orthonormal basis
of V s.t. M(N ) is upper triangular matrix.
Proof. Upper triangular w.r.t any basis implies the existence of an upper triangular w.r.t to an orthonor-
mal basis by Theorem 158. Then see Theorem 195 and the basis there satisfies.
Exercise 333. Suppose T ∈ L(V ), and show that
V = range T 0 ⊃ range T 1 ⊃ · · · ⊃ range T k ⊃ range T k+1 . (701)
Proof. See T 0 = 1, and 1 is invertible and hence surjective, so range T 0 = V . For all v ∈ range T k+1 ,
k+1
∃u s.t. T u = T T u = v, so v ∈ range T k and the other subset relations must hold.
k
Exercise 334. Suppose T ∈ L(V ), m nonnegative integer and range T m = range T m+1 , then prove that
range T m = range T m+i for i = 1, 2, · · · .
Proof. Rank nullity asserts that
dim V = dim null T m + dim range T m = dim null T m+1 + dim range T m+1 . (702)
If range T m = dim range T m+1 , then dim null T m = dim null T m+i for all nonnegative integer i. Again
rank nullity asserts that
dim range T m = dim range T m+i . (703)
Together with the subset relation in Exercise 333, it follows that range T m+i = range T m for any
0 ≤ i ∈ Z.
See Theorem 188, 190. Here Exercise 334 show that ranges decrease, and once it stops, it does not
start.
Exercise 335. Suppose T ∈ L(V ), n = dim V , then prove that range T n = range T n+i for i = 1, 2, · · · .
Proof. Follows from the reasoning in proof of Exercise 334 and using V = null T n ⊕ range T n .
The above Exercises 334, 335 assert that for T ∈ L(V ), 0 ≤ m ∈ Z
null T m = null T m+1 ↔ range T m = range T m+1 . (704)
169
3.3.8.2 Decomposition of Operator
We know that null T, range T are invariant under T . Here, the result is extended for polynomial of linear
operators.
Theorem 196. Suppose T ∈ L(V ), p ∈ P(F), then null p(T ), range p(T ) are invariant under T .
Proof. Let v ∈ null p(T ), then see that p(T )v = 0, p(T )(T v) = T p(T )v = T (0) = 0, and T v ∈ null p(T ).
Also, for v ∈ range p(T ), ∃u ∈ V s.t. v = p(T )u, so T v = T p(T )u = p(T )(T u) and T v ∈ range p(T ).
Theorem 197. Suppose V is complex vector space, T ∈ L(V ), and let λi , i ∈ [m] be distinct eigenvalues
of T . Then
1. V = ⊕m
i G(λi , T ),
2. G(λj , T ) is invariant under T ,
3. (T − λj 1)|G(λj ,T ) is nilpotent.
Proof. Axler [3]. Since G(λj , T ) = null(T −λj 1)n for each j ∈ [m], where n = dimV , Theorem 196 (using
p(z) = (z − λj )n ) assert 2. See 3 holds by definition. For 1, we prove by induction, complex vector space
operator has eigenvalue, and the proof is trivial when n = 1. Assume the result holds inductively on V
with dimension less than n. Since V is complex vector space, T has eigenvalue. Theorem 191 asserts we
have V = null(T − λ1 1)n ⊕ range(T − λ1 1)n = G(λ1 , T ) ⊕ range(T − λ1 1)n . Let U = range(T − λ1 1)n ,
and see U is T invariant. dim G(λ1 , T ) 6= 0 =⇒ dim U < n, the inductive hypothesis applies
to T|U . Each eigenvalue of T|U ∈ {λj , j ∈ [2, m]}. By inductive hypothesis, U = ⊕m
j=2 G(λj , T|U ).
For fixed k ∈ [2, m], clearly G(λk , T|U ) ⊂ G(λk , T ). OTOH, suppose v ∈ G(λk , T ), then since we
Pm
may write v = v1 + u, v1 ∈ G(λ1 , T ), u ∈ U , the inductive hypothesis reasons that u = i=2 vi ,
Pm
where vj ∈ G(λj , T|U ) for j ∈ [2, m]. Then v = i=1 vi . By linear independence of the generalized
eigenvectors of distinct eigenvalues, vi6=k = 0 for v ∈ G(λk , T ). In particular, v1 = 0, v = u ∈ U
and hence v ∈ G(λk , T|U ), then G(λk , T|U ) = G(λk , T ). Using the induction hypothesis and equation
V = G(λ1 , T ) ⊕ (⊕m m
j=2 G(λj , T|U )) = G(λ1 , T ) ⊕ (⊕j=2 G(λj , T )), we are done.
Although an operator on complex vector space may not have enough eigenvectors to form basis, it
has enough generalized eigenvectors.
Theorem 198. Suppose V is complex vector space, T ∈ L(V ), then ∃ basis of V of generalized eigen-
vectors of T .
Proof. Choose basis of G(λj , T ) of the generalized eigenspace decomposition found in Theorem 197 and
combine them (Exercise 124).
Definition 160 (Eigenvalue Multiplicity). Suppose T ∈ L(V ), then the multiplicity of an eigenvalue λ
of T is dim G(λ, T ) = dim null (T − λ1)dim V .
Theorem 199. Suppose V is complex vector space, T ∈ L(V ), then the sum of multiplicities of eigen-
values of T equals dim V .
Proof. The proof follows directly from Theorem 197, Theorem 113 and Definition 160.
Often, the term algebraic multiplicity is used to refer to the multiplicity of a generalized eigenspace,
and the term geometric multiplicity to refer to the eigenspace.
170
Definition 161 (Block Diagonal Matrix). A block diagonal matrix is a square matrix of the form
 
A1 0 · · · 0
 
 0 A2 · · · 0 
  (705)
0
 0 ··· 0 

0 0 · · · Am
where the Ai ’s are themselves square matrices.
Theorem 200. Suppose V is complex vector space, T ∈ L(V ), and let λi , i ∈ [m] be distinct eigenvalues
of T with multiplicities di , i ∈ [m]. Then there exists basis of V s.t. M(T ) is block-diagonal matrix,
where each block Ai is dj × dj upper triangular matrix with λj as diagonals.
Proof. For all j ∈ [m], (T − λj 1)|G(λj ,T ) is nilpotent, and we may choose basis length dj s.t. M((T −
λj 1)|G(λj ,T ) ) with form as in Theorem 195. Then T|G(λj ,T ) = (T − λj 1)|G(λj ,T ) + λj 1|G(λj ,T ) has the
matrix representation for each block Aj , and combining these bases of G(λj , T ) give the basis of V with
the upper triangular blocks.
Theorem 201. Suppose N ∈ L(V ) is nilpotent, then 1 + N has square root.

√
Proof. Axler [3]. Consider Taylor expansion 1 + x = 1 + a1 x + a2 x2 + · · · . The nilpotent property
asserts that ∃m ∈ Z+ s.t. N m = 0, then using this in the Taylor expansion we get
√
1 + N = 1 + a1 N + a2 N 2 + · · · + am−1 N m−1 . (706)
Since
(1 + a1 N + a2 N 2 + · · · + am−1 N m−1 )2 (707)

= 1 + 2a1 N + (2a2 + a21 )N 2 3
+ (2a3 + 2a1 a2 )N + · · · + (2am−1 + ϕ(a1 , · · · , am−2 ))N m−1
(. 708)
Making 2a1 = 1, 2a2 + a21 = 0 and so on shows us there is some choice of the aj ’s such that the equation
holds.
Theorem 202. Let V be complex vector space and T ∈ L(V ) be invertible, then T has square root.
Proof. Let λi , i ∈ [m] be distinct eigenvalues of T . ∀j ∈ [m], ∃Nj ∈ L(G(λj , T )) nilpotent s.t. T|G(λj ,T ) −
λj 1 = Nj . Since T invertible, λj 6= 0, and T|G(λj ,T ) = λj (1 + 1 + Nλjj has square
Nj Nj
λj ). λj nilpotent, so
root. Multiplying square root of λj by square root of 1 + Nλjj , obtains square root Rj of T|G(λj ,T ) . Since
Pm
V has generalized eigenspace direct-sum decomposition, v ∈ V has unique representation v = i ui
Pm
where ui ∈ G(λi , T ), and define R ∈ L(V ) by Rv = i Ri ui . Then R is square root of T .
Exercise 336. Suppose V is complex vector space, N ∈ L(V ) with zero as the only eigenvalue. Then
prove that N nilpotent.
Proof. Theorem 197 asserts that V = G(0, N ) = null(N −0)dim V , so N dim V = 0 and N is nilpotent.
Exercise 337. Prove that for complex vector space V , N ∈ L(V ) is nilpotent iff zero is the only
eigenvalue.
Proof. Forward, backward proof follows from Exercise 327,336.
Exercise 338. Suppose T ∈ L(V ). If S ∈ L(V ) is invertible, then prove that T, S −1 T S share eigenvalues
and multiplicities.
171
Proof. Exercise 221 asserts that the eigenvalues are shared. Additionally, it is also shown there that v
is eigenvector of T iff S −1 v is eigenvector of S −1 T S. So for any v ∈ G(λj , T ) where G(λj , T ) has basis
v1 , v2 , · · · vn , we have shown that S −1 v1 , S −1 v2 · · · S −1 vn must be basis for G(λj , S −1 T S) by Theorem
142. They must share multiplicity.
Exercise 339. Suppose V is complex vector space with dimV = n. Let T ∈ L(V ), nullT n−2 6= nullT n−1 ,
then prove that T has ≤ 2 distinct eigenvalues.
Proof. Suppose not. Then T has > 2 distinct eigenvalues, then eigenspaces are T invariant, distinct
eigenvalue-eigenvectors are linearly independent, then dim range T > 2. Then rank-nullity asserts
dim V = dim null T + dim range T, (709)

dim null T < dim V − 2, (710)
and by Theorem 190, together with the trivial null T k ⊂ null T k+i , i = 1, 2, · · · relation we have
null T n−2 = null T n−1 .
Exercise 340. Suppose V is complex vector space, and T ∈ L(V ), then prove that V has basis of
eigenvectors of T iff every generalized eigenvector of T is eigenvector.
Proof. If V has basis of eigenvectors of T , then by Theorem 147, 197, we can write ⊕m
i G(λi , T ) =
V = ⊕m
i E(λi , T ). Then the forward direction holds. OTOH, if every generalized eigenvector of T
is eigenvector, G(λi , T ) ⊂ E(λi , T ). G(λi , T ) ⊃ E(λi , T ) is trivial. Then G(λi , T ) = E(λi , T ) and
V = ⊕m
i E(λi , T ) and V must have basis of eigenvectors by Theorem 147.
Exercise 341. Show that every invertible operator on complex vector space V has cube root.
Proof. Use the same techniques as in Theorem 201 and Theorem 202.
   
A1 · · · 0 B1 ··· 0
Exercise 342. If A, B are block diagonal matrices where A =  ··· , B =  ···
   

0 Am 0 Bm
where size of Ai equals size of Bi for i ∈ [m], then find the block diagonal matrix of AB.
Proof. See Exercise 11. In particular, see that (let di be the size of square matrix Ai , Bi )
AB = A(B1·1 B1·2 · · · B1·d1 B2·1 · · · Bm·dm ) = (AB1·1 AB1·2 · · · AB1·d1 AB2·1 · · · ABm·dm )(711)
         
A1·1 A1·1 A1·1 A1·1 A1·1
 A1·2   A1·2   A1·2   A1·2   A1·2 
         
         
 ···   ···   ···   ···   ··· 
         
= (A1·m  B1·1 A1·m  B1,2 · · · A1·m  B1·d1 A1·m  B2·1 · · · A1·m  Bm·dm ) (712)
         
         
 A2·1   A2·1   A2·1   A2·1   A2·1 
         
 ···   ···   ···   ···   ··· 
         
Am·d Am·d Am·d Am·d Am·d

 
A1 B 1 · · · 0
=  ··· . (713)
 
0 Am Bm
172
3.3.8.3 Characteristic and Minimal Polynomial
Characteristic equations and polynomials were introduced in Definition 80. This definition depends on
the determinant and is obtained via the cofactor expansion (Definition 39). But here we have not defined
determinants yet. and we want to give formulations for F = C and for F = R respectively.
Definition 162 (Characteristic Polynomial). Suppose V is complex vector space, T ∈ L(V ), and let
λi , i ∈ [m] be distinct eigenvalues of T (note n ≥ 1), with multiplicities (Definition 160) given di , i ∈ [m].
Then the characteristic polynomial of T is the polynomial
Πm di
i (z − λi ) . (714)
Theorem 203. Suppose V is complex vector space, T ∈ L(V ), then
1. characteristic polynomial of T has degree dim V ,
2. zeros of the characteristic polynomial of T are eigenvalues of T .
Proof. The sum of multiplicities equal dim V (Theorem 199), so part 1 follows. Part 2 follows by
definition, since Πm
i (z − λi )
di
= 0 if ∃z = λi .
Later it is shown that the determinant based characteristic polynomial and the one defined here
are identical. From a computational perspective, the cofactor expansion aims to find the eigenvalues
through the characteristic polynomial. Definition 162 given here assumes that the distinct eigenvalues
and multiplicities are obtained.
Theorem 204 (Cayley Hamilton Theorem). Suppose V is complex vector space, T ∈ L(V ), and q be
the characteristic polynomial of T . Then q(T ) = 0.
Proof. By definition, the multiplicities di are the dimensions of the generalized eigenspaces G(λi , T )
, i ∈ [m]. Theorem 197 asserts (T − λj 1)|G(λj ,T ) is nilpotent. Specifically, since by Definition 160,
d = dim G(λ, T ) = dim null (T − λ1)dim V , then (T − λj 1)|G(λ
j d
j ,T )
= 0. Every vector in V can we
written as sum of vectors in G(λi , T ), i ∈ [m] by the generalized eigenspace decomposition. If we show
i (T − λi 1) , and the terms
q(T )|G(λj ,T ) = 0, the result follows. For fixed j ∈ [m], since q(T ) = Πm di
on the RHS commutes, move (T − λj 1)|G(λ to be the last-most term. Since (T − λj 1)|G(λ
j d j d
j ,T ) j ,T )
= 0,
q(T )|G(λj ,T ) = 0.
Definition 163 (Monic Polynomial). A monic polynomial is polynomial where the highest-degree coef-
ficient equal one.
Theorem 205 (Minimal Polynomial). Suppose T ∈ L(V ), !∃p monic polynomial of smallest degree s.t.
p(T ) = 0. Then p is said to be the minimal polynomial.
Proof. Axler [3]. Let n = dim V , then T i , i ∈ [0, n2 ] is not linearly independent (dim L(V ) =
dim V dim V = n2 ). Let m be smallest positive integer s.t. T i , i ∈ [0, m] is linearly dependent. Then one
of the operators is a linear combination of the others. By choice of m, T m ∈ span(T i , i ∈ [0, m − 1]) and
∃ai ∈ F, i ∈ [0, m − 1] s.t.
m−1
X
( ai T i ) + T m = 0 (715)
i=0
Pm−1
Define the monic polynomial p ∈ P(F) s.t. p(z) = ( i=0 ai z i ) + z m . Then p(T ) = 0 by Equation 715.
Again, the choice of m implies 6 ∃ non-zero q ∈ P(F) monic polynomial with degree smaller than m that
173
satisfies q(T ) = 0. Suppose q is monic polynomial, q(T ) = 0 and deg q = m, then (p − q)(T ) = 0 and
deg(p − q) < m, and hence q = p.
In particular, the degree of the minimal polynomial of operator on V ≤ (dim V )2 . The Cayley-
Hamilton Theorem 204 asserts that if V is complex vector space, then the minimal polynomial p satisfies
deg p ≤ dim V . Given matrix M(T ), the minimal polynomial is obtained by finding the minimum m s.t.
Pm−1
the system of linear equations i=0 ai M(T )i = −M(T )m has solutions for the ai ’s, Then the scalars
ai , i ∈ [0, m − 1], 1 are the coefficients of the minimal polynomial.
Theorem 206. Suppose T ∈ L(V ), q ∈ P(F), then q(T ) = 0 iff q is polynomial multiple of minimal
polynomial of T .
Proof. Let p be minimal polynomial; If q is polynomial multiple of p, then ∃s ∈ P(F) s.t q = ps, and
q(T ) = p(T )s(T ) = 0s(T ) = 0. OTOH, if q(T ), division algorithm (see Theorem 132) asserts ∃s, r ∈ P(F)
s.t. q = ps + r, where deg r < deg p. Then
0 = q(T ) = p(T )S(T ) + r(T ) = r(T ). (716)
Suppose r 6= 0, then dividing r by highest degree coefficient gives monic polynomial smaller degree than
minimal polynomial - this is contradiction. Then r = 0 and we are done.
Theorem 207. Suppose F = C, T ∈ L(V ), then the characteristic polynomial of T is polynomial multiple
of minimal polynomial of T .
Proof. See Cayley-Hamilton Theorem 204. Apply Theorem 206.
Eigenvalues solve the characteristic polynomial (see previous determinant-based arguments, Theorem
71). We shall see it solves the minimal polynomial.
Theorem 208. Let T ∈ L(V ), then the zeros of the minimal polynomial of T are precisely the eigenvalues
of T .
Pm−1
Proof. Axler [3]. Let p(z) = i=0 ai z i +z m be minimal polynomial, and suppose λ ∈ F is zero of p. Then
p(z) = (z−λ)q(z) by Theorem 133. See q(z) is monic polynomial, and since p(T ) = 0, 0 = (T −λ1)(q(T )v)
for all v ∈ V . Since deg q < deg p, ∃v ∈ V s.t. q(T )v 6= 0. Then λ must be eigenvalue of T . OTOH,
suppose λ ∈ F is eigenvalue, then ∃v ∈ V, v 6= 0 s.t. T j v = λj v for j ∈ Z+ . Then
m−1
X m−1
X
p(T )v = ( ai T i + T m )v = ( ai λi + λm )v = p(λ)v. (717)
i=0 i=0
Since v 6= 0, p(λ) = 0.
Exercise 343. Find some operator on C4 with characteristic polynomial (z − 7)2 (z − 8)2 .
Proof. The operator T (z1 , z2 , z3 , z4 ) = (7z1 , 7z2 , 8z3 , 8z4 ) satisfies.
Exercise 344. Suppose V is complex vector space, and T ∈ L(V ) s.t. P 2 = P , then prove that
characteristic polynomial of P is z m (z − 1)n , where m = dim null P, n = dim range P .
Proof. [1]. See Exercise 234. In particular, V = null P ⊕ range P . If v ∈ range P , then P (P v − v) =
P 2 v − P v = 0, so P v = v and thus
null P ⊂ G(0, T ), range P ⊂ G(1, T ). (718)
But V = null P ⊕ range P asserts that null P = G(0, T ), range P = G(1, T ). The result follows.
174
Exercise 345. Suppose T ∈ L(V ), then prove that T invertible iff constant term in minimal polynomial
is nonzero.
Proof. If p is minimal polynomial, then see that T invertible iff zero is not eigenvalue (Theorem 139) iff
p(0) 6= 0 (Theorem 208) iff the constant term in the minimal polynomial p is not zero.
Exercise 346. Suppose T ∈ L(V ) is invertible, then prove ∃p ∈ P(F) satisfying T −1 = p(T ).
Pm−1
Proof. [1]. Let i=0 ai z i + z m be the minimal polynomial. Then
a0 1 = −T m − · · · − a1 T. (719)
1 −1
Multiply the equation by a0 T and the result follows.
Exercise 347. If V is complex vector space, T ∈ L(V ), then argue that V has basis of eigenvectors of
T iff minimal polynomial of T has no repeated zeros.
i E(λi , T ), then see that (T − λ1 1) · · · (T − λm 1) = 0. Then Theorem 206 asserts

Proof. Suppose V = ⊕m
that (z − λ1 ) · · · (z − λm ) is polynomial multiple of minimal polynomial of T , but since our polynomial
has no repeated zeros, neither does the minimal polynomial. OTOH, if T has minimal polynomial p with
no repeated zeros, consider V = ⊕m
i G(λi , T ) as in Theorem 197. Then let v ∈ G(λi , T ) for some i ∈ [m],
and since λ is zero of p, we may write p(z) = (z − λ)q(z) by Theorem 133. We have
0 = p(T )v = q(T )(T − λ1)v. (720)
Since p is minimal, q(T ) 6= 0, so (T − λ1)v = 0. Since G(λi , T ) is invariant under any polynomial
operator (Theorem 196), then (T − λ1) 6= 0, then q(T )(T − λ1) 6= 0 and v must be an eigenvector of T .
The basis of V of generalized eigenvectors of T is also eigenvector basis.
Exercise 348. Suppose V is inner product space, T ∈ L(V ) is normal, then prove that minimal poly-
nomial of T has no repeated zeros.
Proof. This holds immediately by considering the Complex Spectral Theorem and Exercise 347.
Pm−1
Exercise 349. Suppose V is inner product space, T ∈ L(V ), and that i=0 ai z i + z m is the minimal
Pm−1
polynomial of T . Then argue that i=0 ai z i + z m is the minimal polynomial of T ∗ .
Pm
Proof. By the given minimal polynomial, i=0 ai T i + T m = 0. Taking adjoint to the equation gives
Pm ∗ i ∗ m ∗
Pm−1 i m
i=0 ai (T ) + (T ) = 0 and the minimal polynomial of T must be i=0 ai z + z . Otherwise, the
minimal polynomial p of T ∗ has degree < m and taking conjugates give p(T ) = 0 and we have obtained
a smaller minimal polynomial for T , a contradiction.
Exercise 350. Let F = C, T ∈ L(V ), then suppose the minimal polynomial of T has degree dim V , then
prove that characteristic and minimal polynomials match.
Proof. The characteristic polynomial has degree dim V . It is polynomial multiple of minimal polynomial
(q = ps) for some characteristic q, minimal p and polynomial s, where q, p is of degree dim V . So s is
constant and the monic property asserts that the coefficient of degree dim V must be the same. Then
they must be the same.
A subspace U of vector space V is said to be proper if it is neither {0} nor V .
175
Exercise 351. Suppose V is complex vector space, Vi , i ∈ [m] are proper subspaces of V s.t. V = ⊕m
i Vi .
If T ∈ L(V ) s.t. Vi , i ∈ [m] is T invariant, prove that the characteristic polynomial of T equals Πm
i pi ,
where pi is the characteristic polynomial of T|Vi .
Proof. Sketch of proof; consider the definition of characteristic polynomials given in Definition 162.
Clearly the relation between an operator and the characteristic polynomial is given by the generalized
eigenspace decomposition as in Theorem 197. Then since V = ⊕Vi , where Vi are invariant subspaces,
and each Vi has generalized eigenspace decomposition by Theorem 197, the characteristic polynomial
of T must be the product of the characteristic polynomials of component restriction operators. This
formulation preserves zeros of T|Vj and the degree of the characteristic polynomial for T equal to dim V .
3.3.8.4 Jordan Form
Theorem 143 asserts that operator on complex vector space has upper triangular. Schur’s Theorem 159
asserts that this may be w.r.t orthonormal basis. In turns out we can find a matrix of T that is zero
everywhere except possibly on the diagonal and the line directly above it.
Theorem 209. Let N ∈ L(V ) be nilpotent, then ∃vi ∈ V , i ∈ [n] and nonnegative integers mi , i ∈ [n]
s.t.
1. N m1 v1 , · · · N v1 , v1 , · · · N mn vn , · · · , N vn , vn is basis of V ,
2. N m1 +1 v1 = · · · = N mn +1 vn = 0.
Proof. Axler [3]. Prove by induction on n = dim V . When n = 1, the result is trivial, N = 0 and
any v1 6= 0, m1 = 0 satisfies. Assume inductively the result holds for < n. Since N is nilpotent, then
∃k ∈ Z+ , v ∈ V s.t N k v 6= 0, N k+1 v = N (N k v) = 0 and so N is not injective, hence not surjective.
range N is subspace of V , dim range N < dim V . Then by induction assumption on N|range N , ∃vi ∈
range N, i ∈ [n], nonnegative integers mi , i ∈ [n] s.t.
N m1 v1 , · · · , N v1 , v1 , · · · , N mn vn , · · · , N vn , vn (721)
is basis of range N and N m1 +1 v1 = · · · = N mn +1 vn = 0. vi ∈ range N =⇒ ∃ui ∈ V s.t. N ui = vi , then

N k+1 ui = N k vi for each i, nonnegative integer k. We show
N m1 +1 u1 , · · · , N u1 , u1 , · · · , N mn +1 un , · · · N un , un (722)
to be linearly independent. If a l.c of Equation 722 equal zero, then applying N to the l.c gives some
l.c. of Equation 721, but since this is linearly independent, all the coefficients of the l.c must equal zero
except possibly those of
N m1 +1 u1 , · · · , N mn +1 un , (723)
which equal to N m1 v1 , · · · , N mn vn . But since Equation 721 is linearly independent, then here the
coefficients also equal zero and Equation 722 is linearly independent. Extending Equation 722 to basis
N m1 +1 u1 , · · · , N1 u1 , · · · , N mn +1 un , · · · N un , un , w1 , · · · , wp , (724)
each N wj ∈ range N = span(Equation 721). Each vector in Equation 721 is equal to N applied to some
vector in Equation 722, and so ∃xj ∈ span(Equation 722) s.t. N wj = N xj . Let un+j = wj − xj , and
hence N un+j = 0. Furthermore,
N m1 +1 u1 , · · · , N1 , u1 , · · · , N mn +1 un , · · · N un , un , un+1 , · · · , un+p (725)
176
spans V , since the span contains each xj , un+j (and hence wj ). This span has right length - it must be
basis. This basis satisfies the conditions outlined.
Definition 164 (Jordan Basis). Let T ∈ L(V ), then basis of V is Jordan basis if M(T ) is block
diagonal matrix, where each block diagonal Aj is upper triangular matrix of form Aj = (λj ) or has λj
on the diagonals, 1 in the line directly above it and zero everywhere else.
Definition 165 (Jordan Form). Suppose V is complex vector space, if T ∈ L(V ), then ∃ basis of V that
is Jordan basis for T .
Proof. Consider nilpotent N ∈ L(V ) and vi , i ∈ [n] as in Theorem 209. N sends the first vector in the
list N mj vj , · · · , N vj , vj → 0 and every vector other than the first to the previous vector. W.r.t to the
basis found in Theorem 209, N is block diagonal matrix. Each block diagonal is upper triangular with
ones directly above the diagonal and zero everywhere else (including the diagonal). Then for T ∈ L(V ),
consider the generalized eigenspace decomposition (Theorem 197) given by V = ⊕m
i G(λi , T ). Each
(T − λj 1)|G(λj ,T ) is nilpotent, and putting their Jordan bases together, we get a Jordan basis.
Exercise 352. Suppose N ∈ L(V ) is nilpotent, then prove that the minimal polynomial of N is z m+1 ,
where m is the length of the longest consecutive string of ones that appear on the line directly above the
diagonal in the matrix of N w.r.t any Jordan basis.
Proof. [1]. Since N is nilpotent, zero is the only eigenvalue, and each block matrix upper triangular Aj
in block diagonal M(N ) of Jordan form has diag(Aj ) = 0. For Aj square matrix size n, it has (n − 1)
ones directly above the diagonal (and zero everywhere else), hence An−1
j 6= 0 = Anj . Set m to be the
longest such n, then we get N m 6= 0 = N m+1 , therefore the minimal polynomial must be z m+1 .
Exercise 353. Suppose T ∈ L(V ), vi , i ∈ [n] is basis of V , and is Jordan basis for T . Then describe
the matrix w.r.t vn , vn−1 , · · · , v1 .
Proof. See the explanation given for Jordan Forms (Definition 165). All the argument holds, except this
time the vectors enumerated out get sent to the next vector in the list rather than the ones. The line of
ones appear directly below the diagonals.
Exercise 354. Suppose T ∈ L(V ), vi , i ∈ [n] is basis of V , and is Jordan basis for T . Then describe the
matrix of T 2 w.rt. to the same basis.
Proof. Simply consider the general form for
 2
λj 1 ··· 0
 
 ··· 
 .
 (726)

 ··· 1 
0 ··· λj
Exercise 355. Suppose N ∈ L(V ) is nilpotent, vi , i ∈ [n], mj , j ∈ [n] as in Theorem 209. Then argue
that N mi vi , i ∈ [n] is basis of null N .
Proof. Consider the basis N m1 v1 , · · · N v1 , v1 , · · · N mn vn , · · · , N vn , vn of V . Recall that N applied to
basis span range N . Since each vector (except for N mi∈[n] that gets mapped to zero) gets mapped to
the vector previous to itself, the vectors N j6=mi∈[n] form basis (since they are linearly independent at the
outset) of range N , and since N mi∈[n] are in null N , are linearly independent and have the right size
asserted by rank-nullity (Theorem 51), N mi∈[n] ’s form basis for null N .
177
Exercise 356. Suppose V is complex vector space, T ∈ L(V ). Prove 6 ∃ any direct sum decomposition
of V into two (proper, non-trivial) subspaces invariant under T iff minimal polynomial of T takes form
(z − λ)dim V for λ ∈ C.
Proof. Adapted from [1]. If 6 ∃ direct sum decomposition for V into two invariant subspaces, the block
diagonal of T w.r.t Jordan basis has a single block and T has only one eigenvalue. Then (z−λ)(dim V −1+1)
(see explanation in Exercise 352) is the minimal polynomial.
OTOH, we prove the contrapositive. Suppose V has some two proper invariant subspace decomposition.
If T has more than one eigenvalue, then clearly the form (z − λ)dim V is insufficient to characterize the
minimal polynomial since all eigenvalues need to solve the minimal polynomial (Theorem 208). If T has
one eigenvalue, then let V = G(λ, T ) = U1 ⊕ U2 where each of the Ui ’s are invariant under T . Since
these subspaces are non-trivial, letting (z − λ)dim Ui be the characteristic polynomial of T|Ui for i = 1, 2,
see that (T − λ1)max{dim U1 ,dim U2 } = 0 and the minimal polynomial has degree < dim V .
3.3.9 Operators on Real Vector Spaces

3.3.9.1 Complexification
Definition 166 (Complexification of V ). Suppose V is real vector space, then the complexification of V ,
written VC , is the product of vector space V × V . Elements of VC is ordered pair (u, v), where u, v ∈ V
and we often write this as u + iv. The addition and scalar multiplication on VC is defined as if the
∆
members (u, v) = u + iv were complex numbers (Definition 92).
V may be thought of as a subset of VC , where u := u + i0 ∈ V . If V is real vector space, then the

definitions of addition and scalar multiplication as in complexification (Definition 166) define a complex
vector space for VC . The additive identity in VC = 0 + i0.
Theorem 210. Suppose V is real vector space, then
1. If vi , i ∈ [n] is basis of V (as real vector space), then vi , i ∈ [n] is basis of VC (as complex vector
space).
2. dim VC = dim V , where VC is taken as complex vector space and V as real vector space.
Proof. Suppose vi , i ∈ [n] is basis of real vector space V . Then span(vi , i ∈ [n]) in complex vector space
VC contains all of vj , j ∈ [n], ivj , j ∈ [n]. Then vi , i ∈ [n] spans the complex vector space VC . To show
Pn
that the vi , i ∈ [n] are linearly independent, for λi ∈ C, i ∈ [n], consider i λi vi = 0. Then we get
n
X n
X
Re λi vi = 0, Im λi vi = 0, (727)
i i
and since vi ’s are linearly independent, Reλi = 0, Imλi = 0 for all i ∈ [n]. It follows that λi = 0, i ∈ [n]
and vi , i ∈ [n] are linearly independent in VC .
The (parenthesis) in Theorem 210 is critical in avoiding a misconception, otherwise, we may mistake
the theorem to violate the rules of computing dimensions of products of vector spaces (Theorem 111). For
Theorem 111 to hold, the vector space on the LHS, RHS are both complex or both real. In the definition
of a vector space, there is a set of scalars from an arbitrary field. If VC is taken to be a real vector space,
then we may only scalarize by real numbers. Then (vi , i ∈ [n]) basis of V has (vj , j ∈ [n], ivj , j ∈ [n])
basis in VC . The dimension of VC taken as real vector space is 2dim V , and Theorem 111 holds. Taken
as complex vector space, we may scalarize by real numbers and complex numbers, say i.
178
Definition 167 (Complexification of Operator). Suppose V is real vector space, T ∈ L(V ), then the
complexification of T , written TC is the operator TC ∈ L(VC ) and defined
TC (u + iv) = T u + iT v, (728)
for u, v ∈ V .
It can be shown that TC (λ(u + iv)) = λTC (u + iv) for u, v ∈ V, λ ∈ C - TC is linear map.
Theorem 211. Suppose V is real vector space with basis vi , i ∈ [n], and T ∈ L(V ), then M(T ) =
M(TC ).
Every operator on complex vector space has eigenvalue (Theorem 141), hence some one-dimensional
invariant subspace. When working with real vector spaces, we may get no eigenvalue. However, an
invariant subspace of dimension one or two is guaranteed.
Theorem 212. Every operator on a nonzero finite dimensional vector space has invariant subspace of
dimension one or two.
Proof. If the operator is on complex vector space, apply Theorem 141. We can assume V is real vector
space. Suppose T ∈ L(V ), then the complexification TC has eigenvalue a + bi, where a, b ∈ R, and
∃u, v ∈ V , not both zero, such that TC (u + iv) = (a + bi)(u + iv) = (au − bv) + (av + bu)i = T u + iT v.
Then T u = au − bv, T v = av + bu. For U = span(u, v), then U is subspace of V and has dimension one,
or two. U is invariant.
Suppose V is real vector space, T ∈ L(V ), then see that
(TC )n (u + iv) = T n u + iT n v, (729)
for all n ∈ Z+ , u, v ∈ V . Note that (TC )n = (T n )C
Theorem 213. Suppose V is real vector space, T ∈ L(V ), then the minimal polynomial of TC equals
minimal polynomial of T .
Proof. Axler [3]. Suppose p ∈ P(R) is minimal polynomial of T . Equation 729 asserts that p(TC ) =
(p(T ))C = 0. Suppose q ∈ P(C) is monic polynomial satisfying q(TC ) = 0, then ∀u ∈ V, q(TC )(u) = 0.
Let r be the polynomial whose j-th coefficient is the real part of the j-th coefficient of q, then r must be
monic polynomial with r(T ) = 0. Since p(TC ) = 0, and for any monic polynomial q(TC ) = 0 we have
deg q = deg r ≥ deg p. It follows that p is TC minimal polynomial.
By extension of Theorem 213, the minimal polynomial of operator complexification has real coeffi-
cients.
Theorem 214. Suppose V is real vector space, T ∈ L(V ), λ ∈ R, then λ is eigenvalue of TC iff λ is
eigenvalue of T .
Proof. Real eigenvalues of T are real zeros of the minimal polynomial of T , the real eigenvalues of TC
are real zeros of the minimal polynomial of TC , and the two minimal polynomials are equal.
Theorem 215. Suppose V is real vector space, T ∈ L(V ), λ ∈ C. If j is some nonnegative integer,
u, v ∈ V ,
(TC − λ1)j (u + iv) = 0 ↔ (TC − λ1)j (u − iv) = 0. (730)
179
Proof. Prove by induction. If j = 0, (TC − λ1)j = 1 and the result is trivial. Assume the result holds
inductively, and if (TC − λ1) (u + iv) = 0, then see that
j
(TC − λ1)j−1 (TC − λ1)(u + iv) = 0, (731)
then for λ := a + bi,
(TC − λ1)(u + iv) = (T u − au + bv) + i(T v − av − bu), (732)

(TC − λ1)(u − iv) = (T u − au + bv) − i(T v − av − bu). (733)
The induction assumption asserts that (TC − λ1)j−1 ((T u − au + bv) − i(T v − av − bu)) = 0 = (TC −
λ1)j−1 (TC − λ1)(u − iv), so (TC − λ1)j (u − iv) = 0. The other direction follows by letting λ → λ, v →
−v.
Theorem 216. Suppose V is real vector space, T ∈ L(V ), λ ∈ C, then λ is eigenvalue of TC iff λ is
eigenvalue of TC .
Proof. Follows from Theorem 215.
It turns out that the multiplicity (Definition 160) of an eigenvalue of an operator complexification
equals the multiplicity of its complex conjugate.
Theorem 217. If V is real vector space, T ∈ L(V ), λ ∈ C is eigenvalue of TC , then multiplicity of λ

as eigenvalue of TC equals multiplicity of λ as eigenvalue of TC .
Proof. Suppose uj + ivj , j ∈ [m] is basis of generalized eigenspace G(λ, TC ), where ui , vi ∈ V . Then
Theorem 215 asserts this is iff uj − ivj , j ∈ [m] is basis of G(λ, TC ). The result follows.
Theorem 218. Every operator on odd dimensional real vector space has an eigenvalue.
Proof. If V is real vector space with odd dimension, since TC has complex eigenvalues in pairs with equal
multiplicity (Theorem 217), and sum of multiplicities equal dimension, then the sum of multiplicities
corresponding to complex eigenvalues is even. Then TC must have at least one real eigenvalue. Every
real eigenvalue of TC is eigenvalue of T (Theorem 214).
The characteristic polynomial of operator on finite-dimensional complex vector space were defined
(Theorem 162).
Theorem 219. Suppose V is real vector space, T ∈ L(V ), then the coefficients of characteristic polyno-
mial of TC are real.
Proof. Suppose λ is nonreal eigenvalue of TC , multiplicity m. Then λ is eigenvalue of TC , multiplicity

m (Theorem 217). See that
(z − λ)m (z − λ)m = (z 2 − 2 Re λz + |λ|2 )m , (734)
and the RHS has only real coefficients. The characteristic polynomial of TC is product of these terms
and (z − t)d , where t is real eigenvalue of TC multiplicity d.
The characteristic polynomial of operator on complex vector space was defined in Definition 162.
Definition 168 (Characteristic Polynomial). If V is real vector space, T ∈ L(V ), then define the
characteristic polynomial of T to be the characteristic polynomial of TC .
180
See that the arguments prior lead naturally to the following results:
Theorem 220. Suppose V is real vector space, T ∈ L(V ), then
1. coefficients of characteristic polynomial of T ∈ R,
2. characteristic polynomial of T has degree = dim V ,
3. eigenvalues of T are exactly the real zeros of characteristic polynomial of T .
The Cayley-Hamilton Theorem 204 can now be stated without constraining V to be complex vector
space:
Theorem 221. If V ∈ L(V ), q is characteristic polynomial of T , then q(T ) = 0.
Proof. If V is complex vector space, result holds by Theorem 204. We may assume V is real vector
space. q(TC ) = 0 = q(T ), so the result holds.
Theorem 222. Suppose T ∈ L(V ), then
1. the degree of the minimal polynomial of T ≤ dim V ,
2. the characteristic polynomial of T is polynomial multiple of the minimal polynomial of T .
Proof. The proofs before follow by applying the general form of the Cayley-Hamilton Theorem 221 and
Theorem 206.
Result 9. Suppose V is real vector space, T ∈ L(V ), then the statements are equivalent for:
1. All eigenvalues of TC are real,
2. ∃ basis of V s.t. T is upper triangular matrix,
3. ∃ basis of V of generalized eigenvectors of T .
Exercise 357. Verify that if V is real vector space, T ∈ L(V ), then TC ∈ L(VC ).
Proof. Using the definition of operator complexification (Definition 167), show that
TC ((u1 + iv1 ) + (u2 + iv2 )) = TC (u1 + iv1 ) + TC (u2 + iv2 ) (735)
and that
TC ((a + bi)(u + iv)) = (a + bi)TC (u + iv) (736)
for a, b ∈ R, u1 , u2 , v1 , v2 , u, v ∈ V . The steps are omitted..
Exercise 358. Suppose V is real vector space, vi ∈ V , i ∈ [m]. Then prove that vi , i ∈ [m] is linearly
independent in VC iff vi , i ∈ [m] is linearly independent in V .
Proof. See Theorem 210, which says that basis of V is basis of VC . Since any linearly independent list
may be extended to basis, and basis are linearly independent, then the result holds for arbitrary linearly
independent list.
Exercise 359. Suppose V is real vector space, vi ∈ V , i ∈ [m]. Then prove that vi , i ∈ [m] span VC iff
it spans V .
181
Proof. Suppose vi , i ∈ [m] spans VC . For v ∈ V, v + i0 ∈ VC , see that
m
X m
X
v + i0 = λi vi = Re λi vi + iIm λi vi , (737)
i i
Pm
where λi ∈ C, i ∈ [m]. Then i Re λi vi = v, and v ∈ span(vi , i ∈ [m]). OTOH, if vi , i ∈ [m] span V ,
then this list may be reduced to basis of V , which is basis of VC , so the original list spans VC .
Exercise 360. Suppose V is real vector space, T ∈ L(V ). Then prove that TC invertible iff T invertible.
Proof. TC invertible iff zero is not eigenvalue of TC iff zero is not eigenvalue T iff T is invertible. See
Theorem 139 and Theorem 214.
Exercise 361. Suppose V is real vector space, N ∈ L(V ). Then prove that NC nilpotent iff N nilpotent.
Proof. See powers of operator complexification (Equation 729). If NC nilpotent, let n = dim V and see
that ∀v ∈ V , we have
0 + i0 = NCn (v + i0) = N n v + i0, (738)
so N n v = 0. N is nilpotent. The other direction follows from definitions and powers of operator
complexification.
Exercise 362. Prove 6 ∃T ∈ L(R7 ) s.t. T 2 + T + 1 is nilpotent.

Proof. [1]. Suppose not. Then since T 2 + T + 1 is nilpotent, zero is the only eigenvalue. Then the
minimal polynomial takes form z j , for some j ∈ Z+ (since zeros of minimal polynomial are precisely the
eigenvalues). See that
√ ! √ !
2 −1 + i 3 −1 − i 3
z +z+1= z− z− . (739)
2 2
Then the polynomial p(z) = (z 2 + z + 1)j has no real roots but p(TC ) = 0. This is a contradiction, since
p must be multiple of minimal polynomial of TC , which must have at least one real root since it is odd
dimension.
Exercise 363. Suppose V is real vector space, T ∈ L(V ). If ∃b, c ∈ R s.t. T 2 + bT + c1 = 0, then show
that T has eigenvalue iff b2 ≥ 4c.
Proof. Consider
TC2 + bTC + c1 = 0. (740)
This polynomial has either two real roots or two nonreal roots since b, c ∈ R. T has eigenvalue iff TC has
real root iff b2 ≥ 4c.
Exercise 364. Suppose V is real vector space, T ∈ L(V ) has no eigenvalues, then prove that every
subspace of V invariant under T has even dimension.
Proof. If subspace U of V is invariant under T , and T has no eigenvalue, then T|U has no eigenvalue and
U has even dimension since all operator on odd-dimension real vector space has eigenvalue.
Exercise 365. Suppose V is real vector space, then prove ∃T ∈ L(V ) s.t. T 2 = −1 iff V has even
dimension.
Proof. Suppose T 2 = −1, then clearly T has no (real) eigenvalue and it must be even dimension. Suppose
V has even dimension, and let the basis of V be (vi , i ∈ [n], ui , i ∈ [n]). Then T ∈ L(V ) defined by
T vj = −uj , T uj = vj satisfies T 2 = −1 and is well defined by Theorem 95.
182
3.3.9.2 Operators on Real Inner Product Spaces
Complex spectral theorem 176 describes normal operators on complex inner product spaces, real spec-
tral theorem 180 describes self-adjoint operators on real inner product space. A description of normal
operators on real inner product spaces are surveyed.
Theorem 223. Suppose V is two dimensional real inner product space, T ∈ L(V ), then the following
are equivalent statements.
1. T is normal, but not self-adjoint,

" #
a −b
2. M(T ) w.r.t every orthonormal basis of V has form , with b 6= 0,
b a
" #
a −b
3. M(T ) w.r.t some orthonormal basis of V has form , with b > 0.
b a
Proof. Suppose T normal, not self adjoint and let e1 , e2 be orthonormal basis and let
" #
a c
M(T, (e1 , e2 )) = . (741)
b d
Theorem 173 asserts that kT e1 k2 = a2 + b2 = kT ∗ e1 k2 = a2 + c2 . Then b2 ="c2 , which

# is when c = ±b.
a −b
But if b = c, T would be self-adjoint, and b = −c. Then M(T, (e1 , e2 )) = . Since T normal,
b d
then compute M(T T ∗ ), M(T ∗ T ), and compare to see that we need bd = ab. If b = 0, then T self-adjoint,
so b 6= 0, then d = a and part 2 holds. Suppose 2 holds, then for orthonormal # e1 , e2 of V , M(T )
" basis
a b
has b 6= 0. If b > 0, then part 3 holds. If b < 0, then M(T, (e1 , −e2 )) = , where −b > 0, so
−b a
again part 3 holds. Suppose 3 holds, then M(T ) w.r.t some orthonormal basis takes form as written in
3. with b > 0. Clearly this is not equal to its conjugate transpose - T is not self-adjoint, but by direct
computation it commutes with its transpose s.t. T T ∗ = T ∗ T .
Theorem 224. Suppose V is inner product space, T ∈ L(V ) normal, U is subspace of V invariant under
T . Then
1. U ⊥ is invariant under T ,
2. U is invariant under T ∗ ,
3. (T|U )∗ = (T ∗ )|U ,
4. (T|U ) ∈ L(U ), T|U ⊥ ∈ L(U ⊥ ) are normal.
Proof. Let ei , i ∈ [m] be orthonormal basis of U . Note that V = U ⊕ U ⊥ , so we may extend to

orthonormal basis of V as ei , i ∈ [m], fj , j ∈ [n], where fj , j ∈ [n] is basis for U ⊥ . Since U invariant,
" #
A B
T ej ∈ span(ei , i ∈ [m]) and M(T, (ei , i ∈ [m], fj , j ∈ [n])) is an upper triangular block matrix
0 C
where A is m × m, 0 is n × m, B is m × n and C is n × n matrix. For all j ∈ [m], kT ej k2 is the sum of
Pm
squares of the entries in A·,j , so that j kT ej k2 is the sum squares of absolute values of entries in A.
Furthermore, for all j ∈ [m], kT ∗ ej k2 equals the sum squares of absolute values in entries in Aj, ·, Bj, ·,
Pm
so that j kT ∗ ej k2 is sum squares of absolute values of entries in A, B. Since T normal, kT ej k =
183
Pm Pm
kT ∗ ej k, and j kT ej k2 = j kT ∗ ej k2 , and it follows that B must be a zero matrix. In particular,
" #
A 0
M(T, (ei , i ∈ [m], fj , j ∈ [n])) is a block diagonal matrix . Then T fk ∈ span(fi , i ∈ [n]) for all
0 C
k ∈ [n]. Since fi , i ∈ [n] is basis of U ⊥ , then v ∈ U ⊥ =⇒ T v ∈ U ⊥ and U ⊥ is invariant under T . Part
1 holds.
With reference to the conjugate transpose of the block diagonal found for M(T ), consider M(T ∗ ). In
particular, T ∗ ej ∈ span(ei , i ∈ [m]). Then U is invariant under T ∗ . Part 2 holds.
Let S = T|U ∈ L(U ), for all u ∈ U , fixed v ∈ U , see that hSu|vi = hT u|vi = hu|T ∗ vi, but since T ∗ v ∈ U ,
we have S ∗ v = T ∗ v, so (T|U )∗ = (T ∗ )|U . Part 3 holds.
Since T normal, it commutes with adjoint, (T|U )∗ = (T ∗ )|U , then T|U commutes with adjoint and let
U → U ⊥ to show that T|U ⊥ is normal for part 4.
Theorem 225. Suppose V is real inner product space, T ∈ L(V ), then the following statements are
equivalent for:
1. T is normal,
2. ∃ orthonormal basis
" # s.t M(T ) is block diagonal matrix with each block Aj being 1 × 1 or 2 × 2
of V
a −b
matrix of form and b > 0.
b a
Proof. We start with part 2 - suppose it holds, then clearly T T ∗ = T ∗ T and T is normal.
If T normal, then we prove part 2 by induction. The cases for dim V = 1 and dim V = 2 (real spectral
theorem 180 for self-adjoint operators and Theorem 223 for normal operators). Assume the result holds
inductively. Let U be subspace of V , dim U = 1 or dim U = 2 (at least one exists) where U invariant
under T . If dim U = 1, then any unit vector in U is orthonormal basis of U , and T|U ∈ L(U ) has 1 × 1
matrix. If dim U = 2, then T|U ∈ L(U ) is normal but not self-adjoint (see Theorem 224, otherwise T|U
has eigenvector by Theorem 178 and this would fall under the dim U = 1 case). Choose an orthonormal
basis of U as in Theorem 223. Theorem 224 assert that U ⊥ is T invariant, T|U ⊥ ∈ L(U ⊥ ) is normal. By
induction, ∃ orthonormal basis of U ⊥ s.t. M(T|U ⊥ ) has the desired form. Adjoin this basis to basis of
U , and the resulting basis of V is orthonormal with M(T ) as desired.
For θ ∈ R, the (counterclockwise)

" # rotation operator by θ is isometry and matrix of this operator
cos θ − sin θ
w.r.t standard basis is . Clearly, if θ 6= kπ for some integer k, then this operator has no
sin θ cos θ
eigenvalues. It turns out that every isometry on a real inner product space is composed of pieces that
are rotations on two-dimensional subspaces, pieces that equal 1 and pieces that equal to multiplication
by −1.
Theorem 226. Suppose V is real inner product space, S ∈ L(V ), then the following statements are
equivalent:
1. S is isometry,
2. ∃ orthonormal basis of V s.t M(S) has" block diagonal# matrix s.t. each block on the diagonal is
cos θ − sin θ
1 × 1 with ±1 entry or 2 × 2 with form with θ ∈ (0, π).
sin θ cos θ
Proof. Suppose S isometry, then S normal, ∃ orthonormal
" # basis of V where M(S) is block diagonal
a −b
matrix and each block is 1 × 1 or 2 × 2 form , b > 0. If λ is in entry of 1 × 1 block, then
b a
184
∃ej s.t Sej = λej . Isometry implies |λ| = 1, so λ = ±1. On the other hand, for 2 × 2 block, for basis
vectors ej , ej+1 , we are given Sej = aej + bej+1 , and hence 1 = kej k2 = kSej k2 = a2 + b2 . Since b > 0,
∃θ ∈ (0, π) s.t, a = cos θ, b = sin θ. Then part 2 holds.
If part 2 holds, then ∃ orthonormal basis of V s.t. M(S) has form as stated, then there is a direct
sum decomposition V = ⊕m i Ui , where Ui is subspace V with dimension one or two. Additionally, for
Pm
v1 ∈ Ui , v2 ∈ Uj6=i , hv1 |v2 i = 0, and S|Uj is isometry Uj → Uj . For v := i ui ∈ V where ui ∈ Ui , apply
S and take norm to get
m
X m
X m
X
2 2 2
kSvk = k Sui k = kSui k = kui k2 = kvk2 , (742)
i i i
where we used the norm preservation of S|Ui and the fact that ui ’s are orthonormal bases. So part 1
holds.
Exercise 366. Suppose S ∈ L(R3 ) is isometry, then prove ∃0 6= x ∈ R3 s.t S 2 x = x.
Proof. Recall that every isometry on real inner product space is composed of pieces that either rotate
or scale by ±1. See Theorem 226. In particular, M(S) has minimally 2 block matrices on the diagonal,
at least one of which is ±1. For the basis vector corresponding to that column, Sx = ±x, S 2 x = x.
Exercise 367. Prove every isometry on an odd-dimensional real inner product space has 1 or −1 as
eigenvalue.
Proof. See proof of Exercise 366. The same choice of x has eigenvalue ±1.
Exercise 368. Suppose V is real inner product space, then show that
hu + iv|x + iyi = hu|xi + hv|yi + (hv|xi − hu|yi)i (743)
for u, v, x, y ∈ V defines complex inner product on VC .
Proof. One should verify (with lengthy expansions) that for u1 , u2 , v1 , v2 ∈ V, a, b ∈ R that (given the
inner product defined above)
h(u1 + iv1 ) + (u2 + iv2 )|x + iyi = hu1 + iv1 |x + iyi + hu2 + iv2 |x + iyi, (744)
and that
h(a + bi)(u + iv)|x + iyi = (a + bi)hu + iv|x + iyi (745)
and that
hu + iv|x + iyi = hx + iy|u + ivi. (746)
Exercise 369. Suppose V is real inner product space, T ∈ L(V ) self-adjoint. Then show that TC is
self-adjoint under the inner product space VC .
Proof. For any choice of orthonormal basis for V , this is basis of VC , and see that
M(TC ) = M(T ) = M(T ∗ ) = M((T ∗ )C ) = M((TC )∗ ). (747)
185
3.3.10 Trace and Determinants
3.3.10.1 Trace
See matrices of linear maps in Definition 113. Theorem 102 asserts that matrix products are well defined,
that is M(ST ) = M(S)M(T ). In particular, for bases ui , i ∈ [n], vi , i ∈ [n], wi , i ∈ [n], for S, T ∈ L(V )
we have
M(ST, (ui )i∈[n] , (wi )i∈[n] ) = M(S, (vi )i∈[n] , (wi )i∈[n] )M(T, (ui )i∈[n] , (vi )i∈[n] ). (748)
Theorem 227. Suppose ui , i ∈ [n], vi , i ∈ [n] are bases of V , then M(1, (ui )i∈[n] , (vi )i∈[n] ) and M(1, (vi )i∈[n] , (ui )i∈[n] )
are invertible and are inverses of each other.
Proof. See Equation 748. We have wj → uj , S → 1, T → 1
1 = M(1, (vi )i∈[n] , (ui )i∈[n] )M(1, (ui )i∈[n] , (vi )i∈[n] ). (749)
By symmetry,
1 = M(1, (ui )i∈[n] , (vi )i∈[n] )M(1, (vi )i∈[n] , (ui )i∈[n] ). (750)
Theorem 228 (Change of Basis). Suppose T ∈ L(V ), and let ui , i ∈ [n], vj , j ∈ [n] be bases of V . Let
A = M(1, (ui )i∈[n] , (vj )j∈[n] ), then
M(T, (ui )i∈[n] ) = A−1 M(T, (vj )j∈[n] )A. (751)
Proof. Make wj → uj , S → 1 in Equation 748 to get M(T, u, u) = M(1, v, u)M(T, u, v) so
M(T, (ui )i∈[n] ) = A−1 M(T, (ui )i∈[n] , (vj )j∈[n] ) (752)
by Theorem 227. Make wj → vj , T → 1, S → T in Equation 748 to get M(T, u, v) = M(T, v, v)M(1, u, v)

so
M(T, (ui )i∈[n] , (vj )j∈[n] ) = M(T, (vj )j∈[n] )A. (753)
The result follows.
Matrix trace was defined in Definition 41 as the sum of diagonals. Here we define the trace of an
operator.
Definition 169 (Trace of Operator). Suppose T ∈ L(V ). Trace of operator T is written tr(T ). If F = C,
then tr(T ) is the sum of eigenvalues of T , with each eigenvalue repeated according to its multiplicity.
If F = R, then tr(T ) is the sum of eigenvalues of TC , with each eigenvalue repeated according to its
multiplicity.
Theorem 229. Suppose T ∈ L(V ), dim V = n, then tr(T ) is the negative of the coefficient of z n−1 in
the characteristic polynomial of T .
Proof. If λi , i ∈ [n] are the eigenvalues of T , (or TC if F = R), with eigenvalue repeated to multiplicity,
then by definition (see Definition 162, 168) the characteristic polynomial of T is (z − λ1 ) · · · (z − λn ).
Expanding, we can write
z n − (λ1 + · · · + λn )z n−1 + · · · + (−1)n (λ1 · · · λn ). (754)
186
Defined as a sum of the diagonal entries in square matrix (Definition 41), it turns out that the trace of
an operator is consistent with this definition; tr(T ) = tr(M(T, (vi )i∈[n] )) for arbitrary basis vi ’s. Recall
Definition 41, there it is shown that tr(AB) = tr(BA).
Theorem 230. For T ∈ L(V ), bases ui , i ∈ [n], vi , i ∈ [n] of V ,
tr(M(T, (ui )i∈[n] )) = tr(M(T, (vi )i∈[n] )). (755)
Proof. Let A = M(1, (ui )i∈[n] , (vi )i∈[n] ), then
tr(M(T, (ui )i∈[n] )) = tr(A−1 M(T, (vi )i∈[n] )A) = tr(M(T, (vi )i∈[n] )AA−1 ) = tr(M(T, (vi )i∈[n] )). (756)
Theorem 231. For T ∈ L(V ), tr(T ) = tr(M(T )).
Proof. Theorem 230 says we can choose any basis. If V is complex vector space, then choose the basis
as in Theorem 200 (we can also just choose an upper triangular, Theorem 143). If V is real vector space,
then apply the complex case to the complexification TC and we are done.
If operator on complex vector space has given matrix, we can find the sum of all eigenvalues without
knowing any of them. See Definition 41. Theorem 231 asserts that tr(S + T ) = tr(S) + tr(T ). We also
showed that no S, T exist s.t. ST − T S = 1.
Exercise 370. Suppose T has the same matrix w.r.t every basis of V , then prove T = λ1 for some
λ ∈ F.
Proof. [2]. Suppose T has same matrix w.r.t every basis of V . We show ∀v ∈ V , that v, T v is linearly
dependent. If v, T v linearly independent, then it may extended to basis v, T v, u1 , · · · , un of V . Then
M(T )·,1 w.r.t to this basis is all zeros except for one in the second row. 2v, T v, u1 , · · · , un is also basis of
V , and M(T )·,1 w.r.t to this basis is all zeros except for two in the second row. The matrix is different
w.r.t to the two bases, and this is contradiction. Then v, T v is linearly dependent. Then all v ∈ V is
eigenvector of T . The result follows by Exercise 226.
Exercise 371. Suppose ui , i ∈ [n], vi , i ∈ [n] are two bases of V . Let T ∈ L(V ) be s.t. T vk = uk for
k ∈ [n], then prove that M(T, (vi )i∈[n] ) = M(1, (ui )i∈[n] , (vi )i∈[n] ).
Pn Pn
Proof. For k ∈ [n], let uk = i ai vi . Since T vk = uk = i ai vi , clearly M(T, v)·,k = a1 , · · · , an . Then
the result follows.
Exercise 372. Suppose B is square matrix order n with complex entries. Prove ∃ invertible A square
matrix order n s.t. A−1 BA is upper triangular.
Proof. For some basis ei , i ∈ [n], let T ∈ L(V ) be the operator that satisfies M(T, e) = B. Theorem 143
asserts that there is some basis vi , i ∈ [n] s.t. M(T, v) is upper triangular. Let A = M(1, v, e), then see
that
A−1 BA = A−1 M(T, e)A = M(T, v). (757)
Exercise 373. Suppose V is real vector space, T ∈ L(V ), and V has basis of eigenvectors of T . Then
prove that tr(T 2 ) ≥ 0.
187
Proof. If V has basis of eigenvectors, then Theorem 147 asserts there is some M(T ) that is diagonal
matrix and where the eigenvalues are on the diagonal. By powers of diagonal matrix (Exercise 20), the
result is easy to see.
Exercise 374. Suppose V is inner product space and v, w ∈ V . Define T ∈ L(V ) by T u = hu|viw and
find tr(T ).
v v
Proof. [2]. Suppose v 6= 0, and extend kvk to orthonormal basis of V as kvk , e1 , · · · , en . See that T ej = 0,
v v
since hej |vi = 0. Then the sum of diagonals along M(T ) w.r.t this basis is tr(T ) = hT ( kvk )| kvk i+
Pn v v
i hT ei |ei i = hh kvk |viw| kvk i = hw|vi. See if v = 0, the relation stills holds.
Exercise 375. Suppose P ∈ L(V ), P 2 = P , then prove that tr(P ) = dim range P .
Proof. Exercise 234 asserts V = null P ⊕ range P . Let the set of vectors nj , j ∈ [n] be basis for null P ,
and the set of vectors ri , i ∈ [m] be basis for range P , and their union is basis of V . Then clearly P ni = 0
and P ri = ri , so the only eigenvalues are zero or one. The matrix of P w.r.t to this basis is diagonals
with exactly n zeros and m ones. Result follows.
Exercise 376. Suppose V is inner product space, T ∈ L(V ), then prove that tr(T ∗ ) = tr(T ).
Proof. Let ei , i ∈ [n] be orthonormal basis for V, then

0
tr(T ∗ ) = tr(M(T ∗ , e)) = tr(M(T, e) ) = tr(M(T, e)) = tr(M(T, e)) = tr(T ). (758)
Exercise 377. Suppose V is inner product space, T ∈ L(V ) is positive operator with tr(T ) = 0, then
prove that T = 0.
Proof. T is positive, so self-adjoint, so spectral theorems (Theorem 176, 180) assert existence of diagonal
matrix of w.r.t orthonormal bases. Since tr(T ) = 0, the eigenvalues (diagonals) are nonnegative, then
M(T ) is zero matrix and T = 0.
Exercise 378. Suppose V is inner product space, P, Q ∈ L(V ) are orthogonal projections. Then prove
that tr(P Q) ≥ 0.
Proof. Recall that orthogonal projections are self-adjoint. Let λ be eigenvalue with associated eigenvector
x s.t P Qx = λx. See also that P Qx = λx = λP x. Then λ = 0 or x = P x. If x = P x, it follows that
P QP x = λx. Recall eigenvalues of self-adjoint operator are all real. Since Q2 = Q for orthogonal
projection, and since P QP = P QQP is positive operator (see that hP QQP x|xi = hQP x|QP xi ≥ 0), all
4
the eigenvalues are nonnegative - in particular λ ≥ 0. It follows that tr(P Q) ≥ 0.
Exercise 379. Suppose T ∈ L(V ), tr(ST ) = 0 for all S ∈ L(V ), then prove that T = 0.
Proof. [1]. Let dim V = n. If T 6= 0, then there exists some nonzero v ∈ V s.t. T u = v and
.
nullT 6= V . Let v , · · · v be the basis of nullT , and extend this to basis of V as v , · · · , v , v
m+1 n 1 , .., v .
m m+1 n
Recall that T vi , i ∈ [m] are linearly independent vectors in V . Extend T vi , i ∈ [m] to basis of V as
4 Adapted from https://math.stackexchange.com/questions/2699099/if-p-q-are-two-orthogonal-projections-then-all-
eigenvalues-of-q-circ-p
188
T vi , i ∈ [m], wm+1 , · · · , wn . Then ∃S ∈ L(V ) s.t. S(T v1 ) = v1 , S(T vi ) = 0 for all i = 2, · · · , m and
Swj = 0 for all j ∈ [m + 1, n]. We have
ST v1 = v1 , ST vi = 0, ST vj = 0, i = [2, m], j = [m + 1, n]. (759)
Then the eigenvalues of ST are zero and one, and the sum of the eigenvalues > 0. This is contradiction.
Exercise 380. Suppose V is inner product space, orthonormal basis ei , i ∈ [n] and T ∈ L(V ). Prove
Pn
that tr(T ∗ T ) = i kT ei k2 .
Proof. Since trace is sum of diagonals, we have

n
X n
X n
X
tr(T ∗ T ) = hT ∗ T ei |ei i = hT ei |T ei i = kT ei k2 . (760)
i i i
Exercise 381. Suppose V is inner product space, then prove that hS|T i = tr(ST ∗ ) defines an inner
product on L(V ).
Proof. See that hT |T i = tr(T T ∗ ) is the trace of a positive operator and is ≥ 0. In fact it is zero iff T = 0,
since all eigenvalues are zero for the positive operator of zero trace. One should also take the steps to
verify that hR + S|T i = hR|T i + hS|T i, hcS|T i = chS|T i. hS|T i = hT |Si.
Note that in literature, often the default standard inner product defined for matrices A, B is given
hA|Bi = tr(AB ∗ ). (761)
When the inner product space is real, then tr(AB ∗ ) = tr((AB ∗ )∗ ) = tr(BA∗ ) = tr(A∗ B) = hA|Bi.
Exercise 382. Suppose V is complex inner product space, T ∈ L(V ). Let λi , i ∈ [n] be the eigenvalues
of T repeated to multiplicity and suppose A = M(T ) w.r.t some orthonormal basis. Then prove that
n
X n X
X n
|λi |2 ≤ |Aj,k |2 . (762)
i k=1 j=1
Proof. Let ei , i ∈ [n] be orthonormal bases giving A = M(T ). See that ∀k ∈ [n] where n = dimV , we have
Pm Pn Pn Pn Pn
T ek = i ai,k ei , and that kT ek k2 = i |ai,k |2 . It follows that i kT ei k2 = k j |aj,k |2 = tr(T ∗ T ).
Pn
We want to prove that i |λi |2 ≤ tr(T ∗ T ). Since V is complex vector space, write it in the Jordan
form (Definition 165). Considering the upper triangular, clearly we have |λk |2 ≤ kT fk k2 for each k. The
inequality follows.
Exercise 383. Suppose V is inner product space, T ∈ L(V ). Prove that
∀v ∈ V, kT ∗ vk ≤ kT vk =⇒ T is normal. (763)
Proof. [2]. Suppose kT ∗ vk ≤ kT vk. Then for u ∈ V, kuk = 1, we may extend u to orthonormal basis
(u, ei , i ∈ [n]) of V . See by Exercise 380:
n
X n
X
tr(T T ∗ ) = kT ∗ uk2 + kT ∗ ei k2 ≤ kT uk2 + kT ei k2 = tr(T ∗ T ) = tr(T T ∗ ). (764)
i i
Pn Pn
Then this implies that kT ∗ uk2 + i kT ∗ ei k2 = kT uk2 + i kT ei k2 , or that kT ∗ uk = kT uk. We can
repeat this on any element of the basis. Then kT ∗ vk = kT vk for any v ∈ V , which is iff T normal
(Theorem 173).
189
3.3.10.2 Determinant
Determinants of matrices were outlined in Definition 39. Here we define the determinant of an operator.
Definition 170 (Determinant of Operator). Let T ∈ L(V ), let the determinant of T be written det(T ).
If F = C, then det(T ) is the product of the eigenvalues of T , with each eigenvalue repeated to multiplicity.
If F = R, then det(T ) is the product of the eigenvalues of TC , with each eigenvalue repeated to multiplicity.
Theorem 232. Let T ∈ L(V ), n = dim V , then det(T ) equals a factor of (−1)n times the constant term
of the characteristic polynomial of T .
Proof. See expansion in Theorem 229.
Theorem 233. Suppose T ∈ L(V ), then the characteristic polynomial of T may be written
z n − tr(T )z n−1 + · · · + (−1)n det(T ). (765)
Now we can relate invertibility condition (Theorem 72) to determinants as we did before.
Theorem 234. Invertibility iff determinant is nonzero.
Proof. If V is complex vector space, T ∈ L(V ), then T invertible iff zero is not eigenvalue of T , which is
iff 0 is not in the product of eigenvalues of T . Then T invertible iff det(T ) 6= 0. If V is real vector space,
T ∈ L(V ), then T invertible iff zero is not eigenvalue of T , which is iff 0 is not in the product of eigenvalues
of T , which is iff 0 is not in the product of eigenvalues of TC , then T invertible iff det(T ) 6= 0.
Theorem 235. Suppose T ∈ L(V ), then the characteristic polynomial of T equals det(z 1 − T ).
Proof. Suppose V is complex vector space. For λ, z ∈ C, λ is eigenvalue of T iff z − λ is eigenvalue of

z 1 − T by writing −(T − λ1) = (z 1 − T ) − (z − λ)1. Raise LHS, RHS to dim V power and take nullspace
to show that dim G(λ, T ) = dim G(z − λ, z 1 − T ). Let λi , i ∈ [n] denote eigenvalues of T repeated to
multiplicity. Then for z ∈ C, the eigenvalues of z 1 − T are z − λi repeated to multiplicity. Follows that
det(z 1 − T ) = (z − λ1 ) · · · (z − λn ), the characteristic polynomial. Apply the argument to TC when V is
real vector space.
The determinants obtained from cofactor expansion were outlined in Definition 39. Here a definition
that does not rely on recursive arguments are given.
Definition 171. Denote perm n to be the set of all permutations of the list 1, 2, · · · , n. Each element
of perm n is said to be a permutation.
Definition 172. The sign of a permutation (m1 , · · · , mn ) is one if the number of integer pairs (j, k),
1 ≤ j < k ≤ n s.t. j appears after k in the list (m1 , · · · , mn ) is even and negative one otherwise. In
other words, the sign is one if the natural order has been flipped an even number of times and negative
one when flipped an odd number of times.
Theorem 236. Interchanging two entries in a permutation multiplies the sign of the permutation by
−1.
Proof. Axler [3]. If the interchanged entries were in their natural order, they no longer would be, and
vice-versa, so a net change of ±1 occurs in the number of pairs flipped in natural order. For each entry
between the two interchanges entries, if an intermediate entry was originally in their natural order w.r.t
190
both the changed entries, it is in the natural order w.r.t to neither changed entries. If an intermediate
entry was in the natural order w.r.t only one of the two swapped entries, it is still only in the natural
order w.r.t to just one of the swapped entries. That is, for each intermediate entry, the net change of the
flipped natural order count is 0, ±2. For every other entry, there is no change. The result follows.
Definition 173. Suppose A is n × n matrix with entry Ai,j , then det(A) is defined by
X
det(A) = sign(m1 , · · · , mn )Πni Ami ,i (766)
(m1 ,··· ,mn )∈perm n
Note that the summation term is over n! elements. We know that the triangular matrix has determi-
nant equal to product of diagonals via the cofactor expansion argument (Theorem 16). Here the same
result is proved without cofactors.
Exercise 384. Find the determinant of an upper triangular matrix A.
Proof. The permutation (1, 2, · · · n) has sign 1, and contributes A1,1 A2,2 · · · An,n to the determinant
computation in Equation 766. Any other permutation (m1 , · · · , mn ) has at least some entry mj > j
contributing Amj ,j = 0 to the product term, and the entire summand for that permutation do not affect
the determinant.
The computation of determinants of matrices, their transformations and products are all discussed in
Section 3.2.2.4. In particular, AB invertible iff A, B invertible, so det(AB) = det(A)det(B) if any of A, B
has zero determinant. Otherwise, both are invertible and may be expressed as product of elementary
matrices (Theorem 72), the determinants of which we may compute rather easily. Then the determinants
of the matrices obtained from elementary row operations (Definition 10) may be determined from the
product of the determinants of elementary matrices, and since the determinant of a matrix is equal to
its transpose (Theorem 17), we can also find the determinants of matrices obtained from elementary
column operations (Definition 15). A permutation based argument using Definition 173 can be used to
prove the same arguments. We leave it to the reader to verify these statements, each using preceding
results in the list as sub-proofs:
1. if A, B square matrix and B obtained from A from swapping two columns, then det(A) = −det(B),
2. if A square matrix has two identical columns then det(A) = 0,

h i
3. if A = A·,1 · · · A·,n is n × n matrix and (m1 , · · · , mn ) ∈ perm n, then
h i
det( A·,m1 ··· A·,mn ) = sign(m1 , · · · , mn )det(A). (767)
In addition, for fixed 1 ≤ k ≤ n, the function taking
A·,k → det(A·,1 ·, A·,k , ·, A·,n ) (768)
is linear map, where the linearity follows from seeing that each term in the sum has precisely one entry
from A·,k in Equation 766.
We restate the proof for det(AB) = det(A)det(B).
Theorem 237. If A, B square matrix, then det(AB) = det(BA) = det(A)det(B).
191
Proof. See Result 6. Additionally, for standard basis ek , k ∈ [n] (Definition 56), note that Aek =
Pn
A·,k , Bek = B·,k = m=1 Bm,k em . Then
det(AB) = det(AB·,1 · · · AB·, n) (769)

Xn n
X
= det(A( Bm1 ,1 em1 ) · · · A( Bmn ,n emn )) (770)
m1 =1 mn =1
n
X n
X
= det( Bm1 ,1 Aem1 · · · Bmn ,n Aemn ) (771)
m1 =1 mn =1
n
X n
X
= ··· Bm1 ,1 · · · Bmn ,n det(Aem1 · · · Aemn ) (772)
m1 =1 mn =1
X
= Bm1 ,1 · · · Bmn ,n det(Aem1 · · · Aemn ) (773)
(m1 ,··· ,mn )∈perm n
X
= Bm1 ,1 · · · Bmn ,n sign(m1 , · · · , mn )det(A) (774)
(m1 ,··· ,mn )∈perm n
X
= det(A) Bm1 ,1 · · · Bmn ,n sign(m1 , · · · , mn ) (775)
(m1 ,··· ,mn )∈perm n
= det(A)det(B). (776)
We used linearity of determinants w.r.t fixed columns, vanishing determinants of matrix with repeated
columns and determinant of column permuted matrices in the equality statements above. A → B, B → A
to get det(BA) = det(B)det(A).
Theorem 238. Operator determinant does not depend on basis. Specifically, for T ∈ L(V ), ui , i ∈
[n], vj , j ∈ [n] bases of V , then
det(M(T, (ui )i∈[n] )) = det(M(T, (vi )i∈[n] )). (777)
Proof. Define A = M(1, (ui )i∈[n] ), (vj )j∈[n] ), then
det(M(T, u)) = det(A−1 M(T, v)A) = det(A−1 )det(M(T, v))det(A) = det(M(T, v))det(AA−1 ) = det(M(T, v)).
Theorem 239 (Operator Determinant and Matrix Determinant). If T ∈ L(V ), det(T ) = det(M(T )).
Proof. If V is complex vector space, then choose basis of V as in Theorem 200, otherwise V is real vector
space and apply complex case to complexification TC . Result follows from the operator determinant
definition (see Definition 170) and applying Theorem 238.
Theorem 240. Suppose S, T ∈ L(V ), then det(T S) = det(ST ) = det(S)det(T ).
Result 10. A block upper triangular matrix A with block Ai , i ∈ [m] has determinant given by det(A) =
Πm
i det(Ai ). Note that Ai only need be square.
Theorem 241. If V is inner product space, S ∈ L(V ) is isometry, then |det(S)| = 1.
Proof. See Theorem 226. Each block has determinant ±1 by direct computation. The determinant of
this (trivially, upper) diagonal block matrix has determinant given by Result 10 as ±1.
192
If T is self-adjoint on a real vector space, then real spectral theorem 180 asserts that there is an
associated diagonal matrix, and the eigenvalues appear on the diagonal of M(T ) repeated to multiplicity.
√
If V is an inner product space, then recall polar decomposition T = S T ∗ T . In particular, T ∗ T is
√ √
positive, so T ∗ T is positive (and hence self-adjoint), and all eigenvalues ≥ 0 - so det( T ∗ T ) ≥ 0.
Assume T ∈ L(T ) is invertible, and see that for an isometry S ∈ L(V ), det(S) = ±1. Additionally,
{v ∈ V : Sv = −v} is the eigenspace E(−1, S), and is the subspace on which S reverses direction.
det S = 1 is 1 if this eigenspace has even dimension and det(S) = −1 otherwise (see proof in Theorem
√ √
241). For invertible T = S T ∗ T , det(T ) = det(S)det( T ∗ T ), but since the second determinant term is
strictly nonnegative, the sign of det(T ) depends solely on det(S), which depends on whether the subspace
that S reverses direction on has even or odd dimension, which depends on whether T reverses vectors
an even or odd number of times.
√
Theorem 242. If V is inner product space and T ∈ L(V ), then |det(T )| = det T ∗ T .
√
Proof. Apply Polar Decomposition (Theorem 185) to get T = S T ∗ T , and see that
√ √
|det(T )| = |det(S)|det T ∗ T = det T ∗ T . (778)
We want to assign each subset of Ω ∈ Rn to its n-dimensional volume (or area when n = 2),
Definition 174 (Box). A box in Rn is set of form
{(y1 , · · · , yn ) ∈ Rn : xj < yj < xj + rj ∀j ∈ [n]}, (779)
where ri ∈ R+ are said the be the side lengths of the box and xi ∈ R.
Definition 175 (Box Volume). For box (Definition 174), its volume is the product of side lengths
r1 , · · · rn . We write this vol(B).
To define the volume of arbitrary Ω ∈ Rn rather than a box, we write Ω as a subset union of many
small boxes.
Definition 176 (Volume). Suppose Ω ∈ Rn , then volume of Ω, written vol(Ω) is
inf vol(B1 ) + vol(B2 ) + · · · (780)
where the infimum is taken over all sequences B1 , B2 , · · · of boxes in Rn whose union contains Ω.
Definition 177. For function T defined on set Ω, denote T (Ω) to be
T (Ω) = {T x : x ∈ Ω}. (781)
Theorem 243 (Axler [3]). For T ∈ L(Rn ) positive operator and Ω ⊂ Rn , then
vol(T (Ω)) = det(T )vol(Ω). (782)
Proof. [3]. Consider the case where λi , i ∈ [n] are positive numbers and T ∈ L(Rn ) are defined by
T (x1 , · · · , xn ) = (λ1 x1 , · · · , λn xn ). This operator stretches the j-th standard basis by λj . If B ∈ Rn
is box with side lengths ri , i ∈ [n], T (B) is box side lengths λi ri , i ∈ [n], and clearly vol(T (B)) =
λ1 · · · λn r1 · · · rn = det(T )vol(B). Since volume of Ω is approximated as the infimum of sum volume of
boxes, the result applies to Ω. If T was arbitrary positive operator, then by real spectral theorem 180,
∃ei , i ∈ [n] orthonormal bases s.t. T ej = λj ej , λj ≥ 0. The operator also stretches the j-th basis vector
in an orthonormal basis by λj , and the same intuition applies.
193
Theorem 244. Suppose S ∈ L(Rn ) is isometry, Ω ∈ Rn , then
vol(S(Ω)) = vol(Ω). (783)
Proof. For x, y ∈ Rn , kSx − Syk = kS(x − y)k = k(x − y)k, so isometry does not change the distance
between points, and it does not change the side lengths of the component boxes either.
Theorem 245. Suppose T ∈ L(Rn ), Ω ⊂ Rn , then
vol(T (Ω)) = |det(T )|vol(Ω). (784)

√
Proof. Polar decomposition (Theorem 185) asserts we can write T = S T ∗ T , and so
√ √ √
vol(T (Ω)) = vol(S T ∗ T (Ω)) = vol( T ∗ T (Ω)) = det( T ∗ T )vol(Ω) = |det(T )|vol(Ω). (785)
When we do multivariable integration, often a change of variables is employed, and the resulting
integration exhibits a determinant term. If Ω ∈ Rn , f : Ω → R, then integral of f over Ω denoted
R
Ω
f (x)dx is defined by breaking Ω into pieces and assuming that is f is constant over each piece. This
constant is multiplied by the volume of the piece and summed as approximation to the integral.
Definition 178. Suppose Ω is open subset of Rn , and σ : Ω → Rn . For x ∈ Ω, σ is differentiable at x

if ∃T ∈ L(Rn ) s.t.
kσ(x + y) − σ(x) − T yk
lim = 0. (786)
y→0 kyk
If σ differentiable at x, then the unique operator T ∈ L(Rn ) satisfying the above equation is said to be
derivative of σ at x, and we write this σ 0 (x).
For fixed x, small kyk, we have
σ(x + y) ≈ σ(x) + σ 0 (x)y (787)
. If Ω ⊆ Rn is open subset and σ : Ω → Rn , we have σ(x) = (σ1 (x), · · · , σn (x)), where σj : Ω → R.

Denote the partial derivative of σj w.r.t k-th coordinate as Dk σj . Evaluating said partial derivative at
x ∈ Ω gives Dk σj (x). If σ is differentiable at x, then M(σ 0 (x)) w.r.t standard basis of Rn is s.t. we
obtain n × n matrix M(σ 0 (x))ij = Dj σi (x).
Theorem 246 (Integral Change of Variables). Suppose Ω ⊆ Rn is open subset and σ : Ω → Rn is

differentiable at every point x ∈ Ω. If f : σ(Ω) → R, then
Z Z
f (y)dy = f (σ(x))|det(σ 0 (x))|dx. (788)
σ(Ω) Ω
Proof. Let x ∈ Ω and Γ be a small subset of Ω containing x, and s.t. f is constant at f (σ(x)) on σ(Γ).
Since adding fixed vector (i.e. σ(x)) to each vector produces a new set with the same volume,
vol σ(x + Γ) = |det σ| · vol(x + Γ) = |det σ| · vol(Γ) = vol σ(Γ) (789)
is approximately vol(σ(x) + σ 0 (x)(Γ)) = vol(σ 0 (x)(Γ)) by Equation 787, hence vol(σ(Γ)) ≈ vol(σ 0 (x)Γ),
and
vol(σ(Γ)) ≈ |det(σ 0 (x))|vol(Γ). (790)
194
Define change of variable y = σ(x), then multiplying LHS by f (y), RHS by f (σ(x)), then
f (y)vol(σ(Γ)) ≈ f (σ(x))|det(σ 0 (x))|vol(Γ). (791)
Breaking Ω into many small pieces and adding them gives the desired result.
Exercise 385." Define σ : R2#→ R2 by σ(r, θ) = (r cos θ, r sin θ). Then the partial derivative matrix
cos θ −r sin θ
M(σ 0 (r, θ)) = . Furthermore, compute determinant = r, hence we get
sin θ r cos θ
√
Z 1 Z 1−x2 Z 2π Z 1
√ f (x, y)dydx = f (r cos θ, r sin θ)rdrdθ (792)

−1 − 1−x2 0 0
in the change of variable integral of f over disk in R2 .
Exercise 386. Suppose V is real vector space, T ∈ L(V ) has no eigenvalues. Then prove that det(T ) > 0.
Proof. T has no eigenvalues, so it has no real eigenvalues, so TC has no real eigenvalues, and its complex
eigenvalues come in pairs with matching multiplicity. Then if TC has m distinct eigenvalues, det(TC ) =
Πm m 2
i λλ = Πi |λ| = det(T ) and see this is positive since λ 6= 0.
Exercise 387. Suppose V is real vector space, even dimension. T ∈ L(V ) and det(T ) < 0, then prove
that T has at least two distinct eigenvalues.
Proof. The contrapositive of Exercise 386 asserts that T has some eigenvalue that is negative. Let
this be α. If α is the only eigenvalue, and V is even dimension, since TC has nonreal eigenvalues in
pairs with matched multiplicity, the multiplicity of α must be also be even. But this would imply
det(TC ) = det(T ) ≥ 0, and this is contradiction. There must be at least two distinct eigenvalues.
Result 11. If V is inner product space, T ∈ L(V ), det(T ∗ ) = det(T ).
3.4 More Concepts in Linear Algebra

3.4.1 Rayleigh Quotients, Matrix Norms and Characterization of Singular
Values
Spectral Theorems 176, 180 asserts that every symmetric matrix has orthonormal set of eigenvector
basis, such that we may write the matrix M in terms of its spectral decomposition (see Definition 83) as
n
X
M= λi vi vi0 = V ΛV 0 , (793)
i
where V contains orthogonal columns of eigenvectors vi and the diag(Λ) = (λi )i∈[n] . The utility of this
is that we can express seemingly complicated functions of symmetric matrices quickly in terms of the
eigenvector decomposition. For instance, the spectral decomposition given by Equation 793 has powers
Pn
of M given by M k = i λki vi vi0 .
Theorem 247 (Rayleigh Quotient). Let A be a symmetric matrix, then the Rayleigh coefficient expresses
the eigenvalues as solutions to optimization problems. In particular, let λ1 ≥ λ2 ≥ · · · ≥ λn be the ordered
eigenvalues of A, then the Rayleigh coefficients are given by
λi (A) = max x0 Ax. (794)

x:hx|x<i i=0,kxk=1
195
In particular, prove that the largest eigenvalue solves
x0 Ax
λ1 (A) = max x0 Ax = max , (795)
kxk2 =1 x x0 x
and that the smallest eigenvalue solves
λn (A) = min x0 Ax. (796)

kxk2 =1
Pn
Proof. Write A in terms of its eigendecomposition A = i λi vi vi0 , and define the quadratic function
f : Ω → R as
n
! n n
X X X
0 0
f (x) = x Ax = x λi vi vi0 x= λi x0 vi vi0 x = λi hx|vi i2 , (797)
i i i
Pn Pn
with Ω = {x : kxk = 1}. Additionally, we may write x = i hx|vi ivi , so kxk2 = i |hx|vi i|2 = 1
(see Theorem 156). Since we are taking a linear weighted sum, and λ1 ≥ λj6=1 , then Equation 797 is
maximized when hx|v1 i = 1 and all other terms in the summation are zero. It follow that x = v1 and
f (x) evaluates to the largest eigenvalue λ1 . Similar proof holds for λn (A).
Theorem 247 states that the largest and smallest eigenvalues satisfy
x0 Ax x0 Ax
λmax (A) = sup , λmin (A) = inf . (798)
x6=0 x0 x x6=0 x0 x
In particular, for any x, the inequality
λmin (A)x0 Ax ≤ x0 Ax ≤ λmax (A)x0 Ax (799)
is satisfied, with tight equality for some choice of x. The inequality in Equation 799 suggests that
a symmetric matrix A is positive definite iff all its eigenvalues are positive, and that A is positive
semidefinite iff all its eigenvalues are nonnegative.
See that the singular values are also characterized naturally as a result. Using the SVD (Theorem
p
322), the singular values σi are the square roots of eigenvalues of A0 A, AA0 , given σi = λi (A0 A), with
rank(A) nonzero values. Now recall the defining property of norms (Definition 150) and inner products
(Definition 143). We may define the norms on matrices.
Definition 179 (Frobenius Norm). The Frobenius norm for any matrix A ∈ Rn×n is defined to be
  21
n X
n
1
X
kAkF = |tr(AA0 )| 2 =  |aij |2  . (800)
i j
See that
n
X n
X
tr(AA0 ) = λi (AA0 ) = σi (A)2 , (801)
i i
Pn 1 Pn 1
then kAkF = i σi (A)2 2 . In the general case, if A is symmetric, then this is i |λi |
2 2
. If it is
Pn 2 21
further real, then this is i λi .
Theorem 248 (Norm as Supremum of Inner Product). Let V be an inner product space, then show that
for v ∈ V , we have
kvk = sup{hv|wi : w ∈ V, kwk = 1}. (802)
196
v
Proof. If v = 0, then the equality is trivial and we many assume v 6= 0. For w := kvk , see that kwk = 1
and
v
kvk2 = hv|vi = kvkhv| i = kvkhv|wi (803)
kvk
and it follows that
kvk ≤ sup{hv|wi : w ∈ V, kwk = 1}. (804)
On the other hand, Cauchy Schwarz inequality (Theorem 154) asserts that
hv|wi ≤ |hv|wi| ≤ kvkkwk ∀v, w ∈ V, (805)
so for kwk = 1,
kvk ≥ sup{hv|wi : w ∈ V, kwk = 1}. (806)
The result follows.
Definition 180 (Spectral/Operator Norm). The operator norm k · k2 is defined as
kAxk
kAk2 = sup kAxk = sup . (807)
kxk=1 x6=0 kxk
This is equal to the maximum singular value, by writing

s s
kAxk kAxk2 x0 A0 Ax p
sup = sup = sup = λmax (A0 A) = σmax (A), (808)
x kxk x kxk2
x x0 x
where the penultimate inequality follows by the Rayleigh Quotient (Theorem 247). This can be further
expressed as
v 0 Au
σmax (A) = sup . (809)
v,u6=0 kvk2 kuk2
2
To see this, since σmax is the maximum eigenvalue of positive operator A∗ A, by Theorem 247 we may
5
express
2
σmax = sup hv|A∗ Avi = sup hAv|Avi = sup kAvk2 . (810)
kvk2 =1 kvk2 =1 kvk2 =1
Then σmax = supkvk2 =1 kAvk. Then Theorem 248 asserts that
σmax = sup kAvk = sup hAv|ui. (811)

kvk2 =1 kvk2 =kuk2 =1
On the other hand, the minimum singular value can be easily characterized as σmin (A) = σr (A) where
rank(A) = r, which is nonzero iff the matrix is full rank. If A is symmetric then kAk2 = supi∈[n] |λi | =
max{λ1 , −λn }. The singular values of a symmetric matrix are the absolute values of the nonzero eigen-
values, while the singular values of a symmetric, positive definite matrix is the same as its nonzero
eigenvalues.
5 proof by Yiorgos S. Smyrlis on https://math.stackexchange.com/questions/656017/largest-singular-value-of-non-
square-matrix.
197
3.4.2 Schur Complement
Consider the matrix
" #
Ap×p Bp×q
M= (812)
Cq×p Dq×q n×n
" #
" #
c x
presented as a block matrix s.t. p+q = n, and we are interested in solving the linear system M = .
y d
The system of systems of linear equations can be written by
Ax + By = c, (813)
Cx + Dy = d. (814)
Definition 181 (Schur Complement D in M ). Consider the square matrix M written Equation 812.
Solving the system given by Equation 814, and eliminating y (as if we would any other system), and
further assuming D−1 exists, we get
y = D−1 (d − Cx) , (815)

−1
Ax + B(D (d − Cx)) = c, (816)
Ax + BD−1 d − BD−1 Cx = c, (817)
−1 −1
(A − BD C)x = c − BD d, (818)
and if (A − BD−1 C) is invertible then
x = (A − BD−1 C)−1 (c − BD−1 d), (819)

−1 −1 −1 −1

y = D d − C[(A − BD C) (c − BD d)] (820)
is the solution to the system. We say that the Schur complement of D in M is given by
S (D) = (A − BD−1 C). (821)
Definition 182 (Schur Complement A in M ). Consider the square matrix M written Equation 812.
Solving the system given by Equation 814, eliminating x (as if we would any other system), and further
assuming A−1 exists, we get
x = A−1 (c − By) (822)

C(A−1 (c − By)) + Dy = d, (823)
−1 −1
CA c − CA By + Dy = d, (824)
(−CA−1 B + D)y = d − CA−1 c, (825)
−1 −1 −1
y = (D − CA B) (d − CA c), (826)
x = A−1 (c − B[(D − CA−1 B)−1 (d − CA−1 c)]) (827)
is the solution to the system. We say that the Schur complement of A in M is given by
S (A) = (D − CA−1 B). (828)
198
Continuing from Equation 820, we get
x = (S (D) )−1 c − (S (D) )−1 BD−1 d, (829)

y = D−1 d − D−1 C[(S (D) )−1 (c − BD−1 d)] (830)
−1 −1 (D) −1 −1 (D) −1 −1
= D d−D C(S ) c+D C(S ) BD d) (831)
= −D−1 C(S (D) )−1 c + (D−1 + D−1 C(S (D) )−1 BD−1 )d (832)
is now system of equations in c, d and we may write

" #−1 " #
A B (S (D) )−1 −(S (D) )−1 BD−1
= , (833)
C D −D−1 C(S (D) )−1 D−1 + D−1 C(S (D) )−1 BD−1
1 −BD−1
" #" #
(S (D) )−1 0
= (834)
−D−1 C(S (D) )−1 D−1 0 1
1 1 −BD−1
" #" #" #
0 (S (D) )−1 0
= , (835)
−D−1 C 1 0 D−1 0 1
the inverse of which is easy to compute and verify as (with S (D) expanded)
1 BD−1 A − BD−1 C 0 1
" # " #" #" #
A B 0
M= = . (836)
C D 0 1 0 D D−1 C 1
The factorization given by Equation 836 assumes the invertibility of D.

If we went from Equation 827 and use S (A) (Definition 182) we would have arrived in the same fashion
at (assuming invertibility of D − CA−1 B)
" #−1 " #
A B A−1 + A−1 B(S (A) )−1 CA−1 −A−1 B(S (A) )−1
= . (837)
C D −(S (A) )−1 CA−1 (S (A) )−1
and
1 1 A−1 B
" # " #" #" #
A B 0 A 0
M= = . (838)
C D CA−1 1 0 D − CA−1 B 0 1
Comparing the one-one entry block forms in Equation 835, 837, we get
(S (D) )−1 = A−1 + A−1 B(S (A) )−1 CA−1 , (839)

(A − BD−1 C)−1 = A−1 + A−1 B(D − CA−1 B)−1 CA−1 , (840)
which assume the invertibility of A, D, S (A) , S (D) . Then we can get another form using both Schur
complements:
" #−1 " #
A B (S (D) )−1 −A−1 B(S (A) )−1
= . (841)
C D −(S (A) )−1 CA−1 (S (A) )−1
Theorem 249 (Matrix Inversion Lemma).
(A + BC)−1 = A−1 − A−1 B(1 + CA−1 B)−1 CA−1 . (842)
Proof. Follows immediately from Equation 840 with B → −B, D → 1.
199
3.4.3 Schur Complement and Positive (Semi)definite Matrices
Let A 0, A < 0 denote matrix A as positive definite, positive semidefinite matrix respectively (see
Definition 153).
With motivation from Gallier [9]. Recall pseudo inverses of arbitrary, rectangular matrices (Definition
324). In particular, for any matrix Mn×n given square, there is SVD (Definition 322) s.t.
M = U ΣV 0 , (843)
where U, V are orthogonal (Definition 3.2.5.3) and Σ is n order square matrix with
diag(Σ) = (σ1 , · · · , σr , (σr+j∈[n−r] = 0)) for rank(M ) = r and σ1 ≥ · · · ≥ σr ≥ σr+j∈[n−r] = 0. The
σi are singular values of M , and are positive square roots of nonzero eigenvalues of M M 0 , M 0 M , and
the columns of V, U are the eigenvectors of M 0 M, M M 0 respectively. The pseudo-inverse is given by
M + = V Σ+ U 0 , where diag(Σ+ ) = (σ1−1 , · · · , σr−1 , (σr+j∈[n−r] = 0)). See that
1r
" #
0 0 0 0
MM + +
= U ΣV V Σ U = U ΣΣ U = U +
U 0,
0 0n−r
1r
" #
0 0 0 0
+ +
M M = V Σ U U ΣV = V Σ ΣV = V +
V 0, (844)
0 0n−r
so it is easy to tell
M M + M = M, M +M M + = M +, (M M + )2 = M M + , (M + M )2 = M + M, (845)
so they are projections (last two equalities). But Equation 844 is product of two isometries and one
diagonal with entries ≤ 1, which clearly suggests that kM M + vk ≤ kvk for all v ∈ V , where V is some
vector space. Exercise 272 reasons that V = null M M + ⊕ range M M + with null M M + ⊥ range M M + ,
so the projection is orthogonal. The same holds for M + M .
Theorem 250. M M + is orthogonal projection onto range M , and M + M is orthogonal projection onto
null(M )⊥ .
Proof. Clearly range M M + ⊆ range M . For any y ∈ range M , ∃x s.t. y = M x. Then
M M + y = M M + M x = M x = y, (846)
and range M ⊆ range M M + and we have range M = range M M + and M M + is orthogonal projection
onto range M .
Clearly, null M ⊆ null M + M , and since M M + M = M , it follows that null M + M ⊆ null M M + M =
null M and we have null M = null M + M . Since M + M is self-adjoint/Hermitian, Theorem 167 asserts
that
range M + M = (null M + M )⊥ = (null M )⊥ (847)
and M + M is orthogonal projection onto null(M )⊥ .

" #
+ n 0 z
Theorem 251. range(M ) = range(M M ) consists of all vectors y ∈ R s.t. U y = , z ∈ Rr , where
0
r = rank(M ).
200
Proof. If y ∈ range M , then ∃x s.t M x = y and
" # " #
Σr 0 z
U 0 y = U 0 M x = U 0 U ΣV 0 x = ΣV 0 x = V 0x = , (848)
0 0n−r 0
" # " # " #
0 z r z 0 z
so U y = for some z ∈ R . OTOH, if U y = , then y = U , so
0 0 0
1
" #
0
MM y +
= U U 0y (849)
0 0n×r
1
" # " #
0 0 z
= U UU (850)
0 0n×r 0
" #
z
= U (851)
0
= y, (852)
so y ∈ range M M + = range M .
" #
+ ⊥ n 0 z
Theorem 252. range M M = (null M ) consists of all vectors y ∈ R s.t. V y = , z ∈ Rr , where
0
r = rank(M ).
Proof. If y ∈ range M + M , then ∃u s.t. M + M u = y, and
1n
" # " #
+ 0 0 z
y = M Mu = V V u=V , (853)
0 0n−r 0
" # " # " #
0 z r z 0 z
so V y = for some z ∈ R . OTOH, if V y = , then y = V , so
0 0 0
" #
+ + z
M My = M MV (854)
0
1n
" # " #
0 z
= V V 0V (855)
0 0n−r 0
" #
z
= V (856)
0
= y, (857)
so y ∈ range M + M = (null M )⊥ .
Recall that if M is symmetric, then M is orthogonally diagonalizable, s.t. M = P DP 0 (Theorem 75).

If M < 0, then the eigenvalues on diag(D) are nonnegative; the nonzero eigenvalues of M are precisely
the singular values of M and there is SVD of M s.t. M = U ΣU 0 .
Recall the block-matrix representation of a square matrix M (Equation 812) - if it symmetric we may
write
" #
A B
M= , (858)
B0 D
201
where A = A0 , D = D0 and C → B 0 . Equation 836 asserts that we can write
1 BD−1 A − BD−1 B 0 0 1
" # " #" #" #
A B 0
M= = (859)
B0 D 0 1 0 D D−1 B 0 1
#0
1 BD−1 A − BD−1 B 0 0 1 BD−1
" #" #"
= (860)
0 1 0 D 0 1
Result 12. For any symmetric matrix T , invertible N , the matrix T 0 iff N T N 0 0.
Theorem 253. A block diagonal matrix is positive definite iff each diagonal block is positive definite.
 
A1 0 · · · 0
 
 0 A2 · · · 0 
Proof. Consider the block diagonal presented A =   , with block Ai being square
0 0 ··· 0 

0 0 · · · Am
order ni . This matrix An×n is positive iff z 0 Az > 0 for all z ∈ Rn by definition. Clearly, the vector
Pj−1 Pj−1
z = (0, · · · , z̃1 , · · · , z̃nj , 0, · · · 0) for values of z̃ in zi ∈ ( k=1 nk , nj + k=1 nk ] for j ∈ [m] satisfies
z 0 Az = z̃Aj z̃ and the reasoning follows.
Theorem 254. For a symmetric matrix M presented

" #
A B
M= , (861)
B0 D
if D invertible, then
1. M 0 iff D 0, (A − BD−1 B 0 ) 0,
2. If D 0, then M < 0 iff (A − BD−1 B 0 ) < 0.
1 BD−1 1 BD−1 1 −BD−1

" # " #" #
Proof. See factorization given by Equation 860. is invertible since =
0 1 0 1 0 1
1, so the result for both cases follow by Result 12 and Theorem 253.
If we use S (A) instead of S (D) (see Definition 182, 181), then the result that follows would be an
analogue of Theorem 254.
Theorem 255. For a symmetric matrix M presented

" #
A B
M= , (862)
B0 D
if A invertible, then
1. M 0 iff A 0, (D − B 0 A−1 B) 0,
2. If A 0, then M < 0 iff (D − B 0 A−1 B) < 0.
Exercise 388 (Encoding of Nonlinear Quadratic Constraints into Matrix Inequality). Suppose we are
given the quadratic constraint
(Ax + b)0 (Ax + b) ≤ c0 x + d, (863)
202
then if we arrange these terms in a block matrix presented
1
" #
Ax + b
M= , (864)
(Ax + b)0 c0 x + d
we have 1 0, so Theorem 255 asserts that M < 0 iff c0 x+d−(Ax+b)0 (Ax+b) < 0 iff (Ax+b)0 (Ax+b)0 ≤
c0 x + d. The statement is equivalent to the constraint M < 0.
When A or D is singular, then the Schur complements (Definition 181, 182) involve the pseudo-inverse
(Definition 324) as we shall see. Generally, the quadratic optimization problem min. f (x) = 12 x0 P x + x0 b,
where P = P 0 is not convex and has no solution, unless some conditions on P, b are satisfied.
Theorem 256 (Minimum of Quadratic Equation with Invertible Symmetric Matrix). If P is invertible
and symmetric, then
1 0
f (x) = x P x + x0 b (865)
2
has minimum value iff P < 0. If the minimum value exists, it occurs at x∗ = −P −1 b. The minimum
evaluated at this point is f (x∗ ) = − 12 b0 P −1 b.
Proof. Gallier [9]. See that we may express

1 0
f (x) = x P x + x0 b (866)
2
1 1
= (x + P −1 b)0 P (x + P −1 b) − b0 P −1 b. (867)
2 2
Suppose P has negative eigenvalue, s.t. P u = −λu for λ > 0. Then for any α 6= 0, x = αu − P −1 b, we
get
1 0 1
f (x) = αu P αu − b0 P −1 b (868)
2 2
1 2 1
= − α λkuk2 − b0 P −1 b.
2
(869)
2 2
This is unbounded from below as α → ∞, and the minimum does not exist. Then P has no negative
eigenvalue, is symmetric and hence is positive (Theorem 181). Clearly the minimum occurs when x +
P −1 b = 0.
Theorem 257 (Minimum of Quadratic Equation with Symmetric Matrix). If P is symmetric matrix,
then
1 0
f (x) = x P x + x0 b (870)
2
has minimum value iff P < 0 and (1 −P P + )b = 0. If the minimum value exists, it occurs at any x∗ ∈ Rn
taking form
" #
∗ + 0 0
x = −P b + U , ∀z ∈ Rn−r , (871)
z
where r = rank(P ) and P = U 0 ΣU is the SVD/orthogonal diagonalization of P . The minimum evaluated

at any of the satisfying points is f (x∗ ) = − 21 b0 P + b.
203
Proof. Gallier [9]. When P invertible, then set z → 0 and Theorem 256 holds, so we may assume P is
singular." If P singular,
# rank(P ) = r < n, the orthogonal diagonalization of P (Theorem 72) produces
Σ r 0
P = U0 U , where U is orthogonal matrix and Σr is diagonal of non-negative singular values of
0 0
P P 0 , P 0 P (and hence Σr is invertible). We may write
" #
1 0 0 Σr 0
f (x) = x (U U )x + x0 (U 0 U )b (872)
2 0 0
" #
1 0 Σr 0
= (U x) U x + (U x)0 U b. (873)
2 0 0
" # " #
y c
Let U x = , Ub = , with y, c ∈ Rr , z, d ∈ Rn−r to get
z d
" #" # " #
1h 0 i Σ 0 y h i c
0 r 0 0
f (x) = y z + y z (874)
2 0 0 z d
1 0
= y Σr y + y 0 c + z 0 d. (875)
2
If y ="0,#then f (x) = z 0 d and f is unbounded from below unless d = 0. To have minimum, d = 0, so
c
Ub = , and Theorem 251 asserts that b ∈ range P = range P P + , and Theorem 250 asserts that for
0
b ∈ range P , we have P P + b = b, and it follows that (1 − P P + )b = 0. If d =
" 0,
1 0 0
# then f (x) = 2 y Σ"r y#+ y c,
y y
and define g : Rr → R as g(y) = 12 y 0 Σr y + y 0 c. Since by definition U x = , we have x = U 0 , and
z z
since U 0 is orthogonal (invertible), as x ranges over Rn , y ranges over Rr . It follows that f has minimum
iff g has minimum. As argued, Σr is invertible, then Theorem 256 asserts that g has minimum iff Σr < 0,
which is iff P < 0 by Result 12. It follows that if f has minimum, then (1 − P P + )b = 0, P < 0.
On the other"hand,
# suppose (1 − P P + )b = 0, P < 0. Then P P + b = b, b ∈ range P P + , Theorem 251
c
asserts U b = , and as in the forward proof we get
0
1 0
f (x) = g(y) = y Σr y + y 0 c, (876)
2
and Σr < 0 since P < 0 by assumption and applying Result 12. Theorem 256 asserts f, g has minimum.
∗ ∗
We" try to#find the minimum
" value
# points satisfied by x and evaluate f (x ). First, see that P =
Σr 0 Σ−1
r 0
U0 U, P + = U 0 U . Furthermore, f (x) = g(y) share minimum; Theorem 256 asserts
0 0 0 0
minimum of g is evaluated at y ∗ = −Σ −1
" r c. #Since f (x) does not depend
" # on z, without loss of generality,
−1
−Σ r c c
let z → 0, since d = 0, then U x∗ = . Together with U b = , it is easy to see now that
0 0
" # " #" # " #
∗ 0 −Σ−1
r c 0 −Σ−1
r 0 c 0 −Σ−1
r 0
x =U =U =U U b = −P + b. (877)
0 0 0 0 0 0
Since P + is symmetric, P + P P + = P + , then the value of f evaluated at this minimum point is

1 1 1
f (x∗ ) = (−P + b)0 P (−P + b) + b0 (−P + b) = b0 P + P P + b − b0 P + b = − b0 P + b. (878)
2 2 2
204
Finally, since f (x) does not depend on z, then any x of form, using Equation 877,
" # " # " # " #
−1 −1
0 −Σ c 0 −Σ c
x = −P + b + U 0 = U0 r
+ U0 = U0 r
(879)
z 0 z z
satisfies the same optimally minimal point.
Theorem 258 (Equivalency" Conditions

# for Positive Semidefinite Matrices using Schur Complement).
A B
For symmetric matrix M = , the following statements are equivalent for:
B0 D
1.
M < 0, (880)
2.
A < 0, (1 − AA+ )B = 0, D − B 0 A+ B < 0, (881)
3.
D < 0, (1 − DD+ )B 0 = 0, A − BD+ B 0 < 0. (882)
Proof. Theorem 257 asserts that M < 0 iff

" #" #
h i A B x
0
f (x, y) = x y 0 = x0 Ax + 2x0 By + y 0 Dy (883)
B0 D y
has minimum w.r.t both x, y. Holding y constant, Theorem 257 again asserts that f (x, y) has minimum
iff A < 0, (1 − AA+ )By = 0, with the minimum value evaluated at
f (x∗ , y) = −y 0 B 0 A+ By + y 0 Dy = y 0 (D − B 0 A+ B)y. (884)
Since f (x, y) is uniformly bounded for any x, y, then (1 − AA+ )B = 0. Applying the theorem once more
states that f (x∗ , y) has minimum iff (D − B 0 A+ B) < 0. Then f (x, y) has minimum over all x, y iff
A < 0, (1 − AA+ )B = 0, D − B 0 A+ B < 0. (885)
Then part 1 iff part 2. We can show part 1 iff part 3 with the same approach, except we hold x constant
first and use the other Schur complement.
Result 13. If M < 0, then verify the factorizations for M using Schur complements hold:
1 BD+ A − BD+ B 0 0 1
" # " #" #" #
A B 0
= , (886)
B0 D 0 1 0 D D+ B 0 1
1 1 A+ B
" # " #" #" #
A B 0 A 0
= , (887)
B0 D B 0 A+ 1 0 D − B 0 A+ B 0 1
which use A+ AA+ = A+ , D+ DD+ = D+ and do not assume invertibility of any of the blocks.
205
3.5 Matrix Calculus
Recall the definition of derivatives laid out in Definition 178 with operators. The topic of matrix calculus
is further explored here, more precisely, with motivation from Boyd and Vandenberghe [5].
Definition 183 (Derivatives). Suppose f : Rn → Rm , x ∈ int dom f . Then f is differentiable at x if

∃Df (x) ∈ Rm×n satisfying
kf (z) − f (x) − Df (x)(z − x)k2
lim = 0. (888)
z∈dom f,z6=x,z→x kz − xk2
The matrix Df (x) is said to be the derivative, or Jacobian of f at x. The function f is differentiable if
dom f is open set and it is differentiable at every point in the domain.
By definition the interior of a set (Definition 194) is open and the Definitions 178, 183 are consistent.
The first order approximation of f centred at x is given by
f (z) = f (x) + Df (x)(z − x), (889)
as in typical Taylor expansions. This first order approximation is affine, and is perfect approximation at
point z = x. The Jacobian is found by taking the partial derivatives
δfi (x)
Df (x)ij = , i ∈ [m], j ∈ [n]. (890)
δxj
Definition 184 (Gradient). When f : Rn → R, the derivative Df (x) is an n-row vector and its transpose
is said to be the gradient of f (x), and we denote this
∇f (x) = Df (x)0 . (891)

δf (x)
Clearly, ∇f (x) is column vector and ∇f (x)i = δxi , i ∈ [n].
See that f (x) + Df (x)(z − x) = f (x) + ∇f (x)0 (z − x) is equivalent formulation of the first order
approximation (Equation 889).
Theorem 259 (Chain Rule). Suppose f : Rn → Rm is differentiable at x ∈ int dom f and g : Rm → Rp

is differentiable at f (x) ∈ int dom g. Define the composition h : Rn → Rp by h(z) = g(f (z)), then h is
differentiable at x and the chain rule states that the derivative at this point is
Dh(x) = Dg(f (x))Df (x). (892)
When f : Rn → R, g : R → R, and h = g ◦ f , then
Dh(x)0 = g 0 (f (x))∇f (x). (893)
Exercise 389 (Composition with Affine Function). Suppose f : Rn → Rm , A ∈ Rn×p , b ∈ Rn , g : Rp →

Rm s.t g(x) = f (Ax + b). Let dom g = {x : Ax + b ∈ dom f }, then the derivative of g under chain rule is
Dg(x) = Df (Ax + b)D(Ax + b) = Df (Ax + b)A. (894)
In particular, when f is real-valued function then the formula for the gradient of a composition of arbitrary
function with affine transformations can be written ∇g(x) = A0 ∇f (Ax + b). As example, if f : Rn → R,
and x, v ∈ Rn , then
f˜ : R → R, f˜(t) = f (x + tv) (895)
has derivative equal to Df˜(t) = f˜0 (t) = ∇f (x + tv)0 v.
206
Definition 185 (Second Derivative/Hessian). The second derivative of a real valued function f : Rn → R
at x ∈ int dom f is said to be the Hessian matrix of f , and is denoted ∇2 f (x) = Hx with entries given
by
δ 2 f (x)
∇2 f (x)ij = , i ∈ [n], j ∈ [n]. (896)
δxi δxj
It is assumed that f is twice differentiable.
The second order approximation of f centred at x is the quadratic function of z defined by

1
fˆ(z) = f (x) + ∇f (x)0 (z − x) + (z − x)0 ∇2 f (x)(z − x), (897)
2
and satisfies
|f (z) − fˆ(z)|
lim = 0. (898)
z∈dom f,z6=x,z→x kz − xk2 2
This may be interpreted as the derivative of the first derivative. If f is differentiable, then the gradient
mapping is the function ∇f : Rn → Rn , where dom ∇f = dom f and the derivative of this mapping
satisfies D∇f (x) = ∇2 f (x).
Theorem 260 (Chain Rule for Second Derivatives of Function Composition with Scalar Functions).
For f : Rn → R, g : R → R, h(x) = g(f (x)). We have
∇h(x) = g 0 (f (x))∇f (x), (899)

2 0 2 00 0
∇h (x) = g (f (x))∇ f (x) + g (f (x))∇f (x)∇f (x) . (900)
Theorem 261 (Chain Rule for Second Derivatives of Function Composition with Affine Functions). For
f : Rn → R, A ∈ Rn×m , b ∈ Rn , define the composition g : Rm → R defined by g(x) = f (Ax + b). We
have
∇g(x) = A0 ∇f (Ax + b), (901)

2 0 2
∇ g(x) = A ∇ f (Ax + b)A. (902)
n
Exercise 390. Compute the second order approximation of f (X) = log det X, dom f = S++ = {X < 0}.
n n
Proof. Let X ∈ S++ . For Z ∈ S++ close to X, let ∆X = Z − X, and see that
log det Z = log det(X + ∆X) (903)

1 1
= log det X 2 1 + X − 2 ∆XX − 2 X 2
1 1
(904)
1 1
= log det X 2 X 2 det 1 + X − 2 ∆XX − 2
1 1
(905)

= log det X + log det 1 + X − 2 ∆XX − 2
1 1
(906)
n
X
= log det X + log(1 + λi ), (907)
i
1 1 1 1
where λi are the eigenvalues of X − 2 ∆XX − 2 . The equalities follow since X − 2 ∆XX − 2 is symmetric
and hence orthogonally diagonalizable. Furthermore, we chose ∆X to be small, so λi (the eigenvector
207
scaling) must be small, and since log(1 + x) ≈ x for small x, we may write (recall trace Definition 41)
n
X
log det Z ≈ log det X + λi (908)
i
1 1
= log det X + tr(X − 2 ∆XX − 2 ) (909)
− 21 − 21
= log det X + tr(X X ∆X) (910)
−1
= log det X + tr(X (Z − X)) (911)
0
= log det X + X −1 (Z − X). (912)
It follows from Equation 889 that this is the first order approximation of f centred at X, and that
∇f (X) = X −1 is the gradient of f evaluated at X. To get the second order approximation, we require
the Hessian. We can do this with a first order approximation of ∇f (X) = X −1 . Write
Z −1 = (X + ∆X)−1 (913)
1 1 −1
X 2 1 + X − 2 ∆XX − 2 X 2
1 1
= (914)
−1
= X − 2 1 + X − 2 ∆XX − 2
1 1 1 1
X− 2 (915)

≈ X − 2 1 − X − 2 ∆XX − 2 X − 2
1 1 1 1
(916)
= X −1 − X −1 ∆XX −1 , (917)
with the use of first order approximation (1 + A)−1 ≈ 1 − A for small A. We can think of the term
−1 −1 2 0
−X ∆XX as ∇ f (X) ∆X, the change in gradient as X → Z. Compare to Equation 897 - the
−1
1 1
∆XX −1 ∆X), so that
2

second-th order term in the Taylor expansion is 2 ∆X∇ f (X)∆X = − 2 tr(X
the second order approximation of f near X evaluates to
f (Z) = f (X + ∆X) (918)

1
≈ f (X) + tr(X −1 ∆X) − tr(X −1 ∆XX −1 ∆X) (919)
2
1
= f (X) + tr(X −1 (Z − X)) − tr(X −1 (Z − X)X −1 (Z − X)). (920)
2
208
Chapter 4
Convex Optimization
Knowledge of linear algebra (Chapter 3) is assumed. The primary reference text is Boyd and Vanden-
berghe [5]. For a more complete exposition with figures for intuition, refer to the primary text. This
chapter is a more detailed (in terms of intermediate workings and cross-references) but reduced (in terms
of content) treatment of the topic.
4.1 Notations
When referring to the linear algebra proofs from Section 3.3, we may assume F = R, the set of real
numbers, unless otherwise stated. R+ is the set of nonnegative real numbers, R++ is the set of strictly
positive real numbers. Rn is the Euclidean space (Definition 44), Rm×n is the set of all m × n matrices
with entries in R. All vectors presented are column. 1 is identity matrix, 1 is vector of ones. S k is
k
the set of symmetric matrices order k, S+ is set of symmetric positive semidefinite operators/matrices
k
(Definition 153), S++ is set of symmetric positive definite operators/matrices. , < are generalized
inequalities - on vectors it is componentwise inequality, on matrices it is matrix inequality (A , < B
k k
means (A − B) ∈ S+ , S++ respectively). When there is a subscript K to the inequality, there is an
associated cone K.
f : Rp → Rq is the real valued function from Ω ⊆ Rp to Rq . We write Ω = dom f , where f (x) is defined
for all points x in dom f . For instance log R → R, and dom log = R++ . Pm (R) is the polynomial function
space with max degree m (Definition 101, 103). Given matrices A, B ∈ Rm×n , hA|Bi = A ◦ B = tr(A0 B)
(Definition 41). We denote nullspaces/kernels (Definition 64, 90) as N and ranges as R.
4.2 Mathematical Optimization

A mathematical optimization problem has some form
min. f0 (x), s.t. fi (x) ≤ bi , i ∈ [m]. (921)
f0 (x) : Rn → R is said to be the objective function, the functions fi : Rn → R are constraint functions,
bi are said to be limits/bounds for i ∈ [m]. The optimization variable is x = (x1 , · · · , xn ). Note
that this admits equality constraints, since fi (x) ≤ bi , −fi (x) ≤ bi ⇐⇒ fi (x) = bi . An optimal
∗
solution to the problem is the vector x that minimizes f0 (x) subject to specified conditions. Formally,
∀z, fi (z) ≤ bi , i ∈ [m] implies f0 (x∗ ) ≤ f0 (z). If fi , i ∈ [0, m] satisfy linearity (Definition 105), that is
209
fi (αx + βy) = αfi (x) + βfi (y) for α, β ∈ R, then we have a linear program. If instead we are guaranteed
only that they are convex, i.e.
fi (αx + βy) ≤ αfi (x) + βfi (y), α, β ∈ R+ , α + β = 1. (922)
then we have a convex optimization problem. See that convex optimization problems contains linear
programs. A problem is said to be sparse if each constraint function depends only on a small number
of the variables. The ability/time required for computer algorithms to solve our optimization problems
vary widely in the algorithm, form of the objective/constraint function determining the instance of the
problem class, variables, sparsity and so on.
4.3 General Overview of Problem Classes

The least squares problem is an optimization problem without constraints, and the objective function is
the sum of squares. Formally, we solve for
k
X
min. f0 (x) = kAx − bk22 = (a0i x − bi )2 , (923)
i
where A ∈ Rk×n is the matrix of k entries for each n-vector optimization variable. Solving this reduces to
solving for a set of linear equations, (A0 A)x = A0 b with analytical solution x = (A0 A)−1 A0 b. Recognizing
an optimization problem as a least-squares problem amounts to verifying that the objective is a quadratic
function, and whether the associated quadratic form is positive semidefinite. See we may write
kAx − bk2 = (Ax − b)0 (Ax − b) (924)

0 0 0
= (x A − b )(Ax − b) (925)
= x0 A0 Ax − x0 A0 b − b0 Ax + b0 b (926)
0 0 0 0 0
= x A Ax − 2x A b + b b := φ(x). (927)
δφ !
The solution is obtained via matrix calculus, in particular δx = 2A0 Ax − 2A0 b = 0, so A0 Ax = A0 b
and x = (A0 A)−1 Ab. Clearly hA0 Ax|xi = hAx|Axi ≥ 0, so the objective function is quadratic and the
quadratic form is positive semidefinite (Definition 153). To increase flexibility in applications, a num-
ber of additions to the least-squares literature has been made. In weighted-least squares problems, the
Pk Pk
weighted least-squares take modified objective function i wi (a0i x − bi )2 , where i wi = 1, wi ≥ 0. The
weights can be adjusted to determine the influence of the different terms/points in the overall objective
function, such as when there are unequal variances in the data. Another technique uses regularization,
Pk Pn
such as the addition of penalty term, giving (for instance) objective function i (a0i x − bi )2 + ρ i x2i
where ρ > 0. Clearly large values of x increase the objective function to be minimized and enforces a
tradeoff.
The linear programming problems are instances where all the objective and constraint functions are
linear (additive and homogeneous), giving general form:
min c0 x s.t. a0i x ≤ bi , i ∈ [m]. (928)
Here c, ai∈[m] ∈ Rn , bi∈[m] ∈ R. Although in many cases the problem naturally inherits the standard
form, in other cases the problem needs to transformed to look like a linear program. For instance, the
210
Chebyshev approximation problem given
min. maxi∈[k] |a0i x − bi | (929)
is non-linear and certainly non-differentiable, but we can linearize this form to
min t (930)
s.t. a0i x − t ≤ bi , i ∈ [k], (931)
−a0i x − t ≤ −bi , i ∈ [k]. (932)
Definition 186 (Convex Function). A function f is said to be convex if
∀x, y ∈ dom f, α, β ∈ R+ , α + β = 1, f (αx + βy) ≤ αf (x) + βf (y). (933)
The convex optimization problems takes form
min. f0 (x) s.t. fi (x) ≤ bi , i ∈ [1, m] (934)
where fi∈[0,m] : Rn → R are convex functions (Definition 186). For a problem formulated (or translated
to be) convex, there are efficient numerical tools to arrive at the solution. The difficulty lies in recognising
the convexity in the first place.
The nonlinear optimization problem is used to describe problems when the objective/constraint are
not linear, but it is also not known to be convex. There are no effective tools for solving these in the
general case. Local optimization seeks to find a point that is locally optimal, the point that minimizes the
objective function among the set of feasible points nearby. Differentiability of the objective and constraint
functions are the only criterion in local optimization, but these can fail to find the global optima, require
initial guesses, are sensitive to parameters/initialization and often require domain knowledge to arrive
at a good solution. As opposed to this, global optimization aims to find the true solution, often at a
tradeoff with efficiency. This is particularly important in applications such as system verification, where
the cost of failure is high but time is not a limiting factor. Additionally, convex optimization is not just
useful in convex settings - it can be used to approximate solutions for local optimization problems and
offer heuristics/bounds as subroutines in nonconvex settings.
4.4 Convex Sets and Preservation of Convexity

4.4.1 Convex Sets and Other Examples
Definition 187 (Lines and Line Segments). If x1 , x2 ∈ Rn and x1 6= x2 , then the points y = θx1 + (1 −
θ)x2 where θ ∈ R form a line through x1 , x2 . If we restrict θ ∈ [0, 1], then the points form a line segment.
If we rewrite y = x2 + θ(x1 − x2 ), we can reinterpret y as the sum of base point x2 and distance scaled
by θ of some direction vector (x1 − x2 ).
Definition 188 (Affine Set). A set C ⊆ Rn is said to be affine if the line through any two distinct points
in C lies in C. Formally,
∀x1 , x2 ∈ C, θ ∈ R, θx1 + (1 − θ)x2 ∈ C. (935)
Definition 189 (Affine Combination). An affine combination is a generalization of the line segment be-
Pk Pk
tween two points to multiple points. For i θi = 1, the form i θi xi is said to be an affine combination.
211
An affine set must contain every affine combination of its points. That is, if C is affine,
k
X k
X
x1 , · · · , xk ∈ C, θi = 1 =⇒ θi xi ∈ C.
i i
Note that nowhere does it state that θ < 0, only that 10 θ = 1. We have encountered affine subsets in
Definition 121. If C is affine set, x0 ∈ C, then the set
V = C − x0 = {x − x0 : x ∈ C} (936)
is a subspace as can be seen under the techniques discussed there. The affine set C was written there as
C = V + x0 = {v + x0 : v ∈ V }.
Definition 190 (Dimension of Affine Set). We define the the dimension of an affine set C as the
dimension of the subspace V = C − x0 , where x0 is arbitrary element in C.
Exercise 391 (Solution Set of Linear System). Consider the linear system Ax = b. The solution set to
this linear system can be written C = {x : Ax = b}, and see this has representations as in Theorem 52,
that is C = {u + v : u ∈ N (A)} and v is particular solution to Ax = b. It is affine because
∀x1 , x2 ∈ C, A(θx1 + (1 − θ)x2 ) = θAx1 + (1 − θ)Ax2 = θb + (1 − θ)b = b. (937)
As per above discussion, the associated subspace is N (A), since ∀u ∈ N (A), we have A(v + λu) =
Av + λAu = Av + 0 = b where v is particular solution.
Definition 191 (Affine Hull). The set of all affine combinations of points in some set C ⊆ Rn is said
to be the affine hull, and is written
Xk k
X
aff C = { θi xi : xi ∈ C, i ∈ [k], θi = 1}. (938)
i i
The affine hull is the smallest affine set that contains some set C. Formally,
S is affine set, C ⊆ S =⇒ aff C ⊆ S. (939)
The affine hull of an affine set is itself. The affine dimension of a set C is the dimension of its affine
hull, which is trivially an affine set, which has an associated vector space. For instance, the affine hull
of a unit circle in R2 , that is {x ∈ R2 : x2 + y 2 = 1} is equal to R2 and has dimension two. If the affine
dimension of a set C ⊆ Rn is less than n, then aff C 6= Rn .
Definition 192 (Ball). A ball radius r, center x is written
B(x, r) = {y : ky − xk ≤ r}. (940)
Here, the norm k · k is w.r.t any norm·
Definition 193 (Interior Point). An element x ∈ C ⊆ Rn is said to be interior point of C if ∃ > 0 s.t.
{y : ky − xk2 ≤ } ⊆ C. (941)
That is, there is some ball (Definition 192) centered at x that lies in entirely in C.
Definition 194 (Interior). The set of all interior points (Definition 193) to C is the interior of C and
is denoted as int C. Note that in Rn , all norms are equivalent to the Euclidean norm, and hence all
norms generate the same int C.
212
Definition 195 (Open). A set C is open if int C = C, that is if every point in C is an interior point
to C.
Definition 196 (Closed). A set C ⊆ Rn is closed if its complement, that is Rn \C = {x ∈ Rn : x 6∈ C}

is open (Definition 195). Defined in terms of the limits, a set C is closed iff it contains the limit point
of every convergent sequence on it. That is, if the sequence x1 , x2 · · · → x, xi ∈ C =⇒ x ∈ C.
Definition 197 (Closure). A closure of a set C is defined to be
cl C = Rn \int(Rn \C). (942)
That is, the closure of a set is the complement of the interior of the complement of C. A point x is in
the closure of C iff ∀ > 0, ∃y ∈ C s.t. kx − yk2 ≤ . Defined in terms of the limits, the closure of C is
the set of all limit points of convergence sequences in C.
Definition 198 (Boundary). The boundary of a set C is defined as bd C = cl C\int C. If x ∈ bd C,

then ∀ > 0, ∃y ∈ C, z 6∈ C s.t. ky − xk2 ≤ , kz − xk2 ≤ . That is, there exists arbitrarily close points
in C and also arbitrarily close points not in C.
A set C is closed (Definition 196) if it contains its boundary, that is if bd C ⊆ C. It is open if it

contains no boundary points, that is if C ∩ bd C = ∅. It is possible for a set (e.g. the empty set) to be
both open and closed, and it is also possible for a set to be neither open nor closed.
Definition 199 (Relative Interior). The relative interior of a set C is the interior relative to aff C, and
is written
relint C = {x ∈ C : B(x, r) ∩ aff C ⊆ C for some r > 0}, (943)
where B(x, r) is norm-ball (Definition 192).
Definition 200 (Relative Boundary). The relative boundary of a set C is cl C \relint C, where cl C is
the closure of set C. We denote this relbd C.
The interior of a point in one-dimension space is empty, but the relative interior is itself. The interior
of a line segment in R≥2 is empty, but its relative interior is the line segment without the endpoints.
The interior of a disk in R≥3 is empty, but its relative interior is the disk without its wireframe.
Exercise 392. Consider a square in R3 defined to be
C = {x ∈ R3 : x1 ∈ [−1, 1], x2 ∈ [−1, 1], x3 = 0}. (944)
The affine hull is the entire (x1 , x2 ) plane given aff C = {x ∈ R3 : x3 = 0}, int C = ∅, and
relint C = {x ∈ R3 : x1 ∈ (−1, 1), x2 ∈ (−1, 1), x3 = 0}. (945)
Additionally we have bd C = C, and relbd C = {x ∈ R3 : max{|x1 |, |x2 |} = 1, x3 = 0}.
Convex functions (Definition 186) were defined. We can also define convexity of a set.
Definition 201 (Convex Set). A set C is said to be convex if any line segment between two points in C
lies in C. That is,
∀x1 , x2 ∈ C, θ ∈ [0, 1], θx1 + (1 − θ)x2 ∈ C. (946)
213
As with affine combinations (Definition 189) and affine hulls (Definition 191) we can also define convex
combinations and convex hulls.
Pk Pk
Definition 202 (Convex Combination). A point of the form i θi xi , where i θi = 1, θi ≥ 0 is said
to be convex combination. A set if convex iff it contains every convex combination of its points.
Definition 203 (Convex Hull). The convex hull of a set C, is the set of all convex combinations of the
points in C, written
( k k
)
X X
conv C = θi xi : xi ∈ C, θi ≥ 0, i ∈ [k], θi = 1 . (947)
i i
If a convex set were to be ‘drawn’, every point in the set can be ‘seen’ by every other point along an
unobstructed line segment in the set. Affine sets are convex, since it contains any line through two distinct
points, it must also contain line segments. As in the case of affine sets and affine combinations, the convex
hull is the smallest convex set that contains C, in other words, if B convex and C ⊆ B =⇒ conv C ⊆ B.
The convex hull of a convex set is itself. In the definition of convex combinations (Definition 202), we
can let k → ∞, such that the result holds for infinite sums, integrals and by extension any probability
P∞
distribution. That is, given θi ≥ 0, i=1 θi = 1 and xi ∈ C, where C is convex set, then we are
P∞
guaranteed that i=1 θi xi ∈ C if the series converge. If p is a probability density function satisfying
R R
p(x) ≥ 0, C p(x)dx = 1 over convex set C, then C xp(x)dx ∈ C, assuming this integral is well defined.
These statements make it clear that if C ∈ Rn is convex and x is random vector, then Ex ∈ C.
Definition 204 (Cone). A set C is said to be cone, or nonnegative homogeneous if for every x ∈ C,
θ ≥ 0, θx ∈ C. The set C is convex cone if it is both convex and a cone, which asserts that
∀x1 , x2 ∈ C, θ1 , θ2 ≥ 0, θ1 x1 + θ2 x2 ∈ C. (948)
See Definition 204. Set θ → 0, trivially the cone shall contain zero. Note that here θi ’s need not
sum to one, and the combination of points is different from the convex combination (Definition 202).
Naturally, a definition for conic combination is in order.
Pk
Definition 205 (Conic Combination). A point of the form i θi xi , where θi ≥ 0 and xi ∈ C for i ∈ [k]
is said to be a conic combination. The set C is a convex cone iff it contains all conic combination of its
elements.
Definition 206 (Conic Hull). A conic hull of a set C is the set of all conic combinations of points in
C, that is
( k )
X
θi xi : xi ∈ C, θi ≥ 0, i ∈ [k] . (949)
i
Again, the conic hull is the smallest convex cone that contains C, and the conic hull of a convex cone is
itself.
Some examples of affine (and hence convex) sets are ∅, {x0 }, Rn . Any line is affine. If the line passes
through zero, it is a subspace. Any subspace V is affine, and since 0 ∈ V and has closed span (therefore
contains any conic combination), subspaces are convex cones. A line segment is convex, but not affine
unless it reduces to a singleton. A ray {x0 + θv : θ ≥ 0} for nonzero v is convex, and is furthermore a
convex cone if x0 = 0.
214
Definition 207 (Hyperplane). A hyperplane is a set of the form {x : a0 x = b}, where a ∈ Rn , a 6= 0, b ∈
R. The vector a ∈ Rn is said to be the normal vector.
The hyperplane can be treated to be the solution set of a linear equation, and is hence an affine set
(see Exercise 391). The hyperplane consists of the points with an inner product to fixed vector a equal to
constant b. This constant b is the offset of the hyperplane from the origin. To see the ‘offset’ argument,
we may rewrite the hyperplane as {x : a0 (x − x0 ) = 0}, where a0 x0 = b.
Exercise 393. See that a hyperplane can be written as
{x : a0 (x − x0 ) = 0} = x0 + a⊥ , (950)
where a⊥ is the orthogonal complement of a defined a⊥ = {v : a0 v = 0}. For a(⊥) ∈ a⊥ , simply make
the substitution a0 ((x0 + a(⊥) ) − x0 ) = a0 · a(⊥) = 0 to see this characterization holds. x0 is a vector,
a⊥ is orthogonal complement, orthogonal complements are subspaces, so this should also affirm that the
hyperplane is an affine set. The hyperplane consists of an offset x0 , plus all the vectors that are orthogonal
to the normal a.
Definition 208 (Halfspaces). A hyperplane (Definition 208) splits Rn into two halfspaces. A (closed)
halfspace is a set taking form {x : a0 x ≤ b}, where a 6= 0. This is the solution set of a linear inequality.
Halfspaces are convex. The halfspace determined by a0 x ≥ b is the halfspace that extends in the
direction of normal vector a, while the halfspace determined by a0 x ≤ b extends in the direction of −a.
Note that halfspace {x : a0 x ≤ b} = {x : a0 (x − x0 ) ≤ 0}, where x0 is arbitrary point along the
hyperplane satisfying a0 x0 = b. See that a0 (x − x0 ) ≤ 0 can be taken as ha|(x − x0 )i ≤ 0, which is when
π
](a, (x − x0 )) ≥ 2 (see Definition 72). The halfspace consists of x0 plus any vector that makes an obtuse
or perpendicular angle with the normal vector a. The boundary of the halfspace is the hyperplane itself.
The set {x : a0 x < b} is the interior of the halfspace {x : a0 x ≤ b}, and is said to be the open halfspace.
Recall the definition of balls (Definition 192). We are interested in a particular ball:
Definition 209 (Euclidean Ball). A Euclidean ball in Rn is said to take form
B(xc , r) = {x : kx − xc k2 ≤ r} = {x : (x − xc )0 (x − xc ) ≤ r2 } = {xc + ru : kuk ≤ 1}, (951)
where r > 0.
A Euclidean ball is convex. To see this, see that if kx1 − xc k ≤ r, kx2 − xc k ≤ r, θ ∈ (0, 1), we have
kθx1 + (1 − θ)x2 − xc k2 = kθ(x1 − xc ) + (1 − θ)(x2 − xc )k2 (952)

≤ θkx1 − xc k2 + (1 − θ)kx2 − xc k2 (953)
≤ r, (954)
where the first inequality is result of triangle inequality (Theorem 155).
Definition 210 (Ellipsoid). An ellipsoid takes form
ξ = {x : (x − xc )0 P −1 (x − xc ) ≤ 1} = {xc + Au : kuk2 ≤ 1}, (955)

n
where P ∈ S++ . Here xc is the center of the ellipsoid, and the lengths of the semi-axes of ξ are given
√
by λi , where λi are the eigenvalues of P . In the second representation for ξ, A is square, invertible
1
n
matrix and A ∈ S++ . We can see this by taking A → P 2 , then see that
1 1 1
{xc + P 2 u : kuk2 ≤ 1} = {x : (xc + P 2 u − xc )0 P −1 (xc + P 2 u − xc ) ≤ 1} (956)
1 1
0 −1 0
= {x : u P P2 P u=uu=
2 kuk22 ≤ 1}. (957)
215
An ellipsoid generalizes the ball with P = r2 1. Ellipsoids are convex. Recall the definition of a (norm)
ball (Definition 192). All balls are convex, and this follows from the properties of norms (Theorem 149).
When not specified, assume that the norm k · k is defined on the inner product space Rn .
Definition 211 (Norm Cone). A norm cone associated with the norm k · k is the set
V = {(x, t) : kxk ≤ t} ⊆ Rn+1 . (958)
The norm cone is a convex cone (Definition 204).
Definition 212 (Second Order Cone). The second-order (quadratic, Lorentz, ice-cream) cone is the
norm cone where the associated norm is Euclidean. That is,
C = {(x, t) ∈ Rn+1 : kxk2 ≤ t} (959)

(" # " #0 "
1 0 x
#" # )
x x
= : ≤ 0, t ≥ 0 . (960)
t t 0 −1 t
Definition 213 (Polyhedron). A polyhedron is defined to be the solution set for a finite number of linear
equalities and inequalities, characterized
P = {x : a0j x ≤ bj , j ∈ [m], c0j x = dj , j ∈ [p]} = {x : Ax 4 b, Cx = d}, (961)
where Am×n , Cp×n . It is therefore the intersection between a finite number of halfspaces and hyperplanes.
Affine sets, rays, line segments, halfspaces are examples of polyhedra. Polyhedra are convex sets.
Definition 214 (Nonnegative Orthant). A nonnegative orthant is the set of points with nonnegative
components, that is
Rn+ = {x ∈ Rn : x < 0}. (962)
The nonnegative orthant (Definition 214) is a polyhedron of inequalities xi ≥ 0, i ∈ [n], and is also a
cone, as it should be easy to check.
Definition 215 (Affinely Independent). A set of k + 1 points vi , i ∈ [0, k] are said to be affinely inde-
pendent if the set of k points vi − v0 , i ∈ [1, k] are linearly independent (Definition 52).
Definition 216 (Simplexes). Simplexes are a family of polyhedra (Definition 213). Given affinely inde-
pendent vi , i ∈ [0, k] in Rn , the simplex is defined by
( k )
X
0
C = conv {v0 , · · · , vk } = θi vi : θ < 0, 1 θ = 1 . (963)
i=0
The affine dimension (Definition 191) of this simplex is k, as it should be easy to derive by definition of
affine independence.
To see that simplex (Definition 216) is polyhedron (Definition 213), first see that x ∈ C iff ∃θ < 0,
0
Pk
1 θ = 1 s.t. x = i=0 θi vi , which can be written
k
X k
X k
X k
X
x= θi vi = (1 − θi )v0 + θ i vi = v0 + θi (vi − v0 ). (964)
i=0 i=1 i=1 i=1
216
If we write y = (θ1 , · · · , θk ), then
h i
B = v1 − v0 ··· vk − v0 ∈ Rn×k (965)
and x ∈ C iff x = v0 + By for some y < 0, 10 y ≤ 1. Since vi ’s are affinely independent (Definition 215),
so rank(B) = k (Definition 62), and Theorem 51 asserts that B has left inverse A1 s.t. we may write
1
" # " #
A1
AB = B= (966)
A2 0
" #
A1
and A = is invertible square matrix order n. Since x = v0 + By, then Ax = Av0 + ABy so
A2
A1 x = A1 v0 + y, A2 x = A2 v0 . (967)
Hence x ∈ C iff A2 x = A2 v0 , and y = A1 x − A1 v0 satisfies y < 0, 10 y ≤ 1. Then we have x ∈ C iff
A2 x = A2 v 0 , A1 x < A1 v0 , 10 A1 x ≤ 1 + 10 A1 v0 , (968)
which defines a set of linear equalities and inequalities in x - by definition we have polyhedron.
Definition 217 (Unit Simplex). The unit simplex is the n dimensional simplex determined by the zero
vector and unit vectors, that is 0, ei , i ∈ [n] in Rn (see Definition 56). It may be expressed as the set of
vectors that satisfy
x < 0, 10 x ≤ 1. (969)
Definition 218 (Probability Simplex). The probability simplex is the (n − 1) dimensional simplex de-
termined by the unit vectors ei , i ∈ [n] in Rn (see Definition 56). It is the set of vectors that satisfy
x < 0, 10 x = 1. (970)
Vectors in the probability simplex are probability distributions on a set with n elements.
Definition 219 (Infinity-Norm). The infinity norm of kxk∞ is the max norm maxi∈[n] {|xi |}.
It matters how we define a polyhedron. Consider the unit ball C in the `∞ norm, written C =
{x : |xi | ≤ 1, i ∈ [n]}. Then set C can be defined by 2n linear equalities of the form ±e0i x ≤ 1,
where ei are unit vectors (see Definition 56), while describing it as a convex hull requires 2n points via
C = conv {v1 , · · · , v2n }, where vi ’s represent 2n possible vectors where the bits are flipped ±1.
The set of symmetric matrices, symmetric positive semidefinite, symmetric positive definite matrices are
defined S n = {X ∈ Rn×n : X = X 0 }, S+
n
= {X ∈ Rn×n : X < 0}, S++
n
= {X ∈ Rn×n : X 0}. The set
n
S+ is convex cone, to prove this write
x0 (θ1 A + θ2 B)x = θ1 x0 Ax + θ2 x0 Bx ≥ 0, (971)
n2 −n n(n+1)
for A, B < 0, θ1 , θ2 ≥ 0 and for all x ∈ Rn . The dimension of the vector space S n is 2 +n= 2 ,
as should be clear by thinking of the basis consisting of vectors in the upper half and diagonal of the
matrix.
217
4.4.2 Operations Preserving Convexity of Sets
Theorem 262 (Intersections). If S1 , S2 are convex, then S1 ∩ S2 convex. This extends to any finite
number and infinite number of sets.
Polyhedra (Definition 213) are intersection of halfspaces and hyperplanes, all of which are convex,
and are hence themselves convex.
n
Exercise 394. The positive semidefinite cone S+ may be expressed as ∩z6=0 {X ∈ S n : z 0 Xz ≥ 0}. For
each z 6= 0, z 0 Xz is linear in X, so the sets {X ∈ S n : z 0 Xz ≥ 0} are halfspaces in S n . The positive
semidefinite cone is the intersection of an infinite number of halfspaces.
It turns out that a closed convex set S is actually just the intersection of all the halfspaces that
contain it, formally S = ∩{H : S ⊆ H, where H is a halfspace}.
Definition 220 (Affine Functions). A function f : Rn → Rm is affine if it can be written in the form
f (x) = Ax + b, where A ∈ Rm×n , b ∈ Rm .
Theorem 263 (Affine Preservation). If S ⊆ Rn is convex and f : Rn → Rm is affine function (Definition

220), then the image of S under f , written
f (S) = {f (x) : x ∈ S} (972)
is convex set. Additionally, if f : Rk → Rn is affine function, the inverse image of S under f , written
f −1 (S) = {x : f (x) ∈ S} (973)
is convex set.
A consequence of Theorem 263 is that for S ⊆ Rn convex, α ∈ R, a ∈ Rn , the sets αS, a + S are
convex sets. Additionally, the projection of a convex set onto a subset of coordinates is convex. That
is, given S ⊂ Rm × Rn convex, the set T = {x1 ∈ Rm : (x1 , x2 ) ∈ S for some x2 ∈ Rn } is convex. The
sum of two convex sets, S1 + S2 = {x + y : x ∈ S1 , y ∈ S2 } is convex, their direct/Cartesian product
S1 × S2 = {(x, y) : x ∈ S1 , y ∈ S2 } is convex, their partial sums S = {(x, y1 + y2 ) : (x, y1 ) ∈ S1 , (x, y2 ) ∈
S2 } where x ∈ Rn , yi ∈ Rm , i = 1, 2 is convex.
Exercise 395. Recall that a polyhedron (Definition 213) may be written as {x : Ax 4 b, Cx = d}.
This is the inverse image of the Cartesian product of Rm
+ and the origin under the affine function
f (x) = (b − Ax, d − Cx) and may be expressed {x : f (x) ∈ Rm
+ × {0}}.
Pn
Exercise 396 (Linear Matrix Inequalities (LMIs)). Consider the condition A(x) = i xi Ai 4 B, where
m
B, Ai ∈ S are symmetric matrices. This condition is said to be a linear matrix inequality in x. The
solution set of a LMI, {x : A(x) 4 B} is convex, and may be expressed as the inverse image of the
positive semidefinite cone under the affine function f : Rn → S m where f (x) = B − A(x) and may be
m
expressed {x : f (x) ∈ S+ }.
Exercise 397. The set {x : x0 P x ≤ (c0 x)2 , c0 x ≥ 0} where P ∈ S+

n
, c ∈ Rn is convex, and can be expressed
1
as the inverse image of the second order cone (Definition 212) under the affine function f (x) = (P 2 x, c0 x)
as {(z, t) : z 0 z ≤ t2 , t ≥ 0}.
Exercise 398. The ellipsoid (Definition 210) ξ = {x : (x − xc )0 P −1 (x − xc ) ≤ 1}, where P ∈ S++

n
is the
1
image of the unit Euclidean ball {u : kuk ≤ 1} (Definition 209) under the affine mapping f (u) = P 2 u+xc .
1
It is also the inverse image of the unit ball under the affine mapping g(x) = P 2 (x − xc ).
218
Definition 221 (Perspective Function). The perspective function P : Rn+1 → Rn , dom P = Rn × R++
z
is the function P (z, t) = t where z ∈ Rn , t > 0.
For perspective function P , if C ⊆ dom P is convex, then image P (C) = {P (x) : x ∈ C} is convex.
The proof of this statement is done by showing that line segments are mapped to line segments under
P . Consider x = (x̃, xn+1 ), y = (ỹ, yn+1 ), where x, y ∈ Rn+1 and xn+1 , yn+1 > 0. Then for θ ∈ [0, 1], we
have
∆ θx̃ + (1 − θ)ỹ
P (θx + (1 − θ)y) = (974)
θxn+1 + (1 − θ)yn+1
θxn+1 x̃ (1 − θ)yn+1 ỹ
= + (975)
θxn+1 + (1 − θ)yn+1 xn+1 θxn+1 + (1 − θ)yn+1 yn+1
= µP (x) + (1 − µ)P (y), (976)
θxn+1
where µ = θxn+1 +(1−θ)yn+1 ∈ [0, 1]. See that as θ : 0 → 1, µ : 0 → 1 hence P ([x, y]) = [P (x), P (y)]. For
C convex, C ⊆ dom P , the line segment [P (x), P (y)] ⊂ P (C) since this is the image of the line segment
[x, y] under P . The inverse image of the perspective function is also convex. That is, for C ⊆ Rn , the
set P −1 (C) = {(x, t) ∈ Rn+1 : x
t ∈ C, t > 0} is convex set. To see this, for (x, t) ∈ P −1 (C), (y, s) ∈
P −1 (C), θ ∈ [0, 1], we need show that
θ(x, t) + (1 − θ)(y, s) ∈ P −1 (C). (977)

θx+(1−θ)y
This amounts to showing θt+(1−θ)s ∈ C, and which follows from
θx + (1 − θ)y θt x (1 − θ)s y x y
= + = µ( ) + (1 − µ) , (978)
θt + (1 − θ)s θt + (1 − θ)s t θt + (1 − θ)s s t s
θt
where µ = θt+(1−θ)s ∈ [0, 1].
Definition 222 (Linear Fractional Function). A linear fractional (projective) function f = P ◦ g is

n m+1
" " # composition of perspective function P and affine function g : R → R
the #function given g(x) =
A b
0
x+ , where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , d ∈ R. Then f : Rn → Rm is the function f (x) = cAx+b
0 x+d ,
c d
with "dom f #= {x : c0 x + d > 0}. The transformations can be thought of as first applying the matrix
A b
Q= 0 ∈ R(m+1)×(n+1) on (x, 1) to get (Ax + b, c0 x + d), and then scaling by the last component
c d
to yield (f (x), 1).
If C ⊆ dom f is convex, then f (C) is convex, and if C ⊆ Rm is convex, then f −1 (C) is convex.
Exercise 399 (Conditional Probabilities). If u, v are random variables that take discrete values in
{ui , i ∈ [n]}, {vj , j ∈ [m]} respectively, then given the joint distributions pij = P (u = ui , v = vj ), the
conditional distribution fi|j is given by
pij
fi|j = P (u = ui |v = vj ) = Pn ,
k pkj
and see that f is obtained by a linear fractional mapping from p. If C is some convex set of joint
probabilities on (u, v), the set of conditional probabilities of u given v is a convex set.
4.4.3 Proper Cones and Generalized Inequalities

Recall the definition for cones (Definition 204).
219
Definition 223 (Proper Cone). A cone K ⊆ Rn is said to be proper cone if the following are satisfied:
1. K is convex,
2. K is closed,
3. K is solid, formally int K 6= ∅,
4. K is pointed, formally x ∈ K, −x ∈ K =⇒ x = 0.
Definition 224 (Generalized Inequality). A proper cone K induces a generalized inequality, that which
we denote K , <K , ≺K , 4K . This is only a partial ordering in Rn defined by
x 4 y ↔ y − x ∈ K, x ≺K y ↔ y − x ∈ int K. (979)
When K = R+ , the partial ordering 4K is the usual ordering ≤ on R (and the same holds for other
generalized inequalities on R+ ).
The nonnegative orthant (Definition 214) is a proper cone (Definition 223). The associated generalized
inequality <K is the componentwise inequality in R written < as is understood between vectors. The
positive semidefinite cone Sn+ is also proper cone in Sn . The associated generalized inequality <K is
the matrix inequality outlined in the notations, Section 4.1. In particular, X 4K Y means that Y − X
n
is positive semidefinite. int S+ is the set of positive definite matrices, and so the strict generalized
inequality X ≺K Y means that Y − X is positive definite.
Exercise 400. The cone of polynomials nonnegative on [0, 1] with degree n−1 can be defined by K = {c ∈
Pn
Rn : i=1 ci t
i−1
≥ 0 for t ∈ [0, 1]}. Here K is proper cone, int K is the set of coefficients of polynomials
Pn Pn
that are positive on [0, 1]. Two vectors c, d ∈ Rn satisfy c 4K d iff i=1 ci ti−1 ≤ i=1 di ti−1 for all
t ∈ [0, 1].
Theorem 264 (Properties of Generalized Inequalities). Generalized inequalities 4K satisfies
1. preservation under addition: x 4K y, u 4K v =⇒ x + u 4K y + v,
2. transitivity: x 4K y, y 4K z =⇒ x 4K z,
3. preservation under nonnegative scaling: α ≥ 0, x 4Y =⇒ αx 4K αy,
4. reflexivity: x 4K x,
5. antisymmetry: x 4K y, y 4K =⇒ x = y,
6. preservation under limits: xi 4K yi , i = 1, 2, · · · , xi → x, yi → y =⇒ x 4K y.
The strict generalized inequality ≺K satisfies
1. x ≺K y =⇒ x 4K y,
2. x ≺K y, u 4K v =⇒ x + u ≺K y + v,
3. x ≺K y, α > 0 =⇒ αx ≺K αy,
4. x 6≺K x,
5. x ≺K y, =⇒ for small enough u, v, we have x + u ≺K y + v.
220
There are important differences between the generalized inequality (Definition 224) and the ordinary
inequality on R. In fact, the ordinary inequality is an instance of the generalized inequality, as the word
generalized should suggest. For one, the ordinary inequality is a total/linear order, meaning any two
points are comparable. The generalized inequality is partial order.
Definition 225 (Minimum/Maximum w.r.t 4K ). The point x ∈ S is said to the minimum of S if for
all y ∈ S, x 4K y. It is said to be maximum of S if for all y ∈ S, x <K y.
The minimum (maximum) as in Definition 225 is unique, if it exists.
Definition 226 (Minimal/Maximal w.r.t 4K ). The point x ∈ S is said to be minimal of S if for any
y ∈ S, y 4K x =⇒ y = x. It is said to be maximal of S if for any y ∈ S, y <K x =⇒ y = x.
A set can have non unique minimal and maximal elements (Definition 226). Alternative formulations
can be given. In particular, a point x ∈ S is minimum element of S iff S ⊆ x + K, where x + K
denotes all the points that are comparable to x and at least as great as x. A point x ∈ S is minimal iff
(x − K) ∩ S = {x}, where x − K denotes all the points that are comparable to x and at most x.
n
Exercise 401. Consider A ∈ S++ and an ellipsoid centered at the origin given by ξA = {x : x0 A−1 x ≤
1}. We have A 4 B iff ξA ⊆ ξB .
4.4.4 Separating and Supporting Hyperplanes

Theorem 265 (Separating Hyperplane Theorem). Suppose C, D are two convex sets and C ∩ D = ∅.
Then ∃a 6= 0, b s.t.
∀x ∈ C, a0 x ≤ b, ∀x ∈ D, a0 x ≥ b. (980)
The affine function a0 x − b is nonpositive on C and nonnegative on D. The hyperplane given by {x :

a0 x = b} is said to be a separating hyperplane for C, D, and is to separate the two sets. This need not be
unique.
Proof. [5]. Proof on the real inner product space with Euclidean norm:
First, define dist(C, D) = inf{ku − vk2 : u ∈ C, v ∈ D} as the distance between two sets. Assume this
distance is positive, and that there exists points c ∈ C, d ∈ D satisfying kc − dk2 = dist(C, D). Define
kdk22 −kck22
a = d − c, b = 2 . The proof is done by showing that the affine function
1
f (x) = a0 x − b = (d − c)0 (x − (d + c)) (981)
2
is nonpositive on C, nonnegative on D, s.t. {x : a0 x = b} is the separating hyperplane, halfway though
the midpoint between points c, d. Suppose ∃u ∈ D for which f is negative on, that is
1
f (u) = (d − c)0 (u − (d + c)) (982)
2
1
= (d − c)0 (u − d + (d − c)) (983)
2
1
= (d − c)0 (u − d) + kd − ck22 (984)
2
< 0, (985)
221
then this implies that (d − c)0 (u − d) < 0, and see that

d 2 d
kd + t(u − d) − ck2 = hd + t(u − d) − c|d + t(u − d) − ci (986)
dt t=0 dt t=0

d
= ((d − c)0 + t(u − d)0 )((d − c) + t(u − d)) (987)
dt t=0

d
= (d − c) t(u − d) + t(u − d) (d − c) + t (u − d)0 (u − d)
0 0 2
(988)
dt t=0
= (u − d)0 (d − c) + (u − d)0 (d − c) (989)
0
= 2(u − d) (d − c), (990)
and this we know to be < 0. Since the norm decreases in the (positive) neighbourhood of t = 0, it follows
that for small t > 0, t ≤ 1, we have
kd + t(u − d) − ck2 < kd − ck2 . (991)
The point d + t(u − d) is closer to c than d, but since d, u ∈ D and D is convex, this is a contradiction.
The proof that f is nonpositive on C follows similarly.
Exercise 402 (Separation of Affine and Convex Set). [5]. Suppose C is convex, D is affine. We may
write D = {F u + g : u ∈ Rm }, where F ∈ Rn×m . If C ∩ D = ∅, separating hyperplane exists (Theorem
265), and ∃a 6= 0, b s.t a0 x ≤ b for all x ∈ C, a0 x ≥ b for all x ∈ D. Then a0 x ≥ b on D implies
a0 (F u + g) ≥ b =⇒ a0 F u ≥ b − a0 g for all u ∈ Rm . Since the linear function is bounded from below on
Rm only if it is zero, then a0 F = 0, and we have condition 0 ≥ b − a0 g. Then there is some a 6= 0 s.t
F 0 a = 0, a0 x ≤ a0 g for all x ∈ C.
A separating hyperplane is said to be strictly separating if a0 x < b for all x ∈ C, and a0 x > b for all
x ∈ D. Other variants of the hyperplane theorems exist; strict separation can be shown for some convex
sets, and in some cases the converse of the separating hyperplane theorem applies.
Exercise 403 (Strict Separation of a Point and Closed Convex Set). If C is closed convex set, x0 6∈ C,
prove there is a strictly separating hyperplane that separates x0 from C.
Proof. [5]. C is closed (Definition 196), and therefore contains boundary, and there are arbitrarily close
points in C, as well as arbitrarily close points not in C (Definition 198). For some > 0, there is some
B(x0 , ) ball that does not intersect with C. The separating hyperplane theorem (Theorem 265) asserts
that ∃a 6= 0, b s.t a0 x ≤ b for x ∈ C, and a0 x ≥ b for x ∈ B(x0 , ) = {x0 + u : kuk ≤ }. We may write
the condition
a0 x ≥ b as a0 (x0 + u) ≥ b, ∀kuk ≤ . (992)
a
Then the u that minimizes the LHS is u = − kak 2
(verify this), and at this point we have

a0 x0 − a0 a = a0 x0 − kak2 ≥ b. (993)
kak2
Since this is minimal value of LHS, then the affine function
kak2
f (x) = a0 x − b − (994)
2
must be positive at x0 . Additionally, − kak 2

2 < 0, so clearly it is negative on C.
222
Theorem 266. A closed convex set is the intersection of all halfspaces that contain it.
Proof. Let C be closed convex set, and let S be the intersection of all halfspaces containing C. Clearly,
x ∈ C =⇒ x ∈ S by definition. OTOH, suppose ∃x ∈ S but x 6∈ C, then Exercise 403 asserts there
is some hyperplane that strictly separates x and C. Then there is some halfspace containing C, but it
does not contain x, hence x 6∈ S. This is contradiction, so C = S.
Exercise 404 (Converse of the Separating Hyperplane Theorem). The converse of the separating hyper-
plane theorem 265 holds when some conditions are satisfied on the two convex sets, C and D. Suppose
C, D are convex sets, at least one of which are open, C, D are disjoint iff there is a separating hyperplane
between them.
Proof. Without loss of generality, assume that C is open, and that there is affine function f that is
nonpositive on C and nonnegative on D. Since C is open (Definition 195), every point in C is interior
to C, and every point in C has a ball with positive centred to itself that lies in the set C. It follows
that f must be negative on C, else if f were zero at any point, f would take positive values near the
point and this is contradiction of the nonpositive assumption of f on C. Then C, D would be disjoint,
since f is negative on C and f is nonnegative on D. With the other direction proved by the separating
hyperplane theorem (Theorem 265), the result follows.
Exercise 405 (Theorem of alternatives for strict linear inequalities). [5]. Consider the strict linear
inequality characterized by Ax ≺ b. These inequalities are infeasible iff the convex sets given by
C = {b − Ax : x ∈ Rn }, D = Rm m
++ = {y ∈ R : y 0} (995)
do not intersect. Since C is affine and D is open, Exercise 404 asserts that C, D disjoint iff there is a
separating hyperplane λ 6= 0, µ s.t. λ0 y ≤ µ for all y ∈ C and λ0 y ≥ µ for all y ∈ D. Then the first
condition may be written
λ0 y ≤ µ as λ0 (b − Ax) = λ0 b − λ0 Ax ≤ µ, (996)
and recall as in Exercise 402, the affine function is only bounded when λ0 A = 0, and it follows that
A0 λ = 0, λ0 b ≤ µ. The second condition written λ0 y ≥ µ for all y 0 implies that µ ≤ 0, λ < 0, λ 6= 0.
Then the system of strict inequalities is infeasible iff ∃λ ∈ Rm s.t.
λ 6= 0, λ < 0, A0 λ = 0, λ0 b ≤ 0, (997)
a system of equalities and inequalities in λ. The two equations (Equations 995,997) are said to alterna-
tives, in that for any data A, b, exactly one of them is solvable.
Definition 227 (Supporting Hyperplane). For any (not necessarily convex) C ⊆ Rn , x0 ∈ bd C =

cl C\int C, if ∃a 6= 0 s.t. a0 x ≤ a0 x0 for all x ∈ C, then the hyperplane {x : a0 x = a0 x0 } = {x :
a0 (x − x0 ) = 0} is said to be a supporting hyperplane to C at x0 . Equivalently, the point x0 , C are
separated by the hyperplane {x : a0 x = a0 x0 }. We say that a supporting hyperplane to S at x is strict if
the hyperplane intersects S only at point x.
We can think of the halfspace {x : a0 x ≤ a0 x0 } created by the supporting hyperplane to contain C.
Theorem 267 (Supporting Hyperplane Theorem). For any nonempty convex set C, and any x0 ∈ bd C,
∃ a supporting hyperplane (Definition 227) to C at x0 .
223
Proof. If int C 6= ∅, then the result follows by applying the separating hyperplane theorem (Theorem
265) to int C, {x0 }. If int C = ∅, then C lies in an affine set with dimension < n - any hyperplane
containing that affine set contains both C, x0 , and is a trivial supporting hyperplane.
There also variants of the supporting hyperplane theorem (Theorem 267). In particular, if a set C is
closed (Definition 196), and has non empty interior (Definition 194), and has a supporting hyperplane
at all x0 ∈ bd C, then C is convex.
4.4.5 Dual Cones

Definition 228 (Dual Cone). For (not necessarily proper, or even convex) cone K, the set
K ∗ = {y : x0 y ≥ 0 ∀x ∈ K} (998)
is said to be dual cone of K. K ∗ is cone, is always convex, even if K is non-convex. y ∈ K ∗ iff −y is

the normal of a supporting hyperplane at origin to K.
Exercise 406 (Subspace Dual). The dual cone of a subspace (Definition 228, 49) V ⊆ Rn is the
orthogonal complement of V , written V ⊥ (Definition 148). In particular, V ⊥ = {y : y 0 v = 0, ∀v ∈ V }·
See that the nonnegative orthant (Definition 214) Rn+ is its own dual, since y 0 x ≥ 0 for all x < 0 iff
y < 0. Cones dual to itself is said to be self-dual.
Exercise 407. Consider the set of symmetric matrices S n , and the standard inner product tr(XY ) =
Pn n
i,j Xij Yij . Prove that the positive semidefinite cone S+ is self-dual.
Proof. We are required to show that for X, Y ∈ S n , tr(XY ) ≥ 0 for all X < 0 iff Y < 0. The
contrapositive is showed for both directional proofs. If Y 6< 0, then ∃q ∈ Rn s.t. q 0 Y q = tr(qq 0 Y ) < 0, so
X = qq 0 is a positive semidefinite matrix satisfying tr(XY ) < 0. On the other hand, suppose ∃X < 0 s.t.
Pn
tr(XY ) < 0, then expressing X in terms of its eigenvector expansion (Definition 83) as X = i λi qi qi0 ,
where λi ≥ 0 for i ∈ [n], we get
n
! n
X X
tr(Y X) = tr Y λi qi qi0 = λi qi0 Y qi < 0, (999)
i i
so Y 6< 0.
Definition 229 (Dual Norm). Let k · k be a norm on Rn , then the dual of the associated cone K =
{(x, t) ∈ Rn+1 : kxk ≤ t} is dual cone
K ∗ = {(u, v) ∈ Rn+1 : kuk∗ ≤ v}, (1000)
where kuk∗ is the dual norm given by kuk∗ = sup{u0 x : kxk ≤ 1}.
Proof. We need to show that x0 u + tv ≥ 0, kxk ≤ t iff kuk∗ ≤ v. Suppose kuk∗ ≤ v, kxk ≤ t for some
t > 0 (see result holds trivially when t = 0), by definition of dual norm, using k − xt k ≤ 1, we have
x
u0 (− ) ≤ kuk∗ ≤ v. (1001)
t
It follows that u0 x + vt ≥ 0. On the other hand, if we have kuk∗ > v, then by definition of dual norm,
∃x, kxk ≤ 1, x0 u > v, so (−x)0 u + v < 0. Take t → 1 and see this is the contrapositive assertion.
Theorem 268 (Properties of Dual Cone). If K ∗ is dual cone of K, then
224
1. K ∗ is closed and convex,
2. K1 ⊆ K2 =⇒ K2∗ ⊆ K1∗ ,
3. int K 6= ∅ =⇒ K ∗ is pointed,
4. cl K pointed =⇒ int K ∗ 6= ∅,
5. K ∗∗ = cl(conv K). If K is convex and closed, then K ∗∗ = K.
See Theorem 268. In particular, if K is proper cone, then its dual K ∗ is a proper cone and K ∗∗ = K.
Recall that proper cones (Definition 223) induces a generalized inequality. The dual cone K ∗ of proper
cone K is proper so there is an associated dual of the generalized inequality <K ∗ associated with <K .
Theorem 269 (Generalized Inequality and its Dual). These properties are satisfied:
1. x 4K y iff λ0 x ≤ λ0 y for all λ <K ∗ 0,
2. x ≺K y iff λ0 x < λ0 y for all λ <K ∗ 0 and λ 6= 0.
Since K = K ∗∗ , it is trivial to see that Theorem 269 holds when we swap the roles of the generalized
inequality and dual since the dual generalized inequality of <K ∗ is <K . For instance, λ 4K ∗ µ iff
λ0 x ≤ µ0 x for all x <K 0.
See Exercise 405. We may generalize this to any proper cone:
Exercise 408 (Theorem of alternatives for strict linear inequalities w.r.t proper cone). Adapted from
[5]. Let K ⊆ Rm be a proper cone (Definition 223). For x ∈ Rn let
Ax ≺K b,
w.r.t to the generalized inequality induced by the cone. This inequality is infeasible iff the affine set
{b − Ax : x ∈ Rn } does not intersect with open, convex set int K. Suppose it is infeasible, then the
separating hyperplane theorem (Theorem 265) asserts there is λ 6= 0, µ s.t λ0 (b − Ax) ≤ µ for all x ∈ Rn
and λ0 y ≥ µ for all y ∈ int K. For the affine inequality to hold in the first condition, we require that
A0 λ = 0 (as in Exercise 402, 405) and so λ0 b ≤ µ. The second condition implies that λ0 y ≥ µ for all
y ∈ int K, which is iff λ <K ∗ 0 and µ ≤ 0. Putting these conditions together, we get
λ 6= 0, λ <K ∗ 0, A0 λ = 0, λ0 b ≤ 0.
On the other hand, suppose both inequalities hold, then λ0 (b−Ax) > 0, since λ 6= 0, λ <K ∗ 0, b−Ax K 0.
But since A0 λ = 0, then λ0 (b − Ax) = λ0 b > 0 and this is contradiction. The iff condition holds. The
conditions are said to be alternatives, in that for any data A, b, exactly one of them is solvable.
Recall the definition of minimal, maximal, maximum and minimum elements of an arbitrary set
S ⊆ Rm (see Definition 225, 226).
Theorem 270 (Dual Characterization of Minimum Elements). x is the minimum element of S w.r.t
4K iff for all λ K ∗ 0, x is the unique minimizer of λ0 z over z ∈ S. For any λ K ∗ 0, the hyperplane
{z : λ0 (z − x) = 0} is a strict supporting hyperplane (Definition 227) to S at x.
Proof. Suppose x is minimum element of S. For λ K ∗ 0, let z ∈ S, z 6= x, since x is minimum,

z − x <K 0 (Definition 225). Since λ K ∗ 0, z − x <K 0, z − x 6= 0, then λ0 (z − x) > 0. Then x must
be unique minimizer of λ0 z over z ∈ S. On the other hand, if for all λ K ∗ 0, x is the unique minimizer
225
of λ0 z over z ∈ S but is not minimum, then ∃z ∈ S, z 6<K x, then z − x 6<K ≥ 0 and ∃λ̃ <K ∗ 0 s.t.
λ̃0 (z − x) < 0. It follows that λ0 (z − x) < 0 for λ K ∗ 0 nearby λ̃ and x cannot be the unique minimizer
- this is contradiction.
Theorem 271 (Dual Characterization of Minimal Elements). If λ K ∗ 0, and x minimizes λ0 z over

z ∈ S, then x is minimal element.
Proof. Suppose λ K ∗ 0, x minimizes λ0 z over z ∈ S but is not minimal, then ∃z ∈ S, z 6= x, s.t.z 4K x.

Then λ0 (x − z) > 0 and this is contradiction.
The converse of Theorem 271 is not true in general.
Theorem 272. For convex set S, minimal element x, ∃λ <K ∗ 0 that is nonzero s.t. x minimizes λ0 z
over z ∈ S.
Proof. Suppose x is minimal, then ((x − K)\{x}) ∩ S = ∅ (see Definition 226). By separating hyperplane
theorem (Theorem 265), ∃λ 6= 0, µ s.t.
∀y ∈ K, λ0 (x − y) ≤ µ, ∀z ∈ S, λ0 z ≥ µ. (1002)
The first inequality asserts that λ <K ∗ 0. Since x ∈ x−K, 0 ∈ K, the first inequality asserts λ0 (x−0) ≤ µ,
and since x ∈ S, the second inequality asserts that λ0 x ≥ µ, then λ0 x = µ, so the second inequality asserts
that µ is the minimum value of λ0 z over S and x must be minimizer of λ0 z over S.
For those familiar with Pareto dominance and Pareto efficient/optimality, an efficient frontier can be
characterized by the minimal points of a set. Let x ∈ Rn be the resource vector, such as the consumption
of portfolio risk, (negative) returns, amount of lumber required, et cetera, and the production set P ⊆ Rn
be the set of all resource vectors x corresponding to a production method, such as the implementation
(holding) of a basket of securities. Production methods/portfolios with resource vectors that are minimal
elements are said to be Pareto optimal. The set of minimal elements of P form the efficient frontier. A
point a is better than another point b if a 4 b, a 6= b. The efficient frontier can be found by minimizing
λ0 x over the set P of production vectors, for any λ 0, where λi indicates the price of resource xi .
4.5 Convex Functions and Preservation of Convexity

Recall definition of convex functions (Definition 186). That is, for θ ∈ [0, 1] and x, y ∈ dom f and dom f
is convex, f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). It is strictly convex if the inequality is a strict one
(obviously here θ ∈ (0, 1)), it is concave if −f is convex, and strictly concave if −f is strictly convex.
Affine functions (Definition 220) are trivially both convex and concave, and in fact any function that is
both convex and concave are affine.
Theorem 273. A function is convex iff it is convex when restricted to any line intersecting its domain.
Formally, f convex iff ∀x ∈ dom f, ∀v, g(t) = f (x + tv) is convex.
Definition 230 (Extended-Value Extensions). When x 6∈ dom f , and we invoke f (x), we get an unde-
fined output. To make it convenient, while preserving convexity, we want to extend the convex function
to have domain Rn . The extended-value extension of f is written f˜ : Rn → R ∪ {∞} characterized by
(
f (x), x ∈ dom f,
f˜(x) = (1003)
∞, x 6∈ dom f.
226
For extended-value (Definition 230) f˜ of convex f , we may write
∀θ ∈ [0, 1], f˜(θx + (1 − θ)y) ≤ θf˜(x) + (1 − θ)f˜(y). (1004)
When referring to two convex functions f1 , f2 , the pointwise sum f = f1 + f2 is the function with
dom f = dom f1 ∩ dom f2 , and f (x) = f1 (x) + f2 (x) for x ∈ dom f . When x is not in at least one of
dom f1 , dom f2 , then the extended value extension f˜(x) = f˜1 (x) + f˜2 (x) works. All convex functions are
implicitly extended, and hence may be written without the attendant˜.
Exercise 409. Let C ⊆ Rn be convex set and define an indicator function 1C , where dom 1C = C with
value 1C (x) = 0 for all x ∈ C. The extended value extension is written
(
0, x ∈ C,
1̃C (x) = (1005)
∞, x 6∈ C.
Both the function 1C , 1̃C are convex.

To preserve concavity on extended-value functions (Definition 230), for concave f , we let f˜(x) = −∞
whenever x 6∈ dom f .
4.5.1 Checking for Convexity

Theorem 274 (First Order Condition for Convexity). If f is differentiable, then f is convex iff dom f
is convex and
f (y) ≥ f (x) + ∇f (x)0 (y − x) (1006)
holds on all x, y ∈ dom f . For concave functions, f is concave iff dom f is convex and
f (y) ≤ f (x) + ∇f (x)0 (y − x) (1007)
for all x, y ∈ dom f . Strict convexity/concavity takes ≥→>.
Proof. Consider Rn , n = 1. If f is convex, x, y ∈ dom f , then since dom f is convex, we have f (x + t(y −
f (x+t(y−x))−f (x)
x)) ≤ (1 − t)f (x) + tf (y). Dividing by t, we obtain f (y) ≥ f (x) + t . Take lim t → 0 and
0 0
we get f (y) ≥ f (x) + f (x)(y − x). On the other hand, if ∀x, y, f (y) ≥ f (x) + f (x)(y − x), then define
z = θx + (1 − θ)y for θ ∈ [0, 1], and see that
f (x) ≥ f (z) + f 0 (z)(x − z), f (y) ≥ f (z) + f 0 (z)(y − z). (1008)
It follows that a linear weight (θ, 1 − θ) of the two equations gives us
θf (x) + (1 − θ)f (y) ≥ f (z) + θf 0 (z)(x − z) + (1 − θ)f 0 (z)(y − z) (1009)

= f (z) + f 0 (z)(θx − θz + (1 − θ)y − (1 − θ)z) (1010)
0
= f (z) + f (z)(z − z) (1011)
= f (z). (1012)
For the general case when n > 1, let x, y ∈ Rn , consider the function f restricted to a line passing
through the points, which is defined as
g(t) = f (ty + (1 − t)x). (1013)
227
Then g 0 (t) = ∇f (ty + (1 − t)x)0 (y − x) by the composition rule (Exercise 389). If f convex, then g convex
and see that g(1) ≥ g(0) + g 0 (0)(1 − 0). It follows that
f (y) ≥ f (x) + ∇f (x)0 (y − x). (1014)
If this holds on all x, y ∈ dom f, then we have
f (ty + (1 − t)x) ≥ f (t̃y + (1 − t̃)x) + ∇f (t̃y + (1 − t̃)x)0 (ty + (1 − t)x − t̃y − (1 − t̃)x) (1015)
= f (t̃y + (1 − t̃)x) + ∇f (t̃y + (1 − t̃)x)0 (yt − xt − y t̃ + xt̃) (1016)
0
= f (t̃y + (1 − t̃)x) + ∇f (t̃y + (1 − t̃)x) (y − x)(t − t̃). (1017)
This is equivalent to g(t) ≥ g(t̃) + g 0 (t̃)(t − t̃).
For affine functions of y, the first-order Taylor approximation on the RHS of Equation 1006 is exact.
Furthermore, any function is convex iff the first-order Taylor approximation is a global underestimator
of the function. Convexity allows us to gain global information from a local gradient. In particular, if at
some x, ∇f (x) = 0, then for all y ∈ dom f, f (y) ≥ f (x) - meaning x is global minimum of f .
Theorem 275 (Second Order Condition for Convexity). For a twice differentiable function f , where its
Hessian (Definition 185) H = ∇2 f exists at each point in open set dom f , f is convex iff dom f is convex
and ∀x ∈ dom f , ∇2 f (x) < 0. f is concave iff dom f is convex and ∇2 f (x) 4 0 for all x ∈ dom f .
See Theorem 275. If ∇2 f (x) 0 for all x ∈ dom f, then f is strictly convex. However, the converse
need not be true.
See the quadratic equation in two variables (Definition 85). The generalization to n variables is as
follows:
Definition 231 (Quadratic Function). A quadratic function f : Rn → R, dom f = Rn may be written

1 0
f (x) = x P x + q 0 x + r, (1018)
2
where P ∈ S n , q ∈ Rn , r ∈ R. Clearly the quadratic form has Hessian (Definition 185) equal to P on
dom f . Then f is convex iff P < 0, and strictly convex iff P 0. Also, f is (strictly) concave if P
(≺) 4 0. Note that the iff statement made for strict convexity cannot be generalized to arbitrary forms
that are non-quadratic.
Linear and affine functions are convex. Quadratic functions with symmetric positive semidefinite
matrix in the quadratic form are convex. More convex and concave functions are studied.
Result 14 (Exponential). The function exp(αx) is convex on R, α ∈ R.
Result 15 (Power). The function xα is convex on R++ , when α ≥ 1, α ≤ 0, and is concave for α ∈ [0, 1].
Result 16 (Powers of Absolutes). |x|p , for p ≥ 1 is convex on R.
Result 17 (Log). log x is concave on R++ .
Result 18 (Negative Entropy). The negative entropy function, x log x is convex on R++ .
Exercise 410 (Norms). Every norm

n
! p1
X
p
kxkp = |xi | (1019)
i
on Rn is convex.
228
Proof. Follows directly from the homogeneous property of norms (Theorem 150) and triangle inequality
(Theorem 155).
Note that the p-norms require p ≥ 1 by definition. When p < 1, it is not actually a norm. In fact,
P 1
k p p
for 0 < p ≤ 1, the function h(z) = z
i i is concave on Rk+ .
Exercise 411 (Max). The max function f (x) = max{x1 , · · · , xn } is convex on Rn .
Proof. If f (x) = maxi xi , then
f (θx + (1 − θ)y) = max(θxi + (1 − θ)yi ) ≤ max θxi + max(1 − θ)yi = θf (x) + (1 − θ)f (y). (1020)
i i i
x2
Exercise 412 (Quadratic Over Linear). The function f (x, y) = y , dom f = R × R++ = {(x, y) ∈ R2 :
y > 0} is convex.
Proof. See that
δf 2x δf x2
= , =− 2 (1021)
δx y δy y
δ2 f 2 δ2 f 2yx2 2x2
2
= , = = . (1022)
δx y δy 2 y4 y3
δ2 f 2x
=− 2. (1023)
δyδx y
So the Hessian is
" # " #
2 2 y2 −xy 2 y h i
∇ f (x, y) = 3 = 3 y −x < 0 (1024)
y −xy x2 y −x
for y > 0.
Exercise 413. Compute the Hessian for f : Rn → R, dom f = Rn , defined

m
X
f (x) = log exp(a0i x + bi ), (1025)
i
where ai ∈ Rn , i ∈ [m], and bi ∈ R, i ∈ [m]. Note that this is the composition of an affine function Ax+b,
 0
a1
Pm δg
where A =  · · ·  and the function g : Rm → R defined g(y) = log( i exp(yi )). Since δy = Pexp yi
exp yi ,
 
i
a0m
the gradient
 
exp y1
1
∇g(y) = Pm  ··· , (1026)
 
i exp yi
exp ym
and by the affine composition (Exercise 389) we may write

1 0
∇f (x) = A z, zi = exp (a0i x + bi ) , i ∈ [m]. (1027)
10 z
229
Furthermore, compute
δ2 g − exp yj exp yi
= 2 , (1028)
δyj yi
P
( exp yi )
2
δ2 g (exp yi )2
P
exp yi ( exp yi ) − (exp yi ) exp yi
= =P − P 2, (1029)
δyi2 2 exp yi
P
( exp yi ) ( exp yi )
and using the second derivative affine composition rule (Exercise 261)

0 1 1 0
2
∇ f (x) = A diag(z) − 0 2 zz A, zi = exp(a0i x + bi ), i ∈ [m]. (1030)
10 z (1 z)
Pn
Exercise 414 (Log-Sum Exp). The function f (x) = log( i exp(xi )) is convex on Rn . Note that
f (x) ∈ [maxi {xi }, maxi {xi } + log n]. Show that it is convex.
Proof. Write the Hessian (with A → 1 in Equation 1030)
1
∇2 f (x) = ((10 z)diag(z) − zz 0 ) , (1031)
(10 z)2
where z = (exp(x1 ), · · · , exp(xn )). For all v, write
 ! n ! !2 
n n
0 2 1 X X
2
X
v ∇ f (x)v = 0 2  zi vi zi − vi zi  ≥ 0, (1032)
(1 z) i i i
with inequality asserted by Cauchy-Schwarz (Theorem 154).

1
Exercise 415 (Geometric Mean). The geometric mean f (x) = (Πni=1 xi ) n is concave on dom f = Rn++ .
Proof. Compute the Hessian
δf 1 n n1 −1 n
= (Πi xi ) Πi6=l xi , (1033)
δxl n
δ 2 f (x)

1 1 − n n n1 −2 n 1
−1
Πi6=l xi Πni6=k xi + Πni6=l,k xi (Πni xi ) n

= (Πi xi )
δxk δxl n n

1 1 − n n n1 1 1 1
= (Πi xi ) + (Πni xi ) n
n n xl xk xl xk
1
n

(Πi xi ) n 1 − n
= +1 (1034)
nxk xl n
1
(Πni xi ) n
= , (1035)
n2 xk xl
δ 2 f (x)

1 1 1
−2 2
= − 1 (Πni xi ) n Πnj6=k xj (1036)
δx2k n n

1 1 n
1 1
= − 1 (Πi xi ) n
(1037)
n n x2k
1
(Πni xi ) n
= −(n − 1) , (1038)
n2 x2k
or equivalently
1
Πn x n

1 0
∇ f (x) = − i 2 i
2
n diag(( 2 )i∈[n] ) − qq ,
n xi
1
where qi = xi . To see concavity, see
1
 !2 
n n n n
Π x X vi2 X vi
v 0 ∇2 f (x)v = − i 2 i n − ≤0 (1039)
n i
x2i i
xi
for all v by Cauchy-Schwarz (Theorem 154) with Exercise 253 directly applied.
230
n
Exercise 416 (Log-determinant). The function f (X) = log det(X) is concave on dom f = S++ . (Defi-
nition 766).
Proof. For arbitrary line X = Z + tV in symmetric matrix space, where Z, V ∈ S n , define g(t) =
f (Z + tV ). Applying Theorem 273, write
g(t) = log det(Z + tV ) (1040)

1 1
= log det (Z 2 1 + tZ − 2 V Z − 2 Z 2
1 1
(1041)

= log det 1 + tZ − 2 V Z − 2 det(Z)
1 1
(1042)
= log Πni (1 + tλi ) + log det(Z) (1043)
Xn
= log(1 + tλi ) + log det(Z), (1044)
i
1 1
where λi , i ∈ [n] are the eigenvalues of Z − 2 V Z − 2 . The results follow since det(AB) = det(A)det(B) =
1 1
det(B)det(A), Z − 2 V Z − 2 is symmetric and hence has orthogonal diagonalization with λi as diagonals.
Furthemore, determinants are the products of eigenvalues. Then
n n
X λi X λ2i
g 0 (t) = , g 00 (t) = − . (1045)
i
1 + tλi i
(1 + tλi )2
It follows that f is concave.
Definition 232 (α-(sub/super) level set). The α-sublevel set of f : Rn → R is defined as
Cα = {x ∈ dom f : f (x) ≤ α}. (1046)
The α-superlevel set of f : Rn → R is defined as
Cα = {x ∈ dom f : f (x) ≥ α}. (1047)
The α-level set of f : Rn → R is defined as
Cα = {x ∈ dom f : f (x) = α}. (1048)
Theorem 276. The sublevel (Definition 232) of a convex function f is convex for all α ∈ R. The
superlevel (Definition 232) of a concave function f is convex for all α ∈ R.
Proof. If x, y ∈ Cα , then f (x), f (y) ≤ α and f (θx + (1 − θ)y) ≤ α for θ ∈ [0, 1]. It follows that
θx + (1 − θ)y ∈ Cα , and the sublevel set is convex. The converse is not necessarily true. The proof for
superlevel sets follow similarly.
Exercise 417. Consider the geometric and arithmetic means for x ∈ Rn+ characterized by
n
1 1X
G(x) = (Πni xi ) n , A(x) = xi (1049)
n i
respectively. There is an associated arithmetic-geometric mean inequality given by G(x) ≤ A(x) for all
x ∈ Rn . If 0 ≤ α ≤ 1, the set given by {x ∈ Rn+ : G(x) ≥ αA(x)} is the set of vectors with geometric
mean at least as large as α times the arithmetic mean. This set is convex, since it is a superlevel set of
concave G(x) − αA(x).
231
Exercise 418. A generalization of the arithmetic-geometric inequality (Exercise 417) can be written.
Prove that
aθ b1−θ ≤ θa + (1 − θ)b, a, b ≥ 0, θ ∈ [1]. (1050)
Proof. Since − log x is convex, we may write
− log(θa + (1 − θ)b) ≤ −θ log a − (1 − θ) log b (1051)

log(θa + (1 − θ)b) ≥ θ log a + (1 − θ) log b (1052)
log(θa + (1 − θ)b) ≥ log aθ b1−θ (1053)
Definition 233 (Epigraph). For a graph defined by f : Rn → R defined as {(x, f (x)) : x ∈ dom f }, the
epigraph of f : Rn → R is written to be
epi f = {(x, t) : x ∈ dom f, f (x) ≤ t}. (1054)
A function is convex iff its epigraph is a convex set.
Definition 234 (Hypograph). For a graph defined by f : Rn → R defined as {(x, f (x)) : x ∈ dom f },
the hypograph of f : Rn → R is written to be
hypo f = {(x, t) : x ∈ dom f, t ≤ f (x)}. (1055)
A function is concave iff its hypograph is a convex set.
Exercise 419. A generalization of the quadratic over linear (Exercise 412) function is the matrix frac-
tional function f : Rn × S n → R, defined
f (x, Y ) = x0 Y −1 x. (1056)
The function is convex on dom f = Rn × S++

n
. To see this is convex, consider its epigraph
epi f = {(x, Y, t) : Y 0, x0 Y −1 x ≤ t} (1057)

( " # )
Y x
= (x, Y, t) : 0 < 0, Y 0 , (1058)
x t
by the Schur complement for positive semidefiniteness of block matrices (see Theorem 255.2). This is
LMI (Definition 396) in (x, Y, t) and hence epi f is convex, so f is convex.
Exercise 420. Consider the first-order convexity condition (Theorem 274) given f (y) ≥ f (x)+∇f (x)0 (y−
x), where f convex and x, y ∈ dom f . Then in epigraph terms we may write
(y, t) ∈ epi f =⇒ t ≥ f (y) ≥ f (x) + ∇f (x)0 (y − x) (1059)

" #0 " # " #!
∇f (x) y x
⇐⇒ − ≤ 0. (1060)
−1 t f (x)
The hyperplane defined by the normal (∇f (x), −1) is a supporting hyperplane (Definition 227) to epi f
at boundary (x, f (x)).
The Equation 933 characterizing convex functions is also known as Jensen’s inequality, and can be
extended to any convex combination (Definition 202), including infinite sums and integrals, giving results
R R
f ( S p(x)xdx) ≤ S f (x)p(x)dx, f (Ex) ≤ Ef (x), assuming the integrals are well defined.
232
4.5.2 Operations that Preserve Convexity
Previously, the convexity-preserving operations on sets were studied. Here, convexity-preserving opera-
tions on functions are studied.
Theorem 277 (Non-negative sums). If f is convex, α ≥ 0, then αf is convex. If f1 , f2 convex, f1 +f2 is
convex. By extension, the set of convex functions forms a convex cone (Definition 204), for any w < 0,
Pn
we have f = i wi fi convex if each of the fi ’s are convex. Similarly, a nonnegative weighted sum of
concave functions are concave. The same statements hold for strict inequalities in the case of both convex
and concave functions. The sums extend to infinite sums and integrals, such that for f (x, y) convex in
R
x for each y ∈ A, and w(y) ≥ 0 for y ∈ A, we are guaranteed that g(x) = A w(y)f (x, y)dy is convex in
x, assuming the integral is well defined.
The statement asserted in Theorem 277 can be understood in terms of the epigraphs (Definition 233).
For w ≥ 0, f convex, we have
1 0
" #
epi wf = {(x, wt) : x ∈ dom f, f (x) ≤ wt} = epi f, (1061)
0 w
which is convex since we have applied a linear mapping to convex epi f .
Theorem 278 (Affine Mapping Composition). For f : Rn → R, A ∈ Rn×m , b ∈ Rn , define g : Rm → R
by
g(x) = f (Ax + b), dom g = {x : Ax + b ∈ dom f }. (1062)
Then f convex =⇒ g convex and f concave =⇒ g concave.

Theorem 279 (Pointwise maximum/supremum). If f1 , f2 convex, then pointwise maximum f (x) =
max{f1 (x), f2 (x)} for all x ∈ dom f = dom f1 ∩ dom f2 is convex. See that
f (θx + (1 − θ)y) = max {f1 (θx + (1 − θ)y), f2 (θx + (1 − θ)y)} (1063)

≤ max {θf1 (x) + (1 − θ)f1 (y), θf2 (x) + (1 − θ)f2 (y)} (1064)
≤ θ max {f1 (x), f2 (x)} + (1 − θ) max{f1 (y), f2 (y)} (1065)
= θf (x) + (1 − θ)f (y). (1066)
Of course this extends to any fi , i ∈ [m] set of convex functions, as well as infinite sets. Given a set
of convex functions, if ∀y ∈ A, f (x, y) convex in x, then g(x) = supy∈A f (x, y) is convex in x, where
dom g = {x : ∀y ∈ A, (x, y) ∈ dom f and supy∈A f (x, y) < ∞}. The epigraph of pointwise supremum of
functions correspond to the intersection of component epigraphs, particularly
epi g = ∩y∈A epi f (·, y). (1067)
which agree with the convexity preserving property of intersection of sets (Theorem 262).
As we might expect from Theorem 279, the pointwise infimum of a set of concave function is concave.
Theorem 279 allows us to establish convexity over non-differentiable functions where it would not
be possible to use other methods, such as through the computation of a Hessian (Theorem 275). For
Pr−1
instance, the vector x ∈ Rn , the sum of the largest r order statistics in x given by f (x) = i=0 x(n−i)
is convex, as can be seen by writing
r−1
X Xr
f (x) = x(n−i) = max{ xij : 1 ≤ i1 < i2 < · · · < ir ≤ n}, (1068)
i=0 j
n

the maximum of all possible r choice sums. It is the pointwise maximum over r linear functions.
233
Definition 235 (Support Function). Let C ⊆ Rn , C 6= ∅, then the support function SC of C is defined
SC (x) = sup{x0 y : y ∈ C}, dom Sc = {x : sup x0 y < ∞}. (1069)

y∈C
For each y ∈ C, x0 y linear in x, so SC is pointwise supremum of linear functions, therefore it is convex.
Exercise 421 (Piecewise Linear). A piecewise function defined as the pointwise maximum over L affine
functions, specifically
f (x) = max{a0i x + bi : i ∈ [L]} (1070)

i
is convex.
Exercise 422. Let C ⊆ Rn , then the distance of point x to the farthest point in set C is defined as
f (x) = supy∈C kx − yk. This is obvious, since for each fixed y, the norm kx − yk is convex in x and the
pointwise supremum over convex functions is convex.
Exercise 423. Let ai ∈ Rm , i ∈ [n]. In the weighted-least squares problem, the objective function is
n
X
min.x wi (a0i x − bi )2 (1071)
i
over x ∈ Rm . Suppose we allow wi to take any values in R, then the optimal weighted least squares cost
is
n
X
g(w) = inf wi (a0i x − bi )2 , (1072)
x
i
Pn
with dom g = {w : inf x i wi (a0i x − bi )2 > −∞} and since g is infimum of linear (concave) functions of
w, it is concave.
Exercise 424. The function f (X) = λmax (X) that computes the maximum eigenvalue of symmetric
matrix X ∈ S m can be written using the Rayleigh Quotients (Theorem 247) as
f (X) = sup{y 0 Xy : kyk2 = 1} (1073)
and is clearly convex as supremum over linear functions in X.
Exercise 425. The spectral norm (Definition 180) f (X) = kXk2 , dom f = Rp×q denotes the maximum
singular value, and may be expressed
kXvk2
kXk = sup , (1074)
v6=0 kvk2
so this may be expressed
kXk2 = sup{kXvk2 : kvk2 = 1} (1075)

0
= sup{u Xv : kuk2 = 1, kvk2 = 1}, (1076)
where the equality is discussed in Definition 180. It is hence a supremum of linear functions in X and
is convex.
Exercise 426 (Pointwise supremum of affine functions). Most convex functions can be expressed as the
pointwise supremum of a family of affine functions. Take f : Rn → R to be convex, dom f = Rn , then
f (x) = sup{g(x) : g affine, g(z) ≤ f (z) ∀z}. (1077)
f is the pointwise supremum of the set of all affine global underestimators.
234
Proof. Suppose f is convex, domf = Rn , then since g is affine underestimator of f , then the LHS ≥ RHS
is clear in Equation 1077. To show equality holds, it is shown ∀x ∈ Rn , ∃ affine global underestimator g
s.t g(x) = f (x). Define the supporting hyperplane (Definition 227) to epi f at (x, f (x)) by
" #0 " #
a x−z
≤0 (1078)
b f (x) − t
for all (z, t) ∈ epi f , a ∈ Rn , b ∈ R, (a, b) 6= 0. By definition of epigraphs (Definition 233), t = f (z) + s
for some s ≥ 0. Then
a0 (x − z) + b(f (x) − f (z) − s) ≤ 0, ∀z ∈ dom f = Rn . (1079)
For this to hold for all z, we require that all s ≥ 0, which requires that b ≥ 0. If b = 0, then a0 (x − z) ≤ 0
∀z ∈ Rn , then a = 0 and this is contradiction. It follows that b > 0, take s → 0 and see that
a 0
(x − z) + f (x) ≤ f (z), (1080)
b
a
g(z) := f (x) + ( )0 (x − z) ≤ f (z), ∀z. (1081)
b
g is affine underestimator of f , and clearly agrees with f at x.
Definition 236 (Function Composition). For h : Rk → R, g : Rn → Rk , the function composition of h

and g is written as f = h ◦ g : Rn → R that outputs f (x) = h(g(x)), where dom f = {x ∈ dom g : g(x) ∈
dom h}.
Theorem 280 (Conditions for Scalar Function Composition and Convex Preservation). An example of
function composition is the scalar composition, which takes h : R → R, g : Rn → R (see Definition 236).
Assume that h, g are twice differentiable and dom g = dom h = R, see that for f (x) = h(g(x)), we have
by chain rule
f 0 (x) = h0 (g(x))g 0 (x), (1082)

f 00 (x) = h00 (g(x))g 0 (x)g 0 (x) + g 00 (x)h0 (g(x)) (1083)
h00 (g(x)) g 0 (x)2 + h0 (g(x))g 00 (x).

= (1084)
Using these rules, the following statements (extended-value extensions are required only in generalization
to dimensions n > 1, see Definition 230), and may be reasoned using the signs convex (concave) h00 (x) →
+ (−) and nondecreasing (nonincreasing) g 0 (x) → + (−); the first derivative of g is irrelevant:
1. If h convex, h̃ nondecreasing and g is convex, then f convex,
2. If h convex, h̃ nonincreasing and g is concave then f convex,
3. If h concave, h̃ nondecreasing and g is concave then f concave,
4. If h concave, h̃ nonincreasing and g is convex then f concave.
Proof. The proof of statements do not require differentiability assumptions. The first of the statements
are proved. Assume x, y ∈ dom f , θ ∈ [0, 1]. Since x, y ∈ dom f then x, y ∈ dom g and g(x), g(y) ∈ dom h.
Since dom g convex, then θx + (1 − θy) ∈ dom g, and g(θx + (1 − θ)y) ≤ θg(y) + (1 − θ)g(y), and since
dom h is convex, the RHS is in dom h. h̃ is nondecreasing, the domain extends infinitely in the negative
direction. Since the RHS is in dom h, by the previous statement asserted, the LHS is in dom h, which
235
again asserts that θx + (1 − θy) ∈ dom f , and dom f must be convex. Now it follows from nondecreasing
property of h̃ and convexity of g, h that
h(g(θx + (1 − θ)y)) ≤ h(θg(x) + (1 − θ)g(y)) ≤ θh(g(x)) + (1 − θ)h(g(y)) (1085)
and we are done.
In Theorem 280, to say that h̃ is nondecreasing means that for any x, y ∈ R, with x < y, we have
h̃(x) ≤ h̃(y). If h is assumed convex and y is finite, then clearly h̃(x) must not be ∞, and this means
that if y ∈ dom h, then x ∈ dom h, which is to say that the domain of h extends infinitely in the
negative direction. By the same reasoning, when h assumed convex and h̃ is assumed nonincreasing,
dom h extends infinitely in the positive direction.
Theorem 281 (Conditions for Vector Function Composition and Convex Preservation). Let h : Rk → R,
gi : Rn → R and let function composition f be (see Definition 236) given by
f (x) = h(g(x)) = h(g1 (x), · · · , gk (x)). (1086)
Without loss of generality, assume n = 1. When k = 1, assuming twice differentiability of h, gi , i ∈ [k]

and dom g = R, dom h = Rk , then Equation 1084 generalizes to
f 00 (x) = g 0 (x)0 ∇2 h(g(x))g 0 (x) + ∇h(g(x))0 g 00 (x), (1087)
(note that here g 0 (x) is the derivative of g w.r.t x and g 0 (x)0 is the transpose of the derivative vector).
Then the following statements hold:
1. If h convex, h nondecreasing in each argument, gi , i ∈ [k] convex then f convex,
2. If h convex, h nonincreasing in each argument, gi , i ∈ [k] concave then f convex,
3. If h concave, h nondecreasing in each argument, gi , i ∈ [k] concave then f concave,
4. If h concave, h nonincreasing in each argument, gi , i ∈ [k] convex then f concave,
and the results hold in dimension n > 1, including without the assumption of differentiable h or g, with
the addition of monotonicity condition on h̃.
The monotonicity condition states than when u 4 v, h convex and h̃ nondecreasing, that h̃(u) ≤ h̃(v).
Then v ∈ dom h =⇒ u ∈ dom h; the domain of h extends infinitely in the −Rk+ direction, a fact denoted
dom h − Rk+ = dom h.
Pr−1
Consider the example for Equation 1068, if we let h(z) = i=0 z(n−i) , then Theorem 281 asserts
that h ◦ g, the pointwise sum of the r largest gi ’s is convex function if gi ’s are convex. The log-sum
Pk
exponential function (Definition 414) defined h(z) = log( i exp(zi )) is convex as we know, so the
function composition
Xk
log( exp(gi )) (1088)
i
is convex if the gi ’s are convex.
236
Theorem 282 (Minimization and Infimum Convexity). If f is jointly convex in (x, y) and C is convex
nonempty set, then
g(x) = inf f (x, y)
y∈C
is convex in x if g(x) > −∞ for all x with
dom g = {x : (x, y) ∈ dom f for some y ∈ C}. (1089)
Proof. ∀ > 0, there ∃y1 , y2 ∈ C s.t. f (xi , yi ) ≤ g(xi ) + for i = 1, 2. For θ ∈ [0, 1], see that
g(θx1 + (1 − θ)x2 ) = inf f (θx1 + (1 − θ)x2 , y) (1090)

y∈C
≤ f (θx1 + (1 − θ)x2 , θy1 + (1 − θ)y2 ) (1091)
≤ θf (x1 , y1 ) + (1 − θ)f (x2 , y2 ) (1092)
≤ θg(x1 ) + (1 − θ)g(x2 ) + . (1093)
Let → 0+ and see that
g(θx1 + (1 − θ)x2 ) ≤ θg(x1 ) + (1 − θ)g(x2 ). (1094)
Writing this in terms of the epigraphs, the set
epi g = {(x, t) : (x, y, t) ∈ epi f for some y ∈ C} (1095)
is an affine preservation (Theorem 263) of the convex set epi f .
Exercise 427. If the quadratic function
f (x, y) = x0 Ax + 2x0 By + y 0 Cy, A, C ∈ S n (1096)
is jointly convex in (x, y), then

" #
A B
< 0. (1097)
B0 C
We may express this as a minimization problem, g(x) = inf y f (x, y) = x0 (A−BC + B 0 )x (direct application
of Equation 883 with D → C and using Schur complement of A instead), where C + is the pseudo-inverse.
Since g is convex by Theorem 282, (A − BC + B 0 ) is positive semidefinite.
Exercise 428. The distance of point x to set S given inf y∈S kx − yk is convex in x if S is convex since
the norm is jointly convex on (x, y).
Exercise 429. For arbitrary convex h, let g(x) = inf{h(y) : Ay = x}. We may write
(
h(y) if Ay = x,
f (x, y) = , (1098)
∞ else
and since this is convex in (x, y), and g is the minimum of f over y, then g is convex.
Recall perspective functions (Definition 221). On the other hand, the perspective of a function is
defined:
237
Definition 237 (Perspective of Functions). The perspective of function f : Rn → R is the function
g : Rn+1 → R defined by
x x
g(x, t) = tf ( ), dom g = {(x, t) : ∈ dom f, t > 0}. (1099)
t t
The function perspective (Definition 237) preserves convexity. If f convex (concave), then g convex
(concave).
Proof. Proof with epigraphs: for t > 0, obtain

x x s x s
(x, t, s) ∈ epi g ⇐⇒ tf ( ) ≤ s ⇐⇒ f ( ) ≤ ⇐⇒ ( , ) ∈ epi f. (1100)
t t t t t
Then epig is inverse image of f under the perspective mapping (Definition 221, on the second coordinate).
The result follows since epi f is convex.
Exercise 430. The perspective of convex function f (x) = hx|xi on Rn is
x x x0 x
g(x, t) = t( )0 ( ) = , (1101)
t t t
and this is convex in (x, t) for any t > 0.
Exercise 431 (Relative Entropy). The function perspective of f (x) = − log x on R++ is given by
x t
g(x, t) = −t log( ) = t log( ) = t log t − t log x, (1102)
t x
which is convex on R2++ . g is said to be the relative entropy of t and x, and is generalization of the
negative entropy function (see this by x → 1).
Related to the relative entropy functions, the relative entropy of two vectors in Rn++ is defined
Pn ui
as i ui log( vi ), whuich is clearly jointly convex in (u, v) as the sum of relative entropies of ui , vi .
Additionally, the function
n
X ui
Dkl (u, v) = (ui log( ) − ui + vi ) (1103)
i
vi
is known as the Kullback-Leibler divergence between u, v which measures the deviation between two
positive vectors, and is convex as the positive weighted sum of relative entropy and linear function of
(u, v). If we take vi → 10 u and negate the relative entropy function, we obtain the concave function of
u ∈ Rn++ given by
n n
X 10 u X 1 u
ui log( ) = (10 u) zi log( ), z= (1104)
i
ui i
zi 10 u
u
and is called the normalized entropy function. The vector z = 10 u is a probability vector.
Exercise 432. For f : Rm → R convex and A ∈ Rm×n , b ∈ Rm , c ∈ Rn , d ∈ R, the function

Ax + b Ax + b
g(x) = (c0 x + d)f , dom g = {x : c0 x + d > 0, 0 ∈ dom : f } (1105)
c0 x + d cx+d
is convex.
238
4.5.3 Conjugate Functions
Definition 238 (Conjugate Function). Let f : Rn → R, and f ∗ : R → R defined as
f ∗ (y) = sup (y 0 x − f (x)), dom f ∗ = {y ∈ Rn : supx∈dom f (y 0 x − f (x)) < ∞} (1106)

x∈dom f
be known as the conjugate of function f . f ∗ is convex by Theorem 279.
Exercise 433 (Examples of Conjugates). -
1. Consider the affine function f (x) = ax + b, then as a function of x, f˜ = yx − ax − b has derivative

df˜
dx = y − a, is bounded iff y = a. Then dom f ∗ = {a} and f ∗ (a) = −b.
df˜
2. The negative log f (x) = − log x with domain R++ has function f˜ = xy+log x derivative dx = y+ x1 ,
∗ ∗
unbounded above if y ≥ 0, maximum at x = − y1 . We have f (y) = − log(−y) − 1, dom f = {y :
y < 0}.
df˜
3. The exponential function f (x) = exp(x) has function f˜ = xy−exp(x), dx = y−exp(x) is unbounded
when y < 0. For y > 0, xy − exp(x) is maximum at x = log y, hence f ∗ (y) = y log y − y. When
y = 0, f ∗ (y) = supx (− exp(x)) = 0. f ∗ (y) = y log y − y, dom f ∗ (y) = R+ (let 0 log 0 = 0).
4. The negative entropy function f (x) = x log x has dom f = R+ (let 0 log 0 = 0). The function
δ f˜
f˜ = xy − x log x, derivative dx = y − (1 + log x) is bounded from above on R+ for all y. It
has maximum at x = exp(y − 1), hence f ∗ (y) = exp(y − 1)y − exp(y − 1) log(exp(y − 1)) =
exp(y − 1)(y − y + 1) = exp(y − 1), dom f ∗ = R.
df˜
5. The inverse function f (x) = 1
x on R++ . f˜ = yx − 1
x has derivative dx =y+ 1
x2 , it is unbounded
∗ − 21 1 1
above when y > 0 and has minimum at x = √1 ,
−y
so f (y) = y(−y) − (−y) = −(−y)(−y)− 2 −
2
1 1
(−y) 2 = −2(−y) 2 with dom f ∗ = −R+ .
6. The quadratic form f (x) = 12 x0 Qx, where Q ∈ S++

n
has function f˜ = y 0 x − 12 x0 Qx with derivative
df˜
dx = y − Qx, is bounded from above as function of x for all y, and has maximum at x = Q−1 y.
Then f ∗ (y) = y 0 Q−1 y − 12 yQ−1 QQ−1 y = 12 y 0 Q−1 y.
7. The log determinant function f (X) = log det X −1 , X ∈ S++

n
has conjugate function defined
f ∗ (Y ) = sup {tr(Y X) + log det X} , (1107)

X0
since tr(Y X) = hY |Xi and detXX −1 = detXdetX −1 = 1, so detX −1 = (detX)−1 so − log detX =
log(detX)−1 = log detX −1 . tr(Y X)+log detX is unbounded from above unless Y ≺ 0, to show this
suppose Y 6≺ 0, then Y has a nonnegative eigenvalue λ ≥ 0 with associated eigenvector v, kvk2 = 1.
Let X = 1 + tvv 0 , which is clearly in dom f , and see that
tr(Y X) + log det X = tr(Y (1 + tvv 0 )) + log det X (1108)

= tr Y + t · tr Y vv 0 + log det (1 + tvv 0 ) (1109)
0
= tr Y + t · tr λvv + log(1 + t) (1110)
= tr Y + tλ · tr(vv 0 ) + log(1 + t) (1111)
0
= tr Y + tλ · tr(v v) + log(1 + t) (1112)
= tr Y + tλ · 1 + log(1 + t), (1113)
239
which is unbounded as t → ∞. Now suppose Y ≺ 0, then the maximizing value of X is found at
!
∇X (tr(Y X) + log det X) = Y + X −1 = 0, (1114)
where the gradient follows from Equation 912, so X = −Y −1 0 and the conjugate is given by
f ∗ (Y ) = −n + log det(−Y )−1 , dom f ∗ = −S++

n
. (1115)
8. The indicator function 1S of arbitrary set S ⊆ Rn , defined

1S (x) = 0, dom 1S = S (1116)
has conjugate defined by
1∗S (y) = sup y0 x, (1117)

x∈S
same as the support function (Definition 235) for S.

Pn
9. The log-sum exponential function given by f (x) = log( i exp(xi )). The function y 0 x − f (x) is
!
unbounded unless y < 0, 10 y = 1. To see this, first see that df
= Pexp(x
n
i)
=: yi for i ∈ [n].
dx j exp(xj )
Substituting into y 0 x − f (x), we get
n n n n
X X X exp(xi ) X
yi xi − log exp(xi ) = Pn xi − log exp(xi ) (1118)
i i i j exp(xj ) i
n n n
Xexp(xi ) 1 X X
= P log exp(xi ) − Pn exp(xj ) · log exp(xk )
i j exp(xj ) j exp(xj ) j k
n n
!
X exp(xi ) X
= P log exp(xi ) − log exp(xk ) (1119)
i j exp(xj ) k
n
X exp(xi ) exp(xi )
= P log Pn (1120)
i j exp(xj ) k exp(xk )
n
X
= yi log yi (1121)
i
∗
= f (y), where 0 log 0 = 0. (1122)
To see dom f ∗ = {y < 0 : 10 y = 1}, suppose yk < 0 for some k-th component , then let xk →
−t, xi = 0 when i 6= k and t → ∞ and see that y 0 x − f (x) is unbounded from above. So y < 0.
Further suppose 10 y 6= 1, then choose x = t1, then
y 0 x − f (x) = t10 y − log(n exp t) = t10 y − t − log n, (1123)
which is unbounded from above and unbounded from below when 10 y > 1, t → ∞ and 10 y < 1, t →
−∞ respectively. Then the conjugate function is
 n
X
y < 0, 10 y = 1,


 yi log yi ,
f ∗ (y) = i (1124)


∞ else.
So it is the negative entropy function restricted to the probability simplex (Definition 218).
Theorem 283 (Proof of z 0 x ≤ kxkkzk∗ on Rn ).
240
Proof. For k · k norm on Rn , the dual norm (Definition 229) defined kzk∗ = sup{z 0 x : kxk ≤ 1}. See that
x
z 0 x ≤ sup z 0 x = suphz|xi = kxk suphz| i = kxk sup hz|yi ≤ kxk sup hz|yi = kxkkzk∗ . (1125)
kxk kyk=1 kyk≤1
Theorem 284. For norm k · k defined on the inner product space Rn , the conjugate of f (x) = kxk is
the indicator function of the dual norm k · k∗ unit ball, written
(
0 kyk∗ ≤ 1,
f ∗ (y) = (1126)
∞ else.
Note that the inequality is tight, in that for any x, ∃z s.t. equality holds, and for any z, ∃x s.t. equality
holds.
Proof. Recall definition of dual norms (Definition 229 kuk∗ = sup{u0 x : kxk ≤ 1}). Suppose kyk∗ > 1,
then ∃z ∈ Rn , kzk ≤ 1 s.t. y 0 z > 1. Take x = tz, t → ∞ and see that
y 0 z − kxk = t(y 0 z − kzk) (1127)
goes to infinity as t → ∞. On the other hand, Theorem 283 asserts that y 0 x ≤ kxkkyk∗ ≤ kxk for all x
when kyk∗ ≤ 1, and since the equality holds only when x = 0, we have supx (y 0 x − kxk) = 0.
Exercise 434. Prove that the function f (x) = 21 kxk2 has conjugate function given by
1
f ∗ (y) = kyk2∗ , (1128)
2
Proof. Since y 0 x ≤ kyk∗ kxk (see Theorem 283), then we may write

1 1
f ∗ (x) = sup y 0 x − kxk2 ≤ kyk∗ kxk − kxk2 . (1129)
x 2 2
d(RHS)
The RHS is quadratic in kxk, so taking derivatives dkxk = kyk∗ − kxk, and we have the maximum
∗
value evaluated at kyk∗ kyk∗ − 12 kyk2∗ = 1 2
2 kyk∗ , hence f (y) ≤ 1 2
2 kyk∗ . To show equality holds, since
0
the equality is tight for some vector x (see Theorem 283) s.t. y x = kyk∗ kxk, and scaled such that
kxk = kyk∗ , then we have
1 1
y 0 x − kxk2 = kyk2∗ , (1130)
2 2
so f ∗ (x) = supx y 0 x − 12 kxk2 ≥ 12 kyk2∗ and we are done.

To see how conjugate functions come up in optimization, consider a manufacturing process with
resource consumption vector r and price vector p, and let the sales be measured by S(r). Then the profit
derived is S(r) − p0 r, and the management pursues optimization M (p) = supr {S(r) − p0 r}. See that we
may write M (p) as
(−S)∗ (−p) = sup {(−p)0 r − (−S)(r)} . (1131)

r
Theorem 285 (Fenchel’s Inequality). Since f ∗ (y) = supx {yx − f (x)}, then the Fenchel’s inequality/Y-
oung’s inequality is written
f (x) + f ∗ (y) ≥ x0 y, ∀x, ∀y. (1132)
241
For instance, with reference to Exercise 433, when Q 0, for f (x) = 21 x0 Qx, we may write
1 0 1
x0 y ≤ x Qx + y 0 Q−1 y. (1133)
2 2
Result 19. If f is convex, and epi f is closed, then f ∗∗ = f .
Exercise 435 (Legendre Transform). The conjugate of a differentiable function f is also called the
Legendre transform of f . If f is convex and differentiable, with dom f = Rn , then any maximizer x̃ of
y 0 x − f (x) satisfies y = ∇f (x̃) (by simple calculus), and if x̃ satisfies y = ∇f (x̃), it is a maximizer of
y 0 x − f (x). Let y → ∇f (x̃), and we can write
f ∗ (y) = sup{x0 y − f (x)} = x̃0 ∇f (x̃) − f (x̃), (1134)

x
or more simply for z ∈ Rn , y = ∇f (z), we get f ∗ (y) = z 0 ∇f (z) − f (z).
Result 20. For a > 0, b ∈ R, conjugate of g(x) = af (x) + b is given by

y
g ∗ (y) = af ∗ ( ) − b. (1135)
a
Result 21. If f (u, v) = f1 (u) + f2 (v), and f1 , f2 are convex, then
f ∗ (w, z) = f1∗ (w) + f2∗ (z). (1136)
4.5.4 Quasiconvex Functions

Definition 239 (Quasiconvex/concave functions). A function f : Rn → R is said to be quasiconvex if
its domain and (see sublevel sets, Definition 232, not to be confused with hypographs, Definition 234)
∀α ∈ R, Sα = {x ∈ dom f : f (x) ≤ α} (1137)
are convex. f is said to be quasiconcave if −f is quasiconvex, that is when for every value of α, the
superlevel set {x : f (x) ≥ α} is convex.
Definition 240 (Quasilinear Functions). A function that is both quasiconvex and quasiconcave (Defi-
nition 239) is said to be quasilinear. For quasilinear function f , its dom f is convex and every level set
{x : f (x) = α} is convex.
Quasiconvex function on R requires that each sublevel set is (possibly infinite) interval. Convex
functions have convex sublevel sets (Theorem 276), and are therefore quasiconvex, but the converse is
not true.
Exercise 436. Some examples of quasiconvex functions on R,
1. The log function f (x) = log x, dom f = R++ is concave, quasiconvex and quasiconcave.
2. The ceiling function, f (x) = inf{z ∈ Z : z ≥ x} is quasiconvex and quasiconcave.
and other examples on Rn ,
1. The largest index of nonzero component in vector, f (x) = max{i : xi 6= 0} is quasiconvex. The
sublevel sets are subspaces, since f (x) ≤ α ⇐⇒ xi = 0 for i = bαc + 1, · · · n and we can verify
that for ∀α, ∀x1 , x2 ∈ Sα , that λx1 ∈ Sα and x1 + x2 ∈ Sα . The sublevel sets are subspaces.
242
2. The
" function
# f : R2 → R, dom f = 2
" R+ defined
2
# f (x1 , x2 ) = x1 x2 has Hessian matrix ∇ f (x) =
0 1 λ −1
, so the eigenvalues solve = λ2 − 1, which have roots ±1. The function is
1 0 −1 λ
neither convex nor concave, but it is quasiconcave since the superlevel set {x ∈ R2+ : x1 x2 ≥ α} is
convex for any α.
a0 x+d
3. The linear fractional (Definition 222) function f (x) = c0 x+d , dom f = {x : c0 x + d > 0} is
quasiconvex and quasiconcave. To show quasiconvexity, see that
Sα = {x : c0 x + d > 0, a0 x + b ≤ αc0 x + d} (1138)
is intersection of open halfspace and closed halfspace, where halfspaces are convex (Definition 208)
and intersection preserves convexity (Theorem 262). Quasiconcavity is shown in a similar way.
kx−ak2
4. The ratio of Euclidean distances to fixed point a, b ∈ Rn , written f (x) = kx−bk2 is quasiconvex on
the halfspace {x : kx − ak2 ≤ kx − bk2 }. Clearly, on this halfspace, f (x) ≤ 1 and see that for any
α ≤ 1,
Sα = {x ∈ dom f : kx − ak2 ≤ αkx − bk2 }. (1139)
kx − ak2 ≤ αkx − bk2 (1140)

(x − a)0 (x − a) ≤ α2 (x − b)0 (x − b) (1141)
0 0 2 0 0
(x − a )(x − a) ≤ α (x − b )(x − b) (1142)
x0 x + a0 a − 2a0 x ≤ α2 (x0 x + b0 b − 2b0 x) (1143)
so
(1 − α2 )x0 x − 2(a − α2 b)0 x + a0 a − α2 b0 b ≤ 0. (1144)
See Euclidean ball (Definition 209) {x : (x − xc )0 (x − xc ) ≤ r2 } may be expressed
(x0 − x0c )(x − xc ) ≤ r2 (1145)

0
xx− 2x0c x + x0c xc −r 2
≤ 0, (1146)
so Equation 1144 represents a Euclidean ball, which is convex.
5. The internal rate of return x = (x0 , · · · , xn ) is the cash flow over n periods, where xi ∈ R represent
inflow/outflow of cash on positive/negative values of xi . Then the present/discounted value of the
cash flow (assume r ≥ 0) is the value
n
X
P V (x, r) = (1 + r)−i xi . (1147)
i=0
Pn
Let x0 < 0 and i=0 xi > 0 be the cash flow received from investment of −x0 in some asset at
time zero. Then the cash flow P V (x, 0) > 0, P V (x, r) → x0 < 0 as r → ∞, and there ∃r ≥ 0 s.t.
P V (x, r) = 0. This smallest r ≥ 0 s.t. present value is zero is known as the internal rate of return
(IRR), denoted
IRR(x) = inf{r ≥ 0 : P V (x, r) = 0}. (1148)
243
Pn
IRR is quasiconcave function, given x0 < 0, i=1 xi > 0. See this by writing
IRR(x) ≥ R ⇐⇒ P V (x, r) > 0, ∀r ∈ [0, R), (1149)
where the LHS is R-superlevel set of IRR, and the RHS is the intersection of {x : P V (x, r) > 0}
indexed by r on [0, R), the intersection of infinite number of open halfspaces (convex).
Theorem 286 (Jensen’s Inequality on Quasiconvex Functions). A function f is quasiconvex iff dom f
is convex and
∀x, y ∈ dom f, θ ∈ [0, 1], f (θx + (1 − θ)y) ≤ max{f (x), f (y)}. (1150)
A function f is quasiconcave iff dom f is convex and ∀x, y ∈ dom f, θ ∈ [0, 1],
−f (θx + (1 − θ)y) ≤ max{−f (x), −f (y)} = − min{f (x), f (y)} (1151)

⇐⇒ f (θx + (1 − θ)y) ≥ min{f (x), f (y)}. (1152)
As with convexity, if a function is quasiconvex iff its restriction to any line intersecting its domain is
quasiconvex. Formally, f convex iff ∀x ∈ dom f, ∀v, g(t) = f (x + tv) is quasiconvex.
Exercise 437. Let the cardinality of a vector x ∈ Rn be equal to the number of non-zero components
denoted |x|. Then | · | is quasiconcave on Rn+ , by Equation 1152 since
|θx + (1 − θ)y| ≥ min{|x|, |y|} (1153)
for all x, y < 0.

n
Exercise 438. For X, Y ∈ S+ , the rank function is quasiconcave, since Exercise 304 asserts that
n
rank(X + Y ) ≥ min{rank(X), rank(Y )}, ∀X, Y ∈ S+ . (1154)
Theorem 287 (Quasiconvex (continuous) function on R). A continuous function f : R → R is quasi-

convex iff any of the statements a) f is nondecreasing, b) f is nonincreasing, c) ∃c ∈ dom f s.t. f 0 ≤ 0
on t ≤ c and f 0 ≥ 0 on t ≥ c hold.
Theorem 288 (First Order conditions for Quasiconvexity). If f : Rn → R is differentiable, then f

quasiconvex iff dom f is convex and
∀x, y ∈ dom f, f (y) ≤ f (x) =⇒ ∇f (x)0 (y − x) ≤ 0. (1155)
When ∇f (x) 6= 0, then ∇f (x) is the normal defining a supporting hyperplane to sublevel set {y : f (y) ≤
f (x)} at x.
Theorem 289 (Second Order conditions for Quasiconvexity). If f : Rn → R is twice differentiable, then
if f is quasiconvex, then for all x ∈ dom f, y ∈ Rn ,
y 0 ∇f (x) = 0 =⇒ y 0 ∇2 f (x)y ≥ 0. (1156)
In the case when n = 1, we have f 0 (x) = 0 =⇒ f 00 (x) ≥ 0. Whenever ∇f (x) 6= 0 and is orthogonal to
some vector y, ∇2 f (x) is positive semidefinite. This means ∇2 f (x) is positive semidefinite on (n − 1)
dimensional subspace ∇f (x)⊥ . Since Rn = span(∇f (x)) ⊕ ∇f (x)⊥ , ∇2 f (x) has at most one negative
eigenvalue. The partial converse states that if f satisfies y 0 ∇f (x) = 0 =⇒ y 0 ∇2 f (x)y > 0 (note the
inequality is strict) for all x ∈ dom f, y ∈ Rn , then f is quasiconvex function.
244
Proof. Considering the restriction to a line, we only look at the case f : R → R. Suppose f : R → R is
quasiconvex on open interval (a, b), and that f 0 (c) = 0. Suppose it is not the case that f 00 (c) ≥ 0, that is
f 00 (c) < 0, then f 0 (c+ ) < 0, f 0 (c− ) > 0 and it follows that ∃ small s.t f (c − ) < f (c), f (c + ) < f (c) (the
graph is an inverted-‘U’ at c). Then the sublevel set is {x : f (x) ≤ f (c) − } is disconnected and clearly
non-convex, but f is assumed quasiconvex and we have contradiction. It follows that f 00 (c) ≥ 0. For the
partial converse statement, suppose ∀c ∈ (a, b), f 0 (c) = 0 =⇒ f 00 (c) > 0. Then f 0 (c+ ) > 0, f 0 (c− ) < 0. If
f 0 does not cross zero, then f is either nonincreasing or nondecreasing on interval (a, b), and otherwise if it
crosses zero, then f 0 (t) ≤ 0 on (a, c], f 0 (t) ≥ 0 on [c, b) and Theorem 287 asserts that f is quasiconvex.
4.5.5 Preservation of Quasiconvexity

Theorem 290 (Nonnegative Weighted Maximums). f = maxi∈[m] {wi fi }, is quasiconvex when wi ≥
0, fi quasiconvex, and extends to f (x) = supy∈C (w(y), g(x, y)) given w(y) ≥ 0, g(x, y) quasiconvex in x
on each y. We can see this by writing
f (x) ≤ α ⇐⇒ w(y)g(x, y) ≤ α ∀y ∈ C. (1157)
That is, α-sublevel set of f is just the intersection of the α-sublevel sets of the functions w(y)g(x, y) in
x.
Result 22 (Composition). If g : Rn → R quasiconvex and h : R → R is nondecreasing, then f = h ◦ g

is quasiconvex. Suppose f is quasiconvex. The composition of quasiconvex function with affine function,
g(x) = f (Ax + b) is quasiconvex. The composition of quasiconvex function with linear fractional function
0
(Definition 222), g̃(x) = f ( cAx+b Ax+b
0 x+d ), dom g̃ = {x : c x + d > 0, c0 x+d ∈ dom f } is quasiconvex.
Minimization preservation of convexity was shown in Theorem 282.
Theorem 291 (Minimization and Infimum Quasiconvexity). If f (x, y) is jointly quasiconvex in (x, y)
and C is convex, then g(x) = inf y∈C f (x, y) is quasiconvex.
Proof. Clearly
g(x) ≤ α ⇐⇒ ∀ > 0, ∃y g(x) = inf f (x, ỹ) ≤ f (x, y) ≤ α + . (1158)

ỹ∈C
Let x1 , x2 be in the α-sublevel set of g, then
∀ > 0, ∃yi , i = 1, 2 f (x1 , y1 ) ≤ α + , f (x2 , y2 ) ≤ α + . (1159)
Since f is quasiconvex, by Theorem 286 we have
f (θx1 + (1 − θ)x2 , θy1 + (1 − θ)y2 ) ≤ max{f (x1 , y1 ), f (x2 , y2 )} ≤ α + , (1160)
then g(θx1 + (1 − θ)x2 ) ≤ α and it follows that the sublevel set {x : g(x) ≤ α} is convex; g is quasiconvex.
Sublevel sets of quasiconvex functions can be represented as inequalities of convex functions. It is

always possible to find a family of convex functions φt : Rn → R, s.t. f (x) ≤ t ⇐⇒ φt (x) ≤ 0. Clearly,
for all x ∈ Rn , the inequality suggests that φt satisfies
φt (x) ≤ 0 =⇒ ∀s ≥ t, φs (x) ≤ 0, (1161)
245
since φt (x) ≤ 0 =⇒ f (x) ≤ t ≤ s =⇒ φs (x) ≤ 0. This is satisfied for any φ s.t. for each x, φt (x) is
nonincreasing function of t, s.t. s ≥ t =⇒ φs (x) ≤ φt (x). Such an example is
(
0 f (x) ≤ t,
φt (x) = (1162)
∞ else.
Other representations exist, and we are interested in those where φt exhibit nice properties, such as
differentiability.
Exercise 439. Suppose p is convex, q is concave and p(x) ≥ 0, q(x) > 0 on convex set C. Then
p(x)
f (x) = q(x) is quasiconvex on C, since we may write
f (x) ≤ t ⇐⇒ p(x) − tq(x) ≤ 0 (1163)
and φt (x) = p(x) + t(−q)(x) is sum of two convex functions for t ≥ 0 decreasing in t. The case when
t < 0 is never evaluated since p(x) ≥ 0, q(x) > 0 on C by definition.
4.5.6 Log-concave/convex functions

Definition 241 (Log-concave/convex). A function f : Rn → R is said to be log concave if ∀x ∈ dom f ,
f (x) > 0 and log f is concave. It is log-convex if log f is convex. See that f is log-convex iff − log f =
log f −1 = log 1
f is concave iff 1
f is log-concave. For convenience, often f is allowed to take on the
value zero, which we assign log f (x) = log 0 = −∞. Then f is log-concave if the extended value log f
is concave. Since exp(h) is convex if h is convex (Result 14), so log-convex functions are convex. A
nonnegative, concave function is log-concave (Result 17). Log-convex functions are convex, so they are
quasiconvex. For log-concave function f (x), the superlevel sets of log f (x) are convex, and since log is
monotone, the superlevel sets of f (x) are convex and log-concave functions are quasiconcave (Definition
239).
Theorem 292 (Log-concavity inequality). For f : Rn → R, with dom f convex and f (x) > 0 on dom f ,
f is log-concave iff ∀x, y ∈ dom f, θ ∈ [0, 1],
f (θx + (1 − θ)y) ≥ f (x)θ f (y)1−θ . (1164)
The value of a log-concave function at the mixture of two points is at least the geometric mean of the
values of the two points.
Proof. If log f is concave, then
log f (θx + (1 − θ)y) ≥ θ log{f (x)} + (1 − θ) log{f (y)} (1165)

θ (1−θ)
log f (θx + (1 − θ)y) ≥ log{f (x) } + log{f (y) } (1166)
and take exponential to get the result.
Exercise 440. -
1. The affine function f (x) = a0 x + b is log-concave on {x : a0 x + b > 0}.
2. For f (x) = xα on R++ , f is log-convex when α ≤ 0 and log-concave when α ≥ 0.
3. f (x) = exp(αx) is both log-convex and log-concave (log-linear).
246
4. The (standard) Gaussian c.d.f
x
u2
Z
1
Φ(x) = √ exp(− )du (1167)
2π −∞ 2
is log-concave.
5. The Gamma function

Z ∞
Γ(x) = ux−1 exp(−u)du (1168)
0
is log-convex for x ≥ 1.
n
6. det X is log-concave on S++ .
det X n
7. tr(X) is log-concave on S++ .
The multivariate normal distribution, exponential distribution, uniform distribution and Wishart
distribution over convex sets are all have log-concave density functions (verify this).
Theorem 293. If f is twice differentiable with convex domain, then by chain rule, we have (see Equation
900)
1 1
∇2 log f (x) = ∇2 f (x) − ∇f (x)∇f (x)0 (1169)
f (x) f (x)2
and therefore f is log-convex iff for all x ∈ dom f , f (x)∇2 f (x) < ∇f (x)∇f (x)0 , and log-concave iff for
all x ∈ dom f , f (x)∇2 f (x) 4 ∇f (x)∇f (x)0 .
Theorem 294 (Preservation of log-concavity and log-convexity). Log-convexity and log-concavity are
closed under multiplication and positive scaling. If f, g log concave, then the pointwise product h = f · g
evaluating h(x) = f (x)g(x) is log-concave. We can see this easily, since log f, log g is concave, so
log f + log g = log f · g is concave, so f · g is log-concave. The sum of log-concave functions is not in
general log-concave. The sum of log-convex functions are log-convex, and this is seen by writing
log(f + g) = log(exp log f + exp log g) (1170)
and seeing this is composition matched by Equation 1088. This is extended to integrals, such that if
f (x, y) is log-convex in x for each y ∈ C, then
Z
g(x) = f (x, y)dy (1171)
C
is log-convex.
Exercise 441. Suppose p : Rn → R and p(x) ≥ 0 for all x. The Laplace transform of p is defined to be
Z
P (z) = p(x) exp(−z 0 x)dx, dom P = {z : P (z) < ∞} (1172)
is log convex on Rn . If p is density function for random variable v, then p(x)dx = 1 and the function
R
M (z) = P (−z) = p(x) exp(z 0 x)dx = E exp(zv) is said to be the moment generation function, and we
R
have
∇M (0) = Ev, ∇2 M (0) = Evv 0 . (1173)
The function log M (z) is convex, and is said to be the cumulant generating function for p. The derivatives
of the cumulant generation function gives cumulants of the density, with (verify this)
∇ log M (0) = Ev, ∇2 log M (0) = E(v − Ev)(v − Ev)0 . (1174)
247
Result 23. If f : Rn × Rm → R is log-concave, then
Z
g(x) = f (x, y)dy (1175)
is log-concave function of x on Rn .
Theorem 295. Marginal distributions of log-concave probability distributions are log-concave.
Proof. This follows directly from Result 23.
Theorem 296. For W = X + Y , X, Y independent, let f, g be log-concave density functions for X,Y
respectively on Rn . Then their convolution (Theorem 359)
Z
qW (x) = f (x − y)g(y)dy (1176)
is log-concave, since f (x − y), g(y) log-concave, so f (x − y)g(y) is log-concave (product of log-concave)

and so is their integration by Result 23.
Exercise 442. Suppose C ⊆ Rn is convex set, w is random vector in Rn with density function p. If p
is log-concave, then
f (x) = P{x + w ∈ C} (1177)
is log-concave in x since we may write

Z
f (x) = E [1C {x + w}] = p(w)1C (x + w)dw, (1178)
where
(
1 u ∈ C,
1C (u) = (1179)
0 u 6∈ C
is log-concave and applying Result 23.
Exercise 443. For density function f of random variable w, the c.d.f f : Rn → R defined
Z xn Z x1
F (x) = P(w 4 x) = ··· f (z)dz1 · · · dzn (1180)
−∞ −∞
is log-concave if f (z) is log-concave.

Exercise 444. The yield of a manufacturing process is given by
y(x) = P(x + w ∈ S), (1181)
where x is the optimal specification, w is a zero mean random vector and S is the set of products satisfying
specification requirements. If the manufacturing errors w ∼ f is modelled by a log-concave density, and
S is convex, then y is log-concave by Exercise 442. The α-yield is defined to be the superlevel set of the
yield, and since we can write
{x : y(x) ≥ α} = {x : log y(x) ≥ log α}, (1182)
the α-yield must be convex (RHS is superlevel set (Definition 232) of a log-concave function).
Exercise 445. Let A ∈ Rm×n , then the volume of the polyhedron (Definition 213) defined Pu = {x ∈
Rn : Ax 4 u} has volume function volPu that is log-concave. See this by defining the indicator function
(
1 Ax 4 u,
Φ(x, u) = (1183)
0 else,
R
which is log-concave. Then Result 23 asserts that volPu = Φ(x, u)dx is log-concave.
248
4.5.7 Extension of Convexity to Generalized Equalities
Definition 242 (Monotonicity under Generalized Inequality). Let K ⊆ Rn be proper cone (Definition
223), then w.r.t. to the generalized inequality (Definition 224) induced by the proper cone, a function
f : Rn → R is said to be K-nondecreasing if
x 4K y =⇒ f (x) ≤ f (y). (1184)
It is said to be K-increasing if
x 4K y, x 6= y =⇒ f (x) < f (y). (1185)
It is said to be K-nonincreasing if
x 4K y =⇒ f (x) ≥ f (y). (1186)
It is said to be K-decreasing if
x 4K y, x 6= y =⇒ f (x) > f (y). (1187)
Exercise 446 (Monotone Vector Function). A function f : Rn → R is said to be nondecreasing w.r.t

the nonnegative orthant Rn+ iff
x 4 y =⇒ f (x) ≤ f (y). (1188)
Exercise 447 (Monotone Matrix Function). A function f : S n → R is said to be matrix monotone

n
function if it is monotone w.r.t S+ . Let X ∈ S n , then
1. tr(W X), W ∈ S n is matrix nondecreasing if W < 0, matrix increasing if W 0, matrix nonin-

creasing if W 4 0 and matrix decreasing if W ≺ 0.
2. tr(X −1 ) is matrix decreasing on S++

n
.
n n
3. det X is matrix increasing on S++ and matrix nondecreasing on S+ .
A differentiable function f : R → R, dom f convex is nondecreasing iff f 0 (x) ≥ 0 on dom f and

increasing if f 0 (x) > 0 on dom f . These statements can be extended to generalized inequalities. The iff
conditions do not hold on strictly increasing/decreasing assertions.
Theorem 297. A differentiable function f , domf convex, is said to be K-nondecreasing iff ∇f (x) <K ∗ 0
for all x ∈ dom f . If ∇f (x) K ∗ 0, then f is K-increasing.
d
Proof. Suppose ∇f (x) <K ∗ 0 but ∃x, y s.t. x 4 y and f (y) < f (x). Then ∃t ∈ [0, 1] s.t. dt f (x + t(y −
0
x)) < 0, but see by chain rule that the LHS is ∇f (x + t(y − x)) (y − x). But since x 4K y, y − x ∈ K
(Definition 224) and ∇f (x + t(y − x))0 (y − x) < 0 implies ∇f (x + t(y − x)) 6∈ K ∗ by Definition 228. But
since we have asserted ∇f (x) <K ∗ 0 for all x, this is contradiction and f must be K-nondecreasing. A
similar proof shows the strict generalized inequality implies f is K-increasing.
On the other hand, suppose that f is differentiable, dom f is convex and is K-nondecreasing. Now
suppose ∃z s.t. ∇f (z) 6<K ∗ 0, then by Equation 979, ∇f (z) − 0 = ∇f (z) 6∈ K ∗ . By definition of dual
cone, ∃v ∈ K s.t. ∇f (z)0 v < 0. Then
h(t) = f (z + tv) (1189)
has derivative h0 (t) = ∇f (z + tv)0 v evaluates to h0 (0) = ∇f (z)0 v < 0 and ∃t > 0 s.t. h(t) = f (z + tv) <
h(0) = f (z), and f cannot be K-nondecreasing. This is contradiction and ∇f (z) <K ∗ 0 for all z.
249
Definition 243 (Jensen’s Generalized Inequality). Let K ⊆ Rn be a proper cone (Definition 223), then
f : Rn → Rm is said to be K-convex if ∀x, y, θ ∈ [0, 1],
f (θx + (1 − θ)y) 4K θf (x) + (1 − θ)f (y). (1190)
It is said to be strictly K-convex if the inequality is replaced with strict generalized inequality ≺K , for all
x 6= y, θ ∈ (0, 1).
Exercise 448. Here are some examples of the Jensen’s Generalized Inequality (Definition 243) w.r.t the
generalized inequality induced by a cone K.
1. f : Rn → Rm is convex w.r.t Rm
+ iff for all x, y, θ ∈ [0, 1], we have
f (θx + (1 − θ)y) 4 θf (x) + (1 − θ)f (y). (1191)
2. f : Rn → S m is said to be matrix convex if
f (θx + (1 − θ)y) 4S+n θf (x) + (1 − θ)f (y). (1192)
An equivalent definition is that g(z) : z 0 f (x)z is convex for all z. Some examples (verify this).
(a) f (X) = XX 0 , where X ∈ Rn×m is matrix convex, since ∀z, z 0 XX 0 z = kX 0 zk22 is convex
quadratic function of X. f (X) = X 2 is matrix convex on S n , by similar reasoning.
(b) f (X) = X p is matrix convex on S++
n
for p ∈ [−1, 0] ∪ [1, 2], and matrix concave for p ∈ [0, 1].
(c) f (X) = exp(X) is not matrix convex on S n for n ≥ 2.
Like convex functions, K-convexity restricted to any line intersecting domain is necessary and suffi-
cient condition for K-convexity.
See Theorem 269. We say that a function f is K-convex iff for all w <K ∗ 0, the real-valued function
w0 f is convex. f is strictly K-convex iff for all w <K ∗ 0, w 6= 0, the function w0 f is strictly convex.
A differentiable function f is K-convex iff its domain is convex and for all x, y ∈ dom f , we have
f (y) <K f (x) + Df (x)(y − x). (1193)
See Equation 1014 for comparison and Equation 889 for matrix Taylor expansion. The function f is
strictly K-convex iff for all x, y ∈ dom f , x 6= y, we have
f (y) K f (x) + Df (x)(y − x). (1194)
See Theorem 280 and Theorem 281. Many of the results can be extended. For K-convex g : Rn → Rp ,
convex h : Rp → R and h̃ defined to be K-nondecreasing, the composition h ◦ g is convex.
Exercise 449. The quadratic matrix function g : Rm×n → S n defined
g(X) = X 0 AX + B 0 X + X 0 B + C, A ∈ S m , B ∈ Rm×n , C ∈ S n (1195)
n
is convex if A < 0. The function h(Y ) = − log det(−Y ) is convex and increasing on dom h = −S++
(Exercise 416). The composition rule asserts that
f (X) = − log det(−(X 0 AX + B 0 X + X 0 B + C)), dom f = {X ∈ Rm×n : X 0 AX + B 0 X + X 0 B + C ≺ 0}.
is convex.
250
4.6 Convex Optimization Problems
A high level overview of mathematical optimization and convex instances were given in Section 4.2 and
4.3. A more formal treatment is given.
Definition 244 (Mathematical Optimization Problem). A mathematical optimization problem can be

written
minimizes f0 (x) (1196)

subject to fi (x) ≤ 0, i = [m], (1197)
hi (x) = 0, i ∈ [p]. (1198)
Here, x ∈ Rn is optimization variable, f0 : Rn → R is objective/cost function and we have m inequality

constraints and p equality constraints. When m = p = 0, the problem is said to be unconstrained search
problem. The domain D of the optimization problem is defined implicitly in the functions, in particular
p
D = ∩m
i=0 dom fi ∩ ∩i=1 dom hi .
Definition 245 (Terminologies Related to Mathematical Notation). For optimization problem formu-
lated as in Definition 244, domain D, point x ∈ D is said to be feasible if all the constraints are satisfied.
The set of all feasible points is said to be feasible set, and we write this
F = {x ∈ D : fi (x) ≤ 0, i ∈ [m], hi (x) = 0, i ∈ [p]}. (1199)
The problem is feasible if F 6= ∅. The optimal value of the problem is

(
inf{f0 (x) : x ∈ F} F 6= ∅
p∗ = (1200)
∞ F = ∅.
If p∗ is −∞, that is when there are feasible points xk s.t. f0 (xk ) → −∞ as k → ∞, we say that the
problem is unbounded below. If x∗ ∈ F and f0 (x∗ ) = p∗ , then we call this an optimal point, and the set
of optimal points is written
Xopt = {x : x ∈ F ∧ f0 (x) = p∗ }. (1201)
If there exists an an optimal point for the optimization problem, then we say that the optimal value is
attained/achieved, and the problem is said to be solvable. If Xopt = ∅, then we say that the optimal value
is not attained/achieved. Trivially, if F = ∅, the optimal value is not achieved since Xopt is empty. When
the problem is unbounded below, this is also true.
A feasible point x satisfying f0 (x) ≤ p∗ + is said to be -suboptimal, and the set of all such points is a
-suboptimal set. A feasible point is locally optimal if ∃R > 0 s.t.
f0 (x) = inf{f0 (z) : z ∈ F, kz − xk2 ≤ R}, (1202)
and this solves the optimization problem
minimize f0 (z) (1203)

subject to fi (z) ≤ 0, i ∈ [m] (1204)
hi (z) = 0, i ∈ [p] (1205)
kz − xk2 ≤ R. (1206)
251
If x ∈ F, fi (x) = 0, the i-th inequality constraint fi (x) ≤ 0 is said to be active at x, and if fi (x) < 0, it
is said to be inactive. Trivially, the equality constraint functions hi (x) are active at all feasible points.
A constraint is redundant if removing it does not change the feasible set.
When the objective function is identically zero, the optimal value is either zero when F 6= ∅ or infinity
when F = ∅. Then the objective function statement is often changed from minimize f0 (z) → find x, and
we say this is a feasibility problem.
Consider dom : f0 = R++ - an example of a problem where the optimal value is not achieved (see
1 ∗
Definition 245) is when f0 (x) = x, p = 0. An example where the problem is unbounded below is
∗
f0 (x) = − log x, p = −∞.
The definition given in Definition 244 said to be of standard form. Note that this requires the RHS
of the constraint functions to be zero. Clearly the general inequality and equality statements that did
not begin with zero RHS may be transformed to adhere to standard form. Note also the direction of the
inequality.
Exercise 450. The box constraints/variable bounds often appear in optimization problems. For instance,
the optimization problem
minimize f0 (x) (1207)

subject to li ≤ xi ≤ ui , i ∈ [n] (1208)
may be written in standard form by writing

subject to li − xi ≤ 0, i = 1, · · · , n (1210)
xi − ui ≤ 0 i = 1, · · · , n (1211)
with 2n inequality constraint functions.
Clearly, the maximization problem can be solved by solving the minimization problem where the
objective function is negated (i.e. −f0 ) subject to the same constraints. Then the optimal value takes
p∗ = − inf{−f0 (x) : x ∈ F}, where F is the feasible set. A feasible point x is said to be -suboptimal if
f0 (x) ≥ p∗ − , as we might expect. The objective function is often said to be the utility function in the
maximisation settings.
Definition 246 (Equivalent Problem). Two problems (Definition 244) are said to be equal if the solution
to one problem can be derived from solving the other problem, or vice versa. An optimal point is found
for one problem iff an optimal point is found for the other problem.
Exercise 451 (Examples of Equivalent Problems). A number of transformations on the objective and
constraint functions to a problem result in equivalent problems. Some common examples are studied. It
should be verified that in each case, the feasible set is unchanged (through an invertible map), and that
the solution to one problem is necessary and sufficient for the other.
1. The problem
minimize f˜(x) = α0 f0 (x) (1212)

subject to f˜i (x) = αi fi (x) ≤ 0, i = 1, · · · , m (1213)
h̃i (x) = βi hi (x) = 0, i = 1, · · · , p, (1214)
where αi > 0, i ∈ [0, m], βi 6= 0, i ∈ [p] is an equivalent problem.
252
2. Suppose φ : Rn → R is injective and that φ(dom φ) ⊇ D. Let f˜i (z) = fi (φ(z)) for i ∈ [0, m] and
h̃i (z) = hi (φ(z)) for i ∈ [p], then the problem
minimize f˜0 (z) (1215)

subject to f˜i (z) ≤ 0 i = 1, · · · , m, (1216)
h̃i (z) = 0 i = 1, · · · , p (1217)
is equivalent problem and are related by the change of variable x = φ(z).
3. If ψ0 : R → R is monotone increasing and ψi : R → R, i ∈ [1, m] satisfy ψi (u) ≤ 0 ⇐⇒ u ≤ 0,

and φj : R → R, j ∈ [1, p] satisfy φj (u) = 0 ⇐⇒ u = 0, then for f˜i (x) = ψi (fi (x)), i ∈ [0, m], and
h̃i (x) = φi (hi (x)), i ∈ [1, p], the problem defined
minimize f˜0 (x) (1218)

subject to f˜i (x) ≤ 0 i = 1, · · · m, (1219)
h̃i (x) = 0 i = 1, · · · p. (1220)
is equivalent to standard form.
4. Since fi (x) ≤ 0 ⇐⇒ ∃si ≥ 0, fi (x) + si = 0. Then the problem
minimize f˜0 (z) (1221)

subject to si ≥ 0 i = 1, · · · , m, (1222)
f˜i (x) + si = 0 i = 1, · · · , m (1223)
h̃i (x) = 0 i = 1, · · · , p, (1224)
where the variables are x ∈ Rn , s ∈ Rm . This transformed problem is equivalent, and is optimization
problem in n + m variables, m inequality constraints and m + p equality constraints. The variable si
is said to be the slack variable associated with the inequality constraints fi (x) ≤ 0. A problem (x, s)
is feasible/optimal for this problem iff x is feasible/optimal for the standard form with fi (x) = −si .
5. If all the solutions of the equality constraints hi (x) = 0, i ∈ [p] can be characterized by some
parameter z ∈ Rk , then the equality constraints may be eliminated. Suppose for φ : Rk → Rn , x
satisfies all equality constraints hi (x) iff ∃z ∈ Rk s.t. x = φ(z). Then the problem stated
minimize f˜0 (z) = f0 (φ(z)) (1225)

subject to f˜i (z) = fi (φ(z)) ≤ 0, i = 1, · · · m (1226)
is equivalent to the standard form, and is problem in z with m inequality constraints and zero
equality constraints. x is optimal iff φ(z) is optimal for the transformed problem. Note that if x is
optimal for standard form, it means that ∃z s.t. x = φ(z) where z is optimal for the transformed
problem, but this is not necessarily unique.
6. When the equality constraints are all linear (Ax = b), then the equality constraints may be removed.
In particular, if Ax = b is inconsistent, b 6∈ range A and F = ∅, otherwise define F = Rn×k to be
a matrix with range F = null A, and the solution space of Ax = b is given by F z + x0 , where x0 is
particular solution (Theorem 52), with z ∈ Rk . If we choose F s.t it is full rank, then k = n−rankA
by rank-nullity (Theorem 51). Then the problem
minimize f0 (F z + x0 ) (1227)
subject to fi (F z + x0 ) ≤ 0, i = 1, · · · m, (1228)
253
is equivalent problem in z and has no equality constraints with rank A fewer variables.
7. The problem
minimize f0 (A0 x + b0 ) (1229)

subject to fi (Ai x + bi ) ≤ 0, i = 1, · · · m, (1230)
hi (x) = 0, i = 1, · · · p (1231)
with x ∈ Rn , Ai ∈ Rki ×n , fi : Rki → R can be transformed by introducing new variables yi ∈ Rki

and writing the problem as an equivalent problem
minimize f0 (yi ) (1232)

subject to fi (yi ) ≤ 0, i = 1, · · · m, (1233)
yi = Ai x + bi , i = 0, · · · m, (1234)
hi (x) = 0, i = 1, · · · p. (1235)
Pm Pm
The transformation adds i=0 ki variables, yi ∈ Rki for i ∈ [0, m] and i=0 ki new equality
constraints. Here, the objective and inequality constraints involve different optimization variables,
and this is said to be independence.
8. Joint minimization and stepwise minimization are equivalent problems. That is,
inf f (x, y) = inf (inf f (x, y)). (1236)

x,y x y
As example. consider the problem
minimize f0 (x1 , x2 ) (1237)

subject to fi (x1 ) ≤ 0 i = 1, · · · m1 , (1238)
f˜i (x2 ) ≤ 0, i = 1, · · · m2 (1239)
is equivalent to
minimize f˜0 (x1 ) (1240)

subject to fi (x1 ) ≤ 0, i = 1, · · · m1 , (1241)
where f˜0 (x1 ) = inf{f0 (x1 , z) : f˜i (z) ≤ 0, i = 1, · · · m2 }.
9. The standard form problem may be stated in terms of the epigraphs (Definition 233) as
minimize t (1242)
subject to f0 (x) − t ≤ 0 (1243)
fi (x) ≤ 0, i = 1, · · · m (1244)
hi (x) = 0, i = 1, · · · p, (1245)
with variables x ∈ Rn and t ∈ R. It is easy to see that this is equivalent to the standard form,
since we are minimizing t subject to f0 (x) ≤ t, a point (x, t) here is optimal iff x is optimal for the
standard form, with t = f0 (x).
254
10. Constraints that are explicit or implicit in the domain (see Definition 244) can be brought in/out.
For instance, clearly the standard form is equivalent to the unconstrained problem minimizing F (x),
where we let
(
f0 (x) x ∈ F,
F (x) = (1246)
∞.
On the other hand, the unconstrained problem (for instance)
( 0
x x Ax = b
minimize f (x) = (1247)
∞ else
can be rewritten as a constrained optimization problem
minimize x0 x (1248)
subject to Ax = b. (1249)
The unconstrained formulation has non-differentiable objective function, but the constrained formu-
lation has differentiable objective functions (as well as an additional, differentiable, affine equality
constraint).
One of the most common equivalent problems that appear in literature are for the objective functions
kAx − bk2 and kAx − bk22 = (Ax − b)0 (Ax − b). The optimal points are the same, but only the latter
objective is differentiable for all x.
Exercise 452. A quadratic objective convex optimization problem written
minimize x01 P11 x1 + 2x01 P12 x2 + x02 P22 x2 (1250)

subject to fi (x) ≤ 0 i = 1, · · · m, (1251)
with P11 , P22 can be first analytically minimized over x2 . Write in block matrix form
" #
P11 P12
M= 0
, (1252)
P12 P22
and use the Schur’s complement of P11 in M as in Equation 884 to get
inf (x01 P11 x1 + 2x01 P12 x2 + x02 P22 x2 ) = x01 (P11 − P12 P22
+ 0
P12 )x1 , (1253)
x2
and therefore the problem

+ 0
minimize x01 (P11 − P12 P22 P12 )x1 (1254)
subject to fi (x1 ) ≤ 0, i = 1, · · · m (1255)
is equivalent problem by Exercise 451.8.
The specific problem to be solved depends on the nature of fi , hi ’s in the standard form (Definition
244), and the parameterizations of those functions. When we are not working with specific instances
of these functions, we often describe the objective and constraint functions as oracle queries; they are
treated as black box models that when queried, return some f (x) at any x ∈ dom f . Sometimes, the
oracle also returns additional information, such as (x, ∇f (x)), and/or guaranteed some information (such
as differentiability/convexity/bounds) regarding the functions. In some cases, the time complexity and
cost of queries are further known.
Convex optimization problems were raised in Section 4.3. It would be useful to define the standard
form for a convex optimization problem.
255
Definition 247 (Convex Optimization Standard Form). We define the standard form for convex opti-
mization problem to be of

subject to fi (x) ≤ 0, i = 1, · · · , m (1257)
a0i x = bi , i = 1, · · · , p, (1258)
where f0 , · · · fm are convex. That is, convex optimization problems are the general optimization stan-
dard form (Definition 244) restricted to convex objective and inequality constraints, and affine equality
constraints.
Theorem 298. The feasible set of a convex optimization problem (Definition 247) is convex.
Proof. Firstly, the domain is convex since it is the intersection of m+1 convex domains of convex objective
and convex inequality constraint functions (Definition 244) given by D = ∩m
i=0 dom fi . Furthermore,
the feasible set (Definition 245) is a subset of D constructed as intersection of convex sublevel sets
{x ∈ D : fi (x) ≤ 0} for i ∈ [m] and convex hyperplanes {x ∈ D : a0i x − bi = 0} for i ∈ [p].
When the objective function f0 (x) is quasiconvex (Definition 239) instead of convex in Definition
247, then the problem is a quasiconvex standard form optimization problem. The feasible set of a
quasiconvex optimization problem is also convex by same reasoning. Since the sublevel sets of both
convex and quasiconvex functions (one by proof and the other by definition), the -suboptimal sets are
convex. The optimal set is convex ( → 0). If the objective is strictly convex, then the optimal set
contains at most one point. The maximization problem for a concave objective f0 (x) subject to convex
constraints can readily be solved by minimizing the negative of the concave (and hence convex) function.
The maximization problem of a quasiconcave f0 is also the minimization of quasiconvex −f0 for the same
reason.
An aside should be noted. That a problem is equivalent to a convex one does not mean it is a convex
optimization problem, even if the feasible set were convex. For instance, the (sole) equality constraint
h1 (x) = (x1 + x2 )2 = 0 describes a convex feasible set {x : x1 + x2 = 0}, but h1 is not affine. The task
of optimization a convex function over a convex set is sometimes called an abstract convex optimization
problem, but here only the explicit form of convexity in the standard form (Definition 247) is said to
be convex optimization problem. The equivalent standard form problem would feature the constraint
h1 (x) = (x1 + x2 ) = 0 instead.
Recall the definition of local optimality (Definition 245) written ∃R > 0 s.t.
f0 (x) = inf{f0 (z) : z ∈ F, kz − xk2 ≤ R}. (1259)
Theorem 299. Any locally optimal point in a convex optimization problem is globally optimal.
Proof. Suppose point x is locally optimal but not globally optimal, that is
∃x, y ∈ F s.t. f0 (x) = inf{f0 (z) : z ∈ F, kz − xk2 ≤ R} and f0 (y) < f0 (x). (1260)
Clearly ky − xk2 > R, and define
R
z = (1 − θ)x + θy, θ= . (1261)
2ky − xk2
256
R
See z − x = (1 − θ)x + θy − x = θ(y − x) = 2ky−xk2 (y − x), then
R R
kz − xk2 = (y − x) = , (1262)
2ky − xk2 2 2
so z is at least R-close to x. But f0 is convex, and
f0 (z) ≤ (1 − θ)f0 (x) + θf0 (y) < f0 (x) (1263)
is contradiction of local optimality. So 6 ∃y ∈ F s.t. f0 (y) < f0 (x).
Theorem 300. If f0 is objective function of a convex optimization problem, is differentiable, then we

have (see Equation 1014)
∀x, y ∈ dom f0 , f0 (y) ≥ f0 (x) + ∇f0 (x)0 (y − x). (1264)
The point
x is optimal iff x ∈ F and ∇f0 (x)0 (y − x) ≥ 0 for all y ∈ F. (1265)
That is, if ∇f0 (x) 6= 0, then the normal −∇f0 (x) describes a supporting hyperplane to a feasible set at
x. The feasible set lies in the halfspace induced by normal −∇f0 (x).
Proof. If x ∈ F and ∇f0 (x)0 (y − x) ≥ 0 for all y ∈ F, then clearly Equation 1264 asserts that x is
optimal. Suppose x is optimal, and assume ∃y s.t. ∇f0 (x)0 (y − x) < 0. Consider the point defined by
z(t) = ty + (1 − t)x for some t ∈ [0, 1] on the line segment, which must be feasible by convexity of feasible
set (Theorem 298). Compute
d
f0 (z(t)) = ∇f0 (z(t))0 (y − x), (1266)
dt
evaluated at t = 0 gives ∇f0 (x)0 (y − x) < 0. Therefore for some small t > 0, f0 (z(t)) < f0 (x) and x is
not optimal point, which is contradiction.
Theorem 301. If f0 is objective function of a convex optimization problem (Definition 247), is dif-
ferentiable, and has no constraints, then the condition asserted by Equation 1265 reduces to the simple
condition
∇f0 (x) = 0. (1267)
Clearly if ∇f0 (x) = 0, then ∀y ∈ F, ∇f0 (x)0 (y − x) = 0 ≥ 0. On the other hand, suppose that ∀y ∈
F, ∇f0 (x)0 (y − x) ≥ 0 (x ∈ dom f0 is optimal). Since f0 is differentiable (see Definition 183), dom f0 is
open set (Definition 195), so all y sufficiently close to x is inside the feasible set. Define y := x−t∇f0 (x),
t ∈ R, and for small t > 0, ∃ feasible y s.t. ∇f0 (x)0 (y − x) = −t∇f0 (x)0 ∇f0 (x) = −tk∇f0 (x)k22 ≥ 0, and
this asserts that ∇f0 (x) = 0.
Theorem 302. If f0 is objective function of a convex optimization problem (Definition 247), is differ-
entiable, and the problem only has (affine) equality constraints, then if x is optimal point, x satisfies
∇f0 (x) ⊥ N (A), (1268)
where A is the matrix involved in equality constraint Ax = b.
257
Proof. Consider the convex optimization problem

The feasible set is affine. Suppose x is an optimal point, then it is feasible by definition (Definition 245)
and is a particular solution. For any y feasible, y is in the solution space and ∃v ∈ N (A) s.t. y = x + v
and the optimality condition (Equation 1265) is rewritten
∇f0 (x)0 (y − x) = ∇f0 (x)0 v ≥ 0, v ∈ N (A). (1271)
See that for a linear function to be nonnegative on a subspace, it must be zero on the subspace. That
means ∇f0 (x)0 v = 0 for v ∈ N (A), or equivalently they are orthogonal, written ∇f0 (x) ⊥ N (A). Since
R(A0 ) = N (A)⊥ (see Exercise 44), then ∇f0 (x) ∈ R(A0 ) and ∃ν ∈ Rp s.t. ∇f0 (x) + A0 ν = 0. This is
known as the Lagrange multiplier optimality condition.
Theorem 303. If f0 is objective function of a convex optimization problem (Definition 247), is differ-
entiable, and the problem is only constrained over the nonnegative orthant, i.e. written

subject to x < 0, (1273)
then the optimality condition given by Equation 1265 can be expressed as
x < 0, ∇f0 (x) < 0, xi (∇f0 (x))i = 0, i ∈ [1, n]. (1274)
Note that the last of the three conditions are known as complementarity.
Proof. With reference to constraints and optimality conditions given by Equation 1265, we have opti-
mality condition
x < 0, ∇f0 (x)0 (y − x) ≥ 0, ∀y < 0. (1275)
Since ∇f0 (x)0 y is linear in y, and this is unbounded below on y < 0 unless ∇f0 (x) < 0. Since this holds
for all y, set y → 0 and see we have condition −∇f0 (x)0 x ≥ 0. But since this x < 0, ∇f0 (x)0 x < 0, it
Pn
must be that ∇f0 (x)0 x = 0, which is to say i ∇f0 (x)i xi = 0, that the nonzero indices do not intersect.
The result follows.
Exercise 453. Consider the unconstrained minimization for quadratic function

1 0
f0 (x) = x P x + q 0 x + r, n
P ∈ S+ . (1276)
2
Since P is positive semidefinite, the problem is convex, and since it is unconstrained, optimality condition
(see Theorem 301) is ∇f0 (x) = P x + q = 0, so P x = −q. a) If q 6∈ R(P ), then there is no solution
and f0 is unbounded below, b) if P invertible/positive definite (equivalent since zero is not eigenvalue iff
P is invertible, Theorem 72), then the optimal solution occurs uniquely at x∗ = −P −1 q (see Theorem
256), and c) if P singular, q ∈ R(P ) then the optimal set is affine given by Xopt = −P + q + N (P ) (see
Theorem 257).
258
Exercise 454. Consider the unconstrained minimization for convex function f0 : Rn → R given by
m
X
f0 (x) = − log(bi − a0i x), dom f0 = {x : Ax ≺ b}, (1277)
i=1
where a0i is i-th row of A. Since f0 is differentiable and unconstrained, Theorem 301 asserts the optimality
conditions are given
m
X ai
Ax ≺ b, ∇f0 (x) = = 0. (1278)
i=1
bi − a0i x
If Ax ≺ b is not feasible, then F = ∅. Otherwise it is feasible, and a) there are no solutions, hence no
optimal points occurring iff f0 is unbounded below, b) non-unique solutions form an affine set, c) or there
is unique solution, occurring iff the open polyhedron {x : Ax ≺ b} is nonempty and bounded (verify this).
Exercise 455. A number of examples of equivalent optimization problems were studied in Exercise 451.
Here we look at the settings in which the problem are known to be convex.
1. We may eliminate equality constraints. A standard form convex problem (Definition 247) has
linear equality constraints Ax = b. These may be eliminated by finding a particular solution x0 s.t.
Ax0 = b, some matrix F s.t R(F ) = N (A) and solving the problem
minimize f0 (F z + x0 ) (1279)
subject to fi (F z + x0 ) ≤ 0, i ∈ [1, m]. (1280)
The optimization variable is z. The composition of convex and affine functions are convex, so this
formulation is convex. It should be noted that if the original problem has some useful structure such
as sparsity, then the elimination of equality constraints can destroy sparsity and make the problem
more difficult to solve.
2. As opposed to eliminating equality constraints, we can also add equality constraints. For instance,
given objective/constraint function fi (Ai x + bi ), we can replace this with fi (yi ), yi = Ai x + bi .
3. If a particular convex constraint function fi (x) ≤ 0 is specifically linear, then we can replace this
with fi (x) + si = 0, si ≥ 0. Note we require fi to be linear since the standard form requires equality
constraint functions to be affine.
4. We can write the problem in epigraph form, namely
minimize t (1281)
subject to f0 (x) − t ≤ 0, (1282)
fi (x) ≤ 0, i ∈ [m], (1283)
a0i x = bi , i ∈ [p]. (1284)
Since this can be written for any convex optimization problem, the linear objective is said to be
universal for convex optimization.
5. Minimizing a convex function over a subset of the variables preserves convexity of the problem, as
in Exercise 451.
259
4.6.1 Quasiconvex Optimization
Quasiconvex optimization problems take the standard form as in Definition 247, except that f0 is qua-
siconvex instead of convex. Sublevel sets of quasiconvex functions can be represented as inequalities of
convex functions (see Equation 1161 and discussion). Therefore the quasiconvex constraint functions can
be replaced with convex constraint functions. It turns out solving a quasiconvex optimization problem
reduces to solving a sequence of convex optimization problems.
Unlike convex functions, local optimality does that imply global optimality. Recall that for convex
optimization problems, (see Equation 1265) x ∈ F is optimal iff ∇f0 (x)0 (y − x) ≥ 0 for all y ∈ F. Recall
the first order condition for quasiconvexity (Theorem 288), which states that f quasisconvex iff dom f
convex and ∀x, y ∈ F, f (y) ≤ f (x) =⇒ ∇f (x)0 (y −x) ≤ 0. Then, for quasiconvex optimization problem,
x optimal for x ∈ F, ∇f0 (x)0 (y − x) > 0 for all y ∈ F\{x}. Note that in this case, the condition is only
sufficient for optimality, but an optimal point need not satisfy this equation.
Theorem 304 (Quasiconvex optimization as convex feasiblity problem). The quasiconvex optimization
problem can be solved as a sequence of convex feasibility problems. Recall that sublevel sets of quasiconvex
functions are convex. Then let φt : Rn → R be a family of convex functions indexed by t that satisfy
f0 (x) ≤ t ⇐⇒ φt (x) ≤ 0. Suppose also that for each x, φt (x) is nonincreasing function of t, s.t.
φs (x) ≤ φt (x) if s ≥ t. It p∗ is the optimal solution to the quasiconvex optimization problem, then the
convex feasibility problem
find x (1285)
subject to φt (x) ≤ 0, (1286)
fi (x) ≤ 0, i ∈ [1, m], (1287)
Ax = b (1288)
is feasible iff p∗ ≤ t. This gives rise to the bisection method for solving quasiconvex problems. Assuming
the problem is feasible with lower and upper bound s.t. p∗ ∈ [l, u], then we can solve the convex feasibility
problem at t = l+u
2 , repeated until the search space has width u − l ≤ , where is some specified additive
u−l
error. The length of interval at iteration k is given by 2k
, so the number of iterations are given by
dlog2 u−l
e (since u−l

k
≤ 2 ).
4.6.2 Linear Programs

Recall linear programs (Section 4.3). In particular, the objective and constraint functions are all affine,
and we may write the general form as
minimize c0 x + d, (1289)
subject to Gx 4 h (1290)
Ax = b, (1291)
where G ∈ Rm×n , A ∈ Rp×n specify m inequality and b equality constraints. Note that the constant d
affects neither the feasible set nor the optimal point; it may be ommitted. The maximization problem
of an affine objective function is the minimization of its negation. Hence, both the minimization and
maximization problems are said to be linear programs when presented with the linear objective and
constraint functions. Clearly, the feasible set F is an instance of a polyhedron (Definition 213). We
minimize the affine objective over this polyhedron.
260
For convenience, we can also define a standard form for linear programs (LPs). In the standard form,
the only inequality constraint is that the optimization variable is over the nonnegative orthant, s.t. we
may write
minimize c0 x, (1292)
subject to Ax = b, (1293)
x < 0. (1294)
When an LP has no equality constraints, s.t. the only constraint is given by Ax 4 b, then we say that
this an inequality form LP.
Exercise 456. Proof that the standard form LP is universal for LPs.
Proof. Consider the general form for LPs (Equation 1291). Then using slack variable si we may write
minimize c0 x + d, (1295)
subject to Gx + s = h, (1296)
Ax = b, (1297)
s < 0. (1298)
To constraint the optimization variable over the nonnegative orthant, we define x+ = max(x, 0), x− =
max(−x, 0), then x = x+ − x− and x+ , x− < 0. Then we obtain the equivalent standard form LP
minimize c0 x+ − c0 x− + d, (1299)
subject to Gx+ − Gx− + s = h, (1300)
+ −
Ax − Ax = b, (1301)
x+ < 0, x− < 0, s < 0. (1302)
The set of problems that are transformably LPs are also loosely referred to as LPs.
Exercise 457. Some LPs and transformable LPs are explored.
1. A diet specifies minimum quantities b of m different nutrients. There are n food types on the menu,
with their nutrient contents given by matrix A, where Aij is amount of i-th nutrient contained
in item j. If the cost of menu items are given by c, the minimal cost diet satisfying nutrient
requirements is linear program given by minimization of c0 x subject to Ax < b, x < 0.
2. Chebyshev Center of Polyhedron. Finding the center of the largest Euclidean ball (Definition 209)
lying inside polyhedron parameterized by inequalities P = {x ∈ Rn : a0i x ≤ bi , i ∈ [m]} is said to
be the problem of finding the Chebyshev center. That is, we want to maximize r, the radius of a
Euclidean ball {xc + u : kuk2 ≤ r} lying inside the polyhedron. For a single inequality, we have the
constraint
kuk2 ≤ r =⇒ ai (xc + u) ≤ bi . (1303)
Theorem 248 asserts that
rkai k2 = r sup{a0i u : kuk2 = 1} (1304)

= sup{a0i u : kuk2 = r}, (1305)
261
so subject to kuk2 ≤ r, we may write
a0i (xc + u) = a0i xc + a0i u (1306)

= a0i xc + sup{a0i u : kuk2 ≤ r} (1307)
= a0i xc + rkai k2 (1308)
≤ bi , (1309)
and we arrive at the constraint a0i xc + rkai k2 ≤ bi , which is linear constraint in xc , r. So the
Chebyshev centering problem is an LP given by
maximize r (1310)
subject to a0i xc + rkai k2 ≤ bi , i ∈ [1, m]. (1311)
3. The allocation of goods for economic activities are to be planned over N periods. The activity at
time t ∈ [N ] in sector j is given by xj (t). The production matrix is given by A, where Aij speifies
the amount of good i produced per unit of activity j. The consumption matrix is given by B, where
Bij specfies the amount of good i consumed per unit of activity j. At time t, Ax(t) is produced and
Bx(t) is consumed. At any period, we may not consume more than the previous period’s production.
We start off with initial goods g0 . The problem is to maxmise the discounted total value of excess
goods, given by c0 s(0) + γc0 s(1) + · · · γ N c0 s(N ), where c is the value of goods and γ is specified
discount factor determinining the tradeoff between current and future consumption. Then we solve
the LP
N
X
maximize γ i c0 s(i), (1312)
i=0
x(t) < 0, t ∈ [1, N ], (1313)
s(t) < 0, t ∈ [0, N ], (1314)
s(0) = g0 − Bx(1), (1315)
s(t) = Ax(t) − Bx(t + 1), t = [1, N − 1], (1316)
s(N ) = Ax(N ). (1317)
Here the variables are x(1), · · · x(N ), s(0), · · · s(N ). The slack variables s(t) are associated with the
constraint Bx(t + 1) 4 Ax(t).
4. Chebyshev Inequalities. Let x be a discrete random variable taking on values on {ui , i ∈ [n]} ⊆ R.
Let the distribution of x be described by probability vector p satisfying pi = P{x = ui }. If f is
Pn
function of x, we know that Ef = i pi f (ui ), a linear function in p. If S ⊆ R, we may write
P
P{x ∈ S} = ui ∈S pi . Further suppose we are given guarantees on a number of functions of x,
expressed as a set of linear inequalities
αi ≤ a0i p ≤ βi , i ∈ [m]. (1318)
Our objective is to bound the expectation Ef0 (x) = a00 p, where f0 is arbitrary function of x. A
lower bound may be found by solvng the LP
minimize a00 p (1319)

subject to p < 0, (1320)
0
1 p = 1, (1321)
αi ≤ a0i p ≤ βi , i ∈ [m]. (1322)
262
The solution to this LP is the lower bound for Ef0 (X) that is consistent with the prior. The upper
bound is found similarly.
5. The minimization of a piecewise linear convex function f (x) = maxi (a0i x + bi ) has equivalent
epigraph form
minimize t (1323)
subject to max(a0i x + bi ) ≤ t. (1324)
i
See that the constraint maxi (a0i x + bi ) ≤ t is simply a set of m inequalities a0i x + bi ≤ t, i ∈ [m].
The variables are x, t.
When the objective function is the ratio of affine functions, together with linear inequalities and
linear equalities as constraints, then we say that it is a linear fractional program. In particular, we have
c0 x + d
minimize f0 (x) = , dom f0 {x : e0 x + f > 0}, (1325)
e0 x + f
subject to Gx 4 h, (1326)
Ax = b. (1327)
Theorem 305. The linear fractional program as in Equation 1327 has quasiconvex objective function
and is transformably LP.
Proof. If the feasible set F is not empty, then we claim that the LP given by
minimize c0 y + dz, (1328)

subject to Gy − hz 4 0, (1329)
Ay − bz = 0, (1330)
0
e y + f z = 1, (1331)
z ≥ 0. (1332)
is equivalent with variables in y and z. It is easy to see that x feasible in Equation 1327 implies (y, z)
x 1
feasible in Equation 1332, where y = e0 x+f , z = e0 x+f . The objective function is the same, so solving
Equation 1327 gives optimal value greater than or equal to the solution to Equation 1332, since the
feasible set of the former is superset of the latter. On the other hand, if (y, z 6= 0) feasible for Equation
y
1332, then x = z is feasible for Equation 1327. Also if (y, 0) is feasible for Equation 1332, and x0 is
feasible then we have
Gy 4 0, Ay = 0, e0 y = 1, Gx0 4 h, Ax0 = b, e0 x0 + f > 0, (1333)
and see that any x = x0 + ty is feasible for t ≥ 0. Furthermore, limt→∞ (x0 + ty) = c0 y + dz (verify this),
so there are arbitrary close feasible points for Equation 1327 that are less than equal to the objective
value for Equation 1332. So the two problems are equivalent.
We can further generalize the linear fractional program (Equation 1327) to the pointwise maximum
over r quasiconvex functions, namely
c0i x + d
f0 (x) = max , dom f0 = {x : e0i x + fi > 0, i ∈ [1, r]}, (1334)
i e0i x + fi
and this is said to be a generalized linear fractional programming problem. Recall that the pointwise
maximum over quasiconvex functions is quasiconvex.
263
4.6.3 Quadratic Optimization Problem
The convex optimization form (Equation 247) is quadratic program (QP) if f0 (x) is convex quadratic
and constraint functions are affine. In particular, we have
1 0
minimize x P x + q0 x + r (1335)
2
subject to Gx 4 h, (1336)
Ax = b, (1337)
where P is positive semidefinite. The convex quadratic objective is minimized over polyhedron specified
by the equality and inequality constraints. If the inequality constraint functions themselves were convex
quadratic, that is
1 0
minimize x P0 x + q00 x + r0 , (1338)
2
1 0
x Pi x + qi0 x + ri ≤ 0, i ∈ [m], (1339)
2
Ax = b, (1340)
n
with Pi ∈ S+ , then we say that the problem is quadratically constrained quadratic program (QCQP).
In a QCQP the objective is minimized over intersection of ellipsoids determined by Pi . Note that when
Pi , i ∈ [0, m] = 0, then the QCQP is LP, and when only Pi = 0, i ∈ [1, m] = 0, then QCQP is QP.
Exercise 458. Recall that the least-squares problem (see Section 4.3) written as unconstrained mini-
mization of convex quadratic function kAx − bk = x0 A0 Ax − 2b0 Ax + b0 b is QP. Assuming A is full column
rank, such that (A0 A)−1 exists, then x = (A0 A)−1 A0 b is optimal solution. More generally, this optimal
solution is given by x = A+ b, where A+ is the pseudo-inverse discussed in Exercise 324. This is also
known as the least squares regression problem. When we add constraints, then we say it is constrained
least squares problem. Suppose the constraints are some bounds on x, written
minimize kAx − bk22 , (1341)

subject to li ≤ xi ≤ ui , i ∈ [1, n]. (1342)
Then a closed form solution no longer exists but we may solve it as QP.
Exercise 459. Define the polyhedral distance between P1 = {x : A1 x 4 b1 } and P2 = {x : A2 x 4 b2 } to
be defined as inf{kx1 − x2 k2 : x1 ∈ P1 , x2 ∈ P2 }. The polyhedral distance may be solving by QP defined
minimize kx1 − x2 k22 . (1343)

subject to A1 x 1 4 b1 , A2 x2 4 b2 . (1344)
Exercise 460. The Chebyshev mean-bounding problem given by Equation 1320 was a LP. We can also
provide a variance bound under the same settings using a QP. See that the variance is given by
n n
!2
X X
Varf = Ef 2 − (Ef )2 = fi2 pi − fi pi . (1345)
i i
This is a concave quadratic function in p. The upper bound for the variance consistent with prior is a
QP written
n n
!2
X X
maximize fi2 pi − fi pi , (1346)
i i
subject to p < 0, 10 p = 1, (1347)
αi ≤ a0i p ≤ βi , i ∈ [1, m]. (1348)
264
Exercise 461. Consider the Markowitz portfolio discussed in Exercise 637. Here, some extensions
of this problem in terms of constraints are studied in the context of convex optimization. Let x be a
unit dollar investment vector in n assets, where positive and negative entries correspond to long/short
positions. Let pi be the net-return vector (see Section 12.1.1), which is taken to be random with mean p̄
and covariance Σ. The portfolio x then correspond to expected return p̄0 x and variance x0 Σx. Then the
Markowitz minimum variance portfolio subject to minimum return rmin with long-only constraints are
given by the QP
minimize x0 Σx, (1349)

subject to p̄0 x ≥ rmin (1350)
0
1 x = 1, (1351)
x < 0. (1352)
Other extensions allow for short positions, or a bounded fraction of short positions, which we may intro-
duce with the constraints
xl < 0, xs < 0, x = xl − xs , 10 xs ≤ η10 xl . (1353)
Another extension is a linear transction cost, as we did in Equation 3397. For instance we may let the
portfolio x be constrained by
x = xinit + ub − us , ub < 0, us < 0. (1354)
The budget constraint 10 x = 1 is replaced with the condition that the transactions post fees involve zero
net cash, namely
(1 − fs )10 us = (1 + fb )10 ub , (1355)
where fs , fb are the fee rates to sell and buy respectively on the nominal transaction volume. The exten-
sions are all QPs.
4.6.4 Second Order Cone Programs

The second order cone program (SOCP) is given by the general form
minimize f 0 x, (1356)
subject to kAi x + bi k2 ≤ c0i x + di , i ∈ [m], (1357)
F x = g. (1358)
Here x ∈ Rn is optimization variable, and we have Ai ∈ Rni ×n , F ∈ Rp×n . The constraints of the form
kAx + bk2 ≤ c0 x + d is said to be the second order cone constraint as it is equivalent to saying that the
affine function (Ax + b, c0 x + d) lies in the second order cone (see Definition 212). Note that SOCPs
contain QCQPs (set ci = 0 to get constraints written in the form x0 A0 Ax+2b0 Ax+b0 b 4 d). By extension
SOCPs contain QPs and LPs.
Exercise 462. Consider a LP defined
subject to a0i x ≤ bi , i ∈ [m]. (1360)
265
Suppose ai ’s are random, but are known to be contained in ellipsoids s.t.
ai ∈ ξi = {āi + Pi u : kuk2 = 1}, (1361)
where Pi ∈ Rn×n . If Pi is singular, then the ellipsoid is degenerate and there is no uncertainty in ai .
The robust version of the linear program is a program that minimizes the objective function subject to
the satisfction of constraints for all possible vlues of ai . That is, we solve the problem
subject to a0i x ≤ bi , ∀ai ∈ ξi . i ∈ [m]. (1363)
For each i, the linear constraint on ai ’s may be written as sup{ai x : ai ∈ ξi } ≤ bi . This can be further
written as
sup{a0i x : ai ∈ ξi } = sup{(āi + Pi u)0 x : kuk2 ≤ 1} (1364)

= ā0i x + sup{u 0
Pi0 x : kuk2 ≤ 1} (1365)
= ā0i x + kPi0 xk2 , (1366)
where the last equality follows from Theorem 248. Then the robust LP can be wrriten as a SOCP
minimize c0 x (1367)
subject to ā0i x + kPi0 xk2 ≤ bi , i ∈ [1, m], (1368)
which differs from the original form in the regularization norm term that prevents large x values with
large degree of uncertainty in parameters ai .
Exercise 463. Let ai ’s be parameters of a statistical estimation problem modelled by independent Gaus-
sian random vectors. Each random vector ai has mean āi , covariance Σi . We have a prior, probabilistic
constraint given by P{a0i x ≤ bi } ≥ η, where η ≥ 0.5 is some confidence level. This can be represented as
a second order cone constraint. Let u = a0i x and Var(u) = σ 2 , then we have

u − ū bi − ū
P ≤ ≥ η. (1369)
σ σ
This is standard random normal (Definition 6.17.6) so we can write
bi − ū
≥ Φ−1 (η), (1370)
σ
Rx 2 1
where Φ(x) = √1
2π −∞
exp(− x2 )dx. So ū + Φ−1 (η)σ ≤ bi , and since ū = ā0i x, σ = (x0 Σi x) 2 and
1 1 1 1 1
(x0 Σi xi ) 2 = (x0 Σi2 Σi2 x) 2 = kΣi2 xk2 , so we obtain the inequality
1
ā0i x + Φ−1 (η)kΣi2 xk2 ≤ bi . (1371)
Furthermore, by assumption, η ≥ 21 , so Φ−1 (η) ≥ 0 and we have second order cone constraint (since the
inequality points in the right direction). Then the optimization problem
subject to P {a0i x ≤ bi } ≥ η, i ∈ [m] (1373)
can be wrriten as SOCP
1
subject to ā0i x +Φ −1
(η)kΣi xk2 ≤ bi ,
2
i ∈ [m]. (1375)
266
Exercise 464. Consider the classical Markowitz portfolio with extensions explored in Section 461. The
extensions there were QPs. Here we explore a SOCP extension. As in there, we assume a net return
vector p, and dollar investment vector x. Assume now that p is Gaussian vector, with mean p̄ and
covariance Σ. Then the portfolio dollar return is denoted by r and is Gaussian with expectation r̄ =
p̄0 x, σr2 = x0 Σx. We want to impose a constraint of loss threshold α by a probability of β. In particular,
we want P{r ≤ α} ≤ β, and as in discussion above we may write this as a second order cone constraint
given by
1
p̄0 x + Φ−1 (β)kΣ 2 xk2 ≥ α. (1376)
Note that we require β ≤ 12 , otherwise this constraint is nonconvex in x. Then the return maximization
subject to this constraint would be a SOCP.
Exercise 465. The surface area of a differentiable function f : R2 → R with domain C is given by
Z q Z p Z
A= 1 + k∇f (x)k22 dx = h(∇f (x), 1)|(∇f (x), 1)idx = k(∇f (x), 1)k2 dx. (1377)
C C C
The minimal surface area problem finds the functon f that minimizes A subject to specified constraints.
Assume that C = [0, 1] × [0, 1], and let fij be the function f evaluated at ( Ki , Kj ). Here the domain is
discretized, s.t. i, j ∈ [0, K] are indices that range over C. The gradient approximated by the forward
differencing method can be written as
" #
fi+1,j − fi,j
∇f (x) ≈ K . (1378)
fi,j+1 − fi,j
Then the integral is approximated to be
 
K−1 K(fi+1,j − fi,j )
1 X
A ≈ Ã = K(fi,j+1 − fi,j ) . (1379)
 
K2 i,j=0
1 2
See that Ã is convex in fij , and suppose we want to place constraints on the edges of the square subject
to f0,j = lj , j ∈ [0, K] and fK,i = ri , i ∈ [0, K]. The problem is SOCP
K−1
1 X
minimize tij (1380)
K 2 i,j=0
 
K(fi+1,j − fi,j )
subject to K(fi,j+1 − fi,j ) ≤ tij , i, j ∈ [0, K − 1], (1381)
 
1 2
f0,j = lj , j ∈ [0, K], (1382)
fK,i = ri , i ∈ [0, K]. (1383)
4.6.5 Geometric Programming

Geometric programs are transformably convex by change of variable and transformation of objective/-
constraint functions.
Definition 248 (Monomial). A function f : Rn → R, dom f = Rn++ given by
f (x) = cxa1 1 · · · xann , c > 0, ai ∈ R (1384)
is said to be monomial function.
267
Definition 249 (Posynomial). A function f : Rn → R, dom f = Rn++ given by
K
X
f (x) = cxa1 1k · · · xannk , ck > 0, aik ∈ R (1385)
k=1
is said to be posynomial function.
It would be easy to see that monomials are closed under multiplication and division, while posynomials
are closed under addition, multiplication, and nonnegative scaling. Multiplication of monomial and
posynomial is posynomial, division of posynomial by monomial is posynomial.
Definition 250 (Geometric Program). An optimization problem formulated as

subject to fi (x) ≤ 1, i = 1, · · · m, (1387)
hi (x) = 1, i = 1, · · · p, (1388)
where fi , i ∈ [0, m] are posynomials and hj , j ∈ [p] are monomials is said to be geometric program (GP).
The implicit domain of this problem is given by D = Rn++ as is clear by the domain of posynomials and
monomials (Definitions 248, 249).
Some constraints not in the standard form GP as in Definition 250 is readily transformable to GP.
f (x)
For instamce, given f posynomial and h monomial, the constraint f (x) ≤ h(x) is simply h(x) ≤ 1.
where LHS is posynomial. If h1 , h2 nonzero and monomial, then h1 (x) = h2 (x) has equivalent monomial
h1 (x)
constraint h2 (x) = 1.
Exercise 466. Show that the problem

x
maximize , (1389)
y
subject to 2 ≤ x ≤ 3, (1390)
y √
x2 + 3 ≤ y (1391)
z
x 2
=z (1392)
y
with variables in x, y, z > 0 is GP.
Proof. It is not difficult to see that we can write this in GP form
minimize x−1 y (1393)

subject to 2x−1 ≤ 1, (1394)
1
x ≤ 1, (1395)
3
1 1
x2 y − 2 + 3y 2 z −1 ≤ 1, (1396)
−1 −2
xy z = 1. (1397)
When the problem is easily transformable into GP, we refer to the problem also as GP.
268
Theorem 306 (GPs are transformably convex.). Consider the change of variable yi = log xi , then the
monomial function f (x) can be written as
f (x) = cxa1 1 · · · xann (1398)

= f (exp(y)) (1399)
= c(exp(y1 ))a1 · · · (exp(yn ))an (1400)
0
= exp(a y + b), b = log c. (1401)
Similarly the posynomial form given in Definition 249 can be written to be

K
X
f (x) = exp(a0k y + bk ), bk = log ck . (1402)
i
The affine constraints can be written as exp(gi y+ui ) = 1, where gi contain the exponents of the monomial
constraint hi , and ui is the log of the coefficient in that monomial. Logarithmic functions monotonically
increase, so we can take logs of the transformed objective/constraint functions and solve the problem
K0
!
X
minimize ˜
f0 (y) = log 0
exp(a0k y + b0k ) , (1403)
k=1
Ki
!
X
subject to f˜i (y) = log exp(a0ik y + bik ) ≤ 0, i ∈ [m], (1404)
k=1
h̃i (y) = gi0 y + ui = 0, i ∈ [p]. (1405)
The f˜’s are log-sum exponentials composition functions and is convex (see Equation 1088). h̃’s are affine.
The problem is in convex optimization problem standard form (Definition 247).
The convex problem form formulated in Theorem 306 is said to be the geometric program in convex
form, as opposed to the geometric program in the posynomial form. Note that when the posynomials
f ’s degenerate to monomials, the problem is an LP.
Exercise 467 (Minimization of Frobenius Norm). A square matrix M order n is associated with a linear
function y = M u. If we scale the coordinates by some diagonal matrix D, we have ũ = Du, then we
have ỹ = DM u = Dy. Assume that d = diag(D) 0. The new coordinates are given by ỹ = DM D−1 ũ.
The problem involves choosing a scaling s.t DM D−1 small. Using the Frobenius norm (Definition 179)
as the measurement for matrix size, we may write
kDM D−1 k2F = tr (DM D−1 )0 (DM D−1 )

(1406)
Xn
−1 2
= (DM Dij ) (1407)
i,j
Xn
2 2 −2
= Mij di dj . (1408)
i,j
Pn 2 2 −2
Then our problem is unconstrined GP with posynomial objective to minimize i,j Mij di dj .
4.6.6 Optimization Problems with Generalized Inequalities

The inequality constraint may be generalized w.r.t proper cones (Definition 223) using generalized in-
equalities. The standard form convex optimization problem with generalized inequality constraints is
269
said to take form

subject to fi (x) 4Ki 0, i ∈ [1, m], (1410)
Ax = b. (1411)
Here f0 : Rn → R and Ki ⊆ Rki are proper cones. The constraint functions fi : Rn → Rki are said to
be Ki -convex. When working with optimization problems in generalized inequality form, many of the
properties extend. The feasible set, sublevel sets (see Definition 232) and optimal set is convex. Local
optimality implies global optimality. The optimality condition for differentiable objective function f0 (x)
applies (see Equation 1265).
Definition 251. A conic form problem, or cone program is said to take form
subject to F x + g 4K 0, (1413)
Ax = b. (1414)
The cone program (Equation 251) in standard form:
subject to x <K 0, (1416)
Ax = b. (1417)
When there are only inequality constraints, then
subject to F x + g 4K 0, (1419)
is cone program in inequality form. See Exercise 456 for the techinique in showing that the standard
form cone program is universal for cone programs.
When the associated cone in the cone program (see Definition 251) is the postive semidefinite cone,
n
that is when K = S+ , we say that the problem is a semidefinite program (SDP). It has the general form
Xn
subject to xi Fi + G 4 0, (1421)
i
Ax = b. (1422)
Here A is arbitrary Rp×n and the matrices G, F1 , · · · , Fn ∈ S k (are symmetric). This is an LMI (see
Exercise 396). If G, Fi∈[n] are further diagonal, then the SDP is LP. The SDP (Equation 1422) also has
standard form, which has linear equality constraints and matrix nonnegative constraints on X written
as
minimize tr(CX) (1423)

subject to tr(Ai X) = bi , i ∈ [p], (1424)
X < 0. (1425)
270
Here C, A1 , · · · Ap ∈ S n . An inequality form SDP has no equality constraints and just one LMI:
Xn
subject to xi Ai 4 B. (1427)
i
Here B, Ai ∈ S k and the optimization variable is x. Note that the extension to several LMI constraints,
given by
n
(i)
X
subject to F (i) (x) = xi Fi + G(i) 4 0, i ∈ [K], (1429)
i
Gx 4 h, (1430)
Ax = b (1431)
can readily be shown to be an SDP by arranging a large block diagonal LMI and specifying the problem
instance
subject to diag(Gx − h, F (1) (x), · · · , F (K) (x)) 4 0, (1433)
Ax = b. (1434)
The SOCP formulation in Equation 1358 is a cone program. See this by writing
subject to −(Ai x + bi , c0i x + di ) 4Ki 0, i ∈ [1, m] (1436)
F x = g, (1437)
where Ki = {(y, t) ∈ Rni +1 : kyk2 ≤ t} is the second-order cone.

Exercise 468. Recall the spectral norm of a matrix (Definition 180). Define A(x) := A0 + x1 A1 +
· · · xn An , where Ai ∈ Rp×q . The unconstrained problem of minimizing the spectral norm kA(x)k2 is
convex. Note that kAk2 ≤ s ⇐⇒ A0 A 4 s2 1 for s ≥ 0, and we may express the problem as
minimize s (1438)
subject to A(x)0 A(x) 4 s1. (1439)
The variables are x and s. Note that A(x)0 A(x) − s1 is matrix convex in (x, s), so this is convex
optimization problem in generalized inequality constraints.
t1
" #
A
Alternatively, given t > 0, Theorem 255.2 asserts that for M = , since t1 0, we have
A0 t1
M 0 iff t1 − A0 1t 1A < 0, which may be written 1t (t2 1 − A0 A) < 0, so
t1 A
" #
A A 4 t 1 ⇐⇒
0 2
<0 (1440)
A0 t1
and we have an SDP
minimize t (1441)
t1
" #
A(x)
subject to <0 (1442)
A(x) 0
t1
with variables in x, t.
271
Exercise 469. Let t be a scalar random number, and suppose Etk are well defined. Consider the
probability distribution for xk = Etk for k ∈ [0, 2n], and where x0 = 1. Then the matrix
 
x0 x1 · · · xn−1 xn
 x1 x2 ··· xn xn+1 
 
 
H := H(x0 , · · · , x2n ) = 
 ···

 (1443)
xn−1 xn · · · x2n−2 x2n−1 
 
xn xn+1 · · · x2n−1 x2n
is said to be Hankel matrix. H < 0 since for xi = Eti , i = [0, 2n] and y ∈ Rn+1 we have
n
X 2
y 0 Hy = yi yj Eti+j = E y0 + y1 t1 + · · · yn tn ≥ 0.

(1444)
i,j=0
On the other hand, if x0 = 1 and H(x) 0, then ∃ some distribution on R s.t. xi = Eti , i ∈ [0, 2n]
(verify this). The condition that x0 , · · · x2n are the moments of some distribution on R can be expressed
in terms of an LMI asserting that H(x0 , · · · x2n ) < 0 and that x0 = 1. Now consider the problem where
we are given some bounds on t (random) in the form
(l) (u)
µk ≤ Etk ≤ µk , k = 1, · · · 2n. (1445)
P2n i
We are to find the bounds on the expected value of a polynomial p(t) = i=0 ci t in t. The lower bound
is given by
2n
X 2n
X
minimize Ep(t) = ci Eti = ci xi , (1446)
i=0 i=0
(l) (u)
subject to µk ≤ Etk ≤ µk , k = 1, · · · 2n. (1447)
As per discussion we shall do this by solving the SDP

2n
X
minimize ci xi , (1448)
i=1
(l) (u)
subject to µk ≤ xk ≤ µk , k = 1, · · · 2n, (1449)
H(1, x1 , · · · , x2n ) < 0. (1450)
Exercise 470 (Portfolio Risk Bounding with Incomplete Covariance Matrix). Consider the portfolio
problem of n asserts with dollar investment vector x, net return vector p and return covariance Σ as
in Exercise 461. For p̄ = Ep, the expected change in portfolio value is p̄0 x, and standard deviation of
1
portfolio’s dollar risk is σ = (x0 Σx) 2 . As opposed to letting x be the optimization variable, in the risk
bounding problem, assume x is known, but we do not know Σ, except for a number of priors Lij ≤ Σij ≤
Uij for i, j ∈ [n]. We want to given an upper bound for the portfolio risk that is consistent with these
priors, which is to say we want to find
σ∗2 = sup{x0 Σx : Lij ≤ Σij ≤ Uij : i, j ∈ [n], Σ < 0}. (1451)
Then the problem is an SDP
maximize x0 Σx (1452)
subject to Lij ≤ Σij ≤ Uij , i, j ∈ [n], (1453)
Σ < 0. (1454)
272
We can place other priors on Σ, as long as they are convex. For instance, suppose Σ has an estimate
from an estimation process to give Σ̂. We are also given some information about this estimate, such as
its confidence ellipsoid given by the inequality C(Σ − Σ̂) ≤ α, where C 0 and α is the confidence level.
This is convex constraint. Another common scenario is when the returns are modlled by a factor model
(Section 12.4), then we are given Σ = F ΣF F 0 + D, where F are the factor loadings, ΣF is the factor
covariance matrix and D is diagonal containing idiosyncratic noise. Then p = F z + d and since Σ is
linear function on ΣF , D, we may impose any convex constraint on them.
Exercise 471. Let G be an undirected graph have n nodes and a set of edges ξ ⊆ {1, · · · n} × {1, · · · n},
where the pair (i, j) indicates that there is edge between node i and node j. The graph is symmetric
(i, j) ∈ ξ ⇐⇒ (j, i) ∈ ξ and it is possible to have (i, i) ∈ ξ for i ∈ [n]. A Markov chain is said to have
state X(t) at step t. The transition matrix is given P = P 0 , where Pij is the probability of traversing
the edge from node i to node j. At each node i, the sum of probabilities Pij must sum to one; we have
10 P = 10 , P 1 = 1. The uniform distribution 1
n1 is said to be an equilibrium distribution for the Markov
1
chain. It turns out that the covergence of the Markov state X(t) to n1 (in distribution) is determined
by the second largest eigenvalue of P , characterized by r = max{λ2 , −λn }, where the eigenvalues of P
is given by λ(P ) : 1 = λ1 ≥ λ2 ≥ · · · ≥ λn . The variable r is said to be the mixing rate of the Markov
chain. If r < 1, the distribution of X(t) approaches the uniform distribution asymptotically in step
size. The fastest mixing Markov chain problem is to find P that minimizes r. Recall that P 1 = 1 has
eigenvalue one. The mixing rate can be expressed as function of P given by (verify this) r = kQP Qk2 ,
where Q = 1 − n1 110 is the orthogonal projection onto 1⊥ . We may write
1 0 1 1
r = kQP Qk2 = k(1 − 11 )P (1 − 110 )k2 = kP − 110 k2 . (1455)
n n n
So we solve the convex problem
1 0
minimize kP − 11 k2 (1456)
n
subject to P 1 = 1, (1457)
Pij ≥ 0, i, j ∈ [n], (1458)
Pij = 0, (i, j) 6∈ ξ (1459)
which can be written as SDP
minimize t (1460)
1
subject to −t1 4 P − 110 4 t1, (1461)
n
P 1 = 1, (1462)
Pij ≥ 0, i, j ∈ [n], (1463)
Pij = 0, (i, j) 6∈ ξ. (1464)
4.6.7 Vector Optimization

The objective function may be extended to be a vector instead of a scalar. The general vector optimization
problem takes form

subject to fi (x) ≤ 0, i ∈ [m], (1466)
hi (x) = 0, i ∈ [p]. (1467)
273
Here the minimization is w.r.t to a proper cone K ⊆ Rq and f0 : Rn → Rq . This vector optimization
problem (Equation 1467) is further said to be convex vector optimization problem if f0 is K-convex,
fi , i ∈ [m] is convex and hj , j ∈ [p] is affine. Recall the interpretations for generalized inequalities
(Definition 224), minimal and minimum elements (Definitions 226 and 225).
Consider the feasible set F, then let the set of feasible/achievable objective values take on the set
O = {f0 (x) : x ∈ F}. If the set has a minimum element, then ∃x ∈ F s.t. f0 (x) 4K f0 (y) for all y ∈ F.
This x is optimal and we denote this as x∗ . Note that this is comparable to every other point in the
feasible set, and the relation O ⊆ f0 (x∗ ) + K is satisfied (see discussion in referenced sections), where
the RHS are the set of points that are at least worse than f0 (x∗ ).
Exercise 472. Let y = Ax + v, where v ∈ Rm is random vector and y are a vector of observations.
We are tasked with finding x ∈ Rn . Suppose that A is rank n (it is full column rank), and that we are
given Ev = 0, Varv = Evv 0 = 1. A linear estimator is any estimator of x given by the form x̂ = F y.
Furthermore, the estimator is unbiased if Ex̂ = x, which clearly is when F A = 1. We can also compute
the variance of an unbiased estimator by E(x̂ − x)(x̂ − x)0 = EF vv 0 F 0 = F F 0 Evv 0 = F F 0 . Our goal
is to find the best unbiased linear estimator (BLUE), in the sense that this estimator has the smallest
covariance matrix in matrix inequality terms. That is, given two unbiased, linear estimators x̂1 , x̂2 , we
say that x̂1 is at least as good as x̂2 (F1 F10 ≺ F2 F20 ) iff ∀c, E(c0 x̂1 − c0 x)2 ≤ E(c0 x̂2 − c0 x)2 . We solve the
optimization problem
minimize F F 0, (1468)
suject to F A = 1. (1469)
Here F F 0 is S+
n
-convex. As we know, when the matrix is full column rank, the optimal solution is
characterized by the pseudo-inverse of A given by
F ∗ = A+ = (A0 A)−1 A0 . (1470)
See Exercise 324.
We can also have no optimal point; there is no minimum element. Instead, we look for the minimal
elements, which we also say to be Pareto optimal/efficient points. In particular, some x ∈ F is said to
be Pareto optimal if f0 (x) is minimal element in O. The relation f0 (y) 4K f (x) =⇒ f0 (y) = f0 (x) is
satisfied for all y ∈ F. See Definition 226. An equivalent condition given there asserts that
(f0 (x) − K) ∩ O = {f0 (x)}. (1471)
The set of Pareto optimal values P satisfy P ⊆ O ∩ bd O. Every Pareto optimal value is achievable value
on the boundary of the set of achievable values.
Scalarization is a standard technique for finding Pareto optimal points using dual generalized inequal-
ities (see Definition 269) Let λ K ∗ 0. Then consider the scalar optimization problem given by
minimize λ0 f0 (x), (1472)

subject to fi (x) ≤ 0, i ∈ [1, m], (1473)
hi (x) = 0, i ∈ [1, p]. (1474)
If x∗ is optimal point in Equation 1474, then x is Pareto optimal for the vector optimization problem. This
follows directly from Theorem 271. By the scalarization method, we may find Pareto optimal points for
274
any vector optimization problem using a scalar optimization problem. Different values of λ K ∗ 0 often
gives rise to different points in P. However, it is possible to have Pareto optimal points that cannot be
obtained via scalarization without assuming convexity. The optimal point x in the scalarization problem
is s.t. {u : −λ0 (u − f0 (x)) = 0} is supporting hyperplane to the O at f0 (x), and
{u : λ0 (u − f0 (x)) < 0} ∩ O = ∅. (1475)
Now further suppose that the vector optimization problem is given convex. Then the scalarized opti-
mization problem is also convex, and we can find Pareto optimal points of a convex vector optimization
problem solving a convex scalar optimization problem. Furthermore, for every Pareto optimal point x(p) ,
∃λ <K ∗ 0, λ 6= 0 s.t. x(p) is optimal solution to the convex scalarized problem. Scalarization yields the en-
tire P in the convex case. To see this, consider the set A = O+K = {t : Rq : f0 (x) 4K t for some x ∈ F}.
Given f0 is K-convex, then A is convex regardless of O, and the minimal elements in A are the minimal
elements in O (verify this). Then any minimal element of A minimizes λ0 z over A for some nonzero
λ <K ∗ 0 (Theorem 271). Any Pareto optimal point for the vector optimization problem is optimal for
some scalarized problem. Note that it is not the case that every solution of the scalarized problem with
nonzero λ <K ∗ 0 is Pareto optimal, although every solution to the scalarized problem with λ K ∗ is
Pareto optimal.
Exercise 473. Consider the convex vector optimization problem
minimize X, (1476)
subject to X < Ai , i ∈ [m] (1477)
for Ai ∈ S n , where the associated proper cone is S+

n
. A Pareto optimal point may be found by choosing
n
any W ∈ S++ and solving the problem
minimize tr(W X), (1478)

subject to X < Ai , i ∈ [m]. (1479)
n
This is an SDP. For each constraint matrix A ∈ S++ , there is an associated ellipsoid ξA = {u : u0 A−1 u ≤
1} such that A 4 B iff ξA ⊆ ξB . Then the Pareto optimal finding problem for X corresponds to finding
the minimal containing ellipsoid.
If the cone K is the entire nonnegative orthan Rq+ , we say that the problem is multicriterion/multi-
objective optimization problem (MOO). Each component of f0 (x) can be treated to be different scalar
objectives, and we denote them Fi , i ∈ [q]. We say that the MOO is convex MOO if fi , i ∈ [m] are convex
constraints and hj , j ∈ [p] are affine constraints. Furthermore, the objectives Fi , i ∈ [q] must be convex.
The theory of vector optimization (Section 4.6.7) applies directly to MOO problems, for obvious reasons.
The interpretations that can be made are also natural. We say that the solution x dominates solution y
if f0 (x) < f0 (y) but f0 (x) 6= f0 (y).
Suppose we have an optimal x∗ , we say that the objectives are noncompeting, since that point is
optimal jointly in each of the q objectives. On the other hand, a Pareto optimal point x(p) is feasible
and there is no (unanimously) better feasible point. We are then interested in the tradeoff between the q
objectives. These may be studied under the framework of utility theory, by comparing the marginal rate
of substitution (MRS) across different objectives. Here, a light survey of tradeoff analysis is discussed.
275
Suppose x, y are Pareto optimal satisfying
Fi (x) < Fi (y) i ∈ A, (1480)

Fi (x) = Fi (y) i ∈ B, (1481)
Fi (x) > Fi (y) i ∈ C, (1482)
with A ∪ B ∪ C = [q]. We have either (B = [q]) or (B 6= [q] and A 6= ∅ ∩ B 6= ∅). We say there is a tradeoff
between objectives in A and in C. In the two objective F1 (x), F2 (x) problem, if a large value in F2 must
be accepted to realize a small decrease in F1 , then we say that there is a strong tradeoff. The other
spectrum would, of course, be called a weak tradeoff. The set of objective values of a Pareto optimal set
is said to be the optimal tradeoff surface/curve. Of course, when the problem possesses an optimal point,
this curve is a singleton. The tangents to the curve show the local optimal tradeoff between objectives,
and a point of large curvature is one where small decreases in one objective can cause large increase in
the other. This is said to be the knee of the tradeoff curve.
When a MOO is scalarized using the weighted sum objective
q
X
λ0 f0 (x) = λi Fi (x), λ 0, (1483)
i=1
λi indicates the relative weight we place on the i-th objective, where large values force Fi to be small.
λi
The ratio λj can be thought of to be the relative importance; or as an exchange rate between objectives
Fi , Fj . At any point where the optimal tradeoff surface is smooth, λ gives the inward normal to the
surface at the associated Pareto optimal point. Minimizing the weighted sum of objectives, and adjusting
the weights to obtain a suitable solution tradeoff is the essence of duality.
Exercise 474 (Regularized Least Squares). Suppose we are given A ∈ Rm×n , b ∈ Rm and we want to
choose x ∈ Rn under two objectives, namely minimizing F1 (x) = kAx − bk22 and F2 (x) = kxk22 . We can
write this in terms of the vector optimization problem
minimize f0 (x) = (F1 (x), F2 (x)), (1484)
and we can solve the scalaraization form with λ = (λ1 , λ2 ) 0 writing
λ0 f0 (x) = λ1 F1 (x) + λ2 F2 (x) (1485)

= x0 (λ1 A0 A + λ2 1)x − 2λ1 b0 Ax + λ1 b0 b. (1486)
The solution to this problem is given by taking
δλ0 f0 (x)
= 2(λ1 A0 A + λ2 1)x − 2λ1 A0 b = 0,
!
(1487)
δx
so
λ2
x(µ) = (λ1 A0 A + λ2 1)−1 λ1 A0 b = (A0 A + µ1)−1 A0 b, µ= . (1488)
λ1
This scalarization gives all Pareto optimal points except at the extremes with µ → ∞ and µ → 0. In the
first case, we have x = 0. The other extreme is just the least-squares optimal solution given by A+ b.
276
4.7 Duality
4.7.1 Lagrangian and Dual Functions
Definition 252 (Lagrangian). Recall the standard mathematical optimization problem (Definition 244),
p
that is not necessarily convex. Assume D = (∩m i=0 dom fi ) ∩ ∩j=1 dom hj is not empty, and let the
optimal solution be p∗ . The Lagrangian is defined to be L : Rn × Rm × Rp → R defined by
m
X p
X
L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x), dom L = D × Rm × Rp . (1489)
i=1 j=1
The variables λi , νj are said to be Lagrange multipliers associated with the i-th inequality constraint and
j-th equality constraint respectively. In particular, λ, ν are said to be dual variables or Lagrange multiplier
vectors.
Definition 253 (Lagrange Dual Function). The Lagrange dual function is defined to be the minimum
value of the Lagrangian over x. We may write it to be g : Rm × Rp → R s.t.
 
m
X p
X
g(λ, v) = inf L(x, λ, v) = inf f0 (x) + λi fi (x) + νj hj (x) , λ ∈ Rm , ν ∈ Rp . (1490)
x∈D x∈D
i=1 j=1
If the Lagrangian is unbounded below in x, then the Lagrange dual function evaluates to −∞. See
comment after Theorem 279. The Lagrangian is affine in (λ, ν). The dual function is pointwise infimum
of affine (and hence concave) set of functions, and is therefore concave. This does not depend on the
nature of fi ’s and hj ’s.
Theorem 307 (Dual Function (Definition 253) is Lower Bound on p∗ ). Given optimal solution x∗ with
optimal value p∗ = f0 (x∗ ), for any λ < 0, ν ∈ Rp , we have
g(λ, ν) ≤ p∗ . (1491)
Proof. Define the feasible set F. For x̃ ∈ F, we have fi (x̃) ≤ 0, hi (x̃) = 0. Since λ < 0, we have
m
X p
X
λi fi (x̃) + νj hj (x̃) ≤ 0. (1492)
i=1 j=1
It should be trivial to see that
g(λ, ν) = inf L(x, λ, ν) (1493)

x
≤ L(x̃, λ, ν) (1494)
Xm p
X
= f0 (x̃) + λi fi (x̃) + νj hj (x̃) (1495)
i j
≤ f0 (x̃). (1496)
Then x∗ → x̃ and the inequality holds.
The dual function gives lower bounds (Theorem 307). When g(λ, ν) = −∞, the inequality is trivial.
It is only nontrivial lower bound on p∗ when λ < 0, (λ, ν) ∈ dom g s.t. g(λ, ν) > −∞. We say that the
pair (λ, ν) satisfying these conditions are dual feasible.
277
The standard form optimization problem can be written in an unconstrained way by sufficiently
penalizing infeasible points. We may write the form in Definition 244 as
m p
1− (fi (x)) + 10 (hj (x)),
X X
minimize f0 (x) + (1497)
i=1 j=1
where I− (u) is 0 when u ≤ 0 and ∞ otherwise, and 10 (u) is 0 when u = 0 and ∞ otherwise. The values
of the indicator variable terms measure our displeasure of constraints being violated. We can replace this
with a ‘softer’ measure of displeasure, that grows in the degree of constraint violation by specifying for
some λ < 0, ν, the function f0 (x) + λ0 f (x) + ν 0 h(x) (we let f (x) be a vector of the fi (x) and h(x) be a
vector of hj (x)). See that this is precisely the Lagrangian L(x, λ, ν). The optimization problem solving
the minimization for this Lagrangian is in turn the dual function. See also that when the constraints
have margin from the constraint boundaries (when fi (x) < 0), the corresponding term may be negative
so it should be obvious from this that the dual function is lower bound on p∗ .
Exercise 475. The least squares problem subject to affine constraints is given the formulation
minimize kxk22 , (1498)

Then the Lagrangian is characterized by L(x, ν) = x0 x + ν 0 (Ax − b). To find g(ν) = inf x L(x, ν), and
!
since L(x, ν) is convex quadratic in x, and ∇x L(x, ν) = 2x + A0 ν = 0, we arrive at x = − 21 A0 ν. The
dual function is given by g(ν) = (− 12 A0 ν)0 (− 12 A0 ν) + ν 0 (A(− 21 A0 ν) − b) = − 14 ν 0 AA0 ν − b0 ν. This is lower
bound for p∗ = inf{x0 x : Ax = b}.
Exercise 476. The standard form LP (Definition 1294) given by
x<0 (1502)
has Lagrangian
n
X
L(x, λ, ν) = c0 x − λi xi + ν 0 (Ax − b) = −b0 ν + (c + A0 ν − λ)0 x (1503)
i=1
with dual function g(λ, ν) = −b0 ν + inf x (c + A0 ν − λ)0 x. Linear function is bounded below only when it
is zero, so
−b0 ν A0 ν − λ + c = 0,
(
g(λ, ν) = (1504)
−∞ else.
g is finite only on proper affine subset of Rm × Rp satisfied by A0 ν = λ − c.
Exercise 477. Consider the minimization problem
minimize x0 W x, (1505)
subject to x2i = 1, i ∈ [1, n], (1506)
with W ∈ S n . The problem is not necessarily convex, W is not guaranteed to be positive semidefinite.
The dual function is given by
n
!
X
g(ν) = inf L(x, ν) = inf x0 W x + νi (x2i − 1) = inf x0 (W + diag(ν))x − 10 ν. (1507)
x x x
i
278
The dual has minimum only if the quadratic form is positive semidefinite (see Theorem 257), so we have
analytical expression for the dual given
−10 ν
(
W + diag(v) < 0,
g(ν) = (1508)
−∞ else.
Take ν = −λmin (W )1. Then W + diag(v) = W − λmin (W )1 < 0, so it is dual feasible and it is lower
bound to p∗ . In particular, p∗ ≥ −10 ν = nλmin (W ).
4.7.2 Dual and Conjugate Functions

Recall conjugate functions (Definition 238) given by f ∗ (y) = supx∈dom f (y 0 x − f (x)). It turns out
conjugate functions and dual functions are closely related.
Theorem 308. The dual function of an optimizaton problem with affine constraints may be written in
terms of the conjugate functions. Define the problem
minimize f0 (x), (1509)

subject to Ax 4 b, (1510)
Cx = d. (1511)
Then we have
g(λ, ν) = inf {f0 (x) + λ0 (Ax − b) + ν 0 (Cx − d)} (1512)

x
= −b0 λ − d0 ν + inf {f0 (x) + (A0 λ + C 0 ν)0 x} (1513)
x
= −b0 λ − d0 ν − sup {−(A0 λ + C 0 ν)0 x − f0 (x)} (1514)
x
= −b0 λ − d0 ν − f0∗ (−A0 λ − C 0 ν), dom g = {(λ, ν) : −A0 λ − C 0 ν ∈ dom f0∗ }. (1515)
Exercise 478. Consider the problem
minimize kxk, (1516)

The conjugate function is given by

(
0 kyk∗ ≤ 1,
f0∗ (y) = (1518)
∞ else
by Exercise 284. Equation 1515 asserts that the dual function is

( 0
−b ν kA0 νk∗ ≤ 1,
g(ν) = −b0 ν − f0 (−A0 ν) = (1519)
−∞ else.

n
X
minimize f0 (x) = xi log xi , dom f0 = Rn++ , (1520)
i=1
subject to Ax < b, (1521)
0
1 x = 1. (1522)
279
Then the conjugate function (see Exercise 433.4) is given by
n
X
f0∗ (y) = exp(yi − 1), dom f0∗ = Rn . (1523)
i=1
Using Equation 1515 we get the dual function

n
X
0
g(λ, ν) = −b λ − ν − exp(−a0i λ − ν − 1). (1524)
i=1
Exercise 480. Given X ∈ S n , consider the problem
minimize f0 (X) = log detX −1 n

dom f0 = S++ , (1525)
subject to a0i Xai ≤ 1, i ∈ [1, m]. (1526)
1
Note that the volume of the ellipsoid ξX = {z : z 0 Xz ≤ 1} is proportional to (detX −1 ) 2 , so the problem is
to find the minimal volume ellipsoid containing the m points ai . Exercise 433.7 asserts that the conjugate
is given by
f0∗ (Y ) = −n + log det(−Y )−1 , dom f ∗ = −S++

n
. (1527)
Furthermore the constraint a0i Xai ≤ 1 is in fact linear in X, and we may rewrite this as tr((ai a0i )X) ≤ 1.
Then Equation 1515 asserts that the dual function is given by
m m

X X
−10 λ + log det ( λi ai a0i ) + n λi ai a0i 0,


g(λ) = i=1 i=1 (1528)


−∞ else.
4.7.3 Lagrange Dual Problem

We know that the Lagrange dual function g(λ, ν) is lower bound to p∗ (Theorem 307), and finding the
tightest bound would amount to solving the problem
maximize g(λ, ν), (1529)

subject to λ < 0. (1530)
The problem of maximizing the Lagrange dual function (Equation 1530) is said to be the Lagrange dual
problem. The associated, original problem is said to be the primal problem. We said that (λ < 0, ν) is
dual feasible when g(λ, ν) > −∞, in that it is feasible for the dual problem. The solution to the Lagrange
dual problem is said to be the dual optimal or optimal Lagrange multipliers, and we denote this (λ∗ , ν ∗ ).
See discussion after Definition 253 for Lagrange dual functions. It says that g(λ, ν) is concave. The dual
problem is maximization of concave function, so the dual problem is equivalently convex. The constraint
is clearly convex. This is regardless of the primal problem.
Often times there are implicit constraints in the dual problem that defines dual feasible points. We
may make them explicit.
Exercise 481. Consider the standard form LP (Definition 1294) given by
x<0 (1533)
280
with dual function
−b0 ν A0 ν − λ + c = 0,
(
g(λ, ν) = (1534)
−∞ else.
as in Exercise 476 with λ < 0. Clearly the equivalent problem to the dual problem defined by making the
finiteness constraints explicit can be written
maximize −b0 ν, (1535)

0
subject to A ν − λ + c = 0, (1536)
λ < 0. (1537)
We can write this as

0
subject to A ν + c < 0. (1539)
This is an LP in inequality form, and we loosely refer to all of these equivalent forms involving λ to be
the Lagrange dual of the standard form LP.
Exercise 482. Consider the LP in inequality form given by
subject to Ax 4 b. (1541)
The Lagrange dual function is given by (see Theorem 308) g(λ) = −b0 λ + inf x (A0 λ + c)0 x. The linear
component in x is only bounded below when it is zero, so
( 0
−b λ A0 λ + c = 0,
g(λ) = (1542)
−∞ else.
If λ < 0 and A0 λ + c = 0 then λ is dual feasible. The Lagrange dual problem of the inequality form LP is
maximize −b0 λ, (1543)

subject to A0 λ + c = 0, (1544)
λ < 0. (1545)
This LP is in standard form.
It is interesting that the dual problem of the standard form LP is the inequality form and vice versa.
In the LP case the Lagrange dual problem and primal problem are equivalent.
In the general case, the optimal value of the Lagrange dual problem, d∗ , is the tightest bound to p∗
obtained from the Lagrange dual function. The inequality
d∗ ≤ p∗ (1546)
holds and is said to be weak duality. The difference p∗ − d∗ is said to be optimality duality gap. The
dual problem is always convex, even if the primal problem is not. Therefore, dual problems give us
information about the primal problem in the form of weak duality in the general case and may be solved
efficiently. If the optimality duality gap is zero, then d∗ = p∗ , and we say that strong duality holds. We
analyze conditions for strong duality.
281
Theorem 309. If the primal problem is convex (i.e. has standard form (Definition 247)), strong duality
usually, but not always holds. The conditions that are required for strong duality to hold are known
as constraint qualifications. One form of constraint qualifictions is known as the Slater’s condition,
which states that ∃x ∈ relint D s.t. fi (x) < 0 for i ∈ [1, m], Ax = b (a strictly feasible point).
Slater’s theorem asserts that the convexity and Slater’s condition implies strong duality. The weak form
of Slater’s condition is when some of the inequality constraints fi are affine. If the first k constraint
functions fj , j ∈ [1, k] are affine, and ∃x ∈ relint D s.t. fi (x) ≤ 0, i ∈ [1, k], fi (x) < 0, i ∈ [k + 1, m] and
Ax = b, then strong duality holds. That is to say that the strict inequalities is not required in the case of
affine constraints. When all the constraints are linear, and dom f0 is open, then Slater’s condition holds
if the feasible set F is non-empty, as it should be easy to see.
Proof. We prove that Slater’s condition on convex problems imply strong duality. Let the primal problem
be given by

subject to fi (x) ≤ 0, i ∈ [m], Ax = b, (1548)
where fj , j ∈ [0, m] are convex. Furthermore assume Slater’s condition holds, that there ∃x̃ ∈ D s.t.
fi (x̃) < 0, i ∈ [m] and Ax̃ = b. For simplicity assume that D has nonempty interior, so relint D = int D
and that A is full row rank, so rank A = p. Further assume that p∗ is finite. Define the two convex sets
A = {(u, v, t) : ∃x ∈ D, fi (x) ≤ ui , i ∈ [m], hi (x) = vi , i ∈ [p], f0 (x) ≤ t}, (1549)

B = {(0, 0, s) ∈ Rm × Rp × R : s < p∗ }. (1550)
It should be easy to see that the two sets do not intersect, since for (u, v, t) ∈ A ∩ B, f0 (x) ≤ t < p∗ and
this is contradiction since p∗ is optimal value to the primal problem. Separating hyperplane theorem
(Theorem 265) asserts that ∃(λ̃, ν̃, µ) 6= 0, α s.t.
(u, v, t) ∈ A =⇒ λ̃0 u + ν̃ 0 v + µt ≥ α, (1551)

(u, v, t) ∈ B =⇒ λ̃0 u + ν̃ 0 v + µt ≤ α. (1552)
For Equation 1551, λ̃u + µt needs to be bounded below, so we have λ̃ < 0 and µ ≥ 0. Equation 1552
asserts that ∀t < p∗ , we have µt ≤ α. In particular we have µp∗ ≤ α (verify this). Then for any x ∈ D,
we have
m
X
λ̃i fi (x) + ν̃ 0 (Ax − b) + µf0 (x) ≥ α ≥ µp∗ . (1553)
i=1
We consider cases. Suppose µ > 0, then dividing by µ we obtain
λ̃ ν̃
∀x ∈ D, L(x, , ) ≥ p∗ , (1554)
µ µ
so g(λ, ν) ≥ p∗ for λ = λ̃
µ, ν = ν̃
µ. Weak duality asserts that g(λ, ν) ≤ p∗ , so we have g(λ, ν) = p∗ and
strong duality holds. Now if µ = 0, then we have
m
X
λ̃i fi (x) + ν̃ 0 (Ax − b) ≥ 0. (1555)
i=1
Pm
If the Slater conditions hold for feasible x̃ we have i=1 λ̃i fi (x̃) ≥ 0. Furthemore, since fi (x̃) < 0,
λ̃i ≥ 0, we must have λ̃ = 0. Since (λ̃, ν̃, µ) 6= 0 and λ̃ = 0, µ = 0, it follows that ν̃ 6= 0. But for all
282
x ∈ D, λ̃ = 0, the inequality ν̃ 0 (Ax − b) ≥ 0 must hold, and for x̃, ν̃ 0 (Ax̃ − b) = 0. Furthermore, since
x̃ ∈ int D, there exists x ∈ D s.t. ν̃ 0 (Ax − b) < 0 unless A0 ν̃ = 0, but since A is full row rank, we have
contradiction and µ cannot be zero. The hyperplane separating A, B defines a supporting hyperplane
to A at (0, p∗ ). Slater’s conditions establishes that this hyperplane is nonvertical, that the normal takes
form (λ∗ , 1).
The least-squares dual problem to the primal problem in Exercise 475 is to maximize − 14 ν 0 AA0 ν −b0 ν.
Clearly the constraints are all affine. The comments in Theorem 309 assert that Slater’s condition holds
when the primal problem is feasible, which is when b ∈ R(A). The Lagrange duals of the LP in standard
form and inequality form (Exercise 481, 482) exhibit strong duality iff both the primal and dual problems
are feasible, as it should be clear.
Consider the QCQP form (Equation 1340) with inequality constraints only given by
1 0
minimize x P0 x + q00 x + r0 , (1556)
2
1 0
x Pi x + qi0 x + ri ≤ 0, i ∈ [m], (1557)
2
n n
where P0 ∈ S++ and Pi ∈ S+ , i ∈ [1, m]. The Lagrangian is characterized by
1 0
x P (λ)x + q(λ)0 x + r(λ),
L(x, λ) = (1558)
2
Pm Pm Pm
where P (λ) = P0 + i=1 λi Pi , q(λ) = q0 + i=1 λi qi , r(λ) = r0 + i=1 λi ri . If P (λ) < 0 (which is
when λ < 0), Theorem 256 asserts that g(λ) = inf x L(x, λ) = − 21 q(λ)0 P (λ)−1 q(λ) + r(λ), and the dual
problem is written
1
maximize − q(λ)0 P (λ)−1 q(λ) + r(λ), (1559)
2
The Slater condition holds when the quadratic constraints are strict, that is 21 x0 Pi x + qi0 x + ri < 0, i∈
[m]. If Slater conditions hold, then strong duality holds since convexity holds. Recall the entropy
maximization problem (Exercise 479). The dual problem can be writting using the dual function found
there as
n
X
maximize f0 (x) = −b0 λ − ν − exp(−ν − 1) exp(−a0i λ), (1561)
i=1
The constraints in that primal problem are all affine, so strong duality holds if the feasible set of the primal
problem is non-empty (i.e. ∃x 0, Ax 4 b, 10 x = 1). We can actually simplify the problem by analytically
!
minimizing over ν first. Take δfδν
0 (x)
= −1+exp(−ν−1) exp(−a0i λ) = 0, so −ν−1 = − log exp(−a0i λ).
P P
Pn
Substitute ν = log i exp(−a0i λ) − 1 into the objective of the dual problem to get
X X X n
X
−b0 λ + 1 − log exp(−a0i λ) − exp(− log (exp(−a0i λ))) · exp(−a0i λ) = −b0 λ − log exp(−a0i λ),
i
and we arrive at the equivalent dual problem

n
X
maximize −b0 λ − log exp(−a0i λ), (1563)
i
283
This is GP (see Definition 250) in convex form (see Theorem 306).
The minimal volume ellipsoid in Exercise 480 has linear constraints in X, and furthermore there is
n
always a satisfying X ∈ S++ , so the Slater conditions hold and strong duality holds w.r.t to the dual
problem there.
In some cases, it can be shown that strong duality is obtained even for a nonconvex primal problem
(verify this).
Exercise 483. Consider a game with two players, P1 , P2 . P1 makes a random choice in k ∈ [n] and P2
makes a random choice in l ∈ [m], and the outcome of the game is that P1 makes a payout of Pkl to P2 ,
where P ∈ Rn×m is the payoff matrix. Their decisions are modelled by a distribution that is independent
of one another. In particular we have
P(k = i) = ui , i ∈ [n], P(l = i) = vi , i ∈ [m]. (1565)

Pn Pn
Then the expected payout is given by i j ui vj Pij = u0 P v. Suppose u was known to P2 , then rationally
she solves the optimization problem supv {u0 P v : v < 0, 10 v = 1} = maxi (P 0 u)i . The best P1 can do,
assuming P2 knows u (that is when P1 goes first), is to then solve this ’minimax’ problem by minimizing
the worst case payoff by solving
minimize max (P 0 u)i , (1566)

i∈[m]
subject to u < 0, 10 u = 1. (1567)
This is a piecewise linear convex optimization problem. Let this have optimal vlue p∗1 . Now suppose the
situation is turned around, the best P2 can do, assuming P1 knows v (that is when P2 goes first), by
similar reasoning, is to solve
maximize min (P v)i , (1568)

i∈[n]
subject to v < 0, 10 v = 1. (1569)
This is also piecewise linear convex optimization problem. Let this have optimal value p∗2 . Naturally,
since P2 tries to maximize the payoff, and p∗1 is computed assuming P2 gets to act later, then p∗1 ≥ p∗2 ,
and their difference should be construed as the advantage of getting to act later. We can write the former
minimization problem assuming P1 goes first as an LP written
minimize t, (1570)
0 0
subject to u < 0, 1 u = 1, P u 4 t1. (1571)
The Lagrangian for this problem can be written
t + λ0 (P 0 u − t1) − µ0 u + ν(1 − 10 u) = ν + (1 − 10 λ)t + (P λ − ν1 − µ)0 u. (1572)
The components that are linear in t and u must be identically zero in order for the dual function to be
finite, so we have
(
ν 1λ = 1, P λ − ν1 = µ,
g(λ, µ, ν) = (1573)
−∞ else.
So the dual problem is
maximize ν, (1574)
subject to λ < 0, 10 λ = 1, P λ < ν1. (1575)
284
But this is equivalent to the latter problem that P2 solves assuming she goes first. The two problems are
equivalent, and the LPs are feasible, strong duality holds and the optimal values p∗1 = p∗2 . It turns out
that there is in fact no advantage in knowing the opponent’s strategy (assuming rationality).
Consider the set
G = {(f1 (x), · · · fm (x), h1 (x), · · · , hp (x), f0 (x)) ∈ Rm × Rp × R : x ∈ D}. (1576)
Clearly p∗ = inf{t : (u, v, t) ∈ G, u 4 0, v = 0}. The dual function minimizes the inner product of (λ, ν, 1)
over G, in particular we have
g(λ, ν) = inf{(λ, ν, 1)0 (u, v, t) : (u, v, t) ∈ G}. (1577)
If dual function is finite then (λ, ν, 1)0 (u, v, t) ≥ g(λ, ν) by definition, and the inequality defines a sup-
porting hyperplane to G. This is said to be nonvertical supporting hyperplane. We can derive weak
duality from this. Clearly, λ < 0, u 4 0, v = 0 =⇒ t ≥ (λ, ν, 1)0 (u, v, t) so
p∗ = inf{t : (u, v, t) ∈ G, u 4 0, v = 0} (1578)

0
≥ inf{(λ, ν, 1) (u, v, t) : (u, v, t) ∈ G, u 4 0, v = 0} (1579)
0
≥ inf{(λ, ν, 1) (u, v, t) : (u, v, t) ∈ G} (1580)
= g(λ, ν) (1581)
for λ < 0 and weak duality is obtained.

Define the set A ⊆ Rm × Rp × R s.t.
A = G + (Rm
+ × {0} × R+ ) (1582)
= {(u, v, t) : ∃x ∈ D, fi (x) ≤ ui , i ∈ [m], hi (x) = vi , i ∈ [p], f0 (x) ≤ t}. (1583)
Then p∗ = inf{t : (0, 0, t) ∈ A} and for λ < 0, the dual function is given by
g(λ, ν) = inf{(λ, ν, 1)0 (u, v, t) : (u, v, t) ∈ A}. (1584)
Again if g(λ, ν) is finite then (λ, ν, 1)0 (u, v, t) ≥ g(λ, ν) is nonvertical supporting hyperplane to A. In
particular, (0, 0, p∗ ) ∈ bd A, and p∗ = (λ, ν, 1)0 (0, 0, p∗ ) ≥ g(λ, ν) is the weak duality assertion. Strong
duality holds iff there is nonvertical supporting hyperplane to A at boundary point (0, 0, p∗ ).
Exercise 484 (Connection between Lagrange duality for problem without equality constraints and
scalarization of unconstrained multicriterion problems). Consider the problem
minimize f0 (x), subject to fi (x) ≤ 0, i ∈ [m], (1585)
as well as the scalarization of multicriterion problem given by
minimize F (x) = (f1 (x), · · · , fm (x), f0 (x)). (1586)
We may scale λ by any positive constant without affecting the minimizers, so without loss of generality,
Pm
take λ̃ = (λ, 1). Then λ̃0 F (x) = f0 (x) + i=1 λi fi (x). It is easy to see this is Lagrangian for the problem
without equality constraints.
285
Exercise 485 (Max-min characterizations of weak and strong duality). Consider the problem without
equality constraints. First note that
m
!
X
sup L(x, λ) = sup f0 (x) + λi fi (x) (1587)
λ<0 λ<0 i=1
(
f0 (x) fi (x) ≤ 0, i ∈ [m]
= (1588)
∞ else.
Note that if any fi (x) ≤ 0, then the optimal choice of λ would just be the zero vector, and we have
supλ<0 L(x, λ) = f0 (x). We may then express the optimal value of the primal problem as p∗ = inf x supλ<0 L(x, λ).
On the other hand, the dual problem solves for the optimal value d∗ = supλ<0 inf x L(x, λ), where
inf x L(x, λ) is the dual function. Then weak duality is simply supλ<0 inf x L(x, λ) ≤ inf x supλ<0 L(x, λ)
and strong duality is when equality holds. Strong duality has the simple interpretation that the order of
minimization over x and maximization over λ < 0 does not matter. In general, for any f : Rn × Rm → R
and arbitrary W ⊆ Rn , Z ⊆ Rm , it is true that
sup inf f (w, z) ≤ inf sup f (w, z). (1589)

z∈Z w∈W w∈W z∈Z
This is said to be the max-min inequality. When equality holds, we say that f satisfies strong max-min
property, or the saddle point property.
Exercise 486 (Saddle point characterizations of weak and strong duality). The point (w̃, z̃) ∈ (W, Z)
is said to be saddle point for f if f (w̃, z) ≤ f (w̃, z̃) ≤ f (w, z̃) for all w ∈ W, z ∈ Z. This holds when
f (w̃, z̃) = inf w∈W f (w, z̃) = supz∈Z f (w̃, z). Then the strong form of max-min equality (Equation 1589)
holds. Then x∗ , λ∗ are primal and dual optimal points for a problem iff they form saddle point for the
Lagrangian.
4.7.4 Optimality Conditions

Suppose we hve dual feasible (λ, ν) s.t. p∗ ≥ g(λ, ν). We say this is the proof, or certificate that
p∗ ≥ g(λ, ν). If we have strong duality, clearly the certificate would be the best possible one. For any
primal feasible x, we know that f0 (x) − p∗ ≤ f0 (x) − g(λ, ν). For any feasible x, we can establish that
it is -suboptimal, where = f0 (x) − g(λ, ν). This level is said to be the duality gap associated with
any primal feasible x and dual feasible (λ, ν). In fact, we know that p∗ , d∗ ∈ [g(λ, ν), f0 (x)] - the primal
optimal and dual optimal values fall in an interval equal to the width of the duality gap. This duality gap
is zero when strong duality holds, and the points in question are the primal optimal and dual optimal
points. In these scenarios, we may think of (λ, ν) as a certificate that x is optimal, and vice versa.
These are useful in acting as stopping criterions in iterative algorithms. Suppose an algorithm produces
a sequence of primal feasible x(k) , (λ(k) , ν (k) ), for k = 1, 2 · · · , and we want to obtain a solution up to
abs > 0 absolute accuracy. Then the stopping criterion given by
f0 (x) − p∗ ≤ f0 (x(k) ) − g(λ(k) , ν (k) ) ≤ abs (1590)
guarantees that the value of x(k) is abs -suboptimal. We say that this is proved by the certificate
(λ(k) , ν (k) ). We can also provide similar conditions for relative accuracy rel > 0. If
f0 (x(k) ) − g(λ(k) , ν (k) )

g(λ(k) , ν (k) ) > 0, ≤ rel , (1591)
g(λ(k) , ν (k) )
286
or
f0 (x(k) ) − g(λ(k) , ν (k) )
f0 (x(k) ) < 0, ≤ rel (1592)
−f0 (x(k) )
then the relative error is bounded s.t.
f0 (x(k) ) − p∗
≤ rel . (1593)
|p∗ |
Suppose we have primal optimal p∗ = f0 (x∗ ), and let (λ∗ , ν ∗ ) be dual optimal point. Further suppose
strong duality is obtained. We have
f0 (x∗ ) = g(λ∗ , ν ∗ ) (1594)

p
m
!
X X
= inf f0 (x) + λ∗i fi (x) + vi∗ hi (x) (1595)
x
i=1 i=1
m
X p
X
≤ f0 (x∗ ) + λ∗i fi (x∗ ) + νi∗ hi (x∗ ) (1596)
i=1 i=1
≤ f0 (x∗ ). (1597)
Clearly the two inequalities are equalities. x∗ minimizes L(x, λ∗ , ν ∗ ) over x. Furthermore, λ∗i fi (x∗ ) = 0.
Pm
Since λ < 0, and fi (x∗ ) ≤ 0, and we can see that i=1 λ∗i fi (x∗ ) = 0, we must have
λ∗i fi (x∗ ) = 0, i ∈ [1, m] (1598)
and this condition is to be known as the complementary slackness. The complementary slackness con-
dition holds for any primal optimal x∗ , dual optimal (λ∗ , ν ∗ ) where strong duality is obtained. At least
one of the components in each sum term must be zero. The i-th optimal Lagrange multiplier is zero
unless the i-th constraint is active at the optimum.
Theorem 310 (KKT conditions for nonconvex problems). Let x∗ , (λ∗ , ν ∗ ) be primal optimal, dual op-
timal points respectively where strong duality holds. Recall that x∗ minimizes L(x, λ∗ , ν ∗ ) over x. It
follows that we have
∇f0 (x∗ ) + λ∗ ∇f (x∗ ) + ν ∗ ∇h(x∗ ) = 0. (1599)
It follows that we must have the conditions
fi (x∗ ) 4 0, i ∈ [1, m] (1600)

∗
hi (x ) = 0, i ∈ [1, p], (1601)
λ∗i ≥ 0, i ∈ [1, m], (1602)
λ∗i fi (x∗ ) = 0, i ∈ [1, m], (1603)
m
X p
X
∇f0 (x∗ ) + λ∗i ∇fi (x∗ ) + νi∗ ∇hi (x∗ ) = 0. (1604)
i=1 i=1
These conditions are known as Karush-Kuhn-Tucker (KKT) conditions. For any optimization problem
with differentiable objective and constraint functions for which strong duality obtains, a pair of primal
and dual optimal points implies the KKT conditions are satisfied.
Theorem 311 (KKT conditions for convex problems). If the primal problem is standard form convex,
that is fi , i ∈ [0, m] are convex and hi , i ∈ [p] are affine, then x̃, λ̃, ν̃ satisfy KKT conditions (Equations
∼ 1604) iff x̃, (λ̃, ν̃) are primal optimal and dual optimal with strong duality.
287
Proof. Clearly the first two lines in the KKT equations assert that x̃ is primal feasible. Furthermore,
λ̃i ≥ 0, L(x, λ̃, ν̃) is convex in x, the last line asserts that x̃ is minimizer of L(x, λ̃, ν̃) over x, so
m
X p
X
g(λ̃, ν̃) = L(x̃, λ̃, ν̃) = f0 (x̃) + λ̃i fi (x̃) + ν̃i hi (x̃) = f0 (x̃). (1605)
i=1 i=1
Then x̃, (λ̃, ν̃) are primal and dual optimal points with strong duality. For any convex optimization
problem with differentiable objective and constraint functions, the KKT conditions provide iff conditions
for primal and dual optimality. Furthemore, for convex optimization problems, if the Slater’s conditions
hold, then the the optimal duality gap is zero and the dual optimum is attained. So x optimal iff (λ, µ)
satisfy KKT.
In some cases the KKT conditions may be solved analytically, here we take a look at some of these
scenarios.
Exercise 487. Consider the convex problem
1
minimize x0 P x + q 0 x + r, subject to Ax = b, (1606)
2
where P < 0. Then the KKT conditions are Ax∗ = b, P x∗ + q + A0 v ∗ = 0. This is a linear system in
m + n equations with m + n variables x∗ , v ∗ . Solving these gives us optimal primal and dual variables.
Exercise 488. Consider the convex problem
n
X
minimize − log(αi + xi ) αi > 0, (1607)
i=1
subject to x < 0, 10 x = 1. (1608)
The KKT conditions for this problem is
x∗ < 0, (1609)
10 x∗ = 1, (1610)
∗
λ < 0, (1611)
λ∗i x∗i = 0, i ∈ [n], (1612)
1
− − λ∗i + ν ∗ = 0, i ∈ [1, n] (1613)
αi + x∗i
which can be written as
x∗ < 0, (1614)
10 x∗ = 1, (1615)
1
x∗i (v ∗ − ) = 0, i ∈ [1, n], (1616)
αi + x∗i
1
v∗ ≥ , i ∈ [n]. (1617)
αi + x∗i
If ν ∗ < 1
αi , by the last condition, we have x∗i > 0, and the third condition asserts that v ∗ = 1
αi +x∗ , which
i
requires x∗i = 1
v∗ − αi . On the other hand, if v ≥ ∗ 1
αi ,
∗
then by the same reasoning, since x < 0, we
require that x∗i = 0, therefore we have

1 1
 ∗ − αi

 v∗ < ,
ν αi
x∗i = (1618)
1
0 v∗ ≥


αi
1
= max{0, − αi }. (1619)
ν∗
288
Pn
Then the second condition implies that i=1 max{0, ν1∗ − αi } = 1.
Suppose strong duality is obtained and we have a dual optimal (λ∗ , ν ∗ ), and further suppose that the
Pm Pp
minimizer of L(x, λ∗ , ν ∗ ) = f0 (x) + i=1 λ∗i fi (x) + i=1 νi∗ hi (x) is a unique one. Then if this minimizer
is primal feasible, it must be primal optimal, and if it is not primal feasible, then no primal optimal
point exists.
Exercise 489. Recall the entropy maximization problem (Exercise 479) and its dual (Equation 1562).
Assume the Slater’s conditions hold (Theorem 309), that is ∃x 0, Ax 4 b, 10 x = 1. Then strong
duality is obtained and a dual optimal solution exists. Suppose we have already solved the dual problem
to get (λ∗ , ν ∗ ). Then, the Lagrangian for the primal problem at the optimal dual point is
n
X
L(x, λ∗ , ν ∗ ) = xi log xi + λ0i (Ax − b) + ν ∗ (10 x − 1). (1620)
i
This is strictly convex on D, and bounded below, so x∗ is unique and we can find
δL 0
= log xi + 1 + λ∗ ai + ν ∗ (1621)
δxi
so x∗i = exp(−1 − a0i λ∗ − ν ∗ ) for i ∈ [n]. If x∗ is primal feasible, then it must be the optimal solution of
the primal problem. If x∗ is not primal feasible, then the primal optimum is not obtained.
4.7.5 Perturbation and Sensitivity Analysis

Consider the perturbtion/modification to the general optimization problem given by

subject to fi (x) ≤ ui , i ∈ [1, m], (1623)
hi (x) = vi , i ∈ [1, p]. (1624)
When ui > 0, we say that the i-th inequality constraint is relaxed, if ui < 0 we say it is tightened. The
right hand side of the equality constraint is changed to vi from zero. Denote
p∗ (u, v) = inf{f0 (x) : ∃x ∈ D, fi (x) ≤ ui , i ∈ [m], hi (x) = vi , i ∈ [p]} (1625)
to be the optimal value of the perturbed problem. Obviously p∗ (0, 0) = p∗ . When the original problem
is convex, the function p∗ : Rm × Rp → R is convex in u, v.
0 0
Theorem 312. Assume that strong duality holds and we have dual optimal point (λ∗ , ν ∗ ) for the
unperturbed problem, then
0 0
∀u, v, p∗ (u, v) ≥ p∗ (0, 0) − λ∗ u − ν ∗ v. (1626)
Proof. Let x be feasible point for the perturbed problem, then fi (x) ≤ ui for i ∈ [m] and hi (x) = vi for
i ∈ [p]. By strong duality assumption it follows that
m
X p
X
p∗ (0, 0) = g(λ∗ , ν ∗ ) ≤ f0 (x) + λ∗i fi (x) + νi∗ hi (x) (1627)
i=1 i=1
0 0
≤ f0 (x) + λ∗ u + ν ∗ v. (1628)
So
0 0
f0 (x) ≥ p∗ (0, 0) − λ∗ u − ν ∗ v (1629)
and the result holds since this works for any feasible point in the perturbation.
289
The inequality asserted by Theorem 312 gives lower bounds on the optimal value of the perturbed
problem, and the size of λ∗ , ν ∗ tells us the sensitivity of the optimal value in relation to the constraint
perturbations.
Theorem 313. Suppose p∗ (u, v) differentiable at (0, 0), then assuming strong duality holds, the optimal
dual variables λ∗ , ν ∗ satisfies
δp∗ (0, 0) δp∗ (0, 0)
λ∗i = − , vi∗ = − . (1630)
δui δvi
The optimal Lagrange multipliers are precisely the local sensitives of the optimal value w.r.t constraint
perturbations.
Proof. Assume perturbation u = tei , v = 0, where ei is the i-th unit vector. Then we have
p∗ (tei , 0) − p∗ δp∗ (0, 0)
lim = . (1631)
t→0 t δui
0 0
Theorem 312 asserts that p∗ (u, v) ≥ p∗ (0, 0) − λ∗ u − ν ∗ v. It follows that here we have
p∗ (tei , 0) − p∗ (0, 0)
≥ −λ∗i , t > 0, (1632)
t
p∗ (tei , 0) − p∗ (0, 0)
≤ −λ∗i , t < 0. (1633)
t
δp∗ (0,0) δp∗ (0,0)
Taking the limit as t → 0+ gives δui ≥ −λ∗i and the limit as t → 0− gives δui ≤ −λ∗i , so
∗ ∗
δp (0,0) δp (0,0)
δui = −λ∗i . We can use the same method to show that δvi = −νi∗ .
Theorem 313 gives us a measure of how active a constraint is at x∗ . In particular if fi (x∗ ) < 0, then
the inequality constrant is inactive, and the optimal value is unaffected by small loosening or tightening
in constraints. Furthermore, complementary slackness asserts the the corresponding optimal Lagrange
multiplier λ∗i = 0. If fi (x∗ ) = 0, then the i-th constraint is active and the size of λ∗i is the local sensitivity
determining the effect of modifications to constraints on optimal value.
Equivalent problems can have very different dual forms. We first consider those forms for which the
linear constraints are made explicit.
Exercise 490. Consider the unconstrained problem
minimize f0 (Ax + b). (1634)
There are no constraints and the Lagrange dual function would not be meaningful. Strong duality is
trivially obtained. Consider the equivalent problem
minimize f0 (y), (1635)

subject to Ax + b = y. (1636)
The Lagrangian is written L(x, y, ν) = f0 (y) + ν 0 (Ax + b − y) and the dual function is unbounded below
unless the linear component A0 ν = 0. We may write
g(ν) = b0 ν + inf {f0 (y) − ν 0 y} = b0 ν − sup{ν 0 y − f0 (y)} = b0 ν − f0∗ (ν). (1637)

y y
We then arrive at the dual problem
maximize b0 ν − f0∗ (y), (1638)

0
subject to A ν = 0. (1639)
290
Exercise 491. Consider the convex form GP (Definition 306) given by
m
!
X
minimize log exp(a0i x + bi ) . (1640)
i=1
This has equivalent problem

m
X
minimize f0 (y) = log( exp(yi )), (1641)
i=1
subject to Ax + b = y. (1642)
We know that the conjugate to log-sum exp is given by (see Exercise 433.9)
m
X
yi log yi , y < 0, 10 y = 1,



∗
f0 (y) = i=1 (1643)


∞, else.
We can write the dual of this (using Equation 1637) as

m
X
maximize b0 ν − νi log νi , (1644)
i=1
subject to 10 ν = 1, (1645)
0
A ν = 0, (1646)
v < 0. (1647)
Exercise 492. Consider the problem minimize kAx − bk, which has equivalent problem in the form
minimize kyk, (1648)

subject to Ax − b = y. (1649)
This has Lagrangian L(x, y, ν) = kyk+ν 0 (Ax−b−y) which is unbounded below unless the linear component
A0 ν = 0. We can then write the dual function (similar to Equation 1637)
g(ν) = −b0 ν + inf (kyk − ν 0 y) = −b0 ν − sup(ν 0 y − kyk) = −b0 ν − f0∗ (ν). (1650)
y y
Furthermore the conjugate of the norm is given by (see Theorem 284)

(
∗
0 kyk∗ ≤ 1,
f (y) = (1651)
∞ else,
so we may write the Lagrange dual as

subject to kvk∗ ≤ 1, (1653)
A0 ν = 0. (1654)
minimize f0 (A0 x + b0 ), (1655)

subject to fi (Ai x + bi ) ≤ 0, i ∈ [1, m]. (1656)
291
Assume the fi , i ∈ [0, m] are convex. Then an equivalent problem is
minimize f0 (y), (1657)

subject to fi (yi ) ≤ 0, i ∈ [1, m], (1658)
Ai x + bi = y i , i ∈ [0, m]. (1659)
Pm Pm
The Lagrangian is L(x, y, λ, ν) = f0 (y0 ) + i=1 λi fi (yi ) + i=0 νi0 (Ai x + bi − yi ). Again to be bounded
Pm
below we require i=0 A0i νi = 0, so the dual function is written (for λ 0)
m m m
!
X X X
0 0
g(λ, ν) = νi bi + inf f0 (y0 ) + λi fi (yi ) − νi yi (1660)
y
i=0 i=1 i=0
m m
X X νi
= νi0 bi + inf (f0 (y0 ) − ν00 y0 ) +λi inf fi (yi ) − ( )0 yi (1661)
i=0
y0
i=1
yi λi
m m
X X νi
= νi0 bi − f0∗ (ν0 ) − λi fi∗ . (1662)
i=0 i=1
λ i
Since the last term is perspective of conjugate function, with negative weights, the dual function is concave
in (λ, ν). If λ < 0 but ∃i s.t. λi = 0, then if νi 6= 0, the dual function evaluates to −∞, otherwise
νi = 0 and
all the terms involving yi , νi , λi are all zero and we can use the same expression by letting
νi
λi fi∗ λi = ∞ when νi 6= 0 and 0 when νi = 0 for λi = 0. We can then write the dual problem as
m m
X X νi
maximize νi0 bi − f0∗ (ν0 ) − λi fi∗ , (1663)
i=0 i=1
λi
subject to λ < 0, (1664)
Xm
A0i νi = 0. (1665)
i=0
Exercise 494. Consider the convex problem given by

K0
!
X
minimize log exp(a00k x + b0k ) , (1666)
k=1
Ki
!
X
subject to log exp(a0ik x + bik ) ≤ 0, i ∈ [1, m]. (1667)
k=1
Using Exercise 493 and the conjugate for log-sum exp given by Exercise 433.9, we may write the dual
problem as
K0 m Ki
!
X X X νik
maximize b00 ν0 − v0k log v0k + b0i νi − νik log , (1668)
i=1
λi
k=1 k=1
subject to v0 < 0, 10 ν0 = 1, (1669)
νi < 0, 10 νi = λi , i ∈ [1, m], (1670)
λi ≥ 0, i ∈ [1, m], (1671)
Xm
A0i νi = 0. (1672)
i=0
We can also consider those for which the objectives are transformed.
292
Exercise 495. Consider the problem to minimize kAx − bk, which has clearly equivalent form
1
minimize kyk2 , (1673)
2
subject to Ax − b = y. (1674)
See Exercise 492. We may write the dual problem to be

1
maximize − kνk2∗ − b0 ν, (1675)
2
subject to A0 ν = 0. (1676)
We can also make the explicit constraints implicit.
Exercise 496. Consider the LP given by
l 4 x 4 u. (1679)
The Lagrangian has form L(x, λ1 , λ2 , ν) = c0 x + ν(Ax − b) + λ1 (x − u) − λ2 (x − l), and we want the linear
component in x to be zero, so the dual problem is to solve
maximze −b0 ν − λ01 u + λ02 l, (1680)

0
subject to A ν + λ1 − λ2 + c = 0, (1681)
λ1 < 0, λ2 < 0. (1682)
The box constraints in the primal problem can also be made implicit by writing
( 0
c x l 4 x 4 u,
minimize f0 (x) = , (1683)
∞ else,
The dual function of this problem is instead
g(ν) = inf (c0 x + ν 0 (Ax − b)) = −b0 ν − u0 (A0 ν + c)− + l0 (A0 ν + c)+ , (1685)
l4x4u
where y + , y − are the positive and negative components of y respectively. The dual problem of this is the
unconstrained problem
maximize g(ν).
4.7.6 Theorem of Alternatives

Theorem 314. Consider a system of inequalities and equalities given by
fi (x) ≤ 0, i ∈ [m], hi (x) = 0, i ∈ [p]. (1686)

p
Assume D = (∩m
i dom fi ) ∩ (∩i dom hi ) 6= ∅. We can think of this to be the feasibility problem given by
minimize 0, (1687)
subject to fi (x) ≤ 0, i ∈ [m], (1688)
hi (x) = 0, i ∈ [p] (1689)
293
with optimal value zero if we can solve the system and ∞ otherwise. Then the dual function can be
written
p
m
!
X X
g(λ, ν) = inf λi fi (x) + νi hi (x) . (1690)
x∈D
i=1 i
See that the dual function is positive homogenous; for α > 0, we have g(αλ, αν) = αg(λ, ν). The dual
problem is
maximize g(λ, ν), subject to λ < 0, (1691)
and has optimal value given by

(
∗
∞ λ < 0, g(λ, ν) > 0 is feasible,
d = (1692)
0 λ < 0, g(λ, ν) > 0 is not feasible
by the homogeneity property. Furthermore, weak duality applies, so d∗ ≤ p∗ , and it is easy to see that
d∗ = ∞ =⇒ p∗ = ∞, so that any solution (λ, ν) that is feasible and satisfying
λ < 0, g(λ, ν) > 0 (1693)
is proof or certificate of the infeasibility of the original system. The contrapositive states that if the
original system was feasible, then the system λ < 0, g(λ, ν) > 0 is infeasible. An x satisfying the original
system is certificate of the infeasibility of the latter system. We say these are weak alternatives, since at
most one of the two is feasible. Note that it is possible for both systems to be infeasible. Furthermore,
regardless of the nature of fi , hi , the inequality system (Equation 1693) is always convex, since g is
concave and λ is linear.
We can also consider the feasibility of the strict inequality system given by
fi (x) < 0, i ∈ [m], hi (x) = 0, i ∈ [p]. (1694)
As in similar reasoning to the previous discussion, in this case the alternative equality system is given by
λ < 0, λ 6= 0, g(λ, ν) ≥ 0. (1695)
These are also weak alternatives. Suppose the system given in Equation 1694 is feasible, and let x̃ be
feasible point, then
∀λ < 0, λ 6= 0, ν, g(λ, ν) = inf λ0 f (x) + ν 0 h(x) ≤ λ0 f (x̃) + ν 0 h(x̃) < 0. (1696)

x
Then x̃ is not in the feasible set for Equation 1695 and hence the system must not be feasible. The
contrapositive asserts the other direction.
Theorem 315. A stronger form of alternatives than Theorem 314 exists when the system of inequalities
are convex, which says that exactly one of the two alternatives hold. Again we have a system
fi (x) ≤ 0, i ∈ [m], hi (x) = 0, i ∈ [p] (1697)
but here fi ’s are guaranteed convex and hi ’s are affine so that we can write the set of equality constraints
as Ax = b. Consider first the case when the inequality system is strict, so
fi (x) < 0, i ∈ [m], Ax = b (1698)
294
with alternative
λ < 0, λ 6= 0, g(λ, ν) ≥ 0. (1699)
Further suppose ∃x ∈ relint D s.t. Ax = b. Then strong alternatives hold. To see this, consider the
optimization problem given by
minimize s, (1700)
subject to fi (x) − s ≤ 0, i ∈ [m], (1701)
Ax = b. (1702)
The variables are (x, s) ∈ (D, R) and the optimal value to this p∗ < 0 ⇐⇒ ∃x s.t. x satisfies the strict
inequality system. The Lagrange dual can be written as
m

X
λi fi (x) + ν 0 (Ax − b) = g(λ, ν) 10 λ = 1,
m
! 
X  inf

0 x∈D,s
inf s+ λi (fi (x) − s) + ν (Ax − b) = i=1 (1703)
x∈D,s 
i=1 
−∞ else.
Then the dual problem is given by
maximize g(λ, ν) (1704)

0
subject to λ < 0, 1 λ = 1. (1705)
For the problem defined in Equation 1702, since we assumed x̃ ∈ relintD s.t. Ax̃ = b, any s̃ > maxi fi (x̃)
gives point (x̃, s̃) strictly feasible in Equation 1702 and the Slater’s condition (Theorem 309) holds. Thus
strong duality is obtained, and d∗ = p∗ , and there is some dual optimal (λ∗ , ν ∗ ) satisfying
g(λ∗ , ν ∗ ) = p∗ , λ∗ < 0, 10 λ∗ = 1. (1706)
Suppose the strict inequality system is infeasible, then as argued the optimal value p∗ to problem in
Equation 1702 must be ≥ 0, then clearly g(λ∗ , ν ∗ ) = p∗ ≥ 0, and this dual optimal pair satisfies alternative
inequality system Equation 1699. If we suppose that Equation 1699 is feasible, then p∗ = d∗ ≥ 0 and as
asserted 6 ∃x sayisfying strict inequality system in Equation 1698. The iff conditions are hold.
When the inequalities are not strict, that is when
fi (x) ≤ 0, i ∈ [1, m], Ax = b (1707)
with alternative given by
λ < 0, g(λ, ν) > 0. (1708)
We require condition that ∃x ∈ relint D s.t. Ax = b and we further require that the optimal value of p∗
is obtained for Equation 1702. These assumptions assert that strong duality holds and that the primal
optimal value is obtained (verify this) with p∗ = d∗ .
Exercise 497. Consider the system of linear inequalities given by Ax 4 b, then the dual function is
( 0
0
−b λ A0 λ = 0,
g(λ) = inf λ (Ax − b) = (1709)
x
−∞ else.
Theorem 315 asserts that the alternative inequality system is given by
λ < 0, A0 λ = 0, b0 λ < 0. (1710)
295
When the inequalities are strict, that is Ax ≺ b, then the (strong) alternative inequality system is given
by
λ < 0, λ 6= 0, A0 λ = 0, b0 λ ≤ 0. (1711)
Exercise 498. Consider m ellipsoids defined by ξi = {x : fi (x) ≤ 0}, i ∈ [m], where each fi is defined
by
fi (x) = x0 Ai x + 2b0i x + ci , n
i ∈ [1, m], Ai ∈ S++ . (1712)
The problem is to find the condition for which the intersection of the m ellipsoids have nonempty interior.
We can write this as a feasibility problem given
fi (x) = x0 Ai x + 2b0i x + ci < 0, i ∈ [1, m]. (1713)
The dual function of this is written
g(λ) = inf (x0 A(λ)x + 2b(λ)0 x + c(λ)) (1714)

x
−b(λ)0 A(λ)+ b(λ) + c(λ),

(
A(λ) < 0, b(λ) ∈ R(A(λ)), (Theorem 257)
= (1715)
−∞ else,
Pm Pm Pm
where A(λ) = i=1 λi Ai , bλ) = i=1 λi bi , c(λ) = i=1 λi ci . If λ < 0, λ 6= 0, then A(λ) 0, and the
dual function simplifies to (Theorem 256)
g(λ) = −b(λ)0 A(λ)−1 b(λ) + c(λ). (1716)
The strong alternative of system is therefore written
λ < 0, λ 6= 0, −b(λ)0 A(λ)−1 b(λ) + c(λ) ≥ 0. (1717)
Theorem 316 (Farkas’ Lemma). The system of inequalities given by
Ax 4 0, c0 x < 0, (1718)
and
A0 y + c = 0, y<0 (1719)
are strong alternatives. This strong alternative is known as the Farkas’ lemma.
Proof. Consider the LP given by
subject to Ax 4 0. (1721)
We have dual problem (Exercise 482) given by
maximize 0, (1722)
subject to A0 y + c = 0, (1723)
y < 0. (1724)
The primal LP given by Equation 1721 is homogenous, so if Equation 1718 is not feasible then the
optimal value is 0 and otherwise −∞. The dual LP formulated has optimal value 0 if there is a feasible
point for Equation 1719 and otherwise has optimal value −∞. It is easy to see that x = 0 is feasible
for the primal problem; Slater’s condition holds, so strong duality holds and p∗ = d∗ . This asserts the
strong alternative between the two inequality systems.
296
Exercise 499. Consider n assets with price vector p, and the future price is modelled by random vector v.
Let x be the portfolio investment in each asset in units of contracts. Then the cost of initial investment is
p0 x and future value of the investment is v 0 x. Suppose v takes discrete vectors v (i) , i ∈ [m], for m possible
0
scenarios in the model economy. If there is any investment x s.t. p0 x < 0 but ∀i ∈ [m], v (i) x ≥ 0, we
say that an arbitrage portfolio exists. Asset pricing typically assumes no arbitrage price, which is to say
the inequality system
V x < 0, p0 x < 0 (1725)

(i)
should be infeasible, where V is the matrix encoding Vij = vj . Farkas’ lemma (Theorem 316) asserts
that no arbitrage holds iff ∃y s.t.
−V 0 y + p = 0, y < 0. (1726)
4.7.7 Duality and Generalized Inequalities

We examine how Lagrange duality extends to problems where the inequality is w.r.t to proper cone(s).
Consider the general problem

subject to fi (x) 4Ki 0, i ∈ [1, m], (1728)
hi (x) = 0, i ∈ [1, p]. (1729)
p
Here Ki ⊆ Rki are proper cones, and we have D = ∩m
i dom fi ∩ ∩i dom hi . With each generalized
inequality, we assign a Lagrange multiplier vector λi , and define the Lagrangian
m
X p
X
L(x, λ, ν) = f0 (x) + λ0i fi (x) + νi hi (x). (1730)
i i
The dual function follows naturally;

p
m
!
X X
g(λ, ν) = inf L(x, λ, ν) = inf f0 (x) + λ0i fi (x) + νi hi (x) . (1731)
x∈D x∈D
i i
As before, the dual function is concave. Also, the dual function gives lower bounds on p∗ . The nonneg-
ative constraint λ < 0 is replaced with
λi <Ki∗ 0, i ∈ [1, m]. (1732)
See Theorem 269. λi <Ki∗ 0 and fi (x̃) 4Ki 0 asserts that λ0i fi (x̃) ≤ 0, so it should be easy to follow that
for any primal feasible x̃, we have
m
X p
X
g(λ, ν) ≤ f0 (x̃) + λ0i fi (x̃) + νi hi (x̃) ≤ f0 (x̃), (1733)
i i
so clearly g(λ, ν) ≤ p∗ and the Lagrange dual problem is given by

subject to λi <Ki∗ 0, i ∈ [1, m]. (1735)
Weak duality holds, d∗ ≤ p∗ irrespective of the structure in the primal problem. Strong duality may also
be extended to generalized constraint problem under some constraint qualifications for convex primal
297
problems. If f0 is convex and fi ’s are Ki -convex, and ∃x ∈ relint D s.t. Ax = b and fi (x) ≺Ki 0 for
i ∈ [m], the generalized Slater conditions are said to hold. Then strong duality holds and dual optimum
is attained.
Exercise 500. Consider the SDP in inequality form (see Equation 1427) given by
subject to x1 F1 + · · · + xn Fn + G 4 0, (1737)
where Fi , G ∈ S k . The associated proper cone is S+

k
and the constraint function is affine, and for some
Z < 0, write the Lagragian
L(x, Z) = c0 x + tr((x1 F1 + xn Fn + G)Z) = x1 (c1 + tr(F1 Z)) + · · · xn (cn + tr(Fn Z)) + tr(GZ). (1738)

(
tr(GZ) tr(Fi Z) + ci = 0, i ∈ [1, n],
g(Z) = inf L(x, Z) = (1739)
x
−∞ else.
The dual problem is written
maximize tr(GZ), (1740)

subject to tr(Fi Z) + ci = 0, i ∈ [1, n], (1741)
Z < 0. (1742)
Strong duality holds if x1 F1 + · · · xn Fn + G ≺ 0.

Assume that primal, dual optimal values are equal and obtained at x∗ , (λ∗ , ν ∗ ). See that we have
f0 (x∗ ) = g(λ∗ , ν ∗ ) (1743)

m p
X 0 X 0
≤ f0 (x∗ ) + λ∗i fi (x∗ ) + νi∗ hi (x∗ ) (1744)
i i
≤ f0 (x∗ ). (1745)
Pm 0
It follows that i λ∗i fi (x∗ ) = 0, and in particular ∀i ∈ [1, m], λ∗i fi (x∗ ) = 0. This is the generalization
of the complementary slackness conditions. It states that
λ∗i Ki∗ 0 =⇒ fi (x∗ ) = 0, fi (x∗ ) ≺Ki 0 =⇒ λ∗i = 0. (1746)
Note that this is different from the scalar inequality problem, we can have λ∗i , fi (x∗ ) 6= 0. If fi , hi are
differentiable, then we can derive the KKT conditions for the generalized constraint problem. As before,
x∗ is a minimizer for L(x, λ∗ , ν ∗ ) so
m
X p
X
∇f0 (x∗ ) + ∇fi (x∗ )λ∗i + νi∗ ∇hi (x∗ ) = 0. (1747)
i=1 i=1
If strong duality holds, then any primal optimal and dual optimal point must satisfy KKT conditions
given by
fi (x∗ ) 4Ki 0, i ∈ [1, m], (1748)

hi (x∗ ) = 0, i ∈ [1, p], (1749)
λ∗i <Ki∗ 0, i ∈ [1, m], (1750)
0
λ∗i fi (x∗ ) = 0, i ∈ [1, m], (1751)
m
X p
X
∇f0 (x∗ ) + ∇fi (x∗ )λ∗i + νi∗ ∇hi (x∗ ) = 0. (1752)
i=1 i=1
298
If the primal problem is convex, then the KKT conditions are iff conditions for optimality.
We can also perform sensitivity analysis as we did in the scalar inequality problem. Consider the
perturbation of the convex problem written as

subject to fi (x) 4Ki ui , i ∈ [1, m], (1754)
hi (x) = vi , i ∈ [1, p]. (1755)
p∗ is convex in (u, v). Let (λ∗ , ν ∗ ) be the dual optimal point of the unperturbed problem, and assume
that strong duality holds. Then
m
X 0 0
∀u, v, p∗ (u, v) ≥ p∗ − λ∗i ui + v ∗ v. (1756)
i=1
Furthermore, if p∗ (u, v) is differentiable at (0, 0), then the optimal dual variable λ∗i satisfies
λ∗i = −∇ui p∗ (0, 0). (1757)
Exercise 501. Recall Exercise 500 and the primal and dual problem given there. If x∗ , Z ∗ are primal
and dual optimal points with strong optimality, then complementary slackness holds; tr(F (x∗ )Z ∗ ) = 0.
Since F (x∗ ) 4 0, Z ∗ < 0, it holds that F (x∗ )Z ∗ = 0, so R(F (x∗ )) ⊥ R(Z ∗ ) - they are orthogonal
subspaces. Let p∗ (U ) be the optimal value of the SDP where the RHS is replaced with U , then
∀U, p∗ (U ) ≥ p∗ − tr(Z ∗ U ). (1758)
If p∗ (U ) differentiable at U = 0, then ∇p∗ (0) = −Z ∗ .
We can also find strong and weak alternatives w.r.t systems of equaltities and inequalities as before.
Consider the system
fi (x) 4Ki 0, i ∈ [1, m], hi (x) = 0, i ∈ [1, p], (1759)
as well as
fi (x) ≺Ki 0, i ∈ [1, m], hi (x) = 0, i ∈ [1, p]. (1760)
Assume the intersection of their domains are nonempty, so that formulated as feasibility problem, D is
not empty. Now write the dual function
p
m
!
X X
g(λ, ν) = inf λ0i fi (x) + νi hi (x) . (1761)
x∈D
i i
The system
λi <Ki∗ 0, i ∈ [1, m], g(λ, ν) > 0 (1762)
is weak alternative to system in Equation 1759. Suppose not, that both systems are feasible, then we
can find points satisfying
m
X p
X
0 < g(λ, ν) ≤ λ0i fi (x) + νi hi (x) ≤ 0 (1763)
i=1 i=1
299
and this is contradiction. We can also show that the system
λi <Ki∗ 0, i ∈ [1, m], λ 6= 0, g(λ, ν) ≥ 0 (1764)
is weak alternative to system in Equation 1760. Now further suppose that fi ’s are Ki convex and hi ’s
are affine, s.t we have
fi (x) ≺Ki 0, i ∈ [1, m], Ax = b. (1765)
The alternative
λi <Ki∗ 0, i ∈ [1, m], λ 6= 0, g(λ, ν) ≥ 0 (1766)
is strong alternative, provided ∃x̃ ∈ relint D s.t. Ax̃ = b. To show this, let ei Ki 0 and consider related
primal problem
minimize s, (1767)
subject to fi (x) 4Ki sei , i ∈ [m], (1768)
Ax = b. (1769)
Provided s̃ is large enough, (x̃, s̃) satisfy strict inequalities and Slater’s condition holds. Then strong
duality is obtained. The dual problem is

subject to λi <Ki∗ 0, i ∈ [1, m], (1771)
Xm
e0i λi = 1. (1772)
i
The reasoning is similar as in the scalar inequality problem. The derivation for the rest of the steps are
identical as before in the scalar inequality problem. In the case of systems of nonstrict inequalities given
fi (x) 4Ki 0, i ∈ [1, m], Ax = b (1773)
and
λi <Ki∗ 0, i ∈ [1, m], g(λ, ν) > 0 (1774)
are strong alternatives provided that the optimal value is attained for Equation 1769, and that ∃x ∈
relint D. s.t Ax = b.
Exercise 502. The system
F (x) = x1 F1 + · · · + xn Fn + G ≺ 0, Fi , G ∈ S k (1775)
and
Z < 0, Z 6= 0, tr(GZ) ≥ 0, tr(Fi Z) = 0, i ∈ [1, n] (1776)
are strong alternatives. This follows from writing (see Exercise 500)
(
tr(GZ) tr(Fi Z) = 0, i ∈ [1, n],
g(Z) = inf {tr(F (x)Z)} = (1777)
x
−∞ else.
300
4.8 Applications
4.8.1 Norm Approximation Problems
The unconstrained norm approximation problem takes the general form
minimize kAx − bk. (1778)
where A ∈ Rm×n and A, b is known problem data. The optimization varaible is x and k · k is some
norm on Rm . The solution is often said to be the approximate solution of Ax ≈ b, r = Ax − b is said
to be residual vector, and the components are said to be individual residuals. The problem is convex,
and p∗ = 0 ⇐⇒ b ∈ R(A). So the problem is only interesting when b 6∈ R(A). Assume that columns
of A are independent; and that m ≥ n, so A is full column rank. When m = n, then A invertible and
Pn
x∗ = A−1 b. Assume m > n. Recall we may express Ax = i xi ai (Definition 33) where ai is i-th column
in A, we want to approximate b as weights in columns of A, which is the geometrically the projection
of b onto the column space of A (see Theorem 63). In the regression problem, the columns ai are said
to be the regressors, and Ax∗ is said to be the regression of b. Suppose a random variable of interest is
assumed to take the linear measurement model
y = Ax + v, (1779)
where y, A are known and x is to be estimated, with random error v presumed small, the optimal solution
is given by
x∗ = arg min kAz − yk. (1780)

z
When the problem data presents heteroskedasticity/unequal variance, we often consider normalizing the
residuals with a weighting matrix. This extension is said to be weighted norm approximation problem,
given
minimize kW (Ax − b)k, (1781)
where W ∈ Rm×m is said to be weighting matrix, and is part of the problem data or found as part
of the estimation process. W is often diagonal with diagonals componentwise inversely proportional to
the variance in underlying variable. The weighting matrix affects the relative emphasis in components
of r = Ax − b. This extension can also be treated as the basic norm approximation problem with
A → W A, b → W b.
In particular when the associated norm on Rm is Euclidean `2 , we say that the norm approximation
problem is, after squaring, equivalent to the least-squares approximation problem given
m
X
minimize kAx − bk22 = ri2 . (1782)
i
We have analytical solution (recall A is assumed full column rank) given by x = (A0 A)−1 b (see Exercise
64, Section 4.3, Section 8.8.2 for different approaches to the same problem).
When the associated norm on Rm is infinity norm `∞ (Definition 219), we say the norm approximation
problem is the Chebyshev approximtion problem given by
minimize kAx − bk∞ = max{|r1 |, · · · , |rm |}. (1783)
301
We may write this as the LP
minimize t, (1784)
subject to −t1 4 Ax − b 4 t1, (1785)
with variables in x ∈ Rn , t ∈ R.
When the associated norm on Rm is `1 , the norm approximation problem becomes
m
X
minimize kAx − bk1 = |ri |. (1786)
i
This is said to be a robust estimation problem, for reasons we will discuss. This can also be cast as LP
minimize 10 t, (1787)
subject to −t 4 Ax − b 4 t, (1788)
with variables in x ∈ Rn , t ∈ Rm .
The general `p norm approximation problem gives the optimization formulation
1
minimize (|r1 |p + · · · |rm |p ) p , p ∈ [1, ∞). (1789)
Pm
Since p ≥ 1 we can consider the equivalent problem with objective function i |ri |p , which is a separable
and symmetric function of the residuals. We can generalize the `p norm approximation, using the form
m
X
minimize φ(ri ), (1790)
i
subject to r = Ax − b, (1791)
where φ : R → R is the penalty function. This problem is said to be the penalty function approximation
problem. The objective function depends only on the amplitude distribution of the residuals. If φ is
convex, then the penalty function approximation problem is convex optimization problem. Assume φ is
convex in the remainder of our discussion.
Exercise 503 (Examples of common penalty functions). -
1. When φ(u) = |u|p , p ≥ 1, we trivially have the `p norm approximation problem.
2. The deadzone linear penalty function is given by

(
0 |u| ≤ a,
φ(u) = (1792)
|u| − a |u| > a.
3. The log barrier penalty function is given by
−a2 log 1 − ( u )2

|u| < a,
φ(u) = a (1793)
∞ |u| ≥ a.

The shape of the penalty functions can shape the preference for different amplitudes in the residuals.
Consider the `1 , `2 norm given by φ1 (u) = |u|, φ2 (u) = u2 respectively. When u ≈ 0, clearly φ1 (u) >
φ2 (u), but φ1 (u) < φ2 (u) for u > 1, thus the `2 approximation objective favour relatively smaller residuals
then the `1 norm approximation objective. The amplitude distribution of the optimal residual in the
302
`2 norm approximation problem will tend to have fewer large residuals. For the `1 norm, many of the
residuals would either be zero or very small, and relatively large residuals. The `2 norm approximation
on the other hand would have many modest residuals, and fewer large ones.
An outlier in the regression problem is a measurement yi = a0i x + vi , where the vi is relatively large.
Usually, an outlier leads to large residuals. To keep the estimation process more robust to potential
outliers, we can adjust the penalty function to reduce their impact on the model estimation, either by
downweighting or using designated penalty functions. An example is given by
( 2
u |u| ≤ M,
φ(u) = (1794)
M 2 |u| > M,
However, some penalty functions such as this one is not convex, and hence the penalty function ap-
proximation problem is no longer convex and the problem becomes a difficult combinatorial problem to
solve. Often, we instead restrict the choice of penalty functions to convex functions. The ones that are
least sensitive are those for which φ(u) grows linearly. Penalty functions with this proporty is said to be
robust, since the associated penalty function approximation methods are less sensitive to outliers. An
example would be φ(u) = |u|, corresponding to the `1 norm approximation problem. A famous example
is the robust least squares given by the Huber penalty function defined
( 2
u , |u| ≤ M,
φhub (u) = (1795)
M (2|u| − M ), |u| > M,
which is convex in u. For this reason we often call the `1 norm approximation problem to be the robust
estimation or robust regression problem. As mentioned, penalty functions such as the least squares
approximation or penalty approximation problems using penalties such as the deadzone linear penalty
functions put zero to little weight on small residuals. Hence the optimal residuals would be small, but
not extremely small, since there is no motivation to make small residuals even smaller. Penalty functions
that put relatively large weight on small residuals such as φ(u) = |u| tend to produce optimal residuals
that are either very small or even zero. To the norm approximation problem, we may add constraints.
Some examples have already been discussed, and are reviewed.
Consider the objective
minimize kAx − bk. (1796)
The constraints x < 0 finds projection of a vector b onto the cone generated by the columns of A. The
constraints l 4 x 4 u finds projection of a vector b onto the image of a box under the linear mapping
induced by A. The constraints x < 0, 10 x = 1 approximates b by convex combination of the columns in
A. The constraints kx − x0 k ≤ d restricts the search space to x inside the norm ball.
4.8.2 Least-Norm Problems

The least-norm problem can be written as
minimize kxk, (1797)

Here A ∈ Rm×n , b ∈ Rm and the variable is x ∈ Rn . The solution exists when x ∈ R(A). Assume that
A is full row rank. If m = n, then x = A−1 b is the unique feasible, and hence optimal solution. We can
303
assume that n > m. We can eliminate the equality constraints in Equation 1798 by explictly encoding
the solution space to Ax = b (see Exercise 52). Let x0 be particular solution and Z ∈ Rn×k consisting
of the columns that form basis for nullspace of A. Then the problem
minimize kx0 + Zuk (1799)
is clearly equivalent problem, and is a norm approximation one. Geometrically, we are finding the point
in the affine set that is minimum distance to the origin; it is the projection of the point zero onto the
affine set {x : Ax = b}. When the norm is Euclidean, then clearly the squared objective problem
minimize kxk22 , (1800)

subject to Ax = b (1801)
is equivalent. The optimal solution is said to be the least squares solution. We have Lagrangian L(x, ν) =
x0 x + ν 0 (Ax − b), so the optimality condition (see Equation 1604) is given
2x∗ + A0 ν = 0, Ax∗ = b, (1802)
which is a simple linear system to solve. Then x∗ = − 12 A0 ν ∗ and − 12 AA0 ν ∗ = b, and we obtain
ν ∗ = −2(AA0 )−1 b, x∗ = A0 (AA0 )−1 b. (1803)

Pn
We can also generalize the problem by alteration of the objective function to take Φ(x) = i φ(xi ),
where φ are convex penalty functions as before. Recall that `1 norm approximation problems give rise
to solutions with many small or zero entries. If we want to find a sparse solution to the least-norm
problem in Equation 1798, a good heuristic is to set the associated norm to `1 . Often, we get m nonzero
components only. This heuristic approach provides an efficient approximate for finding a sparse solution
n

to Ax = b, which would otherwise require solving m linear systems for finding the smallest x with ≤ m
nonzero entries.
4.8.3 Regularized Approximation

The regularized approximation problem has a dual objective of finding x that is small with small residuals
Ax − b. We can write this as a vector optimization problem (see Section 4.6.7) given
minimize (w.r.t R2+ ) (kAx − bk(1) , kxk(2) ). (1804)
It is not necessary for the two norms k · k(1) , k · k(2) to be the same. The techniques discussed, such as
scalarization, for vector optimization apply. Regularization is one of the common scalarization methods
used in solving bi-criterion problems. One such form is to formulate
minimize kAx − bk + γkxk, γ > 0. (1805)
The different values of γ ∈ (0, ∞) traces out the optimal trade off curve. Another approach for regular-
ization would be to formulate
minimize kAx − bk2 + δkxk2 , δ > 0. (1806)
Exercise 504 (Tikhonov Regularization). The Tikohonov regularization problem is given by the convex
quadratic optimization problem
minimize kAx − bk22 + δkxk22 = x0 (A0 A + δ 1)x − 2b0 Ax + b0 b. (1807)
We arrive at analytical solution x = (A0 A + δ 1)−1 A0 b. Since A0 A + δ 1 0 for any δ > 0, there are no
assumptions on rank A.
304
Exercise 505 (Smoothing Regularization). We can extend the regularization terms to impose other
constraints or desirabilities on x. Suppose that the vector x ∈ Rn represents measurements of a continuous
random variable measured at n points across the interval [0, 1]. An approximation of the first derivative
i
and second derivative near n would be
n(xi+1 − xi ), n(n(xi+1 − xi ) − n(xi − xi−1 )) = n2 (xi+1 − 2xi + xi−1 ). (1808)
Then the matrix encoding

 
1 −2 1 0 ··· 0 0 0
 0 1 −2 1 ··· 0 0 0
 
2 
 
∆ = n · · · · ··· ··· 0 ··· ··· ··· · · ·
. (1809)
 0 0 0 0 ··· −2 1 0
 
0 0 0 0 ··· 1 −2 1
is the approximation of the second derivative of x at the different intervals, and hence the norm k∆xk22
approximates the curvature of the parameter over interval [0, 1]. Then the Tikohonov regularization
minimize kAx − bk22 + δkxk22 . (1810)
trades off residual size for the smoothness in x. Of course, we can combine both smoothness and size
considerations with objective kAx − bk22 + δk∆xk22 + ηkxk22 .
As asserted, the regularized approximation problem (Equation 1804) does not require that the norms
match. If we want a sparse solution, we may formulate the objective with an `1 norm on the size of x,
to get
minimize kAx − bk2 + γkxk1 . (1811)
Varying γ > 0 explores the optimal trade off curve between kAx − bk2 and kxk1 .
Exercise 506 (Regressor Selection). See the varible selection problem in Section 8.8.7. We are given
matrix A ∈ Rm×n , composed of n columns of regressors and measurements b. We want to choose a
subset k regressors to be used, and we may express this to be the problem
minimize kAx − bk2 , (1812)

subject to card(x) ≤ k, (1813)
where card(u) is the cardinality function that counts the number of nonzero entries in u. The global
Pn
search problem would require us to consider k nk different regressions, and is a difficult combinatorial
problem as n, k increases in size. A good heuristic would be to first to solve for Equation 1811 for different
γ > 0. For instance, we can find the smallest value of γ s.t. card(x) = k, and then using this sparsity
pattern to solve for kAx − bk2 .
Exercise 507. There is a signal given by x, where the entries are xi are sampled at regularly spaced
intervals. We assume the signals do not vary widely from one step to another, so xi ≈ xi+1 . The
measurements are noisy, and there is an additive noise v s.t. xc = x + v is the corrupted signal. We
assume noise is random, small and rapidly decaying. Given xc , we want to estimate x with estimator x̂.
We can write the problem
minimize (kx̂ − xc k2 , φ(x̂)), (1814)
305
where φ is a convex smoothing objective increasing in roughness in x̂. By scalarization (see Section 4.6.7)
we arrive at convex optimization problem. Consider the quadratic smoothing function
 
−1 1 0 ··· 0 0 0
0 −1 1 · · · 0 0 0
 
n−1
X 
(xi+1 − xi )2 = kDxk22 ,
 
φ(x) = φq (x) = D= · · · · · · · · · · · · · · · · · · · · ·
. (1815)
i=1
 0 0 0 · · · −1 1 0
 
0 0 0 ··· 0 −1 1
Then we solve for
kx̂ − xc k22 + δkDx̂k22 , δ > 0. (1816)
The solution has analytic form (verify this) x̂ = (1 + δD0 D)−1 xc .

Now assume there are occassionally rapid variations in the signal x. Then can choose another smoothing
function
n−1
X
φt (x̂) = |x̂i+1 − x̂i | = kDx̂k1 , (1817)
i=1
which computes the total variation in x. This assigns relatively less penalty to large values of |xi+1 − xi |
in comparison to the quadratic smoothing function.
4.8.4 Robust Approximation

Exercise 508 (Stochastic Robust Approximation). In the stochastic robust approximation problem, the
objective kAx − bk has stochastic components. Assume that the data matrix A ∈ Rm×n is random with
mean Ā and zero-mean additive error U , s.t. A = Ā + U . Then we want to solve for
minimize EkAx − bk. (1818)
This is clearly convex, but usually is not tractable since it is difficult to evaluate the objectives or its
derivatives. However, in some cases we may solve the problem. Assume A takes discrete values given by
P(A = Ai ) = pi for i ∈ [1, k]. Here p satisfies 10 p = 1, p < 0. Then Equation 1818 can be written
Pk
minimize i pi kAi x − bk. (1819)
For obvious reasons this is said to be sum of norms problem, which is equivalent to
minimize p0 t, (1820)
subject to kAi − bk ≤ ti , i ∈ [1, k]. (1821)
If the associated norm is `2 then we have SOCP (Equation 1358), if the associated norm is `1 , `∞ , then
we have LP (Equation 1294). Suppose we have `2 norm, then the problem objective takes on
EkAx − bk22 = E(Āx − b + U x)0 (Āx − b + U x) (1822)

= (Āx − b)0 (Āx − b) + Ex0 U 0 U x (1823)
0 0
= kĀx − bk22 + x P x, P = EU U. (1824)
= x0 Ā0 Āx − 2b0 Āx + b0 b + x0 P x := ψ(x). (1825)
306
So we solve for
1
minimize kĀx − bk22 + kP 2 xk22 . (1826)
δψ(x)
It is easy to see δx = 2xĀ0 Ā − 2Ā0 b + 2xP = 0, so our solution takes analytical form x = (Ā0 Ā +
−1 0
P) Ā b. When the matrix A is subject to variation, the vector Ax has larger variation the larger x is.
Jensen’s inequality asserts that the variation in Ax increases the mean value of kAx − bk2 . Our approach
then balances making Ax − b small with making x small. It is illuminating to note that the solution of
the Tikhonov regularized least squares problem (Exercise 504) minimizes Ek(A + U )x − bk2 , where Uij is
δ
zero mean with variance m, where m is the number of columns in A.
Exercise 509 (Worse Case Robust Approximation). Consider again the uncertainty in matrix A. In
particular, A can take on values in A ⊆ Rm×n . The worst-case error of a candidate approximate solution
x ∈ Rn is expressed ew (x) = sup{kAx − bk : A ∈ A}. Clearly this is convex in x, then the worst-case
robust approximation problem is to solve for
minimize ew (x). (1827)
If A is singleton, then clearly the problem reduces to norm approximation problem in Equation 1778.
If A is given to be finite set {A1 , · · · , Ak }, then we have to solve for
minimize max kAi x − bk, i ∈ [1, k]. (1828)

i
We can write this in epigraph form as
minimize t, (1829)
subject to kAi x − bk ≤ t, i ∈ [1, k]. (1830)
If the associated norm is `2 then we have SOCP (Equation 1358), if the associated norm is `1 , `∞ , then
we have LP (Equation 1294).
If A is represented by norm ball, say A = {Ā + U : kU k ≤ a}, then we have
ew (x) = sup{kĀx − b + U xk : kU k ≤ a}. (1831)
Let the norm on vector be Euclidean norm on Rn and the matrix norm be spectral norm (Definition 180).
Then the supremum is obtained for (verify this)
Āx − b x
U = auv 0 , u= , v= . (1832)
kĀx − bk2 kxk2
Then ew (x) = kĀx − bk2 + akxk2 . We solve for
minimize kĀx − bk2 + akxk2 . (1833)
See this takes the form of a regularized norm problem (Exercise 1804) and is solvable as SOCP
minimize t1 + at2 , (1834)

subject to kĀx − bk2 ≤ t1 , kxk2 ≤ t2 . (1835)
307
If A describes an ellipsoid of possible values in each row, that is A = {[a1 , · · · , am ]0 : ai ∈ ξi , i ∈ [1, m]},
where ξi = {āi + Pi u : kuk2 ≤ 1}, the worst case magnitude of each residual is written
sup |a0i x − bi | = sup{|ā0i x − bi + (Pi u)0 x : kuk2 ≤ 1} (1836)

ai ∈ξi
= |ā0i x − bi | + kPi0 xk2 . (1837)
Refer to Theorem 248 to see the equalities hold. Then the robust `2 norm approximation problem given
minimize ew (x) = sup{kAx − bk2 : ai ∈ ξi , i ∈ [1, m]} (1838)
has explicit expression for the objective function in


m
!2  21 m
! 21
X X
ew (x) =  sup |a0i x − bi |  = (|ā0i x − bi | + kPi0 xk2 )2 . (1839)
i ai ∈ξi i
Then our problem is equivalent to SOCP
minimize ktk2 , (1840)

subject to ā0i x − bi + kPi0 xk2 ≤ ti , i ∈ [m], (1841)
−ā0i x + bi + kPi0 xk2 ≤ ti , i ∈ [m]. (1842)
Now suppose A is described as the image of norm ball under an affine transformation given by
p
X
A = {Ā + ui Ai : kuk ≤ 1}. (1843)
i=1
h i
Define P (x) = A1 x A2 x ··· Ap x , q(x) = Āx − b, then we can write
p
X
ew (x) = sup k(Ā + ui Ai )x − bk = supkuk≤1 kP (x)u + q(x)k. (1844)
kuk≤1 i
The robust Chebyshev approximation problem for this is

p
X
minimize ew (x) = sup k(Ā + ui Ai )x − bk∞ . (1845)
kuk∞ ≤1 i
Since we can write
ew (x) = sup kP (x)u + q(x)k∞ (1846)

kuk∞ ≤1
= max sup |pi (x)0 u + qi (x)| (1847)

i∈[1,m] kuk∞ ≤1
= max (kpi (x)k1 + |qi (x)|), Theorem 248. (1848)

i∈[1,m]
We thus solve for the LP
minimize t, (1849)
subject to −y0 4 Āx − b 4 y0 , (1850)
−yk 4 Ak x 4 yk , k = 1, · · · p, (1851)
p
X
y0 + yk 4 t1. (1852)
k=1
308
4.8.5 Function Fitting
The function fitting problem is the problem of finding a best fit function on given data from a finite
dimensional subspace of functions. Let f1 , · · · , fn : Rk → R be a family of functions with common
Pn
domain D. For each x ∈ Rn , define f : Rk → R s.t. f (u) = i xi fi (u). The family {f1 , · · · fn } is said
to be basis functions (not necessarily in the same notion as Definition 54, they do not even need to be
independent). Here x is optimization variable and is said to be coefficient vector. The basis functions
span subspace F of functions on D. An example would be Pn−1 (R) (Definition 104). The standard basis
for this polynomial space is given by fi (t) = ti−1 , i ∈ [1, n]. Let f1 , · · · fn be an orthonormal basis for
Pn−1 (R) w.r.t to measure φ : Rn → R+ given
(
1 i = j,
Z
fi (t)fj (t)φ(t)dt = (1853)
0 i 6= j.
Another common basis for this polynomial space is the Lagrange basis f1 , · · · fn associated with distinct
points t1 , · · · tn satisfying
(
1 i = j,
fi (tj ) = (1854)
0 i 6= j.
Definition 254. Consider a domain given by D ⊆ Rk . The triangularization of D is defined to be a set

of mesh or grid points given by g1 · · · gn ∈ Rk and partition of D into set of simplexes (Definition 216)
s.t.
D = ∪m
i=1 Si , int(Si ∩ Sj ) = ∅ for i 6= j. (1855)
Each simplex is the convex hull of k + 1 grid points, and each grid point is a vertex of any simplex it lies
in.
Given triangularization (Definition 254), a piecewise-affine function f can be constructed by letting

f (gi ) = xi to the grid points, and then extending the function affinely on each simplex. Then f is written
Pn
f (u) = i xi fi (u) with
(
1 i = j,
fi (gj ) = (1856)
0 i 6= j.
We can define constraints that f should satisfy.
Exercise 510 (Function value interpolation and inequality constraints). Let v ∈ D for which f (v) =
Pn
i xi fi (v). We say that the conditions given by
f (vj ) = zj , j ∈ [1, m] (1857)
are interpolation conditions on f . The condition l ≤ f (v) ≤ u are box constraints on the output of f at
points v and are linear inequalities in x. For a set of finite points v1 , · · · vN , the condition
|f (vj ) − f (vk )| ≤ Lkvj − vk k, j, k ∈ [1, N ] (1858)
are said to be Lipschitz constraints. All these constraints are convex in x. We can also impose inequalities
on f at a infinite number of points, such as the simple nonnegative constraint f (u) ≥ 0 for all u ∈ D.
309
Exercise 511 (Derivative constraints). Let the basis functions fi , i ∈ [n] be differentiable at point v ∈ D.
Pn
The gradient ∇f (v) = i xi ∇fi (v) is clearly linear in x. We can define constraints on the norm of the
orders of the derivatives, such as the convex constraint in x given by
n
X
k∇f (v)k = k xi ∇fi (v)k ≤ M (1859)
i=1
and
l1 4 ∇2 f (v) 4 u1. (1860)
Exercise 512 (Integral Constraints). Let l be a linear functional (see Definition 126) on the finite
dimensional subspace of functions linear in x, An example is given by l(f ) = D φ(u)f (u)du = c0 x, where
R
φ : Rk → R and ci = D φ(u)fi (u)du. Then any constraints of the form l(f ) = a is linear constraint in
R
x. We can impose moment constraints such as D tm f (t)dt = a.

R
We worked with polynomial approximation problems in Exercise 265. We may add different con-
straints such as linear inequalities, constraints on derivatives, monotonicity constraints or moment con-
straints on polynomial approximation problems under the convex optimization framework.
Exercise 513. We consider the problem of fitting a polynomial function p of maximum degree n − 1 to
observed data. Suppose we are given training data of m samples given by the pairs (ui , vi ), i ∈ [1, m],
where ui ∈ D and vi ∈ R are the outputs. We want to find the best fitting p(u) = x1 + x2 u + · · · + xn un−1 .
Then the errors are given by
h i
e = p(u1 ) − v1 ··· p(um ) − vm . (1861)
The minimum norm polynomial function fitting problem solves for
minimize kek = kAx − vk, Aij = uij−1 , i ∈ [1, m], j ∈ [1, n]. (1862)
Consider the function fitting problem to data (ui , yi ), i ∈ [m], where ui ∈ D and yi ∈ R. When
m >> n, then the number of data points is larger than the dimension of the function subspace. There
is no requirement to do explicit smoothing. When we have fewer data points that the dimension of
the subspace of functions, we may require that the function satisfies interpolation condition f (ui ) = yi ,
i ∈ [m]. The problem that seeks the smoothest f subject to these interpolation conditions lead to least-
norm problems. When we used the fitted fˆ to evaluate the point that belongs in the convex hull of the
fitted points (i.e. v ∈ conv{u1 , · · · , un }), we say that we are interpolating. Otherwise, our evaluation is
an extrapolation.
Recall the regressor selection problem (Exercise 506). In basis pursuit, there is a large number of
basis functions, and our objective is to find a good fit for given data as linear combination of a small
subset of the given basis functions. Here, the function family is not linearly independent, and we say it
Pn P
is over-complete basis or dictionary. We seek a function f = i xi fi = i∈B xi fi where f (ui ) ≈ yi and
card(x) << n (is sparse), where B = {i : xi 6= 0}. An application of basis pursuit is data compression.
Suppose the sender and receiver of a data package both have knowledge of dictionary. The sender
begins by finding sparse representation of the signal, and then sending only the nonzero coefficients. The
receiver may look up the dictionary to reconstruct the appproximation of the original signal. As before
in Exercise 506, we may first use `1 norm regularization and solve for convex problem
m
X
minimize (f (ui ) − yi )2 + γkxk1 , γ > 0. (1863)
i=1
310
We can use the solution either directly or refine by solving the least-squares problem
m
X
minimize (f (ui ) − yi )2 (1864)
i=1
with the sparsity pattern found in the `1 approximation step.

We consider the special case of interpolating with f restricted to convex functions. In particular, we
solve for interpolation problems involving infinite dimensional set of functions, using finite dimensional
convex optimization.
Theorem 317. Prove that there exists convex function f : Rk → R satisfying interpolation conditions
f (ui ) = yi at points ui for i ∈ [1, m] iff ∃gi , i ∈ [m] s.t.
yj ≥ yi + gi0 (uj − ui ), i, j ∈ [1, m]. (1865)
Proof. Suppose that f convex, dom f = Rk . Further suppose f (ui ) = yi . If f is differentiable, then there
exists vector gi = ∇f (ui ) s.t. (see Theorem 274)
∀z, f (z) ≥ f (ui ) + gi0 (z − ui ). (1866)
In the general case, gi may be constructed by finding a supporting hyperplane to epi f at (ui , yi ). Then
uj → z and the forward proof is done. On the other hand, suppose ∃gi , i ∈ [m] s.t.
yj ≥ yi + gi0 (uj − ui ), i, j ∈ [1, m]. (1867)
Define f (z) = maxi∈[1,m] (yi + gi0 (z − ui )) for all z ∈ Rk . This is piecewise convex and clearly f (ui ) = yi
for i ∈ [1, m] and we are done.
Suppose we wish to find the least-squares fit of a convex fuction to given training data (ui , yi ), i ∈ [m],
then we solve for
m
X
minimize (yi − f (ui ))2 , (1868)
i=1
subject to f : Rk → R convex and dom f = Rk . (1869)
As per above discussion in Theorem 317 we may write

m
X
minimize (yi − ŷi )2 , (1870)
i=1
subject to ŷj ≥ ŷi + gi0 (uj − ui ), i, j ∈ [1, m]. (1871)
This is QP with variables in ŷ and gi , i ∈ [m]. The optimal value is zero iff the given data may be
interpolated by some convex function. On the same data, suppose it is already known that there is an
interpolating convex function on it. We want to find the range of possible values of a new data point u0
s.t. f remains interpolating convex. The smallest possible value is given by the LP
minimize y0 , (1872)
subject to yj ≥ yi + gi0 (uj − ui ), i, j ∈ [1, m]. (1873)
The maximization problem gives the range. We may impose further constraints on the convex interpo-
lating function. Suppose we are now further interested in a convex, monotone nondecreasing function.
311
Result 24. There exists a convex function f : Rk → R with dom f = Rk satisfying interpolating
conditions
f (ui ) = yi , i ∈ [1, m] (1874)
and is monotone nondecreasing (u < v =⇒ f (u) ≥ f (v)) iff there exists
gi < 0, i ∈ [1, m], yj ≥ yi + gi0 (uj − ui ), i, j ∈ [1, m]. (1875)
Compare this to Theorem 317.
Exercise 514. Suppose a basket of goods are described by an allocation vector 0 4 x 4 1. Given basket
x, the utlity derived by some consumer for x is determined by the utility function u : Rn → R with utility
u(x). u is monotone nondecreasing and concave, reflecting the law of diminishing marginal utilities. We
are given consumer preference data, but not the utility function u. The priors are given by a set of strong
and weak preference relations:
P = {(i, j) : u(ai ) > u(aj ) for (i, j) ⊆ [1, m] × [1, m]} (1876)
Pw = {(i, j) : u(ai ) ≥ u(aj ) for (i, j) ⊆ [1, m] × [1, m]}. (1877)
Our goal is to determine if the data is consistent. We want to solve for the feasibility problem of a
concave, nondecreasing utility function that does not violate the known consumer preferences. We solve
the problem
find u (1878)
n
subject to u : R → R, nondecreasing , (1879)
u(ai ) > u(aj ), (i, j) ∈ P, (1880)
u(ai ) ≥ u(aj ), (i, j) ∈ Pw . (1881)
The constraints are homogenous, so we can write in equivalent form
find u (1882)
subject to u : Rn → R, nondecreasing , (1883)
u(ai ) ≥ u(aj ) + 1, (i, j) ∈ P, (1884)
u(ai ) ≥ u(aj ), (i, j) ∈ Pw . (1885)
Then Theorem 317 asserts we can solve for the optimization problem
find u1 , · · · um , g1 , · · · gm , (1886)
subject to gi < 0, i ∈ [1, m], (1887)
uj ≤ ui + gi0 (aj − ai ), i, j ∈ [1, m], (1888)
ui ≥ uj + 1, (i, j) ∈ P, (1889)
ui ≥ uj , (i, j) ∈ Pw . (1890)
4.8.6 Parametric and Non Parametric Estimations

Recall the maximum likelihood estimation problem (Definition 334). In particular, we have a family of
probability distributions indexed by θ ∈ Θ. The likelihood function is given by pθ (y), and log-likelihood
312
is given by l(θ) = log pθ (y). We want to estimate θ based on some sample y from the (unknown)
distribution. The MLE problem is to estimate θ by maximizing the likelihood function for the observed
y, so we search for θ̂ = arg maxθ pθ (y) = arg maxθ l(θ). We can also add constraints, such as x ∈ C for
some constraint set C. We solve for
maximize l(θ) = log pθ (y), (1891)

subject to x ∈ C. (1892)
If the log-likelihood function, l, is concave on y, and set C can be described by affine constraints and
convex inequality constraints, then MLE formulated is convex optimization standard form (Definition
247). For instance, suppose we assume data generated by a linear measurement model
yi = a0i x + vi , i ∈ [1, m]. (1893)
Here x are the optimization variables to find, yi are measurements and vi are noise inputs. Assume that
IID
vi ∼ p, then the likelihood function, log-likelihood function is given by
m
X
0
px (y) = Πm
i=1 p(yi − ai x), l(x) = log px (y) = log p(yi − a0i x). (1894)
i=1
The MLE is the optimal solution to the optimization problem that maximizes l(x). If p is log-concave
(see Definition 241) then we have convex optimization problem.
Exercise 515 (MLE for common parametric families). -
1. When vi are Gaussian given vi ∼ Φ(0, σ 2 ), then the log-likelihood function is given by (see Exercise
603)
m 1
l(x) = − log(2πσ 2 ) − 2 kAx − yk22 , (1895)
2 2σ
where A is matrix encoding rows a0i . Then the MLE is given by x̂ = arg minx kAx − yk22 .
2. When vi are Laplacian given density p(z) = 1

2a exp(− |z|
a ), a > 0, then the MLE is given by
x̂ = arg minx kAx − yk1 .
1
3. When vi is uniformly distributed with vi ∼ U (−a, a), then p(z) = 2a and the MLE is any x that
satisfies kAx − yk∞ ≤ a.
Other MLE solutions for Poisson distributions, Gamma distributions, mulitnomial distributions are
explored in Section 6.12. Interestingly, we may treat penalty function approximation problems (Equation
1791) as MLE problems with noise density
exp(−φ(z))
p(z) = R (1896)
exp(−φ(u))du
and measurements b. Then we have
m
X m
X Z
l(x) = log px (y) = log p(bi − a0i x) = −φ(bi − a0i x) − log exp(−φ(u))du. (1897)
i=1 i=1
Then the maximiztion of this log-likelihood function is equivalent to the problem of taking
m
X
minimize φ(bi − a0i x). (1898)
i
313
The penalty function approximation problem has statistical interpretations; if φ grows rapidly in residual
r, then then the noise density p has very small tails and the MLE avoids estimates with large residuals.
See that we may interpret the `1 norm approximation problem as MLE with noise density given Laplacian
(Exercise 515.2), `2 norm approximation problem as MLE with noise density given Gaussian (Exercise
515.1). The Laplacian density has larger tails that the Gaussian, and the MLE using Laplacian noise
densities tend to result in larger residual fits.
Exercise 516 (Logistic Regression). Consider a random variable y distributed Bernoulli(p) (Section
6.17.1). Assume p depends on a vector of explanatory varibles u under the logistic model given by
exp(a0 u + b)
p= , (1899)
1 + exp(a0 u + b)
where a, b are the model parameters. We are given training data D = {(ui , yi ) ⊆ Rm ×{0, 1} : i ∈ [1, m]}.
Logistic regression is the problem of finding the MLE for (a, b) ∈ Rn+1 . Let us re-order the data such
that for u1 , · · · uq , y = 1 and is otherwise zero. We have
q
X m
X
l(a, b) = log Πqi=1 pi · Πm
i=q+1 (1 − pi ) = log pi + log(1 − pi ) (1900)
i=1 i=q+1
q m
X exp(a0 ui + b) X 1
= log + log (1901)
i=1
1 + exp(a0 ui + b) i=q+1 1 + exp(a0 ui + b)
Xq m
X
0
= (a ui + b) − log(1 + exp(a0 ui + b)). (1902)
i=1 i=1
l is concave in (a, b) and we have convex optimization problem.
Exercise 517 (Covariance Estimation for Gaussian Variables). Let y be a multivariate Gaussian normal
vector, y ∼ Φ(0, R), s.t. R = Eyy 0 and has density function
n 1 −y 0 R−1 y
pR (y) = (2π)− 2 det(R)− 2 exp( ), Equation 2701. (1903)
2
We are given independently sampled y1 , · · · , yN ∈ Rn , then the log-likelihood function is given by
l(R) = log pR (y1 , · · · yN ) (1904)

N
Nn N 1 X
= − log(2π) − log(det R) − yk0 R−1 yk (1905)
2 2 2
k=1
N
Nn N N 1 X
= − log(2π) − log(det R) − tr(R−1 Y ), Y = yk yk0 . (1906)
2 2 2 N
k
See that Y is the sample covariance matrix for observed yi , i ∈ [N ]. Unfortunately, this is not concave in
R, but we employ change of variables to get concave equivalent problem. Let S = R−1 . Clearly we may
write
Nn N N
l(S) = − log(2π) + log det S − tr(SY ), (1907)
2 2 2
and this is concave in S. We solve for the optimization problem
maximize log det S − tr(SY ) (1908)

subject to S ∈ C, (1909)
314
where C is our constraint set encoding our prior knowledge on S = R−1 . If there are no priors, then we
have the implicit constraint C = {R 0}. In this special case, the derivative of the objective function
is (see Exercise 390) given by S −1 − Y , so the optimal S satisfies S −1 = Y - without prior assumptions
the MLE of the covariance matrix is simply the sample covariance matrix. We can impose a condition
λmax (R)
number constraint on R, say λmin (R) ≤ κmax which may be written as
λmax (S) ≤ κmax λmin (S). (1910)
This is equivalent to specifying the existence of u > 0 s.t. u1 4 S 4 κmax u1. Condition numbers
indicate the sensitivity of perturbations in input to a function output. Recall that they can be used to
study multicollinearity issues (see Section 8.8.3.7). Then we solve the constrained MLE problem
maximize log det S − tr(SY ) (1911)

subject to u1 ≤ S ≤ κmax u1. (1912)
The other school of statistical estimation is said to be the maximum a posteriori (MAP) estimation
problem. This is a Bayesian approach, as opposed to the frequentist approach in the MLE problem.
Here we assume that we have priors on distribution parameters x, and that there is joint density p(x, y),
R R
where y is observed data. The prior density is given by px (x) = p(x, y)dy, py (y) = p(x, y)dx. The
posterior density, after observing y, is given by px|y . The MAP estimation problem solves for
py|x (x, y)px (x)

x̃ = arg max px|y (x, y) = arg max = arg max p(x, y). (1913)
x x py (y) x
Clearly if px is uniform prior, then the MAP and MLE estimation methods give the same estimates for

x. We can also write x̃ = arg maxx log py|x (x, y) + log px (x) . So the MAP estimate solves for the MLE
problem, with the adjustment of log px (x) as reward. To any MLE with concave log-likelihood function,
we can add a prior density for x that is log-concave, and the resulting problem remains convex.
Exercise 518 (MAP on Linear Measurement Model). Suppose that y is vector of measurments given by
yi = a0i x + vi , i ∈ [1, m]. (1914)

IID
We assume that vi ∼ pv and we have prior density px . Then we have
0
p(x, y) = px (x)Πm
i=1 pv (yi − ai x), (1915)
and we solve for the MAP estimation problem

m
X
maximize log px (x) + log pv (yi − a0i x). (1916)
i=1
If px , pv is log-concave, the problem is convex. Assume that vi ∼ U (−a, a), and that x is Gaussian noise
with mean x̄, covariance Σ. Then the MAP problem is QP
minimize (x − x̄)Σ−1 (x − x̄) (1917)

subject to kAx − yk∞ ≤ a. (1918)
Now consider a random variable X taking on values in the finite set {α1 , · · · αn } ⊆ R. The distribution
of X is given by probability vector p ∈ Rn satyisfying p 0, 10 p = 1. Many priors take linear forms.
If f : R → R, then Ef (X) = i pi f (αi ) is linear in p. If C ⊆ R, then P(X ∈ C) = E1C (X) = c0 p
Pn
315
Pn
where ci is one if αi ∈ C and zero otherwise. The constraint EX = α can be expressed i αi pi = α, the
Pn
constraint EX 2 = i αi2 pi = β, the constraint P(X ≥ 0) ≤ δ can be expressed αi ≥0 pi ≤ δ. These are
P
Pn Pn 2
linear in p. The constraint Var(X) = EX 2 − (EX)2 = i αi2 pi − ( i=1 αi pi ) is concave in p. Any lower
bound constraint on the variance of X is convex quadratic inequality in p. The constraint on entropy of
Pn
X can be expressed − i pi log pi is concave in p, and we can impose lower bounds on entropy as convex
inequality. If q is another probability vector q < 0, 10 q = 1, then the Kullback-Leibler divergence (see
Pn
Equation 1103) between the two distributions q, p is given by i pi log pqii is convex in p. Upper bounds
on the Kullback-Leibler divergence between probability distributions is convex inequality,
Of course, these convex constraints can also be the objectives in convex optimization problems. Let
the priors be encoded in a set P, which is a set of linear equalities and convex inequalities on p. Further
suppose that P contains the constraints p 0, 10 p = 1. The lower bound on Ef (X) is convex problem
n
X
minimize f (αi )pi , (1919)
i
subject to p ∈ P. (1920)
The minimum Kullback-Leibler divergence solves for

n
X pi
minimize pi log (1921)
i
qi
subject to p ∈ P. (1922)
1
Note that when the prior is uniform, that is when q = n 1, then the minimum Kullback-Leibler divergence
problem is simply the maximum entropy problem, which solves for the most random distribution that is
consistent with P.
Exercise 519. Let X, Y be two random variables representing asset return taking values in {α1 , · · · αn }
and {β1 , · · · βm } respectively. Their joint distributions pij is not known, but we know that their marginal
distributions are given by
m
X n
X
pij = ri , i ∈ [n], pij = qj , j ∈ [m] (1923)
j i=1
are known. Our objective is to bound the probability of loss on the two asset portfolio, given by P(X +Y <
γ). This can be written as LP
X
maximize {pij : αi + βj < γ, (i, j) ∈ [n] × [m]}, (1924)
subject to pij ≥ 0, i ∈ [n], j ∈ [m] (1925)
Xm n
X
pij = ri , i ∈ [n], pij = qj , j ∈ [m]. (1926)
j i=1
The optimal solution p∗ is the distribution that maximizes the probability of loss, subject to consistency
with the known marginal distributions.
4.9 Algorithms for Unconstrained Minimization

We solve for the convex, unconstrained optimization problem
minimize f (x). (1927)
316
Here f : Rn → R is convex and twice continously differentiable. By definition, dom f is open set and we
assume that the problem may be solved. We let p∗ = f (x∗ ) = inf x f (x). Then x is optimal iff ∇f (x) = 0
(Theorem 301), which is a set of n equations in n variables. In general, there are no analytic solutions,
and may be solved by an iterative algorithm that generates a sequence of points x(0) , x(1) · · · ∈ domf with
f (x(k) ) → p∗ as k → ∞. Then we say that this sequence of points is minimizing sequence for the problem.
We can define a stopping criterion, for instance the -suboptimality criterion, given f (x(k) ) − p∗ ≤ ,
where is error tolerance. Assume that we begin with a starting guess x(0) ∈ dom f . Further assume
the sublevel set
S = {x ∈ dom f : f (x) ≤ f (x(0) )} (1928)
is closed. This condition is satisfied for all x(0) in dom f if f is closed. Some classes of closed functions
are continuous functions with dom f = Rn , as well as continuous functions with open domains, for which
f (x) → ∞ as x → bd dom f .
Some cases of unconstrained convex minimization problems are well known. The convex quadratic
minimization problem has objective 21 x0 P x+q 0 x+r, with P < 0. When P 0 then the solution is unique
at x∗ = −P −1 q. Otherwise any solution satisfying P x∗ = −q is optimal, and if no such solution exists
then the system is unbounded below (Theorem 257). A special and common case is the least-squares
problem, that minimizes kAx − bk2 with optimality conditions satisfied by any x∗ solving A0 Ax = A0 b
(Section 4.3). This system of equations referred to as the normal equations in the least-squares problem.
A class of unconstrained, convex optimization problem we have already encountered so far is the
unconstrained geometric program. Consider the GP in convex form (Theorem 306) given by
m
X
minimize f (x) = log exp(a0i x + bi ). (1929)
i=1
Then the optimality condition is

m
1 X
∇f (x∗ ) = Pm 0 exp(a0i x + bi )ai = 0. (1930)
j=1 exp(aj x + b )
j i=1
In general no analytical solution may be computed, but we may rely on iterative methods.
Exercise 520 (Analytic Center). Consider the optimization problem

m
X
minimize f (x) = − log(bi − a0i x), dom f = {x : a0i x < bi , i ∈ [1, m]}. (1931)
i=1
The function f is said to be logarithmic barrier for the inequalities a0i x ≤ bi . The solution the optimization
problem, if it exists, is said to be the analytic center of the inequalities.
We can also consider the problem
minimize f (x) = log det F (x)−1 , dom f = {x : F (x) 0} (1932)
where F (x) = F0 + x1 F1 + · · · xn Fn , with Fi ∈ S p . Then F : Rn → S p . Here f is said to be logarithmic

barrier for the LMI F (x) < 0, and the solution, if it exists, is said to be analytic center of the LMI.
Definition 255 (Strong Convex). A function f is strongly convex on set S, if there ∃m > 0 s.t.
∀x ∈ S, ∇f 2 (x) < m1. (1933)
317
Note that the notion of strong convexity (Definition 255) is not the same as strict convexity, which
refers to the situation where the inequality in the Jensen’s inequality is a strict one. We assume unless
otherwise specfied in this section that the objective function f in Equation 1927 is strongly convex on
sublevel set in Equation 1928.
Result 25. For x, y ∈ S, ∃z ∈ [x, y] s.t.

1
f (y) = f (x) + ∇f (x)0 (y − x) + (y − x)0 ∇2 f (z)(y − x). (1934)
2
Theorem 318 (Tighter Bounds on Strongly Convex Functions). Assume f is strongly convex (Definition
255). Let S be the sublevel set defined in Equaiton 1928. By strong convex assumptions and applying
Result 25, we have for all x, y ∈ S
1
f (y) = f (x) + ∇f (x)0 (y − x) + (y − x)0 ∇2 f (z)(y − x) (1935)
2
m
≥ f (x) + ∇f (x)0 (y − x) + ky − xk22 . (1936)
2
When m = 0, then we have the first order condition for convexity (Theorem 274). When m > 0, strong
convexity asserts we can get a tigher bound on f (y).
Theorem 319 (Suboptimality Bound for Strongly Convex Function). For strongly convex function
satisfying Definition 255, sublevel set S as in Equation 1928, we have
1
f (x) − p∗ ≤ k∇f (x)k22 . (1937)
2m
1
Note that if k∇f (x)k2 ≤ (2m) 2 , then f (x) − p∗ ≤ . The size of the gradient evaluated at a point
indicates nearness of optimality of that point. Furthermore, the distance between point x and optimal
point x∗ is bounded by
2
kx − x∗ k2 ≤ k∇f (x)k2 . (1938)
m
This also asserts that the optimal point x∗ must be unique.
Proof. Taking the derivative of the RHS in Equation 1936 w.r.t y, we get
m !
∇f (x) + (2y − 2x) = 0, (1939)
2
1
so ỹ = x − m ∇f (x) is minimizer of the RHS. Then for all y ∈ S,
m
f (y) ≥ f (x) + ∇f (x)0 (y − x) + ky − xk22 (1940)
2
m
≥ f (x) + ∇f (x)0 (ỹ − x) + kỹ − xk22 (1941)
2
1 m 1
= f (x) − k∇f (x)k22 + k∇f (x)k22 (1942)
m 2 m2
1
= f (x) − k∇f (x)k22 . (1943)
2m
It follows that p∗ ≥ f (x) − 1 2
2m k∇f (x)k2 . The optimality bounds on function value holds for Equation
∗
1937. Use Equation 1936 with y → x to obtain
m ∗
p∗ = f (x∗ ) ≥ f (x) + ∇f (x)0 (x∗ − x) + kx − xk22 (1944)
2
m
≥ f (x) − |h∇f (x)|(x∗ − x)i| + kx∗ − xk22 (1945)
2
m
≥ f (x) − k∇f (x)k2 kx − xk2 + kx∗ − xk22 ,
∗
Theorem 154. (1946)
2
318
But since p∗ ≤ f (x), then
m
−k∇f (x)k2 kx∗ − xk2 + kx∗ − xk22 (1947)
2
m
= kx − xk2 −k∇f (x)k2 + kx∗ − xk ≤ 0,
∗
(1948)
2
The sublevel set S in Equation 1928 is bounded (verify this), and hence the maximum eigenvalue of
∇2 f (x) is continuous function of x on S and bounded from above on S. That is, ∃M s.t.
∀x ∈ S, ∇2 f (x) 4 M 1. (1949)
Theorem 320 ((Lower) Suboptimality Bound for Strongly Convex Function). Assume f is strongly
convex (Definition 255). Let S be the sublevel set defined in Equation 1928. Using Equation 1949, we
can write
M
∀x, y ∈ S, p∗ ≤ f (y) ≤ f (x) + ∇f (x)0 (y − x) + ky − xk22 . (1950)
2
As before we can take the derivative of the RHS w.r.t to y to get
M !
∇f (x) + (2y − 2x) = 0, (1951)
2
1
ỹ = x − ∇f (x) . (1952)
M
Then ỹ is the minimizer for the RHS and after the substitutions we get
1 M 1 2
f (x) − k∇f (x)k22 + ( ) k∇f (x)k22 . (1953)
M 2 M
Then p∗ ≤ f (x) − 1 2
2M k∇f (x)k2 , and
1
f (x) − p∗ ≥ k∇f (x)k22 . (1954)
2M
By strong convexity assumption (Definition 255) and bounded sublevel set giving Equation 1949,
we have the Hessian inequality m1 4 ∇2 f (x) 4 M 1 for all x ∈ S. Then the condition number of the
Hessian has upper bound
M
cond(∇2 f (x)) ≤ = κ. (1955)
m
Definition 256. For a given convex set C ⊆ Rn , the width of the convex set in direction q satisfying
kqk2 = 1 is given by
W (C, q) = sup q 0 z − inf q 0 z. (1956)

z∈C z∈C
The minimum and maximum width of C is given by
Wmin = inf W (C, q), Wmax = sup W (C, q). (1957)

kqk2 =1 kqk2 =1
2
Wmax
The condition number of C is defined cond(C) = 2
Wmin
. This number is said to give measure of
anisotropy, or eccentricity. If condition number is small, then the set has approximately the same width
in all directions.
319
Exercise 521. Let an ellipsoid be defined ξ = {x : (x − c0 )0 A−1 (x − x0 ) ≤ 1}, A 0. See we may write
1 1
supz∈ q 0 z = sup{q 0 (x0 + A 2 x) : kxk2 ≤ 1} = kA 2 qk2 + q 0 x0 , (see Definition 210 and Theorem 248).
The width of ξ may be written explicitly
1 1 1
sup q 0 z − inf q 0 z = (kA 2 qk2 + q 0 x0 ) − (−kA 2 q k2 + q 0 x0 ) = 2kA 2 qk2 . (1958)
z∈ z∈
It follows that the minimum and maximum width are given by

1 1
Wmin = 2λmin (A) 2 , Wmax = 2λmax (A) 2 . (1959)
The condition number of the ellipsoid is expressed

λmax (A)
cond(ξ) = = cond(A). (1960)
λmin (A)
Suppose f is strongly convex function satisfying the Hessian bounds
m1 4 ∇2 f (x) 4 M 1. (1961)
Consider the α-sublevel set Cα = {x : f (x) ≤ α}, where p∗ < α ≤ f (x(0) ). Recall Equation 1950 and
Equation 1936. Taking x → x∗ we have
m M
p∗ + ky − x∗ k22 ≤ f (y) ≤ p∗ + ky − x∗ k22 . (1962)
2 2
Then (verify this) we have Bi ⊆ Cα ⊆ Bo , where
2(α − p∗ ) 1
Bi = {y : ky − x∗ k2 ≤ ( ) 2 }, (1963)
M
2(α − p∗ ) 1
Bo = {y : ky − x∗ k2 ≤ ( ) 2 }. (1964)
m
∗ 1 ∗ 1
The two balls have radii ( 2(α−p
M
)
) 2 and ( 2(α−p
m
)
) 2 respectively. The condition number on Cα is given
by
M
cond(Cα ) ≤ . (1965)
m
It turns out that the condition number of the sublevel sets of f has strong effect on the efficiency of
the algorithms used in unconstrained minimization. However, note that this is not practically useful for
defining stopping conditions; in most cases, the constants m, M are not known. It should be noted then
when convergence proofs are provided for algorithms that use these constants, we should be aware that
these assertions are conceptual.
4.9.1 Descent Methods

We discuss iterative algorithms that compute minimizing sequences
x(k+1) = x(k) + t(k) ∆x(k) , t(k) > 0. (1966)
∆x(k) is said to be step/search direction and t(k) is said to be step size/length. Often the updating
step is replaced with the equivalent form x+ = x + t∆x or x := x + t∆x. The methods are said to be
descending, or is descent method, because f (x(k+1) ) < f (x(k) ). For all k, we have x(k) ∈ S = {x ∈
dom f : f (x) ≤ f (x(0) )} (Equation 1928). Since Theorem 274 asserts that
f (y) ≥ f (x(k) ) + ∇f (x(k) )0 (y − x(k) ), (1967)
320
Algorithm 1 Outline of a Descent Algorithm
x := x0 ∈ domf
while X does not satisfying stopping criterion do
find ∆x.
choose or find t > 0 (line search).
x := x + t∆x
end while
then ∇f (x(k) )0 (y − x(k) ) ≥ 0 =⇒ f (y) ≥ f (x(k) ). It follows that the search direction should satisfy
∇f (x(k) )0 ∆x(k) < 0, or −∇f (x(k) )0 ∆x(k) > 0. The search direction makes an acute angle with the
negative gradient. This direction is said to be a descent direction.
Choosing t is said to be a line search, since the selection of t determines our next exploration along the
ray {x + t∆x : t ≥ 0} we iterate to. Since near optimality is bounded by factor of k∇f (x)k22 (Equation
1937), a common stopping criterion would take form k∇f (x)k2 ≤ η. From an efficiency standpoint, if
we are able to solve for
t = arg min f (x + s∆x) (1968)

s≥0
much cheaper than the cost of determining ∆x, we set the variable t by solving for the optimal solution
here, and say that the line search is an exact line search.
Definition 257. Most line search problems for the determination of t is non-exact, in that it is not the
optimal solution to Equation 1968. An inexact, yet effective, line search method is called the backtracking
line search. For some choice of constants 0 < α < 0.5, 0 < β < 1, solves for t using the algorithm
Algorithm 2 Backtracking Line Search

Input: 0 < α < 0.5, 0 < β < 1. Suppose we have ∆x, x ∈ dom f .
t := 1
while f (x + t∆x) > f (x) + αt∇f (x)0 ∆x do
t := βt.
end while
Let t0 be the point such that f (x)+αt0 ∇f (x)0 ∆x = f (x+t0 ∆x). The backtracking search starts with
step size of one and reduces it by a factor of β until stopping condition f (x + t∆x) ≤ f (x) + αt∇f (x)0 ∆x
holds. Recall that ∆x is chosen to be descent direction, and ∇f (x)0 ∆x < 0, and since the first-order
approximation
f (x + t∆x) ≈ f (x) + t∇f (x)0 ∆x < f (x) + αt∇f (x)0 ∆x (1969)
is accurate for small t and α > 0, the algorithm is guaranteed to terminate. Let t0 satisfy f (x) +
αt0 ∇f (x)0 ∆x = f (x + t0 ∆x). Then t takes value one (if stopping condition holds at initiation) or
t ∈ (βt0 , t0 ]. To see this, the lowest value of t is when f (x) + αt+ 0 +
0 ∇f (x) ∇x > f (x + t0 ∆x), then we will
have to iterate once more to get βt+
0 . So the variable t satisfies
t ≥ min{1, βt0 }. (1970)
Note that we assume here that x + t∆x ∈ dom f - in implementations, we will need to take t := β k
until x + t∆x ∈ dom f for increasing k before entering the loop. Typical values of α ∈ [0, 01.0.3] and
β ∈ [0.1, 0.8].
321
4.9.2 Gradient Descent
The gradient descent method is a particular instance of the descent algorithm (Algorithm 1) that sets
the search direction ∆x = −∇f (x). The inequality in Equation 1937 asserts that a common stopping
criterion takes form k∇f (x)k2 ≤ η, where η is small positive number.
Theorem 321 (Convergence Properties of Exact Line Search Gradient Descent). Assume that f is
strong convex on S (Definition 255, Definition 1928), then ∃m, M > 0 s.t. m1 ≺ ∇2 f (x) ≺ M 1 for all
x ∈ S. Define f˜ : R → R as f˜(t) = f (x − t∇f (x)). The search direction is ∆x = −∇f (x) by definition.
Assume that x − t∇f (x) ∈ S. Equation 1950 asserts that for y = x − t∇f (x) we have
M t2
f˜(t) ≤ f (x) − tk∇f (x)k22 + k∇f (x)k22 . (1971)
2
Exact line search chooses the minimizing t, so on the LHS we have f˜(te ) and taking the RHS derivative
we have
2M t !
−∇f (x)0 ∇f (x) + ∇f (x)0 ∇f (x) = (−1 + M t)∇f (x)0 ∇f (x) = 0, (1972)
2
1
so t = M and we obtain
1
f (x+ ) = f˜(te ) ≤ f (x) − k∇f (x)k22 . (1973)
2M
Then
1
f (x+ ) − p∗ ≤ f (x) − p∗ − k∇f (x)k22 (1974)
2M
and since k∇f (x)k22 ≥ 2m(f (x) − p∗ ) (Equation 1937) then
m
f (x+ ) − p∗ ≤ (1 − )(f (x) − p∗ ). (1975)
M
Then by iteration k we have
m
f (x(k) ) − p∗ ≤ ck (f (x(0) ) − p∗ ), c=1− < 1, (1976)
M
so f (x(k) ) → p∗ as k → ∞. Furthermore, to guarantee suboptimality we compute
ck (f (x(0) ) − p∗ ) = , (1977)
k (0) ∗
log c + log(f (x )−p ) = log , (1978)
k log c = log − log(f (x(0) ) − p∗ ), (1979)
(0) ∗ (0) ∗
log − log(f (x ) − p ) log(f (x ) − p ) − log
k = = , (1980)
log c − log c
log(f (x(0) ) − p∗ )/
k = (1981)
log(1/c)
The numerator is the logarithmic ratio between initial distance from the optimal value and the error
tolerance. This is multiplied by (log(1/c))−1 = (− log(1− M
m −1 m −1
)) ≈ ( M ) = M
m, where the approximation
M
holds large condition number bound m. Recall that this is the upper bound on the condition number for
2
the Hessian ∇ f (x) on S, or the sublevel-sets Cα (see Equation 1955, Equation 1965). When condition
number bound is large, the bound on the number of iterations required increases approximately linear w.r.t
to the bound. The gradient method does in fact require a large number of iterations when ∇2 f (x) has
large condition number for x near x∗ . Since f (x(k) ) − p∗ ≤ ck (f (x(0) ) − p∗ ), we say that our algorithm
output value converges to zero at least as fast as a geometric series, and say that the numerical method
is linearly convergent.
322
Theorem 322 (Convergence Properties of Backtracking Line Search Gradient Descent). Assume that
f is strong convex on S (Definition 255, Definition 1928), then ∃m, M > 0 s.t. m1 ≺ ∇2 f (x) ≺ M 1
for all x ∈ S. Define f˜ : R → R as f˜(t) = f (x − t∇f (x)). The search direction is ∆x = −∇f (x) by
definition. Assume that x − t∇f (x) ∈ S. We prove that the backtracking exit condition
f˜(t) ≤ f (x) − αtk∇f (x)k22 (1982)

2
1 1
is satisfied whenever 0 ≤ t ≤ M . It is easy to see that for t satisfying 0 ≤ t ≤ M, we have −t+ M2t ≤ − 2t .
Starting with the same equation as we did in Theorem 321, we have
M t2
f˜(t) ≤ f (x) − tk∇f (x)k22 + k∇f (x)k22 (1983)
2
t
≤ f (x) − k∇f (x)k22 (1984)
2
1
≤ f (x) − αtk∇f (x)k22 , α ∈ (0, ) by choice in Algorithm 2. (1985)
2
β
Then t either terminates with t = 1, or t > M , since the minimum value is given by failing the stopping
1 + β +
criterion at f = M and then updating to M . In either case we can write
βα
f (x+ ) ≤ f (x) − min{α, }k∇f (x)k22 . (1986)
M
Then
βα
f (x+ ) − p∗ ≤ f (x) − p∗ − min{α, }k∇f (x)k22 , (1987)
M
and since k∇f (x)k22 ≥ 2m(f (x) − p∗ ) (Equation 1937) then
βαm
f (x+ ) − p∗ ≤ (1 − min{2mα, 2 })(f (x) − p∗ ). (1988)
M
At iteration k we have
βαm
f (x(k) ) − p∗ ≤ ck (f (x(0) ) − p∗ ), c = (1 − min{2mα, 2 }) < 1, (1989)
M
and f (x(k) ) converges to p∗ at least as fast as geometric series with exponent depending on M
m, with at
least linear convergence.
Exercise 522. Consider the quadratic objective function given by f (x) = 21 (x"21 + γx#22 ), γ > 0. It is easy
2 2 2 1 0
to compute the Hessian, since δδxf2 = 1, δδxf2 = γ, δxδ1 δx
f
2
= 0, so ∇2 f (x) = , which is diagonal
1 2 0 γ
matrix with eigenvalues 1, γ. The condition numbers can be computed exactly in this scenario, with
M
M = max{1, γ}, m = min{1, γ} and m = max{γ, γ1 }.
It should be noted that in practical implementations, convergence can be slow, even for problems
that are moderately conditioned, with condition number in the hundreds. When the condition number
exceeds a thousand, the covergence is so slow that the gradient descent becomes useless.
4.9.3 Steepest Descent

Given the first order Taylor approximation of f (x + v) around x
f (x + v) ≈ f (x) + ∇f (x)0 v, (1990)
323
we say that ∇f (x)0 v is directional derivative of f at x towards v, and v is said to be descent direction if
∇f (x)0 v is negative. Let k · k be arbitrary norm on Rn , then the normalized steepest descent direction
is given by
∆xn = arg min{∇f (x)0 v : kvk = 1} = arg min{∇f (x)0 v : kvk ≤ 1}. (1991)
v v
∆xn gives the step of unit norm leading to largest decrease in the linear approximation of f , and is the
direction in the unit norm ball extending farthest in the direction of −∇f (x). We consider a unnormalized
steepest descent step given by ∆xs , defined
∆xs = k∇f (x)k∗ ∆xn . (1992)
Note that
∆f (x)0 ∆xs = k∇f (x)k∗ ∇f (x)0 ∆xn = −k∇f (x)k2∗ . (1993)
To see this we may write (see dual norm, Definition 229)
−k∇f (x)k∗ = − sup{∇f (x)0 v : kvk ≤ 1} = inf{∇f (x)0 v : kvk ≤ 1} = ∇f (x)0 ∆xn . (1994)
The steepest descent method is the instance of the descent algorithm (Algorithm 1) that sets ∆x = ∆xs .
Note that when we use exact line search, the use of ∆n , ∆s is irrelevant since they are scaled by optimal
t anyway.
Exercise 523. When the associated norm is Euclidean (k·k2 ), then ∆xs = −∇f (x) and steepest descent
is simply gradient descent.
Exercise 524. When the associated norm is quadratic, given by

1 1
kzkP = (z 0 P z) 2 = kP 2 zk2 , P 0. (1995)
1 1
Then (verify this) ∆xn = −(∇f (x)0 P −1 ∇f (x))− 2 P −1 ∇f (x), and kzk∗ = kP − 2 zk2 . We have
1
n 1
o
∆xs = k∇f (x)k∗ ∆xn = kP − 2 ∇f (x)k2 −(∇f (x)0 P −1 ∇f (x))− 2 P −1 ∇f (x) (1996)
= −P −1 ∇f (x). (1997)
Exercise 525. When the associated norm is `1 , then ∆xn is easy to formulate. Let i be the index s.t.
k∇f (x)k∞ = |∇f (x)i |. The dual norm of `1 is the `∞ norm (verify this). We have
δf (x)
∆xn = −sign( )ei , (1998)
δxi
where ei is the ith standard basis vector, and
δf (x)
∆xs = ∆xn k∇f (x)k∞ = − ei . (1999)
δxi
At each iteration, the component i of ∇f (x) with largest absolute value is chosen, and xi is increased
or decreased according to the sign of ∇f (x)i . The algorithm is also known to be a coordinate descent
algorithm.
Exercise 526. Consider the Frobenius norm minimization (see Exercise 467) problem given
n
X
2 d2i
minimize Mij , (2000)
i,j=1
d2j
324
and using change of variables xi = 2 log di , we can express this GP in equivalent convex form (see
Theorem 306) as
 
n
X
2
minimize f (x) = log  Mij exp(xi − xj ) (2001)
i,j=1
X X X
2 2 2 2
= log(Mkk + Mij exp(xi − xj ) + exp(−xk ) Mik exp(xi ) + exp(xk ) Mkj exp(−xj )).
i,j6=k i6=k j6=k
Then
f (x) = log(αk + βk exp(−xk ) + γk exp(xk ))
and
X X X
2 2 2 2
αk = Mkk + Mij exp(xi − xj ), βk = Mik exp(xi ), γk = Mkj exp(−xj ). (2002)
i,j6=k i6=k j6=k
See
δf −βk exp(−xk ) + γk exp(xk ) !
= = 0, (2003)
δxk αk + βk exp(−xk ) + γk exp(xk )
so exp(−xk )(−βk + γk exp(2xk )) and xk = 1

2 log( βγkk ). Then the `1 steepest descent algorithm with exact
line search iterates over the steps consisting of finding
−βi exp(−xi ) + γi exp(xi )
max |∇f (x)i | = , i ∈ [1, n] (2004)
αi + βi exp(−xi ) + γi exp(xi )
and minimizing f over xk by setting xk = 1

2 log( βγkk ).
Result 26. For arbitrary norm k · k, dual norm k · k∗ , ∃γ, γ̄ ∈ (0, 1] s.t.
kxk ≥ γkxk2 , kxk∗ ≥ γ̄kxk2 . (2005)
Theorem 323 (Convergence Properties of Backtracking Line Search Steepest Descent). Assume that f
is strong convex (Definition 255) so ∃m, M s.t. m1 4 ∇2 f (x) 4 M 1. Recall Equation 1950
M
∀x, y ∈ S, p∗ ≤ f (y) ≤ f (x) + ∇f (x)0 (y − x) + ky − xk22 . (2006)
2
For y = x + t∆xs we have
M k∆xs k22 2
f (x + t∆xs ) ≤ f (x) + t∇f (x)0 ∆xs + t (2007)
2
M k∆xs k2 2
≤ f (x) + t∇f (x)0 ∆xs + t , Result 26 (2008)
2γ 2
M
= f (x) − tk∇f (x)k2∗ + 2 t2 k∇f (x)k2∗ . (2009)
2γ
The equality follow from Equation 1993, and k∆xs k2 = ∆x0n ∆xn k∇f (x)k2∗ = k∇f (x)k2∗ . Minimizing the
RHS, obtain
2M t
−k∇f (x)k2∗ + k∇f (x)k2∗ , (2010)
2γ 2
γ2
so setting the step size t̂ = M, using α < 1
2 and ∇f (x)0 ∆xs = −k∇f (x)k2∗ , see that the exit condition is
satisfied
γ2 αγ 2
f (x + t̂∆xs ) ≤ f (x) − k∇f (x)k2∗ ≤ f (x) + ∇f (x)0 ∆xs . (2011)
2M M
325
2
The line search returns t ≥ min{1, βγ
M } and so
βγ 2
f (x+ ) = f (x + t∆xs ) ≤ f (x) − α min{1, }k∇f (x)k2∗ (2012)
M
βγ 2
≤ f (x) − αγ̄ 2 min{1, }k∇f (x)k22 , Result 26. (2013)
M
It follows that (using Equation 1937)
βγ 2
f (x+ ) − p∗ ≤ c(f (x) − p∗ ), c = 1 − 2mαγ̄ 2 min{1, } < 1, (2014)
M
so
f (x(k) ) − p∗ ≤ ck (f (x(0) ) − p∗ ), (2015)
and the algorithm is proved linearly convergent.
4.9.4 Newton’s Method

Definition 258 (Newton step). For x ∈ dom f , the vector
∆xw = −∇2 f (x)−1 ∇f (x) (2016)
is said to be Newton step for f at x.
We know that the Hessian matrix ∇2 f (x) < 0, so
∇f (x)0 ∆xw = −∇f (x)0 ∇2 f (x)−1 ∇f (x) ≤ 0. (2017)
The Newton step ∆xw is a descent direction. Consider the second order Taylor approximation of f at x
given by
1
f (x + v) ≈ f (x) + ∇f (x)0 v + v 0 ∇2 f (x)v, (2018)
2
and the minimizer of the RHS w.r.t v satisfies ∇f (x) + ∇2 f (x)v = 0, so v = ∆xw = −∇2 f (x)−1 ∇f (x).
∆xw minimizes the second order approximation of f at x. If f was quadratic, then x + ∆xw is the exact
minimizer of f , and if f is near quadratic, then x + ∆xw is good estimate of minimizer of f . This is very
accurate when x is close to x∗ .
The Newton step can be treated as the steepest descent direction at x w.r.t to the quadratic norm
(Equation 1995) induced by the Hessian ∇2 f (x) (see Equation 1997). Consider the first order Taylor
approximation of ∇f at x given by
∇f (x + v) ≈ ∇f (x) + ∇2 f (x)v. (2019)
The optimality condition is for this to be zero, and the solution to this is v = ∆xw . The Newton step
is what must be added to x so that the linearized, approximate optimality conditions hold. When x is
near x∗ , the update x + ∆xw is very good approximation of x∗ .
The Newton step ∆xw is said to be affine invariant; it is independent of linear changes in coordinates.
Let T ∈ Rn×n be invertible, and define f¯(y) = f (T y). Then
∇f¯(y) = T 0 ∇f (T y) = T 0 ∇f (x), ∇2 f¯(y) = T 0 ∇2 f (T y)T = T 0 ∇2 f (x)T, (2020)
326
under change of coordinates x = T y. The Newton step for f¯ at y is given by
∆yw = −(T 0 ∇2 f (x)T )−1 (T 0 ∇f (x)) (2021)

= −T −1 ∇2 f (x)−1 ∇f (x) (2022)
−1
= T ∆xw . (2023)
See that
x + ∆xw = T (y + ∆yw ). (2024)
Definition 259 (Newton decrement). For x ∈ dom f , the scalar

21
λ(x) = ∇f (x)0 ∇2 f (x)−1 ∇f (x) (2025)
is said to be Newton decrement at x.
Let fˆ(y) be the second-order approximation of f at x, then

1
fˆ(y) = f (x) + ∇f (x)0 (y − x) + (y − x)0 ∇2 f (x)(y − x) (2026)
2
and we have
f (x) − inf fˆ(y) = f (x) − fˆ(x + ∆xw ) (2027)

y
1
= −∇f (x)0 ∆xw − ∆x0w ∇2 f (x)∆xw (2028)
2
1
= −∇f (x)0 (−∇2 f (x)−1 ∇f (x)) − (−∇2 f (x)−1 ∇f (x))0 ∇2 f (x)(−∇2 f (x)−1 ∇f (x))
2
1 0 2 −1
= ∇f (x) ∇ f (x) ∇f (x) (2029)
2
1
= λ(x)2 . (2030)
2
λ2
Then 2 is an estimate of f (x) − p∗ based on the quadratic approximation. Also, the Newton decrement
12
λ(x) = ∇f (x)0 ∇2 f (x)−1 ∇f (x) (2031)
1
= ((−∇2 f (x)−1 ∇f (x))0 ∇2 f (x)(−∇2 f (x)−1 ∇f (x))) 2 (2032)
= k∆xw k∇2 f (x) . (2033)
This also comes up in backtracking line search (see Equation 1993), and may be interpreted as the
directional derivative of f at x in the direction of the Newton step

0 2 δ
∇f (x) ∆xw = −λ(x) = f (x + ∆xw t) . (2034)
δt t=0
Like the Newton step, the Newton decrement is affine invaraint. The Newton decrement of f¯(y) = f (T y)
at y for invertible T is the same as the Newton decrement of f at x = T y.
Theorem 324 (Convergence Properties of Backtracking Newton’s Method). Assume that f is twice
continously differentiable and is strong convex on S (Definition 255, 1928), so ∃m, M > 0 s.t.
∀x ∈ S, m1 4 ∆2 f (x) 4 M 1. (2035)
327
Algorithm 3 Newton’s method
x := x0 ∈ domf, > 0.
while TRUE do
compute ∆xw := −∇2 f (x)−1 ∇f (x)
compute λ2 := ∇f (x)0 ∇2 f (x)−1 ∇f (x).
λ2
if 2 ≤ , break.
t = BACKT RACK() (Algorithm 2)
x := x + t∆xw .
end while
Further assume that ∇2 f satisfies
∀x, y ∈ S, k∇2 f (x) − ∇2 f (y)k2 ≤ Lkx − yk2 . (2036)
We say tht f is Lipschitz continuous on S with constant L. L can be interpreted as the bound on the
third derivative of f , and is a measure of how well the quadratic model approximates f . We can then
expect the smaller the Lipschitz constant L, we may provide better guarantees on the performance of the
m2
Newton’s method. We show that ∃0 < η ≤ L ,γ > 0 satisfying
k∇f (x(k) )k2 ≥ η =⇒ f (x(k+1) ) − f (x(k) ) ≤ −γ, (2037)

2
L L
k∇f (x(k) )k2 < η =⇒ t(k) = 1, k∇f (x(k+1) )k2 ≤ k∇f (x(k) )k2 . (2038)
2m2 2m2
Assume inductively that the k∇f (k) k2 < η holds at iteration k, then (verify this) k∇f (x(k+1) )k2 < η and
all iterations k + i, i = 0, 1, · · · satisfy the second condition. For all iteration number l ≥ k, the algorithm
is said to take a full Newton step, t = 1, and this portion of the algorithm is said to pure Newton method.
Furthermore, see that at iteration l ≥ k, we have
2l−k
L L 1 l−k
k∇f (x(l+1) )k2 ≤ k∇f (x(k) )k2 ≤ ( )2 . (2039)
2m2 2m2 2
Then letting x → x(l) in Equation 1937 we get

2l−k+1
(l) 1 ∗ 2m3 1
f (x ) − p ≤ k∇f (x(l) )k22 ≤ 2 . (2040)
2m L 2
The convergence is extremely fast once the second condition is satisfied, and it is said to be quadratic
convergent. The second condition, as we called the pure Newton phase, is also said to be the quadraticlly
convergent stage. The first stage is referred to as the damped Newton phase. The entire algorithm is
often referred to as the damped/guarded Newton’s method. In the damped Newton phase, f decreases by
at least γ in each iteration, so the number of iterations inside the damped Newton phase is at most
f (x(0) ) − p∗
. (2041)
γ
Since Equation 2040 asserts
2l−k+1
2m3 1
f (x(l) ) − p∗ ≤ , (2042)
L2 2
328
then for some defined additive error tolerance ,
2l−k+1
L2 1
3 = , (2043)
2m 2
L2 l−k+1
3 = 2−2 , (2044)
2m
L2
− log2 { 3 } = 2l−k+1 , (2045)
2m
2m3
log2 { 2 } = 2l−k+1 , (2046)
L
0 2m3
log2 log2 { } = l − k + 1, 0 = . (2047)
L2
So the total number of iterations up to additive error tolerance is given by
f (x(0) ) − p∗ 0
+ log2 log2 ( ). (2048)
γ
In the quadratically convergent phase, the number of iterations grow extremely slowly in the desired
accuracy . Now we show the inequality in the damped Newton phase
k∇f (x(k) )k2 ≥ η =⇒ f (x(k+1) ) − f (x(k) ) ≤ −γ. (2049)
Suppose k∇f (x)k2 ≥ η, and using Equation 1950 with y → x + t∆xw , we have
M
f (x + t∆xw ) ≤ f (x) + ∇f (x)0 (t∆xw ) + kt∆xw k22 (2050)
2
M k∆xw k2 2
= f (x) + t∇f (x)0 ∆xw + t (2051)
2
M 2
≤ f (x) − tλ(x)2 + t λ(x)2 . (2052)
2m
The last inequality follows from Equation 2034 for the second term and
λ(x)2 = ∆x0w ∇2 f (x)∆xw ≥ mk∆xw k22 , (2053)
for the third term, with equality relation coming from Equation 2033 and inequality from the strong
m
convex assumptions (Definition 255). Let the step size t̂ = M, and the exit condition of the line search
(Algorithm 2) is satisfied since
m 1
f (x + t̂∆xw ) ≤ f (x) − λ(x)2 ≤ f (x) − αt̂λ(x)2 , α ∈ (0, ). (2054)
2M 2
βm
So the line search returns step size t ≥ M , and so
f (x+ ) − f (x) ≤ −αtλ(x)2 (2055)

m
≤ −αβ λ(x)2 (2056)
M
m
≤ −αβ 2 k∇f (x)k22 (2057)
M
m
≤ −αβη 2 2 . (2058)
M
We used the fact that
1
λ(x)2 = ∇f (x)0 ∇2 f (x)−1 ∇f (x) ≥ k∇f (x)k22 . (2059)
M
329
m
So γ = αβη 2 M 2 satisfies Equation 2037.
Now the inequality correspoding to the condition for the pure Newton phase, Equation 2038 is discussed.
Assume k∇f (x)k2 < η. By Lipschitz continuity assumptions (Equation 2036), for t ≥ 0, we have
k∇2 f (x + t∆xw ) − ∇2 f (x)k2 ≤ tLk∆xw k2 . (2060)
Multiply both sides by ∆x0w ∆xw , and we get
|∆x0w (∇2 f (x + t∆xw ) − ∇2 f (x))∆xw | ≤ tLk∆xw k32 . (2061)
Now define f˜(t) = f (x + t∆xw ), and so
f˜(2) (t) = ∆x0w ∇2 f (x + t∆xw )∆xw . (2062)
So the we can rewrite the LHS of the last inequality as |f˜(2) (t) − f˜(2) (0)|, so
L
f (2) (t) ≤ f (2) (0) + tLk∆xw k32 ≤ λ(x)2 + t 3 λ(x)3 , (2063)
m2
where f˜(2) (0) = λ(x)2 by Equation 2062 and λ(x)2 ≥ mk∆xw k22 by Equation 2053. Integrate this
inequality to obtain
L
f˜(1) (t) ≤ f˜(1) (0) + tλ(x)2 + t2 3 λ(x)
3
(2064)
2m 2
L
= −λ(x)2 + tλ(x)2 + t2 3
3 λ(x) , (2065)
2m 2
where f˜(1) (t) = ∆x0w ∇f (x + t∆xw ), so f˜(1) (0) = ∆x0w ∇f (x) = −λ(x)2 (Equation 2034). Integrate again
to get
1 L
f˜(t) ≤ f˜(0) − tλ(x)2 + t2 λ(x)2 + t3 3
3 λ(x) , (2066)
2 6m 2
and set t → 1 to get
1 L
f (x + ∆xw ) ≤ f (x) − λ(x)2 + 3
3 λ(x) . (2067)
2 6m 2
2
Provided η ≤ 3(1 − 2α) mL , we show that the backtracking line search selects unit steps. Suppose
2
k∇f (x)k2 ≤ η ≤ 3(1 − 2α) mL , using the strong convexity to see (like in Equation 2059)
1
λ(x)2 ≤ k∇f (x)k22 , (2068)
m
we have
m0.5 λ(x) ≤ k∇f (x)k2 , (2069)
so
3
λ(x) ≤ 3(1 − 2α)m 2 /L. (2070)
Using Equation 2067 we now have

2 1 Lλ(x)
f (x + ∆xw ) ≤ f (x) − λ(x) − 3 (2071)
2 6m 2
≤ f (x) − αλ(x)2 (2072)
0
= f (x) + α∇f (x) ∆xw , (2073)
330
so the exit condition is satisfied and unit step t = 1 is taken. By Lipschitz condition, we have (verify
this)
k∇f (x+ )k2 = k∇f (x + ∆xw ) − ∇f (x) − ∇2 f (x)∆xw k2 (2074)

Z 1
= k (∇2 f (x + t∆xw ) − ∇2 f (x))∆xw dtk2 (2075)
0
L
≤ k∆xw k22 (2076)
2
L 2
= k∇ f (x)−1 ∇f (x)k22 (2077)
2
L
≤ k∇f (x)k22 , (2078)
2m2
2
so inequality in Equation 2038 holds for η = min{1, 3(1 − 2α)} mL .
4.9.5 Self-Concordance
Despite the conceptually useful analysis of the Newton’s method in Theorem 324, in practice, the con-
stants m, M, L are in most cases not known. Secondly, unlike the affine invariance of the Newton’s
method, the analysis of the Newton’s method is not; the constants m, M, L are affected by change in
coordinates. We want to come up with an analysis that does not depend on these coordinate systems.
We then want a different set of assumptions from strong convexity (m1 4 ∇2 f (x) 4 M 1) and Lipschitz
conditions (k∇2 f (x) − ∇2 f (y)k2 ≤ Lkx − yk2 ).
This new condition is self concordance.
Definition 260 (Self-concordance). A convex function f : R → R is said to be self-concordant if

3
∀x ∈ dom f, |f (3) (x)| ≤ 2f (2) (x) 2 . (2079)
A function f : Rn → R is said to be self-concordant if it is self-concordant along every line in its domain,

that is f˜(t) = f (x + tv) is self-concordant function of t for all x ∈ dom f , for all v.
Theorem 325 (Affine-invariance of self-concordant functions). Self-concordance (Definition 260) is

affine-invariant - if we apply linear transformation to any of the input variables of self-concordant func-
tions, the result is self-concordant.
Proof. Define f˜(y) = f (ay + b), a 6= 0 and we show f˜ self-concordant iff f self-concordant. It should be
easy to see
f˜(2) (y) = a2 f (2) (x), f˜(3) (y) = a3 f (3) (x), x = ay + b. (2080)
Then
3 3
|a3 f (3) (x)| ≤ 2(a2 f (2) (x)) 2 = 2a3 f (2) (x) 2 (2081)
It is easy to verify using the Definition 260 that the function f (x) = − log x, f (x) = x log x − log x is
self-concordant.
We study the preservation of self-concordant functions. Self concordance is preserved under scaling
by factor a ≥ 1. Self concordance is preserved by pointwise sum f1 + f2 , if f1 , f2 is self-concordant. To
see this, using
3 3 2
(u 2 + v 2 ) 3 ≤ u + v (2082)
331
for u, v ≥ 0 and triangle inequality (Theorem 155) we can write
(3) (3) (3) (3)
|f1 (x) + f2 (x)| ≤ |f1 (x)| + |f2 (x)| (2083)
(2) 3 (2) 3
≤ 2(f1 (x) 2 + f2 (x) 2 ) (2084)
(2) (2) 3
≤ 2(f1 (x) + f2 (x)) 2 . (2085)
If f is self-concordant, then f (Ax + b) is self-concordant.
Exercise 527. Consider the log barrier for linear inequalities (Exercise 520)
m
X
f (x) = − log(bi − a0i x), dom f {x : a0i x < bi , i ∈ [1, m]}. (2086)
i
This is sum of terms, where each term is composition of negative log and affine transformation y =
bi − a0i xi and is therefore self-concordant.
Exercise 528. Consider the negative log-determinant function, recall (Exercise 416) we may write
n
X
f˜(t) = f (X + tV ) = − log det X − log(1 + tλi ), X 0, V ∈ S n (2087)
i
1 1
where λi are the eigenvalues of X − 2 V X − 2 . Each term − log(1 + tλi ) is self-concordant, and hence f˜, f
are self-concordant.
Exercise 529. Consider the negative log concave quadratic function given by
f (x) = − log(x0 P x + q 0 x + r), n

P ∈ −S+ . (2088)
f is self-concordant on dom f = {x : x0 P x + q 0 x + r > 0}. Without loss of generality, we show the case
when n = 1, then
f (x) = − log(px2 + qx + r) = − log(−p(x − a)(b − x)), (2089)
where a, b are the roots of the quadratic px2 + qx + r. Then
f (x) = − log(−p) − log(x − a) − log(b − x) (2090)
and therefore f is self-concordant.
Result 27. Let g : R → R be convex function, dom g = R++ , and
g (2) (x)
∀x, |g (3) (x)| ≤ 3 . (2091)
x
Then
f (x) = − log(−g(x)) − log x (2092)
is self-concordant on {x : x > 0, g(x) < 0}. Note that this condition on g holds on any quadratic functions
of the form ax2 + bx + c, where a ≥ 0. Then the function g(x) + ax2 + bx + c also satisfies the same
condition on g for a ≥ 0. Some examples of g that satisfies Equation 2091 are enumerated
1. g(x) = −xp for 0 < p ≤ 1,
2. g(x) = − log x,
332
3. g(x) = x log x,
4. g(x) = xp for −1 ≤ p ≤ 0,
(ax+b)2
5. g(x) = x .
Result 28. Some examples of self-concordant functions are listed
1. f (x, y) = − log(y 2 − x0 x) on {(x, y) : kxk2 < y},

2
2. f (x, y) = −2 log y − log(y p − x2 ) with p ≥ 1 on {(x, y) ∈ R2 : |x|p < y},
3. f (x, y) = − log y − log(log y − x) on {(x, y) : exp(x) < y}.
Recall the Newton decerement. Some of the important and equivalent expressions are repeated here
1
λ(x) = ∇f (x)0 ∇2 f (x)−1 ∇f (x) 2 Definition 259, (2093)
1
λ(x) = k∆xw k∇2 f (x) = (∆x0w ∇2 f (x)∆w ) 2 Equation 2033, (2094)

δ
∇f (x)0 ∆xw = −λ(x)2 = f (x + ∆xw t) Equation 2034. (2095)
δt t=0
We also note that this may be expressed (verify this)
−v 0 ∇f (x) −v 0 ∇f (x)
λ(x) = sup 1 ≥ 1 for any v 6= 0. (2096)
v6=0 (v 0 ∇2 f (x)v) 2 (v 0 ∇2 f (x)v) 2
See that equality holds for v = ∆xw , since
−∆x0w ∇f (x) λ(x)2

1 = = λ(x). (2097)
(∆x0w ∇2 f (x)∆xw ) 2 λ(x)
For f : R → R defined to be strictly convex self-concordant (Definition 260), we are guaranteed that
δ (2) − 1 1 3
∀t ∈ dom f, f (t) 2 = − f (2) (t)− 2 f (3) (t) ≤ 1. (2098)
δt 2
Assuming t ≥ 0, and that [0, t] ⊆ dom f , we may integrate to get

Z t
δ (2) − 1
−t ≤ f (τ ) 2 dτ ≤ t, (2099)
0 δτ
so
1 1
−t ≤ f (2) (t)− 2 − f (2) (0)− 2 ≤ t. (2100)
Then
1 1
f (2) (0)− 2 − t ≤ f (2) (t)− 2 , (2101)
h i
− 12 1 1
f (2)
(0) 1 − tf (2) (0) 2 ≤ f (2) (t)− 2 , (2102)
f (2) (0)
f (2) (t) ≤ 1 . (2103)
(1 − tf (2) (0) 2 )2
We can repeat the argument on the other side of the inequality to obtain bounds
f (2) (0) f (2) (0)

1 ≤ f (2) (t) ≤ 1 . (2104)
(1 + tf (2) (0) 2 )2 (1 − tf (2) (0) 2 )2
333
1
The lower bound is valid for all t ≥ 0 in dom f , while the upper bound is valid for all 0 ≤ t < f (2) (0)− 2
in dom f .
As before, assume that f : Rn → R is strictly convex self-concordant, and v is a descent direction s.t.
v ∇f (x) < 0. Define f˜ : R → R as f˜(t) = f (x + tv), which is self-concordant by definition (see Definition
0
260). We may integrate the lower bound in Equation 2104 to obtain

1
1 f˜(2) (0) 2
f˜(1) (t) ≥ f˜(1) (0) + f˜(2) (0) 2 − 1 . (2105)
1 + tf˜(2) (0) 2
Integrate once more to get
1 1
f˜(t) ≥ f˜(0) + tf˜(1) (0) + tf˜(2) (0) 2 − log(1 + tf˜(2) (0) 2 ). (2106)
The minimizer in t for the RHS of the above inequality is computed by taking the derivative w.r.t t:
1
1 f˜(2) (0) 2 !
f˜(1) (0) + f˜(2) (0) 2 − 1 = 0, (2107)
1 + tf˜(2) (0) 2
1 1 1
(f˜(1) (0) + f˜(2) (0) 2 )(1 + tf˜(2) (0) 2 ) = f˜(2) (0) 2 , (2108)
1 1 1
f ˜(1) ˜(1) ˜(2)
(0) + tf (0)f (0) + f (0) + tf (0)2 ˜(2) 2 ˜(2) = f ˜(2) (0) ,
2 (2109)
1
1 1
f˜(1) (0) + t f˜(1) (0)f˜(2) (0) 2 + f˜(1) (0) + f˜(2) (0) 2 = f˜(2) (0) . 2 (2110)
−f˜(1) (0)
So the minimizer is t̄ = 1 , and so we may write
f (0)+f˜(2) (0) 2 +f˜(1) (0)
˜(2)
1 1
inf f˜(t) ≥ f˜(0) + t̄f˜(1) (0) + t̄f˜(2) (0) 2 − log(1 + t̄f˜(2) (0) 2 ) (2111)
t≥0
1 1
= f˜(0) − f˜(1) (0)f˜(2) (0)− 2 + log(1 + f˜(1) (0)f˜(2) (0)− 2 ). (2112)
Since f˜(t) = f (x + tv), see that f˜(1) (0) = v 0 ∇f (x) and that f˜(2) (0) = v 0 ∇f (x)v. Then we may write
Equation 2096 as
1
λ(x) ≥ −f˜(1) (0)f˜(2) (0)− 2 . (2113)
Furthermore, u + log(1 − u) is monotonically decreasing in u, so we have
inf f˜(t) ≥ f˜(0) + λ(x) + log(1 − λ(x)). (2114)

t≥0
Since this holds for any descent direction v, in particular we have
p∗ ≥ f (x) + λ(x) + log(1 − λ(x)), λ(x) < 1. (2115)

λ2
For small values of λ, −(λ + log(1 − λ)) ≈ 2 and for λ ≤ 0.68 we have −(λ + log(1 − λ)) ≤ λ2 . Then
p∗ (x) ≥ f (x) − λ(x)2 , λ ≤ 0.68. (2116)

λ(x)2
Recall that 2 is an estimate for f (x)−p∗ . In the case of convex self-concordant functions, the stopping
2
criterion λ(x) ≤ guarantees the output is additive -suboptimal.
Theorem 326 (Convergence Properties of Backtracking Newton’s Method for Strictly Convex Self–
Concordant Function). Assume f is strictly convex self-concordant. Assume that x(0) is known and that
S = {x : f (x) ≤ f (x(0) )} is closed. Further assume f is bounded below, so that ∃x∗ minimizer of f .
1
We show there ∃0 < η ≤ 4 and γ > 0 depending only on α, β parameters of the backtracking line search
algorithm (Algorithm 2) s.t. the following hold:
334
1. if λ(x(k) ) > η, then
f (x(k+1) ) − f (x(k) ) ≤ −γ, (2117)
2. and if λ(x(k) ) ≤ η, then backtracking line search outputs t = 1 with

2
2λ(x(k+1) ) ≤ 2λ(x(k) ) . (2118)
Once the second condition holds, then it holds for all future iterations and so ∀l ≥ k, λ(x(l) ) ≤ η and
2l−k 2l−k
(l)

(k) 2l−k 1
2λ(x ) ≤ 2λ(x ) ≤ (2η) ≤ . (2119)
2
By Equation 2116 we have

2l−k+1 2l−k+1
1 1 1
f (x(l) ) − p∗ ≤ λ(x(l) )2 ≤ ≤ . (2120)
4 2 2
It follows that f (x(l) ) − p∗ ≤ if l − k ≥ log2 log2 ( 1 ). The first condition asserts that the damped Newton
f (x(0) −p∗ )
phase has at most γ steps, so the total number of iterations required to obtain additive error is
upper bounded by
f (x(0) ) − p∗ 1
+ log2 log2 ( ). (2121)
γ
Note that for f˜(t) = f (x + t∆xw ), f˜(1) (0) = ∆xw ∇f (x) = −λ(x)2 and f˜(2) (0) = ∆xw ∇2 f (x)∆xw =
λ(x)2 (see Equation 2034 and Equation 2033). Integrating the upper bound in Equation 2104 twice, we
obtain
1
1

f˜(t) ≤ f˜(0) + tf˜(1) (0) = tf˜(2) (0) 2 − log 1 − tf˜(2) (0) 2 (2122)
= f˜(0) + tλ(x) − tλ(x) − log(1 − tλ(x)).
2
(2123)
1
We first consider the damped Newton phase. First see that the point t̂ = 1+λ(x) always satisfies the exit
condition in the backtracking line search (Algorithm 2), since
f˜(t̂) ≤ f˜(0) − t̂λ(x)2 − t̂λ(x) − log(1 − t̂λ(x)) (2124)

= f˜(0) − λ(x) + log(1 + λ(x)) (2125)
λ(x)2
≤ f˜(0) − α (2126)
1 + λ(x)
˜
= f (0) − αλ(x)2 t̂, (2127)
where we use the result

x2
−x + log(1 + x) + ≤0 (2128)
2(1 + x)
β
and α ∈ (0, 21 ). It follows that the backtracking line search always results in step size t ≥ 1+λ(x) . Then
λ(x)2
f˜(t) − f˜(0) ≤ −αβ (2129)
1 + λ(x)
335
2
η
and Equation 2117 holds with γ = αβ 1+η . In the pure/quadratically convergent Newton phase, we show
1−2α 1−2α
that η = 4 satisfies. Then λ(x(k) ) ≤ η ≤ 4 and Equation 2118 holds. Using Equation 2123 and
1−2α
λ(x) ≤ 2 , we have
f˜(1) ≤ f˜(0) + λ(x)2 − λ(x) − log(1 − λ(x)) (2130)

1
≤ f˜(0) − λ(x)2 + λ(x)3 , (2131)
2
≤ f˜(0) − αλ(x)2 , (2132)
where we use the result

1 2
−x − log(1 − x) ≤ x + x3 , 0 ≤ x ≤ 0.81. (2133)
2
Furthermore, using the result (verify this)
λ(x)2
λ(x) < 1, x+ = x − ∇2 f (x)−1 ∇f (x) =⇒ λ(x+ ) ≤ , (2134)
(1 − λ(x))2
1
it follows that λ(x+ ) ≤ 2λ(x)2 for λ(x) ≤ 4 and Equation 2118 holds when λ(x(k) ) ≤ η.
4.10 Equality Constrained Minimization

A convex optimization problem subject to equality constraints are solved. The problem takes form
minimize f (x), (2135)

We assume that f is convex and twice continuously differentiable. Furthermore, A ∈ Rp×n is assumed
to be rank A = p < n, such that A is full row rank (constraints are independent) and that there are
fewer equality constraints then there are variables. We assume the optimal solution x∗ exists, and denote
p∗ = inf{Ax = b} = f (x∗ ). Recall (see Theorem 302) that x∗ ∈ dom f is optimal iff ∃ν ∗ ∈ Rp s.t.
Ax∗ = b, ∇f (x∗ ) + A0 ν ∗ = 0. (2137)
The first equation is said to be the primal feasibility equations, and the second is said to be the dual
feasibility equations. We say the equality constrained optimization problem is equivalent to solving the
KKT conditions (Equation 1604), a set of n + p equations in n + p variables. Recall also that equality
constrained minimization problems may be reduced to solving an equivalent, unconstrained minimization
problem (see Exercise 451), although important properties of the problem structure such as sparisity may
be destroyed. Another approach is to solve the dual optimal problem (see Equation 1530) for which the
Slater’s condition holds (Definition 309).
4.10.1 Equality Constrained Convex Quadratic Problems

Suppose we have the problem
1 0
minimize f (x) = x P x + q 0 x + r, n
P ∈ S+ (2138)
2
subject to Ax = b, A ∈ Rp×n . (2139)
The optimality conditions are given by the KKT equations (Equation 1604)
Ax∗ = b, P x∗ + q + A0 v ∗ = 0. (2140)
336
We may write this as
" #" # " #
P A0 x∗ −q
= . (2141)
A 0 ν∗ b
This system of equations is said to be the KKT system, and the associated coefficient matrix is said to
be KKT matrix. When the KKT matrix is invertible, then there is a unique, optimal primal-dual pair
(x∗ , ν ∗ ). If the KKT matrix is singular and solvable, then any solution yields the optimal primal-dual
pair. If the system is not solvable, then the quadratic optimization problem is either not feasible or
unbounded below.
Result 29. For KKT matrix given

" #
P A0
K= (2142)
A 0
with P < 0, rank A = p < n, the following statements are equivalent:
1. K is invertible,
2. N (P ) ∩ N (A) = {0},
3. Ax = 0, x 6= 0 =⇒ x0 P x > 0, P is positive semidefinite on the nullspace of A,
4. F 0 P F 0, where F ∈ Rn×(n−p) is matrix for which R(F ) = N (A).
4.10.2 Elimination of Equality Constraints

Eliminating the equality constraints (see Exercise 451) gives us equivalent problems. In particular, we
may specify the matrix F ∈ Rn×(n−p) s.t. R(F ) = N (A), and the vector x̂ ∈ Rn satisfying Ax̂ = b
parameterizes the affine feasible set given by
{x : Ax = b} = {F z + x̂ : z ∈ Rn−p }. (2143)
Then we can solve the unconstrained minimization problem
minimize fˆ(z) = f (F z + x̂). (2144)
The optimality conditions satisfy Equation 2137, and we can see that the optimal dual variable ν ∗ defined
ν ∗ = −(AA0 )−1 A∇f (x∗ ) (2145)
satisfies. The dual feasibility equation condition is given by
∇f (x∗ ) + A0 (−(AA0 )−1 A∇f (x∗ )) = 0. (2146)
For f˜(z) = f (F z + x̂), we have ∇f˜(z ∗ ) = F 0 ∇f (x∗ ) = 0, and AF = 0 since R(F ) = N (A). Then
" #
F0
∇f (x∗ ) − A0 (AA0 )−1 A∇f (x∗ ) = 0.

(2147)
A
" #
F0
The matrix is invertible and hence injective (Definition 109), so Equation 2146 holds.
A
337
Exercise 530. Consider the problem given by
n
X
minimize fi (xi ), (2148)
i
n
X
subject to xi = b. (2149)
i
Here fi : R → R are assumed to be convex and twice differentiable. The degrees of freedom for x is n − 1,
so we can eliminate xn (or any other xi ) by setting
n−1
X
xn = b − xi , (2150)
i=1
1
" #
and this corresponds to the choice of x̂ = ben , F = ∈ Rn×(n−1) , since
−10
 
    x1
x1 0  
    x2 
F  · · ·  + · · · =   = x. (2151)

 ··· 
xn−1 b
 
−x1 − x2 − · · · + b
Then the equivalent problem is

n−1
X n−1
X
minimize fn (b − xi ) + fi (xi ). (2152)
i=1 i=1
4.10.3 Solving Equality Constrained Problem via the Dual Problem

See Equation 1515. The dual function to the problem in Equation 2136 is given by
g(ν) = −b0 ν − f ∗ (−A0 ν). (2153)
We solve for the dual problem
maximize − b0 ν − f ∗ (−A0 ν). (2154)
Slater conditions hold (Theorem 309), the optimal point is assumed to exist, so the problem is strictly
feasible and strong duality holds. ∃ν ∗ s.t. g(ν ∗ ) = p∗ . If the dual function is twice differentiable, then
clearly we may used the unconstrained minimization techniques.
Exercise 531. Consider the problem in Exercise 520 given by

n
X
minimize f (x) = − log xi , (2155)
i=1
Then (see Exercise 433.2) we have

n
X n
X
f ∗ (y) = (−1 − log(−yi )) = −n − log(−yi ), dom f ∗ = −Rn++ . (2157)
i=1 i=1
We may solve for the dual problem

n
X
0
maximize g(ν) = −b ν + n + log(A0 ν)i , (2158)
i=1
338
with optimality condition written
h i
∇f (x) + A0 ν = − 1
x1 ··· 1
xn
+ A0 ν = 0. (2159)
1
Then xi (ν) = A0 ν i .
4.10.4 Feasible point Newton’s method with equality constraints

The Newton’s method with equality constraints adopts the general form as in the unconstrained Newton’s
method (Algorithm 3), with the obvious difference now that we constrain the Newton step ∆xw to be in
a feasible direction. We begin with the assumption that the initial point x(0) is feasible, so x(0) ∈ dom f
and Ax(0) = b. To say a direction is feasible, we meant that A∆xw = 0. Then given Ax = b, we have
A(x + t∆xw ) = Ax + tA∆xw = b.
We want to replace the problem in Equation 2136 with the second order Taylor approximation near
x. We formulate the problem
1
minimize fˆ(x + v) = f (x) + ∇f (x)0 v + v 0 ∇2 f (x)v (2160)
2
subject to A(x + v) = b. (2161)
This is convex quadratic minimization problem of the type in Section 4.10.1. The Newton step ∆xw at
x is defined the be the solution of this convex quadratic problem, where we assume the associated KKT
matrix is invertible (see Equation 2141). The associated KKT system is then
" #" # " #
∇2 f (x) A0 ∆xw −∇f (x)
= , (2162)
A 0 w 0
where w is the associated dual optimal dual variable. When f is quadratic, then the Newton update
x + ∆xw gives the exact solution, and in other cases where f is near quadratic, the Newton update is
good approximate of x∗ and w is good approximate of ν ∗ .
We may interpret ∆xw , w as the solution of linearized approximations to the linearity conditions
given by
Ax∗ = b, ∇f (x∗ ) + A0 ν ∗ = 0. (2163)
Using x∗ → x + ∆xw , ν ∗ → w we obtain
A(x + ∆xw ) = b, ∇f (x + ∆xw ) + A0 w ≈ ∇f (x) + ∇2 f (x)∆xw + A0 w = 0. (2164)
Since Ax = b, the conditions are
A∆xw = 0, ∇2 f (x)∆xw + A0 w = −∇f (x), (2165)
which is essentially the aforementioned KKT system. For equality constrained Newton problem, we
define the Newton decrement in the same way (Definition 259):
21
λ(x) = ∆x0w ∇2 f (x)∆xw . (2166)
The related forms in Equation 2034, 2033, 2096 hold as before. Let the second order approximation of
f at x be given by fˆ, so
1
fˆ(x + v) = f (x) + ∇f (x)0 v + v 0 ∇2 f (x)v. (2167)
2
339
Then (verify this)
λ(x)2
f (x) − inf{fˆ(x + v) : A(x + v) = b} = , (2168)
2
which again gives estimate of f (x) − p∗ . As before, we say that v is feasible direction if Av = 0, and
furthermore a descent direction if f (x + tv) < f (x) for small positive t. The Newton step in the feasible
Newton problem is always descent direction by Equation 2034. The Newton step, Newton decrement
here are also affine invariant. For T invertible, f¯(y) = f (T y), it is easy to see
∇f¯(y) = T 0 ∇f (T y), ∇2 f¯(y) = T 0 ∇2 f (T y)T. (2169)
The constraint Ax = b is replaced with AT y = b. The KKT system in Equation 2141 becomes
" #" # " #
T 0 ∇2 f (T y)T T 0 A0 ∆yw −T 0 ∇f (T y)
= (2170)
AT 0 w̄ 0
and T ∆yw = ∆xw .

The Newton’s algorithm (Algorithm 3) exactly applies, except we require that x := x0 ∈ dom f to
further satisfy Ax0 = b, and that the computation of ∆xw is further solution to the KKT system (Equa-
tion 2141). The Newton decrement is unchanged. These modifications lead to what we call a feasible
descent method, since all iterations involve feasible x(k) , and f (x(k+1) ) < f (x(k) ). The assumption is
that the KKT matrix is always invertible at x.
Exercise 532. We show that the feasible descent Newton’s method is the same as the Newton’s method
applied to the unconstrained optimization problem with equality constraints eliminated (Equation 2144).
As before, assume that R(F ) = N (A) and that rank F = n − p (F is full rank). Suppose x̂ satisfies
Ax̂ = b, and define fˆ(z) = f (F z + x̂). We have
∇f˜(z) = F 0 ∇f (F z + x̂), ∇2 f˜(z) = F 0 ∇2 f (F z + x̂)F. (2171)
Then the Newton step for the equality constrained problem is defined iff the KKT matrix
" #
∇2 f (x) A0
(2172)
A 0
is invertible, which is iff ∇2 f˜(z) is invertible (see Result 29). The Newton step in the reduced (without
linear constraints) problem is given by
−1
∆zw = −∇2 f˜(z)−1 ∇f˜(z) = −F 0 ∇2 f (x)F F 0 ∇f (x), x := F z + x̂. (2173)
The search direction is given by

−1
F ∆zw = −F F 0 ∇2 f (x)F F 0 ∇f (x) (2174)
for the equality constrained problem. To show this is the same as ∆xw , take ∆xw = F ∆zw , and define
w := −(AA0 )−1 A(∇f (x) + ∇2 f (x)∆xw ). Then the KKT system
∇2 f (x)∆xw + A0 w + ∇f (x) = 0, A∆xw = 0 (2175)
should hold. The second equation holds since A∆xw = AF ∆zw = 0zw = 0, and for the first equation,
we have
" # " #
F0 h 2 i F 0 2
∇ f (x)∆xw + F 0 0
A w + F 0
∇f (x)
∇ f (x)∆xw + A0 w + ∇f (x) = = 0. (2176)
A A∇2 f (x)∆xw + AA0 w + A∇f (x)
340
" #
F0
Again the matrix is invertible and hence injective, and we are done. The Newton decrement λ̃(z)
A
˜
of f at z can be written
λ̃(z)2 = 0
∆zw ∇2 f˜(z)∆zw (2177)
0
= ∆zw F 0 ∇2 f (x)F ∆zw (2178)
= ∆x0w ∇2 f (x)∆xw (2179)
2
= λ(x) Equation 2033. (2180)
Theorem 327 (Convergence Properties of Backtracking Feasible Descent Newton’s Method). Assume
the following hold:
1. The sublevel set S = {x : x ∈ dom f, f (x) ≤ f (x(0) ), Ax = b} is closed set, and that x(0) ∈
dom f, Ax(0) = b.
2. On the set S, assume that ∇2 f (x) 4 M 1, and that

" #
∇2 f (x) A0
≤ K. (2181)
A 0
2
That is, the inverse of the KKT matrix (Equation 2141) is bounded on S. This is to say that the
eigenvalues, n of which are positive, and p of which are negative (since invertibility is assumed, 0
is not an eigenvalue (Theorem 72)) , are bounded away from zero.
3. For x, x̃ ∈ S, the Lipschitz condition
∇2 f (x) − ∇2 f (x̃) 2
≤ L kx − x̃k2 (2182)
holds. Exercise 532 asserts that the feasible descent Newton’s method and unconstrained Newton’s
method with linear inequalities eliminated are identical. The assumptions made make the function
objective f˜, initial feasible point z (0) , with x(0) = x̂ + F z (0) satisfy the assumptions in Theorem
324 (verify this). One of the implications, that bounded inverse KKT, ∇2 f (x) 4 M 1 implies
∇2 f˜(z) < m1 for some m > 0 is shown. In particular, we show it holds for m = σmin2(F ) > 0 since
2
K M
F is full rank (eigenvalues are non-zero). Suppose not. Then F 0 HF 6< m1, where H = ∇2 f (x)
and ∃u, kuk2 = 1 s.t.
u0 F 0 HF u < m. (2183)
1 1
Then kH 2 F uk2 < m 2 . We can write
" #" # " #
H A0 F u HF u
= , (2184)
A 0 0 0
| {z } | {z } | {z }
α β γ
since AF u = 0 for any u. Then
β = α−1 γ (2185)
−1 −1
kβk = |hα |γi| ≤ kα kkγk (2186)
kβk
kα−1 k ≥ , (2187)
kγk
341
where the inequality comes from Cauchy-Schwarz (Theorem 154). So
" #
Fu
" #−1
0 0
H A kF uk2
≥ " #2 = . (2188)
A 0 HF u kHF uk2
2
0
2
Using kF uk2 ≥ σmin (F ) and

1 1 1 1
kHF uk2 ≤ kH 2 k2 kH 2 F uk2 < M M 2 m 2 . (2189)
It follows that
" #−1
H A0 kF uk2 σmin (F )
≥ > 1 1 = K. (2190)
A 0 kHF uk2 M 2 m2
2
The strict inequality is contradiction of our assumption, and the result follows. Note that if f is
self-concordant, then we can provide performance guarantees in terms of known α, β, the parameters
of the backtracking line search algorithm (Algorithm 2).
4.10.5 Infeasible point Newton’s method with equality constraints

Here we do not assume that we are working with a feasible x(0) . We consider the optimality conditions
for the equality constrained minimzation problem as before
Ax∗ = b, ∇f (x∗ ) + A0 ν ∗ = 0. (2191)
However, it is assumed that x(0) ∈ dom f . We want to find a step ∆x s.t. x + ∆x approximates the
optimality conditions, and is ‘headed for’ feasbility. Again, we use x∗ → x + ∆xw and ν ∗ → w in the
first order approximation
∇f (x + ∆x) ≈ ∇f (x) + ∇2 f (x)∆x (2192)
to get
A(x + ∆xw ) = b, ∇f (x) + ∇2 f (x)∆xw + A0 w = 0. (2193)
We want to solve for

" #" # " #
∇2 f (x) A0 ∆xw ∇f (x)
=− . (2194)
A 0 w Ax − b
Note that this is simply the convex quadratic KKT system (Equation 2162) when x is feasible. We can
interpret Equation 2194 in terms of a primal-dual method for the equality constrained problem, which
updates both the primal and dual variables x, v to approximate optimality conditions. Let the optimality
conditions be denoted by r(x∗ , ν ∗ ) = 0, where r : Rn → Rp given by
r(x, ν) = (rdual (x, ν), rpri (x, ν)), (2195)
where
rdual (x, ν) = ∇f (x) + A0 ν, rpri (x, ν) = Ax − b (2196)
342
are the dual residuals and primal residuals w.r.t to the dual feasibility and primal feasibility equations
respectively (see Equation 2137). The first-order Taylor approximation of r near estimate y is given by
r(y + z) ≈ r(y) + Dr(y)z, (2197)
where Dr(y) ∈ Rn+p×n+p is the derivative matrix of r at y, given by

" # " #
δrd δrp
δx δx ∇2 f (x) A0
D = δrd δrp = . (2198)
δν δν A 0
The primal-dual Newton step is denoted ∆ypd and is defined to be the step z for which the Taylor
approximation vanishes, which is when −r(y) = Dr(y)∆ypd . With ∆ypd = (∆xpd , ∆νpd ), we can write
" #" # " # " #
∇2 f (x) A0 ∆xpd rdual ∇f (x) + A0 ν
=− =− . (2199)
A 0 ∆νpd rpri Ax − b
If we let ν + ∆νpd = ν + , then

" #" # " #
∇2 f (x) A0 ∆xpd ∇f (x)
=− . (2200)
A 0 ν+ Ax − b
This is just the system of equations in Equation 2194, and we have
∆xw = ∆xpd , w = ν + = ν + ∆νpd . (2201)
The infeasible Newton step is the same as the primal part of the primal-dual step. The dual variable w
is the updated primal-dual variable.
Note that the infeasible Newton step is not guranteed to be a descent direction. From Equation 2194,
we have
∇2 f (x)∆xw + A0 w = −∇f (x), A∆xw = −(Ax − b) (2202)
so

δ
f (x + t∆xw ) = ∇f (x)0 ∆x (2203)
δt t=0
= −∆x0 (∇2 f (x)∆x + A0 w) (2204)
= −∆x0 ∇2 f (x)∆x + (Ax − b)0 w. (2205)
This is not necessarily negative. It is the norm of the primal-dual residuals that decrease in the Newton
direction, where

δ 2
kr(y + t∆ypd )k2 = 2r(y)0 Dr(y)∆ypd = −2r(y)0 r(y) (2206)
δt
t=0
δ
kr(y + t∆ypd )k2 = −kr(y)k2 . (2207)
δt t=0
Note that when we take a full Newton step (t = 1), then we are guaranteed that A(x + t∆xw ) = b. All
future iterations can be analyzed under the feasible descent framework. In general, we may take damped
steps with t ∈ [0, 1], so that we have
+
rpri = A(x + ∆xw t) − b = (1 − t)(Ax − b) = (1 − t)rpri . (2208)
343
Algorithm 4 Infeasible start Newton method
x := x0 ∈ domf, > 0, α ∈ (0, 12 ), β ∈ (0, 1).
while Ax 6= b OR kr(x, ν)k2 > do
compute ∆xw , ∆νw , the primal and dual Newton steps.
perform backtracking line search on krk2 . set t := 1.
while kr(x + t∆xw , ν + t∆νw )k2 > (1 − αt)kr(x, ν)k2 do
t := βt
end while
x := x + t∆xw
ν := ν + t∆νw .
end while
(i)
Given iterations x(i+1) = x(i) + t(i) ∆xw for i = 0, · · · k − 1. Then

r(k) = Πk−1
i=0 (1 − t (i)
) r(0) , (2209)
where r(i) = Ax(i) − b. Using Equation 2194 and Equation 2201 to denote ∆νw = w − ν, we outline the
infeasible start Newton method.
The line search exit condition in Algorithm 4 is guaranteed to terminate in a finite number of steps,
since it is satisfied for small enough t (see Equation 2207). Note that once feasibility is attained, one
may simply revert back to the feasibility Newton method, which by then only differs in the computation
of t and exit condition. An issue with the infeasibility start Newton method is that there is no clear way
to detect if there exists no strictly feasible point. These issues are explored again later.
Theorem 328 (Convergence Properties of Backtracking Infeasible Descent Newton’sMethod). Assume

the following hold:
1. The sublevel set S = {(x, ν) : x ∈ dom f, kr(x, ν)k2 ≤ kr(x(0) , ν (0) )k2 } is closed set.
2. On the set S, assume that

" #−1
∇2 f (x) A0
kDr(x, ν)−1 k2 = ≤K (2210)
A 0
2
for some K.
3. For (x, ν), (x̃, ν̃) ∈ S, Dr satisfies Lipschitz condition
kDr(x, ν) − Dr(x̃, ν̃)k2 ≤ Lk(x, ν) − (x̃, ν̃)k2 . (2211)
6 ∅ and that there exists an optimal point (x∗ , ν ∗ ).

These assumptions imply that dom f ∩ {z : Az = b} =
Let y = (x, ν) ∈ S, kr(y)k2 6= 0 and ∆yw = (∆xw , ∆νw ) be the Newton step at y. Define
tmax = inf{t > 0 : y + t∆yw 6∈ S}. (2212)
If no such t exists, then we set tmax = ∞. By construction, ∀t ∈ [0, tmax ], we have y + t∆yw ∈ S. We
begin by showing that
K 2L

∀t ∈ [0, min{1, tmax }], kr(y + t∆yw )k2 ≤ (1 − t)kr(y)k2 + t2 kr(y)k22 . (2213)
2
344
See
Z 1
r(y + t∆yw ) = r(y) + Dr(y + τ t∆yw )t∆yw dτ (2214)
0
Z 1
= r(y) + tDr(y)∆yw + (Dr(y + τ t∆yw ) − Dr(y))t∆yw dτ (2215)
0
Z 1
= r(y) + tDr(y)∆yw + e e= (Dr(y + τ t∆yw ) − Dr(y))t∆yw dτ,(2216)
0
= (1 − t)r(y) + e. (2217)
where we use Dr(y)∆yw = −r(y). For 0 ≤ t ≤ tmax , y + τ t∆yw ∈ S for τ ∈ [0, 1] and we may write
Z 1
kek2 ≤ kt∆yw k2 kDr(y + τ t∆yw ) − Dr(y)k2 dτ (2218)
0
Z 1
≤ kt∆yw k2 Lkτ t∆yw k2 dτ (2219)
0
Z 1
= |t|k∆yw k2 L|t||τ |k∆yw k2 dτ (2220)
0
Z 1
δ L 2
= |t|2 k∆yw k2 |τ | k∆yw k2 dτ (2221)
0 δτ 2
L
= |t|2 k∆yw k22 (2222)
2
2L
= t kDr(y)−1 r(y)k22 (2223)
2
2
K L
≤ t2 kr(y)k22 . (2224)
2
We used the Lipschitz condition and the assumption kDr(y)−1 k2 ≤ K. Then for t ∈ [0, min{1, tmax }],
we have
kr(y + t∆yw )k2 = k(1 − t)r(y) + ek2 (2225)

≤ (1 − t)kr(y)k2 + kek2 (2226)
K 2L 2
≤ (1 − t)kr(y)k2 + t kr(y)k22 , (2227)
2
where the first inequality is by the Triangle inequality (Theorem 155). The damped Newton phase is
1
considered first. Assume kr(y)k2 > K2L . The RHS of the inequality in Equation 2213 is quadratic in t,
1
and monotically decreasing between t ∈ [0, t̄], where t̄ is the minimizer given by t̄ = K 2 Lkr(y)k2 < 1. By
construction tmax > t̄, and taking t → t̄ in Equation 2213 we have
1
kr(y + t̄∆yw )k2 ≤ kr(y)k2 − (2228)
2K 2 L
α
≤ kr(y)k2 − 2 (2229)
K L
= (1 − αt̄)kr(y)k2 , (2230)
so t̄ satisfies exit condition in Algorithm 4. Then the backtracking line search returns t ≥ β t̄, so
kr(y + ∆yw )k2 ≤ (1 − αt)kr(y)k2 (2231)

≤ (1 − αβ t̄)kr(y)k2 (2232)

αβ
= 1− 2 kr(y)k2 (2233)
K Lkr(y)k2
αβ
= kr(y)k2 − 2 . (2234)
K L
345
1 αβ
It follows that kr(y)k2 > K2L implies that the decrease in krk2 is at least K2L per iteration in the damped
Newton phase. A maximum of
kr(y (0) )k2 K 2 L
(2235)
αβ
iterations are taken before kr(y (k) )k2 ≤ K12 L . The quadratically convergent phase is considered. Assume
that kr(y)k2 ≤ K12 L By Equation 2213 asserts that
1
∀t ∈ [1, tmax ], kr(y + t∆yw )k2 ≤ (1 − t + t2 )kr(y)k2 . (2236)
2
By definition, we have tmax > 1, otherwise kr(y + tmax ∆yw )k2 < kr(y)k2 . In particular, when t = 1, we
have
1
kr(y + ∆yw )k2 ≤ kr(y)k2 ≤ (1 − α)kr(y)k2 (2237)
2
since α ∈ (0, 21 ). The backtracking line search condition is specified for t = 1, and the full step is taken,
as well as for future iterations. Equation 2213 with t = 1 can be written
2
K 2 Lkr(y + )k2
2
K Lkr(y)k2
≤ , y + = y + ∆yw . (2238)
2 2
1
Let r(y +k ) be the residual k steps after the iteration in which kr(y)k2 ≤ K2L is satisfied; we are guaranteed
that
2k 2k
K 2 Lkr(y +k )k2 K 2 Lkr(y)k2

1
≤ ≤ . (2239)
2 2 2
We have the quadratic convergence result for the residual norm. Suppose that in some iteration we have
1
kr(y)k2 ≤ K2L , then we have full step (t = 1) and
ky +k − yk2 ≤ ky +k − y +(k−1) k2 + · · · + ky + − yk2 (2240)

= kDr(y +(k−1) )−1 r(y +(k−1) )k2 + · · · + kDr(y)−1 r(y)k2 (2241)

≤ K kr(y +(k−1) )k2 + · · · + kr(y)k2 (2242)
k−1 2i −1
K 2 Lkr(y)k2
X
≤ Kkr(y)k2 (2243)
i=0
2
i
k−1
X 1 2 −1
≤ Kkr(y)k2 (2244)
i=0
2
≤ 2Kkr(y)k2 . (2245)
Since kr(y (k) )k2 converges to zero, y (k) is a Cauchy sequence (verify this) and by continuity of r, the
limit point y ∗ satisfies r(y ∗ ) = 0, which asserts that the optimal point (x∗ , ν ∗ ) exists.
4.11 Interior Point Methods

Interior point methods are used for solving convex optimization problems with inequality constraints.
Consider the problem given by

subject to fi (x) ≤ 0, i ∈ [1, m], (2247)
Ax = b. (2248)
346
Here fi : Rn → R, i ∈ [0, m] are assumed convex and twice continuously differentiable, and that A ∈
Rp×n , rank A = p < n. In other words A is full row rank, so the constraints are independent and there
are more variables that there are constraints. Assume that the problem is solvable, so ∃x∗ optimal point
and denote f0 (x∗ ) = p∗ . Further assume there is a strictly feasible point, then ∃x ∈ D s.t. Ax = b,
fi (x) < 0 for i ∈ [m]. This implies the Slater’s constraint qualifictions hold (Theorem 309), so strong
duality holds, and ∃(λ∗ , ν ∗ ) ∈ Rm × Rp dual optimal point. The KKT conditions (Equation 1604) are
satisfied, given here
fi (x∗ ) 4 0, i ∈ [1, m] (2249)

Ax∗ = b, (2250)
∗
λ < 0, (2251)
λ∗i fi (x∗ ) = 0, i ∈ [1, m], (2252)
m
X
∇f0 (x∗ ) + λ∗i ∇fi (x∗ ) + A0 ν ∗ = 0. (2253)
i=1
Interior point methods are said to solve for these KKT conditions in Equation 2253 by a series of equality
constrained problems. The barrier method in the class of interior-point algorithms are explored. Recall
that linear equality constrained convex quadratic problems may be via analytical solving of KKT condi-
tions, which amounted to solving a linear system (see Section 4.10.1). In the Newton’s method (feasible
and infeasible), assuming the objective function is twice differentiable, and the problem was subject
to linear constraints, it turns out that the second order Taylor approximation is convex quadratic (see
Equation 2160), and hence may be solved as a series of linear equality constrained quadratic prob-
lems. Interior-point methods build upon these algorithms - solving a problem with linear equality and
inequality constraints by solving a sequence of linear equality constrained problems.
We want to approximately formulate the inequality constrained problem in Equation 2248 with the
inequality constraints implicit in the objective:
m
1− (fi (x)),
X
minimize f0 (x) + (2254)
i=1
where
(
0 u ≤ 0,
1− (u) = (2256)
∞ u > 0.
Although the problem is convex, we do not have twice differentiability, so we instead approximate 1− (u)
with
1
1̂− (u) = − log(−u), dom 1̂− = −R++ . (2257)
t
1̂− is convex and nondecreasing, with extended value ∞ for u > 0, as well as continously differentiable,
and ∞ as u → 0. The approximation of 1̂− for 1− improves for larger t, as may be verified by plotting
the graphs. Then our approximation to the approximation for Equaton 2248 is now given (after scaling
by t > 0 for convenience)
minimize tf0 (x) + φ(x), (2258)

347
where
m
X
φ(x) = − log(−fi (x)), dom φ = {x ∈ Rn : fi (x) < 0, i ∈ [1, m]} (2260)
i=1
is said to be the logarithmic barrier for Equation 2248. Note that (Theorem 260)
m
X 1
∇φ(x) = ∇fi (x), (2261)
i=1
−fi (x)
m m
X 1 0
X 1
∇2 φ(x) = ∇f i (x)∇fi (x) + ∇2 fi (x). (2262)
i=1
fi (x)2 i=1
−fi (x)
The problem is convex and now differentiable, which allows us to use Newton’s method. This approx-
imation of Equation 2248 improves as the paramater t grows, but it turns out that as t increases, the
function tf0 + φ becomes difficult to minimize via the Newton’s method, as the Hessian varies rapidly
near the boundary of the feasible set. By solving a sequence of the problems of the form in Equation
2259, it turns out that we may circumvent this problem by increasing t in each iteration.
Assume that the problem in Equation 2259 may be solved via the Newton’s method with a unique
solution for each t > 0 denoted x∗ (t). We define a central path associated with problem in Equation
2248 to be the set of central points x∗ (t), t > 0. By the optimality conditions, a point is central point iff
Ax∗ (t) = b, fi (x∗ (t)) < 0, i ∈ [1, m], (2263)

∃v̂ ∈ Rp , 0 = t∇f0 (x∗ (t)) + ∇φ(x∗ (t)) + A0 ν̂ (2264)
m
X 1
= t∇f0 (x∗ (t)) + ∗ (t))
∇fi (x∗ (t)) + A0 ν̂. (2265)
i=1
−fi (x
Equation 2263 and Equation 2265 is said to be centrality condition.

We claim that every central point yields dual feasible point, which by extension yields lower bounds
on the optimal value p∗ (see Equation 1546). In particular, we prove that
ν̂ 1
ν ∗ (t) = , λ∗i (t) = − , i ∈ [1, m] (2266)
t tfi (x∗ (t))
forms dual feasible pair. By strict feasibility, it is easy to see λ∗ (t) 0, since fi (x∗ (t)) < 0. We can
multiply the centrality condition by t and get
m
X
∗
∇f0 (x (t)) + λ∗i (t)∇fi (x∗ (t)) + A0 ν ∗ (t) = 0. (2267)
i=1
So the central point x∗ (t) minimizes the Lagrangian

m
X
L(x, λ, ν) = f0 (x) + λi fi (x) + ν 0 (Ax − b), (2268)
i=1
for λ = λ∗ (t), ν = ν ∗ (t). It follows that the pair is dual feasible, and that the dual function is finite.
m
X
g(λ∗ (t), ν ∗ (t)) = f0 (x∗ (t)) + λ∗i (t)fi (x∗ (t)) + ν ∗ (t)0 (Ax∗ (t) − b) (2269)
i=1
m
= f0 (x∗ (t)) − , (2270)
t
348
where last equality follows from Equation 2266. It follows that
m
f0 (x∗ ) − p∗ ≤ f0 (x∗ ) − g(λ∗ (t), ν ∗ (t)) = , (2271)
t
which proves that x∗ (t) is -suboptimal for = m
t . By Equation 2266, see that −λi (t)fi (x∗ (t)) = 1t , for i ∈
[1, m]. The complementarity condition in the KKT equation 1604 is replaced with near-complementarity
when t is large.
4.11.1 Barrier Method

So far, we know that the central point x∗ (t) is m
t -suboptiml,
with certificate given by the dual feasible
∗ ∗
pair (λ (t), ν (t)). If we take the problem in Equation 2259 and take t → m , the problem is solvable as
linearly constrained optimization problem. However, in practise, the problem only works well for small
problems. The barrier method extends this approach by computing x∗ (t) for sequence of increasing
values in t, and using x∗ (t) as the starting point for the next iteration. The exit condition is given by
m
t≥ , which guarantees the output is -suboptimal.
Algorithm 5 Barrier Method; Interior Point Algorithms

x := x0 ∈ domf , where Ax0 = b, fi (x0 ) < 0 for i ∈ [1, m]. (strict feasibility assumptions)
t := t(0) > 0, µ > 1, > 0.
while TRUE do
centering step. Compute x∗ (t) by solving problem in Equation 2259.
x := x∗ (t).
m
if t < then
Break
end if
t := µt.
end while
Algorithm 5 computes a -suboptimal cenral point x∗ (t), with certificate (λ∗ (t), ν ∗ (t)) and its dual
-suboptimal point. The centering step is said to be an outer iteration. Assume that Equation 2259
is solved via the Newton’s method. Then the Newton iterations are referred to as inner iterations. At
each inner step, we have a primal feasible point. The dual feasible point is only available at the end of
each outer step. The choice of µ is a tradeoff in the number of inner and outer iterations required. For
small µ, t only increases by a small factor in each outer iteration, and subsequent central points are close
together, so that the number of Newton steps in the following iteration is small. The opposite can be
said for large values of µ. The same idea is applied to our choice of t(0) . If t(0) is chosen to large, then
the first outer iteration will require too many inner iterations. A small choice of t(0) , on the other hand,
will require more outer iterations. One possibility uses the centrality condition (Equation 2265) to solve
inf t∇f0 (x(0) ) + ∇φ(x(0) ) + A0 ν , (2272)

ν 2
which may be treated as a measure of deviation of x(0) from x∗ (t). If the barrier method is initialized
with point x(0) ∈ dom f0 , fi (x(0) ) < 0 for i ∈ [1, m] but Ax(0) 6= b, then we can use the infeasible start
Newton method as centering step. Note that if there is a feasible point, then the full Newton step is
taken somewhere during the first centering and future iterations work with primal feasible points.
349
Theorem 329 (Convergence Properties of Barrier Method). Assuming that tf0 + φ is minimized by
Newton’s method for sequence of t values given t(0) , µt(0) , · · · µk t(0) , the duality gap after the initial
m m
centering plus k outer iterations are given by µk t(0)
:= , as argued. Then t(0)
and the number of
(centering) iterations for -suboptimality is given by
log tm

(0)
1+ . (2273)
log µ
It follows that the barrier method works, provided the centering problem is solvable by the Newton’s
method for t ≥ t(0) . The assumptions required for bounding the number of Newton steps place conditions
on tf0 + φ and have been discussed in previous sections, and will not be raised here. The question of how
difficult (in the number of inner iterations) it is to solve the centering problem for increasing t is still of
concern. It turns out that the centering problem appears to require a near-constant number of Newton
steps to solve, even as t increases.
Exercise 533. Consider the problem given by
subject to Ax 4 b. (2275)
Then the log-barrier function is given by

m
X
φ(x) = − log(bi − a0i x), dom φ = {x : Ax ≺ b}, (2276)
i=1
where a0i is i-th row of A. We may compute

m m
X 1 X 1
∇φ(x) = ai = A0 d, ∇2 φ(x) = ai a0i = A0 diag(d)2 A (2277)
i=1
bi − a0i x i=1
(bi − a0i x)2
1
where di = bi −a0i x . The centrality condition holds if
m
X 1
tc + ai = tc + A0 d = 0. (2278)
i=1
bi − a0i x
The dual function is given by c0 x + (Ax − b)0 λ, which is unbounded unless the linear components in x
goes to zero, so the dual problem is characterized by
maximize −b0 λ, (2279)

subject to A0 λ + c = 0, (2280)
λ < 0. (2281)
We let
1
λ∗i (t) = , i ∈ [1, m] (2282)
t(bi − a0i x∗ (t))
which is dual feasible with dual objective

m
−b0 λ∗ (t) = c0 x∗ (t) + (Ax∗ (t) − b)0 λ∗ (t) = c0 x∗ (t) − . (2283)
t
350
Theorem 330 (Newton Steps as Solving for Modified KKT Equations). The associated KKT system
(Equation 2141) in the context of barrier methods are given by
" #" # " #
t∇2 f0 (x) + ∇2 φ(x) A0 ∆xw t∇f0 (x) + ∇φ(x)
=− . (2284)
A 0 νw 0
It is shown that the Newton steps in the centering problem solve for the modified KKT equations
m
X
∇f0 (x) + λi ∇fi (x) + A0 ν = 0, (2285)
i=1
1
−λi fi (x) = , i ∈ [1, m], (2286)
t
Ax = b. (2287)
We begin by eliminating λi by taking λi → − tfi1(x) . We solve for

m
X 1
∇f0 (x) + ∇fi (x) + A0 ν = 0, (2288)
i=1
−tfi (x)
Ax = b. (2289)
This is a set of (nonlinear) n + p equations in n + p variables. Beginning with the Taylor approximation
for the nonlinear component, we have
m
X 1
∆f0 (x + v) + ∆fi (x + v) ≈ (2290)
i=1
−tfi (x + v)
m m m
X 1 X 1 X 1
∆f0 (x) + ∆fi (x) + ∇2 f0 (x)v + ∇2 fi (x)v + 2
∇fi (x)∇fi (x)0 v.
i=1
−tfi (x) i=1
−tfi (x) i=1
tf i (x)
We then have
Hv + A0 ν = −g, Av = 0, (2291)
where
m m
X 1 X 1
H 2
= ∇ f0 (x) + 2
∇ fi (x) + 2
∇fi (x)∇fi (x)0 (2292)
i=1
−tfi (x) i=1
tfi (x)
m
X 1
g = ∇f0 (x) + ∇fi (x). (2293)
i=1
−tfi (x)
The second condition is used to assert that the Newton direction is feasible. Comparing with Equation
2284. we have
tH∆xw + A0 νw = −tg, A∆xw = 0, (2294)
and by comparing this to Equation 2291 we have

1
v = ∆xw , ν= νw . (2295)
t
4.11.2 Phase I Method

In the input to the barrier method, we assumed a strictly feasible starting point x(0) . This can be
unreasonable in practice. It turns out that finding such a feasible point is itself solvable by the barrier
351
method, and we say this is a phase I method. The barrier method using this strictly feasible x(0) will
then be referred to as the phase II stage.
We begin with the set of constraints for x ∈ Rn ,
fi (x) ≤ 0, i ∈ [1, m], Ax = b. (2296)
Here fi is assumed convex, with continuous second derivatives. Assume we are given some point x(0) ∈
∩m
i dom fi , and Ax
(0)
= b. The objective is to find a strictly feasible solution or determine that none
exists, by solving the problem
minimize s (2297)
subject to fi (x) ≤ s, i ∈ [1, m], (2298)
Ax = b. (2299)
Clearly if s < 0 for any feasible solution then we are done. This problem is always strictly feasible, since
x(0) provided and any s ≥ maxi∈[1,m] fi (x(0) ) satisfies. We can then apply the barrier method. Let the
problem in Equation 2299 take opitmal value p̄∗ . If p̄∗ < 0, then we have a strictly feasible solution,
and we can in fact terminate at any (x, s < 0). If p̄∗ > 0, then no strictly feasible point exists. We can
terminate when a dual feasible point is found with positive dual objective, which is the certificate that
p̄∗ > 0. If p̄∗ = 0, and the minimum is attained at x∗ , s∗ = 0, then we have feasible, but not strictly
feasible point. If p̄∗ = 0 and the minimum is not attained, then the inequalities are infeasible. This is
said to be the basic phase I method. We study some alternatives for this.
Exercise 534 (Sum of infeasibilities method). We may instead solve for the problem
minimize 10 s (2300)
subject to fi (x) ≤ si , i ∈ [1, m], (2301)
Ax = b, (2302)
s < 0. (2303)
The optimal value of this problem is zero, which is iff there is a feasible set of inequalities. Note however,
a strictly feasible point is not required. When the system of equalities and inequalities are infeasible, the
optimaal point found in this problem often violates a small subset of the inequalities. The dual variables
associated with the strictly feasible inequalities are zero, and we have proved the infeasibility of a subset
of the inequalities.
Exercise 535 (Termination near the phase II central path). When there is a strictly feasible point, the
central path for the phase I problem intersects the central path for the original optimization problem.
Assume ∃x(0) ∈ dom D = ∩m
i=0 dom fi , Ax
(0)
= b. Let the phase I optimization problem take form
minimize s (2304)
subject to fi (x) ≤ s, i ∈ [1, m], (2305)
f0 (x) ≤ M, (2306)
Ax = b. (2307)
Here M is chosen to be constant s.t M ≥ maxf0 (x(0) ),p∗ . Assume that the original problem has strictly
feasible point, then the optimal value to this phase I problem given by p̄∗ < 0. The centrality condition
352
(Equation 2265) is given by
m m
1 X 1 X 1
∇f0 (x) + ∇fi (x) + A0 ν = 0, = t̄, (2308)
M − f0 (x) i=1
s − f i (x) i=1
s − fi (x)
where t̄ is the parameter (verify this). If (x, s) is central point with s = 0, then
m
X 1 1
t∇f0 (x) + ∇fi (x) + A0 ν = 0, t= . (2309)
i=1
−fi (x) M − f0 (x)
Then x is on the central path of the original problem with duality gap
m
= m(M − f0 (x)) ≤ m(M − p∗ ). (2310)
t
If we do not know of a feasible point x for Ax = b in D, we can express the problem

subject to fi (x) ≤ 0, i ∈ [1, m] (2312)
Ax = b (2313)
as problem

subject to fi (x) ≤ s, i ∈ [1, m] (2315)
Ax = b, (2316)
s = 0. (2317)
The barrier method begins with the initial centering step that solves for
m
X
minimize t(0) f0 (x) − log(s − fi (x)) (2318)
i=1
subject to Ax = b, s = 0, (2319)
where any x ∈ D, s > maxi fi (x) suffices. If we further do not know D, then we solve for
m
X
minimize t(0) f0 (x + z0 ) − log(s − fi (x + zi )) (2320)
i=1
subject to Ax = b, s = 0, z0 = z1 = · · · = zm = 0. (2321)
Any starting point x + zi ∈ dom fi suffices.

Theorem 329 gave complexity analysis in the number of outer iterations required. The complexity of
the Newton’s method in different values of t were not explored. We can give more precise statements if
the objective were self-concordant (Definition 260).
Theorem 331 (Convergence Properties of Barrier Method for Self Concordant Objective). Assume that
the function tf0 + φ is closed and self-concordant for all t ≥ t(0) . Further assume that the sublevel sets
of the original problem

subject to fi (x) ≤ 0, i ∈ [1, m] Ax = b. (2323)
353
are bounded. These assumptions (verify this) imply that the centering problem is solvable, and the Hessian
of tf0 + φ is positive definite everywhere. Note that the self-concordance assumption holds for a variety of
Pm
problems, including all linear and quadratic problems. If fi is linear/quadratic, then tf0 − i=1 log(−fi )
is self-concordant for all t ≥ 0. The analysis applies to all LPs, QPs, and QCQPs as a result. Many
problems can be reformulated in a manner that satisfies self-concordance. Theorem 326 asserts that the
number of Newton iterations required to minimize a closed, strictly convex self-concordant f is upper
bounded by
f (x(0) ) − p∗ 1
+ log2 log2 , (2324)
γ
where γ is deterministic function on α, β, the parameters of the backtracking search. We may use this
to derive a bound on the number of Newton steps required for one outer iteration of the barrier method.
Let x denote x∗ (t) and let x+ to denote x∗ in the next iteration, which is a function of µt, as well as
λ, ν to represent λ∗ (t), ν ∗ (t) respectively. Equation 2324 asserts that we have
µtf0 (x) + φ(x) − µtf0 (x+ ) − φ(x+ )

+ c, (2325)
γ
but since x+ is not known, this is not very helpful. Instead, it may shown that
µtf0 (x) + φ(x) − µtf0 (x+ ) − φ(x+ ) ≤ m(µ − 1 − log µ). (2326)
Then the number of Newton iterations per outer iteration of the barrier method is bounded by
m(µ − 1 − log µ)
+ c. (2327)
γ
Interestingly, the number of Newton iterations is bounded by a quantity that depends mostly on µ, the
factor by which t is scaled and m, the number of inequality constraints. The number of equality con-
straints, p, the dimnesion of the variable, n, does not affect the bound on the number of Newton iterations
required when self-concordance holds. Also, it does not depend on t - the problem bounds do not get in-
creasingly more difficult when we require higher approximation accuracy. Excluding the initial centering
step, Theorem 329 assert that the total number of Newton iterations throughout the barrier method is
bounded by
log tm

(0) m(µ − 1 − log µ)
+c (2328)
log µ γ
m
when self-concordance holds. When µ, m is fixed, N is proportional to log t(0)
, which is the logarithmic
m
ratio of t(0)
to . We say that the barrier method is at least linearly convergent, since the number of
steps required grows logarithmically with the inverse of desired precision. It turns out that there always
exists a choice of µ s.t. the number of Newton steps grows only in the square root of m, as opposed to m,
which is the number of inequality constraints. Assume that µ is fixed, then Equation 2328 grows linearly
with m. Suppose we choose µ = 1 + √1 , then
m
1 1
µ − 1 − log µ = √ − log(1 + √ ) (2329)
m m
1 1 1
≤ √ −√ + (2330)
m m 2m
1
= , (2331)
2m
354
a2
where we use the result − log(1 + a) ≤ −a + 2 for a ≥ 0. Furthermore, using the result
1 log 2
log µ = log(1 + √ ) ≥ √ . (2332)
m m
Then the total number of Newton steps is bounded by
log tm

(0) m(µ − 1 − log µ)
+c (2333)
log µ γ
m
√ log( t(0) )

1
≤ m +c (2334)
log 2 2γ
l√ m m 1

= m log2 ( (0) ) +c , (2335)
t 2γ
which for fixed µ, grows in the square root of m. In practice, however, this choice of µ is often not used,
since it is too small.
Although SOCPs and SDPs do not take the form as in the problem in Equation 2248, extensions of
the interior-point methods to handle problems with generalized inequalities w.r.t proper cones exist to
solve these convex optimization problems. They are not studied here.
4.11.3 Primal Dual Interior Point Methods

The primal dual interior point method is similar to the barrier method, except there are no nested loops.
The search directions in a primal dual interior point method are obtained from the Newton’s method,
applied to modified set of KKT equations. In the primal-dual interior point method, the primal and
dual iterates are not required to be feasible. In addition, they are often more efficient than the barrier
methods, and can work even in the case when there is no strictly feasible point. The modified KKT
conditions are expressed as rt (x, λ, ν) = 0, where
 
∇f0 (x) + ∇f λ + A0 ν
rt (x, λ, ν) =  −diag(λ)f (x) − 1t 1  . (2336)
 
Ax − b
Here t > 0, f : Rn → R and the derivative matrix is given by

   
f1 (x) ∇f1 (x)0
f (x) =  · · ·  , ∇f (x)0 =  · · ·  . (2337)
   
fm (x) ∇fm (x)0
If x, λ, ν satisfy rt (x, λ, ν) = 0, with fi (x) < 0, then x = x∗ (t) is primal feasible and λ = λ∗ (t), ν = ν ∗ (t)
m
are dual feasible points with duality gap t . The first block component, given by rdual = ∇f0 (x) +
0
∇f (x)λ + A ν is said to be the dual residual, the last block component rpri = Ax − b is said to be primal
residual, as we have seen before. Finally, the middle block given by rcent = −diag(λ)f (x) − 1t 1 is said to
be the centrality residual, and we can see that it is the residual for the modified (near) complementarity
condition to hold. Suppose a point (x, λ, ν) satisfies f (x) ≺ 0, λ 0, and let the current point, Newton
step be denoted
y = (x, λ, ν), ∆y = (∆x, ∆λ, ∆ν). (2338)
respectively. The Newton step ∆y is taken so that the approximate residual is zero, so
rt (y + ∆y) ≈ rt (y) + Drt (y)∆y = 0. (2339)
355
It follows that ∆y = −Drt (y)−1 rt (y). The primal-dual search direction is the vector ∆ypd = (∆xpd , ∆λpd , ∆νpd )
that solves the system
 2 Pm    
∇ f0 (x) + i=1 λi ∇2 fi (x) Df (x)0 A0 ∆x rdual
−diag(λ)Df (x) −diag(f (x)) 0  ∆λ = − rcent  . (2340)
    

A 0 0 ∆ν rpri
The primal-dual search direction is closely related but distinct from the search direction in the barrier
method. Eliminating the variable ∆λpd in Equation 2340, with
∆λpd = −diag(f (x))−1 diag(λ)Df (x)∆xpd + diag(f (x))−1 rcent . (2341)
Then
" #" # " #
Hpd A0 ∆xpd −rdual + Df (x)0 diag(f (x))−1 rcent
= − (2342)
A0 0 ∆νpd rpri
" Pm #
∇f0 (x) + 1
t
1
i=1 −fi (x) ∇fi (x) + A0 ν
= − , (2343)
rpri
where
m m
X X λi
Hpd = ∇2 f0 (x) + λi ∇2 fi (x) + ∇fi (x)∇fi (x)0 . (2344)
i=1 i=1
−fi (x)
Comparing with Equation 2284, we can write

" #" # " # " Pm #
Hbar A0 ∆xbar t∇f0 (x) + ∇φ(x) t∇f0 (x) + i=1 − fi 1(x) ∇fi (x)
=− =− , (2345)
A 0 νbar rpri rpri
where
m m
X 1 X 1
Hbar = t∇2 f0 (x) + − ∇2 fi (x) + 2
∇fi (x)∇fi (x)0 . (2346)
i=1
fi (x) i=1
fi (x)
It should be noted that in the primal-dual interior point method, the iterations x(k) , λ(k) , ν (k) are not
necessarily feasible. We cannot easily evaluate the duality gap η (k) associated with iteration k of the
algorithm. For any x satisfying f (x) ≺ 0, λ < 0, the quantity
η̂(x, λ) = −f (x)0 λ (2347)
is said to be the surrogate duality gap. The surrogate duality gap is simply the duality gap if x is primal
feasible and (λ, ν) is dual feasible, which is when rpri = rdual = 0. The value of t corresponding to the
m
surrogate duality gap η̂ is given by η̂ .
Algorithm 6 Primal Dual Interior Point Method

Given x s.t. fi (x) < 0 for i ∈ [1, m], λ 0, µ > 1, f eas > 0, > 0.
while NOT (krpri k2 ≤ f eas AND krdual k2 ≤ f eas AND η̂ ≤ ) do
µm
t := η̂ .
Compute primal-dual search direction ∆ypd by solving Equation 2340.
(Line search): Determine step length s > 0, y := y + s∆ypd .
end while
356
The line search in Algorithm 6 is a standard backtracking line search based on the residual norm, as
we did in the infeasible Newton method (Algorithm 4). The modification is made to ensure that λ 0,
f (x) ≺ 0. Let the current iteration have variables in x, λ, ν updated to x+ , λ+ , ν + respectively in the
next iteration. The variables are related by
x+ = x + s∆xpd , λ+ = λ + s∆λpd , ν + = ν + s∆νpd . (2348)
The residual at y + is written as r+ . We begin by computing the largest step s ∈ (0, 1] s.t. λ+ < 0,
expressed
smax = sup{s ∈ (0, 1] : λ + s∆λ < 0} (2349)

λi
= min{1, min{− : ∆λi < 0}}. (2350)
∆λi
Then the backtracking is performed with s = smax −, scaled by choice of β ∈ (0, 1) in each round until
f (x+ ) ≺ 0 until
krt (x+ , λ+ , ν + )k2 ≤ (1 − αs)krt (x, λ, ν)k2 . (2351)
The same analysis in the proof of convergence of the infeasible Newton method (Theorem 328) show that
the line search for the primal-dual interior point method terminates in a finite number of steps.
357
Chapter 5
Set Theory
5.1 Algebra of Sets

Set theory has many uses under both theoretical and applied settings. One of the topics using sets is
probability theory, where probability measure (see Definition 267) and random variables (see Definition
279) are set functions from abstract spaces to real numbers. The most elementary use of sets in probability
theory is the treatment of experiments, sample outcomes, sample spaces and events. (see Definitions
262, 263, 264)
Definition 261 (Definition, Operations and Terminologies of Sets). Here we define common operations
that may be applied on sets, collectively known as set algebra.
1. Intersection: C = A ∩ B =⇒ ∀ ∈ C, ( ∈ A) ∧ ( ∈ B).
2. Union: C = A ∪ B =⇒ ∀ ∈ C, ( ∈ A) ∨ ( ∈ B).
3. Mutually exclusive (mutex): A ∩ B = ∅ = {}.
4. For A on sample space (see Definition 263) S, the complement AC is defined as the set satisfying
(A ∪ AC = S) ∧ (A ∩ AC = ∅).
A common technique of visualising the relationships between sets is with the use of Venn diagrams.
In keeping with the dense presentation, we will not present Venn diagrams here. The mistake that
many beginners make in relation to sets and probability theory is in using the Venn diagram to infer
independence. Do not make the same mistake. Mutual exclusiveness and independence (see Definition
313) are different concepts. Independence involves probability measures (see Definition 267) while mutual
exclusion do not require the discussion of probability.
There are two methods in proving that some set A = B. The informal method is to draw a Venn
diagram and show they represent the same area. The more formal, and mathematically rigorous method
is to show that (A ⊂ B) ∧ (B ⊂ A). This is done by arguing that for any element a, a ∈ A =⇒ a ∈ B
and vice-versa.
358
Chapter 6
Probability and Statistical Models
The treatment of probability theory is typically done in two-parts, one without the use of measure theory
(at the undergraduate level) and the other employing measure theoretic arguments (at the graduate level).
Although they are almost never taught together (perhaps for good reason), here we aim to present them
shoulder-to-shoulder. Measure theoretic arguments are necessary to draw convergence arguments and
discuss concepts belonging to the uncountably infinite world. However, since they are describing the same
concept, there is utility in drawing the bridge between the two probability treatments where relevant.
Hopefully what is achieved is a more complete view of statistics and probability theory while minimising
the attendant disorganization. It is also hoped that this way of presentation also ease the internalization
of measure theory probability concepts. We assume basic proficiency in set theory. In fact, not much is
assumed, except that the reader is familiar with algebra involving sets. (see Section 5.1)
6.1 Probability Spaces and Probability Measures

Definition 262 (Experiment). An experiment in statistics is a procedure that can be repeated an infinite
number of times with a well defined set of possible outcomes.
Definition 263 (Sample Space and Sample Outcomes). Each of the possible outcomes in an experiment
(see Definition 262) is called an outcome s ∈ S where S is the sample space of all possible outcomes.
Definition 264 (Event). A set of possible outcomes (a subset of the sample space) (see Definition 263)
is known as an event, which we denote e. Then e ⊂ S.
Inf. probability spaces are used to model situations in which random experiments have infinitely
many possible outcomes. There are 2 general types, such as (i) sampling from x ∈ [1] and (ii) inf. coin
tosses. The sample space is the set of possible outcomes. That is ω : ω ∈ [1], and Ωinf = {ω = ω1 ω2 ..., :
ωn represents n-th coin toss}, the set of inf. sequences of head and tails. These sample spaces are both
infinite and uncountably infinite, meaning we cannot list their elements in sequence. The P(ω) = 0 for
any outcome ω belonging to the set. We cannot determine the probability of event A ∈ Ω by summation
of set members, and define probability of events directly.
Definition 265 (σ-algebra). Let Ω 6= ∅ be a set, let F be a collection of subsets of Ω. Then F is

σ-algebra (or σ-field) provided:
1. ∅ ∈ F
359
2. A ∈ F =⇒ Ac ∈ F
3. sequence A1 , A2 , · · · ∈ F =⇒ ∪∞
i=1 Ai ∈ F. Any sequence of sets belonging to F also has union of
sets in F.
It is easy to confuse the power set with the sigma algebra of a set. The power set is the largest possible
sigma algebra of a set. The trivial σ-field {∅, Ω} is not the power set of Ω, but is also σ-algebra. All
operations on elements of sigma algebra set gives us other sets in the sigma algebra. It is easy to derive
that any union of subsets are in the sigma algebra, and so are their intersections (consider the De-Morgan
on complement of the union of complements).
Definition 266 (P function). The probability function is a function P that satisfies axioms that are
collectively known as ‘Kolmogorov Axioms of Probability’. For sample space Ω and events A ∈ F defined
over the sample space, the P function satisfies the same properties as defined in Definition 267.
Definition 267 (P measure). Let Ω be non-empty set and F be σ-algebra of subsets of Ω. Then probability
measure P is function mapping every set A ∈ F to range [0, 1], written P(A) : F → [0, 1]. We require
1. P(Ω) = 1, note that Ω ∈ F given ∅ ∈ F and complement property.
2. (countable additivity) where A1 , A2 , · · · are disjoint set sequence, then P(∪∞ ∞

n=1 An ) = Σn=1 P(An ).
This implies finite additivity P(∪ni=1 Ai ) = Σni=1 P(Ai ) on disjoint sets.
Theorem 332 (Properties of Probability Measures/Functions). We state some trivial but important
results of probability functions without proof. Beginning with (i) P(AC ) = 1 − P(A), (ii) P(∅) = 0, (iii)
A ⊂ B =⇒ P(A) ≤ P(B), (iv) P(A) ≤ 1, (v) P(A ∪ B) = P(A) + P(B) − P(A ∩ B). A less trivial
theorem is provided with proof, with regards to the probability measure on k unions:
n
X X
P(∪ni Ai ) = P(Ai ) − P(Ai ∩ Aj ) + · · · + (−1)n+1 P(∩ni Ai ) for any Ai on S, i ∈ [n]. (2352)
i i<j
Proof. (verify this)
Exercise 536 (Math is Weird Sometimes). Consider events A, B. Then we have P(A ∪ B) = P(A) +
P(B) − P(A ∩ B) = P(A) + P(B) − (1 − P((A ∩ B)C ) = P(A) + P(B) − (1 − P(AC ∪ B C )) = P(A) +
P(B) − 1 + P(AC ∪ B C ). The probability of at least one of A, B occurring increases with the probability
of at least one not occurring!
Exercise 537. Consider a probability measure as defined in Definition 267, then prove that
1. (A ∈ F, B ∈ F) ∧ A ⊂ B =⇒ P(A) ≤ P(B)
2. A ∈ F ∧ {An }∞
n=1 ∈ F ∧ limn→∞ P(An ) = 0 ∧ (∀n, A ⊂ An ) =⇒ P(A) = 0
Proof. 1. We see that for A ⊂ B we have B = A ∪ (B\A) the result follows by countable additivity
of disjoint sequences.
2. For all n, P(A) ≤ P(An ) and therefore P(A) ≤ limn→∞ P(An ) = 0 and 0 ≤ P(A) ≤ 0.
Exercise 538. Prove that the set of sequences of coin tosses in which outcome of each even numbered
coin toss matches the outcome of the preceding toss, such that
A = {ω = ω1 ω2 · · · : ω1 = ω2 , ω3 = ω4 }
is uncountably infinite. Furthermore, show that if p-head is not zero or one, then P(A) = 0.
360
Proof. Consider the function φ : A → Ω and φ(ω) = ω1 ω3 ω5 · · · then the function is injective and
surjective. Then the cardinality of A matches cardinality of Ω∞ , which means that A is uncountably
infinite. Next, let An = {ω : ω1 = ω2 , · · · ω2n−1 = ω2n }. Then
n p∈{0,1}
/
P(A) = lim P(An ) = lim p2 + (1 − p)2

= 0.
n→∞ n→∞
Definition 268 (Probability Space). The triple (Ω, F, P) is termed probability space, with reference to
definitions 265 and 267.
Definition 269 (Uniform Lebesgue Measure L on Unit Intervals). Models choice of sampling a random
number from unit interval, with probability measure on [a, b] by
P[a, b] = P(a, b) = b − a, 0≤a≤b≤1
. Probability measure on point is 0.
Consider that we can define an open interval (a, b) as a union of sequence of closed intervals, we can
write
1 1
(a, b) = ∪∞a + ,b −
n=1 ,
n n
. Now consider a σ-algebra formed by starting with the closed intervals and putting in everything else
required to have a σ-algebra. Then it turns out that this σ-algebra contains all open intervals also.
Definition 270 (Borel σ-algebra, B). The σ-algebra constructed by beginning with closed intervals and
adding everything else required to have a σ-algebra is called the Borel σ-algebra of subsets [0, 1], denoted
B. Sets belonging to the set B are called Borel sets, and are subsets of [0, 1].
Exercise 539 (Infinite, Independent Coin Toss Space). We can illustrate infinite probability spaces with
a sequence of infinite coin tosses. Let Ω∞ denote the set of possible outcomes, and the probability of head,
tail be p, q = (1 − p) respectively, both non-zero. The tosses are independent, and we want to construct
a probability measure on this space corresponding to this experiment. We can define P(∅) = 0, P(Ω) = 1.
0
The 2 sets form a σ-algebra, and we denote this F0 = {∅, Ω}. Note that |F0 | = 22 . Then, consider the
2 sets AH and AT , denoting the sets that begins with a head and the other with a tail. For instance,
AH may be denoted {ω : ω1 = H}. Let the P(AH ) = p, P(AT ) = q, and we have defined the probability
1
measure P for the σ-algebra F1 = {∅, Ω, AH , AT }. Note also that |F1 | = 22 . No other sets need to be
2
added to form a σ-algebra. We can continue for F2 of size 22 and so on. The continuation of this
process gives us the probability of every set that can be described in terms of finitely many tosses. It
turns out that once this is done, the other sets that are not describable in terms of finitely many coin
tosses have determined probabilities. Consider for example, consider the set containing singleton infinite
sequence of heads HHH · · · , which is not a finite sequence but is a subset of AH , AHH · · · . Furthermore,
since we defined P(AH ) = p, P(AHH ) = p2 and so on, the singleton set has probability equals zero as
p < 1. The same argument can used to argue that the probability of any individual sequence ∈ Ω∞ equals
zero. We create the σ-algebra F∞ by putting in every set that can be described by the finite coin tosses,
and then everything else required for the σ-algebra property, and it turns out that we will then have the
probability of every set in F∞ . It is determined but not necessarily easily computed, as we shall see. For
instance, consider the set A = {ω = ω1 ω2 · · · : limn→∞ Hn (ωn1 ···ωn ) = 21 }, which defines the set for which
the long-run average of heads is half. This is in F∞ . To see this, for constant m, n ∈ Z+ define
Hn (ω1 , · · · , ωn ) 1 1
Sn,m = {ω : − ≤ }
n 2 m
361
. This is in Fn with known probability. By definition of limit, the specified limit is satisfied iff for every
positive integer m, there exists a positive integer N such that ∀n ≥ N, ω ∈ Sn,m . In other words, the set
for which ω satisfies the limit can be expressed
A = ∩∞ ∞ ∞
m=1 ∪N =1 ∩n=N Sn,m .
Then A ∈ F by formulation since it is the union and intersection of members of the set. It turns out the
Strong LLN asserts P(A) = 1 if p = 0.5 and 0 otherwise.
The probability zero event in uncountable probability spaces has a paradox, as highlighted in our
example above. Whenever an event is said to be almost sure,we refer to it as the case P(A) = 1, even
though it may not include every possible outcome. The events not included together has probability
P(Ac ) = 0.
Definition 271 (Almost Surely). Let (Ω, F, P) be a probability space. If set A ∈ F satisfies P(A) = 1,
we say event A occurs almost surely.
6.2 Counting and Combinatorics

We have defined probabilities, sample spaces, events and other important fundamental artefacts of ran-
domness in the Section 6.1. The most basic probability model is the counting model. Combinatorics is
the science and practice of counting, arranging and ordering of objects. Surprisingly, like Kolmogorov
probability axioms, combinatorics can be reduced to four fundamental rules.
Theorem 333 (Combinatorial Multiplication Rule). Operations Ai , i ∈ [k] performed in sequence can
be conducted in total of Πki ni where ni is the number of ways to conduct Ai .
Theorem 334 (n-permutate k). The number of permutations of length k from n distinct objects without
repetitions is written
n n!
Pk = . (2353)
(n − k)!
Proof. The proof follows from application of Theorem 333 by first considering n possible ways, then n−1
ways down to n − k + 1 ways.
Corollary 8. By the Theorem 334 the number of ways to arrange n distinct objects is n!.
Result 30 (Stirling’s Approximation). Stirling’s formula for approximating n! is written

√ 1
n! ≈ 2πnn+ 2 exp(−n). (2354)
In practice we can write in log form

√ 1
log 2π + (n + ) log n − n
2
and then exponentiating after substituting the value of n.
Exercise 540 (The Rook Problem). How many ways are there to arrange eight rooks on an eight by
eight chessboard such that they are non capturing? See that we each rook once placed, eliminates one row
and one column each from the next iteration. Eight rooks can have valid formations 8P8 and be internally
permutated 8P8 times for a total of 8! · 8! arrangements.
362
Definition 272 (n-permutate r categories, multinomial coefficients). The number of permutations of n
objects of r categories where each type i has ni , i ∈ [n] objects is written
n!
(2355)
Πni !
Pr
where i ni = n. This is known as the multinomial coefficient, due its appearance when expanding
multinomials.
Exercise 541 (Larsen and Marx [12]). Find the coefficient of x23 in the expansion of (1 + x5 + x9 )100
by combinatorial arguments.
Proof. The coefficient of the term corresponds to the number of ways in which the term can be formed.
Here x23 can only be formed from two of x9 , one of x5 and ninety seven ones being multiplied together.
100!
Then the coefficient is 2!1!97! .
Exercise 542 (Number of Passwords). How many total passwords can be constructed that is length of
ten, constructed from four letters, four numbers and two symbols. Let the total symbols admissible be
eight.
Proof. One can choose 104 total numbers, 82 symbols and 264 · 24 letters (including upper cases). Once
10!
the numbers, letters and symbols are chosen, they can be arranged in 4!4!2! ways. The total number of
10! 4 2 4 4
passwords is then 4!4!2! · 10 · 8 · 26 · 2 .
See that we can often think of the permutation problem as a two step choice of first choosing the
candidates and then arranging them.
Theorem 335 (Circular Permutations). There are (n − 1)! ways to permute n distinct objects in a
circle. Do this by writing nPn and see that a factor of n permutation arrangements are repeated by the
‘circularity’. Divide by n and we get n!/n = (n − 1)!.
Exercise 543 (The Necklace Problem). We have 10 beads of different colours, how many different
necklaces can we form?
Proof. Their circular arrangement permutation cardinality is (n − 1)! for n = 10. However, since the
necklace flipped over is a different circular permutation but the same necklace, we actually need to divide
(n−1)!
by two to obtain 2 . To see this consider a smaller necklace of ROY. This is the same necklace RYO
flipped! We divide by n to account for ROY being identical to OYR and YRO. The flipping accounts
for further division factor of two.
Definition 273 (n-choose k, binomial coefficients). The number of ways to form combinations of size k
from set of n distinct objects without repetitions is written

n n!
= nCk = . (2356)
k k!(n − k)!
Due to its common appearance as coefficients in binomial expansions, this term is also called the binomial
coefficients. Cross reference this with the multinomial coefficients. (see Definition 272)
Proof. The proof for this can be seen by first permuting k of n objects to get nPk and then seeing that
order does not matter and dividing by k!.
363
Theorem 336 (Pascal’s rule). For positive, natural numbers n and k, the Pascal’s rule states that
n+1
= nk + k−1
n

k . This can be thought of recursively. To choose k objects from n + 1 objects, I can
first choose or not choose the ‘first’ object. Choosing the object will mean I can choose k − 1 objects from
the remaining n objects. If I do not choose the object then I still have to pick k out of the remaining n
objects. The total number of possibilities are the sum total.
Exercise 544. See in Theorem 332 we defined the probabilities of n unions of events. Here we want to
show that the formula adds the probability of each outcome exactly once - there is no double counting or
under counting. Consider the set of outcomes in ∪ni Ai that belong to some k of the Ai and no others.
We want to show that the formula counts this set of outcomes exactly once for arbitrary k. See that
Pn
these outcomes get counted k1 times in term i P(Ai ), k2 times in i<j P(Ai ∩ Aj ) and so on. The
P
outcomes are counted a total of

k k k k+1 k
− + · · · + (−1) (2357)
1 2 3 k
Pk k

times. See that we can write the binomial expansion (−1 + 1)k = 0k = 0 = j=0 j (−1)j 1. Then see
that Equation 2357 equals k0 = 1 and we are done.

Theorem 337 (Some Binomial Identities). Prove that

Pn n
n−1
1. i=1 i · i = n · 2 .
Pn n2
= 2n

2. i i n .
Pn n
n
3. k=0 k = 2 .
4. nk = n−k n

.
Pn n k

Proof. For 1. see that (1 + x)n = k=0 k x · 1. Differentiating both sides we get n(1 + x)
n−1
=
Pn n
k−1
k=0 k kx . Substitute x = 1 and we are done. For 2. logicize that

2n n n n n n n
= + + ··· + .
n 0 n 1 n−1 n 0
Using the identity nk = n−k n

we obtain the equality.
We have worked hard to count the total number of arrangements and choices of objects. This is often
motivated by the desire to find the probability of an event.
Theorem 338 (Combinatorial Probability, the classical definition of probability). If there are n ways
m
to perform an operation and m satisfies some condition for occurrence of an event, then P(A) = n.
Exercise 545 (The Birthday Problem). Assume that birth is uniformly distributed over 365 days in a
year and k people are selected at random. What is the probability that there is at least one overlap in
birthday dates?
Proof. By multiplication rule (see Theorem 333) the total number of birthday sequences are 365k . The
365
total number of sequences with k people with distinct birth dates are Pk . Then the probability that
at least two people share birthdays are simply
365k − 365Pk
. (2358)
365k
We require n ≥ 23 for the probability to be greater than half.
364
Exercise 546 (Discrete Random Walk). A drunkard walks forward and backward randomly with equal
probability. At time/step n what is the probability that he is r steps ahead of where he began?
Proof. Let x be number of forward steps and y be number of backward steps. Then x + y = n and
n+r n−r
x − y = r and the equations solve to x = 2 , y= 2 . The total number of ways for which he ends up
n!
at r-front is n+r n−r and the total number of ways to take n steps is 2n . His probability for r-forward
2 !· 2 !
n
( n+r )
is 2
2n .
Definition 274 (von Mises probability). For an experiment (see Definition 262) repeated n times under
identical conditions and if event E (see Definition 264) occurs m times out of the n repetitions, the
m
probability of event E is written P(E) = lim .
n→∞ n
6.3 Random Variables

In general, a random variable can be said to be a function that maps outcomes from a sample space to
real numbers. We write notations such as X : Ω → R to show this. When the sample space Ω is finite,
or countably infinite we can define a probability function as such:
Definition 275 (Probability Function on Countable Sample Space). For finite or countably infinite
sample space Ω, real valued function p is said to be discrete probability function if it satisfies 0 ≤ p(s), ∀ω ∈
P
Ω and ω∈Ω p(ω) = 1.
Definition 276 (Discrete Random Variable). A function with domain Ω ranging over finite or countably
infinite set of real numbers is called a discrete random variable. Then for such a random variable X we
can write X(ω) = k for ω ∈ Ω and k ∈ R.
Definition 277 (Discrete Probability Density Function). Associated with the random variable defined
as in Definition 276, its probability density function pX (k) is the probability function such that
pX (k) = P({ω ∈ Ω|X(ω) = k}) = P(X = k). (2359)
The probability density function for a discrete random variable (Definition 276) can be expressed by
formula or by enumerating the domain (in the finite case) and assigning probability values. Often we are
interested in the probabilities of a range, as opposed to at a point. We can then consider its cumulative
density function.
Definition 278 (Discrete Cumulative Density Function). For discrete random variable X (Definition
276), the probability that X takes on values t or lesser is written by its cumulative density function FX ,
written
FX (t) = P({ω ∈ Ω|X(ω) ≤ t} = P(X ≤ t). (2360)
Random variables often take values that have continuous domain. We might have sample spaces that
contain an uncountably infinite number of outcomes. A discrete probability function p(s) as defined in
Definition 275 is not applicable to outcomes in continuous sample spaces. Uncountably infinite sample
spaces often have point probabilities of zero. We give a more general definition under measure theoretic
settings that generalizes the behavior of random variables.
365
Definition 279 (Random Variable). Let (Ω, F, P) be a probability space, then a random variable X is
a real-valued function defined on Ω with the property that for every Borel subset (see Definition 270)
B ⊂ R, the subset of Ω given by {X ∈ B} = {ω ∈ Ω : X(ω) ∈ B} is in the σ-algebra F. We may permit
the value of a random variable to take ±∞.
To get Borel subsets of R, begin with closed intervals and add all other sets necessary. Unions of
sequences of closed intervals are Borel sets, and it follows that every open interval is Borel. Furthermore,
every open set (not necessarily an interval) is Borel, and every closed set is Borel by complement.
Definition 280 (Borel Subsets of R). We define the collection of Borel subsets of R as B(R), the Borel
σ-algebra of R. We are often interested in the value P(X ∈ B) for B ∈ B(R), which requires that
{X ∈ B} ∈ F for all B ∈ B(R).
We need to allow for the possibility that the random variables under consideration do not assign any
lumps of (probability) mass but rather spread them continuously over the real line. We should think of
the distribution of a random variable as telling us the mass in a set rather than mass at a point. The
distribution of a random variable is itself a probability measure, but it is a measure on subsets of R
rather than on subsets of Ω.
Definition 281 (Distribution Measure of a Random Variable). Let X be a random variable defined
on probability space as in definition 279. The distribution measure of X is the probability measure µX
assigning the each Borel subset B of R the mass µX (B) = P(X ∈ B). The definition holds for both
discrete and continuous random variables.
Exercise 547 (A Random Variable Distributed on L(0, 1)). Let P be the uniform Lebesgue measure on
[0, 1]. Define X(ω) = ω and Y (ω) = 1 − ω for all ω ∈ [0, 1]. Then, distribution measure of X is uniform,
that
µx [a, b] = P{ω : a ≤ X(ω) ≤ b} = P[a, b] = b − a, 0 ≤ a ≤ b ≤ 1,
by definition of P. See also that Y has the same distribution as X:
µy [a, b] = P{ω : a ≤ Y (ω) ≤ b} = P{ω : a ≤ (1 − ω) ≤ b} = P[1 − b, 1 − a] = (1 − a) − (1 − b) = b − a, · · ·
Now, consider a different probability measure P̃ on [0, 1] by

Z b
P̃[a, b] = 2ωdω = b2 − a2 , 0≤a≤b≤1
a
. We see that the random variable X no longer has the uniform distribution under the probability measure
P̃, with µx [a, b] = b2 − a2 and µy [a, b] = (1 − a)2 − (1 − b)2 under conditions 0 ≤ a ≤ b ≤ 1 - we also see
that the distributions of the random variable no longer agrees!
There are other ways to specify the distribution of random variables other than the distribution
measure. One is the cumulative distribution function.
Definition 282 (Probability Function on Uncountably Infinite Sample Space). A probability function
P on set of real numbers Ω is continuous if there exists f (t) such that for closed intervals [a, b] ⊂
Rb R∞
Ω, P([a, b]) = a f (t)dt, where f (t) satisfies f (t) ≥ 0, ∀t and −∞ f (t)dt = 1. For arbitrary set A, we
R
write P(A) = A f (t)dt.
366
The function f (t) could be considered to be the density scaled population histogram partitioned to
its limit. Therefore, in order to see if a sampled data is fitted well by some continuous density f (t),
we can plot the density scaled sample histogram and superimpose them to see if there is a fit. In order
to map a frequency plot to the density domain, such that the sum of areas of all non-zero probability
partitions equate to one, we need to make the transformation fˆ(ct ) = wnt tN where ct represents the class
t, nt is the class frequency and wt is width of the class. fˆ is the density scaled sample histogram function
that assigns probability mass to class ct .
Definition 283 (Cumulative Distribution Functions). We can define the distribution of random variables
in terms of its cumulative distribution function, which is the function
F (x) = P{X ≤ x}, x ∈ R.
Corollary 9 (Duality of F (x) ∼ µx ). If we know the distribution measure µX , then we know cdf F . To
see that, F (x) = µx (−∞, x]. The other direction holds, in that µx (x, y] = F (y) − F (x) for x < y. Then
for a ≤ b, [a, b] = ∩∞ 1
n=1 (a − n , b], and we can compute
1 1
µx [a, b] = lim µx (a − , b] = F (b) − lim F (a − ).
n→∞ n n→∞ n
Once the distribution measure µx [a, b] is known for all closed intervals in R, then it is determined for
all Borel subsets of R.
There are 2 special cases for which the distribution of the random variable can be recorded in other
forms: (i) probability density functions and (ii) probability mass functions.
Definition 284 (Random Variable Quantiles). For q ∈ [1], the q-th quantile of a random variable
X ∼ F (x) is defined to be the value
Qq (X) = max {x : P(X < x) ≤ q} . (2361)
The empirical quantiles can be measured from observed samples. For Xi , i ∈ [n] observations, define the
q-th empirical quantile as Qq = X(bnqc) , where (i) indexes the i-th order statistic.
Definition 285 (Probability Density Functions). The density function f (x) is a non-negative function
defined for x ∈ R such that
Z b
µx [a, b] = P{a ≤ X ≤ b} = f (x)dx, −∞ < a ≤ b < ∞.
a
R∞ Rn
Note that −∞
f (x)dx = limn→∞ −n
f (x)dx = limn→∞ P{−n ≤ X ≤ n} = P{X ∈ R} = P(Ω) = 1.
Definition 286 (Probability Mass Functions). Consider either the finite sequence of numbers xi , i ∈ [N ]
or the infinite sequence x1 , x2 , · · · such that with probability one the random variable takes one of the
P
values in the sequence. Then, we define pi = P{X = xi }, and ∀pi , pi ≥ 0 with i pi = 1. Then the mass
assigned to a Borel set B ⊂ R by the distribution measure of X is
X
µx (B) = pi B ∈ B(R)
i:xi ∈B
and we shall call this the probability mass function.
There exists random variables whose distribution is given by mixture of density and probability mass
functions, and even those with both no lumps of mass and no density.
367
Exercise 548 (A Random Variable U(0, 1)). Construct uniformly distributed r.v. taking values ∈ [0, 1]
1
and defined on infinite coin toss space Ω∞ . Suppose independent coin-toss, with p = q = 2 as in usual
settings. For n = 1, 2, · · · , define
Yn (ω) = 1{ωn = H}
P∞ Yn
and set X = n=1 2n . If Y1 = 0, which happens with probability 1/2, then X ∈ [0, 1/2], and if Y1 = 1
then X ∈ [0.5, 1], and we can continue the enumeration of probabilities and sequences (such as Y1 Y2 = 00
with p = 1/4) to see that for any interval [ 2kn , k+1
2n ] ⊂ [0, 1] - the probability that the interval contains X
1
is 2n . In terms of the distribution measure µX , we can write this to be

k k+1 1
µX n , n = n 0 ≤ k ≤ 2n − 1.
2 2 2
With finite additivity and unions of intervals, when 0 ≤ k ≤ m ≤ 2n we obtain µX [ 2kn , 2mn ] = m−k
2n ,
simplified µX [a, b] = b − a when 0 ≤ a ≤ b ≤ 1.
Consider the method of obtaining a standard normal random variable by the probability integral
transform method in Definition (621) - another way we can construct such standard normals is to take
Rb
Ω = R, F = B(R),and take P to be the probability measure on R satisfying P[a, b] = a φ(x)dx under
conditions −∞ < a ≤ b < ∞, and take X(ω) = ω for all ω ∈ R. While this method can be used to
construct a random variable with any desired distribution, it is less useful when we want to have multiple
random variables, each with specified distributions and dependencies among the random variables. In
such cases, we need to construct a single probability space on which all of the random variables of interest
are defined.
Exercise 549. Construct a standard random normal on infinite coin toss space as in Exercise 539 where
p-head is 0.50.
RX 2
Proof. First construct a uniform random variable X as in Exercise 548. Then for Φ(X) = ∞ √12π exp{− ξ2 }dξ,
define Z = Φ−1 (X). We have P(Z ≤ a) = P(Φ−1 (X) ≤ a) = P(X ≤ Φ(a)) = Φ(a) is standard random
normal on infinite coin toss space.
Exercise 550. Define a sequence of random variables {Zn }∞

n=1 on Ω∞ infinite coin toss space as in
Exercise 539 such that
lim Zn (ω) = Z(ω) ∀ω ∈ Ω∞
n→∞
and for each n, Zn depends only the first n number of coin tosses.
Proof. Consider the random variable Xn = i=1 21i 1{ωi =H} , then Xn (ω) → X(ω) for all ω ∈ Ω∞ , where
Pn
P∞ 1{ωn =H}
X = n=1 2n . Then Zn = Φ−1 (Xn ) → Z = Φ−1 (X) for every ω ∈ Ω and Zn depends only on the
first n coin tosses. {Zn }n≥1 is the desired sequence.
In the discrete case the c.d.f is non-decreasing step function with jumps at points with non-zero
probability mass. In continuous case this is monotonically non-decreasing continuous function.
Result 31 (Some properties of the Cumulative Density Function). For continuous random variable Y
with c.d.f FY (y) the following statements are satisfied: (i) P(Y > s) = 1 − FY (s), (ii) P(r < Y ≤ s) =
d
FY (s)−FY (r), (iii) lim FY (y) = 1, (iv) lim FY (y) = 0. If the density exists then (v) dy FY (y) = fY (y).
y→∞ y→−∞
368
6.3.1 Expectations of Random Variables
The expectation value of a random variable can be seen as its central tendency or centre of gravity.
Let X be a random variable defined on probability space (Ω, F, P), a location statistic would be the
‘average value’. If Ω is finite, then this value is
X
E[X] = X(ω)P(ω)
ω∈Ω
and if Ω is countably infinite, such that its elements may be listed in sequence, then
∞
X
E[X] = X(ωk )P(ωk )
k=1
where the k indexes the elements in sequence. However, when Ω is uncountably infinite, then we cannot
define an uncountable sum and integrals must be considered. For continuous random variable Y with
density function fY (y) we shall have
Z ∞
E[Y ] = µY = y · fY (y)dy.
−∞
By writing the above, we assumed that both the sum and integral terms defined converge absolutely,
P R∞
such that k |k|pX (k) < ∞ and −∞ |y|fY (y)dy < ∞. We may want a more rigorous treatment of
these formulations, and employ upon measure theoretic arguments. In general we should not take it for
granted that these integrals are well defined, and we need to consider their Lebesgue forms.
6.3.2 Riemann and Lebesgue Integrals

Definition 287 (Riemann Integral). If f (x) is continuous function defined for all x inside closed interval
Rb
[a, b], then the Riemann integral
a
f (x)dx is constructed as follows. First partition [a, b] into subintervals
[x0 , x1 ], [x1 , x2 ], · · · [xn−1 , xn ] where a = x0 < x1 < · · · < xn = b. Then denote kΠk = max1≤k≤n (xk −
xk−1 ), the length of the longest subinterval in the partition. For each subinterval [xk−1 , xk ], define
Mk = maxxk−1 ≤x≤xk f (x), mk = minxk−1 ≤x≤xk f (x). The upper/lower Riemann sum are defined
n
X
+
RSΠ (f ) = Mk (xk − xk−1 )
k=1
n
X
−
RSΠ (f ) = mk (xk − xk−1 )
k=1
respectively. The Riemann integral is defined as the lim of the upper Riemann and lower Riemann, which
converges to the same value as kΠk → 0.
The issue with the Riemann integral can be summarized as follows: when using Riemann to compute
expectations, the random variable, say X, is often a function of ω ∈ Ω, which can be some abstract space
of objects. There is no natural partitioning order, and therefore we need to partition instead the y-axis.
Definition 288 (Lebesgue Integral). Assume for a moment that 0 ≤ X(ω) < ∞ for every ω ∈ Ω, and
let Π = {y0 , y1 , · · · } where 0 = y0 < y1 < y2 · · · . For each subinterval [yk , yk+1 ], set
Ak = {ω ∈ Ω : yk ≤ X(ω) < yk+1 }
369
and define the lower Lebesgue sum to be
∞
X
−
LSΠ (X) = yk P(Ak ),
k=1
and this converges as kΠk → 0. This convergence limit is known as the Lebesgue integral
Z Z
X(ω)dP(ω) = XdP
Ω Ω
. Now consider the general random variable - if the set of ω that has negative values have zero probability,
R
then we are done. If P(ω : X(ω) ≥ 0) = 1 and P(ω : X(ω) = ∞) > 0 then Ω XdP = ∞. Finally,
we can also decompose X into the positive and negative components, giving X + (ω) = max{X(ω), 0},
X − (ω) = max{−X(ω), 0}. Both X + , X − are nonnegative, and X = X + − X − , |X| = X + + X − . They
are now well defined by the Lebesgue integral and provided they are not both ∞, we let
Z Z Z
X(ω)dP(ω) = X + (ω)dP(ω) − X − (ω)dP(ω).
Ω Ω Ω
If they are both finite, then X is integrable. If only one is finite, then the obvious infinite term
R
dominates. If both are infinite, then Ω X(ω)dP(ω) is not defined.
Theorem 339 (Properties of Lebesgue Integral). Let X be a random variable on probability space
(Ω, F, P), and Y be another random variable on the same probability space.
(1) if X takes only finitely many vales y0 , y1 , · · · yn then

Z n
X
X(ω)dP(ω) = yk P(X = yk )
Ω k=0
R
(2) Integrability: random variable X is only integrable iff Ω
|X(ω)|dP(ω) < ∞
R
(3) Comparison: If X ≤ Y almost surely, and if their integrals are well defined, then Ω
X(ω)P(ω) ≤
R
Ω
Y (ω)dP(ω). If X = Y almost surely, then the inequality becomes an equality.
(4) Linearity: If α, β are real constants and X, Y are integrable, or if α, β are nonnegative constants
R R R
and X, Y are nonnegative, then Ω (αX(ω) + βY (ω))dP(ω) = α Ω X(ω)dP(ω) + β Ω Y (ω)dP(ω).
Proof. -
(1) Consider the case when X is almost surely nonnegative. If zero is not one of the yk s, then add
y0 to list, and label yk s such that 0 = y0 < y1 · · · < yn . Using the y’s as partition points, let
Pn
Ak = {yk ≤ X ≤ yk+1 } = {X = yk } and the Lebesgue sum equals k=0 yk P(X = yk ). Adding
partitions do not affect this sum and we are done with the convergence limit.
(2) Consider the comparison and linearity property. Since |X| = X + + X − , then X + , X − ≤ |X|. If
|X(ω)|dP(ω) ≤ ∞ then Ω X + (ω)dP(ω), Ω X − (ω)dP(ω) < ∞ by comparison property and X is
R R R
Ω
integrable by definition (both finite). The other direction is a direct result of the linearity property,
with α, β = 1.
(3) If X ≤ Y almost surely, then X + ≤ Y + and X − ≥ Y − almost surely. For every partition Π, the lower
Lebesgue sum satisfy Ω X + (ω)dP(ω) ≤ Ω Y + (ω)dP(ω) and Ω X − (ω)dP(ω) ≥ Ω Y − (ω)dP(ω). The
R R R R
proof follows from decomposition of random variable W = W + − W − .
370
(4) -
Corollary 10 (Integration of a Random Variable X over subset of the sample space Ω). We can easily
define the integration of a random variable over subset A ⊂ Ω using the indicator function
Z Z
X(ω)dP(ω) = 1A (ω)X(ω)dP(ω) ∀A ∈ F,
A Ω
where 1A (ω) is the indicator function ω ∈ A. If A, B disjoint in F, then 1A + 1B = 1A∪B and by

linearity (see Theorem 339),
Z Z Z
X(ω)dP(ω) = X(ω)dP(ω) + X(ω)dP(ω).
A∪B A B
Definition 289 (Expectation of a Random Variable). The expectation of X on probability space (Ω, F, P)
is defined Z
E(X) = X(ω)dP(ω)
Ω
. This is well defined if at least one of the positive or negative component is finite. If both EX + or EX −
is finite, then we are done, if only one is infinite then the expectation is infinite and sign is dependent
on the dominant term. If both are finite then E(X) is finite and integrable. Finally, if both are infinite
then the expectation is not defined.
Theorem 340 (Properties of Expectations). Let X be a random variable on space (Ω, F, P), and Y be
another random variable on the same probability space. Then,
Pn
(1) If X takes finitely many values x0 , x1 · · · xn then E(X) = k=0 xk P(X = xk ), and if Ω is finite,
P
then this is equivalent to w∈Ω X(ω)P(ω).
(2) Integrability: X is integrable iff E|X| < ∞.
(3) If X ≤ Y almost surely, and both are either integrable or almost surely nonnegative, then EX ≤ EY .
In particular, if X = Y almost surely, and one of the random variables are integrable or almost surely
nonnegative, then both are integrable or almost surely nonnegative and EX = EY .
(4) If α, β ∈ R, X, Y integrable or α, β > 0, X, Y nonnegative, then E(αX + βY ) = αEX + βEY .
(5) Jensen’s Inequality holds, such that if ϕ is convex, real valued function on R, E|X| < ∞ then
ϕ(EX) ≤ Eϕ(X)
Exercise 551. In Lebesgue integrals, the order of integration can be reversed (much like in Riemann in-
tegrals). The assumption required is that the function being integrated is either nonnegative or integrable.
Let X be a nonnegative random variable with cumulative distribution function F (x) = P(X ≤ x). Show
that E(X) = 0 (1−F (x))dx by showing Ω 0 1[0,X(ω)] (x)dxP(ω) is equal to E(X) and 0 (1−F (x))dx.
R∞ R R∞ R∞
Proof. By reversing the order of integrals, we have

Z Z ∞ Z ∞ Z
1[0,X(ω)] (x)dxdP(ω) = 1[0,X(ω)] (x)dP(ω)dx
Ω 0 0 Ω
R R X(ω) R
. The LHS evaluates to Ω 0 dxdP(ω) = Ω X(ω)dP(ω) = E(X) and RHS evaluates to
1
R∞R R∞ R∞
0 Ω [0,X(ω)]
(x)dP(ω)dx = 0 P(x < X)dx = 0 (1 − F (x))dx. The result follows.
371
Exercise 552. A postman goes rogue and random shuffles n letters into n mailboxes. What is the
expected number of letters correctly mailed?
Proof. Each letter getting correctly mailed is Bernoulli with p = n1 . Then by expectation linearity the
EXi = n n1 = 1.
P P
total number of letters correctly mailed is EX = E Xi =
Theorem 341 (Jensen’s Inequality). If f is a convex real valued function, and X is a finite random
variable then
f (E(X)) ≤ E(f (X)).
Exercise 553. Let u ∈ R and a convex function ϕ(x) = exp{ux} for all x ∈ R. Let X be normal
random with µ = EX, σ 2 = E[(X − µ)2 ]. Verify that E exp{uX} = exp{uµ + 21 u2 σ 2 } and that the
Jensen’s inequality holds.
σ 2 u2
Proof. See from Theorem 377 that E[exp{uX}] = exp{uµ + 2 }. It follows that for Eϕ(X) =
2 2
u σ
E exp{uX} = exp{uµ + 2 } ≥ exp{uµ} = ϕ(EX)
Definition 290 (Lebesgue Measure on R). Let B(R) be the σ-algebra of Borel subsets of R. The Lebesgue
measure on R, denoted L assigns to each set B ∈ B(R) a number in [0, ∞), or the value ∞ such that (i)
L[a, b] = b − a when a ≤ b and (ii) if B1 , B2 , B3 · · · is sequence of disjoint subsets in B(R), the Lebesgue
P∞
measure satisfies countable additivity (i.e. L(∪∞ n=1 Bn ) = n=1 L(Bn )).
The definition is similar to the one on probability measures (see Definition 267) except some sets can
have measure greater than one. The Lebesgue measure of every interval is the length, so that R, half-lines
[a, ∞), (−∞, b] have infinite Lebesgue measure, while single points have measure zero. The measure of
the empty set is zero. Finite additivity follows, whenever the Borel subsets are disjoint, with similar
proof.
Definition 291 (Borel Measurable Functions). Let f (x) be a real valued function on R, and assume that
for every Borel subset B of R, the set {x : f (x) ∈ B} is also a Borel subset of R. Then, this function is
said to be Borel measurable.
Every continuous and piecewise continuous function is Borel measurable. It is quite difficult to find
functions that are not Borel measurable. In the discussion of stochastic calculus, we assume that all
R
functions are Borel measurable. We want to define the Lebesgue integral R f (x)dL(x) of f over R.
Assume for a moment that f (x) ≥ 0, x ∈ R. Choose partition Π = {y0 , y1 , · · · } where 0 = y1 < y2 , · · · .
For each subinterval [yk , yk+1 ), define Bk = {x ∈ R : yk ≤ f (x) < yk+1 }. Since we assumed f was Borel
measurable, they are Borel subsets of R and their Lebesgue measures are defined. Define their lower
Lebesgue sum
∞
X
−
LSΠ (f ) = yk L(Bk ).
k=1
R
As kΠk → 0, the Lebesgue sums converge to a limit, which we let to be R
f (x)dL(x). If the set of x
where f (x) is nonnegative has zero Lebesgue measure, then we are done. If L{x ∈ R : f (x) < 0} = 0
and L{x ∈ R : f (x) = ∞} > 0 then the integral is ∞. If f (x) can take both positive and negative values,
then define
f + (x) = max{f (x), 0}
f − (x) = max{−f (x), 0}
372
and both terms are nonnegative. Let
Z Z Z
f (x)dL(x) = f + (x)dL(x) − f − (x)dL(x)
R R R
, which is defined if both are not infinity. If both are infinite, then it is again not well defined and
R
if both are finite, then it is integrable. This occurs iff P |f (x)|dL(x) < ∞. The integral shares the
properties of the Lebesgue integral (see Theorem 339) and if f takes finitely many values y0 , y2 · · · yn we
R Pn
have R f (x)dL(x) = k=0 yk L{x ∈ R : f (x) = yk }.
R
Again, if we want to define f (x) over subset of R, such as some B ∈ B(R), we define B f (x)dL(x) =
1 (x)f (x)dL(x) with the usual interpretations.
R
R B
Exercise 554 (Difference between the Riemann and Lebesgue Integrals). Let Ω = [0, 1], and P be the
Lebesgue measure on Ω (see Definition 269). Now consider the random variable X(ω) = 1(ω ∈ P), the
indicator function that ω is irrational. The random variables take only two values, and its expectation is
computed
EX = 1 · P(ω ∈ [0, 1] : ω ∈ P) + 0 · P(ω ∈ [0, 1] : ω ∈ Q)
. Since there are only countably many rational numbers in [0, 1] (they may be listed in sequence) and
each number in the sequence has probability zero, the whole sequence has probability zero by countable
additivity. ω is irrational almost surely and EX = 1. The random variable is almost surely equal to one.
If we have been working with Riemann integral rather than the Lebesgue integral, we may arrive at a
different conclusion. In fact, intuitively each partition has both irrational and rational numbers, making
the Mk = 1, mk = 0. The upper Riemann equates to one, and lower Riemann sum equates to zero, no
matter the size of our subintervals. There is no convergence limit since the function is too discontinuous.
Result 32 (Formalisation of the Differences between the Riemann and Lebesgue Integrals). Let f be a
bounded function on R, and let a < b ∈ R. Then
Rb
(1) The Riemann integral a f (x)dx is defined (meaning the upper, lower Riemann sums converge) iff
set of points x ∈ [a, b] where f (x) is not continuous has Lebesgue measure zero.
Rb
(2) If the Riemann integral a f (x)dx is defined, then f is Borel measurable, then the Lebesgue integral
R
[a,b]
f (x)dL(x) is defined and their integrals agree.
Definition 292 (Holds Almost Everywhere). If the set of numbers in R failing some property has
Lebesgue measure zero, then we say the property holds almost everywhere.
Rb
Then, the Riemann integral a f (x)dx exists iff f (x) is almost everywhere continuous on [a, b]. Since
the Riemann and Lebesgue integrals agree when the Riemann integral is defined, then we shall use
Rb R
notation a f (x)dx to denote the Lebesgue integral instead of [a,b] f (x)dL(x), although we should be
mindful that the integrals are Lebesgue in most discussions. When developing theory, the integral is often
Lebesgue, and when computing numbers, the techniques employed are Riemann. If we are integrating
R
over a set, we use the notation B f (x)dx.
6.3.3 Convergence of Integrals

Definition 293 (Almost Sure Convergence). Let X1 , X2 , · · · be a sequence of random variables, all
defined on the same probability space (Ω, F, P). Let X be another random variable defined on the same
space, and we say X1 , X2 · · · converges to X almost surely, denoted
lim Xn = X
n→∞
373
almost surely if the set of ω ∈ Ω for which sequences X1 (ω), X2 (ω), · · · has limit X(ω) is set of probability
one. Equivalently, we can say that the set of ω ∈ Ω for which the sequence does not converge to X(ω) is
set with zero probability.
Exercise 555 (Convergence Application on the Strong Law of Large Numbers). The Strong LLN is
a case of almost sure convergence. Consider the infinite coin toss space Ω∞ , with probability measure
corresponding to p = 1/2 on head and tails, and random variables Yk (ω) = 1{ωk = H}, Hn = k=1 Yk
Pn
such that Hn is just the number of heads obtain in the first n tosses. Strong LLN asserts that lim Hnn = 1
2
n∞
almost surely.
Definition 294 (Convergence of Functions). Let f1 , f2 · · · be a sequence of real valued, Borel measurable
functions defined on R. Let f be another Borel measurable function defined on R. We say f1 , f2 · · ·
converge to f almost everywhere and write
lim fn = f
n→∞
almost everywhere if the set x ∈ R for which the sequence of numbers f1 (x), f2 (x) · · · does not have limit
f (x) is set of Lebesgue measure zero.
Exercise 556 (Almost Everywhere Convergence of Normal Densities). Consider sequence of normal
2
densities with mean zero and variance n1 at the n-th indexed density, then fn (x) = 2π exp{− (nx)
pn
2 }.
pn
If x 6= 0, then limn→∞ fn (x) = 0, but limn→∞ fn (0) = limn→∞ 2π = ∞, and we can say that the
sequence fi , i = 1, 2 · · · converges everywhere to the function
(
∗
0 if x 6= 0,
f (x) = (2362)
∞ if x = 0,
and is said to converge almost everywhere to the zero function f (x) = 0 for all x ∈ R. Set of x where
convergence to f (x) is a singleton zero and has zero Lebesgue measure.
Often, when random variables converge almost surely, their expectation converges to the expectation
value of the limiting random variable. Likewise, functions that converge almost everywhere often has
Lebesgue integrals that converge to the Lebesgue integral of the limiting function. This is not always
the case.
Exercise 557 (Differences between the Integral of a Converging Function and the Integral of its Limiting
R∞
Function). Recall Example (556), for which we have a sequence of normal densities −∞ fn (x)dx = 1 for
every n but the almost everywhere limit function f is the identically zero function. It would not help to
use the everywhere limit function f ∗ (x) since any two functions that differ only on a set of zero Lebesgue
R∞ R∞
measure must share the integral value. That is, −∞ f ∗ (x)dx = −∞ f (x)dx = 0. It also does not help
to replace the functions fn by the function
(
fn (x) if x 6= 0,
gn (x) (2363)
0 if x = 0.
R∞
Then the sequence g1 , g2 · · · converges to 0 everywhere, while the integrals agree and g (x)dx
−∞ n
as
defined equates to 1. This leads us to the conclusion that
Z ∞ Z ∞
lim fn (x)dx 6= lim fn (x)dx,
n→∞ −∞ −∞ n→∞
since 1 6= 0. In the Riemann integral, the upper Riemann sums for f ∗ is infinite, while the lower Riemann
sums for f ∗ are zero, and hence not even defined.
374
In order for the integral of a sequence of functions to converge to the integral of the limiting function,
we require some conditions. One such satisfying class of functions are functions that are nonnegative and
converge to the limit from below. If we think of an integral as the area under a curve, then assumption
is that as we go farther out in the sequence, we continue to add area and never take it away. Then the
area under the curve of the limiting function is the limit of the areas under the functions in the sequence.
This is known as monotone convergence.
Theorem 342 (Monotone Convergence). Let X1 , X2 · · · be random variable sequence that converges
almost surely to X. If 0 ≤ X1 ≤ X2 ≤ · · · almost surely, then limn→∞ EXn = EX. Similarly for
f1 , f2 · · · Borel-measurable functions converging almost everywhere to f on R and f1 ≤ f2 ≤ f3 · · · , the
R∞ R∞
limn→∞ −∞ fn (x)dx = −∞ f (x)dx.
Corollary 11 (Monotone Convergence to Prove Expectation Values under Countable Ranges). Suppose
P∞
X is nonnegative random variable taking countably many values x0 , x1 · · · , then EX = k=0 xk P(X =
xk ). X with both positive and negative ranges can perform the transformation X = X + − X − as usual,
and the result follows (provided of course, that we do not result in a ∞ − ∞ situation).
xk 1Ak . Define Xn = xk 1Ak . Then 0 ≤ X1 ≤

P∞ Pn
Proof. Let Ak = {X = xk } such that X = k=0 k=0
X2 · · · , limn→∞ Xn = X surely, and by Monotone Convergence (Theorem (342) )
n
X ∞
X
EX = lim EXn = lim xk P{X = xk } = xk P{X = xk }.
n→∞ n→∞
k=0 k=0
Definition 295 (Pointwise Convergence). Suppose that X is a set and Y is a topological space, such as
R or C. Then the net or sequence of functions (fn ), all having the same domain X and codomain Y is
said to converge pointwise to a function f : X → Y and is denoted
fn = f pointwise iff ∀x ∈ Df , fn (x) = f (x)

n→∞ n→∞
Exercise 558 (Differences between Pointwise Convergence and Almost Everywhere Convergence). Let
{fn } be a sequence of measurable functions fn : X → R, where X is some measure space with measure
µ. Let f : X → R be another measurable function. The sequence {fn } converges to f pointwise if for
each x ∈ X, sequence of real numbers fn (x) → f (x). This does not require the µ measure. The sequence
{fn } converges to f almost everywhere/almost surely/pointwise almost everywhere if there exists B ⊂ X
with measure µ(B) = 0 such that {fn } converges to f pointwise on X\B. Pointwise convergence implies
almost everywhere convergence (by taking B = ∅).
Theorem 343 (Dominated Convergence). Let X1 , X2 · · · be sequence of random variables converging

almost surely (and pointwise) to random variable X. If another random variable Y is such that EY <
∞, |Xn | ≤ Y almost surely for all n, then
lim EXn = EX.

n→∞
Similarly, if f1 , f2 · · · is sequence of Borel measurable functions on R converging almost everywhere to

R∞
function f , and ∃g s.t −∞ g(x)dx < ∞ and |fn | ≤ g almost everywhere for all n, then
Z ∞ Z ∞
lim fn (x)dx = f (x)dx.
n→∞ −∞ −∞
375
Exercise 559 (Saint Petersburg Paradox). Consider the infinite coin toss space with fair flip prob-
abilities, and random variable X such that X(ω) = 2, 4, 8, · · · 2k if ω1 = H, if ω1 = T, ω2 = H, if
ω1 = ω2 = T, ω3 = H and · · · if ω1 = · · · ωk−1 = T, ωk = H respectively and so on. Then X(ω) is
defined for every sequence of coin tosses except the one with all tails, which we define X(T T T · · · ) = ∞.
We have P(X = ∞) = 0, and X is almost surely finite. By Corollary (11), EX = 2·P(X = 2)+4·P(X =
4) + 8 · P(X = 8) · · · = 1 + 1 + 1 · · · = ∞. Even though X is almost surely finite, its expectation is infinite.
Exercise 560. For each n ∈ Z+ , define fn to be the normal density from Φ(0, n) such that fn (x) =
2
√1 x
exp{− 2n }. Then,
2nπ
1. What is the limiting function f (x) = limn→∞ fn (x)?

R∞
2. Find limn→∞ −∞ fn (x)dx.
R∞ R∞
3. Explain why the inequality limn→∞ −∞ fn (x)dx 6= −∞ f (x)dx does not violate both the Monotone
Convergence Theorem and the Dominated Convergence Theorem (see Theorems 342 and 343).
Proof. -
1. It is easy to see that |fn (x)| ≤ √1 and hence f (x) = 0.

2nπ
2. By definition of density functions this equals one.
3. The sequence of fn does not approach the limiting function f from below, hence the Monotone
Convergence Theorem is not violated. For Dominated Convergence Theorem to hold we need
pointwise convergence (see Definition 295) to hold for all n, but see that for small values of n (i.e.
n = 1), the limiting function does not bound the sequence. @ integrable function that bounds the
(fn ) sequence for ∀n.
6.3.4 Computing Expectations

Here we develop the theory behind the relationships between integrals over Ω and integrals over R. Recall
the distribution measure of X (see Definition (281) is
µX (B) = P(X ∈ B) ∀B ∈ B(R)
. Since µX is probability measure on R, we can use it to integrate functions over R.
Theorem 344 (Relating Integrals over Abstract Spaces to Integrals over R ; Introduction of the Standard
Machine, from Shreve [19]). Let X be a random variable on probability space (Ω, F, P) and g be Borel-
measurable function on R. Then Z
E|g(X)| = |g(x)|dµX (x)
R
, which if is finite then we must have
Z
Eg(X) = g(x)dµX (x)
R
376
Proof. -
Step 1: Indicator Functions. Suppose function g(x) = 1B (x) is the indicator of a Borel subset of R, and
since it must be nonnegative, then we are done and
Z
E1B (X) = 1B (x)dµX (x),
R
and
E1B (X) = P(X ∈ B)
. Similarly, the function 1B (x) of dummy variable x takes binary values and setting Ω = R, X = 1B , P =
µX , we obtain integral
Z
1B (x)dµX (x) = 1 · µX {x : 1B (x) = 1} = µX (B).
R
Step 2: Nonnegative Simple Functions. A simple function is a finite sum of indicator function times
constants, such as g(x) = k=1 αk 1Bk (x), where αi , i ∈ [n] ≥ 0 and Bi , i ∈ [n] are Borel subsets of R.
Pn
By linearity of integrals,
n n n Z
αk 1Bk (X) = αk E1Bk (X) = 1Bk (x)dµX (x)
X X X
Eg(X) = E αk (2364)
k=1 k=1 k=1 R
Z n
! Z
αk 1Bk (x) dµX (x) =
X
= g(x)dµX (x) (2365)
R k=1 R
R
hence Eg(X) = R
g(x)dµX (x) when g is simple nonnegative function.
Step 3 Nonnegative Borel-measurable functions. Let g(x) be arbitrary non-negative Borel-measurable
function on R, then for all n ∈ Z+ define
k k+1
Bk,n = {x : n
≤ g(x) < } k = 0, 1, 2, · · · 4n − 1.
2 2n
In computerist terminology this can be considered successive approximations of the binary digit represen-
tation of g(x) up to n.n digits. For each fixed n, the sets B0,n , B1,n · · · B4n −1,n correspond to partitions
1 2 4n
0< 2n , 2n ··· < 2n = 2n . At next stage n + 1, the partition points include all points at stage n and new
partition points at midpoints between the old ones. Therefore, the simple functions
n
4X −1
k
gn (x) = 1B (x)
2n k,n
k=0
satisfy 0 ≤ g1 ≤ g2 · · · ≤ g and are increasingly better approximations of g as n becomes larger. We have

R
limn→∞ gn (x) = g(x)∀x ∈ R and we know from previous step that Egn (X) = R gn (x)dµX (x) for every
n. By Monotone Convergence (see Theorem (342)) on both sides, we obtain
Z Z
Eg(X) = lim Egn (X) = lim gn (x)dµX (x) = g(x)dµX (x)
n→∞ n→∞ R R
and we are done.

Step 4: General Borel-measurable function. Let g(x) be arbitrary Borel-measurable function and write
g + (x) = max{g(x), 0}, g − (x) = max{−g(x), 0}. Then from previous step we know
Z
Eg + (x) = g + (x)dµX (x),
R
377
Z
Eg − (X) = g − (x)dµX (x).
R
Adding them derives the E|g(X)|, and if this is finite then Eg + (X) = R g + (x)dµX (x) < ∞, Eg − (X) =
R
R −
R
g (x)dµX (x) < ∞ and subtracting the two terms finishes the proof.
Exercise 561 (Moment Generating Functions). Let X be nonnegative random variable, and assume
ϕ(x) = E(exp{tX}) < ∞ for all real t. Assume further that E{X exp{tX}} < ∞ for all real t. We want
0
to show that ϕ (t) = E[X exp{tX}]. Recall definition of the derivative

0 ϕ(t) − ϕ(s) E exp(tX) − E exp(sX) exp(tX) − exp(sX)
ϕ (t) = lim = lim = lim E .
s→t t−s s→t t−s s→t t−s
The limit is taken over a continuous variable s but we may choose a sequence of numbers {sn }∞
n=1 → t
and compute
exp(tX) − exp(sn X)
lim E ,
sn →t t − sn
exp(tX)−exp(sn X)
taking limit of expectations of the sequence of random variables Yn = t−sn . If it turns out
that the limit is the same, regardless of the choice of sequence that converges to t, then this limit is the
0
same as lims→t E exp(tX)−exp(sX)
t−s and equals ϕ (t). The Mean Value Theorem (see Theorem 399) states
0
that if f (t) is differentiable, then ∀s, t, ∃θ s.t (s ≤ θ ≤ t) ∧ f (t) − f (s) = f (θ)(t − s). Fixing ω ∈ Ω and
defining f (t) = exp(tX) we arrive at
exp(tX(ω)) − exp(sX(ω)) = (t − s)X(ω) exp{θ(ω)X(ω)},
where θ(ω) depends on ω and lies between t, s.
1. Use Dominated Convergence Theorem (see Theorem (343)) to show that

h i
lim EYn = E lim Yn = E[X exp(tX)],
n→∞ n→∞
0
establishing that ϕ (t) = E[X exp(tX)].
2. Suppose random variable X can take both positive and negative values, with E exp(tX) < ∞ and
0
E {|X| exp(tX)} < ∞ for all real t. Show again ϕ (t) = E[X exp(tX)].
Proof. -
1. First using the result
exp(tX(ω)) − exp(sX(ω)) = (t − s)X(ω) exp{θ(ω)X(ω)},
we obtain |Yn | = | exp(tX)−exp(s

t−sn
n X)
| = |X exp(θn X)| = X exp(θn X) ≤ X exp(max(2|t|, 1), X), since
0
X ≥ 0 and θn ∈ [sn , t]. By Dominated Convergence Theorem, we obtain ϕ (t) = limn→∞ EYn =
E limn→∞ Yn = E(X exp(tX)).
2. E[exp(tX + )1X≥0 ] + E[exp(−tX − )1X<0 ] = E exp(tX) < ∞ for all t ∈ R, which means that
E exp(t|X|) = E[exp(tX + )1x≥0 ] + E[exp(−(−t)X − )1x<0 ] < ∞. For |Yn | = |X exp{θn X}| ≤
|X| exp(max(2t, 1)|X|), by Dominated Convergence Theorem we have ϕ0 (t) = limn→∞ EYn =
E limn→∞ Yn = E[X exp(tX)].
378
R
What we have proved tells us that in order to compute Lebesgue integral Eg(X) = Ω g(X(ω))dP(ω)
R
over abstract space Ω, it suffices to compute the integral R g(x)dµX (x) over R. Although the integral is
still Lebesgue, we let the integrator be the distribution measure µX rather than the Lebesgue measure.
To perform computation, we can express the distribution measure µX in its different forms. A simple
case is when X takes finite values x0 , · · · xn with mass P(X = xk ) at xk , then we can write Eg(X) =
R Pn
R
g(x)µX (dx) = k=0 g(xk )P(X = xk ). In the continuous time models, X has density, which must be
R
nonnegative, Borel measurable function f on R satisfying µX (B) = B f (x)dx for all B ∈ B(R). This is
often bounded and continuous or almost everywhere continuous, allowing us to use Riemann integral to
compute expectations.
Exercise 562. Suppose X is random variable on probability space (Ω, F, P) and let A ∈ F. If ∀B ∈ B(R)
we have Z
1B (X(ω))dP(ω) = P(A)P(X ∈ B)
A
then we say that X is independent of A. If X ⊥ A then show for all nonnegative, Borel-measurable
R
function g, we have A g(X(ω))dP(ω) = P(A) · Eg(X).
Proof. If g(x) can be expressed as indicator function 1B (x), B ∈ B(R) then the equality is immediate.
Since any nonnegative, Borel-measurable function g is the limit of an increasing sequence of simple
functions, the desired equality can be proved by Monotone Convergence Theorem by way of standard
machine (see Theorem 344).
P
Corollary 12. In particular for discrete X and integrable g(X) its expectation is Eg(X) = g(k)pX (k),
R∞
and if X is continuous we obtain analogous form Eg(X) = −∞ g(x)f (x)dx.
Under measure theoretic settings we would use the standard machine (see proof in Theorem 344) to
get expectations of measurable functions of random variables.
Theorem 345 (Expectations of Functions over Random Variables). Let X be random variable on
(Ω, F, P) and g be Borel-measurable function on R. Furthermore, let f be the density function for
R∞ R∞
X, then |g(X)| = −∞ |g(x)|f (x)dx, and if this quantity is finite we have Eg(X) = −∞ g(x)f (x)dx.
Proof. -
1B (x) then since g ≥ 0, both equations are same and E1B (X) =
R
Step 1: If g(x) = B
f (x)dx. LHS
equals P(X ∈ B) = µX (B) and the result follows.
Step 2: If g(x) = k=1 αk 1Bk (x),
Pn
n
! n n Z ∞
αk 1Bk = αk E1Bk (X) = 1Bk (x)f (x)dx
X X X
Eg(X) = E αk (2366)
k=1 k=1 k=1 −∞
Z ∞ Xn Z ∞
= αk 1Bk (x)f (x)dx = g(x)f (x)dx. (2367)
−∞ k=1 −∞
Step 3: Constructing sequence of nonnegative simple functions 0 ≤ g1 ≤ g2 ≤ · · · ≤ g s.t. limn→∞ gn (x) =

g(x) for every x ∈ R. We have shown
Z ∞
Egn (X) = gn (x)f (x)dx
−∞
for every n, and hence letting n → ∞ and using the Monotone Convergence Theorem (Theorem 342),
we obtain our result.
379
Step 4: Let g be a general Borel-measurable function, which can take both positive and negative values,
R∞ R∞
then since we proved Eg + (X) = −∞ g + (x)f (x)dx, and that Eg − (X) = −∞ g − (x)f (x)dx we can add
these two terms to obtain the first equation, and subtract the two to get the second one.
Exercise 563 (Break a Stick). A stick of unit length is broken at point y sampled from U(0, 1). Find
the expected ratio of shorter segment to the longer segment.
Proof. See that the function g(Y ) can be expressed by determining the cases when y is on the ‘left half’
of the stick or the ‘right half’ of the stick, which determines the shorter end of the broken segment. Then
y 1


 0≤y≤
1−y 2

g(y) = (2368)
1−y 1
< y ≤ 1.



y 2
R 12 y R 1 1−y R1
Then see that Eg(Y ) = 0 1−y dy + 1 y dy where the second integral may be written 1 y1 − 1 dy =
2 2
(ln y − y)|11 = ln 2 − 12 . By symmetry Eg(Y ) = 2 ln 2 − 1.
2
Exercise 564 (Mean Free Path). The random variable Y used to describe the distance a molecule in a
1
gas travels before colliding with another can be modelled by the exponential density with λ = µ such that
1 y
fY (y) = exp(− ) y≥0 (2369)
µ µ
R∞ R∞
where µ is called the mean free path. See that EY = 0 y µ1 exp(− µy )dy = µ 0 w exp(−w)dw by substi-
tuting w → µy and obtain EY = µ by seeing integral form is exponential density.
6.3.5 Change of Measure

Here we work under the pure measure theoretic settings. Consider (Ω, F) pair and two probability
spaces associated with the different probability measures P and P̃ and the pair. Write the equation
Z(ω)P(ω) = P̃(ω). When Ω is uncountably infinite then P(ω) = P̃(ω) = 0 for all ω ∈ Ω, and for now this
tell us nothing about the relationship between the Z, P, P̃ since Z(ω) can take on any value and satisfy
the equation. We would like to reassign probabilities in Ω using Z to tell us where in Ω we would like
to revise probabilities upwards and where we shall do otherwise. In fact, we would like to do this on a
set-by-set basis and this would give us meaningful relationships.
Theorem 346 (Change of Measure). Let (Ω, F, P) be a probability space and Z be an almost surely
nonnegative random variable, with EZ = 1. For A ∈ F, define
Z
P̃(A) = Z(ω)dP(ω),
A
then P̃ is probability measure. Additionally if X is nonnegative random variable, then ẼX = E[XZ] and
if Z is almost surely strictly positive, then EY = Ẽ[ YZ ] for every nonnegative random variable Y, where
Ẽ indicates computation of expectation w.r.t to the probability measure P̃. If X can take both positive
and negative values, then again we may do similar transformation X = X + − X − and the result holds,
provided ∞ − ∞ does not result from such transformations.
Proof. To ensure P̃ is probability measure (see Definition (267)) we need to proof P̃(Ω) = 1 and countable
additivity. By assumption, Z
P̃(Ω) = Z(ω)dP(ω) = EZ = 1.
Ω
380
For countable additivity, let A1 , A2 · · · be sequence of disjoint sets in F and Bn = ∪nk=1 Ak , B∞ =
∪∞
k=1 Ak . Since 1B1 ≤ 1B2 · · · and limn→∞ 1Bn = 1B∞ , by Monotone Convergence (see Theorem 342)
Z Z
P̃(B∞ ) = 1B∞ (ω)Z(ω)dP(ω) = n→∞
lim 1Bn (ω)Z(ω)dP(ω).
Ω Ω
1Bn (ω) = 1Ak (ω), we have

Pn
Since k=1
Z n Z n
1Bn (ω)Z(ω)dP(ω) = 1Ak (ω)Z(ω)dP(ω) =
X X
P̃(Ak ).
Ω k=1 Ω k=1
Pn P∞
These two equations give countable additivity, since P̃ (∪∞
k=1 Ak ) = lim k=1 P̃(Ak ) = k=1 P̃(Ak ).
n→∞
Suppose X is nonnegative random variable, then if X is indicator function X = 1A we have
Z
ẼX = P̃(A) = 1A (ω)Z(ω)dP(ω) = E[1A Z] = E[XZ].
Ω
By use of standard machine (see proof of Theorem 344) we can complete the proof. When Z > 0 almost
surely, then Y /Z is defined and we may replace X in the equation with Y /Z in ẼX = E[XZ] to obtain
E[Y ] = Ẽ[ YZ ].
Definition 296 (Equivalent Probability Measures). Let Ω be nonempty set, F be σ-algebra of subsets
of Ω, then two probability measures P, P̃ on (Ω, F) are equivalent if they agree on which sets in F have
probability zero.
The probability measures P, P̃ with assumptions as discussed in Theorem (346) are equivalent mea-
sures (with the assumption that Z > 0 almost surely). Suppose A ∈ F is given and P(A) = 0, then
1A Z is almost surely zero and P̃(A) = 0. Suppose also that B ∈ F, satisfying P̃(B) = 0, then Z1 1B = 0
almost surely under P̃ and Ẽ[ Z1 1B ] = 0. This implies that E1B = P(B) = 0, and the two probability
measures agree on the sets in F with probability zero. Since probability one events are probability zero
complements, they agree on the sets that are almost sure as well. Since the two measures are equivalent,
we do not need to specify the measures in mind when we say an event occurs almost surely.
Often, our work will first set up sample space Ω, such as the set of possible scenarios in the future.
These sets of scenarios have some actual (objective) probability measure P. However, in derivative
pricing, we use a risk-neutral (risk-adjusted) measure P̃, and insist that the two measures are equivalent.
They may only disagree on the probabilities on the possibilities, and not the impossibilities.
It is common to refer to computation under the actual measure as the real world and under the risk-
neutral measure as the risk-neutral world. There is only one world in the models, a single sample space
Ω representing all possible future states, and a single set of asset prices modeled by random variables.
A hedge that works almost surely under one probability measure works almost surely under the other
probability measure, since the probability measures agree on which events are possible!
Exercise 565. Let P be uniform Lebesgue measure on Ω = [0, 1] and define

1

0 if 0 ≤ ω < ,

Z(ω) = 2 (2370)
2 if 1 ≤ ω ≤ 1,

2
R
For A ∈ B[0, 1] define P̃(A) = A Z(ω)dP(ω) and
1. Show that P̃ is probability measure.
381
2. Show that P(A) = 0 =⇒ P̃(A) = 0, that P̃ is absolutely continuous with respect to P.
3. Show ∃A s.t. P̃(A) = 0 ∧ P(A) > 0 - that they are not equivalent probability measures.
Proof. -
R R R R R1
1. We have P̃(Ω) = Ω
Z(ω)dP(ω) = [0,1]
Z(ω)dP(ω) = ω∈[0, 21 ]
0dP(ω) + ω∈[ 21 ,1]
2dP(ω) = 1 2dx =
2
[2x] 1 = 1. Additionally, if {Ai }∞
1
i=1 is sequence of disjoint Borel subsets of [0, 1] then by Monotone
2
∞
Convergence Theorem P̃(∪i=1 Ai ) equals
Z Z Z
1 ∪∞
i=1 Ai
Z(ω)dP(ω) = lim
n→∞
1 ∪n
i=1 Ai
Z(ω)dP(ω) = lim
n→∞
1∪ni=1 Ai Z(ω)dP(ω)
n Z ∞
˜ i ).
X X
= lim Z(ω)dP(ω) = P(A
n→∞ Ai
i=1 i=1
dP(ω) = 2P(A ∩ [ 21 , 1]) = 0.

R R
2. If P(A) = 0 then P̃(A) = A
Z(ω)dP(ω) = 2 A∩[ 21 ,1]
3. Let A = [0, 12 ) and we are done.
Exercise 566 (Examples of Change of Measure Under Uncountable Ω). Consider the probability measure
P over Ω = [0, 1] that is the uniform Lebesgue measure (see Definition (269)), and let
Z b
P̃[a, b] = 2ωdω = b2 − a2 , 0 ≤ a ≤ b ≤ 1.
a
R
We have P(dω) = dω and write P̃[a, b] = [a,b]
2ωdP(ω). Since B[0, 1] is σ-algebra generated by closed
intervals and everything else, this implies that
Z
P̃(B) = 2ωdP(ω) for every Borel set B ⊂ R
b
R1
. This is simply Z(ω) = 2ω, and this is strictly positive almost surely and EZ = 0 2ωdω = 1. For every
R1 R1
nonnegative random variable X(ω) we have 0 X(ω)dP̃(ω) = 0 X(ω)2ωdω, suggesting the notation
dP̃(ω) = 2ωdω = 2ωdP(ω). In general, we may rewrite the two equations relating P, P̃, Z with the
following Z Z
X(ω)dP̃(ω) = X(ω)Z(ω)dP(ω),
Ω Ω
Z Z
Y (ω)
Y (ω)dP(ω) = dP̃(ω).
Ω Ω Z(ω)
dP̃(ω)
This can be remembered by writing Z(ω) = dP(ω) .
Definition 297 (Radon-Nikodym Derivatives). Let (Ω, F) be a pair and probability spaces defined each
by two equivalent (see Definition 296) probability measures P, P̃. Let Z be an almost surely positive
R
random variable relating the two probability measures via P̃(A) = A Z(ω)dP(ω) for A ∈ F. Then Z is
dP̃
the Radon-Nikodym derivative of P̃ w.r.t to P and we write Z = dP .
Exercise 567 (Examples of Change of Measure for Normal Random Variable). Let X be standard normal
random variable defined on some probability space (Ω, F, P). Let the set Ω be uncountably infinite and
P(ω) = 0 for all ω ∈ Ω. We say that X is standard normal random variable, by which we mean
Z
µX (B) = P(X ∈ B) = ϕ(X)dx ∀B ∈ B(R)
B
382
2
where ϕ(x) = √12π exp{− x2 }. If we take B = (−∞, b] then this reduces to the familiar condition P(X ≤
Rb
b) = −∞ ϕ(x)dx for every b ∈ R. Note that we have E(X) = 0, Var(X) = E[X − EX]2 = 1. Let
θ be constant, and Y = X + θ so that under P, random variable Y is normal with expectation θ and
variance one. We also assume θ to be positive, although not necessary. We want to change to new
˜
probability measure P̃ on Ω under which Y is standard normal random. We want ẼY = 0, VarY = 1, by
assigning less probability to ω for which Y (ω) is sufficiently positive, and more probability to that which
is sufficiently negative. We want to change the distribution of Y , but not the random variable itself, and
can be seen as analogous to the change from actual to risk-neutral probabilities, such that the distribution
of asset prices are shifted without changing the asset prices themselves. Define the random variable
1
Z(ω) = exp{−θX(ω) − θ2 } for all ω ∈ Ω.
2
This random variable has two important properties allowing it to be a Radon-Nikodym derivative for
obtaining equivalent probability measure P̃, that is (i) Z(ω) > 0 for all ω ∈ Ω, such that Z is positive
almost surely (in fact here we have surely), and (ii) EZ = 1. (i) is obvious since Z is defined exponential,
and (ii) follows from integration as follows:
Z ∞
1
EZ = exp{−θx − θ2 }ϕ(x)dx (2371)
−∞ 2
Z ∞
1 1
exp{− x2 + 2θx + θ2 }dx

= √ (2372)
2π −∞ 2
Z ∞
1 1
= √ exp{− (x + θ)2 }dx (2373)
2π −∞ 2
Z ∞
1 1
= √ exp{− y 2 }dy (2374)
2π −∞ 2
with the change of variable y = x + θ. Since the last term is the standard normal density, it equates to
one and we are done. We can use the random variable Z to create P̃, with definition
Z
P̃(A) = Z(ω)dP(ω) ∀A ∈ F.
A
The random variable Z has property that if X(ω) is positive, then Z(ω) < 1 (we made the assumption
θ was positive constant). P̃ assigns less probability to P to the sets on which X is positive. To see that
this is standard normal, consider
Z
P̃(Y ≤ b) = Z(ω)dP(ω) (2375)
{ω:Y (ω)≤b}
Z
= 1{Y (ω)≤b} Z(ω)dP(ω) (2376)
ZΩ
1
= 1{X(ω)≤b−θ} exp{−θX(ω) − θ2 }dP(ω). (2377)
Ω 2
We obtained P̃(Y ≤ b) in terms of the function of the random variable X, integrated w.r.t probability
measure P under which X is standard normal. By the relation of integrals over abstract spaces to the
383
1{X(ω)≤b−θ} exp{−θX(ω) − 12 θ2 }dP(ω)
R
real line, (see Theorem (344), we obtain Ω
Z ∞
1
= 1{x≤b−θ} exp{−θX(ω) − θ2 }ϕ(x)dx (2378)
−∞ 2
b−θ
x2
Z
1 1
= √ exp{−θx − θ2 − }dx (2379)
2π −∞ 2 2
Z b−θ
1 1
= √ exp{− (x + θ)2 }dx (2380)
2π −∞ 2
Z b
1 1
= √ exp{− y 2 }dy (2381)
2π −∞ 2
with the change of variable y = x + θ. Finally, we arrive at the conclusion that P̃ is standard normal.
Exercise 568. We saw in Exercise 553 that the moment-generating function of a standard normal has
form E exp(uX) = exp( 12 u2 ). We saw the change of measure on Y = X + θ to construct a standard
random normal in Example 567. Setting Z = exp(−θX − 21 θ2 ) and defining P̃(A) = A Z(ω)dP(ω) for
R
A ∈ F, prove by moment generating functions that Y is standard normal.
Proof.
Ẽ exp(uY ) = E[exp(uY )Z] (2382)

1
= E[exp(uX + uθ) exp(−θX − θ2 )] (2383)
2
1 2
= exp(uθ − θ )E[exp((u − θ)X)] (2384)
2
1 1 u2
= exp(uθ − θ2 ) exp( (u − θ)2 ) = exp( ). (2385)
2 2 2
Exercise 569. We saw the change of measure technique to construct a standard normal for Y = X + θ,
where X ∼ N (0, 1) in Exercise 567. We took Z = exp(−θX − 12 θ2 ) and Z as Radon-Nikodym derivative
R
for probability measure P̃(A) = A Z(ω)dP(ω) for all A ∈ F. Here, assume now that X is related to
Y by the formula X = Y − θ, where Y ∼ N (0, 1). We can then replace X → Y, θ → −θ and define
Ẑ = exp(θY − 21 θ2 ) and use Z
P̂(A) = Ẑ(ω)dP̃(ω) ∀A ∈ F
A
1
so that X is standard normal under the P̂ measure. Show that Ẑ = Z and that P̂ = P.
Proof. We have Ẑ = exp(θY − 21 θ2 ) = exp(θ(X + θ) − 21 θ2 ) = exp( 12 θ2 + θX) = Z −1 . For any A ∈

F, P̂(A) = A ẐdP̂ = A (1A Ẑ)ZdP = A 1A dP = P(A) and we are done. X is standard normal under P̂
R R R
since the two probability measures are identical.
Exercise 570. Let X be a standard normal and Y = X +θ defined on some probability space (Ω, F, P) as
R
in Exercise 567. Our goal is to define a strictly positive random variable Z(ω) s.t P̃(A) = A Z(ω)dP(ω),
for all A ∈ F, that Y under P̃ is standard normal. Suppose we fix ω̄ ∈ Ω and choose set A that contains
ω̄ that is ‘small’, giving
P̃(A) ≈ Z(ω̄)P(A).
P̃(A)
Dividing by P(A) we obtain P(A) ≈ Z(ω̄) for small sets A containing ω̄. With ω̄ fixed, let x = X(ω̄). For

> 0, define B(x, ) = [x ± 2 ]. Let y = x + θ and B(y, ) = [y ± 2 ].
384
2
1. Show that 1 P{X ∈ B(x, )} ≈ √1
2π
exp{− X 2(ω̄) }.
2. Show that for Y to be standard normal random under P̃ we require
1 1 Y 2 (ω̄)
P̃{Y ∈ B(y, )} ≈ √ exp{− }
2π 2
3. Show {X ∈ B(x, )} = {Y ∈ B(x, )}. Denote this set A(ω̄, ). This set contains ω̄ and is ‘small’
when > 0 is small.
4. Show
P̃A(ω, ) 1
≈ exp(−θX(ω̄) − θ2 ),
PA(ω̄, ) 2
which completes our (informal) derivation of the Radon-Nikodym derivative in Exercise 567.
Proof. -
1.
u2 x2 X 2 (ω̄)
Z
1 1 1 1 1 1
P(X ∈ B(x, )) = √ exp(− )du ≈ √ exp(− ) · = √ exp(− ).
x± 2 2π 2 2π 2 2π 2
2. Refer to proof for Part 1.
3. {X ∈ B(x, )} = {X ∈ B(y − θ, ) = {X + θ ∈ B(y, )} = {Y ∈ B(y, )}.
4. Approximately, we have
2
√
2π
exp(− Y (ω)
2 ) Y 2 (ω) − X 2 (ω) (X(ω̄) + θ)2 − X 2 (ω̄) θ2
2 = exp(− ) = exp(− ) = exp(−θX(ω̄) − ).
√
2π
exp(− X 2(ω) ) 2 2 2
Exercise 571 (Change of Measure, Exponential Random Variable). Let X > 0 be random variable de-
fined on probability space with exponential distribution, such that for λ > 0 ,P(X ≥ a) = 1−exp(−λa), a ≥
0. Let λ̃ be another positive constant defined Z = λ̃λ exp(−(λ̃ − λ)X), and P̃(A) = A ZdP for all A ∈ F,
R
the σ-algebra of the probability space. Show (i) P̃(Ω) = 1 and (ii) the cumulative distribution function
for X under P̃.
Proof. For (i), we have

Z Z ∞ Z ∞
λ̃ λ̃
P̃(Ω) = exp(−(λ̃ − λ)X)dP = exp(−(λ̃ − λ)x)λ exp(−λx)dx = λ̃ exp(−λ̃x)dx = 1.
λ λ 0 0
For (ii) we have

Z Z a
λ̃ λ̃
P̃(X ≤ a) = exp(−(λ̃ − λ)X)dP = exp(−(λ̃ − λ)x)λ exp(−λx)dx
{X≤a} λ 0 λ
Z a
= λ̃ exp(−λ̃x)dx = 1 − exp(−λ̃a).
0
385
Exercise 572. Let X be random variable on the usual probability space and assume X has density
function f that is positive everywhere. Let g be strictly increasing, differentiable function satisfying
lim g(y) = −∞ lim g(y) = ∞,

y→−∞ y→∞
R∞
and define random variable Y = g(X). Let h(y) be arbitrary non-negative function satisfying −∞
h(y)dy =
1, and we want to change the probability measure such that h(y) becomes the density function for Y . De-
fine
0
h(g(X))g (X)
Z= .
f (X)
Then
1. show that Z is nonnegative and EZ = 1 and that

R
2. Y has density h under the probability measure P̃(A) = A
ZdP for all A ∈ F.
Proof. -
1. Clearly Z is non-negative. We have

( 0
) Z 0
∞ Z ∞ Z ∞
h(g(X))g (X) h(g(x))g (x)
EZ = E = f (x)dx = h(g(x))dg(x) = h(u)du = 1
f (X) −∞ f (x) −∞ −∞
.
0 R g−1 (a) h(g(x))g (x) 0 R g−1 (a)
R h(g(X))g (X)
2. P̃(Y ≤ a) = {g(X)≤a} f (X) dP = −∞ f (x) f (x)dx = −∞
h(g(x))dg(x). By
Ra
change of variable, we obtain −∞ h(u)du and we are done.
The First Fundamental Theorem of Asset Pricing 425 states that the existence of a risk-neutral
measure guarantees that a financial model is arbitrage-free.
We are interested in the existence of risk-neutral measures, for arbitrage-free models. This involves
the construction of a different probability measure P̃ that is equivalent to P. We began with an almost
surely positive random variable Z and constructed the equivalent measure P̃, and it turns out that this
is the only method to obtain an equivalent probability measure.
Theorem 347 (Radon-Nikodym Theorem). Let P, P̃ be equivalent probability measures defined on (Ω, F).
Then ∃Z almost surely positive such that EZ = 1 and
Z
P̃(A) = Z(ω)dP(ω) ∀A ∈ F.
A
6.3.6 Random Variable Moments

We define the moments of random variable.
Definition 298 (Moments of a Random Variable). Let X be random variable, and for k = 1, 2 · · · the
k-th moment of X is written
µk = EX k . (2386)
386
R∞
This exists provided that −∞
|x|k fX (x)dx < ∞. The k-th central moment is then the moment about the
mean, that is µ0k = E[(X − µ)k ]. Let Xi , i ∈ [n] be IID and distributionally equivalent to X. The k-th
sample moment is
n
1X k
µ̂k = X (2387)
n i=1 i
1
Pn
is an estimator of µk . A realisation of the estimator µ̂k is the value n i=1 xki where the xi , i ∈ [n] are
realisations.
We saw that a measure of central tendency was the expectation, which is the first moment EX = EX 1 .
Another measure we are interested in is the measure of spread, and we are interested in its deviation
from the central tendency.
Definition 299 (Variance). For random variable X its variance formula is written Var(X) = E[(X −
µ)2 ]. Suppose we are interested in a unit compatible measure of spread w.r.t X, then we would take
p
Var(X) = σ(X), its standard deviation. Show we can write the variance in moments, that is for
random variable W we have Var(W ) = µ2 − µ21 = σ 2 .
R
Proof. Suppose W is continuous (try the discrete case as self-exercise) see that V ar(W ) = (w −
µ)2 fW (w)dw = (w2 − 2µw + µ2 )fW (w)dw = w2 fW (w)dw − 2µ wfW (w)dw + µ2 fW (w)dw = µ2 =
R R R R
2µ2 + µ2 .
Exercise 573. Show that for W with EW = µ, EW 2 < ∞ that Var(aW + b) = a2 Var(W ).
2
Proof. Show this by writing Var(aW + b) = E (aW + b)2 − E [(aW + b)] and expanding the terms
together with expectation linearity.
Exercise 574. Show that we can express the k-th central moment in terms of the moment up to itself.
Pk
Proof. See that we can write µ0k − E[(X − µ)k ] = j=0 kj E(X j )(−µ)k−j by binomial expansions.

Definition 300 (Skewness). The skewness of a probability density function can be measured in terms of
its third central moment. The coefficient of skewness measures the symmetry in a density function and
is written
E[(X − µ)3 ]
.
σ3
Definition 301 (Kurtosis). The coefficient of kurtosis measures the density function’s flatness shape. It
is the scaled fourth central moment written
E[(X − µ)4 ]
−3
σ4
and is useful for understanding the probability of outliers. Flat density functions are said to be platykurtic
and peaked density functions to be leptokurtic. This is generally in relation to the normal density.
See Exercise 575 for computing the kurtosis on a normal random variable and why we subtract three
from the fraction term.
Exercise 575 (Kurtosis of Normal Random Variable). Verify that the normal random variable has
kurtosis (see kurtosis Definition 301) 3. That is, for X ∼ Φ(µ, σ 2 ) show that E(X − µ)4 = 3σ 4 .
387
Proof. See from the normal m.g.f (see Theorem 377) that it is written ψ(u) = E exp(u(X − µ)) =
exp( 12 u2 σ 2 ). Then w.r.t to u we obtain
1
ψ (1) (u) = E [(X − µ) exp(u(X − µ)] = σ 2 u exp( σ 2 u2 ), (2388)
2
and the first moment is ψ (1) (0) = 0. Differentiate again for
1 1 1
ψ (2) (u) = σ 2 exp( σ 2 u2 ) + σ 4 u2 exp( σ 2 u2 ) = (σ 2 + σ 4 u2 ) exp( σ 2 u2 ) (2389)
2 2 2
and so ψ (2) (0) = σ 2 . Differentiate again for
1 1
ψ (3) (u) = 2σ 4 u exp( σ 2 u2 ) + (σ 2 + σ 4 u2 )(σ 2 u) exp( σ 2 u2 ) (2390)
2 2
1
= exp( σ 2 u2 )(2σ 4 u + σ 4 u + σ 6 u3 ) (2391)
2
1
= exp( σ 2 u2 )(3σ 4 u + σ 6 u3 ) (2392)
2
so that ψ (3) (0) = 0. Differentiate again for
1 1
ψ (4) (u) = (3σ 4 u + σ 6 u3 )(σ 2 u) exp( σ 2 u2 ) + (3σ 4 + 3σ 6 u2 )(exp( σ 2 u)) (2393)
2 2
1 2 2
= exp( σ u )(6σ 6 u2 + σ 8 u4 + 3σ 4 ) (2394)
2
u4
so that ψ (4) (0) = 3σ 4 . Then the kurtosis level is equivalent to (u2 −u21 )2
= 3σ 4 /σ 4 = 3.
Theorem 348. If k-th moment exists, all moments of order j ≤ k exists.
Proof. For random variable Y , EY k exists iff |y|k fY (y)dy < ∞. See that ∀j ∈ [k] (Larsen and Marx
R
[12])
Z ∞ Z Z
|y|j fY (y)dy = |y|j fY (y)dy + |y|j fY (y)dy (2395)
−∞ |y|≤1 |y|>1
Z Z
≤ fY (y)dy + |y|j fY (y)dy (2396)
|y|≤1 |y|>1
Z
j
≤ 1+ |y| fY (y)dy (2397)
|y|>1
Z
≤ 1+ |y|k fY (y)dy < ∞ (2398)
|y|>1
and we are done in the continuous case. We omit proof for discrete case, but it is similar.
6.3.7 Random Variable Co-Movements

We defined the variance of a random variable in Definition 299. We give a more general definition that
includes variance as well as accounting for the deviation when paired with another random variable,
called covariance.
Definition 302 (Covariance of a Pair of Random Variables). Let X, Y be a pair of random variables for
which their expectations and variances are all defined and finite. The covariance of the pair is denoted
Cov(X, Y ) = E[(X − EX)(Y − EY )]. By linearity of expectations, this is equivalent to E[XY ] − EX · EY .
One obvious corollary is that E[XY ] = E[X]E[Y ] iff their Cov(X, Y ) = 0.
388
Theorem 349. Let X, Y be random variables with finite variances and then show V ar(aX + bY ) =
a2 Var(X) + b2 Var(Y ) + 2abCov(X, Y ).
Proof. See by definition of covariance (Definition 302) that we can write this to be
E[(aX + bY )2 ] − (E(aX + bY ))2 (2399)

2 2 2 2
= E[a X + b Y + 2abXY ] − (a2 µ2x + b2 µ2y + 2abµx µy ) (2400)
= [E(a2 X 2 ) − a2 µ2x ] + [E(b2 Y 2 ) − b2 µ2y ] + [2abEXY − 2abµx µy ] (2401)
2 2
= a Var(X) + b Var(Y ) + 2abCov(X, Y ). (2402)
Corollary 13 (Variance of n-sums of Random Variables). For Wi , i ∈ [n] random variables, each with
finite variance we have
Xn n
X X
Var( ai Wi ) = a2i Var(Wi ) + 2 ai aj Cov(Wi , Wj ). (2403)
i=1 i=1 i<j
Definition 303 (Correlation of a Pair of Random Variables). Assume we have a pair of random variables
in the same settings as in Definition ??. Then assume further that in addition to the finiteness of
expectations and variances, that we are guaranteed that both Var(X), Var(Y ) > 0. Then we define the
correlation coefficient of X, Y :
Cov(X, Y )
ρ(X, Y ) = p
Var(X)Var(Y )
6.4 Moment Generating Functions

Definition 304 (Moment Generating Functions). For random variable W the moment generating func-
tion for W is written
X

 exp(tk)pW (k)

k
MW (t) = E exp(tW ) = Z ∞ (2404)


 exp(tw)fW (w)dw.
−∞
Exercise 576. Show that the moment generating function of

p exp(t)
1. X ∼ Geom(p) has MX (t) = 1−(1−p) exp(t) .
2. X ∼ Bin(p) has MX (t) = (1 − p + p exp(t))n .

λ
3. Y ∼ Exp(λ) has MY (t) = λ−t .
σ 2 t2
4. Y ∼ Φ(µ, σ 2 ) has MY (t) = exp(µt + 2 ).
5. X ∼ P oisson(λ) has MX (t) = exp(−λ + λ exp(t)).
Theorem 350 (Generating Functions Generate). Let MW (t) be the moment generating function for W
(r)
with density fW (w). Provided the r-th moment exists then MW (0) = EW r .
389
Proof. We show proof for r = 1, 2 in continuous random variables, but the same holds for higher moments.
See that
Z ∞ Z ∞
(1) δ
MY (0) = exp(ty)fY (y)dy|t=0 = y exp(ty)fY (y)dy|t=0 (2405)
δt −∞ −∞
Z ∞
= y exp(0 · y)fY (y)dy = EY. (2406)
−∞
See that
Z ∞ Z ∞
(2) δ2
MY (0) = exp(ty)f Y (y)dy| t=0 = y 2 exp(ty)fY (y)dy|t=0 (2407)
δt2 −∞ −∞
Z ∞
= y 2 exp(0 · y)fY (y)dy = EY 2 . (2408)
−∞
Theorem 351 (Properties of Moment Generating Functions). -
1. For random variables W1 , W2 sharing the same moment generating function MW1 (t) = MW (t) =
MW2 (t) for some interval of t containing zero, then they share densities.
2. Let W be random variable with mgf MW (t) then its linear transformation V = aW + b has mgf
MV (t) = exp(bt)MW (at). (2409)
P
3. For Wi , i ∈ [n] independent random variables each with mgf MWi (t) the mgf of W = Wi is equal
to
MW (t) = Πi MWi (t). (2410)
Exercise 577. Using properties of mgf (Theorem 351) relate the mgf of a standard normal random
variable and arbitrary normal random variable.
Proof. By Theorem 377 see that the mgf is exp( 21 u2 σ 2 ). For the standard normal this is exp( 12 u2 ). For
Y −µ
Y ∼ Φ(µ, σ 2 ) write Z = σ = Y
σ − σµ . Then by the linear transformation property in Theorem 351
2
σ 2 σt 2
we have MZ (t) = exp(− σµ t)MWY ( σ1 t) = exp(− σµ t) exp(µ σt + 2 ) = exp( 12 t2 ) and we see their mgf
matches. By uniqueness property in Theorem 351, Z is standard random normal and shares density.
6.5 Information and Joint Distributions

Joint probabilities are interesting because they are relationships between the multivariate random vari-
ables. The presentation of some information can change our probability estimates on random variables.
Consider the case where we only have limited information in relation to the randomness of an outcome or
event. Then if these limited information allow us to make a list of sets that are sure to contain the outcome
and other sets that are sure not to contain the outcome, we say that the sets are resolved by the informa-
tion. Consider three coin tosses and sample space Ω of the eight possible outcomes. The information on
first coin toss resolves the sets AH = {HHH, HHT, HT H, HT T } and AT = {T HH, T HT, T T H, T T T }.
390
The empty set ∅ and sample space Ω are always resolved. The four sets resolved by first toss, namely
{∅, Ω, AH , AT } form the σ-algebra F1 . Consider an equivalent statement - if we are instead just told, for
each set in F1 whether or not the true ω belongs to that set, we know the outcome of the first coin toss
and nothing more! By the second coin toss being told, we have even more information. The sets in F1
are still resolved, but we have AHH , AHT , AT H , AT T . When a set becomes resolved, so is its complement.
Whenever two sets are resolved, so is their union. We can continue this reasoning and by the second
coin toss, we have the σ-algebra F2 =
{∅, Ω, AH , AT , AHH , AHT , AT H , AT T , AcHH , AcHT , AcT H , AcT T } +
{AHH ∪ AT H , AHH ∪ AT T , AHT ∪ AT H , AHT ∪ AT T }
And this σ-algebra contains all information learned from the first two coin tosses. For three coin tosses,
3
we have F3 = Ω, with 22 = 256 subsets belonging to the sigma algebra of Ω. In the beginning, we
only have F0 = {∅, Ω} and let this be known as the trivial σ-field. Note that for n < m, Fn ⊂ Fm and
as the indexed σ-algebra increases w.r.t time, we have more information. The collection of σ-algebras
F0 , F1 , F2 , F3 are known as a filtration.
Definition 305 (Filtration). Let Ω be a non-empty set, and T ∈ R+ . Assume for t ∈ [0, T ] that
there exists σ-algebra F(t). If ∀s, t, s ≤ t =⇒ F(s) ⊂ F(t) then we say the collection of σ-algebras
F(t), 0 ≤ t ≤ T is a filtration.
Exercise 578. Consider the sample space Ω = C0 [0, T ], the set of continuous functions defined on [0, T ]
taking the value zero at time t = 0. Randomly sample a function ω̄ and observe it to time t ∈ [T ]. For
s < T , only some of the subsets of Ω is resolved, since we only know the value of ω̄(s), 0 ≤ s ≤ t. An
example of a resolved set would be {ω ∈ Ω : max0≤s≤t ω(s) ≤ 1}, which is in F(t). An example of an
unresolved set would be the set {ω ∈ Ω : ω(T ) > 0} if t < T . The resolved sets are those that can be
described by the path up of ω up to time T .
Since the σ-algebra were formed by starting with some ‘fundamental sets’ and adding everything
else, we might posit that it would suffice to work with atoms, the indivisible sets in the σ-algebra instead
of all the other sets. The issue is that for uncountable sample spaces, there exists sets that cannot
be constructed as a countable union of atoms. Uncountable unions are forbidden - we cannot add up
probabilities of such unions in the first place. Consider a fixed t ∈ [T ] as in Exercise 578, and choose
a continuous function f (u) defined on 0 ≤ u ≤ t, f (0) = 0. Then, the set of continuous functions
ω ∈ C0 [0, T ] agreeing with f on [0, t] and free to take any values on (t, T ] form atoms in F(t). This
atom can be represented {ω ∈ C0 [0, T ] : ω(u) = f (u), ∀u ∈ [0, t]}. Each time we sample another
function f (u) defined on 0 ≤ u ≤ t we arrive at a new atom. There is no way to define some set
{ω ∈ Ω : ω(t) > 0} however, by taking the countable union of said atoms. Moreover, these atoms usually
have zero probability. We shall work with sets of F(t) instead and not the atoms.
Let X be a random variable which takes some functional formula. Another way we may be presented
‘information’ is suppose that we are instead presented with the value of X(ω). This resolves some sets,
such as the sets {X ≤ 1}. Every set of the form {X ∈ B}, where B ⊂ R and is Borel-measurable are
resolved.
Definition 306 (σ-algebra generated from a random variable). Let X be a random variable defined
on nonempty sample space Ω, then the σ-algebra generated by X, denoted σ(X) is the collection of all
subsets of Ω of form {X ∈ B} and B ranges over B(R).
391
Exercise 579. Consider the three-period coin toss model where each head doubles our value and each
tail halves our value. We start with four units, with the random variable St denoting time t of our game
and S our value-worth. At time two, we have S2 (HH·) = 16, S2 (HT ·) = S2 (T H·) = 4, S2 (T T ·) = 1.
If we take B to be some number, say 16, then {S2 = B} = {HHH, HHT } = AHH . If follows that
AHH ∈ σ(S2 ) and so on. Taking B = ∅ gives us ∅, as does B = Ω → Ω. As B ranges over the Borel
subsets of R we will obtain the list of sets and all unions and complements, which turns out to be the
σ-algebra σ(S2 ). Every set in σ(S2 ) is in the σ-algebra F2 , the information contained in the first two
coin tosses. However, AHT , AT H ∈ F2 but only their union appears in σ(S2 ). This is because observing
S2 does not tell us about the order of the head or tail. Since there is information in F2 to determine the
value of S2 and even more, we say that S2 is F2 measurable.
Definition 307 (Measurable Random Variable and Their σ-algebras). Let X be random variable defined
on Ω 6= ∅ sample space, and G to be the σ-algebra of subsets of Ω. Then, if every set in σ(X) is also in
G we say that X is G-measurable. Such a random variable X is G-measurable iff the information in G
is sufficient to determine the value of X. If X is G-measurable, then f (X) is also G-measurable when
f is Borel-measurable. If X, Y is G-measurable, then f (X, Y ) is G-measurable for any Borel-measurable
function f (x, y).
A portfolio position ∆(t) taken at t must be F(t)-measurable, since the investor must not make
decisions available in the future!
Definition 308 (Measurable Random Variables and Their Filtration ; Adapted Stochastic Processes).
Let Ω 6= ∅ be sample space with filtration F(t) (see Definition 305) , where t ∈ [T ]. Let X(t) be a collection
of random variables indexed by t. We say that this collection of random variables is an adapted-stochastic
process if for all t ∈ [T ], X(t) is F(t)-measurable.
Asset prices, portfolio processes and wealth processes are typically adapted to a filtration of market
information.
6.5.1 Independence and Joint Densities

Here we are interested in situations where there are two or more relevant random variables defined on the
same probability space (Ω, F, P). (see Definition 268) The opposite extreme of a random variable that
is σ-algebra-measurable is a random variable that is independent of that σ-algebra. The information
within the sets give no information about the random variable. A middle point would be when we have
a σ-algebra G and a random variable X that is neither G-measurable nor independent of G. In contrast
to the concept of measurability, to discuss independence we need probability measures. Independence
can be affected by changes in probability measures, but measurability cannot.
Independence of sets imply that knowledge of an outcome ω belonging to A does not affect our
estimation of P(ω ∈ B). The independence of random variables mean that if ω occurs, giving us X(ω)
does not affect the estimation of distribution of Y (ω).
Definition 309 (Independence of Pair Events). Two events A, B are said to be independent if P(A∩B) =
P(A) · P(B).
Result 33. If A ⊥ B then A ⊥ B C , AC ⊥ B C , AC ⊥ B.
Definition 310 (Independence of n Events). Events Ai , i ∈ [n] are independent if for all possible
subsets of {Ai |i ∈ [n]} the intersection of their probabilities factors into marginal product of probabilities
themselves, that is P(∩ki=1 Ai ) = (·)ki=1 P(Ai ) all for nk subset of size k, for all k ∈ [n].

392
We need σ-algebra when working with uncountable sets, so we will provide their definitions.
Definition 311 (Independence of Pair σ-algebras). Let (Ω, F, P) be a probability space and H, G be two
sub σ-algebras of F such that H, G ⊂ F. Then we say that the H is independent of G if
P(A ∩ B) = P(A) · P(B) ∀A ∈ G, ∀B ∈ H
and denote this H ⊥ G.
Definition 312 (Independence of a Pair of Random Variables). We say that two random variables X, Y
on (Ω, F, P) is independent, denoted X ⊥ Y if the σ-algebras they generate, σ(X) ⊥ σ(Y ). We say that
the random variable X is independent of the σ-algebra G if σ(X) ⊥ G.
Note that probability measures affect independence of random variables! To see this, consider the
finite coin toss space as in Exercise 579, then obviously S2 6⊥ S3 in general but if p ∈ {0, 1} then they
would be independent!
Definition 313 (Independence of σ-algebras). Let (Ω, F, P) be probability space for which (Gi ) is se-
quence of sub-σ-algebras of F. Then for fixed positive integer n we say that the n σ-algebras Gi , i ∈ [n]
are independent if
P(∩ni=1 Ai ) = (·)ni=1 P(Ai ) ∀i ∈ [n], Ai ∈ Gi .
We say the full sequence of σ-algebras are independent if for all positive integer n, the n σ-algebras are
independent.
Definition 314 (Independence of Random Variables). Consider random variable sequence {Xi } on
probability space (Ω, F, P). We say n random variables Xi , i ∈ [n] are independent if the σ-algebras
σ(Xi ), i ∈ [n] are independent. We say the full sequence of random variables are independent if for all
positive integer n, the n random variables are independent.
Exercise 580. Consider the infinite, independent coin toss space as in Exercise 539. Let Gk be the
σ-algebra of information associated with the k-th coin toss. In other words, Gk comprises sets ∅, Ω∞ and
atoms {ω ∈ Ω∞ : ωk = H}, {ω ∈ Ω∞ : ωk = T }. Under the probability measure defined in Exercise 539,
the full sequence of σ-algebras G1 , G2 , · · · are independent. Define the random variable Yk (ω) = 1{ωk =
H} for the k-th toss, then the full sequence of random variables Y1 , Y2 , · · · are independent.
Theorem 352 (Independence of a Pair of Functions of Random Variables). Let X ⊥ Y and f, g be

Borel measurable functions on R. Then f (X) ⊥ g(Y ).
Proof. Let A be σ-algebra generated by f (X), which is sub-σ-algebra of σ(X).To see this recall that every
set A ∈ σ(f (X)) has form {ω ∈ Ω : f (X(ω)) ∈ C} where C ∈ B(R). Define D = {x ∈ R : f (x) ∈ C} and
we have A = {ω ∈ Ω : f (X(ω)) ∈ C} = {ω ∈ Ω : X(ω) ∈ D} ∈ σ(X). Let B be σ-algebra generated by
g(Y ), which similarly ⊂ σ(Y ). Then it follows that since X ⊥ Y , P(A ∩ B) = P(A) · P(B). We may intuit
the proof in words - since functions are many to one or one to one but never one to many by definition,
mapping sets through a function only gives us as much information as the original set itself.
Result 34. For a random variable X and Borel measurable function f we are guaranteed the inequality
σ(f (X)) ≤ σ(X). (2411)
393
Definition 315 (Joint Distribution Measure of Pair of Random Variables). Let X, Y be random vari-
ables. Pair (X, Y ) takes values in R2 plane and joint distribution measure of (X, Y ) is given
µX,Y (C) = P{(X, Y ) ∈ C} ∀C ∈ B(R2 )
where C are Borel sets. This is a probability measure (where µX,Y (R2 ) = 1 and countable additivity is
satisfied). The joint cumulative distribution function is noted
FX,Y (a, b) = µX,Y ((−∞, a] × (−∞, b]) = P{X ≤ a, Y ≤ b} ∀a, b ∈ R.
We say that a non-negative, Borel-measurable function fX,Y (x, y) is joint density if we have
Z ∞Z ∞
µX,Y (C) = 1C (x, y)fX,Y (x, y)dydx ∀ Borel sets C ⊂ R2 .
−∞ −∞
This holds iff

Z a Z b
FX,Y (a, b) = fX,Y (x, y)dydx ∀a, b ∈ R.
−∞ −∞
The (marginal) distribution measures of X, Y are respectively
µX (A) = P(X ∈ A) = µX,Y (A × R) ∀ Borel subsets A ⊂ R,
µY (B) = P(Y ∈ B) = µX,Y (R × B) ∀ Borel subsets B ⊂ R.
The (marginal) c.d.f are then

FX (a) = µX (−∞, a] ∀a ∈ R,
FY (b) = µY (−∞, b] ∀b ∈ R.
If their marginal densities exists (which is when their joint density exists), they are nonnegative Borel-
R R
measurable functions such that µX (A) = A fX (x)dx and µY = B fY (y)dy for all Borel subsets A, B ⊂
Ra Rb
R. These conditions hold iff FX (a) = −∞ fX (a)dx, FY (b) = −∞ fY (y)dy for all a, b ∈ R respectively.
We may generate the σ-algebra of Borel subsets of R2 by starting with closed rectangles [a1 , b1 ] ×
[a2 , b2 ] and then adding all other sets necessary. Any set in this σ-algebra is said to be a Borel subset of
R2 .
Under non-measure theoretic settings we would discriminate between the discrete and continuous
variants. We introduce them but keep in mind their unifying interpretations under the distribution
measure.
Definition 316 (Joint and Marginal Densities of Discrete Random Variables). For Ω discrete sample
space and X, Y the joint probability density function pX,Y (x, y) = P({ω ∈ Ω|X(ω) = x ∧ Y (ω) =
P
y}) = P(X = x, Y = y). Their marginal probability density functions are pX (x) = y pX,Y (x, y) and
P
pY (y) = x pX,Y (x, y) respectively.
In the case of continuous random variables we need to deal with integrals, that is
Definition 317 (Joint and Marginal Densities of Continuous Random Variables). For random variables
(Xi , i ∈ [n]) that are defined over the Euclidean space with some function satisfying P((Xi , i ∈ [n]) ∈
R ⊂ Rn ) = · · · R f(X)i∈[n] ((x)i∈[n] )dx1 dx2 · · · dxn , we say that f(X)i∈[n] ((x)i∈[n] ) is joint density. The
R R
existence of joint density implies the jointly continuousness of the random variables over region R. At
any point the function f maps to nonnegative numbers, and the integral over the entire space equals one.
394
The order of integration does not matter. The marginal densities are the joint densities with the other
variables held constant and integrated out. We show proof for bivariate density.
Z ∞Z x Z x Z ∞ Z ∞
d d d d
FX (x) = P(X ≤ x) = fX,Y (x, y)dxdy = fX,Y (x, y)dydx = fX,Y (x, y)dy.
(2412)
dx dx dx −∞ −∞ dx −∞ −∞ −∞
If the joint probability density function exists then it is related to the joint c.d.f by
δn
f(Xi ),i∈[n] (xi∈[n] ) = FX ,··· ,Xn (x1 , · · · xn ). (2413)
δ1 · · · δn 1
Exercise 581. Suppose man watches TV and reads book for X, Y each with joint density f (x, y) =
xy exp(−(x + y)). The probability that he spends at least twice as much timeh watching TV as he
i reads
R ∞ R x/2 R∞ R x/2
can be written P(X ≥ 2Y ) = 0 0 xy exp(−(x + y))dydx = 0 x exp(−x) 0 y exp(−y)dy dx
Definition 318 (Geometric Probability). The jointly uniform density is the function fX,Y (x, y) =
1 ∆(R)
(b−a)(d−c) for x ∈ [a, b], y ∈ [c, d] and P((X, Y ) ∈ R) = (b−a)(d−c) where ∆ is the area function. These
are known as geometric probabilities.
Theorem 353 (Equivalence of Independence Conditions). Let X, Y be be random variables. The state-
ments (the statements on joint densities assume existence)
1. X ⊥ Y
2. (factoring of joint distribution measures) µX,Y (A × B) = µX (A) · µY (B) for all Borel subsets
1
A, B ⊂ R.
3. (joint c.d.f factors) FX.Y (a, b) = FX (a) · FY (b) for all a, b ∈ R.
4. (joint moment generating function factors) E exp(uX +vY ) = E exp(uX)·E exp(vY ) for all u, v ∈ R
for which expectations are finite.
5. (joint density factors) fX,Y (x, y) = fX (x) · fY (y) for almost every x, y ∈ R.
The conditions imply (but are not equivalent) to expectations factoring
E[XY ] = EXEY. (2414)
provided that E|XY | < ∞

Proof. The proof provided here originates from the exercises in Shreve [19]. First we prove (1) → (2).
Assume X ⊥ Y , then
µX,Y (A × B) = P(X ∈ A ∧ Y ∈ B) (2415)

= P(X ∈ A) · P(Y ∈ B) (2416)
= µX (A) · µY (B). (2417)
(2) → (1): A set σ(X) is of form {X ∈ A} and σ(Y ) of form {Y ∈ B}. Then assuming that the joint
distribution measure factors, we obtain
P({X ∈ A ∧ Y ∈ B}) = P{X ∈ A, Y ∈ B} (2418)

= µX,Y (A × B) (2419)
= µX (A) × µY (B) (2420)
= P(X ∈ A) · P(Y ∈ B) (2421)
1 Note that in the higher dimensional cases of more than two random variables, the existence of factoring of marginal
distributions for the full sequence is sufficient condition for their independence! We do not need to ensure that all possible
jointly distributed subsets have marginal factoring.
395
and therefore every set in σ(X) ⊥ σ(Y ).
For (2) → (3), assuming the joint distribution measure factors again, then we have
FX,Y (a, b) = µX,Y ((−∞, a] × (−∞, b]) (2422)

= µX (−∞, a] · µY (−∞, b] (2423)
= FX (a) · FY (b). (2424)
(3) → (2): Now, assume that the joint c.d.f factors, whenever we can express A = (−∞, a], B = (−∞, b]
we are done.
(3) → (5): Assuming joint c.d.f factors, if there exists joint density then
Z a Z b Z a Z b
fX,Y (x, y)dydx = fX (x)dx · fY (y)dy.
−∞ −∞ −∞ −∞
Differentiating w.r.t a, b stepwise, we obtain fX,Y (a, b) = fX (a) · fY (b) and we are done.
(5) → (3): Assume there exists joint density that factors, then by integration we obtain
Z a Z b
FX,Y (a, b) = fX,Y (x, y)dydx (2425)
−∞ −∞
Z a Z b
= fX (x) · fY (y)dydx (2426)
−∞ −∞
Z a Z b
= fX (x)dx · fY (y)dy (2427)
−∞ −∞
= FX (a) · FY (b). (2428)
(1) → (4): is provided by way of standard machine (see Theorem 344), starting with expression of h as
indicator function of Borel subset B ⊂ R2 . For every Borel-measurable h(x, y) : R2 → R, we have
Z
E|h(X, Y )| = |h(x, y)|dµX,Y (x, y),
R2
which if is finite Z
Eh(X, Y ) = h(x, y)dµX,Y (x, y).
R2
R∞ R∞
Furthermore, if X ⊥ Y we have Eh(X, Y ) = −∞ −∞
h(x, y)dµY (y)dµX (x) and we can fix u, v and take
h(x, y) = exp(ux + vy). It follows that
Z ∞ Z ∞
E exp(uX + vY ) = exp(ux + vy)dµY (y)dµX (x) (2429)
−∞ −∞
Z ∞ Z ∞
= exp(ux)dµX (x) · exp(vy)dµY (y) (2430)
−∞ −∞
= E exp(uX) · E exp(vY ). (2431)
(4) → (1): is beyond our scope.

R∞ R∞
(1) → (6): When h(x, y) = xy we have EXY = −∞
xdµX (x) · −∞
ydµY (y) = EXEY.
Exercise 582 (Independent Random Normals). We say that X, Y are independent standard normal if
they have joint density
1 1
fX,Y (x, y) = exp(− (x2 + y 2 )) ∀x, y ∈ R.
2π 2
396
This is a product of the marginal densities f(·) (x) = √1 exp(− 21 x2 ) where (·) ∈ {X, Y }, and we let
2π
the notation N (a) = Φ(a) for ∼ Φ(0, 1) as in the usual notations for the standard c.d.f. for standard
Ra Rb
normals. The joint c.d.f for (X, Y ) factors and we have FX,Y (a, b) = −∞ −∞ fX (x)fY (y)dydx =
Ra Rb
−∞ X
f (x)dx −∞ fY (y)dy = N (a) · N (b). The joint distribution µX,Y is probability measure on R2
assigning each Borel set C ⊂ R2 equal to the integral of fX,Y (x, y) over C. If we have C = A ×
R R
B, where A, B ∈ B(R) then µX,Y factors as is expressed µX,Y (A × B) = A B fX (x)fY (y)dydx =
R R
f (x)dx B fY (y)dy = µX (A) · µY (B).
A X
We saw in Equation 2414 that expectation factors when two random variables are independent. See
from that definition of covariance (Definition ??) that if the expectation factors, then they must be
uncorrelated! Therefore, independence imply uncorrelation. The converse is not true in general (it is
true in the case of jointly normal random variables, see Definition 344).
Exercise 583 (Uncorrelated and Dependent Normal Random Variables). Here we show an example
that re-iterates: uncorrelation is not independence. Let X be a standard normal random and Z ⊥ X
satisfying P(Z = 1) = 21 , P(Z = −1) = 12 . Define Y = ZX. We show that X, Y is standard normal, they
are uncorrelated and that they are not independent. (X, Y ) does not have a joint density. First consider
the distribution of Y by its c.d.f
FY (b) = P(Y ≤ b) = P(Y ≤ b ∧ Z = 1) + P(Y ≤ b ∧ Z = −1) (2432)

= P(X ≤ b ∧ Z = 1) + P(−X ≤ b ∧ Z = −1) (2433)
= P(X ≤ b) · P(Z = 1) + P(−X ≤ b) · P(Z = −1) (2434)
1 1
= P(X ≤ b) · + P(−X ≤ b) · . (2435)
2 2
Since X is standard normal, −X is standard normal. It follows P{X ≤ b} = P{−X ≤ b} = N (b) = FY (b)
and Y is standard normal. Cov(X, Y ) = EXY = EZX 2 and since Z ⊥ X, Z ⊥ X 2 and their expectation
factors such that EZX 2 = 0, which again implies Cov(X, Y ) = 0 = ρ(X, Y ). They are uncorrelated.
Consider that if X ⊥ Y , then |X| ⊥ |Y | and P{|X| ≤ 1, |Y | ≤ 1} = P{|X| ≤ 1} = N (1) − N (−1) 6=
2
(N (1) − N (−1)) = P{|X| ≤ 1} · P{|Y | ≤ 1}.In other words, they are not independent random variables.
Now consider the joint distribution µX,Y measure. Since |X| = |Y |, then the pair (X, Y ) are members
of the Borel set C = {(x, y) : x = ±y}. Hence, we have µX,Y (C) = 1, µX,Y (C c ) = 0. However, C has
zero area and for any nonnegative f , we must have −∞ −∞ 1C (x, y)f (x, y)dydx = 0. When we want
R∞ R∞
to integrate a function 1C (x, y)f (x, y) over the plane R2 , we would first fix x and integrate out y and
repeat, but since the integrand is zero except when y = x ∧ y = −x we get zero. Integrating out x in next
step is integrating a zero function, and we get zero. There cannot be a joint density for (X, Y )! The
joint cumulative density function exists (as it always should), and takes form:
FX,Y (a, b) = P(X ≤ a ∧ Y ≤ b} (2436)

= P(X ≤ a ∧ X ≤ b ∧ Z = 1) + P(X ≤ a, −X ≤ b ∧ Z = −1) (2437)
= P(Z = 1) · P(X ≤ min(a, b)) + P(Z = −1) · P(−b ≤ X ≤ a) (2438)
1 1
= N (min(a, b)) + max(N (a) − N (−b), 0) (2439)
2 2
Show the same result (of uncorrelation but non-independence) with the use of moment generating func-
tions. Note that this argument requires the iterated conditioning property, discussed in Section 356.
397
Proof. Consider that the joint moment generating function of (X, Y ) is written
E exp(uX + vY ) = E exp(uX + vXZ) (2440)

= E[E exp(uX + vXZ)|Z] (2441)
= E[exp(uX + vXZ|Z = 1] · PZ = 1 (2442)
+E[exp(uX + vXZ|Z = −1] · PZ = −1 (2443)
1 1
= E exp(uX + vX) + E exp(uX − vX) (2444)
2 2
(u + v)2 (u − v)2

1
= exp( ) + exp( ) . (2445)
2 2 2
Then by setting u = 0 we see that Y has mgf of standard normal but since the joint mgf does not factor
into the product of marginal mgf (see Theorem 377) of X, Y , we obtain X 6⊥ Y .
Exercise 584 (Another Pair of Uncorrelated but Dependent Normal Random Variables). Shreve [19]
Let X, Y be random variables with joint density written
(2|x| + y)2

 2|x|

√
+y
exp(− ) y ≥ −|x|
fX,Y = 2π 2 (2446)

0 y < −|x|

and show that X, Y ∼ Φ(0, 1), ρ(X, Y ) = 0, X 6⊥ Y .
Proof. By integrating out Y and using the substitution ξ = 2|x| + y see that
(2|x| + y)2
Z Z
2|x| + y
fX (x) = fX,Y (x, y)dy = √ exp(− )dy (2447)
{y≥−|x|} 2π 2
ξ2
Z
ξ
= √ exp(− )dξ (2448)
{ξ≥|x|} 2π 2
1 x2
= √ exp(− ). (verify this). (2449)
2π 2
R
Now, see that fY (y) can also be written fY (y) = fX,Y (x, y)dxdy =
(2|x|y)2
Z
2|x| + y
= √ exp(− )dx (2450)
1{|x|≥−y} 2π 2
1 x2
√ exp(− )
= (verify this). (2451)
2π 2
R∞ R∞
It may also be obtained that EXY = −∞ −∞ xyfX,Y (x, y)dxdy = 0 (verify this). Then they are
dependent since fX,Y (x, y) 6= fX (x) · fY (y) but uncorrelated since EXY = 0.
6.5.2 Conditioning Probabilities, Densities and Expectations

Now that we have introduced probability densities on both joint and marginal variables, we can define
conditional probabilities. We look at the effect of information on probability estimates, as well as on
random variable distributions.
First we shall look at the probability of sets.
Definition 319. For event A, B defined on sample space Ω such that P(B) > 0 the conditional probability
P(A∩B)
of A given B occurred is written P(A|B) = P(B) .
398
Corollary 14. An immediate corollary from independence of events (Definition 310) and conditional
probability (Definition 319) is that for independent A, B, P(A|B) = P(A).
Theorem 354 (Probability Chain Rule). For events Ai∈[n] the chain rule states that
P(∧ni=1 Ai ) = Πnk P(Ak | ∩k−1

j=1 Aj ) (2452)
Proof. This can be easily see by defining events onto sequentially higher order events and applying
induction on the rule P(A ∩ B) = P(A|B)P(B).
Definition 320 (Partitioning Sample Spaces). A set of events {Ai |i ∈ [n]} is said to partition sample
space S if S = ∪ni Ai and Ai ∩ Aj = ∅ for all i 6= j and P(Ai ) > 0 for all i.
Theorem 355 (Law of Total Probability). For set {Ai |i ∈ [n]} of measurable events that partition the
sample space S (see Definition 320), any event B defined on the same probability space can be written
to have probability measure
n
X
P(B) = P(B ∩ Ai )
i
which is equivalent to
n
X
P(B) = P(B|Ai )P(Ai ).
i
Definition 321 (Bayes Theorem). The Bayes theorem is most succinctly stated
P(B|A)P(A)
P(A|B) = . (2453)
P(B)
More specifically, for sample space S with partition set {Ai |i ∈ [n]} then
P(B|Aj )P(Aj )
P(Aj |B) = Pn (2454)
i=1 P(B|Ai )P(Ai ).
Proof. The proof follows from simple applications of the Law of Total Probability (see Theorem 355)
and conditional probability (see Definition 319).
It is often we have P(A|B), but the quantity of interest is P(B|A). The Bayes theorem (Theorem
321) relates these two matters. For instance, we often have P(P ositive|Cancer), the probability of a test
correctly identifying a cancer-positive patient. But we want to know the probabilities that someone who
undergoes the test actually has cancer, and we want to compute P(Cancer|P ositive). We will call upon
Bayes theorem.
Exercise 585 (The Shooter Paradox, Problem from Larsen and Marx [12]). Andy, Bob, Charlie love
Diana but Diana only loves strong men. Andy, Bob and Charlie are placed in a shootout for Diana’s
heart, and they shoot at each other in order of Andy, Bob, Charley, Andy and so on. Andy is not that
sharp and hits with p = 0.3, and Charley hits with p = 0.5, and Bob does not miss. Andy goes first.
Explain his strategy.
Proof. If Andy shoots at Charlie and kills him, Bob goes next and is sure to kill Andy. He can only
shoot at Bob. Suppose he hits Bob, then he survives with probability (denote CM, AH, AM as events
399
Charlie hits, Andy hits, Andy misses) given
P(Andy Survives) = P[(CM1 ∩ AH1 ) ∪ (CM1 ∩ AM1 ∩ CM2 ∩ AH2 ) · · · ] (2455)

= 0.5 · 0.3 + 0.5 · 0.7 · 0.5 · 0.3 + · · · (2456)
1
= 0.5 · 0.3 · (2457)
1 − 0.35
3
= . (2458)
13
On the onther hand if he misses Bob, Bob kills Charlie and Andy will survive iff he gets Bob next, which
3
is probability 10 . He is better off shooting into the sky on his first shot.
Exercise 586 (Caesar’s last breath). There is an estimate 1044 molecules of air on earth, and each
breath takes about 2 · 1022 molecules. What is the probability that your very next breath has at least one
molecule from the great Julius Caesar’s last breath?
Proof. Suppose we number each molecule in our next breath and define the event Ai that the i-th
22
1044 −2·1022
molecule is not a Caesar molecule. Then we want 1 − P(∩2·10 Ai ) and we have P(A1 ) = 1044 ,
44 22
10 −2·10 −1
P(A2 |A1 ) = 1044 −1 which see that P(A1 ) ≈ P(A2 |A1 ) and same for the higher order conditioning.
Then we can treat their probabilities unconditionally as in sampling from infinite spaces and take
2·1022
1044 − 2 · 1022

1− (2459)
1044
2·1022
2
= 1 − 1 − 22 . (2460)
10
22
But see that for small x, we have 1 − x ≈ exp(−x) and we can write 1 − exp(− 10222 )2·10 = 1 − exp(−4) =
0.98 as our probability estimate, a surprising result for most.
Definition 322. Let X, Y be discrete random variables. then conditional probability density of Y given
pX,Y (x,y)
x is written pY |x (y) = P(Y = y|X = x) = pX (x) . Note that here X is at fixed realized value of x and
the function is non-random. If X, Y continuous, then for joint density fX,Y (x, y) and marginal density
fX (x), we write its conditional, cumulative density to be
Z y
fX,Y (x, u)
P(Y ≤ y|X = x) = du. (2461)
−∞ fX (x)
Proof. We show proof for the continuous case. Since point probabilities for continuous random variables
are zero everywhere, we use limiting arguments. We write
R x+h R y
x
f
−∞ X,Y
(t, u)dudt
P(Y ≤ y|X = x) = lim P(Y ≤ y|x ≤ X ≤ x + h) = lim R x+h . (2462)
h→0 h→0 f (t)dt
x X
By application of the L’Hopital rule (see Theorem 3) and applying the Fundamental Theorem of Calculus
we obtain
Ry
fX,Y ((x + h), u)du
Z y
−∞ fX,Y (x, u)
P(Y ≤ y|X = x) = lim = du. (2463)
h→0 fX (x + h) −∞ fX (x)
400
Consider a finite coin toss problem of N tosses with p-head and q-tail probabilities, with Ω being
the sample space. Let P(ω) denote probability of a sequence of coin tosses under such assumptions.
Let n ∈ Z+ , 1 ≤ n ≤ N − 1 and X be some random variable. The conditional expectation of X under
measure P on information at n is random variable that can be written
X
En [X](ω1 · · · ωn ) = p#H(wn+1 ···wN ) q #T (wn+1 ···wN ) X(ω1 · · · ωN ).
ωn+1 ···ωN
Here the conditioning is on a random variable that is not yet realized, therefore the expectation term itself
is random (contrast this to Definition 322). Therefore the bracketed arguments represent the outcomes
drawn from Ω. Note when n = 0 we have E0 [X] = EX and EN [X](ω1 · · · ωN ) = X(ω1 · · · ωN ). We want
to set up a continuous time analogue. Consider the three-period model, we can then write
E2 [S3 ](HH) = pS3 (HHH) + qS3 (HHT ) (2464)

E2 [S3 ](HT ) = pS3 (HT H) + qS3 (HT T ) (2465)
E2 [S3 ](T H) = pS3 (T HH) + qS3 (T HT ) (2466)
E2 [S3 ](T T ) = pS3 (T T H) + qS3 (T T T ) (2467)
The σ-algebra of F2 was constructed from the four fundamental sets (atoms that are indivisible within
the σ-algebra), namely AHH , AHT , AT H , AT T . We can rewrite such that we have the formulations
P P(ω)
E2 [S3 ](HH) = ω∈AHH S3 (ω) P(A HH )
and so on for the remaining equations. However, since we are
P
working in continuous, infinite space, we shall instead write E2 [S3 ](HH)·P(AHH ) = ω∈AHH S3 (ω)P(ω)
to be more precise. We shall take an alternate route to allow for continuous-time models. On each of the
atoms in F2 , the conditional expectation E2 [S3 ] is constant, since it does not depend on the third toss,
and each atom holds the first two coin tosses fixed. We shall then let the LHS be written as integrals of
the integrand E2 [S3 ] over the atom, and write E2 [S3 ](ω), where ω = ω1 ω2 ω3 even though ω3 is irrelevant.
The RHS are sums, which are Lebesgue integrals on finite probability spaces. In Lebesgue notation, we
can rewrite
Z Z
E2 [S3 ](ω)dP(ω) = S3 (ω)dP(ω) (2468)
AHH AHH
Z ··· Z
E2 [S3 ](ω)dP(ω) = S3 (ω)dP(ω). (2469)
AT T AT T
That is, on each of the atoms, the value of the conditional expectation is the constant that yields the
average over the atom. We can do this for other sets in F2 , obtaining
Z Z
E2 [S3 ](ω)dP(ω) = S3 (ω)dP(ω),
AH AH
Z Z
E2 [S3 ](ω)dP(ω) = S3 (ω)dP(ω),
A A
and so on for all sets A ∈ F2 . This is called the partial averaging property of conditional expectations
- the conditional expectation and random variable being estimated give the same value when averaged
over parts of Ω.
Definition 323 (Conditional Expectations). Let (Ω, F, P) be probability space and G be sub-σ-algebra
of F. Let X be random variable that is nonnegative or integrable, then the conditional expectation of X
given G, denoted E[X|G] is a random variable satisfying
401
1. Measurability: E[X|G] is G measurable and
R R
2. Partial Averaging: A E[X|G](ω)dP(ω) = A X(ω)dP(ω) ∀A ∈ G.
If G is the σ-algebra generated by some other random variable, say W , such that G = σ(W ) we usually
abbreviate and write E[X|W ]. This is the unifying simplification we make in the difference between
measure theoretic and non-measure theoretic treatments.
Although the estimate of X based on information in G is itself a random variable, the value of the
estimate E[X|G] can be determined from the information in G. Partial averaging ensures that E[X|G] is
indeed an estimate of X, providing the same averages as X over all sets in G.
Note that we are guaranteed the existence of E[X|G]. This proof relies on the Radon Nikodym
Theorem (Theorem 347). We show uniqueness. Suppose Y, Z satisfies conditions of measurability and
partial averaging. Since both are G measurable, their difference is G-measurable and the set A = {Y −Z >
0} ∈ G. We have Z Z Z
Y (ω)dP(ω) = X(ω)dP(ω) = Z(ω)dP(ω)
A A A
R
implies A
(Y (ω) − Z(ω))dP(ω) = 0. Since the integrand is strictly positive on A, this holds only if
P(A) = 0 and Y ≤ Z almost surely. The same argument yields Z ≤ Y and therefore we obtain Y = Z
almost surely. The set of which ω for which the random variables are different (do not agree) have zero
probability even though different procedures can result in different random variables when determining
E[X|G].
Theorem 356 (Properties of Conditional Expectations). Assume probability space (Ω, F, P) and G ⊂ F
be sub-σ algebra. The proof and theorems again originate from Shreve [19].
1. Linearity of conditional expectations: If X, Y integrable and c1 , c2 constants, or both nonnegative

random variables with c1 , c2 > 0, we have E[c1 X + c2 Y |G] = c1 E[X|G] + c2 E[Y |G]
2. If X, Y, XY are integrable random variables, with X being G-measurable, then E[XY |G] = XE[Y |G].
Holds also if X > 0, Y ≥ 0.
3. Iterated Conditioning: If H is sub-σ-algebra of G and X integrable, then
E[E{X|G}|H] = E[X|H].
Holds if X ≥ 0 also.
4. Independence: If X integrable, X ⊥ G then E[X|G] = EX. Holds if X ≥ 0 also.
5. Conditional Jensen’s Inequality: If ϕ(x) is convex, and X integrable, then
E[ϕ(X)|G] ≥ ϕ(E[X|G])
Proof. -
1. To prove the LHS = RHS, we need to verify the measurability and partial averaging conditions.
First, the RHS is G-measurable since both E[X|G], E[Y |G] is G-measurable. Using the fact that the
402
individual expectation terms satisfy partial averaging, we have ∀A ∈ G:
Z
(c1 E[X|G](ω) + c2 E[Y |G(ω)](ω)) dP(ω) (2470)
ZA Z
= c1 E[X|G](ω)dP(ω) + c2 E[Y |G(ω)](ω)dP(ω) (2471)
A A
Z Z
= c1 X(ω)dP(ω) + c2 Y (ω)dP(ω) (2472)
Z A A
= (c1 X(ω) + c2 Y (ω)) dP(ω). (2473)

A
We have that RHS satisfies partial averaging.
2. First oberve that XE[Y |G] is G measurable since both X, E[Y |G] is G-measurable. We need consider
partial averaging. We have ∀A ∈ G where X = 1{X ∈ B}:
Z Z
X(ω)E[Y |G](ω)dP(ω) = E[Y |G](ω)dP(ω) (2474)
A
ZA∩B
= Y (ω)dP(ω) (2475)
ZA∩B
= X(ω)Y (ω)dP(ω) (2476)
A
and the partial averaging property satisfies for X when X is indicator function. Remainder of the
proof can be done by way of standard machine. (see proof in Theorem (344)). XE[Y |G] satisfies
partial averaging condition characterizing E[XY |G] and hence they are equivalent.
3. If we estimate X based on information in G , followed by estimation of the estimate based on smaller

amount of information in H, then we shall obtain the random variable we would have gotten by
directly estimating X conditional on H. As proof, first observe that E[X|H] is H-measurable by
definition. By partial averaging
Z Z
E{E[X|G]|H}(ω)dP(ω) = E[X|G](ω)dP(ω) ∀A ∈ H (2477)
A
ZA
= X(ω)dP(ω) (2478)
A
Z
= E[X|H](ω)dP(ω) (2479)
A
E[X|H] satisfies partial averaging property characterising E[E[X|G]|H], and they must be the same.
4. If X ⊥ G, then our best estimate is EX. Note that EX is not random and is measurable w.r.t
every σ-algebra! Verifying the partial averaging property of EX characterising E[X|G], obtain
Z Z
EXdP(ω) = X(ω)dP(ω) ∀A ∈ G. (2480)
A A
When X is indicator random variable of form 1{X ∈ B} and X ⊥ G we obtain ∀A ∈ G:

Z Z
X(ω)dP(ω) = P(A ∩ B) = P(A) · P(B) = P(A)EX = EXdP(ω). (2481)
A A
Other forms of X satisfy by way of standard machine (Theorem 344).
5. -
403
Exercise 587 (Degeneracy of Random Variables). Show that a random variable X defined on probability
space (Ω, F, P) that is measurable w.r.t trivial σ-field is degenerate.
Proof. Zeng [20] We have proved this under non-measure theoretic settings in Corollary 373. Here we
give arguments using limits. See that for any real a ∈ R the probability of P{X ≤ a} is zero or one under
the measure defined with respect to the trivial σ-field. Then lima→∞ PX ≤ a = 1, lima→−∞ PX ≤ a = 0,
and therefore ∃x0 s.t. PX ≤ x0 = 1, PX ≤ x = 0 for x < x0 . We can write

1 1
P(X = x0 ) = lim P(X ∈ (x0 − , x0 ]) = lim P(X ≤ x0 ) − P(X ≤ x0 − ) = 1. (2482)
n→∞ n n→∞ n
X is almost surely x0 ∈ R.
Lemma 4 (Independence). Let (Ω, F, P) be probability space, and G be sub-σ-algebra of F. Sup-

pose random variables Xi , i ∈ [K] are G-measurable and the random variables Yj ⊥ G, j ∈ [L]. Let
f (x1 , · · · xK , y1 , · · · yL ) be function of dummy variables xi , yj , i ∈ [K], j ∈ [L] and define function
g(x1 , · · · , xK ) = Ef (x1 , · · · xK , Y1 , · · · YL ). (2483)
Then, we have
E[f (X1 , · · · , XK , Y1 , · · · YL |G] = g(X1 , · · · , XK )
Since the information in G is sufficient to determine values of X1 , · · · XK , we should hold these constant
when estimating f (Xi∈[K] , Yj∈[L] ). The variables Yj ⊥ G, j ∈ [L] shall be integrated out without regard
to the information in G. The ‘hold constant’ and ‘integrate out’ steps are represented by Equation 2483.
Since we get an estimate that depends on the values of Xi , i ∈ [K], the dummy xi , i ∈ [K] are replaced
by the random variables Xi , i ∈ [K] in the last step.
Exercise 588. Let (X, Y ) be jointly normal (see Definition 344), define
σ2
W +ρ X =Y (2484)
σ1
such that X ⊥ W as we did in Exercise 620. We saw there that W has mean µ3 = µ2 − ρµ1 σσ21 and
variance σ32 = (1 − ρ2 )σ22 . To estimate Y with the conditioning σ-algebra σ(X) we obtain the linear
regression function
σ2 σ2
E[Y |X] = ρ X + EW = ρ (X − µ1 ) + µ2 , (2485)
σ1 σ1
where we see that the RHS is σ(X)-measurable random variable. Subtracting 2485 from 2484 the estima-
tion error is Y − E[Y |X] = W − E[W ], with expectation zero (unbiased) and independent of the estimate
E[Y |X], since E[Y |X] is σ(X)-measurable and W ⊥ σ(X). The independence of the error and condi-
tioning random variable X is consequence of joint normality. Often, the conditioning random variable
and errors are uncorrelated but not independent.
Continuing, suppose we want to estimate f (X, Y ) conditional on the value of X. Since we are not
ρσ2
guaranteed that X ⊥ Y we cannot directly apply the independence lemma (4). By writing Y = σ1 X +W ,
we obtain X that is σ(X) measurable and W ⊥ σ(X), where W ∼ Φ(µ3 , σ32 ) - then we may use the
independence lemma! We shall write

ρσ2
g(x) = Ef x, x+W (2486)
σ1
Z ∞
(w − µ3 )2

1 ρσ2
= √ f x, x + w exp − dw (2487)
σ3 2π −∞ σ1 2σ32
404
and E[f (X, Y )|X] = g(X) is a σ(X)-measurable random variable.
IID
Exercise 589 (Axial Rotations). Let X, Y ∼ Φ(0, 1), then show that (X cos θ + Y sin θ) ⊥ (−X sin θ +
Y cos θ)
Proof. Recall that linear combination of independent normals are jointly normal, then by properties of
jointly normal random variables (see Section 344) we can prove independence by showing that their
covariance is zero. In particular, since Cov((X cos θ + Y sin θ), (−X sin θ + Y cos θ)) = E(X cos θ +
Y sin θ)(−X sin θ + Y cos θ) − E(X cos θ + Y sin θ)E(−X sin θ + Y cos θ) = E(X cos θ + Y sin θ)(−X sin θ +
Y cos θ) − 0 see that
E(X cos θ + Y sin θ)(−X sin θ + Y cos θ) (2488)

= E − X 2 cos θ sin θ + XY cos2 θ − XY sin2 θ + Y 2 sin θ cos θ (2489)
E (Y 2 − X 2 ) sin θ cos θ + (XY )(cos2 θ − sin2 θ)

= (2490)
= 0, (2491)
where the last step follows from the independence of X, Y .
Exercise 590. For random variable X, Y defined on probability space (Ω, F, P), at fixed X = x we can
R∞
write function g(x) = E(Y |X = x) = −∞ yfY |X (y|x)dy, which is characterized by the conditional density
of y on x. We want to show that if there exists joint density fX,Y (x, y) then E(Y |X) = g(X). Trivially,
g(X) is σ(X) measurable since g is constant function, and to check EY |X = g(X) we have to show partial-
averaging is satisfied. Let A ∈ σ(X) then ∃B ∈ B(R) such that we can express A = {ω ∈ Ω : X(ω) ∈ B}
R R R
by definition. Show the partial averaging property A g(X)dP = A EY |XdP = A Y dP.
R
Proof. We write A g(X)dP =
Z ∞ Z ∞ Z ∞
fX,Y (x, y)
g(x)1B (x)fX (x)dx = fX (x)1B (x) y dy (2492)
−∞ −∞ −∞ fX (x)
Z ∞Z ∞
= 1B (x)yfX,Y (x, y)dxdy (2493)
−∞ −∞
Z
= E[Y 1B (X)] = E[Y 1A ] = Y dP. (2494)
A
Lemma 5 (A Random Variable is a Function of its measurable σ-algebra). For X defined on (Ω, F, P)
and σ(X) measurable W , show ∃g s.t. W = g(X).
Proof. By definition, A ∈ σ(X) can be expressed A = {ω ∈ Ω : X(ω) ∈ B} where B is Borel set

B ∈ B(R). Then if W = 1B then g = 1B . Other forms of W follows by the standard machine. (see
Theorem 344)
Corollary 15. Let X, Y be defined on (Ω, F, P). Show ∃g such that EY |X = g(X).
Proof. Since EY |X is σ(X) measurable then by applying Lemma 5, there exists Borel measurable function
g such that EY |X = g(X).
Exercise 591. Let Y be defined on probability space (Ω, F, P) and G ⊂ F be sub σ-algebra. Let Y = µY +
then for some measurement based on G information we can write Y − EY |G = which has property
E = 0, Var = σ 2 . Let X be an estimator for Y that is G measurable. Prove that Var ≤ Var(Y − X).
This statement is equivalent to showing that EY |G is the minimum Var estimator among all estimators.
405
Proof. See that
E (Y − X − µY −X )2

Var(Y − X) = (2495)
E ([Y − E(Y |G)] + [E(Y |G) − X − µY −X ])2

= (2496)
2
= Var + E(ξ ) + 2E(Y ξ − E(Y ξ|G)) (2497)
2
= Var + Eξ (2498)
where ξ = E(Y − X − µY −X |G), and where the steps follow from iterated conditioning property of
expectations. (see Theorem 373)
Exercise 592. For X, Y defined on probability space (Ω, F, P). Write Y = E(Y |X) + (−E(Y |X) + Y )
and show that ρ((−E(Y |X) + Y ), ψ) = 0 when ψ is σ(X) measurable.
Proof. By partial averaging property (see Theorem 323) see that −E(Y |X) + Y is centred random
variable. To see its correlation equals zero we only need to show that E [ψ(−E(Y |X) + Y )] = 0. We can
write this to be
E [ψY − ψE(Y |X)] = 0 (2499)
by the iterated conditioning property.
Definition 324 (Martingales). Let (Ω, F, P) be probability space, T > 0 and F(t), t ∈ [T ] be filtration
of sub-σ-algebras ⊂ F. Consider an adapted stochastic process M (t) (see Definition 308), we say that
1. M (t) is a martingale process if
E[M (t)|F (s)] = M (s) ∀0 ≤ s ≤ t ≤ T, (2500)
2. a submartingale process if
E[M (t)|F (s)] ≥ M (s) ∀0 ≤ s ≤ t ≤ T (2501)
3. a supermartingale process if
E[M (t)|F (s)] ≤ M (s) ∀0 ≤ s ≤ t ≤ T (2502)
Definition 325 (Markov Processes). Let (Ω, F, P) be probability space, and T > 0 and F(t) be filtration
as in usual settings. For adapted stochastic process X(t), t ∈ [T ], if for all 0 ≤ s ≤ t ≤ T and for every
nonnegative, Borel measurable function f there exists another Borel measurable function g such that
E[f (X(t))|F(s)] = g(X(s)) (2503)
then X is Markov process.

Estimating the value of f (X(t)) only depends on the process value X(s) at time s and not on the
process path. We may relax this somewhat without violating definitions by letting the Markov process
be conditioned on higher dimensions. We may establish that a process is Markov via the independence
lemma 4, by writing E[f (X, Y )|G] = g(X) for X that is G measurable and Y ⊥ G, and g(x) = Ef (x, Y ).
Note that in the Markov definition for Equation 2503 the function f may be permitted to depend on t
and g on s. We may rewrite this equation to be a single f varying on time by writing
E[f (t, X(t))|F(s)] = f (s, X(s)), 0 ≤ s ≤ t ≤ T. (2504)
and f becomes a function over t, x written f (t, x) and is related to the Black Scholes Merton partial
differential equation (see Section 13.2.3). It is the Markov property that takes us from stochastic calculus
to the treatment of partial differential equations, as we shall see.
406
6.5.3 Transformation and Combination of Random Variables
For transformation Y = g(X) we are interested in the distribution of Y . We study these transformations
under discrete and continuous settings.
Theorem 357 (Density Functions of Linear Transformation on Discrete Variables). Suppose X is dis-
crete random variable with Y = aX + b then pY (y) = pX ( y−b
a ).
y−b
Proof. See this by writing pY (y) = P(Y = y) = P(aX + b = y) = P(X = a ) = pX ( y−b
a ).
Theorem 358 (Density Functions of Linear Transformation on Continuous Variables). Suppose X s

1 y−b
continuous random variable with Y = aX + b then fY (y) = |a| fX ( a ). Here a 6= 0.
y−b
Proof. Write FY (y) = P(Y ≤ y) = P(aX + b ≤ y) = P(aX ≤ y − b). If a > 0 then P(X ≤ a ) and
y−b
by differentiating w.r.t y we have fY (y) = 1
|a| fX ( a ). If a < 0 then P(aX ≤ y − b) = P(X ≥ y−b
a ) =
y−b
1 − P(X ≤ a ) and by differentiating fY (y) = − a fX ( y−b
1
a ).
Theorem 359 (Density Functions of Sums of Random Variables; convolution of probability distributions,
independent case). Let X ⊥ Y be random variable and W = X + Y then if X, Y discrete pW (w) =
P R∞
x pX (x)pY (w − x). If X, Y continuous then fW (w) = −∞ fX (x)fY (w − x)dx.
Proof. For the discrete case see that

X
pW (w) = P(W = w) = P(X + Y = w) = P(∪x (X = x, Y = w − x)) = P(X = x, Y = w − x).(2505)
x
and then using independence and definition for equivalence. For continuous case see that
Z ∞ Z w−x Z ∞ Z w−x
FW (w) = fX (x)fY (y)dydx = fX (x) fY (y)dy dx. (2506)
−∞ −∞ −∞ −∞
See that
Z ∞ Z ∞
δ
fW (w) = fX (x)FY (w − x)dx = fX (x)fY (w − x)dx. (2507)
δw −∞ −∞
Note that here the implicit assumption for existence of joint density requires the integrand to be smooth
in order for differentiation and integration to be interchanged.
Exercise 593. A light bulb has lifetime X modelled by exponential density fX (x) = λ exp(−λx) for
x > 0. In case of failure a second light bulb is on standby sharing the same density, and let its lifetime
be Y . The time for both bulbs to be alive is written W = X + Y. Find its density.
R∞
Proof. We have fW (w) = −∞ fX (x)fY (w − x)dx. In particular since fY (w − x) > 0 only when x < w
and fX (x) > 0 when x > 0 the density is written
Z w Z w
fW (w) = fX (x)fY (w − x)dx = λ exp(−λx)λ exp(−λ(w − x))dx (2508)
0 0
Z w Z w
= λ2 exp(−λx) exp(−λ(w − x))dx = λ2 exp(−λw) dx = λ2 w exp(−λw). (2509)
0 0
Lemma 6. By the Fundamental Theorem of Calculus and chain rule

Z wx
δ δ
fY (y)dy = fY (wx) wx = xfY (wx). (2510)
δw −∞ δw
407
Theorem 360 (Densities of Products and Ratios of Random Variables). Let X ⊥ Y be continuous
Y
random variables for which X is zero for at most a set of isolated points. Define W = X then
Z ∞
fW (w) = |x|fX (x)fY (wx)dx. (2511)
−∞
Now if we write W = XY then

Z ∞ Z ∞
1 1
fW (w) = fX (x)fY (w/x)dw = fX (w/x)fY (x)dx. (2512)
−∞ |x| −∞ |x|
Y
Proof. We show proof for the ratio case when W = X. Write
Y Y Y
FW (w) = P( ≤ w) = P( ≤ w, X ≥ 0) + P( ≥ w, X < 0) (2513)
X X X
Z ∞ Z wx Z 0 Z wx
= fX (x)fY (y)dydx + 1 − fX (x)fY (y)dydx. (2514)
0 −∞ −∞ −∞
By differentiation and application of Lemma 6

Z ∞ Z wx Z 0 Z wx
δ δ
fW (w) = fX (x)fY (y)dydx − fX (x)fY (y)dydx (2515)
δw 0 −∞ δw −∞ −∞
Z ∞ Z 0
= xfX (x)fY (wx)dx − xfX (x)fY (wx)dx (2516)
0 −∞
Z ∞ Z 0
= xfX (x)fY (wx)dx + (−x)fX (x)fY (wx)dx (2517)
0 −∞
Z ∞
= |x|fX (x)fY (wx)dx. (2518)
−∞
Y
Exercise 594. Let X, Y be independent with exponential density of rate λ. Define W = X and find
fW (w).
Proof. Since the domain of exponential random variables are strictly positive we can do away with the
modulo and write
Z ∞ ∞
λ2
Z
fW (w) = x(λ exp(−λx))(λ exp(−λxw))dx = λ2 x exp(−λ(1 + w)x)dx = E(Ŵ )(2519)
0 0 λ(1 + w)
λ2 1 1
where Ŵ ∼ exp(λ(1 + w)). Then we have fW (w) = λ(1+w) λ(1+w) = (1+w)2 for w ≥ 0.
Exercise 595. Let X, Y have marginal densities fX (x) = 1, 0 ≤ x ≤ 1 and fY (y) = 2y, 0 ≤ y ≤ 1. Find
density if W = XY.
R∞ 1
Proof. We can write fW (w) = f (x)fY
−∞ |x| X
(w/x)dx. See thatfY (w/x) > 0 only if 0 ≤ w/x ≤ 1, that
is when x ≥ w. For fX (x) > 0 we require 0 ≤ x ≤ 1. Then we can write
Z 1 Z 1
1 2w 1
fW (w) = · dx = 2w dx = 2 − 2w 0 ≤ w ≤ 1. (2520)
w x x w x2
408
6.6 Inferential Statistics
Here we assume the settings for which Xi ∼ F, i ∈ [n], where F is unknown cumulative density. Define
the population median as follows:
Definition 326 (Population Median). For Xi ∼ F, i ∈ [n] the population median is written
1 1
m = F −1 ( ) = inf {F (x) ≥ } (2521)
2 x 2
In the case where we are dealing with discrete random variable X, m is point s.t. P(X < m) = P(X > m).
If it turns out that P(X ≤ m) = P(X ≥ m0 ) = 0.5 then m is arithmetic mean 21 (m + m0 ).
When working with median as central measure the standard practice is to use the interquartile range
as dispersion measure as opposed to the variance. This is the range Q3 − Q1 where P(Y ≤ Q1 ) =
0.25, P(Y ≤ Q3 ) = 0.75 for some random variable Y .
Definition 327 (Sample Summaries). Let Xi ∼ F (x), i ∈ [n] and let the center of this distribution be
defined by the population mean µ (if it exists) and median m (which always exists). Then define the
following summary statistics:
1
Pn
1. Sample Mean: n i Xi
2. Sample Median:

X( n+1
2 )
n is odd
m̂ = 1
 X( n2 ) + X( n2 +1) n even
2
Note that m = F −1 ( 21 ) always exists but the mean µ may not exist. In the Cauchy distribution, see that
1
R∞
f (x) = π[1+x 2 ] and has EX = −∞
|x|/ π(1 + x2 ) = ∞ but it has m = 0. For symmetric density f we
have µ = m.
6.7 Laws and Basics of Probability Concepts

Theorem 361 (Law of Large Numbers). Let X1 , X2 · · · be IID expectation µ variance σ 2 random
Pn
variables, then the mean X̄n = n1 i=1 Xi converges to µ as n → ∞. Since x̄n is realization of X̄n , then
the mean of large sample converges to the population mean. This is also true in other moments, such
Pn
that the empirical distribution converges. In particular, n−1 i=1 (xi − x̄n )2 → σ 2 for large sample.
Theorem 362 (Central Limit Theorem). Let Xi , i ∈ [n] be IID expectation µ, variance σ 2 random
Pn
variables, then Sn = i=1 Xi has property
Sn − nµ asy.
√ ∼ Φ(0, 1) (2522)
nσ
σ2
with EX̄ = µ, Var(X̄) = n .
Definition 328 (Standardization of Random Variable). For any random variable X it has standardiza-
X−EX
tion Y = SD(X) where SD(X) is square root of Var(X).
409
6.8 Sampling Methods
The measurement model is an abstraction of the population from which we can assume our data is
generated from. Let xi , i ∈ [n] be independent measurements with same protocol of some constant µ,
it may be reasonable to say that our measurements are realizations of the random variable Xi where
Xi = µ+i , where i are IID Ei = 0, Vari = σ 2 random variables. Then it follows that EX̄ = µ, VarX̄ =
σ2
n . Each realisation is then of form xi = µ + ei , i ∈ [n] where ei is realized i value. Our statistical
problem relates to estimating these values. A common estimate of µ is x̄, which has SE(X̄) = √σn . Since
1
Pn 1
Pn
Var() = Var(X) = σ 2 , we may use n−1 2
i=1 (ei − ē) = n−1
2 2
i=1 (xi − x̄) to get an estimate of σ .
In general we are not guaranteed unbiasedness of estimators. Let X = µ + , and let X be used to
measure some constant a 6= µ. Rewriting X = a + (µ − a) +, we obtain the mean squared error for X:
| {z }
bias
h i
2
E (X − a)2 = E ((µ − a) + )

(2523)
= E (µ − a)2 + 2 + 2(µ − a)

(2524)
2 2
= σ + (µ − a) (2525)
When we take n IID measurments then X̄ = µ + ¯ and we have

σ2
E (X̄ − a)2 = + (µ − a)2

(2526)
n
which is the all important formula M SE = SE 2 + bias2 .
6.8.1 Simple Random Sampling

The simple random sampling method refers to the method in which from the population, we make n
random draws without replacement. Every subset of size n has an equal probability of being chosen as
the sample a priori.
Lemma 7. Let Xi be random variable with EXi = µ, Var(Xi ) = σ 2 . Let the distinct values from the
population by listed ξ1 , ξ2 , · · · , ξm with the number of population members taking ξj be nj , j ∈ [m]. Note
P nj
N = i ni and we have that P(Xi = ξj ) = N since each member is equally likely to be selected. For
i 6= j, they are negatively covariant since we can write
σ2
Cov(Xi , Xj ) = − (2527)
N −1
and we note that this covariance is weak when N large.
Proof. We can use the definition of covariance (see Definition ??) and then find EXi Xj − EXi Xj . The
latter term is easy to find and the first term can be found by writing in terms of joint density f (Xi , Xj ),
which is equals to f (Xi |Xj )f (Xj ). The conditional distribution term can then be evaluated. Here
we apply a trick to avoid the density computation. See that the Xi ’s are sum-wise constant, in that
P P
Xi = τ ∈ R is not random. We can write Var( X) = Var(τ ) = 0. See also that this is written
P P P P
Cov( i Xi , j Xj ) = i j Cov(Xi , Xj ) = N Cov(X1 , X1 ) + N (N − 1)Cov(Xi , Xi6=j ) = 0 and we are
done.
Theorem 363 (Simple Random Sampling Properties). Suppose X is random variable, τ is population
total then we can use X̄ → µ, T = N X̄ → τ as estimators. We have by Lemma 7 that
410
1. EX̄ = µ, ET = τ
hP i h i
−σ 2 σ 2 N −n
2. VarX̄ = n12 1
nσ 2 + n(n − 1) N
P
i Var(Xi ) + i6=j Cov(Xi , Xj ) = n2 −1 = n N −1
2
N −n
3. Var(T ) = N 2 σN N −1
N −n n−1 n
The factor N −1 = 1− N −1 is known as the finite population correction factor. The ratio N is known as
the sampling fraction.
Corollary 16 (Sampling Distribution of Proportions). When sampling proportions, we have Ep̂ =

p, Var(p̂) = p(1−p)
n
n−1
(1 − N −1 ).
Often, when the sampling size is large enough, we just assume that the draws are made with replace-
ment and the covariance is ignored, such that the correction factor is not applied. With replacement
the standard deviation of the sampling distribution gives SD(X̄) = √σn , but often we do not know σ.
Pn
The value σ̂ 2 = n1 i=1 (Xi − X̄)2 which turns out to have expectation n−1 2

n σ (verify this) and so the
unbiased estimator for population variance can be written:
1
Pn
Definition 329 (Sample Variance under Sampling with Replacement). S 2 = n−1 i=1 (Xi − X̄)2 is an
2
unbiased estimator of σ and is known as the sample variance.
1
Pn
Let xi , i ∈ [n] be n draws with replacement, then s2 = n−1 i=1 (xi − x̄)2 is a realization of S 2 and
estimates σ 2 . The standard error of x̄ can be then said to be approximately √s .
n
2 2 n 2
In the proportion case σ̂ = p̂(1 − p̂), s = n−1 σ̂ as sample variance and therefore the standard error
is written p
p̂(1 − p̂)
SE p̂ = √
n−1
√
p(1−p)
which approximates √
n
. When we are working under conditions of no replacement assumption
(simple random sampling) theqstandard error should be multiplied with the square root of the finite
n−1
population correction factor 1 − N −1 . Additionally, when taken without replacement then to be
N −1 2
2
precise (verify this) E N S = σ but we often do not correct for this bias.
Definition 330 (Sample Covariance). We define the sample covariance as

n
1 X
sxy = (Xi − X̄)(Yi − Ȳ ) (2528)
n − 1 i=1
6.8.2 Confidence Intervals

We would like to make some statistical statements about the point estimates obtained from our sample
from simple random sampling (see Section 6.8.1).
For Xi , i ∈ [n] IID with EXi = µ, Var(Xi ) = σ 2 by the Central Limit Theorem (see 362) we have
X̄ − µ
√ ∼ Φ(0, 1) for large n (2529)
σ/ n
X̄−µ
Let zu be such that P(Z > zu ) = u then for α ∈ [0, 1] we have 1 − α ≈ P(−zα/2 ≤ √
σ/ n
≤ zα/2 ) =
P(X̄ − zα/2 √σn ≤ µ ≤ X̄ + zα/2 √σn ). It follows that our 1 − α confidence interval is written

S S
X̄ − zα/2 √ , X̄ + zα/2 √
n n
411
and we can estimate with the sample variance
√ s2 → S 2 and x̄ → X̄. Similarly in the proportion case we
p̂(1−p̂)
replace the standard error with SE p̂ = √n−1 . We interpret that among a large number of confidence
intervals computed, a fraction 1 − α contains the true µ. Recall that in simple random sampling, if we
σ 2 N −n
want to be precise the variance estimate, VarX̄ = n N −1 and the interval estimate needs to be adjusted.
We often simplify this with the sampling with replacement assumption since this is approximately true
when n << N .
Note that the use of sample estimates as population estimates such as s2 → σ 2 is known as the
bootstrap technique. Since the standard error term is approximated √sn , the sample size needs to grow

in O σ 2 for different populations with growing variances in order to obtain the same width of confidence
interval at α.
Theorem 364 (Chebyshev’s Inequality). Assume a random variable, with a variance bound of σ 2 ,
Chebyshev’ s inequality guarantees there exists an upper bound on the probability of the accuracy of x̄ up
σ2
to error of P [|x̄ − µ| ≥ ] ≤ n2
Exercise 596 (Applied Chebyshev’s on Proportions). For a sample size of n = 100 from population
with p = 0.2, find δ such that P(|p̂ − p| ≥ δ) = 0.025 and verify that the Chebyshev’s inequality holds.
≈
Proof. We obtain p̂ ∼ Φ(p, p(1−p)
n ) = Φ( 15 , 4/25 1 2
100 = ( 25 ) ). Then we want
1
P(|p̂ − | ≥ δ) = 0.025 (2530)
5
p̂ − 15 δ
P( 1 ≥ 1 ) = 0.025 (2531)
25 25
P(|Z| ≥ 25δ) = 0.025 (2532)
σ2
and we have 25δ = z0.0125 ≈ 2.241403 by standard quantiles. Then δ = 0.08965 and we have n2 =
p̂(1−p̂)
100δ 2 = 0.19907 ≥ 0.025.
Note that the conclusions derived do not apply to sampling methods that are based on non simple
random samples.
6.9 Estimation of Ratios

In this section we formalize the statistical estimation of ratios of random variables.
Definition 331 (Problem Setting for the Ratio Estimation Problem). Consider N size population of
PN PN
bivariate characteristics given (xi , yi ), i ∈ [N ] and let µx = N1 i xi , µy = N1 i yi . We are interested
µy
in the parameter µx . Suppose by simple random sampling method 6.8.1 we obtain the samples (Xi , Yi ), i ∈
Ȳ
[n] then we can estimate r with R = X̄
, and by the Law of Large Numbers 361 this is a good estimator.
We may attempt to quantify exactly how good it is by analysis of the mean of squared errors.
Recall the Taylor expansion for a univariate function
Theorem 365 (Univariate Taylor Expansion). For function f on x over R we say the Taylor series
expansion of f (x) is written
f 00 (a)
f (x) = f (a) + f 0 (a)(x − a) + (x − a)2 · · · (2533)
2!
and in the bivariate case we have the formula:
412
Theorem 366 (Bivariate Taylor Expansion). For function f on (x, y) over R2 we say the Taylor series
expansion of f (x, y) is written
f (x, y) = f (a, b) + f 0 (a, b)x (x − a) + f 0 (a, b)y (y − b) (2534)

1 1
+ + f 00 (a, b)x2 (x − a)2 + f 00 (a, b)y2 (y − b)2 (2535)
2 2
+ f 00 (a, b)xy (x − a)(y − b) + · · · (2536)
δ 2 f (a,b) δ 2 f (a,b)
where the abbreviations f 00 (a, b)xy = 00
δxδy , f (a, b)x
2 = δx2 , and so on are partial differentiation
terms.
We write r = f (µx , µy ), R = f (X̄, Ȳ ) By the bivariate expansions 366 we obtain
δf δf
R−r = (µx , µy )(X̄ − µx ) +
= (µx , µy )(Ȳ − µy ) + · · · ... (2537)
δx δy
−µy 1
= ( 2 )(X̄ − µx ) + ( )(Ȳ − µy ) + · · · (2538)
µx µx
−µ
and equivalently R = r + ( µ2y )(X̄ − µx ) + ( µ1x )(Ȳ − µy ) + O ·2 Then the

x

−µy 1
Var(R) ≈ Var r + ( 2 )(X̄ − µx ) + ( )(Ȳ − µy ) (2539)
µx µx

1 µy
= Var − (X̄ − µx ) + (Ȳ − µy ) (2540)
µ2x µx
1
r2 σx̄2 + σȳ2 − 2rCov(X̄ − µx , Ȳ − µy )

= 2
(2541)
µx
1
r2 σx̄2 + σȳ2 − 2rσX̄ Ȳ .

= 2
(2542)
µx
σ2 2

−n σY N −n
Now recall that VarX̄ = nX N N −1 , Var Ȳ = n N −1 and we have

σXY N −n
Cov(X̄, Ȳ ) = , (2543)
n N −1
1
PN
where σxy = N i=1 (xi − µx )(yi − µy ) is population covariance.
Theorem 367 (Variance of Ratios under Simple Random Sampling). Then it follows that the approxi-
Ȳ
mate variance for the ratio defined R = X̄
for the ratio estimation problem in Definition 331 follows
1 2 2
Var(R) ≈ (r σx̄ + σȳ2 − 2rσx̄ȳ ) (2544)
µ2x
1 N −n 1 2 2
= (r σx + σy2 − 2rσxy ) (2545)
n N − 1 µ2x
A reasonable question of interest is the level of bias from our first order estimate. Say we further
estimate the ratio R by Taylor expansion up to quadratic terms, see that the second order partial
δf (x,y) 2y δf (x,y)
differentials for f (x, y) = xy , we have δx2 = x3 , δy 2 = 0, δfδxδy
(x,y)
= − x12 . Then adding on the second
order terms to previous Taylor expansion in Equations 2537, by substitution we have:
−µy 1
R−r = ( 2
)(X̄ − µx ) + ( )(Ȳ − µy ) + (2546)
µx µx

1 2µy 1 1
(X̄ − µx )2 + (0)(Ȳ − µy )2 + − 2 (X̄ − µx )(Ȳ − µy ) + O ·3 (2547)

+ 3
2 µx 2 µx
413
which yields
1 2

ER − r ≈ rE( X̄ − µ x ) − E( X̄ − µx )( Ȳ − µy ) (2548)
µ2x
1
= [rσ 2 − σX̄ Ȳ ] (2549)
µ2x X̄
and in the case of simple random samples, we add the finite population correction factor to yield up
to second order

1 N −n 1 2
ER ≈ r + (rσX − ρσx σy ). (2550)
n N −1 µ2X
ρxy
We may be interested in the parameter that is the population correlation coefficient denoted ρ = σx σy ,
which is an indication of the linear relationship between x, y. Recall that we may estimate the population
1
Pn
variance with the sample variance s2x = n−1 2
i=1 (Xi − X̄) . We may perform a similar estimation of
sxy
σxy with the sample covariance, which are defined in Definition 330. Our estimate ρ̂ = sx sy estimates
the population correlation coefficient. Assuming the data set (Xi , Yi ), i ∈ [n], we would like form a
(1 − α) confidence interval, which we may obtain by obtaining the value R ± z α2 SE(R) for some ratio
R. Applying the Theorem 367 in the simple random sampling problem and bootstrapping sample point
estimates, we obtain
1 N −n 1
s2R = (R2 s2x + s2y − 2Rsxy ) (2551)
n N − 1 x̄2
and our interval estimate becomes
R ± z α2 SR . (2552)
Note that we use the normal quantiles z α2 by the Taylor expansions to first order (see Equation 2537, a
linear combination of joint normals (see Definition 344) by the application of the Central Limit Theorem
(362)). When n large the standard error dominates the bias, and our interval is approximately good.
6.9.0.1 Estimation of Paired Samples
Suppose in the same bivariate problem setting where (Xi , Yi ), i ∈ [n], except in this case we have the
population xi , i ∈ [N ] but the yi ’s are not known. We are interested in finding µy and although the
estimate Ȳ is a good unconditional estimate, we may do better if the ρ(X, Y ) is sufficiently far from
zero. We may take some n < N size sample (xi , yi ), i ∈ [n] and our ratio estimate of µy is now
µx
ȲR = Ȳ = µx R (2553)
X̄
µy
which ‘works’ since R is an estimate for r = µx . To see if ȲR is better than Ȳ we shall look at their
statistical properties.
Since Var(ȲR ) = Var(µx R) by the variance of ratio under simple random sampling of Theorem (367),
we obtain
1 N − n µ2x 2 2
Var(ȲR ) ≈ (r σx + σy2 − 2rσxy ) (2554)
n N − 1 µ2x
1 N −n 2 2
= (r σx + σy2 − 2rρσx σy ) (2555)
n N −1
with approximate bias (using computation in Equation (2550)

1 N −n 1 2 1
EȲR − µy ≈ (rσX − ρσx σy ) ∈ O (2556)
n N − 1 µX n

which has smaller order of growth relative to SE(ȲR ) ∈ O √1n .
414
Exercise 597 (Comparison of Ȳ and ȲR ). Assuming we drop the bias due to the small growth factor
and comparing on the variances, show that the ratio
estimate
of ȲR has smaller variance and hence is a
1 Cx σ
better estimate of µy compared to Ȳ when ρ > 2 Cy where Cx = µσxx , Cy = µyy are the coefficients of
variation.
Proof. We may write

1 N −n 2 2
Var(ȲR ) ≈ (r σx + σy2 − 2rρσx σy ) (2557)
n N −1

1 N −n 2 2 1 σx /µx
< (r σx + σy2 − 2r σx σy ) sub. ρ inequality (2558)
n N −1 2 σy /µy
1 N −n 2 2
= (r σx + σy2 − r2 σx2 ) (2559)
n N −1
1 N −n 2
= (σ ) (2560)
n N −1 y
= VarȲ (2561)
Note that the variance of ȲR may be estimated again by bootstrap techniques to be
1 N −n 2 2
s2 (ȲR ) = (R sx + s2y − 2Rsxy ) (2562)
n N −1
which gives interval
ȲR ± z α2 sȲR (2563)
6.10 Method of Moments

Suppose we want to estimate arbitrary parameters of a probability distribution. Say we have a random
variable X ∼ P oisson(λ) we may use X̄ as an estimate of λ, since EX =q
λ (see Section 6.17.4) and
EX̄ = EX. The standard error of this parameter estimation is SD(X̄) = nλ , and we can bootstrap
this with λ → x̄ to get the standard error. What we described is the estimation procedure. The general
estimation problem can be framed as follows:
Problem 2 (Estimation Problem). Let Xi , i ∈ [n] be IID random variables with density f (x|θ), where
θ ∈ Rp is unknown population parameter. We are given data xi , i ∈ [n], estimate θ. In particular,
consider the parametric family
{f (x|θ) : θ ∈ Θ ⊂ Rp } (2564)
where Θ is the parameter space subset of p-dimensional Euclidean space, then given xi , i ∈ [n] realizations
from IID Xi ∼ f (x|θ0 ), i ∈ [n], find θ0 . Assume that the parametric family has property that θ → f (·|θ)
is injective on Θ. This is called an identifiable family. Unidentifiable families can have parameters that
are impossible to estimate even with infinite data.
The estimate θ̂ is itself a random variable, but we often refer to its realization with the same notation.
The bias and standard error terms are written Eθ̂ − θ and SE(θ̂).
Note that the variance formula Var(X) = µ2 − µ21 can be written in moment terms (see Definition
298).
415
Definition 332 (Method of Moments). In the estimation problem (Problem 2) if we express it as a
function of moments g(µ1 , µ2 , · · · ) then the method of moments technique generates estimator
θ̂ = g(µ̂1 , µ̂2 , · · · )
gives us a method to estimate parameters of a statistical distribution of interest when we may express
moments in terms of the parameters.
Interestingly it is often easier to express the moments in terms of the parameters. Let us consider
some of the standard distributions and their method of moments.
Exercise 598 (Poisson Method of Moments). Let Xi , i ∈ [n] be IID Poisson(λ) and recall that µ1 = λ.
Our method q
of moment
q estimator for λ = µ̂1 = X̄, which has bias and standard error EX̄ − λ =
λ x̄
0, SD(X̄) = n ≈ n .
Exercise 599 (Normal Method of Moments). We have θ = (µ, σ 2 ) and µ1 = µ, µ2 = σ 2 + µ2 . Then we

can easily write µ̂ = µ̂1 = X̄ and σ 2 = µ2 − µ21 , which yields method of moment estimator.
1X 2 1X
σ̂ 2 = µ̂2 − µ̂21 = Xi − (X̄)2 = (Xi − X̄)2 . (2565)
n n
Additionally, suppose we want to learn about the joint distribution of our sample estimators. Recall from
properties of sampling on normals (see 343) that µ̂ ⊥ σ̂ 2 . Hence we may analyse them separately. In
2
nσ̂ 2 −σ 2
particular, we have µ̂ ∼ Φ(µ, σn ) and σ2 ∼ χ2n−1 . Em̂u = µ is unbiased but Eσ̂ 2 = n−1 2
n σ has bias n ,
2
and their standard errors may be found. Again from the sampling properties we obtain Var(µ̂) = σn ,
4
Var(σ̂ 2 ) = nσ2 2(n − 1) = n2 n−1
4
n σ which may be found with bootstrap approximates.
Exercise 600 (Gamma Method of Moments). Recall the Gamma density function (Equation 2729) we
α α(α+1)
have unknown θ = (λ, α). We may write µ1 = λ , µ2 = λ2 (verify this) and the method of moment
estimators of λ̂, α̂ can be found. We have
α2 1α 1
µ2 = + = µ21 + µ1 (2566)
λ2 λλ λ
µ̂1
Equivalently by rewriting µ2 − µ21 = µ1 λ1 we have the method of moment estimator λ̂ = µ̂2 −µ̂21
, α̂ = λ̂µ̂1 .
Suppose in the above example on Gamma method of moments (Exercise 600), we would like to
compute bias and standard errors. Since we obtain them with
Eλ,α λ̂ − λ (2567)
SDλ,α λ̂ (2568)
and similarly for α̂, the error estimates of our estimator itself requires knowledge of λ, α. We shall explore
this with Monte Carlo bootstraps in Section 6.11.
6.10.1 Consistency of Estimators

Definition 333 (Consistency). We say an estimator θ̂ of θ is consistent if it converges to θ with sample
lim n→∞
size. (θ̂ → θ) By the Law of Large Numbers 361, since µ̂k → µk , that is the sample moments
converge to population moments by LLN, then the set of estimators by method of moments (see Definition
332) are consistent estimators (assuming smoothness conditions of Taylor approximations and existence
of higher order derivatives).
416
1+αx
Exercise 601 (Angular Distribution). [16] Let X = cos θ and some density function take f (x) = 2
for x ∈ [−1, 1] and α ∈ [−1, 1] is unknown constant. Find method of moments estimator of α given IID
samples Xi , i ∈ [n].
R1 R1 h i
1+αx x α 2 x2
+ α6 x3 α

Proof. Since EX = −1
x 2 dx = −1 2
+ 2 x dx = 4 = 3, we obtain α̂ = 3µ̂ =
−1,1
3X̄.
6.11 Monte Carlo Bootstraps

Recall that in the method of moments (see 2567) when we compute estimates it is often not possible
to analytically derive variance and bias estimates on arbitrary parameters. Here we explore numerical
solutions. A bootstrap technique is a general method of using sample to approximate some population
properties. An example is when substitute sample variances into the computation of interval estimates.
If the quantities we want to derive are not easily obtained analytically from the parameter unknowns,
then a numerical solution may be warranted by simulating the parametric distribution. For instance in
the Gamma density (see 2729) even if we knew (λ, α) it is not easy to compute Eλ̂, Varλ̂ where λ̂ = σ̂X̄2 .
By Law of Large Numbers (Theorem 361), if we may simulate samples from this distribution we may
obtain the expectations by numerical techniques.
Problem 3 (Bootstrapping in Parameter Estimation). Given some θ for a statistical distribution, find
the θ̂ estimator of arbitrary technique (say, method of moments) based on n size sample. Find bias and
variance.
Suppose we know θ in the parameter estimation problem and we want to bootstrap it (see Problem
3), then we may simulate a n size sample and obtain a single realization of θ̂ based on the technique,
in our case the method of moments estimators. By Law of Large Numbers, Eλ̂, SD(λ̂) is approximately
mean and standard deviation of the n size sample simulated N times. But mostly we do not know θ.
So we may hope that from some original sample that θ̂ is close to θ, and instead simulate from the
ˆ
population density on θ̂. The idea is that the bias Eθ̂ θ̂ − θ̂ approximates Eθ θ̂ − θ. and that standard
ˆ
error SDθ̂ (θ̂) approximates SDθ (θ̂). Then this is a two-step approximation, (1) which is the bootstrap
approximate of θ̂ → θ and the (2) Monte Carlo approximation by Law of Large Numbers. We can then
study in this ‘controlled’ experiment how different n size samples affect bias in θ̂, which we expect to
decrease absolutely with increasing n for consistent estimators.
6.12 Method of Maximum Likelihoods

Definition 334 (Maximum Likelihood Estimation, MLE). Suppose random sample yi , i ∈ [N ] is realized
from the IID random variables Yi , i ∈ [N ] with density Pθ (y) indexed by parameter set θ ∈ Θ. Then the
PN
log-probability of observed sample log L(θ) = i=1 log Pθ (yi ). The principle of MLE assumes that the
most reasonable θ estimate is such that the probability of observed sample is maximised. In particular,
we also assume that Pθ (y) is an identifiable family, as defined in the estimation problem (see Problem
2).
The bias and standard error of the maximum likelihood estimates are
bias = Eθ0 θ̂0 − θ0 , SDθ0 (θ̂0 ) (2569)
417
Exercise 602 (Poisson Likelihoods). Let xi , i ∈ [n] be realizations of IID Poisson(λ0 ) (see 6.17.4) where
λ0 is not known, then their likelihood is
P
xi xi
λ exp(−λ) λ exp(−nλ)
L(λ) = Πni=1 = (2570)
xi ! Πni=1 xi !
and log-likelihoods
n
X n
X
l(λ) = −nλ + xi log λ − log xi ! (2571)
i=1 i=1
and taking
n
δl(λ) X 1 !
= xi − n = 0 (2572)
δλ i=1
λ
gives solution λ̂0 = x̄ as maximum likelihood estimate and X̄ as the estimator.
In general we also need to show that at this λ̂0 we have maxima by ensuring second derivative < 0.
IID
Exercise 603 (Normal Likelihoods). Let Xi , i ∈ [n] ∼ Φ(µ, σ 2 ) with the Gaussian density
2
1 1 xi − µ
φ(µ,σ2 ) (xi ) = √ exp{− },
σ 2π 2 σ
and then the log-Gaussian density is simply
2
√

1 xi − µ
log(1) − log(σ 2π) − ,
2 σ
which gives the log-likelihood of data samples
n
n 1 X 2
l(θ) = log L(θ) = − log(2π) − n log σ − 2 (xi − µ) . (2573)
2 2σ i=1
Taking
n
δl 1 X
= (xi − µ) (2574)
δµ σ 2 i=1
Pn
δl n (xi − µ)2
= − + i=1 3 (2575)
δσ σ σ
q P
and by solving these equations we arrive at µ̂ = X̄, σ̂ = n1 i (Xi − X̄)2 maximum likelihood estimators.
IID
Exercise 604 (Gamma Likelihoods). Let Xi , i ∈ [n] ∼ Γ(λ, α) (see density Equation 2729) then it has
log-density
α log λ + (α − 1) log x − λx − log Γ(α) (2576)
and log-likelihood on data sample

n
X n
X
l(θ) = nα log λ + (α − 1) log Xi − λ Xi − n log Γ(α) (2577)
i=1 i=1
and taking derivatives we have

δl nα X
= − Xi (2578)
δλ λ i
δl X Γ0 (α)
= n log λ + log Xi − n (2579)
δα i
Γ(α)
418
and we arrive at solution equations
α̂
λ̂ =
X̄
and using λ̂ → λ we also have
n
X Γ0 (α)
n log α̂ − n log X̄ + log Xi − n
i=1
Γ(α)
which allows for solution finding by numerical methods such as the Newton-Raphson method.
Exercise 605 (Multinomial Likelihoods). For experiment with r possible outcomes Ei , i ∈ [r] with
corresponding probabilities p = (pi∈[r] ). Let Xi be number of times Ei occurs in n trials, then the
random vector (Xi∈[r] ) has density function

n
f (xi∈[r] ) = Πri=1 pxi i (2580)
x1 · · · xr
Pr
where i=1 xi = n and xi ≥ 0. For all i we have EXi = npi , VarXi = npi (1 − pi ) and when i 6=
j, Cov(Xi , Xj ) = −npi pj (verify this).
For (xi∈[n] ) realized from (Xi∈[r] ) ∼ Mult(n, p) we obtain log-likelihood
r
n X
l(p) = log + xi log pi . (2581)
x1 , x2 · · · xr i=1
Pm
Posing the constraint i=1 pi = 1 with Lagrange multiplier gives
m
X
l0 (p, λ) = l(p) + λ(1 − pi ) (2582)
i=1
and to find arg maxp L(p, λ) we take partial derivative

m
δl0 (p, λ) δl(p) δ X
= + λ(1 − pi ) (2583)
δpi δpi δpi i=1
m
xi δ X
= −λ pj (2584)
pi δpi j=1
xi !
= − λ = 0. (2585)
pi
xi Pm Pm xi 1
P
Since we want pi = λ and i=1 pi = i=1 λ = 1, then λ i xi = 1 and λ = n. We then finally
xi xi
arrive at the MLE p̂i = n and our estimate p̂ = X/n which is element wise p̂i = n. We have EXi =
pi pj
npi , Epî = EXi
n , Var(pî ) = pi (1−p
n
i)
. For i 6= j we obtain Cov(p̂i , p̂j ) = 1
n2 Cov(Xi , Xj ) =− n .
Again, the method of moments estimators (see Section 332) and the method of maximum likelihoods
may or may not agree, and in similar fashion we may not have closed form solutions to bias and standard
error estimates. We can then employ bootstrap and Monte Carlo techniques (see 6.11) to obtain these
estimates. The MLE (maximum likelihood estimates) have smaller standard error than the method of
moments estimates, and are in fact asymptotically the most efficient among consistent estimators. The
concept of efficient estimators are discussed in Section 6.14. Consistency is discussed in Section 333.
419
6.12.1 Confidence Intervals of Maximum Likelihood Estimates
Consider the parametric family {f (·|θ) : θ ∈ Θ} where here Θ ∈ R (one-dimensional) as in our estimation
problem (see Problem 2). Then suppose we want to set up confidence intervals for estimate θ̂, we may
use the Central Limit Theorem 362. In the normal case we have exact intervals.
Exercise 606 (Normal Estimation Intervals). For MLE X̄ of µ we have
X̄ − µ
√ ∼ tn−1 (2586)
S/ n

X̄−µ
√ ∈ (−tn−1, α , tn−1, α ) = 1 − α and bootstrap approxi-
which shall then have probability measure P S/ n 2 2
mate interval
s
x̄ ± tn−1, α2 √ (2587)
n
Recalling properties of the normal sampling (see 343), we have
(n − 1)S 2 nσ̂ 2
= ∼ χ2n−1 (2588)
σ2 σ2
and defining the quantile values to be P(χ2n−1 > χ2n−1,u ) = u, we have
nσ̂ 2

P χ2n−1,1− α2 ≤ 2 ≤ χ2n−1, α2 = 1 − α (2589)
σ
and interval for σ 2 at (1 − α) of:

!
nσ̂ 2 nσ̂ 2
, . (2590)
χ2n−1, α χ2n−1,1− α
2 2
These intervals are exact.
6.12.2 Large Sample Theory of Maximum Likelihoods

Discussion in the large sample theory behavior of MLEs require understanding of concepts in estimator
Fisher Information, which are discussed in Section 335.
Theorem 368 (Asymptotic Normality of Maximum Likelihoods). Let Xi , i ∈ [n] be IID density f (·|θ), θ ∈
Θ ⊂ Rp . The the MLE θ̂ of θ has asymptotic distribution
nI(θ)(θ̂ − θ) → Φ(0, 1p )
p n→∞
(2591)
where I(θ) is the Fisher information matrix (see 335). In particular, for large enough n we have ap-
proximately

1
θ̂ ∼ Φ θ, (2592)
nI(θ)
The theorem holds under the assumption that ∃ > 0 such that ∀x ∈ R, for k = 1, 2, 3 the partial
δk
differentiation δθ k
log f (x|θ) exists on (θ ± ) and is continuous. Additionally we make the assumption
3
δ
that | δθ 3 log f (x|θ)| < M (x) on (θ ± ) with integrable M . These conditions hold in almost all practical
applications. Additionally in higher dimensions we assume that I(θ) is invertible matrix.
420
From Section (343) we already know that for Gaussian random variables Xi , i ∈ [n] from Φ(µ, σ 2 ) we
have: (see 609):
" #
1
σ2 0
I(θ) = 2
(2593)
0 σ2
2
with X̄ ∼ Φ(µ, σn ) (exactly) and that X̄ ⊥ σ̂. By asymptotic normality (Theorem 368) we have
" # " # " 2 #!
σ
X̄ µ n 0
∼Φ , σ2
(2594)
σ̂ σ 0 2n
that they are in fact bivariate normal! When using the case for ν = σ 2 ∈ θ (see 609) then we arrive at
" # " # " 2 #!
σ
X̄ µ 0
∼Φ , n 2σ4 (2595)
σ̂ 2 σ2 0 n
with again the additional information that (X̄, σ̂ 2 ) is approximately bivariate

q normal. By the asymptotic
1
normality theorem on MLE (see Theorem 368) then we must have SE(θ̂) ≈ nI(θ) and by bootstrapping
θ̂ → θ we may compute the Fisher information at the estimate for interval computations. We have
 
θ̂ − θ
1 − α ≈ P −z α2 ≤ q ≤ z α2  (2596)
I(θ)−1
n
which gives equivalent significance intervals

 s s 
I( θ̂) −1 I( θ̂) −1
θ̂ − zα/2 , θ̂ + zα/2  (2597)
n n
by bootstrap approximate under n large. For marginal intervals of θi ∈ θ of multivariate distributions

we may take diag(I(θ))i For joint intervals they would have multivariate normal densities (see 344).
6.12.3 Least Squares and Maximum Likelihoods

In the least squares additive error model Y = f (X) + , ∼ Φ(0, σ 2 ) is equivalent to MLE using the
conditional likelihood P(Y |X, θ) = N (fθ (X), σ 2 ).
Recall the Gaussian log-likelihood of data samples from Equation (2598)
N
N 1 X 2
log L(θ) = − log(2π) − N log σ − 2 (yi − fθ (xi )) . (2598)
2 2σ i=1
θ only falls in the last term and we try to minimise this term to obtain the maximum likelihood estimate
for θ̂. This extends to problems such as the multinomial likelihood for regression function P(G|X) on
qualitative outputs, where we use the model P(G = Gk |X = x) = pk,θ (x), k ∈ [K] together with the
log-likelihood function (cross-entropy) of
N
X
log L(θ) = log pgi ,θ (xi ).
i=1
421
6.13 Fisher Information
Let θ̂0 be the MLE (see Section 334) of parameter θ0 based on IID Xi , i ∈ [n] from density function
f (x|θ0 ). Recall by central limit theorem that the distribution of θ̂0 becomes approximately normal. The
bias and standard error of θ̂0 are computed Eθ̂0 −θ0 , SD(θ̂0 ) respectively. Often our estimate is consistent
and the bias goes to zero as n → ∞.
Definition 335 (Fisher Information Matrix). Recall the estimation problem (see Problem 2) and consider
the parametric family {f (x|θ) : θ ∈ Θ ⊂ Rp }. The Fisher information matrix at θ defined I(θ) ∈ Rp,p :
Z ∞ 2
δ
− 2
log f (x|θ) f (x|θ)dx (2599)
−∞ δθ
such that the entry

 Z ∞ 2
δ
− log f (x|θ) f (x|θ)dx i=j


 2
−∞ δθi

Iij (θ) = Z ∞ 2 (2600)
 δ

 − log f (x|θ) f (x|θ)dx i 6= j
−∞ δθi δθj

where x can be scalar of vector. For random variable X ∼ f (x|θ) we may write
2
δ
I(θ) = −E log f (X|θ) (2601)
δθ2
We may interpret I(θ) as the amount of information about θ in one sample X ∼ f (x|θ), and we may
see this by thinking about is as the expected acceleration of θ on sample.
Consider n IID samples X = (Xi∈[n] ) have joint density g(x|θ) = Πni=1 f (xi |θ) define the n-sample
Fisher information as
δ2

J (θ) = −E log g(X|θ) = nI(θ). (2602)
δθ2
We may focus on single sample Fisher information since the generalization to n-sample is proportionally
scaled - let’s take a look at examples.
Exercise 607 (Poisson Fisher Information). Consider Poisson density (see Definition 6.17.4) f (x|λ) =
λx exp(−λ)
x! for x = 0, 1 · · · . The log density follows log f (X|λ) = X log λ − λ − log X! and since
δ X δ2
δλ log f (X|λ) = λ − 1, δλ2 log f (X|λ) = − λX2 we have
EX 1
I(λ) = 2
= . (2603)
λ λ
Obviously since for Poisson, Var(X) = λ - the larger the variance, the smaller the amount of information
from each sample.
Exercise 608 (Bernoulli Fisher Information). Let X ∼ Bernoulli(p), by its density (see Equation 2651)
we have log density
l = log f (X|p) = X log p + (1 − X) log(1 − p) (2604)

δl X 1−X δ 2 l
and taking derivatives δp = p − 1−p , δp2 = − pX2 − 1−X
(1−p)2 we have Fisher information
1 1 1
I(p) = + = (2605)
p 1−p p(1 − p)
δI(p) 1−2p
See from δp = − [p(1−p)]2 that Fisher information is lowest when p = 0.5, that is when each sample
gives least information.
422
Exercise 609 (Normal Fisher Information). We consider the cases when (1) θ = (µ, σ) and when (2)
θ = (µ, σ 2 ).
n 2
o
1. Recalling the Gaussian density f (x|θ) = √1
σ 2π
exp − (x−µ)
2σ 2 we obtain log density
1 (X − µ)2
l = log f (X|θ) = − log σ − log(2π) − . (2606)
2 2σ 2
Then
(X − µ)2

δl X −µ 1
= , − + (2607)
δθ σ2 σ σ3
" #
δ2 l −σ −2 −2σ −3 (X − µ)
= (2608)
δθ2 −2σ −3 (X − µ) σ −2 − 3σ −4 (X − µ)2
with Fisher information matrix I(θ)

" #
1
σ2 0
2
(2609)
0 σ2
2
where the formulas EX = µ, σ 2 = E(X − µ)2 are used and the (2, 2) entry follows from − σ12 + 3 σσ4 .
2. Has Fisher information matrix

" #
1
ν 0
I(θ) = 1
(2610)
0 2ν 2
where ν = σ 2 . (verify this)
Exercise 610 (Binomial Fisher Information). For X ∼ Bin(n, p) it has log density

n
l = log f (X|p) = log + X log p + (n − X) log(1 − p) (2611)
X
and the Fisher information matrix equals
n
I(p) = (2612)
p(1 − p)
, (verify this) which is expected since the Binomial are n IID samples of Bernoulli (so n times Bernoulli
Fisher information see Exercise 608).
Although the full sequence of n Bernoulli samples tell us which succeed and which fail, and the bino-
mial only tells us the total number of successes - the Fisher information matrix suggests no information
was lost. This is because no information was lost in relation to the estimation of θ. This is discussed in
the context of estimator sufficiency (see Section 6.15).
Exercise 611 (Multinomial Fisher Information). Consider the special case X = (X1 , X2 , X3 ) ∼ M ult(n, p)
where θ = (p1 , p2 ) and p3 = 1 − p1 − p2 . Then it has log density function
l = log f (X|θ) = κ + X1 log p1 + X2 log p2 + X3 log(1 − p1 − p2 ) (2613)
where κ is the log multinomial coefficient independent of θ. Taking derivatives

δl X1 X3 X2 X3
= − , − (2614)
δθ p1 p3 p2 p3
" X X X
#
δ2 l − p2 − p2
1 3
− p23
= 1 3 3
(2615)
δθ2 −X p2
3
−X2
p2
−X 3
p2
3 2 3
423
and Fisher information matrix
" #
1 1 1
p1 + p3 p3
I(θ) = n 1 1 1
(2616)
p3 p2 + p3
which is n times the Fisher information of a multinomial experiment of (1, p).
We see it is often the case that Var(θ̂) = I(θ)−1 : that is smaller variance has larger information
in single sample. This is intuitive. Suppose in the extreme case that the population members are non
random and every member is some unknown constant. A single sample reveals this value and we know
the complete distribution.
6.14 Efficiency of Estimators and Asymptotic Relative Efficien-

cies
Suppose we have Xi , i ∈ [n] IID from parametric density f (x|θ), θ ∈ Θ and we have θ̂, θ̃ as estimators of
θ. Their mean squared errors can be decomposed (see Equation 2523)
E(θ̃ − θ)2 = Var(θ̃) + (Eθ̃ − θ)2 (2617)

2 2
E(θ̂ − θ) = Var(θ̂) + (Eθ̂ − θ) (2618)
and we would like the better estimate which has smaller MSE on all or some subset of {θ : θ ∈ Θ} ⊂ Θ.
Theorem 369 (Cramer-Rao Inequality). For unbiased θ̂ where Eθ̂ = θ, then ∀θ ∈ Θ we have
1
Var(θ̂) ≥ . (2619)
nI(θ)
That is, the Cramer-Rao lower bound tells us the most that we can expect from an unbiased estimator.
1
Definition 336 (Efficiency). We say that an unbiased estimator θ̂ is efficient if Var(θ̂) = nI(θ) for all
θ ∈ Θ. That is, its variance is equivalent to the Cramer-Rao lower bound and no unbiased estimator can
be better. Additionally, we say that the efficiency of an unbiased θ̂ can be quantified
1/(nI(θ))
eff(θ̂) = ≤ 1. (2620)
Var(θ̂)
IID
Exercise 612 (Inefficiency of Sample Variance). For Xi ∼ Φ(µ, σ 2 ), i ∈ [n] argue that the sample
1
variance s2 = n−1 2
P
i (Xi − X̄) is not an efficient estimator. Recall that (see properties proved in
2
Definition 343) that (n−1)S
σ2 ∼ χ2n−1 then by expectation of χ2 variable (see Section 6.17.7) we have that
ES = σ (unbiased). Then see by variance properties of chi-squared random variables that Var(s2 ) =
2 2
σ4 2σ 4 1 1
(n−1)2 2(n − 1) = n−1 . By Normal Fisher Information (see Exercise (609)) we have I(σ 2 ) = 2ν 2 = 2σ 4
and we have efficiency
1/nI(θ) 2σ 4 /n n−1
= = < 1. (2621)
Var(θ̂) 2σ 4 /(n − 1) n
In general we have no guarantee of unbiasedness for MLE (see Section 334). However, by asymptotic
normality of maximum likelihood estimators (see Theorem 368) for large n, MLE θ̂ should have small
1
bias and variance close to nI(θ) . That is, the MLE is asymptotically efficient.
424
Definition 337 (Relative Estimator Efficiency). Let Eθ̂ = Eθ̃ = θ, then the relative efficiency of θ̃-to-θ̂
is written
Var(θ̂) eff(θ̃)
eff(θ̃, θ̂) = = . (2622)
Var(θ̃) eff(θ̂)
Note that the relative efficiencies can depend on n. In the case when the n term factors, the value
eff(θ̃, θ̂)−1 gives us size m sample required to make Var(θ̃m ) ≈ Var(θ̂n ). For efficiency ratio α we need
αn samples to achieve an estimator for the population parameters with same variance characteristics.
Corollary 17 (Efficiency of Consistent Estimators). For consistent estimators (see Section 333) θ̂, θ̃
we may define their efficiency with the same interpretations as in efficiency of unbiased estimators (see
Section 336) under asymptotic conditions.
6.14.1 Comparing Asymptotic Relative Efficiency

Let Xi ∼ F (x), i ∈ [n] and have symmetric density f (x), such that µ = m. We use the sample mean and
sample median, as well as sampled trimmed mean X̄α to estimate the distribution centre. These were
discussed under the settings of robust estimation methods in Section 6.19.
Result 35 (Asymptotic Centre Measure Distributions). Their asymptotic distributions of the centre
measure are as follows:
2
1. X̄ ∼ Φ(µ, σn )
2. m̂ ∼ Φ(µ, 4nf12 (µ) )

2
σα
3. X̄α ∼ Φ(µ, n ), where
Z x1−α
−2
σα2 = 2(1 − 2α) 2
t f (t)dt + αx21−α (2623)
0
where x1−α is 1 − α quantile given P(X ≤ x1−α ) = 1 − α. Since we do not have these in most
cases we apply bootstrap approximate. Let f = bnαc
 
n−f
1 X 2 2 2
σ̂α2 =

 X(i) − X̄α + f X(f +1) − X̄α + f X(n−f ) − X̄α  . (2624)
n(1 − 2α)2
i=f +1
Exercise 613 (Asymptotic Relative Efficiency of Centre Measures). From the asymptotic distribution
results of the centre measures (see Result 35), see that:
1. eff(m̂, X̄) = 4σ 2 f 2 (µ)

1 2
2. f (x) from Φ(µ, σ 2 ), then f 2 (µ) = 2πσ 2 and hence eff(m̂, X̄) = π.
π2
3. f (x) from logistic distribution moments, recall that EX = 0, VarX = 3 , then we have eff(m̂, X̄) =
2
π2
2 2 2 2
4σ f (µ) = 4σ f (0) = 4 π3 16
1
= 12 .
Exercise 614 (Almost always goodness of the trimmed mean). Show that the relative efficiency eff(X̄α , X̄) =
σ2
2
σα ≥ (1 − 2α)2 .
425
Proof. Without loss of generality, we can show proof for a centred distribution. Then σ 2 = EX 2 . We
can write that
Z F −1 (1−α) Z ∞
2 2
σ = 2 t f (t)dt + 2 t2 f (t)dt (2625)
0 F −1 (1−α)
Z F −1 (1−α) 2
t2 f (t)dt + 2α F −1 (1 − α) α < 1, F −1 (1 − α) is l.b.

≥ 2 (2626)
0
∆
= (1 − 2α)2 σα2 . (2627)
For value of α = 0.05 we have asymptotic relative efficiency of ≥ 0.81. In the case of the Cauchy
distribution, the trimmed mean m̂ is never much worse than the sample mean but can be much, much
(even infinitely) better at times.
6.15 Sufficient Statistics

Recall that the Binomial Fisher information (see Exercise (610)) is the same as n times the Bernoulli
Fisher information (see Exercise (608)). This might seem paradoxical since the Bernoulli sequence gives
more information than the Binomial summary of total success counts. This confusion is resolved by the
concept of sufficiency.
IID
Definition 338 (Sufficiency). Let Xi ∼ fθ (x), θ ∈ Θ, let T (X) be function of X = (Xi∈[n] ). If the
conditional distribution of X given T = t is the same across θ ∈ Θ for all t of realized T then we
say T is sufficient for Θ (or θ). That is to say that all information about θ in X is contained in T .
We may characterize sufficiency with equivalent statements as follows. Define random variable X on
probability space (see Definition 268) (S, Σ, P). For each disjoint t, define St = {x : T (x) = t} and we
have S = ∪Tt St . Then T is sufficient for θ iff ∃q(x) such that ∀θ ∈ Θ, ∀t ∈ D(T ),
fθ (X = x|T = t) = q(x), x ∈ St (2628)
where q(x) ⊥ θ. For any fixed value of t, we only need to check the conditional distribution of outcomes
x ∈ St .
IID
Exercise 615 (Bernoulli Insufficiency). Let Xi ∼ Bernoulli(p), i ∈ {1, 2} and T = X1 . For space
partition S0 = {(0, 0), (0, 1)}, S1 = {(1, 0), (1, 1)} and x ∈ S0 we have
Pp (X1 = 0, X2 = x2 )
Pp (X = x|T = 0) = = Pp (X2 = x2 ) = px2 (1 − p)1−x2 (2629)
Pp (X1 = 0)
where x2 ∈ {0, 1} and we see that the conditional distribution of X|T 6⊥ p. That is X1 is not sufficient
for p.
IID
Theorem 370 (Equivalency of Sufficiency and Factorization). Let Xi ∼ f (x|θ), i ∈ [n] then T is
sufficient for θ iff ∃g(t, θ), h(x) such that ∀θ ∈ Θ we can write
fθ (x) = g(T (x), θ)h(x) ∀x ∈ D(x). (2630)
426
Proof. →: Suppose we may write fθ (x) = g(T (x), θ)h(x) for all x, then for x ∈ St we may write
fθ (X = x, T = t)
fθ (X = x|T = t) = (2631)
Pθ (T = t)
fθ (X = x)
= (2632)
Pθ (T = t)
g(t, θ)h(x)
= P 0
(2633)
x ∈St g(t, θ)h(x )
0
h(x)
= P 0
(2634)
x0 ∈St h(x )
= q(x) (2635)
and we see that T is sufficient for θ by definition.

←: Assuming ∃q(x) for all θ ∈ Θ and t we have Pθ (X = x|T = t) = q(x) for x ∈ St we have
Pθ (X = x) = Pθ (T = t)Pθ (X = x|T = t) = Pθ (T = t)q(x) (2636)
is factorization into g(t, θ) = Pθ (T = t) and h(x) = q(x).
Theorem 371. If T is sufficient for θ then MLE is function of sufficient statistic T .
Proof. This is easily seen by us recalling that the likelihood function is written g(t, θ)h(x), where MLE
maximises g(t, θ) (a function of t).
For estimator θ̂ which is not a function of a sufficient statistic T , we may obtain a better estimator
by conditioning on T .
IID Pn
Exercise 616 (Bernoulli Sufficiency). Let Xi ∼ Bernoulli(p), i ∈ [n], X = (Xi∈[n] ) and T = i Xi .
For t ∈ {0, 1, · · · n}, define St = {x : x1 + x2 + · · · xn = t} then for x ∈ St we have:
Pp (X = x, T = t) Pp (X = x)
Pp (X = x|T = t) = = (2637)
Pp (T = t) Pp (T = t)
−1
pt (1 − p)(n−t)

n
= n t

n−t
= ⊥ p, (2638)
t p (1 − p)
t
and hence T is sufficient statistic for p. Alternatively see that we can write the density function
fp (x) = Πni=1 pxi (1 − p)1−xi (2639)

P P
xi
= p i (1 − p)n− i xi (2640)
pt (1 − p)n−t · (1)

= (2641)
by equivalency theorem 370 we conclude that T is sufficient for all p.

IID Pn
Exercise 617 (Conditional Bernoulli on Statistic T ). Let Xi , ∼ Bernoulli(p) and T = i Xi , a
sufficient statistic for p. Recall that Xi is unbiased for p. Find conditional distribution of X1 given
T = t ∈ [n].
Proof. See that

n−1 t−1

p· t−1 p (1 − p)n−t t
Pp (X1 = 1|T = t) = n t

n−t
= , (2642)
t p (1 − p)
n
and Pp (X1 = 0|T = t) = 1 − nt .
427
Theorem 372 (Rao-Blackwell Theorem). Let θ̂ be estimator of θ with finite variance, and T be sufficient
statistic for θ. Define θ̃ = E[θ̂|T ]. For every θ ∈ Θ,
E[(θ̃ − θ)2 ] ≤ E[(θ̂ − θ)2 ] (2643)
and the equality holds for θ ∈ Θ iff θ̂ is function of T . In particular if T is sufficient statistic then any
estimator θ̂ that is not a function of T can be improved (in mean squared error terms) by conditioning
on T . The MLE is a function of T .
Proof. Since Eθ̃ = E[E(θ̂|T )] = Eθ̂ by iterated conditioning property (Theorem 373), their bias is equiv-
alent and we may compare their variances. See that (by Theorem 373):
Var(θ̂) = Var(E(θ̂|T )) + E[Var(θ̂|T )] (2644)

= Var(θ̃) + E[Var(θ̂|T )] (2645)
≥ Var(θ̃). (2646)
For equality to hold it requires that E[Var(θ̂|T )] = 0 which occurs when Var(θ̂|T ) = 0, which we argued
requires that θ̂ = g(T ), or that θ̃ = E(θ̂|T ) = θ̂, where the sufficiency of T is required such that the
expectation does not depend on unknown θ.
6.16 Expectations
6.16.1 Conditional Expectations
Definition 339 (Conditional Expectations). Let X, Y be random variables. Then E(Y |x) is conditional
expectation of Y given X = x. The random conditional expectation is the random variable E(Y |X), a
function of X.
Definition 340 (Conditional Variance). Let X, Y be random variables. Then Var(Y |x) is conditional
variance of Y given X = x with value
Var(Y |x) = E(Y 2 |x) − E(Y |x)2 (2647)
with random conditional variance
Var(Y |X) = E(Y 2 |X) − E(Y |X)2 . (2648)
Corollary 18 (Degenerate Random Variables). Note that (1) EY = Y iff Y is constant, and (2)
Var(Y ) = 0 iff Y is constant. This is easily seen by the degeneracy of random variable Y for (1) and
then substituting (1) into the variance formula for (2).
Theorem 373. Let X, Y be random variables. Then
1. EY |X = Y iff Y = g(X).
2. Var(Y |X) = 0 iff Y = g(X).
3. EY = E [E[Y |X]]
4. Var(Y ) = Var(E[Y |X]) + E[Var(Y |X)]
428
Proof. We prove statements (1), (2) in whole and in brief. If Y = g(X) then P(Y |X = x) is concen-
trated on g(x) and E(Y |x) = g(x), Var(Y |x) = 0. Integrating over X we obtain EY |X = g(X) = Y ,
Var(Y |X) = 0. On the other hand, suppose EY |X = Y . For all x we have E(Y |x) = Y ∈ R, a constant
at fixed x. This is equivalent to saying that Y is function of X. Further suppose Var(Y |X) = 0, then
Var(Y |x) = 0 and Y is constant at X = x.
Statement (3) is proved in (Theorem 356) and we show Statement (4). For k = 1, 2, define Zk = E[Y k |X).
Then Var(Y |X) = Z2 − Z12 . Then we have
Var(E[Y |X]) = E(Z12 ) − E(Z1 )2 = E(Z12 ) − E(Y )2 , (2649)

E[Var(Y |X)] = EZ2 − E(Z12 ) 2
= E(Y ) − E(Z12 ) (2650)
where in the last step we use the iterated conditioning property of Statement (3) and definition of variance
to prove equivalency.
Exercise 618. Let X ∼ Φ(µ, σ 2 ) and ∼ Φ(0, τ 2 ) with X ⊥ . Define Y = X + and verify that
EY = E[E[Y |X]] , Var(Y ) = Var(E[Y |X]) + E[Var(Y |X)].
Proof. We have EY |X = E[(X + )|X] = X + E|X = X. See that E[E[Y |X]] = EX = µ = EY . For
the next proof, we have Var(Y |X) = Var(X + |X) = V ar(|X) = τ 2 , and Var(E[Y |X]) = Var(X) = σ 2 .
Then Var(Y ) = σ 2 + τ 2 = Var(E[Y |X]) + E[Var(Y |X)].
6.17 Probability Distributions

Here we marry the discussion on random variables and probabilistic distributions. Although they are
distinct concepts, it is random variables that take on some distribution of interest. Recall that a random
variable is a function mapping objects from a measurable set to a real number. A random variable can
have different probability distributions under different probability measures. A random variable defined
on some probability space (see Definition 268) gives us the relevant distributions of interest. We can
then think of random variables as a function transforming the sample space. A random variable that
has finite or countably infinite range is discrete, and those with uncountably infinite range is said to be
continuous.
6.17.1 Bernoulli Trials and Binomial Distribution

In the sampling of a population in which members take binary values, we can indicate their values as
zero and one. The proportion p of the population that has value one, is a parameter of interest. For such
PN
a population, we have µ = N1 i xi = N k
= p and σ 2 = N1 i x2i − µ2 = N1 (K) − µ2 = µ − µ2 = p(1 − p).
P
That is, EX = p, Var(X) = p(1−p) when X is a random variable of a Bernoulli experiment. The density
of X ∼ Bernoulli(p) follows
f (x|p) = px (1 − p)1−x x = 0, 1. (2651)
For the consideration of Binomials, we are interested in repeated series of independent, identical
Bernoulli trials. (see Definition 314)
Definition 341 (Binomial Distribution). For a series of n independent Bernoulli trials with success p,
the probability of successes X = k successes is written

n k
P(X = k) = p (1 − p)n−k k ∈ [n]. (2652)
k
429
This is also known as the binomial distribution.
Theorem 374 (Binomial Expectation). Show that for X ∼ Bin(n, p), E(X) = np.
P Pn n k
n−k
Pn kn! k n−k
Proof. By definition EX = k kpX (k) = k=0 k k p (1 − p) = k=0 k!(n−k)! p (1 − p) =
Pn n! k n−k
P n (n−1)! k−1 n−k
k=1 (k−1)!(n−k)! p (1 − p) . See that this is equals to np k=1 (k−1)!(n−k)! p (1 − p) =
n
X n − 1 k−1
np p (1 − p)n−k (2653)
k−1
k=1
n−1
X n − 1
= np pj (1 − p)n−j−1 k−1→j (2654)
j=0
j
m
X m j
= np p (1 − p)m−j n−1→j (2655)
j=0
j
= np. (2656)
6.17.2 Hypergeometric Random Distribution

The hypergeometric distribution is often casted as a ‘draw balls from urn problem’. While the binomial
casted in an urn model is sampling with replacement to ensure Bernoulli independence, in hypergeometric
randomness we are concerned with simultaneous sampling, or unordered sampling without replacement.
Theorem 375 (Hypergeometric Distributions). For urn with r red chips, w white chips and r + w = N .
The probability of k red chips being selected out of a sampling of n chips drawn at random without
replacement is given by
r w

k n−k
P(X = k) = N
. (2657)
n
This is known as the hypergeometric distribution.
6.17.3 Exponential Random Variables

Let τ be exponential random variable with density f (t) = λ exp(−λt) for t ≥ 0 and otherwise zero. Here
λ is positive constant and we call this the ‘rate’ parameter of the exponential distribution. We see why
by considering the expectation (integrate by parts)
Z ∞ Z ∞
∞ 1 1 ∞ 1
Eτ = λ t exp(−λt)dt = [−t exp(−λt)]0 − λ − exp(−λt)dt = 0 − [exp(−λt)]0 = . (2658)
0 0 λ λ λ
The c.d.f is given by
Z t
t
F (t) = P(τ ≤ t) = λ exp(−λu)du = [− exp(−λu)]0 = 1 − exp(−λt), t ≥ 0. (2659)
0
It follows that P(τ > t) = exp(−λt) for t ≥ 0. The random variable τ can be used to model the time of
occurrence of an event, for instance a credit event. A defining characteristic of the exponential random
variable is known as memorylessness. See for
P(τ > t + s, τ > s) P(τ > t + s) exp(−λ(t + s))
P {(τ ≥ t + s|τ > s)} = = = = exp(−λt). (2660)
P(τ > s) P(τ > s) exp(−λs)
We write a random variable X that is exponentially distributed with rate λ as X ∼ exp(λ).
430
6.17.4 Poisson Distribution
The Poisson distribution is characterized by the parameter λ, and the random variable X with underlying
∼ P oisson(λ) has mass
λk exp{−λ}
P(X = k) = , k = 0, 1, 2 · · · (2661)
k!
and EX = VarX = λ.
6.17.5 Poisson Process

A Poisson process can be generated by counting the number of events modelled by exponential random
variables (see Section 6.17.3) in some time period. Starting with sequence τ1 , τ2 , · · · , with each τi ∼
exp(λ). Let these denote interarrival times of events, such that the first event occurs at τ1 , second event
occurs at τ1 + τ2 and so on. The arrival times are the cumulative sum of these interarrival times, such
that
n
X
Sn = τi (2662)
i=1
represents the arrival time of the n-th event.
Definition 342 (Poisson Process and Intensity). The Poisson process N (t) is the random variable that
counts this number of jumps occurrences, that is


 0 0 ≤ t < S1 ,


1 S1 ≤ t < S2 ,

N (t) = (2663)


 ···


n Sn ≤ t ≤ Sn+1

N (t) is right-continuous, which we denote N (t) = lims→t+ N (s). The σ-algebra generated from observing
1
N (t) up to 0 ≤ t ≤ s is denoted F(t). Since each τi has expectation λ, the jumps arrive at rate λ per
unit time (on average). Often, we also call λ of N (t) as the intensity.
Lemma 8 (Distribution of Arrival Times). For n ≥ 1, random variable arrival time Sn as in Equation
2662 has the gamma density
(λs)n−1
gn (s) = λ exp(−λs), s ≥ 0. (2664)
(n − 1)!
Proof. We know that sum of exponential random variables is gamma random variable. We prove this
statement by induction. When n = 1, then S1 = τ1 , which is exponential random variable. In particular,
g1 (s) = λ exp(−λs) for any s ≥ 0. Then assume it holds for some n, and since Sn+1 = Sn + τn+1 , with
Sn ⊥ τn+1 , then the probability convolution (see Theorem 359) holds s.t Sn+1 has density function
Z s Z s
(λv)n−1
gn (v)f (s − v)dv = λ exp(−λv) · λ exp(−λ(s − v))dv (2665)
0 0 (n − 1)!
λn+1 exp(−λs) s n−1
Z
= v dv (2666)
(n − 1)! 0
λn+1 exp(−λs) n s
= [v ]0 (2667)
n!
n
(λs)
= λ exp(−λs) (2668)
n!
= gn+1 (s). (2669)
431
Lemma 9 (Distribution of Poisson Processes). The Poisson process N (t) (Equation 2663) with intensity
λ has distribution
(λt)k
P {N (t) = k} = exp(−λt), k = 0, 1, · · · (2670)
k!
Proof. Reason that P {N (t) ≥ k} = P{Sk ≤ t}. Then we may write
Z t
(λs)k−1
P {N (t) ≥ k} = P {Sk ≤ t} = λ exp(−λs)ds. (2671)
0 (k − 1)!
Z t
(λs)k
P {N (t) ≥ k + 1} = P{Sk+1 ≤ t} = λ exp(−λs)ds. (2672)
0 k!
If we integrate by parts Equation 2672, then we get
t Z t
(λs)k (λs)k−1

P {N (t) ≥ k + 1} = − exp(−λs) − λ k(− exp(−λs))ds (2673)
k! 0 0 k!
t Z t
(λs)k (λs)k−1

= − exp(−λs) + λ exp(−λs)ds (2674)
k! 0 0 (k − 1)!
(λt)k
= − exp(−λt) + P {N (t) ≥ k} . (2675)
k!
The result follows since P {N (t) = k} = P {N (t) ≥ k} − P {N (t) ≥ k + 1}.
As consequence of the memorylessness of exponential random variables modelling the frequency of

underlying events, which determine the (time) distance between event occurrences, the random variable
N (t + s) − N (s) is independent of F(s), and is in fact distributionally equivalent to the random variable
N (t). The time between any two non-overlapping events are independent and exponentially distributed
1
with mean λ. The increments of Poisson processes are said to stationary in distribution.
Exercise 619. For a Poisson process N (Section 6.17.5) observed up to N (s) and measured as N (s) = k,
show that for small t > 0, we have
P{N (s + t) = k|N (s) = k} = 1 − λt + O t2 ,

(2676)
P{N (s + t) = k + 1|N (s) = k} = λt + O t2 ,

(2677)
P{N (s + t) ≥ k + 2|N (s) = k} = O t2 .

(2678)
Proof. Using the Taylor expansions and the density functions known (Lemma 9), write
P(N (s + t) = k|N (s) = k) = P(N (t) = 0) = exp(−λt) = 1 − λt + O t2 ,

λt
exp(−λt) = λt(1 − λt + O t2 ) = λt + O t2 ,

P(N (s + t) = k + 1|N (s) = k) = P(N (t) = 1) =
1!
∞
X (λt)k
exp(−λt) = O t2 .

P(N (s + t) ≥ k + 2|N (t) = k) = P(N (t) ≥ 2) =
k!
k=2
Theorem 376. Let N (t) be Poisson process with intensity λ > 0. and 0 = t0 < t1 < · · · < tn ,
then the increments N (t1 ) − N (t0 ), N (t2 ) − N (t1 ), · · · , N (tn ) − N (tn−1 ) are stationary and independent.
Furthermore, the Poisson increments are distributed
λk (tj+1 − tj )k
P {N (tj+1 ) − N (tj ) = k} = exp(−λ(tj+1 − tj )), k = 0, 1, · · · . (2679)
k!
432
Proof. The proof follows directly from the discussion before, memorylessness of exponential random
variables and application of Lemma 9.
As a result of Theorem 376 we can derive the moments of Poisson increments. The exponential power
series can be stated by any of the equivalent Taylor expansions:
∞ ∞ ∞
X xk X xk−1 X xk−2
exp(x) = = = . (2680)
0
k! 1
(k − 1)! 2
(k − 2)!
N (t) − N (s) has possible values over N, so it should be trivial by the total law of probability that
P∞
0 P {N (t) − N (s) = k} = 1. Computing
∞
X λk (t − s)k
E[N (t) − N (s)] = k exp(−λ(t − s)) (2681)
0
k!
∞
X λk−1 (t − s)k−1
= λ(t − s) exp(−λ(t − s)) (2682)
1
(k − 1)!
= λ(t − s) exp(−λ(t − s)) exp(λ(t − s)) (2683)
= λ(t − s). (2684)
Additionally, compute
∞
h
2
i X λk (t − s)k
E (N (t) − N (s)) = k2 exp(−λ(t − s)) (2685)
0
k!
∞
X λk (t − s)k
= exp(−λ(t − s)) (k − 1 + 1) (2686)
1
(k − 1)!
∞ ∞
X λk (t − s)k X λk (t − s)k
= exp(−λ(t − s)) + exp(−λ(t − s)) (2687)
2
(k − 2)! 1
(k − 1)!
∞
X λk−2 (t − s)k−2
= λ2 (t − s)2 exp(−λ(t − s)) (2688)
2
(k − 2)!
∞
X λk−1 (t − s)k−1
+λ(t − s) exp(−λ(t − s)) (2689)
1
(k − 1)!
= λ2 (t − s)2 + λ(t − s). (2690)
Then
Var(N (t) − N (s)) = λ(t − s). (2691)
In particular, E(N (t) − N (s)) = Var(N (t) − N (s)) for Poisson increments.
6.17.6 Gaussian/Normal Distribution

Random variable X is said to be normally distributed with mean µ, variance σ 2 and is denoted X ∼
Φ(µ, σ 2 ) if it has density
1 (x − µ)2
f (x) = √ exp{− } −∞<x<∞ (2692)
σ 2π 2σ 2
When X is centred random variable (EX = 0) then the odd moments equates to zero. We can see this
R∞ R0 R∞ R∞ R∞
by writing EX k − −∞ xk f (x)dx = −∞ xk f (x)dx + 0 xk f (x)dx = 0 (−x)k f (−x)dx + 0 xk f (x)dx.
X−µ
Denote its standardization Z = σ ∼ Φ(0, 1).
433
IID
Definition 343 (Sample Mean and Sample Variance). When we have Xi , i ∈ [n] ∼ Φ(µ, σ 2 ), then we
define the sample mean and sample variance to be respectively
1X
X̄ = Xi (2693)
n i
1 X
S2 = (Xi − X̄)2 (2694)
n−1 i
Then we can show that
1. X̄ ⊥ S 2
2
2. X̄ ∼ Φ(µ, σn )
(n−1)S 2
3. σ2 ∼ χ2n−1
X̄−µ
4. √
S/ n
∼ tn−1
2σ 4
5. ES 2 = σ 2 , and Var(S 2 ) = n−1
Proof. 1. First let Y = X1 − X̄. We show that EX̄Y = EX̄EY . Then EX̄Y = EX̄(X1 − X̄) =
E[X1 { n1 ( 2
P
i Xi )}] − EX̄ . The first term can be written
1 1
E[X12 + X1 X2 + · · · X1 Xn ] = [EX12 + EX1 EX2 + · · · EX1 EXn ] (2695)
n n
1
= [Var(X1 ) + E(X1 )2 + (n − 1)µ2 ] (2696)
n
1 2
= [(σ + µ2 ) + (n − 1)µ2 ] (2697)
n
σ2
= + µ2 (2698)
n
σ2
and the last term is written EX̄ 2 = Var(X̄) + (EX̄)2 = n + µ2 . Then the first and last term has
difference equals zero, and we have EX̄Y = 0 = EX̄EY which is easily seen from (EX1 = EX̄) →
EY = 0. Then since their expectation factors and S 2 is jointly normal, then they are independent
and we are done.
2. Follows from linearity of expectations and properties of variance.

Pn Pn 2
3. Since we have σ12 i=1 (Xi − µ)2 = i=1 Xiσ−µ ∼ χ2n and we can write (verify this)
n n 2
1 X 2 1 X 2 X̄ − µ
(X i − µ) = (X i − X̄) + √ (2699)
σ 2 i=1 σ 2 i=1 σ/ n
| {z } | {z } | {z }
W U V
and we already showed U ⊥ V in (1). Using the moment generating functions property MW (t) =
MU (t)MV (t) we have
MW (t) (1 − 2t)−n/2 n−1

MU (t) = = = (1 − 2t)− 2 (2700)
MV (t) (1 − 2t)−1/2
which maps to the moment generating function of χ2n−1 random variable.
434
6.17.6.1 The Multivariate Case
Definition 344 (Jointly/Multivariate Normal Random Variables). X, Y jointly normal if they have
joint density
(x − µ1 )2 (y − µ2 )2

1 1 2ρ(x − µ1 )(y − µ2 )
fX,Y (x, y) = exp − + − .
2(1 − ρ2 ) σ12 σ22
p
2πσ1 σ2 1 − ρ2 σ1 σ2
In matrix form of n dimensional vector of random variables, a random column vector X = (Xi , i = [n])T
is jointly normal if it has density

1 1
fX (x) = p exp − (x − µ)C −1 (x − µ)T (2701)
(2π)n det(C) 2
, where C is the positive definite matrix of covariances.
The density factors iff ρ = 0 in the bivariate case, and into a product of n normal densities iff C is
diagonal matrix. Linear combinations of jointly normal random variables are jointly normal. Independent
normal random variables are jointly normal. We can hence create jointly normal random variables by
beginning with a set of independent normals and then taking their linear combinations. Conversely, any
set of jointly normal random variables can be reduced to linear combinations of independent normals. It
is not necessary to deal directly with joint densities when making linear change in variables, since the
joint normality is preserved under linear combinations - they are characterized completely by their mean
and covariance matrix. Here we give an instance to support the claim that any set of jointly normal
random variables can be reduced to a linear combination of independent normals.
Exercise 620 (Jointly Normal Random Variables as Linear Combination of Independent Normals). Let
ρσ2
(X, Y ) be jointly normal. Define W = Y − σ1 X, so that W ⊥ X. We may see their independence by
writing (recall that uncorrelation implies independence in joint normals):
Cov(X, W ) = E[(X − EX)(W − EW )] (2702)

ρσ2
= E[(X − EX)(Y − EY )] − E (X − EX)2 (2703)
σ1
ρσ2 2
= Cov(X, Y ) − σ (2704)
σ1 1
= 0 (2705)
ρ2 σ22
E(W ) = µ3 = µ2 − ρσσ21µ1 , σ32 = E[(W −EW )2 ] = E[(Y −EY )2 ]− 2ρσ
σ1 E[(X −EX)(Y −EY )]+
2
σ12
E[(X −
EX)2 ] = (1 − ρ2 )σ22 . The joint density of (X, W ) is written
1 (x − µ1 )2 (w − µ3 )2
fX,W (x, w) = exp(− − )
2πσ1 σ3 2σ12 2σ32
ρσ2
and we see that Y is a linear composition σ1 X + W of pair of independent normals.
The moment generating function for identifying normal random variables can be found as follows:
435
Theorem 377 (Normal Moment Generating Functions). For X ∼ Φ(µ, σ 2 ), it’s mgf is given:
Z ∞
E[exp{uX}] = exp{ux}φ(x)dx (2706)
−∞
Z ∞
1 (x − µ)2 − 2σ 2 ux
= √ exp{− }dx (2707)
−∞ σ 2π 2σ 2
Z ∞
1 [x − (µ + σ 2 u)]2 − [2σ 2 uµ + σ 4 u2 ]
= √ exp{− }dx (2708)
−∞ σ 2π 2σ 2
Z ∞
σ 2 u2 1 (x − (µ + σ 2 u))2
= exp{uµ + } √ exp{− }dx (2709)
2 −∞ σ 2π 2σ 2
σ 2 u2
= exp{uµ + } (2710)
2
6.17.6.2 Other Gaussian Methods
Exercise 621 (Probability Integral Transform of Uniform Random to Standard Normal Random). Let
1 x2
φ(x) = √ exp{− }
2π 2
be called the standard normal density, and the cumulative normal distribution be such that
Z x
Φ(x) = φ(ξ)dξ.
−∞
Φ(x) is strictly increasing and maps R → (0, 1), and hence the inverse function Φ−1 (y) exists and is
strictly increasing also. For all y ∈ (0, 1), Φ(Φ−1 (y)) = y. Let Y be a uniformly distributed random
variable defined on probability space (see Definition 268) and set X = Φ−1 (Y ). Whenever −∞ < a ≤
b < ∞, we have the distribution measure
µx [a, b] = P{ω ∈ Ω : a ≤ X(ω) ≤ b} (2711)

= P{ω ∈ Ω : a ≤ Φ−1 (Y (ω)) ≤ b} (2712)
−1
= P{ω ∈ Ω : Φ(a) ≤ Φ(Φ (Y (ω))) ≤ Φ(b)} (2713)
= P{ω ∈ Ω : Φ(a) ≤ Y (ω) ≤ Φ(b)} (2714)
= Φ(b) − Φ(a) (2715)
Z b
= φ(x)dx, (2716)
a
and this distribution measure is known as the standard normal distribution measure. Any random variable
with this distribution (regardless of the probability space) is a standard normal.
6.17.7 Chi-Squared Distribution χ2

Let Z ∼ Φ(0, 1) standard normal defined as in Section 6.17.6 and define random variable U = Z 2 , then
we say that that U has chi-square distribution with a degree of freedom one, and denote U ∼ χ21 .
Theorem 378 (C.D.F Chi-Squared, DF = 1). Show that U ∼ χ21 has cumulative density function
√ √
FU (u) = Φ( u) − Φ(− u) (2717)
and probability density function

1 u
fU (u) = √ u−1/2 exp{− } u≥0 (2718)
2π 2
436
√ √ √ √
Proof. We have FU (u) = P(U ≤ u) = P(Z 2 ≤ u) = P(− u ≤ Z ≤ u) = Φ( u) − Φ(− u) = κ.
√ 1 √ 1 √ 1
δκ
Additionally we have fu (u) = δu = φ( u) 12 u− 2 − φ(− u)(− 21 u− 2 ) = φ( u)u− 2 which follows from
symmetricity of the Gaussian density.
We can easily see from the Gamma density (see Equation 2729) that χ21 variables are an instance of
the Gamma variable with parameters (α, λ) = ( 21 , 12 ).
IID Pn
Let Ui , i ∈ [n] ∼ χ21 then V = i=1 Ui has χ2n distribution, which follows from it equivalency to a
Gamma (α, λ) = ( n2 , 12 ) random variable. It has moment generating function
α n
Result 36 (χ2n Moment Generating Functions). M (t) = E(exp{tV }) = λ−t λ
= (1 − 2t)− 2
Corollary 19 (Chi-Squared Moments). With the use of chi-squared moment generating function, we
have
n
M (t) = (1 − 2t)− 2 (2719)
n n n
M 0 (t) = − (1 − 2t)− 2 −1 (−2) = n(1 − 2t)− 2 −1 (2720)
2
n n n
M 00 (t) = (− − 1)n(1 − 2t)− 2 −2 (−2) = (n + 2)n(1 − 2t)− 2 −2 (−2) (2721)
2
We have EV = M 0 (0) = n, EV 2 = M 00 (0) = n(n + 2), Var(V ) = 2n.
Corollary 20 (Addition of Independent Chi-Squared Random Variables). If X ∼ χ2m , Y ∼ χ2n and

X ⊥ Y then X + Y ∼ χ2m+n .
Proof. Since X ⊥ Y their moment generating function factors, and we have
MX+Y (t) = MX (t)MY (t) (2722)

−m −n
= (1 − 2t) 2 (1 − 2t) 2 (2723)
− m+n
= (1 − 2t) 2 (2724)
and we are done (since this maps to the moment generating function of χ2m+n ).
6.17.8 Log-Normal Distribution

A log-normal random variable X ∼ log Φ(µ, σ) has density function
(log x − µ)2

1
f (x) = √ exp − x > 0. (2725)
xσ 2π 2σ 2
6.17.9 t Distribution
Let Z ∼ Φ(0, 1) be standard normal and Vn ∼ χ2n , and that Z ⊥ χ2n . Then the random variable defined
Z
tn = p (2726)
Vn /n
is t-distributed with n degrees of freedom, and has density function
Theorem 379 (Density Function of t Distribution).

− n+1
t2

Γ [(n + 1)/2] 2
f (t) = √ 1+ . (2727)
nπΓ(n/2) n
This is symmetric around zero, and tn → Z as n → ∞.
437
6.17.10 F Distribution
Let U ∼ χ2m , V ∼ χ2n and U ⊥ V and let
U/m
W =
V /n
, then this is said to have F distribution with (m, n) degrees of freedom.
The W F-random variable has density function
m+n
Γ[(m + n)/2] m m/2 m −1 m − 2
f (w) = w2 1+ w , w ≥ 0. (2728)
Γ(m/2)Γ(n/2) n n
n
and for value of n ≥ 2 we have EW = n−2 .
Z 2 /1
Lemma 10. If X ∼ tn then X ∼ √ Z and X 2 ∼ Vn /n ∼ F1,n .
Vn /n
U U
Lemma 11. As n → ∞, W ≈ m and mW = V /n ∼ χ2m .
6.17.11 Gamma Distribution Γ

The variable X has Gamma distribution if it has density
λα α−1
g(t) = t exp{−λt}, t ≥ 0. (2729)
Γ(α)
Theorem 380 (Properties of the Gamma Function). The Gamma function Γ(a) has the following
properties:
1. Γ(a + 1) = aΓ(a).
6.17.12 Beta Distribution

We say that Y ∼ Beta(a, b) if it has density function
y a−1 (1 − y)b−1
f (y) = 0 ≤ y ≤ 1. (2730)
B(a, b)
Γ(a)Γ(b)
where B(a, b) = Γ(a+b) .
a
Theorem 381 (Beta Random Variable Properties). If Y ∼ Beta(a, b), then show that (i) EY = a+b
and that (ii) Var(Y ) = (a+b)2ab
(a+b+1) .
Proof. Recall that from properties of the Gamma function (see Theorem 380) that Γ(a + 1) = aΓ(a).
Then we can write
1
y a−1+1 (1 − y)b−1
Z
EY = dy (2731)
0 B(a, b)
1
y (a+1)−1 (1 − y)b−1 B(a + 1, b)
Z
= dy (2732)
0 B(a + 1, b) B(a, b)
Γ(a + 1)Γ(b)/Γ(a + b + 1)
= (2733)
Γ(a)Γ(b)/Γ(a + b)
a
= . (2734)
a+b
438
Similarly, we can write
1
y a−1+2 (1 − y)b−1
Z
EY 2 = dy (2735)
0 B(a, b)
1
y (a+2)−1 (1 − y)b−1 B(a + 2, b)
Z
= dy (2736)
0 B(a + 2, b) B(a, b)
Γ(a + 2)Γ(b)/Γ(a + b + 2)
= (2737)
Γ(a)Γ(b)/Γ(a + b)
a(a + 1)
= (2738)
(a + b + 1)(a + b)
ab
and by their moments it follows that VarY = (a+b)2 (a+b+1)
6.17.13 Logistic Distributions

We say X has logistic distribution if it has density
exp(−x)
f (x) = (2739)
(1 + exp(−x))2
π3
Theorem 382 (Logistic Distribution Moments). Show that the value of µ = EX = 0, σ 2 = Var(X) = 3 .
Proof. It is easy to see that f (−x) = f (x) by substitution, and is therefore symmetrical about origin. It
has µ = EX = 0. Then we can write
Z ∞ Z ∞
2 2 2
σ = EX = x f (x)dx = 2 x2 f (x)dx (2740)
−∞ 0
Z ∞ 2 Z ∞
x exp(x) 2 1
= 2 dx = −2 x d (2741)
0 (1 + exp(x))2 0 1 + exp(x)
∞ Z ∞
x2

x
= −2 +4 dx i.b.p (see Theorem 2) (2742)
1 + exp(x) 0 0 1 + exp(x)
Z ∞
x exp(−x)
= 4 dx (2743)
0 1 + exp(−x)
Z ∞
= x exp(−x)(1 − exp(−x) + exp(−2x) − exp(−3x) · · · )dx verify this (2744)
0
Z ∞ ∞
X
= 4 x exp(−x) (−1)m exp(−mx)dx (2745)
0 0
Z ∞
∞X
= 4 (−1)m x exp(−(m + 1)x)dx (2746)
0 0
∞
X Z ∞
= 4 (−1)m x exp(−(m + 1)x)dx (2747)
0 0
∞ ∞
(−1)m
X Z
= 4 y exp(−y)dy y = (m + 1)x (2748)
0
(m + 1)2 0
∞
(−1)m

X 1 1 1
= = 4 1 − + − · · · . (2749)
0
(m + 1)2 22 32 42
Using the property (verify this)

∞
π2 X 1
= (2750)
6 i=1
i2
439
P
∞ 1 π2
= 2 41 1 + 1 1

, see that 2 i=1 (2i)2 22 + 32 + ··· = 12 . Then taking their difference the bracketed
2 2
π π
term in Equation 2749 is equals to 12 and σ 2 = 3 .
6.17.14 Rayleigh Distribution

This distribution has interesting applications in physics and is characterized by continous density
y y2
fY (y) = exp(− ) a > 0, 0 ≤ y < ∞. (2751)
a2 2a2
R∞ 2 √ R∞
See that it has expectation EY = y ay2 exp(− ay2 ) = 2 2a 0 v 2 exp(−v 2 )dv by substituting v = √y2a
0 R∞
(verify this). See this is form of v exp(−v 2 ) and for k = 1 we have (verify this) 0 v 2k exp(−v 2 )dv =
2k
R∞ 2 √ √ √
v exp(−v 2 )dv = 41 π, so that E(Y ) = 2 2a 14 π = a π2 .
p
0
6.17.15 Mixture Models

Definition 345 (Gaussian Mixtures). A Gaussian mixture can be described as a two-step generative
model. The first step is to generate a discrete variable determining which component Gaussian to sample
from, followed by sampling a measurement of the chosen density function.
6.18 Order Statistics

Definition 346 (Order Statistics). Define the order statistics of a random sample Xi , i ∈ [n] to be the
values arranged in ascending order, and are denoted X(i) , i ∈ [n].
Theorem 383 ((Joint) Distribution of Order Statistics). Suppose Xi ∼ F (x), i ∈ [n], and density f (x)
exists. We may find the joint distribution of any number of order statistics. Suppose we want to find the
joint distribution of 3 order statistics. Then the joint density function of (X(i) , X(j) , X(k) ), 1 ≤ i ≤ j ≤
k ≤ n is given
n!
f(X(i) ,X(j) ,X(k) ) (u, v, w) = κ (2752)
(i − 1)!(j − i − 1)!(k − j − 1)!(n − k)!
where κ equals
[F (u)]i−1 [F (v) − F (u)]j−i−1 [F (w) − F (v)]k−j−1 [1 − F (w)]n−k f (u)f (v)f (w) (2753)
and u < v < w.
Proof. The result can be easily intuited by combinatorial arguments and dividing the real line into
partitions. These partitions are around the local region of parameters u, v, w and then adding all other
partitions necessary. We can assume density is constant at points in the local region of u, v, w. For
du
instance, such that the probability value inside the interval (u ± 2 ) = f (u) and so on.
Theorem 384 (Distribution of the k-th Order Statistic). Suppose that Xi ∼ F (x), i ∈ [n], and that its
density f (x) exists. For k ∈ [n], show that
n
X n i n−i
FX(k) (x) = [F (x)] [1 − F (x)] (2754)
i
i=k
n! k−1 n−k
fX(k) (x) = [F (x)] [1 − F (x)] f (x) (2755)
(k − 1)!(n − k)!
440
Proof. See that
n
1(Xi ≤ x) ≥ k)
X
FX(k) (x) = P(X(k) ≤ x) = P( (2756)
i=1
= P(Y ∼ Bin(n, F (x)) ≥ k) (2757)
n
X n
= [F (x)]i [1 − F (x)]n−i . (2758)
i
i=k
dFX(k) (x)
We may derive fX(k) (x) = dx by standard calculus. This is also easily derived as a special case
from the joint distribution argument in Theorem 383.
Corollary 21 (Corollaries for the k-th Order Statistic). We see from distribution of the k-th order
statistic (Theorem 384) that
FX(1) (x) = 1 − [1 − F (x)]n (2759)

n
FX(n) (x) = [F (x)] (2760)
Exercise 622 (Uniform Order Statistics). Suppose that we have Xi ∼ U (0, 1), i ∈ [n], which has density
f (x) = 1, F (x) = x for values of x ∈ [1]. It is easy to see from the distribution on order statistic (see
Theorem 384) that we have
n!
fX(k) = xk−1 (1 − x)n−k (1)(1) (2761)
(k − 1)!(n − k)!
which is the density of the Beta distribution (see Equation 2730) of Beta(k, n − k + 1). In particular,
X(k) ∼ Beta(k, n − k + 1) (2762)
k
and it follows from Beta random variable moment properties (see 381) that EX(k) = n+1 and VarX(k) =
k(n−k+1) −1
−2

(n+1)2 (n+2) . See that EX(k) ∈ O n and VarX(k) ∈ O n , that is the variance term is comparatively

k i
small and we have X(k) − EX(k) ∈ o(1) and X(k) ≈ EX(k) = n+1 and the points X(i) , n+1 forms
approximately a straight line.
Exercise 623 (Derivation of the QQ-Plot). See that for X ∼ F where F continuous, we have Y =
F (X) ∼ U (0, 1) since P(Y ≤ y) = P(F (X) ≤ y) = F (F −1 (y)) = y. Then we can say that for
Xi ∼ F, i ∈ [n] and F continuous we have F (Xi ) ∼ U (0, 1) for i ∈ [n]. Then since they are monotonic,
by uniform order statistics we get that F (X)(k) = F (X(k) ) ∼ Beta(k, n −
k +
1) which we showed to
be approximately n+1 . Then X(k) ≈ F −1 ( n+1 ) and the points X(i) , F −1 n+1 , i ∈ [n] should form
k k i
approximately straight line. If we take Φ → F then we can test if our data-set comes from the normal
distribution and is known as the QQ plot. (see 7.3.1)
See from Theorem (383) that for random variables Xi ∼ F (x) with i ∈ [n] and density f (x), we have
a joint density for (X(i) , X(j) ):
n!
fX(i) ,X(j) (u, v) = [F (u)]i−1 [F (v) − F (u)]j−i−1 ][1 − F (v)]n−j f (u)f (v)
(i − 1)!(j − i − 1)!(n − j)!
for range of values u < v. We may often be interested in deriving statistics around the joint random
variables. For instance, many statistics of interest take form W = aX(i) + bX(j) . These statistics include
1. Sample Range: R = X(n) − X(1)
441
1

2. Sample Mid-range: V = 2 X(1) + X(n)
3. Interquartile Range:
X(b3n/4c) − X(b1n/4c) (2763)

X( n+1
2 )
n odd
4. Sample Median:
1 X
(n/2) + X(n/2+1) n even
2
Corollary 22 (Joint Density of All Order Statistics). A trivial case of the joint density for order statistics
observed in Theorem 383 is the case that for Xi , i ∈ [n] from F (x) and density f (x), their joint p.d.f of
((Xi )i∈[n] ) is given
f(Xi )i∈[n] (x1 , · · · xn ) = n!Πni f (xi ) ∀i, j we have i < j =⇒ xi < xj (2764)
Exercise 624. Let Xi ∼ U (0, 1), i ∈ [n] and then find the following:
1. Joint p.d.f on (X(1) , X(n) )
2. Joint p.d.f of (R, V ), the sample range (see Item 1) and sample mid range (see Item 2)
3. Find the marginal p.d.f of R and of V .
Proof. 1. By application of the density of joint order statistics (see Theorem 383), our solution is
written:
n!
fX(1) ,X(n) (u, v) = [u]1−1 [v − u]n−1−1 ][1 − v]n−n 1 · 1
(1 − 1)!(n − 1 − 1)!(n − n)!
= n(n − 1)[v − u]n−2 0 ≤ u ≤ v ≤ 1. (2765)
2. We can apply variable transform r = xn − x1 , v = 12 (x1 + xn ) with Jacobian −1 (verify this). Then
by change of variable method we obtain
fR,V (r, v) = fX(1) ,X(n) (x1 , xn )| − 1| = n(n − 1)(xn − x1 )n−2 = n(n − 1)rn−2 (2766)
where 0 < r < 1, 2r < v < 1 − 2r .
3. From the previous step we want to integrate out marginal variables, and we get
Z 1−r/2
fR (r) = n(n − 1)rn−2 dv = n(n − 1)rn−2 (1 − r), (2767)
v=r/2
and see that this is the density of Beta(n − 1, 2) (see Equation 2730). Next integrate out the
variable R and we have
Z r=2v
1
n(n − 1)rn−2 dr = n(2v)n−1 0<v≤ ,



 2
fV (v) = Z0 r=2(1−v) (2768)
 1
n(n − 1)rn−2 dr = n(2(1 − v))n−1 ≤ v ≤ 1.



0 2
442
6.18.1 Large Sample Theory on Order Statistics
Here we discuss the consistency (see Theorem 333) and asymptotic normality distributions for order
statistics.
Result 37 (Large Sample Distribution of Order Statistics). We state the following results.
1. P(X(k) > x) = P(nFn (x) ≤ k − 1)
2. If t < F (x) then we have

c
P(Fn (x) ≤ t) ≤ (t − F (x))−4 E(Fn (x) − F (x))4 ≤ (t − F (x))−4 (2769)
n2
and if t > F (x) we have
c
P(Fn (x) ≥ t) ≤ (t − F (x))−4 E(Fn (x) − F (x))4 ≤ (t − F (x))−4 (2770)
n2
for some universal constant c ∈ R.
3. Suppose F has density f and f (F −1 (t)) > 0 then if k = bntc, 0 < t < 1 then
√

−1
t(1 − t)
n X(k) − F (t) → Φ 0, 2 −1 (2771)
f (F (t))
4. If ∃F −1 (t) such that it is unique and for 0 < t < 1, knn → t as n → ∞ we have X(kn ) → F −1 (t).
(it is consistent)
6.19 Methods in Robust Estimation

It is easy to see from the definition on sample median and sample mean (see Definition 327) that the
sample mean is sensitive to extreme values/outliers while the median is much more stable. This is the
robustness feature of the median statistics. These outliers can occur in both cases where our data is
contaminated or when the density function is strongly tailed. Depending on the context we might prefer
to use m̂ over X̄ as a center measure. However, since median ‘throws away’ many data points, we might
opt for some center measure that is both robust and not so wasteful with the data. In other words,
although m̂ is satisfactory in the presence of outliers, it is not efficient under normality conditions. (see
6.14). X̄ is efficient and well behaved under the normality assumption but in the presence of outliers, is
also inefficient. Let’s define another estimator, the trimmed mean.
Definition 347 (Trimmed Mean). For some α ∈ [0, 12 ] define the α-trimmed mean to be the value
X(bnαc+1) + · · · + X(n−bnαc)
X̄α = . (2772)
n − 2bnαc
By this definition we discard floor(nα) observations on each side and take the sample mean on remaining
data-set. Typically a value of α ∈ [0.10, 0.20] is selected for data which contains moderate outliers. We
can also defined trimmed mean with asymmetric cutoffs. Let this trimmed mean be defined
X(bnαc+1) + · · · + X(n−bnβc)
X̄αβ = (2773)
n − bnαc − bnβc
Corollary 23 (Mean and Median as Trimmed Means). From the definition of trimmed mean (see
Definition 423) we obtain sample mean X̄α=0 = X̄ and sample median X̄α→ 12 = m̂. Sample mean,
sample median were defined in Definitions (327).
443
The asymptotic relative efficiency of these central estimates are discussed in Section 6.14.1.
A gross error model can be used to compare different estimators in the robust estimation process.
Definition 348 (Gross Error Model). -
444
Chapter 7
Hypothesis Testing, Interval

Estimation and Other Tests
Hypothesis testing is one of the major approaches to making inferences about populations. A five-step
approach can be specified as follows in general: make assumptions satisfied (or robust to satisfaction)
by our data, compute a test statistic and calculation probability (known as p-values) and arrive at
conclusions. The first and foremost assumption about most data is that the data is derived from random
sampling. The test statistic indicates how far the point estimate for parameters in our hypothesis differs
from the null hypothesis. In most cases, this distance is measured in the units of standard errors between
the null value and point estimate. We interpret the test statistic conditional under the null hypothesis -
we consider the values we would expect to get for the test statistic according to the sampling distribution
suggested by our null hypothesis. In general, the competing hypothesis are known as the null hypothesis
(H0 ) and alternative hypothesis (H1 ). The p-value computed is the probability that the test statistic
assumes a value equal to, or more extreme than what we have observed. It follows that a small p-value
is in favour of the alternative hypothesis. The p-value is then compared to a ex-ante significance level
to determine if we conclude the data rejects the null hypothesis in favour of the alternative hypothesis.
We shall denote the sample size as n, or N , without signposting in the absence of ambiguity.
7.1 Theory of Hypothesis Tests

Let Xi∈[n] be IID random variables with density f (x, θ), θ ∈ Θ. The null hypothesis H0 competes against
the alternative hypothesis H1 . We define the critical region to be the set for which the range of values of
the random variable rejects H0 , such as {x > k}. Let’s discuss some common terminology in hypothesis
testing literature.
Definition 349 (Hypothesis Testing Terminology). Let a hypothesis test have critical region {x > c}
for some constant c. Then we define the terms Type I error, Type II error, size, significance level and
power as follows:
1. Type I error: the error of rejecting H0 when it is true. The size/significance level of the test is
P0 (X > c).
2. Type II error: the mistake of not rejecting H0 when it is false, which equals P1 (X < c). The
power is defined to be 1 − P1 (type II error), or P1 (X > c), the probability of falling inside the
445
critical region conditional on alternative hypothesis, or the probability of correctly rejecting the null
hypothesis when it is false.
3. Familywise Error Rate: the probability of making at least one type I error. That is, it is the
probability of rejecting one or more of the true, individual null hypothesis. A familywise error
rate with weak control is constructed under the assumption that all of the null hypothesis is true,
while in strong control familywise error rate, the error rate holds regardless of the number of null
hypothesis statements holding true.
The p-value is the probability under H0 that the test statistic is more extreme than the realized value - it
is not the probability that H0 is true. H0 is either true or false, and has no probability distribution. If
the p-value is smaller than the size of the test, then we reject the null hypothesis. The p-value can then
be reframed to be the smallest size α such that a test of this size is rejected.
In the ideal case, our hypothesis test has power one and size zero. However, this is not possible and
we need a tradeoff. A common approach is to pre-determine some size α, and for tests less than that
size we would choose the most powerful test. Consider the general hypothesis
Problem 4. Let Xi , i ∈ [n] be IID density f (x|θ) random variables and suppose we want to test
H0 : θ = θ0 H1 : θ = θ1
and let the critical region be R ⊂ Rn . Then we have size P0 (X ∈ R) and power P1 (X ∈ R).
Definition 350 (Likelihood Ratio of Hypothesis). Let f0 be the density of X under H0 , and f1 be the
density of X under H1 . Then the likelihood ratio of H0 to H1 from Xi , i ∈ [n] IID sample is
Πni f0 (xi )
∧(x) = . (2774)
Πni f1 (xi )
Smaller values of ∧(x) suggests that there is more evidence x has against H0 .
Exercise 625 (Normal Likelihood Ratios). Consider Xi , i ∈ [n] IID random variables from Φ(µ, σ 2 )
where σ 2 is known. Consider the case where H0 : µ = µ0 , H1 : µ = µ1 where µ0 < µ1 . With sample
n = 1 we have
2
exp{− (x−µ 0)
2σ 2 }
∧(x) = 2 . (2775)
exp{− (x−µ 1)
2σ 2 }
Then for test with critical region written
Rc = {x : ∧(x) < c} (2776)
,for non-negative c, see that as c increases both size and power increases. In fact, for any α ∈ [1], there
exists cα such that the size of our test is α.
Lemma 12 (Neyman-Pearson Lemma). Let H0 , H1 be associated with densities f0 , f1 respectively. For

some fixed α ∈ [1] consider the likelihood ratio on IID Xi , i ∈ [n] with critical region
{x : ∧(x) < cα } (2777)
, then among all tests with size ≤ α, this test has maximum power.
446
Referring to the hypothesis ratio for one sample on normal data (see Exercise (625)) and assuming
µ0 < µ1 with known σ 2 we have
P 2
i (xi −µ0 )
exp(− 2σ 2 )
∧(x) = P 2 (2778)
i (xi −µi )
exp(− )
2σ 2
2nx̄(µ1 − µ0 ) − n(µ21 − µ20 )

= exp − . (2779)
2σ 2
By Neyman Pearson Lemma 12 the most powerful test of size ≤ α would have the critical region of form
{x : ∧(x) < cα }, which can be shown to be equivalent to {x : x̄ > c0α }. (verify this). We may write
c0α −µ0 c0α −µ0
α = P0 (X̄ > c0α ) = Φ(Z > σ/ n
√ ) and therefore we may write σ/ n
√ = zα , or equivalently:
σ
c0α = µ0 + zα √ . (2780)
n
c0α −µ1

µ1 −µ
We also have the power P1 (X̄ > c0α ) = Φ(Z > σ/ n
√ ) = Φ(Z > zα − √0
σ/ n
) > Φ(Z > zα ) = α when
µ1 > µ0 . Consider again the likelihood ratio example given for the same two hypothesis, but suppose
instead this time we are interested in a composite alternative hypothesis in the form H1 : µ > µ0 . The
critical region for this test of size α is
σ
{x̄ > µ0 + zα √ }, (2781)
n
and which has power depending on the value of µ > µ0 . Since the Neyman-Pearson lemma (Lemma 12)
says that for all alternatives µ1 > µ0 , this test is most powerful. In this case our test is uniformly most
powerful, in that for any other test of size ≤ α, the power is smaller everywhere. If we want to instead
form the two-sided composite hypothesis H1 : µ 6= µ0 then let the critical region be form {|x̄ − µ0 | > c},
which for test size α yields c = z α2 √σn . By the Neyman-Pearson Lemma there exists a more powerful
test than our two sided hypothesis test - it is not uniformly most powerful.
7.2 Generalized Likelihood Ratio Tests

See that we may express the H0 , H1 as sets ω0 , ω1 ⊂ Θ. For instance we may write ω0 = {µ0 }, ω1 = {µ :
µ 6= µ0 } for the two-sided composite test on means.
Definition 351 (Generalized Likelihood Ratio). Let L(θ) be the likelihood function for data observed,
then we write
maxθ∈ω0 L(θ)
∧0 = , (2782)
maxθ∈ω1 L(θ)
where smaller values of ∧0 evidences against H0 . Let the considered space be Ω = ω0 ∪ ω1 , then we can
write
( 0
maxθ∈ω0 L(θ) ∧ if num. < den.
∧= = (2783)
maxθ∈Ω L(θ) 1 otherwise.
Therefore we have ∧ ≤ ∧0 . Under H0 if n large we have (verify this) approximately
−2 log ∧ ∼ χ2k (2784)
where k = dim Ω − dim ω0 , is the difference in unknown parameters between Ω and ω0 .
447
Exercise 626 (Normal Generalized Likelihood Ratio Test). Consider Xi , i ∈ [n] of IID Φ(µ, σ 2 ), we
construct ω0 = {µ0 }, ω1 = {µ : µ 6= µ0 }. Then Ω = R and we obtain
2
P
1 i (Xi − µ0 )
max L(µ) = exp − (2785)
µ∈ω0 σ n (2π)n/2 2σ 2
2
P
1 i (Xi − X̄)
max L(µ) = exp − (2786)
µ∈Ω σ n (2π)n/2 2σ 2
and
− µ0 )2 − i (Xi − X̄)2
P P
i (Xi
−2 log ∧ = (2787)
σ2
2
n(X̄ − µ0 )
= (2788)
σ2
2
X̄ − µ0
= √ ∼ χ21 . (2789)
σ/ n
For size α we arrive at critical region
(x̄ − µ0 )2

2 σ
> χ1,α = |x̄ − µ0 | > zα/2 √ , (2790)
σ 2 /n n
with p-value computed
(x̄ − µ0 )2

P χ21 > . (2791)
σ 2 /n
Exercise 627 (Multinomial Generalized Likelihood Ratio Tests). Consider the multinomial likelihoods
(see Exercise 605) generalized to the ratio test. Suppose ω0 = {p(θ), θ ∈ Θ}, ω1 = Ω\ω0 , where ω0 ∪ ω1 =
Ω is the set of all probability vectors. Let θ̂ be MLE of θ under ω0 giving p(θ̂) probability estimates.
X
Recall under Ω our MLE for p is n.We may write

n
max L(p) = Πri=1 pi (θ̂)Xi (2792)
p∈ω0 X1 · · · Xr
Xi
n Xi
max L(p) = Πri=1 (2793)
p∈Ω X1 · · · Xr n
!Xi Xi
npi (θ̂) Ei
∧ = Πri=1 = Πri=1 (2794)
Xi Xi
where Ei is expected number of occurences of i-th event under the null hypothesis. Then writing
r
X Xi
−2 log ∧ = 2 Xi log (2795)
i=1
Ei
which for large n is approximately (verify this)

r
X (Xi − Ei )2
−2 log ∧ ≈ , (2796)
i=1
Ei
which is known as the Pearson’s Chi-Squared test statistic. Large values of Pearson’s chi-squared test
statistic suggests that X is quite different from the predicted Ei and evidences against the null hypothesis
H0 . This is useful for checking the goodness of fit of models in categorical data anaylsis.
448
Exercise 628 (Poisson Dispersion Test). For i ∈ [n], Xi ∼ P oisson(λi ) (see Poisson distribution Section
6.17.4) and Xi ⊥ Xj for i 6= j. Let the hypothesis test be: H0 : λi = λ for all i and H1 : ∃λi 6= λj . We
have the sets ω0 = {(λi∈[n] ) : λi = λj ∀i, j}, ω1 = {(λi∈[n] ) : λi 6= λj for some i, j }. In ω0 the MLE is
λ̂i = X̄, while in Ω we have λ̂i = Xi . Then we can write the likelihood ratios
exp(−X̄)
max L(θ) = Πni=1 X̄ Xi (2797)
θ∈ω0 Xi !
exp(−X i)
max L(θ) = Πni=1 XiXi (2798)
θ∈Ω Xi !
Xi
X̄
∧ = Πni=1 exp(Xi − X̄) (2799)
Xi
n
X X̄
−2 log ∧ = −2 Xi log( ) + Xi − X̄ (2800)
i=1
Xi
n
X Xi n large 2
= 2 Xi log( ) ∼ χn−1 (2801)
i=1
X̄
Pn 2
i=1 (Xi −X̄)
The test statistic −2 log ∧ ≈ X̄
(verify this) which under the null hypothesis should carry a
numerator of close to (n − 1)λ, and denominator close to λ.
7.3 Tests on Normality

7.3.1 QQ-Plots
The theory behind the derivation of QQ-plots are driven by order statistics and are written in Exer-
cise 623. Although not a hypothesis test, quantile plots provide visual way of understanding how the
distribution of one sample compares against another. We may plot the standardized sample quantiles
against the theoretical Φ(0, 1) distribution, and see if their quantiles match (a straight line) to see if the
sample were normally distributed. We are able to observe the behavior of tails relative to normal data.
Plotting the expected quantiles in the y-axis and observed quantiles in the x-axis, our interpretations
are as follows:
1. Right tail below straight line: longer tail than normal.
2. Left tail below straight line: shorter tail than normal.
7.4 Tests on Mean

7.4.1 One-Sample T-Test
Assumptions
The assumptions required for the t-test on population mean µ are as follows:
1. Quantitative variable is being measured.
2. Data is obtained from random sampling.
3. Population distribution is approximately normal (see normality tests in Section 7.3). For non-
normal data, test is robust to normality assumptions when n is large and without outliers.
449
Hypothesis
We can describe the hypothesis in general as such
H0 : µ = µ0 H1 : µ 6= µ0 or µ < µ0 or µ > µ0
Test Statistic and p-values

1
From our point estimate of X̄, we can compute the test statistic
X̄ − µ0
T = ∼ tn−1 ,
√s
n
√s and s2 = 1
− x̄)2 . The probability value P(T ? t) =
P
where SE(X̄) = n n−1 i (xi
1 def o n e _ s a m p l e _ t _ t e s t ( sample , mu0 , side = " greater " ) :

2 res = stats . ttest_1samp ( sample , mu0 , alternative = side )
3 return res . pvalue , res . statistic
Duality of Two-Sided One-Sample T-Test and Confidence Intervals

We see that confidence intervals and two-sided tests agree on their conclusions. To see this - we can
write confidence intervals as
X̄ ± tn−1,1− α2 · SE(X̄), (2802)
which corresponds to the range of values that do not result in a rejection of the null hypothesis under
an α significance two-sided hypothesis test. This is because the confidence intervals use the same value
of standard errors as in the test statistic
q - a condition that does not hold for test of proportions (means
of indicator functions) since SE(p̂) = p̂(1−n
p̂)
.
7.4.2 One-Sample Sign Test

See definition of median (Definition 326). Here we assume the settings for which Xi ∼ F, i ∈ [n],
where F is unknown, arbitrary continuous cumulative density. We want to test if H0 : m = m0 against
H1 : m > m0 at α. Then we can write the test statistic
n
1
1{Xi > m0 } ∼ Bin(n, p = )
X
S= (2803)
i
2
by definition. Then the critical region shall be the values of S ≥ C where P(S ≥ C) = α and p-value =
P(S > s|H0 ) < α. Here the assumption that F is continuous gives p = 21 , and since P(X = x) = 0 for
R R
all x then P(X1 = X2 ) = P(X1 = X2 |X2 = x)dF (x) = P(X1 = x)dF (x) = 0 and all our observations
are assumed distinct. On the other hand if F was jumpy then we would obtain P(X = x0 ) > 0 at jump
x0 and let the aggregate of these points be indexed by ai . We write
X X
P(X1 = X2 ) = P(X1 = X2 = ai ) = P(X1 = ai )2 > 0 (2804)
ai ai
and we have repeat observations. In practice if it turns out that we violate this assumption and that we
have jumps at m0 , then we discard the point and perform the experiment as is.
2
1 note that when data is normally distributed (or n is large), then X̄ ∼ Φ(µ, σn ) exactly (approximately by CLT under
n large). The t-distribution is required as we approximate σ̂ with s, the sample deviation.
450
7.4.2.1 Normal Approximation to Sign Test and Continuity Corrections
For large n we may employ normal approximations, since S ∼ Bin(n, 12 ) we have ES = n

2 , VarS = n
4
and by Central Limit Theorem (see Theorem 362) to the summation of Bernoullis (see Section 6.17.1)
we obtain S ∼ Φ( n2 , n4 ). Since we are working with continuous approximations to the discrete underlying
distribution we apply correction so that the p-values are computed
!
s − n/2 + 1/2
P(S ≤ s) ≈ Φ p , (2805)
n/4
!
s − n/2 − 1/2
P(S ≥ s) ≈ 1−Φ p , (2806)
n/4
" !#
|s − n/2| − 1/2
P(|S − n/2| ≥ |s − n/2|) ≈ 2 1−Φ p . (2807)
n/4
We may implement such a test as follows:

1 def o n e _ s a m p l e _ s i g n _ t e s t ( sample , m0 , side = " greater " , norm_approx = False ) :
2 assert side == " greater " or side == " lesser "
3 n = len ( sample )
4 S = np . sum (( sample - m0 ) > 0) if side == " greater " else np . sum (( sample - m0 ) < 0)
5 ’S ~ binom (n , 0.5) ’
6 if norm_approx : # ( w cc )
7 if side == " greater " :
8 Z = ( S - n * 0.5 - 0.5) / np . sqrt ( n / 4 )
9 p = 1 - norm . cdf ( Z )
10 return p
11 else :
12 Z = ( S - n * 0.5 + 0.5) / np . sqrt ( n / 4 )
13 p = norm . cdf ( Z )
14 return p
15 return scipy . stats . binom_test (S , n =n , p =0.5 , alternative = side )
7.4.3 Interval Estimation of the Median

Recall the definition of population median m as in Definition 326. Then let the sample median take
value m̂ = F̂n−1 ( 12 ). Alternatively we may let
1
 X(n/2) + X(n/2+1) n even
m̂ = 2 (2808)
X(n+1)/2 n odd

and we may define the (1 − α) confidence interval for m:

α
Theorem 385 (Median Interval Theorem). If k(α) is integer s.t. P(S ≤ k(α) − 1) = 2, where S is
the test statistic as in the sign test (see 2803) with binomial distributions. Then the confidence interval
estimate is

X(k(α)) , X(n−k(α)+1) (2809)
451
Proof. We obtain

P m ∈ X(k(α)) , X(n−k(α)+1) = 1 − P X(k(α)) > m − P X(n−k(α)+1) < m (2810)
= 1 − P (S ≥ n − (k(α) − 1)) − P (S ≤ k(α) − 1) (2811)
= 1 − 2P (S ≤ k(α) − 1) (2812)
= 1 − α. (2813)
7.4.4 One-Sample (Wilcoxon) Signed Rank Test

In the sign test (see Section 7.4.2) we ignored their order statistic and only took into consideration their
signs - here we use their ranks. Here we suppose Xi ∼ F, i ∈ [n] with unknown but continuous cumulative
density function F . We want to test H0 : m = m0 against H1 : m > m0 at significance α, where m is
population median as in Definition 326. Define the signed rank to be:
Definition 352 (Signed Rank).
Ri = sign(Xi − m0 ) · rank(|Xi − m0 |) (2814)

P
A reasonable Wilcoxon signed rank test statistic takes Ri . We can use a monotonic transformation
1X n(n + 1)
W = Ri + (2815)
2 4
which rejects H0 iff W ≥ C for some critical value C determined P(W ≥ C) = α. The p-value com-
putation is written P(W ≥ w|H0 ) and we shall analyze the distribution of W under both exact and
approximately normal conditions. See also that we may rewrite the Wilcoxon Signed Rank test statistic,
as follows:
Exercise 629. Let Ti , i ∈ [s] and Ti < Ti+1 be ordered positive Ri and s the number of positive Xi − m0 .
Show that we have
n s n
1X n(n + 1) X
|Ri |1{Xi − m0 > 0}.
X
W = Ri + = Ti = (2816)
2 i=1 4 i=1 i=1
Proof. Note that we can write

n s s n
! s
Ri 1{Xi − m0 < 0}
X X X X X
Ri − 2 Ti = Ti + −2 Ti (2817)
n s
Ri 1{Xi − m0 < 0} −
X X
= Ti (2818)
i
= −(1 + 2 + · · · + n) (2819)
n(n + 1)
= − (2820)
2
and hence we can easily write
s n
X1X n(n + 1)
Ti =
Ri + =W (2821)
2 4
and we can just compute W to be the sum of all positive signed ranks. Under the continuity assumption
of F we have P|X1 − m0 | = |X2 − m0 |) = 0 and there are no ties. If there are jumps then there is
non-zero probability of ties in absolute rank - we apply the average absolute ranks. A jump at m0 is
Ps
discarded, and since W = i Ti it must satisfy 0 ≤ W ≤ n(n+1) 2 , and we observe symmetricity around
n(n+1)
4 under null hypothesis.
452
7.4.4.1 Distribution of the Wilcoxon Signed Rank Test Statistic
Here we state some useful results without proof that shall be used in the derivation of the exact distri-
butions. Their proofs are intuitive and follows by from combinatorial arguments and using each result’s
preceding sub-results.
Result 38. For continuous and symmetric F , writing 1i = 1i = {Xi − m0 > 0} we have
1. (|R1 |, |R2 |, · · · , |Rn |) ⊥ (1i , · · · , 1n )
1
2. P(|R1 | = k1 , · · · |Rn | = kn ) = n! where (ki∈[n] ) ∈ perm(1, · · · , n)
3. Ii IID
∼ Bernoulli( 21 )
4. P(R1 = r1 , · · · Rn = rn ) = 2n1n! where (ri∈[n] ) ∈ perm(±1, · · · , ±n)
5. P(T1 = t1 , · · · Ts = ts , j 1j = s) = 21n where ti ’s are the set of (any) values satisfying ti < ti+1
Pn
for Ti , i ∈ [s].
Consider the exact distribution of W for n = 3, we can write
Ri Ri Ri W P(W )
1
P
1 2 3 Ti 2n
1
- - - 0 8
1
- - + 3 8
1
- + - 2 8
1
- + + 5 8
1
+ - - 1 8
1
+ - + 4 8
1
+ + - 3 8
1
+ + + 6 8
1
which gives P(W = 3) = 4 and P(W = i) = 18 , i ∈ [6]\3.
We may also do them by approximation as n gets large. Define the permutation (D1 , · · · Dn ) by
Dj = k iff |Zk | = |Z|(j) . For instance for values of X = {1, −2, −3, −1.5} with ranks |Rj | = {1, 3, 4, 2}
our definition writes D = {1, 4, 2, 3}. We can think of D[i] = indexof(X(i) ) when applied to some vector
X.
Theorem 386 (Normal Approximation to the Wilcoxon Signed Rank Test Statistic). If Xi ∼ F, i ∈ [n]
for continuous F , then as n → ∞ we have
W ∼ Φ(EW, Var(W )) (2822)

n(n+1) n(n+1)(2n+1)
where EW = 4 and Var(W ) = 24 . We omit proof of asymptotic normality. Writing
s n n n
|Ri |1(Xi − m > 0) = |Ri |1i = k 1Dk
X X X X
W = Tj = (2823)
j i i k
kE1Dk =
Pn Pn n(n+1)
then EW = k k k 21 = 4 . We also have
n n
1 n(n + 1)(2n + 1)
k Var(1Dk ) =
X X
2
Var(W ) = k2 = (2824)
4 24
k k
453
where we use the variance of Bernoullis.
Here we give the p-value computations of the test statistic W with continuity corrections. The
equivalent library call is also referenced in comment.
1 def o n e _ s a m p l e _ w i l c o x o n _ s i g n e d _ r a n k ( sample , m0 , side = " greater " ) :
2 n = len ( sample )
3 ranks = stats . rankdata ( np . abs ( sample - m0 ) )
4 signs = np . sign ( sample - m0 )
5 signed_ranks = signs * ranks
6 w = np . sum ( signed_ranks [ signed_ranks > 0])
7 ew = n * ( n + 1) / 4
8 varw = n * ( n + 1) * (2 * n + 1) / 24
9 z = ( w - ew - 0.5) / np . sqrt ( varw ) \
10 if side == " greater " else ( w - ew + 0.5) / np . sqrt ( varw )
11 p = 1 - norm . cdf ( z ) if side == " greater " else norm . cdf ( z )
12 # stats . wilcoxon ( sample - m0 , alternative = side )
13 return p
7.4.4.2 Relating the Wilcoxon Signed Rank Test to the T-Test
Let Xi ∼ Φ(µ, σ 2 ), i ∈ [n] and we want to test
H0 : µ = µ0 H1 : µ < µ0 (2825)
H0 : m = m0 H1 : m < m0 (2826)
which are equivalent hypothesis, since our priors are normal distribution of Xi and the median and mean
agree. Then recall from Section (7.4.1) that the test statistic under the t-test is equivalent to
√
n(X̄ − µ0 )
T = ∼ tn−1 (2827)
S
which has critical region for some constant C of T ≤ C, We may rewrite (verify this)
r P
n−1 (Xi − m0 )
T = q (2828)
n 2
(Xi − m0 ) − n1 (Xi − m0 )
P P
and if we replace Xi − m0 by their signed ranks Ri we obtain

r P r P
n−1 Ri n−1 Ri
TR = qP = q (2829)
n Ri2 − n1 ( Ri )2
P n P
i2 − n1 ( Ri )2
P
P
which is a strictly increasing function of Ri , a monotonic transformation of the Wilcoxon Signed Rank
test statistic W . This gives us rejection region W ≤ C, for some C ∈ R.
7.4.5 Hodge-Lehmann Estimates for Median and Tukey’s Method for Con-
fidence Intervals
Definition 353 (Walsh Averages). Define the set

Xi + Xj
,1 ≤ i ≤ j ≤ n (2830)
2
n(n+1)
to be the Walsh Averages of 2 elements.
454
Definition 354 (Hodges-Lehmann Estimate of Median). We may obtain an estimate for population
median m (see 326) by taking the median of Walsh Averages

Xi + Xj
m̂HL = median ,1 ≤ i ≤ j ≤ n (2831)
2
Theorem 387. Defining W to be the total number of Walsh averages (Definition 353) greater than m,
then we can write
s
Xi + Xj
1
X X
W = Ti = >m (2832)
i=1
2
1≤i≤j≤n
which is the sum of all positive ranks.

n o
1≤i≤j≤n 1
P Xi +Xj
Proof. Without loss of generality, assume m = 0. Then we can write the value of 2 >m
to be
1{X(1) > 0}

= (2833)
1{X(2) + X(1) > 0} + 1{X(2) > 0}

+ (2834)
+ ··· (2835)
1{X(n) + X(1) } + 1{X(n) + X(2) > 0} + · · · + 1{X(n)

+ > 0} (2836)
and let the summation of this terms be W1 + W2 + · · · Wn . We may show by enumeration that
Wi = R(i) 1{X(i) > 0} (2837)
and therefore that

s
R(i) 1{X(i) > 0} =
X X X
Wi = Ti = W (2838)
i
and we are done.

n(n+1)
Theorem 388 (Tukey’s Confidence Interval for Medians). Let M = 2 and A(i) , i ∈ [M ] be the
α
order statistic of the Walsh averages. If k(α) is integer satisfying P(W ≤ k(α) − 1) = 2 then we obtain
population median m (see Definition 326) confidence interval of

A(k(α)) , A(M −k(α)+1 ) (2839)
Proof. We may write

P(m ∈ A(k(α)) , A(M −k(α)+1 ) ) = 1 − P(A(k(α)) > m) − P(A(M −k(α)+1) < m) (2840)

n(n + 1)
= 1−P W ≥ − (k(α) − 1) − P (W ≤ k(α) − 1)
(2841)
2
= 1 − 2P (W ≤ k(α) − 1) (2842)
= 1 − α. (2843)
455
7.4.6 Paired-Sample T-Test
IID
When we have paired data (Xi , Yi ) ∼ Φ((µx , µy ), Σ2,2 ), i ∈ [n] then we may test for differences µx − µy
by taking differences and applying the one-sample t-test (Section 7.4.1) on the differences. In particular,
assume we have the data samples (Xi , Yi ) IID from some distribution F . We are interested in the
differences Di = Yi − Xi , which are useful in analysis of treatments, et cetera. Our hypothesis shall be
H0 : µD = µ and we may employ both parametric and non-parametric methods.
In the parametric version we may compute the t-statistic
D̄ − µ
t= √ (2844)
sD / n
with the notations as usual for sample means and variance. Here the D values are assumed normal, or
that n is large which has critical region |t| > tn−1,α/2 and confidence interval

sD
D̄ ± tn−1,α/2 √ , (2845)
n
and the test is qualitatively no different from a one-sample t-test.
7.4.7 Paired Sample Signed Rank Test

Consider the same settings as in Section 7.4.6 of the paired-sample t-test. Here we can apply the
Wilcoxon Signed Rank Test on the differences Di , which under H0 should be symmetric about zero. Let
the sample of differences be denoted {Di , i ∈ [n]} and assume no ties and no zeros. Define their order
statistic via the rank function Rank(D) = D(i) . Then define W+ to be sum of ranks among all positive
Di and W− be the sum of ranks among all negative Di . We expect W+ to be neither too big nor too
n(n+1)
small under the null assumptions. Since we have fixed constant W+ + W− = 1 we can just consider
W = min(W+ , W− ). In the presence of zeros we can discard them, and average out ranks between ties.
Their p-values can be computed analytically using combinatorial arguments and binomial distributions
or by consulting statistical tables. It is robust to outliers and has asymptotic efficiency (see Section 336)
of ≈ 0.955 to the t − test under normality assumptions being fulfilled. Like the paired-sample t-test
analogue to the one-sample t-test, the paired sample signed rank test is qualitatively no different from
the one-sample signed-rank test. (see Section 7.4.4)
7.4.8 Two Sample T-Tests

IID IID
Problem 5 (Two Sample Comparison of Means). Let the set up be that we have Xi ∼ Φ(µX , σ 2 ), Yj ∼
Φ(µY , σ 2 ) for i ∈ [n], j ∈ [m], or some other general density with n, m large. We are interested in the
test
H0 : µX = µY H1 : µX 6= µY (2846)
and we can use the test statistic X̄ − Ȳ , which is normal since they are independent normals.
1 1

When null is assumed we may take E(X̄ − Ȳ ) = µX − µY , with Var(X̄ − Ȳ ) = σ 2 n + m . Under
the known variance we would get the test statistic
X̄ − Ȳ
Z= p , (2847)
σ 1/n + 1/m
X̄−Ȳ −(µX −µY )
and in the generalization where H0 : µx − µY = d we would use Z = √1 1
. In many practical
σ n+m
cases we would not have the value of σ 2 , and we can use their pooled sample variance defined:
456
Definition 355 (Pooled Sample Variance). We define the pooled sample variance on two samples of
n, m size respectively to be
(n − 1)s2X + (m − 1)s2Y
sp = (2848)
m+n−2
where the s2X , s2Y are the individual sample variances (see Definition 329). By the normal sampling prop-
erties (see Theorem 343) and linearity of expectations we can easily see that it is an unbiased estimator
of σ 2 , and its distribution may be described via the relation
m+n−2 2 n−1 2 m−1 2
sp = SX + sy (2849)
σ2 2 2
| {z } | σ{z }
σ
χ2n−1 χ2m−1
suggesting the LHS is χ2m+n−2 random variable.
Then the t-statistic under the unknown variance can be written

X̄ − Ȳ − (µX − µY )
t= q ∼ tm+n−2 (2850)
sp n1 + m 1
, and the confidence intervals for (µX − µY ) can be accordingly constructed. In particular for the two
cases of known σ and unknown σ we have respectively
r !
1 1
(X̄ − Ȳ ) ± zα/2 σ + , (2851)
n m
r !
1 1
(X̄ − Ȳ ) ± tm+n−2,α/2 sp + . (2852)
n m
Our two sided test has critical region |t| > tn+m−2, α2 . In the case of the one-sided alternative, the
appropriate critical region would be t > tn+m−2,α .
Consider now a similar setup to the Problem 5 except now we have unequal variances. Then we must
σ2 σ2
have Var(X̄ − Ȳ ) = nX + mY , and under the known variance a test statistic
X̄ − Ȳ − (µX − µY )
Z= q 2 2
, (2853)
σX σY
n + m
a standard random normal. Of course yet again in most practical scenarios we would rather need to
estimate the variances via the sample variance s2X , s2Y , but we perform no pooling. The t-statistic
computed would be value
X̄ − Ȳ − (µX − µY )
t= q 2 (2854)
sX s2Y
n + m
which turns out to be approximately t-distributed with degrees of freedom
(a + b)2 s2X s2
df = a2 b2
, a= ,b = Y . (2855)
n−1 + m−1
n m
A good rule of thumb to see if we should use pooling or estimate them separately could be something
1 sX
like 2 < sY < 2. Note that if n, m large then this test is robust to the normality assumption since by
CLT (see Theorem 362) X̄ − Ȳ is approximately normal.
457
7.4.9 Two-Sample Mann Whitney (Wilcoxon Rank Sum) Test
Here we discuss the non-parametric version of the two-sample t-tests, which required approximate nor-
mality. Here we make weaker assumptions about the data. In particular, let Xi , i ∈ [n] be IID ∼ F ,
and Yi , i ∈ [m] be IID ∼ G, where F, G are cumulative density functions. Then we want to test
H0 : F = G, that the are generated from the same distribution. Let the alternative hypothesis be
H1 : F (t) ≥ G(t) ∧ ∃t s.t F (t) 6= G(t). We often simplify the null hypothesis and write µX = µY
although they are not equivalent statements. We see that the null hypothesis suggests stochastic com-
parison:
Definition 356 (Stochastically Larger Random Variables). The statements F (t) ≥ G(t) ↔ P(Xi ≤ t) ≥
P(Yj ≤ t) ↔ P(Xi > t) < P(Yj > t) which is that Yj ’s are stochastically larger than the Xi ’s.
Let (Z1 , · · · , Zm+n ) be the pooled sample of X, Y and assume that their values are distinct. Let
their order statistic be defined via the rank function rank(Z) = i when Z = Z(i) . Their define rank sum
scores under the pooled sample to be
n
X m
X
TX = Rank(Xi ), TY = Rank(Yi ) (2856)
i i
and since if their density is the same we expect their ranks to be uniformly distributed, we shall ex-
pected that under H0 that their rank sums are neither too large nor too small. Note that TX + TY =
(m+n)(m+n+1)
2 is fixed constant. See also that the TX , TY are bounded
n(n + 1) n(n + 1)
≤ TX ≤ nm + (2857)
2 2
m(m + 1) m(m + 1)
≤ TY ≤ nm + (2858)
2 2
since TX is between [1 + 2 + · · · + n, (m + 1) + · · · + (m + n)] and similarly in methodology for TY .
Take n1 = min(m, n) and compute sum of ranks from that sample and let this be R. Let R0 =
n1 (m + n + 1) − R and let our Mann Whitney test statistic be computed
R∗ = min(R, R0 ) (2859)
which rejects the null hypothesis if its value is too small. This can be compared to the Mann Whitney
statistical tables. If the values in the order statistic were not distinct, then we may assign average ranks
to each data sample, and their effect on the overall significance level would be small. Since we are
working with rank sums rather than the data itself, the Mann-Whitney test is robust to outliers, while
the parametric t-tests in Section 5 were not. This is because we use a sample variance estimate in the
computation of the test statistic of the t-test, and the presence of outliers inflate this value. The Mann
Whitney test is more powerful than the t-test in the presence of outliers, while it is only slightly less
3
efficient with asymptotic efficiency (see Definition 336) of π ≈ 0.955 - a requirement of only 5% more
samples. To analyze their exact distributions and exact normality, it is more convenient to work with
adjusted figures - consider the following
n
n(n + 1) X n(n + 1)
UX = TX − = Rank(Xi ) − , (2860)
2 i
2
m
m(m + 1) X m(m + 1)
UY = TY − = Rank(Yi ) − (2861)
2 i
2
458
which are bounded 0 ≤ UX ≤ nm, 0 ≤ UY ≤ nm. Additionally their sums evaluate to
n(n + 1) m(m + 1)
UX + UY = TX + TY − − (2862)
2 2
(n + m)(n + m + 1) n(n + 1) m(m + 1)
= − − (2863)
2 2 2
= nm (2864)
Then we can define critical regions UX ≤ D1 or UY ≥ D2 .
7.4.10 Distribution of the Wilcoxon Rank Sum Test Statistic

Theorem 389. Under H0 when F (t) = G(t) for all t then the permutation of ranks are uniform and
1
we have P(Rank(Xi ) = ri , Rank(Yj ) = rj for i ∈ [n], for j ∈ [m]) = n! for all permutations. Letting
S1 ≤ · · · ≤ Sm be ordered ranks of samples for Y then we may write
m
X
TY = Si (2865)
i
and under the null hypothesis we obtain

1
P(S1 = s1 , · · · Sm = sm ) = n+m
(2866)
m
Using the specified theorem we can obtain exact distributions of UY . For instance consider the
samples n = 3, m = 2 and observations Xi , i ∈ [3], Yj , j ∈ [2] and we have n + m = 5. Write
1 2 3 4 5 TY UY
- - 3 0
- - 4 1
- - 5 2
- - 6 3
- - 5 2
- - 6 3
- - 7 4
- - 7 4
- - 8 5
- - 9 6
1 1
where each row has probability = 10 . In our case we have
(52)
1


 i= 0,1,5,6
P(UY = i) = 10 (2867)
1

i = 2,3,4.
5
In all cases the distribution of UY is symmetric about nm, and we have
P(UY ≤ u) = P(UY ≥ nm − u). (2868)
Lemma 13. Show that

n m
m(m + 1) X X
UY = TY − = 1{Xi < Yj } (2869)
2
459
Proof. We can write the RHS in order statistics and by switching order of summands
m X
n m
1{Xi < Y(j) } =
X X
(Sj − j) (2870)
j i j
m
X m(m + 1)
= Sj − (2871)
2
m(m + 1)
= TY − (2872)
2
= UY . (2873)
Lemma 14. Let p = P(X1 < Y1 ), q1 = P(X1 < min(Y1 , Y2 )), q2 = P(Y1 > max(X1 , X2 )) then show that

EUY = nmp and that Var(UY ) = nm p(1 − p) + (m − 1)(q1 − p2 ) + (n − 1)(q2 − p2 )
Proof. See that

n X
m n X
m
E1{Xi < Yj } =
X X
EUY = P(Xi < Yj ) = nmp. (2874)
Write
n X
m m
n X n X
m
1ij ) = Cov( 1ij , 1kl )
X X X
Var(UY ) = Var( (2875)
i j k l
n X
m X m
n X
Cov(1ij , 1kl )
X
= (2876)
i j k l
and use the results
Cov(1ij , 1il ) = E(1ij 1il ) − E(1ij )E(1il ) (2877)

= E(1Xi <Yj ,Xi <Yl ) − p2 (2878)
= E(1Xi <min(Yj ,Yl ) ) − p 2
(2879)
= q1 − p2 . (2880)
Similarly for
Cov(1ij , 1kj ) = q2 − p2 . (2881)
and Cov(1ij , 1kl ) = 0.
Lemma 15. Show that when F = G then for p = P(X1 < Y1 ), q1 = P(X1 < min(Y1 , Y2 )), q2 = P(Y1 >
max(X1 , X2 )) from which X ∼ F and Y ∼ G yields p = 21 , q1 = 31 , q2 = 13 .
1
Proof. Note that if F, G continuous then P(X1 < Y1 ) = P(X1 > Y1 ) = 2 = p. In fact the same argument
460
can be used to show q1 = q2 = 13 . Alternatively, consider that
q1 = P(X1 < min(Y1 , Y2 )) (2882)

Z ∞
= P(X1 < min(Y1 , Y2 )|X1 = x)dF (x) (2883)
−∞
Z ∞
= P(Y1 > x)P(Y2 > x)dF (x) (2884)
−∞
Z ∞
= [1 − F (x)]2 dF (x) (2885)
−∞
Z 0
= − y 2 dy (2886)
13
−y
= (2887)
3 1,0
1
= (2888)
3
where we used the variable substitution y = 1 − F (x).
By the previous lemmas then we have that under H0 : F = G we have

UY − EUY ≈
√ ∼ Φ(0, 1) (2889)
VarUY
with EUY = 21 nm and VarUY = 1
12 nm(n + m + 1). We skip proof of asymptotic normality.
7.5 Location Shift Models

Definition 357 (Location Shift Model). Suppose Xi ∼ F, i ∈ [n] and Yj ∼ G, j ∈ [m] and we say that
G(t) = F (t − ∆) (2890)
is the location shift model.
A common example of such location shift application would be when ∆ is the treatment effect shifting
the control distribution F to treatment distribution G. The location shift model is a generalization of
the two sample Wilcoxon Rank Sum test (see Section 7.4.9). To see this write H0 : F (t) = G(t), H1 :
F (t) ≥ G(t) is equivalent to testing for H0 : ∆ = 0, H1 : ∆ > 0.
˜ = Ŷ −X̄ = 1
Pn Pm
In parametric statistics we often try to estimate ∆ = EY −EX with ∆ nm i=1 j=1 (Yj −
Xi ), which is the average of all possible pairs. In non-parametric scenarios we can take the median, and
consider the order statistics of the nm differences. The median of D(i) is written
1
 D(nm/2) + D(nm/2+1) even
ˆ = median{D(i) , i ∈ [nm]} = 2
∆ (2891)
D((nm+1)/2) odd

Inside the Wilcoxon Rank Sum (centralized) test statistics (see Equations 2860) we wrote that UY =
j 1{Xi < Yj } and we computed the null distribution where ∆ = 0. If ∆ turns out to be non-zero,
Pn Pm
i
then we may write
n X
m n X
m
1{Xi < Yj − ∆} = 1{Yj − Xi > ∆}
X X
UY(X,Y −∆) = (2892)
i j i j
which shares the same distribution as UY under ∆ = 0.
461
Theorem 390 (Interval Estimation for the Location Shift Model). If k(α) is integer s.t. under H0 :
α
∆ = 0 we have P(UY ≤ k(α) − 1) = 2, where the distribution of UY is given under the discussion on
distribution of Wilcoxon Rank Sums (see 7.4.10), show that our (1 − α) confidence interval is given

D(k(α)) , D(nm−k(α)+1) (2893)
Proof. Write

P(∆ ∈ D(k(α)) , D(nm−k(α)+1) ) = 1 − P(D(k(α)) > ∆) − P(D(nm−k(α)+1) < ∆) (2894)
= 1 − P(UY ≥ nm − (k(α) − 1)) − P(UY ≤ k(α) − 1) (2895)
= 1 − 2P(UY ≤ k(α) − 1) (2896)
= 1 − α. (2897)
7.6 Parametric and Nonparametric Analysis of Variance, ANOVA/F-

Test, Kruskal Wallis Test (to be reviewed)
The question we may have is whether the differences in population parameter estimates are considered
significant or simply due to random variance.
7.6.1 F Test
A one-way layout is an experimental design in which independent measurements are made under several
treatments. As such, the F test is a generalization of the two sample/treatment problem. Let there be I
groups and J measurements in each group. We also discuss the case when Ji 6= Jj for some i 6= j group.
Further denote Yij the j-th measurement in group i, and suppose the sampling is generated from the
function
IID
Yij = µ + αi + eij , eij ∼ Φ(0, σ 2 ) (2898)
P
and αi are normalized such that i αi = 0.
The null hypothesis can be specified as follows: H0 : ∀i, αi = 0, that there is no difference between
the expected values under selected treatment.
Yij and Ȳ¯ =

1
P 1
PI PJ
Definition 358 (ANOVA Sum of Squares). Let Ȳi = J IJ Yij , then we can
decompose the total sum of squared errors into the group sum of squares and sum squares between groups.
In particular, we have
I X
J I X
J I
(Yij − Ȳ¯ )2 = (Ȳi − Ȳ¯ )2
X X X
(Yij − Ȳi )2 + J (2899)
| {z } | {z } | {z }
SST SSW SSB
where the second and third term attribute sum squares to within and between groups respectively.
In the ANOVA method, we are particularly interested in the relative variation between and within
the group.
Lemma 16. Let Xi be random variable, i ∈ [n] that are independent with EXi = µi , VarXi = σ 2 then
Pn
E(Xi − X̄)2 = (µi − µ̄)2 + n−1 2 1
n σ where µ̄ = n µi
462
By the Lemma 16 and equation in Definition 358, we have (verify this)

I X
X J
ESSW = E(Yij − Ȳi )2 (2900)
I X
J
X J −1 2
= σ (2901)
J
2
= I(J − 1)σ (2902)
and
I
E(Ȳi − Ȳ¯ )2
X
ESSB = J (2903)
I
(I − 1)σ 2
X
2
= J αi + (2904)
IJ
I
X
= J αi2 + (I − 1)σ 2 . (2905)
SSW
Note that E(Yij ) = E(Ȳi ) = µ + αi . We may use SSW to estimate the σ 2 term. Let s2p = I(J−1) , then
2 2
sp is an unbiased estimate of σ and this relates to the sample variances of groups via the equation
PI
SSW = (J − 1)s2i - that is the variance estimate is from the weighted sample variances. An important
realization is that under the null hypothesis, we shall have
SSB
E = σ2 (2906)
(I − 1)
SSW
and the value I(J−1) should be similar. However, if ∃i, αi 6= 0 then SSB will be inflated. We may then
define the F test statistic
IID SSW
Theorem 391 (F-Test Distribution). Assuming e ∼ Φ(0, σ 2 ), then σ2 ∼ χ2I(J−1) . Furthermore,
SSB
assuming αi = 0, ∀i ∈ [J] then σ2 ∼ χ2I−1 and SSB ⊥ SSW . The test statistic that follows is the value
SSB /(I − 1)
F = ∼ FI−1,I(J−1)
SSW /(I(J − 1))
We can think of the numerator to reflect variation between and within groups, while the denominator
reflects the variation within groups. Under the null hypothesis this value should be ≈ one. We reject
this if F > FI−1,I(J−1) (α).
7.6.1.1 Groups of Different Size
In the case where I groups have size Ji , i ∈ [I] of distinct sizes, we arrive at (verify this)
I
X
E(SSW ) = σ 2 (Ji − 1) (2907)
i
I
X
E(SSB ) = (I − 1)σ 2 Ji αi2 (2908)
i
(2909)
and the F test statistic computed will have degrees of freedom FI−1,PIi Ji −I .
463
7.6.2 Kruskal Wallis Test
In the I groups without quantitative or normal data, we shall use a non-parametric method instead of
ANOVA. Recall that in the ANOVA, we require that our sampling is from a parametric distribution, in
particular that the samples are generated from the population function (Equation 2898). The Kruskal-
Wallis test provides a similar variant with weaker assumption, and generalizes the Mann-Whitney test
(see Section 7.4.9). We assume here that the observations here are independently sampled but require no
distribution assumption except for continuity. Order statistics are employed and rank metrics are used
rather than the original data. Consider the p group/sample problem where Xij ∼ Fi , i ∈ [p], j ∈ [ni ] and
our objective is to test for
H0 : ∀i, j ∈ [p], Fi = Fj H1 : ∃i, j ∈ [p] s.t Fi ≤ Fj ∨ Fi ≥ Fj . (2910)
Combining p samples and let their pooled ranks be Rij corresponding to rank of Xij , and define
p X
X ni
SST = (Rij − R̄)2 (2911)
i=1 j=1
p X ni p
¯ )2 = ¯ )2
X X
SStreat = (R̄i − R̄ ni (R̄i − R̄ (2912)
i=1 j=1 i=1
Xp X ni
SSe = (Rij − R̄i )2 (2913)
i=1 j=1
SStreat
and let the test statistic value SST rejects the null hypothesis at some critical value C. See that
n(n+1) 1
R̄ = 2 n, and therefore we obtain
p
¯ R̄ + R̄
¯2)
X
SStreat = ni (R̄i2 − 2R̄ i (2914)
i
p p
¯ ¯ )2
X X
= ni R̄i2 − 2R̄ ni R̄i + (nR̄ (2915)
i i
p p X
ni
¯ ¯ )2
X X
= ni R̄i2 − 2R̄ Rij + (nR̄ (2916)
i i j
p
¯ )2
X
= ni R̄i2 − (nR̄ (2917)
i
p
X 1
= ni R̄i2 − n(n + 1)2 (2918)
i=1
4
and
p X
ni
¯ )2
X
2
SST = Rij − (nR̄ (2919)
i j
p X
ni
X
2 n(n + 1)2
= Rij − (2920)
i j
4
n
X n(n + 1)2
= i2 − (2921)
i=1
4
n(n + 1)(2n + 1) n(n + 1)2
= − (2922)
6 4
1
= n(n + 1)(n − 1). (2923)
12
464
It follows
p
!
(n − 1)SStreat 12 X n(n + 1)2
= ni R̄i2 − (2924)
SST n(n + 1) 4
p
12 X
= ni R̄i2 − 3(n + 1) (2925)
n(n + 1) i
and let this value be T .
Definition 359 (Kruskal Wallis Test Statistic). The Kruskal Wallis Test Statistic is the value T =
12
Pp 2
n(n+1) i ni R̄i − 3(n + 1) as in preceding arguments. In the presence of jumps in our distribution (and
hence ties), we use the Kruskal Wallis mid rank test statistic
p
! Pe 3

12 ¯ i=1 (ti − ti )
X
0 0 2
T = ni R i − 3(n + 1) / 1 − (2926)
n(n + 1) i n3 − n
Pni
where R0 ij is the mid rank of Rij , R¯0 i = 1
ni j R0 ij and e is number of distinct observation values in
the samples, ti the number of observations tied with i-th order statistic in pooled sample.
7.6.2.1 Distribution of the Kruskal Wallis Test Statistic
Recall that in Result 38 for discussion of the Wilcoxon Signed Rank test that for each permutation we
1
have probability 2n n! under the null settings. Here since we have no signs we may take 2n 2n1n! = 1
n! for
each permutation mapping to measure P((Rij = rij )i∈[p],j∈[ni ] ) under the assumptions that all Fi ’s are
equal. These allows us to explicitly find the exact distribution of T when we have small group sizes.
However in the large ni , i ∈ [p] case we perform approximations. In particular we have
Result 39 (Approximate Distributions for the Kruskal Wallis Test Statistic). Under H0 : ∀i, j ∈ [p], Fi =
Fj and for large ni , i ∈ [p] we have
≈ ≈
T ∼ χ2p−1 T 0 ∼ χ2p−1 (2927)
and we compute the p-values using the χ2 computations, rejecting if T > χ2p−1 (α).
7.6.3 Bonferroni’s Method

If the null hypothesis of the ANOVA method or the Kruskal-Wallis (see Section 7.6.2.1]) test were
rejected, we might question which of the groups are involved in the ‘significantly different’ relationships.
However, making pairwise-comparisons between groups without making adjustments is not a good idea,
since we end up with I2 pairs, and our type I error (see Definition 349) is inflated. The probability of

having at least one type I error appear is equivalent to 1 − (1 − α)k , where k = I2 . We want to control

such local probability of type I errors such that the family-wise error rate/overall type I error rate (see
Definition 349) is controlled. Let this error rate be a, then the Bonferroni correction chooses a local error
rate of α = ka . By the duality of confidence intervals and hypothesis tests (see 2802), we form marginal
confidence intervals of 100(1 − αk )% and their jointly we can be at least 100(1 − α)% confident.
7.6.4 Tukey’s Method

Assuming the same settings as in Bonferroni’s Method (see Section 7.6.3) but further assume that we
IID
are provided with equal sample sizes and e ∼ Φ(0, σ 2 ). Let a to be the familywise type I error rate for
465
k tests, and we use Tukey’s method to construct CI for differences of pairwise means by ensuring that
the confidence intervals have joint error rate of α. Recall from Equation 2898 we can write
J
1X
Ȳi = µ + αi + eij (2928)
J j
1
PJ
and by writing µi = µ+αi , we obtain Ȳi −µi = J j eij for i ∈ [I]. We have E(Ȳi −µi ) = 0, Var(Ȳi −µi ) =
σ2 s2p
J , which may be approximated with the sample variance and substituting J . Now consider a random
pairwise test between i1 , i2 , then the we obtain the probability distribution
|(Ȳi1 − µi1 ) − (Ȳi2 − µi2 )|
max √ (2929)
{i1 ,i2 } sp / J
is called the studentized range distribution. We can construct intervals at 100(1 − α)% for µi1 , µi2 with
sp
(Ȳi1 − Ȳi2 ) ± q(α) √ (2930)
J
where q(α) is the upper 100a percentage point of the distribution. By duality, if this interval does not
contain zero we reject the null hypothesis of equal means at familywise error rate a.
Note that if the data follows (conditional) normal distributions then the ANOVA test can be applied,
followed by the Tukey’s correction if ANOVA was significant. Under non-normality, we may conduct the
Kruskal-Wallis test combined with Bonferroni’s correction method.
7.6.5 Two-Way ANOVA Methods

In the two-way ANOVA layout, we estimate how the mean of a quantitative variable changes according
to the levels of two categorical variables - a type of interaction effect.
Assume K > 1, and we specify the ANOVA population function
Yijk = µ + αi + βj + γij + eijk , i ∈ [I], j ∈ [J], j ∈ [K] (2931)

IID
where the ijk indexes the k-th observation in cell ij, and eijk ∼ Φ(0, σ 2 ), and cell ij indexes two
different categorical variables. αi , βj reflects the individual effects of each group, γij the interaction
PI PJ
effects and eijk is our error term. We constrain these constants such that αi = 0, βj = 0 and
PI PJ
that γij = γij = 0 Again we would like to compare the various sum of squares, then the following
sum of squares are defined
I
X
SSA = JK (Ȳi.. − Ȳ... )2 (2932)
i=1
J
X
SSB = IK (Ȳ.j. − Ȳ... )2 (2933)
j=1
I X
X J
SSAB = K (Ȳij. − Ȳi.. − Ȳ.j. + Ȳ... )2 (2934)
i=1 j=1
I X
X J X
K
SSE = (Yijk − Ȳij. )2 (2935)
I X
X J X
K
SST OT = (Yijk − Ȳ... )2 = SSA + SSB + SSAB + SSE (2936)
We can construct multiple hypothesis on population function 2931, such as that there exists row effects
H0 : ∀i, αi = 0, that there exists column effects H0 : ∀j, βj = 0 and the presence of interaction effects
466
∀i, ∀j, γij = 0. If the errors were IID normal Φ(0, σ 2 ) then under the null our test would satisfy (verify
this)
1. SSE /σ 2 ∼ χ2IJ(K−1)
2. SSA /σ 2 ∼ χ2I−1
3. SSB /σ 2 ∼ χ2J−1
4. SSAB /σ 2 ∼ χ2(J−1)(I−1)
5. SSE ⊥ SSA ⊥ SSB ⊥ SSAB
and the F statistic can be constructed as in usual scenarios by taking the ratio against the SSE . For
instance when testing for interacting effects we generate the F statistic
SSAB /(I − 1)(J − 1)

F = ∼ F(I−1)(J−1),IJ(K−1)
SSE /IJ(J − 1)
and compute the relevant p-values to make conclusions.
7.7 Correlation Tests

Note that for two random variables X, Y , ρ(X, Y ) = −ρ(X, −Y ). Therefore the null hypothesis for
H0 : ρ(x, y) = 0 H1 : ρ(x, y) < 0 (2937)

H0 : ρ(x, −y) = 0 H1 : ρ(x, −y) > 0 (2938)
are equivalent tests and we many focus on a single one.
7.7.1 Parametric Correlation Test

IID
Suppose we are provided (Xi , Yi ) ∼ F (x, y), i ∈ [n] from bivariate distributions. Then the correlation
Cov(X,Y )
coefficient of the population is given ρ = σx σy and we want to test H0 : ρxy = 0 against a competing
hypothesis, say H1 : ρxy > 0. Then our sample correlation coefficient is given
Xi Yi − n1
P P P P
(Xi − X̄)(Yi − Ȳ ) Xi Yi
r = pP P = pP P (2939)
(Xi − X̄)2 (Yi − Ȳ )2 (Xi − X̄) 2 (Yi − Ȳ )2
which we can show (verify this) to be of equivalent forms (verify this)

√ r−ρ ≈
n ∼ Φ(0, 1) (2940)
1 − r2
√ ≈
n(ψ(r) − ψ(ρ)) ∼ Φ(0, 1) (2941)
1 1−r
for transformation ψ(r) = 2 log 1+r .
7.7.2 Nonparametric Spearman Correlation Test

In the Spearman rank correlation test we take take the non-parametric analogue of the parametric
correlation test as in Section 7.7.1. Let ((Ri )i∈[n] ) be ranks of (Y1 , · · · Yn ) and ((Qj )j∈[n] ) be ranks of
(X1 , · · · Xn ). Then define
467
Definition 360 (Spearman Rank Correlation Coefficient). Spearman Rank Correlation Coefficient is
the value for which we have
Qi Ri − n1
P P P
Qi Ri
r = pP P (2942)
(Qi − Q̄) 2 (Ri − R̄)2
Qi Ri − 1 ( i)2
P P
= P 2 1 nP 2 (2943)
i − n ( i)
12 X 3(n + 1)
= Qi Ri − . (2944)
n(n2 − 1) n−1
Permuting the (Xi , Yi ) by the X keys such that the ordered X has ranks 1 · · · n, let Si , i ∈ [n] be the
corresponding Y rank and see that
X X 1X X
Qi Ri = iSi = − (Si − i)2 + i2 (2945)
2
(Si − i)2 which rejects the null hypothesis when D ≤ dα for
P
and let our test statistic be D =
P(D ≤ dα ) = α.
Exercise 630. Show that 0 ≤ D ≤ 31 (n3 −n) and that under the null hypothesis of ρ = 0, D is symmetric
about 61 (n3 − n) such that P(D ≥ d) = P(D ≤ 13 (n3 − n) − d).
Proof. The lower bound of zero is trivial. The maximum value of D occurs when S1 = n, S2 = n −
1, · · · Sn = 1, then Si = n − (i − 1) = n + 1 − i. For these instance we have
X 2
max D = [(n + 1) − 2i] (2946)
X X X
= (n + 1)2 − 4(n + 1) i+4 i2 (2947)
2
= n(n + 1)2 − 2n(n + 1)2 + n(n + 1)(2n + 1) (2948)
3
1
= n(n + 1)(2(2n + 1) − 3(n + 1)) (2949)
3
1
= n(n2 − 1) (2950)
3
1
The symmetricity follows from seeing that each permutation mapping to P((Si = si )i∈[n] ) = n! results
1 3
in a centred distribution about ED = 6 (n − n).
We may refer to statistical tables for the p-value computations on D.
7.8 Goodness of Fit Tests

We commonly assume that some data is sampled from some underlying distribution, but if this assump-
tion is violated, our tests would be invalid. Here we want to see the validation of such assumptions. Let
Xi ∼ F, i ∈ [n] be n observations, and let our hypothesis be H0 : F = F0 , H1 : F 6= F0 .
Definition 361 (Empirical Cumulative Density Function). Define our empirical c.d.f as
1X
Fn (x) = 1(Xi ≤ x)
n
.
Theorem 392 (Properties of the Empirical C.D.F). Here we state some properties of the empirical
cumulative density function.
468
1(Xi ≤ x) ∼ Bin(n, p = F (x)) by definition of F (x). See that EFn (x) =
P
1. We have nFn (x) =
1
n EnFn (x) = F (x), and Var(Fn (x)) = ( n1 )2 Var(nFn (x)) = 1
n F (x)(1 − F (x)).
2. By the Strong Law of Large Numbers (see Theorem 361) we have
Fn (x) → EFn (x) = F (x) as n → ∞. (2951)
3. By the Central Limit Theorem (see Theorem 362) we have

≈ 1
Fn (x) ∼ Φ F (x), F (x)(1 − F (x)). (2952)
n
These properties suggest that Fn (x) is good estimator of F (x).
7.8.1 Kolmogorov Smirnov Test

Let Xi ∼ F, i ∈ [n] and Fn be the empirical c.d.f defined as in Definition 361. We want to test if
H0 : F = F0 against the alternative H1 : F 6= F0 .
Definition 362 (Kolmogorov Smirnov Distance). Then let the Kolmogorov Smirnov distance be defined
by the formula
Dn = sup |Fn (x) − F (x)| (2953)

x
which is the supremum distance between the empirical c.d.f and theoretical c.d.f. at point x.
Under the null hypothesis we may expect that the supremum distance of substituting F0 → F should
be small, and our critical region shall be for some Dn > d given P(Dn > d) = α.
Result 40. The KS-Distance as defined in Definition 362 has same distribution for all F . The distri-
bution Dn ⊥ F .
We may refer to statistical tables for the distribution of Dn . If we are provided that P(Dn ≤ c) = 1−α
then we can write

1−α = P sup |Fn (x) − F (x)| ≤ c (2954)
x
= P (∀x, −c ≤ F (x) − Fn (x) ≤ c) (2955)
= P (∀x, Fn (x) − c ≤ F (x) ≤ Fn (x) + c) (2956)
gives us confidence interval of F (x) ∈ [Fn (x) ± c] which has duality with the 2-sided hypothesis test.
Assuming that F0 is continuous density function, we can compute Dn from sampled Xi ’s by using
their order statistics X(i) . In particular, we have

i i−1
Dn = max max F0 (X(i) ) − , F0 (X(i) ) − . (2957)
1≤ı≤n n n
For instance, suppose we are given Xi , i ∈ [n = 10] samples generated from F , and we want to test
if they are generated from Φ(µ, σ 2 ). Then we can conduct the following steps to get Dn : (i) first order
X(i) −µ
them by their order statistics, (ii) compute σ = Z(i) , (iii) compute F0 (X(i) ) = Φ(Z(i) ), (iv) compute
| ni − F0 (X(i) )| and |F0 (X(i) ) − i−1
n | and (v) and take maximum of the 2n = 20 values computed in step
(iv).
469
If F0 were discrete then our test statistic would take

−
Dn = max |F0 (X(i) ) − Fn (X(i) )|, |F0 (X(i) ) − Fn (X(i−1) )| (2958)
i
−
where F0 (X(i) ) = P(X < X(i) ). Arranging the X’s by their order statistic, we would regard ties as a
single element. Then X(1) < · · · X(m) , 1 ≤ m ≤ n and let ni be the number of ties at X(i) . Our empirical
Pi
jnj
c.d.f here takes Fn (X(i) ) = n .
In general, during the χ2 test we divide axially into many subintervals that can lead to different
conclusions. The Kolmogorov Smirnov test does not suffer from this ambiguity. Additionally, χ2 is a
large sample test with only asymptotic distributions, while in the Kolmogorov Smirnov test we have
exact distributions for Dn . We do not discuss the closed form distribution for Dn here. Statistical tables
may be referenced.
470
Chapter 8
Statistical Learning
Statistical learning problems can be classified as supervised or unsupervised, whether there is a mea-
surement accompanying the feature set. One is then a prediction problem and the other is an organiza-
tion/relationship problem. The supervised learning problem is further decomposed into the regression
problem, and the classification problem, which is dependent on the quantitative/qualitative nature. Not
all errors are equal in prediction problems.
8.1 Supervised Learning

In the supervised learning problem, the inputs can be termed predictors, independents, features. The
term we are predicting is called the response or dependent. In the classification problem, qualitative
variables are predicted, which are referred to as categorical or discrete. They may have explicit ordering,
sometimes. The task can be simplified as a function approximation issue. Categorical variables that
are ordered can be called ordered categorical. Categories are often represented by numeric codes, or
binary values such as indicators. The numeric codes are sometimes named as targets. K-level qualitative
variables can be represented by a K-vector. More compact coding schemes are possible.
Definition 363 (Variable Representations). The input variable is often symbolized X. If X is in higher
dimensions (a vector), we can refer to components as Xj . Quantitative outputs are denoted Y , and
qualitative ones as G. Observed values of the random variables X, Y, G are written lowercase, such as x,
with the same convention on xi . Matrices are bold uppercase, such as X for N by p matrix of N inputs
and p features. Cardinality N vectors are bold, so we can distinguish the p-vector of inputs xi , i ∈ [N ]
for observation i from the N -vector xj for N samples on feature j ∈ [p]. All vectors are assumed column,
so ith sample/row of X is xT
i . Sometimes, in cases where there are no ambiguity, all matrices (both
cardinality n and p) are written bold.
Problem 6 (Learning Problem). Given input X, make prediction Ŷ of Y , or Ĝ of G. We are given

training data, represented as set {(xi , yi ) : i ∈ [N ]}, and equivalently so for G.
8.2 Generalized Linear Models

The linear model problem is given Hastie et al. [10]:
471
Problem 7 (Linear Model). Given input vector X T = (Xi , · · · , Xp ), predict Ŷ = βˆ0 + Σpj=1 Xj βˆj . The
β̂0 is noted intercept/bias. Subsuming 1 in X and intercept in β̂ vector, we can represent in inner product
form Ŷ = X T β̂. We can generalize such that Ŷ is a N vector, then β̂ ∈ Rp+1,K and X ∈ RN,p+1 .
In the (p + 1) dimensional input-output space, (X, Ŷ ) represents a hyperplane including the origin
and is a subspace. If the constant is not included, then it is an affine set cutting the Y -axis at point
(0, β̂0 ). The function over p dimensional input space f (X) = X T β is linear and gradient f 0 (X) = β is a
vector in input space of steepest ascent. [10]
The question then becomes how we can choose the appropriate value of β̂. In the least-squares
approach we can estimate β to minimise the residual sum of squares. 364
Definition 364 (Residual Sum of Squares, a.k.a Sum of Squared Errors).

N
X 2
RSS(β) = yi − xTi β (2959)
i=1
T
= (y − Xβ) (y − Xβ) (2960)
with X ∈ RN,p .
Note that minima exists. It is not necessarily unique. Differentiating w.r.t to β we get normal
! −1 T
equations XT (y − Xβ) = 0. If XT X is non-singular, we have unique solution β̂ = XT X X y. For
arbitrary input x, we can fit estimate ŷ = ŷ(x) = xT β̂.
These equations also apply mostly to the classification instance, which predicts output in G, as in
the Learning Problem 6. For the binary classification instance, we can encode Ĝ using a binary variable,
and the decision boundary is the line defined xT β̂. The decision boundary is the set {x : xT β̂ = m},
where m is the cutoff such that we make one classification or the other as xT β̂ ≥ m.
8.2.1 Nearest Neighbors

Consider the training set T , the nearest-neighbour method is a method to use the nearest neighbours in
the training set (in the input space) closest to input x to form Ŷ . We have simple formulations.
Definition 365 (K-Nearest Neighbours). Given input x, the KNN method predicts Ŷ = k1 Σxi ∈Nk (x) (yi ) ,
where Nk (x) ∈ T are closest in training sample to x. Closeness is choice metric, we assume Euclidean.
Then in the classification problem the Ŷ is compared to the decision boundary, and can be seen as majority
voting in the local input space. When K = 1, the classification degenerates into a Voronoi tessellation.
Simple logic dictates that training error is approximately an increasing function of K, and at zero
when K = 1. Although it appears the parameter space is univariate, the effective number of parameters
N
in the KNN method is K and generally > p. [10] The intuition is that if the neighbourhoods were
N
disjoint, there would be K neighborhoods and each neighbourhood fits a parameter, the mean. Clearly,
we should not rely on RSS (see 364) to determine the value of K.
8.2.2 Least Squares vs Nearest Neighbours

In least squares, the decision boundary is smooth, and stable. The appropriateness depends on the
problem statement itself, whether a linear decision boundary is apt. It has low variance and high bias.
On the other hand, KNN makes less assumptions on data, but has high variance and low bias. The
model goodness then depends on the amount of data available in relation to the population distribution.
Significant literature of statistical learning is a variant of the two general methods, such as
472
1. Kernel methods use weights that decrease smoothly to zero with distance from target point, rather
than a binary weight.
2. In higher dimensions the distance kernels can overweight some variables more than others.
3. Local regression methods fit linear models by locally weighted least squares.
4. Linear models fit to basis expansion of original inputs allow arbitrary complex models.
5. Projection pursuit and neural networks consist of sums of non-linearly transformed linear models.
8.2.3 Regression Functions, Classifiers and Prediction Errors

Consider the quantitative output and probability spaces as defined in 268. Let X ∈ Rp denote the
p-input vecotr, and Y ∈ R be the output. These are random variables. Let their joint distribution be
P(X, Y ). Then we seek the function f (X) for prediction, using loss function L(X, f (X)). Here we give
one such common example.
Definition 366 (Squared Error Loss Function).
L(X, f (X)) = (Y − f (X))2 (2961)
Definition 367 (Expected Prediction Error).

Z
2
EP E(f ) = E(Y − f (X))2 = [y − f (x)] P(dx, dy) (2962)
Conditioning on X, this can be expressed

Z h i
2
(y|x − f (x))2 P(dy|dx) P(dx) = EX EY |X (Y − f (X)) |X

(2963)
In the categorical case, we can express

h i
EP E = E L(G, Ĝ(X) (2964)
K
X h i
= Ex L Gk , Ĝ(X) P(Gk |X) (2965)
k=1
The criteria for choosing f can be formed as the regression function. It suffices to minimize EPE
pointwise such that we have f (x) as follows:
Definition 368 (Regression Function).
f (x) = arg min EY |X (Y − c)2 |X = c

c
, yielding f (x) = E(Y |X = x). This conditional expectation is known as the regression function, and
the best point estimate of any input x is the conditional mean.
Again, in the categorical case, minimising the EPE pointwise we have
K
X
Ĝ(x) = arg min L(Gk , g)P(Gk |X = x)
g∈G
k=1
. In the binary loss function this simplifies to Ĝ(x) = arg ming∈G [1 − P(g|X = x)].
473
Definition 369 (Bayes Classifier). The Bayes Classifier is the classifier corresponding to
arg max P(g|X = x),

g∈G
which is to classify to the most probable class, under the conditional distribution of P(G|X). This is
the ‘golden standard’, and is the classification achieved assuming we have knowledge of the conditional
distributions. In general, our classification techniques approximate the Bayes classifier. The error rate
of the Bayes classifier is the Bayes rate.
An analogue in the KNN can be drawn, although here we settle for conditional mean on neighbors
of the input point. Two approximations are employed, (i) that expectation is approximated by average
over over samples, and (ii) point conditioning is relaxed to the neighbourhood.
In the KNN case, we have fˆ(x) = |Nk1(x)| xi ∈Nk (x) yi |xi where Nk (x) is the closest k neighbours of
P
x. We can show that under mild regularity conditions as N, k → ∞, k → 0, then fˆ(x) → E(Y |X = x).
N
However, the issue is often the lack of data, and hence the stability of our estimates. As dimension p gets
large, so does the metric size of the KNN method, and the rate of convergence decreases with increasing
p.
Exercise 631. Suppose each of the K classes have associated target tk of all zeroes except at coordinate
k with entry of one. Show that classifying to largest element of ŷ amounts to choosing closest target
mink ktk − ŷk.
Proof. Our work can be formulated to proving that arg maxk ŷk = arg mink ktk − ŷk k. Consider
X
ktk − ŷk k2 = (1 − ŷk )2 + (0 − ŷl )2 (2966)
l6=k
X
= 1+ ŷk2 − 2ŷk + ŷl2 (2967)
l6=k
X
= 1 − 2ŷk + ŷl2 (2968)
l
and the equivalency follows.
In linear regression, we assumed that the regression function f (x) is approx. linear, such that f (x) ≈
T
x β. Substituting into EPE 367, we get the form E(Y −xT β)2 = E Y − xT β
T
Y − xT β . Differentiate
w.r.t to β and solving, we get (verify this):
−1
β = E(XX T )

E [XY ]
We have not conditioned on X, and instead used the functional relationship to pool over values of X.
The least squares assumes f (x) is well approximated by a globally linear function, while KNN assumes
the regression function is well approximated by locally constant function.
An example of a model based (but more flexible than the simple linear model discussed) is the additive
Pp
model, which assume that f (X) = j=1 fj (Xj ), which retains the additivity of the linear model but has
arbitrary coordinate functions. We can also consider other loss functions apart from the `2 loss function,
with the `1 = E|Y −f (X)|, with the solution as the conditional median fˆ(x) = median(Y |X = x). While
the estimates are most robust that the conditional mean, the `1 function have discontinuous derivatives,
which hinder widespread use. The loss function for categorical predictions also need to be adjusted.
Assume Ĝ = fˆ(X), and our loss function can then be represented by K · K matrix L, where K = |G|.
The matrix L has zero diagonals, and are non-negative elsewhere, with the values at L(k, l) =cost of
474
(mis)classifying Gk as Gl . The common entry is the binary {0, 1} loss. The formulation of EPE can be
referenced in definition 367. In the KNN, we approximate the Bayes classifier, with the approximation
that the conditional probability at a point in the input space is approximated by the neighbourhood of
the point.
8.3 Local Methods in High Dimensions

So far we saw that the linear model was stable but biased, relative to the KNN method. We should
explore this instability using the concept of the curse of dimensionality.
Exercise 632 (Curse of Dimensionality). Consider the p dimensional l-length hypercube. We want to
sample proportion r of our data in the approximation of a target point. This will correspond to lp · r
1
volume sampling, and the expected edge length of the hypercube equals (lp · r) p . To cover a fraction r
of the data to form local averages, we must cover increasingly greater percentage of the range on each
input dimension, and it ceases to be ‘local’. Another consequence of this sparse sampling is that in high
dimensions, all sample points are close to the edge of the sample. Another manifestation of this issue
can be viewed in terms of the sampling density. The sampling density is ∝ N 1/p , where N is the sample
size and p is the dimensionality. To maintain the density, the training data size needs to increase by the
power of p. In high dimensions, all feasible training samples sparsely populate the input space.
8.4 Bias, Variance and Errors

Definition 370 (Bias Variance Decomposition). Define the training set T , and let the relationship be
Y = f (X) with |T | = 1000. We may compute the EPE at x0 ∈ T , giving us the deterministic mean
squared error (MSE) (verify this):
M SE(x0 ) = ET [f (x0 ) − ŷ0 ]2 (2969)

2
= ET [(f (x0 ) − ET [ŷ0 ]) + (ET [ŷ0 ] − ŷ0 )] (2970)
h i
2 2
= ET (f (x0 ) − ET [ŷ0 ]) + (ET [ŷ0 ] − ŷ0 ) + 2Ψ (2971)
where
Ψ = (f (x0 ) − ET [ŷ0 ]) (ET [ŷ0 ] − ŷ0 ) (2972)
Then,
ET (Ψ) = ET (f (x0 ) − ET [ŷ0 ]) (ET [ŷ0 ] − ŷ0 ) (2973)

= (f (x0 ) − ET [ŷ0 ]) ET (ET [ŷ0 ] − ŷ0 ) (2974)
= 0 (2975)
by factoring out the non-random constant and seeing the second term equates to zero. Then, we have
M SE(x0 ) = Bias2 (ŷ0 ) + VarT (ŷ0 ) (2976)
As the complexity of functions of many variables can grow exponentially with dimensionality, to
estimate with same accuracy as in low dimensions we need the size of the training set to grow at
exponential rate.
475
Exercise 633. Suppose we know that Y, X has linear relationship and that Y = X T β + , where ∼
N (0, σ 2 ). Fitting by least squares, at test point x0 we have ŷ0 = xT0 · β̂, which can be written in form
PN
ŷ = xT0 β + i=1 li (x0 )i , and where li (x0 ) is the ith element of X(XT X)−1 x0 . To see this recall that we
wrote in Equation (3067) that e = (1 − H), and we have − i=1 li (x0 )i = xT0 β − ŷ = ei . Since least
PN
squares here is unbiased, we can form
EP E(x0 ) = Ey0 |x0 ET (y0 − ŷ0 )2 (2977)

2
= Var(y0 |x0 ) + ET [ŷ0 − ET ŷ0 ] + [ET ŷ0 − xT0 β]2 (2978)
= Var(y0 |x0 ) + VarT (ŷ0 ) + Bias2 (ŷ0 ) (2979)
= σ + 2
ET xT0 (XT X)−1 x0 σ 2 +0 (2980)
where the additional variance σ 2 is the variance due to non-deterministic targets. Note that our element
x0 is sampled, and therefore has associated variance in sampling. If N large, and T selected at random,
assuming E(X) = 0, then XT X → N · Cov(X) (verify this) and
σ2
Ex0 EP E(x0 ) ∼ Ex0 xT0 Cov(X)−1 x0 + σ2 (2981)
N
σ2
= tr Cov(X)−1 Cov(x0 ) + σ2 (2982)
N
p
= σ2 + σ2 (2983)
N

p+N
= σ2 . (2984)
N
σ2
We can observe that the expected EPE increases linearly on p with slope N , and if N large the p-term
is negligible and dimensionality curse is avoided.
8.5 Statistical Models

The goal of statistical modelling is to find a useful approximation fˆ(x) for f (x). We saw that the squared
error loss leads us to regression function f (x) = E(Y |X = x) for quantitative response variables. Here
we discuss framework for the study of statistical models.
Suppose the underlying true model can be represented Y = f (X) + , where E() = 0, ⊥ X. Here,
f (x) = E(Y |X = x) and P(Y |X) depends on X only through the conditional mean f (x). The assumption
that the errors are IID is not strictly necessary, but when we average squared errors uniformly in EPE,
we need note this fact. The independence assumption and additive model simplifies our analysis, but
in general simple modifications can be used to avoid the independence assumption. For instance, we
can have Var(Y |X = x) = σ(x), where both the mean and variance depend on X. The conditional
distribution P(Y |X) may depend on X in complicated ways, but the additive model precludes this. So
far, we have concentrated only on quantitative responses, and in qualitative response models we typically
do not have additive error models. In this case the target function p(X) is the conditional density P(G|X)
and is modelled directly.
8.5.1 Supervised Statistical Models

Assume again that the true model is Y = f (X) + . Suppose the errors are additive. In machine learning
jargon, we attempt to learn f through a teacher, using training set T = (xi , yi ), i ∈ [N ]. The learning
476
algorithm takes the inputs and target outputs and adjusts fˆ(xi ) based on the differences yi − fˆ(xi ),
and is named simply learning by example. The data pairs {xi , yi } may be viewed as points in a (p + 1)
dimensional Euclidean space, where the function f (x) has domain equal to the p-dim input subspace.
Assuming the domain is Rp , our goal is then to obtain a useful approximation to all x in some region
of Rp for f (x) based on T . Let the parameter set to learn be θ, for which in the linear model case we
PK
have θ = β. Another such example would be the linear basis expansions fθ (x) = k=1 hk (x)θk , where
hk are suitable set of functions or transformations on input vector x, such as x21 , cos(x1 ), x1 x22 and so on,
1
or even nonlinear expansions such as sigmoid transformations in the form of hk (x) = 1+exp(−xT βk )
. The
parameters can be obtained by techniques such as least squares (minimising RSS - Definition (364)). In
function approximation terminology, our parameterized function is a surface in the (p + 1) space, with
T noisy realizations from it. In the linear model, we get simple closed form solutions, and similarly so
for the basis function methods (if we the basis functions do not have hidden parameters), but in general
they can also be solved with iterative or numerical methods. A more general principle of estimation than
the least squares method is the maximum likelihood estimation method, which we formalised in Section
334.
8.6 Classes of Restricted Estimators

Consider the RSS (see Definition 364) criterion for arbitrary f - this yields infinitely many solutions.
Any function fˆ passing though the training points can be a solution, and can be arbitrarily complex.
However, these might be a poor predictor at test/unobserved points in the input space. We need to
restrict the set of eligible solutions, often either by encoding parametric representations of the function
or restraints built into the learner method.
8.6.1 Roughness Penalty and Bayesian Methods

There is a class of functions controlled by explicitly penalizing the RSS(f ) with a roughness penalty,
given P RSS(f, λ) = RSS(f ) + λJ(f ), where J(f ) is a specified functional that increases for f that
vary rapidly over small regions of input space. One such example is the cubic smoothing spline for one-
PN R 00
dim inputs, with P RSS(f, λ) = i=1 (yi − f (xi ))2 + λ [f (x)2 dx] as the least-squares criterion. The
second term controls large values of the second derivative of f , with penalty cost dictated by λ ≥ 0. If
λ = 0, then any interpolating function will do, and λ = ∞ only permits the functions linear in x. There
are many variants and can be used to impose a special structure - for instance the additive penalties
Pp Pp
J(f ) = J(fj ) can be used with the additive functions f (X) = fj (Xj ) to create additive models
PM T
with smooth coordinate functions. In projection pursuit regression models, f (X) = gm (αm X) for
adaptively chosen directions αm . Functions gm each can have a roughness penalty associated.
Penalty functions are also known as regularization methods. There is a prior that the functions
exhibit smoothness, and can be cast in a Bayesian framework. The penalty J corresponds to a log-prior,
and P RSS(f, λ) the log-posterior distribution. Minimising the P RSS(f, λ) corresponds to finding the
posterior mode.
8.6.2 Kernel Methods and Local Regression

Kernel methods can be thought of as explicitly providing estimates of the regression function or condi-
tional expectations by specifying the nature of the local neighborhood, and the class of regular functions
477
fitted locally. The ‘local neighbourhood’ is specified by kernel function Kλ (x0 , x), assigning weights to
points x in region around x0 . An example is the Gaussian kernel, which has a weight function based on
the Gaussian density
1 kx − x0 k2
Kλ (x0 , x) =
exp{− }
λ 2λ
and assigns weights to points that decay exponentially with their squared Euclidean distance to x0 .
Parameter λ corresponds to variance of Gaussian density, varying the width of the neighbourhood. A
simple kernel estimate is the Nadaraya-Watson weighted average, with form
PN
ˆ Kλ (x0 , xi )yi
f (x0 ) = Pi=1
N
.
i=1 Kλ (x0 , xi )
The Nadaraya-Watson method is discussed in Section 8.11.1. In general, the local regression estimate
fθ̂ (x0 ) of f (x0 ) is such that θ̂ minimizes
N
X
RSS(fθ , x0 ) = Kλ (x0 , xi )(yi − fθ (xi ))2 ,
i=1
and fθ is some parameterized function, such as a low-order polynomial. Some examples are (i) fθ (x) = θ0 ,
resulting in the Nadaraya Watson estimate, and (ii) fθ (x) = θ0 + θ1 x resulting in the local linear
regression model. Nearest-neighbour methods can also be though of as kernel methods, with the metric
Kk (x, x0 ) = I(kx−x0 k ≤ kx(k) −x0 k), with x(k) is the k-th order statistic training observation in distance
from x0 .
8.6.3 Basis Functions and Dictionary Methods

PM
This models f as a linear expansion of basis functions fθ (x) = m=1 θm hm (x), where each hm is a
function of the input x and is linear in θ. The class of linear and polynomial expansions fall under this
category, among other methods. In some cases, the sequence of basis functions are prescribed, such as the
basis for polynomials in x of deg M . For one-dim x, polynomial splines of degree K can be represented by
appropriate sequence of M spline basis functions, determined in turn by M −K −1 knots. These produces
functions that are piecewise polynomials of degree K between knots, and joined up with continuity of
degree K − 1 at the knots. For instance, consider he linear splines, or piecewise linear functions. An
intuitive satisfying basis consists of functions b1 (x) = 1, b2 (x) = x, bm+2 (x) = (x − tm )+ , m ∈ [M − 2]
where tm is the m-th knot, and z+ denotes the positive part. Tensor products of spline bases can be
used for inputs with dimensions larger than one. Parameter M controls degree of polynomial or number
of knots in splines.
Radial basis functions are symmetrical p-dim kernels located at particular centroids,
M
X
fθ (x) = Kλm (µm , x)θm .
m=1
2
A good example is the Gaussian kernel Kλ (µ, x) = exp{− kx−µk
2λ }. Radial basis functions have centroids
µm and scales λm that need to be determined. Spline basis functions have knots. The parameters can
change the problem from a straightforward one with closed form solutions to computationally difficult
non-linear problem. Techniques such as greedy algorithms or two stage processes can be used.
A single-layer feed forward neural network with linear output weights can be thought of as an
PM T
adaptive basis function method, with model of form fθ (x) = m=1 βm σ(αm x + bm ), where here the
478
1
σ = σ(x) = 1+e−x is the activation function. As in projection pursuit, directions αm and bias bm have
to be determined. Adaptively chosen basis functions are known as dictionary methods, where one has
available possibly infinite set or dictionary D of candidate basis functions to choose from. The models
are then built and is equivalent to the search problem.
8.7 Model Selection, Bias-Variance Tradeoff

The models described in general have either a smoothing or complexity parameter that determines the
restriction of the model, such as the (i) λ penalty scalar, (ii) kernel width and (iii) number of basis
functions.
Consider the KNN case again, with Y = f (X) + , E[] = 0, Var[] = σ 2 . Assume for simplicity the
values of xi in sample are fixed and nonrandom. Then the expected prediction error at x0 , known as the
test/generalization error can be decomposed into
EP Ek (x0 ) = E[(Y − fˆk (x0 ))2 |X = x0 ] (2985)

= σ + Bias (fˆk (x0 )) + VarT (fˆk (x0 ))
2 2
(2986)
k
1X σ2
= σ 2 + [f (x0 ) − f (x(l) )]2 + . (2987)
k k
l=1
Note the parenthesis indicates the order statistic for the KNN selection. The first term, σ 2 is known
as the irreducible error, and is present even if we knew the true f . We are able to control the other two
terms, which constitute the mean squared error of fˆk (x0 ) in estimation of f (x0 ), where we have the bias
and variance decomposition.
Definition 371 (Bias of Estimate). The bias is the difference between the true mean value of a random
variable and the expectation of our estimate. In the MSE decomposition of the generalization error, this
can be formulated ET [fˆk (x0 ) − f (x0 )], where f (x0 ) is the non-random conditional mean at x0 .
If the true function is reasonably smooth, the bias term likely increases with K in the nearest
neighbour method.
The variance term on the other hand, decreases as the inverse of K in nearest neighbours. In
the general function approximation paradigm, as the model complexity of our procedure increases, the
variance tends to increase and (squared) bias tends to decrease. We want to choose the model complexity
so that there is an attractive tradeoff between the bias and variance, so that the test error is minimized.
P
Although we can estimate test error with the training error (1/N ) i (yi − ŷi ), this is often not a good
estimate as it does not account for model complexity. The generalization issues and bias-variance tradeoff
leads to the classical U-shaped test error in relation to model complexity.
8.8 Least Squares Regression Methods

8.8.1 Simple Least Squares
The simple linear regression model assumes data follows relationship
y = β0 + β1 x + (2988)
, where is the random error assumed E = 0, Var = σ 2 (constant). The conditional mean response at
x shall then be E(Y |X = x) = µy|x = β0 + β1 x, and conditional variance at x equivalent to Var(Y |X =
479
2
x) = σy|x = Var(β0 + β1 x + ) = σ 2 . Note then how the variance is assumed constant in this model -
we shall see how heteroscedasticity affects our model interpretations later on. Our least squares problem
reduces to the problem of finding coefficient estimates and ensuring their adequacy. For samples taken
from the population regression model, we assume i 6= j =⇒ i ⊥ j , that errors (and hence responses)
are uncorrelated. In the simple linear regression equation (Equation 2988), the β1 slope specifies the
change in the mean distribution of y under unit change in x.
8.8.1.1 Assumptions of the Simple Linear Equation
1. ∀i, j, i 6= j =⇒ i ⊥ j
2. σ 2 = Var() = Var(|X = x) = k ∈ R.
8.8.1.2 Model Fitting
The hypothesis space of our simple linear regression problem is the set of all candidate lines specified by
{(β0 , β1 ) : (β0 , β1 ) ∈ R2 }. From the sample regression model yi = β0 + β1 xi + i , i ∈ [n], we choose the
model that minimises the sum of squared errors S(β0 , β1 ) = i 2i = i (yi − β0 − β1 xi )2 . We can easily
P P
derive by standard calculus the estimators β̂0 , β̂1 by solving the linear equations
δS X !
= −2 (yi − β̂0 − β̂1 xi ) = 0 (2989)
δβ0 i
δS X !
= −2 xi (yi − β̂0 − β̂1 xi ) = 0 with solutions (2990)
δβ1 i
X X
nβ̂0 + β̂1 xi = yi (2991)
i i
X X X
β̂0 xi + β̂1 x2i = yi xi (2992)
i i i
From Equation 2991 we have β̂0 = ȳ − β̂1 x̄, and substituting into Equation 2992 we obtain
X X X
(ȳ − β̂1 x̄) xi + β̂1 x2i = yi xi , (2993)
i i i
X X X X
β̂1 ( x2i − x̄ xi ) = yi xi − ȳ xi (2994)
i i i
This gives solution

P P
P i i y xi
i yi xi −
i
βˆ1 = n
2 (2995)
P 2 ( i xi )
P
i xi − n
Sxy
= (2996)
Sxx
where
! !
2
(nx̄)2
P
X ( xi ) X
Sxx = x2i − i
= x2i − (2997)
i
n i
n
! !
X X
= x2i − nx̄ = 2
x2i + nx̄2 − 2nx̄2 (2998)
i i
X X
= (x2i + x̄2 − 2xi x̄) = (xi − x̄)2 (2999)
i i
= (n − 1)s2 (3000)
480
and
P P ! !
X
i yi i xi X nȳnx̄ X
Sxy = yi xi − = yi xi − = yi xi − nȳx̄ (3001)
n i
n i
X
= yi (xi − x̄) (3002)
i
and we obtain fitted model ŷ = β̂0 + β̂1 x and residuals on sample i = ei = yi − β̂0 − β̂1 xi . The least
squares can be computed by the formula:
numpy.linalg.lstsq(a, b, rcond=’warn’)
that computes the vector x that approximately solves the equation a @ x = b.
8.8.1.3 Model Properties and Variance of Estimates

P
Sxy y (x−x̄) −x̄)
ci = (xSixx
P
Recalling that β̂1 = Sxx = i Sixx , we may express in form i ci yi where . We see
P 2 P 2 2
P P P x −x̄ (xi −xi x̄) ( i xi )−nx̄
easily that i ci = 0, and i ci xi = i xi P (xi −x̄)2 = P (x2 +x̄2 −2xi x̄) = (P x2 )+nx̄2 −2nx̄2 = 1. Finally,
i i
i i i i i
P 2 P (xi −x̄)2 1
i ci = S 2 =
xx Sxx . This is, the slope estimator β̂ 1 is a linear combinations of the observations yi .
The least squares estimates are unbiased, in that Eβ̂1 = β1 , Eβ̂0 = β0 . The variance Var(β̂1 ) =
Pn 2
Var( i=1 ci yi ) = i c2i Var(yi ) = Sσxx , since we showed i c2i = S1xx and we assumed the errors (and
P P
accordingly response) are uncorrelated.

The
intercept variance follows Var(β̂0 ) = Var(ȳ−β̂1 x̄) = Var(ȳ)+
2 2 1 x̄2 1
P
x̄ Var(β̂1 )−2x̄Cov(ȳ, β̂1 ) = σ n + Sxx where the covariance term drops off since Cov( n yi , i ci yi ) =
1
P P
n i ci Var(yi ) and ci = 0.
The ordinary least squares estimators (β̂i ), i ∈ [p+1] are the best linear, unbiased estimators, in that it
has the smallest variance compared to the other unbiased estimators formed from the linear combinations
P P
of yi . Some useful results we arrive from the simple least squares method is that i (yi − ŷi ) = i ei = 0,
P P
which implies yi = ŷi . Not only are the sum of residuals zero, the regressor weighted errors and
P P
fitted value weighted errors are zero. such that i xi ei = 0, i ŷi ei = 0. The fitted line always passes
through (x̄, ȳ).
We may obtain a point estimate of conditional variance of y given x using the residual sum of squares
SSres = i e2i = i (yi − ŷi )2 . The residual sum of squares is a component of the total sum of squares,
P P
which is relevant to the unconditional variance. Writing sum of total squares SST = i (yi − ȳ)2 and
P
noting ŷi = β̂0 + β̂1 xi , then

X X
(yi − ȳ)2 = (yi − ŷi + ŷi − ȳ)2 (3003)
i i
X X X
= (yi − ŷi )2 + (ŷi − ȳ)2 + 2 [(yi − ŷi )(ŷi − ȳ)] (3004)
i i i
X X X X
= (yi − ŷi )2 + (ŷi − ȳ)2 + 2 ŷi (yi − ŷi ) − 2 ȳ(yi − ŷi ) (3005)
i i i i
X X X X
= (yi − ŷi )2 + (ŷi − ȳ)2 + 2 ŷi ei − 2 ȳei (3006)
i i i i
X X X X
= (yi − ŷi )2 + (ŷi − ȳ)2 + 2 ŷi ei − 2ȳ ei (3007)
i i i i
X X
= (yi − ŷi )2 + (ŷi − ȳ)2 (3008)
i i
X
= SSres + (ŷi − ȳ)2 (3009)
i
481
we see that the total unconditional dispersion can be broken down into the components explained by the
model and unexplained by our model. That is SST = SSres + SSreg , where SSreg is the regression sum
2
P
of squares defined i (ŷi − ȳ) .
SSres
We may show (verify this) that ESSres = (n − 2)σ 2 . An unbiased estimator σ̂ 2 for σ 2 is n−2 .
SSres
Definition 372 (Residual Mean Square). The value is called the residual mean square and is an
n−2
2
√
unbiased estimator of σ , the conditional variance of response at a given input. The value σ̂ = M Sres
shall be called the residual standard error, or equivalently, the standard error of regression. n−2 indicates
the degrees of freedom in residual sum of squares (SSres ), attributed to the loss of freedom from estimating
β̂0 , β̂1 . The estimate σ̂ depends on SSres , and requires that model assumptions of independent errors and
constant variance be satisfied.
8.8.1.4 Assumptions of the Analysis of Model on Simple Linear Equations
1. The assumptions of the model also apply, for obvious reasons, as assumptions in the analysis of
model. That is, we assume uncorrelated errors with constant variance and mean zero.
2. Additionally, we assume that i ∼ Φ(0, σ 2 ). In fact, they are identical and independent normal
random variables, implying that (i) for each value/level of regressor variable, the sub-population
of responses follow normal distribution and (ii) each such sub-populations share constant variance
σ2 .
8.8.1.5 Test of Significance on Regression Coefficients
We may perform hypothesis testing for the significance of regression coefficients, for instance under
settings as follows:
H0 : β1 = β10 H1 : β1 6= β10
with test statistic

β̂1 − β10 β̂1 − β10
Zo = = q ∼ N (0, 1).
sd(β̂1 ) σ2
Sxx
Since β̂1 is a linear combination of yi and yi ∼ N (β0 + β1 xi , σ 2 ), β̂1 must follow normal distribution.
However, most often σ 2 is unknown and we use the estimator σ̂ 2 , using test statistic
β̂1 − β10
t0 = q ∼ tn−1 ,
M Sres
Sxx
which rejects the null hypothesis in a two-sided test under conditions |to | > tn−2 (α/2), where tn−2 (α/2)
indicates the percentile point of a t-distribution of degrees of freedom (n−2) with α2 right-tail probability.
q
It is obvious from the test statistic that SE(β̂1 ) = MSSxxres . For test of intercept, our equivalent (and
q
2
abbreviated) steps would follow H0 : β0 = β00 , SE(β̂0 ) = M Sres ( n1 + Sx̄xx ) with test statistic following
(n-2) degrees of freedom t-distributions. The H0 : β̂1 = 0 implies there is no linear relationship between
y and x supported by the data, and the rejection implies that x helps explain variability of the response.
8.8.1.6 Test of Significance on Regression Model and ANOVA Methods
We may test for the significance of the regression model by testing if any of the β coefficients are unlikely
to be zero. In the case of the simple linear model, this turns out to be equivalent to the t-test on
482
β̂1 −0
regression coefficients, since we only have one. We re-iterate here in short: H0 : β1 = 0, t0 = se(β̂1 )
.
The more general method is known as the Analysis of Variance (ANOVA) method. In Equation 3003 we
demonstrated that the sum of total squares may be decomposed into the sum of squared residuals and
the regression/model sum of squares. Re-iterating:
X X X
SST = (yi − ȳ)2 = (ŷi − ȳi )2 + (yi − ŷi )2 = SSreg + SSres .
i i i
P
Since SST has the constraint that i (yi − ȳ) = 0, it has a degree of freedom of (n − 1). SSres has
degrees of freedom (n − 2), and since SSreg = β̂1 · Sxy (verify this) is determined once β̂1 is decided, it
has degrees of freedom one. Both sides match.
To test the hypothesis H0 : β1 = 0, we arrive at the following conclusions, conditional on null:
SSres M Sres
= (n − 2) ∼ χ2n−2 (3010)
σ2 σ2
SSreg
= χ21 (3011)
σ2
SSres ⊥ SSreg . (3012)
Under H0 , this amounts to the test statistic
SSreg /1
F0 = ∼ F1,(n−2) ,
SSres /(n − 2)
rejecting the null hypothesis when F0 > F1,(n−2) (α). We often call the terms SSreg /1 = M Sreg the
regression mean square and SSres /(n − 2) = M Sres the residual mean square. It can be shown the t-test
and F-test in the simple linear model are identical, since
β̂1
t0 = p
M Sres /Sxx
β̂12 Sxx β̂1 Sxy

t20 = = = F0 .
M Sres M Sres
The ANOVA tables may be generated in code using the following:
from statsmodels.formula.api import ols

model = ols(’y ~ x’, data=data).fit()
anova = statsmodels.stats.anova_lm(model, typ=2)
8.8.1.7 Confidence Intervals on Parameters and Variance Estimates
Assuming the same assumptions for the hypothesis (see Section 8.8.1.4) are satisfied, we may derive
confidence intervals on the β coefficients. Assuming the IID errors, the sampling distribution of βi , i ∈
β̂i −βi
{0, 1} = ∼ tn−2 . The discussion on the standard errors of these estimates were employed in
SE(β̂i ) h i
Section 8.8.1.5, and their 100(1 − α)% confidence intervals are constructed βi ∈ β̂i ± tn−2 ( α2 ) · SE(β̂i ) .
h i
The confidence interval for σ 2 corresponds to the interval (n−2)M
χ 2
Sres (n−2)M Sres
α
( )
, χ2 α
(1− )
.
n−2 2 n−2 2
483
8.8.1.8 Confidence Intervals and Prediction Intervals on Response
Confidence Intervals
The regression function E(Y |X) gives point estimates for the conditional mean on response, given inputs
regressor variables. The point estimator for E(y|x0 ) = E(y|x ˆ 0 ) = µ̂ = β̂0 + β̂1 x0 is an unbiased
y|x0
estimator, and we can derive its variance. Recall that ȳ = β̂0 + β̂1 x̄, then β̂0 = ȳ − β̂1 x̄ and
σ2 σ2 σ2
Var(µ̂y|x0 ) = Var(β̂0 + β̂1 x0 ) = Var(ȳ + β̂1 (x0 − x̄)) = + (x0 − x̄)2 Var(β̂1 ) = + (x0 − x̄)2 .
n n Sxx
We therefore arrive at a confidence interval for conditional mean responses, with the following test
statistic:
µ̂y|x0 − E(y|x0 )
r ∼ tn−2 (3013)
1 (x0 −x̄)2
M Sres · n + Sxx
and confidence intervals at 100(1 − α)%:

s
(x0 − x̄)2

1
µ̂y|x0 ± tn−2, α2 M Sres + (3014)
n Sxx
which has minimal interval width at x0 = x̄ and widens with |x0 − x̄|. The standard error in the
computation takes into account the sampling error. We interpret that if we keep repeating the sampling
N times, then approximately (1 − α)% of the N confidence intervals constructed contain the true sub-
mean.
Prediction Intervals
Another application is the prediction of a new observation y given some specified x = x0 . The point
estimate for this response is the same as the point estimate for the conditional mean response. In this
ˆ 0 ). Consider the random
part we are interested in making statistical conclusions about ŷ0 instead of E(y|x
variable ψ = y0 − ŷ0 , which takes normal
h distribution
i Φ(0,
h Var(ψ)) and Var(ψ)
i = Var(y0 − ŷ0 ) = Var(y0 )+
2 2 1 (x0 −x̄)2 2 1 (x0 −x̄)2
Var(ŷ0 ) − 2Cov(y0 , ŷ0 ) = σ + σ n + Sxx = σ 1 + n + Sxx , since future observations are
necessary independent of ŷ0 . This results in the prediction interval
s
(x0 − x̄)2

1
ŷ0 ± tn−2, α2 M Sres 1 + + , (3015)
n Sxx
which takes into account both the sampling error but also the variability of individuals around the
predicted conditional mean. It follows that the prediction interval (also a confidence interval, technically)
is a superset of the confidence interval relating to the conditional mean.
8.8.1.9 No Intercept Models
The no-intercept model specified y = β1 x + can be specified where the origin intersection shall be
included as part of the problem specification. The estimators take different forms, although the principle
is the same. We have S(β1 ) = i (yi − β1 xi )2 , least-squares equation β̂1 i x2i = yi xi , giving unbiased
P P P
estimator P
yi xi
β̂1 = Pi 2
i xi
and fitted regression

P 2
model ŷ = β̂1 x. We have an estimator σ̂ 2 for conditional variance σ̂ 2 = M Sres =
( i yi )−β̂1 i yi xi
P 2
P
i (yi −ŷi )
n−1 = n−1 .
484
8.8.1.10 Coefficient of Determination, R2
SSreg SSres
The quantity R2 = SST = 1− SST is known as the coefficient of determination, measuring the
proportion of variability in response explained by the regressors and our model. We have 0 ≤ SSres ≤
SST =⇒ R2 ∈ [0, 1] - but note that adding more terms monotonically increases the coefficient of
determination. This does not indicate the appropriateness or complexity of our model!
8.8.1.11 Maximum Likelihood Estimators vs Simple Least Squares
It can be shown that the method of least squares in parameter estimation is identical when errors are
assumed normal. Referring to the maximum likelihood method (see Definition 334), we have yi ∼
N (β0 + β1 xi , σ 2 ) when errors are assumed IID N (0, σ 2 ). Using the Gaussian log-likelihood function in
Equation (2598):
N
N 1 X 2
L(θ) = − log(2π) − N log σ − 2 (yi − fθ (xi )) (3016)
2 2σ i=1
N
N 1 X 2
= − log(2π) − N log σ − 2 (yi − β0 − β1 xi ) , (3017)
2 2σ i=1
which by taking derivatives results in the linear equations

δ log L 1 X
!
= 2
yi − β̃0 − β̃1 xi = 0 (3018)
δβ0 σ̃ i
δ log L 1 X
!
= 2
y i − β̃ 0 − β̃ 1 x i xi = 0 (3019)
δβ1 σ̃ i
δ log L n 1 X 2
!
= − 2+ 4 yi − β̃0 − β̃1 xi = 0 (3020)
δσ 2 2σ̃ 2σ̃ i
(3021)
and solutions
β̃0 = ỹ − β˜1 x̄ (3022)

P
y (x − x̄)
β̃1 = Pi i i 2
(3023)
i (xi − x̄)
(yi − β˜0 − β˜1 xi )2
P
σ̃ 2 = i
(3024)
n
which give unbiased estimators of β0 , β1 and biased estimator of σ. The relationship between the least
squares estimator and MLE estimator follows the equation

n−2 2
σ̃ 2 = σ̂ ,
n
and we see that the bias is small when n large. In general, the MLE estimators have better statistical
properties than the least-squares estimators, exhibiting consistency (see 333), asymptotic efficiency (see
336) and sufficiency (see 338). The downside of MLE is that it requires distributional assumptions about
the errors - that they are IID ∼ Φ(0, σ 2 ).
485
8.8.2 Multiple Least Squares
We may generalize the simple least squares model to multiple regressors, say k of them. Then our
population regression model gives
k
!
X
y = β0 + βi x i + βi , i ∈ [k].
i
The βi , i ∈ [k] are known as regression coefficients and reflect the expected change in response given unit
change in corresponding regressor, ceteris paribus. We note that ‘linearity’ is in coefficients, and the
model may take general forms such as higher order polynomials.
8.8.2.1 Interaction Effects
The regressors may contain interaction effects, which are defined as models that contain a function on
two ‘atomic’ regressors, which should already be included in the model. An example of such formulaic
relationships could take form y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + , and we shall note that in this model
the level of change expected in y given unit change in x1 depends on the level of x2 ! Again, here we have
a non-linear surface generated by the regression equation, but we maintain linearity in the regression
coefficients and all conclusions apply.
8.8.2.2 Assumptions and Model Notations
Let n denote sample size as usual, and k take number of regressors. yi , i ∈ [n] is the i-th response, with
xij taking the i-th observation of the regressor xj .
We assume that E() = 0, Var() = σ 2 - particularly that errors are centered at zero and have constant
Pk
variance. The sample regression model takes form yi = β0 + j βj xij + i , i ∈ [n].
8.8.2.3 Model Fitting

2
Pn Pk
βj xij )2 and computing minima, we obtain the
P
Taking least squares function i i = i (yi − β0 − j=1
linear equations
n k
δS X X !
= −2 (yi − β̂0 − β̂j xij ) = 0. (3025)
δβ0 i=1 j=1
n k
δS X X !
= −2 (yi − β̂0 − β̂j xij )xij = 0. (3026)
δβj∈[k] i=1 j=1
resulting in p = k +1 normal equations. Encoding the response vector in y ∈ Rn·1 , regressors in X ∈ Rn·p
and β ∈ Rp·1 , ∈ Rn·1 , we may rewrite the sample regression model in matrix form y = Xβ + . Note
that the matrix X is the matrix of 1’s in the leftmost column and n rows of regressor data row-stacked.
We can equivalently express
X
S(β) = 2i = T = (y − Xβ)T (y − Xβ) (3027)
i
= (yT − β T XT )(y − Xβ) (3028)
= yT y − yT Xβ − β T XT y + β T XT Xβ (3029)
T T T T T
= y y + β X Xβ − 2β X y (3030)
486
where in the last step we use the fact that transpose on scalars are identity transforms. By matrix
calculus we arrive at
δS
= (XT Xβ)T + (β T XT X) − 2(XT y)T = 2(β T XT X) − 2yT X, (3031)
δβ
giving solutions XT Xβ = XT y and regression coefficient
β̂ = (XT X)−1 XT y. (3032)
This gives solution if ∃(XT X)−1 . This exists if the regressors (columns) are linearly independent. This
is when X is full rank (Definition 62). When X is not full rank, then there can be multiple least-squares
solution. A linear algebraic treatment to finding the general solution for the least-squares method when
X is non-full rank are discussed in Theorem 64. Our fitted regression model is written ŷ = xβ̂, where x
is row vector of [1, x1 , x2 · · · xk ]. Continuing, substitute regression coefficient estimates (Equation 3032),
and we obtain
ŷ = Xβ̂ = X(XT X)−1 XT y = Hy (3033)
where
Definition 373 (Hat Matrix). H = X(XT X)−1 XT ∈ Rn,n is known as the hat matrix, for putting a
’hat’ on the response vector.
Corollary 24 (Residuals in Terms of Hat Matrix). Defining the residual terms ei = yi − ŷi , the n
residuals may be written in matrix form:
e = y − ŷ = y − Xβ̂ = y − Hy = (1 − H)y (3034)
Note that the hat matrix H is symmetric and idempotent (that is HH = H), and is also often called
the projection matrix (projections have the property P 2 = P ). The matrix (1 − H) is also symmetric
and idempotent.
8.8.2.4 Model Properties and Variance of Estimates
Note that
E (XT X)−1 XT y

Eβ̂ = (3035)
E (XT X)−1 XT (Xβ + )

= (3036)
E (XT X)−1 XT Xβ + (XT X)−1 XT )

= (3037)
E 1β + (XT X)−1 XT )

= (3038)
= β (3039)
where the last two steps used the assumption of uncorrelated errors to arrive at unbiased regression
coefficient estimates. Furthermore, we have
h i
Cov(β̂) = E (β̂ − Eβ̂)(β̂ − Eβ̂)T , (3040)
487
which is a positive and symmetric semi-definite matrix in Rp,p . The diagonals are the variance of the
coefficient estimates. We have
Cov(β̂) = Cov((XT X)−1 XT y) (3041)

T
(XT X)−1 XT Cov(y) (XT X)−1 XT

= (3042)
T
σ 2 (XT X)−1 XT (XT X)−1 XT

= (3043)
2 T −1 T T −1
= σ (X X) X X(X X) (3044)
= σ 2 1(XT X)−1 (3045)
2 T −1
= σ (X X) (3046)
Writing C = (XT X)−1 , and the Cov(β̂i , β̂j ) = σ 2 Cij . Similar to the simple least squares approach, in
the estimation of p parameters we arrive at SSres of (n − p) degrees of freedom. As before, we obtain
the definition
Definition 374 (Residual Mean Square). The residual mean square is defined
SSres
M Sres = (3047)
n−p
is unbiased estimator of σ 2 .
which is model dependent.
8.8.2.5 Assumptions for Analysis of Multiple Least Squares Regression
In addition to the assumptions specified for the model in Section (8.8.2.2), we need to assume here that
the random errors
IID
i ∼ Φ(0, σ 2 )
8.8.2.6 Significance Tests for Regression Coefficients by t-tests and Partial Sum of Squares
Method
We may test separately the contribution of a particular regressor in explaining the variability of the
response, or any subset of them.
T-Test
To test the significance of any individual regression coefficient βj we perform test (abbreviated) H0 :
β̂
βj = 0, t0 = √ 2j ∼ tn−p , where Cjj is the j-th diagonal of (XT X)−1 . This test is called the
σ̂ Cjj
partial/marginal test of significance of β̂j given all other regressors already present in the model.
Partial/Extra Sum-of-Squares Method
To determine the contribution to SSreg of a particular set of regressors given other regressors already in
the model, consider the following:
Definition 375 (Partial Sum-of-Squares Method). Consider again the population regression model writ-
ten y = Xβ + where the β is re-arranged to form the block matrix
T
β = [β1 β2 ] ,
where β1 ∈ Rp−r,1 and β2 ∈ Rr,1 for r ≤ p partitions the regression coefficients of interest. Then we can
rewrite such that y = X1 β1 + X2 β2 + and we are interested in H0 : β2 = 0 vs H1 : β2 6= 0. Now recall
488
that in the full model we have β̂ = (XT X)−1 XT y, and we define their regression sum of squares SSreg (β).
Under the reduced model of the null hypothesis we are left with the regression equation y = X1 β1 + ,
−1 T
and we can write their regression coefficient estimates as β̂1 = (XT
1 X1 ) X1 y with regression sum of
squares SSreg (β1 ). Then we define the partial sum of squares of β2 to be
SSreg (β2 |β1 ) = SSreg (β) − SSreg (β1 ),
and this has the number of degrees of freedom r. Note that SSreg (β2 |β1 ) ⊥ M Sres and for the null
hypothesis H0 : β2 = 0 we construct the test statistic
SSreg (β2 |β1 )/r
F0 = ∼ Fr,n−p
M Sres
under the null hypothesis, which rejects the null hypothesis if it turns out that F0 > Fr,n−p (α), the
implication of which states that ∃j ∈ [k − r + 1, k] s.t xj is a statistically significant regressor.
Note that the result of the partial sum of squares in the case r = 1 yields the same conclusion and
p-values as would a t-test. Also, if the columns in X1 are orthogonal to the columns in X2 , then we
must have SSreg (β2 ) = SSreg (β2 |β1 ). When we do significance tests and decide to remove a variable, in
general the regression coefficients of the refitted model are not the same. In the special case of orthogonal
sets, no refitting is required.
8.8.2.7 Confidence Interval for Regression Coefficient Estimates
Recall in Section 3041, we demonstrated that Cov(β̂) = σ 2 (X T X)−1 . Furthermore, since we know that
IID
β̂ is just a linear combination of the observations, and that we have the assumptions that i ∼ Φ(0, σ 2 )
(and therefore that yi ∼ Φ(β0 + j βj xij , σ 2 )), we can then conclude
P
β̂ ∼ Φ(β, σ 2 (X T X)−1 ) (3048)
where the marginal distribution of a regression coefficient β̂j∈[p] is ∼ Φ(βj , σ 2 Cjj ) where Cjj is j-th
diagonal of (X T X)−1 . The (abbreviated) hypothesis test would have test statistic of form
β̂ − βj
pj ∼ tn−p , j ∈ [k]
σ̂ 2 Cjj
and confidence interval
α
q
β̂j ± tn−p ( ) σ̂ 2 Cjj
2
p
with SE(β̂j ) = σ̂ 2 Cjj .
Joint Confidence Intervals on Regression Coefficients If we have a least squares model with
predictions on (β̂0 , β̂1 ), we each have a 100(1 − α)% confidence interval on the coefficients. Say we want a
95% confidence interval on both parameters, and assuming independence (they are not), then their ‘joint
correctness’ is only 0.952 . We need to define their joint distributions. It can be shown that (verify):
(β̂ T − β)T XT X(β̂ T − β)

F0 = ∼ Fp,n−p (3049)
pM Sres
which implies P{F0 ≤ Fp,n−p (α)} = 1 − α. Constructing a 100(1 − α)% joint confidence region for all
parameters in β we obtain the values for which it results in test statistic F0 ≤ Fp,n−p (α).
Bonferroni Correction: we may also choose the family-wise error rate such that we instead use
confidence intervals:
β̂j ± t 2p
α
,n−p · SE(β̂j ), j ∈ [k]. (3050)
489
8.8.2.8 Confidence Interval and Prediction Intervals of Estimates on Mean Response and
Response Variables
We may want to make certain point estimates and give statistical comments about the relevancy of our
estimates on conditional mean response, or just response.
Confidence for Prediction of Conditional Mean Response For some point x0 we want to construct
(conditional) mean response intervals. Recall from Equation 3041 Cov(β̂) = σ 2 (X T X)−1 . Further
suppose we select the estimator ŷ0 = xT
0 β̂, which turns out to be an unbiased estimator of E(y|X = x0 ).
The variance of shall be computed
Var(ŷ0 ) = Var(xT0 β̂) (3051)

= xT0 Cov(β̂)x0 (3052)
= σ 2 xT0 (X T X)−1 x0 . (3053)
We can therefore define the confidence interval for mean response given:
α
q
ŷ0 + tn−p ( ) σ̂ 2 xT T
0 (X X)
−1 x
0 (3054)
2
Prediction Interval for Prediction of Response

For a particular point x0 , we are interested in making the 100(1−α)% prediction interval for response
given
α
q
ŷ0 ± tn−p ( ) σ̂ 2 (1 + xT T
0 (X X)
−1 x ).
0 (3055)
2
8.8.2.9 Significance Tests for Regression Model
To test if there is a linear relationship between our response variable and any regressors (if our model is
significant at all), we may construct hypothesis:
H0 : ∀j ∈ [k], βj = 0 H1 : ∃j ∈ [k] s.t βj 6= 0.
Recall that we have the decomposition SST = SSres + SSreg . We construct the test-statistic:
SSreg /k M Sreg
F0 = = ∼ Fk,n−p (3056)
SSres /(n − k − 1) M Sres
In particular, we have (verify this):

P 2
T T ( yi )
SSreg = β̂ X y − (3057)
n
SSres = y T y − β̂ T X T y (3058)
P 2
T ( yi )
SST = y y − (3059)
n
SSreg
and the random variables follow the distribution (assuming null is true): σ2 ∼ χ2k , SSσres
2 ∼ χ2n−k−1
and SSreg ⊥ SSres . We reject the null hypothesis when F0 > Fk,n−p (α).
490
8.8.2.10 Coefficients of Determination and Adjustments
SSreg
Similar to the simple least squares coefficient of determination as in Section 8.8.1.10, we have R2 = SST .
We define them formally here.
Definition 376 (Coefficient of Determination). The ‘R2 value’, also known as the coefficient of deter-
mination or proportion of variation explained by regressors, is defined
SSreg
R2 =
SST
To penalize model complexity, we have the adjusted coefficient of determination, defined:
Definition 377 (Adjusted Coefficient of Determination).
2 SSres /(n − p)
Radj =1− (3060)
SST /(n − 1)
8.8.2.11 Interpretation of Model and Coefficients
In the case of a vanilla model with no interaction effects, the interpretation is obvious in that regression
coefficients are mean change in response per unit regressor change, ceteris paribus. However, we can
have a slightly more nuanced discussion. We can see the coefficients as the contribution of xj to response
y after BOTH y, xj have been linearly adjusted for all other regressors. In particular, consider model
y = β0 + β1 x1 + β2 x2 + , and suppose we want to interpret the effect of x2 on y. Then, let the steps
follow:
Model 1 : y ∼ x1 : ŷ = α̂0 + α̂1 x1 , with residual y − ŷ = ey·x1 (3061)

Model 2 : x2 ∼ x1 : x̂2 = γ̂0 + γ̂1 x1 , with residual x2 − x̂2 = ex2 ·x1 (3062)
Model 3 : ey·x1 ∼ ex2 ·x1 : êy·x1 = λ̂0 + λ̂1 ex2 ·x1 , with residual ey·x1 − êy·x1 (3063)
From Model 2 it follows that x2 = γ0 + γ1 x1 + ex2 ·x1 and we can rewrite
y = β0 + β1 x1 + β2 (γ0 + γ1 x1 + ex2 ·x1 ) + (3064)

= (β0 + β2 γ0 ) + (β1 + β2 γ1 )x1 + (β2 ex2·x1 + ) (3065)
Relating to Model 1 we can map α0 = β0 + β2 γ0 , α1 = β1 + β2 γ1 , ey·x1 = β2 ex2 ·x1 + . We may perform

a similar exercise to obtain ey·x2 = β1 ex1 ·x2 + . The general relationship for a multiple linear regression
model can be summarized:
ey·x1 x2 ···xj−1 xj+1 ···xk = βj exj ·x1 x2 ···xj−1 xj+1 ···xk +
where βj is the contribution of xj to y after both y, xj have been linearly adjusted for all other regressors.
8.8.2.12 Regressor Variable Hull and Extrapolation of the Input Space
Consider that in the case of higher dimensions, an unseen input vector can be elementwise in the range
of the regressors but lie outside the region of the original data. There is hence a hidden extrapolation.
Definition 378 (Regressor Variable Hull). Consider n data points and training data (xi )i∈[n] , where
xi is k-dimensional vector of input. The smallest convex set containing all of these data points shall be
called the regressor variable hull. The set of points x satisfying
xT (XT X)−1 x ≤ max diag(H) = hmax

i
491
where H is the hat matrix (see Definition 373) X(XT X)−1 XT forming an ellipsoid enclosing all points
inside the regressor variable hull.
For new input point of p-vector xT0 = [1, (x0j )j∈[k] ], we say that using the model to fit x0 is extrapo-
−1
lation if xT T
0 (X X) x0 > hmax .
8.8.2.13 Standardization of Model
In general, the units of β̂j are in terms of the change in units of y per change in units of xj . To make
the model dimensionless, we can do standardization to yield standardized regression coefficients.
x −x̄
Unit Normal Scaling: to conduct unit scaling for regressors we take zij = ijsj , i ∈ [n], j ∈ [k]
∗ yi −ȳ
where x̄j = n1 i xij and s2j = n−1
1 2
P P
i (xij − x̄j ) . For regressors, we can also perform yi = sy . Using
the new scaled response and regressors, we can construct
k
X
y∗ = bj zj ,
j=1
which after standardization has no intercept!

xij −x̄j
Unit Length Scaling: for unit length scaling of regressors we form wij = √ , i ∈ [n], j ∈ [k]
Sjj
where Sjj is the corrected sum of squares for xj with definition
Definition 379 (Corrected Sum of Squares).

n
X
Sjj = (xij − x̄j )2 . (3066)
i
pPn
The name follows since each new regressor w̄j has mean zero and length i (wij − w̄j )2 = 1. Using
yi −ȳ
unit length scaling for response yi0 = √
SST
and fitting the model
k
X
ŷ 0 = bj w j ,
j
we arrive at some interesting properties. In particular, the matrix WT W = ρ(X), the correlation matrix
and ZT Z = (n − 1)WT W - that is the estimates of regression coefficients from norm scaling and length
scaling are the same.
8.8.2.14 Indicator Functions
We may also encode qualitative/categorical variables of k levels by a k − 1 indicator function set - in fact
any binary function that maps to discrete values shall suffice, but indicators are most used. Here the
assumption that variance is constant in all levels of category is used, instead of across continuous axis.
In the vanilla model we can easily see that the indicator function indicate change of intercept (see by
substitution of {0, 1}), with the interpretation as difference of means. In the case of interaction terms
with continuous variables this can also mean the change in both slope and intercept, which we may also
verify simply by specifying such a model and substituting our own values! Note that interaction term
requires that the atomic regressor is already included in the model, and we may test for the significance
of the categorical variable using the partial sum of squares (see Section 375) method. Suppose the model
follows y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + where x2 is the indicator function affecting intercept and slope
of x1 , then we perform the (abbreviated) test H0 : β2 = β3 = 0 with test statistic
SSreg (β2 , β3 |β1 , β0 )/2
F0 = ∼ F2,n−4 .
M Sres
492
Note that the ‘partial sum’ comes from the fact that we may do something like SSreg (β2 , β3 |β1 , β0 ) =
SSreg (β2 |β1 , β0 ) + SSreg (β3 |β2 , β1 , β0 ).
8.8.3 Adequacy of The Least Squares Method

We made assumptions in the phase of model fitting and during the phase of model analysis and interpre-
tation (see Section 8.8.1.4 and Section 8.8.2.2). Obviously, ex-ante we made the assumption on the form
of the model, in that it is parametric and the relationship between response and regressors be approxi-
mately linear, known as the linearity assumption. We also made assumptions about the errors - mostly
that the errors has expectation zero and constant variance, known as the constant variance/homo-
IID
geneity assumption. In the analysis, we further assumed they are ∼ Φ(0, σ 2 ), which is our normality
assumption and independent error assumption.
A possible method would be the study of residuals, either graphically or in numerical forms. Types
of graphs include one-dimensional graphs (such as histograms, stem-and-leaf plots, dot plots, box plots
et cetera), two-dimensional graphs (such as the scatter plot). However, in the case of multiple linear
regression, even the pairwise correlation matrix (and therefore the scatter plot) may not indicate the
presence of linear relationships between regressors. Ideally, pairwise plot of regressors show little to
no correlation (linear patterns), but even then there may exist linear multivariate relationships that
involve more than two variables. These plots can be performed ex-ante. An ex-post analysis may also
be performed, after the model has been fit.
8.8.3.1 Residual Analysis
After fitting the model we may perform residual analysis of ei = ŷi − yi , i ∈ [n], which have zero mean
and variance approximated
− ē)2
P P 2
i (ei e SSres
= i i = = M Sres .
n−p n−p n−p
When n >> p then the dependence between residuals ei is weak, but otherwise they are not independent.
8.8.3.1.1 Leverage Values, Influential Values and the Variance of Residuals
Definition 380 (Leverage Values). Recall that we defined the hat matrix 373 X(XT X)−1 XT and ob-
Pn
tained ŷ = Hy, and we can rewrite ŷi = j hij yj , which shows that the response prediction is a weighted
sum of all observations. In particular we shall then call hii the leverage value for the i-th observation,
indicating the weight of yi in ŷi .
We also wrote the residuals to take e = (1 − H)y. By substitution of y = Xβ + we obtain
e = (1 − H)(Xβ + ) (3067)
= Xβ + − HXβ − H (3068)
T −1 T
= Xβ + − X(X X) X Xβ − H (3069)
= Xβ + − Xβ − H (3070)
= − H (3071)
= (1 − H) (3072)
(3073)
493
and using Var() = σ 2 1 and the fact that 1 − H is idempotent symmetric,
Var(e) = Var((1 − H)) = (1 − H)Var()(1 − H)T = σ 2 (1 − H) (3074)
Finally we arrive at the results
Var(i ) = σ 2 (1 − hii ), Cov(ei , ej ) = −σ 2 hij (3075)
where in the covariance term we use the fact that identity matrices have zero non-diagonals.
8.8.3.2 Standardization of Residuals
Following the variance study of residuals in Equations 3075 we can form standardized residuals. In
particular we define:
Definition 381 (Standardized Residuals). Standardized residuals are residuals defined as

e
√ i , i ∈ [n] (3076)
σ 1 − hii
√
but again recall that our estimator of σ is M Sres . Another alternative is σ̂(i) where
2 SSres(i) SSres(i)
σ̂(i) = = (3077)
n−k−2 n−p−1
where SSres(i) is the sum squared residuals when model is fitted without the i-th observation. Then
2
both M Sres , σ̂(i) form unbiased estimator of σ 2 . Then we may estimate the standardized residuals by
substitution with the formulations:
Definition 382 (Internally Studentized Residuals). Internally Studentized Residuals are defined
ei
ri − p , i ∈ [n]
M Sres (1 − hii )
and
Definition 383 (Externally Studentized Residuals). Externally Studentized Residuals are defined
ei
ri∗ − p , i ∈ [n]
σ̂(i) (1 − hii )
q
n−p−1
which can be shown to be related via the monotonic (and hence similar) transformation ri∗ = ri n−p−ri2
.
8.8.3.3 Checking Normality Assumptions
We can also check for the normality assumptions on residuals using quantile plots such as the QQ plot
(see Section 7.3.1). Plotting against normal scores the ordered standardized residuals, we can look out
for the presence of tails / skewness, as well as large residuals representing potential outliers. Additionally,
a scatter plot of standardized residuals against either the response or any of the regressors should yield
no pattern. A funnel, cone or bow shape can indicate heteroscedasticity, curves and quadratic pattern
can indicate that the linearity assumption is violated and so on. Additionally, under the normality
assumption we should expect (most) standardized residuals to fall within an absolute value of three. For
instance, a ‘double bow’ (imagine the shape of one’s lips) often occurs when the response is a proportion
between zero and one, since the variance of binomial random variable near 0.5 is greater than when it is
near the extreme values of proportion (zero, one). Note that in the simple regression, scatter plot ri ∼ X
is visually identical to ri ∼ ŷ.
494
8.8.3.4 Note on Time Series Data
Additionally, if the data is serially sampled, it might be worth looking at the standardized residuals
plot against the time order. Ideally, we obtain the ‘no pattern’ pattern, the violation of which might
suggest we have variance as function of time σ(t) or even autocorrelation ρ(t , t−1 ) 6= 0 which is a serious
violation of the independence assumption of errors.
8.8.3.5 Outliers and Influential Data
Residuals that are considerably (absolutely) larger than the others may indicate potential outliers in the
output space. QQ plots (see Section 7.3.1), scatter plots and normality tests are good ways of identifying
outliers. Although outliers can be ‘bad’ and corrected/removed in the case of sampling faults, sometimes
it just reflects a black swan event that is perfectly plausible - in this case deleting the data point can
lead to dangerous conclusions and false sense of model accuracy. The removal of (bad) outliers can affect
regression coefficient estimates, residual sum of squares, coefficients of determination and error variances.
This in turn affects interval estimates and so on.
Recalling the idempotent property, again we have Var(ŷ) = Var(Hy) = HVar(y) = Hσ 2 . The hat
matrix determines the covariance of ŷ and e, the fitted value and error matrices. hij is the amount
of leverage exerted by the j-th observation yj on i-th fitted value ŷi . Recall from Definition 380 that
hii = xi (X T X)−1 xi is a standardized measure of the distance of the i-th observation from the center of
the x space. Large values of hii indicate potentially influential points since they are different from the
P
other points in the input space! We have diag(H) = rank(H) = p, therefore the average size of a hat
p
diagonal should be n . We can say that for any hat diagonal exceeding 2p n , the point is a leverage point
2p
(assuming n < 1). Points with large residuals and large diagonals are likely to be influential points,
in that they affect model summary statistics and regression coefficient estimates to a more significant
degree.
To measure the influence of data points, we can use the squared distance between the least square
estimates based on all n data and without the data point. Letting the estimate for regression coefficients
excluding point i be β̂(i) , then define
Definition 384 (Cook’s Distance). Cook’s Distance is defined
(β̂(i) − β̂)T M(β̂(i) − β̂)

Di = (M, c) = i ∈ [n] (3078)
c
ri2 hii
where M, c usually takes form XT X, pM Sres respectively. With these forms of M, c, then Di = p 1−hii
where ri is the internally studentized residuals defined as in Definition 382.
Points with large Cook’s Distance are said to have considerable influence. We can interpret the value
as follows: if Di = Fp,n−p (0.5) ≈ 1, then deleting the point i would move β̂(i) to the boundary of an
approximately 50% confidence region for β. Therefore, the cutoff Di > 1 is often used.
Other useful measure of influence are DFFITS and DFBETAS. While invalid data may be removed,
if there is no justification for removal we shall not do so. A common method called the ‘robust estimation
technique’ would be to down-weight the influence of the point in the model estimates.
8.8.3.6 Lack of Fit
A lack-of-fit test requires that for a single level of response, we have replicate observations on response.
Suppose in the simple linear model we have y ∼ x and in the training data we have x = {x1 , x2 , · · · , xm }
495
discrete, m ≤ n levels. Now assume that for each level i ∈ [m] that we have ni observations, and denote
P
yij to be observation j for i-th level regressor. Note that we have n = i ni . The ij-th residual is
yij − ŷi = (yij − ȳi ) + (ȳi − ŷi ). Squaring both sides and summing over all inputs we obtain
X ni
m X X ni
m X m
X
(yij − ŷi )2 = (yij − ȳi )2 + ni (ȳi − ŷi )2 (3079)
i=1 j=1 i=1 j=1 i=1
| {z } | {z } | {z }
SSres SSP E SSLOF
since the cross terms evaluate to zero. We decomposed the sum of squared residuals into two components,
namely the pure error and the lack of fit term. Compare this to the sum of squares in the general ANOVA
problem discussed in Definition 358.
Definition 385 (Lack of Fit Sum of Squares). We define the lack of fit sum of squares to be
m
X
SSLOF = ni (ȳi − ŷi )2
i=1
with the terminologies as specified prior, and is the weighted sum of squared deviations of fitted and mean
response values. If the fitted values are close to the mean, then the lack of fit value is small and strongly
indicates a linear relationship of in the regression function.
The lack of fit test statistic follows distribution

SSLOF /(m − 2) M SLOF
F0 = = ∼ Fm−2,n−m (3080)
SSP E /(n − m) M SP E
under the null hypothesis of H0 : true regression function is linear .In a simple model, the H0 : β1 6= 0 is
the equivalency of ‘model is linear’ and we reject the null hypothesis if F0 > Fm−2,n−m (α). Unfortunately,
this is not very useful in many cases, especially in multiple regression models - repeat observations do
not occur often in higher dimensionality.
8.8.3.7 Multicollinearity
When there is no linear relationship between regressor variables, we say that they are orthogonal. In most
applications however, this is not the case. In the discussions that follow, we will assume all regressors
have been centered and unit scaled. In many cases the regressors can be linearly dependent, affecting the
inferences based on the regression model. The presence of near-linear dependencies between regressors
is called the multicollinearity problem, and affects regression coefficient estimates. Define Xj to be the
P
j-th column of input variable matrix X. Then if there exists nonzero solutions to t where j tj · Xj = 0
then (XT X)−1 does not exist. Multicollinearity can be introduced in different stages of the model
building process, such as (i) data collection, (ii) model/population constraints, (iii) model specifications
and (iv) over-defined modelling. If the sampling method is not a pure random sampling and involves
the sampling of a sub-sample of the entire sample space - such that there exists correlation between
variables - we can run into multicollinearity issues. In the case of problem specific constraints, we might
have a situation where we are modelling y ∼ x1 + x2 and it turns out x1 and x2 are related intrinsically
(think income and housing size, et cetera). Model specifications such as higher order polynomial term
might introduce multicollinearity, especially so when the range of a regressor variable is small. Over-
defined models are situations in which p > n. Recall that the variance of regression estimates were given
Cov(β̂) = σ 2 (X T X)−1 (see Section 3041), then the `2 error of our regression estimates can be computed
496
as the squared distance: L21 = (β̂ − β)T (β̂ − β) and we have
j
X k
X
EL21 T
= E[(β̂ − β) (β̂ − β)] = 2
E(β̂j − βj ) = Var(β̂j ) = σ 2 trace(XT X)−1 (3081)
j=1 j=1
where the trace is the sum diagonals. When multicollinearity exists, some of the eigenvalues of XT X will
turn out to be small. Letting λj denote the j-th eigenvalue of X T X, then EL21 = σ 2 trace(XT X)−1 =
Pk
σ 2 j=1 λ−1
j . At least one of the eigenvalues being involved in a multicollinear relationship inflates the
1
λj and the expected error loss in estimation of β̂ becomes large. Since Var(β̂j ) = σ 2 Cjj , j ∈ [k], we can
show that if X is unit-length scaled (verify this) then ρ(X) = XT X. We can show that j-th diagonal
of [ρ(X)]−1 = 1
1−Rj2
where Rj2 is the coefficient of determination from regression of xj ∼ {xi∈[k]∧j6=i }.
We can see directly from this that strong multicollinearity effects between regressors inflate both the
variance and covariance of our least squares estimates! Often, when X is unit scaled we just call XT X
as the correlation matrix of X and call it the correlation form of XT X. Multicollinearity decreases
the generalisability of the model - although it fits well to training data, the poor regression coefficient
estimates makes the model poor for prediction for inputs outside the observed input space.
8.8.3.7.1 Detection of multicollinearity - Plots, Variance Inflation Factors and Eigensys-

tem Analysis :
Correlation Plots A simple (but non-sure) way of identifying potentially multicollinear variables is to
look at the off-diagonal elements in ρ(X). However, there are many instances when pairwise correlation
do not reflect the presence of multicollinearity, since near-linear dependence involving more than two
regressors are not reflected.
Variance Inflation Factors Recall that the diagonals of ρ(X)−1 correlation matrix can be useful in
detecting multicollinearity. In particular, we have the values:
1
, j ∈ [k]. (3082)
1 − Rj2
1
Since the variance of the j-th regressor equals σ 2 Cjj and Cjj increases with 1−Rj2
, then we can view this
value as the factor by which the variance of β̂j increases due to non-linear dependence among regressors.
Definition 386 (Variance Inflation Factors). The variance inflation factor for the j-th regressor is the
value
1
V IFj = , (3083)
1 − Rj2
1
for which the presence of a large value indicates multicollinearity.
Note that scaling data can help decrease the variance inflation factors and improve the fit.
Eigensystem Analysis Since the eigenvalues/characteristic values of a square matrix A ∈ Rp,p are
the k roots to the system kA − λ1k = 0, then the eigenvalues of ρ(X), that is {λi , i ∈ [k]}a, can be
used to measure collinearity in data. Small eigenvalues indicate collinearity issues. Let the condition
λmax
number of ρ(X) be κ = λmin , then if κ > 100 we say that collinearity issues exist. This does not tell
us however, the number of regressors involved in the collinearity relationship. The number of condition
λmax
indices, κj = λj , j ∈ [k] gives a measure of the number of such near-linear dependencies. The method
of eigensystem analysis can also be used to identify the nature of this near-linear dependence.
1A
VIF value exceeding 5 or 10 can indicate that the associated coefficient estimates are poorly estimated. The VIFs
not only detect collinearity but suggest which regressors are involved in the collinear relationship.
497
8.8.4 Correction of Inadequacies
8.8.4.1 Transformation of Response-Regressor
A violation of the linearity assumption may be detected via the analysis of standardized residuals or
in the lack of fit test, as we saw in Section 8.8.3. In these cases, the originally nonlinear function may
be transformably linear. For instance, consider the following linearizable, non-linear functions and their
linear transformations:
y∼x Transform y 0 ∼ x0
y = β0 xβ1 y → log y, x → logx y 0 = log β0 + β1 x0
y = β0 exp{β1 x} y → ln y y 0 = ln β0 + β1 x
y = β0 + β1 log x x → log x y 0 = β0 + β1 x 0
y= x
β0 x−β1 y→ 1
y,x → 1
x y 0 = β0 − β1 x 0
When performing transformations, the problem domain needs to be taken into account. For instance,
1
two equally feasible transformations x → x and adding a non-linear higher order term x2 may be
presented - then we should question if the quadratically U-shaped relationship between response and
regressor is an intuitively reasonable one.
We may also rely on analytical methods to specify appropriate transformations, such as the Box-Cox
method.
8.8.4.1.1 Box-Cox Method We may transform the response y to correct for non-normality and
heteroscedasticity, using the power transform y λ , and then estimating β, λ using maximum likelihood
methods (see Section 334). We use the transformation
 λ
 y −1

when λ 6= 0
y = λy ỹ λ−1
(λ)
(3084)

ỹ log y when λ = 0,

Pn
where ỹ = exp{ n1 i log yi } represents geometric mean of observations. The regression coefficients are
obtained by fitting the model y(λ) = Xβ + using the least squares (see 3032) method or the maximum
likelihood estimation method (see 334).
Note that when comparing between different λ transformations, we may not use SSres since each of
the computations are scaled differently. The MLE estimate corresponds to the λ value for which the
SSres from the fitted model is minimum. If λ = 0, then we choose log y as response, else our new response
variable is y λ .
8.8.4.1.2 Box-Tidwell Method Another transformation is on the regressor variables instead of

IID
the response. Assuming that ∼ Φ(0, σ 2 ) is at least approximately satisfied, then we may apply the
Box-Tidwell method as follows. For the simple model, consider transformation
( α
x α 6= 0
ξ= (3085)
log x α = 0
and with least squares fit ŷ = β̂0 + β̂1 x. Fit a new model with addition of regressor w = x log x, such
γ̂
that we have ŷ = β̂00 + β̂10 x + γ̂w and take α1 = β̂1
+ 1. Repeat the procedures using a new regressor
α1
x → x , and this converges rapidly to a satisfactory result of α.
498
Another method of correcting the inadequacies are to adjust weights of points such that they are
approximately equal in importance to the model fits. This is discussed in the section on the weighted
least squares method (see Section 8.8.5).
8.8.4.2 Ridge Regression
M SE(β̂ 0 ) = E(β̂ 0 − β)2 = Var(β̂ 0 ) + Bias(β̂ 0 )2 . (3086)
The measurement model relating to this decomposition is given in Equation 2523. If we are able to
increase bias to smaller extent than the decrease in variance, our estimate has lower expected errors.
From here we refer to X T X in its correlation form.
Definition 387 (Ridge Estimator). Define the ridge estimator β̂R as solution to (X T X +λ1)β̂R = X T y,
therefore we have
β̂R = (X T X + λ1)−1 X T y (3087)

= (X X + λ1)
T −1 T
X X β̂ (3088)
= Zλ β̂ (3089)
where λ ∈ [0, 1] is called the biasing parameter.
Then we have (verify this)
M SE(β̂R ) = Var(β̂R ) + Bias(β̂R )2 (3090)

k
λj
+ λ2 β T (X T X + λ1)−2 β
X
= σ2 (3091)
j=1
(λj + λ)2
where each λj is the j-th eigenvalue of X T X. As the biasing parameter increases, the variance decreases
and bias increases. There exists λ 6= 0 such that M SE(β̂R ) < Var(β̂), provided β̂ T β̂ is bounded. Note
that since (verify this)
SSres = (y − X β̂R )T (y − X β̂R ) (3092)

= (y − X β̂)T (y − X β̂) + (β̂R − β̂)T X T X(β̂R − β̂), (3093)
where the first term is the SSres of the unbiased least squares, the ridge regression results in poorer fit
and lower coefficients of determination as a tradeoff to lower mean squared error of regression estimates.
The ridge trace is a useful plot to observe the elements of β̂R against the values of λ. In the case of
multicollinearity issues, the instability of regression coefficients can be observed from the ridge trace.
The ridge regression method is often a technique employed to deal with multicollinear effects.
8.8.4.3 Principal Component Regression
8.8.5 Weighted Least Squares

When faced with data with heteroscedastic variance, we can fit by weighted least squares. The residual
terms (yi − ŷi ) are scaled by weight inversely proportional to Var(yi ). Recall that in the simple least
squares 2989 we took S(β0 , β1 ) = i (yi − β0 − β1 xi )2 , and we replace this with the weighted sum of
P
squares objective function:

X
S(β0 , β1 ) = wi (yi − β0 − β1 xi )2 (3094)
i
499
with normal equations
n
X n
X n
X
β̂0 wi + β̂1 w i xi = wi yi (3095)
i=1 i=1 i=1
n
X n
X n
X
β̂0 wi xi + β̂1 wi x2i = wi yi xi (3096)
i=1 i=1 i=1
which by solving yields β̂0 , β̂1 as regression coefficient estimates. The wi here are not variable - they
1
are set inversely proportional to their variance (such as wi = σi2
) where the error i term is determined
variance σi2 . These variances are unknown
P
a priori but
P
can be estimated and are to be discussed later.
w i x
Writing the weighted means x̄w = P wi and ȳw = P wyi i , we obtain (verify this) regression estimates
i w i
β̂0 = ȳw − β̂1 x̄w (3097)

P
iwi (xi − x̄w )(yi − ȳw )
β̂1 = P 2
(3098)
i wi (xi − x̄w )
that yields unbiased estimates. The weighted mean squared residuals M S(w)Res is an unbiased estimator
of σ 2 . Interpretation of the models are the same as when weights are uniform. We may remove point i
by setting wi = 0. Additionally, outliers or influential points may be set wi < w̄ to down weight impact
on coefficient estimates relative to others. Since we need conditional variance σi2 , this can be difficult to
obtain. Suppose we already know the relationship such that Var(y|x) = f (x) then we may encode it in a
functional form. However, in many cases the underlying distributions and relationships are not known -
and we may rely on methods such as estimation on the multiple (nearly) repeated values of the regressor.
Althought we ideally want multiple response values at each level of the regressor value, we often do not
have sufficient data. Instead, we bin the regressor axis and group them. Let this group formed from the
neighbourhood of xi have average x̄ and sample variance s2y . We may perform a least squares for sy ∼ x̄,
and substitute xi → x̄ to obtain an estimate of σi2 .
8.8.6 Generalized Least Squares

In the generalized least squares we want to fit
y = Xβ +
with the weaker assumptions that E = 0, Var() = σ 2 V . In the multiple least squares assumptions (see
Section 8.8.2.2) our assumption of constant and independent assumption corresponds to V = 1. When
V is diagonal matrix then our problem set is uncorrelated but non-constant variance, and in the general
case we have both correlated errors and non-constant variance.
We may no longer use the estimates β̂ = (X T X)−1 X T y when V 6= 1. What we want to do is then
perform transformation of the model into a new set of observations satisfying the ordinary least squares
assumptions and utilize that machinery. Let σ 2 V be represent matrix Cov(), then the generalized least
squares normal equations become (verify this):
(XT V−1 X)β̂ = XT V−1 y (3099)
with solution
β̂ = (XT V−1 X)−1 XT V−1 y, (3100)
500
the generalized least square estimator of β. Note that in the special case where σ 2 V is diagonal matrix,
let W = V−1 and we derive the weighted least squares equation (verify this)
(XT WX)β̂ = XT Wy (3101)
with solution
β̂ = (XT WX)−1 XT Wy. (3102)
8.8.7 Variable Selection Methods

We discuss the tradeoff between model complexity and predictive power in this section. The basic
strategy can be modelled as follows: 1) fit a full model, 2) perform analysis and validity studies, 3)
determine statistical relevance and significance by tests, 4) edit model and repeat.
Assume a k regressor candidate problem with regressors xi , i ∈ [k] and response y. Then our model
Pk
is specified yi = β0 + j=1 βj xij + , i ∈ [n]. Suppose r regressors should be deleted from the model,
then we can decompose Xq , Xr such that Xq is the q = k − r + 1 columns of intercept and significant
regressors, while the Xr are the deleted ones. We rewrite y = Xq βq + Xr βr + . Recall the full model
3032 β̂ ∗ = (X T X)−1 X T y where β̂ ∗ consists of β̂q∗ and β̂r∗ and
y T y − β̂ ∗T X T y y T [1 − X(X T X)−1 X]y

σ̂∗2 = = (3103)
n−k−1 n−k−1
with fitted values ŷi∗ . The subset model takes form β̂q = (XqT Xq )−1 XqT y and
y T y − β̂qT XqT y y T [1 − Xq (XqT Xq )−1 Xq ]y

σ̂ 2 = = (3104)
n−q n−q
and fitted values ŷi . Then a question we may ask is with regards to the consequence of mis-specifying our
model by the r regressors. It turns out that (verify this) Eβ̂q = β + Aβr where A = (XqT Xq )−1 XqT Xr ,
which results in a biased estimator unless βr is zero vector or the regressors are orthogonal to the retained
variables, in that XqT Xr = 0. We also have Var(β̂q ) = σ 2 (XqT Xq )−1 , Var(β̂ ∗ ) = σ 2 (X T X)−1 by definition
and writing Var(β̂q∗ ) − Var(β̂q ) we obtain a matrix such that all variances of regression coefficients in the
full model are ≥ to variances of coefficients in the reduced model. Removal of unnecessary variables will
not increase the variance of remaining coefficients. We also have the M SE(β̂q ) < M SE(β̂q∗ ), where the
subset model has smaller mean squared errors.
Note that in the full model, the σ̂∗2 is unbiased estimator of σ 2 , while the σ̂ 2 from the subset model is a
biased upward estimate of σ 2 . When predicting response at point xT = (xTq , xTr ), the predicted response
from full model is ŷ ∗ = xT β̂ ∗ with mean xT β and prediction variance Var(ŷ ∗ ) = σ 2 [1 + xT (X T X)−1 x], in
comparison to the predicted response of ŷ = xTq β̂q with mean Eŷ = xTq βq + xTq Aβr and mean of squared
error
M SE(ŷ) = σ 2 [1 + xTq (XqT Xq )−1 xq ] + (xTq Aβr − xTr βr )2 . (3105)
Although (in general) we have a biased estimate of y, the variance of ŷ ∗ from the full model is not less
that the variance of ŷ from the subset model.
We may show (verify this):
M SE(ŷ ∗ ) = Var(ŷ ∗ ) ≥ M SE(ŷ) (3106)
in the misspecification of the model. Hence, deleting variables improves the precision of the regression
estimates of the retained models, reduce variance of predicted response. If these deleted variables are
not significant then the bias-variance tradeoff is favourable and we earn lower mean squared errors.
501
8.8.7.1 Selection Criterion
Some of the metrics to select different models can be named as follows: i) the coefficient of (multiple)
SSreg (q) SSres (q)
determination (Rq2 = SST = 1− SST ),
2
adjusted coefficient of multiple determination (Radj =
n−1
1− n−p (1−Rp2 )), Residual Mean Squares (M Sres ), Akaike Information Criteria and Bayesian Information
Criteria.
SSres
For the M Sres = n−p , there is a tradeoff between the loss of degrees of freedom and a decrease in
the SSres values - ideally, we want a model with the minimum M Sres . We may desire to choose a model
with the number of regressors such that this value is minimized, or a nearby model.
8.8.7.1.1 Akaike Information Criteria The AIC method is based on maximising the expected
entropy of them model, trading off goodness-of-fit for simplicity of model.
Definition 388 (AIC). We define the AIC to be the value
AIC = −2 log(L) + 2p, p=k+1 (3107)
where L is defined the likelihood function of the model.
In the ordinary least squares setting, we have (verify this): AIC = n log( SSnres ) + 2p and our goal
may be to select the model with smallest AIC value.
8.8.7.1.2 Bayesian Information Criteria The BIC is an extension of AIC2 (see Section 8.8.7.1.1)
with several variants. The Schwartz and Sawa variant is given
BICs = −2 log(L) + p log(n) (3108)
which in the ordinary least squares yields BICs = n log( SSnres ) + p log n. We would like a model with
the lowest BIC.
8.8.7.2 Computational Methods - Brute Force and Stepwise Greedy Solutions
8.8.7.2.1 Brute Force Method : With a total of k candidate regressors, we may fit a model over
k
the power set and compare 2 such models to compare based on agent-defined utility criterion (see
8.8.7.1). Note that since each model gives regression coefficient estimates, a useful side-effect is that we
may be able to detect multicollinearity issues by observing instability of the regression estimates. If this
is computationally unviable we may opt for greedy methods, although no global optima is guaranteed.
8.8.7.2.2 Forward Selection Method : The steps may be enumerated as follows (1) begin with
the intercept model, and pick a regressor that has the highest correlation with the response. (2) If the F
statistic of the model is significant beyond some threshold, greedily select another regressor that has the
largest correlation with response after adjusting for the effect of the first regressor on response. This is
known as partial correlation. (3) Compute the partial F statistic, and if this exceeds threshold add the
regressor, repeat and otherwise terminate.
The partial correlation can be determined as follows:
1. First derive ŷ = β̂0 + β̂1 x1

2 notethat when using AIC and BIC, the comparison of models are only valid when the response is the same. Additionally,
the model should be fit on the same size of data, since the input n is part of the criterion.
502
2. Then fit x̂j = α̂0j + α̂1j x1 , for all j ∈ [2, k]
3. Derive ρ(ŷ − y, x̂j − xj ), for all j ∈ [2, k]
The value in step 3 is known as the partial correlation value of xj with y, which also yields the largest
partial F statistic
SSreg (x2 |x1 )
M Sres (x1 , x2 )
which we compare against the threshold to see if we shall continue.
8.8.7.2.3 Backward Elimination Method : The method may be described in a similar format to
the forward selection, but in reverse order. First begin with the full k regressor model, and examine the
partial F statistic (recall that this is equivalent to the t-test statistic) as if it were the last variable to be
added in the model. If the smallest partial F statistic is lower than some threshold, remove the regressor
and repeat or terminate.
8.8.7.2.4 Stepwise Regression Method : The stepwise method is a combination of the forward
and backward elimination methods. Beginning with the intercept model, we initiate forward selection.
The modification is that after the addition of each variable we conduct partial F tests for all variables
in the model to see if any regressors previously added have become redundant due to relationships
between it and the incoming regressor. The removal or addition of more regressors then follows from the
comparison against two thresholds, which are generally distinct.
8.9 Rank Methods for Linear Regression

Recall that in Section 8.8.1.5 we tried to test for significance of the regression coefficient estimates by
using t-test and residual mean square (see Definition 374) to obtain variance estimates used in standard
error computations of our regression coefficients. In particular our hypothesis test statistic was written
Z= q1 −β1
β̂
but we generally have no σ 2 , so we employ M Sres = SSres
where p = k + 1 in a k regressor
σ2 n−p
Sxx
problem and bootstrap to get a t-distributed test statistic instead. Assume the simple least squares
settings and write our test statistic
Sxy
− β1
Tβ1 = qSxx (3109)
SSres
(n−2)Sxx
(xi − x̄)2
P P
(xi − x̄)yi /
= q
P P (3110)
(yi − β̂0 − β̂1 xi )2 /[(n − 2) (xi − x̄)2 ]
√ P pP
n − 2 (xi − x̄)yi / (xi − x̄)2
= q
P . (3111)
(yi − β̂0 − β̂1 xi )2
503
By rewriting the terms
qX X 2
(yi − β̂0 − β̂1 xi )2 = (yi − ȳ) − β̂1 (xi − x̄) (3112)
X X X
= (yi − ȳ)2 + β̂12 (xi − x̄)2 − 2β̂1 (xi − x̄)(yi − ȳ) (3113)
X X X
= (yi − ȳ)2 + β̂12 (xi − x̄)2 − 2β̂1 (xi − x̄)2 (3114)
X X
= (yi − ȳ)2 − β̂12 (xi − x̄)2 (3115)
P 2 X
X (xi − x̄)yi
= (yi − ȳ)2 − P (xi − x̄)2 (3116)
(xi − x̄)2
X hX i2 X
= yi2 − n(ȳ)2 − (xi − x̄)yi / (xi − x̄)2 (3117)
and hence the t-test statistic can be expressed

√
n − 2V
pP (3118)
Yi2 − n(Ȳ )2 − V 2
P pP
where V = (xi − x̄)Yi / (xi − x̄)2 .
Then for some sample Y ∼ X where ranks Ri , i ∈ [n] correspond to ranks of Yi , i ∈ [n] and replacing
the Y values with R values in the parametric t-test statistic, obtain the rank-based test statistic for
significance of β1 :
√
0 n − 2VR
T = qP 2 (3119)
i2 − n n+1
2 − VR2
pP
(xi − x̄)2 . Since T 0 increases on VR , which increases on (xi − x̄)Ri =
P P
where VR = (xi − x̄)Ri /
P P P
xi Ri − x̄ Ri = xi Ri − k with fixed constant k, our rejection region can be written that for
P
U = xi Ri , reject H0 : β1 = 0 if U ≥ uα where P(U ≥ uα ) = α. We skip discussion on the distribution
of U except for when we have are equally spaced design points (x1 , · · · , xn ). We saw that in the discussion
1
of the Kruskal Wallis test (see Section 7.6.2.1) that each unsigned rank permutation have probability n! ,
and the same could be said under the null hypothesis of the rank test for H1 : β1 = 0. In particular, we
1
have P((Ri = ri )i∈[n] ) = n! where (ri∈[n] ) is an element of the set of permutations of [n].
P
Theorem 393. Under the null hypothesis H0 : β1 = 0, U = xi Ri has distribution
U − EU ≈
√ ∼ Φ(0, 1) (3120)
VarU
n(n+1) n(n+1)
(xi − x̄)2 .
P
with EU = 2 x̄, VarU = 12
Proof. We show proof for moments and skip normality. See that
X X 1X n(1 + n)
EU = E xi Ri = xi ERi = nx̄ i= x̄. (3121)
n 2
P
Since Ri is non-deterministic, and its variance equals zero. But we may also express
X XX n
X X
Var( Ri ) = Cov(Ri , Rj ) = Var(Ri ) + Cov(Ri , Rj ) (3122)
i j i i6=j
= nVar(R1 ) + n(n − 1)Cov(R1 , R2 ) = 0 (3123)
504
−1
which implies Cov(R1 , R2 ) = n−1 Var(R1 ). Now see that
XX X X
Var(U ) = xi xj Cov(Ri , Rj ) = x2i Var(Ri ) + xi xj Cov(Ri , Rj ) (3124)
i j i=j i6=j

X X −1
= Var(R1 ) x2i + Var(R1 ) xi xj (3125)
i=j
n −1
i6=j
 

X X −1 
= Var(R1 )  x2i + xi xj (3126)
i=j
n−1
i6=j
Now write
   
n n
X X −1  X X X X −1 
 x2i + xi xj =  x2i + ( xi xj + x2i − x2i ) (3127)
i=j
n −1 i=j i i
n −1
i6=j i6=j
 
n
X XX X −1 
=  x2i + ( xi xj − x2i ) (3128)
i=j i j i
n −1
 
n
X X −1 
=  x2i + (n2 x̄2 − x2i ) (3129)
i=j i
n−1
n
!
n X
= x2 − nx̄2 (3130)
n − 1 i=1 i
n X
= (xi − x̄)2 (3131)
n−1 i
Finally, express
1X 2 1X 2 n+1 2
Var(R1 ) = i − (R̄)2 = i −( ) (3132)
n n 2
1 (n + 1)2
= (n + 1)(2n + 1) − (3133)
6 4
1
= (n + 1)[(4n + 2) − 3(n + 1)] (3134)
12
1
= (n + 1)(n − 1) (3135)
12
1 n 2 n(n+1) P
(xi − x̄)2 and we are done.
P
It follows that Var(U ) = 12 (n + 1)(n − 1) n−1 i (xi − x̄) = 12
8.10 Nonparametric Density Curve Estimation

8.10.1 Histogram
Consider for Xi ∼ F, i ∈ [n] with density f (x) we may plot a histogram, which depends on the bar width
and starting point as parameters.
8.10.2 Naive Kernel Density Estimator

Recall by limit arguments that we have
F (x + h) − F (x − h) P(X1 ∈ (x − h, x + h))
f (x) = F 0 (x) = lim = lim . (3136)
h→0 2h h→0 2h
505
Then a reasonable estimate of f (x) can be related to the number of observations in its local width, where
we define width to be a parameter for consideration. That is, our estimator is written
n
1 1X
fˆ(x) = 1{Xi ∈ (x − h, x + h)} (3137)
2h n i
n
1 X
= 1{|Xi − x| < h} (3138)
2nh i
n
1 X1 X −x
= 1{ i < 1} (3139)
nh i 2 h
n
1 X Xi − x
= k (3140)
nh i h
where k(t) = 1
2 1{|t| < 1}. See that here the function k(t) can be considered a (1) density function
symmetrical about zero, (2) k(t)dt = 1, tk(t)dt = 0, t2 k(t)dt > 0 and (3) is a step function. Since
R R R
k(t) is step function that fˆ(x) is step function.
8.10.3 Kernel Density Estimator

Definition 389 (Kernel Density Estimators). Let k(t) be a kernel function satisfying the properties of
k(t)dt = 1, tk(t)dt = 0, t2 k(t)dt 6= 0. Then the kernel density estimate of f (x) is defined by
R R R
n
1 X x − Xi
fˆ(x) = k (3141)
nh i h
where h is termed width parameter or smoothing parameter. k(t) is usually a symmetric density function.
Exercise 634 (Examples of Kernel Functions). Some examples of kernel functions satisfying the prop-
erties mentioned in Definition 389 can be enumerated.
1. Uniform/Rectangle: k(t) = 1
2 1{|t| < 1}
2
2. Normal: k(t) = √1 exp(− x2 )
2π
3. Triangular: k(t) = (1 − |t|)1{|t| < 1}
4. Biweight: k(t) = 15
− t2 )2 1{|t| < 1}
16 (1
√ √
5. Epanechnikov: k(t) = 34 (1 − t5 )/ 5 1{|t| < 5}
2
Theorem 394 (Mean Squared Error Properties of KDE). For fixed x, we can have the bias-variance
decomposition (see Definition 370)
M SE(fˆ(x)) = Varfˆ(x) + bias2 (fˆ(x)). (3142)
Since x is random the global mean square error is the mean integrated square error which can expressed:
Z Z Z
M ISE(fˆ(x)) = E[fˆ(x) − f (x)]2 dx = V ar(fˆ(x))dx + bias2 (fˆ(x))dx. (3143)
Since we may write

n
1X1 x − Xi 1X
fˆ(x) = k = g(Xi ) (3144)
n i h h n
506
and the expectation term

1 x − Xi
Efˆ(x) = Eg(Xi ) = Ek (3145)
h h
Z
1 x−y
= k f (y)dy (3146)
h h
Z
= k(t)f (x − ht)dt (3147)
(ht)2 00
Z
= k(t)[f (x) − htf 0 (x)dx + f (x) + · · · ]dt (3148)
2
h2
Z
= f (x) + f 00 (x) t2 k(t)dt + O h3

(3149)
2
2
which gives us bias(fˆ(x)) = h2 f 00 (x) t2 k(t)dt + O h3 . We can also write in similar fashion
R

x − Xi x−y
Z
2 2
Ek = k f (y)dy (3150)
h h
Z
= h k 2 (t)f (x − ht)dt (3151)
(ht)2 00
Z
= h k 2 (t)[f (x) − htf 0 (x) + f (x) + · · · ]dt (3152)
2
Z
= hf (x) k 2 (t) + O h2 .

(3153)
h 2 i
x−Xi 2
It follows immediately that Var(fˆ(x)) = 1 1 x−Xi 1

k 2 (t)dt+
R
n Varg(Xi ) = nh2 E(k h ) − Ek h = nh f (x)
···.
This allows us to write the mean squared error and mean integrated square error:
2
h4 00
Z Z
1
M SE(fˆ(x)) ≈ f (x) 2
k (t)dt + [f (x)] 2 2
t k(t)dt , (3154)
nh 4
2
h4
Z Z Z
1
M ISE(fˆ(x)) ≈ k 2 (t)dt + [f 00 (x)]2 dx t2 k(t)dt (3155)
nh 4
h4
This is a function over h, and we have g(h) = M ISE(fˆ(x)) ≈ 1
nh C1 + 4 C2 and its optimal bin width
C
is written ho = n5 (verify this) at the MISE minima where
R 00 −1/5
( [f (x)]2 dx)( t2 k(t)dt)2
R
C= R . (3156)
k 2 (t)dt
If we are working with k(t) = φ(t) for the normal kernels (see Exervise 634) then the value of h0 ≈
1.06σn−1/5 for Φ(µ, σ 2 ) settings.
Theorem 395 (Confidence Intervals using Kernel Density Estimators). If |f 00 (x)| < C, f (x) 6= 0 and
nh5 → 0 as n → ∞, then distribution
fˆ(x) − f (x)
q ∼ Φ(0, 1) (3157)
Var(fˆ(x))
where we can bootstrap the Var(fˆ(x)) with estimate 1 ˆ

k 2 (t)dt. Then the confidence interval fol-
R
q nh f (x)
lows: fˆ(x) ± zα/2 Var(fˆ(x))
507
8.11 Kernel Regression
Suppose we have (Xi , Yi ) ∼ F (x, y), i ∈ [n] and joint density f (x, y) exists, and assume the population
model
Yi = m(Xi ) + i i ∈ [n], (3158)
where Xi ⊥ i , Ei = 0, Vari = σ 2 . We want to find the true m(x).
8.11.1 Nadaraya–Watson Kernel Regression

We can write E(Y |X = x) = E(m(X) + |X = x) = m(x) which is unbiased estimate. We can estimate
this by the local neighbourhood of regressor points, such that we have
i Yi 1{Xi ∈ (x ± h)}
P
m̂(x) = (3159)
1{Xi ∈ (x ± h)}
P
Yi 1{|Xi − x| < h}
P
= (3160)
1{|Xi − x| < h}
P
Yi 2 1{| Xih−x | < 1}

P 1
= (3161)
2 1{| h |}
P1 Xi −x
Yi k Xih−x
P
= P Xi −x . (3162)
k h
See also that this can be written by its kernel density estimator fˆ(x) (see Definition 389) by writing
1
Yi k Xih−x
P
nh
m̂(x) = 1 P X −x . (3163)
nh k ih = fˆ(x)
We may derive it from yet another perspective. See that

Z Z R
f (x, y) yf (x, y)dy
m(x) = E(Y |X = x) = yfY |X=x (y)dy = y dy = . (3164)
f (x) f (x)
We may estimate this using fˆ(x) → f (x), and for the joint density we use a product of kernels. See that
n
1 X x − Xi y − Yi
fˆ(x, y) = k k . (3165)
nh1 h2 i h1 h2
R
Then the estimator of yf (x, y)dy is written

x − Xi
Z
ˆ 1 X
y f (x, y)dy = Yi k (3166)
nh1 h1
and we arrive at the same formulation for m̂(x).
508
Chapter 9
Randomization and Simulation
Here we give general statistical theory on randomizations and simulations.
9.1 Permutations
The permutation problem is the different number of ways to arrange n elements. Order matters. Take
n-size array, then the total number of permutations is n!. We might obtain all permutations of some
array by recursive solution. The set of permutations is the set for which each element in the array is
appended to all other permutations less itself. The algorithm can be written as follows:
1 def g e n e r a t e _ p e r m u t a t i o n s ( array ) :
2 if len ( array ) <= 1:
3 return [ array ]
4 permutations = []
5 for i in range ( len ( array ) ) :
6 element = array [ i ]
7 sub_sequence = array [: i ] + array [ i +1:]
8 s u b_ p e r m u t a t io n s = g e n e r a t e _ p e r m u t a t i o n s ( sub_sequence )
9 for permutation in s u b _ p e r m u t a t i o n s :
10 permutations . append ([ element ] + permutation )
11 assert len ( permutations ) == np . math . factorial ( len ( array ) )
12 return permutations
Listing 9.1: permutations
Additionally we might only be interested in generating one member of the set of permutations, then
we shall employ a quicker algorithm which tantamount to uniform sampling from the permutations
generated by the algorithm presented in Listing 9.1.
1 def p e r m u t a t i o n _ m e m b e r ( array ) :
2 i = len ( array )
3 while i > 1:
4 j = int ( np . random . uniform (0 , 1) * i )
5 if j >= i : j = i - 1
6 i -= 1
7 temp = array [ i ]
8 array [ i ] = array [ j ]
9 array [ j ] = temp
10 return array
Listing 9.2: permutation member
509
We provide no mathematical proof for this. However, we can verify this by Monte Carlo. By Law of
Large Numbers (see Theorem 361) the sample mean should asymptotically approach its expected value.
1 n = 10
2 N = 1000000
3 perm_arr = [ i + 1 for i in range ( n ) ]
4 permutations = [ np . array ( p e r m u t a t i o n _ m e m b e r ( perm_arr ) ) for _ in range ( N ) ]
5 sums = np . zeros ( n )
6 for perm in permutations :
7 sums += perm
8 sums = sums / len ( permutations )
9 ev = ( n + 1) / 2
10 print ( sums , ev )
[5.499291 5.49788 5.502698 5.508888 5.500134

5.498059 5.500234 5.497063 5.497515 5.498238] 5.5
and the numerical trial behaves as expected.
510
Chapter 10
Utility Theory
10.1 Utility Functions

We assume the consumer preference can be modelled using utility functions defined over consumption.
Definition 390 (Utility Functions). We denote u(ct ) where ct is consumption level at time t. Utility
functions satisfy (i) u0 (ct ) > 0, in that they prefer more consumption over less, ceteris paribus, and that
(ii) u00 (ct ) ≤ 0 in that the law of diminishing marginal return is satisfied.
Consequently where consumption is low, the marginal utility u0 (ct ) is high, and the converse is true.
10.1.1 CRRA Utility

Definition 391 (CRRA Utility Function). The CRRA utility has form
c1−γ
t
u(ct ) = ,
1−γ
where γ denotes the degree of risk aversion. γ = 0 denotes a risk-neutral decision maker, with u0 (ct ) =
1, u(ct ) = ct when γ = 0. When γ = 1, then the CRRA utility is a log utility function.
The CRRA utility function has constant response to relative changes

in consumption.
A fall in
u(c) c
consumption leads to a proportional fall in utility. See that ln u(c0 ) = (1 − γ) ln c0
511
Chapter 11
Statistical Finance
11.1 Simulation and Resampling Methods

11.1.1 Destruction of Patterns in Prices and Bars by Permutation
Here we are interested in permuting market data so as to destroy any price predictability, while preserving
other numerical artefacts such as the drift of the overall market. A use case is when we are trying to
compare the performance of a strategy when applied to market data (difficult to predict) and permuted
(impossible to predict) data for hypothesis testing. These routines will be heavily employed in the
problem of probabilistic analysis for trading systems and strategies (see Section 11.2). Even though a
random walk or Brownian bridge may be employed to generate prices, in scientific procedures we always
want to control all variables except one - otherwise the conclusions of our scientific experiment would
be muddled by confounders. The permutation methods provide excellent candidate solutions to destroy
price predictability while keeping constant as many other distribution properties as possible.
11.1.1.1 Permuting Price Data
Here we are interested in the permutation of a single price series, such as closing data. Recall the
permutation algorithm given in Listing (9.2). Here we want to present an algorithm that keeps the first
and last price of the time series invariant while destroying internal serial patterns. All permutations
share the same global trend and moments. Given price data (which is non-stationary), we can take the
difference of price logs (which is assumed reasonably to be stationary) and then use the permutation
algorithm discussed to generate new evolution of price paths. The new series is exponentiated to map to
Pt Pt+1
the original price domain. Given Pt∈[T ] , permute Pt−1 , or equivalently Dt = log Pt = log Pt+1 − log Pt
for t ∈ [T − 1] and permute to obtain Dt0 . Then take the cumulative sum starting from log P0 , adding
elementwise members of Dt0 to obtain new log price series. Exponentiate. We present the algorithm that
does this:
1 def permute_price ( price ) :
2 log_prices = np . log ( price )
3 diff_logs = log_prices [1:] - log_prices [: -1]
4 diff_perm = p e r m u t a t i o n _ m e m b e r ( diff_logs )
5 cum_change = np . cumsum ( diff_perm )
6 new_l og_pric es = np . concatenate (([ log_prices [0]] , log_prices [0] + cum_change ) )
7 new_prices = np . exp ( n ew_log_ prices )
8 return new_prices
512
9
10 perm uted_pri ce = permute_price ([3 , 1 , 2 , 4 , 5 , 4 , 6 , 4])

11 print ( permu ted_pric e )
12 # [3.0000 , 6.0000 , 4.8000 , 9.6000 , 14.399 , 4.8000 , 3.2000 , 4.0000]
Listing 11.1: price permute
However, when working with multiple instruments/strategies, we cannot use this algorithm. This is
because we want to try to preserve as much distributional artefacts of the data as possible, and only
destroy price predictability. Applying different permutations to different price series ignores the joint
distribution of correlated markets and destroys inter-market relationships. For instance, this can affect
the analysis of tails of the trading strategy applied to the permuted data. We shall then make the
permutation algorithm more general and apply the same permutation index across markets. The edited
algorithm is presented here.
1 def permute_price ( price , permute_index = None ) :
2 if not permute_index :
3 permute_index = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( price ) - 1) ) )
4 log_prices = np . log ( price )
5 diff_logs = log_prices [1:] - log_prices [: -1]
6 diff_perm = diff_logs [ permute_index ]
7 cum_change = np . cumsum ( diff_perm )
8 new_l og_pric es = np . concatenate (([ log_prices [0]] , log_prices [0] + cum_change ) )
9 new_prices = np . exp ( n ew_log_ prices )
10 return new_prices
11
12 def p e r m u t e _ m u l t i _ p r i c e s ( prices ) :
13 assert all ([ len ( price ) == len ( prices [0]) for price in prices ])
14 permute_index = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( prices [0]) - 1) ) )
15 new_prices = [ permute_price ( price , permute_index = permute_index ) for price in prices ]
16 return new_prices
17
18 import random
19 from pprint import pprint
20 arr = [3 , 1 , 2 , 4 , 5 , 4 , 6 , 4]
21 prices = []
22 for trials in range (5) :
23 random . shuffle ( arr )
24 prices . append ( list ( arr ) )
25 pprint ( prices )
26 perm uted_pri ce = p e r m u t e _ m u l t i _ p r i c e s ( prices )
27 pprint ( permuted _price )
Listing 11.2: multiple prices permute
11.1.1.2 Permuting Bar (OHLCV) Data
In the permutation of bars, the problem is distinct from the problem of permuting price series (see Section
11.1.1.1). This is because there exists both inter-bar and intra-bar distributions that were not present
when working with a univariate time series. We want to preserve these relationships while destroying the
serial patterns, just as in any other controlled test. The intra-bar relationships could be statistics such
as the bar range, bar mid price and so on. Suppose again that we are working with all of log data. Since
the pricing data of the bar is characterised by the open (O), high (H), low (L) and close (C), we may
pick a baseline and infer the other variables relative to this baseline. Let this be the values H - O, L - O,
513
C - O, and be denoted by ∆H , ∆L , ∆C respectively 1 . The intra-bar relationships are characterized by
the triplet (∆H , ∆L , ∆C ), and matches the idea put forward in Masters [14]. A characteristic property of
inter-bar relationships that holds particularly when the bars are close together in real time is the distance
between one bar’s close and the next’s open. We shall use this to represent the inter-bar relationships
and permute Ot+1 − Ct for t ∈ [T − 1]. The permutation of volume data is even trickier, since they are
unitless and can be both stationary and non-stationary across time depending on the asset. Additionally
they could exhibit seasonality effects, such as time-of-year, time-of-day or day-of-week components. In
that sense, they can affect the joint density of both inter-bar and intra-bar relationships. Although we
have not found a perfect solution to this, one of the observation I have made is that there is correlation
between magnitudes of the intra-bar ranges and volume, and intra-bar price relationships explain roughly
50% of the volume component. Therefore, instead of taking the volume deltas, we shall permute their
raw values on the same swap indices as the triplet (∆H , ∆L , ∆C ) and hope that not much of the joint
relationships were destroyed in this permutation. This gives us (∆H , ∆L , ∆C , V ).
Again these are differences in logarithmic prices - transformations on the price ratios which are
commensurate across history. Our hope is that the intra-bar 4-tuple, and inter-bar pair summarizes the
distribution of OHLCV data that we hope to keep invariant while destroying serial patterns. As in price
series, we keep both the first and last bar invariant to preserve global trend. Then the permutation is
almost exact for moderately large bar samples, since we are not explicitly fixing any two adjacent bars.
The algorithm is presented as follows, with permutation subroutines presented in Listing (9.2):
1 import pandas as pd
2 import mplfinance as mpf
3 import matplotlib . pyplot as plt
4
5 def permute_bars ( ohlcv , in de x _i nt er _ ba r = None , in de x _i nt ra _ ba r = None ) :

6 if not in d ex _i nt e r_ ba r :
7 i ndex _i n te r_ b ar = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( ohlcv ) - 1) ) )
8 if not in d ex _i nt r a_ ba r :
9 i ndex _i n tr a_ b ar = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( ohlcv ) - 2) ) )
10
11 log_data = np . log ( ohlcv )

12 delta_h = log_data [ " high " ]. values - log_data [ " open " ]. values
13 delta_l = log_data [ " low " ]. values - log_data [ " open " ]. values
14 delta_c = log_data [ " close " ]. values - log_data [ " open " ]. values
15 diff_deltas_h = np . concatenate (( delta_h [1: -1][ i n de x_ in t ra _b ar ] , [ delta_h [ -1]]) )
16 diff_deltas_l = np . concatenate (( delta_l [1: -1][ i n de x_ in t ra _b ar ] , [ delta_l [ -1]]) )
17 diff_deltas_c = np . concatenate (( delta_c [1: -1][ i n de x_ in t ra _b ar ] , [ delta_c [ -1]]) )
18
19 new_volumes = np . concatenate (
20 (
21 [ log_data [ " volume " ]. values [0]] ,
22 log_data [ " volume " ]. values [1: -1][ i nd ex _ in tr a_ b ar ] ,
23 [ log_data [ " volume " ]. values [ -1]]
24 )
25 )
26
27 i n t e r _ o p e n _ t o _ c l o s e = log_data [ " open " ]. values [1:] - log_data [ " close " ]. values [: -1]
28 d i f f _ i n t e r _ o p e n _ t o _ c l o s e = i n t e r _ o p e n _ t o _ c l o s e [ i nd e x_ in te r _b ar ]
29
30 new_opens , new_highs , new_lows , new_closes = \

1 These differences can be assumed stationary since log has already been applied. In particular, they are the log ratios
of the original market variables.
514
31 [ log_data [ " open " ]. values [0]] , \
32 [ log_data [ " high " ]. values [0]] , \
33 [ log_data [ " low " ]. values [0]] , \
34 [ log_data [ " close " ]. values [0]]
35
36 last_close = new_closes [0]

37 for i_delta_h , i_delta_l , i_delta_c , inter_otc in zip (
38 diff_deltas_h , diff_deltas_l , diff_deltas_c , d i f f _ i n t e r _ o p e n _ t o _ c l o s e
39 ):
40 new_open = last_close + inter_otc
41 new_high = new_open + i_delta_h
42 new_low = new_open + i_delta_l
43 new_close = new_open + i_delta_c
44 new_opens . append ( new_open )
45 new_highs . append ( new_high )
46 new_lows . append ( new_low )
47 new_closes . append ( new_close )
48 last_close = new_close
49
50 new_df = pd . DataFrame (
51 {
52 " open " : new_opens ,
53 " high " : new_highs ,
54 " low " : new_lows ,
55 " close " : new_closes ,
56 " volume " : new_volumes
57 }
58 )
59 new_df = np . exp ( new_df )
60 new_df . index = ohlcv . index
61 return new_df
Listing 11.3: bar permute
Again, as in price permutations (see Section 11.1.1.1), we want to preserve intermarket relationships.
For instance, a high volume day in one instrument is likely to correspond in a high volume day of another
instrument, especially if the pair are correlated. To do this we want to use the same permutation swap
indices across multiple instruments. The algorithm presented does this:
1 def p e r m u t e _ m u l t i _ b a r s ( bars ) :
2 assert all ([ len ( bar ) == len ( bars [0]) for bar in bars ])
3 i nd ex _in te r_ ba r = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( bars [0]) - 1) ) )
4 i nd ex _in tr a_ ba r = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( bars [0]) - 2) ) )
5 new_bars = [ permute_bars (
6 bar ,
7 i nd ex _i n te r_ ba r = index_inter_bar ,
8 i nd ex _i n tr a_ ba r = i n de x_ in t ra _b ar
9 ) for bar in bars ]
10 return new_bars
Listing 11.4: multiple bars permute
Although the above algorithm works great in theory, this simplifies the problem of shuffling data in
practice, particularly when our testing framework accounts for a dynamic universe of trading instruments.
In practice, we are likely to be testing a multi-instrument portfolio, of which the different instruments
do not share the same period of being ‘alive’. Trading instruments are frequently added, delisted and
515
modified on real exchanges, and this presents an issue. For instance, a common method under the
backtesting approach is to forward-fill then back-fill our pricing data so that the universe of assets share
the same date axis. Now consider an asset that only traded in the recent window, and where most of
the older dates are back-filled. Our return series looks something like this:
R = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.02 0.01
R’ = 0 0 0 0.1 0 0 0 0 0 0 0 0 0 0.2 0 0 0 0
Obviously the permuted time series R0 is invalid! Therefore, when permuting multiple bars of equal
length, we need to ensure that all of the bars under consideration were ‘alive’. In order to permute bars
of unequal length, we can then adopt a modified approach, by permuting maximally overlapping alive local
regions of equal length data and then stitching the locally permuted bars together. Since neither the first
nor the last bar is modified each local permutation, globally the permutation is still legitimate. Since the
total number of permutations are now equivalent to the product of the cardinalities of the permutation
sets of local regions as opposed to the cardinality of the permutation set of the global region, our penalty
is a lower total number of permutation choices to simulate from. However, for large enough dataset this
should not affect us meaningfully. A greater penalty is that price patterns are not destroyed as much as
we would have hoped, but this is traded-off for preserving intermarket correlations. Additionally, when
the bar permutation is performed in the context of hypothesis testing (see Section 11.2.2), the effect of
not as strictly destroying price patterns is a conservative estimate of the p-value, which corresponds to
loss of power. (see Section 349)
We present the modified algorithm for multi-bar permutations:
1 def p e r m u t e _ m u l t i _ b a r s ( bars ) :
2 if all ([ len ( bar ) == len ( bars [0]) for bar in bars ]) :
3 i ndex _i n te r_ b ar = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( bars [0]) - 1) ) )
4 i ndex _i n tr a_ b ar = p e r m u t a t i o n _ m e m b e r ( list ( range ( len ( bars [0]) - 2) ) )
5 new_bars = [
6 permute_bars (
7 bar ,
8 i nd ex _i n te r_ ba r = index_inter_bar ,
9 i nd ex _i n tr a_ ba r = i n de x_ in t ra _b ar
10 )
11 for bar in bars
12 ]
13 else :
14 bar_indices = list ( range ( len ( bars ) ) )
15 inde x_to_dat es = { k : set ( list ( bar . index ) ) for k , bar in zip ( bar_indices , bars ) }
16 date_pool = set ()
17 for index in list ( ind ex_to_d ates . values () ) :
18 date_pool = date_pool . union ( index )
19 date_pool = list ( date_pool )
20 date_pool . sort ()
21 partitions , partitio n_idxs = [] , []
22 temp _partiti on = []
23 temp_set = set ([ idx for idx , date_sets in in dex_to_ dates . items () \
24 if date_pool [0] in date_sets ])
25
26 for i_date in date_pool :

27 i_insts = set ()
28 for inst , date_sets in index_t o_dates . items () :
29 if i_date in date_sets :
30 i_insts . add ( inst )
516
31 if i_insts == temp_set :
32 temp _partiti on . append ( i_date )
33 else :
34 partitions . append ( t emp_part ition )
35 part ition_id xs . append ( list ( temp_set ) )
36 temp _partiti on = [ i_date ]
37 temp_set = i_insts
38 partitions . append ( t emp_par tition )
39 part ition_id xs . append ( list ( temp_set ) )
40
41 chunked_bars = defaultdict ( list )

42 for partition , idx_list in zip ( partitions , par tition_i dxs ) :
43 permuted_bars = p e r m u t e _ m u l t i _ b a r s (
44 [ bars [ idx ]. loc [ partition ] for idx in idx_list ]
45 )
46 for idx , bar in zip ( idx_list , permuted_bars ) :
47 chunked_bars [ idx ]. append ( bar )
48
49 new_bars = [ None ] * len ( bars )

50 for idx in bar_indices :
51 new_bars [ idx ] = pd . concat ( chunked_bars [ idx ] , axis =0)
52 return new_bars
Listing 11.5: multiple bars permute
In the permutation of bars that are of granularity finer than one day (intra-day data), we employ
the bar permutation algorithm as subroutine and independently permute each day. We also permute
the overnight gap, which corresponds to the number of days. These two permutation are used to stitch
together the new price series. Each intra-day sample is permuted independently to maintain volatility
heterogeneity, which reflects real quoted data. The global trend remains invariant. Volatility clusters
are destroyed in all bar permutations. Bootstrapping is no better in this aspect.
11.1.1.2.1 Cautionary Note on the Walkforward Data Shuffle Since the OOS trade returns
of unpermuted data do not involve the data in the optimization period of the first fold, the training
period of the first fold should never be permuted into the rest of the OOS data used in computing
the performance criterion of a strategy. Since the cardinality of permutations depend on the size of
the permuted set, the total number of permutations depend on the fold size. In order to overcome this
restriction, and also avoid the illegality of permuting in-sample data with out-of-sample data, we permute
the first fold’s training set and data from first fold inclusive and onwards (everything else) separately.
Then the walkforward is done as in usual settings to generate oos trades.
11.2 Probabilistic Analysis of Trading Systems

Here we want to develop statistical theory in the domain of analysing trading systems and strategies. In
particular, we are concerned with the rejection of the hypothesis that our trading system is worthless.
Given some trading results, we want to be able to separate luck from skill, and quantify the degree
to which it is improbable that we are working with a worthless system. We also want to do this in
the context of existing market conditions, such as drift of the overall market. In essence we want to
perform attribution to random noise, beta and alpha. Constructing intervals and expectations for future
returns would be useful. The theory and practice in analysis of trading systems are heavily rooted in the
517
theory of hypothesis testing (see Section 7.1), parametric and non-parametric methods. Computational
and numerical methods such as the bootstrap and Monte Carlo methods (see Section 6.11) are critical
components of these tests. Our discussions would span different phases of the quant-process. In particular
we look at validation in early-stage model development, as well as middle-late stage probabilistic analysis.
In early stage model development, we often employ parameter optimization techniques, which naturally
result in over-fit models. We shall see that permutation tests provide excellent methods to test for
overfitting without having to use expensive out-of-sample data. In the event that we have a fully
developed trading strategy or system, we are now interested in its out-of-sample performance to produce
unbiased estimates of the trading strategy. This statistical estimator is (in general) unbiased but we
have no guarantees on the standard error - the likeliness that this estimator performance is replicated
in the future is almost surely zero. Hence we would like to quantify how good the performance of our
strategy was, not in relation to the absolute performance but also in the probabilistic sense. The use
cases introduced so far fall under the single strategy problem. In general, a trader would be working
under multi-strategy settings. Our discussions would look at statistical methods when working and
analysing both single strategy and multi-strategy settings. In addition, constructing confidence bounds
for parameters of interest would be useful to build expectations for the future.
11.2.1 Parametric and Non-Parametric Methods for Tests of Location

One may consider applying the one-sample t-test (see Section 7.4.1) and see if our return vector (with
the zeros filtered out) have returns significantly different from zero. However, recall in the t-test our
assumption is that population distribution is approximately normal. Our return vector is far from normal
in practice, and the test is for all purposes invalid. The presence of outliers and extreme returns inflate
the standard error and the test statistic is unreliable. Then we shall look for a more robust method in
non-parametric statistics. In particular we can test for median using the sign test (see Section 7.4.2) that
median returns m̂ were equal to zero or some other baseline value such as the benchmark index. The code
to perform the tests were included in the relevant discussions. However, since the sign test does not make
use of the magnitudes of the distance between m̂ and the data point, in the case of negatively skewed
strategy returns, we would be too optimistic about our trading strategy performance. To account for the
difference in magnitudes in losing days and winning days, we shall employ a more powerful test such as
the Wilcoxon Signed Rank Test (see Section 7.4.4). The distribution that generates these samples comes
from the underlying interaction between the strategy and market returns and are hence only weakly
correlated. The results from the sign test and signed rank test are both generally valid. If we want to
check that our strategy performed (statistically, significantly) better than some baseline value, we could
modify the null parameter. For instance the null parameter can be set to the net market return over the
test period scaled by the long-bias factor. However, these parametric and non-parametric tests presented
do not give us a way to compute statistical significance in the way of general summary statistics, such
as drawdown. An alternative to the non-parametric methods is to employ bootstrapping tests of the
parametric ones, which solves the issue of non-normality since bootstraps are outlier-robust. These
methods are described in Section 11.2.3. However bootstrap tests are poor when evaluating statistics
with long tails. Estimation of ratios are poor in the bootstrap tests. Furthermore, since serial correlation
is destroyed in the bootstrap, any order statistics based summaries (such as the drawdown) are invalid.
Before introducing the Monte Carlo methods in Section 11.2.2 and bootstrap methods in Section
11.2.3, let us discuss some more methods under the non-parametric tests we can employ here for com-
pleteness. Again in addition to the Wilcoxon Signed Rank test, we may want to build a confidence
518
interval estimation for the median return. The relevant theory and application is the Hodge-Lehmann
estimate and Tukey’s interval estimations. These are discussed in Theorem 388.
11.2.2 Monte Carlo Permutation Methods for Arbitrary Performance Cri-

terion
The trading system typically consists of position vectors (or equivalently, their weight vectors) and the
corresponding position return vector that generates the portfolio return series. Let there be p instruments,
their weight vector on day t to be wt and the return vector from day t to t + 1 be denoted rt . Let our
Pp
leverage on day t be written scalar lt . We set the constraint such that kwt k1 = i |wt,i | = 1. Then our
return matching time between t and t + 1 is the dot product lt wt · rt . Our performance over the test
period is then written ΠTt (1 + lt wt rt ). This is an ordered pairing, which can be destroyed by permuting
either the decision making series (lt , wt )t∈[T] or the market data series rt∈[T] . Suppose we shuffle the
return series to obtain r0t∈[T] , then the cumulative return ΠTt (1 + lt wt r0t ) gives us a method of comparing
random decisions against the original, rule-based pairing. We may then construct a hypothesis test with
the null hypothesis claiming that the rule-pairing is effectively random (and therefore useless). Suppose
we do m such randomized simulations, then under this null hypothesis our original performance is likely
to be as good as any other. Then the probability that the original series is at least j-th best out of
j
the m additional randomizations are given m+1 . This can be seen by writing the randomizations in
order ·1 · 2(· · · ) · (m − 1) · m·, and then slotting our series inside any of the ‘·’s. Stated otherwise, there
k+1
is a probability of m+1 that at most k of the randomizations have performance at least as good as
the original. At this stage, a technical point that might confuse the reader. Suppose we are given x
performance criterion and run two shuffled trials, with x being the best performing value. On a worthless
model assumption the probability of x beating any shuffled trial equals half and our p-value computation
12 1
should be 2 = 4 6= 13 , where one-third is the p-value suggested by our preceding argument. See that the
1
p-value computation of 4 corresponds to the sign test hypothesis testing (see Section 7.4.2) with the null
hypothesis H0 : mx = x. This is not what we are testing for! We are testing if the value of x obtained
was improbable conditional on worthless model, not whether it is possible that our true median criterion
was x.
Depending on the implementation, lt can be dependent on either the strength of the signal, the risk
control component of the trading system or both. Suppose we want to control for the risk management
capacity and simply analyse the trading system based on its ability for predictive strength - we can modify
the return criterion to compare the value ΠTt (1 + wt r0t ) of shuffles against ΠTt (1 + wt rt ). Additionally,
we shall note that this cumulative return criterion is not necessarily best and can be replaced with some
arbitrary objective function, say ψ((lt , wt , rt )t∈[T ] ). Since our p-value computation is discrete, in order
to better approximate our null distribution we shall use m >> 0.
In the permutation of an ordered series, the order is destroyed. Therefore, any summary statistic
that depend on serial correlation of the time series is no longer valid. This is critical to note in the case
of a single sample. Now, in the case of paired samples, such as the ordered (wt , rt )t∈[T ] it is a sufficient
condition that only one of the pairs are serially uncorrelated. The weight vector does not carry this
property in general, since the decisions are typically computed on a rolling computation that necessarily
involves overlaps of market data. On the other hand, the return series rt∈[T] can typically be assumed to
have weak to no correlation. In statistical finance, it is typically the second moment of returns r2t that
are assumed serially correlated - in fact by permutation of the return series we would lose any volatility
clusters. Our permutation test is therefore legal in this aspect. This is one of the methods to perform the
519
permutation test - by shuffling the decision series from the return series. Another method that is more
powerful, but computationally more expensive is to permute the data itself by destroying price patterns
but preserving global trends and other artefacts. Then the decision making (the trading strategy) is
performed on this permuted data that should be impossible to predict. Price and bar permutation
methods were discussed in Section 11.1.1. Note that the same principle of only shuffling between ‘alive’
decision/return series applies in the above and all of the below discussions.
11.2.2.1 p-value in in-sample and overfit detection by data shuffle
Here we assume we have the full machinery available to the underlying trading strategy. In general
the quant strategy development process is split between in-sample and out-of-sample (OOS). The out-
of-sample data is used as an acid test and should remain untouched until we have done ‘all we could’,
since any OOS data is no longer so upon usage. OOS testing is therefore expensive. In general, we
shall hope to perform all of our optimizations in-sample and iterate towards a model we are confident of
before testing it OOS to obtain an unbiased performance estimate. In many instances we may employ
optimization and tuning techniques in sample. If we were to perform such optimizations, we would obtain
at best an optimistic estimate of expected returns, since we are iterating on the same dataset ex-post
view. This optimism is warranted in most part to overfitting, where we fit to both legitimate market
patterns and random error - in this section we shall hope to quantify the degree to which our optimism
is overfit-attributed. Here we assume that the optimization method is systematic and repeatable, else no
objective quantification of optimism may be obtainable. In addition, for a good in-sample performance,
we are interested in the quantification of this goodness in relation to the overfit. Using the permutation
algorithms for data shuffling of prices (see Section 11.1.1.1) and bars (see Section 11.1.1.2), employ the
trading machinery to obtain new ψ((lit , wti , rit )t∈[T ] ) for i ∈ [m] where i is the i-th trial of m permutations
of the trading machinery (inclusive of the optimization) and ψ is our performance criterion function. Then
our p-value for the test is computed
1 + i=1 1{ψ((lit , wti , rit )t∈[T ] ) ≥ ψ((lt , wt , rt )t∈[T ] )}

Pm
, (3167)
m+1
k+1
or m+1 , where k is the number of times the strategy on shuffled data performs better or ties with our
original strategy. Note that since the optimization is performed within each iteration, if we are overfitting,
then each of our shuffled (and therefore impossible to predict) trials also give good performance in-sample.
The p-value can then be interpreted as the probability that we obtain a performance as good as we did,
or better, under the null hypothesis of a worthless predictive model. The performance criterion ψ is
an estimator that is a function of random variables - therefore it useful to select estimators that are
not only meaningful but also stable in terms of variance. Other considerations such as the correlation
of in-sample criterion and out-of-sample criterion can go into the choice of ψ. A good candidate is the
profit factor on logarithmic returns marked-to-market. The variance of selected performance estimator
decreases with increasing amount of data used in the performance criterion, and is in fact consistent
(see estimator consistency in Section 333) if ψ can be expressed as a function of moments. (see method
of moments, Section 332) by the Law of Large Numbers (Theorem 361). The same reasoning suggests
that when we are performing the permutation tests in oos, we shall not be stingy with holding out data.
The walkforward technique allows us to generate a large number of oos trades. The true value of the
permutation test allows us to quantify the (im)probability of our good results. This is to be kept in mind
in the discussions to follow.
520
11.2.2.2 p-value of asset timing in oos by decision shuffle
Here we assume we have lt wt rt produced from a model run on OOS data. Either the model was
calibrated independently of data by inductive reasoning and hence had no optimization stage, or were
the result of an in-sample calibrated model run on hold-out set. Equivalently, these could also have
been produced from optimization, but under the walk-forward. Then we have an unbiased estimate
ψ((lt , wt , rt )t∈[T ] ) and similar to the in-sample case in Section 11.2.2.1 we can compute the test-statistic
i=1 1{ψ((lt , wt , rt )t∈[T ] ) ≥ ψ((lt , wt , rt )t∈[T ] )} with p-value
Pm i i i
1{ψ((lit , wti , rit )t∈[T ] ) ≥ ψ((lt , wt , rt )t∈[T ] )}

Pm
1+ i=1
(3168)
m+1
where (lit , wti , rit )t∈[T ] is the i-th shuffled groupings among m permutation trials. Note that we only need
to shuffle either the decision making series or the market returns but not both, and as such we can
use a permutation swap index obtained from a member of perm([T ]) and apply it to rt∈[T] to obtain
rit∈[T] for i ∈ [m] with the leverage and weight vector series unchanged. This is computationally cheaper
and requires less information - we do not even need the trading strategy that produced the pairings or
knowledge of the optimization technique embedded in the model. Of course, it requires that the original
performance score ψ((lt , wt , rt )t∈[T ] ) is unbiased, or our hypothesis test is invalid. This is a simpler and
fast method to test if there were timing skill in the performance generated that is beyond the natural
drift, controlling for asset picking ability. Assuming we are working with p instruments such that w, r
PT Pp j
are p-vectors, then if the overall market had positive drift such that t j rt > 0 then a strategy
PT Pp j
that is generally long the market t j wt > 0 is likely to have earned money even under the shuffled
conditions. More specifically, if some asset i had persistently positive return over the test period such
PT
that t rti > 0 then any shuffling of the weight vector wi across time is likely to yield positive return
for that asset component. Therefore our permutation test gives us information as to the ability of asset
timing rather than asset picking. Inside this permutation test we are fixing the totality of weight applied
to each asset across time, but only permuting how we vary the weight in time. To see more clearly that
we are testing for asset timing rather than asset picking, assume a universe of ten assets, indexed one
through ten. First five assets are consistently going up while the last five are going down in price, such
that the overall market is net flat. Suppose we had good asset picking ability and we were consistently
long those going up and short those going down, such that the weight vector across time looks something
like (0.1, 0.1, · · · − 0.1, −0.1). Then no matter how we permute this indexed across time we will always
end up with good results in excess of the net zero drift, and our p-value would be large. We would
conclude our model did nothing, despite it having correctly having identified the favourable assets to be
positioned in. Rather, it is how we vary the weights across time (the asset timing component) that is
being tested under the hypothesis test. We call the p-value obtained from this hypothesis test to be the
timer’s p-value.
11.2.2.3 p-value of asset picking in oos by decision shuffle
In similar setting to the asset timing test in Section 11.2.2.2, we assume we are given lt wt rt produced
from a model run on OOS data. Then we have an unbiased estimate ψ((lt , wt , rt )t∈[T ] ) and similar to
the in-sample case in Section 11.2.2.1 we can compute the test-statistic i=1 1{ψ((lt , wt , r̃it )t∈[T ] ) ≥
Pm
ψ((lt , wt , rt )t∈[T ] )} with p-value
1{ψ((lt , wt , r̃it )t∈[T ] ) ≥ ψ((lt , wt , rt )t∈[T ] )}

Pm
1+ i=1
(3169)
m+1
521
where (lt , wt , r̃it )t∈[T ] is the i-th shuffled groupings among m permutation trials where the return vector
has been internally shuffled, such that ∀t ∈ [T ], for some p length fixed swap index si ∈perm([p]) we
have r˜t [·] = rt [si [·]] for trial i ∈ [m]. The swap index is held constant across all time periods. Again,
this is computationally not as expensive as data shuffling methods and gives us meaningful probability
values, assuming that ψ((lt , wt , rt )t∈[T ] ) is unbiased estimate. Here we are keeping the decision series
fixed across time, but meddling with the signal mapping function. To see that we are testing for asset
picking rather than asset timing, consider the same ten asset problem suggested in Section 11.2.2.2, where
we have expected mean zero drift. Since we are keeping the time index unpermuted, the performance
and p-value depends solely on the signal mapping and relates to the asset picking problem. We call the
p-value obtained from this hypothesis test to be the picker’s p-value.
11.2.2.4 p-value of trader skill in oos by data shuffle
Here we assume we have the full machinery available to the underlying trading strategy. On OOS data,
we can produce lt wt rt by running said machinery, then ψ((lt , wt , rt )t∈[T ] ) is unbiased estimator for
performance criterion. Similar to the in-sample case in Section 11.2.2.1 we can compute the test-statistic
i=1 1{ψ((lt , wt , rt )t∈[T ] ) ≥ ψ((lt , wt , rt )t∈[T ] )} with p-value
Pm i i i
1{ψ((lit , wti , rit )t∈[T ] ) ≥ ψ((lt , wt , rt )t∈[T ] )}

Pm
1+ i=1
(3170)
m+1
where (lit , wti , rit )t∈[T ] is the tuple generated by running the machinery on the i-th data bar shuffling
(see Listing 11.5). A total of m permutation trials are executed. The p-value relates to the probability
of getting as good or better results on a worthless, null model in relation to asset-timing (see Section
11.2.2.2) and asset picking (see Section 11.2.2.3), relative to the overall market drift. To see this includes
both asset picking and asset timing components, trivially realize that after data shuffling both the signal
mapping in a single time period t and the variation of weights across time in multiple periods are variable.
While this is a powerful test, this can be somewhat computationally demanding since we need to run
the full machinery (such as the signal computations) m times. An alternative, computationally cheaper
approach would be to employ a double shuffle of the sort in Section 11.2.2.2 followed by Section 11.2.2.3.
For instance, we can use this permutation test to see if a long momentum strategy on an equity pick
was due to net positive drift or the presence of momentum effect itself. Although they are similar,
momentum and drift are distinct concepts. Momentum implies the persistence and predictability of
drift, while the latter is just a location statistic. Note that we need to be careful about shuffling data
in relation to performance criterion computed from OOS trades in the walkforward optimization. (see
Section 11.1.1.2.1) We call the p-value obtained from this hypothesis test to be the trader’s p-value.
11.2.2.5 p-value of signal families in oos by data shuffle and bias adjustments
Here we assume we have the full machinery available to the underlying trading strategy. We want to
test for the robustness and strength of signals in relation to future market movements. In most cases
of signals measured at time t, its predictive power (if any) decays with time. Since we cannot compute
the instantaneous return of a signal, we can measure the predictive ability of a signal by measuring
its correlation with rt . In order to prevent serial correlation, the returns should be marked-to-market
daily and not contain multiple periods. If for some reason we believe the signal is mapped to a lagged-
return, we can measure its correlation to rt+k , where k is the proposed lag. However in most cases
k = 0 and predictive strength decays w.r.t time. Another note is that we may choose to normalize the
522
next day (or some other signal measure) return against volatility, since magnitude of returns should be
taken in context of the overall volatility of the market. We may also adopt variants such as taking the
predictive ability to be the idiosyncratic component of return, rather than the absolute one. Other than
the correlation function of signal and future returns, other measures of uncertainty reduction could be
used. Mutual information functions also offer good candidates, as does usual performance criterions.
When taking into choices signal families, another statistical bias called the selection bias creeps into
our process. In the development process, the predominant bias of concern is the training bias, since we
iterate towards a better solution using the same dataset in parameter optimization. This performance
is not guaranteed (actually, almost guaranteed not) to remain as favourable in the future. At this stage,
the Monte Carlo permutation methods are suitable to detect for the overfit, and we apply the techniques
discussed in Section 11.2.2.1. However, each strategy/signal is compared in isolation to its permuted
data - and the (systematically repeatable) optimization of parameters relevant to the strategy is executed
in-silo. Comparing between strategies would be an invalid one, since the optimization method needs to be
equitable - optimizing on a strategy with multiple parameters is not the same as optimizing a univariate
strategy, and therefore the comparisons of their optimized performance criterion is invalid. However, once
we pass the development stage into the OOS testing, our strategy criterion are unbiased estimators and
we are working on equal footing. Say we have n candidates and corresponding (unbiased) performance
criterion values xi ∼ FXi , i ∈ [n], a trader might be tempted to pick the best k strategies X(i∈[n−k+1,n])
and take them to production, thinking they are representative of future returns. This is the fatal error of
selection bias - since once we are picking k < n strategies we are dealing with the statistical distributions
of their joint order statistics (see Section 346), and not their marginal distributions. At this stage, we
might have a few questions we would like to answer, such as (i) is our best performing strategy statistically
significant? (ii) is the combined performance of our top k strategies statistically significant? and (iii)
what is the value of k such that individually each of our best k strategies are statistically significant?
We hope to answer these questions by the Monte Carlo method. In the following sections to discuss,
assume that we have n strategies/signals/indicators, producing performance criterion sampled from the
distribution Fi∈[n] . Then a realization of their performance estimate on the same OOS dataset can be
written xi∈[n] and let them be written in order by x(i∈[n]) , where x(i) < x(i+1) .
11.2.2.5.1 p-value of 1-best signal For the strategy corresponding to estimate x(n) from X(n)
what is the probability value of obtaining a result as good or even better, conditional on a worthless
model? Then we can perform the data shuffling m times and apply the trading machinery on each of
the n strategies, with p-value computation:
1 + j=1 1{ψ((ljt , wtj , rjt )t∈[T ] )(n) ≥ ψ((lt , wt , rt )t∈[T ] )(n) }
Pm
(3171)
m+1
where ψ((lt , wt , rt )t∈[T ] )(n) = x(n) is the best performer out of the n strategies in our original, unper-
muted data. This is fixed in the experiment. On the other hand, the variable ψ((ljt , wtj , rjt )t∈[T ] )(n) = xj(n)
is the best performing strategy (random) generated on the j-th permutation trial. Interestingly, an ana-
logue of this hypothesis test can be drawn to the in-sample overfit detection test discussed in Section
(11.2.2.1), by considering the n strategies as a single multivariate signal, and where the in-sample opti-
mization step is equivalent to the 1-best selection. The selection bias is equivalent to the training bias
under the detection of overfit technique.
11.2.2.5.2 p-value of k-combined signals For the strategies corresponding to estimate {X(i) |i ∈
[n − k + 1, n]}, what is the probability value of obtaining a combined strategy as good or even better,
523
conditional on a worthless model? For instance, suppose we have fifty moving average pairs, and we
choose the best twenty-five to form a unified momentum signal. What is the probability that this unified
momentum signal produced at least as good of a return behavior? This is a trivial extension of the 1-best
signal (see Section 11.2.2.5.1), and we can compute the p-value on m data shuffled trials:
1 + j=1 1{ψ(∧{(ljt , wtj , rjt )(i∈[n−k+1,n]) }t∈[T ] ) ≥ ψ(∧{(lt , wt , rt )(i∈[n−k+1,n]) }t∈[T ] )}

Pm
(3172)
m+1
where ∧ is the alpha combiner function that maps a p-signal vector into a single unified signal, and
ψ(∧{(ljt , wtj , rjt )(i∈[n−k+1,n]) }t∈[T ] ) is the performance obtained from combining the k best random strat-
egy on shuffled data.
11.2.2.5.3 p-value of k-marginal signals by bounding Consider the settings in the 1-best prob-
lem discussed in Section 11.2.2.5.1 - we may interested in how good the estimate x(i<n) , the second
best and onwards strategies’ probability value for obtaining a result at least as good conditional on a
null of worthless models. Since we are using the maximum among n strategies in each trial j ∈ [m] as
comparison, we can use the same methodology to obtain bounds on the unbiased p-value estimates of
the other rank ordered strategies. For strategy corresponding to x(i) , i ∈ [n], the selection-bias adjusted
p value pi , has upper bound bound pi0 is given:
1 + j=1 1{ψ((ljt , wtj , rjt )t∈[T ] )(i) ≥ ψ((lt , wt , rt )t∈[T ] )(n) }

Pm
. (3173)
m+1
In the case when i = n, then pi = pi0 and otherwise we have pi ≤ pi0 . We will obtain an overly conservative
estimate, and by taking pi as significance cutoffs, we might not have as much power (see Definition 349)
as would like. However, those that qualify the size bounds are at least as statistically significant as their
p-values suggest.
11.2.2.5.4 p-value of k-marginal signals by greedy selection and FER control In Section
11.2.2.5.3 we provided a method to get upper bounds on probability values for getting a result at least
as good as the observed value for each of the rank ordered signals. However, an issue was the loss
of power, since we were working with upper bounds rather than their true values. Consequentially,
we could have been potentially dismissing indicators that we would have concluded to be exhibiting
statistical significance if the true probability values were known. Although no disastrous error was made
in permitting a worthless strategy, there is potential loss of profit. We shall then look to control their
familywise error rate (see Definition 349), and arrive at an asymptotically most powerful hypothesis
test that allows us to obtain the true p-value for each rank ordered indicator. This algorithm in the
general sense is based on Resampling-Based Stepdown Multiple Testing, studied by Romano and Wolf [18].
The resampling refers to a general category of bootstrapping, permutation or randomization method.
The application of the permutation based stepdown multiple testing for selection bias adjustment was
suggested by Masters [14]. No proofs are provided, although reasoning may be intuited. Numerical tests
may be employed. The bounded p-value algorithm uses the null hypothesis that all competitors are
worthless (unrelated to the prediction target), and the p-value is only an equality for the best performing
signal. The hypothesis test introduced in this section has a null statement that tests for each of the
null hypothesis while retaining strong control (see Definition 349) for the familywise error rate α. More
importantly, as opposed to the bounded p-value algorithm suggested in Section 11.2.2.5.3, the test (again,
conjectured) has maximum power, meaning among all possible tests at FER α, this test is the best at
correctly (on average) rejecting the null hypothesis when it is false. This is opposed to the bounded
524
p-value algorithm, which retains weak control of FER and has good power for the best candidate only.
The algorithm is a modification of the said algorithm that makes use of the upper-bound probability. It
can be explained as follows:
1. Run the bounded p-value algorithm as in Section 11.2.2.5.3.
2. Reject the null hypothesis which have upper bound p-value pi < α, and terminate (go to step 5.)
if none of the null hypothesis may be rejected.
3. Remove the corresponding strategies which have already rejected the null hypothesis from the
candidates under consideration.
4. Go to step 1. and rerun the bounded p-value algorithm, without the already rejected null hypothesis
inside the algorithm until termination at step 2.
5. When step 1. to 4. terminates, the null hypothesis rejected thus far are significant, and the
remaining null hypothesis are not rejected.
When the algorithm terminates, we would have obtained the null hypothesis that were rejected, were the
true values computed. Note that if we want to compute exact p-values for each and every rejected null
hypothesis, we can conduct the same algorithm, except we only remove the smallest p-value candidate
at step 3. instead of all of those that already rejected the null hypothesis. This would give us their
p-value in future iterations. For those not rejected, we only have upper bounds on the p-values - but
this is of little practical interest to us anyway. At each stepdown we approximate the null distribution
of the test statistic by permutation, but this null distribution is restricted to those which have not yet
rejected the null hypothesis, thus shrinking the number of competing distributions that approximate the
null distribution as we advance. Note that in practice, we do not have to rerun step 1 on each stepdown
- we can pre-compute their permuted trials and then perform the stepdown part of the algorithm by
referencing the cached values without running our trading machinery again. No mathematical proof is
provided in both Romano and Wolf [18], as well as Masters [14]. The intuition is that since FER is
at least one or more errors, and we only perform the stepdown iff we reject the null hypothesis of the
best candidate, the FER is maintained at α. That is, by construction of the test, at each step the best
candidate has type I error probability of α, and if we terminate we make no rejection and we are done. If
we continue, then we make type I error probability of α. Making error from the next iteration onwards
is inclusive in the condition ‘at least one error ’ and our FER is unchanged. Since the empirical density
function are consistent estimators (see Theorem 392), the permutation test gives us p-value against the
asymptotically exact estimate of the underlying null distribution (and hence asymptotically the most
2
powerful one). The strong control arises from having tested greedily from our best candidate down to
the worst candidate or termination, whichever was earlier. Note that we need to terminate the algorithm
once the best candidate no longer rejects the null hypothesis, since it is possible that under the shrunk
hypothesis test that p-value decreases in future step-downs. The issue of monotonicity is discussed and
solved in Romano and Wolf [18], and here we deal with it by immediate termination, since we are not
particularly interested in the p-values of the candidates which are statistically insignificant in the first
place. This test gives us the probability values from the 1-best test (Section 11.2.2.5.1), and finer bounds
to the p-values relative to the k-marginal bounded tests. (Section 11.2.2.5.3). These are implemented
under the Russian Doll Testing framework, presented in Appendix A.
2 (verifythis) our intuition is that this test is an unbiased, uniformly most asymptotically powerful everywhere test
based on the Neyman Pearson Lemma (see Lemma 12). we hope to prove this in future work.
525
11.2.3 Monte Carlo Bootstraps
3
Inside the Monte Carlo analysis of trading systems (see Section 11.2.2) we were interested in whether
the decision making series were an intelligent one. We were interested in the predictive power of the
strategy to enter into positions in anticipation of future market returns. In the bootstrap test we are
more interested in whether her returns are sufficiently far from zero or some other baseline. These are
fundamentally different hypothesis tests, and can be performed without the weight (and leverage) vector
- we need only the portfolio return vector π = ((πt = lt wt rt )t∈[T ] ). Although this is computationally and
information-ally less demanding, bootstrap tests tend to be less accurate than the Monte Carlo methods
in p-value computation. It is also poor in relation to hypothesis tests involving ratios, such as the Sharpe
ratio and the profit factor, while the Monte Carlo permutation methods retain high degree of accuracy.
If this return sample π were approximately normal or had no outliers, recall that we employ one-sample
t-tests as discussed in Section 11.2.1. If the assumptions were satisfied then while the sample mean gives
us an idea of how good our strategy looks, the p-value puts this in the perspective of luck by taking
into consideration random variation. Of course, this is practically useless since market returns are well,
non-normal and also chock-full of outliers. In the bootstrap we can work with weaker assumptions about
the distribution of the underlying data, and make another assumption instead (which may considered
somewhat of a stretch): that the shape of distribution of the sample approximates the shape of the true
distribution well. In other words, we are assuming that the empirical C.D.F (see Definition 361) is close to
the unknown F . Then sampling from this sample as if it were a population would yield similar sampling
distributions as if it were sampled from the population itself. Since we are testing if µπ = Eπ̄ > π0 , then
we are interested in whether the estimator π̄ is sufficiently improbable by computation of p-value from
the test statistic. The sampling distribution must be approximately normal by Central Limit Theorem,
π̄−π0
and we have z = SE(π̄) . Then by the Law of Large Numbers (see Theorem 361) the sample standard
deviation of the N simulations
v
u N
u 1 X
t ¯ )2
(π̄i − π̄ (3174)
n − 1 i=1
approaches the true value of the standard error in our estimator π̄ and our z-score is valid, allowing us to
compute accurate p-values. The recommended value of N could be between 500 ∼ 10000, or even more
depending on the distribution at hand and accuracy we desire. This is the core idea of the bootstrap
method, and we shall use an extension of this. Note that here we assumed that the standard deviation
(and other moments) of the estimator is independent of the actual parameter value - but this is not
always the case. To correct for this we apply a more sophisticated technique known as jackknifing.
3 note that in the bootstrap we work under the same legality assumptions as discussed in Section 11.2.2 in relation to
destroying serial correlation - the bootstraps also destroy the order statistics.
526
Chapter 12
Portfolio Management
12.1 Introduction and Problem Settings

12.1.1 Returns
The chapter of portfolio management is concerned with the management of a basket of capital assets.
For instance, the price of a government bond or stock at time t can be denoted pt . This is an example of
a capital asset. The pnl of holding one unit of the asset at [t − 1, t) is πt = pt − pt−1 . The price process
is a stochastic process. The price at any time t is then some initial condition plus a sum of profits,
Pt
pt = pl + j=l+1 πj . If πt is IID with Eπt = µ, Var(π) = σ 2 , we call this drift and variance respectively
and the price process is random walk. Then the conditional mean E[Pt |P0 ] = P0 + tµ and conditional
variance is Var(Pt |P0 ) = σ 2 t. Continuous-time treatment of an asset price process sees asset prices as
(generalized) geometric Brownian motion (see Definition 656). We do not deal with continuous-time
treatment here.
We can define returns in many ways, and here we shall define some. The gross return, also known
Pt
as cumulative returns is simply Gt (k) = Pt−k , the ratio of prices over two periods. We assume positive
price processes, Gt (k) ≥ 0. The k period gross return is simply the product of 1-period gross returns
(or more generally, product of gross returns on non-overlapping periods). This is expressed Gt (k) =
Pt −Pt−k
Πki=1 Gt−k+i (1). The net return is Rt (k) = Gt (k) − 1 = Pt−k . The log return is rt (k) = log Gt (k) =
log(Pt ) − log(Pt−k ). It is often true that we do not distinguish between net returns and log returns in
literature. The true relationship is
rt (k) = log(1 + Rt (k)) ≈ Rt (k), (3175)
where the approximation is given by log(1 + x) ≈ x when x is small (here log is the natural log, ‘ln’).
Their distinction is of little concern in practice when working on fine granularities of data. However,
when working with returns over longer day periods, this approximation does not necessarily hold. The
mathematics is simple but the implications are non-trivial. Log-returns are often favored - it is quite
well behaved over time. The k-period log return is additive over the 1-period log returns (again, any
Pk
non-overlapping period holds) such that rt (k) = i=1 rt−k+i (1). Most often, we will not distinguish
between daily net returns and daily log return data and use them interchangeably. In the presence of
dividend payout Dt , then adjustments need to be made. The net return adjusted for dividends become
Pt +Dt
Pt−1 − 1. The dividend-adjusted gross and logarithmic returns follow from their relationship to the net
return statistic.
527
We are interested in the economics of multiple assets. These assets put together form a portfolio.
Let there be N assets, and wi be the weight assigned to asset i ∈ [N ]. A general portfolio allocation can
PN
be denoted kwk1 = i |wi | = 1. If we are working with strategies, or positive return bearing assets, it
makes sense to constrain our weight to positive values, such that ∀i, wi ≥ 0. Let it be the index for asset
i at time t. Then
N
!
X
Pt = 1+ wi Rit Pt−1 , (3176)
i=1
N
Pt X
Rt = −1= wi Rit , (3177)
Pt−1 i=1
N
X
rt = log(1 + wi Rit ). (3178)
i=1
The naive but highly useful assumption is that rt ∼ Φ(µ, σ 2 ). Then log(Pt ) − log(Pt−k ) ∼ Φ(kµ, kσ 2 ).
This is the discrete form statement for asset price processes driven by a Brownian motion. For continuous
time-treatments, see Definition (435). Log-normality is also obtained there. In particular, the asset price
is modelled
σ2

Pt = P0 exp µ− t + σwt , wt ∼ Φ(0, t). (3179)
2
Often the portfolio returns are taken in relation to a benchmark. We call this excess returns. For
some asset price benchmark p0t , the excess returns are rt − rt0 . Often the benchmark is set to the risk-
free rate. Cash rates or short-term (3M) Treasury rates are often used in literature when computing
Sharpe ratios. There are many objections. Firstly, benchmarks are arbitrary and choices are themselves
questionable. Secondly, most people live in the nominal world. When we talk about our portfolio returns
at a barbeque party, we say: ‘we made x% a year’. If you say: ‘I made y% net of YoY core CPI inflation’,
then feel free to benchmark the risk-free rate. Just don’t come to my party.
When growth rates (such as interest rates) are continuously compounded, we get nice approximations.
1
If for some small time period n we compound at rate nr , then after t periods we have
r nt n→∞
Pt = P0 1 + → P0 exp(rt). (3180)
n
These have applications in discounting future valuations of wealth, such as cash flows. When the rates
are allowed to be stochastic, then we get discount processes. See Definition 4619 for continuous time
treatments. It would not very important at this juncture.
12.1.2 Risk and Standard Deviations

Almost everything we want to know about portfolio management is with regards to estimating the true
distribution, or nature of rt . The first moment and second central moment (see Definition 298) for rt is
the expected return and variance respectively, namely
µ = Ert , σ 2 = µ2 − µ21 = E[(rt − µ)2 ]. (3181)

√
σ = σ 2 is known as standard deviation (statistics terminology), volatility (finance terminology) or
portfolio risk (trader terminology). There are many definitions of portfolio risk, but volatility is by
far the most commonly used. Some commonly used distributions in the study of returns are normal
distributions (Section 6.17.6), t-distribution (Section 6.17.9) and log-normal distributions (when rt is
exponentiated, see Section 6.17.8).
528
Result 41. If θ̂ − θ ∼ Φ(0, n1 ξ 2 ), then for any f (θ) with |f 0 (θ)| < ∞ we have

1 2
f (θ̂) − f (θ0 ) ∼ Φ 0, (f 0 (θ0 )) ξ 2 . (3182)
n
Pn
Lemma 17. With IID sample Yi , i ∈ [n], for µ̂ = Ȳ and σ̂ 2 = n1 i (Yi − Ȳ )2 , then
√
n(µ̂ − µ) ∼ Φ(0, σ 2 ), (3183)
√
n(σ̂ 2 − σ 2 ) ∼ Φ(0, 2σ 4 ), (3184)
√ 1 2
n(σ̂ − σ) ∼ Φ(0, σ ) (3185)
2
Proof. The first two results follow immediately from Definition 343. For the third statement, consider
1
the Taylor expansion of (σ̂ 2 ) 2 at σ 2 , expressed
1 1 1
σ̂ = (σ̂ 2 ) 2 ≈ (σ 2 ) 2 + 1 (σ̂ 2 − σ 2 ) (3186)
2(σ 2 ) 2
1 2
= σ+ (σ̂ − σ 2 ). (3187)
2σ
4 √
Since (σ̂ 2 − σ 2 ) ∼ Φ(0, 2σn ), apply Result 41 with f (x) = x, ξ = 2σ 2 to get

1 1 4
σ̂ ∼ Φ σ, 2 2σ (3188)
4σ n
and we are done.
12.1.3 VaR, Conditional VaR

Definition 392 (Value at Risk). Recall the definition of random variable quantiles (see Definition 284).
Let the q quantile be denoted Qq (x). For continuous c.d.f we can also write Qq (x) = F −1 (q). When the
random variable under concern is rt , then −Qq (X) is q VaR. Write
V aRq (X) = −Qq (X) = − max {x : F (x) ≤ q} . (3189)
This is the minimum loss incurred in the worst q fraction of return samples.
Lemma 18 (Properties of VaR). Let X, Y be random variables, and λ > 0, c ∈ R. The following
properties are satisfied by VaR risk measures.
1. V aRq (X + c) = V aRq (X) − c.
2. X ≤ Y =⇒ V aRq (X) ≥ V aRq (Y ). VaR is consistent with first-order stochastic dominance.
3. V aRq (λX) = λV aRq (x).

In particular, V aRq (a + bX) = bV aRq (X) − a when b > 0.
Proof. Proof for part 1. Write V aRq (X) = −Qq (X) is such that
P(X < Qq (X)) = q (3190)

P(X + c < Qq (X) + c) = q (3191)
but −Qq (X) − c is V aRq (X + c). For part 2, see
P(X < Qq (X)) = q (3192)

P(X + (Y − X) < Qq (X) + (Y − X)) = q (3193)
P(Y < Qq (X) + (Y − X)) = q (3194)
but −Qq (X) + X − Y is V aRq (Y ). X − Y < 0 and the result follows.
529
Exercise 635. For X ∼ Φ(µ, σ 2 ) we know that for zq , the q-th quantile of Φ(0, 1), we have

X −µ
P < zq = q (3195)
σ
P(X < µ + zq σ) = q (3196)
s.t. Qq (x) = µ + zq σ, V aRq (x) = −µ − zq σ. The same principle applies if we assume X ∼ tv -distribution
and so on. If we fit a distribution to rt , we can the find the VaR at arbitrary q using the estimated
parameters. If we have sufficient data, we can estimate based on empirical quantiles to high resolution
without making parametric assumptions. (see Definition 284).
Definition 393 (Conditional VaR, Expected Shortfall). Recall the definition of value-at-risk in Defini-
tion 392. This measured the minimum loss incurred. The expected shortfall measures the average loss
incurred in the worst q fraction of return samples instead, expressed
1 q
Z
ESq (X) = V aRα (X)dα. (3197)
q 0
Why conditional VaR? See
ESq (X) = −E[X|X < −V aRq (X)] (3198)

1 −V aRq (X)
Z
= − xf (x)dx (3199)
q −∞
1 −V aRq (X)
Z
= − xdF (x) (3200)
q −∞
1 q −1
Z
= − F (α)dα α = F (x) (3201)
q 0
Z q
1
= V aRα (X)dα. (3202)
q 0
Lemma 19 (Properties of Expected Shortfall). It is easy to see using the results from Lemma 18 that
1. ESq (X + c) = ESq (X) − c, c ∈ R
2. X ≤ Y =⇒ ESq (X) ≥ ESq (Y ).
3. ESq (λX) = λESq (X). λ > 0.
12.1.4 Alternative Measures of Portfolio Risk

It is well known that asset returns are not multivariate normal; it is insufficient to parameterize asset
returns with solely the vector (µ, Σ). This is particularly so when portfolio returns exhibit strong skew
or tails. We present other common measures of portfolio risk.
Definition 394 (Mean Absolute Deviation (MAD)). Instead of using the squared deviations, the MAD
statistic takes absolute values, that is
N
X N
X
M AD(Rp ) = E wi Ri − wi µi , (3203)
i i
PN
where
q Rp = i wi Ri . If asset returns are assumed multivariate normal, then we have M AD(Rp ) =
2
π σp . (verify this)
530
The optimal portfolio computation with MAD (Definition 394) portfolio risk is a linear problem which
can be solved by standard linear programming routines.
Definition 395 (Mean Absolute Moment (MAM)). The MAM is a generalization of MAD where we
allow the statistic to take on higher moments of the absolute deviation from the mean. In particular,
1
q
M AMq (Rp ) = (E |Rp − E(Rp )| ) q (3204)
is the q-th order mean absolute moment. See that when q = 1, we have MAD (Definition 394), when
q = 2, we have the usual standard deviation.
Definition 396 (Semivariance). The semivariance is the variance statistic computed from using only
part of the data. In the portfolio context, semivariance is the portfolio variance computed on negative
entries. That is

N N
!2 
X X
2
σp,semi = E min wi Ri − wi µi , 0 , (3205)
i i
PN
where Rp = i w i Ri .
The semivariance statistic (Definition 396) is used to compute the Sortino ratio where the standard
deviation statistic is swapped out for semivariance in the computation of Sharpe ratios.
Definition 397 (Lower Partial Moment (LPM)). LPM is a generalization of the semivariance (Defini-
tion 396) statistic for higher moments. In particular, the power index q lower partial moment is
1
σq = (E[min(Rp − µp , 0)q ]) q . (3206)
Definition 398 (Roy’s Safety First (RSF)). RSF chooses the optimization problem that first assures
the preservation of a percentage of some principal value targeted at a threshold acceptable return. In
particular, the investor solves the optimization problem
min P(Rp ≤ Ro ) (3207)

w
subject to w0 1 = 1. Since the investor does not know the true probability measure P, using Chebyshev’s
p σ2
inequality (Theorem 364), we arrive at P(Rp ≤ R0 ) ≤ (µp −R 0)
2 (verify this). The investor therefore solves
σp
the optimization function minw µp −R0 (subject to the same conditions). See that if we set R0 = rf , the
risk free rate, then this is the maximal Sharpe ratio problem (Exercise 638).
12.1.5 Risk-Adjusted Returns

The portfolio manager is interested in returns, particularly in relation to risk. He should be interested
primarily in the Sharpe ratio, which is computed as the ratio of expected returns to volatility of returns.
Most often, the annualized Sharpe ratio is presented. Suppose daily returns rt are independent. Let
PT PT √
R = i ri . Then Var(R) = i Var(ri ) = T σr2 and σR = T σr . Suppose we have daily return data,
and let number of trading days in a year be 253, then the annualized Sharpe is computed
253 ∗ µr
sharpe = √ (3208)
253 · σr
√ µr
= 253 · . (3209)
σr
531
The reason why the Sharpe ratio is of primary concern can be outlined. It gives us the return measured
in units of volatility. Volatility can be acquired by leverage. A high Sharpe portfolio with low risk
can be levered up to a desired level of volatility. For given level of volatility, the returns possible are
commensurate with the Sharpe ratio. As mentioned, some choose to use benchmarked returns. Others
criticize the Sharpe ratio all together. For the naysayers of Sharpe, I leave with you a quote (Paleologo
[15]):
committees have been formed; replacements have been suggested, including the
beautifully named ‘ulcer index’; recommendations have been ignored.
Titans of finance come and are quickly forgotten.
Volatility and Sharpe will stay in the foreseeable future.
We show important results in the confidence of sample statistics on Sharpe ratio observable on return
data from market trading.
ˆ =
Corollary 25 (Sharpe Ratio). We may estimate the Sharpe ratio by SR µ̂
with distribution
σ̂

ˆ − SR ∼ Φ 0, 1 1 + 1 SR2
SR . (3210)
T 2
The 95% confidence interval for the Sharpe Ratio is
s
SRˆ = SRˆ ± 1.96 1 1 + 1 SR
ˆ 2 (3211)
T 2
This gives us the range of values for which a hypothesis testing for zero Sharpe would return statistically
(in)significant p-values.
Proof. By bivariate Taylor expansion arguments, we argued in Lemma 367 that the variance of ratio
Y
R= X is given by
1 2 2
Var(R) ≈ (r σx̄ + σȳ2 − 2rσx̄ȳ ) (3212)
µ2x
2 2
σ
See from Result 17 that we have σ̂ ∼ Φ(σ, 2T ), and µ̂ ∼ Φ(µ, σT ). Ignoring the finite population correction
factor, we can write our bootstrap approximate of the Sharpe estimator variance
2
σ̂ 2

1 2 σ̂
Var(SR) ≈ SR + σ̂ ⊥ µ̂ , Definition 343 (3213)
σ̂ 2 2T T

1 1
= SR2 + 1 (3214)
T 2
and we are done.
12.1.6 A Basket of Assets

Assets are held in aggregate, and this is called a portfolio. The portfolio wealth process is observed
with respect to a basket of assets evolving over time. Market instruments are intricately related, and so
are the random variables representing the observables of interest. These random variables shall then be
studied with multivariate methods. In the general form, for X = (xi )i∈[N ] random variables, the random
variables have c.d.f F (x1 , x2 · · · , xN ) = P(X1 < x1 <, · · · XN < xN ). If there exists joint density, then
their relations are given by
Z b1 Z bN
P(X1 ∈ (a1 , b1 ), · · · XN ∈ (aN , bN )) = ··· f (x1 , · · · xN )dxN · · · x1 . (3215)
a1 aN
532
The mathematics of joint densities are discussed in Section 317. Of particular prevalence is the multivari-
ate Gaussian distributions (see Definition 344). Recall that they are characterized completely by their
expectations and covariance matrix, which we denote here (µ, Σ). Refer to Definitions of random variable
expectations, covariance and correlation in Definitions 289, 302 and 303 respectively. In particular, for
p-column matrix Y = (Yi )Ti∈[p] we have
E(Y ) = (EYi )0i∈[p] = (µi )0i∈[p] , (3216)

Cov(Y ) = Σ = E {[Y − EY ][Y − EY ]0 } . (3217)
The covariance and correlation matrices are said to be semi-positive definite. More generally (obtained
by their definitions and using the transpose (AB)0 = B 0 A0 property):
EAY = AEY, Cov(AY ) = ACov(Y )A0 . (3218)
More generally, Cov(AY, BY ) = ACov(Y )B 0 . We can estimate the population covariance matrix with
the observed data. For observed Y , take
n
1X
S= (Yl − Ȳ )(Yl − Ȳ )0 . (3219)
n
l=1
It is well known the sample covariance matrix is a poor estimator out-of-sample. Robust estimation of
covariance matrices are discussed in Section 12.6.4.
12.1.7 Computations for the Portfolio

For N assets with returns R = (Ri )i∈[N ] , denote their expected returns and covariance as ri = ERi , Cov(R) =
Σ respectively. The portfolio is a linear combination of assets - let the weighting be denoted by (col-
umn) vector w, s.t the portfolio Rp has relation Rp = w0 R. In the general case, kwk1 = 1, but this
P
is usually wi = 1 (components are non-negative) in the strategy-allocation problem (since we would
not be trading a strategy with negative expected returns). By the relation given by Equation 3218, the
portfolio’s expected returns, variance, volatility given by
√
rp = ERp = w0 ER, σp2 = w0 Σw, σp = w0 Σw. (3220)
It follows that the portfolio Sharpe is given
w0 ER
SRp = √ . (3221)
w0 Σw
Our objectives would be to maximise rp , minimize σp , or their ratio SRp . The efficient frontier curve
is the curve along non Pareto dominated solutions for w w.r.t to the first two objectives. In particular,
it is the set of solutions for w corresponding to (i) maximum rp for some target σp , or to (ii) minimum
σp for some target rp . To trade on this frontier would be an efficient approach - see otherwise that the
trader’s SRp would be lower.
12.2 Dynamic Signal Weighting

The active portfolio signal weighting technique is a problem in normative (as opposed to positive) eco-
nomics, one that prescribes behaviour that rational investors should purse in the construction of a
533
portfolio based on the assumptions outlined in the construction of the normative theory. Diversifica-
tion is related to the CLT (Theorem 362), which states that for sum of IID Xi ∼ F (µ, σ 2 ) of arbitrary
distribution F and bounded variance σ 2 , that
N
!
X Xi − µ
lim P √ ≤y = Φ(0, 1), (3222)
N →∞ σ N
i
where Φ(0, 1) is C.D.F of standard random normal. In particular, for a portfolio of N assets with IID
PN
returns ri , i ∈ [N ], the equal weighted portfolio gives Rp = N1 i Ri , and the variance of this portfolio
PN 2
is Var(Rp ) = N12 i Var(Ri ) = N12 N σ 2 = σN → 0 as N → ∞.
A reasonable line of thinking for a trader is that she believes that the Sharpe ratio of her portfolio
must increase by adding less than perfectly correlated strategies. A more nuanced trader would realise
that this is not in fact true (for all practical purposes) and there is an asymptotic bound for which the
addition of new alphas do not improve the return profile. For a simple numerical example, consider a
1
vector of alphas denoted α = (αi )i∈[N ] ∼ Φ(µ, Σ). Then on any day, her return profile on a J (we call
the ‘ J1 ’ portfolio to be the equal weighted portfolio from here onwards) portfolio would be distributed
N,N
1 X
Eᾱ = µ̄, Varᾱ = 2 Σij . (3223)
N i,j
Assume constant correlation ρ between alphas, and unit variance such that Σij = ρ1{i 6= j} and Σii = 1
for all i ∈ [N ]. In this scenario we have
1 X
Varᾱ = Σij (3224)
N 2 ij
1
= [N + N ρ(N − 1)] (3225)
N2
1
N + N 2ρ − N ρ

= 2
(3226)
N
1 + ρ(N − 1)
= ≥ ρ. (3227)
N
Her portfolio Sharpe must be
s
N µ̄
Sharpe(α) = µ̄ ≤√ . (3228)
1 + ρ(N − 1) ρ
1
In other words, her multi-alpha portfolio cannot increase Sharpe by more than a factor of ρ− 2 . This
must be the limit of her performance on a ‘diversified’ portfolio of correlated alphas. In the more general
PN −1)
case, we have Var(Rp ) = w0 Σw = N12 i Var(Ri ) + N12 i6=j Cov(Ri , Rj ) ≤ N12 N σmax 2
+ N (N
P
N2 A=
2
σmax
N + NN−1 A, where A are the off-diagonal means. We see that A is the limit of the portfolio Rp variance
1
when the number of assets N → ∞. Alphas, in practice, are indeed often correlated. It follows from
the non-vanishing correlations that a dynamic signal weighting scheme may be of interest to achieve
further risk-adjusted returns. This is our discussion of interest.
12.2.1 Elegant Mathematics, Poor Economics - MPT

We present the classical mathematical models of optimization. These are known as ‘Modern Portfolio
Theory’. There is nothing quite modern about these models, and no serious amount of capital are
1 It
has been shown that if asset returns behave like a stable Paretian distribution, then diversification is not possible.
We do not consider this case, since most, if not all practicing portfolio managers agree that diversification is a meaningful
economic activity, and to some degree achievable.
534
managed by the mathematics presented herein. However, the advanced models of today would be
motivated by similar goals of optimization. The theory presented herein are the bedrock for evolutions
in portfolio theory.
P
We present the arguments here, assuming constraint wi = 1, with no short-constraints. This is to
simplify the mathematics under the settings of ‘soft’ constraints. Here, 1 is the identity function and 1
is a vector one ones. All vectors are presented column.
Exercise 636 (Global Minimum Variance). Assume N assets, with returns R = (Ri )0i∈[N ] and covariance
Σ = (σij )i∈[N ],j∈[N ] . The global minimum variance portfolio is the portfolio Rp = w0 R that solves
min Cov(Rp ) = w0 Σw, (3229)

w
N
X
s.t. wi = 1. (3230)
i
Proof.
N
X N
X
Cov(Rp ) = w0 Σw = wl2 σl2 + 2 wi wj σij . (3231)
l i<j
Lagrangian for the problem is

N
X N
X XN
L(w, λ) = wl2 σl2 + 2 wi wj σij + λ( wi − 1) = 0, (3232)
l i<j l
with Lagrangian equations

δL(w, λ) X
=2 wj σij + λ = 0, (3233)
δwi j
δL(w, λ) X
= wi − 1 = 0. (3234)
λ i
This can be written by block matrix

" #" # " #
2Σ 1 w 0
= . (3235)
10 0 λ 1
Solving for the set of linear equations 2Σw = −λ1, 10 w = 1, we get

1
w = − Σ−1 λ1, (3236)
2
1
10 w = − 10 Σ−1 λ1 = 1. (3237)
2
It follows that λ = −2 10 Σ1−1 1 , with solution
Σ−1 1
w= . (3238)
10 Σ−1 1
Exercise 637 (Markowitz Portfolios). The efficient portfolio is set up
max rp = w0 r (3239)
X
s.t. w0 Σw = σ∗2 & wi = 1. (3240)
i
535
or
min Cov(rp ) = w0 Σw (3241)

X
s.t. w0 r = r∗ & wi = 1. (3242)
i
Proof. Minimizing w0 Σw is the same as minimizing 1 0

2 w Σw. In particular, see that our Lagrangian
equation is
1 0
L(w, λ) = w Σw + λ1 (1 − w0 1) + λ2 (r∗ − w0 r). (3243)
2
Then
δL
= Σw − λ1 1 − λ2 r, (3244)
δw
δL
= 1 − w0 1 (3245)
δλ1
δL
= r∗ − w0 r. (3246)
δλ2
0
Note that Σ0 = Σ, so Σ−1 = Σ −1 . Solving this set of linear equations, see that
w = λ1 Σ−1 1 − λ2 Σ−1 r (3247)
and substituting,
1 − (10 Σ−1 1) λ1 + (r0 Σ−1 1) λ2 = 0 (3248)

| {z } | {z }
A B
r∗ = (10 Σ−1 r) λ1 − (r0 Σ−1 r) λ2 . (3249)
| {z } | {z }
B C
Solving the equations 1 = Aλ1 − Bλ2 , r∗ = Bλ1 − Cλ2 , write
λ1 = (1 + Bλ2 )A−1 , (3250)

2
Ar∗ = B + B λ2 − ACλ2 , (3251)
λ2 (B 2 − AC) = Ar∗ − B (3252)
B − Ar∗
λ2 = (3253)
AC − B 2
and
B−Ar∗
(1 + B AC−B 2)
λ1 = , (3254)
A
(AC − B 2 + B 2 − BAr∗ )
= (3255)
A(AC − B 2 )
C − Br∗
= (3256)
AC − B 2
Then
w = λ1 Σ−1 e − λ2 Σ−1 r (3257)

C − Br∗ −1 B − Ar∗ −1
= Σ 1− Σ r. (3258)
AC − B 2 AC − B 2
536
The Markowitz problem (Exercise 637) is said to be a quadratic program, and as we showed, has
analytical solutions. In quadratic forms involving only the equality constraints Ax = b (inequality
constraints are of the type Ax 4 b), the problem reduces to solving a set of linear equations. In more
complex settings with inequality and other constraints (such as mixed-integer programs), the solutions
may be obtained numerically. The extensions of more complex constraints and objective functions are
discussed in Section 12.5.
Exercise 638 (Maximum Sharpe/Tangency Portfolio). Solve for
w0 r
max 1 (3259)
w (w0 Σw) 2
subject to w0 1 = 1.
Proof. Optimization equation has Lagrangian

w0 r
L(w, λ) = 1 + λ(w0 1 − 1). (3260)
(w0 Σw) 2
The first order conditions are
δL 1 3
= r(w0 Σw)− 2 − (w0 r)(w0 Σw)− 2 Σw + λ1 = 0, (3261)
δw
δL
= w0 1 − 1 = 0. (3262)
δλ
Solving, (verify this) we get
Σ−1 r
w= . (3263)
10 Σ−1 r
An equivalent statement of the mean-variance Markowitz formulation in Exercise 637 is to specify

the tradeoff between risk and return in the objective function using some risk aversion constant γ. For
covariance matrix Σ, expected returns r for N assets and risk-aversion factor γ, the mean variance
investor solves for the maximization of
w0 r − γw0 Σw (3264)
N
X
s.t. w = 1. (3265)
i=1
This is known as the risk aversion formulation of the classical mean-variance optimization problem. It
can be shown that as we increase λ from zero and solve for w, we obtain points along the efficient frontier.
12.2.2 Risk-Free Rates and the Capital Market Line

It can be shown that the set of efficient portfolios of risky assets derived in Section 12.2.1 result in inferior
portfolios relative to when the risk-free asset is made available. There is a rather limiting assumption,
that makes this theoretical portfolio less practicable; it is that the risk-free rate is representative of both
the cost of borrowing and lending. For completeness, we present the arguments made anyway. In the
presence of a risk-free asset returning Rf , the investor has another choice to make - the investment in
0 0
the risky portfolio of N assets wR = (wR1 , · · · , wRN ), and (1 − wR 1) invested in the risk-free asset. This
portion is allowed to be both positive and negative, if leveraged portfolios are permitted. Correspondingly,
537
0 0 0
the portfolio expected returns are µp = wR r + (1 − wR 1)Rf and σp2 = wR ΣwR . The risk-free rate has
no variance component. Again, the investor minimizes his portfolio variance with the objective function
0 0 0
minwR {wR ΣwR }, subject to feasible return target µ0 = wR r + (1 − wR 1)Rf . The optimal weights are
given by (verify this)
µ0 − Rf
wR = Σ−1 (r − Rf 1). (3266)
(r − Rf 1)0 Σ−1 (r − Rf 1)
See that wR is then a product of some proportionality constant and the vector Σ−1 (r − Rf 1). All
minimum variance portfolios at specified target return is a combination of the risk-free asset and the
N-risky portfolio. The N-risky component is said to be the tangency portfolio. If we let the choice of
∗ 0
weights in the tangency portfolio go to one, that is (wR ) 1 = 1, then since for some C ∗ , we must have
the relation
∗
wR = C ∗ Σ−1 (r − Rf 1), (3267)
we may write
1
10 wR
∗
= 10 C ∗ Σ−1 (r − Rf 1) = 1 → C ∗ = .
10 Σ−1 (r − Rf 1)
Make the substitution to get the tangency portfolio
1
Σ−1 (r − Rf 1). (3268)
10 Σ−1 (r− Rf 1)
This tangency portfolio is consistent with Equation 3263 under the zero risk-free assumptions. In
essence, every possible combination of the risk-free asset and tangency portfolio forms the capital market
line that is tangent to the efficient frontier at the tangency portfolio. With the exception of the tangency
portfolio, all points on the efficient frontier is actually Pareto dominated by the capital market line
(CML); the rational investor invests on the capital market line for his desired level of risk. To the left
of the tangency portfolio (on the CML), there are positive investments in the risk-free portfolio. To the
right, we are borrowing at the risk-free rate. This is known as the separation property; the investor has
two primary tasks. The first is to decide the allocation of wealth between the N-risky portfolio and the
risk-free asset itself, which depends on the level of risk preference of said investor. The second problem
is the construction of the N-risky portfolio, which have allocations (normatively) recommended by the
quadratic optimization problem.
Exercise 639 (Derivation of the Capital Market Line). Denote the weight in the risk-free asset and
risky portfolio as wf , wM respectively s.t. wf + wM = 1. Then the portfolio Rp has expectation ERp =
wf Rf +wM ERM = (1−wM )Rf +wM E(RM ) = Rf +wM [ERM −Rf ]. Additionally, the portfolio variance
σ
is determined (solely by the risky component) as Var(Rp ) = σp2 = wM
2 2
σM . So wM = σMp and making the
substitutions, we get the CML

ERM − Rf
ERp = Rf + σp . (3269)
σM
h i
ERM −Rf
Compare the CML Equation 3269 to the CAPM model (Equation 642). The term σM in
Equation 3269 is said to be the risk premium, or the market price of risk. This is the compensation (in
expectation) over the risk-free rate for taking on risk, where the units of risk are in terms of σM and the
amount of risk consumed is σp . The CML gradient is the return compensation for each perceived level
of risk, and is in fact the equilibrium market price of risk.
538
12.3 Utility functions and Indifference Curves
The question of how an individual investor determines the optimal point along the efficient frontier (or
CML - see Section 12.2.1, Section 12.2.2 - depends on their risk preferences). The concept of preference
is captured by the utility function. Utility functions assign a numeric value to all possible choices faced
by the agent, and captures the degree of satisfaction obtained from the set of choices available to the
investor. A higher utility for choice A over choice B describes a preference relation for A < B. The
indifference curve plots the set of points for which the investor is indifferent about either A or B. One
may find these points with the following experiment; presented with a portfolio A returning annualized
10% in return and 5% volatility and another portfolio B returning 20% in return and 5% volatility (and
everything else equal), the investor would pick portfolio B. Suppose we up the volatility in B to 6%, 7%
and so on...at some point we would be rather indifferent between being presented the portfolio A or B.
The set of such indifference portfolios A, B, C, · · · make up the indifference curve. The indifference curve
of expected return against portfolio risk would be upward sloping - as the risk in the portfolio increases,
the investor demands higher expected return as compensation. Then the optimal portfolio of the investor
based on her utility function would be precisely the point in which her utility function is tangential to
the efficient frontier (under the N-risky portfolio) and to the CML (under the risk-free asset and N-risky
portfolio).
We can specify more general forms for investor goals other than the one mentioned in statements such
as in Equation 3264. That is, the risk-aversion formulation is a specific instance of an investor utility
function. The general framework specifies some utility function u, initial wealth W0 and chooses w to
maximise his utility over the next period
max Eu(W0 (1 + w0 R)), w0 1 = 1. (3270)

w
Then our problem is of the expected utility maximization type. Solutions to this depend, of course, on
the utility function. Our risk-aversion formulation Equation 3264 is of the quadratic utility type
b
u(x) = x − x2 , b > 0. (3271)
2
See that we may express

0 0 b 2 0 2
E[u(W0 (1 + w R))] = E W0 (1 + w R) − W0 (1 + w R) (3272)
2
b
= W0 + W0 Ew0 R − W02 1 + 2Ew0 R + E[(w0 R)2 ]

(3273)
2
b b
= W0 − W02 + W0 µp (1 − bW0 ) − W02 (σp2 + µ2p ). (3274)
2 2
Here the investor utility is concerned only with the mean and variance of the portfolio. This is consistent
with the classical mean-variance type analysis. If we assume that the asset returns are jointly normal
(then only {µ, Σ} matters), then the mean-variance model is consistent with the wealth maximization
objective. The general utility function often leads to more complex formulations.
It is often necessary assumption that the utility function (at least over some range of input values)
take u0 > 0, u00 < 0. This is consistent with the law of diminishing marginal utility that economists
prescribe to the consumption of goods and services. An investor is risk averse; the degree of this risk
aversion is given by the absolute and relative risk aversion, respectively:
u00 (x) xu00 (x)
rA (x) = − , rR (x) = − . (3275)
u0 (x) u0 (x)
539
The curvature of the utility function describes the risk-aversion of the investor. The curvature of util-
ity functions of market agents, their consumption behavior and their relation to the existence of risk
premiums are discussed in Section 15.1.1 on the microeconomic theory of risk premiums.
Some examples of utility functions and their risk aversion functions are given below.
Definition 399 (Linear Utility Function). The linear utility function is
u(x) = a + bx, rA (x) = rR (x) = 0. (3276)
Definition 400 (Quadratic Utility Function).

b
u(x) = x − x2 , b > 0, (3277)
2
b bx
rA (x) = , rR (x) = . (3278)
1 − bx 1 − bx
Definition 401 (Exponential Utility Function / Constant Absolute Risk Aversion).
1
u(x) = − exp(−λx), λ = 0, (3279)
λ
rA (x) = λ, rR (x) = λx. (3280)
Definition 402 (Power Utility (Constant Relative Risk Aversion)).
u(x) = xα , 0 < α < 1, (3281)

1−α
rA (x) = , rR (x) = 1 − α. (3282)
x
Definition 403 (Logarithmic Utility Function (Constant Relative Risk Aversion)).
u(x) = ln(x), (3283)

1
rA (x) = , rR (x) = 1. (3284)
x
The log-utility and power utility functions are known as constant relative risk aversion (CRRA)
utility functions due to their constant relative risk aversion function rR (x). It has been shown that given
similar levels of absolute risk aversion, that the optimal portfolio is robust to choice of utility functions
[7], in that they result in very similar portfolios despite the different functional forms for utility. In many
practical applications, we hence rely on the quadratic utility as this is computationally tractable.
12.4 Factor Modelling

The factor modelling approach models returns as exposure to some stylistic/systematic market factors
and idiosyncratic components. The main tool are linear regression methods, and their machinery is
discussed in Section 8.8.1 (Simple Least Squares) and in Section 8.8.2 (Multiple Least Squares). Like
the discussions on Modern Portfolio Theory (Section 12.2.1), the content in this chapter is also rooted
in elegant mathematics. Unlike it however, the contents of this chapter are actively and heavily used in
both academia and practice for the purposes of risk-attribution, performance attribution, hedging, and
risk-analysis. Here we present the mathematical formulations underlying the commercial factor models
of such portfolio managers.
In general, factor models can be classified into macroeconomic factor models, fundamental factor
models and statistical factor models. They all rely on the tools of statistical regression, taking the form
K
X
Rit = αit + βikt fkt + it t ∈ [T ], i ∈ [N ] (3285)
k=1
540
where Rit is the return at t, βikt is the k-th factor beta (also known as factor loading) and fkt is the
k-th factor value indexed at t. it is the residual (also known as asset specific factor). For presentation
cleanliness, we would like to represent Equation 3285 in matrix form. Let R ∈ RT,N , Rt ∈ RN,1 and
Ri ∈ RT,1 . The contents of these matrices should be obvious from their size. The assumptions are (often,
further assumptions are made in texts) that
1. Eft = µf , and Cov(ft ) = Σf .
2. Cov(fkt , it ) = 0 ∀k, ∀i, ∀t.
3. E(it ) = 0.
4. Cov(it , jk ) = σi2 · 1 {i = j && t = k} = σi2 δij δtk (serial and contemporaneous independence).
Here δ is the Kronecker delta function. See that the regression equation (Equation 3285) can be expressed
Rt = αt + Bt ft + t , t ∈ [T ] (3286)
|{z} |{z} |{z} |{z} |{z}
RN,1 RN,1 RN,K RK,1 RN,1
as T cross sectional regressions. In time series forms, the parameters on time axis is not allowed to
be variable (except in explicit models such as rolling regressions to allow for time varying beta). The
solutions to regressions, that is αit = αi and βit = βi do not depend on time. Then we can write
Ri = 1T αi + |{z}
F βi + i , i ∈ [N ] (3287)
|{z} |{z} |{z} |{z} |{z}
RT ,1 RT ,1 R RT ,K RK,1 RT ,1
as N time series regressions. Again, their contents are obvious from their sizes.
See Equation 3286. This gives us a return decomposition. We can attribute the return on asset Rit
as ERit = αit + βi0 Eft , its alpha (unexplained returns) plus systematic exposure to factor returns. The
variance of returns on asset t can also be attributed similarly. See that we can write
Cov(Rt ) = Bt Σf Bt0 + D, (3288)
where D = E[t 0t ], the diagonal matrix of idiosyncratic variances σi2 . More specifically,
0 2
Cov(Rit , Rjt ) = βit Σf βjt + σij (3289)
0
= βit Σf βjt + σi2 δij . (3290)
.
Now recall that a basket of assets have the relation Rp = w0 R. At time t, the portfolio factor equation
can be expressed
Rp,t = wt0 Rt = wt0 αt + wt0 Bt ft + wt0 t = αp,t + βp,t

0
ft + p,t (3291)
where αp,t = wt0 αt , βp,t

0
= wt0 Bt , p,t = wt0 t . This equation will be of use to us. The variance of the
portfolio is then
Var(Rp,t ) = wt0 Bt Σf Bt0 wt + wt0 Dwt . (3292)
541
12.4.1 Time Series Factor Models
The time-series factor modelling equation takes general form (Equation 3310) repeated here:
Ri = 1T αi + |{z}
F βi + i , i ∈ [N ] (3293)
|{z} |{z} |{z} |{z} |{z}
Exercise 640 (Macroeconomic Factor Models). are given by
Rit = αi + βi0 ft + it (3294)
where ft is vector of observed macro-factor values. In time series applications, the factor values must
be stationary, along with the assumptions on serial and contemporaneous independence of asset specific
residuals.
Exercise 641 (Sharpe’s Single Factor Model). The Sharpe’s single factor model is given by a macroe-
conomic factor model of form
Rit = αi + βi Rmkt,t + it . i ∈ [N ], t ∈ [T ] (3295)
where Rmkt,t is the market return net of risk-free rate. The expected returns for a stock at time t is
written
ERit = αi + βi ERmkt,t . (3296)

Cov(Rit ,Rmkt )
Here the βi is such that βi = Cov(Rmkt ) . To see this, write
Cov(Rit , Rmkt ) = Cov(αi + βi Rmkt + it , Rmkt ) (3297)

= Cov(βi Rmkt , Rmkt ) + Cov(it , Rmkt ) (3298)
= βi Cov(Rmkt ). (3299)
Under this model, we can determine the covariance of a portfolio of assets in terms of the systematic
and idiosyncratic contributions. That is,
2
Σassets = σmkt ββ 0 + D, (3300)
where D = Et 0t , β = (βi )i∈[N ] , and σmkt

2
is the market variance. Returns in this model are only
correlated through their market factor exposure. With the market observables, we can make estimates
β̂i , α̂i . In particular, the sample regression equation is written
Ri = α̂i 1T + Rmkt β̂i + î , (3301)
where
cov(R
ˆ it , Rmkt )
β̂i = , (3302)
ˆ
Var(Rmkt )
and α̂i = R̄i − β̂i R̄mkt . Our estimate for σ̂ 2 is the mean squared residual
1 0
ˆ î . (3303)
T −2 i
The estimated sample covariance matrix for the assets are
2
Σassets = σ̂mkt β̂ β̂ 0 + D̂. (3304)
542
The proportion of variance Ri2 = [(σ̂mkt
2
β̂ β̂ 0 )ii + D̂ii ]/(σ̂mkt
2
β̂ β̂ 0 )ii for asset i is a measure of market risk,
while 1 − Ri2 represents the idiosyncratic risk component. We may also express it
σi2 = 2
Cov(Rit ) = σmkt βi2 + σ2i (3305)
2
σmkt βi2 σ2i
1 = 2 + . (3306)
σi σi2
| {z } |{z}
Ri2 1−Ri2
The constraint on fixed β, α may be relaxed by the use of rolling regression or Kalman filter techniques
to vary with time.
Exercise 642 (Capital Asset Pricing Model (CAPM)). The CAPM model modifies the Single Index
Sharpe model (Exercise 641) and models asset prices as
Ri − rf = αi + βi (Rmkt − rf ) + i , (3307)
where rf is the risk-free rate, and Rmkt is the market return. The term (ERi − rf ) is often called the
risk-premium, and (Rmkt −rf ) is known as the market premium. It is an equilibrium pricing model where
Eαi = 0. The (expected) excess returns on an asset depends primarily (completely) on its sensitivity βi
to the market index. We can also write Equation 3302 as
Cov(Ri , Rmkt ) σRi ,mkt σRi ,mkt σRi σRi
βi = = 2 = · = ρRi ,mkt . (3308)
Var(Rmkt ) σmkt σRi σmkt σmkt σmkt
The CAPM model is the simplest example of a General Equilibrium Theory (GET) model. An equilib-
rium model matches the supply and demand of market agents based on the assumption of participants’
rationality for utility maximization. Derivation of the CAPM assumes that investors are rational and
risk-averse, share the same time horizon, along with other assumptions of efficient markets and risk-free
lending/borrowing. The implication of this model is that the appropriate risk that investors are compen-
sated for (risk premium) depends on the asset’s systematic variance component. These arguments are
laid out in explict detail in Section 15.1.2. The derivation of the CAPM model under the CRRA utility
function is outlined in Section 15.1.4.1. For a market in equilibrium, the CML (Exercise 639) specifies
the return of a portfolio as linear function of the expected return on the market portfolio. CAPM asserts
that the expected returns of assets take form (see Equation 3307, Equation 3308)
Cov(Ri , Rmkt )
ERi = rf + (ERmkt − rf ). (3309)
Var(Rmkt )
This equation is called the security market line. Expected returns of individual securities lie on the SML
in the equilibrium model. Efficient portfolios lie on both the SML and CML.
Exercise 643 (Fama’s Factor Model). The Fama-French style factor models can be seen as an extension
of the Sharpe Single Index model (see Definition 641). The extension can be seen as a parallel extension
in theory from the simple linear regression problem to the multiple linear regression problem (see Section
8.8.1 to Section 8.8.2). In particular, the general multiple factor time series regression form takes
Equation 3310:
Ri = 1T αi + |{z}
F βi + i , i ∈ [N ]. (3310)
|{z} |{z} |{z} |{z} |{z}
The original Fama-French three factor models specify the factors to be High minus Low (HML), Small
minus Big (SMB) and market (MKT) factors, with respect to market capitalization, book value and
543
cap-weighted benchmark returns respectively. Regression equation is given
Ri,t − rf,t = βi,M KT (Rt,M KT − rf,t ) + βi,HM L Rt,HM L + βi,SM B Rt,M KT + i,t . (3311)
Other Fama-style model extensions follow similar stepwise formulations: based on asset characteristics,
portfolios are created by first sorting based on their characteristic and forming a hedge portfolio that is
long/short the particular factor, creating a time-series of returns. The returns of this hedge portfolio are
treated as factor returns and are market observable. The general Fama-style models extend the work to
incorporate for any k factors, such as momentum, accrual factors and so on. By time-series regression
methods, write
Ri = α̂i 1T + F β̂i + î (3312)

= X γ̂ + î , (3313)
where X is the feature matrix with [1T : F ] and γ̂ is the loading matrix (α̂i , β̂i )0 . See that γ̂ =
(X 0 X)−1 X 0 Ri from Equation 3032, and we have σ̂i2 = 1
ˆ0i î .
T −K−1 The sample factor covariance matrix
is
T
1 X
Σ̂f = (ft − f¯)(ft − f¯)0 , (3314)
T − 1 t=1
and the estimated covariance matrix for asset returns under the Fama-style models can be written
Σ̂assets = B̂ Σ̂f B̂ 0 + D̂. (3315)
12.4.2 Cross Sectional Models

The cross sectional models take the form in Equation 3286 repeated here:
Rt = αt + Bt ft + t , t ∈ [T ]. (3316)
|{z} |{z} |{z} |{z} |{z}
RN,1 RN,1 RN,K RK,1 RN,1
Fundamental models of the BARRA type (as opposed to the Fama-type (see Exercise 643)) use
market observables and asset characteristics (such as the industry classification) as the factor loadings,
and the econometric problem is to estimate the factors ft . Yes, we use Bt to estimate ft , which may be
fairly confusing for some readers, since this flies in the face of standard statistical regression approach
we see in Stats101.
Exercise 644 (BARRA’s Factor Model [4]). The BARRA2 factor models takes form
Rt = Bt ft + t , t ∈ [T ]. (3317)
|{z} |{z} |{z} |{z}
RN,1 RN,K RK,1 RN,1
The observable asset fundamentals and market factors are used as factor betas Bt . Using T cross-
sectional regressions, we estimate for ft , ∀t ∈ [T ]. In some texts, the factor loading matrix is assumed to
be invariant. That is Bt = B. We assume a more general model to allow for factor migration (firms do
change industries sometimes). For categorical variables (asset characteristics), the asset betas are often
represented by indicator functions, such as
1 {firm is a tech stock} . (3318)

2 refer to BARRA Risk Model Handbook: https://roycheng.cn/files/riskModels/barra_risk_model_handbook.pdf
544
For numerical asset characteristics, transformation(s) are often applied, such as z-scoring and log func-
tions. Log functions are a good approach to model the diminishing marginal rate of return that firms face.
To account for heteroskedastic variance across assets (different assets exhibit different variance/volatil-
ities), we can employ weighted least squares (WLS) to get efficient estimates of ft . Assuming asset
idiosyncratic variances σi2 are known, then
fˆt,W LS = (Bt0 D−1 Bt )−1 Bt0 D−1 Rt , (3319)
where D = diag((σi2 )i∈[n] ) gives an unbiased estimator for f with the lowest estimation error. Then the
factor estimates, fˆt,W LS in Equation 3319 have the interpretation as the return on a portfolio w∗ ∈ RN,K t
that solves (verify this 3 )
1
min trace(wt∗ 0 Dwt∗ ) s.t. wt∗ Bt = 1. (3320)
w 2
That is, the portfolio constructed from the k-th column of wt∗ is the minimum variance portfolio with unit
exposure to factor k and no other factor exposure. This is given by
wt∗ 0 = (Bt0 D−1 Bt )−1 Bt0 D−1 , (3321)

|{z}
RK,N
and fˆt,GLS = wt∗ 0 Rt . If

P ∗
i wi,t = 1, then this portfolio is called a factor mimicking portfolio. The
difference between a factor mimicking portfolio and its corresponding (if any) ETF such as sector ETFs
are that the portfolio has no exposure to other factors (industries, styles) and so on. The ETF, on the
other hand could have returns attributed to momentum, value and other conflating factors. The factor
mimicking portfolio has attendant volatility; write
fˆt,GLS = (Bt0 D−1 Bt )−1 Bt0 D−1 Rt (3322)

= (Bt0 D−1 Bt )−1 Bt0 D−1 (Bt f + ) (3323)
= f+ (Bt0 D−1 Bt )−1 Bt0 D−1 (3324)
= f + Bt−1 (3325)
and
0
Σfˆ = Σf + Bt−1 DBt−1 (3326)
= Σf + (Bt0 D−1 Bt )−1 . (3327)
In practice, we do not know σi2 , and the assumptions are not fulfilled. However, we have consistent
(see estimator consistency, Definition 333) estimators for σ̂i2 . Using the regression equation, we may
conduct ordinary least-squares (OLS) approach T times to obtain (unbiased, inefficient) estimate of the
factor realizations
fˆt,OLS = (Bt0 Bt )−1 Bt0 Rt (3328)
1
PT ˆ
at each t. The estimated covariance matrix for factor returns is then Σ̂F,OLS = T −1 t=1 (ft,OLS −
f¯OLS )(fˆt,OLS − f¯OLS ) . The OLS approach similarly generates a time series of residual variances ˆt . In
0
2 1
PT
particular, we can write σ̂i,OLS = T −1 it,OLS − ¯i,OLS )2 , a consistent estimator for σi2 . We can
t=1 (ˆ
3 some of the intermediate proofs presented here are tentative - I have not worked on them rigorously yet
545
then use these estimates to obtain D̂OLS . Using this in the weighted-least squares estimation problem
(Equation 3319), obtain
fˆt,W LS = −1
(B̂t0 D̂OLS −1
Bt )−1 Bt0 D̂OLS Rt , (3329)
T
1 X
Σ̂F,W LS = (fˆt,W LS − f¯W LS )(fˆt,W LS − f¯W LS )0 , (3330)
T −1 t=1
T
2 1 X
σ̂i,W LS = it,W LS − ¯i,W LS )2 ,
(ˆ (3331)
T − 1 t=1
Σ̂assets,t,W LS = Bt Σ̂F,W LS Bt0 + D̂W LS . (3332)
This approach deals with the cross-sectional heteroskedasticity of asset returns.
12.4.3 Statistical Models

(discuss this)
12.4.4 Simplified Markowitz Formulations under the Factor Model Analysis

The factor models assume that the investment profiles of capital assets are related by their factor co-
variance matrix only. In particular, under the regression assumptions for the factor models (see 12.4),
we assumed serial and contemporaneous independence of specific errors, such that the covariance matrix
of errors D is diagonal. Using this together with the residual returns (factor adjusted expected returns
αt = rt − w0 µf ), the formulations given in Section 12.2.1 for modern portfolio theory are simplified.
We present a short section detailing these special cases. In particular, here are some common weighting
solutions:
1. Proportional: portfolio position is set to αt . Here, volatility is not taken into consideration.
α
2. Risk (Volatility) Parity: σ. Positions are scaled by the inverse of their idiosyncratic volatility. We
do not need to care about its relation to other assets in the portfolio, since σij = 0 for all i 6= j.
α
3. Mean Variance: σ2 .
α
4. Shrunk Mean Variance: 2
γσ 2 +(1−γ)σsector
. for γ ∈ [0, 1].
We show the results for mean-variance (Item 3). Assume the trader has the utility function to
maximize return w0 α subject to penalty γw0 Σw where γ is some risk-aversion parameter. Here, assume
γ = 21 , such that she aims to maximise (αportf olio − 21 σportf
2
olio ) - we know from Section 435 that this is
the utility function that maximises trader wealth under the assumptions of asset driven by a geometric
Brownian motion. Then, write
1
w0 α − γw0 Σw = w0 α − w0 Σw (3333)
2
X 1X X
= wi αi − wi wj σij − 2 wi wj σij (3334)
2 i=j i<j
n
X 1X 2 2
= wi αi − w σ . (3335)
2 i i i
αi
Taking derivative w.r.t. to wi , we get αi − 12 2wi σi2 , and the optimal allocation for asset i is σi2
, which is
the mean-variance allocation.
546
12.4.5 Factor Hedging and Attribution
Once the factors and their returns have been identified via the techniques discussed in Section 12.4.1,
12.4.2, 12.4.3, and a portfolio w constructed, recall that our portfolio factor exposure can be modelled
by the equation (Equation 3291)
Rp,t = wt0 Rt = wt0 αt + wt0 Bt ft +wt0 t = αp,t + βp,t

0
ft + p,t , (3336)
|{z} |{z} |{z}
R1,N RN,K RK,1
where αp,t = wt0 αt , βp,t

0
= wt0 Bt , p,t = wt0 t . Our attribution (where did our returns/risk come from) and
hedging problem (eliminating undesirable risks) will focus on this equation. In particular, the beta of a
portfolio to the factors is a linear combination of the component betas wt0 Bt , and the portfolio variance
is decomposed by the factor covariance matrix and idiosyncratic variance, (Equation 3292)
Var(Rp,t ) = wt0 Bt Σf Bt0 wt + w0 Dwt . (3337)

| {z } | t{z }
portfolio market variance portfolio idiosyncratic variance
12.4.5.1 Single/Multiple Factor Hedging
If we assume the Sharpe’s Single Index model (Exercise 641), then the portfolio market volatility is
wt0 βt σmkt,t , and idiosyncratic volatility is wt0 Dwt . As the number of stocks in the portfolio increases,
p
the proportion of portfolio variance explained via the market variance component increases - the repre-
4
sentative market used as benchmark itself has perfect unit beta and no idiosyncratic variance. This is
trivial. The regression equation for the market index under the Sharpe single-factor model is
Rmkt,t = αmkt + βmkt Rmkt,t + mkt,t . t ∈ [T ] (3338)
Then βmkt = 1, mkt,t = 0. We can show this with another numerical explanation. For a portfolio
P 2 2 2
P 2 P 2
w = (wi )i∈[n] , the idiosyncratic variance is wi σi ≤ σmax wi , and wi ≤ 1. The greater the
diversification, the smaller the idiosyncratic variance. An equal weighted portfolio would have variance
2
σmax
2
bound σmax n n12 = n , and this is asymptotically zero in n. In practice we can find tradable instruments
to hedge away the market risk, that replicates the returns of the benchmark index, such as ETFs and
futures. Ignoring the tracking error, our market hedging is done by adding −βp,t fmkt to the portfolio,
such that the Equation 3336 becomes
0
Rp,t = wt0 Rt − βp,t fmkt (3339)
= wt0 αt + wt0 βmkt ft + wt0 t − βp,t fmkt (3340)
= αp,t + βp,t fmkt − βp,t fmkt + p,t (3341)
= αp,t + p,t . (3342)
There is no change to the idiosyncratic volatility and we have hedged our factor return.
Exercise 645 (Multiple Factor Hedging). Let w ∈ RN be a portfolio with factor loadings B. We want
to remove from the portfolio factor return components Bt ft . Our objective be to minimize the size of
4 A significant amount of AUM of equity portfolio managers have a positive beta exposure to the equity indices. This is
because equities empirically increase in value over time. This increase is attributed to the ‘equity risk premium’ - investors
are compensated for taking on additional risk, and the anticorrelation between equity prices and bad states of the world
economy. Funds targeting zero beta and uncorrelated return streams are known as ‘absolute return’ funds.
547
the hedging book, so as to retain as much of the original speculative portfolio as possible while removing
factor exposures. Let Ω ∈ RN,N be some semi-positive definite matrix, then we want to solve the problem
min h0 Ωh minimize hedge book size (3343)

s.t. Bt0 (w + h) = 0, (3344)
where h is our hedge portfolio. The Lagrangian for this is
h0 Ωh + λBt0 (w + h) = 0. (3345)
Then we need to solve the linear equations
2Ωh + Bt λ0 = 0, (3346)
Bt0 (w + h) = 0. (3347)
From the first Lagrangian, see that we can write h = − 12 Ω−1 Bt λ0 . Substituting into the second La-
grangian, get
1 0 −1
Bt0 w = B Ω Bt λ 0 , (3348)
2 t
2(Bt0 Ω−1 Bt )−1 Bt0 w = λ0 (3349)
and substitute back into the first equation to get
2Ωh + 2Bt (Bt0 Ω−1 Bt )−1 Bt0 w = 0. (3350)
Then the hedge portfolio is
h = −Ω−1 Bt (Bt0 Ω−1 Bt )−1 Bt0 w. (3351)
When the matrix Ω is identity, then the objective function minimizes the hedge book’s Euclidean distance
from the portfolio holdings w. When Ω is set to the covariance matrix between asset returns, then the
hedge portfolio size is volatility adjusted.
Exercise 646 (Multiple Factor Hedging Portfolio as Residuals of Regression on Loading Matrix). See
Exercise 645. Show that in a portfolio allocation scheme where weights are proportional to expected
returns, the final portfolio (original allocations plus the hedge portfolio) is equal to the residuals obtained
from regression
w ∼ Bt x + . (3352)
Proof. Let the expected returns be denoted by vector α, then w = α. See that hedge portfolio is given
Equation 3351, when Ω = 1 (minimizing the hedge book), we have
α + h = α − Ω−1 Bt (Bt0 Ω−1 Bt )−1 Bt0 w (3353)

= α − Bt (Bt0 Bt )−1 Bt0 α (3354)
= (1 − Bt (Bt0 Bt )−1 Bt0 )α. (3355)
Recall from Equation 3034 that the residuals from Multiple Least Squares regression may be expressed
in terms of the hat matrix and is precisely Equation 3355.
548
5 6
12.5 Extension of the Classical Mean-Variance Framework for

Various Constraints and Utilities
It has been shown in Section 12.3 that the investor trades (or should attempt) to trade on the frontier
or capital market line; where on the frontier he does so is then a matter of the investor’s utility function.
12.5.1 Practical Constraints and Modifications to Objectives

Here we consider the addition of constraints and modifications to objectives extending the classical mean-
variance objective function (Equation 3264). The constraints are generally split between the linear,
quadratic, nonlinear and combinatorial types. For current portfolio weighting w0 and target weighting
w, the optimal portfolio distance x = w − w0 denotes the amount to trade.
Definition 404 (Long Only Constraints). The long only constraints require that ∀i ∈ [N ], wi ≥ 0. These
constraints are applied either when there are short selling constraints (short investments can be difficult to
implement in some markets, such as in real-estate), or when we are trading explicitly positive expectancy
instruments such as active alphas.
Definition 405 (Turnover Constraints). Turnover result in transaction costs and results in inefficient
portfolio rebalancing. For some upper bound Ui , we may set |xi | ≤ Ui to limit asset turnover. This is
useful when cross-sectional transaction costs are highly variable. Often this parameter is set relative to
the average daily volume (ADV) or average daily dollar volume (ADDV) of the stock, or some other
measure of liquidity. When costs are uniform, then often, the bound is specified for the entire portfolio,
PN
such that i |xi | ≤ Up .
Definition 406 (Holding/Diversification Constraints). The holding constraints limit the maximal hold-
ings in an individual asset. The target portfolio is constrained such that we have ∀i, Li ≤ wi ≤ Ui , where
Li , Ui are lower and upper thresholds respectively. When the exposure to constrain is with respect to some
P
m baskets Si∈[m] (such as asset class, country or industry exposure), then we write Li ≤ {j∈Si } wj ≤ Ui ,
for i ∈ [m].
Definition 407 (Risk Factor Constraints). When the portfolio risk is measured as a (multi) factor model
(see Section 12.4), then the portfolio manager may choose to limit her exposure to a certain factor risk.
Assume a K-factor model, then the returns for asset i are assumed to take parametric form
K
X
Ri = αi + βik Fk + i , (3356)
k
5 In practice, hedging can also contribute significantly to costs and adversely impact performance results. While a useful
tool for portfolio analysis, many traders (without mandates) choose to have open factor exposure without incurring hedging
costs, with large pnl attribution to beta effects. This is perfectly reasonable - I have found market beta hedging particularly
useful in long-short strategies where I aggregate the beta for single stock shorts and proxy the short leg with short futures
contracts. This is particularly useful when the return predictability of a L/S strategy is asymmetrically poor on the short
side, reducing transaction costs and cutting short-squeeze risk.
6 Another argument against factor-hedging for quant portfolios are that the dynamic correlation of strategies are markedly
distinct from the correlations of the assets themselves. When the quantitative strategies span across multiple asset classes,
finding useful factors can also become a non-trivial problem, giving support to using portfolio management strategies that
are asset-agnostic and act only upon the numerical artefacts generated by the historical and live trades.
549
where Fk is the k-th factor driver of asset returns. To limit the exposure to some k-th risk factor, we use
the constraint
N
X
βik wi ≤ Uk , (3357)
i
PN
for some upper threshold Uk . A k-th factor neutral portfolio assumes the constraint i βik wi = 0.
Factor hedging implementations are discussed in Section 12.4.5.1.
Definition 408 (Tracking Error Constraint (TEV)). When a portfolio is managed relative to a bench-
mark asset (such as ETFs, mutual funds), then the mandate may be to hold a portfolio within some small
tracking error of the reference asset. In particular, denote wb to be the benchmark portfolio weights, and
R be the vector of asset returns. Then Rb = wb0 · R. A possible measure of distance from the benchmark
portfolio may be the Euclidean distance kw − wb k ≤ M . The most commonly used metric of benchmark
deviation is called the tracking error, which is defined as the variance of the difference between portfolio
returns and benchmark returns. That is, given Rp = w0 R, Rb = wb0 R, T EV = Var(Rp − Rb ). We get
T EVp = Var(Rp − Rb ) (3358)

= Var(w0 R − wb0 R) (3359)
0
= (w − wb ) Σ(w − wb ), (3360)
where Σ is asset covariance matrix. Then the constraint is the value (w − wb )0 Σ(w − wb ) ≤ σU
2
, for some
2
upper threshold variance σU .
When working with integer/binary constraints, then it would be helpful to define the decision variable
δi = 1{wi 6= 0} for asset i.
Definition 409 (Minimum Holding and Transaction Constraint). Small portfolio holdings mean that
future changes in portfolio weightings significantly affect holdings. Additionally, small transactions mean
that non-variable costs are amortized over small notional volume, and hence expensive relative to the
rebalancing obtained. It is often desirable to specify minimum holding size in the portfolio and minimum
trade sizes, respectively written
|wi | ≥ Lwi δi , |xi | ≥ Lxi δi , i ∈ [N ]. (3361)
Often, the addition of minimum size constraints (Definition 409) result only in a marginally different
portfolio, while adding significant challenges to the computational load of numerical optimization. Often
this constraint optimization is substituted as a post-hoc filtering after the classical mean-variance type
analysis without the computational difficulties.
Definition 410 (Cardinality Constraints). A portfolio manager may choose to limit the number of assets
PN
held in the portfolio. This cardinality constraint takes form i δi = K, where K ≤ N .
In practice, market securities trade in lots. Solving for w assumes that as function of initial portfolio
wealth W0 , that the position-wise holdings W0 w translate to fractional holdings that are close to the
implementable portfolio. This is not always true, particularly when the account size is small relative
to the contract size which trade in large lots sizes (such as in futures markets). Therefore, we may
implement round lot constraints.
550
Definition 411 (Round Lot Constraints). The portfolio weights are represented as wi = zi · fi , i ∈ [N ],
where fi is the fraction of portfolio wealth and zi is an integer number of contract size. For instance,
for a contract trading at lots of size 100, stock price Si and wealth W0 , we have fi = SiW·100
0
. Under the
PN
round lot constraints, i wi = 1 is often not precisely possible and we allow an additive error bound
PN − +
i fi zi + + = 1. This may be expressed
z 0 Λ1 + − + + = 1, (3362)
where Λ is diagonal matrix with entries Λii = fi . The investor then solves the problem
max z 0 Λµ − λz 0 ΛΣΛz − γ(− + + ), (3363)

z
subject to z 0 Λ1 + − − + = 1, − , + ≥ 0.
The binary variables and round lot constraints (Definition 410, Definition 411) introduce integers into
our computer program and the quadratic mean-variance program becomes a quadratic mixed integer
problem (QMIP). These may be computationally expensive. With N assets and specific cardinality
N
= M !(NN−M
!

choice M , the total number of portfolios are M )! . The efficient frontier is discontinuous
and constructed by plotting the frontier of all possible M choices, followed by removing the dominated
portfolios. As the size of asset universe increases in size, this quickly becomes computationally intractable.
The portfolio optimization problem with minimum holding constraints, cardinality constraints and round
lot constraints (Definition 409, 410 and 411) belong to the class of non-complete deterministic polynomial
time (NP-complete) problems. Heuristics are often employed.
Consider the mean-variance risk-aversion formulation Equation 3264. We may penalize the friction
of trading with a cost function and write an extension
max w0 µ − λw0 Σw − λT C · T C, (3364)

w
with constraint w0 1 = 1. T C is transaction cost function, depending usually on w, w0 and other market
observables. The transaction cost is arbitrary; it can involve either complicated nonlinear functions or
take some linear/quadratic form.
Definition 412 (Linear/Quadratic Cost). A common assumption is that the transaction cost penalty is
a simple function of x = w − w0 , and that it may be expressed
N
X
T C(x) = T Ci (xi ), (3365)
i
where T Ci is the transaction cost function for security i. The general form for T Ci (xi ) up to the second
order may be written
T Ci (xi ) = αi 1{xi 6= 0} + βi |xi | + γi |xi |2 , (3366)
for some αi , βi , γi . When αi = 0 for all i ∈ [N ], then the investor solves for
max w0 µ − λw0 Σw − λT C (β 0 |x| + Γ|x|2 ). (3367)

w
Here the vector β 0 = (β1 , · · · , βN ) and Γ is diagonal matrix with Γii = γi . These may be solved as a
quadratic optimization problem. When γi = 0 and β 0 6= 0, then the costs are strictly linear. See Section
12.5.3.
551
Definition 413 (Piecewise Linear Costs). A piecewise linear transaction cost can be used to model
the state of increasing transaction costs based on trade volume at some threshold levels. These may be
modelled depending on the trading volume or the state of the order book, for instance. For simplicity, let
there be some thresholds U1 , U2 , U3 , then the transaction cost function may be written as the piecewise
linear

s t 0 ≤ t ≤ U1 ,
 1


T C(t) = s1 U1 + s2 (t − U1 ), U1 ≤ t ≤ U2 , (3368)


s1 U1 + s2 U2 + s3 (t − U2 ), U2 ≤ t ≤ U3

and so on. Often the slopes s1 < s2 < s3 . Inclusion of piecewise linear function in the mean-variance
framework is easily tractable. We may introduce the decision variables y1 , y2 , y3 , and write the cost
penalty as
N
X
λT C (s1 · y1 + s2 · y2 + s3 · y3 ) = λT C (s1,i y1,i + s2,i y2,i + s3,i y3,i ). (3369)
i
The constraints for decision variables added are 0 ≤ y1,i ≤ U1 , 0 ≤ y2,i ≤ U2 , 0 ≤ y3,i ≤ U3 . The
increasing slopes i < j =⇒ si < sj mean that the optimizer sets yk,i > 0 only when y<k,i > 0. The total
amount traded is zi = y1,i + y2,i + y3,i , and so we set the constraint zi = |wi − w0,i | = |xi |.
12.5.2 Extension of the Utility Function for Higher Moments of Portfolio

Returns
In this section we present the arguments when the investor utility function incorporates higher order
moment statistics. In particular, when the asset returns are no longer assumed to be multivariate
normal, (µ, Σ) of the asset returns are insufficient to capture the distribution of asset returns. The
portfolio may exhibit skew and tail characteristics that are relevant to the problem of portfolio selection.
In general, investors favour positive skew and non-fat tail portfolios. While the return is a linear function
of the signal weighting w, and the standard deviation σ is convex, the skew and kurtosis statistics are
highly nonlinear and can exhibit multiple maxima and minima points. It is possible that a portfolio
minimizing the variance simultaneously minimizes skew and maximizes kurtosis (which is undesirable to
the investor) and result in a sub-optimal portfolio. The classical mean-variance type portfolios (Section
12.2.1) may hence be extended to incorporate investor views and aversion to risk in the higher moments.
Given initial wealth W0 , vector of portfolio weights w and returns R, the one-period terminal wealth is
expressed W = W0 (1 + w0 R) = W0 (1 + Rp ), and in expectation W̄ = W0 (1 + w0 µ) = W0 (1 + µp ). By
the Taylor expansions, for utility function u we have
1 1
Eu(W ) = u(W̄ ) + u(1) (W̄ )E[W − W̄ ] + u(2) (W̄ )E[(W − W̄ )2 ] + u(3) E[(W − W̄ )3 ] (3370)
2 3!
1 (4) 4 5

+ u E[(W − W̄ ) ] + O W . (3371)
4!
Since EW = W̄ , then the second term disappears. See that the terms E[(W − W̄ )k ] are the k-th (non-
standardized) moments of the random variable W . Define these central moments for k = 2, 3, 4 and
variance, (non-standardized) skew and kurtosis respectively. That is, we have
µp = ERp , σp2 = E[(Rp − µp )2 ] = E[(W − W̄ )2 ], (3372)

s3p 3
= E[(Rp − µp ) ] = E[(W − W̄ ) ], 3
κ4p 4
= E[(Rp − µp ) ] = E[(W − W̄ ) ]. 4
(3373)
552
Then we may express the investor utility function incorporating the higher moments as
1 1 1
Eu(W ) = u(W̄ ) + u(2) (W̄ )σp2 + u(3) (W̄ )s3p + u(4) (W̄ )κ4p + O (W − W̄ )5 .

(3374)
2 3! 4!
Assuming the log utility form u(x) = ln(x) (Definition 403), then work through the logarithmic deriva-
tives to get
1 2 1 3 1 4
E ln(W ) ≈ ln(W̄ ) − σp + sp − κ . (3375)
2W̄ 2 3W̄ 3 4W̄ 4 p
In essence the investor solves the problem

1 2 1 3 1 4
max ln(W̄ ) − σp + sp − κ (3376)
w 2W̄ 2 3W̄ 3 4W̄ 4 p
subject to constraint w0 1 = 1. An investor with specific preference for any of the moments may opt to
instead solve the general form
max w0 µ − λ1 σp2 + λ2 s3p − λ3 κ4p . (3377)

w
See that the investor would be averse to the moments of even order and non-averse to the moments
of odd order. Just as there is covariance describing the co-movements of asset returns for the second
moment, the concept of co-skew and co-kurtosis describe the co-movement of asset returns. We would
like to express a general form for the computation of higher moments with the use of tensor products.
In particular, if we express
M3 = (sijk ) = E[(R − µ)(R − µ)0 ⊗ (R − µ)0 ] (3378)

0 0 0
M4 = (κijkl ) = E[(R − µ)(R − µ) ⊗ (R − µ) ⊗ (R − µ) ], (3379)
where each element of the matrix is expressed respectively as sijk = E[(Ri −µi )(Rj −µj )(Rk −µk )], κijkl =
E[(Ri − µi )(Rj − µj )(Rk − µk )(Rl − µl )] for i, j, k, l = [N ]. Then in tensor notation, the optimization
function Equation 3377 is expressed
max w0 µ − λ1 w0 Σw + λ2 w0 M3 (w ⊗ w) − λ3 w0 M4 (w ⊗ w ⊗ w), (3380)

w
s.t. w0 1 = 1. This is a fourth order non-convex polynomial function over w, and possibly exhibits local
7
optima.
12.5.2.1 Polynomial Goal Programming, Fabozzi et al. [7]
We noted that the odd moments are favourable, and the even moments are undesirable for investors.
That is, the investors have multiple goals. This is a multiobjective optimization (MOO) problem. In
particular, suppose the investor is concerned up to the fourth moment, then she solves simultaneously
max O1 (w) = w0 µ, (3381)

w
min O2 (w) = w0 Σw, (3382)
w
0
max O3 (w) = w M3 (w ⊗ w). (3383)
w
0
min O4 (w) = w M4 (w ⊗ w ⊗ w). (3384)
w
7 It
is known that the sample mean and variance is highly sensitive to outliers (it is an arithmetic averaging statistic).
Even worse, the standard error of sample moments of order n is proportional to the square root of the moments of order
2n. The skew and kurtosis statistics are even more sensitive to outliers. Therefore, these statistics should be used with
caution, and it is desirable to consider more robust estimators of these moments. There is significant literature around the
robust estimation of mean and variance statistics. This is less so for the higher moments.
553
The same constraints apply. The high-level idea is that we first solve the individual goals, and then aim
to achieve a good tradeoff between the optimal solutions to each goal. We are primarily interested in
the relative percentage invested in each of the assets, therefore we can do without O2 by reducing the
problem into the unit variance space, such that our hypothesis space is {w|w0 Σw = 1}. Then we solve
for
max O1 (w) = w0 µ, (3385)

w
max O3 (w) = w0 M3 (w ⊗ w). (3386)
w
0
min O4 (w) = w M4 (w ⊗ w ⊗ w) (3387)
w
subject to w0 1 = 1, and w0 Σw = 1. Let the solutions to each of these problems be w1∗ , w2∗ , w3∗ , and the
second step in our problem now solves for
min O(w) = (d1 (w))p1 + (d3 (w))p3 + (d4 (w))p4 , (3388)

w
where di (w) = Oi (wi∗ )−Oi (w). The parameters pi represent investor preference for each of the moments,
and can be shown to be directly associated with the marginal rate of substitution
δO
δdi pi di (w)pi −1
M RSij = = . (3389)
δO
δdj
pj dj (w)pj −1
12.5.3 Numerical Solving of the Extension to N Asset Mean-Variance Port-

folios with Long Constraints and Linear Costs
We introduced the Markowitz formulations. These have a number of shortfalls, which we shall discuss
separately. One concern is that we were unable to add non-linear constraints. Another problem with
the mean-variance portfolio is that it is an incomplete model of trading with friction. The other is that
it does not account for poor estimation of the covariance matrix. We shall deal with the first two issues.
For covariance matrix Σ, expected returns r for N assets and risk-aversion factor γ, the mean variance
investor wants to maximise Equation 3264
w0 r − γw0 Σw (3390)
N
X
s.t. w = 1. (3391)
i=1
We want to make two adjustments: first is the addition of a cost-function, and second is the addition
P
of long only constraints kwk1 = wi = 1. These modifications/extensions are formally introduced
in Definition 412 and 404 respectively. At this point, some readers may be tempted to the run the
Markowitz optimization on only the alphas with positive returns over some historical window to get
non-negative weight allocations, while setting the remainder of the portfolio to zero weighting. However,
this is inaccurate. In fact, it can be shown that when the correlation of alphas are sufficiently high,
the resulting recommended weight for component alphas can be negative even if they were individually
profitable.
The mean-variance objective function (Equation 3390) adjusted to account for long-only constraints
for the N asset problem can be written
maxw w0 r − γw0 Σw (3392)

N
X
s.t. w = 1, (3393)
i=1
and ∀i ∈ [N ], wi ≥ 0, (3394)
554
for asset weighting w, expected returns r and asset covariance matrix Σ. We may use a quadratic
optimizer which numerically solves for the vector w that minimizes the generic loss function
min L(w) = −w0 r + γw0 Σw, (3395)

w
s.t A0 w = b and G0 w 4 h, (3396)
where A ∈ RN,m1 , G ∈ RN,m2 together with b ∈ Rm1 , h ∈ Rm2 specify m1 , m2 constraints respectively
for N assets. Σ is symmetric positive definite. 4 represents componentwise inequality. Define 1 to be
the vector of ones, and 1 to be the identity matrix. Mapping the long-constrained objective function
Equation 3392 to the loss function Equation 3395, we obtain representations A = 1 ∈ RN , b = 1, as well
as G = −1N ∈ RN,N , h = 0 ∈ RN .
Additionally, we would like to add a cost function that penalizes the portfolio turnover. Assuming
a fixed bid-ask spread, the turnover penalty is equivalent to the Manhattan distance between adjacent
position vectors. That is, for turnover cost ξ, the loss function is adjusted to
min L(wt ) = −wt0 rt + γwt0 Σwt + ξ|wt − wt−1 |. (3397)

wt
However, this form is non-differentiable due to the non-linear term |wt − wt−1 | and we would like to
linearize the cost function as input to our quadratic solver. To make the necessary adjustments, define
−
|wt − wt−1 | = |∆t | = ∆+
t + ∆t , (3398)
−
where ∆t = ∆+
t − ∆t are the positive and negative components of our vector difference. Now see that
we can write
min L(wt ) = −wt0 rt + γwt0 Σwt + ξ|wt − wt−1 | (3399)

wt
−
= −wt0 rt + γwt0 Σwt + ξ(∆+
t + ∆t ) (3400)
−
= −(wt0 rt − ∆+ 0
t ξ − ∆t ξ) + γwt Σt wt (3401)
    
h i rt h i Σt 0 0 wt0
−  
0 +
= − wt −∆t −∆t  ξ  + γ wt0 −∆+ −∆−
t  0 0 0 −∆+ t . (3402)
  
t
ξ 0 0 0 −∆− t
We must also adjust for the relationship between wt ∼ wt+1 . In particular, the new constraints imposed
by the formulation and linearization of costs are
−
wt = wt−1 + ∆t = wt−1 + ∆+
t − ∆t , (3403)
X
wi,t = 1 and ∀i ∈ N, wi ≥ 0, (3404)
−
∆+
t , ∆t < 0. (3405)
These result in the constraint matrices

 
"
T
# " wt #
1 0 0 + 1
−∆t  = . (3406)

1 1 −1 − wt−1
| {z } −∆t | {z }
A0 b
See that this corresponds to the relation

X
wi = 1, (3407)
wt − (∆+
t − ∆−
t ) = wt−1 . (3408)
| {z }
∆t
555
Additionally, the long-only constraint and decomposition for ∆t is given by the condition
−1 0 0
    
wt 0
 0 1 0  −∆t  0 .
+
4 (3409)
   
0 0 1 −∆−
t 0
| {z } |{z}
G0 h
− 0
We can pass this through the quadratic solver which obtains [wt , −∆+
t , −∆t ] and take the first N
components as the next time-period allocations.
Pseudo code is presented here:
1 import numpy as np
3 import numpy_ext as npe
4 import matplotlib . pyplot as plt
5
6 from cvxopt import matrix

7 from cvxopt . solvers import qp
8
9 global last
10 last = np . array ([1/ n ]* n ) # default to parity
11
12 def g e t _ r o l l i n g _ p o r t f o l i o (* retdf ) :
13 global last
14 retdf = pd . DataFrame ( retdf ) . transpose ()
15 mr = retdf . mean () . values
16 r = matrix ( np . block ([ mr , _exec * np . ones (2* n ) ]) ) # _exec is \ xi
17 Sigma = matrix (
18 np . block ([
19 [ retdf . cov () . values , np . zeros (( n , n ) ) , np . zeros (( n , n ) ) ] ,
20 [ np . zeros (( n , n ) ) , np . zeros (( n , n ) ) , np . zeros (( n , n ) ) ] ,
21 [ np . zeros (( n , n ) ) , np . zeros (( n , n ) ) , np . zeros (( n , n ) ) ]
22 ])
23 )
24 lamb =0.5
25 A = matrix (
26 np . block ([
27 [ np . ones ( n ) , np . zeros ( n ) , np . zeros ( n ) ] ,
28 [ np . eye ( n ) , np . eye ( n ) , - np . eye ( n ) ]
29 ])
30 )
31 b = matrix ( np . block ([1.0 , last ]) )
32 G = matrix ( np . block ([
33 [ - np . eye ( n ) , np . zeros (( n , n ) ) , np . zeros (( n , n ) ) ] ,
34 [ np . zeros (( n , n ) ) , np . eye ( n ) , np . zeros (( n , n ) ) ] ,
35 [ np . zeros (( n , n ) ) , np . zeros (( n , n ) ) , np . eye ( n ) ] ,
36 ]) )
37 h = matrix ( np . zeros (3* n ) )
38 sol = qp ( lamb * Sigma , -r , G , h , A , b ) [ ’x ’]
39 last = np . array ( sol ) . flatten () [: n ]
40 return last
41
42 weights = npe . rolling_apply ( get_rolling_portfolio , roll , * retdf . transpose () . values )
To be pedantic about the computation of cost handling (including return drift), note that for terms
556
stated in the loss function (Equation 3397):
min L(wt ) = −wt0 rt + γwt0 Σwt + ξ|wt − wt−1 |, (3410)

wt
i,t P
we shall replace wt−1 with w̃t−1 where for price vector Pt , return vector rt = ( Pi,t−1 −1)i∈[n] and portfolio
wt−1 traded at one time step prior, the portfolio input to the optimization function is
(1 + sign(wt−1 ) ⊗ rt ) ⊗ wt−1
w̃t−1 = , (3411)
(1 + sign(wt−1 ) ⊗ rt ) · wt−1
where ⊗, · is Hadamard and dot product respectively. This accounts for the change in portfolio due to
realized returns in between time-steps. The adjustments to the code is given in Section 12.5.4.
12.5.4 Numerical Solving of the Extension to N Multi-Asset Alpha Mean-

Variance Portfolios with Long Constraints and Linear Costs
Here we extend the N-Asset problem to the N-Alpha problem, where the component allocations in
weight vector w correspond to investments in active strategies in marketable securities rather than
marketable securities themselves. In particular, the trader has N alphas in his portfolio given by the
vector α = (αi )i∈[N ] , where
Eα = r, Var(α) = Σ (3412)
and ∀i ∈ [N ], each alpha αi recommends some position in a M size universe of market instruments, which
+
we denote w̃i = (w1 , w2 , · · · wm )i ∈ RM . Then given an alpha allocation of w = (w1 , w2 , · · · , wN ) ∈ RN ,
denote the instrument allocation to be
N
X
wi w̃i = ŵ ∈ RM . (3413)
i
Obviously, we would incur penalties in the turnover of ŵ due to the trading of securities, rather than
the turnover of w. Due to netting effect, it is not necessary that kŵk1 = 1, or that we even attempt to
minimize the turnover of alphas, so long as we constrain the turnover of the underlying positions held.
The portfolio risk is the covariance matrix of the alphas rather than the securities. Mathematically, our
objective function is written
max wt0 rt − γwt0 Σwt − ξ|ŵt − ŵt−1 |, (3414)

wt
and maps to the loss function
min L(wt ) ˆ+
= −wt0 rt + γwt0 Σwt + ξ(∆ ˆ−
t + ∆t )
ˆ t , see Section 12.5.3, (3415)
ŵt − ŵt−1 = ∆
t
ˆ+
= −(wt0 rt − ∆ ˆ− 0
t ξ − ∆t ξ) + γwt Σwt (3416)
    
h i rt h i Σt 0 0 wt0
ˆ + ˆ −   ˆ+ ˆ−   ˆ +
0
= − wt −∆t −∆t  ξ  + γ wt0 −∆ −∆ t  0 0 0   −∆ t  . (3417)

t
ξ 0 0 0 −∆ˆ− t
557
Recalling Equation 3413, column stack the N alpha positions in M assets to express
N
X
ŵt = w̃i,t wi,t (3418)
|{z}
underlying i
     
w1 w̃11 w̃21 ··· w̃N 1 w1
i 
 w2 

 w̃12 ··· ···
  
w̃N 2 
  w2 
h  
= w̃1 w̃2 ··· w̃N   = 
  (3419)
···
 ··· ··· ··· ···   ···
 

wN w̃1M w̃2M ··· w̃N M wN
t
| {z }t | {z }t
w̃t0 ∈RM,N wt ∈RN
= w̃t0 wt . (3420)
|{z} |{z}
constant solution
Here w̃t0 are the positions recommended by the alphas and are not part of the signal-weighting procedure,
but rather the signal-generation one. Now we can specify our constraints. Particularly, we want to have
ŵt = ŵt−1 + ∆ˆt = ˆ+

ŵt−1 + ∆ ˆ−
t − ∆t , (3421)
X
wi,t = 1 and ∀i ∈ N, wi ≥ 0, (3422)
ˆ+
∆ ˆ−
t , ∆t < 0. (3423)
Since we may express
ˆ+
ŵt − (∆ ˆ−
t − ∆t ) = ŵt−1 (3424)
w̃t0 wt − ˆ+
(∆ t − ˆ−
∆ t ) = 0
w̃t−1 wt−1 , (3425)
then we may write our constraint matrices to be

 
" # wt " #
0
1 0 0  ˆ + 1
−∆t  = . (3426)
w̃t0 1 −1 ˆ −
0
w̃t−1 wt−1
| {z } −∆ t | {z }
A0 b
and
−1
    
0 0 wt 0
 0

1   ˆ +  
0  −∆t  4 0 . (3427)
0 0 1 ˆ−
−∆ t 0
| {z } |{z}
G0 h
The pseudo-code for the above logic is presented:

1 global last
2 last = np . array ([1/ n ]* n ) # default to parity
3 def opt (W ,t , sretdf , iretdf ) :
4 ’’’
5 W TMN dimension matrix of time * underlying * alphas
6 t optimization time
7 sretdf strategy return series
8 iretdf instrument return series
9 ’’’
10 n = len ( list ( sretdf ) )
11 m = len ( instruments )
12 mr = est_returns ( retdf = sretdf , settings = optargs [ " ret " ]) # get expected return est
558
13 xi = np . array ( list ( execrates ) + list ( execrates ) )
14 r = matrix ( np . block ([ mr , xi ]) )
15 cov_est = est_c ovarian ce ( retdf = sretdf , settings = optargs [ " cov " ]) # get cov matrix est
16 Sigma = matrix (
17 np . block ([
18 [ cov_est , np . zeros (( n , m ) ) , np . zeros (( n , m ) ) ] ,
19 [ np . zeros (( m , n ) ) , np . zeros (( m , m ) ) , np . zeros (( m , m ) ) ] ,
20 [ np . zeros (( m , n ) ) , np . zeros (( m , m ) ) , np . zeros (( m , m ) ) ]
21 ])
22 )
23 lamb = optargs [ " parameters " ][ " r i s k _ a v e r s i o n _ c o e f f " ]
24 A = matrix (
25 np . block ([
26 [ np . ones ( n ) , np . zeros ( m ) , np . zeros ( m ) ] ,
27 [ W [ t ][: , list ( sretdf ) ] , np . eye ( m ) , - np . eye ( m ) ]
28 ])
29 )
30 holding =
31 (1+ np . sign ( last ) * iretdf . iloc [ t ]. values ) * last
32 / np . linalg . norm ((1+ np . sign ( last ) * iretdf . iloc [ t ]. values ) * last )
33 if not sum ( last ) == 0 else np . zeros ( len ( last ) )
34 b = matrix ( np . block ([1.0 , holding ]) )
35 G = matrix ( np . block ([
36 [ - np . eye ( n ) , np . zeros (( n , m ) ) , np . zeros (( n , m ) ) ] ,
37 [ np . zeros (( m , n ) ) , np . eye ( m ) , np . zeros (( m , m ) ) ] ,
38 [ np . zeros (( m , n ) ) , np . zeros (( m , m ) ) , np . eye ( m ) ] ,
39 ]) )
40 h = matrix ( np . zeros ( n +2* m ) )
41 sol = qp ( lamb * Sigma , -r , G , h , A , b ) [ ’x ’]
42
43 suballoc = np . array ( sol ) . flatten () [: n ]

44
45 salloc = np . zeros ( len ( list ( weightss ) ) ) # alpha allocations

46 np . put ( salloc , list ( sretdf ) , suballoc )
47 underlying = W [ t ] @ ( salloc . reshape ( len ( salloc ) ,1) ) # underlying positions
48 underlying = underlying . reshape (1 , len ( instruments ) ) [0]
49 last = underlying
50 return salloc
12.6 Robust Estimation Methods and Algorithms

The major reason for the reluctance of portfolio managers in employing quantitative risk-return opti-
mization techniques even at A-tier funds is the observation that the classical approaches often result
in unstable results due to sensitivity of the optimization process to input variables. The corresponding
recommended portfolio is often extreme in the signal weights, and do not account for the estimation
error implicit in the input variables; so much so that the equal-weighted portfolio often outperforms
the mean-variance portfolios. The branch of robust estimation and robust optimization methods from
operations research target the practical applicability of these portfolio techniques. A robust estimator
is a statistical estimation technique that is less sensitive to outliers in the data. The use of Bayesian
methods, robust statistics and uncertainty robust estimation methods are employed in practice to solve
the ‘robust’ portfolio optimization problems. Classical statistics describe certain exhibits of the popula-
tion, such as the mean (center), variance (spread), skewness (symmetry) and kurtosis (tails). Non-robust
559
classical statistics are often strongly affected by one or two observations; robust statistics aim to come
up with new descriptive statistics to describe the population exhibits in a way that is less affected by
mistakes in the sampling or distributional assumptions. Robust estimation and optimization models in
turn have outputs that are also less sensitive with respect to violations in distributional assumptions.
For N sized sample D = (x1 , · · · , xN ) taken from population with c.d.f F (x|θ), a statistical estimator
from the sample D can be written ϑ̂ = ϑN (FN ), where FN is the empirical c.d.f
N
1 X
FN (x) = 1{xi ≤ x}. (3428)
N i
FN , ϑ̂ is consistent estimator, so in general as N → ∞ we get FN (x) → F (x) and ϑ̂ → ϑθ . Under F , the

statistical estimator has some probability distribution LF (ϑN ). An estimator is said to be robust if for
given F , if small deviations in F result in at most small deviations from distribution LF (ϑN ) for some
ϑN .
Definition 414 (Resistant Estimator). An estimator is said to be resistant if it is insensitive to changes

in a single observation. Given an estimator ϑ̂ = ϑN (FN ), the sensitivity of change to an additional point
x can be measured by the influence curve function
ϑ((1 − s)F + sδx ) − ϑ(F )

ICϑ̂,F (x) = lim . (3429)
s→0 s
Here δx is a point with probability mass one.
Definition 415 (Breakdown Bound). The breakdown bound is the largest possible fraction of observations
for which there is a bound on the change of the estimate when that fraction of the sample in altered.
Definition 416 (Rejection Point). The rejection point is defined as the point beyond which the influence
curve (Equation 3429) become zero. Estimators with finite rejection point are said to be redescending.
Beyond rejection point, the estimator does not change in value. However, since some of the samples are
‘thrown away’, there is an efficiency cost.
Definition 417 (Gross Error Sensitivity). The gross error sensitivity is the maximum absolute value of
the influence curve, and is the asymptotic impact of outliers.
Definition 418 (M-Estimators). M-estimators are estimators obtained by function minimization over
the sample. For N-sized sample X = (xi )i∈[N ] , estimator T over X, the M-estimator has general form
(N )
X
T = arg min f (xi , t) , (3430)
t
i
for some arbitrary function f . If f is smooth and differentiable then the M-estimator solves the equation
N
X δf (xi , t)
= 0. (3431)
i
δt
It can be shown that the influence curve (Equation 3429) of M-estimators takes form IC = c δf
δt , c ∈ R.
[7]
A commonplace M-estimator is the maximum likelihood estimator (MLE), where f = − log P.
560
Definition 419 (L-Estimators). Consider an N-sized sample X = (xi )i∈[N ] with order statistics x(i) for
i ∈ [N ]. Then the L-estimators are estimators obtained from a linear combination of order statistics
N
X
L= ai X(i) , (3432)
i
PN
where ai ∈ R are constants. Constants are often normalized s.t i ai = 1.
The trimmed mean estimator is an example of an L-estimator (Definition 419).
Definition 420 (R-Estimators). R-estimators are estimators obtained by minimizing the sum of resid-
uals weighted by nondecreasing functions of the rank of each residual,
(N )
X
arg min f (Ri )ri , (3433)
i
where Ri is rank of residual ri and f is some nondecreasing function satisfying centering condition
PN
i f (Ri ) = 0.
Definition 421 (Least Trimmed of Squares (LTS)). The LTS estimator is the estimator that minimizes
Ph 2
the objective function i r(i) , were r(i) is the i-th order statistical for the residuals, and h is the number
of points included ≤ N . Essentially, we are computing the least squares objective function on a subset of
the residuals (trimming away the N − h largest residuals).
Definition 422 (Re-weighted Least Squares). The reweighted least squares estimator minimizes the
PN
objective function i ωi ri2 , where ri are residuals obtained from a robust residual process such as the
LTS method (Definition 421).
12.6.1 Robust Estimators for Center

Definition 423 (Trimmed Mean). The trimmed mean is the mean after trimming of the edges of the
sample data. In particular, for lower and upper threshold L, U respectively, the trimmed mean is computed
U
1 X
x(j) . (3434)
U −L+1
j=L
Definition 424 (Winsorized Mean). The winsorized mean X̄w is the mean of the winsorized data, which
is the data obtained by setting data above and beyond some specified thresholds to a fixed value (usually
the closest non-winsorized value).
Definition 425 (Median). The median value of sample X = (x1 , · · · , xN ) is the value

X n+1

2
, n is odd,
med(X) = X n + X n (3435)
 (2)
 ( 2 +1)
n is even.
2
12.6.2 Robust Estimators for Spread

Recall the definition for mean absolute deviation (Definition 394). The
Definition 426 (Median Absolute Deviation). The median absolute deviation is the median (Definition
425) of the difference between the variable and the variable median, expressed
med(X − med(X)). (3436)
561
Another popular measure is the interquartile range (IQR), the difference between the first and third
quartile. A precise statement is Equation 2763.
The Winsorized standard deviation is the standard deviation computed from winsorized data. (see
Definition 424).
12.6.3 Robust Estimation of Returns

A robust estimator can be obtained by computing a robust center estimator of the historical data.
As such, the techniques employed in Section 12.6.1 are applicable to the robust estimation of returns.
Return forecasts may be obtained either via historical calibration of market observables, or with the
use of exogenous variables such as fundamental artefacts of the company financial statements. This
section explores the different techniques employed to estimate (expected) return vectors of multiple
assets. These in turn can be used in forecast/signal models, or in adjoint problems such as that of
portfolio optimization and management problems. Some key criterion for evaluating these methods
involve forecast sensitivity, out-of-sample performance, computational tractability and interpretability.
Expected returns exhibit significant time variance and are non-stationary. Therefore the estimation
of forward returns using historical calibration present serious challenges. In particular, their inputs
to classical Modern Portfolio Theory optimization methods (Section 12.2.1) result in concentrated and
PT
unintuitive portfolios. Given historical security returns, we may estimate R̄i = T1 i Ri,t , the sample
mean of asset returns. This is an unbiased estimator of the asset return. When presented with more
1
PT
than one asset, asset covariance is σ̂ij = T −1 t (Ri,t − R̄i )(Rj,t − R̄j ). When presented with some N
1 0
assets, these are usually presented in matrix form, with ER = µ and Σ̂ = N −1 XX , where X is centred
such that at row i we have Xi = (Rij )j∈[T ] − R̄i . If security returns are IID then Σ is the MLE of the
population covariance matrix. In particular Σ̂ ∼ W ashartN −1 (verify this). Although the sample mean
is the best linear unbiased estimator (BLUE) of the population mean when the underlying distributions
are not fat-tailed, in the financial settings these are not true and the high variance of the estimator
results in poor forecasting power. The large estimation error combined with non-stationarity makes R̄i
a poor input into the Markowitz type optimization. Often, the equal weighted portfolio outperforms the
mean-variance optimization when the mean input is the sample calibrated historical return. To produce
more robust estimates, Bayesian estimates and shrinkage techniques have been studied. Additionally, the
mean-variance optimizer treats any input as deterministic - the estimation error (including that of a more
robust one) is not taken into consideration in the optimization process. Therefore, robust variants of the
optimization algorithm may help in producing better forecasts in addition to more robust estimators.
One of the most popular equity valuation models in classical finance is the discounted cash flow
model. Like in the pricing of coupon paying bonds, the fair value of a stock is modelled as the sum of
P∞ Dt
discounted future cash flows. This is expressed P = t (1+R t)
t . Some companies do not pay dividends;
the proposed alternative is the use of free cash flow (FCF) by the assumption that retained earnings
P∞ F CFt
become dividends in the future. Then the discounted free cash flow fair value is P = t (1+R t)
t . We
show how to bootstrap return estimates from the observed price. If we assume a finite horizon and that
the stock is sold at some terminal value PT , then the fair value takes form (under the constant rate
assumption)
T T
X Dt PT X F CFt PT
P = + , P = + . (3437)
t
(1 + R)t (1 + R)T t
(1 + R) t (1 + R)T
The internal rate of return (IRR) is the level of expected return that is equivalent to the discount rate
used to equate the present value of expected future cash flows with the present value of the stock. That
562
is, the IRR is the market implied rate of return of the stock, and the return expectation satisfies the
equation
T T
X Dt PT X F CFt PT
PA = + , PA = + . (3438)
t
(1 + ER)t (1 + ER)T t
(1 + ER)t (1 + ER)T
for t = 0 market observable PA . We may then solve for ER. Clearly, the utility of this framework
depends on the accuracy of our estimate for PT and Dt , F CFt . Dividend growth models and analyst
price estimates must be obtained.
Definition 427 (James Stein Shrinkage [7]). If we consider the problem of estimating the mean of a
N multivariate normal random variable, X ∼ Φ(µ, Σ), with known covariance Σ, it has has been shown
that µ̂,the sample mean, is not the best estimator of the population mean µ in terms of loss function
L(µ, µ̂) = (µ − µ̂)Σ−1 (µ − µ̂). An improved estimator known as the James-Stein shrinkage estimator
takes form
µ0 = (1 − w)µ̂ − wµ0 1, (3439)
where

N −2
w = min 1, , (3440)
T (µ̂ − µ0 1)0 Σ−1 (µ̂ − µ0 1)
and T is the number of observations in the sample. Surprisingly, arbitrary numbers for µ0 were shown
to do well. Here the shrinkage target is µ0 1 and the shrinkage intensity is w.
Definition 428 (Jorion Shrinkage). For sample mean returns µ̂, covariance matrix Σ and asset di-
mensionality N , historical window T , the Jorion’s shrinkage estimator of the asset returns are given
by
µ0 = (1 − w)µ̂ + wµg 1, (3441)
where
10 Σ−1 µ̂
µg = , (3442)
10 Σ−1 1
N +2
w= . (3443)
N + 2 + T (µ̂ − µg 1)0 Σ−1 (µ̂ − µg 1)
Σ−1 µ̂
See that µg is the return on global minimum variance portfolio 10 Σ−1 1 (Exercise 636).
12.6.4 Robust Estimation of High Dimensional Covariance and Correlation

Matrices
As alternative to using the sample covariance matrix (a non-parametric estimator) as input to an op-
timization model, we may hope to do better by imposing some structure on the covariance matrix. A
popular approach is to use factor modelling. Another is shrinkage. See Section 12.4 on factor modelling.
BARRA factor models (Exercise 644) are used by quantitative portfolio managers to manage, and esti-
mate factor risk. The number of factors are often significantly smaller than the universe size and there
is an added benefit in terms of dimensionality reduction. The cost of adding structure to the estimation
of covariance matrices (or statistical estimators in general) is the introduction of model and specification
error. The bias-variance tradeoff from adding structure to the estimator must be favorable for improved
out-of-sample performance. Common techniques such as model ensembles may be employed to obtain
more stable estimates. A possible ensemble would be a basket of sample covariance matrices, shrinkage
and factor-modelled covariance estimators.
563
12.6.4.1 Difficulty in Estimation of the Covariance Matrix
An issue with estimating the covariance matrix of asset returns is due to the dynamic nature. Co-
movements in risk are non-static - increasing the number of data points used in the estimator includes
less relevant entries, while reducing the number of data points increase estimation error. Additionally,
asset risk exhibit serial correlation (autocorrelation), time-dependent variance (heteroskedasticity) and
strong tail dependence. The Newey-West correction targets the autocorrelation problem.
Additionally, we are often presented with return samples of varying length and truncation. Statis-
tical techniques available for dealing with missing observations include expectation maximization (EM)
algorithms. The Stambaugh method may be used when estimating covariance matrices to deal with
truncated return samples in our estimation lookback period.
An improvement to the estimation of asset covariance may be obtained via increasing the sample
frequency. With the presence of high-frequency data, better estimation of covariance may be obtained.
Additionally, an improved estimator of volatility can be achieved by using the daily OHLC and volume
data, and are known as Garman-Klass estimators. Another possible reference point of asset volatility (if
their derivatives are liquid) are implied volatility statistics from derivatives instruments. Off-diagonals
are difficult to obtain however, since there are not that many liquid correlation options on single stocks.
However, since the covariance matrix may be decomposed into Σ = σCσ, where C is asset correlation
matrix and σ is the diagonal matrix diag(Σ), the implied volatilities would still be helpful. Another
common approach in the volatility estimation problem is the use of ARCH, GARCH and stochastic
volatility models.
For N sized data, the empirical/sample covariance is computed by
N
1 X
(Xi − X̄)(Yi − Ȳ ), (3444)
N −1 i
qP
σ̂X,Y 1 N
and the empirical correlation coefficient is simply ρ̂X,Y = σ̂X σ̂Y , where σ̂X = N i (Xi − X̄)2 and
similarly for σ̂Y . The sample correlation and covariance matrices may also be subject to more robust
measures.
Since covariance satisfies
1
Cov(X, Y ) = [Var(aX + bY ) − Var(aY − bY )] , (3445)
4ab
assuming S is robust scale functional [7] S(aX + b) = |a|S(X), the robust coavariance is defined
1
S(aX + bY )2 − S(aX − bY )2 .

C(X, Y ) = (3446)
4ab
1 1
If we let a = S(X) , b = S(Y ) , then the robust correlation coefficient is
1
c= [S(ax + bY )2 − S(aX − bY )2 ]. (3447)
4
To confine c ∈ [−1, 1], we use the alternative form
S(ax + bY )2 − S(aX − bY )2
r= . (3448)
S(ax + bY )2 + S(aX − bY )2
12.6.4.2 Random Matrix Theory
Random matrix theory allows us to better reason about the instability of large dimensional sample
covariance matrices. In other words, sample covariance matrices are observed to fluctuate about in a
564
nearly random way, similar to the random matrix - a covariance matrix computed on a set of independent
random walks. When both the number of data samples T and asset dimensionality N goes to ∞, with
T
some fixed ratio Q = N ≥ 1, it is shown (verify this) that the eigenvalues of the random matrix have
density function
p
Q (λmax − λ)(λmin − λ)
f (λ) = , (3449)
2πσ 2 λ
h q i
1 1
where λmax,min = σ 2 1 + Q ±2 Q and σ 2 is the mean eigenvalue [7]. It can be shown that the
distribution (both theoretical random matrices and empirical sample matrices) result in a small number
of large eigenvalues and large number of small eigenvalues. The existence of a small number of large
eigenvalues entail that the bulk of information on statistical distributions are carried by a few eigen-
portfolios. The challenges faced with trading low-eigenvalue eigen-portfolios are discussed in Section
12.6.4.6.
12.6.4.3 Change in Centre of Mass
It is known that the second moment of returns exhibit higher level of predictability relative to the first
moment. Volatility is said to ‘cluster’, and a high magnitude of returns (both positive and negative) is
related to high relative magnitude of returns in the following day. In other words, variance is autocorrel-
ative and the time series has memory. More recent observations are more relevant to the prediction than
older ones. The estimators for risk many be improved by taking weighted data, giving more importance
to more recent data. For some d < 1, if we give weights 1, d, d2 , d3 · · · and so on to successive data points
further away (in time), then we have
PT T
t dT −t (Ri,t − R̄i )(Rj,t − R̄j ) 1 − d X T −t
σij = PT T −t = d (Ri,t − R̄i )(Rj,t − R̄j ). (3450)
t d
1 − dT t
1−d
As T → ∞, we have 1−dT
→ 1 − d. This parameter may be estimated by MLE or using walk-forward
cross-validation.
12.6.4.4 Ridge Estimator
12.6.4.5 Ledoit Wolf Constant Correlation
12.6.4.6 EPO Shrinkage
12.6.5 Bayesian Methods
565
Chapter 13
Stochastic Calculus
13.1 Brownian Motion

13.1.1 Random Walk
Definition 429 (Symmetric Random Walk). A random walk can be simulated by means of the fair coin
toss model, and let the outcomes of these tosses be denoted ω = ωi , i ∈ [1, ∞). Let Xj = 1 if ωj is head
and otherwise Xj = −1, then given starting point of M0 = 0, a symmetric random walk can be written
k
X
Mk = Xj , k = 1, 2 · · · (3451)
j=1
Since the coin toss model assumes independent coin tosses, then the increment of the random walk
can be defined for any increasing time periods. Define the sequence of integers 0 = k0 < k1 · · · < km , then
Pki+1
Mki+1 − Mki are random and equivalent to j=k i +1
Xj . Non-overlapping increments are independent.
By linearity of expectation, this random variable has expectation zero. By independence, their variance
is written Var(Mki+1 − Mki ) = ki+1 − ki . This can be seen from VarXj = EXj2 = 21 (1)2 + 12 (−1)2 = 1.
Corollary 26 (Variance Rate of The Symmetric Random Walk). From the preceding arguments, we see
that the variance of the symmetric random walk accumulates at rate of one per unit time.
Corollary 27 (Symmetric Random Walks are Martingales). To see the symmetric random walk is
martingale (see Section 324), write for k < l, k, l ∈ R+ we have
E[Ml |Fk ] = E[(Ml − Mk ) + Mk |Fk ] = E[Ml − Mk |Fk ] + E[Mk |Fk ] (3452)

= E[Ml − Mk |Fk ] + Mk = E[Ml − Mk ] + Mk = Mk . (3453)
The arguments follow from Mk being an adapted stochastic process (see Definition 308), independence
of future (non-overlapping) increments, and linearity of expectations.
Definition 430 (Quadratic Variance of the Symmetric Random Walk). Define the term quadratic vari-
ation of the symmetric random walk to k to be the term:
k
X
[M, M ]k = (Mj − Mj−1 )2 = k, (3454)
j=1
which is computed for a single path of random variables M .
566
Although its value is equivalent to the variance term Var(Mk − M0 ), the variance term is computed
as an average of all paths and not a single path as in [M, M ]k . If the random walk had biased coin flips
such that EXj is non-zero, then the variance of Var(Mk ) is affected but not [M, M ]k . The quadratic
variation for a stochastic process generally depends on the computed realized path, although not in the
current case.
Definition 431 (Scaled Symmetric Random Walk). We can scale the symmetric random walk discussed
to some smaller time steps, in particular for some n ∈ Z+ the scaled symmetric random walk is defined
to be the random variable
1
W (n) (t) = √ Mnt , for integer nt. (3455)
n
If nt is non-integer value, then we take its linear interpolation between the two nearest integer points.
Defined as such, the scaled symmetric random walk also has independent increments. In particular, for
the set of values 0 = t0 < t1 < · · · < tm for each nti integer value, we obtain (W (n) (ti ) − W (n) (tj )) ⊥
(W (n) (tk ) − W (n) (tl )) for ti > tj , tk > tl and non-overlapping [tj , ti ], [tl , tk ]. Then for any 0 ≤ s ≤ t
where ns, nt ∈ Z+ , we have
E(W (n) (t) − W (n) (s)) = 0, Var(W (n) (t) − W (n) (s)) = t − s,
which we can easily see by expansion of the terms.
The scaled symmetric random walk is also martingale, we can write E[W (n) (t)|F(s)] = E[W (n) (t) −
(n)
W (s) + W (n) (s)|F(s)] = W (n) (s). Its quadratic variation has similar derivations, we write
mk 2 Xmk 2
(m) (m)
X
(m) j (m) j−1 1
[W ,W ](k) = W −W = √ Xj = k. (3456)
j=1
m m j=1
m
Going from time t = 0 to time t = k, by taking sum of squares of increments we obtain k for each path,
regardless of the path. The variance term is also k, but in theory the computation is performed as an
average over price paths.
13.1.2 Limiting Distributions and Derivation of Log-Normality

Theorem 396 (Limiting Distribution of the Scaled Random Walk). We apply Central Limit Theorem
(see Theorem 362) to derive the limiting distribution of scaled random walks. (see Definition 431). For
limn→∞
some fixed t ≥ 0, we show that distribution of W (n) (t) → = Φ(µ = 0, σ 2 = t), the normal C.D.F.
(see Section 6.17.6). Recall the moment generating function for normals (see Theorem 377), we wrote
that this equals exp{ 12 u2 t}. Then for nt integer, the mgf of W (n) (t) can be written
nt
u u X u
ψn (u) = E exp{uW (n) (t)} = E exp{ √ Mnt } = E exp{ √ Xj } = EΠnt
j=1 exp{ √ Xj }. (3457)
n n j=1 n
Since Xj ’s are independent, the expectation of product factors into product of expectations, and we obtain
Πnt √u
j=1 E exp{ n Xj } =
nt
1 u 1 u 1 u 1 u
Πnt exp{ √ } + exp{− √ } = exp{ √ } + exp{− √ } . (3458)
2 n 2 n 2 n 2 n
We want to show that this is equals to ψ(u) = exp{ 21 u2 t} in its limit. Consider

1 u 1 u
log ψn (u) = nt log exp{ √ } + exp{− √ } (3459)
2 n 2 n
567
and make change of variable x = √1 to obtain
n

t 1 1
lim log ψn (u) = lim+ 2 log exp{ux} + exp{−ux} . (3460)
n→∞ x→0 x 2 2
0
Taking x = 0 and applying L’Hopital’s rule (see Theorem 3) to 0 we obtain
u
2 exp(ux)− u2 exp(−ux)
1
exp(ux)+ 12 exp(−ux) t u2 exp(ux) − u2 exp(−ux)
lim t · 2
= lim 1 1 (3461)
x→0+ 2x x→0+ 2 ( exp(ux) + exp(−ux))x
2 2
u
t 2 exp(ux) − u2 exp(−ux)
= lim+ (3462)
x→0 2 x
and by repeated application of the L’Hopital’s rule we obtain
2
u2

t u 1
lim log ψn (u) = lim+ exp(ux) + exp(−ux) = u2 t
n→∞ 2 x→0 2 2 2
which is the moment generating function of Φ(µ = 0, σ 2 = t) random variable.
Theorem 397 (Log-Normality Distribution under the Binomial Limits). The discretized version of the
geometric Brownian motion is the binomial model, and here we show that the limit of a binomial model is
log-normal. For asset price from time 0 → t, let integer n be the number of steps per unit time. Assume
nt is integer, and each up movement to be the geometric factor un = 1 + √σ and down movement to be
n
ud = 1 − √σ for some volatility term σ > 0. Assuming r = 0, then the risk-neutral probabilities are given
n
(since exp(0t) = 1):
√ √
1 − dn σ/ n un − 1 σ/ n
p̃ = = √ q̃ = = √ (3463)
un − dn 2σ n un − dn 2σ n
and both evaluates to 12 . Define the stock price process at time t to be S(t), and the price of the stock at
S(t), t > 0 to be determined by the first nt coin tosses (since each time period has n tosses). Denote the
number of heads and tails in the first nt number of coin tosses be denoted by nt = Hnt + Tnt . Then the
random walk (see Definition 429) is written Mnt = Hnt − Tnt and we must have the relationships
1 1
Hnt = (nt + Mnt ) Tnt = (nt − Mnt ) . (3464)
2 2
Then the stock price at time t must be
1 (nt+Mnt ) 1 (nt−Mnt )
σ 2 σ 2
Sn (t) = S(0)uH nt Tnt
n dn = S(0) 1 + √ 1− √ . (3465)
n n
Considering the logarithmic form of Equation 3465 we obtain log Sn (t) =

1 σ 1 σ
log S(0) + (nt + Mnt ) log 1 + √ + (nt − Mnt ) log 1 − √ . (3466)
2 n 2 n
Writing the Taylor expansion to f (x) = log(1 + x) (see Theorem 365) we obtain
1 00 1
log(1 + x) = f (0) + f 0 (0)x + f (0)x2 + O x3 = x − x2 + O x3 .

(3467)
2! 2
Applying this Taylor expansion to Equation 3466 with x = ± √σn , we can write log Sn (t) =
σ2 σ2
1
1 σ −σ
log S(0) + (nt + Mnt ) √ − + O n−3/2 + (nt − Mnt ) √ − + O n−3/2
2 n 2n 2 n 2n
2

σ σ
= log S(0) + nt − + O n−3/2 + Mnt √ + O n−3/2 (3468)
2n n
1
= log S(0) − σ 2 t + O n−1/2 + σW (n) (t) + O n−1 W (n) (t)

(3469)
2
568
√
since by definition nW (n) (t) = Mnt . By central limit theorem (which is the limiting argument using
moment generating functions of the scaled random walk from Theorem 396), since W (n) (t) is t tosses we
obtain W (n) (t) → Φ(0, t) and the terms associated with O n−1/2 and O n−1 goes to zero under limit

of n. Then log Sn (t) = log S(0) + σW (t) − 21 σ 2 t as n → ∞ and we are done. We showed that when
dist.
n → ∞, Sn (t) → S(t) = S(0) exp σW (t) − 21 σ 2 t , where W (t) ∼ Φ(µ = 0, σ 2 = t). That is Sn (t) is

log-normal in its limit. The random variable X = σW (t) − 12 σ 2 t has expectation − 12 σ 2 t and variance
σ 2 t.
Exercise 647 (Question from Shreve [19], Solution from Zeng [20]). We showed in Theorem 397 that
the binomial limits lead to an exponentiated Brownian motion and log-normality results. Here we show
the same thing under non-zero rate settings. Let r ≥ 0, σ > 0 then for some positive n we consider the
binomial model taking n steps per unit time, such that per period the rate is nr , up factor and down factor
√ √
are written un = exp(σ/ n), dn = exp(−σ/ n). The risk neutral probabilities are given (verify this)
r
n + 1 − dn un − nr − 1
p̃n = q̃n = . (3470)
un − dn un − dn
For t ∈ Q+ and nt ∈ Z define
nt
X
Mnt,n = Xk,n (3471)
k=1
where Xi,n , i ∈ [n] are IID s.t. P̃(Xi,n = 1) = p̃n and P̃(Xi,n = −1) = q̃n . The stock price at time t
after nt steps are given
1 1
(nt+Mnt,n ) (nt−Mnt,n )
Sn (t) = S(0)un2 dn2 (3472)

σ σ
= S(0) exp √ (nt + MnT,n ) exp − √ (nt − Mnt,n ) (3473)
2 n 2 n
σ
= S(0) exp( √ Mnt,n ). (3474)
n
We want to show that lim √σ Mnt,n ∼ Φ((r − 1 σ 2 )t, σ 2 t), which we would then conclude Sn (t) ∼
n→∞ n 2
1 2
S(0) exp{σW (t) + (r − 2 σ )t} in the distributional sense.
1. Show the mgf of √1 Mnt,n and denote this ψn (u)

n
2. Compute lim ϕn (u) = lim+ ϕ 1 (u) under change of variable. Show that this may be written as
n→∞ x→0 x2
(rx2 + 1)sinh(ux) + sinh(σ − u)x

t
log (3475)
x2 sinh(σx)
exp(z)−exp(−z) exp(z)+exp(−z)
with definitions sinh(z) = 2 , cosh(z) = 2 . Use identity sinh(A − B) =
sinh(A)cosh(B) − cosh(A)sinh(B). If we show this then
(rx2 + 1 − cosh(σx)sinh(ux)

t
log cosh(ux) + (3476)
x2 sinh(σx)
is equivalent expression.
3. Given Taylor approximation

1
cosh(z) = 1 + z 2 + O z 4 sinh(z) = z + O z 3

(3477)
2
show that
(rx2 + 1 − cosh(σx))sinh(ux) 1 rux2 1
= 1 + u2 x2 + − ux2 σ + O x4 .

cosh(ux) + (3478)
sinh(σx) 2 σ 2
569

4. Use Taylor approximation log(1 + x) = x + O x2 and compute lim+ log ϕ 1 (u). Explain the
x→0 x2
limiting distribution for the scaled binomial limit.
Proof. -
1. By the independence of Xk,n and their moment generating function factors, we may write
nt
1 u u
ϕn (u) = E[exp(u √ X1,n )]nt = exp( √ )pñ + exp(− √ )qñ (3479)
n n n
r √ r √ nt
n + 1 − exp(−σ/ n) n + 1 − exp(σ/ n)

u −u
= exp( √ ) √ √ − exp( √ ) √ √ .
n exp(σ/ n) − exp(−σ/ n) n exp(σ/ n) − exp(−σ/ n)
2. By substition of x → √1 we obtain
n
2
rx2 + 1 − exp(σx)

t rx + 1 − exp(−σx)
log ϕ 1 (u) = log exp(ux) − exp(−ux)
x2 x2 exp(σx) − exp(−σx) exp(σx) − exp(−σx)
(rx2 + 1)(exp(ux) − exp(−ux)) + exp(σ − u)x − exp(−(σ − u)x)

t
= log
x2 exp(σx) − exp(−σx)
2

t (rx + 1)sinh(ux) + sinh(σ − u)x
= log (3480)
x2 sinh(σx)

t (rx2 + 1)sinh(ux) + sinh(σx)cosh(ux) − cosh(σx)sinh(ux)
= log (3481)
x2 sinh(σx)
(rx2 + 1 − cosh(σx))sinh(ux)

t
= log cosh(ux) + . (3482)
x2 sinh(σx)
3.
(rx2 + 1 − cosh(σx))sinh(ux)
cosh(ux) + (3483)
sinh(σx)
2 2
(rx2 + 1 − 1 − σ 2x + O x4 )(ux + O x3

u2 x2 4
= 1+ +O x + (3484)
2 σx + O (x3 )
(3485)
σ2 3 5

u2 x2 (r − +O x
2 )ux
+ O x4

= 1+ + (3486)
2 σx + O (x3 )
2
(r − σ2 )ux3 (1 + O x2 )

u2 x2
+ O x4

= 1+ + 2
(3487)
2 σx(1 + O (x ))
u2 x2 rux2 1
− ux2 σ + O x4 .

= 1+ + (3488)
2 σ 2
4. Since
u2 x2 rux2

t 1 2 4

logϕ (u) = log 1 + + − ux σ + O x , (3489)
1
x2 x2 2 σ 2
by Taylor expansion we have
t u2 x2 rux2

1 2 4

logϕ (u) = + − ux σ + O x (3490)
1
x2 x2 2 σ 2
lim x→0+ tu2 tru tσu
= + − , (3491)
2 σ 2
570
the exponential of which is the moment generating function of normal random variable (see mgf
Theorem 377) of mean t( σr − σ2 ) and variance t. By writing

1 1 2 r σ
E exp u √ Mnt,n = exp tu + t( − )u , (3492)
n 2 σ 2
√σ Mnt,n σ2 2
then we see that n
converges to normal random Φ(t(r − 2 ), σ t).
13.1.3 Brownian Motion

Definition 432 (Brownian Motion). Define some probability space (see Definition 268) (Ω, F, P). For
each ω ∈ Ω, suppose ∃ continuous function W (t), t ≥ 0 satisfying W (0) = 0 depending on ω. W (t) is
said to be Brownian motion if for all sets of values 0 < t0 < · · · < tm the increments W (ti ) − W (ti−1 )
for i ∈ [m] are independent and each increment is has E[W (ti ) − W (ti−1 )] = 0, Var(W (ti ) − W (ti−1 )) =
ti − ti−1 .
The development of the Brownian motion is the continuous time version of the scaled, symmetric
random walk formulated in Definition 431. Unlike its discrete formulation which is approximately normal
on each t, the Brownian motion is exactly normal. For all 0 ≤ s < t, the increment W (t) − W (s) ∼
Φ(0, t − s). In particular, this also holds for s = 0, such that the value at specific time t is normal. The
probability measure P pertains to the probabilities of A ∈ F, which is the σ-algebra of subsets of Ω. It
is the ω ∈ Ω that can be thought of to generate the path of the Brownian motion. If we would like to
compute the probability that W (t) ∈ [a, b] for some t > 0, we can write
Z b 2
1 −x
P{a ≤ W (t) ≤ b} = √ exp dx. (3493)
2πt a 2t
Theorem 398 (Properties of the Brownian Motion). Since the increments of Brownian motion are IID
normal, their linear combination (and hence W (t1 ), · · · W (tm ) are jointly normal. (see Definition 344)
They can then be characterized by their expectation and covariance vectors. For 0 ≤ s < t we obtain the
covariance formula (see Definition 302)
EW (s)W (t) = E[W (s)(W (t) − W (s)) + W 2 (s)] (3494)

2
= EW (s) · E[W (t) − W (s)] + EW (s) (3495)
= Var(W (s)) = s. (3496)
W (s) ⊥ [W (t) − W (s)] follows by definition of Brownian motion (see Definition 432) and the covariance
matrix for Brownian motion m-vector (W (ti )i∈[m] ) has (i, j) entry tmin(i,j) .
Since we may characterize distributions with moment generating functions, let us consider that of the
Brownian motion m-vector (W (ti )i∈[m] ). In particular, its mgf is written
   
Xm X m Xm 
E exp{ ui W (ti )} = E exp  uj  (W (tj ) − W (tj−1 )) (3497)
 
i k=1 j=k
where t0 = 0. Then by the independence of increments their joint mgf factors and we write
 
 X m 
Πmk=1 E exp  u j (W (t j ) − W (t j−1 ))  (3498)
 
j=k
 
1 X m 
= Πm k=1 exp u 2
j (t j − t j−1 ) see mgf Thereom 377. (3499)
2 
j=k
571
Definition 433 (Equivalent Definition of a Brownian Motion). Consider the definition of a Brownian
motion given in Definition 432. We give two equivalent formulations, the first being that for the cor-
responding 0 = t0 < · · · < tm , W (ti ), i ∈ [m] are jointly normal with mean zero vector and covariance
matrix described by Equation 3494. The second equivalence is that the random variables W (ti ), i ∈ [m]
have joint mgf given Equation 3498.
13.1.3.1 Brownian Motion Filtration
Recall the definition of a filtration (see Definition 305). Here we define the filtration for a Brownian
process.
Definition 434 (Filtration for Brownian Motions). Let (Ω, F, P) be probability space (see Definition
268) defined for Brownian motion W (t) (see Definition 432) on values of t ≥ 0. A filtration for the
Brownian motion satisfies the properties (along with the definitions of filtration itself ) of
1. Adaptivity: For each t ≥ 0, W (t) is F(t) measurable. The Brownian motion is an adapted stochastic
process (see Definition 308) under the Brownian filtration.
2. Future increment independence: For values of t ∈ [0, u), [W (u) − W (t)] ⊥ F(t).
Part (2) of property of Brownian Motion filtration in Definition 434 leads to the Efficient Market
Hypothesis. Note that the Brownian filtration F(t) contains at least as much information obtained by
observing the Brownian motion up to time t, and can contain more, such as the observation of other
processes.
Corollary 28 (Brownian Motion is Martingale). As in usual techniques for 0 ≤ s ≤ t we may write

E[W (t)|F(s)] = E[(W (t) − W (s)) + W (s)|F(s)] = W (s). It is martingale. (see Definition 324)
Exercise 648. Show that for 0 ≤ t < u1 < u2 , [W (u2 ) − W (u1 )] ⊥ F(t).
Proof. We do this by writing that
W (u2 ) − W (u1 ) = W (u2 ) − W (t) + W (t) − W (u1 ) (3500)
and call upon linearity of expectations.
Exercise 649. For Brownian motion W (t) adapted to filtration F(t) for t ≥ 0 show that W 2 (t) − t is
martingale.
Proof. See that for values 0 ≤ s ≤ t, we have
E(W 2 (t)|F(s)) = E[(W (t) − W (s))2 + 2W (t)W (s) − W 2 (s)|F(s)] (3501)

= E[(W (t) − W (s))2 |F(s)] + W (s)E[2W (t) − W (s)|F (s)] (3502)
= (t − s) + W (s)E[W (t) + W (t) − W (s)|F(s)] (3503)
= (t − s) + W (s)E[W (t)|F(s)] (3504)
= (t − s) + W (s)W (s) (3505)
= (t − s) + W 2 (s) (3506)
where we use the property that W (t) is martingale. By subtracting t from both sides we get the result.
572
13.1.3.2 Variation of the Brownian Motion
The property of the Brownian motion that makes it mathematically interesting is that the paths of the
Brownian motion has non-zero quadratic variation, and this affects our computation when integrating
over a Brownian motion w.r.t time.
Theorem 399 (Mean Value Theorem). For arbitrary function f (t) that is continuously differentiable
f (tj+1 )−f (tj )
everywhere 1 the mean value theorem states that for any subinterval [tj , tj+1 ], there ∃t0j s.t. tj+1 −tj =
0
f (t0j ). That is, some point t0j in the subinterval has tangent parallel to the chord connecting
(tj , f (tj )), (tj+1 .f (tj+1 )).
Theorem 400 (First Order Variation of a Continuously Differentiable Function). For some function
f (t), define the first-order variation
Z T
F VT (t) = |f 0 (t)|dt. (3507)
0
This can be computed as a limit of sums, by considering the partitions 0 = t1 < t2 < · · · < tn = T
and letting the maximum step size of partition equate to kΠk = maxj (tj+1 − tj ). Then F VT (t) =
Pn−1
lim j=0 |f (tj+1 ) − f (tj )|. Prove this statement.
kΠk→0
Proof. By the mean value theorem (see Theorem 399) we can write f (tj+1 ) − f (tj ) = f 0 (t0j )(tj+1 − tj ).
Then we can write
n−1
X n−1
X Z T
F VT (t) = lim |f (tj+1 ) − f (tj )| = lim |f 0 (t0j )|(tj+1 − tj ) = |f 0 (t)|dt (3508)
kΠk→0 kΠk→0 0
j=0 j=0
where the last equality follows by seeing that this is the definition of a Riemann sum (see Definition 287)
for integral of |f 0 (t)|.
Theorem 401 (Second Order (Quadratic) Variation of a Continuously Differentiable Function). For
function f (t) defined on 0 ≤ t ≤ T the quadratic variation of the function up to T is written
n−1
X
[f, f ](T ) = lim [f (tj+1 ) − f (tj )]2 , (3509)
kΠk→0
j=0
for partition Π = {ti : i ∈ [n]} and 0 = t0 < · · · < tn = T . For f that is continuous and differentiable,
the mean value theorem states that
n−1
X n−1
X n−1
X
|f (tj+1 ) − f (tj )|2 = |f 0 (t0j )|2 (tj+1 − tj )2 ≤ kΠk |f 0 (t0j )|2 (tj+1 − tj ). (3510)
j=0 j=0 j=0
We can then bound the quadratic variation

 
n−1
X Z T
0 0 2
[f, f ](T ) ≤ lim kΠk
 |f (tj )| (tj+1 − tj ) = lim kΠk ·
 |f 0 (t)|2 dt = 0. (3511)
kΠ|→0 kΠk→0 0
j=0
The integrability of |f 0 (t)|2 is guaranteed since f (t) is differentiable, and therefore the integral term is
finite. Otherwise, we obtain 0 · ∞ and the result is uncertain.
1 there exists a class of functions that are continuous everywhere but nowhere differentiable! See Weierstrass functions.
573
In ordinary calculus, the quadratic variation is zero when working with functions having continuous
derivatives. However, the paths of the Brownian motion is non time-differentiable. The Mean Value
Theorem only holds for functions that are continuously differentiable everywhere. In the case of Brownian
d
motions, however, there exists no value of t where dt W (t) is defined. It is ‘pointy’.
Theorem 402 (Second Order (Quadratic) Variation of a Brownian Motion). Let W be Brownian motion
defined as in Definition 432 or as in equivalent Definition 433. Then the quadratic variation of a
Brownian motion is written [W, W ](T ) and equals to T for all T ≥ 0 almost surely.
Proof. Let Π = {ti : i ∈ [n]} be the partition set for [0, T ], and define the sampled quadratic variation
w.r.t to this partition be equal to the random variable
n−1
X 2
QΠ = (W (tj+1 ) − W (tj )) (3512)
j=0
for this sampling frequency. We show QΠ → T as kΠk → 0. We can do this by showing that this
random variable, in its limit, has expectation T and variance zero. This is known as `2 convergence, or
convergence in mean square. The set of paths for which the quadratic variation does not converge to T
has zero probability. For some j, we have
2
E[(W (tj+1 ) − W (tj )) ] = Var(W (tj+1 ) − W (tj )) = tj+1 − tj (3513)
Pn−1
by properties of the Brownian motion. (see Theorem 398) Then the expectation term EQΠ = j=0 (tj+1 −
tj ) = T . Considering its variance, write
h 2 i
Var (W (tj+1 ) − W (tj ))2 = E (W (tj+1 ) − W (tj ))2 − (tj+1 − tj )

(3514)
E (W (tj+1 ) − W (tj ))4 + (tj+1 − tj )2 − 2(tj+1 − tj )(W (tj+1 ) − W (tj ))2 .

= (3515)

Since the fourth moment of X ∼ N (0, σ 2 ) = 3σ 4 (see Exercise 575) we obtain Var (W (tj+1 − W (tj ))2 =
3ψ − 2ψ + ψ = 2ψ, ψ = (tj+1 − tj )2 . (3516)

Pn−1
Then the value Var(QΠ ) = j=0 Var((W (tj+1 ) − W (tj ))2 ) =
n−1
X n−1
X
2(tj+1 − tj )2 ≤ 2kΠk(tj+1 − tj ) = 2kΠkT. (3517)
j=0 j=0
As kΠk → 0, the value lim Var(QΠ ) = 0 and the lim QΠ = EQΠ = T .

kΠk→0 kΠk→0
Since we wrote that E[(W (tj+1 ) − W (tj ))2 ] = tj+1 − tj and Var[(W (tj+1 ) − W (tj ))2 ] = 2(tj+1 − tj )2 ,
it may be easy to argue that that as tj+1 − tj becomes small, that its variance diminishes such that
(W (tj+1 )−W (tj ))2
(W (tj+1 ) − W (tj ))2 ≈ tj+1 − tj , or that tj+1 −tj ≈ 1. However, see that
s
(W (tj+1 ) − W (tj ))2
= Yj+1 ∼ Φ(0, 1) (3518)
tj+1 − tj
is standard normal and this approximation does not hold! Suppose we take equally spaced partitions for
jT T
some n >> 0 such that tj = n , j ∈ [n] then tj+1 − tj = n and
n−1 n−1 2
X 2
X Yj+1
QΠ = (W (tj+1 ) − W (tj )) = T . (3519)
j=0 j=0
n
574
2
Pn−1 Yj+1 2
By Law of Large Numbers (see Theorem 361) lim j=0 n = EYj+1 = 1. Asymptotically the value
n→∞
of the sampled quadratic variation goes to T , although each term in the sum does not, and this is
guaranteed by the Law of Large Numbers. Then we write this informally as
dW (t)dW (t) = dt (3520)
to say that the Brownian motion accumulates quadratic variation at rate one per unit time. This is not
to be confused with the distribution of each Brownian increment and calls upon large number theory in
its formulation. For the quadratic variation over path [0, TX ], we get [W, W ](TX ) = TX . For interval
[TX , TY ], by taking the partitions, squaring the Brownian increments and summing them up, in its limit
the large number theory gives [W, W ](TY ) − [W, W ](TX ) = TY − TX . Brownian motion accumulates
TY − TX units of quadratic variation over interval [TX , TY ]. This is the source of asset volatility, as we
shall see soon.
Theorem 403 (Cross Variation of the Brownian Motion). Let Π = {ti : i ∈ [n], tj < tj+1 } be the
partition set for [0, T ]. Then the cross variation of W (t) with t and the quadratic variation of t with
itself is written
n−1
X
lim (W (tj+1 ) − W (tj )) (tj+1 − tj ) = 0 (3521)
kΠk→0
j=0
and
n−1
X
lim (tj+1 − tj )2 = 0 (3522)
kΠk→0
j=0
respectively.
To show this, since |(W (tj+1 ) − W (tj ))(tj+1 − tj )| ≤ maxk∈[n−1] |(W (tk+1 ) − W (tk ))|(tj+1 − tj ), then
n−1
X
lim | (W (tj+1 ) − W (tj )) (tj+1 − tj )| ≤ T lim max |(W (tk+1 ) − W (tk ))| (3523)
kΠk→0 kΠk→0 k∈[n−1]
j=0
and the equivalence follows as the maximum partition length goes to zero. For the self variation we can
similarly write
n−1
X n−1
X
(tj+1 − tj )2 ≤ max (tk+1 − tk ) · (tj+1 − tj ) = T · kΠk (3524)
k∈[n−1]
j=0 j=0
and the result follows. We write these results to be
dW (t)dt = 0 dtdt = 0. (3525)
Exercise 650 (First Order Variation and Cubic Variation of Brownian Motion). For Brownian motion
W (t) define the sampled quadratic variation corresponding to partition Π = {ti ; i ∈ [n], tj < tj+1 } and
t0 = 0, tn = T to be the value
n−1
X
(W (tj+1 ) − W (tj ))2
j=0
which we showed in Theorem 402 to approach T as kΠk, the length of longest subinterval, → 0. Show
that the first order variation and the cubic variation satisfy
n−1
X
|W (tj+1 ) − W (tj )| → ∞ (3526)
j=0
n−1
X
|W (tj+1 ) − W (tj )|3 → 0 (3527)
j=0
575
under the same partition limits almost surely.
Proof. We show a by contradiction proof from Zeng [20].Assume ∃A ∈ F s.t P(A) > 0 ∧ ∀ω ∈ A,
n−1
X
lim sup |Wtj+1 − Wtj |(ω) < ∞. (3528)
n→∞
j=0
Then we must have for ω ∈ A

n−1
X n−1
X
(Wtj+1 − Wtj )2 (ω) ≤ max |Wtk+1 − Wtk |(ω) · |Wtj+1 − Wtj |(ω) (3529)
0≤k≤n−1
j=0 j=0
n−1
X
≤ max |Wtk+1 − Wtk |(ω) · lim sup |Wtj+1 − Wtj |(ω) → 0. (3530)
0≤k≤n−1 n→∞
j=0
The LHS has both limit 0 and limit T and this is contradiction. The next part is shown by construction
instead, see that under n → ∞
n−1
X n−1
X
3
|W (tj+1 ) − W (tj )| ≤ max |Wtk+1 − Wtk | · (W (tj+1 ) − W (tj ))2 → 0. (3531)
0≤k≤n−1
j=0 j=0
13.1.3.3 Volatility of an Exponentiated Brownian Motion
Definition 435 (Geometric Brownian Motion). Define the exponentiated / geometric Brownian motion
to be the process

1
S(t) = S(0) exp{σW (t) + α − σ 2 t} (3532)
2
for some α ∈ R and σ > 0. We showed in Theorem 397 under zero rate settings that we arrive at a similar
stock price process under the limits of the binomial model. Here we show that the Brownian motion’s
quadratic variation gives rise to the volatility σ of a stochastic process. For fixed 0 ≤ T1 < T2 , suppose the
Brownian motion was realized up to t ∈ [T1 , T2 ] and define the partition indices T1 = t0 < · · · < tm = T2 .
The log returns of an asset is written

S(tj+1 ) ∆ 1 2
log = σ (W (tj+1 ) − W (tj )) + α − σ (tj+1 − tj ) (3533)
S(tj ) 2
and the realized volatility is the sum of squares of these log returns. We can write
m−1
X 2 m−1 2 m−1
S(tj+1 ) 2
X 2 1 2 X
log = σ (W (tj+1 ) − W (tj )) + α − σ (tj+1 − tj )2 (3534)
j=0
S(t j ) j=0
2 j=0
m−1
1 X
+2σ α − σ 2 (W (tj+1 ) − W (tj )) (tj+1 − tj ). (3535)
2 j=0
We showed that as kΠk → 0, the second and third term diminishes to zero, while the first term is
approximately σ 2 the quadratic variation accumulated by the Brownian motion. We have
m−1
X 2
S(tj+1 )
log ≈ σ 2 (T2 − T1 ). (3536)
j=0
S(tj )
Therefore, for asset price S(t) that follows a geometric Brownian motion with constant volatility σ, we
can compute it using Equation 3536. Although this is exact in its limit, market observables are typically
discrete and our computation is approximate. Besides, market prices exhibit non-constant volatility, so
there is estimation error for long time intervals.
576
13.1.4 Brownian Motion as Markov Process
Theorem 404 (Brownian Motion as Markov Process). Let W (t), t ≥ 0 be Brownian motion as defined
in Definitions 432 or 433. Let F(t), t ≥ 0 be the Brownian filtration as defined in Definition 434. Then
W (t), t ≥ 0 is Markov process as defined in Definition 325.
Proof. We need to show that
∀0 ≤ s ≤ t, f Borel measurable =⇒ ∃g s.t E[f (W (t))|F(s)] = g(W (s)). (3537)
Write E[f (W (t))|F(s)] = E[f {W (t) − W (s) + W (s)} |F(s)]. By the independence lemma (see Lemma
4), we may express
∞
w2
Z
1
g(x) = E[f {W (t) − W (s) + x}] = p f (w + x) exp − dw. (3538)
2π(t − s) −∞ 2(t − s)
Applying the change of variable technique for τ = t − s, y = w + x we write
Z ∞
(y − x)2

1
g(x) = √ f (y) exp − dy. (3539)
2πτ −∞ 2τ
Define the transition density p(τ, x, y) for the Brownian motion to be the equation
(y − x)2

1
p(τ, x, y) = √ exp − , (3540)
2πτ 2τ
R∞
and g(x) = −∞ f (y)p(τ, x, y)dy. The expected value of a Borel measurable f on Brownian motion at
future states t given market information available up to s is expressed in terms of its transition density
as
Z ∞
E[f (W (t))|F(s)] = f (y)p(τ, W (s), y)dy. (3541)
−∞
The transition density p(τ, W (s), y) is the conditional density of W (t) on filtration F(s) and is density in
y, and is normal about W (s) with variance τ = t − s. Since the distribution is completely characterized
by the value W (s) and nothing else, this is Markov.
Exercise 651. Let W (t) be Brownian motion (see Definition 432) adapted to filtration F(t) (see Defi-
nition 305) for t ≥ 0. Show that
1. For µ ∈ R, let the Brownian motion with drift µ be written
X(t) = µt + W (t). (3542)
Show for any Borel-measurable f (y), 0 ≤ s < t that

Z ∞
(y − x − µ(t − s))2

1
g(x) = p f (y) exp − dy (3543)
2π(t − s) −∞ 2(t − s)
satisfies E[f (X(t))|F(s)] = g(X(s)). That is, show that the Brownian motion with drift is also
Markov.
2. For ν ∈ R and σ > 0 write a geometric Brownian motion S(t) = S(0) exp{σW (t) + νt} and show
R∞
that for any Borel measurable f (y), the function g(x) = 0 f (y)p(τ, x, y)dy satisfies E[f (S(t))|F(s)] =
g(S(s)) where
(log xy − ντ )2

1
p(τ, x, y) = √ exp − τ =t−s (3544)
σy 2πτ 2σ 2 τ
is the transition density. That is, show S is Markov process.
577
Proof. -
For (1), Write E[f (X(t))|F(s)] = E[f (W (t)+µt−W (s))+W (s))|F(s)] = E[f (Wt −Ws +(Ws +µt))|F(s)].
By independence lemma (see Lemma 4) write
∞
w2
Z
1
g(x) = E[f (Wt − Ws + x)] = p f (w + x) exp − dw. (3545)
2π(t − s) −∞ 2(t − s)
Then we can write
E[f (Xt )|F(s)] = E[f (Wt − Ws + (Ws + µt)|F(s)] (3546)

Z ∞
x2

1
= f (Ws + µt + x) p exp − dx (3547)
−∞ 2π(t − s) 2(t − s)
Z ∞
(y − µt − Ws )2

1
= f (y) p exp − dy (3548)
−∞ 2π(t − s) 2(t − s)
Z ∞
(y − Ws − µs − µ(t − s))2

1
= f (y) p exp − dy (3549)
−∞ 2π(t − s) 2(t − s)
= g(W (s) + µs) = g(X(s)). (3550)
ν
For (2), see that E[f (S(t))|F(s)] = E[f (S(0) exp{σXt })|F(s)] where the drift is µ = σ. See then that
we may write using the result in (1)
Z ∞
(y − Xs − µ(t − s))2

1
E[f (St )|F(s)] = f (S0 exp{σy}) p exp − dy (3551)
−∞ 2π(t − s) 2(t − s)
 h i2 
1 z 1 Ss
Z ∞
1

 σ ln S0 − σ ln S0 − µ(t − s) 
 1
= f (z) p exp − dz z → S0 exp(σy)
0 2π(t − s) 
 2(t − s)  σz

 h i2 
z
Z ∞
1

 ln Ss − ν(t − s) 

= f (z) p exp − 2
dz (3552)
0 σz 2π(t − s) 
 2σ (t − s) 

Z ∞
= f (z)p(τ, Ss , z)dz (3553)
0
= g(Ss ). (3554)
13.1.5 First Passage Time Distribution

In Definition 435 we visited the case of stochastic processes where the Brownian motion appear in the
exponent term. Here we visit a martingale containing a Brownian motion as exponent.
Definition 436 (Exponential Martingale). The exponential martingale corresponding to constant volatil-
ity σ has form

1
Z(t) = exp σW (t) − σ 2 t (3555)
2
where W (t) is Brownian motion (see Definition 432, 433) and t ≥ 0.
Theorem 405 (Proof that the Exponential Martingale is Martingale). For W (t) equipped with filtration
F(t) and exponential martingale defined as in Definition 436 see that for 0 ≤ s ≤ t, we can write
578
E[Z(t)|F(s)] =

1 2
E exp σW (t) − σ t − σW (s) + σW (s) |F(s) (3556)
2

1 2
= E exp{σ(W (t) − W (s))} · exp{σW (s) − σ t}|F(s) (3557)
2
1 2
= exp{σW (s) − σ t}E [exp{σ(W (t) − W (s))}|F(s)] (3558)
2
1
= exp{σW (s) − σ 2 t}E [exp{σ(W (t) − W (s))}] (3559)
2
1 1
= exp{σW (s) − σ 2 t} exp{ σ 2 (t − s)} see Exercise 553 (3560)
2 2
1
= exp σW (s) − σ 2 s = Z(s) (3561)
2
and we are done.
Definition 437 (First Passage Time of Stochastic Process). For m ∈ R and stochastic process W (t),
the first passage time to level m is defined to be
τm = min{t ≥ 0, W (t) = m}, (3562)
the earliest time the stochastic process reaches m. For a stochastic process that does not reach m, we set
τm = ∞.
For the exponential martingale defined in Definition 436 see that we can write

1 2
1 = Z(0) = EZ(t ∧ τm ) = E exp σW (t ∧ τm ) − σ (t ∧ τm ) (3563)
2
where a ∧ b is written short form for min(a, b). For σ > 0, m > 0, for time t ≤ τm we must have inequality
for the Brownian motion to be W (t) ≤ m, hence we get the inequality 0 ≤ exp{σW (t ∧ τm )} ≤ exp(σm).
We also have
1

1 2 exp{− σ 2 τm }
 as t → ∞ and τm < ∞
exp{− σ (t ∧ τm )} = 2 (3564)
2 exp{− 1 σ 2 t}

if τm = ∞ .
2
The second term converges to zero as t → ∞, and we can express the two cases as

1 2 1 2
lim exp − σ (t ∧ τm ) = 1{τm <∞} exp − σ τm . (3565)
t→0 2 2
In the same way, we are guaranteed
(
= exp{σW (τm )} = exp{σm} as t → ∞ and τm < ∞
exp{σW (t ∧ τm )} (3566)
≤ exp(σm) if τm = ∞ .
In any case, the exponential martingale is bounded in its limit, and takes form

1 2 1
lim exp σW (t ∧ τm ) − σ (t ∧ τm ) = 1{τm <∞} exp{σm − σ 2 τm }. (3567)
t→∞ 2 2
Taking the Equation 3563 and using Dominated Convergence Theorem (see Theorem 343) to interchange
the limit and expectations, we obtain

1 1
lim E exp σW (t ∧ τm ) − σ 2 (t ∧ τm ) = E lim exp σW (t ∧ τm ) − σ 2 (t ∧ τm )
t→∞ 2 t→∞ 2

1
= E 1{τm <∞} exp{σm − σ 2 τm } (3568)
2
= 1, (3569)
579
which means that

1
exp(−σm) = E 1{τm <∞} exp{− σ2 τm } (3570)
2
which holds for conditions m, σ > 0. Then we obtain

1
lim exp(−σm) = lim E 1{τm <∞} exp{− σ 2 τm } (3571)
σ→0 σ→0 2

1
1 = E lim 1{τm <∞} exp{− σ 2 τm } by M.C.T, Theorem 342 (3572)
σ→0 2
= E 1{τm <∞}

(3573)
= P [τm < ∞] (almost surely). (3574)
We can then drop the indicator since the probability of τm = ∞ is zero and we obtain E exp{− 12 σ 2 τm } =

exp(−σm).
Theorem 406 (Laplace Transform of the First Passage Time Distribution of the Brownian Motion).
For m ∈ R, first passage time τm of W (t) (see Definition 437) is such that P(τm < ∞) = 1 and the
Laplace transform of its distribution is given by the equation
√
E exp{−ατm } = exp{−|m| 2α} ∀α > 0. (3575)
Proof. We prove by cases. Consider m > 0, then let σ such that 21 σ 2 = α and we are done. Since W (t)
distributionally
is symmetrically distributed, τm ∼ τ|m| and the negative case follows.
Exercise 652 (Infinite Expectation for First Passage Time of Brownian Motion where m 6= 0). Show
that for m 6= 0, τm has expectation ∞.
Proof. Consider the Laplacian transform equation given in Equation 3575. Taking its derivative we
obtain
|m| √
E[τm exp(−ατm )] = √ exp(−|m| 2α) ∀α > 0. (3576)
2α
As α → 0+ , we have E[τm ] = ∞. We have seen a similar ‘paradox’ in Exercise 559, the Saint Petersburg
paradox.
Exercise 653. Let W (t) be Brownian motion equipped with filtration F(t) (see Definition 434) and for
some fixed m > 0, µ ∈ R, we have
X(t) = µt + W (t) τm = min{t ≥ 0 : X(t) = m} (3577)
for 0 ≤ t < ∞. For

1
Z(t) = exp σX(t) − σµ + σ 2 t (3578)
2
show that
1. Z(t) is martingale for t ≥ 0.
2. Then show that

1
E exp σX(t ∧ τm ) − σµ + σ 2 (t ∧ τm ) =1 t ≥ 0. (3579)
2
580
3. Then show that for non-negative drift µ ≥ 0, σ > 0 that

1 2
E exp σm − σµ + σ (τm ) 1τm <∞ = 1 t≥0 (3580)
2
and that τm is almost surely finite with Laplace transform

n p o
E exp(−ατm ) = exp mµ − m 2α + µ2 α > 0. (3581)
4. Then obtain a formula for Eτm , showing its finiteness.
5. Then show for negative drift µ < 0 for σ > −2µ the expectation value in (3) also equals one, and
that P(τm < ∞) = exp(−2m|µ|) < 1. Use this to show the same Laplace form.
Proof. -
1.
σ2

Zt
E |F(s) = E exp σ(Wt − Ws ) + σµ(t − s) − σµ + (t − s) (3582)
Zs 2
σ2
2
σ
= exp σµ(t − s) − σµ + (t − s) exp (t − s) use mgf (see Theorem 377)
2 2
σ2
2
σ
= exp − (t − s) exp (t − s) (3583)
2 2
= 1. (3584)
2. By the optimal stopping theorem we have EZt∧τm = EZ0 = 1 by use of result proved in (1).
3. In similar reasoning to derivation of Equation 3566, for µ > 0, σ > 0 we can write bounds Zt∧τm ≤
exp(σm). By dominated convergence (see Theorem 343) we can apply
E[1τm <∞ Zτm ] = E[ lim Zt∧τm ] = lim E[Zt∧τm ] = 1. (3585)

t→∞ t→∞
If the first passage time is not finite then Zt∧τm ≤ exp(σm − 12 σ 2 t) → 0 as t → ∞, and E[exp(σm −
(σµ + σ2
1
2 )τm ) τm <∞ ] = 1. Set σ → 0+ and therefore P(τm < ∞) = 1. Writing σµ + σ2
2 = α we
obtain E exp(−ατm ) = exp(−σm) which is the Laplace form by substituting back the value of σ,
p p
since mµ − m 2σµ + σ 2 + µ2 = mµ − m (σ + µ)2 = mµ − m(σ + µ) = −σm.
4.
δ p −m
E exp(−ατm ) = −Eτm exp(−ατm ) = exp(mµ − m 2α + µ2 ) p . (3586)
δα 2α + µ2
m
Set α → 0+ and by monotone convergence (see Theorem 342) obtain Eτm = µ < ∞ for µ > 0 by
−m
seeing that −Eτm exp(−ατm ) → −Eτm and RHS goes to µ .
2
5. Under stated conditions σµ + σ2 > 0, so that Zt∧τm ≤ exp(σm) for finite first passage time and
2
Zt∧τm ≤ exp(σm − ( σ2 + σµ)t) → 0 for τm = ∞ as t → ∞. Then
σ2

E exp(σm − (σµ + )τm )1τm <∞ = E lim Zt∧τm = lim EZt∧τm = 1. (3587)
2 t→∞ t→∞
581
As σ → −2µ+ we obtain E[exp(−2µ+ m)1τm <∞ ] = 1 and hence P(τm < ∞) = exp(2µm) =
σ2
exp(−2|µ|m) < 1. Set α = σµ + 2 > 0 and multiply both sides by exp(−σm) to Equation 3587.
We can then see that
E exp(−ατm ) = E exp(−ατm )(1τm <∞ + 1τm =∞ ) (3588)

= E exp(−ατm )1τm <∞ (3589)
= exp(−σm) (3590)
p
= exp(mµ − m 2α + µ2 ) (3591)
p p
since mµ − m 2σµ + σ 2 + µ2 = mµ − m (σ + µ)2 = mµ − m(σ + µ) = −σm.
13.1.6 Reflection Principle

For some fixed level m > 0 and time t > 0, we are interested in the number of Brownian motion paths
that have at some point t̂ ≤ t reached m. The reflection principle states that for each Brownian motion
path that reaches m prior to t and is at w < m at t such that it is distance (m − w) below m, there ∃
another Brownian path that is at level m + (m − w) = 2m − w at time t. This is the Brownian path that
flips the Brownian increments at time t ≥ τm , the first passage time. (see Definition 437) Since we are
working with continuous time stochastic processes, point probabilities are zero and we concern ourselves
with intervals instead. The reflection equality states that
P{τm ≤ t, W (t) ≤ w} = P{τm ≤ t, W (t) ≥ 2m − w} w ≤ m, m > 0 (3592)

= P{W (t) ≥ 2m − w}P{τm ≤ t|W (t) ≥ 2m − w} (3593)
= P{W (t) ≥ 2m − w}. (3594)
Theorem 407 (First Passage Time Distribution). For all m 6= 0, the first passage time random variable
τm has c.d.f
∞
y2
Z
2
P{τm ≤ t} = √ exp{− }dy t≥0 (3595)
2π −
|m|
√ 2
t
and density function
|m| 3 m2
fτm (t) = √ t− 2 · exp{− } t ≥ 0. (3596)
2π 2t
Proof. For m > 0, by setting w = m see that by Reflection Equality Equation 3592 we obtain P{τm ≤
t, W (t) ≤ m} = P{W (t) ≥ m}. Since we have P{τm ≤ t, W (t) ≥ m} = P{W (t) ≥ m}P{τm ≤ t|W (t) ≥
m} = P{W (t) ≥ m}, by adding the two equations we obtain
Z ∞
2 x2
P{τm ≤ t} = 2P{W (t) ≥ m} = √ exp{− }dx (3597)
2πt m 2t
Z ∞ √
2 y2
= √ exp{− }dy m > 0, x → y t (3598)
2π − |m|
√ 2
t
√ distributionally
for positive m since dx = dy t. For negative m since τm ∼ τ|m| we have the same result.
Density follows from ordinary calculus.
582
Here the Laplacian transform is given
Z ∞ ∞
m2

|m|
Z
E exp(−ατm ) = exp(−αt)fτm (t)dt = √ exp −αt − dt ∀α > 0. (3599)
0 0 t 2πt 2t
This is equivalent to formation in Equation 3575. (verify this)
13.1.7 Brownian Density

Definition 438 (Maximum to Date). For a stochastic process W (s), denote its maximum to date to be
the value
M (t) = max W (s). (3600)

0≤s≤t
By logical reasoning it is trivial to see that for positive m,
M (t) ≥ m ↔ τm ≤ t. (3601)
Write the reflection equality Equation (3592) as
P{M (t) ≥ m, W (t) ≤ w} = P{W (t) ≥ 2m − w} w ≤ m, m > 0. (3602)
Theorem 408 (Joint Density of Brownian Motion and its Maximum). For values of t > 0, the joint
density of (M (t), W (t)) is given
(2m − w)2

2(2m − w)
fM (t),W (t) (m, w) = √ exp − w ≤ m, m > 0. (3603)
t 2πt 2t
Proof. By reflection equality equivalence (see Equation 3602), and density of the first passage time of
Brownian motion (see Theorem 407), write
Z ∞ Z w
P{M (t) ≥ m, W (t) ≤ w} = fM (t),W (t) (x, y)dydx (3604)
m −∞
Z ∞
1 z2
= √ exp{− }dz (3605)
2πt 2m−w 2t
= P{W (t) ≥ 2m − w}. (3606)
The equality of the two probability measures are given by Equation 3592. Equation 3604 is by definition
and Equation 3605 is obtained from Equation 3597. Differentiating both integral forms w.r.t m, get
Z w
2 (2m − w)2
− fM (t),W (t) (m, y)dy = − √ exp{− }. (3607)
−∞ 2πt 2t
Differentiating w.r.t w gets us
2(2m − w) (2m − w)2
−fM (t),W (t) (m, w) = − √ exp{− } (3608)
t 2πt 2t
and we are done.
Corollary 29 (Conditional Distribution of M (t) given W (t)). The conditional density is given for values
of w ≤ m, m > 0 by
fM (t),W (t) (m, w)
fM (t)|W (t) (m|w) = (3609)
fW (t) (w)
(2m − w)2 w2

2(2m − w) 1
= √ exp{− } / √ exp{− } (3610)
t 2πt 2t 2πt 2t

2(2m − w) 2m(m − w)
= exp − . (3611)
t t
583
13.2 Stochastic Calculus
13.2.1 Ito Integral
RT
For fixed positive T we want to integrate 0
∆(t)dW (t), that is integration w.r.t to some Brownian
motion W (t) adapted to filtration F(t). The integrand ∆(t) is also an adapted stochastic process that
is F(t) measurable for each t ≥ 0. Since Brownian increments are independent, and since ∆(t) is F(t)
measurable, ∆(t) is independent of future Brownian increments too. The Ito integral arises from the
inability to differentiate paths of the Brownian motion with respect to t. In particular, for a differentiable
RT RT
g(t) we can write 0 ∆(t)dg(t) = 0 ∆(t)g 0 (t)dt but this Lebesgue form is unavailable in the context of
Brownian motions.
Definition 439 (Ito Integral for Simple Integrands ∆(t)). Let Π = {t0 , t1 · · · tn } be a partition set for
[0, T ] such that 0 = t0 ≤ t1 ≤ · · · ≤ tn = T . Assume ∆(t) is constant for each t ∈ [tj , tj+1 ) and this
process is called a simple process. The value of ∆(t) is motivated by the underlying Brownian motion,
which is then dependent on the drawing from ω ∈ Ω sample space. Since no information is available at
t = 0, then ∆(0) is constant for all possible Brownian paths. If we treat ∆(t) as position sizing and W (t)
as price processes, then the portfolio profit and loss statements are easily expressed:
I(t) = ∆(t0 )[W (t) − W (t0 )] = ∆(0)W (t) t ∈ [t0 , t1 ] (3612)

I(t) = ∆(0)W (t1 ) + ∆(t1 ) [W (t) − W (t1 )] t ∈ [t1 , t2 ] (3613)
and this can be generalized for t ∈ [tk , tk+1 ] we can write

k−1
X
I(t) = ∆(tj ) [W (tj+1 ) − W (tj )] + ∆(tk ) [W (t) − W (tk )] (3614)
j=0
and the process I(t) is called the Ito integral of the simple process ∆(t). We express this in integral
format as
Z t
I(t) = ∆(u)dW (u). (3615)
0
Theorem 409 (Ito Integral from Martingale Trading is Martingale). a) Show the Ito integral in Defini-
tion 439 is martingale. b) Then show that for M (t) martingale, ∆(t) simple process adapted to filtration
F(t) with partition set Π = {0 = t0 < t1 < · · · < tn = T } - that the stochastic integral
k−1
X
I(t) = ∆(tj ) [M (tj+1 ) − M (tj )] + ∆(tj ) [M (t) − M (tk )] (3616)
j=0
is also martingale.
Proof. -
a) We start by writing 0 ≤ s < t ≤ T . Let the partition set be Π = {ti |ti < ti+1 } for [0, T ], and assume
s, t are in different partitions - if not choose a finer partition. Denote the partition s is in as [tl , tl+1 )
584
and that for t as [tk , tk+1 ). Write the Ito process expressed in summation form (see Equation 3614) as
k−1
X
I(t) = ∆(tj ) [W (tj+1 ) − W (tj )] + ∆(tk ) [W (t) − W (tk )] (3617)
j=0
l−1
X
= ∆(tj ) [W (tj+1 ) − W (tj )] + ∆(tl ) [W (tl+1 ) − W (tl )] (3618)
j=0
k−1
X
+ ∆(tj ) [W (tj+1 ) − W (tj )] + ∆(tk ) [W (t) − W (tk )] (3619)
j=l+1
and we would like to show that E[I(t)|F(s)] = I(s). There are four terms to consider. See that the first
Pl−1 Pl−1
term is F(s) measurable such that E[ j=0 ∆(tj ) [W (tj+1 ) − W (tj )] |F(s)] = j=0 ∆(tj ) [W (tj+1 ) − W (tj )].
For the second term see that
E[∆(tl ) [W (tl+1 ) − W (tl )] |F(s)] = ∆(tl )E [W (tl+1 ) − W (tl )|F(s)] = ∆(tl )(W (s) − W (tl )) (3620)
by expectation linearity and martingale property. The sum of these two conditional expectations give us
I(s). Then for us to show martingale property, the remaining two terms need to sum to zero. For the
third term, consider for each j we can write
E [∆(tj ) [W (tj+1 ) − W (tj )] |F(s)] = E {E [∆(tj ) [W (tj+1 ) − W (tj )] |F(tj )] |F(s)} (3621)
= E {∆(tj ) (E[W (tj+1 )|F(tj )] − W (tj )) |F(s)} (3622)
= E {∆(tj )(W (tj ) − W (tj ))|F(s)} = 0. (3623)
We used linearity of expectation, taking out measurable random variables, iterated expectation (see
Theorem 356) property and martingale property of the Brownian motion. The sum of these terms
involving j is correspondingly zero, and the third term evaluates to zero. The last term has similar
proof, write
E {∆(tk )(W (t) − W (tk ))|F(s)} = E [E {∆(tk ) [W (t) − W (tk )] |F(tk )} |F(s)] (3624)
= E [∆(tk )E [W (t)|F(tk )] − W (tk )|F(s)] (3625)
= E [∆(tk )(W (tk ) − W (tk )|F(s)] = 0. (3626)
Then I(t) is martingale.

b) Now for the general martingale case, see that only the iterated expectation property, adaptivity,
measurability, linearity and martingale property were used in a). These are all satisfied in b) and the
result holds for stochastic calculus with respect to a general martingale.
Recall I(t) can be framed as profit from martingale. Trivially, I(0) = 0, and therefore EI(t) = 0
for t > 0. Then I(t) is centred random variable when evaluated at time prior to t and has variance
VarI(t) = EI 2 (t).
Theorem 410 (Ito Isometry). The Ito integral as defined in Equation 3615 satisfies
Z t
2
EI (t) = E ∆2 (u)du. (3627)
0
Proof. Write Dj = W (tj+1 ) − W (tj ) for j ∈ [k − 1], and Dk = W (t) − W (tk ) and see that we can express
k−1
X
I(t) = ∆(tj ) [W (tj+1 ) − W (tj )] + ∆(tk ) [W (t) − W (tk )] see Equation 3614 (3628)
j=0
k
X
= ∆(tj )Dj . (3629)
j=0
585
Then we have
k
X X
I 2 (t) = ∆2 (tj )Dj2 + 2 ∆(ti )∆(tj )Di Dj . (3630)
j=0 i<j
Since ∆(ti )∆(tj )Di is F(tj ) measurable and Dj ⊥ F(tj ) then ∆(ti )∆(tj )Di ⊥ Dj - their expectation
factors and since EDj = 0 the sum of cross terms evaluate to zero expectation. By same reasoning see
that ∆2 (tj ) ⊥ Dj2 and from Brownian motion variance (see Definition 432) we have EDj2 = tj+1 − tj for
j ∈ [k − 1], and EDk2 = t − tk . See that we may write
k
X k
X
EI 2 (t) = E∆2 (tj )Dj2 = E∆2 (tj ) · EDj2 (3631)
j=0 j=1
 
k−1
X
=  E∆2 (tj )(tj+1 − tj ) + E∆2 (tk )(t − tk ). (3632)
j=0
Since ∆(t) is simple process (see Definition 439) then ∆(tj ) is constant over [tj , tj+1 ) and we have
R tj+1 2 Rt
∆2 (tj )(tj+1 − tj ) =
tj
∆ (u)du, and ∆2 (tk )(t − tk ) = tk ∆2 (u)du. Then
k−1
X Z tj+1 Z t
2 2
EI (t) = E ∆ (u)du + E ∆2 (u)du (3633)
j=0 tj tk
 
k−1
X Z tj+1 Z t Z t
= E ∆2 (u)du + ∆2 (u)du = E ∆2 (u)du. (3634)
j=0 tj tk 0
Theorem 411 (Quadratic Variation of the Ito Integral). Consider the Ito integral as in Equation 3615
expressed as a process in its upper limit of integration t. Brownian motion (see Definition 432) accu-
mulates quadratic variation of one unit per time. See that the Brownian motion is scaled in a time and
Rt
path dependent way by the integrand ∆(u) under the integral 0 ∆(u)dW (u). The quadratic variation
accumulated to time t by the Ito integral is
Z t
[I, I](t) = ∆2 (u)du. (3635)
0
In differential form this is written dI(t)dI(t) = ∆2 (t)dt.
Proof. Consider a single subinterval for which the simple process ∆(u) (see Definition 439) is constant
over. Let this be [tj , tj+1 ]. Partition this subinterval such that we have tj = s0 < s1 < · · · < sm = tj+1 .
Consider the sampled quadratic variation on this partition for the interval written
m−1
X m−1
X m−1
X
2 2
[I(si+1 ) − I(si )]2 = [∆(tj )(W (si+1 ) − W (si ))] = ∆2 (tj ) (W (si+1 ) − W (si )) . (3636)
i=0 i=0 i=0
Pm−1 2
As m → ∞ the max partition step → 0 and i=0 (W (si+1 ) − W (si )) → (tj+1 − tj ), by the Brownian
motion quadratic variation (see Theorem 402). Then by limiting arguments the quadratic variation
Rt
accumulated by the Ito integral over this subinterval is written ∆2 (tj )(tj+1 − tj ) = tjj+1 ∆2 (u)du.
Rt
The ‘last piece’ has quadratic variation tk ∆2 (u)du. Piece together the different subintervals, and the
quadratic variation accumulated by I(t) is
Z t
[I, I](t) = ∆2 (u)du. (3637)
0
586
Rt
The alternative proof using differential forms can be obtained. Write the Ito integral 0
∆(u)dW (u) in
differential form
dI(t) = ∆(t)dW (t) (3638)
and take square to get dI(t)dI(t) = ∆2 (t)dW (t)dW (t) = ∆2 (t)dt, which is to say that the Ito integral
accumulates quadratic variation at rate ∆2 (t) per unit time. This is in general both time and path
dependent.
Compare the Ito Isometry (see Theorem 410) which specifies the variance of an Ito integral, and
the quadratic variation accumulated by an Ito integral (see Theorem 411). In particular, the former is
Rt Rt
written VarI(t) = EI 2 (t) = E 0 ∆2 (u)du and the latter is written 0 ∆2 (u)du. It is now clear that the
quadratic variation of the Ito integral is path dependent - large processes/positions ∆(u) leads to large
quadratic variations and vice versa - it is integral over a random process that is therefore random itself.
2
The variance is an expectation over all such possible paths in the population, and is non-random.
Definition 440 (Differential and Integral Forms). The notations

Z t
I(t) = ∆(u)dW (u) ≡ dI(t) = ∆(t)dW (t). (3639)
0
The integral form is precise in its formulation as in Definition 439. Note that when going from the
differential form to the integral form we need to remember the constant of integration I(0). That is, to
be precise the integral is
Z t
I(t) − I(0) = ∆(u)dW (u) (3640)
0
and the two forms are equivalent if the initial condition I(0) = 0.
We would like to define the integral for when ∆(t) is no longer a simple process. That is, we would
like ∆(t) to vary continuously with time and include jumps.
Definition 441 (Ito Integral for General Integrands ∆(t)). Assume that ∆(t) is adapted stochastic
process equipped with the filtration F(t) for t ≥ 0 and that square integrability
Z T
E ∆2 (t)dt < ∞ (3641)
0
is fulfilled. Then choose a partition set 0 = t0 < t1 · · · < tn = T and for each partition, assume
an approximating simple integrand that has simple process ∆(tj ) at each tj and through [tj , tj+1 ). In
general, it is possible to choose sequence ∆n (t) of simple processes such that we have
Z T
lim E |∆n (t) − ∆(t)|2 dt = 0, (3642)
n→∞ 0
which is to say that our method approximates a continuously varying integrand as partition size gets
smaller. The definition of the Ito integral for continuous varying integrand is then given
Z t Z t
∆(u)dW (u) = lim ∆n (u)dW (u) t ∈ [T ]. (3643)
0 n→∞ 0
Since this is defined in terms of the Ito integral for simple processes (see Definition 439) the general Ito
integral inherits these properties.
2 Note that this is distinct from the sampled or empirical variance, which we often just call variance in literature and
practice - and to be pedantic - is a misnomer.
587
Theorem 412 (Properties of Ito Integrals). For T > 0 and ∆(t) adapted to F(t) satisfying square
Rt
integrability (see Definition 441) on t ∈ [T ], the ito integral I(t) = 0 ∆(u)dW (u) satisfies
1. Continuity: as a function of the upper limit of integration, t, the paths of I(t) are continuous.
2. Adaptivity: ∀t ∈ [T ], I(t) is F(t) measurable.

Rt Rt
3. Linearity: For I(t) = 0 ∆(u)dW (u), J(t) = 0 Γ(u)dW (u) linearity holds on aI(t) + bJ(t) for
constant a, b.
4. Martingale: I(t) is martingale.

Rt
5. Ito Isometry: EI 2 (t) = E 0 ∆2 (u)du.
Rt
6. Quadratic Variation: [I, I](t) = 0 ∆2 (u)du.
RT
Exercise 654 (Ito Integral of the Brownian Process, Shreve [19]). Compute 0
W (t)dW (t). Choose
an equal sized partition by taking some large n integer, for which ∆n (t) = W iT

n for values of t ∈
iT (i+1)T
RT 2
[ n , n ) and i ∈ [n − 1]. Then lim E 0 |∆n (t) − W (t)| dt = 0 and by definition we can write
n→∞
Z T Z T n−1
X jT (j + 1)T jT
W (t)dW (t) = lim ∆n (t)dW (t) = lim W W −W . (3644)
0 n→∞ 0 n→∞
j=0
n n n

jT
Writing short form Wj for W n , see that
n−1 n−1 n−1 n−1

1X 1X 2 X 1X 2
(Wj+1 − Wj )2 = Wj+1 − Wj Wj+1 + W (3645)
2 j=0 2 j=0 j=0
2 j=0 j
n n−1 n−1
1X 2 X 1X 2
= Wk − Wj Wj+1 + W (3646)
2 j=0
2 j=0 j
k=1
n−1 n−1 n−1
1 2 1X 2 X 1X 2
= Wn + Wk − Wj Wj+1 + W since W0 = 0 (3647)
2 2 j=0
2 j=0 j
k=0
n−1 n−1
1 2 X 2 X
= Wn + Wj − Wj Wj+1 (3648)
2 j=0 j=0
n−1
1 2 X
= Wn + Wj (Wj − Wj+1 ). (3649)
2 j=0
It follows that
n−1 n−1 2
X jT (j + 1)T jT 1 2 nT 1X (j + 1)T jT
W W −W = W − W −W (3650)
,
j=0
n n n 2 n 2 j=0 n n
and see that as n → ∞ see that Equation 3650 becomes

Z T
1 1 1 1
W (t)dW (t) = W 2 (T ) − [W, W ](T ) = W 2 (T ) − T. (3651)
0 2 2 2 2
In contrast for differentiable function g with initial condition g(0) = 0 the ordinary calculus results would
yield:
Z T Z T
1 2 T 1
g(t)dg(t) = g(t)g 0 (t)dt = g (t)|0 = g 2 (T ). (3652)
0 0 2 2
588
Exercise 655 (Stratonovich Integral of the Brownian Motion). Find the Stratonovich integral for
RT
0
W (t)dW (t) which computes the integral at the value of the integrand at middle point of each subin-
terval. That is, find
n−1
T T
(j + 21 )T
Z Z
X (j + 1)T jT
W (t)dW (t) = lim ∆n (t)dW (t) = lim W W −W (3653)
.
0 n→∞ 0 n→∞
j=0
n n n
In the general case, we can let the partition set be non equally spaced points, such that we can write
tj +tj+1
Π = {0 = t0 < t1 < · · · < tn = T } to and define t∗j = 2 to be the midpoint of the subinterval
[tj , tj+1 ]. Show that
i) the half-sample quadratic variation, given
n−1
X 2
QΠ = W (t∗j ) − W (tj ) (3654)
2
j=0
T
has limit 2 as partition size goes to zero
ii) and that the Stratonovich integral of W (t) w.r.t W (t) written
Z t n−1
X 1 2
W (t) ◦ dW (t) = lim W (t∗j )[W (tj+1 ) − W (tj )] = W (T ). (3655)
0 kΠk→0
j=0
2
Proof. -
T
i) For the first case we can show `2 convergence, that is EQ Π = 2 and that the variance in its limit is
2
zero. See that
n−1
X n−1
X n−1
X tj + tj+1 n−1
X tj+1 − tj T
E (W (t∗j ) − W (tj ))2 = (t∗j − tj ) =

EQ Π = − tj = = . (3656)
2
j=0 j=0 j=0
2 j=0
2 2

tj+1 −tj
See we can write (Zeng [20]) W (tj ∗) − W (tj ) = W 2 , and that
E[(W 2 (t) − t)2 ] = E W 4 (t) + t2 − 2tW 2 (t) = 3EW 2 (t)2 − 2t2 + t2 = 2t2 .

(3657)
589
See
 2 
n−1
X 2 T
VarQ Π = E  W (t∗j ) − W (tj ) −   (3658)
 
2
j=0
2
 2 
n−1 n−1
X 2 X tj+1 − tj  
W (t∗j ) − W (tj ) −

= E  (3659)

2

j=0 j
  
n−1 2
X 2 tj+1 − tj
= E  W (t∗j ) − W (tj ) − (3660)
  
2

j=0
n−1
X 2 tj+1 − tj 2 tk+1 − tk
= 2 E W (t∗j ) − W (tj ) − ∗
(W (tk ) − W (tk )) − (3661)
2 2
j<k
n 2
X 2 tj+1 − tj
+ E W (t∗j ) − W (tj ) − (3662)
j=0
2
n−1
" 2 #
X tj+1 − tj
= 0+ E W 2
(t∗j − tj ) − (3663)
j=0
2
n−1
X
4 ∗ tj+1 − tj 2 2 ∗
= E W (tj − tj ) + ( ) − W (tj − tj )(tj+1 − tj ) (3664)
j=0
2
n−1 2
X tj+1 − tj
= 3(t∗j − tj ) + 2
− (t∗j − tj )(tj+1 − tj ) (3665)
j=0
2
n−1 2 2
Xtj+1 − tj tj+1 − tj tj+1 − tj
= 3 + − (tj+1 − tj ) (3666)
j=0
2 2 2
n 2
X tj+1 − tj
= 2· (3667)
j=0
2
T
≤ max |tj+1 − tj | (3668)
2 j∈[n]
See that this variance term goes to zero as partition size goes to zero.
ii) For the second part see that we can write Shreve [19], Zeng [20]
n−1
X
W (t∗j )(W (tj+1 ) − W (tj )) (3669)
j=0
n−1
X
W (t∗j ) [W (tj+1 ) − W (t∗j )] + [W (t∗j ) − W (tj )]

= (3670)
j=0
n−1
X n−1
X
W (t∗j )[W (tj+1 ) − W (t∗j )] + W (tj )[W (t∗j ) − W (tj )] + [W (t∗j ) − W (tj )]2 . (3671)

=
j=0 j=0
RT
As lim see that the first summation term in Equation 3671 goes to W (t)dW (t), where the partition
0
kΠ∗ k→0
Pn−1
Π∗ corresponds to 0 = t0 < t∗0 < t1 < t∗1 < · · · < t∗n−1 < tn = T . Using the result lim ∗
j=0 [W (tj ) −
kΠk→0
T
W (tj )]2 = 2 from part 1. see that (compare to Equation 3651)
Z T n−1 Z T
X T 1
W (t) ◦ dW (t) = lim W (t∗j )(W (tj+1 ) − W (tj )) = W (t)dW (t) + = W 2 (T ). (3672)
0 kΠk→0
j=0 0 2 2
590
Obviously, the Stratonovich form is not suitable in the context of financial trading - positions need
for a time period need to be decided in the beginning of the period itself. For functions with continuous
derivatives, both the Ito integrals and Stratonovich integrals agree. It is functions with nonzero quadratic
variation where the integrals are sensitive to where the approximating integrands are evaluated.
Rt
We can verify that the Ito integral of the Brownian motion is martingale - take E 0 W (u)dW (u) =
E( 21 W 2 (t) − 12 t) = 21 EW 2 (t) − 12 t = 0.
13.2.2 Ito Doeblin Formula

Recall that for differentiable f, g the chain rule gives us
d
f (g(t)) = f 0 (g(t))g 0 (t) (3673)
dt
with differential form
df (g(t)) = f 0 (g(t))g 0 (t)dt = f 0 (g(t))dg(t). (3674)
Definition 442 (Ito Doeblin Formula). Recall that the Brownian motion is not time differentiable due
to nonzero quadratic variations. Therefore the correct form is
1
df (W (t)) = f 0 (W (t))dW (t) + f 00 (W (t))dt. (3675)
2
This is known as the Ito Doeblin differential form, and the integral form is written
Z t
1 t 00
Z
f (W (t)) − f (W (0)) = f 0 (W (u))dW (u) + f (W (u))du (3676)
0 2 0
which consists of both the Ito integral form and the Lebesgue integral form.
See that for some integral f (u)f 0 (u)du, applying change of variable v = f (u) then dv = f 0 (u)du
R
and the integral f (u)f 0 (u)du = vdv = 12 v 2 + C. We formalize the Ito Doeblin formula in context of
R R
Brownian motions.
Theorem 413 (Ito Doeblin for Brownian Motion, Shreve [19]). Let f (t, x) be function for which the
partial derivatives ft , fx , fxx are defined and continuous. Let W (t) be Brownian motion adapted to F(t)
and the Ito Doeblin formula for Brownian motion states that
Z T Z T
1 T
Z
f (T, W (T )) − f (0, W (0)) = ft (t, W (t))dt + fx (t, W (t))dW (t) + fxx (t, W (t))dt. (3677)
0 0 2 0
The differential form for this is
1
df (t, W (t)) = ft (t, W (t))dt + fx (t, W (t))dW (t) + fxx (t, W (t))dW (t)dW (t) (3678)
2
1
+ftx (t, W (t))dtdW (t) + ftt (t, W (t))dtdt (3679)
2
1
= ft (t, W (t))dt + fx (t, W (t))dW (t) + fxx (t, W (t))dt. (3680)
2
1 2
Proof. We show proof for specific case when f (x) = 2x . By Taylor expansion and using f (1) (x) =
x, f (2) (x) = 1 obtain
1
f (xj+1 ) − f (xj ) = f (1) (xj )(xj+1 − xj ) + f (2) (xj )(xj+1 − xj )2 . (3681)
2
591
This is exact. For T > 0 and partition set Π = {t0 < t1 < · · · < tn } for [0, T ] see that we can write
n−1
X
f (W (T )) − f (W (0)) = [f (W (tj+1 )) − f (W (tj ))] (3682)
j=0
n−1
X
= f (1) (W (tj ))[W (tj+1 ) − W (tj )] + (3683)
j=0
n−1
1 X (2)
f (W (tj ))[W (tj+1 ) − W (tj )]2 . (3684)
2 j=0
Pn−1 Pn−1
For f (x) = 12 x2 this yields to j=0 W (tj )[W (tj+1 ) − W (tj )] + 1
2 j=0 [W (tj+1 ) − W (tj )]2 . See that as
kΠk → 0 we obtain
n−1
X 1
f (W (T )) − f (W (0)) = lim W (tj )[W (tj+1 ) − W (tj )] + [W, W ](T ) (3685)
kΠk→0
j=0
2
Z T
∆ 1
= W (t)dW (t) + T (3686)
0 2
Z T Z T
1
= f (1) (W (t))dW (t) + f (2) (W (t))dt f (1) (x) = x, f (2) (x) = 1.
0 2 0
Now see that for any other general funcion f (x), we would obtain terms in [W (tj+1 ) − W (tj )]3 and recall
Pn−1
from Exercise 650 that j=0 [W (tj+1 ) − W (tj )]3 as kΠk → 0 and the final form is unchanged. For a
bivariate f (t, x), in similar fashion consider the Taylor expansion
f (tj+1 , xj+1 ) − f (tj , xj ) = ft (tj , xj )(tj+1 − tj ) + fx (tj , xj )(xj+1 − xj ) (3687)

1
+ fxx (tj , xj )(xj+1 − xj )2 (3688)
2
+ftx (tj , xj )(tj+1 − tj )(xj+1 − xj ) (3689)
1
+ ftt (tj , xj )(tj+1 − tj )2 (3690)
2
+O (xj+1 − xj )3 + O (tj+1 − tj )3 .

(3691)
Then
f (T, W (T )) − f (0, W (0)) (3692)

n−1
X
= [f (tj+1 , W (tj+1 )) − f (tj , W (tj ))] (3693)
j=0
n−1
X n−1
X
= ft (tj , W (tj ))(tj+1 − tj ) + fx (tj , W (tj ))(W (tj+1 ) − W (tj )) (3694)
j=0 j=0
n−1
1X
+ fxx (tj , W (tj ))(W (tj+1 ) − W (tj ))2 (3695)
2 j=0
n−1
X
+ ftx (tj , W (tj ))(tj+1 − tj )(W (tj+1 ) − W (tj )) (3696)
j=0
n−1
1X
+ ftt (tj , W (tj ))(tj+1 − tj )2 (3697)
2 j=0
+O (W (tj+1 ) − W (tj ))3 + O (tj+1 − tj )3 .

(3698)
592
RT
We treat each of these five terms separetely. As kΠk → 0 the first term goes to 0 ft (t, W (t))dt. Second
RT RT
term goes to 0 fx (t, W (t))dW (t). The third term is 21 0 fxx (t, W (t))dt, since in the limit of kΠk → 0
we can write (W (tj+1 ) − W (tj ))2 = tj+1 − tj (see Theorem 402 for result). Now we just need to show
that the fourth and fifth terms are zero in limits. Write
n−1
X
lim ftx (tj , W (tj ))(tj+1 − tj )(W (tj+1 ) − W (tj )) (3699)
kΠk→0
j=0
n−1
X
≤ lim |ftx (tj , W (tj ))| (tj+1 − tj ) |(W (tj+1 ) − W (tj ))| (3700)
kΠk→0
j=0
n−1
X
≤ lim max |(W (tk+1 ) − W (tk ))| · lim |ftx (tj , W (tj ))| (tj+1 − tj ) (3701)
kΠk→0 0≤k≤n−1 kΠk→0
j=0
Z T
=0· |ftx (t, W (t))|dt = 0. (3702)
0
For the last term see that

n−1
1X
lim ftt (tj , W (tj ))(tj+1 − tj )2 (3703)
kΠk→0 2 j=0
n−1
1X
≤ lim |ftt (tj , W (tj ))| · (tj+1 − tj )2 (3704)
kΠk→0 2
j=0
n−1
1 X
≤ lim max (tk+1 − tk ) · lim |ftt (tj , W (tj ))| · (tj+1 − tj ) (3705)
2 kΠk→0 0≤k≤n−1 kΠk→0
j=0
Z T
1
= ·0· |ftt (t, W (t))|dt = 0. (3706)
2 0
The Taylor approximation to first order is written
f (W (tj+1 )) − f (W (tj )) = f (1) (W (tj ))(W (tj+1 ) − W (tj )) + O () (3707)
and to second order is written
f (W (tj+1 )) − f (W (tj )) = f (1) (W (tj ))(W (tj+1 ) − W (tj )) (3708)

1
+ f (2) (W (tj ))(W (tj+1 ) − W (tj ))2 + O (˜
) (3709)
2
where O (˜
) ⊂ O (). In other words, the second order approximation has smaller error due to convexity
adjustments from curvature in f (x). As kΠk → 0, the errors in Equation 3707 and Equation 3708 go to
zero. However, in the summation of these errors for computing f (W (T ))−f (W (0)), only the second order
approximation summation of errors has limit zero as kΠk → 0. The convexity adjustment is required
due to the path of the Brownian motion being so volatile as to have nonzero quadratic variation.
Definition 443 (Ito Process). Let W (t), t ≥ 0 be Brownian motion adapted to F(t), then an Ito process
is a stochastic process that satisfies
Z t Z t
X(t) − X(0) = ∆(u)dW (u) + Θ(u)du. (3710)
0 0
593
X(0) is the initial condition for the stochastic process that is nonrandom, and ∆(u), Θ(u) are adapted
stochastic processes to F(t). (see Definition 308). We assume square integrability and finiteness assump-
Rt Rt
tions such that E 0 ∆2 (u)du, 0 |Θ(u)|du are finite on t > 0. The differential form is written
dX(t) = ∆(t)dW (t) + Θ(t)dt. (3711)
Almost all stochastic processes developed here without jumps are Ito processes of the kind in Defini-
tion 443.
Theorem 414 (Quadratic Variation of the Ito Process). The quadratic variation of the Ito process is
computed
Z t
[X, X](t) = ∆2 (u)du. (3712)
0
In differential form this is dX(t)dX(t) = ∆2 (t)dt.

Rt Rt
Proof. Denote the Ito integrals I(t) = 0 ∆(u)dW (u) and R(t) = 0 Θ(u)du. Both processes are contin-
uous in their upper limit of integration t. Choose partition set for [0, t] and denote it Π = {0 = t0 <
t1 · · · < tn = t}. The sampled quadratic variation w.r.t to this partition is computed
n−1
X n−1
X n−1
X
2 2 2
[X(tj+1 ) − X(tj )] = [I(tj+1 ) − I(tj )] + [R(tj+1 ) − R(tj )] (3713)
j=0 j=0 j=0
n−1
X
+2 [I(tj+1 ) − I(tj )] [R(tj+1 ) − R(tj )] . (3714)
j=0
Pn−1 2 Rt
See that as kΠk → 0 by definition j=0 [I(tj+1 ) − I(tj )] → [I, I](t) which equals 0
∆2 (u)du. The
magnitude of the second term can be bounded
n−1
X
max |R(tk+1 ) − R(tk )| |R(tj+1 ) − R(tj )| (3715)
0≤k≤n−1
j=0
n−1
X Z tj+1
= max |R(tk+1 ) − R(tk )| Θ(u)du (3716)
0≤k≤n−1 tj
j=0
n−1
X Z tj+1
≤ max |R(tk+1 ) − R(tk )| |Θ(u)| du (3717)
0≤k≤n−1 tj
j=0
Z t
= max |R(tk+1 ) − R(tk )| |Θ(u)| du (3718)
0≤k≤n−1 0
Rt
which evalutes to 0 · 0
|Θ(u)| du = 0 since we assumed finite integrability, and that R(t) is continuous.
Last but not least, the magnitude of the cross term is bounded by writing
n−1
X Z t
2 max |I(tk+1 ) − I(kj )| · |R(tj+1 ) − R(tj )| ≤ 2 max |I(tk+1 ) − I(kj )| · |Θ(u)| du. (3719)
0≤k≤n−1 0≤k≤n−1 0
j=0
Again since I(t) is assumed continuous the cross term has zero limits. Then [X, X](t) = [I, I](t) =
Rt 2
0
∆ (u)du.
Alternatively, see the Equation 3711 for the differential form of Ito process, then we can write
dX(t)dX(t) to be
(∆(t)dW (t) + Θ(t)dt)2 = ∆2 (t)dW (t)dW (t) + 2∆(t)Θ(t)dW (t)dt + Θ2 (t)dtdt = ∆2 (t)dt. (3720)
594
At each time t the process X accumulates quadratic variation at rate ∆2 (t) per unit time. Then
Rt
[X, X](t) = 0 ∆2 (u)du and this is solely due to the contribution of the quadratic variation of the
Rt Rt
Ito integral I(t) = 0 ∆(u)dW (u). The ordinary integral R(t) = 0 Θ(u)du accumulates no quadratic
variation.
We see that R(t) is not as volatile as I(t) although it is still allowed to be a random variable! For
small time increment h we have R(t + h) ≈ R(t) + hΘ(t). The estimate for I(t + h) over the same time
increment is written I(t + h) ≈ I(t) + (W (t + h) − W (t)) ∆(t). Unlike the other form, we do not have
(W (t + h) − W (t)) at time t and the estimate has higher uncertainty. Having defined integrals w.r.t
time and to a Brownian motion (ordinary integrals and Ito integrals respectively), we look to formally
develop integrals w.r.t to an Ito process.
Definition 444 (Ito Integral w.r.t Ito Process). Let X(t) be an Ito process (see Definition 443) and
let Γ(t) be an adapted stochastic process (see Definition 308) defined for t ≥ 0. Then the integral with
respect to the Ito process X(t) is given
Z t Z t Z t
Γ(u)dX(u) = Γ(u)∆(u)dW (u) + Γ(u)Θ(u)du. (3721)
0 0 0
Rt Rt
Assume square integrability and integrability of Γ(u)∆(u) such that 0
Γ2 (u)∆2 (u)dW (u) and 0
|Γ(u)Θ(u)|du
are finite.
Theorem 415 (Ito Doeblin Formula for an Ito Process). Let X(t) be Ito process (see Definition 443)
defined for t ≥ 0 and let f (t, x) be function with partial derivatives ft , fx , fxx defined. For all T ≥ 0 we
have
Z T Z T Z T
1
f (T, X(T )) − f (0, X(0)) = ft (t, X(t))dt + fx (t, X(t))dX(t) + fxx (t, X(t))d[X, X](t)
0 0 2 0
Z T Z T
= ft (t, X(t))dt + fx (t, X(t))∆(t)dW (t) (3722)
0 0
Z T Z T
1
+ fx (t, X(t))Θ(t)dt + fxx (t, X(t))∆2 (t)dt. (3723)
0 2 0
In differential form this is written

1
df (t, X(t)) = ft (t, X(t))dt + fx (t, X(t))dX(t) + fxx (t, X(t))dX(t)dX(t). (3724)
2
Since dX(t) = ∆(t)dW (t) + Θ(t)dt and dX(t)dX(t) = ∆2 (t)dt (see Theorem 414) the form above can be
written in dt and dW (t) forms to be
1
df (t, X(t)) = ft (x, X(t))dt + fx (t, X(t))∆(t)dW (t) + fx (t, X(t))Θ(t)dt + fxx (t, X(t))∆2 (t)dt.
2
Proof. Refer to the Taylor expansion and similar derivations for the Ito Doeblin formula on Brownian
595
motion. (Theorem 413). See that we can write
f (T, X(T )) − f (0, X(0)) (3725)

n−1
X
= [f (tj+1 , X(tj+1 )) − f (tj , X(tj ))] (3726)
j=0
n−1
X n−1
X
= ft (tj , X(tj ))(tj+1 − tj ) + fx (tj , X(tj ))(X(tj+1 ) − X(tj )) (3727)
j=0 j=0
n−1
1X
+ fxx (tj , X(tj ))(X(tj+1 ) − X(tj ))2 (3728)
2 j=0
n−1
X
+ ftx (tj , X(tj ))(tj+1 − tj )(X(tj+1 ) − X(tj )) (3729)
j=0
n−1
1X
+ ftt (tj , X(tj ))(tj+1 − tj )2 (3730)
2 j=0
+O (X(tj+1 ) − X(tj ))3 + O (tj+1 − tj )3 .

(3731)
Again, the last two terms Taylor-expanded out are zero, written informally dtdt = 0 and dtdX(t) = 0
RT
and we consider the first three terms. See the first term evaluates to 0 ft (t, X(t))dt in limit of kΠk → 0.
For the second term the limiting value is
Z T Z T Z T
fx (t, X(t))dX(t) = fx (t, X(t))∆(t)dW (t) + fx (t, X(t))Θ(t)dt. (3732)
0 0 0
The third term has limiting value

Z T Z T
1 1
fxx (t, X(t))d[X, X](t) = fxx (t, X(t))∆2 (t)dt. (3733)
2 0 2 0
Exercise 656 (Generalized Geometric Brownian Motion). Let W (t) be a Brownian motion adapted to
F(t) for t ≥ 0 and let α(t), σ(t) be adapted stochastic processes. Define the Ito process
Z t Z t
1
X(t) = σ(s)dW (s) + α(s) − σ 2 (s) ds. (3734)
0 0 2
See that it has differential form dX(t) = σ(t)dW (t) + α(t) − 21 σ 2 (t) dt. It follows that

2
1
dX(t)dX(t) = σ(t)dW (t) + α(t) − σ 2 (t) dt (3735)
2
= σ 2 (t)dt. (3736)
Consider the asset price process given by S(t) = S(0) exp(X(t)). Then we can write
Z t Z t
1
S(t) = S(0) exp σ(s)dW (s) + α(s) − σ 2 (s) ds . (3737)
0 0 2
The initial condition S(0) is set to be positive, and is not random. Then S(t) = f (X(t)) where
(1) (2)
f (x) = S(0) exp(x) and it has derivatives f (x) = f (x) = S(0) exp(x). By Ito Doeblin formula
596
(see Definition 442) we can write
1
dS(t) = df (X(t)) = f (1) (X(t))dX(t) + f (2) (X(t))dX(t)dX(t) (3738)
2
1
= S(0) exp(X(t))dX(t) + S(0) exp(X(t))dX(t)dX(t) (3739)
2
1
= S(t)dX(t) + S(t)dX(t)dX(t) (3740)
2
1 1
= S(t) σ(t)dW (t) + α(t) − σ 2 (t) dt + S(t)σ 2 (t)dt (3741)
2 2
= α(t)S(t)dt + σ(t)S(t)dW (t). (3742)
That models the dynamics of asset price evolution and states that S(t) has instantaneous (mean) rate
of return α(t) and volatility σ(t). Both are allowed to be stochastic processes in t. Our model captures
the price evolutions, under the assumption that prices are strictly positive, has no jumps and is driven
by a single Brownian motion. Since σ(t), α(t) are random and is allowed to be function of time it is not
necessarily true that S(t) is log-normal. If it were assumed constant then we would have the geometric
Brownian motion model as in Definition 435, modelling S(t) = S(0) exp σW (t) + (α − 21 σ 2 )t . Recall

that S(0) exp{σW (t)− 12 σ 2 t} is martingale (see Exercise 436), where the subtraction of 21 σ 2 (t) is added to
adjust for the upward drift imposed by convexity of function exp(σx). Adding αt gives us a price process
with mean rate of return α.
If the mean rate of return α(t) = 0 for all t then dS(t) = σ(t)S(t)dW (t) and by integration see
Rt
that we have S(t) = S(0) + 0 σ(s)S(s)dW
nR (s) and since the RHS
o is constant plus an Ito integral it is
t 1 t 2
R
martingale. Then S(t) = S(0) exp 0 σ(s)dW (s) − 2 0 σ (s)ds is martingale. For non-zero values of
α(t) we call it the instantaneous mean rate of return. It is path and time dependent.
Exercise 657 (Solving the Stochastic Differential Equation for GBM Stock Price). Solve the stochastic
differential equation
dS(t) = α(t)S(t)dt + σ(t)S(t)dW (t). (3743)

1
Proof. First, consider f (x) = log(x) with partial derivatives fx (x) = x and fxx = − x12 . Then by Ito
Doeblin (Theorem 415) write
1 1 1 1
d log S(t) = dS(t) − 2
dS(t)dS(t) = (α(t)S(t)dt + σ(t)S(t)dW (t)) − σ 2 (t)S 2 (t)dt
S(t) 2S(t) S(t) 2S(t)2

1 2
= α(t) − σ (t) dt + σ(t)dW (t). (3744)
2
Integrating this form we obtain
Z t Z t
1 2
log S(t) = log S(0) + α(t) − σ (t) dt + σ(t)dW (t) (3745)
0 2 0
which when exponentiated gives us

Z t Z t
1 2
S(t) = S(0) exp σ(u)dW (u) + α(u) − σ (u) du , (3746)
0 0 2
the solution to the SDE.
Exercise 658. For geometric Brownian motion

1
S(t) = S(0) exp σW (t) + α − σ 2 t (3747)
2
find d(S p (t)) for p ∈ R+ .
597
Proof. First see that f (xp ) has (partial) derivatives f (1) (xp ) = pxp−1 and f (2) (xp ) = p(p − 1)xp−2 . See
also from Exercise (657) that this is modelled by the SDE
dS(t) = αS(t)dt + σS(t)dW (t). (3748)
Now by Ito Doeblin (Theorem 415) get

1
d(S p (t)) = pS p−1 (t)dS(t) + p(p − 1)S p−2 (t)dS(t)dS(t) (3749)
2
p−1 1
(t) (αS(t)dt + σS(t)dW (t)) + p(p − 1)S p−2 (t) σ 2 S 2 (t)dt

= pS (3750)
2
1
= S (t) pαdt + pσdW (t) + p(p − 1)σ 2 dt
p
(3751)
2

1
= pS p (t) σdW (t) + α + (p − 1)σ 2 dt . (3752)
2
Exercise 659. -
1. Compute dW 4 (t). Write W 4 (t) as sum of Lebesgue integral and Ito integral.
2. Show that EW 4 (T ) = 3T 2 .
3. Find EW 6 (T ).
Proof. -
1. For f (x) = x4 with partial derivatives fx (x) = 4x3 and fxx (x) = 12x2 and application of Ito
Doeblin form in Theorem 415 gives
1
dW 4 (t) = 4W 3 (t)dW (t) + 12W 2 (t)dW (t)dW (t) (3753)
2
= 4W 3 (t)dW (t) + 6W 2 (t)dt. (3754)
Integrate to get
Z t Z t
4 3
W (t) = 4 W (t)dW (t) + 6 W 2 (t)dt. (3755)
0 0
2. Taking expectations see that the first expectation term is martingale trading (see Theorem 409)
with zero expectation. The second term evaluates to
T T
t2 T
Z Z
6 E[W 2 (t)]dt = 6 tdt = 6 | = 3T 2 . (3756)
0 0 2 0
3. With similar approach take

1
dW 6 (t)dt = 6W 5 (t)dW (t) +
· 30W (t)4 dt (3757)
2
Rt RT RT
so that W 6 (t) = ψ + 15 0 W 4 (t)dt where Eψ = 0 and 15 0 EW 4 (t)dt = 15 0 3t2 dt = 15T 3 , (see
Exercise 575) the final expectation form.
598
Theorem 416 (Ito Integral of Deterministic Integrands). Let W (s) be Brownian motion adapted to
Rt
F(s) and ∆(s) be deterministic function of time. Define I(t) = 0 ∆(s)dW (s). Then, for each t ≥ 0 the
t
random variable I(t) is distributionally Φ(0, σ 2 = 0 ∆2 (s)ds).
R
Proof. Since I(t) is martingale and initial condition I(0) = 0 then EI(t) = 0. Then it is centred and by Ito
Rt
isometry (see Theorem 410) we have VarI(t) = EI 2 (t) = 0 ∆2 (s)ds, and we drop the expectation term
since ∆(s) is deterministic. We show normality
n using the mgfoof I(t). See thatnwe want to obtain (see The-
o
Rt Rt
orem 377) that we want E exp(uI(t)) = exp 21 u2 0 ∆2 (s)ds , or that E exp uI(t) − 21 u2 0 ∆2 (s)ds =
nR o
t Rt 2
1. By definition of I(t) write E exp 0 u∆(s)dW (s) − 21 0 (u∆(s)) ds = 1. Now realize that expec-
tation term is a martingale, and in particular it is geometric Brownian motion (see Definition 435) with
α = 0, σ(s) = u∆(s). See it has initial condition of one, and we are done. We used the non-randomness
of ∆(s) to move the RHS of our original equation into the expectation term. Without this determinism
Rt
(of ∆(s)) then we cannot say that 0 ∆(s)dW (s) is normal.
Exercise 660. For ∆(t) that is nonrandom simple process and Brownian motion W (t) adapted to F(t)
define the stochastic integral
k−1
X
I(t) = ∆(tj )[W (tj+1 ) − W (tj )] + ∆(tk )[W (t) − W (tk )], t ∈ [tk , tk+1 ]. (3758)
j=0
Then
1. Show that for 0 ≤ s < t ≤ T , we have (I(t) − I(s)) ⊥ F(s).

Rt
2. Show [I(t) − I(s)] ∼ Φ(0, σ 2 = s ∆2 (u)du).
3. Use 1, 2, to show that I(t) is martingale.

Rt
4. Show that I 2 (t) − 0 ∆2 (u)du is martingale.
Proof. -
1. We show this by showing that for arbitrary tl < tk belonging to two different partitions points,
(I(tk ) − I(tl )) ⊥ F(tl ). This is valid since the partition can be defined arbitrarily fine. Then we
have
k−1
X
I(tk ) − I(tl ) = ∆(tj )[W (tj+1 ) − W (tj )]. (3759)
j=l
∆(tj ) is deterministic for all j, and since each increment is independent of F(tj ) is must be inde-
pendent of F(tl ). The whole term is independent of F(tl ).
2. By Theorem 416 we know that the I(·)’s are each normals, then the Ito increment is linear combina-
Pk−1
tion of independent normals by result from 1. Express Var(I(tk )−I(tl )) = j=l ∆2 (tj )Var(W (tj+1 )−
Pk−l Rt
W (tj )) = j=l ∆2 (tj )(tj+1 − tj ) = tlk ∆2 (u)du and we are done.
3. EI(t)|F(s) = E[I(t) − I(s) + I(s)|F(s)] = I(s).
599
4.
Z t Z s
E I 2 (t) − ∆2 (u)du − I 2 (s) − ∆2 (u)du |F(s) (3760)
0 0
Z t
= E I 2 (t) − I 2 (s)|F(s) − ∆2 (u)du

(3761)
s
Z t
= E (I(t) − I(s))2 − 2I 2 (s) + 2I(t)I(s)|F(s) − ∆2 (u)du

(3762)
s
Z t
= E (I(t) − I(s))2 − 2I(s)(I(s) − I(t))|F(s) − ∆2 (u)du

(3763)
s
Z t Z t
2
= ∆ (u)du − ∆2 (u)du (3764)
s s
= 0. (3765)
Exercise 661. For ∆(t) that is random simple process and Brownian motion W (t) adapted to F(t)
define the stochastic integral
k−1
X
I(t) = ∆(tj )[W (tj+1 ) − W (tj )] + ∆(tk )[W (t) − W (tk )], t ∈ [tk , tk+1 ]. (3766)
j=0
Then
1. See if for 0 ≤ s < t ≤ T , we have (I(t) − I(s)) ⊥ F(s).

Rt
2. See if [I(t) − I(s)] ∼ Φ(0, σ 2 = s ∆2 (u)du).
3. See if I(t) is martingale when ∆(s) = W (s).

Rt
4. See if that I 2 (t) − 0 ∆2 (u)du is martingale when ∆(s) = W (s).
Proof. -
1. We show by construction for arbitrary tl < tk . Let ∆(s) = W (s), then the term written
k−1
X
I(tk ) − I(tl ) = W (tj )[W (tj+1 ) − W (tj )]. (3767)
j=l
W (tj )[W (tj+1 ) − W (tj )] 6 ⊥F(tj ) - the property is not satisfied.
2. We can show it is not normal by example when ∆(s) = W (s). We show that the fourth moment
does not match 3σ 2 . By generalization of Equation 3767
I(t) − I(s) = W (s)[W (t) − W (s)] (3768)
so that
E[I(t) − I(s)]4 = EWs4 E[Wt − Ws ]4 = 3s2 · 3(t − s)2 = 9s2 (t − s)2 . (3769)
On the other hand
E[I(t) − I(s)]2 = EWs2 E[Wt − Ws ]2 = s(t − s), (3770)
giving 3s2 (t − s)2 6= 9s2 (t − s)2 and we see it is not normal.
600
3.
E[I(t) − I(s)|F(s)] = W (s)E[W (t) − W (s)|F(s)] = 0. (3771)
It is martingale.
4.
Z t Z s
E I 2 (t) − ∆2 (u)du − I 2 (s) − ∆2 (u)du |F(s) (3772)
0 0
Z t
= E I 2 (t) − I 2 (s) + ∆2 (u)du|F(s) (3773)
s
Z t
2 2 2
= E (I(t) − I(s)) + 2I(t)I(s) − 2I (s) + ∆ (u)du|F(s) (3774)
s
Z t
= E (I(t) − I(s))2 + 2I(s)(I(t) − I(s)) + W 2 (u)du|F(s) (3775)
s
= W 2 (s)(t − s) − W 2 (s)(t − s) (3776)
= 0. (3777)
Exercise 662 (Vasicek Interest Rate Rate Model). The Vasicek model says that interest rate process
R(t) can be determined via the stochastic differential equation
dR(t) = (α − βR(t))dt + σdW (t). (3778)
Here α, β, σ ∈ R+ and W (t) is Brownian motion· A stochastic differential equation models the evolution
of a random process in differential form, involving the process itself and differential of a Brownian motion.
Verify that the closed form solution for Vasicek model under the specified dynamics are written
Z t
α
R(t) = exp(−βt)R(0) + (1 − exp(−βt)) + σ exp(−βt) exp(βs)dW (s). (3779)
β 0
Proof. Define
α
f (t, x) = exp(−βt)R(0) + (1 − exp(−βt)) + σ exp(−βt)x
β
, then
Z t
R(t) = f t, X(t) = exp(βs)dW (s) (3780)
0
. Essentially what we did was to split the function into a product of deterministic value and another
containing an Ito process. By Ito Doeblin formula (see Theorem 415) we can find df (t, X(t)). In particular
see the partial derivatives are respectively
ft (t, x) = −β exp(−βt)R(0) + α exp(−βt) − σβ exp(−βt)x = α − βf (t, x), (3781)

fx (t, x) = σ exp(−βt), (3782)
fxx (t, x) = 0. (3783)
The differential we require in correspondence to the non-zero partial derivatives are dX(t) = exp(βt)dW (t),
and by Ito Doeblin we arrive at df (t, X(t)) =
1
ft (t, X(t))dt + fx (t, X(t))dX(t) + fxx (t, X(t))dX(t)dX(t) = (α − βf (t, X(t))) dt + σdW (t). (3784)
2
601
We showed that the function f (t, X(t)) satisfies the stochastic differential equation, with matching initial
conditions f (0, X(0)) = R(0). Then f (t, X(t)) = R(t) for all t ≥ 0. By the Ito integral of deterministic
Rt Rt 1
integrand (see Theorem 416) the term 0 exp(βs)dW (s) ∼ Φ(0, σ 2 = 0 exp(2βs) = 2β {exp(2βt) − 1}).
Using the relation given by Equation 3780, it follows that
σ2

α
R(t) ∼ Φ exp(−βt)R(0) + (1 − exp(−βt)), (1 − exp(−2βt) , (3785)
β 2β
where only the X(t) is stochastic. The Vasicek model specifies a mean-reverting process for the interest
α α
rate. See that when R(t) = β, the drift term is zero. When R(t) > β drift is negative and positive in
α α α
the remaining case. In particular, if R(0) = β then ER(t) = β and otherwise lim ER(t) = β.
t→∞
Exercise 663. Using the Vasicek model differential form
dR(t) = (α − βR(t))dt + σdW (t) (3786)
derive the closed form Vasicek solution, Equation 3779.
Proof. For function f (t, x) = exp(βt)x with partial derivatives ft (x) = β exp(βt)x, fx (x) = exp(βt) and
fxx (x) = 0, by Ito Doeblin (Theorem 415) write
d(exp(βt)R(t)) = β exp(βt)R(t)dt + exp(βt)dR(t) (3787)

= β exp(βt)R(t)dt + exp(βt) ((α − βR(t))dt + σdW (t)) (3788)
= exp(βt) [βR(t)dt + ((α − βR(t))dt + σdW (t))] (3789)
= exp(βt)(αdt + σdW (t)). (3790)
Now integrating write

Z t
exp(βt)R(t) = R(0) + exp(βs) (αds + σdW (s)) (3791)
0
Z t
α
= R(0) + (exp(βt) − 1) + σ exp(βs)dW (s). (3792)
β 0
Multiply by exp(−βt) on both sides and we are done.
In the era prior to Modern Monetary Theory (MMT), interest rate processes were thought to remain
strictly positive. This was thought to be a big downside flaw of the Vasicek model, and the Cox-Ingersoll-
Ross (CIR) model was used to alleviate this ‘problem’ to define interest rate processes that could not
enter negative territory. However, the likes of Europe and most notably, Japan, undertook negative
rate monetary policies, and what was once the strength of the CIR model is now considerably limiting.
If Vasicek, Cox, Ingersoll and Ross were in ‘The Office’, Vasicek would go ‘Well, Well, Well, How the
Turntables...’.
Exercise 664 (CIR Model). The interest rate process R(t) modelled by the stochastic differential equation
p
dR(t) = (α − βR(t)) dt + σ R(t)dW (t) (3793)
for Brownian motion W (t) and positive constants α, β, σ is said to be Cox-Ingersoll and Ross model.
The CIR model however, has no closed form solution. The CIR model is also mean reverting. Find
expectation and variance of R(t).
602
Proof. Define f (t, x) = exp(βt)x. By Ito Doeblin (see Definition 415) write
d(exp(βt)R(t)) = df (t, R(t)) (3794)

1
= ft (t, R(t))dt + fx (t, R(t))dR(t) + fxx (t, R(t))dR(t)dR(t) (3795)
2 p
= β exp(βt)R(t)dt + exp(βt) (α − βR(t)) dt + exp(βt)σ R(t)dW (t) (3796)
p
= α exp(βt)dt + σ exp(βt) R(t)dW (t). (3797)
By integration yield
Z t Z t p
exp(βt)R(t) = R(0) + α exp(βu)du + σ exp(βu) R(u)dW (u) (3798)
0 0
Z t
α p
= R(0) + (exp(βt) − 1) + σ exp(βu) R(u)dW (u). (3799)
β 0
The last term has expectation zero since it is Ito integral and
α
ER(t) = exp(−βt)R(0) + (1 − exp(−βt)) (3800)
β
, the same as the Vasicek model (see distribution specified by Equation 3785). For variance write
X(t) = exp(βt)R(t) and from Equation 3797 that
p βt p
dX(t) = α exp(βt)dt + σ exp(βt) R(t)dW (t) = α exp(βt)dt + σ exp( ) X(t)dW (t). (3801)
2
See also that EX(t) = exp(βt)ER(t) = R(0) + α
β (exp(βt) − 1), a restatement of Equation 3800. By Ito
Product rule (see Corollary 30) see that
d(X 2 (t)) = 2X(t)dX(t) + dX(t)dX(t) (3802)

βt p βt
= 2X(t) α exp(βt)dt + σ exp( ) X(t)dW (t) + σ 2 exp(2 )X(t)dt (3803)
2 2
βt 3
= 2α exp(βt)X(t)dt + 2σ exp( )X 2 (t)dW (t) + σ 2 exp(βt)X(t)dt.
2
Integrate to get
Z t Z t
βu 3
X 2 (t) − X 2 (0) = (2α + σ 2 ) exp(βu)X(u)du + 2σ exp( )X 2 (u)dW (u) (3804)
0 0 2
. Expectation of the last term is zero (it is Ito integral) and see that
Z t
EX 2 (t) = X 2 (0) + (2α + σ 2 ) exp(βu)EX(u)du (3805)
0
Z t
α
= R2 (0) + (2α + σ 2 ) exp(βu) R(0) + (exp(βu) − 1) du (3806)
0 β
Z t Z t
α
= R2 (0) + (2α + σ 2 ) exp(βu)R(0)du + (2α + σ 2 ) (exp(βu) − 1)du (3807)
0 0 β
2α + σ 2 2α + σ 2 α

α
= R2 (0) + R(0) − (exp(βt) − 1) + · (exp(2βt) − 1) . (3808)
β β 2β β
Verify this last step in Equation 3808. Then
ER2 (t) = exp(−2βt)EX 2 (t) (3809)

2

2α + σ α
= exp(−2βt)R2 (0) + R(0) − · (exp(−βt) − exp(−2βt)) (3810)
β β
α(2α + σ 2 )
+ (1 − exp(−2βt)). (3811)
2β 2
603
Follows that
Var(R(t)) = ER2 (t) − ER(t)2 (3812)

2α + σ 2

2 α
= exp(−2βt)R (0) + R(0) − (exp(−βt) − exp(−2βt)) (3813)
β β
α(2α + σ 2 )
+ (1 − exp(−2βt)) − exp(−2βt)R2 (0) (3814)
2β 2
2α α2 2
− R(0) (exp(−βt) − exp(−2βt)) − 2 (1 − exp(−βt)) (3815)
β β
σ2 ασ 2
= R(0) (exp(−βt) − exp(−2βt)) + (1 − 2 exp(−βt) + exp(−2βt)) . (3816)
β 2β 2
ασ 2
Verify this last step in Equation 3816. As t → ∞ see that we have lim Var(R(t)) = 2β 2 .
t→∞
13.2.3 Black Scholes Merton

We obtain the partial differential equation for the price of an option with underlying asset modelled by
geometric Brownian motion.
13.2.3.1 Evolution of Portfolio Value
Let the portfolio value follow X(t) at time t. Let the money market account pay rate r. The portfolio
is invested in stock and cash markets. Stock assumes a geometric Brownian motion modelled by
dS(t) = αS(t)dt + σS(t)dW (t). (3817)
W (t) here is Brownian motion adapted to filtration F(t). The shares held in stock at time t is ∆(t), also
adapted to F(t). The amount invested at rate r is then X(t) − ∆(t)S(t). The change in portfolio value
is then simply
dX(t) = ∆(t)dS(t) + r (X(t) − ∆(t)S(t)) dt (3818)

= ∆(t) (αS(t)dt + σS(t)dW (t)) + r (X(t) − ∆(t)S(t)) dt (3819)
= rX(t)dt + ∆(t)(α − r)S(t)dt + ∆(t)σS(t)dW (t). (3820)
The evolution of the portfolio is composed of mean risk-free rate of return r on portfolio, risk premium
and stock volatility. In discrete settings this is also written as
Xn+1 = ∆n (Sn+1 ) + (1 + r) (Xn − ∆n Sn ) (3821)
and the portfolio value change in single time period is
Xn+1 − Xn = ∆n (Sn+1 − Sn ) + r (Xn − ∆n Sn ) . (3822)
For derivative pricing we would like a discounted valuation to account for the risk-free rate of return.
Then for f (t, x) = exp(−rt)x by Ito Doeblin on Ito processes (see Theorem 415) we have
d (exp(−rt)S(t)) = df (t, S(t)) (3823)

1
= ft (t, S(t))dt + fx (t, S(t))dS(t) + fxx (t, S(t))dS(t)dS(t) (3824)
2
= −r exp(−rt)S(t)dt + exp(−rt)dS(t) (3825)
= −r exp(−rt)S(t)dt + exp(−rt) (αS(t)dt + σS(t)dW (t)) (3826)
= (α − r) exp(−rt)S(t)dt + σ exp(−rt)S(t)dW (t). (3827)
604
We also have
d(exp(−rt)X(t)) = df (t, X(t)) (3828)

1
= ft (t, X(t))dt + fx (t, X(t))dX(t) + fxx (t, X(t))dX(t)dX(t) (3829)
2
= −r exp(−rt)X(t)dt + exp(−rt)dX(t) (3830)
See from Equation 3820 that
rX(t)dt = dX(t) − {∆(t)(α − r)S(t)dt + ∆(t)σS(t)dW (t)} (3831)
Then substituting into the equation we get
−r exp(−rt)X(t)dt + exp(−rt)dX(t) (3832)

= − exp(−rt) {dX(t) − ∆(t)(α − r)S(t)dt − ∆(t)σS(t)dW (t)} + exp(−rt)dX(t) (3833)
= ∆(t)(α − r) exp(−rt)S(t)dt + ∆(t)σ exp(−rt)S(t)dW (t). (3834)
Then
d(exp(−rt)X(t)) = ∆(t)d (exp(−rt)S(t)) . (3835)
The change in discounted portfolio value is solely due to change in discounted stock price, which after
discounting has mean rate of return only containing the risk premium (α − r).
13.2.3.2 Evolution of Option Value
Consider European call on underlying S struck at K with maturity T and therefore terminal payoff
+
(S(T ) − K) for K > 0, and let the value of this call option at any time t ∈ [T ] be denoted c(t, x) when
stock price at t is S(t) = x. We want to determine the future option values in terms of future stock
prices, and by Ito Doeblin (see Theorem 415) we write
1
dc(t, S(t)) = ct (t, S(t))dt + cx (t, S(t))dS(t) + cxx (t, S(t))dS(t)dS(t) (3836)
2
1
= ct (t, S(t))dt + cx (t, S(t)) (αS(t)dt + σS(t)dW (t)) + cxx (t, S(t))σ 2 S 2 (t)dt Thm. 414
2
1
= ct (t, S(t)) + αS(t)cx (t, S(t)) + σ 2 S 2 (t)cxx (t, S(t)) dt + σS(t)cx (t, S(t))dW (t). (3837)
2
We are more interested in the discounted value of the European call though, so computing the
differential and using Ito Doeblin on f (t, x) = exp(−rt)x we obtain
1
d (exp(−rt)c(t, S(t))) = ft (t, c(t, S(t)))dt + fx (t, c(t, S(t)))dc(t, S(t)) + fxx (t, c(t, S(t)))dc(t, S(t))dc(t, S(t))
2
= −r exp(−rt)c(t, S(t))dt + exp(−rt)dc(t, S(t)), (3838)
for which we already computed dc(t, S(t)). Let us further substitute these terms and write
d (exp(−rt)c(t, S(t))) = −r exp(−rt)c(t, S(t))dt (3839)

+ exp(−rt) · (3840)

1 2 2
ct (t, S(t)) + αS(t)cx (t, S(t)) + σ S (t)cxx (t, S(t)) dt + σS(t)cx (t, S(t))dW (t)
2
(3841)

1
= exp(−rt) −rc + ct + αS(t)cx + σ 2 S 2 (t)cxx dt + exp(−rt)σS(t)cx dW (t) (3842)
2
where the arguments t, S(t) to function c and partial derivatives of c are abbreviated out.
605
13.2.3.3 Deriving the BSM PDE
We want to hedge against a short position in an option. Specifying initial capital X(0), we invest in
stock and money market such that ∀t ∈ [T ], X(t) = c(t, S(t)), or equivalently that exp(−rt)X(t) =
exp(−rt)c(t, S(t)). If the initial condition X(0) = c(0, S(0)) and their evolutions were equal, specifically
d(exp(−rt)X(t)) = d (exp(−rt)c(t, S(t))) ∀t ∈ [T ] (3843)
then we would satisfy such hedging requirements. See this by integrating both sides. Then comparing
Equation 3834 and Equation 3842 the hedging condition holds iff
∆(t)(α − r)S(t)dt + ∆(t)σS(t)dW (t) = (3844)

1
−rc + ct + αS(t)cx + σ 2 S 2 (t)cxx dt + σS(t)cx dW (t) (3845)
2
For the Brownian term to match we require ∆(t) = cx (t, S(t)) for all t ∈ [0, T ). This is the delta-
hedging rule, and says that we need units of stock equivalent to the partial derivative of the value of call
option with respect to stock price. This is also called option delta.
For the dt terms to match we require that
1
(α − r)S(t)cx (t, S(t)) = −rc + ct + αS(t)cx + σ 2 S 2 (t)cxx . (3846)
2
This occurs when
1
rc(t, S(t)) = ct (t, S(t)) + rS(t)cx (t, S(t)) + σ 2 S 2 (t)cxx (t, S(t)) t ∈ [0, T ). (3847)
2
Therefore this equation
1
ct (t, x) + rxcx (t, x) + σ 2 x2 cxx (t, x) = rc(t, x) ∀t ∈ [0, T ), x ≥ 0 (3848)
2
is called the Black Scholes Partial partial differential equation. To solve for the price of a European call
option we want to solve the partial differential equation with the terminal condition c(T, x) = (x − K)+ .
Then the investor starting with initial capital X(0) = c(0, S(0)) and continuously holding ∆(t) stock
units successfully hedges her short position for all possible stock price paths in t ∈ [0, T ). As t → T
and using continuity of X(t), c(t, S(t)) we can conclude that X(T ) = c(T, S(T )) = (S(T ) − K)+ . Our
hedging problem is solved.
13.2.4 Verifying the Black Scholes Merton Solution

The Black Scholes Merton (BSM) partial differential equation (PDE) is of the backward parabolic type.
Equations of these type need boundary conditions at x = 0, x = ∞ and terminal conditions to determine
the solution. When x = 0 it is trivial to see ct (t, 0) = rc(t, 0), an ordinary differential equation. We
verify the solution is c(t, 0) = exp(rt)c(0, 0). When t = T see that c(T, 0) = (0 − K)+ = 0 and therefore
c(0, 0) = 0 and c(t, 0) = 0 must be zero. This is the boundary condition for x = 0. When x → ∞ see
that c(t, x) has the same growth rate as x itself, and we can write
lim [c(t, x) − (x − exp(−r(T − t))K)] = 0 ∀t ∈ [T ]. (3849)

x→∞
The solution to the BSM PDE (Equation 3848) with terminal conditions as above is
c(t, x) = xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x)) 0 ≤ t < T, x > 0, (3850)
606
where τ = T − t, Φ is normal c.d.f of Φ(0, 1) (see Section 6.17.6) and
σ2

1 x
d± (τ, x) = √ log + r± τ . (3851)
σ τ K 2
As one can see this is a function of five parameters, and we write BSM (τ, x, K, r, σ) denote the Black
Scholes Merton function.
13.2.5 Option Greeks

Lemma 20. For
σ2

1 x
d± (τ, x) = √ log + r± τ (3852)
σ τ K 2
it is easy to see that

√
d− (τ, x) = d+ − σ τ , (3853)
√
d+ (τ, x) = d− + σ τ (3854)
Exercise 665. Recall that the solution to the European call pricing problem for strike K maturity T at
t with underlying x is given by Equation 3850 and repeated here
c(t, x) = xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x)) 0 ≤ t < T, x > 0. (3855)
Supposedly this satisfies the BSM PDE given by Equation 3848 and repeated here:
1
ct (t, x) + rxcx (t, x) + σ 2 x2 cxx (t, x) = rc(t, x) ∀t ∈ [0, T ), x ≥ 0, (3856)
2
as well as the terminal condition
lim c(t, x) = (x − K)+ , x > 0, x 6= K (3857)

t→T −
and boundary conditions
lim c(t, x) = 0 lim [c(t, x) − (x − exp(−r(T − t))K] = 0 t ∈ [0, T ). (3858)

x→0+ x→∞
1. Verify K exp(−rτ )Φ0 (d− ) = xΦ0 (d+ ).
2. Find the call option delta.
3. Find the call option theta.
4. Find the call option gamma.
5. Use delta and theta form to show BSM PDE satisfaction.
6. Show for x > K, limt→T − d± = ∞ and x ∈ (0, K) that limt→T − d± = −∞. Verify the terminal
condition.
7. Show for t ∈ [0, T ), limx→0+ d± = −∞. Verify boundary condition lim c(t, x) = 0.
x→0+
8. Show for t ∈ [0, T ), limx→∞ d± = ∞. Verify boundary condition

lim [c(t, x) − (x − exp(−r(T − t))K] = 0.
x→∞
607
Proof. -
1.
K exp(−rτ )Φ0 (d− ) (3859)

1 d2−
= K exp(−rτ ) √ exp(− ) (3860)
2π 2
1 1 √ 2
= K exp(−rτ ) √ exp(− d+ − σ τ ) from Equation 3853 (3861)
2π 2
1 1 √
= K exp(−rτ ) √ exp(− d2+ + σ 2 τ − 2d+ σ τ ) (3862)
2π 2
√ σ2 τ 1 1
= K exp(−rτ ) exp(σ τ d+ ) exp(− ) √ exp(− d2+ ) (3863)
2 2π 2
√ σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) exp(− )Φ (d+ ) (3864)
2
√ xΦ0 (d+ ) σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) 0
exp(− )Φ (d+ ) using the eqn. to verify
K exp(−rτ )Φ (d− ) 2
√ x Φ0 (d+ ) σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) exp(rτ ) 0 exp(− )Φ (d+ ) (3865)
K Φ (d− ) 2
√1 exp(− 1 d2 )
√ x 2 + σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) exp(rτ ) 12π 1 2
exp(− )Φ (d+ ) (3866)
K √ exp(− 2 d− ) 2
2π
√ x 1 1 σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) exp(rτ ) exp(− d2+ + d2− ) exp(− )Φ (d+ ) (3867)
K 2 2 2
√ x 1 1 √ σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) exp(rτ ) exp(− d2+ + (d+ − σ τ )2 ) exp(− )Φ (d+ ) (3868)
K 2 2 2
√ x 1 1 √ σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) exp(rτ ) exp(− d2+ + (d2+ + σ 2 τ − 2d+ σ τ )) exp(− )Φ (d+ )
K 2 2 2
√ x 1 √ σ2 τ 0
= K exp(−rτ ) exp(σ τ d+ ) exp(rτ ) exp( σ 2 τ − d+ σ τ ) exp(− )Φ (d+ ) (3869)
K 2 2
= xΦ0 (d+ ). (3870)
δc
2. Referencing Equation 3850 take δx =
Φ(d+ (τ, x)) + Φ0 (d+ (τ, x))d0+ (τ, x)x − K exp(−rτ )Φ0 (d− (τ, x))d0− (τ, x) (3871)
= Φ(d+ (τ, x)) + Φ0 (d+ (τ, x))d0+ (τ, x)x − xΦ0 (d+ (τ, x))d0− (τ, x) by part1. (3872)
= Φ(d+ (τ, x)) (3873)
is the option delta.

δc
3. Referencing Equation 3850 take δt =
δ δ
xΦ0 (d+ (τ, x)) d+ (τ, x) − rK exp(−rτ )Φ(d− (τ, x)) − K exp(−rτ )Φ0 (d− (τ, x)) d− (τ, x)
δt δt
0 δ 0 δ √
= xΦ (d+ ) d+ − rK exp(−rτ )Φ(d− ) − K exp(−rτ )Φ (d− ) d+ − σ τ see Eqn 3853
δt δt
δ δ √
= d+ [xΦ0 (d+ ) − K exp(−rτ )Φ0 (d− )] − rK exp(−rτ )Φ(d− ) + K exp(−rτ )Φ0 (d− ) (σ τ )
δt δt
δ 0 0 0 δ √
= d+ [xΦ (d+ ) − xΦ (d+ )] − rK exp(−rτ )Φ(d− ) + K exp(−rτ )Φ (d− ) (σ τ ) by part 1.
δt δt
1 1
= −rK exp(−rτ )Φ(d− ) + K exp(−rτ )Φ0 (d− )(−σ √ ) (3874)
2 τ
σx
= −rK exp(−rτ )Φ(d− (τ, x)) − √ Φ0 (d+ (τ, x)), by part 1. (3875)
2 τ
608
the value of option theta.
4. Taking part 1. the call gamma is cxx (t, x) =
δ δ 1
Φ(d+ (τ, x)) = Φ0 (d+ (τ, x)) d+ (τ, x) = Φ0 (d+ (τ, x)) √ (3876)
δx δx xσ τ
is call option gamma.
5. Substituting parts 2.-3. see that

1
ct (t, x) + rxcx (t, x) + σ 2 x2 cxx (t, x) (3877)
2
σx 1 2 2 δ
= −rK exp(−rτ )Φ(d− (τ, x)) − √ Φ0 (d+ (τ, x)) + rx · Φ(d+ (τ, x)) + σ x Φ(d+ (τ, x))
2 τ 2 δx
σx 1 2 2 0 1
= −rK exp(−rτ )Φ(d− (τ, x)) − √ Φ0 (d+ (τ, x)) + rx · Φ(d+ (τ, x)) + σ x Φ (d+ (τ, x)) √
2 τ 2 xσ τ
σx 1 1
= −rK exp(−rτ )Φ(d− (τ, x)) − √ Φ0 (d+ (τ, x)) + rx · Φ(d+ (τ, x)) + σxΦ0 (d+ (τ, x)) √
2 τ 2 τ
= −rK exp(−rτ )Φ(d− (τ, x)) + rx · Φ(d+ (τ, x)) (3878)
= r[xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x))] (3879)
= rc(t, x). (3880)
6. See by referencing Equation 3851 that as τ → 0 we get

1 x
d± = √ (r + log ) (3881)
σ τ K
so that limτ →0+ d+ = limτ →0+ d− = ∞ when x > K, and that limτ →0+ d± = −∞ when x < K
and zero otherwise. Then the terminal condition
(
xΦ(∞) − K exp(−rτ )Φ(∞) = x − K
lim+ c(t, x) = (3882)
τ →0 xΦ(−∞) − K exp(−rτ )Φ(−∞) = 0
= (x − K)+ , x > 0, x 6= K (3883)
7. See by referencing Equation 3851 that as x → 0+ we get

1 x
d± = √ (−∞ + log ) = −∞. (3884)
σ τ K
Then
lim c(t, x) = xΦ(−∞) − K exp(−rτ )Φ(−∞) = 0. (3885)

x→0+
8. See from part 1. that
x exp(−rτ )Φ0 (d− ) 1 √

= 0
= exp(−rτ )[exp( σ 2 τ − d+ σ τ )]−1 see eqn. 3869 for expansion
K Φ (d+ ) 2
√

1
= exp σ τ d+ − τ r + σ 2 . (3886)
2
Then see by referencing Equation 3851 that as x → ∞ we get

1 x
d± = √ (∞ + log ) = ∞. (3887)
σ τ K
609
(Φ(d+ ) − 1) Φ0 (d+ ) δx
δ
d+ Φ0 (d+ ) xσ1√τ Φ0 (d+ ) σ√1
τ
lim = lim = = (3888)
x→∞ x−1 x→∞ −x−2 −x−2 −x−1
1
= lim − xΦ0 (d+ ) √ (3889)
x→∞ σ τ
√

1 1 1 1
= lim − K exp σ τ d+ − τ r + σ 2 √ exp(− d2+ ) √
d+ →∞ 2 2π 2 σ τ
= 0. (3890)
Then
lim [c(t, x) − (x − exp(−r(T − t))K] (3891)

x→∞
= lim [(xΦ(d+ ) − K exp(−rτ )Φ(d− )) − (x − exp(−r(τ ))K] (3892)
x→∞
= lim [x(Φ(d+ ) − 1) + K exp(−rτ )(1 − Φ(d− ))] (3893)
x→∞
= 0 + K exp(−rτ )(1 − Φ(∞)) (3894)
= 0 + K exp(−rτ )(1 − 1) (3895)
= 0. (3896)
To make things less lengthy we abbreviate the parameters where there is no ambiguity. For a short
European call option hedge we hold cx number of stocks worth xcx = xΦ(d+ ). The hedging portfolio
value is worth c = xΦ(d+ ) − K exp(−rτ )Φ(d− ) and since cx are in stocks, each worth x, then the money
market investment is c − xcx = −K exp(−rτ )Φ(d− ) - a borrowing. The opposite action is to hedge a
long European call and short −cx stock, investing K exp(−rτ )Φ(d− ) in cash markets.
Exercise 666 (Gamma Scalping). Note that for fixed t, cx , cxx are always positive. That is delta and
gamma are positive and c is increasing convex function on x. Suppose a hedge is initiated on a long
call option at time t for stock price x1 , we then pay c(t, x1 ) for option, short cx (t, x1 ) stock units and
invest M = x1 cx (t, x1 ) − c(t, x1 ) in cash market. Portfolio of these components after said operations are
c(t, x1 ) − x1 cx (t, x1 ) + M . A fair portfolio is worthless on construction, such that at t this equals zero.
Suppose stock instantaneously falls to x0 , then our portfolio is now worth
c(t, x0 ) − x0 cx (t, x1 ) + M = c(t, x0 ) − x0 cx (t, x1 ) + x1 cx (t, x1 ) − c(t, x1 ) (3897)

= c(t, x0 ) − [cx (t, x1 )(x0 − x1 ) + c(t, x1 )] (3898)
= c(t, x0 ) − cx (t, x1 )(x0 − x1 ) − c(t, x1 ). (3899)
Due to the curvature, (positive) difference between curve c(t, x) at x = x0 and the line y = cx (t, x1 )(x −
x1 ) + c(t, x1 ) creates profit from instantaneous drop in stock price. An instantaneous rise to price x2
makes the portfolio value
c(t, x2 ) − x2 cx (t, x1 ) + M = c(t, x2 ) − x2 cx (t, x1 ) + x1 cx (t, x1 ) − c(t, x1 ) (3900)

= c(t, x2 ) − [cx (t, x1 )(x2 − x1 ) + c(t, x1 )] (3901)
= c(t, x2 ) − cx (t, x1 )(x2 − x1 ) − c(t, x1 ). (3902)
Again, the same curvature creates a positive difference and we profit from instantaneous price change
in both up and down moves. This is called a delta-neutral portfolio and long gamma portfolio, and
profits under high volatility conditions. The curve c(t, x)’s tangent property at x = x1 drawn by y =
610
cx (t, x1 )(x − x1 ) + c(t, x1 ) is what makes the delta-neutrality - if the curve were steeper then we would
be short delta and an upward stock price move would increase liabilities faster than the rise in value of
the option, and portfolio loses money, while making money on downward moves due to greater gain from
decrease in liabilities on short stock position than the fall in option value. The curve y = c(t, x) itself
shifts downwards in time time t due to negative theta, and under no arbitrage and geometric Brownian
motion assumptions, it turns out all of these variables exactly cancel each other.
Exercise 667. A hedging portfolio is a portfolio that reproduces the derivative payoff with the underlying
and money market account transactions. Let Xk denote this hedging portfolio value at k and ∆k be the
units of stock held, such that the money market investment is Xk − ∆k Sk . His portfolio value at time
k + 1 is then Xk+1 = ∆k Sk+1 + (1 + r)(Xk − ∆k Sk ), and his earnings are
Xk+1 − Xk = ∆k (Sk+1 − Sk ) + r(Xk − ∆k Sk ) (3903)
with continuous time analogue
dXt = ∆t dSt + r(Xt − ∆t St )dt. (3904)
Let the value of one share invested in the money market account at k time be Mk = (1 + r)k , and let
Xk now be determined by the two processes ∆k and Γk for units in stock and shares in money market
respectively. Then
Xk = ∆k Sk + Γk Mk . (3905)
so that we have
Xk+1 = ∆k Sk+1 + (1 + r)Γk Mk = ∆k Sk+1 + Γk Mk+1 (3906)
and
Xk+1 − Xk = ∆k (Sk+1 − Sk ) + Γk (Mk+1 − Mk ). (3907)
The dynamically rebalancing portfolio is
Xk+1 = ∆k+1 Sk+1 + Γk+1 Mk+1 (3908)
and by comparison of Equation 3906, Equation 3908, the self financing condition equation is given
Sk+1 (∆k+1 − ∆k ) + Mk+1 (Γk+1 − Γk ) = 0, (3909)
or rewritten to be
Sk (∆k+1 − ∆k ) + (Sk+1 − Sk )(∆k+1 − ∆k ) + Mk (Γk+1 − Γk ) + (Mk+1 − Mk )(Γk+1 − Γk ) = 0. (3910)
The continuous time analogue being
St d∆t + dSt d∆t + Mt dΓt + dMt dΓt = 0 (3911)
known as the continuous time self financing condition.

Another common way of deriving the BSM PDE is to first argue that a portfolio holding call and
short ∆(t) stock is worth N (t) = c(t, S(t)) − ∆(t)S(t) and instantaneously risk-free, therefore it earns the
risk-free rate to be arbitrage free. Expression of this form is dN (t) = rN (t)dt. Since portfolio is option
611
and stock, and option can be written as function over time t, price x, then by Ito Doeblin the call option
price evolves at
1
dc(t, S(t)) = cx dt + ct dSt + cxx dSt dSt (3912)
2
and the short stock component evolves at ∆t dSt * such that the portfolio evolves at
1
dNt = ct dt + cx dSt + cxx dSt dSt − ∆t dSt (3913)
2
1
= [cx − ∆t ]dSt + [ct + σ 2 S 2 cxx ]dt. (3914)
2
The Brownian motion inside the dSt is to be cancelled for riskless portfolio, hence the delta-hedging
formula cx = ∆t . Since we expect dNt = rNt dt then we can match forms
1
rNt dt = [ct + σ 2 S 2 cxx ]dt (3915)
2
but since Nt = c − ∆t St = c − St cx then making the relevant substitutions gives us BSM PDE
1
r (c − St cx ) dt = [ct + σ 2 S 2 cxx ]dt (3916)
2
1
ct + rSt cx + σ 2 S 2 cxx = rc (3917)
2
with arguments (t, S(t)) abbreviated. We may consider at this point* that the argument is inaccurate,
since we did not use the Ito product rule (Theorem 30) on the term ∆t dSt when considering the stock
component. That is we drew a continuous analogue to discrete form ∆k (Sk+1 − Sk ) for profit in period
k, but if we work under continuous settings for the stock position and setting ∆t = cx (t, St ) the portfolio
evolves at exactly
1
dNt = ct dt + cx dSt + cxx dSt dSt − d∆t St (3918)
2
1
= ct dt + cx dSt + cxx dSt dSt − ∆t dSt − St d∆t − d∆t dSt . (3919)
2
1. Let the money market bear rate r continuously such that Mt = exp(rt) and use Equations 3904 to
derive the continuous self financing equation (Equation 3911).
2. Recall by starting with X0 = c(0, S(0)) and holding ∆t = cx stock and executing the money market
transactions relevant, we will end up with Xt = c(t, S(t)), the dynamic replicating portfolio of the
call option. The money market investment at t is
Xt − ∆t St = c − ∆t St = Nt . (3920)
Nt
The number of money market account shares held is therefore Γt = Mt . Using the SDE for portfolio
(Equation 3919) and using the continuous self-financing condition from part 1. show Equation 3915.
Proof. -
1. See that the portfolio value at t is given Xt = ∆t St + Γt Mt . Write
dXt = ∆t dSt + Γt dMt + St d∆t + dSt d∆t + Mt dΓt + dMt dΓt (3921)
by repeated application of the Ito Product rule (see Theorem 30). Then dXt = ∆t dSt + Γt dMt iff
St d∆t + dSt d∆t + Mt dΓt + dMt dΓt = 0 and the result follows. That is, Equation 3904 is equivalent
to self financing Equation 3911.
612
2. Write

1
dNt = ct dt + cx dSt + cxx dSt dSt − d(∆t St ) (3922)
2

1
= ct + cxx σ 2 St2 dt + [cx dSt − d(Xt − Γt Mt ] (3923)
2
 

1
= ct + cxx σ 2 St2 dt + Mt dΓt + dMt dΓt +  cx dSt + Γt dMt −dXt  . (3924)
2 | {z }
∆t dSt +Γt dMt =dXt
with the braced comments given by self-financing. Then

1
ct + cxx σ 2 St2 dt = dNt − Mt dΓt − dMt dΓt = Γt dMt = Γt rMt dt = rNt dt (3925)
2
and we are done.
Exercise 668 (Solutions to the Black-Scholes-Merton for a European Call). For rate r and volatility
σ held constant, let S(t) = S(0) exp(σW (t) + (r − 21 σ 2 )t) for which W (t) is Brownian motion (see
Definition 432). Assume S(0) > 0 and positive strike K. Show that the present value of the expected
payoff at maturity T > 0 for European calls can be written by
E exp(−rT )(S(T ) − K)+ = S(0)Φ(d+ (T, S(0))) − K exp(−rT )Φ(d− (T, S(0))),

(3926)
h i
σ2
for which d± (T, S(0)) = σ√1 T log S(0)
K + r± 2 T and Φ is c.d.f for Φ(0, 1) (see Section 6.17.6).
Proof. See that ST = S0 exp (r − 12 σ 2 )T + σW (T ) , and the values of the x realized for W (T ) such

that S(T ) − K is positive takes

1 2 S0 1
log S0 exp (r − σ )T + σx − log K = log + (r − σ 2 )T + σx > 0. (3927)
2 K 2
Write
K 1
σx = log − (r − σ 2 )T (3928)
S0 2

1 K 1 2
x= log − (r − σ )T (3929)
σ S0 2
613
so that the expectation is written
E exp(−rT )(ST − K)+

(3930)
Z ∞ 2
1 2 1 x
= exp(−rT ) h i S0 exp (r − σ )T + σx − K √ exp(− )dx
1 K 1 2
σ log S0 −(r− 2 σ )T
2 2πT 2T
Z ∞ √ y2

1 2 1
= exp(−rT ) i S0 exp (r − σ )T + σ T y − K √ exp(− )dy
2 2π 2
h
1
√ log SK −(r− 12 σ 2 )T
σ T 0
1
Z ∞
1 y 2 √
= S0 exp(− σ 2 T ) i √ exp(− + σ T y)dy (3931)
2 2π 2
h
1
√ log SK −(r− 12 σ 2 )T
σ T 0
Z ∞
1 y2
−K exp(−rT ) i √ exp(− )dy (3932)
2π 2
h
1
√ log SK −(r− 12 σ 2 )T
σ T 0
Z ∞
1 ξ2
= S0 √ √ exp(− )dξ (3933)
2π 2
h i
1
√ log SK −(r− 12 σ 2 )T −σ T
σ T 0
Z ∞
1 y2
−K exp(−rT ) i √ exp(− )dy (3934)
2π 2
h
1
√ log SK −(r− 12 σ 2 )T
σ T 0
= S0 Φ(d+ (T, S0 )) − K exp(−rT )Φ(d− (T, S0 )) (3935)
and we are done.
Exercise 669. Recall that price of call option is characterized by Equation 3850: c = xΦ(d+ ) −
K exp(−rτ )Φ(d− ) for which d± contains the volatility in
h the form of Equation 3851. Suppose the market
σ12
i
1√ x
assumes volatility σ1 and therefore d± (τ, x) = σ1 τ
log K + (r ± 2 τ) but we know that the underlying
really assumes the evolution dSt = αSt dt + σ2 St dWt where σ2 > σ1 . Beginning with a worthless portfolio
X0 = 0, construct a portfolio of long call, short cx stock and cash Xt − c + St cx invested at risk-free r.
Withdraw cash from the portfolio at rate 12 (σ22 − σ12 )St2 cxx such that the portfolio value has differential
1
dXt = dc − cx dSt + r [Xt − c + St cx ] dt − (σ22 − σ12 )St2 cxx dt. (3936)
2
Show Xt = 0, that we have an arbitrage opportunity.
Proof. c(t, x) solves the BSM PDE (Equation 3848), that ct +rxcx + 12 σ12 S 2 cxx = rc where the parameters
(t, St ) has been abbreviated. By Ito Doeblin (Theorem 415) write
1
dc = ct dt + cx (αSt dt + σ2 St dWt ) + cxx σ22 St2 dt (3937)
2
1 2 2
= ct + αcx St + σ2 St cxx dt + σ2 St cx dWt (3938)
2

1 2 2 2
= rc + (α − r)cx St + St (σ2 − σ1 )cxx dt + σ2 St cx dWt . (3939)
2
Therefore the differential for the portfolio is

1 1
dXt = rc + (α − r)cx St + St2 (σ22 − σ12 )cxx + rXt − rc + rSt cx − (σ12 − σ12 )St2 cxx − cx αSt dt
2 2
+ [σ2 St cx − cx σ2 St ] dWt = rXt dt. (3940)
Equivalently Xt = X0 exp(rt), which means Xt = 0 by initial condition and we are done.
614
13.2.6 Put-Call Parity
A forward on asset struck at K and maturity T is obligation for one to buy one such unit of asset in
exchange for K dollars at time T . At maturity his payoff on long forward is therefore S(T ) − K and let
f (t, x) denote the value of such forward at t ∈ [T ] for underlying S(t) = x. The forward value is then
the discounted expected payoff written
f (t, x) = x − exp(−r(T − t))K (3941)
and we can see easily that under no arbitrage conditions that the forward value must be this and not lower
or higher. Otherwise he may short or long the forward depending on whether it is overvalued and hedge
in a stock unit, investing or borrowing whatever proceeds at cash market. For instance a short forward
at t = 0 earns f (0, S(0)) = S(0) − exp(−r(T − t))K and borrowing exp(−r(T − t))K cash to invest in one
unit of stock: see that at maturity his debt is K, receives K and he uses his stock to fulfill the obligation
to sell the stock. His portfolio is worth zero at maturity, and his hedge is a perfectly replicating portfolio
of the forward. Therefore the forward price of a stock at time t is the value of the strike K that causes
the forward contract to be worthless at time zero, such that S(t) − exp(−r(T − t))K = 0. For constant
interest rate the forward price of a non-dividend bearing asset is
F (t) = exp(r(T − t))S(t). (3942)
The forward price is the theoretical strike, not the value of a forward contract. At time t = 0 the forward
price is K = exp(rT )S(0) and its value at time t is
f (t, S(t)) = S(t) − exp(rt)S(0). (3943)
A European put on the other hand has payoff (K − ST )+ and see that x − K = (x − K)+ − (K − x)+ .
Then this implies that for the value of a European put p(t, x) with the same strike K and maturity T ,
a European call must satisfy f (T, S(T )) = c(T, S(T )) − p(T, S(T )) and since it holds at T it must hold
for all t ∈ [T ] and we have
f (t, x) = c(t, x) − p(t, x) x ≥ 0, t ∈ [T ] (3944)
and this equation is called the put call parity. Again one may verify that if this relationship did not hold
for European options then she may earn risk-free profit by going long the undervalued side and shorting
the counterpart. The put call parity does not require Black Scholes Merton arguments. The constant
rate assumption gives us the value of a forward contract. Constant volatility assumptions give us Black
Scholes Merton. By put-call parity (Equation 3944) and solution for calls, the price of a put is
p(t, x) = c(t, x) − f (t, x) (3945)

= xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x)) − (x − exp(−rτ )K) (3946)
= x(Φ(d+ (τ, x)) − 1) − K exp(−rτ )(Φ(d− (τ, x)) − 1) (3947)
= K exp(−rτ )Φ(−d− (τ, x)) − xΦ(−d+ (τ, x)). use (Φ(ξ) = 1 − Φ(−ξ)) (3948)
Exercise 670 (Results on Put Greeks and Satisfaction of PDE by Forwards). 1. Determine the delta,
gamma and theta of a European put.
2. Show that short stock and long money market shares are required for hedging short put.
3. Show that the forward value f (t, x) and put value p(t, x) satisfies BSM PDE. (Equation 3848).
615
Proof. -
1. By the put-call parity formulated (Equation 3944), see that c − p = x − exp(−rτ )K, such that
px = cx − 1 = Φ(d+ (τ, x)) − 1, pxx = cxx = 1√
σx τ
Φ0 (d+ (τ, x)) (from Equation 3876) and
pt = ct + r exp(−rτ )K (3949)
σx
= −rK exp(−rτ )Φ(d− (τ, x)) − √ Φ0 (d+ (τ, x)) + r exp(−rτ )K from Equation 3875
2 τ
σx
= rK exp(−rτ )[1 − Φ(d− (τ, x))] − √ Φ0 (d+ (τ, x)) (3950)
2 τ
σx 0
= rK exp(−rτ )Φ(−d− (τ, x)) − √ Φ (d+ (τ, x)). (3951)
2 τ
2. See that ∆t = px = Φ(d+ (τ, x)) − 1 < 0 and therefore he is short stock, long p(t, St ) − px (t, St )St
investment in money market shares.
3. See that BSM PDE

1
ct + rxcx + σ 2 x2 cxx = rc (3952)
2
can be rewritten
δ2

δ 1 δ
+ σ 2 x2 2 + rx − r c = 0. (3953)
δt 2 δx δx
c solves the BSM PDE if this equation is satisfied. By put-call parity (see Equation 3944) it suffices
to show that f = x − K exp(−rτ ) solves the partial differential equation. Write
1 2 2 δ2

δ δ
+ σ x + rx − r (x − K exp(−rτ )) (3954)
δt 2 δx2 δx
δ2 δ2

δ 1 δ δ 1 δ
= + σ 2 x2 2 + rx −r x− + σ 2 x2 2 + rx − r K exp(−rτ )(3955)
δt 2 δx δx δt 2 δx δx
= rx − rx − (rK exp(−rτ ) − rK exp(−rτ )) (3956)
=0 (3957)
and we are done.
13.2.7 Multivariable Stochastic Calculus

Definition 445 (Multiple Brownian Motions). A d-dimensional Brownian motion is defined to be process
W (t) = (W1 (t), · · · , Wd (t)) (3958)
where each element satisfies uni-dimensional Brownian properties (see Definition 432), ∀i 6= j, Wi (t) ⊥
Wj (t). Defined as such, a multidimensional Brownian is a vector of independent uni-dimensional Brow-
nians.
Definition 446 (Filtration for Multiple Brownians). A filtration F(t) can be defined for a d-dimensional
Brownian motion (see Definition 445) such that the following properties are satisfies: (i) cumulative
information, 0 ≤ s < t =⇒ F(s) ⊂ F(t), (ii) adaptivity, W (t) is F(t) measurable and that (iii)
0 ≤ t < u =⇒ (W (u) − W (t)) ⊥ F(t).
616

It follows from Definition 445 that for Wi ∈ (Wi (t))i∈[d] we have [Wi , Wi ](t) = t or in differential
form dWi (t)dWi (t) = dt.
Theorem 417 (Zero Cross Variation of Independent Brownians). For two independent Brownians from
a multiple dimensional Brownian (see Definition 445), that is for some i, j where i 6= j since Wi ⊥ Wj
then [Wi , Wj ](t) = 0, or equivalently dWi (t)dWj (t) = 0 is satisfied.
Proof. Define partition set Π = {0 = t0 < t1 < · · · < tn = T } and for i 6= j define the sampled cross
variation w.r.t. to this partition be
n−1
X
CΠ = [Wi (tk+1 ) − Wi (tk )] [Wj (tk+1 ) − Wj (tk )] . (3959)
k=0
By independence, their by parts expectation factors, each zero such that their total sum is zero and
2 2
ECΠ = 0. Then VarCΠ = ECΠ and consider that CΠ =
n−1
X 2 2
[Wi (tk+1 ) − Wi (tk )] [Wj (tk+1 ) − Wj (tk )] (3960)
k=0
n−1
X
+2 [Wi (tl+1 ) − Wi (tl )] · [Wj (tl+1 ) − Wj (tl )] · [Wi (tk+1 ) − Wi (tk )] · [Wj (tk+1 ) − Wj (tk )] .
l<k
Again in the last term all the terms factor with expectation zero, such that
n−1
X
2 2 2
ECΠ = E [Wi (tk+1 ) − Wi (tk )] [Wj (tk+1 ) − Wj (tk )] (3961)
k=0
n−1 n−1
X X lim kΠk→0
= (tk+1 − tk )2 ≤ kΠk · (tk+1 − tk ) = kΠkT = 0. (3962)
k=0 k=0
Then we have `2 convergence, that is CΠ = ECΠ = 0.
Definition 447 (Multiple Ito Processes and Their Quadratic/Cross Variation). A two dimensional Ito
process is presented, but the multiple processes problem generalizes to any number of Ito processes driven
by any, possibly different number of dimensional Brownians. The specific example here shows two Ito
processes on two Brownians. Specify X(t), Y (t) Ito processes with form
Z t Z t Z t
X(t) − X(0) = Θ1 (u)du + σ11 (u)dW1 (u) + σ12 (u)dW2 (u), (3963)
0 0 0
Z t Z t Z t
Y (t) − Y (0) = Θ2 (u)du + σ21 (u)dW1 (u) + σ22 (u)dW2 (u), (3964)
0 0 0
where Θi (u), σij (u) are adapted to the Brownian components. In differential form these are written
dX(t) = Θ1 (t)dt + σ11 (t)dW1 (t) + σ12 (t)dW2 (t), (3965)

dY (t) = Θ2 (t)dt + σ21 (t)dW1 (t) + σ22 (t)dW2 (t). (3966)
Rt
Recall that Ito integrals 0
ψ(t)dt accumulate quadratic variation at rate ψ 2 (t) per unit time, and this is
2 2
additive. In particular, the process X(t) accumulates quadratic variation at σ11 (t) + σ12 (t) per unit time,
and we write
Z t
2 2

[X, X](t) = σ11 (u) + σ12 (u) du (3967)
0
617
2 2

or in differential form dX(t)dX(t) = σ11 (t) + σ12 (t) dt. Alternatively we might have obtained by
considering the square of Equation 3965, that is
2
(Θ1 (t)dt + σ11 (t)dW1 (t) + σ12 (t)dW2 (t)) (3968)
2 2

= σ11 (t) + σ12 (t) dt (3969)
since dtdt = 0, dtdWi (t) = 0, dWi (t)dWj (t) = 0, dWi (t)dWi (t) = dt for values of i 6= j. A similar
approach is taken to obtain
2 2

dY (t)dY (t) = σ21 (t) + σ22 (t) dt (3970)
dX(t)dY (t) = (σ11 (t)σ21 (t) + σ21 (t)σ22 (t)) dt. (3971)
The cross variation term in integral form (of Equation 3971) is

Z T
[X, Y ](T ) = (σ11 (u)σ21 (u) + σ12 (u)σ22 (u))du, (3972)
0
which is defined by taking the partition set Π = {0 = t0 < t1 < · · · < tn = T } and taking the limits as
kΠk → 0 for the sampled cross variation
n−1
X
[X(tk+1 ) − X(tk )] [Y (tk+1 ) − Y (tk )] . (3973)
k=0
Theorem 418 (Two Dimensional Ito Doeblin Processes). Again we present the same conditions for
two dimensional processes on two Brownians as in Definition 447 but the general concept is exten-
sible to n dimensions on k Brownians. Let f (t, x, y) be a function for which the partial derivatives
ft , fx , fy , fxx , fyy , fxy , fyx are defined and continuous, and two Ito Processes X(t), Y (t) be defined as
in Definition 447. All the partial derivatives take functional parameters t, X(t), Y (t), so abbreviate
f· (t, X(t), Y (t)) → f· . Similarly all Ito processes are taken w.r.t to time, so abbreviate X(t), Y (t) → X, Y .
In differential form the Ito Doeblin for two dimensional ito processes is written
1 1
df (t, X, Y ) = ft dt + fx dX + fy dY + fxx dXdX + fxy dXdY + fyy dY dY. (3974)
2 2
The Taylor expansion terms involving partial derivatives ftt , ftx , fty have cross variation terms that go
to zero and hence not included. The partial derivatives fxy = fyx for continuous functions so they are
written together. Expanding out the differential forms
1 1
df (t, X, Y ) = ft dt + fx dX + fy dY + fxx dXdX + fxy dXdY + fyy dY dY (3975)
2 2
= ft dt (3976)
+fx [Θ1 (t)dt + σ11 (t)dW1 (t) + σ12 (t)dW2 (t)] (3977)
+fy [Θ2 (t)dt + σ21 (t)dW1 (t) + σ22 (t)dW2 (t)] (3978)
1 2 2

+ fxx σ11 (t) + σ12 (t) dt (3979)
2
+fxy [(σ11 (t)σ21 (t) + σ21 (t)σ22 (t)) dt] (3980)
1 2 2

+ fyy σ21 (t) + σ22 (t) dt (3981)
2
618
And therefore the integral form writes to a cumbersome form
f (t, X(t), Y (t)) − f (0, X(0), Y (0)) (3982)

Z t
= [σ11 (u)fx (u, X(u), Y (u)) + σ21 (u)fy (u, X(u), Y (u))] dW1 (u) (3983)
0
Z t
+ [σ12 (u)fx (u, X(u), Y (u)) + σ22 (u)fy (u, X(u), Y (u))] dW2 (u) (3984)
0
Z t
+ [ft (u, X(u), Y (u)) + · · · ] du, (3985)
0
composed of two Ito integrals and one Lebesgue integral.
Corollary 30 (Ito Product Rule). For X(t), Y (t) Ito processes
d(X(t)Y (t)) = X(t)dY (t) + Y (t)dX(t) + dX(t)dY (t). (3986)
Proof. To see this take f (x, y) = xy which has partial derivatives ft = fxx = fyy = 0 and fx = y, fy =
x, fxy = 1 and apply the Ito Doeblin differential from for two Ito Processes. (see Theorem 418)
13.2.8 Levy’s Theorem

Theorem 419 (One Dimensional Levy’s). For a stochastic process M (t) , t ≥ 0 that is martingale
relative to filtration F(t) with initial condition M (0) = 0, has continuous paths and accumulates quadratic
variation [M, M ](t) = t for all t ≥ 0 must be Brownian. That is, any such stochastic process must also
satisfy normality if the specified conditions hold.
Proof. By Ito Doeblin (see Theorem 415) for continuous function f (t, x) with the existence of necessary
partial derivatives are written
1
df (t, M (t)) = ft (t, M (t))dt + fx (t, M (t))dM (t) + fxx (t, M (t))dt, (3987)
2
where the assumption that [M, M ](t) = t under specified conditions are used. Then its integral form is
Z t Z t
1
f (t, M (t)) − f (0, M (0)) = ft (s, M (s)) + fxx (s, M (s)) ds + fx (s, M (s))dM (s). (3988)
0 2 0
Recall that the stochastic integral of martingale is martingale (see Theorem 409). By assumption its
initial condition is set zero, such that
Z t
1
Ef (t, M (t)) − f (0, M (0)) = E ft (s, M (s)) + fxx (s, M (s)) ds. (3989)
0 2
For some fixed constant u ∈ R write f (t, x) = exp ux − 12 u2 t , and recall this is the m.g.f. of Φ(x, σ 2 =

−1 2
t). (see Theorem 377). But this function has partial derivatives ft (t, x) = 2 u f (t, x) and fx (t, x) =
2 1
uf (t, x) and fxx (t, x) = u f (t, x), and therefore ft (t, x) + 2 fxx (t, x)
= 0. In particular the Equation
3989 becomes E exp uM (t) − 2 u t = 1, and the m.g.f of M (t) is E exp {uM (t)} = exp{ 21 u2 t}, and by
1 2

uniqueness property (see Theorem 351) of moment generating functions we are done.
Theorem 420 (Two Dimensional Levy’s). Specify two martingales M1 (t), M2 (t) adapted to filtration
F(t) and each satisfying the one-dimensional Levy settings as in Theorem 419. Further assume that cross
variation [M1 , M2 ](t) = 0 for all t ≥ 0 then the two Martingales are independent Brownian motions.
619
Proof. The one-dimensional levy proof (see proof of Theorem 419) specifies individually they are Brow-
nian motion - we need to show independence. For f (t, x, y) by Ito Doeblin (see Theorem 418) write
1 1
= ft dt + fx dM1 + fy dM2 + fxx dM1 dM1 + fxy dM1 dM2 + fyy dM2 dM2 (3990)
df (t, M1 , M2 )
2 2
1 1
= ft dt + fx dM1 + fy dM2 + fxx dt + fyy dt, (3991)
2 2
where the cancelled terms go to zero or combine by their cross variation or quadratic variations. The
integral form is written
f (t, M1 (t), M2 (t)) − f (0, M1 (0), M2 (0)) (3992)

Z t
1 1
= ft (s, M1 (s), M2 (s)) + fxx (s, M1 (s), M2 (s)) + fyy (s, M1 (s), M2 (s)) ds (3993)
0 2 2
Z t Z t
+ fx (s, M1 (s), M2 (s))dM1 (s) + fy (x, M1 (s), M2 (s))dM2 (s). (3994)
0 0
See that the last two terms are martingale trading (with zero expectations) so
Ef (t, M1 (t), M2 (t)) − f (0, M1 (0), M2 (0)) (3995)

Z t
1 1
= E ft (s, M1 (s), M2 (s)) + fxx (s, M1 (s), M2 (s)) + fyy (s, M1 (s), M2 (s)) ds. (3996)
0 2 2
For fixed constant u1 , u2 define the function f (t, x, y) = exp u1 x + u2 y − 21 (u21 + u22 )t . Finding the

right partial derivatives, substituting the values and using specified initial conditions it turns out we
have

1 2
E exp {u1 M1 (t) + u2 M2 (t)} = exp (u1 + u22 )t (3997)
2
and see that the joint moment generating function factors into their respective marginal moment gen-
erating functions. By equivalence of independence conditions (see Theorem 353) the Brownian motions
are independent.
Exercise 671 (Correlated Brownian Motion). Let W1 (t), W2 (t) be two independent Brownian motions.
p
Define some constant ρ ∈ [−1, 1] and write W3 (t) = ρW1 (t)+ 1 − ρ2 W2 (t). Then let two stock processes
follow
dS1 (t) = α1 S1 dt + σ1 S1 dW1 (t),
dS2 (t) = α2 S2 dt + σ2 S2 dW3 (t)
where σ1 , σ2 > 0. By construction W3 (0) = 0 and is continuous, and since
dW3 (t)dW3 (t) = ρ2 dW1 (t)dW1 (t) + (·)dW1 (t)dW2 (t) + (1 − ρ2 )dW2 (t)dW2 (t) = dt, (3998)
by Levy’s Theorem (see Theorem 419) W3 (t) is Brownian motion. Furthermore by their differential
equations S1 , S2 are each geometric Brownian motions with mean rate of return α1 , α2 and volatility
σ1 , σ2 respectively. By the Ito product rule (see Corollary 30) we have
d(W1 (t)W3 (t)) = W1 (t)dW3 (t) + W3 (t)dW1 (t) + dW1 (t)dW3 (t) = W1 (t)dW3 (t) + W3 (t)dW1 (t) + ρdt.
Integrating this form then

Z t Z t
W1 (t)W3 (t) − 0 = W1 (s)dW3 (s) + W3 (s)dW1 (s) + ρt (3999)
0 0
and the first two terms are martingale trading Ito integrals with expectation zero. EW1 (t)W3 (t) = ρt.
See their correlation is then √ρt
√ = ρ and we are done.
t t
620
Exercise 672 (Decomposition of correlated Brownians to independent Brownians, Shreve [19]). Suppose
B1 (t), B2 (t) are Brownians such that
dB1 (t)dB2 (t) = ρ(t)dt (4000)
where ρ(t) is stochastic process taking values [−1, 1]. Define two processes
B1 (t) = W1 (t) (4001)

Z t Z tp
B2 (t) = ρ(s)dW1 (s) + 1 − ρ2 (s)dW2 (s). (4002)
0 0
Show W1 (t), W2 (t) is multidimensional Brownian (see Definition 445), and therefore independent.
Proof. Zeng [20] By the two-dim Levy’s Theorem (see Theorem 420) if we can construct two martingales,
each accumulating quadratic variation at rate t with zero cross variation and initial condition zero then
they are independent Brownians. Write a pair of martingales
dW1 (t) = dB1 (t) (4003)

dW2 (t) = α(t)dB1 (t) + β(t)dB2 (t) (4004)
and we want α(t), β(t) s.t.

2
α (t) + β 2 (t) + 2αβρ(t) dt = dt

dW2 (t)dW2 (t) = (4005)
dW1 (t)dW2 (t) = [α(t) + β(t)ρ(t)]dt = 0. (4006)
Solving for α(t), β(t), obtain α(t) = √−ρ(t)

2
and β(t) = √ 1
such that W1 (t) = B1 (t) and
1−ρ (t) 1−ρ2 (t)
t t
−ρ(s)
Z Z
1
W2 (t) = p dB1 (s) + p dB2 (s), (4007)
0 1− ρ2 (s) 0 1 − ρ2 (s)
pair of independent Brownians. See that this is equivalent to what we want to prove.
Exercise 673. For proof of Ito Doeblin (Theorem 413) we showed the case when f (x) = 21 x2 s.t. the
P 00
partial derivative f (2) (x) = 1, making the last term to be written 12 f (W (tj ))[W (tj+1 ) − W (tj )]2 →
1 T 00 00
R
2 0 f (W (t))dt as kΠk → 0. Now consider the general case (where we are not guaranteed that f (x) =
1), and we want to show
Pn−1 RT
limkΠk→0 j=0 f 00 (W (tj ))[W (tj+1 ) − W (tj )]2 = 0 f 00 (W (t))dt. Write
Zj = f 00 (W (tj ))[(W (tj+1 ) − W (tj ))2 − (tj+1 − tj )]
so that
n−1
X n−1
X n−1
X
f 00 (W (tj ))[W (tj+1 ) − W (tj )]2 = Zj + f 00 (W (tj ))(tj+1 − tj ). (4008)
j=0 j=0 j=0
2
1. Show Zj is Ftj +1 measurable, and that EZj |F(tj ) = 0, E[Zj2 |F(tj )] = 2 [f 00 (W (tj ))] (tj+1 − tj )2 .
Pn−1
2. Show the `2 convergence of Zj to zero that allows us to establish that the proof used in
j=0
RT
Theorem 413 holds in the general case, assuming E 0 [f 00 (W (t))]2 dt is finite.
Proof. -
621
1. It is trivial to see that Zj is F(tj+1 ) measurable. Write
E[Zj |Ftj ] = f 00 (Wtj )(tj+1 − tj − tj+1 + tj ) = 0. (4009)
See that
00 2
E[Zj2 |Ftj ] f (Wtj ) E (Wtj+1 − Wtj )4 + (tj+1 − tj )2 − 2(tj+1 − tj )(Wtj+1 − Wtj )2 |Ftj

=
00 2
f (Wtj ) 3(tj+1 − tj )2 + (tj+1 − tj )2 − 2(tj+1 − tj )2 )

= (4010)
00 2
= 2 f (Wtj ) (tj+1 − tj )2 . (4011)
hP i
n−1
2. To show `2 convergence we need to show the variance lim Var j=0 Zj = 0. Since the random
kΠk→0
variable considered is centred, write
     
n−1
X n−1
X n−1
X X
Var  Zj  = E ( Zj )2  = E  Zj2 + 2 Zi Zj  (4012)
j=0 j=0 j=0 i<j
n−1
X n
X
E E[Zj2 |Ftj ] + 2

= E[Zi E[Zj |Ftj ]] (4013)
j=0 i<j
n−1
X
E E[Zj2 |Ftj ]

= by part 1. (4014)
j=0
n−1 h i
X 2
= E 2 f 00 (Wtj ) (tj+1 − tj )2 by part 1. (4015)
j=0
n
X
≤ 2 max |tj+1 − tj | E[f 00 (Wtj )2 (tj+1 − tj )]. (4016)
j∈[n−1]
j=0
Now see that the last statement goes to zero as kΠk → 0 due to finite integrability of f 00 (W (t))2
Pn RT
which means that j=0 E[f 00 (Wtj )2 (tj+1 − tj )] → 0 E[f 00 (Wt )2 ]dt < ∞.
13.2.9 Gaussian Processes

Definition 448 (Gaussian Processes). A Gaussian process X(t) specified on t ≥ 0 is a stochastic process
for which at times 0 < t1 < · · · < tn the random variables X(ti ), i ∈ [n] are jointly normal. (see Definition
344)
As in our discussion of jointly normal random variables, the vector of Gaussian processes indexed
by time are characterised by their mean and covariance matrix. Adopt the notation EX(t) = m(t) and
Cov(X(s), X(t)) = E[(X(s) − m(s))(X(t) − m(t))] = c(s, t).
Exercise 674 (Brownian Motion is Gaussian Process). Verify that the Brownian motion (see Definition
432) is Gaussian process. (see Definition 448).
Proof. For partition 0 = t0 < t1 < · · · , define Brownian motion increments I(ti ) = W (ti ) − W (ti−1 ).
Pi
Each indexed Brownian can be written W (ti ) = j=1 Ij . See from Theorem 398 that they are jointly
normal, since each term is linear combination of independent normals. Additionally see also that we wrote
for s < t, EW (s)W (t) = s and their covariance is simply s ∧ t. It is Gaussian process by definition.
622
Exercise 675 (Ito Integral of Deterministic Integrand is Gaussian Process). Show that the Ito integral
Rt
I(t) = 0 ∆(s)dW (s) for non-random ∆(t) is Gaussian process.
n Rt o
Proof. See from proof in Theorem 416 that E exp uI(t) − 21 u2 0 ∆2 (s)ds = 1, and that the moment
n Rt o Rt
generating formula for E exp(uI(t)) = exp 12 u2 0 ∆2 (s)ds , an mgf for Φ(0, σ 2 = 0 ∆2 (s)ds). (see
Theorem 377) Although I(t) normality is proved, to show this is Gaussian process we need to show
the indexed Ito integrals I(ti ), i ∈ [n] are joint normals. Similarly to the proof on Brownian motion
as Gaussian processes (see Exercise 674), we can do this by proving that increments I(ti ) − I(ti−1 ) are
independent normals. Consider for two arbitrary 0 < t1 < t2 we show I(t1 ) − I(t0 ) and I(t2 ) − I(t1 ) are
⊥ and ∼ Φ. Write
Z t
1
Mu (t) = exp uI(t) − u2 t 2
∆ (s)ds (4017)
2 0
and for fixed constant u2 the martingale property of Mu2 (from Theorem 416) implies Mu2 (t1 ) =
E[Mu2 (t2 )|F(t1 )]. Write

Mu1 (t1 )Mu2 (t2 )
Mu1 (t1 ) = E |F(t1 ) (4018)
Mu2 (t1 )
1 2 t1 2 1 2 t2 2
Z Z
= E exp u1 I(t1 ) + u2 (I(t2 ) − I(t1 )) − u1 ∆ (s)ds − u2 ∆ (s)ds |F(t1 ) .
2 0 2 t1
Taking expectations, see
1 = Mu1 (0) = EMu1 (t1 ) (4019)

Z t1 Z t2
1 1
= E exp u1 I(t1 ) + u2 (I(t2 ) − I(t1 )) − u21 ∆2 (s)ds − u22 ∆2 (s)ds (4020)
2 0 2 t1
1 2 t1 2 1 2 t2 2
Z Z
= E [exp {u1 I(t1 ) + u2 (I(t2 ) − I(t1 ))}] · exp − u1 ∆ (s)ds − u2 ∆ (s)ds .(4021)
2 0 2 t1
Move the deterministic term to the LHS and we see that the mgf of E [exp {u1 I(t1 ) + u2 (I(t2 ) − I(t1 ))}]
equals
t1
1 2 t2 2
Z Z
1 2 2
exp u ∆ (s)ds · exp u ∆ (s)ds (4022)
2 1 0 2 2 t1
and by their joint mgf factoring and uniqueness property (see Theorem 351) we are done. We can also
characterize their covariance matrix. For entry corresponding to
c(t1 , t2 ) = EI(t1 )I(t2 ) = E [I(t1 )(I(t2 ) − I(t1 ) + I(t1 )] (4023)

2
= E [I(t1 )(I(t2 ) − I(t1 ))] + EI (t1 ) (4024)
Z t1
= ∆2 (s)ds, t1 ≤ t2 (4025)
0
since first term factors out to zero.
Exercise 676. Let (Wi (t))i∈[d] be d dimensional Brownian motion. Let (σij (t))i∈[m],j∈[d] be m · d matrix
process adapted to filtration associated with the d dimensional Brownian motion. Define
  21
Xd
2
σi (t) =  σij (t) (4026)
j=1
623
. Assume
d Z t
X σij (u)
Bi (t) = dWj (u). (4027)
j=1 0 σi (u)
Show that each Bi is Brownian motion and that dBi (t)dBk (t) = ρik (t)dt where
d
1 X
ρik (t) = σij (t)σkj (t). (4028)
σi (t)σk (t) j=1
Proof. See that we can write

 2
d d 2
X σij (t) X σij (t)
dBi (t)dBi (t) =  dWj (t) = 2 dt = dt. (4029)
j=1
σi (t) σ (t)
j=1 i
where the last step follows from definition of σi (t)2 and the result follows from Levy’s Theorem (see
Theorem 419). Next see that
   
d d
X σij (t) X σ kj (t)
dBi (t)dBk (t) =  dWj (t) ·  dWj (t) (4030)
j
σ i (t) j
σ k (t)
d
X σij (t)σkj (t)
= dt (4031)
j=1
σi (t)σk (t)
= ρik (t)dt. (4032)
where the cross terms cancel for which dWj dWk = 0 when j 6= k.
Exercise 677. Let Bi (t), i ∈ [m] be m one-dimensional Brownians with dBi (t)dBk (t) = ρik (t)dt for all
i, k ∈ [m] where ρik (t) are adapted stochastic processes taking values (−1, 1) for i 6= k and ρik (t) = 1
when i = k. Define a positive definite, symmetric matrix C(t) where the entries Ci,j are given by ρij (t),
which has a square root matrix A(t) such that A(t)A(t)T = C(t). Each element of C(t)i,k = ρik (t) can
Pm
be written as j=1 aij (t)akj (t), ∀i, k ∈ [m]. The matrix A(t) also has an inverse matrix defined A−1 (t)
such that A(t)A−1 (t) = 1 = A−1 (t)A(t). Denoting entries of the inverse matrix A−1 (t)i,j = αij (t) see
that this relationship is given by the Kronecker delta function
m m
αij (t)ajk (t) = δjk = 1i=k .
X X
aij (t)αjk (t) = (4033)
j=1 j=1
Show that there exists m independent Brownian motions such that Wi (t), i ∈ [m] satisfy
m Z
X t
Bi (t) = aij (u)dWj (u) ∀i ∈ [m]. (4034)
j=1 0
Proof. We want to the find the Brownian motions that satisfy

m
X
dBi (t) = aij (t)dWj (t) ∀i ∈ [m] (4035)
j=1
which is equivalently
(dB1 (t), · · · , dBm (t))T = A(t)(dW1 (t), · · · , dWm (t))T . (4036)
624
or equivalently
(dW1 (t), · · · , dWm (t))T = A(t)−1 (dB1 (t), · · · , dBm (t))T . (4037)
For decomposition C(t) = A(t)A(t)T , we have A(t)−1 C(t)A−1·T = 1. We show that the Brownian
motions defined as such are independent. See that
(dW1 (t), · · · , dWm (t))T · (dW1 (t), · · · , dWm (t)) (4038)

−1 T −1·T
= A(t) (dB1 (t), · · · , dBm (t)) · (dB1 (t), · · · , dBm (t))A(t) (4039)
= A(t)−1 C(t)A(t)−1·T (4040)
and we showed that this last term A(t)−1 C(t)A−1·T = 1, and their independence is established.
Exercise 678. Define two Ito processes by

Z t Z t
X1 (t) = X1 (0) + Θ1 (u)du + σ1 (u)dB1 (u) (4041)
0 0
Z t Z t
X2 (t) = X2 (0) + Θ2 (u)du + σ2 (u)dB2 (u) (4042)
0 0
where B1 (t), B2 (t) are two Brownians with relation dB1 (t)dB2 (t) = ρ(t)dt. See that
dX1 (t)dX2 (t) = σ1 (t)σ2 (t)dB1 (t)dB2 (t) = ρ(t)σ1 (t)σ2 (t)dt. (4043)
C()
Show that ρ(t) = lim+ V1 ()V 2 ()
, the instantaneous correlation of X1 (t), X2 (t) where the variables
→0
Mi () = E[Xi (t0 + ) − Xi (t0 )|F(t0 )] i = 1, 2, (4044)

2
Vi () = E[(Xi (t0 + ) − Xi (t0 )) |F(t0 )] − Mi2 (), (4045)
C() E[(X1 (t0 + ) − X1 (t0 ))(X2 (t0 + ) − X2 (t0 ))|F(t0 )] − M1 ()M2 ().
= (4046)
Rt Rt
Proof. Zeng [20] For general Xi (t) − Xi (0) = 0 Θi (u)du + 0 σi (u)dBi (u) see that for i = 1, 2 we have
Mi () = E[Xi (t0 + ) − Xi (t0 )|F(t0 )] (4047)

Z t0 + Z t0 +
= E Θi (u)du + σi (u)dBi (u)|F(t0 ) (4048)
t0 t0
Z t0 +
= Θi (t0 ) + E (Θi (u) − Θi (t0 ))du|F(t0 ) . (4049)
t0
By Jensen’s inequality (see Theorem 341) we obtain

Z t0 + Z t0 +
E (Θi (u) − Θi (t0 ))du|F(t0 ) ≤ E |(Θi (u) − Θi (t0 ))| du|F(t0 ) (4050)
t0 t0
1
R t0 +
Use the bound t0
|Θi (u) − Θi (t0 )|du ≤ 2M , Θi (u) → Θi (t0 ) as → 0 and continuity property of Θi
with Dominated Convergence (see Theorem 343) to write
Z t0 +
1 t0 +
Z
1
lim E |(Θi (u) − Θi (t0 ))| du|F(t0 ) = E lim |(Θi (u) − Θi (t0 ))| du|F(t0 ) = 0, (4051)
→0 t0 →0 t
0
Rt
which is to say that Mi () = Θi (t0 ) + o(). Write Yi (t) = 0 σi (u)dBi (u), a martingale. dYi (t) =
σi (t)dBi (t) and by Ito Product
dYi (t)Yj (t) = Yi (t)dYj (t) + Yj (t)dYi (t) + dYi (t)dYj (t) (4052)
= Yi (t)dYj (t) + Yj (t)dYi (t) + σi (t)σj (t)dt. (4053)
625
Rt Rt Rt
Integrating, obtain Yi (t)Yj (t) = 0
Yi (u)dYj (u) + 0
Yj (u)dYi (u) +
σi (u)σj (u)du which has first two
0
Rt
terms as martingale trading (see Exercise 409) and we conclude Yi (t)Yj (t)− 0 σi (u)σj (u)du is martingale.
Then
E[(Xi (t0 + ) − Xi (t0 ))(Xj (t0 + ) − Xj (t0 ))|F(t0 )] (4054)

Z t0 + Z t0 +
= E Yi (t0 + ) − Yi (t0 ) + Θi (u)du Yj (t0 + ) − Yj (t0 ) + Θj (u)du |F(t0 ) (4055)
t0 t0
Z t0 + Z t0 +
= E[(Yi (t0 + ) − Yi (t0 ))(Yj (t0 + ) − Yj (t0 ))|F(t0 )] + E Θi (u)du Θj (u)du|F(t0 )
t0 t0
Z t0 + Z t0 +
+E (Yi (t0 + ) − Yi (t0 )) Θj (u)du|F(t0 ) + E (Yj (t0 + ) − Yj (t0 )) Θi (u)du|F(t0 )
t0 t0
Use the same limiting arguments (verify this) and Dominated Convergence Theorem to see that the first
term is
I = σi (t0 )σj (t0 )ρij (t0 ) + o() (4056)
and the second term (verify this)

Z t0 + Z t0 + Z t0 +
E (Θi (u) − Θi (t0 ))du Θj (u)du|F(t0 ) + Θi (t0 )E Θj (u)du|F(t0 ) (4057)
t0 t0 t0
= o() + (Mi () − o())Mj () (4058)
= Mi ()Mj () + o(). (4059)
By Cauchy–Schwarz write for third term

Z t0 +
E (Yi (t0 + ) − Yi (t0 )) Θj (u)du|F(t0 ) (4060)
t0
Z t0 +
≤ E |(Yi (t0 + ) − Yi (t0 ))| |Θj (u)|du|F(t0 ) (4061)
t0
p
≤ M E [(Yi (t0 + ) − Yi (t0 ))2 |F(t0 )] (4062)
p
≤ M E [Yi (t0 + )2 − Yi (t0 )2 |F(t0 )] (4063)
s Z
t0 +
≤ M E Θi (u)2 du|F(t0 ) verify this (4064)
t0
√
≤ M M (4065)
= o() (4066)
and the fourth term is also ∈ o() (verify this). Then we can write
E[(Xi (t0 + ) − Xi (t0 ))(Xj (t0 + ) − Xj (t0 ))|F(t0 )] = Mi ()Mj () + σi (t0 )σj (t0 )ρij (t0 ) + o() (4067)
and
C() ρ(t0 )σ1 (t0 )σ2 (t0 ) + o()
lim = lim p = ρ(t0 ). (4068)
→0+ V1 ()V2 () →0+ (σ12 (t0 ) + o())(σ22 (t0 ) + o())
Exercise 679. For stock price determined by evolution dSt = αSt dt + σSt dWt and interest rate r, define
α−r
market price of risk to be the value θ = σ .
The state density process to be

1 2
ξ(t) = exp −θW (t) − r + θ t . (4069)
2
Show that
626
1. dξ(t) = −θξ(t)dW (t) − rξ(t)dt.
2. Let X denote the portfolio value process, and ∆t denotes adaptive process representing the stock
units held. Then
dXt = rXt dt + ∆t (α − r)St dt + ∆t σSt dWt . (4070)
Show that ξt Xt is martingale.
3. Show for fixed terminal T > 0, when investor begins with initial capital X0 and invests to have VT
at T , he must begin with initial capital X0 = E[ξt VT ], the present value of the random payoff. VT
is FT measurable.
Proof. -
1. See by Ito Doeblin (Theorem 415) that we have

1
d(exp(rt)ξt ) = d(exp(−θWt − θ2 t)) (4071)
2
1 θ2 1
= −θ exp(−θWt − θ2 t)dWt − exp(−θWt − θ2 t)dt (4072)
2 2 2
1 1
+ θ2 exp(−θWt − θ2 t)dWt dWt (4073)
2 2
1 2
= −θ exp(−θWt − θ t)dWt (4074)
2
= −θ(exp(rt)ξt )dWt . (4075)
See also that
d(exp(rt)ξt ) = r exp(rt)ξt dt + exp(rt)dξt . (4076)
Then −θ(exp(rt)ξt )dWt = r exp(rt)ξt dt + exp(rt)dξt and the result follows.
2. By Ito Product rule, definitions and part 1. write
d(ξt Xt ) = ξdXt + Xt dξt + dXt dξt (4077)

= ξt (rXt dt + ∆t (α − r)St dt + ∆t σSt dWt ) + Xt (−θξt dWt − rξt dt) (4078)
+(rXt dt + ∆t (α − r)St dt + ∆t σSt dWt )(−θξt dWt − rξt dt) (4079)
= ξt (∆t (α − r)St dt + ∆t σSt dWt ) − θXt ξt dWt − θ∆t σSt ξt dt (4080)
= ξt ∆t σSt dWt − θXt ξt dWt . (4081)
There is no drift term involved - it is martingale.
3. By part 2. and evaluating ξ0 see X0 = ξ0 X0 = EξT XT = EξT VT .
Exercise 680. For Brownian motion W (t) and Ito integral defined
Z t
B(t) = sign(W (s))dW (s) (4082)
0
show that
1. B(t) is Brownian,
627
2. that EB(t)W (t) = 0,
3. verify that dW 2 (t) = 2W (t)dW (t) + dt,
4. and finally that E[B(t)W 2 (t)] 6= EB(t) · EW 2 (t).
Proof. -
1. See that dBt dBt = sign(W (t))2 dt = dt and by initial condition zero use Levy Theorem 419 to
conclude it must be Brownian.
2. By Ito product rule (see Corollary 30) we have
dB(t)W (t) = B(t)dW (t) + W (t)dB(t) + dB(t)dW (t) (4083)

= B(t)dW (t) + W (t)signW (t)dW (t) + sign(W (t))dt. (4084)
Integrating and taking expectations obtain

Z t Z t
1 1
EB(t)W (t) = Esign(W (s))ds = E[1Ws ≥0 − 1Ws <0 ]ds = t − t = 0. (4085)
0 0 2 2
3. Use Ito product rule d(W (t)W (t)) = W (t)dW (t) + W (t)dW (t) + dW (t)dW (t).
4. By Ito product rule we have
dB(t)W 2 (t) = B(t)dW 2 (T ) + W 2 (t)dB(t) + dB(t)dW 2 (t) (4086)

2
= B(t)(2W (t)dW (t) + dt) + W (t)dB(t) + dB(t)(2W (t)dW (t) + dt) (4087)
= B(t)(2W (t)dW (t) + dt) + W 2 (t)(dW (t)sign(W (t))) + (dW (t)sign(W (t)))(2W (t)dW (t) + dt)
= 2B(t)W (t)dW (t) + B(t)dt + sign(W (t))W 2 (t)dW (t) + 2W (t)sign(W (t))dt. (4088)
Integrating and taking expectations, obtain

Z t Z t
2
EB(t)W (t) = E B(s)ds + E 2W (s)sign(W (s))ds (4089)
0 0
Z t Z t
= EBs ds + 2 Esign(W (s))W (s)ds ≥ 0 6= 0 = EB(t) · EW 2 (t). (4090)
0 0
Exercise 681. The application of Ito Doeblin (Theorem 415) holds if there are finitely many points
for x where f 00 (x) is not defined, given f 0 (x) is defined for all x ∈ R and is continuous on x, and that
R T 00
0
f (W (t))dt is defined. An example where the Ito Doeblin formula does not apply is explored, from
Shreve [19]
1. For K > 0, write f (x) = (x − K)+ and compute f 0 (x), f 00 (x).
2. Apply Ito Doeblin formula

Z T Z T
1
f (W (T )) − f (W (0)) = f 0 (W (t))dW (t) + f 00 (W (t))dt (4091)
0 2 0
using f specified in part 1. and show the equality does not hold.
628
3. Define a sequence of functions {fn }∞
n=1 by
1


 0 x≤K−


 2n
n 1 1 1

fn (x) = (x − K)2 + (x − K) + x ∈ [K ± ] (4092)

 2 2 8n 2n
1


x − K

x≥K+ .
2n
Find fn0 (x), fn00 (x) and show that fn0 (K − 2n
1
), fn0 (K + 2n
1
) is well defined, while fn00 (x) is not defined
1
at x = K ± 2n .
4. Show that
lim fn (x) = (x − K)+ (4093)

n→∞
for every x ∈ R and that

1
lim fn0 (x) = x < K ? 0 : (x = K ? : 1) (4094)
n→∞ 2
which shares the same integral value as
1(K,∞) (x) = 1{x > K} (4095)
since they only disagree on a single point with zero measure.
5. Applying f to Ito Doeblin and letting n → ∞, obtain

Z T Z T
(W (T ) − K)+ − (W (0) − K)+ = 1(K,∞) (W (t))dW (t) + lim n n→∞
1(K± 2n1 ) (W (t))dt.(4096)
0 0
Define the local time of the Brownian motion at K to be the value

Z T
LK (T ) = lim n
n→∞
1(K± 2n1 ) (W (t))dt, (4097)
0
and see that as n → ∞ we get ∞ · 0. Show that if path of the Brownian motion stays strictly below
K on [T ] then LK (T ) = 0.
6. Write
Z T
LK (T ) = (W (T ) − K)+ − 1(K,∞) (W (t))dW (t) (4098)
0
and show that we cannot have LK (T ) = 0 almost surely.
Proof. -
1.


1 x > K,

0
f (x) = ? x = K, (4099)


0 x<K

and f 00 (x) = 0 when x 6= K and undefined otherwise.

R∞ 1 x2
2. See Ef (W (T )) = K (x − K) √2πT exp(− 2T )dx > 0, but the integral on the RHS of Equation 4091
evaluates to zero by martingale trading (see Theorem 409) and part 1.
629
3. By taking derivative, write
 1 1
n(x − K) +
 x ∈ [K ± ]


 2 2n
fn0 (x) = 1 1 (4100)
x≥K+


 2n

0 else.

See also fn00 (x) = n if x ∈ [K ± 1

2n ] and 0 otherwise. It is to see that at x = K ± 1
2n agrees at the
relevant cases for fn0 (x) but not for fn00 (x).
4. We can see this by seeing which of the cases x = K, x < K, x > K applies and seeing that the
evaluations match (x − K)+ . The limit of the derivatives fn0 (x) follows by taking derivatives of
fn0 (x) and evaluating at respective cases.
5. Define fixed ω such that Wt (ω) < K, ∀t ∈ [T ]. By definition of limits, ∃n0 s.t ∀n ≥ n0 , maxt∈[T ] Wt (ω) <
1
K− 2n such that
Z T
LK (T )(ω) = lim n
n→∞
1[K± 2n1 ] (Wt (ω))dt = 0. (4101)
0
6. Since E[LK (T )] = E[(WT − K)+ ] > 0, it is not possible to have LK (T ) = 0 almost surely.
Exercise 682. Suppose we engage in a short European call hedge that is initially OTM with underlying
driven by equation dS(t) = σS(t)dW (t). Assume r = 0. Our strategy is to be long one share of stock
whenever St > K and own zero shares otherwise, such that our portfolio process is ∆(t) = 1(K,∞) (S(t)).
At terminal T if S(T ) < K then our portfolio value is X(0) = X(T ) = 0. If S(T ) > K then the call is
exercised and we have a stock worth S(T ) and liability K, with payoff X(T ) = (S(T ) − K)+ . Argue that
this stop-loss start-gain strategy is not practical and wrong.
Proof. First, transaction costs exist. Secondly, short sale restrictions exist. Third, market observables
are discrete. The portfolio value is given
dX(t) = ∆(t)dS(t) + r(X(t) − ∆(t)X(t))dt (4102)

= 1(K,∞) (S(t))(σS(t)dW (t)) + r(X(t) − ∆(t)X(t))dt. (4103)
Since r = 0 and initial condition X(0) = 0 we have

Z T
X(T ) = σ 1(K,∞) (S(t)) · S(t)dW (t). (4104)
0
See that EX(T ) = E(S(T ) − K)+ > 0 but the RHS of Equation 4104 is martingale, a contradiction.
13.2.10 Brownian Bridge

Definition 449 (Brownian Bridge from 0 → 0). Let W (t) be Brownian motion (see Definition 432) and
for fixed constant T > 0, define Brownian bridge from 0 to 0 on [0, T ] to be process satisfying
t
X(t) = W (t) − W (T ) t ∈ [T ]. (4105)
T
630
See that X(0) = X(T ) = 0. See that since W (T ) is not F(t) measurable then X(t) is not adapted to F(t).
t1
See that for partition 0 < t1 < · · · < tn < T , random variables X(t1 ) = W (t1 ) − T W (T ), · · · X(tn ) =
tn
W (tn ) − T W (T ) is jointly normal since each term is jointly normal (since each jointly normal term
is linear sum of independent normal increments). Then the Brownian bridge from 0 to 0 is Gaussian
process. (see Definition 448) The mean function is

t
m(t) = EX(t) = E W (t) − W (T ) = 0. (4106)
T
Covariance function is

s t
c(s, t) = E W (s) − W (T ) W (t) − W (T ) (4107)
T T
t s st
= EW (s)W (t) − EW (s)W (T ) − EW (t)W (T ) + 2 EW 2 (T ) (4108)
T T T
2st st
= min(s, t) − + (4109)
T T
st
= min(s, t) − . (4110)
T
Definition 450 (Brownian Bridge from a → b). Let W (t) be Brownian motion and for fixed T > 0,
a, b ∈ R define Brownian bridge from a to b on [0, T ] to be process
(b − a)t
X a→b (t) = a + + X(t) t ∈ [T ], (4111)
T
(b−a)t
where X(t) is Brownian bridge from 0 to 0 defined as in Definition 449. See that the term a + T is
function drawing line from (0, a) → (T, b) and adding this line to the Brownian bridge give from 0 → 0 on
[0, T ] yields process that goes from a to b in time 0 to T . Adding this deterministic function to Gaussian
process gives us another Gaussian process, with expectation
(b − a)t
ma→b (t) = EX a→b (t) = a + . (4112)
T
The covariance function is still
st
ca→b (s, t) = E X a→b (s) − ma→b (s) X a→b (t) − ma→b (t) = min(s, t) − .

(4113)
T
Theorem 421 (Brownian Bridge is Scaled Stochastic Integral). See the variance of Brownian bridge
2
EX 2 (t) = c(t, t) = t − tT = t(TT−t) increases in t ∈ [ T2 ] and decreases in [ T2 , T ], but the variance of the
Rt Rt
Ito integral I(t) = 0 ∆(u)dW (u) is 0 ∆2 (u)du. We cannot write the Brownian bridge as a stochastic
integral of a deterministic integral. To obtain such analogues, define
Z t
1
Y (t) = (T − t) dW (u) t ∈ [0, T ). (4114)
0 T − u
Rt 1
See the integral component 0 T −u dW (u) is Gaussian process (it is Ito integral of deterministic integrand
as in Example 675) when t < T . For values 0 < t1 < · · · < tn < T the random variables Y (ti ) =
(T − ti )I(ti ), i ∈ [n] are jointly normal since they are linear combinations of jointly normal I(ti )’s (see
Theorem 449 for joint normality proof ), and therefore Y is Gaussian process. The mean and covariance
functions of I are respectively (short form ∧ ↔ min)
mI (t) = 0,
Z s∧t
1 1 1
cI (s, t) = du = − ∀s, t ∈ [0, T ).
0 (T − u)2 T −s∧t T
631
Therefore mean function for Y is mY (t) = 0. Without loss of generality assume that 0 ≤ s ≤ t < T such
1 1 s
that cI (s, t) = T −s − T = T (T −s) and therefore covariance function for Y is
cY (s, t) = E [(T − s)(T − t)I(s)I(t)] (4115)

s
= (T − s)(T − t) (4116)
T (T − s)
(T − t)s
= (4117)
T
st
= s− . (4118)
T
st
By symmetric arguments then cY (s, t) = min(s, t) − T for all s, t ∈ [0, T ). We see this has same
covariance form as the Brownian bridge from a to b, and also for 0 to 0. Since this is Gaussian process,
which is then jointly normal, which is then characterized by mean and covariance forms, then we conclude
Y is identically distributed to the Brownian bridge from 0 to 0.
t(T −t)
Consider the variance EY 2 (t) = cY (t, t) = T and see that as t → T , the variance goes to zero.
The random process Y (t) has variance that goes to zero as t → T .
Theorem 422 (Distributional Equivalence with Brownian Bridge). The process

 Z t
(T − t) 1
dW (u) t ∈ [0, T )

Y (t) = 0 T − u (4119)

0 t = T.

is Gaussian process on [T ] and has mean function mY (t) = 0, t ∈ [T ] and variance function cY (s, t) =
st
min(s, t) − T for s, t ∈ [T ]. See that this Gaussian process is distributionally equivalent to a Brownian
bridge from 0 to 0 on [T ]. (see Definition 449). Unlike the Brownian bridge, this Gaussian bridge is
adapted to filtration generated by W (t).
See that the stochastic differential for Y (t) is computable as

Z t Z t
1 1
dY (t) = dW (u) · d(T − t) + (T − t) · d dW (u) (4120)
0 T − u 0 T −u
Z t
1
= − dW (u) · dt + dW (t) (4121)
0 T − u
Y (t)
= − dt + dW (t). (4122)
T −t
See that the drift term drives the value of Y (t) towards zero as t → T , and the process Y (t) converges
to zero almost surely as t → T .
13.2.10.1 Joint Distributions of the Brownian Bridge
Based on Shreve [19]. Fix 0 = t0 < t1 < · · · < tn < T and define the X a→b (t) to be Brownian bridge
from a to b on [0, T ]. (see Definition 450) Consider the joint density of X a→b (ti ) for i ∈ [n]. Recall the
s(T −t)
covariance function for this is when s < t is c(s, t) = T and write in short from τj = T − tj . Then
define the random variable
X a→b (tj ) X a→b (tj−1 )
Zj = − (4123)
τj τj−1
632
and since the X a→b (ti )’s are jointly normal, so are the Zj , j ∈ [n]. See that
1 1
EZj = EX a→b (tj ) − EX a→b (tj−1 ) (4124)
τj τj−1

1 (b − a)tj 1 (b − a)tj−1
= a+ − a+ (4125)
τj T τj−1 T
a btj a btj−1
= + − − (4126)
T T τj T T τj−1
btj (T − tj−1 ) − btj−1 (T − tj )
= (4127)
T τj τj−1
b(tj − tj−1 )
= . (4128)
τj τj−1
and that
1 2 1
Var(Zj ) = Var(X a→b (tj )) − Cov(X a→b (tj ), X a→b (tj−1 )) + 2 Var(X a→b (tj−1 ))
τj2 τj τj−1 τj−1
1 2 1
= c(tj , tj ) − c(tj , tj−1 ) + 2 c(tj−1 , tj−1 ) (4129)
τj2 τj τj−1 τj−1
1 tj (T − tj ) 2 tj−1 (T − tj ) 1 tj−1 (T − tj−1 )
= 2 − + 2 (4130)
τj T τj τj−1 T τj−1 T
tj tj−1 tj−1
= −2 + (4131)
T τj T τj−1 T τj−1
tj (T − tj−1 ) − 2tj−1 (T − tj ) + tj−1 (T − tj )
= (4132)
T τj τj−1
tj − tj−1
= . (4133)
τj τj−1
For i < j we have

1 1 1 1
Cov(Zi , Zj ) = c(ti , tj ) − c(ti , tj−1 ) − c(ti−1 , tj ) + c(ti−1 , tj−1 )(4134)
τi τj τi τj−1 τi−1 τj τi−1 τj−1
ti (T − tj ) ti (T − tj−1 ) ti−1 (T − tj ) ti−1 (T − tj−1 )
= − − + (4135)
T τi τj T τi τj−1 T τi−1 τj T τi−1 τj−1
= 0. (4136)
Therefore for i 6= j, Zi ⊥ Zj and their joint density is

 2 
1  1 zj − b(tτj −t
 j−1 ) 

j τj−1
f(Z(ti ))i∈[n] (z1 , · · · , zn ) = Πnj=1 q exp − tj −tj−1 (4137)
t −tj−1
2π τjj τj−1  2

τ τ j j−1


 2 
b(t −tj−1 )

 1X n zj − τjj τj−1 
 1
= exp − t −t Πnj=1 q . (4138)

 2 j=1
j j−1
τj τj−1

 tj −tj−1
2π τ τ j j−1
633
xj xj−1
Let x0 = a and make change of variables zj = τj − τj−1 for j ∈ [1, n]. See the exponent term
2
b(tj −tj−1 )
n
X zj − τj τj−1
tj −tj−1 (4139)
j=1 τj τj−1
n 2
X τj τj−1 xj xj−1 b(tj − tj−1 )
= − − (4140)
j=1
tj − tj−1 τj τj−1 τj τj−1
n
!
X τj τj−1 x2j x2j−1 b2 (tj − tj−1 )2 2xj xj−1 2xj b(tj − tj−1 ) 2xj−1 b(tj − tj−1 )
= + 2 + − − +
t
j=1 j
− tj−1 τj2 τj−1 2
τj2 τj−1 τj τj−1 τj2 τj−1 2
τj τj−1
n
!
X τj−1 x2j τj x2j−1 b2 (tj − tj−1 ) 2xj xj−1 2xj b 2xj−1 b
= + + − − + (4141)
j=1
τj (tj − tj−1 ) τj−1 (tj − tj−1 ) τj τj−1 tj − tj−1 τj τj−1
n
" #
x2j x2j−1

X τj−1 − τj τj−1 − τj 2xj xj−1
= 1+ + 1− − (4142)
j=1
tj − tj−1 τj tj − tj−1 τj−1 tj − tj−1
n n
2
X 1 1 X xj xj−1
+b − − 2b − (4143)
j=1
τj τj−1 j=1
τj τj−1
n
" #
x2j x2j−1

X tj − tj−1 tj − tj−1 2xj xj−1
= 1+ + 1− − (4144)
j=1
tj − tj−1 τj tj − tj−1 τj−1 tj − tj−1
n n
X 1 1 X xj xj−1
+b2 − − 2b − (4145)
j=1
τj τj−1 j=1
τj τj−1
n
" # n
" # n n
x2j − 2xj xj−1 + x2j−1 x2j x2j−1

X X X 1 1 X xj xj−1
= + − + b2 − − 2b −
j=1
tj − tj−1 j=1
τj τj−1 j=1
τj τj−1 j=1
τj τj−1
n
(xj − xj−1 )2 x2n a2

X 1 1 xn a
= + − + b2 − − 2b − (4146)
j=1
tj − tj−1 T − tn T T − tn T T − tn T
n
X (xj − xj−1 )2 (b − xn )2 (b − a)2
= + − . (4147)
j=1
tj − tj−1 T − tn T
Therefore the whole exponent component becomes

  
n 2 2 2
 1 X (x j − x )
j−1  (b − xn ) (b − a)
exp − − + . (4148)
 2 j=1 tj − tj−1 2(T − tn ) 2T 
The Jacobian for this transform is

δzj 1
= j ∈ [n]
δxj τj
and
δzj 1
=− j ∈ [2, n].
δxj−1 τj−1
The Jacobian matrix is
 
1
0 ··· 0
 τ11 1

− τ τ2 ··· 0
J =  1 . (4149)
 ··· ··· ··· · · ·

1
0 0 ··· τn
634
Determinant for lower triangular is product of diagonals so taking Πi diag(J )i = Πnj=1 τ1j , by change of
variable the joint density for Brownian bridge is
f(X a→b (ti ))i∈[n] (x1 , · · · , xn ) (4150)

 
n
1 τj−1  1X (xj − xj−1 )2 (b − xn )2 (b − a)2 
r
= Πnj=1 p exp − − + (4151)
2π(tj − tj−1 ) τj  2
j=1
tj − tj−1 2(T − tn ) 2T 
 
r n 2 2 2
T 1  1X (x j − x j−1 ) (b − x n ) (b − a)
= Πn p exp − − + (4152)
T − tn j=1 2π(tj − tj−1 )  2
j=1
tj − tj−1 2(T − tn ) 2T 
p(T − tn , xn , b) n
= Πj=1 p(tj − tj−1 , xj−1 , xj ) (4153)
p(T, a, b)
n 2
o
1
where p(τ, x, y) = √2πτ exp − (y−x)
2τ is transition density of Brownian motion.
13.2.11 Brownian Bridge as Conditioned Brownian Motion

Equation 4153 hints to us that the joint density of Brownian bridge is conditional density. The Brownian
bridge from a → b on [0, T ] is Brownian motion starting W (0) = a and conditioned on W (T ) = b. See
that for 0 = t0 < t1 < · · · < tn < T the joint density
f((W (ti )i∈[n] ,W (T )) (x1 , x2 , · · · , b) = p(T − tn , xn , b)Πnj=1 p(tj − tj−1 , xj−1 , xj ) (4154)
where the initial condition for Brownian motion W (0) = x0 = a. See this by arguing that p(t1 −
t0 , x0 , x1 ) = p(t1 , a, x1 ) is density for Brownian motion going from W (0) = a to W (t1 ) = x1 , and that
p(t2 − t1 , x1 , x2 ) is density from W (t1 ) = x1 to W (t2 ) = x2 . Joint density is product of the two terms,
the extension of which gives us Equation 4154. Then the joint conditional density of W (ti ), i ∈ [n] on
W (T ) = b is quotient value
p(T − tn , xn , b)
· Πnj=1 p(tj − tj−1 , xj−1 , xj ), (4155)
p(T, a, b)
a familiar density for the Brownian bridge joint density as in Equation 4153.
Corollary 31 (Distribution of Maximum Value of Brownian Bridge). Denote
M a→b (T ) = max X a→b (t) (4156)

0≤t≤T
as the maximum value obtained by Brownian bridge from a to b on [0, T ]. (see Definition 450). Show
that density function for M a→b (T ) is

2(2y − b − a) 2
fM a→b (T ) (y) = exp − (y − a)(y − b) y > max (a, b). (4157)
T T
Proof. Since Brownian bridge from 0 to w on [0, T ] is conditioned Brownian motion on W (T ) = w, the
maximum of X 0→w on [0, T ] is maximum of Brownian motion W on [0, T ] conditioned on W (T ) = w.
Then the density of M 0→w (T ) is the maximum to date conditional density as in Corollary 29 and can
be written

2(2m − w) 2m(m − w)
fM 0→w (T ) (m) = exp − w < m, m > 0. (4158)
T T
The result follows by translating to initial condition W (0) = a, thereby replacing m with y − a, and w
with b − a.
635
13.3 Risk-Neutrality
Risk neutral pricing is a valuable tool for computing prices of derivatives, made available to us when
accompanied by a hedge for a short derivative position. We begin the discussion with a theorem central
to the topic. Recall that we can write a Radon Nikodym derivative Z > 0 (see Definition 347) satisfying
R
EZ = 1 and define a new probability measure P̃(A) = A Z(ω)dP(ω) defined for all A ∈ F for probability
space (Ω, F, P), such that ẼX = EXZ, or EX = Ẽ X dP̃
Z , or crudely Z = dP . Exercise 567 shows us that
Z = exp −θX − 12 θ2 defined under such settings give us Y ∼ Φ(0, 1) when X ∼ Φ(0, 1) and Y = X +θ.

P̃ P
Instead of a single random variable, we are interested in the change of measure for a process.
Definition 451 (Radon Nikodym Derivative Process). For probability space (Ω, F, P) and filtration F(t)
on [T ], define the Radon Nikodym derivative process to be
Z(t) = E[Z|F(t)] t ∈ [T ] (4159)
where P(Z > 0) = 1 and EZ = 1. See that this is martingale. In particular, by iterated conditioning
∆
(Theorem 356) we have E[Z(t)|F(s)] = E[E[Z|F(t)]|F(s)] = E[Z|F(s)] = Z(s), where 0 ≤ s ≤ t ≤ T .
Lemma 21 (Properties of Radon Nikodym Process). Consider Radon Nikodym derivative process Z(t)
specified as in Definition 451. The following properties are satisfied:
1. For t ∈ [T ] and F(t) measurable Y , ẼY = EY Z(t).
1
2. For 0 ≤ s ≤ t ≤ T and F(t) measurable Y , Ẽ[Y |F(s)] = Z(s) E[Y Z(t)|F(s)].
Proof. 1. By tower/iterated conditioning property (see Theorem 356) write
ẼY = EY Z = E[E[Y Z|F(t)]] = E[Y E[Z|F(t)]] = EY Z(t). (4160)
2. As in Theorem 356, to prove equality we need to show measurability and partial averaging property.
The measurability is trivial. To show partial averaging property, we require
Z Z
1
E[Y Z(t)|F(s)]dP̃ = Y dP̃ ∀A ∈ F(s). (4161)
A Z(s) A
h i
The LHS evaluates to Ẽ 1A Z(s)1
E[Y Z(t)|F(s)] =
= E[1A E[Y Z(t)|F(s)]] by part 1. (4162)

= E[E[1A Y Z(t)|F(s)]] (4163)
= E[1A Y Z(t)] (4164)
= Ẽ[1A Y ] (4165)
Z
= Y dP̃ (4166)
A
and we are done.
Theorem 423 (Girsanov Theorem, One Dimension). Let W (t), t ∈ [T ] be Brownian motion on proba-
bility space (Ω, F, P) and its filtration F(t) be defined on [T ]. Let Θ(t) be adapted stochastic process to
F(t) (see Definition 308). Define
Z t
1 t 2
Z
Z(t) = exp − Θ(u)dW (u) − Θ (u)du , (4167)
0 2 0
Z t
W̃ (t) = W (t) + Θ(u)du. (4168)
0
636
RT
Assume finite integrability condition E 0
Θ2 (u)Z 2 (u)du < ∞. If Z = Z(T ), then EZ = 1 and under P̃,
W̃ (t) is Brownian.
Proof. By Levy’s Theorem (Theorem 419), we are done if we can show that W̃ (t) accumulates dt
quadratic variation with initial condition zero and is martingale. Continuity and zero initial condi-
tion is trivial. Also, dW̃ (t)dW̃ (t) = (dW (t) + Θ(t)dt)2 = dW (t)dW (t) = dt. We are left with martingale
property. First, write Ito process
Z t Z t
1
X(t) = − Θ(u)dW (u) − Θ2 (u)du (4169)
0 2 0
and applying Ito Doeblin on f (x) = exp(x) see that we obtain

1
dZ(t) = df (X(t)) = f 0 (X(t))dX(t) + f 00 (X(t))dX(t)dX(t) (4170)
2
1 2 1
= exp(X(t)) −Θ(t)dW (t) − Θ (t)dt + exp(X(t))Θ2 (t)dt (4171)
2 2
= −Θ(t)Z(t)dW (t). (4172)
There is no drift term, and Z(t) is martingale. In particular we have EZ = EZ(T ) = Z(0) = 1. Then
since Z(t) is martingale and Z = Z(T ), Z(t) = E[Z(T )|F(t)] = E[Z|F(t)] and Z(t) is a Radon Nikodym
derivative process of the sort in Definition 451. Next, we show W̃ (t)Z(t) is martingale under P. By Ito
Product (Lemma 30) write
d(W̃ (t)Z(t)) = W̃ (t)dZ(t) + Z(t)dW̃ (t) + dW̃ (t)dZ(t) (4173)

= −W̃ (t)Θ(t)Z(t)dW (t) + Z(t)dW (t) + Z(t)Θ(t)dt + (dW (t) + Θ(t)dt)(−Θ(t)Z(t)dW (t))
= (−W̃ (t)Θ(t) + 1)Z(t)dW (t). (4174)
Again there is no drift and the process W̃ (t)Z(t) is martingale. For 0 ≤ s ≤ t ≤ T by part 2. of Lemma
21 and martingale property we have
1 1
Ẽ[W̃ (t)|F(s)] = E[W̃ (t)Z(t)|F(s)] = W̃ (s)Z(s) = W̃ (s) (4175)
Z(s) Z(s)
and we have shown that W̃ (t) is martingale under P̃.
The two probability measures P, P̃ in Girsanov’s Theorem (Theorem 423) are equivalent probability
measures (see Definition 296). We develop the theory for an asset price model under P, the actual
probability measure, and under P̃, the risk-neutral (often called Q) measure.
13.3.1 Risk Neutral Measure, Generalized Geometric Brownian Motion

Consider Brownian motion W (t) equipped with filtration F(t) on probability space (Ω, F, P), and stock
price process dS(t) = α(t)S(t)dt + σ(t)S(t)dW (t) defined on [T ]. See that this is generalized geometric
Brownian motion (see Exercise 656), with equivalent expression
Z t Z t
1 2
S(t) = S(0) exp σ(s)dW (s) + α(s) − σ (s) ds . (4176)
0 0 2
We are also provided with adapted interest nrate process o

R(t), and the discount process corresponding
Rt
to this interest rate process is D(t) = exp − 0 R(s)ds . By application of Ito Doeblin to f (x) =
Rt
exp(−x), and writing I(t) = 0 R(s)ds we have dD(t) = df (I(t)) = f 0 (I(t))dI(t) + 21 f 00 (I(t))dI(t)dI(t) =
637
−f (I(t))R(t)dt = −R(t)D(t)dt. See that D(t) has no quadratic variation and we can write its derivative
to be D0 (t) = −R(t)D(t). For a share worth 1 in the money market account at time t = 0, at time t = t
Rt 1
we have exp( 0 R(s)ds) = D(t) (hence the term discount process). The key difference between stock and
money market randomness can be stated: randomness in the model affects the money market account
only indirectly by affecting the interest rate. Interest rate changes do not affect the money market account
instantaneously, but only when they act over time. In essence, this is the difference between processes
with zero quadratic variation and non-zero quadratic variation. The discounted stock price process
follows
Z t Z t
1
D(t)S(t) = S(0) exp σ(s)dW (s) + α(s) − R(s) − σ 2 (s) ds . (4177)
0 0 2
Write its differential using Ito Product rule to be
d(D(t)S(t)) = (α(t) − R(t))D(t)S(t)dt + σ(t)D(t)S(t)dW (t) (4178)

= σ(t)D(t)S(t)[Θ(t)dt + dW (t)]. (4179)
α(t)−R(t)
where Θ(t) = σ(t) is called the market price of risk. The discounted stock price process has drift
rate α(t) − R(t), the mean rate of return. The variance term is unaffected by discounting. By applying
the Girsanov’s Theorem 423 with probability measure P̃ we can rewrite Equation 4179 to
d(D(t)S(t)) = σ(t)D(t)S(t)dW̃ (t). (4180)
This P̃ is what we call the risk-neutral measure that makes the discounted stock price process D(t)S(t)
a martingale. In integral form this is
Z t
D(t)S(t) = S(0) + σ(u)D(u)S(u)dW̃ (u). (4181)
0
Under the risk-neutral measure, the undiscounted S(t) has mean rate of return at the level equal to
the interest rate. See this by substituting dW (t) = −Θ(t)dt + dW̃ (t), which makes the stock price
Rt
differential equation dS(t) = R(t)S(t)dt+σ(t)S(t)dW̃ (t). Solve the equation (or replace 0 σ(s)dW (s) →
Rt Rt
0
σ(s)dW̃ (s) − 0 (α(s) − R(s))ds in the solution for the generalized geometric Brownian motion for
Exercise 656) to obtain
Z t Z t
1 2
S(t) = S(0) exp σ(s)dW̃ (s) + R(s) − σ (s) ds . (4182)
0 0 2
The change of measure from P → P̃ changes the mean rate of return but not the volatility. In general,
α(t) > R(t) and the change of measure reduces mean rate of return by placing greater probability on
the paths with lower return.
13.3.2 Risk Neutral Measure, Value of Portfolio Process

Consider portfolio wealth process with initial value X(0) = 0, ∆(t) in stock financed at rate R(t). The
portfolio value follows
dX(t) = ∆(t)dS(t) + R(t)(X(t) − ∆(t)S(t))dt (4183)

= ∆(t)(α(t)S(t)dt + σ(t)S(t)dW (t)) + R(t)(X(t) − ∆(t)S(t))dt (4184)
= R(t)X(t)dt + ∆(t)(α(t) − R(t))S(t)dt + ∆(t)σ(t)S(t)dW (t) (4185)
= R(t)X(t)dt + ∆(t)σ(t)S(t)[Θ(t)dt + dW (t)] by ∆ of market risk. (4186)
638
By Ito product rule (see Lemma 30), relation dD(t) = −R(t)D(t)dt and Equation 4179 we have
d(D(t)X(t)) = D(t)dX(t) + X(t)dD(t) + dD(t)dX(t) (4187)

= ∆(t)σ(t)D(t)S(t)[Θ(t)dt + dW (t)] (4188)
= ∆(t)d(D(t)S(t)) (4189)
= ∆(t)σ(t)D(t)S(t)dW̃ (t) (4190)
The discussion argue that the investor can either invest in a money market account with mean rate
of return R(t), or a stock with mean rate of return R(t) under P̃. His discounted portfolio value is
martingale regardless of the investment under measure P̃.
13.3.3 Risk Neutral Measure, Derivative Pricing

Let V (T ) be F(T ) measurable random variable, the derivative payoff at terminal T . This is possibly
path dependent. The hedging problem requires that we find the initial capital X(0) and portfolio process
∆(t) required to hedge a short derivative position to have X(T ) = V (T ) almost surely. Under a portfolio
set up satisfying such conditions, and using the fact that D(t)X(t) is martingale under P̃, we would have
the relation
D(t)X(t) = Ẽ[D(T )X(T )|F(t)] = Ẽ[D(T )V (T )|F(t)]. (4191)
We can also call this the price V (t) of the derivative at t, and
D(t)V (t) = Ẽ[D(T )V (T )|F(t)] (4192)
or equivalently
Z T
V (t) = Ẽ[exp(− R(u)du)V (T )|F(t)] t ∈ [T ], (4193)
t
which shall be known as the risk-neutral pricing formula.
Exercise 683 (State Price Density Process). Show that the risk-neutral pricing formula (see Equation
4193) can be expressed
D(t)Z(t)V (t) = E[D(T )Z(T )V (T )|F(t)], (4194)
where Z(T ) is Radon-Nikodym derivative process (see Definition 451) and expectation is w.rt. the ac-
tual probability measure. For some A ∈ F(T ), if a security pays 1A then at time zero it is worth
E[D(T )Z(T )1A ]. The process D(t)Z(t) is known as the state price density process.
Proof. By properties of Radon-Nikodym derivative process (see Lemma 21), D(t)V (t) = Ẽ[D(T )V (T )|F(t)] =
E[ D(T )VZ(t)
(T )Z(T )
|F(t)] and the result follows.
13.3.4 Risk Neutral Measure, Obtaining the Black-Scholes-Merton Form

Under the assumption of constant volatility and interest rate r, the risk-neutral pricing formula (see
Equation 4193) for European call with terminal T strike K is
Ẽ[exp(−rτ )(S(T ) − K)+ |F(t)] (4195)
639
at time t where τ = T − t. Recall that geometric Brownian motion is Markov (see Theorem 404), and
therefore this computation does not depend on the path < t. There exists c(t, x) s.t.
c(t, S(t)) = Ẽ[exp(−rτ )(S(T ) − K)+ |F(t)] (4196)
and write

1
S(t) = S(0) exp σ W̃ (t) + r − σ 2 t . (4197)
2
Therefore, we have

1
S(T ) = S(t) exp σ(W̃ (T ) − W̃ (t)) + r − σ 2 τ (4198)
2
√

1 2
= S(t) exp −σ τ Y + r − σ τ , (4199)
2
where Y = − W̃ (T √
)−W̃ (t)
τ
∼ Φ(0, 1). S(t) is F(t) measurable and the exponent term is independent of
F(t), by Independence Lemma we have
" + #
√

1 2
c(t, x) = Ẽ exp(−rτ ) x exp −σ τ Y + r − σ τ − K (4200)
2
Z ∞ +
√

1 1 2 1
= √ exp(−rτ ) x exp −σ τ y + r − σ τ − K exp(− y 2 )dy (4201)
2π −∞ 2 2
Z ∞
1 1
= √ exp(−rτ )ψ exp(− y 2 )dy. (4202)
2π −∞ 2
and see that ψ > 0 iff

1 x 1
y< √ log + r − σ 2 τ = d− (τ, x). (4203)
σ τ K 2
Then
Z d− (τ,x)
√

1 1 1
c(t, x) = √ exp(−rτ ) x exp −σ τ y + r − σ 2 τ − K exp(− y 2 )dy
2π −∞ 2 2
Z d− (τ,x) 2 2 Z d− (τ,x)
√

1 y σ τ 1 1
= √ x exp − − σ τ y − dy − √ exp(−rτ )K exp(− y 2 )dy
2π −∞ 2 2 2π −∞ 2
Z d− (τ,x)
√ 2

x 1
= √ exp − (y + σ τ ) dy − exp(−rτ )KΦ(d− (τ, x)) (4204)
2π −∞ 2
Z d− (τ,x)+σ√τ 2
x z
= √ exp − dz − exp(−rτ )KΦ(d− (τ, x)) (4205)
2π −∞ 2
= xΦ(d+ (τ, x)) − exp(−rτ )KΦ(d− (τ, x)). (4206)
where the d± terms are given by Equation 3851 with relation by Equation 3853. Then the Black Scholes
Merton price for the European call is
" + #
√

1 2
BSM (τ, x, K, r, σ) = c(t, x) = Ẽ exp(−rτ ) x exp −σ τ Y + r − σ τ − K (4207)
2
which evaluate to xΦ(d+ (τ, x)) − exp(−rτ )KΦ(d− (τ, x)). Previously, in Section 13.2.4 we verified the
solution to the BSM. Here we derive the solution by the risk-neutral measure.
640
Exercise 684 (Computing the Option Delta under Risk-Neutral Measure). Equation 3873 computed
the option delta by taking derivative of the Black-Scholes form. Here we show alternative methods.
1. Let h(s) = (s − K)+ , then h0 (s) = 1 {s > K} if s 6= K, and is undefined if s = K. Find cx (0, x)
0
using h (x).
2. Show that the form obtained in part 1. can be written by
cx (0, x) = P̂(S(T ) > K) (4208)
where P̂ is a probability measure equivalent to the risk-neutral measure P̃. Show that Ŵ (t) =
W̃ (t) − σt under P̂.
3. Write S(T ) in terms of Ŵ (T ) and show that

( )
Ŵ (T )
P̂(S(T ) > K) = P̂ − √ < d+ (T, x) = Φ(d+ (T, x)). (4209)
T
Proof. 1. See Equation 4197 for expression of S(T ) under the risk neutral measure. Write

1
ψ = x exp σ W̃T + (r − σ 2 )T
2
. We have
d
cx (0, x) = Ẽ[exp(−rT )(ψ − K)+ ] (4210)
dx
d
= Ẽ exp(−rT ) h(ψ) (4211)
dx

ψ
= Ẽ exp(−rT ) exp( )1{ψ > K} (4212)
x

1 1 K 1
= exp(− σ 2 T )Ẽ exp(σ W̃T )1{W̃T > (ln − (r − σ 2 )T } Zeng [20]
2 σ x 2
" ( )#
1 2 √ W̃T W̃T √ 1 K 1 2 √
= exp(− σ T )Ẽ exp(σ T √ )1 √ − σ T > √ (ln − (r − σ )T ) − σ T
2 T T σ T x 2
1
Z ∞
1 z 2 √ n √ o
= exp(− σ 2 T ) √ exp(− ) exp(σ T z)1 z − σ T > −d+ (T, x) dz (4213)
2 −∞ 2π 2
Z ∞
1 1 √ n √ o
= √ exp(− (z − σ T )2 )1 z − σ T > −d+ (T, x) dz (4214)
−∞ 2π 2
= Φ(d+ (T, x)). (4215)
2. See that Equation 4212 can be written
cx (0, x) = Ẽ[ẐT 1{ST > K}] = Ê[1{ST > K}] = P̂(ST > K) (4216)
1 2
where ẐT = exp(σ W̃T − ẐT |F(t)]. See that Ẑt is martingale under P̃, surely
h 2 σ nT ) and Ẑt = Ẽ[oi
positive and EZˆT = Ẽ exp σ(W̃T − 1 σ 2 T )
2 = 1. By application of Girsanov Theorem using
dP̂
ZT = dP̃
(see Theorem 423) we have
Z t
Ŵt = W̃t + (−σ)du = W̃t − σt, (4217)
0
a P̂ Brownian.
641
3. See from relation in part 2. that

1 2
P̂(ST > K) = P̂(x exp σ W̃T + (r − σ )T > K) (4218)
2

1 2
= P̂(x exp σ ŴT + (r + σ )T > K) (4219)
2
σ2

x
= P̂ σ ŴT > − log + r+ T (4220)
K 2
!
σ2

ŴT 1 x
= P̂ √ > − √ log + r+ T (4221)
T σ T K 2
!
ŴT
= P̂ √ > −d+ (T, x) (4222)
T
= Φ(d+ (T, x)). (4223)
Exercise 685 (Black Scholes Merton formula for time-varying, deterministic rates and volatility). Con-
sider stock price differential
dS(t) = r(t)S(t)dt + σ(t)S(t)dW̃ (t). (4224)
Here r(t), σ(t) are allowed to vary with time but are non-random. The European call is worth
c(0, S(0)) = Ẽ[D(T )(S(T ) − K)+ ]. (4225)
1. Show that S(T ) can be expressed as S(0) exp(X) where X ∼ Φ.
2. Let BSM (T, x, K, R, Σ) =
Σ2 Σ2

1 x 1 x
xΦ √ log + (R + )T − exp(−RT )KΦ √ log + (R − )T (4226)
Σ T K 2 Σ T K 2
be the value at time zero of European call maturing at T with constant volatility Σ and constant
rate R. Show that
 s 
Z T Z T
1 1
c(0, S(0)) = BSM T, S(0), K, r(t)dt, σ 2 (t)dt (4227)
T 0 T 0
Proof. 1. Using the techniques in Example 657 we can write

(Z )
T Z T
1 2
S(T ) = S(0) exp σt dW̃ (t) + (r(t) − σ (t))dt . (4228)
0 0 2
RT RT
Then denote X = 0
σt dW̃ (t) + 0
(r(t) − 21 σ 2 (t))dt and see that
Z T Z T
1
X ∼ Φ( (r(t) − σ 2 (t))dt, σ 2 (t)dt), (4229)
0 2 0
since both r, σ are nonrandom functions.
2. When rates and volatility are constant then we can write Equation 4228 as

1
S(T ) = S(0) exp (R − Σ2 )T + ΣW̃ (T ) . (4230)
2
642
Denote Y = (R − 21 Σ2 )T + ΣW̃ (T ) and see that Y ∼ Φ((R − 12 Σ2 )T, Σ2 T ). We may express R
q
in terms of its moments, such that R = T1 (E[Y ] + 12 Var(Y )). We also have Σ = T1 Var(Y ). The
risk-neutral pricing for the call option has relation Ẽ[(S(0) exp(Y ) − K)+ ] =
exp(RT )BSM (T, S(0), K, R, Σ) (4231)

r !
1 1 1 1
= exp(EY + Var(Y ))BSM T, S(0), K, EY + VarY , Var(Y )
2 T 2 T
It follows that
Z T
c(0, S(0)) = exp(− r(t)dt)Ẽ[(S(0) exp(X) − K)+ ] (4232)
0
 s 
Z T Z T
1 1
= BSM T, S(0), K, r(t)dt, σ 2 (t)dt . (4233)
T 0 T 0
13.3.5 Martingale Representation Theorem

The risk-neutral pricing formula (see Equation 4193) assumes that the agent begins with some initial
capital, holds ∆(t) stocks continuously such that at terminal T his portfolio value is V (T ). It argues
that the initial capital is precisely V (0) = Ẽ[D(T )V (T )], with portfolio value V (t) = Ẽ[ D(T )
D(t) V (T )|F(t)]
for t ∈ [T ]. We verify these assumptions.
13.3.5.1 One Brownian Motion Martingale Representation
Result 42 (Martingale Representation, One Dimension, Shreve [19]). For W (t) Brownian motion on
probability space (Ω, F, P) equipped with filtration F(t), let M (t) be martingale with respect to this filtra-
tion. Then, there exists adapted process Γ(u) satisfying
Z t
M (t) = M (0) + Γ(u)dW (u) t ∈ [T ]. (4234)
0
That is, if filtration is generated by a single Brownian and nothing else, then every martingale with
respect to this filtration is initial condition plus Ito integral with respect to this Brownian motion.
The implication of Result 42 is that in hedging, only the Brownian motion uncertainty needs to be
removed, if it is the only source of uncertainty in the pricing model. Since Ito integrals are continuous,
this also implies that the martingale cannot have jumps. In the one-dimensional Girsanov (Theorem
423), the filtration is allowed to have more than the information generated by the Brownian. If we are
given additional restriction such that the filtration is generated by the Brownian only and nothing more,
then we arrive at the following:
Corollary 32. Let W (t) be Brownian on (Ω, F, P) with filtration F(t) and Θ(t) be adapted process. For
processes
Z t
1 t 2
Z
Z(t) = exp − Θ(u)dW (u) − Θ (u)du (4235)
0 2 0
Z t
W̃ (t) = W (t) + Θ(u)du, (4236)
0
643
RT
given finite integrability E 0
Θ2 (u)Z 2 (u)du < ∞, Z = Z(T ), we have EZ = 1 and W̃ (t) as Brownian
under P̃. Let M̃ (t) be martingale under P̃. Then, there exists adapted process Γ̃(u) (see Definition 308)
such that
Z t
M̃ (t) = M̃ (0) + Γ̃(u)dW̃ (u) t ∈ [T ]. (4237)
0
1
Rt 1
Rt
Proof. Let f (x) = x. Write X(t) = − 0
Θ(u)dW (u) − 2 0
Θ2 (u)du, then Z(t) = exp(X(t)). By Ito
Doeblin (see Theorem 415) get
1
dZ(t) = Z(t)dX(t) + Z(t)Θ2 (t)dt (4238)
2
1
= Z(t)(dX(t) + Θ2 dt) (4239)
2
= −Z(t)Θ(t)dW (t). (4240)
Applying Ito Doeblin again to f (x) get
Θ2 (t)

1 −1 1 2 2 2 Θ(t)
d = 2 (−Z(t)Θ(t)dW (t)) + Z (t)Θ (t)dt = dW (t) + dt. (4241)
Z(t) Z (t) 2 Z 3 (t) Z(t) Z(t)
Let M̃ (t) be some P̃ martingale, then we show that M (t) = Z(t)M̃ (t) is martingale. By application of
Lemma 21 for 0 ≤ s < t we can write
" #
Zt M̃t
M̃s = Ẽ[M̃t |F(s)] = E |F(s) . (4242)
Zs
Then E[Zt M̃t |F(s)] = Zs M̃s and we have show that Z M̃ is P martingale. Suppose we are given a
martingale M (t), the Martingale Representation Theorem (Result 42) asserts that ∃Γ(u) satisfying
Z t
M (t) = M (0) + Γ(u)dW (u) t ∈ [T ]. (4243)
0
with differential dM (t) = Γ(t)dW (t). By Ito Product, we have

1 1 1 1
dM̃ (t) = dM (t) · = M (t)d + dM (t) + dM (t)d (4244)
Z(t) Z(t) Z(t) Z(t)
Θ2 (t) Θ2 (t)

Θ(t) 1 Θ(t)
= M (t) dW (t) + dt + Γ(t)dW (t) + Γ(t)dW (t) dW (t) + dt
Z(t) Z(t) Z(t) Z(t) Z(t)
Θ(t) Θ2 (t) 1 Θ(t)
= M (t) dW (t) + M (t) dt + Γ(t)dW (t) + Γ(t) dt. (4245)
Z(t) Z(t) Z(t) Z(t)
We may express this to be
Γ(t) M (t)Θ(t)
dM̃ (t) = (dW (t) + Θ(t)dt) + (dW (t) + Θ(t)dt). (4246)
Z(t) Z(t)
Γ(t)+M (t)Θ(t)
Letting Γ̃(t) = Z(t) , we obtain
dM̃ (t) = Γ̃(t)dW̃ (t) (4247)
and by integrating our proof is complete.
644
13.3.5.2 Single Stock Hedging
Recall the settings for stock price process as generalized geometric Brownian motion as in Exercise
656. We make the additional assumption that the filtration F(t) is generated solely by the Brownian
W (t). Define V (T ) to be a F(T ) measurable random variable, and V (t) to be the asset price under
the risk-neutral pricing formula (Equation 4193) such that D(t)V (t) = Ẽ[D(T )V (T )|F(t)]. Under the
risk-neutral measure this is martingale. By iterated conditioning see
Ẽ[D(t)V (t)|F(s)] = Ẽ[Ẽ[D(T )V (T )|F(t)]|F(s)] (4248)

= Ẽ[D(T )V (T )|F(s)] (4249)
= D(s)V (s). (4250)
Martingale representation theorem (Theorem 42) asserts there exists Γ̃(u) such that
Z t
D(t)V (t) = V (0) + Γ̃(u)dW̃ (u) t ∈ [T ]. (4251)
0
Recall from Equation 4188 that

Z t
D(t)X(t) = X(0) + ∆(u)σ(u)D(u)S(u)dW̃ (u) t ∈ [T ]. (4252)
0
To obtain ∀t ∈ [T ], X(t) = V (t), we require X(0) = V (0) and ∆(t) s.t. ∆(t)σ(t)D(t)S(t) = Γ̃(t), or
Γ̃(t)
that ∆(t) = σ(t)D(t)S(t) . The assumption we used is that F(t) is generated solely by the Brownian,
which is hedged by trading the stock (we also further assumed that σ(t) is strictly positive). Every
F(t) measurable derivative can be hedged, and the model is said to be ‘complete’. In other words,
the Martingale Representation Theorem justifies (and guarantees) the existence of risk-neutral pricing
formulas. It does not, however, specify how the process Γ̃(t) can be found.
13.3.6 Fundamental Theorem of Asset Pricing

We develop theory on multi-dimensional variants of Girsanov Theorem and Martingale Representation
Theorems, which were covered in one-dimension form in Theorem 423 and Result 42. We therefore use
the notation W (t) to be multidimensional Brownians (see Definition 445). This is defined on probability
space (Ω, F, P), of which P is the actual/empirical probability measure. Assumed fixed terminal T and
denote F = F(T ), which is not necessarily generated from W (T ) alone, unless otherwise stated.
Theorem 424 (Girsanov Theorem, Multiple Dimension). Let Θ(t) = (Θi (t))i∈[d] be d-dimensional
adapted processes and define (note how the · defines an inner product)
Z t
1 t
Z
2
Z(t) = exp − Θ(u) · dW (u) − kΘ(u)k du (4253)
0 2 0
Z t
W̃ (t) = W (t) + Θ(u)du. (4254)
0
RT P 12
d
Assume finite integrability such that E 0
kΘ(u)k2 Z 2 (u)du < ∞, where kΘ(u)k = j Θ2j (u) is
Euclidean norm. Set Z = Z(T ), then EZ = 1, and W̃ (t) is d-dimensional Brownian under measure
R
P̃(A) = A Z(ω)dP(ω), ∀A ∈ F. The Ito integral component-wise expanded for the dot product is
Z t d
Z tX d Z
X t
Θ(u) · dW (u) = Θj (u)dWj (u) = Θj (u)dWj (u). (4255)
0 0 j=1 j=1 0
645
Rt
Additionally, W̃j (t) = Wj (t)+ 0
Θj (u)du, j ∈ [d]. Under P̃ the component process of W̃ (t) is ⊥, although
it might not be so under P. The component process of W (t) are independent under P, but each Θj (t)
processes can depend on any or all of the component Brownians Wi (t).
Proof. By Levy’s Theorem (see Theorem 420), it suffices to show that for i = 1, 2, W̃i (t) is P̃ martingale
with zero cross variation and t quadratic
h variation.i W̃i (t) is martingale under P̃ iff W̃i (t)Zt is martingale
W̃i (t)Z(t)
under P since Ẽ[W̃i (t)|F(s)] = E Z(s) |F(s) . See that dZ(t) = −Z(t)Θ(t)dW (t) from Equation
4240. By Ito Product
d(W̃i (t)Z(t)) = W̃i (t)dZ(t) + Z(t)dW̃i (t) + dZ(t)dW̃i (t) (4256)

= W̃i (t)(−Z(t)Θ(t)dW (t)) + Z(t)(dWi (t) + Θi (t)dt) + (−Z(t)Θ(t)dW (t))(dWi (t) + Θi (t)dt)
d
X
= W̃i (t)(−Z(t) Θj (t)dWj (t)) + Z(t)(dWi (t) + Θi (t)dt) + (−Z(t)Θi (t)dt) (4257)
j=1
d
X
= W̃i (t)(−Z(t) Θj (t)dWj (t)) + Z(t)dWi (t) (4258)
j=1
has no drift and is P martingale. Then W̃i (t) is P̃ martingale. See cross variation
dW̃i (t)dW̃j (t) = [dWi (t) + Θi (t)dt][dWj (t) + Θj (t)dt] (4259)

= dWi (t)dWj (t) (4260)
= dt1{i=j} . (4261)
Result 43 (Martingale Representation Theorem, Multiple Dimension, Shreve [19]). Let F(t) generated
by a d-dimensional Brownian W (t) (see Definition 446). Let M (t) be martingale w.r.t to F(t) under P.
Then there exists adapted d-dimensional adapted stochastic process Γ(u) = (Γi (u))i∈[d] such that
Z t
M (t) = M (0) + Γ(u) · dW (u) t ∈ [T ]. (4262)
0
If we are further given that M̃ (t) is P̃ martingale, then there exists adapted d-dimensional process Γ̃(u) =
(Γi (u))i∈[n] satisfying
Z t
M̃ (t) = M̃ (0) + Γ̃(u) · dW̃ (u) t ∈ [T ]. (4263)
0
13.3.6.1 Multidimensional Market Model
Assume m stocks, each following

d
X
dSi (t) = αi (t)Si (t)dt + Si (t) σij (t)dWj (t) i ∈ [m]. (4264)
j=1
Assume that α, σ matrices are adapted, and writing

v
u d
uX
σi (t) = t 2 (t),
σij (4265)
j=1
646
define the processes
d Z t
X σij (u)
Bi (t) = dWj (u) i ∈ [m]. (4266)
j=1 0 σi (u)
2
Pd σij (t)
It is easy to see it is martingale, and since d[Bi , Bi ](t) = j=1 σi2 (t) dt = dt, we see that Bi (t) is Brownian
by application of Levy’s Theorem (see Theorem 419). We may express
dSi (t) = αi (t)Si (t)dt + σi (t)Si (t)dBi (t). (4267)
See that
d
X σij (t)σkj (t)
dBi (t)dBk (t) = dt = ρik (t)dt. (4268)
j=1
σi (t)σk (t)
By Ito Product rule (Lemma 30) we have dBi (t)dBk (t) = Bi (t)dBk (t) + Bk (t)dBi (t) + dBi (t)dBk (t), and
Z t Z t Z t
Bi (t)Bk (t) = Bi (u)dBk (u) + Bk (u)dBi (u) + ρik (u)du. (4269)
0 0 0
We have
Z t
Cov[Bi (t), Bk (t)] = E[Bi (t)Bk (t)] − E[Bi (t)]E[Bk (t)] = E ρik (u)du. (4270)
0
In general, Bi (t) 6⊥ Bk (t) for i 6= k, and ρik (t) is the instantaneous correlation between the two Brownians.
If the volatility processes were constant (that is ρij (t), ρkj (t) is independent of t), the correlation is simply
tρ√
√ ik
t t
= ρik . Using these relations, we have
dSi (t)dSk (t) = σi (t)σk (t)Si (t)Sk (t)dBi (t)dBk (t) = ρik (t)σi (t)σk (t)Si (t)Sk (t)dt, (4271)
dSi (t) dSk (t)

or that Si (t) Sk (t) = ρik (t)σi (t)σk (t)dt. Instantaneous standard deviations and correlations are unaf-
fected by change in measure. Non-instantaneous standard deviations and correlations can be affected by
change of measure when the instantaneous standard deviations
n R and correlations
o are random.
t
As before, define the discount process D(t) = exp − 0 R(u)du , where R(u) is adapted interest
Rt
rate process. Writing X(t) = 0 R(u)du, and applying Ito Doeblin on f (x) = exp(−x) we have
1
dD(t) = − exp(−X(t))dX(t) + exp(−X(t))dX(t)dX(t) (4272)
2
= − exp(−X(t))(R(t)dt) (4273)
= −D(t)R(t)dt. (4274)
By Ito product
dD(t)Si (t) = D(t)dSi (t) + Si (t)dD(t) + dD(t)dSi (t) (4275)

= D(t)dSi (t) + Si (t)(−D(t)R(t)dt) + (−D(t)R(t)dt)dSi (t) (4276)
= D(t)(dSi (t) − Si (t)R(t)dt) (4277)
 
d
X
= D(t)Si (t) (αi (t) − R(t)) dt + σij (t)dWj (t) (4278)
j=1
= D(t)Si (t) [(αi (t) − R(t)) dt + σi (t)dBi (t)] i ∈ [m]. (4279)
647
13.3.6.2 Existence of Risk-Neutral Measure
Definition 452 (Risk Neutral Probability Measure). A probability measure P̃ is said to be risk-neutral
if 1) P̃, P are equivalent probability measures (see Definition 296) and 2) under P̃, the discounted stock
price D(t)Si (t) is martingale ∀i ∈ [m].
For stock prices to be martingales, we aim to express Equation 4278 as

d
X
dD(t)Si (t) = D(t)Si (t) σij (t)[Θj (t)dt + dWj (t)] (4280)
j=1
d
X
= D(t)Si (t) σij (t)dW̃j (t), (4281)
j=1
where the conversion in the last step is given by the multidimensional Girsanov Theorem (see Theorem
424). The task is then to find Θj (t) that makes the equations agree - which occurs if the coefficient of
dt is the same in both cases. The resulting equations to be satisfied are
d
X
αi (t) − R(t) = σij (t)Θj (t) i ∈ [m], (4282)
j=1
which we call the market price of risk equations of d unknowns.
Exercise 686 (Shreve [19]). Consider two stocks and one Brownian, with market price of risk equations
α1 − r = σ1 θ, (4283)
α2 − r = σ2 θ. (4284)
α1 −r α2 −r
The set of linear equations are solved if σ1 = σ2 . Suppose not. Without loss of generality, assume
α1 −r α2 −r α1 −r α2 −r 1
σ1 > σ2 , and denote µ = σ1 − σ2 > 0. If an investor holds ∆1 (t) = S1 (t)σ1 shares of stock
−1
one and ∆2 (t) = S2 (t)σ2 shares of stock two, financed or invested at rate r, the initial capital required is
1 1
σ1 − σ2 , with this value invested or borrowed from the money market to have initial wealth X(0) = 0.
Then
dX(t) = ∆1 (t)dS1 (t) + ∆2 (t)dS2 (t) + r(X(t) − ∆1 (t)S1 (t) − ∆2 (t)S2 (t))dt (4285)
= ∆1 (t)(α1 S1 (t)dt + σ1 (t)S1 (t)dW (t)) (4286)
+∆2 (t)(α2 S2 (t)dt + σ2 (t)S2 (t)dW (t)) (4287)
+r(X(t) − ∆1 (t)S1 (t) − ∆2 (t)S2 (t))dt (4288)
1 1
= (α1 dt) − (α2 dt) + (dW (t) − dW (t)) (4289)
σ1 σ2
+r(X(t) − ∆1 (t)S1 (t) − ∆2 (t)S2 (t))dt (4290)
α1 − r α2 − r
= dt − dt + rX(t)dt (4291)
σ1 σ2
= µdt + rX(t)dt. (4292)
By Ito Product rule
d(D(t)X(t)) = D(t)dX(t) + X(t)dD(t) + dX(t)dD(t) (4293)

= D(t)dX(t) + X(t)[−D(t)R(t)dt] + [−D(t)R(t)dt][µdt + rX(t)dt] (4294)
= D(t)[dX(t) − X(t)R(t)dt] (4295)
= D(t)[µdt + rX(t)dt − rX(t)dt] (4296)
= µD(t)dt. (4297)
648
The discounted portfolio value process has positive drift, and this is nonrandom. Arbitrage exists.
If one cannot solve the market price of risk equations, then there is an arbitrage opportunity and the
pricing model is incomplete. If there is a solution, then there is a price of no arbitrage. Let ∆i (t) be
adapted positions for each Si (t), then the portfolio value evolution follows
m m
!
X X
dX(t) = ∆i (t)dSi (t) + R(t) X(t) − ∆i (t)Si (t) dt (4298)
i=1 i=1
m
X
= R(t)X(t)dt + ∆i (dSi (t) − R(t)Si (t)dt) (4299)
i=1
m
X ∆i (t)
= R(t)X(t)dt + d(D(t)Si (t)) from Equation 4277. (4300)
i=1
D(t)
Then it follows from Ito Product rule that
d(D(t)X(t)) = D(t)(dX(t) − R(t)X(t)dt) from Equation 4295 (4301)

m
X
= ∆i (t)d(D(t)Si (t)) from Equation 4300. (4302)
i=1
Since ∀i ∈ [m], D(t)Si (t) is martingale under risk-neutral measure P̃, then so is D(t)X(t) under P̃.
Under this measure, all investments have a mean rate of return R(t), and the discounted portfolio value
is martingale.
Lemma 22. Let P̃ be risk-neutral measure, and X(t) be value of portfolio, then the discounted portfolio
value process D(t)X(t) is martingale under this risk-neutral measure.
Exercise 687 (Correlation under Change of Measure). Define Bi (t) as in Equation 4266. Assume
market price of risk equations (see Equations 4282) have solutions Θi (t), i ∈ [d].
Z T
W̃j (t) = Wj (t) + Θj (u)du j ∈ [d] (4303)
0
are d independent Brownians under P̃, the risk-neutral measure.
1. Define
d
X σij (t)Θj (t)
γi (t) = (4304)
j=1
σi (t)
qP
d 2 (t) as in Equation 4265 and show that
where σi (t) = j=1 σij
Z t
B̃i (t) = Bi (t) + γi (u)du (4305)
0
is Brownian under the risk-neutral measure.
2. Show that
dSi (t) = R(t)Si (t)dt + σi Si (t)dB̃i (t). (4306)
3. Show dB̃i (t)dB̃k (t) = ρik (t), that conditioned on F(t0 ), both the instantaneous correlation between
B̃1 and B̃2 is the same regardless of whether we use P or P̃.
649
4. Show that if ρik (t) is deterministic, then for all t ≥ 0 we have
Z t
E[Bi (t)Bk (t)] = Ẽ[B̃i (t)B̃k (t)] = ρik (u)du. (4307)
0
1
Rt
That is, both Bi (t), Bk (t) and B̃i (t), B̃k (t) have correlation t 0
ρik (u)du under P, P̃ respectively.
5. Assume the multidimensional market model (see Section 13.3.6.1) for m = 2 stocks and d = 2 Brow-
nians. Let W1 (t) ⊥ W2 (t) be Brownians under P. Taking σ11 (t) = σ21 (t) = 0, σ12 (t) = 1, σ22 (t) =
sign(W1 (t)), see that σ1 (t) = σ2 (t) = 1, ρ11 (t) = ρ22 (t) = 1, ρ12 (t) = ρ21 (t) = sign(W1 (t)). Take
Θ1 (t) = 1, Θ2 (t) = 0 so that we have
W̃1 (t) = W1 (t) + t, W̃2 (t) = W2 (t). (4308)
Then γ1 (t) = γ2 (t) = 0 and
B1 (t) = W2 (t) (4309)

B̃1 (t) = B1 (t) (4310)
Z t
B2 (t) = sign(W1 (u))dW2 (u) (4311)
0
B̃2 (t) = B2 (t). (4312)
Show
E[B1 (t)B2 (t)] 6= Ẽ[B̃1 (t)B̃2 (t)]. (4313)
Proof. 1.
d d d
X σij (t) X σij (t)θj (t) X σij (t)
dB̃i (t) = dWj (t) + d dt = dW̃j (t). (4314)
j=1
σi (t) j=1
σi (t) j=1
σi (t)
It is clear that is it martingale. Additionally the quadratic variation is computed

d 2
X σij (t)
dB̃i (t)dB̃i (t) = dt = dt. (4315)
j=1
σi2 (t)
By Levy’s Theorem (see Theorem 419) we have shown it is Brownian.
2. From Equation 4267
dSi (t) = αi (t)Si (t)dt + σi (t)Si (t)dBi (t) (4316)

= R(t)Si (t)dt + σi (t)Si (t)dB̃i (t) + (αi (t) − R(t))Si (t)dt − σi (t)Si (t)γi (t)dt (4317)
d
X d
X
= R(t)Si (t)dt + σi (t)Si (t)dB̃i (t) + σij Θj (t)Si (t)dt − Si (t) σij (t)Θj (t)dt
j=1 j=1
= R(t)Si (t)dt + σi (t)Si (t)dB̃i (t). (4318)
3.
dB̃i (t)dB̃k (t) = (dBi (t) + γi (t)dt)(dBk (t) + γk (t)dt) = dBi (t)dBk (t) = ρik (t)dt. (4319)
650
4. By Ito Product rule
Z t Z t Z t
E[B1 (t)B2 (t)] = E[ Bi (s)dBk (s)] + E[ Bk (s)dBi (s)] + E[ dBi (s)dBk (s)] (4320)
0 0 0
Z t
= E[ ρik (s)ds] (4321)
0
Z t
= ρik (s)ds. (4322)
0
By part 3. see that the non-zero expectation matches in the computation for Ẽ[B̃i (t)B̃k (t)].
5. See that sign(x) = 1 {x ≥ 0} − 1 {x < 0}. Zeng [20] By Ito Product rule
Z t
E[B1 (t)B2 (t)] = E[ ρik (s)ds] (4323)
0
Z tX d
σij (u)σkj (u)
= E[ du] (4324)
0 j
σi (u)σk (u)
Z t
= E[ sign(W1 (u))du] (4325)
0
Z t
= [P(W1 (u) ≥ 0) − P(W1 (u) < 0)]du = 0. (4326)
0
See that
Z t
Ẽ[B̃1 (t)B̃2 (t)] = Ẽ[ sign(W1 (u))]du (4327)
0
Z t
= [P̃(W1 (u) ≥ 0) − P̃(W1 (u) < 0)]du (4328)
0
Z t
= [P̃(W̃1 (u) ≥ u) − P̃(W̃1 (u) < u)]du (4329)
0
Z t
= 1 − 2P̃(W̃1 (u) < u)du (4330)
0
< 0. (4331)
for t > 0 and we are done.
Exercise 688 (Correlations under Change of Measure and Random Market Prices of Risk). Let W1 (t) ⊥
W2 (t) under P. Let Θ1 (t) = 0, Θ2 (t) = W1 (t) in multidimensional Girsanov Theorem (see Theorem 424)
Rt
s.t. W̃1 (t) = W1 (t) and W̃2 (t) = W2 (t) + 0 W1 (u)du.
1. Show that ẼW1 (t) = ẼW2 (t) = 0 for all t ∈ [T ].
2. Use Ito Product rule on d(W1 (t)W2 (t)) to show that
˜
Cov(W1 (T ), W2 (T )) = 0 6= Cov(W 1 (T ), W2 (T )). (4332)
Rt
Proof. 1. Since W1 (t) = W̃1 (t) the first part is trivial. Next Ẽ[W2 (t)] = Ẽ[W̃2 (t) − 0
W̃1 (u)du] = 0
since Ẽ[W̃2 (t)] = 0.
651
2.
˜
Cov[W 1 (T ), W2 (T )] = Ẽ[W1 (T )W2 (T )] (4333)
"Z #
T Z T
= Ẽ W1 (t)dW2 (t) + W2 (t)dW1 (t) (4334)
0 0
"Z #
T
= Ẽ W̃1 (t)(dW̃2 (t) − W̃1 (t)dt) (4335)
0
"Z #
T
2
= −Ẽ W̃1 (t) dt (4336)
0
Z T
= − tdt (4337)
0
1
= − T 2. (4338)
2
It is trivial to see Cov(W1 (T ), W2 (T )) = 0, since they are independent by assumption.
Definition 453 (Securities Arbitrage). A securities arbitrage or portfolio arbitrage is a portfolio value
process X(t) with initial condition X(0) = 0 and satisfying
P(X(T ) ≥ 0) = 1 P(X(T ) > 0) > 0 (4339)
for T > 0. This happens iff (see Exercise 689 for proof ) there is a way to start with positive capital X(0)
and beat the money market account. An equivalent statement is as follows:

X(0) X(0)
P X(T ) ≥ = 1, P X(T ) > > 0. (4340)
D(T ) D(T )
Exercise 689 (Prove of Equivalent Statements of Securities Arbitrage). Prove the iff relationship as-
serted in Definition 453.
Proof. Zeng [20]. →. Let a > 0, and define X2 (t) = (a + X(t))D(t)−1 , then

−1 X2 (0)
P X2 (T ) = (a + X(T ))D(T ) ≥ = P(a + X(T ) ≥ a) = P(X(T ) ≥ 0) = 1, (4341)
D(T )

X2 (0)
P X2 (T ) > = P(X(T ) > 0) > 0. (4342)
D(T )
← Let X1 (t) = X2 (t)D(t) − X2 (0). Then X1 (0) = 0 and

X2 (0)
P(X1 (T ) ≥ 0) = P X2 (T ) ≥ =1 (4343)
D(T )

X2 (0)
P(X1 (T ) > 0) = P X2 (T ) > > 0. (4344)
D(T )
Theorem 425 (First Fundamental Theorem of Asset Pricing). A market model with risk-neutral prob-
ability measure does not admit arbitrage.
Proof. Recall that under P̃ all discounted portfolio value processes are martingale. That is, Ẽ[D(T )X(T )] =
X(0). Define X(t) portfolio process with initial condition zero, then the RHS evaluates to zero. Since the
risk-neutral measure is equivalent (see Definition 296), then P(X(T ) < 0) = 0 =⇒ P̃(X(T ) < 0) = 0,
and since Ẽ[D(T )X(T )] = 0 this implies P̃(X(T ) > 0) = 0. Again by equivalence, we would have
P(X(T ) > 0) = 0 and we have shown that X(t) is arbitrage free.
652
13.3.6.3 Uniqueness of Risk-Neutral Measures
Definition 454 (Completeness of Market Model). A market model is said to be complete if every
derivative security can be hedged.
Suppose we are provided with a market model with filtration generated by d-dimensional Brownians
with risk neutral measure P̃ and V (T ) that is F(T ) measurable. We can define V (t) be the risk-neutral
pricing such that D(t)V (t) is martingale under P̃. By Martingale Representation Theorem (see Theorem
43), there exists processes Γ̃1 (u), Γ̃2 (u), · · · Γ̃d (u) such that
d Z
X t
D(t)V (t) = V (0) + Γ̃j (u)dW̃j (u). (4345)
j=1 0
Equations 4188 and 4279 show that

m
X
d(D(t)X(t)) = ∆i (t)d(D(t)Si (t)) (4346)
i=1
d X
X m
= ∆i (t)D(t)Si (t)σij (t)dW̃j (t). (4347)
j=1 i=1
In integral form we arrive at

d Z tX
X m
D(t)X(t) = X(0) + ∆i (u)D(u)Si (u)σij (u)dW̃j (u). (4348)
j=1 0 i=1
Compare with Equation 4345 to see we require the portfolio processes ∆i (t) to satisfy
m
Γ̃j (t) X
= ∆i (t)Si (t)σij (t), j ∈ [d], (4349)
D(t) j=1
and we call this the hedging equations.
Theorem 426 (Second Fundamental Theorem of Asset Pricing). For a market with risk-neutral proba-
bility measure P̃, the market model is complete iff P̃ is unique.
Proof. Assume a complete model, and we show this implies uniqueness of the risk-neutral measure. Let
there be two equivalent risk-neutral measures P̃i for i = 1, 2. Let A ∈ F = F(T ), and suppose the
derivative payoff were V (T ) = 1A D(T
1
) . By completeness, a hedge for a short position in this derivative
exists, and under the risk neutral measures the discounted portfolio processes are martingales such that
P˜1 (A) = E˜1 [D(T )V (T )] = E˜1 [D(T )X(T )] = X(0) (4350)
and
P˜2 (A) = E˜2 [D(T )V (T )] = E˜2 [D(T )X(T )] = X(0). (4351)
Since this is satisfied for arbitrary A ∈ F, the two probability measures must be identical. Suppose
there exists some unique risk-neutral measure P̃, then the filtration for the model is generated by the
d-dimensional Brownians. Uniqueness of this measure implies that there exists a unique solution to
the market price of risk equations for Θi (t), i ∈ [d]. For fixed t, ω, the set of linear equations can be
written Ax = b, where aij = σij (t) is the i, j entry of A ∈ Rm,d , and x = (Θ1 (t) · · · Θd (t))T and
b = (α1 (t) · · · αm (t))T − R(t). The uniqueness assumption says that this linear system has unique
653
solution x. To show completeness, we require that the hedging equations (see Equation 4349) must be
Γ̃j (t)
solvable regardless of the D(t) corresponding to the security. The hedging equations can be expressed
by the equation A y = c. where y = (∆1 (t)S1 (t) · · · ∆m (t)Sm (t))T , and c = ( Γ̃D(t)
T 1 (t)
· · · Γ̃D(t)
d (t) T
) . The
uniqueness of x implies the existence of solution y (verify this).
Exercise 690 (Every strictly positive asset is generalized geometric Brownian (see Exercise 656)).
Let W (t) be Brownian motion adapted to F(t) defined on probability space (Ω, F, P). Let P̃ be unique
risk-neutral measure and W̃ (t) be Brownian under this measure. Martingale Representation Theorem
(Theorem 42) asserts ∀M̃ (t), ∃Γ̃(t) s.t.
Z t
M̃ (t) = M̃ (0) + Γ̃(u)dW̃ (u) t ∈ [T ]. (4352)
0
If P(V (T ) > 0) = 1 and V (T ) be F(T ) measurable, then risk-neutral pricing equation states (see Equation
4193)
Z T
V (t) = Ẽ[exp(− R(u)du)V (T )|F(t)]. (4353)
t
1. Show there exists adapted process Γ̃(t) satisfying
Γ̃(t)
dV (t) = R(t)V (t)dt + dW̃ (t) t ∈ [T ], (4354)
D(t)
2. ∀t ∈ [T ], P(V (t) > 0) = 1,
3. ∃σ(t) s.t.
dV (t) = R(t)V (t)dt + σ(t)V (t)dW̃ (t), t ∈ [T ]. (4355)
Proof. -
1. Risk neutral pricing asserts D(t)V (t) = Ẽ[D(T )V (T )|F(t)] - it is martingale. Martingale Repre-
sentation theorem asserts ∃Γ̃t s.t.
Z t
D(t)V (t) = Γ̃(u)dW̃ (u), (4356)
0
1
Rt
then V (t) = D(t) 0
Γ̃(u)dW̃ (u) and
Z t
1
dV (t) = −D(t)−2 (dD(t)) Γ̃(u)dW̃ (u) + Γ̃(t)dW̃ (t) (4357)
0 D(t)
Z t
1
= −D(t)−2 (−R(t)D(t)dt) Γ̃(u)dW̃ (u) + Γ̃(t)dW̃ (t) (4358)
0 D(t)
Γ̃(t)
= R(t)V (t)dt + dW (t). (4359)
D(t)
2. The uniqueness of the risk neutral measure asserts the existence of no arbitrage price. Suppose
the price of V (t) at any time is not positive, but is surely positive at T . Then this security can
be arbitraged (see Definition 453) and there is contradiction. Price of the security V (t) must be
almost surely positive.
654
Γ̃(t)
3. Set σ(t) = V (t)D(t) and we are done.
Exercise 691 (Implied Risk-Neutral Distribution). Let p̃(0, T, x, y) denote the risk-neutral density in
variable y of distribution of S(T ) when S(0) = x. Then the risk-neutral pricing equation (see Equation
4193) asserts the value of a European call be
Ẽ exp(−rT )(S(T ) − K)+

c(0, T, x, K) = (4360)
Z ∞
= exp(−rT ) (y − K)p̃(0, T, x, y)dy. (4361)
K
Suppose we were given an option chain at some T for continuous values of strikes K > 0, then compute
the cK , cKK to get the implied distribution p̃(0, T, x, y).
Proof. Differentiate (twice) Equation 4361 to get

Z ∞
cK (0, T, x, K) = − exp(−rT ) p̃(0, T, x, y)dy (4362)
K
cKK (0, T, x, K) = − exp(−rT )(−p̃(0, T, x, y)) (4363)
s.t. p̃(0, T, x, y) = exp(rT )cKK (0, T, x, K), the implied distribution. From Equation 3850 write
c = xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x)) (4364)
with
σ2

1x
d± (τ, x) = log√ + r± τ from Equation 3851 (4365)
σ τ K 2
see that for φ(x) = √1

2π
exp(− 12 x2 ), we have φ0 (x) = −xφ(x). Then
−1 1
cK = xφ(d+ ) √ − exp(−rT )Φ(d− ) + exp(−rT )φ(d− ) √ , (4366)
σ TK σ T
x xd+ −1 −1 exp(−rT )d− −1
cKK = √ φ(d+ ) + √ φ(d+ ) √ − exp(−rT )φ(d− ) √ − √ φ(d− ) √
σ TK 2 σ TK Kσ T Kσ T σ T Kσ T

x d+ exp(−rT )φ(d− ) d−
= φ(d+ ) √ 1− √ + √ 1+ √ (4367)
2
K σ T σ T Kσ T σ T
" √ # " √ #
x d− + σ T exp(−rT )φ(d− ) d+ − σ T
= φ(d+ ) √ 1− √ + √ 1+ √ Equation 3853
K 2σ T σ T Kσ T σ T
exp(−rT ) x
= 2
φ(d− )d+ − 2 2 φ(d+ )d− . (4368)
Kσ T K σ T
Exercise 692 (Chooser Option). A market model has unique risk neutral measure P̃ and constant rate
r, then under this model the risk neutral pricing for calls, puts and forwards can be written
C(t) = Ẽ[exp(−rτ )(S(T ) − K)+ |F(t)] (4369)

+
P (t) = Ẽ[exp(−rτ ](K − S(T )) |F(t)] (4370)
F (t) = Ẽ[exp(−rτ )(S(T ) − K)|F(t)] (4371)
= exp(rt)Ẽ[exp(−rT )S(T )|F(t)] − exp(−rτ )K (4372)
= S(t) − exp(−rτ )K, (4373)
655
where we get Equation 4373 from Equation 4372 since under the risk neutral measure, exp(−rt)S(t) is
martingale. Furthermore by put call parity (see Theorem 3944) we have
C(t) − P (t) = F (t). (4374)
Let a chooser option be an option that gives the right at time t0 ∈ [T ] to own either the call or the put.
Show that the value of this option (i) is at t0
+
C(t0 ) + max {0, −F (t0 )} = C(t0 ) + (exp(−r(T − t0 ))K − S(t0 )) , (4375)
and that (ii) value of chooser option is sum of value of call strike K maturity T plus put strike exp(−r(T −
t0 )K) terminal t0 .
Proof. (i) Since we would choose the option with higher value,
V (t0 ) = max {C(t0 ), P (t0 )} (4376)

= max {C(t0 ), C(t0 ) − F (t0 )} (4377)
= C(t0 ) + max {0, −F (t0 )} (4378)
+
= C(t0 ) + (exp {−r(T − t0 )} K − S(t0 )) . (4379)
(ii) The risk-neutral pricing formual asserts that under P̃
V (0) = Ẽ[exp(−rt0 )V (t0 )] (4380)

+
= Ẽ[exp(−rt0 )C(t0 ) + (exp(−rT )K − exp(−rt0 )S(t0 )) ] (4381)
= C(0) + Ẽ[exp {−rt0 } (exp {−r(T − t0 )} K − S(t0 ))+ ]. (4382)
13.3.7 Dividend Paying Stocks

Dividend adjustments (e.g re-investments) need to be considered for discounted portfolio processes to be
martingale under risk-neutral measure.
13.3.7.1 Continuous Dividends
For a stock paying dividends continuously at rate A(t), we assume the stock price follows the differential
equation
dS(t) = (α(t) − A(t))S(t)dt + σ(t)S(t)dW (t). (4383)
An investor portfolio process follows
dX(t) = ∆(t)dS(t) + ∆(t)A(t)S(t)dt + R(t)[X(t) − ∆(t)S(t)]dt (4384)

= R(t)X(t)dt + (α(t) − R(t))∆(t)S(t)dt + σ(t)∆(t)S(t)dW (t) (4385)
= R(t)X(t)dt + ∆(t)S(t)σ(t)[Θ(t)dt + dW (t)], (4386)
Rt
where Θ(t) is market price of risk (see Equation 4282). Define W̃ (t) = W (t) + 0
Θ(u)du and by
application of Girsanov (Theorem 423) rewrite the portfolio differential equation to be
dX(t) = R(t)X(t)dt + ∆(t)S(t)σ(t)dW̃ (t). (4387)
656
By Ito Product Rule (see Theorem 30) the discounted portfolio value process is specified d(D(t)X(t)) =
∆(t)D(t)S(t)σ(t)dW̃ (t) (Equation 4190), and it is martingale under P̃. We have D(t)X(t) = Ẽ[D(T )V (T )|F(t)] =
D(t)V (t), the risk-neutral pricing for the dividend paying stock. Under the risk-neutral measure the stock
price differential follows
dS(t) = [R(t) − A(t)]S(t)dt + σ(t)S(t)dW̃ (t), (4388)
and the discounted stock price is not a martingale under the risk-neutral measure. The solution takes
form
Z t Z t
1
S(t) = S(0) exp σ(u)dW̃ (u) + R(u) − A(u) − σ 2 (u) du . (4389)
0 0 2
To obtain the martingale property, we require reinvestment of dividends - in fact, the process
Z t Z t Z t
1 2
exp A(u) D(t)S(t) = exp σ(u)dW̃ (u) − σ (u)du (4390)
0 0 0 2
is martingale.
13.3.7.2 European Call Pricing for Continuous Dividends
Assuming the adapted processes for volatility, rate and dividends are constant, the stock price follows

1 2
S(t) = S(0) exp σ W̃ (t) + r − a − σ t , (4391)
2

1
S(T ) = S(t) exp σ(W̃ (T ) − W̃ (t)) + r − a − σ 2 (T − t) . (4392)
2
By risk-neutral pricing, the derivative value V (t) = Ẽ[exp(−rτ )(S(T ) − K)+ |F(t)]. The call price for
S(t) = x is expressed
" + #
1 2
c(t, x) = Ẽ exp(−rτ ) x exp σ(W̃ (T ) − W̃ (t)) + r − a − σ τ − K (4393)
2
" + #
√

1 2
= Ẽ exp(−rτ ) x exp −σ τ Y + r − a − σ τ − K , (4394)
2
where Y = − W̃ (T √
)−W̃ (t)
τ
∼ Φ(0, 1) under P̃. Define

1 x 1
d± (τ, x) = √ log + r − a ± σ2 τ . (4395)
σ τ K 2
Since the integrand is positive iff Y < d− (τ, x), the call price can be written
Z d− (τ,x)
√

1 1 1
c(t, x) = √ exp(−rτ ) x exp −σ τ y + r − a − σ 2 τ − K exp(− y 2 )dy
2π −∞ 2 2
Z d− (τ,x) Z d− (τ,x)
√

1 1 1 1 1
= √ x exp −σ τ y − a + σ 2 τ − y 2 dy − √ exp(−rτ )K exp(− y 2 )dy
2π −∞ 2 2 2π −∞ 2
Z d− (τ,x)
√ 2

1 1
= √ x exp(−aτ ) exp − y + σ τ dy − exp(−rτ )KΦ(d− (τ, x)) (4396)
2π −∞ 2
Z d+ (τ,x) 2
1 z
= √ x exp(−aτ ) exp − dz − exp(−rτ )KΦ(d− (τ, x)) (4397)
2π −∞ 2
= x exp(−aτ )Φ(d+ (τ, x)) − exp(−rτ )KΦ(d− (τ, x)). (4398)
657
Exercise 693 (Cash Flow Hedging). Let W (t) be Brownian motion on probability space (Ω, F, P) and
let F(t) be filtration generated by W (t). Let α(t), R(t), σ(t) be adapted and a stock follow
dS(t) = α(t)S(t)dt + σ(t)S(t)dW (t) (4399)
. Suppose agent must pay cash flow at rate C(t), then the portfolio differential is
dX(t) = ∆(t)dS(t) + R(t)(X(t) − ∆(t)S(t))dt − C(t)dt. (4400)
Find the initial portfolio value and ∆(t) required to hedge the cash flow, such that X(T ) = 0.
Proof. Zeng [20]. By Ito Product see that
d(Xt Dt ) = Dt dXt + Xt dDt + dDt dXt (4401)

= Dt (∆t dSt + Rt (Xt − ∆t St )dt − Ct dt) + Xt dDt + dDt (∆t dSt + Rt (Xt − ∆t St )dt − Ct dt)
= Dt ∆t dSt + Dt Rt (Xt − ∆t St )dt − Dt Ct dt + Xt dDt + dDt ∆t dSt + dDt Rt (Xt − ∆t St )dt − dDt Ct dt
= ∆t Dt dSt + ∆t dDt dSt − Ct Dt dt + Dt Rt (Xt − ∆t St )dt + Xt dDt + dDt Rt (Xt − ∆t St )dt − dDt Ct dt
= ∆t Dt dSt + ∆t dDt dSt − Ct Dt dt − ∆t St Dt Rt dt
= ∆t Dt dSt + ∆t dDt dSt − Ct Dt dt + ∆t St dDt (4402)
= ∆t (Dt dSt + St dDt + dDt dSt ) − Ct Dt dt (4403)
= ∆t d(Dt St ) − Ct Dt dt (4404)
Integrating, we have
Z T Z T
DT XT − D0 X0 = ∆t d(Dt St ) − Ct Dt dt. (4405)
0 0
Since terminal condition is XT = 0 then we have

Z T Z T
X0 D0 = Ct Dt dt − ∆t d(Dt St ). (4406)
0 0
αt −Rt
Since (see Equation 4179) we have d(Dt St ) = σt (Dt St )(Θt dt + dWt ), where Θt = σt . By Girsanov
(see Theorem 423) get change of measure s.t. dW̃t = Θt dt + dWt and from Equation 4406 we have
Z T Z T Z T
Ct Dt dt = D0 X0 + ∆t d(Dt St ) = D0 X0 + ∆t (Dt St )σt dW̃t . (4407)
0 0 0
This is constant plus Ito integral w.r.t to martingale. Define Mt = Ẽ[MT |F(t)], and martingale MT =
RT
0
Ct Dt dt. By Martingale Representation theorem (see Theorem 42), ∃Γ̃t s.t.
Z t
Mt = M0 + Γ̃t dW̃t . (4408)
0
Γ̃t
Set ∆t = Dt St σt and see that Equation 4405 satisfies the stochastic differential equation in Equation
4400. The terminal wealth is written
Z T Z T Z T
XT DT = D0 X0 + ∆t Dt St σt dW̃t − Ct Dt dt = D0 X0 + ∆t Dt St σt dW̃t − MT = 0.
0 0 0
RT RT
It follows that we must have D0 X0 + 0 ∆t Dt St σt dW̃t = M0 + 0 Γ̃t dW̃t . The initial portfolio wealth
RT
required is X0 = M0 = Ẽ[ 0 Ct Dt dt]. This is the no-arbitrage price of cash flow at setup.
658
Exercise 694 (Cost of Carry, Shreve [19]). Consider a commodity with price S(t) incurring storage costs
at rate a per unit time per unit of commodity, such that an investor portfolio holding ∆(t) commodity
units has portfolio differential
dX(t) = ∆(t)dS(t) − a∆(t)dt + r(X(t) − ∆(t)S(t))dt. (4409)
Assume constant volatility. Comparing with Equation 4383 we can write
dS(t) = (rS(t) + a)dt + σS(t)dW̃ (t), (4410)
where W̃ (t) is P̃ Brownian.
1. Show that the discounted portfolio value process exp(−rt)X(t) is martingale under the risk-neutral
measure.
n o
2. Define Y (t) = exp σ W̃ (t) + r − 21 σ 2 t and verify that for t ∈ [T ] we have
dY (t) = rY (t) + σY (t)dW̃ (t), (4411)
which is equivalent to showing exp(−rt)Y (t) is P̃ martingale. Show that

Z t
a
S(t) = S(0)Y (t) + Y (t) ds (4412)
0 Y (s)
satisfies Equation 4410.
3. Derive a formula for Ẽ[S(T )|F(t)] in terms of S(t) by simplifying

Z t Z T
a Y (T )
Ẽ[S(T )|F(t)] = S(0)Ẽ[Y (T )|F(t)] + Ẽ[Y (T )|F(t)] ds + a Ẽ[ |F(t)]ds. (4413)
0 Y (s) t Y (s)
4. Verify the formula obtained in part 3. for the futures price process is a martingale under the risk
neutral measure.
5. A forward contract agreed at t to buy at T for price K of one unit of commodity is risk-neutrally
priced at
Ẽ[exp(−rτ )(S(T ) − K)|F(t)]. (4414)
The forward price is the value of K that makes the value of the contract worthless, and show that
the forward and futures price agree.
6. Consider an investor who shorts a forward contract at time zero. She hedges by borrowing from the
money market to purchase a commodity at time zero, and fulfills the obligation with her commodity
at terminal T in exchange for the forward price. Show that this is exactly enough to pay the
money-market debt, that is X(T ) = S(T ) − F orS (0, T ). Use the relation
exp(−rt)(dS(t) − rS(t)dt) = d(exp(−rt)S(t)). (4415)
Proof. Zeng [20]
659
1. For f (t, x) = exp(−rt)x we have
ft = −r exp(−rt)x, (4416)
fx = exp(−rt), (4417)
fxx = 0 (4418)
and by Ito Doeblin (see Theorem 415) we have
d(f (t, X(t))) = −r exp(−rt)X(t)dt + exp(−rt)dX(t) (4419)

= −r exp(−rt)X(t)dt + exp(−rt)(∆(t)dS(t) − a∆(t)dt + r(X(t) − ∆(t)S(t))dt)
= ∆(t) exp(−rt)[dS(t) − rS(t)dt − adt] (4420)
= ∆(t) exp(−rt)σS(t)dW̃ (t). (4421)
and discounted portfolio value process is P̃ martingale.
2. Define X(t) = σ W̃ (t) + r − 21 σ 2 t, such that dX(t) = σdW̃ (t) + r − 21 σ 2 dt. Then Y (t) =

exp(X(t)) and using f (x) = exp(x) by Ito Doeblin (Theorem 415) write
1
dY (t) = Y (t)dX(t) + Y (t)dX(t)dX(t) (4422)
2
1 2 1
= Y (t)(σdW̃ (t) + r − σ dt) + Y (t)σ 2 dt (4423)
2 2

= Y (t) σdW̃ (t) + rdt . (4424)
Then by applying Ito Doeblin to f (t, Y (t)) = exp(−rt)Y (t) (see Theorem 415) we have
d(exp(−rt)Y (t)) = −r exp(−rt)Y (t)dt + exp(−rt)dY (t) (4425)

= −r exp(−rt)Y (t)dt + exp(−rt)(Y (t) σdW̃ (t) + rdt ) (4426)
= σ exp(−rt)Y (t)dW̃ (t). (4427)
exp(−rt)Y (t) is P̃ martingale. By Ito Doeblin (see Theorem 442) and Ito Product rule (see Lemma
30) write
Z t
a a a
dS(t) = S(0)dY (t) + Y (t) dt + dsdY (t) + dY (t)( dt) (4428)
Y (t) 0 Y (s) Y (t)
Z t
a
= S(0)dY (t) + ds dY (t) + adt (4429)
0 Y (s)
Z t
a
= S(0) + ds Y (t)(σdW̃ (t) + rdt) + adt (4430)
0 Y (s)
= S(t)(σdW̃ (t) + rdt) + adt. (4431)
660
3. Ẽ[S(T )|F(t)]
Z t Z T
a Y (T )
= S(0)Ẽ[Y (T )|F(t)] + Ẽ[Y (T )|F(t)] ds + a |F(t)]ds
Ẽ[ (4432)
0 Y (s) t Y (s)
Z t Z T
a
= S(0)Y (t)Ẽ[Y (T − t)] + ds Y (t)Ẽ[Y (T − t)] + a Ẽ[Y (T − s)]ds (4433)
0 Y (s) t
Z t Z T
a
= S(0) + ds Y (t) exp(r(T − t)) + a exp(r(T − s))ds (4434)
0 Y (s) t
Z t T
a 1
= S(0) + ds Y (t) exp(r(T − t)) + a − exp(r(T − s)) (4435)
0 Y (s) r t
Z t
a −1 1
= S(0) + ds Y (t) exp(r(T − t)) + a exp(0) + exp(r(T − t)) (4436)
0 Y (s) r r
Z t
ads a
= S(0) + Y (t) exp(r(T − t)) − (1 − exp(r(T − t))). (4437)
0 Y (s) r
4. Differentiating, write
Z t
ads
dẼ[S(T )|F(t)] = a exp(r(T − t))dt + S(0) + {exp(r(T − t))dY (t) − rY (t) exp(r(T − t))dt}
0 Y (s)
a
+ exp(r(T − t))(−r)dt
r
Z t
ads
= a exp(r(T − t))dt + S(0) + [exp(r(T − t))(Y (t) σdW̃ (t) + rdt )
0 Y (s)
a
−rY (t) exp(r(T − t))dt] + exp(r(T − t))(−r)dt
r
Z t
ads
= S(0) + exp(r(T − t))σY (t)dW̃ (t). (4438)
0 Y (s)
Ẽ[S(T )|F(t)] is P̃ martingale.
5. Solving for K in Ẽ[exp(−rτ )(ST − K)|F(t)] = 0, obtain
K = Ẽ[S(T )|F(t)] = F utS (t, T ). see Def. 457 (4439)
6. See by part 1. that
d(exp(−rt)X(t)) = ∆(t) exp(−rt)[dS(t) − rS(t)dt − adt]. (4440)
In this part of the problem the investor holds one underlying, so ∆(t) = 1. Substituting, get
d(exp(−rt)X(t)) = exp(−rt)[dS(t) − rS(t)dt] − exp(−rt)adt (4441)

= d(exp(−rt)S(t)) − a exp(−rt)dt. hint from Eqn. 4415 (4442)
Integrating, write
Z t
exp(−rt)X(t) − exp(0)X(0) = exp(−rt)S(t) − exp(0)S(0) − a exp(−rt)dt, (4443)
0
Z t
exp(−rt)X(t) = exp(−rt)S(t) − S(0) − a exp(−rt)dt. (4444)
0
661
Then
Z t
a
X(t) = S(t) − exp(rt)S(0) − exp(rt) −r exp(−rt)dt (4445)
−r 0
a
= S(t) − exp(rt)S(0) + exp(rt)[exp(−rt)]t0 (4446)
r
a
= S(t) − S(0) exp(rt) + (1 − exp(rt)) (4447)
r
a
= S(t) − S(0) exp(rt) − (exp(rt) − 1) . (4448)
r
At t = T we have X(T ) = S(T ) − S(0) exp(rT ) − ar (exp(rT ) − 1). By part 3, the forward price is
Z t
ads a
Ẽ[S(T )|F(t)] = S(0) + Y (t) exp(rτ ) − (1 − exp(rτ )). (4449)
0 Y (s) r
Substituting t = 0 into the conditioning filtration, get
a
F orS (0, T ) = S(0) exp(rT ) − (1 − exp(rT )) . (4450)
r
Finally, see X(T ) = S(T ) − F orS (0, T ).
13.3.7.3 Lump Sum Dividends
Consider stock paying dividends at 0 < t1 < t2 · · · < tn < T of payment at each tj of sum aj S(tj −),
and aj ∈ [0, 1] is adapted to F(t). The stock price after dividend payments is given S(tj ) = S(tj −) −
aj S(tj −) = (1−aj )S(tj −). Set t0 = 0, tn+1 = T and a0 = an+1 = 0. The stock price follows a generalized
geometric Brownian motion between the dividend dates, that is
dS(t) = α(t)S(t)dt + σ(t)S(t)dW (t), t ∈ [tj , tj+1 ), j ∈ [0..n]. (4451)
This correspond to portfolio wealth differential
dX(t) = ∆(t)dS(t) + R(t)[X(t) − ∆(t)S(t)]dt (4452)

= R(t)X(t)dt + (α(t) − R(t))∆(t)S(t)dt + σ(t)∆(t)S(t)dW (t) (4453)
= R(t)X(t)dt + ∆(t)σ(t)S(t)[Θ(t)dt + dW (t)]. (4454)
On dividend dates, the fall in value of stock is exactly matched by the dividend payments and there are
no jumps in portfolio value. Then for all t ∈ [T ] we have
dX(t) = R(t)X(t)dt + ∆(t)σ(t)S(t)[Θ(t)dt + dW (t)]. (4455)
The risk-neutral pricing formula is given by change of measure P̃ to W̃ as in previous discussion.
13.3.7.4 European Call Pricing for Lump Sum Dividends
Assume volatility, rate and payment rates (σ, r, aj ) are constants. Between dividend dates, the stock
price evolution follows
dS(t) = rS(t)dt + σS(t)dW̃ (t). t ∈ [tk , tj+1 ), j = 0, · · · n. (4456)
Then

1
S(tj+1 −) = S(tj ) exp σ(W̃ (tj+1 ) − W̃ (tj )) + r − σ 2 (tj+1 − tj ) . (4457)
2
662
Recall that after dividend payments we are given the relation

S(tj+1 ) 1 2
= (1 − aj+1 ) exp σ(W̃ (tj+1 ) − W̃ (tj )) + r − σ (tj+1 − tj ) . (4458)
S(tj ) 2
Since we set t0 = 0, tn+1 = T , then

S(T ) S(tj+1 ) 1
= Πnj=0 = Πnj=1 (1 − aj+1 ) exp σ W̃ (T ) + r − σ 2 T , (4459)
S(0) S(tj ) 2
and we see that S(T ) has the same solution as the no dividend model, except that initial price condition
is set to S(0)Πn−1
j=0 (1 − aj+1 ). The Black Scholes price is then
c(t, x) = S(0)Πn−1
j=0 (1 − aj+1 )Φ(d+ ) − exp(−rT )KΦ(d− ) (4460)
h Pn−1 i
with d± = 1
√
σ T
log S(0)
K + j=0 log(1 − aj+1 ) + r ± 1 2
2 σ T .
13.3.8 Forwards and Futures

Assume a complete model with unique risk-neutral measure P̃.
Definition 455 (Zero Coupon Bonds). Let R(t), 0 ≤ t ≤ T̄ be an interest rate process. T̄ is set
suchnthat all securities
o in our portfolio expire at or prior to T̄ . The discount process is then D(t) =
Rt
exp − 0 R(u)du and by risk-neutral pricing a unit-zero coupon bond maturing at T is an asset with
price at t of B(t, T ) satisfying
D(t)B(t, T ) = Ẽ[D(T )|F(t)] 0 ≤ t ≤ T ≤ T̄ . (4461)
13.3.8.1 Forwards
Assume a stock/asset price process S(t).
Definition 456 (Forward Contract). A forward contract is an agreement to pay delivery price K at
date 0 ≤ T ≤ T̄ for asset S(t). The T-forward price F orS (t, T ) is the strike K that makes the value of
the forward contract worthless at time t.
Theorem 427 (Forward Priced in Zero Coupon Bond Terms). The relationship between the forward
and its underlying can be stated in terms of the price of the zero coupon bond. That is
S(t)
F orS (t, T ) = 0 ≤ t ≤ T ≤ T̄ . (4462)
B(t, T )
Proof. Suppose a trader shorts a forward at strike K maturity T , where K is chosen such that the
S(t)
contract is worthless. Then shorting receives no income. Further suppose he shorts B(t,T ) zcb (see
Definition 455) and uses the income generated to buy one share at S(t). She delivers the stock at T , and
S(t) S(t)
receives K, with liability B(t,T ) , so that her portfolio is worth K − B(t,T ) . The no arbitrage price must
S(t)
be K = B(t,T ) .
We may also show this using the risk-neutral pricing machinery. The price at time t of the forward
contract shall be
1
Ẽ[D(T )(S(T ) − K)|F(t)] (4463)
D(t)
1 K
= Ẽ[D(T )S(T )|F(t)] − Ẽ[D(T )|F(t)] (4464)
D(t) D(t)
= S(t) − KB(t, T ). (4465)
663
13.3.8.2 Futures
Consider partition 0 = t0 < · · · < tn = T for interval [T ] and let this be in partitions in size of daily
granularity. Assume constant rate for each day, then the discount process is
 
Z tk+1  X k 
D(tk+1 ) = exp − R(u)du = exp − R(tj )(tj+1 − tj ) . (4466)
0 
j=0

See this is F(tk ) measurable. The zcb price at t = tk is given

1
B(tk , T ) = Ẽ[D(T )|F(tk )]. (4467)
D(tk )
An investor going long one forward at tk has portfolio value at tj ≥ tk equal to

1 S(tk )
Vk,j = Ẽ D(T ) S(T ) − |F(tj ) (4468)
D(tj ) B(tk , T )
1 S(tk ) 1
= Ẽ [D(T )S(T )|F(tj )] − · Ẽ[D(T )|F(tj )] (4469)
D(tj ) B(tk , T ) D(tj )
B(tj , T )
= S(tj ) − S(tk ) . (4470)
B(tk , T )
If rates were constant at r then B(t, T ) = exp(−rτ ) and Vk,j = S(tj ) − exp {r(tj − tk )} S(tk ), and the
portfolio value depends on the relative rate of growth between the stock and interest rates. The portfolio
value is only realized as profit at terminal T . Suppose we want a security that provides daily settlements -
we may do so using the object of futures. Denote the futures price to be F utS (t, T ), and this security shall
allow for daily settlement of profits, such that at tk+1 , a long position earns F utS (tk+1 , T ) − F utS (tk , T ),
known as mark to market (m2m) valuation. F utS (t, T ) is F(t) measurable, and F utS (T, T ) = S(T ).
The total cash flow is then
X n
(F utS (ti , T ) − F utS (ti−1 , T )) = F utS (T, T ) − F utS (0, T ) = S(T ) − F utS (0, T ). (4471)
i=1
The daily settlement is enabled vis-a-vis a margin account. The futures price process is set such that
the value of future payments are always zero - one may exit the position or enter into a position without
incurring costs other than those involved in the transaction itself. Mathematically this is formulated
1
0 = Ẽ[D(tk+1 )(F utS (tk+1 , T ) − F utS (tk , T ))|F(tk ))] (4472)
D(tk )
D(tk+1 ) n o
= Ẽ[F utS (tk+1 , T )|F(tk )] − F utS (tk , T ) . (4473)
D(tk )
The futures contract is a (discrete) martingale under P̃, since the above relation gives Ẽ[F utS (tk+1 , T )|F(tk )] =
F utS (tk , T ). Since we are required the boundary condition F utS (T, T ) = S(T ), then an equivalent state-
ment can be made - that the futures price must be
F utS (tk , T ) = Ẽ[S(T )|F(tk )] k = 0, · · · n. (4474)
Since the value of future cash flows at tk to be received at tj , j ≥ k + 1 are all zero, we have
1
Ẽ[D(tj )(F utS (tj , T ) − F utS (tj−1 , T ))|F(tk )] (4475)
D(tk )
1
= Ẽ[Ẽ[D(tj )(F utS (tj , T ) − F utS (tj−1 , T ))|F(tj−1 )]|F(tk )] (4476)
D(tk )
1
= Ẽ[D(tj )Ẽ[(F utS (tj , T ) − F utS (tj−1 , T ))|F(tj−1 )]|F(tk )] (4477)
D(tk )
= 0. (4478)
664
where we used the martingale property of F utS (t, T ) and D(tj ) measurability w.r.t F(tj−1 ).
Definition 457 (Futures Contract). A futures contract is an agreement to receive as cash flow the
changes in the futures price. The futures price of asset S taking value S(T ) at T is given
F utS (t, T ) = Ẽ[S(T )|F(t)] t ∈ [T ]. (4479)
Theorem 428. The futures price is a martingale under the risk-neutral measure P̃, satisfying F utS (T, T ) =
S(T ). Under the risk neutral measure the contract is always worth zero over any time interval.
Proof. By iterated conditioning arguments, F utS (t, T ) is martingale under P̃ measure. If filtration F(t)
is W (t)-generated, then by Martingale Representation Theorem (see Theorem 42), ∃Γ̃ s.t.
Z t
F utS (t, T ) = F utS (0, T ) + Γ̃(u)dW̃ (u) t ∈ [T ]. (4480)
0
If the investor borrows or invests his cash flow at rate R(t), his portfolio process is
dX(t) = ∆(t)dF utS (t, T ) + R(t)X(t)dt = ∆(t)Γ̃(t)dW̃ (t) + R(t)X(t)dt (4481)
and we have
d(D(t)X(t)) = D(t)∆(t)Γ̃(t)dW̃ (t). (4482)
If the investor has profit/portfolio value X(t0 ) = 0, then at t1 we have

Z t1
D(t1 )X(t1 ) = D(u)∆(u)Γ̃(u)dW̃ (u). (4483)
t0
We can write
Ẽ[D(t1 )X(t1 )|F(t0 )] (4484)

Z t1 Z t0
= Ẽ D(u)∆(u)Γ̃(u)dW̃ (u) − D(u)∆(u)Γ̃(u)dW̃ (u)|F(t0 ) (4485)
0 0
Z t1 Z t0
= Ẽ D(u)∆(u)Γ̃(u)dW̃ (u)|F(t0 ) − D(u)∆(u)Γ̃(u)dW̃ (u) (4486)
0 0
= 0. (4487)
1
The value of at t = t0 of payment of X(t1 ) at t = t1 is D(t0 ) Ẽ[D(t1 )X(t1 )|F(t0 )] under the risk-neutral
pricing, which is zero regardless of positions in future ∆(u) held. If F(t) were generated by another
source of uncertainty, then we have
Z t1
D(t1 )X(t1 ) = D(u)∆(u)dF utS (u, T ), (4488)
t0
which is martingale trading and therefore Ẽ[D(t1 )X(t1 )|F(t0 )] = 0.
Let C(u) be the sum total cash flow generated by some asset between time 0 and time u. Then this
is F(u) measurable and a portfolio holding one such asset financed/invested in the money market at rate
R follows
dX(u) = dC(u) + R(u)X(u)du. (4489)
We can also write
d(D(u)X(u)) = D(u)dC(u). (4490)
665
Assuming initial condition zero and integrating, we have
Z T
D(T )X(T ) = D(u)dC(u) (4491)
t
and see that under risk-neutral pricing, the value at t of the terminal value of the portfolio shall be the
sum of discounted cash flows. This is the relation
"Z #
T
1 1
Ẽ[D(T )X(T )|F(t)] = Ẽ D(u)dC(u)|F(t) t ∈ [T ]. (4492)
D(t) D(t) t
If the cash flows for this asset were lump sum payments Ai at time ti for i ∈ [n], where Ai is F(ti )
measurable, then we can express C(u) = i=1 Ai 1[0,u] (ti ). Then the RHS of Equation 4492 can be
Pn
written
" n # n
1 1
D(ti )Ai 1(t,T ] (ti )|F(t) = 1(t,T ] (ti )
X X
Ẽ Ẽ[D(ti )Ai |F(t)]. (4493)
D(t) i=1 i=1
D(t)
These equations represent the risk-neutral valuation of cash flows generated in a portfolio.
13.3.9 Difference in Valuations Between Forwards and Futures

Recall that the arbitrage free prices for forwards and futures were written
S(t)
F orS (t, T ) = , F utS (t, T ) = Ẽ[S(T )|F(t)]. (4494)
B(t, T )
Operating under constant rate assumptions (we also saw this in Equation 4439), then the zcb price is
worth exp(−rτ ) and we have
F orS (t, T ) = exp(rτ )S(t) (4495)

F utS (t, T ) = exp(rT )Ẽ[exp(−rT )S(T )|F(t)] = exp(rT ) exp(−rt)S(t) = exp(rτ )S(t). (4496)
The forward and futures prices agree under constant rate assumptions. If rates were stochastic, then
the risk neutral price of the zcb is B(0, T ) = Ẽ[D(T )], and the forward-futures spread (difference) is
written
S(0)
F orS (0, T ) − F utS (0, T ) = − Ẽ[S(T )] (4497)
Ẽ[D(T )]
1 n o
= Ẽ[D(T )S(T )] − Ẽ[D(T )] · Ẽ[S(T )] (4498)
ẼD(T )
1 ˜
= Cov(D(T ), S(T )). (4499)
B(0, T )
Note that the covariance is computed under the risk neutral measure. There is negative relation between
discount process and interest rates. If stock price and discount process have positive co-movement, then
prices and rates move in opposite directions. In states of the world where prices are high, rates are low,
and positive futures cash flow earn low rates of interest. In states of world where prices are low, rates
are high, and negative futures cash flow need to be financed by borrowing at higher rates. Therefore,
the forward contract is worth more than the futures contract of same terminal and strike, admitting a
positive spread when rates and prices are negatively covariant. The converse holds true.
666
13.4 Partial Differential Equations
Derivative pricing can be performed numerically or analytically. Analytic models solve a partial differ-
ential equation.
13.4.1 Stochastic Differential Equations

A SDE takes form
dX(u) = β(u, X(u))du + γ(u, X(u))dW (u), (4500)
where β(u, x) and γ(u, x) are drift and diffusion terms respectively. The initial condition specifies X(t) =
x for some t ≥ 0. The solution to the SDE is computing
Z T Z T
X(T ) = X(t) + β(u, X(u))du + γ(u, X(u))dW (u). (4501)
t t
This can be difficult to compute explicitly, since the stochastic process appears in both the LHS and
RHS. X(T ) is F(T ) measurable (it is in fact F(T )\F(t) measurable). A subclass of the SDE called
one-dimensional linear stochastic differential equations of form
dX(u) = (a(u) + b(u)X(u))du + (γ(u) + σ(u)X(u))dW (u) (4502)
can be solved explicitly. Here a, b, σ, γ are deterministic functions of time. In order to guarantee that the
solution to the SDE has Markov property, the only randomness permitted on the RHS of the equation
should be in X(u) and W (u). When we set β(u, x) = αx.γ(u, x) = σx we get the SDE
dS(u) = αS(u)du + σS(u)dW (u), (4503)
which is the geometric Brownian motion we have been working with in previous sections (see Theorem
435). The solution can be expressed

S(T ) 1 2
= exp σ(W (T ) − W (t)) + α − σ (T − t) . (4504)
S(t) 2
Previously, we looked at two interest rate models, namely Vasicek (Exercise 662) and CIR (Exercise
664) models. We look at a generalization of the Vasicek model, the Hull-White interest rate model.
Exercise 695 (Hull-White interest rate models). The Hull-White interest rate models specify the rate
process by
dR(u) = (a(u) − b(u)R(u))du + σ(u)dW̃ (u), (4505)
where a, b, σ are positive, nonrandom functions of time. W̃ (u) is P̃ Brownian. Here β(u, r) = a(u)−b(u)r
Ru
and γ(u, r) = σ(u). Let initial condition at time t be specified R(t) = r. Write f (u, x) = exp( 0 b(u)du)x,
Ru Ru
with fu (u, x) = xb(u) exp( 0 b(u)du), fx (u, x) = exp( 0 b(u)du). Using the SDE to compute by Ito
Doeblin formula (see Theorem 415)
Z u Z u
d exp( b(v)dv)R(u) = exp( b(v)dv)(b(u)R(u)du + dR(u)) (4506)
0 0
Z u
= exp b(v)dv (a(u)du + σ(u)dW̃ (u)). (4507)
0
667
Integrating, obtain
Z T Z t Z T Z u Z T Z u
exp( b(v)dv)R(T ) = r exp( b(v)dv) + exp( b(v)dv)a(u)du + exp( b(v)dv)σ(u)dW̃ (u).
0 0 t 0 t 0
Solve for R(T ) to get

Z T Z T Z T Z T Z T
r exp(− b(v)dv) + exp(− b(v)dv)a(u)du + exp(− b(v)dv)σ(u)dW̃ (u). (4508)
t t u t u
Recall that the Ito integral with deterministic integrand (see Theorem 675) has known distribution. In
RT RT
particular, here the Ito integral term is Φ(0, t exp(−2 u b(v)dv)σ 2 (u)du). Therefore, under the mea-
sure P̃, R(T ) is normally distributed with mean
Z T Z T Z T
r exp(− b(v)dv) + exp(−r b(v)dv)a(u)du (4509)
t t u
and variance
Z T Z T
exp(−2 b(v)dv)σ 2 (u)du. (4510)
t u
Like the Vasicek model, the interest rate model allows rate to go into negative territory.
Exercise 696. Consider the SDE
dX(u) = (a(u) + b(u)X(u))du + (γ(u) + σ(u)X(u))dW (u) (4511)
where W (u) is Brownian adapted to filtration F(u), and a(u), b(u), γ(u), σ(u) are adapted process. Fix
t ≥ 0 and initial value X(t) = x, and define
Z u Z u
1
Z(u) = exp σ(v)dW (v) + b(v) − σ 2 (v) dv (4512)
t t 2
Z u Z u
a(v) − σ(v)γ(v) γ(v)
Y (u) = x + dv + dW (v). (4513)
t Z(v) t Z(v)
Show that dZ(u) = b(u)Z(u)du + σ(u)Z(u)dW (u) with value 1 at time t, and that
X(u) = Y (u)Z(u) (4514)
solves the SDE Equation 4511 with initial condition X(t) = x.

Ru Ru
σ(v)dW (v)+ t b(v) − 21 σ 2 (v) dv,

Proof. It is trivial to see that Z(t) = 1. Let f (x) = exp(x), and X(u) = t
then Z(u) = f (X(u)). By Ito Doeblin formula (see Theorem 415) we obtain
1
dZ(u) = Z(u)dX(u) + Z(u)dX(u)dX(u) (4515)
2
1 2 1
= Z(u) σ(u)dW (u) + (b(u) − σ (u))du + Z(u)σ 2 (u)du (4516)
2 2
= b(u)Z(u)du + σ(u)Z(u)dW (u). (4517)
By definition,
a(u) − σ(u)γ(u) γ(u)

dY (u) = du + dW (u) u ≥ t, (4518)
Z(u) Z(u)
668
then if X(u) = Y (u)Z(u) we would have X(t) = Y (t)Z(t) = x and
dX(u) = Y (u)dZ(u) + Z(u)dY (u) + dY (u)dZ(u) (4519)

a(u) − σ(u)γ(u) γ(u)
= Y (u)(b(u)Z(u)du + σ(u)Z(u)dW (u)) + Z(u) du + dW (u) + σ(u)γ(u)du
Z(u) Z(u)
= (Y (u)b(u)Z(u) + (a(u) − σ(u)γ(u)) + σ(u)γ(u))du + (σ(u)Z(u)Y (u) + γ(u))dW (u) (4520)
= (b(u)X(u) + a(u))du + (σ(u)X(u) + γ(u))dW (u). (4521)
13.4.2 Markov Property

Consider the SDE given by the general form in Equation 4500 repeated here:
dX(u) = β(u, X(u))du + γ(u, X(u))dW (u). (4522)
Let t ∈ [T ] and h(y) be Borel measurable function. Denote g(t, x) = Et,x h(X(T )). Assume integrability
s.t. Et,x |h(X(T ))| < ∞. If we do not have an explicit analytical form for the distribution of X(T ), we
would simulate such a distribution by Monte Carlo methods - we can simulate the SDE. One method is
the Euler method, which chooses a small δ and set
√
X(t + δ) = x + β(t, x)δ + γ(t, x) δ1 , (4523)
√
X(t + 2δ) = X(t + δ) + β(t + δ, X(t + δ))δ + γ(t + δ, X(t + δ)) δ2 , (4524)
and so on, where i are IID Φ(0, 1). We can continue the process until we get X(T ), and taking an
average over many such h(X(T )) gives an estimator for g(t, x).
Result 44. Let X(u), u ≥ 0 be solution to SDE with known X(0). Then for t ∈ [T ] we have
E[h(X(T ))|F(t)] = g(t, X(t)). (4525)
If we observe X until time t and attempt to compute the conditional expectation of h(X(T )), we shall
be able to assume the process starts at t with X(t) as initial condition instead. The only relevant piece
of information when computing E[h(X(T ))|F(t)] is the value of X(t) - it must be a Markov process.
Corollary 33 (SDE Solutions are Markov Process). The solutions to the stochastic differential equations
(see Equation 4500) are Markov processes. (see Definition 325).
13.4.3 Feynman-Kac Theorem

Feynman-Kac Theorem relates stochastic differential equations (SDE) to partial differential equations
(PDE). The relationship between the geometric Brownian motion (see Definition 435) and the BSM PDE
(see Equation 3848) is a special case of such relationships.
Lemma 23. Let X(u) be solution to SDE (see Equation 4500) with known initial condition. Let h(y) be
Borel measurable function and for fixed T > 0, let g(t, x) = Et,x h(X(T )). Then, the stochastic process
g(t, X(t)) is martingale.
Proof. By Result 44, for 0 ≤ s ≤ t ≤ T , we have
E[h(X(T ))|F(s)] = g(s, X(s)), (4526)

E[h(X(T ))|F(t)] = g(t, X(t)). (4527)
669
Then
E[g(t, X(t))|F(s)] = E[E[h(X(T ))|F(t)]|F(s)] (4528)

= E[h(X(T ))|F(s)] (4529)
= g(s, X(s)). (4530)
Theorem 429 (Feynman-Kac Theorem). Consider SDE of the general form (see Equation 4500)
Let h(y) be Borel-measurable, and constant T > 0. For t ∈ [T ], define
g(t, x) = Et,x h(X(T )). (4532)
Assume finite integrability of h(X(T )) for all t, x. Then g(t, x) satisfies the partial differential equation
1
gt (t, x) + β(t, x)gx (t, x) + γ 2 (t, x)gxx (t, x) = 0 (4533)
2
and terminal condition g(T, x) = h(x) for all x.
Proof. Let X(t) be the solution to the SDE, by Lemma 23, dg(t, X(t)) must have no drift. Then by Ito
Doeblin (see Theorem 415)
1
dg(t, X(t)) = gt dt + gx dX + gxx dXdX (4534)
2
1
= gt dt + βgx dt + γgx dW + γ 2 gxx dt (4535)
2
1 2
= (gt + βgx + γ gxx )dt + γgx dW. (4536)
2
The diffusion term is zero when
1
gt + βgx + γ 2 gxx = 0 (4537)
2
and the result follows that for every point (t, x) reachable by (t, X(t)), the LHS needs to sum to zero.
Theorem 430 (Discounted Feynman-Kac Theorem). Consider SDE of the general form (see Equation
4500)
Let h(y) be Borel-measurable, and constant T > 0. For t ∈ [T ], define
f (t, x) = Et,x [exp(−r(T − t)h(X(T ))]. (4539)
Assume finite integrability of h(X(T )). Then, f (t, x) satisfies

1
ft (t, x) + β(t, x)fx (t, x) + γ 2 fxx (t, x) = rf (t, x) (4540)
2
and terminal condition f (T, x) = h(x).
670
Proof. Let X(t) be the solution to the SDE, then
f (t, X(t)) = E[exp(−r(T − t))h(X(T ))|F(t)]. (4541)
By iterated conditioning see that for 0 ≤ s ≤ t ≤ T ,
E[f (t, X(t))|F(s)] = E[E[exp(−r(T − t))h(X(T ))|F(t)]|F(s)] (4542)

= E[exp(−r(T − t))h(X(T ))|F(s)] (4543)
6= E[exp(−r(T − s))h(X(T ))|F(s)] (4544)
= f (s, X(s)). (4545)
It is not martingale. To obtain a martingale, we would like to remove the dependence of the expectation
term on t. Discounting the LHS, write
exp(−rt)f (t, X(t)) = E[exp(−rT )h(X(T ))|F(t)] (4546)
Then
E[exp(−rt)f (t, X(t))|F(s)] = E[E[exp(−rT )h(X(T ))|F(t)]|F(s)] (4547)

= E[exp(−rT )h(X(T ))|F(s)] (4548)
= exp(−rs)f (s, X(s)) (4549)
is martingale. By Ito Doeblin form for f (t, x) = exp(−rt)x, with differentials ft = −rx exp(−rt),
fx = exp(−rt), fxx = 0. Then
d(exp(−rt)f (t, X(t))) = −rf exp(−rt)dt + exp(−rt)df (t, X(t)) (4550)

1
= −rf exp(−rt)dt + exp(−rt)(ft dt + fx dX(t) + fxx dX(t)dX(t))
2
1 2
= exp(−rt) −rf + ft + βfx + γ fxx dt + exp(−rt)γfx dW. (4551)
2
For martingale property to hold, the drift term should be zero and the result follows.
Let h(S(T )) be some option payoff at time T where the underlying S is motivated by a geometric
Brownian motion, with form dS(u) = αS(u)du + σS(u)dW (u). We can express it under the risk-neutral
measure as
dS(u) = rS(u)du + σS(u)dW̃ (u). (4552)
Risk-neutral pricing (see Equation 4193) asserts that the derivative value V (t) = Ẽ[exp(−r(T −t))h(S(T ))|F(t)].
Since the stock price is Markov, and the payoff is function of S(t) only, then ∃v(t, x) s.t. V (t) = v(t, S(t)).
By the (discounted) Feynman-Kac theorem (see Theorem 430) we have
1
vt (t, x) + rxvx (t, x) + σ 2 x2 vxx (t, x) = rv(t, x) (4553)
2
and see that this is the Black Scholes Merton PDE. (see Equation 3848). We can use this to price
options where the underlying is geometric Brownian, such as European calls, European puts, forwards
(or any option with payoff h(S(T ))). If the underlying were a generalized geometric Brownian motion
(see Exercise 656), that is if σ(t, x) 6= σ, the stock price would not be a geometric Brownian motion.
The BSM PDE does not apply, but we can instead solve
1
vt (t, x) + rxvx (t, x) + σ 2 (t, x)x2 vxx (t, x) = rv(t, x). (4554)
2
671
Empirically, the market pricing for σ turns out to be different for options with different strikes. That
is, the implied volatility generally exhibits convexity over the strike values, a phenomenon known as the
volatility smile. To compute the implied volatilities by the empirical option values at different dates
and strikes, the adapted volatility is allowed to be function of t, x, and the function σ(t, x) is called the
volatility surface.
Exercise 697 (Kolmogorov Backward Equation). Consider the SDE of the general form
Assume X(u) > 0 - it is generalized geometric Brownian (see Exercise 690). Let p(t, T, x, y) be the
transition density for the solution to the SDE. Show that p(t, T, x, y) satisfies Kolmogorov backward
equation
1
−pt (t, T, x, y) = β(t, x)px (t, T, x, y) + γ 2 (t, x)pxx (t, T, x, y). (4556)
2
Proof. By the Feynman-Kac theorem (see Theorem 429), we know that for arbitrary h(y),
Z ∞
g(t, x) = Et,x h(X(T )) = h(y)p(t, T, x, y)dy (4557)
0
satisfies
1
gt (t, x) + β(t, x)gx (t, x) + γ 2 (t, x)gxx (t, x) = 0. (4558)
2
Differentiating Equation 4557, obtain
Z ∞
gt (t, x) = h(y)pt (t, T, x, y)dy (4559)
0
Z ∞
gx (t, x) = h(y)px (t, T, x, y)dy (4560)
Z0 ∞
gxx (t, x) = h(y)pxx (t, T, x, y)dy. (4561)
0
Then the Feynman Kac PDE relation implies

Z ∞
1
h(y) pt + βpx + γ 2 pxx dy = 0, (4562)
0 2
and for this to hold for arbitrary choice of h, the result follows.
Exercise 698 (Kolmogorov Forward Equation). Consider the SDE of the general form
Assume X(u) > 0 - it is generalized geometric Brownian (see Exercise 690). Let p(t, T, x, y) be the tran-
sition density for the solution to the SDE. Show that p(t, T, x, y) satisfies Kolmogorov forward equation
δ δ 1 δ2 2
p(t, T, x, y) = − (β(T, y)p(t, T, x, y)) + (γ (T, y)p(t, T, x, y)). (4564)
δT δy 2 δy 2
In Exercise 697, the variables T, y where held constant, with t, x varying. Here, y, T is allowed to
vary with t, x held constant. The variables t, x are hence often called backward variables, and T, y the
forward variables. Let b > 0, hb (y) be function with hb (x) = 0 for x ≤ 0, h0b (x) = 0 for x ≥ b, and
672
hb (b) = h0b (b) = 0. Let initial condition X(t) = x and assume the derivatives hb , hbb exists. Compute
dhb (X(u)). Integrating and take expectations to get the relation
Z b Z T Z b
hb (y)p(t, T, x, y)dy = hb (x) + β(u, y)p(t, u, x, y)h0b (y)dydu (4565)
0 t 0
Z T Z b
1
+ γ 2 (u, y)p(t, u, x, y)h00b (y)dydu. (4566)
2 t 0
Integrate by parts, and differentiate to show the Kolmogorov forward equation holds.
Proof. Zeng [20] By Ito Doeblin formula,

1
dhb (X(u)) = h0b (X(u))dX(u) + h00b (X(u))dX(u)dX(u) (4567)
2
1
= h0b (X(u))β(u, X(u)) + γ 2 (u, X(u))h00b (X(u)) du + h0b (X(u))γ(u, X(u))dW (u).
2
Integrating,
Z T Z T
0 1 2 00
hb (X(T )) − hb (X(t)) = hb (X(u))β(u, X(u)) + γ (u, X(u))hb (X(u)) du + h0b (X(u))γ(u, X(u))dW (u).
t 2 t
Taking expectations,
Z ∞
Et,x [hb (X(T )) − hb (X(t))] = hb (y)p(t, T, x, y)dy − hb (x) (4568)
−∞
Z T
t,x0 1 2 00
= E hb (X(u))β(u, X(u)) + γ (u, X(u))hb (X(u)) du
t 2
Z T Z b
1
= h0b (y)β(u, y) + γ 2 (u, y)h00b (y) p(t, u, x, y)dydu. (4569)
t 0 2
Integrating by parts, see that

Z b Z b
b δ
β(u, y)p(t, u, x, y)h0b (y)dy = [hb (y)β(u, y)p(t, u, x, y)]0 − hb (t) (β(u, y)p(t, u, x, y))dy
0 0 δy
Z b
δ
= − hb (y) (β(u, y)p(t, u, x, y))dy. (4570)
0 δy
We also have
Z b Z b
δ 2
γ 2
(u, y)p(t, u, x, y)h00b (y)dy = − (γ (u, y)p(t, u, x, y))h0b (y)dy (4571)
0 0 δy
b
δ2 2
Z
= (γ (u, y)p(t, u, x, y))hb (y)dy. (4572)
0 δy 2
Substituting, we get
Z b Z T Z b
δ
hb (y)p(t, T, x, y)dy = hb (x) − [β(u, y)p(t, u, x, y)]hb (y)dydu (4573)
0 t 0 δy
Z T Z b 2
1 δ
+ (γ 2 (u, y)p(t, u, x, y))hb (y)dydu. (4574)
2 t 0 δy 2
Differentiate to get
b b b
δ2 2
Z Z Z
δ δ 1
hb (y) p(t, T, x, y)dy = − [β(T, y)p(t, T, x, y)]hb (y)dy + [γ (T, y)p(t, T, x, y)]hb (y)dy.
0 δT 0 δy 2 0 δy 2
673
Rearrange to get
Z b
1 δ2 2

δ δ
hb (y) p(t, T, x, y) + (β(T, y)p(t, T, x, y)) − (γ (T, y)p(t, T, x, y)) dy = 0. (4575)
0 δT δy 2 δy 2
Since the choice of h was arbitrary, we require that
δ δ 1 δ2 2
p(t, T, x, y) + (β(T, y)p(t, T, x, y)) − (γ (T, y)p(t, T, x, y)) = 0. (4576)
δT δy 2 δy 2
Exercise 699 (Implied Volatility Surface). Assuming a stock price SDE of form
dS(u) = rS(u)dt + σ(u, S(u))S(u)dW̃ (u), (4577)
where W̃ is risk-neutral measure Brownian. Then by Exercise 698 the transition density p̃(t, T, x, y)
satisfies
δ δ 1 δ2 2
p̃(t, T, x, y) = − (ry p̃(t, T, x, y)) + (σ (T, y)y 2 p̃(t, T, x, y)). (4578)
δT δy 2 δy 2
R∞
Let c(0, T, x, K) = exp(−rT ) K (y −K)p̃(0, T, x, y)dy be the price of a European call expiring at T struck
at K. Show that
Z ∞
1
cT (0, T, x, K) = exp(−rT )rK p̃(0, T, x, y)dy + exp(−rT )σ 2 (T, K)K 2 p̃(0, T, x, K) (4579)
K 2
1
= −rKcK (0, T, x, K) + σ 2 (T, K)K 2 cKK (0, T, x, K). (4580)
2
That is, we can infer the points on the volatility surface for σ 2 (T, K) in terms of cT , cK and cKK . Use
the assumption that
lim (y − K)ry p̃(0, T, x, y) = 0 (4581)

y→∞
and that
δ 2
lim (y − K) (σ (T, y)y 2 p̃(0, T, x, y)) = 0, (4582)
y→∞ δy
lim σ 2 (T, y)y 2 p̃(0, T, x, y) = 0. (4583)
y→∞
Proof. We have
Z ∞ Z ∞
δ ∞
− (y − K) (ry p̃(0, T, x, y)dy = [−(y − K)ry p̃(0, T, x, y)]K + ry p̃(0, T, x, y)dy (4584)
K δy K
Z ∞
= ry p̃(0, T, x, y)dy. (4585)
K
Integrate by parts (twice) to obtain
1 ∞ δ2
Z
(y − K) 2 (σ 2 (T, y)y 2 p̃(0, T, x, y))dy (4586)
2 K δy
∞ Z ∞
1 δ δ 2
= (y − K) (σ 2 (T, y)y 2 p̃(0, T, x, y)) − (σ (T, y)y 2 p̃(0, T, x, y))dy (4587)
2 δy K K δy
1 2 ∞
= − ( σ (T, y)y 2 p̃(0, T, x, y) K )

(4588)
2
1
= σ 2 (T, K)K 2 p̃(0, T, x, K). (4589)
2
674
We may write
Z ∞
cT (0, T, x, K) = −rc(0, T, x, K) + exp(−rT ) (y − K)p̃T (0, T, x, y)dy (4590)
K
Z ∞ Z ∞
= −r exp(−rT ) (y − K)p̃(0, T, x, y)dy + exp(−rT ) (y − K)p̃T (0, T, x, y)dy
K K
Z ∞ Z ∞
δ
= −r exp(−rT ) (y − K)p̃(0, T, x, y)dy − exp(−rT ) (y − K) (ry p̃(0, T, x, y))dy
K K δy
Z ∞
1 δ2 2
+ exp(−rT ) (y − K) (σ (T, y)y 2 p̃(0, T, x, y))dy by Equation 4578
K 2 δy 2
Z ∞ Z ∞
= −r exp(−rT ) (y − K)p̃(0, T, x, y)dy + exp(−rT ) ry p̃(0, T, x, y)dy (4591)
K K
1
+ exp(−rT ) σ 2 (T, K)K 2 p̃(0, T, x, K) Equation 4585 and 4589 (4592)
2Z
∞
1
= r exp(−rT )K p̃(0, T, x, y)dy + exp(−rT )σ 2 (T, K)K 2 p̃(0, T, x, K) (4593)
K 2
1 2
= −rKcK (0, T, x, K) + σ (T, K)K 2 cKK (0, T, x, K). (4594)
2
Exercise 700 (Heston Stochastic Volatility Model). A Heston Stochastic Volatility model specifies the
stock SDE under risk-neutral measure P̃ to be
p
dS(t) = rS(t)dt + V (t)S(t)dW̃1 (t), (4595)
p
dV (t) = (a − bV (t))dt + σ V (t)dW̃2 (t). (4596)
Here a, b, σ > 0 and dW̃1 (t)dW̃2 (t) = ρdt. See that (S(t), V (t)) is a two-dimensional Markov process
(see Corollary 33). Risk-neutral pricing asserts that the price of a European call is given at t to be
Ẽ[exp(−rτ )(S(T ) − K)+ |F(t)]. By the Markov property, there exists c(t, s, v) s.t.
c(t, S(t), V (t)) = Ẽ exp {−rτ } (S(T ) − K)+ |F(t) .

(4597)
We aim to show c(t, s, v) satisfies PDE

1 1
ct + rscs + (a − bv)cv + s2 vcss + ρσsvcsv + σ 2 vcvv = rc. (4598)
2 2
Suppose there are functions f (t, x, v), g(t, x, v) satisfying

1 1 1 2
ft + r + v fx + (a − bv + ρσv)fv + vfxx + ρσvfxv + σ vfvv = 0, (4599)
2 2 2

1 1 1 2
gt + r − v gx + (a − bv)gv + vgxx + ρσvgxv + σ vgvv = 0. (4600)
2 2 2
If we define
c(t, s, v) = sf (t, log s, v) − exp(−r(T − t))Kg(t, log s, v), (4601)
show that c(t, s, v) satisfies PDE Equation 4598. Suppose a pair of processes (X(t), V (t)) is governed by
SDE

1 p
dX(t) = r + V (t) dt + V (t)dW1 (t) (4602)
2
p
dV (t) = (a − bV (t) + ρσV (t))dt + σ V (t)dW2 (t), (4603)
675
with dW1 (t)dW2 (t) = ρdt, and f (t, x, v) = Et,x,v 1 {X(T ) ≥ log K}. Show that f (t, x, v) satisfies PDE
Equation 4599 and boundary condition f (T, x, v) = 1 {x ≥ log K} for all v ≥ 0. Similarly, using the
SDE

1 p
dX(t) = r − V (t) dt + V (t)dW1 (t) (4604)
2
p
dV (t) = (a − bV (t))dt + σ V (t)dW2 (t), (4605)
with dW1 (t)dW2 (t) = ρdt and g(t, x, v) = Et,x,v 1 {X(T ) ≥ log K}, we can show that g(t, x, v) satisfies
PDE Equation 4600 and boundary condition g(T, x, v) = 1 {x ≥ log K} for all v ≥ 0. Using the re-
sults thus far, show the function c(t, x, v) satisfies the boundary condition c(T, s, v) = (s − K)+ for all
s ≥ 0, v ≥ 0. This is one of the boundary conditions under the Heston model’s (among many oth-
ers) stochastic volatility assumptions. See Exercise 665 for the boundary conditions when volatility were
assumed constant.
Proof. For s ≤ t,
Ẽ[exp(−rt)c(t, St , Vt ))|F(s)] = Ẽ[Ẽ[exp(−rT )(ST − K)+ |F(t)]|F(s)] (4606)

+
= Ẽ[exp(−rT )(ST − K) |F(s)] (4607)
= exp(−rs)c(s, Ss , Vs ). (4608)
It is martingale. Using f (t, c) = exp(−rt)c with derivatives ft = −rf, fc = exp(−rt), fcc = 0 we can
write d(exp(−rt)c(t, St , Vt )) (by Ito Doeblin, Theorem 418)
= −r exp(−rt)cdt + exp(−rt)dc (4609)

1 1
= −r exp(−rt)cdt + exp(−rt)(ct dt + cs dS + cv dV + css dSdS + cvv dV dV + csv dSdV ) (4610)
2 2
1 2 1 2
= exp(−rt) −rc + ct + cs rSt + cv (a − bVt ) + css Vt St + cvv σ Vt + csv σVt St ρ dt + ψdW̃t .
2 2
Setting the martingale drift to zero it follows that
1 1
ct + rscs + (a − bv)cv + s2 vcss + ρσsvcsv + σ 2 vcvv = rc. (4611)
2 2
We omit the rather mechanical proof that c(t, s, v) = sf (t, log s, v)−exp(−r(T −t))Kg(t, log s, v) satisfies
Equation 4598. By Markov property, Zeng [20] we can write
f (t, Xt , Vt ) = E [1 {XT ≥ log K} |F(t)] . (4612)
Then f (T, Xt , Vt ) = 1 {XT ≥ log K} =⇒ f (T, x, v) = 1 {x ≥ log K} for all v ≥ 0. Since f (t, Xt , Vt ) is
martingale, write

1 1 1
df (t, Xt , Vt ) = ft + fx (r + Vt ) + fv (a − bvt + ρσVt ) + fxx Vt + fvv σ 2 Vt + fxv σVt ρ dt + ξdW̃t .
2 2 2
It follows that
1 1 1
ft + (r + v)fx + (a − bv + ρσv)fv + fxx v + fvv σ 2 v + σvρfxv = 0. (4613)
2 2 2
Finally, see that
c(T, s, v) = sf (T, log s, v) − exp(−rτ )Kg(T, log s, v) (4614)

= s1 {log s ≥ log K} − K 1 {log s ≥ log K} (4615)
= 1 {s ≥ K} (s − K) (4616)
+
= (s − K) . (4617)
676
13.4.4 Applications to Interest Rate Models
The simplest SDE for interest rate evolution is of the general form
dR(t) = β(t, R(t))dt + γ(t, R(t))dW̃ (t), (4618)
where the risk-neutral measure is used to price all assets by the risk-neutral pricing formula, which asserts
that discounted asset prices are martingales. These models are also called short-rate models, since R(t)
specifies the rate for short-term borrowing. One factor rate models are models that are determined by
only one SDE. These are often unable to capture complicated yield curve dynamics, and often can only
generate parallel yield curve shifts. As before, denote the discount process
Z t
D(t) = exp − R(s)ds (4619)
0
and the money market price process is the reciprocal of the discount process, namely
Z t
1
= exp R(s)ds , (4620)
D(t) 0
specifying the value of one unit of currency invested continuously in the money market at the prevailing
short rate. Ito Doeblin would give us the relations

1 R(t)
dD(t) = −R(t)D(t)dt, d = dt (4621)
D(t) D(t)
which would serve us well to remember by heart. A zero-coupon bond (see Definition 455) pays one
dollar at terminal T and risk-neutral pricing formula asserts that the price of this bond should satisfy
D(t)B(t, T ) = Ẽ[D(T )|F(t)], or equivalently
" ( Z ) #
T
B(t, T ) = Ẽ exp − R(s)ds |F(t) . (4622)
t
Definition 458 (Zero Coupon Bond Yield). The yield between times t and T for a zero-coupon bond
can be defined
−1
Y (t, T ) = log B(t, T ). (4623)
T −t
See that this satisfies
B(t, T ) = exp {−Y (t, T )(T − t)} . (4624)
That is, the yield is the continuously compounding constant rate between t and T implied by the bond
prices. For instance, the 30-year long rate can be written Y (t, t + 30).
Since R is solution to stochastic process governed by SDE Equation 4618, it is Markov process (see
Corollary 33) and we have
B(t, T ) = f (t, R(t)) (4625)
for some f (t, r). To find the PDE for f (t, r), we can use the Discounted Feynman-Kac theorem (see The-
orem 430). In particular, we find a martingale, find its differential and set the drift to zero. Specifically
here,
d(D(t)f (t, R(t))) = f (t, R(t))dD(t) + D(t)df (t, R(t)) + 0 (4626)

1
= f (−RD(t)dt) + D(t)(ft dt + fr dR + frr dRdR) (4627)
2
1 2
= D(t) −Rf + ft + βfr + γ frr dt + D(t)γfr dW̃ . (4628)
2
677
Setting the drift to zero, we obtain the PDE
1
ft (t, r) + β(t, r)fr (t, r) + γ 2 (t, r)frr (t, r) = rf (t, r) (4629)
2
with terminal f (T, r) = 1 for all values of r.
We saw the Hull-White interest rate model in Exercise 695. Here, we derive a formula for the price
of a zero coupon bond when the rate is modelled by the Hull-White interest rate model.
Exercise 701 (ZCB Pricing under Hull-White Interest Rate Model). Recall in Exercise 695 that the
short rate evolves by the SDE
dR(t) = (a(t) − b(t)R(t))dt + σ(t)dW̃ (t). (4630)
Suppose a(t), b(t), σ(t) > 0 and are deterministic. The zero-coupon bond partial differential equation
specified by Equation 4629 under the Hull-White rate model is expressed
1
ft (t, r) + (a(t) − b(t)r)fr (t, r) + σ 2 (t)frr (t, r) = rf (t, r). (4631)
2
Suppose the solution (we verify this later) were of the form f (t, r) = exp {−rC(t, T ) − A(t, T )} for some
nonrandom functions C(t, T ), A(t, T ). The zcb yield (see Definition 458) can be expressed
−1 1
Y (t, T ) = log f (t, r) = (rC(t, T ) + A(t, T )) , (4632)
T −t T −t
which we see is an affine function of r of form a · r + b. See also that
ft (t, r) = (−rC 0 (t, T ) − A0 (t, T ))f (t, r) (4633)

fr (t, r) = −C(t, T )f (t, r) (4634)
2
frr (t, r) = C (t, T )f (t, r). (4635)
By substituting into the Hull-White zcb-pde we get

1
(−C 0 (t, T ) + b(t)C(t, T ) − 1)r − A0 (t, T ) − a(t)C(t, T ) + σ 2 (t)C 2 (t, T ) f (t, r) = 0. (4636)
2
Since this holds regardless of r, it follows that r term must disappear. In other words, C 0 (t, T ) =
b(t)C(t, T ) − 1 and
1
A0 (t, T ) = −a(t)C(t, T ) + σ 2 (t)C 2 (t, T ). (4637)
2
For terminal condition f (T, r) = 1 to hold for all r, we require that C(T, T ) = A(T, T ) = 0. Given
relations for C 0 (t, T ), A0 (t, T ), C(T, T ), A(T, T ), we can determine the functions to be (see proof )
Z T Z s
C(t, T ) = exp − b(v)dv ds (4638)
t t
Z T
1
A(t, T ) = a(s)C(s, T ) − σ 2 (s)C 2 (s, T ) ds. (4639)
t 2
Then the price of a zero coupon bond priced by the Hull-White model is written
B(t, T ) = exp {−R(t)C(t, T ) − A(t, T )} . (4640)
678
Proof. We show that the solutions to
C 0 (t, T ) = b(t)C(t, T ) − 1 (4641)

1
A0 (t, T ) = −a(t)C(t, T ) + σ 2 (t)C 2 (t, T ). (4642)
2
and C(T, T ) = A(T, T ) = 0 are
Z T Z s
C(t, T ) = exp − b(v)dv ds (4643)
t t
Z T
1 2 2
A(t, T ) = a(s)C(s, T ) − σ (s)C (s, T ) ds. (4644)
t 2
First, taking derivatives
Z s Z s
d
exp − b(v)dv C(s, T ) = exp(− b(v)dv) [C 0 (s, T ) − b(s)C(s, T )] (4645)
ds 0 0
Z s
= exp(− b(v)dv) [b(s)C(s, T ) − 1 − b(s)C(s, T )] (4646)
0
Z s
= − exp(− b(v)dv). (4647)
0
Integrating from s = t → s = T , write

Z T Z t Z T Z s
exp(− b(v)dv)C(T, T ) − exp(− b(v)dv)C(t, T ) = − exp(− b(v)dv)ds. (4648)
0 0 t 0
Given final condition C(T, T ) = 0,

Z t Z T Z s Z T Z t
C(t, T ) = exp( b(v)dv) exp(− b(v)dv)ds = exp( b(v)dv)ds. (4649)
0 t 0 t s
Using Equation 4642 we have

Z T Z T
1
A(T, T ) − A(t, T ) = − a(s)C(s, T )ds + σ 2 (s)C 2 (s, T )ds. (4650)
t 2 t
Substitute A(T, T ) = 0 and we are done.
We encountered the Cox-Ingersoll-Ross model in Exercise 664. Here we price a zero-coupon bond
using the CIR model.
Exercise 702 (ZCB CIR Pricing). Recall the rate evolves with SDE
p
dR(t) = (a − bR(t))dt + σ R(t)dW̃ (t). (4651)
a, b, σ are positive constants. The zero-coupon bond partial differential equation specified by Equation
4629 under the CIR rate model is expressed
1
ft (t, r) + (a − br)fr (t, r) + σ 2 rfrr (t, r) = rf (t, r). (4652)
2
Again assume that the solution takes some form f (t, r) = exp {−rC(t, T ) − A(t, T )}, which have deriva-
tives given by the set of Equations 4634. This is another type of affine yield model. Substituting, see
that

1
−C 0 (t, T ) + bC(t, T ) + σ 2 C 2 (t, T ) − 1 r − A0 (t, T ) − aC(t, T ) f (t, r) = 0. (4653)
2
679
By the same reasoning, since the relation must hold for all values of r, we have
1
C 0 (t, T )= bC(t, T ) + σ 2 C 2 (t, T ) − 1 (4654)
2
A0 (t, T ) = −aC(t, T ) (4655)
with terminal conditions C(T, T ) = A(T, T ) = 0. The solutions (see proof ) are
sinh(γ(T − t))
C(t, T ) = , (4656)
γcosh(γ(T − t)) + 12 b · sinh(γ(T − t))
γ exp(− 21 b(T − t))

2a
A(t, T ) = − 2 log , (4657)
σ γ · cosh(γ(T − t)) + 21 b · sinh(γ(T − t))
1
√ exp(u)−exp(−u) exp(u)+exp(−u)
where γ = 2 b2 + 2σ 2 , sinh(u) = 2 and cosh(u) = 2 .
Proof. Shreve [19], Zeng [20]. Here we show the solutions to

0 1 2 2 0
−C (t, T ) + bC(t, T ) + σ C (t, T ) − 1 r − A (t, T ) − aC(t, T ) f (t, r) = 0. (4658)
2
1
C 0 (t, T )= bC(t, T ) + σ 2 C 2 (t, T ) − 1 (4659)
2
A0 (t, T ) = −aC(t, T ) (4660)
with terminal conditions C(T, T ) = A(T, T ) = 0 are given by

sinh(γ(T − t))
C(t, T ) = , (4661)
γ exp(− 21 b(T − t))

2a
A(t, T ) = − 2 log . (4662)
The proof is done in multiple parts.

n RT o
1. Define ψ(t) = exp 12 σ 2 t C(u, T )du and show that
−2ψ 0 (t)
C(t, T ) = (4663)
σ 2 ψ(t)
−2ψ 00 (t) 1 2 2
C 0 (t, T ) = + σ C (t, T ). (4664)
σ 2 ψ(t) 2
2. Use Equation 4659 to show that

1
ψ 00 (t) − bψ 0 (t) − σ 2 ψ(t) = 0. (4665)
2
A constant-coefficient linear ordinary differential equation has solution form (verify this)
ψ(t) = a1 exp(λ1 t) + a2 exp(λ2 t), (4666)
where λ1 , λ2 are solutions to characteristic equations λ2 − bλ − 21 σ 2 = 0.
3. Show that ψ(t) has form

c1 1 c2 1
ψ(t) = 1 exp −( b + γ)(T − t) − 1 exp −( b − γ)(T − t) , (4667)
2b − γ
2b + γ
2 2
1
√
where γ = 2 b2 + 2σ 2 .
680
4. Show ψ 0 (t) = c1 exp −( 12 b + γ)(T − t) − c2 exp −( 21 b − γ)(T − t) . Show c1 = c2 .

5. Show
1 1
b−γ

1 2b + γ
ψ(t) = c1 exp(− b(T − t)) 1 2 2 2
exp(−γ(T − t)) − 1 2 2
exp(γ(T − t)) (4668)
4b − γ 4b − γ
2
2c1 −1
= 2
exp( b(T − t)) [b · sinh(γ(T − t)) + 2γ · cosh(γ(T − t))] , (4669)
σ 2
−1
ψ 0 (t) = −2c1 exp( b(T − t))sinh(γ(T − t)). (4670)
2
and show C(t, T ) takes form Equation 4661.
2aψ 0 (t)
6. Show A0 (t, T ) = σ 2 ψ(t) and integrate this form from t → T and show A(t, T ) takes form Equation
4662.
(verify this set of assertions above).
Exercise 703 (Moment Generating Function for CIR process, Shreve [19]). -
1. Let Wi , i ∈ [d] be d-dimensional Brownians (see Definition 445) and a, σ > 0. For j ∈ [d] define
b 1
dXj (t) = − Xj (t)dt + σdWj (t). (4671)
2 2
h Rt i
Show that Xj (t) = exp − 12 bt Xj (0) + σ2 0 exp( 12 bu)dWj (u) , and that

σ2

1
Xj (t) ∼ Φ exp(− bt)Xj (0), [1 − exp(−bt)] . (4672)
2 4b
Pd
2. Define R(t) = j=1 Xj2 (t) and show
p
dR(t) = (a − bR(t))dt + σ R(t)dB(t) (4673)
dσ 2
Pd Rt Xj (s)
where a = 4 , B(t) = j=1 0
√ dWj (s).
R(s)
q
R(0)
3. Assuming R(0) > 0, define Xj (0) = d and show for i ∈ [d],
r !
IID 1 R(0) σ2
Xi (t) ∼ Φ µ(t) = exp(− bt) , v(t) = [1 − exp(−bt)] . (4674)
2 d 4b
4. Since R(t) is sum of IID random normal squares, it must be ∼ χ2d (see Section 6.17.7). Compute
the mgf of R(t), using the mgf
uµ2 (t)

1 1
E exp uXj2 (t) = p

exp ∀u < . (4675)
1 − 2v(t)u 1 − 2v(t)u 2v(t)
Use the hints

2
uµ2 (t)

2 1 1 − 2v(t)u µ(t)
ux − (x − µ(t))2 = − x− + , (4676)
2v(t) 2v(t) 1 − 2v(t)u 1 − 2v(t)u
µ(t) v(t)
and that the normal density of variable mean 1−2v(t)u variance 1−2v(t)u can be written
s ( 2 )
1 − 2v(t)u 1 − 2v(t)u µ(t)
exp − x− . (4677)
2πv(t) 2v(t) 1 − 2v(t)u
681
5. Show that R(t) of the CIR model has mgf
d2
1 exp(−bt)uR(0)
E exp(uR(t)) = exp (4678)
1 − 2v(t)u 1 − 2v(t)u
2a2
1 σ exp(−bt)uR(0) 1
= exp ∀u < . (4679)
1 − 2v(t)u 1 − 2v(t)u 2v(t)
Proof. Zeng [20] -
1. For f (t, x) = exp( 12 bt)x with derivatives ft = b

2 exp( 12 bt)x, fx = exp( 21 bt), fxx = 0, apply Ito
Doeblin formula to obtain
1 b 1 1
d exp( bt)Xj (t) = exp( bt)Xj (t)dt + exp( bt)dXj (t) (4680)
2 2 2 2
1 b
= exp( bt)( Xj (t)dt + dXj (t)) (4681)
2 2
1 b b 1
= exp( bt)( Xj (t)dt + (− Xj (t)dt + σdWj (t))) (4682)
2 2 2 2
1 1
= exp( bt) σdWj (t). (4683)
2 2
Rt
Integrating, exp( 12 bt)Xj (t) − Xj (0) = 12 σ 0 exp( 21 bu)dWj (u) and it follows that
σ t
Z
1 1
Xj (t) = exp − bt Xj (0) + exp( bu)dWj (u) . (4684)
2 2 0 2
Then the integrand is deterministic and we can apply Theorem (416). It is easy to see that
EXj = exp(− 12 bt)Xj (0). The variance term can be expressed Var(Xj ) =
σ2 t
Z
exp(−bt) exp(bu)du (4685)
4 0
σ2 1 t
Z
= exp(−bt) b exp(bu)du (4686)
4 b 0
σ2 1
= exp(−bt) [exp(bu)]t0 (4687)
4 b
σ2 1
= exp(−bt) (exp(bt) − 1) (4688)
4 b
σ2
= [1 − exp(−bt)] . (4689)
4b
Pd
2. For R(t) = j=1 Xj2 (t), by Ito Product rule we have
d
X
(2Xj (t)dXj (t) + dXj (t)dXj (t)) (4690)
j=1
d
σ2
X
= 2Xj (t)dXj (t) + dt (4691)
j=1
4
d
σ2

X b 1
= 2Xj (t)(− Xj (t)dt + σdWj (t)) + dt (4692)
j=1
2 2 4
d
X 1
= −bXj2 (t)dt + σXj (t)dWj (t) + σ 2 dt (4693)
j=1
4
d
d 2 X (t)
pj
p X
= σ − bR(t) dt + σ R(t) dWj (t). (4694)
4 j=1
R(t)
682
Pd Xj2 (t)
Then dB(t)dB(t) = j=1 R(t) dt
= dt. By Levy’s Theorem (see Theorem 419) we use the martin-
Pd R t X (s)
gale property and dt quadratic variation to conclude B(t) = j=1 0 √j dWj (s) is Brownian.
R(s)
We have specified a CIR short rate model.
3. The result follows immediately from part 1. and seeing that Xj (t) depends on Wj only.
4. (verify this).
5. Since the Xi ’s are IID their mgf factors and we can write
d
E exp(uR(t)) = E[exp(uXi2 (t))] . (4695)
The result follows. (verify this)
Exercise 704 (Bond Options). Consider some short-rate model, with differential given by the general
form Equation 4629. Let 0 ≤ t ≤ T1 ≤ T2 , where T1 is the terminal maturity for a European call on a
zcb and T2 is the terminal maturity for the zcb. Suppose we solved for f (t, r) for this short rate model, as
we did in Exercises 701 and 702 satisfying the PDE and terminal conditions specified in Equation 4629.
Then risk-neutral pricing (see Equation 4193) asserts
h i
+
D(t)c(t, R(t)) = Ẽ D(T1 ) (f (T1 , R(T1 )) − K) |F(t) , t ∈ [T ] (4696)
is martingale. By Ito Product (see Lemma 30) we have
d(D(t)c(t, R(t))) = c(t, R(t))dD(t) + D(t)dc(t, R(t)) + 0 (4697)

1
= c(−DRdt) + D(ct dt + cr dR(t) + crr dR(t)dR(t)) (4698)
2
1
= D −Rcdt + ct dt + cr dR + crr dRdR (4699)
2

1
= D −Rc + ct + βcr + γ 2 crr dt + Dγcr dW̃ . (4700)
2
Since this is martingale with drift zero, we get the PDE
1
ct (t, r) + β(t, r)cr (t, r) + γ 2 (t, r)crr (t, r) = rc(t, r). (4701)
2
Although f (t, r), c(t, r) shares PDE, the terminal condition for c(t, r) is
c(T1 , r) = (f (T1 , r) − K)+ ∀r. (4702)
Exercise 705 (No-arbitrage arguments to bond-pricing Shreve [19]). We derive the PDE for bond
pricing from no arbitrage arguments as opposed to the risk-neutral pricing formulation. Let the short
rate SDE be dR(t) = α(t, R(t))dt + γ(t, R(t))dW (t), where W (t) is Brownian under some arbitrary
probability measure. Denote the zcb (see Definition 455) price as a function f (t, R(t), T ) where T is the
bond-maturity, and assume that fr (t, r, T ) 6= 0 for all values of (r, T ). Define

−1 1 2
β(t, r, T ) = −rf (t, r, T ) + ft (t, r, T ) + γ (t, r)frr (t, r, T ) , (4703)
fr (t, r, T ) 2
or equivalently
1
ft (t, r, T ) + β(t, r, T )fr (t, r, T ) + γ 2 (t, r)frr (t, r, T ) = rf (t, r, T ). (4704)
2
In particular if β(t, r, T ) is independent of T , the bond-pricing formula agrees with the Feynman-Kac
theorem 430 PDE derived from risk-neutral pricing arguments.
683
1. For two maturities 0 < T1 < T2 corresponding to two different zcb held in a portfolio continuously
at amounts ∆1 , ∆2 respectively, show the SDE for a discounted portfolio value process.
2. Denote sign(x) = (x > 0) ? 1 : (x < 0 ? − 1 : 0), and
S(t) = sign {[β(t, R(t), T2 ) − β(t, R(t), T1 )]fr (t, R(t), T1 )fr (t, R(t), T2 )} . (4705)
Show that the portfolio process ∆1 (t) = S(t)fr (t, R(t), T2 ), ∆2 (t) = −S(t)fr (t, R(t), T1 ) leads to an
arbitrage portfolio unless β(t, R(t), T1 ) = β(t, R(t), T2 ). Since the choice of terminals T1 , T2 were
arbitrary, then β(t, r, T ) = β(t, r, (·)) for all (·) ∈ [T ] - it does not depend on (·).
3. For a portfolio investing only in the bond of maturity T , show that the discounted portfolio process
satisfies
1
d(D(t)X(t)) = ∆D[−Rf + ft + αfr + γ 2 frr ]dt + D∆γfr dW. (4706)
2
Show that if fr = 0, then there is an arbitrage unless
1
ft (t, r, T ) + γ 2 (t, r)frr (t, r, T ) = rf (t, r, T ). (4707)
2
Then Equation 4704 holds regardless of the choice of β(t, r, T ) when fr (t, r, T ) = 0.
That is, regardless of whether fr = 0, the same PDE holds. If we change the measure P̃ for which
Z t
1
W̃ (t) = W (t) + [α(u, R(u)) − β(u, R(u))] du, (4708)
0 γ(u, R(u))
then the SDE can be expressed in terms of W̃ (t) of form
dR(t) = β(t, R(t))dt + γ(t, R(t))dW̃ (t) (4709)
which is the general short-rate model in Equation 4618.
Proof. 1. The portfolio differential follows
dX(t) = ∆1 (t)df (t, Rt , T1 ) + ∆2 (t)df (t, Rt , T2 ) + Rt (Xt − ∆1 (t)f (t, Rt , T1 ) − ∆2 (t)f (t, Rt , T2 ))dt.
By Ito Product rule, write
d(Dt Xt ) = Dt dXt + Xt dDt + dDt dXt (4710)

= Dt dXt − Rt Dt Xt dt (4711)
= ··· expand using the portfolio differential, Ito Doeblin and SDE given
= ∆1 (t)D(t) [α(t, R(t)) − β(t, R(t), T1 )] fr (t, R(t), T1 )dt (4712)
+∆2 (t)D(t) [α(t, R(t)) − β(t, R(t), T2 )] fr (t, R(t), T2 )dt (4713)
+D(t)γ(t, R(t)) [∆1 (t)fr (t, R(t), T1 ) + ∆2 (t)fr (t, R(t), T2 )] dW (t). (4714)
We have omitted the intermediate steps, but they should follow as in usual arguments. For instance,
1
df (t, Rt , T1 ) = ft (t, Rt , T1 )dt + fr (t, Rt , T1 )dRt + frr (t, Rt , T1 )dRt dRt (4715)
2
1
= ft (t, Rt , T1 )dt + fr (t, Rt , T1 )dRt + frr (t, Rt , T1 )γ 2 (t, Rt )dt. (4716)
2
684
2. ∆1 (t) → S(t)fr (t, R(t), T2 ), ∆2 (t) = −S(t)fr (t, R(t), T1 ) into part 1. discounted portfolio process
SDE to get
(S(t)fr (t, R(t), T2 ))D(t) [α(t, R(t)) − β(t, R(t), T1 )] fr (t, R(t), T1 )dt (4717)
+(−S(t)fr (t, R(t), T1 ))D(t) [α(t, R(t)) − β(t, R(t), T2 )] fr (t, R(t), T2 )dt (4718)
+D(t)γ(t, R(t)) [(S(t)fr (t, R(t), T2 ))fr (t, R(t), T1 ) + (−S(t)fr (t, R(t), T1 ))fr (t, R(t), T2 )] dW (t)
= D(t)S(t)[β(t, R(t), T2 ) − β(t, R(t), T1 )]fr (t, R(t), T1 )fr (t, R(t)T2 )dt (4719)
= D(t) |[β(t, R(t), T2 ) − β(t, R(t), T1 )]fr (t, R(t), T1 )fr (t, R(t)T2 )| dt. (4720)
See that there is non-positive drift (i.e DT XT −D0 X0 6 ≥0) only when β(t, R(t), T2 ) = β(t, R(t), T1 ).
3. Set ∆1 (t) = ∆(t), T1 = T, ∆2 (t) = 0, then the equation in part 1. becomes (from Equation 4711)
Dt dXt − Rt Dt Xt dt (4721)
= Dt (∆1 (t)df (t, Rt , T1 ) + 0 + Rt (Xt − ∆1 (t)f (t, Rt , T1 ))dt) − Rt Dt Xt dt (4722)
= Dt (∆(t)df (t, Rt , T ) − ∆(t)f (t, Rt , T1 )Rt dt) (4723)
= Dt ∆(t)[df (t, Rt , T ) − f (t, Rt , T1 )Rt dt] (4724)

1 2
= Dt ∆(t) ft dt + fr (αdt + γdW ) + frr γ dt − f Rt dt (4725)
2
1
= ∆D[−Rf + ft + αfr + γ 2 frr ]dt + D∆γfr dW. (4726)
2
When fr is zero, the discounted portfolio differential d(D(t)X(t)) is purely drift and for no-arbitrage
condition to hold we require Rf = ft + αfr + 12 γ 2 frr . That is, for all values of r in range of R(t),
we require the PDE
1
ft (t, r, T ) + γ 2 (t, r)frr (t, r, T ) = rf (t, r, T ) (4727)
2
to hold.
13.4.5 Multidimensional Feynman-Kac Theorem

The number of differential equations and Brownian motions in Feynman-Kac theorems (see Theorem
429, 430) discussed thus far were one.
Theorem 431 (Multidimensional Feynman Kac Theorems). In general, we can have more than one
differential equation and more than one (not necessarily the same number of ) Brownian. Let W (t) =
(W1 (t), W2 (t)) be a two dimensional Brownian (see Definition 445) and two SDE of form
dX1 (u) = β1 (u, X1 (u), X2 (u))du + γ11 (u, X1 (u), X2 (u))dW1 (u) + γ12 (u, X1 (u), X2 (u))dW2 (u),
dX2 (u) = β2 (u, X1 (u), X2 (u))du + γ21 (u, X1 (u), X2 (u))dW1 (u) + γ22 (u, X1 (u), X2 (u))dW2 (u).
As in Corollary 33, the solutions are Markov. Let h(y1 , y2 ) be Borel-measurable functions and the initial
conditions be known X1 (t) = x1 , X2 (t) = x2 . Define for t ∈ [T ] :
g(t, x1 , x2 ) = Et,x1 ,x2 h(X1 (T ), X2 (T )), (4728)

t,x1 ,x2
f (t, x1 , x2 ) = E [exp(−r(T − t))h(X1 (T ), X2 (T ))] (4729)
685
Then the multi-dimensional Feynman-Kac theorem asserts that
1 2 2
1 2 2

gt + β1 gx1 + β2 gx2 + γ11 + γ12 gx1 x1 + (γ11 γ21 + γ12 γ22 ) gx1 x2 + γ21 + γ22 gx2 x2 = 0. (4730)
2 2
1 2 2
1 2 2

ft + β1 fx1 + β2 fx2 + γ11 + γ12 fx1 x1 + (γ11 γ21 + γ12 γ22 ) fx1 x2 + γ21 + γ22 fx2 x2 = rf, (4731)
2 2
with terminal condition g(T, x1 , x2 ) = f (T, x1 , x2 ) = h(x1 , x2 ) for all (x1 , x2 ). As in the one-dimensional
Feynman-Kac theorems (see Theorem 429), the Equations 4730 are derived by starting the processes at
time zero, and observing the processes g(t, X1 (t), X2 (t)) and exp(−rt)f (t, X1 (t), X2 (t)) are martingales.
Take their differentials and set the drift to zero.
Proof. For 0 ≤ s ≤ t ≤ T,
E[g(t, X1 (t), X2 (t)|F(s)] = E[E[h(X1 (T ), X2 (T ))|F(t)]|F(s)] (4732)

= E[h(X1 (T ), X2 (T ))|F(s)] (4733)
= g(t, X1 (s), X2 (s)) (4734)
g(t, X1 (t), X2 (t)) is martingale. The same argument can be used to show exp(−rt)f (t, X1 (t), X2 (t)) is
martingale. For brevity, we only keep track of the drift terms, which we aim to set to zero. We prove a
more general form than the proof asserts, that is when the Brownians are possibly correlated. Suppose
the instantaneous correlation between W1 (t), W2 (t) were written dW1 (t)dW2 (t) = ρdt. By Ito Doeblin
formula (see Theorem 418) we can write
dg(t, X1 (t), X2 (t)) = gt dt + gx1 dgX1 (t) + gx2 dgX2 (t) (4735)
1 1
+ gx1 x1 dX1 (t)dX1 (t) + gx2 x2 dX2 (t)dX2 (t) + gx1 x2 dX1 (t)dX2 (t)(4736)
2 2
1 2 2
= [gt + gx1 β1 + gx2 β2 + gx1 x2 (γ11 + γ12 + 2ργ11 γ12 ) (4737)
2
+gx1 x2 (γ11 γ21 + ργ11 γ22 + ργ12 γ21 + γ12 γ22 ) (4738)
1 2 2
+ gx2 x2 (γ21 + γ22 + 2ργ21 γ22 )]dt + ψdWt . (4739)
2
The solution follows by setting this drift term to zero. Additionally, when the Brownians are uncorrelated,
we can get the result by setting ρ = 0. Similar computational exercise derives the solution for f .
Exercise 706 (Asian Option Pricing). Here the dynamic hedging problem for a path-dependent option
such as the Asian option is solved using the discounted Feynman-Kac theorem (Theorem 431). The
payoff for an Asian option can be written
!+
Z T
1
V (T ) = S(u)du − K , (4740)
T 0
where the S(u) is geometric Brownian (see Definition 435). Under the risk-neutral measure P̃ the SDE
for S(u) can be written
dS(u) = rS(u)du + σS(u)dW̃ (u). (4741)

Rt
Define Y (t) = 0
S(u)du, with SDE dY (u) = S(u)du. By Corollary 33, the pair of processes (S(u), Y (u))
defined by their specified respective differential equations are two-dimensional Markov processes. (see
Definition 325) In fact, Y (u) is not a Markov process in itself since it involves the process S(u). Using
the SDE to generate processes with initial values S(0) > 0 and Y (0) = 0, the payoff can be expressed
686
1
+
V (T ) = T Y (T ) − K . By risk-neutral pricing (see Equation 4193) we have the relation for the value
of an Asian option
" + #
1
V (t) = Ẽ exp(−rτ ) Y (T ) − K |F(t) t ∈ [T ]. (4742)
T
By the Markov assumption, ∃v(t, x, y) s.t.
v(t, S(t), Y (t)) = V (t) (4743)

+
satisfying the terminal condition v(T, x, y) = Ty − K

for all x, y. Risk neutral pricing asserts that
exp(−rt)v(t, S(t), Y (t)) is martingale. By Ito Doeblin form for f (t, v) = exp(−rt)v with derivatives
ft = −r exp(−rt)v, fv = exp(−rt), fvv = 0, write
d (exp(−rt)v(t, S(t), Y (t))) = −r exp(−rt)vdt + exp(−rt)dv (4744)

= exp(−rt) [−rvdt + dv] (4745)

1 1
= exp(−rt) −rvdt + vt dt + vx dS + vy dY + vxx dSdS + vxy dSdY + vyy dY dY .
2 2

1
= exp(−rt) −rvdt + vt dt + vx (rSdt + σSdW̃ ) + vy Sdt + vxx σ 2 S 2 dt . (4746)
2
The drift in Equation 4746 should be zero, such that we obtain the PDE
1
vt (t, x, y) + rxvx (t, x, y) + xvy (t, x, y) + σ 2 x2 vxx (t, x, y) = rv(t, x, y). (4747)
2
See that this is an example of the discounted Feynman-Kac theorems for higher dimensions as given by
Equations 4730. Omitting the zero drift term, Equation 4746 simplifies to
d(exp(−rt)v(t, S(t), (t))) = exp(−rt)σS(t)vx (t, S(t), Y (t))dW̃ (t). (4748)
The discounted portfolio value evolves by SDE (Equation 4190), that is
d(exp(−rt)X(t)) = exp(−rt)σS(t)∆(t)dW̃ (t). (4749)
To achieve dynamic hedging, compare the Equations 4748 and 4749. In particular, a short position in
the Asian option for v(0, S(0), 0) can be used to set up the initial capital for a hedging portfolio, such
that X(0) = v(0, S(0), 0), and using ∆(t) = vx (t, S(t), Y (t)), then the SDEs match and at terminal T ,
we have
+
1
X(T ) = v(T, S(T ), Y (T )) = Y (T ) − K (4750)
T
and we are done. The solution to the PDE contains a term xvy (t, x, y) which was not observed in the
European variants.
13.5 Exotic Options

Options with payoff determined by the final value of the underlying are known as vanilla options. There
are options which depend on the price path. These path-dependent options are called exotic options.
3
Common examples of such options are barrier options, lookback options and Asian options.
3 Most of the discussion in this section are motivated by the chapters in Stochastic Calculus for Finance II by Shreve
[19] - the author, yours truly, does not have the mathematical aptitude nor attention span to construct such lengthy cogent
pricing arguments without reference to Shreve’s work.
687
In Section 13.1.6 we looked at distributions of a Maximum to Date random variable (Definition 438).
This distribution was discussed in Theorem 408 w.r.t to the stochastic process being a Brownian motion.
Here we generalize to when the Brownian motion has drift. Define a P̃ Brownian motion W̃ (t) with zero
drift defined on (Ω, F, P̃) and
Ŵ (t) = αt + W̃ (t) (4751)
with α drift under P̃. Define the maximum to date M̂ (T ) = maxt∈[T ] Ŵ (t). See the trivial results
Ŵ (0) = 0 = M̂ (0), M̂ (T ) ≥ 0, and Ŵ (T ) ≤ M̂ (T ).
Theorem 432 (Joint Density of Brownian Motion with Drift and Maximum to Date). The joint density
under P̃ of pair (M̂ (T ), Ŵ (T )) is

˜ 2(2m − w) 1 2 1 2
fM̂ (T ),Ŵ (T ) (m, w) = √ exp αw − α T − (2m − w) . w ≤ m, m ≥ 0. (4752)
T 2πT 2 2T
Proof. Define the exponential martingale

1 ∆ 1
Ẑ(t) = exp −αW̃ (t) − α2 t = exp −αŴ (t) + α2 t . (4753)
2 2
R
Define probability measure P̂(A) = A
Z(T )dP̃ for all A ∈ F. By Girsanov Theorem (Theorem 423),
Ŵ (T ) is P̂ Brownian. By Theorem 408, we have
(2m − w)2

2(2m − w)
fM̂ (T ),Ŵ (T ) (m, w) = √ exp − w ≤ m, m > 0. (4754)
T 2πT 2T
By the properties of Radon Nikodym derivative processes specified (Lemma 21), we can write
n o
P̃(M̂ (T ) ≤ m, Ŵ (T ) ≤ w) = Ẽ1 M̂ (T ) ≤ m, Ŵ (T ) ≤ w (4755)
1 n o
= Ê 1 M̂ (T ) ≤ m, Ŵ (T ) ≤ w (4756)
Ẑ(T )
n
1 o
= Ê exp αŴ (T ) − α2 T 1 M̂ (T ) ≤ m, Ŵ (T ) ≤ w (4757)
2
Z w Z m
1
= exp αy − α2 T fˆM̂ (T ),Ŵ (T ) (x, y)dxdy. (4758)
−∞ −∞ 2
Therefore, the density of (M̂ (T ), Ŵ (T )) under P̃ is written
δ2

n o 1 2
P̃ M̂ (T ) ≤ m, Ŵ (T ) ≤ w = exp αw − α T fˆM̂ (T ),Ŵ (T ) (m, w). (4759)
δmδw 2
Corollary 34. We have the following results for m ≥ 0

m − αT −m − αT
P̃(M̂ (T ) ≤ m) = Φ √ − exp(2αm)Φ √ , (4760)
T T

2 1 −m − αT
fˆM̂ (T ) (m) = √ exp − (m − αT )2 − 2α exp(2αm)Φ √ . (4761)
2πT 2T T
688
Proof. With reference to Theorem 432, integrate over the valid ranges:
Z mZ m
2(2µ − w) 1 2 1 2
P̃(M̂ (T ) ≤ m) = √ exp αw − α T − (2µ − w) dµdw (4762)
0 w T 2πT 2 2T
Z 0 Z m
2(2µ − w) 1 1 2
+ √ exp αw − α2 T − (2µ − w) dµdw (4763)
−∞ 0 T 2πT 2 2T
Z m m
1 1 1 2
= − √ exp αw − α2 T − (2µ − w) dw (4764)
0 2πT 2 2T w
Z 0 m
1 1 2 1 2
− √ exp αw − α T − (2µ − w) dw (4765)
−∞ 2πT 2 2T
Z m 0
1 1 1 2
= −√ exp αw − α2 T − (2m − w) dw (4766)
2πT 0 2 2T
Z m
1 1 2 1 2
+√ exp αw − α T − (2w − w) dw (4767)
2πT 0 2 2T
Z 0
1 1 1 2
−√ exp αw − α2 T − (2m − w) (4768)
2πT −∞ 2 2T
Z 0
1 1 1 2
+√ exp αw − α2 T − (−w) dw (4769)
2πT −∞ 2 2T
Z m
1 1 1 2
= −√ exp αw − α2 T − (2m − w) dw (4770)
2πT −∞ 2 2T
Z m
1 1 2 1 2
+ √ exp αw − α T − w dw. (4771)
2πT −∞ 2 2T
Using
1 2 (2m − w)2 1
− (w − 2m − αT ) = − + αw − 2αm − α2 T, (4772)
2T 2T 2
1 w2 1 2
− (w − αT )2 = − + αw − α T. (4773)
2T 2T 2
Then
exp(2αm) m
Z
1 2
P̃(M̂ (T ) ≤ m) = − √ exp − (w − 2m − αT ) dw (4774)
2πT −∞ 2T
Z m
1 1 2
+√ exp − (w − αT ) dw. (4775)
2πT −∞ 2T
w−2m−αT w−αT
Use y → √
T
and y → √
T
, get
Z −m−αT
√ Z m−αT
√
exp(2αm) T 1 1 T 1
P̃(M̂ (T ) ≤ m) = − √ exp(− y 2 )dy + √ exp(− y 2 )dy (4776)
2π −∞ 2 2π −∞ 2

−m − αT m − αT
= − exp(2αm)Φ √ +Φ √ . (4777)
T T
See that
(−m − αT )2 4αmT m2 + 2αmT + α2 T 2
2αm − = − (4778)
2T 2T 2T
m2 − 2αmT + α2 T 2
= − (4779)
2T
(m − αT )2
= − . (4780)
2T
689
δ
Differentiating, δm P̃(M̂ (T ) ≤ m)

m − αT 1 −m − αT −m − αT 1
= φ √ √ − 2α exp(2αm)Φ √ − exp(2αm)φ √ −√
T T T T T

1 1 2 −m − αT exp(2αm) 1 2
= √ exp − (m − αT ) − 2α exp(2αm)Φ √ + √ exp − (−m − αT )
2πT 2T T 2πT 2T

1 1 2 −m − αT 1 1 2
= √ exp − (m − αT ) − 2α exp(2αm)Φ √ +√ exp − (m − αT ) Eqn 4780
2πT 2T T 2πT 2T

2 1 −m − αT
= √ exp − (m − αT )2 − 2α exp(2αm)Φ √ . (4781)
2πT 2T T
The maximum-to-date random variables can be used to analyse a class of barrier options known as
knock-out/in barrier options. Knock out options become worthless when the underlying crosses some
level, as opposed to knock in options which only payoff when the underlying crosses some level. In
particular, an up-and-out option requires the underlying crossing a price ceiling to become worthless,
while the down-and-out option requires the underlying to break a price floor before becoming worthless.
We analyze an up-and-out call, which may be described by the following:
13.5.1 Up-and-Out Call

Definition 459 (Up-and-Out Call). Suppose an underlying is dS(t) = rS(t)dt + σS(t)dW̃ (t) where
W̃ (t) is Brownian under P̃, the risk-neutral measure. A European call with strike K maturity T and
up-and-out barrier B is considered. As in usual cases we may write

1 2 n o
S(t) = S(0) exp σ W̃ (t) + (r − σ )t = S(0) exp σ Ŵ (t) , (4782)
2
where Ŵ (t) = αt + W̃ (t) and α = σ1 r − 21 σ 2 . Define the maximum-to-date M̂ (T ) for Ŵ (t) such that

n o
max S(t) = S(0) exp σ M̂ (T ) . (4783)
t∈[T ]
The option payoff can be written

n o + n o
V (T ) = S(0) exp σ Ŵ (T ) − K 1 S(0) exp(σ M̂ (T )) ≤ B (4784)
n o n n o o
= S(0) exp σ Ŵ (T ) − K 1 S(0) exp σ Ŵ (T ) ≥ K && S(0) exp(σ M̂ (T )) ≤ B
n o n o
= S(0) exp σ Ŵ (T ) − K 1 Ŵ (T ) ≥ k, M̂ (T ) ≤ b (4785)
where
1 K 1 B
k= log , b= log . (4786)
σ S(0) σ S(0)
Result 45 (Optional Sampling Theorem). A (sub/super)martingale stopped at stopping time is (sub/su-
per)martingale. Additionally, denote t ≥ 0 and τ stopping time. If X(t) is submartingale , EX(t ∧ τ ) ≤
EX(t). If X(t) is supermartingale, EX(t ∧ τ ) ≥ EX(t). If X(t) is martingale, EX(t ∧ τ ) = EX(t).
Theorem 433. Let v(t, x) denote price at t of up-and-out call option (see Definition 459) that has not
yet been knocked out, with S(t) = x. v(t, x) must satisfy the BSM PDE Equation 3848 form
1
vt (t, x) + rxvx (t, x) + σ 2 x2 vxx (t, x) = rv(t, x) (4787)
2
690
in rectangle {(t, x); 0 ≤ t < T, 0 ≤ x ≤ B} and satisfies boundary conditions
v(t, 0) = 0, 0 ≤ t ≤ T, (4788)
v(t, B) = 0, 0 ≤ t < T, (4789)
+
v(T, x) = (x − K) , 0 ≤ x ≤ B. (4790)
We can solve for this either analytically or with the general arguments using the risk-neutral pricing
formula.
By risk-neutral pricing arguments (see Equation 4193) the random variable
exp(−rt)V (t) = Ẽ [exp(−rT )V (T )|F(t)] , t ∈ [T ] (4791)
is martingale. The difference between v(t, S(t)) and V (t) is that the former has no memory of the
price path. In particular, if the underlying breaks the barrier and returns below the barrier at t ∈ [T ],
v(t, S(t)) > 0 but V (t) = 0. To account for this, define ρ to be the earliest time t s.t. S(t) = B. Then
S(t) < B for all 0 ≤ t < ρ and S(ρ) = B. By the non-zero quadratic variation of S(t), S(t) oscillates
and crosses B infinitely many times immediately after ρ, and we can let ρ define the knock-out time.
If it never hits the barrier prior to maturity, set ρ = ∞. Then ρ is called the stopping time, which
were first discussed in Section 13.1.5. The Optional Sampling Theorem (Result 45) asserts that a frozen
martingale is still martingale, that is
(
exp(−rt)V (t) if 0 ≤ t ≤ ρ ,
exp {−r(t ∧ ρ)} V (t ∧ ρ) = (4792)
exp(−rρ)V (ρ) if ρ < t ≤ T
is P̃ martingale. It follows that V (t) = v(t, S(t)) for 0 ≤ t ≤ ρ and that
exp {−r(t ∧ ρ)} v(t ∧ ρ, S(t ∧ ρ)), 0≤t≤T (4793)
is P̃ martingale.
Proof. Using Ito Doeblin (Theorem 415), write d (exp(−rt)v(t, S(t)))

1
= exp(−rt) −rvdt + vt dt + vx dS(t) + vxx dS(t)dS(t) (4794)
2

1 2 2
= exp(−rt) −rv + vt + rS(t)vx + σ S (t)vxx dt + exp(−rt)σS(t)vx dW̃ (t). (4795)
2
The drift term is zero for t ∈ [ρ], and therefore the BSM PDE holds for every point in the rectangle
{(t, x) : 0 ≤ t < T, 0 ≤ x ≤ B}.
Setting drift to zero, Equation 4795 becomes
d (exp(−rt)v(t, S(t))) = exp(−rt)σS(t)vx dW̃ (t) (4796)
and comparing this with the portfolio evolution SDE 4190, written
d(exp(−rt)X(t)) = exp(−rt)σS(t)∆(t)dW̃ (t). (4797)
If the agent begins with short position in the up-and-out call, initial capital X(0) = v(0, S(0)) and
holds ∆(t) = vx (t, S(t)) continuously until ρ ∧ T , she will have perfectly hedged the option. However,
in practice, when the underlying approaches B near T , the function v(t, x) approaches discontinuity
691
and has large negative delta vx (t, x) and large negative gamma vxx (t, x) values. The hedger needs to
take large positions as well as large adjustments near expiration near barrier scenarios, which incurs
significant costs. The practice (Shreve [19]) is to price and hedge the option as if it were an option with
B ∗ > B such that these large vx , vxx occurs when underlying S > B. The hedge position is closed out
when S hits B and the contract knocks out.
See from Equation 4785
n o n o
V (T ) = S(0) exp σ Ŵ (T ) − K 1 Ŵ (T ) ≥ k, M̂ (T ) ≤ b (4798)
where
1 K 1 B
k= log , b= log . (4799)
σ S(0) σ S(0)
Risk-neutral pricing formula (Equation 4193) asserts that V (0) = Ẽ(exp(−rt)V (T )). We are interested
in the region {(m.w); k ≤ w ≤ m, 0 ≤ m ≤ b}, and we assume 0 < S(0) ≤ B. Then by the joint-density
for (M̂ (T ), Ŵ (T )) in Theorem 432 written

2(2m − w) 1 1 2
√ exp αw − α2 T − (2m − w) , w ≤ m, m ≥ 0, (4800)
T 2πT 2 2T
we may write
Z bZ b
2(2m − w) 1 2 1 2
V (0) = exp(−rT )(S(0) exp(σw) − K) √ exp αw − α T − (2m − w) dmdw
k w+ T 2πT 2 2T
Z b m=b
1 1 1
= − exp(−rT )(S(0) exp(σw) − K) √ exp αw − α2 T − (2m − w)2 dw
k 2πT 2 2T m=w+
Z b
1 1 2 1 2
= √ (S(0) exp(σw) − K) exp −rT + αw − α T − w dw (4801)
2πT k 2 2T
Z b
1 1 1
−√ (S(0) exp(σw) − K) exp −rT + αw − α2 T − (2b − w)2 dw (4802)
2πT k 2 2T
Z b
1 1 2 1 2
= S(0) √ exp σw − rT + αw − α T − w dw (4803)
2πT k 2 2T
| {z }
I1
Z b
1 1 2 1 2
−K √ exp −rT + αw − α T − w dw (4804)
2πT k 2 2T
| {z }
I2
Z b
1 1 2 1 2
−S(0) √ exp σw − rT + αw − α T − (2b − w) dw (4805)
2πT k 2 2T
| {z }
I3
Z b
1 1 2 1 2
+K √ exp −rT + αw − α T − (2b − w) dw . (4806)
2πT k 2 2T
| {z }
I4
We may write each of these terms in form

Z b Z b
1 1 2 1 1 2 1 2
√ exp β + γw − w dw = √ exp − (w − γT ) + γ T + β dw (4807)
2πT k 2T 2πT k 2T 2
Z √1 (b−γT )
1 2 1 T 1 w − γT
= exp γ T +β √ exp(− y 2 )dy, y→ √ .
2 2π √1 (k−γT ) 2 T
T
692
Using Φ(z) = 1 − Φ(−z), write:
Z b
1 w2
√ exp(β + γw − )dw (4808)
2πT k 2T

1 2 b − γT k − γT
= exp( γ T + β) Φ √ −Φ √ (4809)
2 T T

1 2 −k + γT −b + γT
= exp( γ T + β) Φ √ −Φ √ (4810)
2 T T

1 2 1 S(0) 1 S(0)
= exp( γ T + β) Φ √ log + γσT −Φ √ log + γσT . (4811)
2 σ T K σ T B
Set

1 1 2
δ± (τ, s) = √log s + r ± σ τ . (4812)
σ τ 2
See that we can express
Z b
1 1 2 2 1 2
I3 = √ exp σw − rT + αw − α2 T − b2 + bw − w dw, (4813)
2πT k 2 T T 2T
Z b
1 1 2 2 1 2
I4 = √ exp −rT + αw − α2 T − b2 + bw − w dw. (4814)
2πT k 2 T T 2T
Recall the value of
1 1 2 1
α= r − σ ↔ r = ασ + σ 2 .
σ 2 2
Then for I1 we have β = −rT − 21 α2 T, γ = α + σ, so
1 2 1 2 1
γ T +β = (α + σ 2 + 2ασ)T − rT − α2 T (4815)
2 2 2
1 2
= σ T + ασT − rT (4816)
2
= 0, (4817)
γσ = (α + σ)σ (4818)
1
= r + σ2 (4819)
2
s.t.

S(0) S(0)
I1 = Φ δ+ T, − Φ δ+ T, . (4820)
K B
For I2 we have β = −rT − 12 α2 T, γ = α so 12 γ 2 T + β = −rT, γσ = r − 21 σ 2 s.t.

S(0) S(0)
I2 = exp(−rT ) Φ δ− T, − Φ d− T, . (4821)
K B
2b2
For I3 we have β = −rT − 12 α2 T − T ,γ =α+σ+ 2b
T so
− 2r2 −1
1 2 S(0) σ
γ T +β = log , (4822)
2 B
2
1 B
γσT = r + σ 2 T + log , (4823)
2 S(0)
s.t.
− 2r2 −1
B2

S(0) σ B
I3 = Φ δ+ T, − Φ δ+ T, . (4824)
B KS(0) S(0)
693
2b2
For I4 we have β = −rT − 12 α2 T − T ,γ =α+ 2b
T so
− 2r2 +1
1 2 S(0) σ
γ T +β = −rT + log , (4825)
2 B
2
1 B
γσT = r − σ 2 T + log , (4826)
2 S(0)
s.t
− 2r2 +1
B2

S(0) σ B
I4 = exp(−rT ) Φ δ− T, − Φ δ− T, . (4827)
B KS(0) S(0)
It follows that for 0 < S(0) ≤ B we have

S(0) S(0)
V (0) = S(0) Φ δ+ T, − Φ δ+ T, (4828)
K B

S(0) S(0)
− exp(−rT )K Φ δ− T, − Φ δ− T, (4829)
K B
− 2r2
B2

S(0) σ B
−B Φ δ+ T, − Φ δ+ T, (4830)
B KS(0) S(0)
− 2r2 +1
B2

S(0) σ B
+ exp(−rT )K Φ δ− T, − Φ δ− T, . (4831)
B KS(0) S(0)
Replacing T with τ and S(0) with new initial condition S(t) = x, the call price of the up-and-call is
h x x i
v(t, x) = x Φ δ+ τ, − Φ δ+ τ, (4832)
Kh B
x x i
− exp(−rτ )K Φ δ− τ, − Φ δ− τ, (4833)
K B
x − 2r2 B 2
B

σ
−B Φ δ+ τ, − Φ δ+ τ, (4834)
B Kx x
x − 2r2 +1 B 2
B

σ
+ exp(−rτ )K Φ δ− τ, − Φ δ− τ, . (4835)
B Kx x
when 0 ≤ t < T, 0 < x ≤ B. For values not in this range, v(t, x) = 0. The special case of the asset
hitting the barrier at expiration gives v(T, B) = B − K.
13.5.2 Lookback Options

A lookback option is an option whose payoff depends on the maximum that the underlying asset price
attains over some time interval before expiry.
Definition 460 (Floating Strike Lookback Options). A floating strike lookback option pays off the price
difference between the nmaximum
o over [T ] and at T . Consider a geometric Brownian underlying with
1 1 2

price S(t) = S(0) exp σ Ŵ (t) , where Ŵ (t) = αt + W̃ (t) and α = σ r − 2 σ . A similar statement
was made in Definition 459. Specify maximum to date M̂ (t) = max0≤u≤t Ŵ (u) for 0 ≤ t ≤ T and write
the maximum of underlying to t as
Y (t) = max S(u) = S(0) exp(σ M̂ (t)). (4836)

0≤u≤t
Then the option pays off V (T ) = Y (T ) − S(T ) at T . Risk-neutral pricing formula asserts that
V (t) = Ẽ [exp(−rτ )(Y (T ) − S(T ))|F(t)] . (4837)
694
By Corollary 33 the processes (S(t), Y (t)) is Markov and ∃v(t, x, y) satisfying
V (t) = v(t, S(t), Y (t)). (4838)
Theorem 434. Let v(t, x, y) denote the price at t of floating strike lookback option (Definition 460)
when S(t) = x, Y (t) = y. The BSM PDE
1
vt + rxvx + σ 2 x2 vxx = rv, (4839)
2
holds in the region {(t, x, y) : 0 ≤ t < T, 0 ≤ x ≤ y} with boundary conditions
v(t, 0, y) = exp {−rτ } y, 0 ≤ t ≤ T, y ≥ 0, (4840)

vy (t, y, y) = 0, 0 ≤ t ≤ T, y > 0, (4841)
v(T, x, y) = y − x, 0 ≤ x ≤ y. (4842)
Under risk-neutral pricing exp(−rt)V (t) = exp(−rt)v(t, S(t), Y (t)) is P̃ martingale. Y (t) has zero
quadratic variation. To see this, for partition set 0 = t0 < t1 < · · · < tm = T write
m
X 2
(Y (tj ) − Y (tj−1 )) (4843)
k=1
m
X
≤ max (Y (tj ) − Y (tj−1 )) (Y (tj ) − Y (tj−1 )) (4844)
j=1,··· ,m
j=1
= max (Y (tj ) − Y (tj−1 )) · (Y (T ) − Y (0)). (4845)
j=1,··· ,m
Since Y (t) is continuous and non-decreasing, maxj=1,··· ,m (Y (tj ) − Y (tj−1 )) goes to zero as the partition
size goes to zero. Y (t) has zero quadratic variation, or colloquially dY (t)dY (t) = 0. Since Y (tj ) −
Y (tj−1 ) ≥ 0 we do not need to take absolute values. However, 6 ∃Θ(t) s.t we can write Y (t) = Y (0) +
Rt
0
Θ(u)du. Otherwise, Θ(u) would be zero whenever S(t) ≤ maxu∈[t] S(u), below its maximum to date.
But the interval on which Y (t) increases has Lebesgue measure zero - an interval where Y (t) is strictly
increasing requires S(t) to be strictly increasing, which would only happen if S(t) has zero quadratic
variation, which is not the case since d[S, S](t) = σ 2 S 2 (t)dt for all t. Each time interval in [0, T ]
contains a sequence of subintervals whose lengths sum to T , and on each of these subintervals, Y (t) is
constant. The lengths of these subintervals sum to T . Their union is called the Cantor set. Although
the lengths of these subintervals sum to T , there are uncountably many points not contained in these
intervals. The cross variation with S(t) can be written to dY (t)dS(t) = 0.
Proof. By the Ito Doeblin formula, we may write

1
d (exp {−rt} v(t, S(t), Y (t))) = exp(−rt) −rvdt + vt + vx dS(t) + vxx dS(t)dS(t) + vy dY (t)
2

1 2 2
= exp(−rt) −rv + vt dt + rS(t)vx + σ S (t)vxx dt (4846)
2
+ exp(−rt)σS(t)vx dW̃ (t) (4847)
+ exp(−rt)vy dY (t). (4848)
See that exp(−rt)vy (t, S(t), Y (t))dY (t) must be zero. This is because dY (t) = 0 when S(t) is below
Y (t), and when S(t) = Y (t), dY (t) > 0 - for the equation to hold, the term itself must be zero, and the
boundary condition Equation 4841 holds.
695
We can logicize the following relationship for the price of a floating strike lookback option
v(t, λx, λy) = λv(t, x, y), ∀λ > 0. (4849)
Define u(t, z) = v(t, z, 1). If we knew u(t, z), we would be able to obtain v(t, x, y) via the relationship
x x
v(t, x, y) = yv(t, , 1) = yu(t, ), 0 ≤ t ≤ T, 0 ≤ x ≤ y, y > 0. (4850)
y y
It is easy to see

x
vt (t, x, y) = yut t, , (4851)
y

x 1
vx (t, x, y) = yuz t, , (4852)
y y

x 1
vxx (t, x, y) = uzz t, , (4853)
y y

x x −x
vy (t, x, y) = u t, + yuz t, . (4854)
y y y2
Substitution into the BSM PDE, keeping only the dt terms, write

1
0 = −rv(t, x, y) + vt (t, x, y) + rxvx (t, x, y) + σ 2 x2 vxx (t, x, y)
2
" 2 #
x x x x 1 2 x x
= y −ru t, + ut t, +r uz t, + σ uzz t, (4855)
y y y y 2 y y
x
z→ y results in the BSM PDE
1
ut (t, z) + rzuz (t, z) + σ 2 z 2 uzz (t, z) = ru(t, z), 0 ≤ t < T, 0 < z < 1. (4856)
2
exp(−rτ )y = v(t, 0, y) = yu(t, 0) =⇒ u(t, 0) = exp(−rτ ) (4857)

0 = vy (t, y, y) = u(t, 1) − uz (t, 1) =⇒ u(t, 1) = uz(t, 1) (4858)

x
y − x = v(T, x, y) = yu T, =⇒ u(T, z) = 1 − z. (4859)
y
are our boundary u(t, z) boundary conditions.

We compute the function v(t, x, y) when 0 ≤ t < T and when 0 < x ≤ y. For 0 ≤ t < T , see that
n o n o n o
Y (T ) = S(0) exp σ M̂ (t) exp σ(M̂ (T ) − M̂ (t)) = Y (t) exp σ(M̂ (T ) − M̂ (t)) . (4860)
See that we can write

+ +
M̂ (T ) − M̂ (t) = max Ŵ (u) − M̂ (t) = max (Ŵ (u) − Ŵ (t)) − (M̂ (t) − Ŵ (t)) (4861)
t≤u≤T t≤u≤T
Multiplying by σ and use the relation Y (t) = S(0) exp(σ M̂ (t)), S(t) = S(0) exp(σ Ŵ (t)) to write
+
Y (t)
σ(M̂ (T ) − M̂ (t)) = max σ(Ŵ (u) − Ŵ (t)) − log . (4862)
t≤u≤T S(t)
The risk-neutral pricing equation can be written to be
" ( + ) #
Y (t)
V (t) = exp {−rτ } Ẽ Y (t) exp max σ(Ŵ (u) − Ŵ (t)) − log |F(t) − Ẽ [exp(−rτ )S(T )|F(t)] .
t≤u≤T S(t)
696
Taking out the measurable random variables for first term gives
" ( + ) #
Y (t)
exp {−rτ } Y (t)Ẽ exp max σ(Ŵ (u) − Ŵ (t)) − log |F(t) (4863)
t≤u≤T S(t)
By the Independence Lemma (see Lemma 4), we may write

( + )
y
g(x, y) = Ẽ exp max σ(Ŵ (u) − Ŵ (t)) − log , (4864)
t≤u≤T x
to which Equation 4863 is given by g(S(t), Y (t)) without the conditioning filtration. Then
V (t) = exp(−rτ )Y (t)g(S(t), Y (t)) − S(t), (4865)
where S(t) is the latter term due to risk-neutral pricing. It stands that v(t, x, y) = exp(−rτ )yg(x, y) − x
- we are left with finding g(x, y).
Since we may write
max σ(Ŵ (u) − Ŵ (t)) = σ max (Ŵ (u) − Ŵ (t)), (4866)

t≤u≤T t≤u≤T
where the term maxt≤u≤T (Ŵ (u) − Ŵ (t)) shares distribution with M̂ (τ ), then we may express
h
y i+
g(x, y) = Ẽ exp σ M̂ (τ ) − log (4867)
x

1 y 1 y n o n yo
= Ẽ1 M̂ (τ ) ≤ log exp {0} + Ẽ1 M̂ (τ ) ≥ log exp σ M̂ (τ ) exp − log
σ x σ x x
o
1 y x n 1 y
= P̃ M̂ (τ ) ≤ log + Ẽ exp σ M̂ (τ ) 1 M̂ (τ ) ≥ log . (4868)
σ x y σ x
Recall the CDF given by Corollary 34:

m − αT −m − αT
P̃(M̂ (T ) ≤ m) = Φ √ − exp(2αm)Φ √ , (4869)
T T
1
r − 21 σ 2 , δ± (τ, s) = 1
log s + r ± 12 σ 2 τ )

we can write (recall α = σ
√
σ τ

1 1 y 1 y 1 2
√ log − ατ = √log − r − σ τ (4870)
τ σ x σ τ x 2

1 x 1 2
= − √ log + r − σ τ (4871)
σ τ y 2

x
= −δ− τ, (4872)
y
and

1 1 y 1 y 1 2
√ − log − ατ = √ − log − r − σ τ (4873)
τ σ x σ τ x 2
y
= −δ− τ, . (4874)
x
Doing m → 1
σ log xy , term exp(2αm) becomes
2r
2α y 2r y y σ2 −1
exp log = exp 2
− 1 log = . (4875)
σ x σ x x
697
It follows that
2r
1 y x y σ2 −1 y
P̃ M̂ (τ ) ≤ log = Φ −δ− τ, − Φ −δ− τ, . (4876)
σ x y x x

Recall Equation 4761: fˆM̂ (T ) (m) = √2πT
2
(m − αT )2 − 2α exp(2αm)Φ −m−αT
1
exp − 2T √
T
such that
x ∞
Z
x n o 1 y
Ẽ exp σ M̂ (τ ) 1 M̂ (τ ) ≥ log = exp {σm} fˆM̂ (τ ) (m)dm (4877)
y σ x y σ1 log xy
x ∞
Z
2 1
= √ exp σm − (m − ατ )2 dm(4878)
y σ1 log xy 2πτ 2τ
Z ∞
x −m − ατ
− 2α exp {(σ + 2α)m} Φ √ dm.
y σ1 log xy τ
Since
1 1 1
rτ − (m − ατ − στ )2 = rτ − (m − ατ )2 + σ(m − ατ ) − σ 2 τ (4879)
2τ 2τ 2
1 1 2 1
= rτ − (m − ατ ) + σm − r − σ τ − σ 2 τ
2
(4880)
2τ 2 2
1
= σm − (m − ατ )2 . (4881)
2τ
Then we can write
Z ∞
x 2 1 2
√ exp σm − (m − ατ ) dm (4882)
y 1 y
σ log x
2πτ 2τ
2x exp(rτ ) ∞
Z
1
= √ exp − (m − ατ − στ )2 dm. (4883)
y 2πτ 1 y
σ log x
2τ
If ξ → √ −m
ατ +στ
then the lower limit
τ

1 y 1 1 y
log → √ ατ + στ − log (4884)
σ x τ σ x

1 x 1
= √ log + (r − σ 2 )τ + σ 2 τ (4885)
σ τ y 2

1 x 1
= √ log + rτ + σ 2 τ (4886)
σ τ y 2

x
= δ+ τ, . (4887)
y
Substituting,
√ Z x
x ∞ 2x exp(rτ ) τ δ+ (τ, y )
Z
2 1 2 1 2
√ exp σm − (m − ατ ) dm = √ exp − ξ dξ
y σ1 log xy 2πτ 2τ y 2πτ −∞ 2

2x exp(rτ ) x
= Φ δ+ τ, . (4888)
y y
The first part of Equation 4877 is resolved. Next, since σ + 2α = 2r σ , we can write
Z ∞
x −m − ατ
− 2α exp(m(σ + 2α))Φ √ dm (4889)
y σ1 log xy τ
Z ∞ Z √1 (−m−ατ )
2αx τ 2 1 2
=− √ exp rm − ξ dξdm (4890)
y 2π σ1 log xy −∞ σ 2
Z −δ− (τ, xy ) Z −ξ√τ −ατ
2αx 2 1 2
=− √ exp rm − ξ dmdξ.* (4891)
y 2π −∞ 1 y
σ log x
σ 2
698
4
Evaluating the inner integral, write
√
−ξ τ −ατ m=−ξ√τ −ατ
2rm ξ 2 2rm ξ 2
Z
σ
exp − dm = exp − (4892)
1
σ log y
x
σ 2 2r σ 2 1
m= σ y
log x
√

σ 2r 1 σ 2r y 1 2
= exp −ξ τ − ατ − ξ 2 − exp log − ξ .
2r σ 2 2r σ2 x 2
Since
√
2r √ ξ2 ξ2 2rξ τ 2rατ
−ξ τ − ατ − = − − − (4893)
σ 2 2 σ σ
√ 2
2r2 τ

1 2r τ 2rατ
= − ξ+ + 2 − (4894)
2 σ σ σ
√ 2
1 2r τ 2rτ
= − ξ+ + 2 (r − σα) (4895)
2 σ σ
√ 2
1 2r τ
= − ξ+ + rτ, (4896)
2 σ
and
ξ2 y 2r2 ξ2

2r y σ
exp log − = exp − , (4897)
σ2 x 2 x 2
Then
Z −ξ√τ −ατ ( √ 2 )
2rm ξ 2
2
σ y σ2r2

σ 1 2r τ ξ
exp − dm = exp rτ − ξ+ − exp − .
1
σ log y
x
σ 2 2r 2 σ 2r x 2
Use this in Equation 4891 to get

x ∞

−m − ατ
Z
− 2α exp(m(σ + 2α))Φ √ dm (4898)
y σ1 log xy τ
Z −δ− (τ, xy ) Z −ξ√τ −ατ
2αx 2 1
= − √ exp rm − ξ 2 dmdξ (4899)
y 2π −∞ 1 y
σ log x
σ 2
Z −δ− (τ, xy ) ( √ 2 )
ασx 1 2r τ
= − √ exp rτ − ξ+ dξ (4900)
ry 2π −∞ 2 σ
y
ασ y σ2r2 −1 −δ− (τ, x )
Z 2
ξ
+ √ exp − dξ (4901)
r 2π x −∞ 2
y
( √ 2 )
ασx exp(rτ ) −δ− (τ, x )
Z
1 2r τ
= − √ exp − ξ+ dξ (4902)
ry 2π −∞ 2 σ
ασ y σ2r2 −1 y
+ Φ −δ− τ, . (4903)
r x x
√
2r τ
Make ζ = ξ + σ so that the upper limit becomes
y
1 y 1
−δ− τ, → √ − log − r − σ 2 τ + 2rτ (4904)
x σ τ x 2

1 x 1
= √ log + r + σ 2 τ (4905)
σ τ y 2

x
= δ+ τ, . (4906)
y
4* verify this change of integrals.
699
It follows that
∞
−m − ατ
Z
x
− √
2α exp {(σ + 2α)m} Φ dm (4907)
y 1 y
σ log x
τ
ασ y σ2r2 −1

ασx x y
= − exp(rτ )Φ δ+ τ, + Φ −δ− τ, . (4908)
ry y r x x
We can then write (see Equation 4865)

2r
x y σ2 −1 y
v(t, x, y) = exp(−rτ ) · y · [Φ −δ− τ, − Φ −δ− τ, + Eqn. 4876
y x x

x x
+2 exp(rτ )Φ δ+ τ, Eqn. 4888 (4909)
y y
σ2

x x
− 1− exp(rτ )Φ δ+ τ, Eqn. 4908 (4910)
2r y y
σ 2 y σ2r2 −1
y
+ 1− Φ −δ− τ, ]−x (4911)
2r x x
σ2

x x
= 1+ xΦ δ+ τ, + exp(−rτ )yΦ −δ− τ, (4912)
2r y y
σ2 y 2r2
σ
y
− exp(−rτ ) xΦ −δ− τ, − x, 0 ≤ t < T, 0 < x ≤ y. (4913)
2r x x
Then we can write
σ2

x x x x
u t, = 1+ Φ δ+ τ, + exp(−rτ )Φ −δ− τ, (4914)
y 2r y y y
2
1− 2r2
σ x σ
y x
− exp(−rτ ) Φ −δ− τ, − . (4915)
2r y x y
x
z→ y s.t
σ2

u(t, z) = 1+ zΦ (δ+ (τ, z)) + exp(−rτ )Φ (−δ− (τ, z)) (4916)
2r
σ2 2r
exp(−rτ )z 1− σ2 Φ −δ− (τ, z −1 ) − z,

− 0 ≤ t < T, 0 < z ≤ 1. (4917)
2r
13.5.3 Asian Options

Definition 461 (Asian Option). The Asian option payoff depends on the time average of underlying
over some time period between initiation and expiry (possibly the whole period). This average is either
RT Pm
1
sampled continuously or discretely, corresponding to T1 0 S(t)dt, m j=1 S(tj ) where ti∈[m] partitions
[T ]. Some of the mathematics for a fixed-strike Asian Call were discussed in Example 706 - here we
repeat the central arguments and further analyze such options. In particular, assume underlying
dS(t) = rS(t)dt + σS(t)dW̃ (t) (4918)
where W̃ (t) is P̃ Brownian, then the payoff of a fixed-strike Asian call is

!+
Z T
1
V (T ) = S(t)dt − K , K ≥ 0. (4919)
T 0
Risk-neutral pricing asserts V (t) = Ẽ [exp(−rτ )V (T )|F(t)] and that exp(−rt)V (t) = Ẽ [exp(−rT )V (T )|F(t)]
(i.e. exp(−rt)V (t) is P̃ martingale). Recall that this S(t) is not itself Markov, since the Asian option
700
payoff is path dependent. In particular, we need to employ state augmentation, defining a variable
Rt
Y (t) = 0 S(u)du. Then (S(t), Y (t)) is a valid two-dimensional Markov process. V (T ) can be expressed
1
+
T Y (T ) − K and by Markov property we say there ∃v(t, x, y) s.t.
" + #
1
v(t, S(t), Y (t)) = Ẽ exp(−rτ ) Y (T ) − K |F(t) (4920)
T
= Ẽ [exp(−rτ )V (T )|F(t)] . (4921)
Theorem 435. The Asian call price satisfies PDE

1
vt + rxvx + xvy + σ 2 x2 vxx = rv, 0 ≤ t < T, x ≥ 0, y ∈ R. (4922)
2
The boundary conditions are
y +
v(t, 0, y) = exp(−rτ ) , −K
0 ≤ t < T, y ∈ R, (4923)
T
lim v(t, x, y) = 0, 0 ≤ t < T, x ≥ 0, (4924)
y→−∞
y +
v(T, x, y) = −K , x ≥ 0, y ∈ R. (4925)
T
The proof for the BSM PDE was shown in Example 706, in particular

1
d(exp(−rt)v(t, S(t), Y (t)) = exp(−rt) −rv + vt + rSvx + Svy + σ 2 S 2 vxx dt + exp(−rt)σSvx dW̃ (t).
2
The boundary conditions are new. S(t) is always non-negative. If S(t) = 0, Y (t) = y for some t, then
S(u) = 0 for all values of u and Y (u) is constant on [t, T ]. In particular, Y (T ) = y and the Asian call
+
is worth Ty − K at t. Thus boundary Equation 4923 is satisfied. In the theoretical sense, there is no
RT
restriction on y being negative. For Y (t) = y, we can write Y (T ) = y + t S(u)du. We can still obtain
Y (T ) > 0 even if y < 0, and therefore we let y ∈ R. As Y (t) = y and y → −∞, Y (T ) → −∞ and the
probability that the call expires ITM goes to zero, as does the value of the call option - boundary Equation
4924 is satisfied. Equation 4925 is the boundary condition for payoff at maturity.
To work out the partial differential equation for Asian option pricing, we use a change of numeraire
technique. For payoff function
!+
Z T
1
V (T ) = S(t)dt − K , (4926)
c T −c
we would like to create a portfolio process whose time T value is given
1 T
Z
X(T ) = S(u)du − K. (4927)
c T −c
Let γ(t) denote the number of shares of asset held in our portfolio. This has no Brownian term, such
that dγ(t)dγ(t) = 0, dγ(t)dS(t) = 0. Then
d (exp(r(T − t))γ(t)S(t)) = exp(r(T − t))d(γ(t)S(t)) + r exp(r(T − t))γ(t)S(t)dt (4928)

= exp(r(T − t))γ(t)dS(t) + exp(r(T − t))S(t)dγ(t) − r exp(r(T − t))γ(t)S(t)dt.
Rearranging,
exp(r(T − t))γ(t)(dS(t) − rS(t)dt) = d (exp(r(T − t))γ(t)S(t)) − exp(r(T − t))S(t)dγ(t). (4929)
701
Since the portfolio must evolve by
dX(t) = γ(t)dS(t) + r(X(t) − γ(t)S(t))dt (4930)

= rX(t)dt + γ(t)(dS(t) − rS(t)dt), (4931)
so
d (exp(r(T − t))X(t)) = −r exp(r(T − t))X(t)dt + exp(r(T − t))dX(t) (4932)

= exp(r(T − t))γ(t)(dS(t) − rS(t)dt) using Equation 4929, (4933)
= d (exp(r(T − t))γ(t)S(t)) − exp(r(T − t))S(t)dγ(t). (4934)
Let
1

 (1 − exp(−rc)) ,
 0≤t≤T −c
γ(t) = rc (4935)
 1 (1 − exp(−r(T − t))) ,

T − c ≤ t ≤ T.
rc
Let the initial capital be
1
X(0) = (1 − exp(−rc)) S(0) − exp(−rT )K. (4936)
rc
1 1
At t = 0, rc (1 − exp(−rc)) stock is bought for rc (1 − exp(−rc)) S(0). exp(−rT )K amount of capital
is borrowed from the money market account. For 0 ≤ t ≤ T − c, the position is held and the value of
1
holdings in the risky asset is rc (1 − exp(−rc)) S(t) with money market debt exp(−r(T − t))K, s.t.
1
X(t) = (1 − exp(−rc)) S(t) − exp(−r(T − t))K, 0 ≤ t ≤ T − c. (4937)
rc
1
So X(T −c) = rc (1 − exp(−rc)) S(T −c)−exp(−rc)K. For T −c ≤ t ≤ T , dγ(t) = − 1c exp(−r(T −t))dt.
Integrating Equation 4934, we can write
Z t Z t
exp(r(T − t))X(t) = exp(rc)X(T − c) + d (exp(r(T − u))γ(u)S(u)) − exp (r(T − u)) S(u)dγ(u)
T −c T −c
1
= exp(rc) (1 − exp(−rc)) S(T − c) − K (4938)
rc
+ exp(r(T − t))γ(t)S(t) (4939)
1
− exp(rc) (1 − exp(−rc)) S(T − c) (4940)
rc
1 t
Z
+ S(u)du (4941)
c T −c
1 t
Z
= −K + exp(r(T − t))γ(t)S(t) + S(u)du. (4942)
c T −c
Then
Z t
1 1
X(t) = (1 − exp(−r(T − t))) S(t) + exp(−r(T − t)) S(u)du − exp(−r(T − t))K (4943)
rc c T −c
for T − c ≤ t ≤ T . See that we have

Z T
1
X(T ) = S(u)du − K, (4944)
c T −c
+
V (T ) = X (T ) = max {X(T ), 0} (4945)
702
and that at arbitrary time t ≤ T
V (t) = Ẽ [exp(−r(T − t))V (T )|F(t)] = Ẽ exp(−r(T − t))X + (T )|F(t) .

(4946)
To compute the RHS of this expectation, we employ change of numeraire. Define

X(t) exp(−rt)X(t)
Y (t) = = , (4947)
S(t) exp(−rt)S(t)
such that the portfolio value is denominated in risky asset terms. Since
d(exp(−rt)S(t)) = −r exp(−rt)S(t)dt + exp(−rt)dS(t) = σ exp(−rt)S(t)dW̃ (t), (4948)
we may write
h i
−1 −3
d (exp(−rt)S(t)) = −(exp(−rt)S(t))−2 d(exp(−rt)S(t)) + (exp(−rt)S(t)) d (exp(−rt)S(t)) d (exp(−rt)S(t))
−2 −3 2
= − (exp(−rt)S(t)) σ(exp(−rt)S(t))dW̃ (t) + (exp(−rt)S(t)) (exp(−rt)S(t)) σ 2 dt
= −σ(exp(−rt)S(t))−1 dW̃ (t) + σ 2 (exp(−rt)S(t))−1 dt.
But
d(exp(−rt)X(t)) = −r exp(−rt)X(t)dt + exp(−rt)dX(t) (4949)

= γ(t) exp(−rt)(dS(t) − rS(t)dt) Equation 4931 (4950)
= γ(t)σ exp(−rt)S(t)dW̃ (t). (4951)
By Ito Product rule, we can write
dY (t) = d (exp(−rt)X(t))(exp(−rt)S(t))−1

(4952)
= exp(−rt)X(t)d (exp(−rt)S(t))−1 + (exp(−rt)S(t))−1 d(exp(−rt)X(t))

(4953)
+d(exp(−rt)X(t))d (exp(−rt)S(t))−1

(4954)
−1 2 −1
= exp(−rt)X(t)(−σ(exp(−rt)S(t)) dW̃ (t) + σ (exp(−rt)S(t)) dt) (4955)
+(exp(−rt)S(t))−1 (γ(t)σ exp(−rt)S(t)dW̃ (t)) (4956)
−1 2 −1
+(−σ(exp(−rt)S(t)) dW̃ (t) + σ (exp(−rt)S(t)) dt)(γ(t)σ exp(−rt)S(t)dW̃ (t))
= −σY (t)dW̃ (t) + σ 2 Y (t)dt + σγ(t)dW̃ (t) − σ 2 γ(t)dt (4957)
= σ[γ(t) − Y (t)][dW̃ (t) − σdt]. (4958)
Although this is not martingale, define
W̃ S (t) = W̃ (t) − σt (4959)
such that
dY (t) = σ[γ(t) − Y (t)]dW̃ S (t). (4960)
By Girsanov’s theorem (see Theorem 423), the Radon Nikodym derivative process (see Definition 451)

1 2
Z(t) = exp σ W̃ (t) − σ t . (4961)
2
Then W̃ S (t) is P̃S (A) = A Z(T )dP̃, ∀A ∈ F Brownian and Y (t) is martingale under this measure. See
R
that (α → r in Example 435)

exp(−rt)S(t)
Z(t) = . (4962)
S(0)
703
Then Y (t) is P̃S Markov since the only source of randomness is in Y (t) and W̃ S , while γ(t) is deterministic.
Then
exp(rt)Ẽ exp(−rT )X + (T )|F(t)

V (t) = (4963)
" + #
S(t) exp(−rT )X(T )
= Ẽ exp(−rT )S(T ) |F(t) (4964)
exp(−rt)S(t) exp(−rT )S(T )
S(t)
Ẽ Z(T )Y + (T )|F(t)

= (4965)
Z(t)
= S(t)ẼS Y + (T )|F(t) .

Lemma 21 (4966)
By Markov property, there ∃g(t, y) s.t.
g(t, Y (t)) = ẼS Y + (T )|F(t)

(4967)
X(T )
and therefore g(T, Y (T )) = ẼS [Y + (T )|F(T )] = Y + (T ). Since Y (T ) = S(T ) can take on any value, then
g(T, y) = y + , y ∈ R is our boundary Equation. Since g(t, Y (t)) is martingale, we shall write
1
dg(t, Y (t)) = gt dt + gy dY (t) + gyy dY (t)dY (t) (4968)
2
1 2
= gt + σ (γ(t) − Y (t)) gyy dt + σ(γ(t) − Y (t))gy dW̃ S (t).
2
(4969)
2
It follows that the PDE:
1
gt (t, y) + σ 2 (γ(t) − y)2 gyy (t, y) = 0, 0 ≤ t < T, y ∈ R (4970)
2
is satisfied by g(t, y).
Theorem 436. For 0 ≤ t ≤ T , the price V (t) of the continuously sampled fixed strike Asian call with
payoff function
!+
Z T
1
V (T ) = S(t)dt − K (4971)
c T −c
is

X(t)
V (t) = S(t)g t, , (4972)
S(t)
where g(t, y) satisfies the PDE
1
gt (t, y) + σ 2 (γ(t) − y)2 gyy (t, y) = 0, 0 ≤ t < T, y ∈ R (4973)
2
and
1

 rc (1 − exp(−rc)) S(t) − exp(−r(T − t))K, t ∈ [0, T − c],


X(t) = (4974)
1 t
Z
1
 (1 − exp(−r(T − t))) S(t) + exp(−r(T − t)) S(u)du − exp(−r(T − t))K t ∈ [T − c, T ]


rc c T −c
The boundary conditions for g(t, y) can be stated
g(T, y) = y+ , y∈R (4975)

lim g(t, y) = 0, 0≤t≤T (4976)
y→−∞
lim [g(t, y) − y] = 0, 0 ≤ t ≤ T. (4977)
y→∞
704
The last boundary condition is intuited by the following: when Y (t) is large positive number, then
Y (T ) > 0 with probability ≈ 1. Then g(t, Y (t)) = ẼS [Y (T )+ |F(t)] ≈ ẼS [Y (T )|F(t)] and since Y (T )
is martingale, this conditional expectation is simply Y (t). The change of numeraire allows us to work
with g(t, y) rather than with v(t, x, y), reducing the problem dimensionality and making our computation
problem easier to solve.
When the Asian call payoff is based on discrete sampling rather than continuous sampling, we can use
a similar method. In particular, given sampling times 0 = t0 < t1 < · · · < tm = T and payoff function
 +
m
1 X
V (T ) =  S(tj ) − K  . (4978)
m j=1
1
Pm
Consider a portfolio process with terminal value X(T ) = m j=1 S(tj ) − K. Define
m
1 X
γ(tj ) = exp(−r(T − ti )), j = 0, 1 · · · m, (4979)
m i=j
s.t.
1
γ(tj ) − γ(tj−1 ) = − exp(−r(T − tj−1 )), j = 1, · · · m, (4980)
m
1
and γ(T ) = m. Set γ(t) = γ(tj ) for tj−1 < t ≤ tj . Then this position is held in each subinterval (tj−1 , tj )
s.t. dγ(t) = 0 inside the interval. Integrating Equation 4934 from tj−1 to tj , we can write
exp(r(T − tj ))X(tj ) − exp(r(T − tj−1 ))X(tj−1 ) (4981)

= γ(tj ) [exp(r(T − tj ))S(tj ) − exp(r(T − tj−1 ))S(tj−1 )] (4982)

1
= γ(tj ) exp(r(T − tj ))S(tj ) − γ(tj−1 ) − exp(−r(T − tj−1 )) exp(r(T − tj−1 ))S(tj−1 )
m
1
= γ(tj ) exp(r(T − tj ))S(tj ) − γ(tj−1 ) exp(−r(T − tj−1 ))S(tj−1 ) + S(tj−1 ). (4983)
m
Summing from j = 1 to j = k we may obtain
exp(r(T − tk ))X(tk ) − exp(rT )X(0) (4984)

k
1 X
= γ(tk ) exp(r(T − tk ))S(tk ) − γ(0) exp(rT )S(0) + S(tj−1 ) (4985)
m j=1
k−1
1 X 1
= γ(tk ) exp(r(T − tk ))S(tk ) + S(ti ) + −γ(0) exp(rT ) + S(0). (4986)
m i=1 m
Set

1
X(0) = exp(−rT ) γ(0) exp(rT ) − S(0) − exp(−rT )K (4987)
m
s.t.
k−1
1 X
exp(r(T − tk ))X(tk ) = γ(tk ) exp(r(T − tk ))S(tk ) + S(ti ) − K (4988)
m i=1
k−1
1 X
X(tk ) = γ(tk )S(tk ) + exp(−r(T − tk )) S(ti ) − exp(−r(T − tk ))K. (4989)
m i=1
705
Then
m
1 X
X(T ) = X(tm ) = S(ti ) − K (4990)
m i=1
which is our option payoff. To find the portfolio process for some tk ≤ t ≤ tk+1 we may integrate
Equation 4934 (from tk to t) to obtain (recall dγ(t) = 0 in interval) and use Equation 4980, Equation
4988:
exp(r(T − t))X(t) = exp(r(T − tk ))X(tk ) + γ(tk+1 ) [exp(r(T − t))S(t) − exp(r(T − tk ))S(tk )]

k−1
1 X
= γ(tk ) exp(r(T − tk ))S(tk ) + S(ti ) − K + γ(tk+1 ) exp(r(T − t))S(t)
m i=1

1
− γ(tk ) − exp(−r(T − tk )) exp(r(T − tk ))S(tk )
m
k
1 X
= γ(tk+1 ) exp(r(T − t))S(t) + S(ti ) − K. (4991)
m i=1
It follows that
k
1 X
X(t) = γ(tk+1 )S(t) + exp(−r(T − t)) S(ti ) − exp(−r(T − t))K, tk ≤ t ≤ tk+1 . (4992)
m i=1
The rest of the argument follows as in the continuous arguments.
13.6 American Derivative Securities

In the European derivatives, the right to exercise the option was only available at maturity. An American
option allows the owner to exercise at any time up to, and including maturity. It follows that an American
option of same strike and maturity must be worth at least as much as the European one. In some cases,
the early-exercise feature imparts a premium to the American derivative security. In other cases, they
are worth exactly the same. A Bermudan option sits halfway, where it can be exercised on some specified
dates. Since an American option can be exercised at any time, it must be worth at least as much as
the value if it were exercised immediately - this value is called the intrinsic value of the option. The
discounted European option value was martingale under risk-neutral pricing. The discounted American
option value is supermartingale (Definition 324) - the owner might fail to exercise at the optimal stopping
time, hence there is tendency for option value to fall. To price an American option, we set up a hedge
portfolio for a short option position. Unlike the European settings, we need to be prepared for exercise
at all times. We can assume the owner exercises at his optimal exercise time. We show that even if he
did not, our hedge portfolio would work.
We define a stopping time as the random variable τ ∈ [0, ∞] associated with the exercise time of an
American option. Then this should not depend on future information. Particularly we require that for
t ≥ 0, the set
{τ = t} = {w ∈ Ω : τ (ω) = t} ∈ F(t). (4993)
We are in fact interested in the sets of type {ω ∈ Ω : T1 ≤ τ (ω) ≤ T2 }.
Definition 462 (Stopping Time). Define the random variable stopping time τ to be
{τ ≤ t} ∈ F(t) ∀t ≥ 0, τ ∈ [0, ∞]. (4994)
706
See that for t ≥ 0, by properties of σ-algebra (see Definition 265)
c
1 1 1 1
τ >t− = τ ≤t− ∈F t− ⊂ F(t) =⇒ τ > t − ∈ F(t) (4995)
n n n n
s.t.

∞ 1
{τ = t} = {τ ≤ t} ∩ ∩n=1 τ > t − ∈ F(t). (4996)
n
Recall the first passage time random variable τm (see Section 437) for adapted continuous process X(t)
s.t.
τm = min {t ≥ 0 : X(t) = m} . (4997)
We show this is valid stopping time under Definition 462. Let t ≥ 0 be known. Then we want to show
{τ ≤ t} ∈ F(t). If t = 0, then {τ ≤ t} = {τ = 0} is either Ω or ∅ corresponding to X(0) = m, X(0) 6= m
respectively. See {τ ≤ 0} ∈ F(0). If t > 0, then suppose ω ∈ Ω satisfies τ (ω) ≤ t. It follows there
∃s ≤ t, s.t X(s, ω) = m. For each positive integer n, there is some open interval containing s for which
the process X is in m − n1 , m + n1 , and some q ∈ Q in this interval satisfies q ≤ s ≤ t. In other words,

ω is in set

1 1
A = ∩∞
n=1 ∪0≤q≤t,q∈Q m− < X(q) < m + . (4998)
n n
It follows {τ ≤ t} ⊂ A. If ω ∈ A, then ∀n ∈ Z+ , ∃qn ≤ t s.t.
1 1
m− < X(qn , ω) < m + . (4999)
n n
See that ∃s ∈ [0, t] and subsequence {qnk }nk=1 s.t limk→∞ qnk = s. But since
1 1
m− < X(qnk , ω) < m + ∀k = 1, 2 · · · , (5000)
nk nk
let k → ∞ and by continuity of X we have X(s, ω) = m. τ (ω) ≤ t and A ⊂ {τ ≤ t}. Therefore
A = {τ ≤ t}. Since X is F(t) adapted, ∀n ∈ Z+ , q ∈ [0, t], Q. the set

1 1
m − < X(q) < m + (5001)
n n
belongs to F(q) ⊂ F(t). Since there are only countably many rational numbers q in the interval [0, t],
they may be arranged in a sequence and

1 1
Bn = ∪0≤q≤t,q∈Q m − < X(q) < m + (5002)
n n
is union of sequence of sets in F(t), and is therefore itself in F(t). Since Bn is in F(t) for any n ∈ Z+ ,
∩∞
n=1 Bn = A = {τ ≤ t} ∈ F(t).
We say that a process X(t) with value frozen at τ , the stopping time (Definition 462), is a stopped
process and denote it X(t ∧ τ ).
13.6.1 Perpetual American Put

Assume underlying SDE
dS(t) = rS(t)dt + σS(t)dW̃ (t) (5003)
where W̃ (t) is risk-neutral measure Brownian. The perpetual American put pays K − S(t) for exercise
t.
707
Definition 463. Let T be set of all stopping times. Price of the perpetual American put is
v∗ (x) = max Ẽ [exp(−rτ )(K − S(τ ))] , τ < ∞, (5004)

τ ∈T
for x = S(0). If τ = ∞ then exp(−rτ )(K − S(τ )) = 0. Since the owner is expected to choose the exercise
strategy that is optimal, the price is defined to be the maximum over T . Here, v∗ (x) is the initial capital
required for an agent to hedge a short position in the perpetual American put regardless of the exercise
strategy for τ . Since there is no maturity, there is no concept of time to maturity. Since every date is
like every other date, we may intuit that there is no dependence of the optimal exercise policy on t - we
shall be able to specify the optimal policy to be taken based on the underlying S(t) falling to some level
L∗ < K.
Theorem 437 (Laplace transform for first passage time of Brownian motion with Drift.). In Equation
3575 we saw the Laplace transform of the First Passage Time Distribution of a Brownian motion as-
suming no drift. Here we generalize to Brownian motion with drift. Let W̃ (t) be P̃ Brownian, let µ ∈ R
and m > 0. Define
X(t) = µt + W̃ (t) (5005)

τm = min{t ≥ 0 : X(t) = m}, (5006)
and τm = ∞ if X(t) does not reach m. Then

p
Ẽ exp(−λτm ) = exp(−m(−µ + µ2 + 2λ)) ∀λ > 0. (5007)
When τm = ∞, set exp(−λτm ) = 0.

p
Proof. Let σ = −µ + µ2 + 2λ so
1 p 1 p 2
σµ + σ 2 = −µ2 + µ µ2 + 2λ + −µ + µ2 + 2λ (5008)
2 2
p 1 2 p 1
= −µ2 + µ µ2 + 2λ + µ − µ µ2 + 2λ + µ2 + λ (5009)
2 2
= λ. (5010)
It follows that exp(σX(t) − λt) = exp(σµt + σ W̃ (t) − σµt − 12 σ 2 t) = exp(σ W̃ (t) − 12 σ 2 t). See that this
is P̃ martingale. By optional sampling (Result 45),

1 2
M (t) = exp σ W̃ (t ∧ τm ) − σ (t ∧ τm ) (5011)
2
is martingale. We may write ∀n ∈ Z+ ,
1 = M (0) = ẼM (n) (5012)

= Ẽ [exp {σX(n ∧ τm ) − λ(n ∧ τm )}] (5013)
= Ẽ [exp {σm − λτm } 1{τm ≤ n}] + Ẽ [exp {σX(n) − λn} 1{τm > n}] . (5014)
Since for i < j,
exp(σm − λτm )1{τm ≤ i} ≤ exp(σm − λτm )1{τm ≤ j} (5015)
and
lim exp(σm − λτm )1 {τm ≤ n} = exp(σm − λτm )1 {τm < ∞} (5016)

n→∞
708
almost surely, then by Monotone Convergence (Theorem 342) we say that
lim Ẽ [exp(σm − λτm )1 {τm ≤ n}] = Ẽ [exp(σm − λτm )1 {τm < ∞}] . (5017)
n→∞
Also we have
0 ≤ exp(σX(n) − λn)1{τm > n} ≤ exp(σm − λn) ≤ exp(σm) (5018)
almost surely such that
lim exp(σX(n) − λn)1{τm > n} ≤ lim exp(σm − λn) = 0. (5019)

n→∞ n→∞
Then by Dominated Convergence (Theorem 343) lim Ẽ [exp(σX(n) − λn)1{τm > n}] = 0 and taking
n→∞
Equation 5014 to limit, obtain
1 = Ẽ [exp {σm − λτm } 1 {τm < ∞}] (5020)
Monotone Convergence allows us to see that
P̃ {τm < ∞} = Ẽ1{τm < ∞} = lim+ exp(−m(−µ +

p
µ2 + 2λ)) exp(λτm )) = exp(mµ − m|µ|). (5021)
λ→0
If µ ≥ 0, X(t) has nonnegative drift and P̃(τm < ∞) = 1. When µ < 0, X(t) has negative drift and
P̃(τm < ∞) = exp(−2m|µ|) < 1. There is a positive probability that the process does not reach m.
Suppose that the investor sets a level L < K as the level to exercise the put once the stock hits
L. If it turns out S(0) ≤ L she will exercise immediately, s.t. vL (S(0)) = K − S(0). Otherwise her
exercise strategy is at τL = min {t ≥ 0 : S(t) = L}. If the stock prices does not reach L, then τL = ∞.
Risk-neutral pricing asserts that the value of the put under her arbitrary exercise strategy is the value
vL (S(0)) = (K − L)Ẽ exp(−rτL ) ∀S(0) ≥ L. (5022)
Lemma 24. vL (x) is given by the formula


K − x,
 0≤x≤L
vL (x) = x − 2r2 (5023)
σ
(K − L)
 , x ≥ L.
L
The first case is trivial. For the case when x ≥ L, the stopping time τL is set such that

1 2
S(t) = L = x exp σ W̃ (t) + r − σ t (5024)
2
which is when

1 1 2 1 x
−W̃ (t) − r − σ t = log . (5025)
σ 2 σ L
See that this is drifted Brownian motion and we apply Theorem 437 with X(t) → −W̃ (t) − σ1 r − 12 σ 2 t,

1 x
λ → r, m → σ log L . Also

2 1 2 2 1 4
µ + 2λ = r − rσ + σ + 2r (5026)
σ2 4

1 2 2 1 4
= r + rσ + σ (5027)
σ2 4
2
1 1
= r + σ2 . (5028)
σ2 2
709
p 1
Then −µ + µ2 + 2λ = σ (r− 12 σ 2 ) + σ1 (r + 12 σ 2 ) = 2r
σ . Then substitute into Equation 5007 and we get
2r
1 x 2r x − σ2
Ẽ exp(−rτL ) = exp − log = . (5029)
σ L σ L
So far the investor choice of L was arbitrary. Particularly, we would like the find the optimal L∗ that
gives us the optimal exercise strategy and valuation vL∗ . The option is always at least worth intrinsic
value. So we only need to concern ourselves with case when x ≥ L. We know from Lemma 24 that L∗
is the value of L that maximises
2r 2r
vL (x) = (K − L)L σ2 x− σ2 , ∀x ≥ L. (5030)
− σ2r2
= g(L)x . (5031)
Taking derivatives,
2r 2r 2r 2r + σ 2 2r2 2r 2r
g 0 (L) = −L σ2 + 2 (K − L)L σ2 −1 = − 2
L σ + 2 KL σ2 −1 (5032)
σ σ σ
2r + σ 2

2r 2r
= L σ2 −1 − L + K (5033)
σ2 σ2
Then
2r
L∗ = K (5034)
2r + σ 2
2r2
2r 2r σ
g(L∗ ) = K− K K (5035)
2r + σ 2 2r + σ 2
2 2r2
σ K 2r σ
= 2 2
K (5036)
2r + σ 2r + σ
2r2
σ2

2r σ 2r+σ 2
= 2 2
K σ2 . (5037)
2r + σ 2r + σ
This corresponds to

K − x,
 0 ≤ x ≤ L∗ ,
vL∗ (x) = − 2r2 (5038)
x σ
(K − L∗ )
 , x ≥ L∗ .
L∗
Taking derivatives

−1,
 0 ≤ x ≤ L∗ ,
0 − 2r2
vL ∗
(x) =
2r x σ
(5039)
−(K − L∗ )
 , x ≥ L∗ .
σ2 x L∗
Using Equation 5034
2r 2r + σ 2

0 2r 2rK 2r 2r 0
vL (L∗ +) = −(K − L∗ ) =− + = − + 2 = −1 = vL (L∗ −). (5040)
∗
σ 2 L∗ σ 2 L∗ σ2 σ 2 2r σ ∗
The derivative of vL∗ is continuous at x = L∗ , and we call this property smooth pasting. Taking further
derivatives, for case x ≥ L∗ , we have
" − 2r2 #
δ 2r − σ2r2 −1 1 σ
−(K − L∗ ) 2
x (5041)
δx σ L∗
− 2r2
2r −2r 1 x σ
= −(K − L∗ ) 2 2
−1 2
. (5042)
σ σ x L∗
710
Then

0,
 0 ≤ x < L∗ ,
00 − 2r2
vL ∗
(x) = 2r(2r + σ 2 )

x σ
(5043)
(K − L∗ )
 , x > L∗ .
σ 4 x2 L∗
00
See that smooth-pasting property does not apply to vL ∗
(x). For x > L∗ , (verify this)
− 2r2
2r2 r(2r + σ 2 )

0 1 2 2 00 x σ
rvL∗ (x) − rxvL ∗
(x) − σ x v L∗ (x) = (K − L ∗ ) r + 2
− 2
= 0, (5044)
2 σ σ L∗
and for 0 ≤ x < L∗ ,
0 1 00
rvL∗ (x) − rxvL ∗
(x) − σ 2 x2 vL ∗
(x) = r(K − x) + rx = rK. (5045)
2
We say that vL∗ (x) satisfies linear complementarity conditions
v(x) ≥ (K − x)+ ∀x ≥ 0, (5046)

1
rv(x) − rxv 0 (x) − σ 2 x2 v 00 (x) ≥ 0 ∀x ≥ 0, (5047)
2
∀x ≥ 0, equality holds in either of the above two equations. (5048)
00
For undefined points in vL ∗
(L∗ ), the linear complementarity equations are held for L∗ +, L∗ − instead.
vL∗ (x) is the only bounded continuous function having a continuous derivative that satisfies these con-
ditions.
Exercise 707. We show there exists unbounded functions that satisfy the conditions not of the form
2r
given by vL∗ (x). For 0 < L < K, assume 2r+σ 2 K > L 6= L∗ (see Equation 5034), and
− σ2r2
1. Show that for A, B ∈ R, f (x) = Ax + Bx satisfies differential equation
1
rf (x) − rxf 0 (x) − σ 2 f 00 (x) = 0 ∀x ≥ 0. (5049)
2
2. Show ∃A, B s.t. f (L) = K − L, f 0 (L) = −1.
3. Show f (x) ≥ (K − x)+ for all x ≥ L with A, B obtained in part 2.
4. Define
(
K − x, 0 ≤ x ≤ L,
v(x) = (5050)
f (x), x ≥ L.
Show v(x) satisfies the linear complementarity conditions.
2r
5. To have bounded solution, we require B = 0. Show B = 0 =⇒ L = 2r+σ 2 K (the same form as
vL∗ ).
Proof. 1. Use

2r 2r 2Ar 2r 2r
f 0 (x) = (− 2
)Ax− σ2 −1 + B, f 00 (x) = − − 2 − 1 x− σ2 − 2. (5051)
σ σ2 σ
Substitute into the differential equation

2r −2Ar − −2r 1 2r 2r
rAx− σ2 + rBx − rx 2
x σ2
−1
+ B − x2 (−2Ar)(− 2 − 1)x− σ2 −2 (5052)
σ 2 σ
2

2Ar x − 2 −1
2r 2r 2r
= x σ − Arx2 x− σ2 −2 (5053)
σ2 σ2
= 0. (5054)
711
2. Use
2r ! −2Ar − 2r2 −1 !
f (L) = AL− σ2 + BL = K − L, f 0 (L) = L σ + B = −1. (5055)
σ2
Then
2r
AL− σ2 + BL + L = K (5056)
1 2r

B = K − L − AL− σ2 (5057)
L
−2Ar − 2r2 −1 1 − 2r

L σ + K − L − AL σ2 = −1 (5058)
σ2 L
−2Ar − 2r2 2r
2
L σ + K − L − AL− σ2 = −L (5059)
σ
−2Ar − 2r2 −2r
2
L σ − AL− σ2 = −K (5060)
σ
2r − 2r2 − σ2r2
A L σ +L = K (5061)
σ2

2r 2r
A L− σ2 +1 = K (5062)
σ2
So
σ2

2r
A = KL σ2 (5063)
σ 2 + 2r
σ2

2r 2r
B = L−1 K − L − KL σ2 L− σ2 (5064)
σ 2 + 2r
2

σ
= L−1 K − L − K (5065)
σ 2 + 2r
K(σ 2 + 2r) − Kσ 2
= −1 (5066)
L(σ 2 + 2r)
2rK
= − 1. (5067)
L(σ 2 + 2r)
3. We want to show that

σ2

2r 2r 2rK
f (x) = KL σ2 2
x− σ 2 + − 1 x ≥ (K − x)+ . (5068)
σ + 2r L(σ 2 + 2r)
2r
Since 2r+σ 2 K > L =⇒ B > 0, for x ≥ K, f (x) ≥ BK > 0 = (K − x)+ . For L ≤ x < K, see that
Zeng [20]
2r
σ 2 KL σ2 − 2r2 2rKx
f (x) − (K − x)+ = 2
x σ + −K (5069)
σ + 2r L(σ 2 + 2r)
2r 2r
σ 2 KL σ2 +1 x− σ2 + 2rKx − KL(σ 2 + 2r)
= (5070)
(σ 2 + 2r)L
2r
+1
h
2
2r
x σ2 +1 2 x σ2r2
i
2r
KL σ2 σ + 2r L − (σ + 2r)( L )
= x− σ 2 . (5071)
(σ 2 + 2r)L
2r 2r
Define g(θ) = σ 2 + 2rθ σ2 +1 − (σ 2 + 2r)θ σ2 , θ ≥ 1. Then g(1) = 0,

2r 2r 2r 2r 2r 2r
g 0 (θ) = 2r 2
+ 1 θ σ2 − σ 2 + 2r 2 θ σ2 −1 = 2 (σ 2 + 2r)θ σ2 −1 (θ − 1) ≥ 0. (5072)
σ σ σ
Then g(θ) ≥ 0 for all θ ≥ 1, and f (x) ≥ (K − x)+ for L ≤ x < K. Therefore f (x) ≥ (K − x)+ ,
∀x ≥ L.
712
4. We showed that v(x) ≥ (K − x)+ in part 3. When x ≥ L,
1 1
rv − rxv 0 − σ 2 x2 v 00 = rf − rxf 0 − σ 2 x2 f 00 = 0, (5073)
2 2
where the last equality holds by part 1. When 0 ≤ x ≤ L the RHS is r(K − x) + rx = rK. The last
linear complementarity equation was shown here and in part 3. limx→∞ v(x) = limx→∞ f (x) =
− 2r2
σ
∞= 6 limx→∞ vL∗ (x) = limx→∞ (K − L∗ ) Lx∗ = 0.
5.
2rK 2rK
B=0↔ 2
− 1 = 0 =⇒ L = . (5074)
L(σ + 2r) 2r + σ 2
0
Exercise 708. Show that for function of form Equation 5023, the smooth pasting condition for vL (x)
when x = L is satisfied only by L∗ given Equation 5034.
0
Proof. It is trivial to see vL (L− ) = −1. For x ≥ L, write
− 2r2
2r 1 σ 2r
0
vL (x) = −(K − L) 2 (x)− σ2 −1 . (5075)
σ L
Evaluating at L and for smooth pasting to hold we have
− 2r2
2r 1 σ 2r
0
vL (L+) = − 2 (K − L) (L)− σ2 −1 = −1. (5076)
σ L
Follows that
− 2r2
2r 1 σ 2r
2
(K − L) (L)− σ2 −1 = 1 (5077)
σ L
2r
(K − L)L−1 = 1 (5078)
σ2
2r 2r
K = (1 + 2 )L (5079)
σ2 σ
2rK
L= σ 2 +2r follows.
Theorem 438. Let S(t), τL∗ be given as discussed. Then exp(−rt)vL∗ (S(t)) is supermartingale under
P̃, and stopped process exp(−r(t ∧ τL∗ ))vL∗ (S(t ∧ τL∗ )) is martingale.
Proof. By Ito Doeblin formula (see Theorem 415) we can write

0 1 00
d(exp(−rt)vL∗ (S(t))) = exp(−rt) −rvL∗ (S(t))dt + vL∗ (S(t))dS(t) + vL∗ (S(t))dS(t)dS(t)
2

0 1 2 2 00
= exp(−rt) −rvL∗ (S(t)) + rS(t)vL∗ (S(t)) + σ S (t)vL∗ (S(t)) dt (5080)
2
0
+ exp(−rt)σS(t)vL ∗
(S(t))dW̃ (t). (5081)
By the Equations 5044 and 5045 we can express
d(exp(−rt)vL∗ (S(t))) = − exp(−rt)rK 1 {S(t) < L∗ } dt + exp(−rt)σS(t)vL

0
∗
(S(t))dW̃ (t). (5082)
Since the drift is ≤ 0, exp(−rt)vL∗ (S(t)) is supermartingale. The process is martingale prior to τL∗ ,
such that
Z t∧τL∗
0
exp(−r(t ∧ τL∗ )vL∗ (S(t ∧ τL∗ )) = vL∗ (0) + exp(−ru)σS(u)vL ∗
(S(u))dW̃ (u). (5083)
0
713
Corollary 35.
vL∗ (x) = max Ẽ [exp(−rτ )(K − S(τ ))] , (5084)

τ ∈T
where x = S(0) and T is set of all stopping times (see Definition 462). That is, vL∗ (x) is the price of a
perpetual American put.
Proof. Here the corollary states that L∗ gives the optimal stopping time τL∗ among all stopping times.
→ Since exp(−rt)vL∗ (S(t)) is P̃ supermartingale, by optional sampling (Theorem 45),
∀τ ∈ T , vL∗ (x) = vL∗ (S(0)) ≥ Ẽ [exp(−r(t ∧ τ ))vL∗ (S(t ∧ τ ))] . (5085)
Letting t → ∞, by Dominated Convergence (Theorem 343),
vL∗ (x) ≥ Ẽ [exp(−rτ )vL∗ (S(τ ))] ≥ Ẽ [exp(−rτ )(K − S(τ )] , (5086)
since v(x) ≥ (K − x)+ for all x ≥ 0. Then vL∗ (x) ≥ maxτ ∈T Ẽ [exp(−rτ )(K − S(τ )] .
← See that
vL∗ (x) = vL∗ (S(0)) = Ẽ [exp(−r(t ∧ τL∗ ))vL∗ (S(t ∧ τL∗ ))] (5087)
Again let t → ∞ and by Dominated Convergence (Theorem 343),
vL∗ (x) = Ẽ [exp(−rτL∗ )vL∗ (S(τL∗ ))] . (5088)
Since we have the relation
exp(−rτL∗ )vL∗ (S(τL∗ )) = exp(−rτL∗ )vL∗ (L∗ ) = exp(−rτL∗ )(K − L∗ ) = exp(−rτL∗ )(K − S(τL∗ )),
vL∗ (x) = Ẽ [exp(−rτL∗ (K − S(τL∗ ))] ≤ max Ẽ [exp(−rτ )(K − S(τ ))] . (5089)
τ ∈T
The result follows.
Discounted American option prices are martingales up to the time τL∗ , after which they are super-
martingales.
0
Corollary 36. An investor with initial capital X(0) = vL∗ (S(0)) holding ∆(t) = vL ∗
(S(t)) has the
ability to consume cash at rate C(t) = rK 1 {S(t) < L∗ } while satisfying X(t) ≥ (K − S(t))+ at all times
t.
Proof. The portfolio value process has differential
dX(t) = ∆(t)dS(t) + r(X(t) − ∆(t)S(t))dt − C(t)dt. (5090)
By Ito Doeblin (Lemma 415) we can write
d(exp(−rt)X(t)) = exp(−rt) (−rX(t)dt + dX(t)) (5091)

= exp(−rt) (∆(t)dS(t) − r∆(t)S(t)dt − C(t)dt) (5092)

= exp(−rt) ∆(t)σS(t)dW̃ (t) − C(t)dt . (5093)
0
Using the hedging process ∆(t) = vL ∗
(S(t)) and consumption C(t) = rK 1 {S(t) < L∗ }, see that
d(exp(−rt)X(t)) = d(exp(−rt)vL∗ (S(t))) where the latter is given by Equation 5083.
714
In a complete market, any discounted price process that is supermartingale can be hedged, and may
sometimes consume. The first two linear complementarity conditions (Equation 5046, 5047) guarantee
that the price is sufficient to satisfy the put seller. However, there exists functions greater than vL∗ (x) that
satisfy the two conditions. The last condition (Equation 5048) guarantees that the price is satisfactory
for the put buyer, in that there exists an exercise strategy that allows the buyer to realize the full value
of the put. This decision can be categorized into the stopping set and continuation set respectively,
denoted
x ≥ 0 : vL∗ (x) = (K − x)+

S = (5094)
x ≥ 0 : vL∗ (x) > (K − x)+

C = (5095)
For S(0) ∈ S, the value of the option is equal intrinsic value and the owner should exercise immediately.
If S(0) ∈ C, then the option value would be worth more than if it were exercised, and the put owner
should wait until S(t) enters S, which is precisely at t = τL∗ . For V (t) = exp(−rt)v(S(t)), where V (t)
is the perpetual American put value, the following conditions hold:
V (t) ≥ (K − S(t))+ ∀t ≥ 0, (5096)

exp(−rt)V (t) is P̃ supermartingale, (5097)
∃τ∗ s.t V (0) = Ẽ exp(−rτ∗ )(K − S(τ∗ ))+ .

(5098)
Exercise 709. For two perpetual American puts priced risk-neutrally at v1 (x), v2 (x), with strike 0 <
K1 < K2 , show that v2 (X) satisfies the linear complementarity Equations
1. v2 (x) ≥ (K1 − x)+ ∀x ≥ 0, (5099)

1
2. rv2 (x) − rxv20 (x) − σ 2 x2 v200 (x) ≥ 0 ∀x ≥ 0, (5100)
2
but ∃x for which (strict) equality holds for neither of the linear complementarity equations.
Proof. The first linear-complimentary equation is trivial, since v2 (x) ≥ (K2 − x)+ ≥ (K1 − x)+ . The
second linear-complimentary equation holds without consideration for strike. See the relation 0 < L1,∗ <
K1 , 0 < L2,∗ < K2 and 0 < K1 < K2 . For x < K2 , v2 (x) ≥ (K2 − x)+ > (K1 − x)+ . Also, for
0 < x < L2,∗ ,
1
rv2 (x) − rxv20 (x) − σ 2 x2 v200 (x) = rK2 (5101)
2
by Equation 5045.
Exercise 710 (Perpetual American Put with Continuous Dividends). A geometric Brownian asset
paying continuous dividends at rate a has portfolio differential
dS(t) = (r − a)S(t)dt + σS(t)dW̃ (t), (5102)
where W̃ (t) is risk-neutral measure P̃ Brownian motion.
1. Find the risk-neutral discounted payoff of a strategy that exercises the put at arbitrary level L or
below L.
2. Determine the value of L that maximizes the risk-neutral expected payoff, which must be the price.
3. Show that for any asset price S(0) = x, process exp(−rt)vL∗ (S(t)) is P̃ supermartingale. Show if
x > L∗ and process is stopped at L∗ , stopped supermartingale process becomes martingale.
715
4. Show
vL∗ (x) = max Ẽ [exp(−rτ )(K − S(τ ))] . (5103)

τ ∈T
Proof. 1. For differential Equation 5102 the solution is

1
S(t) = S(0) exp σ W̃ (t) + (r − a − σ 2 )t . (5104)
2
See that S(t) = L when −W̃ (t) − σ1 r − a − 21 σ 2 t = σ1 log L

x

. By the Laplace transform given in
Theorem 437, we have
( " r #)
1 x 1 1 2 1 1 2 2
Ẽ exp(−rτL ) = exp − log (r − a − σ ) + (r − a − σ ) + 2r . (5105)
σ L σ 2 σ2 2
q
1
Let γ = σ 2 (r − a − 12 σ 2 ) + 1
σ
1
σ 2 (r −a− 1 2
σ2 ) + 2r such that
x x −γ
Ẽ exp(−rτL ) = exp(−γ log )= .
L L
So we have

K − x, 0 ≤ x ≤ L,
vL (x) = x (5106)
(K − L)( )−γ , x > L.
L
2. Differentiate case two of Equation 5106 to get
δvL (x) x x −x
= −( )−γ + (K − L)(−γ)( )−γ−1 2 (5107)
δL L L L
x −γ x −γ 1
= −( ) − (K − L)(−γ)( ) (5108)
L L L
x −γ 1
= − 1 − γ(K − L) . (5109)
L L
Set derivative to zero, which is when 1 − γ(K − L) L1 = 0, s.t. γK − γL = L → γK = L(1 + γ) →

γK γK
L= 1+γ . So we have L∗ = 1+γ .
3. By the Ito Doeblin formula (Theorem 415), write

0 1 2 2 00
d (exp(−rt)vL∗ (S(t))) = exp(−rt) −rvL∗ (S(t)) + vL∗ (S(t))(r − a)S(t) + σ S (t)vL∗ (S(t)) dt
2
0
+ exp(−rt)vL ∗
(S(t))σS(t)dW̃ (t). (5110)
If x > L∗ , by Equation 5106, the bracketed term in the drift becomes
0 1 00
−rvL∗ (x) + vL ∗
(x)(r − a)x + σ 2 x2 vL ∗
(x) (5111)
2
−γ −γ−1 −γ−2
x x 1 2 2 x
= −r(K − L∗ ) + (K − L∗ ) −γ (−γ)(r − a)x + σ x (K − L∗ )(−γ)(−γ − 1)
L∗ L∗ 2 L−γ
∗
−γ
x 1
= (K − L∗ ) −r + (−γ)(r − a) + σ 2 (γ)(γ + 1) . (5112)
L∗ 2
716
Further making the substitution u = r − a − 21 σ 2 and using definition of γ, we have Zeng [20]

1
− −r + (−γ)(r − a) + σ 2 (γ)(γ + 1) (5113)
2
1 1
= r − σ 2 γ 2 + γ(r − a − σ 2 ) (5114)
2 2
r ! 2 r !
1 2 u 1 u2 u 1 u2
= r− σ + + 2r + + + 2r u (5115)
2 σ2 σ σ2 σ2 σ σ2
r ! r
1 2 u2 1 u2 2u u2 u2 u u2

= r− σ + 2 + 2r + 3 + 2r + 2 + + 2r (5116)
2 σ4 σ σ2 σ σ2 σ σ σ2
r r
u2 u u2 1 u2 u2 u u2

= r− 2 − + 2r − + 2r + 2 + + 2r (5117)
2σ σ σ2 2 σ2 σ σ σ2
= 0. (5118)
On the other hand when x < L∗ , the drift has coefficient

0 1 00
−rvL∗ (x) + vL ∗
(x)(r − a)x + σ 2 x2 vL ∗
(x) (5119)
2
= −r(K − x) − 1(r − a)x = −rK + ax. (5120)
Then the differential Equation 5112 becomes
d (exp(−rt)vL∗ (S(t))) = − exp(−rt)1 {S(t) < L∗ } (rK − aS(t))dt + exp(−rt)vL

0
∗
(S(t))σS(t)dW̃ (t).
To show this is supermartingale we need to show drift ≤ 0, which is the condition 1 {x < L∗ } (rK −
γK
ax) ≥ 0, which is satisfied iff rK − aL∗ ≥ 0. Using L∗ = 1+γ , since
s 2
1 1 1 2 1 1
γ= 2
r − a − σ + 2r + 2 (r − a − σ 2 ) ≥ 0,
σ σ 2 σ 2
then
aγK
rK − ↔ r(γ + 1) − aγ ≥ 0. (5121)
1+γ
Next, for K, σ > 0, r, a ≥ 0, assume rK − aL∗ < 0. This implies r < a since L∗ ≤ K and iff
r−a
relationship implies that r(γ + 1) − aγ < 0. Then Zeng [20], define θ = σ < 0, we can write
r
1 1 1 1
γ = (θ − σ) + (θ − σ 2 )2 + 2r, (5122)
σ 2 σ 2
and
r(γ + 1) − aγ < 0 ↔ (r − a)γ + r < 0 (5123)

" r #
1 1 1 1 2
↔ (r − a) (θ − σ) + (θ − σ) + 2r + r < 0 (5124)
σ 2 σ 2
r
1 1
↔ θ(θ − σ) + θ (θ − σ)2 + 2r + r < 0 (5125)
2 2
r
1 2 1
↔ θ (θ − σ) + 2r < −r − θ(θ − σ) since RHS < 0, (5126)
2 2

1 1 1
↔ θ2 (θ − σ 2 )2 + 2r > r2 + θ2 (θ − σ)2 + 2θr(θ − σ) (5127)
2 2 2
↔ 0 > r2 − θrσ (5128)
↔ 0 > r (r − θσ) , (5129)
717
but r ≥ 0 so 0 > r − θσ. But since θσ < 0, r − θσ > 0. This is contradiction. We must have
rK − aL∗ ≥ 0.
4. By part 3, exp(−rt)vL∗ (S(t)) is supermartingale and exp(−r(t ∧ τL∗ )vL∗ (S(t ∧ τL∗ )) is martingale.
We show vL∗ ≥ (K −x)+ . Since γ ≥ 0, L∗ < K, when x ≥ K > L∗ , we have vL∗ (x) > 0 = (K −x)+
and when 0 ≤ x < L∗ , we have vL∗ (x) = (K − x)+ = K − x. When L∗ ≤ x ≤ K, write Zeng [20]
δ x−γ−1
(vL∗ (x) − (K − x)) = −γ(K − L∗ ) +1 (5130)
δx L−γ
∗
L∗−γ−1
≥ −γ(K − L∗ ) −γ + 1 (5131)
L∗
γK 1
= −γ(K − ) γK + 1 (5132)
γ + 1 γ+1
= −(γ + 1) + γ + 1 (5133)
= 0. (5134)
and (vL∗ (x) − (K − x))|x=L∗ = 0. vL∗ (x) − (K − x)+ ≥ 0. Using vL∗ ≥ (K − x)+ and part 3, apply
the proof as in Corollary 35.
13.6.2 Finite Expiration American Put

Again assume the underlying
dS(t) = rS(t)dt + σS(t)dW̃ (t) (5135)
and define the finite-expiration American put as follows:

Definition 464 (Finite-Expiration American put). Let 0 ≤ t ≤ T , x ≥ 0 be known. Assume S(t) = x.
(t)
Let Fu to be the σ-algebra generated by S(v) as v ranges over [t, u], u ∈ [t, T ]. Let Tt,T denote the set
(t) (t)
of stopping times for filtration Fu . That is, ∀u ∈ [t, T ], {τ ≤ u} ∈ Fu . The price at t of an American
put expiring at T is
v(t, x) = max Ẽ [exp(−r(τ − t))(K − S(τ ))|S(t) = x] . (5136)

τ ∈Tt,T
If τ = ∞, set exp(−rτ )(K − S(τ )) = 0.

Result 46. The finite-expiration American put price v(t, x) as given by Definition 464 satisfies the linear
complementarity equations
v(t, x) ≥ (K − x)+ ∀t ∈ [0, T ], x ≥ 0, (5137)

1 2 2
rv(t, x) − vt (t, x) − rxvx (t, x) − σ x vxx (t, x) ≥ 0 ∀t ∈ [0, T ), x ≥ 0 (5138)
2
∀t ∈ [0, T ), x ≥ 0 equality holds for either of the above equations.
Now the optimal exercise time would depend on the time to maturity. Denote the optimal exercise time
L(T − t) then the special cases limT →∞ L(T ) = L∗ (the perpetual American put case) and L(0) = K
is trivial to logicize. Unfortunately, no exact form of L(T − t) is known, but this may be determined
numerically. As in the perpetual American put case, we may divide the decision boundary of the put
owner to exercise by the two sets: stopping set and continuation set, written
(t, x) : v(t, x) = (K − x)+ ,

S = (5139)
(t, x) : v(t, x) > (K − x)+ .

C = (5140)
718
The curve x = L(T − t) is the decision boundary, belonging to S. In particular,
1
rv(t, x) − vt (t, x) − rxvx (t, x) − σ 2 x2 vxx (t, x) = 0 ∀t ∈ [0, T ), x ≥ 0 (5141)
2
when (t, x) ∈ C, and
1
rv(t, x) − vt (t, x) − rxvx (t, x) − σ 2 x2 vxx (t, x) > 0 ∀t ∈ [0, T ), x ≥ 0 (5142)
2
for (t, x) ∈ S, except when x = L(T − t), then equality holds. For (t, x) ∈ S, when 0 ≤ x < L(T − t),
v(t, x) = (K − x)+ = K − x, the RHS of Equation 5138 equals rK. L(·) decreases with increasing time
to maturity. See that
vx (t, x+) = vx (t, x−) = −1 for x = L(T − t), 0 ≤ t < T (5143)
and the smooth-pasting condition holds for vx (t, x). At t = T , this does not hold, since
L(0) = K, v(T, x) = (K − x)+ , (5144)
s.t. vx (T, x−) = −1 6= 0 = vx (T, x+) when x = L(0). vt (t, x), vxx (t, x) also turns out to be non-
continuous along x = L(T − t). The equations
1
rv(t, x) − vt (t, x) − rxvx (t, x) − σ 2 x2 vxx (t, x) = 0 x ≥ L(T − t), (5145)
2
v(t, x) = K − x, 0 ≤ x ≤ L(T − t), (5146)
smooth pasting condition (Equation 5143), terminal condition (Equation 5143), 5144) and asymptotic
condition lim v(t, x) = 0 determines a unique function for v(t, x). We can use finite difference approxi-
x→∞
mation methods to numerically compute for v(t, x), L(T − t).
Theorem 439. Let S(u), u ∈ [t, T ] denote stock price with initial condition S(t) = x and define stopping
set S = {(t, x) : v(t, x) = (K − x)+ }. Let
τL∗ = min {u ∈ [t, T ] : (u, S(u)) ∈ S} , (5147)
and set τ∗ = ∞ if (u, S(u)) does not enter S in [T ]. Then, exp(−ru)v(u, S(u)) is supermartingale under
P̃ in [t, T ] and stopped process exp(−r(u ∧ τ∗ ))v(u, S(u ∧ τ∗ )) is martingale on [t, T ].
Proof. Since vx (u, x) is continuous by smooth-pasting condition, apply Ito Doeblin formula to obtain

1
d (exp(−ru)v(u, S(u))) = exp(−ru) −rvdu + vu du + vx dS + vxx dSdS (5148)
2

1
= exp(−ru) −rv + vu + rSvx + σ 2 S 2 vxx du + exp(−ru)σSvx dW̃ (u).
2
Since the drift term can be written − exp(−ru)rK 1 {S(u) < L(T − u)}, exp(−ru)v(u, S(u)) is P̃ super-
martingale. Stopped process exp(−r(u ∧ τ∗ ))v(u ∧ τ∗ , S(u ∧ τ∗ )) is martingale for u ∈ [t, T ].
Corollary 37. An agent with initial capital X(0) = v(0, S(0)) with portfolio process ∆(u) = vx (u, S(u))
and consumption of cash at rate C(u) = rK 1 {S(u) < L(T − u)} per unit time has X(u) = v(u, S(u))
for all times from initiation to expiry or exercise. The proof follows as in the same methodology outlined
in Corollary 36.
719
The only function v(t, x) satisfying the smooth pasting condition on boundary L(T − t) must be the
form v(t, x) = maxτ ∈Tt,T Ẽ [exp(−r(τ − t))(K − S(τ ))|S(t) = x]. We show this here. For fixed t ∈ [T ],
optional sampling (Theorem 45) implies that
exp(−r(t ∧ τ ))v(t ∧ τ, S(t ∧ τ )) ≥ Ẽ [exp(−r(T ∧ τ )v(T ∧ τ, S(T ∧ τ )|F(t)] . (5149)
For τ ∈ Tt,T , since

(
τ τ <∞
t ∧ τ = t. T ∧τ = (5150)
T otherwise .
Then for τ ∈ Tt,T , exp(−rt)v(t, S(t))
≥ Ẽ [exp(−rτ )v(τ, S(τ ))1 {τ < ∞} + exp(−rT )v(T, s(T ))1 {τ = ∞} |F(t)] (5151)
≥ Ẽ [exp(−rτ )v(τ, S(τ ))|F(t)] . (5152)
Since v(t, x) ≥ (K − x)+ by the linear complementarity Equation 5137, we have
exp(−rt)v(t, S(t)) ≥ Ẽ [exp(−rτ )(K − S(τ ))|F(t)] . (5153)
Since S(t) is Markov, we can write
exp(−rt)v(t, x) ≥ Ẽ [exp(−rτ )(K − S(τ ))|S(t) = x] . (5154)
This holds for arbitrary τ ∈ Tt,T , so v(t, x) ≥ maxτ ∈Tt,T Ẽ [exp(−r(τ − t))(K − S(τ ))|S(t) = x] . On
the other hand, for τ∗ = min {u ∈ [T ] : (u, S(u)) ∈ S}, the process exp(−r(t ∧ τ∗ ))v(t ∧ τ∗ , S(t ∧ τ∗ )) is
martingale. Then substituting τ → τ∗ , equality holds for Equation 5151. If τ∗ = ∞,then (T, S(T )) ∈ C
and v(T, S(T ))1 {τ∗ = ∞} = 0. Equality holds for Equation 5152. Then equality holds for Equation
5153 since at t = τ∗ < ∞, v(t, S(t)) = K − S(t) and so v(t, x) = Ẽ [exp(−r(τ∗ − t))(K − S(τ∗ ))|S(t) = x].
Then v(t, x) ≤ maxτ ∈Tt,T Ẽ [exp(−rτ )(K − S(τ ))|S(t) = x]. Equality follows.
13.6.3 Finite Expiration American Call

We look at two cases, one with underlying paying dividends and another without. We first assume the
case without dividends and the usual dS(t) = rS(t)dt + σS(t)dW̃ (t). Here W̃ (t) is risk-neutral measure
P̃ Brownian motion.
Lemma 25. Let h(x) be convex function of x ≥ 0, satisfying h(0) = 0. Then exp(−rt)h(S(t)) of the
American derivative security that pays h(S(t)) on exercise is submartingale (see Definition 324).
Proof. for λ ∈ [1], 0 ≤ x1 ≤ x2 we have
h((1 − λ)x1 + λx2 ) ≤ (1 − λ)h(x1 ) + λh(x2 ) (5155)
by the convexity of h(x). x1 → 0, x2 = x gives h(λx) ≤ λh(x) for all x ≥ 0, 0 ≤ λ ≤ 1. Then, for
0 ≤ u ≤ t ≤ T , λ → exp(−r(t − u)) ∈ [0, 1] and therefore
Ẽ [exp(−r(t − u))h(S(t))|F(u)] ≥ Ẽ [h (exp(−r(t − u))S(t)) |F(u)] . (5156)
By Jensen’s inequality (Theorem 341), write

Ẽ [h (exp(−r(t − u))S(t)) |F(u)] ≥ h Ẽ [exp(−r(t − u))S(t)|F(u)] (5157)

= h exp(ru)Ẽ [exp(−rt)S(t)|F(u)] (5158)
= h (exp(ru) exp(−ru)S(u)) (5159)
= h (S(u)) . (5160)
720
Then Ẽ [exp(−r(t − u))h(S(t))|F(u)] ≥ h(S(u)), which is to say
Ẽ [exp(−rt)h(S(t))|F(u)] ≥ exp(−ru)h(S(u)). (5161)
exp(−rt)h(S(t)) is submartingale.
Theorem 440 (European vs American Derivative Securities with Nonnegative, Convex Payoffs). Let
h(x) be nonnegative, convex function of x ≥ 0, satisfying h(0) = 0. Then the price of American derivative
expiring at T with intrinsic value h(S(t)) is the same as the price of European derivative paying h(S(T ))
at maturity.
Proof. Since
Ẽ [exp(−r(t − u))h(S(t))|F(u)] ≥ h(S(u)), then (5162)

Ẽ [exp(−r(T − u))h(S(T ))|F(u)] ≥ h(S(u)) (5163)
for t ∈ [T ]. The European derivative price dominates the intrinsic value of the American derivative
5
security. The early exercise option is worthless.
Corollary 38. The price of an American call on a no-dividend underlying asset is priced the same as
a European call with same maturity and strike.
Proof. Apply Theorem 440 with h(x) = (x − K)+ .
Exercise 711. Use the Optional Sampling Result 45 and Lemma 25 to show that the American call price
without dividends is the same as the European call price. That is,
Ẽ exp(−rT )(S(T ) − K)+ = max Ẽ exp(−rτ )(S(τ ) − K)+ .

(5164)
τ ∈T0,T
Proof. By convexity property, apply Lemma 25. Use the submartingale property on Optional Sampling
Result 45 to get
∀τ ∈ T0,T , Ẽ exp(−rT )(S(T ) − K)+ ≥ Ẽ exp(−r(T ∧ τ ))(S(T ∧ τ ) − K)+

(5165)
≥ Ẽ exp(−rτ )(S(τ ) − K)+ .

(5166)
Since T ∈ T0,T , then
Ẽ exp(−rT )(S(T ) − K)+ ≤ max Ẽ exp(−rτ )(S(τ ) − K)+ .

(5167)
τ ∈T0,T
Equality follows.
Lemma 26. Let c(x) be function defined for all x ≥ 0. If for every 0 ≤ x1 ≤ x2 , λ ∈ [0, 1], if
c((1 − λ)x1 + λx2 ) ≤ (1 − λ1 )c(x1 ) + λ2 c(x2 ), then we say that c(x) is convex. For two convex functions
f (x), g(x), show that h(x) = max {f (x), g(x)} is convex.
Proof. See from
h((1 − λ)x1 + λx2 ) = max {f ((1 − λ)x1 + λx2 ), g((1 − λ)x1 + λx2 )} (5168)
≤ max {(1 − λ)f (x1 ) + λf (x2 ), (1 − λ)g(x1 ) + λg(x2 )} (5169)
≤ (1 − λ) max {f (x1 ), g(x1 )} + λ max {f (x2 ), g(x2 )} (5170)
= (1 − λ)h(x1 ) + λh(x2 ) (5171)
5a detail beginners get caught up with is that the act of exercise is not synonymous with selling the option. It is not
selling the option that is sub-optimal - it is the exercise of it.
721
that h(·) is convex.
The discounted process exp(−rt)(S(t) − K)+ is submartingale under P̃, making early exercise sub-
optimal. This effect is two part. First, since exp(−rt)S(t) is P̃ martingale, exp(−rt)(S(t) − K) must
itself be submartingale as the strike discounting gets ‘lighter’ with time. The second is the convexity
imparted by the max operator. Both effects reinforce the tendency for the process to increase. See that
this is more complicated for the put option - exp(−rt)(K − S(t))) is a supermartingale, but the max
operator imparts a competing effect.
We now consider an underlying paying lump sum dividends. The lump sum dividend model was
discussed in Section 13.3.7.4 for the European call. We analyze similar settings for the American type.
Recalling the critical arguments, we have
dS(t) = rS(t)dt + σS(t)dW̃ (t). t ∈ [tk , tj+1 ), j = 0, · · · n. (5172)
between dividend dates,

1 2
S(tj+1 −) = S(tj ) exp σ(W̃ (tj+1 ) − W̃ (tj )) + r − σ (tj+1 − tj ) . (5173)
2
After dividend payments we are given the relation

S(tj+1 ) 1 2
= (1 − aj+1 ) exp σ(W̃ (tj+1 ) − W̃ (tj )) + r − σ (tj+1 − tj ) (5174)
S(tj ) 2
and

1 2
S(T ) = S(tn ) exp σ(W̃ (T ) − W̃ (tn )) + r − σ (T − tn ) . (5175)
2
We want to show that it is sub-optimal to do early exercise, except possibly immediately before a dividend
payment. Between dividend dates, the option satisfies the BSM PDE. At dividend dates, the call price
is the maximum of the call’s intrinsic value and price of the call after the dividend is paid and stock
price reduced. For tn ≤ t ≤ T , exp(−rt)S(t) is P̃ martingale s.t. exp(−rt)(S(t) − K)+ is submartingale
by Lemma 25. This is expressed
Ẽ exp(−rT )(S(T ) − K)+ |F(t) ≥ exp(−rt)(S(t) − K)+ ,

tn ≤ t ≤ T. (5176)
For the European call, at t in [tn , T ], the price cn (t, S(t)) = Ẽ [exp(−r(T − t))(S(T ) − K)+ |F(t)] >
(S(t) − K)+ . It is at least worth intrinsic value, which is the value of the an American call at immediate
exercise. The early exercise feature is worthless. The prices agree on [tn , T ] for the European and
American calls, s.t.
cn (t, x) = xΦ (d+ (T − t, x)) − K exp(−r(T − t))Φ (d− (T − t, x)) . (5177)
It is assumed the reader is familiar with this formulation introduced in the European call pricing. The
form d± (T − t, x) is given by Equation 3851. When t = tn , the call price is expressed cn (tn , x)
" + #
1 2
Ẽ exp(−r(T − tn )) x exp σ(W̃ (T ) − W̃ (tn )) + r − σ (T − tn ) − K (5178)
2
This satisfies the BSM PDE Equation 3848, and terminal condition cn (T, x) = (x − K)+ . We show
cn (tn , x) is convex in x. We want to show that for 0 ≤ x1 ≤ x2 , 0 ≤ λ ≤ 1, we have
cn (tn , (1 − λ)x1 + λx2 ) ≤ (1 − λ)cn (tn , x1 ) + λcn (tn , x2 ). (5179)
722
See that for α ∈ R, the function (αx − K)+ is convex in x, s.t.
+
1
x exp σ(W̃ (T ) − W̃ (tn )) + r − σ 2 (T − tn ) − K (5180)
2
is convex in x. Apply this to
cn (tn , (1 − λ)x1 + λx2 ) (5181)

" + #
1 2
= Ẽ exp(−r(T − tn )) ((1 − λ)x1 + λx2 ) exp σ(W̃ (T ) − W̃ (tn )) + r − σ (T − tn ) − K
2
and use convexity of Equation 5180 to show that this is ≤ (1 − λ)cn (tn , x1 ) + λcn (tn , x2 ). Immediately
before dividend payments, the owner can exercise to receive S(tn −) − K or remain an owner of a option
valued at cn (tn , (1 − an )S(tn −)). Clearly he would prefer to choose the better strategy and the call
option should be valued at
hn (x) = max {x − K, cn (tn , (1 − an )x)} (5182)
for x ≥ 0. We show that hn (x) satisfies Lemma 25. Since cn (tn , (1 − an )x) ≥ 0 for all x ≥ 0, hn (x) ≥ 0.
hn (0) = 0 since cn (tn , (1 − an )0) = 0. Since cn (tn , x) is convex in x,
cn (tn , (1 − an )((1 − λ)x1 + λx2 )) ≤ (1 − λ)cn (tn , (1 − an )x1 ) + λcn (tn , (1 − an )x2 ). (5183)
cn (tn , (1 − an )x) is convex in x. Maximum over two convex functions are convex (see Lemma 26). hn (x)
is convex. At t ∈ [tn−1 , tn ), if she exercise at u ∈ [t, tn ) she receives S(u) − K. Else at tn immediately
before dividends her call is worth hn (S(tn −)). For t ∈ [tn−1 , tn ) American call expiring at T has the
same price as the European call expiring immediately before the dividend payment at tn and paying
hn (S(tn −)) on expiry. Lemma 25 asserts that exp(−rt)hn (S(t)) is submartingale in between dividend
dates. Then
Ẽ [exp(−r(u − t))hn (S(u))|F(t)] ≥ hn (S(t)) tn−1 ≤ t ≤ u < tn . (5184)
Let u → t−
n s.t.
Ẽ [exp(−r(tn − t))hn (S(tn −))|F(t)] ≥ hn (S(t)) ≥ S(t) − K. (5185)
The European call expiring at tn immediately before dividends and paying hn (S(tn −)) on expiring is
at least worth the intrinsic value. The option to exercise before tn is worthless. By Markov property of
S(t), ∃cn−1 (t, x) s.t. the European call value is
cn−1 (t, S(t)) = Ẽ [exp(−r(tn − t))hn (S(tn −))|F(t)] , t ∈ [tn−1 , tn ) (5186)
For the case when t = tn−1 , we have

1
cn−1 (tn−1 , x) = Ẽ exp(−r(tn − tn−1 ))hn x exp σ(W̃ (tn ) − W̃ (tn−1 )) + r − σ 2 (tn − tn−1 ) (5187)
.
2
cn−1 (t, x) also satisfies the BSM PDE Equation 3848 and terminal condition
cn−1 (tn , x) = hn (tn , x), x ≥ 0. (5188)
We can continue by writing
hn−1 (x) = max {x − K, cn−1 (tn−1 (1 − an−1 )x)} , x ≥ 0. (5189)
723
We can show that hn−1 (x) satisfies conditions for Lemma 25. Using the recurrence relation, solve
recursively for j = n, · · · 0 satisfying the PDE
δ δ 1 δ2
cj−1 (t, x) + rx cj−1 (t, x) + σ 2 x2 2 cj−1 (t, x) = rcj−1 (t, x), (5190)
δt δx 2 δx
cj−1 (tj , x) = hj (x), x ≥ 0. (5191)
for tj−1 ≤ t < tj and x ≥ 0. The base case is cn (tn x), hn (x) as discussed above. For tj−1 ≤ t < tj , if
S(t) = x, then cj−1 (t, x) is the value of the American call price. In the interval [tj−1 , tj ), the American
call price is the same as the European call price expiring at tj with a payoff hj (S(tj −)). The optimal
exercise time is immediately prior to the dividend payment at smallest time tj where S(tj −) − K exceeds
cj (tj , (1 − aj )S(tj −)). Otherwise the optimal exercise is at T if S(T ) > K else ∞.
13.7 Change of Numeraire

Change of numeraire was seen in the pricing of exotic options (see Equation 4947), to simplify the
analysis of our pricing model. A numeraire is the unit of account in which we denominate other assets.
One example is the country currency (domestic or foreign). Another is the bank/money market account.
Another would be the zero coupon bond (see Definition 455).
Assume a multidimensional market model (see Section 13.3.6.1). We have a d-dimensional Brownian
(see Definition 445) W (t) = (Wi (t))i∈[d] on probability space (Ω, F, P), where F(t) is the multiple
Brownian filtration (see Definition 446). We have m stocks with differential equation given by Equation
4264 repeated here:
d
X
dSi (t) = αi (t)Si (t)dt + Si (t) σij (t)dWj (t) i ∈ [m]. (5192)
j=1
o process R(t) for t ∈ [T ]. Then the bank account has price per share at t
Assume an adaptedninterest rate
Rt
worth M (t) = exp 0 R(u)du , and the discount process (see Definition 4619) is the reciprocal D(t) =
n R o
t
exp − 0 R(u)du . Assume a unique risk-neutral measure P̃, s.t. ∃ Θ(t) = (Θi (t))i∈[d] satisfying market
price of risk-equations (see Equation 4282, Theorem 426). The multidimensional Girsanov Theorem 424
gives us a way to construct the risk-neutral measure P̃, where
Z t
W̃j (t) = Wj (t) + Θj (u)du, j ∈ [d] (5193)
0
is a P̃ Brownian and W̃j (t) ⊥ W̃k (t) if j 6= k. The market is complete; equivalently (by Theorem
426) every derivative has a hedge using the underlying primary assets and money market. Discounted
stock price process and discounted portfolio process are P̃ martingales (see Lemma 22). If we were to
Si (t)
denominate asset Si in money market terms, then M (t) = D(t)Si (t) - asset i is worth D(t)Si (t) shares
of money market account. The asset value denominated in shares of money market is P̃ martingale and
we say that P̃ is risk neutral for the money market account numeraire. When changing the numeraire,
we need to change the measure to maintain the risk-neutrality.
Theorem 441 (Stochastic Representation of Assets, Shreve [19]). Let N (t) > 0 be price process for
non-dividend paying asset (primary or derivative) in the multidimensional market model (see Section
13.3.6.1). Then,
∃ν(t) = (νi (t))i∈[d] (5194)
724
s.t.
(i) dN (t) = R(t)N (t)dt + N (t)ν(t) · dW̃ (t). (5195)
Equation 5195 is equivalent to following statements:
(ii) d (D(t)N (t))= D(t)N (t)ν(t) · dW̃ (t), (5196)

Z t Z t
1
(iii) D(t)N (t) = N (0) exp ν(u) · dW̃ (u) − kv(u)k22 du , (5197)
0 2 0
Z t Z t
1
(iv) N (t) = N (0) exp ν(u) · dW̃ (u) + R(u) − kν(u)k22 du . (5198)
0 0 2
Every asset has mean return R(t) under P̃. Realized risk-neutral return is solely dependent on volatility
vector processes.
Proof. Risk-neutral pricing formula (see Equation 4193) asserts D(t)N (t) is P̃ martingale. Martingale
Representation Theorem (Result 43) asserts that
d (D(t)N (t)) = Γ̃(t) · dW̃ (t), (5199)
for some Γ̃(t) = (Γ̃i (t))i∈[d] . Since N (t) is strictly positive, let
Γ̃j (t)
vj (t) = ∀j ∈ [d]. (5200)
D(t)N (t)
Then we have
d(D(t)N (t)) = D(t)N (t)ν(t) · dW̃ (t), (5201)
which is (ii). Let

Z t
1
X(t) =ν(u) · dW̃ (u) − kν(u)k22 du, s.t. (5202)
0 2
1
dX(t) = ν(t) · dW̃ (t) − kν(t)k22 dt (5203)
2
d d
X 1X 2
= νj dW̃j (t) − ν (t)dt. (5204)
j=1
2 j=1 j
Pd
It follows that dX(t)dX(t) = j=1 νj2 (t)dt = kν(t)k22 dt. Denote f (x) = N (0) exp(x), then by Ito Doeblin
formula (Theorem 415) we have
1
df (X(t)) = f 0 (X(t))dX(t) + f 00 (X(t))dX(t)dX(t) (5205)
2
= f (X(t))ν(t) · dW̃ (t) (5206)
= N (0) exp(X(t))ν(t) · dW̃ (t) (5207)
= D(t)N (t)ν(t) · dW̃ (t). (5208)
RHS of Equation 5197 is f (X(t)). Initial conditions agree. So (ii) implies (iii). By definition of D(t),
Rt Rt
(iii) implies (iv). By Ito Doeblin formula, writing Y (t) = 0 ν(u) · dW̃ (u) + 0 R(u) − 21 kν(u)k22 du with

N (t) = N (0) exp(Y (t)) we have

1
dN (t) = N (t)dY (t) + N (t)dY (t)dY (t) (5209)
2
1 1
= N (t) ν(t)dW̃ (t) + R(t)dt − kν(t)k dt + N (t)kν(t)k2 dt
2
(5210)
2 2
= N (t)R(t)dt + N (t)ν(t) · dW̃ (t). (5211)
So (iv) implies (i).
725
Equation 5197 states
Z t
1 t
Z
D(t)N (t)
= exp − −ν(u) · dW̃ (u) − kv(u)k22 du , (5212)
N (0) 0 2 0
and the Multidimensional Girsanov Theorem 424 asserts that the Brownian
Z t
W̃j (t)(N ) = − νj (u)du + W̃j (t), j ∈ [d] (5213)
0
and the multiple Brownian W̃ (t)(N ) = ((W̃i (t)(N ) )i∈[d] ) is a d-dimensional Brownian (see Definition 445)
under P̃(N ) (A), where
Z
1
P̃(N ) (A) = D(T )N (T )dP̃, ∀A ∈ F. (5214)
N (0) A
D(T )N (T )
Here the terms Z(T ) → N (0) , Θj (t) → −νj (t). We have the relation (see Lemma 21)
1
Ẽ(N ) X = Ẽ [XD(T )N (T )] . (5215)
N (0)
Specifically, we are using the Radon Nikodym derivative process (see Definition 451)

D(t)N (t) D(T )N (T )
= Ẽ |F(t) , t ∈ [T ]. (5216)
N (0) N (0)
Using the properties given (see Lemma 21), we have the relation (for 0 ≤ s ≤ t ≤ T , F(t)-measurable
random variable Y )
1
Ẽ(Y ) [Y |F(s)] = Ẽ[Y D(t)N (t)|F(s)]. (5217)
D(s)N (s)
Theorem 442 (Change of Risk-Neutral Measure). Let S(t), N (t) be two assets with prices denominated
in the same currency, and denote σ(t) = (σi (t))i∈[d] , ν(t) = (νi (t))i∈[d] as their respective volatility
processes such that
d (D(t)S(t)) = D(t)S(t)σ(t) · dW̃ (t), (5218)

d (D(t)N (t)) = D(t)N (t)ν(t) · dW̃ (t). (5219)
S(t)
Taking N (t) as numeraire, S(t) in N (t) terms is denoted S (N ) (t) = N (t) . Then
dS (N ) (t) = S (N ) (t) [σ(t) − ν(t)] · dW̃ (N ) (t). (5220)
That is, S (N ) (t) is P̃(N ) martingale and the volatility vectors subtract. dN (N ) (t) = 0, since N (N ) (t) =
N (t) 6
N (t) = 1.
Proof. Since
Z t Z t
1
D(t)S(t) = S(0) exp kσ(u)k2 du ,
σ(u) · dW̃ (u) − (5221)
0 0 2
Z t
1 t
Z
2
D(t)N (t) = N (0) exp ν(u) · dW̃ (u) − kν(u)k du . (5222)
0 2 0
6A M (t)
more general result is stated. For M1 (t), M2 (t) martingales under P, If M2 (0) = 1, M2 (t) > 0, then M1 (t) is P(M2 )
2
martingale, where P(M2 ) (A) = A M2 (T )dP for all A ∈ F corresponding to the filtration the measure P is associated with.
R
See Exercise 712
726
So
Z t Z t
(N ) S(0) 1 2 2
S (t) = exp (σ(u) − ν(u)) · dW̃ (u) − (kσ(u)k − kν(u)k )du . (5223)
N (0) 0 2 0
Let
Z t Z t
1
X(t) = (σ(u) − ν(u)) · dW̃ (u) − (kσ(u)k2 − kν(u)k2 )du. (5224)
0 2 0
Take differential
1
dX(t) = (σ(t) − ν(t)) · dW̃ (t) − (kσ(t)k2 − kν(t)k2 )dt (5225)
2
d d
X 1X 2
= (σj (t) − νj (t))dW̃j (t) − (σj (t) − νj2 (t))dt. (5226)
j
2 j
d
X
d[X, X](t) = (σj (t) − νj (t))2 dt (5227)
j
d
X
= (σj2 (t) − 2σj (t)νj (t) + νj2 (t))dt (5228)
j
= kσ(t)k2 dt − 2σ(t) · ν(t)dt + kν(t)k2 dt. (5229)

S(0)
Write f (x) = N (0) exp(x) such that by the Ito Doeblin formula (see Theorem 415), we have
dS (N ) (t) = df (X(t)) (5230)

1
= f 0 (X)dX + f 00 (X)dXdX (5231)
2
1 1 1 1
= S (N ) (σ − ν) · dW̃ − kσk2 dt + kνk2 dt + kσk2 dt + kvk2 dt − σ · νdt (5232)
2 2 2 2
h i
(N )
= S (σ − ν) · dW̃ − v · (σ − ν)dt since v · v = kvk2 (5233)
= S (N ) (σ − ν) · (−νdt + dW̃ ) (5234)
= S (N ) (σ − ν) · dW̃ (N ) . (5235)
So S (N ) (t) is P̃(N ) martingale.
N (t) may be written (as we did in the multidimensional market model, Section 13.3.6.1) to be
= R(t)N (t)dt + kν(t)kN (t)dB (N ) (t),

dN (t) (5236)
Z tXd
(N ) νj (u)
B (t) = dW̃j (t). (5237)
0 j=1 kν(u)k
B (N ) (t) is one-dimensional Brownian (see Levy’s Theorem 419). N (t) has volatility kν(t)k. N (t) has
volatility vector ν(t). S (N ) (t) has volatility kσ(t) − ν(t)k. In the special case when money market
1
account is numeraire, N (t) = M (t) = D(t) and d(D(t)N (t)) = 0. The volatility vector for money market
is ν(t) = 0. Discounting/denominating asset prices by the money market account numeraire does not
affect the volatility vector. Volatility vector for asset S (N ) (t) denominated in money market account
units is the same as the volatility vector of asset denominated in currency units.
Exercise 712. Show that for martingales M1 (t), M2 (t) under P, given initial conditions M2 (0) =
M1 (t)
1, M2 (t) > 0, that M2 (t) is a P(M2 ) martingale, where
Z
P(M2 ) (A) = M2 (T )dP. (5238)
A
727
Specifically, let S(t), N (t) be asset prices and N (t) > 0. For risk-neutral measure P̃, D(t)S(t), D(t)N (t)
S(t)
are martingales. Show that S (N ) (t) = N (t) is P̃(N ) martingale, where P̃(N ) is given by Equation 5214.
Proof. For any t ∈ [T ], by properties of the Radon-Nikodym derivative process (Lemma 21), we may
write

(M2 ) M1 (T ) M2 (T )M1 (T ) E[M1 (T )|F(t)] M1 (t)
E |F(t) = E |F(t) = = . (5239)
M2 (T ) M2 (t)M2 (T ) M2 (t) M2 (t)
D(t)N (t)
The martingale property is done. Let M1 (t) = D(t)S(t), and M2 (t) = N (0) . Then P̃(N ) → P(M2 ) and
M1 (t) S(t) S(t)
M2 (t) = N (t) N (0) is P̃(N ) martingale. It follows that S(t)(N ) = N (t)
(N )
is P̃ martingale.
That is, when we change the units of accounting, the change of risk-neutral measure may be obtained
by using the Radon-Nikodym derivative process (see Definition 451) which is precisely the numeraire
itself, discounted and normalized (by the initial condition) to be martingale and have expectation one.
Exercise 713 (Portfolio Change of Numeraire). For two asset prices S(t), N (t) given by

1
S(t) = S(0) exp σ W̃ (t) + r − σ 2 t , (5240)
2

1
N (t) = N (0) exp ν W̃ (t) + r − ν 2 t , (5241)
2
where W̃ (t) is risk-neutral measure P̃ Brownian and σ ∈ R+ , ν ∈ R+ , r ∈ R. Let the money market be
M (t) = exp(rt) per share. Taking N (t) as numeraire, we denote
S(t)
Ŝ(t) = , (5242)
N (t)
M (t)
M̂ (t) = . (5243)
N (t)
By the change of risk-neutral measure Theorem 442, we have
dŜ(t) = (σ − ν)Ŝ(t)dŴ (t), (5244)

Ŵ (t) = W̃ (t) − νt. (5245)
A portfolio process X(t) holding ∆(t) shares of asset and self-financing has differential
dX(t) = ∆(t)dS(t) + r(X(t) − ∆(t)S(t))dt (5246)
Define
X(t) − ∆(t)S(t)
Γ(t) = (5247)
M (t)
to be the number of shares of money market account held by portfolio. Then the differential becomes
dX(t) = ∆(t)dS(t) + rΓ(t)M (t)dt = ∆(t)dS(t) + Γ(t)dM (t). (5248)
By definition of Γ(t), X(t) = ∆(t)S(t) + Γ(t)M (t). The portfolio denominated in numeraire
X(t)
X̂(t) = , (5249)
N (t)
s.t.
X̂(t) = ∆(t)Ŝ(t) + Γ(t)M̂ (t). (5250)
728
1
Compute the differential of N (t) . Compute the differential of M̂ (t) and show that it is P̂ martingale.
Next show that
dX̂(t) = ∆(t)dŜ(t) + Γ(t)dM̂ (t). (5251)
Proof. First see that

1 1 1
= exp −ν W̃ (t) − r − ν 2 t . (5252)
N (t) N (0) 2
Define ψ(t) = −ν W̃ (t) − r − 12 ν 2 t, such that dψ(t) = −νdW̃ (t) − r − 21 ν 2 dt and by Ito Doeblin

formula:

1 1 1 1
d = dψ(t) + ν 2 dt = −νdW̃ (t) − rdt + ν 2 dt (5253)
N (t) N (t) 2N (t) N (t)
1 1
= −ν(−νdt + dW̃ (t)) − rdt = −νdŴ (t) − rdt .
N (t) N (t)
Next, compute (by Ito Product rule ,Lemma 30)

1 1 1 1
dM̂ (t) = M (t)d + dM (t) = M (t) −νdŴ (t) − rdt + (rM (t)dt)
N (t) N (t) N (t) N (t)
= −ν M̂ (t)dŴ (t). (5254)
Finally

11 1
dX̂(t) = X(t)d + dX(t) + d dX(t) (5255)
N (t)
N (t) N (t)

1 1
= (∆(t)S(t) + Γ(t)M (t)) d + (∆(t)dS(t) + Γ(t)dM (t)) (5256)
N (t) N (t)

1
+d (∆(t)dS(t) + Γ(t)dM (t)) (5257)
N (t)

1 1 1
= ∆(t) S(t)d + dS(t) + d dS(t) (5258)
N (t) N (t) N (t)

1 1 1
+Γ(t) M (t)d + dM (t) + d dM (t) (5259)
N (t) N (t) N (t)
= ∆(t)dŜ(t) + Γ(t)dM̂ (t). (5260)
Exercise 714 (Change-of-numeraire-change in volatility). Let S(t), N (t) be two asset prices with differ-
entials
dS(t) = rS(t)dt + σS(t)dW̃1 (t), (5261)

dN (t) = rN (t)dt + νN (t)dW̃3 (t). (5262)
Here, W̃1 (t), W̃3 (t) are risk-neutral measure P̃ Brownians, and r, σ, ν are assumed constant. Assume
constant correlation dW̃1 (t)dW̃3 (t) = ρdt. Then
S(t)
p
1. Show that S (N ) (t) = N (t) has volatility γ = σ 2 + ν 2 − 2ρσν. That is, there exists a Brownian
W̃4 (t) under P̃ and some κ such that
dS (N ) (t)
= κdt + γdW̃4 (t). (5263)
S (N ) (t)
729
2. Show how W̃2 (t) may be constructed to be independent of W̃1 (t), such that
h p i
dN (t) = rN (t)dt + νN (t) ρdW̃1 (t) + 1 − ρ2 dW̃2 (t) . (5264)
3. Using Theorem 442, find volatility vector v = (v1 , v2 ) of S (N ) (t), s.t.

h i
(N ) (N )
dS (N ) (t) = S (N ) (t) v1 dW̃1 (t) + v2 dW̃2 (t) . (5265)
Show that in fact we have

q p
v12 + v22 = σ 2 + ν 2 − 2ρσν. (5266)
Proof. 1. In the same way as we did in Equation 5253, we may obtain

1 1 h i
d = −νdW̃3 (t) − (r − ν 2 )dt . (5267)
N (t) N (t)
By the Ito product rule, write dS (N ) (t)

1 1 1
= dS(t) + S(t)d + dS(t)d (5268)
N (t) N (t) N (t)
h
1 1 i
= (rS(t)dt + σS(t)dW̃1 (t)) + S(t) −νdW̃3 (t) − (r − ν 2 )dt (5269)
N (t) N (t)
h
1 i
+(rS(t)dt + σS(t)dW̃1 (t)) −νdW̃3 (t) − (r − ν 2 )dt (5270)
N (t)
h i
= S (N ) (t) rdt + σdW̃1 (t) + S (N ) (t) −νdW̃3 (t) − (r − ν 2 )dt − σνS (N ) (t)ρdt (5271)
= S (N ) (t)(ν 2 − σνρ)dt + S (N ) (t)(σdW̃1 (t) − νdW̃3 (t)). (5272)
p
Define γ = σ 2 + ν 2 − 2ρσν and
γ W̃4 (t) = σ W̃1 (t) − ν W̃3 (t). (5273)
See that
2
σ ν
dW4 (t) = dW̃1 (t) − dW̃3 (t) (5274)
γ γ
2 2
σ ν 2σνρ
= dt + 2 dt − 2 dt (5275)
γ2 γ γ
2 2
σ + ν − 2σνρ
= dt (5276)
γ2
γ2
= dt. (5277)
γ2
W4 (t) is martingale with quadratic variation t. The Brownian property follows by Levy’s Theorem
419.
2. See Exercise 672. The appropriate Brownian is

ρ 1
W̃2 (t) = − p W̃1 (t) + p W̃3 (t). (5278)
1 − ρ2 1 − ρ2
730
3. Since (W̃1 , W̃2 ) is P̃ Brownian, and
dS(t) = rS(t)dt + σS(t)dW̃1 (t), (5279)

p
dN (t) = rN (t)dt + νρN (t)dW̃1 (t) + ν 1 − ρ2 N (t)dW̃2 (t), (5280)
Theorem 442 asserts that under the change of numeraire, the volatility vectors subtract such that
the volatility vector for S (N ) is given by
p
(v1 , v2 ) = (σ − νρ, −ν 1 − ρ2 ). (5281)
q p p
The volatility is the kvk = (σ − νρ)2 + (−ν 1 − ρ2 )2 = σ 2 + ν 2 ρ2 − 2σνρ + ν 2 (1 − ρ2 ) =
p
σ 2 + ν 2 − 2νσρ.
13.7.1 Foreign, Domestic Risk-Neutral Measures

Let W (t) = (W1 (t), W2 (t)) be a two-dimensional Brownian (see Definition 445) on probability space
(Ω, F, P). A stock price process in domestic currency satisfies differential equation
dS(t) = α(t)S(t)dt + σ1 (t)S(t)dW1 (t). (5282)
Let the domestic, adapted interest rate be R(t), such that domestic money market account and domestic
discount process is respectively
Z t Z t
M (t) = exp R(u)du , D(t) = exp − R(u)du . (5283)
0 0
Let the foreign, adapted interest rate be Rf (t), such that foreign money market account and foreign
discount process is respectively
Z t Z t
M f (t) = exp Rf (u)du , Df (t) = exp − Rf (u)du . (5284)
0 0
Rt
Let ψ(t) = 0
Rf (u)du s.t dψ(t) = Rf (t)dt, so
1
dM f (t)dt = M f (t)dψ(t) + (0) = Rf (t)M f (t)dt. (5285)
2
Let the exchange rate Q(t) denote the units of domestic currency per unit of foreign currency. Assume
the differential
h p i
dQ(t) = γ(t)Q(t)dt + σ2 (t)Q(t) ρ(t)dW1 (t) + 1 − ρ2 (t)dW2 (t) . (5286)
Define
Z t Z tp
W3 (t) = ρ(u)dW1 (u) + 1 − ρ2 (u)dW2 (u). (5287)
0 0
This is P Brownian (Levy’s Theorem 419). We can express
dQ(t) = γ(t)Q(t)dt + σ2 (t)Q(t)dW3 (t). (5288)
For R(t), Rf (t), σ1 (t), σ2 (t), ρ(t) adapted to F(t) generated by W (t), where σ1 (t), σ2 (t) > 0, ρ(t) ∈
(−1, 1), we can write
dS(t) dQ(t) h p i
· = [α(t)dt + σ1 (t)dW1 (t)] γ(t)dt + σ2 (t)[ρ(t)dW1 (t) − 1 − ρ2 (t)dW2 (t)] (5289)
S(t) Q(t)
= ρ(t)σ1 (t)σ2 (t)dt. (5290)
ρ(t) is the instantaneous correlation between relative changes in S(t) and Q(t).
731
13.7.1.1 Domestic Risk-Neutral Measure
There are three primary assets, namely the domestic, foreign money market account and stock. We
want to price and discount in domestic terms. The domestic money market risk-neutral measure is the
measure that prices all three domestic-money-market-numeraire asset processes as martingales. This
is equivalent to say that the three assets priced in units of domestic money market account must be
martingales. The domestic money market account priced in terms of itself is worth one - it must be
martingale. The stock priced in domestic money market account terms has price S(t)/M (t) = D(t)S(t)
with differential
d(D(t)S(t)) = S(t)dD(t) + D(t)dS(t) + dD(t)dS(t) (5291)

= −S(t)R(t)D(t)dt + D(t)dS(t) + 0 (see Equation 4621) (5292)
= D(t)S(t) [(α(t) − R(t))dt + σ1 (t)dW1 (t)] . (5293)
We want to find Θ1 (t) s.t.

Z t
W̃1 (t) = Θ1 (u)du + W1 (t) (5294)
0
s.t. d(D(t)S(t)) = σ1 (t)D(t)S(t)dW̃1 (t) and this is our market price of risk equation
σ1 (t)Θ1 (t) = α(t) − R(t). (5295)
The foreign money market account in domestic currency terms is M f (t)Q(t) and in terms of domestic
money market this is D(t)M f (t)Q(t). Since d(M f (t)) = Rf (t)M f (t)dt (see Equation 5285), by Ito
product rule
d(M f (t)Q(t)) = M f (t)dQ(t) + Q(t)dM f (t) + dM f (t)dQ(t) (5296)

| {z }
0
n h p io
= M (t) γ(t)Q(t)dt + σ2 (t)Q(t) ρ(t)dW1 (t) + 1 − ρ2 (t)dW2 (t) + Q(t)Rf (t)M f (t)dt
f
h p i
= M f (t)Q(t) (Rf (t) + γ(t))dt + σ2 (t)ρ(t)dW1 (t) + σ2 (t) 1 − ρ2 (t)dW2 (t) . (5297)
Ito Product once more to arrive at the differential d(D(t)M f (t)Q(t))
= M f (t)Q(t)dD(t) + D(t)d(M f (t)Q(t)) + 0 (5298)

n h p io
= −M f (t)Q(t)R(t)D(t)dt + D(t) M f (t)Q(t) (Rf (t) + γ(t))dt + σ2 (t)ρ(t)dW1 (t) + σ2 (t) 1 − ρ2 (t)dW2 (t)
h p i
= D(t)M f (t)Q(t) Rf (t) − R(t) + γ(t) dt + σ2 (t)ρ(t)dW1 (t) + σ2 (t) 1 − ρ2 (t)dW2 (t) .

We would like to write

Z t
W̃2 (t) = Θ2 (u)du + W2 (t) (5299)
0
s.t.
h p i
d(D(t)M f (t)Q(t)) = D(t)M f (t)Q(t) σ2 (t)ρ(t)dW̃1 (t) + σ2 (t) 1 − ρ2 (t)dW̃2 (t) . (5300)
Expanding this
h p i
d(D(t)M f (t)Q(t)) = D(t)M f (t)Q(t) σ2 (t)ρ(t)(Θ1 (t)dt + dW1 (t)) + σ2 (t) 1 − ρ2 (t)(Θ2 (t)dt + dW2 (t))
p
= D(t)M f (t)Q(t)[σ2 (t)ρ(t)Θ1 (t)dt + σ2 (t) 1 − ρ2 (t)Θ2 (t)dt (5301)
p
+σ2 (t)ρ(t)dW1 (t) + σ2 (t) 1 − ρ2 (t)dW2 (t)].
732
The second market price of risk equation is
p
σ2 (t)ρ(t)Θ1 (t) + σ2 (t) 1 − ρ2 (t)Θ2 (t) = Rf (t) − R(t) + γ(t). (5302)
The two market price of risk equations determine Θ1 (t), Θ2 (t), which is a system of two linear equations
with two unknowns. If there is a unique solution to this linear system, then there is a unique risk-neutral
measure by the Second Fundamental Theorem of Asset Pricing (see Theorem 426). P̃ may be found using
the multidimensional Girsanov Theorem (see Theorem 424). W̃ (t) = (W̃1 (t), W̃2 (t)) is P̃ two-dimensional
Brownian, and processes 1, D(t)S(t), D(t)M f (t)Q(t) are the domestic risk neutral martingales. Define
Z t Z tp
W̃3 (t) = ρ(u)dW̃1 (u) + 1 − ρ2 (u)dW̃2 (u). (5303)
0 0
By Levy’s Theorem 419, W̃3 (t) is P̃ Brownian, and we have cross variation
p
dW̃1 (t)dW̃3 (t) = ρ(t)dt, dW̃2 (t)dW̃3 (t) = 1 − ρ2 (t)dt. (5304)
To get the undiscounted differentials, we need to multiply the discounted processes by M (t), use dM (t) =
R(t)M (t)dt and apply Ito Product rule. Doing so would give us
h i
dS(t) = S(t) R(t)dt + σ1 (t)dW̃1 (t) , (5305)
dM f (t)Q(t) D(t)M (t) = D(t)M f (t)Q(t)dM (t) + d(D(t)M f (t)Q(t))M (t), use Equation 5300
| {z }
1
h p i
= M f (t)Q(t) R(t)dt + σ2 (t)ρ(t)dW̃1 (t) + σ2 (t) 1 − ρ2 (t)dW̃2 (t) (5306)
h i
= M f (t)Q(t) R(t)dt + σ2 (t)dW̃3 (t) . (5307)
The mean rate of return on asset price processes is R(t). Multiplying M f (t)Q(t) by Df (t) and using Ito
product rule, we have
dQ(t) = d(M f (t)Q(t))Df (t) + (−Rf (t)Df (t)dt)M f (t)Q(t) (5308)

h i
= Q(t) (R(t) − Rf (t))dt + σ2 (t)dW̃3 (t) . (5309)
The mean rate of change in exchange rates is the difference between the domestic and foreign interest
rates under the domestic risk-neutral measure. The exchange rate can be considered as a dividend paying
asset, where the underlying is foreign currency paying continuous dividend Rf (t). The reinvestment of
dividends gets us back to the mean rate of return R(t). This is an analogue to the dividend paying stock.
Under the objective measure P, the mean rate of the exchange rate is not restricted to Equation 5309.
13.7.1.2 Foreign Risk-Neutral Measure
Here we construct the foreign risk-neutral measure under which the asset prices for the domestic, for-
eign money market account and stock is martingale. To find the foreign risk-neutral measure, take the
foreign money market account as numeraire, which has value at t denominated in domestic currency
to be M f (t)Q(t), and denominated in domestic money market account as D(t)M f (t)Q(t). Therefore,
given the domestic money market numeraire martingales (1, domestic money market), (D(t)S(t), stock),
(D(t)M f (t)Q(t), foreign money market) we can divide by D(t)M f (t)Q(t) to obtain the foreign money
market martingales, namely M (t)Df (t)/Q(t), Df (t)S(t)/Q(t), 1 for domestic money market, stock and
foreign money market account respectively. d(M f (t)Q(t)) was given by Equation 5307. See that the
733
p
volatility vector is ν(t) = (ν1 (t), ν2 (t)) = (σ2 (t)ρ(t), σ2 (t) 1 − ρ2 (t)). This is the same volatility vec-
tor as Q(t). By the change of risk-neutral measure (see Theorem 442), the numeraire M f (t)Q(t) is
accompanied with the risk-neutral measure
Z
1
P̃f (A) = D(T )M f (T )Q(T )dP̃, ∀A ∈ F, (5310)
D(0)Q(0) A

where D(0) = M f (0) = 1. Additionally, W̃ f (t) = W̃1f (t), W̃2f (t) is P̃f two-dimensional Brownian (see
Definition 445), where
Z t
W̃1f (t) = − σ2 (u)ρ(u)du + W̃1 (t), (5311)
0
Z t
W̃2f (t)
p
= − σ2 (u) 1 − ρ2 (u)du + W̃2 (t). (5312)
0
P̃f is the foreign risk-neutral measure. If we define

Z t Z tp
W̃3f (t) = ρ(u)dW̃1f (u) + 1 − ρ2 (u)dW̃2f (u) (5313)
0 0
Z t Z tp p
= ρ(u) −σ2 (u)ρ(u)du + dW̃1 (u) + 1 − ρ2 (u) −σ2 (u) 1 − ρ2 (u)du + dW̃2 (u)
0 0
Z t Z t p
= − σ2 (u)du + (ρ(u)dW̃1 (u) + 1 − ρ2 (u)dW̃2 (u)) (5314)
0 0
Z t
= − σ2 (u)du + W̃3 (t). Equation 5303. (5315)
0
Then W̃3f (t) is P̃f Brownian with cross variation
dW̃1f (t)dW̃3f (t) = ρ(t)dt, dW̃2f (t)dW̃3f (t) =

p
1 − ρ2 (t)dt. (5316)
Using d(Df (t)) = −Rf (t)Df (t)dt, d(M (t)) = R(t)M (t)dt, then
d(M (t)Df (t)) = M (t)d(Df (t)) + Df (t)d(M (t)) = −M (t)Rf (t)Df (t)dt + R(t)M (t)Df (t)dt. (5317)
Also by Ito Doeblin (Theorem 442), see that for f (x) = x1 , f 0 (x) = − x12 , f 00 x(x) = 2
x3 so (using Equation
5309) we have

1 1
d = f 0 (Q)dQ + f 00 (Q)dQdQ (5318)
Q 2
1 h i 1
= − 2 Q (R − Rf )dt + σ2 dW̃3 + 3 Q2 σ22 dt (5319)
Q Q
1 h f i 1
= (R − R)dt − σ2 dW̃3 + σ22 dt (5320)
Q Q
1 h f 2
i
= (R − R + σ2 )dt − σ2 dW̃3 . (5321)
Q(t)
734
See that
M (t)Df (t)

1 f f 1
d = d(M (t)D (t)) + M (t)D (t)d (5322)
Q(t) Q(t) Q(t)
1
= −M (t)Rf (t)Df (t)dt + R(t)M (t)Df (t)dt (5323)
Q(t)
i
1 h f
+M (t)Df (t) (R (t) − R(t) + σ22 (t))dt − σ2 (t)dW̃3 (t) (5324)
Q(t)
f
M (t)D (t) 2h i
= σ2 (t)dt − σ2 (t)dW̃3 (t) (5325)
Q(t)
M (t)Df (t) h i
= −σ2 (t)(−σ2 (t)dt + dW̃3 (t)) (5326)
Q(t)
M (t)Df (t)
= − σ2 (t)dW̃3f (t). (5327)
Q(t)
We may similarly obtain

f
Df (t)S(t) h

D (t)S(t) i
(σ1 (t) − σ2 (t)ρ(t))dW̃1f (t) − σ2 (t) 1 − ρ2 (t)dW̃2f (t)
p
d = (5328)
Q(t) Q(t)
f
D (t)S(t) h i
= σ1 (t)dW̃1f (t) − σ2 (t)dW̃3f (t) . (5329)
Q(t)
The martingale property of these processes under the foreign risk-neutral measure follows.
Although the mean rate of change or the exchange rate Q(t) is expressed R(t) − Rf (t) under the
domestic risk-neutral measure P̃, it should be noted that from the foreign perspective, the exchange
1 1
rate Q(t) does not have mean rate of change Rf (t) − R(t). The is due to the convexity of function x.
Assume an exchange rate of 0.90EU R ↔ 1.00U SD (1.11U SD ↔ 1.00EU R). If it turns out the dollar
price of euro falls by 5%, then 1.00EU R ↔ 0.95 · 1.11 = 1.0556U SD, or 0.947EU R ↔ 1.00U SD, a
0.9474/0.90 = 5.26% increase in euro price of the dollar. See from Equation 5321 that the mean rate
of change under the domestic risk-neutral measure from the foreign perspective is Rf (t) − R(t) + σ22 (t).
If we are pricing in the foreign currency, we shall use the foreign risk-neutral measure. Recall that
dW̃3f (t) = −σ2 (t)dt + dW̃3 (t), such that we may write Equation as 5321
h
1 1 i
d = (Rf − R)dt − σ2 dW̃3f . (5330)
Q Q
The asymmetry is resolved when pricing under the foreign risk-neutral measure. Under objective measure
P, the asymmetry remains:
dQ(t) = γ(t)Q(t)dt + σ2 (t)Q(t)dW3 (t), (5331)

1 1 1
−γ(t) + σ22 (t) dt −

d = σ2 (t)dW3 (t). (5332)
Q(t) Q(t) Q(t)
They have the same volatility but mean rate of returns are asymmetric.
Exercise 715 (Quanto Option). A quanto option pays off in one currency the price in another currency
of an underlying asset without taking into consideration the currency conversion. For instance, a quanto
call struck on a British underlying at 25U SD pays off 5U SD if the price of asset at maturity is 30GBP .
Using the Equations 5305 and 5309, show that

p
2 f 1 2
Q(t) = Q(0) exp σ2 ρW̃1 (t) + σ2 1 − ρ W̃2 (t) + r − r − σ2 t , (5333)
2
735
and that

S(t) S(0) 1 2
= exp σ4 W̃4 (t) + r − a − σ4 t (5334)
Q(t) Q(0) 2
p
where σ4 = σ12 − 2σ1 σ2 ρ + σ22 , a = r − rf + ρσ1 σ2 − σ22 , and
p
σ1 − σ2 ρ σ2 1 − ρ2
W̃4 (t) = W̃1 (t) − W̃2 (t). (5335)
σ4 σ4
+
S(T )
Find the price of the quanto call at t paying Q(T ) − K in domestic currency units at maturity T .
Proof. By risk-neutral pricing formula (see Equation 4193), the value at t is

" + #
S(T )
Ẽ exp(−r(T − t)) −K |F(t) (5336)
Q(T )
Since we may write by Equation 5305

1
S(t) = S(0) exp σ1 W̃1 (t) + r − σ12 t , (5337)
2
and we may also write (using Equation 5287, Equation 5309):

f 1 2 p
2 f 1 2
Q(t) = Q(0) exp σ2 W̃3 (t) + (r − r − σ2 )t = Q(0) exp σ2 ρW̃1 (t) + σ2 1 − ρ W̃2 (t) + (r − r − σ2 )t
2 2
then

S(t) S(0) p
2 f 1 2 2

= exp (σ1 − σ2 ρ)W̃1 (t) − σ2 1 − ρ W̃2 (t) + (r + σ − σ1 )t . (5338)
Q(t) Q(0) 2 2
p p p
Let σ4 = (σ1 − σ2 ρ)2 + σ22 (1 − ρ2 ) = σ12 + σ22 − 2ρσ1 σ2 and σ4 W̃ (t) = (σ1 −σ2 ρ)W̃1 (t)−(σ2 1 − ρ2 )W̃2 (t).
See W̃ (t) has quadratic variation t. If we let a = r − rf + ρσ1 σ2 − σ22 , then
1 1
r − a − σ42 = r − (r − rf + ρσ1 σ2 − σ22 ) − σ42 (5339)
2 2
f 2 1 2
= r − σ1 σ2 ρ + σ2 − σ4 (5340)
2
1 2
f 2
= r − σ1 σ2 ρ + σ2 − (σ1 + σ22 − 2ρσ1 σ2 ) (5341)
2
f 1 2 1 2
= r + σ2 − σ1 . (5342)
2 2
so Equation 5338 is

S(t) S(0) 1
= exp σ4 W̃4 (t) + (r − a − σ42 )t , (5343)
Q(t) Q(0) 2

S(t) S(t) h i
d = σ4 dW̃4 (t) + (r − a)dt . (5344)
Q(t) Q(t)
S(t)
Q(t) behaves like a dividend paying stock under P̃, where the quanto call option shall be priced by the
formula (Equation 4398)
q(t, x) = x exp(−ατ )Φ(d+ (τ, x)) − exp(−rτ )KΦ(d− (τ, x)), (5345)
with

1 x 1
d± (τ, x) = √ log + (r − a ± σ42 )τ (5346)
σ4 τ K 2
S(t)
when Q(t) = x and τ is time to maturity.
736
13.7.2 Forward Exchange Rate
Assume constants for the domestic interest rate r and foreign interest rates rf . The exchange rate from
the domestic perspective has SDE (Equation 5309)
h p i
dQ(t) = Q(t) (r − rf )dt + σ2 (t)ρ(t)dW̃1 (t) + σ2 (t) 1 − ρ2 (t)dW̃2 (t) . (5347)
Then exp(−(r − rf )t)Q(t) is domestic risk-neutral measure P̃ martingale. The forward price is the strike
that makes the forward contract value zero. The domestic currency forward price F for a unit of foreign
currency with maturity at T has the relation
Ẽ [exp(−rT )(Q(T ) − F )] = 0. (5348)

Then exp(−rT )F = Ẽ [exp(−rT )Q(T )] = exp(−rf T )Ẽ exp(−(r − rf )T )Q(T ) = exp(−rf T )Q(0), so
F = exp((r − rf )T )Q(0). (5349)
We call this the T-forward exchange rate for domestic currency per unit of foreign currency. From the
foreign perspective, we have
h
1 1 i
(rf − r)dt − σ2 (t)ρ(t)dW̃1f (t) − σ2 (t) 1 − ρ2 (t)dW̃2f (t) .
p
d = (5350)
Q(t) Q(t)
1
So exp(−(rf − r)t) Q(t) is foreign risk-neutral measure P̃f martingale. The foreign currency forward price
F f for a unit of domestic currency delivered at T has relation

f f 1 f
Ẽ exp(−r T ) −F = 0. (5351)
Q(t)
1
Both Q(T ) and F f is priced in foreign currency. Since

1 1 1
exp(−r T )F = Ẽ exp(−rf T )
f f f f f
= exp(−rT )Ẽ exp(−(r − r)T ) = exp(−rT ) ,
Q(T ) Q(T ) Q(0)
then
1 1
F f = exp((rf − r)T ) = . (5352)
Q(0) F
We call this the T-forward exchange rate for foreign currency per unit of domestic currency.
13.7.3 Garman-Kohlhagen
Assume domestic, foreign interest rates r, rf , and volatility σ2 is constant. A call on a unit of for-
eign currency has payoff (Q(T ) − K)+ in domestic currency. The time zero value of this call is
Ẽ exp(−rT )(Q(T ) − K)+ . Equation 5309 using the constant process assumptions gives us
h i
dQ(t) = Q(t) (r − rf )dt + σ2 dW̃3 (t) , (5353)
so

f 1 2
Q(T ) = Q(0) exp σ2 W̃3 (T ) + r − r − σ2 T . (5354)
2
737
See that Y = − W̃√3 (T
T
)
is standard normal under P̃ so that (see the direct analogue for pricing the
European call with underlying paying continuous dividend rates, Section 13.3.7.2) we arrive at the
Garman-Kohlhagen formula
" + #
√

+ f 1 2
Ẽ exp(−rT )(Q(T ) − K) = Ẽ exp(−rT ) Q(0) exp −σ2 T Y + (r − r − σ2 )T − K
2
= exp(−rf T )Q(0)Φ(d+ ) − exp(−rT )KΦ(d− ), (5355)
h i
1 Q(0)
where d± = σ2
√
T
log K + r − rf ± 12 σ22 T .
13.7.4 Exchange Rate Put-Call Duality

Recall the Radon-Nikodym derivative of the foreign risk-neutral measure w.r.t to the domestic risk-
neutral measure is (see Equation 5310)
dP̃f D(T )M f (T )Q(T )

= , (5356)
dP̃ Q(0)
s.t.
D(T )M f (T )Q(T )

Ẽf X = Ẽ X . (5357)
Q(0)
+
1
A call struck at K on domestic currency pays Q(T ) −K in foreign currency at T . This should be
valued risk-neutrally by the foreign risk-neutral measure discounted payoff to be
" + # " + #
D(T )M f (T )Q(T )

f f 1 f 1
Ẽ D (T ) −K = Ẽ · D (T ) −K (5358)
Q(T ) Q(0) Q(T )
1
Ẽ D(T )(1 − KQ(T ))+

= (5359)
Q(0)
" + #
K 1
= Ẽ D(T ) − Q(T ) . (5360)
Q(0) K
1
The foreign currency price of the put struck at K on a unit of foreign currency is
" + #
1 1
Ẽ D(T ) − Q(T ) . (5361)
Q(0) K
The call is worth K of these puts. That is, the option to exchange K units of foreign currency for one
1
unit of domestic currency must be equivalent to the option to exchange K units of domestic currency
for one unit of foreign currency K times.
13.7.5 Forward Measure

Here we introduce what is known as the T-forward measure. Although we may assume a multidimensional
Brownian, for simplicity we assume the stochastic processes here are motivated by a single Brownian.
Recall the zero coupon bond (Definition 455) has risk-neutral pricing (see Equation 4193) given
1
B(t, T ) = Ẽ[D(T )|F(t)]. (5362)
D(t)
738
By definition B(T, T ) = 1. An asset price S(t) that is currency denominated has a forward contract
valued at t to be
1
V (t) = Ẽ[D(T )(S(T ) − K)|F(t)]. (5363)
D(t)
By the martingale property of D(t)S(t) we get

K
V (t) = S(t) − Ẽ[D(T )|F(t)] = S(t) − KB(t, T ). (5364)
D(t)
The T-Forward price is denoted F orS (t, T ), and is the strike that makes the forward contract value zero.
In other words
S(t)
F orS (t, T ) = . (5365)
B(t, T )
.
13.7.5.1 Zero-Coupon Bond Numeraire
By the risk-neutral arguments D(t)B(t, T ) is P̃ martingale. By stochastic asset representation (Theorem

441) there ∃σ ∗ (t, T ) for the bond satisfying (the coefficient of −1 does not affect our distribution of the
bond price process, since we may as well write σ ∗ DBd(−W̃ ):
d(D(t)B(t, T )) = −σ ∗ (t, T )D(t)B(t, T )dW̃ (t). (5366)
Definition 465 (T-Forward Measure). For fixed maturity T , the T-forward measure is denoted
Z
T 1
P̃ (A) = D(T )dP̃ A ∈ F. (5367)
B(0, T ) A
Multidimensional Girsanov Theorem 424, Stochastic Representation Theorem 441 asserts that
Z t
W̃ T (t) = σ ∗ (u, T )du + W̃ (t) (5368)
0
is P̃T Brownian. Under the T-forward measure, all assets denominated in units of zero coupon bond
maturing at T is martingale. That is to say that the T-forward prices are P̃T martingales.
The volatility vector of the T-forward price is the difference between the volatility vector of the asset
and a T-maturity zero coupon bond. We may write
1 1
ẼT [V (T )|F(t)] = Ẽ[D(T )V (T )|F(t)] = V (t), s.t (5369)
D(t)B(t, T ) B(t, T )
V (t) = B(t, T )ẼT [V (T )|F(t)]. (5370)
The relationship between the interest rates may be complex, making the analysis of D(T )V (T ) difficult.
However, by the change to the T-forward measure we may obtain a simpler model by only obtaining
V (T ) inside the expectation term.
13.7.6 Stochastic Rate Option Pricing

The BSM discussion in Section 13.2.3 assumed constant rates. This would be clearly inadequate in pricing
derivatives that are dependent on rate movements, such as interest rate swaps and bond options. We
extend the model here, and replace the assumption of constant asset volatility with constant volatility of
739
the forward on the underlying asset. Because the forward price is forward-measure martingale, denoting
W̃ T (t) to be the Brownian driving asset prices under the forward measure, this constant volatility of
forward assumption can be written as
dF orS (t, T ) = σF orS (t, T )dW̃ T (t). (5371)
Theorem 443 (Black Scholes Model extension for stochastic rates, Shreve [19]). Let S(t) be domestic
currency denominated asset price, and assume forward price of this asset has differential
dF orS (t, T ) = σF orS (t, T )dW̃ T (t), (5372)
where σ > 0 is constant. The value at t ∈ [T ] of European call expiring T struck at K is
V (t) = S(t)Φ(d+ (t)) − KB(t, T )Φ(d− (t)). (5373)
The adapted processes for d± is given by

1 F orS (t, T ) 1 2
d± (t) = √ log ± σ (T − t) . (5374)
σ T −t K 2
A short position in the option may be hedged holding Φ(d+ (t)) shares of asset and shorting KΦ(d− (t))
of T -maturity zero coupon bonds continuously at t. See that if we assume constant rates then B(t, T ) =
exp(−r(T − t)) gives us back the price of the European call as we saw in Equation 3850 with fixed rates.
S(0)
Proof. When t = 0, observe that F orS (0, T ) = B(0,T ) by Equation 5365 and the solution to the differ-
ential equation is (see Exercise 657)

S(0) T 1 2
F orS (t, T ) = exp σ W̃ (t) − σ t . (5375)
B(0, T ) 2
This is log-normal under P̃T . Taking S(t) as numeraire, the asset price S(t) is trivially 1. The risk-neutral
measure for S(t) numeraire is
Z
S 1
P̃ (T ) = D(T )S(T )dP̃, ∀A ∈ F. (5376)
S(0) A
The zcb in asset terms is worth

B(t, T ) 1
= , t ∈ [T ]. (5377)
S(t) F orS (t, T )
By the change of risk-neutral measure (Theorem 442), this is P̃S martingale. Computing the differential
for 1
using f (x) = x1 , f 0 (x) = −1
F orS (t,T ) ,
00
x2 , f (x) =
2
x3 , we may write

1
d = df (F orS (t, T )) (5378)
F orS (t, T )
1
= f 0 (F orS (t, T ))dF orS (t, T ) + f 00 (F orS (t, T ))dF orS (t, T )dF orS (t, T )
2
σ T σ2
= − dW̃ (t) + dt (5379)
F orS (t, T ) F orS (t, T )
σ
= − (−σdt + dW̃ T (t)). (5380)
F orS (t, T )
Since this is asserted to be P̃S martingale, W̃ S (t) = −σt + W̃ T (t) must be P̃S Brownian. The volatility
is σ. Equation 5380 has solution

1 B(0, T ) 1
= exp −σ W̃ S (t) − σ 2 t . (5381)
F orS (t, T ) S(0) 2
740
This is log-normal under P̃S . The risk neutral pricing formula asserts the European call price is
V (0) = Ẽ[D(T )(S(T ) − K)+ ] (5382)

= Ẽ[D(T )S(T )1{S(T ) > K}] − K Ẽ[D(T )1{S(T ) > K}] (5383)

D(T )S(T ) D(T )
= S(0)Ẽ 1{S(T ) > K} − KB(0, T )Ẽ 1{S(T ) > K} (5384)
S(0) B(0, T )
= S(0)P̃S {S(T ) > K} − KB(0, T )P̃T {F orS (T, T ) > K} use F orS (T, T ) = S(T ).

1 1
= S(0)P̃S < − KB(0, T )P̃T {F orS (T, T ) > K}. (5385)
F orS (T, T ) K
P̃S
Since W̃ S (T ) ∼ Φ(0, T ), then

S 1 1 S B(0, T ) S 1 2 1
P̃ < = P̃ exp −σ W̃ (T ) − σ T < Equation 5381
F orS (T, T ) K S(0) 2 K

1 S(0)
= P̃S −σ W̃ S (T ) − σ 2 T < log (5386)
2 KB(0, T )
( )
W̃ S (T )

1 S(0) 1
= P̃S − √ < √ log + σ2 T (5387)
T σ T KB(0, T ) 2
= Φ(d+ (0)). (5388)
P̃T
Using W̃ T (T ) ∼ Φ(0, T ), then

T T S(0) T 1 2
P̃ {F orS (T, T ) > K} = P̃ exp σ W̃ (T ) − σ T > K Equation 5375
B(0, T ) 2

1 KB(0, T )
= P̃T σ W̃ T (T ) − σ 2 T > log (5389)
2 S(0)
( )
W̃ T (T )

1 KB(0, T ) 1 2
= P̃T √ > √ log + σ T (5390)
T σ T S(0) 2
( )
W̃ T (T )

1 S(0) 1 2
= P̃T − √ < √ log − σ T (5391)
T σ T KB(0, T ) 2
= Φ(d− (0)). (5392)
The case when t > 0 may be generalized. Referring to the Black-Scholes-Merton pricing Equation 5373,
consider taking the zero-coupon bond as numeraire rather than currency. Then we have
V (t)
= F orS (t, T )Φ(d+ (t)) − KΦ(d− (t)). (5393)
B(t, T )
If we hedge a short position in the call option by holding Φ(d+ (t)) shares of asset and shorting KΦ(d− (t))
ZCBs, then the portfolio value with ZCB numeraire agrees with Equation 5393. To show that the
short option hedge is successful, we need to verify self-financing property. The capital gains differential
associated with this portfolio in ZCB terms is Φ(d+ (t))dF orS (t, T ). Since the numeraire is ZCB, bond
price movements do not affect capital gains. By Ito product (Lemma 30), the portfolio evolves at

V (t)
d = Φ(d+ (t))dF orS (t, T ) + F orS (t, T )dΦ(d+ (t)) + dF orS (t, T )dΦ(d+ (t)) − KΦ(d− (t)).
B(t, T )
For self-financing to hold, we require that
F orS (t, T )dΦ(d+ (t)) + dF orS (t, T )dΦ(d+ (t)) − KdΦ(d− (t)) = 0, (5394)
that the change of portfolio value is solely due to capital gains.
741
Exercise 716. Using d± given by Equation 5374

1 F orS (t, T ) 1 2
d± (t) = √ log ± σ (T − t) , (5395)
σ T −t K 2
verify self-financing condition Equation 5394. The steps are outlined, Shreve [19]:
√
1. Show d− = d+ − σ τ .
F orS (t,T )
2. Show d2+ − d2− = 2 log K .
3. Show
F orS (t, T ) exp(−d2+ /2) − K exp(−d2− /2) = 0. (5396)
4. Show
3
1 2 F orS (t, T ) 3σ 1
dd+ = log dt − √ dt + √ dW̃ T (t). (5397)
2στ K 4 τ τ
5. Show
σ
dd− = dd+ + √ dt. (5398)
2 τ
6. Show
dt
dd+ dd+ = dd− dd− = . (5399)
τ
7. Show
1 d2 d+ d2
dΦ(d+ ) = √ exp(− + )dd+ − √ exp(− + )dt. (5400)
2π 2 2τ 2π 2
8. Show
1 d2 σ d2 d+ d2
dΦ(d− ) = √ exp(− − )dd+ + √ exp(− − )dt − √ exp(− − )dt. (5401)
2π 2 2πτ 2 2τ 2π 2
9. Show
σF orS (t, T ) d2
dF orS (t, T )dΦ(d+ ) = √ exp(− + )dt. (5402)
2πτ 2
10. Complete it .
Proof. 1. Follow definition. See Equation 3853.
2.
√ √
(d+ − d− )(d+ + d− ) = (d+ + d+ − σ τ )(σ τ ) (5403)
√
= 2σ τ d+ − σ 2 τ (5404)
F orS
= 2 log + τ σ2 − σ2 τ (5405)
K
F orS (t, T )
= 2 log . (5406)
K
742
3.
d2+ d2+

1 2 2 F orS (t, T )
exp(− ) F orS (t, T ) − K exp (d − d− ) = exp(− ) F orS (t, T ) − K exp log = 0.
2 2 + 2 K
4. Define f (x) = log x

K,then f 0 (x) = x1 , f 00 x(x) = − x12 , and

1 F orS 1 2 1 1 1 1 1 2
dd+ (t) = √ log + σ τ dt + √ dF orS + (− )dF orS dF orS − σ dt
2σ τ 3 K 2 σ τ F orS 2 F orS2 2
1 F orS σ 1 1 1
= √ log dt + √ dt + √ (σdW̃ T (t) − σ 2 dt − σ 2 dt) (5407)
2σ τ 3 K 4 τ σ τ 2 2
1 F orS 3σ 1
= 3 log dt − √ dt + √ dW̃ T (t). (5408)
2στ 2 K 4 τ τ
5.
√ σ
dd− (t) = dd+ (t) − d(σ τ ) = dd+ (t) + √ dt. (5409)
2 τ
6. From part 5, the dt term does not affect quadratic variation. Result follows from part 4:
dt
(dd− (t))2 = (dd+ (t))2 = . (5410)
τ
7.
1
dΦ(d+ ) = Φ0 (d+ )dd+ + Φ00 (d+ )(dd+ )(dd+ ) (5411)
2
1 1 2 1 1 1 2 dt
= √ exp − d+ dd+ + √ exp − d+ (−d+ ) . (5412)
2π 2 2 2π 2 τ
8.
1
dΦ(d− ) = Φ0 (d− )dd− + Φ00 (d− )(dd− )(dd− ) (5413)
2
1 exp(− 12 d2− )

1 1 σdt dt
= √ exp − d2− dd+ + √ + √ (−d− ) (5414)
2π 2 2 τ 2 2π τ
n 2o √
d
1

1 2

σ exp(− 12 d2− ) exp − 2− (σ τ − d+ )
= √ exp − d− dd+ + √ dt + √ dt (5415)
2π 2 2 2πτ 2τ 2π
d2
σ exp(− 12 d2− ) d+ exp(− 2− )

1 1 2
= √ exp − d− dd+ + √ dt − √ dt. (5416)
2π 2 2πτ 2τ 2π
9.
d2 d2+
exp(− + ) 1 σF orS exp(− 2 )
dF orS dΦ(d+ ) = σF orS dW̃ (t) √ 2 √ dŴ T (t) =
T
√ dt. (5417)
2π τ 2πτ
743
10.
F orS dΦ(d+ ) + dF orS dΦ(d+ ) − KdΦ(d− ) (5418)

2
d
d2 d2 σF orS exp(− 2+ )

1 d+
= F orS √ exp(− + )dd+ − √ exp(− + )dt + √ dt (5419)
2π 2 2τ 2π 2 2πτ
 
d2
exp(− 2− ) σ −d2− d+ d 2
−K  √ dd+ + √ exp(− )dt − √ exp(− − )dt (5420)
2π 2πτ 2 2τ 2π 2
 
2 d2+ d2− 2
F or d
S + d σF orS exp(− ) Kσ exp(− ) Kd + d
=  √ exp(− + ) + √ 2
− √ 2
− √ exp(− − ) dt
2τ 2π 2 2πτ 2πτ 2τ 2π 2
d2 d2

1
+√ F orS exp(− + ) − K exp(− − ) dd+ (5421)
2π 2 2
= 0. by part 3. (5422)
13.8 Term Structure Models

7
Markets are not governed by a single interest rate. Costs of borrowing vary over time. Bonds of
different maturities have different yields (see Definition 458) . For time periods 0 = T0 < · · · < Tn and
B(0, Tj ) price of zcb paying one at maturity Tj , a coupon paying bond making fixed payments Ci at Ti ,
i ∈ [j] is valued as the discounted sum of cash flows
j
X
Ci B(0, Ti ). (5423)
i=1
C1 , · · · Cj−1 represent coupons (interest payment) and Cj represents interest payment plus principal (or
face, par value) paid at maturity Tj . If we are given instead the price of these coupon-paying bonds at
different maturities, we can solve recursively for B(0, Ti ), a technique known as bootstrapping. That is,
bootstrapping coupon-paying bond prices give us zero-coupon bond prices - see this is equivalent to giving
us the yield to maturity by Equation 4624. The different yields at different maturities make up the yield
curve. The interest rate term structure may be described via the forward rates f (t, T, T +∆), the interest
rate for borrowing from T to T + ∆ at t. When we let ∆ → 0, then we abbreviate the last parameter
and write f (t, T ) - this is the instantaneous forward rate. When t = T , we write r(t) = f (t, t), and this
is the short rate. The forward rate curve specifies for fixed t, the mapping T → f (t, T ). Some one-factor
short rate models were discussed in Exercises 664, 662 and 695. The short-rate model is an idealization,
since ‘instantaneous’ agreements are assumed. Rather they correspond to the shortest maturity yield,
or overnight rates such as the LIBOR and SOFR rates. We assume bonds assume no default risk in
the developments herein. In other pricing models, we assumed a SDE under the objective measure and
solve for market price of risk equations that allow us to switch to the risk-neutral measure. However,
interest rates are not prices of assets, and we cannot infer market price of risk from interest rates alone.
If we are instead given the prices of primary assets, such as zcbs, then we may infer market prices of
risks. But these zcbs assume risk-neutral pricing, which assume the existence of a risk-neutral measure
to begin with. Therefore, we work here with models that begin with the risk-neutral measure at the
outset. The Heath-Jarrow-Morton (HJM) model gives us a mechanism for evolving the forward rate curve
7 The discussions here are motivated by Shreve’s text, Stochastic Calculus for Finance II, Continuous Time Models [19]
744
forward in time. The forward rate curve and zero-coupon bond prices can be deduced from one another.
We see how to enforce conditions such that no arbitrage assumptions are satisfied. Every Brownian-
motion driven model must satisfy the HJM no-arbitrage conditions. Defining the simple interest rate
f (t, T, T + δ) = L(t, T ), a commonplace market observable corresponds to the forward LIBOR, where δ
is in annual terms, such that δ = 0.25, 0.50 correspond to the 3M -LIBOR and 6M -LIBOR respectively.
13.8.1 Affine-Yield Models

Affine yield models specify yields for zcbs as affine functions of the interest rate. Examples are the CIR
model and Hull-White model discussed in Exercise 664 and Exercise 695 respectively. Here we develop
two factor, constant coefficient models of the interest rates. For models discussed, we reduce them to
their canonical form. The canonical model does not have an economic interpretation. They simply help
to simplify the models to minimal parameters to aid in the model calibration. All two-factor affine yield
models may be obtained from one of the canonical models discussed herein. There are three different
types: 1) both factors have constant diffusion terms, and therefore are Gaussian processes possibly taking
on negative values, 2) both factors appear under the square root in diffusion terms and are therefore
strictly nonnegative, and 3) one out of the two factor appears under square root in diffusion terms. In
order, these are called two-factor Vasicek, two-factor CIR and two-factor mixed term structure models.
13.8.1.1 Two-Factor Vasicek Model
Assume factors X1 (t), X2 (t) take on the SDE
dX1 (t) = (a1 − b11 X1 (t) − b12 X2 (t))dt + σ1 dB̃1 (t), (5424)
dX2 (t) = (a2 − b21 X1 (t) − b22 X2 (t))dt + σ2 dB̃2 (t). (5425)
Here B̃1 (t), B̃2 (t) are P̃ Brownians with constant correlation dB̃1 (t)dB̃2 (t) = νdt. σ1 , σ2 > 0 and assume
the matrix
" #
b11 b12
B= (5426)
b21 b22
has eigenvalues λ1 , λ2 > 0, which causes X1 (t), X2 (t) to be mean-reverting (verify this). Assume interest
rates are affine function of the factors, such that
R(t) = 0 + 1 X1 (t) + 2 X2 (t), (5427)
where i , i ∈ [2] are constants. Different choices of parameters (ai , bij , σi , i ) can specify the same process
R(t) - we specify a canonical two-factor Vasicek model by reducing the model to
dY1 (t) = −λ1 Y1 (t)dt + dW̃1 (t), (5428)

dY2 (t) = −λ21 Y1 (t)dt − λ2 Y2 (t)dt + dW̃2 (t), (5429)
R(t) = δ0 + δ1 Y1 (t) + δ2 Y2 (t), (5430)
where W̃1 (t), W̃2 (t) are independent Brownians. This model has six parameters λ1 , λ2 , λ21 , δ0 , δ1 , δ2 with
λ1 , λ2 > 0. These can be allowed to be a function of time. The reduction is done by transforming
" # the
p11 p22
matrix B into its Jordan normal form such that for choice of non-singular matrix P = we
p21 p22
745
have
" #
λ1 0
K = P BP −1 = . (5431)
κ λ2
If λ1 6= λ2 , then K is diagonal and columns of P −1 are eigenvectors of B (Exercise 67). If λ1 = λ2 but

κ 6= 0, we choose P such that κ = 1. Then for
" # " # " # " #
X1 (t) a1 σ1 0 B̃1 (t)
X(t) = , A= , Σ= , B̃(t) = , (5432)
X2 (t) a2 0 σ2 B̃2 (t)
Equations 5424, 5425 are expressed
dX(t) = Adt − BX(t)dt + ΣdB̃(t). (5433)
If we let X̄(t) = P X(t), then
dX̄(t) = P Adt − P BX(t)dt + P ΣdB̃(t) (5434)

= P Adt − KP X(t)dt + P ΣdB̃(t) (5435)
= P Adt − K X̄(t)dt + P ΣdB̃(t). (5436)
or componentwise
dX̄1 (t) = (p11 a1 + p12 a2 ) dt − λ1 X̄1 (t)dt + p11 σ1 dB̃1 (t) + p12 σ2 dB̃2 (t), (5437)
dX̄2 (t) = (p21 a1 + p22 a2 ) dt − κX̄1 (t)dt − λ2 X̄2 (t)dt + p21 σ1 dB̃1 (t) + p22 σ2 dB̃2 (t). (5438)
Lemma 27. For σ1 > 0, σ2 > 0, v ∈ (−1, 1) and non-singular matrix P , we have
γi = p2i1 σ12 + 2νpi1 pi2 σ1 σ2 + p2i2 σ22 > 0, i = 1, 2. (5439)
and
1
p11 p21 σ12 + ν(p11 p22 + p12 p21 )σ1 σ2 + p12 p22 σ22 ∈ (−1, 1).

ρ= √ (5440)
γ1 γ2
" # " √ #
1 ν √ a 1 − a2
Proof. Since ν ∈ (−1, 1), N = has matrix square root. A valid square root is N = √
ν 1 1 − a2 a
q
1 1
√
2
where a = sign(ν) 2 + 2 1 − ν . We can verify this directly by matrix computation. In particular,
√ 2
a2 + 1 − a2 = a2 + (1 − a2 ) = 1, and
r s
p
2
1 1p 2
1 1p
2a 1 − a = 2sign(ν) + 1−ν 1− + 1 − ν2 (5441)
2 2 2 2
r r
1 1p 2
1 1p
= 2sign(ν) + 1−ν − 1 − ν2 (5442)
2 2 2 2
r
1 1
= 2sign(ν) − (1 − ν 2 ) (5443)
4 4
1
= 2sign(ν) |ν| (5444)
2
= ν. (5445)
√
Since matrices N , Σ, P 0 are nonsingular,
" √ √ #
√ p11 σ 1 a + p 12 σ2 1 − a2 p σ a+p σ
21 1 22 2 1 − a2 h i
N ΣP 0 = √ √ =: c1 c2 , (5446)
p11 σ1 1 − a2 + p12 σ2 a p21 σ1 1 − a2 + p22 σ2 a
746
where c1 , c2 are column representations is nonsingular (Theorem 14). By invertibility, the columns c1 , c2
are linearly independent (and therefore also non-zero) by Theorem 72. Since ci is non-zero, then γi =
kci k2 > 0 for i = 1, 2 (Theorem 56) and Cauchy-Schwarz (Theorem 68) asserts that |c1 · c2 | < kc1 kkc2 k
(equality holds only when c1 , c2 are linearly dependent). Then
−kc1 kkc2 k < c1 · c2 < kc1 kkc2 k, (5447)
which is iff ρ ∈ (−1, 1) by definition.
Define
1
B̄i (t) = √ pi1 σ1 B̃1 (t) + pi2 σ2 B̃2 (t) , i = 1, 2. (5448)
γi
Then
1
dB̄1 (t)dB̄1 (t) = (p11 σ1 dB̃1 (t) + p12 σ2 dB̃2 (t))2 (5449)
γ1
1 2 2
p11 σ1 dt + p212 σ22 dt + 2p11 p12 σ1 σ2 νdt

= (5450)
γ1
1 2 2
= (p σ + p212 σ22 + 2p11 p12 σ1 σ2 ν)dt (5451)
γ1 11 1
= dt. (5452)
Similarly, dB̄2 (t)dB̄2 (t) = dt. B̄1 (t), B̄2 (t) are Brownian motions by Levy Theorem 419. Their instanta-
neous correlation is constant
dB̄1 (t)dB̄2 (t) = ρdt. (5453)
Write Equations 5437 and 5438 as

√
dX̄1 (t) = (p11 a1 + p12 a2 )dt − λ1 X̄1 (t)dt + γ1 dB̄1 (t), (5454)
√
dX̄2 (t) = (p21 a1 + p22 a2 )dt − κX̄1 (t)dt − λ2 X̄2 (t)dt + γ2 dB̄2 (t) (5455)
using definitions for B̄i (t) given Equation 5448. Denote

1 p11 a1 + p12 a2
X̂1 (t) = √ X̄1 (t) − , (5456)
γ1 λ1

1 κ(p11 a1 + p12 a2 ) p21 a1 + p22 a2
X̂2 (t) = √ X̄2 (t) − − . (5457)
γ2 λ1 λ2 λ2

For f (x) = √1γ1 x − p11 a1λ+p1
12 a2
, f 0 (x) = √1γ1 , f 00 (x) = 0 then (Ito Doeblin, Theorem 415)
1
dX̂1 (t) = √ dX̄1 (t) (5458)
γ1
1 √
= √ (p11 a1 + p12 a2 )dt − λ1 X̄1 (t)dt + γ1 dB̄1 (t) (5459)
γ1
= −λ1 X̂1 (t)dt + dB̄1 (t) (5460)
and by similar approach

r
γ1
dX̂2 (t) = −κ X̂1 (t)dt − λ2 X̂2 (t)dt + dB̄2 (t). (5461)
γ2
747
Define
1
W̃1 (t) = B̄1 (t), W̃2 (t) = p −ρB̄1 (t) + B̄2 (t) , (5462)
1− ρ2
and see that
dW̃i (t)dW̃j (t) = 1 {i = j} dt. (5463)
Then by Levy’s Theorem, (W̃1 (t), W̃2 (t)) is a two-dimensional Brownian motion (see Definition 445). If
we set
−ρX̂1 (t) + X̂2 (t)
Y1 (t) = X̂1 (t), Y2 (t) = p , (5464)
1 − ρ2
we may write (using Equations 5460, 5461, 5462, 5464)
1 h i
dY2 (t) = p −ρdX̂1 (t) + dX̂2 (t) (5465)
1 − ρ2
r
1 γ1 1
= p ρλ1 − κ X̂1 (t) − λ2 X̂2 (t) dt + p −ρdB̄1 (t) + dB̄2 (t)
1−ρ 2 γ 2 1−ρ 2
r
1 γ1 −λ2 X̂2 (t)
= p ρλ1 − κ Y1 (t) dt + p dt + dW̃2 (t)
1−ρ 2 γ 2 1 − ρ2
r !
1 γ1 X̂1 (t)(ρλ2 ) −λ2 X̂2 (t)
= p ρλ1 − ρλ2 − κ Y1 (t) dt + p + p dt + dW̃2 (t)
1 − ρ2 γ2 1 − ρ2 1 − ρ2
r
1 γ1
= p ρλ1 − ρλ2 − κ Y1 (t)dt − λ2 Y2 (t)dt + dW̃2 (t). (5466)
1−ρ 2 γ2
Then the linear system for Equations 5454, 5455 become
dY1 (t) = −λ1 Y1 (t)dt + dW̃1 (t), see Equations 5460, 5462, 5464 (5467)
dY2 (t) = −λ21 Y1 (t)dt − λ2 Y2 (t)dt + dW̃2 (t), see Equation 5466. (5468)
where
r
1 γ1
λ21 = p −ρλ1 + ρλ2 + κ . (5469)
1 − ρ2 γ2
We have shown the first two Equations 5428 and 5429 of the canonical model expression. Lastly, using
all the substitutions thus far, see that
Y1 (t) = X̂1 (t) (5470)

1 p11 a1 + p12 a2
= √ X̄1 (t) − (5471)
γ1 λ1

1 p11 a1 + p12 a2
= √ p11 X1 (t) + p12 X2 (t) − , (5472)
γ1 λ1
1
Y2 (t) = p −ρX̂1 (t) + X̂2 (t) (5473)
1 − ρ2

ρ p11 a1 + p12 a2 1 κ(p11 a1 + p12 a2 ) p21 a1 + p22 a2
= −p X̄1 (t) − +p X̄2 (t) + −
γ1 (1 − ρ2 ) λ1 γ2 (1 − ρ2 ) λ1 λ2 λ2

ρ p11 a1 + p12 a2
= −p p11 X1 (t) + p12 X2 (t) − (5474)
γ1 (1 − ρ )2 λ1

1 κ(p11 a1 + p12 a2 ) p21 a1 + p22 a2
+p p21 X1 (t) + p22 X2 (t) + − . (5475)
γ2 (1 − ρ )2 λ1 λ2 λ2
748
See we can express
Y (t) = Γ(P X(t) + V ), (5476)

" #
Y1 (t)
Y (t) = , (5477)
Y2 (t)
 
√1 0
γ1
Γ = (5478)
−√ ρ 2
 1

√ ,
γ1 (1−ρ ) γ2 (1−ρ2 )
" #
− p11 a1λ+p 12 a2
V = κ(p11 a1 +p12 a2 )
1
p21 a1 +p22 a2
. (5479)
λ1 λ2 − λ2
Solving for X(t),
X(t) = P −1 (Γ−1 Y (t) − V ). (5480)
So expressing as column matrix (see block multiplication in Exercise 11)
R(t) = 0 + [1 , 2 ]X(t) (5481)

= 0 + [1 , 2 ]P −1 Γ−1 Y (t) − [1 , 2 ]P −1 V (5482)
= δ0 + [δ1 , δ2 ]Y (t), (5483)
where δ0 − 0 − [1 , 2 ]P −1 V and [δ1 , δ2 ] = [1 , 2 ]P −1 Γ−1 , which is the last equation in the canonical
form (Equation 5430).
The zcb prices under the canonical two-factor Vasicek model may be stated. Risk-neutral pricing
asserts that
" #
Z T
B(t, T ) = Ẽ exp(− R(u)du)|F(t) , t ∈ [T ]. (5484)
t
R(t) is a function of Y1 (t), Y2 (t), and is solution to a system of SDEs which must be Markov (see Cor
33), then ∃f (t, y1 , y2 ) s.t.
B(t, T ) = f (t, Y1 (t), Y2 (t)). (5485)
Recall D(t)B(t, T ) is P̃ martingale. Using dD(t) = −R(t)D(t)dt (Equation 4621), Ito product rule
(Theorem 30),
d (D(t)B(t, T )) = d (D(t)f (t, Y1 (t), Y2 (t))) (5486)

= d(D(t))f (t, Y1 (t), Y2 (t)) + d(f (t, Y1 (t), Y2 (t)))D(t) (5487)
= −R(t)D(t)f (t, Y1 (t), Y2 (t))dt (5488)

1 1
+D(t) ft dt + fy1 dY1 + fy2 dY2 + fy1 y1 dY1 dY1 + fy2 y2 dY2 dY2 + fy1 y2 dY1 dY2 (5489)
2 2

1 1
= D −Rf dt + ft dt + fy1 dY1 + fy2 dY2 + fy1 y1 dY1 dY1 + fy2 y2 dY2 dY2 + fy1 y2 dY1 dY2 .
2 2
Substitute the canonical forms (Equations 5428,5429,5430), so that

1 1
d(D(t)B(t, T )) = D −(δ0 + δ1 Y1 + δ2 Y2 )f + ft − λ1 Y1 fy1 − λ21 Y1 fy2 − λ2 Y2 fy2 + fy1 y1 + fy2 y2 dt
2 2
h i
+D fy1 dW̃1 + fy2 dW̃2 . (5490)
749
This is martingale so drift is zero and we arrive at the PDE
1 1
−(δ0 + δ1 y1 + δ2 y2 )f + ft − λ1 y1 fy1 − λ21 y1 fy2 − λ2 y2 fy2 + fy1 y1 + fy2 y2 = 0 (5491)
2 2
for all t ∈ [0, T ) and y1 , y2 ∈ R. The terminal conditions are
f (T, y1 , y2 ) = 1 ∀y1 , y2 ∈ R. (5492)
We assume the solution takes form
f (t, y1 , y2 ) = exp {−y1 C1 (T − t) − y2 C2 (T − t) − A(T − t)} (5493)
for some functions C1 (τ ), C2 (τ ), A(τ ). The zcb prices depend on the time to maturity/relative maturity
term τ = T − t. The terminal conditions imply that
C1 (0) = C2 (0) = A(0) = 0. (5494)
Using δ
δt Ci (τ ) = Ci0 (τ ) δt
δ
τ = −Ci0 (τ ), as well as δ
δt A(τ ) = −A0 (τ ), then
ft = [y1 C10 + y2 C20 + A0 ]f, (5495)

fy1 = −C1 f, (5496)
fy2 = −C2 f, (5497)
fy1 y1 = C12 f, (5498)
fy2 y2 = C22 f, (5499)
fy1 y2 = C1 C2 f. (5500)
Then we may write Equation 5491 as

1 1
−(δ0 + δ1 y1 + δ2 y2 )f + [y1 C10 + y2 C20 + A0 ]f − λ1 y1 (−C1 f ) − λ21 y1 (−C2 f ) − λ2 y2 (−C2 f ) + C12 f + C22 f = 0
2 2
which simplifies to

0 0 0 1 2 1 2
(C1 + λ1 C1 + λ21 C2 − δ1 )y1 + (C2 + λ2 C2 − δ2 )y2 + A + C1 + C2 − δ0 f = 0. (5501)
2 2
Since this holds for all y1 , y2 ∈ R, we must have that
C10 + λ1 C1 + λ21 C2 − δ1 = 0, (5502)

C20 + λ2 C2 − δ2 = 0, (5503)
and these two equations imply that

1 1
A0 + C12 + C22 − δ0 = 0, (5504)
2 2
a system of three ordinary differential equations. The solutions satisfying terminal conditions C1 (0) =
C2 (0) = A(0) = 0 can be shown to have solutions

1 λ21 δ2 λ21 δ2
 λ δ1 − λ (1 − exp(−λ1 τ )) + (exp(−λ2 τ ) − exp(−λ1 τ )) , (λ1 6= λ2 )


1 2 λ2 (λ1 − λ2 )
C1 (τ ) = (5505),
 1 λ21 δ2 λ21 δ2

 δ1 − (1 − exp(−λ1 τ )) + τ exp(−λ1 τ ), (λ1 = λ2 ).
λ1 λ1 λ1
δ2
C2 (τ ) = (1 − exp(−λ2 τ )), (5506)
λ2
Z τ
1 1
A(τ ) = − C12 (u) − C22 (u) + δ0 du. (5507)
0 2 2
750
Exercise 717. Show the solutions to
C10 + λ1 C1 + λ21 C2 − δ1 = 0, (5508)

C20
+ λ 2 C2 − δ2 = 0, (5509)
1 1
A0 + C12 + C22 − δ0 = 0 (5510)
2 2
with initial conditions C1 (0) = C2 (0) = A(0) = 0 are given by Equations 5505, 5506 and 5507.
Proof. We first show C2 . See that

δ
(exp(λ2 τ )C2 (τ )) = exp(λ2 τ ) (λ2 C2 (τ ) + C20 (τ )) (5511)
δτ
= exp(λ2 τ ) (λ2 C2 (τ ) − λ2 C2 (τ ) + δ2 ) sub. Equation 5509 (5512)
= exp(λ2 τ )δ2 . (5513)
Integrating from 0 → τ and using the initial condition, we get

Z τ
exp(λ2 τ )C2 (τ ) = exp(λ2 u)δ2 du (5514)
0
δ2
= (exp(λ2 τ ) − 1) . (5515)
λ2
δ2
So C2 (τ ) = λ2 (1 − exp(−λ2 τ )). Solving for C1 and using our solution for C2 , we may write
δ
(exp(λ1 τ )C1 (τ )) = exp(λ1 τ ) (λ1 C1 (τ ) + C10 (τ )) (5516)
δτ
= exp(λ1 τ ) (−λ21 C2 (τ ) + δ1 ) (5517)

λ21 δ2
= exp(λ1 τ ) − (1 − exp(−λ2 τ )) + δ1 . (5518)
λ2
If we integrate this from 0 → τ , using the initial condition, get
Z τ
λ21 δ2
exp(λ1 τ )C1 (τ ) = − (exp(λ1 u) − exp((λ1 − λ2 )u)) + δ1 exp(λ1 u)du (5519)
λ2
Z0 τ Z τ
λ21 δ2 λ21 δ2
= − exp(λ1 u) + δ1 exp(λ1 u)du + exp((λ1 − λ2 )u)du. (5520)
0 λ 2 0 λ2
Evaluating this at λ1 6= λ2 , we get

λ21 δ2 1 λ21 δ2 1
δ1 − (exp(λ1 τ ) − 1) + (exp((λ1 − λ2 )τ ) − 1) . (5521)
λ2 λ1 λ2 λ1 − λ2
Evaluating at λ1 = λ2 instead, we get

λ21 δ2 1 λ21 δ2
δ1 − (exp(λ1 τ ) − 1) +τ . (5522)
λ2 λ1 λ2
Multiplying both sides of Equation 5521 and Equation 5522 by exp(−λ1 τ ), we get solution Equation 5505.
The final solution for Equation 5507 can be obtained similarly, using the initial condition A(0) = 0.
For fixed relative maturity τ̄ (bond at t maturing at t + τ̄ ), denote the corresponding yield L(t).
We call this the long rate. After determining the evolution of the short rate R(t) risk-neutrally, we can
determine the price of the zcbs maturing at (t + τ̄ ). That is, the short-rate model determines the long
rate. In any affine-yield model, the long rate satisfies some SDE - we attempt to work this out. For the
two-factor Vasicek model, the zcbs take form
B(t, T ) = exp {−Y1 (t)C1 (τ ) − Y2 (t)C2 (τ ) − A(τ )} , (5523)
751
where C1 (τ ), C2 (τ ), A(τ ) are given by the previous discussions (see Exercise 717). The long rate may be
written
−1 1
L(t) = log B(t, t + τ̄ ) = [C1 (τ̄ )Y1 (t) + C2 (τ̄ )Y2 (t) + A(τ̄ )] . (5524)
τ̄ τ̄
See that this is affine function of canonical factors Y1 (t), Y2 (t). If we want to get back the economic
interpretations of our model, then we aim to write a two-factor Vasicek model of the forms in Equation
5424, 5425 where X1 (t) → R(t) and X2 (t) → L(t). See that we may express Equations 5430 and 5524
with the system
" # " #" # " #
R(t) δ1 δ2 Y1 (t) δ0
= 1 1
+ 1
. (5525)
L(t) τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) Y2 (t) τ̄ A(τ̄ )
Lemma 28. The matrix

" #
δ1 δ2
D= 1 1
(5526)
τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ )
is non-singular iff δ2 6= 0 and
(λ1 − λ2 )δ1 + λ21 δ2 6= 0. (5527)
Proof. Shreve [19]. Consider function f (x) = 1 − exp(−x) − x exp(−x). See that f 0 (x) = exp(−x) −
exp(−x) + x exp(−x) = x exp(−x) > 0 for all x > 0, and that f (x) > 0 for any positive x since
f (0) = 0. Let h(x) = 1
x (1 − exp(−x)) , then h0 (x) = − x12 + 1
x2 exp(−x) + 1
x exp(−x) = − x12 (1 −
exp(−x) − x exp(−x)) = − x12 f (x) < 0 is for any x > 0. So h(x) is strictly decreasing function on the
positive real line. D is invertible iff det(D) 6= 0 (Theorem 72). The determinant is easy to compute
752
(Exercise 16). When λ1 6= λ2 , we may write (Equations 5505, 5506)
1
det(D) = [δ1 C2 (τ̄ ) − δ2 C1 (τ̄ )] (5528)
τ̄
1 δ2 1 λ21 δ2
= [δ1 (1 − exp(−λ2 τ̄ )) − δ2 δ1 − (1 − exp(−λ1 τ̄ )) (5529)
τ̄ λ2 λ1 λ2
λ21 δ22
− (exp(−λ2 τ̄ ) − exp(−λ1 τ̄ ))] (5530)
λ2 (λ1 − λ2 )

δ1 δ2 δ2 λ21 δ2
= (1 − exp(−λ2 τ̄ )) − δ1 − (1 − exp(−λ1 τ̄ )) (5531)
τ̄ λ2 τ̄ λ1 λ2
λ21 δ22
− (exp(−λ2 τ̄ ) − exp(−λ1 τ̄ )) (5532)
τ̄ λ2 (λ1 − λ2 )
δ1 δ2 δ2 δ1 λ21 δ22
= (1 − exp(−λ2 τ̄ )) − (1 − exp(−λ1 τ̄ )) + (1 − exp(−λ1 τ̄ )) (5533)
τ̄ λ2 τ̄ λ1 τ̄ λ1 λ2
λ21 δ22
− (exp(−λ2 τ̄ ) − exp(−λ1 τ̄ )) (5534)
τ̄ λ2 (λ1 − λ2 )
δ1 δ2 δ2 δ1
= (1 − exp(−λ2 τ̄ )) − (1 − exp(−λ1 τ̄ )) (5535)
τ̄ λ2 τ̄ λ1
λ21 δ22
+ [(λ1 − λ2 )(1 − exp(−λ1 τ̄ )) − λ1 exp(−λ2 τ̄ ) + λ1 exp(−λ1 τ̄ ))](5536)
(λ1 − λ2 )τ̄ λ1 λ2

1 1
= δ1 δ2 (1 − exp(−λ2 τ̄ )) − (1 − exp(−λ1 τ̄ )) (5537)
λ2 τ̄ λ1 τ̄
λ21 δ22
+ [λ1 (1 − exp(−λ2 τ̄ )) − λ2 (1 − exp(−λ1 τ̄ ))] (5538)
(λ1 − λ2 )λ1 λ2 τ̄

λ21 δ2 1 1
= δ2 δ1 + (1 − exp(−λ2 τ̄ )) − (1 − exp(−λ1 τ̄ )) (5539)
(λ1 − λ2 ) λ2 τ̄ λ1 τ̄

λ21 δ2
= δ2 δ1 + [h(λ2 τ̄ ) − h(λ1 τ̄ )] . (5540)
(λ1 − λ2 )
λ1 6= λ2 and h is strictly decreasing, so h(λ2 τ̄ ) 6= h(λ1 τ̄ ) and determinant is nonzero iff δ2 6= 0,

(λ1 − λ2 )δ1 + λ21 δ2 6= 0. When λ1 = λ2 , then
1
det(D) = [δ1 C2 (τ̄ ) − δ2 C1 (τ̄ )] (5541)
τ̄
λ21 δ22

1 δ2 1 λ21 δ2
= [δ1 (1 − exp(−λ1 τ̄ )) − δ2 δ1 − (1 − exp(−λ1 τ̄ )) − τ̄ exp(−λ1 τ̄ )]
τ̄ λ1 λ1 λ1 λ1
λ21 δ22

δ1 δ2 δ2 λ21 δ2
= (1 − exp(−λ1 τ )) − δ1 − (1 − exp(−λ1 τ̄ )) − exp(−λ1 τ̄ ))
τ̄ λ1 τ̄ λ1 λ1 λ1
λ21 δ22
= (1 − exp(−λ1 τ̄ ) − λ1 τ̄ exp(−λ1 τ̄ )) (5542)
λ21 τ̄
λ21 δ22
= f (λ1 τ̄ ). (5543)
λ21 τ̄
Again, λ1 τ̄ is positive and f (λ1 τ̄ ) 6= 0, so for determinant to be nonzero we arrive at the same statement.
If the matrix D in Lemma 28 satisfies the invertibility conditions, we may obtain

" # " #−1 " # " #!
Y1 (t) δ1 δ2 R(t) δ0
= 1 1
− 1
. (5544)
Y2 (t) τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) L(t) τ̄ A(τ̄ )
753
Then using Equations 5428, 5429, 5525 and 5544, get
" # " #" #
dR(t) δ1 δ2 dY1 (t)
= 1 1
(5545)
dL(t) τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) dY2 (t)
" # " #" # " #!
δ1 δ2 λ1 0 Y1 (t) dW̃1 (t)
= 1 1
− dt + (5546)
τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) λ21 λ2 Y2 (t) dW̃2 (t)
" #" #" #−1 " #
δ1 δ2 λ1 0 δ1 δ2 δ0
= 1 1 1 1 1
dt (5547)
τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) λ21 λ2 τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) τ̄ A(τ̄ )
" #" #" #−1 " # " #" #
δ1 δ2 λ1 0 δ1 δ2 R(t) δ1 δ2 dW̃1 (t)
− 1 1 1 1
dt + 1 1
.
C
τ̄ 1 (τ̄ ) C
τ̄ 2 (τ̄ ) λ21 λ 2 C
τ̄ 1 (τ̄ ) C
τ̄ 2 (τ̄ ) L(t) τ̄ 1 (τ̄ )
C τ̄ C2 (τ̄ ) dW̃2 (t)
The parameters a1 , a2 corresponding to the SDEs for the two-factor Vasicek form (Equations 5424, 5425)
are given by
" # " #" #" #−1 " #
a1 δ1 δ2 λ1 0 δ1 δ2 δ0
= 1 1 1 1 1
. (5548)
a2 τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) λ21 λ2 τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) τ̄ A(τ̄ )
Also,
" # " #" #" #−1
b11 b12 δ1 δ2 λ1 0 δ1 δ2
B= = 1 1 1 1
, (5549)
b21 b22 τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ ) λ21 λ2 τ̄ C1 (τ̄ ) τ̄ C2 (τ̄ )
p p
with positive eigenvalues λ1 and λ2 . Furthermore, writing σ1 = δ12 + δ22 and σ2 = τ̄1 C12 (τ̄ ) + C22 (τ̄ ),
1
B̃1 (t) = δ1 W̃1 (t) + δ2 W̃2 (t) , (5550)
σ1
1
B̃2 (t) = C1 (τ̄ )W̃1 (t) + C2 (τ̄ )W̃2 (t) , (5551)
σ2 τ̄
and 0 = 2 = 0, 1 = 1 for Equation 5427.
The canonical two factor Vasicek model can be written
dY (t) = −ΛY (t)dt + dW̃ (t), (5552)
where
" # " # " #
Y1 (t) λ1 0 W̃1 (t)
Y (t) = , Λ= , W̃ (t) = . (5553)
Y2 (t) λ21 λ2 W̃2 (t)
Consider the definition of matrix exponential (Definition 85), here we show the exponential for Λt.
" #
λ1 0
Lemma 29. Denote Λ = . If λ1 6= λ2 , then
λ21 λ2
" #
exp(λ 1 t) 0
λ1 6= λ2 ,


 λ21 (exp(λ1 t) − exp(λ2 t)) exp(λ2 t)


λ1 −λ2
exp(Λt) = " # (5554)

 exp(λ1 t) 0

 λ1 = λ2 .
λ21 t exp(λ1 t) exp(λ1 t)

Either way,
δ
exp(Λt) = Λ exp(λt) = exp(λt)Λ, (5555)
δt
where exp(−Λt) = exp(Λt)−1 is obtained by setting λ1 → −λ1 , λ2 → −λ2 , λ21 → −λ21 .
754
Proof. Shreve [19]. First we claim that for λ1 6= λ2 , we have
" #
n
(λ 1 t) 0
(Λt)n = λn −λn , n = 0, 1, · · · . (5556)
λ21 tn λ11 −λ22 (λ2 t)n
The base case for n = 0 is trivial and evaluates to (Λt)0 = 12 . Then we assume this holds for some n > 0
and write
(Λt)n+1 = (Λt)(Λt)n (5557)

" #" #
n
λ1 t 0 (λ1 t) 0
= n
λ −λ n (5558)
λ21 t λ2 t λ21 tn λ11 −λ22 (λ2 t)n
" #
(λ1 t)n+1 0
= n+1

n λn n
1 −λ2
(5559)
λ21 t λ1 + λ2 λ1 −λ2 (λ2 t)n+1
" #
(λ1 t)n+1 0
= λn+1 −λn+1 . (5560)
λ21 tn+1 1 λ1 −λ22 (λ2 t)n+1
By mathematical induction, our claim is true. Then by definition of matrix exponent and matrix sum-
mations, we may write
∞
X 1
exp(Λt) = (Λt)n (5561)
0
n!
" P∞ #
1
(λ1 t)n 0
= λ21 P∞ 1 0 n! n P∞ 1 n
P∞ 1 n
(5562)
λ1 −λ2 0 n! (λ1 t) − 0 n! (λ2 t) 0 n! (λ2 t)
" #
exp(λ1 t) 0
= λ21
(5563)
λ1 −λ2 (exp(λ1 t) − exp(λ2 t)) exp(λ2 t)
When λ1 = λ2 , we claim
" #
n (λ1 t)n 0
(Λt) = , n = 1, 2, · · · . (5564)
nλ21 λn−1
1 tn (λ1 t)n
The same method of induction can be used to verify this. Using the result
∞ ∞
X n n−1 n δ X 1 δ
λ21 λ1 t = λ21 (λ1 t)n = λ21 exp(λ1 t) = λ21 t exp(λ1 t), (5565)
0
n! δλ1 0
n! δλ 1
we can write the matrix exponent

∞
" P #
∞ 1 n
1 0 n! (λ1 t) 0
X
n
exp(Λt) = (Λt) = ∞ n n−1 n P∞ 1 n
(5566)
n!
P
0
λ21 0 n! λ1 t 0 n! (λ1 t)
" #
exp(λ1 t) 0
as the matrix . It is straightforward to compute the derivatives componentwise,
λ21 t exp(λ1 t) exp(λ1 t)
and it is also straightforward to compute the matrix inverses,
" # " #
exp(−λ1 t) 0 exp(−λ1 t) 0
exp(−Λt) = λ21
, (5567)
λ1 −λ2 (exp(−λ1 t) − exp(−λ2 t)) exp(−λ2 t) −λ21 t exp(−λ1 t) exp(−λ1 t)
respectively for cases λ1 6= λ2 and λ1 = λ2 .
755
By Equation 5552, with function f (t, x) = exp(Λt)x with derivatives ft (t, x) = Λx exp(Λt), fx (t, x) =
exp(Λt), ftx (t, x) = Λ exp(Λt) and fxx (x) = 0, the Ito Doeblin formula on f (Y (t)) evaluates to
d (exp(Λt)Y (t)) = ΛY (t) exp(Λt)dt + exp(Λt)dY (t) + Λ exp(Λt)dtdY (t) (5568)

= exp(Λt) (ΛY (t)dt + dY (t)) (5569)
= exp(Λt)dW̃ (t), see Equation 5552. (5570)
Integrating,
Z t
exp(Λt)Y (t) − Y (0) = exp(Λu)dW̃ (u), (5571)
0
Z t
Y (t) = exp(−Λt)Y (0) + exp(−Λt) exp(Λu)dW̃ (u), (5572)
0
Z t
= exp(−Λt)Y (0) + exp(−Λ(t − u))dW̃ (u). (5573)
0
Recall the Equations 5567 for inverses. If λ1 6= λ2 , then

Z t
Y1 (t) = exp(−λ1 t)Y1 (0) + exp(−λ1 (t − u))dW̃1 (u), (5574)
0
λ21
Y2 (t) = (exp(−λ1 t) − exp(−λ2 t)) Y1 (0) + exp(−λ2 t)Y2 (0) (5575)
λ1 − λ2
Z t Z t
λ21
+ (exp(−λ1 (t − u)) − exp(−λ2 (t − u))) dW̃1 (u) + exp(−λ2 (t − u))dW̃2 (u)
λ1 − λ2 0 0
and if λ1 = λ2 , then
Z t
Y1 (t) = exp(−λ1 t)Y1 (0) + exp(−λ1 (t − u))dW̃1 (u), (5576)
0
Z t
Y2 (t) = −λ21 t exp(−λ1 t)Y1 (0) + exp(−λ1 t)Y2 (0) − λ21 (t − u) exp(−λ1 (t − u))dW̃1 (u)(5577)
0
Z t
+ exp(−λ1 (t − u))dW̃2 (u) (5578)
0
are the componentwise expressions. The integrands are nonrandom; the processes Y1 (t), Y2 (t) are Gaus-
sian (see Exercise ??) and therefore R(t) = δ0 + δ1 Y1 (t) − δ2 Y2 (t) is normally distributed. There is
positive probability that R(t) < 0. Next, we look at a model where the factors are guaranteed to be
nonnegative at all times almost surely.
13.8.1.2 Two-Factor CIR Model
Define the interest rate by the model
R(t) = δ0 + δ1 Y1 (t) + δ2 Y2 (t), (5579)
except here we enforce conditions δ0 ≥ 0, δ1 > 0, δ2 > 0 and assume R(0) ≥ 0. Then we have R(t) ≥ 0
for all t ≥ 0 almost surely. The evolution of the factor processes in the canonical two-factor CIR model
follows
p
dY1 (t) = (µ1 − λ11 Y1 (t) − λ12 Y2 (t)) dt + Y1 (t)dW̃1 (t), (5580)
p
dY2 (t) = (µ2 − λ21 Y1 (t) − λ22 Y2 (t)) dt + Y2 (t)dW̃2 (t). (5581)
756
We assume that
µ1 ≥ 0, µ2 ≥ 0, λ11 > 0, λ22 > 0, λ12 ≤ 0, λ21 ≤ 0. (5582)
These conditions specify a process for which the drift
µi − λi1 Y1 (t) − λi2 Y2 (t) (5583)
can be negative but is nonnegative whenever Yi (t) = 0 and Y{1,2}\i ≥ 0. When Y1 (0), Y2 (0) ≥ 0, the
processes Y1 (t) ≥ 0, Y2 (t) ≥ 0 almost surely. It is assumed that W̃1 (t) ⊥ W̃2 (t), which is required for the
affine property as we shall see.
In this canonical two-factor CIR model, again by the Markov property there ∃f (t, y1 , y2 ) satisfying
B(t, T ) = f (t, Y1 (t), Y2 (t)). We write d (D(t)B(t, T )) = d (D(t)f (t, Y1 (t), Y2 (t)))
= −RDf dt + Ddf (5584)

1 1
= D −Rf dt + ft dt + fy1 dY1 + fy2 dY2 + fy1 y2 dY1 dY2 + fy1 y1 dY1 dY1 + fy2 y2 dY2 dY2 (5585)
2 2

1 1
= D −(δ0 + δ1 Y1 + δ2 Y2 )f + ft + (µ1 − λ11 Y1 − λ12 Y2 )fy1 + (µ2 − λ21 Y1 − λ22 Y2 )fy2 + Y1 fy1 y1 + Y2 fy2 y2 dt
2 2
hp p i
+ Y1 fy1 dW̃1 + Y2 fy2 dW̃2 . (5586)
By martingale property we arrive at the PDE

1 1
−(δ0 + δ1 y1 + δ2 y2 )f + ft + (µ1 − λ11 y1 − λ12 y2 )fy1 + (µ2 − λ21 y1 − λ22 y2 )fy2 + y1 fy1 y1 + y2 fy2 y2 = 0(5587)
2 2
for all t ∈ [0, T ) and all y1 ≥ 0, y2 ≥ 0. We want to find a solution of the affine yield form
f (t, y1 , y2 ) = exp {−y1 C1 (τ ) − y2 C2 (τ ) − A(τ )} (5588)
for some functions C1 (τ ), C2 (τ ), A(τ ) and time to maturity τ = T − t. The terminal conditions
f (T, Y1 (T ), Y2 (T )) = B(T, T ) = 1 (5589)
implies the initial conditions C1 (0) = C2 (0) = A(0) = 0. As before, since δ

δt Ci (τ ) = −Ci0 (τ ), δt
δ
A(τ ) =
−A0 (τ ) we may write Equation 5587 PDE as (using the differentials found in Equations 5495 and method
similar in obtaining Equation 5501)

1 1
C10 + λ11 C1 + λ21 C2 + C12 − δ1 y1 + C20 + λ12 C1 + λ22 C2 + C22 − δ2 y2 + (A0 − µ1 C1 − µ2 C2 − δ0 ) f
2 2
= 0. (5590)
These correspond in the same way of reasoning as before to the system of ordinary differential equations
1
C10 (τ ) = −λ11 C1 (τ ) − λ21 C2 (τ ) − C12 (τ ) + δ1 , (5591)
2
1
C20 (τ ) = −λ12 C1 (τ ) − λ22 C2 (τ ) − C22 (τ ) + δ2 , (5592)
2
A0 (τ ) = µ1 C1 (τ ) + µ2 C2 (τ ) + δ0 . (5593)
The solution to this system of ODE equations with initial conditions can be found by numerical methods.
p
If we did not assume W̃1 (t) ⊥ W̃2 (t), we would obtain an additional ρ Y1 (t)Y2 (t)fy1 y2 in the dt term
for Equation 5586, and the resulting PDE would not lead to the same ODEs.
757
13.8.1.3 Mixed Model
In the CIR model, both factors where almost surely nonnegative. In the mixed model, we allow one to
become negative. The canonical two-factor mixed model can be written
p
dY1 (t) = (µ − λ1 Y1 (t)) dt + Y1 (t)dW̃1 (t), (5594)
p p
dY2 (t) = −λ2 Y2 (t)dt + σ21 Y1 (t)dW̃1 (t) + α + βY1 (t)dW̃2 (t). (5595)
Assume µ ≥ 0, λ1 > 0, λ2 > 0, α ≥ 0, β ≥ 0, σ21 ∈ R, W̃1 (t) ⊥ W̃2 (t), Y1 (0) ≥ 0. Then Y1 (t) ≥ 0 almost
surely and Y2 (t) ∈ R. The interest rate takes equation
R(t) = δ0 + δ1 Y1 (t) + δ2 Y2 (t), (5596)
and zcb bond prices take affine-yield form
B(t, T ) = exp {−Y1 (t)C1 (τ ) − Y2 (t)C2 (τ ) − A(τ )} . (5597)
A system of ordinary differential equations may be obtained, and can be solved together with terminal
conditions C1 (0) = C2 (0) = A(0) as in the other models.
13.8.2 Heath-Jarrow-Morton
The HJM model models the forward rates and evolves the yield curve forward in time. For a fixed time
horizon T̄ , 0 ≤ t ≤ T ≤ T̄ denote B(t, T ) as the price of the T-maturity risk-free zcb with face value
one. For all 0 ≤ t ≤ T ≤ T̄ , the bond price B(t, T ) < 1 if R(t) > 0 and t < T . However, it is possible for
some implementations of the HJM to violate this assumption. For some positive δ, if we (i) take short
B(t,T )
position in one T-maturity bond with income B(t, T ), (ii) go long B(t,T +δ) units of (T + δ)-maturity
zcb bonds, the net capital outlay is zero. We call this forward investing. This is because we agree to
B(t,T )
invest 1 at T and receive B(t,T +δ) ≥ 1 at T + δ. The yield y applied to the one dollar invested at T and
B(t,T ) B(t,T )
returning B(t,T +δ) at T + δ (Definition 458) satisfies 1 = B(t,T +δ) exp(−δy), a fact we write as any one
of the equivalent forms
1 B(t, T + δ) 1 B(t, T ) log B(t, T + δ) − log B(t, T )
− log = log =− . (5598)
δ B(t, T ) δ B(t, T + δ) δ
See that if the bond price at B(t, T + δ) has smaller price than at B(t, T ), this implies the yield is strictly
positive. This yield is F(t)-measurable, since it is agreed at an earlier time t. Any other interest rate
would admit arbitrage. See that we have the relation
log B(t, T + δ) − log B(t, T )
f (t, T ) = −lim (5599)
δ→0 δ
δ
= − log B(t, T ). (5600)
δT
On the other hand, if we knew f (t, T ) for all values of 0 ≤ t ≤ T ≤ T̄ , we would be able to recover the
values for B(t, T ) by the relation
Z T Z T
δ
f (t, v)dv = − log B(t, u)du = − [log B(t, T ) − log B(t, t)] = − log B(t, T ). (5601)
t t δT
since B(t, t) = 1. Then we can write
( Z )
T
B(t, T ) = exp − f (t, v)dv , 0 ≤ t ≤ T ≤ T̄ . (5602)
t
The bond prices and forward rates can therefore be determined from each other. The model for the
forward rates or bond prices are from a theoretical standpoint the same.
758
13.8.2.1 Dynamics of Forward Rates and Bond Prices
Assume an initial forward rate curve f (0, T ), 0 ≤ T ≤ T̄ . In the HJM model, the forward rates are given
by
Z t Z t
f (t, T ) = f (0, T ) + α(u, T )du + σ(u, T )dW (u) (5603)
0 0
where 0 ≤ t ≤ T . In differential form this is written
df (t, T ) = α(t, T )dt + σ(t, T )dW (t), t ∈ [T ]. (5604)
The differential is with respect to t. The variable T is constant. α(t, T ), σ(t, T ) are allowed to be random
RT
processes. Since − t f (t, v)dv has variable t in two different places, its differential is
!
Z T Z T
d − f (t, v)dv = f (t, t)dt − df (t, v)dv (5605)
t t
Z T
= R(t)dt − [α(t, v)dt + σ(t, v)dW (t)] dv. (5606)
t
Reversing the order of integration (verify this is valid) we write

Z T Z T
α(t, v)dtdv = α(t, v)dvdt = α∗ (t, T )dt, (5607)
t t
Z T Z T
σ(t, v)dW (t)dv = σ(t, v)dvdW (t) = σ ∗ (t, T )dW (t) (5608)
t t
where
Z T Z T
α∗ (t, T ) = α(t, v)dv, σ ∗ (t, T ) = σ(t, v)dv. (5609)
t t
If follows that
!
Z T
d − f (t, v)dv = R(t)dt − α∗ (t, T )dt − σ ∗ (t, T )dW (t). (5610)
t
Let g(x) = exp(x), then we have (Equation 5602)

!
Z T
B(t, T ) = g − f (t, v)dv , (5611)
t
and by Ito Doeblin we have (Equations 5609, 5610)

! ! !" !#2
Z T Z T Z T Z T
0 1
dB(t, T ) = g − f (t, v)dv d − f (t, v)dv + g 00 − f (t, v)dv d − f (t, v)dv
t t 2 t t
1
= B(t, T ) [R(t)dt − α∗ (t, T )dt − σ ∗ (t, T )dW (t)] + B(t, T )(σ ∗ (t, T ))2 dt (5612)
2
1
= B(t, T ) R(t) − α∗ (t, T ) + (σ ∗ (t, T ))2 dt − σ ∗ (t, T )B(t, T )dW (t). (5613)
2
To admit no-arbitrage, we shall seek a probability measure P̃ s.t.

Z t
D(t)B(t, T ) = exp − R(u)du B(t, T ), t ∈ [T ] (5614)
0
759
is martingale. Since (using Equation 5613)
d(D(t)B(t, T )) = −R(t)D(t)B(t, T )dt + D(t)dB(t, T ) (5615)

1
= D(t)B(t, T ) −α∗ (t, T ) + (σ ∗ (t, T ))2 dt − σ ∗ (t, T )dW (t) . (5616)
2
To get the change of measure we want to express

1 ∗ !
−α (t, T ) + (σ (t, T )) dt − σ ∗ (t, T )dW (t) = −σ ∗ (t, T ) [Θ(t)dt + dW (t)] .
∗ 2
(5617)
2
By Girsanov Theorem 423, if we write
Z t
W̃ (t) = Θ(u)du + W (t), (5618)
0
then W̃ (t) is P̃ Brownian, and
d(D(t)B(t, T )) = −D(t)B(t, T )σ ∗ (t, T )dW̃ (t). (5619)
The market price of risk equations at different T ∈ (0, T̄ ] is solving for Θ(t) satisfying
1
−α∗ (t, T ) + (σ ∗ (t, T ))2 = −σ ∗ (t, T )Θ(t). (5620)
2
That is, we have market price of risk equations, one for each bond maturity. Θ(t) should satisfy Equation
5620 for all maturity T ∈ (0, T̄ ]. The number of processes is as numerous as the number of sources of
uncertainty - and since we assumed a single Brownian, we have one. Recall (Equation 5609)
δ ∗ δ ∗
α (t, T ) = α(t, T ), σ (t, T ) = σ(t, T ). (5621)
δT δT
Then differentiating w.r.t T the Equation 5620, get
−α(t, T ) + σ ∗ (t, T )σ(t, T ) = −σ(t, T )Θ(t), (5622)

∗
α(t, T ) = σ(t, T ) [σ (t, T ) + Θ(t)] . (5623)
Theorem 444 (HJM No-Arbitrage Condition). A term structure model specified for zcb maturities
T ∈ (0, T̄ ] and driven by single Brownian does not admit arbitrage if there ∃Θ(t) s.t.
α(t, T ) = σ(t, T ) [σ ∗ (t, T ) + Θ(t)] (5624)
holds for all 0 ≤ t ≤ T ≤ T̄ . α(t, T ), σ(t, T ) represents drift term and diffusion term respectively of the
forward rate, that is Equation 5604,
df (t, T ) = α(t, T )dt + σ(t, T )dW (t), 0 ≤ t ≤ T. (5625)

T
Additionally, the process σ ∗ (t, T ) = t σ(t, v)dv, and Θ(t) is the market price of risk.
R
Proof. If the market price of risk Θ(t) solves Equation 5624, we aim to show it must satisfy the market
price of risk equations (Equation 5620). Then the Girsanov’s Theorem 423 applies and we can construct
a risk-neutral measure, which by the Second Fundamental Theorem of Asset Pricing 426, asserts the
no-arbitrage condition. Suppose Θ(t) solves Equation 5624, then integrating the equation from t → T ,
obtain
Z T Z T
α(t, v)dv = σ(t, v) [σ ∗ (t, v) + Θ(t)] dv (5626)
t t
1 ∗
T
[α∗ (t, v)]t =[(σ (t, v))2 ]Tt + [σ ∗ (t, v)Θ(t)]Tt . (5627)
2
1
α∗ (t, T ) = (σ ∗ (t, T ))2 + σ ∗ (t, T )Θ(t). (5628)
2
The last equation follows since α∗ (t, t) = σ ∗ (t, t) = 0.
760
The market price of risk can be found (Equation 5624)
α(t, T )
Θ(t) = − σ ∗ (t, T ), 0 ≤ t ≤ T, (5629)
σ(t, T )
which is defined for σ(t, T ) 6= 0. Θ(t) is unique, so the risk-neutral measure is unique, and again Theorem
426 asserts that the model is complete.
13.8.2.2 Heath-Jarrow-Morton under Risk-Neutrality
Assuming no-arbitrage conditions (Theorem 444) is satisfied, we may rewrite the SDE Equation 5604
df (t, T ) = α(t, T )dt + σ(t, T )dW (t) (5630)

= σ(t, T )σ ∗ (t, T )dt + σ(t, T ) [Θ(t)dt + dW (t)] Equation 5624 (5631)
∗
= σ(t, T )σ (t, T )dt + σ(t, T )dW̃ (t), (5632)
where W̃ (t) is Brownian obtained from change-of-measure Equation 5618 asserted by Girsanov’s Theorem
423, and the bond-price differential may be written (using the market price of risk) as Equation 5619
d(D(t)B(t, T )) = −D(t)B(t, T )σ ∗ (t, T )dW̃ (t). (5633)
1 R(t)
Recall (Equation 4621) that d D(t) = D(t) dt, so by Ito Product rule (Theorem 30),

1
dB(t, T ) = d · D(t)B(t, T ) (5634)
D(t)
R(t)
= D(t)B(t, T )dt − σ ∗ (t, T )B(t, T )dW̃ (t) (5635)
D(t)
= R(t)B(t, T )dt − σ ∗ (t, T )B(t, T )dW̃ (t). (5636)
Theorem 445 (Risk-Neutral Evolution of Term Structure). Term structure model satisfying HJM no-
arbitrage condition (Theorem 444) have forward rates that evolve under the SDE form (Equation 5632)
df (t, T ) = σ(t, T )σ ∗ (t, T )dt + σ(t, T )dW̃ (t), (5637)
where W̃ (t) is risk-neutral measure P̃ Brownian. Additionally, the zcb prices evolve by sde Equation 5636
dB(t, T ) = R(t)B(t, T )dt − σ ∗ (t, T )B(t, T )dW̃ (t). (5638)

RT
The processes σ ∗ (t) = t
σ(t, v)dv and R(t) = f (t, t) is the interest rate. Furthermore, the discounted
bond process satisfy
d(D(t)B(t, T )) = −σ ∗ (t, T )D(t)B(t, T )dW̃ (t), (5639)

Rt
where D(t) is discount process exp(− 0
R(u)du). Then it follows that the solution to B(t, T ) takes
Z t Z t Z t
1
B(t, T ) = B(0, T ) exp R(u)du − σ ∗ (u, T )dW̃ (u) −
(σ ∗ (u, T ))2 du (5640)
0 0 0 2
Z t Z t
B(0, T ) ∗ 1 ∗ 2
= exp − σ (u, T )dW̃ (u) − (σ (u, T )) du . (5641)
D(t) 0 2 0
761
13.8.2.3 Relation to the Affine-Yield Models
Every term structure driven by Brownian motion is HJM model. The forward rates specified under the
term structure model must satisfy no arbitrage conditions (Theorem 444), and this asserts that forward
rates and bonds evolve as in Theorem 445 under the risk-neutral measure.
Consider the one-factor Hull White and CIR models (Exercises 695 and 701, 664 and 702 respectively),
which specify short rate dynamics
dR(t) = β(t, R(t))dt + γ(t, R(t))dW̃ (t), (5642)
where W̃ (t) is P̃ risk-neutral measure Brownian. In particular, the Hull-White model specifies the
parameters
β(t, r) = a(t) − b(t)r, γ(t, r) = σ(t) (5643)
for a(t), b(t), σ(t) > 0, while the CIR model specifies
√
β(t, r) = a − br, γ(t, r) = σ r, (5644)
for some a, b, σ ∈ R+ . We assume affine yield, in particular recall (Exercises 695,701, 664, 702) that
zero-coupon bond prices take form
B(t, T ) = exp(−R(t)C(t, T ) − A(t, T )), (5645)
where in the Hull-White model we have (see Exercise 701)

Z T Z s
C(t, T ) = exp − b(v)dv ds (5646)
t t
Z T
1
A(t, T ) = a(s)C(s, T ) − σ 2 (s)C 2 (s, T ) ds. (5647)
t 2
and for the CIR model, (see Equations 4661, 4662)
sinh(γ(T − t))
C(t, T ) = , (5648)
γ exp(− 21 b(T − t))

2a
A(t, T ) = − 2 log . (5649)
By Equation 5600 and Equation 5645, we can write

δ δ δ
f (t, T ) = − log B(t, T ) = R(t) C(t, T ) + A(t, T ). (5650)
δT δT δT
The forward rate differential is given using the Ito Product rule (Theorem 30)
δ δ δ 0
df (t, T ) = C(t, T )dR(t) + R(t) C 0 (t, T )dt + A (t, T )dt (5651)
δT
δT δT
δ δ δ 0 δ
= C(t, T )β(t, R(t)) + R(t) C 0 (t, T ) + A (t, T ) dt + C(t, T )γ(t, R(t))dW̃ (t).
δT δT δT δT
The diffusion term implies that this is HJM model with (see Equation 5632)
δ
σ(t, T ) = C(t, T )γ(t, R(t)). (5652)
δT
762
Additionally, since we are working under the risk-neutral P̃, Equation 5637 of Theorem 445 asserts that
RT RT δ
the drift term is σ(t, T )σ ∗ (t, T ) = σ(t, T ) t σ(t, v)dv = δT
δ
C(t, T )γ(t, R(t)) t δv C(t, v)γ(t, R(t))dv,
and therefore we derive
δ δ δ 0
C(t, T )β(t, R(t)) + R(t) C 0 (t, T ) + A (t, T ) (5653)
δT δT δT
Z T
δ δ
= C(t, T )γ(t, R(t)) C(t, v)γ(t, R(t))dv (5654)
δT t δv
δ
= C(t, T )γ(t, R(t))[C(t, T ) − C(t, t)]γ(t, R(t)) (5655)
δT
δ
= C(t, T )γ 2 (t, R(t))C(t, T ). (5656)
δT
Exercise 718. Verify the no-arbitrage condition for affine-yield models, Equation 5656, in the case of
a Vasicek model.
Proof. Vasicek model takes the Hull-White form with constant parameters, that is Equation 5643 with
a(t) = a, b(t) = b, σ(t) = σ. Then Equation 5647 become
1 1
C(t, T ) = (1 − exp(−b(T − t)) , A0 (t, T ) = −aC(t, T ) + σ 2 C 2 (t, T ), (5657)
b 2
and therefore δ
δT
δ
C(t, T ) = exp(−b(T − t)), δT C 0 (t, T ) = b exp(−bτ ). Additionally,
δ 0 δ δ
A (t, T ) = −a C(t, T ) + σ 2 C(t, T ) C(t, T ) (5658)
δT δT δT
21
= −a exp(−bτ ) + σ (1 − exp(−bτ )) exp(−bτ ). (5659)
b

σ2 σ2
δ
δT A0 (t, T ) = b − a exp(−b(T − t)) − b exp(−2b(T − t)). Then (using Equation 5652), σ(t, T ) =
σ exp(−b(T − t)) and
Z T
σ
σ ∗ (t, T ) = σ exp(−b(T − u))du = (1 − exp(−b(T − t))) , (5660)
t b
and making the substitutions we get
δ δ δ 0
C(t, T )β(t, R(t)) + R(t) C 0 (t, T ) + A (t, T ) (5661)
δT δT δT
σ2 σ2

= exp(−b(T − t))(a − bR(t)) + R(t)b exp(−b(T − t)) + − a exp(−b(T − t)) − exp(−2b(T − t))
b b
σ2
= [exp(−b(T − t)) − exp(−2b(T − t))] (5662)
b
σ
= σ exp(−b(T − t)) (1 − exp(−b(T − t)) (5663)
b
= σ(t, T )σ ∗ (t, T ). (5664)
13.8.2.4 Implementation of HJM
Shreve [19]. Since Equation 5604 and Equation 5637 specify differentials under their respective measures
featuring the same σ(t, T ), we may estimate σ(t, T ) from historical data. Then σ ∗ (t, T ) is obtained as
RT
t
σ(t, v)dv by definition, and together with the initial forward curve f (0, T ), we may determine the
variables required in Theorem 445. We may compute the short rates
Z t Z t
R(t) = f (t, t) = f (0, t) + σ ∗ (u, t)σ(u, t)du + σ(u, t)dW̃ (u). (5665)
0 0
763
Market price of risk Θ(t) and forward rate drift α(t, T ) are not considered, but if we switch back to
the actual measure and look to estimate non-diffusion terms from historical data, then we require them.
Assume that σ(t, T ) takes form σ(t, T ) = σ̃(T − t) min{M, f (t, T )}, for some σ̃(τ ), τ ≥ 0, M ∈ R+ . The
forward rate is capped to prevent explosion, but the non-linearity of the min-function makes the forward
rate not log-normally distributed. Calibrating σ̃(T − t) to historical data, the forward rate evolves by
the sde
df (t, T ) = α(t, T )dt + σ̃(T − t) min{M, f (t, T )}dW (t). (5666)
Suppose that the forward rates at t1 < t2 < · · · < tJ < 0 were historically observed, with relative
maturities τ1 < τ2 < · · · < τK s.t. f (tj , tj + τk ) is known for j ∈ [J], k ∈ [K], and also that we observed
for some small δ, the value f (tj + δ, tj + τk ). It is assumed that tj + δ < tj+1 , tJ + δ ≤ 0 holds. Then
(see Euler method, Equation 4523) we may write
f (tj + δ, tj + τk ) − f (tj , tj + τk ) ≈ δα(tj , tj + τk ) + σ̃(τk ) min{M, f (tj , tj + τk )}(W (tj + δ) − W (tj )).
(5667)
If we let
f (tj + δ, tj + τk ) − f (tj , tj + τk )
Djk =√ (5668)
δ min{M, f (tj , tj + τk )}
√
δα(tj , tj + τk ) W (tj + δ) − W (tj )
≈ + σ̃(τk ) √ . (5669)
min{M, f (tj , tj + τk )} δ
√
W (tj +δ)−W (tj )
Ignoring the first term that is O δ , and writing Xj = √
δ
∼ Φ(0, 1), see that Djk ≈
σ̃(τk )Xj . We obtained Xi , i ∈ [J] standard normal random variables that are independent, and we
can approximate Djk as independent observations taken at t1 , t2 , · · · tJ of the forward rates, with same
relative maturity τk . The empirical covariance is given by
J
1X
Ck1 ,k2 = Djk1 Djk2 , (5670)
J j=1
and this is an estimator for the theoretical covariance E[σ̃(τk1 )σ̃(τk2 )Xj2 ] = σ̃(τk1 )σ̃(τk2 ). We want to
find σ̃(τi ), i ∈ [K] s.t.
Ck1 ,k2 = σ̃(τk1 )σ̃(τk2 ), k1 , k2 ∈ [K]. (5671)

1
By symmetry, we have Ck1 ,k2 = Ck2 ,k1 , so in total we have a system of 2 K(K + 1) equations and K
unknowns. To determine the best choice of σ̃(τi ), i ∈ [K], we use principal component analysis. For
1 0
D = (Djk )J×K , we may express C = (Cij )K×K as J D D,and since C is symmetric and positive semi-
PK 0
definite, it has principal component decomposition C = i λi ei ei , where λ1 ≥ λ2 ≥ · · · ≥ λK ≥ 0 are
eigenvalues of C withcorresponding
 eigenvectors ei . {ei , i ∈ [K]} is orthonormal set (Definition 73). We
σ̃(τ1 )
 
σ̃(τ2 ) h i
want to express C =   σ̃(τ1 ) σ̃(τ2 ) · · · σ̃(τk ) , but since this cannot be done exactly, we use
 ··· 
 
σ̃(τk )
 
σ̃(τ1 )
σ̃(τ2 ) √
 
the best approximate  · · ·  = λ1 e1 . In the final step of the calibration, a nonrandom function s(t)

 
σ̃(τk )
is introduced s.t.
df (t, T ) = σ(t, T )σ ∗ (t, T )dt + s(t)σ̃(T − t) min{M, f (t, T )}dW̃ (t). (5672)
764
The volatility is assumed to take form σ(t, T ) = s(t)σ̃(T − t) min{M, f (t, T )}, such that
Z T Z T
σ ∗ (t, T ) = σ(t, v)dv = s(t) σ̃(v − t) min{M, f (t, v)}dv, (5673)
t t
and the forward rate is evolved by the SDE Equation 5672. It turns out that this maintains the no-
arbitrage condition (verify this). The introduction of s(t) allows us to get the model to agree with the
market price.
13.8.3 Forward LIBOR Model

Recall that under the risk-neutral measure, forward rates evolve under Equation 5637:
df (t, T ) = σ(t, T )σ ∗ (t, T )dt + σ(t, T )dW̃ (t). (5674)
We want to apply the Black-Scholes formula for equity options to fixed income derivatives. But recall that
the forward rate is not log-normally distributed. We want to build a model in which the forward rates
are indeed log-normal under the risk-neutral measure. We want to set σ(t, T ) = σf (t, T ) in Equation
5637 for this purpose. But then we get
Z T Z T
∗
σ (t, T ) = σ(t, v)dv = σ f (t, v)dv, (5675)
t t
RT
and the dt term becomes σ 2 f (t, T ) t
f (t, v)dv. It can be shown (verify this) that this drift term causes
the forward rates to explode. Thus, working with the continuously compounding forward rates become
tricky and we use forward LIBORs.
13.8.3.1 (Forward) LIBOR
Let 0 ≤ t ≤ T , δ > 0. Recall we may perform forward-investing over [T, T + δ] by taking short one
B(t,T )
T-maturity zcb, long B(t,T +δ) units (T + δ) zcbs for zero cost at t. The forward-investing yield, or
the continuously compounded interest rate over this interval is given Equation 5598. We instead study
simple interest rates, and denote this L(t, T ) (instead of the L(t) previously defined for continuously
compounded rates) given by the relation
B(t, T )
1 + δL(t, T ) = , (5676)
B(t, T + δ)
which gives
B(t, T ) − B(t, T + δ)
L(t, T ) = . (5677)
δB(t, T + δ)
When 0 ≤ t < T , L(t, T ) is called forward LIBOR. When t = T , we call it (spot) LIBOR. δ is the tenor
of the LIBOR, which usually 3M or 6M .
13.8.3.2 Backset LIBOR Contract
An interest rate swap is an agreement with two legs, taken by party A and party B. One party, say A,
makes fixed interest rate payments on some notional value to B at given dates, while B makes variable
interest rate payments on the same notional value to A. The variable (floating reference) rate is often
the backset LIBOR, which for some payment date, is the LIBOR set on the previous payment date.
765
Theorem 446 (Backset LIBOR Pricing). Assume notional value one. Let 0 ≤ t ≤ T + δ, δ > 0. Then
the no arbitrage price at t of contract paying L(T, T ) at T + δ is given by
(
B(t, T + δ)L(t, T ) 0 ≤ t ≤ T,
S(t) = (5678)
B(t, T + δ)L(T, T ) T ≤ t ≤ T + δ.
Proof. The case when T ≤ t ≤ T + δ is trivial to reason with. When 0 ≤ t ≤ T , we may write (Equation
5677)
B(t, T ) − B(t, T + δ)
B(t, T + δ)L(t, T ) = . (5679)
δ
B(t,T )−B(t,T +δ) 1 1
If our portfolio at t is worth δ , and we go long δ units T-maturity zcbs and short δ units
1
T + δ-maturity zcbs, at T we may use the zcb proceeds to purchase δB(T,T +δ) units T + δ-maturity zcbs,
1 1
then we have net δB(T,T +δ) − δ units T + δ-maturity zcbs which at maturity pay off
1 1 B(T, T ) − B(T, T + δ)
− = = L(T, T ), see Definition 5677. (5680)
δB(T, T + δ) δ δB(T, T + δ)
We have essentially replicated the portfolio and therefore, Equation 5679 satisfies the value at t of the
backset LIBOR contract.
13.8.3.3 Black Caplet Formula
An interest rate cap is a fixed income derivative that pays the difference between variable reference
rate and fixed rate (cap) applied to some principal, whenever the variable rate is grater than the fixed
one. That is, for tenor δ, principal P , cap K, an interest rate cap with backset LIBOR as reference
rate has the payoff (δP L(δj, δj) − K)+ at time δ(j + 1), j = 0, 1, · · · n. It determine the cap price,
it turns out we can just price one of the payments and then sum the prices over the payments. Since
(δP L(δj, δj)−K)+ = δP (L(δj, δj)−K 0 )+ , where K 0 = K
δP , we may just price the payment (L(T, T )−K)+
at T +δ for some T . Recall Theorem 446, then the backset LIBOR contract with price S(t) and B(t, T +δ)
numeraire is expressed
(
S(t) L(t, T ), 0 ≤ t ≤ T,
= (5681)
B(t, T + δ) L(T, T ), T ≤ t ≤ T + δ.
See Equation 5365. For 0 ≤ t ≤ T , the forward LIBOR L(t, T ) is the (T + δ) forward price of the
contract paying backset LIBOR L(T, T ) at T + δ. Building a term-structure model under P satisfying
HJM no arbitrage condition (Equation 5656), then ∃W̃ (t) under risk-neutral P̃ with forward rates, bond
prices given by Theorem 445. Change of numeraire (Theorem 442) asserts the risk-neutral measure with
numeraire B(t, T + δ) is given by
Z
1
P̃T +δ (A) = D(T + δ)dP̃, ∀A ∈ F. (5682)
B(0, T + δ) A
Additionally, (see portfolio change of numeraire Exercise 713, bond differential Equation 5636)
Z t
W̃ T +δ (t) = σ ∗ (u, T + δ)du + W̃ (t) (5683)
0
is P̃T +δ Brownian. We call P̃T +δ the (T + δ)-forward measure and Martingale Representation Theorem
42 asserts ∃γ(t, T ) s.t. for fixed T , we have
dL(t, T ) = γ(t, T )L(t, T )dW̃ T +δ (t), t ∈ [T ]. (5684)
766
Changing to the (T + δ) forward measure using the T + δ maturity zcb as numeraire, we obtain a
differential without drift term. When γ(t, T ) is nonrandom, then the forward LIBOR L(t, T ) is P̃T +δ
log-normal distributed.
Theorem 447 (Black Caplet Formula, Shreve [19]). Assume the forward LIBOR evolves under the SDE
(Equation 5684)
dL(t, T ) = γ(t, T )L(t, T )dW̃ T +δ (t), t ∈ [T ], (5685)
where γ(t, T ) is nonrandom function. Then a caplet option paying (L(T, T ) − K)+ at T + δ, K ≥ 0 has
price at time zero
B(0, T + δ) [L(0, T )Φ(d+ ) − KΦ(d− )] , (5686)
where
" #
Z T
1 L(0, T ) 1 2
d± = qR log ± γ (t, T )dt . (5687)
T K 2 0
0
γ 2 (t, T )dt
Proof. Risk-neutral pricing formula (Equation 4193) asserts that (see also Theorem 21)

D(T + δ)
Ẽ D(T + δ)(L(T, T ) − K)+ = B(0, T + δ)Ẽ (L(T, T ) − K)+

(5688)
B(0, T + δ)
= B(0, T + δ)ẼT +δ (L(T, T ) − K)+ . (5689)
We can solve the SDE Equation 5684 to get

(Z )
T Z T
1
L(T, T ) = L(0, T ) exp γ(t, T )dW̃ T +δ (t) − γ 2 (t, T )dt . (5690)
0 2 0
RT
Since γ(t, T ) is non-random, apply Theorem 416. In particular, 0q
γ(t, T )dW̃ T +δ (t) is normal ran-
1 T 2
dom variable with mean zero, and variance γ̄ 2 (T )T where γ̄(T ) :=
R
T 0 γ (t, T )dt. If we let X =
RT RT √
− γ̄(T1)√T 0 γ(t, T )dW̃ T +δ (t), then we can write 0 γ(t, T )dW̃ T +δ (t) = −γ̄(T ) T X. See X is P̃T +δ
standard random normal. Then
√

1
L(T, T ) = L(0, T ) exp −γ̄(T ) T X − γ̄ 2 (T )T (5691)
2
and
+
√

T +δ + T +δ 1
Ẽ (L(T, T ) − K) = Ẽ L(0, T ) exp −γ̄(T ) T X − γ̄ 2 (T )T −K (5692)
2
= BSM (T, L(0, T ), K, 0, γ̄(T )) (5693)
= L(0, T )Φ(d+ ) − KΦ(d− ). (5694)
See derivation of Equation 4207 for the assertions of the last two equalities.
13.8.3.4 Relating Forward LIBOR and ZCB Volatility

1 B(t,T )
The forward LIBOR (Definition 5677) is given by L(t, T ) + δ = δB(t,T +δ) . We want to obtain the
evolution of the forward LIBOR under (T + δ) forward measure. By Theorem 445 of the discounted
bond price solution to the SDE, write
1 B(t, T )
L(t, T ) + = (5695)
δ δB(t, T + δ)
Z t
1 t ∗
Z
B(0, T )
= exp [σ ∗ (u, T + δ) − σ ∗ (u, T )]dW̃ (u) + [σ (u, T + δ)2 − σ ∗ (u, T )2 ]du .
δB(0, T + δ) 0 2 0
767
B(0,T )
Then L(t, T ) = − 1δ + δB(0,T +δ) exp(X(t)) where
Z t Z t
∗ ∗ 1
X(t) = [σ (u, T + δ) − σ (u, T )]dW̃ (u) + [σ ∗ (u, T + δ)2 − σ ∗ (u, T )2 ]du. (5696)
0 2 0
Then dX(t) = [σ ∗ (t, T + δ) − σ ∗ (t, T )]dW̃ (t) + 12 [σ ∗ (t, T + δ)2 − σ ∗ (t, T )2 ]dt. Then L(t, T ) = f (X(t)),
B(0,T ) B(0,T )
where f (x) = − 1δ + δB(0,T +δ) exp(x), f 0 (X) = f 00 (x) = δB(0,T +δ) exp(x). It is a straightforward exercise
to compute
1
dL(t, T ) = f 0 (X)dX(t) + f 00 (X)dX(t)dX(t) (5697)
2
B(0, T ) 1 B(0, T )
= exp(X(t))dX(t) + exp(X(t))dX(t)dX(t) (5698)
δB(0, T + δ) 2 δB(0, T + δ)
1 1 1
= (L(t, T ) + )dX(t) + (L(t, T ) + )dX(t)dX(t) (5699)
δ 2 δ
= ··· (5700)

1
= L(t, T ) + [σ ∗ (t, T + δ) − σ ∗ (t, T )][σ ∗ (t, T + δ)dt + dW̃ (t)]. (5701)
δ
Equation 5683 asserts
dW̃ T +δ (t) = σ ∗ (t, T + δ)dt + dW̃ (t). (5702)
Then we may write

1
dL(t) = (δL(t, T ) + 1) [σ ∗ (t, T + δ) − σ ∗ (t, T )]dW̃ T +δ (t). (5703)
δ
Relating Equation 5684 and Equation 5703, the volatility of the forward LIBOR and the volatility of
((T + δ), T maturity) zcb are related via
1 + δL(t, T ) ∗
γ(t, T ) = [σ (t, T + δ) − σ ∗ (t, T )]. (5704)
δL(t, T )
13.8.3.5 Forward LIBOR Term Structure Model
Recall Equations 5676,5682,5702,5685,5704 state:

B(t, T )
1 + δL(t, T ) = , (5705)
B(t, T + δ)
Z
1
P̃T +δ (A) = D(T + δ)dP̃, ∀A ∈ F, (5706)
B(0, T + δ) A
dW̃ T +δ (t) = σ ∗ (t, T + δ)dt + dW̃ (t), (5707)
dL(t, T ) = γ(t, T )L(t, T )dW̃ T +δ (t), t ∈ [T ], (5708)
1 + δL(t, T ) ∗
γ(t, T ) = [σ (t, T + δ) − σ ∗ (t, T )] (5709)
δL(t, T )
for 0 ≤ t ≤ T ≤ T̄ − δ. If at time zero we use the market observables to determine caplet prices for
maturity dates Tj = jδ, j ∈ [n], then we may imply the volatilities γ̄(Tj ) appearing in Theorem 447.
We want to calibrate our model to the initial term structure. Starting
q with T̄ = (n + 1)δ, choose a
1
R Tj
nonrandom, nonnegative function γ(t, Tj ), 0 ≤ t ≤ Tj , j ∈ [n] s.t. γ 2 (t, T )dt = γ̄(T ). An
Tj 0 j j
instance of such a function could be γ(t, Tj ) = γ̄(Tj ) for t ∈ [Tj ]. We may then evolve the LIBOR using
dL(t, T ), and the forward LIBORs agree with the market prices minimally for T = Tj , j ∈ [n]. Equation
5708 at T = Tj gives us formula for L(t, Tj ) in terms of forward Brownian motion W̃ Tj+1 (t), which are
768
different for different values of j. We need to develop a relationship between this system of equations.
Since Equation 5707 asserts
dW̃ Tj (t) = σ ∗ (t, Tj )dt + dW̃ (t), 0 ≤ t ≤ Tj , (5710)
it follows that
δγ(t, Tj )L(t, Tj )
dW̃ Tj (t) − dW̃ Tj+1 (t) = [σ ∗ (t, Tj ) − σ ∗ (t, Tj+1 )]dt = − dt (5711)
1 + δL(t, Tj )
for 0 ≤ t ≤ Tj , and where the last step was obtained via Equation 5709. Then we may iterative obtain
dW̃ Tj+1 in terms of the dW̃ Tn+1 (t), in particular
n
Tj+1
X δγ(t, Ti )L(t, Ti )
dW̃ (t) = − dt + dW̃ Tn+1 (t), 0 ≤ t ≤ Tj+1 (5712)
i=j+1
1 + δL(t, T i )
for all j ∈ [n]. Then the forward LIBOR differential Equation 5708 can be expressed
 
n
X δγ(t, Ti )L(t, T i )
dL(t, Tj ) = γ(t, Tj )L(t, Tj ) − dt + dW̃ Tn+1 (t) , 0 ≤ t ≤ Tj , j ∈ [n]. (5713)
i=j+1
1 + δL(t, T i )
A single Brownian drives all of the n equations, and we construct the forward LIBOR process by first
choosing W̃ Tn+1 (t) on (Ω, F, P̃Tn+1 ). It is assumed that the forward LIBORs L(0, Tj ), j ∈ [n + 1] are
known initially. Use the differential equation (Equation 5713) to iteratively solve for L(t, Tj ) using
L(t, T>j ).
We construct the volatility σ ∗ (t, Tj ) of the zcbs maturing at various Tj∈[n+1] . We are subject to the
constraint limt→Tj σ ∗ (t, Tj ) = σ ∗ (Tj , Tj ) = 0, since the bond price B(t, Tj ) → 1 as t → Tj . Otherwise
we are free to choose any form for σ ∗ (t, Tj ), Tj−1 ≤ t < Tj , and suppose we make such a choice for
j = 1, · · · , n + 1 satisfying the limit constraints. Then σ(t, Tj ) is determined for all t ∈ [0, Tj ). We verify
this claim. First, our choice of σ ∗ (t, T1 ) determines σ(t, T1 ) for all 0 ≤ t < T1 . Then by Equation 5709,
we may write
δγ(t, T1 )L(t, T1 )
σ ∗ (t, T2 ) = σ ∗ (t, T1 ) + , (5714)
1 + δL(t, T1 )
which decides σ(t, T2 ) for all values 0 ≤ t < T2 . Then write
δγ(t, T2 )L(t, T2 )
σ ∗ (t, T3 ) = σ ∗ (t, T2 ) + . (5715)
1 + δL(t, T2 )
The same algorithm can be used to obtain σ(t, Tj ) for all j ∈ [n + 1]. Using bond volatilities and change
of measure given by Equation 5707, we may write
dB(t, Tj ) = R(t)B(t, Tj )dt − σ ∗ (t, Tj )B(t, Tj )dW̃ (t) (5716)

= R(t)B(t, Tj )dt + σ ∗ (t, Tj )σ ∗ (t, Tn+1 )B(t, Tj )dt − σ ∗ (t, Tj )B(t, Tj )dW̃ Tn+1 (t).(5717)
Working with the discounted form allows us to exclude R(t), which is not yet specified, giving us
d(D(t)B(t, Tj )) = σ ∗ (t, Tj )σ ∗ (t, Tn+1 )D(t)B(t, Tj )dt − σ ∗ (t, Tj )D(t)B(t, Tj )dW̃ Tn+1 (t), 0 ≤ t ≤ Tj (. 5718)
The initial condition is specified by Equation 5705, in particular
j−1 B(0, Ti+1 )

D(0)B(0, Tj ) = B(0, Tj ) = Πi=0 = Πj−1
i=0 (1 + δL(0, Ti ))
−1
. (5719)
B(0, Ti )
769
We can solve the SDE Equation 5718 to get
Z t Z t
1 ∗
D(t)B(t, Tj ) = B(0, Tj ) exp − σ ∗ (u, Tj )dW̃ Tn+1 (u) − σ (u, Tj )2 − σ ∗ (u, Tj )σ ∗ (u, Tn+1 ) du (5720)
.
0 0 2
Note that in the case when t = Tj , the equation simplifies. In particular,
D(Tj )B(Tj , Tj ) = D(Tj ) (5721)

( Z )
Tj Z Tj
1 ∗
= B(0, Tj ) exp − σ ∗ (u, Tj )dW̃ Tn+1 (u) − (σ (u, Tj ))2 − σ ∗ (u, Tj )σ ∗ (u, Tn+1 ) du .
0 0 2
Furthermore, when j = n + 1, then

( Z )
Tn+1 Z Tn+1
∗ Tn+1 1 ∗ 2
D(Tn+1 ) = B(0, Tn+1 ) exp − σ (u, Tn+1 )dW̃ (u) + σ (u, Tn+1 ) du . (5722)
0 2 0
Recall that
Z
D(Tn+1 )
P̃Tn+1 (A) = dP̃, ∀A ∈ F, (5723)
A B(0, Tn+1 )
Z
B(0, Tn+1 ) Tn+1
P̃(A) = dP̃ , ∀A ∈ F. (5724)
A D(Tn+1 )
Since we started with P̃Tn+1 ,it is more convenient to use the latter equation, so
(Z )
Tn+1
1 Tn+1 ∗
Z
B(0, Tn+1 ) ∗ Tn+1 2
= exp σ (u, Tn+1 )dW̃ (u) − σ (u, Tn+1 ) du . (5725)
D(Tn+1 ) 0 2 0
Theorem 448. Under P̃ given by Equation 5724, zcb given by Equations 5718, 5719 (or Equation 5720)
are martingales.
Proof. Consider Equation 5683

Z t
W̃ (t) = W̃ Tn+1
− σ ∗ (u, Tn+1 )du, 0 ≤ t ≤ Tn+1 , (5726)
0
which can be used to write Equation 5718 as
d(D(t)B(t, Tj )) = −σ ∗ (t, Tj )D(t)B(t, Tj )dW̃ (t). (5727)
By Girsanov’s Theorem 423,where Θ(u) → −σ ∗ (u, Tn+1 ), W̃ (t) is P̂ Brownian, where

Z
P̂(A) = Z(Tn+1 )dP̃Tn+1 , ∀A ∈ F, (5728)
A
and
( Z )
Tn+1 Z Tn+1
Tn+1 1 2
Z(Tn+1 ) = exp − Θ(u)dW̃ (u) − Θ (u)du . (5729)
0 2 0
B(0,Tn+1 )
See that Z(Tn+1 ) = D(Tn+1 ) , so P̂ = P̃.
770
13.9 Calculus for Jump Processes
In this part, we introduce jump-diffusion processes, in particular the case when there are a finite number
of jumps in some finite time interval. A jump process consists of some initial condition, Ito integral with
respect to Brownian motion dW (t), a Riemann integral w.r.t to time dt and a pure jump process. A pure
jump process starts at zero and in each finite time interval, has finitely many jumps and is otherwise
constant. As in Brownian motions driving continuous path processes, the Poisson process drives a jump
process. The relation between concepts in exponential random variables, Gamma random variables and
Poisson distributions are done in the text on statistics and probability distributions. Refer to Section
6.17.3 and Section 6.17.5.
13.9.1 Compensated and Compound Poisson Processes

We refer to Section 6.17.5 for discussion on Poisson processes. The Poisson random variable only increases
(at each event occurrence) and jumps with fixed-size one. Here we relax these assumptions by defining
compensated and compound Poisson processes.
Definition 466 (Compensated Poisson Process). For N (t) Poisson process (see Equation 2663), inten-
sity λ, define the compensated Poisson process as
M (t) = N (t) − λt. (5730)
Theorem 449 (Compensated Poisson Process is Martingale). M (t) as defined in Definition 466 is
martingale.
Proof. For 0 ≤ s < t, for independence of increments N (t) − N (s) to F(s) and measurability of
M (s), N (s) to F(s), we may write
E[M (t)|F(s)] = E[M (t) − M (s) + M (s)|F(s)] (5731)

= E[M (t) − M (s)|F(s)] + E[M (s)|F(s)] (5732)
= E[N (t) − N (s) − λ(t − s)|F(s)] + M (s) (5733)
= E[N (t) − N (s)] − λ(t − s) + M (s) (5734)
= M (s). (5735)
The Poisson process N (t) increments (jumps) by one each time. We want to allow this to be random,
for modelling market prices.
Definition 467 (Compound Poisson Process). Let N (t) be Poisson process intensity λ, and Y1 , Y2 , · · ·
be sequence of IID random variables, EYi = β. We also assume Yi is independent of N (t). Then define
the compound Poisson process as
N (t)
X
Q(t) = Yi , t ≥ 0. (5736)
i=1
Jumps in Q(t) occurs at the same time as in N (t), but the jump sizes are modelled by the distribution
PN (t)
of Yi . The compound Poisson increments are independent, and Q(t) − Q(s) = i=N (s)+1 Yi is also
independent of Q(s). Additionally, the distribution of Q(s) is stationary, distributionally Q(t) − Q(s) is
771
equivalent to Q(t − s). We can compute the moments of Q(t) by the double expectation (one w.r.t to Y ,
the other w.r.t N (t)) s.t
∞
" k
#
X X
EQ(t) = E Yi |N (t) = k P(N (t) = k) (5737)
k=0 i=1
∞
X (λt)k
= βk exp(−λt) (5738)
k!
k=0
∞
X (λt)k−1
= βλt exp(−λt) (5739)
(k − 1)!
k=1
= βλt. (5740)
The last equality follows since the summation term is Taylor expansion of exp(λt). The average number
of jumps is λt, where is each jump is on average size β, so it turns out we can just multiply their
expectation, since we asserted they were independently distributed.
Theorem 450 (Compensated Compound Poisson Process is Martingale). For Q(t) defined as in Def-
inition 467, the random variable (said to be compensated compound Poisson process) Q(t) − βλt is
martingales.
Proof. The proof follows similarly as in proof of Theorem 449. In particular, see
E[Q(t) − βλt|F(s)] = E[Q(t) − Q(s)|F(s)] + Q(s) − βλt = βλ(t − s) + Q(s) − βλt = Q(s) − βλs.(5741)
Theorem 451. Let Q(t) be compound Poisson process (Definition 467), then for 0 = t0 < t1 < · · · < tn ,
the increments Q(t1 ) − Q(t0 ), · · · , Q(tn ) − Q(tn−1 ) are independent and stationary. Q(tj ) − Q(tj−1 ) is
distributionally equivalent to Q(tj − tj−1 ) - only the time distance is relevant.
Exercise 719. Let M (t) = −λt + N (t) be compensated Poisson process (Definition 466), then show that
1. M 2 (t) is submartingale and that
2. M 2 (t) − λt is martingale.
Proof. Zeng [20] If we show (ii) then this implies (i) since λ, t, > 0. So we show (ii). Recall the increments
in M (t) are stationary (non-random jump case of Theorem 451), then
E[M 2 (t) − M 2 (s)|F(s)] = E[(M (t) − M (s))2 |F(s)] + E[(M (t) − M (s)) · 2M (s)|F(s)] (5742)
= E[M 2 (t − s)] + 2M (s)E[M (t − s)] (5743)
= Var(N (t − s)) (5744)
= λ(t − s). (5745)
M 2 (s) is F(s) measurable so taking out what is measured and shifting the terms appropriately give us
E[M 2 (t) − λt|F(s)] = M 2 (s) − λs.
Exercise 720. Prove that a compound Poisson process (Definition 467) is Markov (Definition 325).
Proof. By the Independence Lemma 4, we may write
E[h(Q(T ))|F(t)] = E[h(Q(T ) − Q(t) + Q(t))|F(t)] = E[h(Q(T − t) + Q(t))|F(t)] = g(t, Q(t)). (5746)
See that g(t, x) = E[h(Q(T − t) + x)].
772
13.9.2 Moment Generating Function for the Compound Poisson Process
It turns out that the density function of Q(t) (Definition 467) is complicated, but its moment generating
function has simpler formulation. For compound Poisson process Q(t), denote the mgf for increment size
random variables Yi as
ϕY (u) = E exp(uYi ), (5747)
then since they are IID we may write
ϕQ(t) (u) = E exp(uQ(t)) (5748)

 
 NX (t) 
= E exp u Yi (5749)
 
i=1
∞
" ( k
) #
X X
= P {N (t) = 0} + E exp u Yi |N (t) = k P {N (t) = k} (5750)
k=1 i=1
∞
( k
)
X X
= P {N (t) = 0} + E exp u Yi P(N (t) = k) (5751)
k=1 i=1
∞
X (λt)k
= exp(−λt) + Πki=1 E exp(uYi ) exp(−λt) (5752)
k!
k=1
∞
X (ϕY (u)λt)k
= exp(−λt) + exp(−λt) (5753)
k!
k=1
∞
X (ϕY (u)λt)k
= exp(−λt) (5754)
k!
k=0
= exp(−λt) exp(ϕY (u)λt) (5755)
= exp(ϕY (u)λt − λt) (5756)
= exp(λt(ϕY (u) − 1)). (5757)
We consider when Yi is a discrete random variable taking values yi , i ∈ [M ], with probability p(ym ) =
PM
P(Yi = ym ), and m p(ym ) = 1. Then we may write
M
X
ϕY (u) = p(ym ) exp(uym ). (5758)
m=1
We may then express

( M
!)
X
ϕQ(t) (u) = exp λt p(ym ) exp(uym ) − 1 (5759)
m=1
( M
)
X
= exp λt p(ym )(exp(uym ) − 1) (5760)
m=1
= ΠM
i=1 exp {λp(yi )t(exp(uyi ) − 1)} . (5761)
The RHS is product of mgf of M scaled Poisson processes, m-th process having intensity λp(ym ) and
jump size ym .
Exercise 721. Prove that the geometric Poisson process
S(t) = exp {N (t) log(σ + 1) − λσt} = (σ + 1)N (t) exp(−λσt) (5762)
is martingale where N (t) is Poisson process with λ > 0 intensity and S(0) > 0, σ > −1.
773
Proof. For t ≤ u we have (see Equation 5757)

S(u) h i
E |F(t) = E (σ + 1)N (u)−N (t) exp(−λσ(u − t))|F(t) (5763)
S(t)
h i
= exp(−λσ(u − t))E (σ + 1)N (u−t) (5764)
= exp(−λσ(u − t))E [exp {N (u − t) log(σ + 1)}] (5765)
= exp(−λσ(u − t)) exp {λ(u − t) (exp(log(σ + 1)) − 1)} (5766)
= exp {−λσ(u − t) + λσ(u − t)} (5767)
= 1. (5768)
Then S(t) = E[S(u)|F(t)].
Theorem 452. Let yi , i ∈ [M ] be finite set of nonzero numbers and p(yi ) denote their respective proba-
bilities that sum to one. For λ > 0, let N̄m (t) be independent Poisson process with intensity λp(ym ) for
m ∈ [M ]. If we let
M
X
Q̄(t) = ym N̄m (t), t ≥ 0, (5769)
m=1
then Q̄(t) is compound Poisson process (Definition 467). In particular, if we let Ȳi be the size of the i-th
jump in Q̄(t), then we may write
M
X
N̄ (t) = N̄m (t), t ≥ 0, (5770)
m=1
where N̄ (t) is Poisson process with intensity λ. The random variables Ȳi ⊥ Ȳj for i 6= j and P{Ȳi =
ym } = p(ym ) for m ∈ [M ]. The random variables Ȳi ⊥ N̄ (t) for i ∈ [M ] and the relationship
N̄ (t)
X
Q̄(t) = Ȳi , t≥0 (5771)
i=1
holds.
Proof. If we write
ϕym N̄m (t) (u) = exp {λp(ym )t(exp(uym ) − 1)} see Equation 5761, (5772)
we may express
( M
)
X
ϕQ̄(t) (u) = E exp u ym N̄m (t) (5773)
m
= ΠM
i E exp(uyi N̄i (t)) (5774)
= ΠM
i ϕyi N̄i (t) (u) (5775)
= ΠM
i exp {λp(yi )t(exp(uyi ) − 1)} . (5776)
See that the expression in Equation 5761 is consistent with Equation 5776. Since Q, Q̄ share distribution,
then it must also be true of the number of jumps and size of jumps.
Therefore a compound Poisson process with jumps taking discrete set of values from a probability
distribution can be thought of as a single Poisson process with random jumps or as a linear sum of
independent Poisson processes that take jump of size one.
774
Corollary 39. For random variable sequence Yi ’s taking discrete set of values ym , m ∈ [M ] with proba-
bility p(ym ), that is P(Yi = ym ) = p(ym ), let N (t) be Poisson process and define the compound Poisson
process (Definition 2663, Definition 467) by
N (t)
X
Q(t) = Yi , (5777)
i=1
and Nm (t), m ∈ [M ] be the number of jumps of size ym in Q ≤ t. Then

M
X M
X
N (t) = Nm (t), Q(t) = ym Nm (t). (5778)
m=1 m=1
Nm ’s are independent Poisson processes, and each Nm has intensity λp(ym ).
13.9.3 Integrals and Differentials of Jump Processes

The integrals here contain process with jumps in the integrators, which differ from the continuous types
treated in Ito integrals.
Definition 468. Let (Ω, F, P) be probability space, let F(t), t ≥ 0 be filtration defined on the probability
space. A Poisson process N is Poisson process relative to F(t) if N (t) is F(t) measurable for all t, and
for all u > t, N (u) − N (t) ⊥ F(t). Similarly, a compound Poisson process Q is compound Poisson
relative to F(t) if Q(t) is F(t) measurable for all t, and for all u > t, Q(u) − Q(t) ⊥ F(t).
Rt
A stochastic integral where the integrator X has jumps is written 0 Φ(s)dX(s), where X(t) is right-
continuous and of the general form
X(t) = X(0) + I(t) + R(t) + J(t). (5779)

Rt Rt
In particular, X(0) is initial condition, I(t) = 0 Γ(s)dW (s) is Ito integral and R(t) = 0 Θ(s)ds is
Riemann integral. All processes Γ(s), Θ(s) are adapted to F(t), which is defined on the probability space
(Ω, F, P). The continuous part of X(t) is denoted X c (t) and this is
Z t Z t
X c (t) = X(0) + I(t) + R(t) = X(0) + Γ(s)dW (s) + Θ(s)ds (5780)
0 0
Rt
and we know from previous discussion that quadratic variation of X c (t) is written [X c , X c ](t) = 0
Γ2 (s)ds,
a fact we write dX c (t)dX c (t) = Γ2 (t)dt in differential form. The jump is added to X c (t) by term J(t),
which is an adapted, right-continuous pure jump process with initial condition J(0) = 0. Right conti-
nuity implies J(t) = J(t+) = lims→t+ J(s) for t ≥ 0. Therefore if the jump at t of process J is written
∆J(t), the relationship ∆J(t) = J(t) − J(t−) holds. It is assumed there is no jump at time zero, and
that there are finite number of jumps in a finite interval, and that J(t) is constant between jumps. See
that a Poisson and compound Poisson process would be of the pure jump type, the compensated Poisson
process decreases between jumps and is not pure.
Definition 469 (Jump Process). A jump process X(t) takes form
X(t) = X(0) + I(t) + R(t) + J(t), (5781)
where I(t), R(t), J(t) are the Ito integral, Riemann integral and pure jump parts respectively. In par-
Rt Rt
ticular, X(0) is initial condition, I(t) = 0 Γ(s)dW (s) is Ito integral and R(t) = 0 Θ(s)ds is Rie-
mann integral. Additionally, X c (t) is said to be continuous part of jump process X(t) when X c (t) =
X(0) + I(t) + R(t). By continuity of the continuous part and right-continuity of J(t), X(t) is right-
continuous. It is also adapted to filtration.
775
We can write X(t−) = X(0) + I(t) + R(t) + J(t−), and ∆X(t) = X(t) − X(t−). If ∆X(t) > 0, it is
due to the jump of size ∆J(t).
Definition 470 (Stochastic Integral of Jump Process). Let X(t) be jump process (Definition 469), and
Φ(s) be adapted process, then
Z t Z t Z t X
Φ(s)dX(s) = Φ(s)Γ(s)dW (s) + Φ(s)Θ(s)ds + Φ(s)∆J(s). (5782)
0 0 0 0<s≤t
The differential form for this is
Φ(t)dX(t) = Φ(t)dI(t) + Φ(t)dR(t) + Φ(t)dJ(t) (5783)

= Φ(t)dX c (t) + Φ(t)dJ(t). (5784)
Here
Φ(t)dI(t) = Φ(t)Γ(t)dW (t), (5785)

Φ(t)dR(t) = Φ(t)Θ(t)dt, (5786)
Φ(t)dX c (t) = Φ(t)Γ(t)dW (t) + Φ(t)Θ(t)dt. (5787)
Exercise 722 (An Example of a Stochastic Integral with Jump Process Integrator, Shreve [19]). Let a
jump process X(t) be the compensated Poisson process X(t) = M (t) = N (t) − λt (it has no Ito integral
component). Let Φ(s) = ∆N (s). By Definition 470, and since Φ(s) is non-zero only at finite points with
Lebesgue measure zero, then
Z t Z t Z t
Φ(s)dX c (s) = Φ(s)dR(s) = −λ ∆N (s)ds = 0. (5788)
0 0 0
Furthermore, we have
Z t X
Φ(s)dN (s) = (∆N (s))2 = N (t). (5789)
0 0<s≤t
We may therefore write

Z t Z t Z t
Θ(s)dM (s) = −λ Φ(s)ds + Φ(s)dN (s) = N (t). (5790)
0 0 0
It turns out that unlike the Ito integral, the stochastic integral w.r.t to a martingale is not necessarily
a martingale (see Example 722, N (t) is strictly nonnegative!). Therefore an agent with investment in the
compensated portfolio process M (t) with position Φ(s) = ∆N (s) is not free of arbitrage by being flat
except at jump times, where he is long. Actually, this cannot be implemented, since it is impossible that
he is only long precisely at the jump time. Φ(s) depends only on the path of the underlying process M
up to and including time s, and not on the future path. It turns out that being adapted is necessary but
not a sufficient condition for portfolio replication - we in fact require that the integrand is predictable. We
can do this by insisting that the integrands (the integrator is still right-continuous) are left-continuous.
Φ(s) = ∆N (s) is not an example of a left-continuous integrand.
Result 47. For jump process X(t) (Definition 469), assumed martingale, and some left-continuous
adapted integrand Φ(s), the stochastic integral
Z t
Φ(s)dX(s) (5791)
0
Rt
is martingale subject to integrability conditions E 0
Γ2 (s)Φ2 (s)ds < ∞ for all t ≥ 0.
776
Rt
As a result of the right-continuity of integrator X(t), the integral 0
Φ(s)dX(s) is right-continuous in
the upper limit of integration t. The integral jumps whenever X jumps (assuming Φ 6= 0). The integral
value at t includes the jump at t, if there is one. We see this martingale property with the following
example.
Exercise 723 (Shreve [19]). Let N (t) be Poisson process (see Equation 2663) with intensity λ. Let
M (t) be compensated Poisson process N (t) − λt (Definition 466), and the integrand be the left-continuous
function Φ(s) = 1{s ∈ [0, S1 ]}, where there is jump at S1 . We may write
(
Z t −λt 0 ≤ t < S1 ,
Φ(s)dM (s) = (5792)
0 1 − λS1 , t ≥ S1 , because this includes the jump increment 1
= 1{t ∈ [S1 , ∞)} − λ(t ∧ S1 ). For case 0 ≤ s < t, write
E [1{t ∈ [S1 , ∞)} − λ(t ∧ S1 )|F(s)] = P{S1 ≤ t|F(s)} − λE[t ∧ S1 |F(s)]. (5793)
We consider two subcases, when S1 ≤ s and when S1 > s. In particular, if S1 ≤ s, since S1 ≤ s < t, the
RHS evaluates to 1 − λS1 = 1{s ∈ [S1 , ∞)} − λ(s ∧ S1 ) and we are done. Otherwise by memorylessness
of S1 = τ1 we may write P{S1 ≤ t|F(s)} = 1 − exp(−λ(t − s)). Compute
δ δ
− P{S1 > u|S1 > s} = − exp(−λ(u − s)) = λ exp(−λ(u − s)), u > s. (5794)
δu δu
Obtain with by part integration and using Equation 5794 the result
λE[t ∧ S1 |F(s)] = λE[t ∧ S1 |S1 > s] (5795)

Z ∞
= λ2 (t ∧ u) exp(−λ(u − s))du (5796)
s
Z t Z ∞
= λ2 u exp(−λ(u − s))du + λ2 t exp(−λ(u − s))du (5797)
s t
Z t
= −λ[u exp(−λ(u − s))]ts + λ exp(−λ(u − s))du − λ[t exp(−λ(u − s))]∞
t(5798)
s
= λs − λt exp(−λ(t − s)) − [exp(−λ(u − s))]ts + λt exp(−λ(t − s)) (5799)
= λs − exp(−λ(t − s)) + 1. (5800)
Then Equation 5793 is [1 − exp(−λ(t − s))] − λs + exp(−λ(t − s)) − 1 = −λs = 1{s ∈ [S1 , ∞)} − λ(s ∧ S1 )
and we are done since this is the case for S1 > s.
To get the differential and apply Ito Doeblin formula to jump processes, we must first find its quadratic
and cross variations.
Definition 471 (Quadratic Variation of Jump Processes). Let X(t) be jump process, and choose partition
0 = t0 < t1 < · · · < tn = T . Denote the longest subinterval by kΠk = maxj (tj+1 − tj ) and write
n−1
X
QΠ (X) = (X(tj+1 ) − X(tj ))2 . (5801)
j=0
Then the quadratic variation X on [0, T ] is written
[X, X](T ) = lim QΠ (X). (5802)

kΠk→0
777
RT
In general this may be random (such as a path dependent one like [I, I](T ) = 0
Γ2 (s)ds). The cross
variation is defined for two jump processes X1 (t), X2 (t) and defining
n−1
X
CΠ (X1 , X2 ) = [(X1 (tj+1 ) − X1 (tj ))(X2 (tj+1 ) − X2 (tj ))] , (5803)
j=0
[X1 , X2 ](T ) = lim CΠ (X1 , X2 ) is cross variation on [0, T ].

kΠk→0
Theorem 453. For i = 1, 2, define the jump processes (Definition 469)
Xi (t) = Xi (0) + Ii (t) + Ri (t) + Ji (t), (5804)

Rt Rt
where Ii (t) = 0
Γi (s)dW (s), Ri (t) = 0
Θi (s)ds and Ji (t) is right-continuous pure jump process. Then
quadratic variation and cross variations (see Definition 471) are given
Z T X
[X1 , X1 ] (T ) = [X1c , X1c ](T ) + [J1 , J1 ](T ) = Γ21 (s)ds + (∆J1 (s))2 , (5805)
0 0<s≤T
Z T X
[X1 , X2 ] (T ) = [X1c , X2c ](T ) + [J1 , J2 ](T ) = Γ1 Γ2 (s)ds + ∆J1 (s)∆J2 (s). (5806)
0 0<s≤T
Proof. See that [X2 , X2 ](T ) is similar to [X1 , X1 ](T ), except we make X1 → X2 and invoke symmetry.
Also, [X1 , X2 ](T ) is general case of [X1 , X1 ](T ) where X1 (T ) = X2 (T ), so we need only provide reasoning
for the cross variation arguments. In particular, by Definition 471, write
n−1
X
CΠ (X1 , X2 ) = [(X1 (tj+1 ) − X1 (tj ))(X2 (tj+1 ) − X2 (tj ))] (5807)
j=0
n−1
X
= (X1c (tj+1 ) − X1c (tj ) + J1 (tj+1 ) − J1 (tj ))(X2c (tj+1 ) − X2c (tj ) + J2 (tj+1 ) − J2 (tj ))
j=0
n−1
X
= (X1c (tj+1 ) − X1c (tj ))(X2c (tj+1 ) − X2c (tj )) (5808)
j=0
n−1
X
+ (X1c (tj+1 ) − X1c (tj ))(J2c (tj+1 ) − J2c (tj )) (5809)
j=0
n−1
X
+ (X2c (tj+1 ) − X2c (tj ))(J1c (tj+1 ) − J1c (tj )) (5810)
j=0
n−1
X
+ (J1 (tj+1 ) − J1 (tj ))(J2 (tj+1 ) − J2 (tj )). (5811)
j=0
The cross variation is this equation in its limit as kΠk → 0, and we know from continuous theory that
n−1
X Z T
lim [(X1c (tj+1 ) − X1c (tj ))(X2c (tj+1 ) − X2c (tj ))] = [X1c , X2c ](T ) = Γ1 (s)Γ2 (s)ds. (5812)
kΠk→0 0
j=0
Next consider
n−1
X
(X1c (tj+1 ) − X1c (tj ))(J2c (tj+1 ) − J2c (tj )) (5813)
j=0
n−1
X
≤ max |X1c (tj+1 ) − X1c (tj )| · |J2c (tj+1 ) − J2c (tj )|. (5814)
0≤j≤n−1
j=0
778
The rightmost term is finite, since there are finite number of jumps with finite size. But the left term
max0≤j≤n−1 |X1c (tj+1 ) − X1c (tj )| → 0 as kΠk → 0 so this does not contribute to the cross variation.
Pn−1
Symmetrical arguments assert j=0 (X2c (tj+1 ) − X2c (tj ))(J1c (tj+1 ) − J1c (tj )) does not contribute to cross
variation in its limit. Only the cross variation of the pure jump process remains. For fixed ω ∈ Ω,
the path is fixed. Choose partition set s.t. there is at most one jump belonging to J1 in each interval
(tj , tj+1 ], and at most one jump belonging to J2 in each interval (tj , tj+1 ]. If they happen to have jumps
in the same interval, then they are treated as a simultaneous jump. Define the set A1 , A2 of indices of
j for which the corresponding subinterval (tj , tj+1 ] contained a jump from J1 and J2 respectively, then
see that
n−1
X
(J1 (tj+1 ) − J1 (tj ))(J2 (tj+1 ) − J2 (tj )) (5815)
j=0
X
= (J1 (tj+1 ) − J1 (tj ))(J2 (tj+1 ) − J2 (tj )) (5816)
j∈A1 ∩A2
X
= ∆J1 (s)∆J2 (s). (5817)
0<s≤t
For convenience the results in Theorem 453 may be stated as follows, for jump processes i = 1, 2
where Xi (t) = Xi (0) + Xic (t) + Ji (t), we have
dX1 (t)dX2 (t) = dX1c (t)dX2c (t) + dJ1 (t)dJ2 (t) (5818)
and dX1c (t)dJ2 (t) = dX2c (t)dJ1 (t) = 0. If we expand the continuous parts dXic (t) it is also trivial to see
that the cross variation between any continuous process and pure jump process is zero.
Corollary 40. Let W (t), M (t) = N (t) − λt be Brownian motion and compensated Poisson process
relative to filtration F(t) respectively, then the cross variation [W, M ](t) = 0 for all t ≥ 0.
Proof. This follows directly from Theorem 453.
Corollary 41. For i = 1, 2, define the jump processes (Definition 469)
Xi (t) = Xi (0) + Ii (t) + Ri (t) + Ji (t), (5819)

Rt Rt
where Ii (t) = 0 Γi (s)dW (s), Ri (t) = 0 Θi (s)ds and Ji (t) is right-continuous pure jump process. Let
X̃i (0) be constant, and Θi (s) be adapted process. If we let
Z t
X̃i (t) = X̃i (0) + Φi (s)dXi (s), (5820)
0
then
X̃i (t) = X̃i (0) + I˜i (t) + R̃i (t) + J˜i (t) (5821)
Rt Rt
where I˜i (t) = 0 Φi (s)Γi (s)dW (s), R̃i (t) = 0 Φi (s)Θi (s)ds, J˜i (t) = 0<s≤t Φi (s)∆Ji (s). Then the cross
P
variations are given by
[X̃1 , X̃2 ](t) = [X̃1c , X̃2c ](t) + [J˜1 , J˜2 ](t) (5822)
Z t X
= Φ1 (s)Φ2 (s)Γ1 (s)Γ2 (s)ds + Φ1 (s)Φ2 (s)∆J1 (s)∆J2 (s) (5823)
0 0<s≤t
Z t
= Φ1 (s)Φ2 (s)d[X1 , X2 ](s). (5824)
0
779
In differential form this is written for
dX̃i (t) = Φi (t)dXi (t), i = 1, 2, (5825)
we have
dX̃1 (t)dX̃2 (t) = Φ1 (t)Φ2 (t)dX1 (t)dX2 (t) (5826)

Rt Rt
Recall the Ito Doeblin formula for continuous process X c (t) = X c (0) + 0
Γ(s)dW (s) + 0
Θ(s)ds,
adapted Γ(s), Θ(s) with differential form
dX c (s) = Γ(s)dW (s) + Θ(s)ds, dX c (s)dX c (s) = Γ2 (s)ds, (5827)
have result
1
df (X c (s)) = f 0 (X c (s))dX c (s) + f 00 (X c (s))dX c (s)dX c (s) (5828)
2
1
= f 0 (X c (s))Γ(s)dW (s) + f 0 (X c (s))Θ(s)ds + f 00 (X c (s))Γ2 (s)ds. (5829)
2
In integral form this is
Z t Z t Z t
0 0 1
c c
f (X (t)) − f (X (0)) = c
f (X (s))Γ(s)dW (s) + c
f (X (s))Θ(s)ds + f 00 (X c (s))Γ2 (s)ds.(5830)
0 0 2 0
To get the results for jump processes, we add the pure jump process J(t) to get X(t) = X(0) + I(t) +
R(t) + J(t) = X c (t) + J(t). Between jumps of J, the same differential as in Equation 5829 holds for
df (X(s)) by substituting X c → X. However, when there is jump, we must account for the jump in f (X)
from f (X(s−)) → f (X(s)).
Theorem 454 (Ito Doeblin Formula, One Jump Process). Let X(t) be jump process, f (x) have contin-
uous derivatives f 0 (x), f 00 (x). Then
Z t Z t
1 X
f (X(t)) − f (X(0)) = f 0 (X(s))dX c (s) + f 00 (X(s))dX c (s)dX c (s) + [f (X(s)) − f (X(s−))].
(5831)
0 2 0 0<s≤t
Proof. For fixed ω ∈ Ω, define 0 < τ1 < τ2 < · · · < τn−1 < τn = t to be partition intervals, where
τ1 , · · · τn−1 are interval times. τ0 = 0 is assumed not to be a jump time, while τn is possibly one. For
u, v in same interval (τj , τj+1 ) and u < v, there is no jump, so by continuity we may write
Z v
1 v 00
Z
0 c
f (X(v)) − f (X(u)) = f (X(s))dX (s) + f (X(s))dX c (s)dX c (s), (5832)
u 2 u
and by the right-continuity, in fact we have

Z τj+1
1 τj+1 00
Z
f (X(τj+1 −)) − f (X(τj )) = f 0 (X(s))dX c (s) + f (X(s))dX c (s)dX c (s) = κ− (5833)
τj 2 τj
Rv R τj+1
Right continuity implies that lim− u
f 0 (X(s))dX c (s) = u
f 0 (X(s))dX c (s) but not in the case when
v→τj+1
dX c → dX, since the RHS would include the jump. If we were to include the jump point, then we must
add the jump size f (X(τj+1 )) − f (X(τj+1 −)), s.t.
f (X(τj+1 )) − f (X(τj )) = κ− + f (X(τj+1 )) − f (X(τj+1 −)). (5834)
780
Sum over all intervals to obtain
n−1
X
f (X(t)) − f (X(0)) = [f (X(τj+1 )) − f (X(τj ))] (5835)
j=0
Z t Z t n−1
1 X
= f 0 (X(s))dX c (s) + f 00 (X(s))dX c (s)dX c (s) + [f (X(τj+1 ) − f (X(τj+1 −))].
0 2 0 j=0
Pn−1 P
The last term j=0 [f (X(τj+1 ) − f (X(τj+1 −))] = 0<s≤t [f (X(s)) − f (X(s−))] and the result follows.
There is no general differential form for Equation 5831; we are not guaranteed to have a differential form
for the summation of jumps.
We see one where it is possible to obtain the differential form.
Exercise 724. Consider a geometric Poisson process given by
S(t) = S(0) exp {−λσt + N (t) log(σ + 1)} = S(0) exp {−λσt} (σ + 1)N (t) . (5836)
Here σ > −1.
Proof. Writing S(t) = S(0)f (X(t)) = S(0) exp(X(t)) where X(t) = −λσt + N (t) log(σ + 1), by jump
process Ito Doeblin (Theorem 454) we may write
Z t X
f (X(t)) − f (X(0)) = −λσ f 0 (X(u))du + [f (X(u)) − f (X(u−))] (5837)
0 0<u≤t
Z t
= −λσ S(u)du + Σ0<u≤t [S(u) − S(u−)]. (5838)
0
See that if there is jump at u, then S(u) = (σ +1)S(u−) (see Equation 5836), so S(u)−S(u−) = σS(u−).
Otherwise by continuity S(u) − S(u−) = 0 if u is not jump time. If ∆N (u) = 1 for jump at u and 0
for no-jump at u, then S(u) − S(u−) = σS(u−)∆N (u) and the summation of the jump terms can be
written as integral. Therefore Equation 5838 can expressed
Z t Z t Z t
S(t) = S(0) − λσ S(u−)du + σ S(u−)dN (u) = S(0) + σ S(u−)dM (u), (5839)
0 0 0
Rt Rt
where we have used 0
S(u)du = 0
S(u−)du (since there are only finite jumps at points with Lebesgue
measure zero), and the definition of compensated Poisson process M (t) = N (t)−λt. Additionally, S(u−)
is left continuous, so Result 47 asserts S(t) is martingale. The differential form is
dS(t) = σS(t−)dM (t) = −λσS(t)dt + σS(t−)dN (t). (5840)
This differential form exists because we were able to write the jump in f (X) at jump time u in terms of
the jump in f (X(u−)).
Corollary 42. For W (t), N (t) Brownian motion and Poisson process with intensity λ > 0 respectively
8
defined on probability space (Ω, F, P) relative to filtration F(t), we have W (t) ⊥ N (t).
8A stronger form of this statement holds, that is the processes N, W are independent. Any stochastic process driven
solely by W , and any stochastic process driven solely by N are independent. It can be shown that the vector of random
variables (W (ti )i∈[n] ) ⊥ (N (ti )i∈[n] ),
781
Proof. We show this using moment generating functions. Define

1 2
Y (t) = exp u1 W (t) + u2 N (t) − u1 t − λ(exp(u2 ) − 1)t . (5841)
2
Then writing X(t) = u1 W (t) + u2 N (t) − 12 u21 (t) − λ(exp(u2 ) − 1)t, we can use Ito Doeblin formula for
jump processes (Theorem 454) for Y (s) = f (X(s)). See that
1
dX c (s) = u1 dW (s) − u21 ds − λ(exp(u2 ) − 1)ds, dX c (s)dX c (s) = u21 ds. (5842)
2
Also by definition of Y (s), if s is jump time then Y (s) = Y (s−) exp(u2 ), so
Y (s) − Y (s−) = (exp(u2 ) − 1)Y (s−)∆N (s). (5843)
Then apply Ito Doeblin formula, Y (t) = f (X(t)) =

Z t
1 t 00
Z X
f (X(0)) + f 0 (X(s))dX c (s) + f (X(s))dX c (s)dX c (s) + [f (X(s)) − f (X(s−))] (5844)
0 2 0
0<s≤t
Z t Z t Z t Z t
1 1 X
= 1 + u1 Y (s)dW (s) − u21 Y (s)ds − λ(exp(u2 ) − 1) Y (s)ds + u21 Y (s)ds + [Y (s) − Y (s−)]
0 2 0 0 2 0 0<s≤t
Z t
1 2 t
Z t
1 2 t
Z Z
= 1 + u1 Y (s)dW (s) − u1 Y (s)ds − λ(exp(u2 ) − 1) Y (s)ds + u1 Y (s)ds (5845)
0 2 0 0 2 0
Z t
+(exp(u2 ) − 1) Y (s−)dN (s) see Equation 5843 (5846)
0
Z t Z t
= 1 + u1 Y (s)dW (s) + (exp(u2 ) − 1) Y (s−)dM (s), (5847)
0 0
Rt Rt
where we used M (s) = N (s)−λs compensated Poisson process and 0
Y (s−)ds = 0
Y (s)ds by continuity
and finiteness of jump points. One is Ito integral with Brownian integrator, and another is martingale
integral with left-continuous integrand (see Theorem 47), and the rest are constants. So clearly Y (t) is
martingale, and EY (t) = Y (0) = 1 and

1 2
E exp u1 W (t) + u2 N (t) − u1 t − λ(exp(u2 ) − 1)t = 1 ∀t ≥ 0. (5848)
2
It follows that

1 2
E exp {u1 W (t) + u2 N (t)} = exp u1 t + λ(exp(u2 ) − 1)t , (5849)
2
the product of the moment generating functions for W (t), N (t) respectively. By equivalence of indepen-
dence conditions (Theorem 353), we are done.
Exercise 725. Prove for W (t), Q(t) Brownian motion and compound Poisson process respectively defined
9
on probability space (Ω, F, P) relative to filtration F(t), we have W (t) ⊥ Q(t).
Proof. Let X(t) = u1 W (t) − 21 u21 t + u2 Q(t) − λt(ϕ(u2 ) − 1) (see Equation 5747). Then by Ito Doeblin
Theorem 454, we can write
Z t
1
exp(X(t)) − exp(X(0)) = exp(X(s)) u1 dW (s) − u21 ds − λ(ϕ(u2 ) − 1)ds (5850)
0 2
1 t
Z X
+ exp(X(s))u21 ds + [exp(X(s)) − exp(X(s−))]. (5851)
2 0
0<s≤t
9A stronger form of this statement holds, that is the processes Q, W are independent. Any stochastic process driven
solely by W , and any stochastic process driven solely by Q are independent. It can be shown that the vector of random
variables (W (ti )i∈[n] ) ⊥ (Q(ti )i∈[n] ),
782
Since ∆X(t) = u2 ∆Q(t) = u2 YN (T ) ∆N (t), then exp(X(t))−exp(X(t−)) = exp(X(t−)) [exp(∆X(t)) − 1] =

exp(X(t−)) exp(u2 YN (t) ) − 1 ∆N (t). Denote the compound Poisson process
PN (t)
H(t) = i=1 exp(u2 YN (i) − 1 . We have the relation
H(t) − λE [exp(u2 Y1 ) − 1] t = H(t) − λ(ϕ(u2 ) − 1)t. (5852)
This is martingale process. Next, since exp(X(t)) − exp(X(t−)) = exp(X(t−))∆H(t), and

Z t
1 2
exp(X(t)) − 1 = exp(X(s)) u1 dW (s) − u1 ds − λ(ϕ(u2 ) − 1)ds (5853)
0 2
Z t Z t
1
+ exp(X(s))u21 ds + exp(X(s−))dH(s) (5854)
2 0 0
Z t Z t
= exp(X(s))u1 dW (s) + exp(X(s−))d(H(s) − λ(ϕ(u2 ) − 1)s). (5855)
0 0
Therefore, exp(X(t)) is martingale and E exp(X(t)) = 1 by initial condition X(0) = 0. It follows that
1
E [exp(u1 W (t) + u2 Q(t))] = exp( u1 t) exp(λt(ϕ(u2 ) − 1)). (5856)
2
The RHS is the product of their respective mgfs, so we are done.
Exercise 726. Zeng [20]. Prove for processes N1 (t), N2 (t) with no simultaneous jumps almost surely
that the paths for N1 (t) ⊥ N2 (t) for all t > 0.
Proof. For fixed 0 ≤ s < t, let X(t) = u1 (N1 (t) − N1 (s)) + u2 (N2 (t) − N2 (s)) − λ1 (exp(u1 ) − 1)(t − s) −
λ2 (exp(u2 ) − 1)(t − s). The Ito Doeblin formula (Theorem 454) asserts that exp(X(t)) − exp(X(s))
Z t
1 t
Z X
= exp(X(u))dX c (u) + exp(X(u))dX c (u)dX c (u) + [exp(X(u)) − exp(X(u−))]
s 2 s
s<u≤t
Z t X
= exp(X(u))[−λ1 (exp(u1 ) − 1) − λ2 (exp(u2 ) − 1)]du + [exp(X(u)) − exp(X(u−))].
(5857)
s s<u≤t
Since ∆X(t) = u1 ∆N1 (t) + u2 ∆N2 (t) and ∆N1 (t)∆N2 (t) = 0 by no simultaneous jump assumption, we
may write
exp(X(u)) − exp(X(u−)) = exp(X(u−))(exp(∆X(u)) − 1) (5858)

exp(X(u−))[(exp(u1 ) − 1)∆N1 (u) + (exp(u2 ) − 1)∆N2 (u)].(5859)
=
Rt Rt
As before, the Riemann integrals agree for equality s exp(X(u))du = s exp(X(u−))du. Then continue
Equation 5857
exp(X(t)) − exp(X(s)) (5860)

Z t
= exp(X(u−)) [−λ1 (exp(u1 ) − 1) − λ2 (exp(u2 ) − 1)] du (5861)
s
X
+ exp(X(u−)) [(exp(u1 ) − 1)∆N1 (u) + (exp(u2 ) − 1)∆N2 (u)] (5862)
s<u≤t
Z t
= exp(X(u−)) [(exp(u1 ) − 1)d(N1 (u) − λ1 u) − (exp(u2 ) − 1)d(N2 (u) − λ2 )] . (5863)
s
Then E exp(X(t)) = E exp(X(s)) = exp(X(0)) = 1 and so
E exp {u1 (N1 (t) − N1 (s)) + u2 (N2 (t) − N2 (s))} = exp {λ1 (exp(u1 ) − 1)(t − s)} exp {λ2 (exp(u2 ) − 1)(t − s)}
= E exp {u1 (N1 (t) − N1 (s))} E exp {u2 (N2 (t) − N2 (s))} .
783
It follows that the Poisson increments (N1 (t) − N1 (s)) ⊥ (N2 (t) − N2 (s)). If we use the initial condition
N1 (0) = N2 (0) = 0, we have shown N1 (t) ⊥ N2 (t). To show the stronger result for the whole path,
we employ mathematical induction. In particular, the base case is done, and since ((N1 (ti ))i∈[n] ) ⊥
((N2 (ti ))i∈[n] ) iff ((N1 (ti ) − N1 (ti−1 )i∈[n] ) ⊥ ((N2 (ti ) − N2 (ti−1 )i∈[n] ). Let t0 = 0, and see that
  
X n X n 
E exp ui (N1 (ti ) − N1 (ti−1 )) + vj (N2 (tj ) − N2 (tj−1 ))  (5864)
 
i j
  
n−1
X n−1
X 
= E exp ui (N1 (ti ) − N1 (ti−1 )) + vj (N2 (tj ) − N2 (tj−1 ))  · (5865)
 
i j
E [exp {un (N1 (tn ) − N1 (tn−1 )) + vn (N2 (tn ) − N2 (tn−1 ))}] (5866)
The first expectation is the inductive case and we assume inductively that their expectation factors. The
second expectation is a specific case of Poisson increments, which we showed earlier take independence
result (N1 (t) − N1 (s)) ⊥ (N2 (t) − N2 (s)). Their expectation also factors. Then the overall expectation
factors for the entirety of the path.
Theorem 455 (Ito Doeblin Formula, Multiple Jump Processes). The generalized form of the Ito Doeblin
formula on one-jump process (Theorem 454) is available in the multi-dimensional case. We look at the
two-dimensional form as special case, but the same theory applies in higher dimensions. For two jump
processes X1 (t), X2 (t), let f (t, x1 , x2 ) be function with continuous first and second partial derivatives,
then we have
f (t, X1 (t), X2 (t)) − f (0, X1 (0), X2 (0)) (5867)

Z t Z t Z t
c
= ft ds + fx1 dX1 + fx2 dX2c (5868)
0 0 0
1 t 1 t
Z Z Z t
+ fx1 x1 dX1c dX1c + fx2 x2 dX2c dX2c + fx1 x2 dX1c dX2c (5869)
2 0 2 0 0
X
+ [f (s, X1 (s), X2 (s)) − f (s, X1 (s−), X2 (s−))]. (5870)
0<s≤t
Corollary 43 (Ito Product Rule, Jump Processes). Let X1 (t), X2 (t) be jump processes, then
Z t Z t
X1 (t)X2 (t) − X1 (0)X2 (0) = X2 (s)dX1c (s) + X1 (s)dX2c (s)
0 0
X
+[X1c , X2c ](t) + [X1 (s)X2 (s) − X1 (s−)X2 (s−)] (5871)
0<s≤t
Z t Z t
= X2 (s)dX1c (s) + X1 (s)dX2c (s) + [X1 , X2 ](t). (5872)
0 0
Proof. First equality follows directly from Ito Doeblin formula by applying Theorem 455 with f (x1 , x2 ) =
x1 x2 . Note that by definition
Z t
dX1c (s)dX2c (s) = [X1c , X2c ](t), (5873)
0
and so dX1c (t)dX2c (t) = d[X1c , X2c ](t). Applying the two-dimensional Ito Doeblin formula gives us
Z t Z t Z t
X1 (t)X2 (t) = X1 (0)X2 (0) + X2 (t)dX1c (s) + X1 (s)dX2c (s) + dX1c (s)dX2c (s) (5874)
0 0 0
X
+ [X1 (s)X2 (s) − X1 (s−)X2 (s−)] (5875)
0<s≤t
784
Since we may write Ji (t) = Xi (t) − Xic (t), and using Equation 5806, we have
Z t Z t
X2 (s−)dX1 (s) +
X1 (s−)dX2 (s) + [X1 , X2 ](t) (5876)
0 0
Z t Z t Z t Z t
= X2 (s−)dX1c (s) + X2 (s−)dJ1 (s) + X1 (s−)dX2c (s) + X1 (s−)dJ2 (s) (5877)
0 0 0 0
X
+[X1c , X2c ](t) + ∆J1 (s)∆J2 (s) (5878)
0<s≤t
Z t Z t
= X2 (s)dX1c (s) + X1 (s)dX2c (s) + [X1c , X2c ](t) (5879)
0 0
X
+ [X2 (s−)∆X1 (s) + X1 (s−)∆X2 (s) + ∆X1 (s)∆X2 (s)]. (5880)
0<s≤t
We are trying to show that Equation 5872 is equivalent to Equation 5871. It hence suffices if we show
that the summand in Equation 5880 is the same as the summand in Equation 5871. The equations
follow by seeing that the jump at points in Xi is same as in Ji , which is to say ∆Xi (t) = ∆Ji (t). Finally,
expand the summand, writing
X1 (s)X2 (s) − X1 (s−)X2 (s−) (5881)

= (X1 (s−) + ∆X1 (s))(X2 (s−) + ∆X2 (s)) − X1 (s−)X2 (s−) (5882)
= X1 (s−)X2 (s−) + X1 (s−)∆X2 (s) + ∆X1 (s)X2 (s−) + ∆X1 (s)∆X2 (s) − X1 (s−)X2 (s−)
= X1 (s−)∆X2 (s) + ∆X1 (s)X2 (s−) + ∆X1 (s)∆X2 (s). (5883)
The result follows.
Exercise 727. By the Ito Product rule Corollary 43, show that two Poisson processes (see Section
6.17.5) N1 (t) ⊥ N2 (t) with intensities λ1 , λ2 respectively have no simultaneous jump almost surely.
Proof. Define the two compensated Poisson processes Mi (t) = Ni (t) − λi t for u = 1, 2, and by indepen-
dence, their expectation factors s.t. EM1 M2 (t) = EM1 (t)EM2 (t) = 0. The Ito product rule suggests
states that
Z t Z t
M1 (t)M2 (t) = M1 (s−)dM2 (s) + M2 (s−)dM1 (s) + [M1 , M2 ](t). (5884)
0 0
By left continuity of integrands and martingale property of Mi (t), Theorem 47 asserts that
 
X
0 = E[M1 , M2 ](t) = E  ∆N1 (s)∆N2 (s) . (5885)
0<s≤t
Since the summands ∆N1 (s)∆N2 (s) are non-negative for all s ∈ (0, t], and this holds at all t, the
whole summation term equals zero and therefore each summand is zero. N1 , N2 have no simultaneous
jumps.
13.9.4 Change of Measure for Jump Processes

Recall that the Radon-Nikodym derivative process Z(t) may be written
Z t
1 t 2
Z
Z(t) = exp − Γ(s)dW (s) − Γ (s)ds , (5886)
0 2 0
785
which satisfies the SDE
dZ(t) = −Γ(t)Z(t)dW (t) (5887)

= Z(t)dX c (t) (5888)
Rt
when X c (t) = − 0
Γ(s)dW (s). We may write

1
Z(t) = exp X c (t) − [X c , X c ](t) . (5889)
2
When working with jumps, Equation 5888 is replaced with
dZ X (t) = Z X (t−)dX(t), (5890)
where X is jump process. When X jumps in size ∆X(s), correspondingly Z X takes jump size ∆Z X (s) =
Z X (s−)∆X(s). Then
Z X (s) = Z X (s−) + ∆Z X (s) = Z X (s−)(1 + ∆X(s)). (5891)
Corollary 44 (Doleans-Dade exponential, Shreve [19]). Let X(t) be jump process, then the Doleans-Dade
exponential of X is the process

1
Z X (t) = exp X c (t) − [X c , X c ](t) Π0<s≤t (1 + ∆X(s)). (5892)
2
This is the solution to the SDE
dZ X (t) = Z X (t−)dX(t), (5893)
and with initial condition Z X (0) = 1 the integral of the differential equation may be written
Z t
Z X (t) = 1 + Z X (s−)dX(s). (5894)
0
Rt Rt
Proof. Since X(t) = X c (t) + J(t), where X c (t) = 0 Γ(s)dW (s) + 0 Θ(s)ds, write
Z t Z t
1
Y (t) = exp Γ(s)dW (s) + Θ(s)ds − Γ2 (s)ds (5895)
2
0 0

1
= exp X c (t) − [X c , X c ](t) . (5896)
2
The usual Ito calculus gives us dY (t) = Y (t)dX c (t) = Y (t−)dX c (t). Let τ1 be the first jump time,
and define K(t) = 1 in (0, τ1 ), and define K(t) = Π0<s≤t (1 + ∆X(s)) for any time t ≥ τ1 . Then
K(t) is a pure jump process and Z X (t) = Y (t)K(t) (see Equation 5892). If X has jump at t, then
K(t) = K(t−)(1 + ∆X(t)), so
∆K(t) = K(t) − K(t−) = K(t−)∆X(t). (5897)
Since Y (t) is continuous and K(t) is pure jump, Theorem 453 asserts that [Y, K](t) = 0, and by Ito
Product rule (Theorem 43), obtain
Z t Z t
X
Z (t) = Y (t)K(t) = Y (0) + K(s−)dY (s) + Y (s−)dK(s) (5898)
0 0
Z t X
= 1+ Y (s−)K(s−)dX c (s) + Y (s−)K(s−)∆X(s) (5899)
0 0<s≤t
Z t
= 1+ Y (s−)K(s−)dX(s) (5900)
0
Z t
= 1+ Z X (s−)dX(s). (5901)
0
786
In Poisson processes, the change of measure can change the intensity. In compound Poisson processes,
the intensity and the distribution of jump sizes may be affected by change of measure.
Let N (t) be Poisson process on (Ω, F, P) relative to filtration F(t), t ≥ 0. Let N (t) have intensity
λ, s.t. EN (t) = λt. Denote M (t) = −λt + N (t), the compensated Poisson process. Recall this is P
martingale (see Theorem 449). Furthermore, let λ̃ be positive number and define
n o λ̃ N (t)
Z(t) = exp (λ − λ̃)t . (5902)
λ
It turns out that for fixed positive T , Z(T ) is a valid Radon-Nikodym derivative process that causes
N (t) to have intensity λ̃ under P̃. By the exponential term and positive values for λ, λ̃, it is trivial that
Z(T ) is almost surely positive. We verify that EZ(T ) = 1.
Lemma 30. For

n o λ̃ N (t)
Z(t) = exp (λ − λ̃)t , (5903)
λ
we have
λ̃ − λ
dZ(t) = Z(t−)dM (t). (5904)
λ
In particular, Z(t) is P martingale and EZ(t) = 1 for all t ≥ 0.
λ̃−λ λ̃−λ λ̃−λ
Proof. Define X(t) = λ M (t) = λ (−λt + N (t)) = (λ − λ̃)t + λ N (t) = X c (t) + J(t) where
c λ̃−λ λ̃−λ
X (t) = (λ − λ̃)t and J(t) = λ N (t). See that [X c , X c ](t) = 0, and in case of jump at t, ∆X(t) = λ .
We may then write
λ̃
1 + ∆X(t) = , (5905)
λ
and we may express Equation 5903 Z(t) as the Doleans-Dade exponential Z X (t) discussed Corollary 44,
that is

c 1 c c
Z(t) = exp X (t) − [X , X ](t) Π0<s≤t (1 + ∆X(s)). (5906)
2
Rt
In particular, we have (see Equation 5894) Z(t) = 1 + 0
Z(s−)dX(s), and we have martingale
property by left-continuity of integrand (Result 47). Since initial condition Z(0) = 1, EZ(t) = 1 by
martingale property.
R
Then P̃(A) = A Z(T )dP, for all A ∈ F gives us the new measure through Z(T ). We see that this
affects the Poisson intensity.
R
Theorem 456. The probability measure P̃(A) = A
Z(T )dP with Z described as in Lemma 30 is a
probability measure causing the process N (t) to be Poisson process with intensity λ̃.
787
Proof.
Ẽ exp {uN (t)} = E exp {uN (t)} Z(t) (5907)

 !N (t) 
n o λ̃
= exp (λ − λ̃)t E exp(uN (t))  (5908)
λ
" ( ! )#
n o λ̃
= exp (λ − λ̃)t E exp u + log N (t) (5909)
λ
( " ! #)
n o λ̃
= exp (λ − λ̃)t exp λt exp u + log( ) − 1 (5910)
λ
n o
= exp λ̃t(exp(u) − 1) . (5911)
Compare this with Equation 5757 to see that this is valid change of intensity.
Exercise 728. Consider the geometric Poisson process
S(t) = S(0) exp {αt + N (t) log(σ + 1) − λσt} = S(0) exp {(α − λσ)t} (σ + 1)N (t) . (5912)
Here σ > −1 and N (t) is Poisson process intensity λ under P. Reason from Exercise 724 that exp(−αt)S(t)
is martingale, so we have differential
dS(t) = αS(t)dt + σS(t−)dM (t), (5913)
where M (t) = −λt + N (t). We want to obtain a risk-neutral measure P̃ s.t. the differential
dS(t) = rS(t)dt + σS(t−)dM̃ (t) (5914)
holds, and where M̃ (t) = N (t) − λ̃t is compensated Poisson process under P̃ for N (t) Poisson process
with intensity λ̃. Since the dt drift term in Equation 5913 (due to dM (t)) is (α − λσ)S(t)dt, and
α−r
(r − λ̃σ)S(t)dt in Equation 5914, we solve for λ̃ = λ − σ . We use the result that S(t−)dt, S(t)dt
have the same integrals. Using the random variable Z(t) as in Equation 5903 and change of measure
P̃(A) = A Z(T )dP, we obtain risk-neutral measure P̃. For this to be valid we require λ̃ > 0, so λ > α−r
R
σ .
Otherwise no risk-neutral measure may be obtained, and arbitrage exists. In particular, if σ > 0, and
α−r
λ≤ σ , see that
S(t) ≥ S(0) exp(rt)(σ + 1)N (t) ≥ S(0) exp(rt). (5915)
We may create an arbitrage portfolio by borrowing at r and investing in stock. If −1 < σ < 0, then an
arbitrage portfolio is created by shorting the stock and investing in money market.
Let N (t) be Poisson process with intensity λ, and a sequence of IIDs Yi ’s defined on probability
space (Ω, F, P). Additionally assume Yi ⊥ N (t). Define a compound Poisson process obtained via
PN (t)
Q(t) = i=1 Yi and see that if N jumps at t, then ∆Q(t) = YN (t) . We wish to employ change of
measure s.t. both the distribution of Yi ’s and Poisson intensity is adjusted. Assume Yi takes discrete
values ym∈[M ] with probability P{Yi = ym } = p(ym ). We further assume p(ym ) > 0, and trivially
P
m∈M p(ym ) = 1. Let Nm (t) be number of jumps size ym in Q(t) up to and including t, and recall
(Corollary 39) that
M
X M
X
N (t) = Nm (t), Q(t) = ym Nm (t) (5916)
m m
788
holds. Corollary 39 asserts Nm ’s are independent Poisson processes with intensity λm = λp(ym ). For
λ̃i∈M > 0, set
!Nm (t)
n o λ̃m
Zm (t) = exp (λm − λ̃m )t , Z(t) = ΠM
m Zm (t). (5917)
λm
Lemma 31. Z(t) specified Equation 5917 is martingale, and EZ(t) = 1 for all t ≥ 0.
Proof. Lemma 30 asserts that
λ̃m − λm
dZm (t) = Zm (t−)dMm (t), (5918)
λm
where Mm (t) = −λm t + Nm (t). Zm (t−) is left-continuous, and the martingale property of Zm holds
by Result 47. When m 6= n, then Nm , Nn have no simultaneous jumps, and cross variation is zero. Ito
product rule (Theorem 43) asserts
d(Z1 (t)Z2 (t)) = Z2 (t−)dZ1 (t) + Z1 (t−)dZ2 (t). (5919)
Additionally, since Z1 , Z2 are martingales and Zi (t−), i = 1, 2 are left continuous, it also follows that
Z1 Z2 is martingale process. Furthermore, see that Z1 Z2 itself has no simultaneous jumps with Z3 , then
iterating the previous argument gives
d(Z1 (t)Z2 (t)Z3 (t)) = Z3 (t−)d(Z1 (t)Z2 (t)) + (Z1 (t−)Z2 (t−))dZ3 (t). (5920)
We can repeat the same argument to argue that Z(t) = Πm

i Zi (t) is martingale. It is trivial to see
Z(T ) > 0 almost surely, Z(0) = 1, and by martingale property EZ(T ) = 1 and
Z
P̃(A) = Z(t)dP, ∀A ∈ F. (5921)
A
is valid change of measure.
Theorem 457 (Change of compound Poisson intensity and jump distributions). Assume P̃ specified as
PM
in Equation 5921 and let Q(t) be compound Poisson process with intensity λ̃ = m λ̃m . Furthermore,
λ̃m
the jumps of Q(t) written Y1 , Y2 , · · · are sequence of IIDs with P̃{Yi = ym } = p̃(ym ) = λ̃
.
Proof. Recall Ni , i ∈ [M ] are independent under P. Then we may write for 0 ≤ t ≤ T , that
Ẽ [exp(uQ(t))] = E [exp(uQ(t))Z(t)] (5922)

 ( M ) !N (t) 
X n o λ̃m
= E exp u ym Nm (t) · ΠM
m exp (λ m − λ̃ m )t  (5923)
m
λm
" ( ! )#
n o λ̃m
= ΠM
mexp (λm − λ̃m )t · E exp uym + log Nm (t) (5924)
λm
( ( ) !)
M
n o λ̃m
= Πm exp (λm − λ̃m )t · exp λm t exp uym + log( ) −1 (5925)
λm
n o
= ΠMm exp (λm − λ̃m )t + λ̃m t exp(uym ) − λm t (5926)
n o
= ΠMm exp λ̃m t(exp(uym ) − 1) (5927)
n o
= ΠMm exp λ̃tp̃(ym ) exp(uym ) − λ̃m t and using λ̃m = λp(ym ), (5928)
( M
!)
X
= exp λ̃t p̃(ym ) exp(uym ) − 1 . (5929)
m
789
See that we may also write
(M ) !Nm (t)
X λ̃p̃(ym ) n o
N (t) λ̃p̃(Yi )
Z(t) = exp (λm − λ̃m )t · ΠM
m = exp (λ − λ̃)t Πi=1 . (5930)
m
λp(ym ) λp(Yi )
If Yi ∼ f (y) density function, then we may apply change of measure s.t. Q(t) has λ̃ intensity and
Yi ∼ f˜(y) using the Radon-Nikodym derivative process
n o ˜
N (t) λ̃f (Yi )
Z(t) = exp (λ − λ̃)t Πi=1 . (5931)
λf (Yi )
Zero division error is avoided by assuming f˜(y) = 0 for all sets of values in y where f (y) = 0.
Lemma 32. Process Z(t) (Equation 5930) is martingale and EZ(t) = 1 for all t ≥ 0.
N (t) λ̃f˜(Yi )
Proof. Shreve [19] Consider the pure jump process J(t) = Πi=1 λf (Yi ) . At jump times of Q, reason that
λ̃f˜(YN (t) ) f˜(∆Q(t))

J(t) = J(t−) = J(t−) . (5932)
λf (YN (t) ) λf (∆Q(t))
Follows that at jump

" #
λ̃f˜(∆Q(t))
∆J(t) = J(t) − J(t−) = − 1 J(t−). (5933)
λf (∆Q(t))
Define compound Poisson process

N (t)
X λ̃f˜(Yi )
H(t) = , (5934)
i=1
λf (Yi )
λ̃f˜(∆Q(t))
then ∆H(t) = λf (∆Q(t)) . Furthermore, see that
" #
λ̃f˜(Yi ) λ̃ ∞ f˜(y)
Z
λ̃
E = f (y)dy = . (5935)
λf (Yi ) λ −∞ f (y) λ
Therefore H(t) − λ̃t is martingale, so (using Equation 5933)
∆J(t) = J(t−)∆H(t) − J(t−)∆N (t), (5936)
and since J, H, N are all pure jump processes, the differential equation
dJ(t) = J(t−)dH(t) − J(t−)dN (t) (5937)

n o
holds. Then using the Ito Doeblin formula, we may write Z(t) = exp (λ − λ̃)t J(t)
Z t n o Z t n o
= Z(0) + J(s−)(λ − λ̃) exp (λ − λ̃)s ds + exp (λ − λ̃)s dJ(s) (5938)
0 0
Z t n o Z t n o Z t n o
= 1+ exp (λ − λ̃)s J(s−)(λ − λ̃)ds + exp (λ − λ̃)s J(s−)dH(s) − exp (λ − λ̃)s J(s−)dN (s)
0 0 0
Z t n o Z t n o
= 1+ exp (λ − λ̃)s J(s−)d(H(s) − λ̃s) − exp (λ − λ̃)s J(s−)d(N (s) − λs) (5939)
0 0
Z t Z t
= 1+ Z(s−)d(H(s) − λ̃s) − Z(s−)d(N (s) − λs). (5940)
0 0
790
See the two integrators are both martingales (they are compensated Poisson processes), and by initial
condition Z(0) = 1 we get EZ(t) = 1 for all t ≥ 0. The martingale property holds.
The differential form may be written
dZ(t) = Z(t−)d(H(t) − λ̃t) − Z(t−)d(N (t) − λt). (5941)
The differential form implies
∆Z(t) = Z(t−)∆H(t) − Z(t−)∆N (t). (5942)
For fixed T > 0,

Z
P̃(A) = Z(T )dP, ∀A ∈ F(t) (5943)
A
is valid change of measure.
Theorem 458 (Change of compound Poisson intensity and continuous jump distributions). Under
measure P̃ defined as in Equation 5943, Q(t) is a compound Poisson process with intensity λ̃ with IID
jumps Yi ∼ f˜(y), where f˜(y) is continuous density function.
n o
Proof. Shreve [19]. We show Ẽ exp(uQ(t)) = exp λ̃t(ϕ̃Y (u) − 1) , and
Z ∞
ϕ̃Y (u) = exp(uy)f˜(y)dy. (5944)
−∞
First, define
n o
X(t) = exp uQ(t) − λ̃t(ϕ̃Y (u) − 1) , (5945)
and see that at jump times for Q, X(t) = X(t−) exp(u∆Q(t)), and therefore
∆X(t) = X(t) − X(t−) = X(t−) (exp {u∆Q(t)} − 1) . (5946)
Define
N (t)
X λ̃f˜(Yi )
V (t) = exp {uYi } (5947)
i=1
λf (Yi )
and see that this is compound Poisson process. Furthermore,

" #
λ̃f˜(Yi ) λ̃ ∞ f˜(y)
Z
λ̃
E exp(uYi ) = exp(uy) f (y)dy = ϕ̃Y (u). (5948)
λf (Yi ) λ −∞ f (y) λ
So the compensated Poisson process V (t) − λ̃tϕ̃Y (u) is martingale. At jump times of Q, we obtain
λ̃f˜(∆Q(t))
∆V (t) = exp {u∆Q(t)} = exp(u∆Q(t))∆H(t), (5949)
λf (∆Q(t))
where H(t), ∆H(t) are given by Equation 5934. X(t), Z(t) have no Ito integral components and their
791
cross variation is given by (use Equation 5946, 5942)
X
[X, Z](t) = ∆X(s)∆Z(s) (5950)
0<s≤t
X X
= X(s−)Z(s−)(exp {u∆Q(s)} − 1)∆H(s) − X(s−)Z(s−)(exp {u∆Q(s)} − 1)∆N (s)
0<s≤t 0<s≤t
X
= X(s−)Z(s−)∆V (s) (5951)
0<s≤t
X
− X(s−)Z(s−)∆H(s) (5952)
0<s≤t
X
− X(s−)Z(s−)(exp {u∆Q(s)} − 1). (5953)
0<s≤t
(5954)
The last equality follows from the relation (exp(u∆Q(s))−1)∆N (s) = exp(u∆Q(s))−1 since exp(u∆Q(s))−
1 is non-zero iff ∆N (s) is non-zero, which is when ∆N (s) is just one. Then by Ito Product rule we may
write
Z t Z t
X(t)Z(t) = 1 + X(s−)dZ(s) + Z(s−)dX(s) + [X, Z](t). (5955)
0 0
Rt
1 is constant, 0
X(s−)dZ(s) is martingale by left-continuity and Z being martingale (and invoking Result
47), so considering the remaining terms, (use Equations 5944,5946 in obtaining the second equality)
Z t
Z(s−)dX(s) + [X, Z](t) (5956)
0
Z t X
= Z(s−)dX c (s) + Z(s−)∆X(s) + [X, Z](t) (5957)
0 0<s≤t
Z t X
= −λ̃(ϕ̃Y (u) − 1) X(s−)Z(s−)ds + X(s−)Z(s−)(exp(u∆Q(s)) − 1) (5958)
0 0<s≤t
X X
+ X(s−)Z(s−)∆V (s) − X(s−)Z(s−)∆H(s) (5959)
0<s≤t 0<s≤t
X
− X(s−)Z(s−)(exp(u∆Q(s)) − 1) (5960)
0<s≤t
Z t Z t
= X(s−)Z(s−)d(V (s) − λ̃sϕ̃Y (u)) − X(s−)Z(s−)d(H(s) − λ̃s). (5961)
0 0
We see the remaining terms are also martingales. Then
Ẽ[exp(uQ(t)] = E[exp(uQ(t))Z(t)], (5962)
and since we showed (by the martingale property) that EX(t)Z(t) = 1, we may write
n o
1 = exp −λ̃t(ϕ̃Y (u) − 1) E[exp(uQ(t))Z(t)] (5963)
and the result follows by definition of X(t) (Equation 5945).

PN (t)
As before, define a compound Poisson process Q(t) = i=1 Yi , where N (t) is Poisson process
and some Brownian W (t) on probability space (Ω, F, P). Assume Q(t) has intensity λ and let jumps
Yi ∼ f (y). Assume W (t), Q(t) is defined relative to a single filtration F(t). We know Yi (t) ⊥ W (t).
792
It turns out that Q(t) ⊥ W (t). Let λ̃ > 0 be defined and f˜(y) be density function defined satisfying
∀y, f (y) = 0 =⇒ f˜(y) = 0. Assume an adapted process Θ(t), and define the processes
Z t
1 t 2
Z
Z1 (t) = exp − Θ(u)dW (u) − Θ (u)du , (5964)
0 2 0
n o ˜
N (t) λ̃f (yi )
Z2 (t) = exp (λ − λ̃)t Πi=1 , (5965)
λf (yi )
Z(t) = Z1 (t)Z2 (t). (5966)
Lemma 33. Process Z(t) specified by Equation 5966 is martingale and EZ(t) = 1 for all t ≥ 0.
Proof. We know Z1 (t) is martingale. Lemma 32 asserts that Z2 (t) is martingale. Z1 (t) is continuous,
Z2 (t) has no Ito integral, so Theorem 453 asserts cross variation [Z1 , Z2 ](t) = 0. Then Ito product rule
(Theorem 43) asserts
Z t Z t
Z1 (t)Z2 (t) = Z1 (0)Z2 (0) + Z1 (s−)dZ2 (s) + Z2 (s−)dZ1 (s), (5967)
0 0
and by left-continuity, martingale-ness of Zi (t) for i = 1, 2 apply Theorem 47 to get martingale property
for Z(t) and use initial condition Z(0) = 1 to finish the proof.
Theorem 459. Define probability measure

Z
P̃(A) = Z(T )dP, ∀A ∈ F (5968)
A
for fixed T > 0 and where Z(T ) is defined by Equation 5966. Furthermore,
Z t
W̃ (t) = W (t) + Θ(s)ds (5969)
0
is P̃ Brownian, and Q(t) is compound Poisson process with intensity λ̃. Each jump in Q(t) of Yi ’s are
IID with density f˜(y), and W̃ (t) ⊥ Q(t).
Proof. We want to show that

h n oi 1 2 n o
Ẽ exp u1 W̃ (t) + u2 Q(t) = exp u t exp λ̃t(ϕ̃Y (u2 ) − 1) , (5970)
2
R∞
where (see Equation 5944) ϕ̃Y (u) = −∞ exp(uy)f˜(y)dy. Recall exp( 21 u2 t) is mgf of the random variable
n o
∼ Φ(µ = 0, σ 2 = t) and that exp λ̃t(ϕ̃Y (u2 ) − 1) is mgf of compound Poisson process with intensity λ̃
and jump density f˜(y). If Θ(t) ⊥ Q(t), then Z1 ⊥ Q and it follows that
h n oi h i
Ẽ exp u1 W̃ (t) + u2 Q(t) = E exp{u1 W̃ (t)}Z1 (t) · exp{u2 Q(t)}Z2 (t) (5971)
= E[exp(u1 W̃ (t))Z1 (t)] · E [exp(u2 Q(t))Z2 (t)] .
(5972)
n o
Recall that the first term is exp 21 u21 t and that the second term equates to (Equation 5963) exp λ̃t(ϕY (u2 ) − 1) .

Their joint mgf factors. Now suppose when Θ(t) 6⊥ Q(t). Define

1 2
X1 (t) = exp u1 W̃ (t) − u1 t , (5973)
2
n o
X2 (t) = exp u2 Q(t) − λ̃t(ϕ̃Y (u2 ) − 1) . (5974)
793
See that

1 1
dX1 (t) = X1 (t) u1 dW̃ (t) − u21 dt + u21 X1 (t)dt (5975)
2 2
= u1 X1 (t)dW̃ (t) (5976)
= u1 X1 (t)dW (t) + u1 Θ(t)X1 (t)dt (5977)
and dZ1 (t) = −Θ(t)Z1 (t)dW (t), and by Ito product
d(X1 (t)Z1 (t)) = X1 (t)dZ1 (t) + Z1 (t)dX1 (t) + dX1 (t)dZ1 (t) (5978)
= −Θ(t)X1 (t)Z1 (t)dW (t) + u1 X1 (t)Z1 (t)dW (t) + u1 Θ(t)X1 (t)Z1 (t)dt − u1 Θ(t)X1 (t)Z1 (t)dt
= (u1 − Θ(t))X1 (t)Z1 (t)dW (t). (5979)
It follows that X1 (t)Z1 (t) is martingale, see Theorem 458 for assertion that X2 (t)Z2 (t) is martingale.
X1 (t)Z1 (t) is continuous, X2 (t), Z2 (t) has no Ito integral component, cross variation evaluates to zero
and by Ito Product
Z t Z t
X1 (t)Z1 (t)X2 (t)Z2 (t) = 1 + X1 (s−)Z1 (s−)d(X2 (s)Z2 (s)) + X2 (s−)Z2 (s−)d(X1 (s)Z1 (s)).(5980)
0 0
Left-continuity of integrands and Theorem 47 asserts EX1 (t)Z1 (t)X2 (t)Z2 (t) = 1 by initial condition
and martingale property. Equation 5970 holds.
Theorem 460. Define a compound Poisson process Q(t) with jumps Yi taking discrete values ym , m ∈
[M ] with p(ym ) = P{Yi = ym } satisfying the usual conditions of probability distributions, and let λ̃ > 0
and p̃(yi ) > 0, i ∈ [M ] sum to one. Further define
N (t) λ̃p̃(Yi )
n o
Z2 (t) = exp (λ − λ̃)t Πi=1 , (5981)
λp(Yi )
and Z(t) as in Equation 5966 with replacement of Equation 5965 by Z2 (t) defined here. Define P̃(A) =
R
A
Z(T )dP for all A ∈ F, and the process
Z t
W̃ (t) = W (t) + Θ(s)ds. (5982)
0
Then W̃ (t) is P̃ Brownian motion, and Q(t) is compound Poisson process intensity λ̃ with IID jumps
P̃(Yi = ym ) = p̃(ym ) with W̃ (t) ⊥ Q(t).
13.9.5 Option Pricing under the Jump Model of S(t) driven by N (t)
When the underlying is driven by a jump process, we need to revise our pricing model. We work in the
case of European call option. Consider an asset S(t) driven by a Poisson process and satisfying
S(t) = S(0) exp {αt + N (t) log(σ + 1) − λσt} (5983)

N (t)
= S(0) exp {(α − λσ)t} (σ + 1) . (5984)
Recall that this has differential form dS(t) = αS(t)dt + σS(t−)dM (t), where M (t) = −λt + N (t) is the
compensated Poisson process. For maturity T we want to price European call on underlying S, s.t. the
α−r α−r
payoff at maturity is V (T ) = (S(T ) − K)+ , and recall we require that λ > σ . Then λ̃ = λ − σ > 0,
794
n o N (t)
λ̃
R
and we can define P̃(A) = A
Z(T )dP, where Z(t) = exp (λ − λ̃)t λ . We refer readers to the
previous arguments for these formulations. Furthermore, M̃ (t) = N (t) − λ̃t is martingale, and
dS(t) = rS(t)dt + σS(t−)dM̃ (t), (5985)

d(exp(−rt)S(t)) = σ exp(−rt)S(t−)dM̃ (t). (5986)
Equation 5984 asserts that under the change of measure

n o n o
S(T ) = S(0) exp (r − λ̃σ)t (σ + 1)N (t) · exp (r − λ̃σ)(T − t) (σ + 1)N (T )−N (t) (5987)
n o
= S(t) exp (r − λ̃σ)(T − t) (σ + 1)N (T )−N (t) . (5988)
The discounted asset price is P̃ martingale. Risk neutral pricing asserts that we may express
Ẽ exp {−r(T − t)} (S(T ) − K)+ |F(t)

V (t) = (5989)
h n o i
N (T )−N (t) +
= Ẽ exp {−r(T − t)} (S(t) exp (r − λ̃σ)(T − t) (σ + 1) − K) |F(t) . (5990)
h n o i
Since S(t) is F(t) measurable, and exp (r − λ̃σ)(T − t) (σ + 1)N (T )−N (t) ⊥ F(t). By Independence
Lemma, ∃c(t, S(t)) s.t. V (t) = c(t, S(t)) and
h n o i
c(t, x) = Ẽ exp {−r(T − t)} (x exp (r − λ̃σ)(T − t) (σ + 1)N (T )−N (t) − K)+ |F(t) (5991)
∞
X n o + λ̃j (T − t)j
= exp {−r(T − t)} x exp (r − λ̃σ)(T − t) (σ + 1)j − K · exp(−λ̃(T − t))
j=0
j!
∞
X n o + λ̃j (T − t)j
= x exp −λ̃σ(T − t) (σ + 1)j − K exp {−r(T − t)} · exp(−λ̃(T − t)).
j=0
j!
When t = T, j = 0, c(T, x) = (x − K)+ for all x ≥ 0 since the only non-zero term occurs at j = 0. This
is our first terminal condition. By iterated conditioning we can argue
exp(−rt)c(t, S(t)) = exp(−rt)V (t) = Ẽ exp(−rT )(S(T ) − K)+ |F(t) .

(5992)
That is, discounted price process if P̃ martingale. Then
dS(t) = (r − λ̃σ)S(t)dt + σS(t−)dN (t), (5993)
where the continuous part is dS c (t) = (r − λ̃σ)S(t)dt. Additionally, if stock price jumps at time t, then
∆S(t) = S(t) − S(t−) = σS(t−), so S(t) = (σ + 1)S(t−). Ito Doeblin formula (Theorem 454) asserts
that (we abbreviated the cases where arguments are simply u or S(u) such as c(u, S(u)) → c, S(u) → S,
as well as for the partial derivatives cx )
exp(−rt)c(t, S(t)) (5994)

Z t X
= c(0, S(0)) + exp(−ru)[−rcdu + ct du + cx dS c ] + exp(−ru)[c(u, S(u)) − c(u, S(u−))]
0 0<s≤t
Z t h Z t
i
= c(0, S(0)) + exp(−ru) −rc + ct + (r − λ̃σ)Scx du + exp(−ru)[c(u, (σ + 1)S(u−)) − c(u, S(u−))]dN (u)
0 0
Z t Z t
= c(0, S(0)) + exp(−ru)[−rc + ct + (r − λ̃σ)Scx ]du + exp(−ru)[c(u, σ(σ + 1)S(u−)) − c(u, S(u−))]λ̃du
0 0
Z t
+ exp(−ru)[c(u, (σ + 1)S(u−)) − c(u, S(u−))]dM̃ (u). (5995)
0
795
Rt Rt
Using 0
exp(−ru)[c(u, (σ+1)S(u−))−c(u, S(u−))]λ̃du = 0
exp(−ru)[c(u, (σ+1)S(u))−c(u, S(u))]λ̃du,
we have
Z t
exp(−rt)c(t, S(t)) = c(0, S(0)) + exp(−ru)[−rc + ct + (r − λ̃σ)Scx + λ̃(c(u, (σ + 1)S(u)) − c(u, S(u)))]du
0
Z t
+ exp(−ru)[c(u, (σ + 1)S(u−)) − c(u, S(u−))]dM̃ (u). (5996)
0
The integral with M̃ (u) as integrator has left-continuous integrand, and is therefore martingale. We
know beforehand that exp(−rt)c(t, S(t)) is martingale. Then the term c(0, S(0)) plus the Riemann
integral, which is the difference of two martingales must itself be martingale. This can happen only if
the integrand is zero, that is (expanded)
−rc(t, S(t)) + ct (t, S(t)) + (r − λ̃σ)S(t)cx (t, S(t)) + λ̃(c(t, (σ + 1)S(t)) − c(t, S(t))) = 0. (5997)
See that d(exp(−rt)c(t, S(t))) =
exp(−rt)[−rc + ct + (r − λ̃σ)S(t)cx ]dt + exp(−rt)[c(t, (σ + 1)S(t−)) − c(t, S(t−))]dN (t). (5998)
To use the same argument as we did in continuous time theory, we need to make sure that the non drift
term has an integrator that is martingale with left-continuous integrand. Here N (t) is not martingale
for the non-drift part and we cannot use the same argument as we did in continuous time. By Equation
5997 we have the equation
−rc(t, x) + ct (t, x) + (r − λ̃σ)xcx (t, x) + λ̃(c(t, (σ + 1)x) − c(t, x)) = 0. (5999)
This must hold for 0 ≤ t < T and x ≥ 0, and is said to be a differential difference equation. See that we
have
Z t
exp(−rt)c(t, S(t)) = c(0, S(0)) + exp(−ru)[c(u, (σ + 1)S(u−)) − c(u, S(u−))]dM̃ (u). (6000)
0
Using t → T for Equation 6000 gives us a way to hedge a short European call in the jump model.
If we begin with X(0) = c(0, S(0)) to invest in stock and money market s.t X(t) = c(t, S(t)), or
exp(−rt)X(t) = exp(−rt)c(t, S(t)) for all t ∈ [T ], would mean we have matched differentials
d(exp(−rt)c(t, S(t))) = exp(−rt)[c(t, (σ + 1)S(t−)) − c(t, S(t−))]dM̃ (t). (6001)
Let the stock held in the portfolio process be Γ(t), then portfolio evolves at
dX(t) = Γ(t−)dS(t) + r[X(t) − Γ(t)S(t)]dt, (6002)
s.t (using Equation 5985)
d(exp(−rt)X(t)) = exp(−rt)[−rXdt + dX(t)] (6003)

= exp(−rt)[Γ(t−)dS(t) − rΓ(t)S(t)dt] (6004)
= exp(−rt)σΓ(t−)S(t−)dM̃ (t). (6005)
Then the portfolio process should satisfy
c(t, (σ + 1)S(t−)) − c(t, S(t−))

Γ(t−) = . (6006)
σS(t−)
796
If we write
c(t, (σ + 1)S(t)) − c(t, S(t))
Γ(t) = , ∀t ∈ [T ], (6007)
σS(t)
then integrating, (use Equation 6005)
Z t
exp(−rt)X(t) = X(0) + exp(−ru)[c(u, (σ + 1)S(u−)) − c(u, S(u−))]dM̃ (u). (6008)
0
Then X(t) = c(t, S(t)) for all values of t and we have constructed the hedge portfolio. See that in the
event of a jump at t, we have c(t, (σ + 1)S(t−)) − c(t, S(t−)) change in option price, while the hedging
portfolio changes by (use Equation 6006)
Γ(t−)(S(t) − S(t−)) = Γ(t−)σS(t−) = c(t, (σ + 1)S(t−)) − c(t, S(t−)). (6009)
If there is no jump then dS(t) = (r − λ̃σ)S(t)dt and the differential (see Equation 5998)
d(exp(−rt)c(t, S(t))) = exp(−rt)[−rc + ct + (r − λ̃σ)S(t)cx ]dt (6010)

= − exp(−rt)λ̃[c(t, (σ + 1)S(t)) − c(t, S(t))]dt. see Equation 5999.
(6011)
Meanwhile the portfolio differential is (using Equation 6005, 6006)
d(exp(−rt)X(t)) = exp(−rt)σΓ(t)S(t)(−λ̃dt) (6012)

= − exp(−rt)λ̃[c(t, (σ + 1)S(t)) − c(t, S(t))]dt. (6013)
We have verified that the hedge portfolio works at both the jump time and non-jump times. In fact,
the same hedging argument can be applied with similar analysis as we did for European calls for any
European derivative with payoff h(S(T )). The differential-difference equation (Equation 5999 applies,
and the terminal condition for the general form is c(T, x) = h(x). The model is complete; there is unique
risk-neutral measure and every derivative security can be hedged, including non-European ones.
13.9.6 Option Pricing under the Jump Model of S(t) driven by W (t) and Q(t)
Let W (t), N1 , · · · NM (t) be Brownian motion and M independent Poisson processes defined on (Ω, F, P)
relative to filtration F(t). Let λm > 0 be intensity of Nm (t) for m ∈ [M ], and let −1 < y1 < · · · < yM
be nonzero, and set
M M N (t)
X X X
N (t) = Nm (t), Q(t) = ym Nm (t) = Yi . (6014)
m m i=1
PM
Then recall N is Poisson process with intensity λ = m λm , and Q is compound Poisson process. Let
λm
Yi take discrete distribution, and define p(ym ) = λ . The random variables Y1 , Y2 , · · · are IID. Set
M M
X 1X
β = EYi = ym p(ym ) = λm ym . (6015)
m
λ m
PM
Recall Q(t) − βλt = Q(t) − t m λm ym is martingale. Now assume a stock price differential
dS(t) = αS(t)dt + σS(t)dW (t) + S(t−)d(Q(t) − βλt) (6016)

= (α − βλ)S(t)dt + σS(t)dW (t) + S(t−)dQ(t). (6017)
The assumptions for yi > −1 ensures that stock does not jump to below zero price, or even to zero.
797
Theorem 461. Shreve [19]. We assert that the solution to Equation 6017 is

1 2 N (t)
S(t) = S(0) exp σW (t) + α − βλ − σ t Πi=1 (Yi + 1). (6018)
2
N (t)
Proof. Define X(t) = exp σW (t) + α − βλ − 21 σ 2 t s.t. S(t) = X(t)J(t), where J(t) = Πi=1 (Yi + 1)

is the pure jump process. We know that
dX(t) = (α − βλ)X(t)dt + σX(t)dW (t). (6019)
At jump i, we have J(t) = J(t−)(1 + Yi ) and therefore ∆J(t) = J(t) − J(t−) = J(t−)Yi = J(t−)∆Q(t).
This holds at all times, including at non-jump times. It follows that dJ(t) = J(t−)dQ(t) holds and by
Ito product rule
Z t Z t
S(t) = X(t)J(t) = S(0) + X(s−)dJ(s) + J(s)dX(s) + [X, J](t). (6020)
0 0
Last term evaluates to zero, since X is continuous and J is pure jump. Then
Z t Z t Z t
S(t) = X(t)J(t) = S(0) + X(s−)J(s−)dQ(s) + (α − βλ) J(s)X(s)ds + σ J(s)X(s)dW (s).
(6021)
0 0 0
In differential form, this is expressed
dS(t) = d(X(t)J(t)) (6022)

= X(t−)J(t−)dQ(t) + (α − βλ)J(t)X(t)dt + σJ(t)X(t)dW (t) (6023)
= S(t−)dQ(t) + (α − βλ)S(t)dt + σS(t)dW (t). (6024)
If we let θ ∈ R, λ̃i > 0 for i ∈ [M ], and let

1
Z0 (t) = exp −θW (t) − θ2 t (6025)
2
!Nm (t)
n o λ̃m
Zm (t) = exp (λm − λ̃m )t , m ∈ [M ], (6026)
λm
Z(t) = Z0 (t)ΠM
m Zm (t), (6027)
Z
P̃(A) = Z(T )dP, ∀A ∈ F. (6028)
A
Then as in previous arguments, under P̃, we have W̃ (t) = W (t) + θt Brownian motion, Nm Poisson
processes with intensity λ̃m for m ∈ [M ], and each of W̃ , Ni , i ∈ [M ] are independent processes. If we let
PM PM
λ̃ = m λ̃m and p̃(ym ) = λ̃λ̃m , recall that under P̃, the process N (t) = m Nm (t) is Poisson intensity
λ̃ with IID jump sizes taking distribution P̃(Yi = ym ) = p̃(ym ). Additionally, Q(t) − β̃ λ̃t is martingale,
where
M M
X 1X
β̃ = ẼYi = ym p̃(ym ) = λ̃m ym . (6029)
m λ̃ m
P̃ is risk-neutral iff the stock has drift r under P̃, that is iff
dS(t) = (α − βλ)S(t)dt + σS(t)dW (t) + S(t−)dQ(t) (6030)

!
= rS(t)dt + σS(t)dW̃ (t) + S(t−)d(Q(t) − β̃λt). (6031)
798
We arrive at the market price of risk equation (expand dW̃ (t) = dW (t) + θdt)
α − βλ = r + σθ − β̃ λ̃, (6032)
which is equivalently expressed (using Equation 6015, 6029)

M
X
α − r = σθ + βλ − β̃ λ̃ = σθ + (λm − λ̃m )ym . (6033)
m
This is a system of HLS equations in M + 1 unknowns, and we know (Lemma 2) there are infinitely
many solutions and also risk-neutral measures. The market is not complete. To get unique risk-neutral
measure, we require more equations (and hence more stocks).
Exercise 729. Assume a single Brownian W and two independent Poisson processes N1 , N2 with re-
spective intensities λ1 , λ2 defined on (Ω, F, P) and three compound processes by the following
Qi (t) = yi,1 N1 (t) + yi,2 N2 (t), i = 1, 2, 3. (6034)
Here each of the yi,m > −1 and set λ = λ1 + λ2 ,

1
βi = (λ1 yi,1 + λ2 yi,2 ) , i = 1, 2, 3. (6035)
λ
Next assume a stock process taking
dSi (t) = (αi − βi λ)Si (t)dt + σi Si (t)dW (t) + Si (t−)dQi (t), i = 1, 2, 3. (6036)
The market price of risk equations that correspond to these differential equations are (see Equation 6033)
αi − r = σi θ + (λ1 − λ̃1 )yi,1 + (λ2 − λ̃2 )yi,2 , i = 1, 2, 3. (6037)
This is a system of three equations in three unknowns, and it is possible that there is a unique solution
to the system, which would imply a unique risk-neutral measure.
For some θ, λ̃i , i ∈ [M ] satisfying market price of risk equations (Equations 6033), we would obtain
dS(t) = rS(t) + σS(t)dW̃ (t) + S(t−)d(Q(t) − β̃ λ̃t) (6038)

= (r − β̃ λ̃)dt + σS(t)dW̃ (t) + S(t−)dQ(t). (6039)
This has solution (see Equation 6018)

1 2 N (t)
S(t) = S(0) exp σ W̃ (t) + r − β̃ λ̃ − σ t Πi=1 (1 + Yi ). (6040)
2
To compute the risk-neutral price of a call with underlying given by Equation 6040, first see that (Equa-
PM
tion 6035) β̃ λ̃ = m λ̃m ym , and this term appears in the formula. We can first start with desired
risk-neutral intensities λ̃i , i ∈ [M ], and then choose θ to satisfy market price of risk. Define
κ(τ, x) = xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x)), (6041)
where
σ2

1 x
d± (τ, x) = √ log + r± τ , (6042)
σ τ K 2
Recall this is the same form as in European call pricing in the jump-less models. So
" + #
√

1 2 P̃
κ(τ, x) = Ẽ exp(−rτ ) x exp −σ τ Y + r − σ τ − K , Y ∼ Φ(0, 1). (6043)
2
799
Theorem 462. For 0 ≤ t < T , the risk-neutral price of a call option is given by
V (t) = Ẽ exp(−rτ )(S(T ) − K)+ |F(t) ,

(6044)
and where V (t) = c(t, S(t)) for

∞
X λ̃j τ j n o
c(t, x) = exp(−λ̃τ ) Ẽκ τ, x exp −β̃ λ̃τ Πji (1 + Yi ) . (6045)
j=0
j!
Proof. Shreve [19] Let 0 ≤ t < T and for τ = T − t, we have

1 2 N (T )
S(T ) = S(t) exp σ(W̃ (T ) − W̃ (t)) + r − β̃ λ̃ − σ τ Πi=N (t)+1 (1 + Yi ). (6046)
2
Independence Lemma asserts we may write V (t) = Ẽ[exp(−rτ )(S(T )−K)+ |F(t)] = c(t, S(t)), and where
c(t, x)
" + #
1 2 N (T )
= Ẽ exp(−rτ ) x exp σ(W̃ (T ) − W̃ (t)) + r − β̃ λ̃ − σ τ · Πi=N (t)+1 (1 + Yi ) − K
2
" " + ##
2

σ N (T )
= Ẽ Ẽ exp−rτ x exp σ(W̃ (T ) − W̃ (t)) + r − β̃ λ̃ − τ · Πi=N (t)+1 (1 + Yi ) − K |σ ΠN (T ) (1 + Yi )
2 i=N (t)+1
" " + ##
√ σ2

−rτ −β̃ λ̃ N (T ) N (T )
= Ẽ Ẽ exp x exp exp −σ τ Y + r − τ Πi=N (t)+1 (1 + Yi ) − K |σ Πi=N (t)+1 (1 + Yi ) ,
2
and where Y = − W̃ (T √
)−W̃ (t)
τ
∼ Φ(0, 1) under P̃. Here σ(X) is the σ-algebra generated from the random
N (T ) N (T ) N (T )
variable X; since Πi=N (t)+1 (1 + Yi ) is σ(Πi=N (t)+1 (1 + Yi )) measurable and Y ⊥ σ(Πi=N (t)+1 (1 + Yi )),
by Independence Lemma, we can write
" + #
√ σ2

−rτ −β̃ λ̃ N (T ) N (T )
Ẽ exp x exp exp −σ τ Y + r − τ Πi=N (t)+1 (1 + Yi ) − K |σ Πi=N (t)+1 (1 + Yi )
2
n o
N (T )
= κ τ, x exp −β̃ λ̃τ Πi=N (t)+1 (Yi + 1) (6047)
h n o i
N (T )
s.t. c(t, x) = Ẽ κ τ, x exp −β̃ λ̃τ Πi=N (t)+1 (Yi + 1) . If N (T ) − N (t) = j, then
n o j j
N (T ) D
Πi=N (t)+1 (1 + Yi ) ∼ Πji=1 (1 + Yi ). Since P {N (T ) − N (t) = j} = exp −λ̃τ λ̃ j!τ we are done.
When we are working with Yi ∼ f (y) continuous density functions, we may replace β = EYi =
R∞
−1
yf (y)dy. Similar assumptions for avoiding zero division errors are required as in previous discussions.
We may choose θ, λ̃ > 0 and density f˜(y) so that the market price of risk becomes
α − r = σθ + βλ − β̃ λ̃ (6048)
R∞
where β̃ = ẼYi = −1
y f˜(y)dy and the same theorem applies.
Theorem 463. For call price c(t, x) given by Equation 6045

∞
X λ̃j τ j n o
c(t, x) = exp(−λ̃τ ) Ẽκ τ, x exp −β̃ λ̃τ Πji (1 + Yi ) , (6049)
j=0
j!
the condition
"M #
1 2 2 X
−rc + ct + (r − β̃ λ̃)xcx + σ x cxx + λ̃ p̃(ym )c(t, (1 + ym )x) − c(t, x) = 0, 0 ≤ t < T, x ≥ 0
2 m
is satisfied. The terminal condition is c(T, x) = (x − K)+ for all x ≥ 0.
800
Proof. See that (Equation 6031) dS c (t) = (r − β̃ λ̃)S(t)dt + σS(t)dW̃ (t), and by Ito Doeblin we may
write
exp(−rt)c(t, S(t)) − c(0, S(0)) (6050)

Z t Z t
1
= exp(−ru) −rc + ct + (r − β̃ λ̃)Scx + σ 2 S 2 cxx du + exp(−ru)σScx dW̃ (u) (6051)
0 2 0
X
+ exp(−ru)[c(u, S(u)) − c(u, S(u−))]. (6052)
0<u≤t
If u is Nm jump time, then S(u) = (1 + ym )S(u−) and we may write

X
exp(−ru)[c(u, S(u)) − c(u, S(u−))] (6053)
0<u≤t
M X
X
= exp(−ru)[c(u, (1 + ym )S(u−)) − c(u, S(u−))]∆Nm (u) (6054)
m 0<u≤t
M Z
X t
= exp(−ru)[c(u, (1 + ym )S(u−)) − c(u, S(u−))]d(Nm (u) − λ̃m u) (6055)
m 0
Z t M
X λ̃m
+ exp(−ru)λ̃[ c(u, (1 + ym )S(u)) − c(u, S(u))]du (6056)
0 m λ̃
M Z
X t
= exp(−ru)[c(u, (1 + ym )S(u−)) − c(u, S(u−))]d(Nm (u) − λ̃m u) (6057)
m 0
" M
#
Z t X
+ exp(−ru)λ̃ p̃(ym )c(u, (1 + ym )S(u)) − c(u, S(u)) du (6058)
0 m
Using this in the integral form (Equation 6052), and taking differentials, write
d(exp(−rt)c(t, S(t)) (6059)

( M
)
1 X
= exp(−rt) −rc + ct + (r − β̃ λ̃)Scx + σ 2 S 2 cxx + λ̃ [p̃(ym )c(t, (1 + ym )S(t)) − c(t, S(t))] dt
2 m
+ exp(−rt)σScx dW̃ (t) (6060)

M
X
+ exp(−rt)[c(t, (1 + ym )S(t−)) − c(t, S(t−))]d(Nm (t) − λ̃m t). (6061)
m
See Equation 6061. The non dt terms are martingales. The LHS is martingale by risk-neutral pricing,
and the difference between LHS and non-dt terms must be martingale - it follows the dt term needs to
be zero.
Corollary 45. The call price c(t, x) as in Equation 6045 satisfies d(exp(−rt)c(t, S(t))
M
X
= exp(−rt)σScx dW̃ (t) + exp(−rt)[c(t, (1 + ym )S(t−)) − c(t, S(t−))]d(Nm (t) − λ̃m t)(6062)
m
= exp(−rt)σS(t)cx dW̃ (t) (6063)

+ exp(−rt)[c(t, S(t)) − c(t, S(t−))]dN (t) (6064)
"M #
X
− exp(−rt)λ̃ p̃(ym )c(t, (1 + ym )S(t−)) − c(t, S(t−)) dt. (6065)
m
801
Proof. The equality follows by setting dt term to zero by martingale property, and recalling that N (t) =
PM PM λ̃m
m Nm (t), λ̃ = m λ̃m , p̃(ym ) = λ̃ .
PM
Again, when Yi ∼ f˜(y) for continuous density f˜ instead, then we make the changes m p̃(ym )(c(t, (1+
R∞ M ∞
ym )x) → −1 c(t, (1 + y)x)f˜(y)dy, and m p̃(ym )c(t, (1 + ym )S(t−)) → −1 c(t, (1 + y)S(t−))f˜(y)dy.
P R
Now consider the hedging problem for a short European call. As usual, we start with X(0) =
c(0, S(0)), match differentials in the discounted portfolio value and discounted European call value,
where portfolio X(t) evolves at
dX(t) = Γ(t−)dS(t) + r[X(t) − Γ(t)S(t)]dt (6066)
for adapted Γ(t) shares of stock continuously held at t.

Since (see Equation 6031)
d(exp(−rt)X(t)) = exp(−rt)[−rX(t)dt + dX(t)] (6067)

= exp(−rt)[Γ(t−)dS(t) − rΓ(t)S(t)dt] (6068)
= exp(−rt)[Γ(t)σS(t)dW̃ (t) + Γ(t−)S(t−)d(Q(t) − β̃ λ̃t)] (6069)
M
X
= exp(−rt)[Γ(t)σS(t)dW̃ (t) + Γ(t−)S(t−) ym (dNm (t) − λ̃m dt)]. (6070)
m
If we cancel out the Brownians by trying the usual delta-hedging by holding Γ(t) = cx (t, S(t)), then the
Brownian motion component disappears and the hedge gives us (see Equation 6065)
d (exp(−rt)c(t, S(t)) − exp(−rt)X(t)) (6071)

M
X
= exp(−rt)[c(t, (1 + ym )S(t−)) − c(t, S(t−)) − ym S(t−)cx (t, S(t−))] · (dNm (t) − λ̃m dt).
m
κ(τ, x) is convex, so c(t, x) is convex and c(t, x1 ) − c(t, x2 ) > (x2 − x1 )cx (t, x1 ) for all x1 ≥ 0, x2 ≥ 0 and
x1 6= x2 . Then
c(t, (1 + ym )S(t−)) − c(t, S(t−)) > ym S(t−)cx (t, S(t−)). (6072)
Note ym > −1 and non-zero, hence the strict inequality. Then the differential d (exp(−rt)c(t, S(t)) − exp(−rt)X(t)) <
0 between jumps and > 0 at jumps. Both exp(−rt)c(t, S(t)), exp(−rt)X(t) are P̃ martingales. Since their
initial condition is the same, then Ẽ[exp(−rt)c(t, S(t))] = Ẽ[exp(−rt)X(t)] for all t ∈ [T ].
The delta hedge only works in P̃ expectation, in that on average, the delta hedge works. When
Yi ∼ f˜(y), where f˜ is continuous density function instead, then we have
d(exp(−rt)c(t, S(t)) − exp(−rt)X(t)) (6073)

= exp(−rt)[c(t, S(t)) − c(t, S(t−)) − (S(t) − S(t−))cx (t, S(t−))]dN (t) (6074)
Z ∞
− exp(−rt)λ̃ [c(t, (1 + y)S(t−)) − c(t, S(t−)) − yS(t−)cx (t, S(t−))]f˜(y)dydt. (6075)
−1
Since
c(t, (y + 1)S(t−)) − c(t, S(t−)) − yS(t−)cx (t, S(t−)) > 0, (6076)
for all nonzero y > −1, between jumps the differential d(exp(−rt)c(t, S(t)) − exp(−rt)X(t)) < 0 and
the hedge portfolio outperforms. Since c(t, S(t)) − c(t, S(t−)) − (S(t) − S(t−))cx (t, S(t−)) > 0, at jump
times, the hedge portfolio underperforms.
802
Chapter 14
Volatility Trading
This chapter is attributed to the theory and practice of volatility trading, with primary focus on the
variance risk premiuim. Necessarily, the theory of option pricing will be involved. Although volatility
may as well be traded with the use of linear products (such as UVXY), or with other derivatives (such as
variance swaps), the accessibility and depth of the option markets give us the most commonplace access
to trading volatility. Option pricing is discussed in our chapter on stochastic calculus (Chapter 13),
exploring option pricing through the lenses of the Black-Scholes-Merton model and various extensions.
Both the pricing of European and American (and Asian) options were given treatment in those sections
- here we will give more focus to the pricing of European options, since they are less unwieldy and
produce better intuition - we will find out that as long as interest rates are low or time horizon is small,
or both - then the European option is satisfactory approximation for American one. In most cases, such
approximation should suffice. After all, long term memory in volatility is regime dependent and most
variance premiums are harvested over short horizons.
Options are the right but not the obligation to buy or sell some underlying asset S at (European)
or up to (American) some specified time T at some specified price K. The right to do so is valued as
the option premium, or the value of the option. The seller is in turn obligated to fulfill the terms of this
contract if the buyer chooses to exercise his right. The parameters defining an option contract is given
by the option type (call/put/barrier), underlying asset S, strike K, expiry T , exercise style (typically
specified by a region). A call is the right to buy, and a put is the right to sell. The underlying asset may
affect the form of settlement, and therefore the obligation the seller of an option contract is subjected
to. For instance, the underlying asset of a stock option are shares (often 100 units) in the stock, and
some multiple of the cash value of the index in index options.
The specifications of the option contract in relation to market structure, clearing and margin concerns
are left up to the reader to find out and clearly depends on both product and brokerage being traded
with.
14.1 Arbitrage Bounds

We begin by presenting bounds for option pricing that do not require any models. The arguments rely
on the principle of no-arbitrage. The formal definition is given in Definition 453, and the existence of a
risk-neutral measure asserts the absence of arbitrage (Theorem 425). We would not need such formality
though - simply said, we should not be able to construct a portfolio that begins with zero capital and at
803
some point in the future has positive probability of profit and zero probability of loss. No free money,
that is.
We begin by denoting the value of the European call, European put, American call, American put as
c, p, C, P respectively. Then we have
c ≤ C, p ≤ P. (6077)
The American option is always worth at least as much as the European option, since the American
option becomes a European option if we simply do not choose to exercise before expiry. All we have on
top is choice. If the inequalities do not hold, the arbitrage portfolio is short the European and long the
American at same strike and expiry.
A call option cannot cost more than the underlying. We have
c ≤ S. (6078)
Otherwise the arbitrage portfolio is short the call and long stock. We use the short proceeds to purchase
stock, and invest the remaining. If the call expires worthless, we are done. If the call is exercised, we
fulfill our obligations with the stock held. The money market investment is our free lunch. See that we
can similarly fulfill our obligations at any time prior to expiry, so the argument holds for American calls
C ≤ S. (6079)
An American put cannot be more than the strike price.
P ≤ K. (6080)
Otherwise, he can sell the put, receive P and his maximum payout is (0 − K) = −K, and he makes
risk-free profit of P − K. For the European option, since the exercise can only occur at expiry, we have
p ≤ K exp(−rt). (6081)
The arbitrage portfolio if this inequality does not hold sells put for p, invests in the money market for
p exp(rt) > K at expiry, which is greater than the maximum loss K.
The arguments for put-call parity relationship between a European call and European put is given
in Equation 3944. There we presented that the payoff of a long call, short put portfolio is replicating
portfolio for a forward contract and therefore must satisfy
c − p = S − K exp(−rt). (6082)
If the underlying pays dividends, we would have to modify the discounting and use c − p = S − D −
K exp(−rt), where D is the present value of any dividends paid.
We may easily derive from this put-call parity relationship the following bounds:
c ≥ S − K exp(−rt), (6083)
p ≥ K exp(−rt) − S. (6084)
The put-call argument does not hold when working with American options, since there is no guarantee
that the short American put is not exercised to expiry. We know, however, that the value of the American
options must be at least as great as their intrinsic value, else the arbitrage portfolio purchases and
immediately exercises the right:
C ≥ max(0, S − K), (6085)

P ≥ max(0, K − S). (6086)
804
In addition the American call is at least as valuable as the European call, and since we have a lower
bound already for the European call
C ≥ max(0, S − K exp(−rt)). (6087)
This tighter bound implies that the American call option on a non-dividend paying stock should never
be exerised early. An in-the-money call option may be sold for S − K exp(−rt), greater than the intrinsic
value S − K obtained from early exercise. In this particular scenario, the early exercise option is not
useful and both American and European call options on non-dividend stocks are priced the same.
Now consider two strikes K2 > K1 . We want to construct bounds for the relationship between calls
and puts of different strike price. Consider a portfolio that is long one call at K1 strike and short one
call at K2 strike. For the European option, at maturity, if S < K1 , then both calls are worthless, if
K1 < S < K2 , then only the long call position is in-the-money with payout S − K1 , else if S > K2 ,
the long call position is worth S − K1 , short call position is worth −(S − K2 ) and the portfolio is worth
K2 − K1 . In any case the portfolio is at least worth zero and we can write
c(K1 ) > c(K2 ). (6088)
In all three scenarios the portfolio value at expiration is ≤ K2 − K1 , so we must have
c(K1 ) − c(K2 ) ≤ (K2 − K1 ) exp(−rt). (6089)
But the same argument holds if we were to allow exercise prior to expiry, with the exclusion of the
discount factor -
C(K1 ) − C(K2 ) ≤ (K2 − K1 ). (6090)
Now consider a portfolio that is short one put at K1 strike and long one put at strike K2 . We can run
through the same scenarios and determine that the value of the final portfolio is K2 − K1 , K2 − S, 0 in
the three states of the world S < K1 , K1 < S < K2 and S > K2 respectively. In all states the portfolio
value at expiry is positive and ≤ K2 − K1 , so
p(K2 ) ≥ p(K1 ) (6091)
and
p(K2 ) − p(K1 ) ≤ (X2 − X1 ) exp(−rt) (6092)
and again if we were to consider the possibility of early exercise
P (K2 ) − P (K1 ) ≤ (X2 − X1 ). (6093)
For the next relationship, we would not outline the arguments, but it should be relatively straight-
K3 −K2
forward algebra to verify. For three strikes given K1 < K2 < K3 and α := K3 −K1 , we have
αc(K1 ) + (1 − α)c(K3 ) ≥ c(K2 ), (6094)

αC(K1 ) + (1 − α)C(K3 ) ≥ C(K2 ), (6095)
αp(K1 ) + (1 − α)p(K3 ) ≥ p(K2 ), (6096)
αP (K1 ) + (1 − α)P (K3 ) ≥ P (K2 ). (6097)
805
Consider two portfolios. The first portfolio is long one American call, strike K with K in cash.
The second portfolio is long one American put and one stock. For the first portfolio, we assume the
underlying pays no dividends, so there is no early exericse - or rather we choose not to. At expiry, if
S < K then our portfolio is worth K exp(rt) and otherwise we choose to exercise, and our portfolio is
worth S − K + K exp(rt) = S + K(exp(rt) − 1). Consider the second portfolio - if we do not exercise
early, we have K if the put expires in the money and we exercise the right to sell our stock, otherwise we
hold on to the stock and our portfolio is worth S. If we do exercise early, it is only so when the put is in
the money, at which point we receive K at t∗ and invest in the money market to get K exp(r(t − t∗ )) at
expiry. See that the first portfolio is worth at least as much as the second portfolio, so C + K ≥ P + S,
or
C − P ≥ S − K. (6098)
Again recall the put-call parity relationship c − p = S − K exp(−rt). Then p = c − S + K exp(−rT ).

Since the P > p, we have
P ≥ c − S + K exp(−rT ). (6099)
We said when early exercise feature is worthless then the American option pricing is priced as the
European one, so in the absence of dividends we have
C − P ≤ S − K exp(−rT ). (6100)
Using the lower bound found earlier - we have
S − K ≤ C − P ≤ S − K exp(−rT ). (6101)
Here the put-call parity bounds on American options demonstrate that if the time to expiry is small, or
rates are low, then the European one approximates American ones pretty well. This has to do with how
favorable it is for the buyer of a put to do early exercise on a deep in the money option and invest his
proceeds in the money market. When the stock pays dividends then our bound is modified:
S − D − K ≤ C − P ≤ S − K exp(−rT ), (6102)
where D is the present value of dividends paid.

An interesting option structure is the box spread. Consider two strikes K1 < K2 , and then construct
a portfolio c(K1 ), −c(K2 ), −p(K1 ), p(K2 ), and go through the mental exercise to verify that in all states
of the world, the terminal value of the portfolio is K2 − K1 . So
c(K1 ) − c(K2 ) − p(K1 ) + p(K2 ) = (K2 − K1 )exp(−rT ). (6103)
The sole risk is in the interest rate movements. Note that only the European options perform this
function.
14.2 Pricing Model

For those who have already taken the stochastic calculus pill (Chapter 13), we walk through the central
arguments that lead to the pricing of a European put and call option. For those who have not done so,
we will outline the main points such that our arguments remain tractable in this section, but the next
page or so will probably not make any sense - fret not.
806
It all begins with the symmetric random walk (Definition 429), which is scaled (Definiton 431)
and shown to be normal in distribution as the number of time steps n → ∞ (Theorem 396), the
continuous version of which is the Brownian motion (Definition 432). We saw the Brownian motion is
not continuously differentiable, and there is non-trivial second order quadratic variation in the Brownian
motion, a fact we summarise as dW (t)dW (t) = dt (Theorem 402). Furthermore it was shown that the
cross-variation of the Brownian motion with time and the quadratic variation of time itself is given by
dW (t)dt = dtdt = 0 (Theorem 403). We showed how an exponentiated Brownian motion gives rise to
the geometric Brownian motion which we typically assume as asset price dynamics, and how this leads
to the (sampled) volatility of an asset price process (Definition 435). These are the central results of the
first part in financial stochastic calculus.
The Ito integral is then introduced - we are not able to differentiate paths of the Brownian motion
w.r.t to time, so we defined an integral directly w.r.t the Brownian motion such that we may specify
Rt
an integral I(t) = 0 ∆(u)dW (u), where ∆(u) is a simple process - we began with simple integrands
(Definiton 439), and we observe that Ito integrals w.r.t to martingales are also martingales (Theorem
409). An important result is Ito isometry (Theorem 410) and the quadratic variation of the Ito integral,
Rt
facts we write as EI 2 (t) = E 0 Delta2 (u)du and dI(t)dI(t) = ∆2 (t)dt respectively. It is later extended to
general integrands as a limit of the simple integrands, which now allows us to include portfolio processes
(Definition 441). Most of this is to arrive at the important Ito-Lemma.
The above points are lost on many a traders, and they instead choose to begin with a memorization
of the Ito-Lemma, which would mostly suffice in following the pricing arguments in the Black-Scholes
Merton model. There are many variants, and we summarize: the Ito Doeblin formula for a differentiable
f (x) is written (Definition 442)
1
df (W (t)) = f 0 (W (t))dW (t) + f 00 (W (t))dt. (6104)
2
We can let f further depend on time t and so (Theorem 413)
1
df (t, W (t)) = ft (t, W (t))dt + fx (t, W (t))dW (t) + fxx (t, W (t))dt. (6105)
2
An Ito process (Definition 443) is a stochastic process that evolves both w.r.t time and a Brownian
motion, given
Z t Z t
X(t) − X(0) = ∆(u)dW (u) + Θ(u)du. (6106)
0 0
All the other important results can be obtained from the above results about cross and quadratic variation
of Brownian motions. For instance the differential form of the Ito process is written dX(t) = ∆(t)dW (t)+
Θ(t)dt. One may easily derive that the quadratic variation of the Ito process:
dX(t)dX(t) = (∆(t)dW (t) + Θ(t)dt)(∆(t)dW (t) + Θ(t)dt) = ∆2 (t)dt (6107)
using dW (t)dW (t) = dt, dW (t)dt = dtdt = 0 without going through the arduous proof (Theorem 414).
Now we may define an Ito integral w.r.t to an Ito process (Definition 444), we write:
Z t Z t Z t
Γ(u)dX(u) = Γ(u)∆(u)dW (u) + Γ(u)Θ(u)du. (6108)
0 0 0
and the Ito Doeblin formula for this is (as we may expect) shown in Theorem 415
1
df (t, X(t)) = ft (t, X(t))dt + fx (t, X(t))dX(t) + fxx (t, X(t))dX(t)dX(t). (6109)
2
807
It won’t be difficult to express this in terms of dt and dW (t) by expanding dX(t) which we already know
how to do.
We see how the geometric Brownian motion can be generalized in Example
nR 656. There we see how the
t Rt o
Ito Lemma can be used to relate the asset price process S(t) = S(0) exp 0 σ(s)dW (s) + 0 α(s) − 21 σ 2 (s) ds
to the asset price dynamics
dS(t) = α(t)S(t)dt + σ(t)S(t)dW (t). (6110)
The reverse derivation - obtaining the asset price process from asset price dynamics is given in Exercise
657.
We shall move on to the Black-Scholes-Merton derivation. We begin with some X(t) cash at the
outset, and follow its evolution across time holding some ∆(t) units of stock worth S(t) at time t
dynamically across time (Section 13.2.3.1). We then see how a portfolio of a single European call option
might evolve (Section 13.2.3.2) - using Ito’s Lemma. Now we want to construct a hedge portfolio for
a short European call option - then naturally all we need to do is match their evolution dynamics by
matching the drift and diffusion terms. It turns out that the hedge units required ∆(t) is precisely
cx (t, S(t)), and to match the drift terms, we arrive at the Black Scholes Merton partial differential
equation (Equation 3848):
1
ct (t, x) + rxcx (t, x) + σ 2 x2 cxx (t, x) = rc(t, x) ∀t ∈ [0, T ), x ≥ 0. (6111)
2
In this discussion we did not raise the concern of the payoff related to the European call option at all -
in fact this partial differentiation equation needs to be satisfied by all sorts of derivatives. It is a solution
to the partial differentiation equation, parameterized by the boundary and terminal conditions relating
to the specific derivative that determines the asset pricing formula. The specific solution to the PDE
for European call option is verified in Section 13.2.4 and in Exercise 665, where we assert it is given by
Equation 3850
c(t, x) = xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x)) 0 ≤ t < T, x > 0, (6112)
where τ = T − t, Φ is normal c.d.f of Φ(0, 1) (see Section 6.17.6) and
σ2

1 x
d± (τ, x) = √ log + r± τ . (6113)
σ τ K 2
In Exercise 665 we work out the arduous calculus steps in deriving the option greeks from the closed
form solution for European calls, giving
1. call delta: Φ(d+ (τ, x)),
2. call theta: −rK exp(−rτ )Φ(d− (τ, x)) − √ Φ0 (d+ (τ, x)),
σx
2 τ
3. call gamma: Φ0 (d+ (τ, x)) xσ1√τ .
The other two first-order Greeks such as rho and vega were not explored there - particularly because
rho effects are secondary to option price changes in the real world, and in the purest form the Black-
Scholes-Merton assumes constant volatility. However, the vega term is in fact important to option price
dynamics and we will explore them here, in addition to more second-order Greeks for completeness.
The mathematics of gamma scalping imposed by the curvature of the option pricing is explored in
Example 666. An alternative way of arriving at the Black-Scholes-Merton PDE is presented in Exercise
808
667. One constructs a portfolio of a call option and short ∆(t) of stock and realises this portfolio is risk-
free - hence it must earn the risk-free rate. The arguments in the proof are precise, but use Ito-Lemma.
Again, it is not assumed one is familiar with stochastic calculus, so we will present the same argument
but in an approximated, discrete form in this section, but whether you are familiar with one or the other,
it is the motivation at which we arrive the PDE that is important. At this stage we only verified the
solution to the BSM - now in Exercise 668 we derive it, and show that it is the discounted expected
payoff. The reward from correctly predicting volatility distinct from the market implied volatility levels
are highlighted by Example 669.
Most of the arguments outlined above should do the trick; for good measure we will go abit further.
Section 13.3 introduces the concept of risk-neutrality via the change of measure, the important result
being that derivatives are priced as risk-neutral expected payoffs, and is given the risk-neutral pricing
formula (Equation 4193). Explaining exactly what this means would be too mathematical at this point,
but the intuition of why risk-neutral probabilities arise in the first place is encapsulated well by the
examples given in Exercise 730, which shows that actuarial value need not equal to the market price
even when economic agents are rational investors. Actually we have covered the European option call
pricing for time-varying rates and volatility using an extended-form of the Black-Scholes-Merton model,
but we will leave this as reference (Exercise 685) and not use it here. The more important extension
is when the stock presents dividends. Here it is the portfolio with the re-investment of dividends in
money-market that is risk-neutral martingale (Section 13.3.7.1). The European option price hence takes
an adjusted close-form solution (Equation 4398):
S(t) exp(−aτ )Φ(d+ (τ, x)) − exp(−rτ )KΦ(d− (τ, x)). (6114)
where τ is time to maturity and a is the continuous dividend yield. This is appropriate if the underlying
is an index, for instance. For a stock option it is more likely that the dividends are lump-sum (Definition
13.3.7.3), and the pricing is abit more difficult (Equation 13.3.7.4):
c(t, x) = S(0)Πn−1
j=0 (1 − aj+1 )Φ(d+ ) − exp(−rT )KΦ(d− ) (6115)
h Pn−1 i
with d± = σ√1 T log S(0) 1 2

K + j=0 log(1 − aj+1 ) + r ± 2 σ T , where aj∈[n] are the dividend rates prior
to option expiry.
If one were to carefully go through each of the references above, then understanding the precise
continuous time arguments should be no issue, albeit abit challenging. Either way, here we give an
outline of the arguments that should be understandable for any reader, regardless of whether she has
gone through referenced material. The binomial one-step option pricing model is given before in Exercise
730 to highlight the existence of a risk-neutral world - here we give some more details. Assume a stock
trades at S today, and we short ∆ units of stock and long one call option struck at K = S (at the
money). In the next time step, with probability p it has gross return u and with probability 1 − p it has
gross return d. On the up move our portfolio is worth Cu = max(S0 u − 100, 0) − ∆S0 u, on the down
move it is worth Cd = max(S0 d − 100, 0) − ∆S0 d, to be hedged we want:
Cu − ∆S0 u = Cd − ∆S0 d. (6116)
Obviously the hedge ratio we require is

Cu − Cd
∆= . (6117)
Su − Sd
Then we are indifferent to the stock price move and we are perfectly hedged in this two-state economy.
This also means that our portfolio is risk-free, and should therefore earn the risk-free rate. Then the
809
present value of the portfolio is given by exp(−rT )(Cu − ∆Su ). The current value of the portfolio is
given by C − ∆S. So the value of the call option today is given by
C = ∆S + exp(−rT )(Cu − ∆Su ) (6118)

= exp(−rT ) [exp(rT )∆S + Cu − ∆Su ] . (6119)
exp(rt)−d
Define p := u−d , and we have

Cu − Cd Cu − Cd
C = exp(−rT ) exp(rT ) S + Cu − Su (6120)
Su − Sd Su − Sd

Cu − Cd Cu − Cd
= exp(−rT ) exp(rT ) + Cu − u (6121)
u−d u−d

exp(rT ) + u − d − u exp(rT )(−1) + u
= exp(−rT ) Cu + (6122)
u−d u−d
= exp(−rT ) [pCu + (1 − p)Cd ] . (6123)
Interestingly p does not involve any probability estimations - but rather depend on the components
related to the magnitude of up and down moves; it is the volatility that matters. If we interpret p to be
some probability, then one may see that
exp(rt) − d u − exp(rt) exp(rt)Su − exp(rt)Sd − Sdu + Sdu
pSu + (1 − p)Sd = Su + Sd = = exp(rt)S.
u−d u−d u−d
This is an expectation of the stock price in the next time step in the risk-neutral world. The risk-neutral
pricing assumes that the stock grows at the risk free rate. It is the limit of the binomial model that
gives rise to the geometric Brownian motion assumed for asset prices (Theorem 397). One may use this
to price paths to obtain the pricing of different kinds of options (such as put options) by changing the
payoff, as well as different exercise styles such as that of an American option - however, we will not
explore this option here. It should be enough to keep in mind that as the number of time steps is allowed
to reach infinity, the binomial tree results converges to our BSM result.
We derive the BSM model using some discrete time arguments, parallel to what we did in Exercise
667. We start by assuming a delta-hedged position holding a call option and short ∆ stock units, so
portfolio is worth C − ∆St . The change in the portfolio value is given by
C(St+1 ) − C(St ) − ∆(St+1 − St ) − r(C − ∆St ), (6124)
corresponding to change in the call price, change in the underlying stock price and interest accrued. We
may approximate this first term C(St+1 ) − C(St ) by second-order Taylor expansion w.r.t to asset price
change and change in call value w.r.t to time passing, so we may write
1
∆(St+1 − St ) + Γ(St+1 − St )2 + θ − ∆(St+1 − St ) − r(C − ∆St ) (6125)
2
1
Γ(St+1 − St )2 + θ − r(C − ∆St ) (6126)
2
2
δc δ2 c δC St+1 −St
where ∆ = δS , Γ = δS 2 , θ = − δt . On average we have S ≈ σ 2 , so we let (St+1 − St )2 take
σ 2 S 2 , and substituting we obtain
1 2 2 !
Γσ S + θ − r(C − ∆St ) = 0. (6127)
2
and this should equal zero since the portfolio is financed with zero initial capital. We can refer to the
above pargraphs and see this is just the Black-Scholes-Merton PDE. We assumed that the underlying
810
can be traded in any size, shortable and that it is deeply liquid, with a common interest for borrowing
and investing, and that there were no dividends. We also assumed that the price is continuous and
that asset prices are log-normal. We assumed that volatility is constant, and that we live in a tax-free
world. Clearly this world does not exist, and it is these assumptions that cause the inaccuracy of BSM
models in pricing options in the real world. Even though it might not be that accurate in pricing a
single option, the BSM model is still a great tool in allowing us to evaluate the relative pricing between
different options, including those of different underlying, strike or expiry dates. It is likely that the errors
in the BSM model for an option correlates to the errors in the estimation relating to a similar option,
and the estimate of the spread value may be fairly robust. This allows us to make statements such as
’option A looks to be more rich in volatility compared to option B’, and as a volatility trader this should
be critically helpful.
14.3 Option Greeks

Our discussion so far was mostly on call options, but we know what the put-call parity relationship is,
so we do also know how to price a put option (Equaiton 3948). The European options give closed form
solutions, and the American ones detract only mildly from the European ones, so we will stick to our
analysis on European options. We will also discard with the convention of c, C, p, P and treat them
indiscriminately.
So let’s repeat some of the more salient points:
c(t, x) = xΦ(d+ (τ, x)) − K exp(−rτ )Φ(d− (τ, x)) (6128)

p(t, x) = K exp(−rτ )Φ(−d− (τ, x)) − xΦ(−d+ (τ, x)) (6129)
with
σ2

1 x
d± (τ, x) = √ log + r± τ . (6130)
σ τ K 2
For graphical interpretation of these formulas, use the Internet. As mentioned, we have already done
the hard work to compute option greeks in Exercise 665 on European calls, and the same for European
puts by the put-call parity relationship (Exercise 670) :
δC
1. δS = ∆c = Φ(d+ (τ, x)),
δP
2. δS = ∆p = Φ(d+ (τ, x)) − 1,
3. − δC √ Φ0 (d+ (τ, x)),

σx
δt = θc = −rK exp(−rτ )Φ(d− (τ, x)) − 2 τ
4. − δP √ Φ0 (d+ (τ, x)).

σx
δt = θp = rK exp(−rτ )Φ(−d− (τ, x)) − 2 τ
δ2 C δ2 P
5. δS 2 = Γc = δS 2 = Γp = Φ0 (d+ (τ, x)) xσ1√τ .
A rough rule of thumb for an ATM call option delta is about 0.50, and this approximation is more
accurate the closer the option is to expiry. Gamma is the measure of the convexity of the curve interpo-
lating between a deep-in-the money option and a far-out-of-the money option. As such, one may think
of Gamma as a measure of how option-like an option is. When option is near expiry, Gamma is high for
ATM options and is roughly bell-shaped w.r.t to the strike, although the peak lies slightly to the left of
the strike. The maximum is actually at
3σ 2

S ∗ = K exp −(r + )t . (6131)
2
811
To make comparisons across different underlying stock levels, we might take S · Γ to normalize the stock
price component that contributes to Gamma absolute levels. The Gamma scalping effects are explored
in Exercise 666. Theta is the negative partial derivative of the option value w.r.t time. It is usually
θ
presented as a normalized value over a day, so we may take θ̃ = 365 . Some argue that θ effects should
not be fully accounted over the weekends.
We won’t bother with the philosophical discussion. Theta is
3σ 2
the minimum at Ŝ = K exp (r + 2 )t .
Consider Equation 6126 that describes our PnL approximately:
1 2 2
Γσ S + θ − r(C − ∆St ) = 0. (6132)
2
We are interested in PnL movements over short periods of time - we may ignore the third term. The
two Greeks that dominate our PnL is written
1 1
Γ(St+1 − St )2 + θ = Γ(∆S )2 + θ = 0. (6133)
2 2
Then θ = − 21 S 2 ( ∆SS )2 Γ ≈ − 12 S 2 σimp
2
Γ where the approximation holds on average. The gamma and
theta effects are priced to cancel each other out. The PnL for a delta hedged position can approximately
be written as such:
1
P nL = Γ(∆S )2 + θ(∆t ) (6134)
2 " #
2
1 2 ∆S 2
= ΓS − σimp (∆t) . (6135)
2 S
The breakeven price movement in the underlying for our delta-hedged portfolio to have net zero PnL is
the level where the squared daily return equals implied variance. We may also relate this to the profits
from correctly calibrating a mispriced market implied volatility - see Exercise 669, where we argue that
we may earn at instantaneous rate of 12 (σrlz
2 2
− σimp )St2 Γ.
So far we have introduced delta, gamma and theta. To get vega and rho, we need to do some more
mathematics. We know from Equation 3853 that
√
d− (τ, x) = d+ − σ τ , (6136)
√
d+ (τ, x) = d− + σ τ . (6137)
So we have
δd− δd+ √ δd+ δd−
= − τ, = . (6138)
δσ δσ δr δr
We assert an additional relation.
h i
1 x σ2
Lemma 34. For d+ , d− given by d± (τ, x) = √
σ τ
log K + r± 2 τ we have
S0 Φ0 (d+ ) = K exp(−rt)Φ0 (d− ). (6139)
Proof. First see that
d2− − d2+ = (d− − d+ )(d− + d+ ) (6140)

√ √
= (−σ τ )(2d+ − σ τ ) (6141)
σ2
" #
√ 2 ln SK0 + 2(r + 2 )τ
√
= (−σ τ ) √ −σ τ (6142)
σ τ

S0
= −2 ln + rτ . (6143)
K
812
2
Then since Φ0 (x) = √1
2π
exp(− x2 ), see that
Φ0 (d+ )

1 2 2
S0
ln 0 = d − d+ = − ln + rτ . (6144)
Φ (d− ) 2 − K
The result follows.
Let’s continue by computing vega, the partial derivative of an option value w.r.t implied volatility.
See that
δC δd+ δd−
= S0 Φ0 (d+ ) − K exp(−rτ )Φ0 (d− ) (6145)
δσ δσ δσ
0 δd+ 0 δd+ √
= S0 Φ (d+ ) − K exp(−rτ )Φ (d− )( − τ) Equation 6138 (6146)
δσ δσ
δd+ √
= (S0 Φ0 (d+ ) − K exp(−rτ )Φ0 (d− )) + τ K exp(−rτ )Φ0 (d− ) (6147)
√ δσ
= τ S0 Φ0 (d+ ). Lemma 34. (6148)
Clearly the put call parity asserts that the vega of a European call and European put are identical. So
we have
√
νC = νP = τ S0 Φ0 (d+ ). (6149)
Vega is typically scaled so that it represents the dollar change in option value per percentage point
change in implied volatility. It can be shown that ν = ΓσS 2 τ . As volatility acts on the underlying
process over time, the sensitivity of the call price to volatility changes, which is precisely what vega
measures, increases w.r.t to time. An ATM option that is close to expiry has large gamma and small
vega, and vice versa. This is the reason why purchasing short dated options are referred to as buying
gamma, and buying long dated options are referred to as buying vega.
Often traders are more interested in the total profit from correctly predicting mispriced volatility
rather than the instantaneous profit given 12 (σrlz
2 2
− σimp )St2 Γ. We may approximate it in vega terms as
2 2
follows: suppose we have some call priced at σimp but σrlz was realized, then for δ = σrlz − σimp we have
2 2 δC
C(σimp + δ) = C(σimp )+δ 2 . (6150)
δσrlz
All else equal, the profit is given by δ δσδC
2 and we have
rlz
δC δC δσrlz 1
2 = 2
=ν· . (6151)
δσrlz δσrlz δσ 2σrlz
So the PnL is approximated
1 ν 2 2

δ·ν = σrlz − σimp (6152)
2σrlz 2σrlz
ν
= (σrlz − σimp ) (σrlz + σimp ) (6153)
2σrlz
≈ ν (σrlz − σimp ) , (6154)
the approximate, or average total profit from accurate prediction of volatility over implied levels.
To compute rho we take ρc =

δC δd+ δd−
= S0 Φ0 (d+ ) − K exp(−rτ ) Φ0 (d− ) − τ Φ(d− ) (6155)
δr δr δr
δd+
= [S0 Φ0 (d+ ) − K exp(−rτ )Φ0 (d− )] + τ K exp(−rτ )Φ(d− ) (6156)
δr
= τ K exp(−rτ )Φ(d− ). Lemma 34 (6157)
813
δC δP
By the put-call parity we have δr − δr = τ K exp(−rτ ) so ρP =
δP δC
= − τ K exp(−rτ ) (6158)
δr δr
= τ K exp(−rτ ) (Φ(d− ) − 1) (6159)
= τ K exp(−rτ ) (1 − Φ(−d− ) − 1) (6160)
= −τ K exp(−rτ )Φ(−d− ). (6161)
Rho is typically scaled to be the dollar change in option value per percentage point change in interest
rates.
14.4 Volatility Measurement and Forecasting

If the mathematics of the previous section were lost on traders, the primary realization should be that
volatility is an important, if not the most important input into option pricing. The estimation of volatility
that is subsequently more accurate than implied by the market should, on avarege, lead to monetary
profits - but not much has been said so far about volatility measurement or forecasting it. This section
shall be dedicated to that front.
14.4.1 Measurements of Volatility

Volatility is a concept, but it is not directly measurable. We may assume the true price processes are
continuous and unknown, whereas the data we obtain are samples of this price process measured at
discrete time intervals. Our computation for the estimate of population volatility from sample data use
statistical estimators. There are many statistical estimators, differing in bias (see Definition 371) and
efficiency (see Definition 336). The bias of an estimator is related to the systematic over/under estimation
of the population parameter, while the efficiency indicates the convergence properties of our statistical
estimator - which is related to the sampling error, or the average amount of noise in our statistical
estimator. The forecasting of volatility requires a model that captures our notion of the behavior of
volatility processes as measured by our choice of statistical estimator.
The definition for volatility is given as the square root of variance of logarithmic returns. The sample
variance is given as
n
1 X
s2 = (xi − x̄)2 , (6162)
N i=1
where s is the volatility and xi are logarithmic returns, typically measured by close-to-close prices,
and x̄ is the sample mean for the logarithmic returns. n is the size of the sample used for volatility
estimation. Normally, volatility is presented in annualized terms, and variance is (assumed) additive,
so we need a scalarization factor. If s2 in Equation 6162 was computed with daily returns, then the
annualized variance would be 252 · s2 (because there are typically 252 trading days in a year) and the
√
annualized volatility would be 252 · s. The sample variance is an example of a statistical estimator for
the population variance parameter σ 2 . Given logarithmic returns r, we know that σ 2 = E[r2 ] − E[r]2 ,
but often E[r] is very close to zero. It is even smaller when we square it, so σ 2 ≈ E[r2 ] and we often
treat the mean return as zero - this is rather convenient, and also standard practice, since x̄, the sample
mean of return in Equation 6162 is very noisy. We can typically improve the accuracy of measurement
814
by removing the noisy parameter; so that the estimator we use becomes
n
1X 2
s2 = x . (6163)
n i i
But the sample variance is biased (Exercise 343) - note that this does not require normality assumptions,
although in the derivation of the Black-Scholes model we assumed that returns are normal. To get an
unbiased estimate we take
n 2
σ̂ 2 = s . (6164)
n−1
σ̂ 2 is unbiased estimator for the population variance, so Eσ̂ 2 = σ 2 . One should note that taking the
square root of this does not yield an unbiased estimator of the population volatility, due to the concavity
√
imposed by ·. In particular, by Jensen’s inequality (Theorem 341) we have
√ √ √
Eσ̂ = E σ̂ 2 ≤ Eσ̂ 2 = σ 2 = σ. (6165)
If the returns are assumed normal then it turns out that
2 Γ( n2 )
r
Eσ̂ = σ, (6166)
n Γ( n−1
2 )
q n
σ̂ 2 Γ( 2 )
so that k is unbiased estimator of population standard deviation, where k = n Γ( n−1 is the bias
2 )
correction factor. A rough value of n commonly used in practice is about 30, and for this value k is
slightly greater than 0.96. In volatility trading, it is often also customary to set n to the dte of the
option. Although unbiased, our close-to-close estimator is an inefficient estimator for portfolio variance
- so the convergence is rather slow. We need more samples, so we can increase n - but we are not
measuring volatility over a stationary process. As n increases, we are capturing market dynamics that
are increasingly further from the current regime - and hence the tradeoffs between convergence and
sampling relevant data to prevailing market regimes. The variance of our estimator is roughly given
σ2
V ar(σ̂) ≈ . (6167)
2n
The tradeoff may be alleviated in two directions - one is sampling at higher frequencies, or using a
different statistical estimator all together.
An alternative volatility estimator is the Parkinson estimator, given
v
u n 2
u 1 X hi
σ̂ = t ln (6168)
4n ln 2 i li
where hi , li are the high and low prices in the data bar. Here the volatility is measured by the bar
range, which captures a different aspect of price volatility. By sampling two data points per OHLC
bar, for a fixed window of time n, we have more data points, and hence better convergence properties.
Assuming a driftless geometric brownian motion, the Parkinson estimator is shown to be about five times
more efficient than the close to close estimator. However, when the GBM assumptions are violated, this
efficiency benefit is muted. Furthermore, Parkinson estimator assumes continuous prices, but since high
and low prices are obtained from discretely sampled market data during regular trading hours, it is highly
probably that the true, underlying extreme prices were not recorded, especially during overnight trading
- hence Parkinson estimator tends to underestimate true volatility. The Parkinson estimator should be
815
divided by correction factors to reduce the impact of bias, which can be rather significant. For n = 30,
the correction factor is roughly 0.75. Another issue other than the bias is the tendency for jumps in the
opening prices in markets.
Another popular volatility estimator combines the information from both the bar range (like the
Parkinson estimator) and close-to-close and is named the Garman-Klass estimator
v
u n 2 n 2
u1 X 1 hi 1X ci
σ̂ = t ln − (2 ln 2 − 1) ln , (6169)
n i 2 li n i ci−1
where ci is the closing price sampled at period i. To relax the assumptions on a driftless GBM process,
the Rogers-Satchel estimator gives
v
u n
u1 X hi hi li li
σ̂ = t ln ln + ln ln (6170)
n i ci oi ci oi
which allow price trends. To further account for possibility of jumps in opening prices, the Yang-Zhang
estimator gives
p
σ̂ = σ̂o2 + kσ̂c2 + (1 − k)σ̂rs
2 (6171)
where
n n
!2
1 X oi 1X oi
σ̂o2 = ln − ln , (6172)
n−1 i ci−1 n i ci−1
n n
!2
1 X ci 1 X ci
σ̂c2 = ln − ln , (6173)
n−1 i oi n i oi
n
2 1X hi hi li li
σ̂rs = ln ln + ln ln , (6174)
n i ci oi ci oi
0.34
k = n+1 . (6175)
1.34 + n−1
We can see that Yang-Zhang estimator is a mixture of the overnight volatility, intraday volatility and
Rogers-Satchel estimator. The Garman-Klass estimator, Rogers-Satchel estimator and Yang Zhang
estimator are all biased downwards, since they use price extremes on assumptions of continuous prices,
but in reality the prices are sampled discretely only at market hours. Although each successove estimator
is constructed to address a shortcoming of the last, in practice, there is no ‘best’ estimator. The efficency
gains and improvements assume a geometric Brownian motion, but when tested on simulated market
data behaving closer to real market data, their differences are less distinct. Furthermore it is impossible
to benchmark their performance on real market data, as the true value (and hence errors) are unknown.
If they were known - we would not need an estimator in the first place. A fine estimator would be the
mean of all of these estimators. It is more important that we know what these estimators are telling us.
For instance, if the Parkinson volatility is higher than the close-to-close volatility, we might suspect that
a large proportion of volatility is driven by intraday movements.
816
Chapter 15
Risk Premiums
15.1 Formalism
Determine the fair actuarial priced security as the price equivalent to the pricing under objective expected
payoff. Then the rational investor should choose to trade the side that best hedges his portfolio, such that
variance is minimized at constant return expectation. However, investor portfolios are more correlated
than not, leading to flow imbalance. This imbalance drives market price away from actuarial value, and
is the risk premium. Traders taking on addition to aggregate risk take the compensation as opposed
to the hedger. Iqbal [11] claims changes in FX are primarily driven by risk premiums, with secondary
variables in expected real exchange rates, inflation expectations, BOP, rate differentials and others. In
economic theory, negative economics cause yields to rise and lower currency strength. However, most
developed market governments bonds take negative correlation to economic state. EMFX (AUD, NOK,
...) fall relative to DMFX (USD, CHF, JPY) in risk-off, even if the negative impact are geographically
based in the DM. It stands counter-intuitive that sovereign debt value rises during economic weakness,
just as fiscal position weakens due to falling tax receipts and rising welfare payments. It is then that
market/currency volatility is not driven by pure fundamentals, and premium most be considered for
a more sound empirical analysis. Fact that government bonds and some currencies exhibit negative
correlation to economic state must be reviewed in the context of risk premiums.
The discussions in this section are motivated by the work of Iqbal [11] on risk premiums in relation
to currency markets and asset pricing.
Definition 472 (Actuarial Asset Value). Define the actuarial value of an asset at t under probability
measure P as
VtP = EPt [Xt+1 ]
where Xt+1 is the (random) payoff in the next time period, and Et is the time t conditional expectation.
Assume rates at zero here. These reflect the actual (objective) probabilities of the events in the sample
space.
Definition 473 (Market Asset Value). Define the market value of an asset at t under probability measure
Q as
Vt = EQ
t [Xt+1 ]
where Xt+1 is the (random) payoff in the next time period, and Et is the time t conditional expectation.
Assume rates at zero here. These reflect the market implied probabilities of the events in the sample
space, and can be noted as the risk-neutral or risk-adjusted probability measure.
817
Definition 474 (Risk Premium). The risk premium is the market value less the actuarial value of the
asset, given Vt − VtP , with notations as defined in Definition 472 and Definition 473. Note that the Q,
the risk-neutral probability measure of the market asset value is dropped for brevity.
Exercise 730 (Three Examples of Risk Premiums). We draw three illustrations from Iqbal [11] to show
the existence for risk premiums in market pricing.
1. A Risky Corporate Bond Consider a risky zero coupon bond with unit payoff. The objective
probability of default is 0.10. Actuarial bond value is computed VtP = E[Xt+1 ] = 0.9·1+0.10·0 = 0.9.
However, there may be a risk premium associated, such that the market pricing is 0.8. Then the
Q probabilities of default are 0.8, by solving the obvious linear equation. The buyer of the bond
has an expected payoff of 0.10, and this is her risk premium. The change from P world to Q
is called the change of measure, or risk-neutral/risk-adjusted pricing. The differences can be
intuited. For instance, suppose the bond has high correlation with other economic assets. The bond
(company) survives if the economy does well. In the face of negative economic news, the assets in
an investor portfolio performs poorly, and so does the bond. The bond adds risk (variance) to the
investor’s portfolio rather than diversify it, and hence the buyer demands a risk premium to hold
the bond. Conversely, consider the CDS insuring the same corporate bond. Under no discount and
no-arbitrage, the portfolio of {bond, CDS} must equal 1 at maturity. Then the market price of
CDS = 1 − 0.80 = 0.20, but the expected payoff is only 0.10, amounting to a negative risk premium
of −0.10. Other assets that can act as a portfolio hedge are developed market government bonds,
DMFX relative to EMFX and so on. Why do currencies such as USD, CHF, JPY, (and often)
EUR often exhibit small or negative risk premiums in relation to AUD, MXN and so on?
2. Bookmaking Consider a bookmaker offering odds on an election between Growth Party and Dec-
imation Party, with impact on national economics as their name suggests. She receives 70 bets on
0.3
DP and 30 bets on GP, implying Q(DPwin ) = 0.70. These define decimal odds of 1 + 0.7 = 1.42 for
DP, and 3.33 for GP. This translates to a unit bet on DP returning 1.42 if DP wins. The bookie is
perfectly hedged. (Suppose he receives 100, and his payout is fixed at 70 ∗ 1.42 = 30 ∗ 3.33 = 100.)
The key here is that she does not know the P, and is perfectly hedged with the information on
capital flow. Suppose objective probabilities P 6= Q, and that P places equal probabilities on the
binary outcome. Why can this divergence occur? This can be due to investor hedging the risk that
DP wins. The DP win bet should carry a negative risk premium, since this hedges their overall
portfolio. This example can also be used to illustrated elevated volatility levels ahead of important
events. Suppose DP doubles down on their decimation plan, and in a second round of bookmaking,
the bets imply Q(DPwin ) = 0.80. Assume no change in P. The price of the bet still changes. No
change in the objective probabilities are required to observe volatility. It does not matter that differ-
ent market participants have different views on the probabilities of market outcomes - the market
behaves as if the underlying and derivative bets are priced on the same probability measure - the
risk-neutral one. Otherwise, there will be arbitrage.
3. Bimodal One-Step Option Pricing Consider the situation where GBP/USD spot trades at
1.30 before the BOE has key decision on rates. A hike will send the GBP-USD to 1.31 and a
hold to 1.29, with objective probabilities P(Hike) = 0.8 and P(Hold) = 0.2. FX options market
maker is tasked with pricing a call with notional 100M GBP and strike 1.30. Inspecting the option
payoffs her hedging strategy allows her to price the option at 500 thousand USD. She sells the
option for 500 thousand USD and buys 50M GBP/USD as hedge. If BOE hikes, then she owes
818
100M · (1.31 − 1.30) = 1M USD and makes 50M · (1.31 − 1.30) = 500K on her trade, with net profit
zero after considering the option premium. Else if BOE holds rates she loses 50M · (1.29 − 1.30) =
500K USD on her trade, the call is not exercised and her net profit is zero. In fact we could have
inferred the Q probabilities from tthe spot price and used this to correctly price the option, by solving
the equation
1.30 = Q(Hike) · 1.31 + Q(Hold) · 1.29
If she had simply used the risk-neutral probabilities Q(Hike) = Q(Hold) = 0.50 and computed
Vt = 100M · [Q(Hike) · max(1.31 − 1.30, 0) + Q(Hold) · max(1.29 − 1.30, 0)] (6176)

= 100M · (0.5 · 0.01) (6177)
= 500K (6178)
her hedging strategy would have been the same! Note that if she had priced the option with P, the
objective probabilities, she would have arrived at a price open to arbitrage.
It is no coincidence that the Q probabilities pricing the GBP/USD spot (or any other underlying) is
the same one that prices the option. This is the First Fundamental Theorem of Asset Pricing, which we
discuss in Theorem (425) under more theoretical settings. Arbitrage is not possible iff there exists Q
probabilities to price all assets, including derivatives. Since the same Q probabilities are used to price
spot and derivative, then the risk premium associated with both are consistent.
Corollary 46 (Duality of Risk Premium on Underlying and Derivative). An aside of the First Fun-
damental Theorem of Asset Pricing (see Theorem (425)) is that since the same risk-neutral probability
measure Q are used to price spot and derivative, the two must have consistent risk premiums.
Definition 475 (Risk-Off / Risk-On). From the above discussions on risk-premiums (Example (730)),
a period of increased risk-aversion, also known as risk-off sentiment can be viewed as the increase in
divergence between Q probabilities and P probabilities. The inference on convergence follows as logical.
15.1.1 Microeconomic Foundations of Risk Premiums

This section presents the theory of how utility theory can be used to explain risk premiums, in particular
the concavity of utility functions affecting consumption smoothness and linking equivalent probability
measures (see Definition (296) on equivalent probability measures).
15.1.1.1 One-Period Model
Consider a utility function as defined in (see Definition (390)), then the investor’s lifetime utility in a
single-period model is
U (ct , ct+1 ) = u(ct ) + βEPt [u(ct+1 )]
where EPt [u(ct+1 )] indicates expected utility in the next period, based on consumption ct+1 , and β is
degree of discounting for consumption in the next period. This is closely related to real interest rates.
Theorem 464 (Central Pricing Equation in the One-Period Utility Model). Purchase of assets provide
payoff, and investors can choose to buy more assets today (and consume less today) in favour of future
consumption. Denote Vt as price of asset and Xt+1 denote the random payoff in the next period. Then
investors buy n number of assets to maximise utility such that

n = arg max u(ct ) + βEPt [u(ct+1 )]
819
where the argument n is involved in the terms ct = et − Vt · n and ct+1 = et+1 + nXt+1 with endowments
et , et+1 representing exogenous states of the investor. By substitution,
δ !
u(et − Vt · n) + βEPt [et+1 + nXt+1 ] = 0 (6179)
δn
which has solution Vt u0 (ct ) = EPt [βu0 (ct+1 )Xt+1 ]. The LHS represents to first order the loss of utility
today due to consumption fall, balanced with a corresponding expected gain in utility in the next period
from receiving the payoff Xt+1 discounted by her time preference β. Rearranging, we can express
0
u (ct+1 )
Vt = EPt β 0 Xt+1
u (ct )
and we call this the central pricing equation.
The takeaway from the central pricing equation (Theorem 464) is that the correlation of asset random
payoff and marginal utility is important to asset pricing today. If Xt+1 is high in states of the economy
where ct+1 is high, then this asset must get a lower price. This is because the ratio of marginal utilities
decrease as ct+1 is high (recall the diminishing marginal returns of utility functions, Definition 390).
These are assets that add risk to an investor portfolio. The converse is true. Investors bid up assets that
allow them to smooth consumption (by reducing portfolio variance).
Corollary 47 (Asset Price Depends on Correlation of Random Payoff and Marginal Utility Ratio).
Cov(X,Y )
Note that we have the formula Cov(X, Y ) = E[XY ] − EXEY and ρ(X, Y ) = σX σY . Since the
central pricing equation must also be used to price a risk-free h 0 zero icoupon bond, then (with no discount
assumption) Xt+1 = 1 in all economic states and 1 = Et β uu(c P t+1 )
0 (c )
t
*.
Then we may rewrite
0
u (ct+1 )
Vt = EPt β 0 Xt+1 (6180)
u (ct )
0 0
u (ct+1 ) P u (ct+1 )
= EPt β 0 Et [Xt+1 ] + Covt β 0 , Xt+1 (6181)
u (ct ) u (ct )
0 0 0
u (ct+1 ) P u (ct+1 ) u (ct+1 )
= EPt β 0 Et [Xt+1 ] + ρt β 0 , Xt+1 σt β 0 σt [Xt+1 ] (6182)
u (ct ) u (ct ) u (ct )
0 0
u (ct+1 ) u (ct+1 )
= EPt [Xt+1 ] + ρt β 0 , Xt+1 σt β 0 σt [Xt+1 ] (by substituting *) (6183)
u (ct ) u (ct )
risk premium
z }| 0 {
0

u (c t+1 ) u (ct+1 )
= VtP + ρt β 0 , Xt+1 σt β 0 σt [Xt+1 ] (6184)
u (ct ) u (ct )
| {z }
correlation
h 0 i
∆
where the actuarial value VtP = EPt [Xt+1 ]. The terms ρt β uu(c t+1 )
0 (c ) , Xt+1
t
is the correlation of the as-
h 0 i
set payoff and marginal utility, σt β uu(c t+1 )
0 (c )
t
is the volatility of the marginal utility itself and σt [Xt+1 ]
denotes the asset volatility. The market price is decomposed into the actuarial value plus a risk pre-
mium, which has 3 components. The correlation between the random payoff and marginal utility ratios
determine the sign of the risk premium. Additionally, the marginal utility volatility term notes that if
consumption is volatile, then investors require a larger absolute risk premium, reflecting the desire for
smoother consumption and reduced consumption volatility. The first and second component can be viewed
as the amount of premium per unit of risk, and the final term is the amount of risk.
If investors were risk neutral, then the utility function is linear and u0 (ct ) = α for all t, and Vt = VtP -
volatility of marginal utility ratios degenerate to zero. This illustrates the requirement that risk premiums
are defined by utility curvature.
820
15.1.2 Idiosyncratic Risk
Here the concept that idiosyncratic risk (volatility) is not premium-compensated is illustrated. Writing
sys
the asset payoff as Xt+1 = Xt+1 + t+1 where the t+1 is ⊥ to consumption c, with EPt [t+1 ] = 0. The
sys
Xt+1 is the systematic component of the asset payoff correlated with c, by substitution we obtain
h 0 i
0
u (ct+1 )
Covt β uu(c t+1 )
0 (c )
t
, X sys
t+1 + t+1
ρt β 0 , Xt+1 = h 0 i (6185)
u (ct ) u (ct+1 )
σt β u0 (ct ) σt [Xt+1 ]
h 0 i
Covt β uu(c t+1 )
0 (c ) , Xt+1
t
sys
= h 0 i (by ⊥) (6186)
u (ct+1 )
h 0 i h 0 i
ρt β uu(c t+1 )
t
sys
0 (c ) , Xt+1 · σt β u0 (c )
u (ct+1 )
t
σ t X sys
t+1
= h 0 i (6187)
u (ct+1 )
h 0 i
ρt β uu(c t+1 ) sys sys
0 (c ) , Xt+1 · σt Xt+1
t
= (6188)
σt [Xt+1 ]
by substitution we derive
0 0
u (ct+1 ) u (ct+1 )
Vt = VtP + ρt β 0 , Xt+1 σt β 0 σt [Xt+1 ] (6189)
u (ct ) u (ct )
0 0
u (ct+1 ) sys u (ct+1 ) sys
= VtP + ρt β 0 , Xt+1 σt β 0 σt Xt+1 (6190)
u (ct ) u (ct )
and we see the asset value is unchanged. The additional volatility of the idiosyncratic component is
not compensated with a risk premium. Assets with higher volatility do not necessarily earn higher risk
premiums, unless the risk is correlated with consumption.
15.1.2.1 Continuous Time Models
Now consider a discrete model economy, with asset price

0 X 0
u (ct+1 ) u (ct+1 (s)) X
Vt = EPt β 0 Xt+1 = β 0
Xt+1 (s)P(s) = Xt+1 (s)Q(s).
u (ct ) s
u (ct ) s
Hence, we have
Q(s) u0 (ct+1 (s))

=β (6191)
P(s) u0 (ct )
0
and Q(s) = β u (cu0t+1 (s))
(ct ) · P(s). When marginal utility is high (when consumption is low), investors assign
greater probability weight in the risk-neutral world to states of the world when consumption is low
(economic contractions). The risk-adjusted probabilities are synonymous with risk-neutral probabilities
since once the Q probabilities are known, assets are priced as if investors are risk-neutral in that prices
are Q payoffs.
Here the asset pricing equations from dynamic hedging is linked to the central pricing equation
(see Theorem 464). Consider Equation 6191 linking the probability measures via the ratio of marginal
821
utilities, and the central pricing equation that may be written
X
Vt = Xt+1 (s)Q(s) (6192)
s
X Q(s)
= Xt+1 (s)P(s) (6193)
s
P(s)

P Q
= Et Xt+1 , (6194)
P
where we can think of Q, P as random variables taking values Q(s), P(s) in state s at time t + 1 under the
model economy. We want to obtain continuous time analogues. Assuming an underlying S following:
dSt
= µdt + σdZtP , (6195)
St
with µ expected return, σ volatility and ZtP , the objective P measure Brownian. Specifically, Vt is function
of time and stock, a fact we can write more explicitly V (t = t, x = S(t)). We hedge a long derivative
position with short Vx = ∆t units stock. The evolution of the portfolio process is then written (see Ito
Doeblin Theorem 415)
dWt = dVt − ∆t dSt (6196)

1
= Vt dt + Vx dSt + Vxx dSt dSt − ∆t (dSt ) (6197)
2
1
= (Vt dt + Vxx dSt dSt ) + Vx dSt − ∆t dSt (6198)
2
1
= [Vt + Vxx σ 2 St2 ]dt. (6199)
2
Brownian exposure is hedged away. Since we are operating under zero interest rate assumptions, then
Wt must be martingale and drift term zero, s.t. we have the BSM PDE Equation 3848 (set r = 0)
1
Vt + Vxx σ 2 St2 = 0. (6200)
2
Let the boundary condition of be ψ(ST ) = max(ST − K, 0). Risk-neutral pricing (Equation 4193) asserts
that Vt (t, St ) = EQ
t [ψ(ST )] holds. Feynman-Kac theorem (Theorem 429) asserts that Vt (t, St ) solves the
BSM PDE above. Write
dSt
= µdt + σdZtP (6201)
St
µ
= σ dt + σdZtP (6202)
σ
µ
= σ dt + dZtP (6203)
σ
= σdZtQ , (6204)
where the last step follows from Girsanov’s Theorem (Theorem 423) with Radon-Nikodym derivative
process (Definition 451)
Z t
µ P 1 t µ 2
Z
ξ(t) = exp − dZt − du (6205)
0 σ 2 0 σ

µ 1 µ 2

= exp − ZtP − t (6206)
σ 2 σ
Set ξ = ξT and Girsanov Theorem asserts that ZtQ is Q Brownian, which is a P equivalent probability
measure with relation
dQ
= ξ. (6207)
dP
822
The risk-neutral prices are related to the objective-measure prices (again, zero interest rates, taking out
what is measurable - Theorem 356; and by properties of Radon Nikodym derivative processes - Result
21) by
1 P
St = EQ
t [St ] = E [ξT ST ]. (6208)
ξt t
Asset pricing equations resulting from dynamic hedging arguments are consistent with the form in
central pricing equation derived from utility maximisation in that assets are priced using risk-adjusted
Q probabilities.
The continuous time utility maximisation gives analogue of Equation 6179 to be the first-order-
condition
St u0 (ct ) = Et [exp {−δ} u0 (ct+ )St+ ] . (6209)
Here δ is the investor time preference and a small time increment. A purchase of n units of security would
lead to consumption fall of nSt u0 (ct ). This is compensated by an increase in utility by the time-discounted
consumption nEt [exp {−δ} u0 (ct+ )St+ ] in future periods. The balance of immediate consumption and
future consumption leads to the first-order condition Equation 6209. Letting ξs = exp(−δs)u0 (ct ), see
that the forms Equation 6208 is equivalent to Equation 6209. Then
u0 (ct+ )

P
St = Et exp(−δ) 0 St+ , → 0+ (6210)
u (ct )
is our continuous time central pricing equation where the risk-neutral pricing is related to the risk-
objective pricing via marginal utility functions.
15.1.3 Interest Rates

Here the assumption that rates are zero are relaxed, showing the central pricing equation predicts
strong bid for safe assets in times of economic turmoil from risk-averse investors. In economic turmoil,
expectations for future growth are lowered, higher uncertainty over future consumption increase savings
demand, which in turn leads to lower rates.
Rt
Let rtreal = 0 ru du s.t. it represents the continuously compounded real risk-free rate of interest.
Then Qreal
t = exp(−rtreal ) where Qreal
t is the real price of a risk-free zero coupon bond (Definition 455)
paying one dollar of real output at maturity. Consider the CRRA utility function as defined in Definition
ct1−γ
391, which we reiterate u(ct ) = 1−γ . We have u0 (ct ) = c−γ
t . The central pricing equation must apply to
the bond and
0
u (ct+1 )
Qreal
t = EPt
β 0 Xt+1 (6211)
u (ct )
" −γ #
ct+1
= EPt β (6212)
ct
Assuming that consumption growth follows a log-normal distribution, then

ct+1
ln ∼ N (µ, σ 2 ),
ct
823
then

ct+1 −γ
Qreal
t = EPt
exp ln β( ) (6213)
ct

P ct+1
= exp{ln β} · Et exp −γ ln( ) (6214)
ct

1
= exp ln β − γµ + γ 2 σ 2 (6215)
2
using the result E exp{uX} = exp{uµ+ 12 u2 σ 2 } for a standard random normal (see Exercise 553). Taking
logs on both sides, we have

ct+1 1 2 2 ct+1
rtreal = − ln β + γEPt ln − γ σt ln .
| {z } ct 2 ct
Impatience | {z } | {z }
Growth Expectation Precautionary Savings
The impatience term determines the investor time preference. In particular, small β value consump-
tion today over the future. High rates of interest are required to convince impatient investors to save
up hfor rainy
i days. The growth term tends to lower rates when expectations of consumption growth
P ct+1
Et ln ct is weak. Lower growth expectations increase demand savings to smooth consumption in the
next period, scaled by the risk aversion level γ. The precautionary savings term lowers rates, with im-
pact determined
by
the degree of economic uncertainty (or synonymously the volatility of consumption
growth σt2 ln ct+1
ct . Again, this is also scaled by a risk aversion factor of γ 2 .
15.1.4 Expected Returns

Here we rewrite the central pricing equation to arrive at the CAPM. By referring to the asset beta,
practitioners are effectively rank ordering the risk premium. Consider again the one-period
h 0 imodel, and
1 real P u (ct+1 )
a risk free bond with payoff Xt+1 = 1 in all states, giving us RF = Qt = Et β u0 (ct ) where RF
denotes the real risk-free gross return (see Section 12.1.1). By substitution we have
0 0
∆ u (ct+1 ) P u (ct+1 )
Vt = EPt β 0 Et [Xt+1 ] + Covt β 0 , Xt+1 (6216)
u (ct ) u (ct )
P
0
Et [Xt+1 ] u (ct+1 )
= + Covt β 0 , Xt+1 (6217)
RF u (ct )
Xt+1
Finally, let Rt+1 = Vt denote the asset’s random gross return. Dividing by Vt and rearranging we have
0
EP [Xt+1 ] 1 u (ct+1 )
1= t + · Covt β 0 , Xt+1 (6218)
RF V t Vt u (ct )
0
EPt [Xt+1 ] RF u (ct+1 )
RF = + · Covt β 0 , Xt+1 (6219)
Vt Vt u (ct )
0
RF u (ct+1 )
EPt [Rt+1 ] − RF = − · Covt β 0 , Xt+1 (6220)
Vt u (ct )
0
u (ct+1 )
EPt [Rt+1 ] − RF = −RF Covt β 0 , Rt+1 . (6221)
u (ct )
1
This equation is just a reiteration of the previous forms - if asset return is high in states of the world
where consumption is depressed, then the RHS is positive and assets are priced with positive expected
returns, compensating for the addition of systematic risk. Uncorrelated and idiosyncratic risk have
expected returns equivalent to RF .
1 there is a correction for an errata in Iqbal [11] with a missing RF in the RHS of Equation 6221.
824
15.1.4.1 Central Pricing Equation and the Capital Asset Pricing Model
We show relation of the central pricing equation to CAPM formulations (see Exercise 642). Consider
the marginal utility form
u0 (ct+1 ) M
β = −bRt+1 (6222)
u0 (ct )
M
where b is constant and Rt+1 is the return of market portfolio of value weighted securities. In practice
this may be proxied using weighted indices such as the S&P500. Consider an investor with a log utility
u(c) = lnc, then by substitution we obtain
ct M
β = −bRt+1
ct+1
and taking logs

ct+1 b M
ln = ln + ln Rt+1 . (6223)
ct β
Investors hold the market portfolio in aggregate. If investors consume a fraction of their wealth at each
point in time, then aggregate consumption growth may be proportional to market returns. The equations
Equation 6222 and Equation 6223 give reasonable descriptions of marginal utility growth. We can then
refer to an asset’s consumption correlation and investor portfolio correlation interchangeably.
Consider again Equation 6221, we have
u0 (ct+1 )

EPt [Rt+1 ] − RF = −RF Covt β 0 , Rt+1 .
u (ct )
and since the market index is tradable we must have (by substitution of Equation 6222)
M M M

EPt [Rt+1 ] − RF = −RF Covt −bRt+1 , Rt+1 .
M M
= bRF Cov(Rt+1 , Rt+1 ) (6224)
M
= bRF Var(Rt+1 ) (6225)
M
EPt [Rt+1 −RF ]
which is that RF = bVar(Rt+1M ) . Hence Equation 6221 becomes
M

EPt [Rt+1 ] − RF = −RF Covt −bRt+1 , Rt+1 .
M
EPt [Rt+1 − RF ] M

= − M
Covt −bRt+1 , Rt+1 (6226)
bVar(Rt+1 )
M

Covt Rt+1 , Rt+1 P M
= M )
Et [Rt+1 − RF ] (6227)
Var(Rt+1
= β CAP M EPt [Rt+1
M
− RF ] (6228)
This is the CAPM, stating that the expected excess returns of asset is equal to the regression beta
multiplied by the expected return on market less risk-free rate. This is often calibrated with the regression
i
Rt+1 − RF = β CAP M,i · (Rt+1
M
− RF ) + it+1 ,
replacing the RM with a benchmark index such as the S&P 500. In currency markets, assets with high
betas (and therefore high risk premiums) are typically EM currencies, and some DM currencies. Negative
beta currencies typically are the safe-havens such as the USD, JPY, CHF and sometimes the EUR.
825
Chapter 16
Alternative Data
826
Chapter 17
Programming
17.1 Python Programming

Python is perhaps by far the most popular scripting language for data scientists and machine learning
engineers. It is effective, easy to understand and allows the programmer to focus on the task at hand. It
is high-level - in that it abstracts away many low-level details, such as memory allocation and garbage
collection. Add together other features such as dynamic typing, interpretation and vast ecosystem of
third-party libraries and frameworks, it is quick and easy to write up prototypes and test out ideas
for engineering problems. For perhaps the same reasons, Python is one of the favorite languages when
it comes to introductory courses in data science, and is quite often the peek into the programming
paradigm, alongside technologies like R and Excel.
But there is no free-lunch; these features come at a cost. Aforementioned courses (solid ones) most
likely cover the essential constructs and concepts such as variables, data types, data structures, control
flow, functions, objects, functional/procedural/objected-oriented programming and what-not, but much
less is covered on the internal implementation of Python. It turns out that the convenient features
come at an efficiency cost due to the overhead introduced by the internal implementation of Python
used to ensure the safe operations of what were once explicitly programmed but now abstracted away.
Python code is executed by the Python interpreter, which reads and executes Python bytecode. The
interpreter is responsible for tokenizing and parsing the source code, generating bytecode, and executing
it on the Python Virtual Machine (PVM). The PVM is responsible for executing Python bytecode. It
provides an abstraction layer between the bytecode and the underlying hardware, allowing Python code
to be platform-independent. In one of the internal implementations, CPython (the reference, and most
common, implementation of Python), the Global Interpreter Lock (GIL) is present, and it acts as a
lock/mutex that protects access to Python objects, preventing multiple native threads from executing
Python bytecode simultaneously. This affects concurrency and parallelism in multi-threaded Python
programs. Computationally demanding tasks in the Python runtime do not run any faster on multiple
threads than it does on one, because each thread waits in turn to obtain the GIL access to execute
bytecode.
Python is a great solution for rapid development and prototyping, as well as building scalable and
readable systems. This would be an important feature for quants, and even more so for quant setups in
small groups or solo/‘retail’ desks, where the need to seamlessly test out ideas and deploy algorithms are
required without being spread too thin across multiple technologies. On the other hand, the scientific
827
and numerical computing demands of quantitative work is often unsuited to the untrained Python eye.
This section on Python programming is dedicated to elevation of Python programming for the purposes
of scientific programming. In many, if not most cases, Python applications are capable of running at the
same efficiency and time requirements as their C or cpp counterparts, if properly managed. Additionally,
if one considers the entire Software Development Life Cycle (SDLC), such as design, development, testing,
deployment and maintainence of software applications, Python is capable of offering a much more suitable
solution to one’s dev needs.
17.1.1 Computers and Python as a Language

If the introduction has been of any help, it should hint to us where we should look at to get our
optimizations in - precisely where the opimitzations have taken place! If we understand how Python’s
abstractions work and its interaction with the computing units, we would be squared away. Here we give
a whirlwind tour of the components of the computer system. One may take this to be a concise overview
of computers - what one might expect from a college course in Computer Organization, if it were a few
paragraphs long. Of course, the specifics depend on the computer architecture, operating system and
more, but we provide only enough detail so as to build intuition, that in which is necessary to appreciate
the rest of the writings.
The computer system can be thouught of to be composed of the computing units (Central Processing
Unit - CPU), the memory/storage units and input/output devices.
The CPU is the brain of the computer, and executes instructions, performs calculations, and manages
data movement. The Graphics Processing Unit (GPU) is also increasingly more commonplace, and acts
as auxiliary computing power specialized for graphics processing and numerical computations, allowing
for extensive parallelization. The memory (Random Access Memory - RAM) stores data and instructions
that the CPU needs to access quickly, and is volatile. Volatile memory is memory that loses its contents
when the power is turned off. The storage is on the other hand, non-volatile, and is used to store
data permanently or semi-permanently. Examples include Hard Disk Drives (HDDs), Solid State Drives
(SSDs), and Flash memory, which we colloquially call ‘disk’. Input devices allow users to input data into
the computer system, while output devices display or output information from the computer. Network
connections are affectively a connection to all other computing systems.
Modern computers have variable architectures. Most CPUs today typically have cache memory, which
is a small amount of high-speed memory located directly on the CPU chip itself. It serves as a buffer
between the CPU and the slower RAM, and comes in different levels. The common one is the L1-L2-L3
cache, in order of increasing latency, but also increasing capacity.
In general, memory and storage units are said to be in a tiered-architecture, where latency and
capacity are at odds. The main idea is that for manipulation of bits to occur, those bits must be
physically present at the CPU. The tiered hierarchy allows the CPU to access data and instructions
quickly, reducing the time spent waiting for data to be fetched. These units differ in how ‘close’ they
are to the CPU, and differ in read/write throughput. The data that we think we need more often/soon
is placed closer to the CPU - this arrangement is handled by the OS. Imagine we own a hundred shoes,
you would place your ‘’favorite pair’ closer to the door and dump the ugly one in the closet - this reduces
search time for your shoes, on average. One can think of high-performance storage to be smaller in size
and closer to the CPU for frequently accessed data, as opposed to high-capacity storage to be larger in size
and further from the CPU. The CPU itself is characterized by the number of operations it can perform in
a clock cycle, and the number of clock cycles in a second. These are measured respectively in Instructions
828
Per Cycle (IPC) and clock speed in units of Hertz (Hz ). These metrics are competing with each other.
Some instruction sets contain instructions to operate on multiple data simultaenously, and this is said to
be vectorization, a class of single instruction, multiple data (SIMD) operation. Due to physical limitations
of making transistors smaller, advancements in clock speeds and IPC have only been incremental over
the past decade, and improvements in computing performance have come from other approaches such as
hyperthreading, multi-core architectures, out-of-order execution, branch prediction and so on. Multi-core
architectures include multiple CPUs within the same unit, and we will be extensively looking at how to
make use of multiple cores in the texts to follow. The other techniques are not of particular interest to
programmers using general-purpose programming languages such as Python, and are optimizations at
the lower hardware and operating system levels. Many of our techniques involved would be to do with
unlocking and making copies of the GIL, or at least at similar levels of abstraction.
Not only do the different memory and storage units have different throughputs/latency, the speed
of which they may be accessed also depends on how close or far away the data requested is stored in
the physical sense. Reading chunks of data laid out sequentially is significantly faster than jumping
memory addresses. The movement of data bits itself goes through the tiered architecture - data starts
from the disk, then makes its way through the RAM and L1/L2 caches. Optimizing the performance of
Python applications could mean optimizing the memory access patterns of the program, which means
optimizing where the data is placed in memory and how it is transferred. In many circumstances, this
can be directly or indirectly influenced by how the programmer writes program code. This need not
be complicated, and could simply mean using a numpy array for fixed length numerical data types as
opposed to a Python list data structure. We leave the details for later.
These blocks of computing units communicate with each other using what are called buses, such as
system (frontside) buses, data buses, address buses and control buses. The data bus is responsible for
transferring actual data between the CPU, memory, and other devices. It carries binary data in the form
of electrical signals, allowing for bidirectional data transfer. Like the CPU metrics, the performance of
the data bus is measured by how much data it moves, which is a function of how much data it can move
in a single transfer (bus width) and how many transfers it can do per second (bus frequency). Each data
bus transfer can only move sequential bits of data. Naturally, a large bus width, combined with data
laid out in contiguous memory blocks, can move large amounts of relevant data in a single data transfer
- which then facilitates vectorization, since vectorization can only occur for those data that are available
at the CPU registers. Having a small bus width but high transfer frequency can make for quick access to
data that is scattered in different memory addresses. This is precisely why Python’s native data types are
poorly suited for vectorization. Since Python does automatic garbage collection, memory is automatically
allocated and freed - and this results is memory fragementation, hurting the transfer to the CPU caches.
Often, multiple data bus transfers are required for computations in Python. Furthermore, since Python
is interpreted and allows for dynamic typing, it is also difficult to perform compiler optimizations, since
the precise behavior is not known until runtime. Python’s popular scientific computing modules such as
numpy, pandas and scipy all make use of non-native data types from C, Fortran and others that alleviate
these issues, as we shall soon find out.
17.1.2 Code and Computer Profiling

The discussion so far should build some intuition as to where we might look to optimize our code. These
are undoubtedly useful, and once familiar, give the programmer good discerning ability and heuristics
in writing efficient code. But until we have such intuition, and even after - our decisions to write
829
optimizations should be driven by statistics. After all, we are quantitative developers. Both intuition
and data should be used in the development process. Performance profiling allows the developer to
find performance bottlenecks and attack areas of concern, and can reveal surprising efficency concerns
overlooked by intuition. Code profiling is particularly useful in identifying performance issues with
regards to function calls of which have implementation details that we are not intimately aware of, such
as third-party libraries and APIs. Again, it is the abstraction that allows us to solely concern ourself
with just the inputs and outputs of function calls, and not the internal details. However, it may precisely
be these details that are obscure to intuition while dragging performance. Profiling can be done at many
levels, including the CPU, memory, storage and network resources.
Running code faster can boil down to many approaches. One could use a more efficient algorithm
(a binary search on a sorted list would be complexity O (log n) while a linear search would be O (n)),
run a better CPU (or whatever resource the job needs), or run optimized code. Suppose your job is to
run a train from Washington to California. An efficient algorithm would be to go Washington, Oregon
then California. An inefficient one would go Washington, Idaho, Wyoming, Utah, Arizona and then
California. Algorithms boil down to rethinking the problem and finding a better, more effective solution.
We could run the same algorithm but do it faster - an example would be the evolution of trains from
using the steam engine to diesel engines. Last but not least, we can make peripheral adjustments, such
as the type of wheels, number of stops, material of the train and so on in order to further optimize our
journey. In this chapter, we would be asssuming the optimizations of this last kind - algorithms are
conceptual and problem specific, and designing a better CPU is not really our concern - we would be
learning how to make peripheral optimzations given the context of our problem and resources we are
given. For theory on designing, writing and analysing computer algorithms, a great reference is Cormen
et al. [6].
17.1.2.1 time.time(), function decorators, IPython and OS utilities
The obvious criterion or benchmark we want to begin with is how fast our code runs. We may measure
this using the time module, with the following code:
1 import time
2
3 # Function to perform some work

4 def work ( n ) :
5 result = 0
6 for i in range ( n ) :
7 result += i
8 return result
9
10
11 start_time = time . time ()

12 result = work (1000000)
13 end_time = time . time ()
14
15 # Calculate elapsed time

16 elapsed_time = end_time - start_time
17 print ( elapsed_time )
Listing 17.1: work.py
While the timing method is functional, it quickly gets unwieldy and messy to deal with. We would
ideally like to do it in a much cleaner way, using decorators, which are function wrappers that can modify
830
the behavior of functions with a simple tag.
With the decorator function:
1 import time
2 import functools
3
4 def timer ( func ) :

5 @functools . wraps ( func )
6 def wrapper (* args , ** kwargs ) :
7 start_time = time . time ()
8 result = func (* args , ** kwargs )
9 end_time = time . time ()
10 elapsed_time = end_time - start_time
11 print ( f " Elapsed time for { func . __name__ }: { elapsed_time } seconds " )
12 return result
13 return wrapper
14
15 @timer
16 def work ( n ) :
17 ...
we should be able to a cleaner measurement of time with

1 >>> work (1000000) # equivalent to timer ( work ) (1000000)
2 Elapsed time for work : 0 . 0 8 6 9 9 9 8 9 3 1 8 8 4 7 6 5 6 seconds
3 499999500000
The statement @functools.wraps(func) is not strictly necessary, but it would be good practice to do so, so
that we can preserve meta-attributes of the original function. For example, the statement work.__name__
evaluates to ’work’, but would otherwise evalute to ’wrapper’ without the wrapping.
Clearly, repeat timings of the same code does not take precisely the same duration - there are slight
abberations depending on computer load, other running tasks and so on. It may be reasonable to run
the timer multiple times to get an intuition for the spread of runtimes. Each run should take place under
similar settings; otherwise you may get confusing results. I have been victim to being utterly confused at
timing results when using computer resources that were not completely familiar to me; it turns out that
my Python scripts ran alot slower when I was actively editing code in my Gitpod workspace browser
instance as opposed to standing by and waiting for my code to execute.
If you are scripting in the IPython environment or on a Jupyter Notebook, there is an inbuilt timer
that does the profiling for us, with multiple repetitions to get us runtime statistics:
1 >>> % timeit work (100000)
2 8.77 ms +/ - 35.5 microseconds per loop ( mean +/ - std . dev . of 7 runs , 100 loops each )
In Linux/Unix systems, we have a /usr/bin/time utility used for measuring the resource usage (such
as CPU time, memory usage, etc.) of other commands or scripts. It is utilized by the command:
1 / usr / bin / time [ options ] command [ arguments ]
and one of the options -l gives additional information about memory usage, along with time spent.
Suppose we run the script work.py in Listing 17.1, the followng is an example:
>>>
/usr/bin/time -l python3 work.py
<<<
0.0606992244720459 #output from print statement
831
0.11 real 0.09 user 0.01 sys
8323072 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
3630 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
152 involuntary context switches
823586913 instructions retired
332458388 cycles elapsed
5435392 peak memory footprint
Some of the statistics are elaborated on:
1. Real Time: The actual elapsed time from the start to the finish of the command. This includes
time spent waiting for resources (e.g., I/O operations).
2. User Time: The amount of CPU time spent executing the command in user mode. It includes
time spent executing code in the program.
3. Sys Time (System Time): The amount of CPU time spent by the system executing system
calls on behalf of the command.
4. Maximum Resident Set Size: The maximum amount of memory (in kilobytes) that the process
used during its execution. This represents the maximum amount of physical memory the process
occupied in RAM.
5. Average Shared Memory Size: Average size of shared memory segments used by the process.
6. Page Reclaims: The number of times that data has been brought back into memory from disk.
It represents instances where data previously evicted from memory had to be loaded back.
7. Page Faults: The number of times that data had to be fetched from disk. This occurs when the
requested data is not found in memory and must be retrieved from secondary storage.
8. Swaps: The number of times that memory contents have been swapped between memory and
disk. Swapping occurs when the operating system moves data between RAM and disk storage to
free up memory.
9. Block Input Operations: Number of block input operations, typically involving reading data
from disk or other I/O devices.
10. Block Output Operations: Number of block output operations, typically involving writing data
to disk or other I/O devices.
832
11. Messages Sent: Number of messages sent by the process.
12. Messages Received: Number of messages received by the process.
13. Signals Received: Number of signals received by the process, such as interrupt signals or termi-
nation signals.
14. Voluntary Context Switches: Number of voluntary context switches, which occur when a
process voluntarily relinquishes the CPU.
15. Involuntary Context Switches: Number of involuntary context switches, which occur when a
process is preempted by the scheduler and its execution is paused.
16. Instructions Retired: Number of instructions executed by the CPU.
17. Cycles Elapsed: Number of CPU cycles elapsed during the execution of the command. This
represents the actual number of clock cycles consumed by the CPU.
The amount of time spent in I/O is approximately real−user−sys. Page reclaims refer to the number
of times that data has been brought back into memory from disk. This occurs when previously evicted
data, which was moved from memory to disk to free up space, needs to be loaded back into memory
because it’s needed again. Page faults refer to the number of times that data had to be fetched from
disk, when a process attempts to access data that is not currently in physical memory (RAM). When
a page fault occurs, the operating system must load the required data from disk into memory before
allowing the process to continue accessing it. This is expensive (in time cost). Page faults can happen
for various reasons, such as when accessing data that was paged out to disk due to memory pressure
or when accessing memory-mapped files or virtual memory regions. Messages typically refer to inter-
process communication (IPC) messages sent or received by the process being monitored. Inter-process
communication allows different processes to exchange data, synchronize their activities, and coordinate
their execution. Messages can be sent between processes using various mechanisms, such as message
queues, sockets, signals, pipes, and shared memory. You can build your intuition around high-level
Python functions and how they interact with the kernel through these statistics. The statements
1 with open ( " work . txt " , " w " ) as f :
2 f . write ( " worked " )
would trigger a page fault. The statement input("take in some keyboard input") should trigger a voluntary
context switch. The concepts of context switching, in relation to preemptive multitasking and cooperative
multitasking, are discussed in our notes on the asyncio library (Section 17.1.3). Before you type in the
input, if you do a CTRL-C, the process would be interrupted with a SIGINT and show up in the signals
received count.
17.1.2.2 Function Profiling with cProfile
Now that we are able to profile an entire Python script, and time operations, we want a more granular
view into the performance metrics so that we can drill down into the pain points. We would then perform
the optimizations, time our programs again and verify that our timings have indeed improved. Again, we
stress that the timers should be performed under similar circumstances - the profiling itself can introduce
time delays. An immediate and simple - albeit insignificant - example is the overhead caused by the
decorator @timer, which is itself Python code that takes time to run.
833
In this section we introduce the ability to profile function calls in Python, using the cProfile module, a
powerful tool for analyzing the performance of Python programs by providing detailed runtime statistics.
cProfile provides deterministic profiling of Python programs, allowing the developer to analyze the
performance of your Python code by measuring the execution time of each function and the number of
times each function is called. Statistics such as the total time spent in each function, the number of calls
made to each function, and the time spent in each function’s subcalls are recorded. It is deterministic,
in the sense that it records the exact function calls and their durations without sampling, providing a
more accurate picture of the program’s behavior. This is as opposed to statistical profilers, which sample
the program’s execution at regular intervals. The module is built-in in the standard library, and hooks
into the PVM to do the profiling. Instead of rambling through the documentation, we instead provide
sample use cases on how to utilize the module.
A rather contrive example to demonstrate utility is shown. Later we demonstrate usefulness using
a case study to build optimized code for backtesting trading systems. Suppose we have a recursive
implementation for computing the Fibonacci sequence, and we want the n = 30-th value.
1 def fibonacci ( n ) :
2 if n <= 1:
3 return n
4 else :
5 return fibonacci (n -1) + fibonacci (n -2)
6
7 def main () :
8 fibonacci (30)
9
10 if __name__ == " __main__ " :

11 main ()
Listing 17.2: fib.py
We can execute python -m cProfile test.py, and we receive the following output:
2692541 function calls (5 primitive calls) in 0.333 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)

1 0.000 0.000 0.333 0.333 test.py:1(<module>)
2692537/1 0.333 0.000 0.333 0.333 test.py:1(fibonacci)
1 0.000 0.000 0.333 0.333 test.py:7(main)
1 0.000 0.000 0.333 0.333 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {method ’disable’ of ’_lsprof.Profiler’ objects}
We know how much fibonacci was called in our recursive implementation - something which was not
immediately obvious by looking at the code before (without some analysis). We would most likely be
working with Python scripts with much greater complexity, where there are mutiple functions, branching
conditions and so on. The cumulative time gives us information on how much time of the overall execution
is spent inside of a given function. If a function is taking surprisingly long to run each time it is called,
we may be concerned. If a function is being called a large number of times, it may warrant attention.
If a function is both being called many times, and takes a long time to run per call, our attention is
piqued. We can write the statistics to disk and then call upon modules to give us more details. The
834
following example would be rather illuminating about the internal implementation of pandas function
calls. A script is created to populate a dataframe with random values and then we iterate through it,
using two types of scalar access patterns, alternating between loc and at to do so.
3
4 def g e n e r a t e _ r a n d o m _ d a t a f r a m e ( rows , cols ) :

5 data = np . random . randint (0 , 100 , size =( rows , cols ) )
6 return pd . DataFrame ( data , columns = list ( range ( cols ) ) )
7
8 def iterate ( df ) :
9 for i in range ( len ( df ) ) :
10 for j in range ( len ( df . columns ) ) :
11 value = df . loc [i , j ] if j %2==0 else df . at [i , j ]
12
13 rows = 1000
14 cols = 1000
15 df = g e n e r a t e _ r a n d o m _ d a t a f r a m e ( rows , cols )
16 iterate ( df )
Listing 17.3: profiling.py
Now we profile this and write to disk the statistics with python -m cProfile -o profile.bin profiling.py We
analyse the statistics file with the following script:
1 import pstats
2 p = pstats . Stats ( " profile . bin " )
3 p = p . strip_dirs () . sort_stats ( " cumulative " )
4 p . print_stats (10)
with the following output:
Fri Mar 8 17:02:02 2024 profile.bin
Ordered by: cumulative time

List reduced from 2330 to 10 due to restriction <10>

728/1 0.005 0.000 7.681 7.681 {built-in method builtins.exec}
1 0.000 0.000 7.681 7.681 test.py:1(<module>)
1 0.564 0.564 6.796 6.796 test.py:8(iterate)
500000 0.564 0.000 3.895 0.000 indexing.py:864(__getitem__)
500000 0.882 0.000 1.950 0.000 indexing.py:925(_is_scalar_access)
1000000 0.675 0.000 1.858 0.000 frame.py:2992(_get_value)
7 0.000 0.000 0.910 0.130 __init__.py:3(<module>)
483/1 0.002 0.000 0.880 0.880 <frozen importlib._bootstrap>:966(_find_and_load)
We get 500000 calls to indexing.py:864(__getitem__), indexing.py:2070(__getitem__), indexing.py:925(

_is_scalar_access) and indexing.py:2015(__getitem__) each. frame.py:2992(_get_value) gets a million calls.
835
This is clearly suggestive that even though both loc and at perform scalar access in our script, they make
distinct function calls, but they both make calls to _get_value.
We may get some clarity on where the function calls came from, by specifying p.print_callers(9), and
this gives output

Function was called by...

ncalls tottime cumtime
{built-in method builtins.exec} <- 383/1 0.000 0.879 ...
26 0.004 0.005 __init__.py:357(namedtuple)
316 0.000 0.001 overrides.py:166(decorator)
1 0.000 0.000 six.py:1(<module>)
1 0.000 0.000 typing.py:1(<module>)
test.py:1(<module>) <- 1 0.000 7.681 {built-in method builtins.exec}
test.py:8(iterate) <- 1 0.564 6.796 test.py:1(<module>)
indexing.py:864(__getitem__) <- 500000 0.564 3.895 test.py:8(iterate)
indexing.py:2070(__getitem__) <- 500000 0.360 2.201 test.py:8(iterate)
indexing.py:925(_is_scalar_access) <- 500000 0.882 1.950 indexing.py:864(__getitem__)
frame.py:2992(_get_value) <- 500000 0.342 0.931 indexing.py:864(__getitem__)
500000 0.333 0.926 indexing.py:2015(__getitem__)
indexing.py:2015(__getitem__) <- 500000 0.348 1.364 indexing.py:2070(__getitem__)
__init__.py:3(<module>) <- 7 0.000 0.910 {built-in method builtins.exec}
We can infer from this that indexing.py:864, calls both indexing.py:925(_is_scalar_access) and frame.py
:2992(_get_value), whereas indexing.py:2070(__getitem__) only calls indexing.py:2015(__getitem__), which
then calls frame.py:2992(_get_value) without the check for scalar access. We can verify from the pan-
das documentation that at is indeed the fastest way to do scalar access based on labels. We can suspect
that the former is associated with the loc operator, while the latter is called via the at operator, without
even looking for the pandas source code. Modifying the scalar access line to value = df.at[i, j], and
redoing the profile:
Fri Mar 8 18:00:11 2024 profile.bin


728/1 0.005 0.000 4.798 4.798 {built-in method builtins.exec}
1 0.000 0.000 4.798 4.798 test.py:1(<module>)
1 0.449 0.449 4.501 4.501 test.py:8(iterate)
836
1000000 0.634 0.000 1.731 0.000 frame.py:2992(_get_value)
1000000 0.203 0.000 0.679 0.000 series.py:540(_values)
1000000 0.290 0.000 0.477 0.000 managers.py:1622(internal_values)
1000000 0.417 0.000 0.465 0.000 indexing.py:2064(_axes_are_unique)
1000000 0.202 0.000 0.387 0.000 generic.py:471(ndim)
we verify this fact, and see that runtime improved from 7.681 to 4.797 seconds, with a different function
call profile. Our suspicions are correct.
For those who would prefer a graphical interface to interact with or present to management on their
optimization choices and profiling results, a nice tool in the form of snakeviz does that for us - we
can do python -m pip install snakeviz and then run python -m snakeviz profile.bin, which would display a
graphical interface on your browser for your perusal.
Figure 17.1: Graphical Interface shown
17.1.2.3 Line Profiling with line profiler
While careful analysis with function profiling (Section 17.1.2.2) and analysis of the call stack/counts gave
us an idea as to which lines were concerning in code case studies such as the one in Listing 17.3 - in
general, function profiling gives us an idea of the concerning functioning calls, and not the implementation
of those function calls. For instance, the function profiling gives us an idea that the function call iterate
(df) is problematic, but it is not immediately obvious as to why this function is slowing down our overall
program. In order to get an insight into the function implementation and which parts of the function
are cause for concern, we would need to do drill down into a line-by-line analysis, which is given to us
by the line_profilder library accessible via
1 python3 -m pip install line_profiler
In order to perform this analysis, we are given a function decorator @profile, and all the tagged functions
will be profiled at runtime when run using the kernprof script. Since we are doing a line-by-line analysis,
let us modify Listing 17.3 to
837
3
4 def g e n e r a t e _ r a n d o m _ d a t a f r a m e ( rows , cols ) :

5 data = np . random . randint (0 , 100 , size =( rows , cols ) )
6 return pd . DataFrame ( data , columns = list ( range ( cols ) ) )
7
8 @profile
9 def iterate ( df ) :
10 for i in range ( len ( df ) ) :
11 for j in range ( len ( df . columns ) ) :
12 if j %2==0:
13 value = df . loc [i , j ]
14 else :
15 df . at [i , j ]
16 rows = 1000
17 cols = 1000
18 df = g e n e r a t e _ r a n d o m _ d a t a f r a m e ( rows , cols )
19 iterate ( df )
Listing 17.4: profiling.py
so that the two scalar-accessing function calls appear on different lines. Notice the tagging with @profile
- no imports are required - we may run the script with
1 python3 -m kernprof -l -v test . py
which gives us
Wrote profile results to test.py.lprof

Timer unit: 1e-06 s
Total time: 17.3349 s

File: test.py
Function: iterate at line 8
Line # Hits Time Per Hit % Time Line Contents

==============================================================
8 @profile
9 def iterate(df):
10 1001 388.0 0.4 0.0 for i in range(len(df)):
11 1001000 494465.0 0.5 2.9 for j in range(len(df.columns)):
12 1000000 415803.0 0.4 2.4 if j%2==0:
13 500000 11187304.0 22.4 64.5 value = df.loc[i, j]
14 else:
15 500000 5236952.0 10.5 30.2 df.at[i, j]
which has output that...sort of speaks for itself. The two lines of scalar access share the same number of
hits but one takes about double the runtime - and unlike the function profiling, this was a no-brainer to
figure out.
At different levels of abstraction, we are able to get different angles of the optimization process.
Starting off with high-level profiling as in generic timers and statistical profilers discussed in Section
838
17.1.2.1, we can get an overall feel of our Python application’s ability to utilize the available computing
resources. Function profiling in Section 17.1.2.2 help us understand which logical components are slowing
down our process, which we can then use the line profiling discussed here in Section 17.1.2.3 to zoom in
on the pain hotspots.
17.1.2.4 Bytecode Profiling with dis
In Section 17.1 we discussed how a Python application is first assembled into bytecode, which is then run
on the PVM. Although function profiling and line profiling (see Sections 17.1.2.2, Section 17.1.2.3) give
us good artefacts in attacking our point of entry into the efficiency problem, it does little in the form
of building intuition or constructing a mental model for the underlying function calls. The dis module
in the standard Python library allows us to inspect the underlying bytecode that is run in the CPython
reference implementation. A good rule of thumb is that shorter bytecodes correspond to faster runtimes,
although this, as emphasized, must be backed up by the actual profiling results. Let’s take a simple look:
2 import dis
3
4 arr = np . array ([1 , 2 , 3])

5 def func1 ( arr ) :
6 return sum ( arr )
7

9 sum = 0
10 for i in arr :
11 sum = sum + i
12 return sum
13

15 return np . sum ( arr )
16
17 print ( func1 ( arr ) , func2 ( arr ) , func3 ( arr ) )

18 print ( " - - - - - - - - - - - - - - - - - - " )
19 dis . dis ( func1 )
20 print ( " - - - - - - - - - - - - - - - - - - " )
22 print ( " - - - - - - - - - - - - - - - - - - " )
gives the output
6 6 6
------------------
5 0 RESUME 0
6 2 LOAD_GLOBAL 1 (NULL + sum)

14 LOAD_FAST 0 (arr)
16 PRECALL 1
20 CALL 1
30 RETURN_VALUE
------------------
8 0 RESUME 0
839
9 2 LOAD_CONST 1 (0)
4 STORE_FAST 1 (sum)
10 6 LOAD_FAST 0 (arr)
8 GET_ITER
>> 10 FOR_ITER 7 (to 26)
12 STORE_FAST 2 (i)
11 14 LOAD_FAST 1 (sum)
16 LOAD_FAST 2 (i)
18 BINARY_OP 0 (+)
22 STORE_FAST 1 (sum)
24 JUMP_BACKWARD 8 (to 10)
12 >> 26 LOAD_FAST 1 (sum)

28 RETURN_VALUE
------------------
14 0 RESUME 0
15 2 LOAD_GLOBAL 1 (NULL + np)

14 LOAD_ATTR 1 (sum)
24 LOAD_FAST 0 (arr)
26 PRECALL 1
30 CALL 1
40 RETURN_VALUE
Using the decorator @timer written in Listing 17.1.2.1, we test on array arr = np.array(range(10000000))
and the function calls print(func1(arr),func2(arr),func3(arr)) give output
Elapsed time for func1: 0.5867447853088379 seconds

49999995000000 49999995000000 49999995000000
We see from the timing that func1 runs faster than func2, and we can see from both the actual Python
code and the bytecode that func1 is more terse and succinctly defined. Inspection of the bytecode reveals
that this is likely due to the fact that func2 keeps track of local variables to iterate over the list inside the
for loop and maintain the counter, while func1 does not have such steps - in fact there is no STORE_FAST
instruction even being called. These have time overhead, but it is not something that we might have
been aware of before. Interestingly, func3 runs significantly faster than both, while it is almost identical
to the bytecode for func1 - the bytecode inspection is not a foolproof way of identifying performance
bottlenecks, although it can be very helpful in allowing us to build intuition. It turns out that the
difference in runtime is due to the dynamic typing of Python - during each summation operation, the
data types of the objects added together must be checked for type compatibility func1 - this is not an
issue in func3. Additionally, vectorization also plays a role in the performance difference between the
two. We will explore these concepts in the sections to follow.
840
17.1.3 Asynchronous Programming
When working with applications that involve heavy I/O operations, such as downloading stock market
data, running database queries, these operations may be time intensive and degrade the performance
of the application. If we run these operations sequentially, then the performance issues grow linearly
in the number of such operations and the application does not scale. If we can exploit concurrency
to make these requests together for simultaneous waiting, our application would be much less idle and
benefit from less latency. In general, an application’s work can be divided into I/O and CPU bound
(bound means limiting factor) work. The I/O operations involve a computer’s input and output devices,
such as keyboard, hard drive and network cards. CPU operations involving the central processing unit,
and consists of work such as mathematical computation. Concurrency refers to allowing more than a
single task to be handled at the same time. One of the ways this is achieved in Python is through
the asyncio module, which stands for asynchronous I/O. Asynchronous programming refers to the act
of running tasks in the background, separate from the main application. The asyncio library utilizes
the concurrency model known as a single-threaded event loop. We will explore what these mean later.
Additionally, asyncio provides utility functions that allow us to work with multiple threads and multiple
processes, giving us better control over these workflows in neat abstraction.
Concurrency means that at least two tasks are happening at the same time. This can be rather vague
and needs to be clarified. Concurrency does not in fact imply that something is running in parallel.
When we say that parallelism exists, it requires that the tasks are both concurrent and executing at
the same time. We may be concurrently letting the cake cool in the oven while preparing the table -
these two events can be conducted concurrently. When you hire someone to cook the fish while you
prepare the jus, the events occur in parallel. Concurrency can be obtained on a single core CPU via
task-switching using a mechanism known as preemptive multitasking. Parallelism can only happen on a
machine with at least two cores. Parallelism implies concurrency but not the other way around. With
multitasking, one may have multiple tasks happening concurrently - except only one of them is executing
at one time. Multitasking falls under two types - preemptive multitasking and cooperative multitasking.
In the former, the OS (operating system) decides how and when to switch between different tasks via time
slicing. In the latter, the running process decides to give up CPU time via explicitly earmarked regions
in the code. asyncio in particular employs the cooperating multitasking model to achieve concurrency.
As mentioned, it is single threaded (and therefore also single proccess) and runs a concurrent-but-not-
parallel model. Of course, we can combine parallelism with asyncio, and we will see how to do that
later. Cooperating multitasking has some benefits. One of that is that it is less resource intensive. When
the OS switches between running threads or processes, a context switch is said to be involved, and the
OS needs to save the state of the running process/thread in order to continue later. This is expensive.
By explicitly earmarking the areas for giving up CPU control, the application also has efficiency gains,
assuming it is done properly.
So this begs the question of the difference between a process and a thread. A process is an application
that has its own memory space that other applications cannot access. Multiple processes can run on a
single machine at the same time if there are multiple cores, otherwise they run concurrently via time
slicing. The algorithms that determines when the OS triggers task-switching is OS-dependent, and will
hence not be discussed in this section, which is intended to be mostly OS-independent. A thread on the
other hand, does not have its own memory space. Instead, they share the memory with the same process
that created it. A process is always minimally associated with a thread known as the main thread. The
other threads created from the main thread is known as worker threads. Threads can run alongside one
841
another on a multi-core CPU, and the OS can also switch between threads via time slicing. One can
think of threads as lightweight processes, with a different memory profile. The entry into any Python
application begins with a process and its main thread.
However it turns out that multi-threading for achieving concurency in Python is somewhat nuanced
due to a feature known as the GIL (global interpretor lock). Any high-level language is converted into
machine language by compilers or interpretors, or a mix of both, and has lower-level representations in
bytecodes, assembly language and so on, which is a further abstraction on top of binary code, which
specifies computer instruction by ones and zeros. We are purposefully abstracted away from this gritty
detail, and so we will not go into its details. Briefly, the GIL prevents a Python process from executing
more than one line of Python bytecode instruction at any time. While this may seem restrictive, the
feature exists due to the memory management in CPython, the reference implementation of Python.
Memory in CPython is managed by a mechanism known as reference counting, which keeps track the
number of variables referencing the object. When a variable pointer is assigned to the object, the
reference count is incremented - and then decremented when the variable no longer needs it. When the
reference count hits zero, it is put in garbage collection to free up memory. CPython however, is not
thread safe, which means that if multiple threads share a varaible and attempt to modify it, the result
is non-deterministic, a state known as race condition. While threads can run concurrently on different
cores, the interpretor lock ensures that only one of them runs Python code at a time to prevent race
conditions. The other threads idle. While this can paint a doom picture for Python as a programming
language, there are workarounds. Firstly, the GIL is released when the task occurs outside the Python
runtime. An example is when I/O operations happen. I/O operations make lower level system calls in
the OS that is outside of the Python runtime, and the GIL is released since we are not working directly
with Python objects. The GIL is reacquired when the data is received and translated back into a Python
object instance. In other languages such as Java, C, C++, the GIL is non-existent and threads running
on different cores are parallel. asyncio exploits this fact that I/O releases GIL to give us concurrency
with a single-thread using objects known as coroutines. A coroutine itself can be thought of as an even
lighter-weight version of a thread. Of course, single-threads belong to the same process, so coroutines
share the same memory space. To understand this single-threaded event loop model, we introduce
sockets. A socket is a low-level abstraction for sending and receiving data over between applications over
a network.
Sockets are by default blocking in nature, meaning that when we fire off a request for a data packet,
it stops/blocks until we receive the data. However, they can be made to operate in a non-blocking
mode, which means we fire and forget, until the OS notifies us that we have an inbox. These notification
systems is interfaced by the OS and is OS-dependent. asyncio abstracts away these details to work with
the different notification systems, allowing the developers to be mostly ignorant about the underlying
mechanisms of data transfer. In asyncio, we hit an I/O operation, we request the OS to make the I/O
request, and then register in the notification system to remind us of any updates. Meanwhile, we are
free to return to the Python runtime and execute other work. When the notification is received, Python
receives the object and continues the code execution. The tasks that are waiting for I/O are tracked
inside the Python runtime using a construct known as the event loop.
In asyncio, the event loop keeps a queue of tasks, which are wrappers around coroutines. When an
event loop is instantiated, an empty queue of tasks is created. Each iteration of the event loop checks
for tasks to run and runs them until an I/O operation is hit, which asyncio then makes the necessary
system calls for the operation and for being notified of their progress. It then goes on to look for other
tasks to execute. Each iteration also involves checking for any tasks that have already completed.
842
Figure 17.2: An example of a thread submitting tasks to the event loop.
Source: Figure 1.9 in Fowler [8]
Creating a coroutine is fairly straightforward, using the async def keyword and await keyword at
the I/O code points. We will demonstrate this later in some substantial examples. When a coroutine is
called in the manner result = coroutine(args), it is not executed immediately, but a coroutine object is
assigned to the variable result. To run a coroutine, it needs to be explicitly put on the event loop, and
an iteration of the event loops need to be triggered. The different ways of doing so would be of primary
interest going forward in this section.
One of the convenience functions is asyncio.run, we takes in a coroutine and runs it.
1 import asyncio
2
3 async def main () :

4 await asyncio . sleep (1)
5 print ( " slept one ( s ) " )
6 return True
7
8 if __name__ == " __main__ " :

9 result = asyncio . run ( main () )
Here, asyncio creates a new event-loop, and runs the main coroutine until completion. asyncio.sleep
is also coroutine, which we will use as a stand-in function to simulate an I/O operation such as a database
request. After completion, some cleanup is done and the event loop is closed. The asyncio.run function
is intended to be the main entry into an asynchronous application. Other coroutines should be launched
inside the coroutine that is called inside this run function, which is in our case, main. The await keyword
is followed by an object of the awaitable type, which is satisfied by coroutines. The coroutine itself is
paused on the await sleep instruction, and is then woken up later to continue execution. While no
concurrent work was done here, these are the basic building blocks of an asynchronous application.
In fact, suppose we write this:
1 import asyncio
843
2

8 return True
9
10 if __name__ == " __main__ " :

11 result = asyncio . run ( main () )
it will actually take roughly two seconds. As mentioned, the coroutine itself pauses until the line com-
pletes, which means we are sleeping one second twice, one after the other, instead of together. But
we really want concurrency. In order to achieve this, we need to introduce task s, which are wrappers
around coroutines to schedule the contained coroutine to run on the event loop as early as possible. This
scheduling and execution is non-blocking, while the await keyword is blocking. We can do this with
asyncio.create task function. When a task is created, it is scheduled to run on the next iteration of the
event loop. An iteration of the event loop can be triggered by the await statement. Usually, when a task
is created, we would await it at some point. An interesting aside is made. If we write this:
1 async def verbose_sleep ( s ) :
2 await asyncio . sleep ( s )
3 print ( " slept for " , s )
4

6 task1 = asyncio . create_task ( verbose_sleep (5) )
8 return True
9
10 if __name__ == " __main__ " :

11 asyncio . run ( main () )
the application would actually terminate immediately. Here the task objects are created but the event
loop is not triggered and the function terminates. If instead we go with
4

8 await task2
9 return True
10
11 if __name__ == " __main__ " :

then we will see ’slept for 3’ - both tasks are scheduled and run on the event loop, but the application
terminates after the second task is completed. There are still more seconds to sleep on the first task and
it never gets to finish. Instead,
4
844
8 await task2
9 return True
10
11 if __name__ == " __main__ " :

this would see both ’slept for 5’ and ’slept for 6’. Even though we never awaited the first task, it runs and
completes before the second tasks completes. Of course, if we await both tasks, we will always see both
print statements. Tasks are scheduled to run as soon as possible, which generally means when the first
await statement is encountered after the task has been created. The sleeping here is done concurrently,
and in the last example, should terminate after approximtely six seconds and not eleven. Of course, we
may do some work without wasting time. Before we demonstrate that, let’s create a utility decorator
function that acts as a stopwatch timer for our application. We will discuss decorators in other sections
-
1 def asyn_timefn ( func ) :
2 import time
3 from functools import wraps
4 @wraps ( func )
5 async def timediff (* args , ** kwargs ) :
6 a = time . time ()
7 result = await func (* args , ** kwargs )
8 print ( f " @timefn : { func . __name__ } took { time . time () - a } seconds " )
9 return result
10 return timediff
Now suppose we want to compute the mean of ten million numbers of a uniform random sample.
4
5 @asyn_timefn
9 import random
11 work = np . mean ([ random . uniform (0 ,1) for _ in range (10000000) ])
12 await task1
13 await task2
14 return work
15
16 if __name__ == " __main__ " :

17 work = asyncio . run ( main () )
18 print ( work )
19
20 ’’’
21 slept for 5
22 slept for 6
23 @timefn : main took 8 . 5 1 9 4 0 9 8 9 4 9 4 3 2 3 7 seconds
24 0.4998836811668402
25 ’’’
845
Interestingly it took quite abit more than six seconds to complete this - it turns out that the mathematical
computation was not done concurrently. This is because the tasks were created and scheduled but no
event loop iteration was triggered. We can change the order of line 11 and 12, but this would only mean
we are doing the computation and task2 in parallel. We want to do the math while waiting for both
task1 and task2. Here’s the trick:
1 @asyn_timefn
5 import random
8 work = np . mean ([ random . uniform (0 ,1) for _ in range (10000000) ])
9 await task1
10 await task2
11 return work
12 ’’’
13 slept for 5
14 slept for 6
16 0.49996590071870267
17 ’’’
The zero-sleep operation triggers the event loop, fires both sleep requests and does the mathematical
work. There is almost no additional time spent on the statistical sampling work. Now we know how to
work with tasks and coroutines, and some quirks. Let’s discover some more of the asyncio workflow and
control.
Network connections can be unstable and might hang indefinitely. We want to be able to cancel
tasks. Task objects come with method cancel, and the running tasks raises a CancelledError when we
await on the task.
3 assert not task1 . done ()
4 task1 . cancel ()
5 try :
6 await task1
7 except asyncio . Cancell edError :
8 print ( " task was cancelled " )
9 return
10
11 if __name__ == " __main__ " :

13
14 ’’’
15 task was cancelled
16 ’’’
It should be noted that the CancelledError is only thrown at await statements, meaning to say if the
cancel is submitted while the task is midway through executing plain Python code, the code runs until
the next await statement, if any. We might already have some time in mind to cancel the task, which
we should then go for asyncio.wait for, taking in a coroutine/task and number of seconds to wait for -
after which a TimeoutException is thrown:
846
2 try :
4 await asyncio . wait_for ( task1 ,2)
5 await task1
6 except asyncio . TimeoutError :
7 print ( " task was timed out with status cancel = " , task1 . cancelled () )
8 return
9
10 if __name__ == " __main__ " :

12
13 ’’’
14 task was timed out with status cancel = True
15 ’’’
But suppose we want to let the task continue executing, instead of cancelling it after specified seconds
we can asyncio.shield the function:
2 try :
4 await asyncio . wait_for ( asyncio . shield ( task1 ) ,2)
5 await task1
6 except asyncio . TimeoutError :
7 print ( " task was timed out with status cancel = " , task1 . cancelled () )
8 await task1
9 return
10
11 if __name__ == " __main__ " :

13 ’’’
14 task was timed out with status cancel = False
15 slept for 3
16 ’’’
Ok now let’s understand what links tasks and coroutines. We know they are both awaitable, but what is
an awaitable? To do that we need to introduce futures. Futures are Python objects that contain a single
value you expect to get in the future. When you first create one, that value normally doesn’t exist, and
somewhere downstream, you have it, and you set the value of the future to that object. Futures can be
awaited, and are either done or undone. The following example should be sufficiently clear:
2 future = asyncio . Future ()
3 print ( future . done () )
4 async def _helper () :
6 future . set_result ( " DONE " )
7
8 task = asyncio . create_task ( _helper () )

10 value = await future
11 print ( future . done () , " with result " , value )
12 return
13
14 if __name__ == " __main__ " :

847
16 ’’’
17 False
18 True with result DONE
19 ’’’
The internal implementation of the asyncio API relies heavily on futures. A task can be thought of
as combination of coroutine and future. When a task is created, an empty future is created - with the
expectation that the future result is set to the output of the coroutine in question.
They are awaitables, which means to say they are object instances of classes that implement the
Awaitable abstract base class. Abstract base classes are discussed in other sections, but briefly, it means
they specify some behavior that parent classes implementing it should satisfy. It determines the contract
specification. For instance, when we hire a burger cook, when we call chef.cook(), we know we want a
burger, even though different cooks can prefer sous vide, grill or whatever. In our case the method to be
implemented is the await method. The inheritance diagram looks something like this:
Awaitable
Coroutine
Future
Task
When we look at the API documentation for asyncio, it is useful to know what is the object instance
they require. The children is always an instance of the parent class, so for a function that takes in
an awaitable, we really can pass in an instance of coroutine, future or task. A function that takes in
instances of coroutine only would not work when a task is passed in, however.
Common mistakes in using the asyncio API are discussed. One is trying to use CPU-bound code in
tasks or coroutines without multiprocessing. See that the following two listings that approximately the
same time.
1 async def cpu_work () :
3 import random
4 sample = [ random . uniform (0 ,1) for _ in range (10000000) ]
5 sum = 0
6 for num in sample :
7 sum += num
8 return sum
9
10 @asyn_timefn
12 task1 = asyncio . create_task ( cpu_work () )
13 task2 = asyncio . create_task ( cpu_work () )
14
15 await task1
16 await task2
17 return
18
19 if __name__ == " __main__ " :

21 ’’’
23 ’’’
1 def cpu_work () :
848
3 import random
5 sum = 0
6 for num in sample :
7 sum += num
8 return sum
9
10 @asyn_timefn
12 cpu_work ()
13 cpu_work ()
14 return
15 ’’’
17 ’’’
Actually the second program runs faster, because there is lesser overhead without coroutines, Although
the difference is marginal, it definitely does not help the program. Firstly, the CPU work here does not
have any await statements. So it would not give up execution for other I/O tasks behind it in the event
loop to get a chance to submit the I/O request until it finishes execution. Secondly, because asyncio is
single threaded, no CPU work occurs in parallel. Furthermore, even if it were multi-threaded, the GIL
would prevent any parallelism. What we would want here is multi-processing, which would be discussed
later. The second mistake would be to asynchronously call blocking APIs. When we know that we are
making an I/O operation through an API, we might be tempted to wrap it in a coroutine. But the APIs
themselves may be blocking the event loop itself, so we will not get performance benefits.
1 async def hangukquant () :
2 import requests
3 res = requests . get ( " https :// www . hangukquant . com " )
4 return
5
6 @asyn_timefn
8 tasks =[ asyncio . create_task ( hangukquant () ) for _ in range (30) ]
10 [ await task for task in tasks ]
11 return
12
13 if __name__ == " __main__ " :

15 ’’’
17 ’’’
That the network request is an I/O operation is accurate. However, the internal implementation of the
requests module block on the get method, so no asynchronous work is done here. We can either use
multithreading with blocking APIs such as requests or use dedicated asyncio-supported libraries such as
aiohttp, which uses non-blocking sockets. Later, we will examine them.
Previously we used the asyncio.run method to handle the event loop for us. We can actually have
more fine grained control over the event loop behavior, but we would need to be careful about cleaning
up the resoruce and closing the loop. This would work:
849
2 await verbose_sleep (1)
3 return
4
5 if __name__ == " __main__ " :

6 loop = asyncio . new_e vent_lo op ()
7 try :
8 loop . r u n _ u n t i l _ c o m p l e t e ( main () )
9 finally :
10 loop . close ()
We can also replace new event loop with get event loop, which gets the running loop if any, or creates
a new one. If we do get running loop instead, we would get an error that there is no currently running
event loop. We can also explicitly put non-async functions on the event loop to run on the next iteration
with call soon:
1 def print_hello () :
2 print ( " hello " )
3

5 loop = asyncio . g e t _ r u n n in g _ l o o p ()
6 loop . call_soon ( print_hello )
7 await verbose_sleep (1)
8 return
9
10 if __name__ == " __main__ " :

11 loop = asyncio . new_e vent_lo op ()
12 try :
13 loop . r u n _ u n t i l _ c o m p l e t e ( main () )
14 finally :
15 loop . close ()
16 ’’’
17 hello
18 slept for 1
19 ’’’
We refer readers to documentation to nuances, if more control over the event loop is desired within
the application. Other options, such as debug mode exists, i.e. asyncio.run(main(), debug=True). When
working with production quality applications, exception handling is very important. Note that when
exception is thrown inside a running task, the task is considered done with the result as exception. No
exception is thrown up the call stack, and there is no cleanup. To get the exception thrown to us, we
need to await the task, otherwise the exception may never be retrieved, particularly if the task is not
garbage collected.
We want to have control over the workflow of coroutines, and explore some useful APIs that are
asyncio compatible. An example is the aiohttp library, which uses non blocking sockets to make web
requests and returns coroutines for the requests. As mentioned, the requests library is blocking and
would block the thread it runs in. Before we go into aiohttp, let us first understand the role of context
managers in resource control. In particular, we are interested in asynchronous context managers - they
implement two coroutines aenter and exit . Among other things, context managers help to manage
resources and perform auxiliary functionalities such as resource clean up. A common example would be:
1 with open ( " test . txt " ) as file :
2 lines = file . readlines ()
which is a more Python-ic way of doing this:
850
1 file = open ( " test . txt " )
2 try :
3 lines = file . readlines ()
4 finally :
5 file . close ()
With asynchronous context managers, the syntax is slightly different. For instance, we will use
async with instead of the with keyword. We can acquire an asynchronous lock like this:
1 lock = asyncio . Lock ()
2 counter = 0
3 async def increment () :
4 global counter
5 async with lock :
6 counter += 1
We will not go into too intimate details with context managers - they will discussed in other sections.
aiohttp uses async context managers for acquiring HTTP sessions and connections. Sessioning is a
networking term for keeping connections open. Without going into the details, when HTTP requests are
made over the network, protocols such as the TCP handshake and exchange of other information are
expensive. Sessioning keeps the connections open and allow us to recycle them, and this is known as
connection pooling, which reduces the resources required and improve the performance of aiohttp-based
applications. With this non-blocking library, we may use sessioning and coroutines to speed up the same
problem in Listing 17.1.3.
1 async def hangukquant ( session ) :
2 async with session . get ( " https :// www . hangukquant . com " ) as res :
3 return res . text
4
5 @asyn_timefn
7 import aiohttp
8 async with aiohttp . ClientSession () as session :
9 tasks =[ asyncio . create_task ( hangukquant ( session ) ) for _ in range (30) ]
11 [ await task for task in tasks ]
12 return
13
14 if __name__ == " __main__ " :

16 ’’’
18 ’’’
Notice that our program runs significantly faster. The ClientSession creates a default maximum of
hundred connections. This may be altered - we refer readers to aiohttp documentation. Previously,
we could set timeouts with asyncio.wait for method. Although this works with arbitrary coroutines,
and hence also applicable here, aiohttp provides out of the box functionalities to do this cleanly. At
the session level, we can instantiate the client session with a timeout object to get a timeout limit and
replace the relevant lines of code with:
1 s es si on_ ti me ou t = aiohttp . ClientTimeout ( total =1 , connect =0.3)
2 timed_session = aiohttp . ClientSession ( timeout = s e ss io n_ t im eo ut )
3 async with timed_session :
4 ...
851
which sets a time limit on the whole request and also for subroutines such as establishing connections.
On the per-request level, we can also specify timeouts by passing a client timeout object to the get
request:
1 async with session . get (
2 " https :// www . hangukquant . com " ,
3 timeout = aiohttp . ClientTimeout ( total =0.5)
4 ) as res :
5 ...
If the requests are not successful by the time specified, we would get an asyncio.TimeoutError on
await statement. The last code example works, but it is kind of clunky. First, we have to wrap coroutines
in task objects, then trigger an iteration of the event loop with a pseudo-sleep method and then await each
task. We can do this cleanly using the asyncio.gather method, which takes in a sequence of awaitables
and runs them concurrently. If any of the awaitables passed in were coroutines, they are automatically
wrapped in task objects so that they get run on the event loop concurrently. The gather method itself
returns an awaitable, and even though the individual tasks may not complete in deterministic order, the
results are returned in the order we pass them in. Refer to the following example:
1 async def hangukquant ( session , request_no ) :
3 result = res . text
4 return request_no
5

7 import aiohttp
9 results = await asyncio . gather (*[ hangukquant ( session , i ) for i in range (30) ])
10 print ( results )
11 return
12
13 if __name__ == " __main__ " :

15 ’’’
16 [0 , 1 , 2 , 3 , ... , 28 , 29]
17 ’’’
Much cleaner. What about exceptions? gather has an optional parameter return exceptions, specifying
two different behavior - it defaults to False, which causes gather to throw an exception when we await
it if any of the coroutines throw an exception. It should be noted that in these scenarios, the other
coroutines are not actually cancelled - if we handle the exception, they will be allowed to run until
completion. Otherwise, they will run until they are cleaned up downstream or to completion, whichever
is earlier. If we set return exceptions to True, then the gather method will not throw any exception.
Instead, the return value will be thrown exception inside the coroutine:
1 async def hangukquant ( session , request_no ) :
3 import random
4 if random . uniform (0 ,1) > 0.5:
5 raise Exception ( " unlucky error " )
6 result = res . text
7 return request_no
8
9 @asyn_timefn
852
11 import aiohttp
13 results = await asyncio . gather (*[ hangukquant ( session , i ) for i in range (10) ] ,
r e t u r n _ e x c e p t i o n s = True )
15 return
16
17 if __name__ == " __main__ " :

19 ’’’
20 [0 , Exception ( ’ unlucky error ’) , 2 , Exception ( ’ unlucky error ’) , Exception ( ’ unlucky error
’) ,
21 Exception ( ’ unlucky error ’) , Exception ( ’ unlucky error ’) , Exception ( ’ unlucky error ’) ,
8, 9
22 ]
23 ’’’
Note that if let the exception be thrown at gather, then we will only retrieve the first one when we
await it. Maybe this is okay, but maybe we do want to specify a different behavior. Perhaps we would
like to cancel our tasks if one of the tasks fail, such as when the tasks make API requests to the same
server and we receive a 429 code as rate limits get triggered. Spamming the server with the remaining
requests might get us blacklisted - so cancelling the remaining tasks would be desirable. If the tasks
take disproportionate amount of time to complete, the gather method only completes after the longest
task is done - we may prefer to work on the results of the requests that have already completed first.
asyncio provides the as completed method which takes in a list of awaitables and returns an iterator of
futures. We can iterate over and await on each of the futures, with the result being the return value
of the coroutine on a first-finish basis. Therefore, the ordering of results are non-deterministic. We
demonstrate with the following:
1 async def work ( num ) :
2 import random
3 work_cost = random . uniform (0 ,4)
4 await asyncio . sleep ( work_cost )
5 return ( num , work_cost )
6
7 @asyn_timefn
9 async_work = [ work ( i ) for i in range (10) ]
10 for finished in asyncio . as_completed ( async_work ) :
11 res = await finished
12 print ( res )
13 return
14
15 if __name__ == " __main__ " :

17
18 ’’’
19 (6 , 0.3673681494715675)
20 (9 , 1.276757248018661)
21 (7 , 1.3913958197279603)
22 (3 , 2.3091369741247556)
23 (5 , 2.3978259114992753)
24 (8 , 2.4415577155094477)
25 (2 , 2 . 5 4 6 7 79 2 5 0 5 7 7 3 )
26 (1 , 2.60398276135025)
853
27 (0 , 3 . 1 5 9 0 9 0 8 6 5 2 9 6 5 4 1 )
28 (4 , 3 . 8 2 4 8 6 9 0 2 1 6 3 2 9 8 3 )
30 ’’’
Since we have access to individual tasks, we also have better control over exception handling. As
before, any task exception would be thrown at the await statement. There is an optional parameter
timeout in the asyncio.as completed function, which specifies the number of seconds to let the group
of tasks run. Any task taking longer than the timeout would throw a asyncio.TimeoutException when
awaited. As we get the results of those that finished earlier right away, any result retrieved may straight
away be worked on without the results of the others. While this is an advantage over gather, we
might want even more control. When timeout occurs, exception is thrown, but there is no cancellation.
The tasks still run in the background. Perhaps, we would like to cancel them. asyncio offers another
method wait that gives several options on when we would like to receive our results from the tasks. The
method returns two sets - one in which the tasks are completed (either a result is computed or exception
is thrown) and the other in which the tasks are still running. The optional timeout configuration
on this method specifies the time requirements. Additionally, upon timeout, no exception is thrown,
the coroutines are not cancelled and we would have to iterate over the pending task set to cancel
them. Instead of throwing the exception, on timeout, all the tasks still running are just returned in the
pending set for us to do our preferred handling. The options in the wait protocol is ALL COMPLETED,
FIRST EXCEPTION, FIRST COMPLETED. It should be pretty clear from their variable names what
behavior they specify. We refer the details to the documentation, and instead provide an example. To
be brief, the ALL COMPLETED option would behave similar to gather, and the FIRST COMPLETED
option would behave similar to as completed. Of course, if we receive the results of the tasks that
completed earlier, we may go ahead and perform work on it without waiting for the other tasks to
complete. FIRST EXCEPTION option is interesting - we shall demonstrate a program that cancels the
pending tasks on encountering an exception:
1 async def work ( num ) :
2 import random
3 work_cost = random . uniform (0 ,1)
4 await asyncio . sleep ( work_cost )
5 if random . uniform (0 ,1) < 0.30:
6 raise Exception ( f " Task { num } had an unlucky failure " )
7 return ( num , work_cost )
8
9 @asyn_timefn
11 async_work = [ asyncio . create_task ( work ( i ) ) for i in range (10) ]
12 done , pending = await asyncio . wait ( async_work , return_when = asyncio . FI RS T _E XC EP T IO N )
13 for item in done :
14 if item . exception () is None :
15 print ( item . result () )
16 else :
17 print ( item . exception () )
18 for item in pending :
19 print ( " cancelling " , async_work . index ( item ) )
20 item . cancel ()
21 return
22
23 if __name__ == " __main__ " :

854
25
26 ’’’
27 Task 7 had an unlucky failure
28 (4 , 0 . 3 5 6 5 0 7 6 3 2 7 1 5 7 9 4 9 )
29 (2 , 0 . 2 0 3 3 7 6 6 7 5 3 6 5 7 4 2 7 8 )
30 (9 , 0 . 1 2 7 7 6 4 0 1 3 8 3 7 3 2 8 6 )
31 cancelling 5
32 cancelling 8
33 cancelling 3
34 cancelling 6
35 cancelling 0
36 cancelling 1
37 ’’’
Note that the done variable set is guranteed to contain minimally one element - the task that threw an
exception is considered as done.
As mentioned in earlier text, the asyncio library is interoperable with both threads and processes.
This shall be further explored. The reasons for multiprocessing were laid out earlier. Multiprocessing
allows us to take advantage of a multi-core machine. Each subprocess has its own Python interpeter
and is subject to its own GIL - meaning the computer as a whole can run multiple lines of application
bytecode. Since this is a primer on asyncio, we will be focusing primarily on the synergy between mul-
tiprocessing and asynchronous programming, rather than multiprocessing by itself. Python parallelism
will be discussed in other sections. We are interested in a construct known as the process pool, which is
a collection of Python processes we can use to run functions in parallel as separate programs.
Let us first begin with a timer for normal Python functions:
1 def timefn ( func ) :
2 import time
3 from functools import wraps
4 @wraps ( func )
5 def timediff (* args , ** kwargs ) :
6 a = time . time ()
7 result = func (* args , ** kwargs )
8 print ( f " @timefn : { func . __name__ } took { time . time () - a } seconds " )
9 return result
10 return timediff
Further suppose we are interested in getting bootstrap samples for the uniform random variable to study
their sampling distributions:
1 def work () :
2 import random
4 return sample
5
6 @timefn
7 def main () :
8 results = [ work () for _ in range (10) ]
9 print ( " bootstrap samples obtained " )
10
11 if __name__ == " __main__ " :

12 main ()
13 ’’’
14 bootstrap samples obtained
15 @timefn : main took 2 3 . 3 4 5 5 7 3 1 8 6 8 7 4 3 9 seconds
855
16 ’’’
We can do this with multiprocessing with a process pool from the multiprocessing library with the
following program:
1 def work () :
2 import random
4 return sample
5
6 @timefn
7 def main () :
8 from m ul ti pr o ce ss in g import Pool
9 with Pool () as process_pool :
10 b o ot s t r a p _ s a m p l e s =[
11 process_pool . apply_async ( work , args =() ) for _ in range (10)
12 ]
13 results =[ sample . get () for sample in b o o t s t r a p _ s a m p l e s ]
15
16 if __name__ == " __main__ " :

17 main ()
18 ’’’
19 bootstrap samples obtained
20 @timefn : main took 1 2 . 3 0 0 5 9 0 0 3 8 2 9 9 5 6 seconds
21 ’’’
While the performance improvement is good, as before, we might want to have more granular control
over the workflow, such as the behaviour implemented in as completed, gather and so on. Python offers
an abstraction on top of the multiprocessing process pools in the form of concurrent.futures module,
which uses executors for both processes and threads that interoperate with asyncio. These execetors
are called ProcessPoolExecutor s and ThreadPoolExecutor s, with the same method signatures, allowing
us to reuse similar code for multi-threading and multi-processing where appropriate. These two classes
implement the Executor abstract base class. It interfaces a submit function which takes a callable and
returns a concurrent.futures.Future object. The other is a map function, which takes in a callable, a list
of function arguments and then calls the function with each argument asynchronously. The return value
is an iterator of the results of our calls, in deterministic order. We would first work with CPU-bound
functions, and hence demonstrate the ProcessPoolExecutor. We could do this for the map function:
1 def work ( n ) :
2 import random
3 sample = [ random . uniform (0 ,1) for _ in range ( n ) ]
4 return sample
5
6 def main () :
7 import random
8 from concurrent . futures import P r o c e s s P o o l E x e c u t o r
9
10 with P r o c e s s P o o l E x e c u t o r () as process_pool :
11 args = [ random . randint (1000000 ,10000000) for _ in range (10) ]
12 results = process_pool . map ( work , args )
13 for result in results :
14 print ( len ( result ) )
16
856
17 if __name__ == " __main__ " :
18 main ()
which does the bootstrap sampling in parallel. We get results as soon as they are completed in the
order that they are passed in - which means that if the first argument was the largest sample size, we
would not be able to work with the samples that have already been generated for smaller problems.
Suppose we want behavior as we got in as completed by hooking it to the asyncio module. The asyncio
event loop provides a run in executor function, which takes in an Executor instance and a callable, and
runs that callable on that executor, which would normally be a process or thread pool. Of course, the
executor we pass in should be fitting to the task, whether it is CPU or I/O bound. An awaitable is
returned. This means that any of the asyncio APIs taking in an awaitable can now be applied to achieve
the desired workflow as we did with coroutines.
Note that the run in executor only takes a callable, and does not allow us to supply function argu-
ments - but we can work around this using partial function application to build a function call with zero
arguments using the functools module. Here is a demonstration using the gather functionality:
1 import random
2 import asyncio
3
4 def work ( n ) :
5 sample = [ random . uniform (0 ,1) for _ in range ( n ) ]
6 return sample
7

10 from functools import partial
12 loop = asyncio . g e t _ r u n n i ng _ l o o p ()
13 args = [ random . randint (1000000 ,10000000) for _ in range (10) ]
14 executions = [
15 loop . ru n_ i n_ ex ec u to r (
16 process_pool ,
17 partial ( work , arg )
18 ) for arg in args ]
19 await asyncio . gather (* executions )
21
22 if __name__ == " __main__ " :

We can also use other functionalities such as as completed, as we did before. When splitting work
between multiple subprocesses, there is a tradeoff. When another process is spun off to do work for
us, the function arguments are serialized/pickled and the work process deserializes it, and this is an
expensive task. So the process of spinning up new processes can eat into performance gains. On the
other hand, too little subprocesses mean there are less hands that do the work. What if we want to have
some form of communication between the subprocesses? Say we are interested in looking for a nice prime
factorization for a large number, and we want to tell the other processes that the job is done. Processes
have different memory spaces, but there is a concept of shared memory that multiprocessing uses.
857
Figure 17.3: Memory States.Source: Figure 6.2 in Fowler [8]
In particular, multiprocessing supports two kinds of shared data, values and array. A value contains
a singular value, such as a float or integer. An array is an array of singular values. We refer to the
documentation for more details on supported data types. The instantiation are using objects from
the multiprocessing library, and we can indicate the data type with a single character known as the
’typecode’. Again, inter-process communication is not the central focus of this section, so we will skip
it. Do note that when working with shared data, there is the possibility of race conditions when working
with non-atomic operations. Access to critical parts of the code that causes the race conditions can be
done by synchronizing access to shared data, using a mechanism known as lock or mutex. These locks
can be obtained and released bia the method get lock().acquire() and get lock().release() that can be
called from the object instances of the values or arrays in shared memory. A very simple example is
demonstrated:
1 def work ( work_done ) :
2 work_done . value += 1
3 return
4
5 @timefn
6 def main () :
7 from m ul ti pr o ce ss in g import Process
8 from m ul ti pr o ce ss in g import Value
9 work_done = Value ( " i " , 0)
10 processes =[ Process ( target = work , args =( work_done ,) ) for _ in range (300) ]
11 [ p . start () for p in processes ]
12 [ p . join () for p in processes ]
858
13 print ( " work done : " , work_done . value )
14
15 if __name__ == " __main__ " :

16 main ()
17 ’’’
18 work done : 298
19 ’’’
We fall short of 300. Instead, if this is considered as critical behavior, we shall do

1 def work ( work_done ) :
2 with work_done . get_lock () :
4 return
5 ’’’
6 work done : 300
7 ’’’
where the context manager handles the acquiring and release of lock resource. When we are working
with process pools, the approach is slightly different. As mentioned, submission of task to the process
pool involves serialization of arguments before they are put on the task queue. Since shared data is, by
definition, shared among worker processes, it makes no sense to pickle and unpickle arguments back and
forth between processes. Instead, we let the shared counter take a global variable, and initialize when
the process pool is setup. This initializer is called when each process in the pool starts up. Here is an
example:
1 import random
2 import asyncio
3
4 work_done = None
5
6 def init ( counter ) :

7 global work_done
8 work_done = counter
9
10 def work () :
11 with work_done . get_lock () :
13 return
14

17 from m ul ti pr o ce ss in g import Value
18 counter = Value ( " i " , 0)
19 with P r o c e s s P o o l E x e c u t o r ( initializer = init , initargs =( counter ,) ) as process_pool :
20 loop = asyncio . g e t _ r u n n i ng _ l o o p ()
21 executions = [ loop . ru n_ i n_ ex ec u to r ( process_pool , work ) for _ in range (100) ]
22 await asyncio . gather (* executions )
23 print ( counter . value )
24
25
26 if __name__ == " __main__ " :

Suppose now we have a mixed workload - there are both CPU and I/O heavy operations. We
can handle these kind of workloads using multiprocessing, and running a different event loop on each
subprocess. Here is the design pattern:
859
1 import asyncio
2
3 def work ( some ) :

4 async def coroutine () :
6 print ( some )
7 return asyncio . run ( coroutine () )
8

12 process_pool . map ( work , [( _ ) for _ in range (10) ])
13
14 if __name__ == " __main__ " :

The process pool spins up multiple processes inside the process pool, and then inside each process, a
new event loop is created and the coroutine is executed on that event loop. Of course, we have not
demonstrated any actual work being computed, but we will quickly apply these techniques in the other
sections.
Now that we have demonstrated ProcessPoolExecutor, we want to demonstrate cases in which Thread-
PoolExecutor is used, particularly when no asycnio-compatible library is made available to us. Blocking
I/O operations release the GIL, and we can use multiple threads to achieve concurrency even when no
coroutines are involved. Although we can do without it, asyncio provides utility functions to handle
the workflow of these threads, as we saw for coroutiens and processes. The nuances of threads will be
covered in other section - here we just make brief comments. An important distinction for threads are
between daemon and non-daemonic threads. The former threads shut down automatically when there
is non-daemonic (such as the main thread) thread running. This means if we kill our main thread, the
daemonic threads will get shut down alongside it, and otherwise would be allowed to continue running
until it terminates or throws its own exception. This is important to distinguish since it determines the
error-behavior of our application. We saw the program in Listing 17.1.3 and Listing 17.1.3. In that
particular scenario, we managed to obtain concurrency using the aiohttp library, which is a non-blocking
asyncio-compatible counterpart of the requests library. But suppose no such library exists, and we still
want to achieve concurrency. For instance, we might be using a third-party SDK for a data-service
provider that only exposes blocking API requests. We can do this with threading:
1 def main () :
2 import requests
3 import threading
4 num =10
5 results =[ None ]* num
6 def hangukquant ( id ) :
8 results [ id ] = res . status_code
9
10 threads =[ threading . Thread ( target = hangukquant , args =( id ,) ) for id in range ( num ) ]

11 [ thread . start () for thread in threads ]
12 [ thread . join () for thread in threads ]
14
15 if __name__ == " __main__ " :

16 main ()
860
As we did with processes, we would like to have better workflow control over the threads and have
access to the functionalities that asyncio provides such as as completed and gather. Of course, this is
where the ThreadPoolExecutor comes in:
2 import requests
3 from concurrent . futures import T h r e a d P o o l E x e c u t o r
4 num =10
5 def hangukquant ( id ) :
7 return res . status_code
8
9 loop = asyncio . get_e vent_lo op ()

10 with T hr e a d P o o l E x e c u t o r () as pool :
11 tasks =[
12 loop . ru n_ i n_ ex ec u to r (
13 pool ,
14 functools . partial ( hangukquant , id )
15 ) for id in range ( num )
16 ]
17 results = await asyncio . gather (* tasks )
19
20 if __name__ == " __main__ " :

22 ’’’
23 [200 , 200 , 200 , 200 , 200 , 200 , 200 , 200 , 200 , 200]
24 ’’’
So concurrency is achieved although we are utilizing the requests library, which is blocking, with the
use of threads. These threads are managed through the asyncio library. Note however, that aiohttp would
still offer a more efficient solution, since threads are created at the OS level and involve context switching
- and is hence more expensive to handle relative to coroutines. Coroutines, on the other hand, are handled
inside the Python runtime. A useful note is that the default executor is the ThreadPoolExecutor, which
acts as a resusable singleton throughout the application. We could have skipped the instantiation of
the thread pool and gone with loop.run in executor(None,...). As with multiprocessing code, when we
use multithreading, we need to be careful about race conditions. As opposed to multiprocessing, inside
multithreading, the memory space is shared, hence no shared variable is required, but synchronized
access is still relevant. Locks, reentrant locks and semaphores are some of the different constructs to
synchronize access to critical regions of the code. This is not the focus of this section - it should be noted
however, that asynchronous locks and synchronous locks are distinct, in that when a lock is waiting to
be acquired in a normal lock, the thread is blocked, while the asynchronous lock allows the execution of
other tasks while waiting. Advanced readers who wish to practice should read the asyncio source code:
Github Link. As a bonus, you may refer to the implementation of our credit-sempahore: Github Link,
which uses entry credits as resource pool for control of entry into code regions controlled by a semaphore.
Last but not least, in some notable scenarios, some libraries such as numpy make underlying calls in
the computation to C and Fortran code, and hence do not involve running Python bytecodes. In these
cases, the GIL is released and even multithreading can speed up CPU work. However, these depends
on the internal implementation of these methods and should not be taken at face-value, it is important
to understand the nature of these functions and whether they interact with Python objects in the CPU
workload. Benchmarking and code profiling is a useful tool in this respect.
861
Chapter 18
Conclusions
862
References
[1] linearalgebras.com. https://linearalgebras.com/.
[2] S. Axler. Linear Algebra Done Right Solutions Manual. 1997.
[3] S. Axler. Linear Algebra Done Right. Springer, 2015.
[4] BARRA. United States Equity, BARRA. BARRA, 2021.
[5] Boyd and Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT
Press, 2022.
[7] F. J. Fabozzi, P. N. Kolm, D. A. Pachamanova, and S. M. Focardi. Robust Portfolio Optimization

and Management. Wiley, 2007.
[8] M. Fowler. Python Concurrency with asyncio. Manning Publications, 2022.
[9] J. Gallier. The Schur Complement and Symmetric Positive Semidefinite (and Definite) Matrices.
https://www.cis.upenn.edu/~jean/schur-comp.pdf.
[10] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition.
Springer-Series in Statistics.
[11] A. S. Iqbal. Foreign Exchange, Practical Asset Pricing and Macroeconomic Theory. palgrave macmil-
lan.
[12] R. J. Larsen and M. L. Marx. An Introduction to Mathemetical Statistics and Its Applications.
Pearson.
[13] S. L. Ma, K. L. Ng, and V. Tan. Linear Algebra: Concepts and Techniques on Euclidean Spaces.
McGraw-Hill Education, 2016.
[14] T. Masters. Permutation and Randomization Tests for Trading System Development, Algorithms in
C++. Masters.
[15] G. A. Paleologo. Advanced Portfolio Management, A Quant’s Guide for Fundamental Investors.
Wiley, 2021.
[16] J. A. Rice. Mathematical Statistics and Data Analysis. Duxbury Advanced Series, THOMSON
BROOKS/COLE.
[17] S. Roman. Advanced Linear Algebra. Springer, 2008.
863
[18] J. P. Romano and M. Wolf. Efficient computation of adjusted p-values for resampling based stepdown
multiple testing. 2016.
[19] S. E. Shreve. Stochastic Calculus for Finance II, Continuous Time Models. Springer Finance.
[20] Y. Zeng. Stochastic calculus for finance ii: Continuous-time models solution of exercise problems.
2015.
864
Appendix A
Russian Doll Tester
This chapter presents code that is part of the proprietary testing engines made by the author. It is a
backtesting engine that optimizes computational efficiency, whilst making it simple for users to implement
their own custom, powerful trading strategies and indicators on top of the presented machinery. Although
designed with latency in mind, it is not intended for high-frequency trading purposes. The Russian Doll
Testing framework and backtester allows for testing both single-strategy and multi-strategy portfolios.
Our suite of statistical libraries allow comparison of strategy/fund performances present in standard
literature, as well as our own code and practice of probabilistic analysis of trading system decisions
under the standard hypothesis testing model.
1
Appendix B
Market Historians
B.1 1960
CODE S∼E % YoY − − −− CODE S∼E % YoY − − −−

range range
CCPI 30.5 ∼ 30.8 0.98 ↑↔↑↔ CPI 29.37 ∼ 29.84 1.6 ↑↑↑↑
30.5−30.8 29.37−29.84
FFR 3.99 ∼ 1.45 -63.66 ↓↓↓↓ INFL 1M 1.24 ∼ 1.6 28.96 ↑↓↓↑
1.45−3.99 1.23−1.93
INFL 1Y 1.51 ∼ 1.08 -28.79 ···↓ RGDPPC 17364.0 ∼ 16970.0 -2.27 ↓↓↓↑
1.08−1.51 16922.0−17364.0
SP500 59.89 ∼ 58.11 -2.97 ↓↑↓↑ TAX Y HB 91.0 ∼ 91.0 0.0 ···↔
52.2−60.39 91.0−91.0
TAX Y LB 20.0 ∼ 20.0 0.0 ···↔ UNEMP US 5.2 ∼ 6.6 26.92 ↓↑↑↑
20.0−20.0 4.8−6.8
Table B.1: -
B.2 1961

range range
CCPI 30.8 ∼ 31.2 1.3 ↑↑↑↑ CPI 29.84 ∼ 30.04 0.67 ↓↑↑↑
30.8−31.2 29.81−30.04
FFR 1.45 ∼ 2.15 48.28 ↑↓↑↓ INFL 1M 1.6 ∼ 0.67 -58.12 ↓↑↓↓
1.17−2.61 0.67−1.6
INFL 1Y 1.08 ∼ 1.12 3.8 ···↑ RGDPPC 16970.0 ∼ 17966.0 5.87 ↑↑↑↑
1.08−1.12 16970.0−17966.0
SP500 57.57 ∼ 71.55 24.28 ↑↓↑↑ TAX Y HB 91.0 ∼ 91.0 0.0 ···↔
57.57−72.64 91.0−91.0
TAX Y LB 20.0 ∼ 20.0 0.0 ···↔ UNEMP US 6.6 ∼ 5.8 -12.12 ↑↑↓↓
20.0−20.0 5.8−6.9
Table B.2: -
B.3 1962

range range
CCPI 31.2 ∼ 31.5 0.96 ↑↑↑↔ CPI 30.04 ∼ 30.44 1.33 ↑↑↑↑
31.2−31.6 30.04−30.44
1
FFR 2.15 ∼ 2.92 35.81 ↑↓↑↑ INFL 1M 0.67 ∼ 1.33 98.67 ↑↓↑↓
2.15−2.94 0.67−1.47
INFL 1Y 1.12 ∼ 1.21 8.79 ···↑ RGDPPC 17966.0 ∼ 18337.0 2.07 ↑↑↓↑
1.12−1.21 17966.0−18337.0
SP500 70.96 ∼ 63.1 -11.08 ↓↓↓↑ TAX Y HB 91.0 ∼ 91.0 0.0 ···↔
52.32−71.13 91.0−91.0
TAX Y LB 20.0 ∼ 20.0 0.0 ···↔ UNEMP US 5.8 ∼ 5.8 0.0 ↓↓↑↑
20.0−20.0 5.3−5.8
Table B.3: -
B.4 1963

range range
CCPI 31.5 ∼ 32.2 2.22 ↑↑↑↑ CPI 30.44 ∼ 30.94 1.64 ↑↑↑↑
31.5−32.2 30.44−30.94
FFR 2.92 ∼ 3.48 19.18 ↓↑↑↓ INFL 1M 1.33 ∼ 1.64 23.36 ↓↑↓↑
2.9−3.5 0.89−1.65
INFL 1Y 1.21 ∼ 1.31 7.78 ···↑ RGDPPC 18337.0 ∼ 19215.0 4.79 ↑↑↑↑
1.21−1.31 18337.0−19215.0
SP500 63.1 ∼ 75.02 18.89 ↑↑↑↑ TAX Y HB 91.0 ∼ 77.0 -15.38 ···↓
62.69−75.02 77.0−91.0
TAX Y LB 20.0 ∼ 16.0 -20.0 ···↓ UNEMP US 5.8 ∼ 5.6 -3.45 ↓↓↓↑
16.0−20.0 5.5−6.1
Table B.4: -
B.5 1964

range range
CCPI 32.2 ∼ 32.6 1.24 ↔↑↑↑ CPI 30.94 ∼ 31.28 1.1 ↑↑↑↑
32.2−32.6 30.91−31.28
FFR 3.48 ∼ 3.9 12.07 ↓↓↓↑ INFL 1M 1.64 ∼ 1.1 -33.1 ↓↓↑↓
3.36−3.9 0.98−1.64
INFL 1Y 1.31 ∼ 1.67 27.5 ···↑ RGDPPC 19215.0 ∼ 20002.0 4.1 ↑↑↔↑
1.31−1.67 19215.0−20002.0
SP500 75.02 ∼ 84.75 12.97 ↑↑↑↑ TAX Y HB 77.0 ∼ 70.0 -9.09 ···↓
75.02−86.28 70.0−77.0
TAX Y LB 16.0 ∼ 14.0 -12.5 ···↓ UNEMP US 5.6 ∼ 4.8 -14.29 ↓↓↑↓
14.0−16.0 4.8−5.6
Table B.5: -
B.6 1965

range range
CCPI 32.6 ∼ 33.0 1.23 ↑↔↑↑ CPI 31.28 ∼ 31.88 1.92 ↑↑↑↑
32.6−33.0 31.28−31.88
FFR 3.9 ∼ 4.42 13.33 ↑↔↓↑ INFL 1M 1.1 ∼ 1.92 74.55 ↑↑↓↑
3.9−4.42 1.1−1.93
INFL 1Y 1.67 ∼ 2.99 79.27 ···↑ RGDPPC 20002.0 ∼ 21444.0 7.21 ↑↑↑↑
1.67−2.99 20002.0−21444.0
SP500 84.75 ∼ 92.43 9.06 ↑↓↑↑ TAX Y HB 70.0 ∼ 70.0 0.0 ···↔
81.6−92.63 70.0−70.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 4.8 ∼ 4.0 -16.67 ↑↓↓↓
14.0−14.0 4.0−5.0
2
Table B.6: -
B.7 1966

range range
CCPI 33.0 ∼ 34.2 3.64 ↑↑↑↑ CPI 31.88 ∼ 32.9 3.2 ↑↑↑↑
33.0−34.2 31.88−32.92
FFR 4.42 ∼ 4.94 11.76 ↑↑↑↓ INFL 1M 1.92 ∼ 3.2 66.8 ↑↓↑↓
4.42−5.76 1.92−3.79
INFL 1Y 2.99 ∼ 2.78 -7.2 ···↓ RGDPPC 21444.0 ∼ 21827.0 1.79 ↑↑↑↑
2.78−2.99 21444.0−21827.0
SP500 92.43 ∼ 80.33 -13.09 ↓↓↓↑ TAX Y HB 70.0 ∼ 70.0 0.0 ···↔
73.2−94.06 70.0−70.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 4.0 ∼ 3.7 -7.5 ↓↑↔↓
14.0−14.0 3.7−4.0
Table B.7: -
B.8 1967

range range
CCPI 34.2 ∼ 35.5 3.8 ↑↑↑↑ CPI 32.9 ∼ 34.1 3.65 ↑↑↑↑
34.2−35.5 32.9−34.1
FFR 4.94 ∼ 4.6 -6.88 ↓↓↑↑ INFL 1M 3.2 ∼ 3.65 14.0 ↓↑↓↑
3.79−5.0 2.32−3.65
INFL 1Y 2.78 ∼ 4.22 51.96 ···↑ RGDPPC 21827.0 ∼ 22432.0 2.77 ↓↑↑↑
2.78−4.22 21792.0−22432.0
SP500 80.38 ∼ 96.47 20.02 ↑↑↑↑ TAX Y HB 70.0 ∼ 75.25 7.5 ···↑
80.38−97.59 70.0−75.25
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 3.7 ∼ 3.5 -5.41 ↔↑↑↓
14.0−14.0 3.5−4.3
Table B.8: -
B.9 1968

range range
CCPI 35.5 ∼ 37.3 5.07 ↑↑↑↑ CPI 34.1 ∼ 35.7 4.69 ↑↑↑↑
35.5−37.3 34.1−35.7
FFR 4.6 ∼ 6.3 36.96 ↑↑↓↑ INFL 1M 3.65 ∼ 4.69 28.64 ↑↑↑↓
4.6−6.3 3.64−4.75
INFL 1Y 4.22 ∼ 5.41 28.38 ···↑ RGDPPC 22432.0 ∼ 23209.0 3.46 ↑↑↑↑
4.22−5.41 22432.0−23209.0
SP500 96.11 ∼ 103.86 8.06 ↓↑↑↑ TAX Y HB 75.25 ∼ 77.0 2.33 ···↑
87.72−108.37 75.25−77.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 3.5 ∼ 3.3 -5.71 ↔↑↓↓
14.0−14.0 3.3−3.8
Table B.9: -
3
B.10 1969

range range
CCPI 37.3 ∼ 39.6 6.17 ↑↑↑↑ CPI 35.7 ∼ 37.9 6.16 ↑↑↑↑
37.3−39.6 35.7−37.9
FFR 6.3 ∼ 8.98 42.54 ↑↑↑↓ INFL 1M 4.69 ∼ 6.16 31.34 ↑↓↑↑
6.3−9.19 4.68−6.16
INFL 1Y 5.41 ∼ 5.9 8.88 ···↑ RGDPPC 23209.0 ∼ 23043.0 -0.72 ↑↑↓↓
5.41−5.9 23043.0−23310.0
SP500 103.86 ∼ 92.06 -11.36 ↓↓↓↓ TAX Y HB 77.0 ∼ 71.75 -6.82 ···↓
89.2−106.16 71.75−77.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 3.3 ∼ 3.9 18.18 ↑↑↑↔
14.0−14.0 3.3−4.0
Table B.10: -
B.11 1970

range range
CCPI 39.6 ∼ 42.1 6.31 ↑↑↑↑ CPI 37.9 ∼ 39.9 5.28 ↑↑↑↑
39.6−42.1 37.9−39.9
FFR 8.98 ∼ 4.14 -53.9 ↓↓↓↓ INFL 1M 6.16 ∼ 5.28 -14.37 ↓↓↓↓
4.14−8.98 5.28−6.42
INFL 1Y 5.9 ∼ 4.26 -27.81 ···↓ RGDPPC 23043.0 ∼ 23359.0 1.37 ↓↑↓↑
4.26−5.9 22820.0−23359.0
SP500 92.06 ∼ 92.15 0.1 ↓↓↑↑ TAX Y HB 71.75 ∼ 70.0 -2.44 ···↓
69.29−93.46 70.0−71.75
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 3.9 ∼ 6.0 53.85 ↑↑↑↑
14.0−14.0 3.9−6.0
Table B.11: -
B.12 1971

range range
CCPI 42.1 ∼ 43.5 3.33 ↑↑↑↑ CPI 39.9 ∼ 41.2 3.26 ↑↑↑↑
42.1−43.5 39.9−41.2
FFR 4.14 ∼ 3.5 -15.46 ↑↑↓↓ INFL 1M 5.28 ∼ 3.26 -38.26 ↓↑↓↓
3.5−5.56 3.26−5.28
INFL 1Y 4.26 ∼ 3.31 -22.33 ···↓ RGDPPC 23359.0 ∼ 23893.0 2.29 ↑↑↓↑
3.31−4.26 23359.0−23893.0
SP500 92.15 ∼ 102.09 10.79 ↑↓↓↑ TAX Y HB 70.0 ∼ 70.0 0.0 ···↔
90.16−104.77 70.0−70.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 6.0 ∼ 5.9 -1.67 ↑↓↔↑
14.0−14.0 5.6−6.2
Table B.12: -
B.13 1972

range range
CCPI 43.5 ∼ 44.6 2.53 ↑↑↑↑ CPI 41.2 ∼ 42.7 3.64 ↑↑↑↑
43.5−44.6 41.2−42.7
4
FFR 3.5 ∼ 5.94 69.71 ↑↑↑↑ INFL 1M 3.26 ∼ 3.64 11.74 ↑↓↑↑
3.29−5.94 2.95−3.76
INFL 1Y 3.31 ∼ 6.22 88.17 ···↑ RGDPPC 23893.0 ∼ 25449.0 6.51 ↑↑↑↑
3.31−6.22 23893.0−25449.0
SP500 102.09 ∼ 118.05 15.63 ↑↓↑↑ TAX Y HB 70.0 ∼ 70.0 0.0 ···↔
101.67−119.12 70.0−70.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 5.9 ∼ 5.0 -15.25 ↔↓↔↓
14.0−14.0 5.0−5.9
Table B.13: -
B.14 1973

range range
CCPI 44.6 ∼ 46.9 5.16 ↑↑↑↑ CPI 42.7 ∼ 46.8 9.6 ↑↑↑↑
44.6−46.9 42.7−46.8
FFR 5.94 ∼ 9.65 62.46 ↑↑↓↓ INFL 1M 3.64 ∼ 9.6 163.73 ↑↑↑↑
5.94−10.78 3.64−9.6
INFL 1Y 6.22 ∼ 11.04 77.41 ···↑ RGDPPC 25449.0 ∼ 25387.0 -0.24 ↑↓↑↓
6.22−11.04 25387.0−25680.0
SP500 119.1 ∼ 97.55 -18.09 ↓↓↑↓ TAX Y HB 70.0 ∼ 70.0 0.0 ···↔
92.16−120.24 70.0−70.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 5.0 ∼ 5.2 4.0 ↔↓↓↑
14.0−14.0 4.5−5.2
Table B.14: -
B.15 1974

range range
CCPI 46.9 ∼ 52.3 11.51 ↑↑↑↑ CPI 46.8 ∼ 52.3 11.75 ↑↑↑↑
46.9−52.3 46.8−52.3
FFR 9.65 ∼ 7.13 -26.11 ↑↑↓↓ INFL 1M 9.6 ∼ 11.75 22.39 ↑↑↑↓
7.13−12.92 9.6−12.2
INFL 1Y 11.04 ∼ 9.13 -17.25 ···↓ RGDPPC 25387.0 ∼ 24574.0 -3.2 ↑↓↓↓
9.13−11.04 24574.0−25399.0
SP500 97.55 ∼ 68.56 -29.72 ↓↓↓↑ TAX Y HB 70.0 ∼ 70.0 0.0 ···↔
62.28−99.8 70.0−70.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 5.2 ∼ 8.2 57.69 ↓↑↑↑
14.0−14.0 5.0−8.2
Table B.15: -
B.16 1975

range range
CCPI 52.3 ∼ 55.9 6.88 ↑↑↑↑ CPI 52.3 ∼ 55.8 6.69 ↑↑↑↑
52.3−55.9 52.3−55.8
FFR 7.13 ∼ 4.87 -31.7 ↓↑↓↓ INFL 1M 11.75 ∼ 6.69 -43.06 ↓↓↓↓
4.87−7.13 6.69−11.75
INFL 1Y 9.13 ∼ 5.74 -37.18 ···↓ MONEY ONE 286.5 ∼ 296.9 3.63 ↓↑↓↑
5.74−9.13 267.1−296.9
RGDPPC 24574.0 ∼ 25826.0 5.09 ↑↑↑↑ SP500 68.56 ∼ 90.19 31.55 ↑↑↓↑
24574.0−25826.0 68.56−95.61
TAX Y HB 70.0 ∼ 70.0 0.0 ···↔ TAX Y LB 14.0 ∼ 14.0 0.0 ···↔
70.0−70.0 14.0−14.0
5
UNEMP US 8.2 ∼ 7.8 -4.88 ↑↓↑↓
7.8−9.2
Table B.16: -
B.17 1976

range range
CCPI 55.9 ∼ 59.3 6.08 ↑↑↑↑ CPI 55.8 ∼ 58.7 5.2 ↑↑↑↑
55.9−59.3 55.8−58.7
FFR 4.87 ∼ 4.61 -5.34 ↓↑↓↓ INFL 1M 6.69 ∼ 5.2 -22.34 ↓↓↓↓
4.61−5.48 5.04−6.69
INFL 1Y 5.74 ∼ 6.49 13.06 ···↑ MONEY ONE 302.2 ∼ 315.0 4.24 ↓↓↓↑
5.74−6.49 281.4−315.6
RGDPPC 25826.0 ∼ 26400.0 2.22 ↑↑↑↑ SP500 90.19 ∼ 107.46 19.15 ↑↑↑↑
25826.0−26400.0 90.19−107.83
TAX Y HB 70.0 ∼ 70.0 0.0 ···↔ TAX Y LB 14.0 ∼ 14.0 0.0 ···↔
70.0−70.0 14.0−14.0
UNEMP US 7.8 ∼ 7.3 -6.41 ↓↑↑↓
7.3−8.1
Table B.17: -
B.18 1977

range range
CCPI 59.3 ∼ 63.1 6.41 ↑↑↑↑ CPI 58.7 ∼ 62.7 6.81 ↑↑↑↑
59.3−63.1 58.7−62.7
FFR 4.61 ∼ 6.7 45.34 ↑↑↑↑ INFL 1M 5.2 ∼ 6.81 31.12 ↑↓↓↑
4.61−6.7 5.2−6.95
INFL 1Y 6.49 ∼ 7.65 17.9 ···↑ MONEY ONE 323.3 ∼ 338.5 4.7 ↓↓↓↑
6.49−7.65 302.8−340.9
RGDPPC 26400.0 ∼ 27208.0 3.06 ↑↑↓↑ SP500 107.46 ∼ 95.1 -11.5 ↓↑↓↓
26400.0−27252.0 90.71−107.46
TAX Y HB 70.0 ∼ 70.0 0.0 ···↔ TAX Y LB 14.0 ∼ 14.0 0.0 ···↔
70.0−70.0 14.0−14.0
UNEMP US 7.3 ∼ 6.3 -13.7 ↓↓↑↓
6.3−7.5
Table B.18: -
B.19 1978

range range
CCPI 63.1 ∼ 68.5 8.56 ↑↑↑↑ CPI 62.7 ∼ 68.5 9.25 ↑↑↑↑
63.1−68.5 62.7−68.5
FFR 6.7 ∼ 10.07 50.3 ↑↑↑↑ INFL 1M 6.81 ∼ 9.25 35.75 ↓↑↑↑
6.7−10.07 6.24−9.25
INFL 1Y 7.65 ∼ 11.27 47.32 ···↑ MONEY ONE 349.6 ∼ 377.4 7.95 ↓↑↓↑
7.65−11.27 324.8−377.4
RGDPPC 27208.0 ∼ 28668.0 5.37 ↑↑↑↓ SP500 93.82 ∼ 96.11 2.44 ↓↑↑↓
27208.0−28684.0 86.9−106.99
TAX Y HB 70.0 ∼ 70.0 0.0 ···↔ TAX Y LB 14.0 ∼ 14.0 0.0 ···↔
70.0−70.0 14.0−14.0
UNEMP US 6.3 ∼ 5.8 -7.94 ↓↑↓↔
5.7−6.3
6
Table B.19: -
B.20 1979

range range
CCPI 68.5 ∼ 76.7 11.97 ↑↑↑↑ CPI 68.5 ∼ 78.0 13.87 ↑↑↑↑
68.5−76.7 68.5−78.0
FFR 10.07 ∼ 13.82 37.24 ↓↑↑↑ INFL 1M 9.25 ∼ 13.87 49.92 ↑↑↑↑
10.01−13.82 9.25−13.87
INFL 1Y 11.27 ∼ 13.51 19.91 ···↑ MONEY ONE 377.4 ∼ 401.0 6.25 ↓↑↑↑
11.27−13.51 347.4−401.0
RGDPPC 28668.0 ∼ 28744.0 0.27 ↓↑↓↑ SP500 96.73 ∼ 107.94 11.59 ↑↑↑↓
28628.0−28747.0 96.13−111.27
TAX Y HB 70.0 ∼ 70.0 0.0 ···↔ TAX Y LB 14.0 ∼ 14.0 0.0 ···↔
70.0−70.0 14.0−14.0
UNEMP US 5.8 ∼ 6.2 6.9 ↔↓↑↑
5.6−6.2
Table B.20: -
B.21 1980

range range
CCPI 76.7 ∼ 85.4 11.34 ↑↑↑↑ CPI 78.0 ∼ 87.2 11.79 ↑↑↑↑
76.7−85.4 78.0−87.2
FFR 13.82 ∼ 19.08 38.06 ↑↓↑↑ INFL 1M 13.87 ∼ 11.79 -14.95 ↑↓↓↓
9.03−19.08 11.79−14.59
INFL 1Y 13.51 ∼ 10.32 -23.64 ···↓ MONEY ONE 401.0 ∼ 415.8 3.69 ↓↓↑↑
10.32−13.51 372.7−421.4
MONEY TWO 1595.2 ∼ 1601.8 0.41 ···↑ RGDPPC 28744.0 ∼ 28926.0 0.63 ↓↓↑↑
1594.8−1606.3 27956.0−28926.0
SP500 107.94 ∼ 135.76 25.77 ↓↑↑↑ TAX Y HB 70.0 ∼ 69.12 -1.25 ···↓
98.22−140.52 69.12−70.0
TAX Y LB 14.0 ∼ 14.0 0.0 ···↔ UNEMP US 6.2 ∼ 7.4 19.35 ↑↑↓↓
14.0−14.0 6.0−7.8
Table B.21: -
B.22 1981

range range
CCPI 85.4 ∼ 93.3 9.25 ↑↑↑↑ CPI 87.2 ∼ 94.4 8.26 ↑↑↑↑
85.4−93.3 87.2−94.4
FFR 19.08 ∼ 13.22 -30.71 ↓↑↓↓ INFL 1M 11.79 ∼ 8.26 -30.0 ↓↑↓↓
12.37−19.1 8.26−11.79
INFL 1Y 10.32 ∼ 6.16 -40.28 ···↓ MONEY ONE 430.1 ∼ 445.9 3.67 ↓↓↓↑
6.16−10.32 402.6−447.0
MONEY TWO 1620.7 ∼ 1758.2 8.48 ↑↓↑↑ RGDPPC 28926.0 ∼ 27952.0 -3.37 ↓↑↓↓
1599.8−1760.2 27952.0−28926.0
SP500 135.76 ∼ 122.55 -9.73 ↑↓↓↑ TAX Y HB 69.12 ∼ 50.0 -27.67 ···↓
112.77−138.12 50.0−69.12
TAX Y LB 14.0 ∼ 12.0 -14.29 ···↓ UNEMP US 7.4 ∼ 8.5 14.86 ↓↓↑↑
12.0−14.0 7.0−8.9
7
Table B.22: -
B.23 1982

range range
CCPI 93.3 ∼ 97.6 4.61 ↑↑↑↑ CPI 94.4 ∼ 97.9 3.71 ↑↑↑↓
93.3−97.6 94.4−98.1
FFR 13.22 ∼ 8.68 -34.34 ↑↓↓↓ INFL 1M 8.26 ∼ 3.71 -55.1 ↓↓↓↓
8.68−14.94 3.71−8.26
INFL 1Y 6.16 ∼ 3.21 -47.86 ···↓ MONEY ONE 462.5 ∼ 482.9 4.41 ↓↓↓↑
3.21−6.16 428.0−488.0
MONEY TWO 1780.1 ∼ 1925.6 8.17 ↑↑↑↑ RGDPPC 27952.0 ∼ 28136.0 0.66 ↑↓↓↑
1761.3−1925.6 27829.0−28136.0
SP500 122.55 ∼ 140.64 14.76 ↓↓↑↑ TAX Y HB 50.0 ∼ 50.0 0.0 ···↔
102.42−143.02 50.0−50.0
TAX Y LB 12.0 ∼ 11.0 -8.33 ···↓ UNEMP US 8.5 ∼ 10.4 22.35 ↑↑↑↔
11.0−12.0 8.5−10.8
Table B.23: -
B.24 1983

range range
CCPI 97.6 ∼ 102.5 5.02 ↑↑↑↑ CPI 97.9 ∼ 102.1 4.29 ↑↑↑↑
97.6−102.5 97.9−102.1
FFR 8.68 ∼ 9.56 10.14 ↑↑↑↑ INFL 1M 3.71 ∼ 4.29 15.71 ↑↓↑↑
8.51−9.56 2.36−4.29
INFL 1Y 3.21 ∼ 4.32 34.39 ···↑ MONEY ONE 493.2 ∼ 529.4 7.34 ↓↓↓↑
3.21−4.32 467.5−533.2
MONEY TWO 1949.5 ∼ 2131.3 9.33 ↑↑↑↑ RGDPPC 28136.0 ∼ 30275.0 7.6 ↑↑↑↑
1949.5−2137.0 28136.0−30275.0
SP500 140.64 ∼ 164.93 17.27 ↑↑↓↓ TAX Y HB 50.0 ∼ 50.0 0.0 ···↔
138.34−172.65 50.0−50.0
TAX Y LB 11.0 ∼ 11.0 0.0 ···↔ UNEMP US 10.4 ∼ 8.0 -23.08 ↓↓↓↓
11.0−11.0 8.0−10.4
Table B.24: -
B.25 1984

range range
CCPI 102.5 ∼ 107.1 4.49 ↑↑↑↑ CPI 102.1 ∼ 105.7 3.53 ↑↑↑↑
102.5−107.1 102.1−105.7
FFR 9.56 ∼ 8.35 -12.66 ↑↑↓↓ INFL 1M 4.29 ∼ 3.53 -17.81 ↑↓↓↓
8.35−11.64 3.53−4.89
INFL 1Y 4.32 ∼ 3.56 -17.51 ···↓ MONEY ONE 541.3 ∼ 572.6 5.78 ↓↑↓↑
3.56−4.32 508.8−572.6
MONEY TWO 2150.5 ∼ 2335.9 8.62 ↑↑↑↑ RGDPPC 30275.0 ∼ 31394.0 3.7 ↑↑↑↑
2125.5−2335.9 30275.0−31394.0
SP500 164.04 ∼ 167.24 1.95 ↓↓↑↑ TAX Y HB 50.0 ∼ 50.0 0.0 ···↔
147.82−170.41 50.0−50.0
TAX Y LB 11.0 ∼ 11.0 0.0 ···↔ UNEMP US 8.0 ∼ 7.4 -7.5 ↓↓↓↔
11.0−11.0 7.1−8.0
8
Table B.25: -
B.26 1985

range range
CCPI 107.1 ∼ 111.9 4.48 ↑↑↑↑ CPI 105.7 ∼ 109.9 3.97 ↑↑↑↑
107.1−111.9 105.7−109.9
FFR 8.35 ∼ 8.14 -2.51 ↓↓↑↑ INFL 1M 3.53 ∼ 3.97 12.69 ↑↓↓↑
7.53−8.58 3.24−3.97
INFL 1Y 3.56 ∼ 1.86 -47.8 ···↓ MONEY ONE 572.6 ∼ 635.4 10.97 ↓↑↑↑
1.86−3.56 542.6−635.4
MONEY TWO 2335.9 ∼ 2509.0 7.41 ↑↑↑↑ RGDPPC 31394.0 ∼ 32418.0 3.26 ↑↑↑↑
2327.6−2509.0 31394.0−32418.0
SP500 167.24 ∼ 211.28 26.33 ↑↑↓↑ TAX Y HB 50.0 ∼ 50.0 0.0 ···↔
163.68−212.02 50.0−50.0
TAX Y LB 11.0 ∼ 11.0 0.0 ···↔ UNEMP US 7.4 ∼ 6.7 -9.46 ↓↔↓↓
11.0−11.0 6.7−7.4
Table B.26: -
B.27 1986

range range
CCPI 111.9 ∼ 116.1 3.75 ↑↑↑↑ CPI 109.9 ∼ 111.4 1.36 ↓↑↑↑
111.9−116.1 108.7−111.4
FFR 8.14 ∼ 6.43 -21.01 ↓↓↓↑ INFL 1M 3.97 ∼ 1.36 -65.65 ↓↑↓↓
5.85−8.14 1.19−3.97
INFL 1Y 1.86 ∼ 3.74 101.26 ···↑ MONEY ONE 654.8 ∼ 750.3 14.58 ↓↑↓↑
1.86−3.74 607.5−750.3
MONEY TWO 2539.7 ∼ 2755.8 8.51 ↓↑↑↑ RGDPPC 32418.0 ∼ 33001.0 1.8 ↑↑↑↑
2495.8−2755.8 32418.0−33001.0
SP500 211.28 ∼ 242.17 14.62 ↑↑↓↑ TAX Y HB 50.0 ∼ 38.5 -23.0 ···↓
203.49−254.0 38.5−50.0
TAX Y LB 11.0 ∼ 11.0 0.0 ···↔ UNEMP US 6.7 ∼ 6.7 0.0 ↑↓↑↓
11.0−11.0 6.7−7.3
Table B.27: -
B.28 1987

range range
CCPI 116.1 ∼ 121.1 4.31 ↑↑↑↑ CPI 111.4 ∼ 116.0 4.13 ↑↑↑↑
116.1−121.1 111.4−116.0
FFR 6.43 ∼ 6.83 6.22 ↓↑↑↓ INFL 1M 1.36 ∼ 4.13 202.54 ↑↑↑↓
6.1−7.29 1.36−4.53
INFL 1Y 3.74 ∼ 4.01 7.17 ···↑ MONEY ONE 780.7 ∼ 765.0 -2.01 ↓↓↓↑
3.74−4.01 712.0−780.7
MONEY TWO 2798.6 ∼ 2840.4 1.49 ↓↓↓↑ RGDPPC 33001.0 ∼ 34112.0 3.37 ↑↑↑↑
2732.0−2844.4 33001.0−34112.0
SP500 242.17 ∼ 247.08 2.03 ↑↑↑↓ TAX Y HB 38.5 ∼ 28.0 -27.27 ···↓
223.92−336.77 28.0−38.5
TAX Y LB 11.0 ∼ 15.0 36.36 ···↑ UNEMP US 6.7 ∼ 5.8 -13.43 ↓↓↔↓
11.0−15.0 5.8−6.7
9
Table B.28: -
B.29 1988

range range
CCPI 121.1 ∼ 126.7 4.62 ↑↑↑↑ CPI 116.0 ∼ 121.2 4.48 ↑↑↑↑
121.1−126.7 116.0−121.2
FFR 6.83 ∼ 9.12 33.53 ↑↑↑↑ INFL 1M 4.13 ∼ 4.48 8.56 ↓↑↑↑
6.58−9.12 3.83−4.48
INFL 1Y 4.01 ∼ 4.83 20.4 ···↑ MONEY ONE 787.1 ∼ 797.7 1.35 ↓↓↓↑
4.01−4.83 735.8−801.0
MONEY TWO 2869.7 ∼ 2997.9 4.47 ↑↓↓↑ RGDPPC 34112.0 ∼ 35253.0 3.34 ↑↑↑↑
2849.4−3007.3 34112.0−35253.0
SP500 247.08 ∼ 277.72 12.4 ↑↑↑↑ TAX Y HB 28.0 ∼ 28.0 0.0 ···↔
242.63−283.66 28.0−28.0
TAX Y LB 15.0 ∼ 15.0 0.0 ···↔ UNEMP US 5.8 ∼ 5.4 -6.9 ↓↔↓↑
15.0−15.0 5.3−5.8
Table B.29: -
B.30 1989

range range
CCPI 126.7 ∼ 132.3 4.42 ↑↑↑↑ CPI 121.2 ∼ 127.5 5.2 ↑↑↑↑
126.7−132.3 121.2−127.5
FFR 9.12 ∼ 8.23 -9.76 ↑↓↓↓ INFL 1M 4.48 ∼ 5.2 15.96 ↑↑↓↑
8.23−9.85 4.44−5.28
INFL 1Y 4.83 ∼ 5.4 11.83 ···↑ MONEY ONE 816.3 ∼ 825.0 1.07 ↓↓↓↑
4.83−5.4 754.6−825.0
MONEY TWO 3022.3 ∼ 3187.7 5.47 ↓↓↑↑ RGDPPC 35253.0 ∼ 35903.0 1.84 ↑↑↓↑
2971.9−3187.7 35253.0−35903.0
SP500 275.31 ∼ 353.4 28.36 ↑↑↑↑ TAX Y HB 28.0 ∼ 28.0 0.0 ···↔
275.31−359.8 28.0−28.0
TAX Y LB 15.0 ∼ 15.0 0.0 ···↔ UNEMP US 5.4 ∼ 5.3 -1.85 ↓↓↑↔
15.0−15.0 5.0−5.4
Table B.30: -
B.31 1990

range range
CCPI 132.3 ∼ 139.7 5.59 ↑↑↑↑ CPI 127.5 ∼ 134.7 5.65 ↑↑↑↑
132.3−139.7 127.5−134.7
FFR 8.23 ∼ 6.91 -16.04 ↑↓↓↓ INFL 1M 5.2 ∼ 5.65 8.64 ↓↑↑↓
6.91−8.29 4.37−6.38
INFL 1Y 5.4 ∼ 4.23 -21.55 ···↓ MONEY ONE 825.0 ∼ 852.4 3.32 ↓↓↓↑
4.23−5.4 773.1−852.4
MONEY TWO 3187.7 ∼ 3297.3 3.44 ↑↓↑↑ RGDPPC 35903.0 ∼ 35107.0 -2.22 ↑↓↓↓
3147.7−3297.3 35107.0−35930.0
SP500 359.69 ∼ 330.22 -8.19 ↓↑↓↑ TAX Y HB 28.0 ∼ 31.0 10.71 ···↑
295.46−368.95 28.0−31.0
TAX Y LB 15.0 ∼ 15.0 0.0 ···↔ UNEMP US 5.3 ∼ 6.2 16.98 ↑↑↑↑
15.0−15.0 5.2−6.2
10
Table B.31: -
B.32 1991

range range
BOP −2031.0 ∼ −2031.0 0.0 · · ·· CCPI 139.7 ∼ 145.3 4.01 ↑↑↑↑

−2031.0−−2031.0 139.7−145.3
CPI 134.7 ∼ 138.3 2.67 ↑↑↑↑ FFR 6.91 ∼ 4.03 -41.68 ↓↓↓↓
134.7−138.3 4.03−6.91
INFL 1M 5.65 ∼ 2.67 -52.67 ↓↓↓↓ INFL 1Y 4.23 ∼ 3.03 -28.48 ···↓
2.67−5.65 3.03−4.23
MONEY ONE 852.4 ∼ 919.1 7.82 ↓↑↓↑ MONEY TWO 3297.3 ∼ 3381.4 2.55 ↑↑↓↑
806.9−919.1 3272.6−3400.7
RGDPPC 35107.0 ∼ 35656.0 1.56 ↑↑↑↑ SP500 330.22 ∼ 417.09 26.31 ↑↑↑↑
35107.0−35656.0 311.49−417.09
TAX Y HB 31.0 ∼ 31.0 0.0 ···↔ TAX Y LB 15.0 ∼ 15.0 0.0 ···↔
31.0−31.0 15.0−15.0
UNEMP US 6.2 ∼ 7.1 14.52 ↑↑↔↑
6.2−7.1
Table B.32: -
B.33 1992

range range
BOP −2031.0 ∼ −3998.0 96.85 ↓↓↓↓ CCPI 145.3 ∼ 150.3 3.44 ↑↑↑↑
−4482.0−35.0 145.3−150.3
CPI 138.3 ∼ 142.8 3.25 ↑↑↑↑ FFR 4.03 ∼ 3.02 -25.06 ↓↓↓↓
138.3−142.8 2.92−4.06
INFL 1M 2.67 ∼ 3.25 21.75 ↑↓↑↓ INFL 1Y 3.03 ∼ 2.95 -2.55 ···↓
2.67−3.28 2.95−3.03
MONEY ONE 941.6 ∼ 1048.6 11.36 ↓↓↓↑ MONEY TWO 3403.9 ∼ 3437.2 0.98 ↓↓↓↑
897.2−1048.6 3368.2−3457.1
RGDPPC 35656.0 ∼ 36342.0 1.92 ↑↑↑↓ SP500 417.09 ∼ 435.71 4.46 ↓↑↑↑
35656.0−36381.0 394.5−441.28
TAX Y HB 31.0 ∼ 39.6 27.74 ···↑ TAX Y LB 15.0 ∼ 15.0 0.0 ···↔
31.0−39.6 15.0−15.0
UNEMP US 7.1 ∼ 7.1 0.0 ↑↑↓↓
7.1−7.8
Table B.33: -
B.34 1993

range range
BOP −3998.0 ∼ −7201.0 80.12 ↓↓↓↑ CCPI 150.3 ∼ 154.7 2.93 ↑↑↑↑
−7677.0−−3409.0 150.3−154.7
CPI 142.8 ∼ 146.3 2.45 ↑↑↑↑ FFR 3.02 ∼ 3.05 0.99 ↓↑↓↑
142.8−146.3 2.96−3.09
INFL 1M 3.25 ∼ 2.45 -24.67 ↓↓↓↓ INFL 1Y 2.95 ∼ 2.61 -11.66 ···↓
2.45−3.25 2.61−2.95
MONEY ONE 1071.1 ∼ 1154.6 7.8 ↓↓↓↑ MONEY TWO 3462.0 ∼ 3494.8 0.95 ↓↓↓↑
1014.2−1154.6 3392.7−3510.1
11
RGDPPC 36342.0 ∼ 37131.0 2.17 ↑↑↑↑ SP500 435.71 ∼ 466.45 7.06 ↑↓↑↑
36342.0−37131.0 429.05−470.94
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 15.0 ∼ 15.0 0.0 ···↔
39.6−39.6 15.0−15.0
UNEMP US 7.1 ∼ 6.7 -5.63 ↓↓↔↓
6.4−7.1
Table B.34: -
B.35 1994

range range
BOP −7201.0 ∼ −10660.0 48.03 ↓↓↑↓ CCPI 154.7 ∼ 159.1 2.84 ↑↑↑↑
−10660.0−−6096.0 154.7−159.1
CPI 146.3 ∼ 150.5 2.87 ↑↑↑↑ FFR 3.05 ∼ 5.53 81.31 ↑↑↑↑
146.3−150.5 3.05−5.53
INFL 1M 2.45 ∼ 2.87 17.13 ↓↑↓↑ INFL 1Y 2.61 ∼ 2.81 7.59 ···↑
2.29−2.97 2.61−2.81
MONEY ONE 1172.6 ∼ 1171.0 -0.14 ↓↓↓↑ MONEY TWO 3511.7 ∼ 3508.5 -0.09 ↓↓↓↑
1109.6−1172.6 3447.2−3543.8
RGDPPC 37131.0 ∼ 37967.0 2.25 ↑↑↑↑ SP500 466.45 ∼ 459.27 -1.54 ↓↑↑↓
37131.0−37967.0 438.92−482.0
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 15.0 ∼ 15.0 0.0 ···↔
39.6−39.6 15.0−15.0
UNEMP US 6.7 ∼ 5.7 -14.93 ↓↓↓↓
5.4−6.7
Table B.35: -
B.36 1995

range range
BOP −10660.0 ∼ −9568.0 -10.24 ↓↑↑↓ CCPI 159.1 ∼ 163.8 2.95 ↑↑↑↑
−10719.0−−6098.0 159.1−163.8
CPI 150.5 ∼ 154.7 2.79 ↑↑↑↑ FFR 5.53 ∼ 5.56 0.54 ↑↓↓↓
150.5−154.7 5.53−6.05
INFL 1M 2.87 ∼ 2.79 -2.79 ↑↓↓↑ INFL 1Y 2.81 ∼ 2.93 4.48 ···↑
2.53−3.13 2.81−2.93
MONEY ONE 1183.1 ∼ 1165.1 -1.52 ↓↓↓↑ MONEY TWO 3517.9 ∼ 3671.9 4.38 ↓↑↑↑
1116.6−1193.1 3464.5−3671.9
RGDPPC 37967.0 ∼ 38503.0 1.41 ↑↑↑↑ SP500 459.11 ∼ 615.93 34.16 ↑↑↑↑
37967.0−38503.0 459.11−621.69
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 15.0 ∼ 15.0 0.0 ···↔
39.6−39.6 15.0−15.0
UNEMP US 5.7 ∼ 5.8 1.75 ↑↓↓↑
5.4−5.8
Table B.36: -
B.37 1996

range range
BOP −9568.0 ∼ −12706.0 32.8 ↑↓↑↓ CCPI 163.8 ∼ 167.9 2.5 ↑↑↑↑
−12706.0−−6784.0 163.8−167.9
CPI 154.7 ∼ 159.4 3.04 ↑↑↑↑ FFR 5.56 ∼ 5.25 -5.58 ↓↑↓↑
154.7−159.4 5.22−5.56
12
INFL 1M 2.79 ∼ 3.04 8.87 ↑↑↑↓ INFL 1Y 2.93 ∼ 2.34 -20.25 ···↓
2.72−3.38 2.34−2.93
MONEY ONE 1165.1 ∼ 1112.4 -4.52 ↓↓↓↑ MONEY TWO 3671.9 ∼ 3830.0 4.31 ↑↑↑↑
1064.5−1165.1 3618.0−3837.1
RGDPPC 38503.0 ∼ 39782.0 3.32 ↑↑↑↑ SP500 620.73 ∼ 740.74 19.33 ↑↑↑↑
38503.0−39782.0 598.48−757.03
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 15.0 ∼ 15.0 0.0 ···↔
39.6−39.6 15.0−15.0
UNEMP US 5.8 ∼ 5.4 -6.9 ↓↔↓↑
5.1−5.8
Table B.37: -
B.38 1997

range range
BOP −12706.0 ∼ −12043.0 -5.22 ↑↓↑↓ CCPI 167.9 ∼ 171.7 2.26 ↑↑↑↑
−12706.0−−8037.0 167.9−171.7
CPI 159.4 ∼ 162.0 1.63 ↑↑↑↑ FFR 5.25 ∼ 5.56 5.9 ↑↑↓↑
159.4−162.0 5.19−5.56
INFL 1M 3.04 ∼ 1.63 -46.31 ↓↓↓↓ INFL 1Y 2.34 ∼ 1.55 -33.6 ···↓
1.63−3.04 1.55−2.34
MONEY ONE 1125.3 ∼ 1114.2 -0.99 ↓↓↓↑ MONEY TWO 3867.4 ∼ 4051.9 4.77 ↑↓↓↑
1043.8−1125.3 3796.9−4051.9
RGDPPC 39782.0 ∼ 41131.0 3.39 ↑↑↑↑ SP500 740.74 ∼ 970.43 31.01 ↑↑↑↑
39782.0−41131.0 737.01−983.79
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 15.0 ∼ 15.0 0.0 ···↔
39.6−39.6 15.0−15.0
UNEMP US 5.4 ∼ 4.7 -12.96 ↓↓↓↔
4.6−5.4
Table B.38: -
B.39 1998
CODE S∼E % YoY − − −− CODE S∼E % YoY −−

range range
BOP −12043.0 ∼ −16990.0 41.08 ↓↑↓↓ CCPI 171.7 ∼ 175.7 2.33 ↑↑

−16990.0−−12043.0 171.7−175.7
CPI 162.0 ∼ 164.7 1.67 ↑↑↑↑ CURACC BAL −62666.0 ∼ −62666.0 0.0 ·
162.0−164.7 −62666.0−−62666.0
FFR 5.56 ∼ 4.63 -16.73 ↓↑↓↓ INFL 1M 1.63 ∼ 1.67 2.18 ↓↑
4.63−5.56 1.38−1.75
INFL 1Y 1.55 ∼ 2.19 40.96 ···↑ MONEY ONE 1133.5 ∼ 1135.7 0.19 ↓↓
1.55−2.19 1057.5−1135.7
MONEY TWO 4097.5 ∼ 4401.6 7.42 ↑↓↓↑ RGDPPC 41131.0 ∼ 42674.0 3.75 ↑↑
4026.3−4406.6 41131.0−42674.0
SP500 970.43 ∼ 1229.23 26.67 ↑↑↓↑ TAX Y HB 39.6 ∼ 39.6 0.0 ··
927.69−1241.81 39.6−39.6
TAX Y LB 15.0 ∼ 15.0 0.0 ···↔ UNEMP US 4.7 ∼ 4.3 -8.51 ↓↑
15.0−15.0 4.3−4.7
Table B.39: -
B.40 1999

range range
13
BOP −16990.0 ∼ −28003.0 64.82 ↓↓↓↓ CCPI 175.7 ∼ 179.2 1.99 ↑↑
−28003.0−−16990.0 175.7−179.2
CPI 164.7 ∼ 169.3 2.79 ↑↑↑↑ CURACC BAL −62666.0 ∼ −96811.0 54.49 ↓↓
164.7−169.3 −96811.0−−62666.0
FFR 4.63 ∼ 5.45 17.71 ↑↑↑↑ INFL 1M 1.67 ∼ 2.79 67.58 ↑↓
4.63−5.45 1.67−2.79
INFL 1Y 2.19 ∼ 3.38 54.33 ···↑ MONEY ONE 1165.2 ∼ 1172.1 0.59 ↓↓
2.19−3.38 1077.2−1172.1
MONEY TWO 4443.2 ∼ 4677.4 5.27 ↑↓↓↑ RGDPPC 42674.0 ∼ 43957.0 3.01 ↑↑
4379.9−4679.7 42674.0−43957.0
SP500 1229.23 ∼ 1469.25 19.53 ↑↑↓↑ TAX Y HB 39.6 ∼ 39.6 0.0 ··
1212.19−1469.25 39.6−39.6
TAX Y LB 15.0 ∼ 15.0 0.0 ···↔ UNEMP US 4.3 ∼ 4.0 -6.98 ↔↔
15.0−15.0 4.0−4.4
Table B.40: -
B.41 2000
CODE S∼E % YoY − − −− CODE S∼E % YoY −

range range
BOP −28003.0 ∼ −33262.0 18.78 ↓↓↓↓ CCPI 179.2 ∼ 184.1 2.73 ↑

−34263.0−−28003.0 179.2−184.1
CPI 169.3 ∼ 175.6 3.72 ↑↑↑↑ CURACC BAL −96811.0 ∼ −106879.0 10.4 ↓
169.3−175.6 −108464.0−−96811.0
FFR 5.45 ∼ 5.98 9.72 ↑↑↓↓ INFL 1M 2.79 ∼ 3.72 33.24 ↑
5.45−6.54 2.79−3.76
INFL 1Y 3.38 ∼ 2.83 -16.31 ···↓ MONEY ONE 1198.8 ∼ 1152.5 -3.86 ↓
2.83−3.38 1074.8−1198.8
MONEY TWO 4706.5 ∼ 4996.0 6.15 ↑↓↑↑ RGDPPC 43957.0 ∼ 44500.0 1.24 ↑
4647.8−4996.0 43957.0−44731.0
SP500 1469.25 ∼ 1320.28 -10.14 ↑↓↓↓ TAX Y HB 39.6 ∼ 39.1 -1.26 ·
1264.74−1527.46 39.1−39.6
TAX Y LB 15.0 ∼ 10.0 -33.33 ···↓ UNEMP US 4.0 ∼ 4.2 5.0 ↓
10.0−15.0 3.9−4.2
Table B.41: -
B.42 2001

range range
BOP −33262.0 ∼ −28518.0 -14.26 ↑↑↓↑ CCPI 184.1 ∼ 188.6 2.44

−33262.0−−18692.0 184.1−188.6
CPI 175.6 ∼ 177.7 1.2 ↑↑↑↑ CURACC BAL −106879.0 ∼ −103992.0 -2.7
175.6−178.1 −106879.0−−87840.0
FFR 5.98 ∼ 1.73 -71.07 ↓↓↓↓ INFL 1M 3.72 ∼ 1.2 -67.86
1.73−5.98 1.2−3.72
INFL 1Y 2.83 ∼ 1.59 -43.88 ···↓ MONEY ONE 1152.5 ∼ 1247.7 8.26
1.59−2.83 1072.6−1266.6
MONEY TWO 4996.0 ∼ 5492.9 9.95 ↑↓↑↑ RGDPPC 44500.0 ∼ 44695.0 0.44
4939.4−5497.3 44375.0−44695.0
SP500 1283.27 ∼ 1148.08 -10.53 ↓↑↓↑ TAX Y HB 39.1 ∼ 38.6 -1.28
965.8−1373.73 38.6−39.1
TAX Y LB 10.0 ∼ 10.0 0.0 ···↔ UNEMP US 4.2 ∼ 5.6 33.33
10.0−10.0 4.2−5.8
Table B.42: -
14
B.43 2002

range range
BOP −28518.0 ∼ −41115.0 44.17 ↓↑↓↓ CCPI 188.6 ∼ 192.3 1.96

−44242.0−−28518.0 188.6−192.3
CPI 177.7 ∼ 182.6 2.76 ↑↑↑↑ CURACC BAL −103992.0 ∼ −134986.0 29.8
177.7−182.6 −134986.0−−103992.0
FFR 1.73 ∼ 1.24 -28.32 ↑↓↑↓ INFL 1M 1.2 ∼ 2.76 130.58
1.24−1.75 1.07−2.76
INFL 1Y 1.59 ∼ 2.27 43.13 ···↑ MONEY ONE 1247.7 ∼ 1289.0 3.31
1.59−2.27 1152.6−1289.0
MONEY TWO 5492.9 ∼ 5831.4 6.16 ↑↑↑↑ RGDPPC 44695.0 ∼ 44987.0 0.65
5424.2−5854.1 44695.0−44987.0
SP500 1148.08 ∼ 879.82 -23.37 ↓↓↓↑ TAX Y HB 38.6 ∼ 35.0 -9.33
776.76−1172.51 35.0−38.6
TAX Y LB 10.0 ∼ 10.0 0.0 ···↔ UNEMP US 5.6 ∼ 5.7 1.79
10.0−10.0 5.5−6.0
Table B.43: -
B.44 2003
CODE S∼E % YoY − − −− CODE S∼E % YoY −−−

range range
BOP −41115.0 ∼ −43057.0 4.72 ↓↑↓↓ BREAK 10 1.64 ∼ nan nan ↑↓↑↔
−43456.0−−38010.0 nan−nan
CCPI 192.3 ∼ 194.4 1.09 ↑↑↑↑ CPI 182.6 ∼ 186.3 2.03 ↑↑↑↑
192.3−194.4 182.6−186.3
CURACC BAL −134986.0 ∼ −137586.0 1.93 ↑↓↑↓ FFR 1.24 ∼ 1.0 -19.35 ↑↓↔
−137586.0−−126719.0 0.98−1.26
INFL 1M 2.76 ∼ 2.03 -26.52 ↓↓↓↓ INFL 1Y 2.27 ∼ 2.68 17.94 ···↑
1.89−3.15 2.27−2.68
MONEY ONE 1236.6 ∼ 1371.0 10.87 ↑↑↑↑ MONEY TWO 5872.3 ∼ 6083.7 3.6 ↑↑↓↑
1192.1−1371.0 5786.4−6155.9
RGDPPC 44987.0 ∼ 46560.0 3.5 ↑↑↑↑ SP500 879.82 ∼ 1111.92 26.38 ↓↑↑↑
44987.0−46560.0 800.73−1111.92
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ···↔
35.0−35.0 10.0−10.0
UNEMP US 5.7 ∼ 5.6 -1.75 ↑↑↓↓
5.6−6.4
Table B.44: -
B.45 2004

range range
BOP −43057.0 ∼ −58270.0 35.33 ↓↓↓↓ BREAK 10 2.27 ∼ 2.56 12.78 ↔

−60297.0−−42086.0 nan−nan
BREAK 20 2.8 ∼ 2.79 -0.36 ·· ↓↑ CCPI 194.4 ∼ 198.9 2.31
2.72−2.86 194.4−198.9
CPI 186.3 ∼ 191.6 2.84 ↑↑↑↑ CURACC BAL −137586.0 ∼ −169807.0 23.42
186.3−191.7 −179825.0−−137586.0
FFR 1.0 ∼ 2.28 128.0 ↔↑↑↑ GLD 44.38 ∼ 43.8 -1.31
1.0−2.28 43.44−45.6
INFL 1M 2.03 ∼ 2.84 40.4 ↑↑↑↓ INFL 1Y 2.68 ∼ 3.39 26.73
1.69−3.62 2.68−3.39
MONEY ONE 1327.6 ∼ 1457.1 9.75 ↑↑↑↑ MONEY TWO 6117.6 ∼ 6443.5 5.33
1256.9−1457.1 6005.9−6450.2
RGDPPC 46560.0 ∼ 47804.0 2.67 ↑↑↑↑ SP500 1111.92 ∼ 1211.92 8.99
46560.0−47804.0 1063.23−1213.55
15
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
35.0−35.0 10.0−10.0
UNEMP US 5.6 ∼ 5.2 -7.14 ↔↓↔↓
5.2−5.7
Table B.45: -
B.46 2005

range range
BOP −58270.0 ∼ −68513.0 17.58 ↑↓↓↑ BREAK 10 2.56 ∼ 2.33 -8.98

−68885.0−−54986.0 nan−nan
BREAK 20 2.79 ∼ 2.6 -6.81 ↑↓↑↓ CCPI 198.9 ∼ 203.2 2.16
2.48−2.94 198.9−203.2
CPI 191.6 ∼ 199.3 4.02 ↑↑↑↑ CURACC BAL −169807.0 ∼ −198058.0 16.64
191.6−199.3 −209820.0−−169807.0
FFR 2.28 ∼ 4.29 88.16 ↑↑↑↑ GLD 43.8 ∼ 51.58 17.76
2.28−4.29 41.26−52.56
INFL 1M 2.84 ∼ 4.02 41.26 ↑↓↑↓ INFL 1Y 3.39 ∼ 3.23 -4.92
2.54−4.74 3.23−3.39
MONEY ONE 1442.9 ∼ 1460.0 1.19 ↓↑↓↑ MONEY TWO 6434.1 ∼ 6716.2 4.38
1304.4−1460.0 6335.0−6717.2
RGDPPC 47804.0 ∼ 48857.0 2.2 ↑↑↑↑ SP500 1211.92 ∼ 1248.29 3.0
47804.0−48857.0 1137.5−1272.74
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
35.0−35.0 10.0−10.0
UNEMP US 5.2 ∼ 4.7 -9.62 ↔↓↔↓
4.7−5.4
Table B.46: -
B.47 2006

range range
BOP −68513.0 ∼ −59118.0 -13.71 ↑↓↑↓ BREAK 10 nan ∼ nan nan ↔

−69857.0−−58233.0 nan−nan
BREAK 20 2.6 ∼ 2.53 -2.69 ↑↓↓↓ CCPI 203.2 ∼ 208.63 2.67
2.52−2.87 203.2−208.63
CPI 199.3 ∼ 203.44 2.08 ↑↑↓↑ CURACC BAL −198058.0 ∼ −199093.0 0.52
199.3−203.8 −216063.0−−190233.0
FFR 4.29 ∼ 5.25 22.38 ↑↑↑↔ GLD 53.12 ∼ 63.21 18.99
4.29−5.25 52.34−71.12
INFL 1M 4.02 ∼ 2.08 -48.35 ↓↑↓↑ INFL 1Y 3.23 ∼ 2.85 -11.57
1.41−4.18 2.85−3.23
MONEY ONE 1458.4 ∼ 1445.5 -0.88 ↓↓↓↑ MONEY TWO 6733.9 ∼ 7097.4 5.4
1305.8−1458.4 6642.5−7097.4
RGDPPC 48857.0 ∼ 48994.0 0.28 ↑↓↑↓ SP500 1268.8 ∼ 1418.3 11.78
48815.0−49070.0 1223.69−1427.09
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
35.0−35.0 10.0−10.0
UNEMP US 4.7 ∼ 4.6 -2.13 ↔↑↓↑
4.4−4.8
Table B.47: -
B.48 2007
16
range range
BOP −59118.0 ∼ −58200.0 -1.55 ↑↓↑↓ BREAK 10 nan ∼ nan nan ↔

−63891.0−−56453.0 nan−nan
BREAK 20 2.53 ∼ 2.54 0.4 ↑↓↔↓ CCPI 208.63 ∼ 213.76 2.46
2.49−2.62 208.63−213.76
CPI 203.44 ∼ 212.17 4.29 ↑↑↑↑ CURACC BAL −199093.0 ∼ −181630.0 -8.77
203.44−212.17 −199093.0−−159481.0
FFR 5.25 ∼ 3.94 -24.95 ↔↑↓↓ GLD 62.28 ∼ 82.46 32.4
3.94−5.26 60.17−83.0
INFL 1M 2.08 ∼ 4.29 106.9 ↑↓↑↑ INFL 1Y 2.85 ∼ 3.84 34.58
1.9−4.37 2.85−3.84
MONEY ONE 1445.5 ∼ 1445.6 0.01 ↓↓↓↑ MONEY TWO 7097.4 ∼ 7512.0 5.84
1313.2−1445.6 6997.4−7512.0
RGDPPC 48994.0 ∼ 49080.0 0.18 ↑↑↑↓ SP500 1416.6 ∼ 1468.36 3.65
48994.0−49520.0 1374.12−1565.15
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
35.0−35.0 10.0−10.0
UNEMP US 4.6 ∼ 4.9 6.52 ↓↑↑↑
4.4−5.0
Table B.48: -
B.49 2008

range range
BOP −58200.0 ∼ −36030.0 -38.09 ↓↓↑↑ BREAK 10 2.31 ∼ nan nan ↔

−62321.0−−36030.0 nan−nan
BREAK 20 2.54 ∼ 1.0 -60.63 ↓↔↓↓ CCPI 213.76 ∼ 217.26 1.64 ↑
0.86−2.62 213.76−217.26
CPI 212.17 ∼ 211.93 -0.11 ↑↑↓↓ CURACC BAL −181630.0 ∼ −96782.0 -46.71 ↑
211.4−219.02 −181630.0−−96782.0
FFR 3.94 ∼ 0.15 -96.19 ↓↓↓↓ GLD 82.46 ∼ 86.52 4.92 ↑
0.15−3.94 70.0−99.17
INFL 1M 4.29 ∼ −0.11 -102.64 ↓↑↓↓ INFL 1Y 3.84 ∼ −0.36 -109.26 ·
−0.11−5.5 −0.36−3.84
MONEY ONE 1445.6 ∼ 1714.1 18.57 ↑↑↑↑ MONEY TWO 7512.0 ∼ 8179.5 8.89 ↑
1308.0−1714.1 7448.1−8179.5
RGDPPC 49080.0 ∼ 46931.0 -4.38 ↑↓↓↓ SP500 1468.36 ∼ 903.25 -38.49 ↓
46931.0−49215.0 752.44−1468.36
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ··
35.0−35.0 10.0−10.0
UNEMP US 4.9 ∼ 7.6 55.1 ↑↑↑↑
4.8−7.6
Table B.49: -
B.50 2009

range range
BOP −36030.0 ∼ −37288.0 3.49 ↑↓↓↓ BREAK 10 0.11 ∼ nan nan ↔

−40181.0−−25962.0 nan−nan
BREAK 20 1.0 ∼ 2.5 150.0 ↑↑↑↑ CCPI 217.26 ∼ 220.46 1.47 ↑
1.0−2.5 217.26−220.77
CPI 211.93 ∼ 217.49 2.62 ↑↑↑↑ CURACC BAL −96782.0 ∼ −108972.0 12.6 ↑
211.93−217.49 −108972.0−−88040.0
FFR 0.15 ∼ 0.11 -26.67 ↔↑↓↓ GLD 86.52 ∼ 107.31 24.03 ↑
0.11−0.22 79.79−119.18
INFL 1M −0.11 ∼ 2.62 -2407.6 ↓↓↑↑ INFL 1Y −0.36 ∼ 1.64 -561.27 ·
−1.96−2.81 −0.36−1.64
17
MONEY ONE 1670.5 ∼ 1800.9 7.81 ↓↑↑↑ MONEY TWO 8249.4 ∼ 8399.9 1.82 ↑
1504.2−1800.9 8183.4−8447.3
RGDPPC 46931.0 ∼ 47257.0 0.69 ↓↑↑↑ SP500 903.25 ∼ 1115.1 23.45 ↓
46786.0−47257.0 676.53−1127.78
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ··
35.0−35.0 10.0−10.0
UNEMP US 7.6 ∼ 9.7 27.63 ↑↑↑↓
7.6−10.2
Table B.50: -
B.51 2010

range range
BOP −37288.0 ∼ −46341.0 24.28 ↓↓↑↓ BREAK 10 2.37 ∼ 2.3 -2.95 ↔

−49895.0−−37288.0 nan−nan
BREAK 20 2.5 ∼ 2.58 3.2 ↑↓↑↑ CCPI 220.46 ∼ 222.59 0.96
1.87−2.63 220.46−222.59
CPI 217.49 ∼ 221.19 1.7 ↓↑↑↑ CURACC BAL −108972.0 ∼ −118581.0 8.82
217.2−221.19 −118581.0−−102033.0
FFR 0.11 ∼ 0.17 54.55 ↑↓↑↓ GLD 107.31 ∼ 138.72 29.27
0.11−0.2 104.04−139.11
INFL 1M 2.62 ∼ 1.7 -35.11 ↓↓↓↑ INFL 1Y 1.64 ∼ 3.16 92.49
1.08−2.62 1.64−3.16
1616.0−1964.2 8398.2−8869.3
RGDPPC 47257.0 ∼ 47861.0 1.28 ↑↑↑↓ SP500 1115.1 ∼ 1257.64 12.78
47257.0−48096.0 1022.58−1259.78
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
35.0−35.0 10.0−10.0
UNEMP US 9.7 ∼ 9.0 -7.22 ↑↓↑↓
9.0−9.9
Table B.51: -
B.52 2011

range range
BOP −46341.0 ∼ −52565.0 13.43 ↑↓↑↓ BREAK 10 2.3 ∼ 1.96 -14.78

−53067.0−−43107.0 nan−nan
BREAK 20 2.58 ∼ 2.19 -15.12 ↑↓↓↑ CCPI 222.59 ∼ 227.68 2.29
2.11−2.8 222.59−227.68
CPI 221.19 ∼ 227.84 3.01 ↑↑↑↑ CURACC BAL −118581.0 ∼ −123962.0 4.54
221.19−227.84 −123962.0−−107979.0
FFR 0.17 ∼ 0.08 -52.94 ↓↓↔↑ GLD 138.72 ∼ 151.99 9.57
0.07−0.17 127.93−184.59
INFL 1M 1.7 ∼ 3.01 76.9 ↑↑↓↓ INFL 1Y 3.16 ∼ 2.07 -34.45
1.7−3.81 2.07−3.16
1783.7−2290.8 8774.7−9724.0
RGDPPC 47861.0 ∼ 49079.0 2.54 ↑↑↑↑ SP500 1257.64 ∼ 1257.6 -0.0
47861.0−49079.0 1099.23−1363.61
TAX Y HB 35.0 ∼ 35.0 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
35.0−35.0 10.0−10.0
UNEMP US 9.0 ∼ 8.3 -7.78 ↔↑↓↓
8.3−9.2
Table B.52: -
18
B.53 2012

range range
BOP −52565.0 ∼ −44448.0 -15.44 ↑↑↓↓ BREAK 10 nan ∼ nan nan ↔

−52565.0−−38539.0 nan−nan
BREAK 20 2.19 ∼ 2.48 13.24 ↑↓↑↓ CCPI 227.68 ∼ 232.11 1.94
2.09−2.52 227.68−232.11
CPI 227.84 ∼ 231.68 1.68 ↑↓↑↑ CURACC BAL −123962.0 ∼ −105487.0 -14.9
227.84−231.68 −123962.0−−105487.0
FFR 0.08 ∼ 0.14 75.0 ↑↑↔↓ GLD 155.92 ∼ 162.02 3.91
0.08−0.16 149.46−173.61
INFL 1M 3.01 ∼ 1.68 -44.03 ↓↓↑↓ INFL 1Y 2.07 ∼ 1.46 -29.21
1.42−3.01 1.46−2.07
MONEY ONE 2380.3 ∼ 2662.9 11.87 ↓↓↑↑ MONEY TWO 9829.1 ∼ 10581.0 7.65
2137.6−2662.9 9653.0−10581.0
RGDPPC 49079.0 ∼ 49376.0 0.61 ↑↑↓↑ SP500 1277.06 ∼ 1426.19 11.68
49079.0−49388.0 1277.06−1465.77
TAX Y HB 35.0 ∼ 39.6 13.14 ···↑ TAX Y LB 10.0 ∼ 10.0 0.0 ·
35.0−39.6 10.0−10.0
UNEMP US 8.3 ∼ 7.9 -4.82 ↓↑↓↔
7.7−8.3
Table B.53: -
B.54 2013

range range
BOP −44448.0 ∼ −39095.0 -12.04 ↑↑↓↑ BREAK 10 2.45 ∼ nan nan ↔

−45027.0−−34224.0 nan−nan
BREAK 20 2.48 ∼ 2.35 -5.24 ↔↓↑↑ CCPI 232.11 ∼ 235.84 1.61
2.09−2.59 232.11−235.84
CPI 231.68 ∼ 235.29 1.56 ↑↑↑↑ CURACC BAL −105487.0 ∼ −111156.0 5.37
231.68−235.29 −111156.0−−87317.0
FFR 0.14 ∼ 0.07 -50.0 ↑↓↔↓ GLD 162.02 ∼ 116.12 -28.33
0.07−0.15 114.82−163.67
INFL 1M 1.68 ∼ 1.56 -7.5 ↓↑↓↑ INFL 1Y 1.46 ∼ 1.62 10.74
0.88−2.02 1.46−1.62
MONEY ONE 2662.9 ∼ 2875.3 7.98 ↑↓↑↑ MONEY TWO 10581.0 ∼ 11080.3 4.72
2368.6−2875.3 10334.6−11080.3
RGDPPC 49376.0 ∼ 50171.0 1.61 ↑↑↑↓ SP500 1426.19 ∼ 1848.36 29.6
49376.0−50236.0 1426.19−1848.36
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
39.6−39.6 10.0−10.0
UNEMP US 7.9 ∼ 6.6 -16.46 ↓↓↓↓
6.6−7.9
Table B.54: -
B.55 2014

range range
BOP −39095.0 ∼ −41752.0 6.8 ↓↑↓↑ BREAK 10 2.24 ∼ nan nan ↔

−47236.0−−39001.0 nan−nan
BREAK 20 2.35 ∼ 1.7 -27.66 ↓↑↓↓ CCPI 235.84 ∼ 239.87 1.71
1.7−2.35 235.84−239.87
CPI 235.29 ∼ 234.75 -0.23 ↑↑↓↓ CURACC BAL −111156.0 ∼ −113337.0 1.96
234.75−237.5 −113451.0−−98506.0
FFR 0.07 ∼ 0.11 57.14 ↑↔↔↑ GLD 116.12 ∼ 113.58 -2.19
0.07−0.12 109.79−133.1
19
INFL 1M 1.56 ∼ −0.23 -114.76 ↑↓↓↓ INFL 1Y 1.62 ∼ 0.12 -92.69
−0.23−2.17 0.12−1.62
2591.1−3157.8 10968.7−11769.4
RGDPPC 50171.0 ∼ 50853.0 1.36 ↑↑↑↓ SP500 1848.36 ∼ 2058.9 11.39
50171.0−51042.0 1741.89−2090.57
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
39.6−39.6 10.0−10.0
UNEMP US 6.6 ∼ 5.7 -13.64 ↓↓↓↓
5.6−6.7
Table B.55: -
B.56 2015

range range
BOP −41752.0 ∼ −45677.0 9.4 ↑↓↓↓ BREAK 10 1.68 ∼ nan nan ↔

−51367.0−−35444.0 nan−nan
BREAK 20 1.7 ∼ 1.44 -15.29 ↑↓↓↓ CCPI 239.87 ∼ 245.23 2.23
1.44−1.99 239.87−245.23
CPI 234.75 ∼ 237.65 1.24 ↑↑↓↓ CURACC BAL −113337.0 ∼ −124671.0 10.0
234.75−238.03 −125314.0−−109676.0
FFR 0.11 ∼ 0.34 209.09 ↑↑↓↑ GLD 113.58 ∼ 101.46 -10.67
0.11−0.34 100.5−125.23
INFL 1M −0.23 ∼ 1.24 -638.21 ↑↑↓↑ INFL 1Y 0.12 ∼ 1.26 963.49
−0.23−1.24 0.12−1.26
2830.4−3319.0 11641.2−12459.0
RGDPPC 50853.0 ∼ 51023.0 0.33 ↓↑↑↑ SP500 2058.9 ∼ 2043.94 -0.73
50660.0−51023.0 1867.61−2130.82
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
39.6−39.6 10.0−10.0
UNEMP US 5.7 ∼ 4.9 -14.04 ↓↓↓↓
4.9−5.7
Table B.56: -
B.57 2016

range range
BOP −45677.0 ∼ −48492.0 6.16 ↑↓↓↓ BREAK 10 1.54 ∼ 1.95 26.62 ↔

−48492.0−−36440.0 nan−nan
BREAK 20 1.44 ∼ 2.01 39.58 ↑↓↑↑ CCPI 245.23 ∼ 250.78 2.26
1.35−2.01 245.23−250.78
CPI 237.65 ∼ 243.62 2.51 ↑↑↑↑ CURACC BAL −124671.0 ∼ −116781.0 -6.33
237.34−243.62 −124671.0−−112382.0
FFR 0.34 ∼ 0.65 91.18 ↑↑↑↑ GLD 101.46 ∼ 109.61 8.03
0.34−0.65 101.46−130.52
INFL 1M 1.24 ∼ 2.51 102.86 ↓↓↑↑ INFL 1Y 1.26 ∼ 2.13 68.84
0.85−2.51 1.26−2.13
2984.5−3518.5 12375.4−13344.8
RGDPPC 51023.0 ∼ 51860.0 1.64 ↑↑↑↑ SP500 2043.94 ∼ 2238.83 9.54
51023.0−51860.0 1829.08−2271.72
TAX Y HB 39.6 ∼ 39.6 0.0 ···↔ TAX Y LB 10.0 ∼ 10.0 0.0 ·
39.6−39.6 10.0−10.0
UNEMP US 4.9 ∼ 4.8 -2.04 ↑↓↔↓
4.6−5.0
Table B.57: -
20
B.58 2017

range range
BOP −48492.0 ∼ −56601.0 16.72 ↑↑↓↓ BREAK 10 nan ∼ nan nan ↔

−56601.0−−42395.0 nan−nan
BREAK 20 2.01 ∼ 2.04 1.49 ↓↓↑↑ CCPI 250.78 ∼ 255.29 1.8
1.79−2.04 250.78−255.29
CPI 243.62 ∼ 248.86 2.15 ↑↑↑↑ CURACC BAL −116781.0 ∼ −124105.0 6.27
243.62−248.86 −128158.0−−100566.0
FFR 0.65 ∼ 1.41 116.92 ↑↑↔↑ GLD 110.47 ∼ 123.65 11.93
0.65−1.41 110.47−128.13
INFL 1M 2.51 ∼ 2.15 -14.3 ↓↓↑↑ INFL 1Y 2.13 ∼ 2.44 14.67
1.64−2.81 2.13−2.44
MONEY ONE 3568.9 ∼ 3837.3 7.52 ↑↓↓↑ MONEY TWO 13394.8 ∼ 13961.3 4.23
3238.3−3837.3 13163.3−13965.4
RGDPPC 51860.0 ∼ 53099.0 2.39 ↑↑↑↑ SP500 2257.83 ∼ 2673.61 18.42
51860.0−53099.0 2257.83−2690.16
TAX Y HB 39.6 ∼ 37.0 -6.57 ···↓ TAX Y LB 10.0 ∼ 10.0 0.0 ·
37.0−39.6 10.0−10.0
UNEMP US 4.8 ∼ 4.1 -14.58 ↓↓↓↔
4.1−4.8
Table B.58: -
B.59 2018

range range
BOP −56601.0 ∼ −51149.0 -9.63 ↑↓↓↑ BREAK 10 nan ∼ nan nan ↔

−59769.0−−43053.0 nan−nan
BREAK 20 2.04 ∼ 1.82 -10.78 ↑↓↑↓ CCPI 255.29 ∼ 260.7 2.12
1.82−2.13 255.29−260.7
CPI 248.86 ∼ 252.72 1.55 ↑↑↑↓ CURACC BAL −124105.0 ∼ −130403.0 5.07
248.86−252.77 −134377.0−−101460.0
FFR 1.41 ∼ 2.4 70.21 ↑↑↑↑ GLD 125.15 ∼ 121.25 -3.12
1.41−2.4 111.1−128.83
INFL 1M 2.15 ∼ 1.55 -27.92 ↑↑↓↓ INFL 1Y 2.44 ∼ 1.81 -25.81
1.55−2.85 1.81−2.44
MONEY ONE 3837.3 ∼ 4016.7 4.68 ↓↓↓↑ MONEY TWO 13961.3 ∼ 14628.5 4.78
3448.3−4016.7 13717.9−14628.5
RGDPPC 53099.0 ∼ 57541.0 8.37 ↑↑↑↑ SP500 2695.81 ∼ 2506.85 -7.01
53099.0−57541.0 2351.1−2930.75
TAX Y HB 37.0 ∼ 37.0 0.0 · · ·· TAX Y LB 10.0 ∼ 10.0 0.0
37.0−37.0 10.0−10.0
UNEMP US 4.1 ∼ 4.0 -2.44 ↓↔↓↑
3.7−4.1
Table B.59: -
B.60 2019

range range
BOP −51149.0 ∼ −45338.0 -11.36 ↑↓↑↑ BREAK 10 1.71 ∼ nan nan ↔

−55520.0−−43086.0 nan−nan
BREAK 20 1.82 ∼ 1.81 -0.55 ↑↓↓↑ CCPI 260.7 ∼ 266.48 2.22
1.64−1.97 260.7−266.48
CPI 252.72 ∼ 259.04 2.5 ↑↑↑↑ CURACC BAL −130403.0 ∼ −104204.0 -20.09
252.72−259.04 −130403.0−−104204.0
FFR 2.4 ∼ 1.55 -35.42 ↑↓↓↓ GLD 121.25 ∼ 142.9 17.86
1.55−2.42 119.94−146.66
21
INFL 1M 1.55 ∼ 2.5 61.25 ↑↓↓↑ INFL 1Y 1.81 ∼ 1.23 -31.93
1.52−2.5 1.23−1.81
3598.2−4264.9 14350.8−15538.0
RGDPPC 57541.0 ∼ 57621.0 0.14 ↑↑↑↓ SP500 2506.85 ∼ 3230.78 28.88
57541.0−58386.0 2447.89−3240.02
UNEMP US 4.0 ∼ 3.6 -10.0 ↓↑↓↔
3.5−4.0
Table B.60: -
B.61 2020

range range
BOP −45338.0 ∼ −68213.0 50.45 ↓↓↑↓ BREAK 10 1.77 ∼ nan nan ↔

−68213.0−−39932.0 nan−nan
BREAK 20 1.81 ∼ 2.16 19.34 ↓↑↑↑ CCPI 266.48 ∼ 270.02 1.33
1.23−2.16 265.44−270.12
CPI 259.04 ∼ 262.65 1.39 ↓↑↑↑ CURACC BAL −104204.0 ∼ −195739.0 87.84
255.87−262.65 −195739.0−−104204.0
FFR 1.55 ∼ 0.09 -94.19 ↓↑↔↔ GLD 142.9 ∼ 178.36 24.81
0.05−1.58 138.04−193.89
INFL 1M 2.5 ∼ 1.39 -44.22 ↓↑↑↑ INFL 1Y 1.23 ∼ 4.7 280.83
0.23−2.5 1.23−4.7
3847.9−7038.1 15351.3−19440.2
RGDPPC 57621.0 ∼ 57664.0 0.07 ↓↑↑↑ SP500 3230.78 ∼ 3756.07 16.26
52155.0−57664.0 2237.4−3756.07
UNEMP US 3.6 ∼ 6.3 75.0 ↑↓↓↓
3.5−14.7
Table B.61: -
B.62 2021

range range
BOP −68213.0 ∼ −89692.0 31.49 ↓↓↑↓ BREAK 10 1.99 ∼ 2.56 28.64 ↔

−89692.0−−67116.0 nan−nan
BREAK 20 2.16 ∼ 2.46 13.89 ↑↓↑↓ CCPI 270.02 ∼ 286.43 6.08
2.16−2.6 270.02−286.43
CPI 262.65 ∼ 282.6 7.6 ↑↑↑↑ CURACC BAL −195739.0 ∼ −291418.0 48.88
262.65−282.6 −291418.0−−190282.0
FFR 0.09 ∼ 0.08 -11.11 ↓↑↓↔ GLD 178.36 ∼ 170.96 -4.15
0.06−0.1 157.49−182.87
INFL 1M 1.39 ∼ 7.6 444.55 ↑↑↑↑ INFL 1Y 4.7 ∼ 4.7 0.0
1.39−7.6 4.7−4.7
6556.6−20761.5 19424.9−21871.5
RGDPPC 57664.0 ∼ 59312.0 2.86 ↑↑↑↓ SP500 3756.07 ∼ 4766.18 26.89
57664.0−59692.0 3700.65−4793.06
UNEMP US 6.3 ∼ 4.0 -36.51 ↓↓↓↓
3.9−6.3
Table B.62: -
B.63 2022
22
range range
BOP −89692.0 ∼ −67419.0 -24.83 ↑↑↓↑ BREAK 10 2.56 ∼ 2.3 -10.16

−109802.0−−61511.0 nan−nan
BREAK 20 2.46 ∼ 2.46 0.0 ↑↓↑↓ CCPI 286.43 ∼ 302.7 5.68
2.44−2.86 286.43−302.7
CPI 282.6 ∼ 300.54 6.35 ↑↑↑↑ CURACC BAL −291418.0 ∼ −217106.0 -25.5
282.6−300.54 −291418.0−−217106.0
FFR 0.08 ∼ 4.33 5312.5 ↑↑↑↑ GLD 170.96 ∼ 169.64 -0.77
0.08−4.33 151.23−191.51
INFL 1M 7.6 ∼ 6.35 -16.43 ↑↑↓↓ MONEY ONE 20728.3 ∼ 19803.9 -4.46
6.35−8.93 19803.9−21019.7
MONEY TWO 21844.7 ∼ 21363.3 -2.2 ↑↓↓↓ RGDPPC 59312.0 ∼ 60422.0 1.87
21288.9−22136.9 59115.0−60422.0
SP500 4766.18 ∼ 3839.5 -19.44 ↓↓↓↑ UNEMP US 4.0 ∼ 3.4 -15.0
3577.03−4796.56 3.4−4.0
Table B.63: -
B.64 2023

range range
BREAK 10 nan ∼ 2.39 nan ↔··· BREAK 20 2.46 ∼ 2.46 0.0 · · ··

nan−nan 2.46−2.46
CCPI 302.7 ∼ 302.7 0.0 · · ·· CPI 300.54 ∼ 300.54 0.0 · · ··
302.7−302.7 300.54−300.54
FFR 4.33 ∼ 4.33 0.0 · · ·· GLD 171.06 ∼ 169.57 -0.87 ↓···
4.33−4.33 169.57−181.67
INFL 1M 6.35 ∼ 6.35 0.0 · · ·· MONEY ONE 19735.4 ∼ 19735.4 0.0 · · ··
6.35−6.35 19735.4−19735.4
MONEY TWO 21328.5 ∼ 21328.5 0.0 · · ·· SP500 3824.14 ∼ 4012.32 4.92 ↑···
21328.5−21328.5 3808.1−4179.76
UNEMP US 3.4 ∼ 3.4 0.0 · · ··
3.4−3.4
Table B.64: -
23
Appendix C
CODE References
BOP
Frequency: M
Range: 2022-12-01∼2022-12-01
Description: Trade Balance: Goods and Services, Balance of Payments Basis
Citations: U.S. Census Bureau and U.S. Bureau of Economic Analysis, Trade Balance: Goods and Services, Balance of Pay-
ments Basis [BOPGSTB], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/BOPGSTB,
February 20, 2023.
Notes:
BREAK 10
Frequency: D
Range: 2023-02-23∼2023-02-23
Description: The breakeven inflation rate represents a measure of expected inflation derived from 10-Year Trea-
sury Constant Maturity Securities (BC 10YEAR) and 10-Year Treasury Inflation-Indexed Constant Maturity Securities
(TC 10YEAR). The latest value implies what market participants expect inflation to be in the next 10 years, on average.
Citations: Federal Reserve Bank of St. Louis, 10-Year Breakeven Inflation Rate [T10YIE], retrieved from FRED, Federal
Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/T10YIE, February 20, 2023.
Notes: breakeven rates are based on latest release/revision.
BREAK 20
Frequency: M
Range: 2023-01-01∼2023-01-01
Description: The breakeven inflation rate represents a measure of expected inflation derived from 20-Year Trea-
sury Constant Maturity Securities (BC 20YEARM) and 20-Year Treasury Inflation-Indexed Constant Maturity Securities
(TC 20YEARM). The latest value implies what market participants expect inflation to be in the next 20 years, on average.
Citations: Federal Reserve Bank of St. Louis, 20-year Breakeven Inflation Rate [T20YIEM], retrieved from FRED, Federal
Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/T20YIEM, February 20, 2023.
Notes: breakeven rates are based on latest release/revision.
CCPI
Frequency: M
Range: 2023-01-01∼2023-01-01
Description: The ”Consumer Price Index for All Urban Consumers: All Items Less Food & Energy” is an aggregate
of prices paid by urban consumers for a typical basket of goods, excluding food and energy. This measurement, known as
”Core CPI,” is widely used by economists because food and energy have very volatile prices.
Citations: U.S. Bureau of Labor Statistics, Consumer Price Index for All Urban Consumers: All Items Less Food and En-
ergy in U.S. City Average [CPILFESL], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CPILFESL,
February 11, 2023.
Notes:
CPI
Frequency: M
Range: 2023-01-01∼2023-01-01
Description: Consumer Price Index for All Urban Consumers: All Items (CPIAUCSL) is a price index of a basket
1
of goods and services paid by urban consumers. Percent changes in the price index measure the inflation rate.Includes
roughly 88 percent of the total population, accounting for wage earners, clerical workers, technical workers, self-employed,
short-term workers, unemployed, retirees, and those not in the labor force. The CPIs are based on prices for food, clothing,
shelter, and fuels; transportation fares; service fees (e.g., water and sewer service); and sales taxes. Prices are collected
monthly from about 4,000 housing units and approximately 26,000 retail establishments across 87 urban areas. To calculate
the index, price changes are averaged with weights representing their importance in the spending of the particular group.
Bureau of Labor Statistics also releases a seasonally adjusted index, which removes the effects of seasonal changes, such
as weather, school year, production cycles, and holidays. Includes volatile food and oil prices as opposed to core CPI
(CPILFESL).
Citations: U.S. Bureau of Labor Statistics, Consumer Price Index for All Urban Consumers: All Items in U.S. City Aver-
age [CPIAUCSL], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CPIAUCSL,
February 11, 2023.
Notes: cpi is based on latest release/revision
CURACC BAL
Frequency: Q
Range: 2022-07-01∼2022-07-01
Description: Calculated by subtracting the imports of goods and services and income payments (debits) from the
exports of goods and services and income receipts (credits)
Citations: U.S. Bureau of Economic Analysis, Balance on current account [IEABC], retrieved from FRED, Federal Reserve
Bank of St. Louis; https://fred.stlouisfed.org/series/IEABC, February 20, 2023.
Notes:
FFR
Frequency: M
Range: 2023-01-01∼2023-01-01
Description: Federal Funds Effective Rate: Averages of daily figures. The federal funds rate is the interest rate at which
depository institutions trade federal funds (balances held at Federal Reserve Banks) with each other overnight. When a
depository institution has surplus balances in its reserve account, it lends to other banks in need of larger balances. In
simpler terms, a bank with excess cash, which is often referred to as liquidity, will lend to another bank that needs to
quickly raise liquidity, and the FFR is the weighted average rate for all of these negotiations. The Federal Reserve acts
through open market operations to reach the FFR target through open market operations or by buying and selling of gov-
ernment bonds (government debt). The Federal Open Market Committee (FOMC) meets eight times a year to determine
the federal funds target rate. Selling government bonds raise the FFR because banks have less liquidity to trade with other
banks. Converse holds. In making its monetary policy decisions, the FOMC considers economic data such as trends in
prices and wages, employment, consumer spending and income, business investments, and foreign exchange markets. The
federal funds rate is the central interest rate in the U.S. financial market. It influences other interest rates such as the
prime rate, which is the rate banks charge their customers with higher credit ratings. Additionally, the federal funds rate
indirectly influences longer-term interest rates such as mortgages, loans, and savings.
Citations: Board of Governors of the Federal Reserve System (US), Federal Funds Effective Rate [FEDFUNDS], retrieved
from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/FEDFUNDS, February 11, 2023.
Notes:
GLD
Frequency: D
Range: 2023-02-23∼2023-02-23
Description: SPDR® Gold MiniShares (NYSE Arca: GLDM) offers investors one of the lowest available expense ratios
for a U.S. listed physically gold-backed ETF.
Citations: yahoo! Finance
Notes:
INFL 1M
Frequency: M
Range: 2023-01-01∼2023-01-01
Description: YoY Inflation Rate Derived from Monthly Consumer Price Index, FRED CPIAUCSL
Citations: U.S. Bureau of Labor Statistics, Consumer Price Index for All Urban Consumers: All Items in U.S. City Aver-
age [CPIAUCSL], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CPIAUCSL,
February 11, 2023.
Notes: cpi is based on latest release/revision. smoothed version of INFLATION 1y.
INFL 1Y
2
Frequency: Y
Range: 2021-01-01∼2021-01-01
Description: Inflation as measured by the consumer price index reflects the annual percentage change in the cost to the
average consumer of acquiring a basket of goods and services that may be fixed or changed at specified intervals, such as
yearly. The Laspeyres formula is generally used.
Citations: World Bank, Inflation, consumer prices for the United States [FPCPITOTLZGUSA], retrieved from FRED,
Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/FPCPITOTLZGUSA, February 11, 2023.
Notes:
MONEY ONE
Frequency: W
Range: 2023-01-02∼2023-01-02
Description: Before May 2020, M1 consists of (1) currency outside the U.S. Treasury, Federal Reserve Banks, and the
vaults of depository institutions; (2) demand deposits at commercial banks (excluding those amounts held by depository
institutions, the U.S. government, and foreign banks and official institutions) less cash items in the process of collection and
Federal Reserve float; and (3) other checkable deposits (OCDs), consisting of negotiable order of withdrawal, or NOW, and
automatic transfer service, or ATS, accounts at depository institutions, share draft accounts at credit unions, and demand
deposits at thrift institutions. Beginning May 2020, M1 consists of (1) currency outside the U.S. Treasury, Federal Reserve
Banks, and the vaults of depository institutions; (2) demand deposits at commercial banks (excluding those amounts held
by depository institutions, the U.S. government, and foreign banks and official institutions) less cash items in the process
of collection and Federal Reserve float; and (3) other liquid deposits, consisting of OCDs and savings deposits (including
money market deposit accounts). Seasonally adjusted M1 is constructed by summing currency, demand deposits, and
OCDs (before May 2020) or other liquid deposits (beginning May 2020), each seasonally adjusted separately.
Citations: Board of Governors of the Federal Reserve System (US), M1 [WM1NS], retrieved from FRED, Federal Reserve
Bank of St. Louis; https://fred.stlouisfed.org/series/WM1NS, February 20, 2023.
Notes: UNITS: BILLIONS OF DOLLARS
MONEY TWO
Frequency: W
Range: 2023-01-02∼2023-01-02
Description: Before May 2020, M2 consists of M1 plus (1) savings deposits (including money market deposit accounts);
(2) small-denomination time deposits (time deposits in amounts of less than $100,000) less individual retirement account
(IRA) and Keogh balances at depository institutions; and (3) balances in retail money market funds (MMFs) less IRA
and Keogh balances at MMFs. Beginning May 2020, M2 consists of M1 plus (1) small-denomination time deposits (time
deposits in amounts of less than $100,000) less IRA and Keogh balances at depository institutions; and (2) balances in
retail MMFs less IRA and Keogh balances at MMFs. Seasonally adjusted M2 is constructed by summing savings deposits
(before May 2020), small-denomination time deposits, and retail MMFs, each seasonally adjusted separately, and adding
this result to seasonally adjusted M1.
Citations: Board of Governors of the Federal Reserve System (US), M2 [WM2NS], retrieved from FRED, Federal Reserve
Bank of St. Louis; https://fred.stlouisfed.org/series/WM2NS, February 20, 2023.
Notes: UNITS: BILLIONS OF DOLLARS
RGDPPC
Frequency: Q
Range: 2022-10-01∼2022-10-01
Description: Real Gross Domestic Product Per Capita
Citations: U.S. Bureau of Economic Analysis, Real gross domestic product per capita [A939RX0Q048SBEA], retrieved
from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/A939RX0Q048SBEA, February 20, 2023.
Notes:
SP500
Frequency: D
Range: 2023-02-23∼2023-02-23
Description: Large cap U.S. equities market. The index includes 500 leading companies in leading industries of the U.S.
economy, which are publicly held on either the NYSE or NASDAQ. Adjusted for dividends.
Citations: yahoo! Finance
Notes:
TAX Y HB
Frequency: Y
Range: 2018-01-01∼2018-01-01
3
Description: Highest bracket personal income tax rates. Years represent the tax years .Starting with 1985, tax bracket
boundaries were indexed for inflation, using the U.S. Department of Labor Consumer Price Index for Urban Consumers
(CPS-U). As stated by the source, tax rates shown are for the regular income tax, i.e., for normal tax and surtax, applicable
to U.S. citizens and residents. Therefore, the rates exclude provisions unique to nonresident aliens. Tax rates exclude the
effect of tax credits (which reduce the tax liability), except as noted, and several specific add-on or other taxes applicable
to all or some tax years.
Citations: U.S. Department of the Treasury. Internal Revenue Service, U.S Individual Income Tax: Tax Rates for Regular
Tax: Highest Bracket [IITTRHB], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/IITTRHB,
February 11, 2023.
Notes:
TAX Y LB
Frequency: Y
Range: 2018-01-01∼2018-01-01
Description: Lowest bracket personal income tax rates. Years represent the tax years .Starting with 1985, tax bracket
boundaries were indexed for inflation, using the U.S. Department of Labor Consumer Price Index for Urban Consumers
(CPS-U). As stated by the source, tax rates shown are for the regular income tax, i.e., for normal tax and surtax, applicable
to U.S. citizens and residents. Therefore, the rates exclude provisions unique to nonresident aliens. Tax rates exclude the
effect of tax credits (which reduce the tax liability), except as noted, and several specific add-on or other taxes applicable
to all or some tax years.
Citations: U.S. Department of the Treasury. Internal Revenue Service, U.S Individual Income Tax: Tax Rates for Regular
Tax: Lowest Bracket [IITTRLB], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/IITTRLB,
February 11, 2023.
Notes:
UNEMP US
Frequency: M
Range: 2023-01-01∼2023-01-01
Description: The unemployment rate represents the number of unemployed as a percentage of the labor force. Labor
force data are restricted to people 16 years of age and older, who currently reside in 1 of the 50 states or the District of
Columbia, who do not reside in institutions (e.g., penal and mental facilities, homes for the aged), and who are not on
active duty in the Armed Forces. This rate is also defined as the U-3 measure of labor underutilization. The series comes
from the ’Current Population Survey (Household Survey)’ The source code is: LNS14000000
Citations: U.S. Bureau of Labor Statistics, Unemployment Rate [UNRATE], retrieved from FRED, Federal Reserve Bank
of St. Louis; https://fred.stlouisfed.org/series/UNRATE, February 11, 2023.
Notes:

Market Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Market Notes

Uploaded by

Copyright:

Available Formats

Quantitative Trading Series

Quantitative and Qualitative Treatments

Quantitative and Qualitative Treatments

14 Volatility Trading 803

1.1 Guidelines for Reviewing Work

Test 1 (This refers to parameterized research idea that is to be verified as a Strategy.).

Strategy 1 (This explores the implementation and characteristics of a Test.).

The following are the stages of theoretical formulations.

Definition 1 (Standard conventions and formal nomenclature are introduced.).

Problem 1 (A formalization of the problem statement is provided).

Exercise 1 (An example or working problem to demonstrate concepts discussed).

The following are stages of theoretical derivations

Lemma 1 (An important result used as is or for other derivations.).

Corollary 1 (An important aside of the theoretical work.).

Theorem 1 (A central result with derivations).

Result 1 (A central result without proof.).

The following are for declarative statements.

Proposition 1 (An opinion of sorts.).

Fact 1 (A statement of (almost) undeniable truth.).

Theorem 2 (Integration By Parts). The integration by parts formula takes form

Theorem 3 (L’Hopital’s Rule).

3.2 Computational Methods in the Euclidean Space

where ai ∈ R, i ∈ [n] and b ∈ R.

0x1 + 0x2 + · · · 0xn = 0. (3)

Definition 5 (Linear System). A finite set of m equations in n variables x1 , · · · xn is called a linear

ai1 x1 + ai2 x2 + · · · + ain xn = bi , i ∈ [m] (5)

where aij , i ∈ [m], j ∈ [n] ∈ R.

A(tu + (1 − t)v) = tAu + (1 − t)Av = tb + (1 − t)b = tb + b − tb = b. (7)

This is valid for all t ∈ R, and so we have infinitely many solutions.

Exercise 3. In the xyz-space, the two equations

3.2.1.1 Elementary Row Operations (EROs)

may be represented by the rectangular array of numbers

1. kRi ≡ multiply row i by k.

Proof. See proof in Exercise 14 using block matrix notations.

3.2.1.2 Row-Echelon Forms

For matrix A, we denote its matrix REF as REF (A).

Definition 17 (Reduced Row-Echelon Form (RREF)). A reduced row-echelon-form matrix is a row-

For matrix A, we denote its matrix RREF as RREF (A).

Exercise 4 (Finding solutions to REF, RREF Representations of Linear Systems; Back-Substitution

2. Since this represents the linear system

2x2 + 2x3 + x4 − 2x5 = 2, (18)

so x3 = 1 − x4 . Substituting into first equation,

2x2 + 2(1 − x4 ) + x4 − 2 · 2 = 2, (22)

4. The solution set is (r, s, t) = R3 .

5. This system is inconsistent! (Definition 8)

Theorem 5 (Gaussian Elimination/Row Reduction and Gauss-Jordan Elimination). We outline the

1. Locate the leftmost non-zero column (see Definition 14).

To further get a RREF from REF obtained,

Exercise 5. Obtain the RREF of the following augmented matrix

via Gauss-Jordan Elimination (see Theorem 5).

Exercise 6. See that in the xy-plane, the equations

aij = 0 when i 6= j. (34)

for some constant c ∈ R.

3.2.2.1 Operations on Matrices

1. Scalar Multiplication: cA = (caij ).

2. Matrix addition: A + B = (aij + bij ).

3. Linearity: c(A + B) = cA + cB.

4. Linearity: (c + d)A = cA + dA.

5. c(dA) = (cd)A = d(cA).

The matrix multiplication AB is only possible when ncols(A) = nrows(B).

Exercise 8. Show that matrix multiplication (Definition 31) is not commutative.

Proof. Prove by counterexample. For matrices

1. Associativity: A(BC) = (AB)C.