Quantitative Analysis

OGARP
Financial Risk Manager
2020
EXAM PART I
Quantitative Analysis
I
(p G A R P
2020
EXAM PART I
Quantitative Analysis
Pearson
Copyright © 2020 by the Global Association of Risk Professionals
All rights reserved.
This copyright covers material written expressly for this volume by the editor/s as well as the compilation itself.
It does not cover the individual selections herein that first appeared elsew here. Permission to reprint these has
been obtained by Pearson Education, Inc. for this edition only. Further reproduction by any means, electronic or
m echanical, including photocopying and recording, or by any information storage or retrieval system , must be
arranged with the individual copyright holders noted.
All tradem arks, service marks, registered tradem arks, and registered service marks are the property of their
respective owners and are used herein for identification purposes only.
Pearson Education, Inc., 330 Hudson Street, New York, New York 10013
A Pearson Education Com pany
w w w .pearsoned.com
Printed in the United States of Am erica
ScoutAutomatedPrintCode
000200010272205718
EEB /K W
Pearson ISBN 10: 0135969360

ISBN 13: 9780135969366
C o n ten ts
Chapter 1 Fundamentals Chapter 2 Random Variables 11

of Probability 1
2.1 Definition of a Random Variable 12
1.1 Sample Space, Event Space, 2.2 Discrete Random Variables 13
and Events 2
2.3 Expectations 14
Probability 2
Properties of the Expectation O perator 14
Fundamental Principles of Probability 4
Conditional Probability 4 2.4 Moments 14
Exam ple: SIFI Failures 5 The Four Named Moments 16
Moments and Linear Transform ations 17
1.2 Independence 5
Conditional Independence 6 2.5 Continuous Random Variables 18
1.3 Bayes'Rule 7 2.6 Quantiles and Modes 20
1.4 Summary 7 2.7 Summary 22
Questions 8 Questions 23
Short Concept Questions 8 Short Concept Questions 23
Practice Questions 8 Practice Q uestions 23
Answers 9 Answers 24
Solved Problems 9 Solved Problems 25
Conditional Distributions 50
Chapter 3 Common Univariate
4.2 Expectations and Moments 51
Random Variables 27 Expectations 51
Moments 51
The Variance of Sums of Random Variables 53
3.1 Discrete Random Variables 28
Covariance, Correlation, and Independence 54
Bernoulli 28
Binomial 29 4.3 Conditional Expectations 54
Poisson 30 Conditional Independence 54
3.2 Continuous Random Variables 31 4.4 Continuous Random Variables 55

Uniform 31 Marginal and Conditional Distributions 56
Normal 32
4.5 Independent, Identically
Approxim ating Discrete Random
Distributed Random Variables 56
Variables 33
Log-Normal 34 4.6 Summary 57
x2 36 Questions 58
Student's t 36 Short Concept Questions 58
F 38 Practice Q uestions 58
Exponential 39
Answers 59
Beta 40
Short Concept Questions 59
M ixtures of Distributions 40
Solved Problems 59
3.3 Summary 43
Questions 44
Chapter 5 Sample Moments 63
Practice Questions 44
Answers 45
Short Concept Questions 45 5.1 The First Two Moments 64
Solved Problems 45 Estim ating the Mean 64
Estim ating the Variance and Standard
Deviation 65
Presenting the Mean and Standard
Chapter 4 Multivariate Random Deviation 66
Variables 47 Estim ating the Mean and Variance
Using Data 67
5.2 Higher Moments 67

4.1 Discrete Random Variables 48
Probability M atrices 48 5.3 The BLUE Mean Estimator 70
Marginal Distributions 49 5.4 Large Sample Behavior
Independence 50 of the Mean 71
iv ■ Contents
5.5 The Median and Other Quantiles 73 Answers 96
5.6 Multivariate Moments 74
74 Solved Problems 96
Covariance and Correlation
Sample Mean of Two Variables 75
Coskew ness and Cokurtosis 76
5.7 Summary 78 Chapter 7 Linear Regression 101

Questions 79
7.1 Linear Regression 102
Practice Questions 79
Linear Regression Param eters 102
Answers 80 Linearity 103
Short Concept Questions 80 Transformations 103
Solved Problems 80 Dummy Variables 104
7.2 Ordinary Least Squares 104
Chapter 6 Hypothesis Testing 83 7.3 Properties of OLS Parameter

Estimators 106
Shocks Are Mean Zero 106
6.1 Elements of a Hypothesis Test 84 Data Are Realizations from iid Random
Null and Alternative Hypotheses 84 Variables 107
One-sided Alternative Hypotheses 85 Variance of X 107
Test Statistic 85 Constant Variance of Shocks 107
Type I Error and Test Size 85 No Outliers 107
Critical Value and Decision Rule 86 Implications of O LS Assum ptions 108

Exam ple: Testing a Hypothesis about the 7.4 Inference and Hypothesis Testing 110
Mean 86
Type II Error and Test Power 88 7.5 Application: CAPM 111
Example: The Effect of Sample Size on Power 88 7.6 Application: Hedging 113
Exam ple: The Effect of Test Size and
Population Mean on Power 90 7.7 Application: Performance
1
Confidence Intervals 90
Evaluation 114
The p-value of a Test 91 7.8 Summary 115
6.2 Testing the Equality of Questions 116
Two Means 92 Short Concept Questions 116
6.3 Summary 94 Practice Questions 116
Questions 95 Answers 117

Practice Questions 95 Solved Problems 117
Contents ■ v
9.5 Multicollinearity 145
Chapter 8 Regression with Residual Plots 146
Multiple Explanatory O utliers 146
Variables 119 9.6 Strengths of OLS 147
9.7 Summary 149
8.1 Multiple Explanatory Variables 120 Questions 150
Additional Assum ptions 121 Short Concept Questions 150
8.2 Application: Multi-Factor Risk Practice Questions 150
Models 122 Answers 153
8.3 Measuring Model Fit 125 Short Concept Questions 153
Solved Problems 153
8.4 Testing Parameters In Regression
Models 127
M ultivariate Confidence Intervals 128
The F-statistic of a Regression 129 Chapter 10 Stationary Time
8.5 Summary 131 Series 159
8.6 Appendix 131
Tables 131
10.1 Stochastic Processes 160
Questions 134
10.2 Covariance Stationarity 161
Practice Questions 134 10.3 White Noise 162
Dependent W hite Noise 163
Answers 135
Wold's Theorem 163
Solved Problems 135 10.4 Autoregressive (AR) Models 163
Autocovariances and Autocorrelations 164
The Lag O perator 164
AR(p) 165
Chapter 9 Regression
Diagnostics 139 10.5 Moving Average (MA)
Models 169
10.6 Autoregressive Moving
9.1 Omitted Variables 140 Average (ARMA) Models 171
9.2 Extraneous Included Variables 141 10.7 Sample Autocorrelation 174
9.3 The Bias-Variance Tradeoff 141 Joint Tests of Autocorrelations 174
Specification Analysis 174
9.4 Heteroskedasticity 142
Approaches to Modeling Heteroskedastic 10.8 Model Selection 177
Data 145 Box-Jenkins 178
vi ■ Contents
10.9 Forecasting 178 Short Concept Questions 204
Practice Q uestions 204
10.10 Seasonality 181
Answers 206
10.11 Summary 182
Appendix 182 Solved Problems 206
Characteristic Equations 182
Questions 183
Short Concept Questions 183 Chapter 12 Measuring Returns,
Practice Questions 183 Volatility, and
Answers 184 Correlation 211
Solved Problems 184
12.1 Measuring Returns 212
12.2 Measuring Volatility and Risk 213
Chapter 11 Non-Stationary Implied Volatility 214
Time Series 187 12.3 The Distribution of Financial
Returns 214
Power Laws 215
11.1 Time Trends 188
Polynomial Trends 188 12.4 Correlation Versus
Dependence 216
11.2 Seasonality 190
Alternative Measures of Correlation 216
Seasonal Dummy Variables 191
12.5 Summary 219
11.3 Time Trends, Seasonalities,
and Cycles 191 Questions 220
11.4 Random Walks and Unit
Practice Q uestions 220
Roots 194
Unit Roots 194 Answers 221
The Problems with Unit Roots 195 Short Concept Questions 221
Testing for Unit Roots 195 Solved Problems 221
Seasonal Differencing 198
11.5 Spurious Regression 199

11.6 When to Difference? 199 Chapter 13 Simulation and
Bootstrapping 223
11.7 Forecasting 201
Forecast Confidence Intervals 202
11.8 Summary 202 13.1 Simulating Random Variables 224

Questions 204 13.2 Approximating Moments 224
Contents ■ vii
The Mean of a Normal 226 13.4 Comparing Simulation
Approxim ating the Price of a Call and Bootstrapping 234
Option 227
13.5 Summary 234
Improving the Accuracy of Simulations 229
Questions 235
Antithetic Variables 229
Control Variates 229 Practice Q uestions 235
Application: Valuing an Option 230
Answers 236
Limitation of Simulations 231 Short Concept Questions 236
Solved Problems 236
13.3 Bootstrapping 231
Bootstrapping Stock Market Returns 232 Index 239
Limitations of Bootstrapping 234
viii ■ Contents
Chairman
Dr. Rene Stulz
Everett D. Reese Chair of Banking and M onetary Econom ics,
The Ohio State University
Members
Richard Apostolik Dr. Attilio Meucci, CFA
President and C E O , Global Association of Risk Professionals Founder, ARPM
Michelle McCarthy Beck, SMD Dr. Victor Ng, CFA, MD

C hief Risk Officer, T IA A Financial Solutions C h ief Risk Architect, M arket Risk M anagem ent and Analysis,
Goldm an Sachs
Richard Brandt, MD
O perational Risk M anagem ent, Citigroup Dr. Matthew Pritsker
Senior Financial Econom ist and Policy Advisor / Supervision,
Julian Chen, FRM, SVP
Regulation, and C redit, Federal Reserve Bank of Boston
FRM Program Manager, Global Association of Risk Professionals
Dr. Samantha Roberts, FRM, SVP
Dr. Christopher Donohue, MD
Balance Sheet Analytics & M odeling, PN C Bank
G A R P Benchm arking Initiative, Global Association of Risk
Professionals Dr. Til Schuermann
Partner, O liver Wyman
Donald Edgar, FRM, MD
Risk & Q uantitative Analysis, BlackRock Nick Strange, FCA
Director, Supervisory Risk Specialists, Prudential Regulation
Herve Geny
Authority, Bank of England
Group Head of Internal A udit, London Stock Exchange Group
Dr. Sverrir Porvaldsson, FRM
Keith Isaac, FRM, VP
Senior Q uant, SEB
Capital Markets Risk M anagem ent, TD Bank Group
William May, SVP

Global Head of Certifications and Educational Program s, Global
Association of Risk Professionals
FRM® Committee ■ ix
ATTRIBUTIONS
Author
Kevin Sheppard, PhD, Associate Professor in Financial
Econom ics and Tutorial Fellow in Econom ics at the
University of O xford
Reviewers
Chris Brooks, PhD, Professor of Finance at the ICM A Centre, Deco Caigu Liu, FRM , C S C , CBAP, Global A sset Liability
University of Reading, M anagem ent, Manulife
Paul Feehan, PhD, Professor of M athem atics and Director of the John Craig Anderson, FRM , Director— D ebt and Capital
M asters of M athematical Finance Program , Rutgers University Advisory Services, Advize Capital
Erick W. Rengifo, PhD, Associate Professor of Econom ics, Thierry Schnyder, FRM , C SM , Risk Analyst, Bank for
Fordham University International Settlem ents
Richard Morrin, PhD, FRM , ERP, C FA , C IPM , Financial Officer, Yaroslav Nevmerzhytskyi, FRM , C FA , ERP, Deputy C h ief Risk
International Finance Corporation O fficer and Financial Risk M anagem ent Team Leader,
Naftogaz of Ukraine
Antonio Firmo, FRM , Senior A udit M anager— Trading Risk,
Risk C apital, and M odeling, Royal Bank of Scotland
x ■ Attributions
Fundamentals
of Probability
Learning Objectives
A fter com pleting this reading you should be able to:
D escribe an event and an event space. Define and calculate a conditional probability.
D escribe independent events and mutually exclusive Distinguish between conditional and unconditional
events. probabilities.
Explain the difference between independent events and Explain and apply Bayes' rule.
conditionally independent events.
Calculate the probability of an event for a discrete prob

ability function.
Probability theory is the foundation of statistics, econom etrics, Events are subsets of a sam ple space, and an event is denoted
and risk m anagem ent. This chapter introduces probability in its by co. An event is a set of outcom es and may contain one or
sim plest form . more of the values in the sam ple space, or it may even contain
no elem ents. W hen an event includes only one outcom e, it is
A probability measures the likelihood that some event occurs.
often referred to as an elem entary event. Events are sets and
In most financial applications of probability, events are tightly
are usually written with set notation. For exam ple, when inter
coupled with numeric values. Exam ples of such events include
ested in the sum of two dice, one event is that the sum is odd:
the loss on a portfolio, the number of defaults in a m ortgage
{(1,2), (1,4), (1,6), . . . ,(4,5), (5,6)}. A nother event is that the sum
pool, or the sensitivity of a portfolio to a rise in short-term inter
is greater than or equal to nine: {(4,5), (4,6), (5,5), (5,6), (6,6)}.
est rates. Events can also be measured by values without a
natural numeric correspondence. These include categorical vari The event space, usually denoted by T , consists of all com bina
ables, such as the type of a financial institution or the rating on a tions of outcom es to which probabilities can be assigned. Note
corporate bond. that the event space is an abstract concept and separate from
any specific application. As an exam ple, suppose that {A }, {B }
Probability is introduced through three fundam ental principles.
are the possible outcom es of an experim ent. W ithout knowing
1 . The probability of any event is non-negative. what {A }, {B } mean in practice, one might consider the following
2 . The sum of the probabilities across all outcom es is one. events.
3 . The joint probability of two independent events is the prod 1 . A occurs, B does not.
uct of the probability of each.
2. B occurs, A does not.
This chapter also introduces conditional probability, an impor 3 . Both A and B occur.
tant concept that assigns probabilities within a subset of all
4. Neither A nor B occur.
events. A fter that, this chapter moves on to discuss indepen
dence and conditional independence. This chapter concludes This would give an event space consisting of four events
by exam ining Bayes' rule, which provides a simple and powerful {A , B, {A , B}, 0} . Because this event space has a finite number of
expression for incorporating new information or data into prob outcom es, it is called a d iscrete probability sp a ce .*3
ability estim ates. In the corporate default exam ple, the event space is
{D efault, No Default, {D efault, No Default}, 0} . It might appear
surprising that the event space contains two "im possible"
1.1 SAM PLE SPACE, EVEN T SPACE, events for this exam ple:
AND EVENTS
1. {D efault, No D efault}, which contains both outcom es; and
A sam ple space is a set containing all possible outcom es of an 2. The em pty set 0, which contains neither.
experim ent and is denoted as (1. The set of outcom es depends
However, note that a definite probability of 0 can be assigned to
on the problem being studied. For exam ple, when examining "im possible" events. Because the event space contains all sets
the returns on the S&P 500, the sam ple space is the set of all
that can be assigned a probability, these two events are there
real numbers and is denoted by U. On the other hand, if the fore part of the event space.
focus is on the direction of the return on the S&P 500, then the
sam ple space is {Positive, N egative}. W hen modeling corporate
defaults, the sam ple space is {D efault, No Default}. W hen rolling Probability
a single six-sided die, the sam ple space is { 1 , 2 , . . . , 6}. The
Probability measures the likelihood of an event. Probabilities are
sam ple space for two identical six-sided dice contains 21 ele
always between 0 and 1 (inclusive). An event with probability 0
ments representing all distinct pairs of values: {(1,1), (1,2),
never occurs, while an event with a probability 1 always occurs.
(1,3), . . . , (5,5), (5,6), (6 ,6 )}.1 If we are modeling the sum of two
The sim plest interpretation of probability is that it is the fre
dice, however, then the sam ple space is {2, . . . , 12}.
quency with which an event would occur if a set of independent
experim ents was run. This interpretation of probability is known
1 This assumes that the dice are indistinguishable and therefore one can
only observe the values shown (and not which die produced each value).
An event that contains no elements is known as the empty set.
If the dice were distinguishable from each other (e.g., they have differ
ent colors), then the sample space would contain 36 values (because 3 A discrete probability space is one that contains either finitely many
{1,2} and {2,1} are different). outcomes or (possibly) countably infinitely many outcomes.
2 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

as the frequentist interpretation and focuses on objective prob subsets. Pi is the intersection operator, and thus A R B is the set
ability. Note that this is a conceptual interpretation, whereas of outcom es that appear in both A and 8. The union operator is
finance is mostly focused on non-experim ental events (e.g ., the denoted with U , and A U B is the set that contains all outcom es
return on the stock market). in either A , 8, or both. The com plem ent of the set A , denoted
by A c, is the set of all outcom es that are not in A . Finally, the
Probability can also be interpreted from a subjective point of
bottom-right panel illustrates mutually exclusive events, which
view. Under this alternative interpretation, probability reflects
means that either A or 8 could happen, but not both together.
or incorporates an individual's beliefs about the likelihood of an
Here, the intersection A R B is the em pty set ( 0) .
event occurring. These beliefs may differ across individuals and
do not have to agree with the objective probability of an event. Recall the earlier exam ple of rolling two dice, which featured
An exam ple of this would be a senior executive assuming a sp e two events:
cific probability for an event (e.g ., a recession, an interest rate 1 . The event that the sum rolled is greater than or equal to
increase, or a rating downgrade) as part of a scenario analysis or nine; and
stress test.
2 . The event that the sum is odd.
Probability is defined over event spaces. M athem atically, it
The intersection of these two events contains the outcom es
assigns a number between 0 and 1 (inclusive) to each event in
from the sam ple space where both conditions are sim ultane
the event space. A s defined above, events are simply com bina
ously satisfied. This happens when the sum is equal to nine or
tions of outcom es (i.e., subsets of the sam ple space). It is there
11, meaning that the intersection of these two events consists of
fore possible to illustrate some core probability concepts using
the outcom es {(4,5), (5,6)}. The union of the two events includes
the language of sets.
all outcom es in both sets without duplication (i.e., the outcom es
Figure 1.1 illustrates the key properties of sets, with each rect in the set of odd sums as well as the even sums greater than
angle representing a sam ple space (II) with two events (i.e., A or equal to nine). The com plem ent of the event that the sum is
and 8). Note that the standard notation for a set uses capital odd is the set of outcom es where the sum of two dice is even:
letters (e.g ., A , 8), while subscripts (e.g ., A/) are used to denote ({1,1}, {1 ,3 }, {1 ,5 }, . . . , (5,5), (6,6)}. Finally, the intersection of
Intersection Union
Complement Mutually Exclusive
Fiaure 1.1 The top-left panel illustrates the intersection of two sets, which is indicated by the shaded area
surrounded by the dark solid line. The top right panel illustrates the union of two sets, which is indicated by the
shaded area surrounded by the dark solid line. The shaded area in the bottom-left panel illustrates Ac, which is
the complement to A and includes every outcome that is not in A. The bottom-right panel illustrates mutually
exclusive sets, which have an empty intersection (0).
Chapter 1 Fundamentals of Probability ■ 3

the event that the sum is odd with the event that the sum is 2008), the probability of another failure is likely to be higher.
even is the null set (because there cannot be a sum which is This difference is im portant, especially in risk m anagem ent.
sim ultaneously odd and even). This is an exam ple of two mutu
Conditional probability can also be used to incorporate additional
ally exclusive events.
information to update unconditional probabilities. For example,
consider two randomly chosen students preparing to take a risk
Fundamental Principles of Probability management exam, which is passed (on average) by 50% of test
takers. Based on only this information, one would estimate that both
The three fundam ental principles of probability are defined Student A and Student B have a 50% chance of passing the exam.
using events, event spaces, and sam ple spaces. Note that Pr(co)
Now suppose that the average length of study time needed
is a function that returns the probability of an event a>.
to pass the exam is 200 hours. Suppose further that Student A
1 . A ny event A in the event space T has Pr(A) > 0. studied for less than 100 hours, whereas Student B studied for
2 . The probability of all events in (1 is one and thus P r((l) = 1. more than 400 hours. Common sense would say that Student
B has a higher probability of passing com pared to Student A .
3 . If the events A 1 and A 2 are mutually exclusive, then
In fact, a survey might find that the probability of passing for
Pr(A -|U A 2) = Pr(A-|) + Pr(A2). This holds for any number n
students who studied less than 100 hours is 10%, whereas the
of mutually exclusive events, so that Pr(A -|U A 2 . . . L M n)
probability of passing for students who studied more than 400
= S " - , Pr(A)-4
hours is 80% . Using this new inform ation, the conditional prob
These three principles— collectively known as the Axiom s of ability that A passes the exam is 10%, while the conditional
Probability— are the foundations of probability theory. They probability that B passes is 80%.
imply two useful properties of probability. First, the probability
The conditional probability of the event A occurring given that B
of an event or its com plem ent must be 1.
has occurred is denoted as Pr(A| B) and can be calculated as:
P K A U A 0) = Pr(A) + Pr(A°) = 1 (1.1)
P r(A P lB )
Second, the probability of the union of any two sets can be Pr(A| B) = (1.3)
Pr(B)
decom posed into:
The probability of observing an event in A , given that an event
P r(A U B ) = Pr(A) + Pr(B) - P r(A D B ) (1.2) in B has been observed, depends on the probability of observ
This result dem onstrates that the total probability of two events ing events that are in both rescaled by the probability that an
is the probability of each event minus the probability of the event in B occurs. In other words, conditional probability oper
event defined by the intersection. The elem ents in the inter ates as if B is the event space and A is an event inside this new,
section are in both sets, therefore subtracting this term avoids restricted space.
double counting. As a simple exam ple, note that the probability of rolling a three on
a fair die is 1/6. This is because the event of rolling a three occurs
in one outcome in a set of six possible outcomes: {1, 2, 3, 4, 5, 6}.
Conditional Probability
However, if it is known that the number rolled is odd, then the
Probability is commonly used to exam ine events in subsets of probability of rolling a three is 1/3 because this additional informa
the full event space. Specifically, we are often interested in the tion restricts the set of outcomes to three possibilities: { 1 ,3 , 5}.
probability of an event happening only if another event hap
Figure 1.2 dem onstrates this idea using sets. The left panel
pens first. This concept is called conditional probability because
shows the unconditional probability where (1 is the event space
we are computing a probability on that condition that another
and A and B are events. The right panel shows the conditional
event occurs.
probability of A given B. The numeric values show the probabil
For exam ple, we might want to determ ine the probability that a ity of each region, meaning that the unconditional probability of
large financial institution fails given that another large financial A is Pr(A) = 16% and the conditional probability of A given B is
institution has also failed. Note that the probability of a large Pr(A| B) = 12%/48% = 25%.
financial institution failing is normally quite low. However, when
An im portant application of conditional probability is the Law of
one large financial institution fails (e.g ., Lehmann Brothers in
Total Probability. This law states that the total probability of an
event can be reconstructed using conditional probabilities under
4 This result holds more generally for infinite collections of mutually the condition that the probability of the sets being conditioned
exclusive events under some technical conditions. is equal to 1.

U n co n d itio n a l P ro b a b ility C o n d itio n a l P ro b a b ility , Pr (A |B J
Fiq u re 1.2 The left panel shows the unconditional probability of A and B. The right panel shows the conditional
probability of A given B. The numeric values indicate the probability of each region.
For exam ple, Pr(B) + Pr(Bc) = 1, and so: Example: SIFI Failures
Pr(A) = Pr(A| B) Pr(B) + Pr(A| B°) Pr(B°) (1.4)
System ically Important Financial Institutions (SIFIs) are a cat
This property holds for any collection of mutually exclusive egory of designated large financial institutions that are deeply
events {B,}, where ^ P r (B ,) = 1. Intuitively, this property must connected to large portions of the econom y. They are usually
hold because each outcom e in A must occur in one (and only banks, although other types of institutions such as insurers, asset
one) of the B,. m anagers, or central counterparties may be designated as SI FIs
as w ell. SI FIs are subject to additional regulation and supervi
Figure 1.3 illustrates the Law of Total Probability. Note that the
sion, as the failure of even one of them could lead to a major
probability that A occurs is 9% + 12% + 14% = 35% . The four
disruption in financial m arkets.
conditional probabilities corresponding to each set B-: are
Suppose that the chance of at least one SIFI failing in any given
Pr(A| B-,) = 14% /30% = 46.6% ,
year is 1%. Suppose further that when at least one SIFI fails,
Pr(A| B2) = 9% /22% = 40.9% , there is a 20% chance that the number of failing SI FIs is exactly
Pr(A| B3) = 12%/28% = 42.8% , 1, 2, 3, 4 or 5. In other words, there is a 0.2% chance that one
SIFI fails, 0.2% chance that two institutions fail, and so on.
Pr(A| B4) = 0
Finally, assume that more than five SI FIs cannot fail, because
These conditional probabilities are then rescaled by the proba governm ents and central banks would intervene.
bilities of the conditioning event to recover the total probability.
If we define the event that one or more SI FIs fail as E-1 , then
4
Pr(E-|) = 1 % . This is a small probability. However, if we are inter
2 P r ( A | Bi) Pr(Bf) = 46.6% X 30% + 40.9% X 22% + 42.8%
ested in the probability of two or more failures given that at
;=1 x 28% + 0 x 20% = 35%.
least one has occurred, then this is
Pr(Ei D E 2)
Pr(E2| E 1) ^ = 80%
Pr(E,) Pr(E-,) 1%
Here, E2 is the event that two or more SI FIs fail, and it is a sub
set of E -1 by construction (because E -1 is the event of one or more
failures). Because E-|D E2 = E2, P r(E-|D E2) = Pr(E2).
This conditional probability tells us that, while it would be sur

prising to see a SIFI failure in any given year, there is very likely
to be two or more failures if there is at least one failure. The left
panel of Figure 1.4 graphically illustrates this point.
exclusive. A is the event indicated by the dark circle, 1.2 IN DEPEN DEN CE
and Pr(A) = 35%. The probability of each event B;
includes the values that overlap with A as well as those Independence is a core concept that is exploited throughout
that do not. For example, Pr(B^) = 16% + 14% = 30%. statistics and econom etrics. Two events are independent if the

U n co n d itio n a l P ro b a b ility C o n d itio n a l P ro b a b ility
Fiqure 1.4 The left panel illustrates the unconditional probability of at least one SIFI failing. The right panel
illustrates the conditional probability of 1, 2, 3, 4 or 5 failures given that at least one SIFI has failed. The E, are
the events that exactly i SIFIs fail. These differ from E,, which measure the event that / or more institutions fail.
However, E 5 = E5 because the maximum number of failures is assumed to be five.
probability that one event occurs does not depend on whether words, if A-|, A 2, . . . , A n are independent events, then
the other event occurs. W hen the two events A and B are inde Pr(A 1n A 2n • • • O A n) = Pr(A-|) X Pr(A2) X . . . X Pr(An).
pendent, then:
P r(A D B ) = Pr(A) Pr(B) (1.5)

Conditional Independence
This implies that for independent events, the conditional prob
ability of the event is equal to the unconditional probability of Like probability, independence can be redefined to hold condi
the event. W hen A and B are independent, then: tional on another event. Two events A and B are conditionally
independent if:
P r(A H B ) Pr(A) Pr(B)
Pr(B| A) = Pr(B) ( 1.6 )
Pr(A) Pr(A) P r (A n B |C ) = P r (A |Q P r(B |C ) (1.7)
It is also the case that Pr(A| B) = Pr(A). The conditional probability P r(A P |B | C) is the probability of an
outcom e that is in both A and B occurring given that an out
If A and B are mutually exclusive, then B cannot occur if A
come in C occurs.
occurs. This means that the outcom e of A affects Pr(B) and so
mutually exclusive events are not independent. Another way to Note that two types of independence— unconditional and
see is to note that when Pr(A) and Pr(B) are both greater than 0, conditional— do not imply each other. Events can be both
then P r(A f)B ) = Pr(A) X Pr(B) > 0 (i.e., there must be a positive unconditionally dependent (i.e., not independent) and condi
probability that both occur). However, mutually exclusive events tionally independent. Sim ilarly, events can be independent, yet
have P r(A H B ) = 0 and thus cannot be independent. conditional on another event they may be dependent.
Independence can be generalized to any number of events The left-hand panel of Figure 1.5 contains an illustration of
using the same principle: the joint probability of the events two independent events. Each of the events A and B has a
is the product of the probability of each event. In other 40% chance of occurring. The intersection has the probability
Independence Conditional Independence
A B
A n e
24% 16% 24%
Fiaure 1.5 The left panel shows two events are independent so that the joint probability, Pr(A fl B), is the same
as the product of the probability of each event, Pr(A) X Pr(B). The right panel contains an example of two events,
A and B, which are dependent but also conditionally independent given C.

Pr(A) X Pr(B) ( = 40% X 40% = 16%) occurring, and so the Com bining these produces the conditional probability that the
requirem ents for independence are satisfied. m anager is a star given that she had a high return:
M eanwhile, the right-hand panel shows an exam ple where . , 20% x 10% 2%
Pr(Star|H igh Return) = ---- -----= = 30.8%
A and B are unconditionally dependent: Pr(A) = 40% , 6.5% 6.5%
Pr(B) = 40% and P r(A P |8 ) = 25% (rather than 16% as would be
In this case, a single high return updates the probability that a
required for independence). However, conditioning on the event
manager is a star from 10% (i.e., the unconditional probability) to
C restricts the space so that Pr(A| Q = 50%, Pr(B| Q = 50%
30.8% (i.e., the conditional probability). This idea can be extended
and Pr(A| C)Pr(B| C) = 25% , meaning that these events are
to classifying a manager given multiple returns from a fund.
conditionally independent. A further implication of condi
tional independence is that Pr(A| B f lC ) = Pr(A| C) = 50% and
Pr(B| A R C ) = Pr(B| Q = 50% , which both hold. 1.4 SUMMARY
Probability is essential for statistics, econom etrics, and data
1.3 BAYES' RULE analysis. This chapter introduces probability as a sim ple way
to understand events that occur with different frequencies.
Bayes' rule provides a method to construct conditional prob Understanding probability begins with the sam ple space, which
abilities using other probability m easures. It is both a simple defines the range of possible outcom es and events that are col
application of the definition of conditional probability and an lections of these outcom es. Probability is assigned to events,
extrem ely im portant tool. The formal statem ent of Bayes' rule is and these probabilities have three key properties.
Pr(A| B) Pr(B)
Pr(B| A) = (1.9) 1. Any event A in the event space T has Pr(A) > 0.
Pr(A)
2. The probability of all events in (1 is 1 and therefore
This is a sim ple rewriting of the definition of conditional
Pr(ft) = 1.
probability, because P r(A |B ) = P r(A n B )/P r(B ) so that
3 . If the events A-| and A 2 are mutually exclusive,
P r(A H B ) = Pr(A| B) Pr(B). Reversing these argum ents also gives
P r(A n B ) = Pr(B| A) Pr(A). Equating these two expressions and then Pr(A -|(JA 2) = Pr(A-i) + Pr(A2). This holds for
any number of mutually exclusive events, so that
solving for P r(B |A ) produces Bayes' rule. This approach uses the
unconditional probabilities of A and B, as well as information P r ( A , U A 2 . . . U A „) = 2 /=i FMA>.
about the conditional probability of A given B, to com pute the

These properties allow a com plete description of event prob
probability of B given A .
abilities using a standard set of operators (e.g ., union, inter
Bayes' rule has many financial and risk m anagem ent applica section, and com plem ent). Conditional probability extends
tions. For exam ple, suppose that 10% of fund managers are the definition of probability to subsets of all events, so that a
superstars. Superstars have a 20% chance of beating their conditional probability is like an unconditional probability, but
benchm ark by more than 5% each year, whereas normal fund only within a sm aller event space.
managers have only a 5% chance of beating their benchm ark by
This chapter also exam ines independence and conditional inde
more than 5%.
pendence. Two events are independent if the probability of both
Suppose there is a fund m anager that beat her benchm ark by events occurring is the product of the probability of each event.
7%. W hat is the chance that she is a superstar? Here, event A is In other words, events are independent if there is no inform a
the m anager significantly outperform ing the fund's benchm ark. tion about the likelihood of one event given knowledge of the
Event B is that the m anager is a superstar. Using Bayes' rule: other. Independence extends to conditional independence,
Pr(High Return | Star) Pr(Star) which replaces the unconditional probability with a conditional
Pr(Star| High Return) probability. An im portant feature of conditional independence
Pr(High Return)
is that events that are not unconditionally independent may be
The probability of a superstar is 10%. The probability of a high conditionally independent.
return if she is a superstar is 20% . The probability of a high
return from either type of m anager is Finally, Bayes' rule uses basic ideas from probability to develop
a precise expression for how information about one event can
Pr(High Return) = Pr(High Return]Star) Pr(Star)
be used to learn about another event. This is an im portant prac
+ Pr(High Return | Normal) Pr(Normal)
tical tool because analysts are often interested in updating their
= 20% x 10% + 5% x 90% = 6.5% beliefs using observed data.

The following questions are intended to help candidates understand the material. They are not actual FRM exam questions.
QUESTIONS
Short Concept Questions

1.1 Can independent events be mutually exclusive? 1.3 W hat are the sam ple spaces, events, and event spaces if
1.2 Can Bayes' rule be helpful if A and B are independent? y ° u are interested in measuring the probability of a corpo-
W hat if A and B are perfectly dependent so that B is a rate takeover and w hether * was hostile friendly?
subset of A?
Practice Questions
1.4 Based on the probabilities in the diagram below, what are a. Pr(Ac)
the values of the following? b. P r (D |A (J B (J C )
c. P r(A |A )
d. Pr(B| A)
e. Pr(C| A)
f. Pr(D| A)
g. P r((A U D )c)
h. Pr(A cn D c)
i. Are any of the four events pairwise independent?
1.6 Continue the application of Bayes' rule to com pute the

probability that a m anager is a star after observing two
years of "high" returns.
a. Pr(A)
b. P r(A |B ) 1.7 There are two com panies in an econom y, Com pany A and
c. Pr(S| A) Com pany B. The default rate for Com pany A is 10%, and
d. P r(A n e n c ) the default rate for Com pany B is 20% . Assum e defaults
e. P r (B |A D Q for the two com panies occur independently.
f. P r (A D B |Q a. W hat is the probability that both com panies default?
g. P r ( A U B |Q b. W hat is the probability that either Com pany A or
h. Considering the three distinct pairs of events, are any Com pany B defaults?
of these pairs independent?
1.8 C red it card com panies rapidly assess transactions
1.5 Based on the probabilities in the plot below, what are the for fraud. In each day, a large card issuer assesses
values of the following? 10,000,000 transactions. O f these, 0.001% are fraudulent.
If their algorithm identifies 90% of all fraudulent transac
tions but also 0.0001% of legitim ate transactions, what
is the probability that a transaction is fraudulent if it has
been flagg ed ?
1.9 If fund managers beat their benchmark at random (meaning

a 50% chance that they beat their benchm ark), and annual
returns are independent, what is the chance that a fund
m anager beats her benchm ark for ten years in a row?

ANSW ERS

1.1 No. The probability of observing both must be the prod probabilities, Pr(A)/Pr(B). Here the probability also only
uct of the individual probabilities, and mutually exclusive depends on unconditional probabilities and so never
events have probability 0 of sim ultaneously occurring. changes.
Pr(B| A)Pr(A) 1.3 Using the notation No Takeover (NT), Hostile Takeover
1.2 Bayes' rule says that Pr(A| B) = . If these
Pr(B) (HT) and Friendly Takeover (FT), the sam ple space is {N T,
events are independent, then Pr(B| A) = Pr(B) so that HT, H T}. The events are all distinct com binations of the
Pr(A| B) = Pr(A). B has no information about A and so sam ple space including the em pty set, </>, NT, HT, FT,
updating with Bayes' rule never changes the conditional {N T, H T}, {N T, F T }, {H T, F T }, {N T, HT, F T }. The event
probability. If these events are perfectly dependent, space is the set of all events.
then Pr(B | A) = 1, so that Pr(A | B) is just the ratio of the
Solved Problems
1.4 a. Pr(A) = 12% + 11% + 15% + 10% = 48% c. This is trivially 100%.
b. Pr(A| B) = P r(A rW P r(B ) = (11% + 10%)/ d. P r(B H A ) = 9%. The conditional probability is
(11% + 10% + 16% + 18%) = 38.2%
c. Pr(B| A) = P r(A rW P r(A ) = (11% + 10%)/48%
e. There is no overlap and so P r(C n A ) = 0.
= 43.8%
f. P r(D H A ) = 9%. The conditional probability is 30%.
d. P r (A D B n C ) = 10%
g. This is the total probability not in A or D. It is
e. Pr(B|ADC) = Pr(BnADC)/Pr(AnC)
1 - P r(A U D ) = 1 - (Pr(A) + Pr(D) - P r(A D D ))
= 10% /(15% + 10%) = 40%
= 100% - (30% + 36% - 9%) = 43%.
f. Pr(ADB|C) = Pr(AnenC)/Pr(C)
h. This area is the intersection of the space not in A with
= 10%/(15% + 10% + 18 + 10%) = 18.9%
the space not in D. This area is the same as the area
g. P r (A U B |C ) = P r((A (JB )n C )/P r(C ) that is not in A or D, P r((A (JD )c) and so 43% .
= (15% + 10% + 18%)/53% = 81.1%
i. The four regions have probabilities A = 30%,
h. We can use the rule that events are indepen B = 30% , C = 28% and D = 36% . The only region
dent if their joint probability is the product of the that satisfied the requirem ent that the joint probability
probability of each. Pr(A) = 48% , Pr(B) = 55%, is the product of the individual probabilities is A and B
Pr(C) = 53%, P r(A D B ) = 21% # (48% X 55%), because P r(A D B ) = 9% = Pr(A)Pr(B) = 30% X 30%.
P r(A D C ) = 25% ^ (48% X 53%),
1.6 Consider the three scenarios: (High, High), (High, Low)
P r(B D C ) = 28% ^ (55% X 53%). None of these
and (Low, Low).
events are pairwise independent.
We are interested in Pr (Star\High, High) using Bayes'
1 .5 a. 1 - Pr(A) = 100% - 30% = 70%
P r(High, High] Star)Pr(Star)
rule, this is equal t o -------- ^
P
-'
b. This value is P r(D n (A U B U C ))/P r(A (JB (JC ).
The total probability in the three areas A , B, and Stars produce high returns in 20% of years, and so
C is 73% . The overlap of D with these three is Pr (High, H igh\Star) = 20% X 20% Pr (Star) is still 10%.
9% + 8% + 7% = 24% , and so the conditional prob Finally, we need to com pute Pr (High, High), which is
24% Pr (High, H ig h jS ta r) Pr (Star) + Pr(H/gh, High\ Normal)
ability is 33%.
73%

Pr(Normal). This value is 20% X 20% X 10% + 5% X 5% 1.8 We are interested in Pr(Fraud|Flag). This value is
= 90% = 0.625% . Com bing these values, Pr(Fraud D FIag )/Pr(Flag ). The probability that a trans
20% X 20% X 10% action is flagged is .001% X 90% + 99.999% X 0.0001%
.625% = .000999% . The P r(F ra u d n F la g ) = .001% X 90%
This is a large increase from the 30% chance after one year. .0009
= .0009. Com bining these values, = 90% . This
.000999
1.7 a. 0.10*0.20 = 0.02 —»• 2% indicates that 10% of the flagged transactions are not
b. Calculate this in two ways: actually fraudulent.
i. Using Equation 1.1: 1.9 If each year was independent, then the probability of
P r(A U B ) = Pr(A) + Pr(B) - P r(A D B ) beating the benchm ark ten years in a row is the product
P r(A U B ) = Pr(A) + Pr(B) - P r(A D B ) of the probabilities of beating the benchm ark each year,
= 10% + 20% - 2% = 28%.
1Y ° = 1 . 1% .
2 1,024
ii. Using the identity: P r(A U A c) = Pr(A) + Pr(Ac) = 1
Let C be the event of no defaults. Then
Pr(C) = (1 - (1 - Pr(A))*(1 - Pr(B))
= (1 - 10%)(1 - 20%) = 0.9*0.8 = 72%.
Then the com plem ent of C is the event of any
defaults Pr(C c) = 1 - Pr(C) = 28%.

Random Variables
Learning Objectives
D escribe and distinguish a probability mass function from Characterize the quantile function and quantile-based
a cumulative distribution function, and explain the rela estim ators.
tionship between these tw o.
Explain the effect of a linear transform ation of a random
Understand and apply the concept of a mathematical variable on the mean, variance, standard deviation, skew
expectation of a random variable. ness, kurtosis, m edian, and interquartile range.
D escribe the four common population moments.
Explain the differences between a probability mass func

tion and a probability density function.
11
Probability can be used to describe any situation with an mass function, a continuous random variables depends on a prob
elem ent of uncertainty. However, random variables restrict ability density function (PDF). However, this difference is of little
attention to uncertain phenomena that can be described with consequence in practice, and the concepts introduced for dis
numeric values. This restriction allows standard m athem ati crete random variables all extend to continuous random variables.
cal tools to be applied to the analysis of random phenom ena.
For exam ple, the return on a portfolio of stocks is numeric,
and a random variable can therefore be used to describe its 2.1 DEFINITION O F A RANDOM
uncertainty. The default on a corporate bond can be similarly VARIABLE
described using numeric values by assigning one to the event
that the bond is in default and zero to the event that the bond is The axiom s of probability are general enough to describe
in good standing. many forms of random ness (e.g ., a coin flip, a draw from a
This chapter begins by defining a random variable and relating well-shuffled deck of cards, or a future return on the S&P 500).
However, directly applying probability can be difficult because
its definition to the concepts presented in the previous chapter.
The initial focus is on discrete random variables, which take on it is defined on events, which are abstract concepts. Random
variables limit attention to random phenomena which can be
distinct values. Note that most results from discrete random
variables can be com puted using only a weighted sum. described using numeric values. This covers a wide range of
applications relevant to risk m anagem ent (e.g ., describing asset
Two functions are commonly used to describe the chance of returns, measuring defaults, or quantifying uncertainty in param
observing various values from a random variable: the probability
eter estim ates). The restriction that random variables are only
mass function (PMF) and the cumulative distribution function defined on numeric values sim plifies the mathematical tools
(CD F). These functions are closely related, and each can be
used to characterize their behavior.
derived from the other. The PM F is particularly useful when defin
ing the expected value of a random variable, which is a weighted A random variable is a function of co (i.e., an event in the sample
average that depends on both the outcomes of the random vari space (1) that returns a number x. It is conventional to write a
random variable using upper-case letters (e.g ., X, Y, or Z) and to
able and the probabilities associated with each outcome.
express the realization of that random variable with lower-case
M oments are used to summarize the key features of random letters (e.g ., x, y, or z). It is im portant to distinguish between
variables. A moment is the expected value of a carefully chosen
these tw o: A random variable is a function, whereas a realization
function designed to measure a characteristic of a random vari
is a number. The relationship between a realization and its corre
able. Four moments are commonly used in finance and risk man
sponding random variable can be succinctly expressed as:
agem ent: the mean (which measures the average value of the
x = X{o>) (2.1)
random variable), the variance (which measures the spread/dis-
persion), the skewness (which measures asym m etry), and the For exam ple, let X be the random variable defined by the roll of
kurtosis (which measures the chance of observing a large devia a fair die. Then we can denote x as the result of a single roll (i.e.,
tion from the m ean).1 one event). The probability that the random variable is equal to
five can be expressed as:
Another set of im portant measures for random variables
includes those that depend on the quantile function, which is Pr(X = x) when x = 5
the inverse of the CDF. The quantile function, which can be used
Univariate random variables are frequently distinguished based
to map a random variable's probability to its realization, defines
on the range of values produced. The two classes of random vari
two moment-like m easures: the median (which measures the
ables are discrete and continuous. A discrete random variable is
central tendency of a random variable) and the interquartile
one that produces distinct values. A continuous random variable
range (which is an alternative measure of spread). The quantile
produces values from an uncountable set, (e.g., any number on
function is also used to simulate data from a random variable
the real line IR).
with a particular distribution later in this book.
The chapter concludes by examining continuous random vari

ables, which produce values that lie in a continuous range.
W hereas a discrete random variable depends on a probability 1
*
2 The set of values that can be produced by a discrete random variable

1 The skewness and kurtosis are technically standardized versions of the can be either finite or have an infinite range. When the set of values is
third and fourth moment. infinite, the set must be countable.

1 .0 1 --------------
0.8
0.6
0.4
0.2
0 .0 --------------■
-------------- '-------------- •
- 0.5 0.0 0.5 1.0 1.5
random variable with p = 0.7. The right panel shows the
corresponding cumulative distribution function.
2.2 DISCRETE RANDOM VARIABLES Note that fx(0) = p°(1 — p )1 = 1 — p a n d fx O ) = p 1(1 — p)° = p,
so that the probability of observing 0 is 1 — p and the probability
A discrete random variable assigns a probability to a set of dis of observing 1 is p. The PM F is denoted by f^x) to distinguish the
tinct values. This set can be either finite or contain a countably random variable X underlying the mass function from the realiza
infinite set of values. An im portant exam ple of a discrete ran tion of the random variable (i.e., x).
dom variable with a finite number of values is a Bernoulli ran The PM F of a Bernoulli can be equivalently expressed using a
dom variable, which can only take a value of 0 or 1. They are list of values:
frequently encountered when measuring binary random events
(e.g ., the default of a loan).
«*) = { 1 - P X = ° (2.3)
I p x = 1
Because the values of random variables are always numerical,
random variables can be described precisely using mathematical The counterpart of a PM F is the cum ulative distribution function
functions. The set of values that the random variable may take is (C D F), which measures the total probability of observing a value
called the support of the function. For exam ple, the support for less than or equal to the input x (i.e., Pr(X < x)). In the case of a
a Bernoulli random variable is {0 ,1 }. Two functions are frequently Bernoulli random variable, the C D F is a sim ple step function:
used when describing the properties of a discrete random vari
( 0 x< 0
able's distribution.
Fx(x) = l 1 — p 0 < x < 1 (2.4)
The first function is known as the probability mass function [ 1 x> 1
(PM F). This function returns the probability that a random vari
Because the C D F measures the total probability that X < x, it
able takes a certain value. Because the PM F returns probabili
must be m onotonic and increasing in x. Note that while the PM F
ties, any PM F must have two properties.
is a discrete function (i.e., it only has support for 0 and 1), the
1. The value returned from a PM F must be non-negative. C D F returns Pr(X < x) and is therefore a continuous function
that has support for any value of x, not just 0 and 1.
2. The sum across all values in the support of a random vari
able must be one. Figure 2.1 shows the PM F (left panel) and C D F (right panel) of a
Bernoulli random variable with param eter p = 0.7 (i.e., there is
For exam ple, if X is a Bernoulli random variable and the proba a 70% chance of observing x = 1). The PM F is only defined for
bility that X = 1 is denoted with p (where p is between 0 and 1), the two possible outcom es (i.e., 0 or 1). In contrast, the C D F is
the PM F of X is defined for all values of x, and so is a step function that has the
fxW = p V - p )’ " x (2.2) value 0 for x < 0, 0.3 for 0 < x < 1, and 1 for x > 1.
and the inputs to the PM F are either 0 or 1.3 The PM F and the C D F are closely related, and the C D F can
always be expressed as the sum of the PM F for all values in the
support that are less than or equal to x:
3 The leading example of a countably infinite set is the integers, and any
countably infinite set has a one-to-one relationship to the integers. FxM = 2 (2.5)
t e R ( x ) ,t < x
Chapter 2 Random Variables ■ 13

where R(X) is used to denote the range of realizations that If the strike price K = 40, then the payoff of the call option can
may be produced from the random variable X and G means be written as
"an elem ent of." Conversely, the PM F is the difference of the
c(S) = m ax(S - 40, 0)
C D Fs evaluated at consecutive values in the support of X. For a
Bernoulli random variable: Thus, the expected payoff of the call option is
E[c(S)] = 0.2 X max(20 — 40,0) + 0.5 X max(50 - 40,0)

m = Fx(x) - F x f x - 1) (2.6)
+ 0.3 X m ax(100 — 40,0)
for x G {0,1}.
= 0.2 x 0 + 0.5 x 10 + 0.3 x 60
= 23.
2.3 EXPECTATIONS
The expected value of a random variable is denoted E[X] and is Properties of the Expectation Operator
defined as:
The expectation operator E[ - ] is like a function whose input is
another function (i.e., a random variable). It perform s a specific
E[X] = 2 x Pr<x = *>' (2.7)
xGR(X) task on a random variable: computing a weighted average of its
possible values.
where R(X) is the range of values of the random variable X. For
exam ple, when X is a Bernoulli random variable where the prob The expectation operator has one particularly useful property in
ability of observing 1 is p: that it is a linear operator.4 An im m ediate implication of linearity
is that the expectation of a constant is the constant (e.g .,
E[X] = 0 X (1 - p) + 1 X p = p (2.8)
E[a] = a). A related implication is that the expectation of the
Thus, an expectation is simply an average of the values in expectation is just the expectation (e.g ., E[E[X]] = E[X]). Further
the support of X weighted by the probability that a value x is more, the expected value of a random variable is a constant
observed. (and not a random variable).
Functions of random variables are also random variables and W hile linearity is a useful property, the expectation operator
therefore have expected values. The expectation of a function f does not pass through nonlinear functions of X. For exam ple,
of a random variable X is defined as: E[1/X ] 1 /E[X ] w henever X is a non-degenerate random vari
ab le .5 This property holds for other nonlinear functions so that
E[/(X)] = 2 = x), (2.9) in general6 E[g(X)] # g(E[X]).
xeR(X)
where f(x) is a function of the realization value x.
For exam ple, when X is a Bernoulli random variable, then the 2.4 MOMENTS
expected value of the exponential of X is
A s stated previously, moments are a set of commonly used
E[exp(X)] = exp(0) X (1 — p) + exp(1) X p = (1 — p) + p exp(1)
descriptive measures that characterize im portant features of ran
As an exam ple, consider a call option (i.e., a derivative contract dom variables. Specifically, a moment is the expected value of
with a payoff that is a nonlinear function of the future price of a function of a random variable. The first moment is defined as
an asset). The payoff of the call option is c(S) = m ax(S - K, 0), the expected value of X:
where S is the value of the asset and K is the strike price. If the
A1 = E[X] (2.10)
asset price is above the strike price, the call option pays out
S — K. If the price is below the strike price, the call option pays The second and higher moments all share the same form and
zero. Valuing a call option involves computing the exp ecta only differ in one parameter. These moments are defined as:
tion of a function (i.e., the payoff) of the price (i.e., a random Ar = E[(X - E[X])r], (2.11)
variable).
Suppose the value of an asset S takes on one of three values

4 Note that if a, b, and c are constants, and X and Y are random vari
(i.e., 20, 50, or 100) in one year. The PM F of S is
ables, then E[a + bX + cY] = a + bE[X] + cE[Y].
f 0.2 if s = 20 5 A degenerate random variable is a random variable that always returns
fs(s) = < 0.5 if s = 50 the same values, and thus is a constant.
[ 0.3 if s = 100 6 Except when g(X) is a linear function.

JEN SEN 'S IN EQ U ALITY
Jensen's inequality is a useful rule that provides some insight Figure 2.2 illustrates Jensen's inequality for both convex and
into the expectation of a nonlinear function of a random vari concave functions. In each panel, the random variable is dis
able. Jensen's inequality applies when a nonlinear function is crete and can only take on one of two values, each with proba
either concave or convex. bility of 50%. The two points in the center of each graph show
the function of the expectation (which is always on the curved
• A function g(x) is concave if g((1 - t)x + tx) > (1 — t)g(x)
graph of the function) and the expected value of the function
+ tg(x), for all t in the interval [0,1 ] and x in the domain
(which is always on the straight line connecting the two values).
of g.
• A function h(x) is convex if h(( 1 - t)x + tx) < (1 — t)h(x) A s an exam ple of a convex function, suppose you play a
+ th(x), for all t in the interval [0,1 ] and x in the domain game in which you roll a die and receive the square of the
of h. number rolled. Let X represent the die roll. We have that the
payoff function is f(X) = X 2. Then:
If h(X) is a convex function of X, then E[h(X)] > h(E[X]). Sim i
E[x] = = 3.5
larly, if g(X) is a concave function, then E[g(x)] < g(E[X]).
1 + 2 + 3 + 4 + 5 + 6
6
C onvex and concave functions are common in finance (e.g ., 1 + 4 + 9 + 16 + 25 + 36
E[f(X)] = E[X2] = 15.167
many derivative payoffs are convex in the price of the under 6
lying asset). It is useful to understand that the expected
value of a convex function is larger than the function of the and:
expected value. f(E[X]) = f( 3.5) = 3 .5 2 = 12.5.
Convex Function Concave Function
Figure 2.2 These two panels demonstrate Jensen's inequality. The left shows the difference between the
expected value of the function E[H(x)] and the function at the expected value, H(E[X]), for a convex function.
The right shows this difference for a concave function.
w here r denotes the order of the m om ent (2 for the second, 3 There is also a second class of m oments, called non-central
for the third, and so on). In the definition of the second and m om ents.8 These moments are not centered around the mean
higher m om ents, the first m om ent is su b tracted , which and thus are defined as:
ensures that changes in the first m om ent have no effect on the
/zrN C= E [ X r] (2.12)
higher m o m ents.7
8 The non-central moments, which do not subtract the first moment, are
less useful for describing random variables because a change in the first
moment changes all higher order non-central moments. They are, how
7 Formally these moments are known as central moments because they ever, often simpler to compute for many random variables, and so it is
are all centered on the first moment, E[X], common to compute central moments using non-central moments.

Non-central moments and central moments contain the same across distributions with different first and second m oments, and
information and the rth central moment can be directly con means that the skewness is unit-free by construction.
structed from the first r non-central m oments. For exam ple, the
The skewness measures asym m etry in a distribution, because
second central moment can be calculated as:9
the third power depends on the sign of the difference. Negative
E[(X - E[X])2] = E[X2] - E[X]2 = M2n c - ( ^ c)2 values of the skewness indicate that the chance of observing a
large (in magnitude) negative value is higher than the chance of
Note that the first central and non-central moments are the
observing a similarly large positive value. A sset returns are fre
same:
quently found to have negative skewness.
The fourth standardized moment, known as the kurtosis, is

defined as:
The Four Named Moments
E[(X - E[X])4
kurtosis(X) (2.14)
There are four common moments: the mean, the variance, the
skewness, and the kurtosis. The mean is the first moment and is
Its definition is analogous to that of skewness, excep t that
denoted
kurtosis uses the fourth power instead of the third. The use
A = E[X] of the fourth power makes the kurtosis especially sensitive to
The mean is the average value of X and is also referred to as the large deviations of either sign. This makes it useful in analyzing
location of the distribution. the chance of observing a large (absolute) return. Specifically,
the kurtosis of a random variable is commonly benchm arked
The variance is the second moment and is denoted by:
against that of a normally distributed random variable, which is
cr2 = E[(X - E[X])2] = E[(X - ,u)2] 3. Random variables with kurtosis greater than 3 are described
The variance is a measure of the dispersion of a random vari as being heavy-tailed/fat-tailed (i.e., having excess kurtosis).
able. It m easures the squared deviation from the m ean, and Financial return distributions have been widely docum ented to
thus it is not sensitive to the direction of the difference relative be heavy-tailed. Like skewness, kurtosis is naturally unit-free and
to the mean. can be directly com pared across random variables with different
means and variances.
The standard deviation is denoted by a and is defined as the
square root of the variance (i.e., a / ct 2). It is a commonly
reported alternative to the variance, as it is a more natural m ea
sure of dispersion and is directly com parable to the mean
STANDARDIZING RANDOM
(because the units of m easurem ent of the standard deviation are
VARIABLES
the same as those of the m ean).10
W hen X has mean /z and variance cr2, a standardized ver
The final two named moments are both standardized versions of
central moments. The skewness is defined as: sion of X can be constructed as . This variable has
a
mean 0 and unit variance (and standard deviation). These
X - IX.
skew(X). g t * - g * i >3 . E (2.13) can be verified because:
(T a
E[X] - /x
The final version in the definition clearly dem onstrates that the
a
skewness measures the cubed standardized deviation, (see box:
Standardizing Random Variables) Note that the random variable where /x and a are constants and therefore pass through
(X — /jl )/ ct is a standardized version of X that has mean 0 and the expectation operator. The variance is
variance 1. This standardization makes the skewness com parable
9 Note that E[(X - E[X])2] = E[X2 - 2xE[X] + E[X]2] = E[X2] - 2E[X] E[X] because adding a constant has no effect on the variance
+ E[X]2 = E[X2] - E[X]2 (i.e., V [X - c] = V[X]) and the variance of b X is
10 The units of the variance are the squared units of the random variable. b2V[X],
For example, if X is the return on an asset, measured in percent, then
the units of the mean and standard deviation are both percentages. The when b and c are constants.
units of the variance are squared percentages.

M ean S ta n d a rd D e v ia tio n
Skew Kurtosis
Fiqure 2.3 The four panels demonstrate the effect of changes in the four moments. The top-left panel shows the
effect of changing the mean. The top-right compares two distributions with the same mean but different standard
deviations. The bottom-left panel shows the effect of negative skewness. The bottom-right shows the effect of
heavy tails, which produces excess kurtosis.
Figure 2.3 shows the effect of changing each moment. The top- bottom-right graph shows the effect of heavy tails, which pro
left panel com pares the distribution of two random variables duce excess kurtosis. The heavy-tailed distribution has more
that differ only in the mean. The top-right com pares two ran probability mass near the center and in the tails, and less in the
dom variables with the same mean but different standard devia intervals (—2, —1) and (1,2).
tions. In this case, the random variable with a sm aller standard
deviation has both a narrower and a taller density. Note that
this increase in height is required when the scale is reduced to Moments and Linear Transformations
ensure that the total probability (as measured by the area under
Many variables used in finance and risk m anagem ent do not
the curve) remains 1.
have a natural scale. For exam ple, asset returns are commonly
The bottom -left panel shows the effect of a negative skew on expressed as proportions or (if multiplied by 100) as percent
the distribution of a random variable. The skewed distribu ages. This difference is an exam ple of a linear transform ation. It
tion has a short right tail that abruptly cuts off near 2 as well is helpful to understand the effect of linear transform ations on
as a long left tail which extends to the edge of the graph. The the first four moments of a random variable.

Let Y = a + bX, where a and b are both constant values. It For exam ple, consider
is common to refer to a as a location shift and b as a scale,
• The interval [0,1 ], and
because these directly affect the mean and standard deviation
• A 10-sided die with the numbers {0 .0 5 ,0 .1 5 , ... ,0.95}.
of Y. The mean of Y is
Note that there is a 10% probability of getting any single num
E[Y] = a + bE[X]
ber. Now suppose the die has 1,000 sides with the numbers
This follows directly from the linearity of the expectation opera {0.0005, ... ,0.9995} and thus there is a 0.1% chance of any
tor. The variance of Y is single number appearing. As the number of points (or number of
b2V[X] sides on the die) increases, the probability of any specific value
decreases. Therefore, as the number of points becom es infinite,
or equivalently:
the probability of a single point approaches zero. In other words,
a continuous distribution is like a die with an infinite number of
sides. In order to define a probability in a continuous distribution,
Note that the location shift a has no effect on the variance
because the variance measures deviations around the mean. The it is therefore necessary to exam ine the values b etw een points.
standard deviation of Y is Recall that a PM F is required to sum to 1 across the support of
V b 2a 2 = |b|cr X. This property also applies to a PDF, excep t that the sum is
replaced by an integral:
The standard deviation is also insensitive to the shift by a and
is linear in b. Finally, if b is positive (so that Y = a + b X is an fx(x)dx - 1
increasing transform ation), then the skewness and kurtosis of Y
are identical to the skewness and kurtosis of X. This is because The PD F has a natural graphical interpretation and the area
both moments are defined on standardized quantities (e.g ., under the PD F between x L and Xu is Pr(xL < X < x j).
(Y — E[Y])/ v V [Y ]), which removes the effect of the location
M eanwhile, the C D F of a continuous random variable is identical
shift by a and rescaling by b. If b < 0 (and thus Y = a + b X is
to that of a discrete random variable. It is a function that returns
a decreasing transform ation), then the skewness has the same
the total probability that the random variable X is less than or
m agnitude but the opposite sign. This is because it uses an odd
equal to the input (i.e., Pr(X < x)). For most common continuous
power (i.e., 3). The kurtosis, which uses an even power (i.e., 4), is
random variables, the PD F and the C D F are directly related. The
unaffected when b < 0.
PD F can be derived from the C D F by taking the derivative:
BFxjx)
2.5 CONTINUOUS RANDOM c)X
VARIABLES Conversely, the C D F is the integral of the PD F up to x:
A continuous random variable is one with a continuous sup

Fx(x) = I fx(z)dz
port. For exam ple, it is common to use a continuous random J — oc
variable when modeling asset returns so that the return can be The left panel of Figure 2.4 illustrates the relationship between
expressed at the accuracy level of the asset's price (e.g ., dollars the PD F and C D F of a continuous random variable. The random
or cents). W hile many continuous random variables have support variable has support on (0,°c)# meaning that the C D F at the
for any value on the real number line (i.e., R), others have sup point x (i.e., Fx(x)) is the area under the curve between 0 and x.
port for a defined interval on the real number line (e.g ., [0,1]).
The riqht panel shows the relationship between a PD F and the
Continuous random variables use a probability density function area in an interval. The probability P r (- 1 < X < | ) is the area
(PDF) in place of the probability mass function. The PD F f(X) under the PD F between these two points. It can be com puted
returns a non-negative value for any input in the support of X. from the C D F as:
Note that a single value of f{X) is technically not a probability
Fxi 1/3) - FxH-1/2) = P r ( X < 1/3) - Pr(X < - 1 / 2 )
because the probability of any single value is always 0. This is
necessary because there are infinitely many values x in the sup The probabilities Pr(a < X < b), Pr(a < X < b), Pr(a < X < b),
port of X. If the probability of each value was larger than 0, then and Pr(a < X < b) are the same when X is a continuous random
the total probability would be larger than 1 and thus violate an variable. However, this is not necessarily the case when X is a
axiom of probability. discrete random variable.

Fiqure 2.4 The left panel shows the relationship between the PDF and CDF of a random variable. The right
panel illustrates the probability of an interval.
The expectation of a continuous random variable is also an inte panel shows a more accurate approxim ation with a much sm aller
gral. The mean E[X] is approxim ation error. As the number of points in the approxim a
tion increases, the accuracy of the approxim ation increases as
E[X] = / xfx (x)dx (2.15) well.
J R(X)
The PD F and the C D F of many common random variables are
Superficially, this appears to be different from the exp ecta
com plicated. For exam ple, the most widely used continuous
tion of a discrete random variable. However, an integral can be
distribution in finance and econom ics is known as a normal (or
approxim ated as a sum so that:
Gaussian) distribution. The PD F of a normal variable is
E[X] - £ x Pr(X = x)
where the sum is over a range of values in the support of X. The (2.16)
relationship between continuous and discrete random variables
is illustrated in Figure 2.5. The left panel shows a seven-point where /x and cr2 are param eters that determ ine the mean and
PM F that is a coarse approxim ation to the PD F of a normal variance of the random variable (respectively). The C D F of a nor
random variable. This approxim ation has large errors (which are mal random variable does not have a closed form expression,
illustrated by gaps between the PM F and the PD F). The right and therefore it is calculated using numerical methods.
Coarse Approximation Fine Approximation
Fiaure 2.5 These two graphs show discrete approximations to a continuous random variable. The left shows a
coarse approximation, and the right shows a fine approximation.

Fortunately, it is rare to work directly with the PD F of a continu so that the quantile function is the inverse of the CDF. Recall
ous random variable. The next chapter describes several com that the C D F transform s values in the support of X to the cum u
mon continuous random variables that are used throughout lative probability. These values are always between 0 and 1. The
finance and risk m anagem ent. These common distributions quantile function thus maps a cumulative probability to the cor
have well-docum ented properties, and so it is not necessary to responding quantile.
directly com pute moments or most other required m easures.
Figure 2.6 shows the relationship between the C D F and the quan
tile function of a continuous random variable. The left panel shows
2.6 QUANTILES AND MODES that the C D F Fx(x) maps realizations to the associated quantiles.
The right panel shows how the quantile function Qx(a) = F x 1(a)
Moments are the most common measures used to describe maps quantiles to the values where Pr(X < x) = a. This means that
and com pare random variables. They are well understood and the quantile function is the reflection of the C D F along a 45° line.
capture im portant features. However, quantiles can be used to Q uantiles are used to construct useful descriptions of random
construct an alternative set of descriptive m easures of a random variables that com plem ent moments. Two are common: the
variable. For a continuous or discrete random variable X, the median and the interquartile range.
a-quantile X is the sm allest number q such that Pr(X < q) = a.
The median is defined as the 50% quantile and measures the
This is traditionally denoted by:
point that evenly splits the probability in the distribution (i.e., so
inf {q: Pr(X < q) = a}, (2.17) that half of the probability is below the median and half is above
where a G [0,1 ] and inf selects the sm allest possible value satis it). The median is a measure of location and is often reported
fying the requirem ent that Pr(X < q) = a. The inf is only needed along with the mean. W hen distributions are sym m etric, the
when measuring the quantile of distributions that have regions mean and the median are identical. W hen distributions are
with no probability. skew ed, these values generally differ (e.g ., in the common case
of negative skew, the mean is below the median). The interquar
In most applications in finance and risk m anagem ent, the
tile range (IQR) is defined as the difference between the 75%
assumed distributions are continuous and without regions of
and 25% quantiles. It is a measure of dispersion that is com pa
zero probability. This means that for each a, there is exactly one
rable to the standard deviation. The IQR depends on values that
number q such that Pr(X < q) = a. In this im portant case, the
are relatively central and never in the tail of the distribution. This
quantile function can be defined as such that Q(a) returns the
makes it less sensitive to changes in the extrem e tail of a distri
a-quantile q of X and:
bution, which manifest as outliers (i.e., observations that are far
Q » = F x V ), (2.18) from many of the data points) in a data set.
CDF Quantile Function
Fiaure 2.6 The left panel shows the CDF of a continuous random variable, which maps values x to the associated
quantile (i.e., F^x)). The right panel shows the quantile function, which maps quantiles to the corresponding
inverse CDF value (i.e., Qx(«) = F~1(a:)). It can easily be seen that either graph is a reflection of the other through
the diagonal (i.e., with the axes flipped).

Q uantiles are also noteworthy because they are always well- variable. Specifically, the mode is a measure of com m on ten
defined for any random variable. M oments, even the mean, may dency. It measures the location of the most frequently observed
not be finite when a random variable is very heavy-tailed. The values from a random variable. W hen a random variable is con
m edian, or any other quantile, is always well-defined even in tinuous, the mode is the highest point in the PDF.
cases where the mean is not. The top panels of Figure 2.7 show
Note that random variables may have one or more modes.
the relationship between the mean and the median in a sym
W hen a distribution has more than one m ode, it is common to
m etric distribution (left) and in a positively skewed distribution
label any local maximum in the PD F as a m ode. These distri
(right). The bottom-left panel relates the IQR to a 2cr interval in
butions are described as bimodal (if there are two modes) or
a normal distribution.
multimodal (if there are more than two modes). The bottom-
O ne final measure of central tendency, the m ode, is also fre right panel of Figure 2.7 shows a bimodal and a multimodal
quently used to characterize the distribution of a random distribution.
Symmetric Distribution Asymmetric Distribution
IQR and Standard Deviation Multimodal Distributions
Figure 2.7 The top panels demonstrate the relationship between the mean, median, and mode in a symmetric,
unimodal distribution (left) and a unimodal right-skewed distribution (right). The bottom-left panel demonstrates
the relationship between the interquartile range and the standard deviation. The bottom-right panel contains
examples of both a bimodal and a multimodal distribution.

2.7 SUMMARY 3 . The skewness, which measures asym m etry; and
4 . The kurtosis, which measures the relative probability of

This chapter introduced random variables, which are functions observing an extrem e value.
from the sam ple space that return numeric values. This property
Financial variables are often rescaled, and these four moments
allows us to apply standard results from m athem atics to charac
all change in easily understandable ways.
terize their behavior. The probability of observing various values
of a random variable is summarized by the probability mass • The mean is affected by location shifts and rescaling.
function (PM F) and the cum ulative distribution function (CD F). • The variance is only affected by rescaling.
These two functions are closely related: The PM F returns the
• The skewness and kurtosis are unaffected by increasing linear
probability of a specific value x, while the C D F returns the total
transform ations.
probability that the random variable is less than or equal to x.
Finally, the quantile function allows two other common m ea
Expectations are a natural method to com pute the average value
sures to be defined:
of a random variable (or a function of a random variable). When
a random variable X is discrete, the expectation is just a sum of 1 . The median, which is an alternative measure of central
the outcomes weighted by the probabilities of each outcom e. tendency.
2 . The interquartile range, which measures dispersion.

Moments are the expected values for sim ple functions of a ran
dom variable. The four key moments are Later chapters show how these m easures— both moments and
quantiles— are used to enhance our understanding of observed
1 . The mean, which measures the average value of the random
data.
variable;
2. The variance, which is a measure of deviation from the mean;

QUESTIONS

2.1 W hat is the key difference between a discrete and a con 2 .5 W hat is excess kurtosis?
tinuous random variable?
2 .6 How are quantile functions related to the C D F? W hen X is
2 .2 W hat do a PM F and a C D F m easure? How are these a continuous random variable, when does this relationship
related? not hold?
2 .3 W hat two properties must a PM F satisfy? 2 .7 W hat are the median and interquartile range? W hen is the
2 .4 How are the mean, standard deviation, and variance of Y median equal to the mean?
related to these values of X when Y = a + bX ?
Practice Questions
2 .8 W hat is E[XE[X]]? Hint: recall that E[X] is a constant. 2 .1 0 Suppose the return on an asset has the following
2 .9 Consider the following data on a discrete random distribution:
variable X:
Return Probability
X -4 % 6%
1 - 2 .4 5 6 -3 % 9%
2 - 3 .3 8 8 -2 % 11%
3 - 6 .8 1 6 -1 % 12%
4 1.531 0% 14%
5 1.737 1% 16%
6 - 1 .2 5 4 2% 15%
7 - 1 .1 6 4 3% 8%
8 1.532 4% 5%
9 2.550 5% 4%
10 0.296 a. Com pute the mean, variance, and standard deviation.
11 - 0 .9 7 9 b. Verify your result in (a) by computing E[X2] directly and

using the alternative expression for the variance.
12 - 4 .2 5 9
c. Is this distribution skew ed?
13 2.810 d. Does this distribution have excess kurtosis?
14 - 1 .6 0 8 e. W hat is the median of this distribution?
15 - 0 .5 7 5
a. Calculate the mean and variance of X.

b. Standardize X and check that the mean and variance
of X are 0 and 1 respectively.
c. Calculate the skew and kurtosis of X.

ANSW ERS

2.1 Discrete random variables have support on only a finite PDF
(or possibly countably infinite) set of points. These points
are all distinct in the sense that it there is always some
number between any two points in the support. A con
tinuous random variable has support over a continuous
range of values. This range can be finite, as in a uniform
random variable; or infinite, as in a standard normal ran
dom variable.
2.2 The PM F m easures the probability of a single value.

The C D F m easures the total probability that the ran
dom variable is less than or equal to a sp ecific value—
that is, the cum ulative probability that X < x for som e
value x. The C D F is the sum of the PM F across all val
ues less than x.
2.3 A PM F must be non-negative and sum across the support

of the random variable to 1.
2.4 E[Y] = E[a + bX]

= E[a] + E[bX] Expectation of sum is the
sum of expectations
= a + bE[X] Expectation of a constant is constant.
The variance of V is V[a + bX] = V[bX] = b2V[X]

because the variance of a constant is 0 and the variance
of b X is b2 tim es the variance of X.
The standard deviation of V is simply the square root of

the variance given above, V b ^ v jx j = |b| v V [ X ] .
2.5 Excess kurtosis is the standard kurtosis measure minus 3. 2.7 The median is the point where 50% of the probability is
This reference value com es from the normal distribution, located on either side. It may not be unique in discrete
and so excess kurtosis m easures the kurtosis above that distributions, although it is common to choose the sm all
of the normal distribution. est value where this condition is satisfied. The IQR is the
2.6 The quantile function is the inverse of the C D F function. difference between the 25% and 75% quantiles. These
That is, if u = Fx(x) returns the C D F value of X, then are the points where 25% and 75% of the probability of
q = F x \u ) is the quantile function. The C D F function is the random variable lies to the left. This difference is a
not invertible if there are regions where there is no prob measure of spread that is like the standard deviation.
ability. This corresponds to a ju m p in the CDF, as shown The median is equal to the mean in any sym m etric distri
in the figures that follow. bution if the first moment exists (is finite).

Solved Problems
2.8 E[XE[X\] = 1 E[X \X Pr(X = x) = E [X ]2 X Pr(X = x) X X Standardized
= E[X] X E[X] = (E[X])2 because E[X] is a constant and
8 1.532 0.898
does not depend on X. In other words, the expected
value of a random variable d o e s not depend on the ran 9 2.550 1.289
dom variable. 10 0.296 0.423
2.9 a. The mean is —0.803 and the variance is 6.762. 11 - 0 .9 7 9 - 0 .0 6 8
b.
12 - 4 .2 5 9 - 1 .3 2 9
X X Standardized 13 2.810 1.389
1 - 2 .4 5 6 - 0 .6 3 6 14 - 1 .6 0 8 - 0 .3 1 0
2 - 3 .3 8 8 - 0 .9 9 4 15 - 0 .5 7 5 0.088
3 - 6 .8 1 6 - 2 .3 1 2 Mean - 0 .8 0 3 0.000
4 1.531 0.897 Variance 6.762 1.000
5 1.737 0.977 Standard 2.600 1.000

Deviation
6 - 1 .2 5 4 - 0 .1 7 3
7 - 1 .1 6 4 - 0 .1 3 9
(X Standardized)3 (X Standardized)4
X X Standardized (Skew) (Kurtosis)
1 - 2 .4 5 6 - 0 .6 3 6 - 0 .2 5 7 0.163
2 - 3 .3 8 8 - 0 .9 9 4 - 0 .9 8 2 0.977
3 - 6 .8 1 6 - 2 .3 1 2 - 1 2 .3 6 4 28.590
4 1.531 0.897 0.723 0.649
5 1.737 0.977 0.932 0.910
6 - 1 .2 5 4 - 0 .1 7 3 - 0 .0 0 5 0.001
7 - 1 .1 6 4 - 0 .1 3 9 - 0 .0 0 3 0.000
8 1.532 0.898 0.724 0.650
9 2.550 1.289 2.143 2.764
10 0.296 0.423 0.075 0.032
11 - 0 .9 7 9 - 0 .0 6 8 0.000 0.000
12 - 4 .2 5 9 - 1 .3 2 9 - 2 .3 4 8 3.120
13 2.810 1.389 2.682 3.726
14 - 1 .6 0 8 - 0 .3 1 0 - 0 .0 3 0 0.009
15 - 0 .5 7 5 0.088 0.001 0.000
Mean - 0 .8 0 3 0.000 -0 .5 8 1 2.773
Variance 6.762 1.000
Standard Deviation 2.600 1.000
Hence, the skew of X is —0.581 and the kurtosis of X is 2.773.

2.10 a. The mean is E[X] = Pr(X = x) = 0.25% . d. The kurtosis requires computing
The variance is Var[X] = £ (x — E[X])2 Pr(X = x)

E[(X - E[X])4
= 0.000555. kurtosis(X)
The standard deviation is \ / Var[X] = 2.355% .

Pr(X = x).
b. E[X2] = 2 x 2 Pr(X = x) = .000561 and so
E[X2] - (E[X])2 = 0.000561 - (.0025)2 = .000555, Thus the kurtosis is 2.24. The excess kurtosis is then
which is the same. 2.24 - 3 = - 0 .7 6 . This distribution does not have
c. The skewness requires computing excess kurtosis.
e. The median is the value where at least 50% probabil

X - ix
skew(X) = I M 1 ! = E ity lies to the left, and at least 50% probability lies to
a
(T
the right. Cum ulating the probabilities into a CDF, this
X — fJL^
occurs at the return value of 0%.
2
3
Pr(X = x).
= a
Thus the skewness is 0.021, and the distribution has a

mild positive skew.

Common Univariate
Random Variables
Learning Objectives
Distinguish the key properties and identify the common Describe a mixture distribution and explain the creation
occurrences of the following distributions: uniform and characteristics of mixture distributions.
distribution, Bernoulli distribution, binomial distribution,
Poisson distribution, normal distribution, lognormal
distribution, chi-squared distribution, Student's t, and
F-distributions.
27
There are over two hundred named random variable distribu (e.g ., a bear m arket, a severe loss in a portfolio, or a com pany
tions. Each of these distributions has been developed to explain defaulting on its obligations).
key features of real-world phenom ena. This chapter exam ines a
The Bernoulli distribution depends on a single parameter, p,
set of distributions commonly applied to financial data and used
which is the probability that a success (i.e., 1) is observed. The
by risk m anagers.
probability mass function (PM F) of a Bernoulli(p) is
Risk managers model uncertainty in many form s, so this set
M y) = py(1 - p f ~y)(3-D
includes both discrete and continuous random variables. There
are three common discrete distributions: the Bernoulli (which Note that this function only produces two values:
was introduced in the previous chapter), the binomial, and the p, when y = 1
Poisson. The Bernoulli is a general purpose distribution that is
1 — p, when y = 0
typically used to model binary events. The binomial distribution
describes the sum of n independent Bernoulli random variables. M eanwhile, its C D F is a step function with three values:
The Poisson distribution is commonly used to model hazard
( 0 y < 0
rates, which count the number of events that occur in a fixed
Fy(y) = < 1 — p 0 < y < 1 (3.2)
unit of tim e (e.g ., the number of corporations defaulting in the
[ 1 y —1
next quarter).
The moments of a Bernoulli are easy to calculate. Suppose that
There is a w ider variety of continuous distributions used by
X is a Bernoulli random variable with param eter p. This can be
risk m anagers. The most basic is a uniform distribution, which
expressed as
serves as a foundation for all random variables. The most widely
used distribution is the normal, which is used for tasks such as Y ~ Bernoulli(p)
modeling financial returns and im plem enting statistical tests. The mean and variance of Y are respectively
Many other frequently used distributions are closely related
E[Y] = p X 1 + (1 - p) X 0 = p,
to the normal. These include the Student's t, the chi-square
i x 2), and the F, all of which are encountered when evaluating and
statistical models. V[Y] = E[Y2] - E[Y]2 = (p X 12 + (1 - p) X 02) - p2 = p(1 - p)
This chapter concludes by introducing mixture distribu Note that the mean reflects the probability of observing a 1,
tions, which are built using two or more distinct com ponent and thus it is just p. The variance reflects the fact that if p is
distributions. A mixture is produced by randomly sampling either near 0 or near 1, then one value is observed much more
from each com ponent so that the mixture distribution inher frequently than the other (and thus the variance is relatively
its characteristics of each com ponent. M ixtures can be low). The variance is maximized when p = 50%, which reflects
used to build distributions that match im portant features the equal probability that failure and success (i.e., 0 and 1)
of financial data. For exam ple, mixing two normal random are observed.
variables with different variances produces a random vari
Figure 3.1 describes two Bernoulli random variables:
able that has a larger kurtosis than either of the mixture
Y ~ Bernoulli(p = 0.5) and Y ~ Bernoulli(p = 0.9). Their PM Fs
com ponents.
and C D Fs are shown in the left and right panels, respectively.
Note that the C D Fs are step functions that increase from 0 to
3.1 DISCRETE RANDOM VARIABLES 1 — p at 0, and then to 1 at 1.
Bernoulli random variables can be applied to events that are

Bernoulli binary in nature. For exam ple, loan defaults can be m odeled by
making the observed values 1 for a default and 0 for no default.
The Bernoulli is a simple and frequently encountered distribu
M eanwhile, p (i.e., the probability of default) can be calculated
tion. It is a discrete distribution for random variables that pro
using various relevant factors. Com plete m odels can include
duces one of two values: 0 or 1. It applies to any problem with a
factors specific to each recipient (e.g ., a credit score) as well
binary outcom e (e.g ., bull and bear m arkets, corporate defaults,
as those that reflect w ider econom ic conditions (e.g ., the G D P
or the classification of fraudulent transactions). It is common to
growth rate).
label the outcom e 1 as a success and 0 as a failure. These labels
are often misnomers in finance and risk m anagem ent because 1 Bernoulli random variables can also be used to model excep
is used to denote the unusual state, which is often undesirable tionally large portfolio losses, taking on a value of 1 when such a

PM F CDF
Fiqure 3.1 The left panel shows the probability mass functions of two Bernoulli random variables,
one with p = 0.5 and the other with p = 0.9. The right panel shows the cumulative distribution
functions for the same two variables.
loss is realized (e.g ., a loss that exceeds Value-at-Risk [VaR]) and By construction, a binomial random variable is always non
0 otherwise. negative, integer-valued, and a B(n, p) is always less than or
equal to n. The skew ness of a binomial depends on p, with
Binomial small values producing right-skewed distributions. The PM F of

a B(n, p) is
A binom ial random variab le m easures the total num ber of
successes from n ind ep end ent Bernoulli random variab les, fy( )= r j p m - P)"~y
w here each has a probability of success equal to p. In other
w ords, binom ial distributions are used to model counts of
Note that ( ), which is commonly expressed as "n choose y ,"
indep end ent events.
counts the distinct number of ways that y successes could have
A binomial distribution has two param eters:
been realized from n experim ents. It is equivalent to
1. n, the number of independent experim ents; and
n\ _ n!
2 . p, the probability that each experim ent is successful.1
y) ~ yKn - y)!'
If n variables X, ~ Bernoulli(p) are independent, then a bino where "!" indicates factorial.
mial with param eters n and p is defined as Y = ^ " =1X,- and
The probability density function (PDF) for any values of y and n
expressed as B(n, p).
can be com puted using basic probability. For exam ple, consider
The mean follows directly from the properties of moments a binomial random variable that models the flipping of two
described in the previous chapter coins. In this case, an outcom e of heads represents a success
E[Y] = np (i.e., it takes on a value of 1), whereas a tails outcom e will be
considered a failure (i.e., it takes on a value of 0). This variable
M eanwhile, the variance of V is
can be expressed as Y ~ B(2, 0.5). Note that:
V[Y] = np(1 - p)
• n = 2 because there are two flips, and
Note that the p( 1 — p) com ponent reflects the uncertainty com
• p = 0.5 because each flip has a 50% chance of success (i.e.,
ing from the Bernoulli random variables, while n considers the
producing heads).
number of independent com ponent variables.
W hen the two coins are flipped, there are four possible out
com es: {Tails, Tails}, {Tails, Heads}, {H eads, Tails} or {H eads,
1 The outcomes of Bernoulli random variables are called experiments
due to the historical application in a laboratory setting. While financial Heads}. This means that the number of successes (i.e., heads)
data are not usually experimental, this term is still used. can either be 0, or 2.
Chapter 3 Common Univariate Random Variables ■ 29

Fiqure 3.2 The left panel shows the PMF of two binomial distributions. The solid line is the PDF of a
normal random variable, N(np, np(1-p)) = N(10, 6), that approximates the PMF of the B(25, 0.4). The
right panel shows the CDFs of the two binomials.
Since the coin flips are independent, the probability of {Tails, variables. M eanwhile, the second configuration has n = 25 and
Tails} is the product of the probability of each flip being tails, p = 0.4, creating a shape that resem bles the density of a normal
which is random variable. The solid line in the PM F panel shows the PD F
of this approxim ation. The right panel shows the C D Fs, which
(1 /2 )2 = 1 /4
are step functions.
Furtherm ore, the probability of one head and one tail is also
1/4, as is the probability of two heads.
Poisson
Now, the variable's PD F can be produced by multiplying the
probabilities of each number of by the number of ways in which Poisson random variables are used to measure counts of events
it could occur over fixed tim e spans. For exam ple, one application of a Poisson
is to model the number of loan defaults that occur each month.
Q ) x (V/ X (’/2)2 = 1/4 if y = 0 Poisson random variables are always non-negative and integer
valued. The Poisson distribution has a single parameter, which
F Y(y) = < 2 X y 4 = X (1/2)1 X (1/2)1 = 1/2 if y = 1. is called the hazard rate and expressed as A, that signifies the
average number of events per interval. Therefore, the mean and
( 2 ) x <V2)2 X I1/ / = 1/4 if y = 2 variance of Y ~ Poisson(A) is simply:
E[Y] = V[Y] = A
M eanwhile, the C D F is the sum2 of the cum ulated PM F between
0 and y The PM F of a Poisson random variable is
y e xp (—A)
(3.5)
y!
M eanwhile, the C D F of a Poisson is defined as the sum of the
Note that the normal distribution provides a convenient approxi
PM F for values less than the input
mation to the binomial if both np > 10 and n(1 — p) > 10. This is
illustrated in Figure 3.2 and explained further later in this chapter. LyJ x
F Y(y) = e xp (—A ) 2 t (3.6)
Figure 3.2 shows the PDFs and C D Fs of two binomial random ;=o,!
variables. One distribution has n = 4 and p = 0.1, which pro Note that [ y j is the floor operator, which produces the greatest
duces a distribution with right-skew. Note that the mass is heav integer that is less than or equal to y.
ily concentrated around y = 0 and y = 1, which reflects the
The Poisson param eter A can be considered as a hazard rate in
small probability of success of each of the four Bernoulli random
survival modeling.
2 The CDF is defined using the floor function, [y J , which returns y when A s an exam ple, consider a fixed-incom e portfolio that consists
y is an integer and the largest integer smaller than y when y is not. of a large number of bonds. On average, five bonds within the

PM F Cdf
Fiqure 3.3 The left panel shows the PMF for two values of A, 3 and 15. The right panel
shows the corresponding CDFs.
portfolio default each month. Assum ing that the probability of The C D F returns the cum ulative probability of observing a value
any bond defaulting is independent of the other bonds, what is less than or equal to the argum ent, and is
the probability that exactly two bonds default in one month?
if y < a
In this case, y = 2 and A = 5. Thus
if a < y < b. (3.8)
52e xp (—5)
M 2) = ----- ^----- = 0.0842 = 8.42% if y > b
The C D F is 0 to the left of a, which is the sm allest value that

A useful feature of the Poisson (one that is uncommon among
could be produced, linearly increases from a to b, and then is 1
distributions) is that it is infinitely divisible. If X i ~ Poisson(A-i),
above b. W hen a = 0 and b = 1, the distribution is called the
and X 2 ~ Poisson(A2) are independent, and Y = X-| + X 2, then
standard uniform. Any uniform random variable can be con
V ~ Poisson (Ai + A2). This feature makes the Poisson well-
structed from a standard uniform l/-, using the relationship:
suited to work with tim e-series data, because summing the
number of events in a sampling interval (e.g ., a w eek, month, or U2 = a + (b - a)U1# (3.9)
quarter) does not change the distribution.
where a and b are the bounds of U2.
Figure 3.3 shows the PM Fs (left panel) and C D Fs (right panel)
Figure 3.4 plots the PDFs of U(0, 1) and U( 1/2, 3). These PDFs
for two configurations of a Poisson random variable. W hen A is
both have a simple box-shape, which reflects the constant prob
small, the distribution is right-skewed. W hen A = 15, the distri
ability across the support of the random variable. The right
bution is nearly sym m etric.
panel shows the C D Fs that are piecewise linear functions that
are 0 below the lower bound, 1 above the upper bound, and
have a slope of 1/(b — a) between these two values.
3.2 CONTINUOUS RANDOM
The mean of a uniform random variable Y ~ U(a, b) is the mid
VARIABLES
point of the support
Uniform a + b
2
The sim plest continuous random variable is a uniform random
variable. A uniform distribution assumes that any value within M eanwhile, the variance depends only on the width of the
the range [a, b] is equally likely to occur. The PD F of a uniform is support:
1 (b - a )2
---- (3.7) V[ Y]
- a 12
The PD F of a uniform random variable does not depend on y, The mean and variance of a standard uniform random variable
because all values are equally likely. are 1/2 and 1/12, respectively.

PDF CDF
Fiaure 3.4
and a uniform between 1/2 and 3. The right panel shows the cumulative distribution functions
corresponding to these two random variables.
The probability that Y falls into an interval with lower bound / (i.e., the process where data is used to determ ine the truth
and upper bound u (i.e., Pr(/ < Y < u)) is fulness of an objective statem ent).
minfu, b) — max(/; a) • Normal random variables are infinitely divisible, which makes
---- ——r---------- (3.10)
b —a them suitable for simulating asset prices in models that
assume that prices are continuously evolving.
which sim plifies to (u — l)/(b — a) when both / > a and u < b.
• The normal is closely related to many other im portant distri
As an exam ple, a uniform random variable Y ~ U(—2, 4) has a
butions, including the Student's t, the y 2, and the F.
mean of
• The normal is closed (i.e., weighted sums of normal ran
-2 + 4
dom variables are normally distributed) under linear
2
operations.
A variance of
• Estim ators derived under the assumption that the underly
(4 - (—2f ing observations are normally distributed often have simple
12 closed forms.
And the probability that Y falls between —1 and 2 is The normal distribution has two param eters: /z (i.e., the mean)
min(2, 4) — m a x (- 1 , — 2) 2 ~ ( - 1) and a 2 (the variance). Therefore
Pr(/ < Y < u)
4 - (- 2 ) 4 - (- 2 )
E M = /x
Normal and
The normal distribution is the most commonly used distribu V[Y] = a 2

tion in risk m anagem ent. It is also commonly referred to as a
A normal distribution has no skewness (because it is sym m etri
Gaussian distribution (after m athem atician Carl Friedrich Gauss)
cal) and a kurtosis of 3. This kurtosis is often used as a bench
or a bell curve (which reflects the shape of the PDF). The norma
mark when assessing w hether another distribution is heavy/
distribution is popular for many reasons.
fat-tailed (i.e., has relatively greater probability of observing an
• Many continuous random variables are approxim ately nor especially large deviation than if the random variable is normally
mally distributed. distributed).
• The distribution of many discrete random variables can be Figure 3.5 shows the shapes of the PDFs of three normal distri
well approxim ated by a normal. butions, all of which have a bell shape. Each PD F is centered at
• The normal distribution plays a key role in the Central Limit the respective mean /x of each distribution. Note that increasing
Theorem (CLT), which is widely used in hypothesis testing the variance a 2 increases the spread of the distribution.

PDF CDF
Fiaure 3.5 The left panel shows the PDFs of three normal random variables, N(0, 1)
(i.e., the standard normal), N(0, 4), and N(2, 1). The right panel shows the corresponding CDFs.
The normal can generate any value in °°), although it is The central probability of each region is 100% — a. It is common
unlikely (i.e., less than one in 10,000) to observe values more to substitute 2 for 1.96 here as well.
than 4er away from the mean. In fact, values more than 3cr away
Figure 3.6 shows a graphical representation of these two m ea
from the mean are expected in only one in 370 realizations of a
sures. The left panel shows the area covered by 1a , 2 a , and
normal random variable.
3cr. These regions are centered at the mean /x. The right panel
The PD F of a normal is shows the area covered by three confidence intervals when the
random variable is a standard normal.
My) = v f a exp(“ ^ ) (311) Note that the sums of independent3 normally distributed ran
dom variables are also normally distributed. If X-| ~ N(/x-\, of)
Recall that while the C D F of a normal does not have a closed
and X 2 ~ N(/x2, a2) are independent, and Y = X | + X 2, then
form , fast numerical approxim ations are widely available in Excel
Y ~ N(ju 1 + /x2, a\ + 0 2 ). This property sim plifies describing
or other statistical software packages.
log returns at different frequencies; if daily log returns are
An im portant special case of a normal occurs when fi = 0 and independent and normal, then w eekly and monthly returns are
a 2 = 1. This is called a standard normal, and it is common to as well.
use Z t o denote a random variable with distribution N(0,1). It
is also common to use <£(z) to denote the standard normal PD F
and <f>(z) to denote the standard normal CDF. Approximating Discrete Random Variables
The normal is widely applied, and key quantiles from the normal Recall that a normal distribution can approxim ate a binomial
distribution are commonly used for two purposes. random variable if both np > 10 and n(1 - p) > 10. When
these conditions are satisfied, a binomial has either many inde
1. First, the quantiles are used to approxim ate the chance of
pendent experim ents or a probability that is not extrem e (or
observing values more than 1er, 2a, and 3a when describ
both), and so the PM F is nearly sym m etric and well approxi
ing a log return as a normal random variable.
mated by a N{np, np(1 — p)).4
2. Second, when constructing confidence intervals for esti
mated param eters, it is common to consider sym m etric
intervals that contain 90% , 95% , or 99% of the probability.
These values are summarized in Table 3.1. The left panel tabu 3 Chapter 4 shows this property holds more generally for normal random
lates the area covered by an interval of q a for q = 1 ,2 , 3. Both variables and independence is not required.
the exact probabilities and the common approxim ations used by 4 The approximation of a binomial with a normal random variable is an
application of the CLT, which establishes conditions where averages of
practitioners are given. The right panel shows the end points of
independent and identically distributed random variables behave like
a sym m etric interval with a total tail probability of a. This means normal random variables when the number of samples is large. Chapter 5
that each tail contains probability a / 2 for a = 10%, 5%, and 1%. provides a thorough treatment of CLTs.

Table 3.1 The Left Panel Shows the Total Probability within 1, 2, and 3 Standard Deviations of the Mean for a
Normal Random Variable. The Right Column Contains the Common Approximation Used for Each Multiple. The
Right Panel Shows the Endpoints for a Symmetric Interval with a Total Tail Probability of 10%, 5%, or 1% for a
Standard Normal
Standardized Distance from /x Confidence Intervals
Common Central Tail

Approxim ation Probability Probability
Exact Value
1(7 68.2% 68% 90% 10% ± 1 .6 4 5
2(7 95.4% 95% 95% 5% ± 1 .9 6
3(7 99.7% 99% 99% 1% ± 2 .5 7
Standardized Distance Confidence Intervals
Fiaure 3.6 The left panel shows the area (probability) covered by ±1, ±2, and ±3o\ The right
panel shows the critical values that define regions with 90%, 95%, and 99% coverage so that the total
area in the tails is 10%, 5%, and 1%, respectively. The probability in each tail is a/2.
The Poisson can also be approxim ated by a normal random For exam ple, if stock prices are assumed to be normally distrib
variable. W hen A is large, then a Poisson(A) can be well approxi uted, there is a positive (although perhaps tiny) probability that
mated by a Normal(A, A). This approxim ation is commonly the stock price becom es negative. This is im possible under a
applied when A > 1000. log-normal model.
A log-normal distribution can be denoted as y ~ LogN(/z, a 2),
Log-Normal or equivalently as ln(Y) ~ N(/z, a 2), where ln(Y) is normally dis

tributed with mean /z and variance cr2. Here, y is a non-linear
A variable Y is said to be log-normally distributed if the natu transform ation of a normal random variable and the m ean5 of y
2
ral logarithm of Y is normally distributed. In other words, if is exp (/z + (r /2).
X = In V, then Y is log-normally distributed if and only if X is
normally distributed. Alternatively, a log-normal can be defined
Y = exp(X), (3.12)
where X ~ N(/z, cr). An im portant property of the log-normal 5 The mean reflects the mean of the underlying normal random variable
distribution is that Y can never be negative, whereas X can be as well as its variance. The extra term arises due to Jensen's inequality:
The exponential function is a convex function and so it must be the case
negative because it is normally distributed. This log-normal that E[Y] > exp (E[X]) = exp(/z). The variance depends on both model
property can be desirable when constructing certain models. parameters and V[Y] = (exp(cr2) — 1) exp(2/z + a 2).

W ORKING WITH NORMAL RANDOM VARIABLES
Note that a normal random variable with any mean and vari The result allows the C D F of the original normal random vari
ance can be constructed as a sim ple linear transform ation of able to be com puted from the C D F of the standard normal
another normal random variable.
To see how this works, begin with the standard normal (i.e., F x « = <* > ( ^ T !:) = J ^ r M
Z ~ N(0,1)). Defining X = /x + erZ (where /x and er are con
stants) therefore produces a random variable with a N(/x, a 2) Figure 3.7 provides an illustration of CDF, which measures the
distribution. shaded area under the PDF.
The mean and variance of X can be verified using the results Note that only one tail is required because the stan
from Chapter 2 to show that dard normal distribution is sym m etric around 0 so that
Pr(Z < 0) = Pr(Z > 0) = 0.5. This sym m etry ensures that
E[X] = /A + crE[Z] = fjt
Pr(Z < z) = Pr(Z > - z ) = 1 - Pr(Z < - z )
and
when z < 0.
Var[X] = o-2Var[Z] = cr2
For exam ple, P r(-1 < Z < 2) is simply
C hapter 2 also explains how a random variable is standard
ized, so that if X — N(/x, a 2), then Pr(Z < 2) - Pr(Z < - 1 )
However, because the standard Z table only includes posi

N( 0, 1) tive values of Z, the probability that Z < - 1 is the same
as the probability that Z > 1, which is 1 - Pr(Z < 1).
This result allows the C D F of a normally distributed random Com bining these tw o, the total probability in this region is
variable to be evaluated using the C D F of a standard normal. Pr(Z < 2) - (1 - Pr(Z < 1)). Looking up these two left tail
For exam ple, suppose that X ~ N(3,4), and we wanted to values, Pr(—1 < Z < 2) is .9772 - (1 - .8413) = .8185. This
evaluate the C D F of X at 2, which is Pr(X < 2). Applying the method can be applied to any normal random variable by
standardization transform ation to both X and 2 gives standardizing it, so that:
Pr(X < 2) = Pr
< Z <
The PD F of a log-normal is given by: X ~ N(ju, a 2)], the C D F of Y can be com puted by inverting the
transform ation using ln(Y) and then applying the normal C D F on
(In y ~ /x )
exp (3.13) the transform ed value
2cr2
Because a log-normal random variable is defined as the exp o In y —^

Fyiy) = $ (3.14)
a
nential of a normal random variable [i.e., Y = exp X , where

PDF CDF
2.0 -
1.5
1 .0 -
LogN(0.08, .22)
0.5 LogN(0.16, .22)
— LogN(0.08, A 2)
0 .0
2.5 2.0 2.5
Fig u re 3 .8 The left panel shows the PDFs of three log-normal random variables, two of which have /x = 8%,
and two with ct = 20%, which are typical of annual equity returns. The right panel shows the corresponding CDFs
Figure 3.8 shows the PDFs (left panel) and C D Fs (right panel) for
three log-normal random variables. These have been param eter
and
ized to reflect plausible distributions of annual returns on an
equity m arket index. Note that the expected value of the distri V[Y] = 2v
bution is always greater than exp(/z) due to the cr2/ 2 term in the The PD F of a ^ random variable is
mean. For exam ple, when /z = 8% and a = 20% , the expected
1
value is 1.105, and when a = 40% , the expected value is 1.174. My) = y<V2-1> e x p ( - y 2), (3.16)
A t the same tim e, changing /x to 16% produces an expected 2(’ 2)T(M )
value of 1.197. The effect of a 2 on the expected value reflects where T(x) is known as the Gam m a function.6
the asym m etric nature of returns; the worst return is - 1 0 0 % ,
Figure 3.9 shows the PDFs and C D Fs for three param eter val
whereas the upside is unbounded.
ues of a x 2 random variable. These PDFs are only defined for
positive values, because the x 2 is the distribution of a sum of
squared random variables. W hen v = 1, the distribution has a
single mode at 0. As v increases, the density shifts to the right
The x 2 (chi-squared) distribution is frequently encountered when
and spreads out, reflecting the mean [v) and the variance [2v) of
testing hypotheses about model param eters. It is also used
a In all cases, the x 2 distribution is right-skewed. The skew
when modeling variables that are always positive, (e.g ., the
ness declines as the degrees of freedom increases, and when
VIX Index). A x% random variable is defined as the sum of the
v is large ( > 25), a ^ random variable is well approxim ated
squares of v (G reek nu) independent standard normal random
by a N(v, 2v).
variables
Y = J l Zf (3.15) Student's t
1=1
The Student's t distribution is closely related to the normal, but
Note that a x 2 distribution has v d e g re e s o f freedom , a concept
it has heavier tails. The Student's t distribution was originally
that arises when dealing with models that have k param eters
using n data points. D egrees of freedom measure the amount developed fo rte stin g hypotheses using small sam ples.
of data that is available to test model param eters, because esti

mating model param eters requires a minimum number of obser
6 The Gamma function is defined as the integral
vations (e.g ., k). In many m odels, the degree of freedom used in
testing is n — k. T(x) = f zx~1 exp (—z)dz.
Jo
Using standard properties for sums of independent random When x is a positive integer, T(x) = (x — 1) X (x — 2) X (x — 3) X
variables, it can be shown that for Y ~ x? X 3 X 2 X 1 = (x — 1)!, with ! being the factorial operator.

PDF CDF
Fiaure 3.9 The left panel shows the PDFs of \ 2 distributed random variables with v = 1, 3, or 5. The right panel
shows the corresponding CDFs.
A Student's t is a one-param eter distribution. This parameter, than 3 (i.e ., it is larger than the kurtosis of a normal random
denoted by v, is also called the degrees of freedom param variable).
eter. W hile it affects many aspects of the distribution, the most
In some applications of a Student's t, it is desirable to separate
im portant effect is on the shape of the tails.
the degrees of freedom from the variance. Using the basic
A Student's t is the distribution of result that
V[aY] = a2V[Y]
(3.17)
it is easy to see that
where Z is a standard normal, W is a x l random variable, and Z
and W a re independent.7 Dividing a standard normal by another
random variable produces heavier tails than the standard nor
mal. This is true for all values of v, although a Student's t con when Y ~ t„. This distribution is known as a standardized Stu
verges to a standard normal as v —» °c.8 dent's t, because it has mean 0 and variance 1 for any value
of v. Im portantly, it can be rescaled to have any variance and
If Y ~ tv, then the mean is
re-centered to have any mean if v > 2.
E[Y] = 0
For exam ple, the generalized Student's t is param eterized with
and the variance is three param eters reflecting the mean, variance, and degrees of
freedom , and is denoted as G en. a 2).
v
The features of the generalized t make it better suited to
The kurtosis of Y is m odeling the returns of many assets than the normal distri
bution. It captures the heavy tails in asset returns (i.e ., the
v —2
kurtosis(Y) = 3 increased likelihood of observing a value larger than q a rela
v - 4
tive to a normal for values | q| > 2 ) while retaining the fle xib il
The mean is only finite if v > 1 and the variance is only finite ity of a normal random variable to directly set the mean
if v > 2. The kurtosis is defined for v > 4 and is always larger and variance.
For exam ple, the chance of observing a value more than 4cr
away from the m ean, if X-i is normally distributed, is about 1 in
15,000. If X 2 is a standardized t with the same mean and vari
7 Z and W are independent if the normal random variable used to gener
ance with the degree of freedom param eter v = 8, then the
ate W is independent of Z.
chance of observing a value larger than 4a- is 1 in 580. This dif
8 This should be intuitive considering that E[\N/v) = 1 for all values of v
and V[W/i^] = 1jv, so that as v becomes larger, the denominator closely ference has im portant im plications for risk m anagem ent, esp e
resembles a constant value. cially when we are worried about rare events.

PDF CDF
Fiaure 3.10 The left panel shows the PDF of a generalized Student's t with four degrees of freedom and the
PDF of a standard normal. The right panel shows the corresponding CDFs.
Figure 3.10 com pares the PD Fs and C D Fs of a generalized Stu vvi iiu i is ui ny
dent's t with v = 4 to a standard normal, which is the limiting
2 v \ {v'l + V2 — 2)
distribution as v —> o°. The Student's t is more peaked, heavier- V[Y] =
vy {v2 - 2)2{v 2 - 4)
tailed, and has less probability in the regions around ± 1.5. This
appearance is common in heavy-tailed distributions. These two and is only finite for v 2 > 4.
distributions have the same variance by construction, and so the When using an F in hypothesis testing, v-\ is usually determined by
increased probability of a large m agnitude observation in the t the hypothesis being tested and is typically small (e.g., 1, 2, 3, . . . ) ,
must be offset with an increased chance of a small magnitude while v2 is related to the sample size (and so is relatively large). When
observation to keep the variance constant. v2 is large, the denominator {W2/ v 2) has mean 1 and variance10
2 / v 2 ~ 0, and so behaves like a constant. In this case, if Y ~ F„1f „2,
then the distribution is close to W l /z'1, and the mean and variance
are 1 and 2 fvy, respectively. The PDF and C D F of an F a re both
The F is another distribution that is commonly encountered complicated and involve integrals that do not have closed forms.
when testing hypotheses about model param eters. The F h a s
Figure 3.11 shows the PD F and C D F for three sets of param eter
two param eters, and v2, respectively known as the numerator
values of F distributed random variables. Com paring F3 10 and
and denom inator degrees of freedom . These param eters are
the F4jo , the extra degree of freedom in the num erator has two
usually written as subscripts in the shorthand expression F„ „ .
effects. The first is that it shifts the distribution to the right. The
An F distribution is defined as the ratio of two independent \ 2 second is that it shifts the probability closer to the mean (i.e.,
random variables where each has been divided by its degree of 1.25). The result is that these two distributions have the same
freedom mean but the F3 10 has a larger variance.
W i/*1 Com paring the F3 and F3oc, the distribution with a larger
(3.18)
W2/ v 2 denom inator degrees of freedom param eter lies to the left and
has a thinner tail. These param eters change the mean and vari
where W-, ~ , and W2 ~ X$2, and W-, and W2 are indepen
ance of F3 oo to 1 and 2/3, respectively.
dent.9 If Y ~ F„1( „2, then the mean of Y is
The F is closely related to the \ 2 and the Student's t. When
= 1 and v2 is sufficiently large, then F 1jV is close to xi- If
V ~ t„, then Y2 ~ Fi „. These properties can be readily verified
using the definitions of the t, x 2 and F distributions.
9 Wy is independent of VV2 if the normal random variables used to

construct W-1 are independent of the normal random variables used to
construct W2. describes the relationship between values that are almost equal.

PDF CDF
1.0 i
Fiqure 3.11 The left panel shows the PDF of three F-distributed random variables. The right panel shows the
corresponding CDFs.
Exponential the number of loan defaults per quarter. If X is Poisson distributed

with parameter f3, then the time between each subsequent loan
The exponential distribution uses a single parameter, (3, that default has as an exponential distribution with parameter (3.
determ ines both the mean and variance. If Y ~ Exponential^ )
Exponential variables are also memoryless, meaning that their
E[Y] = j8
distributions are independent of their histories. For exam ple,
and suppose that the time until a company defaults is exponentially
V[Y] = p distributed with param eter (3. Assuming that [3 is known and con
stant, it follows that the probability of the company defaulting
The PD F of an Exponential^ ) is
within the next year (days 0 to 365) is the same as the probability
My) = % exp ( - % ) , y a 0 (3.19) that company defaults between a year and two years from now
(days 365 to 730). In fact, the probability the company defaults
The C D F is
in any one-year window is the same (assuming it has not already
Fy<y) = 1 - exp ( - % ) (3.20)
defaulted). The memoryless property of an exponential random
The exponential distribution is closely related to the Poisson distri variable is a statem ent of probability conditioned on the com
bution. For example, suppose X is a random variable that measures pany having not defaulted. However, it does not imply that there
PDF CDF
Fiqure 3.12 The left panel shows the PDFs of three random variables with exponential distributions having
shape parameters between 1 and 5. The right panel shows the corresponding CDFs.

PDF CDF
3.0 1.0 1
/ 7
Beta(1/4 ,1/4) Ss
Beta(1,1 ) / , /
— Beta(5,5)
B e ta (7 ,3)
1.5-
1. 0 -
0.5
0.
Fiqure 3.13 The left panel shows the PDFs for four configurations of the Beta distribution, including the
where a = j3 = 1, which produces a standard uniform. The right panel shows the corresponding CDFs.
is equal probability today that the company defaults in in the first The left panel of Figure 3.13 shows the PDFs for four Beta
year (days 0 to 365) or in the first two years (days 0 to 730). distributions. Three of the exam ples have the same mean (i.e.,
1/2). W hen both param eters are less than 1, the distribution
Figure 3.12 shows the PD F and C D F of three exponential distri
places most of the probability mass near the boundaries. When
butions with f3 between 1 and 5.
a = /3 = 1, then the Beta sim plifies to a standard uniform dis
As an exam ple, assume that the time to default for a specific tribution. As the param eters increase, the distribution becom es
segm ent of credit card consum ers is exponentially distributed more concentrated around the mean. W hen a = 7 and (3 = 3,
with a f3 of five years. To find the probability that the custom er the distribution shifts towards the upper bound. In general,
will not default before year six, start by calculating the cum ula increasing a shifts the distribution towards 1, and increasing [3
tive distribution until year six and then subtract this from 100%: shifts the distribution towards 0. Increasing both param eters
proportionally reduces the variance and concentrates the distri
Survival > Y e a r 6 = 1 - Fy(61/3 = 5) = 1 - ( 1 - e 6/s ) = 30.1%
bution around the mean.
Beta Mixtures of Distributions

The Beta distribution applies to continuous random variables Mixture distributions build new, com plex distributions using two
with outcom es between 0 and 1. It is commonly used to model or more com ponent distributions. W hile this chapter focuses on
probabilities that naturally fall into this range. The Beta distri two-com ponent mixtures, a mixture can be constructed using
bution has two param eters, a and f3, that jointly determ ine the any (finite) number of com ponents.11
mean and variance of a Beta-distributed random variable. If A two-com ponent mixture first draws a value from a Bernoulli
V ~ Beta(a, (3)
random variable. Then, depending on the value (0 or 1), draws
a from one of two com ponent distributions. This structure makes
a + (3 it simple to com pute the C D F of the mixture when the com po
nents are normal random variables
and
Suppose there is X-| ~ Fx 1 , X 2 ~ FXl, and W ~ Bernoulli(p).
a(3
There mixture of X-] and X 2 would therefore be
(a + (3)2 [a + [3 + 1)
V = WX-, + (1 - W )X 2
The PD F of a Beta(a, (3) is
ya-1)(1 _ y)(£-1)
(3.21) 11 Mixtures with infinitely many components are more complicated
B(a, (3) because it is not always straight forward to be assured that the density
will integrate to unity. A Student's t is an infinite mixture of normal ran
where B(a, (3) = r(a )r(/ 3 )/ r(a + f3) and T ( ‘ ) is the Gam m a dom variables where the variance of each component is constructed
function. from draws from the ^ distribution.

If we sam pled from this distribution, 100 X p% of draws would An im portant feature of mixtures is that they can have both
come from the distribution of X-\, and the rem ainder would be skewness and excess kurtosis even when their com ponents have
from the distribution of X 2. no skewness or excess kurtosis (i.e., the com ponents are normal
random variables).
Both the C D F and the PD F of a mixture distribution are the
weighted averages of the C D Fs or PDFs (respectively) of the For exam ple, consider a distribution generated by mixing two
com ponents normal random variables:
Fyiy) = pFXl(y) + d - p)FX2(y) (3.22) Y = WX-, + (1 - W)X2,
My) = pfx,(y) + d - p)fx,(y) (3.23) where X-, ~ N { - 0.9, 1 - 0 .9 2), X 2 ~ N(0.9, 1 - 0 .9 2), and
W hile directly computing the central moments of a mixture W ~ Bernoulli(p = 0.5). The param eters of the three com po
distribution is challenging, doing so for non-central moments is nent normal distributions and the Bernoulli have been chosen so
sim ple. For exam ple, the mean of the mixture is that EM = 0 and V[Y] = 1.
E[Y] = p E M ] + (1 - p)E[X2] (3.24) In this case, the mixing probability is 50%. The mean of Y is 0 by
construction because
and the second non-central moment is
E M = E[WX-, + (1 - W)X2]
E[Y2] = pE[X?) + (1 - p )E[X l) (3.25)
= E[WX-, + X 2 - W X 2]
These two non-central moments can be com bined to construct = E[W\E[X,] + E[X2] - E[W\E[X2]
the variance = 0 .5 (- 0 .9 ) + 0.9 - 0.5(0.9)
= 0
V M = E[Y2] - E M 2
Sim ilarly, the variance is 1 because
Higher-order, non-central moments are equally easy to compute
because they are weighted versions of the non-central moments of E[X?] = E[X2] = EM 2] = 0 .9 2 + (1 - 0 .9 2) = 1
the mixture's components. Using the relationship between the third
and thus
and fourth central and non-central moments defined in Chapter 2,
these central moments are related to the non-central moments by VM = EM2] - EM 2 = 1 - 0 = 1
E[(Y - E M )3] = E [Y 3] - 3E[Y2]E[Y] + 2 E [Y ]3 A s seen in Figure 3.14, the mixture build from these two com
ponents is bim odal. This is because each com ponent has a small
E[(Y - E M )4] = E [Y 4] - 4E[Y3]E[Y] + 6E[Y2]E[Y]2 - 3E[Y]4
variance and the two means are distinct.
(3.26)
Bi-modal Mixture
PDF CDF
Fiaure 3.14 The panels show the PDFs (left) and CDFs (right) of a bimodal mixture where each component has
probability 50% of occurring and a standard normal (N(0, 1)).

Contaminated Normal
PDF CDF
standard normal.
As another exam ple, consider a mixture of X i ~ N(0, .7252) and clo se r to the b o u n d ary (i.e ., 0 or 1) than the C D F of the co n
X 2 ~ N(0, 3 .1 62), weighted 0.95 and 0.5 (respectively). Note tam in ate d norm al.
that X-] is close to a standard normal, whereas X 2 has a much Mixing com ponents with different means and variances
greater variance. As such, draws from X 2 appear as outliers in produces a distribution that is both skewed and heavy
the mixture distribution. tailed. For exam ple, consider a mixture with com ponents
Th e result is a m ixtu re d istrib u tio n th at is very h eavy-tailed X-| ~ N (0.16, 0.61) and X 2 ~ N{—1.5, 2), where the probability
with a kurto sis o f 1 5 .7 . Fig u re 3 .1 5 co m p ares the P D F and of drawing from the distribution of X-| is p = 90% . The mean
the C D F of a co n tam in ated norm al w ith th a t of a sta n and variance of this mixture are 0 and 1, respectively. The skew
dard norm al. N ote th a t the co n tam in ated norm al is sim ilar ness and kurtosis are —0.96 and 5.50, respectively.
in shap e to a g e n e ra lize d S tu d en t's t, as it is both m ore A s shown in Figure 3.16, this distribution has a strong left skew
p eaked and h eavier-tailed then th e norm al. T h e heavy tails and is mildly heavy-tailed. Specifically, the mixture is clearly left-
can be seen in the C D F by the cro ssing of the tw o cu rves fo r skewed with much more probability between —4 and —2 than
valu es b elo w —2 and ab o ve 2, w h e reas the norm al C D F is between 2 and 4.
Mixed Normal
PDF CDF
standard normal.

3.3 SUMMARY The lognormal distribution, which is a sim ple transform ation
of a normal distribution, is often assumed to describe price
This chapter introduces the key distributions— both discrete and processes and is the distribution underlying the famous Black-
continuous— commonly encountered in finance and risk man Scholes Merton model. Three common continuous distributions,
agem ent. Fam iliarity with these distributions is im portant when the Student's t, the x 2<and the F, are all transform ations of
building financial data models because each distribution has a independent normal random variables. The x 2 and the F a re also
specific feature set that makes it applicable to data with corre w idely used in hypotheses testing, while the Student's tc a n be
sponding features. used to model returns that are heavier-tailed than normal ran
dom variables.
The sim plest is the Bernoulli, which can be applied to any binary
event. The binomial generalizes the Bernoulli to the distribution The chapter concludes by introducing mixture distributions. This
of n independent events. The sim plest continuous distribution is class of distributions allows multiple simple distributions to be
the uniform, which specifies that all values within its support are com bined to produce distributions with em pirically im portant
equally likely. The normal is the most w idely used distribution features (e.g ., skewness and heavy tails).
and has a broad range of applications, (e.g ., modeling financial
returns). The normal is also w idely used when testing hypoth
eses about unknown param eters.

The following questions are intended to help candidates understand the material. They are n o t actual FRM exam questions.
QUESTIONS

3.1 W hat is the name of the distribution that describes any 3.5 How much probability is within ± 2 a of the mean of a nor
random binary event? mal random variable?
3.2 Under what conditions can a binomial and a Poisson be 3.6 How many independent standard normal random vari
approxim ated by a normal random variable? ables are required to construct a
3.3 W hat type of events are Poisson random variables used to 3.7 If a Beta-distributed random variable is used to describe
describe? some data, what requirem ent must the data satisfy?
3.4 If X is a standard uniform, what is Pr(0.2 < Z < 0.4)? W hat 3.8 How are the mean and variance of a mixture of normal
about Pr(/ < X < u), where 0 < / < u < 1? random variables related to the mean and variance of the
com ponents of the m ixture?
Practice Questions
3.9 Either using a Z table or the Excel function N O RM .S.D IST, a. The fund has access to a USD 10 million line of credit
com pute that does not count as part of its portfolio. W hat is the
chance that the firm's loss in a month exceeds this line
a. P r ( - 1.5 < Z < 0), where Z ~ N(0, 1)
b. Pr(Z < - 1 .5 ), where Z ~ N(0, 1) of credit?
b. W hat would the line of credit need to be to ensure
c. Pr(—1.5 < X < 0), where X ~ N(1, 2)
d. Pr(X > 2), where X ~ N(1, 2) that the firm's loss was less than the line of credit in
99.9% of months (or equivalently, larger than the LO C
e. Pr(W > 12), where W ~ N(3, 9)
in 0.1% of months)?
3.10 Either using a Z table or the Excel function N O RM .S.IN V,
com pute
3.13 If the kurtosis of some returns on a small-cap stock portfo
lio was 6, what would the degrees of freedom param eter
a. z so that Pr(z < Z) = .95 when Z ~ N(0, 1) be if they were generated by a generalized Student's t„?
b. z so that Pr(z > Z) = .95 when Z ~ N(0, 1)
W hat if the kurtosis was 9?
c. z so that Pr(—z < Z < z) = .75 when Z ~ N(0, 1)
d. a and b so that Pr(a < X < b) = .75 and
3.14 An analyst is using the following exponential function to
Pr(a < X) = 0.125 when X ~ N(2, 4) model corporate default rates:
3.11 If the return on a stock, R, is normally distributed with a M y) = % exp ( - % ) , y 2: 0,
daily mean of 8%/252 and a daily variance of (20%)2/252, where y is the number of years.
find the values where
a. W hat is the cum ulative probability of default within the
a. Pr(R < r) = .001 first five years?
b. Pr(R < r) = .01 b. W hat is the cum ulative probability of default within the
c. Pr(R < r) = .05 first ten years given that the com pany has survived for
five years?
3.12 The monthly return on a hedge fund portfolio with USD 1
billion in assets is N[.02, .0003). W hat is the distribution of
the gain in a month?

A N SW ERS

3.1 Any binary random variable can be described as a Ber them , so Pr(0.2 < Z < 0.4) = 0.4 - 0.2 = 20% and
noulli by mapping the outcom es to 0 or 1. For exam ple, Pr(/ < X < u) = u - I.
a corporate default (1 for default) or the direction of the
3.5 Form ally this quantity is 95.45% , but this area is com
m arker (1 for positive return). monly approxim ated as 95%.
3.2 A binomial can be well approxim ated by a normal when 3.6 A ^ is defined as the sum of v in dependent standard
np and n(1 — p) are both greater than or equal to 10.
normal random variables, and so v.
A Poisson can be accurately approxim ated by a normal
3.7 Beta random variables are continuous values on the
when the param eter A is large. The approxim ation will
interval [0,1], and so the values must lie in this range.
becom e more accurate as A increases. The standard
recom m endation for approxim ating the Poisson with a 3.8 The mean is the weighted average of the means of the
normal requires A > 1000. com ponents. The variance is more com plicated. The
second non-central moment, E[X 2] of the mixture is the
3.3 Poisson random variables are used for counts— the num
weighted average of the second non-central moments
ber of events in a fixed unit for tim e. For exam ple, the
of the com ponents. The variance is then E[X2] — E[X]2,
number of m ortgage defaults over a month.
which depends the first and second moments of
3.4 In a standard uniform, the probability between the mixture.
two points only depends on the distance between
Solved Problems
3.9 a. 43.3% . In Excel, the command to com pute this value answer to the previous problem by re-centering on
is N O R M .S.D IST(0,TR U E) - N O R M .S .D IS T (-1 .5 ,T R U E ). the mean and scaling by the standard deviation, so
that a = 2 X —1. 15 + 2 and b = 2 X 1.15 + 2. Note
b. 6.7% . In Excel, the command to com pute this value is
N O R M .S .D IS T (—1.5,TR U E). that the formula is a = a X q + / j l , where q is the
quantile value.
c. 20.1% . In Excel, the command to com pute this
value is N O RM .S.D IST((0 - 1)/SQ RT(2),TRUE) 3.11 a. The mean is 0.031% per day and the variance is
- N O R M .S .D IS T ((-1 .5 -1 )/S Q R T (2 ),T R U E ). 1.58 per day (so that the standard deviation is
1.26% per day). To find these values, we trans
d. 24.0% . In Excel, the command to com pute this value
form the variable to be standard normal, so that
is 1 - N O RM .S.D IST((2 - 1)/SQ RT(2),TRU E).
r - /x
Pr(R < r) = .001 = Pr Z < = .001
e. 0.13% . In Excel, the command to com pute this value \ (T /
is 1 - N O RM .S.D IST((12 - 3)/3,TRU E). The value for the standard normal is —3.09
3.10 a. 1.645. In Excel, the command to com pute this value is (NORM.S.INV(O.OOI) in Excel) so that
N O RM .S.IN V(.95). —3.09 X a + /jl = —3.86% .
b. The same idea an be used here where z = —2.32

b. —1.645. In Excel, the command to com pute this value
so that Pr(Z < z) = .01. Transforming this value,
is N O RM .S.IN V(.05).
r = —2.32 X a + /jl = —2.89% .
c. 1.15. Here the tail to the left should have 12.5%
c. Here the value of z is - 1 .6 4 5 so that r = —1.645 X
and the tail to the right should also have 12.5%.
a + /jl = —2.04% . These are all common VaR quan
In Excel, the command to com pute this value is
tiles and suggest that there is a 5% chance that the
-N O R M .S.IN V (.125).
return would be less than 2.04% on any given day, a
d. —0.3 and 4.3. The area of the left and right should
1% change that it would be less than 2.89% , and a
each have 12.5% . These can be constructed using the

one in 1,000 chance that the return would be less than that k [v — 4) = 3{v — 2) so that k v — 4 k = 3v — 6 and
3.86% , if returns were normally distributed. 4 k —6
— 3v = 4 — 6. Finally, solving for v, v = • Plug-
—3
k v k
3.12 a. The monthly return is 2%, and the monthly standard k
2 4 -6
deviation is 1.73% . In USD, the monthly change in gmg in v 18 = 6andi^ = = 5. The
6 -3 3 9 -3
portfolio value has a mean of 2% X USD 1 billion =
kurtosis falls rapidly as v grows. For exam ple, if v = 12
USD 20 million and a standard deviation of 1.73% X
then k = 3.75, which is only slightly higher than the kur
USD 1 billion = USD 17.3 million. The probability that
tosis of a normal (3).
the portfolio loses more than USD 10 million is than
(working in millions) 3.14 a. Fy(5) = 1 — exp
V - 20 b. W e need to divide the marginal probability of default
Pr(V < - 1 0 ) = Pr <
17.3 between years five and ten:
= Pr (Z < - 1 .7 3 )
Fy(10) - Fy(5)
Using the normal table, Pr(Z < - 1 .7 3 ) = 4.18% .
By the SURVIVAL probability through year five
b. Here we work in the other direction. First, we find (1 - result from part a).
the quantile where Pr(Z < z) = 99.9% , which gives
z = - 3 .0 9 . This is then scaled to the distribution of Fy(10) - Fy(5) = exp - exp ( - 1% )
the change in the value of the portfolio by m ultiply
ing by the standard deviation and adding the mean, — exp
Fy( 10) ~ Fy(5)
17.3 X —3.09 + 20 = —33.46. The fund would need
1 - Fy(5)
a line of credit of USD 33.46 million to have a 99.9% exp
change of having a change above this level.
3.13 In a Student's t, the kurtosis depends only on the degree
of freedom and is k = 3 ----—. This can be solved so

v —4

Multivariate
Random Variables
Learning Objectives
Explain how a probability m atrix can be used to express a Explain the effects of applying linear transformations on the
probability mass function (PM F). covariance and correlation between two random variables.
Com pute the marginal and conditional distributions of a Com pute the variance of a weighted sum of two random
discrete bivariate random variable. variables.
Explain how the expectation of a function is com puted for Com pute the conditional expectation of a com ponent of a
a bivariate discrete random variable. bivariate random variable.
Define covariance and explain what it m easures. Describe the features of an iid sequence of random
variables.
Explain the relationship between the covariance and
correlation of two random variables and how these are Explain how the iid property is helpful in computing the
related to the independence of the two variables. mean and variance of a sum of iid random variables.
47
M ultivariate random variables extend the concept of a single A PM F describes the probability of the outcom es as a function
random variable to include measures of dependence between of the coordinates X1 and x 2. The probabilities are always non
two or more random variables. negative, less than or equal to 1, and the sum across all possible
values in the support of X-] and X 2 is 1.
Note that all results from Chapter 2 are directly applicable here,
because each com ponent of a m ultivariate random variable The leading exam ple of a discrete bivariate random variable is
o
is a univariate random variable. This chapter focuses on the the trinomial distribution, which is the distribution of n inde
extensions required to understand multivariate random vari pendent trials where each trial produces one of three outcom es.
ables, including how expectations change, new m oments, and In other words, this distribution generalizes the binomial.
additional characterizations of uncertainty. W hile this chapter
The two com ponents of a trinomial are X-| and X 2, which count
discusses continuous and discrete m ultivariate random variables
the number of realizations of outcom es 1 and 2. The count of
in separate sections, note that virtually all results for continuous
the realizations for outcom e 3 is given by n — X-| — X 2, and so it
variables can be derived from those for discrete variables by
is redundant given knowledge of X-| and X 2.
replacing the sum with an integral.
For exam ple, when measuring the credit quality of a diversi
This chapter focuses primarily on bivariate random variables to
fied bond portfolio, the n bonds can be classified as investm ent
simplify the m athem atics required to understand the key con
grade, high yield or unrated. In this case, X-, is the count of the
cepts. All definitions and results extend directly to random vari
investm ent grade bonds and X 2 is the count of the high yield
ables with three or more com ponents.
bonds.
The trinomial PM F has three param eters:

4.1 DISCRETE RANDOM VARIABLES 1. n (i.e., the total number of experim ents),
2 . p 1 (i.e., the probability of observing outcom e 1), and

Multivate random variables are vectors1 of random variables. For
exam ple, a bivariate random variable X would be a vector with 3 . p2 (i.e., the probability of observing outcom e 2).
two com ponents: X-| and X 2 . Similarly, its realization (i.e., x) would Because these outcom es are exclusive, the probability of out
have two com ponent values as well: x-| and x 2. Treated sepa
come 3 is simply
rately, x-| is a realization from X 1 and x 2 is a realization from X 2.
P3 = 1 - P 2 - Pi
M ultivariate random variables are like their univariate counter
The PM F of a trinomial random variable is
parts in that they can be discrete or continuous. Similarly, both
types of random variables are denoted with uppercase letters
fx,.x,(*i,x2) = , ,, n!--------- r7Ptp?(1 - p, - p t f 1- * ’ - *
(e.g ., X, Y, or Z ). x !x2!(n — x-] — x2)!
1
Note that the probability mass function (PMF)/probability density

Figure 4.1 plots the PM F of a trinomial with p 1 = 20%,
function (PDF) for a bivariate random variable returns the probabil
p2 = 50% and n = 5. The mass function shows that most of the
ity that two random variables each take a certain value. This means
mass occurs when X-| is small (i.e., 0 or 1).
that plotting these functions requires three axes: X 1f X 2, and the
probability mass/density. Because the C D F is either the sum of the The C D F of a bivariate variable is a function returning the total
PM F for all values in the support that are less than or equal to x probability that each com ponent is less than or equal to a given
(for discrete random variables) or the integral of the PD F (for con value for that com ponent, so that:
tinuous random variables), it requires three axes as well.
F x1(x2(xi' x2) = 2 2 &1fx2(ti» *2) (4-3)
t,ER(X,) t 2 E R(X2)
t!<X! t2<X2
Probability Matrices In the equation above, t| contains the values that X-| may take as
long as t| < x-|, and t2 is similarly defined for X 2.
The PM F of a bivariate random variable is a function that returns
the probability that X = x. Put in term s of the com ponents, this D iscrete distributions defined over a finite set of values can be
would mean X-| = x-| and X 2 = x 2 described using a probability m atrix, which relates realizations
fx1(x2(* i, * 2 ) = FM-î = x-|, X 2 = x 2) (4.1)
2 Both the binomial and the trinomial are special cases of the multino
1 A vector of dimension n is an ordered collection of n elements, which mial distribution, which is the distribution of n independent trials that
are called components. can take one of k possible outcomes.

random variable.3 This distribution contains
the probabilities of realizations of X-| and its
PM F denoted by fXl(x i)- Note that this is the
same notation used to describe the PM F of a
univariate random variable in Chapter 2.
The marginal PM F of X-| can be com puted

by summing the probability for each realiza
tion of X-| across all values in the support of
X 2. The marginal PM F is defined as:
W * 1) = 2 fx1(x2(*i/ * 2 ) (4.4)
x2 e R(X2)
Returning to the previous exam ple using

the probability m atrix, the marginal PM F of
the recom m endation when the stock return
is 5% is
Fiqure 4.1 The PMF of a trinomial random variable with p = 20%, fXl(5%) = 2 ft5%, x 2)
x2={—1,0,1}
p2 = 50% and n = 5.
= 0% + 15% + 20% = 35%
to p robabilities. In other w ords, it is a tabular representation This calculation can be repeated for stock returns of - 5 % and
of a PMF. 0% to construct the com plete marginal PMF, which is
For exam ple, suppose that the return on a company's stock is

Stock Return
related to the rating given to the com pany by an analyst. For
-5 % 0% 5%
sim plicity, assume that the stock can only have three possible
returns: 0, 5%, or —5%. Analyst ratings can be positive, neutral, 35% 30% 35%
or negative, and are labeled —1, 0, or 1 (respectively). The prob
W hen a PM F is represented as a probability m atrix, the two mar
ability m atrix for this problem is
ginal distributions are com puted by summing across columns
(which constructs the marginal distribution of the row variables)
Analyst Ratings Stock Return (X j) or summing down rows (which constructs the marginal PM F for
-5 % 0% 5% the column variables).
Analyst (X2)
Negative -1 20% 10% 0% The two marginal distributions in the stock return/recommenda-
tion exam ple are
Neutral 0 10% 15% 15%
Positive 1 5% 5% 20% Stock Return (X))
-5 % 0% 5% fx2(*z)
Each cell contains the probability that the combination of the
Negative -1 20% 10% 0% 30%
two outcom es is realized. For exam ple, there is a 10% chance
Analyst (X2)
that the stock price declines with a neutral rating and a 20% Neutral 0 10% 15% 15% 40%
chance that it declines with a negative rating. Positive 1 5% 5% 20% 30%
fx,(*i) 35% 30% 35%
Marginal Distributions Note that marginal PM F is a univariate PMF.
The PM F describes the join t distribution of the two com ponents

[i.e., Pr(X-| = x-|, X 2 = x 2)] and provides a com plete description 3 The focus of the chapter is on bivariate random variables, and so a
marginal distribution can only be defined for a single random variable.
of the uncertainty across both random variables. However, it is
For a random variable with k-components, the marginal distribution can
also possible to exam ine each variable individually. be computed for any subset of the component random variable contain
ing between 1 and k — 1 components. Any marginal distribution is con
The distribution of a single com ponent of a bivariate random vari structed from the joint distribution by summing the probability over the
able is called a marginal distribution, and it is simply a univariate excluded components.
Chapter 4 Multivariate Random Variables ■ 49

The marginal C D F is defined in a natural way using the margina conditional distribution, on the other hand, sum marizes the
PM F so that Pr(X-| < x-|) m easures the total probability in the probability of the outcom es for one random variable conditional
marginal PM F less than Xi on the other taking a specific value. In C hapter 1, the condi
tional probability of two events was defined as:
^ (x -i) = X ^ (T i)
tne R(X,)
tiS Xl P r (A H B )
Pr(B)
Independence This definition of conditional probability can be applied to a
In C hapter 1, two events A and B were defined to be indepen bivariate random variable to construct a conditional distribution.
dent if The conditional distribution of X-] given X 2 is defined as:
P r (A D B ) = Pr(A) Pr(B) fx„x2(xi. * 2 )

f x ^ N X z — * 2) — (4.6)
fx,(*2)
This definition of independence extends directly to bivariate
random variables, and the com ponents of a bivariate random In other words, it is the joint probability of the two events
variable are independent if divided by the marginal probability that X 2 = x 2.
fx,,x2(*i / x 2) = fXl (x-|)fX2(x2) (4.5) For exam ple, suppose that we want to determ ine the distribu
tion of stock returns conditional on a positive analyst rating. This
Independence requires that the joint PM F of X-| and X 2 be the
conditional distribution is
product of the marginal PM Fs.
r , lxx _ ^ _ fx 1(x2(x i, _ fx1(x2(x i, 1)
Returning to the exam p le of the stock return and ratings, ^ , x 2(x i |X2 - 1) - - Q3
the jo in t distribution (left) and the product of the m arginals
(right) are Applying this expression to the row corresponding to X 2 = 1
in the bivariate probability matrix, the conditional distribution
Joint Distribution Product of Marginals is then:
Stock Return (X j) Stock Return (X ,)

Stock Return
-5 % 0% 5% -5 % 0% 5%
-5 % 0% 5%
Analyst (X 2)
Negative -1 20% 10% 0% 10.5% 9% 10.5%

16.6% 16.6% 66.6%
Neutral 0 10% 15% 15% 14% 12% 14%
Recall that probabilities of the three outcom es of the stock
Positive 1 5% 5% 20% 10.5% 9% 10.5% return are 5%, 5%, and 20% when the analyst makes a positive
recom m endation. These values are divided by the marginal
The m atrix on the right is constructed by multiplying the mar
probability of a positive recom m endation (i.e., 30%) to produce
ginal PM F of the stock return and the marginal PM F of the
the conditional distribution. Note that a conditional PM F must
analyst rating. For exam ple, the marginal probability that the
satisfy the conditions of any PMF, and so the probabilities are
stock return is —5% is the sum of the first column (i.e., 35%). The
always non-negative and sum to one.
probability that the analyst gives a negative rating is the sum of
the first row (i.e., 30%). The upper-left value in the right panel Conditional distributions can also be defined over a set of out
is then com es for one of the variables. For exam ple, consider the distri
bution of the stock return given that the analyst did not give a
35% X 30% = 10.5%
positive rating. This set is
If two random variables are independent, one must contain no
X 2 e { - 1 ,0 }
information about the other. Flowever, this condition is violated
here because if we know that if X-| = 5% , then X 2 cannot be The conditional PM F must sum across all outcom es in the set
negative. Therefore, the two random variables are dependent that is conditioned on S = { - 1 ,0 } , and so:
(i.e., not independent). S X2es^x1,x2(x i/ x 2)
^x 1|x 2(x i |X2 G S) (4.7)
S X2es^x2(x 2 )
Conditional Distributions The marginal probability that X 2 G { - 1 ,0 } is the sum of the (mar
ginal) probabilities of these two outcom es:
The marginal PM Fs summarize the probabilities of the out
com es for each com ponent and are thus univariate PM Fs. The fx2M ) + fx2(0) = 70%

The conditional probability is then: The expectation of a function g(X-1 , X 2) is a probability weighted
average of the function of the outcom es g(x-|, x 2). The exp ecta
20% + 10%
=42.8% tion is defined as:
70%
10% + 15%
fxn|X2(xi|X2 E {-1,0}) =35.7% E[g(X1fX 2) ] = 2 2 9 ( x i ,x 2)fXl,x2( x i , x 2) (4.8)
70% x , e R(X1)x2 G R(X2)
0% + 15% Note that while g (x1f x 2) depends on x-] and x 2, it may be a func
21.4%
70% tion of only one of the com ponents. The double sum across
the support of both X-] and X 2 is required when computing any
The definition of a conditional distribution can also be rear
expectation— even when the function depends on only one
ranged so that the joint PM F is expressed in term s of the mar
com ponent.
ginal PM F and a conditional PMF. For exam ple:
A s in the univariate case, E[g(X-|, X 2)j g(E[X-i],E[X2j) for a non
fx1(x2(* 1 , x 2) = fxi|X2(x l | - ^ 2 = x 2 ^x2(x 2 )
linear function g(x-|, x 2).
or
A s an exam ple, consider the following joint PM F:
fx1(x2(*1, x 2) — ^X2|X ,( x 2|^1 — X i) f X l(X i)
These identities can be transform ed to produce further insights. Xi

First, recall that the com ponents of a bivariate random variable
1 2
are independent if:
3 15% 10%
fx1(x2(x i, x 2) = fx1(x1)fx2(x2) x"
4 60% 15%
Comparing this definition to the decomposition of the joint into
the conditional and the marginal, then independence requires that: Now, consider the function
fx, |x2(x i |X2 — x 2) — fXl(xi) g (* i, x 2) = x f 2
In other words, equality of the conditional and the marginal The expectation of g (x1f x 2) is therefore
PM Fs is a requirem ent of independence. Specifically, knowledge
E[g(x-|, x 2)] = 2 2 g {x h x 2)f(x-|, x 2)
about the value of X 2 (e.g ., X 2 = x 2) must not contain any infor x,E^,2}x2e{3A}
mation about X-|. E[g (x1f x 2)] = 13(0 .15) + 14(0.6) + 23(0.10) + 24(0.15)
Returning to the exam ple of the stock return and recom m enda = 0.15 + 0.60 + 0.80 + 2.4
tion, the marginal distribution of the stock return and the condi = 3.95
tional distribution when the analyst gives a positive rating are
Conditional PMF on Moments

Marginal PMF X2 = 1
Expectations are used to define moments of bivariate ran
Stock Return -5 % 0% 5% -5 % 0% 5% dom variables in the same way that they are used to define
Probability 35% 30% 35% 16.6% 16.6% 66.6% moments for univariate random variables. Because X = [X 1f X 2],
the first moment of X (i.e., the mean E[X]) is the mean of the
These distributions differ, and, therefore, reconfirm that the rec
com ponents
om mendation and the return are not independent.
E[X] = [E[X n],E[X2]] = [fx,l f x2] (4.9)
The second moment of X adds an extra term and is referred to

4.2 EXPECTATIONS AND MOMENTS as the covariance of X. The covariance is technically a 2-by-2
m atrix of values, where the values along one diagonal are the
Expectations variances of X-| and X 2, and the term s in the other diagonal are
the covariance between X-, and X 2.
The expectation of a function of a bivariate random variable is
defined analogously to that of a univariate random variable.4 The variance is a measure of dispersion for a single vari
able, and the definition of the variance of each com ponent is
unchanged:
4 As a reminder, the expectation is the weighted sum of the outcome
times the probability of the outcome. V [X ,] = E[(X, - E[X , ])2] (4.10)

The covariance between X-, and X 2 is defined as: Note that the variance of a + b X | is
C o v[X 1/X 2] = E[CX, - E[X 1])(X2 - E[X2])] b2V[X-| ]

= E[X 1X 2] - E[X-|]E[X2] (4.11)
This means that location shifts by an amount a have no effect on
The covariance is a measure of dispersion that captures how the the variance, whereas rescaling by b scales the variance by b2.
variables move together. Note that it is a generalization of the
The covariance between two random variables scales in an
variance, and the covariance of a variable with itself is just the
analogous way, so that:
variance:
Cov[a + bX-|, c + d X 2] = b d C o v[X lf X 2]
C o v[X 1f X |] = E[(X, - E P C ,])^ - E[X-|])]
= E[(X, - E[X , ])2] The covariance is defined in terms of the deviation from the mean
= V [X ,] for each random variable, and so location shifts have no effect. At

the same time, the scale of each component contributes multi-
In a bivariate random variable, there are two variances and one
plicatively to the change in the covariance. Combining these two
covariance. Therefore, it is necessary to distinguish between
properties demonstrates that the correlation estimator is scale free:
these when using short-hand notation. It is common to express
V[X-|] as (j\ and V[X2] as erf. The symbols cr-, and cr2 are used to ab Cov[X-|, X 2]
Corr[aX-|, bX2]
denote the standard deviations of X-| and X 2, respectively. V a 2V [X 1] V b 2V[X2]
An alternative abbreviation schem e uses (Xn and cr22 to indicate ab C o v[X 1 ;X 2]
the variances of X-| and X 2, respectively. The short-hand nota lallbl v v ix d V v f x ^

tion for their standard deviations is identical in this alternative sign(a)sign(b)Corr[X-|, X 2]
schem e. The covariance is commonly abbreviated to er12 in both
The correlation between two random variables is commonly
schem es.
denoted by p (or p12 to specify that this is the correlation
The covariance depends on the scales of X-| and of X 2. It is between the com ponents).
therefore more common to report the correlation, which is a
The covariance can be expressed in term s of the correlation and
scale-free m easure. The correlation is defined as:
the standard deviations:
Cov[X-|, X 2] (J12 (J12
Corr[X-], X 2] (4.12) °"12 (4.14)
0-1(72 ~ P l 2 a / \a 2
The variance of each com ponent and the covariance between

Correlation measures the strength of the linear relationship
the two com ponents of X are frequently expressed as a 2-by-2
between two variables and is always between —1 and 1. For
covariance matrix. The covariance m atrix of X is
exam ple, if X 2 = a + b X lf then the correlation between X-| and
X 2 is 1 if b > 0, —1 if b < 0 or 0 if b = 0. This can be directly (7 1 2 C7|1 (712
Cov[X] = (4.15)
verified because: _C7|2 (T2 _ LO-12 (J22
Cov[X-|, X 2] = C o v[X l7 a + bX-|] Cross-variable versions of the skewness (coskewness) and kurto-
= E[X i(a + bX-,)] - E[XdE[a + bX-,] sis (cokurtosis) are similarly defined by taking powers of the two
= E[aXi + bX2] - a E ^ ] - bE[X-,]2 variables that sum to 3 or 4, respectively. There are two distinct
= aE[Xi] - aE[Xi] + b(E[Xf] - EfX^2) = bV[X-,] coskewness measures and three distinct cokurtosis m easures.
For exam ple, the two coskewness m easures are
Recall from C hapter 2 that V[a + bX] = b2V[X]. This means that
when b ^ 0, the correlation is E[(X, - E[X ,])2(X2 - E[X,])] , E[(X, - EIX ^ X X , - E[X ,])2]
and
<7W2
? (T1 09
02
bV[Xi ] (4.16)
= sign(b) (4.13)5
V \ / [ x j] V b 2W[X^ ] Like the skewness of a single random variable, the coskewness
W hen X i and X 2 tend to increase together, then the correlation m easures are standardized.
is positive. If X 2 tends to decrease when X-| increases, then these
W hile interpreting these m easures is more challenging than the
two random variables have a negative correlation.
covariance, they both measure w hether one random variable
(raised to the power 1) takes a clear direction w henever the
other return (raised to the power 2) is large in m agnitude. For
5 The sign (b) function returns a value of -1 if b is negative or a value of exam ple, it is common for returns on individual stocks to have
1 if b is positive. negative coskew, so that when one return is negative, the other

tends to experience a period of high volatility (i.e., its squared The variance of the sum is the sum of the variances plus twice
value is large). the covariance. For independent variables, the variance of the
sum is simply the sum of the variances. The covariance in the
variance of the sum captures the propensity of the two random
The Variance of Sums of Random variables to move together. This result can be generalized to
Variables weighted sums:
The covariance plays a key role in the variance of the sum of two N/[aX, + bX2] = a2V[X1] + b2V[X2] + l a b C o v[X 1f X 2] (4.18)
or more random variables:
This result is im portant in portfolio applications, where the con
V[X, + X 2] = V [X n] + V[X2] + 2C o v[X 1; X 2] (4.17) stants are the weights in the portfolio.
CORRELATION AND PO RTFOLIO DIVERSIFICATION

Correlation plays an im portant role in determ ining the ben (in the sense that it makes the variance as small as possible)
efits of portfolio diversification. The return on a portfolio can be shown to be
depends on:
(722 ~
w
°~12
• The distribution of the returns on the assets in the portfo (T ii 2 o " i2 T (722
lio, and
Figure 4.2 plots the standard deviation of a portfolio with two
• The portfolio's weights on these assets. assets that have a covariance m atrix of:
The portfolio weights measure the share of funds invested in an 20%2 p X 20% X 10%
asset. These weights must sum to 1 by construction, and so in a _p x 20 % x 10% 10%2
simple portfolio with two assets, the two portfolio weights are
In this exam ple, the first asset has equity market-like volatility
• w (i.e., the w eight on the first asset), and
of 20% , and the second has corporate bond-like volatility of
• 1 — w (i.e., the w eight on the second asset). 10%. M eanwhile, the correlation is varied between - 1 and 1.
If the returns on two assets have a covariance m atrix of: The portfolio's standard deviation (or variance) is smallest when
the correlation is strong and negative. It is also asym m etric,
°"i (Tu because larger positive correlations lead to larger portfolio
_(Ti2 (t \ _ standard deviations than small negative correlations. This
happens because the optimal w eight is negative for the larg
then the variance of the portfolio's return is est correlations and so the second asset has a w eight larger
than 1. While it is optimal to have a weight larger than 1, this
w 2(j\ + (1 — w )2<
t 2 + 2w(1 — w)er12
extra total exposure limits the benefits to diversification. When
This is an application of the formula for the covariance of the correlation is negative, both assets have weights between 0
w eighted sums of random variables. The optimal w eight and 1, and so this additional source of variance is not present.
Fiqure 4.2 The portfolio standard deviation using the optimal portfolio weight in a two-asset portfolio,
where the first asset has a volatility of 20%, the second has a volatility of 10%, and the correlation between
the two is varied between —1 and 1.

Covariance, Correlation, In the stock return/analyst rating exam ple, consider the
expected return on the stock given a positive analyst rating. The
and Independence
conditional distribution here [i.e., fx1 ix2 (x il ^ 2 = 1)1 is
There is an im portant relationship between correlation and
independence. W hen two random variables are independent, Stock Return
they must have zero correlation because (by definition) they
-5 % 0 % 5%
must contain no information about each other. However, if two
random variables have zero correlation, they are not necessarily 16.6% 16.6% . %
6 6 6
independent.
The conditional expectation of the return is then:
Correlation is a measure of linear dependence. If two variables
E [X ,|X 2 = 1] = - 5 % x 16.6% + 0 x 16.6% + 5% x . %
6 6 6
have a strong linear relationship (i.e., they produce values that
= 2.5%
lie close to a straight line), then they have a large correlation. If
two random variables have no linear relationship, then their cor Conditional expectations can be defined over a value or any set of
relation is zero. values that X 2 might take. For exam ple, the conditional expecta
tion of X-| given that the stock is not upgraded (i.e., X 2 G { —1,0})
However, variables can be non-linearly dependent. A simple
depends on the conditional PMF of X-] given that X 2 G { —1,0}.
exam ple of this occurs when:
This expectation is computed using the conditional PMF:
X , ~ N(0,1) and X 2 = X? ^X1 |X2 (X11^2 £ { “ 1,0})
Using the fact that the skewness of a normal is 0, it can be Conditional expectations can be extended to any moment by
shown that Cov[X-|, X 2] = 0. However, these two random vari replacing all expectation operators with conditional expectation
ables are clearly not independent. Because the random variable operators. For exam ple, the variance of a random variable X-] is
X 2 is a function of X-|, the realization of X-| determ ines the real
M X ,] = E[(X, - E ix ,])2] = E[X2] - E[X , ] 2
ization of X 2. Sim ilarly, a realization of X 2 can be used to predict
X-| up to a m ultiplicative factor of + 1 . The conditional variance of X-] given X 2 is then:
Finally, if the correlation between two variables is 0, then V[X, |X 2 = x 2] = E[(X, - E[X , |X 2 = x 2])2|X2 = x 2]
the expectation of the product of the two is the product of
= E [X 2 |X 2 = x 2] - E [X ,|X 2 = x 2 ]2
the expectations. This means that if Corr[X-|, X 2] = 0, then
E[X 1 X 2] = E[X-|]E[X2]. This follows from the definition of covari Returning to the exam ple of the stock return and the recom
ance because: m endation, the standard deviation of the stock return (i.e., the
square root of the variance) is
Cov[X-|, X 2] = E[X-|X2] - E[X-|]E[X2],
a-i = V v p ^
so that when the covariance is zero, the two term s on the right-
hand side must be equal = VE[X?] - E[Xi ]2
E [X )X 2] = E[X-|]E[X2] ((- 0 .0 5 )2(0.35) + 02(0.30) + (0.05)2(0.35)) -

(—0.05(.35) + 0(.30) + 0.05(.35))2
= 4.18%
4.3 CONDITIONAL EXPECTATIONS The standard deviation conditional on a positive rating is
° X , |X2 = 1 = V V I X .I X , = X2 ]
Many applications exam ine the expectation of a random vari
able conditional on the outcom e of another random variable. = V e [X?|X 2 = 1] - E [X ,|X 2 = 1]2
For exam ple, it is common in risk m anagem ent to model the
((—0.05)2(0.166) + 02(0.166) + (0.05)2(0.666))
expected loss on a portfolio given a large negative market
- (-0 .0 5 (.1 6 6 ) + 0(0.166) + 0.05(0.666))2
return.
= 3.81%
A conditional expectation is an expectation when one random
variable takes a specific value or falls into a defined range of val
Conditional Independence
ues. It uses the same expression as any other expectation and is
a weighted average where the probabilities are determ ined by a A s a rule, random variables are dependent in finance and risk
conditional PMF. m anagem ent. This dependence arises from many sources,

including shifts in investor risk aversion, cross-asset or cross-bor 4.4 CONTINUOUS RANDOM
der spillovers, and crowded portfolio strategies. However, con
ditioning is a useful tool to remove the dependence between
VARIABLES
variables.
A s is the case with univariate random variables, moving from
Returning to the stock return and recom m endation exam ple, discrete m ultivariate random variables to those that are continu
suppose that an investor wants to identify firm s that have a ous changes little. The most substantial change is the replace
conservative m anagem ent style. This style variable can be ment of the PM F with a PDF, and the corresponding switch from
labeled Z. sums to integrals when computing expectations. All definitions
and high-level results (e.g ., the effect of linear transform ations
Now suppose that 12.5% of the com panies are in the conserva
on the covariance of random variables) are identical.
tive category and that the probability of the outcom es of stock
returns and recom m endations based on m anagem ent style are The PD F of a bivariate continuous random variable returns the
as in the table below: probability associated with a single point. This value is not tech
nically a probability because, like a univariate random variable, it
Conservative Other is only possible to assign a positive probability to intervals. The
Style (Z) Stock Return (X |) Stock Return (X ,) PD F has the same form as a PM F and is written as fXl,x2 (x i/ x 2 )-
The probability in a rectangular region is defined as:
-5 % 0 % 5% -5 % 0 % 5%
Analyst (X2)
r u1 u2
Negative - 1 1 % 4% 0 % 19% 6 % 0 % Pr(/ 1 ^ X 1 U '| O I2 ^ X2 ^ U2) ~ I fx„x2(x i- x 2 ) c /x 2 dx,

A (4.19)
Neutral 0 1 % 4% 0 % 9% 1 1 % 15%
Positive 1 0.5% 2 % 0 % 4.5% 3% 2 0 % This expression com putes the area under the PD F function in
the region defined by /1f u ,, l2, and U2 . 6 A PD F function is always
Note that the overall probabilities are consistent with the non-negative and must integrate to one across the support of
original bivariate m atrix. For exam ple, there is still a total 20% the two com ponents.
chance of a negative rating when the stock return is —5%.
The cum ulative distribution function is the area of a rectangular
Now consider the bivariate distribution conditional on a com region under the PD F where the lower bounds are — 00 and the
pany being in this conservative category. As before, each upper bounds are the argum ents in the C D F:
individual probability is rescaled by a constant facto r so that X1 /*X2
the total conditional probability sums to 1. In other w ords,

each value is divided by the probability that the m anagem ent / - 00
/
J — oc
fit,, t2)d t2 d t, (4.20)
is conservative (i.e ., 12.5% ), which is also the sum of the values

If the PD F is not defined for all values, then the lower bounds
in the panel.
can be adjusted to be the sm allest values where the random
variable has support.
Conservative
Most continuous m ultivariate random variables have com pli
Stock Return (X ,)
cated PD Fs. The sim plest is the bivariate uniform where the two
-5 % 0 % 5% fx2 (x 2 ) com ponents are independent. The PD F of this random variable
Negative % 32% % 40% is constant:
£ - 1 8 0
+-»
CO
Neutral 0 8 % 32% 0 % 40% 6 (1, x 2( x i / x2 ) = 1 (4.21)
CD
Positive 4% 16% % % and the support of the density is the unit square (i.e., a square
C
< 1 0 2 0
% 80% % whose sides have lengths of 1). The C D F of the bivariate inde
^ ( x i) 2 0 0
pendent uniform is
This shows that this bivariate distribution, which conditions on J X x 2(x i. x 2) = X1X2 (4 -22)
the com pany having a conservative m anagem ent style, is equal
to the product of its marginal distributions. To see this, check
that the cells of the m atrix are equal to the product of their cor
responding marginals [e.g ., 8 % = 40%(20%)]. This distribution 6 The probability that a continuous random variable takes a single point
value is 0, and so the probability over a closed interval is the same as
is therefore conditionally independent even though the original the probability over the open interval with the same bounds, so that
distribution is not. Pr(/-| < X-| “v u, Cl l2 <'- X 2 <C u2) — Pr(/i — X, — u, Cl l2 — X 2 — u2).

This PD F and C D F correspond to a uniform random variable that 4.5 INDEPENDENT, IDENTICALLY
has independent com ponents. This is not always the case, how
ever, and there are many exam ples of uniform random variables
DISTRIBUTED RANDOM VARIABLES
that are dependent. D ependent uniforms and the special role
Independent, identically distributed (iid) random variables are
they play in understanding the dependence in any multivariate
those that are generated by a single univariate distribution.
random variable are exam ined in later chapters.
They are commonly em ployed as fundam ental shocks in m odels,
especially in tim e series.
Marginal and Conditional Distributions For exam ple, it is common to write
For a continuous bivariate random variable, the marginal PD F

integrates one com ponent out of the joint PDF.
as a generator of an n-component m ultivariate normal random
fx1fx2(*i, xz)dx2 (4.23) variable where each com ponent is independent of all other
com ponents and is distributed a 2).
This definition is identical to that of the marginal distribution
M anipulating iid random variables is particularly sim ple,
of a discrete random variable, aside from the replacem ent of
because the variables are independent and have the same
the sum with an integral. The effect is the also sam e, and the
m om ents. For exam ple, the exp ected value of a sum of n iid
marginal PD F of one com ponent is just a continuous univariate
random variables is
random variable.
The conditional PD F is analogously defined, and the conditional (4.25)

PD F of X-| given X 2 is the ratio of the joint PD F to the PD F of X 2:
where /x = E[X,] is the common mean of the random variables.
f x : ,X2(*1 , X 2)
î|X2 (x ll ^ 2 — x 2) ~ (4.24) This result only depends on the fact that the variables are identi
fx M
cal because the expectation of a sum is always the sum of the
This conditional PD F is the distribution of X-] if X 2 is known to expectations (due to the linearity of the expectation operator).
be equal to x 2. The conditional PD F is a density and has all the
The variance of a sum of n iid random variables is
properties of a univariate PDF.
n n n n
The conditional distribution can also be defined when X 2 is in an 2 V [ X ,] + 2 2 2 C o v [X „X J
i=1 ; = U = ;+ 1
interval. Conditional distributions of this form are im portant in
n n
risk m anagem ent when X-, and X 2 are asset returns. In this case,
2 > 2 + 2 2 2 0
the distribution of interest is that of X-| given that X 2 has suffered i= 1 j= 1 k=j+ 1
a large loss. This structure is used in later chapters when exam in
(4.26)
ing concepts such as expected shortfall.
The variance of a sum of iid random variables is just n tim es the
For exam ple, consider the distribution of the return on a hedge
variance of one variable. Unlike the expectation of the sum,
fund (X-|) given that the monthly return on the S&P 500 (X2) is in
however, this sim plification does require independence.
the bottom 5% of its distribution. Historical data shows that the
S&P 500 return is above —6.19% in 95% of months, and thus the Too see this, note that the first line shows that the variance of
case of interest X 2 G (—°°, —6.19% ). The conditional PD F of the the sum is the sum of the variances plus all pairwise covariances
hedge fund's return is multiplied by tw o. There are n(n - 1)/2 distinct pairs, and the
double sum is a sim ple method to account for the distinct com
-0.0619
bination of indices. Furtherm ore, the independence assumption
fx„x 2 (x i, x 2 )dx 2
— oc ensures that the covariances are all zero, which leads to the sim
^ \x2(x i | ^ 2 £ (-0 0 / -6 .1 9 % )) -0.0619
plification of the sum. If the data are not independent, then the
fx (x 2 )c/x2
— oc covariance between different com ponents would appear in the
variance of the sum.
W hile this expression might look strange, it is identical to the
definition of the PM F when one of the variables is in a set of The assumption that the distribution of X, is identical is also
values. This conditional PD F is a weighted average of the PD F of im portant because it ensures that V[X,] = a for all /. W ithout
X-| when X 2 < —6.19% , with the weights being the probabilities this assumption, the variance of the sum is still the sum of the
associated with the values of X 2 in this range. variances, although each variance may take a different value.

It is im portant to distinguish between the variance of the sum weighted by their associated probabilities. The same moments
of multiple random variables and the variance of a multiple of used for univariate random variables (i.e., the mean and the vari
a single random variable. In particular, if X-, and X2 are iid with ance) are also useful for describing bivariate random variables.
variance a 2, then:
However, these moments are not sufficient to fully describe
V[X-, + X2] = 2cr2 a bivariate distribution, and so it is necessary to describe the
dependence between the com ponents. The most common m ea
is different from
sure of dependence is the covariance, which m easures the com
V[2X-,] = 4ct2
mon dispersion between two variables. In practice, it is more
or useful to report the correlation, which is a scale- and unit-free
transform ation of the covariance.
V[2X2] = 4rr2
Bivariate distributions can be transform ed into either marginal
This property plays an im portant role when estimating unknown
or conditional distributions. A marginal distribution summarizes
param eters. The variance of the sum of iid random variables
the information about a single variable and is simply a univariate
grows linearly. This means that when the sum of n random vari
distribution. A conditional distribution describes the probability
ables is divided by n to form an average, the variance of the
of one random variable conditional on an outcome or a range
average reduces as n grows. This property is explored in detail
of outcomes of another. The conditional distribution plays a key
in the next chapter.
role in finance and risk m anagem ent, where it is used to measure
the return on an asset conditional on a significant loss in another
asset. The conditional distribution can also be used to produce a
4.6 SUMMARY
conditional expectation and conditional moments, which provide
summaries of the conditional distribution of a random variable.
M ultivariate random variables are natural extensions of univari
ate random variables. They are defined using PM Fs (for discrete Moving from discrete to continuous random variables makes
variables) or PD Fs (for continuous variables), which describe the little difference aside from some small technical changes. The
joint probability of outcom e com binations. Expected values are concepts are identical, and all properties of moments continue
com puted using the same m ethod, where the outcom es are to hold.

QUESTIONS

4.1 How are the marginal distributions related when two ran 4.6 Under what conditions are conditional expectations and
dom variables are independent? marginal expectations equal?
4.2 W hen are conditional distributions equal to marginal 4.7 If X-| and X 2 are independent continuous random variables
distributions? with PDFs fXl(x 1 ) and fx2 (x 2 )» respectively, what is the joint
4.3 If X-| and X 2 are independent, what is the correlation PD F of the two random variables?
between X-| and X 2? 4.8 If X-| and X 2 both have univariate normal distributions, is
the joint distribution of X-, and X 2 a bivariate normal?
4.4 If X-| and X 2 are uncorrelated (Corr[X-|, X 2] = 0), are they
independent? 4.9 In the previous question, what if X-, and X 2 are
independent?
4.5 W hat is the effect on the covariance between X-| and X 2 of
rescaling X-| by w and X 2 by (1 — w), where w is between 4.10 Are sums of iid normal random variables normally distributed?
0 and 1 ?
Practice Questions
4.11 Suppose that the annual profit of two firm s, one an of the profit of Big Firm when Small Firm either has no
incum bent (Big Firm , X-|) and the other a startup (Small profit or loses money (X 2 < 0)?
Firm , X 2), can be described with the following probability 4.15 The return distributions for the S&P 500 and Nikkei for the
m atrix:
next year are given as:
Small Firm (X2) Return - 1 0 % 0 % + 1 0 %

S&P 500
-U SD 1M USD 0 USD 2M USD 4M Probability 25% 50% 25%
>< -U SD 50M 1.97% 3.90% 0 .8 % 0 .1 % Return -5 % 0 % 8 %
£ Nikkei
•mb USD 0 3.93% 23.5% 1 2 .6 % 2.99% Probability 2 0 % 60% 2 0 %
LL
U) USD 10M .8 % 12.7% 14.2% .6 8 %
5 a. W hat is the joint distribution m atrix if the two return
0 6
USD 100M 0 % 3.09% 6.58% 6.16% series are unrelated?

b. If the actual m atrix looks as follows:
a. W hat are the marginal distributions of the profits of
each firm? S&P 500
b. Are the returns on the two firms independent?
-10% 0% +10%
c. W hat is the conditional distribution of Small Firm's
profits if Big Firm has a $100M profit year? -5 % 15% 1%
4.12 Using the probability m atrix for Big Firm and Small Firm, Nikkei 0% 5% 40%
what are the covariance and correlations between the 8% 6% 9%
profits of these two firm s?
Fill in the missing cells to match the original given
4.13 If an investor owned 20% of Big Firm and 80% of Small marginal distributions.
Firm, what are the expected profits of the investor and the
c. For the m atrix in b, what is the conditional distribution
standard deviation of the investor's profits?
of the Nikkei given a 10% return on the S&P?
4.14 In the Big Firm -Sm all Firm exam ple, what are the condi
tional expected profit and conditional standard deviation

ANSW ERS

4.1 The joint PM F is the product of the two marginals. This is 4.5 Use the properties of rescaled covariance,
a key property of independence. C o v[w X 1f (1 — w)X2] = w(1 — w)Cov[X-|, X 2].
4.2 W hen the random variables are independent. If the condi 4.6 If the two components X-] and X 2 are independent, then the
tional distribution of X i given X 2 equals the marginal, then conditional expectation of any function of X i given X 2 = x 2
X 2 has no information about the probability of different is always the same as the marginal expectation of the same
values in X-|. function. That is, E[g(X-|)|X 2 = x 2] = E[g(X-|)] for any value x 2.
4.3 Independent random variables always have a correlation of 0. 4.7 The joint PD F is fx1 (x 2 (x i/ x 2 ) = fx 1 (x i)^x2 (x 2 ) because when
This statement is only technically true if the variance of both independent the joint is the product of the marginals.
is well defined so that the correlation is also well defined. 4.8 Not necessarily. It is possible that the joint is not a bivari
4.4 No. There are many ways that two random variables can ate normal if the dependence structure between X-| and
be dependent but not correlated. For exam ple, if there is X 2 is different from what is possible with a normal. The
no linear relationship between the variables but they both distribution that generated the data points plotted below
tend to be large in m agnitude (of either sign) at the same has normal marginals, but this pattern of dependence is
tim e. In this case, there may be tail dependence but not not possible in a normal that always appear elliptical.
linear dependence (correlation). The image below plots
realizations from a dependent bivariate random variable
with 0 correlation. The dependence in this distribution
arises because if X-| is close to 0 then X 2 is also close to 0.
- 2 - 1 0 1 2
Xi
4.9 It would be if these are independent, because this is a

bivariate normal with 0 correlation.
4.10 Yes because iid normal random variables are jointly nor
- 2 - 1 0 1 2
Xi mally distributed with 0 correlation, the sum of iid normal

random variable is also normal.
Solved Problems
4.11 a. The marginal distributions are com puted by summing b. They are independent if the joint is the product of the
across rows for Big Firm and down columns for Small marginals. Looking at the upper left cell, 6.77% X 6.70%
Firm. They are is not equal to 1.97%, and so they are not.
c. The conditional distribution is just the row correspond
Big Firm Small Firm ing to USD 100M normalized to sum to 1. The condi
tional is different from the marginal, which is another
- U S D 50M 6.77 - U S D 1M 6.70
dem onstration that the profits are not independent.
USD 0 43.02 USD 0 43.19
- U S D 1M USD 0 USD 2M USD 4M
USD 10M 34.38 USD 2M 34.18
USD 4M % 19.5% 41.6% 38.9%

USD 100M 15.83 15.93
0

4.12 First, the two means and variances can be com puted from Finally, the conditional expectation is E[X-||X 2 < 0] =
the marginal distributions in the earlier problem . Then 2x-|Pr(Xi = x-||X2 < 0) = USD 3.01 M. The conditional
the covariance can be com puted using the alternative expectation squared is E [X f|X 2 < 0] = 940.31, and so
form , which is E[X-[X 2\ — E[X-|]E[X2]. The means can be the conditional variance is \/[X 1 ] = E[Xf] — E [X - |]2 =
com puted using E[Xj] = I x j Pr(Xy = xj) for j = 1,2. The 940.31 - 3.01 2 = 931.25 and the conditional standard
variance can be com puted using E [X f] - (E[Xy])2, which deviation is USD 30.52M .
requires computing E [X 2] using X x 2 Pr(X/ = xj). 4.15 a. If the two variables are unrelated, then the joint distri
For Big Firm , these values are E[X-|] = USD 15.88M , bution is just given by the products
E[X?j = 1786.63 and V[X-,] = 1534.36.
fx1(x2(*-w *2) =
For Small Firm , these values are E [X 2] = USD 1.25M , For exam ple, the probability of seeing —10% on the
E [X 2] = 3.98 and V[X2] = 2.41.
S&P and —5% on the Nikkei is 25% *20% = 5%:
The expected value of the cross product is
E [X 1 X 2] = S S x 1 x 2 Pr(X 1 = x-,, X 2 = x 2) = 43.22. S&P 500
The covariance is then 43.22 — 1.25 X 15.88 = 23.37 -1 0 % 0% + 10%

and the correlation is 2 3 .3 7 /(V 2 .4 1 X 1534.36 = 0.384 -5 % 5% 10% 5%
4.13 Here we can com pute these from the com ponents Nikkei 0% 15% 30% 15%
in the previous step. The expected profits are then
8% 5% 10% 5%
E[P] = 20% X E [X n] + 80% X E [X 2] = 4.18. The variance
of the portfolio is then
S&P 500
(20%) 2 Var[X.,] + (80%)2Var[X2] + 2(20% )(80% )Cov[X1, X 2]
-1 0 % 0% + 10%
This value is 0.04 X 1534.36 + 0.64 X 2.41 + 2 X 0.16 X
23.37 = 70.39, and so the standard deviation is USD 8.39M. -5 % 15% 1%
4.14 We need to com pute the conditional distribution given Nikkei 0% 5% 40%
X 2 < 0. The relevant rows of the probability m atrix are 8% 6% 9%
- U S D 1M USD 0 The sum of the rows needs to match the S&P 500 mar
X^Xz
ginal distribution:
- U S D 50M 1.97% 3.90%
USD 0 3.93% 23.5% Return - 1 0 % 0 % + 1 0 %

S&P 500
USD 10M 0 .8 % 12.7% Probability 25% 50% 25%
USD 100M 0 % 3.09% In other words, the x in this column:
The conditional distribution can be constructed by

-1 0 %
summing across rows and then normalizing to sum to unity.
-5 % 15%
The non-normalized sum and the normalized version are
0% 5%
X,|X2 < 0 Non-Normalized Normalized
8% x%
- U S D 50M 5.87% 11.77%
needs to be such that the sum of 15 + 5 + x = 25.
USD 0 27.43% 54.98%
So, x = 5%.
USD 10M 13.5% 27.06%
USD 100M 3.09% 6.19%

Continuing yields: Dividing each entry by the total mass (25% in this case)
S&P 500 S&P 500 Conditional

Distribution
-1 0 % 0% + 10% + 10%
-5 % 15% 4% 1% -5 % 1% 4%
Nikkei 0% 5% 40% 15% Nikkei 0% 15% 60%
8% 5% 6% 9% 8% 9% 36%
The same process works going across the columns, 25% 100%
c. Working with the last column:
S&P 500
+ 10%
-5 % 1%
Nikkei 0% 15%
8% 9%
25%
Chapter 4 Multivariate Random Variables

Sample Moments
Learning Objectives
Estim ate the mean, variance, and standard deviation using Explain how the Law of Large Num bers (LLN) and Central
sam ple data. Limit Theorem (CLT) apply to the sam ple mean.
Explain the difference between a population moment and Estim ate and interpret the skewness and kurtosis of a
a sam ple moment. random variable.
Distinguish between an estim ator and an estim ate. Use sam ple data to estim ate quantiles, including the
median.
D escribe the bias of an estim ator and explain what the
bias m easures. Estim ate the mean of two variables and apply the CLT.
Explain what is meant by the statem ent that the mean Estim ate the covariance and correlation between two
estim ator is BLU E. random variables.
D escribe the consistency of an estim ator and explain the Explain how coskewness and cokurtosis are related to
usefulness of this concept. skewness and kurtosis.
63
This chapter describes how sam ple moments are used to esti an observed data set. In contrast, an estimate is the value pro-
A
mate unknown population moments. In particular, this chapter duced by an application of the estim ator to data.
pays special attention to the estimation of the mean. This is
The mean estim ator is a function of random variables, and so it
because when data are generated from independent identically
is also a random variable. Its properties can be exam ined using
distributed (iid) random variables, the mean estim ator has sev
the tools developed in the previous chapters.
eral desirable properties.
The expectation of the mean is
• It is (on average) equal to the population mean.
• As the number of observations grows, the sam ple mean (5.2)
becom es arbitrarily close to the population mean.
The expected value of the mean estim ator is the same as the
• The distribution of the sam ple mean can be approxim ated
population mean. W hile the primary focus is on the case where
using a standard normal distribution.
Xj are iid, this result shows that the expectation of the mean
This final property is w idely used to test hypotheses about pop estim ator is /x w henever the mean is constant (i.e., E[X(] = /x for
ulation param eters using observed data. all /). This property is useful in applications to tim e-series that
Data can also be used to estim ate higher-order moments such are not always iid but have a constant mean.
as variance, skewness, and kurtosis. The first four (standard The bias of an estim ator is defined as:
ized) moments (i.e., mean, variance, skewness, and kurtosis) are
Bias(0) = E[0] - 0 (5.3)
widely used in finance and risk m anagem ent to describe the key
features of data sets. where 0 is the true (or population) value of the param eter that
we are estim ating (e.g ., /jl or cr2). It measures the difference
Q uantiles provide an alternative method to describe the dis
between the expected value of the estim ator and the popula
tribution of a data set. Q uantile measures are particularly use
tion value estim ated. Applying this definition to the mean:
ful in applications to financial data because they are robust to
extrem e outliers. Finally, this chapter shows how univariate Bias(/x) = E[/x] — fx = jx — /x = 0
sam ple moments can extend to more than one data series.
Because the mean estim ator's bias is zero, it is unbiased.
The variance of the mean estim ator can also be com puted usinq
5.1 THE FIRST TWO MOMENTS the standard properties of random variab les . 3 Recall that the
general formula for the variance of a sum is the sum of the vari
Estimating the Mean ances plus any covariances between the random variables:
W hen working with random variables, we are interested in the 1 n 1 n 1 I n

value of population param eters such as the mean (/x) or the V[/l] = V = ^ v 2 V [ X (] + Covariances
L n ,t i J n L;=1 J n l ;= 1
variance (cr2). However, these values are not observable, and so (5.4)
data is used to estim ate these values. The population mean is
Because X ) are iid, these variables are uncorrelated and so have
estim ated using the sam ple mean (i.e., average) of the data. The
0 covariance. Thus:
mean estim ator is defined as:
1 n
V = = —j (no-2) = <r2 / n (5.5)
A = ~ ± X , (5.1) L n i= 1 n /=i nr
n,±1
A mean estim ate /x is obtained from Equation 5.1 by replacing
1 This definition only covers point estimators, that is, an estimator whose
the random variables (i.e., X,-) with their observed values (i.e., x,). value is a single point (a scalar or finite-dimensional vector). This is the
The hat-notation O is used to distinguish an estim ator or an most widely used class of estimators.
estim ate (in this case /x) from the unknown population param 2 This result only depends on the linearity of the expectation operator,
eter (in this case /j l ) of interest. Note that X is another common so that the expectation of the sum is the sum of the expectation.
symbol for the sam ple mean. In this chapter, the random vari 3 The variance of the sample mean estimator is distinct from variance
ables X (- are assumed to be independent and identically distrib of the underlying data (i.e., X1( X2, . . . Xn). The variance of the sample
mean depends on the sample size n, whereas the variance of the data
uted, so that E[X,] = /x and V[X,] = a 2 for all /.
does not. These two are commonly confused when first encountered
The mean estim ator is a function that transforms data into an because the variance of the sample mean depends on the variance of
the data (i.e., cr2). See the box, Standard Errors and Standard Deviation,
estimate of the population mean. More generally, an estim ator is for more discussion about the distinction between the variance of an
a mathematical procedure that calculates an estimate based on estimator and the variance of the data used in the estimator.

The variance of the mean estim ator Sample vs. Actual Variance
depends on two values: the variance
of the data (i.e., a 2) and the number of
1 .1
observations (i.e., n). The variance in the

“O
CD
data is noise that obscures the mean. The +-» 1.05
_03
more variable the data, the harder it is D
_U
03
to estim ate the mean of that data. The U
(D
U
variance of the mean estim ator decreases c
03
as the number of observations increases,
>
CD
and so larger sam ples produce estim ates Q)

CD 0.95
03
of the mean that tend to be closer to the Q)
>
population mean. <
0.9
500 1000 1500 2000 2500
Estimating the Variance
0
and Standard Deviation Sample Variance --------- True Variance ---------- Unbiased Estimator
Recall that the variance of a random vari Number of Trials

able is defined as: Fiqure 5.1
o-2 = V[X] = E[(x - E[X])2]
Note that this bias is small when n is large . 5
Given a set of n iid random variables X ,, the sam ple estim ator of
The bias arises because the sam ple variance depends on the
the variance is
estim ator of the mean. Estimation of the mean consum es a
degree of freedom , and the sam ple mean tends to resem ble
~ A) 2 <5 -6 >
n i=1 the sam ple of data a little too well (especially for sm aller sam ple
A sample variance estim ate (i.e., the value of a 2) is obtained from sizes). This slight "overfitting" of the observed data produces a
Equation 5.6 by replacing the X, with the observed values x r slight underestim ation of the population variance.
Note that the sample variance estim ator and the definition of the Because the bias is known, an unbiased estim ator for the vari
population variance have a strong resem blance. Replacing the ance can be constructed as:
expectation operator E[.] with the averaging operator n- 1 ^ ? =1[.]
transforms the expression for the population moment into the (X; - £ ): (5.8)
estim ator moment. The sample average is known as the sample

analog to the expectation operator. It is a powerful tool that is This estim ator is unbiased, as E[s2] = a 2.
used throughout statistics, econom etrics, and this chapter to To see this principle in action, suppose that multiple sam ples
transform population moments into estim ators . 4 are drawn from an N(0,1) distribution, each sam ple has a size of
Unlike the mean, the sam ple variance is a biased estimator. 20, and a sam ple variance a 2 is calculated for each sam ple. The
Note that it can be shown that: results are as shown in Figure 5.1.
A s exp ected , the value converges overtim e to 0.95, because:

E[a 2] = a 2 - — = ^ a 2 (5.7)
n n
E[<x2] = ----- -CT2 = ^ ( 1 ) = 0.95
The sam ple variance is therefore biased with: n 2 0
B ias((j2) = E[<t 2] — a 2 Also shown is the unbiased variance estim ator s , which is simply
the biased estim ator multiplied b y ----- —.
n - 1
n The difference in the bias of these two estim ators might sug
= <x2/n gest that s 2 is a better estim ator of the population variance than
a 2. However, this is not necessarily the case because a 2 has a
4 Recall that the population variance can be equivalently defined as
E[X2] - E[X]2. The sample analog can be applied here, and an alternative
estimator of the sample variance is a2 — n-1Ef= iX 2 - (n-1 SjL-iXj)2. This 5 This is an example of a finite sample bias, and a 2 is asymptotically
form is numerically identical to Equation 5.6. unbiased because n— lim
>oc
Bias(a-2) = 0.
Chapter 5 Sample Moments ■ 65

STANDARD ERRORS VERSUS SCALING OF THE MEAN
STANDARD DEVIATIONS AND STANDARD DEVIATION
The variance of the sample mean depends on the variance Returns can also be measured using the difference
of the data. The standard deviation of the mean estimator is between log prices (i.e., the natural logarithm of the
returns). Log returns are convenient, as the two-period
return is just the sum of two consecutive returns:
and thus, it depends on the standard deviation of lnP 2 — lnP 0 = (In P2 — In Pn) + (InP-, — lnP0),
the data.
= P2 + Pi
The standard deviation of the mean (or any other estimator) where P; is the price of the asset in period /, and the return
is known as a standard error. Standard errors and standard P, is defined using consecutive prices.
deviations are similar, although these terms are not used
interchangeably. Standard deviation refers to the uncer The convention followed here is that the first price occurs
tainty in a random variable or data set, whereas standard at time 0 so that the first return, which is m easured using
error is used to refer to the uncertainty of an estimator. This the price at the end of period one, is P-].
distinction is important, because the standard error of an The return over n periods is then:
estimator declines as the sample size increases. Meanwhile, n
standard deviation does not change with the sample size. R-, + P 2 + • • • + Pn = 2 R i
/=1
Standard errors are im portant in hypothesis testing and
If returns are iid with mean E[P (] = /z and variance
when performing Monte Carlo sim ulations. In both of
V [P(] = o’2, then the mean and variance of the n-period
these applications, the standard error provides an indica
return are n/z and n o 2, respectively. Note that the mean
tion of the accuracy of the estim ate.
follows directly from the expectation of sums of random
variables, because the expectation of the sum is the sum
of the expectation.
sm aller variance than s 2 (i.e., V[<x2] < V[s2]). Financial statistics The variance relies crucially on the iid property, so that
the covariance between the returns is 0. In practice,
typically involve large datasets, and so the choice between o 2 or
financial returns are not iid but can be uncorrelated, which
s 2 makes little difference in practice. The convention is to prefer
is sufficient for this relationship to hold. Finally, standard
a - 2 when the sam ple size is m oderately large (i.e., n > 30). deviation is the square root of the varianceând so the
n-period standard deviation scales with V n .
The sam ple standard deviation is estim ated using the square
root of the sam ple variance (i.e., <x = V m 2 or s = V s 2). The
square root is a nonlinear function and so both estim ators of the
standard deviation are biased. However, this bias diminishes as n
O ne challenge when using asset price data is the choice of
becom es large and is typically small in large financial data sets.
sampling frequency. Most assets are priced at least once per
day, and many assets have prices that are continuously avail
Presenting the Mean and Standard able throughout the trading day (e.g ., equities, some sovereign
Deviation bonds, and futures). O ther return series, such as hedge fund
returns, are only available at lower frequencies (e.g ., once per
Means and standard deviations are the most widely reported
month). This can create challenges when describing the mean or
statistics. Their popularity is due to several factors.
standard deviation of financial returns. For exam ple, sampling
• The mean and standard deviation are often sufficient to over one day could give an asset's average return (i.e., /zDa//y) as
describe the data (e.g ., if the data are normally distributed 0 . 1 %, whereas sampling over one w eek (i.e., jlw e e k iy ) could indi
or are generated by some other one- or two-param eter cate an average return of 0.485% .
distributions).
In practice, it is preferred to report the annualized mean and
• These two statistics provide guidance about the likely range standard deviation, regardless of the sampling frequency. Annu
of values that can be observed. alized sam ple means are scaled by a constant that measures the
• The mean and standard deviation are in the same units as the number of sam ple periods in a year. For exam ple, a monthly
data, and so can be easily com pared. For exam ple, if the data mean is multiplied by 12 to produce an annualized mean. Simi
are percentage returns of a financial asset, the mean and larly, a w eekly mean is multiplied by 52 to produce an annual
standard deviation are also measured in percentage returns. ized version, and a daily mean is multiplied by the number of
This is not true for other statistics, such as the variance. trading days per year (e.g ., 252 or 260). Note that the scaling

convention for daily statistics varies by country (and possibly 5.2 HIGHER MOMENTS
by the year) because the number of trading days differs across
m arkets. Beyond the mean and standard deviation, two higher order
Returning to the previous exam ple with the sam ples from one moments are also commonly measured in financial data: skew
day and one week: ness and kurtosis.
Recall that the skewness is a standardized version of the third

f i'A n n u a l ^ ^ - f^ W e e k ly
central moment, and so it is unit- and scale-free. The population
= (252)(0.001) = (52)(0.00485)
value of the skewness is defined as:
= 2.52%
E[(X - E[X])3]
M eanwhile, standard deviations use the square root of the same Skewness(X) = (5.9)
E[(X - E[X])2 f/>
scale factors. For exam ple, the standard deviation com puted
from daily data is multiplied by V 2 5 2 to produce an annualized where /X3 is the third central moment and er is the standard
standard deviation. deviation.
The third moment raises deviations from the mean to the third
power. This has the effect of amplifying larger shocks relative
Estimating the Mean and Variance
to sm aller ones (compared to the variance, which only squares
Using Data deviations). Because the third power also preserves the sign of
As an exam ple, consider a set of four data series extracted from the deviations, an asym m etric random variable will not have a
the Federal Reserve Econom ic Data (FRED ) database of the third moment equal to 0. W hen large deviations tend to come
Federal Reserve Bank of St. Louis : 6 from the left tail, the distribution is negatively skew ed. If they
are more likely to come from the right tail, then the distribution
1. The IC E BoAM L US Corp M aster Total Return Index Value,
is positively skew ed.
which measures the return on a diversified portfolio of cor
porate bonds; Kurtosis is similarly defined using the fourth moment standard
ized by the squared variance:
2. The Russell 2000 Index, a small-cap equity index that m ea
sures the average perform ance of 2 ,0 0 0 firms; E[(X - E[X])4] = M4
Kurtosis(X) = k (5.10)
3. The Gold Fixing Price in the London bullion m arket, the E[(X - E[X])2]2 <r4
benchm ark price for gold; and The kurtosis uses a higher power than the variance, and so it is
4. The spot price of W est Texas Interm ediate Crude O il, the more sensitive to large observations (i.e., outliers). It discards
leading price series for US crude oil. sign information (because its power is even), and so it measures
the relative likelihood of seeing an observation from the tail of
All data in this set are from the period between January 1, 1987
the distribution.
and D ecem ber 31, 2018. These prices are sam pled daily, weekly
(on Thursdays), and at the end of the m onth . 7 Returns are com The kurtosis is usually com pared to the kurtosis of a normal dis
puted using log returns, defined as the difference between con tribution, which is 3. Random variables with a kurtosis greater
secutive log prices (i.e., lnPt — InPt_-j). than 3 are often called heavy-tailed/fat-tailed, because the fre
quency of large deviations is higher than that of a random vari
Table 5.1 reports the means and standard deviations for these
able with a normal distribution . 8
four series. Note how the values change across the three sam
pling intervals. The bottom rows of each panel contain the annu The estim ators of the skewness and the kurtosis both use the
alized versions of each statistic, which largely remove the effect principle of the sam ple analog to the expectation operator. The
of the sampling frequency. The scale factors applied are 252 estim ators for the skewness and kurtosis are
(for daily returns), 52 (for w eekly returns), and 12 (for monthly
A3
returns). The annual versions of the means are quite close to one and (5.11)
another, as are the annualized standard deviations.
6 See Federal Reserve Bank of St. Louis. Federal Reserve Economic

Data: FRED: St. Louis Fed. Retrieved from https://fred.stlouisfed.org/
8 Many software packages report excess kurtosis, which is the kurtosis
7 Thursday-to-Thursday prices are preferred when computing returns minus 3 (i.e., k - 3), and so care is needed when using reported kurtosis
because public holidays are more likely to occur on a Friday. to assess tail heaviness.

Table 5.1 Sample Statistics of the Returns of Four Representative Financial Assets. The Returns
Are Computed from Daily, Weekly (Using Thursday Prices), and End-of-Month Prices. Each Panel
Reports the Mean, Standard Deviation, Skewness, and Kurtosis. The Annualized Mean and
Standard Deviation Are Reported and Are Based on Scale Factors of 252, 52, and 12 for Daily,
Weekly, and Monthly Data, Respectively. The Row Labeled n Reports the Number of Observations.
The Final Row in the Top Panel Reports the Average Number of Observations per Year
Daily Data
Bonds Stocks Gold Crude
Mean 0.027% 0.044% 0 .0 2 0 % 0.044%
Std. Dev. 0.301% 1.295% 1 .0 0 2 % 2.496%
Skewness -0 .3 2 1 - 0 .2 0 4 - 0 .0 9 5 0.090
Kurtosis 5.62 9.41 9.56 15.75
Ann. Mean 6.9% 1 1 .2 % 5.2% 1 1 .1 %
Ann. Std. Dev. 4.8% 2 0 .6 % 15.9% 39.6%
n 7,346 7,346 7,346 7,346
n/Year 245 245 245 245

Weekly Data
Mean 0.129% 0.206% 0.096% 0.193%
Std. Dev. 0.652% 2.744% 2.154% 5.120%
Skewness - 0 .3 9 7 - 0 .6 7 8 0 .2 1 1 - 0 .0 7 6
Kurtosis 5.57 11.65 7.82 5.79
Ann. Mean 6.7% 10.7% 5.0% 1 0 .0 %
Ann. Std. Dev. 4.7% 19.8% 15.5% 36.9%
n 1,565 1,565 1,565 1,565

Monthly Data
Mean 0.559% 0 .8 6 6 % 0.427% 0.709%
Std. Dev. 1.462% 5.287% 4.460% 9.375%
Skewness - 0 .7 4 5 - 0 .4 4 3 0.146 0.275
Kurtosis 7.65 3.89 4.31 5.19
Ann. Mean 6.7% 10.4% 5.1% 8.5%
Ann. Std. Dev. 5.1% 18.3% 15.4% 32.5%
n 359 359 359 359
The third and fourth central moments are estim ated by: Table 5.1 also reports the estim ated skewness and kurtosis for
the four assets and three sampling horizons. These moments are
A3 (5.12) scale-free, and so there is no need to annualize them .
Skewness, while being unit-free, does not follow a simple scaling

A4 (5.13)
law across sampling frequencies. Financial data is heteroskedastic

HISTOGRAMS AND K ER N EL DEN SITY PLOTS
Histograms are simple graphical tools used to represent the common kernel density estimators use a weight function that
frequency distribution of a data series. Histograms divide the declines as the distance between two points increases, thus pro
range of the data into m bins and then tabulate the number ducing a smooth curve.
of values that fall within the bounds of each bin. Kernel den
Figure 5.2 shows two exam ples of histograms and kernel den
sity plots are sm oothed versions of histogram plots.
sities. The left panel shows the daily returns of the Russell 2000
A density plot differs from a histogram in two ways. First, rather Index, while the right panel shows its monthly returns. Because
than using discrete bins with well-defined edges, a density plot there are over 7,000 daily returns, the density plot is only
computes the number of observations that are close to any slightly smoother than the histogram. In the monthly exam ple,
point on the x-axis. In effect, a density plot uses as many bins as there are only 383 observations and so the difference between
there are data points in the sample. Second, a density plot uses the two is more pronounced. Kernel density plots are generally
a weighted count where the weight depends on the distance preferred over histograms when visualizing the distribution of
between the point on the x-axis and the observed value. Most a dataset.
Daily Return Monthly Return

Fiqure 5.2
(i.e., the volatility of return changes over time) and the differ that of a normal than returns sam pled at higher frequencies.
ences in the volatility dynamics across assets produce a different This is because large short-term changes are diluted over longer
scaling of skew ness . 9 Gold has a positive skewness, especially at horizons by periods of relative calm ness.
longer horizons, indicating that positive surprises are more likely
Figure 5.3 contains density plots of the four financial assets in
than negative surprises. The skewness of crude oil returns is neg
the previous exam ple. Each plot com pares the density of the
ative when returns are sampled at the daily horizon but becomes
assets' returns to that of a normal distribution with the same
positive when sampling over longer horizons. Bond and stock
mean and variance. These four plots highlight some common
market returns are usually negatively skewed.
features that apply across asset classes. The most striking fe a
All four asset return series have excess kurtosis, although ture of all four plots is the heavy tails. Note that the extrem e
bonds have few er extrem e observations than the other three tails (i.e., the values outside ± 2 cr) of the em pirical densities are
asset classes. Kurtosis declines at longer horizons, and it is above those of the normal. The more obvious hint that these
a stylized fact that returns sam pled at low frequencies (i.e., distributions are heavy-tailed is the sharp peak in the center of
monthly, or even quarterly) have a distribution that is closer to each density. Because both the em pirical density and the nor
mal density match the sam ple variance, the contribution of the
9 Time-varying volatility is a pervasive property of financial asset returns. heavy tails to the variance must be offset by an increase in mass
This topic is covered in Chapter 13. near the center. A t the same tim e, regions near ± 1 cr lose mass

Small-Cap Stocks Investment-Grade Bonds
Gold West Texas Intermediate Crude
F ia u re 5 .3 Density plots of daily returns data from small-cap stocks, corporate bonds, gold, and crude oil. The
solid line in each plot is a data-based density estimate of each asset's returns. The dashed line in each panel is the
density of a normal random variable with the same mean and variance as the corresponding asset returns. The
vertical lines indicate ± 2<x.
to the center and the tail to preserve the variance. The negative Linear estim ators of the mean can be expressed as:
skewness is also evident in stock, bond, and crude oil returns.
All three densities appear to be leaning to the right, which is A = 2 wiX "
;= 1
evidence of a long left-tail.
where w,- are weights that do not depend on X,. In the sample
mean estimator, for exam ple, w,- = /n.
5.3 THE BLUE MEAN ESTIMATOR
1
The unbiasedness of the mean estim ator was shown earlier in
The mean estim ator is the Best Linear Unbiased Estim ator this chapter. Showing that the mean is BLU E is also straightfor
1n ward, but it is quite tedious and so is left as an exercise . 101 *
(BLU E) of the population mean when the data are iid. In this
context, best indicates that the mean estim ator has the lowest
variance of any linear unbiased estim ator (LUE). 11 To show that this claim is true, consider another linear estimator that
uses a set of weights of the form w,- = w,- + d,. Unbiasedness requires
that Evv; = 1, and so E d , = 0. The remaining steps only require com
10 BLUE only requires that the mean and variance are the same for all puting the variance, which is equal to the variance of the sample mean
observations. These conditions are satisfied when random variables are iid. estimator plus a term that is always positive.

BLUE is a desirable property for an estimator, because it estab Figure 5.4 illustrates the LLN using sim ulated data. The figures
lishes that the estimator is the best estimator (in the sense of hav on the left all use log-normal data with param eters ju = 0 and
ing the smallest variance) among all linear and unbiased estimators. a 2 = 1. The figures on the right use data generated by a Pois
It does not, however, imply that there are no superior estimators to son with shape param eter equal to 3. Both distributions are
the sample mean. It only implies that these estimators must either right-skewed, and the Poisson is a discrete distribution.
be biased or nonlinear. Maximum likelihood estimators of the
The top panels show the PD F and the PM F of the simulated
mean are generally more accurate than the sample mean, although
data used in the illustration. The middle panel shows the distri
they are usually nonlinear and often biased in finite samples.
bution of the sam ple mean using simulated data. The number
of sim ulated data points varies from 10 to 640 in multiples of
5.4 LARGE SAM PLE BEHAVIOR 4. This ratio between the sam ple sizes ensures that the stan
dard deviation halves each tim e the sam ple size increases. The
O F THE MEAN dashed line in the center of the plots shows the population
mean. These finite-sam ple distributions of the mean estim ators
The mean estim ator is always unbiased, and the variance of the
are calculated using data simulated from the assumed distribu
mean estim ator takes a simple form when the data are iid. Two
tion. The sam ple mean is com puted from each simulated sam
m oments, however, are usually not enough to com pletely
ple, and 10,000 independent sam ples are constructed. The plots
describe the distribution of the mean estimator. If data are iid
in the center row show the em pirical density of the estim ated
normally distributed, then the mean estim ator is also normally
sam ple means for each sam ple size.
distributed . 1 2 However, it is generally not possible to establish
the exact distribution of the mean based on a finite sam ple of n Two features are evident from the LLN plots. First, the density
observations. Instead, modern econom etrics exploits the behav of the sam ple mean is not a normal. In sam ple means com puted
ior of the mean in large sam ples (formally, when n —» °c) to from the sim ulated log-normal data, the distribution of the
approxim ate the distribution of the mean in finite sam ples. sam ple mean is right-skewed, especially when n is small. Sec
ond, the distribution becom es narrower and more concentrated
The Law of Large Numbers (LLN) establishes the large sample
around the population mean as the sam ple size increases. The
behavior of the mean estim ator and provides conditions where
collapse of the distribution around the population mean is evi
an average converges to its expectation. The sim plest LLN for
dence that the LLN applies to these estim ators.
iid random variables is the Kolm ogorov Strong Law of Large
Num bers. This LLN states that if {X,} is a sequence of iid random The LLN also applies to other sam ple m oments. For exam ple,
variables with /jl = E[X,], then: when the data are iid and E[X2] is finite, then the LLN ensures
that a 2 —>a 2, and so the sam ple variance estim ator is also
consistent . 1 4
Consistency is an im portant property of an estimator, although

The symbol —> means co n verg es alm ost surely , 1 3 This property
it is not enough to understand its distribution. For exam ple, dis
ensures that the probability of j l n being far from /x converges to
tribution of jl — /a is not easy to study because it collapses to 0
0 as n —> oo.
as n —> oo.
When an LLN applies to an estimator, the estim ator is said to be
The solution is to rescale the difference between the esti
consistent. Consistency requires that an estim ator is asym ptoti
mate and population value by V r i . This is the precise value
cally unbiased, and so any finite sample bias must diminish as n 2
increases (e.g ., cr has this property). A consistent estimator, how
2 that stabilizes the distribution because V[/x] = — so that
ever, has another key feature: as the sample size n grows larger, V[V7>£] = nV[/I] = cr. The CLT can also be applied to the res
the variance of the estim ator converges to zero (i.e., V[/xn] —>0). caled difference under some additional assumptions.
This ensures that the chance of observing a large deviation from
the population value is negligible in large samples. The sim plest CLT is known as the Lindberg-Levy CLT. This CLT
states that if {X,} are iid, then
12 This result only depends on the property that the sum of normal ran Vn(M - f l ) A N(0, a 2 ) ,
dom variables is also normally distributed.

13 Almost sure convergence is a technical concept from probability
theory. An estimator that converges almost surely must converge for any 14 Formally the LLN can be applied to two averages, n-1 E[X2]
sequence of values that can be produced by the random variables X„ and n-1 E[X]. The sample variance estimator can be equivalently
i — , ,....
1 2 expressed using these two averages.

Log-Normal (0,1) Poisson (3)
PDF PMF
Law of Large Numbers
n 640
n 160
n 40
n 10
TV
^ 5 ;..
0 1.65 2 4.0 4.5
Central Limit Theorem
0. 5 -
0. 4 -
Figure 5.4 These figures demonstrate the consistency of the sample mean estimator using the Law of Large Numbers
(middle panels) and with respect to the normal distribution when scaled (bottom panels). The figures are generated using
simulated data. The left panels use simulated data from a log-normal distribution with parameters M = 0, <j 2 = 1. The
right panels use simulated data from a discrete distribution, the Poisson with shape parameter equal to 3. The top panels
show the PDF of the log-normal and the PMF of the Poisson.

where /z = E[X (] and cr2 = V[X,]. The symbol denotes conver accurate. Note that skewness in the distribution of the sample
gence in distribution. mean is a finite-sam ple property, and when n = 160, the distri
bution is extrem ely close to the standard normal.
This CLT requires one additional assumption from the LLN: that
In the application of the CLT to the mean of data generated
the variance a 2 is finite (as the LLN only requires that the mean
is finite). This CLT can be alternatively expressed as: from Poisson random variables (i.e., the right panel), the CLT
appears to be accurate for all sam ple sizes.
5.5 THE MEDIAN AND OTHER

This version makes it clear that the mean, when scaled by its
QUANTILES
standard error (i.e., a / V n ) , is asym ptotically standard normally
The m edian, like the mean, measures the central tendency of a
distributed.
distribution. Specifically, it is the 50% quantile of the distribu
CLTs extend LLNs and provide an approximation to the distribution tion and the point where the probabilities of observing a value
of the sample mean estimator. Furthermore, they do not require above or below it are equal (i.e., 50%). W hen a distribution is
knowledge of the distribution of the random variables generating sym m etric, the median is in the center of the distribution and is
the data. In fact, only independence and some moment conditions the same as the mean. W hen distributions are asym m etric, the
are required for the CLT to apply to a sample mean estimator. median is larger (smaller) than the mean if the distribution is left-
The CLT is an asym ptotic result and so technically only holds skewed (right-skewed).
in the limit. In practice, the CLT is used as an approxim ation in Estimation of the median is simple. First, the data are sorted from
finite sam ples so that the distribution of the sam ple mean is smallest to largest. When the sample size is odd, the value in posi
approxim ated by: tion (n + 1)/2 of the sorted list is used to estimate the median:
A ~ N(/x, a2/n) median(x) = x ( n + 1 ) / 2 (5.14)
This expression for the mean shows that (in large samples) the W hen the sam ple size is even, the median is estim ated using the
distribution of the sam ple mean estim ator (A) is centered on the average of the two central points of the sorted list:
population mean (/z), and the variance of the sam ple average
median(x) = (1/2)(x n / 2 + x n / 2 +1). (5.15)
declines as n grows.
Two other commonly reported quantiles are the 25% and 75%
The bottom panels of Figure 5.4 show the empirical distribution
quantiles. These are estim ated using the same method as the
for 10,000 simulated values of the sample mean. The CLT says
m edian. More generally, the a-quantile is estim ated from the
that the sample mean is normally distributed in large sam ples, so
2 sorted data using the value in location an. W hen an is not an
that A ~ M m , <r /n)• This value is standardized by subtracting the integer value, then the usual practice is to take the average of
mean and dividing by the standard errors of the mean so that:
A
the points im m ediately above and below a n . 1 5
These two quantile estim ates (q 2 5 and q75) are frequently used
together to estim ate the inter-quartile range:
is a standard normal random variable. The sim ulated values
IQ R = q 7 5 - q 2 5
show w hether the CLT provides an accurate approxim ation to
the (unknown) true distribution of the sam ple mean for different The IQ R is a measure of dispersion and so is an alternative to
sam ple sizes n. standard deviation. O ther quantile measures can be constructed
to measure the extent to which the series is asym m etric or to
Determining when n is large enough is the fundam ental ques
estim ate the heaviness of the tails.
tion when applying a CLT. Returning to Figure 5.4, note that the
shaded areas in the bottom panels are the PD F of a standard Two features make quantiles attractive. First, quantiles have
normal. For data generated from the log-normal (i.e., the left the sam e units as the underlying data, and so they are easy to
panel), the CLT is n o tan accurate approxim ation when n = 10 interp ret. For exam p le, the 25% quantile for a sam ple of asset
because of the evident right-skewness (due to the skewness in
the random data sam pled from the log-normal). W hen n = 40,
15 There is a wide range of methods to interpolate quantiles when an is
however, the skew has substantially disappeared and the not an integer, and different methods are used across the range of com
approxim ation by a standard normal distribution appears to be mon statistical software.

returns estim ates the point w here there is 25% probability of (i.e., bonds, stocks, and crude oil). Note that the median scales
observing a sm aller return (and a 75% probability of observing similarly to the mean, and the w eekly median is about five times
a larger return). M eanw hile, the interquartile range estim ates larger than the daily value.
a central interval w here there is 50% probability of observing
The interquartile range also agrees with the standard deviation
a return.
estim ates; crude is the riskiest of the assets, followed by stocks
The second attractive feature is robustness to outliers (i.e., val and gold, while bonds are the safest (assuming that the risk of
ues very far from the mean). For exam ple, suppose that some each asset can be measured by the dispersion of its returns).
observed data { x 1f. . . , x n} are contam inated with an outlier.
Both the median and IQ R are unaffected by the presence of an
outlier. However, this is not true of the mean estimator, which 5.6 MULTIVARIATE MOMENTS
gives w eight 1/n to the outlier. If the outlier is far from the other
(valid) observations, this distortion can be large. The variance, The sample analog can be extended from univariate statistics (i.e.,
involving a single random variable) to multivariate statistics (i.e.,
because it squares the difference between the outlier and the
mean, is even more sensitive to outliers. involving two or more random variables).
Table 5.2 contains the m edian, quantile, and IQR estim ates for The extension of the mean is straightforward. The sam ple mean
the four-asset exam ple discussed in this chapter. The median of two series is just the collection of the two univariate sample
means. However, extending the sam ple variance estim ator to
return is above the mean return for the negatively skewed assets
m ultivariate datasets requires estim ating the variance of each
series and the covariance between each pair. The higher-order
m oments, such as skewness and kurtosis, can be similarly
Table 5.2 Quantile Estimates for the Four Asset
defined in a m ultivariate setting.
Classes Using Data Sampled Daily, Weekly, and
Monthly. The Median Is the 50% Quantile. The
Inter-Quartile Range (IQR) Is the Difference between Covariance and Correlation
the 75% Quantile and the 25% Quantile
Recall that covariance measures the linear dependence between
Daily Data two random variables and is defined as:
Bonds Stocks Gold Crude o-XY - C o v[X ,Y ] = E[(X - E[X ])(Y - E[Y])]
The sam ple covariance estim ator uses the sam ple analog to the
Median 0.034% 0.103% 0.014% 0.058%
expectation operator
q(.25) -0 .1 4 0 % -0 .5 3 4 % -0 .4 4 5 % - 1 .2 1 %
q(-7 5) 0.207% 0.660% 0.496% 1.30% oxv = - i; (X, - Y; - M

, 5.16)
(
n it 1
IQR 0.347% 1.19% 0.941% 2.51%
where (xx is the sam ple mean of X and /tY is the sam ple mean
Weekly Data of Y. Like the sam ple variance estimator, the sam ple covariance
estim ator is biased toward zero. Dividing by n — 1, rather than
n, produces an unbiased estim ate of the covariance.
Median 0.159% 0.459% 0.044% 0.224%
Correlation is the standardized version of the covariance and is
q(.25) -0 .2 4 4 % - 1 .1 2 % - 1 .0 1 % - 2 .6 4 % usually preferred because it does not depend on the scale of X o r
q(.75) 0.542% 1.69% 1.24% 3.03% Y. The correlation is estim ated by dividing the sample covariance
by the product of the sample standard deviations of each asset:
IQR 0.786% 2.81% 2.24% 5.67%
Monthly Data (JX Y (Tx y

(5.17)
Px y =
(Tx ( J y
The sam ple correlation estim ator is biased in finite sam ples,
Median 0.640% 1.56% 0.265% 0.839% even if unbiased estim ators are used to estim ate the variances
q(.25) - 0 .2 2 0 % - 2 .4 7 % - 2 .2 9 % - 5 .1 4 % and covariance. This occurs for the same reason that the sample
estim ator of the standard deviation is always biased: Unbi
q(-7 5) 1.42% 4.12% 2.84% 6.72%
asedness is not preserved through nonlinear transform ations
IQR 1.64% 6.59% 5.14% 11.9%
(because E[g(Z)] g(E[Z]) for nonlinear functions g).

Table 5.3 contains the estim ated correlations betw een gold and stocks is small (i.e., between - 2 and 4%). In contrast,
the returns in the four-asset exam p le. Th ese correlations the sample correlation between the Russell 1000 (a large-cap
are estim ated using returns com puted from daily, w eekly, stock index) and the Russell 2000 (a small-cap stock index) is
and m onthly data. Note that correlations are unit free by 87%, 8 8 %, and 85% when measured using daily, weekly, or
construction. monthly data, respectively. This feature is useful when construct
ing diversified portfolios, because the correlation between the
The highest correlation (between gold and crude oil) ranges
assets held appears in the variance of the portfolio.
between 13% and 19% across the different sampling frequen
cies. Stocks and bonds have a negative correlation when m ea
sured using daily data and a positive correlation when measured Sample Mean of Two Variables
using monthly data. Correlations measured at different frequen
Estim ating the means of two random variables is no different
cies do not have to be the sam e, and here the monthly cor
from estim ating the mean of each separately, so that:
relations are generally larger than the daily correlations. This is
because short-term returns are often driven by short-term issues 1 n
Ax = n it 1
and Ayy
such as liquidity, whereas longer-term returns are more sensitive
to m acroeconom ic changes. W hen data are iid, the CLT applies to each estimator. However,
These are all relatively small correlations, and cross-asset class it is more useful to consider the joint behavior of the two mean
correlations are generally sm aller than correlations within an estim ators by treating them as a bivariate statistic. The CLT can
asset class. For exam ple, the estim ate of the correlation between be applied by stacking the two mean estim ators into a vector:
(5.18)
Table 5.3 Sample Correlations between Bond, This 2-by-1 vector is asym ptotically normally distributed if the
Stock, Gold, and Crude Oil Returns. The Correlations m ultivariate random variable Z = [X, Y] is iid . 1 6
Are Measured Using Data Sampled Daily, Weekly, and
The CLT for the vector depends on the 2-by-2 covariance matrix for
Monthly
the data. A covariance matrix collects the two variances (one for X
Daily Data and one for Y) and the covariance between X and Yinto a matrix:
Bonds Stocks Gold Crude "x (Tx y

(5.19)
L ct xy o -y J
Bonds —
- 1 6 .4 % 3.0% - 9 .9 %
The elem ents along one diagonal of the covariance m atrix are
Stocks - 1 6 .4 % - .1 % 12.5%
the variances of each series, and those on the other are the
—
2
Gold 3.0% - 2 .1 % —
13.0% covariance between the pair.
Crude - 9 .9 % 12.5% 13.0% —
The CLT for a pair of mean estim ators is virtually identical to that
Weekly Data of a single mean estimator, where the scalar variance is replaced
by the covariance matrix. The CLT for bivariate iid data series
Bonds Stocks Gold Crude states that:
Bonds —
- 2 .4 % 2.3% - 7 .7 %
(5.20)
Stocks - 2 .4 % —
4.0% 13.2%
Gold 2.3% 4.0% —

15.2% so that the scaled difference between the vector of means is
asym ptotically m ultivariate normally distributed. In practice, this
Crude - 7 .7 % 13.2% 15.2% —
CLT is applied by treating the mean estim ators as a bivariate

Monthly Data normal random variable:
Bonds —
15.1% 19.6% - 0 .2 %
Stocks 15.1% —
- 0 .2 % 1 1 .6 %
Gold 19.6% - 0 .2 % —
18.9%
Crude - 0 .2 % 1 1 .6 % 18.9% —
This assumes each component has a finite variance.

Table 5.4 Estimated Moments of the Monthly The two coskewness measures are
Returns on the Russell 2000 and the BoAML Total s(X ,X ,Y ) = El<* - E l X l f t r - Em )]
Return Index During the Period between 1987 and ayCTy
2018. The Means, Variance, and Covariance Are All
and
Annualized. The Correlation Is Scale-Free
s (X ,Y ,Y ) = E K X - E[X ])(Y - E[V-])2]
Moment
/V
-2 -2 (5.21)
As "S AB <*B <r S B P oxoy
10.4 335.4 6.71 25.6 14.0 0.151 The coskewness standardizes the cross-third moments by the
variance of one of the variables and the standard deviation of
the other, and so is scale- and unit-free.
Table 5.4 presents the annualized estim ates of the means, vari
These measures both capture the likelihood of the data taking
ances, covariance, and correlation for the monthly returns on
a large directional value w henever the other variable is large in
the Russell 2000 and the BoA M L Total Return Index (respectively
m agnitude. When there is no sensitivity to the direction of one
sym bolized by S and B). Note that the returns on the Russell
variable to the magnitude of the other, the two coskewnesses are
2 0 0 0 are much more volatile than the returns on the corporate
0. For exam ple, the coskewness in a bivariate normal is always 0,
bond index. The correlation between the returns on these two
even when the correlation is different from 0. Note that the uni
indices (i.e., 0.151) is positive but relatively small.
variate skewness estimators are s(X,X,X) and s{Y,Y,Y).
The CLT depends on the population values for the two variances
Coskewnesses are estimated by applying the sample analog to the
(<xf and o b) and the covariance between the two (c t sb = pascrB).
expectation operator to the definition of coskewness. For example:
To operationalize the CLT, the population values are replaced
with estim ates com puted using the sam ple variance of each V n S ^ iK - ~ A x )2 (y ; ~ A y )
s(X,X,Y) = 2
series and the sam ple covariance between the two series. A
(t x (Ty
A
Applying the bivariate CLT:

Table 5.5 contains the skewness and coskewness for the four
As As 0 335.4 14.0 assets (i.e., bonds, stocks, crude oil, and gold) using data sampled
n NI
LA b AbJ 0 14.0 25.6 monthly. The estimates of the coskewness are mostly negative.
Recall that in practice, the definition is applied as if the mean The bond-stock and crude oil-stock pairs have the largest coskew
estim ators follow the limiting distribution, and so the estim ates ness, although neither relationship is symmetric. For exam ple,
are treated as if they are normally distributed: s(X ,X ,Y ) in the bond-stock pair indicates that stock returns tend
to be negative when bond volatility is high. However, s(X,Y,Y) is
As AS 0.934 0.039
~ n (^ / close to so that the sign of bond returns does not appear to be
-A B _ _ A b _ _0 .0 3 9 0.071 _ 0
where the covariance m atrix is equal to the previous covariance strongly linked to the magnitude of stock returns.
m atrix divided by n (which is 359 in this exam ple).
The most im portant insight from the bivariate CLT is that cor
Table 5.5 Skewness and Coskewness in the Group
relation in the data produces a correlation between the sample
of Four Asset Classes Estimated Using Monthly
means. Moreover, the correlation between the means is identical
Data. The Far Left and Far Right Columns Contain
to the correlation between the data series.
Skewness Measures of the Variables Labeled X and
Y, Respectively. The Middle Two Columns Report the
Coskew ness and Cokurtosis Estimated Coskewness
Like variance, skewness and kurtosis can be extended to pairs X Y s(X,X,X) s(X,X,Y) s(X,Y,Y) s(Y,Y,Y)
of random variables. W hen computing cross pth m oments, there Bonds Crude - 0 .1 7 9 - 0 .0 0 8 0.040 0.104
are p — 1 different m easures. Applying the principle to the first
Bonds Gold - 0 .1 7 9 - 0 .0 1 8 - 0 .0 3 2 - 0 .1 0 1
four moments there are
Bonds Stocks - 0 .1 7 9 - 0 .0 8 2 0 .0 1 2 - 0 .3 6 5
• No cross means,
Crude Gold 0.104 - 0 .0 1 0 - 0 .0 9 8 - 0 .1 0 1
• One cross variance (covariance),
Crude Stocks 0.104 - 0 .0 6 4 - 0 .1 2 7 - 0 .3 6 5
• Two measures of cross-skewness (coskewness), and
Gold Stocks - - 0 .0 0 5 0.014 - 0 .3 6 5
• Three cross-kurtoses (cokurtosis).
0 .1 0 1

k ( X X Y, Y) k (X,X.X. Y)
Fiqure 5.5 Plots of the cokurtoses for the symmetric case {k [X,X,Y,Y), left) and the asymmetric case
(k (X,Y,Y,Y), right). Note that k (X,Y,Y,Y) has the same shape as k [X,X,X,Y) and so is omitted.
The cokurtosis uses com binations of powers that add to 4, and W hen exam ining kurtosis, the value is usually com pared to the
so there are three configurations: (1,3), (2,2), and (3,1). The (2,2) kurtosis of a normal distribution (which is 3). Com paring a cokur
measure is the easiest to understand, and captures the sensitiv tosis to that of a normal is more difficult, because the cokurtosis
ity of the m agnitude of one series to the m agnitude of the other of a bivariate normal depends on the correlation.
series. If both series tend to be large in m agnitude at the same
Figure 5.5 contains plots of the cokurtoses for normal data as a
tim e, then the (2,2) cokurtosis is large. The other two kurtosis
function of the correlation between the variables (i.e., p),
m easures, (3,1) and (1,3), capture the agreem ent of the return
which is always between —1 and 1. The sym m etric cokurtosis
signs when the power 3 return is large.
k {X ,X ,Y ,Y ) ranges between 1 and 3: it is 1 when returns are
The three measures of cokurtosis are defined as: uncorrelated (and therefore independent, because the two
random variables are normal) and rises sym m etrically as the cor
E[(X — E[X])2(Y — E[Y])2]
k ( X X Y . Y ) (5.22) relation moves away from 0. The asym m etric cokurtosis ranges
aX(Ty
2 2
from —3 to 3 and is linear in p.
E[(X — E[X])3(Y - E[Y])]
k {X ,X ,X ,Y ) (5.23) Table 5.6 contains estim ates of the kurtosis and cokurtosis for
a }a Y
the four assets previously described. The sam ple kurtosis is
E[(X - E[X ])(Y - E[V])3] com puted using the sam ple analog to the expectation operator.
x i X . Y X Y ) (5.24)
axo-y The first and last columns contain the kurtosis of the two assets
Table 5.6 Kurtosis and Cokurtosis in the Group of Four Asset Classes Estimated Using Monthly Data. The Far
Left and Far Right Columns Contain Kurtosis Measures of the Variables Labeled X and Y, Respectively. The Middle
Three Columns Report the Three Cokurtosis Estimates
X Y k (X,X,X,X) k (X,X,X,Y) k (X,X,Y,Y) k (X,Y/Y/Y) k (Y,Y,Y,Y)
Bonds Crude 6.662 - 0 .2 4 8 1.996 0.244 15.656
Bonds Gold 6.662 - 0 .5 3 7 1.808 0.547 9.390
Bonds Stocks 6.662 - 1 .5 2 2 2.921 0.682 1 1 .0 2 0
Crude Gold 15.656 4.010 2.833 7.011 9.390
Crude Stocks 15.656 - 0 .4 4 3 2.490 4.750 1 1 .0 2 0
Gold Stocks 9.390 -0 .0 8 1 3.230 2.491 1 1 .0 2 0

reported in the row. The center column reports the sym m etric • The mean estim ator is consistent, and in large sam ples the
cokurtosis k {X ,X ,Y ,Y ), which measures the strength of the link estim ated mean is close to the population mean.
between the variances of the two series. Note that gold and • W hen the variance is finite, the distribution of the mean esti
stocks appear to have the strongest link. mator can be approxim ated using the CLT.
The two columns k (X ,X ,X ,Y ) and k (X ,Y ,Y ,Y ) measure the All moment estim ators use the principle of the sam ple analog,
strength of the dependence in the signed extrem es. In each where the expectation operators in the population moment are
column, the extrem e values of the variable that appears three replaced by the sam ple average. The skewness describes the
tim es have more influence on the cokurtosis. For exam ple, the sym m etry of the data and the kurtosis describes the heaviness
bond-stock pair k (X ,X ,X ,Y ) is —1.5, which indicates that the of the tails. Data series representing four major asset classes—
return on stocks tends to have the opposite sign of the return bonds, equities, gold, and energy— all confirm two stylized
on bonds when bonds are in their tail. M eanwhile, k (X ,Y ,Y ,Y ) facts of financial return data: The distributions of financial asset
is positive, which indicates that the bond returns tend to have returns are not sym m etric (i.e., they are skewed) and heavy
the same sign as the stock returns when stocks have an extrem e tailed (i.e., have a kurtosis larger than that of a normal random
observation. variable).
Estimators based on quantiles (i.e., the median and the quartiles)

complement sample moments. These values are simple to estimate
5.7 SUMMARY and can be used to construct alternative measures to describe a
data set. Quantile-based statistics are naturally robust to outliers
This chapter describes the steps required to estim ate four
and so are particularly attractive when using financial data.
im portant moments: the mean, the variance, the skewness, and
the kurtosis. These statistics are widely reported to describe The chapter concludes by introducing multivariate moment esti
observed data, and each measures an im portant feature in the mators. The sam ple covariance and correlation measure linear
distribution of a random variable. The im portant properties of dependence between two data series. The CLT extends naturally
the mean estim ator include the following. to the mean of a bivariate random variable. The resulting bivari
ate CLT depends on the covariance matrix, which summarizes
• The mean is unbiased.
both the variances of two random variables and the covariance
• W hen the observed data are iid, the mean has a simple between them . Finally, the bivariate extensions of skewness and
expression for its standard error. kurtosis— coskewness and cokurtosis— provide additional tools
• The mean estim ator is BLU E. to describe the joint distribution of two random variables.

QUESTIONS

5.1 W hat is the difference between a sam ple mean and the 5.5 W hy is annualization useful, and how are means and stan
population mean? dard deviations annualized?
5.2 W hat is the difference between the standard deviation of 5.6 W hat do skewness and kurtosis measure?
the sample data and the standard error of the sample mean?
5.7 W hat is the skewness and kurtosis of a normal random
5.3 Is an unbiased estim ator always better than a biased variable?
estim ator?
5.8 How are quantiles estim ated?
5.4 W hat is the sam ple analog to the expectation operator, 5.9 W hat advantages do quantiles have compared to moments?
and how is it used?
5.10 W hen dealing with two random variables, how is coskew
ness interpreted?
Practice Questions
5.11 Suppose four independent random variables X h X 2, X 3, and It is hypothesized that the data comes from a uniform dis
X 4 all have mean /z = 1 and variances o f 1 2, 1 2 <2, and 2, tribution, U(0 , b).
1
respectively. What is the expectation and variance of — a. Calculate the sam ple mean and variance.
5.12 Using the same information from question 1, com pute the b. W hat are the unbiased estim ators of the mean and
1 1 variance?
expectation and variance of 2X-| + 2 X 2 + —X 3 + —X 4.
c. Calculate the b in 1/(0, b) using the formula for the
5.13 Using the same information from question 1, com pute the mean of a uniform distribution and the value of the
2 2 1 1
expectation and variance of —X i + —X? + — Xa + — X 4? unbiased sam ple mean found in part b.
^ 5 1 5 ^ 10 3 10 4
d. Calculate the b in 1/(0, b) using the formula for the
5.14 W hat is the effect on the sam ple covariance between X
variance of a uniform distribution and the value of the
and Y if X is scaled by a constant a? W hat is the effect on
unbiased sam ple variance found in part b.
the sam ple correlation?
5.15 An experim ent yields the following data: 5.16 For the following data:
Observation Value
Trial Number Value
1 0.38
1 0 .0 0
2 0.28
2 0.07
3 0.27
3 0.13
4 0.99
4 0.13
5 0.26
5 0 .2 0
0.43
0.23
6
6
7 0.25 a. W hat is the median?

8 0.27 b. W hat is the IQR?
9 0.34
1 0 0.41
1 1 0.60
1 2 0 .6 6
13 0.76
14 0.77
15 0.96

ANSW ERS

5.1 The sam ple mean is a numerical average of observed common scale factors are 252, 52, 12, and 4 to convert
data. The population mean is the unknown (true) value from daily, w eekly, monthly, and quarterly. These scale
that the sam ple mean is used to approxim ate. factors are applied to the high-frequency mean or variance
to produce annualized means and variances. The standard
When data are iid, the standard error of the mean is ——,
Vn deviation is scaled by the square root of the scale.
where a is the standard deviation of the data. The standard
5.6 Skewness measures the tendency to see larger values in
error measures the estimation error in the sample mean. As
one direction or the other relative to the mean. For exam
the sample size grows the standard error converges to 0. The ple, a distribution with negative skew will tend to produce
(population) standard deviation measures the dispersion in
larger in m agnitude values less than the mean. Kurtosis is
the data. It is a fixed value independent of the sample size.
a m easurem ent of how frequently large observations of
5.3 Not necessarily. W hile an unbiased estim ator is always either sign are m easured.
right, on average, it may be very inaccurate, which would 5.7 The skewness of a normal is 0, and the kurtosis is 3.
be reflected by a large variance. It may be better to use
5.8 Q uantiles are estim ated by sorting the data from sm allest
an estim ator with a small bias but with lower variance. This
to largest and then choosing the values with the sorted
tradeoff may be especially valuable if the bias depends on
index na to measure the quantile.
the sam ple size, decreasing as the sam ple size increases,
and the sam ple size is large. 5.9 There are two potential advantages. First, quantiles are
always well defined (exist) even when random variables
5.4 The sam ple analog replaces the expectation opera
are very heavy-tailed. Second, quantiles are less sensitive
tor with the average so that E[(g(X))] is approxim ated
to outliers than moments.
by n~ 1 ^ g (x ,). The sam ple analog is a universally used
approach to estim ating moments of any order. 5.10 Coskew ness measures the likelihood of one variable tak
ing a large directional value w henever the other variable
5.5 Annualization transform s moments measured using
is large in m agnitude. W hen there is no sensitivity to the
data sam pled at any frequency— daily, w eekly, monthly,
direction of one variable to the m agnitude of the other,
quarterly— to a value that is equivalent under an assum p
the coskewness measures are .
tion to a measure com puted from annual data. The
0
Solved Problems
5.13 The expectation is com puted using the same steps:
5.11 The expectation is E —(Xi + X 2 + X 3 + X 4 ) = 2 2 1 1
-X-, + - x 2 + — x 3 + — x 4
1 5 1 5 1 10 3 10 4
+ E [X 2] + E [X 3] + E [X 4]) — — (/x + /x + /x + /x) — /x — 1
= | E [X , ] + | e [X2] + ^ £ [ X 3] + ^ E [ X 4]
The variance is Var — (X-i + X 2 + X 3 + X 4) = T ( V a r [ X , ]
2 2 1 1
5/x + 5 /x + 1 0 jJL + — /JL = /X = 1.
1 0 ‘
+ Var[X2] + Var[X3] + Var[X4]) — — ( — + 2 + ^ + / ~ ^\6
This estim ator is unbiased.
5.12 The expectation is E 2 Xy + 2 X 2 + —X 3 + —X 4 = 2\± The variance of the sum is the sum of the variances
because the random variables are independent.
1 1
+ 2/x + —/jl + = 5/x = 5. The variance is 2Xi + 2X 2 2 2 1 1
Var 77X3 + —-x4
— f x V a r
5 X l + 5 Xa + *10 6 1 0 4
1 1 1 1
+ X 3 + ~X 4 = 4Var[X-,] + 4V ar[X 2] + ~Var[X3]
— Trz’VarfX'i ] + 7rir\/ar[X2] + Var[X3] + - 7 —Var[X4]
77
2 6 2 4
25 1J 25 2 100 1 0 0
1 4 4 2 2 4 1 4 1 1 1
+ - V a r[X 4] = - + - + - + - = 5 .
4 4J 2 2 4 4 — X - + — x - + x 2+ x 2=
25 2 25 2 1 0 0 1 0 0 25

1 — From here, we anchor the first observation at 0% and

5.14 The sample covariance is defined — ..(X,- — X)(Y,- — Y ).
n ' the last at 1 0 0 %, equally spacing the in-between val
Multiplying each X,- by a constant a will rescale the
ues. The divisions here will be 10 0 /n —1 = 20%:
1 — —
sam ple mean by a so that —^ ;= 1 (aX(- — aX)(Y, — Y) =
Ranked
- 2 ( = i a<x i - - V) = a - - The Position Value Rank
final covariance is a tim es the original covariance. The 1 0.26 0 %
(JX Y
correlation is then p = /V /V so that scaling X,- by a will 2 0.27 %
OXCTy
2 0
3 0.28 40%
produce p = „ , and the sam ple correlation is
dC Ty(Ty CTy(Ty 4 0.38 60%
invariant to rescaling the data.
5 0.43 80%
5.15 a. Use the standard formula to get the sam ple variance
6 0.99 1 0 0 %
(here, n = 15):
n The median (50%) lies exactly half way between the
jl = n“ 1 2 */ = 0-39, third and fourth ranked observations. Therefore:
n
0.28 + 0.38
a 2 = n“ 1 2 ( X / - A ) 2 = 0.08. Median = ---------------= 0.33
/=1
b. The sam ple mean is already unbiased. b. This requires calculating the 25% and 75% levels
For the variance: The 25% level is 5/20 = 25% of the way between
ranked observations 2 and 3. Therefore:
n
s2 = ct 2 = l 5 0.080 = 0.086.
n - 1 14 q25% = 0-75 * 0.27 + 0.25 * 0.28 = 0.2725
c. The mean for a U(a,b) distribution is given as: Sim ilarly, the 75% level is 75% of the way between
a + b observations 4 and 5:
q75% = 0.25 * 0.38 + 0.75 * 0.43 = 0.4175

+ b
Thus, IQR = (0.2725, 0.4175).
0
0.385 = — ------->b = 0.77.
d. The variance for a U(a,b) distribution is given as
2 _ (b - a ) 2
fJ “ 12
b2
0.086 = — ^ b = 1.016
12
5.16 a. The first step is to rank order the observations:
Ranked
Position Value
1 0.26
2 0.27
3 0.28
4 0.38
5 0.43
6 0.99

Hypothesis Testing
Learning Objectives
Construct an appropriate null hypothesis and alternative • Understand how a hypothesis test and a confidence
hypothesis and distinguish between the two. interval are related.
Construct and apply confidence intervals for one Explain what the p-value of a hypothesis test measures.
sided and two-sided hypothesis tests, and interpret
the results of hypothesis tests with a specific level of Interpret the results of hypothesis tests with a specific
confidence. level of confidence.
Differentiate between a one-sided and a two-sided test Identify the steps to test a hypothesis about the difference
and identify when to use each test. between two population means.
Explain the difference between Type I and Type II errors Explain the problem of multiple testing and how it can
and how these relate to the size and power of a test. bias results.
83
As explained in C hapter 5, the true values of the population 5. The critical value, which is a value that is com pared to the
param eters are unknown. However, observed data contain infor test statistic to determ ine whether to reject the null hypoth
mation about those param eters. esis; and
A hypothesis is a precise statem ent about population param 6 . The decision rule, which com bines the test statistic and criti
eters, and the process of exam ining w hether a hypothesis is cal value to determ ine w hether to reject the null hypothesis.
consistent with the sam ple data is known as hypothesis testing.
In frequentist inference, hypothesis testing can be reduced to
one universal question: How likely is the observed data if the
Null and Alternative Hypotheses
hypothesis is true? The first step in performing a test is to identify the null and
alternative hypotheses. The null hypothesis (H0) is a statem ent
Testing a hypothesis about a population param eter starts by
specifying null hypothesis and an alternative hypothesis. The null about the population values of one or more param eters. It is
hypothesis is an assumption about the population parameter. som etim es called the maintained hypothesis, because it is main
The alternative hypothesis specifies the population param eter tained (or assumed) that the null hypothesis is true throughout
the testing procedure. Hypothesis testing relies crucially on this
values (i.e., the critical values) where the null hypothesis should
be rejected. These critical values are determ ined by: assum ption, and many elem ents of a hypothesis test are based
on the distribution of the sam ple statistic (e.g ., the sample
• The distribution of the test statistic when the null hypothesis mean) when the null is true.
is true, and
In most settings, the null hypothesis is often consistent with
• The size of the test, which reflects our aversion to rejecting a
there being nothing unusual about the data. For exam ple, when
null hypothesis that is in fact true.
considering w hether to invest in a mutual fund, the natural null
O bserved data are used to construct a test statistic, and the hypothesis is that the fund does not generate an abnormally
value of the test statistic is com pared to the critical values to high return. This hypothesis can be form alized as the statem ent
determ ine whether the null hypothesis should be rejected. that the expected return on the fund is equal to the expected
This chapter outlines the steps used to test hypotheses. It return on its style-m atched benchm ark. This null hypothesis is
begins by exam ining w hether the mean excess return on the expressed (in term s of population values) as:
stock m arket (i.e., the m arket premium) is 0. The steps used Ho: A f = ABM,
to test a hypothesis about the mean are nearly identical to
where /xF is the expected return of the mutual fund (i.e., E[rFj)
those used in many econom etric problem s and so are widely
and /x Bm is the expected return on the style benchmark (i.e., E[r6M]).
applicable.
This null can be equivalently expressed using the difference
This chapter also exam ines testing the equality of the means for between the means:
two sequences of random variables and explains how tests are
used to validate the specification of Value-at-Risk (VaR) m odels. ffo: A f “ A b m = 0
A nother im portant type of null hypothesis occurs when testing

the accuracy of an econom etric m odel. For exam ple, consider
6.1 ELEM ENTS O F A HYPOTHESIS the testing of VaR m odels, which are im portant m easures of
TEST 4
3
2
1
* portfolio risk. The 95% VaR of a portfolio defines a region
where the daily return on the portfolio should fall 95% of the
A hypothesis test has six distinct com ponents: tim e. The area excluded from this region is always in the left
tail and so captures losses to the portfolio. W hen testing VaR
1. The null hypothesis, which specifies a param eter value that
m odels, the null hypothesis is that the model used to produce
is assumed to be true;
the VaR is correct. However, this hypothesis is not a precise
2. The alternative hypothesis, which defines the range of val statem ent about the value of a population param eter and so
ues where the null should be rejected; is not directly testable. Instead, the null is constructed from a
3. The test statistic, which has a known distribution when the key feature of a correct VaR m odel: The 95% VaR of a portfolio
null is true; should be violated on 5% of days. W hen testing a VaR m odel,
the null hypothesis is
4. The size of the test, which captures the willingness to make
a mistake and falsely reject a null hypothesis that is true; E[H/Tt] = 5%,

where a H IT is defined as a VaR exceedance (i.e., the loss on day One-sided alternatives are only used in specific circumstances.
t (Lt) is larger than the VaR ) . 1 For exam ple, the negative consequences of exceeding VaR limits
(e.g., additional audits by regulators, loss of investor confidence,
The next step in performing a test is to specify the alternative
bankruptcy, and so on) may be more severe than those caused
hypothesis (H-1 ), which determ ines the values of the population
by falling short of those limits (suboptimal capital deployment). In
param eter where the null hypothesis should be rejected. In most
this scenario, a one-sided test with power to detect an excessive
applications, the natural alternative is the set of values that are
number of large losses may be preferable to a two-sided test.
not allowed under the null hypothesis. Returning to the exam ple
of evaluating a mutual fund's perform ance, the natural alterna
tive is H^: /jl f 7 ^ /xBM (or equivalently, H-p /xF — /j l b m ¥= 0). When Test Statistic
testing a VaR model, the natural alternative is H-p E[H/Tt] 5%.
The test statistic is a summary of the observed data that has
Here, rejection of the null indicates that the VaR model is not
a known distribution when the null hypothesis is true. Test
accurate: It may be either too conservative (too few violations)
statistics take many forms and follow a wide range of distribu
or too aggressive (too many violations).
tions. This chapter focuses on test statistics that are normally
distributed, which are commonly used when the distribution of
One-Sided Alternative Hypotheses param eter estim ators can be approxim ated using a Central Limit
Theorem (CLT). This single class of test statistic can be used to
In som e testing problem s, the alternative hypothesis might not
test many im portant hypotheses, including those about means,
fully com plem ent the null. The most common exam ple of this is regression coefficients, or econom etric m odels.
called a one-sided alternative, which is used when the outcom e
of interest is only above or only below the value assum ed by Consider a test of the null hypothesis about a mean: H q : /x = /x q .
W hen data are iid, C hapter 5 showed that the sam ple mean esti
the null.
mator is asym ptotically normally distributed:
For the exam ple of the fund manager's perform ance, the null
is equal perform ance between the return on the fund and the V n (£ - ix)4. N(0,
return on the benchm ark (i.e., H0: /xF = /xBM). If the alternative where a 2 is the variance of the sequence of iid random variables
was set to H-p /zF /j l B m , then a rejection of the null would only used to estim ate the mean. W hen the true value of the mean (/z)
indicate that the return is differen t than that of the benchm ark is equal to the value assumed by the null (/x0), then the asym p
(i.e., it could be either higher or lower than the benchm ark). totic distribution leads to the test statistic:
To focus on the case where the fund m anager outperform s her
benchm ark, then a one-sided alternative is needed: N(0 , 1 ) ( 6 . 1)
/xF > p BM
The test statistic T (also known as the t-statistic) is asymptotically
or equivalently: standard normally distributed and centered at the null hypoth
Hp. /xF - p BM > 0 esis. The population variance is replaced by the sample variance,
because a 2 is a consistent estimator of the population variance a 2
Note that a one-sided alternative increases the chance of reject
when the sample size is sufficiently large. While this chapter focuses
ing a false null hypothesis when the true param eter is in the
on testing hypotheses about the mean in most examples, T is appli
range covered by the (one-sided) alternative.
cable to any estimator that is asymptotically normally distributed.
Note that if ixF < /jl BMi then neither the null nor the alternative
is true. However, because the values where the null is rejected Type I Error and Test Size
are determ ined exclusively by the alternative hypothesis, the
null should not be rejected. This highlights the fact that failing In an ideal world, a false (true) null would always (never) be
to reject the null hypothesis does not mean the null is true, but rejected. However, in practice there is a tradeoff between avoiding
only that there is insufficient evidence against the null. a rejection of a true null and avoiding a failure to reject a false null.
Rejecting a true null hypothesis is called a Type I error. It is pos

sible to calculate the probability of a Type I error, because the
1 Formally, HITt — /[Lt > Va/?t], where /[ • ] is an indicator function that

is 1 when the argument is true (and 0 otherwise) and Lt is the portfolio
loss. 2 Also known as an error of the first kind.
Chapter 6 Hypothesis Testing ■ 85

distribution of the test statistic is known when the null hypothe one-sided lower alternatives are 10% (—1.28), 5% (—1.645), and
sis is true. The probability of com mitting a Type I error is known 1% (—2.32). Test statistics less than the critical value indicate that
as the test size and is denoted a. The test size is chosen to the null should be rejected. When using a one-sided upper tailed
reflect the willingness to m istakenly reject a true null hypothesis. alternative (H0: /jl > /x q ) , the sign of the critical value flips and
The most common test size is 5% (i.e., so that there is a 1 in 20 rejection occurs if the test statistic is larger than the critical value.
chance that a rejected null hypothesis is actually correct).
Sm aller test sizes (e.g ., 1% or even 0.1%) are used when it is Example: Testing a Hypothesis
especially im portant to avoid incorrectly rejecting a true null.
about the Mean
A s an exam ple, suppose that X = [X 1f X 2, . . . , X n] represents the
Critical Value and Decision Rule annual excess return 3 on the S&P 500 from 1951 through 2018.
The estim ated average excess return is calculated using the real
The test size and the alternative are com bined with the distribu
izations of X (i.e., x = [x1; x 2, . . . , x nj) and is therefore:
tion of the test statistic to construct the critical value of the test.
The critical value depends on the distribution of the test statistic
and defines a range of values where the null hypothesis should
be rejected in favor of the alternative. This range is known as In this case, jl = 7.62% .
the rejection region. W hen the test statistic has a standard nor
To determ ine whether the true expected value (i.e., /x = E[X,]) is
mal distribution, the critical value depends on both the size and
different from 0 , a hypothesis test can be constructed where:
the type of the alternative hypothesis (i.e., whether it is one- or
two-sided). In the common case where the null is H0: /x = /x0
and a two-sided alternative is used (e.g ., H f. /jl ¥= /x q ) , then the and
critical value defines a region in the tails with probability a so
H ^: /jl ¥= 0
that the probability in either tail is a/2.
Recall that the test statistic T is
The decision rule com bines the critical value (Ctt, which depends
on the test size a), the alternative hypothesis, and the test statis N(0 , 1 )
tic (T) into one decision: whether to reject the null in favor of the
alternative or to fail to reject the null. W hen testing a hypothesis The standard normal distribution of T when the null is true is
using a t-test, the decision rule depends on the alternative. illustrated in the top-left panel of Figure 6.1.
W hen using a two-sided alternative, the decision rule is to reject
In this exam ple, the standard deviation of the excess returns is
the null if the absolute value of the test statistic is larger than
16.5% , the sam ple size n = 6 8 , /x = 7.62% and the null assumes
the critical value (i.e., | T| > Ca). W hen testing against a one
that / jl = 0. Therefore, the T value for this test is
sided lower alternative, the decision rejects the null if T < Ca
and when using a one-sided upper alternative, the decision rule 0.0762 - 0
rejects the null if T > Ca. W hen a null is rejected using a test 0 .1 6 5 / V 6 8
size of a, we say that the result is significant at the a-level. Using a test size (i.e., a) of 5%, the next step is to find an inter
The top left panel of Figure 6.1 shows the rejection region when val containing 95% of the probability when the null is true. This
a = 5%. The shaded rejection region in each tail has a probabil interval is constructed by finding the critical values, which are
ity of 2.5% . The null is rejected in favor of the alternative hypoth ± 1.96 when a = 5%.
esis when the value of the test statistic falls into this region. Test
The critical values determ ine the rejection region, which consists
statistics in the 95% probability region in the center of the distri of (—oo, —1.96] and [1.96, oo). This rejection region is the shaded
bution do not lead to rejection of the null hypothesis. Common area in the top-right plot of Figure 6.1.
test sizes and critical values for testing against two-sided alterna
Because test statistic value of 3.8 is in the rejection region, it is
tives are 10% ( ± 1.645), 5% ( ± 1.96), and 1% ( ± 2.57).
unlikely to have occurred if the null is true. Therefore, the null
When using a one-sided alternative, the rejection region lies in
a single tail. For exam ple, when the alternative is one-sided lower
3 The excess return is the difference between the S&P 500 and the risk
(H-,: /x <^. /x q ) , then the critical value is chosen so that the probabil free rate. A risk-free rate is the return on a short-term sovereign bond.
ity in the lower tail is a. Common test sizes and critical values for Here the return on a one-month US government bond is used.

N ull D istrib u tio n T w o -S id e d A lte rn a tiv e
- 1 .9 6 1.96
One-Sided Alternative (Lower) One-Sided Alternative (Upper)
-1 .6 4 5 1.645
F ia u re 6.1 The top-left panel shows the distribution of the test statistic, a standard normal when the null is true.
The top-right panel shows the rejection region (shaded) when using a two-sided alternative and a test size of 5%.
The bottom two panels show the rejection regions when using one-sided lower (left) and one-sided upper (right)
alternatives also using a test size of 5%.
THE NORM AL AND THE STU D EN T'S t

This chapter uses the standard normal distribution to deter There is no need to use an approxim ation in this special case.
mine the critical values in hypothesis tests about unknown The exact distribution requires using the unbiased estim ator
param eters. The use of the normal follows from the CLT and of the variance:
is applicable to averages of random variables from any distri
bution when some technical conditions are met. ( 6 . 2)
(x; - A)2
However, there are two circum stances where the Student's
t„ should be used in place of the normal to determ ine the
The test statistic for the null H0: /jl = /x0 \s
critical value of a test. First, when the random variables in the
average are iid normally distributed [i.e., X, ~ N(/x,cr2)], then
the t-test statistic has an exact Student's tn_-| distribution. (6.3)
(C on tin u ed)

(C ontinued)
W hen n is small (i.e., less than 30), the Student's t has been Figure 6.2 com pares a normal, ts, and t -15 densities. Note
docum ented to provide a better approxim ation to the distri that tails of the t are heavier than the normal, although they
bution of T than the normal. This result holds even when the becom e closer as the degrees of freedom increase. Using
random variables in the average (X,-) are not normally distrib critical values from the Student's t, along with the use of s 2 to
uted. Note that the critical values from a Student's tn_-| are estim ate the variance, is the recom m ended method to test
larger than those from a normal, and so using a tn_-| reduces a hypothesis about a mean when the sam ple size contains
the probability of rejecting the null (all things equal). In prac 30 or few er observations. In larger sam ples, the difference
tice, the larger critical values do a better job of producing between the two is negligible, and common practice is to
tests with the correct size a. estim ate the variance with a 2 and to use critical values from
the normal distribution.
Densities Right Tail
Figure 6.2 Comparison of the densities of the normal, t5, and t15. The right panel focuses on the right tail
for values above 1.645, the 5% cutoff from the normal distribution.
hypothesis (H0: /x = 0 ) can be rejected in favor of the alternative Table 6.1 Decision Matrix for a Hypothesis Test
hypothesis (H-p /j l 0 ). Relating the Unknown Truth to the Decision Taken
about the Truth of a Hypothesis. Size a and Power
Type II Error and Test Power 1 — (3 Are Probabilities and so Are between 0 and 1
A Type II error occurs when the alternative is true, but the null is Null Hypothesis
not rejected. The probability of a Type II error is denoted by the True False
G reek letter (3. In practice, (3 should be small so that the power
of the test, defined as 1 — (3, is high. Fail to Reject Correct Type II Error
Decision
d - a) 03)
The power of a test m easures the probability that a false null is
rejected. Table 6.1 shows the decision m atrix that relates the Reject Type I Error Correct
decision taken— rejecting or failing to reject the null— to the (a) d - jS)
truth of the null hypothesis. Note that the values in parentheses
in Table 6.1 are the probabilities of arriving at that decision.
Example: The Effect of Sample Size
W hereas the size of the test can be set by the tester, the
power of a test cannot be controlled in a similar fashion. This is
on Power
because the power depends on: Figure 6.3 illustrates the core concepts of hypothesis testing in a
• The sam ple size, small sample (where n = 1 0 ) and in a larger sample (where n = 1 0 0 ).
• The distance between the value assumed under the null and In both panels, the data are assumed to be generated according
the true value of the parameter, and to:
• The size of the test. X, ~ N(0.3, 1)

Small Sample (n = 10)
Large Sample (n = 100)
Fiq u re 6 .3 Both panels show two distributions: the distribution of the mean when the null is true (solid), and the
distribution of the sample mean that is centered at the population mean of 0.3 (dashed). The vertical lines show
the critical values, which are determined by the distribution centered on assumed mean in the null hypothesis (0.0).
The shaded regions show the power of the test, which is the probability that the null hypothesis is rejected in favor
of the alternative. The hatched region shows the probability of making a Type II error (i.e., /3).
The sam ple average /x is the average of n iid normal random When the sample mean falls outside of the set of critical values,
variables and so is also normally distributed with E [jl] = 0.3 and the null is rejected in favor of the alternative. The probability
V[/x] = 1 / V n . The null hypothesis is H q : ju = 0 and the alterna that the sample mean falls into the rejection region is illustrated
tive is H1: /x 0 . by the grey region outside of the critical values. This area is the
power of the test (i.e., the chance that the null is rejected when
Both panels in Figure 6.3 shows two distributions. The first is
the data is generated under the alternative). M eanwhile, the
the distribution of the sam ple mean if the null is true (shown
probability of failing to reject the null (i.e., (3) is indicated by the
with the solid line). It is centered at 0 and has a standard d evia
hatched area.
tion of 1 / v 10. The second is the distribution of the sam ple
mean (shown with the dashed line). It is centered at the true Increasing the sam ple size has an obvious effect— both distribu
value of j l = 0.3 and has the sam e standard deviation. tions are narrower. These narrower distributions increase the
probability that the null is correctly rejected and so increase the
The dashed lines show the critical values for a test with a size of
power of the test. In even larger sam ples (e.g ., n = 1000), the
5%, which are
two distributions would have virtually no overlap so that the null
• ± 0 .6 1 , when n = 1 0 , and is rejected with an exceedingly high probability (> 99.999% ).
This pattern shows that the power (i.e., the probability that the
• ± 0 .1 9 6 , when n = 100
false null is rejected) increases with n.
for the top and bottom panels, respectively.

Example: The Effect of Test Size Confidence Intervals
and Population Mean on Power A confidence intervals is a range of param eters that com ple
Figure 6.4 shows how the power of a hypothesis test is affected ments the rejection region. For exam ple, a 95% confidence
by the true (population) mean and the test size. In this case, the interval contains the set of param eter values where the null
null hypothesis is H0: /x = 0 and the test uses sizes of 1%, 5%, hypothesis cannot be rejected when using a 5% test. More gen
and 10%. he curves show the power for the sam ple size n = 50. erally, a 1 — a confidence interval contains the values that can
not be rejected when using a test size of a.
Note how test size influences power, because larger sizes (e.g .,
1 0 %) use sm aller critical values and have larger rejection regions The bounds of a confidence interval depend on the type of
than sm aller tests sizes (e.g ., 1%). Sm aller critical values increase alternative used in the test. W hen H q : fx = /xo is tested against
the probability that the sam ple mean falls into the rejection the two-sided alternative (e.g ., H-\: fx ^ /x0), then the 1 — a con
region, and so increase the test power. fidence interval is
Note that changing the sam ple test size has an im portant side [jx — Ca X f f / V n , jx + Ca X (x / V r i], (6.4)
effect: Large test sizes increase the probability of committing a where Ca is the critical value for a two-sided test (e.g ., if a = 5%
Type I error and rejecting a true null hypothesis. The choice of then Ca = 1.96).
the test size therefore affects both the probability of making a
Returning to the exam ple of estim ating the mean excess return
Type I error and that of making a Type II error. In other words,
on the S&P 500, the 95% confidence interval is
small sizes reduce the frequency of Type I errors, while large
sizes reduce the frequency of Type II errors. The test size should 16.5 16.5
thus be chosen to reflect the costs and benefits of committing 7.62 - 1.96 X —= 7.62 + 1.96 X [3.69, 11.54]
V 68 V 68-
either a Type I or Type II error.
This confidence interval indicates that any value of the null
Note that when the power is equal to a (i.e., the test size) when
between 3.69 and 11.54 cannot be rejected against a two-sided
the true value of /x is close to 0. This makes sense, because the
alternative. O ther values (e.g ., 0) are outside of this confidence
null should be rejected with probability a even when true.
interval, and so the null H0: /x = 0 is rejected.
As the true jx moves away from 0, however, the power increases
Confidence intervals can also be constructed for one-sided
monotonically. This is because it is easier to tell that the null is
alternatives. The one-sided 1 — a confidence interval is asym
false when the data are generated by a population /x far from
m etric and is either:
the value assumed by the null. In other words, large effects are
easier to detect than small effects. ( - o o , jx + Ca X <x/\/n]
Effect of Test Size (a ) and True n on Power
Fiqure 6.4

or W hen test statistics are normally distributed, the p-value
depends on the alternative. W hen the alternative is two-sided
[ ji - Ca X ( j/ V n , <*),
(H-i- A jhq ), the p-vsluo of 3 tost ststistic v s Iu b T is
where Ca is the critical value for a one-sided test with a size of
2(1 - <t>(|T|)), (6.5)
a. Returning to the excess return on the stock m arket, the one
sided confidence interval for testing the null H0: /x = 0 against where O(z) is the C D F of a standard normal evaluated at z.
the alternative H f. /jl > 0 is
In this expression, | T| is used to ensure that the probability in
r 16.5 the right-tail is always m easured, irrespective of whether T is
7.62 - 1.645 X oc = [4.32, °c)
L V 68 positive or negative. Therefore, <I>(| T |) com putes the probability
Pr(Z < | T | ) and 1 — <T( | T | ) com putes the area to the right of
The critical value is reduced from 1.96 to 1.645 due to the direc
| T | under the normal density (Pr(Z > | T | ), where Z is a standard
tionality of the test. In the two-sided test, the rejection region
normal random variable. Finally, this area is doubled when using
was shared between the left and right tails. In the one-sided
a two-sided test, because the test statistic T may lie in either tail.
test, the rejection region is only in the left tail. The null that
/x = 0 is outside of this one-sided confidence interval and so is In the exam ple testing w hether the excess return on the stock
rejected. m arket is 0 (H0: /x = 0 ), the test statistic value is
One-sided confidence intervals highlight a surprising issue when T = 3.80

using one-sided tests: The sign of the mean estim ate jl affects
Therefore, the p-value can be calculated as:
the decision to reject. For exam ple, when testing H0: /jl = 0
against the one-sided alternative Hfi /x > 0 , any negative esti 2(1 - <J>( 13 .8 0 1)) = 2(1 - 0.9993)
mate of /x does not lead to rejection of the null hypothesis. = 2(.00007)

= 0.00014
W hile confidence intervals can be interpreted as the set of null
values that would not be rejected using a test size of a, they Because 0.014% is less than the test size of 5%, the null hypoth
can also be correctly interpreted using a repeated sampling esis is rejected.
argum ent. Suppose that many independent sam ples are gener In addition, the one-sided p-value for testing the null against the
ated and that the 1 — a confidence interval is constructed for alternative Hf. /x > 0 is 1 — <E>(3.80) = 0.007% , which is half of
each sam ple. It is the case that 1 — a (e.g ., 95%) of these con the two-sided p-value.
fidence intervals contain the true param eter when the number
The top panels of Figure 6.5 show the area measured by the
of repeated sam ples is large. This second interpretation is more
p-value when the alternative is two-sided. The test statistic and
of a thought experim ent in practice because it is usually not
the tail probabilities in the left panel are and . 8 %, respec
possible to create multiple independent sam ples of data from
2 .1 1
tively, making the p-value equal to 3.6% . In this case, the null
financial m arkets.
hypothesis would be rejected using a test size of 5% (but not
1%). The p-value of 3.6% indicates that there is a 3.6% probabil
The p-value of a Test
ity of observing a | T| > 2.1 the null hypothesis is true.
A hypothesis test can also be summarized by its p-value, which
In the right panel, the test statistic is —0.5 and the total prob
measures the probability of observing a test statistic that is more
ability in both tails (i.e., the p-value) is 61.8% . This test would
extrem e than the one com puted from the observed data when
not reject the null using any standard test size. The large p-value
the null hypothesis is true.
here indicates that when the null is true, three out of five sam
A p-value com bines the test statistic, distribution of the test sta ples would produce an absolute test statistic as large as 0.5.
tistic, and the critical values into a single number that is always
The formula for a one-sided p-value depends on the direction
between 0 and 1. This value can always be used to determ ine
of the alternative hypothesis. W hen using a one-sided lower
whether a null hypothesis should be rejected: If the p-value is
tailed test (e.g ., H-p /x < /x0), the p-value is <J>(T) (i.e., the area
less than the size of the test, then the null is rejected.
in the left tail). W hen using a one-sided upper-tailed test (e.g .,
The p-value of a test statistic is equivalently defined as the Hy. /x Ao)/ the p-value is 1 *t*(T) (i.e., the area in the right
sm allest test size (a ) where the null is rejected. Any test size tail). The p-values for one-sided alternatives do not use the
larger than the p-value leads to rejection, whereas using a test absolute value because the sign of the test statistic matters for
size sm aller than the p-value fails to reject the null. determ ining statistical significance.

I P-value x Test Stat. Value
Figure 6.5 Graphical representation of p-values. The two panels show the area measured by a p-value when
using a two-sided alternative. The p-value measures the area above | T| and below — | T\ when using the test
statistic T.
PRACTICAL VERSUS STATISTICAL SIG N IFICA N CE

Hypothesis tests establish w hether a finding is statistically sig where 8 is a non-zero value and so always falls into the rejec
nificant (i.e., w hether the null hypothesis is rejected using an tion region. A s a result, very large sam ples produce statisti
appropriate test size). The hypothesis test does not, however, cally significant results even when the alternative is extrem ely
indicate w hether a finding has practical significance. In large close to the null.
sam ples, standard hypothesis tests have the property that the
Practical significance applies a second check when assessing
null hypothesis is always rejected w henever null is false and
w hether a result is noteworthy. Practical significance exam
the alternative is true. This happens because any discrepancy,
ines the m agnitude of the estim ated effect, not m agnitude
no m atter how small, must produce a test statistic that grows
of the test statistic or its p-value. It com plem ents statistical
arbitrarily large as n increases, because the test statistic can
significance, because for a result to be actionable it must be
be expressed
both statistically and practically significant.
For exam ple, suppose a new portfolio strategy reduces port
<J
■ V • » /V
folio variance by 0 . 1 % per year while not affecting the perfor

In large sam ples, ft converges to the true value of /z so that mance. Given a long enough historical sample, this difference
when the null is false: is always statistically significant. However, if the portfolio is a
/z - /x0 7^ 0 high-risk strategy with a volatility of 40% per year, then this
minor reduction in volatility is unlikely to alter the attractive
A s n increases, the test statistic takes the form:
ness of the portfolio to new investors. Here the result is statisti
V^8, cally, but not practically, significant and thus is not actionable.
If the null hypothesis is true, then:

6.2 TESTING THE EQUALITY O F TWO
E[Z,] = E[X (-j - E[Y;] = /zx - fjy = 0
MEANS
This is a standard hypothesis test of H0: /xz = 0 against the alter
Testing w hether the means of two series are equal is a com native that H ' n z 0 .
mon problem . Consider the iid bivariate random variable
The test statistic is constructed as:
Wj = [X„ V/]. The com ponent random variables X, and Y; are
each iid and may be contem poraneously correlated (i.e.,
T = - ^ =
C o rr[X „Y (] 0). V m f/n
Now consider a test of the null hypothesis H0: /jl x = /j l y (i.e., that and its value can be com pared to a standard normal. This
the com ponent random variables X; and Y; have equal means). method autom atically accounts for any correlation between X,
To im plem ent a test, construct a new random variable: and Yj because:
Z, = X, - Y, VIZ;] = V[X(-] + V[ Y,] - 2 Cov[X„ YU

THE RISKS OF REPEATED TESTING
Multiple testing (i.e., testing multiple hypotheses on the If the test statistics are independent, then there is a
same data) is a pervasive problem that affects fields as 40.1% (= 100% — (95% )10) chance that one will appear statis
diverse as finance, psychology, biology, and medical tically significant at the 5% level when n = 10. This probabil
research. Reusing data can produce spurious results and find ity increases to 99.4% when n = 100. In others words: If you
ings that do not hold up to rigorous scrutiny. keep testing strategies on the same data, you are bound to
eventually find one that appears to work.
The fundam ental problem with multiple testing is that the
test size (i.e., the probability that a true null is rejected) is That being said, there are ways to control for multiple
only applicable for a single test. However, repeated testing testing. The leading m ethods are the Bonferroni (or Holm-
creates test sizes that are much larger than the assumed size Bonferroni) correction, which corrects the decision rule of a
of a and therefore increases the probability of a Type I error. testing procedure when using multiple tests. It is sim ple to
im plem ent because it only makes use of standard statistics
For exam ple, suppose there are n trading strategies that are
and their p-values. A s a result, Holm-Bonferroni can be very
being com pared to a buy-and-hold strategy. These strate
conservative and have low power.
gies can be tested to determ ine which (if any) provide a bet
ter return than the buy-and-hold approach. The test would O ther m ethods can involve controlling the False Discovery
involve n nulls of the form: Rate (FDR) or limiting the Fam ilywise Error Rate (FW ER). FDR
and FW ER are more com plex approaches and use sophis
H 0: At; = AL B a H , ticated, data-intensive m ethods to overcom e this loss of
where /Xj is the return on strategy / and fxBaH is the return from power.
the buy-and-hold strategy.
Note that because jx z = jxx — (iy and <xf = T x + <xy ~ 2oxy, this A s an exam ple, consider two distributions that measure the
test statistic can be equivalently expressed as: average rainfall in City X and City Y, respectively. Their means,
variances, sam ple size and correlation are
/xx = 10cm; jjLy = 14cm; ax = 4, ay = 6 , nx = riy = 12,

Corr[X;,Y;] = 0.30
This expression for T shows that the correlation between For the hypothesis to not be rejected at the 95% level, the m axi
the two series affects the test statistic. If the two series are mum difference between /xx and jxY would be
positively correlated, then the covariance betw een X, and Y)
D
reduces the variance of the difference. This occurs because 1.96 = > 2 .6 cm
4 + 6 - 2 * 0.3 * V 4 V 6
both random variables tend to have errors in the same direc
12
tion, and so the difference between X, — Yt elim inates som e of
the common error. However, because:
Finally, a special version of this test statistic is available when X , Ax “ Ay -4 > 2.6
and Yj are both iid and mutually independent. W hen X, and Y,
The null is therefore rejected when a = 0.05.
are independent, they are not paired as a bivariate random vari
able, and so the number of sam ple points for X ) (nx) and Y, (nY) Furtherm ore, note that the value of the test statistic is
may differ. The test statistic for testing that the means are equal
(i.e., H0: /xx = /xY) is 4
10 - 14___________
4 + 6 - 2 * 0.3 * V 4 V 6
12
= -5 .2 1
4 This test statistic also has a standard normal distribution when the null Therefore, the null would be rejected even at a 99.99% confi
is true. dence level!

6.3 SUMMARY com bined to produce a critical value, which defines the rejection
region. Test statistics in the rejection region lead to rejection of
This chapter introduces hypothesis testing. Im plem enting a the null. If the test statistic is not in the rejection region, null is
hypothesis test involves specifying the null and alternative not rejected.
hypotheses. The null hypothesis is an assumption about the Confidence intervals and p-values are alternative m ethods to
population param eters being tested. The alternative hypothesis express the results of a testing procedure. Confidence intervals
defines the values where the null hypothesis should be rejected. contain a set of values where the null hypothesis should not be
Most applications use a two-sided alternative that is the natural rejected. The p-value is the probability of observing a test sta
com plem ent to the null. The hypothesis test is im plem ented tistic as large as the one com puted using the data if the null is
with a test statistic that has a known distribution when the null true. This means that a null should be rejected if the p-value is
is true. sm aller than the test size of a.
The size of the test is chosen to control the probability of a Most of the focus of this chapter is on testing a null hypothesis
Type I error, which is the rejection of a true null hypothesis. The about the mean (H0: /z = /x0) using the CLT. More generally, this
test size should reflect the cost of being wrong when the null testing fram ework is applicable to any param eter that has an
is true. The distribution of the test statistic and the test size are asym ptotic distribution that can be approxim ated using a CLT.

QUESTIONS

6.1 W hat role does the alternative hypothesis play when test 6.6 W hat is the critical value in a hypothesis test?
ing a hypothesis?
6.7 W hat does the p-value of a hypothesis m easure?
6.2 W hen is a one-sided alternative useful? 6.8 How can a confidence interval be used when testing a
6.3 W hat is the size of a hypothesis test? hypothesis?
6.4 W hat are the trade-offs when choosing the size of a test? 6.9 W hat does the VaR of a portfolio measure?
6.5 W hat is the power of a hypothesis test? 6.10 W hat are three m ethods to evaluate a VaR model of a
portfolio?
Practice Questions
6.11 Suppose you wish to test w hether the default rate of 6.15 You collect 50 years of annual data on equity and bond
bonds that are rated as investm ent grade by S&P is returns. The estim ated mean equity return is 7.3% per
the same as the default rate on bonds rated as invest year, and the sam ple mean bond return is 2.7% per year.
ment grade by Fitch. W hat are the null and alternative The sam ple standard deviations are 18.4% and 5.3% ,
hypotheses? W hat data would you need to test the null respectively. The correlation between the two-return
hypothesis? series is - 6 0 % . Are the expected returns on these two
assets the sam e? Does your answer change if the correla
6.12 Using the Excel function N O R M .S.D IST or a normal prob
ability table, what are the critical values when testing the tion is 0 ?
null hypothesis, H0: /z = /z0, against 6.16 If a p-VaR model is well specified, HITs should be iid
a. a one-sided lower alternative using a size of 1 0 %? Bernoulli^ — p). W hat is the probability of observing two
HITs in a row? Can you think of how this could be used to
b. a one-sided upper alternative using a size of 2 0 %?
c. a two-sided alternative using a size of 2 %? perform a test that the model is correct?
d. a two-sided alternative using a size of . 1 %? 6.17 A data m anagem ent group wants to test the null hypoth
esis that observed data is N(0,1) distributed by evaluating
6.13 Find the p-value for the following t-test statistics of the
the mean of a set of random draws. However, the actual
null hypothesis, H0: /z = /z0.
underlying data is distributed as N(1, 2.25).
a. Statistic: 1.45, Alternative: Two-sided
b. Statistic: 1.45, Alternative: One-sided upper a. If the sam ple size is 10, what is the probability of a
Type II error and the power of the test? Assum e a 90%
c. Statistic: 1.45, Alternative: One-sided lower
d. Statistic: —2.3, Alternative: One-sided upper confidence level on a two-sided test.
b. How many sam ples would need to be taken to reduce
e. Statistic: 2.7, Alternative: Two-sided
the probability of a Type II error to less than 1%?
6.14 If you are given a 99% confidence interval for the mean
return on the Nasdaq 100 of [2.32% , 12.78% ], what is the
6.18 Suppose that for a linear regression, an estimation of
sam ple mean and standard error? If this confidence inter slope is stated as having a one-sided p-value of 0.04.
W hat does this mean?
val is based on 37 years of data, assumed to be iid, what is
the sam ple standard deviation?

ANSW ERS

6.1 The alternative specifies the range of values where the 6.7 The p-value measures the probability that the observed
null should be rejected. In most tests, the alternative is data would have been generated if the null is true. It is
the natural com plem ent to the null, although this does the size where the test would just reject the null in favor
not have to be the case. It is possible to specify a test of the alternative.
where neither the null nor the alternative is true.
6.8 If the null is outside of the confidence interval, then it
6.2 A one-sided alternative is helpful when there is some should be rejected in favor of the alternative.
guidance about the value of the param eter when the null 6.9 The VaR is a measure or the m agnitude of the loss that
is false. A leading exam ple occurs when testing whether
the portfolio will lose with some specified probability
a risk premium is 0 against a one-sided alternative that (e.g ., 5%) over some fixed horizon (e.g ., one day or one
the premium is positive. w eek). The p-VaR is form ally defined as the value where:
6.3 The size is the probability of wrongly rejecting the null Pr(L > VaR) = 1 - p, (6 . 6 )
when it is in fact true. This is equivalently the probability
where L is the loss of the portfolio of the selected time
of a Type I error.
horizon and 1 — p is the probability that a large loss
6.4 A small test size reduces the probability of a Type I error,
occurs.
that is, rejecting the null when it is true. It also lowers the
power of the test, that is the chance of rejecting a false null 6.10 First, the exact distribution that com putes the exact
distribution of the number of VaR violations (HITs) over
because smaller sizes correspond to larger critical values.
some tim e period. Second, the asym ptotic method that
6.5 The power of a test the probability of rejecting a false exam ines the average number of violations and uses the
test when the alternative is true. It is one minus the prob
CLT to constrict the asym ptotic distribution. Finally, the
ability of a Type II error.
likelihood ratio that exploits the Bernoulli distribution
6.6 The critical value determ ines the rejection region for a that underlies the 0-1 H IT variables.
test. Test statistics larger than the critical value indicate
the null should be rejected in favor of the alternative.
Solved Problems
6.11 The null is that the IG (investm ent grade) default rate d. This size corresponds to the same critical value as a
is the same for S&P rated firms as it is for Fitch-rated .05% upper-tailed test, which is 3.29.
firms. If Sj are default indicators for S&P rated firms, and
6.13 a. In a two-sided test, the p-value is the area under
Fj are default indicators for Fitch-rated firm s, then the the normal curve for values greater than 1.45 or less
null is H q : As = A f , which states that the mean values than - 1 .4 5 . This is tw ice the area less than - 1 .4 5 , or
are the sam e. The null can be equivalently expressed 2 X 0.074. = .148.
as H0: a s — A f = 0. The alternative is Hy. /j l $ ^ a f or b. In a one-sided upper test, we need the area above
equivalently H^: /xs — a f ^ 0 - The data required to test 1.45. This is half the previous answer, or 0.074.
this hypothesis would be binary random variables where
1 indicates that an IG bond defaulted and 0 if it did not
c. In a one-sided lower test, we need the area under
the curve for values less than 1.45. This is just
within a fixed tim e fram e (e.g ., a quarter).
1 - 0.074 = .926.
6.12 a. The Iower-tailed critical value is —1.28.
d. Here we need to probability under the normal curve
b. The one-sided upper-critical value is 0.84.
for values above —2.3, which is 1 minus the area less
c. A two-sided alternative with a size of 2% uses the than - 2 .3 , or 1 — 0.01 = 0.99.
same critical value as a one-sided upper test, which is
e. Because it is a two-sided alternative, we need that
2.32. Each tail has probability 1% when using the criti
area for values larger than 2.7 or less than —2.7,
cal value.

which is tw ice the area less than —2.7. This value is ± 1.96 using a size of 5%. The null is not rejected. If the
2 X .0034 = .0068. correlation was 0 , then the variance estim ate would be
6.14 The mean is the midpoint of a sym m etric confidence 0.0366, and the test statistic is 1.69. The null would still
not be rejected if the size was 5%, although if the test
interval (the usual type), and so is 7.55% . The 99%
Cl is constructed as [/x — c X & ,jl + c X a] and so size was 10%, then the critical value would be ± 1.645
and the correlation would matter.
c X a = 12.78% — 7.55% = 5.23% . The critical value for
a 99% Cl corresponds to the point where there is 0.5% in 6.16 The probability of a H IT should be 1 — p if the model
5.23% is correct. They should also be independent, and so
each tail, or 2.57, and so a 2.03% needs to
2.57 the probability of observing two HITs in a row should
be surveyed.
be (1 — p)2. The can be form ulated as the null that
6.15 The null hypothesis is H0: /j l e = /j l b . The alternative is Hb: E[H IT i X H ITi+i] = (1 — p ) 2 and tested against the
H-|: fx E ¥= /j l b . The test statistic is based on the difference alternative H-p E[H/T(- X H/T/+1] ^ (1 — p)2. This can be
of the average returns, 8 = 7.3% — 2.7% = 4.6% . im plem ented as a sim ple test of a mean by defining the
The estim ator of the variance of the difference is random variable X/ = HIT} X H/T( + 1 and then testing the
a E + a B - 2(t be, which is 0 .1 842 + 0.0532 - 2 X - 0 .6 X null H0: fJLx = (1 — p ) 2 using a standard test of a mean.
0.184 X 0.053 = 0.0483. The test statistic is
g
6.17 a. W hen the null hypothesis is false, the probability of
—, = 1.47. The critical value for a two-sides test is a Type II error is equal to the probability that the
/O043
hypothesis fails to be rejected, as per the diagram in
V 50
the chapter:
Further recall that for a 90% confidence level on a N(0,1) distribution, the cut-off points are + /—1.65.
-1.65 1.65
Chapter 6 Hypothesis Testing 97
Now, if there are 10 sam ples taken from an N(0,1) In actuality, the true distribution is N (1,2.25), so
then the standard deviation is reduced the a = V 2 .2 5 = 1.5. For a sam ple size of 10, the
expected sam ple standard deviation is
°H> — = = 0.316
V 10
/V
A 5
0.474
O s a m p le
VTo
Therefore, the cut-off points are
Calculating the equivalent distance of + /— 1.65 in this
± 1 .6 5 * 0 .3 1 6 = ± 0 .5 2 2 distribution com pared to a standard N(0,1) yields
Hypothesis Testing for Normal Distribution Parameters
N(0,0.316A2) --------- N(1,0.474A2)
- 0 .5 2 2 - 1 Therefore, the requirem ent becom es

-3 .2 1
0.474
1 - Pr(> right) = 0.01
and
This occurs at a Z-score of (using the Excel function
+ 0.522 - 1 NORM SINV) - 2 .3 2 .
1.00
0.474
-
Accordingly, the following equations need to be

The probability of being on the left-hand side is prac
solved
tically zero. For the right,
Pr(> right) = 1 - $ (- 1 .0 0 ) = 1 - 15.9% = 84.1% . - 2 .3 2

So total probability of a Type II error is 1 - the prob
ability of being in the two tails is
Plugging in K yields:
Pr(Non — R e je c tio n | H0 is false) = 1 - [Pr(< left)
+ Pr(> right)] « 1 - 84.1% = 15.9%
Therefore, the power of the test is 84.1% .
b. The requirem ent is to have

1 - [Pr(< left) + Pr(> right)] = 1%
Clearly, as n increases from 10, the probability of

being in the left-hand tail will only decrease from And because partial observations are not allowed,
already being close to zero. n = 27.

6 .1 8 The p-value tells us the "probability of observing a test will correspond to the value of x that solves the following
statistic that is more extrem e than the one com puted equation:
from the observed data when the null hypothesis is tru e." 1 - O(x) = 0.04
In other words, if the estim ate of the slope is true, then in
Using the excel function N O RM SIN V gives x = 1.75.
a randomized trial the probability of getting a larger new
slope estim ate is 4% .Assum ing a normal distribution, this

Learning Objectives
D escribe the models that can be estim ated using linear • Construct, apply, and interpret hypothesis tests and
regression and differentiate them from those which confidence intervals for a single regression coefficient in a
cannot. regression.
Interpret the results of an O LS regression with a single Explain the steps needed to perform a hypothesis test in a
explanatory variable. linear regression.
Describe the key assumptions of OLS parameter estimation. Describe the relationship between a t-statistic, it's p-value,
and a confidence interval.
Characterize the properties of O LS estim ators and their
sampling distributions.
101
Linear regression is a w idely applied statistical tool for modeling
the relationship between random variables. It has many appeal VARIABLE NAMING CO N VEN TIO N S
ing features (e.g ., closed-form estim ators, interpretable param
Linear regression is used widely across finance, economics,
eters, and a flexible specification) and can be adapted to a wide other social sciences, engineering, and the natural sciences.
variety of problem s. This chapter develops the bivariate linear W hile users within a discipline often use the same nomen
regression m odel, which relates a dependent variable to a single clature, there are differences across disciplines. The three
explanatory variable. M odels with multiple explanatory variables variables in a regression are frequently referred to using a
variety of names. These are listed in the table below.
are exam ined later in this book.
The chapter begins by exam ining the specifications that can Explained Variable Explanatory Variable Shock
(and cannot) be m odeled in the linear regression fram ework.
Left-Hand-Side Right-Hand-Side Innovation
Regression is surprisingly flexible and can (using carefully con Variable Variable
structed explanatory variables) describe a wide variety of rela
D ependent Variable Independent Variable Noise
tionships. Dummy variables and interactions are two key tools
used when modeling data. These are w idely used to build fle x Regressand Regressor Error
ible m odels where param eter values change depending on the Disturbance
value of another variable.
This chapter presents the O rdinary Least Squares (OLS) estim a

tors, which have a simple moment-like structure and depend on
where:
the mean, variance, and covariance of the data. W hile these esti
mators are derived without assum ptions about the process that • /3, commonly called the slope or the regression coefficient,
generated the data, they are needed to interpret the model. measures the sensitivity of Y to changes in X;
Five assumptions are used to establish key statistical properties • a, commonly called the intercept, is a constant; and
of linear regression estim ators. W hen satisfied, the param eter • e, commonly called the shock/innovation/error, represents a
estim ators are asym ptotically normal, and standard inference com ponent in Y that cannot be explained by X . This shock is
can be used to test hypotheses. assumed to have mean 0 so that
Finally, this chapter presents three key applications of linear E[Y] = E[a + (3X + e] = a + (3E[X\
regression in finance: measuring asset exposure to risk factors,
The presence of the shock is also why statistical analysis is
hedging, and evaluating fund m anager perform ance.
required. If the shock were not present, then Y and X would
have a perfect linear relationship and the value of the two
param eters could be exactly determ ined using only basic
7.1 LINEAR REGRESSION algebra.
Regression analysis is the most w idely used method to measure, An obvious interpretation of a is the value of Y when X is zero.
m odel, and test relationships between random variables. It is However, this interpretation is not meangingful if X cannot plau
A
widely used in finance to measure the sensitivity of a portfolio to sibly be zero.

common risk factors, estim ate optimal hedge ratios for m anag
For exam p le, co nsid er a m odel that exam ines how the aver
ing specific risks, and to measure fund m anager perform ance.
age m aturity of co rp o rate bond offerings (Y) d ep end s on
Regression models all exam ine the relationship between a firm m arket capitalization (X). Because firm size is alw ays
dependent variable (Y) and one or more explanatory variables p o sitive, the observed data alw ays lie to the right of the
(X). This chapter focuses on the most w idely applied form of y-axis and interpreting a as the average m aturity of debt
regression m odeling: linear regression with a single explanatory offered by firm s with a m arket capitalization of USD 0 is not
variable. m eaningful. M oreover, it may be the case that the prediction
at 0 is not even feasib le (e .g ., a negative value of a would
be incorrect in the m aturity exam ple because m aturity is also
Linear Regression Parameters alw ays positive).
Linear regression assumes a linear relationship between an
explanatory variable X and a dependent variable Y so that:
1 For example, when modeling how the average maturity of corporate
Y = a + (3X + e, (7.1) bond offerings (Y) depends on firm market capitalization (X).

W hen the explanatory variable cannot be 0, then a is nonlinear function of another . 3 Transforming variables enables
interpreted as the value that ensures that Y lies on the fitted linear regression to explain a wide range of relationships
regression line at X (i.e., Y = a + /3X). This interpretation is between the dependent and the explanatory variables. For
always meaningful because x, can always lie on either side of X. exam ple, it is possible to model the relationship between Y and
the pair X and X 2 in a linear regression by setting X-| = X and
X 2 = X 2, so that:
Linearity
V = a + f t X + (32X 2 + e
Explanatory variables can be continuous, discrete, or functions
= a + (3-\X-\ + /32 X 2 + e
of one or more variables (e.g ., X 3 = X-| X X 2). However, all linear
regressions must satisfy three essential restrictions. Here, the second explanatory variable is a function of the first,
and the model is not linear in X, because a small increase in X b y
1 . First, the relationship between Y and the explanatory vari
A X increases y b y : 4
ables X-|, X 2, . . . , Xfc must be linear in the unknown coef
ficients. This means that the term on the right-hand-side of A y = (ft + 2 ftX )A X
the model must have a single unknown coefficient multi
Note that rate of change of y depends on both the change A X
plied by a single explanatory variable.
and the level of X.
2. Second, the error must be additive. This restriction excludes
Many smooth nonlinear functions can be accurately approxi
some m odels where the variance of the error depends on
mated using polynom ials, and so these relationships can be
observed data (e.g ., Y = a + (3X + yXe).
m odeled using linear regression by including powers of explana
3. Finally, all explanatory variables must be observable. This tory variables.
limitation precludes directly applying linear regression with
missing d ata . 2
Transformations
For exam ple, consider
Some model specifications that do not naively satisfy the three
V = a + p x * + €,
requirem ents for linearity can be transform ed so that they do.
where y is an unknown param eter that measures the shape of For exam ple, suppose that y is an always positive random vari
the nonlinear relationship. The param eters of this model can able, and that Y, X , and e are related through:
not be estim ated using the m ethodology of linear regression
y = aXPe,
because /3XY contains two unknown param eters and y does
not enter m ultiplicatively (and therefore it violates the first where e is a positive-valued shock. This specification is not com
restriction). patible with the requirem ents of linear regression, because X is
raised to an unknown coefficient f3 and the error e is not addi
However, the use of m ultiple regression (i.e., multiple explana
tive. However, taking the natural logarithm of both sides of the
tory variables) can allow some nonlinear relationships to be
equation transform s the relationship so that:
m odeled using linear regression. The multiple regression model
includes k explanatory variables so that: In y = In a + (3 In X + In e
Y = a + ftX-, + . . . + fikXk "F e (7.2) Y = a + (3X + e
Building models with multiple explanatory variables allows the This transform ed model satisfies the three requirem ents of linear
effect of an explanatory variable to be measured while control regression.
ling for other variables known to be related to Y. For exam ple, W hen interpreting the slope of a transform ed relationship, note
when assessing the perform ance of a hedge fund manager, it that the coefficient [3 m easures the effect of a change in the
is common to use between seven and 1 1 explanatory variables transform ation o f X on the transform ation o f Y .
that capture various sources of risk.
Note that the k explanatory variables are not assumed to be 3 While the k explanatory variables are not assumed to be independent,
independent, and so one variable can be defined as a known they are assumed to not be perfectly correlated. This structure allows
for nonlinear transformation of a variable to be included in a regression
model, but not for linear transformation because Corr[X„ a + bX,] = 1
if b > 0 and —1 if b < 0. Further details in the assumptions made with
2 There are many approaches to imputing missing values in linear mod
using multiple explanatory variables are presented in Chapter 10.
els that then allow linear regression to be applied to the combined data
set including the imputed values. 4 This is the result of taking the partial derivative of Y with respect to X.
Chapter 7 Linear Regression ■ 103

For exam ple, if the model is dummy variable, because it is produced by interacting a binary
dummy variable with another variable.
In Yj = a + (3 In X + e,,
The dummy interaction makes it possible to describe two m od
then f3 is the change in In Y for a small change in In X.
els with a single equation:
Com puting the direct effect of X on Y requires taking the deriva
• Y = a + /3-|X + e, when D = 0; and
tive of the functions transform ing these variables. In the case of
In Y, the derivative is • Y = a + (fr + /32)X + e, when D = 1.
dY/Y, Even though each segm ent is linear in X, the slopes of the
two lines differ when (32 ^ 0 and thus the overall relationship
which can be directly interpreted as a percentage change.
between Y and X is nonlinear.
W hen both X and Y have been replaced by their logarithm s
(called a log-log m odel), then (3 m easures the percentage
change in Y for a 1% change in X. In other w ords, the co effi 7.2 ORDINARY LEAST SQUARES
cient in a log-log model is directly interpretable as the elastic
ity of Y with respect to X. Consider a basic regression model that includes a single explan
atory variable:
However, if only the dependent variable is replaced by its loga
rithm (e.g ., In Y, = a + (3X, + e), then (3 measures the percent Y = ex + (3X + e
change in Y for a one-unit change in X. This model has three param eters: a (i.e., the intercept), [3
(i.e., the slope), and a 2 (i.e., the variance of e).
Suppose there n observations of Y and X. All pairs (x,-, y,) are

Dummy Variables assumed to have the same linear relationship so that:
An im portant class of explanatory variable is known as a dummy. Yi = a + (3y, + €j

A dummy random variable is binary and only takes the value 0 These param eters can be estim ated using Ordinary Least
or 1. Dummies are used to encode qualitative information (e.g., Squares (OLS) estim ators, which are derived by minimizing the
a firm's sector or a bond's country of origin). sum of squared deviations.
A dummy takes the value 1 when the observation has the qual In this case, the squared deviations being minimized are
ity and 0 if it does not. For exam ple, when encoding sectors, between the realizations of the dependent variable Y and their
the transportation sector dummy is 1 for a firm whose primary expected values given the respective realizations of X:
business is transportation (e.g ., a commercial airline or a bus n
operator) and 0 for firms outside the industry. Dummies are also a, (3 = arg m in ^ Y ; ~ a - /3x,)2, (7.3)
a’P ;=1
commonly constructed as binary transform ations of other ran
where arg min denotes argum ent of the minimum . 5
dom variables (e.g ., a m arket direction dummy that encodes the
return on the m arket as 1 if negative and 0 if positive). In turn, these estim ators minimize the residual sum of squares,
which is defined as:
Dummies can be added to models to change the intercept (a)
or the slope (/3) when the dummy variable takes the value 1. For
S ( y < - « - P * i) 2 = 2 ^
exam ple, an em pirically relevant extension of CAPM allows for /=1 /=1
the sensitivity of a firm's return to depend on the direction of In other words, the estim ators (i.e., a and f3) are the intercept
the m arket return. If we define X t o be the excess m arket return and slope of a line that best fits the data because it minimizes
A
and D to be a dummy variable that is 1 if the excess market the squared deviations between the line a + fix, and the realiza
return is negative (i.e., a loss), then this asym m etric CAPM speci tions of the dependent variable y,-.
fies the relationship as The solution to the minimization problem is
Y = a + /3-|X + /32(X X D) + e a = y - (3X
= a + /3 -|X-| + /32 X 2 + e - X K y, - Y)
(7.4)
This model is a linear regression because the two variables S H A - x )2 '
(X-, = X and X 2 = X X D) are both observable and each is
related to the dependent variable through a single coefficient. 5 For example, arg min f[x) is the value of X the produces that lowest
The second explanatory variable (X X D) is called an interaction value of f(x). *

where Y and X are the averages of y-t and x„ respectively. The estim ator for the slope can be equivalently expressed as the
ratio of the sam ple covariance between X and Y to the variance
Note that the estim ator (3 is only sensible if 2 "=i<*■- *>2 > o,
of X, because we can multiply both the num erator and denomi-
which requires that x, has some variation around its mean. If x,
. 1
only takes one value, then all points lie along a vertical line and nator by — to get
the slope is infinite. Violations of this assumption are simple to
identify in a dataset and so are easily avoided.
- 2 n i <x ' - x ><Yi - y ) (Tx y
P = (7.6)
A
The estimators a and /3 are then used to construct the fitted values:
&XX
yi = a + px, - x >2
and the model residuals This ratio can also be rewritten in term s of the correlation and
V /V /V n
the standard deviations so that:
€ i = y , ~ y \ = y- , - a - px-,
<J y
Finally, the variance of the shocks is estim ated by: P = PxY^r (7.7)
Cx
The sample correlation pxy uniquely determines the direction of the
line relating Y and X. It also plays a role in determining the magni
Note that the variance estim ator divides by n — 2 to account tude of the slope. In addition, the sample standard deviations ay
for the two estim ated param eters in the m odel, so that s 2 is an and ax scale the correlation in the final determination of the regres-
unbiased estim ator of a 2 (i.e., E[s2] = a 2). sion coefficient p . This relationship highlights a useful property of
O LS: p = 0 if and only if p = 0. Thus, linear regression is a conve
The residuals are always mean 0 (i.e., = 0) and uncorrelated
nient method to formally test whether data are correlated.
with X, (i.e., ^ ^ ( X ; — X)e(- = 0). These two properties are conse
quences of minimizing the sum of squared errors. These residuals Figure 7.1 illustrates the com ponents of a fitted model that
are different from the shocks in the model, which are never observed. regresses the mining sector portfolio returns on the market
F ia u re 7.1 Plot of the annual returns on a sector portfolio that tracks the performance of mining companies and
the returns on the US equity market between 1999 and 2018. The diagonal line is the fitted regression line where
the portfolio return is regressed on the market return. The square point is the fitted value when using the market
return in 2003 of 31.8%. The fund's return was 100.5%. The brace shows the residual, which is the error between
the observed value and the regression line.

return for 20-year period between 1999 and 2018. The solid line This assumption also implies that the unconditional mean of the
is the estim ated regression line, which minimizes the sum of the errors is zero (E[e] = 0).
squared vertical distances between the observed data and the
This assumption is not directly testable, as shocks are not observ
fitted line.
able and the estim ated residuals (i.e., e,) are always exactly
The mean annual excess returns of the mining sector (Y) and uncorrelated with the observations of the explanatory variable
the m arket (X) are y = 15.4% and x = 7.77% , respectively. The (i.e., Xj). Determ ining whether this assumption is reasonable
covariance m atrix for the two annual returns is requires a careful exam ination of the data generating process
r for (Y, X). Exam ples of data generating processes where this
~2
ay °X Y 1770 512.3
assumption is violated include the following.
-O -XY <?X J _512.3 337.3_
• Sam ple selection bias or survivorship bias. Sample selection
The estim ate of the slope can be calculated using Equation 7.6:
bias occurs when some observations are not recorded due
d-XY _ 512.3 to missing values of yr For exam ple, when modeling housing
1.52
a** ~~ 337.3 price changes using completed transactions, it is important
The intercept depends on the slope and the two estim ated to remember that homeowners often avoid selling when they
means, and is have negative equity (i.e., when the value of the home is less
than the outstanding mortgage). This effect removes homes
a = y - fix = 15.4% - 1.52 X 7.77% = 3.60%.
with prices that have fallen the most from the sample and
The estim ated slope shows that the return on the mining sec creates sample selection bias. Similarly, when studying firm
tor portfolio increases by an average of 1.52% when the m arket performance, the firms in operation are "winners" in the sense
return increases by 1%. M eanwhile, the estim ated intercept indi that they have performed well enough to continue to do busi
cates that the portfolio returns an average of 3.607% when the ness. Firms that are less successful are more likely to delist, and
m arket return is zero. any model estimated using only the surviving firms is not repre
The square m arker shows the fitted value of the portfolio return sentative of the behavior of all firms. Survivorship bias is com
in 2003. Note that observed mining portfolio m arket return is far monly addressed using carefully constructed databases that
above the fitted value and so the residual (marked by the brace) report the final return for all firms, including those that delist.
is positive. • Sim ultaneity bias. W hen X and Y are sim ultaneously deter
mined, then modeling Y as a function of X is not m eaning
ful because X is also a function of Y. The classic exam ple
7.3 PROPERTIES O F OLS PARAMETER of sim ultaneity bias is the relationship between quantities
ESTIMATORS transacted and the transaction price. Changing the transac
tion price affects the quantity traded, which in turn affects
The derivation of O LS estim ators requires only one easy-to-ver- the price.
ify assumption— that the variance of X is positive. However, five
• O m itted variables. The model should not exclude variables
additional assum ptions are needed to establish conditions that
that are im portant determ inants of Y. Om itting these creates
ensure the O LS estim ators are interpretable and have desirable
coefficients that are biased and may indicate a relationship
statistical properties.
when there is, in fact, none.
• A ttenuation bias. W hen explanatory variables are measured
Shocks Are Mean Zero with error, the estim ated slopes are sm aller in m agnitude
than the true slopes. This is because m easurem ent error
Shocks are mean zero conditional on X, so that E[e|X ] = 0. This
attenuates the relationship between Y and X, leading to
property is known as mean in d ep en d en ce and it requires that X
inconsistent param eter estim ates. Figure 7.2 shows the effect
has no information about the location of e.
of attenuation bias using sim ulated data. The true explana
This assumption states that: tory variable is X* (marked with dots) and is not observed.
E[eg(X)] = 0, Instead, a noisy measure (marked with crosses) of it is used
X, = X* + Vii
where g(X) is any well-defined function of X, including both
the identity function (g(X) = X) and the constant function where 17 , is independent of X; and the shock in the m odel. These
(g(X) = 1). This assumption implies that Corr[e, X] = 0, so that m easurem ent errors randomly shift the explanatory variables
the innovations are uncorrelated with the explanatory variables. and produce a line with a slope that is always flatter than the

Figure 7.2 An illustration of attenuation bias. The explanatory variable Xf (dots) is not observed and the model
is instead estimated using noisy measurements, X, = )£, + rj-, (crosses), where rj; is independent measurement
noise. The arrows show the random shift from the true value of the explanatory variable to the value used in the
regression.
true relationship. M easurem ent bias is common in finance when Variance of X

using balance-sheet variables reported on quarterly statem ents.
Attenuation bias has been called the iron law o f econom etrics, The variance of X is strictly greater than 0 (i.e., o x > 0). This
which suggests that estim ated coefficients system atically under assumption is required when estim ating the param eters and
state the true strength of relationships due to the pervasiveness ensures that (3 is well defined.
of m easurem ent error.

Constant Variance of Shocks
Data Are Realizations from iid Random The variance of the shocks is finite and does not vary with X,
Variables so that:
W hen using financial and econom ic data, explanatory variables V[e,|X] = o- 2 < oc (7.8)
are not fixed values that can be set or altered. Rather, an explan
This assumption is known as hom oskedasticity and requires that
atory variable is typically produced from the same process that
the variance of all shocks is the sam e.
produces the dependent variable.
The top panel of Figure 7.3 illustrates this assumption with sim u
For exam ple, the explanatory variable in CAPM is the market
lated data. The density of the shocks is shown for five values of
return and the dependent variable is the return on a portfolio.
Xj. Note that this distribution does not vary and so the variance
Both the return on the portfolio and the m arket return are the
of the shocks is constant.
result of m arket participants incorporating new information into
their expectations of future prices. Thus, CAPM models the The bottom panel of Figure 7.3 illustrates a violation of this
mean of the excess return on the portfolio co n d itio n ed on the assum ption. The density of the errors is changing with x„ and
excess return on the market. the variance of the shocks (which is reflected by the width of the
interval) is increasing in the explanatory variable.
Form ally, it is assumed that the pairs [x„ y () are iid draws from
their joint distribution. This assumption allows x, and y,- to
be sim ultaneously generated. Im portantly, the iid assum p No Outliers
tion affects the uncertainty of the O LS param eters estim ators
The probability of large outliers in X should be sm all . 6 O LS
because it rules out correlation across observations.
m inim izes the sum of squared errors and therefore is sensitive
In some applications of O LS (e.g ., sequentially produced time to large deviations in either the shocks or the explanatory
series data), the iid assumption is violated. Note that O LS can
still be used in situations where (X, V) are not iid, although the
6 Formally, we assume that the random variables X, have finite fourth
method used to com pute standard errors of the estim ators must moments. This technical condition cannot be directly verified using a
be m odified. sample of data, and so this technical detail is of limited value.

Homoskedastic Shocks
yt
Heteroskedastic Shocks
Fig u re 7 .3 The densities along the regression line illustrate the distribution of the shocks for five values of X,.
In the top panel, these distributions are identical and the shocks in the regression are homoskedastic. The bottom
panel shows heteroskedastic data where the variance of the shock is an increasing function of the value of X,.
variables. O utliers lead to large increases in the sum of Implications of OLS Assumptions
squared errors, and param eters estim ated using O LS in the
presence of outliers may differ substantially from the true These assumptions come with two meaningful im plica
param eters. tions. First, they imply that the estim ators are unbiased so
that:
Figure 7.4 shows exam ples of outliers in e (top panel) and X
(bottom panel). The single outlier in the error has the potential E[a] = a and E[/3] = (3
to substantially alter the fitted line. Here, the extrem e outlier
leads to a doubling of the slope despite the 49 other points all Unbiased param eter estim ates are (on average) equal to the
occurring close to the true line. The bottom panel shows that true param eters. This is a finite sam ple property and so holds for
outlier in X can also produce a large change in the estim ated any number of observations n. In large sam ples, the estim ators
slope. In this case, the outlier reduces the estim ate of the slope are consistent so that a-*-* a and [3 [3 as n —» °c. This property
by more than 50%. ensures that the estim ated param eters are very close to the true
values when n is large.
The sim plest method to detect and address outliers is to visu
ally exam ine data for extrem e observations by plotting both the The second meaningful implication is that the two estim ators are
explanatory variables and the fitted residuals. jointly normally distributed.

yt
Correct Fit, 8 = 1.33
Outlier Fit, $ = 1.42
Extreme Outlier Fit, 8 = 2.82
rrs rs rtr; • • i T ^ 4 >. ■•
X,
Fiqure 7.4 The top panel shows the effect of a single outlier in e(- on the estimated slope coefficient of a
moderate outlier and an extreme outlier. The bottom panel shows the effect on the slope of an outlier and an
extreme outlier in the explanatory variable.
The asym ptotic distribution of the estim ated slope is The asym ptotic distribution of the intercept is
- / . ,\ . d (r2{fi% + o%)\
V n 0 - p ) ^ N (o , (7.9) V n (a - ^------ J (7.10)
where V[X] = 0 $ 's the variance of X. The estimation error in a also depends on the variance of the
This CLT strengthens the consistency result and allows for residuals and the variance of X. In addition, it depends on fix
hypothesis testing. Note that the variance of the slope estim ator (i.e., the squared mean of X). If X has mean 0 (i.e., fix = 0), then
depends on two m oments: the variance of the shocks (i.e., a 2) the asym ptotic variance in Equation 7.10 sim plifies to a 2 and a
and the variance of the explanatory variables (i.e., a%. Further- simply estim ates the mean of Y.
A
more, the variance of (3 increases with cr . This is not surprising, In practice, the CLT is used as an approximation so that [3 is treated
because accurately estimating the slope is more difficult when as a normal random variable that is centered at the true slope [3:
the data are noisy.
P ~ { , ^ (7.11)
A q
On the other hand, the variance of (3 is decreasing in cr* This n p
reflects the value of having widely dispersed explanatory vari

ables; all things equal, it is easier to identify the slope when the The effect of the sam ple size n is clear in this approxim ation: the
observations of X are spread out. variance of (3 decreases as the sam ple size increases.

estimated param eter and the value specified by the null hypoth
ESTIMATING THE STANDARD ERROR esis. When testing the null H0:/3 = (3q , the test statistic is
OF THE OLS PARAM ETERS
p - Po P ~ Po
(7.16)
Because the expression in Equation 7.11 depends on s.e.(jS) s
o o
the unknown values a and ax, it cannot be used to test
hypotheses. Rather, these values are replaced by the esti V 2 !L,(x - x ):
mator of the variance of the shocks (i.e., s2) and the large W hen the null hypothesis is true, then the test statistic has an
sam ple estim ator of the variance of the observations of X: asym ptotic normal distribution. The test is im plem ented by
com paring the test statistic value to the critical values from a
^ = i< x , - x (7.12)
standard normal distribution. W hen the alternative is two-sided
1 )2
n i= 1
(e.g ., H t f 7 ^ P q) i then the null is rejected when:
Note that nSx = ^ ”= 1 (X, - X ) 2 and thus the standard
form of the estim ator of the variance of /3 is t > CS'
where Cs is the critical value from the normal distribution for a

(7.13)
E / = i(x / - *) test with size s. For exam ple, when using a 5% test, the critical
The estim ated standard error of f3 is then:

A
value is chosen so that the probability in each tail is 2.5% , and
so C s = 1.96.
1
s.e.(/3) = (7.14)
The test statistic can also be transform ed into a p-value, which
V 2 " = i ( X - X) 2 V n °x
m easures the probability of observing a test statistic as large
The estim ator for the variance of a can be constructed
as the one observed if the null is true. The p-value is com puted
by replacing population quantities with their sam ple ana
logues, so that: by measuring the area in both tails that is beyond the test sta
tistic. It is com puted by first measuring the area in the right tail
s 2 (n - 1 x ?)
beyond the test statistic, 1 — <E>(| T|), and then doubling this to
2
(7.15)
2 -= i(X/-x)2 account for both tails so that:
where the num erator uses a sim plification derived from
p-value = 2(1 - <D(|T|)), (7.17)
the identity E[X ,]2 = /x£ + a *
where T>(z) is the C D F of a standard normal. A p-value less than
5% indicates that the null is rejected using a test size of 5%.
The asym ptotic distribution can also be used to construct 1 — c

7.4 IN FEREN CE AND HYPOTHESIS confidence intervals using the quantiles of a standard normal
TESTING distribution . 8 Recall that a 1 — c confidence interval contains the
set of null hypothesis values that are not rejected when using a
When the assumptions in Section 7.3 are satisfied, the estimators test size of c. For exam ple, the 90% confidence interval for [3 is
of a and f3 are normally distributed in large sam ples . 7 Therefore,
/V /V
testing a hypothesis about a regression param eter is identical to /3 — 1.645 X s.e.(yS), yS + 1.645 X s.e.(yS) / (7.18)
testing a hypothesis about the mean of a random variable (or any
other estimator that follows a CLT). Tests are implemented using where Pr( — 1.645 < Z < 1.645) = 90% when Z is a standard
a t-test, which measures the normalized difference between the normal random variable.
It is common to report the t-statistic of the regression coeffi

cient, which is the value of the test statistic for the specific null
7 The parameters are jointly asymptotically (bivariate) normally distrib hypothesis that the param eter is 0 .
uted so that:
For exam ple, the t-statistic of [3 has the null H0:(3 = 0 and
o - V x + a% alternative Hy.f3 0. Values larger than ± 1.96 (or 2, using the
ox rule-of-thumb value) indicate that the param eter is statistically
o 2^x
ox
This joint distribution allows performing joint hypothesis testing of nulls 8 In Chapter 6, this is called that 1 — a confidence interval where a is the
that involve both a and (3. In most applications /3 is the parameter of size of the test, c has the same interpretation as a in the previous chap
interest. The parameter a is included to prevent model misspecification ter and is used here to avoid ambiguity with the intercept a in the linear
which would affect the estimated value of /3. regression model.

TESTING THE SLO PE IN A LINEAR REGRESSIO N M ODEL
The asym ptotic variance of the slope depends on the vari This value can be used to construct the test statistic of [3,
ance of the explanatory variable and the variance of the which tests the null H0:/3 = 0:
shocks. These two values can be estim ated using only the
elem ents of the covariance m atrix between the dependent (3 _ 1.52
3.97
and the explanatory variables. It can be shown that: 0.383 “ 0.383
The absolute value of the test statistic is larger than the criti
cal value for a test with a size of 5% (i.e., 1.96 s), and so the
and so the variance of the innovations can be estim ated null is rejected. The estim ated coefficient and its standard
using error can also be com bined to construct a two-sided 95%
confidence interval:
13 — 1.96 X s.e.(/3), (3 + 1.96 X s.e.(/3)

where px y is the sam ple correlation between the dependent
and the explanatory variables. Using the estim ates in the = [1.52 - 1.96 X 0.383, 1.52 + 1.96 X 0.383]
mining sector exam ple in Section 7.2, the estim ated shock
= [0.769, 2.27]
variance is
s2 = X 1770 (1 - .6632) = 991 The confidence interval does not include 0, which reconfirms
the previous finding that the null is rejected when using a
A
The asym ptotic variance of (3 is then: 5% test.
991 Finally, the p-value of the slope is

2.93
337.3
2(1 - <E>(| 3 .7 6 1)) = 2(1 - 0.99996) = 0.0001
/V
The standard error of /3 is

This value is less than 5% and so the p-value provides another
V 2 .9 3 In = V 2 .9 3 / 2 0 = 0.383 method to test the null.
different from 0. Many software packages also report the the contribution of the m arket premium E[Rm*] to the overall
p-value of the t-statistic and a 95% confidence interval for the return on the portfolio because:
estim ated parameter. These can equivalently be used to deter
E[Rp*] = a + [3E[Rm*] (7.20)
mine whether a param eter is statistically different from 0 (i.e.,
the null is rejected if 0 is not in the 95% confidence interval or if The intercept a is a measure of the abnormal return. This coef
the p-value is less than 5%). ficient measures the return that the portfolio generates in excess
of what would be expected given its exposure to the m arket risk
premium (3E[Rm*].
7.5 APPLICATION: CAPM A s an illustration, CAPM is estim ated using 30 years of monthly
data between 1989 and 2018. The portfolios measure the value-
CAPM relates the excess return on a portfolio to the return on weighted return to all firms in a sector. The sectors include
the m arket, so that: banks, technology com panies, and industrials. The market port
Rp - Rf = a + /3(Rmi — Rfi) + e ■
„ (7.19) folio measures the return on the com plete US equity market.
The risk-free rate is the interest rate of a one-month US T-bill.
where Rp is the return on a portfolio, Rm is the m arket return,
and Rfi is the risk-free rate. This can also be written as: Table 7.1 reports the param eter estim ates, standard errors,
t-statistics, p-values and 95% confidence intervals for the esti
Rp* = a + (3Rm* + €j
mated param eters. The CAPM [3s range from 0.569 for the beer
so that Rp* is the excess return on the portfolio and Rm* is the and liquor sector to 1.414 in the com puters sector. All estim ates
excess return on the market. of f3 are statistically different from 0.
Note that both coefficients are im portant to practitioners. The The hypothesis H0:a = 0 can be equivalently tested using any of
slope [3 m easures the sensitivity to changes in the market and the three reported m easures: the t-statistic, the p-value, or the

Table 7.1 The First Column of Numbers Reports the Estimated Value of the Regression Parameter. The Second
Column Reports the Estimated Standard Error. The Third Column Reports the t-Statistic, Which Is the Ratio of the First
Two Columns. The Column Labeled p-value Contains the p-value of the t-Statistic and the Column Labeled Confidence
Interval Contains a 95% Confidence Interval for the Parameter. The Final Column Reports the R2 for the Model
Parameter Estimate Std. Err. t-stat. p-value Conf. Int. R2
Banking a 0.026 0.208 0.124 0.901 [- 0 .3 8 2 , 0.433] 0.584
p 1 .1 0 0 0.049 22.49 0 .0 0 0 [1.004, 1.196]
Beer and Liquor a 0.515 0 .2 2 0 2.339 0.019 [0.083, 0.946] 0.251
p 0.569 0.052 10.98 0 .0 0 0 [0.467, 0.670]
Chemicals a 0.054 0.183 0.298 0.766 [- 0 .3 0 4 , 0.412] 0.618
p 1.038 0.043 24.16 0 .0 0 0 [0.953, 1.122]
Computers a - 0 .1 0 4 0.269 - 0 .3 8 6 0.700 [- 0 .6 3 1 , 0.424] 0.581
p 1.414 0.063 22.34 0 .0 0 0 [1.290, 1.538]
Consumer Goods a 0.245 0.167 1.461 0.144 [- 0 .0 8 4 , 0.573] 0.418
p 0.632 0.039 16.06 0 .0 0 0 [0.555, 0.709]
Electrical Equipment a 0.130 0.177 0.738 0.460 [- 0 .2 1 6 , 0.477] 0.713
p 1.244 0.042 29.93 0 .0 0 0 [1.162, 1.325]
Retail a 0.224 0.164 1.368 0.171 [- 0 .0 9 7 , 0.545] 0.613
p 0.920 0.039 23.89 0 .0 0 0 [0.845, 0.996]
Shipping Containers a 0.062 0.227 0.273 0.785 [- 0 .3 8 4 , 0.508] 0.480
p 0.975 0.053 18.23 0 .0 0 0 [0.870, 1.080]
Transportation a 0.091 0.176 0.517 0.605 [- 0 .2 5 4 , 0.436] 0.593
p 0.948 0.041 22.89 0 .0 0 0 [0.867, 1.029]
Wholesale a - 0 .0 4 3 0.140 - 0 .3 0 8 0.758 [- 0 .3 1 8 , 0.232] 0.667
p 0.885 0.033 26.86 0 .0 0 0 [0.821, 0.950]
confidence interval. The t-statistic is the ratio of the estim ated the total variation in Rp* that can be explained by Rm*. In linear
param eter to its standard error. The standard error of a is esti regressions with a single explanatory variable, the R2 is equal to
mated using the squared sam ple correlation between the dependent and the
explanatory variable:
S2 ( A x + O x )
s.e.(a) = (7.21)
n<% '
R2 = cS T r2 [R£, I C l (7.22)
2
,(*,• - *)
where jxx is the mean of X and ox = n 1 ^ f= 1 (x, — X) It also can be used to understand how the variance of the
model error is related to the variance of the dependent variable,
W hen using the t-statistic, values larger than 1.96 indicate that
because:
the null should be rejected. If using the p-value, values sm aller
than 0.05 indicate rejection of the null. Finally, when using the cr2 = (1 — R2)oy
confidence interval, the null is rejected when is not in the con
This measure is more useful when exam ining models with more
0
fidence interval. The estim ates of a here are generally close to 0

than one explanatory variable, as shown in Chapter 8 . In this
and are not statistically different from 0 in of the 9 portfolios.
application, it is useful to determ ine which sectors are more
8
The final column of Table 7.1 reports the R2 of the m odel. This is closely coupled to the m arket. For exam ple, electrical equip
a measure of the fit of the model and reports the percentage of ment and wholesale both have large R2, which indicates that

B e e r and L iq u o r E le c tric a l E q u ip m e n t
F ia u re 7 .5 Plot of the monthly portfolio returns on the beer and liquor and electrical equipment sectors against
the market (x-axis) between 1989 and 2018. The line is the fitted regression.
the returns of these sectors are tightly coupled with the market where Rpi is the return on the portfolio to be hedged and R^j
return. is the return on the hedging instrument. The /3 is the optimal
hedge ratio and a measures the expected return on the hedged
Figure 7.5 plots the excess market returns and the excess
portfolio (i.e., the return on the original portfolio minus the
returns for electrical equipm ent and beer and liquor. O verall, the
scaled return of the hedging instrument).
patterns among the returns do not look very different. However,
the beer and liquor portfolio had a large negative return when Table 7.2 contains estimated hedge ratios for a set for widely
the m arket had its largest return in this sam ple, which lowers the traded exchange-traded funds (ETFs). These funds are all part
slope and reduces the R2. of the Select Sector SPDR ETF family and target the return of a
diversified industry portfolio. M eanwhile, the hedging instrument
is the SPDR S&P 500 ETF, which tracks the return on the S&P 500.
7.6 APPLICATION: HEDGING The models are estim ated using monthly data from 1999
until 2018. The first column reports the average return on
Estim ating the optimal hedge ratio is a natural application of
the unhedged fund. The next two report the estim ates of the
linear regression. Hedges are commonly used to eliminate
regression coefficients. The t-statistics are reported below each
undesirable risks. For exam ple, a long-short equity hedge fund
coefficient. The estim ate of a has been multiplied by 12 so that
attem pts to profit from differences in returns between firms,
it can be interpreted as the annual hedged portfolio return.
but not on the return to the overall m arket. However, the com
Scaling a has no effect on the t-statistic, and so hypothesis tests
position of the fund's long and short positions may produce an
are similarly unaffected by this type of rescaling.
unwanted correlation between the fund's return and the market.
A hedge can be used to remove the m arket risk. Note that all unhedged returns are positive, and five are statisti
cally different from zero. The hedged returns, however, have
In this case, the slope coefficient is directly interpretable as the
mixed signs and are not statistically different from zero when
hedge ratio that minimizes the variance of the unhedged risk . 9
using a 5% test (i.e., t-statistic of 2). The third column is the esti
The regression model estim ated is
mate of the hedge ratio. The next two columns report the stan
Rpi = « + fiRhi + €i> (7.23) dard deviations of the unhedged (sp) and hedged portfolios (s^).
The quantity sh is the same as the standard error of the regres
sion (s), which measures the standard deviation of the residuals
in the model. It can be seen that the standard deviation of a
9 The regression coefficient fully hedges the risk in the sense that the hedged portfolio is often much lower than that of its unhedged
resulting portfolio is uncorrelated with the hedged factor. When factors
have positive risk premia, this can be expensive, and it is also common counterpart. Furtherm ore, this reduction in standard deviation is
to partially hedge to eliminate some, but not all, factor risk. largest for the models with the highest R2.

Table 7.2 Optimal Hedge Ratios for Select Sector SPDR ETFS Hedged against the Market Using the SPDR S&P 500 ETF.
The First Column, /jl , Reports the Annualized Mean Return of the ETF. The Column Labeled a Reports the Estimate of the
Annualized Return on the Hedged Portfolio. The Third Column Is the Estimate of the Hedge Ratio. The Columns sp and Sf,
Report the Standard Deviation of the ETF Return and of the Hedged Return, Respectively. The Final Column Reports the
R2 from the Model Used to Estimate the Hedge Ratio. The t-Statistics Are Reported Below the Estimated Parameters
A a P sp Sh R2
Consumer 6.235 2.907 0.483 1 2 .0 9.8 0.335
(2.30) (1.30) (10.81)
Energy 10.188 3.907 0.913 21.3 16.8 0.377
(2 . 1 1 ) ( 1 .0 1 ) (11.85)
Financial 7.480 - 0 .7 7 0 1.199 2 2 .0 13.7 0.613
(1.50) (- 0 .2 5 ) (19.15)
Health 8.298 3.129 0.751 13.7 8.5 0.617
(2.67) (1.61) (19.33)
Materials 9.442 1.403 1.168 2 1 .0 12.7 0.634
(1.98) (0.48) (20.07)
Technology 7.713 - 1 .6 8 2 1.365 23.3 1 2 .6 0.706
(1-46) (- 0 .5 8 ) (23.63)
Utilities 7.473 4.481 0.435 14.9 13.5 0.176
(2 . 2 2 ) (1.45) (7.04)
7.7 APPLICATION: PERFORM ANCE type— to fully account for portfolio risk exposures. This is the
topic of the next chapter.
EVALUATION
Table 7.3 contains results from the regression:
Fund managers are frequently evaluated against a style bench
Rfi = a + [3 Rsi + e, (7.24)
mark to control for the typical perform ance of a fund with the
same investm ent style. High-performing fund managers should where Rf is the monthly return on the fund and Rs is the return on
outperform their benchm arks and thus generate positive a. the style benchmark. The performance of four funds, each using
Perform ance analysis can be im plem ented using a regression a different investment style, is exam ined. Two of the funds, the
where the return on the fund is regressed on the return to its Fidelity Contrafund and the JPM organ Small Cap Growth Fund,
benchm ark, a is the most im portant param eter in this applica invest in high-growth equities. The Fidelity fund invests in large-
tion because it can be used to detect superior or inferior perfor cap stocks, while the JPM organ fund focuses on smaller firms. The
m ance; (3 measures the sensitivity to the benchm ark and so can other two funds, the PIM CO Total Return Fund and the PIM CO
be used to exam ine whether the benchm ark is appropriate for High Yield Fund, invest in bonds. The total return fund targets the
the fund's return. A good choice of a benchm ark should have a total return to investing in US bonds, while the high-yield fund
(3 near one and a large R2 because (3 = 1 indicates that the risks focuses on generating returns from non-investment grade securi
captured by the benchm ark are similar to the fund's and large R2 ties. Each of the four funds has a different style, and so the bench
indicates that the benchm ark is highly correlated with the fund's mark is adjusted to match a given fund's investment objective.
perform ance. These regressions use monthly returns, and the reported excess
performance is annualized by multiplying by 12. All monthly
The choice of benchm ark plays an im portant role when assess
returns available for each fund, from the fund's launch until the
ing w hether a fund produces excess returns. If a fund invests
end of 2018, are used in the analysis. The number of returns used
in risks that are not captured by the benchm ark, then the fund
to estimate the parameters in each model is reported in the table.
might appear to produce a (because the benchm ark cannot
account for all risk exposures). In practice, it is common to use Estim ates of a are greater than zero for all four funds, and three
multiple factors— between 3 and 11— depending on the fund of these are statistically different from zero. Fidelity's Contrafund

Table 7.3 The Left Panel Reports a, p, and the R1 2*from a Performance Evaluation Model Using the Style
Benchmark Listed in Italics. The Right Column Uses a Market Model where the Benchmark Is the Return on the
S&P 500. The t-Statistics Are Reported Below Estimated Parameters. The Final Column Reports the Number of
Observations Available to Estimate Parameters
Style Market
a P R2 a P R2 n
Fidelity Contrafund 4.072 0.861 0.836 5.639 0.955 0.803 619
L a rg e G ro w th (4.47) (56.04) (5.68) (50.16)
JPMorgan Small Cap 1.075 0.933 0.912 3.012 1.094 0.602 329
Growth A
Sm all G ro w th (0.94) (58.05) (1.24) (22.25)
PIMCO Total Return Fund 1.172 1.055 0.891 6.565 0.050 0.031 379
In term ed ia te-T e rm B o n d (4.42) (55.41) (8.94) (3.47)
PIMCO High Yield Fund 1.713 0.919 0.902 4.864 0.326 0.409 312
H igh Y ie ld B o n d (3.73) (53.29) (4.39) (14.65)
has outperform ed its benchm ark by over 4% per year during the estim ators in linear regression models have sim ple closed-forms
50 years in the sam ple . 1 0 However, the Contrafund also has the that depend on only the first two moments of Y and X. The
sm allest /3 estim ate and R2. This suggests that the benchm ark slope param eter is closely related to the correlation between
portfolio used might not be appropriate for the strategy used the dependent and explanatory variables, and the slope is zero
by this fund. For the other three funds, the R2 values show that if and only if the two variables are uncorrelated.
the benchm arks are highly correlated with the fund returns. The
/V D eriving the estim ators for these only requires one trivial
/3 are all close to 1, which also indicates that the funds' returns
assum ption— that the exp lan ato ry variab le has som e v a ria
closely com ove with their benchm arks.
tion. Interpreting the results of an estim ated regression
The right column repeats the exercise using the return on the requires four additional assum ptions. Th e most im portant of
S&P 500 benchmark. The market appears to be nearly as good th ese, and the most difficult to verify, is the conditional mean
as the large growth style return in adjusting the return on Fidelity's assum ption:
Contrafund, and the two R2 values are similar. The estimated a
E [ £ ,|x j = 0
is somewhat larger and is still significantly larger than 0. It has
a lower correlation with the small-cap fund because the market This assumption rules out some em pirically relevant applications
portfolio is, by construction, a large-cap portfolio. The conclusion (e.g ., when the sam ple has selection bias or if X and Y are simul
is, however, the same, and the a is not statistically different from 0 . taneously determ ined). The other assum ptions (i.e., that the
data are generated by iid random variables, have no outliers,
The R2 on the bond funds are both low, and near 0 for the PIM CO
and are hom oscedastic) are sim pler to verify using either graphi
Total Return Fund. The poor fit of the market model indicates that
cal analysis or formal tests.
the return on the equity market is not a good benchmark to use
when assessing the performance of these bond funds. W hen these assum ptions are satisfied , then [3 can be inter
preted as the change in Y per unit change in X . The O LS
estim ators of a and f3 also satisfy a C LT and so have a nor
7.8 SUMMARY
mal distribution in large sam ples. The C LT allow s hypoth
esis testing using t-tests and for confidence intervals to be
This chapter introduces linear regression through a focus on
co n stru cted . The t-statistic, a t-test for the null hypothesis
models that contain a single explanatory variable. Param eter
that the param eter is zero, is freq u en tly reported to indicate
w hether a param eter is statistically significant. Finally, the
n
These funds have traded for more than 25 years. The estimates of
1
chap ter presents three standard applications of regression:
a are not representative because this sample suffers from survivorship
bias. The funds have all survived for an extended period which is only facto r sensitivity estim ation, hedging, and fund perform ance
likely if a fund meets or exceeds its performance target. evaluation.

QUESTIONS

7.1 W hat are the three requirem ents of linear regression? 7.4 How is a t-test related to the t-statistic of a param eter?
7.2 You suspect that the CAPM held on all days except 7.5 The R 2 from regressing Y on X is the same as the R2 of
those with a Federal Open M arkets Com m ittee (FO M C) regressing X on Y. True or false?
announcem ent, and on these days the f3 is different. How
7.6 W hat is R2 if /3 = 0?
can a dummy be used to capture this effect? W hat could
you do if you suspected that both a and (3 are different on
7.7 The return on an optimally hedged portfolio is independent
of the return of the hedging instrument. True or false?
FO M C days?
7.8 W hat are the consequences of using a poor benchm ark to
7.3 W hat does a confidence interval for a regression param
evaluate a fund m anager?
eter measure?
Practice Questions
7.9 Find the O LS estim ators for the following data: 7.11 In a CAPM that regresses W ells Fargo on the m arket, the
coefficients on monthly data are a = 0.1 and [3 = 1. 2.
X y W hat is the expected excess return on W ells Fargo when
0 - 1 .4 6 the excess return on the market is 3.5% ?
1 0.35 7.12 You fit a CAPM that regresses the excess return of
Coca-Cola on the excess m arket return using 20 years
2 6.46
of monthly data. You estim ate a = 0.71, (3 = 1.37,
3 4.09 s 2 = 20.38, <xj = 19.82 and jl x = 0.71.
4 7.34 a. W hat are the standard errors of a and /3?
A
5 6.18 b. W hat are the t-statistics for a and (3?

c. W hat is the 99% confidence interval for /§?
6 14.97
7.13 Suppose you estim ated a model that regressed the vol
7 14.28
ume of put and call options traded on the S&P 500 on a
8 2 0 .2 0
measure of overnight news, VOL-, = a + (3NEWSj + e,-.
9 21.24 Your statistical package returned a 90% confidence inter
val for (3 of [0.32,1.89].
7.10 In running a regression of the returns of Stock XYZ
a. Is /3 statistically significant when using a 5% test?
against the returns on the m arket, the standard d evia
b. W hat is the p-value of the t-statistic of (3?
tion for the returns of Stock X YZ is 20% , and that of the
m arket returns is 15%. If the estim ated beta is found to 7.14 If (3 = 1.05 and its 95% confidence interval is [0.43,1.67],
be 0 .7 5 : what is the t-statistic of [3? W hat is the p-value of the
t-statistic?
a. W hat is the correlation between the returns of Stock
X YZ and those of the market? 7.15 How does adding leverage (borrowing at a risk-free rate,
b. If the market falls by 2%, what is the expected return which is not random) to a portfolio change the optimal
on Stock XYZ? hedge ratio? For exam ple, if the portfolio returns were
c. W hat is the maximum possible value of beta given that doubled so that R p = 2Rp, what is the optimal hedge
the standard deviation of the returns of Stock X YZ is ratio for the Rp? Hint: Use the formula for the O LS estim a
2 0 % and those of the m arket is 15%? tor of (3.
116 Financial Risk Manager Exam Part I: Quantitative Analysis

ANSW ERS

7.1 The three requirem ents are that the model is linear in the 7.5 True. The R2 in a regression with a single explanatory
unknown coefficients, the error is additive, and there are no variable is the squared sam plejcprrelation between X
missing values. and Y.
7.2 The model that allowed differences in the slope would 7.6 The only tim e /3 = 0 is when Cov(X,V) = 0 and so the
A
be Rj = a + p R mii + ylFOMCRm,i + €i where If o mc is two variables are uncorrelated. The R2 is the squared cor
1 on FO M C days and 0 otherw ise. If y is not zero, relation and so must be 0 .
then the slope is different on FO M C days. This can be 7.7 False. The optim ally hedged portfolio is uncorrelated
extended to both param eters by estim ating the model
with the return on the hedge. It is not necessarily inde
Ot y IFO M C "f P R m ,i y ^ F O M C ^ m .i "f pendent. It would be independent if the returns were
7.3 The 1 — a confidence interval contains the set of null jointly normally distributed.
hypotheses that could not be rejected using a test size 7.8 If the benchm ark is poor, so that the evaluation model is
of a.
m isspecified, then the slope /3 will be m ism easured. As a
7.4 The t-stat is a t-test of the null H0:f3 = 0 against the alterna result, some of the com pensation for system ic risk (which
tive H-\:(3 9^ 0. is captured by (3) may be attributed to skill a.
Solved Problems
7.9 Doing the basic needed calculations:
X y x —x y-y 1 .x-xf (x -x )(y -y ) ( y - ( a + /be) ) 2
0 - 1 .4 6 - 4 .5 0 - 10 .8 6 20.25 48.85 0.25
1 0.35 - 3 .5 0 - 9 .0 4 12.25 31.66 0.04
2 6.46 - 2 .5 0 -2 .9 3 6.25 7.33 11.41
3 4.09 - 1 .5 0 - 5 .3 0 2.25 7.96 2.31
4 7.34 - 0 .5 0 -2 .0 5 0.25 1.03 0.62
5 6.18 0.50 - 3 .2 2 0.25 -1.61 20.06
6 14.97 1.50 5.57 2.25 8.36 3.19
7 14.28 2.50 4.89 6.25 1 2 .2 2 2.03
8 2 0 .2 0 3.50 10.80 12.25 37.82 3.88
9 21.54 4.50 12.14 20.25 54.64 0.61
A verage 4.50 9.40 8.25 20.83
SUM 82.50 208.25 44.41

/V X ; = 1 (X; ~ X)(Y; ~ Y) 20.83 confidence interval is then [1.37 — 2.58 X 0.065,1.37

2.52 + 2.58 X 0.065] = [1.20,1.54].
2/=i (x, - x)2 8.25
7.13 a. Yes. The null hypothesis value of H0:(3 = 0 is not in the
a = Y - f3X = 9.40 - 2.52 * 4 .5 0 = - 1 .9 6
confidence interval, and so the null is rejected using a
Continuing to estim ate the variance: test size of 1 0 %.
n
1 1 b. The range of the confidence interval is 2c X se{(3),
= 1 0 - 2 19.87 = 5.55
.2 _
= n - 2 (t where c is the critical value used to construct

i= 1
PxyO-2 ^ the 90% confidence interval (5% in each tail,

PxyVy
7.10 a. (3 = 0.75 = - ^ - - p xy = 56.25%
A
so 1.645). The estim ate of [3 is the m idpoint

0 \
of the confidence interval so 1.105. Because
b. (3 * 2% = 0.75 * 2% = 1.5% /x — 1.645se(/3) = 0.32 and jx + 1.645se(/3) = 1.89,
c. The maximum correlation is 1, therefore: these can be solved for se{(3) = .477. The t-stat is
1 * <T,
A
P
0.20 then = 2.32. Finally, the two-sided p-value is
Pm ax = 1.33 se{(3)
o\ 0.15
2(1 — <E»( 1 1 1 )) = 2.06% , where <I>( •) is the standard
7.11 The expected return is 0.1 + 1.2 X 3.5% = 14.2%. This
normal C D F function. This p-value confirms the rejec
value is the fitted value from the linear regression when
tion in the previous step because it is less than 1 0 %.
the m arket return is 3.5% .
1.67 - 0.43
7.12 a. The standard error for (3 is 7.14 The standard error is — ------—— = 0.316.
2 X 1.96
20.38 The t-stat is then 3.31. The two-sided p-value is
= 0.065.
n X & x 240 X 19.82 2(1 - 0 ( |t |) ) = 0.09% .
The standard error for a is C ov(Rp/ Rh)
7.15 The O LS [3 estim ator is — ^ — . Scaling Rp
20.38 X (0.712 + 19.82) Var[Rh)
0.295.
240 X 19.82 by a leverage v, the optimal hedge would be
Cov(vRp, Rh) CoviRp, Rh)
1.37 — \r-u D \
— = v xr u D \ = VP ' where P ls the led9e
The t-stat for (3 is 20.9. Var(Rh) Var(Rh)
0.065
ratio from the unlevered portfolio. Levering up increases
0.71
The t-stat for a is 2.40. exposure to risk that can be hedged and to the hedge
0.295
ratio must account for this.
c. A 99% two-sided confidence interval uses the
critical value from a one-sided test with a size
of 1% (0.5% in each tail), which is 2.57. The

Regression
with Multiple
Explanatory
Variables
Learning Objectives
Distinguish between the relative assumptions of single and Construct, apply, and interpret joint hypothesis tests
multiple regression. and confidence intervals for multiple coefficients in
a regression.
Interpret regression coefficients in a multiple regression.
Interpret goodness of fit m easures for single and multiple

regressions, including R 2 and adjusted R2.
119
Linear regression with a single explanatory variable provides key
insights into O LS estim ators and their properties. In practice, REGRESSIO N WITH INDISTINCT
however, models typically use multiple variables where it is pos VARIABLES
sible to isolate the unique contribution of each explanatory vari
If some explanatory variables are functions of the same ran
able. A model built with multiple variables can also distinguish dom variable (e.g ., if X 2 = X 2), then:
the effect of a novel predictor from the set of explanatory vari
Y, = a + f3-\Xj1 + /32 X 2 + Sj
ables known to be related to the dependent variable.
In this case, it is not possible to change X-| while holding
This chapter introduces the k-variate regression model, which
the other variables constant. The interpretation of the coef
enables the coefficients to measure the distinct contribution of ficients in models with this structure depends on the value
each explanatory variable to the variation in the dependent vari of X-| because a small change of AX-| in X-| changes Y b y
able. This structure allows the estimated parameters to be inter (/3-1 + 2/32 X 1 )A X 1. This effect captures the direct, linear effect of
preted while holding the value of other included variables constant. a change in X-| through /3i and its nonlinear effect through f32.
The first section of this chapter shows how the structure of the
O LS estim ator leads to this interpretation in a model with two When all explanatory variables are distinct (i.e., no variable is an
correlated explanatory variables. The coefficients in this model exact function of the others), then the coefficients are interpreted
can be estim ated using three single-variable regressions. This as holding all oth er values fixed. For exam ple, /3i is the effect of a
stepw ise estimation procedure provides an intuitive understand small increase in X-, holding all other variables constant.
ing of the effect measured by a regression coefficient. W hile the param eter estim ators of a and the /3s have closed
The R2, introduced in the previous chapter, is a simple measure form s, they can be difficult to express without matrices and
that describes the amount of variation explained by a model. It linear algebra. However, it is possible to understand how param
is also the basis of the F-test, which extends the t-test to allow eters are estim ated in a model with two explanatory variables.
multiple hypotheses to be tested sim ultaneously. The F-test Suppose the model is
accounts for the uncertainty in the coefficient estim ators, includ Y, = a + p iX u + /32 X 2i + Sj
ing the correlation between them .
The O LS estim ator for /3-| can be com puted using three single
The key ideas in this chapter are illustrated through an exten variable regressions. The first regresses X-| on X 2 and retains the
sion of C A P M , which incorporates additional risk factors that residuals from this regression:
system atically affect the cross section of asset returns. These
Xu = Xu - S0 - S , X 2if (8 . 2 )
multi-factor models are commonly used in risk m anagem ent to
A A
determ ine sensitivities to econom ically im portant investm ent where S0 and S-| are O LS param eter estim ators and X Vl is the
characteristics and to risk-adjust returns when assessing fund residual of X-|j.
m anager perform ance.
The second regression does the same for Y:
Y , = Y ; - y 0 - y iX 2/, (8.3)
8.1 MULTIPLE EXPLANATORY where yo and 7 i are O LS param eter estim ators, and Y Vl is the
VARIABLES residual of Y-n
The final step regresses the residual of Y (i.e., Y) on the residual

Linear regression can be directly extended to models with more
of X-| (i.e.,X-|):
than one explanatory variab le . 1 The general structure of a
regression is Y j = a * ; + e/ (8-4)
Yj = a + f r X Vl + fi2X 2j + . . . + (3kX kj + s ii (8.1) The O LS estim ate j8 1 is identical to the estim ate com puted by
fitting the full model using Excel or any other statistical software
where there are k explanatory variables and a single constant a.
p ackag e . 2
The first two regressions have a single purpose: to remove the

direct effect of X 2 from Y and X \. They do this by decom posing
1 Historically, models with multiple explanatory variables are called mul A _____ "V »
■%*
tiple regressions. This distinction is not important and so models with This model does not include a constant because Y and X have mean
any number of explanatory variables are referred to as linear regressions zero by construction. If a constant is included, its estimated value is
or simply regressions. exactly 0 and the estimate of /3i is unaffected.

each variable into two com ponents: one that is perfectly corre Multiple linear regression assumes that the explanatory vari
lated with X 2 (i.e., the fitted value) and one that is uncorrelated ables are not perfectly linearly dependent (i.e., each explana
with X 2 (i.e., the residual). As a result, the two residuals (i.e., Y tory variable must have some variation that cannot be perfectly
and X 1 ) are uncorrelated with X 2 by construction. explained by the other variables in the model).
The final regression estim ates the linear relationship (i.e., fa ) The distinct variation in each regressor is exploited to estim ate
between the com ponents of Y and X-] that are uncorrelated with the model param eters. If this assumption is violated, then the
(and so cannot be explained by) X 2. variables are p e rfe ctly collinear. This assumption is simple to
verify in practice, and most statistical software packages pro
Finally, the O LS estimate of fa can be computed in the same man
duce a warning or an error when the explanatory variables are
ner by reversing the roles of X-| and X 2 (i.e., so that fa measures
perfectly collinear.
the effect of the component in X 2 that is uncorrelated with X-|).
The remaining assumptions require simple m odifications to
This stepw ise estimation can be used to estim ate models with
account for the k-explanatory variables.
any number of regressors. In the k-variable model:
• All variables must have positive variances so that cry. > 0 for
Y , = a + f a X v + f a X 2 i + ... + f a X k j + s„ (8.5)
j = 1 , 2 , . . . , k.
the O LS estim ate of fa is com puted by first regressing each of
• The error is assumed to have mean zero conditional on the
X-| and Y on a constant and the remaining k — 1 explanatory
explanatory variables (E[e(-1X i X 2/, . . . , X fc/] = 0).
variables. The residuals from these two regressions are mean
• The random variables (X 1(- ,X 2/, . . . , X ki) are assumed to
zero and uncorrelated with the remaining k — 1 explanatory
variables. The O LS estim ator of fa is then estim ated by regress be iid.
ing the residuals of Y on the residuals of X-|. • The assumption of no outliers is extended to hold for each
explanatory variable so that E[X yj] < 00 for j = 1 , 2 . . . , k.
Additional Assumptions • The constant variance assumption is similarly extended to
O1 o
hold for all explanatory variables (E[ef |X-|/, X 2„ . . . ,X/c;] = er ).
Extending the model to multiple regressors requires one addi
tional assum ption, along with some m odifications, to the five The interpretation of each of the modified assum ptions is
assumptions in Section 7.3. unchanged from the single explanatory variable model.
REGRESSIO N STEP-BY-STEP
The A ppendix contains a table of the annual excess returns The first step is to estim ate the coefficients of the model that
on the shipping container industry portfolio (i.e., Rp*), the rem oves the effect of the market from the shipping container
excess return on the m arket (i.e., Rm*), and the return on a portfolio:
value portfolio (i.e., Rv). R p* = 8 0 + 8 ^ R m* + V p
It also contains the squared deviations from the mean of each
variable (labeled Ep, E^, and E^, for squared error) and the The slope (i.e., S-|) depends on the covariance between the
cross-product of the deviations (EpE m and E vE m). The sums industry portfolio and the market portfolio divided by the
variance of the m arket portfolio. This is estim ated by using
of these columns, when divided by the sam ple size, are the
estim ates of the variances (squares) and covariances (cross- the sum of the cross-product between the two and the sum
products) of these series. of the squared m arket return deviations. (Note that it is
unnecessary to divide each by n — 1 to produce a covariance
If this data set were to be entered into Excel and a regres or variance estim ate, because this cancels in the ratio.). The
sion run using the Analysis ToolPak add-in, the fitted model
/V
intercept depends on the estim ate 5-, and the means of the
would be two series:
4.06 0.72 Rm* 0.101 Rv „ 4639
0.394, 0.696
(4.16) + (0.22) + (0.29) + 6
6665
where the standard error is in parentheses below each
The intercept depends on the estim ate <5i and the means of
coefficient.
the two series:
Here the stepw ise procedure is applied to verify the estim ate
of the coefficient on the value factor. §0 = 8.58 - 0.696 X 6.03 = 4.39
(C ontinued)
Chapter 8 Regression with Multiple Explanatory Variables ■ 121

(C ontinued)
These same calculations are used to estim ate the regression The values of this residual series are in the second table in
of the value factor return on the excess return of the market: the appendix. The first two columns sum to exactly 0, which
is a consequence of fitting the first step model. The final
R v = To + 7i^m* + V *
three columns contain the squares and cross-products of the
The param eter estim ates are residual series. These are the key inputs into the coefficient
1464 estim ate for the value factor, which depends on the covari
7 ! = -777- = - 0 .2 2 0 , y 0 = 1.94 - (- 0 .2 2 0 ) X 6.03 = 3.27 ance between the two residuals divided by the variance of
6665
the value factor residual. A s before, the n — 1 term can be
Finally, the effect of the market is removed using the esti om itted in each because it cancels, and the estim ate of the
mated coefficients to produce the following residuals: regression coefficient is
Rp*i = Rp*j — 4.39 — 0.696 Rm*j 353

0.101
3483
R vi R v j — 3.27 — (—0 . 2 2 0 ) R m *j
8.2 APPLICATION: MULTI-FACTOR where Rm* is the excess return on the m arket, Ris m easures the
excess return of value firms over growth firms, and Rs measures
RISK M ODELS the excess return of small-cap firms over large-cap firm s . 6 The
returns of all three factors and the industry portfolios are avail
M arket variation is an im portant determ inant of the variation
able in Ken French's data library . 7
in asset returns. However, other factors that capture specific
characteristics are also useful in explaining differences in returns The t-statistic is reported below each coefficient in parentheses.
across firms or industries. The coefficients on the m arket return are all similar in magnitude
to those in Table 8.1. They are all positive, close to one, and
The Fam a-French three-factor m odel 3 is a leading exam ple of
have large t-statistics indicating that the returns on most indus
a m ulti-factor approach. This model exp and s upon C A PM by
tries com ove closely with the overall m arket return.
including tw o additional facto rs: the size facto r (which ca p
tures the propensity of sm all-cap firm s to generate higher The coefficients on the size and value factors, however, have a
returns than large-cap firm s) and the value facto r (which m ea different structure. These coefficients are all sm aller in magni
sures the additional return that valu e 4 firm s earn above tude and range from —0.6 to 0.8. Positive estim ates indicate that
growth firm s). the industry has a positive exposure to the factor. For exam ple,
the wholesale industry has a coefficient of 0.26 on the size fac
Both the size and value factors are measured using the returns
tor, indicating that a 1 % factor return increases the return on this
of long-short portfolios that buy firms with the desirable feature
portfolio by 0.26% . This suggests one of two possibilities: either
(i.e., small-cap or value firms) and short sell those with the unde
the firms in this industry are relatively small firms, or their profits
sirable feature (i.e., large-cap or grow th 5 firms).
depend crucially on the same factors as those affecting the per
Table 8.1 extends the results for CAPM in Table 7.1 to the Fama- form ance of small firms. Note that these two explanations for
French three-factor model. The table contains param eter esti the positive exposure are not exclusive and both are likely true.
mates for the model: The com puter industry portfolio, on the other hand, has a large
R p * i = a + AnRm*; + P s R s i + P v R v i + £ i,
negative coefficient on the value factor. This coefficient suggests
that the typical firm in this industry is a growth firm.
Figure 8.1 illustrates the difference between estim ates of the

3 Fama, E. F., & French, K. R. (1993). Common Risk Factors in the Returns factor exposure in a single variable model (which only includes
on Stocks and Bonds. Journal of Financial Economics, 33(1), 3-56. the factor return) and in models that include additional explana
4 A firm is classified as a value stock if its fundamentals are strong rela tory variables.
tive to its price. The value ratio is computed by comparing the book
value of assets to the market value of assets (also called the book-to-
market). Firms with high book-to-market ratios represent a good value
because investing in these firms buys more (tangible) assets than an 6 The size and value factor returns are not measured in excess of the
equal investment in non-value firms. risk-free rate because their construction as long-short portfolios elimi
nates the risk-free rate from the portfolio return.
5 Firms with low book-to-market ratios are called growth firms because
the high price of these firms forecasts high growth in their fundamentals 7 Ken French's data library can be accessed at http://mba.tuck
(e.g., revenues, profits, or the book value of assets). .dartmouth.edu/pages/faculty/ken.french/data_library.html.

Table 8.1 Regression Results from Models of Industry Portfolio Returns on Three Factors: The Market, the Size
Factor, and the Value Factor. The t-statistics Are Reported in Parentheses Below Each Coefficient. The Right-Most
Column Reports the Model R2
a Pm Ps A, R2
Banking - 0 .1 7 2 1.228 - 0 .1 5 6 0.838 0.764
(- 1 .0 9 ) (32.26) (- 3 .0 3 ) (15.15)
Beer and Liquor 0.517 0.628 - 0 .3 9 4 - 0 .0 3 3 0.314
(2.44) (12.28) (- 5 .6 7 ) (- 0 .4 4 )
Shipping Containers - 0 .0 0 3 1 . 0 1 1 - 0 .0 1 2 0.276 0.498

( - 0 .0 1 ) (18.63) (- 0 .1 6 ) (3.51)
Chemicals - 0 .0 5 3 1.098 - 0 .0 2 8 0.458 0.677

(- 0 .3 1 ) (26.90) (- 0 .5 1 ) (7.72)
Electrical Equipment 0.103 1.263 - 0 .0 3 3 0.115 0.717

(0.59) (29.61) (- 0 .5 7 ) (1.85)
Computers 0.055 1.296 0.224 - 0 .6 6 6 0.659

(0.23) (21.99) (2.80) (- 7 .7 8 )
Consumer Goods 0.215 0.675 - 0 .1 7 4 0.117 0.446

(1.31) (17.03) (- 3 .2 4 ) (2.03)
Retail 0 .2 2 1 0.927 - 0 .0 3 6 0 .0 10 0.614

(1.35) (23.33) ( - 0 .6 6 ) (0.17)
Transportation - 0 .0 1 2 0.995 0.045 0.443 0.651

(- 0 .0 7 ) (25.14) (0.84) (7.71)
Wholesale - 0 .1 0 4 0.877 0.260 0.277 0.715

(- 0 .8 0 ) (27.83) (6.09) (6.05)
The top panels contain return plots of the com puter industry
CONTROLS portfolio against the return on the size factor (left) or the return
on the value factor (right). The solid line shows the fitted regres
Controls are explanatory variables that are known to have
sion of the excess industry portfolio return on the factor return.
a clear relationship with the dependent variable. It is fre
quently the case that the param eter values on the controls For exam ple, the model fit in the top left panel is
are not directly of interest.
Rp*/ Q
?“L P s ^ s i
It is common to control for the effects of explanatory

The size factor has a strong positive relationship (0.762) to the
variables that are known to have a strong relationship
to the dependent variable when exam ining the explana com puter industry return, and the value factor model indicates a
tory power of a new variable. This practice ensures that strong negative relationship (—1.04). These estim ates differ from
the novel variable is finding a feature of the data that is those included in Table 7.1 because two variables (i.e., the mar
distinct from the effects captured by a standard set of ket and the other factor) are excluded from the model.
explanatory variables.
The middle panel shows the effect of controlling for the market
Leaving out these controls is known as om itted vari
return. This is done by plotting:
able bias. For exam ple, when researchers develop a new
measure that explains the cross section of asset prices, • The residuals calculated by regressing the sector portfolio
it is common to include the m arket, size, and value fac
returns on the m arket (i.e., the y-axis), and
tor returns when assessing w hether the new measure
improves the model. • The residuals calculated by regressing the factor being exam
ined on the m arket (i.e., the x-axis).

C o m p u te rs on S ize C o m p u te rs on V alu e
Computers on Size, Given Market Computers on Value, Given Market
—40 -----------»---------- ,-----------,-----------1- —40 ^----- 1

---------------- '---------------- <-
—20 -1 0 0 10 20 -1 0 0 10
Size I Market Value I Market
Computers on Size, Given Market and Value Computers on Value, Given Market and Size
CD
-I 20 6S = 0.224
0 *. • • *••••*.
• •
•• • •*
••• ; • • •
0 0
•• •• ••• • •
CO
0
- 20 -
E
o
O
-4 0
-20 10 0 10 20
Size I Market, Value Value I Market, Size
Fiaure 8.1 Illustration of the differences between estimates of the direct effect and the effect given other
relevant variables. Each column contains plots that depict the relationship between the computer industry returns
and either the size factor (left) or the value (right) factor. The top panel in each column plots the portfolio returns
(y-axis) against the factor returns and shows the regression line computed by regressing the portfolio return on
only the factor return. The middle panel shows the relationship between the portfolio return and the factor return
when controlling for the market return. The values plotted are the residuals from regressing the portfolio return
on the market return (y-axis) and the factor return on the market return (x-axis). The bottom panel shows the
relationship when controlling for both the market and the other factor. The values shown are also residuals where
the effect of the market and the other factor are eliminated.
Controlling for the market removes its direct effect from both the of the factor return (i.e., R s j or R vi ) on the m arket return. For
industry portfolio return and the factor return. The slope /3S is then exam ple:
estimated in a regression that includes either the size or the value
R si = R si - 8 ~ y R m *h
factor returns and the market return. For exam ple, the coefficient
in the center-left panel is estimated using the regression: where 8 and y are estim ated using O LS.
Rp*/ Q: + "h /3sRsj + £/ These residuals have "p urged" the effect of the m arket (i.e., the
correlation com puted using the residuals is conditional on the
Controlling for the m arket affects both coefficients: The size
m arket). This conditional correlation of - 2 2 .8 % is reported in
effect is halved, while the value effect is reduced by 30% . These
the bottom panel of Table 8.2.
changes are not surprising given the correlation between the
size/value returns and the m arket return, which are reported in
Table 8.2.
8.3 MEASURING M ODEL FIT
The final row shows the effect of controlling for both the returns
of the m arket and the o th er factor. Only one model is estim ated Minimizing the squared residuals decom poses the total varia
to produce the slopes in both panels: tion of the dependent data into two distinct com ponents: one
that captures the unexplained variation (due to the error in the
Rp*,- a + /3mRm*; + /3 s R S j + j3 v R vi + £;
model) and another that measures explained variation (which
The data points plotted are the residuals from regressions on depends on both the estim ated param eters and the variation in
the m arket return and the other factor return, and so control the explanatory variables).
for the effects of variables that are not shown. For exam ple, the
The total variation in the dependent variable is called the total
points in the bottom-left panel depict the relationship between
sum of squares (TSS), which is defined as the sum of the squared
the com puter industry portfolio returns and the size factor
deviations of Y, around the sam ple mean:
returns, controlling for the m arket and the value factor.
Adding the additional factor further reduces the magnitude TSS= 2 (Y , - ? ) 2 (8 . 6 )

of both coefficients. The reduction occurs because the factor /= 1
Each dependent variable is decom posed into two com ponents:
returns are correlated, conditional on the m arket, and so the
the fitted value (Y,) and the estim ated residual (a,), so that:
slopes reported in the center panels do not capture the distinct
contribution of each factor. Y, = a + A X 1( + . . . + p kX kj + Ej = 9; + £/ (8.7)
The conditional correlation coefficient is the correlation between Transforming the data and the fitted values by subtracting the
the residuals R si and R v il which are calculated from a regression sam ple mean from both sides:
Y,-Y= Y ; - Y + Ej
Table 8.2 The Top Panel Contains the Estimated ensures that both sides of the equation are mean zero. Finally,
Correlation Matrix of the Factor Returns. The Bottom squaring and summing these values over all / from 1 to n yields
Panel Contains the Estimated Conditional Correlation
2 ( Y - Y f = 2 ( Y , - Y + ej)2
Matrix for the Returns to the Size Portfolio and the i= 1 ;=1
Value Portfolio Controlling for the Market Return
E ( Y , - Y )2 = 2 ( Y , - Y ) 2 + 2 "= ,® / (8 . 8 )
Correlation /= 1 /= 1
Note that the second line is missing the cross term between
Market Size Value /V _
___
__
[Y; — Y) and s r This cross term is 0 because the sam ple correla-
Market 1 0.216 - 0 .1 7 6 tion between the fitted values Y, and the estim ated residuals
Size 0.216 1 - 0 .2 5 6 Ej is 0 .8
Value - 0 .1 7 6 - 0 .2 5 6 1
Conditional Correlation 8 The first-order conditions, or normal equations, of the OLS

minimization problem are ^ . =1Xjje,- = 0 for j = 1, 2, . . . k, and
Size Value
2 " =1£(- = 0. These imply that the sample correlation is exactly 0,
Size 1 - 0 .2 2 8 i.e., Cov[Xj,s] = Corr[X,,£] = 0 for all j. The correlation is also 0 with
Yj because the fitted value is a linear function of the X's, namely
Value - 0 .2 2 8
Yj - a + îX1(- + . . . + pkXkl.
1

The three com ponents of this decom position: Accordingly
T SS = E S S + R SS (8.9) ESS 123.988

86.7%
TSS 142.977
are TSS (the total sum of squares), E S S (the explained sum of
squares), and RSS (the residual sum of squares). It can then be deduced that
Normalizing this relationship by dividing both sides by TSS RSS

— - = 1 - 0.867 = 13.3%
yields TSS
and
ESS RSS
TSS + TSS R SS = 0.133 X 142.977 = 19.02
This leads to the definition of R2 in a model with any number of R2 m easures the percentage of the variation in the data that can
included explanatory variables: be explained by the m odel. This is equivalently measured by the
ratio of the explained variation to the total variation (which is
( 8 . 10) equal to one minus the ratio of the residual variation to the total
variation). Because O LS estim ates param eters by finding the val
As an exam ple, suppose the results of a model fit are as follows: ues that minimize the RSS, the O LS estim ator also m aximizes R2.
A s explained in C hapter 7, R2 is defined as the squared cor

Observation relation between the dependent variable and the explanatory
Number Y y variable in a model with a single explanatory variable. In such a
<M
(y y)2
A
S
>>
I
m odel, the fitted value is a linear transform ation of the explana-

1 1.922 2.821 0.850 3.316
tory variable (i.e., Y, = a + /3X,) and so the correlation between
2 - 2 .3 6 7 - 2 .1 3 7 11.337 9.841 Yj and X; is the same as the correlation between Y; and Y; (up to
/V
3 - 4 .8 6 8 -4 .5 0 1 34.433 30.261 the sign, which changes if [3 is negative).
4 5.017 4.948 16.136 15.587 When a model has multiple explanatory variables, R2 is a com
5 2.272 2.450 1.618 2.103 plicated function of the correlations among the explanatory
variables and those between the explanatory variables and the
6 - 0 .5 9 6 0.062 2.547 0.880
dependent variable. However, R2 in a model with multiple regres-
7 2.329 2.721 1.766 2.962 sors is the squared correlation between Y, and the fitted value Yj,
8 4.295 3.894 10.857 8.375 R2 = C o rr 2 [Y ,Y ]
9 - 0 .8 7 5 - 1 .2 3 8 3.516 5.009 This interpretation of R2 provides another interpretation of the
1 0 1.259 1.538 0.067 0.289 O LS estim ator: The regression coefficients (3^,. . . , (3k are cho
1 1 0.442 0.425 0.311 0.331 sen to produce the linear combination of X 1f . . . , X k that m axi
mizes the correlation with Y.
1 2 - 1 .1 0 5 - 1 .2 8 3 4.431 5.212
A model that is com pletely incapable of explaining the observed
13 3.524 1.116 6.371 0.013
data has an R2 of 0 (because all variation is in the residuals). A
14 0.971 1.932 0 .0 0 1 0.869 model that perfectly explains the data (so that all residuals are 0 )
15 2.377 2.259 1.896 1.585 has an R2 of 1. All other models must produce values that fall
between these two bounds so that R2 is never negative and
16 5.445 6.163 19.758 26.657
always less than 1 .
17 -0 .3 3 1 0.044 1.772 0.914
W hile R2 is a useful method to assess model fit, it has three
18 - 3 .6 3 8 - 2 .0 1 0 21.511 9.060
im portant limitations.
19 2.949 0.375 3.799 0.391
1. Adding a new variable to the model always increases the R2.
2 0 0.978 0.421 0 . 0 0 0 0.335 For exam ple, if the original model is Y, = a + /3-|X-|, + e,
and the expanded model Y; = a + /3iX1(- + /32 X 2; + e; ' s
estim ated, then the R2 of the expanded model m ust be
A verage 1.0 0 0
greater than or equal to the R2 of the original model. This is
SUM 142.977 123.988
because the expanded model always has the same TSS and

nearly always has a sm aller RSS, resulting in a higher R2. The where the adjustm ent factor £ = (n — 1)/(n — k — 1). Note that
only situation where adding a variable does not increase R2 £ must be greater than 1 because the denom inator is less than
is if /32 = 0. In that case, the RSS remains the same (as does the numerator.
the R2). As explained previously, adding an additional explanatory vari
2. R2 cannot be com pared across models with different depen able usually decreases R2 and can never increase it. On the other
dent variables. For exam ple, when Y, is always positive, it is hand, including additional explanatory variables (i.e., increasing k)
not possible to com pare the R2 of a model in levels (Yj) and always increases £. The adjusted R2 captures the tradeoff between
logs (In Yj). It is also not possible to com pare the R2 for two increasing £ and decreasing RSS as one considers larger models.
m odels that are logically equivalent (in the sense that both If a model with additional explanatory variables produces a negli
the fit of the model as measured by RSS and predictions gible decrease in the RSS when compared to a base model, then
from the models are identical). The second case can occur the loss of a degree of freedom produces a smaller R2.
when the dependent variable is transform ed by adding or
The adjustm ent to the R2 may produce negative values if a
subtracting one or more of the explanatory variables. For
model produces an exceptionally poor fit. In most financial data
exam ple, consider
applications, n is relatively large and so the loss of a degree of
Yj = a + (3 Xj + Sj freedom has little effect on R2. In large sam ples, the adjustm ent
and term £ is very small and so R2 tends to increase even when an
additional variable has little explanatory power.
Y, - Xj = 8 + yXj + 77j
O LS estim ates of the slope in these two specifications of the

same model have the relationship that y + 1 =73. W hile 8.4 TESTING PARAMETERS
these two models have the same RSS, they have different IN REGRESSION MODELS
TSS and therefore different R2. These models are therefore
statistically identical yet cannot be com pared using R2. W hen the assum ptions in Section 8.1 are satisfied, the param-
3 . There is no such thing as a "g o o d " value for R2. W hether eter estim ators a and ySy follow a CLT. Thus, testing a hypothesis
a model provides a good description of the data depends about a single coefficient in a model with multiple regressors is
on the nature of the data. For exam ple, an R2 of 5% would identical to testing in a model with a single explanatory variable.
be implausibly high for a model that predicts the one-day Tests of the null hypothesis H0: \3j = ftp are im plem ented using
ahead return on a liquid equity index futures contract using a t-test:
the current value of explanatory variables. On the other P j “ PjO
( 8 . 12)
hand, an R2 less than 70% would be quite low for a model
sê-(Pj)
for predicting the returns on a well-diversified large-cap ^-----^ A
where s.e.ify) is the estim ated standard error of fy. W hile the
portfolio using the contem poraneous return on the equity
m arket (i.e., CAPM ). precise formula for the standard error is com plicated in the
k-variable model, it is a standard output in all statistical software
The first limitation is addressed (in a limited way) by the packages.
adjusted R2, which is written as R2. The adjusted R2 modifies the
However, the t-test is not directly applicable when testing com
standard R2 to account for the degrees of freedom used when
estim ating model param eters, so that: plex hypotheses that involve more than one parameter, because
the param eter estim ators can be correlated . 9 This correlation
^ 2 _ RSS/(n - k - 1) com plicates extending the univariate t-test to tests of multiple
TSS/(n - 1)
param eters.
= 1 --------n 7 1 . (1 — R2), (8.11) Instead, the more common choice is an alternative called the
n - k - 1
F-test. This type of test com pares the fit of the model (measured
where n is the number of observations in the sam ple and k is
using the RSS) when the null hypothesis is true relative to the fit
the number of explanatory variables included in the model (not
of the model without the restriction on the param eters assumed
including the constant term a).
by the null.
Adjusted R can also be expressed as:
RSS 9 Specifically, the parameter estimators will be correlated if the explana

= 1 -1 tory variables are correlated.
TSS'

Implementing an F-test requires estim ating two m odels. The so that the null should be rejected if at least one of the coeffi
first is the full model that is to be tested. This model is called cients is different from zero.
the unrestricted model and has an RSS denoted by RSSu. The
Imposing the null hypothesis requires replacing the param eters
second m odel, called the restricted m odel, im poses the null
with their assumed value if the null is true. Imposing the null
hypothesis on the unrestricted model and its R SS is denoted
hypothesis on the unrestricted model produces the restricted
R S S r . The F-test com pares the fit of these two m odels:
model:
(R S S r — RSSu)/q
(8.13) Rp*(- = a + p m R m *j + 0 X R sj + 0 X R vj + Sj
RSSu/(n - k u - 1)'
a + /3mRm*/ “I” e/»
where q is the number of restrictions imposed on the unre
which is the C A PM .
stricted model to produce the restricted model and ku is the
number of explanatory variables in the unrestricted model. An In this hypothesis test, two coefficients are restricted to specific
F-test has an distribution. values and so q = 2. The F-test is then com puted by estim at
ing both regressions, storing the two R SS values, and then
F-tests can be equivalently expressed in term s of the R2 from
com puting:
the restricted and unrestricted m odels. Using this alternative
param eterization: _ (R S S r - RSSu)/2
F ~ R S S j/in - 4)
r («U - R|)/q
(8.14)
(1 - R2u )/{n - ku - 1 ) Finally, if the test statistic F is larger than the critical value of an
F2 n_ 4 distribution using a size of a (e.g., 5%), then the null is
This expression is num erically identical to Equation 8 .1 2 . 1 0
rejected. If the test statistic is smaller than the critical value, then
If the restriction imposed by the null hypothesis does not m ean the null hypothesis is not rejected, and it is concluded that CAPM
ingfully alter the fit of the m odel, then the two R SS measures appears to be adequate in explaining the returns to the portfolio.
are similar, and the test statistic is small. On the other hand, if
Table 8.3 contains the value of the F-test statistics of this null
the unrestricted model fits the data significantly better than the
hypothesis applied to the industry portfolios. The p-value of
restricted m odel, then the R SS from the two models should dif
the test statistic, which measures the probability of observing
fer by a large amount so that the value of the F-test statistic is
the value of the test statistic if the null is true, is reported below
large. A large test statistic indicates that the unrestricted model
each test statistic. The p-value is com puted as 1 — C(F), where
provides a superior fit and so the null hypothesis is rejected.
C is the C D F of an F 2 n_ 4 distribution. A p-value of less than 5%
Implementing an F-test requires imposing the null hypothesis on indicates that the null is rejected when using a 5% test, which is
the model and then estimating the restricted model using O LS. For the case for all portfolios excep t the electrical equipm ent and
example, consider a test of whether CAPM , which only includes the retail industry portfolios.
market return as a factor, provides as good a fit as a multi-factor
model that additionally includes the size and value factors.
Multivariate Confidence Intervals
The unrestricted model includes all three explanatory variables,
so that: Recall that for a single parameter, a 1 — a confidence interval
defines the values where a test with size a is not rejected. The
Rp*(- ex + /3mRm*(- + fisRsi T p vRvl + Gj,
same idea can be applied to an F-test to define a region where
where m indicates the m arket (i.e., so that Rm*(- is the return to the null hypothesis is not rejected.
the m arket factor above the risk-free-rate), s indicates size, and
Figure 8.2 contains four exam ples of bivariate confidence
v indicates value. The null hypothesis is then:
regions . 1 1 These regions are multidimensional ellipses that
H0: Ps = A/ = 0 account for the variances of each param eter and the correlation
The alternative hypothesis is that at least one of param eters is between them . The confidence regions in Figure 8.2 are all con
not equal to zero: structed from the model:
H i: (3S 7 ^ 0 o r p v 0 Y j = « + P -\X -\ + /32 X 2 + e ,,
10 Transforming between the two forms relies on two identities: 11 Confidence intervals are used with a single parameter. "Confidence
1 - Rt = RSSu/TSS and R?u - R2r = (1 - R2r) - (1 - Rfa) so that region" is the preferred term when constructing the regions where the
R b - R2r = (RSS r - RSSuVTSS. null is not rejected for more than one parameter.

Table 8.3 The Three Columns Report the Test Statistic Value Using F-Tests
for the Nulls H0: Ps = p v = 0 (Left), H0: = 1 and ps + p v = 0 (Center),
and the Final Column Reports the F-Statistic of the Regression That Tests
H0: P m = P s = P v = 0 (Right). The Values Below Parameters in Parentheses
Are the p-values of the Test Statistic, Which Is Computed as 1 — C(F), Where C
Is the CDF of the Distribution of the Test Statistic
H„: f t = f t = 0 H0: p m = 1 and fa + f a = 0 F-statistic
Banking 135.3 51.8 575.7

(0 . 0 0 0 ) (0 . 0 0 0 ) (0 . 0 0 0 )
Shipping Containers 6 .6 2.5 176.9

(0 . 0 0 2 ) (0.087) (0 . 0 0 0 )
Chemicals 32.1 14.5 372.8

(0 . 0 0 0 ) (0 . 0 0 0 ) (0 . 0 0 0 )
Electrical Equipment 2 .2 19.3 450.5

(0 . 1 1 1 ) (0 . 0 0 0 ) (0 . 0 0 0 )
Computers 40.8 17.8 344.3

(0 . 0 0 0 ) (0 . 0 0 0 ) (0 . 0 0 0 )
Consumer Goods 9.2 33.6 143.4

(0 . 0 0 0 ) (0 . 0 0 0 ) (0 . 0 0 0 )
Retail 0.3 1.7 282.9

(0.763) (0.182) (0 . 0 0 0 )
where fa = /32 = 1. The errors are standard normal random vari if X-| is above its mean, X 2 will tend to be below its mean due to
ables, and the explanatory variables are bivariate normal: the negative correlation.
o-n po"i cr2 \ The bottom-right shows a different scenario where p = 50%
pO"i 0~2 022 J/ and a- 1 1 = 0.5. The length of the region for each variable is
The baseline specification uses a bivariate standard normal so proportional to the inverse of its variance, therefore, halving the
that cr-] 1 = CT2 2 = 1 and p = 0. The confidence region in the variance of X i doubles the width of the region along the x-axis
baseline specification with uncorrelated regressors is shown in relative to the upper-right panel.
the top left panel. This region is spherical, which reflects the lack
of correlation between the estim ated estim ators fa and fa .
The F-statistic of a Regression
The top-right panel is generated from a model where the cor
relation between the explanatory variables is p = 50%. The The F-statistic of a regression is the multivariate generalization
positive correlation between the variables creates a negative of the t-statistic of a coefficient. It is an F-test of the null that all
correlation between the estim ated param eters. W hen the explanatory variables have 0 coefficients:
explanatory variables are positively related and fa is overesti-
/V /V
Ho: fa = fa = • • • = Ac = 0
mated (i.e., so fa > fa ), then fa tends to underestim ate fa to
com pensate. The negative correlation m itigates the common The alternative hypothesis is
com ponent in the two correlated explanatory variables. H p (3j 7^ 0
The bottom -left panel shows the case where the correla for some j in 1 ,2 . . . , k. The test should reject w henever one or
tion is large and negative (—80%). The confidence interval is more of the regression coefficients are nonzero.
upward-sloping, which reflects the positive correlation between
The null hypothesis does not assume any value for the intercept,
param eter estim ators. Because the variables have a negative
/V A
a, and so the restricted model only includes an intercept:
correlation, fa will tend to overestim ate fa to com pensate if fa is
overestim ated. The com pensation improves the fit because Yj = a + Ej

N o C o rre la tio n 5 0 % C o rre la tio n
99% Conf. Region

1.6- 95% Conf. Region
90% Conf. Region
95% Conf. Int.
0.7-
0.4-
T T T T
-80% Correlation 50% Correlation, = V2
Fig u re 8.2 Four bivariate confidence intervals. Each confidence interval is centered on the estimated
/V /V
parameters fa (x-axis) and fa (y-axis). The top-left panel shows the confidence interval when the two explanatory
variables in the model are uncorrelated. The top-right panel shows the confidence interval when the variables
have identical variances but a correlation of 50%. The bottom-left panel shows the confidence interval when
the correlation is —80%. The bottom-right shows the confidence interval when the correlation is 50% but
the variance of fa is twice the variance of fa. The dark bars along the axes show individual (univariate) 95%
confidence intervals for fa and fa.
/V /V

C O N FLIC T IN G F- AND t-TEST STATISTICS
Suppose a model with two explanatory variables is to be fit: sim ilar to any other model fit where the coefficients sum to 1
(e.g ., fa = 1.5 and fa = 0.5). In models with substantial mul-
Y, — a + /3,X-|/ + f a X 2\ + Sj.
ticollinearity, it is simple to detect an effect of the two but
The significance of this relationship can be exam ined by: difficult to uniquely attribute it to either.
• Using an F-test of the null H0: fa = fa = 0, or On the other hand, if the F-test fails to reject but one of the
• Testing each of the two coefficients separately using t-tests does, then this indicates that one variable has a very
small effect. If X-, and X 2 are uncorrelated, it can be shown
t-tests of the null H0: fa = 0 for j = 1 , 2 .
that:
Under most circum stances, the conclusions drawn from these
two methods agree (i.e., both testing methods should indi
cate that at least one of the param eters is different from zero, /\
or both should fail to reject their respective null hypotheses). where F is the F-test statistic and T; is the t test statistic for /3r
If variable 2 has no effect on Y (i.e., T 2 ~ 0), then F ~ T2/2.
D isagreem ent between the two types of tests is driven by
The F-test will only indicate significance if T-| is m oderately
one of two issues. W hen the F-test rejects the null but the
large. For exam ple, when n = 250, then T2 would need to
t-tests do not, then it is likely that the data are multicollinear.
be larger than 6.06 to reject the null if T 2 = 0, which would
The F-test tells us that the model is providing a useful fit of
require | fa | > 2.46. Treated as a single test, any value
the data. The t-tests indicate that the fit of the model can
to | T-i | > 1.96 indicates that the null H0: fa = 0 should
not be uniquely attributed to either of the two explanatory
be rejected, and so when 1.96 < | fa \ < 2.46, the tests
variables. For exam ple, if X<\ and X 2 are extrem ely highly cor
disagree.
related (say 98% + ), then the model fit from fa = fa = 1 is
The test statistic of the variab le th at is uncorrelated with the other included
variab le s.
(T SS - RSSu)/k
F =RSS»/(n - fc - 1) ~ F^
~'(8' 15) fit is assessed using R2, which measures the ratio of the varia
Model
tion explained by the model to the total variation in the data. While
depends on the TSS, because T SS = R S S r when the restricted
intuitive, this measure suffers from some important limitations: It
model has no explanatory variables. Also, note that q = k
never decreases when an additional variable is added to a model
because every coefficient is being tested.
and is it not interpretable when the dependent variable changes.
Table 8.3 reports the F-statistic for the three-factor m odels of The adjusted R2 (R2) partially addresses the first of these concerns.
the industry portfolio returns in Table 8.1. The portfolio returns
The same variation measures that appear in R2 are used to test
are strongly related to the m arket, and so it is not surprising that
model parameters with F-tests. This test statistic measures the
all F-statistics indicate that the null of all coefficients being zero
difference in the fit between a restricted model (which imposes
is rejected.
the null hypothesis) and the unrestricted model. If the restriction
is valid, the fits of the two models should be similar. When the
8.5 SUMMARY restriction is invalid, the restricted model should fit the data worse
than the unrestricted model so that the test statistic is large,
This chapter describes the k-variable linear regression model, and the null should be rejected. Most statistical software pack
which extends the single explanatory variable model to account ages report the F-statistic of the regression, which tests the null
for additional sources of variation. All key concepts from the hypothesis that the explanatory variables have zero coefficients.
simple specification carry over to the general model.
H ow ever, the p resence of m ultiple exp lan ato ry variab les

8.6 APPENDIX
changes how co efficien ts are in te rp re te d . In m ost cases, the
co efficien ts are in terp reted as the effe ct of a sm all change
Tables
in one e xp lan ato ry variab le holding all others co n stan t. This
m eans th at the O LS estim ato rs m easure the unique co n These tables contain the values that are used in the exam ple
tribution of each e xp lan ato ry variab le using a com ponent that runs through the chapter.

First-Step Regressions The columns labeled using variable names starting with E are
the squares and cross-products of the dem eaned data. For
This table contains the annual returns from 1999 until 2018 for
exam ple, E P = Rp* — Rp* is the dem eaned excess return on
three series:
the shipping container return, and E p = (Rp* — Rp*)2. The col
1. The excess return on the shipping container industry, Rp*; umns labeled with two E variables are cross-products, so that
2. The excess return on the m arket portfolio, Rm*; and EpEm (Rp* ~ Rp*)(Rm * ~ Rm*)-
3 . The return of the value factor, Rv.
Year Rp* Rm E2
m El EpEm
1999 - 5 .4 6 19.7 - 2 4 .8 197.2 186.9 715.2 - 1 9 2 .0 - 3 6 5 .6
2000 - 3 7 .7 - 1 6 .7 39.2 2142.0 516.5 1388.1 1051.9 - 8 4 6 .8
2001 22.9 - 1 4 .7 15.8 205.0 429.6 192.0 - 2 9 6 .8 - 2 8 7 .2
2002 14.4 - 2 2 .4 8 .6 33.8 808.1 44.3 - 1 6 5 .4 - 1 8 9 .3
2003 11.9 30.5 3.79 1 1 . 0 598.9 3.4 81.2 45.2
2004 36.5 10 .6 7.1 779.4 20.9 26.6 127.7 23.6
2005 - 2 .3 2 3 7.86 118.9 9.2 35.0 33.0 - 1 7 .9
2006 1 2 .1 10 .2 12.5 12.4 17.4 111.5 14.7 44.1
2007 24 0.976 - 1 3 .6 237.7 25.5 241.6 - 7 7 .9 78.5
2008 - 3 0 .4 - 3 7 .8 1.71 1519.6 1920.8 0 .1 1708.5 1 0 .2
2009 34.8 28.2 - 4 .2 8 687.4 491.6 38.7 581.3 - 1 3 8 .0
2010 17.7 17.4 - 3 .8 5 83.1 129.3 33.6 103.7 - 6 5 .9
2011 - 5 .8 7 0.415 - 8 .2 6 208.9 31.5 104.1 81.1 57.3
2012 18.1 16.3 8.36 90.6 105.5 41.2 97.8 65.9
2013 36.7 35.2 1.31 790.6 851.1 0.4 820.3 - 1 8 .5
2014 12 .8 11.7 - 1 .9 2 17.8 32.2 14.9 23.9 - 2 1 .9
2015 -5 .3 1 0.06 - 9 .8 7 193.0 35.6 139.5 82.9 70.5
2016 19.2 13.3 20.7 112.7 52.9 351.8 77.2 136.4
2017 16.9 21.4 - 1 1 . 2 69.2 236.3 172.7 127.9 - 2 0 2 .0
2018 - 1 9 .3 -6 .8 1 - 1 0 .3 777.4 164.8 149.9 357.9 157.2
Sum 171.6 120.5 38.9 8287.7 6664.8 3804.6 4638.9 - 1 4 6 4 .2
Mean 8.58 6.03 1.94 414.39 333.24 190.23 231.94 -7 3 .2 1

Second-Step Regressions B2
Year Rp* Rv Rl RP*RV
This table contains the values of the shipping container industry
2007 18.9 -16.7 358 277 -315.3
return residuals ( R p*) and the value factor (Rv) from a regres
sion on a constant and the excess return on the m arket. These 2008 -8.5 -9.9 72 97 83.6
are, by construction, mean zero and uncorrelated with the 2009 10.8 -1.4 116 2 -14.6
excess return on the m arket. The final three columns contain the
2010 1.2 -3.3 1 11 -4.0
squares of these residuals and their cross-products.
2011 -10.5 -11.4 111 131 120.6
Year Rp* Rv Rp* Rl Rp*Rv 2012 2.4 8.7 6 75 20.5
1999 -23.6 -23.74 555 564 559.3 2013 7.8 5.8 61 33 45.1
2000 -30.5 32.3 928 1041 -982.9 2014 0.3 -2.6 0 7 -0.7
2001 28.7 9.3 826 87 267.4 2015 -9.7 -13.1 95 172 127.8
2002 25.6 0.4 656 0 10.6 2016 5.6 20.4 31 414 113.1
2003 -13.7 7.2 188 52 -99.1 2017 -2.4 -9.8 6 95 23.3
2004 24.7 6.2 612 38 152.4 2018 -18.9 -15.1 359 227 285.4
2005 -8.8 5.3 77 28 -46.2 Sum 0 0 5059 3483 353
2006 0.6 11.5 0 132 7.0 Mean 0 0 253 174 17.7

QUESTIONS

8.1 How does the correlation between pairs of explanatory 8.5 You estim ate a regression model
variables affect the interpretation of the coefficients in a Yj = a + f3-\X-| j + i82X 2l + s,. Using the F-stat of the model,
regression with multiple explanatory variables? you reject the null H0 :/31 = yS2 = 0 but fail to reject either
8.2 Both R2 and R2 only increase when adding a new variable. of the nulls H q :/3^ = 0 or H q :/32 = 0 using the t-stat of the
coefficient. Which values of p = C o rr[X 1f X 2] make this
True or false?
scenario more likely?
8.3 W hen is R2 less than 0? Can this measure of fit be larger
than 1 ? 8.6 Suppose you fit a model where both t-statistics are exactly
+ 1.0. Could an F-statistic lead to rejection in this circum
8.4 The value of an F-test can be negative. True or false?
stance? Use a diagram to help explain why or why not.
Practice Questions
8.7 Construct 95% confidence intervals and p-values for the 8.11 Using the data below:
coefficients in the shipping containers industry portfolio
returns.
Trial Number y *1 *2
8.8 The models were estimated on monthly returns. W hat is the 1 - 5 .7 6 -3 .4 8 -1 .3 7

estimate of the annualized a for the beer and liquor indus 2 0.03 - 0 .0 2 -0 .6 2
try portfolio? W hat is the t-statistic for the annualized a?
3 -0 .2 5 - 0 .5 -1 .0 7
8.9 Suppose Yj and X , followed a bivariate normal so that:
4 - 2 .7 2 -0 .1 8 - 1.0 1
V; -0 .8 2 0.39
5 -3 .0 8
6 -7.1 -2 .0 8 1.39
W hat is the R2 in a model that regresses Y on X? W hat
7 -4.1 - 1 .0 6 0.75
would the R2 be in a model that regressed X on V?
0.14 -0 .6 3
8.10 A model was estim ated using daily data from the S&P 500
8 0 .0 2
from 1977 until 2017 which included five day-of-the-week 9 -6 .1 3 - 1.6 6 1.31
dummies (n = 10,087). The R2 from this regression was 10 0.74 0 .6 8 -0 .1 5
0.000599. Is there evidence that the mean varies with the
a. A pply O LS linear regression to find the param eters for
day of the w eek?
the following equation:
Y, = a + p-\XVl + (32X 2i + €j
b. W hat is the R2, R2, and F statistic of this regression?

ANSW ERS

8.1 Regression coefficients are interpreted as holding all other
variables constant, and so the correlation between the regres
sors is immaterial to the interpretation of the coefficients.
8.2 False. R2 will always increase. R2 can decrease if the

degree-of-freedom adjusted residual variance estimator,
sf
— , does not decrease when adding a new regressor.
This can happen because k is the number of regressors
in the model and so if the num erator does not change,
adding an additional regressor will increase k and so
decrease the denominator.
8.3 R2 is always less than R2 and so cannot be larger than 1. 8.6 Yes. When the regressors are very positively correlated,
It can be sm aller than 0 if the model explains virtually then this can happen. The t-stats are small because the
nothing and the degree-of-freedom loss is sufficient to variables are highly co-linear, so that the variation in the
push its value negative. left-hand-side variable cannot be uniquely attributed to
8.4 False. An F-test measures the reduction in the sum of either. However, the F-stat is a joint test and so if there is
squared residuals that occurs in the expanded model and an effect in at least one, the Fe an reject even if the t-stats
depends on SSRu — S S R r . This value is always positive do not. See the image below that shows the region where
because the unrestricted model must fit better than the the F would fail to reject, and the two t-stats.
restricted model.
8.5 This is most likely to occur when the regressors are highly
correlated. If the regressors are positively correlated, then
the parameter estimators of the coefficients will be nega
tively correlated. If both values are positive, this would lead
to rejection by the F-test. Similarly, if the regressors were
negatively correlated, then the estimators are positively
correlated and the F will reject if one t is positive and the
other is negative. The figure below shows the case for pos
itively correlated regressors. The shaded region is the area 3 •-------- •-------- --------- 1-------- 1-------- •
where the F would fail to reject. The t-stats are outside this -3 - 2 - 1 0 1 2 3
area even though neither is individually significant.
Solved Problems
8.7 The confidence interval is ft ± 1.96 X se. The p-value is
2(1 — <E>(| 1 1)) where <I>( -) is the standard normal CDF.
Coefficient Estimate t-stat Std. Err. Conf. Int. P-value
8.8 The annualized a is 12a = 12 X 0.517 = 6.204% . The
Pm 1 . 0 1 1 18.63 0.054 [0.905, 0.000 t-stat is unaffected because the standard error is also
1.117]
scaled by 1 2 .
- - 0 .1 6 0.075 [- 0 .1 5 9 , 1.128
Ps 0 .0 1 2
8.9 Because this is a regression with a single explanatory
0.135]
variable, the R2 is the squared correlation. The correla
Pv 0.276 3.51 0.079 [0 . 1 2 1 , 0 .0 0 2
tion is 0 .9 / ^ 3 = .519 and so the R2 = 0.27. The R2 is
0.431]

the same in both regressions because it only depends on

4 - 2 .7 2 -0 .1 8 - 1.0 1
the correlation between the variables.
5 -3 .0 8 -0 .8 2 0.39
8.10 The model estim ated is
6 -7.1 -2 .0 8 1.39
Y, = /3 -|D-| + (32D2 + /33 D 3 + /34 D 4 + /34 D 5 + e„ where
7 -4.1 - 1 .0 6 0.75
Dj is a dummy that takes the value 1 if the index of the
w eekday is / (e.g ., Monday = 1, Tuesday = 2, . . . ). 8 0.14 0 .0 2 -0 .6 3
The restriction is that H q :/3-\ = = = /34 = ySs 9 -6 .1 3 - 1.6 6 1.31
so there this is no day-of-the-week effect.
0.74 -0 .1 5
This model can be equivalently written as
10 0 .6 8
Yj = /jl + 82D2
S3D3 + 84D4 + S4D5 + Sj, therefore,
+ A verage -2 .8 2 3 -0 .9 1 0 - 0 .1 0 1
here the null is H0:82 = 83 = 84 = 8 5 . In the two m odels,

H = /3-|, and /jl + 8, = (3,. The second form of the model
is a more standard null for an F-stat.
Trial Number y - y x1 — x1 *2 “ *2
The F-stat of the regression is
1 - 2 .9 3 7 -2 .5 7 0 -1 .2 6 9
^0.000599
R 2- 0/
4 2 2.853 0.890 -0 .5 1 9
1.51. The distribution is
1- R 2 1 -0.000599
/n —5 / 3 2.573 0.410 -0 .9 6 9
/(10087 —5)
an ^4 , 1 0 0 8 2 and the critical value using a 5% size is 2.37. 4 0.103 0.730 -0 .9 0 9
The test statistic is less than the critical value, therefore, 5 -0 .2 5 7 0.090 0.491
the null that all effects are 0 is not rejected.
6 -4 .2 7 7 -1 .1 7 0 1.491
8.11 a. Start by setting up the basic calculations: 7 -1 .2 7 7 -0 .1 5 0 0.851
Trial Number y *1 *2 8 2.963 0.930 -0 .5 2 9
1 - 5 .7 6 - 3 .4 8 -1 .3 7 9 -3 .3 0 7 -0 .7 5 0 1.411
2 0.03 - 0 .0 2 -0 .6 2 10 3.563 1.590 -0 .0 4 9
3 -0 .2 5 - 0 .5 -1 .0 7 A verage
N>
Trial Number (*! - X t )2 (x2 - x2)2 (x-, - Xt )(y - y) (x2 - x2)(y - y) (*! - )(*2 - * 2)
<<
I
1 8.626 6.605 1.610 7.548 3.727 3.261
2 8.140 0.792 0.269 2.539 -1.481 -0 .4 6 2
3 6.620 0.168 0.939 1.055 -2 .4 9 3 -0 .3 9 7
4 0.011 0.533 0.826 0.075 -0 .0 9 4 -0 .6 6 4
5 0.066 0.008 0.241 -0 .0 2 3 -0 .1 2 6 0.044
6 18.293 1.369 2.223 5.004 -6 .3 7 7 -1 .7 4 4
7 1.631 0.023 0.724 0.192 -1 .0 8 7 -0 .1 2 8
8 8.779 0.865 0.280 2.756 -1 .5 6 7 -0 .4 9 2
9 10.936 0.563 1.991 2.480 -4 .6 6 6 -1 .0 5 8
10 12.695 2.528 0.002 5.665 -0 .1 7 5 -0 .0 7 8
A verage 7.580 1.345 0.911 2.729 -1 .4 3 4 -0 .1 7 2

This gives y 0 = fiy - rj 1/xx, = - 2 .8 2 3 - (2 .0 2 9 )(-0 .9 1 ) = - 0 .9 7 6
jxy = —2.823 /Xx, = —0.91 /zx2 = —0.101 So:

cry = 7.58 ox, = 1.345 o$2 = 0.911
Trial
& x,x = - 0 .1 7 2 (Jy(Jxy = 2.729 o-yox = -1 -4 3 4 ~2
2 2
Number k *2 kx2 *2
First regressing on the last independent variable: 1 2.274 -1 .5 9 8 -3 .6 3 5 2 .5 5 5
X1(- — S'0
n + 5iX->; + X 1/
1^2/ 2 1.047 -0 .4 0 6 -0 .4 2 5 0.164
X2 - 0 .1 7 2
0.189 3 1.741 -0 .9 1 7 -1 .5 9 6 0.841
0.911
4 -1 .3 7 8 -0 .8 1 6 1.125 0.666
5q = /xXl - î Mx2 = -0-91 - 0.189*(—0.101) = -0.929
5 -0 .4 4 0 0.502 -0.221 0.252
Vi ~ To + 71^2/ + 6 -1 .9 0 5 1.341 -2 .5 5 4 1.798
- 1 .4 3 4
- 1 .5 7 4 7 -0 .9 7 4 0.831 -0 .8 0 9 0.691
0.911
8 1.076 -0 .4 1 0 -0 .4 4 2 0.168
7o = Mr - fiM x2 = - 2 .8 2 3 - (—1 .574*—0.101 ) = - 2 .9 8 2
9 -1 .7 8 7 1.315 -2 .3 4 9 1.728
Trial 10 0.338 0.154 0.052 0.024
Number ~2
y xi y*1 xi A verage -1 .0 8 5 0.889
1 - 4 .9 3 4 -2 .8 1 0 13.865 7.896
and
2 2.036 0.792 1.612 0.627
C o v ( K ,X 2) = - 1 .0 8 5 o-| = 0.889
3 1.048 0.227 0.238 0.051
- 1 .0 8 5
4 -1 .3 2 8 0.558 -0.741 0.311 = - 1.221
0.889
f t =
5 0.516 0.183 0.094 0.033 Tying things together:

6 -1 .9 3 0 -0 .8 8 8 1.715 0.789 V, - ftX -,, - /32X 2( = a + e, = - 1 .8 7 3 X 1; + 1.221 X 2l
7 0.063 0.011 0.001 0.000
8 2.130 0.830 1.768 0.689 Trial Number a + €j 4
9 -1 .0 8 6 -0 .4 8 3 0.525 0.234 1 -0 .9 1 3 0.328 0.108
10 3.486 1.581 5.510 2.498 2 -0 .6 8 9 0.553 0.306
A verage 2.459 1.313 3 -0 .6 1 9 0.623 0.388
4 -3 .6 1 5 -2 .3 7 3 5.632
C o v (Y ,X J 2.459 5 -1 .0 6 8 0.173 0.030
1.873
1.313 6 -1 .5 0 8 -0 .2 6 7 0.071
Repeating the process for X 2: 7 -1 .2 0 0 0.042 0.002
*2 i = io + + x 2i 8 -0 .6 6 6 0.576 0.331
- 0 .1 7 2 9 -1 .4 2 3 -0.181 0.033
- 0 .1 2 8
1.345
10 -0 .7 1 7 0.525 0.276
£0 = Mx 2 - îM x, = -0 .1 0 1 - (- 0 .1 2 8 ) - (- 0 .9 1 ) = - 0 .2 1 7
A verage -1 .2 4 2 0.718
Yj = fjo + + K So a = - 1 .2 4 2 , and the variance of the residuals
2.729 is 0.718.
2.029
1.345

Finally, asserting the presumption of normality: So, the TSS = 75.797 and the ESS = 68.598.
Y; = - 1 .2 4 2 + 1.873*X 1(- - 1.221*X2, p2 _ E S S _ 68.598

t\ = 90.5%
+ £// £/ N(0, 0.718) T SS ~ 75.797
n - 1
Note that the variance for e, is biased and the unbi- R2 = 1 - (1 - R2)
n - k - 1
10
ased figure is this figure multiplied by — ---- — = 1.429 10 - 1
1I (1 - 0.905) = 0.878
or 1.206. 10 - 2 -
b. (.R SS r - RSSu)/q (7.58 - 0.72)/2
F = 33.47
RSSu/(N - ku - 1) 0.72/(10 - 2 - 1 )
Trial
Number
A
y y (y - y )2 (y - y )2
1 -5 .7 6 0 -6 .0 8 9 8.626 10.664
2 0.030 -0 .5 2 3 8.140 5.290
3 -0 .2 5 0 -0 .8 7 3 6.620 3.802
4 -2 .7 2 0 -0 .3 4 7 0.011 6.131
5 -3 .0 8 0 -3 .2 5 4 0.066 0.185
6 -7 .1 0 0 -6 .8 3 4 18.293 16.085
7 -4 .1 0 0 -4 .1 4 2 1.631 1.741
8 0.140 -0 .4 3 6 8.779 5.698
9 -6 .1 3 0 -5 .9 4 9 10.936 9.774
10 0.740 0.215 12.695 9.227
A verage -2 .8 2 3
SUM 75.797 68.598

Learning Objectives
Explain how to test w hether a regression is affected by Explain two model selection procedures and how these
heteroskedasticity. relate to the bias-variance tradeoff.
D escribe approaches to using heteroskedastic data. Describe the various m ethods of visualizing residuals and
their relative strengths.
Characterize m ulticollinearity and its consequences; distin
guish between m ulticollinearity and perfect collinearity. Describe methods for identifying outliers and their impact.
D escribe the consequences of excluding a relevant Determ ine the conditions under which O LS is the best
explanatory variable from a model and contrast those with linear unbiased estimator.
the consequences of including an irrelevant regressor.
139
This chapter exam ines model specification. Ideally, a model estim ator of param eter standard errors, but not the specification
should include all variables that explain the dependent variable of the regression model or the estim ated param eters.
and exclude all that do not. In practice, achieving this goal is chal
The chapter concludes by exam ining the strengths of the O LS
lenging, and the model selection process must account for the
estimator. W hen five assum ptions are satisfied, the O LS estim a
costs and benefits of including additional variables. Once a model
tor is the most accurate in the class of linear unbiased estim a
has been selected, the specification should be checked for any
tors. W hen model residuals are also normally distributed, then
obvious deficiencies. Standard specification checks include resid
O LS is the best estim ator of the model param eters (in the sense
ual diagnostics and formal statistical tests to examine whether the
of producing the most precise estim ates of model param eters
assumptions used to justify the O LS estimator(s) are satisfied.
among a wide range of candidate estim ators).
Om itting explanatory variables that affect the dependent vari
able creates biased coefficients. The bias depends on both the
coefficient of the excluded variable and the correlation between 9.1 OMITTED VARIABLES
the excluded variable and the remaining variables in the model.
W hen a variable is excluded, the coefficients on the included An om itted variable is one that has a non-zero coefficient but is
variables adjust to capture the variation shared between the not included in a model. Om itting a variable has two effects.
excluded and the included variables.
1. First, the remaining variables absorb the effects of the omit
On the other hand, including irrelevant variables does not bias ted variable attributable to common variation. This changes
coefficients. In large samples, the coefficient on an extraneous vari the regression coefficients on the included variables so that
able converges to its population value of zero. Including an unnec they do not consistently estim ate the effect of a change in
essary explanatory variable does, however, increase the uncertainty the explanatory variable on the dependent variable (holding
of the estimated model parameters. When the explanatory vari all other effects constant).
ables are correlated, the extraneous variable competes with the
2. Second, the estim ated residuals are larger in magnitude
relevant variables to explain the data, which in turn makes it more
than the true shocks. This is because the residuals contain
difficult to precisely estimate the parameters that are non-zero.
both the true shock and any effect of the om itted variable
Determ ining w hether a variable should be included in a model that cannot be captured by the included variables.
reflects a bias-variance tradeoff. Large models that include all
Suppose the data generating process includes variables X i
conceivable explanatory variables are likely to have coefficients
and X 2:
that are unbiased (i.e., equal, on average, to their population
values). However, because it may be difficult to attribute an Yj = a + f a X Vl + f a X 2i + 6j
effect to a single variable, the coefficient estim ators in large Now suppose X 2 is excluded from the estim ated model. The
models are also relatively im precise. Sm aller m odels, on the model fit is then:
other hand, lead to more precise estim ates of the effects of the
Y-, — a + f a X Vl + e;
included explanatory variables.
In large sam ples, the O LS estim ator fa converges to:
Model selection captures this bias-variance tradeoff when
sam ple data are used to determ ine w hether a variable should be fa + fa&
included. This chapter introduces two methods that are widely where:
used to select a model from a set of candidate m odels: general-
= C o v [X i,X 2]
to-specific model selection and cross-validation.
vpg
O nce a model has been chosen, additional specification checks
is the population value of the slope coefficient in a regression of
are used to determ ine whether the specification is accurate. Any
X 2 o n X-|.
defects detected in this post-estimation analysis are then used
to adjust the model. The standard specification checks include The bias due to the om itted variable depends on the population
testing to ensure that the functional form is adequate, the model coefficient of the excluded variable fa and the strength of the
param eters are constant, and the assumption that the errors relationship between X-| and X 2 (which is measured by 8). When
have constant variance (i.e., homoscedasticity) is com patible with X-| and X 2 are highly correlated, then X-, can explain a substantial
the data. Tests of functional form or param eter stability provide portion of the variation in X 2 and the bias is large. If X-| and X 2
A
guidance about whether additional variables should be included are uncorrelated (so that 8 = 0), then fa is a consistent estim a
in the model. Rejecting hom oscedasticity requires adjusting the tor of fa . In other words, om itted variables bias the coefficient

on any variable that is correlated with the om itted variable(s). model from a set of candidate explanatory variables, and some
Note that econom ic data are generally correlated (often highly), of these m ethods always select the true model (at least in large
and so omitting relevant variables creates biased and inconsis sam ples). O ther methods perform well in applications of linear
tent estim ates of the effect of the included variables. regression despite having w eaker theoretical foundations.
Two of these m ethods are general-to-specific model selection

and m-fold cross-validation.2 General-to-specific (GtS) model
9.2 EXTRANEOUS INCLUDED selection begins by specifying a large model that includes all rel
VARIABLES evant variables. If there are coefficients in this estim ated model
that are statistically insignificant (using a test size of a), the vari
An extraneous variable is one that is included in the model but
able with the coefficient with the sm allest absolute t-statistic is
is not needed. This type of variable has a true coefficient of 0
rem oved. Then, the model is re-estimated using the remaining
and is consistently estim ated to be 0 in large sam ples. explanatory variables. These two steps (remove and re-estimate)
However, there is a cost to including unnecessary variables. To are repeated until the model contains no coefficients that are
see this, recall the definition of the adjusted R2 from Chapter 8: statistically insignificant. Common choices for a are between 1%
and 0.1% (which correspond to selecting models where all abso
RSS
R2 = 1 - € lute t-statistics are at least 2.57 or 3.29, respectively).
TSS'
m-fold cross-validation is an alternative method that is popular
where the adjustm ent factor £ = (n — 1)/(n — k — 1).
in modern data science. It is designed to select a model that
Adding variables increases k, which increases £ and, therefore, perform s well in fitting observations not used to estim ate the
reduces R 2. M eanwhile, an expanded model nearly always has param eters (a process known as out-of-sample prediction).
a sm aller RSS, which (hopefully) offsets the increase in £ to pro Cross-validation selects variables with consistent predictive
duce a larger R 2. However, this is not true when the true coeffi power and excludes variables with small coefficients that do
cient is equal to 0. In that case, RSS remains the same even as £ not improve out-of-sample predictions. This mechanism has the
grows, resulting in a sm aller R 2 as well as a larger standard error. effect of directly em bedding the bias-variance tradeoff in the
Furthermore, the standard error grows larger as the correlation model selection.
between X-, and X 2 increases. In most applications to financial data, The first step is to determ ine a set of candidate m odels. When
correlations are large and so the impact of an extraneous variable the number of explanatory variables is small, then it is possible
on the standard errors of relevant variables can be substantial. to consider all com binations of explanatory variables. In a data
set with ten explanatory variables, for exam ple, 2 10 = 1,024 dis
tinct candidate models can be constructed.3
9.3 THE BIAS-VARIANCE TRAD EO FF
Cross-validation begins by randomly splitting the data into m
equal sized blocks. Com m on choices for m are five and ten.
The choice between omitting a relevant variable and including
Param eters are then estim ated using m — 1 of the m blocks,
an irrelevant variable is ultimately a tradeoff between bias and
and the residuals are com puted using these estim ated param
variance. Larger m odels tend to have a lower bias (due to the
eter values on the data in the excluded block. These residu
inclusion of additional relevant variables), but they also have less
als are known as out-of-sample residuals, because they are
precise estim ated param eters (because including extraneous
not included in the sam ple used to estim ate param eters. This
variables increases estimation error). On the other hand, models
process of estim ating param eters and computing residuals is
with few explanatory variables have less estimation error but are
repeated a total of m tim es (so that each block is used to com
more likely to produce biased param eter estim ates.
pute residuals once).
The bias-variance tradeoff is the fundam ental challenge in vari
able selectio n .1 There are many methods for choosing a final
2 It most other references on model selection, this type of cross-
validation is referred to as k-fold cross-validation, m is used here to clar
ify that the number of blocks in the cross-validation (m) is distinct from
1 Bias measures the expected deviation between an estimator and the
/V /V
the number of explanatory variables in the model (k).
true value of the unknown coefficient and is defined Bias(0) = E[0] — 0q.
A A A
Variance measures the squared deviation, V[fl] = E[(0 — E[0]) ], and so 3 If a data set has n candidate explanatory variables, then then there are
q
bias and variance are not directly comparable. In practice, the bias-vari 2n possible model specifications. When n is larger than 20, the number
ance tradeoff is a tradeoff between squared bias and variance because of models becomes very large, and so it is not possible to consider all
E[(0 - 6>0)2] = Bias2(0) + V[0], specifications.
Chapter 9 Regression Diagnostics ■ 141

y
Fiq u re 9.1An example of heteroskedastic data. The standard deviation of the errors is increasing in the distance
between X and E[X], so that the observations at the end of the range are noisier than those in the center.
The sum of squared errors is then com puted using the residuals Figure 9.1 contains a sim ulated data set with heteroskedastic
estim ated from the out-of-sample data. Finally, the preferred shocks. The residual variance increases as X moves away from its
model is chosen from the set of candidate models by select m ean; consequently, the largest variances occur when (X — /xx)2
ing the model with the sm allest out-of-sample sum of squared is largest. This form of heteroskedasticity is common in financial
residuals. In machine learning or data science applications, the asset returns.
m — 1 blocks are referred to as the training s e t and the omitted In a regression, the most extrem e observations (of X ) con
block is called the validation set. tain the most information about the slope of the regression
line. W hen these observations are coupled with high-variance
residuals, estim ating the slope becom es more difficult. This is
9.4 HETEROSKEDASTICITY reflected by a larger standard error for /§.
W hen residuals are heteroskedastic, the standard errors can be

Hom oscedasticity is one of the five assum ptions used to deter
estim ated using W hite's estim ator (also called Eiker-W hite in
mine the asym ptotic distribution of an O LS estimator. It requires
some software packages). Param eters can be tested using
that the variance of e(- is constant and so does not system atically
t-tests by using W hite's standard error in the place of standard
vary with any of the explanatory variables. W hen this is not the
error used for hom oscedastic data. On the other hand, F-tests,
case, then the residuals are heteroskedastic.
are not as easy to adjust for heteroskedasticity and so caution is
Form ally, hom oscedasticity requires that the conditional vari required when testing multiple hypotheses if the shocks are
ance of the shock is constant (i.e., N/[e,|X1(-, X 2/, . . . , X /J = a 2). In heteroskedastic.4
models of financial data, this assumption is often false. A conse
quence of heteroskedasticity is that the asym ptotic distribution 4 A Wald test can be used when residuals are heteroskedastic. The
of the estim ated param eters takes a different form . However, expression for this type of test requires linear algebra. This modification
of F-tests is automatically employed by statistical software when using
heteroskedasticity d o es not affect the consistency or unbiased heteroskedasticity-robust parameter covariances (i.e., when using White
ness of the O LS param eter estimator. standard errors).

A sim ple test for heteroskedasticity, proposed by W hite is These residuals are then squared and regressed on a
im plem ented as a two-step procedure.5 constant, the excess m arket return, and the squared excess
m arket return:
1. Estim ate the model and com pute the residuals (e,).
2. Regress the sq u a red residuals on a constant, all explanatory e? = 7o + 7 iR m*; + 72(Rm*;)2 + V i
variables, and the cross-product of all explanatory variables Table 9.1 reports the param eter estim ates for y , and y 2 (which
(including the product of each variable with itself). should be 0 if the data are hom oscedastic). The values below
For exam ple, if the original model has two explanatory variables: the estim ated coefficients are t-statistics.
A part from the banking industry, all estim ates of y , are statisti
Y, — a + fa X Vl + f a X 2j + €j
cally insignificant at the 10% level (i.e., they are sm aller in m ag
then the first step com putes the residuals using the O LS param nitude than 1.645). This indicates that the market return does
eter estim ators:
not generally linearly affect the variance of the residuals. How
e ; = Y , - a - f a X , , - f a X 2i ever, most of the coefficients on y 2 are statistically significant
at the 10% level, and so the shock variance is large when the
The second step regresses the squared residuals on a constant,
squared m arket return is large; in other words, the volatilities of
X 1f X 2, X i X l and X ^ :
the shock and m arket are positively related.
- To + 7 iX ;i + 72^2/ + 73*1/ + 74^2/ + 7 s*i ,x 2/ + 17/ (9 . 1)
If the data are hom oscedastic, then ef should not be explained
by any the variables on the right-hand side and the null is Table 9.1 White's Test for Heteroskedasticity
Applied to the Industry Portfolios. The First-Step
H0: 71 = ' ' ’ = 75 = 0
Model Is a CAPM. The Second-Step Model Is
The test statistic is com puted as nR2, where the R2 is com puted
€? = yo + n Rm*i + 72 (R m * i)2 + Vu where Rm*, Is
in the second regression. The test statistic has a Xk(k+3)/2 distri the Excess Return on the Market. The Values Below
bution, where k is the number of explanatory variables in the the Estimated Coefficients Are t-Statistics. The Third
first-stage m odel.6 Column Contains White's Test Statistic, nR2, Which Has
W hen k = 2, the null imposes five restrictions and the test sta a \ 2 Distribution. The p-Value of the Test Statistic Is in
tistic is distributed xi- Large values of the test statistic indicate Parentheses. The Final Column Reports the R2 from This
that the null is false, so that hom oscedasticity is rejected in favor Regression. The Number of Observations is n = 360
of heteroskedasticity. W hen this occurs, heteroskedasticity-
7i 72 White Stat. R2
robust (White) standard errors should be used.
Banking - 0 .4 8 8 0.140 40.63 0.113
Table 9.1 contains the results of W hite's test on the industry
portfolios considered in C hapter 8. The first-step model is (- 2 .6 7 ) (5.50) (0.000)
derived from C A PM : Shipping 0.144 0.053 0.95 0.003
Containers
Rp*/ ex + /3Rm*(- + €j, (0.36) (0.96) (0.623)
where Rp*(- is the excess return on a portfolio and Rm*(- is the Chemicals 0.042 0.096 7.72 0.021
excess return on the market. (0.17) (2.77) (0.021)
The residuals from the regression are constructed as: Electrical 0.191 0.092 10.21 0.028
Equipment
= RP*; - a ~ /3Rm*; (0.93) (3.23) (0.006)
Computers 0.348 0.205 10.70 0.030

5 White, H, "A Heteroskedasticity-consistent Covariance Matrix Estima (0.78) (3.32) (0.005)
tor and a Direct Test for Heteroskedasticity." Econometrica, 48(4) (1980):
817-838. Consumer 0.492 0.085 3.51 0.010
6 The Breusch-Pagan test is a special case of White's test that excludes Goods
(1.29) (1.61) (0.173)
the cross-products of the explanatory variables, and so the distribution
of the test statistic is a \ 2k. This version of the test is often preferred in Retail 0.102 0.050 3.10 0.009
large models to moderate the number of regressors in the second-step
(0.50) (1.76) (0.212)
regression.

Table 9.2 Parameter Estimates from Regressions of the Returns of Each Industry Portfolio on the Returns of the
Factor Portfolios: The Market, the Size Factor, and the Value Factor. The Line Immediately Below the Parameter
Estimates Contains the t-Statistics of the Coefficient Computed Using White's Heteroskedasticity-Robust Covariance
Estimator. The Next Line Contains the t-Statistics (in italics) of the Coefficient That Are Only Valid if the Shocks Are
Homoscedastic. The Final Two Columns Report Test Statistics of the Null H0: ps = p v = 0 Using the Standard F-Test
and the Heteroskedasticity-Robust F-Test. The Number Below Each Test Statistic Is the p-Value of the Test Statistic7
a Pm Ps Pm F-test Robust F
Banking - 0 .1 7 2 1.228 - 0 .1 5 6 0.838 135.3 87.2
(-1 .0 1 9 ) (23.015) (-2 .7 2 9 ) (12.124) (0.000) (0.000)
(-1 .0 9 5 ) (32.261) (- 3 .0 2 7 ) (15.152)
Shipping Containers - 0 .0 0 3 1.011 - 0 .0 1 2 0.276 6.57 4.11
(-0 .0 1 2 ) (17.128) (-0 .1 0 7 ) (2.851) (0.002) (0.016)
(- 0 .0 7 2 ) (18.633) (- 0 .1 6 1 ) (3.510)
Chemicals - 0 .0 5 3 1.098 - 0 .0 2 8 0.458 32.1 18.0
(-0 .3 0 9 ) (21.988) (-0 .4 3 3 ) (5.783) (0.000) (0.000)
(- 0 .3 1 4 ) (26.902) (- 0 .5 0 6 ) (7 .720)
Electrical Equipment 0.103 1.263 - 0 .0 3 3 0.115 2.21 1.53
(0.585) (25.147) (-0 .4 5 3 ) (1.653) (0.111) (0.217)
(0.585) (29.613) (- 0 .5 7 0 ) (1.855)
Computers 0.055 1.296 0.224 - 0 .6 6 6 40.8 24.7
(0.228) (18.343) (2.285) (-6 .3 6 9 ) (0.000) (0.000)
(0.226) (21.988) (2.800) (- 7 .7 8 4 )
Consumer Goods 0.215 0.675 - 0 .1 7 4 0.117 9.18 1.61
(1.372) (11.779) (-1 .3 0 6 ) (1.507) (0.000) (0.200)
(1.311) (17.026) (- 3 .2 4 1 ) (2.028)
Retail 0.221 0.927 - 0 .0 3 6 0.010 0.27 0.12
(1.346) (20.006) (-0 .4 7 8 ) (0.154) (0.763) (0.887)
(1.347) (23.327) (- 0 .6 6 1 ) (0.174)
The third column reports the test statistic for W hite's test (which The t-statistics on the size and value factor coefficients that
is nR2 from the second-stage regression on the residuals). The account for heteroskedasticity are generally sm aller than the
null of hom oscedasticity imposes two restrictions on the coef t-statistics com puted assuming hom oscedasticity. The changes
ficients in the second-stage regression (H 0 : 7 1 = 72 = 0)/ and so in the consum er goods industry portfolio are large enough to
the test statistic has a X 2 distribution. Four of the seven series alter the conclusion about the significance of the size factor
reject the null hypothesis and therefore appear to be hetero- (because the t-statistic decreases from - 3 .2 to —1.3).
skedastic. The final column reports the R2 of the second-stage
The final two columns contain test statistics for the null that
m odel. All regressions use 30 years of monthly observations
CAPM is adequate (i.e., Hq :/3s = f3v = 0). The two test statistics
(n = 360).
Table 9.2 includes t-statistics com puted using W hite's covari

ance estim ator in the three-factor m odels. It contains the esti 7 The standard definition of the heteroskedasticity-robust F-statistic has
a distribution, while the standard F-test has a Fqn_k_-| distribution.
mated param eters for each industry portfolio, along with the
When n is large, Fq n_k_i ~ and so the robust test statistic has been
heteroskedasticity-robust t-statistics and the t-statistics com divided by q = 2 to make the values of these two test statistics approxi
puted assuming hom oscedasticity (in italics). mately comparable.

are the RSS-based F-test statistic (which is only valid when the The param eters of this model are estim ated using O LS on the
data are hom oscedastic) and a heteroskedasticity-robust F-test transform ed d ata.8 The modified error is e-Jw,, and so the vari
statistic. Below each test statistic is its p-value. ance of the error (N/[e/w,] = a 2) is constant across observations.
The challenge with im plem enting W LS is determ ining the
The robust F-tests show a sim ilar effect: The robust te st statis
weights. O ne approach is to use a three-step procedure, which
tics are sm aller and have larger p-values than the test statistics
estim ates the model with O LS, then estim ates the weights using
assum ing that the shocks are hom oscedastic. In the te st of
the residuals, and then finally re-estimates the transform ed
the coefficients in the consum er goods portfolio, this differ
m odel. This approach is known as Feasible W LS or Feasible
ence changes the conclusion: W hereas, the standard F-test
G eneralized Least Squares (FW LS/FG LS).
indicates that the null is false, the robust version has a p-value
of 20% and so the null cannot be rejected . This te st result is
consistent with the change in the t-statistic of the coefficients
for this industry. This is a com mon pattern when relaxing the
9.5 MULTICOLLINEARITY
assum ption of hom oscedasticity; heteroskedasticity-robust
M ulticollinearity occurs when one or more explanatory variables
standard errors indicate that param eters are less precisely
can be substantially explained by the other(s). For exam ple, if
estim ated.
a model has two explanatory variables, then the variables are
m ulticollinear if the R2 in the regression of one on the other is
Approaches to Modeling Heteroskedastic very high.
Data M ulticollinearity differs from perfect collinearity, where one of
There are three common approaches used to model data with the variables is perfectly explained by the others (in which case
heteroskedastic shocks. The first (and sim plest) is to ignore the the R*2*of a regression of Xj on the remaining explanatory vari
heteroskedasticity when estim ating the param eters and then ables is 1). The rule-of-thumb is that R2 above 90% creates
use the heteroskedasticity-robust (White) covariance estim ator issues in m oderate sam ple sizes (i.e., 100s of observations).9
in hypothesis tests. This approach is frequently used because M ulticollinearity is a common problem in finance and risk man
the param eters are estim ated using O LS. However, while this agem ent because many regressors are sim ultaneously deter
method is sim ple, it often produces substantially less precise mined by and sensitive to the same news. M ulticollinearity is not
model param eter estim ates when com pared to methods that ruled out by any of the assumptions of linear regression, and so
directly address heteroskedasticity. it does not pose a technical challenge to param eter estimation
Th e second approach is to transform the data. For exam p le, or hypothesis testing. It does, however, create challenges in
it is com m on to find h e te ro ske d asticity in data th at are modeling data.
strictly p ositive (e.g., trad in g volum e or firm revenues). W hen W hen data are multicollinear, it is common to find coeffi
data are alw ays p o sitive, it is tem p ting to m odel the natu cients that are jointly statistically significant (as evidenced by
ral log of the d e p en d en t va ria b le , b ecause it can reduce or the F-statistic of the regression) and yet have small individual
even elim inate h e te ro ske d asticity and may provide a b e tte r t-statistics (e.g ., less than 1.96 in magnitude). This contradiction
d escrip tio n of the data. An altern ative is to transform the occurs because the joint statistical analysis can identify some
data by dividing the d e p e n d e n t variab le by ano ther (strictly effect from the regressors as a group (using the F-statistic) but
positive) variab le . For e xam p le , dividing the dividend per cannot uniquely attribute the effect to a single variable (using
share by the share price produces the yield of a sto ck. (N ote the t-statistics).
th at yield s are often h o m o sced astic even when d ivid en d s per
share are not.)
The final (and most com plicated) approach is to use weighted

least squares (W LS). W LS is a generalization of O LS that applies
a set of weights to the data before estim ating param eters. If
V[e,] = w fa 2 and w,- is known, then transform ing the data by e The WLS regression regresses the weighted version of Y„ namely Y„
on two explanatory variables. The first, C„ is the inverse of the weight,
dividing by w,- rem oves the heteroskedasticity from the errors.
1/w,-. The second is the weighted version of X„ namely X,-. While the WLS
W LS regresses Y/w , on X /w , so that: regression does not explicitly include an intercept, a is still interpretable
as the intercept.
Wj Wj Wj w,' 9 In a linear regression with a single explanatory variable, the R2 is the

squared correlation (p2) between the dependent and the explanatory
Y, = aC , + (3X-, + €j. (9.2) variable. An R2 — 90% corresponds to a correlation of 95%.

There are two options available to address multicollinear m easures the sensitivity of the fitted values in a regression to
explanatory variables. The first (and easiest) is to ignore the dropping a single observation j. It is defined as:
issue because it is technically not a problem . The alternative
is to identify m ulticollinear variables and to consider rem ov (9.5)
ing these from the m odel. The standard method to determ ine
whether variables are excessively m ulticollinear is to use the where Y\~^ is the fitted value of Yt when observation j is dropped
variance inflation factor. This measure com pares the variance of and the model is estimated using n — 1 observations, k is the num
the regression coefficient on an explanatory variable Xj in two ber of coefficients in the regression model, and s2 is the estimate
m odels: one that only includes Xj and one that includes all k of the error variance from the model that uses all observations.
explanatory variables:
Dj should be small when an observation is an inlier (i.e., does
Xjj — To + 7 i^ i; + • • • + 7 / - 1'X ,- i ; not disproportionately affect param eter estim ates). Values of Dj
+ 'Yj+'iXj+y, + . . . + 7 kXki + Vi (9-3) larger than 1 indicate that observation j has a large im pact on
the estim ated model param eters.
The variance inflation factor for variable j is then:
For exam ple, consider the following data:
1
(9.4)
1 - Rf
Table 9.3
where Rf com es from a regression of Xj on the other variables
in the m odel. Values above 10 (i.e., which indicate that 90% of Trial X y
the variation in Xj can be explained by the other variables in the

1 3.86 1.89
model) are considered excessive.
2 1.89 0.64
Variables with exceedingly high variance inflation factors should
3 1.38 - 0 .6 2
be considered for exclusion from the model. In practice, model
selection procedures screen out models with high VIF variables, 4 0.20 1.20
because these variables have param eters that are poorly esti 5 -0 .8 1 - 2 .4 2
mated and excluding them tends to have little im pact on the
6 1.92 0.67
fitted values from the model.
7 2.96 0.81
8 1.60 0.26
Residual Plots
9 1.46 0.72
Residual plots are standard m ethods used to detect deficiencies
10 1.52 -0 .4 1
in a model specification. An ideal model would have residuals
11 0.68 1.13
that are not system atically related to any of the included explan
atory variables. The residuals should also be relatively small in 12 0.03 - 1 .7 8
m agnitude (i.e., typically within ±4s, where s2 is the estim ated 13 1.56 1.46
variance of the shocks in the model).
14 - 0 .1 2 - 0 .6 2
The basic residual plot com pares e, (y-axis) against the real
15 7.60 1.85
ization of the explanatory variables x r An alternative uses
the standardized residuals e /s so that the m agnitude of the
deviation is more apparent. Both outliers and model sp ecifica Note that Trial 15 yields a result that is quite different from the
tion problem s (e.g ., om itted nonlinear relationships) can be rest of the data. To see whether this observation is an outlier, a
detected in these plots. model is fit on the entire data set (i.e., Y,) as well as on just the
first 14 observations (i.e., The resulting coefficients are
Outliers Yj = 1.38 + 1.04X f
and
O utliers are values that, if rem oved from the sam ple, produce
large changes in the estim ated coefficients. Cook's distance Yj-fi = 1.15 + 0.69X,

The fitted values from these models are shown in the table The variance from the full run (i.e., s2) is calculated as 2.19.
below: Applying Cook's distance yields
- V/): 4.6
Table 9.4 Di = = 1.05
ks2 2*2.19
Trial Because Dj > 1, Trial 15 can be considered an outlier.

X;
A
y, Yj-i) ( Y H 1 - Y))2
A s another exam ple, consider the industry portfolio data from
1 1.89 3.35 2.45 0.81
C hapter 8. The top panel in Figure 9.2 shows Cook's distance
2 0.64 2.05 1.59 0.21 in the three-factor m odel, where the dependent variable is the
3 - 0 .6 2 0.74 0.73 0.00 excess return on the consum er good's industry portfolio. The
plot shows that one observation has a distance larger than 1.
4 1.20 2.63 1.98 0.43
The bottom panel shows the effect of this single observation on
5 - 2 .4 2 - 1 .1 4 -0 .5 1 0.39 the estim ate of the coefficient on the size factor /§s.
6 0.67 2.08 1.61 0.22
Dropping this single influential observation changes f3s from
7 0.81 2.23 1.71 0.27
A
- 0 .3 0 to - 0 .1 8 . The standard error of f3s is 0.054, and so this

8 0.26 1.65 1.33 0.11 single observation produces a change of 2.2 standard errors.
The other two model param eters do not meaningfully change,
9 0.72 2.13 1.65 0.24
because the identified observation was not associated with a
10 -0 .4 1 0.96 0.87 0.01 large deviation in the other explanatory variables.
11 1.13 2.56 1.93 0.40
12 - 1 .7 8 - 0 .4 7 - 0 .0 7 0.16
9.6 STRENGTHS O F OLS
13 1.46 2.90 2.15 0.56
14 - 0 .6 2 0.74 0.73 0.00 Under assum ptions introduced in Chapter 7, O LS is the Best
Linear Unbiased Estim ator (BLU E). Best indicates that the O LS
15 1.85 3.31 2.42 0.79
estim ator achieves the sm allest variance among any estim ator
Sum 4.60
that is linear and unbiased.
OLS IS BLUE
/v
O LS is a linear estim ator because both a and f3 can be w rit and so also do not depend on Y.
ten as linear functions of Y„ so that:
O LS is the best estim ator in the sense that any other LUE
must have larger variances. The intuition behind this con-
b= i ; w ,y , elusion follows from a result that any other LUE (3 can be
/=i A A A /\
expressed as j8 + 6, where [3 is the O LS estim ator and 8 is

The weights in a linear estim ator cannot depend on the Y„
A
uncorrelated with f3. The variance of any other LUE estim ator
but they may depend on other variables. is then:
Note that the weights in the O LS estim ator of (3: V[j8] = V[jS] + V[6]
This expression is minimized when V[<5] = 0. Because this

only happens when (3 is the O LS estimator, any other estim a
2 " = ,( x - * ):
tor would have a larger variance. This result is known as the
depend only on the values of X. Gauss-M arkov Theorem , which is named after m athem ati
cians Carl Friedrich Gauss and Andrey Markov.
M eanwhile, the weights in the O LS estim ator of a are

Cook's Distance
1992 1996 2000 2004 2008 2012 2016
Effect of Identified Outlier

201 Without, ps = -0.301
All, 5S = -0 .1 8 0
10
0
cc
I
o
Q_
- 10 -
-1 5 -1 0 0 -5 5 10 15 20
Size Factor
Fiaure 9.2 The top panel shows the values of Cook's distance in a regression of the excess return on the
consumer good industry portfolio on the three factors, the excess return on the market and the returns on the size
and value portfolios. The bottom panel shows the effect on the estimated coefficient on the size factor when the
outlier identified by Cook's distance is removed.
W hile BLU E is a strong argum ent in favor of using O LS to esti If e(- ~ N[0,(t 1
21
), then O LS is the Best Unbiased Estim ator (BU E).
mate model param eters, it com es with two im portant caveats. That is, the assumption that the errors are normally distributed
strengthens B LU E to BU E so that O LS is the Best (in the sense
1. First, the BLU E property of O LS depends crucially on the
of having the sm allest variance) Unbiased Estim ator among all
hom oscedasticity of the residuals. W hen the residuals have
linear and nonlinear estim ators. This is a com pelling justification
variances that change with X (as is common in financial
for using O LS when the errors are iid normal random
data), then it is possible to construct better LUEs of a and (3
variab le s.11
using W LS (under some additional assumptions).
2 . Second, many estimators are not linear. For exam ple, many However, it is not the case that O LS is only applicable when the
maximum likelihood estimators (MLEs) are nonlinear.10 M LEs errors are normally distributed. The only requirem ents on the
have desirable properties and generally have the smallest errors are the assum ptions that the residuals have conditional
asym ptotic variance of any consistent estimator. M LEs are, mean zero and that there are no outliers. Normality of the errors
however, generally biased in finite samples and so it is not

possible to compare them to O LS in fixed sample sizes.
11 The standard test of normality is the Jarque-Bera statistic, which tests

two properties of normal random variables: that the skewness is zero
n
Maximum likelihood estimators are another class of estimators. and the kurtosis is 3.
1

is n o t required for a and /3 to accurately estim ate the population variance inflation factor is a sim ple m ethod to check for excess
values or for the estim ators to have desirable properties (e.g ., collinearity and provides guidance about which variables can
consistency or asym ptotic normality). be rem oved.
Second, the im pact of outliers should be recognized. Graphical

tools provide a sim ple method to inspect residuals for outliers or
9.7 SUMMARY m isspecification. Cook's distance provides a numerical measure
of the outlyingness of a data point. If outliers are detected, the
Developing a model requires making choices that reflect a bias- validity of the data point should be considered. In applications
variance tradeoff. Large m odels are less likely to omit im portant in finance and risk m anagem ent, especially when modeling
explanatory variables, and thus, param eter estim ators are less return data, dropping these points is risky because the return
biased. However, large models are also more likely to include occurred in the past and sim ilar returns may occur in the future.
irrelevant variables that reduce the precision of the coefficient
The chapter concludes by describing the conditions where O LS
estim ators. Large models are also more likely to contain vari
is a desirable m ethod to estim ate model param eters. W hen
ables that are redundant (i.e., whose effects can be explained by
five key assum ptions are satisfied, the O LS param eter estim a
other variables in the model). General-to-specific model selec
tor is the Best Linear Unbiased Estim ator (BLU E), which pro
tion and cross-validation are two methods that account for the
vides a strong justification for choosing O LS over alternative
bias-variance tradeoff to select a single specification.
m ethods. W hen the residuals are also normally distributed, the
An accurate model should have several desirable features. B LU E property is strengthened so that O LS is the BU E (best-
First, included variables should not be excessively collinear. The unbiased estim ator).

QUESTIONS

9.1. Define hom oscedasticity and heteroskedasticity. When irrelevant regressors are included when the regressors are
might you exp ect data to be hom oscedastic? uncorrelated?
9.2. W hat conditions are required for a regression to suffer 9.5. W hy does a variable with a large variance inflation factor
from om itted variable bias? indicate that a model may be im proved?
9.3. W hat are the costs and benefits of dropping highly collin- 9.6. The sam ple mean is an O LS estim ator of the model
ear variables? Yj = a + 6,. W hat does the BLU E property imply about
the mean estim ator?
9.4. If you use a general-to-specific model selection with a
test size of a, what is the probability that one or more 9.7. W hat is the strongest justification for using O LS to esti
mate model param eters?
Practice Questions
9.8. In a model with a single explanatory variable, what value there are nonlinear effects of some of these variables,
of the R2 in the second step in a test for heteroskedasticity and so use a R E S E T test including both the squared and
indicates that the null would be rejected for sam ple sizes cubic term . The R2 of the original model is 68.6% , and the
of 100, 500, or 2500? (Hint: Look up the critical values of a R2 from the model that includes both additional term s is
Xq, where q is the number of restrictions in the test.) 68.9% . You have 456 observations. W hat do you conclude
about the specification of the model?
9.9. Suppose the true relationship between Y and two explan
atory variables is Yj = 2 + 1.2X-I,- — 2.1X2/ + e/- 9.12. You are investigating the determ inants of book leverage
of firms. Your model includes an intercept and four addi
a. W hat is the population value of (3-\ in the regression
tional variables: the natural logarithm of sales revenue, the
Yj = a + (3-jX-1 + €j if pxiX2 = 0-6 and ox = 1 and
book-to-market ratio, the ratio of EBlTD A-to-book assets,
<>i = 1 / 2 ?
and Net property-plant-and-equipm ent (Net PPE). Your
b. W hat value of pXlx2 would make f3-\ = 0?
initial model groups firm s in all industries. Suppose you
9.10. In a model with two explanatory variables,
want to test w hether the four slope coefficients are con
Y-, = a + jS-|X1( + /32X 2; + €„ what does the correlation
stant across energy sector firms.
need to be between the two regressors, X | and X 2, for the
variance inflation factor to be above 10? a. W hat additional model would you estim ate to im ple
ment the Chow test?
9.11. You are interested in understanding the determ inants of
b. If your sam ple has 433 observations, what is the distri
the yield spread of corporate bonds above a maturity
bution of the Chow test?
matched sovereign bond. You include three explanatory
c. If the R2 in the original specification is 16.6% , how
variables: the leverage defined as the ratio of long-term
large would the R2 have to be in the regression with
debt to the book value of assets, a dummy variable for
the dummy interactions for the null to be rejected
high yield, and a measure of the volatility of the profit
using a 5% test size?
ability of the issuer. You are interested in testing whether

9.13. G iven the data set: where the data was used for fitting the model versus the
blue indicating the data that was held back):
Y X 1 X2
M1 m2 m3
1 -2 .3 5 3 -0 .4 0 9 -0 .0 0 8
2 -0 .1 1 4 0.397 -1 .2 1 6 1 2.1 0.6 1.1
3 -1 .6 6 5 -0 .8 5 6 -0.911 2 -0.1 - 0 .4 2.5
4 -0 .3 6 4 1.682 0.366 3 - 0 .8 -0.1 1
5 -0.081 0.455 -0 .6 3 9 4 - 2 .6 - 0 .5 -0.1
6 -0 .7 3 5 - 1 .3 9 -1 .0 8 6 5 -1 .1 - 0 .6 1.3
7 -2 .5 0 7 0.954 0.67 6 - 0 .4 0.9 0.5
8 -1 .1 4 4 1.021 0.238 7 0.6 0 0
9 -2 .4 1 9 -0 .1 5 6 -0 .0 5 5 8 0.9 -1 - 0 .4
10 -3.151 1.382 1.148 9 1.2 0.8 -0.1
11 -2 .0 8 5 -0 .5 6 2 -0 .1 3 5 Using the principles of m-fold cross validation, which
12 -2 .9 7 2 -1 .5 5 4 -0 .2 9 9 model should be selected?
13 -0 .6 3 3 -1 .1 2 3 -1 .0 2 7 9.15. Consider the following data provided in the table and

plotted below:
14 -2 .6 7 8 -0 .1 2 4 0.331
15 -7 .0 9 5 0.284 2.622 Y X
a. Find the param eters for the model: 1 -1 0 .4 2 -5

Y = a + /3-|X-|. 2 -1 4 .1 2 -4
b. Find the param eters for the model: 3 -1 4 .7 2 -3
Y = a + ^2X 2 -
4 -8 .2 5 -2
c. Using the results from part a and part b, and the corre
5 2.3 -1
lation between the two explanatory variables, estim ate
the param eters for the full model: 6 1.59 0
Y = a + f a X y + ^2X 2 . 7 5.87 2
d. C heck the answer to part c by using the Excel function 8 14.13 3
LIN EST or the Excel regression macro.
9 6.78 4
9.14. A given set of data was divided into three equal parts.
10 32 1
Three separate models were developed— each model
using two of the three parts for fitting. Errors were cal
culated for each model. The diagram below shows the
standard errors for each model run (the orange highlights
Chapter 9 Regression Diagnostics 151

-10
i----------------------1---------------------- 1---------------2 0 — a—
M anagem ent is concerned that the last data point in the

table is an outlier.
a. Calculate Cook's Distance for the last data point.

Use built-in regression functions in Excel to assist as
needed, such as LINEST.
b. W hat is the interpretation of this result? W hat does it
say about the status of the last data point as an outlier

ANSW ERS

9.1. Hom oscedasticity is a property of the model errors able to clearly attribute m ovem ent in the left-hand-side
where they have the same variance. Heteroskedasticity is variable to the remaining variables.
a property of the errors where their variance changes sys 9.4. If the regressors are uncorrelated, then their joint dis
tem atically with the explanatory variables in the model. tribution has no correlation, and so the probability that
Experim ental data are highly likely to be hom oscedastic. one-or-more irrelevant regressors is included is 1 minus
In general, the more hom ogeneous the data, the more the probability that none are. The probability of includ
likely the errors will be hom oscedastic. In finance, we
ing a single irrelevant regressor is 1 — a, where a is the
often use data with substantially different scales, for test size. This is the definition of the size of a test. The
exam ple, corporate earnings or leverage ratios. This het probability that no irrelevant regressors are included is
erogeneity is frequently accom panied by heteroskedas
(1 — a)p, where p is the number of variables considered.
ticity in model errors.
9.5. W hen a variable has a large VIF, it is highly correlated
9.2. There are two conditions. First, the variable must be with the other variables in the model. The param eter on
om itted in the sense that it has a non-zero coefficient in
this variable is usually poorly estim ated, and so dropping
the correct model. Second, the om itted variable must be it tends to have little im pact on the fit of the model.
correlated with an included variable. If the X variables are
9.6. The sam ple mean estim ator is the BLU E because it is a
uncorrelated, then the coefficients can be estim ated con
special case of O LS if the data are hom oscedastic.
sistently even if there is an om itted variable.
9.7. W hen the errors are iid normally distributed, then the
9.3. The cost is that the model may be m isspecified if the
O LS estim ator of the regression param eters {a and (3, but
dropped variable should be in the true model. The ben
not s2) is an M VUE (Minimum Variance Unbiased Estim a
efit is that the param eter estim ates of the remaining
tor), and so there is not a better estim ator available.
variables will improve because the O LS estim ator will be
Solved Problems
9.8. W hen there is a single explanatory variable in the original b. The om itted variable formula shows that when a vari
m odel, the auxiliary regression used for W hite's test will able is excluded, the coefficient on the included vari
have an intercept and two explanatory variables— the ables is /3-i + y/32, where y is the population regression
original variable and the variable squared. W hite's test coefficient from the regression X 2( = 8 + y X i; + I?;.
statistic is nR2, and the null has a X 2 distribution. When This coefficient depends on the correlation between
using a test with a 5% size, the critical value is 5.99. Solv the two and the standard deviations of the two vari-
5.99 ox, VT72
ing for the R2, R2 < . The maximum value of R2 that ables, so that y = pXlx2— = 0-6 = 0.424. The
n
would not reject is 0.0599, 0.011, and 0.0023 when n is coefficient is then 1.2 — 2.1 X 0.424 = 0.309.
100, 500, and 2500, respectively.
p X V T X VTT2
Solving 1.2 — 2.1 X 0 so that
9.9. a. Using the om itted variable form ula, when 1
X 2 is om itted from the regression,
if p = ---------- — = 0.522 then 13i would be 0 in the
<0 /0.6 x VT x V T 2 \ 2.1 X V T 2
f t = 1.2 - 2.1 x ( ----------- ------------ j = - 0 .1 8 ,
om itted variable regression.
, 0.6 x VT x VT2 . .
w h e re ----------- ------------ is the regression slope from
a regression of X 2 on X-|.

1 the slope coefficient to vary in the energy industry, so

9.10. The variance inflation factor i s ------- t t , where R? mea-
1 - R? 1 that the extended model is
sures how well variable j is explained by the other Yj = a + (3s \n SALES, + [3bB T M ) + (3EEBITDA , + foN etPPE }
variables in the model. Here there are two variables, y si E n e r g y ^ l^ S A L E S j + y BlE n e r g y ^ B TM j + y ElE n e r g y
and so Rf = Px,x2- The variance inflation factor of 10 X EBITD A i + yplEnergy X N etPPE, + e„
solves 10 = 1/(1 - p^x2), so that (1 - px,x2) = 1/10 or
where l Energy is a dummy that takes the value 1 if a firm
Px,x2 = 1 — 1/10 = 0.9. Correlations greater than (in
is in the energy industry.
absolute value) V (T 9 ~ 0.948 would produce values
above 10. b. The Chow test is an F-test with degrees of freedom
equal to the number of coefficients being tested (four
9.11. The R ES ET test exam ines whether the two additional
here, because there are four interaction terms) and
explanatory variables that squared and cubed fitted
the number of observations minus the number of
values have zero coefficients. It is im plem ented using an
coefficients in the extended model (433 - 9 here), so
F-test:
that the test statistic has a F4 424 distribution.
0.689 - 0.686 /1 - 0.689 \
F 2,450 c. The F-test statistic is
V 456 - 6 )
The F-test exam ines the difference between the R2 in the

two m odels. The critical value for an F2450 is 3.01 (F.INV.
The critical value is 2.392 (F.IN V.RT(0.05,4,424)). The
RT(0.05,2,450) in Excel). The value of the test statistic is
value of Rfj can be found by trial-and-error, and values
2.17, which is less than the critical value, and so the null
equal to 18.5% or larger lead to rejection.
that the coefficient on the squared and cubic term s is 0 is
not rejected. 9.13. a. Using the standard form ula:
C o v iY .X ,)
9.12. a. The original model is Y, = a + /%In S A L E S , + (3b BTM ,
0i = v w
+ (3e EB\TD A, + PpNetPPEj + er The Chow test allows
Y X1 (Y -Y ) ( X ,—X . H Y - Y ) ( X ,- X ,) 2
1 -2 .3 5 3 -0 .4 0 9 -0 .4 0 9 -0 .3 5 3 0.145 0.167
2 -0 .1 1 4 0.397 0.397 1.886 0.749 0.158
3 -1 .6 6 5 -0 .8 5 6 -0 .8 5 6 0.335 -0 .2 8 7 0.733
4 -0 .3 6 4 1.682 1.682 1.636 2.751 2.829
5 -0.081 0.455 0.455 1.919 0.873 0.207
6 -0 .7 3 5 - 1 .3 9 -1 .3 9 0 1.265 -1 .7 5 8 1.932
7 -2 .5 0 7 0.954 0.954 -0 .5 0 7 -0 .4 8 4 0.910
8 -1 .1 4 4 1.021 1.021 0.856 0.874 1.042
9 -2 .4 1 9 -0 .1 5 6 -0 .1 5 6 -0 .4 1 9 0.065 0.024
10 -3.151 1.382 1.382 -1.151 -1.591 1.910
11 -2 .0 8 5 -0 .5 6 2 -0 .5 6 2 -0 .0 8 5 0.048 0.316
12 -2 .9 7 2 -1 .5 5 4 -1 .5 5 4 -0 .9 7 2 1.511 2.415
13 -0 .6 3 3 -1 .1 2 3 -1 .1 2 3 1.367 -1 .5 3 5 1.261
14 -2 .6 7 8 -0 .1 2 4 -0 .1 2 4 -0 .6 7 8 0.084 0.015
15 -7 .0 9 5 0.284 0.284 -5 .0 9 5 -1 .4 4 7 0.081
Average -2 .0 0 0 0 .0 0 0 0 .0 0 0 0.933

Therefore,
0.000
0.000
0.933
And, of course,
a = Y - f r X , = - 2 .0 0 0
Y = -2.000
b. Proceeding as in part a gives:
Y X2 (X 2 - X 2) (Y -Y ) (X 2 - X 2) CY - Y ) (X 2 - x 2f
1 -2 .3 5 3 -0 .0 0 8 -0 .0 0 8 -0 .3 5 3 0.003 0.000
2 -0 .1 1 4 -1 .2 1 6 -1 .2 1 6 1.886 -2 .2 9 3 1.478
3 -1 .6 6 5 -0.911 -0.911 0.335 -0 .3 0 5 0.830
4 -0 .3 6 4 0.366 0.366 1.636 0.599 0.134
5 -0.081 -0 .6 3 9 -0 .6 3 9 1.919 -1 .2 2 6 0.408
6 -0 .7 3 5 -1 .0 8 6 -1 .0 8 6 1.265 -1 .3 7 3 1.179
7 -2 .5 0 7 0.67 0.670 -0 .5 0 7 -0 .3 4 0 0.449
8 -1 .1 4 4 0.238 0.238 0.856 0.204 0.057
9 -2 .4 1 9 -0 .0 5 5 -0 .0 5 5 -0 .4 1 9 0.023 0.003
10 -3.151 1.148 1.148 -1.151 -1 .3 2 2 1.318
11 -2 .0 8 5 -0 .1 3 5 -0 .1 3 5 -0 .0 8 5 0.012 0.018
12 -2 .9 7 2 -0 .2 9 9 -0 .2 9 9 -0 .9 7 2 0.291 0.089
13 -0 .6 3 3 -1 .0 2 7 -1 .0 2 7 1.367 -1 .4 0 4 1.055
14 -2 .6 7 8 0.331 0.331 -0 .6 7 8 -0 .2 2 5 0.110
15 -7 .0 9 5 2.622 2.622 -5 .0 9 5 -1 3 .3 6 0 6.875
A ve ra g e -2 .0 0 0 0 .0 0 0 -1 .3 8 1 0.934
- 1 .3 8 1
- 1 .4 7 9
0.934
a = Y - (32X 2 = - 2 .0 0 0
y = -2 .0 0 0 - 1 . 4 7 9 X 2
c. The key statem ent from the chapter is "the estim ator
/3-| converges to (3-\ + (328 "
So, part a gives that:
/3i + (328^ = 0
And part b gives that:
(32 + (3-\82 —1.479

X1 X2 (X2- X 2) (X1- X 1)(X2- X 2) (X, - X, )2 (X2 - X2)2
1 -0 .4 0 9 -0 .0 0 8 -0 .4 0 9 -0 .0 0 8 0.003 0.167 0.000
2 0.397 -1 .2 1 6 0.397 -1 .2 1 6 -0 .4 8 3 0.158 1.478
3 -0 .8 5 6 -0.911 -0 .8 5 6 -0.911 0.780 0.733 0.830
4 1.682 0.366 1.682 0.366 0.616 2.829 0.134
5 0.455 -0 .6 3 9 0.455 -0 .6 3 9 -0.291 0.207 0.408
6 - 1 .3 9 -1 .0 8 6 -1 .3 9 0 -1 .0 8 6 1.510 1.932 1.179
7 0.954 0.67 0.954 0.670 0.639 0.910 0.449
8 1.021 0.238 1.021 0.238 0.243 1.042 0.057
9 -0 .1 5 6 -0 .0 5 5 -0 .1 5 6 -0 .0 5 5 0.009 0.024 0.003
10 1.382 1.148 1.382 1.148 1.587 1.910 1.318
11 -0 .5 6 2 -0 .1 3 5 -0 .5 6 2 -0 .1 3 5 0.076 0.316 0.018
12 -1 .5 5 4 -0 .2 9 9 -1 .5 5 4 -0 .2 9 9 0.465 2.415 0.089
13 -1 .1 2 3 -1 .0 2 7 -1 .1 2 3 -1 .0 2 7 1.153 1.261 1.055
14 -0 .1 2 4 0.331 -0 .1 2 4 0.331 -0.041 0.015 0.110
15 0.284 2.622 0.284 2.622 0.744 0.081 6.875
Average 0 .0 0 0 0 .0 0 0 0.467 0.933 0.934
C o v [X i,X 2] 0.467 Solving yields that

0.501
V [X ,] 0.933 f t = 0.989 f t = - 1 .9 7 3 3
and
And plugging these back into the equation to find
C o v[X lfX 2] 0.467 alpha:
So = = 0.500
V [X 2] 0.934
Y = a + ftX-| + f t X 2 —2.000
Note: In this case these quantities are essentially
equal, but that will not usually be the case. Y = - 2 .0 0 0 + 0.989X-, - 1.9733X2
Plugging back in the 2 X 2 system of equations: d. All approaches should provide the same answer.
f t + 0.501 f t = 0
0 .5 0 0 ft + f t = - 1 .4 7 9

9.14. T h e fi rst task is to calculate the squared residuals:
M1 M2 M3 M1 M2 M3
1 2.1 0.6 1.1 4.41 0.36 1.21
2 -0.1 0.4 2.5 0.01 0.16 6.25
3 - 0 .8 0.1 1 0.64 0.01 1
4 - 2 .6 0.5 -0.1 6.76 0.25 0.01
5 -1 .1 - 0 .6 1.3 1.21 0.36 1.69
6 -0.4 - 0 .9 0.5 0.16 0.81 0.25
7 0.6 0 0 0.36 0 0
8 0.9 -1 - 0 .4 0.81 1 0.16
9 1.2 0.8 -0.1 1.44 0.64 0.01
T O T A L RSS 15.8 3.59 10.58
C V RSS 5.06 1.42 0.17
The model selected is the one that has the sm allest RSS Y1 is the estim ate using the entire data set
within the blue out-of-sample boxes— this is M3.
Y2 is the estim ate using the first nine observations
9.15. a. Following the m ethodology of the first exam ple:
So the Cook's distance for the last data point is
106.552
Alpha Beta 0.61
2 *8 7 .7 8 3
Entire 3.26 3.49 b. The last data point does not have a major influence on
First nine 0.10 2.96 the O LS param eters. However, it may still be an out
lier, and other tests would need to be conducted to
Com pleting the table: draw such a conclusion.
Y X Y1 Y2 (Y1 —Y2)A2 Y 1 -Y
1 -1 0 .4 2 -5 -14.181 -1 4 .7 0 7 0.277 -3.761
2 -1 4 .1 2 -4 -1 0 .6 9 2 -1 1 .7 4 5 1.107 3.428
3 -1 4 .7 2 -3 -7 .2 0 4 -8 .7 8 3 2.491 7.516
4 -8 .2 5 -2 -3 .7 1 6 -5.821 4.428 4.534
5 2.3 -1 -0 .2 2 8 -2 .8 5 8 6.919 -2 .5 2 8
6 1.59 0 3.260 0.104 9.963 1.670
7 5.87 2 10.236 6.028 17.713 4.366
8 14.13 3 13.724 8.990 22.418 -0 .4 0 6
9 6.78 4 17.213 11.952 27.676 10.433
10 32 1 6.748 3.066 13.561 -2 5 .2 5 2
Sum 106.552
SumSq/10 87.783

Learning Objectives
D escribe the requirem ents for a series to be Define and describe the properties of autoregressive
covariance-stationary. moving average (ARM A) processes.
Define the autocovariance function and the autocorrela Describe the application of A R, M A, and A R M A processes.
tion function (A CF).
Describe sam ple autocorrelation and partial
Define white noise, describe independent white noise and autocorrelation.
normal (Gaussian) white noise.
Describe the Box-Pierce Q -statistic and the Ljung-Box Q
Define and describe the properties of autoregressive (AR) statistic.
processes.
Explain how forecasts are generated from A R M A models.
Define and describe the properties of moving average
(MA) processes. Describe the role of mean reversion in long-horizon
forecasts.
Explain how a lag operator works.
Explain how seasonality is modeled in a covariance
Explain mean reversion and calculate a mean-reverting stationary A RM A .
level.
159
Tim e-series analysis is a fundam ental tool in finance and risk (t) (i.e., so that Ys is observed before Yt w henever s < t). The
m anagem ent. Many key tim e series (e.g ., interest rates and ordering of a tim e series is im portant when predicting future
spreads) have predictable com ponents. Building accurate m od values using past observations.
els allows past values to be used to forecast future changes in
Stochastic processes can describe a wide variety of
these series.
data-generating m odels. One of the sim plest useful stochastic
A tim e series can be decom posed into three distinct com po processes is Yt ~ N(/z, a 2), which is a reasonable description
nents: the trend, which captures the changes in the level of the of the data-generating process for returns on some financial
time series over tim e; the seasonal com ponent, which captures assets.
predictable changes in the tim e series according to the tim e of
A leading exam ple of a more com plex stochastic process
year; and the cyclical com ponent, which captures the cycles in
(and one that is a focus of this chapter) is a first-order
the data. W hereas the first two com ponents are determ inistic,
autoregression (AR):
the third com ponent is determ ined by both the shocks to the
process and the memory (i.e., persistence) of the process. Yt = s + <f>Y( _ 1 + etl (10.1)
This chapter focuses on the cyclical com ponent. The proper where 8 is a constant, </>is a model param eter measuring the
ties of this com ponent determ ine w hether past deviations from strength of the relationship between two consecutive observa
trends or seasonal com ponents are useful for forecasting the tions, and et is a shock.
future. The next chapter extends the model to include trends This first-order AR process can describe a wide variety of
and determ inistic seasonal com ponents and exam ines methods
financial and econom ic tim e series (e.g ., interest rates, com
that can be used to remove trends. modity convenience yields, or the growth rate of industrial
This chapter begins by introducing tim e-series and linear production).
processes. This class of processes is broad and includes most
Figure 10.1 shows exam ples of tim e series that are produced by
common tim e-series m odels. It then introduces a key con
stochastic processes with very different properties. The top-left
cept in tim e-series analysis called covariance stationarity. A panel shows the monthly return on the S&P 500 index. This time
time series is covariance-stationary if its first two moments do
series appears to be mostly noise and so is hard to predict. The
not change across tim e. Im portantly, any time series that is top-right panel contains the VIX index, which appears to be very
covariance-stationary can be described by a linear processes.
slowly mean-reverting (i.e., deviations from the long-run average
W hile linear processes are very general, they are also not are tem porary but long-lived).
directly applicable to m odeling. Instead, two classes of models
The bottom panels show two im portant measures of interest
are used to approxim ate general linear processes: autoregres rates: the slope and curvature of the US Treasury yield curve.
sions (AR) and moving averages (M As). The properties of these
The slope is defined as the difference between the yield on a
processes are exploited when modeling data by comparing the ten-year bond and that on a one-year bond. The curvature is
theoretical structure of the different model classes to sample
defined as the difference between two slopes:
statistics. These inspections serve as a guide when selecting a
m odel. Model selection is refined using information criteria (1C) (V10 - Ys) - (V5 - Vi),
that form alize the tradeoff between bias and variance when where Ym is the yield on a bond with a maturity of m years. Both
choosing a specification. This chapter covers the two most series are mean-reverting and have im portant cyclical com po
widely used criteria. nents that are related to the state of the econom y.
The chapter concludes by exam ining how these models are used This chapter focuses exclusively on linear processes. Any linear
to produce out-of-sample forecasts and how ARs and MAs can process can be written as:
be adapted to tim e series with seasonal dynamics.
Yt = 5t + if/0et + if/ - + if/2e t- 2 + " -
oc
= s t + 00.2)
10.1 STOCHASTIC PROCESSES
i= 0
This process is linear in {et}, which is a mean zero stochastic pro

Stochastic processes are ordered collections of random vari cess commonly referred to as the shock. In addition, the process
ables. They are denoted using {Y t}, reflecting the fact that they 8t is determ inistic and the coefficients on the shocks, if/j, are
are sequences of random variables that are ordered in time constant.

Return on the S&P 500 CBOE VIX Index
60
50 -
1960 1970 1980 1990 2000 2010 1995 2000 2005 2010 2015
Slope of the Yield Curve Curvature of the Yield Curve

4 -I 2
1960 1970 1980 1990 2000 2010 1960 1970 1980 1990 2000 2010
Fiaure 10.1 These plots show examples of time series. The top-left panel shows the monthly return
on the S&P 500. The top-right panel shows the end-of-month value of the VIX index. The bottom panels
show the slope and curvature of the US Treasury yield curve.
This chapter only studies models where 8t = 8 for all t, so that 10.2 COVARIANCE STATIONARITY
the determ inistic process is constant. The next chapter relaxes
this restriction to include m odels where 8t explicitly depends on The ability of a model to forecast a tim e series depends crucially
tt o account for both tim e trends and seasonal effects. on w hether the past resem bles the future. Stationarity is a key
This chapter limits attention to linear processes for two reasons. concept that form alizes the structure of a tim e series and justi
First, a wide range of processes, even nonlinear processes, have fies the use of historical data to build m odels. Covariance sta
linear representations. Second, linear processes can be directly tionarity depends on the first two moments of a tim e series: the
related to linear m odels, which are the workhorses of tim e-series mean and the autocovariances.
analysis. Linear models are the most w idely used class of models Autocovariance is a tim e-series specific concept. The autocovari
for modeling and forecasting the conditional mean of a process, ance is defined as the covariance between a stochastic process
defined as: at different points in tim e. Its definition is the tim e-series analog
E t n + iM of the covariance between two random variables. The hth auto

covariance is defined as:
T t is known as the information set and includes everything
known at tim e t, including the history of Y (Yt, Yt_ 1, Yt_ 2, ....). 7th = E[(Vt - E[Yt])(Yt_ h - E[Yt_ h])] (10.3)
Chapter 10 Stationary Time Series ■ 161

Autocovariance is denoted using y, where the subscripts denote The autocorrelation function (ACF) is similarly defined using the
the period (i.e., t) and the lag (i.e., h) between observations. autocorrelations:
W hen h = 0 then:
p(h) = P\h\ (10.6)
n o = E[(Yt - E[ yt])*
12]3, The A C F is typically paired with a closely related measure, called
which is the variance of Yt. the partial autocorrelation function (PACF). The partial auto
correlation between Yt and Yt_h measures the strength of the
This new measure is needed to define the properties of a
correlation between these two values controlling for the values
covariance-stationary time series. A tim e series is covariance
between them (i.e., Yt_-|, Yt_ 2, . . . Yt_ h+1).
stationary if its first two moments satisfy three key properties.
The partial autocorrelation is a nonlinear transform ation of the
1 . The mean is constant and does not change over tim e (i.e.,
A C F and is widely used in model selection. It is identical to
E[Yt] = fx for all t).
the autocorrelation at the first lag because there are no values
2 . The variance is finite and does not change over tim e (i.e., between Yt and Yt_ 1f but it differs at all other lags. It is common
V[Yt] = yo < 00 for all t). to use a(h) to represent this function.
3 . The autocovariance is finite, does not change over tim e,
and only depends on the distance between observation h
(i.e., C ov[Yt, Yt_h] = yh for all t).
10.3 WHITE NOISE
Covariance stationarity is im portant when modeling and
forecasting tim e series. W hite noise is the fundam ental building block of any tim e-series
• A covariance-stationary tim e series has constant relationships m odel. A white noise process is denoted
across tim e. et ~ WN(0, a 2)
• Param eters estim ated from non-stationary tim e series are
where a 2 is the variance of the shock. Note that any white noise
more difficult to interpret.
process is covariance-stationary because the first two moments
A constant relationship across tim e allows historical data to be are tim e-invariant, and the variance is finite.
used to estim ate models that are applicable to future, out-of-
Shocks from a white noise process are used to simulate data.
sam ple observations. M eanwhile, non-stationary time series
W hite noise processes {et} have three properties.
can be difficult to interpret because their estim ated param eters
1 . Mean zero (i.e., E[et] = 0). This property is a convenience,
are not asym ptotically normally distributed. Furtherm ore, non-
because any process with an error that has non-zero mean
stationary tim e series are subject to spurious relationships where
can always be defined in term s of a mean-zero error.
unrelated series appear to have strong, statistically significant
correlations. 2. Constant and finite variance (i.e., V[e t] = a 2 < oc ). This is a
technical assumption that is needed for the third property.
When { Y J is covariance-stationary (i.e., the autocovariance does
not depend on time), the autocorrelation is defined as the ratio: 3 . No autocorrelation or autocovariance (i.e.,
Cov[et, et_h] = E[et et_h] = 0 for all h 0).
Cov[ y „ y t-i,] n
Ph /------------- /----- (10.4) The lack of correlation is the essential characteristic of a white
W [ y m Yh J v W , ro
noise process and plays a key role in the estimation of time-
Autocorrelations, like correlations, are always between —1 and 1, series model param eters. Maintaining this assumption forces all
inclusive. The autocovariance function is a function of h that autocorrelation in a tim e series to be driven by model param e
returns the autocovariance between Yt and Yt_ h ters and not shocks. Testing whether estim ated m odels produce
shocks that are consistent with this property is a critical step in
y(h) = rm (10.5) validating the specification of a model.
This function is only well defined for covariance-stationary pro
Note that any distribution that has a finite variance and non-zero
cesses. The sym m etry follows from the third property of covari
mean can be transform ed into a white noise process by sub-
ance stationarity, because the autocovariance depends only on iid
tracting the mean. For exam ple, if r\t ~ , then et = rjt — v has
h and not t, so that:
mean zero and finite variance, so is a white noise process that
C o v[Yt, Yt_ h] = C o v[Yt+h, Yt] has a positive skew.

Independent, identically distributed random variables are a sim where ip0, i/q, ... are constants, {et} is a white noise process,
ple but im portant special case of white noise processes. Any iid ip0 = 1, and < 00. Wold's theorem also states that this
sequence that has mean zero and finite variance is white noise representation of a covariance-stationary process is unique.
[e.g ., et ~ G(0, cr2), where G(0, a 2) is a distribution function1 of a
random variable that has mean zero and variance a 2]. Gaussian
white noise is a special case of iid noise, where et ~ N(0, cr2).
10.4 AUTOREGRESSIVE (AR) MODELS
W hite noise processes do not require a specific distributional
assum ption, and so assuming that model shocks are normally Autoregressive models are the most widely applied tim e-series
distributed is a stronger assumption than is required in time- m odels in finance and econom ics. These models relate the cur
series analysis. W hile convenient, this assumption is also often rent value of a stochastic process (i.e., Yt) to its previous value
em pirically refuted by financial asset data. (i.e., Tt- i). Recall that a first order AR process, which can be
denoted by AR(1), evolves according to:
Dependent White Noise Yt = 6 + 4>Yt- 1 + et,
W hite noise is uncorrelated across tim e but is not necessarily where 8 is called the intercept, cp is the AR parameter, and the
independent. D ependent white noise is particularly im portant in shock et ~ WN[0, cr2).
finance and risk m anagem ent because asset returns are unpre The AR param eter determ ines the persistence of Yt. This means
dictable but have persistent time-varying volatility. that an AR(1) process is covariance-stationary when \<f)\ < 1 and
D ependent white noise relaxes the iid assumption while main non-stationary when cp = 1. This chapter focuses exclusively on
taining the three properties of white noise. A leading exam ple the covariance-stationary case and leaves the treatm ent of non-
of a dependent white noise process is known as an A utore stationary tim e series to the next chapter.
gressive Conditional H eteroskedasticity (ARCH) process. The Because {Y t} is assumed to be covariance-stationary, the mean,
variance of a shock from an A RCH process depends on the variance, and autocovariances are all constant. Note that the
m agnitude of the previous shock. This process exhibits a prop mean of Yt is not 8 - it depends on both 8 and cp.
erty called volatility clustering, where volatility can be above or
Using the property of a covariance-stationary tim e series that:
below its long-run level for many consecutive periods. Volatility
clustering is an im portant feature of many financial tim e series, E[Yt] = E[YM ] = /x
especially asset returns. The dependence in an A RCH process the long-run (or unconditional) mean is
leads it to have a predictable variance but not a predictable
E[Yt] = 5 + cf>E[Yt_ ,] + E[et]
mean, and shocks from an A RCH process are not correlated
H = 8 + cp/x + 0
across tim e.
/X(1 - 0 = 5
Wold's Theorem M= —!
1 - (f)
(10.8)
Thus, the long-run mean depends on both param eters in the

Wold's theorem establishes the key role of white noise in any
model.
covariance-stationary process. It also provides an im portant
justification for using linear processes to model covariance The variance of Yt also depends on cp:
stationary tim e series. If { Y J is a mean-zero covariance-stationary V[Yt] = V[5 + <£Yt_-| + et]
process, then:
= <p2\/[ Yt-, ] + V[£t] + C o v[Y^ 1f e j
Yt = et + j/qet-T + + •••
oc
To = <£2 7 o + cr2 + 0
= 2 > / * t- i. (10.7)
;=o (10.9)
1 -cP 2
Specifically, the variance of Yt depends on both the variance

of the shocks (i.e., a 2) and the AR param eter (i.e., <p). This
1 Note that G is not necessarily a normal distribution. For example, if
G is a generalized Student's t„ distribution with v > 2, then the shocks derivation relies on the white noise assumption for et so that
have heavy tails. C o v[Yt_-|, et] is zero because Y t_-1 depends on et_-|, et_ 2, ....

Autocovariances and Autocorrelations The PA CF is non-zero only for the first lag. These general
patterns— slow decay in the A C F and a steep cutoff in the
The first autocovariance is in the AR(1) process is PA C F— play a key role when choosing appropriate models to
C o v[Yt, Yt_ ,] = Cov[S + + etl r t_ n] apply to a data set.
= </>Cov[Yt_ 1# Ym ] + C ov[Yt_-|, et] Figure 10.2 contains the autocorrelation and PA C Fs of two
= <f>yo (10.10) A R(1)s. The left panel shows the A C F and PA C F when </> = 0.7.
The remaining autocovariances can be recursively com puted by The A C F decays slow ly to zero and the PA C F is 0.7 at the first
noting that: lag and then zero for all other lags. The right panel shows
the A C F and PA C F when </>is negative. The pattern is sim i
C o v[Yt, Yt_ h] = Cov[S + </>Y't_ 1 + etl Yt_ h]
lar, with the additional feature that the A C F oscillates every
= 0 C o v [Y t_ 1; Yt_ h] + C ov[Yt_ h, €t]
other lag.
= <f>yh--i (10.11)
Note that C o v[Yt_ h, et] = 0 because Yt_ h depends on
et_-i, et_ 2 , ...., whereas et happens after the final shock in Yt_b. The Lag Operator
Applying the recursion, y 2 = <f>y-\ = (f>2yo, and (in general):
The lag operator is a convenient tool for expressing and manip
7 h = <f>h7o ulating more com plex tim e-series m odels. The lag operator L
The autocovariance function is then: shifts the tim e index of an observation, so that LYt = Yt_-1 . It has
six properties.
y(h) = ^ y 0 (10.12)
1. The lag operator shifts the tim e index back one observa
and the A C F is
tion, (i.e., LYt = Yt_-|)
p(h) = = 4>lhl (10.13) 2. (PY, = n_p (e.g ., L2Y, = (Y) ,=

L M = Y,_2)
7o
3. The lag operator applied to a constant is just the constant
The A C F geom etrically decays to zero as h increases. It also
(i.e., L8 = 8)
oscillates between negative and positive values if —1 < <j> < 0.
In practice, negative values of </>are uncommon in econom ic and 4. The pth order lag polynomial is written
financial tim e series.
a(L) = 1 + a-] L + a2L^ + . . . +
Finally, the PA C F of an AR(1) is
For exam ple:
h e {0 ,± D
(10.14) 3(L)Yt = Yt + a-]Yt_i + a2Yt- 2 + . . . + apYt_ p
h > 2
AR(1), <£ = 0.7 AR(1), 0 = -0.9

1. 0n 1.0

5. Lag polynomials can be m ultiplied. For exam ple, if a(L) and Deriving the properties for an AR(p) model is more involved
b(L) are first-order polynom ials, then: than for an AR(1). Instead, this chapter focuses on how the prop
erties of an AR(p) broadly mirror those of an AR(1).
a(L)b(L)Yt = (1 + 3 ,1 )0 + b-|L)Yt = (1 + a iL)(Yt + b ,Y t_.,)
= + b-| Yt_-| + a, Yf_, + a ,b ,Y t _ 2 Note that AR(p) model can also be written using a lag
polynomial:
Polynomial multiplication is com m utative so that a(L)b(L)
= b(L)a(L). (1 - 0 , L --------- </>pLp)Y t = 6 + et
9
6 . If the coefficients in the lag polynomial satisfy some tech W hen {Y t} is covariance-stationary, the long-run mean is
nical conditions, the polynomial can be inverted so that
a(L)a(L)~1 = 1. W hen a(L) is the first-order polynomial ElY,] = * = 1---- 2 ----- T ----------- T " (10.17)
1 - - 0 2 --------- 0p
1 — a-|L, it is invertible if |a-|| < 1 and its inverse is
oc
and the long-run variance is
d - a i r 1 = E a'Li
;=o
V[Yt] = *0 = T---- 7 --------
1 0-ip, 02P2
= 1 + aL + a2L2 + a2L3 + . . . (10.15)
This point acts as an attractor, in the sense that an A R process
W hile it is not necessary to manually invert a lag polynom ial, the
tends to move closer to the m ean. Figure 10.3 shows the sam
concept of invertibility is useful for two reasons. First, an AR pro
ple paths of two persistent A R processes. The first is an AR(1)
cess is only covariance-stationary if its lag polynomial is invert
with an A R coefficient of 0 .9 7 . This process is very persistent,
ible. For exam ple, the AR(1):
although it is mean reverting to /j l . The second process is
Yf — 8 + <j5>Y f-i + Yt = 1.6Yt_ , - 0 .7 Y t_ 2 + et. This process is less persistent than
the AR(1) and crosses the mean more freq u en tly.3
can be expressed with a lag polynomial:
Note that expression for the long-run mean suggests that
Yt = 8 + 4>LYt + et
(f)-\ + • • • + (f)p < 1 is a necessary condition for stationarity.
(1 - < f> L )Y t = S + et The condition is sim ple to check, and if the sum of the coef
If \(f)\ < 1 then this lag polynomial is invertible: ficients is equal to or larger than 1, then the process cannot be
covariance-stationary.
d - <*>Lr1(i - = d - <*>1.n-1> + a - <*>l p.-1 et
O
C oc The autocorrelation and PA CFs of an AR(p) share a common
structure with the A C F and PA CF of an AR(1). For exam ple, the
;=o j= o
oc A C F of an AR(p) also decays to zero as the lag length increases
= ---- T + and may oscillate. However, higher-order ARs can produce more
<P 1=0
com plex patterns in their A C F s than an AR(1).
Second, invertibility plays a key role when selecting a unique M eanwhile, the PACF of an AR(p) shows a sharp drop-off at p
model for a tim e series using the Box-Jenkins m ethodology. lags. This pattern naturally generalizes the shape of the PA CF in
Model building and the role of invertibility is covered in more the AR(1), which cuts-off at one lag.
detail in Section 10.8.
The left panel of Figure 10.4 shows the first 15 values for the
In higher order polynomials, it is difficult to define the values of the A C F and PA CF of the AR(2):
model coefficients where the polynomial is invertible— and hence
Yt = 8 + 1.4Yt_ , - 0 .8 Y t_ 2 + et
the parameter values where an AR is covariance-stationary and the
use of characteristic equations becomes necessary. This method is
explained in more detail in the Appendix to this chapter. 9
This model is covariance-stationary if the roots of the associated
characteristic polynomial:
AR (p) zp - (/mz P- 1 --------- <£p_ i z - 4>p = 0
The pth order AR process generalizes the first-order process to are all less than 1 in absolute value.
include p lags of Y in the m odel. The AR(p) model is 3 A useful measure of persistence in an AR(p) is the largest absolute root
in the characteristic polynomial. The roots in this equation are complex
Yt = 8 + </>iYt_, + 4 >2 ^ t - 2 + ••• + < t> p Y t - p + e t (10.16) and the absolute value of the largest root is 0.894.

AR(1)
A R (2 )
Fiqure 10.3 Plot of the sample paths of two AR processes. The AR(1) specification is
Yt = 0.97 Yt_<\ + et and the AR(2) specification is Yt = 1.6Yt_^ — 0.8Yt_ 2 + €t• Both simulated paths
use the same white noise innovations.
AR(2) Seasonal AR(1)
Fiaure 10.4 Plots of the autocorrelation and PACFs for an AR(2) (left panel) and a Seasonal AR(1)
(right panel) for quarterly data.
The A C F oscillates and decays slowly to zero as the lag length current value of Yt. This m odel is tech n ically an AR(5)
increases. Unlike the AR(1) with a negative coefficient, the oscil b ecau se:
lation does not occur in every period. Instead, it oscillates every
(1 - fa L - </>4L4 + 4>ÂL5)Yt = 8 + et
four or five periods, which produces persistent cycles in the time
y t = 6 + <AiYt-i + </>4Yt- 4 - ^ ^ 4 ^ -5 + €t
series.
The A C F and the PA CF are consistent with the expanded model.
The right panel shows the A C F and PA CF of a model known as
a Seasonal A R. This model is easiest to express using the lag Suppose now that the key coefficients in this model take the
polynomial form: values (f)-| = 0.2 and </>4 = 0.8, which would indicate a small
amount of short-run persistence coupled with a large amount of
(1 - 4>41-4)(1 - <hDYt = e,
seasonal persistence. The seasonality produces the large spikes
This m odel has a q u arterly seaso nality w here the in the A C F every four lags. As exp ected , the PA C F is non-zero
shock lagged four quarters has a distinct effe ct on the for the first five lags because this model is effectively an AR(5).

YULE-W ALKER EQUATIONS
The Yule-W alker (YW) equations provide an expression that can be com puted using the first equation because yj = pyyo,
relates the param eters of an AR to the autocovariances of so that:
the AR process. This approach uses p + 1 equations to solve
7o = <Ai P i 7 o + </>2P27o + • • • + (J>PPp7o + er2
for the long-run variance yo and the first p autocorrelations.
Autocovariances (or autocorrelations) at lags larger than p
are then easily com puted with a recursive structure starting (10.19)
1 - </>-|p-| - (f>2P2 - • • • - (f)ppp
from the first p autocovariances.
Higher-order autocorrelations are recursively com puted using
The equations are
the relationship:
C o v[Yt, Yt] = Cov[<5 + </>1Yt_ 1 + • • • + </>pYt_ p + et, Yt]
Pj ~ <f>'\Pj-'\ + (f>2Pj-2 + • ' ' + <t>pPj-p (10.20)
C ov[Yt, Yt_-|] = Cov[S + + * • • + (f>pYt-p + etl Yt- il
A p p lyin g th e Yule-W alker Eq u atio n s to an AR(2)
For exam ple, suppose that
C o v[Yt, Yt_ p] = Cov[S + + • • • + </>pYt_ p + et< Yt_ p]
Yt = 1.4Yt_-, - 0 .4 5 Yt_ 2
These equations appear sim pler when expressed using the
standard autocovariance notation y f This AR(2) is stationary because the roots of the characteristic
polynomial are 0.9 and 0.5. The Y W equations are
To = <£i 7 i + </>272 + ••* + (f>p 7P + cr2
pi = 1.4 — 0.45pi
71 = <£i 7 o + ^271 + ••* + </>P7 p -i p2 = 1.4p-| — 0.45
The solutions to these are
7P - </>lYp- 1 + 4 > 2 V p - 2 + * • • + </>p7o P1 = 1.4/1 .4 5 = 0.966 and p2 = 1.42/1 .4 5 - 0.45 = 0.902
Excluding the first equation, dividing each row by the long- The remaining autocorrelations are recursively computed using
run variance y 0 produces a set of equations that relate the Pj = l^ p ^ - 0.45pM
autocorrelations:
Using this equation, next three autocorrelations can be com
P1 = $1 + </>2Pl + • • • + (f>ppp-1 ; puted as 0.828, 0.753, and 0.682.
Pp = (fr’ipp-'l + <t>2Pp-2 + ' • • + (f>p (10.18) Finally, the long-run variance is
W hile this system of p equations has p unknown values, it

always has a unique solution when <£1f . . . , 4>p are com pat 14 / 1 42
1 - 1.4 X — — + 0.45 X — — - 0.45
ible with an assumption of stationarity. The long-run variance 1.45 V 1 -4 5
M ODELING TIM E-SERIES DATA: AN EXAM PLE

The concepts of this chapter are applied to modeling two O bservations of this series are available monthly from 1979
important macrofinancial time series. The first is the default until the end of 2018.
premium, defined as the difference in the interest rates of two
The second series is the annualized growth rate of real
corporate bond portfolios (i.e., a "safe" portfolio and a "risky"
Gross Dom estic Product (real G D P) in the United States.
one). The safe portfolio holds only Aaa-rated securities and
Real G D P is a broad measure of econom ic activity that
defaults among these bonds are extrem ely rare. Meanwhile,
includes consumption and investm ent. Changes in price
the risky portfolio holds Baa-rated corporate bonds, which is
levels over tim e have been removed so that the growth rate
the lowest rating for investment grade securities.
only m easures changes in econom ic activity. Real G D P is
The default premium is defined as the difference between m easured quarterly, and so the annualized growth rate is
the yields of these two portfolios: defined as:
D E F t = Aaat — Baat R G D P G t = 400 X (In R G D P t - In R G D P t_-,),

where R G D P t is the level of real G D P in period t. Data are having an estim ated AR coefficient of 0.962. However,
available quarterly between 1947Q1 and 2018Q 4. the AR(2) and AR(3) specifications have more com plicated
dynamics than the AR(1), and so their param eters cannot be
Figure 10.5 contains plots of these two series. The left panel
as easily interpreted. Instead, the largest absolute root from
shows the default premium, and the right panel contains the
the characteristic equation for each model indicates the per
growth rate of real GDP. The dark bands indicate recessions,
sistence in higher-order AR processes. These values are sim i
as determ ined by the National Bureau of Econom ic Research.
lar across the three specifications.
Note that both series are responsive to the state of the econ
omy: The default premium rises when econom ic conditions The three AR specifications estim ated on the real G D P
are poor, and so is countercyclical, whereas the real G D P growth data indicate that the persistence for this data is
growth rate is pro-cyclical by definition. lower than for that for the default premium.
Results from fitting AR models with orders 1, 2, and 3 are The R2 in a tim e-series model has the same interpretation
reported in Table 10.1, and t-statistics are reported below as in a linear regression: It measures the percentage of the
each estim ated coefficient. variation in Yt that is explained by the model. The R2 in all
three real G D P growth models is also lower, indicating that it
The models for the default premium (i.e., top panel) indicate
is more difficult to forecast G D P growth using historical data.
that the series is highly persistent, with the AR(1) model
Default Premium Annualized Real GDP Growth Rate
10
-5 -
1950 1960 1970 1980 1990 2000 2010

Fiaure 10.5 The left panel shows the default premium defined as the difference between the
average interest rates on portfolios of Aaa- and Baa-rated bonds. The right panel shows the
annualized growth rate of real GDP. Dark bars indicate recessions.
Table 10.1 Parameter Estimates from AR Models with Orders 1, 2, and 3. The Top Panel Reports
Parameter Estimates Using the Default Premium, and the Bottom Panel Reports Parameter Estimates from
Models Using Real GDP Growth. The Column Labeled R2 Reports the Fit of the Model. The Final Column
Reports the Absolute Values of the Roots of the Characteristic Equations
Default Premium
8 <P2 R2 Abs. Roots
0.040 0.962 0.927 0.962
(8.721) (90.039)
0.053 1.272 - 0 .3 2 2 0.935 0.924, 0.348
(11.897) (33.294) (-8 .4 2 1 )
0.046 1.315 - 0 .4 9 0 0.132 0.936 0.944, 0.374, 0.374
(10.498) (32.853) (-7 .7 1 6 ) 3.300

Real GDP Growth Rate
8 <Pi <P2 <P3 R2 A bs. Roots
1.996 0.360 0.130 0.360
(9.658) (6.544)
1.765 0.319 0.114 0.141 0.533, 0.213
(8.609) (5.455) (1.938)
1.963 0.333 0.150 - 0 .1 1 3 0.152 0.486, 0.486, 0.477
9.622 (5.681) (2.448) (-1 .9 2 1 )
10.5 MOVING AVERAGE (MA) MODELS The PA CF of an MA(1), on the other hand, is com plex and has
non-zero values at all lags. This pattern is the inverse of what an
A first-order M A, denoted MA(1), is defined as: AR(1) would produce.
Yf = fi + 0et_ 1 + et, (10.21) The MA(1) generalizes to an MA(q), which includes q lags of the
shock:
where et~ WN(0, a 2) is a white noise process.
Yt — /x + et + 0-|6t_-| + • • • + 0qet_ q. (10.25)
The observed value of Yt depends on both the contem pora
neous shock et and the previous shock. The param eter 0 acts
Here, /x is the mean of Yt because all shocks are white noise and
as a w eight and determ ines the strength of the effect of the
so have zero expected value. The variance of an MA(q) can be
previous shock. The param eter /j l is the mean of the process
shown to be
because:
7o — o"2 (1 + 02+ ■■* + (10.26)
E[Yt] = /x + 0 E [e ^ ] + E[et]
= /i + 0 x O + O The autocovariance function is
= A ( 10 . 22 )
W hen 0 is positive, an MA(1) is persistent because two (10.27)

consecutive values are (on average) positively correlated.
W hen 0 is negative, the process aggressively mean reverts
because the effect of the previous shock is reversed in the cur where 0O = 1.
rent period. The A C F is always zero for lags larger than q and the PA C F is
M oving averages are always covariance-stationary. An MA(1) has also com plex. In general, the PA CF of an MA(q) is non-zero at all
a limited memory, because only the shock in the previous period lags.
im pacts the current value. The variance of an MA(1) is Figure 10.6 plots the autocorrelation and PA CFs for four MA
V [Yt] = V[/x + 0et_ 1 + et] processes. The top-left contains the values for an MA(1) with a
positive coefficient. Here, the first autocorrelation is positive,
= 02V[et_ 1] + V[et]
and all other autocorrelations are zero. The PA CF oscillates and
= (1 + 02)(t 2 (10.23) decays slowly towards zero.
because the shocks are white noise and so are uncorrelated. Any The top-right shows the A C F and PA CF for an MA(1) with a
MA(1) has exactly one non-zero autocorrelation, and the A C F is large negative coefficient. The first autocorrelation is nega
tive, and the remaining autocorrelations are all zero. The PACF
1 h = 0
decays slowly towards zero.
0
P(h) = h = 1 (10.24)
1 + 02 The bottom-left and right panels plot the values for an MA(2)
0 h > 2 and an MA(4), respectively. These both show the same

M A (1), 0 , =0 .7 M A (1 )# 01 = - 0 .9
MA(2)#0, = 0.7, 02 = -0.3 MA(4), = 0.2, i = 1,2,3,4
Fiaure 10.6 These four panels show the autocorrelation and PACFs for two MA(1)s (top),
an MA(2) (bottom-left) and an MA(4) (bottom-right). The ACF cuts off sharply at the MA
order (q) while the PACF decays, possibly oscillating, over many lags.
pattern— one that is key to distinguishing M A models from many lags. This is the opposite pattern from what is seen in
AR m odels: The A C F cuts off sharply at the order of the MA, Figure 10.2 and Figure 10.4, where the A C F s decayed slowly
whereas the PA CF oscillates and decays tow ards zero over while the PA CF cut off sharply.
APPLYING MOVING A V ERA G E M ODELS: EXAM PLE CO N TIN UED

Moving average models are estim ated for both the default are lower than the R2 in the AR(1) presented in Table 10.1. All
premium and real G D P growth data. The estim ated param estim ated coefficients are statistically significant with large
eters, t-statistics and R2 values of the estim ated models are t- statistics.
presented in Table 10.2.
G D P growth appears to be better described by a finite-order
W hile the default premium is a highly persistent series, an M A. W hile the R2 from the MA(1) is less than that of the
MA(q) is not an accurate approxim ation of its dynamics AR(1), the R2 from the second and third-order models is sim i
because these models are only autocorrelated with their first lar. Adding a fourth M A lag does not change the model fit
A
q lags. Note that the R2 values of all m odels, even the MA(4), and 04 is not statistically different from zero.

Table 10.2 Parameter Estimates from MA Models with Orders between 1 and 4. The Top Panel Reports
Parameter Estimates Using the Default Premium, and the Bottom Panel Reports Parameters Estimates from
Models Using Real GDP Growth. The Column Labeled R2 Reports the Fit of the Model
Default Premium
8 02 03 04 R2
1.074 0.915 0.688
(56.538) (69.917)
1.074 1.328 0.694 0.836
(49.449) (48.559) (27.838)
1.073 1.432 1.221 0.616 0.892
(43.206) (42.062) (31.475) (22.962)
1.073 1.527 1.465 0.989 0.349 0.908
37.388 (39.277) (25.735) (20.561) (10.396)
8 01 02 03 04 R2
3.125 0.269 0.096
(11.686) (5.725)
3.117 0.307 0.227 0.146
(9.924) (5.306) (4.153)
3.113 0.323 0.276 0.092 0.153
(9.031) (5.576) (4.364) (1.548)
3.113 0.328 0.279 0.096 0.014 0.153
(8.894) (5.309) (4.269) 1.531 (0.211)
10.6 AUTOREGRESSIVE MOVING The autocovariance function is
AVERAGE (ARMA) MODELS a

(1 + 2<f>0 + 02)
h = 0
1 - c f> 2
Autoregressive Moving A verage (ARM A) processes com bine AR 2 </>(1 + ( f>0 ) + 0(1 + p0
< )
7 (h) = < cr h = 1 (10.29)
and M A processes. For exam ple, a simple ARM A(1,1) evolves 1 -< f> 2
according to: 07/.-1 h > 2
Yt — 8 + + 0et_ -| + et (10.28) The autocovariance function is com plicated, even in an

A RM A (1,1). The A C F decays as h increases and oscillates if
The mean of this process is
4> < 0. This is consistent with the A C F of an AR process. How
M = 5/(1 - 4>) ever, the PA C F also slowly decays toward zero as the lag length
increases. This behavior is consistent with a M A process. The
The variance is
slow decay o f both the P A C F and A C F is the key feature that
y 0 = er2(1 + 2<f>0 + 02)/(1 - 4>2) distinguishes an A R M A from an A R or an M A p ro cess.

A R M A (1 ,1 ) p ath A R M A (2 ,2 ) p ath
ARMA(1,1) ACF and PACF ARMA(2,2) ACF and PACF
Fiaure 10.7 The left panels plot a simulated path (top) and the ACF and PACF of the
ARMA(1,1), Yt = 0 .8 7 ^ ! — 0.46*-! + et The right panels plot a simulated path (top) and
the ACF and PACF of the ARMA(2,2), Yt = 1.4Yt_-, - 0.6Yt_ 2 + 0.9et_., + 0.6et_ 2 + et.
The left panels of Figure 10.7 show a sim ulated path (top) and Yt — 8 + <
7>iVt _ t + ••• + 4>pY t - p + îet- 1 + *•* + Qqe t - q + et
the autocorrelation and PA CFs (bottom) of an A RM A (1,1) with (10.30)
</> = 0.8 and 0 = —0.4. Both functions decay slowly to zero
O r when com pactly expressed using lag polynomials
as the lag length increases. The decay resem bles that of an
A R (1), although the level of the first autocorrelation is less than d - <hL - ■■• - (f>pLp)Yt = (1 + 0 ,L + • • • + OqLp)et
4> = 0.8. This difference arises because the M A term elim inates L)Yt = S + 0(L)et
some of the correlation with the first shock.
Like A RM A (1,1), the A RM A (p,q) is covariance-stationary if the
An ARM A(1,1) process is covariance-stationary if \(f>\ < I . T h e A R com ponent is covariance-stationary.4
M A coefficient plays no role in determ ining w hether the pro
The autocovariance and A C Fs of A R M A processes are more
cess is covariance-stationary, because any M A is covariance
com plicated than in pure AR or M A m odels, although the gen
stationary and the MA com ponent only affects a single lag
eral pattern in these is sim ple. A RM A (p,q) models have A C Fs
of the shock. The AR com ponent, however, affects all lagged
shocks and so if <£i is too large, then the time series is not
covariance-stationary.
4 This requires the roots of the characteristic equation:
A RM A (p,q) processes com bine AR(p) and MA(q) processes
to produce models with more com plicated dynam ics. An zp - + ••• + (})p --\Z + <f>p = 0
A RM A (p,q) evolves according to to all be less than 1 in absolute value.

APPLYING A U TO R EG R ESSIV E MOVING A V ERA G E M ODELS: EXAM PLE
CON TIN UED
Table 10.3 reports the param eter estim ates, t-statistics and Models for real G D P growth also benefit from both term s.
the model R2 for a range of A R M A m odels fit to both the Adding the M A term allows the AR coefficients to increase,
default premium and real G D P growth. which provides more persistence while limiting the effect
of the most recent shock. The final m odel, an A RM A (3,2), is
The default premium data is fitted using models with p or q
selected using a model selection criterion (see Section 10.8).
up to 2. However, including a second lag of either the AR or
It does provide a better fit than any of the sm aller m odels,
MA term does not improve the model fit over the A RM A (1,1).
although there is some risk of overfitting when using such a
Com paring the ARM A(1,1) with the AR(1) in Table 10.1, the
large model on quarterly data. The final column reports the
AR param eter is virtually identical, and the MA param eter
absolute value of the roots of the characteristic polynomial
is positive. This configuration of param eters in the A R M A
for the A R. The maximum absolute root provides an indica
produces an A C F that lies below the implied A C F of the esti
tion of the persistence in the process.
mated AR m odel. Adding the M A com ponent has improved
the model fit as measured by the R2 and the t-statistic on §■|.
The ARM A(1,1) fits slightly better than an AR(2) and about as
well as an AR(3).
Table 10.3 Parameter Estimates from ARMA Models. The Top Panel Reports Parameter Estimates Using
the Default Premium, and the Bottom Panel Reports Parameter Estimates from Models on Real GDP Growth.
The Column Labeled R2 Reports the Fit of the Model. The Final Column Reports the Absolute Value of the
Roots of the Characteristic Equations
Default Premium
8 <Pi <P2 01 02 R2 Abs. Roots
0.066 0.938 0.381 0.936 0.938
(0.681) (66.110) (9.845)
0.066 0.942 - 0 .0 0 4 0.377 0.936 0.938, 0.004
(0.681) (8.332) (-0 .0 3 7 ) (3.566)
0.067 0.938 0.381 0.002 0.936 0.938
(0.685) (61.429) (8.760) (0.038)
8 <Pi <P2 <P3 01 02 R2 Abs. Roots
1.512 0.515 - 0 .1 7 4 0.137 0.515
(4.319) (4.909) (-1 .5 3 1 )
2.851 - 0 .2 3 4 0.318 0.563 0.148 0.693, 0.459
(8.174) (-0 .8 3 9 ) (3.418) (1.950)
2.286 0.266 0.057 0.184 0.151 0.266
(6.627) (1.566) (0.345) (2.516)
1.185 1.689 - 1 .2 9 2 0.224 - 1 .3 7 6 0.921 0.177 0.975,

0.975, 0.235
(4.101) (21.579) (-1 1 .9 6 2 ) (3.340) (-2 9 .6 4 4 ) (16.714)

and PA CFs that decay slowly to zero as the lag increases (while Two closely related joint tests of autocorrelations are often used
possibly oscillating). when validating a m odel. They both test the joint null hypoth
esis that all of the autocorrelations are sim ultaneously zero (i.e.,
The right panels of Figure 10.7 show a sim ulated path and the
H q \ P i = P 2 = * •' = Ph = 0) against the alternative hypothesis
A C F and PA C F from an A RM A (2,2). This path uses the same
that at least one is non-zero (i.e., Hy. pj 7^ 0 for some j). Both
shocks as in the top-left panel, only with a different model. Both
tests have asym ptotic Xh distributions. Values of the test statistic
the A C F and the PA C F have an oscillating pattern and decay
larger than the critical value indicate that the autocorrelations
toward zero as the lag length increases.
are not zero.
The Box-Pierce test statistic is the sum of the squared autocor

10.7 SAM PLE AUTOCORRELATION relations scaled by the sam ple size T:
h
Sam ple autocorrelation and partial autocorrelations are used
Q „r = T ^ p f 00.33)
to build and validate A R M A m odels. These statistics are first 1=1
applied to the data to understand the dependence structure
W hen the null is true, VYp,- is asym ptotically distributed as
and to select a set of candidate m odels. They are then applied
a standard normal and thus Tpf is distributed as a The
to the estim ated residuals { e j to determ ine w hether they are
estim ated autocorrelations are also independent and so
consistent with the key assumption that et ~ WN(0, a 2).
the sum of h independent x i variables is, by definition, a Xh
The most common estim ator of the sam ple autocovariance is variable.
T _ _
The Ljung-Box statistic is a version of the Box-Pierce statistic
% = (T ~h)“ ' 2 - Y )(Y ;_ h - Y ), (10.31)
i=h + 1 that works better in sm aller sam ples. It is defined as:
where Y is the full sam ple average. This estim ator uses all avai
able data to estim ate yh-
QLe = (10.34)
The autocorrelation estim ator is then defined as:

so that in large sam ples Q 6P ~ Q LB. The Ljung-Box test is also
2 L + 1 O i - Y ) ( Y ,.h - Y) 7h distributed as a Xh-
Ph A (10.32)
- n To W hen T is m odest (e.g ., 100), the finite sam ple distribution of
The autocorrelation estim ator drops divisors T — h (numerator) the Ljung-Box when the null is true is closer to the asym ptotic Xh
and T to ensure that —1 < p h < 1. This estim ator has a slight distribution. Therefore, it is the preferred method to test mul
bias towards zero that disappears as T becom es large. tiple autocorrelations.
The choice of h can affect the conclusion of the test that the
Joint Tests of Autocorrelations residuals have no autocorrelation. W hen testing the specifica
tion of an A RM A(p,q) m odel, h should be larger than m ax(p,q).
Testing autocorrelation in the residuals is a standard specifica Residual autocorrelations that fall within the lags of the model
tion check applied after fitting an A R M A m odel. It is common are typically small, and the A RM A(p,q) estim ator is effectively
practice to use both graphical inspections as well as formal tests overfitting these values. It is common to use between 5 and 20
in specification analysis. lags when testing a small A R M A model (p, q < 2).
Graphical exam ination of a fitted model includes plotting the
residuals to check for any apparent deficiencies and plotting the
Specification Analysis
sam ple A C F and PA CF of the residuals (i.e., et). For exam ple, a
graphical exam ination can involve assessing the 95% confidence Autocorrelation and partial autocorrelation play a key role in
intervals for the null H0: p h = 0 (ACF) or H^: a h = 0 (PACF). The both model selection and specification analysis. The left column
presence of many violations of the 95% confidence interval in of Figure 10.8 plots 24 sam ple autocorrelations and partial auto
the sam ple A C F or PA C F would indicate that the model does correlations of the default premium (i.e., two years of monthly
not adequately capture the dynamics of the data. However, it is observations). The right column shows 12 sam ple A C F and
common to see a few estim ates outside of the 95% confidence PA C F values of real G D P growth (i.e., three years of quarterly
bands, even in a well-specified m odel. This pattern makes it observations). The horizontal dashed lines mark the boundary of
challenging to rely exclusively on graphical tools to determ ine a 95% confidence interval derived from an assumption that the
whether a model is well-specified. process is white noise.

A u to c o rre la tio n s
Default Premium Real GDP Growth
Partial Autocorrelations
F ia u re 1 0.8 The left column contains the sample autocorrelations (top) and partial autocorrelations of
the default premium. The right panel shows the sample autocorrelations and partial autocorrelations of
real GDP growth.
The A C F of the default premium decays very slowly towards The top four panels in Figure 10.9 show the sam ple A C F and
zero, which is consistent with an AR m odel. The PA CF has three PA C F of the estim ated model residuals from these specifica
values which are statistically different from zero. This pattern is tions. Note that these plots are the equivalents of those in
consistent with either a low-order AR or an A R M A with a MA Figure 10.8, excep t that the data are replaced by the model
com ponent that has a small coefficient. residuals. All four plots indicate that the residuals are consistent
with a white noise process. The estim ated autocorrelations are
The A C F and PA CF of real G D P growth indicate w eaker persis
generally insignificant, small in m agnitude, and lack any discern
tence, and the A C F is different from zero in the first two lags.
ible pattern.
The PA C F resem bles the ACF, and the first two lags are statisti
cally different from zero. These are consistent with a low-order The bottom panel plots both the Ljung-Box statistic and its
A R, M A, or A RM A . p-value. The statistic is com puted for all lag lengths between 1
and 12. The default premium residuals have small autocorrela
Sam ple autocorrelations and partial autocorrelations are also
tions in the first four lags, and the null that all autocorrelations
applied to investigate the adequacy of model specifications. In
are zero is not rejected. However, the 5th autocorrelation of the
Figure 10.9, an ARM A(1,1) is estim ated on the default premium
estim ated residuals is large enough that the Q-statistic rejects
data, and an AR(1) is used to model real G D P growth.

A u to c o rre la tio n s
Partial Autocorrelations
Test Statistics
Real GDP Growth
1 .0 1 . 0 -I
30
0.8-
20 0.6
Ql j (Right)
P-value
0.4-
-10
0.2 0.2-
0
0 .0 0.0-
1 2 3 4 5 6 7 8 9 10 11 12 7 8 9 10 11 12
Fiaure 10.9 The top panels show the sample ACF and PACF of the model residuals. The residuals for
the default premium are from an ARMA(1,1). The residuals for real GDP growth are from an AR(1). The
bottom plot shows the Ljung-Box Q statistic (line, right axis) and its p-value (bars, left-axis).

PARAM ETER ESTIMATION IN AR AND ARMA M ODELS
The param eters of an AR(p) model can be estim ated using This bias com es from the fact that Yt's innovation (i.e., et) can
O LS. The dependent variable is Yt and the explanatory vari appear on both sides of the equation: on the left when Yt is
ables are the p lagged values. The param eter estim ators are the dependent variable and on the right when Yt is a lagged
defined as: value for Yt+ 1 , Yt+2 , . . . , Yt+p. This repetition of the same
shock on both sides is the source of the finite-sam ple bias.
T
arg min ^ (Yt ~ 8 ~ (t>iYt - '\ --------- fipYt-p)2 (10.35) O LS cannot be used when an M A com ponent is added to
6 ,</>!,. . . <f>p t = p + ' \
an A R or when estim ating a pure MA. W hereas O LS requires
that all explanatory variables are observable, the lagged
Estim ation of the param eters in an A R is identical to param
errors that appear on the right-hand side of an MA do not
eter estimation in linear regression m odels. Standard errors
m eet this requirem ent and can only be inferred once the
are also estim ated using the same expressions as in a linear
model param eters have been estim ated. The coefficients of
regression. The only feature of a linear regression that can
M A and A R M A models are instead estim ated using maximum
not be directly applied to param eter estimation in an AR(p) is
likelihood, which assumes that the innovations are jointly nor
unbiasedness. W hereas the linear regression param eter esti
mally distributed with mean zero and variance cr2.
mators are unbiased, param eter estim ators of the A R coef
ficients are biased (although they are consistent).
the null. The rem ainder of the lags also indicate that the null counterparts of these exhibit a similar pattern, then an AR
should be rejected .5 model is likely to fit the data.
The residuals from the AR(1) estim ated on the real G D P growth In general, slow decay in the sam ple A C F indicates that the
data are consistent with white noise. Both the sam ple A C F and model needs an AR com ponent, and slow decay in the sample
PA CF have one value that is statistically significant. However, a PA C F indicates that the model should have an M A com ponent.
small number of rejections are expected when exam ining mul Note that the goal of this initial step is to identify plausible can
tiple lags because each has a 5% chance of rejecting when the didate m odels. It is not necessary to make a final selection using
null is true. The Ljung-Box test also does not reject the null at only graphical analysis.
any lag length.
O nce an initial set of models has been identified, the next step
is to measure their fit. The most natural measure of fit is the
sam ple variance of the estim ated residuals, also known as the
10.8 M ODEL SELECTION Mean Squared Error of the model. It is defined as:
Determ ining the appropriate lag lengths for the A R and MA

a2 = l (10.36)
com ponents (i.e., p and q, respectively) is a key challenge when ' t=1
building an A R M A model. The first step in model building is to
Sm aller values indicate that the model explains more of the time
inspect the sam ple autocorrelation and sam ple PA CFs. These
series. Unfortunately, choosing a model to minimize the residual
provide essential insights into the correlation structure of the
variance a 2 also selects a specification that is far too large. This
data and can be used to determ ine the class of models that are
objective suffers from the same problem as maximizing the R2
likely to explain that structure.
in a linear regression— adding additional lags to either the AR
For exam ple, the A C F and PA C F of an AR(p) model have a or M A com ponents always produces sm aller M SEs. The solution
distinct pattern: The A C F decays slowly (possibly with oscilla to this overfitting problem is to add a penalty to the M SE that
tions), while the PACF has a sharp cutoff at p lags. If the sample increases each tim e a new param eter is added.
Penalized M SE measures are known as information criteria (IC).

The leading two among these are the Akaike Information C rite
5 The Ljung-Box test is not robust with respect to heteroskedasticity in rion (AIC) and the Bayesian Information Criterion (B IC ).6
the data. A robust version can be constructed by estimating an AR(p)
on the residuals using OLS and using White's heteroskedasticity robust These IC both reflect the bias-variance tradeoff and attem pt to
estimator. The regression is: balance the cost and benefit of alternative specifications that
et = 5 + -! + 4>2et- 2 + • • • + 4>P^t-P + Vt include different orders of the AR and M A com ponents.
The null hypothesis is that all coefficients on the lagged terms are zero,
H0: <jf>i = • • • — (J)p — 0, and that alternative is Hp (f)j # 0 for some j. 6 This is also known as the Schwarz IC (SIC or S/BIC).

The A IC is defined as: 2. Invertibility — W hen choosing param eters in MA processes
(either pure M A or A RM A ), always select param eter values
A IC = T in a 2 + 2 k, (10.37)
so that the MA coefficients are invertible. Because any
where T is the sam ple size and k is the number of param eters. MA(q) model has 2q equivalent representations, requir
The BIC alters the penalty and is defined as: ing invertibility is a sim ple method to choose one of these
representations.
B IC = T in a 2 + k in T (10.38)
These two principles are enough to ensure that a unique specifi
Note that the penalty term in the A IC adds a constant "co st"
cation is selected to match any A C F and PACF.
of two per parameter, whereas the penalty in the BIC has a cost
that slow ly increases with T. This difference in the penalty has
two im plications. 10.9 FORECASTING
1. The BIC always selects a model that is no larger (i.e., in
Forecasts use current information to predict the future. W hile it
term s of the lag lengths p and q) than the model selected
by the A IC (assuming T > 8 , because In 8 = 2.07). is common to make one-step-ahead forecasts, which predict the
next value in the time series, forecasts can be generated for any
2. The BIC is a consistent 7 model selection criterion (i.e., the
horizon h.
true model is selected as °c ).
The one-step forecast E[T r + 1 \Tt ] is the expectation of YT+ ^
The difference between the two IC can be understood in the
conditional on F j, which is the tim e T information set. This infor
context of hypothesis testing.
mation set contains all values that are known at time T, includ
• The A IC behaves like a model selection methodology that ing the entire history of Y: Y j, Y 7 - 1 , . . . . I t also includes the
includes any variable that is statistically significant with a fixed- history of the shocks eT, 6 7 - 1 ,. .. (even though these are not
test size of s%. When using a fixed-test size, there is an s% directly observed ) 8 as well as all values of any other variable that
chance that a coefficient that is not relevant is selected. In occurred at tim e T or earlier. The shorthand notation Eyf • ] is
aggregate, this can lead to selecting models that are too large. commonly used for the conditional expectation where:
• The BIC behaves in the same way, only s —» 0 as T —» oo. The E r [Yr+1] = E[Yr+1|!Fr ]
quantity s goes to zero slowly, so that any relevant variable
Three rules sim plify recursively generating forecasts.
always has a t-statistic that is larger than the required criti
cal value in large sam ples. Variables that are not needed, 1. The expectation of any variable with a tim e subscript
however, always have t-statistics that are less than the critical T or earlier is the realization of that variable (e.g.,
value (for a T large enough). As a result, such variables are E j{Y t ] = Y j). Residuals are also in the information set (e.g .,
always excluded. £ 7 ^ 7 - 2 ] = ^7 - 2 ).
2. The expectation of future shocks is zero (e.g ., E T[e 7 +i] = 0).
Box-Jenkins This is true for any horizon (i.e., E T[eT+h] = 0 for all h > 1).
It is possible to specify two (or more) models that have differ 3. Forecasts are generated recursively starting with E T[ Y 7 +1].
ent param eter values but are equivalent. Equivalence between The forecast at horizon h may depend on forecasts from
A R M A models means that the mean, ACF, and PA CF of the earlier steps (E T[Y 7 +1] , . . . , E T [Y T+h_ 1]). W hen these
models are identical. quantities appear in the forecast for period T + h, they
are replaced by the forecasts com puted for horizons
The Box-Jenkins m ethodology provides two principles to select
1 , 2 ,... , h —1 .
among equivalent models.
1. Parsimony — W hen selecting between models that are

equivalent (and thus produce the same fit), always choose
the model with few er param eters.
8 While the shocks are not directly observed, they can be esti
mated from the data by transforming the ARMA process into the
7 This is analogous to parameter estimator consistency, which ensures AR(oo) yt - <Aiyt—1 —02yt-2 —• • • = et. The appendix contains an
that estimated parameters converge to their true values. example illustrating how an MA(1) process is inverted into an AR(°°).

APPLIED M ODEL SELECTIO N
Figure 10.10 plots the A IC and the BIC values for A RM A The IC disagree on the preferred model for the real G D P
models where p < 4 and q < 2 for both the default premium growth data. The A IC selects an A R M A (3 ,2 ), w hereas BIC
and real G D P data. The y-axis labels have been omitted prefers an A R M A (1,0) [i.e ., sim ply an AR(1)]. The second and
because only the relative values m atter for an information third best m odels according to the A IC are an A RM A (2,2)
criterion. and an A R(3), respectively. The BIC ranks an MA(2) as the
second-best model and an AR(2) as the third. The differ
The A IC and the BIC both select the same model for the
ences betw een the criteria reflect the effect of the sam ple
default premium— an A R M A (1,1). The two criteria also agree
size and the precision of the estim ates in m odels for real
on the second and third best specifications, which are an
G D P growth.
ARM A(1,2) and an A RM A (2,1), respectively.

AIC AIC
• :
. • ! ■ l
I I I I I
0 1 2 3 4
AR Order (p ) AR Order (p )
BIC BIC
■ ■
1 2 3 0 1 2 3
AR Order (p) AR Order (p)
Figure 10.10 AIC and BIC values for ARMA models with p E {0, 1, 2, 3, 4} and q E {0, 1, 2} for the
default premium (left) and real GDP growth (right). Models with p = 0 have been excluded from the
default premium plot because these produce excessively large values of the ICs. The selected model
has the smallest information criterion.

Applying these rules to an AR(1), the one-step forecast is This duality between the mean reversion level and the long run
forecast reflects a property of any covariance-stationary time
E t [Y t+1] = E T[8 + 4>YT + eT+1]
series— the current value of Yr always has a negligible im pact on
= 6 + 4>Ej [Y t] + E T[eT+1] the values of Y in the distant future. Form ally, in any covariance
= 8 + 4>Yt stationary tim e series:
and depends only on the final observed value YT. Jim E r [Y r+h] = E [Y t],
h—>
The two-step forecast is which is the long-run (or unconditional) mean.
E t [Y t +2 ] = E r [S + 4>Yt+ -| + eT+2] These same steps generalize to MA and A R M A processes using

the three rules of forecasting. For exam ple, the first three fore
= 6 + </>ET[Y T+1] + E T[eT+2]
casts from an MA(2):
= 8 + (f>{8 + 4>Yt)
YT = /x + eT_ -| + 02eT- 2 + eT
= 8 + <f>8 + 4>2Y t
are
The two-step forecast depends on the one-step forecast and so
E[Y r+1] = /X + O^T + 0 2€ j - 1
is also a function of Y j.
E[Yt+2] = /x + 02eT
These steps repeat for any horizon h so that:
E[Y t+3] = /x
EjfYj+h] = 8 + (f>8 + 4>28 + • • • + (f)*1 + (f>^Yj
MAs depend on (at most) q lags of the residual, and so all fore
h
= + (f>hYT (10.39) casts for horizon h > q are the long-run mean /x.
/=o
The forecast error is the difference between the realized (future)
W hen h is large, cf)h must be small because { Y J is assumed to be value and its tim e T fo re ca st and is defined as Yj+h ~ ErfYj+ h].
a stationary tim e series. It can be shown that: The one-step forecast error:
VY+ 1 - E 7 [Y j +1 ] = 6 7+1
i™ + ^ Y t = 7---- T
/=o 1 “
is always just the surprise in the next period in any A R M A
This limit is the same as the mean reversion level (or long-run m odel. Forecasts errors for longer-horizon forecasts are mostly
mean) of an AR(1). functions of the model param eters.
A PPLIED FO RECASTIN G
The preferred model for the default premium is the The selected model for real G D P growth is the AR(1):
A R M A (1 ,1):
R G D P G t = 1.996 + 0.360 R G D P G t_ , + et.
D E F t = 0.066 + 0.938 D E F t + 0.381 et_-, + et
The final real G D P growth rate is R G D P G T = 2.564 in
Generating out-of-sample forecasts requires the final value of 2018Q 4. Forecasts for longer horizons are recursively com
the default premium and the final estim ated residual. These puted using the relationship:
values are D E F T = 1 . 1 1 and eT = 0.0843. The one-step
E-JRGPDGy+h] = 1.996 + 0.360 E7[RGDPG7+/1_ 1]
ahead forecast is
For exam ple, the first forecast is 1.996 + 0.36 X 2.564
E t [D E F t+1] = 0.066 + 0.938 X 1.11 + 0.381
= 2.919, and the second is 1.996 + 0.360 X 2.81 = 3.047.
X 0.0843 = 1.139
The remaining forecasts for the first year are presented in
The two-step and higher forecasts are recursively com puted Table 10.4. The persistence param eter in the AR(1) is small,
using the relationship: and the forecasts rapidly converge to the long-run mean of
the growth rate of real G D P (i.e., 1.996/(1 — 0.36) = 3.119).
E T[D E F T+h] = 0.066 + 0.938 E T[D E F T+h_-\]
These values are reported in Table 10.4.The A R param eter is
relatively close to 1, and so the forecasts converge slowly to
the long-run mean of 0.066/(1 - .938) = 1.06.

Table 10.4 Multi-Step Forecasts for the Default Premium (Left) and Real GDP Growth Rate (Right).
The Forecasts for the Default Premium Are Constructed from an ARMA(1,1). The Forecasts for the Real GDP
Growth Rate Are from an AR(1) Model
Horizon Date Value Date Value
1 Jan 2019 1.139 2019Q1 2.919
2 Feb 2019 1.135 2019Q 2 3.047
3 Mar 2019 1.131 2019Q3 3.093
4 A p r 2019 1.127 2019Q 4 3.110
5 May 2019 1.123 2020Q1 3.116
6 Jun 2019 1.119 2020Q 2 3.118
7 Jul 2019 1.116 2020Q 3 3.119
8 Aug 2019 1.113 2020Q 4 3.119
10.10 SEASONALITY these two com ponents into a single specification. For exam ple,
a model using monthly data with a seasonal AR(1) and a short
Macrofinancial tim e series often have seasonalities. For exam term AR(1) is
ple, many activities related to home buying and construction (1 - 0 ,0 (1 - 0 SL 1 2 )Y , = S + e,

are higher in the sum mer months than in the winter months.
(1 - 0 ,L - 0 SL12 + < M SL13)Y, = 8 + et
Seasonality can be determ inistic (e.g ., constant level shift in
a series during the sum mer months) or stochastic. Series with Yt = 8 + V>i Yt_ , + 0 SY ,_ , 2 + 0 i 0 s'','t—1 3 +
determ inistic seasonality are non-stationary, whereas those with

This seasonal AR model is a restricted AR(13), where most of the
stochastic seasonality can b e stationary (i.e., they can be m od
coefficients are zero, and the coefficient on lag 13 is linked to
eled using ARM As).
the coefficient on the other included lags.
Seasonality occurs on an annual basis. For exam ple, the seasona
Seasonality can be introduced to the AR com ponent, the MA
com ponent for quarterly data appears in gaps of four and that
com ponent, or both. The seasonality is included in a model
of monthly data appears in gaps of 1 2 .
by multiplying the short-run lag polynomial by a seasonal lag
A pure seasonal model only uses lags at the seasonal frequency. polynom ial. The specification of Seasonal A R M A models is
For exam ple, a pure seasonal AR(1) model of quarterly data denoted
series is
A R M A (p, q) X (ps, qs)f,
(1 - 0 L 4)Y t = 6 + 6,
where p and q are the orders of the short-run lag polynomials,
Yt = 8 + 4>Yt_ 4 + et
ps and qs are the orders of the seasonal lag polynomials, and f
This process has peculiar dynamics that are identical to those is the seasonal horizon (e.g ., 4 or 12). For exam ple, the seasonal
generated from four independent AR(1) processes interwoven A R in the previous exam ple is an ARM A(1,0) X ( 1 , 0 )12.
to make an AR(4). However, this structure is not plausible in
In practice, seasonal com ponents are usually restricted to one
most econom ic tim e series.
lag because the precision of the param eters related to the sea
A more plausible structure includes both short-term and sea sonal com ponent depends on the number of full seasonal cycles
sonal com ponents. The seasonal com ponent uses lags at the in the sam ple. In turn, the number of seasonal observations
seasonal frequency, while the short-term com ponent uses lags depends on the sam ple size and the frequency (i.e., T/f). For
at the observation frequency. A seasonal A R M A combines exam ple, when T is m odest (e.g ., 100), then T/f is small.

APPENDIX
M ODEL SELECTIO N IN SEA SO N A L
TIME SERIES Characteristic Equations
Model selection in seasonal tim e series is identical to
In higher order polynom ials, it is difficult to define the val
selection in non-seasonal tim e series. The A C F and PACF
are inspected at both the observational and the seasonal ues of the model coefficients where the polynomial is
frequencies to determ ine an initial model specification. invertible— and hence the param eter values where an AR
Seasonal ARs have slowly decaying A C Fs and a sharp is covariance-stationary. The stationarity conditions can be
cutoff in the PACF (when using the seasonal frequency). determ ined by associating the pth order lag polynomial:
Seasonal MAs have the opposite pattern, where the PACF
1 - fa L ------~ (f>pLp (10.40)
slowly decays and the A C F drops off sharply. Dynamics
present in the A C F and PACF at the observation frequency with its characteristic equation:
indicate that a short-term com ponent is required. Finally,
1C can be used to select between models that contain only zp — </>-|Zp_1 — (f>2zp~2 — ■■ — (fjp-^z — cf)p = 0 (10.41)
observation frequency lags [i.e., an A RM A (p, q) X (0,0)f)]# The characteristic equation is also a pth order polynomial where
only seasonal lags [i.e., A RM A (0,0) X (ps, qs)f], or a mixture
the ordering of the powers has been reversed.
of both.
The lag polynomial is invertible if the roots of the characteristic
equation c1# c2 . . . , cp in the expression:
10.11 SUMMARY (z - c-|)(z - c2) . . . (z - cp) = 0 (10.42)
are all less than 1 in absolute value.9

This chapter introduces the tools used to analyze, model,
and forecast tim e series. The focus has been exclusively on In the first-order polynomial:
covariance-stationary time series that have constant means and
1 - fa L
autocovariances. This property ensures that the dependence is
the associated characteristic equation is
stable across tim e and is essential when modeling a tim e series.
z — (f>-\ = 0
Stationary series are m odeled using autoregressive processes
(AR) or moving averages (MA). These models imply different and the solution is z = </>-]. If \ < 1, then the polynomial is
dependence structures, but both are built using white noise invertible and the AR(1) is stationary.
shocks that are uncorrelated across tim e. This assumption In the AR(2):
allows model residuals to be inspected to determ ine w hether a
specification appears to be adequate. In addition, these models
Y t = 8 + Y t - -| + $ 2 ^ - 2 + e t
are often com bined in autoregressive moving average models The lag polynomial is 1 — </>-| L — (f>2L2 and its associated charac
(ARM A), which can be used to approxim ate any linear process. teristic equation is z2 — 4><\Z — (f>2. If </>-| = 1.4 and </>2 = —0.45,
then the factored characteristic polynomial is (z - 0.9)(z - 0.5).
Alternative specifications are compared using one of two IC: the
The two roots are 0.9 and 0.5. Both are less than 1 in absolute
A IC or the BIC. These criteria tradeoff bias and variance by penal
value, and so this AR(2) is stationary.
izing larger models that do not improve fit. This BIC leads to con
sistent model selection and is more conservative than the A IC .
Finally, this chapter explained how A R M A models are used

to produce out-of-sample forecasts. These concepts extend
directly to tim e series with seasonal com ponents using seasonal
A R M A m odels, which build models by combining observation 9 The roots may be complex valued, so that Cy — ay + by v —1. When Cy is
complex, |Cy| = V a J + bf is known as complex modulus.
and seasonal frequencies.

QUESTIONS

10.1. W hat are the three com ponents of a tim e series? 10.6. W hat is the maximum number of non-zero autocorrela
tions in an MA(q)?
10.2. D escribe the four features of a tim e series that violate the
assum ptions of covariance stationarity. 10.7. Rewrite the MA(2), Yt = 0.3 + 0.8et_-| + 0.16et_2 + et,
using a lag polynomial.
10.3. How does Gaussian white noise differ from general white
noise? 10.8. W hy does the A IC select a model that is either the same
10.4. W hat are the properties of the lag operator? size or larger than the model selected using the BIC ?
10.5. W hat are the key features of the A C F and PA C F of an 10.9. W hat steps should be taken if the A C F and/or PA C F of
model residuals do not appear to be white noise?
MA(q) process?
Practice Questions
10.10. In the covariance-stationary AR(2), Yt = 0.3 + c. Calculate the tim e series for the following ARM A(1,1)
1.4Yt_-| — 0 .6 Y t_ 2 + et, where et ~ WN(0, a 2), what is m odel, taking e0 = 0, q0 = 0, rj = 0.5, and
the long-run mean E[Yt] and variance V[Yt]? (f> = 0 = 0.375:
10.11. If (1 — 0.4L)(1 — 0 .9 L4)Y t = et, what is the form of this 10.14. In the MA(2), Yt = 4.1 + 5et_-| + 6.25et_2 + et, where
model if written as a standard AR(5) with Yt_ 1f. . . , Yt_ 5 et ~ WN[0, a 2), what is the A C F ?
on the right-hand side of the equation?
10.15. Suppose all residual autocorrelations are 1 .5 / V ^ where
10.12. For the equation: T is the sam ple size. Would these violate the confidence
15 5 3 bands in an A C F plot?
Y‘ = 3 i + ? y * - ' - 8 Y* - 2 + e‘ 10.16. Suppose you observed sam ple autocorrelations

W hat is the lag polynomial? Pi = 0.24, p2 = —0.04, and p3 = 0.08 in a tim e series
10.13. G iven the following data for a set of innovations: with 100 observations. W ould a Ljung-Box Q statistic
reject its null hypothesis using a test with a size of
et 5%? W ould the test reject the null using a size of 10%
1 -0.58 or 1%?
2 1.62 10.17. Suppose you fit a range of A R M A models to a data

set. The A R M A orders and estim ated residual variance
3 -1.34 -2
a are
4 0.88
5 -1.11 Order c r 2 Order c r 2 Order c r 2
6 0.52 (1 ,0 ) 1.212 (1, D 1.147 (1 ,2 ) 1.112
7 0.81 (2, 0) 1.175 (2, 1) 1.097 (2, 2) 1.097
8 -0.65 (3, 0) 1.150 (3, 1) 1.097 (3, 2) 1.068
9 0.85 (4, 0) 1.133 (4 ,1 ) 1.096 (4, 2) 1.065
10 -0.88 Which model is selected by the A IC and BIC if the sample size
T = 250?
a. Calculate the tim e series for the following AR(1)
model, taking y0 = 0, S = 0.5, and </> = 0.75: 10.18. The 2018Q 4 value of real G D P growth is
b. Calculate the tim e series for the following MA(1) RGDPGj - t = 2.793. W hat are the forecasts for 2019Q1
model, taking e0 = 0, ju = 0.5, and 0 = 0.75: - 2019Q 4 using the AR(2) model in Table 10.1?

ANSWERS
Short Concept Questions b. Lag operators multiply so that LL = L2 and

L2Y, = L ( L Y , ) = L Y ,_ , .
10.1. The three com ponents are trend, seasonal, and cyclical.
c. The lag operator applied to a constant is just the con
10.2. a. Random walks— The variance of the process changes stant, L8 = 8.
over tim e because a shock is never "forgotten" by the
process.
d. The pth order lag polynomial is written 4>[L) =
1+<
/>
-| L + 4>2L2 + ••• + (f>pLp.
b. Determ inistic trends— The mean of the process is
e. Lag polynomials can be multiplied to produce another
changing over tim e due to a tim e trend; for exam ple,
lag polynomial.
the series has a positive growth rate.
f. Under some assum ptions, lag polynomials can be
c. Determ inistic seasonality— The mean of the process
inverted.
depends on the time observation due to a seasonal,
repeating pattern. 10.5. The A C F of an MA(q) is non-zero until q lags, and then is
zero for all values above q. The PA C F declines slowly in
d. Structure breaks— The model param eters change over
the lag and may oscillate, but there is no lag above which
tim e.
the PA CF is always zero.
10.3. G eneral white noise can have any distribution and does
10.6 . q, the order of the model.
not have the be independent across tim e. Gaussian
white noise is iid, and each shock is normally distributed 10.7. Y t = et(1 + 0.8L + 0.16L2)
N(0, a 2). 10.8. The BIC has a sharper penalty for all sam ple sizes larger
10.4. a. The lag operator shifts the tim e index back one than 7, and so will always prefer few er lags.
observation. 10.9. The model should be expanded to include additional AR
or M A coefficients.
Solved Problems
10.10. Because this process is covariance-stationary 10.13. a. Yt — 8 + 4>Yt_i+et
0.3
E [Y t] = A = = 1.5 yt
1 - 1.4 - (0.6)
0 .3 : 1 -0 .0 8
V [Y t] = y 0 = = 0.45
1 - 1.4 - (0.6) 2 2.06
10.11. (1 - 0.4L)(1 - 0 .9 L4)Y t = (1 - 0 .4 L - 0 .9 L4 + 0 .3 6 L3)Yt 3 0.71
=> Yt - 0 .4 Yt_ ! - 0 .9 Y t_ 4 + 0 .3 6 Yt_ 5 = et 4 1.91
Yt = 0 .4 Y t_-| + 0 .9 Y t_ 4 - 0 .3 6 Yt_ 5 + et 5 0.82
6 1.64
10.12.1 — — f c L 2 = 1 - f l + fL*
7 2.54
8 1.76
9 2.67
10 1.62

Each value is generated via the form ula: Each value is generated via the form ula:
yy = 0.5 + 0.75 * 0 - 0.58 = - 0 .0 8 qy = 0.5 + 0.375 * 0 + 0.375 * 0 - 0.58 = - 0 .0 8
y2 = 0.5 + 0.75 * ( - 0 .0 8 ) + 1.62 = 2.12 z2 = 0.5 + 0 .3 7 5 (-0 .0 8 ) + 0 .3 7 5 (-0 .5 8 ) + 1.62 = 1.87

Z2
and so forth. and so forth.
2\ 2
Z{ = fJL + 0 € {- y + 10.14. = (1 + 52 + _
.2
and y 2 = 6cr . The autocorrelations are then
Zt Po = 1, py = 0.557 and p2 = 0.096.
1 -0 .0 8 10.15. No, the confidence bands are ± 1 .96/^ / t and so these

would not. If many autocorrelations are jointly relatively
2 1.69
large, then this series is likely autocorrelated. A Ljung-
3 0.38
Box test would likely detect the joint autocorrelation.
4 0.38
1 0 .1 6 . The test statistic is
5 0.05
6 0.19 Qlb = T 2 ( ) P?
/= 1 T — i
7 1.70
= 1001 ^ ) (0.24)2 + 100 [ ^ j (- 0 .0 4 )2 + 100 ( ^ ) (0.08)2
8 0.46 99 98 97
9 0.86 = 5.93 + 0.16 + 0.67 = 6.77
10 0.26 The critical values for tests sizes of 10%, 5%, and 1% are
6.25, 7.81, and 11.34, respectively. (CHISQ.INV(1 -a, 5)
Each value is generated via the form ula: in Excel, where a is the test size). The null is only be
rejected using a size of 10%.
Zy = 0.5 + 0.75 * 0 - 0.58 = - 0 .0 8
10.17. The ICs and the selected model in bold:
z2 = 0.5 + 0 .7 5 (—0.58) + 1.62 = 1.69
and so forth. Order AIC Order AIC Order AIC
c. qt = n + <fiqt-y + 0et_ , + et (1 ,0 ) 50.068 (1, D 38.287 (1 ,2 ) 32.540
(2, 0) 44.317 (2, 1) 29.145 (2, 2) 31.145

qt
(3, 0) 40.940 (3 ,1 ) 31.145 (3, 2) 26.447
1 -0 .0 8
(4, 0) 39.217 (4, 1) 32.917 (4, 2) 27.744
2 1.87
3 0.47 Order BIC Order BIC Order BIC

4 1.05
(1 ,0 ) 53.589 d, D 45.330 (1 ,2 ) 43.104
5 0.12
(2, 0) 51.360 (2, 1) 39.709 (2, 2) 45.231
6 0.65
(3, 0) 51.505 (3, 1) 45.231 (3, 2) 44.054
7 1.75
(4, 0) 53.303 (4 ,1 ) 50.524 (4, 2) 48.872
8 0.81
9 0.41
10 0.47

10.18. All forecasts are recursively com puted starting with the The three- and four-step depend entirely on other
first: forecasts:
1.765 + 0.319 X 2.564 + 0.114 X 2.793 = 2.90. 1.765 + 0.319 X 2.98 + 0.114 X 2.90 = 3.05.
The two-step forecast uses the one-step forecast: 1.765 + 0.319 X 3.05 + 0.114 X 2.98 = 3.08.
1.765 + 0.319 X 2.90 + 0.114 X 2.564 = 2.98.

Learning Objectives
D escribe linear and nonlinear time trends. Describe how to test if a tim e series contains a unit root.
Explain how to use regression analysis to model Explain how to construct an h-step-ahead point forecast
seasonality. for a tim e series with seasonality.
D escribe a random walk and a unit root. Calculate the estim ated trend value and form an interval
forecast for a tim e series.
Explain the challenges of modeling tim e series containing
unit roots.
187
Covariance-stationary tim e series have means, variances, and The correct approach to modeling time series with trends
autocovariances that do not depend on tim e. Any tim e series depends on the source of the non-stationarity. If a tim e series
that is not covariance-stationary is non-stationary. This chapter only contains determ inistic trends, then directly capturing the
covers the three most pervasive sources of non-stationarity in determ inistic effects is the best method to model the data. In
financial and econom ic time series: tim e trends, seasonalities, fact, removing the determ inistic trends produces a detrended
and unit roots (more commonly known as random walks). series that is covariance-stationary.
Tim e trends are the sim plest deviation from stationarity. Tim e On the other hand, if a tim e series contains a stochastic trend,
trend models capture the propensity of many tim e series to the only transform ation that produces a covariance-stationary
grow over tim e. These m odels are often applied to the log series is a difference (i.e., Yt — Yt_,- for some value of j > 1).
transform ation so that the trend captures the growth rate of the Differencing a tim e series also removes tim e trends and, when
variable. Estim ation and interpretation of param eters in trend the difference is constructed at the seasonal frequency, also
models that contain no other dynamics are sim ple. rem oves determ inistic seasonalities. Ultim ately, many trending
econom ic tim e series contain both determ inistic and stochastic
Seasonalities induce non-stationary behavior in time series by
trends, and so differencing is a robust method for removing
relating the mean of the process to the month or quarter of the
A
both effects.
year. Seasonalities can be modeled in one of two ways: shifts in
the mean that depend on the period of the year, or an annual
cycle where the value in the current period depends on the shock
in the same period in the previous year. These two types of sea
11.1 TIME TRENDS
sonality are not mutually exclusive, and there are two approaches
to modeling them . The simple method includes dummy variables
Polynomial Trends
for the month or quarter. These variables allow the mean to vary A tim e trend determ inistically shifts the mean of a series.
across the calendar year and accom m odate predictable variation The most basic exam ple of a non-stationary tim e series is a
in the level of a time series. The more sophisticated approach process with a linear tim e trend:
uses the year-over-year change in the variable to eliminate the
Yt = 80 + 8 ,t + etl (11.1)
seasonality in the transform ed data series. The year-over-year
change eliminates the seasonal com ponent in the mean, and the where et ~ W N{0, a 2)
differenced time series is modeled as a stationary ARM A. This process is non-stationary because the mean depends on time:
Finally, random walks (also called unit roots) are the most E[Yt] = S0 + 8 ,t
pervasive form of non-stationarity in financial and econom ic
In this case, the time trend S-| measures the average change in Y
tim e series. For exam ple, virtually all asset prices behave like
across subsequent observations.
a random walk. Directly modeling a tim e series that contains a
unit root is difficult because param eter estim ators are biased The linear tim e trend model generalizes to a polynomial time
and are not normally distributed, even in large sam ples. trend model by including higher powers of tim e:
However, the solution to this problem is sim ple: model the
Yt = Sq "F 5-|t + § 2 ^ + • • • + 8mt m + et (11.2)
difference (i.e., Yt — Vt-i)-
In practice, most tim e trend models are limited to first
All non-stationary tim e series contain trends that may be d e te r
(i.e., linear) or second-degree (i.e., quadratic) polynomials.
m inistic or stochastic. For determ inistic trends (e.g ., tim e trends
and determ inistic seasonalities), knowledge of the period is The param eters of a polynomial tim e trend model can be consis
enough to m easure, m odel, and forecast the trend. On the tently estim ated using O LS. The resulting estim ators are asym p
other hand, random walks are the most im portant exam ple of totically normally distributed, and standard errors and t-statistics
a stochastic trend. A tim e series that follows a random walk can be used to test hypotheses if the residuals are white noise.
depends equally on all past shocks. If the residuals are not com patible with the white noise assum p
tion, then the t-statistics and model R2 are m isleading.
A linear trend model implies that the series grows by a constant
1 In some applications, financial data have seasonality-like issues at quantity each period. In econom ic and financial time series, this
higher frequencies, for example the day of the week or the hour of the is problem atic for two reasons.
day. While seasonalities are only formally applicable to annual cycles in
data, the concept is applicable to other time series that contain a time- 1. If the trend is positive, then the growth rate of the series
dependent pattern. (i.e., S i/ t) will fall over tim e.

2. If 8 is less than zero, then Yt will eventually becom e The top-left panel of Figure 11.1 plots the level of real G D P and
negative. However, this is not plausible for many financial the fitted values from the two m odels. The top-right panel plots
tim e series (e.g ., prices or quantities). the residuals:
In contrast, constant growth rates are more plausible than con et = R G D P t — 8 0 — 8 -\t — ^ t 2,
stant growth in most financial and m acroeconom ic variables. /V
where 8 2 = 0 in the linear model.

These growth rates can be exam ined using a log-linear models
that relate the natural log of Yt to a linear trend. For exam ple: Both residual series are highly persistent and not white noise,
indicating that these m odels are inadequate to describe the
In Vt = S0 + S-it + et (11.3)
changing level of real GDP.
This model implies that:
The bottom two rows of Table 11.1 contain the estim ated
E[ln Vt+1 - In Yt] = 8 param eters and t-statistics from log-linear and log-quadratic
m odels. The log-linear model indicates that real G D P grows by
or equivalently:
0.79% per quarter (i.e., about 3% per year). M eanwhile, the log-
E[ln(Yt+1/ Y t)] = quadratic indicates that the initial growth rate is higher— 1.1%
so that the growth rate of Y is constant. per quarter at the beginning of the sam ple— and that it is slowly
declining. The sam ple has 288 quarters of data, and so the quar
Log-linear m odels can be extended to higher powers of tim e.
terly growth rate is 0.829% (= 1.1% — 0.00094% X 288) at the
For exam ple, the log-quadratic time trend model:
end of the sam ple.
In Yt = 8 0 + V + S 2 t 2 + et (11.4)
The bottom-left panel of Figure 11.1 plots the natural log of real
implies that the growth rate of Yt is <5-| + 2<52t, and so depends G D P along with the fitted values from the two m odels. The right
on tim e. In practice, log-linear models are sufficient to capture panel shows the residuals from these two regressions. The errors
the trend in series that grow over tim e. from the log-specification are also clearly not white noise, and
so these pure tim e trend models are also inadequate to explain
As an exam ple, consider a data set that m easures the real G D P
the growth of real GDP.
of the United States on a quarterly basis from 1947 until the end
of 2018. The specification of the trend models is The final column of Table 11.1 reports the R 2 of the m odels.
R G D P t = 8 q + 6-|t + S2t 2 + et, These values are extrem ely high and the quadratic m odels both
have R 2 above 99% . High R 2 in trending series are inevitable,
where § 2 = 0 in the linear model. and the R 2 will approach 100% in all of four specifications as
The param eter estim ates and t-statistics are reported in the top the sam ple size grows. In practice, R 2 is not considered to
two rows of Table 11.1. The linear model indicates that real G D P be a useful m easure in trending tim e series. Instead, residual
grows by a constant USD 59 billion each quarter, whereas the diagnostics or other form al tests are used to assess model
quadratic model implies the growth is accelerating. adequacy.
Table 11.1 Estimation Results for Linear, Quadratic, Log-Linear, and Log-Quadratic Models Fitted to US Real
GDP. The t-Statistics Are Reported in Parentheses. The Final Column Reports the R2 of the Model
Dependent So Si s2 R2
RGDP - 4 3 .2 9 59.38 0.964
(-0 .3 8 4 ) (87.81)
RGDP 1966 17.80 0.144 0.996
(32.69) (18.53) (44.68)
In(RGDP) 7.712 0.00789 0.990
(972.5) (165.9)
In(RGDP) 7.581 0.011 -9 .4 e -0 6 0.997
(1278) (112.0) (-2 9 .6 0 )
Chapter 11 Non-Stationary Time Series ■ 189

Level of Real GDP
Fitted Values Residuals
3 ......... Linear
2 ♦ ----- Quadratic %
1■ fj*i A | a A ♦•
'/ V1 \f»
v A i
/ vw^/cr
s /h
\v-
0 "rr T
I
# n /W -A
\
-1 ■ / ¥ jfv • j T»•- /4*
•*•
-2
-3
1950 1965 1980 1995 2010
Log of Real GDP

Fitted Values Residuals
3
In { R G D P )
2
&
$16T
Log-linear t
Log-quadratic
Vt
1■
f> w
$8T 0 if t j -
:A K r A
w r
S
-1
$ 4T -
-2
- 3 -I
1950 1965 1980 1995 2010 1950 1965 1980 1995 2010
Fiaure 11.1 The top-left panel plots the level of real GDP and the fitted values from linear and quadratic trend
models. The top-right panel plots the residuals from the linear and quadratic trend models. The bottom-left
panel plots the log of real GDP and the fitted values from log-linear and log-quadratic time trend models. The
bottom-right plots the residuals from the log-linear and log-quadratic models.
11.2 SEASONALITY the other months of the year.2 M eanwhile, the trading volume of
com m odity futures varies predictably across the year, reflecting
A seasonal tim e series exhibits determ inistic shifts throughout periods when information about crop growing conditions is
the year. For exam ple, new housing has a strong seasonality that revealed. Energy prices also have strong seasonalities, although
peaks in the sum mer months and slows in winter. M eanwhile, the pattern and strength of the seasonality depend on the geo
natural gas consumption has the opposite seasonal pattern: ris graphic location of the market.
ing in winter and subsiding in summer. Both seasonalities are C alendar effects generalize seasonalities to cycles that do not
driven by predictable events (e.g ., the weather). necessarily span an entire year. For exam ple, an anomaly has
Seasonalities also appear in financial and econom ic tim e series.
Many of these seasonalities result from the features of markets
2 Tax optimization, where investors sell poorly performing investments
or regulatory structures. For exam ple, the January effect posits in late December to realize losses and then reinvest in January, is one
that equity m arket returns are larger in January than they are in plausible explanation for this phenomenon.

been identified where Monday returns are statistically signifi- A s an exam ple, consider the monthly growth of gasoline
cantly lower than Friday returns in US equity data. Calendar consumption in the United States from 1992 until 2018.
effects also appear in intra-daily data, where trading volume and This tim e series has a strong seasonality, with consumption
price volatility follow a "U " shaped pattern of spikes during the being much higher in sum m er months than in winter months.
m arket opening/closing and a mid-day lull. Figure 11.2 plots the percentage growth rate, which is defined
100 X (In G C t - In G C t_-|)

Seasonal Dummy Variables The seasonality in this tim e series is obvious: March is consis
tently the highest growth month, whereas Septem ber is one of
Determ inistic seasonalities produce differences in the mean of a
four months with large declines in gasoline consum ption.
time series that are simple to model using dummy variables.
A seasonal model is estim ated on these growth rates:
Suppose that Yt is a seasonal tim e series that has a different
mean in each period. The seasonality repeats each s periods AGCt = 8 + 71 In + / 1 + • • • + 7 n /n t + et>
7 2 2
(e.g ., every four in a quarterly series or 12 in a monthly series).

where /-|t is the January dummy, l2t is a February dummy and so
The seasonalities are directly modeled as:
forth.
Yt = 8 + y i/ u + 72 /21 + • • * + 7 s - i/ s- i t + (11.5)
The fitted values from this model are plotted in the bottom-
where left panel of Figure 11.2 and the residuals are plotted in the
bottom -right panel. W hile the residuals appear to be approxi
ljt = 1 when t (m od s) = j, and
m ately white noise, the Ljung-Box Q-test statistic using ten lags
ljt = 0 when t (m od s) 7 ^j is 33. Because this test statistic is distributed 7 10 , the p-value
of the test statistic is less than 0 . 1 %, and so the null hypoth
The function a (m od b) yields the rem ainder of —. For exam ple,
esis that the residuals are not serial correlated is rejected.
10 (m od 6) = 4.
Therefore, this model is not adequate to fully describe the
Note that dummy variable lst is om itted to avoid what is known dynam ics in the data.
as the dum m y variable trap, which occurs when dummy
variables are m ulticollinear.*4
The mean in period 1 is 11.3 TIME TRENDS, SEASONALITIES,

E[Y i] = 5 + 71 AND CYCLES
The mean in period 2 is
Determ inistic term s (e.g ., tim e trends and seasonalities) are
E[Y2] = 5 + 72 often insufficient to describe econom ic time series. A s a result,
the residuals from models that only contain determ inistic terms
The mean in period s is
tend not to be white noise.
E [Y J = 5
W hen the residuals are not white noise but appear to be
because all dummy variables are zero. covariance-stationary, then the determ inistic term s can be
The param eters of this model are estim ated using O LS by com bined with AR or M A term s to capture the three com
regressing Yt on a constant and s — 1 dummy variables. 8 is ponents of a tim e series: the trend, the seasonality, and the
interpreted as the average of Yt in period s, whereas y-. m ea cyclicality.
sures the difference between the period j mean and the period A process is trend-stationary if removing a determ inistic time
s mean. trend produces a series that is covariance-stationary. For exam
ple, if the residuals from the model:
A recent explanation of this difference posits that key macroeconomic Yt = 8q + 5-|t +

announcements, for example, the number of jobs created in the
previous month, are more likely to occur on Friday than on other days, are not white noise but are stationary (e .g ., follow an AR(1)
and that the additional return is compensation for risk. process), then the model can be augm ented to capture the
4 For example, consider a model with separate dummy variables for dynam ics. This means that adding an A R term to the model
when a light switch turned on (lon) and when it is turned off (/<,#). These should produce residuals that appear to be white noise:
two variables are multicollinear, because if lon — 1 then lQff = 0 (and vice
versa). This problem is avoided by simply omitting one of these variables. Yt = 5c>+ 5it + </>Yt_i + € f

M o n th ly G a s o lin e C o n su m p tio n G ro w th R ate
Gasoline Sales Growth A March T September
-30- T T T
1995 2000 2005 2010 2015
Residuals
Fitted Values
Figure 11.2 The month-over-month growth rate of gasoline consumption in the United States. The upticks
indicate the growth rate in March of each year and the downticks mark the growth rate in September.
If there is a seasonal com ponent to the data, then seasonal trend-stationary processes have different behaviors and proper
dummies can be added as well to produce ties. Tests of trend stationarity against a random walk are pre
s—1 sented in Section 11.4.
Yt = So + Sit + 2 'Yilit + W t - 1 + et/
/=1 The top panel of Figure 11.3 shows the log of gasoline con
where <5-| captures the long-term trend of the process, 7 ,- sumption in the United States (i.e., In G C t). This series has not
m easure seasonal shifts in the mean from the trend growth been differenced and so the tim e trend— at least until the finan
(i.e ., <5-|t), and 4>Yt- 1 is an AR term that captures the cyclical cial crisis of 2008— is evident. The predictable seasonal devia
co m p o nent.5 tions from the tim e trend are also clearly discernable. Both time
trends and seasonalities are determ inistic, so that the regressors
W hile adding a cyclical com ponent is a simple method to
in the model:
account for serial correlation in trend-stationary data, many
trending econom ic time series are n o t trend-stationary. If a 11
Yt = 50 + 6it + ^t2 + 2 'Yilit + et
series contains a unit root (i.e., a random walk com ponent), then ;= 1
detrending cannot eliminate the non-stationarity. Unit roots and only depend on tim e.
Estim ated param eters, t-stats, model R2, and a Ljung-Box

5 This example only includes a single AR component. The cyclical Q-statistic using ten lags are reported in Table 11.2. The model
component can follow any ARMA(p, q) process. in the first row excludes the quadratic term that, when included

Lo g G a s o lin e C o n su m p tio n
1995 2000 2005 2010 2015
Residuals
Seasonal Dummies and Quadratic Trend Dummies, Quadratic Trend, and AR(1)
1995 2000 2005 2010 2015 2008 2011 2014 2017

Fiaure 11.3 The top panel plots the natural log of gasoline consumption in the United States (In GCt) using
the data from Figure 11.2. The bottom-left panel plots the residuals from a model that includes seasonal dummy
variables and a quadratic time trend. The bottom-right panel plots the residuals from a model with seasonal
dummies, a quadratic trend, and a Seasonal AR(1).
(second row), has a negative coefficient that appears to be The third row of Table 11.2 reports param eter estim ates from a
statistically significant.6 model that extends the dum m y-quadratic trend model with a
cyclical com ponent [i.e., an AR(1)] and the bottom-right panel
W hile both m odels produce large R2, the trend m akes this
of Figure 11.3 plots the residuals from this model. The A R(1)
m easure less useful than in a covariance-stationary tim e
coefficient indicates the series is very persistent, and the coef
series. The bottom -left panel of Figure 11.3 plots the residu
ficient is very close to 1.
als from the model that includes both seasonal dum m ies and
a quadratic tim e trend. The residuals are highly persistent However, none of these models appear capable of capturing
and clearly not white noise, indicating that the model is badly all of the dynamics in the data, and the Ljung-Box Q-statistic
m isspecified. indicates that the null is rejected in all specifications. The esti
mated param eters in the autoregression are close to one, which
suggests that the detrended series is n o t covariance-stationary.
6 The usual t-stat only has an asymptotic standard normal distribution This exam ple highlights the challenges of incorporating cyclical
when the model is correctly specified in the sense that the residuals are
white noise. The residuals in these models are serially correlated and so com ponents (e.g ., autoregressions) with trends and seasonal
not white noise. dummies.

Table 11.2 Parameter Estimates and t-Statistics from Models That Include Combinations
of Seasonal Dummies, Linear and Quadratic Trends, and an AR(1) Fitted to Gasoline Sales.
The First Two Lines Report Parameters from Models That Include Seasonal Dummies and a
Linear or Quadratic Time Trend Component. The Third Row Reports Estimates from a Model
that Adds an AR(1) Component to the Results in the Second Row. The Final Two Columns
Report the Model R2 and the Ljung-Box Q-Statistic Using Ten Lags (p-Value in Parentheses).
Seasonal Si S2 <P1 R2 Q 10
Dummies
/ 0.00428 0.851 2193.3
(40.662) (0.000)
/ 0.00901 —1.4e-05 0.908 1752.6
(25.120) (-1 3 .5 4 9 ) (0.000)
/ 0.000437 -8 .2 5 e -0 7 0.959 0.993 36.4
(2.567) (-2 .3 4 3 ) (61.307) (0.000)
11.4 RANDOM WALKS Unit Roots

AND UNIT ROOTS Unit roots generalize random walks by adding short-run station
ary dynamics to the long-run random walk. A unit root process is
Random walks and their generalization, unit root processes,
usually described using a lag polynomial:
are the third and most im portant source of non-stationarity in
econom ic tim e series. A simple random walk process evolves «MUVt = m e t
according to: (1 - L)cf>(L)Yt = 0(L)et, (11.8)
where et ~ W N(0, a 2) is a white noise process and 0[L)et is an M A.7

Yt = Y t- i + e t (11.6)
Substituting Yt_ 1 = Yt_ 2 + et_ i into the random walk: For exam ple, the AR(2)
Yt = 1.8Yt_ ! - 0 .8 Y t_ 2 + et
Yt = (Y t - 2 + et - i) + et
shows that Yt depends on the previous two shocks and the value can be written with a lag polynomial as:
of the series two periods in the past (i.e., Yt_ 2). (1 - 1.8L + 0 .8 L2)Yt = et
Repeating the substitution until observation zero is reached (1 - L)(1 - 0 .8 L)Y t = et
The unit root is evident in the factored polynom ial.8

Y ,= Y0 + 2 ei (11.7)
1=1
it can be seen that Yt depends equally on every shock between
7 The full lag polynomial ij/[L) can be factored into the unit root, (1 — L),
periods 1 and t, as well as on an initial value Y0. and a remainder lag polynomial <£(L) that has a characteristic equation
with stationary roots. This process is called a unit root precisely because
Recall that in a stationary AR(1): one of the roots in the characteristic equation:
Yt = <f>Yt- ‘i + €t zp + (Aizp~1 + • • • + *AP- 1 Z + = (z - 1)(zp~1 + <MP~2
= e t + 4>et--\ + 4>2et-2 + ' ' ' + 1eo + 4>tYo + • • • + <f)p_-1),
Stationarity requires that \<f>\ < 1 so that for t large enough, both is equal to unity (i.e., 1). The remaining roots of zp_1 + <f>-\Zp~2 + . . . +
the initial value and shocks in the distant past make a negligible </)p_ 1 are all less than 1 in absolute value.
contribution to Yt. Random walks place equal weight on all shocks 8 The remaining polynomial 4>[L) — 1 — 0.8L has a characteristic root of
and the initial value, and so a shock in period t permanently 0.8 and so corresponds to a stationary AR(1). The characteristic equation
affects all future values. Unlike covariance-stationary time series, for the AR(2) with a unit root is
which always mean revert, random walks become more dispersed z2 — 1.8z + 0.8 = (z — 1)(z — 0.8),
over tim e. Specifically, the variance of a random walk is and the two roots are 1 and 0.8. A standard random walk is a special
case of a unit root process with iJ/{L) = 1 — L, </>(L) = 1 and 0[L) — 1.
v tv ,] = to-2

The Problems with Unit Roots processes track closely for the first 20 observations, but
then diverge. The stationary AR process is mean reverting
Unit roots and covariance-stationary processes have im portant to zero and regularly crosses this value. The random walk is
differences. There are three key challenges when modeling a not m ean-reverting.
time series that has a unit root.
The solution to all three problem s is to difference a time
1 . Param eter estim ators in A R M A models fitted to time series series that contains a unit root. If Yt has a unit root, then the
containing a unit root have what is called a Dickey-Fuller difference:
(DF) distribution. The DF distribution is asym m etric, sample-
A Y t = Yt - Yt_ !
size dependent, and its critical values depend on whether
the model includes time trends. These features make it dif does not.
ficult to perform inference and model selection when fitting
For exam ple, in a random walk with a drift:
m odels on tim e series that contain unit roots.
Yt = 8 + Yt_-| + et,
2. Ruling out spurious relationships is difficult. A pair of time
series have a spurious relationship when there are no fun where et ~ WN[0, a 2), the difference:
damental links between them , but a regression of one on A Y t = 8 + Yt_ i — Yt_-| + et = 8 + et
the other produces a coefficient estim ate that is large and
is a constant plus a white noise shock. In the general case of a
seem in gly statistically different from zero. W hen a time
unit root process:
series contains a unit root, it is common to find spurious
relationships with other tim e series that have a time trend (1 - L)0(L)Yt = et
or a unit root. Predictive modeling with unit roots is risky 0(D((1 - D Y t) = et
because m odels with highly significant coefficients may be 0 (L )A Y t = e t (11.9)
com pletely misleading and produce large out-of-sample
Because 0(L) is a lag polynomial of a stationary process, then
forecasting errors.
the random variable defined by the difference (i.e., A Yt) must be
3 . A unit root process does not mean revert. A stationary AR stationary.
is m ean-reverting, and so the long-run mean can be esti
m ated. Mean-reversion affects the forecasts of stationary
A R models so that EJYt+h] ~ E[Yt] when h is large. In con Testing for Unit Roots
trast, a random walk is not stationary and EJYf+ fJ = Yt for
Unit root tests exam ine whether a process contains a unit root.
any horizon h. Figure 11.4 plots two sim ulated AR(2) models
They can also distinguish trend-stationary models from unit root
Yt = </>iYt_-| + 0 2Yt_ 2 + €t using the same initial value and
processes with drift.
shocks. The models only differ in the coefficients. The sta
tionary AR is param eterized using 01 = 1.8 and 0 2 = —0.9. It is im portant to correctly distinguish tim e series that are per
The unit root has coefficients 0-1 = 1.8 and 0 2 = - 0 .8 . The sistent but stationary from tim e series that contain a unit root.
Fiaure 11.4 Two AR processes generated using the same initial value and shocks. The data from the stationary
AR model are generated according to Y t = 1.87^ - 0.9Yt_ 2 + e t- The data from the unit root process are
generated by Yt = 1.87^ - 0.8Yt_ 2 + et.

W hile a unit root can only be removed by differencing, taking select the number of lagged differences is to choose the lag
the difference of a series that is already stationary is known as length to minimize A IC . The maximum lag length should be set
over-differencing. An over-differenced time series requires a to a reasonable value that depends on the length of the time
model that is more com plex than would be required to explain series and the frequency of sampling (e.g ., 24 for monthly data
the dynamics in the unmodified series. or eight for quarterly data).
For exam ple, if: Recall that the A IC tends to select a larger model than criteria
such as the BIC. This approach to selecting the lag length is
Yt = 8 + et
preferred because it is essential that the residuals are approxi
then m ately white noise, and so selecting too many lags is better
A Y t = 8 + et — 8 — et_ i = et — et_ i than selecting too few. Ultim ately, any reasonable lag length
selection procedure— IC-based, graphical, or general-to-specific
The original series is a trivial model— an AR(0). The over-dif
selection— should produce valid test statistics and the same
ferenced series is a MA(1) that is not invertible. The additiona
conclusion.
com plexity in models of over-differenced series increases
param eter estimation error and so reduces the accuracy of The included determ inistic term s have a more significant
forecasts. im pact on the A D F test statistic. The D F distribution depends
on the choice of determ inistic term s. Including more term s
The Augm ented Dickey-Fuller (ADF) specification is the most
skew s the distribution to the left, and so the critical value
widely used unit root test. An A D F test is im plem ented using an
becom es more negative as additional determ inistic term s are
O LS regression where the difference of a series is regressed on
included. For exam ple, the 5% critical values in a tim e series
its lagged level, relevant determ inistic term s, and lagged differ
with 250 observations are —1.94 when no determ inistic term s
ences. The general form of an A D F regression is
are included, —2.87 when a constant is included, and —3.43
y Y t_ S0 + S-jt Ai + - • + ApA Y t_ g when a constant and trend are included. All things equal, ad d
AYt = + +
ing additional determ inistic term s m akes rejecting the null
Lagged Level Determ inistic Lagged Differences
more difficult when a tim e series d o e s not contain a unit root.
( 11.10)
This reduction in the pow er of an A D F test suggests a conser
The A D F test statistic is the t-statistic of y . To understand the vative approach when deciding which determ inistic trends to
A D F test, consider a im plem enting a test with a model that only include in the test.
includes the lagged level:
On the other hand, if the tim e series is trend-stationary, then the
A Yt = y Y t-y + et A D F test must include a constant. If the A D F regression is esti
If Yt is a random walk, then: mated without the constant, then the null is asym ptotically never
re jected , and the power of the test is zero.
Yt = Yt_-, + et
Y t~ Yt_ ! = Yt_-, - Yt_ ! + et Avoiding this outcom e requires including any relevant deter
A Y t = 0 X Yt_-, + et ministic term s. The recom m ended method to determ ine the
relevant determ inistic term s is to use t-statistics to test their
so that the value of y is 0 when the process is a random walk.
statistical significance using a size of 10%. Any determ inistic
Under the null H0: y = 0, Yt is a random walk. The alternative is
regressor that is statistically significant at the 10% level should
H 1 : y < 0, which corresponds to the case that Yt is covariance
be included. If the trend is insignificant at the 10% level, then it
stationary. Note that the alternative is one-sided, and the null is
can be dropped, and the A D F test can be rerun including only a
not rejected if y > 0. Positive values of y correspond to an AR
constant. If the constant is also insignificant, then it too can be
coefficient that is larger than 1, and so the process is explosive
dropped, and the test rerun with no determ inistic com ponents.
and not covariance-stationary.
However, most applications to financial and m acroeconom ic
Implementing an A D F test on a tim e series requires making two tim e series require the constant to be included.
choices: which determ inistic term s to include and the number
W hen the null of a unit root cannot be rejected, the series
of lags of the differenced data to use. The number of lags to
should be differenced. The best practice is to repeat the A D F
include is sim ple to determ ine— it should be large enough to
test on the differenced data to ensure that it is stationary. If
absorb any short-run dynamics in the difference A Y t.
the difference is also non-stationary (i.e., the null cannot be
The lagged differences in the A D F test are included to ensure rejected on the difference), then the series should be double
that et is a white noise process. The recom m ended method to differenced. If the double-differenced data are not stationary,

then this is an indication that some other transform ation may be the t-statistic is above the 5% critical value, and the null is not
required before testing stationarity. For exam ple, if the series is rejected . The top row exclu d es determ inistic term s and pro
always positive, it is possible that the natural log should be used duces a positive estim ate of y. This is a com mon pattern— a
instead of the unadjusted data. positive estim ate of y usually indicates an inappropriate
choice of d eterm inistic term s. Th e num ber of lags is chosen
Table 11.3 contains the results of six A D F tests on the natural
by m inim izing the A IC across all m odels that include up to
log of real GDP. The top panel contains tests of log RGDP, and
eight lags.
the bottom panel contains tests on the growth rate (i.e., the dif
ference of the log). The bottom panel contains the sam e three specifications
of determ inistic term s on the growth rate of real G D P (i.e.,
Each panel contains A D F tests for all three configurations of
Ain R G D P t). In this case, the tim e trend is statistically signifi
determ inistic term s. In the top panel, the trend is not statisti
cant, and so this model should be used to test for a unit root.
cally significant at the 10% level because its absolute t-statistic
The t-statistic on y is - 8 .1 6 5 , which is below the critical value
is less than 1.645. The lack of statistical significance indicates
of —3.426 for a 5% test, and so the test indicates that the real
that the trend should be dropped and the test run including
G D P growth rate does not contain a unit root. The A D F test
only a constant, which is shown in the m iddle row. The con
that only includes a constant produces a sim ilar conclusion.
stant is highly significant, and so this model is the preferred
Excluding both determ inistic term s produces an A D F test
specification.
statistic that can be rejected at 5% but not 1%. This value is
The t-statistic on the co efficient on the lagged level y is not reliable because the determ inistic term s are statistically
- 2 .3 3 0 . Using the critical values listed in the right colum ns, significant.
Table 11.3 ADF Test Results for the Natural Log of Real GDP and the Difference of the Log of Real GDP (i.e.,
GDP Growth). The Column y Reports the Parameter Estimate from the ADF Test Regression and the ADF Statistic
in Parentheses. The Columns 80 and ^ Report Estimates of the Constant and Linear Trend Along with t-Statistics.
The Lags Column Reports the Number of Lags in the ADF Regression as Selected by the AIC. The Final Columns
Report the 5% and 1% Critical Values and Test Statistics Less (i.e., More Negative) Than the Critical Value that
Leads to Rejection of the Null Hypothesis of a Unit Root
Log of Real GDP
Deterministic r So Si Lags 5% CV 1% C V
None 5.14e-04 3 - 1 .9 4 2 - 2 .5 7 4
(5.994)
Constant -1 .9 2 e -0 3 0.023 5 - 2 .8 7 2 - 3 .4 5 4
(-2 .3 3 0 ) (3.053)
Trend -8 .2 7 e -0 3 0.072 5.09e-05 5 - 3 .4 2 6 -3 .9 9 1
(-1 .0 1 8 ) (1.147) (0.786)
Growth of Real GDP
Deterministic r So Si Lags 5% CV 1% C V
None - 0 .1 3 3 8 - 1 .9 4 2 - 2 .5 7 4
(-2 .1 8 4 )
Constant - 0 .6 2 6 4.87e-03 2 - 2 .8 7 2 - 3 .4 5 4
(-8 .4 1 4 ) (6.287)
Trend - 0 .7 8 3 8.13e-03 —1.47e-05 4 - 3 .4 2 6 -3 .9 9 1
(-8 .1 6 5 ) (5.641) (-2 .2 3 7 )

Table 11.4 ADF Test Results for the Default Premium, Defined as the Difference between the Interest Rates on
Portfolios of Aaa- and Baa-Rated Bonds. The Column y Reports the Parameter Estimate from the ADF and the
ADF Statistic in Parentheses. The Columns 80 and fr. Report Estimates of the Constant and Linear Trend along with
t-Statistics. The Lags Column Reports the Number of Lags in the ADF Regression as Selected by the AIC. The Final
Columns Report the 5% and 1% Critical Values
Default Premium
Deterministic 7 So Si Lags 5% CV 1% CV
None -5 .3 4 e -0 3 16 - 1 .9 4 2 -2 .5 7 1
(-1 .2 2 0 )
Constant - 0 .0 4 2 0.045 10 - 2 .8 6 8 - 3 .4 4 4
(-3 .2 5 1 ) (3.037)
Trend - 0 .0 4 8 0.063 —5.00e-05 10 - 3 .4 2 0 - 3 .9 7 8
(-3 .4 7 3 ) (3.009) (-1 .2 1 7 )
Table 11.4 shows the results of testing the default premium— The seasonal difference is denoted A 4Yt and is defined
defined as the difference between the yield on a portfolio of Yt — Yt_ 4. Substituting in the difference and sim plifying, this
Aaa-rated corporate bonds and Baa-rated corporate bonds. The difference evolves according to:
tests are im plem ented using 40 years of monthly data between
Yt ~ Vt-4 = S-|(t — (t ~ 4)) + et — et_4
1979 and 2018.
A 4Yt = 4<5-| + et — et_ 4
For this data, the time trend is not statistically significant and so
The seasonal dummies are eliminated because:
can be safely dropped. The null that the constant is zero, how
ever, is explicitly rejected at the 10% level, and so the A D F test 7fjt ~ Yjljt-4 = 7j(ljt ~ Ijt- 4 ) = 7j x 0
should be evaluated using a specification that includes a con This is because lJt = /Jt_ 4 for any value of t (because the observa
stant. The t-statistic on y is —3.23, which is less than the 5% (but tions are four periods apart).
not the 1%) critical value. In the previous chapter, we saw that the
The differenced series is a MA(1), and so is covariance
default premium is highly persistent. These A D F tests indicate
stationary.
that it is also covariance-stationary. The number of lags is chosen
to minimize the A IC across all models that include between zero By removing trends, seasonality, and unit roots, seasonal
and 24 lags. differencing produces a time series that is covariance-stationary.
The seasonally differenced time series can then be modeled
Seasonal Differencing using standard A R M A models. Seasonally differenced series
are directly interpretable as the year-over-year change in Yt
Seasonal differencing is an alternative approach to modeling or, when logged, as the year-over-year growth rate in Yt.
seasonal tim e series with unit roots. The seasonal difference is
constructed by subtracting the value in the sam e p e rio d in the Figure 11.5 contains a plot of the seasonal difference in the
previous year. This transform ation elim inates determ inistic sea log gasoline consumption series. This series is constructed by
subtracting log consumption in the same month of the previous
sonalities, tim e trends, and unit roots.
year, 100 X (In G C t — In G C t_ i 2 ).
For exam ple, suppose Yt is a quarterly tim e series with both
determ inistic seasonalities and a non-zero growth rate, so that: The difference elim inates the striking seasonal pattern present in
the month-over-month growth rate plotted in Figure 11.2, and
Yt = <
50 + y d n + l/2tht + 73tht + 8^t + et, so this series can be m odeled as a covariance-stationary time
where et ~ WN{0, cr2). series without dum m ies.

Year-over-year Growth in Gasoline Consumption
40% ■
T995 2000 2005 2010 2015

Fiaure 11.5 The seasonal difference, or year-over-year growth, of log gasoline consumption in the United States,
100 X (In GQ - In GCt_ 12).
11.5 SPURIOUS REGRESSION JP Y /U S D rate may also have a tim e trend, although it is less
pronounced.
Spurious regression is a common pitfall when modeling the rela The first line of Table 11.5 contains the estim ated param eters,
tionship between two or more non-stationary tim e series. t-statistics and R2 of the model. This naive analysis suggests that
For exam ple, suppose Yt and X t are independent unit roots with the exchange rate in the previous w eek is a strong predictor of
Gaussian white noise shocks, and the cross-sectional regression: the price of the Russell 1000 at the end of the subsequent week.
This result, also plotted in Figure 11.6, is spurious because both
Yt = a + (3Xt + et
series have unit roots.
is estim ated using 100 observations.
The bottom panel of Figure 11.6 plots the residuals (i.e., et) from
W hen using simulated data, the t-statistic on /3 is statistically dif this regression. These residuals are highly persistent. Further
ferent from zero in over 75% of the simulations when using a 5% more, conducting an A D F test results in a failure to reject the
test. The rejection rate should be 5%, and the actual rejection null that the residuals contain a unit root.
rate is extrem ely high considering that the series are in d ep en
The correct approach uses the differences of the series (i.e., the
d en t and so have no relationship. Furtherm ore, the rejection
returns) to explore w hether the JP Y /U S D rate has predictive
probability converges to one in large sam ples, and so a naive
power. The well-specified regression is
statistical analysis always indicates a strong relationship between
the two series. This phenomenon is called a spurious regression. Ain R 1000t = a + /3A In JP Y U S D t- 1 + et
Spurious regression is a frequently occurring problem when The differenced data series are both stationary, a requirem ent
both the dependent and one or more independent variables for O LS estim ators to be well behaved in large sam ples.
are non-stationary. However, it is not an issue in stationary time Results from this regression are reported in the second line of
series, and so ensuring that both X t and Yt are stationary (and Table 11.5. The R2 is only 0.3% , indicating that the JP Y /U SD
differencing if not) prevents this problem . return has little predictive power for the future return of the
As another exam ple, consider the w eekly log of the Russell 1000 equity index.
index and the log of the Jap an ese JP Y /U SD exchange rate.
The current value of the index is regressed on the lag of the
exchange rate: 11.6 WHEN TO D IFFER EN C E?
In R 1000t = a + [3\n JP Y U S D t_ i + et
Many financial and econom ic tim e series are highly persistent
The two series are plotted in the top panel of Figure 11.6. The but stationary, and differencing is only required when a time
Russell 1000 has an obvious tim e trend over the sam ple. The series contains a unit root. W hen a tim e series cannot be easily

Log Levels of Russell 2000 and JPY/USD Exchange Rate
Fiaure 11.6 The top panel plots the logs of the Russell 1000 index, the JPY/USD exchange rate, and the fitted
value from regressing the Russell 1000 on the JPY/USD exchange rate. The bottom panel plots the regression
residuals from a spurious regression of the Russell 1000 on the JPY/USD exchange rate.
categorized as stationary or non-stationary, it is good practice

Table 11.5 Estimation Results from a Spurious
to build models for the series in both levels and differences.
Regression (First Row) and a Well-Specified
Regression. The First Row Contains Estimation A s an exam ple, consider the default premium, which is a highly
Results from a Model That Regresses the End-of-Week persistent tim e series that rejects the null of a unit root when
Log of the Russell 1000 Index on the Log JPY/USD tested .Th e default premium can be m odeled using both
Exchange Rate at the End of the Previous Week. approaches: An AR(3) is estim ated on the levels, and an AR(2) is
The Second Row Shows Estimates from a Model estim ated on the differences.9
That Regresses the Weekly Return of the Russell
The estim ated param eters from these two m odels are reported
1000 on the Previous Week's JPY/USD Exchange
in Table 11.6. The model of the differences can be equivalently
Rate Return
expressed as a model in levels because:
a P R2 A Y t = 8 + </>iA Y t_-| + <£2A Yt_2 +
Logs 22.855 - 3 .2 3 2 67.6% Yt ~ Yt_ , = 8 + </>1(Yt- 1 - Yt_ 2) + < M V 2 - Yt_ 3) + €t

Yt = 8 + (1 + </>i)v t_ -I + ((t>2 - (/>1)Yt_ 2 - (f>2Yt-3 + et
(95.823) (-6 5 .9 3 2 )
Returns 0.000 - 0 .0 5 5 0.3%

9 Modeling the differences assumes that the process contains a unit root
(4.489) (-2 .4 1 4 )
and so the AR order is reduced by one.

Table 11.6 The First Row Reports Parameter Estimates and t-Statistics of an
AR(3) Estimated on the Level of the Default Premium. The Second Row Reports
Estimates from an AR(2) Estimated on the Difference in the Default Premium. The
Final Row Reports the Estimated Parameters in the Implied AR(3) Reconstructed
from the AR(2) Estimated on the Differences
8 <Pi (p2 <P3
Levels 0.044 1.330 - 0 .5 2 4 0.152
(0.362) (29.539) (-7 .3 0 7 ) (3.385)
Differences - 0 .0 0 0 0.360 - 0 .1 8 2
(-0 .0 1 4 ) (8.028) (-4 .0 5 7 )
Diff. as Levels - 0 .0 0 0 1.360 - 0 .5 4 2 0.182
(-0 .0 1 4 ) (30.305) (-7 .4 8 1 ) (4.057)
Figure 11.7 The paths of the forecasts of the default premium starting in January 2018 for models estimated in
levels and differences.
The transform ed param eters are reported in the bottom row of 11.7 FORECASTING
Table 11.6. W hile these param eters appear to be very similar,
they imply different forecast paths: The model estim ated in lev Constructing forecasts from models with time trends, seasonali
els is mean reverting to 1.06% , whereas the model estim ated ties, and cyclical components is no different from constructing
using differences has short-term dynamics but is not mean forecasts from stationary A RM A models. The forecast is the time T
reverting (because it contains a unit root). expected value of YT+h. For exam ple, in a linear time trend model:
Figure 11.7 shows the default premium for the four years prior Yr — S0 + S t T + e
to 2018 and two forecasts for the following 24 months. W hereas
so that Yj+h is
the mean reversion in the levels-estim ated model is appar
ent, the differences-estim ated model predicts little change in Yr+h ~ $0 + <5-i(T + h) + ej+h
the default premium over the next two years despite having The tim e T expectation of YT+h is
coefficients different from zero. Ultim ately, determ ining which
ErtVr+h] = S0 + ^ ( T + h) + E T[eT+/l]
approach works better is an em pirical question, and the best
= §o "F 6 - | ( T + h)
practice is to consider models in both levels and differences
when series are highly persistent. because eT+h is a white noise shock and so has a mean of zero

Forecasting seasonal time series in m odels with dummies
requires tracking the period of the forecast. In the dummy-only FO RECASTIN G LEV ELS USING
model for data with a period of s: LOG M ODELS
Using a log transform ation com plicates forecasting.
Yt = 8 + 2 7 fa + et The log is a nonlinear transform ation, and so E[YT+h] 7^
i= 1
the 1-step ahead forecast is exp E[ln YT+h\.
This difference follows from Jensen's inequality because
Er[W+i] = 8 + yj, the log is a concave transform ation. If the assumption
where j = (T + 1) (mod s) is the period of the forecast and the on the noise is strengthened so that it is also Gaussian,
coefficient on the om itted period is y 0 = 0. et ~ N(0, a 2), then the properties of a log-normal can be
used to construct forecasts of the level from a model of
For example in a quarterly model that excludes the dummy for Q4, the log. Recall that if X ~ N(/x, a 2) then W = exp(X) ~
if T is period 125, then E r [Y r+1] = 8 + y (125 +i)(mod 4 ) = 8 + y 2- Lo g N[/a , a 2). The mean of W depends on both /x and a 2,
The h-step ahead forecasts are constructed using an identical pro and E[W] = exp(/x + a 2j 2). For exam ple, in the log-linear
trend model:
cess only tracking the period of T + h, so that:
In YT+h = 80 + 8^{T + h) + eT+h
E-dYr+h] = $ + yp the mean is E r [ln YT+h] = 80 + 8^(T + h) and a 2 is the
where j = (T + h) (mod s). variance of the shock, eT+h, so that In YT+h ~ N(80 +
S i(T + h), a 2). The expected value of Yj+h is then
Forecasts from models that include both cyclical components and exp(50 + S-|(T + h) + a 2/ 2).
seasonal dummies are constructed recursively. The procedure is
virtually identical to forecasting a stationary A RM A model, and
the only difference is that the deterministic term in the h-step
practice, cr is not known. However, it can be estimated as the
forecasts depends on the period of the forecast, T + h.
square root of the residual variance from the estimated regression.
The effect of the recent value diminishes as the horizon
Confidence intervals can be constructed for forecasts from any
grows, and so models containing a cyclical com ponent and
m odel, whether the process is stationary or not. The confidence
with seasonal dummies exhibit mean reversion. However, the
interval only depends on the variance of YT+h — E T[YT+h]. In
mean reversion is not to a single unconditional mean as in a
m odels that include an AR or M A com ponent, the forecast error
covariance-stationary AR(p). Instead, the forecasts from a model
variance depends on both the noise variance as well as the AR
with seasonal dummies mean revert to the period-specific
and MA param eters.
means captured by the dum m ies.
Forecast Confidence Intervals 11.8 SUMMARY

In some applications, it is useful to construct both the forecasts Non-stationary time series can contain determ inistic or stochas
and confidence intervals to express the uncertainty of the future
tic trends. The correct approach to modeling a non-stationary
value. The forecast confidence interval depends on the variance tim e series depends on the type of trend. If a tim e series has
of the forecast error, which is defined as the difference between only determ inistic trends (e.g ., tim e trends or determ inistic
the time T + h realization and its forecast:
seasonalities), then the trends can be directly m odeled in con
Yr+h ~ E r [Y r+h] junction with the cyclical com ponent. W hen the series contains
a stochastic trend (e.g ., a unit root), then the series must be dif
For exam ple, consider a simple linear time trend model
ferenced to remove the trend.
Yj+h = + <5-|(T + h) + €j+h
N on-stationary tim e series are challenging to model for
In this case: three reasons. First, the distribution of param eter estim ators
E r [Y r+h] = S0 + « !( T + h) depends on w hether the tim e series contains a unit root, and
so correctly interpreting test statistics is challenging. Second,
and the forecast error is eT+h.
it is easy to find spurious relationships that can arise due to
When the error is Gaussian white noise N(0, a 2), then the 95% determ inistic or stochastic trends. Finally, forecasts of non-
confidence interval for the future value is E T[Y T+h] ± 1.96 a . In stationary tim e series do not m ean-revert to a fixed level.

Forecasts produced from m odels built using over- or under that the series contains a unit root, then the series should be
differenced data are likely to perform very poorly, especially at differenced. In most cases, the first difference elim inates both
longer horizons. the stochastic trend and a linear tim e trend, if present, and so
produces a stationary tim e series. However, in a seasonal time
The Augm ented Dickey-Fuller (ADF) test is used to differentiate
series, differencing should be im plem ented at the seasonal fre
between a unit root and a trend-stationary process. Running an
quency. This elim inates determ inistic seasonalities, linear time
A D F test is the first step in building a model for a tim e series
trends, and unit roots.
that might be non-stationary. If the A D F fails to reject the null

QUESTIONS

11.1 W hen modeling In Yt using a tim e trend model, what is 11.3 W hy does a unit root with a tim e trend,
the relationship between exp E T[ln YT+h] and E T[Y T+h] Yt = 61 + Yt_-] + et not depend explicitly on t?
for any forecasting period h? Are these ever the same?
11.4 W hat are the consequences of excluding a determ inistic
Assum e the error term s is normally distributed around a term in an A D F test if it is needed? W hat are the conse
mean of zero.
quences of including an extra determ inistic term that is
11.2 Suppose an hourly tim e series has a calendar effect where not required to fit the data?
the hour of the day m atters. How would the dummy vari 11.5 W hy doesn't the A D F test reject if the t-statistic is large
able approach be im plem ented to capture this calendar
and positive?
effect? How could differencing be used instead to remove
the seasonality?
Practice Questions
3
11.6 A Iinear tim e trend model is estim ated on annual real 11.8 The seasonal dummy model Yt = 8 + 2 ; = 1 Yjljt + et is
euro-area GDP, measured in billions of 2010 euros, estim ated on the quarterly growth rate of housing starts,
using data from 1995 until 2018. The estim ated model is and the estim ated param eters are y i = 6.23, 72 = 56.77,
RG D P t = —234178.8 + 121.3 X t + et. The estim ate of 73 = 10.61, and 8 = —15.79 using data until the end
the residual standard deviation is a = 262.8. Construct of 2018. W hat are the forecast growth rates for the four
point forecasts and 95% confidence intervals (assuming quarters of 2019?
Gaussian white noise errors) for the next three years. Note
11.9 A D F tests are conducted on the log of the ten-year
that t is the year, so that in the first observation, t = 1995, US governm ent bond interest rate using data from
and in the last, t = 2018.
1988 until the end of 2017. Th e results of the A D F
11.7 A log-linear trend model is also estim ated on annual euro with different configurations of the determ inistic term s
area G D P for the same period. The estim ated model is are reported in the table below. The final three co l
In R G D P t = —18.15 + .0136 t + et, and the estim ated umns report the num ber of lags included in the test
standard deviation of et is 0.0322. Assum ing the shocks as selected using the A IC and the 5% and 1% critical
are normally distributed, what are the point forecasts of values that are appropriate for the sam ple size and
G D P for the next three years? How do these compare included determ inistic term s. Do interest rates contain
those from in the previous problem ? a unit root?
Deterministic r So Lags 5% CV 1% CV
None - 0 .0 0 3 7 - 1 .9 4 2 - 2 .5 7 2
( - 1 .6 6 6 )
Constant - 0 .0 0 9 0 .0 1 0 4 - 2 .8 7 0 - 3 .4 4 9
(-1 .4 2 5 ) (1.027)
Trend - 0 .0 8 5 0.188 - 0 .0 0 0 3 - 3 .4 2 3 - 3 .9 8 4
(-4 .3 7 8 ) (4.260) (-4 .1 0 9 )

11.10 An AR(2) process is presented: The data is to be evaluated using the following regres
Yt = kYt_ i + m Yt_ 2 + et sion with a Dickey-Fuller (DF) Test:
It is determ ined that the equation has a unit root. Yt = Y t - 1 + e t.
The specific regression to be evaluated is

a. W hat conditions must k and m satisfy for the first dif
ference series to be stable? yt - yt- i = myt-'i + et
b. Express the equation as an AR(1) process. The relevant critical values for the D F test are 1%: —2.60,
11.11 The following data is to be evaluated for being a 5%: - 1 .9 5 .
random walk:
a. W hat is the appropriate null hypothesis and counter
hypothesis if a one-sided D F test is em ployed?
t Yt
b. Calculate the D F t-statistic, taking Vq = 0.
1 -1.81 Hint: For a regression with no intercept, y,- = mx,- + e,-:
2 -2.31 Ix ^
3 -1 .6 5
4 -0.81
5 -0.31
6 -0 .3 7
7 0.47
8 1.10
9 0.53
10 0.47
11 1.00
12 1.68
13 0.90
14 1.99
15 1.31
16 0.76
17 0.43
18 -1 .0 1
19 0.27
20 -2.21
21 -3.21
22 - 1 .9 0
23 -0 .5 3
24 - 0 .6 0
25 0.34

ANSWERS

11.1 A tim e trend model for In Yt can be stated as: where /Jt = 1 when t(m od 24) = j, and lJt = 0 when
t(m od 24) 7^ j
In Yt = g(t) + €tl e ~ N(0, a 2),
where g(t) is a function of t. Differencing this series can be achieved by looking at

observation 24 periods (hours) apart from each other
So,
(the following presumes that the error term s are iid and
E r [ln Vr+h] = g (T + h), normal):
which gives y t + 2 4 ~ y t = g (t + 24) - g(t) + et+24 - et
exp E r [ln Yr+h] = exp [g (T + h)] O nce the determ inistic time trend is removed the
On the other hand: remaining is a covariance-stationary MA(1) process.
EA yr+ h l = E T[e x p (g (T + h) + eT+h)] = exp (g (T + h) 11.3 The time trend becom es apparent as the series is propa
+ Eyfexp eT+h)j, gated backwards:
which equals yt — 6-| + yt_! + €t — 2§-| + yt_ 2 + et + et_i — • • •

2
t
E T[Y T+h] = exp [g (T + h)] + y
= tS, + Y0 + 2 6/
/= 1
And so:
11.4 Excluding required determ inistic term s m akes it
2
harder to reject the null, and the pow er of the test
E T [yT+h} = exp E r [ln y T+h] + y
will fall tow ards zero. On the other hand, including
These will be equal if the variance is zero (in other words, superfluous determ inistic term s skew s the distribu
if the process is com pletely determ inistic. tion to the left and also reduces the pow er of the test.
1 1 .2 Let s = 24 represent the hour of the day in military time 11.5 The A D F focuses on the left side of the distribution and
(e.g. 13 = 1 p.m .). Then rejection occurs if the test statistic is to the left of the
Yt = 9(t) + 71*11 + 72*2t + • • • + 7 2 3 /2 3 t + € t>

dem arcation point (e.g ., large and negative).
Solved Problems
11.6 Note that there is no AR or M A com ponent, so the vari And the error bounds on the In are + /— 1.96*0.0322, so
ance remains constant. Therefore, the 95% confidence the bounds are given in proportional term s rather than
interval is + / — 1.96*262.8 = + / — 515.1 about the fixed values.
expected value. Bounds_M ultiplier = exp( ± 1.96 * 0.0322)
A s for the expected means: = e xp (± 0.0631) = 0.939,1.065
E [R G D P 2o i 9] = -2 3 4 1 7 8 .8 + 121.3 X 2019 = 10,725.9 Calculating E[ln R G D P J:
E [R G D P 202o ] = -2 3 4 1 7 8 .8 + 121.3 X 2020 = 10,847.2 E[ln R G D P2019] = - 1 8 .1 5 + 0.0136 * 2019 = 9.308
E [R G D P 202-\] = -2 3 4 1 7 8 .8 + 121.3 X 2021 = 10,968.5
E[ln R G D P2020] = - 1 8 .1 5 + 0.0136 * 2020 = 9.322
11.7 In this case: E[ln R G D P202i ] = - 1 8 .1 5 + 0.0136 * 2020 = 9.336
a
ET[YT] = exp ET[ln Y t ] + Furtherm ore,
0.03222
0.0005
2

(which will only make a small im pact in this exam ple) For the trend m odel, both of these have t-stats with
absolute values > 4, well within the bounds of statisti
So:
cal significance at even the 99% level. Accordingly, the
E [R G D P 2o i 9] = exp(9.308 + 0.0005) = 11,031.4
proper model to study is the last one.
E [R G D P 2 0 2 0 ] = exp(9.322 + 0.0005) = 11,186.9
For this m odel, the t-stat on -y is to the right of the 1%
E[R G D P 202i ] = exp(9.336 + 0.0005) = 11,344.6 CV — therefore, the null hypothesis of having a unit root
And the 95% confidence bands are given as: is rejected at the 99% confidence level. Note that if the
proper model were either the constant or no-trend then
95% ^CBo_ = [0 .9 3 9 * 11031.4,1.065 * 11031.4]J
d RGDP 2019 1 '
the null hypothesis would not be rejected.
= [10,358.5,11,748.4]
11.10 a. G iven that the equation has a unit root means that the
95% c b RGDP2020 = [10,504.5,11,914.0] factorization of the characteristic equation must be of
95% c b RGDp2019 = [10,652.6,12,082.0] the form:
In comparison with #1, the bands are growing in size and (z - 1)(z — c) = z2 - (c + 1)z + c,
overall the results are a bit bigger.
where 0 < < 1
1 1 .8 Though there is variance from quarter-to-quarter, the
The characteristic equation for the original relation
expected value of Y t is the same for any two observa
ship is
tions of the same quarter, regardless of the year.
z2 - k i
Accordingly:
r n
3 Therefore, it is required that | m < 1

E [Y Q1] = 8 + 2 Jjljt = - 1 5 .7 9 + 6.23 * 1 + 56.77 * 0
7=1 and
+ 10.61 * 0 = - 9 .5 6 k = ( - m + 1) or (1 - m)
3
E [Y Q2] = 8 + 2 7 jljt = - 1 5 .7 9 + 6.23 * 0 + 56.77 * 1
b. Express the equation as an AR(1) process:
7=1
Yt = (—m + 1)Vt_ 1 + m Yf-2 +
+ 10.61 * 0 = 40.98
Yt - Yt_-| = —m Yt_ 1 + m Yt_2 + et
3
E [Y Q3] = 8 + = - 1 5 .7 9 + 6.23 * 0 + 56.77 * 0 = —m(Yt_-| - Yt_ 2) + €t
y=i
Defining Z t = Yt — Yt_-|
+ 10.61 * 1 = - 5 .1 8
yields
3
E [Y Q4] = 8 + 2 y fjt = - 1 5 .7 9 + 6.23 * 0 + 56.77 * 0 Zt = + et
y=i
+ 10.61 * 0 = - 1 5 .7 9 11.11 a. H0: m = 0
Hfi m < 0
1 1 .9 T h e fi rst step is to select the appropriate model to use.
For these, the statistical significance of the param eters The counterhypothesis is different from usual practice
for the constant and trend must be taken into account. because only the left tail is being evaluated.

b. Com pleting the table yields
t yt yt - 1 A yt yt - 1 * A y t y?-i
1 -1 .8 1 0 -1.81 0.00 0.00
2 -2.31 -1.81 -0 .5 0 0.91 3.28
3 - 1 .6 5 -2.31 0.66 - 1 .5 2 5.34
4 -0.81 - 1 .6 5 0.84 - 1 .3 9 2.72
5 -0.31 -0.81 0.50 -0.41 0.66
6 -0 .3 7 -0.31 -0 .0 6 0.02 0.10
7 0.47 -0 .3 7 0.84 -0.31 0.14
8 1.10 0.47 0.63 0.30 0.22
9 0.53 1.10 -0 .5 7 -0 .6 3 1.21
10 0.47 0.53 -0 .0 6 -0 .0 3 0.28
11 1.00 0.47 0.53 0.25 0.22
12 1.68 1.00 0.68 0.68 1.00
13 0.90 1.68 -0 .7 8 -1.31 2.82
14 1.99 0.90 1.09 0.98 0.81
15 1.31 1.99 -0 .6 8 -1 .3 5 3.96
16 0.76 1.31 -0 .5 5 -0 .7 2 1.72
17 0.43 0.76 -0 .3 3 -0 .2 5 0.58
18 -1.01 0.43 - 1 .4 4 -0 .6 2 0.18
19 0.27 -1.01 1.28 - 1 .2 9 1.02
20 -2.21 0.27 -2 .4 8 -0 .6 7 0.07
21 -3.21 -2.21 - 1 .0 0 2.21 4.88
22 - 1 .9 0 -3.21 1.31 -4.21 10.30
23 -0 .5 3 - 1 .9 0 1.37 - 2 .6 0 3.61
24 - 0 .6 0 -0 .5 3 -0 .0 7 0.04 0.28
25 0.34 - 0 .6 0 0.94 -0 .5 6 0.36
SUM -5.81 0.34 -1 2 .5 0 45.76
So m can be calculated as:
_ 1 x,y, _ - 1 2 .5 0
- 0 .2 7 3
m “ S x ,2 “ 45.76

Returning to calculate the errors:
t y t y t - 1 A yt y t - 1 * A yt yf-i
1 -1.81 0 -1.81 0.00 0.00 -1.81
2 -2.31 -1.81 -0 .5 0 0.91 3.28 - 0 .9 9
3 -1 .6 5 -2.31 0.66 - 1 .5 2 5.34 0.03
4 -0.81 -1 .6 5 0.84 - 1 .3 9 2.72 0.39
5 -0.31 -0.81 0.50 -0.41 0.66 0.28
6 -0 .3 7 -0.31 - 0 .0 6 0.02 0.10 - 0 .1 4
7 0.47 -0 .3 7 0.84 -0.31 0.14 0.74
8 1.10 0.47 0.63 0.30 0.22 0.76
9 0.53 1.10 -0 .5 7 -0 .6 3 1.21 -0 .2 7
10 0.47 0.53 - 0 .0 6 -0 .0 3 0.28 0.08
11 1.00 0.47 0.53 0.25 0.22 0.66
12 1.68 1.00 0.68 0.68 1.00 0.95
13 0.90 1.68 -0 .7 8 -1.31 2.82 -0 .3 2
14 1.99 0.90 1.09 0.98 0.81 1.34
15 1.31 1.99 -0 .6 8 -1 .3 5 3.96 - 0 .1 4
16 0.76 1.31 -0 .5 5 -0 .7 2 1.72 -0 .1 9
17 0.43 0.76 -0 .3 3 -0 .2 5 0.58 -0 .1 2
18 -1.01 0.43 - 1 .4 4 -0 .6 2 0.18 - 1 .3 2
19 0.27 -1.01 1.28 - 1 .2 9 1.02 1.00
20 -2.21 0.27 - 2 .4 8 -0 .6 7 0.07 -2.41
21 -3.21 -2.21 - 1 .0 0 2.21 4.88 - 1 .6 0
22 - 1 .9 0 -3.21 1.31 -4.21 10.30 0.43
23 -0 .5 3 - 1 .9 0 1.37 - 2 .6 0 3.61 0.85
24 - 0 .6 0 -0 .5 3 -0 .0 7 0.04 0.28 -0.21
25 0.34 - 0 .6 0 0.94 -0 .5 6 0.36 0.78
SUM -5.81 0.34 -1 2 .5 0 45.76 -1 .2 5
St Dev 1.36 0.95
Using the STD EV function in E X C E L , As the t-stat is less than the 5% point on the D F distri
The standard error of this is bution, the null is rejected at a 95% confidence level,
but the t-stat is not sufficiently negative to enable
(T, 0.95
= 0.140 rejection at the 99% confidence level.
nay<-i 5 * 1.36
So the t-statistic is
m - 0 .2 7 3
t - stat = = - 1 .9 6
(T,m 0.140

Measuring Returns,
Volatility, and
Correlation
Learning Objectives
Calculate, distinguish, and convert between sim ple and Describe the power law and its use for non-normal
continuously com pounded returns. distributions.
Define and distinguish between volatility, variance rate, Define correlation and covariance and differentiate
and implied volatility. between correlation and dependence.
D escribe how the first two moments may be insufficient to Describe properties of correlations between normally dis
describe non-normal distributions. tributed variables when using a one-factor model.
Explain how the Jarque-Bera test is used to determ ine

w hether returns are normally distributed.
211
Financial asset return volatilities are not constant, and how they The tim e scale is arbitrary and may be short (e.g ., an hour or a
change can have im portant im plications for risk m anagem ent. day) or long (e.g ., a quarter or a year).
For exam ple, if returns are normally distributed with an annual
The return of an asset over multiple periods is the product of
ized volatility of 10%, then a portfolio loss larger than 1% occurs
the sim ple returns in each period:
on only 5% of days and the loss on 99.9% of days will be less
T
than 2%. W hen annualized volatility is 100%, then 5% of days
have losses larger than 10.2% , and one in four days would have
1 + n +
Rt =
t=1
<1 R t) ( 1 2 . 2)
losses exceeding 4.2% . There are also continuously com pounded returns, also known as
This chapter begins by exam ining the distributions of financial log returns. These are com puted as the difference of the natural
asset returns and why they are not consistent with the normal logarithm of the price:
distribution. In fact, returns are both fat-tailed and skew ed. rt = ln P t — ln P t_-| (12.3)
The heavy tails are, in particular, generated by tim e-varying
The log return is traditionally denoted with lower case letter
volatility. Two popular m easures of volatility are p resented,
(i.e., rt).
the Black-Scholes-M erton model and the V IX Index. Both
m easures make use of option prices to m easure the future The main advantage of log returns is that the total return over
exp ected volatility. multiple periods is just the sum of the single period log returns:
Capturing the dependence among assets in a portfolio is also

rT = i > t 02.4)
a crucial step in portfolio construction. In portfolios with many t=1
assets, the distribution of the portfolio return is predom inantly
A log return approxim ates the simple return. However, the
determ ined by the dependence between the assets held. If the
accuracy of this approxim ation is poor when the simple return
assets are weakly related, then the gains to diversification are
is large, and log returns are more commonly used when returns
large, and the chance of experiencing an exceptionally large
are com puted over short tim e spans.
loss should be small. On the other hand, if the assets are highly
dependent, especially in their tails, then the probability of a Converting between the simple and log return uses the
large loss may be surprisingly high. relationship:
The final section reviews linear correlation, which is a common 1 + Rt = exp rt (12.5)
measure that plays a key role in optimizing a portfolio. It is not, Figure 12.1 shows the relationship between the simple return
however, enough to characterize the dependence between the and the log return. In the left panel, which shows the relation
asset returns in a portfolio. Instead, this section presents two ship when the simple return is between —12% and 12%, the
alternatives that are correlation-like but have different proper approximation error is small (i.e., less than 0.85% ). The right
ties. These alternative m easures are better suited to understand panel expands the range to ± 5 0 % , and the approximation error
ing dependence when it is nonlinear. They also have some useful is much larger (e.g., when the simple return is —40% , the log
invariance and robustness properties. return is —51%; and when the simple return is 40%, the log return
is 33.6%).
12.1 MEASURING RETURNS The log return is always less than the sim ple return. Further
more, sim ple returns are also never less than - 1 0 0 % (i.e ., a
All estim ators of volatility depend on returns, and there are two total loss). Log returns do not respect this intuitive boundary
common m ethods used to construct returns. The usual defini and are less than —100% w henever the sim ple return is less
tion of a return on an asset bought at tim e t — 1 and sold at than —63%.

tim e t is The sim ple additive structure of log returns makes them more
attractive in cases where the error is small (i.e., over short peri
ods). For exam ple, it is common for volatility models to use log
returns because they are usually com puted using daily or weekly
where Pt is the price of the asset in period t.
data. Despite this common practice, however, it is im portant
Returns com puted using this formula are called simple returns to have a sense of the approxim ation error em bedded in log
and are traditionally expressed with an uppercase letter (i.e., Rt). returns when returns are large in m agnitude.

Fiqure 12.1 The left panel shows the relationship between the simple return and the log return when the simple
return ranges between —12% and 12%. The right panel expands the range of the returns to —50% and 50%.
12.2 MEASURING VOLATILITY The mean of the w eekly return is 5/ j l and, because et is an iid
sequence, the variance of the w eekly return is 5tr2 and the vola
AND RISK tility of the w eekly return is V 5er. This is an im portant feature of
financial returns— both the mean and the variance scale linearly
The volatility of a financial asset is usually measured by the stan
in the holding period, while the volatility scales with the square-
dard deviation of its returns. A sim ple but useful model for the
root of the holding p erio d .1
return on a financial asset is
This scaling law allows us to transform volatility between time
rt = /j l + oetl (12.6)
scales. The common practice is to report the annualized volatil
where et is a shock with mean zero and variance 1, /x is the mean ity, which is the volatility over a year. W hen the volatility is m ea
of the return (i.e., E[rt] = /x), and a 2 is the variance of the return sured daily, it is common to convert daily volatility to annualized
(i.e., V[rt] = a 2). volatility by scaling by V 2 5 2 so that:
The shock can also be written as:
ânnual = V252 X ^ ai|y (12.7)
et = a e t
The scaling factor 252 approxim ates the number of trading days
so that et has mean zero and variance a 2. in the US and most other developed m arkets. Volatilities com
The shock is assumed to be independent and identically distributed puted over other intervals can be easily converted by computing
(iid) across observations. While it is not necessary to specify the dis the number of sampling intervals in a year and using this as the
tribution of et, it is assumed that et ~ N(0,1). With this assumption, scaling factor.
returns are also iid N(/x, a ). However, while the normal is a conve For exam ple, if volatility is m easured with monthly returns, then
nient distribution to use, the next section demonstrates that it does the annualized volatility is constructed using a scaling factor
not provide a full description of most financial returns. of 12:
The volatility of an asset is the standard deviation of the returns ânnual = V l2 X q*onth|y (12.8)
(i.e., \T a 2 = cr). This measure is the volatility of the returns over
the tim e span where the returns are m easured, and so if returns The variance, also called the variance rate, is just the square of
are com puted using daily closing prices, this measure is the daily the volatility. It scales linearly with tim e and so can also be easily
volatility. transform ed to an annual rate.
In a sim ple m odel, the return over multiple periods is the sum
of the returns. For exam ple, if returns are calculated daily, the
weekly return is
5 5 5 1 The linear scaling of the variance depends crucially on the assumption
2 rt+/ = 2 ^ + a e t+i = 5/x + c r 2 et+; that returns are not serially correlated. This assumption is plausible for
/= 1 ;= 1 /=1 the time series of returns on many liquid financial assets.
Chapter 12 Measuring Returns, Volatility, and Correlation ■ 213

Table 12.1 Annualized Volatilities Computed from The most well-known expression for determ ining the price of an
Variance Estimates Constructed Daily, Weekly, and option is the Black-Scholes-M erton m odel. It relates the price of
Monthly for the Returns on Gold, the S&P 500, and the a call option (i.e., C t) to the interest rate of a riskless asset (i.e., rf),
JPY/USD Exchange Rate the current asset price, the strike price, the tim e until maturity
(measured in T years), and the annual variance of the return
Gold S&P 500 ¥ /$ (i.e., a 2). All values in the m odel, including the call price, are
Daily 15.0% 16.0% 8.5% observable e x c e p t the volatility:
Weekly 17.6% 17.1% 9.7% C , = « rf, T, P„ K, O-2) (12.10)
Monthly 17.1% 14.5% 9.8% The value of a that equates the observed call option price with
the four other param eters in the formula is known as the implied
volatility. This implied volatility is, by construction, an annual
Consider the volatility of the returns of three assets:
value, and so it does not need to be transform ed further.
1 . Gold (specifically, the price as determ ined at the London
The Black-Scholes-M erton option pricing model uses several
gold fix),
simplifying assumptions that are not consistent with how mar
2. The return on the S&P 500 index, and kets actually operate. Specifically, the model assumes that the
3. The percentage change in the Jap an ese yen/US dollar rate variance does not change over tim e. This is not com patible with
(JPY/U SD ). financial data, and so the implied volatility that is calculated
using the model is not always consistent across options with dif
Returns are com puted daily, w eekly (using Thursday closing
r\ ferent strike prices and/or m aturities.
prices), and monthly (using the final closing price in the month).
The variance of returns is estim ated using the standard The VIX Index is another measure of implied volatility that
estim ator: reflects the implied volatility on the S&P 500 over the next cal
endar 30 days, constructed using options with a wide range of
- m )2 strike prices.
; t=i
The m ethodology of the VIX has been extended to many other
applied to returns constructed at each of the three frequencies,
assets, including other key equity indices, individual stocks,
where jl is the sam ple average return.
gold, crude oil, and US Treasury bonds. The most important
Table 12.1 shows the annualized volatilities constructed by trans limitation of the VIX is that it can only be com puted for assets
forming the variance estim ates. The annualized volatilities are with large, liquid derivatives m arkets, and so it is not possible to
similar within each asset and the minor differences across sam apply the VIX m ethodology to most financial assets. Because the
pling frequencies can be attributed to estimation error (because VIX Index makes uses of option prices with expiration dates in
these volatilities are all estim ates of the true volatility). the future, it is a forward-looking measure of volatility. This dif
fers from backward-looking volatility estim ates generated using
(historical) asset returns.
Implied Volatility
Implied volatility is an alternative measure that is constructed
using option prices. Both put and call options have payouts that
12.3 THE DISTRIBUTION
are nonlinear functions of the underlying price of an asset. For O F FINANCIAL RETURNS
exam ple, the payout of a European call option is defined as:
A normal distribution is sym m etric and thin-tailed, and so has
m ax(PT — K, 0), (12.9)
no skewness or excess kurtosis. However, many return series are
where P j is the price of the asset when the option expires (T) both skewed and fat-tailed.
and K is called the strike price. For exam ple, consider the returns of gold, the S&P 500, and
This expression is a convex function of the price of the asset and the JP Y /U S D exchange rate. The left two columns of Table 12.2
so is sensitive to the variance of the return on the asset. contain the skewness and kurtosis of these series when m ea
sured using daily, w eekly (Thursday-to-Thursday), monthly,
and quarterly returns. The monthly and quarterly returns are
2 Weekly returns use Thursday-to-Thursday prices to avoid the impact of com puted using the last trading date in the month or quarter,
any end-of-week effects. respectively.

Table 12.2 Estimated Skewness and Kurtosis for Three Assets Using Data between 1978 and 2017. Returns Are
Computed from Prices Sampled Daily, Weekly, Monthly, and Quarterly. The Third Column Contains the Value of
the Jarque-Bera Test Statistic. The Final Column Contains the p-Value for the Null Hypothesis That H0:S = 0 and
k = 3, Where S Is the Skewness and k Is the Kurtosis
Skewness Kurtosis JB Statistic p-Value
Gold Daily 0.31 17.57 89571.3 0.000
Weekly 0.93 13.08 9139.2 0.000
Monthly 0.63 7.05 358.6 0.000
Quarterly 1.03 6.65 116.7 0.000
S&P 500 Daily - 0 .7 4 23.79 182548.7 0.000
Weekly - 0 .6 8 11.68 6715.3 0.000
Monthly - 0 .6 5 5.26 136.2 0.000
Quarterly - 0 .6 2 3.99 16.6 0.000
JPY/USD Daily - 0 .3 8 6.95 6766.7 0.000
Weekly -0 .6 1 7.13 1610.9 0.000
Monthly -0 .2 1 4.09 27.4 0.000
Quarterly - 0 .2 0 3.14 1.2 0.557
All three assets have a skewness that is different from zero and (k — 3)2/24 also has a x i distribution. These two statistics are
a kurtosis larger than 3. In general, the skewness changes little asym ptotically independent (uncorrelated), and so J B ~ xi-
as the horizon increases from daily to quarterly, whereas the
Test statistics that have x 2 distributions should be small when
kurtosis declines substantially. The returns on the S&P 500 and
the null hypothesis is true and large when the null is unlikely
the returns on the JP Y /U S D exchange rate are both negatively
to be correct. The critical values of a X 2 are 5.99 for a test size
skew ed, while the returns on gold are positively skew ed. This
of 5% and 9.21 for a test size of 1%. W hen the test statistic is
difference arises because equities and gold tend to move in dif
above these values, the null that the data are normally distrib
ferent directions when m arkets are stressed— equities decline
uted is rejected.
m arkedly in stress periods, while gold serves as a flight-to-safety
asset and so moves in the opposite direction. The J B statistic is reported for the three assets in Table 12.2.
It is rejected in all cases excep t for the quarterly return on
The Jarque-Bera test statistic is used to form ally test w hether
the JP Y /U S D rate. The final column reports the p-value of the
the sam ple skew ness and kurtosis are com patible with an
test statistic, which is 1 — F ^ J B ), where is the C D F of a X 2
assumption that the returns are normally distributed. The null
random variable.
hypothesis in the Jarque-Bera test is H0:S = 0 and k = 3, where
S is the skew ness and k is the kurtosis of the return distribution. W hile the test rejects for equities even at the quarterly fre
These are the population values for normal random variables. quency, the low-frequency returns are closer to a normal than
The alternative hypothesis is H-pS 0 or k 3. The test the daily or w eekly returns on the S&P 500. This is a common
statistic is feature of many financial return series— returns com puted over
longer periods appear to be better approxim ated by a normal
J B = ( T - 1) ( 12 . 11) distribution.
where T is the sam ple size.

Power Laws
W hen returns are normally distributed, the skewness is asymp-
totically normally distributed with a variance of 6, so that 5 / 6 An alternative method to understand the non-normality of
has a x i distribution. M eanwhile, the kurtosis is asym ptotically financial returns is to study the tails. Normal random variables
normally distributed with mean 3 and variance of 24, and so have thin tails so that the probability of a return larger than ka

Figure 12.2 Plot of the log probability that a random variable X > x, where X has a mean of zero and unit
standard deviation. Each curve represents In Pr(X > x) = ln(1 — F^x)), which measures the chance of appearing
in the extreme upper tail for x E (4, 8).
declines rapidly as k increases, whereas many other distributions and nonlinear dependence. The linear correlation estim ator
have tails that decline less quickly for large deviations. (also known as Pearson's correlation) measures linear depen
dence. Previous chapters have shown that correlation (i.e., p) and
The most important class of these have p o w er law tails, so that the
the regression slope (i.e., (3) are intim ately related, and the
probability of seeing a realization larger than a given value of x is
regression slope is zero if and only if the correlation is zero.
P(X > x) = kx~a, Regression explains the sense in which correlation m easures lin
where k and a are constants. ear dependence.
The Student's t is an exam ple of a w idely used distribution with In the regression:
a power law tail. Yj = a + jQXj + €j
The simplest method to compare tail behavior is to examine the if Y and X are standardized to have unit variance (i.e.,
natural log of the tail probability. Figure 12.2 plots the natural log ox = oy = 1), then the regression slope is the correlation.
of the probability that a random variable with mean zero and unit
In contrast, nonlinear dependence takes many forms and cannot
variance appears in its tail [i.e., ln P r(X > x)]. Four distributions
be summarized by a single statistic. For exam ple, many asset
are shown: a normal and three parameterizations of a Student's t.
returns have common heteroskedasticity (i.e., the volatility across
The normal curve is quadratic in x, and its tail quickly diverges assets is simultaneously high or low). However, linear correlation
from the tails of the other three distributions. The Student's t does not capture this type of dependence between assets.
tails are linear in x, which reflects the slow decay of the prob
ability in the tail. This slow decline is precisely why these distri
butions are fat-tailed and produce many more observations that
Alternative Measures of Correlation
are far from the mean when com pared to a normal. Linear correlation is insufficient to capture dependence when
assets have nonlinear dependence. Researchers often use two
alternative dependence m easures: rank correlation (also known
12.4 CORRELATION VERSUS
as Spearm an's correlation) and Kendal's r (tau). These statistics
D EPEN D EN CE are correlation-like: both are scale invariant, have values that
always lie between —1 and 1, are zero when the returns are inde
The dependence between assets plays a key role in portfolio
pendent, and are positive (negative) when there is in increasing
diversification and tail risk. Recall that two random variables, X
(decreasing) relationship between the random variables.
and Y, are independent if their joint density is equal to the prod
uct of their marginal densities:
Rank Correlation
fx,v(x, y) = f x M M y ) Rank correlation is the linear correlation estim ator applied to the
Any random variables that are not independent are dependent. ranks of the observations. Suppose that n observations of two
Financial assets are highly dependent and exhibit both linear random variables, X and Y, are obtained. Let Rankx and Ranky

be the rank of X and Y, respectively. The rank operator assigns 1 the data-generating process (DGP) Y, = X , + e/; where X,- and e,
to the sm allest value of each, 2 to the second sm allest, and so are independent standard normal random variables. The top-right
on until the largest is given rank n.3 The rank correlation estim a panel plots simulated data from the D G P Y, = exp(2X,) + e, using
tor is defined as: the same realizations of X, and et. Both plots include the true con
Cov[Rankx, Ranky] ditional mean E [Y |X ] and the fitted regression line a + (3X. These
( 12 . 12)
align closely in the left panel and only differ due to estimation
\A/[Ran/cx]VV[Ran/cy]
error. In the right panel, however, the linear relationship is a poor
W hen all ranks are distinct, the estim ator can be equivalently
approximation to the conditional expectation.
expressed as a function of the difference in the pairwise ranks:
The bottom panels plot the ranked values Rankx against
6 2 / 1= -|(R an/cX/ - RankYi):
P s= 1 - (12.13) RankYr Both D G Ps are m onotonic and so the ranks have linear
n(n2 - 1)
relationships.
where RankXj — RankYj measures the difference in the ranks of
the two elements of the same observation (i.e., X; and Y,). When Kendall's r
highly ranked values of X are paired with highly ranked values of
Kendall's t measures the relative frequency of concordant and
Y, then the difference is small (i.e., (RXi — RYi) 2 ~ 0) and the corre
discordant pairs.
lation is near 1. When X a n d Y h a ve strong negative dependence
(so that largest values of X couple with the smallest values of V), Two random variables, (X/, Y,) and [Xj, Yj) are concordant if
then the difference is large and the rank correlation is close to - 1 . X,- > Xj and Y, > Yj or if X, < Xj and Y, < Yj. In other words,
concordant pairs agree about the relative position of X and Y.
Note that rank correlation depends on the strength of the linear
Pairs that disagree about the order are discordant, and pairs
relationship between the ranks of X and Y, not the linear rela
with ties (X/ = Xj or Y, = Yj) are neither. Random variables with
tionship between the random variables them selves.
a strong positive dependence have many concordant pairs,
W hen variables have a linear relationship, rank and linear corre whereas variables with a strong negative relationship have
lation are usually similar in m agnitude. The rank correlation esti mostly discordant pairs.
mator is less efficient than the linear correlation estim ator and is
Kendal's r is defined as:
commonly used as an additional robustness check.
"c “ n d n. "d
(12.14)
/V
Significant differences in the two correlations indicate an impor T =

n(n - 1)/2 nc + nd + nt nc + nd + nt‘
tant nonlinear relationship. Rank correlation has two distinct
advantages over linear correlation. where nc is the count of the concordant pairs, nd is the count of
the discordant pairs, and nt is the count of the tie s.4
1. First, it is robust to outliers because only the ranks, not the
values of X and Y, are used. Kendall's t is defined as the difference between the probabil
ity of concordance and the probability of discordance. When
2. Second, it is invariant with respect to any monotonic
all pairs are concordant, then t = 1. If all pairs are discordant,
increasing transform ation of X, and Y/ (e.g ., X, —>f (X,),
r is —1. O ther patterns produce values between these two
where f(X ,) > f ( X j ) when X, > Xj). Linear correlation is only
extrem es. Note that concordance only depends on the pairwise
invariant with respect to increasing linear transform ations
order, not the m agnitude of data, and so m onotonic increasing
(e.g ., X, —* a + bX„ where b > 0). This invariance with
transform ations also have no effect on Kendall's r .
respect to a wide class of transform ations makes rank cor
relation particularly useful when exam ining the relationship Figure 12.4 plots the relationship between linear (Pearson) cor
between primary and derivative asset returns. relation, rank (Spearman) correlation, and Kendall's t for a bivari
ate normal.
Figure 12.3 illustrates the difference between correlation and rank
correlation. The top-left panel plots simulated observations from The rank correlation is virtually identical to the linear correlation for
normal random variables. Kendall's t is increasing in p, although the
3 When observation values are not unique then the ranks cannot be relationship is not linear. Kendall's r produces values much closer
uniquely assigned, and so each observation in a tied group is assigned
the average rank of the observations in the group. For example, in the
data set {1, 2, 3, 3, 3, 3, 4, 4, 5}, the value 3 repeats four times. The 4 This estimator may be far from 1 or -1 if the data have many ties.
ranks of the four values in the group of 3s are 3, 4, 5, and 6, so that the There are improved estimators that modify the denominator to account
rank assigned to each 3 is (3 + 4 + 5 + 6)/4 = 4.5. The two repetitions for ties that have better properties when this is the case. Ties are not an
of 4 are assigned rank (7 + 8)/2 = 7.5, and so that the ranks of the val important concern in most applications to financial assets returns, and it
ues in the data set are {1, 2, 4.5, 4.5, 4.5, 4.5, 7.5, 7.5, 9}. is not usually necessary to use an alternative.

Linear D ependence Nonlinear D ependence
Ranks Ranks
oc
Fiqure 12.3 The left panels plot data that are linearly dependent. The top-left panel plots the raw data along
with the conditional expectation and the estimated regression line. The bottom-left panel shows the ranks of the
data and the fitted regression line on the ranks. The right panels show data with a nonlinear relationship. The
top-right panel shows the data, the conditional expectation, and the linear approximation. The bottom-right panel
plots the ranks of the data along with the fitted regression line.
Fiaure 12.4 Relationship between Pearson's correlation [p), Spearman's rank correlation (ps), and Kendall's r for
a bivariate normal distribution.

to zero than the linear correlation for most of the range of p. Esti Table 12.3 Population Values of Linear (Pearson)
mates of r with the same sign as the sample correlation, but that Correlation, Rank (Spearman) Correlation, and Kendall's
are smaller in magnitude, do not provide evidence of nonlinearity. t for the Linear and Nonlinear DGPs in Figure 12.3
Table 12.3 contains the population values for the linear cor
Linear DGP Nonlinear DGP
relation, rank correlation, and Kendall's t using the two DGPs
depicted in Figure 12.3. The linear and rank correlations are nearly Linear Correlation 0.707 0.291
identical when the DGP is linear and Kendall's r is predictably Rank Correlation 0.690 0.837
lower here. When the D G P is nonlinear, both the rank correlation
Kendall's t 0.500 0.667
and Kendall's r are larger than the linear correlation. This pattern
STRUCTURED CORRELATION MATRICES

A correlation m atrix is interpretable as a covariance matrix This value is negative, and so X cannot be a valid covariance
where all random variables have unit variance. The fact that m atrix. This requirem ent— that any weighted average must
any linear combination of random variables must have a non have a positive variance— is known as positive definiteness.
negative variance imposes restrictions on the values in a cor
Practitioners commonly impose structure on correlation
relation m atrix of three or more random variables.
m atrices to ensure that they are positive definite. Two struc
For exam ple, suppose that a trivariate normal random vari tured correlations are commonly used.
able with com ponents X-|, X 2, and X 3 has mean zero and
1 . The first type sets all correlations equal to the same
covariance (and correlation) matrix:
value (i.e., p,y = p for all pairs / and j), a model known as
1 - 0.9 - 0 .9 “ equicorrelation.
2 . The second type uses a factor structure that assumes
correlations are due to exposure to a common factor.
This type of structured correlation mimics the correlation
The variance of an average of the three com ponents: of assets related through C A P M , so that the correlation
between any two entries can be written as p,y = ypy-,,
-(3X1 + 2 x - 0 .9 + 2 x - 0 .9 + 2 x - 0 .9 ) = - 0 .2 6 7
where both y i and y 2 are between - 1 and 1.
in the two alternative measures of correlation indicates that the W hile an assumption of normality is convenient, it is not con
dependence in the data has important nonlinearities. sistent with most observed financial asset return data. In fact,
most returns are both skewed and heavy-tailed. These two
12.5 SUMMARY features are most prominent when measuring returns using
high-frequency data (e.g ., daily), whereas the normal is a better
This chapter began with a discussion of simple returns and log approxim ation over long horizons (i.e., one quarter or longer).
returns. These two measures broadly agree when returns are
Deviations from normality can be exam ined in a couple of
small but substantially disagree if they are large in m agnitude.
ways. The Jarque-Bera test exam ines the skewness and kurtosis
In practice, log returns are used when prices are sam pled fre
of a data series to determ ine whether it is consistent with an
quently, because the approxim ation error is unimportant. O ver
assumption of normality. M eanwhile, the tail index is a measure
longer horizons, the simple return provides a more accurate
of the tail decay that exam ines the rate at which the log-density
measure of the perform ance of an investm ent.
decays. Small values of the tail index indicate that the returns on
Both the mean and the variance of asset returns grow linearly in an asset are heavy-tailed.
tim e, so that the h-period mean and variance are h multiplied
The chapter concluded with an exam ination of alternative
by the first-period mean and variance. The standard deviation,
measures of dependence. W hile linear correlation is an impor
which is the square root of the variance, grows with V 7 ).
tant m easure, it is only a measure of linear dependence. Two
Im plied volatility is an alternative m easure of risk that can alternative m easures, rank correlation and Kendall's r , are more
be constructed for assets that have liquid options m arkets. suited to measure nonlinear dependence. These two measures
Im plied volatility can be estim ate using the Black-Scholes- are correlation-like in the sense that they are between —1 and
M erton option pricing model or the V IX Index, which 1. Both also have two desirable properties when com pared to
reflects the im plied volatility on the S&P 500 over the next linear correlation: They are invariant to monotonic increasing
30 calendar days. transform ations and they are robust to outliers.

QUESTIONS

12.1. If returns are fat-tailed, and we instead assume that they 12.3. Variances are usually transform ed into volatilities by tak
are normally distributed, is the probability of a large loss ing the square root. This changes the units from squared
over- or under-estimated? returns to returns. W hy isn't the same transform ation used
12.2. Is Iinear correlation sensitive to a large outlier pair? W hat on covariances?
about rank correlation and Kendall's r?
Practice Questions
12.4. On Black Monday in O ctober 1987, the Dow Jo n es Indus 1 2 .1 1 . The following data is collected for four distributions:
trial A verage fell from the previous close of 2,246.74 to
1,738.74. W hat was the sim ple return on this day? W hat Dataset Skew Kurtosis T
was the log return? A 0.85 3.00 50
12.5. If the annualized volatility on the S&P 500 is 20% , what is B 0.85 3.00 51
the volatility over one day, one w eek, and one month? C 0.35 3.35 125
12.6. If the mean return on the S&P 500 is 9% per year, what is D 0.35 3.35 250
the mean return over one day, one w eek, and one month?
W hich of these datasets are likely (at the 95% confidence
12.7. If an asset has zero skewness, what is the maximum kurto-
level) to not be drawn from a normal distribution?
sis it can have to not reject normality with a sam ple size of
100 using a 5% test? W hat if the sam ple size is 2,500? 12.12. Calculate the rank correlations for the following data:
12.8. If the skewness of an asset's return is —0.2 and its kurtosis 1
•
X Y
is 4, what is the value of a Jarque-Bera test statistic when
1 0.22 2.73
T = 100? W hat if T = 1,000?
2 1.41 6.63
12.9. Calculate the simple and log returns for the following data:
3 - 0 .3 0 - 2 .1 9
Time Price 4 - 0 .5 9 -6.51
0 100 5 -3 .0 8 -0 .9 9
1 98.90 6 1.08 2.63
2 98.68 7 -0 .4 5 - 3 .4 0
3 99.21 8 0.40 5.10
4 98.16 9 -0 .7 5 - 5 .1 4
5 98.07 10 0.24 1.14
6 97.14
12.13. Find Kendall's Tau for the following data:
7 95.70
X V
•
8 96.57 J
1 3.12 2.58
9 97.65
2 - 1 .2 6 -0 .0 5
10 96.77
3 2.08 -0 .7 2
12.10. The implied volatility for an ATM money option is
4 -0 .2 8 -0 .5 2
reported at 20% , annualized. Based upon this, what
would be the daily implied volatility? 5 - 1 .9 6 - 0 .4 0

ANSW ERS

12.1. We will underestim ate the probability of a large loss correlation and Kendall's t are both less sensitive to a
because the normal has thin tails. For exam ple, the prob single outlier because the effect is bounded and is aver
ability of a 4a loss from a normal is .003% (3 in 100,000). aged out by the other observations. If n is large, then
The probability of a 4rj loss from a standardized Stu a single outlier will have an immaterial im pact on the
dent's t4 is .24% or 80 tim es more likely. estim ates.
12.2. Yes. Linear correlation is sensitive in the sense that mov 12.3. Covariance may have either sign, and so the square
ing a single point in X 1f X 2 space far from the true linear root transform ation is not reliable if the sign is negative.
relationship can severely affect the correlation. Effec Transformation to (3 or correlation is preferred when
tively a single outlier has an unbounded effect on the exam ining the magnitude of a covariance.
covariance, and ultimately the linear correlation. Rank
Solved Problems
^7 2 8 7 4 _ 2246 74 12.9. S imple returns use the form ula:
12.4. The simple return i s ------- ------- = —22.6% . The
2246.74
P t ~ P t-
log return is In 1738.74 - In 2246.74 = 25.6% . The Rt =
P t-
difference between the two is increasing in the
m agnitude. Log returns use the form ula:
12.5. The scale factors for the variance are 252, 52, and 12, Pt
rt = In Pt — In Pt_ 1 = In
P t-
respectively. The volatility scales with the square root of
i i r i 20% 20%
the scale factor, and so — ----- = 1.26% , — — = 2.77% , Time Price Simple Log
V252 \/5 2
20 % 0 100
and — 7 = = 5.77% .
V12
1 98.90 -1 .1 0 % -1 .1 1 %
12.6. The mean scales linearly with the scale factor, and so
9% 9% . . 9% . 2 98.68 -0 .2 2 % -0 .2 2 %
— — = .036% , —— = 0.17% , and —— = 0.75% .
252 52 12 3 99.21 0.54% 0.54%
12.7. The Jarque-Bera has X 2 distribution, and the criti
4 98.16 -1 .0 6 % -1 .0 6 %
cal value for a test with a size of 5% is 5.99. The
f S 2 (k - 3)2 5 98.07 -0 .0 9 % -0 .0 9 %
Jarque-Bera statistic is J B = (T - 1)( — -I-----— —
6 97.14 -0 .9 5 % -0 .9 5 %
so that when the skewness S = 0, the test statistic is
7 95.70 -1 .4 8 % -1 .4 9 %
(k - 3)2
(T — 1)— —— . In order to not reject the null, we need 8 96.57 0.91% 0.90%
24
(k - 3)2 5.99 9 97.65 1 .12% 1.11%
( T - 1) < 5.99, and so (k — 3)2 < 24 X
24 T - 1
10 96.77 -0 .9 0 % -0 .9 1 %
5.99
and k —3 T .»/ 24 X . W hen T = 100, this value
T - 1 12.10. Using the equation
is 4.20. W hen T = 2,500 this value is 3.24. This shows
®annual = V 2 5 2 X o Ldaily
;
that the J B test statistic is sensitive to even mild excess
kurtosis. 0.2 = V 2 5 2 X daily
12.8. Using the formula in the previous problem , the value of 0.2
&daily = 1.26%
the J B is 4.785 when T = 100 and 48.3 when T = 1000. V 2 5 2

12.11. The appropriate test statistic is the Jarque-Bera statistic. 12.13. The first step is to rank the data:
1 X Y Ry/
•
R x,
1 3.12 2.58 5 5
This will be distributed as ar|d for a 5% test size the
critical value is 5.99 2 - 1 .2 6 -0 .0 5 2 4
3 2.08 -0 .7 2 4 1
Dataset Skew Kurtosis T JB
4 -0 .2 8 -0 .5 2 3 2
A 0.85 3.00 50 5.90
5 - 1 .9 6 - 0 .4 0 1 3
B 0.85 3.00 51 6.02
The next step is to see which pairs are concordant, which
C 0.35 3.35 125 3.16
are discordant, and which are neither.
D 0.35 3.35 250 6.35
W hich leads to the pair-by-pair com parison:
So, as the J B figure is greater than 5.99 for datasets

B & D, their respective null hypotheses (normally distrib
uted) are rejected.
1 2 .1 2 . Following the procedure outlined in the text (and exam

ple), the first step is to com plete the table:
•
1 X Y R x, Ry; RX , ~ R Xi
1 0.22 2.73 6 8 - 2
2 1.41 6.63 10 10 0 For the first observation, because both the rank of X & Y
3 - 0 .3 0 - 2 .1 9 5 4 1 are the maximum, every other observation will autom ati
cally be concordant with respect to this one.
4 - 0 .5 9 -6.51 3 1 2
By contrast, looking at the second and third observa
S -3 .0 8 -0 .9 9 1 5 -4
tions, the X rank increases while the Y rank decreases.
6 1.08 2.63 9 7 2 Therefore, this pair is discordant.
7 -0 .4 5 - 3 .4 0 4 3 1 „ _ nc - nd _ 5 -5 _
8 0.40 5.10 8 9 -1 T n(n - 1)/2 10
9 -0 .7 5 - 5 .1 4 2 2 0
10 0.24 1.14 7 6 1
Sum Squares 32
6 S / = i ( Rx; ~ Ry/): 6*32

1 0.81
10(100 1)
= -
n(n 2 - 1) -

Simulation and
Bootstrapping
Learning Objective
D escribe the basic steps to conduct a Monte Carlo Describe pseudo-random number generation.
simulation.
Describe situations where the bootstrapping method
D escribe ways to reduce Monte Carlo sampling error. is ineffective.
Explain the use of antithetic and control variates in reduc Describe the disadvantages of the simulation approach to
ing Monte Carlo sampling error. financial problem solving.
D escribe the bootstrapping method and its advantage

over Monte Carlo simulation.
Simulation is an im portant practical tool in modern risk man (PRNGs) and are initialized with what is called a seed value. It is
agem ent with a wide variety of applications. Exam ples of these important to note that each distinct seed value produces an iden
applications include computing the expected payoff of an tical set of random values every time the PNRG is run.
option, measuring the downside risk in a portfolio, and assess
The reproducibility of outputs from PRN Gs has two im portant
ing estim ator accuracy.
applications.
Monte Carlo simulation— often called a Monte Carlo experim ent
1. It allows results to be replicated across multiple exp eri
or just a Monte Carlo— is a simple approach to approxim ate the
ments, because the same sequence of random values
expected value of a random variable using numerical methods.
can always be generated by using the same seed value.
A Monte Carlo generates random draws from an assumed data
This feature is im portant when exploring alternative m od
generating process (D G P). It then applies the function (or func
els or estim ators, because it allows the alternatives to
tions) to these sim ulated values to generate a realization from
be estim ated on the same simulated data. It also allows
the unknown distribution of the transform ed random variables.
simulation-based results to be reproduced later, which is
This process is repeated, and the statistic of interest (e.g ., the
im portant for regulatory com pliance.
expected price of a call option) is approxim ated using the sim u
lated values. Im portantly, repeating the simulation improves the
2. Setting the seed allows the same set of random numbers to
accuracy of the approxim ation, and the number of replications be generated on multiple computers. This feature is widely
used when examining large portfolios containing many finan
can be chosen to achieve any required level of precision.
cial instruments (i.e., thousands or more) that are all driven by
Another important application of simulated methods is boot
a common set of fundamental factors. Distributed simulations
strapping. The bootstrap gets its name from a seem ingly impossi that assess the risk in the portfolio must use the same simu
ble feat: "to pull oneself up by one's bootstraps." Bootstrapping
lated values of the factors when studying the joint behavior
uses observed data to simulate from the unknown distribu of the instruments held in the portfolio. Initializing the PRNGs
tion generating the observed data. This is done by combining
with a common seed in a cluster environment ensures that
observed data with simulated values to create a new sample that the realizations of the common factors are identical.
is closely related to, but different from, the observed data.
Monte Carlo simulation and bootstrapping are closely

related. In both cases, the goal is to com pute the expected 13.2 APPROXIMATING MOMENTS
value of an (often com plex) function. Both m ethods use
com puter-generated values (i.e., sim ulated data) to numerically Monte Carlo simulation is frequently used to estim ate popula
approxim ate this expected value. tion moments or functions.
The fundam ental difference between simulation and bootstrap Suppose X ~ Fx, where Fx is a known distribution. In simple
ping is the source of the sim ulated data. W hen using simulation, distributions, many moments (e.g ., the mean or variance) are
the user specifies a com plete D G P that is used to produce the analytically tractable. For exam ple, if X ~ N(/x,cr2) then E[X] = /jl
sim ulated data. In an application of the bootstrap, the observed and V[X] = a 2.
data are used directly to generate the simulated data set w ith However, analytical expressions for moments are not available
out specifying a com plete DGP. for several relevant cases, (e.g ., when random variables are con
structed from com plex dynam ic models or when com puting the
expectation of a com plicated nonlinear function of a random
13.1 SIMULATING RANDOM variable). Simulation provides a sim ple method to approxim ate
VARIABLES the analytically intractable expectation.
Suppose X is a random variable that can be simulated (e.g ., a

Conducting simulation experim ents requires generating random
normal) and g is a function that can be evaluated at realizations
values from an assumed distribution. However, numbers gen
of X. Simulation constructs many independent copies of g(X) by
erated by com puters are not actually random and are usually
simulating draws from X (i.e., x,) and then com puting1 g; = g(x,).
referred to as pseudo-random numbers.
Pseudo-random numbers are generated by com plex but deter

ministic functions that produce values that are difficult to predict,
1 In many applications of simulations, X is a vector with n elements.
and so appear to be random. The functions that produce pseudo Computing the expected value of a scalar-valued function that depends
random values are known as pseudo-random number generators on a simulated vector uses g, = g(x-|(-, x2,-, . . . , x m) in place of g, — g(x,).

SIMULATING RANDOM VALUES FROM A SP EC IFIC DISTRIBUTION
Random num ber generators sim ulate standard uniforms— X has C D F Fx, then U = Fx(X) is a uniform random variable.
Uniform (0,1). In most applications, it is necessary to This relationship implies that applying the inverse C D F to a
transform the uniform sam ples into draws from another uniform random variable U produces a random variable dis
distribution. tributed according to Fx. Form ally, if Fx is the inverse C D F
function of X, then F x {U ) ~ Fx .
For exam ple, consider a simulation of data coming from a
heavy-tailed probability distribution, such as a generalized The top panel of Figure 13.1 shows how realizations of four
Student's t with six degrees of freedom . Recall that a Stu iid uniform random variables (i.e., uh u2, u3, and u4) are
dent's t random variable is defined as: m apped through the inverse C D F (i.e., F x ) of X t o simulate
four iid draws from a generalized Student's t with six degrees
Z /V w /v,
of freedom .
where Z is a standard normal and W is an independent ^ The bottom panel illustrates how each transform ed value
random variable with v degrees of freedom .
(i.e., xj) is the realization that has C D F value of u,-. This
The definition of a Student's t cannot be directly applied method can also be applied to discrete random vari
because com puters only generate data from uniform distribu ables because the C D F of a discrete random variable is
tions. However, Chapter 3 showed that if a random variable a step function.
Inverse CDF
0.3 0.9 “ 1
CDF
Fiaure 13.1 The top panel illustrates random values generated from Student's t distribution using uniform
values, marked on the x-axis. The uniform values are mapped through the inverse CDF of the Student's t to
generate the random variates marked on the y-axis. The bottom panel show how the generated values map
back to the uniform originally used to produce each value through the CDF.
Chapter 13 Simulation and Bootstrapping ■ 225

This is repeated b tim es2 to produce a set of iid draws from the
unknown distribution of g(X). The sim ulated values g,- are realiza CON DUCTIN G A SIMULATION
tions from the distribution of g(X) and so can be used to approx EXPERIM EN T
imate the moment or other quantity of interest.
1. G enerate data x, = [x1(-, x 2,-, . . . , x n(] according to the
For exam ple, to estim ate the mean of g(X), the unknown popu assumed DGP.
lation value is approxim ated using an average. Sim ulated draws 2. Calculate the function or statistic of interest,
are iid by construction, and b draws are used to approxim ate 9i = g M .
the expected value as: 3 . Repeat steps 1 and 2 to produce b replications.
1 k 4. Estim ate the quantity of interest from

ElsrfX)] = r S s t X ) (13.1) (91 92i ••• i 9b}'
D;= 1
/
5. A ssess the accuracy by computing the standard error.

The approxim ated expectation is an average of b iid random If the accuracy is too low, increase b until the required
variables, and so the Law of Large Numbers (LLN) applies accuracy is achieved.
Jim E[g(X)] = E[g(X)] (13.2)
Furtherm ore, the Central Limit Theorem (CLT) also applies to

the average. The variance of the sim ulated expectation is esti Monte Carlo experim ents allow the finite sam ple distribution of
mated by: an estim ator to be tabulated and com pared to its asym ptotic
distribution derived from the CLT. The finite sam ple distribution
V[£[g(X)]] =
is also w idely used to exam ine the effect of using an estim ate
where <r| = V[g(X)]. instead of the true value param eter when making a decision.
This variance is estim ated using Assessing the uncertainty in an estim ated param eter aids in
understanding the im pact of alternative decision rules (e.g.,
1 b ____ _
S-l= T :2 ( s ( X > - E[g(X)])2, w hether to hedge a risk fully or partially) under realistic condi
(13.3)
D/=1
tions where the true param eter is unknown.
which is the standard variance estim ator for iid data.
Consider the finite-sam ple distribution of model param-
The standard error of the sim ulated expectation (i.e., <rg/ \ / b ) eter 0. First, a simulated sam ple of random variables
measures the accuracy of the approxim ation, which allows b to x = [x1f x 2, . . . , x n] is generated from the assumed D G R These
be chosen to achieve any desired level of accuracy. sim ulated values are then used to estim ate 0 .
Simulations are applicable to a wide range of problem s, and the This process— simulating a new data set and then estim ating the
set of simulated draws {g 1# g2, . . . , g j can be used to com pute model param eters— is repeated b tim es. This produces a set of
A A /V
other quantities of interest. For exam ple, the a-quantile of g(X) b sam ples {01f 0 2, . . . , 0 b} from the finite-sam ple distribution of
can be estim ated using the em pirical quantile of the simulated the estim ator of 0. These values can be used to assess any prop
values. This is particularly im portant in risk m anagem ent when erty (e.g ., the bias, variance, or quantiles) of the finite-sam ple
A
estim ating the lower quantiles— a G {0.1% , 1%, 5%, 10%}— of distribution of 0 .
a portfolio's return distribution. The em pirical a-quantile is esti
For exam ple, the bias, defined as:
mated by sorting the b draws from sm allest to largest and then
selecting the value in position ba of the sorted set. Bias(0) = E[0] - 0
Simulation studies are also used to assess the finite sample is estim ated in a finite sam ple of size b using
properties of param eter estim ators. Most results presented A
in this book rely on asym ptotic results based on the LLN and
BiiS( ) =
0 ( 1 3 . )
4
the CLT. Flowever, only n observations are availble to estim ate

param eters in practice, and the approxim ation made using a The Mean of a Normal
CLT may not be adequate when the sam ple size is small.
Consider the problem of approxim ating the mean of a ran
dom variable that comes from a normal probability distribution
X ~ N{/jl , a 2). By construction, the mean of X is /x, so it is not
2 b is used to denote the number of replications in a simulation study
and to distinguish the number of replications from the size of a single necessary to use simulation to approxim ate this value. It is also
simulated sample that may have n or T observations. known that E[X] = /x and V[X\ = cr2 /n. Approxim ating the mean

6 = 50
0 1 2
Fiaure 13.2 Distribution of the estimated expected value of a normal with /jl = 1 and a 2 = 9 for simulated
sample sizes of 50 and 1,250.
in a Monte Carlo simulation requires generating b (where b = n) sampling error of a sim ulated expectation is proportional to
draws from a normal with mean ju and variance a 2. 1 / V b . This rate of convergence holds for any moment approxi
V -------
mated using simulation and is a consequence of the CLT for
Figure 13.2 contains a density plot of E[X] = X, where /x = 1
iid data. Furtherm ore, this convergence rate applies even to
and a 2 = 3, for two replications sizes b = 50 and b = 1,250.
approxim ations of the expected value of com plex functions
These densities were constructed by generating 1,000 indepen
because the b replications are iid.
dent approxim ations for each replication size.
In both cases, the distribution is centered on the true value 1.

However, realizations when b = 50 are highly dispersed and Approximating the Price of a Call Option
many are far from 1. W hen b = 1,250, the density is much more
concentrated and virtually all realizations are within ± 0 .2 of 1. A s an exam ple, consider the simulation of the price of a Euro
pean call option. A European call pays
Table 13.1 contains the average estim ates of ju and the stan
dard error of the expected value estim ate. The standard error max(0, S j — K),
decreases as b increases, although only slowly. Increasing b by where S T is the stock price when the option expires at time T
a factor of 25 (i.e., from 50 to 1,250) reduces the standard error and K is the strike price.
by a factor of five. This is the expected reduction because the
Figure 13.3 plots the payoff function of the call options. The
payoff is a nonlinear function of the underlying stock price at
Table 13.1 Average Estimated Expected Value expiration, so computing the expected payoff requires a model
and Error of the Expected Value as a Function of the for the stock price.
Number of Replications Used
In this case, it is assumed that the log of the stock price is normally
b Mean of Standard Deviation distributed. The time T lo g stock price (i.e., sT) is equal to the sum
Sample Means of Sample Means of the initial stock price s0, a mean, and a normally distributed error:
10 0.985 0.917 sT = s0 + T(rf - a 2 /2) + V T x (- (13.5)
50 1.017 0.428 where x, is a sim ulated value from a N(0, a 2), rf is the risk-free
rate, and a 2 is the variance of the stock return. Both the risk-free
250 1.001 0.185
rate and the variance are measured using annualized values, and
1250 1.005 0.087
T is defined so that one year corresponds to T = 1.

Figure 13.3 The payoff of a European call option as a function of the terminal stock price ST and the strike price K.
The final stock price is S T = exp (sT), and the present value of the 1 _
where a i = —^ (Q — C )2 is the variance estim ator applied to
option's payoff is y b '
the sim ulated call prices.
C = e xp (—r/T) m ax(ST — K, 0) (13.6)
A two-sided 95% confidence interval for the price can be con
Simulating the price of an option requires an estim ate of the
structed using the average simulated price and its standard error:
stock price volatility. The estim ated volatility from the past 50
years of the S&P 500 returns is a = 16.4% . The risk-free rate is [ E [ Q - 1.96 X s.e.(E[C l), E[C] + 1.96 X s.e.(E[C])]
assumed to be 2%, the option expires in two years, the initial
price is S q = 2,500, and the strike price is K = 2,500. Table 13.2 Estimated Present Value of the Payoff of
a European Call Option with Two Years to Maturity. The
The expected price of a call option is estim ated by the average
Left Column Contains the Approximation of the Price
of the simulated values:
Using b Replications. The Center Column Reports the
Estimated Standard Error of the Approximation. The
gel = c = H e ,,
bi= i Right Column Reports the Ratio of the Standard Error
where C,- are simulated values of the payoff of the call option.
in the Row to the Standard Error in the Row Above
Using Equation 13.5, the sim ulated stock prices are b Price Std. Err. Std. Err. Ratio
S jj = exp(s0 + T(r — a 2 / 2 ) + V T x ,) 50 301.50 71.23 —
Using the notation of the previous section, the function g(x,) is 200 283.63 30.19 2.36
defined 800 276.21 15.39 1.96
g(x,) = e xp (—rT ) max(exp(so + T(r — a 2 12) + V 7 x ,) — K, 0) 3,200 283.93 7.48 2.06
The number of replications varies from b = 50 to 3,276,800 in 12,800 275.62 3.64 2.06
constant multiplies of 4. The left column of Table 13.2 shows the 51,200 278.49 1.85 1.97
expected price of a call option from a single approxim ation, and
204,800 278.64 0.93 1.99
the center column reports the estim ated standard error of the
approxim ation. The standard error is com puted as: 819,200 278.85 0.46 2.01
3,276,800 279.15 0.23 2.00

s.e.(E[Cj) = V W b , 0 3 .7 )

The standard error is large relative to the expected option where U2 is also a uniform random variable and, so:
price except for the largest values of b. For exam ple, when
F x 1 (U2) ~ Fx
b = 12,800, the 95% confidence interval covers a range of USD
14 (= 2 X 1.96 X 3.64), which is 5% of the value of the call option. By construction, the correlation between U | and U2 is negative,
and mapping these values through the inverse C D F generates
The right column of Table 13.2 contains the ratio of the standard
Fx distributed random variables that are negatively correlated.
error of the estim ate with the next sm allest replication size. The
standard error declines in the expected manner as b increases For exam ple, suppose that F is a standard normal distribution.
because the standard error is c r j \ f b , where ag is the standard Recall that the standard normal is sym m etric so that Fû-1 ) =
deviation of the simulated call option prices. —FxH1 — u-|) (e.g ., if u -1 = 0.25, then x-| = F~1 (u-|) = —0.67 and
x2 = F ~ \ u 2) = F~1(1 — u-|) = 0.67). These values are perfectly
negatively correlated.
Improving the Accuracy of Simulations
Using antithetic random variables in simulations is virtually
The standard method of estimating an expected value using identical to running a standard simulation. The only differ
simulation relies on the LLN. The standard error of the estim ated ence occurs when generating the values used in the sim ula
expected value is proportional to 1 /\/ib and, because the sim u tions. These values are generated in pairs: ((/], 1 — U-]),
lations are independent, it only depends on the variance of the (U2, 1 — U2 ) , . . . ,(Ub/2 , 1 — Ub/2 ). These uniforms are then trans
sim ulated values. form ed to have the required distribution using the inverse CDF.
Recall that the variance of a sum of random variables is the Note that only b/2 uniforms are required to produce b simu
sum of the variances plus tw ice the covariances between each lated values because the random variables are generated in
distinct pair of random variables. W hen the simulations are inde pairs. Antithetic pairs produced using this algorithm are only
pendent, these covariances are all zero. However, if covariance guaranteed to reduce the simulation error if the function g(X)
between the sim ulated values is negative, then the variance of is monotonic (i.e., either always increasing or decreasing) in x.
the sum is less than the sum of the variances. This is the idea M onotonicity ensures that the negative correlation between x ,
behind antithetic variables, which are random values that are and —X; will translate to g(x,) and g (—x ().
constructed to generate negative correlation within the values
The improvement in accuracy when using antithetic variables occurs
used in the simulation.
due to the correlation between the paired random variables. In
Control variates are an alternative that add values that are mean the usual case with b iid draws, the standard error of the simulated
zero and correlated to the simulation. Because the control vari expectation is ^g/X^b. Because the antithetic variables are corre
ate is mean zero, its addition does not bias the approxim ation. lated, the standard error of the simulated expectation becomes
The correlation between the control variate and the function of
a-g V TT
interest allows an optimal combination of the control variate and (13.8)
the original simulation value to be constructed to reduce the Vb
variance of the approxim ation. Thus, the standard error is reduced if p < 0.
Antithetic variables and control variates are com plem entary, and
they can be used sim ultaneously to reduce the approxim ation Control Variates
error in a Monte Carlo simulation.
Control variates are an alternative method to reduce simulation
error. The standard simulation method uses the approxim ation:
Antithetic Variables
1 k
E[g(X)] = - 2 g W
Antithetic variates add a second set of random variables that D/=1
are constructed to have a negative correlation with the iid vari
This approxim ation is consistent because each random variable
ables used in the sim ulation. They are generated in pairs using a
g(X() can be decom posed as
single uniform value.
E[g(X;)] + 77 ,,
Recall that if U -1 is a uniform random variable, then:
where 17 , is a mean zero error.
F x ( U i) ~ Fx
A control variate is another random variable /i(X() that has
Antithetic variables can be generated using
mean 0 (i.e., E[h(X,)] = 0) but is correlated with the error 77 ;
U2 = 1 - U „ (i.e., Corr[h(X,), 77 ,] 7 ^ 0).

A good control variate should have two properties. g(X) is the call option value function) is determ ined to be -43%.
The standard error of the sim ulated value is reduced by approxi
1 . First, it should be inexpensive to construct from x,. If the
m ately 25% (1 — VT~ + p). In this exam ple, using antithetic vari
control variate is slow to com pute, then larger variance
ables is equivalent to performing 78% more simulations using
reductions— holding the com putational cost fixed— may be
only independent draws and so has effectively doubled the
achieved by increasing the number of simulations b rather
number of replications at a relatively low com putational cost.
than constructing the control variates.
The right-center columns of Table 13.3 show the effect of using a
2 . Second, a control variate should have a high correlation
with g(X). The optimal combination param eter /3 that control variate (in this case, the value of a European put option).
minimizes the approxim ation error is estim ated using the A put option has the payoff function m ax(0, K - S T), so it is in the
regression: money when the final stock price is below the strike price K.
Here, the control variate is constructed by subtracting the
g(x,) = a + /3h(x,) + v, (13.9)
closed-form Black-Scholes-M erton put price from the sim ulated
put price:
Application: Valuing an Option
P. = e x p (- rT ) max(0, K - S Ti) - PBS(r, S 0, K, a , T)
Reconsider the exam ple of simulating the price of a European
The simulated price of the put estim ates the analytical price,
call option.
and so the difference is mean zero (i.e., E[P] = 0). The call
The left-center columns of Table 13.3 show the prices simulated option price is then approxim ated by:
using antithetic random variables, along with their standard
errors. Because the random values used in the simulation are e [c ] 4 | q ^ ( 1 | p,)
normally distributed, the antithetic value is the negative of the

where [3 is the O LS estim ate com puted by regressing C, on P,
original value. In other words, the values used in the simulation
are x 1f - x 1f x 2, - x 2, . . . ,x b/2, - x b/2. Note that the variance reduction is similar in m agnitude to the
reduction achieved using antithetic variables, because the cor
The value of the call option is a nonlinear function of these val
relation between the sim ulated call and put option prices, (i.e.,
ues, so the correlation of the sim ulated call option prices is not
Corr[C/,P,]) is determ ined to be —44%.
the same as the correlation of the simulated values of X. The
correlation p between the paired call option prices that use x, The right-m ost colum ns of Table 14.3 use both antithetic
and —x (- (i.e., the correlation between g(x,) and g (—x,), where variab les and the control variate to reduce the standard error
Table 13.3 Present Value of the Payoff of a European Call Option with Two Years to Maturity. The Left
Columns Use b iid Simulations. The Results in the Left-Center Columns Use Antithetic Variables. The Results in the
Right-Center Columns Add a Control Variate to the Simulation. The Results in the Right-Most Columns Use Both
Antithetic Variables and a Control Variate
Standard Antithetic Control Var. Both
n Price Std. Err. Price Std. Err. Price Std. Err. Price Std. Err.
50 301.50 71.23 388.58 65.46 326.31 64.10 433.40 58.38
200 283.63 30.19 286.49 23.61 287.51 26.49 289.33 21.08
800 276.21 15.39 279.95 11.43 276.32 13.71 280.07 10.21
3,200 283.93 7.48 282.06 5.54 282.95 6.61 284.01 4.89
12,800 275.62 3.64 279.12 2.75 277.72 3.21 279.68 2.43
51,200 278.49 1.85 278.99 1.39 279.46 1.64 279.14 1.24
204,800 278.64 0.93 278.39 0.70 278.53 0.83 278.18 0.62
819,200 278.85 0.46 278.06 0.35 278.55 0.41 277.79 0.31
3,276,800 279.15 0.23 278.45 0.17 278.89 0.21 278.36 0.15

by 33% when com pared to a sim ulation using b iid draw s. im plem ent because the sam ples are created by drawing with
The com bined effect is equivalent to using 125% more iid replacem ent from the observed data.
replications.
Suppose that a simulation sam ple size of m is needed from a
data set with a total of n observations. The iid bootstrap gener
Limitation of Simulations ates observation indices by randomly sampling with replacem ent
from the values {1,2, . . . , n}. These random indices are then
Monte Carlo simulation is a straightforward method to approxi used to select the observed data that are included in the sim u
mate moments or to understand estim ators' behavior. The big lated sam ple. The resampled data are commonly described as a
gest challenge when using simulation to approxim ate moments bootstrap sam ple to distinguish them from the original data.
is the specification of the DGP. If the D G P does not adequately
For exam ple, suppose that five observations are required from a
describe the observed data, then the approxim ation of the
sam ple of 20 data points {x-|, . . . , x 20}. Table 13.4 contains a set
moment may be unreliable. The m isspecification in the DGP
of five bootstrap sim ulations, each with five sam ples.
can occur due to many factors, including the choice of distribu
tions, the specification of the dynamics used to generate the The first simulation uses observations {X 3 , x 6, x 2, X5 , Xiq }, and the
sam ple, or the use of im precise param eter estim ates to simulate second simulation uses {x 6, x 3, x 2, x 18, x 15}. These two simulated
the data. sam ples partially overlap. This is expected because the iid boot
strap sam ples with replacem ent. The fifth simulation sample
The other im portant consideration is the com putational cost.
uses {x 8, x 12, x 8, x 2, x 5}, so it repeats the same observation twice
Running a single simulation for a sim ple D G P takes little tim e on
(i.e., x 8). This is also expected when generating many bootstrap
a modern computer. However, running many simulation exp eri
sam ples, the probability 3 * of an index repeating is nearly .
ments (e.g ., to understand the sensitivity of an option price to
1
model param eters) can be very tim e-consum ing. This cost limits The bootstrap is otherwise identical to Monte Carlo simulation,
the ability of an analyst to explore the specification used for the where the bootstrap sam ples are used in place of the simulated
DGP. Therefore analytical expressions should be preferred when sam ples. The expected value of a function is approxim ated by:
available, and simulation should be used when they are not.
1
E[g(X)] = - 2 g(x?j< *!,?/ ••• / *mSj), (13.10)
0/=i
13.3 BOOTSTRAPPING where x f f is observation / from bootstrap sam ple j and a total of
b bootstrap replications are used.
Bootstrapping is an alternative method to generate simulated The iid bootstrap is generally applicable when observations are
data. However, while both simulation and bootstrapping make independent across tim e. In some situations, this is a reasonable
use of observed data, there are some im portant differences. assum ption. For exam ple, if sampling from the S&P 500 in 2005
or 2017, both calm periods with low volatility, the distribution of
Sim ulation uses observed data to estim ate key model param
daily returns is similar throughout the entire year and so the iid
eters (e .g ., the mean and standard deviation of the S&P 500).
bootstrap is a reasonable choice.
These param eters are then com bined with an assum ption
about the distribution of the standardized returns to com plete However, it is often the case that financial data are dependent
the DGP. across time and thus a more sophisticated bootstrapping method
is required. The sim plest of these methods is known as the
In contrast, bootstrapping directly uses the observed data to
circular block bootstrap (CBB). It only differs in one important
simulate a sam ple that has similar characteristics. Specifically, it
aspect— instead of sampling a single observation with replace
does this without directly modeling the observed data or mak
ment, the C B B samples blocks of size q with replacem ent.
ing specific assumptions about their distribution.
The key to understanding bootstrapping lies in one simple

fact: The unknown distribution being sam pled from is the same 3 In an iid bootstrap with b replications where each bootstrap sample
one that produced the observed data. W hile this distribution is has m observations from an original sample size of n, the probability of
unknown, a distribution that resem bles (in a probabilistic sense) at least one bootstrap sample with a repeated index is
the distribution that produced the observed data can be sam n!

1
pled from by using the same observed data. nm(n — m)\
-
where ! is the factorial operator. In the example in the text with five
There are two classes of bootstraps used in risk m anagem ent. bootstrap samples, each with a size of five drawn from a sample size of
The first is known as the iid bootstrap. It is extrem ely simple to 20, this value is 93.4%.

Table 13.4 Simulated Indices Used to Produce Five iid Bootstrap Samples, Each with Size Five from an Original
Sample with 20 Observations
Locationn Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
1 3 6 10 14 8
2 6 3 17 11 12
3 2 2 19 18 8
4 5 18 10 19 2
5 10 15 12 7 5
GEN ERATIN G A SAM PLE USING THE GEN ERATIN G A SAM PLE USING
iid BOOTSTRAP THE CBB
1. G enerate a random set of m integers, {/1f i2, . . . , /m}, 1. Select the block size q.
from { 1 , 2 , . . . , n} with replacem ent. 2. Select the first block index / from { 1 , 2 , . . . , n} and
2. Construct the bootstrap sam ple as {x(-, x,2, . .. , x,m}. append {x„ x ,+1, . . . x /+q} to the bootstrap sample
where indices larger than n wrap around.
3 . W hile the bootstrap sam ple has few er than m ele
ments, repeat step 2.
For exam ple, suppose that 100 observations are available and
they are sam pled blocks of size q = 10. 100 blocks are con 4. If the bootstrap sam ple has more than m elem ents,
drop values from the end of the bootstrap sample
structed, starting with {x 1f x 2, . . . , x 10}, {x 2, x 3, . . . , x-|i}, . . . ,
until the sam ple size is m.
{*91, * 9 2, • • • , x-ioo)/ {x<?2, x<?3' • • • ' X l00' X l^' * • • ' {X100/ x 1/ • • • X9}.
Note that while the first 91 blocks use ten consecutive observa
tions, the final nine blocks all wrap around (as if observation one
occurs im m ediately after observation 100). W hen the loss is measured as the percentage of the portfolio
value, then the p-VaR is equivalently defined as a quantile of the
The C B B is used to produce bootstrap sam ples by sampling
return distribution.
blocks, with replacem ent, until the required bootstrap sample
size is produced. If the number of observations sam pled is larger A s an exam ple, consider the a one-year VaR for the S&P 500.
than the required sam ple size, the final observations in the final Com puting the VaR requires simulating many copies of one year
block are dropped. For exam ple, if the block size is ten and the of data (i.e., 252 days) and then finding the em pirical quantile of
desired sam ple size if 75, then eight blocks of ten are selected the sim ulated annual returns.
and the final five observation from the final block are dropped. The data set used spans from 1969 until 2018. Both the iid
The final detail required to use the C B B is the choice of the and the C B B s are used to g enerate bootstrap sam ples. The
block size q. This value should be large enough to capture the data set is large (i.e ., 13,000 observations) and so the block
dependence in the data, although not so large as to leave too size for C B B is set to 126 observations (i.e ., six months of
few blocks. The rule-of-thumb is to use a block size equal to the daily data).
square root of the sam ple size (i.e., V n ) . Bootstrapped annual log returns are simulated using 252 daily
log returns:
Bootstrapping Stock Market Returns 252
Estim ating m arket risk is an im portant application of bootstrap

ping in financial m arkets. Here we are interested in estimating
where the r-s are the bootstrapped log returns for replication j.
the p-Value-at-Risk (p-VaR):
The VaR is then com puted by sorting the b bootstrapped annual
argmin Pr(L > VaR) = 1 — p, (13.11)
V aR
returns from lowest to highest and then selecting the value at
where L is the loss of the portfolio over a specified period (e.g ., position (1 — p)b, which is the empirical 1 — p quantile of the
one month) and 1 — p is the probability that the loss occurs. annual returns.

Table 13.5 VaR Estimates (Quantiles of the Annual five returns from each of the 50 years in the sam ple. In contrast,
Return Distribution) Using the iid and Circular Block the C B B uses exactly two blocks to sim ulate an annual return.
Bootstraps W hen the C B B selects a block, the volatility— high, normal, or
low— is preserved throughout the block.
Quantile Value-at-Risk iid Value-at-Risk (CBB)
The inclusion of long periods with elevated volatility explains
.1% 44.1% 53.5% why the VaR estim ates for p = 99.9% or 99% from the C B B
1% 31.0% 35.6% are larger than the estim ates from the iid bootstrap. This same
feature also explains the lower VaR estim ates when p = 95%
5% 19.0% 18.2%
or less. The sm aller VaR estim ates are the result of the C B B
10% 12.9% 10.9% selecting blocks with volatility below average, which produce
sim ulated losses sm aller than those produced by the "balanced"
The number of bootstrap replications is b = 100,000. A large (but not em pirically realistic) returns constructed using the iid
number of replications is needed to improve the accuracy of the bootstrap.
0.1% quantile because only one in 1,000 replications falls into
Figure 13.4 shows the left tail of the sim ulated return distribu
this extrem e left tail.
tion, focusing on the bottom 10% of the distribution of boot
As shown in Table 13.5, the C B B produces larger VaR estimates strapped annual returns.
for small values of 1 — p and smaller VaR estimates for large val
The tails of the densities cross near the 2% quantile, and the
ues when compared to the iid bootstrap. This difference occurs
C B B produces a density of annual returns with more probability
because financial asset returns are dependent. Specifically, finan
in the tail. This crossing is another consequence of the excess
cial assets show evidence of volatility clustering and so go through
balance in the iid bootstrap sam ples, and the distribution con
periods where volatility is consistently higher than (e.g., 2002 or
structed from the iid bootstrap is less heavy-tailed than the dis
2008) or lower than (e.g., 2005 or 2017) its long-run average.
tribution constructed from the C B B bootstrap. The points below
The iid bootstrap sam ples at random from the entire set of the curves show the sm allest 1% of actual simulated annual
observations and in doing so produces a simulated sam ple that returns used to estim ate the VaR. O nly 36% of these extrem e
is too "b alan ced ." A typical sam ple from the iid bootstrap has observations are generated by the iid bootstrap.
• m■
iid Bootstrap
Circular Block Bootstrap
V 'V 'f
>
;\ y $ i
-80 -70 -6 0 -5 0 -4 0 -30 -2 0
Figure 13.4 Left tail of the simulated annual return distribution using the iid and circular block bootstraps. The
tail shows 10% of the return density. The points below the density show the lowest 1% of the simulated annual
returns produced by the two bootstrap methods.

Limitations of Bootstrapping the bootstrap method used replicates the actual dependence
observed in the data. If the bootstrap does not reproduce the
W hile bootstrapping is a useful statistical technique, it has its dependence, then statistics com puted from bootstrapped sam
limitations. There are two specific issues that arise when using a ples cannot capture the sampling variation in the observed data.
bootstrap.
Both Monte Carlo simulation and bootstrapping suffer from
• First, bootstrapping uses the entire data set to generate a the "B lack Sw an" problem — simulations generated using either
sim ulated sam ple. W hen the past and the present are similar, method resem ble the historical data. Bootstrapping is especially
this is a desirable feature. However, if the current state of sensitive to this issue, and a bootstrap sam ple cannot generate
the financial m arket is different from its normal state, then data that did not occur in the sam ple. A good statistical model,
the bootstrap may not be reliable. For exam ple, many assets on the other hand, should allow the possibility of future losses
experienced record levels of volatility during the financial cri that are larger than those that have been realized in the past. In
sis of 2008. Using a bootstrap to estim ate the VaR in O ctober practice, D G Ps used in simulation studies are calibrated to his
2008 produces an optim istic view of risk because the boot torical data and so are usually limited in their ability to generate
strap estim ate uses data from periods where the volatility simulation sam ples substantially different from what has been
was much lower. already observed.
• The second limitation arises due to structural changes in
m arkets so that the present is significantly different from the
past. Consider bootstrapping the interest rate on short-term 13.5 SUMMARY
US governm ent debt. This rate was near zero from late 2008
until 2015 due to both m arket conditions and interventions This chapter introduces simulation as a method to approxim ate
by the Federal Reserve. Bootstrapping the historical data moments or other quantities that depend on the expected value
cannot produce sam ples with short-term interest rates that of a random variable (e.g ., an option price) when analytical
replicate this period because rates had never previously been expressions are not available.
near zero for an extended period. Two closely related m ethods are covered: Monte Carlo sim ula
tion and bootstrapping. Both m ethods use random sam ples to
sim ulate data: A simulation uses a com plete specification for the
13.4 COMPARING SIMULATION DGP, whereas the bootstrap method randomly resam ples from
AND BOOTSTRAPPING the observed data. However, the key steps to approxim ate a
moment are the same for the two methods.
Monte Carlo simulation and bootstrapping are both essential
1 . Construct a sim ulated or bootstrap sam ple.
tools in finance and risk m anagem ent. They are closely related
and only differ in the method used to generate simulated data. 2. Com pute the function or estim ator using the sam ple cre
ated in step 1.
A Monte Carlo simulation uses a full statistical model that
3 . Repeat steps 1 and 2 to construct b replications.
includes an assumption about the distribution of the shocks.
The need for a com plete model is a weakness of sim ulation, and 4. Estim ate the quantity of interest using the b values pro
poor models produce inaccurate results even when the number duced in step 2.
of replications is large.
Both simulation and the bootstrap depend on user choices. In
The bootstrap method avoids the specification of a model and a simulation, the entire D G P is specified— this D G P plays a key
instead makes the key assumption that the present resem bles role in the validity of the simulation experim ent. When using
the past. The validity of this assumption is essential when using a bootstrap, the resam pled data must mimic the dependence
a bootstrap. Applying a bootstrap requires understanding structure of the data, and the past must be a reliable guide for
the dependence in the observed data, and it is essential that the future.

QUESTIONS

13.1 W hat is the main difference between a simulation and the 13.5 How are bootstrap sam ples generated using the iid
bootstrap? bootstrap?
13.2 How is the seed of a pseudo-random number generator 13.6 How are bootstrap sam ples generated using the circular
useful? block bootstrap (CBB)?
13.3 W hat are antithetic variables and control variates? 13.7 W hen is the C B B needed?
13.4 W hy do antithetic variables and control variates improve 13.8 W hat are the key limitations of the bootstrap?
the accuracy of a simulation?
Practice Questions
13.9 Suppose you are interested in approximating the a. Calculate the option payoffs using the equation
expected value of an option. Based on an initial sample M ax(x — 0.5,0).
of 100 replications, you estimate that the fair value of the b. W hat are the antithetic counterparts for each sce
option is USD 47 using the mean of these 100 replications. nario and their associated values?
You also note that the standard deviation of these 100 c. Assum ing that the sam ple correlation holds for the
replications is USD 12.30. How many simulations would entire population, what is the change in expected
you need to run in order to obtain a 95% confidence inter sam ple standard error attainable by using antithetic
val that is less than 1% of the fair value of the option? How sam pling?
many would you need to run to get within 0.1% ? d. How does the answer to part c change if the payoff
13.10 Plot the variance reduction when using antithetic random function is Max(| x — 0.5 |, 0)?
variables as a function of the correlation between pairs of

observations.
13.11 Plot the variance reduction when using control variates as

a function of the correlation between the control variate
and the random variable used in the simulation.
13.12 A random draw from an N(1,5) distribution is 1.2. If this

was generated using the inverse of the cum ulative dis
tribution approach, what is the original draw from the
U(0,1) distribution?
13.13 G iven the data below:
N(0,1)
1 - 0 .6 5
2 - 0 .1 2
3 0.92
4 1.28
5 0.63
6 - 1 .9 8
7 0.22
8 0.4
9 0.86
10 1.74
Chapter 13 Simulation and Bootstrapping

ANSW ERS

13.1 A simulation uses draws from a specific distribution to correlation with the simulated values. The variance reduc
simulate the shocks and other quantities in a model error. tion works in the same manner because the variance of
A simulation requires a fully specified model that provides the sum depends on the sum of the variances.
the distribution of all random quantities. A bootstrap uses 1 3 .5 A set of n integers are randomly generated with replace
random sampling from the observed data to produce a
ment from 1 , 2 , . . . , n. These are used to select the indi
new dataset that has sim ilar characteristics. In a bootstrap, ces of original data to produce the bootstrapped sam ple.
the random sam ples are used to generate indices when
1 3 .6 The C B B uses sam ples from the original data in blocks
selecting the data to include in the bootstrap sam ple.
of size m. Each block is chosen by selecting the index for
1 3 .2 The seed is a value that sets the internal state of a the first observation in the block. A total of \m] blocks
pseudo-random number generator. Setting the seed are selected, where n is the sam ple size where [ • ] selects
allows the same sequence of pseudo-random numbers to the sm allest integer larger than its input. If the number of
be generated, which allow simulations to be reproduced. observations selected is larger than m due to rounding,
1 3 .3 A ntithetic variables are sim ulated variables that have only the first n are kept. If a block starts in the last m — 1
been constructed to have a negative correlation with a observations, the block wraps around to the start of
previously sam pled set of sim ulated values. Control vari the sam ple.
ates are mean zero values that are correlated with the 1 3 .7 The C B B is generally needed when the data being boot
quantity whose expectation is being com puted via sim u strapped have tim e series dependence. A block bootstrap
lation. The control variates are then com bined with the is required when the statistic being bootstrapped is sen
sim ulated value using an optimal w eight to reduce the sitive to the serial correlation in the data. For exam ple,
simulation variance. Control variates should be cheap to when bootstrapping the mean, the C B B is required if the
produce. data are serially correlated. If bootstrapping the variance,
the C B B is required if the squares are serially correlated—
1 3 .4 Both introduce some negative correlation in the sim ula
that is, if the data has conditional heteroskedasticity.
tion. Antithetic variables are selected to have negative
correlation across sim ulations. Because the variance of 1 3 .8 The most important limitation is that the bootstrap cannot
the sum depends on the covariance, and the covariance generate data that has not occurred in the sample. This
is negative, the variance of the sum is less than the case is sometimes called the "Black Swan" problem. A simula
if the variables were independent. Control variates add tion can potentially generate shocks larger than historically
a mean-zero term to a simulation that has a negative observed values if the assumed distribution has this feature.
Solved Problems
1 3 .9 The standard deviation is USD 12.30, and a 95% confi = 0.47 => V n = 2 x 1•0 4 7x 1230 = 102.5 (so 103). Using
dence interval is [/x — 1.96 X A + 1.96 X and .1% , we would need 1,025.8 (replace 0.47 with 0.047)
so the width is 2 X 1.96 X If we want this and so 1,026 replications.

vn
12.30
value to be 1% of USD 47.00, then 2 x 1 . 9 6 X
Vn

13.10
13.
13.12 Essentially, the question is to determ ine the cumulative b. Recall that for a N(0,1) distribution:
distribution up to 1.2 on an N(1,5) distribution.
Fxiu rf = —Fx( 1 - Ut )
This can be done with the Excel function
Accordingly, the table is populated as follows:
N O RM D IST(1.2,1 ,sqrt(5),True) and returns the value 0.535.
13.13 a. The formula is simply im plem ented in Excel. For the Option Antithetic Option
first entry: N(0,1) Payoff Sample Payoff
M a x (-0 .6 5 - 1,0) = 0
1 - 0 .6 5 0 0.65 0.15
N(0,1) Option Payoff 2 - 0 .1 2 0 0.12 0
1 - 0 .6 5 0 3 0.92 0.42 - 0 .9 2 0
2 - 0 .1 2 0 4 1.28 0.78 - 1 .2 8 0
3 0.92 0.42 5 0.63 0.13 - 0 .6 3 0
4 1.28 0.78 6 - 1 .9 8 0 1.98 1.48
5 0.63 0.13 7 0.22 0 - 0 .2 2 0
6 - 1 .9 8 0 8 0.4 0 - 0 .4 0
7 0.22 0 9 0.86 0.36 - 0 .8 6 0
8 0.4 0 10 1.74 1.24 - 1 .7 4 0
9 0.86 0.36
10 1.74 1.24

c. Using Excel, the correlation is —27% . Therefore, the The correlation between the two option value vec
reduction in error is tors is + 25% , so the IN C R EA S E in standard error is
1 - V i + p = 15% V i T p - 1 = 12%.
For the same number of simulations.
d. The M ax in the payoff function is
M a x(|x — 0.5 |,0) = |x — 0.5
The table is now as follows:
Option Antithetic Option

N(0,1) Payoff Sample Payoff
1 - 0 .6 5 1.15 0.65 0.15
2 - 0 .1 2 0.62 0.12 0.38
3 0.92 0.42 - 0 .9 2 1.42
4 1.28 0.78 - 1 .2 8 1.78
5 0.63 0.13 - 0 .6 3 1.13
6 - 1 .9 8 2.48 1.98 1.48
7 0.22 0.28 - 0 .2 2 0.72
8 0.4 0.1 - 0 .4 0.9
9 0.86 0.36 - 0 .8 6 1.36
10 1.74 1.24 - 1 .7 4 2.24

IN D EX
A Black-Scholes-Merton model, 212, 214

BLUE. See best linear unbiased estimator (BLUE)
Akaike information criterion (AIC), 177, 178
Bonferroni (or Holm-Bonferroni) correction, 93
annualized volatility, 213, 214
bootstrapping, 231
antithetic variables, 229
limitations, 233
AR models. See autoregressive (AR) models
vs. Monte Carlo simulation, 234
attenuation bias, 106
stock market returns, 232-233
autocorrelation, 162
Box-Jenkins methodology, 178
autocorrelation function (ACF), 162, 164, 169
autocovariance, 161-164
autoregressive conditional heteroskedasticity (ARCH) process, 163
C
autoregressive (AR) models, 160 CAPM, 111-113
AR(p) model, 165-166 CBB. See circular block bootstrap (CBB)
autocovariances, 164 central limit theorem (CLT), 71, 73, 75, 76, 226
default premium, 167-168 central moments, 16
lag operator, 164-165 characteristic equations, 182
real GDP growth rate, 168, 169 X2 (chi-squared) distribution, 36
Yule-Walker equations, 167 circular block bootstrap (CBB), 231
autoregressive moving average (ARMA) processes, 171-174 sample generation using, 232
axioms of probability, 4 stock market returns, 232-233
cokurtosis, 76-78
common univariate random variables
B continuous random variables, 31-42
Bayesian information criterion (BIC), 177-178 discrete random variables, 28-31
Bayes' rule, 7 conditional correlation coefficient, 125
Bernoulli distribution, 28-29 conditional distributions, 50-51
Bernoulli random variable, 13 conditional expectation, 54-55
best linear unbiased estimator (BLUE), 70-71, 147-148 conditional independence, 6-7, 54-55
best unbiased estimator (BUE), 148 conditional probability, 4-5
Beta distribution, 40 confidence intervals, 202
bias-variance tradeoff, 141-142 hypothesis testing, 90-91
binomial distribution, 29-30 multivariate, 128-129
bivariate confidence intervals, 128, 130 constant variance of shocks, 107
continuous random variables, 18-20, 55-56 Fama-French three-factor model, 122
approximating discrete random variables, 33-34 familywise error rate (FWER), 93
Beta distribution, 40 F distribution, 38-39
X distribution, 36
2 Feasible Generalized Least Squares (FGLS), 145
exponential distribution, 39-40 feasible weighted least squares, 145
F distribution, 38-39 Federal Reserve Economic Data (FRED) database, 67
log-normal distribution, 34-36 Fidelity's Contrafund, 114-115
mixture distributions, 40-42 financial asset, 213, 214
normal distribution, 32-34 first-order AR process, 160
Student's t distribution, 36-38 forecasting
uniform distribution, 41-42 non-stationary time series, 201-202
controls, 123 stationary time series, 178-181
control variates, 229-230 French, K. R., 122
Cook's distance, 147, 148 F-test statistics, 127-129, 131
correlation, 53, 54, 74-75 fundamental principles, 2
vs. dependence, 216-219 fund managers, 114
matrix, 219
coskewness, 76-78
G
covariance, 54, 74-75
covariance stationarity, 161-162 Gauss, C. F., 147
cumulative distribution function (CDF), 12 Gaussian distribution, 32
Gauss-Markov Theorem, 147
general-to-specific (GtS) model, 141
D
data-generating process (DGP), 217, 224
H
default premium, 167-168
dependent white noise, 163 hedging, 113-114
discrete random variables, 13-14, 48-51 heteroskedasticity, 142-145
Bernoulli, 28-29 higher order moments, 67-70
binomial, 29-30 higher-order, non-central moments, 41
conditional distributions, 50-51 high-performing fund managers, 114
independence, 50 histograms, 69
marginal distributions, 49-50 hypothesis testing, 110-111
Poisson, 30-31 about the mean, 86-88
probability matrices, 48-49 confidence intervals, 90-91
distribution of financial returns, 214-216 critical value and decision rule, 86
definition, 84
dummy variables, 104, 191
effect of sample size on power, 88-89
effect of test size and population mean on power, 100
E null and alternative hypotheses, 84-85
Eiker-White estimator, 142 one-sided alternative hypotheses, 85
events, 2-5 p-value of test, 91-92
event space, 2-5 testing the equality of two means, 92-93
exchange-traded funds (ETFs), 113 test statistic, 85
expectation operator, 14 type I error and test size, 85-86
expectations, 51 type II error and test power, 88
expected value of a random variable, 14
exponential distribution, 39-40
I
extraneous variable, 141
iid bootstrap, 231
sample generation using, 232
F stock market returns, 232-233
false discovery rate (FDR), 93 iid random variables, 107
Fama, E. F., 122 implied volatility, 214
240 ■ Index
independence, 5-7 Markov, Andrey, 147
conditional, 6-7 maximum likelihood estimators (MLEs), 148
independent, identically distributed random variables, 56-57 mean, 16-17
independent variables, 54 estimation of, 64-65
inference testing, 110-111 of random variable, 226-227
information criterion (IC) sample behavior of, 71-73
Akaike, 177, 178 scaling of, 66
Bayesian, 177-178 and standard deviation, 66-67
interquartile range (IQR), 20 of two variables, 75-76
invertibility, 178 and variance using data, 67
median, 73-74
J mixture distributions, 40-42
MLEs. See maximum likelihood estimators (MLEs)
Japanese yen/US dollar rate (JPY/USD), 199, 200, 214, 215
model selection, 174-177
Jarque-Bera test statistic, 215
moments, 51-54
Jensen's inequality, 15, 202
multivariate. See multivariate moments
random variable
K central moments, 16
Kendall's t , 217-219 definition, 14
Kernel density plots, 69 kurtosis, 16-17
kurtosis, 16-17, 67-69 linear transformations, 17-18
k-variate linear regression model, 120. See also multiple linear mean, 16-17
regression non-central moments, 15, 16
skewness, 16-17
L standard deviation, 16
variance, 16-17
law of large numbers (LLN), 71,73
Monte Carlo simulation
linear correlation, 216
antithetic variables, 229
linearity, 103
approximating moments, 224, 225
linear process, 160-161
vs. bootstrapping, 234
linear regression, 102
control variates, 229-230
CAPM, 111-113
generating random values/variables, 224, 225
dummy variables, 104
improving accuracy of, 229
hedging, 113-114
limitation, 231
inference and hypothesis testing, 110-111
mean of random variable, 226-227
linearity, 103
price of European call option, 227-231
OLS, 104-110
moving average (MA) models, 169-171
ordinary least squares, 104-110
multicollinearity
parameters, 102-103
outliers, 146-147
performance evaluation, 114-115
residual plots, 146
transformations, 103-104
multi-factor risk models, 122-125
linear transformations, 17-18
multiple explanatory variables, 120-122
linear unbiased estimator (LUE), 70
multiple linear regression, 103
Ljung-Box Q-statistic, 193, 194
F-test statistics, 127-129
log-normal distribution, 34-36
with indistinct variables, 120
log-quadratic time trend model, 189
model fit measurement, 125-127
log returns, 212, 213
multi-factor risk models, 122-125
LUE estimator, 147
with multiple explanatory variables, 120-122
multivariate confidence intervals, 128-129
M R2 method, 126-127
marginal and conditional distributions, 56 stepwise procedure, 121
marginal distributions, 49-50 testing parameters, 127-131
market variation, 122 t-test statistics, 127
Index ■ 241
multivariate confidence intervals, 128-129 Poisson random variables, 30-31
multivariate moments, 74 polynomial trends, 188-190
coskewness and cokurtosis, 76-78 portfolio diversification, 53
covariance and correlation, 74-75 power laws, 215-216
sample mean of two variables, 75-76 probability
multivariate random variables Bayes' rule, 7
conditional expectation, 54-55 events, 2-5
continuous random variables, 55-56 event space, 2-5
discrete random variables, 48-51 fundamental principles, 2, 4
expectations and moments, 51-54 independence, 5-7
independent, identically distributed random variables, 56-57 matrices, 48-49
in rectangular region, 55
N sample space, 2-5
probability mass function (PMF), 12, 13
non-central moments, 15, 16
pseudo-random number generators (PRNGs), 224
non-stationary time series
pseudo-random numbers, 224
cyclical component, 192
p-Value-at-Risk (p-VaR), 232
default premium, 200, 201
p-value of test statistic, 91-92
forecasting, 201-202
log of gasoline consumption, in United States, 192-193
random walks and unit roots, 194-199 R
seasonality, 190-192 random variables
spurious regression, 199, 200 continuous, 18-20
time trends, 188-190 cumulative distribution function, 12
normal distribution, 32-34 definition of, 12-13
null and alternative hypotheses, 84-85 discrete, 13-14
expectations, 14
O modes, 20-21
OLS estimators. See ordinary least squares (OLS) estimators moments, 14-18
omitted variables, 106, 140-141 probability mass function, 12
one-sided alternative hypotheses, 85 quantiles, 20-21
ordinary least squares (OLS), 104-106 random walks, 194-199
constant variance of shocks, 107 rank correlation, 216-219
iid random variables, 107 regression
implications of assumptions, 108-110 analysis, 102
no outliers, 107-108 with multiple explanatory variables. See multiple linear regression
parameter estimators properties, 106-110 regression diagnostics
shocks are mean zero conditional, 106-107 bias-variance tradeoff, 141-142
standard error estimation, 110 extraneous variable, 141
variance of X, 107 heteroskedasticity, 142-145
ordinary least squares (OLS) estimators, 120, 121, 126 multicollinearity, 145-147
homoscedasticity, 142, 145 omitted variable, 140-141
strengths of, 147-149 returns, measuring, 212
weighted least squares, 145 R2 method, 126-127
P S
PACF. See partial autocorrelation function (PACF) sample autocorrelation, 174-177
parsimony, 178 sample moments, 64
partial autocorrelation function (PACF), 162, 164-166, 169, 170, 174-178 best linear unbiased estimator, 70-71
Pearson's correlation, 216 higher order moments, 67-70
performance evaluation, linear regression, 114-115 mean and standard deviation, 66-67
PIMCO High Yield Fund, 114 mean and variance using data, 67
PIMCO Total Return Fund, 114, 115 mean estimation, 64-65
242 ■ Index
median, 73-74 survivorship bias, 106
multivariate moments, 74-78 systemically important financial institutions (SIFIs), 5
sample behavior of mean, 71-73
variance and standard deviation estimation, 65-66
T
sample selection bias, 106
test statistics, 84-87, 91-93, 110, 111, 128, 129, 131, 143-145, 174, 176,
sample space, 2-5
191, 196, 197, 202, 215
seasonal differencing, 198, 199
time-series analysis, 160
seasonality
time trends, 188-190
non-stationary time series, 190-192
total sum of squares (TSS), 125, 126
stationary time series, 181-182
transformations, linear regression, 103-104
simple returns, 212, 213
t-statistics, 122, 123, 131
simulation, 224
t-test statistics, 127
conducting experiments, 224, 226
type I error and test size, 85-86
Monte Carlo. See Monte Carlo simulation
type II error and test power, 88
simultaneity bias, 106
skewness, 16-17, 67-70
Spearman correlation, 217-219 U
spurious regression, 199, 200 uniform distribution, 41-42
standard deviation, 16 unit roots
mean and, 66-67 problems with, 195
scaling of, 66 testing for, 195-198
vs. standard errors, 66 univariate random variables, 12
and variance estimation, 65-66
standard errors
estimation, OLS, 110
V
vs. standard deviations, 66 validation set, 142
standardized Student's t distribution, 37 variance, 16-17
stationary time series and standard deviation estimation, 65-66
ARMA models, 171-174 of X, 107
autoregressive models, 163-169 VIX Index, 214
covariance stationarity, 161-162 volatility and risk measures, 213-214
forecasting, 178-181
model selection, 174-177 W
moving average models, 169-171
weighted least squares (WLS), 145
sample autocorrelation, 174-177
White's estimator, 142-144
seasonality, 181-182
Wold's theorem, 163
stochastic processes, 160-161
white noise, 162-163
stochastic processes, 160-161 Y
Student's t distribution, 36-38 Yule-Walker (YW) equations, 167
Index 243

Quantitative Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quantitative Analysis

Uploaded by

Copyright:

Available Formats

OGARP

Financial Risk Manager

Printed in the United States of Am erica

Pearson ISBN 10: 0135969360

Chapter 1 Fundamentals Chapter 2 Random Variables 11

3.2 Continuous Random Variables 31 4.4 Continuous Random Variables 55

5.2 Higher Moments 67

5.7 Summary 78 Chapter 7 Linear Regression 101

7.2 Ordinary Least Squares 104

Chapter 6 Hypothesis Testing 83 7.3 Properties of OLS Parameter

One-sided Alternative Hypotheses 85 Variance of X 107

Test Statistic 85 Constant Variance of Shocks 107

Type I Error and Test Size 85 No Outliers 107

Critical Value and Decision Rule 86 Implications of O LS Assum ptions 108

Questions 95 Answers 117

11.5 Spurious Regression 199

11.8 Summary 202 13.1 Simulating Random Variables 224

Michelle McCarthy Beck, SMD Dr. Victor Ng, CFA, MD

William May, SVP

Calculate the probability of an event for a discrete prob­

2 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

Complement Mutually Exclusive

Chapter 1 Fundamentals of Probability ■ 3

4 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

This conditional probability tells us that, while it would be sur­

Chapter 1 Fundamentals of Probability ■ 5

P r(A D B ) = Pr(A) Pr(B) (1.5)

Independence Conditional Independence

24% 16% 24%

6 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

about the conditional probability of A given B, to com pute the

Chapter 1 Fundamentals of Probability ■ 7

Short Concept Questions

1.6 Continue the application of Bayes' rule to com pute the

1.9 If fund managers beat their benchmark at random (meaning

8 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

Short Concept Questions

Chapter 1 Fundamentals of Probability ■ 9

20% X 20% X 10% action is flagged is .001% X 90% + 99.999% X 0.0001%

.625% = .000999% . The P r(F ra u d n F la g ) = .001% X 90%

10 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

D escribe the four common population moments.

Explain the differences between a probability mass func­

The chapter concludes by examining continuous random vari­

2 The set of values that can be produced by a discrete random variable

12 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

Chapter 2 Random Variables ■ 13

E[c(S)] = 0.2 X max(20 — 40,0) + 0.5 X max(50 - 40,0)

where f(x) is a function of the realization value x.

Suppose the value of an asset S takes on one of three values

14 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

Convex Function Concave Function

Chapter 2 Random Variables ■ 15

The fourth standardized moment, known as the kurtosis, is

16 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

Chapter 2 Random Variables ■ 17

standard deviation of Y is Recall that a PM F is required to sum to 1 across the support of

A continuous random variable is one with a continuous sup­

18 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

Coarse Approximation Fine Approximation

Chapter 2 Random Variables ■ 19

CDF Quantile Function

20 ■ Financial Risk Manager Exam Part I: Quantitative Analysis

Calculate the probability of an event for a discrete prob

This conditional probability tells us that, while it would be sur

Explain the differences between a probability mass func

The chapter concludes by examining continuous random vari

A continuous random variable is one with a continuous sup

e. The median is the value where at least 50% probabil

The normal distribution is the most commonly used distribu V[Y] = a 2

Log-Normal or equivalently as ln(Y) ~ N(/z, a 2), where ln(Y) is normally dis

However, because the standard Z table only includes posi

Because a log-normal random variable is defined as the exp o In y —^

of data that is available to test model param eters, because esti