Professional Documents
Culture Documents
2020
EXAM PART I
Quantitative Analysis
I
(p G A R P
2020
EXAM PART I
Quantitative Analysis
Pearson
Copyright © 2020 by the Global Association of Risk Professionals
All rights reserved.
This copyright covers material written expressly for this volume by the editor/s as well as the compilation itself.
It does not cover the individual selections herein that first appeared elsew here. Permission to reprint these has
been obtained by Pearson Education, Inc. for this edition only. Further reproduction by any means, electronic or
m echanical, including photocopying and recording, or by any information storage or retrieval system , must be
arranged with the individual copyright holders noted.
All tradem arks, service marks, registered tradem arks, and registered service marks are the property of their
respective owners and are used herein for identification purposes only.
Pearson Education, Inc., 330 Hudson Street, New York, New York 10013
A Pearson Education Com pany
w w w .pearsoned.com
ScoutAutomatedPrintCode
000200010272205718
EEB /K W
Answers 9 Answers 24
Short Concept Questions 9 Short Concept Questions 24
Solved Problems 9 Solved Problems 25
Conditional Distributions 50
Chapter 3 Common Univariate
4.2 Expectations and Moments 51
Random Variables 27 Expectations 51
Moments 51
The Variance of Sums of Random Variables 53
3.1 Discrete Random Variables 28
Covariance, Correlation, and Independence 54
Bernoulli 28
Binomial 29 4.3 Conditional Expectations 54
Poisson 30 Conditional Independence 54
Answers 45
Short Concept Questions 45 5.1 The First Two Moments 64
Solved Problems 45 Estim ating the Mean 64
Estim ating the Variance and Standard
Deviation 65
Presenting the Mean and Standard
Chapter 4 Multivariate Random Deviation 66
Variables 47 Estim ating the Mean and Variance
Using Data 67
iv ■ Contents
5.5 The Median and Other Quantiles 73 Answers 96
Short Concept Questions 96
5.6 Multivariate Moments 74
74 Solved Problems 96
Covariance and Correlation
Sample Mean of Two Variables 75
Coskew ness and Cokurtosis 76
Contents ■ v
9.5 Multicollinearity 145
Chapter 8 Regression with Residual Plots 146
Multiple Explanatory O utliers 146
Variables 119 9.6 Strengths of OLS 147
9.7 Summary 149
8.1 Multiple Explanatory Variables 120 Questions 150
Additional Assum ptions 121 Short Concept Questions 150
8.2 Application: Multi-Factor Risk Practice Questions 150
Models 122 Answers 153
8.3 Measuring Model Fit 125 Short Concept Questions 153
Solved Problems 153
8.4 Testing Parameters In Regression
Models 127
M ultivariate Confidence Intervals 128
The F-statistic of a Regression 129 Chapter 10 Stationary Time
8.5 Summary 131 Series 159
8.6 Appendix 131
Tables 131
10.1 Stochastic Processes 160
Questions 134
10.2 Covariance Stationarity 161
Short Concept Questions 134
Practice Questions 134 10.3 White Noise 162
Dependent W hite Noise 163
Answers 135
Wold's Theorem 163
Short Concept Questions 135
Solved Problems 135 10.4 Autoregressive (AR) Models 163
Autocovariances and Autocorrelations 164
The Lag O perator 164
AR(p) 165
Chapter 9 Regression
Diagnostics 139 10.5 Moving Average (MA)
Models 169
10.6 Autoregressive Moving
9.1 Omitted Variables 140 Average (ARMA) Models 171
9.2 Extraneous Included Variables 141 10.7 Sample Autocorrelation 174
9.3 The Bias-Variance Tradeoff 141 Joint Tests of Autocorrelations 174
Specification Analysis 174
9.4 Heteroskedasticity 142
Approaches to Modeling Heteroskedastic 10.8 Model Selection 177
Data 145 Box-Jenkins 178
vi ■ Contents
10.9 Forecasting 178 Short Concept Questions 204
Practice Q uestions 204
10.10 Seasonality 181
Answers 206
10.11 Summary 182
Short Concept Questions 206
Appendix 182 Solved Problems 206
Characteristic Equations 182
Questions 183
Short Concept Questions 183 Chapter 12 Measuring Returns,
Practice Questions 183 Volatility, and
Answers 184 Correlation 211
Short Concept Questions 184
Solved Problems 184
12.1 Measuring Returns 212
12.2 Measuring Volatility and Risk 213
Chapter 11 Non-Stationary Implied Volatility 214
Time Series 187 12.3 The Distribution of Financial
Returns 214
Power Laws 215
11.1 Time Trends 188
Polynomial Trends 188 12.4 Correlation Versus
Dependence 216
11.2 Seasonality 190
Alternative Measures of Correlation 216
Seasonal Dummy Variables 191
12.5 Summary 219
11.3 Time Trends, Seasonalities,
and Cycles 191 Questions 220
Short Concept Questions 220
11.4 Random Walks and Unit
Practice Q uestions 220
Roots 194
Unit Roots 194 Answers 221
The Problems with Unit Roots 195 Short Concept Questions 221
Testing for Unit Roots 195 Solved Problems 221
Seasonal Differencing 198
Contents ■ vii
The Mean of a Normal 226 13.4 Comparing Simulation
Approxim ating the Price of a Call and Bootstrapping 234
Option 227
13.5 Summary 234
Improving the Accuracy of Simulations 229
Questions 235
Antithetic Variables 229
Short Concept Questions 235
Control Variates 229 Practice Q uestions 235
Application: Valuing an Option 230
Answers 236
Limitation of Simulations 231 Short Concept Questions 236
Solved Problems 236
13.3 Bootstrapping 231
Bootstrapping Stock Market Returns 232 Index 239
Limitations of Bootstrapping 234
viii ■ Contents
Chairman
Dr. Rene Stulz
Everett D. Reese Chair of Banking and M onetary Econom ics,
The Ohio State University
Members
Richard Apostolik Dr. Attilio Meucci, CFA
President and C E O , Global Association of Risk Professionals Founder, ARPM
FRM® Committee ■ ix
ATTRIBUTIONS
Author
Kevin Sheppard, PhD, Associate Professor in Financial
Econom ics and Tutorial Fellow in Econom ics at the
University of O xford
Reviewers
Chris Brooks, PhD, Professor of Finance at the ICM A Centre, Deco Caigu Liu, FRM , C S C , CBAP, Global A sset Liability
University of Reading, M anagem ent, Manulife
Paul Feehan, PhD, Professor of M athem atics and Director of the John Craig Anderson, FRM , Director— D ebt and Capital
M asters of M athematical Finance Program , Rutgers University Advisory Services, Advize Capital
Erick W. Rengifo, PhD, Associate Professor of Econom ics, Thierry Schnyder, FRM , C SM , Risk Analyst, Bank for
Fordham University International Settlem ents
Richard Morrin, PhD, FRM , ERP, C FA , C IPM , Financial Officer, Yaroslav Nevmerzhytskyi, FRM , C FA , ERP, Deputy C h ief Risk
International Finance Corporation O fficer and Financial Risk M anagem ent Team Leader,
Naftogaz of Ukraine
Antonio Firmo, FRM , Senior A udit M anager— Trading Risk,
Risk C apital, and M odeling, Royal Bank of Scotland
x ■ Attributions
Fundamentals
of Probability
Learning Objectives
A fter com pleting this reading you should be able to:
D escribe an event and an event space. Define and calculate a conditional probability.
D escribe independent events and mutually exclusive Distinguish between conditional and unconditional
events. probabilities.
Explain the difference between independent events and Explain and apply Bayes' rule.
conditionally independent events.
3 . The joint probability of two independent events is the prod 1 . A occurs, B does not.
uct of the probability of each.
2. B occurs, A does not.
This chapter also introduces conditional probability, an impor 3 . Both A and B occur.
tant concept that assigns probabilities within a subset of all
4. Neither A nor B occur.
events. A fter that, this chapter moves on to discuss indepen
dence and conditional independence. This chapter concludes This would give an event space consisting of four events
by exam ining Bayes' rule, which provides a simple and powerful {A , B, {A , B}, 0} . Because this event space has a finite number of
expression for incorporating new information or data into prob outcom es, it is called a d iscrete probability sp a ce .*3
ability estim ates. In the corporate default exam ple, the event space is
{D efault, No Default, {D efault, No Default}, 0} . It might appear
surprising that the event space contains two "im possible"
1.1 SAM PLE SPACE, EVEN T SPACE, events for this exam ple:
AND EVENTS
1. {D efault, No D efault}, which contains both outcom es; and
A sam ple space is a set containing all possible outcom es of an 2. The em pty set 0, which contains neither.
experim ent and is denoted as (1. The set of outcom es depends
However, note that a definite probability of 0 can be assigned to
on the problem being studied. For exam ple, when examining "im possible" events. Because the event space contains all sets
the returns on the S&P 500, the sam ple space is the set of all
that can be assigned a probability, these two events are there
real numbers and is denoted by U. On the other hand, if the fore part of the event space.
focus is on the direction of the return on the S&P 500, then the
sam ple space is {Positive, N egative}. W hen modeling corporate
defaults, the sam ple space is {D efault, No Default}. W hen rolling Probability
a single six-sided die, the sam ple space is { 1 , 2 , . . . , 6}. The
Probability measures the likelihood of an event. Probabilities are
sam ple space for two identical six-sided dice contains 21 ele
always between 0 and 1 (inclusive). An event with probability 0
ments representing all distinct pairs of values: {(1,1), (1,2),
never occurs, while an event with a probability 1 always occurs.
(1,3), . . . , (5,5), (5,6), (6 ,6 )}.1 If we are modeling the sum of two
The sim plest interpretation of probability is that it is the fre
dice, however, then the sam ple space is {2, . . . , 12}.
quency with which an event would occur if a set of independent
experim ents was run. This interpretation of probability is known
1 This assumes that the dice are indistinguishable and therefore one can
only observe the values shown (and not which die produced each value).
An event that contains no elements is known as the empty set.
If the dice were distinguishable from each other (e.g., they have differ
ent colors), then the sample space would contain 36 values (because 3 A discrete probability space is one that contains either finitely many
{1,2} and {2,1} are different). outcomes or (possibly) countably infinitely many outcomes.
Intersection Union
Fiaure 1.1 The top-left panel illustrates the intersection of two sets, which is indicated by the shaded area
surrounded by the dark solid line. The top right panel illustrates the union of two sets, which is indicated by the
shaded area surrounded by the dark solid line. The shaded area in the bottom-left panel illustrates Ac, which is
the complement to A and includes every outcome that is not in A. The bottom-right panel illustrates mutually
exclusive sets, which have an empty intersection (0).
Fundamental Principles of Probability management exam, which is passed (on average) by 50% of test
takers. Based on only this information, one would estimate that both
The three fundam ental principles of probability are defined Student A and Student B have a 50% chance of passing the exam.
using events, event spaces, and sam ple spaces. Note that Pr(co)
Now suppose that the average length of study time needed
is a function that returns the probability of an event a>.
to pass the exam is 200 hours. Suppose further that Student A
1 . A ny event A in the event space T has Pr(A) > 0. studied for less than 100 hours, whereas Student B studied for
2 . The probability of all events in (1 is one and thus P r((l) = 1. more than 400 hours. Common sense would say that Student
B has a higher probability of passing com pared to Student A .
3 . If the events A 1 and A 2 are mutually exclusive, then
In fact, a survey might find that the probability of passing for
Pr(A -|U A 2) = Pr(A-|) + Pr(A2). This holds for any number n
students who studied less than 100 hours is 10%, whereas the
of mutually exclusive events, so that Pr(A -|U A 2 . . . L M n)
probability of passing for students who studied more than 400
= S " - , Pr(A)-4
hours is 80% . Using this new inform ation, the conditional prob
These three principles— collectively known as the Axiom s of ability that A passes the exam is 10%, while the conditional
Probability— are the foundations of probability theory. They probability that B passes is 80%.
imply two useful properties of probability. First, the probability
The conditional probability of the event A occurring given that B
of an event or its com plem ent must be 1.
has occurred is denoted as Pr(A| B) and can be calculated as:
P K A U A 0) = Pr(A) + Pr(A°) = 1 (1.1)
P r(A P lB )
Second, the probability of the union of any two sets can be Pr(A| B) = (1.3)
Pr(B)
decom posed into:
The probability of observing an event in A , given that an event
P r(A U B ) = Pr(A) + Pr(B) - P r(A D B ) (1.2) in B has been observed, depends on the probability of observ
This result dem onstrates that the total probability of two events ing events that are in both rescaled by the probability that an
is the probability of each event minus the probability of the event in B occurs. In other words, conditional probability oper
event defined by the intersection. The elem ents in the inter ates as if B is the event space and A is an event inside this new,
section are in both sets, therefore subtracting this term avoids restricted space.
double counting. As a simple exam ple, note that the probability of rolling a three on
a fair die is 1/6. This is because the event of rolling a three occurs
in one outcome in a set of six possible outcomes: {1, 2, 3, 4, 5, 6}.
Conditional Probability
However, if it is known that the number rolled is odd, then the
Probability is commonly used to exam ine events in subsets of probability of rolling a three is 1/3 because this additional informa
the full event space. Specifically, we are often interested in the tion restricts the set of outcomes to three possibilities: { 1 ,3 , 5}.
probability of an event happening only if another event hap
Figure 1.2 dem onstrates this idea using sets. The left panel
pens first. This concept is called conditional probability because
shows the unconditional probability where (1 is the event space
we are computing a probability on that condition that another
and A and B are events. The right panel shows the conditional
event occurs.
probability of A given B. The numeric values show the probabil
For exam ple, we might want to determ ine the probability that a ity of each region, meaning that the unconditional probability of
large financial institution fails given that another large financial A is Pr(A) = 16% and the conditional probability of A given B is
institution has also failed. Note that the probability of a large Pr(A| B) = 12%/48% = 25%.
financial institution failing is normally quite low. However, when
An im portant application of conditional probability is the Law of
one large financial institution fails (e.g ., Lehmann Brothers in
Total Probability. This law states that the total probability of an
event can be reconstructed using conditional probabilities under
4 This result holds more generally for infinite collections of mutually the condition that the probability of the sets being conditioned
exclusive events under some technical conditions. is equal to 1.
Fiq u re 1.2 The left panel shows the unconditional probability of A and B. The right panel shows the conditional
probability of A given B. The numeric values indicate the probability of each region.
For exam ple, Pr(B) + Pr(Bc) = 1, and so: Example: SIFI Failures
Pr(A) = Pr(A| B) Pr(B) + Pr(A| B°) Pr(B°) (1.4)
System ically Important Financial Institutions (SIFIs) are a cat
This property holds for any collection of mutually exclusive egory of designated large financial institutions that are deeply
events {B,}, where ^ P r (B ,) = 1. Intuitively, this property must connected to large portions of the econom y. They are usually
hold because each outcom e in A must occur in one (and only banks, although other types of institutions such as insurers, asset
one) of the B,. m anagers, or central counterparties may be designated as SI FIs
as w ell. SI FIs are subject to additional regulation and supervi
Figure 1.3 illustrates the Law of Total Probability. Note that the
sion, as the failure of even one of them could lead to a major
probability that A occurs is 9% + 12% + 14% = 35% . The four
disruption in financial m arkets.
conditional probabilities corresponding to each set B-: are
Suppose that the chance of at least one SIFI failing in any given
Pr(A| B-,) = 14% /30% = 46.6% ,
year is 1%. Suppose further that when at least one SIFI fails,
Pr(A| B2) = 9% /22% = 40.9% , there is a 20% chance that the number of failing SI FIs is exactly
Pr(A| B3) = 12%/28% = 42.8% , 1, 2, 3, 4 or 5. In other words, there is a 0.2% chance that one
SIFI fails, 0.2% chance that two institutions fail, and so on.
Pr(A| B4) = 0
Finally, assume that more than five SI FIs cannot fail, because
These conditional probabilities are then rescaled by the proba governm ents and central banks would intervene.
bilities of the conditioning event to recover the total probability.
If we define the event that one or more SI FIs fail as E-1 , then
4
Pr(E-|) = 1 % . This is a small probability. However, if we are inter
2 P r ( A | Bi) Pr(Bf) = 46.6% X 30% + 40.9% X 22% + 42.8%
ested in the probability of two or more failures given that at
;=1 x 28% + 0 x 20% = 35%.
least one has occurred, then this is
Pr(Ei D E 2)
Pr(E2| E 1) ^ = 80%
Pr(E,) Pr(E-,) 1%
Here, E2 is the event that two or more SI FIs fail, and it is a sub
set of E -1 by construction (because E -1 is the event of one or more
failures). Because E-|D E2 = E2, P r(E-|D E2) = Pr(E2).
exclusive. A is the event indicated by the dark circle, 1.2 IN DEPEN DEN CE
and Pr(A) = 35%. The probability of each event B;
includes the values that overlap with A as well as those Independence is a core concept that is exploited throughout
that do not. For example, Pr(B^) = 16% + 14% = 30%. statistics and econom etrics. Two events are independent if the
Fiqure 1.4 The left panel illustrates the unconditional probability of at least one SIFI failing. The right panel
illustrates the conditional probability of 1, 2, 3, 4 or 5 failures given that at least one SIFI has failed. The E, are
the events that exactly i SIFIs fail. These differ from E,, which measure the event that / or more institutions fail.
However, E 5 = E5 because the maximum number of failures is assumed to be five.
probability that one event occurs does not depend on whether words, if A-|, A 2, . . . , A n are independent events, then
the other event occurs. W hen the two events A and B are inde Pr(A 1n A 2n • • • O A n) = Pr(A-|) X Pr(A2) X . . . X Pr(An).
pendent, then:
It is also the case that Pr(A| B) = Pr(A). The conditional probability P r(A P |B | C) is the probability of an
outcom e that is in both A and B occurring given that an out
If A and B are mutually exclusive, then B cannot occur if A
come in C occurs.
occurs. This means that the outcom e of A affects Pr(B) and so
mutually exclusive events are not independent. Another way to Note that two types of independence— unconditional and
see is to note that when Pr(A) and Pr(B) are both greater than 0, conditional— do not imply each other. Events can be both
then P r(A f)B ) = Pr(A) X Pr(B) > 0 (i.e., there must be a positive unconditionally dependent (i.e., not independent) and condi
probability that both occur). However, mutually exclusive events tionally independent. Sim ilarly, events can be independent, yet
have P r(A H B ) = 0 and thus cannot be independent. conditional on another event they may be dependent.
Independence can be generalized to any number of events The left-hand panel of Figure 1.5 contains an illustration of
using the same principle: the joint probability of the events two independent events. Each of the events A and B has a
is the product of the probability of each event. In other 40% chance of occurring. The intersection has the probability
A B
A n e
Fiaure 1.5 The left panel shows two events are independent so that the joint probability, Pr(A fl B), is the same
as the product of the probability of each event, Pr(A) X Pr(B). The right panel contains an example of two events,
A and B, which are dependent but also conditionally independent given C.
M eanwhile, the right-hand panel shows an exam ple where . , 20% x 10% 2%
Pr(Star|H igh Return) = ---- -----= = 30.8%
A and B are unconditionally dependent: Pr(A) = 40% , 6.5% 6.5%
Pr(B) = 40% and P r(A P |8 ) = 25% (rather than 16% as would be
In this case, a single high return updates the probability that a
required for independence). However, conditioning on the event
manager is a star from 10% (i.e., the unconditional probability) to
C restricts the space so that Pr(A| Q = 50%, Pr(B| Q = 50%
30.8% (i.e., the conditional probability). This idea can be extended
and Pr(A| C)Pr(B| C) = 25% , meaning that these events are
to classifying a manager given multiple returns from a fund.
conditionally independent. A further implication of condi
tional independence is that Pr(A| B f lC ) = Pr(A| C) = 50% and
Pr(B| A R C ) = Pr(B| Q = 50% , which both hold. 1.4 SUMMARY
Probability is essential for statistics, econom etrics, and data
1.3 BAYES' RULE analysis. This chapter introduces probability as a sim ple way
to understand events that occur with different frequencies.
Bayes' rule provides a method to construct conditional prob Understanding probability begins with the sam ple space, which
abilities using other probability m easures. It is both a simple defines the range of possible outcom es and events that are col
application of the definition of conditional probability and an lections of these outcom es. Probability is assigned to events,
extrem ely im portant tool. The formal statem ent of Bayes' rule is and these probabilities have three key properties.
Pr(A| B) Pr(B)
Pr(B| A) = (1.9) 1. Any event A in the event space T has Pr(A) > 0.
Pr(A)
2. The probability of all events in (1 is 1 and therefore
This is a sim ple rewriting of the definition of conditional
Pr(ft) = 1.
probability, because P r(A |B ) = P r(A n B )/P r(B ) so that
3 . If the events A-| and A 2 are mutually exclusive,
P r(A H B ) = Pr(A| B) Pr(B). Reversing these argum ents also gives
P r(A n B ) = Pr(B| A) Pr(A). Equating these two expressions and then Pr(A -|(JA 2) = Pr(A-i) + Pr(A2). This holds for
any number of mutually exclusive events, so that
solving for P r(B |A ) produces Bayes' rule. This approach uses the
unconditional probabilities of A and B, as well as information P r ( A , U A 2 . . . U A „) = 2 /=i FMA>.
Pr(High Return | Star) Pr(Star) which replaces the unconditional probability with a conditional
Pr(Star| High Return) probability. An im portant feature of conditional independence
Pr(High Return)
is that events that are not unconditionally independent may be
The probability of a superstar is 10%. The probability of a high conditionally independent.
return if she is a superstar is 20% . The probability of a high
return from either type of m anager is Finally, Bayes' rule uses basic ideas from probability to develop
a precise expression for how information about one event can
Pr(High Return) = Pr(High Return]Star) Pr(Star)
be used to learn about another event. This is an im portant prac
+ Pr(High Return | Normal) Pr(Normal)
tical tool because analysts are often interested in updating their
= 20% x 10% + 5% x 90% = 6.5% beliefs using observed data.
QUESTIONS
Practice Questions
1.4 Based on the probabilities in the diagram below, what are a. Pr(Ac)
the values of the following? b. P r (D |A (J B (J C )
c. P r(A |A )
d. Pr(B| A)
e. Pr(C| A)
f. Pr(D| A)
g. P r((A U D )c)
h. Pr(A cn D c)
i. Are any of the four events pairwise independent?
ANSW ERS
Solved Problems
1.4 a. Pr(A) = 12% + 11% + 15% + 10% = 48% c. This is trivially 100%.
b. Pr(A| B) = P r(A rW P r(B ) = (11% + 10%)/ d. P r(B H A ) = 9%. The conditional probability is
(11% + 10% + 16% + 18%) = 38.2%
c. Pr(B| A) = P r(A rW P r(A ) = (11% + 10%)/48%
e. There is no overlap and so P r(C n A ) = 0.
= 43.8%
f. P r(D H A ) = 9%. The conditional probability is 30%.
d. P r (A D B n C ) = 10%
g. This is the total probability not in A or D. It is
e. Pr(B|ADC) = Pr(BnADC)/Pr(AnC)
1 - P r(A U D ) = 1 - (Pr(A) + Pr(D) - P r(A D D ))
= 10% /(15% + 10%) = 40%
= 100% - (30% + 36% - 9%) = 43%.
f. Pr(ADB|C) = Pr(AnenC)/Pr(C)
h. This area is the intersection of the space not in A with
= 10%/(15% + 10% + 18 + 10%) = 18.9%
the space not in D. This area is the same as the area
g. P r (A U B |C ) = P r((A (JB )n C )/P r(C ) that is not in A or D, P r((A (JD )c) and so 43% .
= (15% + 10% + 18%)/53% = 81.1%
i. The four regions have probabilities A = 30%,
h. We can use the rule that events are indepen B = 30% , C = 28% and D = 36% . The only region
dent if their joint probability is the product of the that satisfied the requirem ent that the joint probability
probability of each. Pr(A) = 48% , Pr(B) = 55%, is the product of the individual probabilities is A and B
Pr(C) = 53%, P r(A D B ) = 21% # (48% X 55%), because P r(A D B ) = 9% = Pr(A)Pr(B) = 30% X 30%.
P r(A D C ) = 25% ^ (48% X 53%),
1.6 Consider the three scenarios: (High, High), (High, Low)
P r(B D C ) = 28% ^ (55% X 53%). None of these
and (Low, Low).
events are pairwise independent.
We are interested in Pr (Star\High, High) using Bayes'
1 .5 a. 1 - Pr(A) = 100% - 30% = 70%
P r(High, High] Star)Pr(Star)
rule, this is equal t o -------- ^
P
-'
b. This value is P r(D n (A U B U C ))/P r(A (JB (JC ).
The total probability in the three areas A , B, and Stars produce high returns in 20% of years, and so
C is 73% . The overlap of D with these three is Pr (High, H igh\Star) = 20% X 20% Pr (Star) is still 10%.
9% + 8% + 7% = 24% , and so the conditional prob Finally, we need to com pute Pr (High, High), which is
24% Pr (High, H ig h jS ta r) Pr (Star) + Pr(H/gh, High\ Normal)
ability is 33%.
73%
Pr(Normal). This value is 20% X 20% X 10% + 5% X 5% 1.8 We are interested in Pr(Fraud|Flag). This value is
= 90% = 0.625% . Com bing these values, Pr(Fraud D FIag )/Pr(Flag ). The probability that a trans
This is a large increase from the 30% chance after one year. .0009
= .0009. Com bining these values, = 90% . This
.000999
1.7 a. 0.10*0.20 = 0.02 —»• 2% indicates that 10% of the flagged transactions are not
b. Calculate this in two ways: actually fraudulent.
i. Using Equation 1.1: 1.9 If each year was independent, then the probability of
P r(A U B ) = Pr(A) + Pr(B) - P r(A D B ) beating the benchm ark ten years in a row is the product
P r(A U B ) = Pr(A) + Pr(B) - P r(A D B ) of the probabilities of beating the benchm ark each year,
= 10% + 20% - 2% = 28%.
1Y ° = 1 . 1% .
2 1,024
ii. Using the identity: P r(A U A c) = Pr(A) + Pr(Ac) = 1
Let C be the event of no defaults. Then
Pr(C) = (1 - (1 - Pr(A))*(1 - Pr(B))
= (1 - 10%)(1 - 20%) = 0.9*0.8 = 72%.
Then the com plem ent of C is the event of any
defaults Pr(C c) = 1 - Pr(C) = 28%.
D escribe and distinguish a probability mass function from Characterize the quantile function and quantile-based
a cumulative distribution function, and explain the rela estim ators.
tionship between these tw o.
Explain the effect of a linear transform ation of a random
Understand and apply the concept of a mathematical variable on the mean, variance, standard deviation, skew
expectation of a random variable. ness, kurtosis, m edian, and interquartile range.
11
Probability can be used to describe any situation with an mass function, a continuous random variables depends on a prob
elem ent of uncertainty. However, random variables restrict ability density function (PDF). However, this difference is of little
attention to uncertain phenomena that can be described with consequence in practice, and the concepts introduced for dis
numeric values. This restriction allows standard m athem ati crete random variables all extend to continuous random variables.
cal tools to be applied to the analysis of random phenom ena.
For exam ple, the return on a portfolio of stocks is numeric,
and a random variable can therefore be used to describe its 2.1 DEFINITION O F A RANDOM
uncertainty. The default on a corporate bond can be similarly VARIABLE
described using numeric values by assigning one to the event
that the bond is in default and zero to the event that the bond is The axiom s of probability are general enough to describe
in good standing. many forms of random ness (e.g ., a coin flip, a draw from a
This chapter begins by defining a random variable and relating well-shuffled deck of cards, or a future return on the S&P 500).
However, directly applying probability can be difficult because
its definition to the concepts presented in the previous chapter.
The initial focus is on discrete random variables, which take on it is defined on events, which are abstract concepts. Random
variables limit attention to random phenomena which can be
distinct values. Note that most results from discrete random
variables can be com puted using only a weighted sum. described using numeric values. This covers a wide range of
applications relevant to risk m anagem ent (e.g ., describing asset
Two functions are commonly used to describe the chance of returns, measuring defaults, or quantifying uncertainty in param
observing various values from a random variable: the probability
eter estim ates). The restriction that random variables are only
mass function (PMF) and the cumulative distribution function defined on numeric values sim plifies the mathematical tools
(CD F). These functions are closely related, and each can be
used to characterize their behavior.
derived from the other. The PM F is particularly useful when defin
ing the expected value of a random variable, which is a weighted A random variable is a function of co (i.e., an event in the sample
average that depends on both the outcomes of the random vari space (1) that returns a number x. It is conventional to write a
random variable using upper-case letters (e.g ., X, Y, or Z) and to
able and the probabilities associated with each outcome.
express the realization of that random variable with lower-case
M oments are used to summarize the key features of random letters (e.g ., x, y, or z). It is im portant to distinguish between
variables. A moment is the expected value of a carefully chosen
these tw o: A random variable is a function, whereas a realization
function designed to measure a characteristic of a random vari
is a number. The relationship between a realization and its corre
able. Four moments are commonly used in finance and risk man
sponding random variable can be succinctly expressed as:
agem ent: the mean (which measures the average value of the
x = X{o>) (2.1)
random variable), the variance (which measures the spread/dis-
persion), the skewness (which measures asym m etry), and the For exam ple, let X be the random variable defined by the roll of
kurtosis (which measures the chance of observing a large devia a fair die. Then we can denote x as the result of a single roll (i.e.,
tion from the m ean).1 one event). The probability that the random variable is equal to
five can be expressed as:
Another set of im portant measures for random variables
includes those that depend on the quantile function, which is Pr(X = x) when x = 5
the inverse of the CDF. The quantile function, which can be used
Univariate random variables are frequently distinguished based
to map a random variable's probability to its realization, defines
on the range of values produced. The two classes of random vari
two moment-like m easures: the median (which measures the
ables are discrete and continuous. A discrete random variable is
central tendency of a random variable) and the interquartile
one that produces distinct values. A continuous random variable
range (which is an alternative measure of spread). The quantile
produces values from an uncountable set, (e.g., any number on
function is also used to simulate data from a random variable
the real line IR).
with a particular distribution later in this book.
0.8
0.6
0.4
0.2
0 .0 --------------■
-------------- '-------------- •
- 0.5 0.0 0.5 1.0 1.5
random variable with p = 0.7. The right panel shows the
corresponding cumulative distribution function.
2.2 DISCRETE RANDOM VARIABLES Note that fx(0) = p°(1 — p )1 = 1 — p a n d fx O ) = p 1(1 — p)° = p,
so that the probability of observing 0 is 1 — p and the probability
A discrete random variable assigns a probability to a set of dis of observing 1 is p. The PM F is denoted by f^x) to distinguish the
tinct values. This set can be either finite or contain a countably random variable X underlying the mass function from the realiza
infinite set of values. An im portant exam ple of a discrete ran tion of the random variable (i.e., x).
dom variable with a finite number of values is a Bernoulli ran The PM F of a Bernoulli can be equivalently expressed using a
dom variable, which can only take a value of 0 or 1. They are list of values:
frequently encountered when measuring binary random events
(e.g ., the default of a loan).
«*) = { 1 - P X = ° (2.3)
I p x = 1
Because the values of random variables are always numerical,
random variables can be described precisely using mathematical The counterpart of a PM F is the cum ulative distribution function
functions. The set of values that the random variable may take is (C D F), which measures the total probability of observing a value
called the support of the function. For exam ple, the support for less than or equal to the input x (i.e., Pr(X < x)). In the case of a
a Bernoulli random variable is {0 ,1 }. Two functions are frequently Bernoulli random variable, the C D F is a sim ple step function:
used when describing the properties of a discrete random vari
( 0 x< 0
able's distribution.
Fx(x) = l 1 — p 0 < x < 1 (2.4)
The first function is known as the probability mass function [ 1 x> 1
(PM F). This function returns the probability that a random vari
Because the C D F measures the total probability that X < x, it
able takes a certain value. Because the PM F returns probabili
must be m onotonic and increasing in x. Note that while the PM F
ties, any PM F must have two properties.
is a discrete function (i.e., it only has support for 0 and 1), the
1. The value returned from a PM F must be non-negative. C D F returns Pr(X < x) and is therefore a continuous function
that has support for any value of x, not just 0 and 1.
2. The sum across all values in the support of a random vari
able must be one. Figure 2.1 shows the PM F (left panel) and C D F (right panel) of a
Bernoulli random variable with param eter p = 0.7 (i.e., there is
For exam ple, if X is a Bernoulli random variable and the proba a 70% chance of observing x = 1). The PM F is only defined for
bility that X = 1 is denoted with p (where p is between 0 and 1), the two possible outcom es (i.e., 0 or 1). In contrast, the C D F is
the PM F of X is defined for all values of x, and so is a step function that has the
fxW = p V - p )’ " x (2.2) value 0 for x < 0, 0.3 for 0 < x < 1, and 1 for x > 1.
and the inputs to the PM F are either 0 or 1.3 The PM F and the C D F are closely related, and the C D F can
always be expressed as the sum of the PM F for all values in the
support that are less than or equal to x:
3 The leading example of a countably infinite set is the integers, and any
countably infinite set has a one-to-one relationship to the integers. FxM = 2 (2.5)
t e R ( x ) ,t < x
= 23.
2.3 EXPECTATIONS
The expected value of a random variable is denoted E[X] and is Properties of the Expectation Operator
defined as:
The expectation operator E[ - ] is like a function whose input is
another function (i.e., a random variable). It perform s a specific
E[X] = 2 x Pr<x = *>' (2.7)
xGR(X) task on a random variable: computing a weighted average of its
possible values.
where R(X) is the range of values of the random variable X. For
exam ple, when X is a Bernoulli random variable where the prob The expectation operator has one particularly useful property in
ability of observing 1 is p: that it is a linear operator.4 An im m ediate implication of linearity
is that the expectation of a constant is the constant (e.g .,
E[X] = 0 X (1 - p) + 1 X p = p (2.8)
E[a] = a). A related implication is that the expectation of the
Thus, an expectation is simply an average of the values in expectation is just the expectation (e.g ., E[E[X]] = E[X]). Further
the support of X weighted by the probability that a value x is more, the expected value of a random variable is a constant
observed. (and not a random variable).
Functions of random variables are also random variables and W hile linearity is a useful property, the expectation operator
therefore have expected values. The expectation of a function f does not pass through nonlinear functions of X. For exam ple,
of a random variable X is defined as: E[1/X ] 1 /E[X ] w henever X is a non-degenerate random vari
ab le .5 This property holds for other nonlinear functions so that
E[/(X)] = 2 = x), (2.9) in general6 E[g(X)] # g(E[X]).
xeR(X)
For exam ple, when X is a Bernoulli random variable, then the 2.4 MOMENTS
expected value of the exponential of X is
A s stated previously, moments are a set of commonly used
E[exp(X)] = exp(0) X (1 — p) + exp(1) X p = (1 — p) + p exp(1)
descriptive measures that characterize im portant features of ran
As an exam ple, consider a call option (i.e., a derivative contract dom variables. Specifically, a moment is the expected value of
with a payoff that is a nonlinear function of the future price of a function of a random variable. The first moment is defined as
an asset). The payoff of the call option is c(S) = m ax(S - K, 0), the expected value of X:
where S is the value of the asset and K is the strike price. If the
A1 = E[X] (2.10)
asset price is above the strike price, the call option pays out
S — K. If the price is below the strike price, the call option pays The second and higher moments all share the same form and
zero. Valuing a call option involves computing the exp ecta only differ in one parameter. These moments are defined as:
tion of a function (i.e., the payoff) of the price (i.e., a random Ar = E[(X - E[X])r], (2.11)
variable).
6
C onvex and concave functions are common in finance (e.g ., 1 + 4 + 9 + 16 + 25 + 36
E[f(X)] = E[X2] = 15.167
many derivative payoffs are convex in the price of the under 6
lying asset). It is useful to understand that the expected
value of a convex function is larger than the function of the and:
expected value. f(E[X]) = f( 3.5) = 3 .5 2 = 12.5.
Figure 2.2 These two panels demonstrate Jensen's inequality. The left shows the difference between the
expected value of the function E[H(x)] and the function at the expected value, H(E[X]), for a convex function.
The right shows this difference for a concave function.
w here r denotes the order of the m om ent (2 for the second, 3 There is also a second class of m oments, called non-central
for the third, and so on). In the definition of the second and m om ents.8 These moments are not centered around the mean
higher m om ents, the first m om ent is su b tracted , which and thus are defined as:
ensures that changes in the first m om ent have no effect on the
/zrN C= E [ X r] (2.12)
higher m o m ents.7
8 The non-central moments, which do not subtract the first moment, are
less useful for describing random variables because a change in the first
moment changes all higher order non-central moments. They are, how
7 Formally these moments are known as central moments because they ever, often simpler to compute for many random variables, and so it is
are all centered on the first moment, E[X], common to compute central moments using non-central moments.
The variance is a measure of the dispersion of a random vari as being heavy-tailed/fat-tailed (i.e., having excess kurtosis).
able. It m easures the squared deviation from the m ean, and Financial return distributions have been widely docum ented to
thus it is not sensitive to the direction of the difference relative be heavy-tailed. Like skewness, kurtosis is naturally unit-free and
to the mean. can be directly com pared across random variables with different
means and variances.
The standard deviation is denoted by a and is defined as the
square root of the variance (i.e., a / ct 2). It is a commonly
reported alternative to the variance, as it is a more natural m ea
sure of dispersion and is directly com parable to the mean
STANDARDIZING RANDOM
(because the units of m easurem ent of the standard deviation are
VARIABLES
the same as those of the m ean).10
W hen X has mean /z and variance cr2, a standardized ver
The final two named moments are both standardized versions of
central moments. The skewness is defined as: sion of X can be constructed as . This variable has
a
mean 0 and unit variance (and standard deviation). These
X - IX.
skew(X). g t * - g * i >3 . E (2.13) can be verified because:
(T a
E[X] - /x
The final version in the definition clearly dem onstrates that the
a
skewness measures the cubed standardized deviation, (see box:
Standardizing Random Variables) Note that the random variable where /x and a are constants and therefore pass through
(X — /jl )/ ct is a standardized version of X that has mean 0 and the expectation operator. The variance is
variance 1. This standardization makes the skewness com parable
9 Note that E[(X - E[X])2] = E[X2 - 2xE[X] + E[X]2] = E[X2] - 2E[X] E[X] because adding a constant has no effect on the variance
+ E[X]2 = E[X2] - E[X]2 (i.e., V [X - c] = V[X]) and the variance of b X is
10 The units of the variance are the squared units of the random variable. b2V[X],
For example, if X is the return on an asset, measured in percent, then
the units of the mean and standard deviation are both percentages. The when b and c are constants.
units of the variance are squared percentages.
Skew Kurtosis
Fiqure 2.3 The four panels demonstrate the effect of changes in the four moments. The top-left panel shows the
effect of changing the mean. The top-right compares two distributions with the same mean but different standard
deviations. The bottom-left panel shows the effect of negative skewness. The bottom-right shows the effect of
heavy tails, which produces excess kurtosis.
Figure 2.3 shows the effect of changing each moment. The top- bottom-right graph shows the effect of heavy tails, which pro
left panel com pares the distribution of two random variables duce excess kurtosis. The heavy-tailed distribution has more
that differ only in the mean. The top-right com pares two ran probability mass near the center and in the tails, and less in the
dom variables with the same mean but different standard devia intervals (—2, —1) and (1,2).
tions. In this case, the random variable with a sm aller standard
deviation has both a narrower and a taller density. Note that
this increase in height is required when the scale is reduced to Moments and Linear Transformations
ensure that the total probability (as measured by the area under
Many variables used in finance and risk m anagem ent do not
the curve) remains 1.
have a natural scale. For exam ple, asset returns are commonly
The bottom -left panel shows the effect of a negative skew on expressed as proportions or (if multiplied by 100) as percent
the distribution of a random variable. The skewed distribu ages. This difference is an exam ple of a linear transform ation. It
tion has a short right tail that abruptly cuts off near 2 as well is helpful to understand the effect of linear transform ations on
as a long left tail which extends to the edge of the graph. The the first four moments of a random variable.
V b 2a 2 = |b|cr X. This property also applies to a PDF, excep t that the sum is
replaced by an integral:
The standard deviation is also insensitive to the shift by a and
is linear in b. Finally, if b is positive (so that Y = a + b X is an fx(x)dx - 1
increasing transform ation), then the skewness and kurtosis of Y
are identical to the skewness and kurtosis of X. This is because The PD F has a natural graphical interpretation and the area
both moments are defined on standardized quantities (e.g ., under the PD F between x L and Xu is Pr(xL < X < x j).
(Y — E[Y])/ v V [Y ]), which removes the effect of the location
M eanwhile, the C D F of a continuous random variable is identical
shift by a and rescaling by b. If b < 0 (and thus Y = a + b X is
to that of a discrete random variable. It is a function that returns
a decreasing transform ation), then the skewness has the same
the total probability that the random variable X is less than or
m agnitude but the opposite sign. This is because it uses an odd
equal to the input (i.e., Pr(X < x)). For most common continuous
power (i.e., 3). The kurtosis, which uses an even power (i.e., 4), is
random variables, the PD F and the C D F are directly related. The
unaffected when b < 0.
PD F can be derived from the C D F by taking the derivative:
BFxjx)
2.5 CONTINUOUS RANDOM c)X
VARIABLES Conversely, the C D F is the integral of the PD F up to x:
variable when modeling asset returns so that the return can be The left panel of Figure 2.4 illustrates the relationship between
expressed at the accuracy level of the asset's price (e.g ., dollars the PD F and C D F of a continuous random variable. The random
or cents). W hile many continuous random variables have support variable has support on (0,°c)# meaning that the C D F at the
for any value on the real number line (i.e., R), others have sup point x (i.e., Fx(x)) is the area under the curve between 0 and x.
port for a defined interval on the real number line (e.g ., [0,1]).
The riqht panel shows the relationship between a PD F and the
Continuous random variables use a probability density function area in an interval. The probability P r (- 1 < X < | ) is the area
(PDF) in place of the probability mass function. The PD F f(X) under the PD F between these two points. It can be com puted
returns a non-negative value for any input in the support of X. from the C D F as:
Note that a single value of f{X) is technically not a probability
Fxi 1/3) - FxH-1/2) = P r ( X < 1/3) - Pr(X < - 1 / 2 )
because the probability of any single value is always 0. This is
necessary because there are infinitely many values x in the sup The probabilities Pr(a < X < b), Pr(a < X < b), Pr(a < X < b),
port of X. If the probability of each value was larger than 0, then and Pr(a < X < b) are the same when X is a continuous random
the total probability would be larger than 1 and thus violate an variable. However, this is not necessarily the case when X is a
axiom of probability. discrete random variable.
The expectation of a continuous random variable is also an inte panel shows a more accurate approxim ation with a much sm aller
gral. The mean E[X] is approxim ation error. As the number of points in the approxim a
tion increases, the accuracy of the approxim ation increases as
E[X] = / xfx (x)dx (2.15) well.
J R(X)
The PD F and the C D F of many common random variables are
Superficially, this appears to be different from the exp ecta
com plicated. For exam ple, the most widely used continuous
tion of a discrete random variable. However, an integral can be
distribution in finance and econom ics is known as a normal (or
approxim ated as a sum so that:
Gaussian) distribution. The PD F of a normal variable is
E[X] - £ x Pr(X = x)
where the sum is over a range of values in the support of X. The (2.16)
relationship between continuous and discrete random variables
is illustrated in Figure 2.5. The left panel shows a seven-point where /x and cr2 are param eters that determ ine the mean and
PM F that is a coarse approxim ation to the PD F of a normal variance of the random variable (respectively). The C D F of a nor
random variable. This approxim ation has large errors (which are mal random variable does not have a closed form expression,
illustrated by gaps between the PM F and the PD F). The right and therefore it is calculated using numerical methods.
Fiaure 2.5 These two graphs show discrete approximations to a continuous random variable. The left shows a
coarse approximation, and the right shows a fine approximation.
2.6 QUANTILES AND MODES that the C D F Fx(x) maps realizations to the associated quantiles.
The right panel shows how the quantile function Qx(a) = F x 1(a)
Moments are the most common measures used to describe maps quantiles to the values where Pr(X < x) = a. This means that
and com pare random variables. They are well understood and the quantile function is the reflection of the C D F along a 45° line.
capture im portant features. However, quantiles can be used to Q uantiles are used to construct useful descriptions of random
construct an alternative set of descriptive m easures of a random variables that com plem ent moments. Two are common: the
variable. For a continuous or discrete random variable X, the median and the interquartile range.
a-quantile X is the sm allest number q such that Pr(X < q) = a.
The median is defined as the 50% quantile and measures the
This is traditionally denoted by:
point that evenly splits the probability in the distribution (i.e., so
inf {q: Pr(X < q) = a}, (2.17) that half of the probability is below the median and half is above
where a G [0,1 ] and inf selects the sm allest possible value satis it). The median is a measure of location and is often reported
fying the requirem ent that Pr(X < q) = a. The inf is only needed along with the mean. W hen distributions are sym m etric, the
when measuring the quantile of distributions that have regions mean and the median are identical. W hen distributions are
with no probability. skew ed, these values generally differ (e.g ., in the common case
of negative skew, the mean is below the median). The interquar
In most applications in finance and risk m anagem ent, the
tile range (IQR) is defined as the difference between the 75%
assumed distributions are continuous and without regions of
and 25% quantiles. It is a measure of dispersion that is com pa
zero probability. This means that for each a, there is exactly one
rable to the standard deviation. The IQR depends on values that
number q such that Pr(X < q) = a. In this im portant case, the
are relatively central and never in the tail of the distribution. This
quantile function can be defined as such that Q(a) returns the
makes it less sensitive to changes in the extrem e tail of a distri
a-quantile q of X and:
bution, which manifest as outliers (i.e., observations that are far
Q » = F x V ), (2.18) from many of the data points) in a data set.
Fiaure 2.6 The left panel shows the CDF of a continuous random variable, which maps values x to the associated
quantile (i.e., F^x)). The right panel shows the quantile function, which maps quantiles to the corresponding
inverse CDF value (i.e., Qx(«) = F~1(a:)). It can easily be seen that either graph is a reflection of the other through
the diagonal (i.e., with the axes flipped).
Figure 2.7 The top panels demonstrate the relationship between the mean, median, and mode in a symmetric,
unimodal distribution (left) and a unimodal right-skewed distribution (right). The bottom-left panel demonstrates
the relationship between the interquartile range and the standard deviation. The bottom-right panel contains
examples of both a bimodal and a multimodal distribution.
QUESTIONS
2 .3 W hat two properties must a PM F satisfy? 2 .7 W hat are the median and interquartile range? W hen is the
2 .4 How are the mean, standard deviation, and variance of Y median equal to the mean?
Practice Questions
2 .8 W hat is E[XE[X]]? Hint: recall that E[X] is a constant. 2 .1 0 Suppose the return on an asset has the following
variable X:
Return Probability
X -4 % 6%
1 - 2 .4 5 6 -3 % 9%
2 - 3 .3 8 8 -2 % 11%
3 - 6 .8 1 6 -1 % 12%
4 1.531 0% 14%
5 1.737 1% 16%
6 - 1 .2 5 4 2% 15%
7 - 1 .1 6 4 3% 8%
8 1.532 4% 5%
9 2.550 5% 4%
15 - 0 .5 7 5
ANSW ERS
2.5 Excess kurtosis is the standard kurtosis measure minus 3. 2.7 The median is the point where 50% of the probability is
This reference value com es from the normal distribution, located on either side. It may not be unique in discrete
and so excess kurtosis m easures the kurtosis above that distributions, although it is common to choose the sm all
of the normal distribution. est value where this condition is satisfied. The IQR is the
2.6 The quantile function is the inverse of the C D F function. difference between the 25% and 75% quantiles. These
That is, if u = Fx(x) returns the C D F value of X, then are the points where 25% and 75% of the probability of
q = F x \u ) is the quantile function. The C D F function is the random variable lies to the left. This difference is a
not invertible if there are regions where there is no prob measure of spread that is like the standard deviation.
ability. This corresponds to a ju m p in the CDF, as shown The median is equal to the mean in any sym m etric distri
in the figures that follow. bution if the first moment exists (is finite).
Solved Problems
2.8 E[XE[X\] = 1 E[X \X Pr(X = x) = E [X ]2 X Pr(X = x) X X Standardized
= E[X] X E[X] = (E[X])2 because E[X] is a constant and
8 1.532 0.898
does not depend on X. In other words, the expected
value of a random variable d o e s not depend on the ran 9 2.550 1.289
dom variable. 10 0.296 0.423
2.9 a. The mean is —0.803 and the variance is 6.762. 11 - 0 .9 7 9 - 0 .0 6 8
b.
12 - 4 .2 5 9 - 1 .3 2 9
X X Standardized 13 2.810 1.389
1 - 2 .4 5 6 - 0 .6 3 6 14 - 1 .6 0 8 - 0 .3 1 0
2 - 3 .3 8 8 - 0 .9 9 4 15 - 0 .5 7 5 0.088
3 - 6 .8 1 6 - 2 .3 1 2 Mean - 0 .8 0 3 0.000
4 1.531 0.897 Variance 6.762 1.000
7 - 1 .1 6 4 - 0 .1 3 9
(X Standardized)3 (X Standardized)4
X X Standardized (Skew) (Kurtosis)
1 - 2 .4 5 6 - 0 .6 3 6 - 0 .2 5 7 0.163
2 - 3 .3 8 8 - 0 .9 9 4 - 0 .9 8 2 0.977
3 - 6 .8 1 6 - 2 .3 1 2 - 1 2 .3 6 4 28.590
6 - 1 .2 5 4 - 0 .1 7 3 - 0 .0 0 5 0.001
7 - 1 .1 6 4 - 0 .1 3 9 - 0 .0 0 3 0.000
11 - 0 .9 7 9 - 0 .0 6 8 0.000 0.000
12 - 4 .2 5 9 - 1 .3 2 9 - 2 .3 4 8 3.120
14 - 1 .6 0 8 - 0 .3 1 0 - 0 .0 3 0 0.009
2.10 a. The mean is E[X] = Pr(X = x) = 0.25% . d. The kurtosis requires computing
Pr(X = x).
= a
Distinguish the key properties and identify the common Describe a mixture distribution and explain the creation
occurrences of the following distributions: uniform and characteristics of mixture distributions.
distribution, Bernoulli distribution, binomial distribution,
Poisson distribution, normal distribution, lognormal
distribution, chi-squared distribution, Student's t, and
F-distributions.
27
There are over two hundred named random variable distribu (e.g ., a bear m arket, a severe loss in a portfolio, or a com pany
tions. Each of these distributions has been developed to explain defaulting on its obligations).
key features of real-world phenom ena. This chapter exam ines a
The Bernoulli distribution depends on a single parameter, p,
set of distributions commonly applied to financial data and used
which is the probability that a success (i.e., 1) is observed. The
by risk m anagers.
probability mass function (PM F) of a Bernoulli(p) is
Risk managers model uncertainty in many form s, so this set
M y) = py(1 - p f ~y)(3-D
includes both discrete and continuous random variables. There
are three common discrete distributions: the Bernoulli (which Note that this function only produces two values:
was introduced in the previous chapter), the binomial, and the p, when y = 1
Poisson. The Bernoulli is a general purpose distribution that is
1 — p, when y = 0
typically used to model binary events. The binomial distribution
describes the sum of n independent Bernoulli random variables. M eanwhile, its C D F is a step function with three values:
The Poisson distribution is commonly used to model hazard
( 0 y < 0
rates, which count the number of events that occur in a fixed
Fy(y) = < 1 — p 0 < y < 1 (3.2)
unit of tim e (e.g ., the number of corporations defaulting in the
[ 1 y —1
next quarter).
The moments of a Bernoulli are easy to calculate. Suppose that
There is a w ider variety of continuous distributions used by
X is a Bernoulli random variable with param eter p. This can be
risk m anagers. The most basic is a uniform distribution, which
expressed as
serves as a foundation for all random variables. The most widely
used distribution is the normal, which is used for tasks such as Y ~ Bernoulli(p)
modeling financial returns and im plem enting statistical tests. The mean and variance of Y are respectively
Many other frequently used distributions are closely related
E[Y] = p X 1 + (1 - p) X 0 = p,
to the normal. These include the Student's t, the chi-square
i x 2), and the F, all of which are encountered when evaluating and
statistical models. V[Y] = E[Y2] - E[Y]2 = (p X 12 + (1 - p) X 02) - p2 = p(1 - p)
This chapter concludes by introducing mixture distribu Note that the mean reflects the probability of observing a 1,
tions, which are built using two or more distinct com ponent and thus it is just p. The variance reflects the fact that if p is
distributions. A mixture is produced by randomly sampling either near 0 or near 1, then one value is observed much more
from each com ponent so that the mixture distribution inher frequently than the other (and thus the variance is relatively
its characteristics of each com ponent. M ixtures can be low). The variance is maximized when p = 50%, which reflects
used to build distributions that match im portant features the equal probability that failure and success (i.e., 0 and 1)
of financial data. For exam ple, mixing two normal random are observed.
variables with different variances produces a random vari
Figure 3.1 describes two Bernoulli random variables:
able that has a larger kurtosis than either of the mixture
Y ~ Bernoulli(p = 0.5) and Y ~ Bernoulli(p = 0.9). Their PM Fs
com ponents.
and C D Fs are shown in the left and right panels, respectively.
Note that the C D Fs are step functions that increase from 0 to
3.1 DISCRETE RANDOM VARIABLES 1 — p at 0, and then to 1 at 1.
Fiqure 3.1 The left panel shows the probability mass functions of two Bernoulli random variables,
one with p = 0.5 and the other with p = 0.9. The right panel shows the cumulative distribution
functions for the same two variables.
loss is realized (e.g ., a loss that exceeds Value-at-Risk [VaR]) and By construction, a binomial random variable is always non
0 otherwise. negative, integer-valued, and a B(n, p) is always less than or
equal to n. The skew ness of a binomial depends on p, with
Since the coin flips are independent, the probability of {Tails, variables. M eanwhile, the second configuration has n = 25 and
Tails} is the product of the probability of each flip being tails, p = 0.4, creating a shape that resem bles the density of a normal
which is random variable. The solid line in the PM F panel shows the PD F
of this approxim ation. The right panel shows the C D Fs, which
(1 /2 )2 = 1 /4
are step functions.
Furtherm ore, the probability of one head and one tail is also
1/4, as is the probability of two heads.
Poisson
Now, the variable's PD F can be produced by multiplying the
probabilities of each number of by the number of ways in which Poisson random variables are used to measure counts of events
it could occur over fixed tim e spans. For exam ple, one application of a Poisson
is to model the number of loan defaults that occur each month.
Q ) x (V/ X (’/2)2 = 1/4 if y = 0 Poisson random variables are always non-negative and integer
valued. The Poisson distribution has a single parameter, which
F Y(y) = < 2 X y 4 = X (1/2)1 X (1/2)1 = 1/2 if y = 1. is called the hazard rate and expressed as A, that signifies the
average number of events per interval. Therefore, the mean and
( 2 ) x <V2)2 X I1/ / = 1/4 if y = 2 variance of Y ~ Poisson(A) is simply:
E[Y] = V[Y] = A
M eanwhile, the C D F is the sum2 of the cum ulated PM F between
0 and y The PM F of a Poisson random variable is
y e xp (—A)
(3.5)
y!
M eanwhile, the C D F of a Poisson is defined as the sum of the
Note that the normal distribution provides a convenient approxi
PM F for values less than the input
mation to the binomial if both np > 10 and n(1 — p) > 10. This is
illustrated in Figure 3.2 and explained further later in this chapter. LyJ x
F Y(y) = e xp (—A ) 2 t (3.6)
Figure 3.2 shows the PDFs and C D Fs of two binomial random ;=o,!
variables. One distribution has n = 4 and p = 0.1, which pro Note that [ y j is the floor operator, which produces the greatest
duces a distribution with right-skew. Note that the mass is heav integer that is less than or equal to y.
ily concentrated around y = 0 and y = 1, which reflects the
The Poisson param eter A can be considered as a hazard rate in
small probability of success of each of the four Bernoulli random
survival modeling.
2 The CDF is defined using the floor function, [y J , which returns y when A s an exam ple, consider a fixed-incom e portfolio that consists
y is an integer and the largest integer smaller than y when y is not. of a large number of bonds. On average, five bonds within the
Fiqure 3.3 The left panel shows the PMF for two values of A, 3 and 15. The right panel
shows the corresponding CDFs.
portfolio default each month. Assum ing that the probability of The C D F returns the cum ulative probability of observing a value
any bond defaulting is independent of the other bonds, what is less than or equal to the argum ent, and is
the probability that exactly two bonds default in one month?
if y < a
In this case, y = 2 and A = 5. Thus
if a < y < b. (3.8)
52e xp (—5)
M 2) = ----- ^----- = 0.0842 = 8.42% if y > b
Uniform a + b
2
The sim plest continuous random variable is a uniform random
variable. A uniform distribution assumes that any value within M eanwhile, the variance depends only on the width of the
the range [a, b] is equally likely to occur. The PD F of a uniform is support:
1 (b - a )2
---- (3.7) V[ Y]
- a 12
The PD F of a uniform random variable does not depend on y, The mean and variance of a standard uniform random variable
because all values are equally likely. are 1/2 and 1/12, respectively.
Fiaure 3.4
and a uniform between 1/2 and 3. The right panel shows the cumulative distribution functions
corresponding to these two random variables.
The probability that Y falls into an interval with lower bound / (i.e., the process where data is used to determ ine the truth
and upper bound u (i.e., Pr(/ < Y < u)) is fulness of an objective statem ent).
minfu, b) — max(/; a) • Normal random variables are infinitely divisible, which makes
---- ——r---------- (3.10)
b —a them suitable for simulating asset prices in models that
assume that prices are continuously evolving.
which sim plifies to (u — l)/(b — a) when both / > a and u < b.
• The normal is closely related to many other im portant distri
As an exam ple, a uniform random variable Y ~ U(—2, 4) has a
butions, including the Student's t, the y 2, and the F.
mean of
• The normal is closed (i.e., weighted sums of normal ran
-2 + 4
dom variables are normally distributed) under linear
2
operations.
A variance of
• Estim ators derived under the assumption that the underly
(4 - (—2f ing observations are normally distributed often have simple
12 closed forms.
And the probability that Y falls between —1 and 2 is The normal distribution has two param eters: /z (i.e., the mean)
min(2, 4) — m a x (- 1 , — 2) 2 ~ ( - 1) and a 2 (the variance). Therefore
Pr(/ < Y < u)
4 - (- 2 ) 4 - (- 2 )
E M = /x
Normal and
Fiaure 3.5 The left panel shows the PDFs of three normal random variables, N(0, 1)
(i.e., the standard normal), N(0, 4), and N(2, 1). The right panel shows the corresponding CDFs.
The normal can generate any value in °°), although it is The central probability of each region is 100% — a. It is common
unlikely (i.e., less than one in 10,000) to observe values more to substitute 2 for 1.96 here as well.
than 4er away from the mean. In fact, values more than 3cr away
Figure 3.6 shows a graphical representation of these two m ea
from the mean are expected in only one in 370 realizations of a
sures. The left panel shows the area covered by 1a , 2 a , and
normal random variable.
3cr. These regions are centered at the mean /x. The right panel
The PD F of a normal is shows the area covered by three confidence intervals when the
random variable is a standard normal.
My) = v f a exp(“ ^ ) (311) Note that the sums of independent3 normally distributed ran
dom variables are also normally distributed. If X-| ~ N(/x-\, of)
Recall that while the C D F of a normal does not have a closed
and X 2 ~ N(/x2, a2) are independent, and Y = X | + X 2, then
form , fast numerical approxim ations are widely available in Excel
Y ~ N(ju 1 + /x2, a\ + 0 2 ). This property sim plifies describing
or other statistical software packages.
log returns at different frequencies; if daily log returns are
An im portant special case of a normal occurs when fi = 0 and independent and normal, then w eekly and monthly returns are
a 2 = 1. This is called a standard normal, and it is common to as well.
use Z t o denote a random variable with distribution N(0,1). It
is also common to use <£(z) to denote the standard normal PD F
and <f>(z) to denote the standard normal CDF. Approximating Discrete Random Variables
The normal is widely applied, and key quantiles from the normal Recall that a normal distribution can approxim ate a binomial
distribution are commonly used for two purposes. random variable if both np > 10 and n(1 - p) > 10. When
these conditions are satisfied, a binomial has either many inde
1. First, the quantiles are used to approxim ate the chance of
pendent experim ents or a probability that is not extrem e (or
observing values more than 1er, 2a, and 3a when describ
both), and so the PM F is nearly sym m etric and well approxi
ing a log return as a normal random variable.
mated by a N{np, np(1 — p)).4
2. Second, when constructing confidence intervals for esti
mated param eters, it is common to consider sym m etric
intervals that contain 90% , 95% , or 99% of the probability.
These values are summarized in Table 3.1. The left panel tabu 3 Chapter 4 shows this property holds more generally for normal random
lates the area covered by an interval of q a for q = 1 ,2 , 3. Both variables and independence is not required.
the exact probabilities and the common approxim ations used by 4 The approximation of a binomial with a normal random variable is an
application of the CLT, which establishes conditions where averages of
practitioners are given. The right panel shows the end points of
independent and identically distributed random variables behave like
a sym m etric interval with a total tail probability of a. This means normal random variables when the number of samples is large. Chapter 5
that each tail contains probability a / 2 for a = 10%, 5%, and 1%. provides a thorough treatment of CLTs.
Fiaure 3.6 The left panel shows the area (probability) covered by ±1, ±2, and ±3o\ The right
panel shows the critical values that define regions with 90%, 95%, and 99% coverage so that the total
area in the tails is 10%, 5%, and 1%, respectively. The probability in each tail is a/2.
The Poisson can also be approxim ated by a normal random For exam ple, if stock prices are assumed to be normally distrib
variable. W hen A is large, then a Poisson(A) can be well approxi uted, there is a positive (although perhaps tiny) probability that
mated by a Normal(A, A). This approxim ation is commonly the stock price becom es negative. This is im possible under a
applied when A > 1000. log-normal model.
Y = exp(X), (3.12)
where X ~ N(/z, cr). An im portant property of the log-normal 5 The mean reflects the mean of the underlying normal random variable
distribution is that Y can never be negative, whereas X can be as well as its variance. The extra term arises due to Jensen's inequality:
The exponential function is a convex function and so it must be the case
negative because it is normally distributed. This log-normal that E[Y] > exp (E[X]) = exp(/z). The variance depends on both model
property can be desirable when constructing certain models. parameters and V[Y] = (exp(cr2) — 1) exp(2/z + a 2).
To see how this works, begin with the standard normal (i.e., F x « = <* > ( ^ T !:) = J ^ r M
Z ~ N(0,1)). Defining X = /x + erZ (where /x and er are con
stants) therefore produces a random variable with a N(/x, a 2) Figure 3.7 provides an illustration of CDF, which measures the
distribution. shaded area under the PDF.
The mean and variance of X can be verified using the results Note that only one tail is required because the stan
from Chapter 2 to show that dard normal distribution is sym m etric around 0 so that
Pr(Z < 0) = Pr(Z > 0) = 0.5. This sym m etry ensures that
E[X] = /A + crE[Z] = fjt
Pr(Z < z) = Pr(Z > - z ) = 1 - Pr(Z < - z )
and
when z < 0.
Var[X] = o-2Var[Z] = cr2
For exam ple, P r(-1 < Z < 2) is simply
C hapter 2 also explains how a random variable is standard
ized, so that if X — N(/x, a 2), then Pr(Z < 2) - Pr(Z < - 1 )
Pr(X < 2) = Pr
< Z <
The PD F of a log-normal is given by: X ~ N(ju, a 2)], the C D F of Y can be com puted by inverting the
transform ation using ln(Y) and then applying the normal C D F on
(In y ~ /x )
exp (3.13) the transform ed value
2cr2
2.0 -
1.5
1 .0 -
LogN(0.08, .22)
0.5 LogN(0.16, .22)
— LogN(0.08, A 2)
0 .0
2.5 2.0 2.5
Fig u re 3 .8 The left panel shows the PDFs of three log-normal random variables, two of which have /x = 8%,
and two with ct = 20%, which are typical of annual equity returns. The right panel shows the corresponding CDFs
Figure 3.8 shows the PDFs (left panel) and C D Fs (right panel) for
three log-normal random variables. These have been param eter
and
ized to reflect plausible distributions of annual returns on an
equity m arket index. Note that the expected value of the distri V[Y] = 2v
bution is always greater than exp(/z) due to the cr2/ 2 term in the The PD F of a ^ random variable is
mean. For exam ple, when /z = 8% and a = 20% , the expected
1
value is 1.105, and when a = 40% , the expected value is 1.174. My) = y<V2-1> e x p ( - y 2), (3.16)
A t the same tim e, changing /x to 16% produces an expected 2(’ 2)T(M )
value of 1.197. The effect of a 2 on the expected value reflects where T(x) is known as the Gam m a function.6
the asym m etric nature of returns; the worst return is - 1 0 0 % ,
Figure 3.9 shows the PDFs and C D Fs for three param eter val
whereas the upside is unbounded.
ues of a x 2 random variable. These PDFs are only defined for
positive values, because the x 2 is the distribution of a sum of
squared random variables. W hen v = 1, the distribution has a
single mode at 0. As v increases, the density shifts to the right
The x 2 (chi-squared) distribution is frequently encountered when
and spreads out, reflecting the mean [v) and the variance [2v) of
testing hypotheses about model param eters. It is also used
a In all cases, the x 2 distribution is right-skewed. The skew
when modeling variables that are always positive, (e.g ., the
ness declines as the degrees of freedom increases, and when
VIX Index). A x% random variable is defined as the sum of the
v is large ( > 25), a ^ random variable is well approxim ated
squares of v (G reek nu) independent standard normal random
by a N(v, 2v).
variables
Y = J l Zf (3.15) Student's t
1=1
The Student's t distribution is closely related to the normal, but
Note that a x 2 distribution has v d e g re e s o f freedom , a concept
it has heavier tails. The Student's t distribution was originally
that arises when dealing with models that have k param eters
using n data points. D egrees of freedom measure the amount developed fo rte stin g hypotheses using small sam ples.
Fiaure 3.9 The left panel shows the PDFs of \ 2 distributed random variables with v = 1, 3, or 5. The right panel
shows the corresponding CDFs.
A Student's t is a one-param eter distribution. This parameter, than 3 (i.e ., it is larger than the kurtosis of a normal random
denoted by v, is also called the degrees of freedom param variable).
eter. W hile it affects many aspects of the distribution, the most
In some applications of a Student's t, it is desirable to separate
im portant effect is on the shape of the tails.
the degrees of freedom from the variance. Using the basic
A Student's t is the distribution of result that
V[aY] = a2V[Y]
(3.17)
it is easy to see that
where Z is a standard normal, W is a x l random variable, and Z
and W a re independent.7 Dividing a standard normal by another
random variable produces heavier tails than the standard nor
mal. This is true for all values of v, although a Student's t con when Y ~ t„. This distribution is known as a standardized Stu
verges to a standard normal as v —» °c.8 dent's t, because it has mean 0 and variance 1 for any value
of v. Im portantly, it can be rescaled to have any variance and
If Y ~ tv, then the mean is
re-centered to have any mean if v > 2.
E[Y] = 0
For exam ple, the generalized Student's t is param eterized with
and the variance is three param eters reflecting the mean, variance, and degrees of
freedom , and is denoted as G en. a 2).
v
The features of the generalized t make it better suited to
The kurtosis of Y is m odeling the returns of many assets than the normal distri
bution. It captures the heavy tails in asset returns (i.e ., the
v —2
kurtosis(Y) = 3 increased likelihood of observing a value larger than q a rela
v - 4
tive to a normal for values | q| > 2 ) while retaining the fle xib il
The mean is only finite if v > 1 and the variance is only finite ity of a normal random variable to directly set the mean
if v > 2. The kurtosis is defined for v > 4 and is always larger and variance.
For exam ple, the chance of observing a value more than 4cr
away from the m ean, if X-i is normally distributed, is about 1 in
15,000. If X 2 is a standardized t with the same mean and vari
7 Z and W are independent if the normal random variable used to gener
ance with the degree of freedom param eter v = 8, then the
ate W is independent of Z.
chance of observing a value larger than 4a- is 1 in 580. This dif
8 This should be intuitive considering that E[\N/v) = 1 for all values of v
and V[W/i^] = 1jv, so that as v becomes larger, the denominator closely ference has im portant im plications for risk m anagem ent, esp e
resembles a constant value. cially when we are worried about rare events.
Fiaure 3.10 The left panel shows the PDF of a generalized Student's t with four degrees of freedom and the
PDF of a standard normal. The right panel shows the corresponding CDFs.
Figure 3.10 com pares the PD Fs and C D Fs of a generalized Stu vvi iiu i is ui ny
dent's t with v = 4 to a standard normal, which is the limiting
2 v \ {v'l + V2 — 2)
distribution as v —> o°. The Student's t is more peaked, heavier- V[Y] =
vy {v2 - 2)2{v 2 - 4)
tailed, and has less probability in the regions around ± 1.5. This
appearance is common in heavy-tailed distributions. These two and is only finite for v 2 > 4.
distributions have the same variance by construction, and so the When using an F in hypothesis testing, v-\ is usually determined by
increased probability of a large m agnitude observation in the t the hypothesis being tested and is typically small (e.g., 1, 2, 3, . . . ) ,
must be offset with an increased chance of a small magnitude while v2 is related to the sample size (and so is relatively large). When
observation to keep the variance constant. v2 is large, the denominator {W2/ v 2) has mean 1 and variance10
2 / v 2 ~ 0, and so behaves like a constant. In this case, if Y ~ F„1f „2,
then the distribution is close to W l /z'1, and the mean and variance
are 1 and 2 fvy, respectively. The PDF and C D F of an F a re both
The F is another distribution that is commonly encountered complicated and involve integrals that do not have closed forms.
when testing hypotheses about model param eters. The F h a s
Figure 3.11 shows the PD F and C D F for three sets of param eter
two param eters, and v2, respectively known as the numerator
values of F distributed random variables. Com paring F3 10 and
and denom inator degrees of freedom . These param eters are
the F4jo , the extra degree of freedom in the num erator has two
usually written as subscripts in the shorthand expression F„ „ .
effects. The first is that it shifts the distribution to the right. The
An F distribution is defined as the ratio of two independent \ 2 second is that it shifts the probability closer to the mean (i.e.,
random variables where each has been divided by its degree of 1.25). The result is that these two distributions have the same
freedom mean but the F3 10 has a larger variance.
W i/*1 Com paring the F3 and F3oc, the distribution with a larger
(3.18)
W2/ v 2 denom inator degrees of freedom param eter lies to the left and
has a thinner tail. These param eters change the mean and vari
where W-, ~ , and W2 ~ X$2, and W-, and W2 are indepen
ance of F3 oo to 1 and 2/3, respectively.
dent.9 If Y ~ F„1( „2, then the mean of Y is
The F is closely related to the \ 2 and the Student's t. When
= 1 and v2 is sufficiently large, then F 1jV is close to xi- If
V ~ t„, then Y2 ~ Fi „. These properties can be readily verified
using the definitions of the t, x 2 and F distributions.
1.0 i
Fiqure 3.11 The left panel shows the PDF of three F-distributed random variables. The right panel shows the
corresponding CDFs.
PDF CDF
Fiqure 3.12 The left panel shows the PDFs of three random variables with exponential distributions having
shape parameters between 1 and 5. The right panel shows the corresponding CDFs.
3.0 1.0 1
/ 7
Beta(1/4 ,1/4) Ss
Beta(1,1 ) / , /
— Beta(5,5)
B e ta (7 ,3)
1.5-
1. 0 -
0.5
0.
Fiqure 3.13 The left panel shows the PDFs for four configurations of the Beta distribution, including the
where a = j3 = 1, which produces a standard uniform. The right panel shows the corresponding CDFs.
is equal probability today that the company defaults in in the first The left panel of Figure 3.13 shows the PDFs for four Beta
year (days 0 to 365) or in the first two years (days 0 to 730). distributions. Three of the exam ples have the same mean (i.e.,
1/2). W hen both param eters are less than 1, the distribution
Figure 3.12 shows the PD F and C D F of three exponential distri
places most of the probability mass near the boundaries. When
butions with f3 between 1 and 5.
a = /3 = 1, then the Beta sim plifies to a standard uniform dis
As an exam ple, assume that the time to default for a specific tribution. As the param eters increase, the distribution becom es
segm ent of credit card consum ers is exponentially distributed more concentrated around the mean. W hen a = 7 and (3 = 3,
with a f3 of five years. To find the probability that the custom er the distribution shifts towards the upper bound. In general,
will not default before year six, start by calculating the cum ula increasing a shifts the distribution towards 1, and increasing [3
tive distribution until year six and then subtract this from 100%: shifts the distribution towards 0. Increasing both param eters
proportionally reduces the variance and concentrates the distri
Survival > Y e a r 6 = 1 - Fy(61/3 = 5) = 1 - ( 1 - e 6/s ) = 30.1%
bution around the mean.
My) = pfx,(y) + d - p)fx,(y) (3.23) where X-, ~ N { - 0.9, 1 - 0 .9 2), X 2 ~ N(0.9, 1 - 0 .9 2), and
W hile directly computing the central moments of a mixture W ~ Bernoulli(p = 0.5). The param eters of the three com po
distribution is challenging, doing so for non-central moments is nent normal distributions and the Bernoulli have been chosen so
sim ple. For exam ple, the mean of the mixture is that EM = 0 and V[Y] = 1.
E[Y] = p E M ] + (1 - p)E[X2] (3.24) In this case, the mixing probability is 50%. The mean of Y is 0 by
construction because
and the second non-central moment is
E M = E[WX-, + (1 - W)X2]
E[Y2] = pE[X?) + (1 - p )E[X l) (3.25)
= E[WX-, + X 2 - W X 2]
These two non-central moments can be com bined to construct = E[W\E[X,] + E[X2] - E[W\E[X2]
the variance = 0 .5 (- 0 .9 ) + 0.9 - 0.5(0.9)
= 0
V M = E[Y2] - E M 2
Sim ilarly, the variance is 1 because
Higher-order, non-central moments are equally easy to compute
because they are weighted versions of the non-central moments of E[X?] = E[X2] = EM 2] = 0 .9 2 + (1 - 0 .9 2) = 1
the mixture's components. Using the relationship between the third
and thus
and fourth central and non-central moments defined in Chapter 2,
these central moments are related to the non-central moments by VM = EM2] - EM 2 = 1 - 0 = 1
E[(Y - E M )3] = E [Y 3] - 3E[Y2]E[Y] + 2 E [Y ]3 A s seen in Figure 3.14, the mixture build from these two com
ponents is bim odal. This is because each com ponent has a small
E[(Y - E M )4] = E [Y 4] - 4E[Y3]E[Y] + 6E[Y2]E[Y]2 - 3E[Y]4
variance and the two means are distinct.
(3.26)
Bi-modal Mixture
PDF CDF
Fiaure 3.14 The panels show the PDFs (left) and CDFs (right) of a bimodal mixture where each component has
probability 50% of occurring and a standard normal (N(0, 1)).
standard normal.
As another exam ple, consider a mixture of X i ~ N(0, .7252) and clo se r to the b o u n d ary (i.e ., 0 or 1) than the C D F of the co n
X 2 ~ N(0, 3 .1 62), weighted 0.95 and 0.5 (respectively). Note tam in ate d norm al.
that X-] is close to a standard normal, whereas X 2 has a much Mixing com ponents with different means and variances
greater variance. As such, draws from X 2 appear as outliers in produces a distribution that is both skewed and heavy
the mixture distribution. tailed. For exam ple, consider a mixture with com ponents
Th e result is a m ixtu re d istrib u tio n th at is very h eavy-tailed X-| ~ N (0.16, 0.61) and X 2 ~ N{—1.5, 2), where the probability
with a kurto sis o f 1 5 .7 . Fig u re 3 .1 5 co m p ares the P D F and of drawing from the distribution of X-| is p = 90% . The mean
the C D F of a co n tam in ated norm al w ith th a t of a sta n and variance of this mixture are 0 and 1, respectively. The skew
dard norm al. N ote th a t the co n tam in ated norm al is sim ilar ness and kurtosis are —0.96 and 5.50, respectively.
in shap e to a g e n e ra lize d S tu d en t's t, as it is both m ore A s shown in Figure 3.16, this distribution has a strong left skew
p eaked and h eavier-tailed then th e norm al. T h e heavy tails and is mildly heavy-tailed. Specifically, the mixture is clearly left-
can be seen in the C D F by the cro ssing of the tw o cu rves fo r skewed with much more probability between —4 and —2 than
valu es b elo w —2 and ab o ve 2, w h e reas the norm al C D F is between 2 and 4.
Mixed Normal
PDF CDF
standard normal.
QUESTIONS
3.2 Under what conditions can a binomial and a Poisson be 3.6 How many independent standard normal random vari
approxim ated by a normal random variable? ables are required to construct a
3.3 W hat type of events are Poisson random variables used to 3.7 If a Beta-distributed random variable is used to describe
describe? some data, what requirem ent must the data satisfy?
3.4 If X is a standard uniform, what is Pr(0.2 < Z < 0.4)? W hat 3.8 How are the mean and variance of a mixture of normal
about Pr(/ < X < u), where 0 < / < u < 1? random variables related to the mean and variance of the
com ponents of the m ixture?
Practice Questions
3.9 Either using a Z table or the Excel function N O RM .S.D IST, a. The fund has access to a USD 10 million line of credit
com pute that does not count as part of its portfolio. W hat is the
chance that the firm's loss in a month exceeds this line
a. P r ( - 1.5 < Z < 0), where Z ~ N(0, 1)
b. Pr(Z < - 1 .5 ), where Z ~ N(0, 1) of credit?
b. W hat would the line of credit need to be to ensure
c. Pr(—1.5 < X < 0), where X ~ N(1, 2)
d. Pr(X > 2), where X ~ N(1, 2) that the firm's loss was less than the line of credit in
99.9% of months (or equivalently, larger than the LO C
e. Pr(W > 12), where W ~ N(3, 9)
in 0.1% of months)?
3.10 Either using a Z table or the Excel function N O RM .S.IN V,
com pute
3.13 If the kurtosis of some returns on a small-cap stock portfo
lio was 6, what would the degrees of freedom param eter
a. z so that Pr(z < Z) = .95 when Z ~ N(0, 1) be if they were generated by a generalized Student's t„?
b. z so that Pr(z > Z) = .95 when Z ~ N(0, 1)
W hat if the kurtosis was 9?
c. z so that Pr(—z < Z < z) = .75 when Z ~ N(0, 1)
d. a and b so that Pr(a < X < b) = .75 and
3.14 An analyst is using the following exponential function to
Pr(a < X) = 0.125 when X ~ N(2, 4) model corporate default rates:
daily mean of 8%/252 and a daily variance of (20%)2/252, where y is the number of years.
find the values where
a. W hat is the cum ulative probability of default within the
a. Pr(R < r) = .001 first five years?
b. Pr(R < r) = .01 b. W hat is the cum ulative probability of default within the
c. Pr(R < r) = .05 first ten years given that the com pany has survived for
five years?
3.12 The monthly return on a hedge fund portfolio with USD 1
billion in assets is N[.02, .0003). W hat is the distribution of
the gain in a month?
A N SW ERS
Solved Problems
3.9 a. 43.3% . In Excel, the command to com pute this value answer to the previous problem by re-centering on
is N O R M .S.D IST(0,TR U E) - N O R M .S .D IS T (-1 .5 ,T R U E ). the mean and scaling by the standard deviation, so
that a = 2 X —1. 15 + 2 and b = 2 X 1.15 + 2. Note
b. 6.7% . In Excel, the command to com pute this value is
N O R M .S .D IS T (—1.5,TR U E). that the formula is a = a X q + / j l , where q is the
quantile value.
c. 20.1% . In Excel, the command to com pute this
value is N O RM .S.D IST((0 - 1)/SQ RT(2),TRUE) 3.11 a. The mean is 0.031% per day and the variance is
- N O R M .S .D IS T ((-1 .5 -1 )/S Q R T (2 ),T R U E ). 1.58 per day (so that the standard deviation is
1.26% per day). To find these values, we trans
d. 24.0% . In Excel, the command to com pute this value
form the variable to be standard normal, so that
is 1 - N O RM .S.D IST((2 - 1)/SQ RT(2),TRU E).
r - /x
Pr(R < r) = .001 = Pr Z < = .001
e. 0.13% . In Excel, the command to com pute this value \ (T /
is 1 - N O RM .S.D IST((12 - 3)/3,TRU E). The value for the standard normal is —3.09
3.10 a. 1.645. In Excel, the command to com pute this value is (NORM.S.INV(O.OOI) in Excel) so that
one in 1,000 chance that the return would be less than that k [v — 4) = 3{v — 2) so that k v — 4 k = 3v — 6 and
3.86% , if returns were normally distributed. 4 k —6
— 3v = 4 — 6. Finally, solving for v, v = • Plug-
—3
k v k
2 4 -6
deviation is 1.73% . In USD, the monthly change in gmg in v 18 = 6andi^ = = 5. The
6 -3 3 9 -3
portfolio value has a mean of 2% X USD 1 billion =
kurtosis falls rapidly as v grows. For exam ple, if v = 12
USD 20 million and a standard deviation of 1.73% X
then k = 3.75, which is only slightly higher than the kur
USD 1 billion = USD 17.3 million. The probability that
tosis of a normal (3).
the portfolio loses more than USD 10 million is than
(working in millions) 3.14 a. Fy(5) = 1 — exp
V - 20 b. W e need to divide the marginal probability of default
Pr(V < - 1 0 ) = Pr <
17.3 between years five and ten:
= Pr (Z < - 1 .7 3 )
Fy(10) - Fy(5)
Using the normal table, Pr(Z < - 1 .7 3 ) = 4.18% .
By the SURVIVAL probability through year five
b. Here we work in the other direction. First, we find (1 - result from part a).
the quantile where Pr(Z < z) = 99.9% , which gives
z = - 3 .0 9 . This is then scaled to the distribution of Fy(10) - Fy(5) = exp - exp ( - 1% )
the change in the value of the portfolio by m ultiply
ing by the standard deviation and adding the mean, — exp
Fy( 10) ~ Fy(5)
17.3 X —3.09 + 20 = —33.46. The fund would need
1 - Fy(5)
a line of credit of USD 33.46 million to have a 99.9% exp
change of having a change above this level.
Explain how a probability m atrix can be used to express a Explain the effects of applying linear transformations on the
probability mass function (PM F). covariance and correlation between two random variables.
Com pute the marginal and conditional distributions of a Com pute the variance of a weighted sum of two random
discrete bivariate random variable. variables.
Explain how the expectation of a function is com puted for Com pute the conditional expectation of a com ponent of a
a bivariate discrete random variable. bivariate random variable.
Define covariance and explain what it m easures. Describe the features of an iid sequence of random
variables.
Explain the relationship between the covariance and
correlation of two random variables and how these are Explain how the iid property is helpful in computing the
related to the independence of the two variables. mean and variance of a sum of iid random variables.
47
M ultivariate random variables extend the concept of a single A PM F describes the probability of the outcom es as a function
random variable to include measures of dependence between of the coordinates X1 and x 2. The probabilities are always non
two or more random variables. negative, less than or equal to 1, and the sum across all possible
values in the support of X-] and X 2 is 1.
Note that all results from Chapter 2 are directly applicable here,
because each com ponent of a m ultivariate random variable The leading exam ple of a discrete bivariate random variable is
o
is a univariate random variable. This chapter focuses on the the trinomial distribution, which is the distribution of n inde
extensions required to understand multivariate random vari pendent trials where each trial produces one of three outcom es.
ables, including how expectations change, new m oments, and In other words, this distribution generalizes the binomial.
additional characterizations of uncertainty. W hile this chapter
The two com ponents of a trinomial are X-| and X 2, which count
discusses continuous and discrete m ultivariate random variables
the number of realizations of outcom es 1 and 2. The count of
in separate sections, note that virtually all results for continuous
the realizations for outcom e 3 is given by n — X-| — X 2, and so it
variables can be derived from those for discrete variables by
is redundant given knowledge of X-| and X 2.
replacing the sum with an integral.
For exam ple, when measuring the credit quality of a diversi
This chapter focuses primarily on bivariate random variables to
fied bond portfolio, the n bonds can be classified as investm ent
simplify the m athem atics required to understand the key con
grade, high yield or unrated. In this case, X-, is the count of the
cepts. All definitions and results extend directly to random vari
investm ent grade bonds and X 2 is the count of the high yield
ables with three or more com ponents.
bonds.
2 Both the binomial and the trinomial are special cases of the multino
1 A vector of dimension n is an ordered collection of n elements, which mial distribution, which is the distribution of n independent trials that
are called components. can take one of k possible outcomes.
W * 1) = 2 fx1(x2(*i/ * 2 ) (4.4)
x2 e R(X2)
Fiqure 4.1 The PMF of a trinomial random variable with p = 20%, fXl(5%) = 2 ft5%, x 2)
x2={—1,0,1}
p2 = 50% and n = 5.
= 0% + 15% + 20% = 35%
to p robabilities. In other w ords, it is a tabular representation This calculation can be repeated for stock returns of - 5 % and
of a PMF. 0% to construct the com plete marginal PMF, which is
Negative -1 20% 10% 0% The two marginal distributions in the stock return/recommenda-
tion exam ple are
Neutral 0 10% 15% 15%
-5 % 0% 5% fx2(*z)
Each cell contains the probability that the combination of the
Negative -1 20% 10% 0% 30%
two outcom es is realized. For exam ple, there is a 10% chance
Analyst (X2)
that the stock price declines with a neutral rating and a 20% Neutral 0 10% 15% 15% 40%
chance that it declines with a negative rating. Positive 1 5% 5% 20% 30%
In C hapter 1, two events A and B were defined to be indepen bivariate random variable to construct a conditional distribution.
fx,,x2(*i / x 2) = fXl (x-|)fX2(x2) (4.5) For exam ple, suppose that we want to determ ine the distribu
tion of stock returns conditional on a positive analyst rating. This
Independence requires that the joint PM F of X-| and X 2 be the
conditional distribution is
product of the marginal PM Fs.
r , lxx _ ^ _ fx 1(x2(x i, _ fx1(x2(x i, 1)
Returning to the exam p le of the stock return and ratings, ^ , x 2(x i |X2 - 1) - - Q3
the jo in t distribution (left) and the product of the m arginals
(right) are Applying this expression to the row corresponding to X 2 = 1
in the bivariate probability matrix, the conditional distribution
Joint Distribution Product of Marginals is then:
In other words, equality of the conditional and the marginal The expectation of g (x1f x 2) is therefore
PM Fs is a requirem ent of independence. Specifically, knowledge
E[g(x-|, x 2)] = 2 2 g {x h x 2)f(x-|, x 2)
about the value of X 2 (e.g ., X 2 = x 2) must not contain any infor x,E^,2}x2e{3A}
mation about X-|. E[g (x1f x 2)] = 13(0 .15) + 14(0.6) + 23(0.10) + 24(0.15)
Returning to the exam ple of the stock return and recom m enda = 0.15 + 0.60 + 0.80 + 2.4
tion, the marginal distribution of the stock return and the condi = 3.95
tional distribution when the analyst gives a positive rating are
Cov[X-|, X 2] = C o v[X l7 a + bX-|] Cross-variable versions of the skewness (coskewness) and kurto-
= E[X i(a + bX-,)] - E[XdE[a + bX-,] sis (cokurtosis) are similarly defined by taking powers of the two
= E[aXi + bX2] - a E ^ ] - bE[X-,]2 variables that sum to 3 or 4, respectively. There are two distinct
= aE[Xi] - aE[Xi] + b(E[Xf] - EfX^2) = bV[X-,] coskewness measures and three distinct cokurtosis m easures.
For exam ple, the two coskewness m easures are
Recall from C hapter 2 that V[a + bX] = b2V[X]. This means that
when b ^ 0, the correlation is E[(X, - E[X ,])2(X2 - E[X,])] , E[(X, - EIX ^ X X , - E[X ,])2]
and
<7W2
? (T1 09
02
bV[Xi ] (4.16)
= sign(b) (4.13)5
V \ / [ x j] V b 2W[X^ ] Like the skewness of a single random variable, the coskewness
W hen X i and X 2 tend to increase together, then the correlation m easures are standardized.
is positive. If X 2 tends to decrease when X-| increases, then these
W hile interpreting these m easures is more challenging than the
two random variables have a negative correlation.
covariance, they both measure w hether one random variable
(raised to the power 1) takes a clear direction w henever the
other return (raised to the power 2) is large in m agnitude. For
5 The sign (b) function returns a value of -1 if b is negative or a value of exam ple, it is common for returns on individual stocks to have
1 if b is positive. negative coskew, so that when one return is negative, the other
The covariance plays a key role in the variance of the sum of two N/[aX, + bX2] = a2V[X1] + b2V[X2] + l a b C o v[X 1f X 2] (4.18)
or more random variables:
This result is im portant in portfolio applications, where the con
V[X, + X 2] = V [X n] + V[X2] + 2C o v[X 1; X 2] (4.17) stants are the weights in the portfolio.
• The distribution of the returns on the assets in the portfo (T ii 2 o " i2 T (722
lio, and
Figure 4.2 plots the standard deviation of a portfolio with two
• The portfolio's weights on these assets. assets that have a covariance m atrix of:
The portfolio weights measure the share of funds invested in an 20%2 p X 20% X 10%
asset. These weights must sum to 1 by construction, and so in a _p x 20 % x 10% 10%2
simple portfolio with two assets, the two portfolio weights are
In this exam ple, the first asset has equity market-like volatility
• w (i.e., the w eight on the first asset), and
of 20% , and the second has corporate bond-like volatility of
• 1 — w (i.e., the w eight on the second asset). 10%. M eanwhile, the correlation is varied between - 1 and 1.
If the returns on two assets have a covariance m atrix of: The portfolio's standard deviation (or variance) is smallest when
the correlation is strong and negative. It is also asym m etric,
°"i (Tu because larger positive correlations lead to larger portfolio
_(Ti2 (t \ _ standard deviations than small negative correlations. This
happens because the optimal w eight is negative for the larg
then the variance of the portfolio's return is est correlations and so the second asset has a w eight larger
than 1. While it is optimal to have a weight larger than 1, this
w 2(j\ + (1 — w )2<
t 2 + 2w(1 — w)er12
extra total exposure limits the benefits to diversification. When
This is an application of the formula for the covariance of the correlation is negative, both assets have weights between 0
w eighted sums of random variables. The optimal w eight and 1, and so this additional source of variance is not present.
Fiqure 4.2 The portfolio standard deviation using the optimal portfolio weight in a two-asset portfolio,
where the first asset has a volatility of 20%, the second has a volatility of 10%, and the correlation between
the two is varied between —1 and 1.
independent.
The conditional expectation of the return is then:
Correlation is a measure of linear dependence. If two variables
E [X ,|X 2 = 1] = - 5 % x 16.6% + 0 x 16.6% + 5% x . %
6 6 6
have a strong linear relationship (i.e., they produce values that
= 2.5%
lie close to a straight line), then they have a large correlation. If
two random variables have no linear relationship, then their cor Conditional expectations can be defined over a value or any set of
relation is zero. values that X 2 might take. For exam ple, the conditional expecta
tion of X-| given that the stock is not upgraded (i.e., X 2 G { —1,0})
However, variables can be non-linearly dependent. A simple
depends on the conditional PMF of X-] given that X 2 G { —1,0}.
exam ple of this occurs when:
This expectation is computed using the conditional PMF:
X , ~ N(0,1) and X 2 = X? ^X1 |X2 (X11^2 £ { “ 1,0})
Using the fact that the skewness of a normal is 0, it can be Conditional expectations can be extended to any moment by
shown that Cov[X-|, X 2] = 0. However, these two random vari replacing all expectation operators with conditional expectation
ables are clearly not independent. Because the random variable operators. For exam ple, the variance of a random variable X-] is
X 2 is a function of X-|, the realization of X-| determ ines the real
M X ,] = E[(X, - E ix ,])2] = E[X2] - E[X , ] 2
ization of X 2. Sim ilarly, a realization of X 2 can be used to predict
X-| up to a m ultiplicative factor of + 1 . The conditional variance of X-] given X 2 is then:
Finally, if the correlation between two variables is 0, then V[X, |X 2 = x 2] = E[(X, - E[X , |X 2 = x 2])2|X2 = x 2]
the expectation of the product of the two is the product of
= E [X 2 |X 2 = x 2] - E [X ,|X 2 = x 2 ]2
the expectations. This means that if Corr[X-|, X 2] = 0, then
E[X 1 X 2] = E[X-|]E[X2]. This follows from the definition of covari Returning to the exam ple of the stock return and the recom
ance because: m endation, the standard deviation of the stock return (i.e., the
square root of the variance) is
Cov[X-|, X 2] = E[X-|X2] - E[X-|]E[X2],
a-i = V v p ^
so that when the covariance is zero, the two term s on the right-
hand side must be equal = VE[X?] - E[Xi ]2
= 4.18%
° X , |X2 = 1 = V V I X .I X , = X2 ]
Many applications exam ine the expectation of a random vari
able conditional on the outcom e of another random variable. = V e [X?|X 2 = 1] - E [X ,|X 2 = 1]2
For exam ple, it is common in risk m anagem ent to model the
((—0.05)2(0.166) + 02(0.166) + (0.05)2(0.666))
expected loss on a portfolio given a large negative market
- (-0 .0 5 (.1 6 6 ) + 0(0.166) + 0.05(0.666))2
return.
= 3.81%
A conditional expectation is an expectation when one random
variable takes a specific value or falls into a defined range of val
Conditional Independence
ues. It uses the same expression as any other expectation and is
a weighted average where the probabilities are determ ined by a A s a rule, random variables are dependent in finance and risk
conditional PMF. m anagem ent. This dependence arises from many sources,
Style (Z) Stock Return (X |) Stock Return (X ,) PD F has the same form as a PM F and is written as fXl,x2 (x i/ x 2 )-
The probability in a rectangular region is defined as:
-5 % 0 % 5% -5 % 0 % 5%
Analyst (X2)
r u1 u2
Positive 1 0.5% 2 % 0 % 4.5% 3% 2 0 % This expression com putes the area under the PD F function in
the region defined by /1f u ,, l2, and U2 . 6 A PD F function is always
Note that the overall probabilities are consistent with the non-negative and must integrate to one across the support of
original bivariate m atrix. For exam ple, there is still a total 20% the two com ponents.
chance of a negative rating when the stock return is —5%.
The cum ulative distribution function is the area of a rectangular
Now consider the bivariate distribution conditional on a com region under the PD F where the lower bounds are — 00 and the
pany being in this conservative category. As before, each upper bounds are the argum ents in the C D F:
individual probability is rescaled by a constant facto r so that X1 /*X2
+-»
CO
Neutral 0 8 % 32% 0 % 40% 6 (1, x 2( x i / x2 ) = 1 (4.21)
CD
Positive 4% 16% % % and the support of the density is the unit square (i.e., a square
C
< 1 0 2 0
% 80% % whose sides have lengths of 1). The C D F of the bivariate inde
^ ( x i) 2 0 0
pendent uniform is
This shows that this bivariate distribution, which conditions on J X x 2(x i. x 2) = X1X2 (4 -22)
the com pany having a conservative m anagem ent style, is equal
to the product of its marginal distributions. To see this, check
that the cells of the m atrix are equal to the product of their cor
responding marginals [e.g ., 8 % = 40%(20%)]. This distribution 6 The probability that a continuous random variable takes a single point
value is 0, and so the probability over a closed interval is the same as
is therefore conditionally independent even though the original the probability over the open interval with the same bounds, so that
distribution is not. Pr(/-| < X-| “v u, Cl l2 <'- X 2 <C u2) — Pr(/i — X, — u, Cl l2 — X 2 — u2).
a large loss. This structure is used in later chapters when exam in
(4.26)
ing concepts such as expected shortfall.
The variance of a sum of iid random variables is just n tim es the
For exam ple, consider the distribution of the return on a hedge
variance of one variable. Unlike the expectation of the sum,
fund (X-|) given that the monthly return on the S&P 500 (X2) is in
however, this sim plification does require independence.
the bottom 5% of its distribution. Historical data shows that the
S&P 500 return is above —6.19% in 95% of months, and thus the Too see this, note that the first line shows that the variance of
case of interest X 2 G (—°°, —6.19% ). The conditional PD F of the the sum is the sum of the variances plus all pairwise covariances
hedge fund's return is multiplied by tw o. There are n(n - 1)/2 distinct pairs, and the
double sum is a sim ple method to account for the distinct com
-0.0619
bination of indices. Furtherm ore, the independence assumption
fx„x 2 (x i, x 2 )dx 2
— oc ensures that the covariances are all zero, which leads to the sim
^ \x2(x i | ^ 2 £ (-0 0 / -6 .1 9 % )) -0.0619
plification of the sum. If the data are not independent, then the
fx (x 2 )c/x2
— oc covariance between different com ponents would appear in the
variance of the sum.
W hile this expression might look strange, it is identical to the
definition of the PM F when one of the variables is in a set of The assumption that the distribution of X, is identical is also
values. This conditional PD F is a weighted average of the PD F of im portant because it ensures that V[X,] = a for all /. W ithout
X-| when X 2 < —6.19% , with the weights being the probabilities this assumption, the variance of the sum is still the sum of the
associated with the values of X 2 in this range. variances, although each variance may take a different value.
QUESTIONS
4.2 W hen are conditional distributions equal to marginal 4.7 If X-| and X 2 are independent continuous random variables
distributions? with PDFs fXl(x 1 ) and fx2 (x 2 )» respectively, what is the joint
4.3 If X-| and X 2 are independent, what is the correlation PD F of the two random variables?
between X-| and X 2? 4.8 If X-| and X 2 both have univariate normal distributions, is
the joint distribution of X-, and X 2 a bivariate normal?
4.4 If X-| and X 2 are uncorrelated (Corr[X-|, X 2] = 0), are they
independent? 4.9 In the previous question, what if X-, and X 2 are
independent?
4.5 W hat is the effect on the covariance between X-| and X 2 of
rescaling X-| by w and X 2 by (1 — w), where w is between 4.10 Are sums of iid normal random variables normally distributed?
0 and 1 ?
Practice Questions
4.11 Suppose that the annual profit of two firm s, one an of the profit of Big Firm when Small Firm either has no
incum bent (Big Firm , X-|) and the other a startup (Small profit or loses money (X 2 < 0)?
Firm , X 2), can be described with the following probability 4.15 The return distributions for the S&P 500 and Nikkei for the
m atrix:
next year are given as:
4.12 Using the probability m atrix for Big Firm and Small Firm, Nikkei 0% 5% 40%
what are the covariance and correlations between the 8% 6% 9%
profits of these two firm s?
Fill in the missing cells to match the original given
4.13 If an investor owned 20% of Big Firm and 80% of Small marginal distributions.
Firm, what are the expected profits of the investor and the
c. For the m atrix in b, what is the conditional distribution
standard deviation of the investor's profits?
of the Nikkei given a 10% return on the S&P?
4.14 In the Big Firm -Sm all Firm exam ple, what are the condi
tional expected profit and conditional standard deviation
ANSW ERS
4.2 W hen the random variables are independent. If the condi 4.6 If the two components X-] and X 2 are independent, then the
tional distribution of X i given X 2 equals the marginal, then conditional expectation of any function of X i given X 2 = x 2
X 2 has no information about the probability of different is always the same as the marginal expectation of the same
values in X-|. function. That is, E[g(X-|)|X 2 = x 2] = E[g(X-|)] for any value x 2.
4.3 Independent random variables always have a correlation of 0. 4.7 The joint PD F is fx1 (x 2 (x i/ x 2 ) = fx 1 (x i)^x2 (x 2 ) because when
This statement is only technically true if the variance of both independent the joint is the product of the marginals.
is well defined so that the correlation is also well defined. 4.8 Not necessarily. It is possible that the joint is not a bivari
4.4 No. There are many ways that two random variables can ate normal if the dependence structure between X-| and
be dependent but not correlated. For exam ple, if there is X 2 is different from what is possible with a normal. The
no linear relationship between the variables but they both distribution that generated the data points plotted below
tend to be large in m agnitude (of either sign) at the same has normal marginals, but this pattern of dependence is
tim e. In this case, there may be tail dependence but not not possible in a normal that always appear elliptical.
linear dependence (correlation). The image below plots
realizations from a dependent bivariate random variable
with 0 correlation. The dependence in this distribution
arises because if X-| is close to 0 then X 2 is also close to 0.
- 2 - 1 0 1 2
Xi
4.10 Yes because iid normal random variables are jointly nor
- 2 - 1 0 1 2
Solved Problems
4.11 a. The marginal distributions are com puted by summing b. They are independent if the joint is the product of the
across rows for Big Firm and down columns for Small marginals. Looking at the upper left cell, 6.77% X 6.70%
Firm. They are is not equal to 1.97%, and so they are not.
c. The conditional distribution is just the row correspond
Big Firm Small Firm ing to USD 100M normalized to sum to 1. The condi
tional is different from the marginal, which is another
- U S D 50M 6.77 - U S D 1M 6.70
dem onstration that the profits are not independent.
USD 0 43.02 USD 0 43.19
- U S D 1M USD 0 USD 2M USD 4M
USD 10M 34.38 USD 2M 34.18
4.12 First, the two means and variances can be com puted from Finally, the conditional expectation is E[X-||X 2 < 0] =
the marginal distributions in the earlier problem . Then 2x-|Pr(Xi = x-||X2 < 0) = USD 3.01 M. The conditional
the covariance can be com puted using the alternative expectation squared is E [X f|X 2 < 0] = 940.31, and so
form , which is E[X-[X 2\ — E[X-|]E[X2]. The means can be the conditional variance is \/[X 1 ] = E[Xf] — E [X - |]2 =
com puted using E[Xj] = I x j Pr(Xy = xj) for j = 1,2. The 940.31 - 3.01 2 = 931.25 and the conditional standard
variance can be com puted using E [X f] - (E[Xy])2, which deviation is USD 30.52M .
requires computing E [X 2] using X x 2 Pr(X/ = xj). 4.15 a. If the two variables are unrelated, then the joint distri
For Big Firm , these values are E[X-|] = USD 15.88M , bution is just given by the products
E[X?j = 1786.63 and V[X-,] = 1534.36.
fx1(x2(*-w *2) =
For Small Firm , these values are E [X 2] = USD 1.25M , For exam ple, the probability of seeing —10% on the
E [X 2] = 3.98 and V[X2] = 2.41.
S&P and —5% on the Nikkei is 25% *20% = 5%:
The expected value of the cross product is
4.14 We need to com pute the conditional distribution given Nikkei 0% 5% 40%
X 2 < 0. The relevant rows of the probability m atrix are 8% 6% 9%
- U S D 1M USD 0 The sum of the rows needs to match the S&P 500 mar
X^Xz
ginal distribution:
- U S D 50M 1.97% 3.90%
Continuing yields: Dividing each entry by the total mass (25% in this case)
The same process works going across the columns, 25% 100%
S&P 500
+ 10%
-5 % 1%
Nikkei 0% 15%
8% 9%
25%
Estim ate the mean, variance, and standard deviation using Explain how the Law of Large Num bers (LLN) and Central
sam ple data. Limit Theorem (CLT) apply to the sam ple mean.
Explain the difference between a population moment and Estim ate and interpret the skewness and kurtosis of a
a sam ple moment. random variable.
Distinguish between an estim ator and an estim ate. Use sam ple data to estim ate quantiles, including the
median.
D escribe the bias of an estim ator and explain what the
bias m easures. Estim ate the mean of two variables and apply the CLT.
Explain what is meant by the statem ent that the mean Estim ate the covariance and correlation between two
estim ator is BLU E. random variables.
D escribe the consistency of an estim ator and explain the Explain how coskewness and cokurtosis are related to
usefulness of this concept. skewness and kurtosis.
63
This chapter describes how sam ple moments are used to esti an observed data set. In contrast, an estimate is the value pro-
A
mate unknown population moments. In particular, this chapter duced by an application of the estim ator to data.
pays special attention to the estimation of the mean. This is
The mean estim ator is a function of random variables, and so it
because when data are generated from independent identically
is also a random variable. Its properties can be exam ined using
distributed (iid) random variables, the mean estim ator has sev
the tools developed in the previous chapters.
eral desirable properties.
The expectation of the mean is
• It is (on average) equal to the population mean.
• As the number of observations grows, the sam ple mean (5.2)
becom es arbitrarily close to the population mean.
The expected value of the mean estim ator is the same as the
• The distribution of the sam ple mean can be approxim ated
population mean. W hile the primary focus is on the case where
using a standard normal distribution.
Xj are iid, this result shows that the expectation of the mean
This final property is w idely used to test hypotheses about pop estim ator is /x w henever the mean is constant (i.e., E[X(] = /x for
ulation param eters using observed data. all /). This property is useful in applications to tim e-series that
Data can also be used to estim ate higher-order moments such are not always iid but have a constant mean.
as variance, skewness, and kurtosis. The first four (standard The bias of an estim ator is defined as:
ized) moments (i.e., mean, variance, skewness, and kurtosis) are
Bias(0) = E[0] - 0 (5.3)
widely used in finance and risk m anagem ent to describe the key
features of data sets. where 0 is the true (or population) value of the param eter that
we are estim ating (e.g ., /jl or cr2). It measures the difference
Q uantiles provide an alternative method to describe the dis
between the expected value of the estim ator and the popula
tribution of a data set. Q uantile measures are particularly use
tion value estim ated. Applying this definition to the mean:
ful in applications to financial data because they are robust to
extrem e outliers. Finally, this chapter shows how univariate Bias(/x) = E[/x] — fx = jx — /x = 0
sam ple moments can extend to more than one data series.
Because the mean estim ator's bias is zero, it is unbiased.
The variance of the mean estim ator can also be com puted usinq
5.1 THE FIRST TWO MOMENTS the standard properties of random variab les . 3 Recall that the
general formula for the variance of a sum is the sum of the vari
Estimating the Mean ances plus any covariances between the random variables:
and Standard Deviation Sample Variance --------- True Variance ---------- Unbiased Estimator
Note that the sample variance estim ator and the definition of the Because the bias is known, an unbiased estim ator for the vari
population variance have a strong resem blance. Replacing the ance can be constructed as:
expectation operator E[.] with the averaging operator n- 1 ^ ? =1[.]
transforms the expression for the population moment into the (X; - £ ): (5.8)
used throughout statistics, econom etrics, and this chapter to To see this principle in action, suppose that multiple sam ples
transform population moments into estim ators . 4 are drawn from an N(0,1) distribution, each sam ple has a size of
Unlike the mean, the sam ple variance is a biased estimator. 20, and a sam ple variance a 2 is calculated for each sam ple. The
Note that it can be shown that: results are as shown in Figure 5.1.
B ias((j2) = E[<t 2] — a 2 Also shown is the unbiased variance estim ator s , which is simply
the biased estim ator multiplied b y ----- —.
n - 1
n The difference in the bias of these two estim ators might sug
= <x2/n gest that s 2 is a better estim ator of the population variance than
a 2. However, this is not necessarily the case because a 2 has a
4 Recall that the population variance can be equivalently defined as
E[X2] - E[X]2. The sample analog can be applied here, and an alternative
estimator of the sample variance is a2 — n-1Ef= iX 2 - (n-1 SjL-iXj)2. This 5 This is an example of a finite sample bias, and a 2 is asymptotically
form is numerically identical to Equation 5.6. unbiased because n— lim
>oc
Bias(a-2) = 0.
standard deviation does not change with the sample size. R-, + P 2 + • • • + Pn = 2 R i
/=1
Standard errors are im portant in hypothesis testing and
If returns are iid with mean E[P (] = /z and variance
when performing Monte Carlo sim ulations. In both of
V [P(] = o’2, then the mean and variance of the n-period
these applications, the standard error provides an indica
return are n/z and n o 2, respectively. Note that the mean
tion of the accuracy of the estim ate.
follows directly from the expectation of sums of random
variables, because the expectation of the sum is the sum
of the expectation.
sm aller variance than s 2 (i.e., V[<x2] < V[s2]). Financial statistics The variance relies crucially on the iid property, so that
the covariance between the returns is 0. In practice,
typically involve large datasets, and so the choice between o 2 or
financial returns are not iid but can be uncorrelated, which
s 2 makes little difference in practice. The convention is to prefer
is sufficient for this relationship to hold. Finally, standard
a - 2 when the sam ple size is m oderately large (i.e., n > 30). deviation is the square root of the variance^and so the
n-period standard deviation scales with V n .
The sam ple standard deviation is estim ated using the square
root of the sam ple variance (i.e., <x = V m 2 or s = V s 2). The
square root is a nonlinear function and so both estim ators of the
standard deviation are biased. However, this bias diminishes as n
O ne challenge when using asset price data is the choice of
becom es large and is typically small in large financial data sets.
sampling frequency. Most assets are priced at least once per
day, and many assets have prices that are continuously avail
Presenting the Mean and Standard able throughout the trading day (e.g ., equities, some sovereign
Deviation bonds, and futures). O ther return series, such as hedge fund
returns, are only available at lower frequencies (e.g ., once per
Means and standard deviations are the most widely reported
month). This can create challenges when describing the mean or
statistics. Their popularity is due to several factors.
standard deviation of financial returns. For exam ple, sampling
• The mean and standard deviation are often sufficient to over one day could give an asset's average return (i.e., /zDa//y) as
describe the data (e.g ., if the data are normally distributed 0 . 1 %, whereas sampling over one w eek (i.e., jlw e e k iy ) could indi
or are generated by some other one- or two-param eter cate an average return of 0.485% .
distributions).
In practice, it is preferred to report the annualized mean and
• These two statistics provide guidance about the likely range standard deviation, regardless of the sampling frequency. Annu
of values that can be observed. alized sam ple means are scaled by a constant that measures the
• The mean and standard deviation are in the same units as the number of sam ple periods in a year. For exam ple, a monthly
data, and so can be easily com pared. For exam ple, if the data mean is multiplied by 12 to produce an annualized mean. Simi
are percentage returns of a financial asset, the mean and larly, a w eekly mean is multiplied by 52 to produce an annual
standard deviation are also measured in percentage returns. ized version, and a daily mean is multiplied by the number of
This is not true for other statistics, such as the variance. trading days per year (e.g ., 252 or 260). Note that the scaling
The third moment raises deviations from the mean to the third
power. This has the effect of amplifying larger shocks relative
Estimating the Mean and Variance
to sm aller ones (compared to the variance, which only squares
Using Data deviations). Because the third power also preserves the sign of
As an exam ple, consider a set of four data series extracted from the deviations, an asym m etric random variable will not have a
the Federal Reserve Econom ic Data (FRED ) database of the third moment equal to 0. W hen large deviations tend to come
Federal Reserve Bank of St. Louis : 6 from the left tail, the distribution is negatively skew ed. If they
are more likely to come from the right tail, then the distribution
1. The IC E BoAM L US Corp M aster Total Return Index Value,
is positively skew ed.
which measures the return on a diversified portfolio of cor
porate bonds; Kurtosis is similarly defined using the fourth moment standard
ized by the squared variance:
2. The Russell 2000 Index, a small-cap equity index that m ea
sures the average perform ance of 2 ,0 0 0 firms; E[(X - E[X])4] = M4
Kurtosis(X) = k (5.10)
3. The Gold Fixing Price in the London bullion m arket, the E[(X - E[X])2]2 <r4
benchm ark price for gold; and The kurtosis uses a higher power than the variance, and so it is
4. The spot price of W est Texas Interm ediate Crude O il, the more sensitive to large observations (i.e., outliers). It discards
leading price series for US crude oil. sign information (because its power is even), and so it measures
the relative likelihood of seeing an observation from the tail of
All data in this set are from the period between January 1, 1987
the distribution.
and D ecem ber 31, 2018. These prices are sam pled daily, weekly
(on Thursdays), and at the end of the m onth . 7 Returns are com The kurtosis is usually com pared to the kurtosis of a normal dis
puted using log returns, defined as the difference between con tribution, which is 3. Random variables with a kurtosis greater
secutive log prices (i.e., lnPt — InPt_-j). than 3 are often called heavy-tailed/fat-tailed, because the fre
quency of large deviations is higher than that of a random vari
Table 5.1 reports the means and standard deviations for these
able with a normal distribution . 8
four series. Note how the values change across the three sam
pling intervals. The bottom rows of each panel contain the annu The estim ators of the skewness and the kurtosis both use the
alized versions of each statistic, which largely remove the effect principle of the sam ple analog to the expectation operator. The
of the sampling frequency. The scale factors applied are 252 estim ators for the skewness and kurtosis are
(for daily returns), 52 (for w eekly returns), and 12 (for monthly
A3
returns). The annual versions of the means are quite close to one and (5.11)
another, as are the annualized standard deviations.
Skewness -0 .3 2 1 - 0 .2 0 4 - 0 .0 9 5 0.090
Skewness - 0 .3 9 7 - 0 .6 7 8 0 .2 1 1 - 0 .0 7 6
The third and fourth central moments are estim ated by: Table 5.1 also reports the estim ated skewness and kurtosis for
the four assets and three sampling horizons. These moments are
A3 (5.12) scale-free, and so there is no need to annualize them .
(i.e., the volatility of return changes over time) and the differ that of a normal than returns sam pled at higher frequencies.
ences in the volatility dynamics across assets produce a different This is because large short-term changes are diluted over longer
scaling of skew ness . 9 Gold has a positive skewness, especially at horizons by periods of relative calm ness.
longer horizons, indicating that positive surprises are more likely
Figure 5.3 contains density plots of the four financial assets in
than negative surprises. The skewness of crude oil returns is neg
the previous exam ple. Each plot com pares the density of the
ative when returns are sampled at the daily horizon but becomes
assets' returns to that of a normal distribution with the same
positive when sampling over longer horizons. Bond and stock
mean and variance. These four plots highlight some common
market returns are usually negatively skewed.
features that apply across asset classes. The most striking fe a
All four asset return series have excess kurtosis, although ture of all four plots is the heavy tails. Note that the extrem e
bonds have few er extrem e observations than the other three tails (i.e., the values outside ± 2 cr) of the em pirical densities are
asset classes. Kurtosis declines at longer horizons, and it is above those of the normal. The more obvious hint that these
a stylized fact that returns sam pled at low frequencies (i.e., distributions are heavy-tailed is the sharp peak in the center of
monthly, or even quarterly) have a distribution that is closer to each density. Because both the em pirical density and the nor
mal density match the sam ple variance, the contribution of the
9 Time-varying volatility is a pervasive property of financial asset returns. heavy tails to the variance must be offset by an increase in mass
This topic is covered in Chapter 13. near the center. A t the same tim e, regions near ± 1 cr lose mass
F ia u re 5 .3 Density plots of daily returns data from small-cap stocks, corporate bonds, gold, and crude oil. The
solid line in each plot is a data-based density estimate of each asset's returns. The dashed line in each panel is the
density of a normal random variable with the same mean and variance as the corresponding asset returns. The
vertical lines indicate ± 2<x.
to the center and the tail to preserve the variance. The negative Linear estim ators of the mean can be expressed as:
skewness is also evident in stock, bond, and crude oil returns.
All three densities appear to be leaning to the right, which is A = 2 wiX "
;= 1
evidence of a long left-tail.
where w,- are weights that do not depend on X,. In the sample
mean estimator, for exam ple, w,- = /n.
5.3 THE BLUE MEAN ESTIMATOR
1
The mean estim ator is the Best Linear Unbiased Estim ator this chapter. Showing that the mean is BLU E is also straightfor
1n ward, but it is quite tedious and so is left as an exercise . 101 *
(BLU E) of the population mean when the data are iid. In this
context, best indicates that the mean estim ator has the lowest
variance of any linear unbiased estim ator (LUE). 11 To show that this claim is true, consider another linear estimator that
uses a set of weights of the form w,- = w,- + d,. Unbiasedness requires
that Evv; = 1, and so E d , = 0. The remaining steps only require com
10 BLUE only requires that the mean and variance are the same for all puting the variance, which is equal to the variance of the sample mean
observations. These conditions are satisfied when random variables are iid. estimator plus a term that is always positive.
5.4 LARGE SAM PLE BEHAVIOR 4. This ratio between the sam ple sizes ensures that the stan
dard deviation halves each tim e the sam ple size increases. The
O F THE MEAN dashed line in the center of the plots shows the population
mean. These finite-sam ple distributions of the mean estim ators
The mean estim ator is always unbiased, and the variance of the
are calculated using data simulated from the assumed distribu
mean estim ator takes a simple form when the data are iid. Two
tion. The sam ple mean is com puted from each simulated sam
m oments, however, are usually not enough to com pletely
ple, and 10,000 independent sam ples are constructed. The plots
describe the distribution of the mean estimator. If data are iid
in the center row show the em pirical density of the estim ated
normally distributed, then the mean estim ator is also normally
sam ple means for each sam ple size.
distributed . 1 2 However, it is generally not possible to establish
the exact distribution of the mean based on a finite sam ple of n Two features are evident from the LLN plots. First, the density
observations. Instead, modern econom etrics exploits the behav of the sam ple mean is not a normal. In sam ple means com puted
ior of the mean in large sam ples (formally, when n —» °c) to from the sim ulated log-normal data, the distribution of the
approxim ate the distribution of the mean in finite sam ples. sam ple mean is right-skewed, especially when n is small. Sec
ond, the distribution becom es narrower and more concentrated
The Law of Large Numbers (LLN) establishes the large sample
around the population mean as the sam ple size increases. The
behavior of the mean estim ator and provides conditions where
collapse of the distribution around the population mean is evi
an average converges to its expectation. The sim plest LLN for
dence that the LLN applies to these estim ators.
iid random variables is the Kolm ogorov Strong Law of Large
Num bers. This LLN states that if {X,} is a sequence of iid random The LLN also applies to other sam ple m oments. For exam ple,
variables with /jl = E[X,], then: when the data are iid and E[X2] is finite, then the LLN ensures
that a 2 —>a 2, and so the sam ple variance estim ator is also
consistent . 1 4
12 This result only depends on the property that the sum of normal ran Vn(M - f l ) A N(0, a 2 ) ,
n 640
n 160
n 40
n 10
TV
^ 5 ;..
0 1.65 2 4.0 4.5
0. 5 -
0. 4 -
Figure 5.4 These figures demonstrate the consistency of the sample mean estimator using the Law of Large Numbers
(middle panels) and with respect to the normal distribution when scaled (bottom panels). The figures are generated using
simulated data. The left panels use simulated data from a log-normal distribution with parameters M = 0, <j 2 = 1. The
right panels use simulated data from a discrete distribution, the Poisson with shape parameter equal to 3. The top panels
show the PDF of the log-normal and the PMF of the Poisson.
gence in distribution. mean is a finite-sam ple property, and when n = 160, the distri
bution is extrem ely close to the standard normal.
This CLT requires one additional assumption from the LLN: that
In the application of the CLT to the mean of data generated
the variance a 2 is finite (as the LLN only requires that the mean
is finite). This CLT can be alternatively expressed as: from Poisson random variables (i.e., the right panel), the CLT
appears to be accurate for all sam ple sizes.
This expression for the mean shows that (in large samples) the W hen the sam ple size is even, the median is estim ated using the
distribution of the sam ple mean estim ator (A) is centered on the average of the two central points of the sorted list:
population mean (/z), and the variance of the sam ple average
median(x) = (1/2)(x n / 2 + x n / 2 +1). (5.15)
declines as n grows.
Two other commonly reported quantiles are the 25% and 75%
The bottom panels of Figure 5.4 show the empirical distribution
quantiles. These are estim ated using the same method as the
for 10,000 simulated values of the sample mean. The CLT says
m edian. More generally, the a-quantile is estim ated from the
that the sample mean is normally distributed in large sam ples, so
2 sorted data using the value in location an. W hen an is not an
that A ~ M m , <r /n)• This value is standardized by subtracting the integer value, then the usual practice is to take the average of
mean and dividing by the standard errors of the mean so that:
A
the points im m ediately above and below a n . 1 5
These two quantile estim ates (q 2 5 and q75) are frequently used
together to estim ate the inter-quartile range:
is a standard normal random variable. The sim ulated values
IQ R = q 7 5 - q 2 5
show w hether the CLT provides an accurate approxim ation to
the (unknown) true distribution of the sam ple mean for different The IQ R is a measure of dispersion and so is an alternative to
sam ple sizes n. standard deviation. O ther quantile measures can be constructed
to measure the extent to which the series is asym m etric or to
Determining when n is large enough is the fundam ental ques
estim ate the heaviness of the tails.
tion when applying a CLT. Returning to Figure 5.4, note that the
shaded areas in the bottom panels are the PD F of a standard Two features make quantiles attractive. First, quantiles have
normal. For data generated from the log-normal (i.e., the left the sam e units as the underlying data, and so they are easy to
panel), the CLT is n o tan accurate approxim ation when n = 10 interp ret. For exam p le, the 25% quantile for a sam ple of asset
because of the evident right-skewness (due to the skewness in
the random data sam pled from the log-normal). W hen n = 40,
15 There is a wide range of methods to interpolate quantiles when an is
however, the skew has substantially disappeared and the not an integer, and different methods are used across the range of com
approxim ation by a standard normal distribution appears to be mon statistical software.
Table 5.2 contains the m edian, quantile, and IQR estim ates for The extension of the mean is straightforward. The sam ple mean
the four-asset exam ple discussed in this chapter. The median of two series is just the collection of the two univariate sample
means. However, extending the sam ple variance estim ator to
return is above the mean return for the negatively skewed assets
m ultivariate datasets requires estim ating the variance of each
series and the covariance between each pair. The higher-order
m oments, such as skewness and kurtosis, can be similarly
Table 5.2 Quantile Estimates for the Four Asset
defined in a m ultivariate setting.
Classes Using Data Sampled Daily, Weekly, and
Monthly. The Median Is the 50% Quantile. The
Inter-Quartile Range (IQR) Is the Difference between Covariance and Correlation
the 75% Quantile and the 25% Quantile
Recall that covariance measures the linear dependence between
Daily Data two random variables and is defined as:
Bonds Stocks Gold Crude o-XY - C o v[X ,Y ] = E[(X - E[X ])(Y - E[Y])]
The sam ple covariance estim ator uses the sam ple analog to the
Median 0.034% 0.103% 0.014% 0.058%
expectation operator
q(.25) -0 .1 4 0 % -0 .5 3 4 % -0 .4 4 5 % - 1 .2 1 %
Ax = n it 1
and Ayy
such as liquidity, whereas longer-term returns are more sensitive
to m acroeconom ic changes. W hen data are iid, the CLT applies to each estimator. However,
These are all relatively small correlations, and cross-asset class it is more useful to consider the joint behavior of the two mean
correlations are generally sm aller than correlations within an estim ators by treating them as a bivariate statistic. The CLT can
asset class. For exam ple, the estim ate of the correlation between be applied by stacking the two mean estim ators into a vector:
(5.18)
Table 5.3 Sample Correlations between Bond, This 2-by-1 vector is asym ptotically normally distributed if the
Stock, Gold, and Crude Oil Returns. The Correlations m ultivariate random variable Z = [X, Y] is iid . 1 6
Are Measured Using Data Sampled Daily, Weekly, and
The CLT for the vector depends on the 2-by-2 covariance matrix for
Monthly
the data. A covariance matrix collects the two variances (one for X
Daily Data and one for Y) and the covariance between X and Yinto a matrix:
Gold 3.0% - 2 .1 % —
13.0% covariance between the pair.
The CLT for a pair of mean estim ators is virtually identical to that
Weekly Data of a single mean estimator, where the scalar variance is replaced
by the covariance matrix. The CLT for bivariate iid data series
Bonds Stocks Gold Crude states that:
Bonds —
- 2 .4 % 2.3% - 7 .7 %
(5.20)
Stocks - 2 .4 % —
4.0% 13.2%
Bonds —
15.1% 19.6% - 0 .2 %
Stocks 15.1% —
- 0 .2 % 1 1 .6 %
Gold 19.6% - 0 .2 % —
18.9%
Crude - 0 .2 % 1 1 .6 % 18.9% —
10.4 335.4 6.71 25.6 14.0 0.151 The coskewness standardizes the cross-third moments by the
variance of one of the variables and the standard deviation of
the other, and so is scale- and unit-free.
Table 5.4 presents the annualized estim ates of the means, vari
These measures both capture the likelihood of the data taking
ances, covariance, and correlation for the monthly returns on
a large directional value w henever the other variable is large in
the Russell 2000 and the BoA M L Total Return Index (respectively
m agnitude. When there is no sensitivity to the direction of one
sym bolized by S and B). Note that the returns on the Russell
variable to the magnitude of the other, the two coskewnesses are
2 0 0 0 are much more volatile than the returns on the corporate
0. For exam ple, the coskewness in a bivariate normal is always 0,
bond index. The correlation between the returns on these two
even when the correlation is different from 0. Note that the uni
indices (i.e., 0.151) is positive but relatively small.
variate skewness estimators are s(X,X,X) and s{Y,Y,Y).
The CLT depends on the population values for the two variances
Coskewnesses are estimated by applying the sample analog to the
(<xf and o b) and the covariance between the two (c t sb = pascrB).
expectation operator to the definition of coskewness. For example:
To operationalize the CLT, the population values are replaced
with estim ates com puted using the sam ple variance of each V n S ^ iK - ~ A x )2 (y ; ~ A y )
s(X,X,Y) = 2
series and the sam ple covariance between the two series. A
(t x (Ty
A
where the covariance m atrix is equal to the previous covariance strongly linked to the magnitude of stock returns.
The most im portant insight from the bivariate CLT is that cor
Table 5.5 Skewness and Coskewness in the Group
relation in the data produces a correlation between the sample
of Four Asset Classes Estimated Using Monthly
means. Moreover, the correlation between the means is identical
Data. The Far Left and Far Right Columns Contain
to the correlation between the data series.
Skewness Measures of the Variables Labeled X and
Y, Respectively. The Middle Two Columns Report the
Coskew ness and Cokurtosis Estimated Coskewness
Like variance, skewness and kurtosis can be extended to pairs X Y s(X,X,X) s(X,X,Y) s(X,Y,Y) s(Y,Y,Y)
of random variables. W hen computing cross pth m oments, there Bonds Crude - 0 .1 7 9 - 0 .0 0 8 0.040 0.104
are p — 1 different m easures. Applying the principle to the first
Bonds Gold - 0 .1 7 9 - 0 .0 1 8 - 0 .0 3 2 - 0 .1 0 1
four moments there are
Bonds Stocks - 0 .1 7 9 - 0 .0 8 2 0 .0 1 2 - 0 .3 6 5
• No cross means,
Crude Gold 0.104 - 0 .0 1 0 - 0 .0 9 8 - 0 .1 0 1
• One cross variance (covariance),
Crude Stocks 0.104 - 0 .0 6 4 - 0 .1 2 7 - 0 .3 6 5
• Two measures of cross-skewness (coskewness), and
Gold Stocks - - 0 .0 0 5 0.014 - 0 .3 6 5
• Three cross-kurtoses (cokurtosis).
0 .1 0 1
Fiqure 5.5 Plots of the cokurtoses for the symmetric case {k [X,X,Y,Y), left) and the asymmetric case
(k (X,Y,Y,Y), right). Note that k (X,Y,Y,Y) has the same shape as k [X,X,X,Y) and so is omitted.
The cokurtosis uses com binations of powers that add to 4, and W hen exam ining kurtosis, the value is usually com pared to the
so there are three configurations: (1,3), (2,2), and (3,1). The (2,2) kurtosis of a normal distribution (which is 3). Com paring a cokur
measure is the easiest to understand, and captures the sensitiv tosis to that of a normal is more difficult, because the cokurtosis
ity of the m agnitude of one series to the m agnitude of the other of a bivariate normal depends on the correlation.
series. If both series tend to be large in m agnitude at the same
Figure 5.5 contains plots of the cokurtoses for normal data as a
tim e, then the (2,2) cokurtosis is large. The other two kurtosis
function of the correlation between the variables (i.e., p),
m easures, (3,1) and (1,3), capture the agreem ent of the return
which is always between —1 and 1. The sym m etric cokurtosis
signs when the power 3 return is large.
k {X ,X ,Y ,Y ) ranges between 1 and 3: it is 1 when returns are
The three measures of cokurtosis are defined as: uncorrelated (and therefore independent, because the two
random variables are normal) and rises sym m etrically as the cor
E[(X — E[X])2(Y — E[Y])2]
k ( X X Y . Y ) (5.22) relation moves away from 0. The asym m etric cokurtosis ranges
aX(Ty
2 2
from —3 to 3 and is linear in p.
E[(X — E[X])3(Y - E[Y])]
k {X ,X ,X ,Y ) (5.23) Table 5.6 contains estim ates of the kurtosis and cokurtosis for
a }a Y
the four assets previously described. The sam ple kurtosis is
E[(X - E[X ])(Y - E[V])3] com puted using the sam ple analog to the expectation operator.
x i X . Y X Y ) (5.24)
axo-y The first and last columns contain the kurtosis of the two assets
Table 5.6 Kurtosis and Cokurtosis in the Group of Four Asset Classes Estimated Using Monthly Data. The Far
Left and Far Right Columns Contain Kurtosis Measures of the Variables Labeled X and Y, Respectively. The Middle
Three Columns Report the Three Cokurtosis Estimates
QUESTIONS
5.2 W hat is the difference between the standard deviation of 5.6 W hat do skewness and kurtosis measure?
the sample data and the standard error of the sample mean?
5.7 W hat is the skewness and kurtosis of a normal random
5.3 Is an unbiased estim ator always better than a biased variable?
estim ator?
5.8 How are quantiles estim ated?
5.4 W hat is the sam ple analog to the expectation operator, 5.9 W hat advantages do quantiles have compared to moments?
and how is it used?
5.10 W hen dealing with two random variables, how is coskew
ness interpreted?
Practice Questions
5.11 Suppose four independent random variables X h X 2, X 3, and It is hypothesized that the data comes from a uniform dis
X 4 all have mean /z = 1 and variances o f 1 2, 1 2 <2, and 2, tribution, U(0 , b).
1
respectively. What is the expectation and variance of — a. Calculate the sam ple mean and variance.
5.12 Using the same information from question 1, com pute the b. W hat are the unbiased estim ators of the mean and
1 1 variance?
expectation and variance of 2X-| + 2 X 2 + —X 3 + —X 4.
c. Calculate the b in 1/(0, b) using the formula for the
5.13 Using the same information from question 1, com pute the mean of a uniform distribution and the value of the
2 2 1 1
expectation and variance of —X i + —X? + — Xa + — X 4? unbiased sam ple mean found in part b.
^ 5 1 5 ^ 10 3 10 4
d. Calculate the b in 1/(0, b) using the formula for the
5.14 W hat is the effect on the sam ple covariance between X
variance of a uniform distribution and the value of the
and Y if X is scaled by a constant a? W hat is the effect on
unbiased sam ple variance found in part b.
the sam ple correlation?
5.15 An experim ent yields the following data: 5.16 For the following data:
Observation Value
Trial Number Value
1 0.38
1 0 .0 0
2 0.28
2 0.07
3 0.27
3 0.13
4 0.99
4 0.13
5 0.26
5 0 .2 0
0.43
0.23
6
6
13 0.76
14 0.77
15 0.96
ANSW ERS
Solved Problems
5.13 The expectation is com puted using the same steps:
5.11 The expectation is E —(Xi + X 2 + X 3 + X 4 ) = 2 2 1 1
-X-, + - x 2 + — x 3 + — x 4
1 5 1 5 1 10 3 10 4
+ E [X 2] + E [X 3] + E [X 4]) — — (/x + /x + /x + /x) — /x — 1
= | E [X , ] + | e [X2] + ^ £ [ X 3] + ^ E [ X 4]
The variance is Var — (X-i + X 2 + X 3 + X 4) = T ( V a r [ X , ]
2 2 1 1
5/x + 5 /x + 1 0 jJL + — /JL = /X = 1.
1 0 ‘
+ Var[X2] + Var[X3] + Var[X4]) — — ( — + 2 + ^ + / ~ ^\6
This estim ator is unbiased.
5.12 The expectation is E 2 Xy + 2 X 2 + —X 3 + —X 4 = 2\± The variance of the sum is the sum of the variances
because the random variables are independent.
1 1
+ 2/x + —/jl + = 5/x = 5. The variance is 2Xi + 2X 2 2 2 1 1
Var 77X3 + —-x4
— f x V a r
5 X l + 5 Xa + *10 6 1 0 4
1 1 1 1
+ X 3 + ~X 4 = 4Var[X-,] + 4V ar[X 2] + ~Var[X3]
— Trz’VarfX'i ] + 7rir\/ar[X2] + Var[X3] + - 7 —Var[X4]
77
2 6 2 4
25 1J 25 2 100 1 0 0
1 4 4 2 2 4 1 4 1 1 1
+ - V a r[X 4] = - + - + - + - = 5 .
4 4J 2 2 4 4 — X - + — x - + x 2+ x 2=
25 2 25 2 1 0 0 1 0 0 25
3 0.28 40%
produce p = „ , and the sam ple correlation is
dC Ty(Ty CTy(Ty 4 0.38 60%
invariant to rescaling the data.
5 0.43 80%
5.15 a. Use the standard formula to get the sam ple variance
6 0.99 1 0 0 %
(here, n = 15):
n The median (50%) lies exactly half way between the
jl = n“ 1 2 */ = 0-39, third and fourth ranked observations. Therefore:
n
0.28 + 0.38
a 2 = n“ 1 2 ( X / - A ) 2 = 0.08. Median = ---------------= 0.33
/=1
b. The sam ple mean is already unbiased. b. This requires calculating the 25% and 75% levels
For the variance: The 25% level is 5/20 = 25% of the way between
ranked observations 2 and 3. Therefore:
n
s2 = ct 2 = l 5 0.080 = 0.086.
n - 1 14 q25% = 0-75 * 0.27 + 0.25 * 0.28 = 0.2725
c. The mean for a U(a,b) distribution is given as: Sim ilarly, the 75% level is 75% of the way between
a + b observations 4 and 5:
2 _ (b - a ) 2
fJ “ 12
b2
0.086 = — ^ b = 1.016
12
5.16 a. The first step is to rank order the observations:
Ranked
Position Value
1 0.26
2 0.27
3 0.28
4 0.38
5 0.43
6 0.99
Construct an appropriate null hypothesis and alternative • Understand how a hypothesis test and a confidence
hypothesis and distinguish between the two. interval are related.
Construct and apply confidence intervals for one Explain what the p-value of a hypothesis test measures.
sided and two-sided hypothesis tests, and interpret
the results of hypothesis tests with a specific level of Interpret the results of hypothesis tests with a specific
confidence. level of confidence.
Differentiate between a one-sided and a two-sided test Identify the steps to test a hypothesis about the difference
and identify when to use each test. between two population means.
Explain the difference between Type I and Type II errors Explain the problem of multiple testing and how it can
and how these relate to the size and power of a test. bias results.
83
As explained in C hapter 5, the true values of the population 5. The critical value, which is a value that is com pared to the
param eters are unknown. However, observed data contain infor test statistic to determ ine whether to reject the null hypoth
mation about those param eters. esis; and
A hypothesis is a precise statem ent about population param 6 . The decision rule, which com bines the test statistic and criti
eters, and the process of exam ining w hether a hypothesis is cal value to determ ine w hether to reject the null hypothesis.
consistent with the sam ple data is known as hypothesis testing.
In frequentist inference, hypothesis testing can be reduced to
one universal question: How likely is the observed data if the
Null and Alternative Hypotheses
hypothesis is true? The first step in performing a test is to identify the null and
alternative hypotheses. The null hypothesis (H0) is a statem ent
Testing a hypothesis about a population param eter starts by
specifying null hypothesis and an alternative hypothesis. The null about the population values of one or more param eters. It is
hypothesis is an assumption about the population parameter. som etim es called the maintained hypothesis, because it is main
The alternative hypothesis specifies the population param eter tained (or assumed) that the null hypothesis is true throughout
the testing procedure. Hypothesis testing relies crucially on this
values (i.e., the critical values) where the null hypothesis should
be rejected. These critical values are determ ined by: assum ption, and many elem ents of a hypothesis test are based
on the distribution of the sam ple statistic (e.g ., the sample
• The distribution of the test statistic when the null hypothesis mean) when the null is true.
is true, and
In most settings, the null hypothesis is often consistent with
• The size of the test, which reflects our aversion to rejecting a
there being nothing unusual about the data. For exam ple, when
null hypothesis that is in fact true.
considering w hether to invest in a mutual fund, the natural null
O bserved data are used to construct a test statistic, and the hypothesis is that the fund does not generate an abnormally
value of the test statistic is com pared to the critical values to high return. This hypothesis can be form alized as the statem ent
determ ine whether the null hypothesis should be rejected. that the expected return on the fund is equal to the expected
This chapter outlines the steps used to test hypotheses. It return on its style-m atched benchm ark. This null hypothesis is
begins by exam ining w hether the mean excess return on the expressed (in term s of population values) as:
stock m arket (i.e., the m arket premium) is 0. The steps used Ho: A f = ABM,
to test a hypothesis about the mean are nearly identical to
where /xF is the expected return of the mutual fund (i.e., E[rFj)
those used in many econom etric problem s and so are widely
and /x Bm is the expected return on the style benchmark (i.e., E[r6M]).
applicable.
This null can be equivalently expressed using the difference
This chapter also exam ines testing the equality of the means for between the means:
two sequences of random variables and explains how tests are
used to validate the specification of Value-at-Risk (VaR) m odels. ffo: A f “ A b m = 0
return on the benchm ark (i.e., H0: /xF = /xBM). If the alternative where a 2 is the variance of the sequence of iid random variables
was set to H-p /zF /j l B m , then a rejection of the null would only used to estim ate the mean. W hen the true value of the mean (/z)
indicate that the return is differen t than that of the benchm ark is equal to the value assumed by the null (/x0), then the asym p
(i.e., it could be either higher or lower than the benchm ark). totic distribution leads to the test statistic:
To focus on the case where the fund m anager outperform s her
benchm ark, then a one-sided alternative is needed: N(0 , 1 ) ( 6 . 1)
/xF > p BM
The test statistic T (also known as the t-statistic) is asymptotically
or equivalently: standard normally distributed and centered at the null hypoth
Hp. /xF - p BM > 0 esis. The population variance is replaced by the sample variance,
because a 2 is a consistent estimator of the population variance a 2
Note that a one-sided alternative increases the chance of reject
when the sample size is sufficiently large. While this chapter focuses
ing a false null hypothesis when the true param eter is in the
on testing hypotheses about the mean in most examples, T is appli
range covered by the (one-sided) alternative.
cable to any estimator that is asymptotically normally distributed.
Note that if ixF < /jl BMi then neither the null nor the alternative
is true. However, because the values where the null is rejected Type I Error and Test Size
are determ ined exclusively by the alternative hypothesis, the
null should not be rejected. This highlights the fact that failing In an ideal world, a false (true) null would always (never) be
to reject the null hypothesis does not mean the null is true, but rejected. However, in practice there is a tradeoff between avoiding
only that there is insufficient evidence against the null. a rejection of a true null and avoiding a failure to reject a false null.
- 1 .9 6 1.96
-1 .6 4 5 1.645
F ia u re 6.1 The top-left panel shows the distribution of the test statistic, a standard normal when the null is true.
The top-right panel shows the rejection region (shaded) when using a two-sided alternative and a test size of 5%.
The bottom two panels show the rejection regions when using one-sided lower (left) and one-sided upper (right)
alternatives also using a test size of 5%.
(C on tin u ed)
W hen n is small (i.e., less than 30), the Student's t has been Figure 6.2 com pares a normal, ts, and t -15 densities. Note
docum ented to provide a better approxim ation to the distri that tails of the t are heavier than the normal, although they
bution of T than the normal. This result holds even when the becom e closer as the degrees of freedom increase. Using
random variables in the average (X,-) are not normally distrib critical values from the Student's t, along with the use of s 2 to
uted. Note that the critical values from a Student's tn_-| are estim ate the variance, is the recom m ended method to test
larger than those from a normal, and so using a tn_-| reduces a hypothesis about a mean when the sam ple size contains
the probability of rejecting the null (all things equal). In prac 30 or few er observations. In larger sam ples, the difference
tice, the larger critical values do a better job of producing between the two is negligible, and common practice is to
tests with the correct size a. estim ate the variance with a 2 and to use critical values from
the normal distribution.
Figure 6.2 Comparison of the densities of the normal, t5, and t15. The right panel focuses on the right tail
for values above 1.645, the 5% cutoff from the normal distribution.
hypothesis (H0: /x = 0 ) can be rejected in favor of the alternative Table 6.1 Decision Matrix for a Hypothesis Test
hypothesis (H-p /j l 0 ). Relating the Unknown Truth to the Decision Taken
about the Truth of a Hypothesis. Size a and Power
Type II Error and Test Power 1 — (3 Are Probabilities and so Are between 0 and 1
A Type II error occurs when the alternative is true, but the null is Null Hypothesis
not rejected. The probability of a Type II error is denoted by the True False
G reek letter (3. In practice, (3 should be small so that the power
of the test, defined as 1 — (3, is high. Fail to Reject Correct Type II Error
Decision
d - a) 03)
The power of a test m easures the probability that a false null is
rejected. Table 6.1 shows the decision m atrix that relates the Reject Type I Error Correct
decision taken— rejecting or failing to reject the null— to the (a) d - jS)
truth of the null hypothesis. Note that the values in parentheses
in Table 6.1 are the probabilities of arriving at that decision.
Example: The Effect of Sample Size
W hereas the size of the test can be set by the tester, the
power of a test cannot be controlled in a similar fashion. This is
on Power
because the power depends on: Figure 6.3 illustrates the core concepts of hypothesis testing in a
• The sam ple size, small sample (where n = 1 0 ) and in a larger sample (where n = 1 0 0 ).
• The distance between the value assumed under the null and In both panels, the data are assumed to be generated according
the true value of the parameter, and to:
Fiq u re 6 .3 Both panels show two distributions: the distribution of the mean when the null is true (solid), and the
distribution of the sample mean that is centered at the population mean of 0.3 (dashed). The vertical lines show
the critical values, which are determined by the distribution centered on assumed mean in the null hypothesis (0.0).
The shaded regions show the power of the test, which is the probability that the null hypothesis is rejected in favor
of the alternative. The hatched region shows the probability of making a Type II error (i.e., /3).
The sam ple average /x is the average of n iid normal random When the sample mean falls outside of the set of critical values,
variables and so is also normally distributed with E [jl] = 0.3 and the null is rejected in favor of the alternative. The probability
V[/x] = 1 / V n . The null hypothesis is H q : ju = 0 and the alterna that the sample mean falls into the rejection region is illustrated
tive is H1: /x 0 . by the grey region outside of the critical values. This area is the
power of the test (i.e., the chance that the null is rejected when
Both panels in Figure 6.3 shows two distributions. The first is
the data is generated under the alternative). M eanwhile, the
the distribution of the sam ple mean if the null is true (shown
probability of failing to reject the null (i.e., (3) is indicated by the
with the solid line). It is centered at 0 and has a standard d evia
hatched area.
tion of 1 / v 10. The second is the distribution of the sam ple
mean (shown with the dashed line). It is centered at the true Increasing the sam ple size has an obvious effect— both distribu
value of j l = 0.3 and has the sam e standard deviation. tions are narrower. These narrower distributions increase the
probability that the null is correctly rejected and so increase the
The dashed lines show the critical values for a test with a size of
power of the test. In even larger sam ples (e.g ., n = 1000), the
5%, which are
two distributions would have virtually no overlap so that the null
• ± 0 .6 1 , when n = 1 0 , and is rejected with an exceedingly high probability (> 99.999% ).
This pattern shows that the power (i.e., the probability that the
• ± 0 .1 9 6 , when n = 100
false null is rejected) increases with n.
for the top and bottom panels, respectively.
Note that changing the sam ple test size has an im portant side [jx — Ca X f f / V n , jx + Ca X (x / V r i], (6.4)
effect: Large test sizes increase the probability of committing a where Ca is the critical value for a two-sided test (e.g ., if a = 5%
Type I error and rejecting a true null hypothesis. The choice of then Ca = 1.96).
the test size therefore affects both the probability of making a
Returning to the exam ple of estim ating the mean excess return
Type I error and that of making a Type II error. In other words,
on the S&P 500, the 95% confidence interval is
small sizes reduce the frequency of Type I errors, while large
sizes reduce the frequency of Type II errors. The test size should 16.5 16.5
thus be chosen to reflect the costs and benefits of committing 7.62 - 1.96 X —= 7.62 + 1.96 X [3.69, 11.54]
V 68 V 68-
either a Type I or Type II error.
This confidence interval indicates that any value of the null
Note that when the power is equal to a (i.e., the test size) when
between 3.69 and 11.54 cannot be rejected against a two-sided
the true value of /x is close to 0. This makes sense, because the
alternative. O ther values (e.g ., 0) are outside of this confidence
null should be rejected with probability a even when true.
interval, and so the null H0: /x = 0 is rejected.
As the true jx moves away from 0, however, the power increases
Confidence intervals can also be constructed for one-sided
monotonically. This is because it is easier to tell that the null is
alternatives. The one-sided 1 — a confidence interval is asym
false when the data are generated by a population /x far from
m etric and is either:
the value assumed by the null. In other words, large effects are
easier to detect than small effects. ( - o o , jx + Ca X <x/\/n]
Fiqure 6.4
tively, making the p-value equal to 3.6% . In this case, the null
financial m arkets.
hypothesis would be rejected using a test size of 5% (but not
1%). The p-value of 3.6% indicates that there is a 3.6% probabil
The p-value of a Test
ity of observing a | T| > 2.1 the null hypothesis is true.
A hypothesis test can also be summarized by its p-value, which
In the right panel, the test statistic is —0.5 and the total prob
measures the probability of observing a test statistic that is more
ability in both tails (i.e., the p-value) is 61.8% . This test would
extrem e than the one com puted from the observed data when
not reject the null using any standard test size. The large p-value
the null hypothesis is true.
here indicates that when the null is true, three out of five sam
A p-value com bines the test statistic, distribution of the test sta ples would produce an absolute test statistic as large as 0.5.
tistic, and the critical values into a single number that is always
The formula for a one-sided p-value depends on the direction
between 0 and 1. This value can always be used to determ ine
of the alternative hypothesis. W hen using a one-sided lower
whether a null hypothesis should be rejected: If the p-value is
tailed test (e.g ., H-p /x < /x0), the p-value is <J>(T) (i.e., the area
less than the size of the test, then the null is rejected.
in the left tail). W hen using a one-sided upper-tailed test (e.g .,
The p-value of a test statistic is equivalently defined as the Hy. /x Ao)/ the p-value is 1 *t*(T) (i.e., the area in the right
sm allest test size (a ) where the null is rejected. Any test size tail). The p-values for one-sided alternatives do not use the
larger than the p-value leads to rejection, whereas using a test absolute value because the sign of the test statistic matters for
size sm aller than the p-value fails to reject the null. determ ining statistical significance.
Figure 6.5 Graphical representation of p-values. The two panels show the area measured by a p-value when
using a two-sided alternative. The p-value measures the area above | T| and below — | T\ when using the test
statistic T.
Now consider a test of the null hypothesis H0: /jl x = /j l y (i.e., that and its value can be com pared to a standard normal. This
the com ponent random variables X; and Y; have equal means). method autom atically accounts for any correlation between X,
To im plem ent a test, construct a new random variable: and Yj because:
Note that because jx z = jxx — (iy and <xf = T x + <xy ~ 2oxy, this A s an exam ple, consider two distributions that measure the
test statistic can be equivalently expressed as: average rainfall in City X and City Y, respectively. Their means,
variances, sam ple size and correlation are
This expression for T shows that the correlation between For the hypothesis to not be rejected at the 95% level, the m axi
the two series affects the test statistic. If the two series are mum difference between /xx and jxY would be
positively correlated, then the covariance betw een X, and Y)
D
reduces the variance of the difference. This occurs because 1.96 = > 2 .6 cm
4 + 6 - 2 * 0.3 * V 4 V 6
both random variables tend to have errors in the same direc
12
tion, and so the difference between X, — Yt elim inates som e of
the common error. However, because:
Finally, a special version of this test statistic is available when X , Ax “ Ay -4 > 2.6
and Yj are both iid and mutually independent. W hen X, and Y,
The null is therefore rejected when a = 0.05.
are independent, they are not paired as a bivariate random vari
able, and so the number of sam ple points for X ) (nx) and Y, (nY) Furtherm ore, note that the value of the test statistic is
may differ. The test statistic for testing that the means are equal
(i.e., H0: /xx = /xY) is 4
10 - 14___________
4 + 6 - 2 * 0.3 * V 4 V 6
12
= -5 .2 1
4 This test statistic also has a standard normal distribution when the null Therefore, the null would be rejected even at a 99.99% confi
is true. dence level!
The size of the test is chosen to control the probability of a Most of the focus of this chapter is on testing a null hypothesis
Type I error, which is the rejection of a true null hypothesis. The about the mean (H0: /z = /x0) using the CLT. More generally, this
test size should reflect the cost of being wrong when the null testing fram ework is applicable to any param eter that has an
is true. The distribution of the test statistic and the test size are asym ptotic distribution that can be approxim ated using a CLT.
QUESTIONS
6.4 W hat are the trade-offs when choosing the size of a test? 6.9 W hat does the VaR of a portfolio measure?
6.5 W hat is the power of a hypothesis test? 6.10 W hat are three m ethods to evaluate a VaR model of a
portfolio?
Practice Questions
6.11 Suppose you wish to test w hether the default rate of 6.15 You collect 50 years of annual data on equity and bond
bonds that are rated as investm ent grade by S&P is returns. The estim ated mean equity return is 7.3% per
the same as the default rate on bonds rated as invest year, and the sam ple mean bond return is 2.7% per year.
ment grade by Fitch. W hat are the null and alternative The sam ple standard deviations are 18.4% and 5.3% ,
hypotheses? W hat data would you need to test the null respectively. The correlation between the two-return
hypothesis? series is - 6 0 % . Are the expected returns on these two
assets the sam e? Does your answer change if the correla
6.12 Using the Excel function N O R M .S.D IST or a normal prob
ability table, what are the critical values when testing the tion is 0 ?
null hypothesis, H0: /z = /z0, against 6.16 If a p-VaR model is well specified, HITs should be iid
a. a one-sided lower alternative using a size of 1 0 %? Bernoulli^ — p). W hat is the probability of observing two
HITs in a row? Can you think of how this could be used to
b. a one-sided upper alternative using a size of 2 0 %?
c. a two-sided alternative using a size of 2 %? perform a test that the model is correct?
d. a two-sided alternative using a size of . 1 %? 6.17 A data m anagem ent group wants to test the null hypoth
esis that observed data is N(0,1) distributed by evaluating
6.13 Find the p-value for the following t-test statistics of the
the mean of a set of random draws. However, the actual
null hypothesis, H0: /z = /z0.
underlying data is distributed as N(1, 2.25).
a. Statistic: 1.45, Alternative: Two-sided
b. Statistic: 1.45, Alternative: One-sided upper a. If the sam ple size is 10, what is the probability of a
Type II error and the power of the test? Assum e a 90%
c. Statistic: 1.45, Alternative: One-sided lower
d. Statistic: —2.3, Alternative: One-sided upper confidence level on a two-sided test.
b. How many sam ples would need to be taken to reduce
e. Statistic: 2.7, Alternative: Two-sided
the probability of a Type II error to less than 1%?
6.14 If you are given a 99% confidence interval for the mean
return on the Nasdaq 100 of [2.32% , 12.78% ], what is the
6.18 Suppose that for a linear regression, an estimation of
sam ple mean and standard error? If this confidence inter slope is stated as having a one-sided p-value of 0.04.
W hat does this mean?
val is based on 37 years of data, assumed to be iid, what is
the sam ple standard deviation?
ANSW ERS
Solved Problems
6.11 The null is that the IG (investm ent grade) default rate d. This size corresponds to the same critical value as a
is the same for S&P rated firms as it is for Fitch-rated .05% upper-tailed test, which is 3.29.
firms. If Sj are default indicators for S&P rated firms, and
6.13 a. In a two-sided test, the p-value is the area under
Fj are default indicators for Fitch-rated firm s, then the the normal curve for values greater than 1.45 or less
null is H q : As = A f , which states that the mean values than - 1 .4 5 . This is tw ice the area less than - 1 .4 5 , or
are the sam e. The null can be equivalently expressed 2 X 0.074. = .148.
as H0: a s — A f = 0. The alternative is Hy. /j l $ ^ a f or b. In a one-sided upper test, we need the area above
equivalently H^: /xs — a f ^ 0 - The data required to test 1.45. This is half the previous answer, or 0.074.
this hypothesis would be binary random variables where
1 indicates that an IG bond defaulted and 0 if it did not
c. In a one-sided lower test, we need the area under
the curve for values less than 1.45. This is just
within a fixed tim e fram e (e.g ., a quarter).
1 - 0.074 = .926.
6.12 a. The Iower-tailed critical value is —1.28.
d. Here we need to probability under the normal curve
b. The one-sided upper-critical value is 0.84.
for values above —2.3, which is 1 minus the area less
c. A two-sided alternative with a size of 2% uses the than - 2 .3 , or 1 — 0.01 = 0.99.
same critical value as a one-sided upper test, which is
e. Because it is a two-sided alternative, we need that
2.32. Each tail has probability 1% when using the criti
area for values larger than 2.7 or less than —2.7,
cal value.
which is tw ice the area less than —2.7. This value is ± 1.96 using a size of 5%. The null is not rejected. If the
2 X .0034 = .0068. correlation was 0 , then the variance estim ate would be
6.14 The mean is the midpoint of a sym m etric confidence 0.0366, and the test statistic is 1.69. The null would still
not be rejected if the size was 5%, although if the test
interval (the usual type), and so is 7.55% . The 99%
Cl is constructed as [/x — c X & ,jl + c X a] and so size was 10%, then the critical value would be ± 1.645
and the correlation would matter.
c X a = 12.78% — 7.55% = 5.23% . The critical value for
a 99% Cl corresponds to the point where there is 0.5% in 6.16 The probability of a H IT should be 1 — p if the model
5.23% is correct. They should also be independent, and so
each tail, or 2.57, and so a 2.03% needs to
2.57 the probability of observing two HITs in a row should
be surveyed.
be (1 — p)2. The can be form ulated as the null that
6.15 The null hypothesis is H0: /j l e = /j l b . The alternative is Hb: E[H IT i X H ITi+i] = (1 — p ) 2 and tested against the
H-|: fx E ¥= /j l b . The test statistic is based on the difference alternative H-p E[H/T(- X H/T/+1] ^ (1 — p)2. This can be
of the average returns, 8 = 7.3% — 2.7% = 4.6% . im plem ented as a sim ple test of a mean by defining the
The estim ator of the variance of the difference is random variable X/ = HIT} X H/T( + 1 and then testing the
a E + a B - 2(t be, which is 0 .1 842 + 0.0532 - 2 X - 0 .6 X null H0: fJLx = (1 — p ) 2 using a standard test of a mean.
0.184 X 0.053 = 0.0483. The test statistic is
g
6.17 a. W hen the null hypothesis is false, the probability of
—, = 1.47. The critical value for a two-sides test is a Type II error is equal to the probability that the
/O043
hypothesis fails to be rejected, as per the diagram in
V 50
the chapter:
Further recall that for a 90% confidence level on a N(0,1) distribution, the cut-off points are + /—1.65.
-1.65 1.65
Chapter 6 Hypothesis Testing 97
The following questions are intended to help candidates understand the material. They are n o t actual FRM exam questions.
Now, if there are 10 sam ples taken from an N(0,1) In actuality, the true distribution is N (1,2.25), so
then the standard deviation is reduced the a = V 2 .2 5 = 1.5. For a sam ple size of 10, the
expected sam ple standard deviation is
°H> — = = 0.316
V 10
/V
A 5
0.474
O s a m p le
VTo
Therefore, the cut-off points are
Calculating the equivalent distance of + /— 1.65 in this
± 1 .6 5 * 0 .3 1 6 = ± 0 .5 2 2 distribution com pared to a standard N(0,1) yields
6 .1 8 The p-value tells us the "probability of observing a test will correspond to the value of x that solves the following
statistic that is more extrem e than the one com puted equation:
from the observed data when the null hypothesis is tru e." 1 - O(x) = 0.04
In other words, if the estim ate of the slope is true, then in
Using the excel function N O RM SIN V gives x = 1.75.
a randomized trial the probability of getting a larger new
slope estim ate is 4% .Assum ing a normal distribution, this
D escribe the models that can be estim ated using linear • Construct, apply, and interpret hypothesis tests and
regression and differentiate them from those which confidence intervals for a single regression coefficient in a
cannot. regression.
Interpret the results of an O LS regression with a single Explain the steps needed to perform a hypothesis test in a
explanatory variable. linear regression.
Describe the key assumptions of OLS parameter estimation. Describe the relationship between a t-statistic, it's p-value,
and a confidence interval.
Characterize the properties of O LS estim ators and their
sampling distributions.
101
Linear regression is a w idely applied statistical tool for modeling
the relationship between random variables. It has many appeal VARIABLE NAMING CO N VEN TIO N S
ing features (e.g ., closed-form estim ators, interpretable param
Linear regression is used widely across finance, economics,
eters, and a flexible specification) and can be adapted to a wide other social sciences, engineering, and the natural sciences.
variety of problem s. This chapter develops the bivariate linear W hile users within a discipline often use the same nomen
regression m odel, which relates a dependent variable to a single clature, there are differences across disciplines. The three
explanatory variable. M odels with multiple explanatory variables variables in a regression are frequently referred to using a
variety of names. These are listed in the table below.
are exam ined later in this book.
The chapter begins by exam ining the specifications that can Explained Variable Explanatory Variable Shock
(and cannot) be m odeled in the linear regression fram ework.
Left-Hand-Side Right-Hand-Side Innovation
Regression is surprisingly flexible and can (using carefully con Variable Variable
structed explanatory variables) describe a wide variety of rela
D ependent Variable Independent Variable Noise
tionships. Dummy variables and interactions are two key tools
used when modeling data. These are w idely used to build fle x Regressand Regressor Error
ible m odels where param eter values change depending on the Disturbance
value of another variable.
generated the data, they are needed to interpret the model. measures the sensitivity of Y to changes in X;
Five assumptions are used to establish key statistical properties • a, commonly called the intercept, is a constant; and
of linear regression estim ators. W hen satisfied, the param eter • e, commonly called the shock/innovation/error, represents a
estim ators are asym ptotically normal, and standard inference com ponent in Y that cannot be explained by X . This shock is
can be used to test hypotheses. assumed to have mean 0 so that
Finally, this chapter presents three key applications of linear E[Y] = E[a + (3X + e] = a + (3E[X\
regression in finance: measuring asset exposure to risk factors,
The presence of the shock is also why statistical analysis is
hedging, and evaluating fund m anager perform ance.
required. If the shock were not present, then Y and X would
have a perfect linear relationship and the value of the two
param eters could be exactly determ ined using only basic
7.1 LINEAR REGRESSION algebra.
Regression analysis is the most w idely used method to measure, An obvious interpretation of a is the value of Y when X is zero.
m odel, and test relationships between random variables. It is However, this interpretation is not meangingful if X cannot plau
A
Building models with multiple explanatory variables allows the This transform ed model satisfies the three requirem ents of linear
effect of an explanatory variable to be measured while control regression.
ling for other variables known to be related to Y. For exam ple, W hen interpreting the slope of a transform ed relationship, note
when assessing the perform ance of a hedge fund manager, it that the coefficient [3 m easures the effect of a change in the
is common to use between seven and 1 1 explanatory variables transform ation o f X on the transform ation o f Y .
that capture various sources of risk.
Note that the k explanatory variables are not assumed to be 3 While the k explanatory variables are not assumed to be independent,
independent, and so one variable can be defined as a known they are assumed to not be perfectly correlated. This structure allows
for nonlinear transformation of a variable to be included in a regression
model, but not for linear transformation because Corr[X„ a + bX,] = 1
if b > 0 and —1 if b < 0. Further details in the assumptions made with
2 There are many approaches to imputing missing values in linear mod
using multiple explanatory variables are presented in Chapter 10.
els that then allow linear regression to be applied to the combined data
set including the imputed values. 4 This is the result of taking the partial derivative of Y with respect to X.
dY/Y, Even though each segm ent is linear in X, the slopes of the
two lines differ when (32 ^ 0 and thus the overall relationship
which can be directly interpreted as a percentage change.
between Y and X is nonlinear.
W hen both X and Y have been replaced by their logarithm s
(called a log-log m odel), then (3 m easures the percentage
change in Y for a 1% change in X. In other w ords, the co effi 7.2 ORDINARY LEAST SQUARES
cient in a log-log model is directly interpretable as the elastic
ity of Y with respect to X. Consider a basic regression model that includes a single explan
atory variable:
However, if only the dependent variable is replaced by its loga
rithm (e.g ., In Y, = a + (3X, + e), then (3 measures the percent Y = ex + (3X + e
change in Y for a one-unit change in X. This model has three param eters: a (i.e., the intercept), [3
(i.e., the slope), and a 2 (i.e., the variance of e).
A dummy takes the value 1 when the observation has the qual In this case, the squared deviations being minimized are
ity and 0 if it does not. For exam ple, when encoding sectors, between the realizations of the dependent variable Y and their
the transportation sector dummy is 1 for a firm whose primary expected values given the respective realizations of X:
business is transportation (e.g ., a commercial airline or a bus n
operator) and 0 for firms outside the industry. Dummies are also a, (3 = arg m in ^ Y ; ~ a - /3x,)2, (7.3)
a’P ;=1
commonly constructed as binary transform ations of other ran
where arg min denotes argum ent of the minimum . 5
dom variables (e.g ., a m arket direction dummy that encodes the
return on the m arket as 1 if negative and 0 if positive). In turn, these estim ators minimize the residual sum of squares,
which is defined as:
Dummies can be added to models to change the intercept (a)
or the slope (/3) when the dummy variable takes the value 1. For
S ( y < - « - P * i) 2 = 2 ^
exam ple, an em pirically relevant extension of CAPM allows for /=1 /=1
the sensitivity of a firm's return to depend on the direction of In other words, the estim ators (i.e., a and f3) are the intercept
the m arket return. If we define X t o be the excess m arket return and slope of a line that best fits the data because it minimizes
A
and D to be a dummy variable that is 1 if the excess market the squared deviations between the line a + fix, and the realiza
return is negative (i.e., a loss), then this asym m etric CAPM speci tions of the dependent variable y,-.
fies the relationship as The solution to the minimization problem is
Y = a + /3-|X + /32(X X D) + e a = y - (3X
= a + /3 -|X-| + /32 X 2 + e - X K y, - Y)
(7.4)
This model is a linear regression because the two variables S H A - x )2 '
(X-, = X and X 2 = X X D) are both observable and each is
related to the dependent variable through a single coefficient. 5 For example, arg min f[x) is the value of X the produces that lowest
The second explanatory variable (X X D) is called an interaction value of f(x). *
The estimators a and /3 are then used to construct the fitted values:
&XX
yi = a + px, - x >2
and the model residuals This ratio can also be rewritten in term s of the correlation and
V /V /V n
the standard deviations so that:
€ i = y , ~ y \ = y- , - a - px-,
<J y
Finally, the variance of the shocks is estim ated by: P = PxY^r (7.7)
Cx
The sample correlation pxy uniquely determines the direction of the
line relating Y and X. It also plays a role in determining the magni
Note that the variance estim ator divides by n — 2 to account tude of the slope. In addition, the sample standard deviations ay
for the two estim ated param eters in the m odel, so that s 2 is an and ax scale the correlation in the final determination of the regres-
unbiased estim ator of a 2 (i.e., E[s2] = a 2). sion coefficient p . This relationship highlights a useful property of
O LS: p = 0 if and only if p = 0. Thus, linear regression is a conve
The residuals are always mean 0 (i.e., = 0) and uncorrelated
nient method to formally test whether data are correlated.
with X, (i.e., ^ ^ ( X ; — X)e(- = 0). These two properties are conse
quences of minimizing the sum of squared errors. These residuals Figure 7.1 illustrates the com ponents of a fitted model that
are different from the shocks in the model, which are never observed. regresses the mining sector portfolio returns on the market
F ia u re 7.1 Plot of the annual returns on a sector portfolio that tracks the performance of mining companies and
the returns on the US equity market between 1999 and 2018. The diagonal line is the fitted regression line where
the portfolio return is regressed on the market return. The square point is the fitted value when using the market
return in 2003 of 31.8%. The fund's return was 100.5%. The brace shows the residual, which is the error between
the observed value and the regression line.
X, = X* + Vii
where g(X) is any well-defined function of X, including both
the identity function (g(X) = X) and the constant function where 17 , is independent of X; and the shock in the m odel. These
(g(X) = 1). This assumption implies that Corr[e, X] = 0, so that m easurem ent errors randomly shift the explanatory variables
the innovations are uncorrelated with the explanatory variables. and produce a line with a slope that is always flatter than the
which suggests that estim ated coefficients system atically under assumption is required when estim ating the param eters and
state the true strength of relationships due to the pervasiveness ensures that (3 is well defined.
W hen using financial and econom ic data, explanatory variables V[e,|X] = o- 2 < oc (7.8)
are not fixed values that can be set or altered. Rather, an explan
This assumption is known as hom oskedasticity and requires that
atory variable is typically produced from the same process that
the variance of all shocks is the sam e.
produces the dependent variable.
The top panel of Figure 7.3 illustrates this assumption with sim u
For exam ple, the explanatory variable in CAPM is the market
lated data. The density of the shocks is shown for five values of
return and the dependent variable is the return on a portfolio.
Xj. Note that this distribution does not vary and so the variance
Both the return on the portfolio and the m arket return are the
of the shocks is constant.
result of m arket participants incorporating new information into
their expectations of future prices. Thus, CAPM models the The bottom panel of Figure 7.3 illustrates a violation of this
mean of the excess return on the portfolio co n d itio n ed on the assum ption. The density of the errors is changing with x„ and
excess return on the market. the variance of the shocks (which is reflected by the width of the
interval) is increasing in the explanatory variable.
Form ally, it is assumed that the pairs [x„ y () are iid draws from
their joint distribution. This assumption allows x, and y,- to
be sim ultaneously generated. Im portantly, the iid assum p No Outliers
tion affects the uncertainty of the O LS param eters estim ators
The probability of large outliers in X should be sm all . 6 O LS
because it rules out correlation across observations.
m inim izes the sum of squared errors and therefore is sensitive
In some applications of O LS (e.g ., sequentially produced time to large deviations in either the shocks or the explanatory
series data), the iid assumption is violated. Note that O LS can
still be used in situations where (X, V) are not iid, although the
6 Formally, we assume that the random variables X, have finite fourth
method used to com pute standard errors of the estim ators must moments. This technical condition cannot be directly verified using a
be m odified. sample of data, and so this technical detail is of limited value.
Heteroskedastic Shocks
Fig u re 7 .3 The densities along the regression line illustrate the distribution of the shocks for five values of X,.
In the top panel, these distributions are identical and the shocks in the regression are homoskedastic. The bottom
panel shows heteroskedastic data where the variance of the shock is an increasing function of the value of X,.
variables. O utliers lead to large increases in the sum of Implications of OLS Assumptions
squared errors, and param eters estim ated using O LS in the
presence of outliers may differ substantially from the true These assumptions come with two meaningful im plica
param eters. tions. First, they imply that the estim ators are unbiased so
that:
Figure 7.4 shows exam ples of outliers in e (top panel) and X
(bottom panel). The single outlier in the error has the potential E[a] = a and E[/3] = (3
to substantially alter the fitted line. Here, the extrem e outlier
leads to a doubling of the slope despite the 49 other points all Unbiased param eter estim ates are (on average) equal to the
occurring close to the true line. The bottom panel shows that true param eters. This is a finite sam ple property and so holds for
outlier in X can also produce a large change in the estim ated any number of observations n. In large sam ples, the estim ators
slope. In this case, the outlier reduces the estim ate of the slope are consistent so that a-*-* a and [3 [3 as n —» °c. This property
by more than 50%. ensures that the estim ated param eters are very close to the true
values when n is large.
The sim plest method to detect and address outliers is to visu
ally exam ine data for extrem e observations by plotting both the The second meaningful implication is that the two estim ators are
explanatory variables and the fitted residuals. jointly normally distributed.
X,
Fiqure 7.4 The top panel shows the effect of a single outlier in e(- on the estimated slope coefficient of a
moderate outlier and an extreme outlier. The bottom panel shows the effect on the slope of an outlier and an
extreme outlier in the explanatory variable.
The asym ptotic distribution of the estim ated slope is The asym ptotic distribution of the intercept is
- / . ,\ . d (r2{fi% + o%)\
V n 0 - p ) ^ N (o , (7.9) V n (a - ^------ J (7.10)
where V[X] = 0 $ 's the variance of X. The estimation error in a also depends on the variance of the
This CLT strengthens the consistency result and allows for residuals and the variance of X. In addition, it depends on fix
hypothesis testing. Note that the variance of the slope estim ator (i.e., the squared mean of X). If X has mean 0 (i.e., fix = 0), then
depends on two m oments: the variance of the shocks (i.e., a 2) the asym ptotic variance in Equation 7.10 sim plifies to a 2 and a
and the variance of the explanatory variables (i.e., a%. Further- simply estim ates the mean of Y.
A
more, the variance of (3 increases with cr . This is not surprising, In practice, the CLT is used as an approximation so that [3 is treated
because accurately estimating the slope is more difficult when as a normal random variable that is centered at the true slope [3:
the data are noisy.
P ~ { , ^ (7.11)
A q
testing a hypothesis about a regression param eter is identical to /3 — 1.645 X s.e.(yS), yS + 1.645 X s.e.(yS) / (7.18)
testing a hypothesis about the mean of a random variable (or any
other estimator that follows a CLT). Tests are implemented using where Pr( — 1.645 < Z < 1.645) = 90% when Z is a standard
a t-test, which measures the normalized difference between the normal random variable.
The absolute value of the test statistic is larger than the criti
cal value for a test with a size of 5% (i.e., 1.96 s), and so the
and so the variance of the innovations can be estim ated null is rejected. The estim ated coefficient and its standard
using error can also be com bined to construct a two-sided 95%
confidence interval:
s2 = X 1770 (1 - .6632) = 991 The confidence interval does not include 0, which reconfirms
the previous finding that the null is rejected when using a
A
different from 0. Many software packages also report the the contribution of the m arket premium E[Rm*] to the overall
p-value of the t-statistic and a 95% confidence interval for the return on the portfolio because:
estim ated parameter. These can equivalently be used to deter
E[Rp*] = a + [3E[Rm*] (7.20)
mine whether a param eter is statistically different from 0 (i.e.,
the null is rejected if 0 is not in the 95% confidence interval or if The intercept a is a measure of the abnormal return. This coef
the p-value is less than 5%). ficient measures the return that the portfolio generates in excess
of what would be expected given its exposure to the m arket risk
premium (3E[Rm*].
7.5 APPLICATION: CAPM A s an illustration, CAPM is estim ated using 30 years of monthly
data between 1989 and 2018. The portfolios measure the value-
CAPM relates the excess return on a portfolio to the return on weighted return to all firms in a sector. The sectors include
the m arket, so that: banks, technology com panies, and industrials. The market port
Rp - Rf = a + /3(Rmi — Rfi) + e ■
„ (7.19) folio measures the return on the com plete US equity market.
The risk-free rate is the interest rate of a one-month US T-bill.
where Rp is the return on a portfolio, Rm is the m arket return,
and Rfi is the risk-free rate. This can also be written as: Table 7.1 reports the param eter estim ates, standard errors,
t-statistics, p-values and 95% confidence intervals for the esti
Rp* = a + (3Rm* + €j
mated param eters. The CAPM [3s range from 0.569 for the beer
so that Rp* is the excess return on the portfolio and Rm* is the and liquor sector to 1.414 in the com puters sector. All estim ates
excess return on the market. of f3 are statistically different from 0.
Note that both coefficients are im portant to practitioners. The The hypothesis H0:a = 0 can be equivalently tested using any of
slope [3 m easures the sensitivity to changes in the market and the three reported m easures: the t-statistic, the p-value, or the
confidence interval. The t-statistic is the ratio of the estim ated the total variation in Rp* that can be explained by Rm*. In linear
param eter to its standard error. The standard error of a is esti regressions with a single explanatory variable, the R2 is equal to
mated using the squared sam ple correlation between the dependent and the
explanatory variable:
S2 ( A x + O x )
s.e.(a) = (7.21)
n<% '
R2 = cS T r2 [R£, I C l (7.22)
2
,(*,• - *)
where jxx is the mean of X and ox = n 1 ^ f= 1 (x, — X) It also can be used to understand how the variance of the
model error is related to the variance of the dependent variable,
W hen using the t-statistic, values larger than 1.96 indicate that
because:
the null should be rejected. If using the p-value, values sm aller
than 0.05 indicate rejection of the null. Finally, when using the cr2 = (1 — R2)oy
confidence interval, the null is rejected when is not in the con
This measure is more useful when exam ining models with more
0
The final column of Table 7.1 reports the R2 of the m odel. This is closely coupled to the m arket. For exam ple, electrical equip
a measure of the fit of the model and reports the percentage of ment and wholesale both have large R2, which indicates that
F ia u re 7 .5 Plot of the monthly portfolio returns on the beer and liquor and electrical equipment sectors against
the market (x-axis) between 1989 and 2018. The line is the fitted regression.
the returns of these sectors are tightly coupled with the market where Rpi is the return on the portfolio to be hedged and R^j
return. is the return on the hedging instrument. The /3 is the optimal
hedge ratio and a measures the expected return on the hedged
Figure 7.5 plots the excess market returns and the excess
portfolio (i.e., the return on the original portfolio minus the
returns for electrical equipm ent and beer and liquor. O verall, the
scaled return of the hedging instrument).
patterns among the returns do not look very different. However,
the beer and liquor portfolio had a large negative return when Table 7.2 contains estimated hedge ratios for a set for widely
the m arket had its largest return in this sam ple, which lowers the traded exchange-traded funds (ETFs). These funds are all part
slope and reduces the R2. of the Select Sector SPDR ETF family and target the return of a
diversified industry portfolio. M eanwhile, the hedging instrument
is the SPDR S&P 500 ETF, which tracks the return on the S&P 500.
7.6 APPLICATION: HEDGING The models are estim ated using monthly data from 1999
until 2018. The first column reports the average return on
Estim ating the optimal hedge ratio is a natural application of
the unhedged fund. The next two report the estim ates of the
linear regression. Hedges are commonly used to eliminate
regression coefficients. The t-statistics are reported below each
undesirable risks. For exam ple, a long-short equity hedge fund
coefficient. The estim ate of a has been multiplied by 12 so that
attem pts to profit from differences in returns between firms,
it can be interpreted as the annual hedged portfolio return.
but not on the return to the overall m arket. However, the com
Scaling a has no effect on the t-statistic, and so hypothesis tests
position of the fund's long and short positions may produce an
are similarly unaffected by this type of rescaling.
unwanted correlation between the fund's return and the market.
A hedge can be used to remove the m arket risk. Note that all unhedged returns are positive, and five are statisti
cally different from zero. The hedged returns, however, have
In this case, the slope coefficient is directly interpretable as the
mixed signs and are not statistically different from zero when
hedge ratio that minimizes the variance of the unhedged risk . 9
using a 5% test (i.e., t-statistic of 2). The third column is the esti
The regression model estim ated is
mate of the hedge ratio. The next two columns report the stan
Rpi = « + fiRhi + €i> (7.23) dard deviations of the unhedged (sp) and hedged portfolios (s^).
The quantity sh is the same as the standard error of the regres
sion (s), which measures the standard deviation of the residuals
in the model. It can be seen that the standard deviation of a
9 The regression coefficient fully hedges the risk in the sense that the hedged portfolio is often much lower than that of its unhedged
resulting portfolio is uncorrelated with the hedged factor. When factors
have positive risk premia, this can be expensive, and it is also common counterpart. Furtherm ore, this reduction in standard deviation is
to partially hedge to eliminate some, but not all, factor risk. largest for the models with the highest R2.
A a P sp Sh R2
(2 . 1 1 ) ( 1 .0 1 ) (11.85)
(1.50) (- 0 .2 5 ) (19.15)
(1-46) (- 0 .5 8 ) (23.63)
(2 . 2 2 ) (1.45) (7.04)
7.7 APPLICATION: PERFORM ANCE type— to fully account for portfolio risk exposures. This is the
topic of the next chapter.
EVALUATION
Table 7.3 contains results from the regression:
Fund managers are frequently evaluated against a style bench
Rfi = a + [3 Rsi + e, (7.24)
mark to control for the typical perform ance of a fund with the
same investm ent style. High-performing fund managers should where Rf is the monthly return on the fund and Rs is the return on
outperform their benchm arks and thus generate positive a. the style benchmark. The performance of four funds, each using
Perform ance analysis can be im plem ented using a regression a different investment style, is exam ined. Two of the funds, the
where the return on the fund is regressed on the return to its Fidelity Contrafund and the JPM organ Small Cap Growth Fund,
benchm ark, a is the most im portant param eter in this applica invest in high-growth equities. The Fidelity fund invests in large-
tion because it can be used to detect superior or inferior perfor cap stocks, while the JPM organ fund focuses on smaller firms. The
m ance; (3 measures the sensitivity to the benchm ark and so can other two funds, the PIM CO Total Return Fund and the PIM CO
be used to exam ine whether the benchm ark is appropriate for High Yield Fund, invest in bonds. The total return fund targets the
the fund's return. A good choice of a benchm ark should have a total return to investing in US bonds, while the high-yield fund
(3 near one and a large R2 because (3 = 1 indicates that the risks focuses on generating returns from non-investment grade securi
captured by the benchm ark are similar to the fund's and large R2 ties. Each of the four funds has a different style, and so the bench
indicates that the benchm ark is highly correlated with the fund's mark is adjusted to match a given fund's investment objective.
perform ance. These regressions use monthly returns, and the reported excess
performance is annualized by multiplying by 12. All monthly
The choice of benchm ark plays an im portant role when assess
returns available for each fund, from the fund's launch until the
ing w hether a fund produces excess returns. If a fund invests
end of 2018, are used in the analysis. The number of returns used
in risks that are not captured by the benchm ark, then the fund
to estimate the parameters in each model is reported in the table.
might appear to produce a (because the benchm ark cannot
account for all risk exposures). In practice, it is common to use Estim ates of a are greater than zero for all four funds, and three
multiple factors— between 3 and 11— depending on the fund of these are statistically different from zero. Fidelity's Contrafund
Style Market
a P R2 a P R2 n
JPMorgan Small Cap 1.075 0.933 0.912 3.012 1.094 0.602 329
Growth A
PIMCO Total Return Fund 1.172 1.055 0.891 6.565 0.050 0.031 379
PIMCO High Yield Fund 1.713 0.919 0.902 4.864 0.326 0.409 312
has outperform ed its benchm ark by over 4% per year during the estim ators in linear regression models have sim ple closed-forms
50 years in the sam ple . 1 0 However, the Contrafund also has the that depend on only the first two moments of Y and X. The
sm allest /3 estim ate and R2. This suggests that the benchm ark slope param eter is closely related to the correlation between
portfolio used might not be appropriate for the strategy used the dependent and explanatory variables, and the slope is zero
by this fund. For the other three funds, the R2 values show that if and only if the two variables are uncorrelated.
the benchm arks are highly correlated with the fund returns. The
/V D eriving the estim ators for these only requires one trivial
/3 are all close to 1, which also indicates that the funds' returns
assum ption— that the exp lan ato ry variab le has som e v a ria
closely com ove with their benchm arks.
tion. Interpreting the results of an estim ated regression
The right column repeats the exercise using the return on the requires four additional assum ptions. Th e most im portant of
S&P 500 benchmark. The market appears to be nearly as good th ese, and the most difficult to verify, is the conditional mean
as the large growth style return in adjusting the return on Fidelity's assum ption:
Contrafund, and the two R2 values are similar. The estimated a
E [ £ ,|x j = 0
is somewhat larger and is still significantly larger than 0. It has
a lower correlation with the small-cap fund because the market This assumption rules out some em pirically relevant applications
portfolio is, by construction, a large-cap portfolio. The conclusion (e.g ., when the sam ple has selection bias or if X and Y are simul
is, however, the same, and the a is not statistically different from 0 . taneously determ ined). The other assum ptions (i.e., that the
data are generated by iid random variables, have no outliers,
The R2 on the bond funds are both low, and near 0 for the PIM CO
and are hom oscedastic) are sim pler to verify using either graphi
Total Return Fund. The poor fit of the market model indicates that
cal analysis or formal tests.
the return on the equity market is not a good benchmark to use
when assessing the performance of these bond funds. W hen these assum ptions are satisfied , then [3 can be inter
preted as the change in Y per unit change in X . The O LS
estim ators of a and f3 also satisfy a C LT and so have a nor
7.8 SUMMARY
mal distribution in large sam ples. The C LT allow s hypoth
esis testing using t-tests and for confidence intervals to be
This chapter introduces linear regression through a focus on
co n stru cted . The t-statistic, a t-test for the null hypothesis
models that contain a single explanatory variable. Param eter
that the param eter is zero, is freq u en tly reported to indicate
w hether a param eter is statistically significant. Finally, the
n
These funds have traded for more than 25 years. The estimates of
1
chap ter presents three standard applications of regression:
a are not representative because this sample suffers from survivorship
bias. The funds have all survived for an extended period which is only facto r sensitivity estim ation, hedging, and fund perform ance
likely if a fund meets or exceeds its performance target. evaluation.
QUESTIONS
Practice Questions
7.9 Find the O LS estim ators for the following data: 7.11 In a CAPM that regresses W ells Fargo on the m arket, the
coefficients on monthly data are a = 0.1 and [3 = 1. 2.
X y W hat is the expected excess return on W ells Fargo when
1 0.35 7.12 You fit a CAPM that regresses the excess return of
Coca-Cola on the excess m arket return using 20 years
2 6.46
of monthly data. You estim ate a = 0.71, (3 = 1.37,
3 4.09 s 2 = 20.38, <xj = 19.82 and jl x = 0.71.
4 7.34 a. W hat are the standard errors of a and /3?
A
ANSW ERS
7.2 The model that allowed differences in the slope would 7.6 The only tim e /3 = 0 is when Cov(X,V) = 0 and so the
A
be Rj = a + p R mii + ylFOMCRm,i + €i where If o mc is two variables are uncorrelated. The R2 is the squared cor
1 on FO M C days and 0 otherw ise. If y is not zero, relation and so must be 0 .
then the slope is different on FO M C days. This can be 7.7 False. The optim ally hedged portfolio is uncorrelated
extended to both param eters by estim ating the model
with the return on the hedge. It is not necessarily inde
Ot y IFO M C "f P R m ,i y ^ F O M C ^ m .i "f pendent. It would be independent if the returns were
7.3 The 1 — a confidence interval contains the set of null jointly normally distributed.
hypotheses that could not be rejected using a test size 7.8 If the benchm ark is poor, so that the evaluation model is
of a.
m isspecified, then the slope /3 will be m ism easured. As a
7.4 The t-stat is a t-test of the null H0:f3 = 0 against the alterna result, some of the com pensation for system ic risk (which
tive H-\:(3 9^ 0. is captured by (3) may be attributed to skill a.
Solved Problems
7.9 Doing the basic needed calculations:
Distinguish between the relative assumptions of single and Construct, apply, and interpret joint hypothesis tests
multiple regression. and confidence intervals for multiple coefficients in
a regression.
Interpret regression coefficients in a multiple regression.
119
Linear regression with a single explanatory variable provides key
insights into O LS estim ators and their properties. In practice, REGRESSIO N WITH INDISTINCT
however, models typically use multiple variables where it is pos VARIABLES
sible to isolate the unique contribution of each explanatory vari
If some explanatory variables are functions of the same ran
able. A model built with multiple variables can also distinguish dom variable (e.g ., if X 2 = X 2), then:
the effect of a novel predictor from the set of explanatory vari
Y, = a + f3-\Xj1 + /32 X 2 + Sj
ables known to be related to the dependent variable.
In this case, it is not possible to change X-| while holding
This chapter introduces the k-variate regression model, which
the other variables constant. The interpretation of the coef
enables the coefficients to measure the distinct contribution of ficients in models with this structure depends on the value
each explanatory variable to the variation in the dependent vari of X-| because a small change of AX-| in X-| changes Y b y
able. This structure allows the estimated parameters to be inter (/3-1 + 2/32 X 1 )A X 1. This effect captures the direct, linear effect of
preted while holding the value of other included variables constant. a change in X-| through /3i and its nonlinear effect through f32.
The first section of this chapter shows how the structure of the
O LS estim ator leads to this interpretation in a model with two When all explanatory variables are distinct (i.e., no variable is an
correlated explanatory variables. The coefficients in this model exact function of the others), then the coefficients are interpreted
can be estim ated using three single-variable regressions. This as holding all oth er values fixed. For exam ple, /3i is the effect of a
stepw ise estimation procedure provides an intuitive understand small increase in X-, holding all other variables constant.
ing of the effect measured by a regression coefficient. W hile the param eter estim ators of a and the /3s have closed
The R2, introduced in the previous chapter, is a simple measure form s, they can be difficult to express without matrices and
that describes the amount of variation explained by a model. It linear algebra. However, it is possible to understand how param
is also the basis of the F-test, which extends the t-test to allow eters are estim ated in a model with two explanatory variables.
multiple hypotheses to be tested sim ultaneously. The F-test Suppose the model is
accounts for the uncertainty in the coefficient estim ators, includ Y, = a + p iX u + /32 X 2i + Sj
ing the correlation between them .
The O LS estim ator for /3-| can be com puted using three single
The key ideas in this chapter are illustrated through an exten variable regressions. The first regresses X-| on X 2 and retains the
sion of C A P M , which incorporates additional risk factors that residuals from this regression:
system atically affect the cross section of asset returns. These
Xu = Xu - S0 - S , X 2if (8 . 2 )
multi-factor models are commonly used in risk m anagem ent to
A A
determ ine sensitivities to econom ically im portant investm ent where S0 and S-| are O LS param eter estim ators and X Vl is the
characteristics and to risk-adjust returns when assessing fund residual of X-|j.
m anager perform ance.
The second regression does the same for Y:
Y , = Y ; - y 0 - y iX 2/, (8.3)
8.1 MULTIPLE EXPLANATORY where yo and 7 i are O LS param eter estim ators, and Y Vl is the
VARIABLES residual of Y-n
Yj = a + f r X Vl + fi2X 2j + . . . + (3kX kj + s ii (8.1) The O LS estim ate j8 1 is identical to the estim ate com puted by
fitting the full model using Excel or any other statistical software
where there are k explanatory variables and a single constant a.
p ackag e . 2
1 Historically, models with multiple explanatory variables are called mul A _____ "V »
■%*
tiple regressions. This distinction is not important and so models with This model does not include a constant because Y and X have mean
any number of explanatory variables are referred to as linear regressions zero by construction. If a constant is included, its estimated value is
or simply regressions. exactly 0 and the estimate of /3i is unaffected.
The final regression estim ates the linear relationship (i.e., fa ) The distinct variation in each regressor is exploited to estim ate
between the com ponents of Y and X-] that are uncorrelated with the model param eters. If this assumption is violated, then the
(and so cannot be explained by) X 2. variables are p e rfe ctly collinear. This assumption is simple to
verify in practice, and most statistical software packages pro
Finally, the O LS estimate of fa can be computed in the same man
duce a warning or an error when the explanatory variables are
ner by reversing the roles of X-| and X 2 (i.e., so that fa measures
perfectly collinear.
the effect of the component in X 2 that is uncorrelated with X-|).
The remaining assumptions require simple m odifications to
This stepw ise estimation can be used to estim ate models with
account for the k-explanatory variables.
any number of regressors. In the k-variable model:
• All variables must have positive variances so that cry. > 0 for
Y , = a + f a X v + f a X 2 i + ... + f a X k j + s„ (8.5)
j = 1 , 2 , . . . , k.
the O LS estim ate of fa is com puted by first regressing each of
• The error is assumed to have mean zero conditional on the
X-| and Y on a constant and the remaining k — 1 explanatory
explanatory variables (E[e(-1X i X 2/, . . . , X fc/] = 0).
variables. The residuals from these two regressions are mean
• The random variables (X 1(- ,X 2/, . . . , X ki) are assumed to
zero and uncorrelated with the remaining k — 1 explanatory
variables. The O LS estim ator of fa is then estim ated by regress be iid.
ing the residuals of Y on the residuals of X-|. • The assumption of no outliers is extended to hold for each
explanatory variable so that E[X yj] < 00 for j = 1 , 2 . . . , k.
Additional Assumptions • The constant variance assumption is similarly extended to
O1 o
hold for all explanatory variables (E[ef |X-|/, X 2„ . . . ,X/c;] = er ).
Extending the model to multiple regressors requires one addi
tional assum ption, along with some m odifications, to the five The interpretation of each of the modified assum ptions is
assumptions in Section 7.3. unchanged from the single explanatory variable model.
REGRESSIO N STEP-BY-STEP
The A ppendix contains a table of the annual excess returns The first step is to estim ate the coefficients of the model that
on the shipping container industry portfolio (i.e., Rp*), the rem oves the effect of the market from the shipping container
excess return on the m arket (i.e., Rm*), and the return on a portfolio:
value portfolio (i.e., Rv). R p* = 8 0 + 8 ^ R m* + V p
It also contains the squared deviations from the mean of each
variable (labeled Ep, E^, and E^, for squared error) and the The slope (i.e., S-|) depends on the covariance between the
cross-product of the deviations (EpE m and E vE m). The sums industry portfolio and the market portfolio divided by the
variance of the m arket portfolio. This is estim ated by using
of these columns, when divided by the sam ple size, are the
estim ates of the variances (squares) and covariances (cross- the sum of the cross-product between the two and the sum
products) of these series. of the squared m arket return deviations. (Note that it is
unnecessary to divide each by n — 1 to produce a covariance
If this data set were to be entered into Excel and a regres or variance estim ate, because this cancels in the ratio.). The
sion run using the Analysis ToolPak add-in, the fitted model
/V
intercept depends on the estim ate 5-, and the means of the
would be two series:
4.06 0.72 Rm* 0.101 Rv „ 4639
0.394, 0.696
(4.16) + (0.22) + (0.29) + 6
6665
where the standard error is in parentheses below each
The intercept depends on the estim ate <5i and the means of
coefficient.
the two series:
Here the stepw ise procedure is applied to verify the estim ate
of the coefficient on the value factor. §0 = 8.58 - 0.696 X 6.03 = 4.39
(C ontinued)
These same calculations are used to estim ate the regression The values of this residual series are in the second table in
of the value factor return on the excess return of the market: the appendix. The first two columns sum to exactly 0, which
is a consequence of fitting the first step model. The final
R v = To + 7i^m* + V *
three columns contain the squares and cross-products of the
The param eter estim ates are residual series. These are the key inputs into the coefficient
1464 estim ate for the value factor, which depends on the covari
7 ! = -777- = - 0 .2 2 0 , y 0 = 1.94 - (- 0 .2 2 0 ) X 6.03 = 3.27 ance between the two residuals divided by the variance of
6665
the value factor residual. A s before, the n — 1 term can be
Finally, the effect of the market is removed using the esti om itted in each because it cancels, and the estim ate of the
mated coefficients to produce the following residuals: regression coefficient is
8.2 APPLICATION: MULTI-FACTOR where Rm* is the excess return on the m arket, Ris m easures the
excess return of value firms over growth firms, and Rs measures
RISK M ODELS the excess return of small-cap firms over large-cap firm s . 6 The
returns of all three factors and the industry portfolios are avail
M arket variation is an im portant determ inant of the variation
able in Ken French's data library . 7
in asset returns. However, other factors that capture specific
characteristics are also useful in explaining differences in returns The t-statistic is reported below each coefficient in parentheses.
across firms or industries. The coefficients on the m arket return are all similar in magnitude
to those in Table 8.1. They are all positive, close to one, and
The Fam a-French three-factor m odel 3 is a leading exam ple of
have large t-statistics indicating that the returns on most indus
a m ulti-factor approach. This model exp and s upon C A PM by
tries com ove closely with the overall m arket return.
including tw o additional facto rs: the size facto r (which ca p
tures the propensity of sm all-cap firm s to generate higher The coefficients on the size and value factors, however, have a
returns than large-cap firm s) and the value facto r (which m ea different structure. These coefficients are all sm aller in magni
sures the additional return that valu e 4 firm s earn above tude and range from —0.6 to 0.8. Positive estim ates indicate that
growth firm s). the industry has a positive exposure to the factor. For exam ple,
the wholesale industry has a coefficient of 0.26 on the size fac
Both the size and value factors are measured using the returns
tor, indicating that a 1 % factor return increases the return on this
of long-short portfolios that buy firms with the desirable feature
portfolio by 0.26% . This suggests one of two possibilities: either
(i.e., small-cap or value firms) and short sell those with the unde
the firms in this industry are relatively small firms, or their profits
sirable feature (i.e., large-cap or grow th 5 firms).
depend crucially on the same factors as those affecting the per
Table 8.1 extends the results for CAPM in Table 7.1 to the Fama- form ance of small firms. Note that these two explanations for
French three-factor model. The table contains param eter esti the positive exposure are not exclusive and both are likely true.
mates for the model: The com puter industry portfolio, on the other hand, has a large
R p * i = a + AnRm*; + P s R s i + P v R v i + £ i,
negative coefficient on the value factor. This coefficient suggests
that the typical firm in this industry is a growth firm.
a Pm Ps A, R2
Banking - 0 .1 7 2 1.228 - 0 .1 5 6 0.838 0.764
(- 1 .0 9 ) (32.26) (- 3 .0 3 ) (15.15)
Beer and Liquor 0.517 0.628 - 0 .3 9 4 - 0 .0 3 3 0.314
(2.44) (12.28) (- 5 .6 7 ) (- 0 .4 4 )
The top panels contain return plots of the com puter industry
CONTROLS portfolio against the return on the size factor (left) or the return
on the value factor (right). The solid line shows the fitted regres
Controls are explanatory variables that are known to have
sion of the excess industry portfolio return on the factor return.
a clear relationship with the dependent variable. It is fre
quently the case that the param eter values on the controls For exam ple, the model fit in the top left panel is
are not directly of interest.
Rp*/ Q
?“L P s ^ s i
Computers on Size, Given Market and Value Computers on Value, Given Market and Size
CD
-I 20 6S = 0.224
0 *. • • *••••*.
• •
•• • •*
••• ; • • •
0 0
•• •• ••• • •
CO
0
- 20 -
E
o
O
-4 0
-20 10 0 10 20
Size I Market, Value Value I Market, Size
Fiaure 8.1 Illustration of the differences between estimates of the direct effect and the effect given other
relevant variables. Each column contains plots that depict the relationship between the computer industry returns
and either the size factor (left) or the value (right) factor. The top panel in each column plots the portfolio returns
(y-axis) against the factor returns and shows the regression line computed by regressing the portfolio return on
only the factor return. The middle panel shows the relationship between the portfolio return and the factor return
when controlling for the market return. The values plotted are the residuals from regressing the portfolio return
on the market return (y-axis) and the factor return on the market return (x-axis). The bottom panel shows the
relationship when controlling for both the market and the other factor. The values shown are also residuals where
the effect of the market and the other factor are eliminated.
124 ■ Financial Risk Manager Exam Part I: Quantitative Analysis
Controlling for the market removes its direct effect from both the of the factor return (i.e., R s j or R vi ) on the m arket return. For
industry portfolio return and the factor return. The slope /3S is then exam ple:
estimated in a regression that includes either the size or the value
R si = R si - 8 ~ y R m *h
factor returns and the market return. For exam ple, the coefficient
in the center-left panel is estimated using the regression: where 8 and y are estim ated using O LS.
Rp*/ Q: + "h /3sRsj + £/ These residuals have "p urged" the effect of the m arket (i.e., the
correlation com puted using the residuals is conditional on the
Controlling for the m arket affects both coefficients: The size
m arket). This conditional correlation of - 2 2 .8 % is reported in
effect is halved, while the value effect is reduced by 30% . These
the bottom panel of Table 8.2.
changes are not surprising given the correlation between the
size/value returns and the m arket return, which are reported in
Table 8.2.
8.3 MEASURING M ODEL FIT
The final row shows the effect of controlling for both the returns
of the m arket and the o th er factor. Only one model is estim ated Minimizing the squared residuals decom poses the total varia
to produce the slopes in both panels: tion of the dependent data into two distinct com ponents: one
that captures the unexplained variation (due to the error in the
Rp*,- a + /3mRm*; + /3 s R S j + j3 v R vi + £;
model) and another that measures explained variation (which
The data points plotted are the residuals from regressions on depends on both the estim ated param eters and the variation in
the m arket return and the other factor return, and so control the explanatory variables).
for the effects of variables that are not shown. For exam ple, the
The total variation in the dependent variable is called the total
points in the bottom-left panel depict the relationship between
sum of squares (TSS), which is defined as the sum of the squared
the com puter industry portfolio returns and the size factor
deviations of Y, around the sam ple mean:
returns, controlling for the m arket and the value factor.
The conditional correlation coefficient is the correlation between Transforming the data and the fitted values by subtracting the
the residuals R si and R v il which are calculated from a regression sam ple mean from both sides:
Y,-Y= Y ; - Y + Ej
Table 8.2 The Top Panel Contains the Estimated ensures that both sides of the equation are mean zero. Finally,
Correlation Matrix of the Factor Returns. The Bottom squaring and summing these values over all / from 1 to n yields
Panel Contains the Estimated Conditional Correlation
2 ( Y - Y f = 2 ( Y , - Y + ej)2
Matrix for the Returns to the Size Portfolio and the i= 1 ;=1
Value Portfolio Controlling for the Market Return
E ( Y , - Y )2 = 2 ( Y , - Y ) 2 + 2 "= ,® / (8 . 8 )
Correlation /= 1 /= 1
Note that the second line is missing the cross term between
Market Size Value /V _
___
__
[Y; — Y) and s r This cross term is 0 because the sam ple correla-
Market 1 0.216 - 0 .1 7 6 tion between the fitted values Y, and the estim ated residuals
Size 0.216 1 - 0 .2 5 6 Ej is 0 .8
Value - 0 .1 7 6 - 0 .2 5 6 1
This leads to the definition of R2 in a model with any number of R2 m easures the percentage of the variation in the data that can
included explanatory variables: be explained by the m odel. This is equivalently measured by the
ratio of the explained variation to the total variation (which is
( 8 . 10) equal to one minus the ratio of the residual variation to the total
variation). Because O LS estim ates param eters by finding the val
As an exam ple, suppose the results of a model fit are as follows: ues that minimize the RSS, the O LS estim ator also m aximizes R2.
(y y)2
A
S
>>
I
4 5.017 4.948 16.136 15.587 When a model has multiple explanatory variables, R2 is a com
5 2.272 2.450 1.618 2.103 plicated function of the correlations among the explanatory
variables and those between the explanatory variables and the
6 - 0 .5 9 6 0.062 2.547 0.880
dependent variable. However, R2 in a model with multiple regres-
7 2.329 2.721 1.766 2.962 sors is the squared correlation between Y, and the fitted value Yj,
8 4.295 3.894 10.857 8.375 R2 = C o rr 2 [Y ,Y ]
9 - 0 .8 7 5 - 1 .2 3 8 3.516 5.009 This interpretation of R2 provides another interpretation of the
1 0 1.259 1.538 0.067 0.289 O LS estim ator: The regression coefficients (3^,. . . , (3k are cho
1 1 0.442 0.425 0.311 0.331 sen to produce the linear combination of X 1f . . . , X k that m axi
mizes the correlation with Y.
1 2 - 1 .1 0 5 - 1 .2 8 3 4.431 5.212
A model that is com pletely incapable of explaining the observed
13 3.524 1.116 6.371 0.013
data has an R2 of 0 (because all variation is in the residuals). A
14 0.971 1.932 0 .0 0 1 0.869 model that perfectly explains the data (so that all residuals are 0 )
15 2.377 2.259 1.896 1.585 has an R2 of 1. All other models must produce values that fall
between these two bounds so that R2 is never negative and
16 5.445 6.163 19.758 26.657
always less than 1 .
17 -0 .3 3 1 0.044 1.772 0.914
W hile R2 is a useful method to assess model fit, it has three
18 - 3 .6 3 8 - 2 .0 1 0 21.511 9.060
im portant limitations.
19 2.949 0.375 3.799 0.391
1. Adding a new variable to the model always increases the R2.
2 0 0.978 0.421 0 . 0 0 0 0.335 For exam ple, if the original model is Y, = a + /3-|X-|, + e,
and the expanded model Y; = a + /3iX1(- + /32 X 2; + e; ' s
estim ated, then the R2 of the expanded model m ust be
A verage 1.0 0 0
greater than or equal to the R2 of the original model. This is
SUM 142.977 123.988
because the expanded model always has the same TSS and
where s.e.ify) is the estim ated standard error of fy. W hile the
portfolio using the contem poraneous return on the equity
m arket (i.e., CAPM ). precise formula for the standard error is com plicated in the
k-variable model, it is a standard output in all statistical software
The first limitation is addressed (in a limited way) by the packages.
adjusted R2, which is written as R2. The adjusted R2 modifies the
However, the t-test is not directly applicable when testing com
standard R2 to account for the degrees of freedom used when
estim ating model param eters, so that: plex hypotheses that involve more than one parameter, because
the param eter estim ators can be correlated . 9 This correlation
^ 2 _ RSS/(n - k - 1) com plicates extending the univariate t-test to tests of multiple
TSS/(n - 1)
param eters.
= 1 --------n 7 1 . (1 — R2), (8.11) Instead, the more common choice is an alternative called the
n - k - 1
F-test. This type of test com pares the fit of the model (measured
where n is the number of observations in the sam ple and k is
using the RSS) when the null hypothesis is true relative to the fit
the number of explanatory variables included in the model (not
of the model without the restriction on the param eters assumed
including the constant term a).
by the null.
Adjusted R can also be expressed as:
The alternative hypothesis is that at least one of param eters is between them . The confidence regions in Figure 8.2 are all con
10 Transforming between the two forms relies on two identities: 11 Confidence intervals are used with a single parameter. "Confidence
1 - Rt = RSSu/TSS and R?u - R2r = (1 - R2r) - (1 - Rfa) so that region" is the preferred term when constructing the regions where the
R b - R2r = (RSS r - RSSuVTSS. null is not rejected for more than one parameter.
where fa = /32 = 1. The errors are standard normal random vari if X-| is above its mean, X 2 will tend to be below its mean due to
ables, and the explanatory variables are bivariate normal: the negative correlation.
o-n po"i cr2 \ The bottom-right shows a different scenario where p = 50%
pO"i 0~2 022 J/ and a- 1 1 = 0.5. The length of the region for each variable is
The baseline specification uses a bivariate standard normal so proportional to the inverse of its variance, therefore, halving the
that cr-] 1 = CT2 2 = 1 and p = 0. The confidence region in the variance of X i doubles the width of the region along the x-axis
baseline specification with uncorrelated regressors is shown in relative to the upper-right panel.
the top left panel. This region is spherical, which reflects the lack
of correlation between the estim ated estim ators fa and fa .
The F-statistic of a Regression
The top-right panel is generated from a model where the cor
relation between the explanatory variables is p = 50%. The The F-statistic of a regression is the multivariate generalization
positive correlation between the variables creates a negative of the t-statistic of a coefficient. It is an F-test of the null that all
correlation between the estim ated param eters. W hen the explanatory variables have 0 coefficients:
explanatory variables are positively related and fa is overesti-
/V /V
Ho: fa = fa = • • • = Ac = 0
mated (i.e., so fa > fa ), then fa tends to underestim ate fa to
com pensate. The negative correlation m itigates the common The alternative hypothesis is
com ponent in the two correlated explanatory variables. H p (3j 7^ 0
The bottom -left panel shows the case where the correla for some j in 1 ,2 . . . , k. The test should reject w henever one or
tion is large and negative (—80%). The confidence interval is more of the regression coefficients are nonzero.
upward-sloping, which reflects the positive correlation between
The null hypothesis does not assume any value for the intercept,
param eter estim ators. Because the variables have a negative
/V A
a, and so the restricted model only includes an intercept:
correlation, fa will tend to overestim ate fa to com pensate if fa is
overestim ated. The com pensation improves the fit because Yj = a + Ej
0.7-
0.4-
T T T T
Fig u re 8.2 Four bivariate confidence intervals. Each confidence interval is centered on the estimated
/V /V
parameters fa (x-axis) and fa (y-axis). The top-left panel shows the confidence interval when the two explanatory
variables in the model are uncorrelated. The top-right panel shows the confidence interval when the variables
have identical variances but a correlation of 50%. The bottom-left panel shows the confidence interval when
the correlation is —80%. The bottom-right shows the confidence interval when the correlation is 50% but
the variance of fa is twice the variance of fa. The dark bars along the axes show individual (univariate) 95%
confidence intervals for fa and fa.
/V /V
• Using an F-test of the null H0: fa = fa = 0, or On the other hand, if the F-test fails to reject but one of the
• Testing each of the two coefficients separately using t-tests does, then this indicates that one variable has a very
small effect. If X-, and X 2 are uncorrelated, it can be shown
t-tests of the null H0: fa = 0 for j = 1 , 2 .
that:
Under most circum stances, the conclusions drawn from these
two methods agree (i.e., both testing methods should indi
cate that at least one of the param eters is different from zero, /\
or both should fail to reject their respective null hypotheses). where F is the F-test statistic and T; is the t test statistic for /3r
If variable 2 has no effect on Y (i.e., T 2 ~ 0), then F ~ T2/2.
D isagreem ent between the two types of tests is driven by
The F-test will only indicate significance if T-| is m oderately
one of two issues. W hen the F-test rejects the null but the
large. For exam ple, when n = 250, then T2 would need to
t-tests do not, then it is likely that the data are multicollinear.
be larger than 6.06 to reject the null if T 2 = 0, which would
The F-test tells us that the model is providing a useful fit of
require | fa | > 2.46. Treated as a single test, any value
the data. The t-tests indicate that the fit of the model can
to | T-i | > 1.96 indicates that the null H0: fa = 0 should
not be uniquely attributed to either of the two explanatory
be rejected, and so when 1.96 < | fa \ < 2.46, the tests
variables. For exam ple, if X<\ and X 2 are extrem ely highly cor
disagree.
related (say 98% + ), then the model fit from fa = fa = 1 is
The test statistic of the variab le th at is uncorrelated with the other included
variab le s.
(T SS - RSSu)/k
F =RSS»/(n - fc - 1) ~ F^
~'(8' 15) fit is assessed using R2, which measures the ratio of the varia
Model
tion explained by the model to the total variation in the data. While
depends on the TSS, because T SS = R S S r when the restricted
intuitive, this measure suffers from some important limitations: It
model has no explanatory variables. Also, note that q = k
never decreases when an additional variable is added to a model
because every coefficient is being tested.
and is it not interpretable when the dependent variable changes.
Table 8.3 reports the F-statistic for the three-factor m odels of The adjusted R2 (R2) partially addresses the first of these concerns.
the industry portfolio returns in Table 8.1. The portfolio returns
The same variation measures that appear in R2 are used to test
are strongly related to the m arket, and so it is not surprising that
model parameters with F-tests. This test statistic measures the
all F-statistics indicate that the null of all coefficients being zero
difference in the fit between a restricted model (which imposes
is rejected.
the null hypothesis) and the unrestricted model. If the restriction
is valid, the fits of the two models should be similar. When the
8.5 SUMMARY restriction is invalid, the restricted model should fit the data worse
than the unrestricted model so that the test statistic is large,
This chapter describes the k-variable linear regression model, and the null should be rejected. Most statistical software pack
which extends the single explanatory variable model to account ages report the F-statistic of the regression, which tests the null
for additional sources of variation. All key concepts from the hypothesis that the explanatory variables have zero coefficients.
simple specification carry over to the general model.
Year Rp* Rm E2
m El EpEm
1999 - 5 .4 6 19.7 - 2 4 .8 197.2 186.9 715.2 - 1 9 2 .0 - 3 6 5 .6
QUESTIONS
8.2 Both R2 and R2 only increase when adding a new variable. of the nulls H q :/3^ = 0 or H q :/32 = 0 using the t-stat of the
coefficient. Which values of p = C o rr[X 1f X 2] make this
True or false?
scenario more likely?
8.3 W hen is R2 less than 0? Can this measure of fit be larger
than 1 ? 8.6 Suppose you fit a model where both t-statistics are exactly
+ 1.0. Could an F-statistic lead to rejection in this circum
8.4 The value of an F-test can be negative. True or false?
stance? Use a diagram to help explain why or why not.
Practice Questions
8.7 Construct 95% confidence intervals and p-values for the 8.11 Using the data below:
coefficients in the shipping containers industry portfolio
returns.
Trial Number y *1 *2
V; -0 .8 2 0.39
5 -3 .0 8
6 -7.1 -2 .0 8 1.39
W hat is the R2 in a model that regresses Y on X? W hat
7 -4.1 - 1 .0 6 0.75
would the R2 be in a model that regressed X on V?
0.14 -0 .6 3
8.10 A model was estim ated using daily data from the S&P 500
8 0 .0 2
from 1977 until 2017 which included five day-of-the-week 9 -6 .1 3 - 1.6 6 1.31
dummies (n = 10,087). The R2 from this regression was 10 0.74 0 .6 8 -0 .1 5
0.000599. Is there evidence that the mean varies with the
a. A pply O LS linear regression to find the param eters for
day of the w eek?
the following equation:
Y, = a + p-\XVl + (32X 2i + €j
ANSW ERS
8.3 R2 is always less than R2 and so cannot be larger than 1. 8.6 Yes. When the regressors are very positively correlated,
It can be sm aller than 0 if the model explains virtually then this can happen. The t-stats are small because the
nothing and the degree-of-freedom loss is sufficient to variables are highly co-linear, so that the variation in the
push its value negative. left-hand-side variable cannot be uniquely attributed to
8.4 False. An F-test measures the reduction in the sum of either. However, the F-stat is a joint test and so if there is
squared residuals that occurs in the expanded model and an effect in at least one, the Fe an reject even if the t-stats
depends on SSRu — S S R r . This value is always positive do not. See the image below that shows the region where
because the unrestricted model must fit better than the the F would fail to reject, and the two t-stats.
restricted model.
8.5 This is most likely to occur when the regressors are highly
correlated. If the regressors are positively correlated, then
the parameter estimators of the coefficients will be nega
tively correlated. If both values are positive, this would lead
to rejection by the F-test. Similarly, if the regressors were
negatively correlated, then the estimators are positively
correlated and the F will reject if one t is positive and the
other is negative. The figure below shows the case for pos
itively correlated regressors. The shaded region is the area 3 •-------- •-------- --------- 1-------- 1-------- •
where the F would fail to reject. The t-stats are outside this -3 - 2 - 1 0 1 2 3
area even though neither is individually significant.
Solved Problems
8.7 The confidence interval is ft ± 1.96 X se. The p-value is
2(1 — <E>(| 1 1)) where <I>( -) is the standard normal CDF.
Coefficient Estimate t-stat Std. Err. Conf. Int. P-value
8.8 The annualized a is 12a = 12 X 0.517 = 6.204% . The
Pm 1 . 0 1 1 18.63 0.054 [0.905, 0.000 t-stat is unaffected because the standard error is also
1.117]
scaled by 1 2 .
- - 0 .1 6 0.075 [- 0 .1 5 9 , 1.128
Ps 0 .0 1 2
8.9 Because this is a regression with a single explanatory
0.135]
variable, the R2 is the squared correlation. The correla
Pv 0.276 3.51 0.079 [0 . 1 2 1 , 0 .0 0 2
tion is 0 .9 / ^ 3 = .519 and so the R2 = 0.27. The R2 is
0.431]
Yj = /jl + 82D2
S3D3 + 84D4 + S4D5 + Sj, therefore,
+ A verage -2 .8 2 3 -0 .9 1 0 - 0 .1 0 1
1 - 5 .7 6 - 3 .4 8 -1 .3 7 9 -3 .3 0 7 -0 .7 5 0 1.411
3 -0 .2 5 - 0 .5 -1 .0 7 A verage
N>
Trial Number (*! - X t )2 (x2 - x2)2 (x-, - Xt )(y - y) (x2 - x2)(y - y) (*! - )(*2 - * 2)
<<
I
X1(- — S'0
n + 5iX->; + X 1/
1^2/ 2 1.047 -0 .4 0 6 -0 .4 2 5 0.164
X2 - 0 .1 7 2
0.189 3 1.741 -0 .9 1 7 -1 .5 9 6 0.841
0.911
4 -1 .3 7 8 -0 .8 1 6 1.125 0.666
5q = /xXl - ^i Mx2 = -0-91 - 0.189*(—0.101) = -0.929
5 -0 .4 4 0 0.502 -0.221 0.252
Vi ~ To + 71^2/ + 6 -1 .9 0 5 1.341 -2 .5 5 4 1.798
- 1 .4 3 4
- 1 .5 7 4 7 -0 .9 7 4 0.831 -0 .8 0 9 0.691
0.911
8 1.076 -0 .4 1 0 -0 .4 4 2 0.168
7o = Mr - fiM x2 = - 2 .8 2 3 - (—1 .574*—0.101 ) = - 2 .9 8 2
9 -1 .7 8 7 1.315 -2 .3 4 9 1.728
Trial 10 0.338 0.154 0.052 0.024
Number ~2
y xi y*1 xi A verage -1 .0 8 5 0.889
1 - 4 .9 3 4 -2 .8 1 0 13.865 7.896
and
2 2.036 0.792 1.612 0.627
C o v ( K ,X 2) = - 1 .0 8 5 o-| = 0.889
3 1.048 0.227 0.238 0.051
- 1 .0 8 5
4 -1 .3 2 8 0.558 -0.741 0.311 = - 1.221
0.889
f t =
4 -3 .6 1 5 -2 .3 7 3 5.632
C o v (Y ,X J 2.459 5 -1 .0 6 8 0.173 0.030
1.873
1.313 6 -1 .5 0 8 -0 .2 6 7 0.071
Repeating the process for X 2: 7 -1 .2 0 0 0.042 0.002
*2 i = io + + x 2i 8 -0 .6 6 6 0.576 0.331
- 0 .1 7 2 9 -1 .4 2 3 -0.181 0.033
- 0 .1 2 8
1.345
10 -0 .7 1 7 0.525 0.276
£0 = Mx 2 - ^iM x, = -0 .1 0 1 - (- 0 .1 2 8 ) - (- 0 .9 1 ) = - 0 .2 1 7
A verage -1 .2 4 2 0.718
Yj = fjo + + K So a = - 1 .2 4 2 , and the variance of the residuals
2.729 is 0.718.
2.029
1.345
Finally, asserting the presumption of normality: So, the TSS = 75.797 and the ESS = 68.598.
y y (y - y )2 (y - y )2
1 -5 .7 6 0 -6 .0 8 9 8.626 10.664
3 -0 .2 5 0 -0 .8 7 3 6.620 3.802
4 -2 .7 2 0 -0 .3 4 7 0.011 6.131
5 -3 .0 8 0 -3 .2 5 4 0.066 0.185
6 -7 .1 0 0 -6 .8 3 4 18.293 16.085
7 -4 .1 0 0 -4 .1 4 2 1.631 1.741
9 -6 .1 3 0 -5 .9 4 9 10.936 9.774
A verage -2 .8 2 3
Explain how to test w hether a regression is affected by Explain two model selection procedures and how these
heteroskedasticity. relate to the bias-variance tradeoff.
D escribe approaches to using heteroskedastic data. Describe the various m ethods of visualizing residuals and
their relative strengths.
Characterize m ulticollinearity and its consequences; distin
guish between m ulticollinearity and perfect collinearity. Describe methods for identifying outliers and their impact.
D escribe the consequences of excluding a relevant Determ ine the conditions under which O LS is the best
explanatory variable from a model and contrast those with linear unbiased estimator.
the consequences of including an irrelevant regressor.
139
This chapter exam ines model specification. Ideally, a model estim ator of param eter standard errors, but not the specification
should include all variables that explain the dependent variable of the regression model or the estim ated param eters.
and exclude all that do not. In practice, achieving this goal is chal
The chapter concludes by exam ining the strengths of the O LS
lenging, and the model selection process must account for the
estimator. W hen five assum ptions are satisfied, the O LS estim a
costs and benefits of including additional variables. Once a model
tor is the most accurate in the class of linear unbiased estim a
has been selected, the specification should be checked for any
tors. W hen model residuals are also normally distributed, then
obvious deficiencies. Standard specification checks include resid
O LS is the best estim ator of the model param eters (in the sense
ual diagnostics and formal statistical tests to examine whether the
of producing the most precise estim ates of model param eters
assumptions used to justify the O LS estimator(s) are satisfied.
among a wide range of candidate estim ators).
Om itting explanatory variables that affect the dependent vari
able creates biased coefficients. The bias depends on both the
coefficient of the excluded variable and the correlation between 9.1 OMITTED VARIABLES
the excluded variable and the remaining variables in the model.
W hen a variable is excluded, the coefficients on the included An om itted variable is one that has a non-zero coefficient but is
variables adjust to capture the variation shared between the not included in a model. Om itting a variable has two effects.
excluded and the included variables.
1. First, the remaining variables absorb the effects of the omit
On the other hand, including irrelevant variables does not bias ted variable attributable to common variation. This changes
coefficients. In large samples, the coefficient on an extraneous vari the regression coefficients on the included variables so that
able converges to its population value of zero. Including an unnec they do not consistently estim ate the effect of a change in
essary explanatory variable does, however, increase the uncertainty the explanatory variable on the dependent variable (holding
of the estimated model parameters. When the explanatory vari all other effects constant).
ables are correlated, the extraneous variable competes with the
2. Second, the estim ated residuals are larger in magnitude
relevant variables to explain the data, which in turn makes it more
than the true shocks. This is because the residuals contain
difficult to precisely estimate the parameters that are non-zero.
both the true shock and any effect of the om itted variable
Determ ining w hether a variable should be included in a model that cannot be captured by the included variables.
reflects a bias-variance tradeoff. Large models that include all
Suppose the data generating process includes variables X i
conceivable explanatory variables are likely to have coefficients
and X 2:
that are unbiased (i.e., equal, on average, to their population
values). However, because it may be difficult to attribute an Yj = a + f a X Vl + f a X 2i + 6j
effect to a single variable, the coefficient estim ators in large Now suppose X 2 is excluded from the estim ated model. The
models are also relatively im precise. Sm aller m odels, on the model fit is then:
other hand, lead to more precise estim ates of the effects of the
Y-, — a + f a X Vl + e;
included explanatory variables.
In large sam ples, the O LS estim ator fa converges to:
Model selection captures this bias-variance tradeoff when
sam ple data are used to determ ine w hether a variable should be fa + fa&
included. This chapter introduces two methods that are widely where:
used to select a model from a set of candidate m odels: general-
= C o v [X i,X 2]
to-specific model selection and cross-validation.
vpg
O nce a model has been chosen, additional specification checks
is the population value of the slope coefficient in a regression of
are used to determ ine whether the specification is accurate. Any
X 2 o n X-|.
defects detected in this post-estimation analysis are then used
to adjust the model. The standard specification checks include The bias due to the om itted variable depends on the population
testing to ensure that the functional form is adequate, the model coefficient of the excluded variable fa and the strength of the
param eters are constant, and the assumption that the errors relationship between X-| and X 2 (which is measured by 8). When
have constant variance (i.e., homoscedasticity) is com patible with X-| and X 2 are highly correlated, then X-, can explain a substantial
the data. Tests of functional form or param eter stability provide portion of the variation in X 2 and the bias is large. If X-| and X 2
A
guidance about whether additional variables should be included are uncorrelated (so that 8 = 0), then fa is a consistent estim a
in the model. Rejecting hom oscedasticity requires adjusting the tor of fa . In other words, om itted variables bias the coefficient
Furthermore, the standard error grows larger as the correlation model selection.
between X-, and X 2 increases. In most applications to financial data, The first step is to determ ine a set of candidate m odels. When
correlations are large and so the impact of an extraneous variable the number of explanatory variables is small, then it is possible
on the standard errors of relevant variables can be substantial. to consider all com binations of explanatory variables. In a data
set with ten explanatory variables, for exam ple, 2 10 = 1,024 dis
tinct candidate models can be constructed.3
9.3 THE BIAS-VARIANCE TRAD EO FF
Cross-validation begins by randomly splitting the data into m
equal sized blocks. Com m on choices for m are five and ten.
The choice between omitting a relevant variable and including
Param eters are then estim ated using m — 1 of the m blocks,
an irrelevant variable is ultimately a tradeoff between bias and
and the residuals are com puted using these estim ated param
variance. Larger m odels tend to have a lower bias (due to the
eter values on the data in the excluded block. These residu
inclusion of additional relevant variables), but they also have less
als are known as out-of-sample residuals, because they are
precise estim ated param eters (because including extraneous
not included in the sam ple used to estim ate param eters. This
variables increases estimation error). On the other hand, models
process of estim ating param eters and computing residuals is
with few explanatory variables have less estimation error but are
repeated a total of m tim es (so that each block is used to com
more likely to produce biased param eter estim ates.
pute residuals once).
The bias-variance tradeoff is the fundam ental challenge in vari
able selectio n .1 There are many methods for choosing a final
2 It most other references on model selection, this type of cross-
validation is referred to as k-fold cross-validation, m is used here to clar
ify that the number of blocks in the cross-validation (m) is distinct from
1 Bias measures the expected deviation between an estimator and the
/V /V
the number of explanatory variables in the model (k).
true value of the unknown coefficient and is defined Bias(0) = E[0] — 0q.
A A A
Variance measures the squared deviation, V[fl] = E[(0 — E[0]) ], and so 3 If a data set has n candidate explanatory variables, then then there are
q
bias and variance are not directly comparable. In practice, the bias-vari 2n possible model specifications. When n is larger than 20, the number
ance tradeoff is a tradeoff between squared bias and variance because of models becomes very large, and so it is not possible to consider all
E[(0 - 6>0)2] = Bias2(0) + V[0], specifications.
Fiq u re 9.1An example of heteroskedastic data. The standard deviation of the errors is increasing in the distance
between X and E[X], so that the observations at the end of the range are noisier than those in the center.
The sum of squared errors is then com puted using the residuals Figure 9.1 contains a sim ulated data set with heteroskedastic
estim ated from the out-of-sample data. Finally, the preferred shocks. The residual variance increases as X moves away from its
model is chosen from the set of candidate models by select m ean; consequently, the largest variances occur when (X — /xx)2
ing the model with the sm allest out-of-sample sum of squared is largest. This form of heteroskedasticity is common in financial
residuals. In machine learning or data science applications, the asset returns.
m — 1 blocks are referred to as the training s e t and the omitted In a regression, the most extrem e observations (of X ) con
block is called the validation set. tain the most information about the slope of the regression
line. W hen these observations are coupled with high-variance
residuals, estim ating the slope becom es more difficult. This is
9.4 HETEROSKEDASTICITY reflected by a larger standard error for /§.
variables, and the cross-product of all explanatory variables Table 9.1 reports the param eter estim ates for y , and y 2 (which
(including the product of each variable with itself). should be 0 if the data are hom oscedastic). The values below
For exam ple, if the original model has two explanatory variables: the estim ated coefficients are t-statistics.
A part from the banking industry, all estim ates of y , are statisti
Y, — a + fa X Vl + f a X 2j + €j
cally insignificant at the 10% level (i.e., they are sm aller in m ag
then the first step com putes the residuals using the O LS param nitude than 1.645). This indicates that the market return does
eter estim ators:
not generally linearly affect the variance of the residuals. How
e ; = Y , - a - f a X , , - f a X 2i ever, most of the coefficients on y 2 are statistically significant
at the 10% level, and so the shock variance is large when the
The second step regresses the squared residuals on a constant,
squared m arket return is large; in other words, the volatilities of
X 1f X 2, X i X l and X ^ :
the shock and m arket are positively related.
- To + 7 iX ;i + 72^2/ + 73*1/ + 74^2/ + 7 s*i ,x 2/ + 17/ (9 . 1)
If the data are hom oscedastic, then ef should not be explained
by any the variables on the right-hand side and the null is Table 9.1 White's Test for Heteroskedasticity
Applied to the Industry Portfolios. The First-Step
H0: 71 = ' ' ’ = 75 = 0
Model Is a CAPM. The Second-Step Model Is
The test statistic is com puted as nR2, where the R2 is com puted
€? = yo + n Rm*i + 72 (R m * i)2 + Vu where Rm*, Is
in the second regression. The test statistic has a Xk(k+3)/2 distri the Excess Return on the Market. The Values Below
bution, where k is the number of explanatory variables in the the Estimated Coefficients Are t-Statistics. The Third
first-stage m odel.6 Column Contains White's Test Statistic, nR2, Which Has
W hen k = 2, the null imposes five restrictions and the test sta a \ 2 Distribution. The p-Value of the Test Statistic Is in
tistic is distributed xi- Large values of the test statistic indicate Parentheses. The Final Column Reports the R2 from This
that the null is false, so that hom oscedasticity is rejected in favor Regression. The Number of Observations is n = 360
of heteroskedasticity. W hen this occurs, heteroskedasticity-
7i 72 White Stat. R2
robust (White) standard errors should be used.
Banking - 0 .4 8 8 0.140 40.63 0.113
Table 9.1 contains the results of W hite's test on the industry
portfolios considered in C hapter 8. The first-step model is (- 2 .6 7 ) (5.50) (0.000)
derived from C A PM : Shipping 0.144 0.053 0.95 0.003
Containers
Rp*/ ex + /3Rm*(- + €j, (0.36) (0.96) (0.623)
where Rp*(- is the excess return on a portfolio and Rm*(- is the Chemicals 0.042 0.096 7.72 0.021
excess return on the market. (0.17) (2.77) (0.021)
The residuals from the regression are constructed as: Electrical 0.191 0.092 10.21 0.028
Equipment
= RP*; - a ~ /3Rm*; (0.93) (3.23) (0.006)
a Pm Ps Pm F-test Robust F
(- 0 .0 7 2 ) (18.633) (- 0 .1 6 1 ) (3.510)
(- 0 .3 1 4 ) (26.902) (- 0 .5 0 6 ) (7 .720)
The third column reports the test statistic for W hite's test (which The t-statistics on the size and value factor coefficients that
is nR2 from the second-stage regression on the residuals). The account for heteroskedasticity are generally sm aller than the
null of hom oscedasticity imposes two restrictions on the coef t-statistics com puted assuming hom oscedasticity. The changes
ficients in the second-stage regression (H 0 : 7 1 = 72 = 0)/ and so in the consum er goods industry portfolio are large enough to
the test statistic has a X 2 distribution. Four of the seven series alter the conclusion about the significance of the size factor
reject the null hypothesis and therefore appear to be hetero- (because the t-statistic decreases from - 3 .2 to —1.3).
skedastic. The final column reports the R2 of the second-stage
The final two columns contain test statistics for the null that
m odel. All regressions use 30 years of monthly observations
CAPM is adequate (i.e., Hq :/3s = f3v = 0). The two test statistics
(n = 360).
+ 'Yj+'iXj+y, + . . . + 7 kXki + Vi (9-3) larger than 1 indicate that observation j has a large im pact on
the estim ated model param eters.
The variance inflation factor for variable j is then:
For exam ple, consider the following data:
1
(9.4)
1 - Rf
Table 9.3
where Rf com es from a regression of Xj on the other variables
in the m odel. Values above 10 (i.e., which indicate that 90% of Trial X y
8 1.60 0.26
Residual Plots
9 1.46 0.72
Residual plots are standard m ethods used to detect deficiencies
10 1.52 -0 .4 1
in a model specification. An ideal model would have residuals
11 0.68 1.13
that are not system atically related to any of the included explan
atory variables. The residuals should also be relatively small in 12 0.03 - 1 .7 8
m agnitude (i.e., typically within ±4s, where s2 is the estim ated 13 1.56 1.46
variance of the shocks in the model).
14 - 0 .1 2 - 0 .6 2
The basic residual plot com pares e, (y-axis) against the real
15 7.60 1.85
ization of the explanatory variables x r An alternative uses
the standardized residuals e /s so that the m agnitude of the
deviation is more apparent. Both outliers and model sp ecifica Note that Trial 15 yields a result that is quite different from the
tion problem s (e.g ., om itted nonlinear relationships) can be rest of the data. To see whether this observation is an outlier, a
detected in these plots. model is fit on the entire data set (i.e., Y,) as well as on just the
first 14 observations (i.e., The resulting coefficients are
and
O utliers are values that, if rem oved from the sam ple, produce
large changes in the estim ated coefficients. Cook's distance Yj-fi = 1.15 + 0.69X,
- V/): 4.6
Table 9.4 Di = = 1.05
ks2 2*2.19
y, Yj-i) ( Y H 1 - Y))2
A s another exam ple, consider the industry portfolio data from
1 1.89 3.35 2.45 0.81
C hapter 8. The top panel in Figure 9.2 shows Cook's distance
2 0.64 2.05 1.59 0.21 in the three-factor m odel, where the dependent variable is the
3 - 0 .6 2 0.74 0.73 0.00 excess return on the consum er good's industry portfolio. The
plot shows that one observation has a distance larger than 1.
4 1.20 2.63 1.98 0.43
The bottom panel shows the effect of this single observation on
5 - 2 .4 2 - 1 .1 4 -0 .5 1 0.39 the estim ate of the coefficient on the size factor /§s.
6 0.67 2.08 1.61 0.22
Dropping this single influential observation changes f3s from
7 0.81 2.23 1.71 0.27
A
12 - 1 .7 8 - 0 .4 7 - 0 .0 7 0.16
9.6 STRENGTHS O F OLS
13 1.46 2.90 2.15 0.56
14 - 0 .6 2 0.74 0.73 0.00 Under assum ptions introduced in Chapter 7, O LS is the Best
Linear Unbiased Estim ator (BLU E). Best indicates that the O LS
15 1.85 3.31 2.42 0.79
estim ator achieves the sm allest variance among any estim ator
Sum 4.60
that is linear and unbiased.
OLS IS BLUE
/v
O LS is a linear estim ator because both a and f3 can be w rit and so also do not depend on Y.
ten as linear functions of Y„ so that:
O LS is the best estim ator in the sense that any other LUE
must have larger variances. The intuition behind this con-
b= i ; w ,y , elusion follows from a result that any other LUE (3 can be
/=i A A A /\
uncorrelated with f3. The variance of any other LUE estim ator
but they may depend on other variables. is then:
Note that the weights in the O LS estim ator of (3: V[j8] = V[jS] + V[6]
10
0
cc
I
o
Q_
- 10 -
-1 5 -1 0 0 -5 5 10 15 20
Size Factor
Fiaure 9.2 The top panel shows the values of Cook's distance in a regression of the excess return on the
consumer good industry portfolio on the three factors, the excess return on the market and the returns on the size
and value portfolios. The bottom panel shows the effect on the estimated coefficient on the size factor when the
outlier identified by Cook's distance is removed.
W hile BLU E is a strong argum ent in favor of using O LS to esti If e(- ~ N[0,(t 1
21
), then O LS is the Best Unbiased Estim ator (BU E).
mate model param eters, it com es with two im portant caveats. That is, the assumption that the errors are normally distributed
strengthens B LU E to BU E so that O LS is the Best (in the sense
1. First, the BLU E property of O LS depends crucially on the
of having the sm allest variance) Unbiased Estim ator among all
hom oscedasticity of the residuals. W hen the residuals have
linear and nonlinear estim ators. This is a com pelling justification
variances that change with X (as is common in financial
for using O LS when the errors are iid normal random
data), then it is possible to construct better LUEs of a and (3
variab le s.11
using W LS (under some additional assumptions).
2 . Second, many estimators are not linear. For exam ple, many However, it is not the case that O LS is only applicable when the
maximum likelihood estimators (MLEs) are nonlinear.10 M LEs errors are normally distributed. The only requirem ents on the
have desirable properties and generally have the smallest errors are the assum ptions that the residuals have conditional
asym ptotic variance of any consistent estimator. M LEs are, mean zero and that there are no outliers. Normality of the errors
QUESTIONS
9.2. W hat conditions are required for a regression to suffer 9.5. W hy does a variable with a large variance inflation factor
from om itted variable bias? indicate that a model may be im proved?
9.3. W hat are the costs and benefits of dropping highly collin- 9.6. The sam ple mean is an O LS estim ator of the model
ear variables? Yj = a + 6,. W hat does the BLU E property imply about
the mean estim ator?
9.4. If you use a general-to-specific model selection with a
test size of a, what is the probability that one or more 9.7. W hat is the strongest justification for using O LS to esti
mate model param eters?
Practice Questions
9.8. In a model with a single explanatory variable, what value there are nonlinear effects of some of these variables,
of the R2 in the second step in a test for heteroskedasticity and so use a R E S E T test including both the squared and
indicates that the null would be rejected for sam ple sizes cubic term . The R2 of the original model is 68.6% , and the
of 100, 500, or 2500? (Hint: Look up the critical values of a R2 from the model that includes both additional term s is
Xq, where q is the number of restrictions in the test.) 68.9% . You have 456 observations. W hat do you conclude
about the specification of the model?
9.9. Suppose the true relationship between Y and two explan
atory variables is Yj = 2 + 1.2X-I,- — 2.1X2/ + e/- 9.12. You are investigating the determ inants of book leverage
of firms. Your model includes an intercept and four addi
a. W hat is the population value of (3-\ in the regression
tional variables: the natural logarithm of sales revenue, the
Yj = a + (3-jX-1 + €j if pxiX2 = 0-6 and ox = 1 and
book-to-market ratio, the ratio of EBlTD A-to-book assets,
<>i = 1 / 2 ?
and Net property-plant-and-equipm ent (Net PPE). Your
b. W hat value of pXlx2 would make f3-\ = 0?
initial model groups firm s in all industries. Suppose you
9.10. In a model with two explanatory variables,
want to test w hether the four slope coefficients are con
Y-, = a + jS-|X1( + /32X 2; + €„ what does the correlation
stant across energy sector firms.
need to be between the two regressors, X | and X 2, for the
variance inflation factor to be above 10? a. W hat additional model would you estim ate to im ple
ment the Chow test?
9.11. You are interested in understanding the determ inants of
b. If your sam ple has 433 observations, what is the distri
the yield spread of corporate bonds above a maturity
bution of the Chow test?
matched sovereign bond. You include three explanatory
c. If the R2 in the original specification is 16.6% , how
variables: the leverage defined as the ratio of long-term
large would the R2 have to be in the regression with
debt to the book value of assets, a dummy variable for
the dummy interactions for the null to be rejected
high yield, and a measure of the volatility of the profit
using a 5% test size?
ability of the issuer. You are interested in testing whether
9.13. G iven the data set: where the data was used for fitting the model versus the
blue indicating the data that was held back):
Y X 1 X2
M1 m2 m3
1 -2 .3 5 3 -0 .4 0 9 -0 .0 0 8
6 -0 .7 3 5 - 1 .3 9 -1 .0 8 6 5 -1 .1 - 0 .6 1.3
9 -2 .4 1 9 -0 .1 5 6 -0 .0 5 5 8 0.9 -1 - 0 .4
15 -7 .0 9 5 0.284 2.622 Y X
-10
i----------------------1---------------------- 1---------------2 0 — a—
ANSW ERS
Solved Problems
9.8. W hen there is a single explanatory variable in the original b. The om itted variable formula shows that when a vari
m odel, the auxiliary regression used for W hite's test will able is excluded, the coefficient on the included vari
have an intercept and two explanatory variables— the ables is /3-i + y/32, where y is the population regression
original variable and the variable squared. W hite's test coefficient from the regression X 2( = 8 + y X i; + I?;.
statistic is nR2, and the null has a X 2 distribution. When This coefficient depends on the correlation between
using a test with a 5% size, the critical value is 5.99. Solv the two and the standard deviations of the two vari-
5.99 ox, VT72
ing for the R2, R2 < . The maximum value of R2 that ables, so that y = pXlx2— = 0-6 = 0.424. The
n
would not reject is 0.0599, 0.011, and 0.0023 when n is coefficient is then 1.2 — 2.1 X 0.424 = 0.309.
100, 500, and 2500, respectively.
p X V T X VTT2
Solving 1.2 — 2.1 X 0 so that
9.9. a. Using the om itted variable form ula, when 1
X 2 is om itted from the regression,
if p = ---------- — = 0.522 then 13i would be 0 in the
<0 /0.6 x VT x V T 2 \ 2.1 X V T 2
f t = 1.2 - 2.1 x ( ----------- ------------ j = - 0 .1 8 ,
om itted variable regression.
, 0.6 x VT x VT2 . .
w h e re ----------- ------------ is the regression slope from
a regression of X 2 on X-|.
Y X1 (Y -Y ) ( X ,—X . H Y - Y ) ( X ,- X ,) 2
1 -2 .3 5 3 -0 .4 0 9 -0 .4 0 9 -0 .3 5 3 0.145 0.167
3 -1 .6 6 5 -0 .8 5 6 -0 .8 5 6 0.335 -0 .2 8 7 0.733
6 -0 .7 3 5 - 1 .3 9 -1 .3 9 0 1.265 -1 .7 5 8 1.932
9 -2 .4 1 9 -0 .1 5 6 -0 .1 5 6 -0 .4 1 9 0.065 0.024
11 -2 .0 8 5 -0 .5 6 2 -0 .5 6 2 -0 .0 8 5 0.048 0.316
12 -2 .9 7 2 -1 .5 5 4 -1 .5 5 4 -0 .9 7 2 1.511 2.415
13 -0 .6 3 3 -1 .1 2 3 -1 .1 2 3 1.367 -1 .5 3 5 1.261
14 -2 .6 7 8 -0 .1 2 4 -0 .1 2 4 -0 .6 7 8 0.084 0.015
Average -2 .0 0 0 0 .0 0 0 0 .0 0 0 0.933
Therefore,
0.000
0.000
0.933
And, of course,
a = Y - f r X , = - 2 .0 0 0
Y = -2.000
b. Proceeding as in part a gives:
Y X2 (X 2 - X 2) (Y -Y ) (X 2 - X 2) CY - Y ) (X 2 - x 2f
1 -2 .3 5 3 -0 .0 0 8 -0 .0 0 8 -0 .3 5 3 0.003 0.000
2 -0 .1 1 4 -1 .2 1 6 -1 .2 1 6 1.886 -2 .2 9 3 1.478
6 -0 .7 3 5 -1 .0 8 6 -1 .0 8 6 1.265 -1 .3 7 3 1.179
9 -2 .4 1 9 -0 .0 5 5 -0 .0 5 5 -0 .4 1 9 0.023 0.003
11 -2 .0 8 5 -0 .1 3 5 -0 .1 3 5 -0 .0 8 5 0.012 0.018
12 -2 .9 7 2 -0 .2 9 9 -0 .2 9 9 -0 .9 7 2 0.291 0.089
13 -0 .6 3 3 -1 .0 2 7 -1 .0 2 7 1.367 -1 .4 0 4 1.055
A ve ra g e -2 .0 0 0 0 .0 0 0 -1 .3 8 1 0.934
- 1 .3 8 1
- 1 .4 7 9
0.934
a = Y - (32X 2 = - 2 .0 0 0
y = -2 .0 0 0 - 1 . 4 7 9 X 2
c. The key statem ent from the chapter is "the estim ator
/3-| converges to (3-\ + (328 "
/3i + (328^ = 0
Plugging back in the 2 X 2 system of equations: d. All approaches should provide the same answer.
f t + 0.501 f t = 0
0 .5 0 0 ft + f t = - 1 .4 7 9
M1 M2 M3 M1 M2 M3
7 0.6 0 0 0.36 0 0
The model selected is the one that has the sm allest RSS Y1 is the estim ate using the entire data set
within the blue out-of-sample boxes— this is M3.
Y2 is the estim ate using the first nine observations
9.15. a. Following the m ethodology of the first exam ple:
So the Cook's distance for the last data point is
106.552
Alpha Beta 0.61
2 *8 7 .7 8 3
Entire 3.26 3.49 b. The last data point does not have a major influence on
First nine 0.10 2.96 the O LS param eters. However, it may still be an out
lier, and other tests would need to be conducted to
Com pleting the table: draw such a conclusion.
Y X Y1 Y2 (Y1 —Y2)A2 Y 1 -Y
2 -1 4 .1 2 -4 -1 0 .6 9 2 -1 1 .7 4 5 1.107 3.428
3 -1 4 .7 2 -3 -7 .2 0 4 -8 .7 8 3 2.491 7.516
5 2.3 -1 -0 .2 2 8 -2 .8 5 8 6.919 -2 .5 2 8
Sum 106.552
SumSq/10 87.783
D escribe the requirem ents for a series to be Define and describe the properties of autoregressive
covariance-stationary. moving average (ARM A) processes.
Define the autocovariance function and the autocorrela Describe the application of A R, M A, and A R M A processes.
tion function (A CF).
Describe sam ple autocorrelation and partial
Define white noise, describe independent white noise and autocorrelation.
normal (Gaussian) white noise.
Describe the Box-Pierce Q -statistic and the Ljung-Box Q
Define and describe the properties of autoregressive (AR) statistic.
processes.
Explain how forecasts are generated from A R M A models.
Define and describe the properties of moving average
(MA) processes. Describe the role of mean reversion in long-horizon
forecasts.
Explain how a lag operator works.
Explain how seasonality is modeled in a covariance
Explain mean reversion and calculate a mean-reverting stationary A RM A .
level.
159
Tim e-series analysis is a fundam ental tool in finance and risk (t) (i.e., so that Ys is observed before Yt w henever s < t). The
m anagem ent. Many key tim e series (e.g ., interest rates and ordering of a tim e series is im portant when predicting future
spreads) have predictable com ponents. Building accurate m od values using past observations.
els allows past values to be used to forecast future changes in
Stochastic processes can describe a wide variety of
these series.
data-generating m odels. One of the sim plest useful stochastic
A tim e series can be decom posed into three distinct com po processes is Yt ~ N(/z, a 2), which is a reasonable description
nents: the trend, which captures the changes in the level of the of the data-generating process for returns on some financial
time series over tim e; the seasonal com ponent, which captures assets.
predictable changes in the tim e series according to the tim e of
A leading exam ple of a more com plex stochastic process
year; and the cyclical com ponent, which captures the cycles in
(and one that is a focus of this chapter) is a first-order
the data. W hereas the first two com ponents are determ inistic,
autoregression (AR):
the third com ponent is determ ined by both the shocks to the
process and the memory (i.e., persistence) of the process. Yt = s + <f>Y( _ 1 + etl (10.1)
This chapter focuses on the cyclical com ponent. The proper where 8 is a constant, </>is a model param eter measuring the
ties of this com ponent determ ine w hether past deviations from strength of the relationship between two consecutive observa
trends or seasonal com ponents are useful for forecasting the tions, and et is a shock.
future. The next chapter extends the model to include trends This first-order AR process can describe a wide variety of
and determ inistic seasonal com ponents and exam ines methods
financial and econom ic tim e series (e.g ., interest rates, com
that can be used to remove trends. modity convenience yields, or the growth rate of industrial
This chapter begins by introducing tim e-series and linear production).
processes. This class of processes is broad and includes most
Figure 10.1 shows exam ples of tim e series that are produced by
common tim e-series m odels. It then introduces a key con
stochastic processes with very different properties. The top-left
cept in tim e-series analysis called covariance stationarity. A panel shows the monthly return on the S&P 500 index. This time
time series is covariance-stationary if its first two moments do
series appears to be mostly noise and so is hard to predict. The
not change across tim e. Im portantly, any time series that is top-right panel contains the VIX index, which appears to be very
covariance-stationary can be described by a linear processes.
slowly mean-reverting (i.e., deviations from the long-run average
W hile linear processes are very general, they are also not are tem porary but long-lived).
directly applicable to m odeling. Instead, two classes of models
The bottom panels show two im portant measures of interest
are used to approxim ate general linear processes: autoregres rates: the slope and curvature of the US Treasury yield curve.
sions (AR) and moving averages (M As). The properties of these
The slope is defined as the difference between the yield on a
processes are exploited when modeling data by comparing the ten-year bond and that on a one-year bond. The curvature is
theoretical structure of the different model classes to sample
defined as the difference between two slopes:
statistics. These inspections serve as a guide when selecting a
m odel. Model selection is refined using information criteria (1C) (V10 - Ys) - (V5 - Vi),
that form alize the tradeoff between bias and variance when where Ym is the yield on a bond with a maturity of m years. Both
choosing a specification. This chapter covers the two most series are mean-reverting and have im portant cyclical com po
widely used criteria. nents that are related to the state of the econom y.
The chapter concludes by exam ining how these models are used This chapter focuses exclusively on linear processes. Any linear
to produce out-of-sample forecasts and how ARs and MAs can process can be written as:
be adapted to tim e series with seasonal dynamics.
Yt = 5t + if/0et + if/ - + if/2e t- 2 + " -
oc
= s t + 00.2)
10.1 STOCHASTIC PROCESSES
i= 0
50 -
1960 1970 1980 1990 2000 2010 1995 2000 2005 2010 2015
1960 1970 1980 1990 2000 2010 1960 1970 1980 1990 2000 2010
Fiaure 10.1 These plots show examples of time series. The top-left panel shows the monthly return
on the S&P 500. The top-right panel shows the end-of-month value of the VIX index. The bottom panels
show the slope and curvature of the US Treasury yield curve.
This chapter only studies models where 8t = 8 for all t, so that 10.2 COVARIANCE STATIONARITY
the determ inistic process is constant. The next chapter relaxes
this restriction to include m odels where 8t explicitly depends on The ability of a model to forecast a tim e series depends crucially
tt o account for both tim e trends and seasonal effects. on w hether the past resem bles the future. Stationarity is a key
This chapter limits attention to linear processes for two reasons. concept that form alizes the structure of a tim e series and justi
First, a wide range of processes, even nonlinear processes, have fies the use of historical data to build m odels. Covariance sta
linear representations. Second, linear processes can be directly tionarity depends on the first two moments of a tim e series: the
related to linear m odels, which are the workhorses of tim e-series mean and the autocovariances.
analysis. Linear models are the most w idely used class of models Autocovariance is a tim e-series specific concept. The autocovari
for modeling and forecasting the conditional mean of a process, ance is defined as the covariance between a stochastic process
defined as: at different points in tim e. Its definition is the tim e-series analog
W hite noise is uncorrelated across tim e but is not necessarily where 8 is called the intercept, cp is the AR parameter, and the
finance and risk m anagem ent because asset returns are unpre The AR param eter determ ines the persistence of Yt. This means
dictable but have persistent time-varying volatility. that an AR(1) process is covariance-stationary when \<f)\ < 1 and
D ependent white noise relaxes the iid assumption while main non-stationary when cp = 1. This chapter focuses exclusively on
taining the three properties of white noise. A leading exam ple the covariance-stationary case and leaves the treatm ent of non-
of a dependent white noise process is known as an A utore stationary tim e series to the next chapter.
gressive Conditional H eteroskedasticity (ARCH) process. The Because {Y t} is assumed to be covariance-stationary, the mean,
variance of a shock from an A RCH process depends on the variance, and autocovariances are all constant. Note that the
m agnitude of the previous shock. This process exhibits a prop mean of Yt is not 8 - it depends on both 8 and cp.
erty called volatility clustering, where volatility can be above or
Using the property of a covariance-stationary tim e series that:
below its long-run level for many consecutive periods. Volatility
clustering is an im portant feature of many financial tim e series, E[Yt] = E[YM ] = /x
especially asset returns. The dependence in an A RCH process the long-run (or unconditional) mean is
leads it to have a predictable variance but not a predictable
E[Yt] = 5 + cf>E[Yt_ ,] + E[et]
mean, and shocks from an A RCH process are not correlated
H = 8 + cp/x + 0
across tim e.
/X(1 - 0 = 5
Wold's Theorem M= —!
1 - (f)
(10.8)
= 2 > / * t- i. (10.7)
;=o (10.9)
1 -cP 2
The autocovariance function is then: shifts the tim e index of an observation, so that LYt = Yt_-1 . It has
six properties.
y(h) = ^ y 0 (10.12)
1. The lag operator shifts the tim e index back one observa
and the A C F is
tion, (i.e., LYt = Yt_-|)
= ---- T + and may oscillate. However, higher-order ARs can produce more
<P 1=0
com plex patterns in their A C F s than an AR(1).
Second, invertibility plays a key role when selecting a unique M eanwhile, the PACF of an AR(p) shows a sharp drop-off at p
model for a tim e series using the Box-Jenkins m ethodology. lags. This pattern naturally generalizes the shape of the PA CF in
Model building and the role of invertibility is covered in more the AR(1), which cuts-off at one lag.
detail in Section 10.8.
The left panel of Figure 10.4 shows the first 15 values for the
In higher order polynomials, it is difficult to define the values of the A C F and PA CF of the AR(2):
model coefficients where the polynomial is invertible— and hence
Yt = 8 + 1.4Yt_ , - 0 .8 Y t_ 2 + et
the parameter values where an AR is covariance-stationary and the
use of characteristic equations becomes necessary. This method is
explained in more detail in the Appendix to this chapter. 9
This model is covariance-stationary if the roots of the associated
characteristic polynomial:
The pth order AR process generalizes the first-order process to are all less than 1 in absolute value.
include p lags of Y in the m odel. The AR(p) model is 3 A useful measure of persistence in an AR(p) is the largest absolute root
in the characteristic polynomial. The roots in this equation are complex
Yt = 8 + </>iYt_, + 4 >2 ^ t - 2 + ••• + < t> p Y t - p + e t (10.16) and the absolute value of the largest root is 0.894.
Fiqure 10.3 Plot of the sample paths of two AR processes. The AR(1) specification is
Yt = 0.97 Yt_<\ + et and the AR(2) specification is Yt = 1.6Yt_^ — 0.8Yt_ 2 + €t• Both simulated paths
use the same white noise innovations.
Fiaure 10.4 Plots of the autocorrelation and PACFs for an AR(2) (left panel) and a Seasonal AR(1)
(right panel) for quarterly data.
The A C F oscillates and decays slowly to zero as the lag length current value of Yt. This m odel is tech n ically an AR(5)
increases. Unlike the AR(1) with a negative coefficient, the oscil b ecau se:
lation does not occur in every period. Instead, it oscillates every
(1 - fa L - </>4L4 + 4>^AL5)Yt = 8 + et
four or five periods, which produces persistent cycles in the time
y t = 6 + <AiYt-i + </>4Yt- 4 - ^ ^ 4 ^ -5 + €t
series.
The A C F and the PA CF are consistent with the expanded model.
The right panel shows the A C F and PA CF of a model known as
a Seasonal A R. This model is easiest to express using the lag Suppose now that the key coefficients in this model take the
polynomial form: values (f)-| = 0.2 and </>4 = 0.8, which would indicate a small
amount of short-run persistence coupled with a large amount of
(1 - 4>41-4)(1 - <hDYt = e,
seasonal persistence. The seasonality produces the large spikes
This m odel has a q u arterly seaso nality w here the in the A C F every four lags. As exp ected , the PA C F is non-zero
shock lagged four quarters has a distinct effe ct on the for the first five lags because this model is effectively an AR(5).
Excluding the first equation, dividing each row by the long- The remaining autocorrelations are recursively computed using
run variance y 0 produces a set of equations that relate the Pj = l^ p ^ - 0.45pM
autocorrelations:
Using this equation, next three autocorrelations can be com
P1 = $1 + </>2Pl + • • • + (f>ppp-1 ; puted as 0.828, 0.753, and 0.682.
Results from fitting AR models with orders 1, 2, and 3 are The R2 in a tim e-series model has the same interpretation
reported in Table 10.1, and t-statistics are reported below as in a linear regression: It measures the percentage of the
each estim ated coefficient. variation in Yt that is explained by the model. The R2 in all
three real G D P growth models is also lower, indicating that it
The models for the default premium (i.e., top panel) indicate
is more difficult to forecast G D P growth using historical data.
that the series is highly persistent, with the AR(1) model
10
-5 -
Table 10.1 Parameter Estimates from AR Models with Orders 1, 2, and 3. The Top Panel Reports
Parameter Estimates Using the Default Premium, and the Bottom Panel Reports Parameter Estimates from
Models Using Real GDP Growth. The Column Labeled R2 Reports the Fit of the Model. The Final Column
Reports the Absolute Values of the Roots of the Characteristic Equations
Default Premium
(8.721) (90.039)
(9.658) (6.544)
10.5 MOVING AVERAGE (MA) MODELS The PA CF of an MA(1), on the other hand, is com plex and has
non-zero values at all lags. This pattern is the inverse of what an
A first-order M A, denoted MA(1), is defined as: AR(1) would produce.
Yf = fi + 0et_ 1 + et, (10.21) The MA(1) generalizes to an MA(q), which includes q lags of the
shock:
where et~ WN(0, a 2) is a white noise process.
Yt — /x + et + 0-|6t_-| + • • • + 0qet_ q. (10.25)
The observed value of Yt depends on both the contem pora
neous shock et and the previous shock. The param eter 0 acts
Here, /x is the mean of Yt because all shocks are white noise and
as a w eight and determ ines the strength of the effect of the
so have zero expected value. The variance of an MA(q) can be
previous shock. The param eter /j l is the mean of the process
shown to be
because:
7o — o"2 (1 + 02+ ■■* + (10.26)
E[Yt] = /x + 0 E [e ^ ] + E[et]
= /i + 0 x O + O The autocovariance function is
= A ( 10 . 22 )
because the shocks are white noise and so are uncorrelated. Any The top-right shows the A C F and PA CF for an MA(1) with a
MA(1) has exactly one non-zero autocorrelation, and the A C F is large negative coefficient. The first autocorrelation is nega
tive, and the remaining autocorrelations are all zero. The PACF
1 h = 0
decays slowly towards zero.
0
P(h) = h = 1 (10.24)
1 + 02 The bottom-left and right panels plot the values for an MA(2)
0 h > 2 and an MA(4), respectively. These both show the same
Fiaure 10.6 These four panels show the autocorrelation and PACFs for two MA(1)s (top),
an MA(2) (bottom-left) and an MA(4) (bottom-right). The ACF cuts off sharply at the MA
order (q) while the PACF decays, possibly oscillating, over many lags.
pattern— one that is key to distinguishing M A models from many lags. This is the opposite pattern from what is seen in
AR m odels: The A C F cuts off sharply at the order of the MA, Figure 10.2 and Figure 10.4, where the A C F s decayed slowly
whereas the PA CF oscillates and decays tow ards zero over while the PA CF cut off sharply.
q lags. Note that the R2 values of all m odels, even the MA(4), and 04 is not statistically different from zero.
Default Premium
8 02 03 04 R2
(56.538) (69.917)
8 01 02 03 04 R2
(11.686) (5.725)
Fiaure 10.7 The left panels plot a simulated path (top) and the ACF and PACF of the
ARMA(1,1), Yt = 0 .8 7 ^ ! — 0.46*-! + et The right panels plot a simulated path (top) and
the ACF and PACF of the ARMA(2,2), Yt = 1.4Yt_-, - 0.6Yt_ 2 + 0.9et_., + 0.6et_ 2 + et.
The left panels of Figure 10.7 show a sim ulated path (top) and Yt — 8 + <
7>iVt _ t + ••• + 4>pY t - p + ^iet- 1 + *•* + Qqe t - q + et
the autocorrelation and PA CFs (bottom) of an A RM A (1,1) with (10.30)
</> = 0.8 and 0 = —0.4. Both functions decay slowly to zero
O r when com pactly expressed using lag polynomials
as the lag length increases. The decay resem bles that of an
A R (1), although the level of the first autocorrelation is less than d - <hL - ■■• - (f>pLp)Yt = (1 + 0 ,L + • • • + OqLp)et
4> = 0.8. This difference arises because the M A term elim inates L)Yt = S + 0(L)et
some of the correlation with the first shock.
Like A RM A (1,1), the A RM A (p,q) is covariance-stationary if the
An ARM A(1,1) process is covariance-stationary if \(f>\ < I . T h e A R com ponent is covariance-stationary.4
M A coefficient plays no role in determ ining w hether the pro
The autocovariance and A C Fs of A R M A processes are more
cess is covariance-stationary, because any M A is covariance
com plicated than in pure AR or M A m odels, although the gen
stationary and the MA com ponent only affects a single lag
eral pattern in these is sim ple. A RM A (p,q) models have A C Fs
of the shock. The AR com ponent, however, affects all lagged
shocks and so if <£i is too large, then the time series is not
covariance-stationary.
4 This requires the roots of the characteristic equation:
A RM A (p,q) processes com bine AR(p) and MA(q) processes
to produce models with more com plicated dynam ics. An zp - + ••• + (})p --\Z + <f>p = 0
A RM A (p,q) evolves according to to all be less than 1 in absolute value.
Table 10.3 Parameter Estimates from ARMA Models. The Top Panel Reports Parameter Estimates Using
the Default Premium, and the Bottom Panel Reports Parameter Estimates from Models on Real GDP Growth.
The Column Labeled R2 Reports the Fit of the Model. The Final Column Reports the Absolute Value of the
Roots of the Characteristic Equations
Default Premium
h
Sam ple autocorrelation and partial autocorrelations are used
Q „r = T ^ p f 00.33)
to build and validate A R M A m odels. These statistics are first 1=1
applied to the data to understand the dependence structure
W hen the null is true, VYp,- is asym ptotically distributed as
and to select a set of candidate m odels. They are then applied
a standard normal and thus Tpf is distributed as a The
to the estim ated residuals { e j to determ ine w hether they are
estim ated autocorrelations are also independent and so
consistent with the key assumption that et ~ WN(0, a 2).
the sum of h independent x i variables is, by definition, a Xh
The most common estim ator of the sam ple autocovariance is variable.
T _ _
The Ljung-Box statistic is a version of the Box-Pierce statistic
% = (T ~h)“ ' 2 - Y )(Y ;_ h - Y ), (10.31)
i=h + 1 that works better in sm aller sam ples. It is defined as:
where Y is the full sam ple average. This estim ator uses all avai
able data to estim ate yh-
QLe = (10.34)
The choice of h can affect the conclusion of the test that the
Joint Tests of Autocorrelations residuals have no autocorrelation. W hen testing the specifica
tion of an A RM A(p,q) m odel, h should be larger than m ax(p,q).
Testing autocorrelation in the residuals is a standard specifica Residual autocorrelations that fall within the lags of the model
tion check applied after fitting an A R M A m odel. It is common are typically small, and the A RM A(p,q) estim ator is effectively
practice to use both graphical inspections as well as formal tests overfitting these values. It is common to use between 5 and 20
in specification analysis. lags when testing a small A R M A model (p, q < 2).
Graphical exam ination of a fitted model includes plotting the
residuals to check for any apparent deficiencies and plotting the
Specification Analysis
sam ple A C F and PA CF of the residuals (i.e., et). For exam ple, a
graphical exam ination can involve assessing the 95% confidence Autocorrelation and partial autocorrelation play a key role in
intervals for the null H0: p h = 0 (ACF) or H^: a h = 0 (PACF). The both model selection and specification analysis. The left column
presence of many violations of the 95% confidence interval in of Figure 10.8 plots 24 sam ple autocorrelations and partial auto
the sam ple A C F or PA C F would indicate that the model does correlations of the default premium (i.e., two years of monthly
not adequately capture the dynamics of the data. However, it is observations). The right column shows 12 sam ple A C F and
common to see a few estim ates outside of the 95% confidence PA C F values of real G D P growth (i.e., three years of quarterly
bands, even in a well-specified m odel. This pattern makes it observations). The horizontal dashed lines mark the boundary of
challenging to rely exclusively on graphical tools to determ ine a 95% confidence interval derived from an assumption that the
whether a model is well-specified. process is white noise.
Partial Autocorrelations
Default Premium Real GDP Growth
F ia u re 1 0.8 The left column contains the sample autocorrelations (top) and partial autocorrelations of
the default premium. The right panel shows the sample autocorrelations and partial autocorrelations of
real GDP growth.
The A C F of the default premium decays very slowly towards The top four panels in Figure 10.9 show the sam ple A C F and
zero, which is consistent with an AR m odel. The PA CF has three PA C F of the estim ated model residuals from these specifica
values which are statistically different from zero. This pattern is tions. Note that these plots are the equivalents of those in
consistent with either a low-order AR or an A R M A with a MA Figure 10.8, excep t that the data are replaced by the model
com ponent that has a small coefficient. residuals. All four plots indicate that the residuals are consistent
with a white noise process. The estim ated autocorrelations are
The A C F and PA CF of real G D P growth indicate w eaker persis
generally insignificant, small in m agnitude, and lack any discern
tence, and the A C F is different from zero in the first two lags.
ible pattern.
The PA C F resem bles the ACF, and the first two lags are statisti
cally different from zero. These are consistent with a low-order The bottom panel plots both the Ljung-Box statistic and its
A R, M A, or A RM A . p-value. The statistic is com puted for all lag lengths between 1
and 12. The default premium residuals have small autocorrela
Sam ple autocorrelations and partial autocorrelations are also
tions in the first four lags, and the null that all autocorrelations
applied to investigate the adequacy of model specifications. In
are zero is not rejected. However, the 5th autocorrelation of the
Figure 10.9, an ARM A(1,1) is estim ated on the default premium
estim ated residuals is large enough that the Q-statistic rejects
data, and an AR(1) is used to model real G D P growth.
Partial Autocorrelations
Default Premium Real GDP Growth
Test Statistics
Real GDP Growth
1 .0 1 . 0 -I
30
0.8-
20 0.6
Ql j (Right)
P-value
0.4-
-10
0.2 0.2-
0
0 .0 0.0-
1 2 3 4 5 6 7 8 9 10 11 12 7 8 9 10 11 12
Fiaure 10.9 The top panels show the sample ACF and PACF of the model residuals. The residuals for
the default premium are from an ARMA(1,1). The residuals for real GDP growth are from an AR(1). The
bottom plot shows the Ljung-Box Q statistic (line, right axis) and its p-value (bars, left-axis).
the null. The rem ainder of the lags also indicate that the null counterparts of these exhibit a similar pattern, then an AR
should be rejected .5 model is likely to fit the data.
The residuals from the AR(1) estim ated on the real G D P growth In general, slow decay in the sam ple A C F indicates that the
data are consistent with white noise. Both the sam ple A C F and model needs an AR com ponent, and slow decay in the sample
PA CF have one value that is statistically significant. However, a PA C F indicates that the model should have an M A com ponent.
small number of rejections are expected when exam ining mul Note that the goal of this initial step is to identify plausible can
tiple lags because each has a 5% chance of rejecting when the didate m odels. It is not necessary to make a final selection using
null is true. The Ljung-Box test also does not reject the null at only graphical analysis.
any lag length.
O nce an initial set of models has been identified, the next step
is to measure their fit. The most natural measure of fit is the
sam ple variance of the estim ated residuals, also known as the
10.8 M ODEL SELECTION Mean Squared Error of the model. It is defined as:
The null hypothesis is that all coefficients on the lagged terms are zero,
H0: <jf>i = • • • — (J)p — 0, and that alternative is Hp (f)j # 0 for some j. 6 This is also known as the Schwarz IC (SIC or S/BIC).
It is possible to specify two (or more) models that have differ 3. Forecasts are generated recursively starting with E T[ Y 7 +1].
ent param eter values but are equivalent. Equivalence between The forecast at horizon h may depend on forecasts from
A R M A models means that the mean, ACF, and PA CF of the earlier steps (E T[Y 7 +1] , . . . , E T [Y T+h_ 1]). W hen these
models are identical. quantities appear in the forecast for period T + h, they
are replaced by the forecasts com puted for horizons
The Box-Jenkins m ethodology provides two principles to select
1 , 2 ,... , h —1 .
among equivalent models.
• :
. • ! ■ l
I I I I I
0 1 2 3 4
AR Order (p ) AR Order (p )
BIC BIC
■ ■
1 2 3 0 1 2 3
AR Order (p) AR Order (p)
Figure 10.10 AIC and BIC values for ARMA models with p E {0, 1, 2, 3, 4} and q E {0, 1, 2} for the
default premium (left) and real GDP growth (right). Models with p = 0 have been excluded from the
default premium plot because these produce excessively large values of the ICs. The selected model
has the smallest information criterion.
and depends only on the final observed value YT. Jim E r [Y r+h] = E [Y t],
h—>
The two-step forecast is which is the long-run (or unconditional) mean.
VY+ 1 - E 7 [Y j +1 ] = 6 7+1
i™ + ^ Y t = 7---- T
/=o 1 “
is always just the surprise in the next period in any A R M A
This limit is the same as the mean reversion level (or long-run m odel. Forecasts errors for longer-horizon forecasts are mostly
mean) of an AR(1). functions of the model param eters.
A PPLIED FO RECASTIN G
The preferred model for the default premium is the The selected model for real G D P growth is the AR(1):
A R M A (1 ,1):
R G D P G t = 1.996 + 0.360 R G D P G t_ , + et.
D E F t = 0.066 + 0.938 D E F t + 0.381 et_-, + et
The final real G D P growth rate is R G D P G T = 2.564 in
Generating out-of-sample forecasts requires the final value of 2018Q 4. Forecasts for longer horizons are recursively com
the default premium and the final estim ated residual. These puted using the relationship:
values are D E F T = 1 . 1 1 and eT = 0.0843. The one-step
E-JRGPDGy+h] = 1.996 + 0.360 E7[RGDPG7+/1_ 1]
ahead forecast is
For exam ple, the first forecast is 1.996 + 0.36 X 2.564
E t [D E F t+1] = 0.066 + 0.938 X 1.11 + 0.381
= 2.919, and the second is 1.996 + 0.360 X 2.81 = 3.047.
X 0.0843 = 1.139
The remaining forecasts for the first year are presented in
The two-step and higher forecasts are recursively com puted Table 10.4. The persistence param eter in the AR(1) is small,
using the relationship: and the forecasts rapidly converge to the long-run mean of
the growth rate of real G D P (i.e., 1.996/(1 — 0.36) = 3.119).
E T[D E F T+h] = 0.066 + 0.938 E T[D E F T+h_-\]
These values are reported in Table 10.4.The A R param eter is
relatively close to 1, and so the forecasts converge slowly to
the long-run mean of 0.066/(1 - .938) = 1.06.
10.10 SEASONALITY these two com ponents into a single specification. For exam ple,
a model using monthly data with a seasonal AR(1) and a short
Macrofinancial tim e series often have seasonalities. For exam term AR(1) is
Seasonal ARs have slowly decaying A C Fs and a sharp is covariance-stationary. The stationarity conditions can be
cutoff in the PACF (when using the seasonal frequency). determ ined by associating the pth order lag polynomial:
Seasonal MAs have the opposite pattern, where the PACF
1 - fa L ------~ (f>pLp (10.40)
slowly decays and the A C F drops off sharply. Dynamics
present in the A C F and PACF at the observation frequency with its characteristic equation:
indicate that a short-term com ponent is required. Finally,
1C can be used to select between models that contain only zp — </>-|Zp_1 — (f>2zp~2 — ■■ — (fjp-^z — cf)p = 0 (10.41)
observation frequency lags [i.e., an A RM A (p, q) X (0,0)f)]# The characteristic equation is also a pth order polynomial where
only seasonal lags [i.e., A RM A (0,0) X (ps, qs)f], or a mixture
the ordering of the powers has been reversed.
of both.
The lag polynomial is invertible if the roots of the characteristic
equation c1# c2 . . . , cp in the expression:
are often com bined in autoregressive moving average models The lag polynomial is 1 — </>-| L — (f>2L2 and its associated charac
(ARM A), which can be used to approxim ate any linear process. teristic equation is z2 — 4><\Z — (f>2. If </>-| = 1.4 and </>2 = —0.45,
then the factored characteristic polynomial is (z - 0.9)(z - 0.5).
Alternative specifications are compared using one of two IC: the
The two roots are 0.9 and 0.5. Both are less than 1 in absolute
A IC or the BIC. These criteria tradeoff bias and variance by penal
value, and so this AR(2) is stationary.
izing larger models that do not improve fit. This BIC leads to con
sistent model selection and is more conservative than the A IC .
QUESTIONS
10.5. W hat are the key features of the A C F and PA C F of an 10.9. W hat steps should be taken if the A C F and/or PA C F of
model residuals do not appear to be white noise?
MA(q) process?
Practice Questions
10.10. In the covariance-stationary AR(2), Yt = 0.3 + c. Calculate the tim e series for the following ARM A(1,1)
1.4Yt_-| — 0 .6 Y t_ 2 + et, where et ~ WN(0, a 2), what is m odel, taking e0 = 0, q0 = 0, rj = 0.5, and
the long-run mean E[Yt] and variance V[Yt]? (f> = 0 = 0.375:
10.11. If (1 — 0.4L)(1 — 0 .9 L4)Y t = et, what is the form of this 10.14. In the MA(2), Yt = 4.1 + 5et_-| + 6.25et_2 + et, where
model if written as a standard AR(5) with Yt_ 1f. . . , Yt_ 5 et ~ WN[0, a 2), what is the A C F ?
on the right-hand side of the equation?
10.15. Suppose all residual autocorrelations are 1 .5 / V ^ where
10.12. For the equation: T is the sam ple size. Would these violate the confidence
15 5 3 bands in an A C F plot?
10.13. G iven the following data for a set of innovations: with 100 observations. W ould a Ljung-Box Q statistic
reject its null hypothesis using a test with a size of
et 5%? W ould the test reject the null using a size of 10%
1 -0.58 or 1%?
10 -0.88 Which model is selected by the A IC and BIC if the sample size
T = 250?
a. Calculate the tim e series for the following AR(1)
model, taking y0 = 0, S = 0.5, and </> = 0.75: 10.18. The 2018Q 4 value of real G D P growth is
b. Calculate the tim e series for the following MA(1) RGDPGj - t = 2.793. W hat are the forecasts for 2019Q1
model, taking e0 = 0, ju = 0.5, and 0 = 0.75: - 2019Q 4 using the AR(2) model in Table 10.1?
ANSWERS
Solved Problems
10.10. Because this process is covariance-stationary 10.13. a. Yt — 8 + 4>Yt_i+et
0.3
E [Y t] = A = = 1.5 yt
1 - 1.4 - (0.6)
0 .3 : 1 -0 .0 8
V [Y t] = y 0 = = 0.45
1 - 1.4 - (0.6) 2 2.06
6 1.64
10.12.1 — — f c L 2 = 1 - f l + fL*
7 2.54
8 1.76
9 2.67
10 1.62
Each value is generated via the form ula: Each value is generated via the form ula:
.2
and y 2 = 6cr . The autocorrelations are then
Zt Po = 1, py = 0.557 and p2 = 0.096.
6 0.19 Qlb = T 2 ( ) P?
/= 1 T — i
7 1.70
= 1001 ^ ) (0.24)2 + 100 [ ^ j (- 0 .0 4 )2 + 100 ( ^ ) (0.08)2
8 0.46 99 98 97
10 0.26 The critical values for tests sizes of 10%, 5%, and 1% are
6.25, 7.81, and 11.34, respectively. (CHISQ.INV(1 -a, 5)
Each value is generated via the form ula: in Excel, where a is the test size). The null is only be
rejected using a size of 10%.
Zy = 0.5 + 0.75 * 0 - 0.58 = - 0 .0 8
10.17. The ICs and the selected model in bold:
z2 = 0.5 + 0 .7 5 (—0.58) + 1.62 = 1.69
9 0.41
10 0.47
10.18. All forecasts are recursively com puted starting with the The three- and four-step depend entirely on other
first: forecasts:
1.765 + 0.319 X 2.564 + 0.114 X 2.793 = 2.90. 1.765 + 0.319 X 2.98 + 0.114 X 2.90 = 3.05.
The two-step forecast uses the one-step forecast: 1.765 + 0.319 X 3.05 + 0.114 X 2.98 = 3.08.
D escribe linear and nonlinear time trends. Describe how to test if a tim e series contains a unit root.
Explain how to use regression analysis to model Explain how to construct an h-step-ahead point forecast
seasonality. for a tim e series with seasonality.
D escribe a random walk and a unit root. Calculate the estim ated trend value and form an interval
forecast for a tim e series.
Explain the challenges of modeling tim e series containing
unit roots.
187
Covariance-stationary tim e series have means, variances, and The correct approach to modeling time series with trends
autocovariances that do not depend on tim e. Any tim e series depends on the source of the non-stationarity. If a tim e series
that is not covariance-stationary is non-stationary. This chapter only contains determ inistic trends, then directly capturing the
covers the three most pervasive sources of non-stationarity in determ inistic effects is the best method to model the data. In
financial and econom ic time series: tim e trends, seasonalities, fact, removing the determ inistic trends produces a detrended
and unit roots (more commonly known as random walks). series that is covariance-stationary.
Tim e trends are the sim plest deviation from stationarity. Tim e On the other hand, if a tim e series contains a stochastic trend,
trend models capture the propensity of many tim e series to the only transform ation that produces a covariance-stationary
grow over tim e. These m odels are often applied to the log series is a difference (i.e., Yt — Yt_,- for some value of j > 1).
transform ation so that the trend captures the growth rate of the Differencing a tim e series also removes tim e trends and, when
variable. Estim ation and interpretation of param eters in trend the difference is constructed at the seasonal frequency, also
models that contain no other dynamics are sim ple. rem oves determ inistic seasonalities. Ultim ately, many trending
econom ic tim e series contain both determ inistic and stochastic
Seasonalities induce non-stationary behavior in time series by
trends, and so differencing is a robust method for removing
relating the mean of the process to the month or quarter of the
A
both effects.
year. Seasonalities can be modeled in one of two ways: shifts in
the mean that depend on the period of the year, or an annual
cycle where the value in the current period depends on the shock
in the same period in the previous year. These two types of sea
11.1 TIME TRENDS
sonality are not mutually exclusive, and there are two approaches
to modeling them . The simple method includes dummy variables
Polynomial Trends
for the month or quarter. These variables allow the mean to vary A tim e trend determ inistically shifts the mean of a series.
across the calendar year and accom m odate predictable variation The most basic exam ple of a non-stationary tim e series is a
in the level of a time series. The more sophisticated approach process with a linear tim e trend:
uses the year-over-year change in the variable to eliminate the
Yt = 80 + 8 ,t + etl (11.1)
seasonality in the transform ed data series. The year-over-year
change eliminates the seasonal com ponent in the mean, and the where et ~ W N{0, a 2)
differenced time series is modeled as a stationary ARM A. This process is non-stationary because the mean depends on time:
Finally, random walks (also called unit roots) are the most E[Yt] = S0 + 8 ,t
pervasive form of non-stationarity in financial and econom ic
In this case, the time trend S-| measures the average change in Y
tim e series. For exam ple, virtually all asset prices behave like
across subsequent observations.
a random walk. Directly modeling a tim e series that contains a
unit root is difficult because param eter estim ators are biased The linear tim e trend model generalizes to a polynomial time
and are not normally distributed, even in large sam ples. trend model by including higher powers of tim e:
However, the solution to this problem is sim ple: model the
Yt = Sq "F 5-|t + § 2 ^ + • • • + 8mt m + et (11.2)
difference (i.e., Yt — Vt-i)-
In practice, most tim e trend models are limited to first
All non-stationary tim e series contain trends that may be d e te r
(i.e., linear) or second-degree (i.e., quadratic) polynomials.
m inistic or stochastic. For determ inistic trends (e.g ., tim e trends
and determ inistic seasonalities), knowledge of the period is The param eters of a polynomial tim e trend model can be consis
enough to m easure, m odel, and forecast the trend. On the tently estim ated using O LS. The resulting estim ators are asym p
other hand, random walks are the most im portant exam ple of totically normally distributed, and standard errors and t-statistics
a stochastic trend. A tim e series that follows a random walk can be used to test hypotheses if the residuals are white noise.
depends equally on all past shocks. If the residuals are not com patible with the white noise assum p
tion, then the t-statistics and model R2 are m isleading.
1 In some applications, financial data have seasonality-like issues at quantity each period. In econom ic and financial time series, this
higher frequencies, for example the day of the week or the hour of the is problem atic for two reasons.
day. While seasonalities are only formally applicable to annual cycles in
data, the concept is applicable to other time series that contain a time- 1. If the trend is positive, then the growth rate of the series
dependent pattern. (i.e., S i/ t) will fall over tim e.
In contrast, constant growth rates are more plausible than con et = R G D P t — 8 0 — 8 -\t — ^ t 2,
stant growth in most financial and m acroeconom ic variables. /V
R G D P t = 8 q + 6-|t + S2t 2 + et, These values are extrem ely high and the quadratic m odels both
have R 2 above 99% . High R 2 in trending series are inevitable,
where § 2 = 0 in the linear model. and the R 2 will approach 100% in all of four specifications as
The param eter estim ates and t-statistics are reported in the top the sam ple size grows. In practice, R 2 is not considered to
two rows of Table 11.1. The linear model indicates that real G D P be a useful m easure in trending tim e series. Instead, residual
grows by a constant USD 59 billion each quarter, whereas the diagnostics or other form al tests are used to assess model
quadratic model implies the growth is accelerating. adequacy.
Table 11.1 Estimation Results for Linear, Quadratic, Log-Linear, and Log-Quadratic Models Fitted to US Real
GDP. The t-Statistics Are Reported in Parentheses. The Final Column Reports the R2 of the Model
Dependent So Si s2 R2
(-0 .3 8 4 ) (87.81)
(972.5) (165.9)
1■ fj*i A | a A ♦•
'/ V1 \f»
v A i
/ vw^/cr
s /h
\v-
0 "rr T
I
# n /W -A
\
-1 ■ / ¥ jfv • j T»•- /4*
•*•
-2
-3
3
In { R G D P )
2
&
$16T
Log-linear t
Log-quadratic
Vt
1■
f> w
$8T 0 if t j -
:A K r A
w r
S
-1
$ 4T -
-2
- 3 -I
1950 1965 1980 1995 2010 1950 1965 1980 1995 2010
Fiaure 11.1 The top-left panel plots the level of real GDP and the fitted values from linear and quadratic trend
models. The top-right panel plots the residuals from the linear and quadratic trend models. The bottom-left
panel plots the log of real GDP and the fitted values from log-linear and log-quadratic time trend models. The
bottom-right plots the residuals from the log-linear and log-quadratic models.
11.2 SEASONALITY the other months of the year.2 M eanwhile, the trading volume of
com m odity futures varies predictably across the year, reflecting
A seasonal tim e series exhibits determ inistic shifts throughout periods when information about crop growing conditions is
the year. For exam ple, new housing has a strong seasonality that revealed. Energy prices also have strong seasonalities, although
peaks in the sum mer months and slows in winter. M eanwhile, the pattern and strength of the seasonality depend on the geo
natural gas consumption has the opposite seasonal pattern: ris graphic location of the market.
ing in winter and subsiding in summer. Both seasonalities are C alendar effects generalize seasonalities to cycles that do not
driven by predictable events (e.g ., the weather). necessarily span an entire year. For exam ple, an anomaly has
Seasonalities also appear in financial and econom ic tim e series.
Many of these seasonalities result from the features of markets
2 Tax optimization, where investors sell poorly performing investments
or regulatory structures. For exam ple, the January effect posits in late December to realize losses and then reinvest in January, is one
that equity m arket returns are larger in January than they are in plausible explanation for this phenomenon.
4 For example, consider a model with separate dummy variables for dynam ics. This means that adding an A R term to the model
when a light switch turned on (lon) and when it is turned off (/<,#). These should produce residuals that appear to be white noise:
two variables are multicollinear, because if lon — 1 then lQff = 0 (and vice
versa). This problem is avoided by simply omitting one of these variables. Yt = 5c>+ 5it + </>Yt_i + € f
-30- T T T
Residuals
Fitted Values
Figure 11.2 The month-over-month growth rate of gasoline consumption in the United States. The upticks
indicate the growth rate in March of each year and the downticks mark the growth rate in September.
If there is a seasonal com ponent to the data, then seasonal trend-stationary processes have different behaviors and proper
dummies can be added as well to produce ties. Tests of trend stationarity against a random walk are pre
s—1 sented in Section 11.4.
Yt = So + Sit + 2 'Yilit + W t - 1 + et/
/=1 The top panel of Figure 11.3 shows the log of gasoline con
where <5-| captures the long-term trend of the process, 7 ,- sumption in the United States (i.e., In G C t). This series has not
m easure seasonal shifts in the mean from the trend growth been differenced and so the tim e trend— at least until the finan
(i.e ., <5-|t), and 4>Yt- 1 is an AR term that captures the cyclical cial crisis of 2008— is evident. The predictable seasonal devia
co m p o nent.5 tions from the tim e trend are also clearly discernable. Both time
trends and seasonalities are determ inistic, so that the regressors
W hile adding a cyclical com ponent is a simple method to
in the model:
account for serial correlation in trend-stationary data, many
trending econom ic time series are n o t trend-stationary. If a 11
Yt = 50 + 6it + ^t2 + 2 'Yilit + et
series contains a unit root (i.e., a random walk com ponent), then ;= 1
detrending cannot eliminate the non-stationarity. Unit roots and only depend on tim e.
Residuals
Seasonal Dummies and Quadratic Trend Dummies, Quadratic Trend, and AR(1)
(second row), has a negative coefficient that appears to be The third row of Table 11.2 reports param eter estim ates from a
statistically significant.6 model that extends the dum m y-quadratic trend model with a
cyclical com ponent [i.e., an AR(1)] and the bottom-right panel
W hile both m odels produce large R2, the trend m akes this
of Figure 11.3 plots the residuals from this model. The A R(1)
m easure less useful than in a covariance-stationary tim e
coefficient indicates the series is very persistent, and the coef
series. The bottom -left panel of Figure 11.3 plots the residu
ficient is very close to 1.
als from the model that includes both seasonal dum m ies and
a quadratic tim e trend. The residuals are highly persistent However, none of these models appear capable of capturing
and clearly not white noise, indicating that the model is badly all of the dynamics in the data, and the Ljung-Box Q-statistic
m isspecified. indicates that the null is rejected in all specifications. The esti
mated param eters in the autoregression are close to one, which
suggests that the detrended series is n o t covariance-stationary.
6 The usual t-stat only has an asymptotic standard normal distribution This exam ple highlights the challenges of incorporating cyclical
when the model is correctly specified in the sense that the residuals are
white noise. The residuals in these models are serially correlated and so com ponents (e.g ., autoregressions) with trends and seasonal
not white noise. dummies.
Seasonal Si S2 <P1 R2 Q 10
Dummies
/ 0.00428 0.851 2193.3
(40.662) (0.000)
Substituting Yt_ 1 = Yt_ 2 + et_ i into the random walk: For exam ple, the AR(2)
Yt = 1.8Yt_ ! - 0 .8 Y t_ 2 + et
Yt = (Y t - 2 + et - i) + et
shows that Yt depends on the previous two shocks and the value can be written with a lag polynomial as:
of the series two periods in the past (i.e., Yt_ 2). (1 - 1.8L + 0 .8 L2)Yt = et
which always mean revert, random walks become more dispersed z2 — 1.8z + 0.8 = (z — 1)(z — 0.8),
over tim e. Specifically, the variance of a random walk is and the two roots are 1 and 0.8. A standard random walk is a special
case of a unit root process with iJ/{L) = 1 — L, </>(L) = 1 and 0[L) — 1.
v tv ,] = to-2
Fiaure 11.4 Two AR processes generated using the same initial value and shocks. The data from the stationary
AR model are generated according to Y t = 1.87^ - 0.9Yt_ 2 + e t- The data from the unit root process are
generated by Yt = 1.87^ - 0.8Yt_ 2 + et.
For exam ple, if: Recall that the A IC tends to select a larger model than criteria
such as the BIC. This approach to selecting the lag length is
Yt = 8 + et
preferred because it is essential that the residuals are approxi
then m ately white noise, and so selecting too many lags is better
A Y t = 8 + et — 8 — et_ i = et — et_ i than selecting too few. Ultim ately, any reasonable lag length
selection procedure— IC-based, graphical, or general-to-specific
The original series is a trivial model— an AR(0). The over-dif
selection— should produce valid test statistics and the same
ferenced series is a MA(1) that is not invertible. The additiona
conclusion.
com plexity in models of over-differenced series increases
param eter estimation error and so reduces the accuracy of The included determ inistic term s have a more significant
forecasts. im pact on the A D F test statistic. The D F distribution depends
on the choice of determ inistic term s. Including more term s
The Augm ented Dickey-Fuller (ADF) specification is the most
skew s the distribution to the left, and so the critical value
widely used unit root test. An A D F test is im plem ented using an
becom es more negative as additional determ inistic term s are
O LS regression where the difference of a series is regressed on
included. For exam ple, the 5% critical values in a tim e series
its lagged level, relevant determ inistic term s, and lagged differ
with 250 observations are —1.94 when no determ inistic term s
ences. The general form of an A D F regression is
are included, —2.87 when a constant is included, and —3.43
y Y t_ S0 + S-jt Ai + - • + ApA Y t_ g when a constant and trend are included. All things equal, ad d
AYt = + +
ing additional determ inistic term s m akes rejecting the null
Lagged Level Determ inistic Lagged Differences
more difficult when a tim e series d o e s not contain a unit root.
( 11.10)
This reduction in the pow er of an A D F test suggests a conser
The A D F test statistic is the t-statistic of y . To understand the vative approach when deciding which determ inistic trends to
A D F test, consider a im plem enting a test with a model that only include in the test.
includes the lagged level:
On the other hand, if the tim e series is trend-stationary, then the
A Yt = y Y t-y + et A D F test must include a constant. If the A D F regression is esti
If Yt is a random walk, then: mated without the constant, then the null is asym ptotically never
re jected , and the power of the test is zero.
Yt = Yt_-, + et
Y t~ Yt_ ! = Yt_-, - Yt_ ! + et Avoiding this outcom e requires including any relevant deter
A Y t = 0 X Yt_-, + et ministic term s. The recom m ended method to determ ine the
relevant determ inistic term s is to use t-statistics to test their
so that the value of y is 0 when the process is a random walk.
statistical significance using a size of 10%. Any determ inistic
Under the null H0: y = 0, Yt is a random walk. The alternative is
regressor that is statistically significant at the 10% level should
H 1 : y < 0, which corresponds to the case that Yt is covariance
be included. If the trend is insignificant at the 10% level, then it
stationary. Note that the alternative is one-sided, and the null is
can be dropped, and the A D F test can be rerun including only a
not rejected if y > 0. Positive values of y correspond to an AR
constant. If the constant is also insignificant, then it too can be
coefficient that is larger than 1, and so the process is explosive
dropped, and the test rerun with no determ inistic com ponents.
and not covariance-stationary.
However, most applications to financial and m acroeconom ic
Implementing an A D F test on a tim e series requires making two tim e series require the constant to be included.
choices: which determ inistic term s to include and the number
W hen the null of a unit root cannot be rejected, the series
of lags of the differenced data to use. The number of lags to
should be differenced. The best practice is to repeat the A D F
include is sim ple to determ ine— it should be large enough to
test on the differenced data to ensure that it is stationary. If
absorb any short-run dynamics in the difference A Y t.
the difference is also non-stationary (i.e., the null cannot be
The lagged differences in the A D F test are included to ensure rejected on the difference), then the series should be double
that et is a white noise process. The recom m ended method to differenced. If the double-differenced data are not stationary,
Table 11.3 ADF Test Results for the Natural Log of Real GDP and the Difference of the Log of Real GDP (i.e.,
GDP Growth). The Column y Reports the Parameter Estimate from the ADF Test Regression and the ADF Statistic
in Parentheses. The Columns 80 and ^ Report Estimates of the Constant and Linear Trend Along with t-Statistics.
The Lags Column Reports the Number of Lags in the ADF Regression as Selected by the AIC. The Final Columns
Report the 5% and 1% Critical Values and Test Statistics Less (i.e., More Negative) Than the Critical Value that
Leads to Rejection of the Null Hypothesis of a Unit Root
Deterministic r So Si Lags 5% CV 1% C V
None 5.14e-04 3 - 1 .9 4 2 - 2 .5 7 4
(5.994)
Constant -1 .9 2 e -0 3 0.023 5 - 2 .8 7 2 - 3 .4 5 4
(-2 .3 3 0 ) (3.053)
Deterministic r So Si Lags 5% CV 1% C V
None - 0 .1 3 3 8 - 1 .9 4 2 - 2 .5 7 4
(-2 .1 8 4 )
Constant - 0 .6 2 6 4.87e-03 2 - 2 .8 7 2 - 3 .4 5 4
(-8 .4 1 4 ) (6.287)
Default Premium
Deterministic 7 So Si Lags 5% CV 1% CV
None -5 .3 4 e -0 3 16 - 1 .9 4 2 -2 .5 7 1
(-1 .2 2 0 )
Constant - 0 .0 4 2 0.045 10 - 2 .8 6 8 - 3 .4 4 4
(-3 .2 5 1 ) (3.037)
Table 11.4 shows the results of testing the default premium— The seasonal difference is denoted A 4Yt and is defined
defined as the difference between the yield on a portfolio of Yt — Yt_ 4. Substituting in the difference and sim plifying, this
Aaa-rated corporate bonds and Baa-rated corporate bonds. The difference evolves according to:
tests are im plem ented using 40 years of monthly data between
Yt ~ Vt-4 = S-|(t — (t ~ 4)) + et — et_4
1979 and 2018.
A 4Yt = 4<5-| + et — et_ 4
For this data, the time trend is not statistically significant and so
The seasonal dummies are eliminated because:
can be safely dropped. The null that the constant is zero, how
ever, is explicitly rejected at the 10% level, and so the A D F test 7fjt ~ Yjljt-4 = 7j(ljt ~ Ijt- 4 ) = 7j x 0
should be evaluated using a specification that includes a con This is because lJt = /Jt_ 4 for any value of t (because the observa
stant. The t-statistic on y is —3.23, which is less than the 5% (but tions are four periods apart).
not the 1%) critical value. In the previous chapter, we saw that the
The differenced series is a MA(1), and so is covariance
default premium is highly persistent. These A D F tests indicate
stationary.
that it is also covariance-stationary. The number of lags is chosen
to minimize the A IC across all models that include between zero By removing trends, seasonality, and unit roots, seasonal
and 24 lags. differencing produces a time series that is covariance-stationary.
The seasonally differenced time series can then be modeled
Seasonal Differencing using standard A R M A models. Seasonally differenced series
are directly interpretable as the year-over-year change in Yt
Seasonal differencing is an alternative approach to modeling or, when logged, as the year-over-year growth rate in Yt.
seasonal tim e series with unit roots. The seasonal difference is
constructed by subtracting the value in the sam e p e rio d in the Figure 11.5 contains a plot of the seasonal difference in the
previous year. This transform ation elim inates determ inistic sea log gasoline consumption series. This series is constructed by
subtracting log consumption in the same month of the previous
sonalities, tim e trends, and unit roots.
year, 100 X (In G C t — In G C t_ i 2 ).
For exam ple, suppose Yt is a quarterly tim e series with both
determ inistic seasonalities and a non-zero growth rate, so that: The difference elim inates the striking seasonal pattern present in
the month-over-month growth rate plotted in Figure 11.2, and
Yt = <
50 + y d n + l/2tht + 73tht + 8^t + et, so this series can be m odeled as a covariance-stationary time
where et ~ WN{0, cr2). series without dum m ies.
11.5 SPURIOUS REGRESSION JP Y /U S D rate may also have a tim e trend, although it is less
pronounced.
Spurious regression is a common pitfall when modeling the rela The first line of Table 11.5 contains the estim ated param eters,
tionship between two or more non-stationary tim e series. t-statistics and R2 of the model. This naive analysis suggests that
For exam ple, suppose Yt and X t are independent unit roots with the exchange rate in the previous w eek is a strong predictor of
Gaussian white noise shocks, and the cross-sectional regression: the price of the Russell 1000 at the end of the subsequent week.
This result, also plotted in Figure 11.6, is spurious because both
Yt = a + (3Xt + et
series have unit roots.
is estim ated using 100 observations.
The bottom panel of Figure 11.6 plots the residuals (i.e., et) from
W hen using simulated data, the t-statistic on /3 is statistically dif this regression. These residuals are highly persistent. Further
ferent from zero in over 75% of the simulations when using a 5% more, conducting an A D F test results in a failure to reject the
test. The rejection rate should be 5%, and the actual rejection null that the residuals contain a unit root.
rate is extrem ely high considering that the series are in d ep en
The correct approach uses the differences of the series (i.e., the
d en t and so have no relationship. Furtherm ore, the rejection
returns) to explore w hether the JP Y /U S D rate has predictive
probability converges to one in large sam ples, and so a naive
power. The well-specified regression is
statistical analysis always indicates a strong relationship between
the two series. This phenomenon is called a spurious regression. Ain R 1000t = a + /3A In JP Y U S D t- 1 + et
Spurious regression is a frequently occurring problem when The differenced data series are both stationary, a requirem ent
both the dependent and one or more independent variables for O LS estim ators to be well behaved in large sam ples.
are non-stationary. However, it is not an issue in stationary time Results from this regression are reported in the second line of
series, and so ensuring that both X t and Yt are stationary (and Table 11.5. The R2 is only 0.3% , indicating that the JP Y /U SD
differencing if not) prevents this problem . return has little predictive power for the future return of the
As another exam ple, consider the w eekly log of the Russell 1000 equity index.
index and the log of the Jap an ese JP Y /U SD exchange rate.
The current value of the index is regressed on the lag of the
exchange rate: 11.6 WHEN TO D IFFER EN C E?
In R 1000t = a + [3\n JP Y U S D t_ i + et
Many financial and econom ic tim e series are highly persistent
The two series are plotted in the top panel of Figure 11.6. The but stationary, and differencing is only required when a time
Russell 1000 has an obvious tim e trend over the sam ple. The series contains a unit root. W hen a tim e series cannot be easily
Fiaure 11.6 The top panel plots the logs of the Russell 1000 index, the JPY/USD exchange rate, and the fitted
value from regressing the Russell 1000 on the JPY/USD exchange rate. The bottom panel plots the regression
residuals from a spurious regression of the Russell 1000 on the JPY/USD exchange rate.
Differences - 0 .0 0 0 0.360 - 0 .1 8 2
Figure 11.7 The paths of the forecasts of the default premium starting in January 2018 for models estimated in
levels and differences.
The transform ed param eters are reported in the bottom row of 11.7 FORECASTING
Table 11.6. W hile these param eters appear to be very similar,
they imply different forecast paths: The model estim ated in lev Constructing forecasts from models with time trends, seasonali
els is mean reverting to 1.06% , whereas the model estim ated ties, and cyclical components is no different from constructing
using differences has short-term dynamics but is not mean forecasts from stationary A RM A models. The forecast is the time T
reverting (because it contains a unit root). expected value of YT+h. For exam ple, in a linear time trend model:
Figure 11.7 shows the default premium for the four years prior Yr — S0 + S t T + e
to 2018 and two forecasts for the following 24 months. W hereas
so that Yj+h is
the mean reversion in the levels-estim ated model is appar
ent, the differences-estim ated model predicts little change in Yr+h ~ $0 + <5-i(T + h) + ej+h
the default premium over the next two years despite having The tim e T expectation of YT+h is
coefficients different from zero. Ultim ately, determ ining which
ErtVr+h] = S0 + ^ ( T + h) + E T[eT+/l]
approach works better is an em pirical question, and the best
= §o "F 6 - | ( T + h)
practice is to consider models in both levels and differences
when series are highly persistent. because eT+h is a white noise shock and so has a mean of zero
E r [Y r+h] = S0 + « !( T + h) depends on w hether the tim e series contains a unit root, and
so correctly interpreting test statistics is challenging. Second,
and the forecast error is eT+h.
it is easy to find spurious relationships that can arise due to
When the error is Gaussian white noise N(0, a 2), then the 95% determ inistic or stochastic trends. Finally, forecasts of non-
confidence interval for the future value is E T[Y T+h] ± 1.96 a . In stationary tim e series do not m ean-revert to a fixed level.
QUESTIONS
Practice Questions
3
11.6 A Iinear tim e trend model is estim ated on annual real 11.8 The seasonal dummy model Yt = 8 + 2 ; = 1 Yjljt + et is
euro-area GDP, measured in billions of 2010 euros, estim ated on the quarterly growth rate of housing starts,
using data from 1995 until 2018. The estim ated model is and the estim ated param eters are y i = 6.23, 72 = 56.77,
RG D P t = —234178.8 + 121.3 X t + et. The estim ate of 73 = 10.61, and 8 = —15.79 using data until the end
the residual standard deviation is a = 262.8. Construct of 2018. W hat are the forecast growth rates for the four
point forecasts and 95% confidence intervals (assuming quarters of 2019?
Gaussian white noise errors) for the next three years. Note
11.9 A D F tests are conducted on the log of the ten-year
that t is the year, so that in the first observation, t = 1995, US governm ent bond interest rate using data from
and in the last, t = 2018.
1988 until the end of 2017. Th e results of the A D F
11.7 A log-linear trend model is also estim ated on annual euro with different configurations of the determ inistic term s
area G D P for the same period. The estim ated model is are reported in the table below. The final three co l
In R G D P t = —18.15 + .0136 t + et, and the estim ated umns report the num ber of lags included in the test
standard deviation of et is 0.0322. Assum ing the shocks as selected using the A IC and the 5% and 1% critical
are normally distributed, what are the point forecasts of values that are appropriate for the sam ple size and
G D P for the next three years? How do these compare included determ inistic term s. Do interest rates contain
those from in the previous problem ? a unit root?
Deterministic r So Lags 5% CV 1% CV
None - 0 .0 0 3 7 - 1 .9 4 2 - 2 .5 7 2
( - 1 .6 6 6 )
Constant - 0 .0 0 9 0 .0 1 0 4 - 2 .8 7 0 - 3 .4 4 9
(-1 .4 2 5 ) (1.027)
Trend - 0 .0 8 5 0.188 - 0 .0 0 0 3 - 3 .4 2 3 - 3 .9 8 4
11.10 An AR(2) process is presented: The data is to be evaluated using the following regres
4 -0.81
5 -0.31
6 -0 .3 7
7 0.47
8 1.10
9 0.53
10 0.47
11 1.00
12 1.68
13 0.90
14 1.99
15 1.31
16 0.76
17 0.43
18 -1 .0 1
19 0.27
20 -2.21
21 -3.21
22 - 1 .9 0
23 -0 .5 3
24 - 0 .6 0
25 0.34
ANSWERS
EA yr+ h l = E T[e x p (g (T + h) + eT+h)] = exp (g (T + h) 11.3 The time trend becom es apparent as the series is propa
+ Eyfexp eT+h)j, gated backwards:
Solved Problems
11.6 Note that there is no AR or M A com ponent, so the vari And the error bounds on the In are + /— 1.96*0.0322, so
ance remains constant. Therefore, the 95% confidence the bounds are given in proportional term s rather than
interval is + / — 1.96*262.8 = + / — 515.1 about the fixed values.
expected value. Bounds_M ultiplier = exp( ± 1.96 * 0.0322)
A s for the expected means: = e xp (± 0.0631) = 0.939,1.065
E [R G D P 2o i 9] = -2 3 4 1 7 8 .8 + 121.3 X 2019 = 10,725.9 Calculating E[ln R G D P J:
E [R G D P 202o ] = -2 3 4 1 7 8 .8 + 121.3 X 2020 = 10,847.2 E[ln R G D P2019] = - 1 8 .1 5 + 0.0136 * 2019 = 9.308
E [R G D P 202-\] = -2 3 4 1 7 8 .8 + 121.3 X 2021 = 10,968.5
E[ln R G D P2020] = - 1 8 .1 5 + 0.0136 * 2020 = 9.322
11.7 In this case: E[ln R G D P202i ] = - 1 8 .1 5 + 0.0136 * 2020 = 9.336
a
ET[YT] = exp ET[ln Y t ] + Furtherm ore,
0.03222
0.0005
2
(which will only make a small im pact in this exam ple) For the trend m odel, both of these have t-stats with
absolute values > 4, well within the bounds of statisti
So:
cal significance at even the 99% level. Accordingly, the
E [R G D P 2o i 9] = exp(9.308 + 0.0005) = 11,031.4
proper model to study is the last one.
E [R G D P 2 0 2 0 ] = exp(9.322 + 0.0005) = 11,186.9
For this m odel, the t-stat on -y is to the right of the 1%
E[R G D P 202i ] = exp(9.336 + 0.0005) = 11,344.6 CV — therefore, the null hypothesis of having a unit root
And the 95% confidence bands are given as: is rejected at the 99% confidence level. Note that if the
proper model were either the constant or no-trend then
95% ^CBo_ = [0 .9 3 9 * 11031.4,1.065 * 11031.4]J
d RGDP 2019 1 '
the null hypothesis would not be rejected.
= [10,358.5,11,748.4]
11.10 a. G iven that the equation has a unit root means that the
95% c b RGDP2020 = [10,504.5,11,914.0] factorization of the characteristic equation must be of
95% c b RGDp2019 = [10,652.6,12,082.0] the form:
In comparison with #1, the bands are growing in size and (z - 1)(z — c) = z2 - (c + 1)z + c,
overall the results are a bit bigger.
where 0 < < 1
1 1 .8 Though there is variance from quarter-to-quarter, the
The characteristic equation for the original relation
expected value of Y t is the same for any two observa
ship is
tions of the same quarter, regardless of the year.
z2 - k i
Accordingly:
r n
t yt yt - 1 A yt yt - 1 * A y t y?-i
1 -1 .8 1 0 -1.81 0.00 0.00
23 -0 .5 3 - 1 .9 0 1.37 - 2 .6 0 3.61
24 - 0 .6 0 -0 .5 3 -0 .0 7 0.04 0.28
_ 1 x,y, _ - 1 2 .5 0
- 0 .2 7 3
m “ S x ,2 “ 45.76
t y t y t - 1 A yt y t - 1 * A yt yf-i
1 -1.81 0 -1.81 0.00 0.00 -1.81
Using the STD EV function in E X C E L , As the t-stat is less than the 5% point on the D F distri
The standard error of this is bution, the null is rejected at a 95% confidence level,
but the t-stat is not sufficiently negative to enable
(T, 0.95
= 0.140 rejection at the 99% confidence level.
nay<-i 5 * 1.36
So the t-statistic is
m - 0 .2 7 3
t - stat = = - 1 .9 6
(T,m 0.140
Calculate, distinguish, and convert between sim ple and Describe the power law and its use for non-normal
continuously com pounded returns. distributions.
Define and distinguish between volatility, variance rate, Define correlation and covariance and differentiate
and implied volatility. between correlation and dependence.
D escribe how the first two moments may be insufficient to Describe properties of correlations between normally dis
describe non-normal distributions. tributed variables when using a one-factor model.
211
Financial asset return volatilities are not constant, and how they The tim e scale is arbitrary and may be short (e.g ., an hour or a
change can have im portant im plications for risk m anagem ent. day) or long (e.g ., a quarter or a year).
For exam ple, if returns are normally distributed with an annual
The return of an asset over multiple periods is the product of
ized volatility of 10%, then a portfolio loss larger than 1% occurs
the sim ple returns in each period:
on only 5% of days and the loss on 99.9% of days will be less
T
than 2%. W hen annualized volatility is 100%, then 5% of days
have losses larger than 10.2% , and one in four days would have
1 + n +
Rt =
t=1
<1 R t) ( 1 2 . 2)
losses exceeding 4.2% . There are also continuously com pounded returns, also known as
This chapter begins by exam ining the distributions of financial log returns. These are com puted as the difference of the natural
asset returns and why they are not consistent with the normal logarithm of the price:
distribution. In fact, returns are both fat-tailed and skew ed. rt = ln P t — ln P t_-| (12.3)
The heavy tails are, in particular, generated by tim e-varying
The log return is traditionally denoted with lower case letter
volatility. Two popular m easures of volatility are p resented,
(i.e., rt).
the Black-Scholes-M erton model and the V IX Index. Both
m easures make use of option prices to m easure the future The main advantage of log returns is that the total return over
exp ected volatility. multiple periods is just the sum of the single period log returns:
The final section reviews linear correlation, which is a common 1 + Rt = exp rt (12.5)
measure that plays a key role in optimizing a portfolio. It is not, Figure 12.1 shows the relationship between the simple return
however, enough to characterize the dependence between the and the log return. In the left panel, which shows the relation
asset returns in a portfolio. Instead, this section presents two ship when the simple return is between —12% and 12%, the
alternatives that are correlation-like but have different proper approximation error is small (i.e., less than 0.85% ). The right
ties. These alternative m easures are better suited to understand panel expands the range to ± 5 0 % , and the approximation error
ing dependence when it is nonlinear. They also have some useful is much larger (e.g., when the simple return is —40% , the log
invariance and robustness properties. return is —51%; and when the simple return is 40%, the log return
is 33.6%).
12.1 MEASURING RETURNS The log return is always less than the sim ple return. Further
more, sim ple returns are also never less than - 1 0 0 % (i.e ., a
All estim ators of volatility depend on returns, and there are two total loss). Log returns do not respect this intuitive boundary
common m ethods used to construct returns. The usual defini and are less than —100% w henever the sim ple return is less
12.2 MEASURING VOLATILITY The mean of the w eekly return is 5/ j l and, because et is an iid
sequence, the variance of the w eekly return is 5tr2 and the vola
AND RISK tility of the w eekly return is V 5er. This is an im portant feature of
financial returns— both the mean and the variance scale linearly
The volatility of a financial asset is usually measured by the stan
in the holding period, while the volatility scales with the square-
dard deviation of its returns. A sim ple but useful model for the
root of the holding p erio d .1
return on a financial asset is
This scaling law allows us to transform volatility between time
rt = /j l + oetl (12.6)
scales. The common practice is to report the annualized volatil
where et is a shock with mean zero and variance 1, /x is the mean ity, which is the volatility over a year. W hen the volatility is m ea
of the return (i.e., E[rt] = /x), and a 2 is the variance of the return sured daily, it is common to convert daily volatility to annualized
(i.e., V[rt] = a 2). volatility by scaling by V 2 5 2 so that:
The shock can also be written as:
^annual = V252 X ^ ai|y (12.7)
et = a e t
The scaling factor 252 approxim ates the number of trading days
so that et has mean zero and variance a 2. in the US and most other developed m arkets. Volatilities com
The shock is assumed to be independent and identically distributed puted over other intervals can be easily converted by computing
(iid) across observations. While it is not necessary to specify the dis the number of sampling intervals in a year and using this as the
tribution of et, it is assumed that et ~ N(0,1). With this assumption, scaling factor.
returns are also iid N(/x, a ). However, while the normal is a conve For exam ple, if volatility is m easured with monthly returns, then
nient distribution to use, the next section demonstrates that it does the annualized volatility is constructed using a scaling factor
not provide a full description of most financial returns. of 12:
The volatility of an asset is the standard deviation of the returns ^annual = V l2 X q*onth|y (12.8)
(i.e., \T a 2 = cr). This measure is the volatility of the returns over
the tim e span where the returns are m easured, and so if returns The variance, also called the variance rate, is just the square of
are com puted using daily closing prices, this measure is the daily the volatility. It scales linearly with tim e and so can also be easily
volatility. transform ed to an annual rate.
In a sim ple m odel, the return over multiple periods is the sum
of the returns. For exam ple, if returns are calculated daily, the
weekly return is
5 5 5 1 The linear scaling of the variance depends crucially on the assumption
2 rt+/ = 2 ^ + a e t+i = 5/x + c r 2 et+; that returns are not serially correlated. This assumption is plausible for
/= 1 ;= 1 /=1 the time series of returns on many liquid financial assets.
Monthly 17.1% 14.5% 9.8% The value of a that equates the observed call option price with
the four other param eters in the formula is known as the implied
volatility. This implied volatility is, by construction, an annual
Consider the volatility of the returns of three assets:
value, and so it does not need to be transform ed further.
1 . Gold (specifically, the price as determ ined at the London
The Black-Scholes-M erton option pricing model uses several
gold fix),
simplifying assumptions that are not consistent with how mar
2. The return on the S&P 500 index, and kets actually operate. Specifically, the model assumes that the
3. The percentage change in the Jap an ese yen/US dollar rate variance does not change over tim e. This is not com patible with
(JPY/U SD ). financial data, and so the implied volatility that is calculated
using the model is not always consistent across options with dif
Returns are com puted daily, w eekly (using Thursday closing
r\ ferent strike prices and/or m aturities.
prices), and monthly (using the final closing price in the month).
The variance of returns is estim ated using the standard The VIX Index is another measure of implied volatility that
estim ator: reflects the implied volatility on the S&P 500 over the next cal
endar 30 days, constructed using options with a wide range of
- m )2 strike prices.
; t=i
The m ethodology of the VIX has been extended to many other
applied to returns constructed at each of the three frequencies,
assets, including other key equity indices, individual stocks,
where jl is the sam ple average return.
gold, crude oil, and US Treasury bonds. The most important
Table 12.1 shows the annualized volatilities constructed by trans limitation of the VIX is that it can only be com puted for assets
forming the variance estim ates. The annualized volatilities are with large, liquid derivatives m arkets, and so it is not possible to
similar within each asset and the minor differences across sam apply the VIX m ethodology to most financial assets. Because the
pling frequencies can be attributed to estimation error (because VIX Index makes uses of option prices with expiration dates in
these volatilities are all estim ates of the true volatility). the future, it is a forward-looking measure of volatility. This dif
fers from backward-looking volatility estim ates generated using
(historical) asset returns.
Implied Volatility
Implied volatility is an alternative measure that is constructed
using option prices. Both put and call options have payouts that
12.3 THE DISTRIBUTION
are nonlinear functions of the underlying price of an asset. For O F FINANCIAL RETURNS
exam ple, the payout of a European call option is defined as:
A normal distribution is sym m etric and thin-tailed, and so has
m ax(PT — K, 0), (12.9)
no skewness or excess kurtosis. However, many return series are
where P j is the price of the asset when the option expires (T) both skewed and fat-tailed.
and K is called the strike price. For exam ple, consider the returns of gold, the S&P 500, and
This expression is a convex function of the price of the asset and the JP Y /U S D exchange rate. The left two columns of Table 12.2
so is sensitive to the variance of the return on the asset. contain the skewness and kurtosis of these series when m ea
sured using daily, w eekly (Thursday-to-Thursday), monthly,
and quarterly returns. The monthly and quarterly returns are
2 Weekly returns use Thursday-to-Thursday prices to avoid the impact of com puted using the last trading date in the month or quarter,
any end-of-week effects. respectively.
All three assets have a skewness that is different from zero and (k — 3)2/24 also has a x i distribution. These two statistics are
a kurtosis larger than 3. In general, the skewness changes little asym ptotically independent (uncorrelated), and so J B ~ xi-
as the horizon increases from daily to quarterly, whereas the
Test statistics that have x 2 distributions should be small when
kurtosis declines substantially. The returns on the S&P 500 and
the null hypothesis is true and large when the null is unlikely
the returns on the JP Y /U S D exchange rate are both negatively
to be correct. The critical values of a X 2 are 5.99 for a test size
skew ed, while the returns on gold are positively skew ed. This
of 5% and 9.21 for a test size of 1%. W hen the test statistic is
difference arises because equities and gold tend to move in dif
above these values, the null that the data are normally distrib
ferent directions when m arkets are stressed— equities decline
uted is rejected.
m arkedly in stress periods, while gold serves as a flight-to-safety
asset and so moves in the opposite direction. The J B statistic is reported for the three assets in Table 12.2.
It is rejected in all cases excep t for the quarterly return on
The Jarque-Bera test statistic is used to form ally test w hether
the JP Y /U S D rate. The final column reports the p-value of the
the sam ple skew ness and kurtosis are com patible with an
test statistic, which is 1 — F ^ J B ), where is the C D F of a X 2
assumption that the returns are normally distributed. The null
random variable.
hypothesis in the Jarque-Bera test is H0:S = 0 and k = 3, where
S is the skew ness and k is the kurtosis of the return distribution. W hile the test rejects for equities even at the quarterly fre
These are the population values for normal random variables. quency, the low-frequency returns are closer to a normal than
The alternative hypothesis is H-pS 0 or k 3. The test the daily or w eekly returns on the S&P 500. This is a common
statistic is feature of many financial return series— returns com puted over
longer periods appear to be better approxim ated by a normal
J B = ( T - 1) ( 12 . 11) distribution.
declines rapidly as k increases, whereas many other distributions and nonlinear dependence. The linear correlation estim ator
have tails that decline less quickly for large deviations. (also known as Pearson's correlation) measures linear depen
dence. Previous chapters have shown that correlation (i.e., p) and
The most important class of these have p o w er law tails, so that the
the regression slope (i.e., (3) are intim ately related, and the
probability of seeing a realization larger than a given value of x is
regression slope is zero if and only if the correlation is zero.
P(X > x) = kx~a, Regression explains the sense in which correlation m easures lin
where k and a are constants. ear dependence.
The Student's t is an exam ple of a w idely used distribution with In the regression:
a power law tail. Yj = a + jQXj + €j
The simplest method to compare tail behavior is to examine the if Y and X are standardized to have unit variance (i.e.,
natural log of the tail probability. Figure 12.2 plots the natural log ox = oy = 1), then the regression slope is the correlation.
of the probability that a random variable with mean zero and unit
In contrast, nonlinear dependence takes many forms and cannot
variance appears in its tail [i.e., ln P r(X > x)]. Four distributions
be summarized by a single statistic. For exam ple, many asset
are shown: a normal and three parameterizations of a Student's t.
returns have common heteroskedasticity (i.e., the volatility across
The normal curve is quadratic in x, and its tail quickly diverges assets is simultaneously high or low). However, linear correlation
from the tails of the other three distributions. The Student's t does not capture this type of dependence between assets.
tails are linear in x, which reflects the slow decay of the prob
ability in the tail. This slow decline is precisely why these distri
butions are fat-tailed and produce many more observations that
Alternative Measures of Correlation
are far from the mean when com pared to a normal. Linear correlation is insufficient to capture dependence when
assets have nonlinear dependence. Researchers often use two
alternative dependence m easures: rank correlation (also known
12.4 CORRELATION VERSUS
as Spearm an's correlation) and Kendal's r (tau). These statistics
D EPEN D EN CE are correlation-like: both are scale invariant, have values that
always lie between —1 and 1, are zero when the returns are inde
The dependence between assets plays a key role in portfolio
pendent, and are positive (negative) when there is in increasing
diversification and tail risk. Recall that two random variables, X
(decreasing) relationship between the random variables.
and Y, are independent if their joint density is equal to the prod
uct of their marginal densities:
Rank Correlation
fx,v(x, y) = f x M M y ) Rank correlation is the linear correlation estim ator applied to the
Any random variables that are not independent are dependent. ranks of the observations. Suppose that n observations of two
Financial assets are highly dependent and exhibit both linear random variables, X and Y, are obtained. Let Rankx and Ranky
then the difference is large and the rank correlation is close to - 1 . X,- > Xj and Y, > Yj or if X, < Xj and Y, < Yj. In other words,
concordant pairs agree about the relative position of X and Y.
Note that rank correlation depends on the strength of the linear
Pairs that disagree about the order are discordant, and pairs
relationship between the ranks of X and Y, not the linear rela
with ties (X/ = Xj or Y, = Yj) are neither. Random variables with
tionship between the random variables them selves.
a strong positive dependence have many concordant pairs,
W hen variables have a linear relationship, rank and linear corre whereas variables with a strong negative relationship have
lation are usually similar in m agnitude. The rank correlation esti mostly discordant pairs.
mator is less efficient than the linear correlation estim ator and is
Kendal's r is defined as:
commonly used as an additional robustness check.
"c “ n d n. "d
(12.14)
/V
Ranks Ranks
oc
Fiqure 12.3 The left panels plot data that are linearly dependent. The top-left panel plots the raw data along
with the conditional expectation and the estimated regression line. The bottom-left panel shows the ranks of the
data and the fitted regression line on the ranks. The right panels show data with a nonlinear relationship. The
top-right panel shows the data, the conditional expectation, and the linear approximation. The bottom-right panel
plots the ranks of the data along with the fitted regression line.
Fiaure 12.4 Relationship between Pearson's correlation [p), Spearman's rank correlation (ps), and Kendall's r for
a bivariate normal distribution.
Table 12.3 contains the population values for the linear cor
Linear DGP Nonlinear DGP
relation, rank correlation, and Kendall's t using the two DGPs
depicted in Figure 12.3. The linear and rank correlations are nearly Linear Correlation 0.707 0.291
identical when the DGP is linear and Kendall's r is predictably Rank Correlation 0.690 0.837
lower here. When the D G P is nonlinear, both the rank correlation
Kendall's t 0.500 0.667
and Kendall's r are larger than the linear correlation. This pattern
in the two alternative measures of correlation indicates that the W hile an assumption of normality is convenient, it is not con
dependence in the data has important nonlinearities. sistent with most observed financial asset return data. In fact,
most returns are both skewed and heavy-tailed. These two
12.5 SUMMARY features are most prominent when measuring returns using
high-frequency data (e.g ., daily), whereas the normal is a better
This chapter began with a discussion of simple returns and log approxim ation over long horizons (i.e., one quarter or longer).
returns. These two measures broadly agree when returns are
Deviations from normality can be exam ined in a couple of
small but substantially disagree if they are large in m agnitude.
ways. The Jarque-Bera test exam ines the skewness and kurtosis
In practice, log returns are used when prices are sam pled fre
of a data series to determ ine whether it is consistent with an
quently, because the approxim ation error is unimportant. O ver
assumption of normality. M eanwhile, the tail index is a measure
longer horizons, the simple return provides a more accurate
of the tail decay that exam ines the rate at which the log-density
measure of the perform ance of an investm ent.
decays. Small values of the tail index indicate that the returns on
Both the mean and the variance of asset returns grow linearly in an asset are heavy-tailed.
tim e, so that the h-period mean and variance are h multiplied
The chapter concluded with an exam ination of alternative
by the first-period mean and variance. The standard deviation,
measures of dependence. W hile linear correlation is an impor
which is the square root of the variance, grows with V 7 ).
tant m easure, it is only a measure of linear dependence. Two
Im plied volatility is an alternative m easure of risk that can alternative m easures, rank correlation and Kendall's r , are more
be constructed for assets that have liquid options m arkets. suited to measure nonlinear dependence. These two measures
Im plied volatility can be estim ate using the Black-Scholes- are correlation-like in the sense that they are between —1 and
M erton option pricing model or the V IX Index, which 1. Both also have two desirable properties when com pared to
reflects the im plied volatility on the S&P 500 over the next linear correlation: They are invariant to monotonic increasing
30 calendar days. transform ations and they are robust to outliers.
QUESTIONS
Practice Questions
12.4. On Black Monday in O ctober 1987, the Dow Jo n es Indus 1 2 .1 1 . The following data is collected for four distributions:
trial A verage fell from the previous close of 2,246.74 to
1,738.74. W hat was the sim ple return on this day? W hat Dataset Skew Kurtosis T
12.5. If the annualized volatility on the S&P 500 is 20% , what is B 0.85 3.00 51
the volatility over one day, one w eek, and one month? C 0.35 3.35 125
12.6. If the mean return on the S&P 500 is 9% per year, what is D 0.35 3.35 250
the mean return over one day, one w eek, and one month?
W hich of these datasets are likely (at the 95% confidence
12.7. If an asset has zero skewness, what is the maximum kurto-
level) to not be drawn from a normal distribution?
sis it can have to not reject normality with a sam ple size of
100 using a 5% test? W hat if the sam ple size is 2,500? 12.12. Calculate the rank correlations for the following data:
12.8. If the skewness of an asset's return is —0.2 and its kurtosis 1
•
X Y
is 4, what is the value of a Jarque-Bera test statistic when
1 0.22 2.73
T = 100? W hat if T = 1,000?
2 1.41 6.63
12.9. Calculate the simple and log returns for the following data:
3 - 0 .3 0 - 2 .1 9
Time Price 4 - 0 .5 9 -6.51
0 100 5 -3 .0 8 -0 .9 9
1 98.90 6 1.08 2.63
2 98.68 7 -0 .4 5 - 3 .4 0
3 99.21 8 0.40 5.10
4 98.16 9 -0 .7 5 - 5 .1 4
5 98.07 10 0.24 1.14
6 97.14
12.13. Find Kendall's Tau for the following data:
7 95.70
X V
•
8 96.57 J
1 3.12 2.58
9 97.65
2 - 1 .2 6 -0 .0 5
10 96.77
3 2.08 -0 .7 2
12.10. The implied volatility for an ATM money option is
4 -0 .2 8 -0 .5 2
reported at 20% , annualized. Based upon this, what
would be the daily implied volatility? 5 - 1 .9 6 - 0 .4 0
ANSW ERS
12.2. Yes. Linear correlation is sensitive in the sense that mov 12.3. Covariance may have either sign, and so the square
ing a single point in X 1f X 2 space far from the true linear root transform ation is not reliable if the sign is negative.
relationship can severely affect the correlation. Effec Transformation to (3 or correlation is preferred when
tively a single outlier has an unbounded effect on the exam ining the magnitude of a covariance.
covariance, and ultimately the linear correlation. Rank
Solved Problems
^7 2 8 7 4 _ 2246 74 12.9. S imple returns use the form ula:
12.4. The simple return i s ------- ------- = —22.6% . The
2246.74
P t ~ P t-
log return is In 1738.74 - In 2246.74 = 25.6% . The Rt =
P t-
difference between the two is increasing in the
m agnitude. Log returns use the form ula:
12.5. The scale factors for the variance are 252, 52, and 12, Pt
rt = In Pt — In Pt_ 1 = In
P t-
respectively. The volatility scales with the square root of
i i r i 20% 20%
the scale factor, and so — ----- = 1.26% , — — = 2.77% , Time Price Simple Log
V252 \/5 2
20 % 0 100
and — 7 = = 5.77% .
V12
1 98.90 -1 .1 0 % -1 .1 1 %
12.6. The mean scales linearly with the scale factor, and so
9% 9% . . 9% . 2 98.68 -0 .2 2 % -0 .2 2 %
— — = .036% , —— = 0.17% , and —— = 0.75% .
252 52 12 3 99.21 0.54% 0.54%
12.7. The Jarque-Bera has X 2 distribution, and the criti
4 98.16 -1 .0 6 % -1 .0 6 %
cal value for a test with a size of 5% is 5.99. The
f S 2 (k - 3)2 5 98.07 -0 .0 9 % -0 .0 9 %
Jarque-Bera statistic is J B = (T - 1)( — -I-----— —
6 97.14 -0 .9 5 % -0 .9 5 %
so that when the skewness S = 0, the test statistic is
7 95.70 -1 .4 8 % -1 .4 9 %
(k - 3)2
(T — 1)— —— . In order to not reject the null, we need 8 96.57 0.91% 0.90%
24
(k - 3)2 5.99 9 97.65 1 .12% 1.11%
( T - 1) < 5.99, and so (k — 3)2 < 24 X
24 T - 1
10 96.77 -0 .9 0 % -0 .9 1 %
5.99
and k —3 T .»/ 24 X . W hen T = 100, this value
T - 1 12.10. Using the equation
is 4.20. W hen T = 2,500 this value is 3.24. This shows
®annual = V 2 5 2 X o Ldaily
;
that the J B test statistic is sensitive to even mild excess
kurtosis. 0.2 = V 2 5 2 X daily
12.8. Using the formula in the previous problem , the value of 0.2
&daily = 1.26%
the J B is 4.785 when T = 100 and 48.3 when T = 1000. V 2 5 2
12.11. The appropriate test statistic is the Jarque-Bera statistic. 12.13. The first step is to rank the data:
1 X Y Ry/
•
R x,
1 3.12 2.58 5 5
This will be distributed as ar|d for a 5% test size the
critical value is 5.99 2 - 1 .2 6 -0 .0 5 2 4
3 2.08 -0 .7 2 4 1
Dataset Skew Kurtosis T JB
4 -0 .2 8 -0 .5 2 3 2
A 0.85 3.00 50 5.90
5 - 1 .9 6 - 0 .4 0 1 3
B 0.85 3.00 51 6.02
The next step is to see which pairs are concordant, which
C 0.35 3.35 125 3.16
are discordant, and which are neither.
D 0.35 3.35 250 6.35
W hich leads to the pair-by-pair com parison:
•
1 X Y R x, Ry; RX , ~ R Xi
1 0.22 2.73 6 8 - 2
2 1.41 6.63 10 10 0 For the first observation, because both the rank of X & Y
3 - 0 .3 0 - 2 .1 9 5 4 1 are the maximum, every other observation will autom ati
cally be concordant with respect to this one.
4 - 0 .5 9 -6.51 3 1 2
By contrast, looking at the second and third observa
S -3 .0 8 -0 .9 9 1 5 -4
tions, the X rank increases while the Y rank decreases.
6 1.08 2.63 9 7 2 Therefore, this pair is discordant.
7 -0 .4 5 - 3 .4 0 4 3 1 „ _ nc - nd _ 5 -5 _
8 0.40 5.10 8 9 -1 T n(n - 1)/2 10
9 -0 .7 5 - 5 .1 4 2 2 0
10 0.24 1.14 7 6 1
Sum Squares 32
n(n 2 - 1) -
D escribe the basic steps to conduct a Monte Carlo Describe pseudo-random number generation.
simulation.
Describe situations where the bootstrapping method
D escribe ways to reduce Monte Carlo sampling error. is ineffective.
Explain the use of antithetic and control variates in reduc Describe the disadvantages of the simulation approach to
ing Monte Carlo sampling error. financial problem solving.
The fundam ental difference between simulation and bootstrap Suppose X ~ Fx, where Fx is a known distribution. In simple
ping is the source of the sim ulated data. W hen using simulation, distributions, many moments (e.g ., the mean or variance) are
the user specifies a com plete D G P that is used to produce the analytically tractable. For exam ple, if X ~ N(/x,cr2) then E[X] = /jl
sim ulated data. In an application of the bootstrap, the observed and V[X] = a 2.
data are used directly to generate the simulated data set w ith However, analytical expressions for moments are not available
out specifying a com plete DGP. for several relevant cases, (e.g ., when random variables are con
structed from com plex dynam ic models or when com puting the
expectation of a com plicated nonlinear function of a random
13.1 SIMULATING RANDOM variable). Simulation provides a sim ple method to approxim ate
VARIABLES the analytically intractable expectation.
Inverse CDF
0.3 0.9 “ 1
CDF
Fiaure 13.1 The top panel illustrates random values generated from Student's t distribution using uniform
values, marked on the x-axis. The uniform values are mapped through the inverse CDF of the Student's t to
generate the random variates marked on the y-axis. The bottom panel show how the generated values map
back to the uniform originally used to produce each value through the CDF.
This variance is estim ated using Assessing the uncertainty in an estim ated param eter aids in
understanding the im pact of alternative decision rules (e.g.,
1 b ____ _
S-l= T :2 ( s ( X > - E[g(X)])2, w hether to hedge a risk fully or partially) under realistic condi
(13.3)
D/=1
tions where the true param eter is unknown.
which is the standard variance estim ator for iid data.
Consider the finite-sam ple distribution of model param-
The standard error of the sim ulated expectation (i.e., <rg/ \ / b ) eter 0. First, a simulated sam ple of random variables
measures the accuracy of the approxim ation, which allows b to x = [x1f x 2, . . . , x n] is generated from the assumed D G R These
be chosen to achieve any desired level of accuracy. sim ulated values are then used to estim ate 0 .
Simulations are applicable to a wide range of problem s, and the This process— simulating a new data set and then estim ating the
set of simulated draws {g 1# g2, . . . , g j can be used to com pute model param eters— is repeated b tim es. This produces a set of
A A /V
other quantities of interest. For exam ple, the a-quantile of g(X) b sam ples {01f 0 2, . . . , 0 b} from the finite-sam ple distribution of
can be estim ated using the em pirical quantile of the simulated the estim ator of 0. These values can be used to assess any prop
values. This is particularly im portant in risk m anagem ent when erty (e.g ., the bias, variance, or quantiles) of the finite-sam ple
A
estim ating the lower quantiles— a G {0.1% , 1%, 5%, 10%}— of distribution of 0 .
a portfolio's return distribution. The em pirical a-quantile is esti
For exam ple, the bias, defined as:
mated by sorting the b draws from sm allest to largest and then
selecting the value in position ba of the sorted set. Bias(0) = E[0] - 0
Simulation studies are also used to assess the finite sample is estim ated in a finite sam ple of size b using
in this book rely on asym ptotic results based on the LLN and
BiiS( ) =
0 ( 1 3 . )
4
0 1 2
Fiaure 13.2 Distribution of the estimated expected value of a normal with /jl = 1 and a 2 = 9 for simulated
sample sizes of 50 and 1,250.
in a Monte Carlo simulation requires generating b (where b = n) sampling error of a sim ulated expectation is proportional to
draws from a normal with mean ju and variance a 2. 1 / V b . This rate of convergence holds for any moment approxi
V -------
mated using simulation and is a consequence of the CLT for
Figure 13.2 contains a density plot of E[X] = X, where /x = 1
iid data. Furtherm ore, this convergence rate applies even to
and a 2 = 3, for two replications sizes b = 50 and b = 1,250.
approxim ations of the expected value of com plex functions
These densities were constructed by generating 1,000 indepen
because the b replications are iid.
dent approxim ations for each replication size.
50 1.017 0.428 where x, is a sim ulated value from a N(0, a 2), rf is the risk-free
rate, and a 2 is the variance of the stock return. Both the risk-free
250 1.001 0.185
rate and the variance are measured using annualized values, and
1250 1.005 0.087
T is defined so that one year corresponds to T = 1.
The final stock price is S T = exp (sT), and the present value of the 1 _
where a i = —^ (Q — C )2 is the variance estim ator applied to
option's payoff is y b '
the sim ulated call prices.
C = e xp (—r/T) m ax(ST — K, 0) (13.6)
A two-sided 95% confidence interval for the price can be con
Simulating the price of an option requires an estim ate of the
structed using the average simulated price and its standard error:
stock price volatility. The estim ated volatility from the past 50
years of the S&P 500 returns is a = 16.4% . The risk-free rate is [ E [ Q - 1.96 X s.e.(E[C l), E[C] + 1.96 X s.e.(E[C])]
assumed to be 2%, the option expires in two years, the initial
price is S q = 2,500, and the strike price is K = 2,500. Table 13.2 Estimated Present Value of the Payoff of
a European Call Option with Two Years to Maturity. The
The expected price of a call option is estim ated by the average
Left Column Contains the Approximation of the Price
of the simulated values:
Using b Replications. The Center Column Reports the
Estimated Standard Error of the Approximation. The
gel = c = H e ,,
bi= i Right Column Reports the Ratio of the Standard Error
where C,- are simulated values of the payoff of the call option.
in the Row to the Standard Error in the Row Above
Using Equation 13.5, the sim ulated stock prices are b Price Std. Err. Std. Err. Ratio
Using the notation of the previous section, the function g(x,) is 200 283.63 30.19 2.36
The number of replications varies from b = 50 to 3,276,800 in 12,800 275.62 3.64 2.06
constant multiplies of 4. The left column of Table 13.2 shows the 51,200 278.49 1.85 1.97
expected price of a call option from a single approxim ation, and
204,800 278.64 0.93 1.99
the center column reports the estim ated standard error of the
approxim ation. The standard error is com puted as: 819,200 278.85 0.46 2.01
Antithetic variables and control variates are com plem entary, and
they can be used sim ultaneously to reduce the approxim ation Control Variates
error in a Monte Carlo simulation.
Control variates are an alternative method to reduce simulation
error. The standard simulation method uses the approxim ation:
Antithetic Variables
1 k
E[g(X)] = - 2 g W
Antithetic variates add a second set of random variables that D/=1
are constructed to have a negative correlation with the iid vari
This approxim ation is consistent because each random variable
ables used in the sim ulation. They are generated in pairs using a
g(X() can be decom posed as
single uniform value.
E[g(X;)] + 77 ,,
Recall that if U -1 is a uniform random variable, then:
where 17 , is a mean zero error.
F x ( U i) ~ Fx
A control variate is another random variable /i(X() that has
Antithetic variables can be generated using
mean 0 (i.e., E[h(X,)] = 0) but is correlated with the error 77 ;
U2 = 1 - U „ (i.e., Corr[h(X,), 77 ,] 7 ^ 0).
minimizes the approxim ation error is estim ated using the A put option has the payoff function m ax(0, K - S T), so it is in the
regression: money when the final stock price is below the strike price K.
Here, the control variate is constructed by subtracting the
g(x,) = a + /3h(x,) + v, (13.9)
closed-form Black-Scholes-M erton put price from the sim ulated
put price:
Application: Valuing an Option
P. = e x p (- rT ) max(0, K - S Ti) - PBS(r, S 0, K, a , T)
Reconsider the exam ple of simulating the price of a European
The simulated price of the put estim ates the analytical price,
call option.
and so the difference is mean zero (i.e., E[P] = 0). The call
The left-center columns of Table 13.3 show the prices simulated option price is then approxim ated by:
using antithetic random variables, along with their standard
errors. Because the random values used in the simulation are e [c ] 4 | q ^ ( 1 | p,)
Table 13.3 Present Value of the Payoff of a European Call Option with Two Years to Maturity. The Left
Columns Use b iid Simulations. The Results in the Left-Center Columns Use Antithetic Variables. The Results in the
Right-Center Columns Add a Control Variate to the Simulation. The Results in the Right-Most Columns Use Both
Antithetic Variables and a Control Variate
n Price Std. Err. Price Std. Err. Price Std. Err. Price Std. Err.
model param eters) can be very tim e-consum ing. This cost limits The bootstrap is otherwise identical to Monte Carlo simulation,
the ability of an analyst to explore the specification used for the where the bootstrap sam ples are used in place of the simulated
DGP. Therefore analytical expressions should be preferred when sam ples. The expected value of a function is approxim ated by:
available, and simulation should be used when they are not.
1
E[g(X)] = - 2 g(x?j< *!,?/ ••• / *mSj), (13.10)
0/=i
13.3 BOOTSTRAPPING where x f f is observation / from bootstrap sam ple j and a total of
b bootstrap replications are used.
Bootstrapping is an alternative method to generate simulated The iid bootstrap is generally applicable when observations are
data. However, while both simulation and bootstrapping make independent across tim e. In some situations, this is a reasonable
use of observed data, there are some im portant differences. assum ption. For exam ple, if sampling from the S&P 500 in 2005
or 2017, both calm periods with low volatility, the distribution of
Sim ulation uses observed data to estim ate key model param
daily returns is similar throughout the entire year and so the iid
eters (e .g ., the mean and standard deviation of the S&P 500).
bootstrap is a reasonable choice.
These param eters are then com bined with an assum ption
about the distribution of the standardized returns to com plete However, it is often the case that financial data are dependent
the DGP. across time and thus a more sophisticated bootstrapping method
is required. The sim plest of these methods is known as the
In contrast, bootstrapping directly uses the observed data to
circular block bootstrap (CBB). It only differs in one important
simulate a sam ple that has similar characteristics. Specifically, it
aspect— instead of sampling a single observation with replace
does this without directly modeling the observed data or mak
ment, the C B B samples blocks of size q with replacem ent.
ing specific assumptions about their distribution.
where ! is the factorial operator. In the example in the text with five
There are two classes of bootstraps used in risk m anagem ent. bootstrap samples, each with a size of five drawn from a sample size of
The first is known as the iid bootstrap. It is extrem ely simple to 20, this value is 93.4%.
1 3 6 10 14 8
2 6 3 17 11 12
3 2 2 19 18 8
4 5 18 10 19 2
5 10 15 12 7 5
GEN ERATIN G A SAM PLE USING THE GEN ERATIN G A SAM PLE USING
iid BOOTSTRAP THE CBB
1. G enerate a random set of m integers, {/1f i2, . . . , /m}, 1. Select the block size q.
from { 1 , 2 , . . . , n} with replacem ent. 2. Select the first block index / from { 1 , 2 , . . . , n} and
2. Construct the bootstrap sam ple as {x(-, x,2, . .. , x,m}. append {x„ x ,+1, . . . x /+q} to the bootstrap sample
where indices larger than n wrap around.
3 . W hile the bootstrap sam ple has few er than m ele
ments, repeat step 2.
For exam ple, suppose that 100 observations are available and
they are sam pled blocks of size q = 10. 100 blocks are con 4. If the bootstrap sam ple has more than m elem ents,
drop values from the end of the bootstrap sample
structed, starting with {x 1f x 2, . . . , x 10}, {x 2, x 3, . . . , x-|i}, . . . ,
until the sam ple size is m.
{*91, * 9 2, • • • , x-ioo)/ {x<?2, x<?3' • • • ' X l00' X l^' * • • ' {X100/ x 1/ • • • X9}.
Note that while the first 91 blocks use ten consecutive observa
tions, the final nine blocks all wrap around (as if observation one
occurs im m ediately after observation 100). W hen the loss is measured as the percentage of the portfolio
value, then the p-VaR is equivalently defined as a quantile of the
The C B B is used to produce bootstrap sam ples by sampling
return distribution.
blocks, with replacem ent, until the required bootstrap sample
size is produced. If the number of observations sam pled is larger A s an exam ple, consider the a one-year VaR for the S&P 500.
than the required sam ple size, the final observations in the final Com puting the VaR requires simulating many copies of one year
block are dropped. For exam ple, if the block size is ten and the of data (i.e., 252 days) and then finding the em pirical quantile of
desired sam ple size if 75, then eight blocks of ten are selected the sim ulated annual returns.
and the final five observation from the final block are dropped. The data set used spans from 1969 until 2018. Both the iid
The final detail required to use the C B B is the choice of the and the C B B s are used to g enerate bootstrap sam ples. The
block size q. This value should be large enough to capture the data set is large (i.e ., 13,000 observations) and so the block
dependence in the data, although not so large as to leave too size for C B B is set to 126 observations (i.e ., six months of
few blocks. The rule-of-thumb is to use a block size equal to the daily data).
square root of the sam ple size (i.e., V n ) . Bootstrapped annual log returns are simulated using 252 daily
log returns:
Bootstrapping Stock Market Returns 252
• m■
iid Bootstrap
Circular Block Bootstrap
V 'V 'f
>
;\ y $ i
-80 -70 -6 0 -5 0 -4 0 -30 -2 0
Figure 13.4 Left tail of the simulated annual return distribution using the iid and circular block bootstraps. The
tail shows 10% of the return density. The points below the density show the lowest 1% of the simulated annual
returns produced by the two bootstrap methods.
QUESTIONS
Practice Questions
13.9 Suppose you are interested in approximating the a. Calculate the option payoffs using the equation
expected value of an option. Based on an initial sample M ax(x — 0.5,0).
of 100 replications, you estimate that the fair value of the b. W hat are the antithetic counterparts for each sce
option is USD 47 using the mean of these 100 replications. nario and their associated values?
You also note that the standard deviation of these 100 c. Assum ing that the sam ple correlation holds for the
replications is USD 12.30. How many simulations would entire population, what is the change in expected
you need to run in order to obtain a 95% confidence inter sam ple standard error attainable by using antithetic
val that is less than 1% of the fair value of the option? How sam pling?
many would you need to run to get within 0.1% ? d. How does the answer to part c change if the payoff
13.10 Plot the variance reduction when using antithetic random function is Max(| x — 0.5 |, 0)?
N(0,1)
1 - 0 .6 5
2 - 0 .1 2
3 0.92
4 1.28
5 0.63
6 - 1 .9 8
7 0.22
8 0.4
9 0.86
10 1.74
ANSW ERS
Solved Problems
1 3 .9 The standard deviation is USD 12.30, and a 95% confi = 0.47 => V n = 2 x 1•0 4 7x 1230 = 102.5 (so 103). Using
dence interval is [/x — 1.96 X A + 1.96 X and .1% , we would need 1,025.8 (replace 0.47 with 0.047)
13.10
13.
13.12 Essentially, the question is to determ ine the cumulative b. Recall that for a N(0,1) distribution:
distribution up to 1.2 on an N(1,5) distribution.
Fxiu rf = —Fx( 1 - Ut )
This can be done with the Excel function
Accordingly, the table is populated as follows:
N O RM D IST(1.2,1 ,sqrt(5),True) and returns the value 0.535.
13.13 a. The formula is simply im plem ented in Excel. For the Option Antithetic Option
first entry: N(0,1) Payoff Sample Payoff
M a x (-0 .6 5 - 1,0) = 0
1 - 0 .6 5 0 0.65 0.15
1 - 0 .6 5 0 3 0.92 0.42 - 0 .9 2 0
2 - 0 .1 2 0 4 1.28 0.78 - 1 .2 8 0
6 - 1 .9 8 0 8 0.4 0 - 0 .4 0
9 0.86 0.36
10 1.74 1.24
c. Using Excel, the correlation is —27% . Therefore, the The correlation between the two option value vec
reduction in error is tors is + 25% , so the IN C R EA S E in standard error is
1 - V i + p = 15% V i T p - 1 = 12%.
240 ■ Index
independence, 5-7 Markov, Andrey, 147
conditional, 6-7 maximum likelihood estimators (MLEs), 148
independent, identically distributed random variables, 56-57 mean, 16-17
independent variables, 54 estimation of, 64-65
inference testing, 110-111 of random variable, 226-227
information criterion (IC) sample behavior of, 71-73
Akaike, 177, 178 scaling of, 66
Bayesian, 177-178 and standard deviation, 66-67
interquartile range (IQR), 20 of two variables, 75-76
invertibility, 178 and variance using data, 67
median, 73-74
J mixture distributions, 40-42
MLEs. See maximum likelihood estimators (MLEs)
Japanese yen/US dollar rate (JPY/USD), 199, 200, 214, 215
model selection, 174-177
Jarque-Bera test statistic, 215
moments, 51-54
Jensen's inequality, 15, 202
multivariate. See multivariate moments
random variable
K central moments, 16
Kendall's t , 217-219 definition, 14
Kernel density plots, 69 kurtosis, 16-17
kurtosis, 16-17, 67-69 linear transformations, 17-18
k-variate linear regression model, 120. See also multiple linear mean, 16-17
regression non-central moments, 15, 16
skewness, 16-17
L standard deviation, 16
variance, 16-17
law of large numbers (LLN), 71,73
Monte Carlo simulation
linear correlation, 216
antithetic variables, 229
linearity, 103
approximating moments, 224, 225
linear process, 160-161
vs. bootstrapping, 234
linear regression, 102
control variates, 229-230
CAPM, 111-113
generating random values/variables, 224, 225
dummy variables, 104
improving accuracy of, 229
hedging, 113-114
limitation, 231
inference and hypothesis testing, 110-111
mean of random variable, 226-227
linearity, 103
price of European call option, 227-231
OLS, 104-110
moving average (MA) models, 169-171
ordinary least squares, 104-110
multicollinearity
parameters, 102-103
outliers, 146-147
performance evaluation, 114-115
residual plots, 146
transformations, 103-104
multi-factor risk models, 122-125
linear transformations, 17-18
multiple explanatory variables, 120-122
linear unbiased estimator (LUE), 70
multiple linear regression, 103
Ljung-Box Q-statistic, 193, 194
F-test statistics, 127-129
log-normal distribution, 34-36
with indistinct variables, 120
log-quadratic time trend model, 189
model fit measurement, 125-127
log returns, 212, 213
multi-factor risk models, 122-125
LUE estimator, 147
with multiple explanatory variables, 120-122
multivariate confidence intervals, 128-129
M R2 method, 126-127
marginal and conditional distributions, 56 stepwise procedure, 121
marginal distributions, 49-50 testing parameters, 127-131
market variation, 122 t-test statistics, 127
Index ■ 241
multivariate confidence intervals, 128-129 Poisson random variables, 30-31
multivariate moments, 74 polynomial trends, 188-190
coskewness and cokurtosis, 76-78 portfolio diversification, 53
covariance and correlation, 74-75 power laws, 215-216
sample mean of two variables, 75-76 probability
multivariate random variables Bayes' rule, 7
conditional expectation, 54-55 events, 2-5
continuous random variables, 55-56 event space, 2-5
discrete random variables, 48-51 fundamental principles, 2, 4
expectations and moments, 51-54 independence, 5-7
independent, identically distributed random variables, 56-57 matrices, 48-49
in rectangular region, 55
N sample space, 2-5
probability mass function (PMF), 12, 13
non-central moments, 15, 16
pseudo-random number generators (PRNGs), 224
non-stationary time series
pseudo-random numbers, 224
cyclical component, 192
p-Value-at-Risk (p-VaR), 232
default premium, 200, 201
p-value of test statistic, 91-92
forecasting, 201-202
log of gasoline consumption, in United States, 192-193
random walks and unit roots, 194-199 R
seasonality, 190-192 random variables
spurious regression, 199, 200 continuous, 18-20
time trends, 188-190 cumulative distribution function, 12
normal distribution, 32-34 definition of, 12-13
null and alternative hypotheses, 84-85 discrete, 13-14
expectations, 14
O modes, 20-21
OLS estimators. See ordinary least squares (OLS) estimators moments, 14-18
omitted variables, 106, 140-141 probability mass function, 12
one-sided alternative hypotheses, 85 quantiles, 20-21
ordinary least squares (OLS), 104-106 random walks, 194-199
constant variance of shocks, 107 rank correlation, 216-219
iid random variables, 107 regression
implications of assumptions, 108-110 analysis, 102
no outliers, 107-108 with multiple explanatory variables. See multiple linear regression
parameter estimators properties, 106-110 regression diagnostics
shocks are mean zero conditional, 106-107 bias-variance tradeoff, 141-142
standard error estimation, 110 extraneous variable, 141
variance of X, 107 heteroskedasticity, 142-145
ordinary least squares (OLS) estimators, 120, 121, 126 multicollinearity, 145-147
homoscedasticity, 142, 145 omitted variable, 140-141
strengths of, 147-149 returns, measuring, 212
weighted least squares, 145 R2 method, 126-127
P S
PACF. See partial autocorrelation function (PACF) sample autocorrelation, 174-177
parsimony, 178 sample moments, 64
partial autocorrelation function (PACF), 162, 164-166, 169, 170, 174-178 best linear unbiased estimator, 70-71
Pearson's correlation, 216 higher order moments, 67-70
performance evaluation, linear regression, 114-115 mean and standard deviation, 66-67
PIMCO High Yield Fund, 114 mean and variance using data, 67
PIMCO Total Return Fund, 114, 115 mean estimation, 64-65
242 ■ Index
median, 73-74 survivorship bias, 106
multivariate moments, 74-78 systemically important financial institutions (SIFIs), 5
sample behavior of mean, 71-73
variance and standard deviation estimation, 65-66
T
sample selection bias, 106
test statistics, 84-87, 91-93, 110, 111, 128, 129, 131, 143-145, 174, 176,
sample space, 2-5
191, 196, 197, 202, 215
seasonal differencing, 198, 199
time-series analysis, 160
seasonality
time trends, 188-190
non-stationary time series, 190-192
total sum of squares (TSS), 125, 126
stationary time series, 181-182
transformations, linear regression, 103-104
simple returns, 212, 213
t-statistics, 122, 123, 131
simulation, 224
t-test statistics, 127
conducting experiments, 224, 226
type I error and test size, 85-86
Monte Carlo. See Monte Carlo simulation
type II error and test power, 88
simultaneity bias, 106
skewness, 16-17, 67-70
Spearman correlation, 217-219 U
spurious regression, 199, 200 uniform distribution, 41-42
standard deviation, 16 unit roots
mean and, 66-67 problems with, 195
scaling of, 66 testing for, 195-198
vs. standard errors, 66 univariate random variables, 12
and variance estimation, 65-66
standard errors
estimation, OLS, 110
V
vs. standard deviations, 66 validation set, 142
standardized Student's t distribution, 37 variance, 16-17
stationary time series and standard deviation estimation, 65-66
ARMA models, 171-174 of X, 107
autoregressive models, 163-169 VIX Index, 214
covariance stationarity, 161-162 volatility and risk measures, 213-214
forecasting, 178-181
model selection, 174-177 W
moving average models, 169-171
weighted least squares (WLS), 145
sample autocorrelation, 174-177
White's estimator, 142-144
seasonality, 181-182
Wold's theorem, 163
stochastic processes, 160-161
white noise, 162-163
stochastic processes, 160-161 Y
Student's t distribution, 36-38 Yule-Walker (YW) equations, 167
Index 243