Empirical Finance

.
.
. . .
.
.
Empirical Finance
Executive MSc in Investment and Risk Management Programme
Prof. Robert L Kimmel
robert.kimmel@edhec.edu
+65 6631 8579
EDHEC Business School
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Kimmel (EDHEC Business School) Empirical Finance SingaporeMar/Aug 2011 1 / 563
Introduction
.
.
. . .
.
.
Empirical Finance
Introduction
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Introduction Introduction
This course is about Empirical Finance.
What do the available data tell us about nancial markets, and do they
support or contradict the various theories we have developed to explain the
behaviour of nancial markets?
We will focus mainly on pricing, that is, how prices of nancial assets are
determined. It is possible to focus on other aspects of nancial markets,
e.g., trading volume.
The course will discuss both econometric techniques, and the actual
empirical ndings.
Basic Principles
.
.
. . .
.
.
Empirical Finance
Basic Principles
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Basic Principles Probability and Distributions
Why is there even a subject matter called Empirical Finance?
.
.
.1
Astronomers can predict the positions of the planets, and phenomena
such as eclipses, with extreme accuracy, centuries in advance.
.
.
.2
Meteorologists can predict the weather a few days in advance.
.
.
.3
Can stock market analysts predict stock prices ten minutes in
advance?
Humans have essentially no eect on the motion of the planets, and only
(possibly) very long-term eect on the weather. Prices of nancial assets
are set on a minute-to-minute basis by people.
How do they decide what the prices of nancial assets should be?
The extent to which nancial markets incorporate available information
into asset prices (the degree of market eciency) is very hotly debated, in
both academic and industry circles.
There is no question, though, that events nobody knows about yet cant
be incorporated into asset prices.
The evolution of the macroeconomy, technological progress, societal
evolution, are all very hard to predict, even by people who spend their
whole lives studying such things. They are best modelled as random
processes.
If the fundamental economic processes that aect asset prices are random,
then the asset prices themselves are also random.
The fact that security prices are random has profound implications for
investorsmuch of nancial theory involves the investors problem of
trading o risk and average return.
However, it also has profound implications for those who study nancial
markets. Financial theories are generally about relations between average
returns and various measures of risk. If we observe that the average
returns of securities dier from what is predicted by a theory, what
conclusion do we draw?
.
.
.1
The theory is wrong.
.
.
.2
The theory is right, but its predictions are not met exactly because of
the random variation in asset prices.
Which is it?
Probability and statistics are absolutely fundamental to the study of
nancial markets.
Examplesuppose there are three assets, X, Y, and Z. We have
developed an economic theory that tells us what (on average) the returns
of the assets ought to be. We then get a sample of monthly returns
(annualised) of the three assets, over the last 20 year period. The results
are as follows.
Asset
X Y Z
Average return (predicted) 8% 10% 12%
Average return (observed) 6% 16% 14%
Standard deviation of return (observed) 25% 40% 60%
How do the predictions of the theory hold up? Do you have enough
information to tell?
A probability distribution species the likelihood of each possible outcome
of a random process. They can be discrete or continuous.
When a random variable has a discrete probability distribution, there are
either nitely many outcomes, or countably many.
Consider a six-sided die, each side labelled with a number from one to six.
If each side is equally likely to come up when the die is rolled, then the
probabilities p
1
, . . . , p
6
are all equal to 1/6.
Probabilities (in a discrete probability distribution) must satisfy two
properties:
.
.
.1
The probabilities must be zero or positive.
.
.
.2
The probabilities must add up to one.
Do the probabilities specied above satisfy both of these constraints?
Probability Distribution of Six-sided Die Throw
A discrete probability distribution can have innitely many outcomes, each
with positive probability.
Suppose we throw a coin with a heads and a tails side. The coin is
fair, meaning each side has a probability of 1/2. Suppose we throw this
coin repeatedly, and call X the number of throws until the rst head.
What is the probability distribution of X?
There is a 1/2 probability that the rst throw will be heads, so
p
1
= 1/2. The probability that the second throw will be the rst head is
1/4, so p
2
= 1/4. More generally, p
i
= (1/2)
i
. There is no limit to the
value of i ; it is possible (although not likely) that it will take a million, a
billion, a trillion trillion trillion, etc. throws.
Do these probabilities satisfy the two rules?
Each of the probabilities is clearly greater than zero, so we have no
problem with negative probabilities. Do they add up to one?
i =1
p
i
=
i =1
_
1
2
_
i
= 1
(For justication of the last step, see any reference on geometric innite
series.)
The probabilities are non-negative, and up to onethey are valid
probabilities. More generally, any distribution with
p
i
= (1 p)
i 1
p
for some p [0, 1] is called a geometric distribution.
Probability Distribution of First Head in Coin Throw Example
Continuous probability distributions have uncountably innitely many
possible outcomes.
Examplewhat is the amount of rainfall in the centre of Singapore on 22
June 2011, measured in millimetres?
This quantity could take any non-negative valueit could be zero (no
rainfall at all), or any positive number. (Since water consists of molecules,
the amount of rainfall is actually a discrete quantityhowever, it is very
well approximated by a continuous distribution.)
Continuous probability distributions are specied by a probability density
function.
Examplethe random variable X has a uniform probability distribution on
the interval [0, 1]. Then X has the probability density function f
X
(x) = 1.
The density function does not specify the probability of each outcome;
each particular outcome is innitely improbable (i.e., has probability of 0).
But ranges of outcomes have positive probability; what is the probability
that X falls in the interval [0.2, 0.3]?
P (0.2 X 0.3) =
0.3
0.2
f
X
(x) dx =
0.3
0.2
(1) dx = x|
0.3
0.2
= 0.1
Probability density functions must satisfy two rules:
.
.
.1
They must be non-negative.
.
.
.2
They must integrate to one.
Does this uniform probability distribution satisfy these constraints?
The uniform probability density on [0, 1] is obviously positive on this
range. It also integrates to one:
1
0
f
X
(x) dx =
1
0
(1) dx = 1
Note that this integral is only taken over the range of possible values
[0, 1]. We can instead take the probability density to be dened as 0
outside this range:
f
X
(x) =
_
1 0 x 1
0 x < 0 or x > 1
We can then just integrate over the entire real line (, +), and the
value of the integral is still one.
More generally, a uniform distribution can be dened on any range [a, b],
with b > a:
f
X
(x) =
_
1
(ba)
a x b
0 x < a or x > b
Note that the probability density satises the two requirements; it is
non-negative, and it integrates to one.
Uniform Distribution on [0, 1]
Another examplethe exponential distribution, with probability density
function dened on the interval [0, +):
f
X
(x) = e
x
, > 0
Note that this is not a single distribution, but a family of many
distributions, indexed by the parameter .
The exponential distribution has many applications; for example, it is used
to model the time until a radioactive particle decays. It is sometimes used
to model time to default in credit risk applications.
Does the exponential distribution satisfy the two requirements for a valid
probability distribution?
Exponential Distribution with = 0.5
Another examplethe normal, or Gaussian distribution. This distribution
is dened for all real numbers (positive, zero, and negative), and has the
density function:
f
X
(x) =
1
2
2
e
(x)
2
2
2
, > 0
Despite its somewhat odd appearance, the normal distribution arises in a
very natural way in many, many applications, and is one of the most
fundamental continuous distributions there is. It is often used to model
returns of nancial assets.
Note that the Gaussian distribution is actually a family of distributions,
indexed by and . More on these parameters later.
Does the Gaussian distribution satisfy the two requirements for a valid
Gaussian Distribution with = 0.1 and = 0.25
We will often use summary statistics, which capture some (but not all) of
the information in the probability distribution of a random variable.
One of the most important is the mean, or expected value. This is just the
average outcome, weighted by probabilities.
E[X] =
N
i =1
x
i
p
i
where x
i
is the value of a particular outcome, and p
i
is its probability. The
sum must be taken across all possible outcomes (the number of outcomes
being denoted by N here).
For a random variable with a continuous distribution, the mean is an
integral over all possible outcomes (weighted by probability).
E[X] =
xf
X
(x) dx
The expected values of the die and coin throw examples are 3.5 and 2,
respectively. The uniform distribution on [a, b] has an expected value of
(a + b) /2. The exponential distribution has a mean of 1/. The normal
(Gaussian) distribution has a mean of .
When there are innitely many possible outcomes, the expected value may
not even existwhat is the expected value of a random variable that has
value 2 with probability 1/2, 4 with probability 1/4, etc.? The expected
value also does not even have to be one of the possible outcomesin the
die throw example, the mean is 3.5, but no throw ever has this value.
For a random variable X, any function g (X) of X is also a random
variable, and we can contemplate its expected value. For example, if X is
the value of a die throw (1 through 6, with equal probability), what is the
expected value of the squared outcome?
From the denition of an expected value:
E
_
X
2
=
6
i =1
x
2
i
p
i
=
1
6
(1)
2
+ . . . +
1
6
(6)
2
=
91
6
Similarly, E
_
X
3
= 441/6 and E
_
X
4
= 2275/6. (Try it.)

When there are innitely many possible outcomes, the expected value of X
or a particular function of X may not exist. However, for the coin throwing
example, E[X
n
] is well-dened for any integer n 0. Can you nd E[X]
and E
_
X
2
?
We care not just about the expected value (or average outcome), but also
how large deviations from the average tend to be. The variance of a
random variable is one such measure. For discrete and continuous random
variables, respectively, the variance is:
Var [X] =
N
i =1
p
i
(x
i
E[X])
2
Var [X] =
f
X
(x) (x E[X])
2
dx
In both cases, we can express the variance as an expected value:
Var [X] = E
_
(X E[X])
2
_
= E
_
X
2
(E[X])
2
The last step follows from the denitions of expected value and variance,
although the algebra is tedious.
What is the variance of X in the die throw example? One methodgo
straight to the denition of variance:
Var [X] =
N
i =1
p
i
(x
i
E[X])
2
=
1
6
(1 3.5)
2
+ . . . +
1
6
(6 3.5)
2
=
35
12
Another methodnd the variance in terms of quantities we have already
calculated:
Var [X] = E
_
X
2
(E[X])
2
=
91
6

_
7
2
_
2
=
35
12
Both methods give the same answer, which is not a coincidence.
What is the variance in the coin throwing example?
When there are innitely many outcomes, variance (like expected value)
may not exist. For example, a Students T distribution with 2 degrees of
freedom has an expected value of 0, but its variance does not exist.
For most distributions we deal with, both mean and variance are
well-dened. For the exponential distribution, the variance is:
Var [X] =
1
2
(Can you prove it?)
For the normal (Gaussian) distribution, the variance is:
Var [X] =
2
(Proof of this result is more dicult.)
Variance is, by construction, zero or positive. (It is only zero if the random
variable is always equal to its mean.) It is never negative.
The mean, or expected value of a random variable can be expressed in the
same units as the random variable itself; however, variance is not so
convenient. For example, suppose the annual return of a security has a
normal distribution, with = 0.1 and = 0.4. Then the mean (or
average) return is 0.1, or 10%, but its variance is 0.16; the units are
percent squared per year squared. We therefore will often use standard
deviation instead of variance:
SD[X]
Var [X]
Standard deviation, like variance, is always zero or positive, but is in the
same units as the original random variable. In the example above, the
standard deviation of the securitys return is 40% per year.
In nancial and economic applications, mean and variance are used all the
time. Less often, so-called higher order moments are used, e.g., the third
and fourth (centred) moments:
E
_
(X E[X])
3
_
=E
_
X
3
3 E
_
X
2
E[X] + 2 (E[X])
3
E
_
(X E[X])
4
_
=E
_
X
4
4 E
_
X
3
E[X]
+ 6 E
_
X
2
(E[X])
2
3 (E[X])
4
Like variance, these quantities are not in the most convenient units, so they
are often converted to dimensionless quantities, skewness and kurtosis.
Skewness and kurtosis are dened as:
Skew
E
_
(X E[X])
3
_
(Var [X])
3
2
Kurt
E
_
(X E[X])
4
_
(Var [X])
2
3
The kurtosis (sometimes called excess kurtosis) has 3 subtracted out to
make a normal distribution have a kurtosis of 0; any distribution with
positive kurtosis is therefore more kurtotic than a normal distribution.
Skewness is related to the symmetry of a distribution, and kurtosis is
related to the probability of extreme values.
Skewness can take any value, positive or negative. Any symmetric
distribution (e.g., the normal distribution, the uniform distribution, or the
die throwing example) has skewness of zero.
A distribution that has most of the probability near the mean, but also has
a small amount of probability of extremely high values, then the
distribution will have positive skewness. If the extreme values are low
instead of high, then the skewness will be negative.
Income distributions in most countries have positive skewnessmost
people earn an amount around the median, but a very small number of
people typically earn very high incomes.
The skewness of the exponential distribution is 2; the skewness of the
distribution in the coin throwing example is 3/
2. (Can you derive these

results?)
Kurtosis has to do with the probability of extreme observations. If a
random variable is almost always close to the mean, but with some small
probability, it can take on a very large value (above or below the mean),
then the distribution has high kurtosis.
The lowest possible value of kurtosis is 2; there is no maximum value of
kurtosis. It is possible for the skewness and the kurtosis of a distribution
not to exist.
The exponential distribution has a kurtosis of 6; the uniform distribution
has a kurtosis of 1.2. The Gaussian distribution has a kurtosis of zero.
The coin throwing example has a kurtosis of 6.5, and the die throwing
example has a kurtosis of 222/175. (Can you derive these results?)
Exponential vs. Gaussian Distribution
Exponential vs. Gaussian DistributionRight Tail
Gaussian vs. Students T Distribution
Basic Principles Estimation and Inference
Problemwe do not know the distribution of random events.
.
.
.1
For the coin throwing example, it seems like the probability of
heads is 0.5. Are you sure? Maybe it is a trick coin.
.
.
.2
For a security return, we know the future return is random (i.e., we
cannot predict it in advance with perfect accuracy). But what is its
If we have historical data (e.g., we have observed the coin being thrown
repeatedly, or we have historical returns for a security), we can use this
data to learn something about the probabilities of dierent outcomes. (Is
there an implicit assumption here?)
Estimation of the entire probability distribution of a random variable is a
very dicult problem. (Easy for some special cases, like the coin throwing
example.) We will focus on estimating quantities such as the mean and
variance of a random variable.
How do we estimate the mean (expected value) of a random variable, such
as the outcome of a coin throw, or the future return of a security?
An extremely general methodtake the sample average of the available
observations. Suppose we have observed N realisations of the random
variable X, denoted by X
1
, . . . , X
N
. Then we can estimate the average
with:
X =
1
N
N
i =1
X
i
Is this a good way to estimate the expected value of a random variable?
Exampleprobability of heads with a coin throw.
Call the value of a coin throw X = 1 if it comes up heads, and X = 0
otherwise. Call p the probability of heads. Then:
E[X] =
2
i =1
x
i
p
i
= p 1 + (1 p) 0 = p
So estimating the expected value of X is the same thing as estimating the
probability of heads. Estimate the sample mean by throwing the coin N
times, counting each heads as 1, and each tails as 0. Count up the
number of heads, and divide by N. This is

X, the sample mean.
Will the sample average be equal to the true average (i.e., the expected
value)?
Exampleexpected return of a security.
Collect historical returns for the last N months. Add them all up, and
divide by N:
R =
1
N
N
i =1
R
i
This method is very commonly used to estimate expected returns of
broadly diversied portfolios; it is used less often to try to estimate the
expected returns of individual securities. (Any idea why?)
Will the sample average return be equal to the true expected return?
What are the statistical properties of the sample mean?
First, we will need a few basic results. Let X and Y be random variables,
and let a, b, and c be constants. Then:
E[X + Y] =E[X] + E[Y]
E[aX] =a E[X]
E[a + bX + cY] =a + b E[X] + c E[Y]
These results are true for both discrete and continuous random variables,
and follow directly from the denition of expected value. (The derivation
is a little tedious though.)
The rst two results are just special cases of the third, which can be
generalized; let X
1
, . . . , X
N
be random variables, and let a
0
, . . . , a
N
be
constants. Then:
E
_
a
0
+
N
i =1
a
i
X
i
_
= a
0
+
N
i =1
a
i
E[X
i
]
This last result will be extremely useful in analysing the statistical
properties of the sample mean.
Note that the sample mean is itself a random variable; sometimes it will be
higher than the true mean, and sometimes it will be lower. We can nd its
expected value, just like we can with any other random variable:
E
_
=E
_
1
N
N
i =1
X
i
_
= E
_
N
i =1
1
N
X
i
_
=
N
i =1
1
N
E[X
i
]
=
N
i =1
1
N
E[X] = E[X]
So the expected value of the sample average is equal to the true
averageif you estimate the true mean with the sample mean, then on
average, you will get it right!
We would also like to examine how precise the estimate tends to behow
much can the sample average deviate from the true average? However, we
need some additional tools rst.
Let X and Y be random variables. The joint distribution tells us the
probabilities of dierent possible outcomes of X and of Y individually, but
it also tells us how X and Y are related. Suppose there are M possible
values of X, and N possible values of Y. Then the joint probability p
i ,j
is
the probability that X will take the value x
i
, and Y will simultaneously
take the value y
j
.
The joint probabilities of X and Y must satisfy the same two restrictions
that all probabilities must satisfythey must be non-negative, and they
must add up to one.
We can also consider the probabilities of either X or Y, considered alone.
For example, let p
(X)
1
, . . . , p
(X)
M
be the probabilities of the M possible
values of X, and let p
(Y)
1
, . . . , p
(Y)
N
be the probabilities of the N possible
values of Y. Then these two sets of probabilities are called the marginal
probabilities of X and Y.
There is a relation between the marginal probabilities and the joint
probabilities. Specically:
p
(X)
i
=
N
j =1
p
i ,j
p
(Y)
j
=
M
i =1
p
i ,j
Suppose X and Y can each take on the values 1, 0, or +1, and do so
with the following probabilities:
X
1 0 +1
1 0.20 0.10 0.00
Y 0 0.20 0.05 0.20
+1 0.10 0.00 0.15
What are the marginal probabilities of X and Y?
We can also specify the joint probability density function f
X,Y
(x, y) for
two random variables with a continuous distribution.
The probability that X [a, b] and Y [c, d] is:
P(a X b, c Y d) =
b
a
d
c
f
X,Y
(x, y) dydx
In either the discrete or the continuous case, expected values are dened
analogously to the case of a single random variable:
E[g (X, Y)] =
M
i =1
N
j =1
p
i ,j
g (x
i
, y
j
)
E[g (X, Y)] =
f
X,Y
(x, y) g (x, y) dydx
We say the discrete random variables X and Y are independent if:
p
i ,j
= p
(X)
i
p
(Y)
j
If X and Y are continuous, then they are independent if:
f
X,Y
(x, y) = f
X
(x) f
Y
(y)
Intuitively, X and Y are independent if knowledge of X tells you nothing
about the probability of dierent outcomes of Y, and vice-versa.
We dene the covariance between X and Y as:
Cov [X, Y] E[(X E[X]) (Y E[Y])] = E[XY] E[X] E[Y]
Covariance is a measure of how the two random variables are related; e.g.,
if it is positive, then when X is above its mean value, Y also tends to be
above its mean value.
If two random variables are independent, then their covariance is zero.
(Proof?) However, it is possible for random variables to have a covariance
of zero, but not be independent.
Other useful properties of covariance are:
Cov [X, Y] = Cov [Y, X] Cov [X, X] = Var [X]
These follow immediately from the denition.
The units of covariance are not particularly useful, so one may prefer
correlation:
Corr [X, Y]
Cov [X, Y]
SD[X] SD[Y]
Correlation is not well-dened if either X or Y has a standard deviation of
zero. But otherwise, correlation is dimensionless, and is bounded between
its maximum value of +1 and its minimum value of 1.
Correlation and covariance have the same signthat is, they are both
positive, both negative, or both zero.
If two random variables have a correlation of zero, we say they are
uncorrelated. This does not necessarily mean that they are independent!
ExampleX and Y have a bivariate normal distribution:
f
X,Y
(x, y) =
1
2
2
X
2
Y
(1
2
)
e
(x
X
)
2
2
Y
2 (x
X
) (y
Y
)
X
+(y
Y
)
2
2
X
2
[
2
X
2
Y
(1
2
)
]
This distribution has the following properties:
E[X] =
X
E[Y] =
Y
Var [X] =
2
X
Corr [X, Y] = Var [Y] =
2
Y
Note that, if = 0, then X and Y are independent. (Can you show it?)
For this particular distribution, X and Y are independent if and only if
they are uncorrelated.
This result does not generalise to other distributions! It is not true even
for normal distributions; X and Y can each have a marginal normal
distribution and a correlation of zero, but not be independent. (Can you
construct an example?)
Two Standard Gaussian DistributionsZero Correlation
Two Standard Gaussian DistributionsCorrelation of +0.5
Two Standard Gaussian DistributionsCorrelation of 0.5
The following properties of variance follow from the denition. (Can you
derive them?) Let X and Y be random variables, and let a, b, and c be
constants. Then:
Var [X + Y] =Var [X] + Var [Y] + 2 Cov [X, Y]
Var [aX] =a
2
Var [X]
Var [a + bX + cY] =b
2
Var [X] + c
2
Var [Y] + 2bc Cov [X, Y]
The rst two are special cases of the third.
More generally, if X
1
, . . . , X
N
are random variables and a
0
, . . . , a
N
are
constants:
Var
_
a
0
+
N
i =1
a
i
X
i
_
=
N
i =1
a
2
i
Var [X
i
] + 2
N
i =1
N
j =i +1
a
i
a
j
Cov [X
i
, X
j
]
The presence of the covariance terms has very profound implications for
portfolio choice. What is the above result if the X
1
, . . . , X
N
are all
uncorrelated with each other?
At this point, it may be useful to specify some properties of covariances.
Let X, Y, U, and V be random variables, and let a, b, c, d, f , and g be
constants. then:
Cov [a + bX + cY, d + fU + gV] =bf Cov [X, U] + bg Cov [X, V]
+cf Cov [Y, U] + cg Cov [Y, V]
For both variances and covariances, adding a constant to the arguments
has no eect.
The previous result may also provide some insight in why constants that
appear multiplicatively inside a variance must be squared when they are
taken outside:
Var [bX] = Cov [bX, bX] = b
2
Cov [X, X] = b
2
Var [X]
We will state and use a number of statistical results in this section and the
next without proof; if you want to ll in the proofs, the above property of
covariance will often be useful. This result generalizes to arbitrary linear
combinations of random variables in the obvious way.
We can now further analyse the statistical properties of the sample mean.
Specically, we would like to nd its variance. At this point, we assume
the X
1
, . . . , X
N
are independent of each other. (Is this a reasonable
assumption?)
Var
_
= Var
_
1
N
N
i =1
X
i
_
=
1
N
2
N
i =1
Var [X
i
] =
1
N
Var [X]
The standard deviation of the sample mean is:
SD
_
Var
_
=
1
N
SD[X]
From the above results, we can reach the not very surprising conclusion
that, the more observations we have, the better an estimate of the true
mean

X is. On average, it is right; furthermore, the more observations we
have, the less likely

X is to deviate widely from the true mean.
Examplecoin throwing.
Recall our method of estimating the probability a coin comes up
headsthrow the coin N times, count the number of heads, and divide
by N. The resulting number (which is the sample mean) is an estimate of
the probability of heads.
On average, the sample mean is an accurate estimate of the true mean.
But if you throw a coin 1, 000 times, will it always come up heads 500
times, even if it is a fair coin? Suppose it comes up heads 550 timesis
this evidence that it is a trick coin?
Recall that heads receives a value of 1, and tails receives a value of 0.
The average value is p, where p is the probability of heads.
What is the variance of a single coin throw?
E
_
X
2
=p (1)
2
+ (1 p) (0)
2
= p
Var [X] =E
_
X
2
(E[X])
2
= p p
2
= p (1 p)
What is the variance of the sample average?
Var
_
=
1
N
Var [X] =
p (1 p)
N
We dont know the value of p, so we dont know the variance of the
sample mean.
However, note that p (1 p) takes a maximum value of 1/4 at p = 1/2.
So we know for sure that:
Var
_
1
4N
SD
_
1
2
N
For N = 1, 000, we have E
_
= 0.5 and SD
_
0.01581
Suppose after 1, 000 throws, we observe heads 550 times. Is the coin
fair? The sample mean

X is 0.55. If the coin is fair, then p = 0.5, and
E
_
= 0.5 and SD
_
0.01581. There are two possibilities:

.
.
.1
The coin is not fair, and comes up heads more often than tails.
.
.
.2
The coin is fair, but came up heads more often than tails just
due to chance.
Which is it?
When data are generated by a random process, we can never know
anything with absolute certainty. However, we may be able to come to a
conclusion with high probability.
We now construct a test statistic, of the form:
Z =
X
0
where

X is the sample mean (i.e., the mean estimated from the data),
0
is the hypothesized mean (in this case, 0.5, since we are testing whether
the coin is fair), and is the standard deviation of the quantity being
tested. Since 550 coins out of 1, 000 came up heads,

X = 0.55, vs. the
hypothesized value of
0
= 0.5. We have calculated = 0.01581. So the
test statistic is:
Z =
X
0
=
0.55 0.50
0.01581
= 3.16
Intuitively, the observed outcome (550 heads) is 3.16 standard deviations
above the mean outcome, if the coin were fair. Could this have happened
by chance?
Certainly 550 heads could have happened by chance; 600 heads, 900
heads, or 999 heads, or even 1, 000 heads could have happened by chance.
But how likely is it? We can get some idea of how probable in outcome is,
due to chance, even if the hypothesis being tested is true, using a result
known as Chebyshevs inequality.
This result states that the a random variable takes values at least k
standard deviations away from the mean with a probability that is at 1/k
2
.
For k 1, it tells us the probability is at most 1, but we knew that
already, since nothing can happen with probability greater than one. But
for two standard deviations, Chebyshevs inequality tells us that such
outcomes can happen with probability of at most 1/4; depending on the
actual distribution, the true probability might be smaller. Outcomes three
standard deviations away from the mean happen with probability of at
most 1/9, etc.
In this case, the probability of getting a realised value of

X that is
k = 3.16 standard deviations away from the mean is at most 1/k
2
= 0.10.
So 550 heads could have occurred by chance, even if the coin is fair; but
the probability that the outcome would be 50 or more coin throws away
from the expected value of 500, is at most 0.10.
Are you willing to conclude that the coin is not fair, based on this test? If
not, how extreme would the outcome have to be in order to convince you
that the coin is not fair?
In fact, the actual probability of 550 heads, assuming the coin is fair, is
quite a bit smaller than 0.10. The exact distribution of the outcome is
known in this case; it is called the binomial distribution. However, the
binomial distribution is a bit unwieldy for large values of N, so we will
resort to an approximation.
Central Limit Theoremwhen the number of observations is large, the
distribution of the sample mean

X is approximately normal, regardless of
the distribution of X. (Requires existence of nite mean and variance.)
If a random variable has a normal distribution, then any linear function of
that random variable also has a normal distribution. (Can you prove it?)
The sample mean,

X, has a normal distribution (approximately) by the
central limit theorem. Recall the test statistic:
Z =
X
0
The test statistic Z is a linear function of

X (note the other quantities in
the expression above are not random), and therefore also has
approximately a normal distribution.
What are the mean and standard deviation of the test statistic Z?
(Assume the hypothesis, that E[X] = 0.5, is true.)
E[Z] =E
_
X
0
_
=
E
_
=

0
= 0
Var [Z] =Var
_
X
0
_
=
1
2
Var
_
X
0
=
1
2
Var
_
=
1
2
= 1
SD[Z] =
Var [Z] =
1 = 1
The test statistic t thus has approximately a normal distribution, with
mean of 0 and variance of 1. (This is not a coincidencethe test statistic
was designed to have these properties.)
We can now use the test statistic to determine how likely an outcome of
550 heads is, if the coin is fair.
Basic properties of a normal distribution:
.
.
.1
The realised value is within one standard deviation of the mean with
probability 0.682.
.
.
.2
The realised value is within two standard deviations of the mean with
probability 0.954.
.
.
.3
The realised value is within three standard deviations of the mean
with probability 0.997.
These statistics are determined by integrating over the appropriate range
of the density function for the normal distribution.
For example, to nd the second result, we can calculate:
Prob ( 2 X + 2) =
+2
2
1
2
2
e
(x)
2
2
2
dx
The integral above cannot be found in closed-form; however, it can be
evaluated numerically. (A closed-form expression that is known to be
accurate to at least 15 decimal places does exist.)
Many books have tables of the value of integrals of the normal density
function for dierent ranges, and many software packages can also
calculate it. By any of these methods, we can determine than an
observations at least 3.16 standard deviations from the mean occur with
probability of only 0.00159.
In other words, if you were to throw a fair coin 1000 times, the combined
probability that you would get either
.
.
.1
550 heads or more
.
.
.2
450 heads or fewer
is only 0.00159, and the probability that the number of heads will fall
between 450 and 550 is 0.99841. (These probabilities are based on an
approximation, that the sample mean has a normal distribution. The
approximation is fairly accurate in this case.)
Coin Throw Example1,000,000 Trials, 1,000 Throws Each Trial
Coin Throw ExampleStandardised Distribution
Since the distribution of

X is approximately normal for a large number of
coin throws, the probability that the number of heads would dier from
the mean value by at least 50 is approximately 0.00159.
The true value (based on the exact distribution of

X, which in this
example is binomial) is 0.00173; the assumption of normality leads to
some inaccuracy, but not too much.
So, if the coin were fair, the expected number of heads would be 500, and
a realised value as far away as 550 would occur with probability of less
than 0.002; the probability that the number of heads would be closer to
500 is more than 0.998.
Does 550 heads seem very likely to occur just by chance? Are you willing
to declare that the coin is not fair?
Whether we use the approximate probability of 0.00159 (based on the
normal approximation) or the exact probability of 0.00173 (based on the
binomial distribution), this number has a nameit is often called the
p-value. A p-value is simply the probability that, under the hypothesis
being tested, data as extreme as what has been observed would occur just
by chance. The p-value in this example is rather extremea result this
extreme (50 or more heads away from the expected value of 500) should
occur just by chance, if the coin were fair, fewer than two times out of a
thousand. If the coin were fair, we have just observed quite a remarkable
coincidence. It is possible the coin is fair; but it doesnt seem very likely.
We will now try to formalise this idea.
We have an hypothesisthe coin is fair, and the probability of heads is
0.5.
We also have evidence550 heads out of 1, 000 coin throws.
There are two types of errors we can make here:
.
.
.1
Type I Errorwe reject the hypothesis (that is, conclude that the
coin is not fair) when it in fact is fair.
.
.
.2
Type II Errorwe fail to reject the hypothesis (concluding the coin is
fair) when it is in fact not fair.
It is impossible to avoid both types of errors completely. All we can do is
trade the probability of one o against the other.
The nearly universal convention in nance and economics (which is
completely arbitrary) is to set the probability of a Type I Error at 0.05.
Hypothesis: the coin is fair (the probability of heads is 0.5).
Evidence: 550 heads from 1, 000 coin throws.
If the hypothesis is true, the probability of getting a deviation from the
mean this large is only 0.00159 (using the normal approximationthe
exact p-value is 0.00173).
Since this probability is less than 0.05, we reject the hypothesis, and
conclude the coin is not fair.
Could we have just made a Type I error?
Yes, we could have just made a Type I error. The only way to avoid Type I
errors (incorrect rejection of an hypothesis that is true) is never to reject
any hypothesis. If one takes that approach, one is likely to commit quite a
lot of Type II errors (failure to reject an hypothesis which is false).
When the hypothesis is true, if we use a cut-o of 0.05 (as we did in this
example), we are likely to reject the hypothesis (incorrectly) one time in
every twenty. If this risk of Type I error is unacceptably large, we can lower
our cut-o; for example, we could reject the hypothesis only if the p-value
is less than 0.02. Then we will only commit a Type I error one time in
every fty, which is an improvement. However, this comes at a pricethe
probability of a Type II error goes up. We will fail to reject an hypothesis
that is false more often, if we decrease our cut-o value. There is no way
around this trade-o.
One could take the approach of trying to assess how costly Type I and
Type II errors are, and changing the cut-o value accordingly. For
example, consider a medical test that is designed to detect the early stages
of a curable disease. If our hypothesis is the patient is healthy, then a
Type I error is a false positiveconcluding that the patient is sick, when in
fact the patient is healthy. A Type II error is a false negativefailure to
detect the disease, when the patient in fact has it.
If the test is very sensitive, there will be very few false negatives (very few
Type II errors), but there will also be a lot of false positives (lots of Type I
errors). If the test is adjusted so that it is not so sensitive, then there will
be fewer false positives, but more false negatives. So how sensitive should
we make the test?
If we conclude that the cost of a Type II error is very high (a sick patient
fails to get treatment, wrongly believing s/he is healthy), whereas the
Type I error is less costly (a healthy patient has some rather anxious
moments, and undergoes some additional testing/treatment before it is
realised that there was a false positive), then we should make the test very
sensitive. If the costs are dierent (for example, maybe the disease is not
so serious, and the treatment is expensive, painful, and largely ineective),
then we should make the test less sensitive.
This type of analysis is used frequently in some disciplines, such as
engineering. It has largely gone out of fashion in nancial analysis, where
arbitrary benchmarks (such as 0.05 probability of a Type I error) are
commonplace.
Basic Principles Testing Pricing Models
Returning to the three securities mentioned earlier:
Asset
X Y Z
Average return (predicted) 8% 10% 12%
Average return (observed) 6% 16% 14%
Standard deviation of return (observed) 25% 40% 60%
Recall that the observed quantities were estimated from 20 years of
monthly returns data. Can we safely conclude that the securities do not
conform to the predictions of the theory?
This problem is much more dicult than the coin throwing example.
Assume the predictions of the model are correctthen the deviations of
the observed average returns from the predicted average returns are just
due to the random variation of the data. We already know:
E
_
= 8% E
_
= 10% E
_
= 12%
But we need to know the standard deviations as well:
SD
_
=? SD
_
=? SD
_
=?
There were 20 years of monthly data, so N = 240, and
240 15.49.
Therefore:
SD
_
=
SD[X]
15.49
SD
_
=
SD[Y]
15.49
SD
_
=
SD[Z]
15.49
The problem is that we do not know the standard deviations of X, Y, and
Z; we can only estimate them from the data. Estimates were included in
the table, but how these were determined was not specied.
The usual way of estimating the variance of a random variable (which can
then be used to estimate the variance of the sample average) is as follows:
s
2
XX
=
1
N 1
N
i =1
_
X
i

X
_
2
Note that, in order to calculate s
2
XX
, we must rst calculate

X. The
presence of the N 1 (instead of N) in the denominator may seem
puzzling; this is a correction to account for the fact that the mean is not
known exactly, but must be estimated with

X.
The sample variance s
2
XX
is itself a random variablewhat are its
statistical properties?
We have all the tools we need to nd its mean and variance, although the
algebra can be tedious.
E
_
s
2
XX
=E
_
1
N 1
N
i =1
_
X
i

X
_
2
_
=
1
N 1
N
i =1
_
E
_
X
2
i
2 E
_
X
i

X
+ E
_
X
2
_
=
1
N 1
N
i =1
_
_
_
_
_
_
Var [X] + E[X]
2
2
N
Var [X] 2 E[X]
2
+
1
N
Var [X] + E[X]
2
_
_
_
_
_
_
=Var [X]
Can you ll in the missing steps?
The following results can also be derived, with considerable diculty:
Var
_
s
2
XX
=(SD[X])
4
_
2
N 1
+
Kurt [X]
N
_
Cov
_
X, s
2
XX
=
Skew[X] (SD[X])
3
N
If X happens to have a normal distribution, then its skewness and kurtosis
are each equal to zero, the sample mean and variance are uncorrelated
with each other, and the variance of s
2
XX
has a very simple form.
We will not prove these results, but if X has a normal distribution, then

X
also has a normal distribution, and s
2
XX
has a chi-square distribution.
Returning to the example, consider security X. We have a theory that
predicts its expected return is 8%, but when we estimate the mean with
X, it is 6%. The estimated standard deviation (we will use the notation
s
X
) is 25%.
We would like to construct a test statistic:
Z =
X
0
SD
_
N
_
X
0
SD[X]
_
If the hypothesis is correct, then the expected value of

X is 6% and its
standard deviation is SD[X] /
240 (recall that there are 240 monthly

observations). The test statistic then has a mean of zero, and a standard
deviation of one. If X has a normal distribution, then Z also has a normal
distribution; even if X isnt normal, then by the central limit theorem, Z is
approximately normal for large N.
Z-statistic for Stock Return Example1,000,000 Trials
The test statistic Z is therefore ideal, except for one little problemit is
infeasible. We dont know SD[X], and can only estimate it. Note that this
situation is dierent from the coin throwing examplethere, under the
hypothesis (that the coin is fair, and the probability of heads is 1/2), we
knew the standard deviation of a coin throw. Here, we dontthe
hypothesis tells us what the value of the mean ought to be, but is silent
with respect to the variance and standard deviation.
Instead, we must use the estimated standard deviation, rather than the
actual, to form our test statistic:
t =
N
_
X
0
s
X
_
Because the standard deviation used in our test statistic is estimated, the
distribution of the test statistic is not normal, even if X is. Under the
assumption of normality for X, the test statistic t has a Students t
distribution with N 1 degrees of freedom.
The t-distribution approaches a standard normal distribution (i.e., a
normal distribution with a mean of zero and a standard deviation of one)
as the degrees of freedom become large. When there are many data
observed, the uncertainty in the estimate of the mean remains much larger
than the uncertainty in the estimate of the standard deviation, and the t
statistic approaches the distribution it would have if the standard deviation
were known with certainty: a standard normal. When the number of data
observations is small, though, the deviation from normality can be very
signicant.
T Distribution with Various Degrees of Freedom
T-statistic for Stock Return Example1,000,000 Trials
T-statistic with Non-Gaussian Returns1,000,000 Trials, T = 240
The test statistic for security X is then:
t =
N
_
X
0
s
X
_
=
240
_
6%8%
25%
_
1.24
Since the number of degrees of freedom is quite large, we can simply treat
the t-statistic as if it were normally distributed. A test statistic of 1.24
corresponds to a p-value of approximately 0.215; that is, if the hypothesis
were true, there is still a probability of 0.215 that the sample average
return of the security would dier from the hypothesized value by at least
2%.
If we use the 0.05 cut-o for p-values, as is common practice in nance,
we cannot reject the hypothesis that E[X] = 8%. The risk that we are
making a Type I error is too high.
Do the other securities provide evidence against the model?
Lets nd out. Well use subscripts to indicate t-statistics for dierent
securities.
t
X
=
6%8%
_
25%
15.49
_
1.24 t
Y
=
16%10%
_
40%
15.49
_
2.32
t
Z
=
12%14%
_
60%
15.49
_
0.52
The corresponding p-values for X, Y, and Z are 0.215, 0.020, and 0.603,
respectively. If we use 0.05 as our cut-o value (that is, 95% condence),
then securities X and Z do not provide evidence against the theory, since
their p-values are larger than 0.05. However, security Y violates the
prediction; its p-value is less than 0.05, so we can reject the hypothesis
that the expected return is E[Y] = 10%.
Basic Principles Multivariate Tests
Is there anything wrong with what we are doing here?
It doesnt make any sense to test the securities one at a time. Suppose the
model we are testing is actually trueit correctly describes the expected
returns of all securities. If we go out and test its predictions one security
at a time, then for each test we conduct, there is a 0.05 probability
(assuming 95% condence) of a Type I error. If, for example, we test a
model for Japanese stock returns, and decide to conduct a statistical test
for each of the 225 stocks in the Nikkei 225 index, that is 225 chances to
have a Type I error. How likely is it that at least some of the stocks will
appear to violate the predictions of the model, just by chance, even though
the model is true?
What we really ought to do is perform a single statistical test of all the
securities simultaneously. For example, we could consider a test statistic
along the lines of the following:
F = t
2
X
+ t
2
Y
+ t
2
Z
=
_
R
X

0,X
_
2

2
_
R
X
_
+
_
R
Y

0,Y
_
2

2
_
R
Y
_
+
_
R
Z

0,Z
_
2

2
_
R
Z
_
Intuitively, this statistic has some advantagesit is big when the
t-statistics for the individual assets are big, it places more weight on
violations of the theorys predictions for assets which have small standard
deviations, etc. It also seems like it has a distribution that can be
calculatedit is the sum of three squared t distributions. But are these t
distributions independent?
The test statistic just proposed doesnt work if we cant be sure that the
returns of the three assets are independent (or at least uncorrelated). We
can x this defect, but rst, we will need to be able to estimate
covariances from historical data. The usual way of estimating the
covariance between X and Y is:
s
2
XY
=
1
T 1
T
t=1
_
X
t

X
_ _
Y
t

Y
_
This estimator is unbiased, i.e., E
_
s
2
XY
= Cov [X, Y]. Derivation of its

variance (and covariance with other statistics) is very dicult.
The T 1 divisor, instead of T, is often a point of confusion. T 1 is
used to make our estimate unbiased. Some just use T, but if you estimate
covariance (or variance) this way, then your estimate is biased; it tends to
be a little too small, on average. For large T, it doesnt matter very much.
Some software products are quite inconsistent about which divisor they
use, T 1 or T. For example, a spreadsheet product produced by a
software company based in Redmond, Washington, USA, uses T 1 in the
VAR function, but T in the COVAR function. Therefore, even though
Cov [X, X] = Var [X] by denition, this software package returns dierent
values for VAR(A1:A10) and COVAR(A1:A10,A1:A10). When you
have a piece of software do these sorts of calculations for you, make sure it
is doing what you think it is doing.
When we need to estimate a correlation from historical data, we will do so
as follows:
=
s
2
XY
s
X
s
Y
The little hat over the indicates that the quantity is the estimated,
rather than true correlation.
We now return to the problem of constructing a joint test statistic. For
convenience, we will call the assets X
1
, . . . , X
N
. It is convenient to arrange
the means of the assets in a column vector, and the variances and
covariances in a matrix:
=
_
_
E[X
1
]
.
.
.
E[X
N
]
_
_
=
_
_
Var [X
1
] Cov [X
1
, X
N
]
.
.
.
.
.
.
.
.
.
Cov [X
N
, X
1
] Var [X
N
]
_
_
The sample equivalents are:
=
_
X
1
.
.
.
X
N
_
_

=
_
_
s
2
11
s
2
1N
.
.
.
.
.
.
.
.
.
s
2
N1
s
2
NN
_
_
where, through a slight abuse of previous notation, s
ij
is the sample
covariance of X
i
and X
j
.
We will need three linear algebra operations to construct a reasonable test
statistic: matrix multiplication, matrix transposition, and matrix inversion.
In case these operations are not familiar, we will start with multiplication
of a row vector by a column vector. To perform this operation, we just
multiply each element in one of the vectors by its corresponding element in
the other vector, and add the products all up:
_
x
1
x
N
_
y
1
.
.
.
y
N
_
_
=
N
i =1
x
i
y
i
The number of elements in the two vectors must be the same; otherwise
the product is undened.
More generally, we can nd the product of any two matrices, provided the
number of columns in the rst matrix is equal to the number of rows in
the second matrix. The product of a K M matrix and an M N matrix
is a K N matrix. The element in row i and column j of the product is
row i of the rst matrix multiplied by column j of the second matrix:
_
_
x
11
x
1M
.
.
.
.
.
.
.
.
.
x
K1
x
KM
_
_
y
11
y
1N
.
.
.
.
.
.
.
.
.
y
M1
y
MN
_
_
=
_
M
i =1
x
1i
y
i 1

M
i =1
x
1i
y
iN
.
.
.
.
.
.
.
.
.
M
i =1
x
Ki
y
i 1

M
i =1
x
Ki
y
iN
_
_
The inner dimensions of the two matrices must match, or the product is
undened.
Many of the rules of ordinary multiplication do not apply to matrix
multiplication; for example, matrix multiplication is not commutative.
A numeric example of matrix multiplication:
_
-
3
-
5 -2
-
4
-
1
-
0
_
_
_
-
6
-
1
-8
-
4
-
2
-
1
_
_
=
_
-26 21
-
16
0
8
_
Given the large number of operations involved, it is not a bad idea to have
a computer available before multiplying even relatively modestly sized
matrices together. For example, to multiply a 5 8 matrix by an 8 3
matrix requires 120 multiplications and 105 additions.
Transpose is a very simple operation, usually denoted by either a T or a
prime superscript, i.e., C
T
or C
. The matrix is ipped around, so that the

rows become columns and the columns become rows:
_
_
x
11
x
1N
.
.
.
.
.
.
.
.
.
x
M1
x
MN
_
_
T
=
_
_
x
11
x
M1
.
.
.
.
.
.
.
.
.
x
1N
x
MN
_
_
A numeric example:
_
-
1
-
3 -2
-8
-
0
-
4
_
T
=
_
_
-
1 -8
-
3
-
0
-2
-
4
_
_
It doesnt get much easier than matrix transposition.
Matrix operations can be used to avoid cumbersome algebraic expressions
involving large numbers of assets. For example, consider N assets, with
returns R
1
, . . . , R
N
, and a portfolio with share a
1
invested in the rst
asset, a
2
invested in the second asset, and so on, up to a
N
invested in
asset N. (The weights a
i
should add up to one.) What is the variance of
the return of this portfolio?
Var [a
1
R
1
+ . . . + a
N
R
N
] =
N
i =1
N
j =1
a
i
a
j
Cov [R
i
, R
j
]
Arranging the a
1
, . . . , a
N
in a column vector a, the returns R
1
, . . . , R
N
in a
column vector R, and the variances and covariances of returns in a matrix
, we can express the above as:
Var
_
a
T
R
_
= a
T
a
(Try it!) This expression is valid for any number of assets.
Numeric examplesuppose the returns of three assets have the covariance
matrix:
=
_
_
0.040 0.012 0.020
0.012 0.090 0.036
0.020 0.036 0.160
_
_
What is the variance of the return of a portfolio that is 0.2 invested in the
rst asset, 0.6 in the second asset, and 0.1 invested in the third asset?
_
_
0.2
0.6
0.1
_
_
T
_
_
0.040 0.012 0.020
0.012 0.090 0.036
0.020 0.036 0.160
_
_
_
_
0.2
0.6
0.1
_
_
=
_
0.0436
Matrix inversion, usually denoted by a 1 superscript, as in C
1
, is a
rather dicult operation. The inverse of a matrix satises the condition:
C C
1
= C
1
C = I
where I is the identity matrix, which has 1 for each element on the
diagonal, and 0 everywhere else:
I =
_
_
1 0 0
.
.
.
.
.
.
.
.
.
.
.
.
0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1
_
_
If a matrix is not square (i.e., same number of rows and columns), it does
not have an inverse.
Square matrices may or may not have inverses, although covariances
matrices usually do. Specically, every matrix that is the covariance
matrix of some set of random variables R is automatically positive
semidenite:
Var
_
a
T
R
_
= a
T
a 0 a
Such a matrix is also positive denite if it satises the stronger condition:
Var
_
a
T
R
_
= a
T
a > 0 a = 0
A covariance matrix has an inverse if and only if it is positive denite.
That is, if the only portfolio of assets that is risk-free (i.e., has variance of
zero) is the portfolio with weight zero on every asset, then the covariance
matrix of the asset returns is positive denite.
Numeric examplesmatrix inversion is actually rather easy for diagonal
matrices, i.e., those in which the o-diagonal elements are all zero:
_
_
5 0 0
0 2 0
0 0 1
_
_
1
=
_
_
0.2 0.0 0.0
0.0 0.5 0.0
0.0 0.0 1.0
_
_
Note that the inverse is also diagonal, and the elements are just the
reciprocals of the elements in the original matrix.
Things are a bit more complicated in general:
_
_
3
1
6
-
1
4
1
7 -2
6 13
-
0
_
_
1
=
_
_
-
1.6250
-
0.8125 -1.1875
-0.7500 -0.3750
-
0.6250
-
0.6250 -0.1875 -0.1875
_
_
(Try verifying the inverses.)
Recall the example of the three securities, which were used to test a model
of expected returns. We have no information on the covariances between
the three asset returns; suppose these are all estimated at exactly zero
(not very likely, but assume so for purposes of the discussion). We can
arrange the sample mean returns in a vector, and the hypothesized mean
returns in another vector:
=
_
_
1
6%
16%
14%
_
_

0
=
_
_
1
8%
10%
12%
_
_
The estimated variances and covariances can be arranged in a matrix:
=
_
_
0.0625 0 0
0 0.16 0
0 0 0.36
_
_
The proposed joint test statistic can be expressed as:
F = (
0
)
T
1
(
0
)
At an intuitive level, this test statistic has some good properties. When
any of the assets have an estimated expected return that is far from the
hypothesized value, this tends to make the test statistic large.
Furthermore, it gives more weight to assets whose mean is estimated more
accurately. If an asset has a small (estimated) variance of return, when

is inverted, the corresponding element is large, giving more weight to the

deviation of this assets average return from the hypothesized value. Assets
with large variance of return require larger dierences between the observed
and hypothesized returns to have the same eect on the test statistic.
This test statistic works just as well when the asset returns are correlated;
the only modication we will make is to add a scaling factor:
F =
T (T N)
N (T 1)
(
0
)
T
1
(
0
)
where T is (as before) the number of observations, and N is the number
of assets. Under an assumption of normality (the asset returns have the
multivariate normal distribution), this test statistic has an F distribution.
An F distribution has two degrees of freedom parameters; the rst is N,
and the second is T N. This is sometimes written F
N,TN
. Tables of
the F distribution are widely available in statistics books and other
references; many software packages can calculate them.
F-statistic for Stock Return Example1,000,000 Trials
When T is very large, the assumption of multivariate normality is not
particularly important. Recall that, for our application, the rst degrees of
freedom parameter is N, and the second is T N. The F
d
1
,d
2
distribution
approaches a chi-square distribution with d
1
degrees of freedom as d
2
approaches +; since d
2
approaches + as d
2
becomes very large, this is
the limiting distribution of the test statistic for very large T. However, the
test statistic approaches this distribution, for very large T, even if the data
are not multivariate normally distributed.
Chi-square Distribution with Various Degrees of Freedom
F Distribution and Limiting Chi-square Distribution
A test procedure is therefore:
.
.
.1
Estimate the sample means, sample variances, and sample covariances
of the asset returns from historical data.
.
.
.2
Arrange the sample means into a vector, and the sample variances
and covariances into a matrix.
.
.
.3
Also arrange the hypothesized values of the mean returns into a
vector.
.
.
.4
Calculate the test statistic F.
.
.
.5
Determine the p-value of this statistic, using tables from a book,
software, or some other source.
.
.
.6
If the p-value is small enough (e.g., smaller than 0.05 for a 95%
condence test), then reject the hypothesis that the model is correct.
Numeric examplesuppose the (estimated) covariance matrix for the
three assets is:
=
_
_
-
0.0625 -0.0200
-
0.0300
-0.0200
-
0.1600
-
0.0240
-
0.0300
-
0.0240
-
0.3600
_
_
(Are these numbers consistent with the standard deviations reported
earlier?)
Can we reject, with 95% condence, the predictions of the model?
The test statistic is:
F =
240 (240 3)
3 (240 1)
_
_
1
6%
1
8%
16%10%
14%12%
_
_
T
_
_
-
0.0625 -0.0200
-
0.0300
-0.0200
-
0.1600
-
0.0240
-
0.0300
-
0.0240
-
0.3600
_
_
1
_
_
1
6%
1
8%
16%10%
14%12%
_
_
2.066
This distribution has 3 and 237 degrees of freedom. Many tables for the F
distribution do not actually show p-values for dierent values of the F
statistic, but rather a single cut-o p-value for tests of dierent condence
levels. From a table for 95% condence tests, we nd that the cut-o
value for an F distribution with 3 and 120 degrees of freedom is 2.6802,
and for 3 and innitely many degrees of freedom, it is 2.6049. For 3 and
237 degrees of freedom, it must be somewhere in between.
If the F-statistic is above the cut-o value of approximately 3, then the
p-value is below 0.05, and we can reject the hypothesis (correctness of the
model) with 95% condence. If the F-statistic is below the cut-o value of
approximately 3, then the p-value is above 0.05, and we cannot reject the
hypothesis. (Recall that this does not mean the hypothesis is true; it
means we have not found sucient evidence to conclude that the
hypothesis is false.)
The F-statistic is 2.066, which is well below the cut-o value, so we
cannot reject the hypothesis with 95% condence. (We cannot reject it
with 90% condence eitherthe p-value is 0.1053.)
So despite the fact that a t-test rejects the hypothesis for one of the assets
individually, a joint test based on an F-statistic fails to reject the
hypothesis. We have not seen enough evidence to convince us, with 95%
condence, that the model is false.
It is worthwhile in a discussion of hypothesis testing to warn against the
dangers of data mining.
In some disciplines, data mining is considered a good thing; one can even
take a course to learn how to do it. In nance and economics, if someone
tells you that you are data mining, that person is not paying you a
compliment.
What is data mining? Recall that, even if an hypothesis is true, there is a
certain probability of committing a Type I error (rejecting the hypothesis
even when it is true). For example, suppose you believe that the level of
the high tide has an eect on stock market returns. The reality is that
your theory is wrong, and the tides have no eect on the stock market;
however, you dont know this.
So, you gather some data on the tides and the stock market, and perform
a statistic test of your hypothesis. Following common practice, you reject
the hypothesis the tides have no eect on the stock market if the
p-value of your statistical test is 0.05 or less. There is then a one in twenty
chance that you will reject the hypothesis, and conclude that the tides do
have an eect on the stock market (even though they dont).
Data mining refers to the practice of performing statistic test after
statistical test, until nding one that rejects, and then reporting only the
last test. This is a recipe for nding spurious resultschances are good
that the result you report will be a Type I error, rather than a legitimate
result.
The pressure to nd results is enormous, both in academic and industry
circles. Failure to nd a result may mean no publication in academics, and
no clients in industry. The incentives to engage in data mining are huge,
and many engage in it, either fully aware of what they are doing, or having
successfully deluded themselves into believing that what they are doing is
legitimate.
A rule of thumb is the following: if you cant think of a reasonable
economic story for the statistical result you have found, that should be a
warning sign that the result is the product of data mining.
Testing the CAPM
.
.
. . .
.
.
Empirical Finance
Testing the CAPM
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Testing the CAPM Conditional Probabilities
We need to look at the relation between the returns of multiple securities;
the notion of conditional probabilities is absolutely central to the analysis.
The probability of an event very likely depends on how much information
one has. For example, it is much easier to forecast the value of a stock (or
the weather, or an election) one day in advance than it is three years in
advance. The reason is, over the past three years, a great deal has
happened that aects the value of the stock (or the weather, or the
outcome of the election). However, if you are making your forecast one day
in advance, then you know almost everything that will aect the variable
you are forecasting during the last three years; the only information you are
missing pertains to the one remaining day. If you are making your forecast
three years in advance, you are doing so with much less information.
Probabilities therefore depend on an information set; people with dierent
information have dierent probabilities for the same event. In some
contexts, the idea of the information set is left implicit; however, we will
sometimes need to make it explicit.
We will often deal with the situation of two distinct information sets, with
one being a strict subset of the other. Probabilities based on the more
informative information set are then called conditional probabilities, and
those based on the less informative information set are called
unconditional or marginal probabilities.
Examplegenetic predisposition to a disease.
Suppose that some members of the population will develop a disease, with
some probability. It is then discovered that people with a specic genetic
mutation have greater probability of developing the disease. We will dene
two random variables, D and M. Each of these random variables has only
two possible values.
D =
_
1 if a person develops the disease
0 if a person does not develop the disease
M =
_
1 if a person has the genetic mutation
0 if a person does not have the genetic mutation
Here are the joint probabilities of having the genetic mutation, and
developing the disease:
M = 0 M = 1
D = 0 0.76 0.17
D = 1 0.04 0.03
Answer the following questions:
.
.
.1
Are these numbers valid probabilities?
.
.
.2
What is the probability that a randomly selected person has the
genetic mutation?
.
.
.3
What is the probability that a randomly selected person will develop
the disease?
The answers to the last two questions are unconditional probabilities.
Suppose there is a test that can determine whether a person has the
genetic mutation. Then we might want to know answers to questions like:
.
.
.1
What is the probability that a person with the genetic mutation will
develop the disease?
.
.
.2
What is the probability that a person without the genetic mutation
will develop the disease?
The answers to these questions are conditional probabilities.
How can we calculate conditional probabilities?
Refer to any two events as A and B. For example, A can be the event, a
person develops the disease (i.e., D = 1), and B can be the event, a
person has the genetic mutation (i.e., M = 1). These events are relevant
for our example, but A and B can be any arbitrary events.
The usual denition of a conditional probability is:
P(A| B) =
P(A B)
P(B)
In words, the probability that event A occurs, conditional on the fact that
event B occurs, is the unconditional probability that both events A and B
occur, divided by the unconditional probability that B occurs.
Technical issuethis denition has the problem that, if event B has
probability of zero, then the conditional probability is undened (zero
divided by zero). This may seem like a rather minor defect, but there are
actually are reasons to worry about this. So conditional probabilities are
sometimes instead dened as numbers that satisfy:
P(A| B) P(B) = P(A B)
If there are events with probability of zero, then conditional probabilities
are not uniquely dened.
Conditional probabilities, by construction, satisfy the same constraints that
unconditional probabilities satisfythey are non-negative, and they add up
to one. Lets calculate one:
P(D = 1| M = 0) =
P[(D = 1) (M = 0)]
P(M = 0)
=
0.04
0.76 + 0.04
= 0.05
Since event D = 1 is the event that a person develops the disease, and
M = 0 is the event that the person does not have the genetic mutation,
we have calculated that the probability that a person who does not have
the genetic mutation, will develop the diseasespecically, 0.05.
What are the following probabilities, conditional on having the genetic
mutation?
.
.
.1
A person without the genetic mutation does not develop the disease.
.
.
.2
A person with the genetic mutation does not develop the disease.
.
.
.3
A person with the genetic mutation develops the disease.
Can you verify that the probabilities, conditional on having the genetic
mutation, are non-negative and add up to one? How about the
probabilities, conditional on not having the genetic mutation?
These are not the only conditional probabilities we can contemplatewhat
is the probability that a person who developed the disease, had the genetic
mutation?
Just as we have conditional and unconditional probabilities, we can have
conditional and unconditional expectations, conditional and unconditional
variances, conditional and unconditional correlations, etc.
Conditional expectations, variances, covariances, etc., have all the same
properties as unconditional expectations, variances, and covariances. The
former are based on conditional probabilities, and the latter are based on
unconditional probabilities.
However, there are a few results that involve the relations between
conditional and unconditional expectations, variances, etc., that we will
need.
Law of iterated expectationsconsider some random variable Y. Since Y
is random, we do not know its value, but assume we know the probabilities
of dierent outcomes. We can calculate the expected value of Y, from the
denition.
However, suppose there is another random variable X, and that we know
the joint probability distribution of X and Y. Then we can also calculate
the expectation of Y, conditional on dierent values of X. (For random
variables that take values of 1 and 0 only, the expected value is equal to
the probability of a 1 outcome, so we have already eectively done this in
the genetic mutation example.) We now ask what the relation is between
the conditional expectation, E[ Y| X], and the unconditional expectation
E[Y].
The law of iterated expectations tells us:
E[Y] = E[E[ Y| X]]
The conditional expectation, E[ Y| X], is itself a random variable, since it
depends on the value of X. We can nd the (unconditional) expected
value of E[ Y| X], the same way we can nd the expected value of any
other random variable. The law of iterated expectations tells us, the
unconditional expectation of E[ Y| X] is simply equal to the unconditional
expectation of Y itself.
We will make use of this result when dealing with regression analysis.
Proof of law of iterated expectationstake two random variables, X and
Y. Denote the M possible values of X as x
1
, . . . , x
M
, and the N possible
values of Y as y
1
, . . . , y
N
. Let p
i ,j
be the probabilities of the M N
possible joint outcomes of X and Y.
The probabilities of Y, conditional on X = x
i
for some 1 i M, are:
p
j |i
=
p
i ,j
N
k=1
p
i ,k
We can use these probabilities to calculate the expected value of Y,
conditional on X = x
i
:
E[ Y| X] =
N
j =1
y
j
p
j |i
=
N
j =1
y
j
p
i ,j
N
k=1
p
i ,k
Note there is such a conditional expectation for each of the M possible
values of X.
We will now calculate the unconditional expectation of the conditional
expectation E[ Y| X]:
E[E[ Y| X]] =
M
i =1
p
i
E[ Y| X] =
M
i =1
p
i
N
j =1
y
j
p
i ,j
N
k=1
p
i ,k
=
M
i =1
N
j =1
y
j
p
i ,j
= E[Y]
Can you verify the law of iterated expectations on the data from the
genetic mutation example?
Interpret the law of iterated expectations. What, in words, does it say?
The relation between conditional and unconditional variance is not so
simple. But we can derive a relation, by taking the unconditional
expectation of the conditional variance, Var [ Y| X]:
E[Var [ Y| X]] =E
_
E
_
Y
2
(E[ Y| X])
2
_
=E
_
E
_
Y
2
E
_
(E[ Y| X])
2
_
=E
_
Y
2
(E[E[ Y| X]])
2
+ (E[E[ Y| X]])
2
E
_
(E[ Y| X])
2
_
=E
_
Y
2
(E[Y])
2
_
E
_
(E[ Y| X])
2
_
(E[E[ Y| X]])
2
_
=Var [Y] Var [E[ Y| X]]
After a little rearrangement:
Var [Y] = E[Var [ Y| X]] + Var [E[ Y| X]]
This result plays a role in regression analysis.
Examplesuppose your job pays a salary and a bonus. If you do not
receive a promotion within the next year, then your total compensation will
be either S$500,000 or S$300,000, with equal probability. However, with
probability 0.5, you will receive a promotion, and your total compensation
will then be either S$2,500,000 or S$2,300,000, also with equal probability.
Answer the following questions:
.
.
.1
What is the variance of your compensation, conditional on not
receiving a promotion?
.
.
.2
What is the variance of your compensation, conditional on receiving a
promotion?
.
.
.3
What is the unconditional variance of your compensation?
Note that the answer to the third question is much larger than the answer
to either of the rst two questions. Why?
We can think about conditional probabilities in the context of continuous
probability distributions as well. Suppose that f
X,Y
(x, y) is the joint
probability density function of the random variables X and Y.
The marginal densities are:
f
X
(x) =
f
X,Y
(x, y) dy f
Y
(y) =
f
X,Y
(x, y) dx
The conditional densities are:
f
X|Y
(x) =
f
X,Y
(x, y)
f
Y
(y)
f
Y|X
(y) =
f
X,Y
(x, y)
f
X
(x)
All the same results about conditional expectations and variances (e.g.,
the law of iterated expectations) hold in the world of continuous
probability distributions as well.
Examplesuppose the returns of two nancial assets have a bivariate
normal distribution:
f
X,Y
(x, y) =
1
2
2
X
2
Y
(1
2
)
e
(x
X
)
2
2
Y
+ (y
Y
)
2
2
X
2 (x
X
) (y
Y
)
X
2
2
X
2
Y
(1
2
)
This distribution has the following properties:
E[X] =
X
E[Y] =
Y
Var [X] =
2
X
Cov [X, Y] =
X
Y
Var [Y] =
2
Y
The marginal distributions are:
f
X
(x) =
1
2
2
X
e
(x
X
)
2
2
2
X
f
Y
(y) =
1
2
2
Y
e
(y
Y
)
2
2
2
Y
Note that the marginal distributions are Gaussian. X and Y are
independent if and only if they are uncorrelated, that is, if = 0. (Note
that this is not a general principle, but a fact which is specic to this
distribution!)
The conditional distributions are:
f
X|Y
(x) =
1
2
2
X
(1
2
)
e
(
x
X
(y
Y
)
Y
)
2
2
2
X
(1
2
)
f
Y|X
(y) =
1
2
2
Y
(1
2
)
e
(
y
Y
(x
X
)
X
)
2
2
2
Y
(1
2
)
From these, we can nd the conditional means:
E[ X| Y] =
X
+ (y
Y
)

X
Y
E[ Y| X] =
Y
+ (x
X
)

Y
X
Do these satisfy the law of iterated expectations?
The conditional variances are:
Var [ X| Y] =
2
X
_
1
2
_
Var [ Y| X] =
2
Y
_
1
2
_
Do these satisfy the correct relation with the unconditional variances?
If you know the return of one security, your expectation of the return of
the other changes (provided = 0), and the variance of the return is
smaller. These facts play a key role in regression analysis.
Testing the CAPM Regression and CAPM
We will be interested in more realistic models than the simplistic one used
above. For example, consider the so-called market model:
(R
i
R
f
) =
i
+
i
(R
M
R
f
) +
i
where R
i
is the return of some security (indexed by i ), R
f
is the
concurrent return of a risk-free asset, and R
M
is the return of the
market portfolio. (For our purposes, we do not need to worry about the
denition of the market portfolio; just take it as given.)
The above relation has no content unless there are some restrictions on .
(Without such restrictions, we can simply choose
i
and
i
to be any
numbers at all, and then
i
is whatever number is necessary to make the
equation hold.)
One set of assumptions we could use is:
E[
i
] = 0 Cov [R
M
,
i
] = 0
The above equation therefore breaks the excess return (that is, the return
of the security minus the return of the risk-free asset) into two
componentsa term proportional to the excess return of the market
portfolio, and a component (represented by
i
and
i
) that is uncorrelated
with the market return.
Provided Var [R
M
] > 0, these restrictions are sucient to identify
i
and
i
uniquely; in fact, we can then always nd
i
and
i
such that the
market model equation is satised. So, with these assumptions, the
market model has no economic contentit is always true.
A somewhat stronger assumption we will use has to do with the
conditional distribution of the :
E[
i
| R
M
] = 0
The condition implies the other two earlier conditions. By the law of
iterated expectations:
E[
i
] = E[E[
i
| R
M
]] = E[0] = 0
From the denition of a covariance:
Cov [R
M
,
i
] =E[R
M
i
] E[R
M
] E[
i
]
=E[E[ R
M
i
| R
M
]] E[R
M
] 0
=E[R
M
E[
i
| R
M
]] = E[R
M
0] = 0
So the assumption that
i
, conditional on R
M
, has a mean of zero, implies
both that it has an unconditional mean of zero, and that it has a
covariance with R
M
of zero.
The reverse implication does not hold; we can have both:
E[
i
] = 0 Cov [R
M
,
i
]
without having E[
i
| R
M
] = 0. This can occur when there is a non-linear
form of dependence between the two variables.
Non-linear dependence between X and
So the assumption:
E[
i
| R
M
] = 0
has actual restrictive content (i.e., the model may or may not be true).
The other two, less restrictive, assumptions serve only to identify
i
and
i
, but are always true once these two constants are chosen appropriately.
We dont know the values of
i
and
i
. Can we estimate them from
historical data?
Linear regression is the most commonly used tool in nancial economics.
Many researchers do not know how to do much else. Many of them dont
even know how to do a linear regression properly; it is also the most
misused tool in nancial economics. But, it is a tool we can use to
estimate the amount of market risk (i.e., the value of
i
) the security has,
the amount of idiosyncratic risk (i. .e., the magnitude of the error term
i
), and the magnitude of the risk premium (the value of
i
) the security
earns that is not already included in the market return component.
The linear regression model (single variable) is:
Y
..
Dependent variable
= + X
..
Independent variable
+
..
Error term
The random process Y is related to another variable (possibly random) X;
knowledge of X allows us to make more precise predictions about Y than
we would be able to make if we didnt know the value of Y. X is usually
called the independent, or explanatory variable; Y is called the dependent
variable. The relation between X and Y is linear. In nancial economics,
there are very sound theoretical reasons to believe there should linear
relations between certain variables, so linear regression is an appropriate
econometric technique.
Without any constraints on
i
, there are many dierent ways to express
such a relation between the dependent and independent variables.
However, the usual assumption in regression analysis are analogous to
those we made in the case of the market model. The weakest possible
assumptions are:
E[] = 0 Cov [, X] = 0
These two restrictions are adequate to identify and in the regression
equation:
Y = + X +
Taking expected value of both sides, we nd:
E[Y] = E[ + X + ] = + E[X] + E[] = + E[X]
Suppose we want to forecast the value of Y using a linear function of X:
Y
0
= + X
What is the best forecast? We will measure the forecast error by mean
squared error:
MSE = E
_
(Y Y
0
)
2
_
=E
_
(Y X)
2
_
=E
_
Y
2
+
2
+
2
E
_
X
2
2E[Y] 2 E[XY] + 2 E[X]

=
_
2
Y
+
2
Y
_
+
2
+
2
_
2
X
+
2
X
_
2
Y
2 (
X
Y
+
XY
Y
) + 2
X
The objective now is to minimise forecasting error.
First-order conditionsdierentiate MSE with respect to and :
MSE =2 2
Y
+ 2
X
MSE =2
_
2
X
+
2
X
_
2 (
X
Y
+
X
Y
) + 2
X
Setting these two equal to zero and solving:
=
Y

X
=

X
2
X
=
Cov [X, Y]
Var [X]
Note that, with this choice of and , the forecast is unbiased:
E[Y Y
0
] = E[Y X] =
Y

X
= 0
So this choice of and produces a forecast of Y that is, on average,
correct, and that minimises the forecast variance.
Variance Decompositionnote that the variance of the dependent variable
can be expressed simply in two parts:
Y = + X +
Var [Y] =Var [ + X + ]
=
2
Var [X] + Var [] + 2 Cov [X, ]
. .
=0
=
2
Var [X] + Var []
The variance of the dependent variable therefore consists of a component
that is explained by the independent variable, and another component that
is not explained. The covariance term in the expression for the variance of
a sum goes away, because the X and are assumed to be uncorrelated.
(We could also say that the are dened so that this is true.)
Note that the market model is conveniently written in the same form as a
linear regression model:
(R
i
R
f
)
. .
Dependent variable
=
i
+
i
(R
M
R
f
)
. .
+
i
..
Error term
We can therefore apply linear regression results to the market model, just
by making the appropriate changes in notation:
R
i
R
f
=
i
+
i
(R
M
R
f
) +
i
Var [R
i
R
f
] =
2
i
Var [R
M
R
f
]
. .
Systematic Risk
+ Var [
i
]
. .
Idiosyncratic Risk
The risk (measured by variance) of the excess return of any security can
therefore be expressed as the sum of two components:
.
.
.1
Systematic, or market, riskthe variance of the excess return of the
market, multiplied by the squared beta coecient of the security.
.
.
.2
Idiosyncratic risk, a source of risk that is uncorrelated with the market
return.
Recall the value of
i
:
i
=
Cov [R
i
R
f
, R
M
R
f
]
Var [R
M
R
f
]
So
i
is a measure of how much the excess return R
i
R
f
covaries with
the market excess return R
M
R
f
; that is, how much systematic risk does
the excess return of the security have.
Only very minor technical assumptions (existence of the means and
variances of the dependent and independent variables) have been made in
deriving this result. It is always true, subject to these technical
assumptions.
ExampleCAPM. The CAPM was an early theory relating the excess
return of a security with a measure of its risk. The central prediction of
the CAPM is that, for every security:
E[R
i
] = R
f
+
i
(E[R
M
] R
f
)
where
i
is exactly the same as in the market model:
i
=
Cov [R
i
R
f
, R
M
R
f
]
Var [R
M
R
f
]
A very similar result can be obtained simply by taking the expected value
of both sides of the equation for the market model:
The excess return of the security is decomposed into a market component,
and an idiosyncratic component. If we take expected values of both sides
(and rearrange a bit), we have:
R
i
R
f
=
i
+
i
(R
M
R
f
) +
i
E[R
i
R
f
] =E[
i
+
i
(R
M
R
f
) +
i
]
E[R
i
] R
f
=
i
+
i
(E[R
M
] R
f
)
E[R
i
] =R
f
+
i
+
i
(E[R
M
] R
f
)
This is a purely mechanical derivation that is always true (subject to
existence of the means and variances of the excess returns), with no
economic content. But note that the last line is almost the same as the
CAPM equation, except that it has an extra
i
term.
The prediction of the CAPM is therefore that
i
= 0 for every security,
that is, exposure to idiosyncratic risk does not change the expected return
of the security. The only way to have an expected return dierent from
the risk-free rate is to face market risk.
The derivation of the CAPM involves an equilibrium argumentgiven a
model for investor behaviour (specically, investors care about the mean of
return and standard deviation of return of their portfolios), then security
markets are in equilibrium (that is, supply for each security is equal to its
demand) if and only if the market is the tangency portfolio, from
Markowitz portfolio theory fame.
Portfolio Theory with Market Portfolio As Tangency
The following characterisations of the CAPM are fully equivalent:
.
.
.1
The market portfolio is the tangency portfolio, i.e., it is the only
portfolio consisting only of risky assets that is mean-variance ecient.
.
.
.2
The expected return of every security satises
E[R
i
] R
f
=
i
(E[R
M
] R
f
)
where
i
=
Cov [R
i
R
f
, R
M
R
f
]
Var [R
M
R
f
]
Although these two characterisations may seem unrelated, it is possible to
derive either one from the other. (Any idea how?)
The market model has no economic contentit is simply the
decomposition of the return of a security into components. The CAPM,
on the other hand, makes very specic prediction, which can be
characterised either in terms of the mean-variance eciency of the market
portfolio, or in terms of the
i
coecients from the market model (i.e.,
they should all be zero).
The variance decomposition:
Var [R
i
R
f
] =
2
i
Var [R
M
R
f
] + Var [
i
]
is often cited within the context of the CAPM, but note that this is simply
a property of the market model decomposition of excess returnsit is
always true, whether or not the CAPM is true.
How can we determine whether the CAPM is true?
Unfortunately, when one purchases a nancial asset, one does not receive
a notarised certicate stipulating the expected return, the standard
deviation of return, and the covariance with market return. So we cannot
calculate and , but only estimate them.
One method is to estimate the means and variances of both excess
returns, and the covariance between them, using the sample means,
sample variances, and sample covariance, and use these in place of the
true moments in the formulae for and :
=

Y

X

=
s
2
XY
s
2
XX
The little hats indicate that the quantities underneath are estimates,
rather than the true values.
We can now write a regression equation in terms of the estimated, rather
than true parameters.
Y
t
= +

X
t
+
t
(Sometimes a subscript of i is used instead of t, depending on the
context.) Note that the error term also has a hat over it, because these
are also estimated; if the true and are dierent than the estimated
values, then the true error terms (also called residuals) are also dierent
than the estimated errors.
The regression estimates have a very particular property. We can consider
the tted values of Y
t
, that is, the values predicted by the estimated
regression equation, without the error terms:
Y
t
= +

X
t
The estimated error terms
t
are then just the dierence between the
observed value of Y
t
and the tted values

Y
t
:
Y
t

Y
t
=
t
Consider the sum of the squared errors:
T
t=1

2
t
=
T
t=1
_
Y
t

X
t
_
2
=
T
t=1
Y
2
t
+ T
2
+

2
T
t=1
X
2
t
2
T
t=1
Y
t
2
t=1
X
t
Y
t
+ 2
t=1
X
t
How can we choose and

to minimise the sum of the squared errors?
First dierentiate with respect to and

:
t=1

2
t
=2T 2
T
t=1
Y
t
+ 2
t=1
X
t
t=1

2
t
=2
t=1
X
2
t
2
T
t=1
X
t
Y
t
+ 2
T
t=1
X
t
Setting these equal to zero and solving for and

:
=
1
T
T
t=1
Y
t

1
T
T
t=1
X
t
=

Y

T
t=1
X
t
Y
t

1
T
T
t=1
X
t
T
t=1
Y
t
T
t=1
X
2
t

1
T
_
T
t=1
X
t
_
2
=
s
2
XY
s
2
XX
So using the sample means, variances, and covariances to estimate and
minimises the squared residual terms; it provides the best possible t,

measured by sum of square residuals.
For this reason, the linear regression technique we have examined is called
least squares regression, or ordinary least squares regression. (The
ordinary distinguishes it from a more advanced technique, generalised
least squares.) Use of this technique is so common that one often sees it
abbreviated simply as OLS.
Recall some properties of the regression model:
E[] = 0 Cov [X, ] = 0
We would like to verify that the sample equivalents of these statements
also hold, that is:
T
t=1

t
= 0
T
t=1
X
t

t
= 0
Begin with the estimated regression equation:
Y
t
= +

X
t
+
t
We now sum up across all observations and divide by T:
1
T
T
t=1
Y
i
=
1
T
T
t=1
+
1
T
T
t=1
X
t
+
1
T
T
t=1

t
Y = +

X +
1
T
T
t=1

t
But recall that =

Y

X. It follows that:
1
T
T
t=1

t
= 0 or
T
t=1

t
= 0
To derive the second result, multiply both sides of the estimated regression
equation by
_
X
t

X
_
:
_
X
t

X
_
Y
t
=
_
X
t

X
_
+

_
X
t

X
_
X
t
+
_
X
t

X
_

t
Sum up across observations and divide by T:
1
T
T
t=1
_
X
t

X
_
Y
t
=
1
T
T
t=1

_
X
t

X
_
+
1
T
T
t=1
_
X
t

X
_
X
t
+
1
T
T
t=1
_
X
t

X
_

t
The term containing the is equal to zero, so we have:
1
T
T
t=1
_
X
t

X
_
Y
t
=
1
T
T
t=1
_
X
t

X
_
X
t
+
1
T
T
t=1
_
X
t

X
_

t
We will need some additional results that may not be immediately obvious,
specically:
1
T
T
t=1
_
X
t

X
_
Y
t
=
1
T
T
t=1
_
X
t

X
_ _
Y
t

Y
_
= s
2
XY
1
T
T
t=1
_
X
t

X
_
X
t
=
1
T
T
t=1
_
X
t

X
_
2
= s
2
XX
(Can you prove them?)
We therefore have:
s
2
XY
=

s
2
XX
+
1
T
T
t=1
_
X
t

X
_

t
But recall that

= s
2
XY
/s
2
XX
. Then:
s
2
XY
=s
2
XY
+
1
T
T
t=1
_
X
t

X
_
t
0 =
1
T
T
t=1
_
X
t

X
_

t
Rearranging the last line a bit:
1
T
T
t=1
X
t
t
=
1
T
T
t=1
X
t
= 0
We will need these results later.
Statistical properties of the OLS estimateswe now make a stronger
assumption about the behaviour of the residuals:
E[
i
| X] = 0
Recall that this implies the other two properties:
E[
i
] = 0 Cov [X,
i
]
To analyse the properties of the OLS estimates, we rstl treat the X as
non-random. That is, all probabilities, expectations, etc., will be
conditional on the values of X. We also derive some unconditional results
later, using the law of iterated expectations.
Recall the (true) regression equation:
Y = + X +
Conditional on X, the only source of randomness comes from the ; that
is, Y is random (again, conditional on X) because the are random.
Taking the expected values of the OLS estimates and

, we nd:
E[ | X] = E
_
X
_
=
(Can you derive these results?)
The OLS estimates are unbiased, i.e., on average, they are correct. These
results hold under extremely general conditions; we need only assume the
existence of the means and variances of .
The above results are expectations conditional on X; but note an
important factthe conditional expectations do not depend on X. We
can therefore apply the law of iterated expectations to nd the
unconditional expectations of the OLS estimates:
E[ ] =E[E[ | X]] = E[] =
E
_
_
=E
_
E
_
X
__
= E[] =
The OLS estimates are therefore unconditionally unbiased as well. It
doesnt matter whether the X are deterministic or random, or what their
distribution is; the OLS estimates are still unbiased.
To nd the variance of the estimates (and the covariance between them)
requires an assumption about the statistical properties of the residuals.
We have already assumed E[
t
| X] = 0 for all t. We further assume:
Cov [
s
,
t
| X] =
_
2
s = t
0 s = t
We say that the residual terms are uncorrelated (with each other) and
homoscedastic, meaning that each has the same variance as all the others.
(Good assumption?)
Under these assumptions, it is possible (can you do it?) to derive the
following results:
Var [ | X] =

2
T
+

2
X
2
(T 1) s
2
XX
Var
_
X
_
=

2
(T 1) s
2
XX
Cov
_
,

X
_
=

2
X
(T 1) s
2
XX
Recall that the OLS estimates are unbiased, that is, on average, they are
equal to the true values. This is the case conditional on the X, and also
unconditionally. The above results show (under the assumptions of
uncorrelated errors and homoscedasticity) that as the number of
observations grows, the estimates become more and more precise. The
OLS estimates are consistent, that is, the probability of deviation of any
given size goes to zero as the number of observations goes to innity.
The previous results are conditional on the X. It is not so easy to derive
unconditional expressions for the variances of the and the

, nor the
covariances between them. Taking

as an example, we know the relation
between the conditional and unconditional variance:
Var
_
_
= E
_
Var
_
X
__
+ Var
_
E
_
X
__
The second term on the right-hand side is rather simple. (What is it?)
The rst term is dicult, because the conditional variance Var
_
X
_
is a
complicated function of the X. In order to know the unconditional
variance of

, we need to know something about the statistical properties
of the X, and even then, it is dicult.
We have derived many statistical properties of the OLS estimates. But
note that there is a problemthe variances and covariances of and

depend on
2
, which is the variance of the error term, . We dont know
this quantity, and can only estimate it. The usual way of estimating it is:
s
2
=
1
T 2
T
t=1

2
t
The T 2 divisor is used because there are two degrees of freedom used
in tting the data ( and ); it makes the estimator unbiased:
E
_
s
2
=
2
(The above result holds, conditional on X or unconditionally.) If T were
used in place of T 2, the estimate of the variance of the residuals would
be (on average) too high. (Proof?)
The estimate for
2
can be plugged into the formulae for the variances of
and

, and the covariance between them, to estimate these quantities.
Estimates for the standard deviations of other estimates (such as regression
coecients) are usually called standard errors, and we often denote them
by (). The square of the standard error is an estimate of the variance of
the coecient, since variance is just the square of standard deviation:

2
( ) =
s
2
T
+
s
2
X
2
(T 1) s
2
XX

2
_
_
=
s
2
(T 1) s
2
XX

2
_
,

_
=
s
2
X
(T 1) s
2
XX
Then () =

2
() and () =

2
(). Note that the standard
errors are conditional on X.
Lets try a real example. Estimate the regression equation:
(R
i
R
f
)
. .
Y Variable
=
i
+
i
(R
M
R
f
)
. .
X Variable
+
i
for a particular security. We use historical data, monthly returns from July
of 1931 until September of 2009. The market return here is a broad
portfolio of US stocks; the security whose excess returns are the Y
variable is also a portfolio of US stocks.
The estimated regression equation then is:
(R
i ,t
R
f ,t
)
. .
Y Variable
=
i
+

i
(R
M,t
R
f ,t
)
. .
X Variable
+
i ,t
First, lets have a look at the data.
Excess Return of Asset vs. Excess Market Return
Many dierent software packages can estimate the regression equation.
(Trying to run a regression without a computer is not recommended.)
There are special purposes statistic packages, general purpose
mathematical packages (e.g., Matlab), or programming languages. It is
possible to estimate a regression equation using a spreadsheet packages,
such as Microsoft Excel or OpenOce Calc. Using Microsoft Excel, for
example, one can calculate most of the relevant quantities using the
various worksheet functions. However, Excel also has a built-in regression
tool (it must be enabled using the Add-in manager rst; it is part of the
Analysis ToolPak).
The following is the output of the Excel regression tool; it is not very
pretty, but contains a lot of information.
Microsoft Excel Regression Output
Intercept refers to , and X Variable 1 refers to

. So the estimated
coecients are = 0.005046 and

= 1.6596. The estimated standard
deviations (also called standard errors) are 0.002751 and 0.0509,
respectively; these are just the square roots of the variances of and

derived above. (Note that the standard errors are conditional on the X.)
Here is the data again, but with the estimated regression line drawn
through it.
CAPM Regression Line
Do the regression results support the CAPM? (What should we look at?)
Note that the regression output includes quite a lot of additional
informationt-statistics, p-values, condence intervals, and various
statistics about the dierent sources of variance (such as the R
2
statistic).
Lets have a look at them in turn.
For each estimated coecient (all two of them), there is a t-statistic,
which are calculated as follows:
t
=

( )
t
_
At this point, we will make the additional assumption that the terms are
conditionally (i.e., conditional on X) Gaussian. We have already assumed
that the are independent of each other, and all have the same variance.
With the additional assumption of normality, it is possible to derive many
new results. For example, the estimates and

also have a Gaussian
distribution (conditional on X).
Under the normality assumption, and the further assumption that the true
coecients are zero, the t-statistics have a Students t distribution (which
is why they are called t-statistics) with T 2 degrees of freedom. For
large T, the Students t distribution approximates a standard normal. Also
for large T, the assumption of normality becomes less important; the
distribution of the t-statistics is then approximately a standard normal,
despite the non-normality of the .
The regression output also includes p-values; these are the probabilities
that the corresponding t-statistics would have occurred, if the true
coecient were zero.
Regression T-statistic for Coecient1,000,000 Trials
Test the following hypotheses:
.
.
.1
= 0
.
.
.2
= 0
Use any (correct) method you like.
Do the regression results support the CAPM? Why or why not?
The t-statistics (and p-values) included in the regression output are
designed for testing the hypotheses that the corresponding coecients are
equal to zero. These are not the only hypotheses we can contemplate.
Suppose we wish to test the hypothesis that = 1.0how do we do it?
We can still use t-statistics, but we must construct our own:
t =

0

_
_
=
1.6596 1.0
0.0509
12.97
This t-statistic has a Students t-distribution with 937 degrees of freedom
(there are 939 monthly observations); with such a large number of degrees
of freedom, it will be very close to a normal distribution.
The p-value corresponding to this t-statistic is tinyso small, for example,
that built-in functions in Microsoft Excel just print out 0. It isnt really
0, but it is so small it is hard to calculate it accurately.
Can you reject the hypothesis that = 1.0?
The regression output includes some other information that would have
allowed us to answer the above question very simply, provided we use a
cut-o of 0.05 probability of a Type I error.
Note the Lower 95% and Upper 95% entries for each coecient.
These are the endpoints of a 95% condence interval. A condence
interval is simply a range of values where we think the true coecient is
found with a given level of condence.
Consider the statistic:
t =
_
Although we dont know the true value , we nonetheless know that this
statistic has a Students t-distribution with T 2 degrees of freedom. We
can rearrange it a bit:
=

+ t
_
_
We observed the

(the regression software kindly printed it out for us),
and we also observed
_
_
(the same way). In what range would we
expect the true parameter to be 95% of the time?
For T = 939, the t-statistic has a t-distribution with 937 degrees of
freedom, which is very close to a standard normal distribution. A normal
distribution takes on values between 1.96 and +1.96 95% of the time.
(Sometimes the value 1.96 is simply rounded o to 2.)
We would therefore expect that with 95% probability, the true value lies
within 1.96 standard errors of the estimated value

. The regression
software produced the 95% condence interval in exactly this way; for
example, note the lower value of 1.5598. The estimate is 1.6596, and the
standard error is 0.0509. 1.96 standard errors below the estimate is (with
some rounding error) 1.5598, the lower end of the 95% condence interval.
The condence interval for was constructed in the same way.
Condence intervals for any level of condence can be found in this way;
for 95% condence, the regression software did it for us.
Another statistic produced by the regression package is called the
R-squared, or R
2
, statistic. This statistic is widely known, and widely
misunderstood. The denition is:
R
2
= 1
T
t=1

2
t
T
t=1
_
Y
i

Y
_
2
To understand this statistic, recall the decomposition of the
(unconditional) variance of the dependent variable:
Var [Y] =
2
Var [X] + Var []
So the variance of the dependent variable is the sum of a term
proportional to the variance of the independent variable, and the variance
of the residuals.
In the context of the market model, where Y = R
i
R
f
, and
X = R
M
R
f
, this result has the following interpretation:
Var [R
i
]
. .
Total Risk
=
2
Var [R
M
]
. .
Market or Systematic Risk
+ Var []
. .
Idiosyncratic Risk
The risk of any security is its systematic (or market) risk, plus its
idiosyncratic (i.e., uncorrelated with market) risk.
The above result is based on the true regression equation; we can do
something similar to the estimated regression equation.
Start with the estimated equation:
Y
t
= +

X
t
+
t
Take the sample average of both sides:
Y = +

X
Subtract the second equation from the rst:
Y
t

Y =

_
X
t

X
_
+
t
Square both sides:
_
Y
t

Y
_
2
=

2
_
X
t

X
_
2
+ 2
_
X
t

X
_

t
+
2
t
Sum up across all observations:
T
t=1
_
Y
t

Y
_
2
=
2
T
t=1
_
X
t

X
_
2
+ 2
t=1
_
X
t

X
_

t
+
T
t=1

2
t
=
2
T
t=1
_
X
t

X
_
2
+
T
t=1

2
t
(Why does the middle term disappear?)
This result is analogous to the variance decomposition derived earlier. The
sum of the squared deviations of Y from its (estimated) mean value can
be expressed as a term related to the sum of the squared deviations of X
from its (estimated) mean value, and the sum of the squared residuals.
When Y is the excess return of some security, and X is the excess return
of the market, this result is:
T
t=1
_
R
i ,t

R
i
_
2
=

2
T
t=1
_
R
M,t

R
M
_
2
+
T
t=1

2
t
(The above result assumes that the risk-free rate is constant over time. It
is only a minor adjustment to accommodate a time-varying risk-free rate.)
Recall that the denition of R-squared is one minus the ratio of the last
term on the right-hand side to the left-hand side:
R
2
= 1
T
t=1

2
t
T
t=1
_
R
i ,t

R
i
_
2
Neither of the terms on the right-hand side of the equation above can be
negative; this places bounds on the R
2
statistic.
Consider two extreme cases:
T
t=1
_
R
i ,t

R
i
_
2
. .
Total risk
=

2
T
t=1
_
R
M,t

R
M
_
2
. .
Systematic component
+
T
t=1

2
t
. .
Idiosyncratic component
In the rst case, the systematic component is equal to zero (this occurs
when

= 0). Then the last term on the right is equal to the left-hand
side, and the R
2
statistic is equal to zero. This means that the
(estimated) risk of the security is entirely idiosyncratic risk; none of the
securitys risk is explained by exposure to market risk.
In the second case, the last component is equal to zero. Then there is no
idiosyncratic risk, and the risk of the security consists only of market risk.
In this case, the R
2
statistic is one.
In all other cases, both right-hand terms are positive, and the R
2
statistic
is somewhere between 0 and 1.
Intuitively, R
2
measures how much variation in the dependent variable, Y,
is explained by its relation to the independent variable, X. A high R
2
means that almost all of the variability in Y is explained by its relation to
X; if you know X, you can make very accurate predictions about Y. On
the other hand, if R
2
is very low, then almost none of the variability of Y
is explained by its relation to X; knowledge of X hardly helps at all in
predicting Y.
In a regression with a single X variable (i.e., every regression we have
considered so far), R
2
has another interpretationit is the square of the
correlation between the dependent and independent variables. (Can you
prove this?) It does not have this interpretation when there is more than
one X variable, a case we consider later.
As with all other quantities produced by regression, the R
2
is an estimated
statistic:
R
2
= 1
Var []
Var [Y]
The R
2
is based on estimates of the above quantities; however, the more
observations we have, the more accurately we will be able to estimate
them, and the closer the estimated R
2
statistic will be to its true value.
This is a common misconception that a regression with a high R
2
is a
good regression, because most of the variation in the Y variable is
explained, and a regression with a low R
2
is a bad regression, because
most of the variation in Y remains unexplained. A good regression is one
that gives the information you need. If you are evaluating the CAPM, for
example, does the R
2
matter?
The CAPM is a theory that predicts that security returns obey:
E[R
i
] = R
f
+
i
(E[R
M
] R
f
)
The only reason a security ever returns something other than the risk-free
rate, on average, is that it has exposure to market risk. If it has exposure
to idiosyncratic risk, it will deviate from its expected return, but the
expected return itself will not be higher or lower because of the exposure
to idiosyncratic risk.
Linear regression is an ideal tool for evaluating the CAPM. We can
estimate the following regression equation using historical data:
R
i ,t
R
f
= +

(R
M,t
R
f
) +
i ,t
The prediction of the CAPM is that = 0.
The estimated value will hardly ever be exactly zero, even if the CAPM
is true, because is just an estimate of . Because of the randomness of
the data, the estimated value will not be exactly equal to the true value.
We can therefore perform a statistical test.
To do so, we advance the hypothesis that the CAPM is true, and = 0.
Under this hypothesis, the test statistic
t =

( )
has a Students t distribution with T 2 degrees of freedom (where T is
the number of time periods for which we have historical data). We can use
this fact to calculate the p-value of the t-statistic, and use this to reject,
or fail to reject, the hypothesis at the desired level of condence.
If we are content with a 0.05 probability of making a Type I error
(rejecting the hypothesis when it is in fact true), then our decision is,
reject the hypothesis when the p-value is less than 0.05, but fail to reject
the hypothesis if the p-value is greater than 0.05.
Philosophical noteif we fail to reject the hypothesis, that does not
necessarily mean the hypothesis (in this case, the CAPM) is true. One can
draw an analogy to criminal trials. In many jurisdictions (although not in
all) the prosecution must provide a very strong proof of the guilt of the
accused party in order to secure a conviction. If the guilt of the accused is
not proven to a high enough standard, the accused is found not guilty.
Not guilty is not the same as innocent. It is possible that the accused is
innocent; however, it is also possible that the accused committed the
crime, but the prosecution did not provide a strong enough case to nd
the accused guilty (Type II error).
Failing to reject the hypothesis (in this example, the CAPM) is like nding
the accused not guilty. The case that the CAPM is false has not been
proven. That may be because the CAPM is true; or it may be because the
CAPM is false, but we dont have enough evidence to prove this with
sucient condence. If so, we have committed a Type II error (failing to
reject an hypothesis that is false), but there is no way around this; the
only way we can reduce the probability of a Type II error is to increase the
probability of a Type I error (rejecting an hypothesis that is true).
What assumptions did we have to make?
The key assumptions are:
.
.
.1
The error terms (the idiosyncratic component of the securitys risk)
are uncorrelated with each other.
.
.
.2
The error terms are homoscedastic (i.e., each one has the same
variance).
.
.
.3
The error terms have, conditional on the market return R
M
, a
normal or Gaussian distribution.
The last assumption (conditional normality of the ) is not particularly
important if we have a lot of data; the central limit theorem then does its
work. There are techniques for dealing (with greater or lesser degrees of
success) with violations of the rst two assumptions, and we will examine
some of these later.
Testing the CAPM Multivariate CAPM Tests
The big problem with what we have just done is, the CAPM is not a
prediction about the returns of one securityit is a prediction about the
returns of all of them. We should test the returns of many securities to see
if they conform to the predictions of the CAPM, not just one.
We have already seen how to construct such a test using a very simple
model for expected returns. We would now like to extend this test to
models like the CAPM. The procedure is quite similar.
For each security, the returns conform to:
R
i ,t
R
f ,t
=
i
+
i ,t
(R
M,t
R
f ,t
) +
i ,t
Arranging the
i ,t
for dierent values of into a vector:
t
=
_
1,t
.
.
.
N,t
_
_
The mean vector and covariance matrix of the residuals are then:
E[
t
| X] =
_
_
E[
1,t
| X]
.
.
.
E[
N,t
| X]
_
_
Var [
t
| X] =
_
_
Var [
1,t
| X] Cov [
1,t
,
N,t
| X]
.
.
.
.
.
.
.
.
.
Cov [
N,t
,
1,t
| X] Var [
N,t
| X]
_
_
So far, no assumptionseverything is mechanical.
E[
t
| X] =0
N1
for all t
Var [
t
| X] =
_
2
1,1

2
1,N
.
.
.
.
.
.
.
.
.
2
N,1

2
N,N
_
_
for all t
Cov [
s
,
t
| X] =0
NN
for all s = t
(What are the unconditional implications of these assumptions?)
The assumptions state that (all conditional on X) the residuals always
have expected value of zero, and residuals from dierent time periods are
uncorrelated with each other. The residual for any one of the particular
assets always has the same variance (across time periods), and the
covariance between the residuals for two particular assets always has the
same covariance.
Using the law of iterated expectations, the result on the relation between
conditional and unconditional variance, and a similar result for covariance,
it is possible to show that the exactly analogous statements also hold
unconditionally.
We can make still stronger assumptionsspecically, that
s
and
t
are
independent for s = t, and that each
t
has a multivariate normal
distribution:
f
t
(x) =
1
(2)
N
2
||
1
2
e
x
T
1
x
2
where || denotes the determinant of the covariance matrix . Note that
x is an N-element vector here. Also recall that the mean of the
t
is zero,
and the covariance matrix is .
An important property of the multivariate normal (or Gaussian)
distribution is that each element of
t
, considered individually, has a
marginal distribution which is normal (or Gaussian).
We assume that the residuals have this distribution conditional on the X,
but they have the same distribution unconditionally.
Note what we have done herewe have standard OLS regression
assumptions for each asset, considered individually.
However, we have also specied the covariances between contemporaneous
error (residual) terms for dierent assets at the same time period.
The
i
and
i
coecients, standard errors, and t-statistics may all be
estimated by running a standard OLS regression for each asset. The
estimated coecients have all the usual properties (which we derived
earlier).
However, with our additional assumptions, we can derive some of the joint
properties of the coecients estimated in dierent regressions.
To develop our CAPM test, we need to know how the
i
coecients from
dierent regressions are related. The values of the
i
coecients are
irrelevant for purposes of testing the CAPM, as are other statistics that
are produced by regression (e.g., the R
2
statistics).
We therefore need the joint distribution of
1
, . . . ,
N
, although for
completeness, we nd the joint distribution of all the coecients, including
the
1
, . . . ,
N
.
Under the assumption of multivariate normality of the
i ,t
, the joint
distribution of the
i
(conditional on X) is also multivariate normal.
The
i
coecients are just complicated linear combinations of the
i ,t
,
which have a multivariate normal distribution; linear combinations of
multivariate normal random variables are themselves multivariate normal
(can you prove it?). Furthermore, a multivariate normal distribution is
completely characterised by its mean vector and covariance matrix.
If we can nd the means, variances, and covariances of the
i
, we know
their full distribution, conditional on X. (Notewe already know the
means and variances, so we only need the covariances.)
Recall that:

i
=

Y
i

i

X

i
=
s
2
XY
i
s
2
XX
Then after a fair amount of algebra, we nd results that are only slightly
dierent than those we found when considering only a single regression:
Cov [
i
,
j
| X] =
2
ij
T
+
2
ij
X
2
(T 1) s
2
XX
Cov
_
i
,

X
_
=
2
ij
(T 1) s
2
XX
Cov
_

i
,

X
_
=
2
ij
X
(T 1) s
2
XX
Are these results consistent with what was found ealier, for a single
regression?
We can start to put things into vectors and matrices:
=
_
1
.
.
.
N
_
_
=
_
_

1
.
.
.

N
_
_
= Var [
t
| X] =
_
2
1,1

2
1,N
.
.
.
.
.
.
.
.
.
2
N,1

2
N,N
_
_
Then note that:
Var [ | X] =
_
1
T
+
X
2
(T 1) s
2
xx
_
The prediction of the CAPM is that
= 0
N1
that is, the of every asset is equal to zero.
We can think about a test statistic of the form:
2
=

T
1

_
1
T
+

X
2
(T1)s
2
XX
_
The above statistic is equal to zero if = 0
N1
, and is greater than zero
otherwise; covariance matrices have the properties:
x
T
1
x 0 for all x
x
T
1
x > 0 for all x = 0
So the test statistic will be small if the estimated are small (i.e., close to
zero), and large if some of the are large. So this seems like a reasonable
test statistic. Under our distributional assumptions (including multivariate
normality), the above statistic has a chi-square distribution with N degrees
of freedom.
The problem with this test statistic, as with several others we have
considered, is that it is unimplementable; we dont know the value of .
But we can estimate it instead.
We already know how to estimate Var [
i
]:
s
2
i
=
1
T 2
T
t=1

2
i ,t
Recall that this estimate of Var [
i
] is unbiased:
E
_
s
2
= Var [
i
]
The T 2 is used instead of T, because there are two estimated
parameters in the regression ( and

); this causes the estimated residuals
to be slightly smaller than the actual residuals. If we used T, we would
slightly underestimate the variance of the residuals.
Under the assumption of normality, the estimated variance of the residuals
has a chi-square distribution (after appropriate scaling).
We can estimate the covariance between the residuals for two dierent
regressions in an analogous way:
s
2
j
=
1
T 2
T
t=1

i ,t

j ,t
This estimate is also unbiased:
E
_
s
2
j
_
= Cov [
i
,
j
]
We can arrange all the estimates in a matrix, which is the sample
counterpart to :
=
_
_
s
2
1
s
2
N
.
.
.
.
.
.
.
.
.
s
2
1
s
2
N
_
_
(Side noteunder the assumption of multivariate normality, the entire
matrix

has a Wishart distribution.)
We can use the estimated

in place of the actual (but unknown)
matrix to form our test statistic.
The new test statistic is therefore:
t
2
=

T

1

_
1
T
+

X
2
(T1)s
2
XX
_
The distribution of the t
2
statistic is called, oddly enough, Hotellings
t-square distribution. It is closely related to the F distribution. Specically:
F =
T N 1
N (T 2)
t
2
has an F distribution with N and T N 1 degrees of freedom.
Regression F-statistic, Three Assets1, 000, 000 Trials
We now have everything we need to test the CAPM. The procedure is:
.
.
.1
Run a regression of the excess returns of an asset, R
i ,t
R
f ,t
, on the
excess return of the market, R
M,t
R
f ,t
, to nd the estimated
coecients,
i
,

i
, and residuals
i ,t
. Run such a regression for each
asset.
.
.
.2
Use the estimated residuals, the
i ,t
, to estimate the variance of each
residual term Var [
i
], and the covariance for each pair of residual
terms Cov [
i
,
j
].
.
.
.3
Arrange the estimated
i
in an N 1 vector, and the estimated s
2
j
in an N N matrix.
.
.
.4
Calculate the sample mean and sample variance of R
M,t
R
f ,t
. Since
this is the X variable in the regression, we will call these quantities

X
and s
2
XX
.
.
.
.5
Calculate the test statistic:
F =
T N 1
N (T 2)
_
1
T
+

X
2
(T1)s
2
XX
_

T

1

.
.
.6
Choose a condence level (or equivalently, the probability of a Type I
error).
.
.
.7
Compare the F statistic from Step 5 to the cut-o value for an F
distribution with N and T N 1 degrees of freedom, for the
condence level chosen in Step 6. Reject (i.e., conclude the CAPM is
false) if the F statistic is larger than the cut-o value, and fail to
reject if the F statistic is smaller than the cut-o value.
Lets try it with three assets. One of them is the asset we used in the
earlier example.
Excess Return of First Asset vs. Market Excess Return
Excess Return of Second Asset vs. Market Excess Return
Excess Return of Third Asset vs. Market Excess Return
Microsoft Excel Regression Output for First Asset
Microsoft Excel Regression Output for Second Asset
Microsoft Excel Regression Output for Third Asset
Regression Line for First Asset
Regression Line for Second Asset
Regression Line for Third Asset
The regression package has calculated standard errors, etc., for each
regression individually. One thing it did not produce is estimates of the
covariance of the residuals for dierent assets. However, these can be
found with a bit of spreadsheet work:
=
_
_
0.007010 0.000523 0.000840
0.000523 0.000649 0.000361
0.000840 0.000361 0.001841
_
_
The square roots of the diagonal elements are estimates of the variances of
the residuals for the three regressions, respectively, and may be found in
the regression output. The o-diagonal elements are not to be found
anywhere.
We also need the vector :
=
_
_
-0.005047
-
0.002572
-
0.001932
_
_
These values were kindly produced by the regression package.
We also need

X and s
2
XX
, but these do not seem to have been reported in
the regression output. Some quick spreadsheet work will give us the
answer:
X = 0.006306 s
2
XX
= 0.002890
We are now ready to calculate the test statistic:
F =
T N 1
N (T 2)
_
1
T
+

X
2
(T1)s
2
XX
_

T

1

=
939 3 1
3 (939 2)
_
1
939
+
0.006306
2
(9391)0.002890
_
_
_
-0.005047
-
0.002572
-
0.001932
_
_
T
_
_
0.007010 0.000523 0.000840
0.000523 0.000649 0.000361
0.000840 0.000361 0.001841
_
_
1
_
_
-0.005047
-
0.002572
-
0.001932
_
_
=5.7438
The last number is the test statistic, which has an F distribution with 3
and 935 degrees of freedom.
If we are using a 95% condence level, we nd that the p-value is much,
much smaller: specically, it is 0.000679. We have thus found highly
signicant (with more than 99.93% condence) evidence against the
CAPM.
Based on the evidence oered by the three assets, we can therefore reject
the CAPM with a very high degree of condence. The three assets do not
conform to the predictions of the CAPM.
A few practical matters.
Non-normalitygenerally not too much of a problem. Our derivation of
the distribution of the test statistic assumed the
t
had a multivariate
normal distribution. The estimated
i
then also have a normal
distribution. The are not just sample averages, but the central limit
theorem is general enough to be applied to this situation alsounder
relatively mild technical assumptions, the have approximately a normal
distribution when T is large.
Furthermore, for large T, the error in the estimation of the is more
important than the error in the estimation of the variances and covariances
of the residuals. For very large T, the test statistic approaches a scaled
chi-square distribution (which is the limiting distribution of the F
distribution as the second degrees of freedom parameter grows to innity),
even if the residuals are not normal.
Heteroscedasticitywe have assumed that each
t
has the same variance.
If the variance of
t
changes over time, then our estimates of the standard
errors of the are not necessarily accurate, even for very large values of
T. There are, however, relatively robust methods for dealing with various
forms of heteroscedasticity.
Autocorrelationwe have also assumed that
s
is uncorrelated with
t
if
s = t. Statistically, there are methods of dealing with violations of this
assumption, although they are generally not as robust as the methods for
dealing with heteroscedasticity. Furthermore, if the
t
are autocorrelated,
we can only interpret the results as a test of the unconditional CAPM.
Time-varying coecientsif the relation between the individual assets
and the market portfolio is changing, we are still testing an unconditional
version of the CAPM.
Too many assetsour method of testing requires that the number of
assets be smaller than the number of time periods over which they are
observed. In many common data sets, this is not the casefor example, in
the US stock market, there are thousands of stocks, but we generally do
not have such a long history of their returns.
One method for dealing with the last two problems is that of forming
portfolios. If assets are grouped into portfolios based on their
characteristics (e.g., small rm stocks in one portfolio, large rm stocks in
other), then we might expect that the characteristics of the portfolios are
relatively stable over time. An individual asset may drift from portfolio to
portfolio, as its nature changes, but the portfolios themselves have
relatively constant properties.
Furthermore, as long as the portfolios are formed using information
available at the time of portfolio formation (if we do not look into the
future when deciding which portfolio an individual assets goes to), then
the testing method we have developed is not aected by the portfolio
formation step.
Testing the CAPM Two-pass Regression
The testing method we have described was developed by Gibbons, Ross,
and Shanken (1989). It is a test based on the prediction of the
CAPMthe model predicts that the coecient of each asset is zero,
and that is what the procedure tests.
An extremely common alternative procedure used in the nance literature
is the so-called two-pass regression methodology. Few procedures in the
history of nancial or economic thought have been used to reach more
false conclusions than this one.
The two-pass regression methodology is based on the CAPM equation for
expected returns:
E[R
i
] R
f
=
i
(E[R
M
] R
f
)
We have already run regression tests, in which each observation of the
dependent variable is the excess return of a some security in a particular
time period. However, the idea behind the two-pass regression
methodology is to run a regression in which each observation of the
dependent variable is the expected excess return of a particular security,
estimated over the entire time series of available data.
Considering the second pass rst, the regression is as follows:
E[R
i
] R
f
. .
Dependent variable
=
0
+
1

i
..
+
i
..
Error term
There is one observation for each security in the study, observations of the
dependent variable are the expected excess returns of the securities, and
observations of the independent variable are the beta coecients of the
same securities.
Neither the expected returns nor the beta coecients of the securities are
observed directly. The expected returns can be estimated with the sample
means; the beta coecients are estimated with a rst-pass regression.
The two-pass regression is therefore as follows. For concreteness, let us
suppose data on monthly returns of 25 assets are available for a 30 year
period.
.
.
.1
Estimate the expected returns of the securities in the study by
calculating their sample means, i.e., for each asset, add up all the 360
monthly returns for each asset, and divide the sums by 360.
.
.
.2
Estimate the beta coecients of each of the assets with the following
regression:
R
i ,t
R
f ,t
=
i
+
i
(R
M,t
R
f ,t
) +
i ,t
There are 25 regressions to be run (one for each asset), and each
regression has 360 observations.
.
.
.3
Run a single regression:
E[(R
i
)] R
f
=
0
+
1
i
+
i
There are 25 observations, one for each asset. Observations of the
dependent variable are the expected (excess) returns from Step 1, and
observations of the independent variables are the beta coecients
from the regressions run in Step 2.
Note that the rst-pass regressions are performed solely to calculate the
beta coecients. The alpha coecients (which are the ones relevant for
testing the CAPM) are simply discarded.
There are many problems with this procedure, although this does not stop
it from being used constantly in the nance literature to prove extremely
dubious or patently false facts.
The rst issue is that the independent variable in the second-pass
regression is not the true beta coecient, but an estimate produced by the
rst-pass regression. When the X variable in a regression is observed with
error, this gives rise to a problem known as errors-in-variables. Many of
the results we have derived for the statistic properties of the regression
estimates do not hold when there is an errors-in-variables problem; in
particular, the coecients from the second-pass regression are biased.
This problem has been dealt with by Shanken (1992), although Shankens
correction for the errors-in-variables problem is often ignored in practice.
The second problem is that there is almost nothing produced by the
second-pass regression that is useful in answering any of the questions we
might be inclined to ask. For example, what would you look at in the
output of the second-pass regression to decide whether the CAPM is true
or not?
The prediction of the CAPM is that, for every security:
E[R
i
] R
f
=
i
(E[R
M
] R
f
)
The regression equation is:
E[R
i
] R
f
. .
Y variable
=
0
+
1

i
..
X variable
+
i
..
Error term
It follows that, if the CAPM is true, then
0
= 0 and
1
= E[R
M
] R
f
. It
also follows that
i
= 0 for every security.
None of these conditions are likely to hold in any given data sample, even
if the CAPM were true. However, there is no simple statistical test to
apply to the results of the two-pass regression to determine whether
deviations from the CAPM are due to luck (sampling variation), or
whether they are due to the CAPM being incorrect.
It is certainly quite possible for the predictions of the CAPM for
0
and
1
to hold closely (or even perfectly), even if the CAPM is seriously violated
in the data sample. The following graph shows the expected returns and
beta coecients of fty (hypothetical) securities, and a regression line.
Expected Returns and Beta Coecients
The
0
and
1
coecients are estimated at 0 and 8%, respectively. If the
market risk premium were to be 8%, then these values would be exactly
what the CAPM predicts. Does this mean the data support the CAPM?
The prediction of the CAPM is that every security falls on the line. Even if
the CAPM were true, though, securities would not fall exactly on the line,
because expected returns and beta coecients are estimated with error.
How can we tell, from the cross-sectional regression, whether the
deviations from the predictions of the CAPM are due to chance, or
because the theory is just wrong? We cant. To answer this question, we
would need to know how precisely the expected returns and beta
coecients are estimated, and any relevant information on these points
from the rst-pass regression was discarded.
The coecients, which contain useful information about the deviation of
the data from the predictions of the CAPM, were also discardedit is a
truly bizarre testing procedure that calculates from the rst-pass regression
the information that is exactly what we need to test our theory, and then
promptly throws this information into the rubbish bin.
Perhaps one is interested in estimating the risk premium associated with
the market portfolio, rather than in testing the CAPM. One can already
estimate the market risk premium by nding the sample average of
R
M
R
f
.
Under what circumstances is it better to estimate the market risk premium
with
1
instead of the sample average of the returns of the market
portfolio? See Hou and Kimmel (2010), who argue that it is better to use
1
under no circumstances at all.
The two-pass regression methodology is not useful in testing the
predictions of a model such as the CAPM.
The two-pass regression methodology is not useful in estimating the risk
premia of the market portfolio (or other risk factors that carry a risk
premium).
What is the two-pass regression methodology good for? After many years
of pondering this question, the instructor is able to think of only one thing:
if you are interested in proving results that are not true, it is the ideal
procedure.
Despite its evident lack of utility, the two-pass regression methodology has
been used widely for many years in the nance literature. This frequent
use of the procedure has not caused it to become more sensible than it
was years ago.
Just say no to the two-pass regression procedure.
Testing Multifactor Models
.
.
. . .
.
.
Empirical Finance
Testing Multifactor Models
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Testing Multifactor Models Single-asset Tests
We will be interested in testing more complicated models than the CAPM.
For example, the well-known Fama-French three-factor model for security
returns is:
E[R
i
] = R
f
+ b
i
RMRF+s
i
SMB+h
i
HML +
i
where RMRF, SMB, and HML are the (excess) returns of long-short
portfolios, i.e., portfolios that contain a long position in some assets, and
an equally sized short position in other assets, so that the portfolio has a
net cost of zero (more on these three portfolios later). It is assumed that,
for every security:
E[
i
] = 0 Cov [
i
, RMRF] = 0 Cov [
i
, SMB] = 0 Cov [
i
, HML] = 0
This model was developed in the 1990s to overcome empirical diculties
of the CAPM; it is motivated entirely by empirical ndings, with no
underlying theoretical rationale.
More generally, we will be interested in models of the form:
E[R
i
] = R
f
+
i ,1
F
1
+ . . . +
i ,N
F
N
+
i
where the F
1
, . . . , F
N
are risk factors thought to inuence expected returns
of securities. The factors may be the excess returns of portfolios, or they
may simply be some macroeconomic variables.
Arbitrage Pricing Theory, also known as APT, is a model for expected
returns that takes this form.
So far, we have focused exclusively on models of expected returns
expressed in terms of risk factors, and will continue to do so for a little
while. However, there are many other questions one could address in
nancial economics, other than what are the determinants of expected
returns. Just a few examples are:
.
.
.1
What determines the volatility of securities?
.
.
.2
What determines trading volume?
.
.
.3
What determines corporate funding/investment decisions?
.
.
.4
What factors govern rm protability?
A tool that is commonly used to address all of these questions is multiple
regression, which is linear regression with multiple X variables.
When considering models of expected returns, there are very sound
theoretical reasons to think we ought to be looking at a linear relationship
between the expected returns and risk factors.
For other questions (e.g., what determines corporate protability), it is not
so obvious that a linear model is the correct one.
Nonetheless, linear regression is used extremely commonly in nance
literature to address all sorts of questions. In some cases, the use of a
linear model is entirely appropriate; in other cases, it may be a force tting
of a familiar and comfortable tool into a place where it doesnt belong.
The general multiple regression model is of the form:
Y = +
1
X
1
+ . . . +
N
X
N
+
with the minimal identifying assumptions:
E[] = 0 Cov [, X
i
] = 0 1 i N
As in the single-variable regression case, these assumptions are not
restrictive, in the sense that it is impossible for the data to violate them.
Under minimal technical assumptions (rst and second moments of the Y
and all X exist, and the variance-covariance matrix of the X does not have
a determinant of zero), there are always values of and
1
, . . . ,
N
that
make these assumptions true.
For convenience, the multiple regression equation is sometimes expressed
in vector format:
X =
_
1 X
1
X
N
=
_
1
.
.
.
N
_
_
Note that the has been incorporated into the matrix, and the X
matrix includes not only the X
1
, . . . , X
N
, but also a 1 at the beginning.
We can refer to this 1 as X
0
, and the as
0
.
The vector expression of the multiple linear regression equation is then
simply:
Y = X +
There is no need for a separate term, since this has been included as one
of the coecients; the is now simply the corresponding to an X
variable that happens to be constant.
When a multiple regression is written in this form, the conditions that the
error term has expected value of zero and is uncorrelated with the X
variables can be replaced by a single condition:
E[X] = 0
1(N+1)
Is this one condition really the same as the two we had previously?
Recall that the vector X is:
X =
_
1 X
1
X
N
Our condition is then:

E[X] = 0
E
_
_
1 X
1
X
N
= 0
E[] = 0 E[X
i
] = 0 . . . E[X
N
] = 0
But note that:
E[X
i
] = Cov [, X
i
] + E[]
..
=0
E[X
i
] = Cov [, X
i
]
So the single new condition is in fact equivalent to the two original
conditions.
ExampleFama-French three-factor model for security returns:
R
i
R
f
. .
Y variable
=
i
+b
i
RMRF
. .
First X variable
+s
i
SMB
. .
Second X variable
+h
i
HML
. .
Third X variable
+
i
..
Error term
where i is an index that tells us which security we are regressing on the
three X variables. This model is commonly written as it is above, without
any coecients, but note that b
i
, s
i
, and h
i
simply take the place of
1
,
2
, and
3
.
Expressing the regression equation in vector terms, the X and are:
X =
_
1 RMRF SMB HML
=
_
i
b
i
s
i
h
i
_
_
The model is said to hold if the
i
is equal to zero for every security (not
unlike the CAPM).
How to nd the regression coecients?
Beginning with the exact regression equation:
Y = X +
we can rearrange it slightly:
= Y X = Y
T
T
X
T
The last step applies because X is a 1 (N + 1) vector, and is an
(N + 1) 1 vector. Their product therefore is just 1 1, and equal to its
own transpose. A general result on matrix transposition is:
(AB)
T
= B
T
A
T
which has been applied here.
We can now apply the condition:
E[X] = 0
E
__
Y
T
T
X
T
_
X
_
= 0
E
_
Y
T
X
_
E
_
T
X
T
X
_
= 0
E
_
Y
T
X
_
=
T
E
_
X
T
X
_
Continuing the derivation:
E
_
Y
T
X
_
=
T
E
_
X
T
X
_
E
_
Y
T
X
_ _
E
_
X
T
X
__
1
=
T
=
_
E
_
X
T
X
__
1
E
_
X
T
Y
_
It may not be obvious, but this result is completely consistent with the
results from the single variable caselets consider the case of just one X
variable:
X =
_
1 X
1
Then:
=
_
E
_
X
T
X
__
1
E
_
X
T
Y
_
=
_
E
__
1
X
1
_
_
1 X
1
__
1
E
__
1
X
1
_
Y
_
=
_
E
__
1 X
1
X
1
X
2
1
___
1
E
__
Y
X
1
Y
__
=
_
1 E[X
1
]
E[X
1
] E
_
X
2
1
_
1
_
E[Y]
E[X
1
Y]
_
To proceed any further, we will need to know the inverse of the matrix
shown above.
For any 2 2 matrix, the inverse is:
_
a b
c d
_
1
=
_
d
adbc
b
adbc
c
adbc
a
adbc
_
provided ad bc = 0. (Can you verify that this is the correct inverse?)
Applying this result within the context of our problem:
_
1 E[X
1
]
E[X
1
] E
_
X
2
1
_
1
=
_
_
E
[
X
2
1
]
E
[
X
2
1
]
(E[X
1
])
2
E[X
1
]
E
[
X
2
1
]
(E[X
1
])
2
E[X
1
]
E
[
X
2
1
]
(E[X
1
])
2
1
E
[
X
2
1
]
(E[X
1
])
2
_
_
=
_
_
E
[
X
2
1
]
Var[X
1
]
E[X
1
]
Var[X
1
]
E[X
1
]
Var[X
1
]
1
Var[X
1
]
_
_
There is only a division by zero problem if X
1
has a variance of zero.
Continuing where we left o:
=
_
1 E[X
1
]
E[X
1
] E
_
X
2
1
_
1
_
E[Y]
E[X
1
Y]
_
=
_
_
E
[
X
2
1
]
Var[X
1
]
E[X
1
]
Var[X
1
]
E[X
1
]
Var[X
1
]
1
Var[X
1
]
_
_
_
E[Y]
E[X
1
Y]
_
=
_
_
E
[
X
2
1
]
E[Y]E[X
1
] E[X
1
Y]
Var[X
1
]
E[X
1
Y]E[X
1
] E[Y]
Var[X
1
]
_
_
=
_
_
Var[X
1
] E[Y](E[X
1
])
2
E[Y]E[X
1
] Cov[X
1
,Y]+(E[X
1
])
2
E[Y]
Var[X
1
]
E[X
1
Y]E[X
1
] E[Y]
Var[X
1
]
_
_
=
_
_
E[Y]
Cov[X
1
,Y]
Var[X
1
]
E[X
1
]
Cov[X
1
,Y]
Var[X
1
]
_
_
Note that, in the last expression, the bottom element of the vector is just
from the case of the single-variable regression; then the top element is
just the .
Trying to perform such manipulations with two or more X variables
quickly becomes extremely tedious, so we will tend to use the
vector/matrix representation.
As in the single-variable regression case, there is an estimated version of
the regression equation:
Y
t
= X
t

+
t
We can estimate

by minimizing the sum of the squared errors:
T
t=1

2
t
=
T
t=1
_
Y
t
X
t

_
2
For each element of

, there is a rst-order condition:
i
T
t=1

2
t
=

i
T
t=1
_
Y
t
X
t

_
2
=
T
t=1
2X
t,i
_
Y
t
X
t

_
= 0
This condition must hold for each 0 i N.
At this point, it is convenient to assume the dependent variable
observations are arranged in a vector, and the independent variable
observations are arranged in a matrix:
Y =
_
_
Y
1
.
.
.
Y
T
_
_
X =
_
_
1 X
1,1
X
1,N
.
.
.
.
.
.
.
.
.
1 X
T,1
.
.
. X
T,N
_
_
Each individual rst order condition is then that column i of X
(transposed) multiplied by the vector of residuals, is equal to zero. This is
the sample counterpart to the condition that E[X
i
] = 0.
However, we can express the rst-order conditions all at once by writing:
X
T
_
Y X

_
= 0
(N+1)1
Rearranging this last line a bit:
X
T
X

= X
T
Y
or:
=
_
X
T
X
_
1
X
T
Y
The least-squared residual estimate of

is therefore calculated by
arranging the X and the Y in a matrix and a vector, respectively, and
performing the above operations (which include matrix transposition,
multiplication, and inversion).
Note that the

(as well as itself) is an (N + 1) 1 vectorX is an
T (N + 1) matrix, and Y is a T 1 vector. Then:
..
(N+1)1
=
_
X
T
X
_
1
. .
(N+1)(N+1)
X
T
Y
. .
(N+1)1
As N is the number of X variables (not counting the constant), then in
the single-variable regression,

is a 2 1 vector; the top element is the
estimated , and the bottom element is what we had previously called the
estimated

.
Statistical properties of the estimateswe can nd proceed as follows:
=
_
X
T
X
_
1
X
T
Y =
_
X
T
X
_
1
X
T
(X + )
=
_
X
T
X
_
1
X
T
X +
_
X
T
X
_
1
X
T
= +
_
X
T
X
_
1
X
T
So the estimated beta coecients are equal to the actual beta coecients,
plus an error term that depends on the , but also on the X.
To proceed any further, we need (as in the single-variable regression case)
to strengthen our assumption about the properties of the . We now
assume that:
E[
t
| X] = 0
Note that this assumption implies one of our earlier assumptions, but the
law of iterated expectations:
E[
t
| X] = 0
E[E[
t
| X]] = 0
E[
t
] = 0
But it also implies the other condition:
E[
t
| X] = 0
E[
t
X| X] = 0
E[
t
X] = 0
Cov [
t
, X] + E[
t
]
. .
=0
E[X] = 0
Cov [
t
, X] = 0
Once again, we derive the statistical properties of the estimate conditional
on X:
E
_
X
_
=E
_
+
_
X
T
X
_
1
X
T
X
_
=E[ | X] + E
_
_
X
T
X
_
1
X
T
X
_
= +
_
X
T
X
_
1
X
T
E[ | X] =
The coecients are therefore unbiased, conditional on the X; by the law
of iterated expectations, they are also unbiased unconditionally:
E
_
_
= E
_
E
_
X
__
= E
_
_
=
In order to proceed any further, we need to strengthen our assumptions on
the distribution of the residuals still more. We use the same condition we
used in the single-variable case:
Cov [
s
,
t
| X] =
_
2
s = t
0 s = t
that is, we assume the residuals are homoscedastic, and uncorrelated
across time periods.
We would then like to nd:
Var
_
X
_
_
Var
_
X
_
Cov
_
0
,

X
_
.
.
.
.
.
.
.
.
.
Cov
_
N
,

X
_
Var
_
X
_
_
_
We can write this instead as:
Var
_
X
_
= E
_
X
_
E
_
X
_
E
_
X
_
The last term is easy, since E
_
X
_
is just . The rst term requires a
fair amount of manipulation.
We can proceed as follows:
E
_
X
_
=E
_
_
X
T
X
_
1
X
T
YY
T
X
_
X
T
X
_
1
X
_
=
_
X
T
X
_
1
X
T
E
_
YY
T
X
_
X
_
X
T
X
_
1
=
_
X
T
X
_
1
X
T
E
_
(X + ) (X + )
T
X
_
X
_
X
T
X
_
1
=
_
X
T
X
_
1
X
T
_
_
_
_
_
_
_
_
X
T
X
T
+E[ | X]
T
X
T
+X E
_
X
_
+E
_
X
_
_
_
_
_
_
_
_
_
X
_
X
T
X
_
1
E
_
X
_
=
_
X
T
X
_
1
X
T
X
T
X
T
X
_
X
T
X
_
1
+
_
X
T
X
_
1
X
T
E
_
X
_
X
_
X
T
X
_
1
=
T
+
_
X
T
X
_
1
X
T
E
_
X
_
X
_
X
T
X
_
1
=
T
+
2
_
X
T
X
_
1
X
T
X
_
X
T
X
_
1
=
T
+
2
_
X
T
X
_
1
(How does the next to last step follow?)
It follows that:
Var
_
X
_
=E
_
X
_
E
_
X
_
E
_
X
_
=
T
+
2
_
X
T
X
_
1
T
=
2
_
X
T
X
_
1
This result is entirely consistent with the expressions derived when we
considered single-variable regressions, although the calculations needed to
show this are quite tedious. This result, however, is quite general, applying
to regressions with an arbitrary number of X variables.
As in the single-variable case, the problem is that we dont know the value
of
2
, but must estimate it. The usual way of estimating it is:
s
2
=
1
T N 1
T
t=1

2
t
=
1
T N 1
T
t=1
_
Y X

_
2
The T N 1, in place of T, is there because there are N + 1
independent variables (the N X variables, plus the constant), and the

coecients are estimated. If T were used instead, the estimate of s

2
would be biased, on average, too small. This estimate is unbiased:

E
_
s
2
=
2
(The above result holds conditionally on the X, or unconditionally.)
We now have the statistical properties of the regression estimatesthe

coecients are, on average, equal to the true coecients , and we have

derived the variance-covariance matrix of the estimated coecients:
E
_
X
_
= Var
_
X
_
=
2
_
X
T
X
_
1
We have also a method of estimation for
2
; it can be estimated with s
2
.
With these results, we can perform statistical estimation and tests
involving the coecients of the regression. We will use such tests to
evaluate multiple-factor models of expected returns.
To test an hypothesis about an estimated coecient, we can calculate the
test statistic:
t =
i
(
i
)
0

_
i
_
=
i
(
i
)
0
s
2
_
(X
T
X)
1
_
ii
where (
i
)
0
is the hypothesized value of the coecient. The notation []
ij
denotes the element of the enclosed matrix in row i and column j ; we start
numbering the rows and columns with zero, not one, to match the
convention on the vector. Note that the elements along the diagonal of
_
X
T
X
_
1
, when multiplied by s
2
, are estimates of the variance of the

coecients ,
1
, . . . ,
N
.
It is conceptually straightforward, although practically tedious, to verify
that this statistic is the same one we used when testing the expected value
of some quantity (set N = 0), or in the single-variable regression (set
N = 1).
Once again, we need to strengthen our assumptions about the residuals
further still to conduct statistical inference. We make exactly the same
assumption made in the single-variable regression case, that each residual
has a Gaussian, or normal distribution. We have already assumed that the
residuals have, conditional on X, expected value equal to zero, are
homoscedastic, and uncorrelated with each other.
Under this assumption, what is the distribution of our test-statistic? When
testing the expected value of a random variable (i.e., no X variables at all,
or N = 0), this test statistic has a Students t distribution with T 1
degrees of freedom. In single-variable regression (i.e., one X variable, or
N = 1), the test statistic has a Students t distribution with T 2 degrees
of freedom.
You may have started to notice a pattern here. If so, what do you think
the distribution of the test statistic is for arbitrary N?
For any value of N, the test statistic has a Students t distribution with
T N 1 degrees of freedom. The degrees of freedom parameter is the
number of time series observations of the Y variable we have, minus one
degree of freedom for each coecient estimated, and minus one degree
of freedom for the estimated coecient.
In practice, testing whether the estimate coecient is equal to zero (the
test we usually want to perform when evaluating asset pricing models) is
therefore almost exactly the same as in the single-variable case. Run a
regression, producing an estimate and also its standard error ( ). The
formulae to calculate these quantities are more complicated than in the
single-variable case, but most regression software packages hide the details
from you.
We then calculate a test statistic, by taking the ratio of the estimate to
its standard error ( ). Most regression software packages report this
result as a t-statistic, for each estimated coecient. The distribution is, as
before, a Students t distribution, but with fewer degrees of freedom than
in the single-variable regression case. Most software packages will kindly
print out p-values corresponding to the t-statistics.
Note that the number of degrees of freedom is equal to T N 1. If the
number of time series observations (T) is not more than the number of X
variables plus one, then the result of the regression is generally perfect t,
and there is no way to perform statistical inferencethere is simply not
enough information available.
Recall the Fama-French three-factor model:
R
i
R
f
= b
i
RMRF+s
i
SMB+h
i
HML +
i
This model was proposed in the early 1990s, in light of the perceived
failure of the CAPM to explain certain features of expected security
returns.
Each of the three factors on the right-hand side is the excess return of a
zero-cost portfolio. That is, the factors are tradable portfolios with long
and short positions. Calling the long position L and the short position S,
consider the value of a portfolio which has S$1 invested in L and S$1
invested in S. Initially, this portfolio has value of zero, because the long
and short positions have equal value.
However, over time, the values of both the long and the short positions
change. Calling L
0
and L
1
the values of the long position initially and at
some later time, and S
0
and S
1
the values of the short position initially
and later, then:
L
1
= L
0
(1 + r
L
) S
1
= S
0
(1 + r
S
)
The values of the combined portfolio now, V
0
, and after the passage of
time, V
1
, are:
V
0
= L
0
+ S
0
=S$1 S$1 = S$0
V
1
= L
1
+ S
1
=L
0
(1 + r
L
) + S
0
(1 + r
S
)
=S$1 (1 + r
L
) S$1 (1 + r
S
) = S$1 (r
L
r
S
)
The value of a particular factor (i.e., RMRF, SMB, or HML) is simply the
portfolio value per unit of currency invested in the long position (and also
per negative unit of currency invested in the short position), that is,
r
L
r
S
. The dierences between the three factors are therefore only the
contents of the long and short components, which are described below:
.
.
.1
RMRFthe long position is the market portfolio, the short position is
the risk-free rate of return.
.
.
.2
SMBstands for small minus big. The long position is a
portfolio containing small company stocks, and the short position is a
portfolio containing large company stocks.
.
.
.3
HMLstands for high minus low. The long position is a
portfolio containing value stocks, and the short position is a
portfolio containing growth stocks.
Details on the formation of the three portfolios are found at Ken Frenchs
website (Dartmouth University); there, one may also freely download long
histories of the factor realisations.
The CAPM originated from a theoretical argument, beginning with an
assumption about investor behaviour, and ending with a condition that
makes security markets in equilibrium (supply equal to demand). Many
initial tests of the CAPM tended to support its predictions; however, over
time, it became more and more clear that the CAPM failed to capture
systematically certain features of security returns.
Among empirical violations of the CAPM were so-called value and size
eects. Stocks of small rms, and value stocks, tended to outperform the
stocks of large rms and growth stocks, even after accounting for any
dierence in their coecients. Small rm stocks do tend to have higher
coecients than large rm stocks, so, according to the CAPM, they
should have higher expected returns. However, the dierence in expected
returns is too large to be explained by the dierence in coecients.
Fama and French in the 1990s therefore developed the two factors, SMB
and HML, to try to capture the size and value eects. (Note that the rst
factor, RMRF, is simply the excess return of the market, and is what the
CAPM predicts should matter for expected returns.)
The Fama-French model is purely empirical; there was no widely accepted
theory predicting that their should be size and value eects at the time.
There have been various theoretical explanations derived after the fact to
try to explain the value and size eects, although none of these
explanations have won widespread support.
The prediction of the Fama-French model is therefore that, the expected
return of a security is the risk-free asset, plus risk premia for exposure to
three dierent sources of riskmarket risk is measured by the b
i
coecient, the risk of small company stocks is captured by the s
i
coecient, and the risk of value stocks is captured by the h
i
coecient.
Although there is some empirical evidence for the existence of a risk
premium for each of the three factors, there is considerable debate about
the reason for the risk premia associated with the value and size factors.
Fama and French argue that these risk premia are due to rational risk
aversion by investors toward the risk captured by the SMB and HML
factors, but are unable to say much about the nature of these risks.
Others argue that the value and size eects are due to investor
irrationality, arising from various systematic psychological biases.
Note the linear form of the Fama-French model. Does this make sense?
Recall the analysis of portfolio theory. There exists a portfolio, containing
risky assets only, that has the highest level of expected return for a given
level of standard deviation of return, or, equivalently, the lowest standard
deviation of return for a given level of expected return.
Mean-variance Ecient Portfolios
The prediction of the CAPM is that the tangency portfolio is the market
portfolio. Then there is a relation between the expected return of any
security and its covariance with the market return:
E[R
i
] R
f
=
i
(E[R
M
] R
f
)
i
=
Cov [R
i
, R
M
]
Var [R
M
]
The CAPM may or may not be trueif the tangency portfolio is not the
market portfolio, at least some securities will violate the above relation.
However, a similar relation always holds for the tangency portfolio:
E[R
i
] R
f
=
i
(E[R
T
] R
f
)
i
=
Cov [R
i
, R
T
]
Var [R
T
]
(Any idea how to prove this result?)
There is a relation between multiple-factor models in which the factors are
traded zero-cost portfolios. Specically, a multiple-factor asset pricing
model (such as the Fama-French model) explains the returns of all assets
if and only if the return of the tangency portfolio is the risk-free rate of
return, plus a linear function of the factors.
In the case of the Fama-French model, this condition is:
R
T
= R
f
+ c
1
RMRF+c
2
SMB+c
3
HML
The Fama-French model predicts the expected returns of all assets if and
only if the above condition holds for some c
1
, c
2
, and c
3
.
So there is a very sound theoretical reason for postulating a linear relation
between expected returns of securities and exposures to risk factors.
Whatever one believes about the source of the various risk premia, we can
test the model the same way.
Recall once again the Fama-French model:
R
i
R
f
= b
i
RMRF+s
i
SMB+h
i
HML +
i
We can express this relation within the context of a regression equation
(with a constant):
Y = +
1
X
1
+
2
X
2
+
3
X
3
+
As with the CAPM case, we will estimate the coecients of the regression
equation, then perform a statistical test to see if the predictions of the
asset pricing model are supported.
The correspondence between the notation in the regression equation and
the Fama-French model is:
Y
..
Regression
=R
i
R
f
. .
FF model
X =
_
1 X
1
X
2
X
3
. .
Regression
=
_
1 RMRF SMB HML
. .
FF model
=
_
3
_
_
. .
Regression
=
_
_
0
b
i
s
i
h
i
_
_
. .
FF model
Note that the Fama-French model has nothing that corresponds to the
coecient in the regression equation. This is what we will testis the
estimated coecient equal to zero?
Examplewe will continue with the same asset we used to test the
CAPM, but now we will test the Fama-French model instead. Recall that
we failed to reject CAPM at the 95% condence level, based on a
t-statistic of 1.83, which has a p-value of 0.0669. As we failed to reject
the CAPM based on the evidence oered by this asset, does it follow that
we will also fail to reject the three-factor model (which includes the CAPM
as a special case)?
We do not have the technology to produce a four-dimensional plot showing
the excess returns of the asset against each of the three Fama-French
factors simultaneously, but we can look at them one at a time.
Excess Return of Asset vs. RMRF
Excess Return of Asset vs. SMB
Excess Return of Asset vs. HML
Microsoft Excel Regression OutputFama-French Three Factor Model
The regression package produces estimates of each
coecientIntercept is the , and X Variable 1 through X Variable
3 refer to the three

coecients.
The hypothesis we wish to test is whether the coecient is equal to
zero. The estimated coecient is 0.008698, and the standard error of
0.002336.
The t-statistic is therefore approximately 3.72 (calculate it yourself, or
just look at the regression output). With 935 degrees of freedom, a t
distribution is very close to a normal distribution, and the corresponding
p-value is 0.00208.
Do you feel the evidence here supports the Fama-French three-factor
model? How do the results compare to the test we performed for the
CAPM?
Style Analysisalthough not directly related to the testing of an asset
pricing model, regressions like the one we just ran are sometimes used to
perform style analysis. The goal of style analysis is to try to determine the
return characteristics of an asset, usually some kind of managed fund
(mutual fund, unit trust, etc.).
Such funds often advertise their characteristics, using words or phrases like
value, growth, small-cap, etc. It is often found, though, that fund
managers do not always keep true to their advertised style, and the style of
particular funds can change over time. The goal of style analysis is to nd
out what they are really doing, not what they claim to be doing.
The asset whose excess returns were the Y variable in this regression is
not an actual fund; the returns are those of a hypothetical fund
constructed after the fact, and calculated using historical data. However,
lets perform the style analysis anyway.
Answer the following questions, using the regression results:
.
.
.1
What kind of exposure does this fund have to market risk?
.
.
.2
Does this fund seem to be investing in small-cap stocks, or large-cap
stocks?
.
.
.3
Does this fund seem to be investing in value stocks, or growth stocks?
How certain are you of your answers?
Testing Multifactor Models Multivariate Tests
As before, we want a test of many securities, not just one of them. The
procedure is similar to what we have already used for simpler models of
expected returns.
For each security, the returns conform to:
R
i ,t
R
f ,t
=
i
+ b
i ,t
RMRF
t
+ s
i ,t
SMB
t
+ h
i ,t
HML
t
+
i ,t
Arranging the
i ,t
for dierent values of into a vector:
t
=
_
1,t
.
.
.
M,t
_
_
The mean vector and covariance matrix of the residuals are then:
E[
t
| X] =
_
_
E[
1,t
| X]
.
.
.
E[
M,t
| X]
_
_
Var [
t
| X] =
_
_
Var [
1,t
| X] Cov [
1,t
,
M,t
| X]
.
.
.
.
.
.
.
.
.
Cov [
M,t
,
1,t
| X] Var [
M,t
| X]
_
_
So far, no assumptionseverything is mechanical.
E[
t
| X] =0
M1
for all t
Var [
t
| X] =
_
2
1,1

2
1,M
.
.
.
.
.
.
.
.
.
2
M,1

2
M,M
_
_
for all t
Cov [
s
,
t
| X] =0
MM
for all s = t
(What are the unconditional implications of these assumptions?)
The assumptions state that (all conditional on X) the residuals always
have expected value of zero, and residuals from dierent time periods are
uncorrelated with each other. The residual for any one of the particular
assets always has the same variance (across time periods), and the
covariance between the residuals for two particular assets always has the
same covariance.
Using the law of iterated expectations, the result on the relation between
conditional and unconditional variance, and a similar result for covariance,
it is possible to show that the exactly analogous statements also hold
unconditionally.
We can make still stronger assumptionsspecically, that
s
and
t
are
independent for s = t, and that each
t
has a multivariate normal
distribution:
f
t
(x) =
1
(2)
M
2
||
1
2
e
x
T
1
x
2
where || denotes the determinant of the covariance matrix . Note that
x is an M-element vector here. Also recall that the mean of the
t
is zero,
and the covariance matrix is .
An important property of the multivariate normal (or Gaussian)
distribution is that each element of
t
, considered individually, has a
marginal distribution which is normal (or Gaussian).
We assume that the residuals have this distribution conditional on the X,
but they have the same distribution unconditionally.
Note what we have done herewe have standard OLS regression
assumptions for each asset, considered individually.
However, we have also specied the covariances between contemporaneous
error (residual) terms for dierent assets at the same time period.
The regression coecients (in the case of the Fama-French model,
i
, b
i
,
s
i
, and h
i
), standard errors, and t-statistics may all be estimated by
running a standard OLS regression for each asset. The estimated
coecients have all the usual properties (which we derived earlier).
As before, we can derive some of the joint properties of the coecients
estimated in dierent regressions.
Recall that:
=
_
X
T
X
_
1
X
T
Y
We will index and Y by i , to indicate dierent assets, i.e.,
i
and Y
i
:
i
=
_
X
T
X
_
1
X
T
Y
i
We then have:
Cov
_
i
,

X
_
=E
_
T
j
X
_
E
_
X
_
E
_
T
j
X
_
=E
_
_
X
T
X
_
1
_
X
T
Y
i
__
Y
T
j
X
__
X
T
X
_
1
X
_
E
_
X
_
E
_
T
j
X
_
=E
_
_
_
X
T
X
_
1
_
X
T
(X
i
+
i
)
_
_
(X
j
+
j
)
T
X
__
X
T
X
_
1
X
_
T
j
=E
_
_
_
i
+
_
X
T
X
_
1
_
X
T
i
_
_
T
j
+
_
T
j
X
__
X
T
X
_
1
_
X
_
T
j
Cov
_
i
,

X
_
=E
_
_
_
i
+
_
X
T
X
_
1
_
X
T
i
_
_
T
j
+
_
T
j
X
__
X
T
X
_
1
_
X
_
T
j
=
i
T
j
+
i
E
_
_
X
T
X
_
1
X
T
X
_
. .
=0
+E
_
i
X
_
X
T
X
_
1
X
_
. .
=0
T
j
+ E
_
_
X
T
X
_
1
X
T
T
j
X
_
X
T
X
_
1
X
_
T
j
Finally:
Cov
_
i
,

X
_
=
i
T
j
+
i
E
_
_
X
T
X
_
1
X
T
X
_
. .
=0
+E
_
i
X
_
X
T
X
_
1
X
_
. .
=0
T
j
+ E
_
_
X
T
X
_
1
X
T
T
j
X
_
X
T
X
_
1
X
_
T
j
=
_
X
T
X
_
1
X
T
E
_
T
j
X
_
X
_
X
T
X
_
1
=
_
X
T
X
_
1
2
ij
Recall that the

i
and

j
vectors include the
i
and
j
coecients as their
rst elements. For purposes of testing an asset pricing model, only the
covariances between the coecients for dierent assets matters,
although the expression just derived gives us the covariance between every
element of

i
and

j
. We will denote by:
_
X
T
X
_
1
11
the element in the rst row and rst column of the matrix
_
X
T
X
_
1
.
Then:
Cov [
i
,
j
| X] =
_
X
T
X
_
1
2
ij
We can arrange the
i
coecients in a vector:
=
_
_

1
.
.
.

N
_
_
If we give the covariance matrix of the residuals a name:
= Var [ | X]
then we can write:
Var [ | X] =
_
X
T
X
_
1
11
This last result is a conditional variance; trying to write an unconditional
variance for would require knowledge of the full distribution of X, and
would be challenging even then.
A test statistic could therefore be:
2
=
T
Var [ | X]
1

=
T
_
_
X
T
X
_
1
11
_
1

=
_

T
1

_
(X
T
X)
1
11
This statistic has a chi-squared distribution with M degrees of freedom
(where M is the number of assets used to perform the test). In what
should be a now familiar pattern, though, the problem with the test
statistic is that it cannot be implemented; it contains unknown parameters
(the elements of ). So we need to use estimates instead.
We already know how to estimate Var [
i
]:
s
2
i
=
1
T N 1
T
t=1

2
i ,t
Recall that this estimate of Var [
i
] is unbiased:
E
_
s
2
= Var [
i
]
The T N 1 is used instead of T, because there are N + 1 estimated
parameters in the regression; this causes the estimated residuals to be
slightly smaller than the actual residuals. If we used T, we would
underestimate the variance of the residuals.
Under the assumption of normality, the estimated variance of the residuals
has a chi-square distribution (after appropriate scaling).
We can estimate the covariance between the residuals for two dierent
regressions in an analogous way:
s
2
j
=
1
T N 1
T
t=1

i ,t

j ,t
This estimate is also unbiased:
E
_
s
2
j
_
= Cov [
i
,
j
]
We can arrange all the estimates in a matrix, which is the sample
counterpart to :
=
_
_
s
2
1
s
2
M
.
.
.
.
.
.
.
.
.
s
2
1
s
2
M
_
_
(Side noteunder the assumption of multivariate normality, the entire
matrix

has a Wishart distribution.)
We can use the estimated

in place of the actual (but unknown)
matrix to form our test statistic.
The new test statistic is therefore:
t
2
=

T

1

(X
T
X)
1
11
As before, the distribution of the t
2
statistic is Hotellings t-square, and it
has a close relation to the F distribution:
F =
T M N
M (T N 1)
t
2
has an F distribution with M and T M N degrees of freedom. Recall
that M is the number of assets used in the test, and N is the number of
explanatory factors included in the model.
The test procedure for a multiple-factor model is therefore:
.
.
.1
Run a regressions of the excess returns for an asset, R
i ,t
R
f ,t
, on
the explanatory factors, e.g., RMRF, SMB, and HML for the
Fama-French model, to nd the estimated coecients,

i
. Recall that
for each asset,

i
is a vector, which includes
i
as its rst element.
Run such a regression for each asset.
.
.
.2
Use the estimated residuals, the
i ,t
, to estimate the variance of each
residual term Var [
i
], and the covariance for each pair of residual
terms Cov [
i
,
j
].
.
.
.3
Arrange the estimated
i
in an M 1 vector, and the estimated s
2
j
in an M M matrix.
.
.
.4
Calculate the matrix
_
X
T
X
_
1
, and choose the element
_
X
T
X
_
1
11
in
the rst row and rst column.
.
.
.5
Calculate the test statistic:
F =
T M N
M (T N 1)

T

1

(X
T
X)
1
11
.
.
.6
Choose a condence level (or equivalently, the probability of a Type I
error).
.
.
.7
Compare the F statistic from Step 5 to the cut-o value for an F
distribution with M and T M N degrees of freedom, for the
condence level chosen in Step 6. Reject (i.e., conclude the model is
false) if the F statistic is larger than the cut-o value, and fail to
reject if the F statistic is smaller than the cut-o value.
Robust Standard Errors
.
.
. . .
.
.
Empirical Finance
Robust Standard Errors
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Robust Standard Errors White (1980) Method
Recall linear regression. The estimated

coecients are:
=
_
X
T
X
_
1
X
T
Y =
_
X
T
X
_
1
X
T
(X + ) = +
_
X
T
X
_
1
X
T
We can nd the variance of

, conditional on X:
Var
_
X
_
=E
_
_

__

_
T
X
_
=E
_
_
X
T
X
_
1
X
T
T
X
_
X
T
X
_
1
X
_
=
_
X
T
X
_
1
X
T
E
_
X
_
X
_
X
T
X
_
1
Under the homoscedasticity assumption, we have:
E
_
X
_
= I
2
where I is the identity matrix.
In that case, we also have:
Var
_
X
_
=
2
_
X
T
X
_
1
Then
2
can be estimated by:

2
=
1
T N 1
T
t=1

2
t
where the
t
are the estimated residuals from the regression, T is the
number of observations, and N is the number of X variables (not counting
the constant).
Similarly, the covariance between the

coecients for two dierent
regressions (with the same X variables) is:
Cov
_
i
,

X
_
=
2
ij
_
X
T
X
_
1
where
2
ij
can be estimated by:

2
ij
=
1
T N 1
T
t=1

i ,t

j ,t
What if the errors are not homoscedastic?
The expression
Var
_
X
_
=
_
X
T
X
_
1
X
T
E
_
X
_
X
_
X
T
X
_
1
is still valid, as is the analogous result for two dierent regressions (with
the same X variables):
Cov
_
i
,

X
_
=
_
X
T
X
_
1
X
T
E
_
T
j
X
_
X
_
X
T
X
_
1
The method of White (1980) allows estimation of the standard errors in a
way that is consistent for many dierent forms of heteroscedasticity.
White (1980)simply estimate the covariance of each pair of residuals by
the products of the estimated residuals:
_
_

i ,1

j ,1
0 0 0
0
i ,2

j ,2
0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0
i ,T1

j ,T1
0
0 0 0
i ,T

j ,T
_
_
E
_
T
j
X
_
Note that this matrix is not estimated consistentlythere is only one
observation for each diagonal element!
However, this matrix is not the end result. We plug this matrix into the
expression for the covariance of

i
and

j
.
If the homoscedasticity assumption is wrong, then the variance of

(or
the covariance between

i
and

j
) is estimated inconsistently. Statistical
tests based on this estimated variance matrix will be inaccurate.
By contrast, the variance of

(or the covariance between

i
and

j
) is
estimated consistently for many dierent types of heteroscedasticity using
Whites method.
If the homoscedasticity assumption is correct, Whites method still
produces consistent estimates of the variance of

. But they tend to be
less accurate than those obtained by using the simple method.
Note that Whites method is strictly for calculating the standard errors.
The coecient estimates

themselves are still exactly the same as in the
traditional OLS approach; Whites method simply gives us a better idea of
how accurate these estimates are in the presence of heteroscedasticity.
Similar method from Newey and West (1987) for dealing with correlation
between residuals, but much more fragile than White (1980) method.
GMM
.
.
. . .
.
.
Empirical Finance
GMM
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
GMM Mean-Variance Eciency
Almost every single-factor asset pricing model has the same basic form:
E[R
i
] = R
f
+
i

i

Cov [R
i
, F]
Var [F]
The factor F may be the excess return of a traded asset, or it may be
some macroeconomic variable that is not necessarily traded. For example,
in the CAPM, the factor F is the excess return of the market portfolio.
Whenever the factor is the excess return of a portfolio, the risk premium
must be the average excess return of the factor portfolio, or the model
does not price the factor itself correctly.
E.g., CAPM:
E[R
i
] R
f
=
i
(E[R
M
] R
f
)
i

Cov [R
i
, R
M
]
Var [R
M
]
A single-factor asset pricing model whose factor is a portfolio excess return
works (in the sense that it prices all assets correctly) if and only if the
factor portfolio is mean-variance ecient.
Proofwhen a risk-free asset exists, a portfolio with return R
X
is
mean-variance ecient if and only if it has the highest possible (in absolute
magnitude) Sharpe ratio, i.e., for any portfolio at all with return R
Y
:
E[R
X
] R
f
SD[R
X
]
E[R
Y
] R
f
SD[R
Y
]
The two directions to be shown are:

.
.
.1
(Factor model prices all assets correctly)=(Factor portfolio is
mean-variance ecient)
.
.
.2
(Factor portfolio is mean-variance ecient)=(Factor model prices
all assets correctly)
Starting with the rst directionsuppose a single-factor model (with a
traded factor) prices all the assets correctly:
E[R
i
] = R
f
+
i
(E[R
X
] R
f
)
i

Cov [R
i
, R
X
]
Var [R
X
]
We rst note the following fact about the variance of R
i
:
Var [R
i
] =Var [(R
i

i
R
X
) +
i
R
X
]
=
2
i
Var [R
X
] + Var [R
i

i
R
X
] + 2
i
Cov [R
X
, R
i

i
R
X
]
The last term on the right-hand side is zero:
Cov [R
X
, R
i

i
R
X
] =Cov [R
X
, R
i
]
i
Cov [R
X
, R
X
]
=Cov [R
X
, R
i
]
Cov [R
i
, R
X
]
Var [R
X
]
Cov [R
X
, R
X
]
=0
It follows that:
Var [R
i
] =
2
i
Var [R
X
] + Var [R
i

i
R
X
]
The last term on the right-hand side could be zero, which is the case if the
return R
i
is perfectly (positively or negatively) correlated with the return
R
X
. So it is necessarily the case that:
Var [R
i
]
2
i
Var [R
X
]
and therefore:
SD[R
i
] |
i
| SD[R
X
]
Turning to the Sharpe ratio of R
i
:
E[R
i
] R
f
SD[R
i
]
=
(R
f
+
i
(E[R
X
] R
F
)) R
f
SD[R
i
]
=
i
(E[R
X
] R
f
)
SD[R
i
]
=
E[R
X
] R
f
SD[R
X
]
. .
Sharpe Ratio of R
X
i
SD[R
X
]
SD[R
i
]
_
Note that the second factor on the right-hand side must have absolute
value of one or less. It follows that the Sharpe ratio of R
i
is less than or
equal to the Sharpe ratio of R
X
.
This proves the rst implication; if a traded factor prices all assets
correctly, it must be mean-variance ecient.
To prove the second direction, assume R
X
is any mean-variance ecient
(except the risk-free asset). We now consider any other asset or portfolio
with return R
i
. Consider the portfolio R
i ,
, dened as:
R
i ,
R
i
+ (1 ) R
X
Then:
E[R
i ,
] = E[R
i
] + (1 ) E[R
X
]
and:
Var [R
i ,
] =
2
Var [R
i
] + (1 )
2
Var [R
X
] + 2 (1 ) Cov [R
i
, R
X
]
=Var [R
X
] + (2 Cov [R
i
, R
X
] 2 Var [R
X
])
+
2
(Var [R
i
] + Var [R
X
] 2 Cov [R
i
, R
X
])
=Var [R
X
] + (2 Cov [R
i
, R
X
] 2 Var [R
X
]) +
2
Var [R
i
R
X
]
The absolute value of the Sharpe ratio of R
i ,
must be less than or equal
to the absolute value of the Sharpe ratio of R
X
.
First dispensing with the trivial case in which R
X
has a Sharpe ratio of
zero, the Sharpe ratio of every asset must also be zero, which means every
asset earns the risk-free rate in expectation. Then:
E[R
i
] = R
f
+
i
(E[R
X
] R
f
)
trivially, because every asset earns the risk-free rate on average, and the
factor in parenthesis is zero.
For a non-trivial case, suppose that the Sharpe ratio of R
X
is positive.
Then it must be the case that:
E[R
i ,
] R
f
SD[R
i ,
]

E[R
X
] R
f
SD[R
X
]
This relation must hold for every choice of R
i
and every choice of .
Rearranging a bit and squaring both sides, we nd:
(E[R
i ,
] R
f
)
2
Var [R
X
] (E[R
X
] R
f
)
2
Var [R
i ,
]
Each side of the inequality is a quadratic function of .
Considering the left-hand side rst:
(E[R
i ,
] R
f
)
2
Var [R
X
] =( E[R
i
] + (1 ) E[R
X
] R
f
)
2
Var [R
X
]
=(E[R
X
] R
f
)
2
Var [R
X
]
+ 2 (E[R
X
] R
f
) (E[R
i
] E[R
X
]) Var [R
X
]
+
2
(E[R
i
] E[R
X
])
2
Var [R
X
]
Now the right-hand side:
(E[R
X
] R
f
)
2
Var [R
i ,
] =(E[R
X
] R
f
)
2
Var [R
X
]
+ 2 (E[R
X
] R
f
)
2
(Cov [R
i
, R
X
] Var [R
X
])
+
2
(E[R
X
] R
f
)
2
Var [R
i
R
X
]
Note that the terms that do not depend on are identical.
Consider very small (positive or negative) values of , such that the
2
terms can safely be ignored. Then it must be the case that:
2 (E[R
X
] R
f
) (E[R
i
] E[R
X
]) Var [R
X
]
2 (E[R
X
] R
f
)
2
(Cov [R
i
, R
X
] Var [R
X
])
This result must hold for both positive and negative . This can only be
the case if:
(E[R
X
] R
f
) (E[R
i
] E[R
X
]) Var [R
X
]
= (E[R
X
] R
f
)
2
(Cov [R
i
, R
X
] Var [R
X
])
After some manipulation, this becomes:
E[R
i
] = R
f
+
Cov [R
i
, R
X
]
Var [R
X
]
(E[R
X
] R
f
) = R
f
+
i
(E[R
X
] R
f
)
If the Sharpe ratio of R
X
is negative, only minor modication is required.
This proves the other direction of the implication.
So a single-factor model (with a traded factor) prices all assets correctly if
and only if the factor is a mean-variance portfolio.
What about multiple-factor models?
Assume all traded factors, and arrange them in a column vector R
X
. A
typical multiple-factor model has the form:
E[R
i
] = R
f
+
T
i
(E[R
X
] R
f
)
i
= Var [R
X
]
1
Cov [R
X
, R
i
]
We now note that for every multiple-factor model (whether or not it prices
all the assets correctly), there exists an exactly equivalent single-factor
model, in the sense that both models predict the same expected returns.
In the single-factor model, the factor can be expressed as a portfolio of the
factors in the multiple-factor model.
Consider a single-factor model (we will call the factor R
Z
), where the
factor can be expressed as:
R
Z
= w
T
R
X
with w
T
i = 1
where R
X
are the factors in the multiple-factor model, and i is simply a
column vector with each element equal to one. Then:
E[R
Z
] = w
T
E[R
X
] Var [R
Z
] = w
T
Var [R
X
] w
Cov [R
Z
, R
i
] = w
T
Cov [R
X
, R
i
]
The prediction of a model based on R
Z
is therefore:
E[R
i
] =R
f
+
i ,Z
(E[R
Z
] R
f
)
=R
f
+
w
T
Cov [R
X
, R
i
]
w
T
Var [R
X
] w
_
w
T
E[R
X
] R
f
_
Setting the predictions of the single- and multiple-factor models equal to
each other:
R
f
+ Cov [R
i
, R
X
] Var [R
X
]
1
(E[R
X
] R
f
)
?
= R
f
+
w
T
Cov [R
X
, R
i
]
w
T
Var [R
X
] w
_
w
T
E[R
X
] R
f
_
Some slight rearrangement:
Cov [R
i
, R
X
] Var [R
X
]
1
(E[R
X
] R
f
)
?
=
Cov [R
i
, R
X
] w
w
T
Var [R
X
] w
_
w
T
E[R
X
] R
f
_
We need to choose w to make this relation true.
We can now choose a specic value of w:
w =
Var [R
X
]
1
(E[R
X
] R
f
)
i
T
Var [R
X
]
1
(E[R
X
] R
f
)
Note that the elements of w add up to one, i.e., w
T
i = 1.
Also, one can verify that with this choice of w, the predictions of the
single- and multiple-factor models are the same, i.e., the equations on the
previous page hold.
Note that the existence of an equivalent single-factor model does not
depend on the correctness of the multiple-factor model.
With particular choice of w above, R
Z
is mean-variance ecient among all
portfolios that can be formed from the R
X
.
Every multiple-factor model with traded factors is therefore equivalent to a
single-factor model, where the single factor is a portfolio of the factors in
the original model.
Recall the test of a multiple-factor model. This test can be viewed as a
test of an entire class of single-factor modelsi.e., does there exist any
single-factor model, among those that can be constructed from the
multiple factors, that price all of the assets correctly.
Although the test procedure answered this question, it did not (at least
not directly) produce an estimate of the equivalent single-factor model.
We will look at another test procedure that does estimate the equivalent
single-factor model as well. However, rst, we will derive some results on
models with non-traded factors.
Consider a single-factor model, where the factor F is not necessarily
traded. The prediction of the model is:
E[R
i
] = R
f
+
i

i
=
Cov [R
i
, F]
Var [F]
Suppose the asset returns are all arranged in a vector, called R. The factor
F can be decomposed into a traded and non-traded component:
F =
0
+
T
R +
_
E[] =0
Cov [R, ] =0
It is possible to replace the factor with the traded component
T
R, and
produce an alternative model which makes exactly the same predictions.
(Note that there is no requirement here that the elements of sum up to
one.)
Dene the alternate factor:
F
=
T
R
We rst look at the relationship between the covariances of the assets with
F, and with F
. Note that they are the same:

Cov [R, F] =Cov
_
R,
0
+
T
R +
_
=Cov
_
R,
T
R
_
+ Cov [R, ]
=Cov
_
R,
T
R
_
= Cov
_
R, F
The variances of F and F
are (in general) not the same:

Var [F] =Var
_
0
+
T
R +
_
= Var
_
T
R +
_
=Var
_
T
R
_
+ Var [] + 2 Cov
_
T
R,
_
=Var
_
F
+ Var []
The relation between the coecients with respect to F and F
follows:
i
=
Cov [R
i
, F
]
Var [F
]
=
Cov [R
i
, F]
Var [F]
Var [F]
Var [F
]
=
i
Var [F
] + Var []
Var [F
]
=
i
_
1 +
Var []
Var [F
]
_
The prediction of a model based on F
instead of F would therefore be:

E[R
i
] = R
f
+
= R
f
+
i
_
1 +
Var []
Var [F
]
_
The prediction of a model based on F is:

E[R
i
] = R
f
+
i
The predictions of the two models are therefore identical, provided:
=
_
1 +
Var []
Var [F
]
_
There is no guarantee that the elements of add up to one. However, we
can normalise the weights:
F
=

T
R
T
i
It is a straightforward exercise to show that the model with F
is exactly
equivalent to the original model with F, provided:
=
T
i
_
_
1 +
Var []
Var [F
]
_
Any single-factor model with a non-traded factor is therefore exactly
equivalent to another single-factor model with a traded factor.
The traded factor F
is a portfolio that, out of all possible portfolios, is

maximally correlated with the original factor, F.
The risk premium of the traded factor is, in general, dierent than that of
the non-traded factor.
We already know that a single-factor model with a traded factor prices all
assets correctly if and only if the traded factor is mean-variance ecient.
It follows that a single-factor model with a non-traded factor prices all
assets correctly, if and only if the portfolio that is maximally correlated
with the factor is mean-variance ecient.
A similar result holds for multiple-factor models with non-traded factors.
.
.
.1
There exists an exactly equivalent model with all traded factors.
.
.
.2
The traded factors are the returns of the portfolios that are maximally
correlated with the original factors.
.
.
.3
The risk premia of the traded factors are dierent than the risk
premia of the non-traded factors.
.
.
.4
The multiple factor model explains all asset returns correctly if and
only if some combination of the maximally correlated portfolios is
mean-variance ecient.
GMM Stochastic Discount Factor
Stochastic discount factora random variable m
t,T
with the property
that, for any asset or portfolio with future cash ow X
T
, the current price
P
t
of the asset or portfolio is given by:
P
t
= E[m
t,T
X
T
]
Although it looks dierent, a stochastic discount factor asset pricing
model is exactly equivalent to an expected return/ style model.
P
t
=E[m
t,T
X
T
]
P
t
=E[m
t,T
] E[X
T
] + Cov [m
t,T
, X
T
]
1
E[m
t,T
]
=
E[X
T
]
P
t
+
Cov [m
t,T
, X
T
]
P
t
E[m
t,T
]
1
E[m
t,T
]
1 =E
_
X
T
P
t
1
_
+ Cov
_
m
t,T
,
X
T
P
t
1
_
1
E[m
t,T
]
E[R
t,T
] =
1
E[m
t,T
]
1
Cov [m
t,T
, R
t,T
]
Var [m
t,T
]
Var [m
t,T
]
E[m
t,T
]
E[R
t,T
] =
1
E[m
t,T
]
1
Cov [m
t,T
, R
t,T
]
Var [m
t,T
]
Var [m
t,T
]
E[m
t,T
]
E[R
t,T
] =
1
E[m
t,T
]
1
Var [m
t,T
]
E[m
t,T
]
where:
=
Cov [m
t,T
, R
t,T
]
Var [m
t,T
]
Suppose the stochastic discount factor model prices the risk-free asset
correctly. Then:
R
f
=
1
E[m
t,T
]
1
E[m
t,T
] =
1
1 + R
f
Then we can proceed as follows:
E[R
t,T
] =
1
E[m
t,T
]
1
Var [m
t,T
]
E[m
t,T
]
E[R
t,T
] =R
f
+ (Var [m
t,T
] (1 + R
f
))
So the stochastic discount pricing model is fully equivalent to a pricing
model, with:
= Var [m
t,T
] (1 + R
f
)
Suppose we have a single-factor model, with a traded factor R
X
. We can
construct a stochastic discount factor of the form:
m
t,T
= a + bR
X
We must have:
E[m
t,T
] =a + b E[R
X
] =
1
1 + R
f
a =
1
1 + R
f
b E[R
X
]
So we can write:
m
t,T
=
1
1 + R
f
+ b (R
X
E[R
X
])
But we also have:
Var [m
t,T
] = b
2
Var [R
X
]
So the pricing relation is:
E[R
t,T
] = R
f
+
_
b
2
Var [R
X
] (1 + R
f
)
_
The stochastic discount factor should price R
X
itself correctly as well.
Note that:
X
=
1
b
Then:
E[R
X
] = R
f
b Var [R
X
] (1 + R
f
)
Solving for b:
b =
R
f
E[R
X
]
Var [R
X
] (1 + R
f
)
Putting it all together, we have:
m
t,T
=
Var [R
X
] + (R
f
E[R
X
]) (R
X
E[R
X
])
Var [R
X
] (1 + R
f
)
and:
E[R
t,T
] = R
f
+ b (E[R
X
] R
f
)
Note that the above is the coecient with respect to m
t,T
, not R
X
.
The coecient with respect to R
X
is b, so the above is equivalent to
the more traditional type of pricing relation.
The traditional -based and the stochastic discount factor pricing
approaches are equivalentfor any portfolio R
X
, there exists a stochastic
discount factor m
t,T
= a + bR
X
(with a and b specied above) that prices
all assets equivalently.
Are stochastic discount factors unique? Suppose has the properties:
E[] = 0 Cov [R, ] = 0
where R is the vector containing all asset returnsi.e., has mean of zero
and is uncorrelated with any of the asset returns. Then consider the
alternate stochastic discount factor:
m
t,T
= m
t,T
+
Then:
E
_
m
t,T
X
T
=E[(m
t,T
+ ) X
T
] = E[m
t,T
X
T
] + E[X
T
]
=E[m
t,T
X
T
] + E[] E[X
T
] + Cov [, X
T
]
=E[m
t,T
X
T
]
So the two stochastic discount factors make exactly the same prediction.
A consequence of this result is that there is an equivalence between
traditional pricing models and stochastic discount factor models, when
the factor is not traded as well.
E[R
i
] = R
f
+ =
Cov [R
i
, F]
Var [F]
But the factor F can be written as:
F =
0
+
T
R +
Dening F
=
T
R, we have a traded factor, that prices the assets just as
well as the original factor, so there is a stochastic discount factor:
m
t,T
= a + bF
But we can also contemplate:

m
t,T
= a + bF
+ b = a b
0
+ b
0
+ bF
+ b = (a b
0
) + bF
The situation is similar in a multiple-factor setting. For every multiple
factor traditional pricing model, there exists a stochastic discount factor
model which is completely equivalent, and the stochastic discount factor is
a linear function of the factors in the traditional model.
ExampleFama/French model:
E[R
i
] = R
f
+ b
i
RMRF + s
i
SMB + h
i
HML
There exists a stochastic discount factor:
m = c
0
+ c
1
RMRF + c
2
SMB + c
3
HML
which prices the assets in exactly the same way. This is the case for
models with non-traded factors as well.
The stochastic discount factor approach has one advantage we might be
interested ina test of the model simultaneously tells us which factors are
needed and which are not. This information is dicult to extract from the
other testing approach we have considered. (This does not prevent many
researchers from doing it incorrectlycareful implementation of an asset
pricing test followed by horrendous misinterpretation of the results is
commonplace.)
An estimation and testing procedure we can use with stochastic discount
factor asset pricing models is the Generalised Method of Moments, often
just called GMM.
But rst, we will have a look at the Method of Moments.
GMM Method of Moments
Many estimation methods can be considered a special case of the method
of moments, sometimes abbreviated MoM.
1
The basic idea behind method of moments is quite simple. Suppose you
wish to estimate the mean and variance of some random variable X, and
you have collected many observations of this random variable, X
1
, . . . , X
T
.
The rst and second moments of the random variable X are:
E[X] = E
_
X
2
=
2
+
2
where and are the mean and standard deviation of X, respectively.
These conditions are sometimes written in the form:
E[X ] = 0 E
_
X
2
2
+
2
_
= 0
1
Method of Moments is not to be confused with the Ministry of Manpower, the
organisation in Singapore responsible for immigration passes and visas for foreign
workers. Both are sometimes know by the acronym MoM.
The idea behind the method of moments is replace the moment conditions
with their sample counterparts:
1
T
T
i =1
(X
i
) = 0
1
T
T
i =1
_
X
2
i

_

2
+
2
_
= 0
and then to choose and so that the sample moment conditions are
satised.
There are two parameters to be estimated, so two moment conditions are
needed to identify the parameters uniquely.
With only one moment condition, there would be many dierent
combinations of and that satisfy the condition. The system is then
underidentied.
With three or more moment conditions, it is virtually certain that the
conditions cannot all be satised by any choice of and . The system is
overidentied.
If the number of moment conditions is equal to the number of parameters,
and none of the moment conditions are redundant are conicting, then the
system is said to be exactly identied.
Method of moments requires that the system be exactly identied.
In this case, we can solve for and explicitly:
=
1
T
T
i =1
X
i
=
_
1
T
T
i =1
_
_
X
2
i

_
1
T
T
i =1
X
i
_
2
_
_
The method of moments estimates are almost the same as the traditional
estimates for mean and standard deviation.
More formally, we can arrange the parameters to be estimated in a vector:
=
_
_

=
_

_
The moments used in estimation can also be put into a vector:
g (X, ) =
_
X
X
2
2
+
2
_
_
We will refer to both the population and sample averages of the moment
conditions:
m() = E[g (X, )] m() =
1
T
T
i =1
g (X
i
, )
The method of moments is simply to choose

so that m
_
_
= 0.
There are often dierent ways to estimate the same quantities; the
dierent methods can usually be interpreted as method of moments
estimation with dierent moment conditions.
(Can you suggest a dierent set of moment conditions that cause division
by T 1 instead of T in the calculation of ?)
Another common estimation procedure, maximum likelihood estimation, is
simply method of moments with a particular choice of moment conditions.
Exampleordinary least squares regression (OLS). Recall the
single-variable regression equation:
Y = + X +
The two conditions we used to identify the and parameters were:
E[] = 0 Cov [X, ] = 0
We can write equivalently:
E[] = 0 E[X] = 0
The problem is that is not directly observed; however, we can solve this
problem by writing:
= Y X
This can be substituted into the conditions above.
These conditions now are:
E[Y X] = 0 E[X (Y X)] = 0
These conditions can be used to construct moment conditions g (X, ):
g (X, Y, ) =
_
_
Y X
X (Y X)
_
_
with the usual:
m() = E[g (X, Y, )] m() =
1
T
T
i =1
g (X
i
, Y
i
, )
The estimates obtained by choosing

to make m
_
_
= 0 are exactly the
same as the usual OLS estimates. So single-variable OLS regression can be
viewed as a type of method of moments estimation.
Multiple regression is no dierent. Recall:
Y = X +
where Y is a scalar, but X is a row vector, and is a column vector. Also
recall that the rst element of X is taken to be 1, and the rst element of
corresponds to the coecient, when the regression is not written in
vector-matrix form.
The appropriate conditions are:
E
_
X
T
_
= 0
which is a vector of conditions; there is one for each X variable, and one
for the constant (which is included as the rst X variable).
Since = Y X, these conditions can be rewritten as:
E
_
X
T
(Y X)
_
= 0
The moments are therefore:
g (X, Y, ) = X
T
(Y X)
with the usual:
m() = E[g (X, Y, )] m() =
1
T
T
i =1
g (X
i
, Y
i
, )
The method of moments estimation results by choosing

so that
m
_
_
= 0.
So multiple regression is also a type of method of moments estimation.
In the simple case just considered, it is possible to solve explicitly for the
estimates. Sometimes it will be dicult or impossible to do so; in such
cases, numeric search procedures are relatively straightforward to
implement on modern computers.
In complicated problems, it may be dicult to show that the solution (i.e.,
estimates that satisfy all the moment conditions) is unique.
It is usually dicult to derive exact distributional results about estimates
obtained through method of moments.
However, under quite general conditions, it is possible to derive asymptotic
results, i.e., results which are approximately true when the amount of data
is large.
The tool to be used is the delta method.
Suppose X has a Gaussian distribution, with mean and standard
deviation . What is the distribution of f (X)?
The exact answer to this question depends on the specic choice of f (),
and may be dicult to calculate in particular cases.
However, if f () is suciently smooth, we can approximate it as:
f (x) f () +
df
dx
() (x )
Then:
E[f (X)] f () Var [f (X)]
_
df
dx
()
_
2
2
These approximations will be accurate if the variance of X is small enough
so that it is nearly always in a region where f () is well approximated by a
linear function. If that is the case, f (X) will also have approximately a
Gaussian distribution.
Distribution of e
X
X is Gaussian with = 0.2 and = 0.8
Distribution of e
X
Distribution of e
X
Distribution of e
X
Distribution of e
X
The moment conditions, evaluated at the true (but unknown) parameter
vector are always equal to zero.
m() = 0
But consider the estimated moment conditions, evaluated at the true
parameter values, m(). Due to sampling variation, the moment
conditions will most likely be something dierent than zero.
In our running example:
m() =
1
T
T
i =1
g (X
i
, ) =
_
_
1
T
T
i =1
(X
i
)
1
T
T
i =1
_
X
2
i

_
2
+
2
__
_
_
The moment conditions are just sample averages.
On the assumption that successive observations of the X
i
are independent,
and identically distributed, we can estimate:
Var [g (X, )] =
1
T
[g (X, )]
T
[g (X, )]
For this particular example, these estimates are:
Var [X
i
] =
1
T
T
i =1
(X
i
)
2
Var
_
X
i

_
2
+
2
_
=
1
T
T
i =1
_
X
2
i

_
2
+
2
_
2
Cov
_
X
i
, X
i

_
2
+
2
_
=
1
T
T
i =1
(X
i
)
_
X
2
i

_
2
+
2
_
For the rst condition, it might seem odd that the division is by T rather
than T 1, since it looks like simply a sample variance estimate. However,
in this case, the mean is known; it is uncertainty in the estimation of the
mean that results in a T 1 divisor. Furthermore, our statistical results
will only be valid for large T anyway, so it doesnt really matter.
We can then estimate the variance of m(). (Noteit makes no sense at
all to try to estimate the variance of either m() or m
_
_
why?)
Var [ m()] =
1
T
Var [g (X, )]
For this particular example:
Var [ m
1
()] =
1
T
Var[X
i
]
Var [ m
2
()] =
1
T
Var
_
X
2
i

_
2
+
2
_
Cov [ m
1
() , m
2
()] =
1
T
Cov
_
X
i
, X
2
i

_
2
+
2
_
Two issues:
First, the successive observations may not be independent and identically
distributed. If so, it is not clear that these estimates of the variances and
covariances of the moment conditions will be good estimates. However,
under relatively mild technical restrictions, deviations from independence
and identical distribution will not matter as T becomes very large. All the
statistical results we derive will only be valid for large T anyway, so we
wont worry about this problem too much.
The second issue is that the estimates of the variances and covariances
depend on the true parameters, which are not known. We can use the
estimated parameters instead, which may cause the estimated variances
and covariances to be o a bit. But this eect goes away for large T, and
goes away faster than the uncertainty in the estimates of the parameters
goes away. So for large t, we can just ignore the fact that we must use

instead of .
We have a method for estimating the variances and covariances of the
sample moment conditions, evaluated at the true parameters . The
expected values of the moment conditions (also evaluated at the true
parameters ) are zero.
It is possible to go further still, and claim (under some technical
conditions) that the moment conditions have an asymptotically normal
distribution. We usually write this as:
T m()
d
N (0, )
for some matrix , which is called the asymptotic variance.
The asymptotic distribution of the moment conditions is not what we are
ultimately interested in, but rather a means to the end. The goal is the
variance of the estimates

, and these can be found from the moment
conditions, using the delta method.
We rst note that:
m() = 0 m
_
_
= 0
The rst is true be denition; the second is true because we choose

to
make it true. In general, though, we have:
m() = 0
and we have a method to estimate the variance of the moment conditions
m().
Suppose there are N parameters (and therefore N moment conditions).
We can write:
0 = m
_
_
m()
. .
N1
+
_
m
_
_
. .
NN
_

_
. .
N1
where:
m
_
=
_
_
m
1
1
_
_

m
1
N
_
_
.
.
.
.
.
.
.
.
.
m
N
1
_
_

m
N
N
_
_
_
_
These derivatives can be calculated explicitly, since the moment conditions
are known explicitly, and are (unless the problem is really bizarre)
amenable to dierentiation. As a practical matter, particularly if the
moment conditions are complicated, it might be easier to calculate the
derivatives numerically.
We can rearrange the approximate relation:
0 = m
_
_
m() +
_
m
_
_
_

_
to read:
+
_
m
_
_
1
m()
With a further approximation, we can write:
Var
_
_
m
_
_
1
Var [ m()]
_
_
m
_
_
1
_
T
1
T
_
m
_
_
1
_
_
m
_
_
1
_
T
Note that all of the quantities on the right-hand side are known explicitly,
or can be estimated.
The matrix of derivatives is a random quantity; the uncertainty has been
ignored in the calculation of the variance of Var
_
_
. This is not a problem
for large T, though, since the additional variance due to the uncertainty at
this term goes away at a rate faster than 1/T. (Why?)
The sources of approximation in the calculation of the variance of

are
therefore:
.
.
.1
The variance of m() must be estimated, and will dier from the true
variance.
.
.
.2
Successive observations of the data may not be independent.
.
.
.3
We have used a linear approximation to the moment condition
function m().
.
.
.4
We have ignored the randomness in the matrix of derivatives of the
moment conditions with respect to the parameters.
All of these cause the estimated variance of

to deviate from the true
variance; however, all of these sources of approximation go away at a rate
faster than 1/T. So for large T, they can be ignored.
The asymptotic distribution of the method of moments estimator is
therefore:
T
_

_
d
N
_
0, G
1
E
_
[g (X, )]
T
[g (X, )]
_ _
G
T
_
1
_
where:
G = E
_
_
g
1
1
(X, )
g
1
N
(X, )
.
.
.
.
.
.
.
.
.
g
N
1
(X, )
g
N
N
(X, )
_
_
All quantities can be estimated; the error in the estimates goes away
asymptotically.
For the examples we have considered, we already had good methods of
estimation. The method of moments produced similar (in the case of the
mean and variance estimation) or identical (in the case of linear
regression) estimates.
However, the method of moments is extremely general; it can be applied
many many types of problems. Furthermore, it does not require the strong
distributional assumptions typically made in, for example, linear regression.
Nearly all estimation techniques, when looked at the right way, can be
interpreted as method of moments for some choice of moment conditions.
The price to be paid for this generality is that the statistical results on
estimates are weaker. In linear regression, under the assumption of
multivariate conditional normality of the error terms, we are able to derive
the exact distribution of the estimates, and also of some test statistics.
These results are valid (provided the assumption is correct), even for
relatively small data samples.
By contrast, the method of moments results on the distribution of the
estimates is only approximate. The approximation becomes very accurate
for large data sets, but may be quite inaccurate for small data sets.
Method of moments is often easy to use, even for complicated estimation
problems, and the results are valid (for large data sets) even when the
problem dees simple analysis. The asymptotic results are often all that
can be derived explicitly for many problems; if there are concerns that the
data set may be too small for the asymptotic results to apply, techniques
such as simulation (generate sample data using known parameter values,
estimate the parameters from the data, repeat one million times, compare
the estimated parameters to those used to generate the data in the rst
place) can be used to assess whether the asymptotic results are accurate.
In addition to estimation, we would often like to test hypotheses. The
traditional approach (method of moments) is to collect more moment
conditions than there are parameters to be estimated, use some of the
moments to estimate the parameters, and use the rest to conduct
statistical tests.
For example, suppose we believe that some data is drawn from a normal
distribution. Call the random variable X. Some properties of the normal
distribution are:
E[X] = E
_
X
2
=
2
+
2
E
_
X
3
=3
2
+
3
E
_
X
4
=3
4
+ 6
2
2
+
4
We could estimate the parameters and using the rst two conditions.
The other two conditions ought to be satised (at least approximately) if
the data are drawn from a normal distribution, but might not be if the
data are drawn from some other distribution.
So we could estimate the sample third and fourth moments:
X
3
=
1
T
T
i =1
X
3
i
X
4
=
1
T
T
i =1
X
4
i
We could then derive the distribution of these two sample statistics (under
an assumption of normality), and apply a statistical test to see if they are
about what they should be.
This is the traditional approach to estimation and testing. With M
parameters to be estimated, and N > M moment conditions, use M
moment conditions to estimate the M parameters, and use the other
N M moment conditions to conduct statistical tests.
There is an alternative procedure called generalised method of moments
(GMM). Although the implementation is somewhat complicated, the idea
behind GMM is simple.
Use all N moment conditions to estimate the M parameters and
conduct statistical tests simultaneously.
We will now look at GMM in detail.
GMM GMM
There are essentially three things we would like to do.
.
.
.1
Estimate the parameters of a model.
.
.
.2
Estimate the variance of the parameter estimates.
.
.
.3
Test the model.
We will start by having a look at estimation.
GMM GMM
Examplesuppose we think X is a normal random variable, but dont
know its mean and variance. The parameter vector is:
=
_
_

=
_

_
We can write down the following moment conditions:
m() = E[g (X, )] =
_
_
E[X ]
E
_
X
2
E
_
X
3
3
2
E
_
X
4
3
4
6
2
_
These should all be equal to zero.
GMM GMM
The sample counterparts are:
m() =
1
T
T
i =1
g (X
i
, ) =
_
_
1
T
T
i =1
(X
i
)
1
T
T
i =1
_
X
2
i

2
2
_
1
T
T
i =1
_
X
3
i
3
2
3
_
1
T
T
i =1
_
X
4
i
3
4
6
2
4
_
_
_
The method of moments procedure is to choose

to make all the sample
moment conditions equal to zero.
GMM GMM
The problem here is, there are four moment conditions, but only two
parameters. Even if the model (of normality) is perfectly true, it is
virtually certain that there will no value of

that makes all four moment
conditions equal to zero.
Since perfection is not possible, we will have to settle for some
approximation to it. The sample moment conditions cannot all be satised
perfectly, so we will try to satisfy them as closely as possible. For example,
we could choose

as follows:
= argmin
[ m()]
T
[ m()] = argmin
i =1
[ m
i
()]
2
In other words, we can try to make the sample moment conditions as close
to zero as possible, where close is dened by squaring the moment
conditions and adding them together.
GMM GMM
This approach places equal weight on all the moment conditions. We
might want to consider placing more weight on some than on others.
Suppose w
i
are numbers indicating the weight we wish to place on each of
the conditions, i.e., a large value means that the estimation procedure
should try to satisfy that moment condition very closely, even at the
expense of violating some of the other moment conditions rather badly.
The

could then be chosen in this way:
= argmin
i =1
w
i
[ m
i
()]
2
In the problem of estimating and for the normal random variable, we
might want to place higher weight on satisfying the rst two conditions
(about the mean and variance) more closely than the other two. (The
traditional approach would be to use only the rst two conditions in
estimation, which is equivalent to w
1
= 1, w
2
= 1, w
3
= 0, and w
4
= 0.)
GMM GMM
More generally, we could consider using an M M matrix W to decide
how much weight to put on the dierent moment conditions.
The

would then be chosen as:
= argmin
[ m()]
T
W [ m()]
It is not a good idea to choose just any old matrix W. We will always use
positive semidenite matrices. Recall that W is positive semidenite if:
a
T
Wa 0 a
The W we use will usually be not only positive semidenite, but positive
denite. Recall that a matrix is positive denite, if it is positive
semidenite, and:
a
T
Wa = 0 =a = 0
GMM GMM
If the weighting matrix W is not positive semidenite, then there may be
no solution to the minimisation problem; it might, for example, be possible
to nd m() that make the objective function arbitrarily large (in
magnitude) negative numbers.
This type of procedure is sometimes called minimum distance estimation.
We need to determine the properties of the estimates obtained in this way,
and also discuss the choice of W.
GMM GMM
By the same technique used to nd the asymptotic variance in the method
of moments case, we can nd the asymptotic variance of

here:
T
_

_
d
N
_
0,
_
G
T
WG
_
1
G
T
WWG
_
G
T
WG
_
1
_
where:
G =E
_
_
g
1
1
(X, )
g
1
N
(X, )
.
.
.
.
.
.
.
.
.
g
N
1
(X, )
g
N
N
(X, )
_
_
=E
_
[g (X, )]
T
[g (X, )]
_
Once again, everything in the above expression can be estimated; the
estimation error goes away asymptotically.
GMM GMM
What is the best (if there is such a thing) choice of W?
It turns out that there is an optimal choice, which is:
W =
1
It can be shown that this produces the lowest possible asymptotic
variances for

. With this choice of W:
T
_

_
d
N
_
0,
_
G
T
1
G
_
1
_
There is but one small problemwe dont know . However, we can
estimate it.
GMM GMM
Any consistent estimate of produces asymptotically ecient results.
A consistent estimator is one with the following property: for any arbitrary
level of error, the probability that the estimate will deviate from the true
parameter goes to zero as T goes to +. With more and more data, the
estimate becomes more and more accurate, and approaches the true
parameter very closely for large T.
lim
T+
Prob
_
_
< 1 > 0
Any arbitrary accuracy level is achieved with more and more data if the
estimator is consistent.
How to estimate it?
GMM GMM
The most common approach to implementing GMM estimation is two
stages.
In the rst stage, the GMM procedure is implement with W = I (the
identity matrix). The entire purposes of the rst stage is to produce an
estimate

. Although the rst stage produces estimates

, other
information that can be used to calculate the asymptotic variance of

,
etc., only the information on

is retained.
In the second stage, the estimate of

from the rst stage is used to
obtain estimates

, estimates of the asymptotic variance of

, etc.
Other procedures have been used. For example, a third (or fourth, or fth
stage) can be added, with each stage using the estimated

from the
previous stage. These alternate procedures are asymptotically equivalent
to the two-stage procedure; however, simulation studies show they may
have some limited benecial eect in the quality of estimation for small T.
GMM GMM
The two-stage GMM procedure is therefore:
.
.
.1
Produce a rst-stage estimate of the parameter vectors,

1
, using
W = I , that is, solve the following problem:
1
= argmin
[ m()]
T
[ m()]
.
.
.2
Calculate the weighting matrix:
=
1
T
T
i =1
_
g
_
Y
t
,

1
__ _
g
_
Y
t
,

1
__
T
.
.
.3
Produce the second-stage estimate of the parameter vectors, using
W =

1
, i.e., solve the following problem:
2
= argmin
[ m()]
T
1
[ m()]
GMM GMM
.
.
.1
Estimate the asymptotic variance of

2
using:
T
_

_
d
N
_
0,
_
G
T
1
G
_
1
_
.
.
.2
Do whatever it is you wanted to do with the estimates and
variance/covariance estimates.
There is one other thing we can do, which is test the model. If extra
moment conditions were used (i.e., more than the number of parameters
to be estimated), then the system is overidentied, and the extra
information in these conditions can be exploited to construct a test of the
model we are estimating.
GMM GMM
The statistical test we will use is often called the overidentifying
restrictions test, or a J test.
The test statistic we will use is:
J =T
_
m
_
__
T
W
_
m
_
__
=T
_
1
T
T
i =1
g
_
X
i
,

_
_
T
W
_
1
T
T
i =1
g
_
X
i
,

_
_
If all the moment conditions are true (i.e., if m() = 0), then the J
statistic has a chi-square distribution with N M degrees of freedom.
The J statistic can therefore be used to test whether the model is correct
or not. (Noteif the model is rejected by the J-test, the estimates

should be treated with suspicion!)

GMM Example
Examplesuppose we think observations of the random variable X are
drawn from a normal distribution. For purposes of this example, 1000
observations of draws from a normal distribution with mean of 2 and
standard deviation of 3 were generated.
We will use the four moment conditions:
m() = E[g (X, )] =
_
_
E[X ]
E
_
X
2
E
_
X
3
3
2
E
_
X
4
3
4
6
2
_
Using the two step GMM procedure, we begin with a weighing matrix
W = I , and solve the problem:
1
= argmin
[ m()]
T
[ m()]
GMM Example
The estimated parameters are:
1
=
_
_

_
_
=
_
_
2.1312
2.8853
_
_
The four estimated moment conditions are:
m
_
_
=
_
_
0.01800
0.00457
0.45713
0.79567
_
_
Neither of these will be used though; we only need the weighting matrix to
use in the second stage of the procedure.
GMM Example
The weighting matrix to be used in the second procedure is:
W =
_
_
0.55340 0.06423 0.01463 0.00166
0.06423 0.03041 0.00076 0.00039
0.01463 0.00076 0.00073 0.00007
0.00166 0.00039 0.00007 0.00001
_
_
This is the inverse of the estimate

.
Using this new weighting matrix, the second stage estimates are the
solution to the problem:
2
= argmin
[ m()]
T
W [ m()]
GMM Example
2
=
_
_

_
_
=
_
_
2.1332
2.8971
_
_
m
_
_
=
_
_
0.00396
0.05662
0.52774
5.77782
_
_
None of the moment conditions are estimated at zero, which will be the
case when the system is overidentied. Whether they are close to zero is
a question for the J-test.
GMM Example
The estimated covariance matrix of the moment conditions is:
=
_
_
8.43 35.11 321.01 1978.97
35.11 286.42 2134.67 18191.07
321.01 2134.67 20156.87 153228.87
1978.97 18191.07 153228.87 1394503.74
_
_
Which conditions do you think receive the most weight in the second
stage?
GMM Example
The estimated covariance matrix of the parameters is given by:
Var
_
_
=
_
0.00835 0.00020
0.00020 0.00407
_
The estimated standard errors of and are 0.09138 and 0.06377,
respectively.
The one remaining thing to do is to calculate the J-statistic. This is quite
simple, since it is simply the value of the objective function minimised in
the second stage, multiplied by 1000 (the number of observations). The
value is 0.4210. Since it has a chi-square distribution with 2 degrees of
freedom, the p-value is 0.8102. The model (i.e., the assumption of
normality) therefore cannot be rejected at any reasonable condence level.
GMM Example
We now repeat the procedure with a dierent data set. This time, 1000
observations of of the form X = 5Z
2
1, where Z is a standard normal
random variable. The rst-stage estimated parameters are:
1
=
_
_

_
_
=
_
_
4.2228
9.8716
_
_
m
_
_
=
_
_
0.2669
56.2134
0.2040
0.5971
_
_
Neither of these will be used though; we only need the weighting matrix to
use in the second stage of the procedure.
GMM Example
The weighting matrix to be used in the second procedure is:
W =
_
_
0.106 0.00274 0.00010 0.000002
0.00274 0.00033 0.00001 0.0000001
0.00010 0.00001 0.000001 0.00000002
0.000002 0.0000001 0.00000002 0.0000000003
_
_
This is the inverse of the estimate

.
Using this new weighting matrix, the second stage estimates are the
solution to the problem:
2
= argmin
[ m()]
T
W [ m()]
GMM Example
2
=
_
_

_
_
=
_
_
2.7888
3.7824
_
_
m
_
_
=
_
_
1.1671
36.9825
1168.209
37889.3
_
_
None of the moment conditions are estimated at zero, which will be the
case when the system is overidentied. Whether they are close to zero is
a question for the J-test.
GMM Example
The estimated covariance matrix of the moment conditions is:
=
_
_
44.779 1119.093 35414 1333609
1119.093 37110.341 1410434 60377752
35414 1410434 60943442 2855401114
1333610 60377753 2855401114 1.42446E + 11
_
_
Which conditions do you think receive the most weight in the second
stage?
GMM Example
The estimated covariance matrix of the parameters is given by:
Var
_
_
=
_
0.01846 0.01288
0.01288 0.01234
_
The estimated standard errors of and are 0.13588 and 0.11110,
respectively.
The one remaining thing to do is to calculate the J-statistic. This is quite
simple, since it is simply the value of the objective function minimised in
the second stage, multiplied by 1000 (the number of observations). The
value is 57.06. Since it has a chi-square distribution with 2 degrees of
freedom, the p-value is about 3.08 10
10
. The model (i.e., the
assumption of normality) is very easily rejected at any reasonable
condence level.
Since the model is rejected, we should be somewhat cautious about using
the parameter estimates. The estimates of the mean and standard
deviation using more conventional methods are 3.956 and 6.592,
respectively, which are quite far from the GMM estimates.
GMM Asset Pricing Models
Exampletesting asset pricing models. We have already developed a test
based on the coecients in regressions of excess returns on the factor
values.
Can we formulate this type of test GMM style?
Take FF model as an example. (It works the same way for any linear
factor model.) The prediction of the model is
R
i ,t
= R
f ,t
+
i ,RMRF
RMRF
t
+
i ,SMB
SMB
t
+
i ,HML
HML
t
+
i ,t
with
E[
i ,t
] =0 Cov [
i ,t
, RMRF
t
] =0
Cov [
i ,t
, SMB
t
] =0 Cov [
i ,t
, HML
t
] =0
The returns equation can be rearranged to solve for
i ,t
, and then plugged
into the above four equations to form moment conditions.
If there are M assets and N factors in the model, then there are M N
parameters to be estimated (one coecient for each factor and each
asset), and M (N + 1) moment conditions (the error term for each asset
must not covary with the factors, and must have a mean of zero.)
The system is overidentiedthere are M surplus moment conditions.
They are the condition that
i
= 0 for each asset. In our previous
methodology, this condition was used to formulate a statistical test, after
estimation of all the parameters (method of moments procedure). Here,
the restriction is imposed during the estimation. Estimation and testing
have been built into a single step.
The procedure just described is not the typical GMM technique used to
estimate/test asset pricing models. Do the coecients really need to be
estimated to test the model?
Recall the stochastic discount formulation:
m
t
= c
0
+ c
1
RMRF
t
+ c
2
SMB
t
+ c
3
HML
t
If the Fama French model works, there exists such a stochastic discount
factor (although we dont necessarily know the values of c
0
, c
1
, c
2
, and
c
3
), with the property that:
E[m
t
R
i ,t
] = 1
for every asset. Since this will hold for the risk-free asset as well, we have:
E[m
t
(R
i ,t
R
f ,t
)] = 0
We will test the excess returns formulation.
Note that the m
t
is not identied, because we do not use the information
on the risk-free asset. For any m
t
that is a stochastic discount factor, km
t
(for any value of k, including 0) is also a stochastic discount factor. If we
allow c
0
, c
1
, c
2
, and c
3
to be anything at all, then the solution that will
satisfy all moment conditions is to set all equal to zero.
To avoid this problem, we impose the restriction E[m
t
] = 1. This could be
added as a moment condition, or it could be built-in explicitly to the
estimation procedure. We will take the latter approach.
This alternate formulation has N parameters to be estimated, and M
moment conditions. Provided M > N, the system is overidentied.
Note that the coecients are not estimated. Is the model being tested
really the same model we tested with our previous methodology?
Under the assumptions made, the regression and F-statistic approach is
completely correct, and we are able to derive the exact distribution of the
test statistic. However, the assumptions are strong, and may be violated.
This GMM approach makes weaker assumptions, at the price of only
asymptotic results.
As we will see, though, there is a certain amount of exibility in the GMM
approach that allows us to do some things that would be quite dicult in
the regression framework. More on that later.
To build in explicitly the constraint that E[m
t
] = 0, we will rewrite the
stochastic discount factor:
m
t
=c
0
+ c
1
RMRF
t
+ c
2
SMB
t
+ c
3
HML
t
=c
0
_
1 +
c
1
c
0
RMRF
t
+
c
2
c
0
SMB
t
+
c
3
c
0
HML
t
_
=c
0
(1 + d
1
RMRF
t
+ d
2
SMB
t
+ d
3
HML
t
)
We will treat d
1
, d
2
, and d
3
as the parameters to be estimated, and (for
each choice of these parameters), set:
c
0
=
1
1 + d
1

RMRF + d
2

SMB + d
3

HML
The moment conditions are therefore just E[m
t
(R
i ,t
R
f ,t
)].
Implementing the rst stage of the estimation procedure using the 25
Fama French portfolios, we nd:
m
t
= 1.023537 (1 0.016492RMRF
t
0.002107SMB
t
0.027360HML
t
)
Note, however, that the whole purpose of the rst-stage estimation is to
construct the weighting matrix for use in the second stage. This is done by
calculating

(not shown, due to its unwieldy size), and choosing
W =

1
.
The second stage estimates are then:
m
t
= 1.071349 (1 0.042621RMRF
t
0.024479SMB
t
0.074085HML
t
)
Note that these estimates are quite dierent than the rst stage estimates.
(Is this evidence of a problem?)
In order to estimate the covariance matrix of the parameter estimates, we
need the expected values of the derivatives of g (Y, ), with respect to .
Although these could be calculated analytically, they have been calculated
numerically instead (by making small changes to the parameter estimates,
and evaluating the expected value of the g (Y, ) function, with some
robustness check).
The estimated covariance matrix is then
Var
_
_
=
_
_
0.00007639 0.00003211 0.00000976
0.00003211 0.00014410 0.00000923
0.00000976 0.00000923 0.00012532
_
_
The estimated standard errors of d
1
, d
2
, and d
3
are 0.008740, 0.012004,
and 0.011195, respectively. T-statistics for the three parameters are
4.87, 2.04, and 6.62, respectively; note that these statistics
asymptotically have a standard normal distribution.
We have estimated the parameters of the stochastic discount factor
representation of the Fama-French model, and tested it using 25 asset
excess returns. We have also constructed test statistics to determine
whether each factor is necessary in the model; note that it is far from
obvious how to do this within the regression framework previously studied.
The regression framework does produce some outputs that the GMM
framework does not; for example, we have estimates of coecients from
the regressions, if these are of interest.
There is one remaining thing to do with the GMM approachwe should
test the model using the J statistic. This statistic can be calculated at
65.05, and has an asymptotic chi square distribution with 22 degrees of
freedom (25 moment conditions minus 3 parameters to be estimated).
The p-value is 0.00000388.
What conclusions can we draw?
We have estimated a model, and concluded that all three factors are
needed at the 95% condence level. An alternate procedure, used by, for
example, Hou and Kimmel (2010), suggests that we can be 95% condent
that the RMRF and HML factors are needed, but not the SMB factor.
(The incorrect two-pass regression inference procedure may come up with
a dierent conclusion still.) Why the dierence?
One potential explanation is that the GMM results are derived under an
assumption of model correctness. The J statistic rejects the model
violently, so we have some reason to be concerned about the other results.
By contrast, the Hou-Kimmel procedure is robust to misspecication. It
tells us, even if the model is misspecied, whether removal of a factor
makes the model worse; i.e., do the pricing errors become larger.
The Hou-Kimmel procedure works only with traded factors; Kimmel and
Robotti are working on a procedure, robust to misspecication, that works
with both traded and non-traded factors. It is so simple that both authors
are amazed that, as nearly as they can tell, no one has thought of it before.
One particular advantage of the GMM approach is the ability to test both
conditional and unconditional models.
Our testing so far has been, in eect, of unconditional models. In the
regression approach, it is assumed that the statistical properties of asset
returns have not changed over time; the GMM approach (as we
implemented it) was similar.
However, this is not a requirement of the GMM approach; we can take
into account conditional restrictions.
The moment conditions in the GMM approach are of the form:
E[m
t
(R
i ,t
R
f ,t
)] = 0
Suppose we have a stochastic discount factor that works, in the sense that
the above moment condition is satised for all excess returns. However,
suppose that in each time period, the joint statistical properties of the
stochastic discount factor m
t
and the excess returns are dierent than
their unconditional behaviour. I.e., although the above equation is
satised, it may not be the case that:
E
t
[m
t+1
(R
i ,t+1
R
f ,t+1
)] = 0
The above must hold on average, across all time periods, but it may not
hold in each particular time period. Although the stochastic discount
factor prices the asset correctly on average, it may overprice or underprice
it in any particular time period.
Let us suppose that this is in fact not the case; the stochastic discount
factor we have chosen prices all assets correctly all the time, not just on
average. Then the much stronger condition:
E
t
[m
t+1
(R
i ,t+1
R
f ,t+1
)] = 0
holds. It follows immediately that for any piece of information at all that is
available at time t, called X
t
, then
X
t
E
t
[m
t+1
(R
i ,t+1
R
f ,t+1
)] = 0
We can take X
t
inside the expectation:
E
t
[X
t
m
t+1
(R
i ,t+1
R
f ,t+1
)] = 0
Then by the law of iterated expectations:
E[X
t
m
t+1
(R
i ,t+1
R
f ,t+1
)] = 0
Does this look like a new moment condition?
If there are information variables X
t
that potentially help us predict returns
(or the behaviour of the stochastic discount factor, or the joint behaviour
of the two), then it is quite simple to incorporate additional moment
restrictions into the GMM procedure that allow us to test conditional as
well as unconditional implications of the model.
If we were to nd such information variables and retest the Fama-French
three-factor model, how do you think the result would change?
It is substantially more dicult to incorporate conditional implications of
asset pricing models into the regression-based approach.
Return Predictability
.
.
. . .
.
.
Empirical Finance
Return Predictability
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Return Predictability Autoregression
The basic autoregressive AR(1) process is:
X
t
= X
t1
+
t
where
Cov [
s
,
t
] =
_
2
s = t
0 s = t
The behaviour of this sort of process depends critically on the value of the
parameter. We can rewrite the above as:
X
t
=
t
s=1
ts
s
+
t
X
0
The above is called the moving average representation. If || < 1, then the
terms on the right-hand side become smaller and smaller for smaller values
of s, so what happened to this process a long time ago is essentially
irrelevant for its behaviour today. The process is said to be stationary.
If = 1 or = 1, then the errors do not diminish over time, but rather
accumulate. The process is then said to have a unit root.
If || > 1, then the error not only do not diminish over time, they are
actually amplied over time. A process that exhibits this type of behaviour
is nonstationary. (Notethe unit root process could also be said to be
non-stationary.)
Generalisationwe can include a constant:
Y
t
= c + Y
t1
+
t
The introduction of the constant does not fundamentally alter the
properties of the process, e.g., || < 1 still results in a stationary process,
= 1 and = 1 correspond to unit roots, etc.
Autoregressive processes = 0.75, rho = 1, rho = 1, and rho = 1.01
Properties of the autoregressive process:
E
t1
[Y
t
] =c + Y
t1
Var
t1
[Y
t
] =
2
Unconditional results exist only for the stationary case || < 1:
E[Y
t
] =
c
1
Var [Y
t
] =

2
1
2
Even if
2
is small, the unconditional variance of the autoregressive
process can be quite large if is close to 1 or 1.
In small data samples, it can be quite dicult to tell whether a process is
stationary or not.
Unconditional variance of autoregressive process as function of ( = 1)
Consider the unconditional covariance:
Cov [Y
t
, Y
t1
] = Cov [c + Y
t1
+
t
, Y
t1
] = Var [Y
t1
]
Note that this unconditional covariance exists only if || < 1.
The autocorrelation is given by:
Corr [Y
t
, Y
t1
] =
Cov [Y
t
, Y
t1
]
Var [Y
t
] Var [Y
t1
]
=
Var [Y
t1
]
Var [Y
t
] Var [Y
t1
]
=
where we have taken advantage of the fact that Var [Y
t
] = Var [Y
t1
].
Applying this procedure iteratively, we nd for any n:
Corr [Y
t
, Y
tn
] =
n
Autocorrelations for an AR(1) process are therefore decreasing exponential
functions of the time between the two observations. Recall that the
autocorrelation is only dened for a stationary process.
Autocorrelation of AR(1) Process as Function of
Estimationit looks like a regression equation:
Y
t
..
Dependent variable
= c
..
Analogous to
+
..
Analogous to
Y
t1
..
+
t
..
Error term
It looks just like a regression, in which the independent variable is simply a
lagged value of the dependent variable.
Can we run a regression, and use the statistical results on the properties of
OLS regression results?
Answer to the rst partyes.
Answer to the second partno.
A key assumption of regression analysis is not satised here. (Which one?)
Regression becomes a little tricky with non-stationary processes.
Examplespurious regression.
Suppose X
t
and Y
t
are processes that follow:
X
t
= X
t1
+
t
Y
t
= Y
t1
+
t
where
t
and
t
are standard normal random variables, independent for all
t. Note that the two processes have absolutely nothing to do with each
other.
What happens if we run the regression?
Y
t
= + X
t
+
t
Lets nd out100 observations of each variable, beginning with
X
0
= Y
0
= 0.
Spurious Regression Results
Spurious Regression Results
What happened?
We can apply standard regression techniques to an AR(1) process.
However, some of the standard statistical properties of OLS do not hold,
because the usual assumptions are not met.
Consistencythe OLS estimates for c and are consistent. In equations:
lim
t+
Prob (| c c| > ) = 0 lim
t+
Prob (| | > ) = 0
for all > 0. In words, the probability that the estimates will vary from
the truth by a given amount gets smaller and smaller (and asymptotically
vanishes completely) with more and more data.
Asymptotic normalityfor very large t, c and have an asymptotically
normal distribution:
t
_
c c

_
N
__
0
0
_
,
_
2
0
0 1
2
__
These results are strictly asymptotic. In small samples, the distribution of
the estimated parameters will tend not to be normal, and the estimates are
not unbiased.
ExamplesVasicek model.
In the Vasicek interest rate model, the short-term (or instantaneous)
interest rate follows an autoregressive process:
r
t
= c + r
t1
+
t
where
t
has a Gaussian distribution. We would normally expect to be a
positive number slightly less than one, so that the interest rate process is
not stationary. (Notein the Vasicek model, the interest rate is usually
written in continuous-time, but the above is a discrete-time version.)
Under the assumption that
t
is Gaussian, the interest rate process is also
Gaussian (both conditionally and unconditionally). As such, it is possible
for the interest rate to become negativegood modelling feature?
More examplesstock prices. An autoregressive model is not particularly
suited for stock prices, because we would expect stocks with high prices to
exhibit more volatility than stocks with low volatility. But what about log
stock prices?
ln S
t
= c + ln S
t1
+
t
What seems like a reasonable value of ?
If = 1, and the
t
process is Gaussian, then the distribution of ln S
t
is
also Gaussian (conditionallythe unconditional distribution does not
exist). Black-Scholes-Merton.
If this is the process followed by prices, what are the properties of returns?
More examplesvolatility. Is an autoregressive process a good model for
the volatility of a nancial asset?
Not particularly. We tend to think of volatility as stationary, so an AR(1)
process has this property. However, an AR(1) process can take on negative
values, which is impossible for volatility. We could take volatility to be the
absolute value of an AR(1) process, but such a process will often approach
(and reach) the value zerois that a good modelling property?
For something like volatility, it is probability better to have a process in
which the
t
depends on the level of volatility. That way, the variance of
t
can become small when volatility is very low, preventing it from crossing
into negative territory. An AR(1) process does not allow this sort of
behaviour.
Returnssuppose log prices follow an AR(1) process, with = 1:
ln S
t
= c + ln S
t1
+
t
The continuously compounded return is then:
ln
S
t
S
t1
= c +
t
so that returns are uncorrelated with each other, and always have the
same distribution. This is the case for simple returns as well:
S
t
S
t1
S
t1
= e
c+
t
1
What if we think returns are not uncorrelated over time?
We could model returns (rather than prices) as AR(1) processes.
Positive or negative autocorrelation, decaying over time.
At high frequencies, returns are almost certainly autocorrelated due to
various microstructure eects (e.g., bid-ask bounce).
What about lower frequencies?
What would cause returns to have positive autocorrelation? What about
negative autocorrelation?
Fama-French Portfolio Monthly Return Autocorrelations
Fama-French Portfolio Monthly Excess Return Autocorrelations
Monthly Risk-free Rate Autocorrelations
As shown, many of the assets have a very substantial autocorrelation after
a single lag, but most then approach zero at the second lag. It appears
there may be some oscillatory pattern (negative autocorrelations for all
assets at the third lag, substantial and positive for most at nine lags), but
without doing a formal statistical test (dicult), its hard to be sure if
these are a real phenomena or just sampling variation.
So it might seem that a AR(1) model could be a reasonable model for
returns.
Estimated Parameters in AR(1) Autoregression of Risk-free Rate
As shown, the interest rate process is highly persistent.
But NOTEmany of the results in the regression output are derived under
an incorrect assumption.
In an AR(1) autoregression, the residuals and the X variables are not
uncorrelated!
These results can be considered valid asymptotically, for large T. This
data sample had 939 observations, so there probably isnt too much to
worry about.
Return Predictability Long Memory
GMM test of an asset pricing model, taking into account predictability of
returns?
We nd something a bit odd, though, if we look at the autocorrelations of
the absolute value of returns instead of returns.
Fama-French Portfolio Monthly Absolute Return Autocorrelations
Fama-French Portfolio Monthly Absolute Excess Return Autocorrelations
Long memory property of returns.
What kind of process generates autocorrelation patterns like this? (Recall
these patterns observed in absolute returns, not returns.)
Certainly not an AR(1) process.
Maybe an AR(2) or similar process can do it.
Y
t
= c + b
1
Y
t1
+ b
2
Y
t2
+
t
Analysis of AR(m) models for m > 1 is quite dicult, and often best left
to computers equipped with appropriate software, but we are able to get a
few explicit results.
How to analyse something like an AR(2) process?
Rewrite it as:
_
Y
t
Y
t1
_
=
_
c
0
_
+
_
b
1
b
2
1 0
_ _
Y
t1
Y
t2
_
+
_
t
0
_
Consider the eigenvalues of the matrix in the middle term on the
right-hand side. These are numbers that make the following expression
zero:
det
__
b
1
b
2
1 0
_
_
0
0
__
= 0
We can solve explicitly for the two solutions:
=
b
1
b
2
1
+ 4b
2
2
The solutions may be distinct real numbers, the same real number, or
complex conjugate pairs.
The condition for stationarity of an AR(2) process is that the two solutions
(call them
1
and
2
) lie within the unit circle on the complex plane:
|
1
| < 1 |
2
| < 1
If the process is stationary, then we have:
E[Y
t
] =
c
1 b
1
b
2
Var [Y
t
] =

2
_
1 b
2
1
b
2
2
2b
2
1
b
2
1b
2
_
The correlation structure of an AR(2) process (when the are real) is a
mixture of two decaying exponential processes. Depending on the
parameter values, one of these could be fast decaying, and one slow. An
AR(2) process can therefore have persistent components.
Note that we would have to model absolute, or squared, returns this way,
not returns themselves. More on that later.
Return Predictability VAR
Vector autoregressionallow Y
t
to be a vector of variables.
Y
t
..
N1 vector
= c
..
N1 vector
+ B
..
NN matrix
Y
t1
+
t
..
N1 vector
The above is a VAR(1) process. Every AR(m) process can be rewritten as
a VAR(1) process. (We have already done this for one particular case.)
Stationarity conditionall eigenvalues of B have absolute value less than
one.
Assuming stationarity:
E[Y
t
] = (I B)
1
c
The variance of a stationary VAR(1) process is the solution to:
Var [Y
t
] = B Var [Y
t
] B
T
+ Var [
t
]
We also have:
Cov [Y
t
, Y
t1
] = B Var [Y
t
]
The general result is:
Cov [Y
t
, Y
tn
] = B
n
Var [Y
t
]
The stationary condition requires that B
n
becomes smaller in an
appropriate sense for larger and larger values of n.
Also possible to write VAR(m) models for m > 1:
Y
t
= c + B
1
Y
t1
+ B
2
Y
t2
+ B
3
Y
t3
+ B
4
Y
t4
+ B
5
Y
t5
+
t
It is very easy to get carried away. Suppose Y
t
has 10 elements, and ten
years of weekly history are available, for 520 observations (containing
5, 200 numbers).
How many parameters are there to be estimated?
An overly aggressive VAR(m) model is an exercise in overtting. (A very
famous Bayesian econometrician used to refer to VAR models as Very
Awful Regressions. Something of an overstatement, but some discipline is
needed.)
Estimationa VAR(m) model for any m can be estimated using OLS
regression.
As in the AR(1) case, the technique works, but the statistical properties of
the estimates are dierent.
Consistencythe parameter estimates are consistent, for large amounts of
data, they converge to the true parameter values.
Biasthe VAR(m) estimates are, in general, not unbiased. (Some work
has been done on nding unbiased estimates.) However, the bias goes
away for large data samples.
Asymptotic normalitythe estimates have an asymptotic Gaussian
distribution, even if the
t
do not.
The asymptotic variance is complicated.
Return Predictability Persistence
There are many alternative methods for estimation of a VAR(m) model.
The OLS method ignores some of the information in the rst m 1
observationssome estimation methods try to capture this.
Others try to eliminate the bias.
Although they may have desirable statistical properties, they are generally
complicated to implement.
There are alternate ways to generate persistent processes.
Consider an AR(1) process:
X
t
= c + dX
t1
+
t
However, the X
t
process is not observed; it is latent. What is observed is:
Y
t
= a + bX
t
+
t
where
t
and
t
are independent. The statistical properties of X
t
are
already worked out:
E[X
t
] =
c
1 d
Var [X
t
] =

2
1 d
2
Corr [X
s
, X
t
] =d
|st|
The properties of the observed variable Y
t
are easy enough to work out:
E[Y
t
] =a +
bc
1 d
Var [Y
t
] =
b
2
1 d
2
+
2
Cov [Y
s
, Y
t
] =
_
b
2
1d
2
+
2
s = t
b
2
d
|st|
1d
2
s = t
A process of this type can generate a low but persistent autocorrelation.
Autocorrelated processb = 0.2, d = 0.95,
= 1, and
= 1
How does this process compare to the AR(2)? It is just a special case:
_
X
t
Y
t
_
=
_
c
a + bc
_
+
_
d 0
bd 0
_ _
X
t1
Y
t1
_
+
_

t
t
+ b
t
_
However, the state variables are only partially observed.
IdenticationX
t
is not observed, and as a consequence, there are many
dierent values of the parameters a, b, c, d,
, and
which produce
identical behaviour for the observed variable Y
t
.
Examplesuppose X
t
= + X
t
. Then the process followed by X
t
is:
X
t
= + X
t
= + (c + dX
t1
+
t
)
= + c + dX
t1
+
t
= + c + d
_
X
t1
_
+
t
=[(1 d) + c] + dX
t1
+
t
Then the process Y
t
follows:
Y
t
=a + bX
t
+
t
= a + b
_
X
_
+
t
=
_
a
b
_
+
b
t
+
t
We can choose:
=
c
(1 d)
=
1
Then:
X
t
=dX
t1
+
1
t
= dX
t1
t
Y
t
=
_
a +
bc
1 d
_
+ b
t
+
t
= a
+ b
t
+
t
Therefore, without loss of generality, we will drop the prime notation and
just take c = 0 and
= 1.
With this normalisation, the AR(2) representation is:
_
X
t
Y
t
_
=
_
0
a
_
+
_
d 0
bd 0
_ _
X
t1
Y
t1
_
+
_

t
t
+ b
t
_
Estimationthere are four parameters, a, b, d, and
. How to estimate
them?
We rst note that if
t
and
t
are assumed to have a multivariate normal
distribution, then X
t
and Y
t
also do (both conditionally and
unconditionally). Lets see what we can say about the probability
distribution of the observed data series, Y
0
, . . . , Y
T
.
First, we note that the Y
0
, . . . , Y
T
have a multivariate normal distribution.
We can therefore write the joint probability density function as:
f
Y
(y) =
1
(2)
T+1
2
||
1
2
e
(y)
T
1
(y)
2
where contains the unconditional means of the Y
t
(which are equal to
a), and is the unconditional covariance matrix of the Y
t
:
=
_
_
1 d d
T1
d
T
d 1 d
T2
d
T1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
d
T1
d
T2
1 d
d
T
d
T1
d 1
_
_
b
2
1 d
2
+ I
2

Return Predictability Maximum Likelihood
Maximum likelihood estimationchoose the parameter values that make
the probability (or likelihood) density function of the observations as high
as possible.
Maximum likelihood can be thought of as a special case of method of
moments. (What are the moment conditions?)
Maximum likelihood has desirable asymptotic propertiesit is ecient,
meaning that asymptotically, the parameter estimates have a smaller
variance than any other method.
Simple exampleestimation of the mean and standard deviation of a
normal distribution.
Suppose we observe X
1
, . . . , X
N
, which are drawn form a normal
distribution.
The joint density function of the observations are:
f
X
(x) =
1
(2
2
)
N
2
exp
_
i =1
(x
i
)
2
2
2
_
Easier to work with logs; maximising the logarithm is the same as
maximising the original function.
ln f
X
(x) =
N
2
ln
_
2
2
_
i =1
(x
i
)
2
2
2
We need to nd the and that maximise the logarithmic likelihood
function.
Taking derivatives with respect to and :
ln f
X
(x) =
N
i =1
x
i

ln f
X
(x) =
N
+
N
i =1
(x
i
)
2
3
We will now put hats on the parameters to indicate the optimal values.
Setting the rst equal to zero and solving for , we nd:
=
1
N
N
i =1
x
i
Setting the second equal to zero and solving, we nd:
=
_
1
N
N
i =1
(x
i
)
2
Note that the maximum likelihood estimate of is dierent than the usual
estimateN vs. N 1.
Returning to the original estimation problem, we have in principle a
technique (maximum likelihood) that allows us to estimate the
parameters. We know the probability density function, so we could choose
the parameters that maximise its value.
There is a practical problem though; the density function has the inverse
of the function, and if the number of data observations is large, then
this will be a huge matrix, and very dicult to invert.
We can instead apply an iterative procedure. The trick is to write down
the probability density of the observed quantities in a dierent way.
The trick is:
.
.
.1
Each time period, calculate the joint probability distribution of X
t
and
Y
t
, conditional on X
t1
and Y
0
, . . . , Y
t1
.
.
.
.2
Calculate the joint probability distribution of X
t
and Y
t
, conditional
only on Y
0
, . . . , Y
t1
.
.
.
.3
Calculate the probability distribution of X
t
conditional on Y
0
, . . . , Y
t
.
The third step is needed to carry out the rst two steps at the next time
period.
When we are done, among the results are the probability distributions of
each Y
t
, conditional on all previous Y
0
, . . . , Y
t1
.
We can string these together to nd the joint probability distribution of all
Y
0
, . . . , Y
T
, and use it to perform maximum likelihood estimation.
First time steprst task. What is the unconditional joint distribution of
X
0
and Y
0
? (There are no previous observations to condition on.)
The joint distribution is bivariate normal. Under the normalisations we
have made, we have:
E
_
X
0
Y
0
_
=
_
0
a
_
Var
_
X
0
Y
0
_
=
_
1
1d
2
b
1d
2
b
1d
2
b
2
1d
2
+
2
_
Note that the marginal means and variances of either X
0
or Y
0
can readily
be extracted from the above.
The second task is to nd probabilities not conditional on previous values
of X
t
. Since there are none, this task need not be performed for the rst
time step.
The third task is to nd the distribution of X
0
, conditional on Y
0
. This
distribution is Gaussian, with mean and variance:
E[ X
0
| Y
0
] =
b
b
2
+ (1 d
2
)
2
(Y
0
a)
Var [ X
0
| Y
0
] =
1
1 d
2
_

2
b
2
1d
2
+
2
_
These results follow from the properties of a bivariate normal distribution.
The approach we are taking here depends very much on normality. It can
be extended to include more X
t
or Y
t
variables, but it cannot easily be
extended to non-Gaussian data.
Each subsequent time stepthe rst task is to nd the distribution of X
t
and Y
t
, conditional on X
t1
and Y
0
, . . . , Y
t1
.
This distribution is bivariate normal. The means, variances, and
covariances are:
E
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
_
=
_
dX
t1
a + bdX
t1
_
Var
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
_
=
_
1 b
b b
2
+
2
_
Note that these results do not depend on any of the previous results. Also
note that the dependence on Y
0
, . . . , Y
t1
is somewhat superuous, as the
distribution of X
t
and Y
t
depends only on X
t1
. However, we leave this
dependence in, as it simplies later stages.
Second tasknd the distribution of X
t
and Y
t
, conditional only on
Y
0
, . . . , Y
t1
. (In the rst task, we calculated these quantities conditional
on Y
0
, . . . , Y
t1
and X
t1
.)
The distribution we are looking for is bivariate Gaussian, so all we have to
do is nd the means and variances and covariances. Starting with the
means, we use the law of iterated expectations:
E
Y
0
,...,Y
t1
_
X
t
Y
t
_
=E
Y
0
,...,Y
t1
_
E
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
__
=E
Y
0
,...,Y
t1
_
dX
t1
a + bdX
t1
_
=
_
d E
Y
0
,...,Y
t1
[X
t1
]
a + bd E
Y
0
,...,Y
t1
[X
t1
]
_
Note that the expectation that appears in the last expression (the same
one appears twice) was calculated in the third task of the previous time
step.
For variance, we note that:
Var
Y
0
,...,Y
t1
_
X
t
Y
t
_
=E
Y
0
,...,Y
t1
_
Var
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
__
+Var
Y
0
,...,Y
t1
_
E
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
__
For the rst term, we note that:
E
Y
0
,...,Y
t1
_
Var
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
__
= Var
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
_
since we are taking expectations of a matrix of constants.
Turning to the second term, we note that:
Var
Y
0
,...,Y
t1
_
E
X
t1
,Y
0
,...,Y
t1
_
X
t
Y
t
__
=Var
Y
0
,...,Y
t1
_
dX
t1
a + bdX
t1
_
=
_
d
2
bd
2
bd
2
b
2
d
2
_
Var
Y
0
,...,Y
t1
[X
t1
]
The variance in the last expression was calculated in the third task at the
previous time step.
Putting the last two results together, we have the variance of X
t
and Y
t
,
conditional on Y
0
, . . . , Y
t1
.
The third task is to nd the distribution of X
t
, conditional on Y
0
, . . . , Y
t
.
We already have the joint distribution of X
t
and Y
t
, conditional on
Y
0
, . . . , Y
t1
. Call the relevant quantities
X
,
Y
,
2
XX
,
2
XY
, and
2
YY
.
Then:
E
Y
0
,...,Y
t
[X
t
] =
x
+

2
XY
2
YY
(Y
t

Y
)
Var
Y
0
,...,Y
t
[X
t
] =
2
XX

_
2
XY
_
2
2
YY
Since the distribution we are looking for is Gaussian, these are all we need.
Proceeding in this way for each time step, one of the outputs of the
second task is the distribution of Y
t
, conditional on Y
0
, . . . , Y
t1
.
We can string these together, to nd the unconditional distribution of the
entire series of observations Y
0
, . . . , Y
T
:
f
Y
0
,...,Y
T
(y
0
, . . . , y
T
) = f
Y
0
(y
0
)
T
t=1
f
Y
t
|Y
0
,...,Y
t1
(y
t
; y
0
, . . . , y
t1
)
This is the joint unconditional probability density function, or likelihood, of
the entire series of observations of Y
t
. We can use it to perform maximum
likelihood estimation.
Choose the parameters a, b, d, and
that maximise the value of the

likelihood.
Can try to solve for explicit formulas, or can just search numerically.
This procedure is known as ltering, or Kalman ltering.
In general, ltering works in principle, but is endishly dicult to apply in
practice.
Multivariate Gaussian random variables have:
.
.
.1
Conditional variances and covariances that are constant.
.
.
.2
Conditional means that are linear.
These properties make ltering tractable under an assumption of normality.
What if the data are not normally distributed?
It is not an uncommon procedure to just apply Kalman ltering anyway,
and hope for the best.
Forecastingthe third task in the iterative procedure produces the
distribution of X
t
, conditional on Y
0
, . . . , Y
t
.
This task is needed because it is used when performing the second task
(nding the distribution of X
t
and Y
t
, conditional on Y
0
, . . . , Y
t1
) for the
next time step.
Therefore, for the last time step, the third task is unnecessarythe result
is never used.
However, if the goal is forecasting, we can go ahead and perform the rst
and second tasks for the next time step (i.e., one that hasnt happened
yet), to get the distribution of Y
T+1
, conditional on all previous
observations Y
0
, . . . , Y
T
, if desired.
Exampleestimate the parameters of such a process, taking the absolute
value of RMRF as the Y
t
variable.
Estimates:
a = 4.0973

b = 0.4792

d = 0.9721
= 3.3143
Standard errors are:
(a) = 0.6765 (a) = 0.0520 (a) = 0.0072 (a) = 0.0417
(More on how to calculate standard errors later.)
We can have a look at how the empirical autocorrelations compare to
those estimated by the model.
Empirical vs. Implied Autocorrelations
Forecast vs. Observed Values
Forecast vs. Observed Values (First Third of Sample)
Forecast vs. Observed Values (Middle Third of Sample)
Forecast vs. Observed Values (Last Third of Sample)
The unconditional mean and variance of the Y
t
variable implied by the
model are o a bit from those obtained simply by taking the sample
averages, but not by huge amounts. As shown in the graph, the rst-order
autocorrelation is also o slightly from the sample estimate.
How are standard errors calculated? (Or more precisely, estimated?)
There are two commonly used methods for estimating the standard errors
of parameter estimates when using maximum likelihood. They are
asymptotically equivalent, but can dier in small samples; one has a
tendency to be more robust than the other.
One method involves the second derivative of the logarithm of the
likelihood function, evaluated at the estimated parameter values.
The method employed here uses the rst derivative instead.
Call the parameters collectively . Maximum likelihood is performed by
maximising the joint probability density function (or likelihood function)
across all possible parameter values. In practice, we usually maximise the
logarithm of the joint density, which is equivalent. Denote the logarithm of
the likelihood of observationY
t
= y
t
by (y
t
; ):
(y
t
; ) = ln f
Y
t
(y
t
; )
The score functions are the derivatives of this function with respect to the
parameters; since there is one score function for each parameter, and since
these functions can be evaluated at each observation Y
t
, we use the
notation
(y; ) =
_
1
ln f
Y
1
(y
1
; )

N
ln f
Y
1
(y
1
; )
.
.
.
.
.
.
.
.
.
1
ln f
Y
T
(y
T
; )

N
ln f
Y
T
(y
T
; )
_
_
to refer to the entire matrix, where T is the number of time series
observations and N is the number of parameters.
We can estimate the standard errors of the parameter estimates as follows:
Var
_
_
=
_
_

(y; )
_
T
_

(y; )
_
_
1
The result is an N N matrix; the standard errors can be extracted by

taking the square roots of the diagonal elements. Other information, such
as the correlation between estimates of dierent parameters, can be
extracted from this matrix if desired.
The derivatives can sometimes be evaluated explicitly, but often it is much
more practical to evaluate them numerically, by varying the parameter
values by a small amount.
Hypothesis testing and the like is normally conducted only asymptotically,
i.e., assuming the parameter estimate has a normal distribution, and the
standard error is estimated precisely. The test statistic then has a normal
distribution.
More explicit (e.g., small-sample rather than asymptotic) results are
available for maximum likelihood estimation only in particular special
cases. In general, numeric search for the optimal parameter values (i.e.,
those that maximise the likelihood function) and numeric evaluation of
derivatives to calculate standard errors is the best we can do.
Volatility Modelling
.
.
. . .
.
.
Empirical Finance
Volatility Modelling
+65 6631 8579
2427 Mar 2011
2224 Aug 2011
Singapore Campus
Volatility Modelling Autoregressive Models
Volatility is rarely directly observed. It generally must be estimated, or
inferred from prices of assets (e.g., options).
In relatively rare circumstances, volatility may be identied in a relatively
straightforward way from prices of instruments such as volatility swaps,
but even then, there is some subtlety in relating the traded price of
volatility with the actual volatility of nancial assets.
In many simple nancial models (e.g., Black-Scholes-Merton option
pricing), volatility of nancial assets is constant.
Unlike return predictability, which is controversial, the fact that volatility
changes over time in a way that is at least somewhat predictable is
virtually undisputed.
Model-based vs. model free.
General AR(m) model for returns process:
Y
t
= c +
m
i =1
b
i
y
ti
+
t
Properties of such a model:
.
.
.1
Conditional mean is linear function of past observations.
.
.
.2
Conditional variance is constant.
.
.
.3
Conditional distribution is Gaussian.
It is possible to relax all three assumptions.
A more general model:
Y
t
= (Y
t1
, . . . , Y
tm
) + (Y
t1
, . . . , Y
tm
)
t
.
.
.1
Allows non-linear drift.
.
.
.2
Allows time-varying (and possibly in a non-linear way) volatility.
.
.
.3
Allows non-Gaussian error terms.
How to estimate?
In general, it is dicult.
Two-stage vs. quasi-maximum likelihood.
Quasi-maximum likelihoodassume
t
have a Gaussian distribution, can
use tricks like linearisation of coecient functions and .
Two-stageif the parameters that aect are dierent than the
parameters that aect , then we can estimate the parameters that aect
rst. E.g., if is a linear function, then we can just use OLS. The
results will be consistent, but note the heteroscedasticity; could use
Whites method to estimate standard errors.
The residuls from the estimation of the mean can then be analysed to
estimate the parameters of the function.
Both of these methods have their issues. In the two-stage procedure, we
are analysing estimated, rather than actual, residuals, and this source of
error is often simply ignored. Furthermore, the same parameters will often
show up in both and . The quasi-maximum likelihood approach relies
on assumptions that may quite explicitly be violated.
Some common modelsVasicek model for the interest rate:
r
t
= + r
t1
+
t
where the error term has a Gaussian distribution.
Simple AR(1) process:
.
.
.1
Linear expected value.
.
.
.2
Constant volatlity.
.
.
.3
Gaussian error term.
First interest rate model (1977) in the modern spirit.
How to estimate the Vasicek model?
It is an AR(1) model, can just use OLS regression.
Results are consistent, although biased in small sample.
OLS results (almost) coincide with maximum likelihood estimation (T vs.
T 1 in estimation of volatility parameter.)
Hard to do better.
Cox, Ingersoll, and Ross model for the interest rate:
r
t
= + r
t1
+
r
t1
t
where the error term has a Gaussian distribution.
Violates the conditions for an AR(1) model in several ways:
.
.
.1
Expected value still linear.
.
.
.2
Volaility that is increasing in the level of the interest rate.
.
.
.3
Error term has a non-central chi square distribution.
Avoids several undesirable properties of the Vasicek modelinterest rate
cannot become negative, and volatility is not constant.
How to estimate the parameters of the Cox, Ingersoll, and Ross model?
It could be considered an AR(1) model if we allow non-Gaussian error
terms.
Can use OLS regression to estimate and .
Residuals can then be divided by the square root of the lagged interest
ratesample standard deviation is an estimate of . (Ignores error in
estimation of the residuals.)
Alternativemaximum likelihood estimation. The conditional likelihood
function is known explicitly; it is a non-central chi square distribution.
Straightforward in principle, sometimes a little dicult in practice.
(Likelihood contains modied Bessel function of the rst kind; sometimes
it is the product of one extremely large number and another extremely
small number.)
Volatility Modelling ARCH
Another extensionARCH (Autoregression Conditionally Heteroscedastic)
model, Engle (1982).
Y
t
=
c + dY
2
t1
t
Volatility depends on the lagged squared level of the process. Note that
c 0 and d 0.
The above is an ARCH(1) model. More generally, the ARCH(q) model is:
Y
t
=
_
c +
q
i =1
d
i
Y
2
ti
t
Analogously with the ARCH(1) model, we required c 0 and d
i
0 for
all i .
Not a particularly sensible process as written for nancial prices/returns,
but with slight adaptation, can capture the phenomenon of volatility
clustering.
When does an ARCH(1) model have an unconditional variance?
Note that:
E
_
Y
2
t
Y
t1
= c + dY
2
t1
By the law of iterated expectations (assuming it can be applied):
E
_
Y
2
t
= c + d E
_
Y
2
t1
If the process has an unconditional variance, then the two expectations

above must be equal. Since c 0 and d 0, it must be the case that
either c = 0 and d = 1, or d < 1. Assuming the latter, we have:
E
_
Y
2
t
=
c
1 d
Since the unconditional expectation is equal to zero, this is also the
unconditional variance.
If d > 1, the ARCH(1) process does not have an unconditional variance.
(Does that mean it is not stationary?)
Tail behaviourconsider an ARCH(1) model, written as follows:
Y
t
=
t
2
t
=c + dY
2
t1
If the
t
have a Gaussian distribution, then:
E
_
Y
4
t
Y
t1
=
4
t
E
_
4
t
Y
t1
= 3
4
t
= 3c
2
+ 6cdY
2
t1
+ 3d
2
Y
4
t1
E
_
Y
4
t
=3c
2
+ 6cd E
_
Y
2
t1
+ 3d
2
E
_
Y
4
t1
=3c
2
+ 6
c
2
d
1 d
+ 3d
2
E
_
Y
4
t1
We know that d < 1 is required for Y

t
to have a nite unconditional
variance. If it has to have a nite unconditional fourth moment, it must be
the case that d < 1
3 0.577.
If that stronger condition is satised, then
E
_
Y
4
t
=
3c
2
+
6c
2
d
1d
1 3d
2
Recall that
E
_
Y
2
t
=
c
1 d
The unconditional excess kurtosis of Y
t
is therefore
Kurt [Y
t
] =
E
_
Y
4
t
E
_
Y
2
t
2
3 =
6d
2
1 3d
2
Unless d = 0 (in which case the Y
t
process is just a series of independent
Gaussian random variables), the ARCH(1) process is leptokurtotic, i.e.,
fat-tailed. If d > 1/
3, then the unconditional distribution of the

ARCH(1) process is so fat-tailed that the excess kurtosis statistic does not
exist (innite?).
Simulated ARCH(1) Processc = 0 and d = 0
Simulated ARCH(1) Processc = 0 and d = 0.2
Simulated Squared Values of ARCH(1) Processc = 0 and d = 0
Simulated Squared Values of ARCH(1) Processc = 0 and d = 0.2
Estimationthe parameters of an ARCH(q) model can be estimated
consistently by regression. Just take Y
2
t
as the Y variable, and
Y
2
t1
, . . . , Y
2
tq
as the X variables.
As with AR models, many of the OLS small sample statistical results fail
to hold in this context. However, the estimates are still consistent.
Can add a constant to an ARCH(q) model:
Y
t
=a +
t
2
t
=c +
q
i =1
d
i
(Y
ti
a)
2
Is this a reasonable model for returns?
How to estimate it?
This model is an ideal candidate for the two-stage approachrst
estimate the mean (the a parameter), then use the residuals to estimate
the volatility parameters c and d
1
, . . . , d
q
.
ExampleRMRF. Estimate a parameter by sample mean, which is 0.6306
(measured in units of percent per month). The standard error (estimated
in the usual way) is 0.1755 (also in percent per month).
Subtract the estimated mean value from each observation, then square the
dierences. Use OLS on the result to estimate the ARCH parameters; the
X variables are just lagged values of the Y variables.
Estimated ARCH(1) ProcessDemeaned RMRF Process
Alternativemaximum likelihood is viable for a estimation of an ARCH
process.
Under the assumption of a Gaussian error, Y
t
has a normal distribution
with mean equal to a and variance equal to:
Var [Y
t
] = c +
q
i =1
d
i
(Y
ti
a)
2
Gaussian likelihood is very amenable to simple maximum likelihood
estimation. (Other models for
t
are also used, e.g., heavy-tailed
distributions such as Students t.)
ARCH(5) modelestimated paramters are:
a =0.7460 c =10.6681

d
1
=0.0932
d
2
=0.2025

d
3
=0.1383

d
4
=0.1545
d
5
=0.0264
Standard errors are:
(a) =0.0725 ( c) =0.3352
_
d
1
_
=0.0175

_
d
2
_
=0.0120
_
d
3
_
=0.0222
_
d
4
_
=0.0198

_
d
5
_
=0.0141
How do these compare to the regression results for ARCH(5)?
ARCH(5) Volatility ForecastsFirst Third of Sample
ARCH(5) Volatility ForecastsMiddle Third of Sample
ARCH(5) Volatility ForecastsLast Third of Sample
ARCH(5) Volatility ForecastsFirst Third of Sample
ARCH(5) Volatility ForecastsMiddle Third of Sample
ARCH(5) Volatility ForecastsLast Third of Sample
Other possible extensionsmore general volatility modeling:
Y
t
=a +
t
2
t
=c +
_
_
Y
tq
a
.
.
.
Y
t1
a
_
_
T
_
Y
tq
a
.
.
.
Y
t1
a
_
_
where is a positive semidenite matrix.
More general modelling of the mean:
Y
t
=a +
q
i =1
b
i
Y
ti
+
t
2
t
=c +
_
_
Y
tq
_
a +
q
i =1
b
i
Y
ti
_
.
.
.
Y
t1
_
a +
q
i =1
b
i
Y
ti
_
_
_
T
_
Y
tq
_
a +
q
i =1
b
i
Y
ti
_
.
.
.
Y
t1
_
a +
q
i =1
b
i
Y
ti
_
_
_
How to estimate?
ARCH was introduced several decades ago as a mechanism for introducing
conditional heteroscedasticity to time series modelling.
Volatility clustering in nancial returnsARCH model has some success
capturing this phenomenon.
Disadvantages of ARCH:
.
.
.1
Somewhat awkward constraints on parameters for existence of
variance and kurtosis for q > 1.
.
.
.2
Assumes symmetric volatility responseboth large positive and
negative observations cause high future volatility. Leverage eect.
Various extensions to ARCH have been introduced to try to address these
shortcomings, including GARCH and stochastic volatility.

Empirical Finance

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Empirical Finance

Uploaded by

Copyright:

Available Formats

.

= 2275/6. (Try it.)

2. (Can you derive these

0.01581. There are two possibilities:

The test statistic Z is a linear function of

240 (recall that there are 240 monthly

= Cov [X, Y]. Derivation of its

. The matrix is ipped around, so that the

is inverted, the corresponding element is large, giving more weight to the

2E[Y] 2 E[XY] + 2 E[X]

minimises the squared residual terms; it provides the best possible t,

Our condition is then:

coecients are estimated. If T were used instead, the estimate of s

would be biased, on average, too small. This estimate is unbiased:

coecients are, on average, equal to the true coecients , and we have

, are estimates of the variance of the

We can nd the variance of

The two directions to be shown are:

. Note that they are the same:

The variances of F and F

are (in general) not the same:

instead of F would therefore be:

The prediction of a model based on F is:

is a portfolio that, out of all possible portfolios, is

But we can also contemplate:

should be treated with suspicion!)

where I is the identity matrix.

that maximise the value of the

The result is an N N matrix; the standard errors can be extracted by

If the process has an unconditional variance, then the two expectations

We know that d < 1 is required for Y

3, then the unconditional distribution of the

You might also like