Professional Documents
Culture Documents
Il est vrai que M. Fourier avait l’opinion que le but principal des maté-
matiques était l’utilité publique et l’explication des phénomènes naturels;
mais un philosophe comme lui aurait du savoir que le but unique de la sci-
ence, c’est l’honneur de l’esprit humain, et que sous ce titre, une question
de nombres vaut autait q’une question du système du monde.
Carl Gustav Jacobi (letter to Adrien-Marie Legendre, from Konigsberg,
July 2nd, 1830)
1
Among the first examples of least squares: Roger Cotes with Robert Smith, ed.,
“Harmonia mensurarum”, (Cambridge, England: 1722), chapter: “Aestimatio errorum
in mixta mathesis per variationes partium trianguli plani et sphaerici”, pag. 22.
An hypothesis should be made explicit: no systematic bias in the measuring in-
struments. Thomas Simpson points this out in: “An attempt to shew the advantage
arising by taking the mean of a number of observations in practical astronomy” (From:
“Miscellaneous Tracts on Some Curious Subjects ...”, London, 1757). In modern terms
this shall become E(|X) = 0 (see section 9).
2
1 Returns
1.1 Return definitions
Returns come in two versions. Let Pit be the price of the i − th stock at time t.
The linear or simple return (in the future we shall deal with dividends and with
total returns) between times tj and tj−1 is defined as:
In both these definitions of return we do not consider possible dividends. There exist
corresponding definitions of total return where, in the case a dividend Dj is accrued
between times tj−1 and tj , the numerator of both ratios becomes Ptj + Dj .
Moreover, here we do not apply any accrual convention to our returns, that is: we
just consider period returns and do not transform, say, daily computed returns (i.e.
tj − tj−1 = one day) into a yearly basis.
Notice that such transforms are, instead, the rule in most databases and return
reporting and, being totally notional, are based on many and different “accrual conven-
tions” each arising and appropriate into a specific field. You should be quite attentive
to de specific definitions and formulas applied in the fields of your interest.
While Ptj means “price at time tj ”, the symbol: rtj is a shorthand for “return between
time tj−1 and tj ” so that the notation is not really complete and its interpretation
depends on the context. When needed, for clarity sake, we shall specify returns as
indexed by the beginning and the end point of the time interval in which they are
computed as, for instance, in rtj−1 ;tj .
Now, some obvious but important comments.
We begin by saying that only the linear return can be termed “percentage” return,
as it is defined as the ratio of two (positive) quantities minus 1. The log return is on
a different scale. The use of a % sign or of the “percentage” term for log returns is,
strictly speaking, wrong, even if it is quite common.
In fact: the two definitions of return, log and linear, yield different numbers when
applied to the same prices, except in the case when the two prices are identical.
Let’s study this in some detail.
We can easily show that ln(x) ≤ x − 1, where the = is valid only if x = 1 (non
change in prices), moreover, the difference between linear and log returns shall be an
increasing function of |x − 1|.
In fact: x − 1 is equal to and tangent to ln(x) in x = 1. Moreover, since the second
derivative of (x − 1) − ln(x), that is 1/x2 is always positive, the difference between
3
linear and log return, shall be always positive, except when x = 1, and shall be bigger
the more |x − 1| is bigger than 0.
There are several implications of this simple result. An important, if obvious, one
is that, if one kind of return is mistaken for the other, the inevitable “approximation”
errors shall be all of the same sign.
In Finance the ratio of two prices of the same security separated by a not too long
amount of time (this ratio is sometimes called “total return” and maybe corrected by
taking into account accruals, as mentioned above) is often modeled as a random variable
with an expected value very near 1. This implies that the two definitions of returns,
linear and log, shall yield very different values with sizable probability only when the
variance (or more in general a dispersion measure) of the price ratio distribution is
non negligible, so that observations far from the expected value have non negligible
probability.
Since standard models in Finance assume that variance of returns increases when
the time between prices for which the return is computed increases, this also implies
that the two definitions shall more likely imply different values when applied to long
term returns.
Why two definitions? After all, the corresponding prices are the same and this
implies that both definitions, if not swapped by error, give us the same information.
The point is that each definition is useful, in the sense of making computations
simpler, in different cases.
You should also ask yourself: “why bother with returns and not simply discuss
prices?”. Again: the only reason is that using returns in several circumstances simplifies
computations and makes easier to build models for the dynamic of prices.
We now see which properties of linear and log returns make their use a useful
simplification in different cases.
From now on, for simplicity, let us only consider times t and t − 1.
Let the value of a buy and hold portfolio, composed of k stocks each for a nominal
quantity ni , at time t be: X
ni Pit
i=1..k
X ni Pit−1 Pit
= P −1=
i=1..k j=1..k nj Pjt−1 Pit−1
4
X X X X X
= wit (rit + 1) − 1 = ( wit rit + wit 1) − 1 = wit rit + 1 − 1 = wit rit
i=1..k i=1..k i=1..k i=1..k i=1..k
Where wit = P ni Pit−1 are terms summing to 1. Notice that, while the time
j=1..k nj Pjt−1
index of wit is t, the value of wit depends on terms which are all known at time t − 1.
When the wit are all non negative (long only portfolio), they are tipically called
“weights” and represent the percentage of the portfolio invested in the i-th stock at
time t − 1.
This simple “aggregation” result is very useful. Suppose, for instance, that returns
are stochastic and you know, at time t − 1, the expected values for the linear returns
between time t − 1 and t. You are at time t − 1 and want to compute the expected
value of your portfolio return between t − 1 and t.
Since the expected value is a linear operator (the expected value of a sum is the sum
of the expected values, moreover additive and multiplicative constants can be taken
out of the expected value) and the weights wit are known at time t − 1, hence non
stochastic when you make your computations, if we are at time t − 1 we can easily
compute the return for the portfolio as:
X
E(rt ) = wit E(rit )
i=1..k
Moreover if we know all the covariances between rit and rjt (if i = j we simply have
a variance) we can find the variance of the portfolio return as:
X X
V (rt ) = wi wj Cov(rit ; rjt )
i=1..k j=1..k
Notice, in the end, that this is true no matter how big an amount of time passes
between t − 1 and t, provided you do your computations at time t − 1.
This breaks down if we are, say, at time t − 1 and consider now the portfolio return
between time t and time t + 1. If we suppose ni unchanged (buy and hold portfolio)
the aggregation formula shall be valid but with wit+1 6= wit . This because wit+1 shall
depend on security prices at time t, while wit depends on security prices at time t − 1.
If we are at time t−1, so, while wit is non stochastic, wit+1 is stochastic as it depends
on prices available in the future, at time t. These means that the easy expected value
and variance formulas cannot be used now, because we cannot “take the wit+1 outside”
the expected value and variance operators.
In formulas, w could not make passages as, e.g.:
X
E(rt ) = wit E(rit )
i=1..k
5
We should be satisfied by the almost useless
X
E(rt ) = E(wit rit )
i=1..k
The same problem for the computation of variance.
There exist an investing strategy which keeps wit constant overtime by changing
ni . This is called a “constant relative weights” strategy.
This is NOT a buy and hold strategy: if you are long all securities, this requires
selling a proper amount of those securities who outperform the other securities and
buy a proper amount of the underperforming securities so that the net investment is 0
and the weights are kept constant.
(In the real world such a strategy shall imply possibly huge transaction costs).
With such a strategy, we could use the expected value and variance formulas for
returns on periods which begin in our future (provided we know the expected values,
variances and covariances of each return included in the formula).
Now, log returns.
For log returns the aggregation result in a portfolio, valid for linear returns, does
not apply. In fact we have:
P
ni Pit X nP Pit
P i it−1
X
rt∗ = ln( P i=1..k ) = ln( ) = ln( wit exp(rit∗ ))
j=1..k n j P jt−1
i=1..k j=1..k nj P jt−1 P it−1
i=1..k
The log return of the portfolio is not a linear function of the log (and also of the
linear) returns of the components. In this case assumptions on the expected values
and covariances of the (log) returns of each security in the portfolio cannot be (easily)
translated into assumptions on the expected value and the variance of the portfolio
return by simple use of basic “expected value of the sum” and “variance of the sum”
formulas.
Think how difficult this could make to perform any standard portfolio optimization
procedure as, for instance, the Markowitz mean/variance model using log returns.
While log returns create problems for aggregation over portfolios, log returns are
much easier to use than linear returns when we aim at describing the evolution of the
price of a single security thru several time intervals.
Suppose we observe the price Pti at times t1 , ...tn , the log return between t1 and tn
shall be the sum of the intermediate log returns:
Ptn Pt P t Y Pt X
rt∗1 ,tn = ln = ln n n−1 = ... = ln i
= rt∗i
Pt1 Ptn−1 Pt1 P
i=2...n ti−1 i=2...n
It is then easy, for instance, given the expected values and the covariances of the
sub period returns, to compute the expected value and the variance of the full period
return (from t1 to tm ).
6
On the other hand, linear returns over a time interval are not the sum of sub-period
linear returns.
We have:
Pt Pt P t Y Pt Y
rt1 ,tn = n − 1 = n n−1 − 1 = ... = i
−1= (rti + 1) − 1
Pt1 Ptn−1 Pt1 i=2...n
P ti−1 i=2...n
We see that, while it is possible, and in fact easy, to connect a period linear return
with subperiod returns, the function connecting subperiod linear returns to the period
linear return is not a sum but a product.
The expected value of a product is difficult to evaluate, even if we know each single
subperiod expected value, as it does not depend (in general) only on the expected
values of the terms.
A noticeable special case when this is possible, is that of non correlation over time
among terms.
For the computation of the variance, the problem is even worse even in the non
correlation over time case.
So we have: linear returns useful for portfolios ove single time intervals, log returns
useful for single securities over time.
It is clear that, when problems involving the modeling of portfolio evolution over
time are considered, no single definition of return is fully satisfactory.
In these cases we either see approximations or, simply, models are directly expressed
in terms of prices.
You should keep in mind that standard "introductory" portfolio allocation models
are one period models, hence usually based on linear returns (but always read the
details, sometimes you’ll be surprised).
To sum up: the two definition of returns yield different values when the ratio
between consecutive prices is not equal to 1. The linear definition works very well for
portfolios over a single period and conditional to the knowledge of prices at time t − 1:
expected values and variances of portfolios can be derived by expected values variances
and covariances of the components, as the portfolio linear return over a time period is
a linear combination of the returns of the portfolio components.
For analogous reasons the log definition works very well for single securities over
time.
We conclude this section with three warnings which expand what already written
in terms of accrual conventions.
These warnings should be obvious, but experience teaches the opposite.
First. Many other definitions of return exist and each one origins from either
traditional accounting behavior (and typically is connected with some specific asset
class) or from specific computational needs. These are usually based on linear returns
but use different conventions for computing the number of days between two prices and
the accrual of possible dividends and coupons.
7
Second. No single definition is the “correct” or the “wrong” one. In fact such a
statement has no meaning. The correctness in the use of a definition depends on the
context in which it is applied (accounting uses are to be satisfied) and, obviously, on
avoiding naive errors the like of exponentiating linear returns for deriving prices or
summing exponential returns over different securities in order to get portfolio returns.
For instance: the fact that, for a price ratio near to 1, the two definitions give
similar values should not induce the reader in the following consideration: “if I break a
sizable period of time in many short sub periods, such that prices in consecutive times
are likely to be very similar, I am going to make a very small error if I use, say, the
linear return in the accrual formula for the log return”. This is wrong: in any single sub
period the error is going to be small, but, as mentioned above, this error has always
the same sign, so that it shall sum up and not cancel, and on the full time interval the
total error shall be the same non matter how many sub periods we consider.
Third: this multiplicity of definitions requires that, when we speak about any
properties of “returns”, it should be made clear which return definition we have in
mind. For instance: the expected value of log returns must not be confused with the
expected value of linear returns. The probability distribution of log returns shall not
be the same as the probability distribution of linear returns, and so on.
Practitioners are very precise in specifying such definitions in financial contracts.
The common imprecisions found in financial newspapers can be justified in view of the
descriptive purposes of these.
8
r and r* as functions of Pt/Pt‐1
0,6
0,4
0,2
0
0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6
9
r
rr*
‐0,2
‐0,4
‐0,6
08
1.2 Price and return data: a cautionary tale
Finance is “full of numbers”, price data and related Statistics are gathered for commer-
cial and institutional reasons and are readily available on both free and costly/commercial
databases. This has been true for many years and, for some relevant market, databases
have been reconstructed back to the nineteen century. In some cases even before.
As in any field where data are so overwhelmingly available and not directly created
by the researcher thru experiments, any researcher must be cautious before using data
and follow at least some very simple rules which could be summarized in the sentence:
“KNOW YOUR DATA BEFORE USING IT!”.
What does the number mean? How was it recorded? Did it always mean the same
thing? These are three very simple questions which should get an answer before any
analysis is attempted. Failure to do so could taint results to such a way as to make
them irrelevant or even ridiculous.
Moreover: the abundance of numbers should not be meant to imply that Finance
is necessarily amenable to mathematical analysis.
Mathematics, actually, builds models and theories that may even not require num-
bers or quantities. However, it requires the existence of simple and stable relationships
between well defined “objects”, which may be or may not be numbers or quantities.
For this reason, understanding the numbers of Finance and its quantities, is, above
all, the first necessary step toward understanding wether some of these, or some proper
modifications of these, can be the object of useful mathematica modelling.
But undestanding numbers in Finance is not easy.
This is true even at a very basic level, by far lower than what is required in order
to assess the possibility of useful mathematical modelling.
Here is not the place for a detailed discussion, but it could be useful for us to try
and analyze what, at a first sight, should be a very simple example.
Suppose you wish to answer the following question: “how did the US stock market
behave during its history”.
This seems a quite straightforward question, it also is a quiteobviouly important
question and you may think a simple and clear answer should be readily available.
You browse the Internet and run a search for literature on the topic expecting such
a simple, clear and unanimous answer.
Suppose you are able to shunt off conspiracy theorists, finance fanatics, quack doc-
tors and serpent oil sellers, Ponzi scheme advertising and the like.
(All these abound in the field).
Let us say that you concentrate on academic and academic linked literature (beware:
this by no means assures you to fully avoid conspiracy theorists, fanatics, serpent oil
sellers and Ponzi schemers).
At the onset, you could be puzzled by the fact that, in the overwhelming majority
of papers and books, the performance of markets where thousands of securities, not
10
always the same, are traded and traded in different historical moments and under
different institutional rules, is summarized by a single number, and index. For the
moment, we put this point aside and follow this path.
You find a whole jungle of academic and non academic references among which
you choose, e.g., two frequently quoted expository books by famous academicians:
“Irrational exuberance” by Robert J. Shiller (of Yale) and “Stocks for the long run” By
Jeremy J. Siegel (of Wharton)1 .
You browse through the first chapter of both books.
You first look at Figure 1-1 of Siegel, which tells you that 1 dollar invested in stock
in 1802 would have become 7,500,000 dollars by 1997. Moreover you read that 1 dollar
of 1802 is equivalent (according to Siegel) to 12 dollars in 1997. You divide 7,500,000
by 12 and get a real return of about 625000 times (62,500,000% !)
On the other hand, Figure 1.1 of Shiller’s book gives the following information:
between 1871 and 2000 the S&P composite index corrected by inflation grew from
(roughly) 70 to (roughly) 1400 with a real return of roughly 20 times (2000%).
Both numbers are big, but also quite different.
Now you are puzzled.
Sure: a part of the difference is due to the different time basis.
Looking to Siegel picture, you see that the dollar value of the investment around
1870 was about 200, even exaggerating inflation, attributing the full 12 times devalu-
ation to the 1870-2000 period, and assessing this 200 dollars in 1870 to be worth 2400
dollars in 1997, we would have a real increase of 3125 times which is still much more
than 150 times Shiller’s number.
The difference, obviously, cannot come from the difference in terminal years of the
sample, as the period 1997-2000 was a bull market period and should reduce, not
increase, the difference.
Now, both Authors are famous Finance professors and at least one of them (Shiller)
is one of the gurus of the present crisis. So the problem must be in the reader (us).
Let us try and improve our understanding by reading the details. First we notice
that Siegel quotes as source for the raw data the Cowles series as reprinted in Shiller
book “Market volatility” for the 1871-1926 period and the CRISP data for the following
period. Shiller, on the other hand, mentions the S&P composite index.
Reading with care we see another difference: Shiller speaks about a “price” index
while Siegel about a reinvested dividends total return index. Is this the trick?
Browsing the Internet we see that Shiller’s data are actually available for down-
loading (http://www.econ.yale.edu/∼shiller/data.htm).
With this data we can compute the total return for Shiller data between 1871 and
1997 and the real increase now is from 1 dollar to 3654 dollars in real terms.
1
The connection between the two authors and the two books is clearly stated by Shiller in his
Acknowledgments.
11
We also see that the CPI passed from 12 to 154 in the same time interval, so that the
“12 times” rule for the value of the dollar used by Siegel seems a good approximation2 .
There is still some disagreement between the numbers (Siegel 3125, but with exag-
gerated inflation, and Shiller 3654, but with 3 added years of bull market). However,
we think that, at least for answering our question, we have enough understanding.
In particular, we understand (most of) the reason for the, apparent, huge difference
between the statements of the two Authors as initially considered.
In this very short analysis we did learn some important things.
First: understand your question. “How did the US market behave during its history”
is, now we understand, not quite a well specified question.
Are we looking for a summary of the history of prices, or for the history of one
dollar invested in the market? The two different questions have two different answers
and require different data.
Second: understand your data. Price data? Total return data? Raw or inflation
corrected?
There are many subtle but relevant points that should be made, we only mention
the Survivorship Bias problem which taints any ex post analysis of financial series.
We stop here, for the moment, and do not mention the fact that a lot of discussion
has run about the relevance of the question and of the answers and their interpretation.
An final interesting fact is this: while Siegel and Shiller start with data which, once
undersood, are very similar, they reach quite different conclusions. At least, this is
their opinion on their works.
We can reconcile the data: we understand they are using the same data in two
different ways.
However, a puzzle remains: why each of them draws a different conclusion and,
moreover, why they perfectly understand this and, still, they “agree to disagree”?
When we deal with rational and clever people, as they are, this implies a deep dis-
agreement not about the data, but about the correct way of modelling and interpreting
these, so to be able to draw conclusions.
12
Again, be warned that, as mentioned above, while researchers broadly agree on the
following numbers, they broadly disagree on their interpretation.
We do this with the yearly Shiller dataset (widely used in academic literature). We
shall concentrate on the total log return series.
The dataset starts with 1887 and is updated each year since in the latest available
version (at the time this chapter was written) Shiller uses the dataset up to 2013
included we shall limit our computations to the interval 1871-2013.
During this time interval the average real log total return of the index was 6.33%.
In the same period the average real one year interest rate was 1.03%, so that the
“risk premium” was about 5.3%.
The standard deviation of the real log total return was 17.09% while the same
statistic for one year real interest rates was 6.54%.
The 5.3% average real log total return in excess to the yearly rate (which was even
higher up to year 2000) compared with the 17.09% standard deviation (even smaller
than this up to 2000) did generate a literature concerned with the “equity premium
puzzle”.
The average of the real dividend yield (up to 2011 only) is 4.45% and the standard
deviation of the same is 1.5%.
The average real log price return was 2.16% and the standard deviation of the same
17.68%.
While we can only approximately sum these two results and compare them with
the total real log return, we see that most of the equity premium is associated with the
dividend yield.
Notice that the correlation coefficient between real dividend yield and real log price
return is .10 (positive but small) this explains why the standard deviation of the total
real log return is even smaller than the sum of the standard deviations of log real price
return and real dividend yield (diversification effect).
On the other hand this small correlation is, by itself, a puzzle. This because the
value of a stock is commonly interpreted as some kind of expected value of future
discounted dividends.
A last piece of simple data: the 1 year autocorrelation of the real total log return
series is very small: 2.29. This is a first simple evidence of the fact that is is very
difficult to forecast future returns on the basis of past returns.
Some of these empirical facts are at the basis of the simple stock price evolution
model we shall introduce in the next chapter: the log random walk.
13
14
FIGURE 1-1
Total Nominal Return Indexes, 1802-1997
15
Examples
Exercise 1a - returns.xls Exercise 1b - returns.xls
16
Moreover, while the hypothesis of constant variance for (log) returns may be a good
first order approximation of what we observe in markets, at least in the short run, the
same hypothesis for prices is not empirically sound: in general price changes tend to
have a variance which is an increasing function of the price level.
A couple of points to stress.
First: ∆ is the “fraction of time” over which the return is defined. This may be
expressed in any unit of time measurement: ∆ = 1 may mean one year, one month,
one day, at the choice of the user.
However, care must be taken so that µ and σ 2 are assigned consistently with the
choice of the unit of measurement of ∆. In fact, µ and σ 2 represent expected value and
variance of log return over an horizon of Length ∆ = 1 and they shall be completely
different if 1 means, say, one year (as usually it does) or one day (we shall discuss in
what follows a particular convention for translating the values of µ and σ 2 between
different units of measurement of time. This convention is one of the consequences of
the log random walk model).
Second: suppose the model is valid for a time interval of ∆ and consider what
happens over a time span of, say, 2∆.
By simply iterating the model twice we have:
17
Even from this point of view, the model is useful for introductory and simple pur-
poses only. The weight of empirical analysis during the last thirty years has led most
researchers to consider this model as a very approximate probabilitic description of
stock price behavior.
In a nutshell: while no consensus has been reached on an alternative standard
model, there is a general agreement about the fact that some sort of (very weak)
dependence exists for today’s returns on the full, or at least recent, history of returns.
Moreover, the constancy of the expected value and variance of the innovation term
has been strongly put under questioning both in terms of slow variations of these
parameters and of possible sudden “variance explosions”.
By far the weakest aspect of the model is the Gaussian assumption (when required)
and we shall discuss this point in some detail when dealing with value at risk.
In any case, the LRW still underlies many conventions regarding the presentation
of market Statistics. Moreover the LRW is perhaps the most important justification
for the commonly held equivalence between the intuitive term "volatility" and the
statistical entity "variance" (or better "standard deviation").
An important example of the influence of the log random walk model on market
practice concerns the “annualization” of expected value and variance.
We are used to the fact that, often, the rate of return of an investment over a given
time period is reported in an “annualized” way. The precise conversion from a period
rate to a yearly rate depends on accrual conventions. For instance, for an investment
of less that one year length, the most frequent convention is to multiply the period
rate times the ratio between the (properly measured according to the relevant accrual
conventions) length of one year and the length of the investment. So, for instance, if
we have an investment which lasts three months and yields a rate of 1% in these three
months, the rate on an yearly basis shall be 4%.
It is clear that this is just a convention: the rate for an investment of one year
in length shall NOT, in general, be equal to 4%, this is just the annualized rate for
our three months investment. This shall be true, for instance, if the term structure
of interest rates is constant. However such a convention can be useful for comparison
across investment horizons.
In a similar way, when we speak of the expected return or the standard devia-
tion/variance of an investment it is common the report number in an annualized way
even if we speak of returns for periods of less or of more than one year. The actual
annualization procedure is base on a convention which is very similar to the one used
in the case of interest rates. As in this case the convention is “true”, that is: annualized
values of expected value and variance correspond to per annum expected values and
variances, only in particular cases. The specific particular case on which the convention
used in practice is based is the LRW hypothesis.
If we assume the LRW and consider a sequence of n log returns rt∗ at times t, t −
1, t − 2, ..., t − n + 1 (just for the sake of simplicity in notation we suppose each time
18
interval ∆ to be of length 1 and drop the generic ∆) we have that:
X X
∗ ∗ ∗
E(rt−n,t ) = E( rt−i )= E(rt−i ) = nµ
i=0,...,n−1 i=0,...,n−1
X X
∗ ∗ ∗
V (rt−n,t )=V( rt−i )= V (rt−i ) = nσ 2
i=0,...,n−1 i=0,...,n−1
19
common). Data is presented both in price scale (starting value 100) and in log price
scale. The reason is simple. Consider the distribution of log return after 100 year under
our hypothesis. This is going to be the distribution of the sum of 100 iid Gaussian RV
each with expected value of 5.36% and standard deviation 18.1%, Using known results
we have that this distribution shall be Gaussian with expected value 536% and stan-
dard deviation 181%. So, a standard ±2σ interval for the terminal value of this sum
is 536%±362%, or, in price terms, 100e. that is an interval with lower extreme
5.36±3.62
569 and upper extreme 794263. This means that under our hypotheses the possible
histories can be quite different. No problem in this if we recall the unconditional nature
of the model.
To get a quick idea: the actual historical evolution of the market as measured by
our index gave a final value of the index equal to about 21000 which correspond, as
said, to a sum of log returns of 536(%). This is, by construction, smack in the middle of
the distribution of the summed log returns and is the median of the price distribution.
However, due to the exponentiation, or if you prefer, due to the power of compound
interest, the distribution of final values is highly asymmetric (it is Lognormal) so that
the range of possible values above the median of prices is much bigger than the range
below it. We only simulated 100 possible histories. Even with such a limited sample
we have a top terminal price of more than 2000000 (in a very lucky, for long investors,
world. We wonder what studying Finance would be in such a world...) and a bottom
terminal price below 100 (again: in a world so unlucky that, had we lived in it, we
likely would not talk about the stock market)4 .
This result could be puzzling as the “possible histories” seem very heterogeneous.
This is an immediate consequence of the log random walk hypothesis. If we estimate
4
Compare this with the Siegel-Shiller data we discussed in section 1, then think about the result of
our simulation in such extreme worlds. For instance, with the historical mean and standard deviation
of the extreme depressed version 20th century the simulation I would show you in this possible world,
provided you an I were still interested in this topic, would be quite different that what you see here.
And all the same, this possible story is a result totally compatible (under Gaussian LRW) with what
we did actually see in our real history. Spend a little time thinking about this point. It could be
“illuminating”.
Think also to the economic sustainability of such extreme worlds: such extreme market behaviours
cannot happen by themselves (this is not the plot of some lucky or unlucky casino guy, it is the market
value of an economy, which should sustain such values, provided investors are not totally bum) and
how they could be so absurd just because they underline the possible absurd extreme conclusions we
can derive from a simple LRW model.
Last but not least, remember that all this comes from the analysis of the stock market in a very,
up to now, successful country: the USA. But we analyze it so much also because it was successful
(and so, for instance, most Finance schools, journals and researchers are USA based). This biases
our conclusions if we wish to apply such conclusions to the rest of the world or, even, to the future
of USA. Maybe a more balanced view could be gained by comparing this result with the evolution of
stock markets all around the world (this is not a new idea, Robert J. Barro, for instance did this in
“Rare Disasters and Asset Markets in the Twentieth Century.” (2006) Quarterly Journal of Economics,
121(3): 823–66.)
20
µ and σ out of a long series of data (one century) we are using data from a very
heterogeneous set of economic and historical conditions. Then we use this number in
order to simulate “possible histories” without conditioning to any particular evolution
of historical or economical variables which could and shall influence the stock price.
In other words: we are using the log random walk model as a “marginal” model.
That is: it is unconditional to everything you may know or suppose about the evolution
of other variables connected with the evolution of the modeled stock price.
This point is quite relevant if we wish to understand the sometimes surprising
implications of this simple model.
In the above example, according to the model and the historically estimated pa-
rameters, we get the ±2σ interval 536%±362% (beware the % sign: these are log
returns), or, in price terms, 100e. that is an interval with lower extreme 569 and
5.36±3.62
upper extreme 7942635 . It must be clear that such a wide set of histories is possible,
with non negligible probability, only because we did assume nothing on the (century
long) evolution of all the variables that shall influence prices. Only under this “igno-
rance” assumption such an heterogeneous set of trajectories can have non negligible
probability.
If we are puzzled by the result this is because, while the model describes the possible
evolution of prices “in whatever conditions”, unconditional to anything (in fact, we
estimate expected return and standard deviation using a long history, during which
many different things happened), when we see the implications of the model we, almost
invariably, shall be conditioned by our recent memories and recall recent events or,
unconsciously, shall make some hypothesis on the future as, for instance, the fact that
economic growth shall be, on average, similar to what recently seen. Since the estimates
of µ and σ we use (or even our assumption of zero correlation of log returns and, more
in general, the structure of the model itself which contains no other variables but a
single price) are NOT conditional to such (implicit) hypotheses it is not surprising that
the model gives us such wide variation bounds with respect to what we could expect.
This misunderstanding is quite common and it is to be always kept in mind when
discussing results of the applications of the log random walk model6 .
5
By the way: this should be enough to understand why we should not use the term % when
speaking of log returns.
6
There exists a wide body of literature, both from the applied and the academic sides, that suggests
ways for “conditioning” the model.
21
100 years of simulated log random walk data
100 simulated paths
(mean log return 5.35% dev.st. 18.1%)
1000000
900000
800000
700000
600000
22
500000
400000
300000
200000
100000
0
0 20 40 60 80 100 120
100 years of simulated log random walk data (range subset)
compared with USA stock market in the 20th century
(mean log return 5.35% dev.st. 18.1%)
50000
40000
30000
23
20000
10000
0
0 20 40 60 80 100 120
100 years of simulated log random walk data
Log scale
compared with USA stock market in the 20th century
(mean log return 5.35% dev.st. 18.1%)
10000000
1000000
100000
24
10000
1000
100
10
1
0 20 40 60 80 100 120
2.1 "Stocks for the long run" and time diversification
These are very interesting and popular topics, part of the lore of the financial milieu. A
short discussion shall be useful to clarify some issues connected with the LRW hypoth-
esis together with some implicit assumption underlying much financial advertising.
We have three flavors of these arguments. Two are “a priori” arguments, depending
on the log random walk hypothesis or something equivalent to it, the third is an a
posteriori argument based on historical data.
It is quite important to have a clear idea of the different weight and meaning of these
arguments. In fact, most of the “puzzling” statements you may find about advertising
“stock for the long run” by the investment industry depend on a wrong “mix” of the
arguments.7
25
You can choose a very small α and a very big C, what happens is that the required
n shall be bigger.
The investment suggestion could be: if your time horizon is of an undetermined
number n of years, then choose the investment that has the highest expected return
per unit of standard deviation, even if the standard deviation is very high. Even if this
investment may seem too risky in the "short run" there is always a time horizon so
that for that horizon, the probability of any given loss is as small as you like or, that
is the same, the Sharpe ratio as big as you like.
Typically, such high return (and high volatility) investment are stocks, so: "stocks
for the long run".
Now, the “run” can really be “long”.
The value of n for which this lower bound crosses a given C level is the solution of
√
nµ − nz1−α σ ≥ C
The typical stock has a σ/µ ratio for one year of the order of about 6 or mor. So,
even allowing for a big α, so that z1−α is near one (check by yourself the corresponding
α), the required n shall be in the range of 36 which is only slightly shorter than the
average working life.
Under the state hypotheses these arguments are all formally correct.
The question is if the result is relevant and the investment suggestions reasonable.
Let us consider some possible critiques.
This investment suggestion is or is not reasonable depending on the investor’s cri-
terion of choice. This, for instance, could be the full period expected return given some
probability of a given loss, or the Sharpe ratio for the full n periods or, for instance, the
per period Sharpe ratio (which obviously is a constant) or, again, the absolute volatility
over the full period of investment (which obviously increases without bounds), and so
on.
For instance, a typical critique to the statement is phrased like this: "Why should
we consider as good a given investment for n time periods, if we do not consider it
good for each single one of those periods?"
This critique, shared by economists the like of Samuelson, is correct if we believe
that the investor takes into account the per period Sharpe ratio or some measure of
probable loss and expected return per period.
In other words the critique is correct if, very reasonably, we believe the investor does
not consider equivalent investments with identical Sharpe ratios but over different time
spans.
26
Another frequent critique is: "It is true: the expected value of the investment
increases without bounds but so does its volatility so, in the end, over the long run
I am, in absolute terms, much more uncertain on my investment result" (the mean-
standard deviation ratio goes up only because the numerator grows faster than the
denominator).
This is reasonable as a critique if we believe the investor to decide on the basis of
the absolute volatility of the investment over the full time period.
We should also point out that to choose a single asset class only because, by itself, it
has the highest Sharpe ratio, should always be criticized on the basis of diversification
arguments.
In the end, acceptance or refusal, on an a priori basis, of the investment suggestions
implied by this argument (by itself formally correct) depend on how we model the
investor’s decision making.
27
to the first investment in the other nine years?
Second: we are using two different return measures. In the first case we use the
return of the (one year) investment, in the second case the “average per year” return
of the second. But, while I can pocket the first, I cannot pocket the second.
All that I can derive from the second investment is the distribution of returns over
the ten years period which, obviously, has ten times the expected value and ten times
the variance than the distribution of the average return (which, we stress again, is not
the return I could get by the investment).
Dividing by ten the return of the second investment does not make it comparable
with the return of the first AND give us a “measure of return” which is, actually, not
a measure of return.
And notice: the apparent “time diversification” fully comes from the “divide by n”
implied by the computation of the average return.
So, no time diversification exists but only a wrong comparison between different
investments using a true measure of return and an average per year return.
Comparable investments could be, e. g., a ten year investment in the diversified
portfolio and a ten year investment in the single security
A possible correct comparison criterion could be the comparison between the ten
year expected return and return variance of the two investments.
However, in this case the diversified investment is seen to yield the same expected
value of the undiversified investment but with one tenth of the variance so that, these
two investments, now comparable, are by no means equivalent and the single security
investment is seen, in the mean variance sense, as an inferior investment if compared
with the diversified investment.
Analogously, we could ask which investment on a single security over ten years has
the same return mean and variance as the one year diversified investment. The obvious
answer is: an investment of one tenth the size of the diversified investment. In other
words: in order to have the same effective (that is: you can get it from an investment)
return distribution the two investments must be non only on different time periods but
also of different sizes.
While the first version of the argument could be argued for, at least under some
hypothetical, maybe unlikely but coherent, setting, this second version of the argument
is a true fallacy.
28
version of the argument. As such its acceptance or rejection entirely depends on the
way we study historical data. Its use as motivation of investment siggestions wholly
depends on the hypothesis that, to some extend, future behaviour of markets shall
resemble past.
In short this argument states that, based on the analysis of historical prices, stocks
were always, or at least quite frequently, a good long run investment.
Being an historical argument, even if true (and here is not the place to argue for
or against this point) this it does not imply that the past behavior should replicate in
the future.
While apparently held by the majority of financial journalists (provided they do
not weight too much, say, the post 1990 period in Japan or the financial crisis-Covid19
periodr most of the rest of the world), and broadly popular in trouble free times (at
least as popular as the, historically false, argument about real estate as the most sure,
if not the best, investment), and so quite popular for most time periods, at least in
the USA and during the first thirty and the last fifty years of the past century, this
argument is quite controversial among researchers.
The two very famous and quite readable books we quoted in the chapter about
returns: Robert Shiller’s "Irrational Exuberance" vs Jeremy Siegel’s "Stocks for the
Long Run" share (sic!) opposite views on the topic (derived, as we hinted at but do
not have the time to fully discuss, from different readings of the same data).
While not the place for discussing the point, we would suggest the reader, just for
the sake of amusement, to consider a basic fault of such "in the long run it was ..."
arguments.
We have a typical example of the case where the fact itself of considering the
argument, or even the phenomenon itself to which the argument applies, depends of
the fact that the phenomenon itself happened, that is: something "was good in the
long run".
In fact we could doubt about the possibility for an institution (the stock market)
which survives in the modern form, at least in the USA, since, say, the second half
of nineteenth century, to survive up to today, without at least giving a sustainable
impression of offering some opportunities.
Such arguments, if not accompanied by something else to support them, become
somewhat empty as could be the analogue to being surprised observing that the dish I
most frequently eat, is also among those I like the most or, more in the extreme, that
old people did non die young or, again, that when we are in one of many queues, we
recall spending most time in the slowest queue.
Sometimes, however, the "opportunity" of some institution and how to connect this
with its survival can manifest in strange, revealing ways.
For instance, games of chance exist from unmemorable time with the only "long
run" property of making the bank holder richer, together with the occasional random
lucky player. The overall population of players is made, as a whole, poorer. So, while it
29
is clear here what is the "opportunity" of this institution (both for the, usually, steadily
enriched bank holder and the available, albeit unlikely, hope of a quick enrichment), the
survival of such an institution based on such opportunities tells us something interesting
about man’s mind.
This should not puzzle the reader as it is the bitter bread and butter of any research
field where we decide to use Probability and Statistics for writing and testing models,
but only observational data are available and no (relevant) experiments are possible.
Examples
Exercise 2 - IBM random walk.xls
3 Volatility estimation
In applied Finance the term “volatility” has many connected meanings. We mention
here just the main three:
1. Volatility may simply mean the attitude of market prices, rates, returns etc. to
change in an unpredictable and unjustified manner. This without connection to
any formal definition of “change”, “unpredictable” or “unjustified”. Here volatility
is tantamount chance, luck, unknown destiny, etc. Usually the term has a nega-
tive undertone and is mainly used in bear markets. In bullish markets the term
is not frequently used and it is typically changed in more “positive” synonyms.
A volatile bull market is “exuberant”, “tonic” or “lively”.
2. More formally, and mostly for risk managers, volatility has something to do with
the standard deviation of (should be log) returns and, sometimes, is estimated
using historical data (hence the name “Historical Volatility”).
3. For derivative traders, and frequently for risk managers too, “volatility” is the
name of one (or more) parameters in derivative models which, under the hy-
potheses that make “true” the models, are connected with the standard deviation
of underlying variables. However, in the understanding that these hypotheses are
never valid in practice, such parameters are not estimated from historical data
on the underlying variables (say, using time series of stock returns) but directly
backwarded from quoted prices of derivatives, using the pricing model as fitting
formula (take the parameter value which, in the given formula, fits the observed
derivative price). This is in accord to the strange, but widely held and, in fact,
formally justifiable, notion that models may be useful even if the hypotheses un-
derlying them are false. This is “Implied Volatility”. If you do not know the term,
memorize it. It shall come up many times during your career.
30
In what follows we shall introduce a standard and widely applied method for estimating
volatility on the basis of historical data on returns, that is, we consider the second
meaning of volatility.
Under the LRW hypothesis a sensible estimate of σ 2 is:
X
∗
2
S2 = rt−i − r∗ /n
i=0,...,n
31
We could show that this is suboptimal, in terms of sampling variance of the estimate.
Implicitly, then, while we “believe” the log random walk when “annualizing” volatility,
we do not believe it when estimating volatility if we use this estimate.
Moreover, it shall be noticed that, in this estimate, the sampling mean of returns
does not appear. This is a choice which can be justified in two ways: first we can
assume the expected return µ over a small time interval to be very small. With a non
negligible variance it is quite likely that an estimate of the expected value of returns
could show an higher sampling variability than its likely size and so it could create
problems to the statistical stability of the variance estimate8 . Second, an estimate
of the variance where the expected value is set to 0, that is, as written above, an
estimate of the variance which is, actually, an estimate of the second moment, tends
to overestimate, not to underestimate, the variance.
In fact, variance equals the mean of squares (second moment) minus the squared
mean. If you set the squared mean to 0 you tend to exaggerate the estimate.
For institutional investors, traditionally long the market, this could be seen as
a conservative estimate. Obviously this may not be a reasonable choice for hedged
investors and derivative traders.
The apparent truncation at n should be briefly commented. As we have just seen
the standard estimate should be based on the full set of available observations. This
could be applied as a convention also to the RiskMetrics estimate. On the other hand
consider the fact that, e.g., a λ = 0.95 raised to the power of 256 (conventionally one
year of daily data) is less than 0,000002. So, at least with daily data, to truncate n
after one year of data (or even before) is substantially the same as considering the full
data set.
Using this idea and the known identity:
N
X 1 − λN +1
λi =
i=0
1−λ
8
A simple “back of the envelope” computation: say the standard deviation for stock returns over
one year is in the range of 30%. Even in the simple case where data on returns are i.i.d., if we estimate
the expected return over one year with the sample mean we need about 30 observations (years!) in
order to reduce the sampling standard deviation of the mean to about 5.5% so to be able to estimate
reliably risk premia (this is financial jargon: the expected value of return is commonly called ’risk
premium’ implying some kind of APT and even if it also contains the risk free rate) of the size of at
least (usual 2σ rule) 8%-10% per year (quite big indeed!). Notice that things do not improve if we use
monthly or weekly or daily data (why?). It is clear that any direct approach to the estimate of risk
premia is doomed to failure. A connected argument shall be considered at the end of this chapter.
32
We can then approximate the Vt estimate as:
X
∗2
Vt = (1 − λ) λi rt−i
i=0,...,n
We have
∗2
λi rt−1−i
P ∗2
i=0,...,n rt∗2 λn+1 rt−n−1
Vt = λ P i
+ P i
− P i
=
i=0,...,n λ i=0,...,n λ i=0,...,n λ
∗2
λi+1 rt−1−i
P ∗2
i=0,...,n rt∗2 λn+1 rt−n−1
= P i
+ P i
− P i
=
i=0,...,n λ i=0,...,n λ i=0,...,n λ
∗2
rt∗2 + i=0,...,n−1 λi+1 rt−1−i i ∗2
P P
i=0,...,n λ rt−i
= P i
= P i
i=0,...,n λ i=0,...,n λ
Vt = λVt−1 + (1 − λ)rt∗2
In practice the new estimate at time t: Vt is a weighted mean of the old estimate
at time t − 1: Vt−1 (recall: the weight λ is usually big) and of the latest squared log
return (the weight: 1 − λ is usually small).
A simple consequence of this (and of the fact that the estimate does not consider
the mean return) is the following.
∗2
λn+1 rt−n−1
9P ∗2
' (1 − λ)λn+1 rt−n−1
i and for (0 < λ < 1), big n and any squared return “not too big”,
i=0,...,n λ
this shall be approximately 0
33
Since the squared return is always non negative and λ is usually near one, this
formula implies that Vt , even if the new return is 0, is still going to be equal to λVt−1
so that the estimated variance at most can decrease of a percentage of 1 − λ. This, in
the hypothetical case of a rt∗ equal or very near to 0.
On the other hand, it can increase, in principle, of any amount when abnormally
big squared returns are observed. This implies an asymmetric behavior: following any
shock, that is a big positive or big negative return, we have an abrupt jump in Vt , while
a sequence of “normally small” values for returns shall reduce the estimated value in a
smoothed way, the faster the smaller is λ (but this is never small in the applications).
The reader should remember that this behavior of estimated volatility is purely a
feature of the formula used for the estimate.
Compare this with the classic “equally weighted” estimate (again: µ set to 0):
∗ 2
P
2 i=0,...,n rt−i 2 1 ∗2 1 ∗2
St = = St−1 + rt − r
n+1 n+1 n + 1 t−n−1
The big difference, here, is not in the first and second term, but in the third. In
this case the third (and oldest) term weight is not negligible, it counts as much as the
most recent term.
With the classic estimate once a “big shock” exits the range of the estimate, from t
to t − n, we observe a big downward jump in the value of the estimate. This because
of the absence of smoothing.
Hence, the main reason for smoothing is to avoid that today’s estimate be affected,
in a relevant way, by old observations dropping off the sample range.
Let us now go back to
Vt = λVt−1 + (1 − λ)rt∗2
Is there any hypothesis on the “evolution of the return variance”, that makes this
behaviour not only a behaviour of the estimate evolution but also a behaviour of the
variance evolution?
If we want this, we imply that we do not totally agree with the standard, constant
σ 2 , version LRW hypothesis, as written above, as we are implying a time evolution of
the variance of returns.
The recursive formula we just found:
Vt = λVt−1 + (1 − λ)rt∗2
is the empirical analogue of an auto regressive model for the variance of returns the
like of:
σt2 = γσt−1
2
+ 2t
34
which is a particular case of a class of dynamic models for conditional volatility
(ARCH Auto Regressive Conditional Herteroschedastic) of considerable fortune in the
econometric literature.
We do not go further in this but you shall discuss this topic in more advanced
econometrics courses.
The above discussion, involving the smoothed estimate for the return variance, is
by no means just a fancy theoretical analysis or a curiosity related to RiskMetrics. It
is the basis of current regulations.
Here, as an instance, I reproduce a paragraph of the EBA (European Banking
Authority) paper EBA/CP/2015/27.
The quoted Article 365(1) of Regulation (EU) No 575/2013 (On prudential require-
ments for credit institutions and investment firms), is as follows:
Article 365
VaR and stressed VaR Calculation
1. The calculation of the value-at-risk number referred to in Article 364
shall be subject to the following requirements:
(a) daily calculation of the value-at-risk number;
(b) a 99th percentile, one-tailed confidence interval;
(c) a 10-day holding period;
(d) an effective historical observation period of at least one year except
where a shorter observation period is justified by a significant upsurge in
price volatility;
(e) at least monthly data set updates.
35
The institution may use value-at-risk numbers calculated according to
shorter holding periods than 10 days scaled up to 10 days by an appropriate
methodology that is reviewed periodically.
2. In addition, the institution shall at least weekly calculate a “stressed
value-at-risk” of the current portfolio, in accordance with the requirements
set out in the first paragraph, with value-at-risk model inputs calibrated to
historical data from a continuous 12-month period of significant financial
stress relevant to the institution’s portfolio. The choice of such historical
data shall be subject to at least annual review by the institution, which shall
notify the outcome to the competent authorities. EBA shall monitor the
range of practices for calculating stressed value at risk and shall, in accor-
dance with Article 16 of Regulation (EU) No 1093/2010, issue guidelines
on such practices.
The language is a little bureaucratically contrived. However, the meaning of this rule
is that, if you use the exponentiallyPsmoothed estimate truncated at N so that your
(daily data) weights are wt−i = λi / N j=0 λ = λ 1−λN +1 , it must be that the “weighted
j i 1−λ
average time lag of the individual observations” that is: i=0 iwt−i be at least 125
PN
(days).
This, for given N requires a specific choice, or range of choices, of λ.
Notice that if N = 250 the only possible choice is λ = 1. In order to decrease the λ,
so you really have a smoothed estimate, and respect the rule, you must increase N .10 .
But there is more.
Since for N → ∞ the weighted average time lag is λ/(1 − λ) the requirement asks
in any case (that is: whatever be N ) for a value λ > 125/126 = .992063. An even
bigger number shall be needed for moderate N . This is much bigger than what used
to be the common case in the past. The examples in the “classic” edition of the Risk
Metrics Technical Document (iv edition 1996) use λ = .94 which, even with very big
N , corresponds to a weighted average time lag of .94/.06 = 15.(6) by far too small
according to the new rules.
36
at least in the sense that the percentage standard error in the estimate of the variance
shall be smaller that in the case of expected return estimation.
The educated heuristics underlying such a belief are as follows11 .
Consider log returns from a typical stock, let them be iid with expected value (on
an yearly basis) of .07 and standard deviation .3.
Suppose log random walk with constant expected value and variance.
The usual estimate of the expected value, that is the √ arithmetic mean, shall be
unbiased and with a sampling standard deviation of .3/ n where n is the number of
years used in the estimation.
Hence, the ratio the expected value with the standard error of its estimate, a mea-
sure of how√precise we may expectd √ the estimate to be, under these hypotheses, shall
be .07/(.3/ n) that is, roughly n/4. √
Hence, for a t-ratio of 2 we need n = 8 that is n = 64 √ (years!). If we want
a standard error equal to 1/2 of µ (a t ratio of 2) we need n = 16 and n = 256.
This means you need 256 years of data for a 2σ confidence interval that still implies a
possible error of 50% in the estimate of µ (as the sampling standard deviation would
be just of the order of 1/2 of the estimate of the mean).
This simple back on the envelope computation explains why we know so little about
expected returns: if our a priori are correct then it is very difficult to estimate them.
There could be a way out. Do not use yearly data but, say, monthly data.
Alas, for log returns and under log random walk this does not work.
Keep n constant and use any k sub periods per year (of length 1/k in yearly terms)
such that the number of observations in n years (for returns over the sub periods) is
kn. The strategy could be that of estimating the sub period expected value µk = µ/k
(the equality is due to the log random walk hypothesis) and then get an estimate of
the yearly expected value by multiplying the monthly estimate by k. If we indicate
with rki
∗
the log returns for the sub periods, with σk2 = σ 2 /k as variance, would have:
kn
X
∗
V (µ̂k ) = V ( rki /kn) = σk2 /kn = σ 2 /k 2 n
i=1
This seems much better that before, but it is an illusion: we do not need an estimate
of µk , we need an estimate of µ = kµk , that is: k µ̂k . We must then compute V (k µ̂k )
and this is
V (k µ̂k ) = k 2 V (µ̂k ) = σ 2 /n
Exactly the same as with “aggregated” data. This should not surprise us: in fact the
arithmetic mean of log returns is simply given by the log of the ratio of the last to the
first price divided by the required number of data points. In other words it only changes
11
This point is dicussed in many papers and book chapters. Among the most illustrious examples, see
Appendix A in: Merton, R.C., 1980.”On estimating the expected return on the market: an exploratory
investigation”. J. Financ. Econ. 8, 323–361.
37
because of the denominator: n for a yearly mean and kn for a sub period of length k
mean. No information is added by using sub period data, hence no improvement in
the variance of the estimate.
In even simpler words: the estimates of µ based on yearly data or on any subperiod
data (same time interval, obviously) are always identical, so their sampling variances
are always the same.
In summary: the expected return is difficult to estimate for two reasons.
First: σ is expected to be much bigger than µ and the t-ratio, hence the precision
of the estimate, depends on the ratio of these.
Second: even if we increase the frequency of observations, nothing changes for the
estimate of the (yearly) µ so that its sampling variance stays the same.
Now, let us do a similar analysis for the variance. In order to make things simple
we shall suppose that µ is known and data are Gaussian. This allows us to quickly
find some useful results.
The general case (unknown µ) is given below, but nothing relevant intervenes when
we remove the two simplifying hypotheses.
We start with the classic estimate of the variance, at the end of this section we shall
also consider the smoothed estimate.
Let us compute the sampling variance of our variance estimate (known µ) and let
ri be the yearly log return
∗
n
X (ri∗ − µ)2 1 1 1
2
V (σ̂ ) = V ( ) = V ((r∗ −µ)2 ) = (E((r∗ −µ)4 )−E((r∗ −µ)2 )2 ) = (µc4 −σ 4 )
i=1
n n n n
Where µc4 = E((r∗ −µ)4 ) is the fourth centered moment and without further hypothesis
could be any non negative constant.
If the ri∗ are Gaussian we have µc4 = 3σ 4 (take it for granted, no proof required)
and the resulting variance of the sampling variance is
X (r∗ − µ)2 2 4
i
V( )= σ
i
n n
So that the√sampling
√ standard deviation of the estimated variance shall be, with our
numbers, . 2σ / n.
2
The ratio of σ 2 with the sampling standard deviation of its estimate shall be
√
σ2 n √
2
√ ≈ .7 n
σ 2
√
In order to get a ratio equal to 2 we need n > 2/.7 and we get there with just n = 9
instead
√ of 64 as in the expected value case. For a ratio of 4 or greater we need now
n > 4/.7 and for this n = 33 suffices (instead of n = 256 for the above discussed case
of the expected value).
38
But there is much more: for estimating the variance the use of higher frequency
data improves the result.
Let our strategy be that of estimating yearly variance as k times the estimated
variance for a sub period of length1/k in yearly terms (the prime sign is to indicate
that this is a new estimate)
2
σ̂ 0 = kσ̂k2
Using the same notation and hypotheses as above we get
kn ∗
X (rki − µk )2 1 1
V (σ̂k2 ) =V( )= V ((rk∗ −µk )2 ) = (E((rk∗ −µk )4 )−E((rk∗ −µk )2 )2 ) =
i=1
kn kn kn
1 2 4 2 σ4
(µkc4 − σk4 ) =
= σk =
kn kn kn k 2
where we used the Gaussian hypothesis. Then
2 2 4
V (σ̂ 0 ) = k 2 V (σ̂k2 ) = σ
kn
And we see that now k is in the formula: using sub period data improves the estimate.
Now the ratio for the variance estimates is equal to
√
σ 2 kn √
√ ≈ .7 kn
σ2 2
So that the use of k sub periods per year has an effect identical to that of multiplying
the number of years by k.
With, say, monthly data, we need less than one year (actually 9 months) for a ratio
of 2 and slightly less that 3 years (just 33 months) so that, with monthly data, ratio
for the variance becomes greater that 4 in 3 years instead of the 33 years with yearly
data12 .
Estimating σ 2 is then easier than estimating µ for two reasons:
First: the ratio between σ 2 and the standard deviation of its estimate is bigger than
that for the expected value, whatever be the n. This comes from the empirical, and
theoretical idea that expected return are much smaller than volatilities.
Second: even if the first reason was not true, we still have the fact that using higher
frequency data improves (dramatically) the quality of the estimate for σ 2 while it is
irrelevant for the estimate of µ.
We see that if we decrease the observation interval, so that the frequency of observation per unit
12
period k increase, in the limit we get a sampling standard deviation of the variance equal to zero.
This should not be taken too seriously: the log random walk model, which underlies this result, may
be a good approximation for time intervals which are both not too long and not to short. Below the
1 day horizon we enter the world of intraday, trade by trade data which cannot be summarized in the
simple log random walk hypothesis.
39
As mentioned above all our formula hold for known µ and Gaussian log returns.
For the general case we have the following result:
with i.i.d. log returns not necessarily Gaussian, for the estimate
n
X
S = 2
(ri∗ − r̄∗ )2 /(n − 1))
i=1
we get
µc4 σ 4 (n − 3)
V (S 2 ) = −
n n (n − 1)
which, for not too small n and a fourth centered moment not very different from
the Gaussian case, gives us the same result as the above formula.
Notice that in all these cases the sampling variance of the estimate of the variance
(as that of the estimate of the expected value) goes to 0 with n going to infinity.
Let us conclude with the case of the smoothed estimate.
We are going to use the approximation for the denominator given by: i=0,...,n λi ≈
P
X X
∗2 ∗2
V (Vt ) = V ((1 − λ) λi rt−i ) = (1 − λ)2 λ2i V (rt−i )=
i=0,...,n i=0,...,n
(1 − λ)2 2 1−λ
= 2
(µ 4 − µ 2 ) = 2σ 4
1−λ (1 + λ)
where the last equality is true if the expected value is zero (as assumed in RiskMet-
rics) and log returns are Gaussian (and recall: 1 − λ2 = (1 + λ)(1 − λ)).
Here it is meaningless to compare this with the quality of the estimate for µ because
this is assumed equal to zero.
It is, however, interesting to compare the result with a result based on sub period
of length k. Everything depends on the choice of λ for the sub periods. If we set it to
λ1/k we have X
∗2
Vkt = (1 − λ1/k ) λi/k rkt−i
i=0,...,kn
1 − λ1/k σ 4
V (Vkt ) = 2
(1 + λ1/k ) k 2
and we have, for the estimate Vt0 = kVkt (we use the prime sign because this is
different with respect to the estimate using aggregated data)
1 − λ1/k σ 4 1 − λ1/k 4
V (kVkt ) = k 2 2 = 2σ
1 + λ1/k k 2 1 + λ1/k
40
This, for 0 < λ < 1 and k > 1, is always smaller than the variance computed using
only full period data (k = 1)13 .
A last observation. What we did see here may seem to be strictly connected with
stock returns.
Actually, this is not the case.
The properties we discussed are always valid for any series of at least approximately
iid (and non too much un-Gaussian, random variables), where you must estimate µ
and σ 2 , and you are interested in the quality of the two estimates. The only case when
we can expect the quality of the estimate of µ to be better than the quality of the
estimate of σ 2 , is when we can assume µ much bigger that σ. Otherwise it shall always
be simpler to get a better estimate of σ 2 (and σ) than of µ, in particular if you can
increase the frequency of observations.
Examples
Exercise 2 - volatility.xls Exercise 3 - risk premium.xls Exercise 3a - exp smoothing.xls
Exercise 3b - historical and implied volatility.xls Exercise 3c - volatility.xls
41
k−µ
Φ( )
σ
Notice that, for distributions characterized by more that two parameters, as for instance
a non standardized version of the T distribution, this property is obviously no more
valid.
It is then of real interest to find good distribution models for stock returns and, in
particular, to evaluate whether the simplest and most tractable model: the Gaussian
distribution, can do the job.
A better understanding of the problem can be achieved if we consider that, in most
applications, we are not interested in the overall fit of the Gaussian distribution to
observed returns but only in the quality of fit for hot spots of the distribution, mainly
tails.
In Finance the biggest losses are usually connected to extreme, negative, observa-
tions (for an unhedged institutional investor). We shall see that the Gaussian distri-
bution while being, overall, not such a bad approximation of the underlying return
distribution, is not so for the extreme, say 1-2%, tails15 .
When studying stock returns, we observe extreme events, mainly negative, in the
order of µ minus 5 σ and more with a frequency which is incompatible with the prob-
ability of such or more negative events under the hypothesis of Gaussianity.
In these evaluations µ and σ are estimates using a long record of data.
While quite rare (do not be fooled by the fact that extreme events always make the
news and so become memorable) such extreme events are much more frequent than
should be compatible with a Gaussian calibrated on the expected value and variance
of observed data.
For instance, let us consider events where returns are more negative than µ − 5σ.
In a Gaussian, the probability of such events is less than 0.00000028 (use the Excel
function which computes such probability).
Is the frequency of similar events actually so small?
Obviously, we can only observe frequencies, not probabilities, and we can only
estimate µ and σ but, with the inevitable statistical approximation, a simple empirical
analysis shall be useful.
Let us consider an example based on a long series of I.B.M. daily returns.
Between Jan 2nd 1962 and Dec 29th 2005 the IBM daily return shows a standard
deviation of 0.016416 . In this time period for 14 times the return was below −5σ
15
The Gaussian distribution can be a good approximation of many different unimodal distributions
if we are interested (as is true in many applications of Statistics) in the behaviour of a random variable
near its median. For modeling extreme events, having to do with system failures, breakdowns, crisis
and similar phenomena, a totally different kind of distribution may be required.
16
(data are in excel Exercise 2- IBM random walk).
42
(suppose a µ of 0, if you use the historical mean the number of extreme events is even
bigger).
The number of observations is 11013 so the observed frequency of a −5σ event is
0.00127, this is, obviously, small but it is more than 4500 times the probability of such
observations for a Gaussian with the same standard deviation!
This is true for a very “mature” and “conservative” stock the like of I.B.M.
While a frequency of 0.00127 is very small, the events on which it is computed
(big crashes) are those which are remembered in the history of the market. It is
quite clear that, in this case, a Gaussian distribution hypothesis could imply a gross
underestimation of the probability of such events and belief in the Gaussian distribution
could imply taking not well measured risk.
The observed behaviour of the empirical distribution of returns of, basically, any
stock, can be summarized in the motto: fat tails, thin shoulders, tall head.
In other words, given a set of (typically daily) returns over a long enough time period
(we need to estimate tails and this requires lots of data) we can plot the histogram of
our data on the density of a Gaussian distribution with the same mean and standard
deviation. What we observe is that, while overall the Gaussian interpolation of the
histogram is good, if we zoom in the extreme tails: say, first and last two percent
of data, we see that the tails of the histogram decrease at a slower rate than those
of the Gaussian distribution. Moreover, toward the center of the distribution, we see
how the “shoulders” of the histogram are thinner than those of the Gaussian and,
correspondingly, the histogram is more peaked around the mean.
The following plots are from the excel file “Exercise 4 - Non Gaussian returns”, in
this worksheet we use data from May 19th 1995 to Sep 28th 2005 on the same I.B.M.
series as before.
The first plot compares the interpolated Histogram of empirical data (blue) with a
Gaussian density with the same mean and variance as the data (magenta). You can
clearly see the mentioned “fat tails, thin shoulders”.
Since tails, fat or not, are tails that is: they are thin, in the second plot we focus
on the extreme left tail and at this scale the difference between the empirical and the
Gaussian distribution. The x axis is scaled in terms of standard deviation units (1
means 1 standard deviation) and we see that, moving to the left starting at, roughly, 2,
the empirical tail is above the Gaussian tail: extreme observations are more frequent
then what we would expect in a Gaussian distribution with the same mean and variance
as the data.
43
Empirical VS Gaussian density. I.B.M. data
0,2
0,18
0,16
0,14
0,12
0,1
44
Relative frequency
Standard Gaussian density
0,08
0,06
0,04
0,02
0
Left tail empirical VS gaussian CDF. I.B.M. data
0,12
0,1
0,08
0,06
45
Empirical CDF
Gaussian CDF
0 04
0,04
0,02
0
-6 -5 -4 -3 -2 -1 0
Another way to compare the empirical distribution with a Gaussian model (or any
model you may choose) is the Quantile-Quantile (QQ) plot. In the worksheet you find
the standardized version of the plot. In order to build a standardized QQ plot from
data you must first choose a comparison distribution, in our case the Gaussian. The
second step is that of standardizing the data, using some estimate of the data expected
value and variance. The standardized dataset is then sorted in increasing order an the
observations in this dataset shall be the X coordinate in the plot. For each observation
of the standardized returns dataset, compute the relative frequency with which smaller
than or equal values were observed. Compute then, using some software version of the
standard Gaussian CDF tables, the value of the standard Gaussian which leaves on its
left exactly the same probability as the relative frequency left on its left by the X data,
this shall be the corresponding Y coordinate in the plot.
46
Quantile Quantile Plot. I.B.M. data.
10
47
Standard Gaussian equivalent obs
0
Standardized sorted returns
-10 -8 -6 -4 -2 0 2 4 6 8
-2
-4
-6
-8
In the end what you see is a curve of coordinates X,Y. If the curve is a bisecting
straight line, your empirical CDF is approximated well by a Gaussian CDF. Departures
from the bisecting line are hints of possible non Gaussianity. To facilitate the reading of
the plot, a bisecting line is added to the picture. In a second, equivalent, version of the
plot the X coordinate is the same but on the Y axis we plot the difference between the
Y as computed for the previous plot and the bisecting line, This is called a “detrended”
QQ plot.
For the I.B.M. data we see how, on the left tail, observed data are above the
diagonal, meaning a left tail heavier than the Gaussian. On the opposite side of the
plot we see how the QQ plot lies below the bisecting line. Again: this means that
we are observing data far from the mean and on the right side of it with an higher
frequency than compatible with the Gaussian hypothesis.
Since the data are standardized, the scale of the plot is in terms of number of
standard deviations. We see that, on the left tail, we even observe data near and
beyond -6 times the standard deviation. The tail from minus infinity to -6 times the
standard deviation contains a probability of the order of 5 divided by one billion for the
standard Gaussian distribution. We also observe 10 data points on the leftmost −5σ
tail. Since our dataset is based on 10 years of data, roughly 2600 observations, if we
read our data as the result on independent extractions from the same Gaussian, these
observations, while possible, are by no means expected as the probability of observing
10 times, in 2600 independent draws, something which has in each draw a probability
of 0.00000028 to be observed is virtually 017 .
We can also follow a different, strongly related, line of thought. We see that in this
dataset made of about 2600 daily observations we observe a extreme negative return
of around −8σ. This is the most negative return hence the minimum observed value.
Now, let us ask the following question: what is the probability of observing such
negative a minimum if data come from a Gaussian?
Suppose data are iid and distributed according to a (standardized) Gaussian. In
this case the probability of observing data below the minus 8 sigma level is Φ(−8) and
this for each of the 2600 observations.
However, the probability of observing AT LEAST a value less than or equal this
is 1 minus the probability of never observing such value, that is, due to iid: 1 − (1 −
Φ(−8))2600 ). It is clear that 1 − Φ(−8) is almost 1 but (1 − Φ(−8))2600 is much smaller.
17
To understand this use the binomial distribution. Question: suppose the probability of observing a
−5σ in each of 2600 independent “draws” is 0.00000028. What is 10the probability of observing 10 such
events? The answer,computed with Excel is: 2600 10 0.00000028 (1 − 0.00000028)2590 = 0, 0000....
Meaning that, at the precision level of Excel, we have a 0! While the exact number is not 0 this
means that, at least in Excel the actual rounding error could be quite bigger that the result. For all
purposes the answer is 0. Question: in this section we evaluated the “un-likelihood” of −5σ results in
two different ways: first with a ratio between frequency and Gaussian based probability, then using
the binomial distribution and, again, the Gaussian based probability. What is the connection between
these two, different, arguments?
48
Is it small enough to make 1 − (1 − Φ(−8))2600 ) big enough so that a minimum value of
−8σ over 2600 iid observation from a standard Gaussian be not termed “anomalous”?
The computation is not to simple as Φ(−8) is a VERY small number and the
precision of the Excel routine for its computation cannot be guaranteed. However
using Excel we get (1 − Φ(−8))2600 = .999999999998268 so that even if we take into
account the 2600 observations the probability of observing as minimum of the sample
a −(σ data point is still not really different from 0. I checked the result using Matlab
(whose numerical routines should be more precise than Excel’s) getting a very similar
result.
In order to get 1 − (1 − Φ(−8))n ) in the range of .01 (still very unlikely) we would
need n =15,000,000,000,000. These are open market days and would correspond to
roughly to 59 billions of years. This is a time period roughly 4 times the current
estimate of the age of our universe. (Again: beware of roundings!).
In any sense observing even a single −8σ value during the full history of the stock,
is quite unlikely if data come from a standard Gaussian.
It should be noticed, as a comparison, that for a T distribution with, say, 3 degrees
of freedom the probability never observing a return of -8σ over 2600 days is only
0.347165227 so that the observed minimum (or a still smaller value) has a probability
of 0.652834773 that is: by no means unlikely (in doing this computation recall that
the Student’s T distribution variance is ν/(ν − 2) where ν is the number of degrees
of freedom
p so that the quantile corresponding to −8 in a standard Gaussian is, now,
−8 ν/(ν − 2)).
As you can see, while at first sight similar to the Gaussian, the T distribution is
VERY un-Gaussian when tail behaviour is what interests us.
We cannot dedicate more space to this problem. A vast literature exists on the
pitfalls of using the Gaussian distribution for extreme returns.
In the following section we shall consider the relevance of these empirical facts from
the point of view of VaR estimation.
Examples
Exercise 4 - Non normal returns.xls Exercise 4b - Non normal returns.xls
49
Suppose a sum W is invested in a portfolio at time t0 and we are interested in the
p&l (profit and losses) between t0 and t1 that is: Wt1 − Wt0 . In all the (for us) relevant
cases, that is when some “risk” is involved, this p&l shall be stochastic due to the fact
that Wt1 as seen at t0 is stochastic. Our purpose is to give a simple summary of such
stochastic behaviour of the p&l aimed at quantifying our “risk” in a possibly immediate
way.
Many such measures can be (and have been) suggested. The RiskMetrics procedure
chose as its basis the so called VaR “Value at Risk”.
Given a level α (usually very small: 1% to 5% as a rule) of probability, the VaR is
defined as the α-quantile of the distribution of Wt1 − Wt0 .
The definition of a α-quantile xα for the distribution of a random variable X is easy
to write down and understand when the distribution of X : FX (x) is continuous a
strictly increasing at least in an interval xl , xu such that FX (xl ) < α < FX (xu ),
In this case we simply have
xα ≡ x : P (X ≤ x) = FX (x) = α
and
xα = FX−1 (α)
Where the inverse of FX (x), that is:FX-1 (α) is defined in a unique way, continuous
and strictly increasing at least for FX (xl ) < α < FX (xu ).
Here, the α-quantile is nothing but the value of X which corresponds to a cumulated
probability exactly equal to α and we indicate such value with xα .
In the case of a cumulative distribution function with jumps (corresponding to
probability masses concentrated in specific values of x) there may be no x such that
FX (x) = α for a given α.
In this case the convention we use here is that of setting xα equal to the maximum
of the values x of X such that x is of positive probability and FX (x) ≤ α.
Barring this possibility, it is correct to say that the VaR, at level α, for a time
horizon between t0 and t1 , of your investment, is that value of the profit and loss such
that the probability of observing a worse one is equal to α.
This definition seems to imply that we are required to directly compute a quantile
of the p&l. This is not the case.
In fact what is required is a quantile of the return distribution.
Indeed we have
Wt1 − Wt0 = Wt0 rt0,t1
∗
Wt1 − Wt0 = Wt0 (ert0,t1 − 1)
Where rt0,t1 and rt0,t1
∗
are, respectively, the linear and the log return in the time
interval from t0 to t1 .
50
Since the functions return->p&l are both continuous and strictly increasing, the
problem of finding the required quantile of the p&l is equivalent to the problem of
finding the same in the distribution of returns and transform it back to p&l.
In this section we shall consider three different estimates of the VaR which rely on
different sets of hypotheses.
Each estimate shall be presented in a very simple form, the reader is warned that
actual implementation of any of these estimates require a detailed analysis of the avail-
able data and in some cases is subject to detailed regulation.
You’ll see more about this in more advanced courses of the master.
P (R ≤ rα ) = α
And proceeding with the usual argument, already well known from confidence intervals
theory, we get:
Where zα is the usual α quantile for the standard Gaussian CDF Φ(.).
This hypothesis is not reasonable for linear returns, which are bounded below. It is however
19
51
We have, then
(rα − µ)/σ = zα
rα = µ + σzα
This is quite easy. The problem is that, for small values of α we are considering
quantiles very far on the left tail and our previous empirical analysis has shown how
the Gaussian hypothesis for returns (overall not so bad) is inadequate for extreme tails.
Typically the problem of fat tails shall imply a dangerous undervaluation of the
VaR in the sense that the estimate shall tend to be less negative that it should.
During the discussion about the different precision in estimating E(r) = µ and
V (r) = σ 2 we derived for the Gaussian zero µcase the formula
P
2i
i=0,...,n λ
V σˆ2 = P 2 2σ
4
i
i=0,...,n λ
52
1−λ
rα2 ) = zα4
V (b 2σ 4
(1 + λ)
We can then estimate the σ 4 term by taking the square of the estimate of the variance
and get
1−λ
rα2 ) = zα4
V̂ (b 2σ̂ 4
(1 + λ)
A possible approximate confidence lower bound for the squared quantile estimate,
with the usual “two sigma” rule, is given by (minus) the square root of a two sigma one
sided interval for rbα2 .
s " s #
1 − λ 1 − λ
q
rbα2 + 2 V̂ (brα2 ) = rbα2 + 2σ̂ 2 zα2 2 = rbα2 1 + 2 2
(1 + λ) (1 + λ)
is a (upper) confidence bound for the square of the quantile estimate. In order to
convert it into a (lower) bound for the quantile estimate we simply take
v s
u
u 1−λ
rbα t1 + 2 2
(1 + λ)
let us assume that our variance estimate comes from a typical implementation of
the smoothed estimate formula with daily data, n = 256 (meaning roughly one year of
data) and λ = 0.95. r q
In this case we have 1 + 2 (1+λ)1−λ
2 = 1.2053 and the bound shall be, roughly,
20% more negative that the point estimate of the quantile, that is
v s
u
u 1−λ
rbα t1 + 2 2 = −0.0321 ∗ 1.2053 = −.0387
(1 + λ)
r q
Notice that 1 + 2 (1+λ)
1−λ
2 = 1.2053 only depends on the choice of λ so that it
can be precomputed for any estimate sharing the same choice of λ.
53
What we found is a confidence bound for the quantile of the (log) return.
In order to transform this into a confidence bound for the VaR we need to know
the amount invested W at time t0 .
The bound to the VaR shall be W ∗ (e−.0387 − 1) = W ∗ (−.0380), that is a loss of
3.8% (and here the use of % is correct, why?)
In order to further understand the consequences of using the smoothed estimate
consider the case of the “classic” estimate with λ = 1 in
P
2i
i=0,...,n λ
V σˆ2 = P 2 2σ
4
i
i=0,...,n λ
so that 2
V σˆ2 = σ4
n+1
and the bound shall be s r
2
rbα 1+2
n+1
and, due to n, we quickly have that the extreme of the interval
q becomes almost identical
√
to the point estimate. For n = 256 we already get rbα 2
1 + 2 √257 = rbα ∗ 1.084 =
−0.0348.
54
i.i.d. returns with common distribution F (.) which yield observed values {r1 , r2 , ..., rn }
then our estimate of F (.). shall be:
n
#ri ≤ r X
P̂ (R ≤ r) = F̂R (r) = = I(ri ≤ r)/n
n i=1
Where: #ri ≤ r means ’ the number of observed returns less than or equal to r
and I(ri ≤ r) is a function which is equal to 1 if ri ≤ r and 0 otherwise.
Under our hypothesis of i.i.d. returns with unknown distribution F (.) the above
defined estimate works quite well in the sense that
n
X n
E( I(ri ≤ r)/n) = E(I(ri ≤ r) = P (ri ≤ r) = F (r)
i=1
n
and n
X n
V( I(ri ≤ r)/n) = V (I(ri ≤ r)) = F (r)(1 − F (r))/n
i=1
n2
where the last passage depends on the fact that, for given r, I(ri ≤ r) is a Bernoulli
random variable with P = F (r).
Given this estimate of F , the non parametric VaR is, in principle, very easy to
compute.
Order the observed ri in an increasing way, then define r̂α as the smallest ri such
that the observed frequency of data less than or equal to it is α, if such ri exists. This
ri does not exists if α is not one of the observed values of cumulative frequencies, that
is: if there exists no ri such that F̂R (ri ) = α. In this case we make an exception with
respect to the common definition of empirical quantile and define r̂α as the biggest
observed ri such that F̂R (ri ) < α.
(Linear or other interpolations between consecutive observations are frequently used
but we shall consider this in the “semi parametric VaR” section). This is nothing but
a possible definition for the inversion of the empirical CDF20 .
The problem with this estimate is that, if α is small, we are considering areas of the
tail where, probably, we made very few observations. In this case the estimate could
be quite unstable and unreliable. The reader should compare this estimate with the
estimate of a quantile in the Gaussian case. In the Gaussian case we estimate quantiles
inverting the CDF which, on its turn, is estimated indirectly, by estimating µ and σ,
the unknown parameters. This implies that any data point tells us something about
any point of the distribution (maybe very far from the observed point) as it contributes
20
Our choice does not correspond to some definition of empirical quantile you may find in Statistics
books. In particular, in the case where no ri exists such that F̂R (ri ) = α the empirical quantile is
sometimes defined as the smallest observed ri such that F̂R (ri ) > α. This would be not proper for our
purpose which is to estimate the size of a possible loss and, if needed, exaggerate it on the safe side.
55
to the estimate of both parameters. In other terms, a parametric hypothesis allows
us to estimate the shape of the distribution in regions where we do not make any
observations. Instead, in the non parametric case, each data point, in some sense, has
only a “local” story to tell. To be more precise: the non parametric estimate of the
CDF at a given point r does not change if we change in any way the values of our data
provided we keep constant the number of observations smaller than and greater than
r.
So, we use very little information from the data in a non parametric estimate while
the influence of any data point on a parametric estimate is big. An unwritten law of
Statistics is that, if you use little information you are going to get an estimate which is
robust to many possible hypotheses of the data distribution but with a high sampling
variability; on the other hand, if you use a lot of information in your data, as you do in
a parametric model, you are going to have an estimate which is not robust but with a
smaller sampling variability. This is what happens in the case of non parametric VaR
when compared to, say, Gaussian VaR.
56
equal to, see the previous sentence).
Since observations are iid and, supposing a continuous underlying distribution, the
probability of observing a return less than or equal to rα is, by definition, α, the
probability of making exactly i observations less than (or equal to) rα (and so n − i
bigger than rα ) is n
αi (1 − α)n−i
i
We then have that the probability of making at most j observations less than (or equal
to) rα , that is, the probability that r(j) be greater than or equal to rα is equal to the
sum of the probabilities of observing exactly i returns smaller that or equal to rα for
i = 0, 1, 2, ..., j. For i = 0 all observations are greater than rα ; for i = 1 only the
smallest observation is smaller that or equal to rα and so on up to i = j where we have
exactly j observations smaller than or equal to rα (we are including the case r(j) = rα
because we want to be on the ”safe side” and avoid a possible undervaluation of the
risk). Obviously, from i = j + 1 onward, we have j + 1 or more observations smaller
than or equal to rα , so that r(j) shall be, supposing the probability of “ties” (identical
observations) equal to 0 as in the case of a continuous F , strictly smaller that rα .
In the end, the probability of “making a mistake” in the sense of undervaluing the
possible loss, that is the probability of choosing an empirical quantile r(j) greater than
rα , is given by:
j
X n
P (r(j) ≥ rα ) = αi (1 − α)n−i
i=0
i
Now the confidence limit: to be conservative, we want to estimate rα with an empirical
quantile r(j) such that we have a small probability β that that the true quantile rα is
smaller than its estimate. This, again, is because we are willing to overstate and not
to understate risk hence, we “prefer” to choose an estimate more negative than rα that
a less negative one. Obviously, we would also like not to exaggerate on the safe side.
Our strategy shall be as follows: we choose a r(j) such that P (r(j) ≥ rα ) ≤ β for
a given β which represents with its size how much we are willing to accept an under
estimation of the risk (the smaller the β the more adverse we are at underestimating
rα ). On the other hand we do not want j to be smaller (that is r(j) more negative)
than required. Summarizing this we must solve the problem
max(j) : P (r(j) ≥ rα ) ≤ β
57
if α = .01 and n = 2000, intuitively we could use as an estimate of rα the empirical
quantile r(20) .
However, if we make this choice, we are going (for n and α not too small) to have
roughly fifty fifty probability that the true quantile is on the left or on the right of the
estimate. This is due to the central limit theorem according to which
j
X n
αi (1 − α)n−i ≈ Φnα;nα(1−α) (j)
i=0
i
If the approximation works for our n and α we see that nα becomes the mean of
(almost) a Gaussian, hence the probability on the right and on the left of this becomes
.5.
For reasons of prudence fifty/fifty is not good for us, we go for a smaller probability
that the chosen quantile be bigger than rα , that is for a β smaller than .5.
For this reason we choose an empirical quantile corresponding to a smaller j to the
j just smaller than (or equal to) nα and we do this according to the above rule.
The just quoted central limit theorem, if n is big and α not too small, simplifies
our computations with the following approximation:
j
X n
P (r(j) ≥ rα ) = αi (1 − α)n−i ≈ Φnα;nα(1−α) (j) =
i=0
i
!
j − nα
= Φ0;1 p
nα(1 − α)
With this approximation, we want to solve
!
j − nα
max(j) : Φ0;1 p ≤β
nα(1 − α)
So that our solution is given by the biggest (integer) j such that √ j−nα ≤ zβ or,
nα(1−α)
that is the same, the biggest (integer) j such that j ≤ nα + nα(1 − α)zβ .
p
Using the more compact “integer part” notation and calling r̂αβ our lower bound,
we have:
r̂α,β = r([nα+√nα(1−α)z ])
β
Notice that [nα + nα(1 − α)zβ ] does not depend on the observed data but on
p
α, β, n only. Hence, the solution, in terms of j, that is: which ordered observation to
use, (obviously NOT in terms of r(j) ) it is known before sampling.
Suppose, for instance, you have 1000 observations and look for the 2% VaR. The
most obvious empirical estimate of the 2% quantile is the 20th ordered observation,
58
but, according to the central limit theorem, the probability that the true 2% quantile
is on its left (as on its right but this is not important for us) is 50%.
To be conservative you wish for a quantile which has only 2.5% probability of being
on the right of the 2% quantile. Hence you choose a β of 2.5% (zβ = −1, 96) and you
get p p
nα + zβ nα(1 − α) = 1000 ∗ .02 − 1.96 ∗ 1000 ∗ .02 ∗ .98) = 11.32
According to this result your choice for the lower (97.5%) confidence bound for the
(2%) VaR is given by the [11.32] = 11-th ordered observation that is, roughly, the 1%
empirical quantile.
Beware: do not mistake α for β. The first defines the quantile you want to estimate
(rα ) and the second the confidence level of the confidence interval.
Is this prudential estimate much different w.r.t. the simple “expected” quantile?
It depends on the distance between ranked observations on the tail for observed cu-
mulative frequencies of value about α. If the tail goes down quickly the distance is
small and the difference between, in this case, the 11th and the 20th quantile shall not
be big. On the contrary, with heavy tails the difference between the 1% and the 2%
empirical quantile can be quite big.
As an example consider the case of the I.B.M. data between May 19th 1995 to Sep
28th 2005 discussed above.
The point estimates of the 2.5% and 1% quantiles are the 2.5% empirical (ranked
obs 66th) quantile is -4.105% and the 1% empirical quantile (ranked obs 26th) is -5.57%.
These point estimates correspond to 97.5% confidence bonds of -4.67% (obs 50) and
-6.53% (obs 16). In the first case roughly .5% more negative than the point estimate,
in the second case 1%. The reason for the difference is that around the 1% empirical
quantile observations are more “rarefied”, hence with large intervals in between, than
around the 2.5% empirical quantile.
With the same data, a Gaussian VaR estimate and using, for comparison, the full
sample standard deviation as estimate of σ (value 0.021421), we get, for the the 2.5%
VaR, a point estimate of -4.14%, to be compared with the 2.5% empirical quantile
-4.105% (confidence bound -4.67%). However in the Gaussian case the (approximate
2σ) lower confidence limit , given the more than 2600 observation and the unsmoothed
estimate, is -4.23%: very similar to the point estimate. As we did see a moment ago
this is not true for the empirical VaR (.5% of difference between the estimate and the
confidence limit).
Things are worse on more extreme quantiles.
If we compute the 1% quantile in the Gaussian case, we get -4.93% with a (two σ)
bound of -5.02% to be compared with the non parametric -5.57% and the corresponding
bound of -6,53%.21 .
21
These estimates may change very much if we change the sample. For instance, with a longer
stretch of data: between 1962 and 2005, the standard deviation is 0.0164, the 2% Gaussian var is
59
When we are evaluating extreme quantiles two “negative” forces sum. First the
empirical distribution is very “granular” in the tails (very few observations). Second
the empirically observed heavy tails imply the possibility of considerable difference
between contiguous quantiles, bigger that expected in the case of Gaussian data.
Non parametric VaR, sometimes dubbed “historical” VaR because it uses the ob-
served history of returns in order to estimate the empirical CDF, is probably the most
frequently used in practice. Again, confidence limits as often ignored and this could be
due to their dismal “big” size.
The problem of a big sampling variance for such estimates is very well known.
Applied VaR practitioners and academics have suggested in the last years, an amazing
quantity of possible strategies for improving the quality of the non parametric tail
estimate. Most of these suggestion fall in two categories, semi parametric modeling of
the distribution tails and filtered resampling algorithms.
In the following subsection we shall consider a simple example of semi parametric
model. The resampling approach is left for more advanced courses.
60
We suppose that for r negative enough:
P (R ≤ r) = L(r)|r|−a
where, for such negative enough r, L(.) is a slowly varying function for r → −∞
(Formally this means limr→−∞ L(λr)
L(r)
= 1∀λ > 0 and you can understand this as implying
that the function L to be approximately a constant for big negative values of r) and a
is the speed with which the tail of the CDF goes to zero with a polynomial rate.
This is sometimes called a “Pareto” tail because a famous density showing this tail
behaviour bears the name of Vilfredo Pareto.
This choice of tail behaviour could be justified on the basis of limit theory, as hinted
at before, or on the basis of good empirical fitting to data.
Notice that the Gaussian CDF has exponential tails, which go to 0 much faster
than polynomial tails. Pareto tails are, thus, a model for “heavy” tails.
Provided we know where to plug in the model (that is: which value of r is negative
enough) our first task is that of estimating a, the only parameter in the model. In
order to do so we take the logarithm of the previous expression and we get:
We then assume that, maybe with an error, log(L(r)) can be approximated by a con-
stant C:
log(P (R ≤ r)) ≈ C − a log(|r|)
this expression begins to be similar to a linear model. In fact, if, in correspondence of
any observed ri we may estimatelog(P (R ≤ ri )) with log(F̂R (ri )) and summarize the
various approximations in an error term ui , we have:
A linear regression based on this model shall not work for the full dataset of returns,
but it shall work for a properly chosen subset of extreme negative returns. A simple
way to find the proper subset of observations is that of plotting log(F̂R (ri )) against
log(|ri |) for the left tail of the distribution. Typically this plot shall show a parabolic
region (consistent with the Gaussian hypothesis) followed by a linear region (consistent
with the polynomial hypothesis). The regression shall be run with data from the second
region.
Suppose we now have an estimate for a, how do we get an estimate of the quantile?
the problem is that of plugging in the parametric tail to the non parametric estimate
of the CDF.
The solution is simple if we suppose to have a good non parametric estimate for
the quantile rα1 where α1 is too big for this quantile estimate to be used as VaR.
61
Log-Log plot of extreme negative (in absolute value) data
See how the linear hypothesis seems to work on the left of -3/-4 sigma
0
-9 -8 -7 -6 -5 -4 -3 -2 -1 0
-0,5
-1
-1,5
-1 5
62
-2
-2,5
-3
-3,5
-4
-4,5
-5
What we need is an estimate of rα2 for α2 < α1 . If we suppose that the tail model is
approximately true for both quantiles we have:
a
α1 L(rα1 ) rα2
=
α2 L(rα2 ) rα1
L(r )
But the ratio L(rαα1 ) should be very near to 1 (the same slow varying function computed
2
at non very far away points) so that we can directly solve for rα2 :
a1
α1
rα2 = rα1
α2
Given the non parametrically estimated rα1 , an estimate of a (based on the above
described regression) and for a chosen α2 we are then able to estimate the quantile rα2 .
The computation of the confidence interval of the semi parametric VaR is slightly
more difficult than for the other two VaR estimation methods, hence we shall not
discuss it here.
(However, you can find the formulas for the computation of this confidence interval
in the Excel file where the three methods are compared on the same dataset).
A detailed discussion of this method with suggestions for the choice of the subset of
data on which to estimate the tail index, and formulas for more sophisticated confidence
intervals may be found in 22
However the reading of this paper is not required for the course.
A comparison of Gaussian, non parametric and semi parametric VaR is shown in
detail in the Excel worksheet Exercise 5 VaR.xls.
Examples
Exercise 5 - VaR.xls Exercise 5b
63
obvious point of view of any professional involved in asset pricing, asset management
and risk management.
To begin with, we need a compact and reasonably clear notation for dealing with
vectors of returns, both from the mathematical and the probabilistic/statistical point
of view. For this reason the second part of these handouts opens with two chapters: 6
and 7 dedicated to a quick introduction to the basic notation. You can find something
more in the appendixes of these handouts: 12.
Most of the second part of these handouts is centered on the study and of the general
linear model 9 and of its applications in Finance. I expect this topic to be new for
most of the class, hence the handouts contain a rather detailed and complete, if simple,
introduction to this topic. Another important tool introduced in the following chapters
is principal component analysis 11 in the context of linear asset pricing models.
Most of what follows is self contained, however some basic concept and result of
Probability and Statistics is required. You can find this in the appendix 13.
Among the most important of these concepts and results to be added to those
already summarized in the first part of these handouts, I would point out: conditional
expectation and regressive dependence, point estimation, unbiasedness, efficiency.
Again, a short summary of these cam be found in 13. Among the most important
points see: from 13.43 to 13.47, from13.91 to 13.104.
6 Matrix algebra
I suppose the Reader knows what is a matrix and a vector and the basic rules for
multiplication between matrices and matrices and scalars and for sum between matri-
ces. I also suppose the Reader to know the meaning and basic properties of a matrix
inverse and of a quadratic form. This very short section only recalls a a small number
of matrix results and presents a very useful result called “spectral decomposition” or
“eigendecomposition” theorem. Moreover we consider some differentiation rule.
In what follows I’ll write sums and products without declaring matrix dimensions
sum and multiplication. I’ll always suppose the matrices to have the correct dimensions.
The inverse of a square matrix A is indicated by A−1 with A−1 A = I = AA−1 . A
property of the inverse is that, if A and B have inverse then (AB)−1 = B −1 A−1 . Notice
that, if A and B are square invertible matrices and AB = I then, since (AB)−1 = I =
B −1 A−1 , by multiplying on the left by B and on the right by A we have BB −1 A−1 A =
BA = I.
The rank of a matrix A (no matter if square or not): Rank(A) is the maximum
number of linearly independent rows or columns in A. Put in a different way, the
rank of a matrix is the order of the biggest (square) matrix that can be obtained by
64
A deleting rows and/or columns and whose determinant is not zero. Obviously, then,
the rank of a matrix cannot be bigger that its smaller dimension.
A fundamental property of the rank of the product is this:
∂x0 Ax
= 2Ax
∂x
and
∂x0 q
=q
∂x
The proof of these two formulas is quite simple. We give a proof for a generic element
k of the derivative column vector
XX
x0 Ax = xi xj ai,j
i j
P P
∂ i j xi xj ai,j X X X X
= xj ak,j + xi ai,k +2xk ak,k = xj ak,j + xj ak,j +2xk ak,k = 2Ak,. x
∂xk j6=k i6=k j6=k j6=k
Where Ak,. means the k − th row of A and we used the fact that A is a symmetric
matrix. Moreover
X
x0 q = xj q j
j
0
∂x q
= qk
∂xk
65
An important point to stress is that the derivative of a function with respect to a
vector always has the same dimension as the vector, so, for instance (remember that
A is symmetric):
∂x0 Ax
0
= 2x0 A
∂x
A multi purpose fundamental result in matrix algebra is the so called “spectral
theorem”:
where xj is the j −th column of X and is called the j −th eigenvector of A, the elements
λj on the diagonal of Λ are called the eigenvalues of A. These are positive, if A is pd,
and can always be arranged (rearranging also the corresponding columns of X) in non
increasing order.
Each term in the sum is going to be equal to 0 (orthonormal xj vectors) except the ith
which is going to be equal to xi λi so that xi solves the equation (A − λi I)xi = 0.
Notice that x0i xi = 1 so that the “trivial” solution: xi = 0 is NOT a solution of this
problem. Hence, any feasible solution requires |A − λi I| = 0 so that we see that the
,λi -s are the roots of the equation: |A − λI| = 0 (the so called “characteristic equation”
for A). If this determinant equation is written down in full it shall be seen that it is
a polynomial equation in the variable λ of degree equal to the rank k of A (its size,
if A is of full rank). As it is well known since high school days, polynomial equations
66
have k real and/or complex roots (the so called “fundamental theorem of algebra”).
However, an explicit formula for finding such roots only exists (in the general case) for
k ≤ 4. On the other hand, finding the roots of this equation is such relevant a problem
in applied Mathematics that numerical algorithms for computing them exist at least
from the times of Newton. P
The representation A = j λj xj xj makes obvious many classic matrix algebra
0
results. For instance, we know that Az = 0 may have nontrivial solutions only if A
is non invertible. In the case of s symmetric psd matrix this implies that the number
of positive
P eigenvalues is smaller than the size of the matrix. In this case, writing
j λj xj xj z = 0 immediately shows that the solution(s) to the homogeneous
0
Az =
linear system must be found among those (non null) vectors z which are orthogonal to
each eigenvector xj . Such vectors, obviously, cannot exist if A is invertible23 .
A last useful result is the so called “matrix inversion lemma”
We use both random matrices and random vectors. A random matrix is simply a
matrix of random variables, the same for a random vector.
The expected value of a random matrix or vector Q is simply the matrix or vector
of the expected values of each variable in the matrix or vector and is indicated as E(Q).
For a random (column) vector z we define the variance covariance matrix, indicated
with V (z) but sometimes with Cov(z) or C(z) as:
For the expected value of a matrix or a vector we have a result which generalizes
the linearity property of the scalar expected value.
For the lovers of formal language: k orthonormal vectors of size k (k−vectors) “span” a
23
k−dimensional space, in the sense that any vector in the space can be written as a linear combination
of such orthonormal vectors. For this reason, the only k−vector orthogonal to all k orthonormal
vectors (which means that the vector is not a linear combination of them) is the null vector. On
the other hand, given q < k orthonormal k−vectors, these span a q−dimensional subspace and there
exist other k − q orthonormal k−vectors which are orthogonal to the first q and span the “orthogonal
complement” of the space spanned by the q vectors. This is simply the space of all k−vectors which
cannot be written as linear combinations of the q vectors, equivalently: the space of all vectors which
are orthogonal to the q vectors. Yous see how a k−dimensional pd matrix implicitly defines a full
orthonormal basis for a k−dimensional basis. Moreover, the knowledge of its eigenvectors allow us to
split this space in orthogonal subspaces.
67
Let A1 and A2 be random matrices (of any dimension, including vectors) and
B, C, D, G, F non random matrices. We have:
Where, as anticipated, we suppose that all the products and sums have meaning that
is: the dimensions are correct and the expected values exist.
The covariance matrix has a very important property which generalizes the well
known result about the variance of a sum of random variables.
Let z be a random (column) vector H a non random matrix and L a non random
vector then:
V (Hz + L) = HV (z)H 0
Suppose for instance that z is a 2 × 1 vector and H has a single row made of ones, in
this case the above result yields to the usual formula for the variance of the sum of 2
random variables.
68
However, there exist k − q vectors zj∗ such that zj∗0 zj∗ = 1, zj∗0 zj∗0 = 0 for j 6= j 0 and
zj∗0 xl = 0 for any vector xl which is eigenvector of A.
Using any of these zj∗ , or any non zero scalar multiple of them, is then possible to
build linear combinations of the random variables whose varcov matrix is A such that
these linear combinations have zero variance.
If A is a variance covariance matrix of linear returns for financial securities, this
implies that it is possible to build a portfolio24 of such securities and, possibly, the risk
free rate, such that the single securities returns are random but the overall position
return is non random.
By no arbitrage, any such position, being risk free, must yield the same risk free
rate (otherwise you can borrow at the lower rate and invest at the higher rate for a
sure, possibly unbounded, profit).
This very important property is central in asset pricing theory and, more in general,
in asset management.
(Compare this with the example regarding the constrained minimization of a quadratic
form in the Appendix).
As we shall see in the section dedicated factor models and principal components,
it is often the case that covariance matrices of returns for large sets of securities are
approximately not of full rank, that is: it may be that all eigenvalues are non zero but
many of these are almost zero. In this case it is possible to build portfolios of risky
securities whose return is “almost” riskless. This has important applications in (hedge)
fund management and, more in general, trading ad asset pricing.
7.2 Note
These are the barely essential matrix results for this course. Many more useful results
of matrix algebra exist, both in general and applied to Statistics and Econometrics.
For the interested student the Internet offers a number of useful resources.
We limit ourselves to quote a “matrix cookbook” you could download from the
Internet the title is “The Matrix Cookbook” 25 .
Examples
Exercise 6-Matrix Algebra.xls
Or, alternatively, a long short position.
24
worked when I last checked it in August 2019, but I cannot guarantee stability of the link.
69
8 The deFinetti, Markowitz and Roy model for asset
allocation
RΠ = (1 − w0 1)rf + w0 R
So that the expected value and the variance of the portfolio return are
E(RΠ ) = (1 − w0 1)rf + w0 µR
and
V (RΠ ) = w0 Σw
The problem for the fund manager is to choose w so that, for a given expected value
c of the portfolio return, the variance of the portfolio return is minimized. In formulas
min w0 Σw
w:(1−w0 1)rf +w0 µR =c
Equivalently the fund manager could fix the variance and choose w such that the
expected return is maximized.
In both problems it would be sensible to use an inequality constraint. For instance,
in the first problem, we could look for
min w0 Σw
w:(1−w0 1)rf +w0 µR =c
70
We choose the = version just for allowing direct use of Lagrange multipliers as we’ll
see in what follows.
Notice that we do not assume the sum of the elements of w to be 1. This shall
be true only if no risk free investment is made. However, obviously, if we complement
the vector w with the fraction of portfolio invested in the risk free security, the sum
of all the portfolio fractions is 1. Moreover we do not require each element of w to
be positive. This can be done but not in the the straightforward way we are going to
follow26 .
In order to solve the problem we consider its Lagrangian function (notice the rear-
ranged constraint):
71
At this point we already notice that the required solution is a scalar multiple (λ) of a
vector which does not depend on c. In other words the relative weights of the stocks
in the portfolio are already known and do not depend on c. What is still not known is
the relative weight of the portfolio of stocks with respect to the risk free security.
This is a first instance of a “separation theorem”: the amount of expected return
we want to achieve only influences the allocation between the risk free security and
the stock portfolio but does not influences the allocation among different stocks (the
optimal risky portfolio is uniquely determined).
As a second comment we see that, had our objective been that of solving
0 0 1 0
max (1 − w 1)rf + w µR − w Σw
w 2λ
that is: had we wished to maximize some “mean variance” utility criterion, our result
would have been exactly the same. Since the (negative) weight of the variance in this
criterion is given by − 2λ
1
as a rule 1/λ is termed “risk aversion parameter”.
9 Linear regression
9.1 What is a regression
In what follows we shall consider linear models as models of linear regressions.
In other words, we shall suppose that a “set” of random variables (in general a
matrix) Z (maybe only partially observable) has a probability distribution P (Z), that
we want to study the conditional expected value (aka regression function) of one vector
of Z, that we indicate with Y , conditional to a matrix of other elements of Z, that we
call X (so that, obviously, we suppose such conditional expectation to exist) and that
we suppose this conditional expectation to be expressed by E(Y |X) = Xβ, where β is
a vector of non random parameters.
This is read as a “linear” regression function (the important point is that this is
linear in the β vector.
It should be obvious that, out of a given Z, we could compute and be interested in
many conditional expectations which we could derive from P (Z): the joint probability
distribution of Z. It may also be that not all the variables in Z are involved in the
computation. As an example, we could be intereste to the conditional expected value
of one single element of Z given another single element.
Why do we compute conditional expectations?
Basically because we want to make “forecasts” that is, we suppose to know the
values of some of the variables in Z and on the basis of these we want to forecats the
values of other variables in Z.
And why should we use a conditional expectation in order to make a prediction?
72
The main reason for this choice is that, if we evaluate the error in a forecast in a
specific way: the mean square error, then E(Y |X) is the function of X which minimizes
this measure of error.
I.e.: you cannot make a better (in MSE sense) choice than E(Y |X).
In what follows we give a brief summary of the main properties of a regression
function. More detail on this is found in the appendix but it is not required for the
exam: 13
We begin with general properties which only require the regression function to exist,
be it linear or not.
Here we suppose Y to be a column vector.
1. Suppose Y = g(X). That is: Y is a function of X. Then E(Y |X) = g(X).
2. E(E(Y |X)) = E(Y ). This is sometimes called “law of iterated expectations”.
3. E(Y − E(Y |X)) = 0. This is a corollary of the second property.
4. E((Y − E(Y |X))E(Y |X)0 ) = 0 In words: the covariance between forecasts:
E(Y |X) and forecast errors: Y − E(Y |X) is a matrix of zeroes.
Now we are ready to state (the proof is not required) the result about the optimal
property of the regression function as a forecast.
We do this in a way similar to the one we shall follow further on in order to prove
the Gauss Markov theorem.
Consider any (vector) function h(X) (the vector has the same dimension as Y ).
Suppose you want to “forecast” Y using h(X) and you want this forecast to be “the
best possible”.
The regression function is a possible candidate as it IS a function of X (and call
E(Y |X) = φ(X)) but there exist, in general, infinite other possibilities.
You measure the “expected forecast error” “size” with its mean square error matrix27 :
E((Y − h(X)(Y − h(X))0 ).
We would like this to be as “small” as possible: this seems a sensible idea.
However, this is a matrix, so that we must define “small” in a non trivial way.
Here we shall look for a choice of h(X) = h∗ (X) such that any other choice would
yield a MSE matrix whose difference from that corresponding to h∗ (X) is (at least)
PSD, that is:
where H is PSD.
Theorem 9.1. For any h(X) 6= φ(X) we have E((Y − h(X)(Y − h(X))0 ) = E((Y −
φ(X)(Y − φ(X))0 ) + H where H is (at least ) PSD.
27
In general this is not the variance covariance matrix as we are not requiring E(h(X)) = E(Y ).
73
In this sense you cannot choose a better function of X in order to forecast Y than
φ(X) = E(Y |X).
Note: in basic courses of statistics, this theorem is stated in the common and
simpler case where Y is a scalar random variable so that the MSE matrix becomes a
simple MSE.
It is important to stress the point that the regression function is the best choice
if “best” is measured by means of a MSE, different choices of objective functions shall
yield different results.
Verbal summary of the result: the regression function is the “best” (in this
particular sense) function of X if your aim is to forecast Y “on the basis of ”
X.
Note: If, instead of E((Y − h(X)(Y − h(X))0 ), we decide to minimize E((Y −
h(X)0 (Y − h(X))) that is: the expected sum of squared errors of forecast, a proof
following the same steps as the above proof, just changing the position of the transpose
sign, shows that the regression function minimizes the expected sum of squared errors
of forecast. In this case the objective function is a scalar so the term “minimizes” has
the usual sense.
All these results do not require the regression function to be linear.
We need linearity in order to state and prove the following important property.
Suppose X1 is a subset of the columns of X and suppose (linearity) that E(Y |X) =
Xβ and E(X|X1 ) = X1 G where I use an uppercase letter here (G) because X is a
matrix so that E(X|X1 ) is a matrix of regression functions.
We then have E(Y |X1 ) = E(E(Y |X)|X1 ) = E(Xβ|X1 ) = X1 Gβ.
As stated above, given a choice of Y many regressions are possible if we condition Y
to different sets of X. However, there shall be a connection between these regressions.
This simple result gives you the required connection (for the linear case).
Using this result we can compute the coefficients of the regression of Y on X1 when
we know the coefficients of the regression of Y on X and X1 is a subset of X.
A more general version of this result, the partial regression theorem, shall be dis-
cussed in what follows.
74
and suppose E() = 0 and V () = σ2 In .
These hypotheses are best understood in a statistical (that is: estimation in re-
peated samples) setting. Each sample we are going to draw shall be given by a real-
ization of Y . In each sample X (observable) and β (unobservable) shall be the same.
What makes Y “random” that is: changing in a (partially) unpredictable (for us) way
from sample to sample, is the random “innovation” or “error” vector which we cannot
observe.
As for the “partially” clause: under the assumed hypotheses is clear that
So that the expected value of the random Y is not random (while it is unknown as
β is unknown).
In this sense we may say that we are modeling the regression function of Y on
X. However, since the matrix X is non random, which means that the probability
of observing that particular X is one or, equivalently as observed above, that in any
sample X (and β) shall always be the same (so that Y is random just due to the effect
of the random element , in fact: V (Y ) = V (Xβ + ) = V () = σ2 In ), the conditional
expectation shall be the same as the unconditional expectation.
E(|X) = 0
75
covariance matrix and conditional second moments matrix is true only because we as-
sumed that the conditional expectation of is zero. (See by yourself what happens
otherwise).
An immediate result is that:
E(Y |X) = E(Xβ + |X) = Xβ
and this property fully justifies the name “linear regression” for our model.
With a stochastic X and our added hypothesis our model becomes a linear model
for a conditional expectation: a regression function.
76
9.5 Basic statistical properties of the OLS estimate
This estimate has been derived as a best approximation. To derive it we did assume
nothing from the statistical point of view.
However, we shall use the estimate in a setting where our observations are just a
“sample” of all possible observations, in a setting, that is, where Parts of the model are
stochastic/random.
We already have two sets of hypothesis for describing such a randomness: weak
OLS hypotheses with non random and with random X.
Under these hypotheses we shall now see how the OLS estimate behaves in a sta-
tistical sense.
It is easy to show that βbOLS is unbiased for β. In fact, in the non random Xcase:
where in the first passage we use the fact that X is non random and in the second one
we use the hypothesis that β is non random and that E() = 0.
It is also easy to compute V (βbOLS ):
= E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) − E((X 0 X)−1 X 0 )E(0 X(X 0 X)−1 ) =
But the second term in the sum was just shown to be equal to 0 so:
= E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) = EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) =
77
Now the term E|X (0 |X) is, by hypothesis, equal to σ2 In so that:
EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 X 0 X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 )
In short: with a stochastic X and the new OLS hypotheses, βbOLS is still unbiased
but now its covariance matrix is fully unknown as it depends on the expected value of
(X 0 X)−1 .
The results with stochastic X imply those with non stochastic X but require more
steps, hence the use of teaching both with the first as an introduction to the second.
78
We are now just a step short in being able to state an important result in OLS
theory.
We would like to prove a theorem of the kind: the best unbiased estimate of β is
β̂OLS .
Alas, this is actually not true in this generality. The theorem results to be true
if we further ask for the class of competing estimates to be linear in Y that is: each
competing estimate must be of the form HY with H a known nonrandom matrix.
Theorem 9.4. Under the weak OLS hypotheses βbOLS is the Best Linear Unbiased
Estimate (BLUE) of β.
Proof. Any linear estimate of β can be written as β̂ = ((X 0 X)−1 X 0 + C)Y with and
arbitrary C. Since the estimate must be unbiased we have:
79
The proof begins by recalling that any pd matrix has a pd inverse. Moreover any
pd matrix A can be written as A = P P 0 with P invertible. We then have Σ = P P 0
and Σ−1 = (P P 0 )−1 = P 0−1 P −1 .
Multiply the model Y = Xβ + times P −1 as P −1 Y = P −1 Xβ + P −1 .
Call now:
• Y ∗ = P −1 Y
• X ∗ = P −1 X
• ∗ = P −1 .
80
9.6.1 A note on the case of stochastic X
We did prove the Gauss Markoff theorem using weak OLS hypotheses with non stochas-
tic X. The more general, stochastic X case can be easily dealt with using the iterated
expectation rule.
In case of stochastic X, the theorem proof (using OLS hypotheses for the case of
stochastic X) goes, working conditional on X, exactly as before. The last statement
then becomes
This is the varcov matrix of the OLS estimate when X is stochastic plus the ex-
pected value of an at least psd matrix. If the second term is at least psd too, we have
our proof.
This is easy, we must show that for any non stochastic vector z we have z 0 E(CC 0 )z ≥
0 but, by the basic properties of the expected value operator we have z 0 E(CC 0 )z =
E(z 0 CC 0 z).
We already know that CC 0 is at least psd, this is equivalent to say that, whatever z,
we have z 0 CC 0 z ≥ 0. The expected value of a non negative number cannot be negative
(internal property of the expected value) and we have our proof.
You should also notice that, in this proof, we allow C to be stochastic UNCONDI-
TIONAL on X provided it is non stochastic CONDITIONAL on X.
X 0 X β̂OLS = X 0 Y
X 0 (X β̂OLS − Y ) = 0
X 0 ˆ = 0
81
This in particular implies
0
βbOLS X 0 ˆ = Ŷ 0 ˆ = 0
This result is independent on the OLS hypotheses and depends only on the fact
that βbOLS minimizes the sum of squared errors.
9.8 R2
A useful consequence of this result, joint with the assumption that the first column
of X is a column of ones, allows us to define an index of “goodness of fit” (read: how
much did I minimize the squares?) called R2 .
In fact:
Y 0 Y = (Ŷ + ˆ)0 (Ŷ + ˆ) = Ŷ 0 Ŷ + ˆ0 ˆ
where the last equality comes from the fact that Ŷ 0 ˆ = 0.
Moreover if X contains as first column a column of ones X 0 ˆ = 0 implies that the
sum, hence the arithmetic average, of the vector ˆ: ¯ˆ is equal to zero. So:
¯ ¯
Ȳ = Ŷ + ¯ˆ = Ŷ
and:
¯
Y 0 Y − nȲ 2 = Ŷ 0 Ŷ − nŶ 2 + ˆ0 ˆ
where n is the length of the vectors (number of observations). In other words, indicating
with V ar(Y ) the numerical variance of the vector Y (that is: the mean of the squares
minus the squared mean), we have:
We see that the variance of Y neatly decomposes in two non negative parts. There
is no covariance! This is totally peculiar to the use of least squares and implies the
definition of a very natural measure of “goodness of fit’”:
V ar(Ŷ ) V ar(ˆ)
R2 = =1−
V ar(Y ) V ar(Y )
82
X of regressors, Y is a vector of n iid random variables of variance σ 2 so that there
should be an R2 of 0 out of the regression of Y on X. However the expected value
of the sampling variance of the elements of Y (that is: the denominator of the R2 ) is
σ 2 while we can show that the expected value of sampling variance of the elements of
the error of fit vector ε̂ is σ 2 n−k
n
so that the expected value of the sampling variance
of Ŷ in the same regression is going to be σ 2 nk (because Y = Ŷ + ε̂) and this is the
numerator of the R2 .
While we know that the expected value of the ratio is not the ratio of the expected
values, this could still be a good approximation, so we can say that, in the case of no
regression at all between Y and X (that is: theoretical value of R2 equal to 0), the
σ2 k
expected value of the R2 is approximated by σ2n = nk and this number could be quite
big if you use many variables and do not have many observations.
This simple fact should make use wary in using regression as an “exploratory” tool
for finding the “most relevant variables” in a wide set of potential candidates. Such
attitude has been common, and was successfully criticized, many times in the past
and, today, is back on fashion within the “data mining” movement. “To be wary” does
not mean “to utterly avoid”: taken with care such procedures, and, more in general,
exploratory data analysis, may be useful.
E(Ŷ ) = Xβ
83
V (Ŷ ) = σ2 X(X 0 X)−1 X 0
In summary: we see that Ŷ is indeed an unbiased estimate of E(Y ). On the other
hand we see that ˆ shows a non diagonal correlation matrix even if (or, better, just
because) the vector is made of uncorrelated errors.
This property of the estimated residuals is, in some sense, unsatisfactory and lead
some researchers into defining a different estimate of the residuals (non OLS based)
with the property of being uncorrelated under OLS hypotheses. This different estimate,
which we do not discuss here, is known in the literature as BLUES residuals (where
the ending S stand for “scalar” that is: with diagonal covariance matrix).
If X is stochastic, as usual, we work conditional to X and under weak OLS hy-
potheses with stochastic X.
And, since E(ˆ) = E(E(ˆ)|X)) we have E(ˆ) = 0 (notice that, for simplicity, we
now drop the suffixes of the expected values).
A similar result for V (ˆ). It must be said, however, that in this case only the
conditional values and not the marginal values are usually of interest as we usually
condition on the observed X.
84
We can do this both with non stochastic X or stochastic X. However, since,
in practice, and always in what follows, we shall always condition on X there shall
practically be no difference.
Why this hypothesis? When we wish to test hypotheses we need to find distributions
of sample functions, for instance we are going to need the distribution of β̂OLS .
Up to now we know that under weak OLS hypotheses β̂OLS has expected value
vector β and variance covariance matrix σ2 (X 0 X)−1 . Moreover we know that β̂OLS =
β + (X 0 X)−1 X 0 that is: it is a linear function of (and β and X are non stochastic or,
if X is stochastic, we condition on it). With the added hypothesis we can then conclude
that β̂OLS has (possibly conditional on X) Gaussian distribution with expected value
vector β and variance covariance matrix σ2 (X 0 X)−1 .
This may seem too expedient: OK, computations are now simple but why Gaussian
errors? In fact it often is too expedient, and the pros and cons of the hypothesis could
(and are) discussed at length. For the moment we shall take it as a beginner’s use in
the econometric world we live in, a use to be taken with much care.
First, confidence intervals.
Under the above hypotheses we have (no proof required)
β̂j − βj
p ≈> N (0, 1)
σ2 {(X 0 X)−1 }jj
If we remember the properties of determinants and inverses of diagonal matrices we see from this
formula that, in the case of diagonal covariance matrix, this density becomes a product of k uni
dimensional Gaussian densities (one for each element of the vector z̃). So, in the Gaussian case, non
correlation and independence are the
same. In fact,if Σ is a diagonal matrix with diagonal terms σi2
1
σ12
0 0
Qk
we have |Σ| = i=1 σi2 and Σ−1 = . .. 0 so that
0
1
0 0 σ2
k
k
! k
!
−k
Y 1 1 X (zi − µi )2
f (z; µ, Σ) = (2π) 2 exp − =
σ
i=1 i
2 i=1 σi2
k k
1 (zi − µi )2
1
Y Y
= (2πσi2 )− 2 exp − = f (zi ; µi , σi2 )
i=1
2 σi2 i=1
In words: with a diagonal covariance matrix the joint density is the product of the marginal densities,
the definition of independence.
An important property of a k dimensional Gaussian distribution is that, if A and B are non stochas-
tic matrices (of dimensions such that A+B z̃ is meaningful), then the distribution of A+B z̃ is Gaussian
with expected vector A + Bµ and variance covariance matrix BΣB 0 . Linear transforms of Gaussian
random vector are Gaussian random vectors. This, for instance, implies that, in effect, if z̃ is a Gaus-
sian random vector, then each z̃i is Gaussian as we stated a moment ago in the proof of equivalence
between non correlation and independence for the Gaussian distribution. This is easy to see, just
0 0
write z̃i = 1i z̃ where 1i is a k dimensional row vector with null elements except one 1 in the i − th
place and apply the linearity property.
85
Where the distribution is to be intended conditional on X, if X is stochastic.
We then have that
p
P (βj ∈ [β̂j ± z1− α2 σ2 {(X 0 X)−1 }jj ]) = 1 − α
Hence p
[β̂j ± z1− α2 σ2 {(X 0 X)−1 }jj ]
Is a 1 − α confidence interval for βj on the basis of its estimate β̂j . Where z1−α/2 is
the usual 1 − α/2 quantile of the standard Gaussian distribution.
In the case of unknown σ2 , we can estimate it with
ˆ0 ˆ
σ̂2 =
n−k
And, again without proof, we have that
p
[β̂j ±n−k t1− α2 σ̂2 {(X 0 X)−1 }jj ]
Is a 1 − α confidence interval for βj on the basis of its estimate β̂j . Where n−k t1−α/2
is the 1 − α/2 quantile of the T distribution with n − k degrees of freedom (n is the
number of rows and k the number of columns of X).
Second, testing statistical hypotheses.
We suppose everybody knows what a Statistical Hypothesis is (see Appendix in
case). We now define a “linear” statistical hypothesis.
A linear hypothesis on β can be written as Rβ = c where R is a matrix of known
constants and c a vector of known constants. For the purpose of this summary we shall
concentrate on two particular R: a 1 × k vector R where only the j th element is 1 and
the others are zeros and a (k − 1) × k matrix where the first column is of zeros an the
remaining (k − 1) × (k − 1) square matrix is an identity. In both cases c is made of
zeros (in the first case a single 0 and in the second a k − 1 vector of zeros).
The first kind of hypothesis is simply that the j th β is zero (while all other pa-
rameters are free to have any value), the second kind of hypothesis is simply that all
parameters are jointly zero (with the possible exception of the intercept). For (non
trivial) historical reasons these hypothesis are considered so frequently relevant that
any program for OLS regression tests them. The fact that these hypotheses be of your
interest is for You to evaluate.
I shall not detail the procedure for the test of the hypothesis that all parameters,
except, possibly, the intercept are jointly equal to zero. I only mention the fact that
the result of this test is displayed, usually, in any OLS program output. The name of
this test is F test.
A little more detail on the univariate test.
86
The standard test for the Hypothesis H0 : βj = 0 against H1 : βj 6= 0 (complete by
yourself the hypotheses) requires the distribution of β̂OLS , and this, as we wrote above,
a strengthening of the OLS hypotheses which takes the form of the assumption that
is distributed according to an n−variate Gaussian: ≈> Nn (0, σ2 In ).
We do not discuss here the reasons for and against this hypothesis.
Under this hypothesis, as seen above, we can show that: βbOLS ≈> Nk (β, σ2 (X 0 X)−1 )
(this may be conditional on X if this is stochastic). Hence the ratio:
β̂j − βj
p
σ {(X 0 X)−1 }jj
2
(I drop the subscript OLS from the estimate in order to avoid double subscript prob-
lems) is distributed according to a standard Gaussian (i.e. N1 (0, 1)).
Suppose now we set βj = 0 in the above ratio. In this case the distribution of the
ratio shall be a standard Gaussian only if H0 : βj = 0 is true. This allows us to define
a reasonable rejection region for our test.
Reject H0 : βj = 0 with a size of the error of the first kind equal to α iff:
β̂j
p ∈
/ [−z1− α2 ; +z1− α2 ]
σ2 {(X 0 X)−1 }jj
β̂j
p ∈
/ [−n−k t1− α2 ; +n−k t1− α2 ]
σ̂2 {(X 0 X)−1 } jj
where n−k t1− α2 is the quantile in a T distribution with n − k degrees of freedom which
leaves on its left a probability equal to 1 − α2 .
The use of the T distribution is the reason for the name given to this test: the T
test.
I shall not detail the procedure for the test of the hypothesis that all parameters,
except, possibly, the intercept are jointly equal to zero. I only mention the fact that
the result of this test is displayed, usually, in any OLS program output. The name of
this test is F test.
87
As in the case of the T test there exist many different hypotheses which can be
tested using a F test but the standard hypothesis tested by the universally reported
F test is:
H0 : all the betas corresponding to non constant regressors are jointly equal to 0
H1 : at least one of the above mentioned betas is not 0
The Idea is that, if the null is accepted (i.e. big P-value), no “regression” exists
(provided we did not make an error of the second kind, obviously). Hence the popularity
of the test.
A last detail on stochastic X case. As we see here, working with non stochastic X
or conditional to X yields the same results.
We should stress the fact, however, that many properties hold unconditional to X.
For instance, since:
p
P (βj ∈ [β̂j ± z1− α2 σ2 {(X 0 X)−1 }jj ]) = 1 − α
9.11 “Forecasts”
Here we intend the term “forecast” in a very restricted meaning.
Suppose you estimated β using a sample of Y and X, let us say that the sample is
of n observations (rows).
Now suppose a new set of q rows of X is given to you and you are asked to assess
what could be the corresponding Y .
Set as this the question does not allow for an answer. We need to assume some
connection between the old rows and the new rows of data. A possibility is as follows.
Let the model for the n rows of data used to estimate β with β̂OLS be
Y = Xβ +
Suppose we now have data for m more lines for the variables in X and call these
Xf . Let the model for the corresponding new “potential” observations be
Yf = Xf β + f
And suppose (we consider here the general case where X can be stochastic) that
E(|X, Xf ) = 0= E(f |X, Xf ), V (|X, Xf ) = σ2 In ,V (f |X, Xf ) = σ2 Im and E(0f |X, Xf ) =
0.
(Notice the double conditioning to both X and Xf )
88
In this case the obvious (BLUE) estimate for E(Yf |X, Xf ) isYˆf = Xf β̂ with ex-
pected value Xf β and variance covariance matrix σ2 Xf (X 0 X)−1 Xf0 .
If we define the “point forecast error” as Yf − Ŷf , the expected value of this shall
be 0 and the variance covariance matrix σ2 (Im + Xf (X 0 X)−1 Xf0 ).
Be very careful not mistaking these formulas with the corresponding ones for Ŷ .
On the basis of these formulas and working either under strong OLS hp or using
Tchebicev, it is possible to derive (exact or approximate) confidence intervals for the
estimate of the expected value of each element in the new set of observations and for
the corresponding point forecast errors.
For instance, under the Gaussian hypothesis, the two tails confidence (α%) interval
for the expected value of a single observation in the forecasting sample, under the
hypothesis of a known error variance, corresponding to a row of values of Xf equal to
xf is given by: h q i
xf β̂OLS ± z(1−α/2) σ xf (X 0 X)−1 x0f
The corresponding confidence interval for the point estimate, that is, the forecast
interval which keeps into account the point estimate error, is:
h q i
0 −1 0
xf β̂OLS ± z(1−α/2) σ (1 + xf (X X) xf )
In the case the error variance is not known and it is estimated with the unbiased
estimate
ˆ0 ˆ
σ̂2 =
n−k
as described above, theponly changes to be made in the formulas are: σ substituted
with its estimate σ̂ = σ̂2 and z(1−α/2) substituted with (n−k) t(1−α/2) that is: the
(1 − α/2) quantile of a T distribution having as degrees of freedom parameters the
difference between n and k: the number of rows and columns in X, the regressors
matrix in the estimation sample.
It is easy to see that the second interval shall always be bigger than the first, as it
takes into account not only of the sampling uncertainty in estimating β but also of the
uncertainty added by f .
89
• Find a critical region for which the maximum size of error of the first kind is α
and, possibly, with a sensible size of error of the second kind.
• Reject H0 if your sample falls in the critical region, otherwise do not reject H0
(“do not reject” is more precise than the term “accept”).
90
We need now to develop tools which allow us to really understand, and correctly
read, the results of a linear model.
The two main (and connected) tools to this purpose are: the partial regression
theorem and the definition of semi partial R2
There exist two versions of the partial regression theorem. They are very similar be-
cause the proof is based on the strong mathematical similarity between two completely
different objects: frequencies and probabilities.
We first prove the “frequency based” version, that is: the partial regression theorem
valid for OLS estimates.
The second version has to do with “theoretical” regression functions, that is with
probability and can be seen as a direct application of the law of iterated expectetions
to a linear regression.
While quite obvious in terms of proof, the partial regression theorem tells us some-
thing which, maybe, is a priori unexpected: any given coefficient in a linear regression
is NOT a derivative with respect to the corresponding variable, in the common sense
of the term.
In fact, what a coefficient in a linear regression really is, is something of much more
interest and to understand this is fundamental in order to correctly interpret the result
of a regression.
Theorem 9.5. The estimate of any given linear regression coefficient βj in the model
E(Y |X) = Xβ can be computed in two different ways yielding exactly the same result:
1) by regressing Y on all the columns of X,
2) by first regressing the j−th column of X on all the other columns of X, computing
the residuals of this regression and then by regressing Y on these residuals.
Proof. Write the model as Y = Xj βj + X−j β−j + where you isolate the j − th column
of X in Xj and put the rest in X−j . To make things simple suppose the intercept is in
X−j .
You estimate it with OLS and get:
Now write the auxiliary regression: Xj = X−j γj + uj and estimate it with OLS to get
Xj = X−j γ̂j + ûj .
Substitute this in the original OLS estimated model:
Y = (X−j γ̂j + ûj )β̂j + X−j β̂−j + ˆ = ûj β̂j + X−j (γ̂j β̂j + β̂−j ) + ˆ
91
so that β̂j = i Yi ûij / i û2ij which, since the mean of ûj is equal to 0 (X−j contains
P P
the intercept) is identical to the OLS estimate in a regression of Y on ûj alone.
A similar result is directly valid for the regression function (if we suppose all re-
gressions to be linear, otherwise a similar but more general result is valid). The result
is valid without considering estimates, but directly properties of theoretical linear re-
gression functions. In this case the statement of the theorem becomes:
Theorem 9.6. Any given linear regression βj in the linear regression E(Y |X) = Xβ
is identical to the coefficient of the regression of Y on Xj − E(Xj |X−j ) = if we suppose
E(Xj |X−j ) = X−j γXj |X−j that is: linear in X−j .
Proof. The proof mimics the proof based on estimates of βj and goes as this:
E(Y |X−j ) = EXj |X−j (E(Y |X)) = X−j β−j + E(Xj |X−j )βj =
= X−j β−j + X−j γXj |X−j βj = X−j (β−j + γXj |X−j βj )
now, write E(Y |X) as X−j β−j + Xj βj and add and subtract X−j γXj |X−j βj (that is
E(Xj |X−j )βj )
E(Y |X) = X−j β−j + Xj βj = X−j β−j + (Xj − X−j γXj |X−j )βj + X−j γXj |X−j βj
By the basic properties of regression, (Xj − X−j γXj |X−j ) (the forecasting error of Xj
forecasted with X−j ) is orthogonal to X−j so that, a linear regression of Y on both
(Xj − X−j γXj |X−j ) and X−j or on each of them shall give the same coefficients.
We have, then:
EY |Xj −X−j γXj |X−j E(Y |X)) = E(Y |Xj − X−j γXj |X−j ) = (Xj − X−j γXj |X−j )βj
In words: we can βj is both the regression coefficient of the column Xj in the full
regression AND the regression coefficient of (Xj − X−j γXj |X−j ) in the regression of Y
only on (Xj − X−j γXj |X−j ) (that is: the residuals of the regression of Xj on X−j .
You should notice that, in this proof, regressions are required to be linear while the
proof concerning estimates only requires that the estimates to come from the use of
OLS in linear models.
Notice, moreover, that the first proof is based on the algebraic properties of OLS
estimates: weak or strong OLS hypotheses are not required. In practice, the only
property used in the proof is that of orthogonality (with intercept included) which
directly comes from OLS.
This result, in both versions, is relevant from the point of view of interpreting the
meaning ot a βj or of its estimate.
92
In fact, the result implies that each βj is not connected with some “relationship”
between the j − th column of X and Y , but only between the part of the j − th column
of X which is (linearly) regressively independent on the other columns of X and Y .
In other words: the linear regression model does not measure, in any sense, the
“effect” of a given column of X.
Whatever the definition of such “effect”, this has only to do with the part of this
column variance which is uncorrelated with the other columns.
As a consequence: the meaning of a regression coefficient for the same variable
depends on which other variables are in the regression and both the coefficient and the
meaning change if we change the other variables in the regression.
This is completely natural, as in any regression I make a forecast conditional to a
different set of information.
We used the term: “effect”. While a regression may have a causal interpretation,
this is by no means necessary or even common. It is then important, when we speak
of “effect” to avoid the impression of speaking in causal terms.
We shall then define and measure the “effect of a variable” in a regression for what
it is and for what it is implied to be in the partial regression model.
For us, this is just the marginal “effect” or “contribution” of a column of X in
reducing the mean square error or, equivalently, improving the forecast performance,
when the other columns are accounted for in the sense of the above theorem.
This “effect” is to be better understood in “informational” term as the ability to
improve the quality of a fit adding information to a given information set.
If the inferential extension of fit to forecast is justified, see above for the hypotheses
which justify a forecast, in terms of quality of forecast.
When the intercept is in the model, this is measured by the increase of R2 you get
if you add the Xj column to the model, or, equivalently, by how much the R2 decreases
if you drop such column from the model.
This quantitative measure, specific to Xj as used in a regression with a GIVEN set
of other variables: X− j, is called: “semi partial R2 ”.
93
It is quite frequent, at least in the social sciences milieu, the case where even the
overall R2 is not even reported.
We have a way out of this which can almost always (in simple OLS setting) be
implemented.
A “folk” and simple result of OLS algebra (we give it here without proof, but see
further on for a proof, not required for the exam, in a footnote) allows us to determine
the marginal contribution of each column in X to the R2 even if we only know the
T −ratios for the single parameters and the size n of the data set.
Lemma 9.7. Suppose we are using OLS and we drop the column Xj from the matrix
X. The decrease of the overall R2 corresponding to the dropped column (call this
R2 − R−j2
)is equal to t2j (1 − R2 )/(n − k). Where: t2 is the square of the T −ratio for
the added variable, n is the number of observations and k is the number of columns in
the full regressors matrix. This is called the “semi partial R2 for Xj and is nothing but
the R2 of the regression of Y on the residuals of the partial regression of Xj on X−j .
Here the T −ratio is assumed to be computed with the standard formula we gave
in the section about OLS.
An interesting point in this result is that it allows us to “recycle” a quantity which
we considered as just a measure of statistical reliability, as a useful way for reading
the R2 . This is just an algebraic result, that is: it is valid in any sample and does
not require either weak or strong OLS hypotheses to be valid. It is just a numerical
identity.
Beyond allowing us to compute the semi partialR2 , this result has other interesting
implications. As we stressed before, with a big sample size it is very difficult for
a T −ratio not to be “statistically significant” as even a very small number can be
distinguished from zero if the sampling variance is small enough (the denominator of
the T -ratio is divided by the square root of n − k). However, when n − k is big, it
is quite possible that the estimate could be “significant” while, at the same time, the
relevance of the variable could be totally negligible, as the added contribution of the
variable to the explanation of Y variance could be negligible.
Suppose you have, say, n = 10000 (not uncommon a size for a sample in social
sciences and in Finance). Suppose you have 10 columns in the X matrix and the
T −ratio for a given explanatory variable is of the order of 4 so that the P −value shall
be of the order of .01: “very” statistical significant! True, but the above lemma tells
us that the contribution of this variable to the overall R2 is at most (that is: even
for an overall R2 very near to 0) of, approximately 16/(10000) that is: less than two
thousands. Hardly relevant from any practical point of view! If I drop the variable the
overall explanatory power of the model drops of way less that 1%.
Another way to see the same point is this: how big should be the T −ratio, under the
previous hypotheses, so that the marginal contribution of the regressor to the overall
94
R2 is, say, 10%? From the above formula we have that the T −ratio should be of the
order of 32 (the square root of 1000 is about 31.62).
This makes even more clear the fact that, in general, “statistical significance” (that
is: the estimate is precise enough so that we are able to distinguish it from 0) and
“relevance” (here measured by the contribution to he forecasting power of the model
as measured by R2 ) are very different concepts.
It is now important to understand how semi partial R2 works across different
columns of X.
If we compute this quantity for each column of X, we measure the marginal contri-
bution of each of these column in “explaining” (frequently used but not so correct term)
the variance of Y . As stated above, here “marginal” means: how much the introduction
of the variable improves the R2 when the other variables are already all in the model.
This means that, each time we compute this quantity for a different column Xj , the
“other variables” left in X−j are different. For this reason, while we can split the overall
R2 in a part due to the introduction of a new variable and a part already “explained”
by the other variables, we cannot add in any meaningful way the semi partial R2 of
different variables except in the case when all the columns of X are uncorrelated (not
very interesting in our field).
Summary up to this point: a regression is a forecast which minimizes the mean
square error (if it is linear with intercept it minimizes the variance of the error). This is
the purpose of a regression and it should be evaluated in so far the quality of the forecast
is sufficient to our purpose (the specific purpose is going to enter in the evaluation).
While in a forecasting setting there is no big role for speaking of “the effect” of this
or that column of X, it is possible (partial regression theorem) to define the marginal
contribution of each column of X to the overall R2 . The quantitative measure of this
contribution is given by the semi partial R2 .
95
of a linear regression.
In this introduction we summarize the topic in few points.
First: regression=forecast.
A regression function is, as we stated at the beginning of this chapter, a tool for
forecasting Y on the basis of X. The particular forecasting tool called “regression
function” is just the conditional expectation E(Y |X).
Linearity, that is E(Y |X) = Xβ is an hypothesis we add to our analysis.
As with any hypothesis this restricts the set of possible regression function models,
and this reduces generality, but makes easier the use of the restricted set of models.
In these handouts we do not consider the important topic of the “model functional
form” choice.
Suppose we know E(Y |X) that is, for instance in the linear case, we know β.
In this case the use of this function is straightforward: if you think the model applies
to a phenomenon where you know the values of the random variables in X, but not
those of the random variable Y , and you want to forecast Y , then use E(Y |X) = Xβ.
Obviously, you can “forecast” Y in a circumstance where you already know its value,
too. This may seem strange but, sometimes, it can be useful.
Since the term “forecasting” may (wrongly) be understood as implying that, in
some sense, X is “before” Y or even X causes Y , it is important to stress that this is
sometimes true but not required in any way.
It is frequently the case that Y could be something which happens before X but
you cannot observe Y while you observe X and use X to “forecast” Y .
Moreover, it may be that Y “causes” X (in some sense) but we can observe X and
not Y so we try to forecast Y on the basis of X. This is what doctors and detectives
and scientists do all the time: you have symptoms, clues, observations (X) which are
“determined” by “causes” which are not directly observable or not easily observable:
an illness, a culpable, a particular theory (Y ), and what you try to do is to “forecast”
these.
It may also be that there is no “causal” connection between Y and X but that both
of these are somewhat “determined” by some unobserved Z. In this case the knowledge
of X may help in forecasting Y .
In short, it is all a matter of information. You put in X the information you have
and use this for forecasting Y and you could switch one for the other is information
ordering changes.
96
By consequence, the overall quality of the forecast is measured by the size of R2
and the “importance” (for the quality of the forecast) of each element in X is measured
by its marginal contribution to the overall R2 . This is, simply, the difference between
the R2 of a model where the full X is used and a model where the element of interest
is dropped (linearity here is important).
While there may be other measures of relevance, connected with the specific problem
under analysis, this is the only measure connected with the general nature of the
regression function. Hence, this is the only purely statistical measure, in that it is
valid for any regression application in any field.
This is not to say that, in specific fields, due to the role and meaning of the variables
in the model, we cannot use other measures of relevance.
For instance: in many settings, it could be that variables contributing just a little
to the overall R2 , are judged more “relevant” that other whose contribution is higher.
This may happen, e. g., when you can somewhat act (without changing the regres-
sion, see above) on the smaller contribution variables but not on the high contribution
variables.
For instance, in many sports, training is “relevant” for performance, but typically
less so than, say, age, stature and other biological characteristic and qualities you
cannot act on. The contribution of training to the overall R2 could be small, but you
can choose (“act on”) your training amount and method, not your age or stature.
In what follows we shall see how to estimate both R2 and the marginal contribution
to R2 of each element of X.
97
could be completely different from the regression function when we forecast after an
intervention.
In medicine this is simply stated by the fact that, while symptoms (X) are useful
to diagnose (“forecast”) an illness (Y ), a cure of the symptoms (changing X) is very
infrequently a cure of the illness (Y ).
More simply: if you see out of your window, leaves and branches shaking, probably
there is a strong wind, but if you shake leaves and branches you cannot instigate a
strong wind.
In both cases, to forecast Y on the basis of X under simple observation of X or
under action on X would imply the use of completely different regression functions.
A simple, if paradoxical, example is this: what you see in the tachometer of your
car is a very good estimate of the true instantaneous speed of the car (on a straight
road with no slippage). This means that, if Y is the true speed and X the tachometer
value, and you observe X during your trip, E(Y |X) shall be a good “forecast” of Y on
the basis of X. But now, paradoxically, suppose your car still has an analog tachometer
with a hand, and that you “act” on X by moving the hand with your finger. Most likely
this shall break the tachometer but, surely, the use of the same E(Y |X) which worked
very well if applied to observation without intervention, shall be now unwarranted.
Obviously, if you want to change your speed, you may act on the gas pedal, or o the
brake, or on the gear.
To be able to distinguish what we can get from observation and how this may be
connected with action, is a very important topic in any field that is both observational
and, at least sometimes, allows intervention.
It is the case of Economics, Medicine, Biology, Physics.
It is NOT the case in Cosmology, most of Demography and other purely observa-
tional sciences.
In Economics, under the names of “intervention analysis” or (worse) “causality” this
topic generated lots of interest and research.
In some fields a regression function useful for intervention analysis cannot be evalu-
ated using observational data but it can be evaluated by specific intervention procedures
called “experiments”.
This is possible if all or most relevant variables (variables which we would put in
X) can be chosen, observed or controlled by the researchers.
In fields like Economics, where some variables can be acted on but other are deter-
mined by the “economic system” which, for the most part, is unobservable, the study
of aregression function relevant for intervention is very difficult.
A typical consequence of forgetting this important point can be seen in the naive
reading of (x1 − x2 )β as “how much E(Y |X) changes if x2 becomes x1 ”.
This is not incorrect, if we intend as a simple comparison between potential forecast
given possible values of X. This is in general completely wrong if, using the same
regression, we intend it as a comparison of forecasts when we “act” and change x2 into
98
x1 . This is also true in reverse.
99
A point by point summary on how to read a regression and a list of suggested
readings close the chapter.
The suggested readings are “suggested” and are NOT IN ANY WAY required for
the exam while could be useful readings for those interested in some more info on linear
models AFTER the exam.
100
Why is this discussion of optimal forecasts relevant for understanding the results
of a linear model?
Since a regression is a way to make forecast by minimizing a measure of error
the first, and always valid, “reading” of a regression must be first of all based on
summarizing ”how good” this minimization was.
In the context of linear regression this implies a first, simple, question: “how big”
is R2 .
The second question, usually, is: “with this R2 , is the regression relevant?” The
term “relevant” creates many problems as, clearly, the answer shall depend on the
specific context. For this reason, while researchers do propose, e.g., reference values for
R2 under which a regression should be considered irrelevant, we suggest here a more
cautious path which tries to merge an “absolute” evaluation of the R2 with a more
specific connection to the specific practical context of the analysis.
We shall discuss this point further on, with examples.
However we cannot and do not stop here.
It is almost always the case that we look for some further “decomposition” of the
“explained variance” in terms of each single “explanatory variable”.
We want to define a variable by variable measure of relevance.
We shall be able to correctly understand a regression, if we shall be able to precisely
set the bounds under which this question has a meaning and if we shall be able to answer
to this question from within these bounds.
In the section about the partial regression theorem we already defined such measure:
the semi partial R2 . In the following section we shall analyze some more property of
this measure.
where the first term is the definition of remi partial R2 for the j − th column of
X, the first equality was presented in the partial regression theorem section and the
second equality is the new result we wish to discuss now.
Let’s us take the square root of the second and third terms
ˆ |
p
q V ar(Xj |X−j )|β j
t2j (1 − R2 )/(n − k) = p
V ar(Y )
101
This is an interesting formula, from the point of view of a variable by variable
interpretation of the quality of a regression.
The square root of the semi partial R2 for Xj is, in units of the standard deviation
of the data on Y (denominator of the rhs) the “change” in the conditional expectation
of Y given by a “reasonable” change in Xj . Reasonable here means equal to the
CONDITIONAL standard deviation of Xj GIVEN the values of the other columns
in X that is an amount of: V ar(Xj |X−j ).
p
This measures how much, on average, Xj may change (in a standard deviation
sense) if the rest of the columns of X are kept constant. reasonable when the “other
X−j are kept constant”, hence measured with the CONDITIONAL standard deviation
of Xj GIVEN X−j . When we multiply this by |β ˆ | we translate this into a contribution
j
to the standard deviation of Y and if we divide by this standard deviation we have a
contribution in units of this standard deviation.
The idea of a first assessment of the “relevance” of a variable in a regression by
computing “how much the conditional expected value of Y changes if we change Xj of
one unit of standard deviation” is quite common in papers where linear regressions are
used.
Our analysis gives us a way to correctly perform such analysis and shows that, when
correct, the result is just a simple transform of the semi partial R2 .
We can summarize this in several, equivalent, ways:
1. The square root of the semi partial R2 for a given column of Xj is the ratio
between the fraction of the standard deviation of Xj that is not correlated with
the other columns of X and the standard deviation of Y times the absolute value
of βˆj .
2. The absolute value of βˆj is the ratio between the fraction of standard deviation
of Y “explained” by Xj alone (the square root of the semi partial R2 times the
variance of Y ) and the amount of standard deviation of Xj that is uncorrelated
with the other columns of X.
4. The square of βˆj is the ratio between what “is to be explained” (the variance of
Y ) and what is “left in Xj to explain Y ” (the conditional variance of Xj ) times
the fraction of the variance of Y “explained” by Xj given the other columns of X
(i.e. the semi partial R2 ).
102
9.14.4 Forecasting vs intervention and regression coefficients interpreta-
tion
A forecast is only about information, it is about using what we know in order to say
something about what we do not know.
Sometimes, but this is just a particular case, the “information linkage” between
variables may have something to do with some causal (in any intuitive sense) connection
between variables: “if I know the cause I should know something about the effect”.
However, it is also true that if I know the “effect” I can say something about the
“cause” (just recall, again, about how a MD or a police detective works deriving from
symptoms/clues hypotheses on the medical condition/culpable).
In conditional expectations terms, forecasting has nothing to do with the “direction”
of such possible, but not necessary, causal connection: we can try and forecast the
“effect” given the “cause” or the “cause” given the “effect”. The point is: what do we
know, and what do we want to forecast.
Let us go back to the example concerning the speed of a car. Let us say that the
“true speed” of your car is that measured by a roadside Doppler radar while what we
can observe is the car’s tachometer. If we suppose that both tools are well calibrated,
and the conditions of the road reasonable, we expect both tools to give similar values to
“the speed of the car”. You can then use any of the two in order to “forecast” the other
and the forecast should be quite good (meaning: high R2 ). You choose which forecast
to use on the basis of available info. As the driver of the car, you may be interested
in the forecast you can make using your tachometer in order to avoid breaking speed
limits.
It is also clear that there is no “causal” connection between the two measures, at
least not in the sense that, by altering the reading of one of the two instruments you
can alter the other. If, for instance, you break the plastic on the instrument panel of
the car and stop the tachometer arrow with our finger (we suppose analogue dials) this
is NOT going to limit the scale of the radar measurement, and vice versa.
In this case you have very good forecasts, provided you do not mess with the
instruments. You can make forecasts conditioning both ways, according to what you
know. But such forecasts do not imply any causal connection, at least in the sense that
you could alter one measure controlling the other.
Since we know what is happening, an economist would say: “we know the structure
of the economy”, we know the reason for this. The two dials measure correlated phe-
nomena. The radar measures the speed of the car wrt the radar itself, the tachometer
measures the rolling rate of the tire. If the car is running on a reasonably non skidding
surface (not on ice) the two should be highly correlated, hence our forecast ability.
It is interesting to notice that, if we are only interested in forecasting, we may do
without such understanding and only suppose the informative relationship to be stable.
That is, to apply to repeated instance of the phenomenon at hand.
103
In principle, under stability, we may then forecast even if we do not “understand”,
in the sense that “we have no idea whence the correlation comes”. This informal idea
of “stability” has several names in Statistics. In a very simple and constrained sense,
it is called stationarity. The idea of i.i.d. random variables is a particular case of this.
More in general the relevant idea is that of “ergodicity” which is quite beyond what we
do in this course.
We can go further: it is clear, in an intuitive sense, that the rolling rate of the
tire “causes” both the tachometer measure (even on skidding surfaces) and the Doppler
radar measure (non skidding surfaces) in the sense that if I alter the rate, maybe acting
on the gas pedal or on the brake pedal, I expect the dial of the tachometer and of the
radar to move in a precise direction. It is also clear that this is not true in reverse: I
cannot speed up by moving with my finger or other tools any of the dials.
This not withstanding, we are using the “effect” (the position of the dial) in order
to “forecast” the “value” of the “cause” (the rolling speed of the tire). This is perfectly
sensible and is going to work, obviously, if I do not tamper with the dial.
As mentioned in the introduction of this section, to use “effect” in order to “forecast”
“causes” is a quite common procedure. Consider a case where “cause” is in the past
while “effect” (as usual) is in the future.
The information we have about extinct living beings comes from their fossilized
remains which are available today.
We can say something about the shape and behaviour of extinct living beings
“conditioning” on the information we can derive from what today is a fossil.
However, in no sensible meaning, fossils “cause” the existence in the past of now
extinct living beings. I may destroy all fossils today, this would be very foolish and
would not alter the past of life on our planet. Maybe it would alter our understanding of
this past and this could be the objective of the (reasonably mad) fundamentalist/paleo-
terrorist involved in the destruction. This, however, is another story.
Once we understand that a forecast has only to do with an information linkage
between variables under which there may be or not causal relationship, we may consider
a second point and try to shed more light on the difference between simple forecasting
and attempt to intervene on the result of a phenomenon.
When you compute E(Y |X) for a given subject for which you observe X, a set of
variables, you simply put the observed X in the function φ(X) = E(Y |X) and get
your forecast. If X is made of many different measures, there is no much interest in
measuring the “contribution to the forecast” of each variable in X.
To make things simple, suppose Y is a single variable and X is made of X1 and
X2 , suppose E(Y |X1 , X2 ) = α + β1 X1 + β2 X2 where, for instance α = 0, β1 = 1 and
β2 = −1.
This simply means that, if in a subject you observe X1 = .4 and X2 = 1 your
forecast for Y in that unit is -.6, and if you observe, in another unit X1 = .4 and
X2 = 2 your forecast is -1.6.
104
You can surely say that, if you consider two different subjects with the same, say,
X1 and two different values of X2 where the difference in the two values is 1, the two
forecasts shall different of β2 , that is of -1 but this cannot be read in the sense that, if
in a unit you “change” the value of X2 increasing it of 1 than you are going to get a
forecast -1 smaller than before.
This is wrong for many reasons. Consider the tachometer example above: let us say
that to change X2 means to alter the position of the dial with your finger. It should be
foolish, in this case, to use as forecast function the conditional expectation computed
by observing data where the dial is not tampered with. If I actually did the experiment
of comparing the speed as measured by the radar with that of the non tampered and
tampered tachometer, I would see that, while a change in the untampered tachometer
dial corresponds (with some approximation) to a change in speed as measured by the
radar, this does not happen if the dial position is altered because it was tampered with.
This is totally obvious but contains a very important teaching: there are ways in
which, if I “change” some variable in the system I observe, the informative role of such
altered variable changes wrt the role it has in the untampered system.
For this reason, by itself, the observation of the untampered system, while useful for
forecasting, cannot tell me anything, in principle, about the ”effect” of me tampering
with the system.
If our notion of “cause” is based, as usually is in Economics and Finance, on the
idea of “intervention”, we can simply say that forecast on the basis of information (be
it made using regression functions or other tools) in general tells us nothing about any
causal relationship.
This is even more evident when, as in the case of fossils, we observe X a long time
after Y did happen. While the observation of a fossil fish may be the indication that,
in the past, a sea existed where the fossil was found, while the observation of a fossil
sloth would imply that, in the past, some forest/savannah environment existed where
the fossil is found, if I swap the fossils I cannot expect to alter the past environments.
Again, this is obvious and, being obvious it should be always in the mind of any
researcher using regression.
We conclude this (too short, given the relevance of the topic) section by quoting a
wonderfully strong, precise, effective and simple passage from what could be considered
the best ever book on linear models and their interpretation: Fraderick Mosteller and
John W. Tukey, “Data analysis and regression: a second course in statistics”, Addison-
Wesley 1977.
The quotation is from the masterful Chapter 13 “Woes of Regression Coefficients”
whose Goethean title is, probably, not chosen by chance as it as to do with the impos-
sibility of really making oneself known and understood. In what follows the comments
written in square brackets [...] are ours.
“We have been careful to point out using x and x2
[in a regression of y on x and x2 ]
105
that it does not generally make sense to try to interpret the coefficients of x in terms
of what "would happen if the other x’s were held constant"
[try and keep x2 constant when changing x and viceversa].
In this section, we try to go ahead a little, sounding a few of the most necessary
warnings.
Polynomial fits. When it comes to fitting polynomials, whether as simple as
b1 x + b2 x2 or as complex as b0 + b1 x + b2 x2 + b3 x3 + b4 x4 + b5 x5 , it rarely pays to try to
interpret coefficients. Pictures of the fits or of the difference in two fits to two sets of
data--can be very helpful, but the coefficients themselves are rarely worth a hard look.
Unrelated x’s. If the x’s are not closely related, either functionally or statistically,
we may be able to get away with interpreting bi as the "effect of xi changing while the
other x’s keep their same values."
[this is the case where the coefficients for regressions on single variables equate or
almost equate the coefficients for the same variables in the full regression]
If we want to tap expert judgment about the value of bi , some set of words like those
in quotes may be the best we can use.
[it is difficult, even for experts, to have opinions of the coefficient of a given vari-
able conditional to different sets of other variables. Opinions shall usually apply to
univariate regression coefficients]
In practical or policy situations, however, we need to recognize how large a difference
there can be between:
1. x changing while the other x’s are not otherwise disturbed or clamped,
and
2. changing xi while holding the other x’s fast.
[we should add 3. observing the different values of a x in nature. Here this is
subsumed by 1.]
Such differences are not only possible but likely in social and economic problems,
because the x’s we are working with there are usually neither the most fundamental
variables in the situation nor the complete set of variables.
[i.e. they depend of a full substrate of, generally unobservable, variables wich freely
act “in nature”]
Consider the example of performance on tests of cognitive achievement as related
to parents’ education, socioeconomic status, and years of schooling. Note that we have
no measures of innate intelligence, attention paid in school, parental or teachers’ en-
couragement, or even hours spent on the subject matter being tested, to say nothing of
physical handicaps.
Our regression works so long as x’s and y’s together are driven by the fundamental
variables acting as they had acted, at particular times and places, before we collected
the data.
[that is: if we do not “act”]
If we interfere with their activity, we are likely to change the regression and find
106
that the effect of the change cannot be predicted by either of the two regressions listed
above.
Holding all but one fixed and changing that one, if we can indeed do this, is likely
to interfere with the underlying pattern of variability and covariability, thus changing
the regression. In most policy situations this danger is very real.
[what is useful in forecasting, could be useless in deciding an action]
Experiments, Closed Systems, and Physical versus Social Sciences
George Box has (almost) said (1966): "The only way to find out what will happen
when a complex system is disturbed is to disturb the system, not merely to observe it
passively."
These words of caution about "natural experiments" are uncomfortably strong. Yet
in today’s world we see no alternative to accepting them as, if anything, too weak.
Regression is probably the most powerful technique we have for analyzing data. Cor-
respondingly, it often seems to tell us more of what we want to know than our data
possibly could provide. Such seemings are, of course, wholly misleading.
Some examples of what can happen may help us all to understand Box’s point, which
covers these examples as wed as many others.
First, suppose that what we would like to do is measure people (or items) in a
population and use the regression coefficients to assess how much a unit change in
a background variable (say x1 ) will change a response variable (say y). Since the
regression coefficient of x1 depends upon what other variables are used in the forecast,
we cannot hope to buy the information about the quantitative effect of x1 so cheaply.
These remarks do not deny the potential use of forecasting the value of y from several
variables x1 , x2 , and so on, in the population as it now exists.
[again: one thing is a forecast for a left to itself phenomenon, another a forecast of
the effect of an action]
What they do cast grave doubts on is the use to forecast a change in y when x1
is changed for an individual (class, city, state, country), without verification from a
controlled trial making such a change.
(Strictly speaking, but unrealistically and impractically, if we want to verify what
happens when only x1 changes, the controlled trial should be made so that x1 changes
and there is no chance for the other variables to change the way they naturally would
when the underlying variables are manipulated to change x1 . This sort of study may
not be feasible, and it may not yield what we need to know. We ordinarily want to know
what will actually happen when we change x1 )
[that is: when we change x1 but cannot control the other variables. This is impor-
tant if we want, for instance, compare the effect of a medical treatment as measured
in a controlled experiment with the “real world” effect where no control, e.g. double
blind treatment/control, is possible]
When such issues are raised, proponents of observational studies plus regression
analysis are likely to cite the physical sciences for illustrations of the success of the
107
method. The idea that such regression-as-measurement methods are successful in the
physical sciences is seriously misleading for a variety of reasons.
First, because so many physical-science applications of regression-as-measurement
are to experimental data.
And second, because the relatively few useful applications that remain involve sys-
tems in which "the variables" are:
- few in number,
- well clarified,
and
- measured with small error”.
If well understood and kept in mind, this passage is more than enough, jointly
with the above analysis, to avoid pitfalls and producing a reasonable interpretation of
regression results.
A curiosity. Notice the point about interference with the phenomenon and the
change of regression function.
This topic, quite obvious once we think about it, was fully known to (good) statis-
ticians decades before its “reinvention” by Robert Lucas in the field of Economics with
his “Lucas critique”, where he describes a particular setting where the warning you read
above shall apply.
108
It is usually the case that the effect of such warnings is bigger when empirical
analysis has real practical purpose, hence, it is strongly constrained by its practical
implications, and smaller when the main reason for empirical analysis is more “paper
publishing for the shake of it” oriented: a current fad wrongly mistaken by some for
science.
In empirical Economics and Finance the main misunderstanding has to do with the
concept of “statistical significance”.
Technically: the estimate of a parameter βj is “significant at a given level α” if it
falls within a size α rejection region for a test whose null hypothesis is H0 : βj = 0.
This is often stated, in really imprecise words as: a parameter is “statistical signif-
icant” if its estimated value, compared with its sampling standard deviation makes it
unlikely that in other samples the estimate may change of sign.
In the standard regression setting, the most frequently used statistical index is the
T − ratio and an estimated βj has a “significance” which is usually measured in terms
of its P −value of the T −ratio.
(We repeat for reference: the P −value of a test is the α of a critical region whose
boundary corresponds with the observed value of the test. So, for instance, if you
observe a t− ratio of, say, -3, you need to compute the probability that a sample gives
you a t−ratio between -3 and +3 computed under the null hypothesis that βj = 0.
Notice the “two tailed” interval which corresponds to the fact that, as a rule, the
alternative hypothesis is βj 6= 0 that is: a two tailed hypothesis).
Does a small P −value imply that a parameter is “relevant” in any sense, except
the fact that you observed a value outside the interval of values which are likely to be
observed under the null hypothesis?
As already discussed, the answer is “absolutely not”.
We already commented on this when considering the semi partial R2 . There is an
even more striking way to present the point: suppose the parameter is known and is
different from zero (so that its P −value is 0: it cannot be more significant than this!)
the actual relevance of the corresponding regressor could be absolutely negligible if the
semi partial R2 is small. Here, by relevance, we intend the ability of the corresponding
Xj to “explain” an amount of variance of Y (improve the forecast) which is big w.r.t.
the total variance of Y .
“Statistically significant” only means, this statement is approximate but justifiable,
that the statistical quality (precision) of the estimate is such that the estimate should
not change sign if we change the sample.
In iid samples, if n is big, typically all parameter estimates become “statistically
significant”. √
This because the sampling standard deviation decreases at speed n, so that even
a practically negligible βj can be estimated with enough precision so to allow us to
distinguish it from zero.
In no way this implies βj to be “relevant” in any practical sense. What happens here
109
is that, with n big enough, we can reliably assess that an irrelevant effect is actually
not 0 but still irrelevant.
It is frequent to see published papers in major journals where linear models with
tens of regressors and tens of thousands of observations result in statistically significant
coefficients with an overall R2 in the range of few percentage points and semi partial R2
of fractions of 1%. Whatever the notion of “relevance” (forecasting, always available,
or causal, requiring many hypothesis), it is difficult to conceive of any practical setting
where such results could be termed “relevant”, if not because they give relevant support
to the statements of “irrelevance” of the corresponding effects.
This would be not so important, if the same papers did not spend most of their
length discussing about the meanings and the practical relevance of the effects suppos-
edly “found”.
This misunderstanding between “statistical significance” and “relevance” must be
avoided. If models were used for practical purposes (say for forecasting or controlling
variables) the misunderstanding would quickly disappear: an estimate can be as sig-
nificant as I like but, if the R2 is small, the quality of the forecast shall be awful all
the same.
When models are only used for academic purposes (appear in published papers) the
misunderstanding may continue unscathed, sometimes with hilarious consequences.
Now go back to the semi partial R2 and consider, again, how this depends on the
square of the t−ratio (numerator) BUT ALSO on the number of degrees of freedom
(n − k). This clearly suggest you the nature of the problem discussed here. A “highly
significant t−ratio”, say: of absolute value greater than 3, shall NOT correspond to a
high contribution to the quality of the forecast of the sample is big. In other words, in
big samples you may expect to observe “big and significant” t−ratios corresponding to
practically irrelevant (from the point of view of forecasting accuracy) variables.
Summary: first assess the relevance of the regression and the parameters of interest
in terms of explained variance as if parameters were known and not estimated. Then
look at the statistical stability of the results. An irrelevant parameter is still irrelevant if
it is “significant” while a parameter which could be relevant can be put under discussion
if its sampling variance is too big (this usually happens if the sample is small).
110
AveDrv=the average drive length in yards
DrvAcc=the percentage of drives to the fairway
GrnReg=the percentage of times the player reaches the green in the “regular” num-
ber of strokes
AvePutt=the average number of putts per hole (should be less than 2)
SavePct=the percentage of saved points
Events=the number of events the player competed in
This dataset is an observational dataset: in the same time frame, 2004, we observe
the money results (average winnings) of a set of players and they gaming abilities
according to same indexes.
Indexes and results between player are different because players are different.
We are not observing the same players over time and see how changes in their
characteristics are correlated with changes in results and we are not observing, say, a
set of payers randomly assignet to a treatment or a placebo group where treatment
may be and increase of training for this or that aspect of the game.
For this reason, the analysis of this regression only has a forecasting purpose: sup-
pose you randomly draw a player from the population whence these players come from.
If you know the game characteristics of this player the model allows you to forecast his
overall result.
Player can obviously act, in many ways, in order to change their game characteris-
tics. However, there is no guarantee that, if a player, say, is able, by training, tactics,
control etc, to reproduce the characteristic already present in another player, the best
forecast for the results of the first player shall be the result (or the forecast of the
result) of the second.
In the microcosm of golfing, we have something similar to economics: what we ob-
serve are equilibria where many different components are required in a precise amount
in order to yield a result. In such cases, altering the equilibrium mix can have com-
pletely unforecastable effects. If you double the flour in your cake, you do not get
double the cake, you get a mess. Many players destroyed their game just because they
tried to “improve” this or that aspect of their game and, by doing so, they broke the
equilibrium that gave them their results.
This has every practical consequences in reading the result. For instance: be careful
in assessing your expectations for, say, the sign of the parameter estimates we shall
observe. If your expectation comes from the knowledge you have about which training
is more likely to improve your results, such expectations can be irrelevant for this
model.
Instead, if your expectations come from your experience about the characteristics
of the player which, in the past, got more money from the tour, these could be relevant
for this model.
So, for instance, if you say that it is reasonable to assume that expected aver-
age money is positively correlated with AveDrv, DrvAcc, GrnReg and SavePct, neg-
111
atively on AvePutt, you should understand well where these opinions of yours come
from. Moreover, you should understand how these opinions, as a roule, are connected
with bivariate correlations, NOT with conditional correlations, while the regression
model parameters derive from conditional correlations (see the point on experts in the
Mosteller Tukey quote).
We understand that it is tempting to try and answer a very reasonable question:
“if a player trains and improves one some aspects of the quality of his game, how much
more money could he expect to make”?
This is the “comparison of expectation after intervention” we mentioned with the
name “causal” analysis or “intervention” analysis.
It is important to state that, if we do not make further hypotheses, this dataset
does not allow in any case to answer this, undoubtedly very interesting for golfers,
question.
Let us start with some descriptive statistics and a simple correlation matrix:
From this correlation matrix we see that at least one of our expectations is appar-
ently not true: correlation with driving accuracy is negative. However we also see that,
and this could be expected, correlation between AveDrv and DrvAcc is rather strong
112
and negative (longer means riskier).
We’ll see that this has an interesting implication on the overall regression.
Let us now run the regression:
It could be shown that, for n−k not to small, X 0 X with a determinant not too near
to 0, and a xf not “too far” from the observed columns of X, this can be approximated
by h i
xf β̂OLS ± z(1−α/2) σ
Under the same hypotheses we can freely put σ̂ in the place of σ and still use the
Gaussian in place of the T distribution. With this approximation, the point forecast
113
interval is the same for each xf and its width is 2z(1−α) σ̂ . If we allow for the plus/minus
two sigma rule this, with our data, becomes 4 times 41432 or, forecast plus/minus 2
times 41432. If we stick to the Gaussian hypothesis (or believe a central limit theorem
can be applied in our case) this interval should contain the true value of yf with a
probability of more than 95%.
If we go back to the descriptive statistics, we see that the standard deviation of
AveWng is 54990. This means that, without regression, our forecast would be the same
for each observation and equal to the average (46548) and the corresponding forecast
interval would be 46548 plus/minus 2 times 54990. With the regression our forecast
is xf β̂OLS , so it varies with the observations on xf , and this variability “captured” by
the regression, is “subtracted” from the marginal standard deviation so that the point
forecast interval shall be narrower: the point forecast plus/minus 2 times 41432.
You should notice that, with an R2 of about .45, the width of the forecast interval
is reduced only of less that 1/6th. This is not surprising: the R2 is in terms of variance
while the interval is in term of standard deviations. Variances (explained and unex-
plained by the regression) sum, standard deviations do not (the square root of a sum
is not the sum of the square roots). For this reason the term “subtracted” above is put
under quotes.
We may then question the statistical precision of our estimates, in particular the
statistical precision of our R2 estimate. In the output we do not have a specific test
for this but we have something which is largely equivalent: The F −test tables.
The F −tables imply rejection of the null hypothesis that there is no regression
effect, meaning: all parameters are jointly equal to zero with the possible exception of
the intercept.
Notice that, with few observation, even a sizable R2 , as the one we got for this
model, could fully be due to randomness. The F −test tells us that this does not
seem to be the case. This is not a direct “evaluation” of the statistical precision of our
R2 estimate. However, implicitly, since there exist a direct link between the value of the
F −test and R2 , it tells us that an estimate of R2 as the one we found is very unlikely,
if there is no regression effect.
From the point of view of forecasting, this is all. We may like or not the results but
this is what we find in the data and, if we just suppose some “stability” of the model
(see the comments above) this is the precision of the forecast we can make.
What follows can be seen as an “anatomy” of the forecast in terms of each column
of X. This can be useful for forecasting use but, obviously, it is much more relevant if
the setting is such that we are able to hold a causal interpretation of the regression.
If we go to the last column of the regression output (we added this to the standard
Excel output) we find the semi partial R squares. We see that only three variables have
a sizable marginal contribution on R2 as measured by their semi partial R2 : GrnReg,
AvePutt and Events. This means that these are the variables whose addition to X
most improves the forecast. Can we go a little bit further and say that, barring for the
114
Events variable we shall comment further on, an increase of GrnReg and a decrease
of AvePutt are the aspects of the game that, if improved, would imply a greater and
more reliable increase in AveWng?
This is a causal interpretation, is it reasonable in our setting? We cannot exclude
this, we can only say it is very unlikely to hold.
Why? We repeat: the data is the summary of a season. It describes a set of “ability”
indicators for each player and some other variable.
Let us concentrate on the abilities.
Let us take, for instance, Age. This is a typical variable you cannot intervene on.
This not withstanding, the variable changes in time. The possible causal interpre-
tation would then be: each year the conditional expected values of gains goes down by
almost 600 dollars. Is this the “effect” of age?
Even if we do not consider that what we observe is a cross section of players and not
a time series of results for a single player (so that we may observe the action of Age),
we must answer “beware”. If a causal interpretation was possible, the β of Age times a
change of Age would be the expected change in AveWng if all the other variables are
constant, that is: if only Age acts and the golfer’s abilities as expressed by the other
variables do not change.
Is it reasonable that the natural evolution of Age does not change the other abilities
of any given player?
Quite unlikely. In any case this point should be assessed by theory and empirical
data in order for a causal interpretation to be possible (the methods for doing this are
not object of this course).
Let us now consider a variable on which we can think we could “act”: AvePutt.
We cannot arbitrarily set this to a lower number (compare this with “changing
interest rates”) but we may conceive of increasing the time dedicated to putting green
training. If this reduces the number of putts, even of just 1/100 we should improve
(change of the conditional expected value) our wins “on average” of almost 700 dollars
(69000 times 0.01).
Is this the “effect” we can expect? It depends. Golfing is an equilibrium game.
What counts is the overall result and trying to improve a part of the game may have
bad results on other parts of the game.
By training more on the green maybe we worsen (or maybe improve?) our game
under other points of view: length, precision from distance etc.
Moreover: the model was estimated on a sample of players with a given “equilibrium
mix” of abilities.
Is it still going to be valid if we alter such characteristics? Again: we do not
know this and, with no answer, any attempt to use the model in this sense would be
unwarranted.
Notice that here we did hint to three different problems, the same problems we
hinted at a number of times above.
115
The first is that it can be difficult or impossible to act on an Xj and, at the opposite,
some Xj is bound to change by itself .
The second is that it may be difficult to intervene on one Xj without altering other
Xj -s and, if this happens, we should model this interaction to have an idea about the
“effect” on the dependent variable.
The third is that any action on one or more Xj could alter the conditional expec-
tation itself and we should model this alteration.
All these problems have been discussed and are being discussed by econometricians.
In fact, as we mentioned, these problems are at the origin of Econometrics and are still
the central problem of Econometrics: what makes Econometrics sister but not a twin
sister of Statistics.
Following the general approach of this section we do not further develop the “causal”
discussion and, for a moment, heroically suppose that we can improve our AvePutt
decreasing it without altering other variables or the regression function itself.
Is it reasonable to assume an improvement of 1/100 if we suppose this does not alter
the other indicators? Since we see that AvePutt is correlated with the other variables,
this cannot be more than an approximation. However, if this correlation is not too big,
it may be that the reasonable values of AvePutt, conditional to the other variable to be
constant, have a standard deviation which is big enough so that it allows for “changes”
of 1/100 in AvePut.
The marginal standard deviation of AvePutt is about .023. Notice that this is a
standard deviation across the players, so it does not directly concern our problem.
1/100 is less than one half of the marginal standard deviation of AvePutt. This
means that is quite easy to find different players with such a difference in this statistic.
With a little bit of unwarranted logic, let us assume that this is true for the single
player, if we do not condition to his other statistics.
This is the crucial point: both for different players and for the single player we
must recall that we are within a regression and that we are evaluation the possibility
of changing AvePutt of 1/100 while the other variables do not change.
This means that, as stated above, we must consider the conditional standard devi-
ation not the marginal standard deviation.
Recall the formula
ˆ |
p
q V ar(Xj |X−j )|β j
t2j (1 − R2 )/(n − k) = p
V ar(Y )
Using the data in the output we find that the standard deviation of AvePutt (our
Xj ) conditional on the other variables (X−j ) is .021, obviously smaller that the uncon-
ditional variance but still more that 2 times the hypothesized change of .01.
This implies that, even conditionally to the other variables, different players (and
maybe the same player) still could easily show such different values of AvePutt.
116
For the above mentioned reasons, this does not justify, by itself, a causal interpre-
tation. However if such an interpretation were available, an expected effect of the size
of 700 dollars (69000 times .01) or even more would not be unreasonable.
On the other hand an improvement of, say .04 in AvePutt would probably be
unlikely both marginally and, what is more important for us, conditionally to given
values of the other X−j .
If this causal analysis is viable, then, we may expect that a work on the putting
green which does not alter the rest of “the game” could give a golfer a reasonable
improvement of 700 dollars in the AveWng (roughly 1.5% of AveWng).
Let us now consider other aspects of the estimates.
A possible puzzling point is given by the sign of AveDrv and DrvAcc which are
both negative.
The semi partial R2 of AveDrv is almost 0 while that of DrvAcc is a little more than
2%.
In most practical contexts we could then avoid discussing the estimate of the pa-
rameters for these variables.
As an exercise, however let us try to use what we know about partial regressions to
unravel the puzzle. We anticipate that most of the puzle comes from the confusion of
bivariate or “mariginal” with conditional correlation.
Begin by comparing the simple correlations with AveWng and the signs of the
parameters estimated in the linear regression. Notice that the sign of the parameter
for DrvAcc is the same as its correlation with the dependent variable while the sign of
AveDrv is negative with a positive correlation.
A negative simple correlation between AveWng and DrvAcc may not be surprising
and we may try an explanation, which, as always in these cases, is implicitly based on
some causal interpretation to the parameters.
The possible interpretation is this: it could simply be that, in order to be precise
with the drive, a player tends to be too cautious and this may harm his overall result.
There are many alternatives to this interpretation, each depending on some strand
of causal reading of the parameters. The choice among these depends on further and
more complex analysis and on more structured hypotheses about how the performance
of the golfer is connected to each of the statistics.
For a forecasting interpretation this is irrelevant.
Now let us consider AveDrv, the correlation of this variable with AveWng is positive
and not small, while the regression coefficient estimate is negative and the semi partial
R2 is virtually 0 (much smaller that the same statistic for DrvAcc whose correlation
with AveWng, in absolute value, was roughly 1/2 of that of AveDrv).
To understand what is happening let us consider the result Here we see the result
of the regression of AveDrv on the other columns of the X matrix:
117
60% of the variance of AveDrv is captured by its regression on the other columns
of X More that one half of this (38% semi partial R2 ) has to do with its negative
dependence on DrvAcc.
Also GrnReg show a sizable semi partial R2 (14%) and a positive regressive depen-
dence.
As we know, only what is left as residual of this regression is involved in the es-
timation of AveDrv parameter in the original regression. This is the part of AveDrv
variance which is not correlated with DrvAcc and GrnReg (and the other variables in
the partial regression).
We know that GrnReg is the single most important variable in the overall regression
(in the sense that it shows the highest semi partial R2 ).
Based on this we may attempt an interpretation (again: many are possible): the
“equilibrium player” represented by the regression tends to have an higher AveWng if the
percentage in GrnReg is higher. On the other hand, a higher percentage of GrnReg
tends to imply a bigger AveDrv. For this reason, marginally, AveDrv is positively
correlated with AveWng. However the AveDrv in excess of what correlated with GrnReg
seems to be harmful to the overall game and, from this, the negative coefficient in the
overall model.
Now compute, as we did above, the conditional
√ standard deviation of AveDrv.
According to our formula this is equal to 0.0000812 ∗ 54990/94.76 = 5.23 to be
compared with a marginal standard deviation of 8.27. If we hypothesize a change in
this variable (conditional to the other columns of X) equivalent to that hypothesized
118
above for AvePut (less that 1/2 of its conditional standard deviations) equal to 2.5 the
overall expected “effect” should be a decrease of AveWng of roughly 200 dollars. You
would need a very big change of twice conditional standard deviation (about 10) to
have a negative effect comparable with an AvePut change of 1/2 conditional standard
deviation.
Again a matter of care: this evaluation are borderline causal!
In the end, what would be the most proper use of such a regression?
Suppose you want to bet on how much on average a randomly chosen player is going
to win. You know the characteristics of the player, you are betting on the results.
The estimated regression would be a nice starting point.
Now change players into stocks, winnings into returns and use market returns, price
to book value, size, and so on as indicators as in the Fama and French model or in the
style analysis model. In which stock would you invest? To which fund manager would
you give your money?
This are clearly relevant questions and the regression model would be fit for these
even without any causal interpretation.
causal) it seems that the role of x, while existing (we know that β is not 0), is not so
119
relevant (at least in terms of a “good fit”).
But suppose the, with no change the regression, either the variance of x becomes
higher or the variance of decreases or both, that is: suppose the joint distribution of
y and x changes, for some reason32 . For instance, suppose V (x) = 100. If all the rest
remains unchanged the new R2 shall be equal to 25/35=0.72.
In this case the contribution of x to the quality of the forecast of y becomes con-
siderable.
Is this reasonable, is it relevant? For instance: in a observational setting, can we
suppose that the data we use for the forecast are so different w.r.t. those used for the
estimation? In a causal setting: is it possible such a big alteration of the behaviour of
x (and in a multivariate regression: is such a change possible CONDITIONAL to the
other columns of X)?
This, obviously, cannot be assessed in general and can only be evaluated on a case
by case basis.
The important point is, again, to fully understand how, even a simple and standard
method like linear regression can never be dealt with in a ritual/cookbook way. Only
a full understanding of the method and of the circumstances of its specific application
can (and does) yield useful results. Barred this, its use can only be understood as
kowtow to pseudo scientific ritualism or, worst, mislead rhetoric.
Let us consider a case in which a variable may have a “relevant effect” even if it
does NOT explain a big chunk of the dependent variable variance.
Suppose for instance that you have a dataset where observations ore on the heights
of a population of adult men and women. The sample is very unbalanced and it
contains, say, 1000 men and 20 women. For this reason most of the observed variance
in height shall be due to variance across men. If we regress heights on a constant
and a dummy which is equal to 1 of the subject is a woman we shall find, with all
likelihood, a statistical significant negative parameter for the dummy (something like
-10 centimeters) but an almost zero R2 . This does not mean that the difference in
height between men and women is irrelevant, it is, but that, due to the fact that most
of the sample is made of men, this difference does not explain a big chunk of the
variance of THIS sample: most of the variance in this sample is not due to sex, but to
variance in height among males.
Now, suppose you apply this result to a balanced sample where 50% of the subjects
are women and 50% are men. In this new sample most of the variance shall come from
the different sex. In other words: we do forecast in a setting where the distribution of
X is quite different w.r.t. that valid for the estimation sample.
More in general: it may be that the role of a variable, in a forecast or, if reasonable,
in causal terms, is “big” while its partial R2 evaluated in the estimation sample is
small. If this happens this is usually due to the fact that, conditional on the other
32
It is the same to say that the joint distribution of x and changes.
120
explanatory variables (and maybe even unconditionally) this variable varies very little
in the estimation sample and does not determine a relevant part of the dependent
variable variance.
It may be that, for some reasons, the observed sample to be unbalanced with re-
spect to the population. If, in a more balanced sample, the explanatory variable we are
considering is expected to have higher variance, it may be that its contribution to ex-
plaining the variance of the dependent variable increases so that it becomes interesting
to study its behaviour. However if this is not the case and the sample is representa-
tive of the population we are interested in, the “relevant” parameter shall be interesting
only if we compare the (few) sample points where the values of the explanatory variable
present very different values.
A second very simple example: suppose you are interested in the expected life of a
sample of patients after a given medical treatment. A small subsample of patients was
using a given drug, say A. A new drug, B, is given to all the subjects in the sample
and you observe a huge variation in mean survival time across different subjects, say a
standard deviation of 10 years over a mean of 5. You also observe that the subsample
previously treated with A has the same standard deviation but a mean of 10. Since
this subsample is small, the difference between the means shall contribute very little to
the overall variance (the partial R2 shall be small) however it would be very proper to
suggest the use of A joint with B. Notice that the more you increase the subpopulation
which is using A the more the explained variance due to use/no use of A shall increase.
This, however, is true only up to the point when the fraction of sample using A is 1/2.
If for instance, everybody shall use A, there will be non “variation” of life span due to
use/non use of A, but there still be the “effect” of A in the 5 years on average gained
by its use.
Notice that in this example the reasoning is based on our ability to change the
proportion of population using A. Suppose instead A not to be the use of a medicine
but the fact that your eyes are one blue and one brown. In this case, observing the
same results, we would have very little to suggest except the fact that B seems a very
useful medicine for the few lucky (in this case) people with eyes of different colors.
So: beware of unbalanced samples.
In other settings it may be that we can purposely alter the behaviour of some x not
just in terms of level but also of variance.
The number of possible cases is huge and here is not the place to go further in this.
Some last comment: in the example above we have a case where an irrelevant result
in terms of R2 gives us the relevant suggestion that we could assign both drugs A and B
to all patients, hence, alter the distribution of X. This is a real possibility, we can give
both drugs (at least, if their combination is not harmful) and from this the relevance
of the result. Suppose instead that the difference is in terms of other characteristic,
say: the color of eyes. In this case we cannot change the percentage of the population
with such characteristics, ore “give both colors” to each element of the population. In
121
this case, while interesting, the result is in any case “irrelevant”.
Since all estimates and statistics could be identical in both cases, this implies that
“relevance” is not something that can be fully resolved only on the basis of Statistics:
it requires accurate analysis of the specific problem.
It is also easy to show examples where a big partial R2 is, in practice, while im-
portant in forecasting terms and maybe also in “causal” terms, not directly of any use
(beyond forecasting). Suppose we select a population of women of different ages, ac-
cording to the marginal distribution by age of women, and attribute to each individual
the number of children she gave birth to in the last 5 years. It is clear that the age
shall be relevant (in terms of partial R2 ) in “explaining” the variance of the dependent
variable. This is expected and cannot be of use as we cannot change the age of the
elements in the sample. However, it is going to be important to keep the variable in the
regression if we wish to assess the separate effect of other, less obvious but potentially
relevant variables on which we can act, as, for instance, the amounts of vitamins in the
blood of different subjects, when these variables are correlated with age.
3. Understand that Y is NOT E(Y |X) but Y = E(Y |X) + ε and V ar(Y ) =
V ar(E(Y |X)) + V ar(ε)
122
4. Quantify V ar(Y ) due to V ar(E(Y |X)) and V ar(Y ) due to V ar(ε) that is: com-
pute R2
5. You MUST do this because the purpose of a linear OLS model (with intercept)
is that of maximizing R2
6. Moreover, when you discuss the importance of each βj you are “partitioning” R2 .
8. Hence βj only pertains to the “effect” (contribution to the forecast) on E(Y |X)
of what in Xj can change conditional the other X being constant NOT to the
effect of a generic change in Xj .
9. Be careful about the meaning of “effect” strictly speaking all you can say is that,
if you build forecasts for Y given X using E(Y |X) the said “effect” is that, if it
happens that “Nature” gives you two new vectors of observations on X where the
difference between the two vectors is just in a different value of only Xj , then
the difference between your forecasts is given by the difference in the two values
of Xj times the corresponding βj (or its estimate if you are using an estimated
conditional expectation). In other words: this tells you nothing, by itself, about
the possible change in Y given and act on your part to change some value of Xj .
The (by no means easy) study of such a “causal” interpretation has always been
very much in the mind of econometricians who evolved structural Econometrics
as an attempt to answer the (very interesting due to obvious policy reasons) of
assessing the possible results of a change in a variable not just given by “Nature”
but acted by a policy maker. The obvious difference between the two cases is
that the act could not respect the “Natural” joint distribution of observables as
make your previous study of this useless as a source of answer. Just think to the
obvious difference in observing, say, interest rates changes induced by the market
dynamics and imposing by policy an interest rate change: the laws concerning
the effects of the second act could be completely different to the laws concerning
the “natural” evolution of rates in the market. On the contrary the “forecast
change” is always a good interpretation if the observed change happens without
interference.
10. Once you understand the meaning of the “effect” word, quantify this, in your
sample (that is: for a given joint distribution of Y and X), with the semi partial
t2 (1−R2 )
R2 due to the j − th regressor as j(n−k) (if you are reading a paper and, bad
sign,R2 is not available, use the same formula with R2 = 0. This shall give you
an overvaluation of the partial R2 )
123
11. Evaluate the practical significance of this “effect” (while doing this ask yourself
if the sample is balanced with respect the explanatory variable and, if not, con-
sider if a balanced version of it is a sensible possibility: see above the examples
involving heights of Males and females and the experiment with two medicines).
In general this depends on the specific case. However a first rough idea can be
gained by computing the change in the conditional expectation of Y induced by a
“reasonable” change of Xj . Since this must be a “reasonable” change “which leaves
the other explanatory variables unchanged” (recall the meaning of βj induced by
the partial regression theorem) this could be measured by the conditional stan-
dard deviation pof Xj given the pother regressors. It could be useful, so, to compute
the ratio |βj | V ar(Xj |X−j )/ V ar(Y ) which express this “effect” in unit of the
standard deviation of Y (the modulus around βj comes from the fact that we get
the formula taking the square root of a square). A quick proof shows you that
this is exactly identical to the square root of the semi partial R2 for Xj and this
confirms the centrality of this quantity. Beware! Do p not be deceived
p by a quantity
which bears some resemblance to this. This is |βj | V ar(Xj )/ V ar(Y ) which
shall obviously be bigger (actually not smaller) and so, maybe, gratifying. The
point is that, by not conditioning the variance of Xj it violates the interpretation
of the coefficient. By the way: it may well happen that this quantity be bigger
than 1, which, obviously, is absurd33 .
12. Then do Statistics (namely: consider that you must estimate β and evaluate the
quality of the estimate).
13. Remember: an estimate is “statistically significant” if the ratio of its value to its
sampling standard deviation is big enough to say that you can reliably distinguish
33
Most of times, this choice is made when the correct measure of relevance would give as result a very
small value, that is: the practical irrelevance of the “effect”. In this case the use of the unconditional
standard deviation inflates the result but, since the starting point is very small, the inflated value
is smaller that 1 and, apparently, you do not get absurd results. The inconsistency is in any case
evident: for instance, you get a semi partial R2 of, say, .0001 for a given Xj and then you find written
in the paper that “a change of Xj equal to its standard deviation implies a change of (the conditional
expected value of - but sometimes this too is forgotten)Y equal to 1/2 of its standard deviation”. These
two informations are evidently conflicting and the solution is that, since the “effect” measure given by
βj only has to do with “a change in Xj with the rest of the regressors constant” you cannot use the
unconditional standard deviation of Xj as a measure of a “normal” change in Xj (it could be, but in an
unconditional setting) you must use the conditional standard deviation. Just take the square root of
the semi partial R2 and you get that the correct measure of the change in the conditional expectation
of Y given a “reasonable conditional on X−j ” change of Xj given by the conditional standard deviation
of this is, in unit of the standard deviation of Y , given by .01. A completely different picture. What
is happening: just take the ratio of this with the previous number .01/(1/2)=.02 this shall be the
ratio between the unconditional and conditional standard deviation of Xj . It happens that today’s
common use of very big samples which can be used only adding a huge number of “fixed effects” makes
this event quite common.
124
it from zero.
15. In any case remember the difference between significance and relevance. E.g.:
beware the use of large datasets. If, in the comments to their results, Authors
using large datasets stick too much to “statistical significance” and do not deal
with practical relevance, most likely it is the case that their results can be sum-
marized as “a very precise estimate of irrelevant effects”, so that the reading of
the main results of the paper can be usually changed into something like: “our
data points strongly to the irrelevance of the effect under study”. By the way:
while not currently fashionable, such a finding could be of great interest.
16. Finally: Beware of unbalanced samples (this is the same as 11 but it is very
important, so I repeat).
125
Mosteller, F. and Tukey, J. W. (1977). “Data Analysis and Regression: A Second
Course in Statistics”. Addison-Wesley, Reading, MA. In particular, see ch. 13 with the
meaningful title: “Woes of Regression Coefficients”.
A good and more concise summary can be found in:
Sanford Weisberg (2014) “Applied Linear Regression”, III ed., Wiley. In particular,
see Ch. 4.
A short paper by a great statistician which contains, in simple and condensed form,
most of what was discussed here, is: George E. P. Box (1966) “Use and Abuse of
Regression”, Technometrics, Vol. 8, No. 4 (Nov., 1966), pp. 625-629
For the maths of semi partial R2 joint with a keen discussion of “effect sizes” you
may see:
Jacob Cohen e.a. ( 2013) “Applied Multiple Regression/Correlation Analysis for
the Behavioral Sciences” (English Edition), III ed, Routledge.
To those interested in reading something more about the different interpretations
of a linear model (e.g. forecast VS causal) which make, arguably, a very tricky and
slippery field to walk on, the following books could be useful:
J. D. Angrist and J. S. Piscke (2009) “Mostly Harmless Econometrics”, Princeton
University Press.
J. Pearl (with Madelyn Glymour and Nicholas P. Jewell) (2016) “Causal Inference
in Statistics: a Primer”, Wiley.
Examples
Exercise 9-Linear Regression.xls
10 Style analysis
Style analysis is interesting both from the point of view of practitioner’s finance and
as an application of the linear regression model.
The current version of the model was elaborated by William F. Sharpe in a series
of papers beginning in 1989. In this summary we shall refer to the 1992 paper (as of
November 2018 you may download it at http://www.stanford.edu/∼wfsharpe/art/sa/sa.htm).
In order to understand the origin of the model we must recall the intense debate
developing during the eighties about the validity of the CAPM model, its possible
substitution with a multifactor model and the evaluation of the performance of fund
managers.
In a nutshell (back to this in some more detail in the next chapter): a factor
model is a tool for connecting expected returns of securities or securities portfolios to
the exposition of these securities to non diversifiable risk factors. The CAPM model
126
asserts that a single risk factor, the “market”, or, better, the random change in the
“wealth” of all agents invested in the market, is priced in terms of a (possible) excess
expected return. This factor is empirically represented by the market portfolio, that
is: the sum of all traded securities. The expected return of a security in excess of the
risk free rate (remember that we are considering single period models) is proportional
to the amount of the correlation between the security and the market factor. The
proportionality factor is the same for all securities and is called price of risk.
Multifactor models, such as the APT, suggest the existence of multiple risk factors
(not necessarily traded) with different prices of risk, so that the cross section of ex-
pected security (or security portfolios) excess returns is “explained” by the set of the
security expositions to each factor. Classical implementations of the APT were based
on economic factors, some were tradable, like the slope of the term structure of inter-
est rates, some, at least at the time, non tradable, as GNP growth and inflation. At
the turn of the nineties Fama and French, followed by others, produced a number of
papers where factors were represented by spread portfolios. The most frequently used
factors were based on the price to book value ratio, on the size of the firm and on some
measure of market “momentum” (relative recent gain or loss of the stock w.r.t. the
market). These factors were represented, in empirical analysis, by spread portfolios.
As an instance: the price to book value ratio was represented by the p&l of a portfolio
invested, at time zero, in a zero net value position long in a set of high price to book
value stocks and short in a set of low price to book value stocks. Fama and French
asserted that the betas w.r.t. this kind of factor mimicking portfolios were “priced by
the market”, that is, the correlation of a stock return with such portfolios implied a
non null risk premium.
Consider now the problem of evaluating the performance of a fund manager. A
preliminary problem is to understand for which reason you, the fund subscriber, should
pay the fund manager. Obviously, you should not pay the fund manager beyond
implementation costs (administrative, market transactions etc) for any strategy which
is known to you at the moment you subscribe to (or do not withdraw from) the fund
if this strategy gives “normal” returns and if you can implement it by yourself.
Suppose, for instance, that the asset allocation of the fund manager is known to
you before subscribing the fund. Since the subscription of the fund is your choice
the fund manager should not be paid for the fund results due to asset allocation, or,
better, should not be paid for this beyond implementation costs. A bigger fee could be
justified only if, by the implementation of management decisions you cannot forecast
on the basis of what you know, the fund manager earns some “non normal” return.
This is the reason why index funds should (and, in markets populated by knowl-
edgeable investors, usually do) ask for small management fees. What we say here is
that this should be the same for any fund managed with some, say, algorithm, replica-
ble on the basis of a style model like, for instance, funds which follow asset selection
procedures based on variants of the Fama and French approach (that is: stock pick-
127
ing based on observable characteristics of the firms issuing the equity as, for instance,
accounting ratios, momentum etc). While implementing such models requires some
care and a lot of good data management, the reader should be aware of the fact that
nothing magic or secret is required for the implementation of these algorithms.
The fund manager contribution, with a possible value for you, if any, should be
something you cannot replicate, that is: either something arising from (unavailable
to you) abilities or information of the manager or, maybe, from some monopolistic or
oligopolistic situation involving the manager. Let us suppose (a very naive idea!) that
the second hypothesis is not relevant. A formal way to say that the manager ability
is not available to you is to say that you cannot replicate its contribution to the fund
return with a strategy conceived on the basis of your knowledge.
Notice that for this reasoning to be valid it is not required that you actually perform
any analysis of the fund strategy before buying it. Perhaps we could agree on the fact
that you should perform such an analysis, before buying anything. A mystery of finance
is that people spend a lot of money in order to buy something whose properties are
unknown to the buyer. People wouldn’t behave in this way when buying, say, a car
or even a sandwich. However, any lack of analysis simply means that something more
unexpected by you, shall become (on your opinion) merit or fault of the fund manager.
It is important to understand that, according to this view, the evaluation of the
performance of a fund manager is, first of all, subjective. It is the addition of hypotheses
on the set of information used by subscribers and on their willingness to optimize using
these information that can convert the subjective evaluation into an economic model.
The problem here is, obviously, to define what we mean by “normal return” and
“known strategy”.
Here a market model, representing efficient financial use of public information, could
be the sensible solution. Were the market model and the effective asset manager’s asset
allocation available, the first could be used to define the efficiency of the second and,
by difference, possible unexpected (by the model) over or under performances on the
part of the fund manager.
Alas, for reasons that shall be discussed in following sections, satisfactory empirical
versions of market models still have to appear or, at least, versions of market models,
and statistical estimates of the relative parameters, strong enough to be agreed upon
by everybody an so useful in an inter-subjective performance analysis.
A less ambitious and more empirically oriented alternative is return based style
analysis. This alternative yields a (model dependent) subjective statement about the
quality of the fund. We shall return on this point but we stress the fact that, if the
purpose of the method is for a potential subscriber or for someone already invested in
the fund to judge the fund manager performance and not for some agency to award
prizes, the subjective component of the method is by no means a drawback.
Return based style analysis can be seen as a specific choice of “normal return” and
“known strategy” definitions. The “known strategy” is the investment in a set of trad-
128
able assets (typically total return indexes) according to a constant relative proportion
strategy, the “normal return” is the out of sample return of this strategy previously
tuned in order to replicate the historical returns of the fund. This point has to ham-
mered in so we repeat: the strategy is not chosen in order to yield “optimal” returns (in
any case the lack of a market model would impede this) but only in order to replicate
as well as possible, in the least squares sense, the returns of the fund strategy.
In order to estimate the replica weights, the returns (RtΠ ) of the fund under inves-
tigation are fitted to constant relative proportion strategy with weights βj invested in
a set of k predetermined indexes with returns Rjt :
X
RtΠ = βj Rjt + t
j=1,...,k
The term “constant relative weights strategy” indicates, as usual, a strategy where
the proportion of wealth invested in any given index is kept constant over time. This
implies that, when some index over performs other indexes, a part of the investment
in the over performing index must be liquidated and invested in the under performing
indexes.
For the sake of comparison other possible strategies could be the buy and hold
strategy where a constant number of shares is kept for each index and the trend follow-
ing strategy, where shares of “loser” indexes are sold to buy shares of “winner” indexes.
Both these strategies have variable weights on returns and could reasonably be used
as reference strategies.
There exist variants of the constant relative proportions strategy itself. In a con-
strained version the weights could be required to be non negative (short positions are
not allowed). In another version weights could be allowed to change over time (in this
case we should assume that the sum of all weights is constant over time).
In typical implementations no intercept is in the model and the sum of betas is
constrained to be one. The constant is dropped because it is usually interpreted as a
constant return and, over more than one period, a constant return cannot be achieved
even from a risk free investment. The assumption that the sum of all weights is one is
an assumption required for the interpretation of the weights as relative exposures and,
in the case of a multi period strategy, in order for the portfolio to be self financing.
While both interpretations and both constraints could be challenged, in our appli-
cations we shall stick to the common use. We only relate the fact that, sometimes,
instead of imposing the “sum to one” constraint explicitly at the estimation time 35 this
35
The j βj = 1 constraint can be imposed to the OLS model in a very simple way. First chose
P
any Rjt series, say R1t . Typically the choice falls on some series representing returns from a short
term bond but any choice will do. Second compute R̃t = Rt − R1t and R̃jt = Rjt − R1t for j = 2, ..., k.
Now regress R̃t on the R̃jt for j = 2, ..., k. After running the regression the coefficient for R1t , which
Pk
you do not directly estimate, shall be equal to 1 − j=2 βj .
129
is implemented on a a posteriori basis by renormalizing estimated coefficients. The two
methods do not yield the same results.
A relevant point in the choice of the reference strategy is that it should not cost
too much. In this sense the constant relative proportions strategy could be amenable
to criticism as it can imply non negligible transaction costs. The reason for its use in
style analysis seems more leaning on tradition than on suitability.
Notice that in no instance we are supposing that the fund under analysis actually
follows a constant relative proportion strategy invested in the provided set of indexes.
We are NOT trying to discover the true investment of the fund but only to replicate
its returns as best as we can with some simple model. This point has to be underlined
because, at least in the first paper on the topic, Sharpe himself seems to state that
the purpose of the analysis is to find the actual composition of the fund. This is
obviously impossible if it is not the case that the fund is invested, with a constant
relative proportions strategy, in the indexes used in the analysis.
In fact, the actual discovery of the composition of the fund and its evolution over
time would hardly add anything to the purpose of identifying the part of the fund’s
strategy not forecastable by the fund subscriber. A model would still be needed in
order to divide what is forecastable from what is unforecastable in the fund evolution.
Let us go back to the identity:
X
RtΠ = βj Rjt + t
j=1,...,k
Up to now this is not an estimable model but, as said above, an identity. In order to
convert it into a model we must assume something on t . A way of doing this is to recall
the chapter on linear regression. The style model is clearly similar to a linear model.
In particular it is similar to a linear model where both the dependent and independent
variables are stochastic. In this case we know that a minimal hypothesis for the OLS
estimate to work is that E(|RI ) = 0 where is the vector containing the observations
on the n t -s and RI is the matrix containing the n observations on the returns of the
k indexes. The second, less relevant, hypothesis is the usual E(0 |RI ) = σ2 In .
The hypothesis E(|RI ) = 0 has a sensible financial meaning: we are supposing
that any error in our replication of the fund’s returns is uncorrelated with the returns
of the indexes used in our replication.
Sharpe’s suggestion for the use of the model in fund performance evaluation is as
follows: given a set of observations (typically with a weekly or lower frequency, Sharpe
uses monthly data) from time t = 1 to time t = n fit the style model from t = 1 to
t = m < n and use the estimated coefficient for forecasting Rm+1 then add to the
estimating set the observation m + 1 (and, in most implementations) drop observation
1. Forecast Rm+2 and so on. These forecast represent the fund’s performances as due
to its “style” where the term “style” indicates our replicating model. The important
point is that this “style” result is forecastable and, in principle, replicable by us. The
130
possible contribution of the fund manager, at least with respect to our replication
strategy, must be found in the forecast error. The quality of the fund manager has to
be evaluated only on the basis of this error.
There are three possibilities:
• The fund manager return is similar (in some sense to be defined) to the replicating
portfolio return. In this case, since you are able to replicate the result of the fund
manager strategy using a “dumb” strategy, you shall be willing to pay the fund
manager only as much as the strategy costs.
• The fund manager returns are less than your replica returns. In this case you
should avoid the fund as it can be beaten even using a dumb strategy which is
not even conceived to be optimal but only to replicate the fund returns. This is a
strong negative result. While it is true that it is possible to find alternative assets
that, when calibrated to the fund returns in a style analysis, give a positive view
of the same manager results, the fact that a simple strategy exists that beats the
fund returns is enough to put under discussion any fund manager’s ability.
• The fund manager returns are better than your replica strategy. In this case it
seems that the manager adds to the fund strategy something which you cannot
replicate. This is an hint in favor of the fund manager ability. It is a weak hint,
for the same reason the negative result is a strong hint. The negative result is
strong because a simple strategy beats the fund manager’s one, the positive result
is weak because the fund manager beats a simple strategy but other could exist
which equate or even beat the fund manager strategy. In any case this is an at
least necessary condition for paying a fee greater that the simple strategy costs.
The important point to remember, here, is that the result is relative to the strategy
and the asset classes used. No attempt is made to build optimal portfolios with the
given asset classes, only replica portfolios are built. The reader should think about the
possible extensions of procedures like style analysis were a market model available
A simple example of style analysis using my version of Sharpe’s data and three well
known US funds is in the worksheet style analysis.xls.
131
inside the same family. In a sense the comparison strategy is implicitly considered as
a mean of the strategy in the same asset class.
Another shadow of this can be found in the frequently stressed idea that the result
of any fund management must be divided between asset allocation and stock picking.
In common language this partitioning is not well defined and asset allocation may
mean many different things as, for instance, the choice of the market, the choice of
some sector, the choice of some index. Moreover there is no precise definition of how
to distinguish between asset allocation and stock picking. But it is clear that this
distinction, again, hints at some normal return, derived by asset allocation, and some
residual: stock picking.
The “benchmarking” idea is another crude version of the same: you try to sep-
arate the fund manager’s ability from the overall market performance by devising a
benchmark which should summarize the market part of the fund manager strategy.
Market models can be seen as a step up the ladder. Here the benchmark idea is
expressed in a less naive way. Under the hypothesis that the market model holds and
is known and the beta (CAPM) or betas (APT) of the fund are known, the part of
the result due to the market factor(s) is to be ascribed to the overall fund strategic
positioning and, as such, its consequences are in principle a choice of the investor. Any
other over or under performances can be ascribed to the fund manager abilities and
private information.
As we mentioned above, this use of market model is greatly hampered by the fact
that the proposition “...the market model holds and is known and the beta (CAPM)
or betas (APT) of the fund are known” simply does not hold.
Now a few words on comparison criteria.
The classical Sharpe ratio considers the ratio of a portfolio return in excess to
a risk free rate to its standard deviation. Even in this form the Sharpe ratio is a
relative index: the fund performance is compared to a riskless investment. In general
this comparison is not a useful one. Typically our interest shall be to compare the
fund performance with a specific strategy, which, in some instance, could be the best
possible replication of the fund’s returns accomplished using information available to
the investor. In many cases this reference strategy shall be a passive strategy (this
does not mean that the strategy is a buy and hold strategy but that the strategy can
be performed by a computer following a predefined program).
As considered before, such a strategy could be provided, for instance, by some asset
pricing model (CAPM, APT etc.). In other cases the reference strategy could simply
be represented in the choice of a benchmark used either in the unsophisticated way
where, implicitly, a beta of one is supposed (that is, at the numerator of the Sharpe
ratio take the difference between the returns of the fund and those of the benchmark)
or in the more sophisticated way of computing the alpha of a regression between the
return of the fund and the return of the benchmark.
Otherwise the reference strategy could be based on an ad hoc analysis of the history
132
of the fund under investigation. Style analysis is a way to implement this analysis.
Two relevant final points.
First: the comparison strategy should always be a choice of the investor. It is rather
easy, from the fund’s point of view, to choose as comparison a strategy or a benchmark
with respect to with the strategy of the fund is superior, at least in terms of alpha. This
is known as “Roll’s critique”. While the fact that the strategy chosen by the investor as
comparison is dominated by the fund strategy is admissible as, usually, the fund does
not tune its strategy to this or that subscriber comparison strategy (at least this is true
if the subscriber is not big!), when it is the fund to choose the comparison strategy a
conflict of interests is almost certain.
Second: once identified the part of the strategy due to the fund manager interven-
tion, a summary of this based on the Sharpe ratio or on Jensen’s alpha is only one
of the possible choices and strongly depends on the subscriber’s opinion on what is a
proper measure of risk and return.
133
strategy which shall attribute to the fund manager a positive contribution to the fund
result. On the contrary, temporary deviations of the return of one index from the
returns of the others shall result, in the comparison of the strategies, in favor of the
constant proportion strategy.36
A second critique, of theoretical interest but hardly relevant in practice is connected
with Roll’s critiques to CAPM tests and, more in general, to CAPM based performance
evaluation. If the Constant proportion strategy does not contain all the indexes re-
quired for composing an efficient portfolio, any investment by the fund manager into
the relevant excluded indexes shall result in an over performance. This would be rele-
vant if the evaluated fund manager should know, ex ante, the style model with which
his/her strategy shall be evaluated AND if the fund manager has a more thorough
information on the structure of the efficient portfolio.
The point is that, while it is rather easy to compute an efficient portfolio ex post,
this is not so easy ex ante. Moreover, if we accept the idea that the style decomposition
depends on the information of the analyzer, this critique loses much of its stance.
A third, and more subtle, critique can be raised to style analysis as well as to any
OLS based factor model used for performance evaluation. If the model is fitted to the
fund returns, the variance (or sum of squares, if no intercept is used) of the replicating
strategy shall always be less than or equal to that of the fund returns. In a CAPM
or APT logic this is not a problem, since only non diversifiable risk should be priced
in the market. However, as stressed above, we are NOT in a CAPM or APT world.
With this lack of variance we are giving a possible advantage to the fund. Ways for
correcting this problem can be suggested and, in fact, performance indexes which take
into account this problem do exist. However, since, as we saw above, the positive (for
the fund) result is already a weak result in style analysis, this undervaluation of the
variance is only another step in the same direction: negative valuations are strong,
neutral or positive valuations could be challenged.
A last word of warning. Many data providers and financial consulting firms sell style
analysis. As far as I know, the advertising of commercial style models invariably asserts
the ability of such models to discover the true composition of the fund portfolio and
most reports produced by style analysis programs concentrate on the time evolution
(estimated by some rolling window OLS regression) of portfolio compositions. This is
quite misleading (Sharpe is somewhat responsible as in the original papers he seems
36
In the case of a positive trend of, say, an index with respect to the rest of the portfolio, a buy
and hold strategy does not rebalances by selling some of the same index and buying the rest of the
portfolio. In case of a further over performance of the index the buy and hold portfolio shall over
perform the rebalanced portfolio. In the case of a negative trend of some index with respect to the
rest of the portfolio the constant proportion strategy must buy some of the under performing index
selling some of the rest of the portfolio, if the under performance continues this strategy shall imply
an over performance of the buy and hold strategy with respect to the constant relative proportion
strategy. On the contrary, a strategy investing in temporary losers (after the loss!) or disinvesting in
temporary winners shall outperform a buy and hold strategy in a oscillating market.
134
to share this opinion) and can be accepted only if interpreted as a misleading way to
asses the true purpose of the strategy, that is: return replication. As far as I know,
the typical seller and user of style analysis, if not warned, tends to believe the “fund
composition” story. This false ideas usually disappears after some debate, provided, at
least, that the user or seller is even marginally literate in simple quantitative methods.
Examples
Exercise 10-Style Analysis.xls
135
We can see asset pricing models as tools devised to answe the kind of puzzles which
plots like this one may raise.
Among these two of the oldest and most relevant questions of Finance:
1. in the market we see securities whose prices evolve in completely different ways.
There may even be securities that have both mean returns lower and standard
deviations of returns higher than other securities. Why are all these securities,
with such apparently clashing statistical behaviours, still traded in equilibrium?
(Not be puzzled by the fact that we speak of asset pricing models and we write returns.
Given the price at time 0, the return between time 0 and time 1 determines the price
at time 1).
We anticipate here the answers to these two questions given by asset pricing models:
1. securities prices can be understood only when securities are considered within a
portfolio. Completely different (in terms, say, of means and variances of returns)
securities are traded because they contribute in improving the overall quality of
the portfolio (in the classic mean variance setting this boils down to the usual
diversification argument), what is relevant is not the total standard deviation of
each security but how much of this cannot be diversified out in a big portfolio, for
this reason the expected return of a security return should not be compared with
its standard deviation but only with the part of this standard deviation which
cannot be diversified;
These are not the only observed properties of asset prices/returns asset pricing models
try to account for. Another striking property is as follows: while thousands of securities
are quoted, there seems to be a very high correlation, on average, among their returns.
In a sense it is as if those many securities were “noisy” versions of much less numerous
“underlying” securities.
For instance, the 83 stocks of the S&P 100 displayed above show an average (simple)
correlation (over 20 years!) of .31. If we recall the discussion connected with the
spectral theorem and compute the eigenvalues of the covariace matrix of these returns,
while we see no eigenvalue equal to 0, the sum of the first 5 eigenvalues is greater the
50% of the sum of all eigenvalues, where the last, say, 50 eigenvalues count for about
15% of the total. The sole first eigenvalue is about 1/3 of the total. This suggest the
idea that, while not singular, the overall covariance matrix can be well approximated
by a singular covariance matrix.
136
It should be clear how to answer to these questions and to model the the high
average correlation of returns is important for any asset manager and, in fact, asset
pricing models are central to any asset management style not purely based on gut
feelings.
We can deal with these problems within a simple class of asset pricing model known
as: “linear (risk) factor models”. Here we show some hint of how this is done in practice.
An asset pricing model begins with a “market model” that is a model which de-
scribes asset returns (usually linear returns) as a function of “common factors” and
“idiosyncratic noise”. These models are, most frequently, linear models and a typical
market model for the 1 × m vector of excess returns rt observed at time t, the 1 × k
vector f of “common risk factors” observed at time t and and the 1 × m vector of errors
is:
rt = α + ft B + t
Where B is a k × m matrix of “factor weights” and α is a 1 × m vector of constants.
We suppose to observe the vectors rt and ft for n time periods t. Stacking the
n vectors of observations for rt and ft in the n × m matrix R and the n × k matrix
F , and stacking the corresponding error vectors in the n × m matrix we suppose:
E(|F ) = 0, V (t |F ) = Ω and E(t 0t0 |F ) = 0 ∀t 6= t0 . In order to give meaning to the
term “idiosyncratic” the contemporaneous correlation matrix Ω is, as a rule, supposed
to be diagonal, typically with non equal variances.
It is relevant to stress the fact that such a time series model can be a good expla-
nation of the data on R (for instance it may show high R2 for each return series) and
at the same time no asset pricing model could be valid.
Let us recall that, if we estimate the market model with OLS (this may be done
security by security or even jointly), the OLS estimate of α can be written in a compact
way as
α̂ = r̄ − f¯B̂
Where r̄ is the 1×m vector of average excess return (one for each security excess return
averaged over time), f¯ is the 1 × k vector of average common factors values (again:
averaged over time) and B̂ is the matrix of OLS estimated factor weights (one for each
factor for each security: k × m.
The expected value of this, under the above hypotheses, is:
E(α̂) = E(r) − E(f )B
As we shall see in a moment, an asset pricing model is valid, if, supposing Ω diagonal,
we have that α = 0.
This is usually written as:
E(r) = λB
Where λ = E(f ) is a 1 × k vector of “prices of risk” and in a moment we shall see why
this name is used.
137
It is now important to stress that this restriction may hold, so that the asset pricing
model is valid, and this not withstanding the time series model could offer a very poor
fit of r or, on the contrary, the fit could be very good and α 6= 0.37
For asset management purposes, however, a possible good fit for the time series
model with a k << m could be very useful even when the asset pricing model does not
hold.
Suppose, for instance, you want to use a Markowitz model for your asset allocation.
In order to do this you need to estimate the variance covariance matrix of returns.
This requires the estimate of m(m + 1)/2 unknown parameters using n observations
on returns. With a moderately big m this could be an hopeless task.
Suppose now the market model works, at least in the time series sense, in the sense
that the R2 of each of the m linear models is big. In this case the variances of the
errors are small and:
V (rt ) = B 0 V (ft )B + Ω ∼
= B 0 V (ft )B
Let us now count the parameters we need to estimate the varcov matrix of the
excess returns with and without the market model. Without the market model, the
estimation of V (rt ) would require the estimation of m(m + 1)/2 parameters, while with
the factor model it requires the estimation of k × m + k × (k + 1)/2 parameters, that is
B and V (ft ). Suppose for instance m = 500 and k = 10, the direct estimation of V (rt )
implies the estimation of 125250 parameters while the (approximate) estimate based
on the factor model “only” 5000+55 parameters.
The reader should notice that, even if the above assumptions for V (rt ) are right,
the use of B 0 V (ft )B in the place of the full covariance matrix shall imply an underes-
timation of the variance of each asset return, which is going to be negligible only if all
the R2 are big.
Let us move on a step. We must remember that our aim is the construction of
portfolios of securities with weights w and excess returns rt w.
In this case we are not necessarily interested to the full V (rt ) but to variance of the
portfolio
V (rt w) = w0 B 0 V (ft )Bw + w0 Ωw
It is well possible that w0 Ωw be small, so that w0 B 0 V (ft )Bw be a good approximation
of V (rt w), even if it is not true that all R2 are big and, by consequence, the diagonal
elements of Ω small.
Suppose that the weights w of different securities in this portfolios are all of the
order of 1/m. This simply means that no single security dominates the portfolio.
37
Beware: what just described as a possible test of an asset pricing model is useful for the under-
standing of the loose interplay between the time series model and the asset pricing model but it is,
typically, not a very efficient way, from the statistical point of view, to test the validity of an asset
pricing model.
138
We have, then
m m
0
X 1 X ωi
w Ωw = wi2 ωi ≈
i=1
m i=1 m
and this, with bounded, but not necessarily small, diagonal elements of Ω: ωi , goes to
0 for m going to infinity.
This means that, for large, well diversified, portfolios “forgetting” Ω is irrelevant
even if its diagonal elements are not small. The hypothesis of a diagonal Ω, that is:
idiosyncratic t is fundamental for this result.
From this result, we can shed some light on the reason why we should have E(r) =
E(f )B = λB, that is: why an asset pricing model should hold.
In order to understand this, it is enough to compute the expected value and the
variance of our well diversified portfolio (notice the approximation sign for the variance)
V (rt w) ∼
= w0 B 0 V (ft )Bw
Suppose now α 6= 0, recall that B is a k × m matrix with (supposedly) k << m and
we can always suppose that the rank of B is k (if this is not the case we can reduce
the number of factors).
This implies that the matrix B 0 V (ft )B ia m × m matrix of rank k < m. The matrix
B 0 V (ft )B is then SEMI positive definite, this implies that there exist m − k orthogonal
vectors z such that z 0 z = 1 and z 0 B 0 V (ff )Bz = 0.
According to what discussed in the matrix algebra section and in the presentation
of the spectral theorem, under conditions we do not specify here, we can always build
from these a set of weights w$ such that w$0 1 = 1 and αw$ > 0.
You should understand the reason of the dollar sign. The vector w$ is such that it
defines a zero risk portfolio (zero variance) with positive excess return αw$ (since the
variance is zero the expected excess return becomes the excess return).
In other words, we created a risk free security (the portfolio) which yields a return
(arbitrarily) greater than the risk free rate. This is an “arbitrage” as one could borrow
any amount of money at the risk free rate and invest it in the portfolio with a positive
profit and no risk (hence the $). Provided all the financial operations involved (building
the portfolio, borrowing money etc.) are possible, this should not happen if traders are
“reasonable” (and if they know of the existence of the factor model).
The only way to unconditionally (that is: whathever tha choice of w$ ) avoid this is
that α = 0 so that
E(r) = E(f )B = λB
Let us now give a “financial interpretation” of this result.
139
Since each element βji of B represents the “amount” of non diversifiable factor fj
in the excess return of security i and E(fj ) represents the excess expected return of
a security which has a “beta 1” with respect to the j factor and zero with respect to
the others (if the factor fj is the excess return of a security, this could simply be the
excess return of that security, but this is not required) we may understand the name
“price of risk for factor j” used for E(fj ) = λj and risk premium for factor j” given to
the “price times quantity” product fj βji .
Now that we have a rough idea of how an asset pricing model works, it could be
useful go back to the questions with which this section begun and think a little bit
about how the answers come from the asset pricing model.
We should first notice that the approximation
140
give us an unified framework to precisely quantify and test the equilibrium price system
and to transform the statistical results into asset management tools.
141
Current proactitioner models, widely used in the asset management industry for
asset allocation, risk management and budgeting and performance evaluation, include,
for what is my experience, roughly from 10 to 15 risk factors and are tuned to specific
asset classes, so that they do not pretend to be general market models.
All these models can, in principle, be dealt with by regression methods.
There is, however, a different attitude toward factor modeling.
This attitude attempts a representation of underlying unobserved factors based on
portfolios of securities which are not defined a priori but jointly estimated with the
model optimizing some “best fit” criterion.
In order to do this, we need a joint estimation of F , the matrix of observation on
all factors, and B the factor weights matrix.
A common starting point is that of requiring the factors ft to be linear combinations
of excess returns: ft = rt L.
In principle there exist infinite choices for L. A unique solution can be chosen only
by imposing further constraints. Each choice of constraints identifies a different set of
factors.
Most frequently, factor models of this kind are based on the principal components
method or on variants of this.
The principal components method is a classic data reduction method for Multivari-
ate Statistics which has received a lot of new interest with the growth of “big data”.
In Finance principal components are used at least starting with the nineteen six-
ties/seventies.
We can describe the procedure of “factor extraction” that is: the unique identifica-
tion/estimation of factors, in two different but equivalent ways.
Both methods require, implicitly or explicitly, an a priori, maybe very rough, esti-
mate of V (rt ). For this to be possible a fundamental assumption is that V (rt ) = V (r)
that is: the variance covariance matrix of excess total returns is time independent.
When this is not assumed to hold, more complex methods than simple principal
components are available but are well beyond the scope of these notes.
142
product and recalling that Λ is diagonal, we have:
X
XΛX 0 = xi x0i λi
i
However: X
x0j V (r) = x0j XΛX 0 = x0j xi x0i λi = x0j λj
i
and
V (fj ) = V (rxj ) = x0j XΛX 0 xj = λj
so that:
βj = x0j
Let us now find V (r − fj βj )
X
= XΛX 0 − λj xj x0j = xi x0i λi = X−j Λ−j X−j
0
,
i6=j
39
Here the regression is to be intended as the best approximation of ri by means of a linear trans-
formation of fj . The intercept is included, see next note.
40
Notice that the definition of βj here employed implies the use of an intercept. We have not
mentioned it here, since we are interested to the variance-covariance matrix of r, which is unaffected
by the constant. In any case, the value of the constant 1 × m vector α is E(r) − E(f )β = 0
143
where X−j and λ−j are, respectively, the X matrix dropping column j and the Λ
matrix, dropping row and column j.
In other words, the covariance matrix of the “residuals” r − fj βj0 has the same
eigenvectors and eigenvalues of the original covariance matrix with the exception of
the eigenvector and eigenvalue involved in the computation of fj .41
This result is due to the orthogonality of factors42 and has several interesting im-
plications. We mention just three of these.
First: one by one “factor extraction”, that is: the computation of f ’s and corre-
sponding residuals, yields the same results if performed in batch or one by one.
Second: the result is invariant to the order of computation.
Third: once all factors are considered the residual variance is 0.
This last obvious result can be written in this way. If we set F = rX we have
r = F X 0 . Grouping in Fq and Xq the first q factors and columns of X and in Fm−q
and Xm−q the rest of the factors and columns of X we have:
m q m
X X X
r= fi x0i = fi x0i + fi x0i = Fq Xq0 + Fm−q Xm−q
0
number of components in F1 , their regression coefficients shall always be the same and correspond to
the transpose of their eigenvectors (the first statement is a direct consequence of non correlation and
the second was demonstrated in the text.) In matrix terms: the “linear model” estimated with OLS:
r = F1 B̂1 + Û1 holds with B̂1 = X10 and Û1 = F2 X20 .
42
Orthogonality here means that the factors are uncorrelated.
43
It could be argued here that the expectation of e is not zero. Recall, on the other hand, that the
expected returns are typically nearer zero than most observed returns, due to high volatility. This is
particularly true when daily data are considered. Moreover the non zero mean effect is damped down
by the ”small” matrix Xm−q . Hence the expected value of e can be considered negligible.
144
to build such a representation of r. The question is whether, given a pre specified
model r = f B + , the above described method shall identify f, B and . The answer
is: “in general not”.
In fact the two formulas are only apparently similar and become identical only
under some hypothesis. These are:
1. The dimension of f is q.
2. V (f ) is diagonal.
3. BB 0 = I
4. The rank of V () is m − q and the maximum eigenvalue of V () is smaller than
the minimum element on the diagonal of V (f ).
To these hypotheses we must add the already mentioned requirement that f and are
orthogonal.
For any given f B the second and third hypothesis can always be satisfied if V (f B)
is of full rank. In fact, in this case, it is always possible, using the procedure described
above, to write f B = f˜B̃ where the required conditions are true for f˜B̃ (remember
that, if the f are unobservable, there is a degree of arbitrariness in the representation).
Hypothesis one is more problematic: all we observe is r and we do not know, a
priori, the value of q.
But the most relevant (and interesting) hypothesis is that the rank of V () is m − q
and its eigenvalues are all less than the eigenvalues of V (f B).
This may well not be the case and in fact we could consider examples where is a
vector orthogonal to the elements of f but V () is of full rank and/or its eigenvalues
are not all smaller than those of V (f B).
For instance: in classical asset pricing models (CAPM, APT and the like) the main
difference between residuals and factors is not that the variance contributed by the
factors to the returns is bigger than the variance contributed by “residuals” but that
factors are common to different securities, so that they generate correlation of returns,
while residuals are idiosyncratic that is: they should be uncorrelated across securities.
While principal component analysis guarantees zero correlation across different fac-
tors, residuals in the principal component method are by no means constrained to be
uncorrelated across different securities.
In fact, since the varcov matrix of residuals is not of full rank, some correlation
between residual must exist and shall in general be higher if many factors are used in
the analysis44 .
44
If the row vector z of k random variables has a varcov matrix A such that Rank(A) = h. Then
at most h linear combinations of the elements of z can be uncorrelated. The proof is easy. Suppose
a generic number g of uncorrelated linear combinations of z exist and let these g linear combinations
145
While this is not the place for a detailed analysis of this important point, it is
useful to introduce it as a way for remembering that r = Fq Xq0 + E is, before all, a
representation of r and only under (typically non testable) hypothesis some estimate
of a factor model.
In our setting we need the representation in order to simplify the estimation of
V (r), while the interpretation of the result as the estimation of a factor model is very
useful, when possible, the simple representation shall be enough for our purposes.
It should always be remembered that our purpose is not the precise estimation of
each element of V (r). What we really hope for is a sensible estimate of the variance
of reasonably differentiated portfolios made with the returns in r. In this case, even if
the estimate of V (r) is rough it may well be that the estimate of a well differentiated
portfolio variance is fairly precise as, by itself, differentiation shall erase most of the
idiosyncratic components in the variance covariance matrix.
This intuitive reasoning can be made precise but it is above the purpose of our
introductory course.
A last point of warning is required. If we use enough principal components, then
Fq Xq0 behaves almost as r (the R2 of the regression is big). The almost clause is
important. Suppose you invest in a portfolio with weights xq+1 /xq+1 1m that is, a a
portfolio with correlation 1 with the first excluded component (the denominator of the
weights is there in order to have the portfolio weights sum to 1). By construction the
variance of this portfolio is λq+1 /(xq+1 1m )2 . However the covariance of this portfolio
with the included components is zero. In other words: if we measure the risk of any
portfolio by computing its covariance with the set of q principal components included
in the approximation of V (r) we shall give zero risk to a portfolio correlated with one
(or many) excluded components.
The practical implications of this are quite relevant but, a thorough discussion is
outside the purpose of these handouts. However: beware!
The question now is: we introduced the factor/components F in an arbitrary way
deriving them from the spectral theorem. Are there other justifications for them?
146
Here we follow a different path: we characterize each principal component (suitably
renormalized) as a particular “maximum risk” portfolio with the constraint that each
component must be orthogonal to each other component and that the sum of squared
weights should be equal to one.
Linear combinations of returns are (up to a multiplicative constant) returns of
(constant relative weights) portfolios45 . Given a set of returns it is interesting to answer
the question: which are the weights of the maximum variance linear combination of
returns? (We repeat: this is not the same of the maximum variance portfolio).
This problem is not well defined as the variance of any portfolio (provided it is not
0) can be set to any value by multiplying its weight by a constant.
It could be suggested to constrain the sum of weights to one, however this does
not solve the problem. Again, By considering multiples of the different positions the
requirement can be satisfied and the variance set to any number, at least, if weights
are allowed to be both positive and negative.
A possible solution is to set of absolute values of the weights to one. This would
both solve the problem and have a financial meaning. Alas, this can be done but only
numerically.
Suppose instead we set the sum of squared weights to 1. This solves the bounding
problem with the inconvenient that the resulting linear combination shall in general
not be a portfolio. But this choice yields an analytic solution.
Let us set the mathematical problem:
L = θ0 V (r)θ − λ[θ0 θ − 1]
V (r)θ − λθ = 0
45
As hinted at in several places of these handouts, given a linear combinations of returns, there exist
at least two ways of converting this in the returns of a portfolio. If we only want the required portfolio
to be perfectly correlated with the given linear combination, all that is needed is to renormalize the
weights by dividing them for their sum (provided this is not zero). If we wish for a portfolio with
the same weights (on risky assets) and the same variance as those of the linear combination, we must
simply add to the linear combination the return of a risk free security with as weights the difference
between one and the sum of the linear combination’s weights. Notice that in this second case, while
the (one time period) variance of the linear combination shall be the same as the variance of the
portfolio return (the risk free security has no variance for a single time period) the expected value
shall be different. In fact, if the weight of the risk free security is greater than zero, the expected value
of the portfolio return shall be (with a positive return assumed for the risk free security) greater that
the expected value of the linear combination of returns, the opposite in case of a negative weight
147
and
θ0 θ = 1
Rearranging and using the spectral theorem we have:
[XΛX 0 − λI]θ = 0
We see that, if we set θ = xj and λ to the corresponding λj , for any j we have a solution
of the problem. Since V (rXj ) = λj , the solution to the maximum variance problem is
given by the pair x1 and λ1 where, as usual, we suppose the eigenvalues sorted by size.
From what discussed in the previous section, the other solutions can be seen as the
maximum variance linear combinations of returns, where the maximum is taken with
the added constraint of being orthogonal to the previously computed linear combina-
tions.
We see that the components defined in a somewhat arbitrary way in the previous
section become now orthogonal (conditional) maximum variance linear combinations.
148
xj x0j λj and = j ej e0j ηj . Our hope is that the highest of the error eigenvalues ηj is
P P
j
smaller of at least some of V (r) eigenvalues. In this case the estimate error shall affect
the overall quality of the estimate V̂ (r) but only with respect to the lowest eigenvalue
components.
In summary. The principal components are defined as ft = rt X where X are the
eigenvectors of the return covariance matrix. The principal components are uncorre-
lated return portfolios (recall that a constant coefficients linear combination of returns
is the return of a constant relative proportion strategy, moreover recall that the sum of
weights in the principal component portfolios is not one). The variances of the princi-
pal component are the eigenvalues corresponding to the eigenvectors which constitute
the portfolio weights. We can derive a solution to the problem rt = ft B by simply
setting B = X 0 . The percentage of variance of the j−th return due to each principal
component can be computed by taking the square of the j−th column of X 0 and di-
viding each element of the resulting vector for the total sum of squares of the vector
itself.
A simple PC analysis on a set of 6 stock return series can be found in the file
“principal components.xls”. A more interesting dataset containing total return indexes
for 49 of the 50 components in the Eurostoxx50 index (weekly date) can be found
in the file “eurostoxx50.xls”. Principal components were computed using the add-in
MATRIX.
Examples
Exercise 11 - Principal Components.xls Exercise 11b - PC, Eurostoxx50.xls
149
12 Appendix: Some matrix algebra
12.1 Definition of matrix
A matrix A is an n−rows m−columns array of elements the elements are indicated by
ai,j where the first index stands for row and the second for column. n and m are called
the row and column dimensions (sometimes shortened in “the dimensions”) or sizes of
the matrix A. Sometimes we write: A is a nxm matrix.
Sometimes a matrix is indicated as A ≡ {aij }.
When n = m we say the matrix is square.
When the matrix is square and aij = aji we say the matrix is symmetric.
When a matrix is made of just one row or one column it is called a row (column)
vector.
3. Matrix product. The product C = AB of two matrices nxm and qxk Pis defined
if and only if m = q. If this is the case C is a nxk matrix and cij = l ail blj . In
the matrix case it may well be that AB is defined but BA not. An important
property is C 0 = B 0 A0 or, that is the same, (AB)0 = B 0 A0 . Provided the products
and sums involved in what follows are defined we have (A + B)C = AC + BC.
150
12.4 Some special matrix
1. A square matrix A with elements aij = 0, i 6= j is called a diagonal matrix.
2. A diagonal matrix with the diagonal of ones is called identity and indicated with
I. IA = A and AI = A (if the product is defined).
151
x0 Ax > 0 for all non null x
If a matrix A can be written as A = C 0 C for any matrix C then A is surely at least
psd. In fact x0 Ax = x0 C 0 Cx but this is the product of the row vector x0 C 0 times itself,
hence a sum of squares and this cannot be negative. It is also possible to show that
any psd matrix can be written as C 0 C for some C.
152
• The Varcov matrix is symmetric, on the diagonal we have the variances (V (Xi ) =
2
σX i
) of each element of the vector while in the upper and lower triangles we have
the covariances (Cov(Xi ; Xj )).
V (A + BX) = BV (X)B 0
• From this property we deduce that varcov matrices are always (semi) positive
definite. In fact if A = V (z) and x is a (non random) column vector of the same
size as z, then V (x0 z) = x0 Ax which cannot be negative for any possible x.
The correlation matrix %(x) of the vector x of random variables is simply the
matrix of correlation coefficients or, that is the same, the Varcov matrix of the
vector of standardized Xi .
• The presence of a zero correlation between two random variables is defined, some-
times, linear independence or orthogonality. The reader should be careful using
these terms as they exist also in the setting of linear algebra but their mean-
ing, even if connected, is slightly different. Stochastic independence implies zero
correlation, the reverse proposition is not true.
∂x0 Ax
= 2Ax
∂x
and
∂x0 q
=q
∂x
153
The proof of these two formulas is quite simple. In both cases we give a proof for
a generic element k of the derivative column vector.
For the linear combination we have
X
x0 q = xj q j
j
∂x0 q
= qk
∂xk
For the quadratic form
∂x0 Ax
= 2x0 A
∂x0
XX
x0 Ax = xi xj ai,j
i j
P P
∂ i j xi xj ai,j X X X X
= xj ak,j + xi ai,k +2xk ak,k = xj ak,j + xj ak,j +2xk ak,k = 2Ak,. x
∂xk j6=k i6=k j6=k j6=k
Where Ak,. means the k − th row of A and we used the fact that A is a symmetric
matrix.
An important point to stress is that the derivative of a function with respect to a
vector always has the same dimension as the vector w.r.t. the derivative is taken, in
this case x, so, for instance
∂x0 Ax
= 2Ax
∂x
and not
∂x0 Ax
= 2x0 A
∂x
(remember that A is symmetric).
154
We have
(Ax − b)0 (Ax − b) = x0 A0 Ax + b0 b − 2b0 Ax
Now let us take the derivative of this w.r.t. x
∂ 0
x AA0 x + b0 b − 2b0 Ax = 2A0 Ax − 2A0 b
∂x
(remember the rule about the size of a derivatives vector). We now create a new
linear system equating these derivatives to 0.
A0 Ax = A0 b
min w0 V (r)w
w:10 w=1
155
In order to do this we define the Lagrangian of the problem given by
In this function the value of the unconstrained objective function is summed with the
value of the constraint multiplied by a dummy parameter 2λ.
We now take the derivatives of the Lagrangian w.r.t. w and λ.
∂
∂w
(w0 V (r)w − 2λ(10 w − 1)) = 2V (r)w − 2λ1
∂
∂λ
(w0 V (r)w − 2λ(10 w − 1)) = 2(10 w − 1)
V (r)w = λ1
10 w = 1
Notice the difference between the 1-s. In the first equation 1 is a column vector which
is required because we cannot equate a vector to a scalar. The same for 1’ in the second
equation which while the r.h.s. is a scalar one (for dimension compatibility with the
l.h.s.). We do not stress this using, e.g., boldface for the vector 1 because the meaning
follows unambiguously from the context.
It is clear that the second equation is satisfied if and only if w satisfies the constraint.
What is the meaning of the first equation (or, better, set of equations)? The
unconstrained equation would have been
V (r)w = 0
whose only solution (due to the fact that V (r) is invertible) would be w = 0. But this
solution does not satisfy the constraint. What we shall be able to get is V (r)w = λ1.
For some λ chosen in such a way that the constraint is satisfied.
To find this λ, simply put together the result of the first set of equations: w =
λV (r)−1 1 and the equation expressing the constraint: 10 w = 1. Both equations are
satisfied if and only if
λ = 1/10 V (r)−1 1
We now know λ, that is we know of exactly how much we must violate the unconstrained
optimization condition (first set of equations) in order to satisfy the constraint (second
equation).
In the end, putting this value of λin the solution for the first set of equations, we
get
V (r)−1 1
w= 0
1 V (r)−1 1
It is to be noticed that these are only necessary conditions but, for our purposes,
this is enough.
156
What we got is the one period ”minimum variance portfolio” made of securities
whose returns covariance is V (r).
What is the variance of this portfolio?
10 V (r)−1 V (r)−1 1 10 V (r)−1 1 1
V (w0 r) = w0 V (r)w = V (r) = =
10 V (r)−1 1 10 V (r)−1 1 (10 V (r)−1 1)2 10 V (r)−1 1
The expected value shall be
1V (r)−1 E(r)
E(w0 r) = w0 E(r) =
10 V (r)−1 1
If V (r) is only spd, then it shall not be invertible, so that the system V (r)w = λ1
cannot be solved by simple inversion. In this case, however, there shall exist nonnull
vectors w∗ such that w∗0 V (r)w∗ = 0 and, using such w∗ it shall be possible to build
portfolios of the securities with (linear) return vector r, and maybe the risk free, such
that the return of such portfolios is risk free (zero variance) even if its components are
risky. Such riskless retunr must be equal to the risk free rate for no arbitrage to hold.
0 = (y − Xb)0 (y − Xb) = y 0 y + b0 X 0 Xb − 2y 0 Xb
157
This simple application of the rule for the approximate solution of an over determined
system yields the most famous formula in applied (multivariate) Statistics. When this
problem, for the moment just a best fit problem, shall be immersed in the appropriate
statistical setting, our b shall become the Ordinary Least Squares parameter vector
and shall be of paramount relevance in a wide range of applications to Economics and
Finance.
A quick check
The following simple example deal with the relations and differences between proba-
bility concepts and statistical concepts.
Let us start from two simple concepts: The mean and the expected value.
You know that an expected value has to do with a probability model: you cannot
compute it if you do not know the possible values of a random variable and their
probabilities.
On the other hand an average or mean is a simpler concept involving just a set of
numerical values: you take the sum of the values and divide by their number.
Sometimes, if certain assumptions hold (e.g. iid data), an expected value can be
estimated using a mean computed over a given dataset.
Moreover when a mean is seen not as an actual number, involving the sum of
actually observed data divided by the number of summands, but as a sum of still
unobserved, hence random, data, divided by their number, a mean becomes a random
quantity, being a function of random variables, hence entering the field of Probability
and need for its description a probability model. As such, it is reasonable to ask for
its probability distribution, expected value and a variance. In fact this is the study
of “sampling variability” for an estimate. At the opposite, probability distribution,
expected value and variance have no interesting meaning for a mean of a given set of
numbers which has one and only one possible value.
This dualism between quantities computed on numbers and functions of random
variables is true for all other statistical quantities.
In the case of the mean/average vs expected value, we use (but not always) different
names to stress the different role of the objects we speak of. The same is done (usually)
when we distinguish “frequency” and “probability”. To apply this to each statistical
concept would be a little cumbersome and, in fact, is not done in most cases. A variance
is called a variance both when used in the probability setting and as a computation on
158
number, the same for moments, covariances etc. Even the word “mean” is often used
to indicate both expected values and averages. This is a useful shortcut but should
not trick us in believing that the use of the same name implies identity of properties.
Care must be used.
In the experience of any teacher of Statistics the potential misunderstandings which
can derive from an incomplete understanding of this basic point are at the origin of
most of the problems students incur in when confronted with statistical concepts.
Consider the following example and, even if you judge it trivial, dedicate some time
to really repeat and understand all its steps.
Suppose you observe the numbers 1,0,1. The mean of these is, obviously, 2/3. Is it
meaningful to ask questions about the expected value of each of these three numbers
or of the mean? Not at all, except in the very trivial case where the answer to this
question coincides with the actual observed numbers.
However, in most relevant cases the numbers we may observe are not predetermined.
They are obviously known after we observe data, but it is usually the case that we also
want to say something about their values in future possible observations (e.g. we must
decide about taking some line of action whose actual result depends on the future values
of observable. This is the basic setting in financial investments). We cannot do this
without the proper language, we need a model, written in the language of Probability,
able to describe the “future possible observations”.
For instance, we could think sensible to assume that each single number I observe
can only be either 0 or 1, that each possible observation has the same probability
distribution for the possible results: P for 1 and 1 − P for 0 and that observations are
independent that is: the probability of each possible string of results is nothing but
the product of the probability of each result in the string.
Since we only know that P is a number between 0 and 1 the mean computed above
using data from the phenomenon such modeled (in this case equivalent to the “relative
frequency” of 1), has a new role: it could be useful as an “estimate” of P .
Under our hypotheses, however, it is clear that the value 2/3 is only the value of
our mean for the observed data, it is NOT the value of P which is still an unknown
constant. We need something connecting the two.
The first step is to consider the possible values that the mean could have had on
other possible “samples” of three observations.
By enumeration these are 0, 1/3, 2/3, 1.
We can also compute, under our hypotheses the probabilities of these values. Since
a mean of 1 can happen only when we observe three ones, and since the three results
are independent and with the same probability P , we have that the probability of
observing a mean of 1 is P P P = P 3 . On the other hand a mean of 0 can only be
observed when we only observe zeroes, that is with probability (1 − P )3 . A mean of
1/3 can be obtained if we observe a 1 and two 0s. There are three possibilities for
this: 1,0,0; 0,1,0 and 0,0,1. the respective probabilities (under our hypotheses, are)
159
P (1 − P )(1 − P ); (1 − P )P (1 − P ) and (1 − P )(1 − P )P. The three possibilities exclude
each other so we can sum up the probabilities. In the end we have 3P (1 − P )2 . The
same reasoning gives us the probability of observing a mean of 2/3 that is: 3P 2 (1 − P ).
What we just did is to compute the “sampling distribution” of the mean seen as an
“operator” which we can apply to any possible sample.
This sampling distribution of the mean gives us all its (four) possible values (on
n = 3 samples) and their probabilities as functions of P .
Since we now have both the possible values and their probabilities we can compute
the expected value and the variance of this mean. This is the second step to take in
order to connect the estimate to the “parameter” P .
These computations shall give us information about how good an “estimate” the
mean can be of the unknown P . We would like the mean to have expected value P
(unbiasedness) and as small a variance as possible, so to be “with high probability”
“near” to the true but unknown value of P . what this last sentence means is, simply,
that the probability of observing samples where the mean has a value near P should
be big.
Formally an expected value is very similar to a mean with the difference that each
value of the (now) random variable is multiplied by its probability and not its frequency.
The expected value shall be:
0 ∗ (1 − P )3 + 1/3 ∗ 3 ∗ P (1 − P )2 + 2/3 ∗ 3 ∗ P 2 (1 − P ) + 1 ∗ P 3 = P
Notice the difference with respect to a mean computed on a given sample and be
careful not to mistake the point. The difference is not that the result is not a number
but an unknown “parameter”. It could well be that P is known and, say, equal to 1/3 so
that the result would be a number. The difference is that this quantity, the expected
value of the sample mean, is a probability quantity, has nothing to do with actual
observations and frequencies and has everything to do with potential observations and
probability. In fact, on each given sample we have a given value of the mean so its
expected value has a meaning only because we consider the variability of the values
of this mean on the POSSIBLE samples. However, the result is very useful both if P
is unknown and if it is known. When P is unknown it tells us that the mean of the
observed data shall be unbiased as an estimate for P . When P is known to be, say,
1/3 it shall give P an “empirical connection” to an observable quantity, by assessing
that the expected value of the mean of the observed data shall be 1/3.
The question, however is: OK, this for the expected value of the sample mean. But:
how much is it probable that the actual observed mean be “near” P ?
Well, suppose for instance P = 1/3, we immediately see that the probability of
observing a mean equal to 1/3 is 4/9, quite high, while the probability of observing a
mean between 0 and 2/3 is 1-1/9 very near to 1.
For independent observations this is going to improve if n increases in the sense
that we shall find more and more probability in a given interval around 1/3.
160
However this computation, while straightforward, is quite cumbersome for big sam-
ples and requires some not fully elementary approximations as, in this case, the central
limit theorem.
A simples, if less specific, answer to this question requires the computation of
the (sampling) variance of our mean. By using the definition of variance and the
probabilities already computed we get for the variance of the mean:
P (1 − P )/3
The general case, for a sample of size not 3 but n shall be:
P (1 − P )/n
161
frequencies into information on probabilities) and the second “forecasting” (assessing
the probabilities of observations still to be made).
This longish example is not intended to teach you any new concept: with the
possible exception of Tchebicev inequality all this should be already known after a BA
in Economics.
You can take it as follows: if you see all the concepts and steps in the example as
clear, even trivial, fine! All what follows in this course shall be quite easy.
On the other hand, if any step seems fuzzy or inconsequential, dedicate some more
time to a quick rehearsal of what you already did during the BA concerning Probability
and Statistics.
And for any problem ask your teachers.
Probability
13.1 Probability: a Language
• Probability is a language for building decision models.
162
• As all languages, it does not offer or guarantees ready made splendid works of
art (that is: right decisions) but simply a grammar and a syntax whose purpose
is avoiding inconsistencies. We call this grammar and this syntax “Probability
calculus”.
• On the other hand, any language makes it simple to “say” something, difficult to
say something else and there are concepts that cannot be even thought in any
given language. So, no analysis of what we write in a language is independent on
the structure of the language itself, And this is true for Probability too.
• The language is useful to deduce probabilities of certain events when other prob-
abilities are given, but the language itself tells us nothing about how to choose
such probabilities.
• For more general uncertainty situations, probability assessments may have some-
thing to do with prices paid for bets, provided you are not directly involved in
the result of the bet, except with regard to a very small sum of money.
163
• At different degrees of confidence, students in these fields would admit that, in
principle, it could be possible to attempt a specific modeling for each instance of
the phenomena they observe but that, in practice, such model would require such
impossible precision in the measurement of initial conditions and parameters to
be useless. Moreover computations for solving such models would be unwieldy
even in simple cases.
• For these reasons students in these fields are satisfied with a theory that avoids a
case by case description, but directly models possible frequency distributions for
collectives of observations and uses the probability language for these models.
13.5 Finance
• Finance people would admit that days are by no means the same and that prices
are not due to “luck” but to a very complex interplay of news, opinions, sentiments
etc. However, they admit that to model this with useful precision is impossible
and, at a first level of approximation, days can be seen as similar and that it is
interesting to be able to “forecast” the frequency distribution of returns over a
sizable set of days.
• The attitude is similar to Statistical Physics where, however, hypotheses of ho-
mogeneity of underlying micro behaviours are more easy to sustain. Moreover
while we could model in an exact way few particles we cannot do the same even
with a single human agent.
164
(conditional) modeling of the distribution of lifespan in a big population and in
matching this with their insured population.
• Very different and much more interesting is the case of betting (horse racing,
football matches, political elections etc.). In this case the repeatability of events
under similar conditions cannot be called as a justification in the use of proba-
bilities and this implies a different and interesting interpretation of probability
which is beyond the scope of this summary.
• Weather forecasters, as all sensible forecaster (as opposed to fore tellers) phrase
their statements in terms of probabilities of basic events (sunny day, rain ,thun-
derstorms, floods, snow, etc.). In countries where this is routinely done and
weather forecasts are actually made in terms of probabilities (as in UK and USA
but not frequently in Italy) over time the meaning of , say, “60% probability of
rain” and the usefulness of the concept has come to be understood by the general
public (probability is not and should not be a mathematicians only concept).
• Risk managers in any field (the financial one is a very recent example) aim at
controlling the probability of adverse events
• Any big general store chain must determine the procedures for replenishing the
inventories given a randomly varying demand. This problem is routinely solved
by probability models.
• These are just examples of the applied fields where probability models and Statis-
tics are applied with success to the solution of practical problems of paramount
relevance.
165
of each observable instance of a phenomenon but try, with the use of the non
empirical concept of probability, to directly and at the same time fuzzily describe
aggregate results: collective events.
• For this simple reason they are useful inasmuch the decision payout depends, in
some sense, on collectives of events.
• They are not useful for predicting the result of the next coin toss but they are
useful for describing coin tossING.
• Sometimes, probability models are used in cases when the relevant event shall
happen only one or few times.
• In this case the model shall be useful more for organizing our decision process than
for describing its outcome. “Correct” in this case means: “a good and consistent
summary of our opinions”.
• Points represent “atomic” verifiable propositions which, at least for the purposes
of the analysis at hand, shall not be derived by simpler verifiable propositions.
• Sets of such points simply represents propositions which are true any time any
one of the (atomic) propositions within each set is true.
• Notice that, while points must always be defined, it may well be the case that we
only deal with sets of these and, while elements of these sets, some or all of these
points is not considered as a set by itself. For instance, in rolling a standard
die we have 6 possible “atomic” results but we could be interested only in the
probability of non atomic events the like of “the result is an even number” or “
166
the result is bigger than 3”. Since probabilities shall be assigned to a chosen class
of sets of points and we shall call these sets “events”, it may well be that these
“events” do not include atomic propositions (which in common language would
graduate to the name “event”).
• Sets of points are indicated by capital letters: A, B, C, .... The “universe” set (rep-
resenting the sure event) is indicated with Ω and the empty set (the impossible
event) with ∅(read: “ou”).
• Finite or enumerable infinite collections of sets are usually indicated with {Ai }ni=1
and with {Ai }∞
i=1 .
• Correct use of basic Probability requires the knowledge of basic set theoretical
operations: A ∩ B Intersection, A ∪ B Union, A \ B Symmetric difference, A
negation and their basic properties. The same is true for finite and enumerable
infinite Unions and intersections: ∪ Ai , ∪ Ai and so on.
i=1...n i=1...∞
• When the class of sets contains more than a finite number of sets, usually also
enumerable infinite unions of sets in the class are required to be sets in the class
itself (and so enumerable intersections, why?). In this case the class is called a
σ-algebra. The name “Event” is from now on used to indicate a set in and algebra
or σ-algebra.
167
13.12 Basic Results
• A basic result, implied in the above axioms, is that for any pair of events we
have: P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
• Another basic result is that if we have a collection of disjoint events: {Ai }ni=1
(Ai ∩ Aj = Ø ∀i 6= j)
Pand another event B such that B = ∪ni=1 (Ai ∩ B) then we
can write: P (B) = i=1 P (B ∩ Ai )
n
• If we require, and we usually do, the conditioning event to have positive probabil-
ity: P (B) 6= 0, this solution is unique and we have: P (A|B) = P (A ∩ B)/P (B).
Ai = Ω, we have:
i=1...n
P (B|Ai )P (Ai )
P (Ai |B) = Pn
i=1 P (B|Ai )P (Ai )
• Not all such functions are considered random variables. For X(.) to be a random
variable we require that for any real number t the set Bt given by the points ω
168
in Ω such that X(ω) ≤ t is also an event, that is: an element of the algebra (or
σ-algebra).
• The reason for this requirement (whose technical name is: “measurability”) is that
a basic tool for modeling the probability of values of X is the “probability distri-
bution function” (PDF) (sometimes “cumulative distribution function” CDF) of
X defined for all real numbers t as: FX (t) = P ({ω} : X(ω) ≤ t) = P (Bt ) and,
obviously, in order for this definition to have a meaning, we need all Bt to be
events (that is: a probability P (Bt ) must be assessed for each of them).
13.19 Density
In the absolutely continuous case we define the probability density function of X as:
fX (t) = ∂F∂s
X (s)
|s=t where this derivative exists and we complete this function in an
arbitrary´way where it does not. Any choice of completion shall have the property:
t
FX (t) = −∞ fX (s)ds.
169
13.20 Probability Function
In the discrete case we call “support” of X the at most enumerable set of values xi
corresponding to discontinuities of F and we indicate this set with Supp(X)and define
the probability function PX (xi ) = FX (xi ) − lim h↑0 FX (xi + h) for all xi : xi ∈ Supp(X)
with the agreement that such a function is zero on all other real numbers. In simpler
but less precise words P (.) is equal to the “jumps” in F (.) on the points xi where these
jumps happen and zero everywhere else.
and X
E(G) = G(xi )PX (xi )
xi ∈Supp(X)
If G is the identity function G(t) = t the expected value of G is simply called the
“expected value”, “mathematical expectation”, “mean”, “average” of X.
• If G(X) is the function I(X ∈ A), for a given set A, which is equal to 1 if X =
x ∈ A and 0 otherwise (the indicator function of A) then E(G(X)) = P (X ∈ A).
13.23 Variance
• The “variance” of G(X) is defined as V (G(X)) = E((G(X) − E(G(X))2 ) =
E(G(X)2 ) − E(G(X))2 .
170
• The square root of the variance is called “standard deviation”. For these two
quantities the symbols σ 2 and σ are often used (with or without the underscored
name of the variable).
• Since the inequality is strict, that is: it is possible to find a distribution for
which the inequality becomes an equality, this implies that, for instance, 99%
probability could require a ± “10 σ” interval.
• These simple points have a great relevance when tail probabilities are computed
in risk management applications.
• In popular literature about extreme risks, and also in some applied work it is
common to ask for a “six sigma” interval. For such an interval the Tchebicev
bound is 97.(2)%
171
13.26 *Gauss Inequality
This result is an extension of a result by Gauss who stated that if m is the mode (mind
not the expected value: in this is the V-P extension) of a unimodal random variable
then
1 − 2 2 if λ ≥ √2
3λ 3
P (| X − m |< λτ ) ≥
√λ
3
if 0 ≤ λ ≤ √23 .
Where τ 2 = E(X − m)2 .
13.28 Quantiles
• The “α-quantile” of X is defined as the value qα such that, the following conditions
are simultaneously valid:
P r[X < qα ]≤α
P r[X≤qα ]≥α
• Notice that in the case of a random variable with continuousFX (.) this equation
could be written as qα ≡ inf (t : FX (t) = α) and in the case of a continuous
strictly increasing FX (.) this becomes qα ≡ t : FX (t) = α
• For a non continuous FX (.) in case αis NOT one of the values taken by FX (.) the
above definition corresponds to a value of x ofX with FX (x) > α.
• Due to applications in the definition of VaR it is more proper to use as quantile,
in this case, a qα defined as the maximum value x of X with positive probability
and with FX (x) ≤ α.
• The formal definition of this is rather cumbersome
qα ≡ max {x : [P r[X ≤ x] > P r[X < x]] ∩ [P r[X ≤ x] ≤ α]}
• Which reads “qα is the greatest value of x such that it has a positive probability
and such that FX (x) ≤ α
172
13.29 Median
• If α = 0.5 we call the corresponding quantile the “median” of X and use for it,
usually, the symbol Md .
13.30 Subsection
• A mode in a discrete probability distribution (or frequency distribution) is any
value of x ∈ Supp(X) where the probability (frequency) has a local maximum.
• In the case of densities, the same definition is applied in terms of density instead
of probability (frequency).
173
• Poisson:P (x) = λx e−λ /x!, x = 0, 1, 2, ..., ∞; 0 ≤ λ. We have:E(X) = λ; V (X) =
λ.
• The name “random vector” is better than the name “vector of random variables” in
that, while each element of a random vector is, in fact, a random variable, a simple
vector of random variables could fail to be a random vector if the arguments ωi
of the different random variables are not constrained to always coincide.
• (If you understand this apparently useless subtlety you are well on your road to
understanding random vectors, random sequences and stochastic processes).
174
13.37 Distribution Function for a Random Vector
• Notions of measurability analogous to the one dimensional case are required to
random vectors but we do not mention these here.
• If we wish to find the distribution of all the elements of the vector minus, say,
the i-th element we simply work like this:
13.40 Conditioning
• Conditional probability functions and conditional densities are defined just like
conditional probabilities for events.
175
• Obviously, the definition should be justified in a rigorous way but this is not
necessary, for now!
• The conditional probability function of, say, the first i elements in a random
vector given, say, the other n − i elements shall be defined as:
P (x1 , ..., xn )
P (x1 , ..., xi |xi+1 , ...xn ) =
P (xi+1 , ...xn )
f (x1 , ..., xn )
f (x1 , ..., xi |xi+1 , ...xn ) =
f (xi+1 , ...xn )
• We write this for the density case, for the probability function is the same:
• This must be true for all the possible values of the n elements of the vector.
• Again, this must be true for all possible (x1 , ..., xn ). (Notice: the added big
subscript to the uni dimensional density to distinguish among the variables and
the small cap xi which are possible values of the variables).
176
13.43 Conditional Expectation
• Given a conditional probability function P (x1 , ..., xi |xi+1 , ...xn ) or a conditional
density f (x1 , ..., xi |xi+1 , ...xn ) we can define conditional expected values of, in
general, vector valued functions of the conditioned random variables.
• Something the like of E(g(x1 , ..., xi )|xi+1 , ...xn )) (the expected value is defined
exactly as in the uni dimensional case by a proper sum/series or integral opera-
tor).
• Where, in order to understand the formula, we must remember that the outer
expected value in the left hand side of the identity is with respect to (wrt) the
marginal distribution of the conditioning variables vector: (xi+1 , ...xn ), while the
inner expected value of the same side of the identity is wrt the conditional distri-
bution. Notice that, in general, this inner expected value: E(g(x1 , ..., xi )|xi+1 , ...xn )
is a function of the conditioning variables (the conditioned variables are “inte-
grated out” in the operation of taking the conditional expectation) so that it is
meaningful to take its expected value with respect to the conditioning variables.
• The expected value on the right hand side, however, is with respect to the
marginal distribution of the conditioned variables (x1 , ..., xi ).
177
• In the simplest case of two vectors we have: EY (EX|Y (X|Y)) = EX (X). For the
conditional expectation value, wrt the conditioned vector, all the properties of
the marginal expectation hold.
• A tricky topic: conditional expectation is, in general, a “static” concept. For any
GIVEN set of values of, say, Y you compute EX|Y (X|Y). However, implicitly,
the term “regression function” implies the possibility of varying the values of the
conditioning vector (or variable). This must be taken with the utmost care as it
is at the origin of many misunderstandings, in particular with regard to “causal
interpretations” of conditional expectations. The best, if approximate, idea to
start with is that EX|Y (X|Y) gives us a “catalog” of expected values each valid
under given “conditions” Y, be it or not be it possible or meaningful to “pass”
from one set of values of Y to another set.
• From the above definition we get that, for any set of constants a, b, c, d Cov(a + bX, c + dY) =
bdCov(X, Y).
• An
p important result (Cauchy inequality) allows us to show that |Cov(X, Y)| ≤
V (X)V (Y). From this we derive a “standardized covariance” called “correlation
coefficient”: Cor(X, Y) = Cov(X, Y)/ V (X)V (Y).
p
178
• We have Cor(a + bX, c + dY) = Sign(bd)Cor(X, Y).
• The square of the correlation coefficient is usually called R square or rho square.
• Notice that, regressive independence, even only unilateral, implies zero covariance
and zero correlation, the converse, however, is in general not true.
• Then FX(1) (t) = ni=1 FXi (t) and FX(n) (t) = 1 − ni=1 (1 − FXi (t)).
Q Q
and n
Y
FX(n) = 1 − (1 − FXi (t)) = 1 − (1 − F (t))n
i=1
.
• But the max is less than or equal t if and only if each random variable is less
than or equal to t.
• Since they are independent this is givenQby the product of the FXi each computed
at the same point t, that is FX(1) (t) = ni=1 FXi (t).
• For the min: 1 − FX(n) (t) is the probability that the min is greater that t. But
this is true if and only if each of the n random variables has a value greater than t
and for each random variable this probability is 1 − FXi (t). they are independent,
so...
179
13.51 Distribution of the sum of independent random variables
and central limit theorem
• Let {X1 , ..., Xn } be independent random variables. Let Sn = Xi be their
Pn
i=1
sum.
• To be more precise: the first property is always valid, whatever the dependence,
provided the expected values exist, while the second only requires zero correlation
(provided the variances exist).
• If we knew the distributions of the Xi we could (but this could be quite cumber-
some) compute the distribution of the sum.
• However, if we do not know (better: do not make hypotheses on) the distributions
of the Xi we still can give proof to a powerful and famous result which, in its
simplest form, states:
• In practice this means that, under the hypotheses of this theorem, if “n is big
enough ” (a sentence whose meaning should be, and can be, made precise) we
s
−µ
can approximate FSn (s) with Φ( σ/
n√
n
).
180
• This result is fundamental in statistical applications where confidence levels for
confidence intervals of size of errors for tests must be computed in non standard
settings.
Statistical inference
13.54 Why Statistics
• Probabilities are useful when we can specify their values. As we did see above,
sometimes, in finite settings, (coin flipping, dice rolling, card games, roulette, etc.)
it is possible to reduce all probability statement to simple statements judged, by
symmetry properties, equiprobable.
• In these case we say we “know” probabilities (at least in the sense we agree on
its values and, as a first approximation, do not look for some “discovery rule” for
probabilities) and use these for making decisions (meaning: betting). In other
circumstances we are not so lucky.
• However, from the didactic point of view, it is useful to show that the ”problem”
is there even with simple physical “randomizing devices” when their “shape” does
not allow for simple symmetry statements.
• Consider for instance rolling a pyramidal “die”: this is a five sided object with
four triangular sides a one square side. In this case what is the probability
for each single side to be the down side? For some news on dice see http:
//en.wikipedia.org/wiki/Dice
• Just by observing different pyramidal dice we could surmise that the relative
probability of the square face and of the four triangular faces depend, also, on
the effective shape of the triangular faces. We could hypothesize, perhaps, that
181
the greater is the eight of such faces, the bigger the probability for a triangular
face to be the down one in comparison to the probability for the square face.
• For instance we could observe, not simply hypothesize, that (for a pyramid made
of some homogeneous substance) the more peaked are the triangular sides (and
so the bigger their area for a given square basis of the pyramid) the smaller the
probability for the square side to be the one down after the throw. We could also
observe, directly or by mind experiment, that the “degenerate” pyramid having
height equal to zero is, essentially, a square coin so that the probability of each side
(the square one and the one which shall transform in the four triangles) should
be 1/2. From these two observations and some continuity argument we could
conclude that there should be some unknown height such that the probability of
falling on a triangular side is, say, 1 > c ≥ 1/2 and, by symmetry, the probability
for each triangular side, is c/4. This provided there is no cheating in throwing
the die so that the throw is “chaotic” enough. So, beware of magicians!
• By converse, suppose you know, from previous analysis, that, for a pyramid made
of homogeneous material, a good approximation is c proportional to the height.
In this case a good test to assess the homogeneity of the material with which the
pyramid is made could be that of throwing several pyramids of different height
and see if the ratio between the frequency of a triangular face and the height of
the pyramid is a constant.
13.57 No Symmetry
• Consider now a different example: horse racing. Here the event whose probability
we are interested in is, to be simple, the name of the winner.
182
• It is “clear” that symmetry arguments here are useless. Moreover, in this case
even the use of past data cannot mimic the case of the pyramid, while observation
of past races results could be relevant, the idea of repeating the same race a
number of times in order to derive some numerical evaluation of probability is
both unpractical and, perhaps, even irrelevant.
13.58 No Symmetry
• What we may deem useful are data on past races of the contenders, but these
data regard different track conditions, different tracks and different opponents.
• Moreover they regard different times, hence, a different age of the horse(s), a
different period in the years, a different level of training, and so on.
• History, in short.
• This not withstanding, people bet, and bet hard on such events since immemorial
past. Where do their probabilities come from?
• An interesting point to be made is that, in antiquity, while betting was even more
common than it is today (in many cultures it had a religious content: looking for
the favor of the gods), betting tools, like dice existed in a very rudimentary form
with respect to today. We know examples of fantastically “good” dice made of
glass or amber (many of these being not used for actual gambling but as offers
to the Deity). These are very rare. The most commonly used die came from a
roughly cubic bone off a goat or a sheep. In this case symmetry argument were
impossible and experience could be useful.
• This is the typical case in simple games of chance (not in the, albeit still simple,
pyramidal dice case).
183
• If we want to use probability when numerical values for probability are not eas-
ily derived, we are going to be uncertain both on uncertain results and on the
probability of such results.
• We can do nothing (legal) about the results of the game, but we may do something
for building some reasonable way for assessing probabilities. In a nutshell this is
the purpose of Statistics.
• The basic idea of statistic is that, in some cases, we can “learn” probabilities from
repeated observations of the phenomena we are interested in.
• We may agree that the probability for each face to be the down one in repeated
rollings of the die is constant, unknown but constant.
• Moreover, we may accept that the order with which results are recorded is, for
us, irrelevant as “experiments” (rolls of the dice) are made always in the same
conditions.
• We, perhaps, shall also agree that the probability of each triangular face is the
same.
• This are going to be non negative numbers (Probability Theory require this)
moreover, if we agree with the statement about their identity, each of these value
must be equal to the same θ so the total probability for a triangular face to be
the down one shall be 4θ.
• By the rules of probability, the probability for the square face is going to be 1−4θ
and, since this cannot be negative, we need θ ≤ .25 (where we perhaps shall avoid
the equal part in the ≤sign).
184
13.62 Pyramidal Die Constraints
• All these statements come from Probability Theory joint with our assumptions
on the phenomenon we are observing.
• In other, more formal, words we specify a probability model for each roll of the
die and state this:
• In each roll we can have a result in the range 1,2,3,4,5;
• The probability of each of the first four values is θ and this must be a number
not greater than .25.
• With just these words we have hypothesized that the probability distribution of
each result in a single toss is an element of a simple but infinite and very specific
set of probability distributions completely characterized by the numerical value
of the “parameter” θ which could be any number in the “parameter space” given
by the real numbers between 1/8 and 1/4 (left extreme included if you like).
185
• The joint probability of the observed sample is thus (4θ)60 (1 − 4θ)40 .
• Let us forget, for the moment, this subtlety which is going to be relevant in what
follows. We have the probability of the observed sample, since the sample is
given, the only thing in the formula which can change value is the parameter θ.
• Notice that this value maximizes (4θ)60 (1 − 4θ)40 : the probability of observing
the given sample (or any given specific sample containing 40 5s and 60 non 5s)
but also maximizes 40 (4θ) (1 − 4θ)40 that is: the probability of observing A
100 60
sample in the set of samples containing 40 5s and 60 non 5s. (Be careful in
understanding the difference between “the
given sample ” and “A sample in the
set”, moreover notice that 40 = 60 ).
100 100
186
• The suggested estimate is the value θb which maximizes the joint probability
of observing the sample actually observed. In other words we estimated the
unknown parameter according to the maximum likelihood method.
• In fact, if we roll the dice another 100 times and compute the estimate with the
same procedure, most likely, a different estimate shall come up and for another
sample, another one and so on and on.
• Statisticians do not only find estimates, most importantly they study the worst
enemy of someone which must decide under uncertainty and unknown probabil-
ities: sampling variability.
• Since in each roll we either get a 5 or a non 5 the total number of these possible
samples is 2100 .
• On each of these samples our estimate could take a different value, consider,
however, that the value of the estimate only depends of how many 5 and non 5
were observed in the specific sample (the estimate is the number of non 5 divided
by 4 times 100).
• So the probability of observing a given value of the estimate is the same as the
probability of the set of samples with the corresponding number of 5s.
187
• If the number of 5s is, say, k we find that the probability of the generic sample
with k 5s and 100-k non 5s is (see above): (4θ)100−k (1 − 4θ)k .
• We can insert the first 5 in any of 100 places, the second in any of 99 and so on.
• The reader shall realize that, for each given value of θ the a priori (of sampling)
most probable value of the estimate is the one corresponding to the integer num-
ber of 5s nearest to 100(1 − 4θ) (which in general shall not be integer).
188
• Since our procedure is to estimate θ with 100−k
400
this immediately implies that, in
the case the observed sample is not the most probable for that given θ, the value
of the estimate shall NOT be equal to θ, in other words it shall be “wrong” and
the reason of this is the possibility of observing many different samples for each
given “true” θ, that is: sampling variability.
• In general, using the results above, for any given θ, the probability
of n−k
observing a
sample of size n which gives as an estimate n−k
4n
is (as above) n
k
(4θ) (1 − 4θ)k
than n−k
4n
with non negligible probability.
• For instance, using again our example, we see clearly that there does not exist a
single “sampling distribution” of the estimate as there is one for each value of the
parameter.
• On one hand this is good, because otherwise the estimate would give us quite
poor information about θ: the information we get from the estimate comes exactly
from the fact that for different values of θ different values of the estimate are more
likely to be observed.
• On the other it does not allow us to say which is the “sampling distribution” of
the estimate but only gives us a family of such distribution.
189
13.77 Sampling Variability
• However, even if we do not know the value of the parameter we may study several
aspects of the sampling distribution.
• For instance, for each θ we can compute, given that θ the expected value of the
estimate for the distribution of the estimate with that particular value of θ. In
other words we could compute
n
X n − k n
(4θ)n−k (1 − 4θ)k
k=0
4n k
and by doing this computation we would see that the result is θ itself, no matter
which value has θ. So that we say that the estimate is unbiased.
n−k
∈ [θ ± c]
4n
that is: of observing a value of the estimate different than θ at most of c, goes to 1
for ANY c > 0 no matter the value of θ. This is called “mean square consistency”.
190
• For instance, in iid sampling from an unknown distribution with unknown ex-
pected value µ and known standard deviation σ the usual estimate of µ, the
2
arithmetic mean of the data, has a sampling variance equal to σn which does not
depend on unknown parameters (repeat: we assumed σ known).
• V̂ (θ̂) = 4θ̂(1−4θ̂)
and n−k n
(4θ̂)n−k (1 − 4θ̂)k and always remember to
4n
P̂ ( θ̂ = 4n
) = k
notice the “hats” on V and P .
13.82 Principle 1
• The first obvious principle to follow in order to be able to do this is: “do not
forget it”.
• An estimate is an estimate is an estimate, it is not the “true” θ.
• This seems obvious but errors of this kind are quite common: it seems human
brain does not like uncertainty and, if not properly conditioned, it shall try in
any possible way, to wrongly believe that we are sure about something on which
we only posses some clue.
191
13.83 Principle 2
• The second principle is “measure it”.
• At the very least information about sampling standard deviation should be added.
Reporting in the form of confidence intervals could be quite useful.
• This and not point estimation is the most important contribution Statistics may
give to your decisions under uncertainty.
13.84 Principle 3
• The third principle is “do not be upset by it”.
• Results of decision may upset you even under certainty. This is obviously much
more likely when chance is present even if probabilities are known.
• We are at the third level: no certainty, chance is present, probabilities are un-
known!
• The best Statistics can only guarantee an efficient and logically coherent use of
available information.
• It does not guarantee Luck in “getting the right estimates” and obviously it cannot
guarantee that, even if probabilities are estimated well something very unlikely
does not happen! (And no matter what, People shall always expect, forgive the
joke, that what is most probable is much more likely than it is probable).
192
13.86 Statistical Model
• This is made of two ingredients.
• The first is a probability model for a random variable (or more generally a random
vector, but here we shall consider only the one dimensional case).
• The simplest example of this is the case of independent and identically distributed
observations (simple random sampling).
• Sometimes we do not fully specify P but simply ask, for instance, for Xto have
a certain expected value or a certain variance.
13.88 Statistic
• A fundamental concept is that of “estimate” or “statistic”. Given a sample: X
and estimate is simply a function of the sample and nothing else: T (X).
• In other words it cannot depend on unknowns the like of parameters in the model.
Once the sample is observed the estimate becomes a number.
• This may give the false impression that statistician are interested in parameter
values.
193
• Sometimes this may be so but, really, statisticians are interested in assessing
probabilities for (future) values of X, parameters are just “middlemen” in this
endeavor.
• Point estimation;
• Interval estimation;
• Hypothesis testing.
13.92 Unbiasedness
• An estimate T (X) is unbiased for θ iff Eθ (T (X)) = θ, ∀θ ∈ Θ. In order to
understand the definition (and the concept of sampling distribution) is important
to realize that, in general, the statistic T has a potentially different expected value
for each different value of θ (hence each different distribution of the sample).
• What the definition ask for is that this expected value always corresponds to the
θ which indexes the distribution used for computing the expected value itself.
• Notice how, in this definition, we stress the point that the M SE is a function of
θ (just like the expected value of T ).
194
• We recall the simple result:
• Obviously, for an unbiased estimate, M SE and sampling variance are the same.
• We state that T1 is not less efficient than T2 if and only if M SEθ (T1 ) ≤ M SEθ (T2 )
∀θ ∈ Θ.
• As is the case of unbiasedness the most important point is to notice the “for all”
quantifier (∀).
• This implies, for instance, that we cannot be sure, given two estimates, whether
one is not worse than the other under this definition.
• In fact it may well happen that mean square errors, as functions of the parameter
“cross”, so that one estimate is “better” for some set of parameter values while
the other for a different set.
• In other words, the order induced on estimates by this definition is only partial.
• Obviously, since an estimate is defined on a given sample, this new setting requires
the definition of a sequence of estimates and the property we are about to state
is not a property of an estimate but of a sequence of estimates.
195
13.97 Mean Square Consistency
• A sequence {Tn } of estimates is termed “mean square consistent if and only if
lim M SEθ (Tn ) = 0, ∀θ ∈ Θ.
n→∞
• You should notice again the quantifier on the values of the parameter.
• Again, using Tchebicev, we understand that the requirement implies that, for
any given value of the parameter, the probability of observing a value of the
estimate in any given interval containing θ goes to 1 if the size of the sample goes
to infinity.
• In general we shall have E(X m ) = gm (θ) that is: the moments are functions of
the unknown parameters.
moment is immediately seen to be unbiased, while its MSE (the variance, in this
m
case) is V (Xn ) which (if it exists) obviously goes to 0 if the size n of the sample
goes to infinity.
196
13.101 Inverting the Moment Equation
• The idea of the method of moment is simple. Suppose for the moment that θ is
one dimensional.
• Choose any gm and suppose it is invertible (if the model is sensible, this should
be true. Why?).
• In the case of k parameter just solve with respect to the unknown parameter a
system on k equation connecting the parameter vector with k moments estimated
with the corresponding empirical moments.
13.102 Problems
• This procedure is intuitively alluring. However we have at least two problem.
The first is that any different choice of moments is going to give us, in general,
a different estimate (consider for instance the negative exponential model and
estimate its parameter using different moments).
• The Generalized Method of Moments tries to solve this problem (do not worry!
this is something you may ignore, for the moment).
• The second is that, while empirical moments under iid sampling are, for instance,
unbiased estimates of corresponding theoretical moments, this is usually not true
for method of moments estimates. This is due to the fact that the gm we use are
typically not linear.
• Under suitable hypotheses we can show that method of moments estimates are
means square consistent but this is usually all we can say.
• Here the idea is clear if we are in a discrete setting (i.e. if we consider a model
of a probability function).
• The first step in the maximum likelihood method is to build the joint distribution
of the sample.
• In the context described above (iid sample) we have P (X; θ) = i P (Xi ; θ).
Q
197
• Now, observe the sample and change the random variables in this formulas (Xi )
into the corresponding observations (xi ).
• The resulting P (x; θ) cannot be seen as a probability of the sample (the proba-
bility of the observed sample is, obviously, 1), but can be seen as a function of θ
given the observed sample: Lx (θ) = P (x; θ).
13.105 Interpretation
• If P is a probability (discrete case) the idea of the maximum likelihood method
is that of finding the value of the parameter which maximizes the probability of
observing the actually a posteriori observed sample.
• While for each given value of the parameter we may observe, in general, many
different samples, a set of these (not necessarily just one single sample: many
different samples may have the same probability) has the maximum probability
of being observed given the value of the parameter.
13.106 Interpretation
• We observe the sample and do not know the parameter value so, as an estimate,
we choose that value for which the specific sample we observe is among the most
probable samples.
• Obviously, if , given the parameter value, the sample we observe is not among the
most probable, we are going to make a mistake, but we hope this is not the most
common case and we can show, under proper hypotheses, that the probability of
such a case goes to zero if the sample size increases to infinity.
198
13.107 Interpretation
• A more satisfactory interpretation of maximum likelihood in a particular case.
• Suppose the parameter θ has a finite set (say m) of possible values and suppose
that, a priori of knowing the sample, the statistician considers the probability of
each of this values to be the same (that is 1/m).
• Using Bayes theorem, the posterior probability of a given value of the parameter
1
P (x|θj ) m
given the observed sample shall be:P (θj |x) = P P (x|θ )1
= h(x)Lx (θj ).
j j m
13.108 Interpretation
• In words: if we consider the different values of the parameter a priori (of sample
observation) as equiprobable, then the likelihood function is proportional to the
posterior (given the sample) probability of the values of the parameter.
• So that, in this case, the maximum likelihood estimate is the same as the maxi-
mum posterior probability estimate.
• In this case, then, while the likelihood is not the probability of a parameter
value (it is proportional to it) to maximize the likelihood means to choose the
parameter value which has the maximum probability given the sample.
• However, given that we are maximizing a joint density and not a joint probability
the simple interpretation just summarized is not directly available.
199
• If we use the first moment for the estimation√of θ we have θb1 = x̄ but, if we choose
the second moment, we have: θb2 = (−1 + 1 + 4x2 )/2 where x2 here indicates
the empirical second moment (the average of the squares).
• For a given θ this probability does not depend on the specific values of each
single observation but only on the sum of the observations and the product of
the factorials of the observations.
• The value of θ which maximizes the likelihood is θbml = x which coincides with
the method of moments estimate if we use the first moment as the function to
invert.
• The following topics are almost not touched in standard USA like undergraduate
Economics curricula, and scantly so in other systems.
• They are, actually, very important but only vague notions of these can be asked
to a student as a prerequisite.
• A common procedure is to report the point estimate joint with some measure
related to sampling standard deviation.
• We say “related” because, in the vast majority of cases, the sampling standard
deviation depends on unknown parameters, hence it can only be reported in an
“estimated” version.
200
13.115 Sampling Variance of the Mean
• The simplest example is this.
• By recourse to the usual properties of the variance operator we find that the
variance of the arithmetic mean is σ 2 /n.
• If (as it is very frequently the case) σ 2 is unknown, even after observing the
sample we cannot give the value of the sampling standard deviation.
13.117 nσ Rules
• In order to give a direct joint picture of estimate an its (estimated) standard
deviation, nσ “rules” are often followed by practitioners.
• They typically report “intervals” of the form Point Estimate ±n Estimated Stan-
dard Deviation. A popular value of n outside Finance is 2, in finance we see value
of up to 6.
• A way of understanding this use is as follows: if we accept the two false premises
that the estimate is equal to its expected value and this is equal to the unknown
parameter and that the sampling variance is the true variance of the estimate,
then Tchebicev inequality assign a probability of at least .75 to observations of
the estimate in other similar samples which are inside the ” ± 2σ” interval (more
than .97 for the ” ± 6σ” interval).
201
13.118 Confidence Intervals
• A slightly more refined but much more theoretically requiring behavior is that of
computing “confidence intervals” for parameter estimates.
• The proper definition is usually not even given and only one or two simple ex-
amples are reported but with no precise statement of the required hypotheses.
• In fact it turns out that, if α is equal to .05, the z in the first interval is equal to
1.96, and, for n greater than, say, 30, the t in the second formula is roughly 2.
202
• Consequently almost the full set of normative tools of statistical decision theory
have been applied to financial problems and with considerable success, when used
as normative tools (much less success, if any, was encountered by attempts to use
such tools in the description of actual empirical human behavior. But this has
to be expected).
• So, for instance, if you are considering a Gaussian model and your two hypotheses
are that the expected value is either 1 or 2, this means, implicitly, that no other
values are allowed.
• The reason of the names lies in the fact that, in the traditional setting where
testing theory was developed, the “null” hypothesis corresponds to some conser-
vative statement whose acceptance would not imply a change of behavior in the
researcher while the “alternative” hypothesis would have implied, if accepted, a
change of behavior.
203
13.125 Example
• The simplest example is that of testing a new medicine or medical treatment.
• In a very stylized setting, let us suppose we are considering substituting and
already established and reasonably working treatment for some illness with a
new one.
• This is to be made on the basis of the observation of some clinical parameter in
a population.
• We know enough as to be able to state that the observed characteristic is dis-
tributed in a given way if the new treatment is not better than the old one and
in a different way if this is not the case.
• In this example the distribution under the hypothesis that the new treatment is
not better than the old shall be taken as the null hypothesis and the other as the
alternative.
1. x ∈ C but the true hypothesis is H0 , this is called error of the first kind;
2. x ∈ A but the true hypothesis is H1 , this is called error of the second kind.
• We should like to avoid these errors, however, obviously, we do not even know
(except in toy situations) whether we are committing them, just like we do not
know how much wrong our point estimates are.
204
• Proceeding in a similar way as we did in estimation theory we define some measure
of error.
• We would like, ideally, this function to be near 1 for θ ∈ Θ1 while we would like
this to be near 0 for θ ∈ Θ0 .
• We define α = sup ΠC (θ) the (maximum) size of the error of the first kind and
θ∈Θ0
β = sup (1 − ΠC (θ)) the (maximum) size of the error of the second kind.
θ∈Θ1
• The reason of this choice is to be found in the traditional setting described above.
If accepting the null means to continue in some standard and reasonably success-
ful therapy, it could be sensible to require a small probability of rejecting this
hypothesis when it is true and it could be considered as acceptable a possibly big
error of the second kind.
13.130 Asymmetry
The reader should consider the fact that this very asymmetric setting is not the most
common in applications.
205
• We want to test H0 : µ ≤ a against H1 : µ ≥ b where a ≤ b are two given
real numbers. It is reasonable to expect that a critical region of the shape:
C : {x : x > k} should be a good one.
• The problem is to find k.
• When the variance is unknown the critical region is of the same shape but k =
a + √σbn tn−1,1−α where σ
b and t are as defined above.
206
• Suppose we have H0 : µ = µ0 and H1 : µ 6= µ0 for some given µ0 . The above
recalled property of the confidence interval implies that the probability with
which √
x ± z1−α/2 σ/ n
contains µ0 , when H0 is true, is 1 − α.
• Build the analogous region in the case of unknown variance and consider the
setting where you swap the hypotheses.
207
Contents
1 Returns 3
1.1 Return definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Price and return data: a cautionary tale . . . . . . . . . . . . . . . . . 10
1.3 Some empirical “facts” . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Volatility estimation 30
3.1 Is it easier to estimate µ or σ 2 ? . . . . . . . . . . . . . . . . . . . . . . 36
6 Matrix algebra 64
9 Linear regression 72
9.1 What is a regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.2 Weak OLS hypotheses. X non random . . . . . . . . . . . . . . . . . . 74
9.3 Weak OLS hypotheses. X random . . . . . . . . . . . . . . . . . . . . . 75
9.4 The OLS estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.5 Basic statistical properties of the OLS estimate . . . . . . . . . . . . . 77
9.6 The Gauss Markoff theorem . . . . . . . . . . . . . . . . . . . . . . . . 78
9.7 Fit and errors of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.8 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.9 Statistical properties of Ŷ and ˆ . . . . . . . . . . . . . . . . . . . . . . 83
9.10 Strong OLS hypotheses, confidence intervals and testing linear hypothe-
ses in the linear model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.11 “Forecasts” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.12 a note on P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.13 The partial regression theorem. . . . . . . . . . . . . . . . . . . . . . . 90
208
9.14 The interpretation of estimated coefficients . . . . . . . . . . . . . . . 95
209
13.10Classes of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
13.11Probability as a Set Function . . . . . . . . . . . . . . . . . . . . . . . 167
13.12Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.13Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.14Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.15Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.16Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.17Properties of the PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.18Density and Probability Function . . . . . . . . . . . . . . . . . . . . . 169
13.19Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.20Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.21Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.22Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.23Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.24Tchebicev Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
13.25*Vysochanskij–Petunin Inequality . . . . . . . . . . . . . . . . . . . . . 171
13.26*Gauss Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
13.27*Cantelli One Sided Inequality . . . . . . . . . . . . . . . . . . . . . . . 172
13.28Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
13.29Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.30Subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.31Univariate Distributions Models . . . . . . . . . . . . . . . . . . . . . . 173
13.32Some Univariate Discrete Distributions . . . . . . . . . . . . . . . . . . 173
13.33Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 174
13.34Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 174
13.35Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 174
13.36Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
13.37Distribution Function for a Random Vector . . . . . . . . . . . . . . . . 175
13.38Density and Probability Function . . . . . . . . . . . . . . . . . . . . . 175
13.39Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.40Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.41Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 176
13.42Mutual Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
13.43Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.44Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.45Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.46Law of Iterated Expectations . . . . . . . . . . . . . . . . . . . . . . . 177
13.47Regressive Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
13.48Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . 178
13.49Distribution of the max and the min for independent random variables 179
13.50Distribution of the max and the min for independent random variables 179
210
13.51Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.52Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.53Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.54Why Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.55Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . . 181
13.56Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . . 182
13.57No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
13.58No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
13.59Learning Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
13.60Pyramidal Die . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.61Pyramidal Die Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.62Pyramidal Die Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.63Many Rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.64Probability of Observing a Sample . . . . . . . . . . . . . . . . . . . . 185
13.65Pre or Post Observation? . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13.66Maximize the Probability of the Observed Sample . . . . . . . . . . . . 186
13.67Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13.68Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.69Possibly Different Samples . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.70The Probability of Our Sample . . . . . . . . . . . . . . . . . . . . . . 187
13.71The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 188
13.72The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 188
13.73The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 188
13.74The Estimate in Other Possible Samples . . . . . . . . . . . . . . . . . 188
13.75The Estimate in Other Possible Samples . . . . . . . . . . . . . . . . . 189
13.76Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
13.77Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.78Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.79Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.80Estimated Sampling Variability . . . . . . . . . . . . . . . . . . . . . . 191
13.81Quantifying Sampling Variability . . . . . . . . . . . . . . . . . . . . . 191
13.82Principle 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
13.83Principle 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
13.84Principle 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
13.85The Questions of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 192
13.86Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
13.87Specification of a Parametric Model . . . . . . . . . . . . . . . . . . . . 193
13.88Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
211
13.89Parametric Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
13.90Different Inferential Tools . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.91Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.92Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.93Mean Square Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.94Mean Square Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.95Meaning of Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.96Mean Square Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.97Mean Square Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.98Methods for Building Estimates . . . . . . . . . . . . . . . . . . . . . . 196
13.99Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.100Estimation of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.101Inverting the Moment Equation . . . . . . . . . . . . . . . . . . . . . . 197
13.102Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
13.103Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
13.104Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13.105Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13.106Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13.107Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.108Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.109Maximum Likelihood for Densities . . . . . . . . . . . . . . . . . . . . . 199
13.110Example (Discrete Case) . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.111Example Method of Moments . . . . . . . . . . . . . . . . . . . . . . . 199
13.112Example Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . 200
13.113More Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
13.114Sampling Standard Deviation and Confidence Intervals . . . . . . . . . 200
13.115Sampling Variance of the Mean . . . . . . . . . . . . . . . . . . . . . . 201
13.116Estimation of the Sampling Variance . . . . . . . . . . . . . . . . . . . 201
13.117nσ Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
13.118Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.119Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.120Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.121Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.122Parametric Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.123Two Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.124Simple and Composite . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.125Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
13.126Critical Region, Acceptance Region . . . . . . . . . . . . . . . . . . . . 204
13.127Errors of First and Second Kind . . . . . . . . . . . . . . . . . . . . . . 204
13.128Power Function and Size of the Errors . . . . . . . . . . . . . . . . . . 205
13.129Testing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
212
13.130Asymmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
13.131Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
13.132Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.133Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.134Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.135Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.136Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
213