You are on page 1of 213

Handouts for the course

”Financial Econometrics and Empirical Finance -


Module I”
Francesco Corielli
September 1, 2021

It is perhaps not to be wondered at, since fortune is ever changing her


course and time is infinite, that the same incidents should occur many
times, spontaneously. For, if the multitude of elements is unlimited, fortune
has in the abundance of her material an ample provider of coincidences;
and if, on the other hand, there is a limited number of elements from which
events are interwoven, the same things must happen many times, being
brought to pass by the same agencies.
Plutarch, Parallel Lives, Life of Sertorius.
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
Ronald Fisher

Il est vrai que M. Fourier avait l’opinion que le but principal des maté-
matiques était l’utilité publique et l’explication des phénomènes naturels;
mais un philosophe comme lui aurait du savoir que le but unique de la sci-
ence, c’est l’honneur de l’esprit humain, et que sous ce titre, une question
de nombres vaut autait q’une question du système du monde.
Carl Gustav Jacobi (letter to Adrien-Marie Legendre, from Konigsberg,
July 2nd, 1830)

1
Among the first examples of least squares: Roger Cotes with Robert Smith, ed.,
“Harmonia mensurarum”, (Cambridge, England: 1722), chapter: “Aestimatio errorum
in mixta mathesis per variationes partium trianguli plani et sphaerici”, pag. 22.
An hypothesis should be made explicit: no systematic bias in the measuring in-
struments. Thomas Simpson points this out in: “An attempt to shew the advantage
arising by taking the mean of a number of observations in practical astronomy” (From:
“Miscellaneous Tracts on Some Curious Subjects ...”, London, 1757). In modern terms
this shall become E(|X) = 0 (see section 9).

2
1 Returns
1.1 Return definitions
Returns come in two versions. Let Pit be the price of the i − th stock at time t.
The linear or simple return (in the future we shall deal with dividends and with
total returns) between times tj and tj−1 is defined as:

ritj = Pitj /Pitj−1 − 1

The log return is defined as:

rit∗ j = ln(Pitj /Pitj−1 )

In both these definitions of return we do not consider possible dividends. There exist
corresponding definitions of total return where, in the case a dividend Dj is accrued
between times tj−1 and tj , the numerator of both ratios becomes Ptj + Dj .
Moreover, here we do not apply any accrual convention to our returns, that is: we
just consider period returns and do not transform, say, daily computed returns (i.e.
tj − tj−1 = one day) into a yearly basis.
Notice that such transforms are, instead, the rule in most databases and return
reporting and, being totally notional, are based on many and different “accrual conven-
tions” each arising and appropriate into a specific field. You should be quite attentive
to de specific definitions and formulas applied in the fields of your interest.
While Ptj means “price at time tj ”, the symbol: rtj is a shorthand for “return between
time tj−1 and tj ” so that the notation is not really complete and its interpretation
depends on the context. When needed, for clarity sake, we shall specify returns as
indexed by the beginning and the end point of the time interval in which they are
computed as, for instance, in rtj−1 ;tj .
Now, some obvious but important comments.
We begin by saying that only the linear return can be termed “percentage” return,
as it is defined as the ratio of two (positive) quantities minus 1. The log return is on
a different scale. The use of a % sign or of the “percentage” term for log returns is,
strictly speaking, wrong, even if it is quite common.
In fact: the two definitions of return, log and linear, yield different numbers when
applied to the same prices, except in the case when the two prices are identical.
Let’s study this in some detail.
We can easily show that ln(x) ≤ x − 1, where the = is valid only if x = 1 (non
change in prices), moreover, the difference between linear and log returns shall be an
increasing function of |x − 1|.
In fact: x − 1 is equal to and tangent to ln(x) in x = 1. Moreover, since the second
derivative of (x − 1) − ln(x), that is 1/x2 is always positive, the difference between

3
linear and log return, shall be always positive, except when x = 1, and shall be bigger
the more |x − 1| is bigger than 0.
There are several implications of this simple result. An important, if obvious, one
is that, if one kind of return is mistaken for the other, the inevitable “approximation”
errors shall be all of the same sign.
In Finance the ratio of two prices of the same security separated by a not too long
amount of time (this ratio is sometimes called “total return” and maybe corrected by
taking into account accruals, as mentioned above) is often modeled as a random variable
with an expected value very near 1. This implies that the two definitions of returns,
linear and log, shall yield very different values with sizable probability only when the
variance (or more in general a dispersion measure) of the price ratio distribution is
non negligible, so that observations far from the expected value have non negligible
probability.
Since standard models in Finance assume that variance of returns increases when
the time between prices for which the return is computed increases, this also implies
that the two definitions shall more likely imply different values when applied to long
term returns.
Why two definitions? After all, the corresponding prices are the same and this
implies that both definitions, if not swapped by error, give us the same information.
The point is that each definition is useful, in the sense of making computations
simpler, in different cases.
You should also ask yourself: “why bother with returns and not simply discuss
prices?”. Again: the only reason is that using returns in several circumstances simplifies
computations and makes easier to build models for the dynamic of prices.
We now see which properties of linear and log returns make their use a useful
simplification in different cases.
From now on, for simplicity, let us only consider times t and t − 1.
Let the value of a buy and hold portfolio, composed of k stocks each for a nominal
quantity ni , at time t be: X
ni Pit
i=1..k

Notice that we do not require each ni to have the same sign.


It is easy to see that the linear return of the portfolio, if it can be defined, that is:
if the denominator of the following formula is not equal to 0, shall be a linear function
of the returns of each stock. In fact:
P
ni Pit X ni Pit
rt = P i=1..k −1= P −1
j=1..k nj Pjt−1 i=1..k j=1..k nj Pjt−1

X ni Pit−1 Pit
= P −1=
i=1..k j=1..k nj Pjt−1 Pit−1

4
X X X X X
= wit (rit + 1) − 1 = ( wit rit + wit 1) − 1 = wit rit + 1 − 1 = wit rit
i=1..k i=1..k i=1..k i=1..k i=1..k

Where wit = P ni Pit−1 are terms summing to 1. Notice that, while the time
j=1..k nj Pjt−1
index of wit is t, the value of wit depends on terms which are all known at time t − 1.
When the wit are all non negative (long only portfolio), they are tipically called
“weights” and represent the percentage of the portfolio invested in the i-th stock at
time t − 1.
This simple “aggregation” result is very useful. Suppose, for instance, that returns
are stochastic and you know, at time t − 1, the expected values for the linear returns
between time t − 1 and t. You are at time t − 1 and want to compute the expected
value of your portfolio return between t − 1 and t.
Since the expected value is a linear operator (the expected value of a sum is the sum
of the expected values, moreover additive and multiplicative constants can be taken
out of the expected value) and the weights wit are known at time t − 1, hence non
stochastic when you make your computations, if we are at time t − 1 we can easily
compute the return for the portfolio as:
X
E(rt ) = wit E(rit )
i=1..k

Moreover if we know all the covariances between rit and rjt (if i = j we simply have
a variance) we can find the variance of the portfolio return as:
X X
V (rt ) = wi wj Cov(rit ; rjt )
i=1..k j=1..k

Notice, in the end, that this is true no matter how big an amount of time passes
between t − 1 and t, provided you do your computations at time t − 1.
This breaks down if we are, say, at time t − 1 and consider now the portfolio return
between time t and time t + 1. If we suppose ni unchanged (buy and hold portfolio)
the aggregation formula shall be valid but with wit+1 6= wit . This because wit+1 shall
depend on security prices at time t, while wit depends on security prices at time t − 1.
If we are at time t−1, so, while wit is non stochastic, wit+1 is stochastic as it depends
on prices available in the future, at time t. These means that the easy expected value
and variance formulas cannot be used now, because we cannot “take the wit+1 outside”
the expected value and variance operators.
In formulas, w could not make passages as, e.g.:
X
E(rt ) = wit E(rit )
i=1..k

5
We should be satisfied by the almost useless
X
E(rt ) = E(wit rit )
i=1..k
The same problem for the computation of variance.
There exist an investing strategy which keeps wit constant overtime by changing
ni . This is called a “constant relative weights” strategy.
This is NOT a buy and hold strategy: if you are long all securities, this requires
selling a proper amount of those securities who outperform the other securities and
buy a proper amount of the underperforming securities so that the net investment is 0
and the weights are kept constant.
(In the real world such a strategy shall imply possibly huge transaction costs).
With such a strategy, we could use the expected value and variance formulas for
returns on periods which begin in our future (provided we know the expected values,
variances and covariances of each return included in the formula).
Now, log returns.
For log returns the aggregation result in a portfolio, valid for linear returns, does
not apply. In fact we have:
P
ni Pit X nP Pit
P i it−1
X
rt∗ = ln( P i=1..k ) = ln( ) = ln( wit exp(rit∗ ))
j=1..k n j P jt−1
i=1..k j=1..k nj P jt−1 P it−1
i=1..k

The log return of the portfolio is not a linear function of the log (and also of the
linear) returns of the components. In this case assumptions on the expected values
and covariances of the (log) returns of each security in the portfolio cannot be (easily)
translated into assumptions on the expected value and the variance of the portfolio
return by simple use of basic “expected value of the sum” and “variance of the sum”
formulas.
Think how difficult this could make to perform any standard portfolio optimization
procedure as, for instance, the Markowitz mean/variance model using log returns.
While log returns create problems for aggregation over portfolios, log returns are
much easier to use than linear returns when we aim at describing the evolution of the
price of a single security thru several time intervals.
Suppose we observe the price Pti at times t1 , ...tn , the log return between t1 and tn
shall be the sum of the intermediate log returns:

Ptn Pt P t Y Pt X
rt∗1 ,tn = ln = ln n n−1 = ... = ln i
= rt∗i
Pt1 Ptn−1 Pt1 P
i=2...n ti−1 i=2...n
It is then easy, for instance, given the expected values and the covariances of the
sub period returns, to compute the expected value and the variance of the full period
return (from t1 to tm ).

6
On the other hand, linear returns over a time interval are not the sum of sub-period
linear returns.
We have:
Pt Pt P t Y Pt Y
rt1 ,tn = n − 1 = n n−1 − 1 = ... = i
−1= (rti + 1) − 1
Pt1 Ptn−1 Pt1 i=2...n
P ti−1 i=2...n

We see that, while it is possible, and in fact easy, to connect a period linear return
with subperiod returns, the function connecting subperiod linear returns to the period
linear return is not a sum but a product.
The expected value of a product is difficult to evaluate, even if we know each single
subperiod expected value, as it does not depend (in general) only on the expected
values of the terms.
A noticeable special case when this is possible, is that of non correlation over time
among terms.
For the computation of the variance, the problem is even worse even in the non
correlation over time case.
So we have: linear returns useful for portfolios ove single time intervals, log returns
useful for single securities over time.
It is clear that, when problems involving the modeling of portfolio evolution over
time are considered, no single definition of return is fully satisfactory.
In these cases we either see approximations or, simply, models are directly expressed
in terms of prices.
You should keep in mind that standard "introductory" portfolio allocation models
are one period models, hence usually based on linear returns (but always read the
details, sometimes you’ll be surprised).
To sum up: the two definition of returns yield different values when the ratio
between consecutive prices is not equal to 1. The linear definition works very well for
portfolios over a single period and conditional to the knowledge of prices at time t − 1:
expected values and variances of portfolios can be derived by expected values variances
and covariances of the components, as the portfolio linear return over a time period is
a linear combination of the returns of the portfolio components.
For analogous reasons the log definition works very well for single securities over
time.
We conclude this section with three warnings which expand what already written
in terms of accrual conventions.
These warnings should be obvious, but experience teaches the opposite.
First. Many other definitions of return exist and each one origins from either
traditional accounting behavior (and typically is connected with some specific asset
class) or from specific computational needs. These are usually based on linear returns
but use different conventions for computing the number of days between two prices and
the accrual of possible dividends and coupons.

7
Second. No single definition is the “correct” or the “wrong” one. In fact such a
statement has no meaning. The correctness in the use of a definition depends on the
context in which it is applied (accounting uses are to be satisfied) and, obviously, on
avoiding naive errors the like of exponentiating linear returns for deriving prices or
summing exponential returns over different securities in order to get portfolio returns.
For instance: the fact that, for a price ratio near to 1, the two definitions give
similar values should not induce the reader in the following consideration: “if I break a
sizable period of time in many short sub periods, such that prices in consecutive times
are likely to be very similar, I am going to make a very small error if I use, say, the
linear return in the accrual formula for the log return”. This is wrong: in any single sub
period the error is going to be small, but, as mentioned above, this error has always
the same sign, so that it shall sum up and not cancel, and on the full time interval the
total error shall be the same non matter how many sub periods we consider.
Third: this multiplicity of definitions requires that, when we speak about any
properties of “returns”, it should be made clear which return definition we have in
mind. For instance: the expected value of log returns must not be confused with the
expected value of linear returns. The probability distribution of log returns shall not
be the same as the probability distribution of linear returns, and so on.
Practitioners are very precise in specifying such definitions in financial contracts.
The common imprecisions found in financial newspapers can be justified in view of the
descriptive purposes of these.

8
r and r* as functions of Pt/Pt‐1
0,6

0,4

0,2

0
0 0,2 0,4 0,6 0,8 1 1,2 1,4 1,6

9
r
rr*

‐0,2

‐0,4

‐0,6

08
1.2 Price and return data: a cautionary tale
Finance is “full of numbers”, price data and related Statistics are gathered for commer-
cial and institutional reasons and are readily available on both free and costly/commercial
databases. This has been true for many years and, for some relevant market, databases
have been reconstructed back to the nineteen century. In some cases even before.
As in any field where data are so overwhelmingly available and not directly created
by the researcher thru experiments, any researcher must be cautious before using data
and follow at least some very simple rules which could be summarized in the sentence:
“KNOW YOUR DATA BEFORE USING IT!”.
What does the number mean? How was it recorded? Did it always mean the same
thing? These are three very simple questions which should get an answer before any
analysis is attempted. Failure to do so could taint results to such a way as to make
them irrelevant or even ridiculous.
Moreover: the abundance of numbers should not be meant to imply that Finance
is necessarily amenable to mathematical analysis.
Mathematics, actually, builds models and theories that may even not require num-
bers or quantities. However, it requires the existence of simple and stable relationships
between well defined “objects”, which may be or may not be numbers or quantities.
For this reason, understanding the numbers of Finance and its quantities, is, above
all, the first necessary step toward understanding wether some of these, or some proper
modifications of these, can be the object of useful mathematica modelling.
But undestanding numbers in Finance is not easy.
This is true even at a very basic level, by far lower than what is required in order
to assess the possibility of useful mathematical modelling.
Here is not the place for a detailed discussion, but it could be useful for us to try
and analyze what, at a first sight, should be a very simple example.
Suppose you wish to answer the following question: “how did the US stock market
behave during its history”.
This seems a quite straightforward question, it also is a quiteobviouly important
question and you may think a simple and clear answer should be readily available.
You browse the Internet and run a search for literature on the topic expecting such
a simple, clear and unanimous answer.
Suppose you are able to shunt off conspiracy theorists, finance fanatics, quack doc-
tors and serpent oil sellers, Ponzi scheme advertising and the like.
(All these abound in the field).
Let us say that you concentrate on academic and academic linked literature (beware:
this by no means assures you to fully avoid conspiracy theorists, fanatics, serpent oil
sellers and Ponzi schemers).
At the onset, you could be puzzled by the fact that, in the overwhelming majority
of papers and books, the performance of markets where thousands of securities, not

10
always the same, are traded and traded in different historical moments and under
different institutional rules, is summarized by a single number, and index. For the
moment, we put this point aside and follow this path.
You find a whole jungle of academic and non academic references among which
you choose, e.g., two frequently quoted expository books by famous academicians:
“Irrational exuberance” by Robert J. Shiller (of Yale) and “Stocks for the long run” By
Jeremy J. Siegel (of Wharton)1 .
You browse through the first chapter of both books.
You first look at Figure 1-1 of Siegel, which tells you that 1 dollar invested in stock
in 1802 would have become 7,500,000 dollars by 1997. Moreover you read that 1 dollar
of 1802 is equivalent (according to Siegel) to 12 dollars in 1997. You divide 7,500,000
by 12 and get a real return of about 625000 times (62,500,000% !)
On the other hand, Figure 1.1 of Shiller’s book gives the following information:
between 1871 and 2000 the S&P composite index corrected by inflation grew from
(roughly) 70 to (roughly) 1400 with a real return of roughly 20 times (2000%).
Both numbers are big, but also quite different.
Now you are puzzled.
Sure: a part of the difference is due to the different time basis.
Looking to Siegel picture, you see that the dollar value of the investment around
1870 was about 200, even exaggerating inflation, attributing the full 12 times devalu-
ation to the 1870-2000 period, and assessing this 200 dollars in 1870 to be worth 2400
dollars in 1997, we would have a real increase of 3125 times which is still much more
than 150 times Shiller’s number.
The difference, obviously, cannot come from the difference in terminal years of the
sample, as the period 1997-2000 was a bull market period and should reduce, not
increase, the difference.
Now, both Authors are famous Finance professors and at least one of them (Shiller)
is one of the gurus of the present crisis. So the problem must be in the reader (us).
Let us try and improve our understanding by reading the details. First we notice
that Siegel quotes as source for the raw data the Cowles series as reprinted in Shiller
book “Market volatility” for the 1871-1926 period and the CRISP data for the following
period. Shiller, on the other hand, mentions the S&P composite index.
Reading with care we see another difference: Shiller speaks about a “price” index
while Siegel about a reinvested dividends total return index. Is this the trick?
Browsing the Internet we see that Shiller’s data are actually available for down-
loading (http://www.econ.yale.edu/∼shiller/data.htm).
With this data we can compute the total return for Shiller data between 1871 and
1997 and the real increase now is from 1 dollar to 3654 dollars in real terms.
1
The connection between the two authors and the two books is clearly stated by Shiller in his
Acknowledgments.

11
We also see that the CPI passed from 12 to 154 in the same time interval, so that the
“12 times” rule for the value of the dollar used by Siegel seems a good approximation2 .
There is still some disagreement between the numbers (Siegel 3125, but with exag-
gerated inflation, and Shiller 3654, but with 3 added years of bull market). However,
we think that, at least for answering our question, we have enough understanding.
In particular, we understand (most of) the reason for the, apparent, huge difference
between the statements of the two Authors as initially considered.
In this very short analysis we did learn some important things.
First: understand your question. “How did the US market behave during its history”
is, now we understand, not quite a well specified question.
Are we looking for a summary of the history of prices, or for the history of one
dollar invested in the market? The two different questions have two different answers
and require different data.
Second: understand your data. Price data? Total return data? Raw or inflation
corrected?
There are many subtle but relevant points that should be made, we only mention
the Survivorship Bias problem which taints any ex post analysis of financial series.
We stop here, for the moment, and do not mention the fact that a lot of discussion
has run about the relevance of the question and of the answers and their interpretation.
An final interesting fact is this: while Siegel and Shiller start with data which, once
undersood, are very similar, they reach quite different conclusions. At least, this is
their opinion on their works.
We can reconcile the data: we understand they are using the same data in two
different ways.
However, a puzzle remains: why each of them draws a different conclusion and,
moreover, why they perfectly understand this and, still, they “agree to disagree”?
When we deal with rational and clever people, as they are, this implies a deep dis-
agreement not about the data, but about the correct way of modelling and interpreting
these, so to be able to draw conclusions.

1.3 Some empirical “facts”


While we realize, if not fully understand, these differences of opinion, this could be the
right place to state several empirical “facts” that underlie much of the discussion about
the long run behaviour of US stock market.
2
Beware of long term inflation indexes. The underlying hypotesis is that the basket of consumption
goods be comparable thru time. As an anedoctal hint of the contrary: a very good riding horse in 1865
could cost 200 dollars, a “comparable status “ car costs, today, 50.000 dollars. If we, quite heroically,
compare the two “goods” due to their use we se an increase in price not of 12 times but of 250 times.
If we use the “12 times” rule, we get 2400 dollars which might be the price of a scooter. Which is the
right comparison?

12
Again, be warned that, as mentioned above, while researchers broadly agree on the
following numbers, they broadly disagree on their interpretation.
We do this with the yearly Shiller dataset (widely used in academic literature). We
shall concentrate on the total log return series.
The dataset starts with 1887 and is updated each year since in the latest available
version (at the time this chapter was written) Shiller uses the dataset up to 2013
included we shall limit our computations to the interval 1871-2013.
During this time interval the average real log total return of the index was 6.33%.
In the same period the average real one year interest rate was 1.03%, so that the
“risk premium” was about 5.3%.
The standard deviation of the real log total return was 17.09% while the same
statistic for one year real interest rates was 6.54%.
The 5.3% average real log total return in excess to the yearly rate (which was even
higher up to year 2000) compared with the 17.09% standard deviation (even smaller
than this up to 2000) did generate a literature concerned with the “equity premium
puzzle”.
The average of the real dividend yield (up to 2011 only) is 4.45% and the standard
deviation of the same is 1.5%.
The average real log price return was 2.16% and the standard deviation of the same
17.68%.
While we can only approximately sum these two results and compare them with
the total real log return, we see that most of the equity premium is associated with the
dividend yield.
Notice that the correlation coefficient between real dividend yield and real log price
return is .10 (positive but small) this explains why the standard deviation of the total
real log return is even smaller than the sum of the standard deviations of log real price
return and real dividend yield (diversification effect).
On the other hand this small correlation is, by itself, a puzzle. This because the
value of a stock is commonly interpreted as some kind of expected value of future
discounted dividends.
A last piece of simple data: the 1 year autocorrelation of the real total log return
series is very small: 2.29. This is a first simple evidence of the fact that is is very
difficult to forecast future returns on the basis of past returns.
Some of these empirical facts are at the basis of the simple stock price evolution
model we shall introduce in the next chapter: the log random walk.

13
14
FIGURE 1-1
Total Nominal Return Indexes, 1802-1997
15
Examples
Exercise 1a - returns.xls Exercise 1b - returns.xls

2 Logarithmic (log) random walk


In the previous paragraphs, we did define two kinds of returns and discussed some
historical financial data.
Here we are taking a big step forward.
We are going to specify a mathematical model for returns/prices evolution thru
time as a “stochastic process” which means, for us here at least, a sequence of random
variables.
In the light of this model, observed returns shall be seen as “observations” of the
random variables which constitute the process.
Our mathematical model, the log random walk, does not try to describe “how”
prices are “made” in the market.
The limited purpose of the model, is to fit some observed statistical characteristics
of price evolution.
The log random walk (LRW) hypothesis on the evolution of prices, in its simplest
version, states that, if we abstract from dividends and accruals, prices, intended as
random variables, evolve, from time t−∆ to time t according to the stochastic difference
equation:
ln Pt = ln Pt−∆ + t
where the “innovations” t are assumed to be uncorrelated across time (cov(t ; t0 ) =
0 ∀t 6= t0 ), with constant expected value µ∆ and constant variance σ 2 ∆ (Note: ∆ >
0).
Since ln Pt − ln Pt−∆ = rt∗ the LRW is equivalent to the assumption that log re-
turns are uncorrelated random variables with constant expected value µ∆ and constant
variance σ 2 ∆.
A specific probability distribution for t is not required at this introductory level.
It is, however, the case that, often, the log random walk hypothesis is presented
from scratch assuming the vector of t in different times ti to be jointly distributed
according to a Gaussian distribution. In this case, the assumption of non correlation
becomes equivalent to the assumption of independence.

Notice that from the model assumptions we have: Pt = Pt−∆ ert = Pt−∆ et so, if t
is assumed Gaussian, Pt shall be lognormally distributed.
A linear (that is: without logs) random walk in prices of the kind: Pt = Pt−∆ + t
was sometimes considered in the earliest times of quantitative financial research. This
does not seem a good model for prices of limited liability securities, since a sequence
of negative t may result in negative prices.

16
Moreover, while the hypothesis of constant variance for (log) returns may be a good
first order approximation of what we observe in markets, at least in the short run, the
same hypothesis for prices is not empirically sound: in general price changes tend to
have a variance which is an increasing function of the price level.
A couple of points to stress.
First: ∆ is the “fraction of time” over which the return is defined. This may be
expressed in any unit of time measurement: ∆ = 1 may mean one year, one month,
one day, at the choice of the user.
However, care must be taken so that µ and σ 2 are assigned consistently with the
choice of the unit of measurement of ∆. In fact, µ and σ 2 represent expected value and
variance of log return over an horizon of Length ∆ = 1 and they shall be completely
different if 1 means, say, one year (as usually it does) or one day (we shall discuss in
what follows a particular convention for translating the values of µ and σ 2 between
different units of measurement of time. This convention is one of the consequences of
the log random walk model).
Second: suppose the model is valid for a time interval of ∆ and consider what
happens over a time span of, say, 2∆.
By simply iterating the model twice we have:

ln Pt = ln Pt−2∆ + t + t−∆ = ln Pt−2∆ + ut


having set ut = t + t−∆ . The model appears similar to the single ∆ one and in
fact it is but it must be noticed that the ut while uncorrelated (due to the hypothesis
on the t ) on a time span of 2∆ shall indeed be correlated on a time span of ∆. This
means, roughly, that the log random walk model can be aggregated over time if we
“drop” the observations (just one in this case) in between each aggregated interval (in
our example the model shall be valid if we drop every other original observation).
This is going to be relevant in what follows.
The LRW was and still in part is a traditional standard model for the evolution of
stock prices.
It is obviously a wrong model, if understood as stating that prices are dictated by
“chance”. Whatever the meaning of the word “chance”, prices come from market trading
where people decide to post bids and offers for reasons which, at least for them, are
(hopefully) perfectly clear and reasonable. “Chance” here, or its semblance, comes
(Aristotle would be happy of this) on the messy and chaotic interaction of perfectly
determined chains of decisions. In this sense, the model completely fails in describing
the actual process of price formation.
If, on the other hand, we analize the model comparing the statistical properties (in
frequency terms) of observed data with the probabilistic implications of the model, it
can be considered as a good descriptive model in the sense that, even if does not take
into account the actual process of price creation, it reproduces to some degree some
observed “large scale” (i.e. ∆ non too small) statistical properties of prices.

17
Even from this point of view, the model is useful for introductory and simple pur-
poses only. The weight of empirical analysis during the last thirty years has led most
researchers to consider this model as a very approximate probabilitic description of
stock price behavior.
In a nutshell: while no consensus has been reached on an alternative standard
model, there is a general agreement about the fact that some sort of (very weak)
dependence exists for today’s returns on the full, or at least recent, history of returns.
Moreover, the constancy of the expected value and variance of the innovation term
has been strongly put under questioning both in terms of slow variations of these
parameters and of possible sudden “variance explosions”.
By far the weakest aspect of the model is the Gaussian assumption (when required)
and we shall discuss this point in some detail when dealing with value at risk.
In any case, the LRW still underlies many conventions regarding the presentation
of market Statistics. Moreover the LRW is perhaps the most important justification
for the commonly held equivalence between the intuitive term "volatility" and the
statistical entity "variance" (or better "standard deviation").
An important example of the influence of the log random walk model on market
practice concerns the “annualization” of expected value and variance.
We are used to the fact that, often, the rate of return of an investment over a given
time period is reported in an “annualized” way. The precise conversion from a period
rate to a yearly rate depends on accrual conventions. For instance, for an investment
of less that one year length, the most frequent convention is to multiply the period
rate times the ratio between the (properly measured according to the relevant accrual
conventions) length of one year and the length of the investment. So, for instance, if
we have an investment which lasts three months and yields a rate of 1% in these three
months, the rate on an yearly basis shall be 4%.
It is clear that this is just a convention: the rate for an investment of one year
in length shall NOT, in general, be equal to 4%, this is just the annualized rate for
our three months investment. This shall be true, for instance, if the term structure
of interest rates is constant. However such a convention can be useful for comparison
across investment horizons.
In a similar way, when we speak of the expected return or the standard devia-
tion/variance of an investment it is common the report number in an annualized way
even if we speak of returns for periods of less or of more than one year. The actual
annualization procedure is base on a convention which is very similar to the one used
in the case of interest rates. As in this case the convention is “true”, that is: annualized
values of expected value and variance correspond to per annum expected values and
variances, only in particular cases. The specific particular case on which the convention
used in practice is based is the LRW hypothesis.
If we assume the LRW and consider a sequence of n log returns rt∗ at times t, t −
1, t − 2, ..., t − n + 1 (just for the sake of simplicity in notation we suppose each time

18
interval ∆ to be of length 1 and drop the generic ∆) we have that:
X X
∗ ∗ ∗
E(rt−n,t ) = E( rt−i )= E(rt−i ) = nµ
i=0,...,n−1 i=0,...,n−1

X X
∗ ∗ ∗
V (rt−n,t )=V( rt−i )= V (rt−i ) = nσ 2
i=0,...,n−1 i=0,...,n−1

This obvious result, which is a direct consequence of the assumption of constant


expected value and variance and of non correlation of innovations at different times, is
typically applied, for annualization purposes, also when the LRW is not considered to
be valid.
So, for instance, given an evaluation of σ 2 on daily data, this evaluation is annualized
multiplying it, say, by 256 (or any number representing open market days, different
ones exist), it is put on a monthly basis by multiplying it by, say, 25 and on a weekly
basis by multiplying it by, usually, 5.
As we stressed before, this is not just a convention, but the correct procedure, if
the LRW model holds. In this case, in fact, the variance over n time periods is equal to
n times the variance over one time period. If the LRW model is not believed to hold,
for instance: if the expected value and-or the variance of returns is not constant over
time or if we have correlation among the t , this procedure shall be applied but just as
a convention.3
The fact that, under the LRW, the expected value grows linearly with the length of
the time period while the standard deviation (square root of the variance) grows with
the square root of the number of observations, has created a lot of discussion about the
existence of some time horizon beyond which it is always proper to hold a stock port-
folio. This problem, conventionally called “time diversification”, and more popularly
“stocks for the long run”, has been discussed at length both on the positive (commonly
sustained by fund managers) and the negative side (more rooted in academia: Paul
Samuelson is a non negligible opponent of the idea) we shall consider it in the next
section.
To have an idea of the empirical implications of the LRW hypothesis (plus that of
Gaussian distribution) for returns we plot in the following figures an aggregated index
of the US stock market in the 20th century together with 100 simulations describing
possible alternate histories of the US market in the same period, under the hypothe-
sis that the index evolution follows a LRW with yearly expected value and standard
deviation of log return identical with the historical average and standard deviation:
resp. 5.36% and 18,1% (the use of %, as usual with log returns, is quite improper if
3
Empirical computation of variances over different time intervals typically result in sequences
which tend to increase less than linearly wrt the increase of the time interval between consecutive
observations. This could be interpreted as the existence of (small) on average negative correlations
between returns.

19
common). Data is presented both in price scale (starting value 100) and in log price
scale. The reason is simple. Consider the distribution of log return after 100 year under
our hypothesis. This is going to be the distribution of the sum of 100 iid Gaussian RV
each with expected value of 5.36% and standard deviation 18.1%, Using known results
we have that this distribution shall be Gaussian with expected value 536% and stan-
dard deviation 181%. So, a standard ±2σ interval for the terminal value of this sum
is 536%±362%, or, in price terms, 100e. that is an interval with lower extreme
5.36±3.62

569 and upper extreme 794263. This means that under our hypotheses the possible
histories can be quite different. No problem in this if we recall the unconditional nature
of the model.
To get a quick idea: the actual historical evolution of the market as measured by
our index gave a final value of the index equal to about 21000 which correspond, as
said, to a sum of log returns of 536(%). This is, by construction, smack in the middle of
the distribution of the summed log returns and is the median of the price distribution.
However, due to the exponentiation, or if you prefer, due to the power of compound
interest, the distribution of final values is highly asymmetric (it is Lognormal) so that
the range of possible values above the median of prices is much bigger than the range
below it. We only simulated 100 possible histories. Even with such a limited sample
we have a top terminal price of more than 2000000 (in a very lucky, for long investors,
world. We wonder what studying Finance would be in such a world...) and a bottom
terminal price below 100 (again: in a world so unlucky that, had we lived in it, we
likely would not talk about the stock market)4 .
This result could be puzzling as the “possible histories” seem very heterogeneous.
This is an immediate consequence of the log random walk hypothesis. If we estimate
4
Compare this with the Siegel-Shiller data we discussed in section 1, then think about the result of
our simulation in such extreme worlds. For instance, with the historical mean and standard deviation
of the extreme depressed version 20th century the simulation I would show you in this possible world,
provided you an I were still interested in this topic, would be quite different that what you see here.
And all the same, this possible story is a result totally compatible (under Gaussian LRW) with what
we did actually see in our real history. Spend a little time thinking about this point. It could be
“illuminating”.
Think also to the economic sustainability of such extreme worlds: such extreme market behaviours
cannot happen by themselves (this is not the plot of some lucky or unlucky casino guy, it is the market
value of an economy, which should sustain such values, provided investors are not totally bum) and
how they could be so absurd just because they underline the possible absurd extreme conclusions we
can derive from a simple LRW model.
Last but not least, remember that all this comes from the analysis of the stock market in a very,
up to now, successful country: the USA. But we analyze it so much also because it was successful
(and so, for instance, most Finance schools, journals and researchers are USA based). This biases
our conclusions if we wish to apply such conclusions to the rest of the world or, even, to the future
of USA. Maybe a more balanced view could be gained by comparing this result with the evolution of
stock markets all around the world (this is not a new idea, Robert J. Barro, for instance did this in
“Rare Disasters and Asset Markets in the Twentieth Century.” (2006) Quarterly Journal of Economics,
121(3): 823–66.)

20
µ and σ out of a long series of data (one century) we are using data from a very
heterogeneous set of economic and historical conditions. Then we use this number in
order to simulate “possible histories” without conditioning to any particular evolution
of historical or economical variables which could and shall influence the stock price.
In other words: we are using the log random walk model as a “marginal” model.
That is: it is unconditional to everything you may know or suppose about the evolution
of other variables connected with the evolution of the modeled stock price.
This point is quite relevant if we wish to understand the sometimes surprising
implications of this simple model.
In the above example, according to the model and the historically estimated pa-
rameters, we get the ±2σ interval 536%±362% (beware the % sign: these are log
returns), or, in price terms, 100e. that is an interval with lower extreme 569 and
5.36±3.62

upper extreme 7942635 . It must be clear that such a wide set of histories is possible,
with non negligible probability, only because we did assume nothing on the (century
long) evolution of all the variables that shall influence prices. Only under this “igno-
rance” assumption such an heterogeneous set of trajectories can have non negligible
probability.
If we are puzzled by the result this is because, while the model describes the possible
evolution of prices “in whatever conditions”, unconditional to anything (in fact, we
estimate expected return and standard deviation using a long history, during which
many different things happened), when we see the implications of the model we, almost
invariably, shall be conditioned by our recent memories and recall recent events or,
unconsciously, shall make some hypothesis on the future as, for instance, the fact that
economic growth shall be, on average, similar to what recently seen. Since the estimates
of µ and σ we use (or even our assumption of zero correlation of log returns and, more
in general, the structure of the model itself which contains no other variables but a
single price) are NOT conditional to such (implicit) hypotheses it is not surprising that
the model gives us such wide variation bounds with respect to what we could expect.
This misunderstanding is quite common and it is to be always kept in mind when
discussing results of the applications of the log random walk model6 .
5
By the way: this should be enough to understand why we should not use the term % when
speaking of log returns.
6
There exists a wide body of literature, both from the applied and the academic sides, that suggests
ways for “conditioning” the model.

21
100 years of simulated log random walk data
100 simulated paths
(mean log return 5.35% dev.st. 18.1%)

1000000

900000

800000

700000

600000

22
500000

400000

300000

200000

100000

0
0 20 40 60 80 100 120
100 years of simulated log random walk data (range subset)
compared with USA stock market in the 20th century
(mean log return 5.35% dev.st. 18.1%)

50000

40000

30000

23
20000

10000

0
0 20 40 60 80 100 120
100 years of simulated log random walk data
Log scale
compared with USA stock market in the 20th century
(mean log return 5.35% dev.st. 18.1%)

10000000

1000000

100000

24
10000

1000

100

10

1
0 20 40 60 80 100 120
2.1 "Stocks for the long run" and time diversification
These are very interesting and popular topics, part of the lore of the financial milieu. A
short discussion shall be useful to clarify some issues connected with the LRW hypoth-
esis together with some implicit assumption underlying much financial advertising.
We have three flavors of these arguments. Two are “a priori” arguments, depending
on the log random walk hypothesis or something equivalent to it, the third is an a
posteriori argument based on historical data.
It is quite important to have a clear idea of the different weight and meaning of these
arguments. In fact, most of the “puzzling” statements you may find about advertising
“stock for the long run” by the investment industry depend on a wrong “mix” of the
arguments.7

2.1.1 First version


The basic idea of the first version of the argument can be sketched as follows.
Suppose single period (log) returns have (positive) expected value µ (in excess of
the risk free rate. In what follows we shall omit repeating “in excess...”) and variance
σ2.
Moreover, suppose for simplicity that the investor requires a Sharpe ratio of say S
out of his-her investment.
Under the above hypotheses, plus the log random walk hypothesis, the Sharpe ratio
over n time periods is given by
nµ √ µ
√ = n
nσ σ
so that, if n is big enough, any required value can be reached. Another way of phrasing
the same argument, when we add the hypothesis of normality on returns, is that, if we
choose any probability α, the probability of the investment to yield a n periods return
greater than √
nµ − nz1−α σ
is equal to 1 − α.
(In fact, this is the lower bound of the α probability interval for the n period log
return under the above hypotheses).
But this is increasing in n (take the derivative and do the passages), for
√ 1 z1−α
n> σ
2 µ
and it is also unbounded (above), so that for any α and any chosen value C, there
exists a n such that from that n onward, the probability for an n period return less
than C is less than α.
7
As an example, read the Vanguard document “Time Diversification and Horizon-Based Asset
Allocations” available at http://www.vanguard.com/pdf/icrtd.pdf?2210045172

25
You can choose a very small α and a very big C, what happens is that the required
n shall be bigger.
The investment suggestion could be: if your time horizon is of an undetermined
number n of years, then choose the investment that has the highest expected return
per unit of standard deviation, even if the standard deviation is very high. Even if this
investment may seem too risky in the "short run" there is always a time horizon so
that for that horizon, the probability of any given loss is as small as you like or, that
is the same, the Sharpe ratio as big as you like.
Typically, such high return (and high volatility) investment are stocks, so: "stocks
for the long run".
Now, the “run” can really be “long”.
The value of n for which this lower bound crosses a given C level is the solution of

nµ − nz1−α σ ≥ C

In particular, for C = 0 the solution is


√ z1−α σ
n≥
µ

The typical stock has a σ/µ ratio for one year of the order of about 6 or mor. So,
even allowing for a big α, so that z1−α is near one (check by yourself the corresponding
α), the required n shall be in the range of 36 which is only slightly shorter than the
average working life.
Under the state hypotheses these arguments are all formally correct.
The question is if the result is relevant and the investment suggestions reasonable.
Let us consider some possible critiques.
This investment suggestion is or is not reasonable depending on the investor’s cri-
terion of choice. This, for instance, could be the full period expected return given some
probability of a given loss, or the Sharpe ratio for the full n periods or, for instance, the
per period Sharpe ratio (which obviously is a constant) or, again, the absolute volatility
over the full period of investment (which obviously increases without bounds), and so
on.
For instance, a typical critique to the statement is phrased like this: "Why should
we consider as good a given investment for n time periods, if we do not consider it
good for each single one of those periods?"
This critique, shared by economists the like of Samuelson, is correct if we believe
that the investor takes into account the per period Sharpe ratio or some measure of
probable loss and expected return per period.
In other words the critique is correct if, very reasonably, we believe the investor does
not consider equivalent investments with identical Sharpe ratios but over different time
spans.

26
Another frequent critique is: "It is true: the expected value of the investment
increases without bounds but so does its volatility so, in the end, over the long run
I am, in absolute terms, much more uncertain on my investment result" (the mean-
standard deviation ratio goes up only because the numerator grows faster than the
denominator).
This is reasonable as a critique if we believe the investor to decide on the basis of
the absolute volatility of the investment over the full time period.
We should also point out that to choose a single asset class only because, by itself, it
has the highest Sharpe ratio, should always be criticized on the basis of diversification
arguments.
In the end, acceptance or refusal, on an a priori basis, of the investment suggestions
implied by this argument (by itself formally correct) depend on how we model the
investor’s decision making.

2.1.2 Second version


The second version of the argument, again based on the log random walk hypothesis,
is a real fallacy (that is: it is impossible to justify it in any reasonable way) and is the
so called "time diversification" argument.
Note: each single mathematical passage of this argument is perfectly correct, as
in the previous case, given the hypotheses. What is wrong, and unjustifiable, is the
interpretation given to the result.
There is an enticing similitude, under the log random walk hypothesis, between
an investment for one year in, say 10, uncorrelated securities with identical expected
returns and volatilities (this last hypothesis is just for simplicity: the argument can be
extended to different expected returns and volatilities), and a 10 year investment in a
single security with the same expected value and volatility.
To be precise, in order for the result to hold we must forget the difference between
linear and log returns, moreover the comparison implicitly requires zero interest rates.
But let’s do it: such an “approximate” way of thinking is very common in any field
where some Mathematics is used for practical purposes and it is a sound way to proceed
provided the user is able to understand the cases where his-her “approximations” do
not work.
In this case, the expected return and standard deviation for the return correspond-
ing to the first strategy (which could be tagged as the "average per security" return)
are µ and √σn , just the same as the expected value and standard deviation for the
"average per year" return of the second strategy.
Mathematically this is true (under the hypotheses).
However: why should we care?
First: the two investments cannot be directly compared since they are investments
of the same amount of money, but on different time periods. Question: what happens

27
to the first investment in the other nine years?
Second: we are using two different return measures. In the first case we use the
return of the (one year) investment, in the second case the “average per year” return
of the second. But, while I can pocket the first, I cannot pocket the second.
All that I can derive from the second investment is the distribution of returns over
the ten years period which, obviously, has ten times the expected value and ten times
the variance than the distribution of the average return (which, we stress again, is not
the return I could get by the investment).
Dividing by ten the return of the second investment does not make it comparable
with the return of the first AND give us a “measure of return” which is, actually, not
a measure of return.
And notice: the apparent “time diversification” fully comes from the “divide by n”
implied by the computation of the average return.
So, no time diversification exists but only a wrong comparison between different
investments using a true measure of return and an average per year return.
Comparable investments could be, e. g., a ten year investment in the diversified
portfolio and a ten year investment in the single security
A possible correct comparison criterion could be the comparison between the ten
year expected return and return variance of the two investments.
However, in this case the diversified investment is seen to yield the same expected
value of the undiversified investment but with one tenth of the variance so that, these
two investments, now comparable, are by no means equivalent and the single security
investment is seen, in the mean variance sense, as an inferior investment if compared
with the diversified investment.
Analogously, we could ask which investment on a single security over ten years has
the same return mean and variance as the one year diversified investment. The obvious
answer is: an investment of one tenth the size of the diversified investment. In other
words: in order to have the same effective (that is: you can get it from an investment)
return distribution the two investments must be non only on different time periods but
also of different sizes.
While the first version of the argument could be argued for, at least under some
hypothetical, maybe unlikely but coherent, setting, this second version of the argument
is a true fallacy.

2.1.3 Third version


The third version of the stocks for the long run argument is the soundest, as it can be
argued for without being liable of unlikely assumptions or even blatant logical errors.
It is to be noticed that this third version is not an a-priori argument, based on
assumptions concerning the stochastic behavior of prices and the decision model of
agents (and, maybe some logical error). Instead, it is an "a posteriori" or "historical"

28
version of the argument. As such its acceptance or rejection entirely depends on the
way we study historical data. Its use as motivation of investment siggestions wholly
depends on the hypothesis that, to some extend, future behaviour of markets shall
resemble past.
In short this argument states that, based on the analysis of historical prices, stocks
were always, or at least quite frequently, a good long run investment.
Being an historical argument, even if true (and here is not the place to argue for
or against this point) this it does not imply that the past behavior should replicate in
the future.
While apparently held by the majority of financial journalists (provided they do
not weight too much, say, the post 1990 period in Japan or the financial crisis-Covid19
periodr most of the rest of the world), and broadly popular in trouble free times (at
least as popular as the, historically false, argument about real estate as the most sure,
if not the best, investment), and so quite popular for most time periods, at least in
the USA and during the first thirty and the last fifty years of the past century, this
argument is quite controversial among researchers.
The two very famous and quite readable books we quoted in the chapter about
returns: Robert Shiller’s "Irrational Exuberance" vs Jeremy Siegel’s "Stocks for the
Long Run" share (sic!) opposite views on the topic (derived, as we hinted at but do
not have the time to fully discuss, from different readings of the same data).
While not the place for discussing the point, we would suggest the reader, just for
the sake of amusement, to consider a basic fault of such "in the long run it was ..."
arguments.
We have a typical example of the case where the fact itself of considering the
argument, or even the phenomenon itself to which the argument applies, depends of
the fact that the phenomenon itself happened, that is: something "was good in the
long run".
In fact we could doubt about the possibility for an institution (the stock market)
which survives in the modern form, at least in the USA, since, say, the second half
of nineteenth century, to survive up to today, without at least giving a sustainable
impression of offering some opportunities.
Such arguments, if not accompanied by something else to support them, become
somewhat empty as could be the analogue to being surprised observing that the dish I
most frequently eat, is also among those I like the most or, more in the extreme, that
old people did non die young or, again, that when we are in one of many queues, we
recall spending most time in the slowest queue.
Sometimes, however, the "opportunity" of some institution and how to connect this
with its survival can manifest in strange, revealing ways.
For instance, games of chance exist from unmemorable time with the only "long
run" property of making the bank holder richer, together with the occasional random
lucky player. The overall population of players is made, as a whole, poorer. So, while it

29
is clear here what is the "opportunity" of this institution (both for the, usually, steadily
enriched bank holder and the available, albeit unlikely, hope of a quick enrichment), the
survival of such an institution based on such opportunities tells us something interesting
about man’s mind.
This should not puzzle the reader as it is the bitter bread and butter of any research
field where we decide to use Probability and Statistics for writing and testing models,
but only observational data are available and no (relevant) experiments are possible.

Examples
Exercise 2 - IBM random walk.xls

3 Volatility estimation
In applied Finance the term “volatility” has many connected meanings. We mention
here just the main three:

1. Volatility may simply mean the attitude of market prices, rates, returns etc. to
change in an unpredictable and unjustified manner. This without connection to
any formal definition of “change”, “unpredictable” or “unjustified”. Here volatility
is tantamount chance, luck, unknown destiny, etc. Usually the term has a nega-
tive undertone and is mainly used in bear markets. In bullish markets the term
is not frequently used and it is typically changed in more “positive” synonyms.
A volatile bull market is “exuberant”, “tonic” or “lively”.

2. More formally, and mostly for risk managers, volatility has something to do with
the standard deviation of (should be log) returns and, sometimes, is estimated
using historical data (hence the name “Historical Volatility”).

3. For derivative traders, and frequently for risk managers too, “volatility” is the
name of one (or more) parameters in derivative models which, under the hy-
potheses that make “true” the models, are connected with the standard deviation
of underlying variables. However, in the understanding that these hypotheses are
never valid in practice, such parameters are not estimated from historical data
on the underlying variables (say, using time series of stock returns) but directly
backwarded from quoted prices of derivatives, using the pricing model as fitting
formula (take the parameter value which, in the given formula, fits the observed
derivative price). This is in accord to the strange, but widely held and, in fact,
formally justifiable, notion that models may be useful even if the hypotheses un-
derlying them are false. This is “Implied Volatility”. If you do not know the term,
memorize it. It shall come up many times during your career.

30
In what follows we shall introduce a standard and widely applied method for estimating
volatility on the basis of historical data on returns, that is, we consider the second
meaning of volatility.
Under the LRW hypothesis a sensible estimate of σ 2 is:
X

2
S2 = rt−i − r∗ /n
i=0,...,n

Where r∗ is the sample mean.


This is the standard unbiased estimate for the variance of uncorrelated random
variables with identical expected values and variances (the simple empirical variance of
the data, where the denominator its taken as the actual number of observations n + 1,
could be used without problems as in standard applications the sample size is quite
big).
Notice that each data point is given the same weight: the hypothesis is such that
any new observation should improve the estimate quality in the same amount.
B.t.w. “quality” of an estimate, here, means its sampling standard error or sampling
mean square error.
Please: do not be fooled by names. Any estimate of anything is a function of the
sample and, if the sample is random also the estimate is a random variable. Hence it
has (if we rule out divergent cases) a standard error, a variance, an expected value and
so on. So, in a way, we can speak, for instance, of the “variance of the variance”: this
is a shorthand for the more precise statement “the sampling variance of the estimate
of the population variance”. This is basic understanding of what it means to estimate
parameters and to quantify the quality of the estimates.
Now, back to out topic.
The log random walk would justify such an estimate.
In practice, use of such estimate is infrequent.
A common choice is the exponential smoothing estimate, while already quite old
when suggested by J. P. Morgan in the RiskMetrics context, this is commonly known
in the field as the RiskMetrics estimate:
i ∗2
P
i=0,...,n λ rt−i
Vt = P i
i=0,...,n λ

From a statistician’s point of view this is called: exponentially smoothed estimate of


the variance (actually, of the second moment, but more of this in a moment) where λ
is the smoothing parameter: 0 < λ < 1.
Common values of the smoothing parameter are 0.95-0.99.
Users of such an estimate do not consider each data point as equally relevant. Old
observations are less relevant than new ones.

31
We could show that this is suboptimal, in terms of sampling variance of the estimate.
Implicitly, then, while we “believe” the log random walk when “annualizing” volatility,
we do not believe it when estimating volatility if we use this estimate.
Moreover, it shall be noticed that, in this estimate, the sampling mean of returns
does not appear. This is a choice which can be justified in two ways: first we can
assume the expected return µ over a small time interval to be very small. With a non
negligible variance it is quite likely that an estimate of the expected value of returns
could show an higher sampling variability than its likely size and so it could create
problems to the statistical stability of the variance estimate8 . Second, an estimate
of the variance where the expected value is set to 0, that is, as written above, an
estimate of the variance which is, actually, an estimate of the second moment, tends
to overestimate, not to underestimate, the variance.
In fact, variance equals the mean of squares (second moment) minus the squared
mean. If you set the squared mean to 0 you tend to exaggerate the estimate.
For institutional investors, traditionally long the market, this could be seen as
a conservative estimate. Obviously this may not be a reasonable choice for hedged
investors and derivative traders.
The apparent truncation at n should be briefly commented. As we have just seen
the standard estimate should be based on the full set of available observations. This
could be applied as a convention also to the RiskMetrics estimate. On the other hand
consider the fact that, e.g., a λ = 0.95 raised to the power of 256 (conventionally one
year of daily data) is less than 0,000002. So, at least with daily data, to truncate n
after one year of data (or even before) is substantially the same as considering the full
data set.
Using this idea and the known identity:
N
X 1 − λN +1
λi =
i=0
1−λ

So that (for 0 < λ < 1) X


λi = 1/(1 − λ)
i=0,...,∞

8
A simple “back of the envelope” computation: say the standard deviation for stock returns over
one year is in the range of 30%. Even in the simple case where data on returns are i.i.d., if we estimate
the expected return over one year with the sample mean we need about 30 observations (years!) in
order to reduce the sampling standard deviation of the mean to about 5.5% so to be able to estimate
reliably risk premia (this is financial jargon: the expected value of return is commonly called ’risk
premium’ implying some kind of APT and even if it also contains the risk free rate) of the size of at
least (usual 2σ rule) 8%-10% per year (quite big indeed!). Notice that things do not improve if we use
monthly or weekly or daily data (why?). It is clear that any direct approach to the estimate of risk
premia is doomed to failure. A connected argument shall be considered at the end of this chapter.

32
We can then approximate the Vt estimate as:
X
∗2
Vt = (1 − λ) λi rt−i
i=0,...,n

In order to understand the meaning of this estimate it is useful to write it in a recursive


form (this is also useful for computational purposes).
We begin by doing this with the original denominator.
We can directly check that:
∗2
rt∗2 λn+1 rt−n−1
Vt = λVt−1 + P i
− P i
i=0,...,n λ i=0,...,n λ

(Proof not required) In fact, since


∗2
λi rt−1−i
P
i=0,...,n
Vt−1 = P i
i=0,...,n λ

We have
∗2
λi rt−1−i
P ∗2
i=0,...,n rt∗2 λn+1 rt−n−1
Vt = λ P i
+ P i
− P i
=
i=0,...,n λ i=0,...,n λ i=0,...,n λ
∗2
λi+1 rt−1−i
P ∗2
i=0,...,n rt∗2 λn+1 rt−n−1
= P i
+ P i
− P i
=
i=0,...,n λ i=0,...,n λ i=0,...,n λ
∗2
rt∗2 + i=0,...,n−1 λi+1 rt−1−i i ∗2
P P
i=0,...,n λ rt−i
= P i
= P i
i=0,...,n λ i=0,...,n λ

Which is the definition of Vt . (end of nor required proof)


For the standard range of values of λ and n the last term can be approximated with
0.
9

Using the approximate value of the denominator we have:

Vt = λVt−1 + (1 − λ)rt∗2
In practice the new estimate at time t: Vt is a weighted mean of the old estimate
at time t − 1: Vt−1 (recall: the weight λ is usually big) and of the latest squared log
return (the weight: 1 − λ is usually small).
A simple consequence of this (and of the fact that the estimate does not consider
the mean return) is the following.
∗2
λn+1 rt−n−1
9P ∗2
' (1 − λ)λn+1 rt−n−1
i and for (0 < λ < 1), big n and any squared return “not too big”,
i=0,...,n λ
this shall be approximately 0

33
Since the squared return is always non negative and λ is usually near one, this
formula implies that Vt , even if the new return is 0, is still going to be equal to λVt−1
so that the estimated variance at most can decrease of a percentage of 1 − λ. This, in
the hypothetical case of a rt∗ equal or very near to 0.
On the other hand, it can increase, in principle, of any amount when abnormally
big squared returns are observed. This implies an asymmetric behavior: following any
shock, that is a big positive or big negative return, we have an abrupt jump in Vt , while
a sequence of “normally small” values for returns shall reduce the estimated value in a
smoothed way, the faster the smaller is λ (but this is never small in the applications).
The reader should remember that this behavior of estimated volatility is purely a
feature of the formula used for the estimate.
Compare this with the classic “equally weighted” estimate (again: µ set to 0):
∗ 2
P
2 i=0,...,n rt−i 2 1 ∗2 1 ∗2
St = = St−1 + rt − r
n+1 n+1 n + 1 t−n−1
The big difference, here, is not in the first and second term, but in the third. In
this case the third (and oldest) term weight is not negligible, it counts as much as the
most recent term.
With the classic estimate once a “big shock” exits the range of the estimate, from t
to t − n, we observe a big downward jump in the value of the estimate. This because
of the absence of smoothing.
Hence, the main reason for smoothing is to avoid that today’s estimate be affected,
in a relevant way, by old observations dropping off the sample range.
Let us now go back to
Vt = λVt−1 + (1 − λ)rt∗2
Is there any hypothesis on the “evolution of the return variance”, that makes this
behaviour not only a behaviour of the estimate evolution but also a behaviour of the
variance evolution?
If we want this, we imply that we do not totally agree with the standard, constant
σ 2 , version LRW hypothesis, as written above, as we are implying a time evolution of
the variance of returns.
The recursive formula we just found:

Vt = λVt−1 + (1 − λ)rt∗2
is the empirical analogue of an auto regressive model for the variance of returns the
like of:

σt2 = γσt−1
2
+ 2t

34
which is a particular case of a class of dynamic models for conditional volatility
(ARCH Auto Regressive Conditional Herteroschedastic) of considerable fortune in the
econometric literature.
We do not go further in this but you shall discuss this topic in more advanced
econometrics courses.
The above discussion, involving the smoothed estimate for the return variance, is
by no means just a fancy theoretical analysis or a curiosity related to RiskMetrics. It
is the basis of current regulations.
Here, as an instance, I reproduce a paragraph of the EBA (European Banking
Authority) paper EBA/CP/2015/27.

Article 38 Observation period


1. Where competent authorities verify that the VaR numbers are com-
puted using an effective historical observation period of at least one year, in
accordance with point (d) of Article 365(1) of Regulation (EU) No 575/2013,
competent authorities shall verify that a minimum of 250 business days is
used. Where institutions use a weighting scheme in calculating their VaR,
competent authorities shall verify that the weighted average time lag of the
individual observations is not less than 125 business days.
2. Where, according to point (d) of Article 365(1) of Regulation (EU)
No 575/2013 the calculation of the VaR is subject to an effective historical
observation period of less than one year, competent authorities shall verify
that the institution has in place procedures to ensure that the application
of a shorter period results in daily VaR numbers greater than daily VaR
numbers computed using an effective historical observation period of at least
one year.

The quoted Article 365(1) of Regulation (EU) No 575/2013 (On prudential require-
ments for credit institutions and investment firms), is as follows:

Article 365
VaR and stressed VaR Calculation
1. The calculation of the value-at-risk number referred to in Article 364
shall be subject to the following requirements:
(a) daily calculation of the value-at-risk number;
(b) a 99th percentile, one-tailed confidence interval;
(c) a 10-day holding period;
(d) an effective historical observation period of at least one year except
where a shorter observation period is justified by a significant upsurge in
price volatility;
(e) at least monthly data set updates.

35
The institution may use value-at-risk numbers calculated according to
shorter holding periods than 10 days scaled up to 10 days by an appropriate
methodology that is reviewed periodically.
2. In addition, the institution shall at least weekly calculate a “stressed
value-at-risk” of the current portfolio, in accordance with the requirements
set out in the first paragraph, with value-at-risk model inputs calibrated to
historical data from a continuous 12-month period of significant financial
stress relevant to the institution’s portfolio. The choice of such historical
data shall be subject to at least annual review by the institution, which shall
notify the outcome to the competent authorities. EBA shall monitor the
range of practices for calculating stressed value at risk and shall, in accor-
dance with Article 16 of Regulation (EU) No 1093/2010, issue guidelines
on such practices.

The language is a little bureaucratically contrived. However, the meaning of this rule
is that, if you use the exponentiallyPsmoothed estimate truncated at N so that your
(daily data) weights are wt−i = λi / N j=0 λ = λ 1−λN +1 , it must be that the “weighted
j i 1−λ

average time lag of the individual observations” that is: i=0 iwt−i be at least 125
PN
(days).
This, for given N requires a specific choice, or range of choices, of λ.
Notice that if N = 250 the only possible choice is λ = 1. In order to decrease the λ,
so you really have a smoothed estimate, and respect the rule, you must increase N .10 .
But there is more.
Since for N → ∞ the weighted average time lag is λ/(1 − λ) the requirement asks
in any case (that is: whatever be N ) for a value λ > 125/126 = .992063. An even
bigger number shall be needed for moderate N . This is much bigger than what used
to be the common case in the past. The examples in the “classic” edition of the Risk
Metrics Technical Document (iv edition 1996) use λ = .94 which, even with very big
N , corresponds to a weighted average time lag of .94/.06 = 15.(6) by far too small
according to the new rules.

3.1 Is it easier to estimate µ or σ 2 ?


It is useful to end this small chapter discussing a widely hold belief, supported by some
empirical and theorical result, according to which the estimation of variances (and in a
lesser degree of covariances) is an easier task than the estimation of expected returns,
10
This simple arithmetical analysis can easily be done with Excel or similar softwares. There is also
a partially explicit solution. In fact, using some algebra, we see that the required “weighted average
PN (N +1)λN +1
time lag” is E(i) = i=0 iλi 1−λ 1−λ λ
N +1 = 1−λ − 1−λN +1
. The problem becomes that of choosing λ
and N such that this value is at least equal to 125.

36
at least in the sense that the percentage standard error in the estimate of the variance
shall be smaller that in the case of expected return estimation.
The educated heuristics underlying such a belief are as follows11 .
Consider log returns from a typical stock, let them be iid with expected value (on
an yearly basis) of .07 and standard deviation .3.
Suppose log random walk with constant expected value and variance.
The usual estimate of the expected value, that is the √ arithmetic mean, shall be
unbiased and with a sampling standard deviation of .3/ n where n is the number of
years used in the estimation.
Hence, the ratio the expected value with the standard error of its estimate, a mea-
sure of how√precise we may expectd √ the estimate to be, under these hypotheses, shall
be .07/(.3/ n) that is, roughly n/4. √
Hence, for a t-ratio of 2 we need n = 8 that is n = 64 √ (years!). If we want
a standard error equal to 1/2 of µ (a t ratio of 2) we need n = 16 and n = 256.
This means you need 256 years of data for a 2σ confidence interval that still implies a
possible error of 50% in the estimate of µ (as the sampling standard deviation would
be just of the order of 1/2 of the estimate of the mean).
This simple back on the envelope computation explains why we know so little about
expected returns: if our a priori are correct then it is very difficult to estimate them.
There could be a way out. Do not use yearly data but, say, monthly data.
Alas, for log returns and under log random walk this does not work.
Keep n constant and use any k sub periods per year (of length 1/k in yearly terms)
such that the number of observations in n years (for returns over the sub periods) is
kn. The strategy could be that of estimating the sub period expected value µk = µ/k
(the equality is due to the log random walk hypothesis) and then get an estimate of
the yearly expected value by multiplying the monthly estimate by k. If we indicate
with rki

the log returns for the sub periods, with σk2 = σ 2 /k as variance, would have:
kn
X

V (µ̂k ) = V ( rki /kn) = σk2 /kn = σ 2 /k 2 n
i=1

This seems much better that before, but it is an illusion: we do not need an estimate
of µk , we need an estimate of µ = kµk , that is: k µ̂k . We must then compute V (k µ̂k )
and this is
V (k µ̂k ) = k 2 V (µ̂k ) = σ 2 /n
Exactly the same as with “aggregated” data. This should not surprise us: in fact the
arithmetic mean of log returns is simply given by the log of the ratio of the last to the
first price divided by the required number of data points. In other words it only changes
11
This point is dicussed in many papers and book chapters. Among the most illustrious examples, see
Appendix A in: Merton, R.C., 1980.”On estimating the expected return on the market: an exploratory
investigation”. J. Financ. Econ. 8, 323–361.

37
because of the denominator: n for a yearly mean and kn for a sub period of length k
mean. No information is added by using sub period data, hence no improvement in
the variance of the estimate.
In even simpler words: the estimates of µ based on yearly data or on any subperiod
data (same time interval, obviously) are always identical, so their sampling variances
are always the same.
In summary: the expected return is difficult to estimate for two reasons.
First: σ is expected to be much bigger than µ and the t-ratio, hence the precision
of the estimate, depends on the ratio of these.
Second: even if we increase the frequency of observations, nothing changes for the
estimate of the (yearly) µ so that its sampling variance stays the same.
Now, let us do a similar analysis for the variance. In order to make things simple
we shall suppose that µ is known and data are Gaussian. This allows us to quickly
find some useful results.
The general case (unknown µ) is given below, but nothing relevant intervenes when
we remove the two simplifying hypotheses.
We start with the classic estimate of the variance, at the end of this section we shall
also consider the smoothed estimate.
Let us compute the sampling variance of our variance estimate (known µ) and let
ri be the yearly log return

n
X (ri∗ − µ)2 1 1 1
2
V (σ̂ ) = V ( ) = V ((r∗ −µ)2 ) = (E((r∗ −µ)4 )−E((r∗ −µ)2 )2 ) = (µc4 −σ 4 )
i=1
n n n n

Where µc4 = E((r∗ −µ)4 ) is the fourth centered moment and without further hypothesis
could be any non negative constant.
If the ri∗ are Gaussian we have µc4 = 3σ 4 (take it for granted, no proof required)
and the resulting variance of the sampling variance is
X (r∗ − µ)2 2 4
i
V( )= σ
i
n n
So that the√sampling
√ standard deviation of the estimated variance shall be, with our
numbers, . 2σ / n.
2

The ratio of σ 2 with the sampling standard deviation of its estimate shall be

σ2 n √
2
√ ≈ .7 n
σ 2

In order to get a ratio equal to 2 we need n > 2/.7 and we get there with just n = 9
instead
√ of 64 as in the expected value case. For a ratio of 4 or greater we need now
n > 4/.7 and for this n = 33 suffices (instead of n = 256 for the above discussed case
of the expected value).

38
But there is much more: for estimating the variance the use of higher frequency
data improves the result.
Let our strategy be that of estimating yearly variance as k times the estimated
variance for a sub period of length1/k in yearly terms (the prime sign is to indicate
that this is a new estimate)
2
σ̂ 0 = kσ̂k2
Using the same notation and hypotheses as above we get
kn ∗
X (rki − µk )2 1 1
V (σ̂k2 ) =V( )= V ((rk∗ −µk )2 ) = (E((rk∗ −µk )4 )−E((rk∗ −µk )2 )2 ) =
i=1
kn kn kn

1 2 4 2 σ4
(µkc4 − σk4 ) =
= σk =
kn kn kn k 2
where we used the Gaussian hypothesis. Then
2 2 4
V (σ̂ 0 ) = k 2 V (σ̂k2 ) = σ
kn
And we see that now k is in the formula: using sub period data improves the estimate.
Now the ratio for the variance estimates is equal to

σ 2 kn √
√ ≈ .7 kn
σ2 2
So that the use of k sub periods per year has an effect identical to that of multiplying
the number of years by k.
With, say, monthly data, we need less than one year (actually 9 months) for a ratio
of 2 and slightly less that 3 years (just 33 months) so that, with monthly data, ratio
for the variance becomes greater that 4 in 3 years instead of the 33 years with yearly
data12 .
Estimating σ 2 is then easier than estimating µ for two reasons:
First: the ratio between σ 2 and the standard deviation of its estimate is bigger than
that for the expected value, whatever be the n. This comes from the empirical, and
theoretical idea that expected return are much smaller than volatilities.
Second: even if the first reason was not true, we still have the fact that using higher
frequency data improves (dramatically) the quality of the estimate for σ 2 while it is
irrelevant for the estimate of µ.
We see that if we decrease the observation interval, so that the frequency of observation per unit
12

period k increase, in the limit we get a sampling standard deviation of the variance equal to zero.
This should not be taken too seriously: the log random walk model, which underlies this result, may
be a good approximation for time intervals which are both not too long and not to short. Below the
1 day horizon we enter the world of intraday, trade by trade data which cannot be summarized in the
simple log random walk hypothesis.

39
As mentioned above all our formula hold for known µ and Gaussian log returns.
For the general case we have the following result:
with i.i.d. log returns not necessarily Gaussian, for the estimate
n
X
S = 2
(ri∗ − r̄∗ )2 /(n − 1))
i=1

we get
µc4 σ 4 (n − 3)
V (S 2 ) = −
n n (n − 1)
which, for not too small n and a fourth centered moment not very different from
the Gaussian case, gives us the same result as the above formula.
Notice that in all these cases the sampling variance of the estimate of the variance
(as that of the estimate of the expected value) goes to 0 with n going to infinity.
Let us conclude with the case of the smoothed estimate.
We are going to use the approximation for the denominator given by: i=0,...,n λi ≈
P

i=0,...,∞ λ = 1/(1 − λ). The variance of the smoothed estimate is


i
P

X X
∗2 ∗2
V (Vt ) = V ((1 − λ) λi rt−i ) = (1 − λ)2 λ2i V (rt−i )=
i=0,...,n i=0,...,n

(1 − λ)2 2 1−λ
= 2
(µ 4 − µ 2 ) = 2σ 4
1−λ (1 + λ)
where the last equality is true if the expected value is zero (as assumed in RiskMet-
rics) and log returns are Gaussian (and recall: 1 − λ2 = (1 + λ)(1 − λ)).
Here it is meaningless to compare this with the quality of the estimate for µ because
this is assumed equal to zero.
It is, however, interesting to compare the result with a result based on sub period
of length k. Everything depends on the choice of λ for the sub periods. If we set it to
λ1/k we have X
∗2
Vkt = (1 − λ1/k ) λi/k rkt−i
i=0,...,kn

so that, following the same steps as for V (Vt )

1 − λ1/k σ 4
V (Vkt ) = 2
(1 + λ1/k ) k 2
and we have, for the estimate Vt0 = kVkt (we use the prime sign because this is
different with respect to the estimate using aggregated data)

1 − λ1/k σ 4 1 − λ1/k 4
V (kVkt ) = k 2 2 = 2σ
1 + λ1/k k 2 1 + λ1/k

40
This, for 0 < λ < 1 and k > 1, is always smaller than the variance computed using
only full period data (k = 1)13 .
A last observation. What we did see here may seem to be strictly connected with
stock returns.
Actually, this is not the case.
The properties we discussed are always valid for any series of at least approximately
iid (and non too much un-Gaussian, random variables), where you must estimate µ
and σ 2 , and you are interested in the quality of the two estimates. The only case when
we can expect the quality of the estimate of µ to be better than the quality of the
estimate of σ 2 , is when we can assume µ much bigger that σ. Otherwise it shall always
be simpler to get a better estimate of σ 2 (and σ) than of µ, in particular if you can
increase the frequency of observations.

Examples
Exercise 2 - volatility.xls Exercise 3 - risk premium.xls Exercise 3a - exp smoothing.xls
Exercise 3b - historical and implied volatility.xls Exercise 3c - volatility.xls

4 Non Gaussian returns


It can be argued that a reasonable decision maker should be interested in the probability
distribution of returns implied by the strategy the decision maker chooses. This should
be true even if in common academic analysis of decision under uncertainty the use of
polynomial utility functions tend to overweight the role of the moments of the return
distribution and in particular of the expected value and variance.14
In some cases, as for instance in the Gaussian case, the simple knowledge of expected
value and variance is equivalent to the knowledge of the full probability distribution. In
this case the expected value of any utility function shall only depends on the expected
value and the variance of the distribution (being these the only parameters of the
distribution).
Another way to say the same is that, in this case, if we are interested in the prob-
ability with which a random variable X can show values less than or equal to a given
value k, it is enough to possess the tables of the standard Gaussian cumulative density
function and compute:
13
Notice that, with the smoothed estimate, the sampling variance of the estimate does not go to 0
for n going to infinity. On the other hand the bigger k the smaller the sampling variance. Remember,
however, as said above, that k cannot be taken as big as you wish as the log random walk hypothesis
becomes untenable for very short time intervals between observations.
14
Due to linearity of the expected value, the expected value of a polynomial utility function
P (that is
aPlinear combinations of powers of the relevant variable) is a weighted sum of moments: E( i αi X i ) =
i
i αi E(X ).

41
k−µ
Φ( )
σ
Notice that, for distributions characterized by more that two parameters, as for instance
a non standardized version of the T distribution, this property is obviously no more
valid.
It is then of real interest to find good distribution models for stock returns and, in
particular, to evaluate whether the simplest and most tractable model: the Gaussian
distribution, can do the job.
A better understanding of the problem can be achieved if we consider that, in most
applications, we are not interested in the overall fit of the Gaussian distribution to
observed returns but only in the quality of fit for hot spots of the distribution, mainly
tails.
In Finance the biggest losses are usually connected to extreme, negative, observa-
tions (for an unhedged institutional investor). We shall see that the Gaussian distri-
bution while being, overall, not such a bad approximation of the underlying return
distribution, is not so for the extreme, say 1-2%, tails15 .
When studying stock returns, we observe extreme events, mainly negative, in the
order of µ minus 5 σ and more with a frequency which is incompatible with the prob-
ability of such or more negative events under the hypothesis of Gaussianity.
In these evaluations µ and σ are estimates using a long record of data.
While quite rare (do not be fooled by the fact that extreme events always make the
news and so become memorable) such extreme events are much more frequent than
should be compatible with a Gaussian calibrated on the expected value and variance
of observed data.
For instance, let us consider events where returns are more negative than µ − 5σ.
In a Gaussian, the probability of such events is less than 0.00000028 (use the Excel
function which computes such probability).
Is the frequency of similar events actually so small?
Obviously, we can only observe frequencies, not probabilities, and we can only
estimate µ and σ but, with the inevitable statistical approximation, a simple empirical
analysis shall be useful.
Let us consider an example based on a long series of I.B.M. daily returns.
Between Jan 2nd 1962 and Dec 29th 2005 the IBM daily return shows a standard
deviation of 0.016416 . In this time period for 14 times the return was below −5σ
15
The Gaussian distribution can be a good approximation of many different unimodal distributions
if we are interested (as is true in many applications of Statistics) in the behaviour of a random variable
near its median. For modeling extreme events, having to do with system failures, breakdowns, crisis
and similar phenomena, a totally different kind of distribution may be required.
16
(data are in excel Exercise 2- IBM random walk).

42
(suppose a µ of 0, if you use the historical mean the number of extreme events is even
bigger).
The number of observations is 11013 so the observed frequency of a −5σ event is
0.00127, this is, obviously, small but it is more than 4500 times the probability of such
observations for a Gaussian with the same standard deviation!
This is true for a very “mature” and “conservative” stock the like of I.B.M.
While a frequency of 0.00127 is very small, the events on which it is computed
(big crashes) are those which are remembered in the history of the market. It is
quite clear that, in this case, a Gaussian distribution hypothesis could imply a gross
underestimation of the probability of such events and belief in the Gaussian distribution
could imply taking not well measured risk.
The observed behaviour of the empirical distribution of returns of, basically, any
stock, can be summarized in the motto: fat tails, thin shoulders, tall head.
In other words, given a set of (typically daily) returns over a long enough time period
(we need to estimate tails and this requires lots of data) we can plot the histogram of
our data on the density of a Gaussian distribution with the same mean and standard
deviation. What we observe is that, while overall the Gaussian interpolation of the
histogram is good, if we zoom in the extreme tails: say, first and last two percent
of data, we see that the tails of the histogram decrease at a slower rate than those
of the Gaussian distribution. Moreover, toward the center of the distribution, we see
how the “shoulders” of the histogram are thinner than those of the Gaussian and,
correspondingly, the histogram is more peaked around the mean.
The following plots are from the excel file “Exercise 4 - Non Gaussian returns”, in
this worksheet we use data from May 19th 1995 to Sep 28th 2005 on the same I.B.M.
series as before.
The first plot compares the interpolated Histogram of empirical data (blue) with a
Gaussian density with the same mean and variance as the data (magenta). You can
clearly see the mentioned “fat tails, thin shoulders”.
Since tails, fat or not, are tails that is: they are thin, in the second plot we focus
on the extreme left tail and at this scale the difference between the empirical and the
Gaussian distribution. The x axis is scaled in terms of standard deviation units (1
means 1 standard deviation) and we see that, moving to the left starting at, roughly, 2,
the empirical tail is above the Gaussian tail: extreme observations are more frequent
then what we would expect in a Gaussian distribution with the same mean and variance
as the data.

43
Empirical VS Gaussian density. I.B.M. data

0,2

0,18

0,16

0,14

0,12

0,1

44
Relative frequency
Standard Gaussian density
0,08

0,06

0,04

0,02

0
Left tail empirical VS gaussian CDF. I.B.M. data

0,12

0,1

0,08

0,06

45
Empirical CDF
Gaussian CDF

0 04
0,04

0,02

0
-6 -5 -4 -3 -2 -1 0
Another way to compare the empirical distribution with a Gaussian model (or any
model you may choose) is the Quantile-Quantile (QQ) plot. In the worksheet you find
the standardized version of the plot. In order to build a standardized QQ plot from
data you must first choose a comparison distribution, in our case the Gaussian. The
second step is that of standardizing the data, using some estimate of the data expected
value and variance. The standardized dataset is then sorted in increasing order an the
observations in this dataset shall be the X coordinate in the plot. For each observation
of the standardized returns dataset, compute the relative frequency with which smaller
than or equal values were observed. Compute then, using some software version of the
standard Gaussian CDF tables, the value of the standard Gaussian which leaves on its
left exactly the same probability as the relative frequency left on its left by the X data,
this shall be the corresponding Y coordinate in the plot.

46
Quantile Quantile Plot. I.B.M. data.

10

47
Standard Gaussian equivalent obs
0
Standardized sorted returns
-10 -8 -6 -4 -2 0 2 4 6 8

-2

-4

-6

-8
In the end what you see is a curve of coordinates X,Y. If the curve is a bisecting
straight line, your empirical CDF is approximated well by a Gaussian CDF. Departures
from the bisecting line are hints of possible non Gaussianity. To facilitate the reading of
the plot, a bisecting line is added to the picture. In a second, equivalent, version of the
plot the X coordinate is the same but on the Y axis we plot the difference between the
Y as computed for the previous plot and the bisecting line, This is called a “detrended”
QQ plot.
For the I.B.M. data we see how, on the left tail, observed data are above the
diagonal, meaning a left tail heavier than the Gaussian. On the opposite side of the
plot we see how the QQ plot lies below the bisecting line. Again: this means that
we are observing data far from the mean and on the right side of it with an higher
frequency than compatible with the Gaussian hypothesis.
Since the data are standardized, the scale of the plot is in terms of number of
standard deviations. We see that, on the left tail, we even observe data near and
beyond -6 times the standard deviation. The tail from minus infinity to -6 times the
standard deviation contains a probability of the order of 5 divided by one billion for the
standard Gaussian distribution. We also observe 10 data points on the leftmost −5σ
tail. Since our dataset is based on 10 years of data, roughly 2600 observations, if we
read our data as the result on independent extractions from the same Gaussian, these
observations, while possible, are by no means expected as the probability of observing
10 times, in 2600 independent draws, something which has in each draw a probability
of 0.00000028 to be observed is virtually 017 .
We can also follow a different, strongly related, line of thought. We see that in this
dataset made of about 2600 daily observations we observe a extreme negative return
of around −8σ. This is the most negative return hence the minimum observed value.
Now, let us ask the following question: what is the probability of observing such
negative a minimum if data come from a Gaussian?
Suppose data are iid and distributed according to a (standardized) Gaussian. In
this case the probability of observing data below the minus 8 sigma level is Φ(−8) and
this for each of the 2600 observations.
However, the probability of observing AT LEAST a value less than or equal this
is 1 minus the probability of never observing such value, that is, due to iid: 1 − (1 −
Φ(−8))2600 ). It is clear that 1 − Φ(−8) is almost 1 but (1 − Φ(−8))2600 is much smaller.
17
To understand this use the binomial distribution. Question: suppose the probability of observing a
−5σ in each of 2600 independent “draws” is 0.00000028. What is 10the probability of observing 10 such
events? The answer,computed with Excel is: 2600 10 0.00000028 (1 − 0.00000028)2590 = 0, 0000....
Meaning that, at the precision level of Excel, we have a 0! While the exact number is not 0 this
means that, at least in Excel the actual rounding error could be quite bigger that the result. For all
purposes the answer is 0. Question: in this section we evaluated the “un-likelihood” of −5σ results in
two different ways: first with a ratio between frequency and Gaussian based probability, then using
the binomial distribution and, again, the Gaussian based probability. What is the connection between
these two, different, arguments?

48
Is it small enough to make 1 − (1 − Φ(−8))2600 ) big enough so that a minimum value of
−8σ over 2600 iid observation from a standard Gaussian be not termed “anomalous”?
The computation is not to simple as Φ(−8) is a VERY small number and the
precision of the Excel routine for its computation cannot be guaranteed. However
using Excel we get (1 − Φ(−8))2600 = .999999999998268 so that even if we take into
account the 2600 observations the probability of observing as minimum of the sample
a −(σ data point is still not really different from 0. I checked the result using Matlab
(whose numerical routines should be more precise than Excel’s) getting a very similar
result.
In order to get 1 − (1 − Φ(−8))n ) in the range of .01 (still very unlikely) we would
need n =15,000,000,000,000. These are open market days and would correspond to
roughly to 59 billions of years. This is a time period roughly 4 times the current
estimate of the age of our universe. (Again: beware of roundings!).
In any sense observing even a single −8σ value during the full history of the stock,
is quite unlikely if data come from a standard Gaussian.
It should be noticed, as a comparison, that for a T distribution with, say, 3 degrees
of freedom the probability never observing a return of -8σ over 2600 days is only
0.347165227 so that the observed minimum (or a still smaller value) has a probability
of 0.652834773 that is: by no means unlikely (in doing this computation recall that
the Student’s T distribution variance is ν/(ν − 2) where ν is the number of degrees
of freedom
p so that the quantile corresponding to −8 in a standard Gaussian is, now,
−8 ν/(ν − 2)).
As you can see, while at first sight similar to the Gaussian, the T distribution is
VERY un-Gaussian when tail behaviour is what interests us.
We cannot dedicate more space to this problem. A vast literature exists on the
pitfalls of using the Gaussian distribution for extreme returns.
In the following section we shall consider the relevance of these empirical facts from
the point of view of VaR estimation.

Examples
Exercise 4 - Non normal returns.xls Exercise 4b - Non normal returns.xls

5 Three different ways for computing the VaR

First, it is necessary to define what is VaR (Value at Risk)18 .


18
For this section I refer to the worksheets: exercise 4, 4b and 5.

49
Suppose a sum W is invested in a portfolio at time t0 and we are interested in the
p&l (profit and losses) between t0 and t1 that is: Wt1 − Wt0 . In all the (for us) relevant
cases, that is when some “risk” is involved, this p&l shall be stochastic due to the fact
that Wt1 as seen at t0 is stochastic. Our purpose is to give a simple summary of such
stochastic behaviour of the p&l aimed at quantifying our “risk” in a possibly immediate
way.
Many such measures can be (and have been) suggested. The RiskMetrics procedure
chose as its basis the so called VaR “Value at Risk”.
Given a level α (usually very small: 1% to 5% as a rule) of probability, the VaR is
defined as the α-quantile of the distribution of Wt1 − Wt0 .
The definition of a α-quantile xα for the distribution of a random variable X is easy
to write down and understand when the distribution of X : FX (x) is continuous a
strictly increasing at least in an interval xl , xu such that FX (xl ) < α < FX (xu ),
In this case we simply have
xα ≡ x : P (X ≤ x) = FX (x) = α
and

xα = FX−1 (α)
Where the inverse of FX (x), that is:FX-1 (α) is defined in a unique way, continuous
and strictly increasing at least for FX (xl ) < α < FX (xu ).
Here, the α-quantile is nothing but the value of X which corresponds to a cumulated
probability exactly equal to α and we indicate such value with xα .
In the case of a cumulative distribution function with jumps (corresponding to
probability masses concentrated in specific values of x) there may be no x such that
FX (x) = α for a given α.
In this case the convention we use here is that of setting xα equal to the maximum
of the values x of X such that x is of positive probability and FX (x) ≤ α.
Barring this possibility, it is correct to say that the VaR, at level α, for a time
horizon between t0 and t1 , of your investment, is that value of the profit and loss such
that the probability of observing a worse one is equal to α.
This definition seems to imply that we are required to directly compute a quantile
of the p&l. This is not the case.
In fact what is required is a quantile of the return distribution.
Indeed we have
Wt1 − Wt0 = Wt0 rt0,t1


Wt1 − Wt0 = Wt0 (ert0,t1 − 1)
Where rt0,t1 and rt0,t1

are, respectively, the linear and the log return in the time
interval from t0 to t1 .

50
Since the functions return->p&l are both continuous and strictly increasing, the
problem of finding the required quantile of the p&l is equivalent to the problem of
finding the same in the distribution of returns and transform it back to p&l.
In this section we shall consider three different estimates of the VaR which rely on
different sets of hypotheses.
Each estimate shall be presented in a very simple form, the reader is warned that
actual implementation of any of these estimates require a detailed analysis of the avail-
able data and in some cases is subject to detailed regulation.
You’ll see more about this in more advanced courses of the master.

5.1 Gaussian VaR


5.1.1 Point estimate of the Gaussian VaR
A word of notice, in what follows we shall use R as the symbol meaning the random
variable (log) return and r as a possible value of such random variable. To avoid heavy
notation we shall not indicate the kind of return we are speaking about or the time
interval considered. Both these informations shall be clear from the context.
Moreover, we shall limit ouserves to the computation of quantiles in the return
distribution. Transforming this into VaR requires a statement of the initial sum at risk
(see above).
Gaussian VaR is the most restrictive setting used in practice. We suppose that
R is distributed according to a Gaussian density with expected value and variance
(µ, σ 2 ) which are either known or estimated in such a way to minimize sampling error
problems. A typical attitude is that of setting µ = 0 and estimating σ 2 , for instance,
using the smoothed estimate described above19 .
The important point to remember with the Gaussian density is that, under this
hypothesis, knowledge of mean and variance is equivalent to the knowledge of any
quantile.
Under the Gaussian Hypothesis, the CDF is continuous so we can find a quantile
with exactly α probability on its left for any α.
The procedure is simple: we must find rα such that:

P (R ≤ rα ) = α

And proceeding with the usual argument, already well known from confidence intervals
theory, we get:

P (R ≤ rα ) = α = P ((R − µ)/σ ≤ (rα − µ)/σ) = Φ((rα − µ)/σ) = Φ(zα )

Where zα is the usual α quantile for the standard Gaussian CDF Φ(.).
This hypothesis is not reasonable for linear returns, which are bounded below. It is however
19

sometimes used in this case too.

51
We have, then
(rα − µ)/σ = zα

rα = µ + σzα
This is quite easy. The problem is that, for small values of α we are considering
quantiles very far on the left tail and our previous empirical analysis has shown how
the Gaussian hypothesis for returns (overall not so bad) is inadequate for extreme tails.
Typically the problem of fat tails shall imply a dangerous undervaluation of the
VaR in the sense that the estimate shall tend to be less negative that it should.

5.1.2 Approximate confidence interval for the VaR


Now a problem: we do not know both µ and σ. We must estimate them. The usual
RiskMetrics procedure sets µ = 0 and estimates σ with the smoothed estimates intro-
duced above. The estimate of the quantile of the return distribution, that is rα = σzα ,
shall then be rbα = σ̂zα .
According to sound statistical practice we should implement this with a measure of
sampling variability.
Here we show a possible approximate and simple way to do so by computing a lower
confidence bound..
Under the assumptions of uncorrelated observations with constant variance and
zero expected value, it is easy to compute the variance of rbα2 = σ̂ 2 zα2 .
In fact we have  
rα2 ) = zα4 V σˆ2
V (b

During the discussion about the different precision in estimating E(r) = µ and
V (r) = σ 2 we derived for the Gaussian zero µcase the formula
P 
2i
  i=0,...,n λ
V σˆ2 = P 2 2σ
4
i
i=0,...,n λ

Using the approximation


P 
2i
i=0,...,n λ 1−λ
2 h
(1 + λ)
P
i
i=0,...,n λ

This becomes   1−λ


V σˆ2 = 2σ 4
(1 + λ)
So that

52
1−λ
rα2 ) = zα4
V (b 2σ 4
(1 + λ)
We can then estimate the σ 4 term by taking the square of the estimate of the variance
and get
1−λ
rα2 ) = zα4
V̂ (b 2σ̂ 4
(1 + λ)
A possible approximate confidence lower bound for the squared quantile estimate,
with the usual “two sigma” rule, is given by (minus) the square root of a two sigma one
sided interval for rbα2 .
s " s #
1 − λ 1 − λ
q
rbα2 + 2 V̂ (brα2 ) = rbα2 + 2σ̂ 2 zα2 2 = rbα2 1 + 2 2
(1 + λ) (1 + λ)
is a (upper) confidence bound for the square of the quantile estimate. In order to
convert it into a (lower) bound for the quantile estimate we simply take
v s
u
u 1−λ
rbα t1 + 2 2
(1 + λ)

Now let us see some numbers.


Suppose an estimate of the (daily) σ as the one obtained above from I.B.M. data
between 1962 and 2005: 0.0164. You are computing a daily Gaussian VaR with α =
0.025. In this case z0.025 = −1.96
This gets a quantile point estimate equal to

rbα = σ̂zα = 0.0164 ∗ (−1.96) = −0.0321

let us assume that our variance estimate comes from a typical implementation of
the smoothed estimate formula with daily data, n = 256 (meaning roughly one year of
data) and λ = 0.95. r q
In this case we have 1 + 2 (1+λ)1−λ
2 = 1.2053 and the bound shall be, roughly,
20% more negative that the point estimate of the quantile, that is
v s
u
u 1−λ
rbα t1 + 2 2 = −0.0321 ∗ 1.2053 = −.0387
(1 + λ)
r q
Notice that 1 + 2 (1+λ)
1−λ
2 = 1.2053 only depends on the choice of λ so that it
can be precomputed for any estimate sharing the same choice of λ.

53
What we found is a confidence bound for the quantile of the (log) return.
In order to transform this into a confidence bound for the VaR we need to know
the amount invested W at time t0 .
The bound to the VaR shall be W ∗ (e−.0387 − 1) = W ∗ (−.0380), that is a loss of
3.8% (and here the use of % is correct, why?)
In order to further understand the consequences of using the smoothed estimate
consider the case of the “classic” estimate with λ = 1 in
P 
2i
  i=0,...,n λ
V σˆ2 = P 2 2σ
4
i
i=0,...,n λ

so that   2
V σˆ2 = σ4
n+1
and the bound shall be s r
2
rbα 1+2
n+1
and, due to n, we quickly have that the extreme of the interval
q becomes almost identical

to the point estimate. For n = 256 we already get rbα 2
1 + 2 √257 = rbα ∗ 1.084 =
−0.0348.

5.2 Non parametric VaR


5.2.1 Point estimate
The non parametric VaR estimate stands, in some sense, at the opposite of the Gaussian
VaR. In the non parametric case we suppose only that returns are i.i.d. but we avoid
assuming anything about the underlying distribution.
However, in order to find the VaR we need an estimate of the unknown theoretical
distribution.
In standard parametric settings, where we assume, e.g., normality, this is done by
estimating parameters and then computing the required probabilities and finding the
required quantiles using the parametric model with estimated parameters. Since we
now are making no specific assumption about the return distribution we need to find
an estimate of it which is “good” whatever the unknown distribution be.
The starting point of all non parametric procedures is to estimate the theoretical
distribution using the empirical distribution function. Suppose we have a sample of n

54
i.i.d. returns with common distribution F (.) which yield observed values {r1 , r2 , ..., rn }
then our estimate of F (.). shall be:
n
#ri ≤ r X
P̂ (R ≤ r) = F̂R (r) = = I(ri ≤ r)/n
n i=1

Where: #ri ≤ r means ’ the number of observed returns less than or equal to r
and I(ri ≤ r) is a function which is equal to 1 if ri ≤ r and 0 otherwise.
Under our hypothesis of i.i.d. returns with unknown distribution F (.) the above
defined estimate works quite well in the sense that
n
X n
E( I(ri ≤ r)/n) = E(I(ri ≤ r) = P (ri ≤ r) = F (r)
i=1
n

and n
X n
V( I(ri ≤ r)/n) = V (I(ri ≤ r)) = F (r)(1 − F (r))/n
i=1
n2
where the last passage depends on the fact that, for given r, I(ri ≤ r) is a Bernoulli
random variable with P = F (r).
Given this estimate of F , the non parametric VaR is, in principle, very easy to
compute.
Order the observed ri in an increasing way, then define r̂α as the smallest ri such
that the observed frequency of data less than or equal to it is α, if such ri exists. This
ri does not exists if α is not one of the observed values of cumulative frequencies, that
is: if there exists no ri such that F̂R (ri ) = α. In this case we make an exception with
respect to the common definition of empirical quantile and define r̂α as the biggest
observed ri such that F̂R (ri ) < α.
(Linear or other interpolations between consecutive observations are frequently used
but we shall consider this in the “semi parametric VaR” section). This is nothing but
a possible definition for the inversion of the empirical CDF20 .
The problem with this estimate is that, if α is small, we are considering areas of the
tail where, probably, we made very few observations. In this case the estimate could
be quite unstable and unreliable. The reader should compare this estimate with the
estimate of a quantile in the Gaussian case. In the Gaussian case we estimate quantiles
inverting the CDF which, on its turn, is estimated indirectly, by estimating µ and σ,
the unknown parameters. This implies that any data point tells us something about
any point of the distribution (maybe very far from the observed point) as it contributes
20
Our choice does not correspond to some definition of empirical quantile you may find in Statistics
books. In particular, in the case where no ri exists such that F̂R (ri ) = α the empirical quantile is
sometimes defined as the smallest observed ri such that F̂R (ri ) > α. This would be not proper for our
purpose which is to estimate the size of a possible loss and, if needed, exaggerate it on the safe side.

55
to the estimate of both parameters. In other terms, a parametric hypothesis allows
us to estimate the shape of the distribution in regions where we do not make any
observations. Instead, in the non parametric case, each data point, in some sense, has
only a “local” story to tell. To be more precise: the non parametric estimate of the
CDF at a given point r does not change if we change in any way the values of our data
provided we keep constant the number of observations smaller than and greater than
r.
So, we use very little information from the data in a non parametric estimate while
the influence of any data point on a parametric estimate is big. An unwritten law of
Statistics is that, if you use little information you are going to get an estimate which is
robust to many possible hypotheses of the data distribution but with a high sampling
variability; on the other hand, if you use a lot of information in your data, as you do in
a parametric model, you are going to have an estimate which is not robust but with a
smaller sampling variability. This is what happens in the case of non parametric VaR
when compared to, say, Gaussian VaR.

5.2.2 Confidence interval for the non parametric quantile estimate


Let us study a little bit these properties by computing a one side confidence interval
for the α quantile rα on the basis of a simple random sample of size n.
Suppose we order our data from the smallest to the biggest value (that is: we
compute the “order statistics” of the sample). Call the j − th order statistic r(j) .
Using the “integer part” notation where [c] is the largest integer smaller that or
equal to c, the above defined estimate of rα can be simply written as r̂α = r([nα]) .
This means that our estimate is the nα ordered observation, quantile if nα is an
integer, or the ordered observation corresponding to the largest integer smaller that
nα.
This is a sensible choice but, as usual, due to sampling variability, the estimate
could be either more or less negative than the “true” (and unobservable) rα .
We are would be quite worried if rα ≤ r([nα]) (the = is here for cautiousness sake).
A possible strategy in order to lower the probability of this event is that of building a
lower bound for the estimate based on a r(j) with j < [nα]
In order to choose this we must answer the following question. What is the prob-
ability that the “true” α quantile, say rα is smaller than or equal to any given j − th
order statistic (ordered observation) r(j) in the sample? If this event happens and we
used this quantile as estimate, the estimate shall be wrong in the sense that we shall
undervalue the possible loss (as note above, the “equal” part we put in as an added
guarantee).
This error is going to happen if, by chance, we make at most j observations smaller
than, or equal to rα . In fact when, for instance, the number of observations less than
rα is, say, j − 1 the jth empirical quantile shall be bigger (less negative) that rα (or

56
equal to, see the previous sentence).
Since observations are iid and, supposing a continuous underlying distribution, the
probability of observing a return less than or equal to rα is, by definition, α, the
probability of making exactly i observations less than (or equal to) rα (and so n − i
bigger than rα ) is n
αi (1 − α)n−i
i
We then have that the probability of making at most j observations less than (or equal
to) rα , that is, the probability that r(j) be greater than or equal to rα is equal to the
sum of the probabilities of observing exactly i returns smaller that or equal to rα for
i = 0, 1, 2, ..., j. For i = 0 all observations are greater than rα ; for i = 1 only the
smallest observation is smaller that or equal to rα and so on up to i = j where we have
exactly j observations smaller than or equal to rα (we are including the case r(j) = rα
because we want to be on the ”safe side” and avoid a possible undervaluation of the
risk). Obviously, from i = j + 1 onward, we have j + 1 or more observations smaller
than or equal to rα , so that r(j) shall be, supposing the probability of “ties” (identical
observations) equal to 0 as in the case of a continuous F , strictly smaller that rα .
In the end, the probability of “making a mistake” in the sense of undervaluing the
possible loss, that is the probability of choosing an empirical quantile r(j) greater than
rα , is given by:
j  
X n
P (r(j) ≥ rα ) = αi (1 − α)n−i
i=0
i
Now the confidence limit: to be conservative, we want to estimate rα with an empirical
quantile r(j) such that we have a small probability β that that the true quantile rα is
smaller than its estimate. This, again, is because we are willing to overstate and not
to understate risk hence, we “prefer” to choose an estimate more negative than rα that
a less negative one. Obviously, we would also like not to exaggerate on the safe side.
Our strategy shall be as follows: we choose a r(j) such that P (r(j) ≥ rα ) ≤ β for
a given β which represents with its size how much we are willing to accept an under
estimation of the risk (the smaller the β the more adverse we are at underestimating
rα ). On the other hand we do not want j to be smaller (that is r(j) more negative)
than required. Summarizing this we must solve the problem

max(j) : P (r(j) ≥ rα ) ≤ β

This is going to be the extreme of our one tail confidence interval.


Notice this: the expected value of the random variable i with probability function
n
i
α i
(1 − α)n−i is equal to nα so that we could “expect” the empirical quantile cor-
responding to the index j just smaller than or equal to nα to be the “right choice”
(and exactly this choice of point estimate was made in the previous paragraph). e.g.

57
if α = .01 and n = 2000, intuitively we could use as an estimate of rα the empirical
quantile r(20) .
However, if we make this choice, we are going (for n and α not too small) to have
roughly fifty fifty probability that the true quantile is on the left or on the right of the
estimate. This is due to the central limit theorem according to which
j  
X n
αi (1 − α)n−i ≈ Φnα;nα(1−α) (j)
i=0
i

If the approximation works for our n and α we see that nα becomes the mean of
(almost) a Gaussian, hence the probability on the right and on the left of this becomes
.5.
For reasons of prudence fifty/fifty is not good for us, we go for a smaller probability
that the chosen quantile be bigger than rα , that is for a β smaller than .5.
For this reason we choose an empirical quantile corresponding to a smaller j to the
j just smaller than (or equal to) nα and we do this according to the above rule.
The just quoted central limit theorem, if n is big and α not too small, simplifies
our computations with the following approximation:
j  
X n
P (r(j) ≥ rα ) = αi (1 − α)n−i ≈ Φnα;nα(1−α) (j) =
i=0
i
!
j − nα
= Φ0;1 p
nα(1 − α)
With this approximation, we want to solve
!
j − nα
max(j) : Φ0;1 p ≤β
nα(1 − α)

So that our solution is given by the biggest (integer) j such that √ j−nα ≤ zβ or,
nα(1−α)
that is the same, the biggest (integer) j such that j ≤ nα + nα(1 − α)zβ .
p

Using the more compact “integer part” notation and calling r̂αβ our lower bound,
we have:
r̂α,β = r([nα+√nα(1−α)z ])
β

Notice that [nα + nα(1 − α)zβ ] does not depend on the observed data but on
p

α, β, n only. Hence, the solution, in terms of j, that is: which ordered observation to
use, (obviously NOT in terms of r(j) ) it is known before sampling.
Suppose, for instance, you have 1000 observations and look for the 2% VaR. The
most obvious empirical estimate of the 2% quantile is the 20th ordered observation,

58
but, according to the central limit theorem, the probability that the true 2% quantile
is on its left (as on its right but this is not important for us) is 50%.
To be conservative you wish for a quantile which has only 2.5% probability of being
on the right of the 2% quantile. Hence you choose a β of 2.5% (zβ = −1, 96) and you
get p p
nα + zβ nα(1 − α) = 1000 ∗ .02 − 1.96 ∗ 1000 ∗ .02 ∗ .98) = 11.32
According to this result your choice for the lower (97.5%) confidence bound for the
(2%) VaR is given by the [11.32] = 11-th ordered observation that is, roughly, the 1%
empirical quantile.
Beware: do not mistake α for β. The first defines the quantile you want to estimate
(rα ) and the second the confidence level of the confidence interval.
Is this prudential estimate much different w.r.t. the simple “expected” quantile?
It depends on the distance between ranked observations on the tail for observed cu-
mulative frequencies of value about α. If the tail goes down quickly the distance is
small and the difference between, in this case, the 11th and the 20th quantile shall not
be big. On the contrary, with heavy tails the difference between the 1% and the 2%
empirical quantile can be quite big.
As an example consider the case of the I.B.M. data between May 19th 1995 to Sep
28th 2005 discussed above.
The point estimates of the 2.5% and 1% quantiles are the 2.5% empirical (ranked
obs 66th) quantile is -4.105% and the 1% empirical quantile (ranked obs 26th) is -5.57%.
These point estimates correspond to 97.5% confidence bonds of -4.67% (obs 50) and
-6.53% (obs 16). In the first case roughly .5% more negative than the point estimate,
in the second case 1%. The reason for the difference is that around the 1% empirical
quantile observations are more “rarefied”, hence with large intervals in between, than
around the 2.5% empirical quantile.
With the same data, a Gaussian VaR estimate and using, for comparison, the full
sample standard deviation as estimate of σ (value 0.021421), we get, for the the 2.5%
VaR, a point estimate of -4.14%, to be compared with the 2.5% empirical quantile
-4.105% (confidence bound -4.67%). However in the Gaussian case the (approximate
2σ) lower confidence limit , given the more than 2600 observation and the unsmoothed
estimate, is -4.23%: very similar to the point estimate. As we did see a moment ago
this is not true for the empirical VaR (.5% of difference between the estimate and the
confidence limit).
Things are worse on more extreme quantiles.
If we compute the 1% quantile in the Gaussian case, we get -4.93% with a (two σ)
bound of -5.02% to be compared with the non parametric -5.57% and the corresponding
bound of -6,53%.21 .
21
These estimates may change very much if we change the sample. For instance, with a longer
stretch of data: between 1962 and 2005, the standard deviation is 0.0164, the 2% Gaussian var is

59
When we are evaluating extreme quantiles two “negative” forces sum. First the
empirical distribution is very “granular” in the tails (very few observations). Second
the empirically observed heavy tails imply the possibility of considerable difference
between contiguous quantiles, bigger that expected in the case of Gaussian data.
Non parametric VaR, sometimes dubbed “historical” VaR because it uses the ob-
served history of returns in order to estimate the empirical CDF, is probably the most
frequently used in practice. Again, confidence limits as often ignored and this could be
due to their dismal “big” size.
The problem of a big sampling variance for such estimates is very well known.
Applied VaR practitioners and academics have suggested in the last years, an amazing
quantity of possible strategies for improving the quality of the non parametric tail
estimate. Most of these suggestion fall in two categories, semi parametric modeling of
the distribution tails and filtered resampling algorithms.
In the following subsection we shall consider a simple example of semi parametric
model. The resampling approach is left for more advanced courses.

5.3 Semi parametric VaR


Semi parametric VaR mixes a non parametric estimate with a parametric model of the
left tail.
The aim is that of conjugating the robustness of the non parametric approach with
the greater efficiency of a parametric approach.
As we did see in the previous section, the non parametric approach can result in
good evaluations of the VaR for values of α not very small. For small α its sampling
variability may be non negligible even for big samples. However, in VaR computation
we look for a quantile estimate for small α. The idea of a semi parametric approach is
that of using a parametric model just for the tail of the distribution beyond a small
but not to small αquantile. The plug in point of the parametric model is estimated in
a non parametric way, from that point onward a parametric model is used in order to
estimate the required extreme quantile.
The reason why this may work is that, on the basis of arguments akin to the central
limit but considering not means but extreme order statistics, we can prove that, while
we may have many different parametric models for probability distributions, the tails
of such distributions behave in a way that can be approximated with few (typically
three) different parametric models.
Here is not the place to introduce the very interesting topic of “extreme value
theory”, a recent fad in the quantitative Finance milieu, so, we shall not be able to
fully justify our choice of tail parametric model. Be it sufficient to say that a rigorous
justification is possible.
-3.37% while the 2% empirical quantile is -3.4%.

60
We suppose that for r negative enough:

P (R ≤ r) = L(r)|r|−a

where, for such negative enough r, L(.) is a slowly varying function for r → −∞
(Formally this means limr→−∞ L(λr)
L(r)
= 1∀λ > 0 and you can understand this as implying
that the function L to be approximately a constant for big negative values of r) and a
is the speed with which the tail of the CDF goes to zero with a polynomial rate.
This is sometimes called a “Pareto” tail because a famous density showing this tail
behaviour bears the name of Vilfredo Pareto.
This choice of tail behaviour could be justified on the basis of limit theory, as hinted
at before, or on the basis of good empirical fitting to data.
Notice that the Gaussian CDF has exponential tails, which go to 0 much faster
than polynomial tails. Pareto tails are, thus, a model for “heavy” tails.
Provided we know where to plug in the model (that is: which value of r is negative
enough) our first task is that of estimating a, the only parameter in the model. In
order to do so we take the logarithm of the previous expression and we get:

log(P (R ≤ r)) = log(L(r)) − a log(|r|)

We then assume that, maybe with an error, log(L(r)) can be approximated by a con-
stant C:
log(P (R ≤ r)) ≈ C − a log(|r|)
this expression begins to be similar to a linear model. In fact, if, in correspondence of
any observed ri we may estimatelog(P (R ≤ ri )) with log(F̂R (ri )) and summarize the
various approximations in an error term ui , we have:

log(F̂R (ri )) = C − a log(|ri |) + ui

A linear regression based on this model shall not work for the full dataset of returns,
but it shall work for a properly chosen subset of extreme negative returns. A simple
way to find the proper subset of observations is that of plotting log(F̂R (ri )) against
log(|ri |) for the left tail of the distribution. Typically this plot shall show a parabolic
region (consistent with the Gaussian hypothesis) followed by a linear region (consistent
with the polynomial hypothesis). The regression shall be run with data from the second
region.
Suppose we now have an estimate for a, how do we get an estimate of the quantile?
the problem is that of plugging in the parametric tail to the non parametric estimate
of the CDF.
The solution is simple if we suppose to have a good non parametric estimate for
the quantile rα1 where α1 is too big for this quantile estimate to be used as VaR.

61
Log-Log plot of extreme negative (in absolute value) data
See how the linear hypothesis seems to work on the left of -3/-4 sigma

0
-9 -8 -7 -6 -5 -4 -3 -2 -1 0
-0,5

-1

-1,5
-1 5

62
-2

-2,5

-3

-3,5

-4

-4,5

-5
What we need is an estimate of rα2 for α2 < α1 . If we suppose that the tail model is
approximately true for both quantiles we have:
 a
α1 L(rα1 ) rα2
=
α2 L(rα2 ) rα1
L(r )
But the ratio L(rαα1 ) should be very near to 1 (the same slow varying function computed
2
at non very far away points) so that we can directly solve for rα2 :
  a1
α1
rα2 = rα1
α2

Given the non parametrically estimated rα1 , an estimate of a (based on the above
described regression) and for a chosen α2 we are then able to estimate the quantile rα2 .
The computation of the confidence interval of the semi parametric VaR is slightly
more difficult than for the other two VaR estimation methods, hence we shall not
discuss it here.
(However, you can find the formulas for the computation of this confidence interval
in the Excel file where the three methods are compared on the same dataset).

A detailed discussion of this method with suggestions for the choice of the subset of
data on which to estimate the tail index, and formulas for more sophisticated confidence
intervals may be found in 22
However the reading of this paper is not required for the course.
A comparison of Gaussian, non parametric and semi parametric VaR is shown in
detail in the Excel worksheet Exercise 5 VaR.xls.

Examples
Exercise 5 - VaR.xls Exercise 5b

Required Probability and Statistics concepts. Sections


6-12. A quick check.
The main difference between the first and the second part of the course is due to the
fact that in the second part we are mainly interested with vectors of returns. This is the
22
“Tail Index Estimation, Pareto Quantile Plots, and Regression Diagnostics”, Jan Beirlant, Petra
Vynckier, Jozef L. Teugels, Journal of the American Statistical Association, Vol. 91, No. 436 (Dec.,
1996), pp. 1659-1667. Jstore link http://www.jstor.org/stable/2291593

63
obvious point of view of any professional involved in asset pricing, asset management
and risk management.
To begin with, we need a compact and reasonably clear notation for dealing with
vectors of returns, both from the mathematical and the probabilistic/statistical point
of view. For this reason the second part of these handouts opens with two chapters: 6
and 7 dedicated to a quick introduction to the basic notation. You can find something
more in the appendixes of these handouts: 12.
Most of the second part of these handouts is centered on the study and of the general
linear model 9 and of its applications in Finance. I expect this topic to be new for
most of the class, hence the handouts contain a rather detailed and complete, if simple,
introduction to this topic. Another important tool introduced in the following chapters
is principal component analysis 11 in the context of linear asset pricing models.
Most of what follows is self contained, however some basic concept and result of
Probability and Statistics is required. You can find this in the appendix 13.
Among the most important of these concepts and results to be added to those
already summarized in the first part of these handouts, I would point out: conditional
expectation and regressive dependence, point estimation, unbiasedness, efficiency.
Again, a short summary of these cam be found in 13. Among the most important
points see: from 13.43 to 13.47, from13.91 to 13.104.

6 Matrix algebra

I suppose the Reader knows what is a matrix and a vector and the basic rules for
multiplication between matrices and matrices and scalars and for sum between matri-
ces. I also suppose the Reader to know the meaning and basic properties of a matrix
inverse and of a quadratic form. This very short section only recalls a a small number
of matrix results and presents a very useful result called “spectral decomposition” or
“eigendecomposition” theorem. Moreover we consider some differentiation rule.
In what follows I’ll write sums and products without declaring matrix dimensions
sum and multiplication. I’ll always suppose the matrices to have the correct dimensions.
The inverse of a square matrix A is indicated by A−1 with A−1 A = I = AA−1 . A
property of the inverse is that, if A and B have inverse then (AB)−1 = B −1 A−1 . Notice
that, if A and B are square invertible matrices and AB = I then, since (AB)−1 = I =
B −1 A−1 , by multiplying on the left by B and on the right by A we have BB −1 A−1 A =
BA = I.
The rank of a matrix A (no matter if square or not): Rank(A) is the maximum
number of linearly independent rows or columns in A. Put in a different way, the
rank of a matrix is the order of the biggest (square) matrix that can be obtained by

64
A deleting rows and/or columns and whose determinant is not zero. Obviously, then,
the rank of a matrix cannot be bigger that its smaller dimension.
A fundamental property of the rank of the product is this:

Rank(AB) ≤ min(Rank(A); Rank(B))

If B is a q × k matrix of rank q then Rank(AB) = Rank(A). Analogously, if A is


a h × q matrix of rank q then Rank(AB) = Rank(B). Applying this we have that:
Rank(AA0 ) = Rank(A).
A symmetric matrix A is called positive semi definite (psd) iff for any column vector
x we have x0 Ax ≥ 0. If the inequality is strong (>) for all the vectors x not identically
null, then the matrix A is called “positive definite” (pd).
Often we must compute derivatives of functions of the kind x0 Ax (a quadratic form)
or x0 q (a linear combination of elements in the vector q with weights x) with respect
to the vector x.
In both cases we are considering a (column) vector of derivatives of a scalar function
w.r.t. a (column) vector of variables (commonly called a “gradient”). There is a useful
matrix notation for such derivatives which, in these two cases, is simply given by:

∂x0 Ax
= 2Ax
∂x
and
∂x0 q
=q
∂x
The proof of these two formulas is quite simple. We give a proof for a generic element
k of the derivative column vector
XX
x0 Ax = xi xj ai,j
i j
P P
∂ i j xi xj ai,j X X X X
= xj ak,j + xi ai,k +2xk ak,k = xj ak,j + xj ak,j +2xk ak,k = 2Ak,. x
∂xk j6=k i6=k j6=k j6=k

Where Ak,. means the k − th row of A and we used the fact that A is a symmetric
matrix. Moreover
X
x0 q = xj q j
j
0
∂x q
= qk
∂xk

65
An important point to stress is that the derivative of a function with respect to a
vector always has the same dimension as the vector, so, for instance (remember that
A is symmetric):
∂x0 Ax
0
= 2x0 A
∂x
A multi purpose fundamental result in matrix algebra is the so called “spectral
theorem”:

Theorem 6.1. If A is a (n × n) symmetric, pd matrix then there exist a (n × n)


orthonormal matrix X and a diagonal matrix Λ such that:
0
X
A = XΛX 0 = λ j xj xj
j

where xj is the j −th column of X and is called the j −th eigenvector of A, the elements
λj on the diagonal of Λ are called the eigenvalues of A. These are positive, if A is pd,
and can always be arranged (rearranging also the corresponding columns of X) in non
increasing order.

The formula A = XΛX 0 is called the “spectral decomposition” of A.


If the matrix A is only psd with rank m < n a similar theorem holds but the matrix
X has only M columns and the matrix Λ is a square m × m matrix.
Notice that the spectral theorem implies that the rank of a p(s)d matrix A is equal
to the number of non null eigenvalues of A.
A property of the eigenvectors of a pd matrix A is that XX 0 = I (and, since in the
pd case X is square, we also have X 0 X = I). That is: the eigenvectors are orthonormal.
A nice result directly implied by this theorem when A is pd is this: A−1 = XΛ−1 X 0 .
In fact A−1 = (XΛX 0 )−1 = X 0−1 Λ−1 X −1 = XΛ−1 X 0 .
The Reader must notice that computing the spectral decomposition of a matrix,
while straightforward, from the numerical point of view, is by no mean easy to do by
hand. In order to understand this we can post multiply A times the generic xi :
0
X
Axi = XΛX 0 xi = λj xj xj xi
j

Each term in the sum is going to be equal to 0 (orthonormal xj vectors) except the ith
which is going to be equal to xi λi so that xi solves the equation (A − λi I)xi = 0.
Notice that x0i xi = 1 so that the “trivial” solution: xi = 0 is NOT a solution of this
problem. Hence, any feasible solution requires |A − λi I| = 0 so that we see that the
,λi -s are the roots of the equation: |A − λI| = 0 (the so called “characteristic equation”
for A). If this determinant equation is written down in full it shall be seen that it is
a polynomial equation in the variable λ of degree equal to the rank k of A (its size,
if A is of full rank). As it is well known since high school days, polynomial equations

66
have k real and/or complex roots (the so called “fundamental theorem of algebra”).
However, an explicit formula for finding such roots only exists (in the general case) for
k ≤ 4. On the other hand, finding the roots of this equation is such relevant a problem
in applied Mathematics that numerical algorithms for computing them exist at least
from the times of Newton. P
The representation A = j λj xj xj makes obvious many classic matrix algebra
0

results. For instance, we know that Az = 0 may have nontrivial solutions only if A
is non invertible. In the case of s symmetric psd matrix this implies that the number
of positive
P eigenvalues is smaller than the size of the matrix. In this case, writing
j λj xj xj z = 0 immediately shows that the solution(s) to the homogeneous
0
Az =
linear system must be found among those (non null) vectors z which are orthogonal to
each eigenvector xj . Such vectors, obviously, cannot exist if A is invertible23 .
A last useful result is the so called “matrix inversion lemma”

(A − BD−1 C)−1 = A−1 − A−1 B(CA−1 B − D)−1 CA−1

7 Matrix algebra and Statistics

We use both random matrices and random vectors. A random matrix is simply a
matrix of random variables, the same for a random vector.
The expected value of a random matrix or vector Q is simply the matrix or vector
of the expected values of each variable in the matrix or vector and is indicated as E(Q).
For a random (column) vector z we define the variance covariance matrix, indicated
with V (z) but sometimes with Cov(z) or C(z) as:

V (z) = E(zz 0 ) − E(z)E(z 0 )

For the expected value of a matrix or a vector we have a result which generalizes
the linearity property of the scalar expected value.
For the lovers of formal language: k orthonormal vectors of size k (k−vectors) “span” a
23

k−dimensional space, in the sense that any vector in the space can be written as a linear combination
of such orthonormal vectors. For this reason, the only k−vector orthogonal to all k orthonormal
vectors (which means that the vector is not a linear combination of them) is the null vector. On
the other hand, given q < k orthonormal k−vectors, these span a q−dimensional subspace and there
exist other k − q orthonormal k−vectors which are orthogonal to the first q and span the “orthogonal
complement” of the space spanned by the q vectors. This is simply the space of all k−vectors which
cannot be written as linear combinations of the q vectors, equivalently: the space of all vectors which
are orthogonal to the q vectors. Yous see how a k−dimensional pd matrix implicitly defines a full
orthonormal basis for a k−dimensional basis. Moreover, the knowledge of its eigenvectors allow us to
split this space in orthogonal subspaces.

67
Let A1 and A2 be random matrices (of any dimension, including vectors) and
B, C, D, G, F non random matrices. We have:

E(BA1 C + DA2 E + F ) = BE(A1 )C + DE(A2 )G + F

Where, as anticipated, we suppose that all the products and sums have meaning that
is: the dimensions are correct and the expected values exist.
The covariance matrix has a very important property which generalizes the well
known result about the variance of a sum of random variables.
Let z be a random (column) vector H a non random matrix and L a non random
vector then:
V (Hz + L) = HV (z)H 0
Suppose for instance that z is a 2 × 1 vector and H has a single row made of ones, in
this case the above result yields to the usual formula for the variance of the sum of 2
random variables.

7.1 A varcov matrix is at least psd


An important property of a covariance matrix is that it is at least psd. In fact, if A
is the covariance matrix, say, of the vector z, then the quantity x0 Ax is simply the
variance of the linear combination of random variables x0 z where x is a non random
vector. But a variance cannot be negative, hence x0 Ax ≥ 0 for all possible choices of x
and this means A is at least psd.
Suppose now that A is the (suppose pd) covariance matrix of the random vector
z. The spectral decomposition theorem tells us that we can write A as A = XΛX 0 ,
with X orthogonal (i.e. XX 0 = X 0 X = I) and Λ diagonal (Λ = diag(λ1 , ..., λn )). The
columns of X and the diagonal elements of Λ are the eigenvectors and eigenvalues of
A, respectively (i.e. Axi = λxi ).
Let p = X 0 z. Then p is a vector of non correlated random variables with covariance
matrix Λ. In fact V (p) = X 0 V (z)X = X 0 AX = Λ. Moreover z = Xp, since Xp =
XX 0 z = z.
In other words, we can always write any random vector with pd covariance matrix
as a set of linear combinations of uncorrelated random variables. Notice that E(p) =
X 0 E(z) and E(z) = XE(p). When the covariance matrix of z is only psd, a similar
result holds but the dimension of the vector p shall be equal to the rank of V (z) and
so smaller that the dimension of z.
This in particular implies that any pd matrix can be interpreted as a covariance
matrix that is: there exist a random vector of which the given matrix is the covariance
matrix (this is also true for psd matrices, but the proof is slightly more involved).
Recall that, if A is only psd, say of size k and rank q < k, then it has q eigenvectors
xl corresponding to positive eigenvalues.

68
However, there exist k − q vectors zj∗ such that zj∗0 zj∗ = 1, zj∗0 zj∗0 = 0 for j 6= j 0 and
zj∗0 xl = 0 for any vector xl which is eigenvector of A.
Using any of these zj∗ , or any non zero scalar multiple of them, is then possible to
build linear combinations of the random variables whose varcov matrix is A such that
these linear combinations have zero variance.
If A is a variance covariance matrix of linear returns for financial securities, this
implies that it is possible to build a portfolio24 of such securities and, possibly, the risk
free rate, such that the single securities returns are random but the overall position
return is non random.
By no arbitrage, any such position, being risk free, must yield the same risk free
rate (otherwise you can borrow at the lower rate and invest at the higher rate for a
sure, possibly unbounded, profit).
This very important property is central in asset pricing theory and, more in general,
in asset management.
(Compare this with the example regarding the constrained minimization of a quadratic
form in the Appendix).
As we shall see in the section dedicated factor models and principal components,
it is often the case that covariance matrices of returns for large sets of securities are
approximately not of full rank, that is: it may be that all eigenvalues are non zero but
many of these are almost zero. In this case it is possible to build portfolios of risky
securities whose return is “almost” riskless. This has important applications in (hedge)
fund management and, more in general, trading ad asset pricing.

7.2 Note
These are the barely essential matrix results for this course. Many more useful results
of matrix algebra exist, both in general and applied to Statistics and Econometrics.
For the interested student the Internet offers a number of useful resources.
We limit ourselves to quote a “matrix cookbook” you could download from the
Internet the title is “The Matrix Cookbook” 25 .

Examples
Exercise 6-Matrix Algebra.xls
Or, alternatively, a long short position.
24

A possible link is: http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf this


25

worked when I last checked it in August 2019, but I cannot guarantee stability of the link.

69
8 The deFinetti, Markowitz and Roy model for asset
allocation

This section on the deFinetti-Markowitz-Roy model is here both as an exercise in


matrix calculus and because it shall be useful to us when we shall briefly consider risk
factor models.
Suppose one investor is considering a buy and hold portfolio strategy for a single
period of any fixed length (the length is not a decision variable in the asset allocation)
from time t to time T . Let us indicate with R the random vector of linear total returns
from a given set of (k) stocks for the time period. We suppose the investor knows (not
“estimates”) both µR = E(R) and Σ = V (R). We suppose Σ to be invertible.
Moreover, there exists a security which can be bought at t and whose price at T is
known at t (typically a non defaultable bond) called “risk free security”. Let rf be the
(linear) non random return from this investment over the period.
The fund manager’s strategy is to invest in the risk free security and in the stocks
at time t and then liquidate the investment at time T . The relative amounts of each
investment in stocks are in the column vector w, while 1 − w0 1 is the relative amount
invested in the risk free security.
The return of the portfolio between t and T is:

RΠ = (1 − w0 1)rf + w0 R

So that the expected value and the variance of the portfolio return are

E(RΠ ) = (1 − w0 1)rf + w0 µR

and
V (RΠ ) = w0 Σw
The problem for the fund manager is to choose w so that, for a given expected value
c of the portfolio return, the variance of the portfolio return is minimized. In formulas

min w0 Σw
w:(1−w0 1)rf +w0 µR =c

Equivalently the fund manager could fix the variance and choose w such that the
expected return is maximized.
In both problems it would be sensible to use an inequality constraint. For instance,
in the first problem, we could look for

min w0 Σw
w:(1−w0 1)rf +w0 µR =c

70
We choose the = version just for allowing direct use of Lagrange multipliers as we’ll
see in what follows.
Notice that we do not assume the sum of the elements of w to be 1. This shall
be true only if no risk free investment is made. However, obviously, if we complement
the vector w with the fraction of portfolio invested in the risk free security, the sum
of all the portfolio fractions is 1. Moreover we do not require each element of w to
be positive. This can be done but not in the the straightforward way we are going to
follow26 .
In order to solve the problem we consider its Lagrangian function (notice the rear-
ranged constraint):

L = w0 Σw − 2λ(rf − c + w0 (µR − 1rf ))


And differentiate this with respect to the vector w and the scalar λ (remember the
differentiation rules).
∂L
= 2Σw − 2λ(µR − 1rf )
∂w
∂L
= −2(rf − c + w0 (µR − 1rf ))
∂λ
We then define the system of “normal equations” as
n
Σw−λ(µR −1rf )=0
rf −c+w0 (µR −1rf )=0

It is easy to solve the first sub system for w as:

w = λΣ−1 (µR − 1rf )


26
Before going on with the solution of our problem, it is proper to discuss an interesting property
of the mean variance criterion.
The mean variance criterion may seem sensible and, actually, it usually is sensible. However it is
easy to build examples where the results are counter intuitive.
Suppose only two possible scenarios exist, both with probability 1/2. You are choosing between two
securities: A and B. In the first state of the world the return of both securities is 0, in the second it
is 1 for the first and 4 for the second. So the expected returns are .5 and 2 and the variances .25 and
4. Suppose now you wish to minimize the variance for getting at least an expected return of .5. Both
investments yield at least that expected return, however, according to the mean variance criterion,
you would choose the first as the variance of the second is bigger. But the second is going to yield
you a return greater or equal to the first in both states of the world, in other words you are trading
more for less. Notice that, since the two investments are perfectly correlated, you are going to invest
just in one of them. The reason of the paradox is simple: you are considering "variance" as bad, but
variance may be due both to bigger losses and to bigger gains. Since with the mean variance criterion
you may end up trading more for less we can conclude that, in general, the mean variance criterion
does not satisfy no arbitrage.

71
At this point we already notice that the required solution is a scalar multiple (λ) of a
vector which does not depend on c. In other words the relative weights of the stocks
in the portfolio are already known and do not depend on c. What is still not known is
the relative weight of the portfolio of stocks with respect to the risk free security.
This is a first instance of a “separation theorem”: the amount of expected return
we want to achieve only influences the allocation between the risk free security and
the stock portfolio but does not influences the allocation among different stocks (the
optimal risky portfolio is uniquely determined).
As a second comment we see that, had our objective been that of solving
 
0 0 1 0
max (1 − w 1)rf + w µR − w Σw
w 2λ

that is: had we wished to maximize some “mean variance” utility criterion, our result
would have been exactly the same. Since the (negative) weight of the variance in this
criterion is given by − 2λ
1
as a rule 1/λ is termed “risk aversion parameter”.

9 Linear regression
9.1 What is a regression
In what follows we shall consider linear models as models of linear regressions.
In other words, we shall suppose that a “set” of random variables (in general a
matrix) Z (maybe only partially observable) has a probability distribution P (Z), that
we want to study the conditional expected value (aka regression function) of one vector
of Z, that we indicate with Y , conditional to a matrix of other elements of Z, that we
call X (so that, obviously, we suppose such conditional expectation to exist) and that
we suppose this conditional expectation to be expressed by E(Y |X) = Xβ, where β is
a vector of non random parameters.
This is read as a “linear” regression function (the important point is that this is
linear in the β vector.
It should be obvious that, out of a given Z, we could compute and be interested in
many conditional expectations which we could derive from P (Z): the joint probability
distribution of Z. It may also be that not all the variables in Z are involved in the
computation. As an example, we could be intereste to the conditional expected value
of one single element of Z given another single element.
Why do we compute conditional expectations?
Basically because we want to make “forecasts” that is, we suppose to know the
values of some of the variables in Z and on the basis of these we want to forecats the
values of other variables in Z.
And why should we use a conditional expectation in order to make a prediction?

72
The main reason for this choice is that, if we evaluate the error in a forecast in a
specific way: the mean square error, then E(Y |X) is the function of X which minimizes
this measure of error.
I.e.: you cannot make a better (in MSE sense) choice than E(Y |X).
In what follows we give a brief summary of the main properties of a regression
function. More detail on this is found in the appendix but it is not required for the
exam: 13
We begin with general properties which only require the regression function to exist,
be it linear or not.
Here we suppose Y to be a column vector.
1. Suppose Y = g(X). That is: Y is a function of X. Then E(Y |X) = g(X).
2. E(E(Y |X)) = E(Y ). This is sometimes called “law of iterated expectations”.
3. E(Y − E(Y |X)) = 0. This is a corollary of the second property.
4. E((Y − E(Y |X))E(Y |X)0 ) = 0 In words: the covariance between forecasts:
E(Y |X) and forecast errors: Y − E(Y |X) is a matrix of zeroes.
Now we are ready to state (the proof is not required) the result about the optimal
property of the regression function as a forecast.
We do this in a way similar to the one we shall follow further on in order to prove
the Gauss Markov theorem.
Consider any (vector) function h(X) (the vector has the same dimension as Y ).
Suppose you want to “forecast” Y using h(X) and you want this forecast to be “the
best possible”.
The regression function is a possible candidate as it IS a function of X (and call
E(Y |X) = φ(X)) but there exist, in general, infinite other possibilities.
You measure the “expected forecast error” “size” with its mean square error matrix27 :
E((Y − h(X)(Y − h(X))0 ).
We would like this to be as “small” as possible: this seems a sensible idea.
However, this is a matrix, so that we must define “small” in a non trivial way.
Here we shall look for a choice of h(X) = h∗ (X) such that any other choice would
yield a MSE matrix whose difference from that corresponding to h∗ (X) is (at least)
PSD, that is:

E((Y − h∗ (X)(Y − h∗ (X))0 ) = E((Y − h(X)(Y − h(X))0 ) + H

where H is PSD.
Theorem 9.1. For any h(X) 6= φ(X) we have E((Y − h(X)(Y − h(X))0 ) = E((Y −
φ(X)(Y − φ(X))0 ) + H where H is (at least ) PSD.
27
In general this is not the variance covariance matrix as we are not requiring E(h(X)) = E(Y ).

73
In this sense you cannot choose a better function of X in order to forecast Y than
φ(X) = E(Y |X).
Note: in basic courses of statistics, this theorem is stated in the common and
simpler case where Y is a scalar random variable so that the MSE matrix becomes a
simple MSE.
It is important to stress the point that the regression function is the best choice
if “best” is measured by means of a MSE, different choices of objective functions shall
yield different results.
Verbal summary of the result: the regression function is the “best” (in this
particular sense) function of X if your aim is to forecast Y “on the basis of ”
X.
Note: If, instead of E((Y − h(X)(Y − h(X))0 ), we decide to minimize E((Y −
h(X)0 (Y − h(X))) that is: the expected sum of squared errors of forecast, a proof
following the same steps as the above proof, just changing the position of the transpose
sign, shows that the regression function minimizes the expected sum of squared errors
of forecast. In this case the objective function is a scalar so the term “minimizes” has
the usual sense.
All these results do not require the regression function to be linear.
We need linearity in order to state and prove the following important property.
Suppose X1 is a subset of the columns of X and suppose (linearity) that E(Y |X) =
Xβ and E(X|X1 ) = X1 G where I use an uppercase letter here (G) because X is a
matrix so that E(X|X1 ) is a matrix of regression functions.
We then have E(Y |X1 ) = E(E(Y |X)|X1 ) = E(Xβ|X1 ) = X1 Gβ.
As stated above, given a choice of Y many regressions are possible if we condition Y
to different sets of X. However, there shall be a connection between these regressions.
This simple result gives you the required connection (for the linear case).
Using this result we can compute the coefficients of the regression of Y on X1 when
we know the coefficients of the regression of Y on X and X1 is a subset of X.
A more general version of this result, the partial regression theorem, shall be dis-
cussed in what follows.

9.2 Weak OLS hypotheses. X non random


Now let us pass to the study of the linear model where “linear model” means “linear
model of the regression function”.
We begin with the so called “weak” OLS hypotheses with non stochastic X.
Let Y be an n × 1 random vector containing a sample of n observations on a
“dependent” variable y. Let X be a non random n × k matrix of rank k, β a k × 1 non
random vector and  a n × 1 random vector.
Let:
Y = Xβ + 

74
and suppose E() = 0 and V () = σ2 In .
These hypotheses are best understood in a statistical (that is: estimation in re-
peated samples) setting. Each sample we are going to draw shall be given by a real-
ization of Y . In each sample X (observable) and β (unobservable) shall be the same.
What makes Y “random” that is: changing in a (partially) unpredictable (for us) way
from sample to sample, is the random “innovation” or “error” vector  which we cannot
observe.
As for the “partially” clause: under the assumed hypotheses is clear that

E(Y ) = E(Y |X) = Xβ

So that the expected value of the random Y is not random (while it is unknown as
β is unknown).
In this sense we may say that we are modeling the regression function of Y on
X. However, since the matrix X is non random, which means that the probability
of observing that particular X is one or, equivalently as observed above, that in any
sample X (and β) shall always be the same (so that Y is random just due to the effect
of the random element , in fact: V (Y ) = V (Xβ + ) = V () = σ2 In ), the conditional
expectation shall be the same as the unconditional expectation.

9.3 Weak OLS hypotheses. X random


In applications of the linear model to Economics and Finance only infrequently we can
accept the hypothesis of a non random X matrix.
Typically the X shall be as much random (that is unknown before observation and
variable between samples) than Y . Just think to the CAPM “beta” regression where
the “dependent” variable is the excess return of a stock and the “independent” variable
is the contemporaneous excess return of a market index which contains the same stock.
If X is random, the results we gave above about βbOLS are no more valid, in general.
Under the hypothesis of a stochastic matrix X we can follow many ways for extend-
ing the OLS results. Each of these ways means the addition of an hypothesis to the
standard weak OLS setting. Here I choose a very simple way which shall be enough
for our purposes.
We shall extend the weak OLS hypotheses in this way:

E(|X) = 0

V (|X) = E(0 |X) = σ2 In


In other words we shall assume that, conditionally on X, it is true what we assumed
unconditionally in the weak OLS hypotheses setting. It is clear that our new hypothe-
ses imply the old one, not vice verse. Notice that the equality between conditional

75
covariance matrix and conditional second moments matrix is true only because we as-
sumed that the conditional expectation of  is zero. (See by yourself what happens
otherwise).
An immediate result is that:
E(Y |X) = E(Xβ + |X) = Xβ
and this property fully justifies the name “linear regression” for our model.
With a stochastic X and our added hypothesis our model becomes a linear model
for a conditional expectation: a regression function.

9.4 The OLS estimate


The problem is that we do not know β and we must estimate it.
Under the above hypotheses, different estimation procedures lead all to the same
estimate: the Ordinary Least Squares Estimate βbOLS .
The simplest way for deriving βbOLS is trough its namesake, that is: find the value
βOLS that minimizes 0 : the sum of squared errors.
b
The objective function shall be:
0  = (Y − Xβ)0 (Y − Xβ)
The first derivatives with respect to β are:
∂(Y − Xβ)0 (Y − Xβ)
= 2X 0 Xβ − 2X 0 Y
∂β
The system of normal equations28 (where we look for the β that sets to zero the first
derivatives) is:
X 0 Xβ = X 0 Y
Since (by assumption) the rank of X is k, X 0 X can be inverted and the (unique)
solution of the system is:
βbOLS = (X 0 X)−1 X 0 Y
As usual we do not check the second order conditions, we should! Informally we see
that we are minimizing a sum of squares which may go to plus infinity, not to minus
infinity so that our stationary point should be a minimum non a maximum (this is by
no means rigorous but it could be made so).
28
Here as in the Markowitz model, the term “normal” does not mean “usual” or “standard”. As we
are going to see in a moment the first order conditions in systems like these require that each of a
set of products between vectors be 0. This has to do with requiring that vectors be perpendicular
and the term “normal” derives from Latin “normalis” meaning “done according to a carpenter’s square
(“norma” in Latin)” a carpenter’s square, as is well known, is made of two rulers crossed at 90° (the
triangular square of high school has an added side). By extension the term came to mean: “done
according to rules” (in effect a square is two rule(r)s...) and from this the today most common, non
mathematical, usage of the word .

76
9.5 Basic statistical properties of the OLS estimate
This estimate has been derived as a best approximation. To derive it we did assume
nothing from the statistical point of view.
However, we shall use the estimate in a setting where our observations are just a
“sample” of all possible observations, in a setting, that is, where Parts of the model are
stochastic/random.
We already have two sets of hypothesis for describing such a randomness: weak
OLS hypotheses with non random and with random X.
Under these hypotheses we shall now see how the OLS estimate behaves in a sta-
tistical sense.
It is easy to show that βbOLS is unbiased for β. In fact, in the non random Xcase:

E(βbOLS ) = (X 0 X)−1 X 0 E(Y ) = (X 0 X)−1 X 0 Xβ = β

where in the first passage we use the fact that X is non random and in the second one
we use the hypothesis that β is non random and that E() = 0.
It is also easy to compute V (βbOLS ):

V (βbOLS ) = (X 0 X)−1 X 0 V (Y )X(X 0 X)−1 = (X 0 X)−1 X 0 σ2 In X(X 0 X)−1 =

= σ2 (X 0 X)−1 X 0 X(X 0 X)−1 = σ2 (X 0 X)−1 .


Let us now consider the case of random X.
For the case of random X we essentially follow the same steps with the trick of
conditioning to X and using the “law of iterated expectations”: E() = EX (E|X (|X)).

E(βbOLS ) = E(β + (X 0 X)−1 X 0 ) = β + EX (E|X ((X 0 X)−1 X 0 |X)) =

= β + EX ((X 0 X)−1 X 0 E|X (|X)) = β


Notice the use of the iterated expectation rule.
Now compute the variance covariance matrix.

V (βbOLS ) = V (β + (X 0 X)−1 X 0 ) = V ((X 0 X)−1 X 0 ) =

= E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) − E((X 0 X)−1 X 0 )E(0 X(X 0 X)−1 ) =
But the second term in the sum was just shown to be equal to 0 so:

= E((X 0 X)−1 X 0 0 X(X 0 X)−1 ) = EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) =

77
Now the term E|X (0 |X) is, by hypothesis, equal to σ2 In so that:

EX ((X 0 X)−1 X 0 E|X (0 |X)X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 X 0 X(X 0 X)−1 ) = σ2 EX ((X 0 X)−1 )

In short: with a stochastic X and the new OLS hypotheses, βbOLS is still unbiased
but now its covariance matrix is fully unknown as it depends on the expected value of
(X 0 X)−1 .
The results with stochastic X imply those with non stochastic X but require more
steps, hence the use of teaching both with the first as an introduction to the second.

9.6 The Gauss Markoff theorem


The choice of βbOLS as an estimate of β based only on the minimization of 0  could be
disputed: in which sense this should be a “good” estimate from a statistical point of
view?
A strong result in favor of βbOLS as a good estimate of β is the Gauss-Markoff
theorem.
Definition 9.2. If β ∗ and β ∗∗ are both unbiased estimates of β we say that β ∗ is not
worse than β ∗∗ iff D = V (β ∗∗ ) − V (β ∗ ) is at least a psd matrix.
You should notice how this definition is similar to that of MSE which is minimised
by the regression function.
Notice, moreover, that in the univariate case this definition boils down to the stan-
dard definition. Moreover, suppose that what you really want is to estimate a set of
linear functions of β say Hβ where H is any nonrandom matrix (of the right size so
that it can can pre multiply β). Suppose you know that β ∗ is better than β ∗∗ according
to the previous definition. In this case it is easy to show that Hβ ∗ is better than Hβ ∗∗ ,
according to the same definition, as an estimate of Hβ. In fact:

V (Hβ ∗∗ ) − V (Hβ ∗ ) = H(V (β ∗∗ ) − V (β ∗ ))H 0 = HDH 0

And if D is at least psd then HDH 0 is at least psd (why?).


This “invariance to linear transform” property is the strongest argument in favor of
this definition of “not worse” estimator.
As an exercise find a proof of the fact that on the diagonal of D all the elements
must be non negative. In other terms: all the variances of the * estimate are not bigger
than the variances of the ** estimate.
Obviously, this definition of “not worse” would amount to little if the estimates were
allowed to be biased. In this case any vector of constants would be not worse than any
other estimate, in terms of variance. For this we then require for unbiasedness.29
29
As an alternative we could us the concept of mean square error matrix in the place of the variance
covariance matrix.

78
We are now just a step short in being able to state an important result in OLS
theory.
We would like to prove a theorem of the kind: the best unbiased estimate of β is
β̂OLS .
Alas, this is actually not true in this generality. The theorem results to be true
if we further ask for the class of competing estimates to be linear in Y that is: each
competing estimate must be of the form HY with H a known nonrandom matrix.

Definition 9.3. We say that β̂ is a linear estimate of β iff β̂ = HY where H is a non


random matrix.

We thus arrive to the celebrated Gauss Markoff theorem.

Theorem 9.4. Under the weak OLS hypotheses βbOLS is the Best Linear Unbiased
Estimate (BLUE) of β.

Proof. Any linear estimate of β can be written as β̂ = ((X 0 X)−1 X 0 + C)Y with and
arbitrary C. Since the estimate must be unbiased we have:

E(β̂) = ((X 0 X)−1 X 0 + C)Xβ = β + CXβ = β

and this is possible only if CX = 0. Let us now compute V (β̂):

V (β̂) = σ2 ((X 0 X)−1 X 0 + C)((X 0 X)−1 X 0 + C)0 =

= σ2 ((X 0 X)−1 + CC 0 + (X 0 X)−1 X 0 C 0 + CX(X 0 X)−1 )


but, since CX = 0 the last two terms in the above expression are both equal to 0. In
the end we have:
V (β̂) = σ2 ((X 0 X)−1 + CC 0 )
and this is V (βbOLS ) plus σ2 (CC 0 ) which is an at least psd matrix (why?). We have
shown that the covariance matrix of any linear unbiased estimate of β can be written
as the covariance matrix of βbOLS plus a matrix which is at least psd. To summarize:
we have shown that βbOLS is BLUE.
As we shall see in the “stochastic X” section, the Gauss-Markoff theorem still holds
under suitable modifications of the weak OLS hypotheses in the case of stochastic X.
There is an equivalent theorem which is valid when V () = Σ with Σ any non ran-
dom (pd) matrix known up to a multiplicative constant. In this case the BLUE estimate
is β̂GLS = (X 0 Σ−1 X)−1 X 0 Σ−1 Y where GLS stands for Generalized Least Squares, but
we are not going to use this in this course. Notice that if Σ is not known up to a
multiplicative constant the above is not an estimate.

79
The proof begins by recalling that any pd matrix has a pd inverse. Moreover any
pd matrix A can be written as A = P P 0 with P invertible. We then have Σ = P P 0
and Σ−1 = (P P 0 )−1 = P 0−1 P −1 .
Multiply the model Y = Xβ +  times P −1 as P −1 Y = P −1 Xβ + P −1 .
Call now:

• Y ∗ = P −1 Y

• X ∗ = P −1 X

• ∗ = P −1 .

We have that Y ∗ = X ∗ β + ∗ satisfies the weak OLS hypotheses, and, in particular,


0 −1
V (∗ ) = V (P −1 ) = P −1 ΣP = P −1 P P 0 P 0−1 = I
0 0
We can then follow the standard proof up to the result: (X ∗ X ∗ )−1 X ∗ Y ∗ is BLUE.
In the original data this is equal to

(X 0 P 0−1 P −1 X)−1 X 0 P 0−1 P −1 Y = (X 0 Σ−1 X)−1 X 0 Σ−1 Y = β̂GLS

where GLS stands for Generalized Least Squares.


The result seems very general and, in a sense, it is so. However, we should take
into account that the above proof implies Σ non diagonal but KNOWN. Otherwise we
could not compute P and the estimate.
Most cases of a linear model with correlated residuals do not allow for “known” Σ,
and the GLS estimate cannot directly be used.
Most estimates used in practice in these cases can be seen as versions of the GLS
estimate where Σ is itself “estimated” in some way.
We do not consider this (interesting and very relevant) topic due to the introductory
nature of this course.
The above proof deserves some further consideration.
Under the standard weak OLS hypotheses, both with non stochastic and with
stochastic X, the OLS estimate works fantastically well: it minimizes the sum of
squares of the errors, is unbiased, is BLUE. This is perhaps too much and we should
surmise that some of this bonanza strictly depends on a clever choice of the hypotheses.
This is exactly the case.
We did just prove that, even in the case of non stochastic X, when the covariance
matrix of residuals is NOT σ2 I, the BLUE estimate is NOT the OLS estimate. In
this case “minimizing the sum of squared errors” is not equivalent to finding the “best”
estimate in the Gauss Markoff sense.

80
9.6.1 A note on the case of stochastic X
We did prove the Gauss Markoff theorem using weak OLS hypotheses with non stochas-
tic X. The more general, stochastic X case can be easily dealt with using the iterated
expectation rule.
In case of stochastic X, the theorem proof (using OLS hypotheses for the case of
stochastic X) goes, working conditional on X, exactly as before. The last statement
then becomes

V (β̂|X) = σ2 ((X 0 X)−1 + CC 0 )


We must then find V (β̂). In the general case, this is NOT E(V (β̂|X)). In fact we
have: V (β̂) = E(V (β̂|X)) + V (E(β̂|X)). But, here, we have: E(β̂|X) = β (due to
unbiasedness). So, the second term in the formula is equal to 0.
We then have

V (β̂) = E(V (β̂|X)) = σ2 (E((X 0 X)−1 ) + E(CC 0 ))

This is the varcov matrix of the OLS estimate when X is stochastic plus the ex-
pected value of an at least psd matrix. If the second term is at least psd too, we have
our proof.
This is easy, we must show that for any non stochastic vector z we have z 0 E(CC 0 )z ≥
0 but, by the basic properties of the expected value operator we have z 0 E(CC 0 )z =
E(z 0 CC 0 z).
We already know that CC 0 is at least psd, this is equivalent to say that, whatever z,
we have z 0 CC 0 z ≥ 0. The expected value of a non negative number cannot be negative
(internal property of the expected value) and we have our proof.
You should also notice that, in this proof, we allow C to be stochastic UNCONDI-
TIONAL on X provided it is non stochastic CONDITIONAL on X.

9.7 Fit and errors of fit


We call Ŷ = X β̂OLS the “fitted” values of Y . In fact Ŷ is to be understood as an
estimate of E(Y ) = Xβ.
We call ˆ = Y − Ŷ “errors of fit”.
Notice that, by the first order conditions of least squares, we have:

X 0 X β̂OLS = X 0 Y

X 0 (X β̂OLS − Y ) = 0

X 0 ˆ = 0

81
This in particular implies
0
βbOLS X 0 ˆ = Ŷ 0 ˆ = 0

This result is independent on the OLS hypotheses and depends only on the fact
that βbOLS minimizes the sum of squared errors.

9.8 R2
A useful consequence of this result, joint with the assumption that the first column
of X is a column of ones, allows us to define an index of “goodness of fit” (read: how
much did I minimize the squares?) called R2 .
In fact:
Y 0 Y = (Ŷ + ˆ)0 (Ŷ + ˆ) = Ŷ 0 Ŷ + ˆ0 ˆ
where the last equality comes from the fact that Ŷ 0 ˆ = 0.
Moreover if X contains as first column a column of ones X 0 ˆ = 0 implies that the
sum, hence the arithmetic average, of the vector ˆ: ¯ˆ is equal to zero. So:
¯ ¯
Ȳ = Ŷ + ¯ˆ = Ŷ

and:
¯
Y 0 Y − nȲ 2 = Ŷ 0 Ŷ − nŶ 2 + ˆ0 ˆ
where n is the length of the vectors (number of observations). In other words, indicating
with V ar(Y ) the numerical variance of the vector Y (that is: the mean of the squares
minus the squared mean), we have:

V ar(Y ) = V ar(Ŷ ) + V ar(ˆ)

We see that the variance of Y neatly decomposes in two non negative parts. There
is no covariance! This is totally peculiar to the use of least squares and implies the
definition of a very natural measure of “goodness of fit’”:

V ar(Ŷ ) V ar(ˆ)
R2 = =1−
V ar(Y ) V ar(Y )

Notice that, in order to be meaningful, R2 requires for the presence of a column of


ones (or any constant, in fact) in X. Otherwise the mean of the errors of fit may be
different than 0 and the passage from sum of squares to variances is no more possible
(the mean of Ŷ shall not, in general, be the same as the mean of Y ).
Due to sampling error we may expect to observe a positive R2 even when there is
non “regression” between Y and X. We can also say something about the expected size
of an R2 when it should actually be 0. In fact, suppose, conditionally to a n × k matrix

82
X of regressors, Y is a vector of n iid random variables of variance σ 2 so that there
should be an R2 of 0 out of the regression of Y on X. However the expected value
of the sampling variance of the elements of Y (that is: the denominator of the R2 ) is
σ 2 while we can show that the expected value of sampling variance of the elements of
the error of fit vector ε̂ is σ 2 n−k
n
so that the expected value of the sampling variance
of Ŷ in the same regression is going to be σ 2 nk (because Y = Ŷ + ε̂) and this is the
numerator of the R2 .
While we know that the expected value of the ratio is not the ratio of the expected
values, this could still be a good approximation, so we can say that, in the case of no
regression at all between Y and X (that is: theoretical value of R2 equal to 0), the
σ2 k
expected value of the R2 is approximated by σ2n = nk and this number could be quite
big if you use many variables and do not have many observations.
This simple fact should make use wary in using regression as an “exploratory” tool
for finding the “most relevant variables” in a wide set of potential candidates. Such
attitude has been common, and was successfully criticized, many times in the past
and, today, is back on fashion within the “data mining” movement. “To be wary” does
not mean “to utterly avoid”: taken with care such procedures, and, more in general,
exploratory data analysis, may be useful.

9.9 Statistical properties of Ŷ and ˆ


Let us now study some statistical property of Ŷ and ˆ. That is: properties depending
on the OLS hypotheses.
As usual we start with tha weak OLS hypoteses non stochastic X case.
First we compute the expected values and covariance matrices of both vectors.

E(ˆ) = E(Y − X βbOLS ) = Xβ − Xβ = 0

V (ˆ) = V (Y − X βbOLS ) = V (Y − X(X 0 X)−1 X 0 Y ) = V ((I − X(X 0 X)−1 X 0 )Y ) =

= σ2 (I − X(X 0 X)−1 X 0 )(I − X(X 0 X)−1 X 0 ) = σ2 (I − X(X 0 X)−1 X 0 )


this because, by direct computation, we see that

(I − X(X 0 X)−1 X 0 )(I − X(X 0 X)−1 X 0 ) = (I − X(X 0 X)−1 X 0 )

that is: (I − X(X 0 X)−1 X 0 ) is an idempotent matrix.


With similar passages we find that:

E(Ŷ ) = Xβ

83
V (Ŷ ) = σ2 X(X 0 X)−1 X 0
In summary: we see that Ŷ is indeed an unbiased estimate of E(Y ). On the other
hand we see that ˆ shows a non diagonal correlation matrix even if (or, better, just
because) the vector  is made of uncorrelated errors.
This property of the estimated residuals is, in some sense, unsatisfactory and lead
some researchers into defining a different estimate of the residuals (non OLS based)
with the property of being uncorrelated under OLS hypotheses. This different estimate,
which we do not discuss here, is known in the literature as BLUES residuals (where
the ending S stand for “scalar” that is: with diagonal covariance matrix).
If X is stochastic, as usual, we work conditional to X and under weak OLS hy-
potheses with stochastic X.

E(ˆ|X) = E(Y − X βbOLS |X) = Xβ − Xβ = 0

And, since E(ˆ) = E(E(ˆ)|X)) we have E(ˆ) = 0 (notice that, for simplicity, we
now drop the suffixes of the expected values).
A similar result for V (ˆ). It must be said, however, that in this case only the
conditional values and not the marginal values are usually of interest as we usually
condition on the observed X.

9.10 Strong OLS hypotheses, confidence intervals and testing


linear hypotheses in the linear model.
This is a very short section for two very relevant topic. We are not going to deal with
confidence intervals theory or general testing of linear hypotheses but only with those
tests which are routinely found in the output of standard OLS regression computer
programs.
Let us begin with introducing Strong OLS hypotheses. In short these are the same
as weak OLS hypotheses (with non stochastic X or stochastic X) plus the hypothesis
that the  vector is not only made of uncorrelated, zero expected value and constant
variance random variable but also is a vector distributed (conditional on X, if X is
stochastic) as an n dimensional Gaussian density with expectation vector made of zeros
and diagonal variance covariance matrix30 .
30
A k dimensional random vector ze is distributed according to a k dimensional Gaussian density
with expected values vector µ and variance covariance matrix Σ if and only if the density of the any
vector z of possible values for z̃ is given by
k 1 1
f (z; µ, Σ) = (2π)− 2 |Σ|− 2 exp(− (z − µ)0 Σ−1 (z − µ))
2
(As usual in the text we shall omit tildas for distinguishing between RV and their values when this
should not create confusion).

84
We can do this both with non stochastic X or stochastic X. However, since,
in practice, and always in what follows, we shall always condition on X there shall
practically be no difference.
Why this hypothesis? When we wish to test hypotheses we need to find distributions
of sample functions, for instance we are going to need the distribution of β̂OLS .
Up to now we know that under weak OLS hypotheses β̂OLS has expected value
vector β and variance covariance matrix σ2 (X 0 X)−1 . Moreover we know that β̂OLS =
β + (X 0 X)−1 X 0  that is: it is a linear function of  (and β and X are non stochastic or,
if X is stochastic, we condition on it). With the added hypothesis we can then conclude
that β̂OLS has (possibly conditional on X) Gaussian distribution with expected value
vector β and variance covariance matrix σ2 (X 0 X)−1 .
This may seem too expedient: OK, computations are now simple but why Gaussian
errors? In fact it often is too expedient, and the pros and cons of the hypothesis could
(and are) discussed at length. For the moment we shall take it as a beginner’s use in
the econometric world we live in, a use to be taken with much care.
First, confidence intervals.
Under the above hypotheses we have (no proof required)

β̂j − βj
p ≈> N (0, 1)
σ2 {(X 0 X)−1 }jj
If we remember the properties of determinants and inverses of diagonal matrices we see from this
formula that, in the case of diagonal covariance matrix, this density becomes a product of k uni
dimensional Gaussian densities (one for each element of the vector z̃). So, in the Gaussian case, non
correlation and independence are the
 same. In fact,if Σ is a diagonal matrix with diagonal terms σi2
1
σ12
0 0
Qk
we have |Σ| = i=1 σi2 and Σ−1 =  . .. 0   so that
 
 0
1
0 0 σ2
k

k
! k
!
−k
Y 1 1 X (zi − µi )2
f (z; µ, Σ) = (2π) 2 exp − =
σ
i=1 i
2 i=1 σi2

k k
1 (zi − µi )2
 
1
Y Y
= (2πσi2 )− 2 exp − = f (zi ; µi , σi2 )
i=1
2 σi2 i=1
In words: with a diagonal covariance matrix the joint density is the product of the marginal densities,
the definition of independence.
An important property of a k dimensional Gaussian distribution is that, if A and B are non stochas-
tic matrices (of dimensions such that A+B z̃ is meaningful), then the distribution of A+B z̃ is Gaussian
with expected vector A + Bµ and variance covariance matrix BΣB 0 . Linear transforms of Gaussian
random vector are Gaussian random vectors. This, for instance, implies that, in effect, if z̃ is a Gaus-
sian random vector, then each z̃i is Gaussian as we stated a moment ago in the proof of equivalence
between non correlation and independence for the Gaussian distribution. This is easy to see, just
0 0
write z̃i = 1i z̃ where 1i is a k dimensional row vector with null elements except one 1 in the i − th
place and apply the linearity property.

85
Where the distribution is to be intended conditional on X, if X is stochastic.
We then have that
p
P (βj ∈ [β̂j ± z1− α2 σ2 {(X 0 X)−1 }jj ]) = 1 − α
Hence p
[β̂j ± z1− α2 σ2 {(X 0 X)−1 }jj ]
Is a 1 − α confidence interval for βj on the basis of its estimate β̂j . Where z1−α/2 is
the usual 1 − α/2 quantile of the standard Gaussian distribution.
In the case of unknown σ2 , we can estimate it with

ˆ0 ˆ
σ̂2 =
n−k
And, again without proof, we have that
p
[β̂j ±n−k t1− α2 σ̂2 {(X 0 X)−1 }jj ]

Is a 1 − α confidence interval for βj on the basis of its estimate β̂j . Where n−k t1−α/2
is the 1 − α/2 quantile of the T distribution with n − k degrees of freedom (n is the
number of rows and k the number of columns of X).
Second, testing statistical hypotheses.
We suppose everybody knows what a Statistical Hypothesis is (see Appendix in
case). We now define a “linear” statistical hypothesis.
A linear hypothesis on β can be written as Rβ = c where R is a matrix of known
constants and c a vector of known constants. For the purpose of this summary we shall
concentrate on two particular R: a 1 × k vector R where only the j th element is 1 and
the others are zeros and a (k − 1) × k matrix where the first column is of zeros an the
remaining (k − 1) × (k − 1) square matrix is an identity. In both cases c is made of
zeros (in the first case a single 0 and in the second a k − 1 vector of zeros).
The first kind of hypothesis is simply that the j th β is zero (while all other pa-
rameters are free to have any value), the second kind of hypothesis is simply that all
parameters are jointly zero (with the possible exception of the intercept). For (non
trivial) historical reasons these hypothesis are considered so frequently relevant that
any program for OLS regression tests them. The fact that these hypotheses be of your
interest is for You to evaluate.
I shall not detail the procedure for the test of the hypothesis that all parameters,
except, possibly, the intercept are jointly equal to zero. I only mention the fact that
the result of this test is displayed, usually, in any OLS program output. The name of
this test is F test.
A little more detail on the univariate test.

86
The standard test for the Hypothesis H0 : βj = 0 against H1 : βj 6= 0 (complete by
yourself the hypotheses) requires the distribution of β̂OLS , and this, as we wrote above,
a strengthening of the OLS hypotheses which takes the form of the assumption that 
is distributed according to an n−variate Gaussian:  ≈> Nn (0, σ2 In ).
We do not discuss here the reasons for and against this hypothesis.
Under this hypothesis, as seen above, we can show that: βbOLS ≈> Nk (β, σ2 (X 0 X)−1 )
(this may be conditional on X if this is stochastic). Hence the ratio:

β̂j − βj
p
σ {(X 0 X)−1 }jj
2

(I drop the subscript OLS from the estimate in order to avoid double subscript prob-
lems) is distributed according to a standard Gaussian (i.e. N1 (0, 1)).
Suppose now we set βj = 0 in the above ratio. In this case the distribution of the
ratio shall be a standard Gaussian only if H0 : βj = 0 is true. This allows us to define
a reasonable rejection region for our test.
Reject H0 : βj = 0 with a size of the error of the first kind equal to α iff:

β̂j
p ∈
/ [−z1− α2 ; +z1− α2 ]
σ2 {(X 0 X)−1 }jj

(which can be written in many equivalent forms).


In the above formula {(X 0 X)−1 }jj is the j−th element in the diagonal of (X 0 X)−1 .
Z1− α2 is the quantile of the standard Gaussian which leaves on his left 1 − α2 of
probability.
This solves the problem if σ2 is known. In the case it is unknown, as we did above
we estimate it with:
2 ˆ0 ˆ
σ̂ =
n−k
and use as critical region:

β̂j
p ∈
/ [−n−k t1− α2 ; +n−k t1− α2 ]
σ̂2 {(X 0 X)−1 } jj

where n−k t1− α2 is the quantile in a T distribution with n − k degrees of freedom which
leaves on its left a probability equal to 1 − α2 .
The use of the T distribution is the reason for the name given to this test: the T
test.
I shall not detail the procedure for the test of the hypothesis that all parameters,
except, possibly, the intercept are jointly equal to zero. I only mention the fact that
the result of this test is displayed, usually, in any OLS program output. The name of
this test is F test.

87
As in the case of the T test there exist many different hypotheses which can be
tested using a F test but the standard hypothesis tested by the universally reported
F test is:
H0 : all the betas corresponding to non constant regressors are jointly equal to 0
H1 : at least one of the above mentioned betas is not 0
The Idea is that, if the null is accepted (i.e. big P-value), no “regression” exists
(provided we did not make an error of the second kind, obviously). Hence the popularity
of the test.
A last detail on stochastic X case. As we see here, working with non stochastic X
or conditional to X yields the same results.
We should stress the fact, however, that many properties hold unconditional to X.
For instance, since:
p
P (βj ∈ [β̂j ± z1− α2 σ2 {(X 0 X)−1 }jj ]) = 1 − α

when this is seen as a conditional to X probability, for the case of stochastic , we


also see that it does not depend on X. So, this is also the unconditional probability of
being in the interval.
In practice, however, we shall always condition to observed X, as we could not
compute the interval if we did not observe X.

9.11 “Forecasts”
Here we intend the term “forecast” in a very restricted meaning.
Suppose you estimated β using a sample of Y and X, let us say that the sample is
of n observations (rows).
Now suppose a new set of q rows of X is given to you and you are asked to assess
what could be the corresponding Y .
Set as this the question does not allow for an answer. We need to assume some
connection between the old rows and the new rows of data. A possibility is as follows.
Let the model for the n rows of data used to estimate β with β̂OLS be

Y = Xβ + 
Suppose we now have data for m more lines for the variables in X and call these
Xf . Let the model for the corresponding new “potential” observations be

Yf = Xf β + f
And suppose (we consider here the general case where X can be stochastic) that
E(|X, Xf ) = 0= E(f |X, Xf ), V (|X, Xf ) = σ2 In ,V (f |X, Xf ) = σ2 Im and E(0f |X, Xf ) =
0.
(Notice the double conditioning to both X and Xf )

88
In this case the obvious (BLUE) estimate for E(Yf |X, Xf ) isYˆf = Xf β̂ with ex-
pected value Xf β and variance covariance matrix σ2 Xf (X 0 X)−1 Xf0 .
If we define the “point forecast error” as Yf − Ŷf , the expected value of this shall
be 0 and the variance covariance matrix σ2 (Im + Xf (X 0 X)−1 Xf0 ).
Be very careful not mistaking these formulas with the corresponding ones for Ŷ .
On the basis of these formulas and working either under strong OLS hp or using
Tchebicev, it is possible to derive (exact or approximate) confidence intervals for the
estimate of the expected value of each element in the new set of observations and for
the corresponding point forecast errors.
For instance, under the Gaussian hypothesis, the two tails confidence (α%) interval
for the expected value of a single observation in the forecasting sample, under the
hypothesis of a known error variance, corresponding to a row of values of Xf equal to
xf is given by: h q i
xf β̂OLS ± z(1−α/2) σ xf (X 0 X)−1 x0f
The corresponding confidence interval for the point estimate, that is, the forecast
interval which keeps into account the point estimate error, is:
h q i
0 −1 0
xf β̂OLS ± z(1−α/2) σ (1 + xf (X X) xf )

In the case the error variance is not known and it is estimated with the unbiased
estimate
ˆ0 ˆ
σ̂2 =
n−k
as described above, theponly changes to be made in the formulas are: σ substituted
with its estimate σ̂ = σ̂2 and z(1−α/2) substituted with (n−k) t(1−α/2) that is: the
(1 − α/2) quantile of a T distribution having as degrees of freedom parameters the
difference between n and k: the number of rows and columns in X, the regressors
matrix in the estimation sample.
It is easy to see that the second interval shall always be bigger than the first, as it
takes into account not only of the sampling uncertainty in estimating β but also of the
uncertainty added by f .

9.12 a note on P-values


The standard procedure for a test is:
• Choose H0 and H1 (they should be exclusive and exhaustive of the parameter
space).
• Choose a size of the error of the first kind: α. Be careful: too little α implies
usually a big error of the second kind. Your choice should be based on a careful
analysis of the costs implied in the two kind of errors.

89
• Find a critical region for which the maximum size of error of the first kind is α
and, possibly, with a sensible size of error of the second kind.

• Reject H0 if your sample falls in the critical region, otherwise do not reject H0
(“do not reject” is more precise than the term “accept”).

This procedure typically requires the availability of statistical tables.


When computer programs for performing tests came to the fore two alternative
paths were possible.
Let the user input the α for each test and, as output, point out if the null hypothesis
is accepted or rejected on the observed dataset with that α.
Let the user input nothing, but give the user as output the value of α for which
the observed data would have been exactly on the brink of the rejection region. This
value is called the P value.
With this information the user, knowledgeable about his or her preferred α, would
be able to state whether the null hypothesis was accepted or rejected by simply com-
paring the preferred α with the value given by the program. If the researcher’s α is
smaller than the value given by the program this means that the observed data is inside
the acceptance region as would have been computed by the researcher so that H0 is
accepted. If the researcher’s α is bigger than the value given by the program this means
that the observed data is outside the acceptance region as would have been computed
by the researcher so that H0 is rejected.
Conceived for the simple reason of avoiding the use of tables, the P value has be-
come, in the hands of Statistics illiterates, the source or numberless misunderstandings
and sometimes amusing bad behaviors.
The typical example is the use of terms like “highly significant” for the case of null
hypotheses rejected with small P values or the use of “stars” for indicating pictorially
the “significance” of an hypothesis. A strange attitude is that of considering optimal, a
posteriori, a small P value which, a priori, would never be considered a proper α value.
Sometimes, worst of all, the P value is considered as an estimate of the probability
for the null hypothesis to be true given the data. A magnificent error since in testing
theory we only compute the probability of falling in the rejection region given the
hypothesis and not the probability of the hypothesis given that the data falls in the
rejection region.
Please, avoid this and other trivial errors: the fact that such errors are widespread,
even in the scientific community, does not make them less wrong.
Something more on this point in what follows.

9.13 The partial regression theorem.


What we did up to now is a basic analysis of the statistical properties of the linear
model and OLS estimates.

90
We need now to develop tools which allow us to really understand, and correctly
read, the results of a linear model.
The two main (and connected) tools to this purpose are: the partial regression
theorem and the definition of semi partial R2
There exist two versions of the partial regression theorem. They are very similar be-
cause the proof is based on the strong mathematical similarity between two completely
different objects: frequencies and probabilities.
We first prove the “frequency based” version, that is: the partial regression theorem
valid for OLS estimates.
The second version has to do with “theoretical” regression functions, that is with
probability and can be seen as a direct application of the law of iterated expectetions
to a linear regression.
While quite obvious in terms of proof, the partial regression theorem tells us some-
thing which, maybe, is a priori unexpected: any given coefficient in a linear regression
is NOT a derivative with respect to the corresponding variable, in the common sense
of the term.
In fact, what a coefficient in a linear regression really is, is something of much more
interest and to understand this is fundamental in order to correctly interpret the result
of a regression.

Theorem 9.5. The estimate of any given linear regression coefficient βj in the model
E(Y |X) = Xβ can be computed in two different ways yielding exactly the same result:
1) by regressing Y on all the columns of X,
2) by first regressing the j−th column of X on all the other columns of X, computing
the residuals of this regression and then by regressing Y on these residuals.

Proof. Write the model as Y = Xj βj + X−j β−j +  where you isolate the j − th column
of X in Xj and put the rest in X−j . To make things simple suppose the intercept is in
X−j .
You estimate it with OLS and get:

Y = Xj β̂j + X−j β̂−j + ˆ

Now write the auxiliary regression: Xj = X−j γj + uj and estimate it with OLS to get
Xj = X−j γ̂j + ûj .
Substitute this in the original OLS estimated model:

Y = (X−j γ̂j + ûj )β̂j + X−j β̂−j + ˆ = ûj β̂j + X−j (γ̂j β̂j + β̂−j ) + ˆ

By orthogonality of ûj with both X−j and ˆ we get


X X X
Yi ûij = (ûij β̂j + Xi,−j (γ̂j β̂j + β̂−j ) + ˆi )ûij = û2ij β̂j
i i i

91
so that β̂j = i Yi ûij / i û2ij which, since the mean of ûj is equal to 0 (X−j contains
P P
the intercept) is identical to the OLS estimate in a regression of Y on ûj alone.
A similar result is directly valid for the regression function (if we suppose all re-
gressions to be linear, otherwise a similar but more general result is valid). The result
is valid without considering estimates, but directly properties of theoretical linear re-
gression functions. In this case the statement of the theorem becomes:

Theorem 9.6. Any given linear regression βj in the linear regression E(Y |X) = Xβ
is identical to the coefficient of the regression of Y on Xj − E(Xj |X−j ) = if we suppose
E(Xj |X−j ) = X−j γXj |X−j that is: linear in X−j .

Proof. The proof mimics the proof based on estimates of βj and goes as this:

E(Y |X−j ) = EXj |X−j (E(Y |X)) = X−j β−j + E(Xj |X−j )βj =
= X−j β−j + X−j γXj |X−j βj = X−j (β−j + γXj |X−j βj )
now, write E(Y |X) as X−j β−j + Xj βj and add and subtract X−j γXj |X−j βj (that is
E(Xj |X−j )βj )

E(Y |X) = X−j β−j + Xj βj = X−j β−j + (Xj − X−j γXj |X−j )βj + X−j γXj |X−j βj

By the basic properties of regression, (Xj − X−j γXj |X−j ) (the forecasting error of Xj
forecasted with X−j ) is orthogonal to X−j so that, a linear regression of Y on both
(Xj − X−j γXj |X−j ) and X−j or on each of them shall give the same coefficients.
We have, then:

EY |Xj −X−j γXj |X−j E(Y |X)) = E(Y |Xj − X−j γXj |X−j ) = (Xj − X−j γXj |X−j )βj

In words: we can βj is both the regression coefficient of the column Xj in the full
regression AND the regression coefficient of (Xj − X−j γXj |X−j ) in the regression of Y
only on (Xj − X−j γXj |X−j ) (that is: the residuals of the regression of Xj on X−j .
You should notice that, in this proof, regressions are required to be linear while the
proof concerning estimates only requires that the estimates to come from the use of
OLS in linear models.
Notice, moreover, that the first proof is based on the algebraic properties of OLS
estimates: weak or strong OLS hypotheses are not required. In practice, the only
property used in the proof is that of orthogonality (with intercept included) which
directly comes from OLS.
This result, in both versions, is relevant from the point of view of interpreting the
meaning ot a βj or of its estimate.

92
In fact, the result implies that each βj is not connected with some “relationship”
between the j − th column of X and Y , but only between the part of the j − th column
of X which is (linearly) regressively independent on the other columns of X and Y .
In other words: the linear regression model does not measure, in any sense, the
“effect” of a given column of X.
Whatever the definition of such “effect”, this has only to do with the part of this
column variance which is uncorrelated with the other columns.
As a consequence: the meaning of a regression coefficient for the same variable
depends on which other variables are in the regression and both the coefficient and the
meaning change if we change the other variables in the regression.
This is completely natural, as in any regression I make a forecast conditional to a
different set of information.
We used the term: “effect”. While a regression may have a causal interpretation,
this is by no means necessary or even common. It is then important, when we speak
of “effect” to avoid the impression of speaking in causal terms.
We shall then define and measure the “effect of a variable” in a regression for what
it is and for what it is implied to be in the partial regression model.
For us, this is just the marginal “effect” or “contribution” of a column of X in
reducing the mean square error or, equivalently, improving the forecast performance,
when the other columns are accounted for in the sense of the above theorem.
This “effect” is to be better understood in “informational” term as the ability to
improve the quality of a fit adding information to a given information set.
If the inferential extension of fit to forecast is justified, see above for the hypotheses
which justify a forecast, in terms of quality of forecast.
When the intercept is in the model, this is measured by the increase of R2 you get
if you add the Xj column to the model, or, equivalently, by how much the R2 decreases
if you drop such column from the model.
This quantitative measure, specific to Xj as used in a regression with a GIVEN set
of other variables: X− j, is called: “semi partial R2 ”.

9.13.1 Semi Partial R2


We define this as the “marginal” change of R2 due to each column in X after accounting
for the other columns.
This seems to imply that its computation is, if not difficult, quite long: run the
regression of Y on the full X, compute the corresponding R2 and then drop in turn
each single column in X, one at a time, and measure the corresponding change in R2 .
This is not only long to do, it could be impossible if we are evaluating regression
results as we read them we read on a paper or a report, as we should need to work on
the original data to perform the computations.

93
It is quite frequent, at least in the social sciences milieu, the case where even the
overall R2 is not even reported.
We have a way out of this which can almost always (in simple OLS setting) be
implemented.
A “folk” and simple result of OLS algebra (we give it here without proof, but see
further on for a proof, not required for the exam, in a footnote) allows us to determine
the marginal contribution of each column in X to the R2 even if we only know the
T −ratios for the single parameters and the size n of the data set.

Lemma 9.7. Suppose we are using OLS and we drop the column Xj from the matrix
X. The decrease of the overall R2 corresponding to the dropped column (call this
R2 − R−j2
)is equal to t2j (1 − R2 )/(n − k). Where: t2 is the square of the T −ratio for
the added variable, n is the number of observations and k is the number of columns in
the full regressors matrix. This is called the “semi partial R2 for Xj and is nothing but
the R2 of the regression of Y on the residuals of the partial regression of Xj on X−j .

Here the T −ratio is assumed to be computed with the standard formula we gave
in the section about OLS.
An interesting point in this result is that it allows us to “recycle” a quantity which
we considered as just a measure of statistical reliability, as a useful way for reading
the R2 . This is just an algebraic result, that is: it is valid in any sample and does
not require either weak or strong OLS hypotheses to be valid. It is just a numerical
identity.
Beyond allowing us to compute the semi partialR2 , this result has other interesting
implications. As we stressed before, with a big sample size it is very difficult for
a T −ratio not to be “statistically significant” as even a very small number can be
distinguished from zero if the sampling variance is small enough (the denominator of
the T -ratio is divided by the square root of n − k). However, when n − k is big, it
is quite possible that the estimate could be “significant” while, at the same time, the
relevance of the variable could be totally negligible, as the added contribution of the
variable to the explanation of Y variance could be negligible.
Suppose you have, say, n = 10000 (not uncommon a size for a sample in social
sciences and in Finance). Suppose you have 10 columns in the X matrix and the
T −ratio for a given explanatory variable is of the order of 4 so that the P −value shall
be of the order of .01: “very” statistical significant! True, but the above lemma tells
us that the contribution of this variable to the overall R2 is at most (that is: even
for an overall R2 very near to 0) of, approximately 16/(10000) that is: less than two
thousands. Hardly relevant from any practical point of view! If I drop the variable the
overall explanatory power of the model drops of way less that 1%.
Another way to see the same point is this: how big should be the T −ratio, under the
previous hypotheses, so that the marginal contribution of the regressor to the overall

94
R2 is, say, 10%? From the above formula we have that the T −ratio should be of the
order of 32 (the square root of 1000 is about 31.62).
This makes even more clear the fact that, in general, “statistical significance” (that
is: the estimate is precise enough so that we are able to distinguish it from 0) and
“relevance” (here measured by the contribution to he forecasting power of the model
as measured by R2 ) are very different concepts.
It is now important to understand how semi partial R2 works across different
columns of X.
If we compute this quantity for each column of X, we measure the marginal contri-
bution of each of these column in “explaining” (frequently used but not so correct term)
the variance of Y . As stated above, here “marginal” means: how much the introduction
of the variable improves the R2 when the other variables are already all in the model.
This means that, each time we compute this quantity for a different column Xj , the
“other variables” left in X−j are different. For this reason, while we can split the overall
R2 in a part due to the introduction of a new variable and a part already “explained”
by the other variables, we cannot add in any meaningful way the semi partial R2 of
different variables except in the case when all the columns of X are uncorrelated (not
very interesting in our field).
Summary up to this point: a regression is a forecast which minimizes the mean
square error (if it is linear with intercept it minimizes the variance of the error). This is
the purpose of a regression and it should be evaluated in so far the quality of the forecast
is sufficient to our purpose (the specific purpose is going to enter in the evaluation).
While in a forecasting setting there is no big role for speaking of “the effect” of this
or that column of X, it is possible (partial regression theorem) to define the marginal
contribution of each column of X to the overall R2 . The quantitative measure of this
contribution is given by the semi partial R2 .

9.14 The interpretation of estimated coefficients


9.14.1 Introduction and summary
In order to properly use a linear regression , we must understand its meaning both as
a probabilistic object, that is: as a regression function and as a statistical tool, that is:
when parameters are estimated from data.
This, in principle, is not difficult. A possible problem comes from the success itself
of linear regression as a data analysis tool.
In fact, the widespread use of linear regression in almost all fields of empirical
research, sometimes lacking a proper statistical training, did create a number of very
approximate and sometimes completely wrong “readings” of the model idiosyncratic to
the specific field.
In what follows we shall try to introduce the simple basis for a correct understanding

95
of a linear regression.
In this introduction we summarize the topic in few points.

First: regression=forecast.
A regression function is, as we stated at the beginning of this chapter, a tool for
forecasting Y on the basis of X. The particular forecasting tool called “regression
function” is just the conditional expectation E(Y |X).
Linearity, that is E(Y |X) = Xβ is an hypothesis we add to our analysis.
As with any hypothesis this restricts the set of possible regression function models,
and this reduces generality, but makes easier the use of the restricted set of models.
In these handouts we do not consider the important topic of the “model functional
form” choice.
Suppose we know E(Y |X) that is, for instance in the linear case, we know β.
In this case the use of this function is straightforward: if you think the model applies
to a phenomenon where you know the values of the random variables in X, but not
those of the random variable Y , and you want to forecast Y , then use E(Y |X) = Xβ.
Obviously, you can “forecast” Y in a circumstance where you already know its value,
too. This may seem strange but, sometimes, it can be useful.
Since the term “forecasting” may (wrongly) be understood as implying that, in
some sense, X is “before” Y or even X causes Y , it is important to stress that this is
sometimes true but not required in any way.
It is frequently the case that Y could be something which happens before X but
you cannot observe Y while you observe X and use X to “forecast” Y .
Moreover, it may be that Y “causes” X (in some sense) but we can observe X and
not Y so we try to forecast Y on the basis of X. This is what doctors and detectives
and scientists do all the time: you have symptoms, clues, observations (X) which are
“determined” by “causes” which are not directly observable or not easily observable:
an illness, a culpable, a particular theory (Y ), and what you try to do is to “forecast”
these.
It may also be that there is no “causal” connection between Y and X but that both
of these are somewhat “determined” by some unobserved Z. In this case the knowledge
of X may help in forecasting Y .
In short, it is all a matter of information. You put in X the information you have
and use this for forecasting Y and you could switch one for the other is information
ordering changes.

Second: how good is the forecast? How relevant is a given element of X?


Here we have clear cut answers, if we stick to the properties of a regression function.
A regression function minimizes the MSE, a linear regression function (with inter-
cept) does this by maximizing the R2 .

96
By consequence, the overall quality of the forecast is measured by the size of R2
and the “importance” (for the quality of the forecast) of each element in X is measured
by its marginal contribution to the overall R2 . This is, simply, the difference between
the R2 of a model where the full X is used and a model where the element of interest
is dropped (linearity here is important).
While there may be other measures of relevance, connected with the specific problem
under analysis, this is the only measure connected with the general nature of the
regression function. Hence, this is the only purely statistical measure, in that it is
valid for any regression application in any field.
This is not to say that, in specific fields, due to the role and meaning of the variables
in the model, we cannot use other measures of relevance.
For instance: in many settings, it could be that variables contributing just a little
to the overall R2 , are judged more “relevant” that other whose contribution is higher.
This may happen, e. g., when you can somewhat act (without changing the regres-
sion, see above) on the smaller contribution variables but not on the high contribution
variables.
For instance, in many sports, training is “relevant” for performance, but typically
less so than, say, age, stature and other biological characteristic and qualities you
cannot act on. The contribution of training to the overall R2 could be small, but you
can choose (“act on”) your training amount and method, not your age or stature.
In what follows we shall see how to estimate both R2 and the marginal contribution
to R2 of each element of X.

Third: comparison between hypothetical forecasts.


Sometimes, we make “hypothetical” forecasts in the sense that we compare E(Y |X =
x1 ) with E(Y |X = x2 ). Here we use x1 , x2 to stress that these are two different values
of the same random entity X (be it a random matrix, a random vector or a random
variable). Had we used a notation the like of X1 and X2 you could be led to think that
these are different random entities.
This comparison, in the linear case, boils down to computing (x1 − x2 )β.
In principle there is nothing bad in this. The correct meaning of this is that I want
to evaluate the difference between the forecasts if I observe X = x1 or X = x2 .
However, in many cases, underlying this analysis there is something that requires
understanding, to avoid errors.
Suppose, for instance, I can either observe the value of X “determined by Nature”
or I can try to “intervene” and force X to be equal one of the two values with some
“action”. The comparison between forecasts, in this second case, is driven by the fact
that I want to “choose” which “action” to perform based on the implied forecast.
It is quite relevant to state very clearly that, even if Y and X may have the same
“names”, the regression function to be used when forecasting without “intervention”

97
could be completely different from the regression function when we forecast after an
intervention.
In medicine this is simply stated by the fact that, while symptoms (X) are useful
to diagnose (“forecast”) an illness (Y ), a cure of the symptoms (changing X) is very
infrequently a cure of the illness (Y ).
More simply: if you see out of your window, leaves and branches shaking, probably
there is a strong wind, but if you shake leaves and branches you cannot instigate a
strong wind.
In both cases, to forecast Y on the basis of X under simple observation of X or
under action on X would imply the use of completely different regression functions.
A simple, if paradoxical, example is this: what you see in the tachometer of your
car is a very good estimate of the true instantaneous speed of the car (on a straight
road with no slippage). This means that, if Y is the true speed and X the tachometer
value, and you observe X during your trip, E(Y |X) shall be a good “forecast” of Y on
the basis of X. But now, paradoxically, suppose your car still has an analog tachometer
with a hand, and that you “act” on X by moving the hand with your finger. Most likely
this shall break the tachometer but, surely, the use of the same E(Y |X) which worked
very well if applied to observation without intervention, shall be now unwarranted.
Obviously, if you want to change your speed, you may act on the gas pedal, or o the
brake, or on the gear.
To be able to distinguish what we can get from observation and how this may be
connected with action, is a very important topic in any field that is both observational
and, at least sometimes, allows intervention.
It is the case of Economics, Medicine, Biology, Physics.
It is NOT the case in Cosmology, most of Demography and other purely observa-
tional sciences.
In Economics, under the names of “intervention analysis” or (worse) “causality” this
topic generated lots of interest and research.
In some fields a regression function useful for intervention analysis cannot be evalu-
ated using observational data but it can be evaluated by specific intervention procedures
called “experiments”.
This is possible if all or most relevant variables (variables which we would put in
X) can be chosen, observed or controlled by the researchers.
In fields like Economics, where some variables can be acted on but other are deter-
mined by the “economic system” which, for the most part, is unobservable, the study
of aregression function relevant for intervention is very difficult.
A typical consequence of forgetting this important point can be seen in the naive
reading of (x1 − x2 )β as “how much E(Y |X) changes if x2 becomes x1 ”.
This is not incorrect, if we intend as a simple comparison between potential forecast
given possible values of X. This is in general completely wrong if, using the same
regression, we intend it as a comparison of forecasts when we “act” and change x2 into

98
x1 . This is also true in reverse.

Fourth and last: enter statistics, how to estimate a regression function.


As a rule, in most applications, we do not know the regression function, we must
estimate it.
We need to find estimation methods and we need to evaluate the quality of the
estimates. Then, we need to quantify the sampling uncertainty of our estimates.
Estimation is of the essence but, under the point of view of results interpretation,
it changes very little.
Basically, statistical inference allows us to estimate the parameters of the model
and to assess numerical boundaries to the precision of these estimates.
If this is understood, the interpretation of the estimates of the regression function is
the same as the nterpretatin of the regression function itself, as above, with the added
measure of sampling variability.

Summary of what follows


In the following paragraphs we shall not further discuss the first point (regression=forecast)
as it is very simple and clearcut and already dealt with.
We shall begin by considering in some more detail the second point: how to measure
the quality of a forecast and the contribution of each column of X to this quality.
We shall then discuss the comparison between hypothetical forecasts, and its entic-
ing but dangerous connection with a “intervention” or “causal” analysis.
Following this, we shall introduce some further topics related to the “estimation
phase” of the model (fourth point).
This has already been discussed in detail in the first part of this chapter dedi-
cated to linear regression. We already know the probabilistic hypotheses underlying
the evaluation of estimates and in particular OLS estimates. Under these hypotheses
we derived the expected value vector and the variance covariance metrix of the OLS
estimate. Under strong hypotheses we derived the distribution of the OLS vector and,
from this, confidence and forecasting intervals and some test critical region.
What we shall add here is a small but important warning: how to avoid a mistaken
mix of variable relevance analysis (dealt with in point two) and parameter significance
(a purely statistical concept).
We shall then perform a full analysis of a linear model in a purely observational
setting.
As a last topic we shall go back to the idea of R2 as a measure of relevance to show
that this idea, if quite relevant, must in any case be considered with care.
This is the rule with any statistical tool: to use it you must know its properties
and always discuss the opportunity of using it in your sspecific context. Nothing is
“authomatically correct”.

99
A point by point summary on how to read a regression and a list of suggested
readings close the chapter.
The suggested readings are “suggested” and are NOT IN ANY WAY required for
the exam while could be useful readings for those interested in some more info on linear
models AFTER the exam.

9.14.2 Understanding a linear regression model as a forecasting tool. The


central role of R2
A linear model of the kind Y = Xβ +  is not always describing a (linear) regression.
It does describe a regression if we assume, in some way, that E(Y |X) = Xβ.
So, for instance, if we do not suppose E(|X) = 0 the model is still a linear model
but we are NOT interpreting the model as a regression.
It could, then, be very interesting to analyze the nature of a linear model when it
does NOT model a regression, but we shall concern ourselves only with the case what
the linear model is the model for a regression.
A regression is, first and above all, a conditional expectation of one random variable,
given other random variables.
In our case we consider E(Y |X) or, better (row by row) E(yi |xi ) and in particular
we consider the linear case where we suppose E(yi |xi ) = xi β. (Here yi and xi are the
i − th rows of Y and X).
A regression is an optimal forecast of a variable given other variables.
We know that “optimal” here means “minimizing mean square error”.
Let us recall the result: a regression is a (vector) function of xi : E(yi |xi ) = φ(xi )
such that it minimizes the “mean square error”: E((yi − η(xi ))2 ) over all the functions
η(xi ).
As we did stress, this is very general and does not require the regression function
to be linear.
Given xi (be it a single variable or a vector of variables) there is no better way, in
the mean square error sense, to “forecast” yi than using the regression function.
Since the purpose of a regression is to minimize the mean square error, it should
be of interest to know how much such mean square error has been minimized in each
specific case.
We are in the setting of linear regressions, in the empirical version of a linear
regression (using data and not theoretical distribution) to minimize the MSE becomes
to minimize the sum of squared residuals. This is equivalent (when the intercept is
included) to maximize the R2 .
This is the meaning of the variance decomposition result from which we derived the
R2 .31
31
We should be discuss the difference between the theoretical and the empirical variance but this is
not of much relevant here.

100
Why is this discussion of optimal forecasts relevant for understanding the results
of a linear model?
Since a regression is a way to make forecast by minimizing a measure of error
the first, and always valid, “reading” of a regression must be first of all based on
summarizing ”how good” this minimization was.
In the context of linear regression this implies a first, simple, question: “how big”
is R2 .
The second question, usually, is: “with this R2 , is the regression relevant?” The
term “relevant” creates many problems as, clearly, the answer shall depend on the
specific context. For this reason, while researchers do propose, e.g., reference values for
R2 under which a regression should be considered irrelevant, we suggest here a more
cautious path which tries to merge an “absolute” evaluation of the R2 with a more
specific connection to the specific practical context of the analysis.
We shall discuss this point further on, with examples.
However we cannot and do not stop here.
It is almost always the case that we look for some further “decomposition” of the
“explained variance” in terms of each single “explanatory variable”.
We want to define a variable by variable measure of relevance.
We shall be able to correctly understand a regression, if we shall be able to precisely
set the bounds under which this question has a meaning and if we shall be able to answer
to this question from within these bounds.
In the section about the partial regression theorem we already defined such measure:
the semi partial R2 . In the following section we shall analyze some more property of
this measure.

9.14.3 The “contribution” of a variable to the quality of a forecast


The semi partial R2 , or, better, its estimate, fully answers our quest for a measure of
variable by variable contribution to the quality of a forecast.
The role of this measure, however, does not stop here.
A relevant result (proof not required) is as follows:

V ar(Xj |X−j )β̂j2


R2 − R−j
2
= t2j (1 − R2 )/(n − k) =
V ar(Y )

where the first term is the definition of remi partial R2 for the j − th column of
X, the first equality was presented in the partial regression theorem section and the
second equality is the new result we wish to discuss now.
Let’s us take the square root of the second and third terms
ˆ |
p
q V ar(Xj |X−j )|β j
t2j (1 − R2 )/(n − k) = p
V ar(Y )

101
This is an interesting formula, from the point of view of a variable by variable
interpretation of the quality of a regression.
The square root of the semi partial R2 for Xj is, in units of the standard deviation
of the data on Y (denominator of the rhs) the “change” in the conditional expectation
of Y given by a “reasonable” change in Xj . Reasonable here means equal to the
CONDITIONAL standard deviation of Xj GIVEN the values of the other columns
in X that is an amount of: V ar(Xj |X−j ).
p

This measures how much, on average, Xj may change (in a standard deviation
sense) if the rest of the columns of X are kept constant. reasonable when the “other
X−j are kept constant”, hence measured with the CONDITIONAL standard deviation
of Xj GIVEN X−j . When we multiply this by |β ˆ | we translate this into a contribution
j
to the standard deviation of Y and if we divide by this standard deviation we have a
contribution in units of this standard deviation.
The idea of a first assessment of the “relevance” of a variable in a regression by
computing “how much the conditional expected value of Y changes if we change Xj of
one unit of standard deviation” is quite common in papers where linear regressions are
used.
Our analysis gives us a way to correctly perform such analysis and shows that, when
correct, the result is just a simple transform of the semi partial R2 .
We can summarize this in several, equivalent, ways:

1. The square root of the semi partial R2 for a given column of Xj is the ratio
between the fraction of the standard deviation of Xj that is not correlated with
the other columns of X and the standard deviation of Y times the absolute value
of βˆj .

2. The absolute value of βˆj is the ratio between the fraction of standard deviation
of Y “explained” by Xj alone (the square root of the semi partial R2 times the
variance of Y ) and the amount of standard deviation of Xj that is uncorrelated
with the other columns of X.

3. βˆj in absolute value is the ratio between a “reasonable” movement of Y (one


sigma) and one “reasonable” movement of Xj conditional to the other columns of
X (one CONDITIONAL sigma) multiplied by the square root of the semi partial
R2 .

4. The square of βˆj is the ratio between what “is to be explained” (the variance of
Y ) and what is “left in Xj to explain Y ” (the conditional variance of Xj ) times
the fraction of the variance of Y “explained” by Xj given the other columns of X
(i.e. the semi partial R2 ).

102
9.14.4 Forecasting vs intervention and regression coefficients interpreta-
tion
A forecast is only about information, it is about using what we know in order to say
something about what we do not know.
Sometimes, but this is just a particular case, the “information linkage” between
variables may have something to do with some causal (in any intuitive sense) connection
between variables: “if I know the cause I should know something about the effect”.
However, it is also true that if I know the “effect” I can say something about the
“cause” (just recall, again, about how a MD or a police detective works deriving from
symptoms/clues hypotheses on the medical condition/culpable).
In conditional expectations terms, forecasting has nothing to do with the “direction”
of such possible, but not necessary, causal connection: we can try and forecast the
“effect” given the “cause” or the “cause” given the “effect”. The point is: what do we
know, and what do we want to forecast.
Let us go back to the example concerning the speed of a car. Let us say that the
“true speed” of your car is that measured by a roadside Doppler radar while what we
can observe is the car’s tachometer. If we suppose that both tools are well calibrated,
and the conditions of the road reasonable, we expect both tools to give similar values to
“the speed of the car”. You can then use any of the two in order to “forecast” the other
and the forecast should be quite good (meaning: high R2 ). You choose which forecast
to use on the basis of available info. As the driver of the car, you may be interested
in the forecast you can make using your tachometer in order to avoid breaking speed
limits.
It is also clear that there is no “causal” connection between the two measures, at
least not in the sense that, by altering the reading of one of the two instruments you
can alter the other. If, for instance, you break the plastic on the instrument panel of
the car and stop the tachometer arrow with our finger (we suppose analogue dials) this
is NOT going to limit the scale of the radar measurement, and vice versa.
In this case you have very good forecasts, provided you do not mess with the
instruments. You can make forecasts conditioning both ways, according to what you
know. But such forecasts do not imply any causal connection, at least in the sense that
you could alter one measure controlling the other.
Since we know what is happening, an economist would say: “we know the structure
of the economy”, we know the reason for this. The two dials measure correlated phe-
nomena. The radar measures the speed of the car wrt the radar itself, the tachometer
measures the rolling rate of the tire. If the car is running on a reasonably non skidding
surface (not on ice) the two should be highly correlated, hence our forecast ability.
It is interesting to notice that, if we are only interested in forecasting, we may do
without such understanding and only suppose the informative relationship to be stable.
That is, to apply to repeated instance of the phenomenon at hand.

103
In principle, under stability, we may then forecast even if we do not “understand”,
in the sense that “we have no idea whence the correlation comes”. This informal idea
of “stability” has several names in Statistics. In a very simple and constrained sense,
it is called stationarity. The idea of i.i.d. random variables is a particular case of this.
More in general the relevant idea is that of “ergodicity” which is quite beyond what we
do in this course.
We can go further: it is clear, in an intuitive sense, that the rolling rate of the
tire “causes” both the tachometer measure (even on skidding surfaces) and the Doppler
radar measure (non skidding surfaces) in the sense that if I alter the rate, maybe acting
on the gas pedal or on the brake pedal, I expect the dial of the tachometer and of the
radar to move in a precise direction. It is also clear that this is not true in reverse: I
cannot speed up by moving with my finger or other tools any of the dials.
This not withstanding, we are using the “effect” (the position of the dial) in order
to “forecast” the “value” of the “cause” (the rolling speed of the tire). This is perfectly
sensible and is going to work, obviously, if I do not tamper with the dial.
As mentioned in the introduction of this section, to use “effect” in order to “forecast”
“causes” is a quite common procedure. Consider a case where “cause” is in the past
while “effect” (as usual) is in the future.
The information we have about extinct living beings comes from their fossilized
remains which are available today.
We can say something about the shape and behaviour of extinct living beings
“conditioning” on the information we can derive from what today is a fossil.
However, in no sensible meaning, fossils “cause” the existence in the past of now
extinct living beings. I may destroy all fossils today, this would be very foolish and
would not alter the past of life on our planet. Maybe it would alter our understanding of
this past and this could be the objective of the (reasonably mad) fundamentalist/paleo-
terrorist involved in the destruction. This, however, is another story.
Once we understand that a forecast has only to do with an information linkage
between variables under which there may be or not causal relationship, we may consider
a second point and try to shed more light on the difference between simple forecasting
and attempt to intervene on the result of a phenomenon.
When you compute E(Y |X) for a given subject for which you observe X, a set of
variables, you simply put the observed X in the function φ(X) = E(Y |X) and get
your forecast. If X is made of many different measures, there is no much interest in
measuring the “contribution to the forecast” of each variable in X.
To make things simple, suppose Y is a single variable and X is made of X1 and
X2 , suppose E(Y |X1 , X2 ) = α + β1 X1 + β2 X2 where, for instance α = 0, β1 = 1 and
β2 = −1.
This simply means that, if in a subject you observe X1 = .4 and X2 = 1 your
forecast for Y in that unit is -.6, and if you observe, in another unit X1 = .4 and
X2 = 2 your forecast is -1.6.

104
You can surely say that, if you consider two different subjects with the same, say,
X1 and two different values of X2 where the difference in the two values is 1, the two
forecasts shall different of β2 , that is of -1 but this cannot be read in the sense that, if
in a unit you “change” the value of X2 increasing it of 1 than you are going to get a
forecast -1 smaller than before.
This is wrong for many reasons. Consider the tachometer example above: let us say
that to change X2 means to alter the position of the dial with your finger. It should be
foolish, in this case, to use as forecast function the conditional expectation computed
by observing data where the dial is not tampered with. If I actually did the experiment
of comparing the speed as measured by the radar with that of the non tampered and
tampered tachometer, I would see that, while a change in the untampered tachometer
dial corresponds (with some approximation) to a change in speed as measured by the
radar, this does not happen if the dial position is altered because it was tampered with.
This is totally obvious but contains a very important teaching: there are ways in
which, if I “change” some variable in the system I observe, the informative role of such
altered variable changes wrt the role it has in the untampered system.
For this reason, by itself, the observation of the untampered system, while useful for
forecasting, cannot tell me anything, in principle, about the ”effect” of me tampering
with the system.
If our notion of “cause” is based, as usually is in Economics and Finance, on the
idea of “intervention”, we can simply say that forecast on the basis of information (be
it made using regression functions or other tools) in general tells us nothing about any
causal relationship.
This is even more evident when, as in the case of fossils, we observe X a long time
after Y did happen. While the observation of a fossil fish may be the indication that,
in the past, a sea existed where the fossil was found, while the observation of a fossil
sloth would imply that, in the past, some forest/savannah environment existed where
the fossil is found, if I swap the fossils I cannot expect to alter the past environments.
Again, this is obvious and, being obvious it should be always in the mind of any
researcher using regression.
We conclude this (too short, given the relevance of the topic) section by quoting a
wonderfully strong, precise, effective and simple passage from what could be considered
the best ever book on linear models and their interpretation: Fraderick Mosteller and
John W. Tukey, “Data analysis and regression: a second course in statistics”, Addison-
Wesley 1977.
The quotation is from the masterful Chapter 13 “Woes of Regression Coefficients”
whose Goethean title is, probably, not chosen by chance as it as to do with the impos-
sibility of really making oneself known and understood. In what follows the comments
written in square brackets [...] are ours.
“We have been careful to point out using x and x2
[in a regression of y on x and x2 ]

105
that it does not generally make sense to try to interpret the coefficients of x in terms
of what "would happen if the other x’s were held constant"
[try and keep x2 constant when changing x and viceversa].
In this section, we try to go ahead a little, sounding a few of the most necessary
warnings.
Polynomial fits. When it comes to fitting polynomials, whether as simple as
b1 x + b2 x2 or as complex as b0 + b1 x + b2 x2 + b3 x3 + b4 x4 + b5 x5 , it rarely pays to try to
interpret coefficients. Pictures of the fits or of the difference in two fits to two sets of
data--can be very helpful, but the coefficients themselves are rarely worth a hard look.
Unrelated x’s. If the x’s are not closely related, either functionally or statistically,
we may be able to get away with interpreting bi as the "effect of xi changing while the
other x’s keep their same values."
[this is the case where the coefficients for regressions on single variables equate or
almost equate the coefficients for the same variables in the full regression]
If we want to tap expert judgment about the value of bi , some set of words like those
in quotes may be the best we can use.
[it is difficult, even for experts, to have opinions of the coefficient of a given vari-
able conditional to different sets of other variables. Opinions shall usually apply to
univariate regression coefficients]
In practical or policy situations, however, we need to recognize how large a difference
there can be between:
1. x changing while the other x’s are not otherwise disturbed or clamped,
and
2. changing xi while holding the other x’s fast.
[we should add 3. observing the different values of a x in nature. Here this is
subsumed by 1.]
Such differences are not only possible but likely in social and economic problems,
because the x’s we are working with there are usually neither the most fundamental
variables in the situation nor the complete set of variables.
[i.e. they depend of a full substrate of, generally unobservable, variables wich freely
act “in nature”]
Consider the example of performance on tests of cognitive achievement as related
to parents’ education, socioeconomic status, and years of schooling. Note that we have
no measures of innate intelligence, attention paid in school, parental or teachers’ en-
couragement, or even hours spent on the subject matter being tested, to say nothing of
physical handicaps.
Our regression works so long as x’s and y’s together are driven by the fundamental
variables acting as they had acted, at particular times and places, before we collected
the data.
[that is: if we do not “act”]
If we interfere with their activity, we are likely to change the regression and find

106
that the effect of the change cannot be predicted by either of the two regressions listed
above.
Holding all but one fixed and changing that one, if we can indeed do this, is likely
to interfere with the underlying pattern of variability and covariability, thus changing
the regression. In most policy situations this danger is very real.
[what is useful in forecasting, could be useless in deciding an action]
Experiments, Closed Systems, and Physical versus Social Sciences
George Box has (almost) said (1966): "The only way to find out what will happen
when a complex system is disturbed is to disturb the system, not merely to observe it
passively."
These words of caution about "natural experiments" are uncomfortably strong. Yet
in today’s world we see no alternative to accepting them as, if anything, too weak.
Regression is probably the most powerful technique we have for analyzing data. Cor-
respondingly, it often seems to tell us more of what we want to know than our data
possibly could provide. Such seemings are, of course, wholly misleading.
Some examples of what can happen may help us all to understand Box’s point, which
covers these examples as wed as many others.
First, suppose that what we would like to do is measure people (or items) in a
population and use the regression coefficients to assess how much a unit change in
a background variable (say x1 ) will change a response variable (say y). Since the
regression coefficient of x1 depends upon what other variables are used in the forecast,
we cannot hope to buy the information about the quantitative effect of x1 so cheaply.
These remarks do not deny the potential use of forecasting the value of y from several
variables x1 , x2 , and so on, in the population as it now exists.
[again: one thing is a forecast for a left to itself phenomenon, another a forecast of
the effect of an action]
What they do cast grave doubts on is the use to forecast a change in y when x1
is changed for an individual (class, city, state, country), without verification from a
controlled trial making such a change.
(Strictly speaking, but unrealistically and impractically, if we want to verify what
happens when only x1 changes, the controlled trial should be made so that x1 changes
and there is no chance for the other variables to change the way they naturally would
when the underlying variables are manipulated to change x1 . This sort of study may
not be feasible, and it may not yield what we need to know. We ordinarily want to know
what will actually happen when we change x1 )
[that is: when we change x1 but cannot control the other variables. This is impor-
tant if we want, for instance, compare the effect of a medical treatment as measured
in a controlled experiment with the “real world” effect where no control, e.g. double
blind treatment/control, is possible]
When such issues are raised, proponents of observational studies plus regression
analysis are likely to cite the physical sciences for illustrations of the success of the

107
method. The idea that such regression-as-measurement methods are successful in the
physical sciences is seriously misleading for a variety of reasons.
First, because so many physical-science applications of regression-as-measurement
are to experimental data.
And second, because the relatively few useful applications that remain involve sys-
tems in which "the variables" are:
- few in number,
- well clarified,
and
- measured with small error”.
If well understood and kept in mind, this passage is more than enough, jointly
with the above analysis, to avoid pitfalls and producing a reasonable interpretation of
regression results.
A curiosity. Notice the point about interference with the phenomenon and the
change of regression function.
This topic, quite obvious once we think about it, was fully known to (good) statis-
ticians decades before its “reinvention” by Robert Lucas in the field of Economics with
his “Lucas critique”, where he describes a particular setting where the warning you read
above shall apply.

9.14.5 Enter: Statistics. “Statistically significant” vs relevant


We have offered some hint of how to interpret a regression when parameters are known.
In all standard applications, β is not known and must be estimated.
This is the fourth point in interpreting results we mentioned above.
What changes? Actually not much, the problem is only to quantify how much we
really know about β, since we can only estimate it, and it is easy to do this by the
study of its sampling variability.
Although this should be simple (and in fact it is simple) it may create some new
interpretation problems we discussed above, when we compared “statistical significance”
with “relevance”.
It is an empirical observation that, while the pitfalls and interpretation errors we
are going to summarize here are warned against in most good books of Statistics,
this common advice seems to work in some empirical field while it is almost of no
consequence in other.
This point is well summarizer in a fragment of the long quatiation we made from
Mosteller and Tukey’s book:
“Regression is probably the most powerful technique we have for analyzing data.
Correspondingly, it often seems to tell us more of what we want to know than our data
possibly could provide. Such seemings are, of course, wholly misleading”.

108
It is usually the case that the effect of such warnings is bigger when empirical
analysis has real practical purpose, hence, it is strongly constrained by its practical
implications, and smaller when the main reason for empirical analysis is more “paper
publishing for the shake of it” oriented: a current fad wrongly mistaken by some for
science.
In empirical Economics and Finance the main misunderstanding has to do with the
concept of “statistical significance”.
Technically: the estimate of a parameter βj is “significant at a given level α” if it
falls within a size α rejection region for a test whose null hypothesis is H0 : βj = 0.
This is often stated, in really imprecise words as: a parameter is “statistical signif-
icant” if its estimated value, compared with its sampling standard deviation makes it
unlikely that in other samples the estimate may change of sign.
In the standard regression setting, the most frequently used statistical index is the
T − ratio and an estimated βj has a “significance” which is usually measured in terms
of its P −value of the T −ratio.
(We repeat for reference: the P −value of a test is the α of a critical region whose
boundary corresponds with the observed value of the test. So, for instance, if you
observe a t− ratio of, say, -3, you need to compute the probability that a sample gives
you a t−ratio between -3 and +3 computed under the null hypothesis that βj = 0.
Notice the “two tailed” interval which corresponds to the fact that, as a rule, the
alternative hypothesis is βj 6= 0 that is: a two tailed hypothesis).
Does a small P −value imply that a parameter is “relevant” in any sense, except
the fact that you observed a value outside the interval of values which are likely to be
observed under the null hypothesis?
As already discussed, the answer is “absolutely not”.
We already commented on this when considering the semi partial R2 . There is an
even more striking way to present the point: suppose the parameter is known and is
different from zero (so that its P −value is 0: it cannot be more significant than this!)
the actual relevance of the corresponding regressor could be absolutely negligible if the
semi partial R2 is small. Here, by relevance, we intend the ability of the corresponding
Xj to “explain” an amount of variance of Y (improve the forecast) which is big w.r.t.
the total variance of Y .
“Statistically significant” only means, this statement is approximate but justifiable,
that the statistical quality (precision) of the estimate is such that the estimate should
not change sign if we change the sample.
In iid samples, if n is big, typically all parameter estimates become “statistically
significant”. √
This because the sampling standard deviation decreases at speed n, so that even
a practically negligible βj can be estimated with enough precision so to allow us to
distinguish it from zero.
In no way this implies βj to be “relevant” in any practical sense. What happens here

109
is that, with n big enough, we can reliably assess that an irrelevant effect is actually
not 0 but still irrelevant.
It is frequent to see published papers in major journals where linear models with
tens of regressors and tens of thousands of observations result in statistically significant
coefficients with an overall R2 in the range of few percentage points and semi partial R2
of fractions of 1%. Whatever the notion of “relevance” (forecasting, always available,
or causal, requiring many hypothesis), it is difficult to conceive of any practical setting
where such results could be termed “relevant”, if not because they give relevant support
to the statements of “irrelevance” of the corresponding effects.
This would be not so important, if the same papers did not spend most of their
length discussing about the meanings and the practical relevance of the effects suppos-
edly “found”.
This misunderstanding between “statistical significance” and “relevance” must be
avoided. If models were used for practical purposes (say for forecasting or controlling
variables) the misunderstanding would quickly disappear: an estimate can be as sig-
nificant as I like but, if the R2 is small, the quality of the forecast shall be awful all
the same.
When models are only used for academic purposes (appear in published papers) the
misunderstanding may continue unscathed, sometimes with hilarious consequences.
Now go back to the semi partial R2 and consider, again, how this depends on the
square of the t−ratio (numerator) BUT ALSO on the number of degrees of freedom
(n − k). This clearly suggest you the nature of the problem discussed here. A “highly
significant t−ratio”, say: of absolute value greater than 3, shall NOT correspond to a
high contribution to the quality of the forecast of the sample is big. In other words, in
big samples you may expect to observe “big and significant” t−ratios corresponding to
practically irrelevant (from the point of view of forecasting accuracy) variables.
Summary: first assess the relevance of the regression and the parameters of interest
in terms of explained variance as if parameters were known and not estimated. Then
look at the statistical stability of the results. An irrelevant parameter is still irrelevant if
it is “significant” while a parameter which could be relevant can be put under discussion
if its sampling variance is too big (this usually happens if the sample is small).

9.14.6 A golfing example


In the following example we try to determine how much of the average by tournament
gain of the most important competitors in the 2004 PGA events can be captured by a
linear regression on some ability indexes and other possibly relevant variables.
The dependent variable: AveWng, is the average gain.
The columns of X are:
a constant,
Age=the age of the player

110
AveDrv=the average drive length in yards
DrvAcc=the percentage of drives to the fairway
GrnReg=the percentage of times the player reaches the green in the “regular” num-
ber of strokes
AvePutt=the average number of putts per hole (should be less than 2)
SavePct=the percentage of saved points
Events=the number of events the player competed in
This dataset is an observational dataset: in the same time frame, 2004, we observe
the money results (average winnings) of a set of players and they gaming abilities
according to same indexes.
Indexes and results between player are different because players are different.
We are not observing the same players over time and see how changes in their
characteristics are correlated with changes in results and we are not observing, say, a
set of payers randomly assignet to a treatment or a placebo group where treatment
may be and increase of training for this or that aspect of the game.
For this reason, the analysis of this regression only has a forecasting purpose: sup-
pose you randomly draw a player from the population whence these players come from.
If you know the game characteristics of this player the model allows you to forecast his
overall result.
Player can obviously act, in many ways, in order to change their game characteris-
tics. However, there is no guarantee that, if a player, say, is able, by training, tactics,
control etc, to reproduce the characteristic already present in another player, the best
forecast for the results of the first player shall be the result (or the forecast of the
result) of the second.
In the microcosm of golfing, we have something similar to economics: what we ob-
serve are equilibria where many different components are required in a precise amount
in order to yield a result. In such cases, altering the equilibrium mix can have com-
pletely unforecastable effects. If you double the flour in your cake, you do not get
double the cake, you get a mess. Many players destroyed their game just because they
tried to “improve” this or that aspect of their game and, by doing so, they broke the
equilibrium that gave them their results.
This has every practical consequences in reading the result. For instance: be careful
in assessing your expectations for, say, the sign of the parameter estimates we shall
observe. If your expectation comes from the knowledge you have about which training
is more likely to improve your results, such expectations can be irrelevant for this
model.
Instead, if your expectations come from your experience about the characteristics
of the player which, in the past, got more money from the tour, these could be relevant
for this model.
So, for instance, if you say that it is reasonable to assume that expected aver-
age money is positively correlated with AveDrv, DrvAcc, GrnReg and SavePct, neg-

111
atively on AvePutt, you should understand well where these opinions of yours come
from. Moreover, you should understand how these opinions, as a roule, are connected
with bivariate correlations, NOT with conditional correlations, while the regression
model parameters derive from conditional correlations (see the point on experts in the
Mosteller Tukey quote).
We understand that it is tempting to try and answer a very reasonable question:
“if a player trains and improves one some aspects of the quality of his game, how much
more money could he expect to make”?
This is the “comparison of expectation after intervention” we mentioned with the
name “causal” analysis or “intervention” analysis.
It is important to state that, if we do not make further hypotheses, this dataset
does not allow in any case to answer this, undoubtedly very interesting for golfers,
question.
Let us start with some descriptive statistics and a simple correlation matrix:

From this correlation matrix we see that at least one of our expectations is appar-
ently not true: correlation with driving accuracy is negative. However we also see that,
and this could be expected, correlation between AveDrv and DrvAcc is rather strong

112
and negative (longer means riskier).
We’ll see that this has an interesting implication on the overall regression.
Let us now run the regression:

Do not jump to the coefficient estimates!


First read the overall R-square. It is .45 that is: 45% of Y variance is due to its
regression on X. It is important to keep this in mind: anything we’ll further say about
coefficients, partial R square etc. lies within this percentage. More that 50% of the
dependent variable variance is not “forecasted” by its regression on X.
Another way to read the same result is to compute a confidence interval for the
(point) forecasts. As seen above this interval is given by
h q i
0 −1 0
xf β̂OLS ± z(1−α/2) σ (1 + xf (X X) xf )

It could be shown that, for n−k not to small, X 0 X with a determinant not too near
to 0, and a xf not “too far” from the observed columns of X, this can be approximated
by h i
xf β̂OLS ± z(1−α/2) σ
Under the same hypotheses we can freely put σ̂ in the place of σ and still use the
Gaussian in place of the T distribution. With this approximation, the point forecast

113
interval is the same for each xf and its width is 2z(1−α) σ̂ . If we allow for the plus/minus
two sigma rule this, with our data, becomes 4 times 41432 or, forecast plus/minus 2
times 41432. If we stick to the Gaussian hypothesis (or believe a central limit theorem
can be applied in our case) this interval should contain the true value of yf with a
probability of more than 95%.
If we go back to the descriptive statistics, we see that the standard deviation of
AveWng is 54990. This means that, without regression, our forecast would be the same
for each observation and equal to the average (46548) and the corresponding forecast
interval would be 46548 plus/minus 2 times 54990. With the regression our forecast
is xf β̂OLS , so it varies with the observations on xf , and this variability “captured” by
the regression, is “subtracted” from the marginal standard deviation so that the point
forecast interval shall be narrower: the point forecast plus/minus 2 times 41432.
You should notice that, with an R2 of about .45, the width of the forecast interval
is reduced only of less that 1/6th. This is not surprising: the R2 is in terms of variance
while the interval is in term of standard deviations. Variances (explained and unex-
plained by the regression) sum, standard deviations do not (the square root of a sum
is not the sum of the square roots). For this reason the term “subtracted” above is put
under quotes.
We may then question the statistical precision of our estimates, in particular the
statistical precision of our R2 estimate. In the output we do not have a specific test
for this but we have something which is largely equivalent: The F −test tables.
The F −tables imply rejection of the null hypothesis that there is no regression
effect, meaning: all parameters are jointly equal to zero with the possible exception of
the intercept.
Notice that, with few observation, even a sizable R2 , as the one we got for this
model, could fully be due to randomness. The F −test tells us that this does not
seem to be the case. This is not a direct “evaluation” of the statistical precision of our
R2 estimate. However, implicitly, since there exist a direct link between the value of the
F −test and R2 , it tells us that an estimate of R2 as the one we found is very unlikely,
if there is no regression effect.
From the point of view of forecasting, this is all. We may like or not the results but
this is what we find in the data and, if we just suppose some “stability” of the model
(see the comments above) this is the precision of the forecast we can make.
What follows can be seen as an “anatomy” of the forecast in terms of each column
of X. This can be useful for forecasting use but, obviously, it is much more relevant if
the setting is such that we are able to hold a causal interpretation of the regression.
If we go to the last column of the regression output (we added this to the standard
Excel output) we find the semi partial R squares. We see that only three variables have
a sizable marginal contribution on R2 as measured by their semi partial R2 : GrnReg,
AvePutt and Events. This means that these are the variables whose addition to X
most improves the forecast. Can we go a little bit further and say that, barring for the

114
Events variable we shall comment further on, an increase of GrnReg and a decrease
of AvePutt are the aspects of the game that, if improved, would imply a greater and
more reliable increase in AveWng?
This is a causal interpretation, is it reasonable in our setting? We cannot exclude
this, we can only say it is very unlikely to hold.
Why? We repeat: the data is the summary of a season. It describes a set of “ability”
indicators for each player and some other variable.
Let us concentrate on the abilities.
Let us take, for instance, Age. This is a typical variable you cannot intervene on.
This not withstanding, the variable changes in time. The possible causal interpre-
tation would then be: each year the conditional expected values of gains goes down by
almost 600 dollars. Is this the “effect” of age?
Even if we do not consider that what we observe is a cross section of players and not
a time series of results for a single player (so that we may observe the action of Age),
we must answer “beware”. If a causal interpretation was possible, the β of Age times a
change of Age would be the expected change in AveWng if all the other variables are
constant, that is: if only Age acts and the golfer’s abilities as expressed by the other
variables do not change.
Is it reasonable that the natural evolution of Age does not change the other abilities
of any given player?
Quite unlikely. In any case this point should be assessed by theory and empirical
data in order for a causal interpretation to be possible (the methods for doing this are
not object of this course).
Let us now consider a variable on which we can think we could “act”: AvePutt.
We cannot arbitrarily set this to a lower number (compare this with “changing
interest rates”) but we may conceive of increasing the time dedicated to putting green
training. If this reduces the number of putts, even of just 1/100 we should improve
(change of the conditional expected value) our wins “on average” of almost 700 dollars
(69000 times 0.01).
Is this the “effect” we can expect? It depends. Golfing is an equilibrium game.
What counts is the overall result and trying to improve a part of the game may have
bad results on other parts of the game.
By training more on the green maybe we worsen (or maybe improve?) our game
under other points of view: length, precision from distance etc.
Moreover: the model was estimated on a sample of players with a given “equilibrium
mix” of abilities.
Is it still going to be valid if we alter such characteristics? Again: we do not
know this and, with no answer, any attempt to use the model in this sense would be
unwarranted.
Notice that here we did hint to three different problems, the same problems we
hinted at a number of times above.

115
The first is that it can be difficult or impossible to act on an Xj and, at the opposite,
some Xj is bound to change by itself .
The second is that it may be difficult to intervene on one Xj without altering other
Xj -s and, if this happens, we should model this interaction to have an idea about the
“effect” on the dependent variable.
The third is that any action on one or more Xj could alter the conditional expec-
tation itself and we should model this alteration.
All these problems have been discussed and are being discussed by econometricians.
In fact, as we mentioned, these problems are at the origin of Econometrics and are still
the central problem of Econometrics: what makes Econometrics sister but not a twin
sister of Statistics.
Following the general approach of this section we do not further develop the “causal”
discussion and, for a moment, heroically suppose that we can improve our AvePutt
decreasing it without altering other variables or the regression function itself.
Is it reasonable to assume an improvement of 1/100 if we suppose this does not alter
the other indicators? Since we see that AvePutt is correlated with the other variables,
this cannot be more than an approximation. However, if this correlation is not too big,
it may be that the reasonable values of AvePutt, conditional to the other variable to be
constant, have a standard deviation which is big enough so that it allows for “changes”
of 1/100 in AvePut.
The marginal standard deviation of AvePutt is about .023. Notice that this is a
standard deviation across the players, so it does not directly concern our problem.
1/100 is less than one half of the marginal standard deviation of AvePutt. This
means that is quite easy to find different players with such a difference in this statistic.
With a little bit of unwarranted logic, let us assume that this is true for the single
player, if we do not condition to his other statistics.
This is the crucial point: both for different players and for the single player we
must recall that we are within a regression and that we are evaluation the possibility
of changing AvePutt of 1/100 while the other variables do not change.
This means that, as stated above, we must consider the conditional standard devi-
ation not the marginal standard deviation.
Recall the formula
ˆ |
p
q V ar(Xj |X−j )|β j
t2j (1 − R2 )/(n − k) = p
V ar(Y )
Using the data in the output we find that the standard deviation of AvePutt (our
Xj ) conditional on the other variables (X−j ) is .021, obviously smaller that the uncon-
ditional variance but still more that 2 times the hypothesized change of .01.
This implies that, even conditionally to the other variables, different players (and
maybe the same player) still could easily show such different values of AvePutt.

116
For the above mentioned reasons, this does not justify, by itself, a causal interpre-
tation. However if such an interpretation were available, an expected effect of the size
of 700 dollars (69000 times .01) or even more would not be unreasonable.
On the other hand an improvement of, say .04 in AvePutt would probably be
unlikely both marginally and, what is more important for us, conditionally to given
values of the other X−j .
If this causal analysis is viable, then, we may expect that a work on the putting
green which does not alter the rest of “the game” could give a golfer a reasonable
improvement of 700 dollars in the AveWng (roughly 1.5% of AveWng).
Let us now consider other aspects of the estimates.
A possible puzzling point is given by the sign of AveDrv and DrvAcc which are
both negative.
The semi partial R2 of AveDrv is almost 0 while that of DrvAcc is a little more than
2%.
In most practical contexts we could then avoid discussing the estimate of the pa-
rameters for these variables.
As an exercise, however let us try to use what we know about partial regressions to
unravel the puzzle. We anticipate that most of the puzle comes from the confusion of
bivariate or “mariginal” with conditional correlation.
Begin by comparing the simple correlations with AveWng and the signs of the
parameters estimated in the linear regression. Notice that the sign of the parameter
for DrvAcc is the same as its correlation with the dependent variable while the sign of
AveDrv is negative with a positive correlation.
A negative simple correlation between AveWng and DrvAcc may not be surprising
and we may try an explanation, which, as always in these cases, is implicitly based on
some causal interpretation to the parameters.
The possible interpretation is this: it could simply be that, in order to be precise
with the drive, a player tends to be too cautious and this may harm his overall result.
There are many alternatives to this interpretation, each depending on some strand
of causal reading of the parameters. The choice among these depends on further and
more complex analysis and on more structured hypotheses about how the performance
of the golfer is connected to each of the statistics.
For a forecasting interpretation this is irrelevant.
Now let us consider AveDrv, the correlation of this variable with AveWng is positive
and not small, while the regression coefficient estimate is negative and the semi partial
R2 is virtually 0 (much smaller that the same statistic for DrvAcc whose correlation
with AveWng, in absolute value, was roughly 1/2 of that of AveDrv).
To understand what is happening let us consider the result Here we see the result
of the regression of AveDrv on the other columns of the X matrix:

117
60% of the variance of AveDrv is captured by its regression on the other columns
of X More that one half of this (38% semi partial R2 ) has to do with its negative
dependence on DrvAcc.
Also GrnReg show a sizable semi partial R2 (14%) and a positive regressive depen-
dence.
As we know, only what is left as residual of this regression is involved in the es-
timation of AveDrv parameter in the original regression. This is the part of AveDrv
variance which is not correlated with DrvAcc and GrnReg (and the other variables in
the partial regression).
We know that GrnReg is the single most important variable in the overall regression
(in the sense that it shows the highest semi partial R2 ).
Based on this we may attempt an interpretation (again: many are possible): the
“equilibrium player” represented by the regression tends to have an higher AveWng if the
percentage in GrnReg is higher. On the other hand, a higher percentage of GrnReg
tends to imply a bigger AveDrv. For this reason, marginally, AveDrv is positively
correlated with AveWng. However the AveDrv in excess of what correlated with GrnReg
seems to be harmful to the overall game and, from this, the negative coefficient in the
overall model.
Now compute, as we did above, the conditional
√ standard deviation of AveDrv.
According to our formula this is equal to 0.0000812 ∗ 54990/94.76 = 5.23 to be
compared with a marginal standard deviation of 8.27. If we hypothesize a change in
this variable (conditional to the other columns of X) equivalent to that hypothesized

118
above for AvePut (less that 1/2 of its conditional standard deviations) equal to 2.5 the
overall expected “effect” should be a decrease of AveWng of roughly 200 dollars. You
would need a very big change of twice conditional standard deviation (about 10) to
have a negative effect comparable with an AvePut change of 1/2 conditional standard
deviation.
Again a matter of care: this evaluation are borderline causal!
In the end, what would be the most proper use of such a regression?
Suppose you want to bet on how much on average a randomly chosen player is going
to win. You know the characteristics of the player, you are betting on the results.
The estimated regression would be a nice starting point.
Now change players into stocks, winnings into returns and use market returns, price
to book value, size, and so on as indicators as in the Fama and French model or in the
style analysis model. In which stock would you invest? To which fund manager would
you give your money?
This are clearly relevant questions and the regression model would be fit for these
even without any causal interpretation.

9.14.7 Big/small partial R2 and relevance, a critical view


A last relevant consideration: as we have seen, in order for a variable to “explain” a
big chunk of the dependent variable variance it is necessary (not sufficient) that this
variable has some variance left when regressed on the other explanatory variables.
This is a rather generic statement and that a more precise analysis should be led
case by case.
Now we must also stress that the analysis considered here considers as “given” the
joint distribution of X and, by consequence, the conditional distribution of each column
of X given the other columns.
In this setting, an analysis of the “relevance” of a variable in a regression based on the
semi partial R2 is well justified. This seems the most relevant case in an observational
setting as the setting most common in applications in Economics and Finance.
Suppose, on the other hand, that there is the possibility that, keeping constant
the regression function E(Y |X) = Xβ, the joint distribution of Y and X may change
(perhaps because it is “acted on” by some policy decision or, simply because of any
new circumstances). In this case, in general, the overall R2 and each semi partial R2
would, in general, change.
For simplicity, just consider the “univariate” model yi = α + βxi + i . Under the
hypothesis E(yi |xi ) = α + xi β. In this case the R2 and the semi partial R2 of x are the
same and are R2 = β 2 V ar(x)/(β 2 V ar(x) + V ar()).
To fix the ideas, suppose β = .5 V ar(x) = 1 and V ar() = 10. In this case
R = .25/10.25 = 0.024. Whatever the interpretation of the regression (forecast or
2

causal) it seems that the role of x, while existing (we know that β is not 0), is not so

119
relevant (at least in terms of a “good fit”).
But suppose the, with no change the regression, either the variance of x becomes
higher or the variance of  decreases or both, that is: suppose the joint distribution of
y and x changes, for some reason32 . For instance, suppose V (x) = 100. If all the rest
remains unchanged the new R2 shall be equal to 25/35=0.72.
In this case the contribution of x to the quality of the forecast of y becomes con-
siderable.
Is this reasonable, is it relevant? For instance: in a observational setting, can we
suppose that the data we use for the forecast are so different w.r.t. those used for the
estimation? In a causal setting: is it possible such a big alteration of the behaviour of
x (and in a multivariate regression: is such a change possible CONDITIONAL to the
other columns of X)?
This, obviously, cannot be assessed in general and can only be evaluated on a case
by case basis.
The important point is, again, to fully understand how, even a simple and standard
method like linear regression can never be dealt with in a ritual/cookbook way. Only
a full understanding of the method and of the circumstances of its specific application
can (and does) yield useful results. Barred this, its use can only be understood as
kowtow to pseudo scientific ritualism or, worst, mislead rhetoric.
Let us consider a case in which a variable may have a “relevant effect” even if it
does NOT explain a big chunk of the dependent variable variance.
Suppose for instance that you have a dataset where observations ore on the heights
of a population of adult men and women. The sample is very unbalanced and it
contains, say, 1000 men and 20 women. For this reason most of the observed variance
in height shall be due to variance across men. If we regress heights on a constant
and a dummy which is equal to 1 of the subject is a woman we shall find, with all
likelihood, a statistical significant negative parameter for the dummy (something like
-10 centimeters) but an almost zero R2 . This does not mean that the difference in
height between men and women is irrelevant, it is, but that, due to the fact that most
of the sample is made of men, this difference does not explain a big chunk of the
variance of THIS sample: most of the variance in this sample is not due to sex, but to
variance in height among males.
Now, suppose you apply this result to a balanced sample where 50% of the subjects
are women and 50% are men. In this new sample most of the variance shall come from
the different sex. In other words: we do forecast in a setting where the distribution of
X is quite different w.r.t. that valid for the estimation sample.
More in general: it may be that the role of a variable, in a forecast or, if reasonable,
in causal terms, is “big” while its partial R2 evaluated in the estimation sample is
small. If this happens this is usually due to the fact that, conditional on the other
32
It is the same to say that the joint distribution of x and  changes.

120
explanatory variables (and maybe even unconditionally) this variable varies very little
in the estimation sample and does not determine a relevant part of the dependent
variable variance.
It may be that, for some reasons, the observed sample to be unbalanced with re-
spect to the population. If, in a more balanced sample, the explanatory variable we are
considering is expected to have higher variance, it may be that its contribution to ex-
plaining the variance of the dependent variable increases so that it becomes interesting
to study its behaviour. However if this is not the case and the sample is representa-
tive of the population we are interested in, the “relevant” parameter shall be interesting
only if we compare the (few) sample points where the values of the explanatory variable
present very different values.
A second very simple example: suppose you are interested in the expected life of a
sample of patients after a given medical treatment. A small subsample of patients was
using a given drug, say A. A new drug, B, is given to all the subjects in the sample
and you observe a huge variation in mean survival time across different subjects, say a
standard deviation of 10 years over a mean of 5. You also observe that the subsample
previously treated with A has the same standard deviation but a mean of 10. Since
this subsample is small, the difference between the means shall contribute very little to
the overall variance (the partial R2 shall be small) however it would be very proper to
suggest the use of A joint with B. Notice that the more you increase the subpopulation
which is using A the more the explained variance due to use/no use of A shall increase.
This, however, is true only up to the point when the fraction of sample using A is 1/2.
If for instance, everybody shall use A, there will be non “variation” of life span due to
use/non use of A, but there still be the “effect” of A in the 5 years on average gained
by its use.
Notice that in this example the reasoning is based on our ability to change the
proportion of population using A. Suppose instead A not to be the use of a medicine
but the fact that your eyes are one blue and one brown. In this case, observing the
same results, we would have very little to suggest except the fact that B seems a very
useful medicine for the few lucky (in this case) people with eyes of different colors.
So: beware of unbalanced samples.
In other settings it may be that we can purposely alter the behaviour of some x not
just in terms of level but also of variance.
The number of possible cases is huge and here is not the place to go further in this.
Some last comment: in the example above we have a case where an irrelevant result
in terms of R2 gives us the relevant suggestion that we could assign both drugs A and B
to all patients, hence, alter the distribution of X. This is a real possibility, we can give
both drugs (at least, if their combination is not harmful) and from this the relevance
of the result. Suppose instead that the difference is in terms of other characteristic,
say: the color of eyes. In this case we cannot change the percentage of the population
with such characteristics, ore “give both colors” to each element of the population. In

121
this case, while interesting, the result is in any case “irrelevant”.
Since all estimates and statistics could be identical in both cases, this implies that
“relevance” is not something that can be fully resolved only on the basis of Statistics:
it requires accurate analysis of the specific problem.
It is also easy to show examples where a big partial R2 is, in practice, while im-
portant in forecasting terms and maybe also in “causal” terms, not directly of any use
(beyond forecasting). Suppose we select a population of women of different ages, ac-
cording to the marginal distribution by age of women, and attribute to each individual
the number of children she gave birth to in the last 5 years. It is clear that the age
shall be relevant (in terms of partial R2 ) in “explaining” the variance of the dependent
variable. This is expected and cannot be of use as we cannot change the age of the
elements in the sample. However, it is going to be important to keep the variable in the
regression if we wish to assess the separate effect of other, less obvious but potentially
relevant variables on which we can act, as, for instance, the amounts of vitamins in the
blood of different subjects, when these variables are correlated with age.

Points to remember in reading a regression.


To conclude this section let us summarize the steps in reading regression results.
Before beginning remember: it may not be necessary for you to discuss “effects”
of variables. This is really relevant only if you intend to use the model for a “policy”
action. If your purpose is data summary or forecasting “effects” (in the “policy sense”)
are not the relevant aspect of regression to be studied.
On the other hand, if “effects” are of interest for you, regression by itself shall not
be able to evaluate these by itself and you’ll need accessory hypotheses in order to be
able to assess these.
Among the possible accessory hypotheses, an experimental setting may sometimes
be useful (if possible).
A “structural” approach is another possibility and this is, from the historical point
of view, the approach chosen by Econometrics.
Both these are not covered in this introductory course.
This said, here is the list:

1. Divide the analysis: A) known parameters B) statistical estimate. B) is easy (if


you know your Statistics) A) is the tricky part.

2. Understand that your model is a model of a conditional expectation E(Y |X) =


3. Understand that Y is NOT E(Y |X) but Y = E(Y |X) + ε and V ar(Y ) =
V ar(E(Y |X)) + V ar(ε)

122
4. Quantify V ar(Y ) due to V ar(E(Y |X)) and V ar(Y ) due to V ar(ε) that is: com-
pute R2

5. You MUST do this because the purpose of a linear OLS model (with intercept)
is that of maximizing R2

6. Moreover, when you discuss the importance of each βj you are “partitioning” R2 .

7. Understand that the βj of a given Xj can be computed in two ways (partial


regression theorem): A) from the overall regression B) first regressing Xj on the
other columns of X and then regressing Y on the residuals of this regression

8. Hence βj only pertains to the “effect” (contribution to the forecast) on E(Y |X)
of what in Xj can change conditional the other X being constant NOT to the
effect of a generic change in Xj .

9. Be careful about the meaning of “effect” strictly speaking all you can say is that,
if you build forecasts for Y given X using E(Y |X) the said “effect” is that, if it
happens that “Nature” gives you two new vectors of observations on X where the
difference between the two vectors is just in a different value of only Xj , then
the difference between your forecasts is given by the difference in the two values
of Xj times the corresponding βj (or its estimate if you are using an estimated
conditional expectation). In other words: this tells you nothing, by itself, about
the possible change in Y given and act on your part to change some value of Xj .
The (by no means easy) study of such a “causal” interpretation has always been
very much in the mind of econometricians who evolved structural Econometrics
as an attempt to answer the (very interesting due to obvious policy reasons) of
assessing the possible results of a change in a variable not just given by “Nature”
but acted by a policy maker. The obvious difference between the two cases is
that the act could not respect the “Natural” joint distribution of observables as
make your previous study of this useless as a source of answer. Just think to the
obvious difference in observing, say, interest rates changes induced by the market
dynamics and imposing by policy an interest rate change: the laws concerning
the effects of the second act could be completely different to the laws concerning
the “natural” evolution of rates in the market. On the contrary the “forecast
change” is always a good interpretation if the observed change happens without
interference.

10. Once you understand the meaning of the “effect” word, quantify this, in your
sample (that is: for a given joint distribution of Y and X), with the semi partial
t2 (1−R2 )
R2 due to the j − th regressor as j(n−k) (if you are reading a paper and, bad
sign,R2 is not available, use the same formula with R2 = 0. This shall give you
an overvaluation of the partial R2 )

123
11. Evaluate the practical significance of this “effect” (while doing this ask yourself
if the sample is balanced with respect the explanatory variable and, if not, con-
sider if a balanced version of it is a sensible possibility: see above the examples
involving heights of Males and females and the experiment with two medicines).
In general this depends on the specific case. However a first rough idea can be
gained by computing the change in the conditional expectation of Y induced by a
“reasonable” change of Xj . Since this must be a “reasonable” change “which leaves
the other explanatory variables unchanged” (recall the meaning of βj induced by
the partial regression theorem) this could be measured by the conditional stan-
dard deviation pof Xj given the pother regressors. It could be useful, so, to compute
the ratio |βj | V ar(Xj |X−j )/ V ar(Y ) which express this “effect” in unit of the
standard deviation of Y (the modulus around βj comes from the fact that we get
the formula taking the square root of a square). A quick proof shows you that
this is exactly identical to the square root of the semi partial R2 for Xj and this
confirms the centrality of this quantity. Beware! Do p not be deceived
p by a quantity
which bears some resemblance to this. This is |βj | V ar(Xj )/ V ar(Y ) which
shall obviously be bigger (actually not smaller) and so, maybe, gratifying. The
point is that, by not conditioning the variance of Xj it violates the interpretation
of the coefficient. By the way: it may well happen that this quantity be bigger
than 1, which, obviously, is absurd33 .

12. Then do Statistics (namely: consider that you must estimate β and evaluate the
quality of the estimate).

13. Remember: an estimate is “statistically significant” if the ratio of its value to its
sampling standard deviation is big enough to say that you can reliably distinguish
33
Most of times, this choice is made when the correct measure of relevance would give as result a very
small value, that is: the practical irrelevance of the “effect”. In this case the use of the unconditional
standard deviation inflates the result but, since the starting point is very small, the inflated value
is smaller that 1 and, apparently, you do not get absurd results. The inconsistency is in any case
evident: for instance, you get a semi partial R2 of, say, .0001 for a given Xj and then you find written
in the paper that “a change of Xj equal to its standard deviation implies a change of (the conditional
expected value of - but sometimes this too is forgotten)Y equal to 1/2 of its standard deviation”. These
two informations are evidently conflicting and the solution is that, since the “effect” measure given by
βj only has to do with “a change in Xj with the rest of the regressors constant” you cannot use the
unconditional standard deviation of Xj as a measure of a “normal” change in Xj (it could be, but in an
unconditional setting) you must use the conditional standard deviation. Just take the square root of
the semi partial R2 and you get that the correct measure of the change in the conditional expectation
of Y given a “reasonable conditional on X−j ” change of Xj given by the conditional standard deviation
of this is, in unit of the standard deviation of Y , given by .01. A completely different picture. What
is happening: just take the ratio of this with the previous number .01/(1/2)=.02 this shall be the
ratio between the unconditional and conditional standard deviation of Xj . It happens that today’s
common use of very big samples which can be used only adding a huge number of “fixed effects” makes
this event quite common.

124
it from zero.

14. Remember: a “statistically significant” estimate could well be practically irrele-


vant if it corresponds to a small semi partial R2 . Moreover: with enough data
you can get any precision you like out of an estimate and distinguish the estimate
from zero even if its value is almost exactly zero. If you just play with the sample
size you shall see that the semi partial R2 is very little affected by the sample size
when this changes from good, to huge and maybe to amazing. However, you are
free to use the above suggested (correct) statistical measure of practical relevance
but ... what is “small” and “big” in terms of practical relevance, depends on the
specific purpose of analysis non on Statistics. In a good empirical analysis the
researcher should pre specify which size of an effect is practically relevant and
which precision is required for its estimate. This would allow the researcher to
choose (when practically possible) a size of the sample big enough in order to be
able to give estimates of the parameters precise enough to assess the size of the
effect to the required precision34 .

15. In any case remember the difference between significance and relevance. E.g.:
beware the use of large datasets. If, in the comments to their results, Authors
using large datasets stick too much to “statistical significance” and do not deal
with practical relevance, most likely it is the case that their results can be sum-
marized as “a very precise estimate of irrelevant effects”, so that the reading of
the main results of the paper can be usually changed into something like: “our
data points strongly to the irrelevance of the effect under study”. By the way:
while not currently fashionable, such a finding could be of great interest.

16. Finally: Beware of unbalanced samples (this is the same as 11 but it is very
important, so I repeat).

9.14.8 Further readings (Absolutely Not Required for the exam)


We already met, during this chapter, with what, arguably, is the best book on linear
regression from the point of view of interpreting results, with strong, detailed, state-
ments about the difference between forecasting and causal analysis, lots of examples
and hindsight and with a minimum of Mathematics (probably because it was written
by very good mathematicians), is:
34
In Finance we have a very good example of this. Factor models break down the overall variance of
a return in components correlated with “factors”. The most relevant one is the “market level” but, over
time, this has been supplemented by other factors like “value”, “size”, “momentum” etc. These new
factors “explain” very tiny fraction of the variance of return, the more so if compared to the market
factor. However their study is not irrelevant because you could in practice invest in a portfolio whose
only (systematic) change in returns is correlated with just the chosen “factor”. In other words “even a
small variance is relevant if you can isolate it from the overall variance”.

125
Mosteller, F. and Tukey, J. W. (1977). “Data Analysis and Regression: A Second
Course in Statistics”. Addison-Wesley, Reading, MA. In particular, see ch. 13 with the
meaningful title: “Woes of Regression Coefficients”.
A good and more concise summary can be found in:
Sanford Weisberg (2014) “Applied Linear Regression”, III ed., Wiley. In particular,
see Ch. 4.
A short paper by a great statistician which contains, in simple and condensed form,
most of what was discussed here, is: George E. P. Box (1966) “Use and Abuse of
Regression”, Technometrics, Vol. 8, No. 4 (Nov., 1966), pp. 625-629
For the maths of semi partial R2 joint with a keen discussion of “effect sizes” you
may see:
Jacob Cohen e.a. ( 2013) “Applied Multiple Regression/Correlation Analysis for
the Behavioral Sciences” (English Edition), III ed, Routledge.
To those interested in reading something more about the different interpretations
of a linear model (e.g. forecast VS causal) which make, arguably, a very tricky and
slippery field to walk on, the following books could be useful:
J. D. Angrist and J. S. Piscke (2009) “Mostly Harmless Econometrics”, Princeton
University Press.
J. Pearl (with Madelyn Glymour and Nicholas P. Jewell) (2016) “Causal Inference
in Statistics: a Primer”, Wiley.

Examples
Exercise 9-Linear Regression.xls

10 Style analysis

Style analysis is interesting both from the point of view of practitioner’s finance and
as an application of the linear regression model.
The current version of the model was elaborated by William F. Sharpe in a series
of papers beginning in 1989. In this summary we shall refer to the 1992 paper (as of
November 2018 you may download it at http://www.stanford.edu/∼wfsharpe/art/sa/sa.htm).
In order to understand the origin of the model we must recall the intense debate
developing during the eighties about the validity of the CAPM model, its possible
substitution with a multifactor model and the evaluation of the performance of fund
managers.
In a nutshell (back to this in some more detail in the next chapter): a factor
model is a tool for connecting expected returns of securities or securities portfolios to
the exposition of these securities to non diversifiable risk factors. The CAPM model

126
asserts that a single risk factor, the “market”, or, better, the random change in the
“wealth” of all agents invested in the market, is priced in terms of a (possible) excess
expected return. This factor is empirically represented by the market portfolio, that
is: the sum of all traded securities. The expected return of a security in excess of the
risk free rate (remember that we are considering single period models) is proportional
to the amount of the correlation between the security and the market factor. The
proportionality factor is the same for all securities and is called price of risk.
Multifactor models, such as the APT, suggest the existence of multiple risk factors
(not necessarily traded) with different prices of risk, so that the cross section of ex-
pected security (or security portfolios) excess returns is “explained” by the set of the
security expositions to each factor. Classical implementations of the APT were based
on economic factors, some were tradable, like the slope of the term structure of inter-
est rates, some, at least at the time, non tradable, as GNP growth and inflation. At
the turn of the nineties Fama and French, followed by others, produced a number of
papers where factors were represented by spread portfolios. The most frequently used
factors were based on the price to book value ratio, on the size of the firm and on some
measure of market “momentum” (relative recent gain or loss of the stock w.r.t. the
market). These factors were represented, in empirical analysis, by spread portfolios.
As an instance: the price to book value ratio was represented by the p&l of a portfolio
invested, at time zero, in a zero net value position long in a set of high price to book
value stocks and short in a set of low price to book value stocks. Fama and French
asserted that the betas w.r.t. this kind of factor mimicking portfolios were “priced by
the market”, that is, the correlation of a stock return with such portfolios implied a
non null risk premium.
Consider now the problem of evaluating the performance of a fund manager. A
preliminary problem is to understand for which reason you, the fund subscriber, should
pay the fund manager. Obviously, you should not pay the fund manager beyond
implementation costs (administrative, market transactions etc) for any strategy which
is known to you at the moment you subscribe to (or do not withdraw from) the fund
if this strategy gives “normal” returns and if you can implement it by yourself.
Suppose, for instance, that the asset allocation of the fund manager is known to
you before subscribing the fund. Since the subscription of the fund is your choice
the fund manager should not be paid for the fund results due to asset allocation, or,
better, should not be paid for this beyond implementation costs. A bigger fee could be
justified only if, by the implementation of management decisions you cannot forecast
on the basis of what you know, the fund manager earns some “non normal” return.
This is the reason why index funds should (and, in markets populated by knowl-
edgeable investors, usually do) ask for small management fees. What we say here is
that this should be the same for any fund managed with some, say, algorithm, replica-
ble on the basis of a style model like, for instance, funds which follow asset selection
procedures based on variants of the Fama and French approach (that is: stock pick-

127
ing based on observable characteristics of the firms issuing the equity as, for instance,
accounting ratios, momentum etc). While implementing such models requires some
care and a lot of good data management, the reader should be aware of the fact that
nothing magic or secret is required for the implementation of these algorithms.
The fund manager contribution, with a possible value for you, if any, should be
something you cannot replicate, that is: either something arising from (unavailable
to you) abilities or information of the manager or, maybe, from some monopolistic or
oligopolistic situation involving the manager. Let us suppose (a very naive idea!) that
the second hypothesis is not relevant. A formal way to say that the manager ability
is not available to you is to say that you cannot replicate its contribution to the fund
return with a strategy conceived on the basis of your knowledge.
Notice that for this reasoning to be valid it is not required that you actually perform
any analysis of the fund strategy before buying it. Perhaps we could agree on the fact
that you should perform such an analysis, before buying anything. A mystery of finance
is that people spend a lot of money in order to buy something whose properties are
unknown to the buyer. People wouldn’t behave in this way when buying, say, a car
or even a sandwich. However, any lack of analysis simply means that something more
unexpected by you, shall become (on your opinion) merit or fault of the fund manager.
It is important to understand that, according to this view, the evaluation of the
performance of a fund manager is, first of all, subjective. It is the addition of hypotheses
on the set of information used by subscribers and on their willingness to optimize using
these information that can convert the subjective evaluation into an economic model.
The problem here is, obviously, to define what we mean by “normal return” and
“known strategy”.
Here a market model, representing efficient financial use of public information, could
be the sensible solution. Were the market model and the effective asset manager’s asset
allocation available, the first could be used to define the efficiency of the second and,
by difference, possible unexpected (by the model) over or under performances on the
part of the fund manager.
Alas, for reasons that shall be discussed in following sections, satisfactory empirical
versions of market models still have to appear or, at least, versions of market models,
and statistical estimates of the relative parameters, strong enough to be agreed upon
by everybody an so useful in an inter-subjective performance analysis.
A less ambitious and more empirically oriented alternative is return based style
analysis. This alternative yields a (model dependent) subjective statement about the
quality of the fund. We shall return on this point but we stress the fact that, if the
purpose of the method is for a potential subscriber or for someone already invested in
the fund to judge the fund manager performance and not for some agency to award
prizes, the subjective component of the method is by no means a drawback.
Return based style analysis can be seen as a specific choice of “normal return” and
“known strategy” definitions. The “known strategy” is the investment in a set of trad-

128
able assets (typically total return indexes) according to a constant relative proportion
strategy, the “normal return” is the out of sample return of this strategy previously
tuned in order to replicate the historical returns of the fund. This point has to ham-
mered in so we repeat: the strategy is not chosen in order to yield “optimal” returns (in
any case the lack of a market model would impede this) but only in order to replicate
as well as possible, in the least squares sense, the returns of the fund strategy.
In order to estimate the replica weights, the returns (RtΠ ) of the fund under inves-
tigation are fitted to constant relative proportion strategy with weights βj invested in
a set of k predetermined indexes with returns Rjt :
X
RtΠ = βj Rjt + t
j=1,...,k

The term “constant relative weights strategy” indicates, as usual, a strategy where
the proportion of wealth invested in any given index is kept constant over time. This
implies that, when some index over performs other indexes, a part of the investment
in the over performing index must be liquidated and invested in the under performing
indexes.
For the sake of comparison other possible strategies could be the buy and hold
strategy where a constant number of shares is kept for each index and the trend follow-
ing strategy, where shares of “loser” indexes are sold to buy shares of “winner” indexes.
Both these strategies have variable weights on returns and could reasonably be used
as reference strategies.
There exist variants of the constant relative proportions strategy itself. In a con-
strained version the weights could be required to be non negative (short positions are
not allowed). In another version weights could be allowed to change over time (in this
case we should assume that the sum of all weights is constant over time).
In typical implementations no intercept is in the model and the sum of betas is
constrained to be one. The constant is dropped because it is usually interpreted as a
constant return and, over more than one period, a constant return cannot be achieved
even from a risk free investment. The assumption that the sum of all weights is one is
an assumption required for the interpretation of the weights as relative exposures and,
in the case of a multi period strategy, in order for the portfolio to be self financing.
While both interpretations and both constraints could be challenged, in our appli-
cations we shall stick to the common use. We only relate the fact that, sometimes,
instead of imposing the “sum to one” constraint explicitly at the estimation time 35 this
35
The j βj = 1 constraint can be imposed to the OLS model in a very simple way. First chose
P
any Rjt series, say R1t . Typically the choice falls on some series representing returns from a short
term bond but any choice will do. Second compute R̃t = Rt − R1t and R̃jt = Rjt − R1t for j = 2, ..., k.
Now regress R̃t on the R̃jt for j = 2, ..., k. After running the regression the coefficient for R1t , which
Pk
you do not directly estimate, shall be equal to 1 − j=2 βj .

129
is implemented on a a posteriori basis by renormalizing estimated coefficients. The two
methods do not yield the same results.
A relevant point in the choice of the reference strategy is that it should not cost
too much. In this sense the constant relative proportions strategy could be amenable
to criticism as it can imply non negligible transaction costs. The reason for its use in
style analysis seems more leaning on tradition than on suitability.
Notice that in no instance we are supposing that the fund under analysis actually
follows a constant relative proportion strategy invested in the provided set of indexes.
We are NOT trying to discover the true investment of the fund but only to replicate
its returns as best as we can with some simple model. This point has to be underlined
because, at least in the first paper on the topic, Sharpe himself seems to state that
the purpose of the analysis is to find the actual composition of the fund. This is
obviously impossible if it is not the case that the fund is invested, with a constant
relative proportions strategy, in the indexes used in the analysis.
In fact, the actual discovery of the composition of the fund and its evolution over
time would hardly add anything to the purpose of identifying the part of the fund’s
strategy not forecastable by the fund subscriber. A model would still be needed in
order to divide what is forecastable from what is unforecastable in the fund evolution.
Let us go back to the identity:
X
RtΠ = βj Rjt + t
j=1,...,k

Up to now this is not an estimable model but, as said above, an identity. In order to
convert it into a model we must assume something on t . A way of doing this is to recall
the chapter on linear regression. The style model is clearly similar to a linear model.
In particular it is similar to a linear model where both the dependent and independent
variables are stochastic. In this case we know that a minimal hypothesis for the OLS
estimate to work is that E(|RI ) = 0 where  is the vector containing the observations
on the n t -s and RI is the matrix containing the n observations on the returns of the
k indexes. The second, less relevant, hypothesis is the usual E(0 |RI ) = σ2 In .
The hypothesis E(|RI ) = 0 has a sensible financial meaning: we are supposing
that any error in our replication of the fund’s returns is uncorrelated with the returns
of the indexes used in our replication.
Sharpe’s suggestion for the use of the model in fund performance evaluation is as
follows: given a set of observations (typically with a weekly or lower frequency, Sharpe
uses monthly data) from time t = 1 to time t = n fit the style model from t = 1 to
t = m < n and use the estimated coefficient for forecasting Rm+1 then add to the
estimating set the observation m + 1 (and, in most implementations) drop observation
1. Forecast Rm+2 and so on. These forecast represent the fund’s performances as due
to its “style” where the term “style” indicates our replicating model. The important
point is that this “style” result is forecastable and, in principle, replicable by us. The

130
possible contribution of the fund manager, at least with respect to our replication
strategy, must be found in the forecast error. The quality of the fund manager has to
be evaluated only on the basis of this error.
There are three possibilities:
• The fund manager return is similar (in some sense to be defined) to the replicating
portfolio return. In this case, since you are able to replicate the result of the fund
manager strategy using a “dumb” strategy, you shall be willing to pay the fund
manager only as much as the strategy costs.
• The fund manager returns are less than your replica returns. In this case you
should avoid the fund as it can be beaten even using a dumb strategy which is
not even conceived to be optimal but only to replicate the fund returns. This is a
strong negative result. While it is true that it is possible to find alternative assets
that, when calibrated to the fund returns in a style analysis, give a positive view
of the same manager results, the fact that a simple strategy exists that beats the
fund returns is enough to put under discussion any fund manager’s ability.
• The fund manager returns are better than your replica strategy. In this case it
seems that the manager adds to the fund strategy something which you cannot
replicate. This is an hint in favor of the fund manager ability. It is a weak hint,
for the same reason the negative result is a strong hint. The negative result is
strong because a simple strategy beats the fund manager’s one, the positive result
is weak because the fund manager beats a simple strategy but other could exist
which equate or even beat the fund manager strategy. In any case this is an at
least necessary condition for paying a fee greater that the simple strategy costs.
The important point to remember, here, is that the result is relative to the strategy
and the asset classes used. No attempt is made to build optimal portfolios with the
given asset classes, only replica portfolios are built. The reader should think about the
possible extensions of procedures like style analysis were a market model available
A simple example of style analysis using my version of Sharpe’s data and three well
known US funds is in the worksheet style analysis.xls.

10.1 Traditional approaches with some connection to style anal-


ysis
The idea that you should find some “normal” return with which to compare a fund
return and that this definition of “normal” return is to be connected with the return of
some “simple strategy” related with the fund’s strategy is so basic that many empirical
attitudes are informally justified by it.
On a first level, we observe very rough fund classifications in “families” of funds,
defined by broad asset classes. This suggests comparisons of funds to be made only

131
inside the same family. In a sense the comparison strategy is implicitly considered as
a mean of the strategy in the same asset class.
Another shadow of this can be found in the frequently stressed idea that the result
of any fund management must be divided between asset allocation and stock picking.
In common language this partitioning is not well defined and asset allocation may
mean many different things as, for instance, the choice of the market, the choice of
some sector, the choice of some index. Moreover there is no precise definition of how
to distinguish between asset allocation and stock picking. But it is clear that this
distinction, again, hints at some normal return, derived by asset allocation, and some
residual: stock picking.
The “benchmarking” idea is another crude version of the same: you try to sep-
arate the fund manager’s ability from the overall market performance by devising a
benchmark which should summarize the market part of the fund manager strategy.
Market models can be seen as a step up the ladder. Here the benchmark idea is
expressed in a less naive way. Under the hypothesis that the market model holds and
is known and the beta (CAPM) or betas (APT) of the fund are known, the part of
the result due to the market factor(s) is to be ascribed to the overall fund strategic
positioning and, as such, its consequences are in principle a choice of the investor. Any
other over or under performances can be ascribed to the fund manager abilities and
private information.
As we mentioned above, this use of market model is greatly hampered by the fact
that the proposition “...the market model holds and is known and the beta (CAPM)
or betas (APT) of the fund are known” simply does not hold.
Now a few words on comparison criteria.
The classical Sharpe ratio considers the ratio of a portfolio return in excess to
a risk free rate to its standard deviation. Even in this form the Sharpe ratio is a
relative index: the fund performance is compared to a riskless investment. In general
this comparison is not a useful one. Typically our interest shall be to compare the
fund performance with a specific strategy, which, in some instance, could be the best
possible replication of the fund’s returns accomplished using information available to
the investor. In many cases this reference strategy shall be a passive strategy (this
does not mean that the strategy is a buy and hold strategy but that the strategy can
be performed by a computer following a predefined program).
As considered before, such a strategy could be provided, for instance, by some asset
pricing model (CAPM, APT etc.). In other cases the reference strategy could simply
be represented in the choice of a benchmark used either in the unsophisticated way
where, implicitly, a beta of one is supposed (that is, at the numerator of the Sharpe
ratio take the difference between the returns of the fund and those of the benchmark)
or in the more sophisticated way of computing the alpha of a regression between the
return of the fund and the return of the benchmark.
Otherwise the reference strategy could be based on an ad hoc analysis of the history

132
of the fund under investigation. Style analysis is a way to implement this analysis.
Two relevant final points.
First: the comparison strategy should always be a choice of the investor. It is rather
easy, from the fund’s point of view, to choose as comparison a strategy or a benchmark
with respect to with the strategy of the fund is superior, at least in terms of alpha. This
is known as “Roll’s critique”. While the fact that the strategy chosen by the investor as
comparison is dominated by the fund strategy is admissible as, usually, the fund does
not tune its strategy to this or that subscriber comparison strategy (at least this is true
if the subscriber is not big!), when it is the fund to choose the comparison strategy a
conflict of interests is almost certain.
Second: once identified the part of the strategy due to the fund manager interven-
tion, a summary of this based on the Sharpe ratio or on Jensen’s alpha is only one
of the possible choices and strongly depends on the subscriber’s opinion on what is a
proper measure of risk and return.

10.2 Critiques to style analysis


Under the hypotheses and the interpretation described in the previous section style
analysis can be considered an useful performance evaluation tool. However, at least in
the version suggested by Sharpe, it lends itself to some strong critique.
A first very simple critique concerns the choice of the replicating strategy. While
the use of indexes does not create big problems, at least when these indexes can be
reproduced with some actual trading strategy, a big puzzle lies in the choice of a
constant relative proportion strategy. This is both an unlikely and a costly strategy,
due to portfolio rebalancing. The typical simple strategy is the buy and hold strategy,
most indexes are, in principle, buy and hold strategies and the market portfolio of
CAPM is a buy and hold strategy. As seen in chapter 1 the buy and hold strategy
is NOT a constant relative proportion strategy. Moreover, a buy and hold strategy,
typically, implies very small costs (the reinvestment of dividends is the main source of
costs if there is no inflow or outflow of capital from the fund) while a constant relative
proportion strategies implies a frequent rebalancing of the portfolio.
Now, the replicating strategy is a free choice of the analyzer, however, if we simply
suppose that the fund follows a buy and hold strategy in the same indexes used by
the style analyzer we end with a strange, if perfectly natural, result. Obviously the R2
of the model shall not be 1, except in the case of identical returns for all the indexes
involved in the strategy, moreover the analysis shall point out as “unforecastable” and
so due to the fund manager action, any return of the fund due to the lack of rebalancing
implied in a buy and hold strategy.
Suppose, for instance, that some index during the analysis period should outperform
(or under perform) frequently the rest of the indexes used in the analysis. This shall
result in a forecast error for the strategy fitted using a constant relative proportion

133
strategy which shall attribute to the fund manager a positive contribution to the fund
result. On the contrary, temporary deviations of the return of one index from the
returns of the others shall result, in the comparison of the strategies, in favor of the
constant proportion strategy.36
A second critique, of theoretical interest but hardly relevant in practice is connected
with Roll’s critiques to CAPM tests and, more in general, to CAPM based performance
evaluation. If the Constant proportion strategy does not contain all the indexes re-
quired for composing an efficient portfolio, any investment by the fund manager into
the relevant excluded indexes shall result in an over performance. This would be rele-
vant if the evaluated fund manager should know, ex ante, the style model with which
his/her strategy shall be evaluated AND if the fund manager has a more thorough
information on the structure of the efficient portfolio.
The point is that, while it is rather easy to compute an efficient portfolio ex post,
this is not so easy ex ante. Moreover, if we accept the idea that the style decomposition
depends on the information of the analyzer, this critique loses much of its stance.
A third, and more subtle, critique can be raised to style analysis as well as to any
OLS based factor model used for performance evaluation. If the model is fitted to the
fund returns, the variance (or sum of squares, if no intercept is used) of the replicating
strategy shall always be less than or equal to that of the fund returns. In a CAPM
or APT logic this is not a problem, since only non diversifiable risk should be priced
in the market. However, as stressed above, we are NOT in a CAPM or APT world.
With this lack of variance we are giving a possible advantage to the fund. Ways for
correcting this problem can be suggested and, in fact, performance indexes which take
into account this problem do exist. However, since, as we saw above, the positive (for
the fund) result is already a weak result in style analysis, this undervaluation of the
variance is only another step in the same direction: negative valuations are strong,
neutral or positive valuations could be challenged.
A last word of warning. Many data providers and financial consulting firms sell style
analysis. As far as I know, the advertising of commercial style models invariably asserts
the ability of such models to discover the true composition of the fund portfolio and
most reports produced by style analysis programs concentrate on the time evolution
(estimated by some rolling window OLS regression) of portfolio compositions. This is
quite misleading (Sharpe is somewhat responsible as in the original papers he seems
36
In the case of a positive trend of, say, an index with respect to the rest of the portfolio, a buy
and hold strategy does not rebalances by selling some of the same index and buying the rest of the
portfolio. In case of a further over performance of the index the buy and hold portfolio shall over
perform the rebalanced portfolio. In the case of a negative trend of some index with respect to the
rest of the portfolio the constant proportion strategy must buy some of the under performing index
selling some of the rest of the portfolio, if the under performance continues this strategy shall imply
an over performance of the buy and hold strategy with respect to the constant relative proportion
strategy. On the contrary, a strategy investing in temporary losers (after the loss!) or disinvesting in
temporary winners shall outperform a buy and hold strategy in a oscillating market.

134
to share this opinion) and can be accepted only if interpreted as a misleading way to
asses the true purpose of the strategy, that is: return replication. As far as I know,
the typical seller and user of style analysis, if not warned, tends to believe the “fund
composition” story. This false ideas usually disappears after some debate, provided, at
least, that the user or seller is even marginally literate in simple quantitative methods.

Examples
Exercise 10-Style Analysis.xls

11 Factor models and principal components

11.1 A very short introduction to linear asset pricing models


11.1.1 What is a linear asset pricing model
Let us begin by considering the following plot. Here you find yearly excess total linear
return averages and standard deviations for those stocks which were in the S&P 100
index from 2000 to 2019 (weekly data, 83 stocks).
As you can see, stocks with similar average total return show very different standard
deviations and viceversa. We know that the statistical error in the estimate of expected
returns using average returns may be big, however, if we believe that average return
has anything to do with expected return and standard deviation with risk the plot is
puzzling (and these are 84 BIG companies).

135
We can see asset pricing models as tools devised to answe the kind of puzzles which
plots like this one may raise.
Among these two of the oldest and most relevant questions of Finance:

1. in the market we see securities whose prices evolve in completely different ways.
There may even be securities that have both mean returns lower and standard
deviations of returns higher than other securities. Why are all these securities,
with such apparently clashing statistical behaviours, still traded in equilibrium?

2. which are the “right” equilibrium relative prices of traded securities?

(Not be puzzled by the fact that we speak of asset pricing models and we write returns.
Given the price at time 0, the return between time 0 and time 1 determines the price
at time 1).
We anticipate here the answers to these two questions given by asset pricing models:

1. securities prices can be understood only when securities are considered within a
portfolio. Completely different (in terms, say, of means and variances of returns)
securities are traded because they contribute in improving the overall quality of
the portfolio (in the classic mean variance setting this boils down to the usual
diversification argument), what is relevant is not the total standard deviation of
each security but how much of this cannot be diversified out in a big portfolio, for
this reason the expected return of a security return should not be compared with
its standard deviation but only with the part of this standard deviation which
cannot be diversified;

2. the right (excess) expected returns of different securities should be proportional


with the non diversifiable risks “contained” in the returns, to equal amounts of
the same risk should correspond equal amounts of expected return.

These are not the only observed properties of asset prices/returns asset pricing models
try to account for. Another striking property is as follows: while thousands of securities
are quoted, there seems to be a very high correlation, on average, among their returns.
In a sense it is as if those many securities were “noisy” versions of much less numerous
“underlying” securities.
For instance, the 83 stocks of the S&P 100 displayed above show an average (simple)
correlation (over 20 years!) of .31. If we recall the discussion connected with the
spectral theorem and compute the eigenvalues of the covariace matrix of these returns,
while we see no eigenvalue equal to 0, the sum of the first 5 eigenvalues is greater the
50% of the sum of all eigenvalues, where the last, say, 50 eigenvalues count for about
15% of the total. The sole first eigenvalue is about 1/3 of the total. This suggest the
idea that, while not singular, the overall covariance matrix can be well approximated
by a singular covariance matrix.

136
It should be clear how to answer to these questions and to model the the high
average correlation of returns is important for any asset manager and, in fact, asset
pricing models are central to any asset management style not purely based on gut
feelings.
We can deal with these problems within a simple class of asset pricing model known
as: “linear (risk) factor models”. Here we show some hint of how this is done in practice.
An asset pricing model begins with a “market model” that is a model which de-
scribes asset returns (usually linear returns) as a function of “common factors” and
“idiosyncratic noise”. These models are, most frequently, linear models and a typical
market model for the 1 × m vector of excess returns rt observed at time t, the 1 × k
vector f of “common risk factors” observed at time t and and the 1 × m vector of errors
is:
rt = α + ft B + t
Where B is a k × m matrix of “factor weights” and α is a 1 × m vector of constants.
We suppose to observe the vectors rt and ft for n time periods t. Stacking the
n vectors of observations for rt and ft in the n × m matrix R and the n × k matrix
F , and stacking the corresponding error vectors in the n × m matrix  we suppose:
E(|F ) = 0, V (t |F ) = Ω and E(t 0t0 |F ) = 0 ∀t 6= t0 . In order to give meaning to the
term “idiosyncratic” the contemporaneous correlation matrix Ω is, as a rule, supposed
to be diagonal, typically with non equal variances.
It is relevant to stress the fact that such a time series model can be a good expla-
nation of the data on R (for instance it may show high R2 for each return series) and
at the same time no asset pricing model could be valid.
Let us recall that, if we estimate the market model with OLS (this may be done
security by security or even jointly), the OLS estimate of α can be written in a compact
way as
α̂ = r̄ − f¯B̂
Where r̄ is the 1×m vector of average excess return (one for each security excess return
averaged over time), f¯ is the 1 × k vector of average common factors values (again:
averaged over time) and B̂ is the matrix of OLS estimated factor weights (one for each
factor for each security: k × m.
The expected value of this, under the above hypotheses, is:
E(α̂) = E(r) − E(f )B
As we shall see in a moment, an asset pricing model is valid, if, supposing Ω diagonal,
we have that α = 0.
This is usually written as:
E(r) = λB
Where λ = E(f ) is a 1 × k vector of “prices of risk” and in a moment we shall see why
this name is used.

137
It is now important to stress that this restriction may hold, so that the asset pricing
model is valid, and this not withstanding the time series model could offer a very poor
fit of r or, on the contrary, the fit could be very good and α 6= 0.37
For asset management purposes, however, a possible good fit for the time series
model with a k << m could be very useful even when the asset pricing model does not
hold.
Suppose, for instance, you want to use a Markowitz model for your asset allocation.
In order to do this you need to estimate the variance covariance matrix of returns.
This requires the estimate of m(m + 1)/2 unknown parameters using n observations
on returns. With a moderately big m this could be an hopeless task.
Suppose now the market model works, at least in the time series sense, in the sense
that the R2 of each of the m linear models is big. In this case the variances of the
errors are small and:
V (rt ) = B 0 V (ft )B + Ω ∼
= B 0 V (ft )B
Let us now count the parameters we need to estimate the varcov matrix of the
excess returns with and without the market model. Without the market model, the
estimation of V (rt ) would require the estimation of m(m + 1)/2 parameters, while with
the factor model it requires the estimation of k × m + k × (k + 1)/2 parameters, that is
B and V (ft ). Suppose for instance m = 500 and k = 10, the direct estimation of V (rt )
implies the estimation of 125250 parameters while the (approximate) estimate based
on the factor model “only” 5000+55 parameters.
The reader should notice that, even if the above assumptions for V (rt ) are right,
the use of B 0 V (ft )B in the place of the full covariance matrix shall imply an underes-
timation of the variance of each asset return, which is going to be negligible only if all
the R2 are big.
Let us move on a step. We must remember that our aim is the construction of
portfolios of securities with weights w and excess returns rt w.
In this case we are not necessarily interested to the full V (rt ) but to variance of the
portfolio
V (rt w) = w0 B 0 V (ft )Bw + w0 Ωw
It is well possible that w0 Ωw be small, so that w0 B 0 V (ft )Bw be a good approximation
of V (rt w), even if it is not true that all R2 are big and, by consequence, the diagonal
elements of Ω small.
Suppose that the weights w of different securities in this portfolios are all of the
order of 1/m. This simply means that no single security dominates the portfolio.
37
Beware: what just described as a possible test of an asset pricing model is useful for the under-
standing of the loose interplay between the time series model and the asset pricing model but it is,
typically, not a very efficient way, from the statistical point of view, to test the validity of an asset
pricing model.

138
We have, then
m m
0
X 1 X ωi
w Ωw = wi2 ωi ≈
i=1
m i=1 m
and this, with bounded, but not necessarily small, diagonal elements of Ω: ωi , goes to
0 for m going to infinity.
This means that, for large, well diversified, portfolios “forgetting” Ω is irrelevant
even if its diagonal elements are not small. The hypothesis of a diagonal Ω, that is:
idiosyncratic t is fundamental for this result.
From this result, we can shed some light on the reason why we should have E(r) =
E(f )B = λB, that is: why an asset pricing model should hold.
In order to understand this, it is enough to compute the expected value and the
variance of our well diversified portfolio (notice the approximation sign for the variance)

E(rt w) = E(rt )w = αw + E(ft )Bw.

V (rt w) ∼
= w0 B 0 V (ft )Bw
Suppose now α 6= 0, recall that B is a k × m matrix with (supposedly) k << m and
we can always suppose that the rank of B is k (if this is not the case we can reduce
the number of factors).
This implies that the matrix B 0 V (ft )B ia m × m matrix of rank k < m. The matrix
B 0 V (ft )B is then SEMI positive definite, this implies that there exist m − k orthogonal
vectors z such that z 0 z = 1 and z 0 B 0 V (ff )Bz = 0.
According to what discussed in the matrix algebra section and in the presentation
of the spectral theorem, under conditions we do not specify here, we can always build
from these a set of weights w$ such that w$0 1 = 1 and αw$ > 0.
You should understand the reason of the dollar sign. The vector w$ is such that it
defines a zero risk portfolio (zero variance) with positive excess return αw$ (since the
variance is zero the expected excess return becomes the excess return).
In other words, we created a risk free security (the portfolio) which yields a return
(arbitrarily) greater than the risk free rate. This is an “arbitrage” as one could borrow
any amount of money at the risk free rate and invest it in the portfolio with a positive
profit and no risk (hence the $). Provided all the financial operations involved (building
the portfolio, borrowing money etc.) are possible, this should not happen if traders are
“reasonable” (and if they know of the existence of the factor model).
The only way to unconditionally (that is: whathever tha choice of w$ ) avoid this is
that α = 0 so that
E(r) = E(f )B = λB
Let us now give a “financial interpretation” of this result.

139
Since each element βji of B represents the “amount” of non diversifiable factor fj
in the excess return of security i and E(fj ) represents the excess expected return of
a security which has a “beta 1” with respect to the j factor and zero with respect to
the others (if the factor fj is the excess return of a security, this could simply be the
excess return of that security, but this is not required) we may understand the name
“price of risk for factor j” used for E(fj ) = λj and risk premium for factor j” given to
the “price times quantity” product fj βji .
Now that we have a rough idea of how an asset pricing model works, it could be
useful go back to the questions with which this section begun and think a little bit
about how the answers come from the asset pricing model.
We should first notice that the approximation

V (rt w) = w0 B 0 V (ft )Bw + w0 Ωw ∼


= w0 B 0 V (ft )Bw

is a formal interpretation of the empirical fact that correlations among quoted


securities returns are on average high.
The interpretation is based on the idea that all returns depend (in different ways)
on the same underlying factors and what is “not factor” is uncorrelated across returns.
For this reason, well diversified portfolios of securities tend to show returns whose
variance only depends on that of factors.
As a consequence, it shall be difficult to build many such well diversified portfolios
which are not correlated among them.
In fact, if we assume the above approximation to be exact, and V (ft ) to be non
singular, only exactly k of such portfolios can be built, the choice unique modulus an
orthogonal transform. In this case it is quite tempting to interpret any choice of such
k non correlated portfolios as a “factor estimate”. Some aspects of this idea shall be
reconsidered when presenting the “principal component” way to risk factor estimation.
Asset pricing models give a very precise answer to the puzzle about the fact that
securities are traded in the market even if they may show, at the same time, lower
average returns and higher standard deviations of other traded securities. This is
a possible equilibrium because what is relevant is not the “absolute risk” (marginal
standard deviation of returns) of a security, but its contribution to the risk/return mix
in a well diversified portfolio. For this reason, we can see a relatively low average return
and a high standard deviation simply because the security showing these statistics as
little correlation with systematic risk factors. The model tells us many other interesting
things regarding this point. For instance: it tells us that us that if we see two securities
with, say, the same average returns and very different return standard deviations, the
correlation between the returns of these securities should be small.
Last: asset pricing models give us formulas for measuring the “right mix” of expected
returns and correlation with systematic risk factors (betas) and this answers to the
question about right equilibrium relative prices. On this basis, asset pricing models

140
give us an unified framework to precisely quantify and test the equilibrium price system
and to transform the statistical results into asset management tools.

11.2 Estimates for B and F


When the factors F are observable variables, the matrix B can be estimated using OLS
(in fact a slightly better estimate exists but it is outside the scope of these notes).
This, in principle, is what we did with the style model which could, with some
indulgence, be considered as the “market modes” part of an asset pricing model.
In fact, for the style analysis method to work it is not strictly necessary that the
style model corresponds to a full market model. This is due to the fact that, in style
analysis, the model is used as a reference benchmark only.
The joint use of a benchmark model which is also a market model would, in any
case, be in theory a more coherent choice.
We also discussed this in the case of the CAPM. In the CAPM there exists a
single common factor, represented by the wealth of agents, intended as everything that
impacts agents utility, as risked on the market.
This cannot be directly observed and is proxied in practice by some market index
and m idiosyncratic factors supposed to be uncorrelated with the common factor and
among themselves. If we believe in the quality of the proxy for the wealth of agents,
an OLS estimate shall work also in this case.
The typical asset pricing model uses as factors some CAPM like index and ob-
servable macroeconomic variables, The Fama and French model is a CAPM plus two
long/short portfolios for value stocks (low against high price to book value) and size
stocks (later a momentum portfolio was added).
A huge academic industry in “finding relevant risk factors” to “explain the cross sec-
tion of stock returns” (recall the second stage regression) arose from this with hundreds
of papers and suggested risk factors38 .
38
For those interested, read: Campbell R. Harvey, Yan Liu, Heqing Zhu, “. . . and the Cross-Section
of Expected Returns” The Review of Financial Studies, Volume 29, Issue 1, January 2016, Pages 5–68.
In this very interesting and funny paper the Authors attempt a wide review of the risk factors suggested
for market models in published papers up to 2015. They consider 313 papers and 316 different, but
often correlated, factors The Authors are very clear about the fact that this is, actually, not a complete
review of the published and unpublished research on the topic. The Authors summarize the results
and stress the important statistical implications due to the fact that, using, in the vast majority, data
on the US stock market or on markets correlated with this, these papers are not based on independent
experiments or observational data, but on what are, in essence, different time sections of the same
dataset. This is a classic case of the “data mining”, “multiple testing” or “exhausted data” problem,
sometimes also called “pretesting bias”. In this case many, in general dependent, tests are run on the
same dataset. Often, tests are chosen and run conditional to the result of other tests. This requires a
very careful assessment of the joint P-value of the testing procedure which cannot be reduced to a test

141
Current proactitioner models, widely used in the asset management industry for
asset allocation, risk management and budgeting and performance evaluation, include,
for what is my experience, roughly from 10 to 15 risk factors and are tuned to specific
asset classes, so that they do not pretend to be general market models.
All these models can, in principle, be dealt with by regression methods.
There is, however, a different attitude toward factor modeling.
This attitude attempts a representation of underlying unobserved factors based on
portfolios of securities which are not defined a priori but jointly estimated with the
model optimizing some “best fit” criterion.
In order to do this, we need a joint estimation of F , the matrix of observation on
all factors, and B the factor weights matrix.
A common starting point is that of requiring the factors ft to be linear combinations
of excess returns: ft = rt L.
In principle there exist infinite choices for L. A unique solution can be chosen only
by imposing further constraints. Each choice of constraints identifies a different set of
factors.
Most frequently, factor models of this kind are based on the principal components
method or on variants of this.
The principal components method is a classic data reduction method for Multivari-
ate Statistics which has received a lot of new interest with the growth of “big data”.
In Finance principal components are used at least starting with the nineteen six-
ties/seventies.
We can describe the procedure of “factor extraction” that is: the unique identifica-
tion/estimation of factors, in two different but equivalent ways.
Both methods require, implicitly or explicitly, an a priori, maybe very rough, esti-
mate of V (rt ). For this to be possible a fundamental assumption is that V (rt ) = V (r)
that is: the variance covariance matrix of excess total returns is time independent.
When this is not assumed to hold, more complex methods than simple principal
components are available but are well beyond the scope of these notes.

11.2.1 Principal components as factors


As a starting point, suppose that the variance covariance matrix of a 1 × m vector of
returns r: V (r) is known.
We introduce the principal components, at first, in an arbitrary way. In the follow-
ing subsection we shall justify the choice.
From the spectral theorem we know that V (r) = XΛX 0 . By the rules of matrix
by test analysis. The result of such assessment is that individual test should be run under increasingly
stringent “significance” requirements when new hypotheses are tested in addition to old ones. This
quickly makes impossible to test new hypotheses on the same “exhausted” dataset.

142
product and recalling that Λ is diagonal, we have:
X
XΛX 0 = xi x0i λi
i

where xi is the i−th column of X and the sum is from 1 to m.


Notice that, in general, only k eigenvalues of λj are greater than 0 while the others
are equal to 0. Here k is the rank of V (r). For simplicity in the following formulas we
suppose k = m but with proper changes of indexes the formulas are correct in general.
Define the “principal components” as the "factors" (and remember; principal com-
ponents are linear combinations of returns) fj = rxj and regress r on fj 39 . These are
m univariate regressions and the “betas” (one for each return in r) of this regressions
are, as usual40 :

Cov(fj ; r) E(x0j r0 r) − E(x0j r0 )E(r) x0j V (r)


βj = = =
V (fj ) V (fj ) V (fj )

However: X
x0j V (r) = x0j XΛX 0 = x0j xi x0i λi = x0j λj
i

and
V (fj ) = V (rxj ) = x0j XΛX 0 xj = λj
so that:
βj = x0j
Let us now find V (r − fj βj )

V (r − fj βj ) = V (r − rxj x0j ) = [I − xj x0j ]V (r)[I − xj x0j ] =

= [I − xj x0j ]XΛX 0 [I − xj x0j ] = [XΛX 0 − λj xj x0j ][I − xj x0j ] =

= [XΛX 0 − λj xj x0j − λj xj x0j + λj xj x0j ] =

X
= XΛX 0 − λj xj x0j = xi x0i λi = X−j Λ−j X−j
0
,
i6=j

39
Here the regression is to be intended as the best approximation of ri by means of a linear trans-
formation of fj . The intercept is included, see next note.
40
Notice that the definition of βj here employed implies the use of an intercept. We have not
mentioned it here, since we are interested to the variance-covariance matrix of r, which is unaffected
by the constant. In any case, the value of the constant 1 × m vector α is E(r) − E(f )β = 0

143
where X−j and λ−j are, respectively, the X matrix dropping column j and the Λ
matrix, dropping row and column j.
In other words, the covariance matrix of the “residuals” r − fj βj0 has the same
eigenvectors and eigenvalues of the original covariance matrix with the exception of
the eigenvector and eigenvalue involved in the computation of fj .41
This result is due to the orthogonality of factors42 and has several interesting im-
plications. We mention just three of these.
First: one by one “factor extraction”, that is: the computation of f ’s and corre-
sponding residuals, yields the same results if performed in batch or one by one.
Second: the result is invariant to the order of computation.
Third: once all factors are considered the residual variance is 0.
This last obvious result can be written in this way. If we set F = rX we have
r = F X 0 . Grouping in Fq and Xq the first q factors and columns of X and in Fm−q
and Xm−q the rest of the factors and columns of X we have:
m q m
X X X
r= fi x0i = fi x0i + fi x0i = Fq Xq0 + Fm−q Xm−q
0

i=1 i=1 i=q+1

Which we are tempted to write as:


r = Fq Xq0 + e
Now recall the initial factor model (we drop the t suffix for the moment):
r = fB + 
It is tempting to equate Fq to f and Xq0 to B for some q. At the same time it is
tempting to equate e with 43 . Now, given the above construction, it is always possible
41
A fully matrix notation makes the derivations even simpler.
If we suppose V (r) invertible with eigenvector matrix X we have rX = F and immediately r = F X 0
so principal components are linear combinations of returns and vice versa. Moreover V (F ) = V (rX) =
X 0 V (r)X = X 0 XΛX 0 X = Λ that is: principal components are uncorrelated and each has as variance
.

the corresponding eigenvalues. Then, if we split X vertically in two sub matrices X = X1 ..X2 we
 X10
 
.. .. .
    
have rX = F = rX1 .rX2 = F1 .F2 and r = F X 0 = F1 ..F2  ...  = F1 X10 + F2 X20 where
X20
V (F2 X2 ) = X2 Λ2 X2 . Since principal components are uncorrelated this implies that, whatever the
0 0

number of components in F1 , their regression coefficients shall always be the same and correspond to
the transpose of their eigenvectors (the first statement is a direct consequence of non correlation and
the second was demonstrated in the text.) In matrix terms: the “linear model” estimated with OLS:
r = F1 B̂1 + Û1 holds with B̂1 = X10 and Û1 = F2 X20 .
42
Orthogonality here means that the factors are uncorrelated.
43
It could be argued here that the expectation of e is not zero. Recall, on the other hand, that the
expected returns are typically nearer zero than most observed returns, due to high volatility. This is
particularly true when daily data are considered. Moreover the non zero mean effect is damped down
by the ”small” matrix Xm−q . Hence the expected value of e can be considered negligible.

144
to build such a representation of r. The question is whether, given a pre specified
model r = f B + , the above described method shall identify f, B and . The answer
is: “in general not”.
In fact the two formulas are only apparently similar and become identical only
under some hypothesis. These are:

1. The dimension of f is q.

2. V (f ) is diagonal.

3. BB 0 = I

4. The rank of V () is m − q and the maximum eigenvalue of V () is smaller than
the minimum element on the diagonal of V (f ).

To these hypotheses we must add the already mentioned requirement that f and  are
orthogonal.
For any given f B the second and third hypothesis can always be satisfied if V (f B)
is of full rank. In fact, in this case, it is always possible, using the procedure described
above, to write f B = f˜B̃ where the required conditions are true for f˜B̃ (remember
that, if the f are unobservable, there is a degree of arbitrariness in the representation).
Hypothesis one is more problematic: all we observe is r and we do not know, a
priori, the value of q.
But the most relevant (and interesting) hypothesis is that the rank of V () is m − q
and its eigenvalues are all less than the eigenvalues of V (f B).
This may well not be the case and in fact we could consider examples where  is a
vector orthogonal to the elements of f but V () is of full rank and/or its eigenvalues
are not all smaller than those of V (f B).
For instance: in classical asset pricing models (CAPM, APT and the like) the main
difference between residuals and factors is not that the variance contributed by the
factors to the returns is bigger than the variance contributed by “residuals” but that
factors are common to different securities, so that they generate correlation of returns,
while residuals are idiosyncratic that is: they should be uncorrelated across securities.
While principal component analysis guarantees zero correlation across different fac-
tors, residuals in the principal component method are by no means constrained to be
uncorrelated across different securities.
In fact, since the varcov matrix of residuals is not of full rank, some correlation
between residual must exist and shall in general be higher if many factors are used in
the analysis44 .
44
If the row vector z of k random variables has a varcov matrix A such that Rank(A) = h. Then
at most h linear combinations of the elements of z can be uncorrelated. The proof is easy. Suppose
a generic number g of uncorrelated linear combinations of z exist and let these g linear combinations

145
While this is not the place for a detailed analysis of this important point, it is
useful to introduce it as a way for remembering that r = Fq Xq0 + E is, before all, a
representation of r and only under (typically non testable) hypothesis some estimate
of a factor model.
In our setting we need the representation in order to simplify the estimation of
V (r), while the interpretation of the result as the estimation of a factor model is very
useful, when possible, the simple representation shall be enough for our purposes.
It should always be remembered that our purpose is not the precise estimation of
each element of V (r). What we really hope for is a sensible estimate of the variance
of reasonably differentiated portfolios made with the returns in r. In this case, even if
the estimate of V (r) is rough it may well be that the estimate of a well differentiated
portfolio variance is fairly precise as, by itself, differentiation shall erase most of the
idiosyncratic components in the variance covariance matrix.
This intuitive reasoning can be made precise but it is above the purpose of our
introductory course.
A last point of warning is required. If we use enough principal components, then
Fq Xq0 behaves almost as r (the R2 of the regression is big). The almost clause is
important. Suppose you invest in a portfolio with weights xq+1 /xq+1 1m that is, a a
portfolio with correlation 1 with the first excluded component (the denominator of the
weights is there in order to have the portfolio weights sum to 1). By construction the
variance of this portfolio is λq+1 /(xq+1 1m )2 . However the covariance of this portfolio
with the included components is zero. In other words: if we measure the risk of any
portfolio by computing its covariance with the set of q principal components included
in the approximation of V (r) we shall give zero risk to a portfolio correlated with one
(or many) excluded components.
The practical implications of this are quite relevant but, a thorough discussion is
outside the purpose of these handouts. However: beware!
The question now is: we introduced the factor/components F in an arbitrary way
deriving them from the spectral theorem. Are there other justifications for them?

11.3 Maximum variance factors


In the preceding section we derived a principal component representation of a return
vector by comparing the spectral theorem with the general assumptions of a linear
factor model.
equal to u = zG and suppose, without loss of generality, that by a proper choice of the G weights the
variance of each u is 1. Since the u are uncorrelated we have V (u) = G0 AG = Ig . Since the rank of
Ig is g, the rank of G, which is a k × g matrix is at most g and the rank of a product is less then or
equal to the minimum rank of the involved matrices we have that the rank of A would by necessity
be bigger than or equal to g but, by hypothesis, we know it to be equal to h so g cannot be bigger
than h (we could go on and show that it is in fact equal to h but we only wanted to show that AT
MOST h linear combinations could be uncorrelated).

146
Here we follow a different path: we characterize each principal component (suitably
renormalized) as a particular “maximum risk” portfolio with the constraint that each
component must be orthogonal to each other component and that the sum of squared
weights should be equal to one.
Linear combinations of returns are (up to a multiplicative constant) returns of
(constant relative weights) portfolios45 . Given a set of returns it is interesting to answer
the question: which are the weights of the maximum variance linear combination of
returns? (We repeat: this is not the same of the maximum variance portfolio).
This problem is not well defined as the variance of any portfolio (provided it is not
0) can be set to any value by multiplying its weight by a constant.
It could be suggested to constrain the sum of weights to one, however this does
not solve the problem. Again, By considering multiples of the different positions the
requirement can be satisfied and the variance set to any number, at least, if weights
are allowed to be both positive and negative.
A possible solution is to set of absolute values of the weights to one. This would
both solve the problem and have a financial meaning. Alas, this can be done but only
numerically.
Suppose instead we set the sum of squared weights to 1. This solves the bounding
problem with the inconvenient that the resulting linear combination shall in general
not be a portfolio. But this choice yields an analytic solution.
Let us set the mathematical problem:

max V (rθ) = max θ0 V (r)θ


θ:θ0 θ=1 0 θ:θ θ=1

The Lagrangian for this problem is:

L = θ0 V (r)θ − λ[θ0 θ − 1]

So that the first order conditions are:

V (r)θ − λθ = 0
45
As hinted at in several places of these handouts, given a linear combinations of returns, there exist
at least two ways of converting this in the returns of a portfolio. If we only want the required portfolio
to be perfectly correlated with the given linear combination, all that is needed is to renormalize the
weights by dividing them for their sum (provided this is not zero). If we wish for a portfolio with
the same weights (on risky assets) and the same variance as those of the linear combination, we must
simply add to the linear combination the return of a risk free security with as weights the difference
between one and the sum of the linear combination’s weights. Notice that in this second case, while
the (one time period) variance of the linear combination shall be the same as the variance of the
portfolio return (the risk free security has no variance for a single time period) the expected value
shall be different. In fact, if the weight of the risk free security is greater than zero, the expected value
of the portfolio return shall be (with a positive return assumed for the risk free security) greater that
the expected value of the linear combination of returns, the opposite in case of a negative weight

147
and
θ0 θ = 1
Rearranging and using the spectral theorem we have:
[XΛX 0 − λI]θ = 0
We see that, if we set θ = xj and λ to the corresponding λj , for any j we have a solution
of the problem. Since V (rXj ) = λj , the solution to the maximum variance problem is
given by the pair x1 and λ1 where, as usual, we suppose the eigenvalues sorted by size.
From what discussed in the previous section, the other solutions can be seen as the
maximum variance linear combinations of returns, where the maximum is taken with
the added constraint of being orthogonal to the previously computed linear combina-
tions.
We see that the components defined in a somewhat arbitrary way in the previous
section become now orthogonal (conditional) maximum variance linear combinations.

11.4 Bad covariance and good components?


Suppose now that V (r) is not known. In particular our problem is to estimate such a
matrix when m, the number of stocks, is big (say 500-2000). What we wrote up to this
point suggests a way for simplifying a given variance covariance matrix using principal
components. What happens when the variance covariance matrix is not given and we
must estimate it?
Obviously we could start with some standard estimate of V (r). For instance,
suppose we stack in the n × m matrix F our data on return and estimate V̂ (r) =
F 0 F/n − F 0 1n 10n F/n2 where 1n is a column vector of n ones. Then we could proceed
by extracting the principal components from V̂ (r).
It could be a puzzle for the reader the fact that, in order to estimate the factor
model, whose purpose is to make it possible a sensible estimate of the covariance matrix,
we need some a priori estimate of the same matrix. A complete answer to this question
is outside the scope of these notes (this sentence appears an annoying number of times.
Doesn’t it?), however, the intuition underlying a possible explanation is connected with
the fact that, in principle, the principal components could be computed without an
explicit a priori estimate of V (r). Given a sample of n observations on rt that is R,
all that is needed is, for instance in the case of the first component, to find the vector
x1 of weights such that the numerical variance of f1 = Rx1 is maximum (with the
usual constraint x01 x1 = 1. This can be done iteratively for all components. The idea
is that, even if the full V (r) is difficult to estimate, it may be possible to estimate the
highest variance components while the estimation problems are concentrated on the
lowest variance components.
More formally: we estimate V (r) with some V̂ (r) = V (r) +  were  is a positive
definite error matrix. Write the spectral decomposition for both matrices as : V (r) =

148
xj x0j λj and  = j ej e0j ηj . Our hope is that the highest of the error eigenvalues ηj is
P P
j
smaller of at least some of V (r) eigenvalues. In this case the estimate error shall affect
the overall quality of the estimate V̂ (r) but only with respect to the lowest eigenvalue
components.
In summary. The principal components are defined as ft = rt X where X are the
eigenvectors of the return covariance matrix. The principal components are uncorre-
lated return portfolios (recall that a constant coefficients linear combination of returns
is the return of a constant relative proportion strategy, moreover recall that the sum of
weights in the principal component portfolios is not one). The variances of the princi-
pal component are the eigenvalues corresponding to the eigenvectors which constitute
the portfolio weights. We can derive a solution to the problem rt = ft B by simply
setting B = X 0 . The percentage of variance of the j−th return due to each principal
component can be computed by taking the square of the j−th column of X 0 and di-
viding each element of the resulting vector for the total sum of squares of the vector
itself.
A simple PC analysis on a set of 6 stock return series can be found in the file
“principal components.xls”. A more interesting dataset containing total return indexes
for 49 of the 50 components in the Eurostoxx50 index (weekly date) can be found
in the file “eurostoxx50.xls”. Principal components were computed using the add-in
MATRIX.

Examples
Exercise 11 - Principal Components.xls Exercise 11b - PC, Eurostoxx50.xls

149
12 Appendix: Some matrix algebra
12.1 Definition of matrix
A matrix A is an n−rows m−columns array of elements the elements are indicated by
ai,j where the first index stands for row and the second for column. n and m are called
the row and column dimensions (sometimes shortened in “the dimensions”) or sizes of
the matrix A. Sometimes we write: A is a nxm matrix.
Sometimes a matrix is indicated as A ≡ {aij }.
When n = m we say the matrix is square.
When the matrix is square and aij = aji we say the matrix is symmetric.
When a matrix is made of just one row or one column it is called a row (column)
vector.

12.2 Matrix operations


1. Transpose: A0 = {aji }. A00 = A. If A is symmetric then A0 = A.

2. Matrix sum. The sum of two matrices C = A + B is defined if and only if


the dimensions of the two matrices are identical. In this case C has the same
dimensions as A and B and cij = aij + bij . Clearly A + B = B + A and (A + B)0 =
A0 + B 0

3. Matrix product. The product C = AB of two matrices nxm and qxk Pis defined
if and only if m = q. If this is the case C is a nxk matrix and cij = l ail blj . In
the matrix case it may well be that AB is defined but BA not. An important
property is C 0 = B 0 A0 or, that is the same, (AB)0 = B 0 A0 . Provided the products
and sums involved in what follows are defined we have (A + B)C = AC + BC.

12.3 Rank of a matrix


A row vector x is said to be linearly dependent from the row vectors of a matrix A if
it is possible to find a row vector z such that x = zA. The same for a column vector.
r(A) (or rank(A)): the rank or a matrix A, is defined as the number of linearly
independent rows or (the number is the same) the number of linearly independent
columns of A.
A square matrix of size n is called non singular if r(A) = n.
If B is any n × k matrix, then r(AB) ≤ min(r(A), r(B)).
If B is an n × k matrix of rank n, then r(AB) = r(A).
If C is an l × n matrix of rank n, then r(CA) = r(A).

150
12.4 Some special matrix
1. A square matrix A with elements aij = 0, i 6= j is called a diagonal matrix.

2. A diagonal matrix with the diagonal of ones is called identity and indicated with
I. IA = A and AI = A (if the product is defined).

3. A matrix which solves the equation AA = A is called idempotent.

12.5 Determinants and Inverse


There are several alternative definitions for the determinant of a square matrix.
P The Leibniz Qformula for the determinant of an n × n matrix A is det(A) = |A| =
i=1 Ai,σi .
n
σ∈Sn sgn(σ)
Here the sum is computed over all permutations σ of the set 1, 2, ..., n. sgn(σ)
denotes the signature of σ; it is +1 for even σ and −1 for odd σ. Evenness or oddness
can be defined as follows: the permutation is even (odd) if the new sequence can be
obtained by an even number (odd, respectively) of switches of elements in the set.
The inverse of a square matrix A is the solution A−1 (or inv(A)) to the equations
A−1 A = I = AA−1 .
If A is invertible then (A0 )−1 = (A−1 )0
The inverse of a square matrix A exists if and only if the matrix is non singular
that is if the size and the rank of A are the same.
A square matrix is non singular if and only if it has non null determinant.
det(A−1 ) = 1/ det(A)
If the products and inversions in the following formula are well defined (that is
dimensions agree and the inverse exists), then (AB)−1 = B −1 A−1 .
Inversion has to do with the solution of linear non homogeneous systems.
Problem: find a column vector x such that Ax = b with A and b given.
If A is square and invertible then the unique solution is x = A−1 b.
If A is n × k with n > k but r(A) = k then the system Ax = b has in general no
exact solution, however the system A0 Ax = A0 b has the solution x = (A0 A)−1 A0 b.

12.6 Quadratic forms


A quadratic form with coefficient matrix given by the symmetric matrix A and variables
vector given by the column vector x (with size of A equal to the number of rows of x)
is the scalarPgiven
P by:
x Ax = i j aij xi xj .
0

A symmetric matrix A is called semi positive definite if and only if


x0 Ax ≥ 0 for all x
It is called positive definite if and only if

151
x0 Ax > 0 for all non null x
If a matrix A can be written as A = C 0 C for any matrix C then A is surely at least
psd. In fact x0 Ax = x0 C 0 Cx but this is the product of the row vector x0 C 0 times itself,
hence a sum of squares and this cannot be negative. It is also possible to show that
any psd matrix can be written as C 0 C for some C.

12.7 Random Vectors and Matrices (see the following appendix


for more details)
A random vector, resp matrix, is simply a vector (matrix) whose elements are random
variables.

12.8 Functions of Random Vectors (or Matrices)


• A function of a random vector (matrix) is simply a vector (or scalar) function of
the components of the random vector (matrix).
• Simple examples are: the sum of the elements of the vector, the determinant of
a random matrix, sums or products of matrices and vectors and so on.
• We shall be interested in functions of the vector (matrix) X of the kind: Y =
A + BXC where A, B and C are non stochastic matrices of dimensions such that
the sum and the products in the formula are well defined.
• A quadratic form x0 Ax with a non stochastic coefficient matrix A and stochastic
vector x is and example of non linear, scalar function of a random vector.

12.9 Expected Values of Random Vectors


• These are simply the vectors (matrices) containing the expected values of each
element in the random vector (matrix).
• E(X 0 ) = E(X)0
• An important result which generalizes the linear property of the scalar version
of the operator E(.) for the general linear function defined above, is this E(A +
BXC) = A + BE(X)C.

12.10 Variance Covariance Matrix


• For random column vectors, and here we mean vectors only, we define the variance
covariance matrix of a column vector X as:
V (X) = V (X 0 ) = E(XX 0 ) − E(X)E(X 0 ) = E((X − E(X))(X − E(X))0 )

152
• The Varcov matrix is symmetric, on the diagonal we have the variances (V (Xi ) =
2
σX i
) of each element of the vector while in the upper and lower triangles we have
the covariances (Cov(Xi ; Xj )).

• The most relevant property of this operator is:

V (A + BX) = BV (X)B 0

• From this property we deduce that varcov matrices are always (semi) positive
definite. In fact if A = V (z) and x is a (non random) column vector of the same
size as z, then V (x0 z) = x0 Ax which cannot be negative for any possible x.

12.11 Correlation Coefficient


• The correlation coefficient between two random variables is defined as:
Cov(Xi ; Xj )
ρXi ;Xj =
σXi σXj

The correlation matrix %(x) of the vector x of random variables is simply the
matrix of correlation coefficients or, that is the same, the Varcov matrix of the
vector of standardized Xi .

• The presence of a zero correlation between two random variables is defined, some-
times, linear independence or orthogonality. The reader should be careful using
these terms as they exist also in the setting of linear algebra but their mean-
ing, even if connected, is slightly different. Stochastic independence implies zero
correlation, the reverse proposition is not true.

12.12 Derivatives of linear functions and quadratic forms


Often we must compute derivatives of functions of the kind x0 Ax (a quadratic form)
or x0 q (a linear combination of elements in the vector q with weights x) with respect
to the vector x.
In both cases we are considering a (column) vector of derivatives of a scalar function
w.r.t. a (column) vector of variables (commonly called a “gradient”). There is a useful
matrix notation for such derivatives which, in these two cases, is simply given by:

∂x0 Ax
= 2Ax
∂x
and
∂x0 q
=q
∂x

153
The proof of these two formulas is quite simple. In both cases we give a proof for
a generic element k of the derivative column vector.
For the linear combination we have
X
x0 q = xj q j
j

∂x0 q
= qk
∂xk
For the quadratic form

∂x0 Ax
= 2x0 A
∂x0
XX
x0 Ax = xi xj ai,j
i j
P P
∂ i j xi xj ai,j X X X X
= xj ak,j + xi ai,k +2xk ak,k = xj ak,j + xj ak,j +2xk ak,k = 2Ak,. x
∂xk j6=k i6=k j6=k j6=k

Where Ak,. means the k − th row of A and we used the fact that A is a symmetric
matrix.
An important point to stress is that the derivative of a function with respect to a
vector always has the same dimension as the vector w.r.t. the derivative is taken, in
this case x, so, for instance
∂x0 Ax
= 2Ax
∂x
and not

∂x0 Ax
= 2x0 A
∂x
(remember that A is symmetric).

12.13 Minimization of a PD quadratic form, approximate so-


lution of over determined linear systems
Now Let us go back to the linear system Ax = b with A an n × k matrix of rank k.
If n > k this system has, in general, no solution. However, let’s try to solve a similar
problem. By solving a system we wish for Ax − b = 0 in our case this is not possible
so let us try and change the problem to this minx (Ax − b)0 (Ax − b). In words try to
minimize the sum of squared differences between Ax and b if you cannot make it equal
to 0.

154
We have
(Ax − b)0 (Ax − b) = x0 A0 Ax + b0 b − 2b0 Ax
Now let us take the derivative of this w.r.t. x
∂ 0
x AA0 x + b0 b − 2b0 Ax = 2A0 Ax − 2A0 b
∂x
(remember the rule about the size of a derivatives vector). We now create a new
linear system equating these derivatives to 0.

A0 Ax = A0 b

And the solution is


x = (A0 A)−1 A0 b
This is the “least squares” approximate solution of a (over determined) linear system.
(see the Appendix on least squares and Gauss Markov model).

12.14 Minimization of a PD quadratic form under constraints.


Simple applications to Finance
Suppose we are given a column vector r where rj is the random (linear) return for the
stock j.
Suppose we are holding these returns in a portfolio for one time period and that the
(known) relative amount of each stock in our portfolio is given by the column vector
w such that 10 w = 1 where 1 indicates a column vector of ones of the same size as w.
Then the random linear return of the portfolio over the same time period is given
by rπ = w0 r.
Since w is known we have E(w0 r) = w0 E(r) and V (w0 r) = w0 V (r)w.
The fact that, over one period of time, the expected linear return and the variance
of the linear return of a portfolio only depend on the expected values and the covariance
matrix of the single returns and the weight vector is what allows us to implement a
simple optimization method. For the moment let us suppose that the problem is

min w0 V (r)w
w:10 w=1

In this problem we want to minimize a quadratic form under a linear constraint.


It is to be noticed that, without the constraint, the problem would be solved by
w = 0 (no investment). The constraint does not allow for this.
Such problems can be solved with the Lagrange multiplier method.
The idea is to artificially express, in a single function, both the need of minimizing
the original function and the need to do this with respect to the constraint 10 w = 1.

155
In order to do this we define the Lagrangian of the problem given by

L(w, λ) = w0 V (r)w − 2λ(10 w − 1)

In this function the value of the unconstrained objective function is summed with the
value of the constraint multiplied by a dummy parameter 2λ.
We now take the derivatives of the Lagrangian w.r.t. w and λ.

∂w
(w0 V (r)w − 2λ(10 w − 1)) = 2V (r)w − 2λ1

∂λ
(w0 V (r)w − 2λ(10 w − 1)) = 2(10 w − 1)

If we set both these to zero we get, supposing V (r) invertible

V (r)w = λ1
10 w = 1
Notice the difference between the 1-s. In the first equation 1 is a column vector which
is required because we cannot equate a vector to a scalar. The same for 1’ in the second
equation which while the r.h.s. is a scalar one (for dimension compatibility with the
l.h.s.). We do not stress this using, e.g., boldface for the vector 1 because the meaning
follows unambiguously from the context.
It is clear that the second equation is satisfied if and only if w satisfies the constraint.
What is the meaning of the first equation (or, better, set of equations)? The
unconstrained equation would have been

V (r)w = 0

whose only solution (due to the fact that V (r) is invertible) would be w = 0. But this
solution does not satisfy the constraint. What we shall be able to get is V (r)w = λ1.
For some λ chosen in such a way that the constraint is satisfied.
To find this λ, simply put together the result of the first set of equations: w =
λV (r)−1 1 and the equation expressing the constraint: 10 w = 1. Both equations are
satisfied if and only if
λ = 1/10 V (r)−1 1
We now know λ, that is we know of exactly how much we must violate the unconstrained
optimization condition (first set of equations) in order to satisfy the constraint (second
equation).
In the end, putting this value of λin the solution for the first set of equations, we
get
V (r)−1 1
w= 0
1 V (r)−1 1
It is to be noticed that these are only necessary conditions but, for our purposes,
this is enough.

156
What we got is the one period ”minimum variance portfolio” made of securities
whose returns covariance is V (r).
What is the variance of this portfolio?
10 V (r)−1 V (r)−1 1 10 V (r)−1 1 1
V (w0 r) = w0 V (r)w = V (r) = =
10 V (r)−1 1 10 V (r)−1 1 (10 V (r)−1 1)2 10 V (r)−1 1
The expected value shall be
1V (r)−1 E(r)
E(w0 r) = w0 E(r) =
10 V (r)−1 1
If V (r) is only spd, then it shall not be invertible, so that the system V (r)w = λ1
cannot be solved by simple inversion. In this case, however, there shall exist nonnull
vectors w∗ such that w∗0 V (r)w∗ = 0 and, using such w∗ it shall be possible to build
portfolios of the securities with (linear) return vector r, and maybe the risk free, such
that the return of such portfolios is risk free (zero variance) even if its components are
risky. Such riskless retunr must be equal to the risk free rate for no arbitrage to hold.

12.15 The linear model in matrix notation


Suppose you have a matrix X of dimensions n × k containing n observations on each
of k variables. You also have a n × 1 vector y containing n observations on another
variable.
You would like to approximate y with a linear function of X that is: Xb for some
k×1 vector b.
In general, if n > k it shall not be possible to exactly fit Xb to y so that the
approximation shall imply a vector of errors  = y − Xb.
You would like to minimize  but this is a vector, we must define some scalar
function of it we wish to minimize.
A possible solution is 0  that is: the sum of squares of the errors.
We then wish to minimize

0  = (y − Xb)0 (y − Xb) = y 0 y + b0 X 0 Xb − 2y 0 Xb

If we take the derivative of this w.r.t b we get


∂ 0
(y y + b0 X 0 Xb − 2y 0 Xb) = 2X 0 Xb − 2X 0 y
∂b
(again remember the size rule and remember that y 0 Xb = b0 X 0 y each is the transpose
of the other but both are scalars).
The solution of this is
b = (X 0 X)−1 X 0 y

157
This simple application of the rule for the approximate solution of an over determined
system yields the most famous formula in applied (multivariate) Statistics. When this
problem, for the moment just a best fit problem, shall be immersed in the appropriate
statistical setting, our b shall become the Ordinary Least Squares parameter vector
and shall be of paramount relevance in a wide range of applications to Economics and
Finance.

13 Appendix: What you cannot ignore about Proba-


bility and Statistics

A quick check
The following simple example deal with the relations and differences between proba-
bility concepts and statistical concepts.
Let us start from two simple concepts: The mean and the expected value.
You know that an expected value has to do with a probability model: you cannot
compute it if you do not know the possible values of a random variable and their
probabilities.
On the other hand an average or mean is a simpler concept involving just a set of
numerical values: you take the sum of the values and divide by their number.
Sometimes, if certain assumptions hold (e.g. iid data), an expected value can be
estimated using a mean computed over a given dataset.
Moreover when a mean is seen not as an actual number, involving the sum of
actually observed data divided by the number of summands, but as a sum of still
unobserved, hence random, data, divided by their number, a mean becomes a random
quantity, being a function of random variables, hence entering the field of Probability
and need for its description a probability model. As such, it is reasonable to ask for
its probability distribution, expected value and a variance. In fact this is the study
of “sampling variability” for an estimate. At the opposite, probability distribution,
expected value and variance have no interesting meaning for a mean of a given set of
numbers which has one and only one possible value.
This dualism between quantities computed on numbers and functions of random
variables is true for all other statistical quantities.
In the case of the mean/average vs expected value, we use (but not always) different
names to stress the different role of the objects we speak of. The same is done (usually)
when we distinguish “frequency” and “probability”. To apply this to each statistical
concept would be a little cumbersome and, in fact, is not done in most cases. A variance
is called a variance both when used in the probability setting and as a computation on

158
number, the same for moments, covariances etc. Even the word “mean” is often used
to indicate both expected values and averages. This is a useful shortcut but should
not trick us in believing that the use of the same name implies identity of properties.
Care must be used.
In the experience of any teacher of Statistics the potential misunderstandings which
can derive from an incomplete understanding of this basic point are at the origin of
most of the problems students incur in when confronted with statistical concepts.
Consider the following example and, even if you judge it trivial, dedicate some time
to really repeat and understand all its steps.
Suppose you observe the numbers 1,0,1. The mean of these is, obviously, 2/3. Is it
meaningful to ask questions about the expected value of each of these three numbers
or of the mean? Not at all, except in the very trivial case where the answer to this
question coincides with the actual observed numbers.
However, in most relevant cases the numbers we may observe are not predetermined.
They are obviously known after we observe data, but it is usually the case that we also
want to say something about their values in future possible observations (e.g. we must
decide about taking some line of action whose actual result depends on the future values
of observable. This is the basic setting in financial investments). We cannot do this
without the proper language, we need a model, written in the language of Probability,
able to describe the “future possible observations”.
For instance, we could think sensible to assume that each single number I observe
can only be either 0 or 1, that each possible observation has the same probability
distribution for the possible results: P for 1 and 1 − P for 0 and that observations are
independent that is: the probability of each possible string of results is nothing but
the product of the probability of each result in the string.
Since we only know that P is a number between 0 and 1 the mean computed above
using data from the phenomenon such modeled (in this case equivalent to the “relative
frequency” of 1), has a new role: it could be useful as an “estimate” of P .
Under our hypotheses, however, it is clear that the value 2/3 is only the value of
our mean for the observed data, it is NOT the value of P which is still an unknown
constant. We need something connecting the two.
The first step is to consider the possible values that the mean could have had on
other possible “samples” of three observations.
By enumeration these are 0, 1/3, 2/3, 1.
We can also compute, under our hypotheses the probabilities of these values. Since
a mean of 1 can happen only when we observe three ones, and since the three results
are independent and with the same probability P , we have that the probability of
observing a mean of 1 is P P P = P 3 . On the other hand a mean of 0 can only be
observed when we only observe zeroes, that is with probability (1 − P )3 . A mean of
1/3 can be obtained if we observe a 1 and two 0s. There are three possibilities for
this: 1,0,0; 0,1,0 and 0,0,1. the respective probabilities (under our hypotheses, are)

159
P (1 − P )(1 − P ); (1 − P )P (1 − P ) and (1 − P )(1 − P )P. The three possibilities exclude
each other so we can sum up the probabilities. In the end we have 3P (1 − P )2 . The
same reasoning gives us the probability of observing a mean of 2/3 that is: 3P 2 (1 − P ).
What we just did is to compute the “sampling distribution” of the mean seen as an
“operator” which we can apply to any possible sample.
This sampling distribution of the mean gives us all its (four) possible values (on
n = 3 samples) and their probabilities as functions of P .
Since we now have both the possible values and their probabilities we can compute
the expected value and the variance of this mean. This is the second step to take in
order to connect the estimate to the “parameter” P .
These computations shall give us information about how good an “estimate” the
mean can be of the unknown P . We would like the mean to have expected value P
(unbiasedness) and as small a variance as possible, so to be “with high probability”
“near” to the true but unknown value of P . what this last sentence means is, simply,
that the probability of observing samples where the mean has a value near P should
be big.
Formally an expected value is very similar to a mean with the difference that each
value of the (now) random variable is multiplied by its probability and not its frequency.
The expected value shall be:

0 ∗ (1 − P )3 + 1/3 ∗ 3 ∗ P (1 − P )2 + 2/3 ∗ 3 ∗ P 2 (1 − P ) + 1 ∗ P 3 = P

Notice the difference with respect to a mean computed on a given sample and be
careful not to mistake the point. The difference is not that the result is not a number
but an unknown “parameter”. It could well be that P is known and, say, equal to 1/3 so
that the result would be a number. The difference is that this quantity, the expected
value of the sample mean, is a probability quantity, has nothing to do with actual
observations and frequencies and has everything to do with potential observations and
probability. In fact, on each given sample we have a given value of the mean so its
expected value has a meaning only because we consider the variability of the values
of this mean on the POSSIBLE samples. However, the result is very useful both if P
is unknown and if it is known. When P is unknown it tells us that the mean of the
observed data shall be unbiased as an estimate for P . When P is known to be, say,
1/3 it shall give P an “empirical connection” to an observable quantity, by assessing
that the expected value of the mean of the observed data shall be 1/3.
The question, however is: OK, this for the expected value of the sample mean. But:
how much is it probable that the actual observed mean be “near” P ?
Well, suppose for instance P = 1/3, we immediately see that the probability of
observing a mean equal to 1/3 is 4/9, quite high, while the probability of observing a
mean between 0 and 2/3 is 1-1/9 very near to 1.
For independent observations this is going to improve if n increases in the sense
that we shall find more and more probability in a given interval around 1/3.

160
However this computation, while straightforward, is quite cumbersome for big sam-
ples and requires some not fully elementary approximations as, in this case, the central
limit theorem.
A simples, if less specific, answer to this question requires the computation of
the (sampling) variance of our mean. By using the definition of variance and the
probabilities already computed we get for the variance of the mean:

P (1 − P )/3

The general case, for a sample of size not 3 but n shall be:

P (1 − P )/n

Clearly the bigger n the smaller the variance.


This, again, for unknown P is an unknown number. However we can say much
about its value. In fact, since P is between 0 and 1, P (1 − P ) has a maximum value
of 1/4 (and this is the exact value when P = 1/2).
How is this connected with the probability of observing a mean “near” to P ?
The answer is given by Tchebicev inequality. This says that for any random variable
X (hence also for the sample mean) we have:
p p
P rob(E(X) − k V (X) < X < E(X) + k V (X)) ≥ 1 − 1/k 2
(For any positive k). This implies, for instance, that if P = 1/3 and n = 3, there
is at least a probability of .75 that the sample mean be observed between the values
1/3-.3142 and 1/3+.3142.
This is already a very useful information, but think what happens when the sample
size is not 3 but, say, 100. In this case the above interval becomes much more narrow:
1/3-.0544 1/3+.0544. Even if P is unknown with n = 100 the interval for at least a
probability of .75 shall never be wider that ±.1 (= ±2 ∗ (1/(4 ∗ 100)).5 = ±2 ∗ .05 )
around the “true” P . As stated above, results the like of central limit theorem allow
us to be even more precise, but this is outside the scope of this exercise.
Beyond the numbers what this boils down to is that, by studying the sample mean
as a random variable, random due to the fact that the sample values are random before
observing them, we are able to connect a parameter in the model: P to an observable
quantity, the sample mean (relative frequency) by the fact that, the bigger n, the bigger
is the probability that the observed value of the mean be in a given interval around P .
By converse, we also understand the empirical role of P in determining the proba-
bility of different possible observations of the sample mean: the value of P determines
the interval where the probability of observid the sample mead is higher.
These are simple instances of two basic points in Statistics as applied to any sci-
ence: we call the first “(statistical) inference” (transforming information on observed

161
frequencies into information on probabilities) and the second “forecasting” (assessing
the probabilities of observations still to be made).
This longish example is not intended to teach you any new concept: with the
possible exception of Tchebicev inequality all this should be already known after a BA
in Economics.
You can take it as follows: if you see all the concepts and steps in the example as
clear, even trivial, fine! All what follows in this course shall be quite easy.
On the other hand, if any step seems fuzzy or inconsequential, dedicate some more
time to a quick rehearsal of what you already did during the BA concerning Probability
and Statistics.
And for any problem ask your teachers.

How should you use what follows?


In the following paragraphs you shall find a quick summary of basic Probability and
Statistics concepts.
A good understanding of basic concepts in modern Finance and Economics as ap-
plied to the fields of asset pricing, asset management, risk management and Corporate
Finance (that is: what you do in the two years master), would require a full knowledge
of what follows. As far as this course is concerned a good understanding of the strictly
required statistical and Probability concepts (in fact really basics!) can be derived by
simply examining the questions asked in past exams. Moreover, before section 1 and
section 6 of these handouts you can find a short list of concepts that are essential to
the understanding of the first and the second part of these handouts.
In what follows a small number of less essential points, preceded by an asterisk, can
be left out.
The following summary is (obviously) NOT an attempt to write an introductory
text of Probability and Statistics. It should be used as a quick summary check: Browse
thru it, check if most of the concepts are familiar.
In the unlikely case the answer is not (this could be the case for students coming
for different field BAs) you should dedicate some time to upgrade your basic notions
of Probability and Statistics.
For any problem and suggestion ask your teachers.

Probability
13.1 Probability: a Language
• Probability is a language for building decision models.

162
• As all languages, it does not offer or guarantees ready made splendid works of
art (that is: right decisions) but simply a grammar and a syntax whose purpose
is avoiding inconsistencies. We call this grammar and this syntax “Probability
calculus”.

• On the other hand, any language makes it simple to “say” something, difficult to
say something else and there are concepts that cannot be even thought in any
given language. So, no analysis of what we write in a language is independent on
the structure of the language itself, And this is true for Probability too.

• The language is useful to deduce probabilities of certain events when other prob-
abilities are given, but the language itself tells us nothing about how to choose
such probabilities.

13.2 Interpretations of Probability


• A lot of (often quite cheap) philosophy on the empirical meaning of probability
boils down to two very weak suggestions:

• For results of replicable experiments, it may be that probability assessments have


to do with long run (meaning what?) frequency;

• For more general uncertainty situations, probability assessments may have some-
thing to do with prices paid for bets, provided you are not directly involved in
the result of the bet, except with regard to a very small sum of money.

• In simple situations, where some symmetry statement is possible, as in the stan-


dard setting of “games of chance” where probability as a concept was born, the
probability of relevant events can be reduced to some sum of probabilities of
“elementary events” you may accept as “equiprobable”.

13.3 Probability and Randomness


• Probability is, at least in its classical applications, introduced when we wish to
model a collective “random” phenomenon, that is an instance where we agree that
something is happening “under constant conditions” and, this not withstanding,
the result is not fully determined by these conditions and, a priori, unknown to
us.

• Traders are interested in returns from securities, actuaries in mortality rates,


physicists in describing gases or subatomic particles, gamblers in assessing the
outcomes of a given gamble.

163
• At different degrees of confidence, students in these fields would admit that, in
principle, it could be possible to attempt a specific modeling for each instance of
the phenomena they observe but that, in practice, such model would require such
impossible precision in the measurement of initial conditions and parameters to
be useless. Moreover computations for solving such models would be unwieldy
even in simple cases.
• For these reasons students in these fields are satisfied with a theory that avoids a
case by case description, but directly models possible frequency distributions for
collectives of observations and uses the probability language for these models.

13.4 Different Fields: Physics


• Quantum Physics seems the only field where the “in principle” clause is usually
not considered valid.
• In Statistical Physics a similar attitude is held but for a different reason. Statis-
tical Physics describes pressure as the result of “random” hits of gas molecules on
the surface of a container. In doing this they refrain using standard arguments
of mechanics of single particle not because this would be in principle impossible
but because the resulting model would be in practice useless (for instance its
solution would depend on a precise measurement of position and moment of each
gas molecule, something impossible to accomplish in practice).

13.5 Finance
• Finance people would admit that days are by no means the same and that prices
are not due to “luck” but to a very complex interplay of news, opinions, sentiments
etc. However, they admit that to model this with useful precision is impossible
and, at a first level of approximation, days can be seen as similar and that it is
interesting to be able to “forecast” the frequency distribution of returns over a
sizable set of days.
• The attitude is similar to Statistical Physics where, however, hypotheses of ho-
mogeneity of underlying micro behaviours are more easy to sustain. Moreover
while we could model in an exact way few particles we cannot do the same even
with a single human agent.

13.6 Other fields


• Actuaries do not try to forecast with ad hoc models the lifespan of this or that
insured person (while they condition their models to some relevant character-
istic the like of age, sex, smoker-no smoker and so on) they are satisfied in a

164
(conditional) modeling of the distribution of lifespan in a big population and in
matching this with their insured population.

• Gamblers compute probabilities, and sometimes collects frequencies. They would


like to be able to forecast each single result but their problem, when the result
depends on some physical randomizing device (roulette, die, coin, shuffled deck
of cards etc.) is exactly the same as the physicist’s problem at least when the
gamble result depends by the physics of the randomizing device.

• Very different and much more interesting is the case of betting (horse racing,
football matches, political elections etc.). In this case the repeatability of events
under similar conditions cannot be called as a justification in the use of proba-
bilities and this implies a different and interesting interpretation of probability
which is beyond the scope of this summary.

• Weather forecasters, as all sensible forecaster (as opposed to fore tellers) phrase
their statements in terms of probabilities of basic events (sunny day, rain ,thun-
derstorms, floods, snow, etc.). In countries where this is routinely done and
weather forecasts are actually made in terms of probabilities (as in UK and USA
but not frequently in Italy) over time the meaning of , say, “60% probability of
rain” and the usefulness of the concept has come to be understood by the general
public (probability is not and should not be a mathematicians only concept).

• Risk managers in any field (the financial one is a very recent example) aim at
controlling the probability of adverse events

• Any big general store chain must determine the procedures for replenishing the
inventories given a randomly varying demand. This problem is routinely solved
by probability models.

• A similar problem (and similar solutions) is encountered when (random) demand


and (less random) offer of energy must be matched in a given energy grid; channels
must be allocated in a communication network; turnpikes must be opened or
closed to control traffic, etc.

• These are just examples of the applied fields where probability models and Statis-
tics are applied with success to the solution of practical problems of paramount
relevance.

13.7 Wrong Models


• As we already did see, in a sense all probability models are “wrong”. With the
exception (perhaps) of Quantum Mechanics, they do not describe the behaviour

165
of each observable instance of a phenomenon but try, with the use of the non
empirical concept of probability, to directly and at the same time fuzzily describe
aggregate results: collective events.

• For this simple reason they are useful inasmuch the decision payout depends, in
some sense, on collectives of events.

• They are not useful for predicting the result of the next coin toss but they are
useful for describing coin tossING.

13.8 Meaning of Correct


• A good use of probability only guarantees the model to be self consistent. It
cannot guarantee it to be successful

• When the term “correct” is applied to a probability model (would be better to


call it “satisfactory”) what is usually meant is that its probability statement are
well matched by empirical frequencies (the term “well calibrated” is also used in
this sense).

• Sometimes, probability models are used in cases when the relevant event shall
happen only one or few times.

• In this case the model shall be useful more for organizing our decision process than
for describing its outcome. “Correct” in this case means: “a good and consistent
summary of our opinions”.

13.9 Events and Sets


• Probabilities are stated for “events” which are propositions concerning facts whose
value of truth can reasonably be assessed at a given future time. However, for-
mally, probabilities are numbers associated with sets of points.

• Points represent “atomic” verifiable propositions which, at least for the purposes
of the analysis at hand, shall not be derived by simpler verifiable propositions.

• Sets of such points simply represents propositions which are true any time any
one of the (atomic) propositions within each set is true.

• Notice that, while points must always be defined, it may well be the case that we
only deal with sets of these and, while elements of these sets, some or all of these
points is not considered as a set by itself. For instance, in rolling a standard
die we have 6 possible “atomic” results but we could be interested only in the
probability of non atomic events the like of “the result is an even number” or “

166
the result is bigger than 3”. Since probabilities shall be assigned to a chosen class
of sets of points and we shall call these sets “events”, it may well be that these
“events” do not include atomic propositions (which in common language would
graduate to the name “event”).

• Sets of points are indicated by capital letters: A, B, C, .... The “universe” set (rep-
resenting the sure event) is indicated with Ω and the empty set (the impossible
event) with ∅(read: “ou”).

• Finite or enumerable infinite collections of sets are usually indicated with {Ai }ni=1
and with {Ai }∞
i=1 .

• Correct use of basic Probability requires the knowledge of basic set theoretical
operations: A ∩ B Intersection, A ∪ B Union, A \ B Symmetric difference, A
negation and their basic properties. The same is true for finite and enumerable
infinite Unions and intersections: ∪ Ai , ∪ Ai and so on.
i=1...n i=1...∞

13.10 Classes of Events


• Probabilities are assigned to events (sets) in classes of events which are usually
assumed closed with regard to some set operations.

• The basic class is an Algebra, usually indicated with an uppercase calligraphic


letter: A. An algebra is a class of sets which include Ω and is closed to finite
intersection and negation of its elements, that is: if two sets are in the class also
their intersection and negation is in the class. This implies that also the finite
union is in the class and so is the symmetric difference (why?).

• When the class of sets contains more than a finite number of sets, usually also
enumerable infinite unions of sets in the class are required to be sets in the class
itself (and so enumerable intersections, why?). In this case the class is called a
σ-algebra. The name “Event” is from now on used to indicate a set in and algebra
or σ-algebra.

13.11 Probability as a Set Function


• A probability is a set function P defined on the elements of an algebra such that:
P (Ω) = 1, P (A) = 1 − P (A) and for any finitePnumber of disjoint events {Ai }ni=1
(Ai ∩ Aj = Ø ∀i 6= j) we have: P ( ∪ Ai ) = ni=1 P (Ai ) .
i=1...n

• If the probability is defined on a σ-algebra we require the above additivity prop-


erty to be valid also for enumerable unions of disjoint events.

167
13.12 Basic Results
• A basic result, implied in the above axioms, is that for any pair of events we
have: P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

• Another basic result is that if we have a collection of disjoint events: {Ai }ni=1
(Ai ∩ Aj = Ø ∀i 6= j)
Pand another event B such that B = ∪ni=1 (Ai ∩ B) then we
can write: P (B) = i=1 P (B ∩ Ai )
n

13.13 Conditional Probability


• For any pair of events we may define the conditional probability of one to the
other, say: P (A|B) as a solution to the equation P (A|B)P (B) = P (A ∩ B).

• If we require, and we usually do, the conditioning event to have positive probabil-
ity: P (B) 6= 0, this solution is unique and we have: P (A|B) = P (A ∩ B)/P (B).

13.14 Bayes Theorem


Using the definition of conditional probability and the above two results we can prove
Bayes Theorem.
SLet {Ai }i=1 be a partition of Ω in events, that is: {Ai }i=1 (Ai ∩ Aj = Ø ∀i 6= j)and
n n

Ai = Ω, we have:
i=1...n
P (B|Ai )P (Ai )
P (Ai |B) = Pn
i=1 P (B|Ai )P (Ai )

13.15 Stochastic Independence


• We say that two events are “independent in the probability sense”, “stochastically
independent” or, simply, when no misunderstandings are possible, “independent”
if P (A ∩ B) = P (A)P (B).

• If we recall the definition of conditional probability, we see that, in this case,


the conditional probability of each one event to the other is again the “marginal”
probability of the same event.

13.16 Random Variables


• These are functions X(.) from Ω to the real axis R.

• Not all such functions are considered random variables. For X(.) to be a random
variable we require that for any real number t the set Bt given by the points ω

168
in Ω such that X(ω) ≤ t is also an event, that is: an element of the algebra (or
σ-algebra).
• The reason for this requirement (whose technical name is: “measurability”) is that
a basic tool for modeling the probability of values of X is the “probability distri-
bution function” (PDF) (sometimes “cumulative distribution function” CDF) of
X defined for all real numbers t as: FX (t) = P ({ω} : X(ω) ≤ t) = P (Bt ) and,
obviously, in order for this definition to have a meaning, we need all Bt to be
events (that is: a probability P (Bt ) must be assessed for each of them).

13.17 Properties of the PDF


• From its definition we can deduce some noticeable properties of FX
1. it is a non decreasing function;
2. its limit for t going to −∞ is 0 and its limit for t going to +∞ is one;
3. we have: limh↓0 Fx (t + h) = FX (t) but this is in general not true for h ↑ 0 so that
the function may be discontinuous.
• We may have at most a enumerable set of such discontinuities (as they are dis-
continuities of the first kind).
• Each of these discontinuities is to be understood as a probability mass concen-
trated on the value t where the discontinuity appears. Elsewhere F is continuous.

13.18 Density and Probability Function


• In order to specify probability models for random variables, usually, we do not
directly specify F but other functions more easy to manipulate.
• We usually consider as most relevant two cases (while interesting mix of these
may appear):
1. the absolutely continuous case, that is: where F shows no discontinuity and can
be differentiated with the possible exception of a set of isolated points
2. the discrete case where F only increases by jumps.

13.19 Density
In the absolutely continuous case we define the probability density function of X as:
fX (t) = ∂F∂s
X (s)
|s=t where this derivative exists and we complete this function in an
arbitrary´way where it does not. Any choice of completion shall have the property:
t
FX (t) = −∞ fX (s)ds.

169
13.20 Probability Function
In the discrete case we call “support” of X the at most enumerable set of values xi
corresponding to discontinuities of F and we indicate this set with Supp(X)and define
the probability function PX (xi ) = FX (xi ) − lim h↑0 FX (xi + h) for all xi : xi ∈ Supp(X)
with the agreement that such a function is zero on all other real numbers. In simpler
but less precise words P (.) is equal to the “jumps” in F (.) on the points xi where these
jumps happen and zero everywhere else.

13.21 Expected Value


The “expected value” of (in general) a function G(X) is then defined, in the continuous
and discrete case as
ˆ+∞
E(G) = G(s)fX (s)ds
−∞

and X
E(G) = G(xi )PX (xi )
xi ∈Supp(X)

If G is the identity function G(t) = t the expected value of G is simply called the
“expected value”, “mathematical expectation”, “mean”, “average” of X.

13.22 Expected Value


• If G is a non-negative integer power: G(X) = X k , we speak of “the k-th moment
of X and usually indicate this with mk or µk .

• If G(X) is the function I(X ∈ A), for a given set A, which is equal to 1 if X =
x ∈ A and 0 otherwise (the indicator function of A) then E(G(X)) = P (X ∈ A).

• In general, when the probability distribution of Xis NOT degenerate (concen-


trated on a single value x), E(G(X)) 6= G(E(X)). There is a noticeable ex-
ception: if G(X) = aX + b with a andb constants. In this case we have
E(aX + b) = aE(X) + b.

• Sometimes the expected value of X is indicated with µX or simply µ.

13.23 Variance
• The “variance” of G(X) is defined as V (G(X)) = E((G(X) − E(G(X))2 ) =
E(G(X)2 ) − E(G(X))2 .

• A noticeable property of the variance is that such that V (aG(X)+b) = a2 V (G(X)).

170
• The square root of the variance is called “standard deviation”. For these two
quantities the symbols σ 2 and σ are often used (with or without the underscored
name of the variable).

13.24 Tchebicev Inequality


• A fundamental inequality which connects probabilities with means and variances
is the so called “Tchebicev inequality”:
1
P (|X − E(X)| < λσ) ≥ 1 −
λ2

• As an example: ifλis set to 2 the inequality gives a probability of at least 75%


for X to be between its expected value + and - 2 times its standard deviation.

• Since the inequality is strict, that is: it is possible to find a distribution for
which the inequality becomes an equality, this implies that, for instance, 99%
probability could require a ± “10 σ” interval.

• For comparison, 99% of the probability of a Gaussian distribution is contained


in the interval µ ± 2.576σ.

• These simple points have a great relevance when tail probabilities are computed
in risk management applications.

• In popular literature about extreme risks, and also in some applied work it is
common to ask for a “six sigma” interval. For such an interval the Tchebicev
bound is 97.(2)%

13.25 *Vysochanskij–Petunin Inequality


Tchebicev inequality can be refined by the Vysochanskij–Petunin inequality which,
with the added hypothesis that the distribution be unimodal, states that, for any
λ > √23 = 1.632993
4
P (|X − µ| < λσ) ≥ 1 − 2

more than halving the probability outside the given interval given by Tchebicev: the
75% for λ = 2 becomes now 1 − 19 that is 88.(9)%.
Obviously, this gain in precision is huge only if λ is not to big. The fabled “six
sigma” interval according to this inequality contains at least 98.76% probability just
about 1.5% more than Tchebicev.

171
13.26 *Gauss Inequality
This result is an extension of a result by Gauss who stated that if m is the mode (mind
not the expected value: in this is the V-P extension) of a unimodal random variable
then 
1 − 2 2 if λ ≥ √2

3λ 3
P (| X − m |< λτ ) ≥
 √λ
3
if 0 ≤ λ ≤ √23 .
Where τ 2 = E(X − m)2 .

13.27 *Cantelli One Sided Inequality


A less well known but useful inequality is the Cantelli ore one sided Tchebicev inequal-
ity, which, phrased in a way useful for left tail sensitive risk managers, becomes:
λ2
P (X − µ ≥ λσ) ≥
1 + λ2
and for λ = −2 this means that at least 45 of the probability (80%) is above the µ − 2σ
lower boundary.
For “minus six sigma” this becomes 97.29(9729)%.

13.28 Quantiles
• The “α-quantile” of X is defined as the value qα such that, the following conditions
are simultaneously valid:
P r[X < qα ]≤α
P r[X≤qα ]≥α
• Notice that in the case of a random variable with continuousFX (.) this equation
could be written as qα ≡ inf (t : FX (t) = α) and in the case of a continuous
strictly increasing FX (.) this becomes qα ≡ t : FX (t) = α
• For a non continuous FX (.) in case αis NOT one of the values taken by FX (.) the
above definition corresponds to a value of x ofX with FX (x) > α.
• Due to applications in the definition of VaR it is more proper to use as quantile,
in this case, a qα defined as the maximum value x of X with positive probability
and with FX (x) ≤ α.
• The formal definition of this is rather cumbersome
qα ≡ max {x : [P r[X ≤ x] > P r[X < x]] ∩ [P r[X ≤ x] ≤ α]}

• Which reads “qα is the greatest value of x such that it has a positive probability
and such that FX (x) ≤ α

172
13.29 Median
• If α = 0.5 we call the corresponding quantile the “median” of X and use for it,
usually, the symbol Md .

• It may be interesting to notice that, if G is continuous and increasing, we have


Md (G(X)) = G(Md (X)).

13.30 Subsection
• A mode in a discrete probability distribution (or frequency distribution) is any
value of x ∈ Supp(X) where the probability (frequency) has a local maximum.

• “The mode”, usually, is the global maximum.

• In the case of densities, the same definition is applied in terms of density instead
of probability (frequency).

13.31 Univariate Distributions Models


• Models for univariate distributions come in two kinds: non parametric and para-
metric.

• A parametric model is a family of functions indexed by a finite set of parame-


ters (real numbers) and such that for any value of the parameters in a predefined
parameter space the functions are probability densities (continuous case) or prob-
ability functions (discrete case).

• A non parametric model is a model where The family of distributions cannot be


indexed by a finite set of real numbers.

• It should be noticed that, in many applications, we are not interested in a full


model of the distribution but in modeling only an aspect of it as, for instance,
the expected value, the variance, some quantile and so on.

13.32 Some Univariate Discrete Distributions


• Bernoulli: P (x) = θ, x = 1; P (x) = 1 − θ, x = 0; 0 ≤ θ ≤ 1. You should
notice the convention: the function is explicitly defined only on the support of
the random variable. For the Bernoulli we have: E(X) = θ, V (X) = θ(1 − θ).

• Binomial: P (x) = nx θx (1 − θ)n−x , x = 0, 1, 2, ..., n; 0 ≤ θ ≤ 1. We have:E(X) =




nθ; V (X) = nθ(1 − θ).

173
• Poisson:P (x) = λx e−λ /x!, x = 0, 1, 2, ..., ∞; 0 ≤ λ. We have:E(X) = λ; V (X) =
λ.

• Geometric P (x) = (1 − θ)x−1 θ, x = 1, 2, ..., ∞; 0 ≤ θ ≤ 1. We have E(X) = 1θ ;


V (X) = 1−θ
θ2

13.33 Some Univariate Continuous Distributions


Negative exponential: f (x) = θe−θx , x > 0, θ > 0. We have: E(X) = 1/θ; V (X) =
1/θ2 . (Here you should notice that, as it is often the case for distributions with con-
strained support, the variance and the expected value are functionally related).

13.34 Some Univariate Continuous Distributions


1 2
Gaussian: f (x) = √2πσ 1
2
e− 2σ2 (x−µ) , x ∈ R, µ ∈ R, σ 2 > 0. We have E(X) = µ,V (X) =
σ 2 . A very important property of this random variable is that, if a and b are constants,
then Y = aX + b is a Gaussian if X is a Gaussian.
By the above recalled rules on the E and V operators we have also that E(Y ) =
aµ+b; V (Y ) = a2 σ 2 . In particular, the transform Z = X−µ σ
is distributed as a “standard”
(expected value 0, variance 1) Gaussian.

13.35 Some Univariate Continuous Distributions


The distribution function of a standard Gaussian random variable is usually indicated
with Φ, so Φ(x) is the probability of observing values of the random variable X which
are smaller then or equal to the number x, in short: Φ(x) = P (X ≤ x). With
z1−α = Φ−1 (1 − α) we indicate the inverse function of Φ that is: the value of the
standard Gaussian which leaves on its left a given amount of probability. Obviously
Φ(Φ−1 (1 − α)) = 1 − α.

13.36 Random Vector


• A random vector X of size n is a n- dimensional vector function from Ω toRn ,
that is: a function which assigns to each ω ∈ Ω a vector of n real numbers.

• The name “random vector” is better than the name “vector of random variables” in
that, while each element of a random vector is, in fact, a random variable, a simple
vector of random variables could fail to be a random vector if the arguments ωi
of the different random variables are not constrained to always coincide.

• (If you understand this apparently useless subtlety you are well on your road to
understanding random vectors, random sequences and stochastic processes).

174
13.37 Distribution Function for a Random Vector
• Notions of measurability analogous to the one dimensional case are required to
random vectors but we do not mention these here.

• Just as in the case of random variable, we can define probability distribution


functions for random vectors as FX (t1 , t2 , ..., tn ) = P ({ω} : X1 (ω) ≤ t1 ,X2 (ω) ≤
t2 , ..., Xn (ω) ≤ tn ) where the commas in this formulas can be read as logical “and”
and, please, notice again that the ω for each element of the vector is always the
same.

13.38 Density and Probability Function


As well as in the one dimensional case, we usually do not model a random vector by
specifying its probability distribution function but its probability function: P (x1 , ..., xn )
or its density: f (x1 , ..., xn ), depending on the case.

13.39 Marginal Distributions


• In the case of random vectors we may be interested in “marginal” distributions,
that is: probability or density functions of a subset of the original elements in
the vector.

• If we wish to find the distribution of all the elements of the vector minus, say,
the i-th element we simply work like this:

• in the discrete case


X
P (x1 , ..., xi−1 , xi+1 , ...xn ) = P (x1 , ..., xi−1 , xi , xi+1 ...xn )
xi ∈Supp(Xi )

• and in the continuous case:


ˆ
f (x1 , ..., xi−1 , xi+1 , ...xn ) = f (x1 , ..., xi−1 , xi , xi+1 ...xn )dxi
xi ∈Supp(Xi )

• We iterate the same procedures for finding other marginal distributions.

13.40 Conditioning
• Conditional probability functions and conditional densities are defined just like
conditional probabilities for events.

175
• Obviously, the definition should be justified in a rigorous way but this is not
necessary, for now!

• The conditional probability function of, say, the first i elements in a random
vector given, say, the other n − i elements shall be defined as:

P (x1 , ..., xn )
P (x1 , ..., xi |xi+1 , ...xn ) =
P (xi+1 , ...xn )

• For the conditional density we have:

f (x1 , ..., xn )
f (x1 , ..., xi |xi+1 , ...xn ) =
f (xi+1 , ...xn )

• In both formulas we suppose denominators to be non zero.

13.41 Stochastic Independence


• Two sub vectors of a random vector, say: the first i and the other n − i random
variables, are said to be stochastically independent if the joint distribution is the
same as the product of the marginals or, that is the same under our definition, if
the conditional and marginal distribution coincide.

• We write this for the density case, for the probability function is the same:

f (x1 , ..., xn ) = f (x1 , ..., xi )f (xi+1 , ..., xn )

f (x1 , ..., xi |xi+1 , ...xn ) = f (x1 , ..., xi )

• This must be true for all the possible values of the n elements of the vector.

13.42 Mutual Independence


• A relevant particular case is that of a vector of mutually independent (or simply
independent) random variables. In this case:
Y
f (x1 , ..., xn ) = fXi (xi )
i=1,...,n

• Again, this must be true for all possible (x1 , ..., xn ). (Notice: the added big
subscript to the uni dimensional density to distinguish among the variables and
the small cap xi which are possible values of the variables).

176
13.43 Conditional Expectation
• Given a conditional probability function P (x1 , ..., xi |xi+1 , ...xn ) or a conditional
density f (x1 , ..., xi |xi+1 , ...xn ) we can define conditional expected values of, in
general, vector valued functions of the conditioned random variables.

• Something the like of E(g(x1 , ..., xi )|xi+1 , ...xn )) (the expected value is defined
exactly as in the uni dimensional case by a proper sum/series or integral opera-
tor).

13.44 Conditional Expectation


• It is to be understood that such expected value is a function of the conditioning
variables. If we understand this it should be not a surprise that we can take
the expected value of a conditional expected value. In this case the following
property is of paramount relevance:

E(E(g(x1 , ..., xi )|xi+1 , ...xn )) = E(g(x1 , ..., xi ))

• Where, in order to understand the formula, we must remember that the outer
expected value in the left hand side of the identity is with respect to (wrt) the
marginal distribution of the conditioning variables vector: (xi+1 , ...xn ), while the
inner expected value of the same side of the identity is wrt the conditional distri-
bution. Notice that, in general, this inner expected value: E(g(x1 , ..., xi )|xi+1 , ...xn )
is a function of the conditioning variables (the conditioned variables are “inte-
grated out” in the operation of taking the conditional expectation) so that it is
meaningful to take its expected value with respect to the conditioning variables.

• The expected value on the right hand side, however, is with respect to the
marginal distribution of the conditioned variables (x1 , ..., xi ).

13.45 Conditional Expectation


• To be really precise we must say that the notation we use (small printed letters
for both the values and the names of the random variables) is approximate: we
should use capital letters for variables and small letters for values. However we
follow the practice that usually leaves the distinction to the discerning reader.

13.46 Law of Iterated Expectations


• The above property is called “law of iterated expectations” and can be written in
much more general ways.

177
• In the simplest case of two vectors we have: EY (EX|Y (X|Y)) = EX (X). For the
conditional expectation value, wrt the conditioned vector, all the properties of
the marginal expectation hold.

13.47 Regressive Dependence


• Regression function and regressive dependence.

• Being a function of Y, the conditional expectation EX|Y (X|Y) is also called


“regression function” of X on Y. Analogously, EY|X (Y|X) is the regression func-
tion of Y on X. If EX|Y (X|Y) is constant wrt Y we say that X is regressively
independent on Y.

• If EY|X (Y|X) is independent of X we say that Y is regressively independent on


X.

• Regressive dependence/independence is not a symmetric concept: it can hold on


a side only.

• Moreover, stochastic independence implies two sided regressive independence,


again, the converse is not true.

• A tricky topic: conditional expectation is, in general, a “static” concept. For any
GIVEN set of values of, say, Y you compute EX|Y (X|Y). However, implicitly,
the term “regression function” implies the possibility of varying the values of the
conditioning vector (or variable). This must be taken with the utmost care as it
is at the origin of many misunderstandings, in particular with regard to “causal
interpretations” of conditional expectations. The best, if approximate, idea to
start with is that EX|Y (X|Y) gives us a “catalog” of expected values each valid
under given “conditions” Y, be it or not be it possible or meaningful to “pass”
from one set of values of Y to another set.

13.48 Covariance and Correlation


• The covariance between two random variablesX and Y is defined as: Cov(X, Y) =
E(XY) − E(X)E(Y).

• From the above definition we get that, for any set of constants a, b, c, d Cov(a + bX, c + dY) =
bdCov(X, Y).

• An
p important result (Cauchy inequality) allows us to show that |Cov(X, Y)| ≤
V (X)V (Y). From this we derive a “standardized covariance” called “correlation
coefficient”: Cor(X, Y) = Cov(X, Y)/ V (X)V (Y).
p

178
• We have Cor(a + bX, c + dY) = Sign(bd)Cor(X, Y).

• The square of the correlation coefficient is usually called R square or rho square.

• Notice that, regressive independence, even only unilateral, implies zero covariance
and zero correlation, the converse, however, is in general not true.

13.49 Distribution of the max and the min for independent


random variables
• Let {X1 , ..., Xn } be independent random variables with distribution functions
FXi (.).

• Let X(1) = max{X1 , ..., Xn } and X(n) = min{X1 , ..., Xn }.

• Then FX(1) (t) = ni=1 FXi (t) and FX(n) (t) = 1 − ni=1 (1 − FXi (t)).
Q Q

• If the random variables are also identically distributed we have


n
Y
FX(1) (t) = FXi (t) = F n (t)
i=1

and n
Y
FX(n) = 1 − (1 − FXi (t)) = 1 − (1 − F (t))n
i=1
.

13.50 Distribution of the max and the min for independent


random variables
• Why? Consider the case of the max. FX(1) (t) is , by definition, the probability
that the value of the max among the n random variables is less than or equal to
t.

• But the max is less than or equal t if and only if each random variable is less
than or equal to t.

• Since they are independent this is givenQby the product of the FXi each computed
at the same point t, that is FX(1) (t) = ni=1 FXi (t).

• For the min: 1 − FX(n) (t) is the probability that the min is greater that t. But
this is true if and only if each of the n random variables has a value greater than t
and for each random variable this probability is 1 − FXi (t). they are independent,
so...

179
13.51 Distribution of the sum of independent random variables
and central limit theorem
• Let {X1 , ..., Xn } be independent random variables. Let Sn = Xi be their
Pn
i=1
sum.

• We know that,Pif each random variable


Pn has2 expected value µi and variance σi ,
2

then E(Sn ) = i=1 µi and V (Sn ) = i=1 σi .


n

• To be more precise: the first property is always valid, whatever the dependence,
provided the expected values exist, while the second only requires zero correlation
(provided the variances exist).

• Can we say something about the distribution of Sn ?

• If we knew the distributions of the Xi we could (but this could be quite cumber-
some) compute the distribution of the sum.

• However, if we do not know (better: do not make hypotheses on) the distributions
of the Xi we still can give proof to a powerful and famous result which, in its
simplest form, states:

13.52 Distribution of the sum of independent random variables


and central limit theorem
• Let {X1 , ..., Xn } be iid random variables with expected value µ and variance σ 2 .
Then !
Sn
− µ
lim P r n √ ≤ t = Φ(t)
n→∞ σ/ n
Where, as specified above, Φ(.) is the PDF of a standard Gaussian.

• In practice this means that, under the hypotheses of this theorem, if “n is big
enough ” (a sentence whose meaning should be, and can be, made precise) we
s
−µ
can approximate FSn (s) with Φ( σ/
n√
n
).

13.53 Distribution of the sum of independent random variables


and central limit theorem
• More general versions of this theorem, with non necessarily identically distributed
or even non independent Xi exist.

180
• This result is fundamental in statistical applications where confidence levels for
confidence intervals of size of errors for tests must be computed in non standard
settings.

Statistical inference
13.54 Why Statistics
• Probabilities are useful when we can specify their values. As we did see above,
sometimes, in finite settings, (coin flipping, dice rolling, card games, roulette, etc.)
it is possible to reduce all probability statement to simple statements judged, by
symmetry properties, equiprobable.

• In these case we say we “know” probabilities (at least in the sense we agree on
its values and, as a first approximation, do not look for some “discovery rule” for
probabilities) and use these for making decisions (meaning: betting). In other
circumstances we are not so lucky.

• This is obvious when we consider betting on horse racing, computing insurance


premia, investing in financial securities. In all these fields “symmetry” statements
are not reasonable.

• However, from the didactic point of view, it is useful to show that the ”problem”
is there even with simple physical “randomizing devices” when their “shape” does
not allow for simple symmetry statements.

• Consider for instance rolling a pyramidal “die”: this is a five sided object with
four triangular sides a one square side. In this case what is the probability
for each single side to be the down side? For some news on dice see http:
//en.wikipedia.org/wiki/Dice

13.55 Unknown Probabilities and Symmetry


• The sides are not identical, so the classical argument for equiprobability does not
hold. We may agree that the probability of each triangular face is the same as
the dice is clearly symmetric if seen with the square side down. But then: what
is the total value of these four probabilities? Or, that is the same, what is the
probability for the square face to be the down one?

• Just by observing different pyramidal dice we could surmise that the relative
probability of the square face and of the four triangular faces depend, also, on
the effective shape of the triangular faces. We could hypothesize, perhaps, that

181
the greater is the eight of such faces, the bigger the probability for a triangular
face to be the down one in comparison to the probability for the square face.

13.56 Unknown Probabilities and Symmetry


• With skillful physical arguments we could come up with some quantitative hy-
potheses, we understand, however, that this shall not be simple. With much
likelihood a direct observation of the results from a series of actual rolls of this
dice could be very useful.

• For instance we could observe, not simply hypothesize, that (for a pyramid made
of some homogeneous substance) the more peaked are the triangular sides (and
so the bigger their area for a given square basis of the pyramid) the smaller the
probability for the square side to be the one down after the throw. We could also
observe, directly or by mind experiment, that the “degenerate” pyramid having
height equal to zero is, essentially, a square coin so that the probability of each side
(the square one and the one which shall transform in the four triangles) should
be 1/2. From these two observations and some continuity argument we could
conclude that there should be some unknown height such that the probability of
falling on a triangular side is, say, 1 > c ≥ 1/2 and, by symmetry, the probability
for each triangular side, is c/4. This provided there is no cheating in throwing
the die so that the throw is “chaotic” enough. So, beware of magicians!

• What is interesting is that this “mental+empirical” analysis gives us a possible


probability model for the result of throwing our pyramidal die. Moreover, this
model could be enriched by some law connecting c with the height of the pyramid.
Is c proportional to the height? Proportional to the square of the height? To the
square root of the height? As we shall see in what follows Statistics could be a
tool for choosing among these hypotheses.

• By converse, suppose you know, from previous analysis, that, for a pyramid made
of homogeneous material, a good approximation is c proportional to the height.
In this case a good test to assess the homogeneity of the material with which the
pyramid is made could be that of throwing several pyramids of different height
and see if the ratio between the frequency of a triangular face and the height of
the pyramid is a constant.

13.57 No Symmetry
• Consider now a different example: horse racing. Here the event whose probability
we are interested in is, to be simple, the name of the winner.

182
• It is “clear” that symmetry arguments here are useless. Moreover, in this case
even the use of past data cannot mimic the case of the pyramid, while observation
of past races results could be relevant, the idea of repeating the same race a
number of times in order to derive some numerical evaluation of probability is
both unpractical and, perhaps, even irrelevant.

13.58 No Symmetry
• What we may deem useful are data on past races of the contenders, but these
data regard different track conditions, different tracks and different opponents.

• Moreover they regard different times, hence, a different age of the horse(s), a
different period in the years, a different level of training, and so on.

• History, in short.

• This not withstanding, people bet, and bet hard on such events since immemorial
past. Where do their probabilities come from?

• An interesting point to be made is that, in antiquity, while betting was even more
common than it is today (in many cultures it had a religious content: looking for
the favor of the gods), betting tools, like dice existed in a very rudimentary form
with respect to today. We know examples of fantastically “good” dice made of
glass or amber (many of these being not used for actual gambling but as offers
to the Deity). These are very rare. The most commonly used die came from a
roughly cubic bone off a goat or a sheep. In this case symmetry argument were
impossible and experience could be useful.

• An interesting anthropological fact is that in classical times gambling was very


common, the concept of chance and luck were so widespread to merit specific
deities. However no hint of any kind of “uncertainty quantification” is known, with
the exception of some side comment. Why this is the case is a mystery. It may
be that the religious content mentioned above made in some sense blasphemous
the idea of quantifying chance, but this is only an hypothesis.

13.59 Learning Probabilities


• Let us sum up: probability is useful for taking decision (betting) when the only
unknown is the result of the game.

• This is the typical case in simple games of chance (not in the, albeit still simple,
pyramidal dice case).

183
• If we want to use probability when numerical values for probability are not eas-
ily derived, we are going to be uncertain both on uncertain results and on the
probability of such results.

• We can do nothing (legal) about the results of the game, but we may do something
for building some reasonable way for assessing probabilities. In a nutshell this is
the purpose of Statistics.

• The basic idea of statistic is that, in some cases, we can “learn” probabilities from
repeated observations of the phenomena we are interested in.

• The problem is that for “learning” probabilities we need ... probabilities!

13.60 Pyramidal Die


• Let us work at an intuitive level on a specific problem. Consider this set of basic
assumptions concerning the pyramidal die problem.

• We may agree that the probability for each face to be the down one in repeated
rollings of the die is constant, unknown but constant.

• Moreover, we may accept that the order with which results are recorded is, for
us, irrelevant as “experiments” (rolls of the dice) are made always in the same
conditions.

• We, perhaps, shall also agree that the probability of each triangular face is the
same.

13.61 Pyramidal Die Model


• Well: we now have a “statistical model”. Let us call θi , i = 1, 2, 3, 4 the probabil-
ities of each triangular face.

• This are going to be non negative numbers (Probability Theory require this)
moreover, if we agree with the statement about their identity, each of these value
must be equal to the same θ so the total probability for a triangular face to be
the down one shall be 4θ.

• By the rules of probability, the probability for the square face is going to be 1−4θ
and, since this cannot be negative, we need θ ≤ .25 (where we perhaps shall avoid
the equal part in the ≤sign).

• If we recall the previous analysis we should also require θ ≥ 1/8.

184
13.62 Pyramidal Die Constraints
• All these statements come from Probability Theory joint with our assumptions
on the phenomenon we are observing.
• In other, more formal, words we specify a probability model for each roll of the
die and state this:
• In each roll we can have a result in the range 1,2,3,4,5;
• The probability of each of the first four values is θ and this must be a number
not greater than .25.
• With just these words we have hypothesized that the probability distribution of
each result in a single toss is an element of a simple but infinite and very specific
set of probability distributions completely characterized by the numerical value
of the “parameter” θ which could be any number in the “parameter space” given
by the real numbers between 1/8 and 1/4 (left extreme included if you like).

13.63 Many Rolls


• This is a model for a single rolling. But, exploiting our hypotheses, we can easily
go on to a model for any set of rollings of the dice.
• In fact, if we suppose, as we did, that each sequence of results of given length has
a probability which only depends on the number of triangular and square faces
observed in the series (in technical terms we say that the observation process
produces an “exchangeable” sequence of results, that is: sequences of results
containing the same number of 5 and non 5 have the same probability).
• Just for simplicity in computation let us move on a step: we shall strengthen our
hypothesis and actually state that the results of different rollings are stochasti-
cally independent (this is a particular case of exchangeability that is: implies but
is not implied by exchangeability).

13.64 Probability of Observing a Sample


• Under this hypothesis and the previously stated probability model for each single
roll, the joint probability of a sample of size n, were we only record 5s and not
5s, is just the product of the probabilities for each observation.
• In our example: suppose we roll the dice 100 times and observe 40 times 5
(square face down) and 60 times either 1 or 2 or 3 or 4, since each of these
faces is incompatible with the other and each has probability θ, the probability
of “either 1 or 2 or 3 or 4” is 4θ.

185
• The joint probability of the observed sample is thus (4θ)60 (1 − 4θ)40 .

13.65 Pre or Post Observation?


But here there is a catch, and we must understand this well: are we computing the
probability of a possible sample before observation, or the probability of the observed
sample? In the first case no problems, the answer is correct, but, in the second, we
must realize that the probability of observing the observed sample is actually one, after
all we DID observe it!

• Let us forget, for the moment, this subtlety which is going to be relevant in what
follows. We have the probability of the observed sample, since the sample is
given, the only thing in the formula which can change value is the parameter θ.

• The probability of observing the given sample shall, in general, be a function of


this parameter.

13.66 Maximize the Probability of the Observed Sample


• The value which maximizes the probability of the observed sample among the
possible values of θ is (check it!) θb =60/400=3/20=.15

• Notice that this value maximizes (4θ)60 (1 − 4θ)40 : the probability of observing
the given sample (or any  given specific sample containing 40 5s and 60 non 5s)
but also maximizes 40 (4θ) (1 − 4θ)40 that is: the probability of observing A
100 60

sample in the set of samples containing 40 5s and 60 non 5s. (Be careful in
understanding the difference between “the
 given sample ” and “A sample in the
set”, moreover notice that 40 = 60 ).
100 100


13.67 Maximum Likelihood


• Stop for a moment and fix some points. What did we do, after all? Our problem
was to find a probability for each face of the pyramidal dice. The only thing we
could say a priori was that the probability of each triangular face was the same.
From this and simple probability rules we derived a probability model for the
random variable X whose values are 1, 2, 3, 4 when the down face is triangular,
and 5 when it is square.

• We then added an assumption on the sampling process: observations are iid


(independent and identically distributed as X). The two assumptions constitute
a “statistical model” for X and are enough for deriving a strategy for “estimating”
θ (the probability of any given triangular face).

186
• The suggested estimate is the value θb which maximizes the joint probability
of observing the sample actually observed. In other words we estimated the
unknown parameter according to the maximum likelihood method.

13.68 Sampling Variability


• At this point we have an estimate of θ and the first important point is to under-
stand that this actually is just an estimate, it is not to be taken as the “true”
value of θ.

• In fact, if we roll the dice another 100 times and compute the estimate with the
same procedure, most likely, a different estimate shall come up and for another
sample, another one and so on and on.

• Statisticians do not only find estimates, most importantly they study the worst
enemy of someone which must decide under uncertainty and unknown probabil-
ities: sampling variability.

13.69 Possibly Different Samples


• The point is simple: consider all possible different samples of size 100. Since, as
we assumed before, the specific value of a non 5 is irrelevant, let us suppose, for
simplicity, that all that is recorded in a sample is a sequence of 5s and non 5s.

• Since in each roll we either get a 5 or a non 5 the total number of these possible
samples is 2100 .

• On each of these samples our estimate could take a different value, consider,
however, that the value of the estimate only depends of how many 5 and non 5
were observed in the specific sample (the estimate is the number of non 5 divided
by 4 times 100).

• So the probability of observing a given value of the estimate is the same as the
probability of the set of samples with the corresponding number of 5s.

13.70 The Probability of Our Sample


• But it is easy to compute this probability: since by our assumptions on the
statistical model, every sample containing the same number of 5s (and so of non
5s) has the same probability, in order to find this probability we can simply
compute the probability of a generic sample of this kind and multiply it times
the number of possible samples with the same number of 5s.

187
• If the number of 5s is, say, k we find that the probability of the generic sample
with k 5s and 100-k non 5s is (see above): (4θ)100−k (1 − 4θ)k .

13.71 The Probability of a Similar Estimate


• This is the same for any sample with k 5 and 100-k non 5. There are many
samples of this kind, depending on the order of results. The number of possible
samples of this kind can be computed in this simple way: we must put k 5s in a
sequence of 100 possible places.

• We can insert the first 5 in any of 100 places, the second in any of 99 and so on.

• We get 100 ∗ 99 ∗ ... ∗ (100 − k) = (100−k)!


100!
however there are k ways to choose the
first 5 k − 1 for the second and so on up to 1 for the k th and for all these k! ways
(they are called “combinations” the sample is always the same, so the number of
different samples is k!(100−k)!
100!
= 100
k
.

13.72 The Probability of a Similar Estimate


• This is the number of different sequences of “strings” of 100 elements each con-
taining k 5s and 100-k non 5s.

• Summing up: the probability of observing k 5s on 100 rolls, hence of computing


and estimate of θ equal to k/400, is precisely: k (4θ)
100 100−k
(1 − 4θ)k (which is
a trivial modification of the binomial).

13.73 The Probability of a Similar Estimate


• So, before sampling, for any possible “true” value of θ we have a different proba-
bility for each of the (100 in this case) possible values of the estimate.

• The reader shall realize that, for each given value of θ the a priori (of sampling)
most probable value of the estimate is the one corresponding to the integer num-
ber of 5s nearest to 100(1 − 4θ) (which in general shall not be integer).

13.74 The Estimate in Other Possible Samples


• Obviously, since this is just the most probable value of the estimate if the prob-
ability is computed with this θ, it is quite possible, it is in fact very likely, that
a different sample is observed.

188
• Since our procedure is to estimate θ with 100−k
400
this immediately implies that, in
the case the observed sample is not the most probable for that given θ, the value
of the estimate shall NOT be equal to θ, in other words it shall be “wrong” and
the reason of this is the possibility of observing many different samples for each
given “true” θ, that is: sampling variability.

• In general, using the results above, for any given θ, the probability
 of n−k
observing a
sample of size n which gives as an estimate n−k
4n
is (as above) n
k
(4θ) (1 − 4θ)k

13.75 The Estimate in Other Possible Samples


• So, for instance, the probability, given this value of θ, of observing a sample such
that, for instance, the estimate n−k 4n
is equal to the parameter value, is, if we
suppose that the value for θ which we use in computing this probability can be
written as n−k
4n
(otherwise the probability is 0 and we must use intervals of values)
n n − k n−k n − k k  n  n − k n−k n−k k
(4 ) (1 − 4 ) = ( ) (1 − )
k 4n 4n k n n

• Due to what we did see above the value n−k 4n


is the most probable value of the
estimate when θ = n−k 4n
but many other values may have sizable probability so
that, eve if the “true value” is θ = 4n it is possible to observe estimates different
n−k

than n−k
4n
with non negligible probability.

13.76 Sampling Variability


• The study of the distribution of the estimate given θ is called the study of the
“sampling variability” of the estimate: the attitude of the estimate to change in
different samples and can be done in several different ways.

• For instance, using again our example, we see clearly that there does not exist a
single “sampling distribution” of the estimate as there is one for each value of the
parameter.

• On one hand this is good, because otherwise the estimate would give us quite
poor information about θ: the information we get from the estimate comes exactly
from the fact that for different values of θ different values of the estimate are more
likely to be observed.

• On the other it does not allow us to say which is the “sampling distribution” of
the estimate but only gives us a family of such distribution.

189
13.77 Sampling Variability
• However, even if we do not know the value of the parameter we may study several
aspects of the sampling distribution.

• For instance, for each θ we can compute, given that θ the expected value of the
estimate for the distribution of the estimate with that particular value of θ. In
other words we could compute
n
X n − k n
(4θ)n−k (1 − 4θ)k
k=0
4n k

and by doing this computation we would see that the result is θ itself, no matter
which value has θ. So that we say that the estimate is unbiased.

13.78 Sampling Variability


• Again, for each θ we can compute the variance of the of the estimate for the
distribution of the estimate with that particular value of θ. That is, we could
compute
n
X n − k 2 n 4θ(1 − 4θ)
( ) (4θ)n−k (1 − 4θ)k − θ2 =
k=0
4n k 4n
the “sampling variance” of the estimate, and see that, while this is a function of
θ (whose value is unknown to us) for any value of θ it goes to 0 when n goes
to infinity. This, joint with the above unbiasedness result, implies (Tchebicev
inequality) that the probability of having

n−k
∈ [θ ± c]
4n
that is: of observing a value of the estimate different than θ at most of c, goes to 1
for ANY c > 0 no matter the value of θ. This is called “mean square consistency”.

13.79 Sampling Variability


• A curiosity. In typical applications the sampling variance depends on the un-
known parameter(s).

• While any reasonable estimate must have a sampling distribution depending on


the unknown parameter(s) there are cases where the sampling variance could be
independent on unknown parameter(s).

190
• For instance, in iid sampling from an unknown distribution with unknown ex-
pected value µ and known standard deviation σ the usual estimate of µ, the
2
arithmetic mean of the data, has a sampling variance equal to σn which does not
depend on unknown parameters (repeat: we assumed σ known).

13.80 Estimated Sampling Variability


• In the end, if, say we wish for some “number” for the sampling variance when, as
in our case, it depends on the unknown parameter and not the simple formula
4n , n−k
or some specific distribution in the place of the family of distributions
4θ(1−4θ)
n
k
(4θ) (1 − 4θ)k we could “estimate” these substituting in the formula the
estimate of θ to the unknown value θ̂ = n−k 4n
and get

• V̂ (θ̂) = 4θ̂(1−4θ̂)
and n−k n
(4θ̂)n−k (1 − 4θ̂)k and always remember to

4n
P̂ ( θ̂ = 4n
) = k
notice the “hats” on V and P .

13.81 Quantifying Sampling Variability


• Whatever method we use for dealing with sampling variability the point is to
face it
• We could find different procedures for computing our estimate, however, for the
same reason (for each given true value of θ many different samples are possi-
ble) any reasonable estimate always a sampling distribution (in reasonable cases
depending on θ), so we would in any case face the same problem:sampling vari-
ability.
• The point is not to avoid sampling variability but to live with it. In order to do
this it is better to follow some simple principles.
• Simple, yes, but so often forgotten, even by professionals, as to create most
problems encountered in practical applications of Statistics.

13.82 Principle 1
• The first obvious principle to follow in order to be able to do this is: “do not
forget it”.
• An estimate is an estimate is an estimate, it is not the “true” θ.
• This seems obvious but errors of this kind are quite common: it seems human
brain does not like uncertainty and, if not properly conditioned, it shall try in
any possible way, to wrongly believe that we are sure about something on which
we only posses some clue.

191
13.83 Principle 2
• The second principle is “measure it”.

• An estimate (point estimate) by itself is almost completely useless, it should


always be supplemented with information about sampling variability.

• At the very least information about sampling standard deviation should be added.
Reporting in the form of confidence intervals could be quite useful.

• This and not point estimation is the most important contribution Statistics may
give to your decisions under uncertainty.

13.84 Principle 3
• The third principle is “do not be upset by it”.

• Results of decision may upset you even under certainty. This is obviously much
more likely when chance is present even if probabilities are known.

• We are at the third level: no certainty, chance is present, probabilities are un-
known!

• The best Statistics can only guarantee an efficient and logically coherent use of
available information.

• It does not guarantee Luck in “getting the right estimates” and obviously it cannot
guarantee that, even if probabilities are estimated well something very unlikely
does not happen! (And no matter what, People shall always expect, forgive the
joke, that what is most probable is much more likely than it is probable).

13.85 The Questions of Statistics


• This long discussion should be useful as an introduction to the statistical problem:

• why we need to do inference and do not simply use Probability?

• what can we expect from inference?

• Now let us be a little more precise.

192
13.86 Statistical Model
• This is made of two ingredients.

• The first is a probability model for a random variable (or more generally a random
vector, but here we shall consider only the one dimensional case).

• This is simply a set of distributions (probability functions or densities) for the


random variable of interest. The set can be indexed by a finite set of numbers
(parameters) and in this case we speak of a parametric model. Otherwise we
speak of a non parametric model.

• The second ingredient is a sampling model that is: a probabilistic assessment


about the joint distribution of repeated observation on the variable of interest.

• The simplest example of this is the case of independent and identically distributed
observations (simple random sampling).

13.87 Specification of a Parametric Model


• Typically a parametric mode is specified by choosing some functional form for
the probability or density function (here we use the symbol P for both) of the
random variable X say: X P (X; θ) and a set of possible values for θ : θ ∈ Θ(in
the case of a parametric model).

• Sometimes we do not fully specify P but simply ask, for instance, for Xto have
a certain expected value or a certain variance.

13.88 Statistic
• A fundamental concept is that of “estimate” or “statistic”. Given a sample: X
and estimate is simply a function of the sample and nothing else: T (X).

• In other words it cannot depend on unknowns the like of parameters in the model.
Once the sample is observed the estimate becomes a number.

13.89 Parametric Inference


• When we have a parametric model we typically speak about “parametric infer-
ence”, and we are going to do so here.

• This may give the false impression that statistician are interested in parameter
values.

193
• Sometimes this may be so but, really, statisticians are interested in assessing
probabilities for (future) values of X, parameters are just “middlemen” in this
endeavor.

13.90 Different Inferential Tools


• Traditionally parametric inference is divided in three (interconnected) sections:

• Point estimation;

• Interval estimation;

• Hypothesis testing.

13.91 Point Estimation


• In point estimation we try to find an estimate T (X) for the unknown parameter
θ (the case of a multidimensional parameter is completely analogous).

• In principle, any statistic could be an estimate, so we discriminate between good


and bad estimates by studying the sampling properties of these estimates.

• In other words we try to asses whether a given estimate sampling distribution


(that is, as we did see before, the probability distribution of the possible values
of the statistic as induced by the probabilities of the different possible samples)
enjoys or not a set of properties we believe useful for a good estimate.

13.92 Unbiasedness
• An estimate T (X) is unbiased for θ iff Eθ (T (X)) = θ, ∀θ ∈ Θ. In order to
understand the definition (and the concept of sampling distribution) is important
to realize that, in general, the statistic T has a potentially different expected value
for each different value of θ (hence each different distribution of the sample).

• What the definition ask for is that this expected value always corresponds to the
θ which indexes the distribution used for computing the expected value itself.

13.93 Mean Square Error


• We define the mean square error of an estimate T as: M SEθ (T ) = Eθ ((T − θ)2 )
.

• Notice how, in this definition, we stress the point that the M SE is a function of
θ (just like the expected value of T ).

194
• We recall the simple result:

Eθ ((T − θ)2 ) = Eθ ((T − Eθ (T ) + Eθ (T ) − θ)2 ) =

= Eθ ((T − Eθ (T ))2 ) + (Eθ (T ) − θ)2


where the first term in the sum is the sampling variance of the estimate and the
second is the “bias”.

• Obviously, for an unbiased estimate, M SE and sampling variance are the same.

13.94 Mean Square Efficiency


• Suppose we are comparing two estimates for θ, say: T1 and T2 .

• We state that T1 is not less efficient than T2 if and only if M SEθ (T1 ) ≤ M SEθ (T2 )
∀θ ∈ Θ.

• As is the case of unbiasedness the most important point is to notice the “for all”
quantifier (∀).

• This implies, for instance, that we cannot be sure, given two estimates, whether
one is not worse than the other under this definition.

• In fact it may well happen that mean square errors, as functions of the parameter
“cross”, so that one estimate is “better” for some set of parameter values while
the other for a different set.

• In other words, the order induced on estimates by this definition is only partial.

13.95 Meaning of Efficiency


If an estimate is T1 satisfies this definition wrt another estimate T2 , this means (use
Tchebicev inequality and the above decomposition of the mean square error) that it
shall have a bigger (better: not smaller) probability of being “near” θ for any value of
this parameter, than T2 .

13.96 Mean Square Consistency


• Here we introduce a variation. Up to now properties consider only fixed sample
sizes. here, on the contrary, we consider the sample size n as a variable.

• Obviously, since an estimate is defined on a given sample, this new setting requires
the definition of a sequence of estimates and the property we are about to state
is not a property of an estimate but of a sequence of estimates.

195
13.97 Mean Square Consistency
• A sequence {Tn } of estimates is termed “mean square consistent if and only if
lim M SEθ (Tn ) = 0, ∀θ ∈ Θ.
n→∞

• You should notice again the quantifier on the values of the parameter.

• Given the above decomposition of the M SE the property is equivalent to the


joint request: lim Eθ (Tn ) = θ, ∀θ ∈ Θ and lim Vθ (Tn ) = 0, ∀θ ∈ Θ.
n→∞ n→∞

• Again, using Tchebicev, we understand that the requirement implies that, for
any given value of the parameter, the probability of observing a value of the
estimate in any given interval containing θ goes to 1 if the size of the sample goes
to infinity.

13.98 Methods for Building Estimates


We could proceed by trial and error: this would be quite time consuming. better
to devise some “machinery” for creating estimates which can reasonably expect to be
“good” in at least some of the above defined senses.

13.99 Method of Moments


• Suppose we have a iid (to be simple) sample X from a random variable X dis-
tributed according to some (probability or density) P (X; θ) θ ∈ Θ where the
parameter is, in general, a vector of k components.

• Suppose, moreover, X has got, say, n moments E(X m ) with m = 1, ..., n.

• In general we shall have E(X m ) = gm (θ) that is: the moments are functions of
the unknown parameters.

13.100 Estimation of Moments


• Now, under iid sampling, it is very easy to estimate moments in a way that is, at
least, unbiased and mean square consistent (and also, under proper hypotheses,
efficient).

• In fact the estimate: E(X b m) = P


i=1,...,n X /n that is: the m−th empirical
m

moment is immediately seen to be unbiased, while its MSE (the variance, in this
m
case) is V (Xn ) which (if it exists) obviously goes to 0 if the size n of the sample
goes to infinity.

196
13.101 Inverting the Moment Equation
• The idea of the method of moment is simple. Suppose for the moment that θ is
one dimensional.

• Choose any gm and suppose it is invertible (if the model is sensible, this should
be true. Why?).

• Estimate the correspondent moment of order m with the empirical


Pmoment of the
same order and take as an estimate of θ the function θm = gm ( i=1,...,n X /n).
b −1 m

• In the case of k parameter just solve with respect to the unknown parameter a
system on k equation connecting the parameter vector with k moments estimated
with the corresponding empirical moments.

13.102 Problems
• This procedure is intuitively alluring. However we have at least two problem.
The first is that any different choice of moments is going to give us, in general,
a different estimate (consider for instance the negative exponential model and
estimate its parameter using different moments).

• The Generalized Method of Moments tries to solve this problem (do not worry!
this is something you may ignore, for the moment).

• The second is that, while empirical moments under iid sampling are, for instance,
unbiased estimates of corresponding theoretical moments, this is usually not true
for method of moments estimates. This is due to the fact that the gm we use are
typically not linear.

• Under suitable hypotheses we can show that method of moments estimates are
means square consistent but this is usually all we can say.

13.103 Maximum Likelihood


• Maximum likelihood method (one of the many inventions of Sir R. A. Fisher: the
creator of modern mathematical Statistics and modern mathematical genetics).

• Here the idea is clear if we are in a discrete setting (i.e. if we consider a model
of a probability function).

• The first step in the maximum likelihood method is to build the joint distribution
of the sample.

• In the context described above (iid sample) we have P (X; θ) = i P (Xi ; θ).
Q

197
• Now, observe the sample and change the random variables in this formulas (Xi )
into the corresponding observations (xi ).

• The resulting P (x; θ) cannot be seen as a probability of the sample (the proba-
bility of the observed sample is, obviously, 1), but can be seen as a function of θ
given the observed sample: Lx (θ) = P (x; θ).

13.104 Maximum Likelihood


• We call this function the “likelihood”.

• It is by no means a probability, either of the sample or of θ, hence the new name.

• The maximum likelihood method suggests the choice, as an estimate of θ, of the


value that maximizes the likelihood function given the observed sample,formally:
θbml = arg maxLx (θ).
θ∈Θ

13.105 Interpretation
• If P is a probability (discrete case) the idea of the maximum likelihood method
is that of finding the value of the parameter which maximizes the probability of
observing the actually a posteriori observed sample.

• The reasoning is exactly as in the example at the beginning of this section.

• While for each given value of the parameter we may observe, in general, many
different samples, a set of these (not necessarily just one single sample: many
different samples may have the same probability) has the maximum probability
of being observed given the value of the parameter.

13.106 Interpretation
• We observe the sample and do not know the parameter value so, as an estimate,
we choose that value for which the specific sample we observe is among the most
probable samples.

• Obviously, if , given the parameter value, the sample we observe is not among the
most probable, we are going to make a mistake, but we hope this is not the most
common case and we can show, under proper hypotheses, that the probability of
such a case goes to zero if the sample size increases to infinity.

198
13.107 Interpretation
• A more satisfactory interpretation of maximum likelihood in a particular case.

• Suppose the parameter θ has a finite set (say m) of possible values and suppose
that, a priori of knowing the sample, the statistician considers the probability of
each of this values to be the same (that is 1/m).

• Using Bayes theorem, the posterior probability of a given value of the parameter
1
P (x|θj ) m
given the observed sample shall be:P (θj |x) = P P (x|θ )1
= h(x)Lx (θj ).
j j m

13.108 Interpretation
• In words: if we consider the different values of the parameter a priori (of sample
observation) as equiprobable, then the likelihood function is proportional to the
posterior (given the sample) probability of the values of the parameter.

• So that, in this case, the maximum likelihood estimate is the same as the maxi-
mum posterior probability estimate.

• In this case, then, while the likelihood is not the probability of a parameter
value (it is proportional to it) to maximize the likelihood means to choose the
parameter value which has the maximum probability given the sample.

13.109 Maximum Likelihood for Densities


• In the continuous case the interpretation is less straightforward. Here the like-
lihood function is the joint density of the observed sample as a function of the
unknown parameter and the estimate is computed by maximizing it.

• However, given that we are maximizing a joint density and not a joint probability
the simple interpretation just summarized is not directly available.

13.110 Example (Discrete Case)


Example of the two methods. Let X be distributed according to the Poisson distribu-
x e−θ
tion, that is: P (x; θ) = θ x! x = 0, 1, 2, ... Suppose we have a simple random sample
of size n.

13.111 Example Method of Moments


• For this distribution all moments exist and, for instance E(X) = θ, E(X 2 ) =
θ2 + θ.

199
• If we use the first moment for the estimation√of θ we have θb1 = x̄ but, if we choose
the second moment, we have: θb2 = (−1 + 1 + 4x2 )/2 where x2 here indicates
the empirical second moment (the average of the squares).

13.112 Example Maximum likelihood


P
θxi e−θ i xi e−nθ
• The joint probability of a given Poisson sample is: Lx (θ) = θ
Q
i xi ! = Q
xi !
. i

• For a given θ this probability does not depend on the specific values of each
single observation but only on the sum of the observations and the product of
the factorials of the observations.

• The value of θ which maximizes the likelihood is θbml = x which coincides with
the method of moments estimate if we use the first moment as the function to
invert.

13.113 More Advanced Topics


• Sampling standard deviation, confidence intervals, tests, a preliminary comment.

• The following topics are almost not touched in standard USA like undergraduate
Economics curricula, and scantly so in other systems.

• They are, actually, very important but only vague notions of these can be asked
to a student as a prerequisite.

• In the following paragraphs such vague notions are shortly described.

13.114 Sampling Standard Deviation and Confidence Intervals


• As stated above, a point estimate is useless if it is not provided with some measure
of sampling error.

• A common procedure is to report the point estimate joint with some measure
related to sampling standard deviation.

• We say “related” because, in the vast majority of cases, the sampling standard
deviation depends on unknown parameters, hence it can only be reported in an
“estimated” version.

200
13.115 Sampling Variance of the Mean
• The simplest example is this.

• Suppose we have n iid observations from an unknown distribution about which


we only know that it possesses expected value µ and variance σ 2 (by the way, are
we considering here a parametric or a non parametric model?)

• In this setting we know that the arithmetic mean is an unbiased estimate of µ.

• By recourse to the usual properties of the variance operator we find that the
variance of the arithmetic mean is σ 2 /n.

• If (as it is very frequently the case) σ 2 is unknown, even after observing the
sample we cannot give the value of the sampling standard deviation.

13.116 Estimation of the Sampling Variance


• We may estimate the numerator of the sampling variance: σ 2 (typically using the
sample variance, with n or better n − 1 as a denominator) and we usually report
the square root of the estimated sampling variance.

• Remember: this is an estimate of the sampling standard error, hence, it too is


affected by sampling error (in widely used statistical softwares, invariably, we
see the definition “standard deviation of the estimate” in the place of “estimated
standard deviation of the estimate”: this is not due to ignorance of the software
authors, just to the need for brevity, but could be misleading for less knowledge-
able software users).

13.117 nσ Rules
• In order to give a direct joint picture of estimate an its (estimated) standard
deviation, nσ “rules” are often followed by practitioners.

• They typically report “intervals” of the form Point Estimate ±n Estimated Stan-
dard Deviation. A popular value of n outside Finance is 2, in finance we see value
of up to 6.

• A way of understanding this use is as follows: if we accept the two false premises
that the estimate is equal to its expected value and this is equal to the unknown
parameter and that the sampling variance is the true variance of the estimate,
then Tchebicev inequality assign a probability of at least .75 to observations of
the estimate in other similar samples which are inside the ” ± 2σ” interval (more
than .97 for the ” ± 6σ” interval).

201
13.118 Confidence Intervals
• A slightly more refined but much more theoretically requiring behavior is that of
computing “confidence intervals” for parameter estimates.

• The theory of confidence intervals typically developed in undergraduate courses


of Statistics is quite scant.

• The proper definition is usually not even given and only one or two simple ex-
amples are reported but with no precise statement of the required hypotheses.

13.119 Confidence Intervals


• These examples are usually derived in the context of simple random sampling
(iid observations) from a Gaussian distribution and confidence intervals for the
unknown expected value are provided which are valid in the two cases of known
and unknown variance.

• In the first case the formula is


 √ 
x ± z1−α/2 σ/ n

and in the second  √ 


x ± tn−1,1−α/2 σ
b/ n
where z1−α/2 is the quantile in the standard Gaussian distribution which leaves
on its left a probability of 1 − α/2 and tn−1,1−α/2 is the analogous quantile for the
T distribution with n − 1 degrees of freedom.

13.120 Confidence Intervals


• With the exception of the more specific choice for the “sigma multiplier” these two
intervals are very similar to the “rule of thumb” intervals we introduced above.

• In fact it turns out that, if α is equal to .05, the z in the first interval is equal to
1.96, and, for n greater than, say, 30, the t in the second formula is roughly 2.

13.121 Hypothesis testing


• The need of choosing actions when the consequences of these are only partly
known, is pervasive in any human endeavor. However few fields display this need
in such a simple and clear way as the field of finance.

202
• Consequently almost the full set of normative tools of statistical decision theory
have been applied to financial problems and with considerable success, when used
as normative tools (much less success, if any, was encountered by attempts to use
such tools in the description of actual empirical human behavior. But this has
to be expected).

13.122 Parametric Hypothesis


• Statistical hypothesis testing is a very specific an simple decision procedure. It is
appropriate in some context and the most important thing to learn, apart from
technicalities, is the kind of context it is appropriate for

• Statistical hypothesis. Here we consider only parametric hypotheses. Given a


parametric model, a parametric hypothesis is simply the assumption that the
parameter of interest, θ lies in some subset Θi ∈ Θ.

13.123 Two Hypotheses


• In a standard hypothesis testing, we confront two hypotheses of this kind (θ ∈
Θ0 , θ ∈ Θ1 ) with the requirement that, wrt the parameter space, they should be
exclusive (they cannot be both true at the same time) and exhaustive (they cover
the full parameter space.

• So, for instance, if you are considering a Gaussian model and your two hypotheses
are that the expected value is either 1 or 2, this means, implicitly, that no other
values are allowed.

13.124 Simple and Composite


• A statistical hypothesis is called “simple” if it completely specifies the distribution
of the observables, it is called “composite” if it specifies a set of possible distri-
butions. the two hypotheses are termed “null” (H0 ) hypothesis and “alternative”
hypothesis (H1 ).

• The reason of the names lies in the fact that, in the traditional setting where
testing theory was developed, the “null” hypothesis corresponds to some conser-
vative statement whose acceptance would not imply a change of behavior in the
researcher while the “alternative” hypothesis would have implied, if accepted, a
change of behavior.

203
13.125 Example
• The simplest example is that of testing a new medicine or medical treatment.
• In a very stylized setting, let us suppose we are considering substituting and
already established and reasonably working treatment for some illness with a
new one.
• This is to be made on the basis of the observation of some clinical parameter in
a population.
• We know enough as to be able to state that the observed characteristic is dis-
tributed in a given way if the new treatment is not better than the old one and
in a different way if this is not the case.
• In this example the distribution under the hypothesis that the new treatment is
not better than the old shall be taken as the null hypothesis and the other as the
alternative.

13.126 Critical Region, Acceptance Region


• The solution to a testing problem is a partition of the set of possible samples into
two subsets. If the actually observed sample falls in the acceptance region x ∈ A
we are going to accept the null, if it falls in the rejection or critical region x ∈ C
we reject it.
• We assume that the union of the two hypotheses cover the full set of possible
samples (the sample space) while the intersection is empty (they are exclusive).
this is similar to what is asked to the hypotheses wrt the parameter space but
has nothing to do with it.
• The critical region stands to testing theory in the same relation as the estimate
is to estimation theory.

13.127 Errors of First and Second Kind


• Two errors are possible:

1. x ∈ C but the true hypothesis is H0 , this is called error of the first kind;
2. x ∈ A but the true hypothesis is H1 , this is called error of the second kind.

• We should like to avoid these errors, however, obviously, we do not even know
(except in toy situations) whether we are committing them, just like we do not
know how much wrong our point estimates are.

204
• Proceeding in a similar way as we did in estimation theory we define some measure
of error.

13.128 Power Function and Size of the Errors


• Power function and size of the two errors. Given a critical region C, for each
θ ∈ Θ0 ∪ Θ1 (which sometimes but not always corresponds to the full parameter
space Θ) we compute ΠC (θ) = P (x ∈ C; θ) that is the probability, as a function
of θ, of observing a sample in the critical region, so that we reject H0 .

• We would like, ideally, this function to be near 1 for θ ∈ Θ1 while we would like
this to be near 0 for θ ∈ Θ0 .

• We define α = sup ΠC (θ) the (maximum) size of the error of the first kind and
θ∈Θ0
β = sup (1 − ΠC (θ)) the (maximum) size of the error of the second kind.
θ∈Θ1

13.129 Testing Strategy


• There are many reasonable possible requirements for the size of the two errors
we would the critical region to satisfy.

• The choice made in standard testing theory is somewhat strange: we set α to


an arbitrary (typically small) value and we try to find the critical region that,
given that (or a smaller) size of the error of the first kind, minimize (among the
possible critical regions) the error of the second kind.

• The reason of this choice is to be found in the traditional setting described above.
If accepting the null means to continue in some standard and reasonably success-
ful therapy, it could be sensible to require a small probability of rejecting this
hypothesis when it is true and it could be considered as acceptable a possibly big
error of the second kind.

13.130 Asymmetry
The reader should consider the fact that this very asymmetric setting is not the most
common in applications.

13.131 Some Tests


• One sided hypotheses for the expected value in the Gaussian setting. Suppose
we have an iid sample from a Gaussian random variable with expected value µ
and standard deviation σ.

205
• We want to test H0 : µ ≤ a against H1 : µ ≥ b where a ≤ b are two given
real numbers. It is reasonable to expect that a critical region of the shape:
C : {x : x > k} should be a good one.
• The problem is to find k.

13.132 Some Tests


• Suppose first σ is known. The power function of this critical region is (we use
the properties of the Gaussian under standardization):
x−µ k−µ
ΠC (θ) = P (x ∈ C; θ) = P (x > k; µ, σ) = 1 − P ( √ ≤ √ )=
σ/ n σ/ n
k−µ
= 1 − Φ( √ )
σ/ n
• Where Φ is the usual cumulative distribution of the standard Gaussian distribu-
tion.

13.133 Some Tests


• Since this is decreasing in µ the power function is increasing in µ, hence, its
maximum value in the null hypothesis region is for µ = a.
• We want to set this maximum size of the error of the first kind to a given value
α so we want: 1 − Φ( σ/√ ) = α so that k−a
k−a
n
√ = z1−α so that k = a + √σ z1−α .
σ/ n n

• When the variance is unknown the critical region is of the same shape but k =
a + √σbn tn−1,1−α where σ
b and t are as defined above.

13.134 Some Tests


The reader should solve the same problem when the hypotheses are reversed and com-
pare the solutions.

13.135 Some Tests


• Two sided hypotheses for the expected value in the Gaussian setting and confi-
dence intervals.
• By construction the confidence interval for µ (with known variance):
 √ 
x ± z1−α/2 σ/ n
contains µ with probability (independent on µ) equal to 1 − α.

206
• Suppose we have H0 : µ = µ0 and H1 : µ 6= µ0 for some given µ0 . The above
recalled property of the confidence interval implies that the probability with
which  √ 
x ± z1−α/2 σ/ n
contains µ0 , when H0 is true, is 1 − α.

13.136 Some Tests


√ 
• The critical region: n or, that is the same: C :
 
C :
√  x : µ 0 ∈
/ x ± z1−α/2 σ/
/ µ0 ± z1−α/2 σ/ n has only α probability of rejecting H0 when H0 is
 
x:x∈
true.

• Build the analogous region in the case of unknown variance and consider the
setting where you swap the hypotheses.

207
Contents
1 Returns 3
1.1 Return definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Price and return data: a cautionary tale . . . . . . . . . . . . . . . . . 10
1.3 Some empirical “facts” . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Logarithmic (log) random walk 16


2.1 "Stocks for the long run" and time diversification . . . . . . . . . . . . 25

3 Volatility estimation 30
3.1 Is it easier to estimate µ or σ 2 ? . . . . . . . . . . . . . . . . . . . . . . 36

4 Non Gaussian returns 41

5 Three different ways for computing the VaR 49


5.1 Gaussian VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Non parametric VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Semi parametric VaR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Matrix algebra 64

7 Matrix algebra and Statistics 67


7.1 A varcov matrix is at least psd . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8 The deFinetti, Markowitz and Roy model for asset allocation 70

9 Linear regression 72
9.1 What is a regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.2 Weak OLS hypotheses. X non random . . . . . . . . . . . . . . . . . . 74
9.3 Weak OLS hypotheses. X random . . . . . . . . . . . . . . . . . . . . . 75
9.4 The OLS estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.5 Basic statistical properties of the OLS estimate . . . . . . . . . . . . . 77
9.6 The Gauss Markoff theorem . . . . . . . . . . . . . . . . . . . . . . . . 78
9.7 Fit and errors of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.8 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.9 Statistical properties of Ŷ and ˆ . . . . . . . . . . . . . . . . . . . . . . 83
9.10 Strong OLS hypotheses, confidence intervals and testing linear hypothe-
ses in the linear model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.11 “Forecasts” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.12 a note on P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.13 The partial regression theorem. . . . . . . . . . . . . . . . . . . . . . . 90

208
9.14 The interpretation of estimated coefficients . . . . . . . . . . . . . . . 95

10 Style analysis 126


10.1 Traditional approaches with some connection to style analysis . . . . . 131
10.2 Critiques to style analysis . . . . . . . . . . . . . . . . . . . . . . . . . 133

11 Factor models and principal components 135


11.1 A very short introduction to linear asset pricing models . . . . . . . . . 135
11.2 Estimates for B and F . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.3 Maximum variance factors . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.4 Bad covariance and good components? . . . . . . . . . . . . . . . . . . 148

12 Appendix: Some matrix algebra 150


12.1 Definition of matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
12.2 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
12.3 Rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
12.4 Some special matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.5 Determinants and Inverse . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.6 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.7 Random Vectors and Matrices (see the following appendix for more details)152
12.8 Functions of Random Vectors (or Matrices) . . . . . . . . . . . . . . . . 152
12.9 Expected Values of Random Vectors . . . . . . . . . . . . . . . . . . . 152
12.10Variance Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . 152
12.11Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
12.12Derivatives of linear functions and quadratic forms . . . . . . . . . . . 153
12.13Minimization of a PD quadratic form, approximate solution of over de-
termined linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
12.14Minimization of a PD quadratic form under constraints. Simple appli-
cations to Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.15The linear model in matrix notation . . . . . . . . . . . . . . . . . . . . 157

13 Appendix: What you cannot ignore about Probability and Statistics


158
13.1 Probability: a Language . . . . . . . . . . . . . . . . . . . . . . . . . . 162
13.2 Interpretations of Probability . . . . . . . . . . . . . . . . . . . . . . . 163
13.3 Probability and Randomness . . . . . . . . . . . . . . . . . . . . . . . . 163
13.4 Different Fields: Physics . . . . . . . . . . . . . . . . . . . . . . . . . . 164
13.5 Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
13.6 Other fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
13.7 Wrong Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
13.8 Meaning of Correct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
13.9 Events and Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

209
13.10Classes of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
13.11Probability as a Set Function . . . . . . . . . . . . . . . . . . . . . . . 167
13.12Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.13Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.14Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.15Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.16Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
13.17Properties of the PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.18Density and Probability Function . . . . . . . . . . . . . . . . . . . . . 169
13.19Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.20Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.21Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.22Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.23Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.24Tchebicev Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
13.25*Vysochanskij–Petunin Inequality . . . . . . . . . . . . . . . . . . . . . 171
13.26*Gauss Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
13.27*Cantelli One Sided Inequality . . . . . . . . . . . . . . . . . . . . . . . 172
13.28Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
13.29Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.30Subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.31Univariate Distributions Models . . . . . . . . . . . . . . . . . . . . . . 173
13.32Some Univariate Discrete Distributions . . . . . . . . . . . . . . . . . . 173
13.33Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 174
13.34Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 174
13.35Some Univariate Continuous Distributions . . . . . . . . . . . . . . . . 174
13.36Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
13.37Distribution Function for a Random Vector . . . . . . . . . . . . . . . . 175
13.38Density and Probability Function . . . . . . . . . . . . . . . . . . . . . 175
13.39Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.40Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.41Stochastic Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 176
13.42Mutual Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
13.43Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.44Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.45Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 177
13.46Law of Iterated Expectations . . . . . . . . . . . . . . . . . . . . . . . 177
13.47Regressive Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
13.48Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . 178
13.49Distribution of the max and the min for independent random variables 179
13.50Distribution of the max and the min for independent random variables 179

210
13.51Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.52Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.53Distribution of the sum of independent random variables and central
limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
13.54Why Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.55Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . . 181
13.56Unknown Probabilities and Symmetry . . . . . . . . . . . . . . . . . . 182
13.57No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
13.58No Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
13.59Learning Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
13.60Pyramidal Die . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.61Pyramidal Die Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.62Pyramidal Die Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.63Many Rolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.64Probability of Observing a Sample . . . . . . . . . . . . . . . . . . . . 185
13.65Pre or Post Observation? . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13.66Maximize the Probability of the Observed Sample . . . . . . . . . . . . 186
13.67Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13.68Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.69Possibly Different Samples . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.70The Probability of Our Sample . . . . . . . . . . . . . . . . . . . . . . 187
13.71The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 188
13.72The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 188
13.73The Probability of a Similar Estimate . . . . . . . . . . . . . . . . . . . 188
13.74The Estimate in Other Possible Samples . . . . . . . . . . . . . . . . . 188
13.75The Estimate in Other Possible Samples . . . . . . . . . . . . . . . . . 189
13.76Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
13.77Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.78Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.79Sampling Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.80Estimated Sampling Variability . . . . . . . . . . . . . . . . . . . . . . 191
13.81Quantifying Sampling Variability . . . . . . . . . . . . . . . . . . . . . 191
13.82Principle 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
13.83Principle 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
13.84Principle 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
13.85The Questions of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 192
13.86Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
13.87Specification of a Parametric Model . . . . . . . . . . . . . . . . . . . . 193
13.88Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

211
13.89Parametric Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
13.90Different Inferential Tools . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.91Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.92Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.93Mean Square Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
13.94Mean Square Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.95Meaning of Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.96Mean Square Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 195
13.97Mean Square Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.98Methods for Building Estimates . . . . . . . . . . . . . . . . . . . . . . 196
13.99Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.100Estimation of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
13.101Inverting the Moment Equation . . . . . . . . . . . . . . . . . . . . . . 197
13.102Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
13.103Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
13.104Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13.105Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13.106Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13.107Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.108Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.109Maximum Likelihood for Densities . . . . . . . . . . . . . . . . . . . . . 199
13.110Example (Discrete Case) . . . . . . . . . . . . . . . . . . . . . . . . . . 199
13.111Example Method of Moments . . . . . . . . . . . . . . . . . . . . . . . 199
13.112Example Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . 200
13.113More Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
13.114Sampling Standard Deviation and Confidence Intervals . . . . . . . . . 200
13.115Sampling Variance of the Mean . . . . . . . . . . . . . . . . . . . . . . 201
13.116Estimation of the Sampling Variance . . . . . . . . . . . . . . . . . . . 201
13.117nσ Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
13.118Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.119Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.120Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.121Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.122Parametric Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.123Two Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.124Simple and Composite . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.125Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
13.126Critical Region, Acceptance Region . . . . . . . . . . . . . . . . . . . . 204
13.127Errors of First and Second Kind . . . . . . . . . . . . . . . . . . . . . . 204
13.128Power Function and Size of the Errors . . . . . . . . . . . . . . . . . . 205
13.129Testing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

212
13.130Asymmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
13.131Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
13.132Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.133Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.134Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.135Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
13.136Some Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

213

You might also like