Chapter 1

Chapter 1 Empirical finance
Figure 1: Predicting returns over alternative horizons.
“There is no way to predict whether the price of stocks and bonds will go up or down
over the next few days or weeks. But it is quite possible to foresee the broad course of
the prices of these assets over longer time periods, such as the next three to five years...”
2013 Nobel Prize Committee
1
Multiple linear regression with pre-determined

regressors
We consider the model

Y = Xβ + ε,
where X is a fixed (non stochastic) n × k matrix of regressors (n observations and k

regressors) and ε is an n-vector of error terms.
Assumptions:
1. E(ε) = 0. On average, the errors are zero.
2. V ar(ε) = σ 2 In , where In is the identity matrix of dimension n. (a) The errors

are homoskedastic (same variances on the diagonal of V ar(ε)) and (b) uncorrelated
(the off-diagonal elements, i.e., the covariances, are equal to zero).
More explicitly, write

     
y1 x11 x21 ... xk1   ε1
 β1
y2 x12 x22 ... xk2 ε2
    
      
     β2   
=
y3 x13 x23 ... xk3   ...  +  ε3
    
   
... ... ... ... ... ...
      
     
βk
yn x1n x2n ... xkn εn
or
     
y1 β1 x11 + β2 x21 + ... + βk xk1 ε1
y2 β1 x12 + β2 x22 + ... + βk xk2 ε2
     
     
     
 y3 = β1 x13 + β2 x23 + ... + βk xk3 + ε3 ,
     
... ... ...
     
     
yn β1 x1n + β2 x2n + ... + βk xkn εn
where, again, E(ε) = 0 and V ar(ε) = σ 2 In . All we are saying is that the Y observations
are linear combinations of the k regressors contained in the fixed matrix X. On top of
2
the linear combination, there is an error vector. The error vector has mean zero and a
variance-covariance matrix given by
 
σ 2 0 .. .. 0
0 σ 2 0 .. ..
 
 
0 2
 
V ar(ε) = E(εε ) = σ In = 
 .. 0 σ 2 0 .. .

.. .. ... ... ...
 
 
0 .. .. 0 σ 2
Note: the first column of the X matrix could just be a column of ones. This is the case
when there is an intercept in the regression (β1 would therefore be the intercept).
Given the assumptions, it is easy to show that E(Y ) = Xβ and V ar(Y ) = σ 2 In . In
fact,
E(Y ) = E(Xβ + ε) = E(Xβ) + E(ε) = Xβ,

V ar(Y ) = V ar(Xβ + ε) = V ar(ε) = σ 2 In .
1 The ordinary least squares (OLS) method

We need to estimate parameters.
n
X
βb = arg min (yi − x0i β)2
β i=1
= arg min(Y − Xβ)0 (Y − Xβ)
β
0
= (X X)−1 X 0 Y
3
 n n
−1  n

x21i
P P P
 i=1 x1i x2i ... ...  i=1 x1i yi
i=1
 
 n n  n
 P   P 
x22i
P
x1i x2i ... ... x y
 
 i=1 2i i
   
= 
 i=1 i=1 
 
.


 ... ... ... ... 
  n ...
 

 n   P 
x2ki
P
... ... ... xki yi
i=1 i=1
The least squares method chooses the vector β by minimizing the squared differences
around the multivariate line x0 β.
Proof.
C(β) = (Y − Xβ)0 (Y − Xβ)
Hence,
∂C(β) 0
= 0 ⇒ −2X (Y − Xβ) = 0
∂β
2
∂ C(β)
= 2X 0 X > 0.
∂β∂β 0
By setting the first derivative equal to zero, we obtain the OLS βb estimator. The second
derivative is positive. Hence, we confirm that we have a minimum.
1.1 Definitions: fitted values and estimated residuals

Fitted values (Yb ). These are the values on the estimated line.
   
yb1 βb1 x11 + βb2 x21 + ... + βbk xk1
yb2 βb1 x12 + βb2 x22 + ... + βbk xk2
   
   
   
Yb = 
 yb3 =
  βb1 x13 + βb2 x23 + ... + βbk xk3 

... ...
   
   
ybn βb1 x1n + βb2 x2n + ... + βbk xkn
4
= X β.
b
Residuals (b
ε). These are the differences between the true Y values and the fitted values
Yb .
   
y1 yb1
y2 yb2
   
   
   
εb = Y − Yb = 
 y3 −
  yb3  = Y − X β.

b
... ...
   
   
yn ybn
1.2 Properties
From the first-order conditions of the minimization problem we can write
0
b = 0 ⇒ X 0 εb = 0
X (Y − X β)
or, equivalently (less compactly, but more intelligibly),

n
 
P
x1i εbi 
i=1

 Pn 
x2i εbi 
 

 i=1  = 0.
 

 ... 

 Pn 
xki εbi
i=1
This is like saying that the residuals are orthogonal to the X observations! Since the
fitted values are linear combinations of the X observations (see above), it also says that
the residuals are orthogonal to the fitted values. In other words,
0
X εb = 0
and
5
0
Yb εb = 0.
Important: Note that, if there is an intercept in the regression, then the first element in
n
0 P
X εb becomes εbi = 0. In other words, the sample mean of the residuals is zero, if there
i=1
is an intercept in the regression!
1.3 Properties: continued

Write
Yb = X βb = X(X 0 X)−1 X 0 Y = P Y
and
εb = Y − Yb = Y − X(X 0 X)−1 X 0 Y = (I − X(X 0 X)−1 X 0 )Y = M Y
where M and P are symmetric and idempotent matrices. They are symmetric, since
M = M 0 and P = P 0 (try showing it ...). They are idempotent since M = M M and
P = P P (try showing it ...).
Geometrically, P is the matrix which projects Y on the space spanned by the columns
of X (recall, Yb is a linear combination of the regressors). M is a matrix which projects
Y on the space orthogonal to the space spanned by the columns of X (recall εb and Yb are
orthogonal).
1.4 Partitioned matrices

Write
X = [ X1 | X2 ],
n×k1 n×k2
where k = k1 + k2 . We are simply separating the full matrix X into two sub-matrices.
Then,
6
βb = (X 0 X)−1 X 0 Y
" #−1 " #
X10 X1 X10 X2 X10 Y
=
X20 X1 X20 X2 X20 Y
There are two cases.

(1) Assume X10 X2 = 0 (the regressors in the first block and in the second block are
orthogonal). Then,
" #−1 " #

X10 X1 0 X10 Y
βb = 0
0 X2 X2 X20 Y
" #" #
(X10 X1 )−1 0 X10 Y
=
0 (X2 X2 )−1
0
X20 Y
" # " #
0 −1 0
(X1 X1 ) X1 Y βb1
= = .
(X20 X2 )−1 X20 Y βb2
Thus, we can run two separate regressions (one on X1 and one on X2 ) to obtain the least
squares estimates.
(2) Assume X10 X2 6= 0. Then, we can show that
" # " #
βb1 (X10 M2 X1 )−1 X10 M2 Y
= ,
βb2 (X20 M1 X2 )−1 X20 M1 Y
where M2 = In − X2 (X20 X2 )−1 X20 and M1 = In − X1 (X10 X1 )−1 X10 .
Intuition: Focus on βb1 . This is like running several regressions. First, regress the first
column of X1 on X2 . Obtain the residuals. Now, regress the second column of X1 on X2 .
Obtain the residuals. Keep going until you reach the last column of X1 . Collect the k1
columns of residuals in a new matrix X e1 = M2 X1 . Regress Y on the matrix of residuals
to obtain βb1 .
7
−1
βb1 = e10 X
X e1 e10 Y
X
−1
= (X10 M2 X1 ) X10 M2 Y.
In English, first you want to purge X1 of the effect of X2 and compute the component of
X1 which is orthogonal of X2 (the residual matrix), then you want to regress Y on the
residual matrix.
Important: multiple regression does this automatically. You never really go through
these steps. If you know X2 and only care about β1 , just put X2 in your regression!
Proof.
" #
β1
Xβ = [X1 |X2 ] = X1 β1 + X2 β2
β2
= (M2 + P2 )X1 β1 + X2 β2
= M2 X1 β1 + P2 X1 β1 + X2 β2
= M2 X1 β1 + X2 (β2 + (X20 X2 )−1 X20 X1 β1 )
= M2 X1 β1 + X2 c
" #
β1
= [M2 X1 |X2 ] .
c
Notice that β1 has not changed. However, the regressors (and the second set of param-
eters) have changed. The two blocks are now orthogonal! Hence, I can find β1 by just
running a regression of Y on M2 X1 = X
e1 :
−1
βb1 = e10 X
X e1 e10 Y
X
−1
= (X10 M20 M2 X1 ) X10 M20 M2 Y
−1
= (X10 M2 M2 X1 ) X10 M2 M2 Y
8
−1
= (X10 M2 X1 ) X10 M2 Y,
by the symmetry and idempotency of M2 .
2 The statistical properties of OLS
Write
βb = (X 0 X)−1 X 0 Y
⇒
βb = (X 0 X)−1 X 0 (Xβ + ε)
= (X 0 X)−1 (X 0 X)β + (X 0 X)−1 X 0 ε
⇒
βb = β + (X 0 X)−1 X 0 ε.
(1) The expected value of β.

b
b = E(β + (X 0 X)−1 X 0 ε)
E(β)
= β + (X 0 X)−1 X 0 E(ε)
= β.
The OLS estimator is unbiased. Interpret: whatever the true parameter β is, if the
model is true (i.e., if Y = Xβ +ε), then the OLS estimator will deliver the right parameter
(on average). This said, there is some sampling variation around the expectation. Hence,
we need to talk about the variance of β.b
9
(2) The variance of β.

b
b = E[(βb − E(β))(
V ar(β) b βb − E(β))b 0]
= E[(βb − β)(βb − β)0 ]

h 0 i
= E (X 0 X)−1 X 0 ε (X 0 X)−1 X 0 ε

= E (X 0 X)−1 X 0 εε0 X(X 0 X)−1

= (X 0 X)−1 X 0 E(εε0 )X(X 0 X)−1

= σ 2 (X 0 X)−1 X 0 In X(X 0 X)−1
= σ 2 (X 0 X)−1 X 0 X(X 0 X)−1
= σ 2 (X 0 X)−1 .
We can also write −1

X 0X

b = σ2 1
V ar(β)
n n
Interpret. The variance of βb depends directly on the variance of the error terms σ 2 and
0
inversely on the “variability” of the X observations, i.e., XnX . It also depends inversely
on the number of observations. Notice that, when the number of observations increases
without bound (i.e., when n → ∞), the distribution of the βb estimator becomes more and
more concentrated around the expected value β. We will return to this idea in Chapter
2.
2.1 The Gauss-Markov Theorem

The OLS estimator βb is BLUE (best linear unbiased). For any estimator βe which is linear
(in the observations Y ) and unbiased, it turns out that
e ≥ V ar(β).
V ar(β) b
Proof. Consider the generic estimator βe = AY , where A is a k × n matrix. Note that βe

is linear in Y. Let us compute its expected value.
10
E(β)
e = E(AY )
= E(AXβ + Aε)
= AXβ + AE(ε)
= AXβ.
Thus, AX = Ik for βe to be unbiased. So, βe = AY (with the restriction AX = Ik ) is a

generally specified unbiased and linear estimator of β. Note:
e = E((βe − E(β))(
V ar(β) e 0)
e βe − E(β))
= E((βe − β)(βe − β)0 )

= E(Aεε0 A0 )
0
= σ 2 AIn A
0
= σ 2 AA .
Now, we need to show that
e ≥ V ar(β).
V ar(β) b
Write
AA0 − (X 0 X)−1
= AA0 − AX(X 0 X)−1 X 0 A0
= A(In − X(X 0 X)−1 X 0 )A0
= AM A0
= AM M A0
= AM M 0 A0
11
by the fact that AX = Ik and M is symmetric and idempotent. Is AM M 0 A0 positive

semi-definite? Write
z 0 AM (AM )0 z
= ze0 ze
X n
= zei2 ≥ 0,
i=1
for any real vector z ∈ Rk . Because z 0 AM M 0 A0 z ≥ 0 for any conformable real z, AM M 0 A0

is positive semi-definite and AA0 − (X 0 X)−1 ≥ 0.
2.2 Estimation of σ 2
We (almost) use the empirical variance of the estimated residuals:
n
εb2i
P
i=1 εb0 εb
b2 =
σ = .
n−k n−k
b is unbiased for σ 2 (i.e., E(b
Statistical property: σ 2
σ 2 ) = σ 2 ).
Proof. Recall
εb = M Y
or
εb = M (Xβ + ε) = M ε.
Thus,
1 1
σ2) =
E(b ε0 εb) =
E(b E(ε0 M 0 M ε)
n−k n−k
1 1
= E(ε0 M ε) = E(trε0 M ε)
n−k n−k
12
1 1
= E(trM εε0 ) = tr (M E(εε0 ))
n−k n−k
σ2 σ2
= tr(M In ) = tr(M )
n−k n−k
σ2 0 σ2 0
= tr(In − X(X 0 X)−1 X ) = (trIn − tr(X(X 0 X)−1 X ))
n−k n−k
σ2 0
σ2
= (n − tr (X 0 X)−1 X X ) = (n − tr (Ik ))
n−k n−k
σ2
= (n − k) = σ 2 .
n−k
The result relies on the symmetry and idempotency of M . It also relies on the properties
of the trace (for a review, refer to Chapter 0).
3 Exact inferential theory: testing

Write
Y = Xβ + ε,
where
d
ε → N (0, σ 2 In ).
d
The symbol → signifies “distributed as”. Because the error terms are normally distributed
and Y is a linear combination of normal random variables (with Xβ deterministic), we
have
d
Y → N (Xβ, σ 2 In ).
Note: we are imposing strong restrictions on the error terms. Not only are we saying that
they are mean zero, homeskedastic (same variance) and uncorrelated, we are also saying
that they are normally distributed. This will lead to an exact inferential theory. We
will see later what we mean by “exact”. In Chapter 2, normality will be relaxed. In
Chapter 3, we will relax normality, homoskedasticity and uncorrelatedness.
13
3.1 Classical testing problems

(1) Single linear restriction:
H0 : c0 β = γ
or, equivalently,
k
X
H0 : cj βj = γ.
j=1
Example: Standard t-test on the j th parameter.
H0 : βj = 0.
Write:
 
0
 

 0 

 .. 
c=
 

 1 (this is the jth spot) 
 

 ... 

0
and γ = 0.
(2) Multiple linear restrictions:
H0 : R β = r
q×kk×1 q×1
with q ≤ k. Here q is the number of restrictions, k is always the number of parameters.

Example: Standard F -test on the slope parameters (excluding the intercept).
H0 : β2 = β3 = ... = βk = 0.
Write:
14
 
0 1 0 ...
 
 0 0 1 0 ... 
R = ... ...

k−1×k
 0 ... 0 
0 ... 0 1
and
 
0
 
 0 
r = .
k−1×1 
 ... 

0
3.2 Implementation
Recall:
d
Y → N (Xβ, σ 2 In ).
Write
βb = (X 0 X)−1 X 0 Y.
Since, βb is a linear combination of normal random variables (the Y s), it is also a normal
random variable. Hence,
d
βb → N (β, σ 2 (X 0 X)−1 )
3.2.1 Single linear restriction
H0 : c0 β = γ.
Construction of the test:
d
c0 βb → N (c0 β, σ 2 c0 (X 0 X)−1 c)
15
or
d
c0 βb − c0 β → N (0, σ 2 c0 (X 0 X)−1 c)
or
c0 βb − c0 β d
p → N (0, 1)
0 0
σ c (X X) c −1
and, under the null hypothesis H0 : c0 β = γ,
c0 βb − γ d
p → N (0, 1).
0 0 −1
σ c (X X) c H0
This would be our test statistic, if we knew σ. If we knew σ, we could test the null hypoth-
0b
esis c0 β = γ by checking if the ratio √ c0 β−γ
0 −1
falls in the tails of the normal distribution
σ c (X X) c
(i.e., if it is larger than 2, or smaller than
q -2, for a 5% level test). Unfortunately, we do
εb0 εb
not know σ. We estimate σ using σ b = n−k .
We will show that, when we replace σ with σ b, the ratio is not standard normal
anymore. It is t-distributed with n − k degrees of freedom. The following result will lead
to the finding.
IFirst aside:
εb0 εb d 2
→ χn−k ,
σ2
where χ2n−k is a chi-squared random variable with n − k degrees of freedom.
Proof:
Recall, εb = M ε. Hence,
εb0 εb ε0 M 0 M ε ε0 M ε ε0 QΛQ0 ε
= = =
σ2 σ2 σ2 σ2
by the Jordan decomposition of the idempotent matrix M (please refer to Chapter 0).
Q is the matrix containing the eigenvectors of M . Note that QQ0 = In . Λ is the
matrix containing the eigenvalues of M on the diagonal and zeros everywhere else. By
16
idempotency, the eigenvalues of M are either 1 or zero. Since the trace of M is n − k, it

turns out that the number of ones is n − k. Now, notice that
Q0 ε d 1
→ N (0, 2 Q(σ 2 In )Q0 ) = N (0, In ).
σ σ
Q0 ε d
Thus, call σ
= Z → N (0, In ). This implies,
n−k
ε0 QΛQ0 ε 0
X d
= Z ΛZ = zi2 → χ2n−k
σ2 i=1
since the sum of n − k independent normal random variables is a chi-squared random

variable with n − k degrees of freedomJ
ISecond aside: Consider a standard normal random variable. Consider a chi-square

random variable with n − k degrees of freedom. Assume the two random variables are
independent. Hence,
N (0, 1)
q 2 = tn−k ,
χn−k
n−k
a t distribution with n − k degrees of freedoms.J
Let us now go back to
c0 βb − γ d
p → N (0, 1)
σ c0 (X 0 X)−1 c
Write
0b
√ c0 β−γ
c0 βb − γ c (X 0 X)−1 c d N (0, 1)
σ
p = q 0 → q 2 = tn−k .
σ
b c0 (X 0 X)−1 c εb εb χn−k
σ 2 (n−k) n−k
Interpret. When we replace σ with its estimator σ b, we effectively compute the ratio
between a normal random variable and the square root of an independent chi-squared
random variable divided by the number of degrees of freedom (n − k, in this case). As in
17
the second aside, this ratio is distributed as a t-student distribution with n − k degrees
of freedom. Thus,
c0 βb − γ d
p → tn−k .
σ
b c0 (X 0 X)−1 c
0b
We now test the null hypothesis c0 β = γ by checking if the ratio √ c0 β−γ falls in the
0 σ
b c (X X)−1 c
tails of the t distribution with n − k degrees of freedom (i.e., if it is larger than t0.025,n−k
- i.e., slightly larger than 2 - or smaller than −t0.025,n−k - i.e., slighly smaller than -2 - for
a 5% level test).
Example: Classical t-test (H0 : βj = 0).

The relevant statistic is
βbj − 0
t= q
b (X 0 X)−1
σ jj
where (X 0 X)−1 0 −1
jj is the j th spot on the diagonal of the matrix (X X)jj .
3.2.2 Multiple linear restrictions
H0 : R β = r .
q×kk×1 q×1
Construction of the test:
d
Rβb → N (Rβ, σ 2 R(X 0 X)−1 R0 )
or
d
Rβb − Rβ → N (0, σ 2 R(X 0 X)−1 R0 )
or
−1/2 d
Zq = σ −1 R(X 0 X)−1 R0 (Rβb − Rβ) → N (0, Iq ).
This implies that
18
−1 d
Zq0 Zq = σ −2 (Rβb − Rβ)0 R(X 0 X)−1 R0 (Rβb − Rβ) → χ2q
and, under the null hypothesis,
−1 d
Zq0 Zq = σ −2 (Rβb − r)0 R(X 0 X)−1 R0 (Rβb − r) → χ2q .
H0
This last result is obvious. The internal product of q independent standard normal
random variables is just a chi-squared random variable with q degrees of freedoms.
At this point, we could use the 95th percentile of the chi-squared distribution with q
degrees of freedom (χ20.95,q ) to test the null hypothesis. If
−1
σ −2 (Rβb − r)0 R(X 0 X)−1 R0 (Rβb − r) ≥ χ20.95,q
then we would reject the null hypothesis. The problem, again, is that we do not know σ.
Just like earlier, we will show that when we replace σ with σ b, the distribution of
the test statistic changes. In this case, it changes to that of an F random variable with
number of degrees of freedom q, n − k (when the test statistic is also divided by the
number of restrictions q).
IThird aside: Consider a chi-squared random variable with q degrees of freedom χ2q .
Consider a chi-squared random variable with n − k degrees of freedom χ2n−k . Assume the
two random variables are independent. Hence,
χ2q /q d
2
→ Fq,n−k ,
χn−k /n − k
an F distribution with q degrees of freedom in the numerator and n−k degrees of freedoms
in the denominator.J
Now, write
−1
σ −2 (Rβb − γ)0 (R(X 0 X)−1 R0 ) (Rβb − γ)/q
εb0 εb
σ 2 (n−k)
19
−1 d
b−2 (Rβb − γ)0 R(X 0 X)−1 R0
= σ (Rβb − γ)/q → Fq,n−k .
Thus, when we replace σ with σ b (and divide by q ), rather than using the 95th percentile
of the chi-squared distribution, we use the 95th percentile of the F distribution to test.
Example: Classical F-test with an intercept (H0 : β2 = β3 = ... = βk = 0).
The relevant statistic is
−1 d
σ b 0 R(X 0 X)−1 R0
b−2 (Rβ) (Rβ)/(k
b − 1) → Fk,n−k
with  
0 1 0 ...
 
 0 0 1 0 ... 
R = .
k−1×k 
 ... ... 0 ... 0 
0 ... 0 1
Note: All of these tests are “exact” in the sense that they are valid for any number of
observations n. Exact tests can be derived only by imposing strong restrictions (like
normality) on the error terms. Without normality, this testing framework would not
hold. In Chapter 2, we will abandon normality of the error terms and derive tests which
are not “exact” (and are, therefore, not valid for any n) but “asymptotic”, i.e., they are
valid only when the number of observations goes off to infinity.
4 Regression analysis, liquidity and asymmetric in-

formation
We are interested in the relation between the average bid-ask spreads on stocks and the
characteristics of the corresponding companies (Stoll, 2000). Download the file spreads-
microstructure.xls. The file contains information for the 100 stocks in the S&P 100 index.
Our variable of interest (the Y variable) is the bid-ask spread (constructed as an average
20
over the day) - or tradecost - of the S&P100 stocks. The explanatory, or X, variables
are:
1. log volatility - The log of the daily return standard deviation
2. log size - The log of the size of the stock. Size is total outstanding number of shares
multiplied by share price. Size is measured in thousands of dollars
3. log trades - This is the log of the average number of trades per day
4. log turn - This is the log of the ratio of the average number of shares traded per
day (in dollars) over the number of shares outstanding (in dollars)
5. NumberAnalysts - This is the number of analysts following the stock
The same data is used in Bandi and Russell (2007). Consider the following theories of
the determinants of bid-ask spreads.
1. Asymmetric information. Stocks with greater degrees of asymmetry in infor-
mation (regarding their fundamental value) tend to have wider bid-ask spreads. The
number of analysts following a stock is viewed as an asymmetric information proxy. The
larger it is, the lower private information, the smaller the spreads. Log turn-over is, also,
seen as an asymmetric information proxy. The larger it is, the larger private information,
the larger the spreads. (As Stoll, 1989, points out, without informed trading, stocks
would be traded in proportion to their shares outstanding. Trading rates in excess of this
proportion should be associated with informed trading.)
2. Liquidity. Stocks that trade more frequently and have larger market capitalization
(i.e., more liquid stocks) tend to have lower bid-ask spreads. The larger log trades and
log size, the larger liquidity, the smaller the spreads. Log turn-over is, also, sometimes
seen as a liquidity proxy. The larger it is, the larger liquidity, the smaller the spreads.
3. Fundamental volatility. Stocks that have a higher volatility of fundamental
values tend to have larger bid-ask spreads. Higher uncertainty about the underlying
stock’s value implies higher potential for adverse price moves and, hence, higher inventory
risk, mostly in the presence of large imbalances to offset (Ho and Stoll, 1981).
21
Asymmetric information is a term used to describe a situation where some market

participants are better informed that others about the value of the asset. The greater the
degree of asymmetric information, the wider the spreads should be as the market makers
(who are not fully informed) charge a higher price when selling (raising the ask) and
a lower price when buying (lowering the bid) to insulate themselves from losing money
to potentially better informed traders. We do not get to see the degree of asymmetric
information between market participants but we do know the following: 1) The larger
the number of analysts following a stock, the lower asymmetric information we expect
to be for that stock. Analysts provide information about the stock, thereby uncovering
its fundamental value. 2) The greater the turn-over of the stock, the higher we expect
asymmetric information to be for that stock. Intuitively, the greater the turnover, the
faster individuals are getting in and out of investment positions in the stock. They do this
more when they believe the current price does not accurately reflect the fundamental value
of the stock - that is, when they believe they possess asymmetric (superior) information.
As indicated, turn-over is also viewed, by some, as a liquidity proxy.
1. Generate a histogram of the bid-ask spreads. What do you notice?
30
Series: TRADECOST
Sample 1 100
25
Observations 100
20 Mean 0.000643
Median 0.000533
Maximum 0.002967
15
Minimum 0.000319
Std. Dev. 0.000399
10 Skewness 3.886016
Kurtosis 21.20235
5
Jarque-Bera 1632.208
Probability 0.000000
0
0.0005 0.0010 0.0015 0.0020 0.0025 0.0030
Figure 2: Histogram of the spreads
There is apparent skewness.
22
2. Take a logarithmic transformation of the bid-ask spreads. Plot a his-

togram again. What do you notice now?
20
Series: LOG_TRADECOST
Sample 1 100
16 Observations 100
Mean -7.451748
12 Median -7.536629
Maximum -5.820198
Minimum -8.049928
8 Std. Dev. 0.403460
Skewness 1.624503
Kurtosis 6.613926
4
Jarque-Bera 98.40212
Probability 0.000000
0
-8.0 -7.8 -7.6 -7.4 -7.2 -7.0 -6.8 -6.6 -6.4 -6.2 -6.0 -5.8
Figure 3: Histogram of the log(spreads)
A bit more “normal” or “Gaussian”.
3. Run a least-squares regression of the log bid-ask spread on the 5 ex-

planatory variables.
23
Dependent Variable: LOG TRADECOST

Method: Least Squares (Gauss-Newton / Marquardt steps)
Date: Time:
Sample: 1 100
Included observations: 100
LOG TRADECOST = C(1)+C(2)*LOG SIZE+C(3)*LOG VOLATILITY+C(4)
*NUMBERANALYSTS+C(5)*LOG TRADES+C(6)*LOG TURN
Coefficient Std. Error t-Statistic Prob.
C(1) −0.829315 0.442769 −1.873022 0.0642

C(2) −0.140288 0.023910 −5.867276 0.0000
C(3) 1.022999 0.053245 19.21317 0.0000
C(4) 5.04E − 05 0.002913 0.017313 0.9862
C(5) −0.169363 0.035596 −4.757966 0.0000
C(6) −0.098714 0.032502 −3.037139 0.0031
R-squared 0.882467 Mean dependent var −7.451748

Adjusted R-squared 0.876216 S.D. dependent var 0.403460
S.E. of regression 0.141950 Akaike info criterion −1.008566
Sum squared resid 1.894069 Schwarz criterion −0.852256
Log likelihood 56.42829 Hannan-Quinn criter. −0.945304
F-statistic 141.1555 Durbin-Watson stat 1.894852
Prob(F-statistic) 0.000000
4. Interpret the coefficients economically (in light of the three theories

above) and statistically. Do you find your results surprising?
The theories have some empirical validation. Importantly, all variables are statis-
tically significant, with the exception of the number of analysts. Take volatility,
for instance. To test statistical significance, one could run the following “single
restriction” test. We are testing if c(3) = 0. In other words, we are testing the null
H0 : β3 = 0.
24
Write
 
0
0
 
 
 
c =  1 (this is the 3rd spot) 


...
 
 
0
and γ = 0. Thus,
βb3 − 0
t= q
b (X 0 X)−1
σ 3,3
1.022999 − 0
=
0.053245
=19.21,
where (X 0 X)−1 0 −1
3,3 is the 3rd value on the diagonal of the matrix (X X)3,3 .
Since |19.21| > 2, the parameter associated with volatility is “statistically different”
from zero. In fact, volatility is the most statistically significant of all assumed
predictors.
Note: if one wished to use the critical values of the t distribution with n − k degrees
of freedom (in our case, 100 - 6 = 94), these critical values would be close to -2
and 2 (since the t density function would be very similar to the normal density
function).
Alternatively, one could use a “multiple restriction” test to test a single restriction.
This would effectively amount to using a “one-sided” test rather than a “two-sided”
test. Write
H0 : R β = r .
q×kk×1 q×1
where q = 1, the vector R is the same as the vector c above (transposed, of course),
25
and the scalar r is the same as the scalar γ above, namely 0. Hence,
H0 : c0 β = γ = 0.
1×kk×1 1×1
Hence, the statistic would be:
0 −1 d
b−2 βb3 (X 0 X)−1
σ 3,3 βb3 /1 → F1,94
Notice that the F statistic would now effectively be the square of the t statistic.
Here is the output.
Wald Test:
Equation: Untitled
Test Statistic Value df Probability
t-statistic 19.21317 94 0.0000

F-statistic 369.1457 (1, 94) 0.0000
Chi-square 369.1457 1 0.0000
Null Hypothesis: C(3)=0

Null Hypothesis Summary:
Normalized Restriction (= 0) Value Std. Err.
C(3) 1.022999 0.053245
Restrictions are linear in coefficients.
In the output above, one could ignore - for the time being - the remaining test
(called Chi-square). We will return to it in the next chapter.
5. Test the assumption that the coefficient associated with log volatility
is equal to 1. If this is the case, how would you interpret the relation
between daily volatility and bid-ask spreads?
26
Again, this is a single restriction test. We are testing if c(3) = 1. In other words:
H0 : β3 = 1.
Write
 
0
 

 0 

 .. 
c=
 

 1 (this is the 3rd spot) 
 

 ... 

0
and γ = 1. Specifically, write
βb3 − 1
t= q
b (X 0 X)−1
σ 3,3
1.022999 − 1
=
0.053245
=0.43,
Hence, we “fail” to reject. The true volatility slope could be 1. Of course, as earlier,
we could have used a single-sided F test as well.
What does it mean to have a slope equal to 1? The slope is
∂log(tradecosts)∂tradecosts ∂tradecosts
∂log(tradecosts) ∂tradecosts tradecosts
= ∂log(volatility)∂volatility
= ∂volatility
= 1,
∂log(volatility) volatility
∂volatility
Hence, since the regression is log/log (“logarithm on logarithm”), the slope has an
interpretation in terms of elasticity. A 1% increase in volatility translates into a 1%
increase in tradecosts.
27
6. Test the assumption that the coefficients associated with log size and log
trades are equal to each other. Be as precise as possible.
We use a classical F test with 1 restriction. See below in bold. We “fail” to reject.
Clear from the p-value, right?
Wald Test:
Equation: Untitled
t-statistic 0.672016 94 0.5032

F-statistic 0.451606 (1, 94) 0.5032
Chi-square 0.451606 1 0.5016
Null Hypothesis: C(2)=C(5)

C(2) - C(5) 0.029075 0.043265
7. Test the assumption that the coefficients associated with log turnover
and number of analysts are jointly equal to zero.
We use a classical F test with 2 restrictions. See below in bold. We reject. Again,
look at the p-value ...
28
Wald Test:
Equation: Untitled
F-statistic 4.979229 (2, 94) 0.0088

Chi-square 9.958458 2 0.0069
Null Hypothesis: C(4)=0, C(6)=0

C(4) 5.04E − 05 0.002913

C(6) −0.098714 0.032502
8. Let us use our model to predict what the spread will look like tomor-
row. Use a regression which does not include the number of analysts
to predict. Consider a stock which has a log size of 10.5. Suppose that
for this stock we expect that for the following day the log turnover will
be -1.1, the log of the number of trades will be 7.6, and the log of the
standard deviation will be -3.5. Predict what the spread will be for this
stock tomorrow. (Note that since the regression is run with log spreads
you will have to make a transformation to convert your prediction for
the log spread into a prediction for the actual spread ...).
Here is the regression output excluding the number of analysts. All other parameter
estimates are robust to this exclusion, i.e., similar to previous results.
29
Dependent Variable: LOG TRADECOST

Method: Least Squares (Gauss-Newton / Marquardt steps)
Date: Time:
Sample: 1 100
Included observations: 100
LOG TRADECOST = C(1)+C(2)*LOG SIZE+C(3)*LOG VOLATILITY+C(4)
*LOG TRADES+C(5)*LOG TURN
Coefficient Std. Error t-Statistic Prob.
C(1) −0.828920 0.439847 −1.884563 0.0625

C(2) −0.140277 0.023775 −5.900212 0.0000
C(3) 1.023106 0.052606 19.44864 0.0000
C(4) −0.169296 0.035201 −4.809382 0.0000
C(5) −0.098864 0.031164 −3.172391 0.0020
R-squared 0.882467 Mean dependent var −7.451748

Adjusted R-squared 0.877518 S.D. dependent var 0.403460
S.E. of regression 0.141201 Akaike info criterion −1.028563
Sum squared resid 1.894075 Schwarz criterion −0.898304
Log likelihood 56.42813 Hannan-Quinn criter. −0.975845
F-statistic 178.3208 Durbin-Watson stat 1.895160
Prob(F-statistic) 0.000000
Let us do the prediction:
\
log(tradecosts) = −0.829−0.14×10.5+1.023×(−3.5)−0.17×7.6−0.099×(−1.1) = −7.0626
Now, to obtain a prediction for tradecosts rather than for log(tradecosts) we need
to exponentiate. We write
\ = e−7.0626 = 0.00085655,
tradecosts
which is a value larger than the historical mean (see the histogram in point 1 above).
30
5 Appendix I:
5.1 Another useful idempotent matrix
Consider
L = In − i(i0 i)−1 i0 ,
where i is an n × 1 column vector of ones. L is, of course, symmetric and idempotent.

More explicitly,
(ii0 )
L = In −
n
or
   
1 0 ... ... 1 1 1 1
   
 1 ...  1 1 1
−  1 1 
L= .
 n 1 1
... ... ...  1 1
  
 
1 1 1 1 1
The matrix L transforms any n × 1 column vector y in deviations from the mean. In fact,
     
1 0 ... ... y1 1 1 1 1 y1
     
 1 ...   y2  1  1 1 1 1  y2 
Ly =   ...  − n  1
    

 ... 1 ... 
   1 1 1 
 ... 

1 yn 1 1 1 1 yn
     
y1 Y y1 − Y
     
 y2   Y   y2 − Y 
=  − = .

 ... 
  ...   ... 
   
yn Y yn − Y
31
5.2 More on partitioned regressions (the scalar case with an

intercept)
Consider the case
Y = Xβ + ε,
where
 
1 x11
 
 1 x12 
X= 

 ... ... 

1 x1n
and
" #
β1
β= .
β2
This is the standard scalar case. What is βb2 ? It is
−1
βb2 = (X20 M1 X2 ) X20 M1 Y
but, in this case, M1 = L. Hence,
−1
βb2 = (X20 LX2 ) (X20 LY )
−1
= (X20 LLX2 ) (X20 LLY )
−1
= (X20 L0 LX2 ) (X20 L0 LY )
= ((LX2 )0 (LX2 ))−1 (LX2 )0 (LY )
Pn
(yi − Y )(xi − X)
i=1
= Pn ,
(xi − X) 2
i=1
32
which has a very familiar form from univariate regression analysis, right? Also, this is
how you compute the beta of an asset ... recall?
6 Appendix II: Miscellanea

6.1 The R2
Write n
X
(yi − Y )2 = (LY )0 (LY ),
i=1
but
Y = Yb + εb.
Hence,
(LY )0 (LY ) = (L(Yb + εb))0 (L(Yb + εb))

= (LYb )0 (LYb ) + (Lb
ε)0 (Lb
ε) + 2(LYb )0 (Lb
ε)
= (LYb )0 (LYb ) + (b
ε0 εb) + 2(LYb )0 εb,
since, with an intercept, the mean of the residuals is zero. In addition
(LYb )0 εb = Yb 0 L0 εb = Yb 0 Lb
ε = Yb 0 εb = 0.
Thus,
n
X
(yi − Y )2
i=1
= (LY )0 (LY )
= (LYb )0 (LYb ) + (b
ε0 εb)
33
n
X n
X
= yi − Yb )2 +
(b εb2i .
i=1 i=1
We expressed the variance of the Y observations (the total sum of squares or SST) as the
sum of the variance of the fitted values (the regression sum of squares or SSR) and the
variance of the residuals (the residuals sum of squares or SSE). Now write
n n
yi − Yb )2 εb2i
P P
(b
1 = i=1
Pn +P
n
i=1
(yi − Y )2 (yi − Y )2
i=1 i=1
SSR SSE
= + .
SST SST
Define
SSR
R2 = .
SST
Naturally,
0 ≤ R2 ≤ 1.
The closer the R2 is to 1 the better the fit (the larger the variance of Y that is explained
by the regression or, equivalently, the smaller SSE, the better the fit). The closer the
R2 is to 0, the worse the fit (the larger is SSE).
(1) First alternative (and equivalent) way to write the R2 :
SSE
R2 = 1 − .
SST
n n 2
2 0
P P
This expression is useful. Note: SSE = εbi = yi − β xi . Recall, OLS finds β
b
i=1 i=1
which minimizes SSE. Hence, SSE from a regression with k regressors can never be
smaller than SSE from a regression with k + 1 regressors. Why? Well, because I could
always set the extra parameter equal to zero and, at the very least, achieve the value
34
that I would obtain with only k regressors. Hence, the R2 from a regression with k + 1
regressors will always be larger than the R2 from a regression with k regressors. This
is mechanical. It is also problematic since the increase might not have anything to do
with the actual explanatory power of the extra regressor. In sum: just adding regressors
(irrespective of their explanatory power ) increases the R2 . Thus, the R2 cannot be a
perfect measure of goodness of fit.
(2) Second alternative (and equivalent) way to write the R2 :

n
1
yi − Yb )2
P
(b
SSR n−1
i=1 s2yb s2ybsyby
R2 = = n = 2 = 2 ,
SST 1
P sy sy syby
n−1
(yi − Y )2
i=1
where s2yb is the sample variance of the fitted values, s2y is the sample variance of the y
observations, and syby is the sample covariance between fitted values and true values. But,
syby = cov(Y, Yb )
= cov(Yb + εb, Yb )
= var(Yb ) = s2yb.
Hence,
s2ybsyby s2ybsyby s2yby

R2 = = = 2
= ryb
y.
s2y syby s2y s2yb s2y s2yb
The R2 is just the empirical correlation between the fitted values and the Y values. In
the scalar case, R2 = ryx 2
, the empirical correlation between the regressor x and the
regressand y. This observation is also useful: the R2 is just a descriptive statistic. In
the scalar case, it does not contain more information than the correlation between Y and
X. In fact, it is the correlation between Y and X raised to the second power and could,
therefore, be computed before inference begins.
35
6.2 The adjusted R2

As we discussed, the value of the R2 increases mechanically with the number of regressors.
Idea: penalize the goodness-of-fit measure for using a large number of k which might
spuriously increase the R2 . The adjusted-R2 contains a penalty. It is defined as
2 SSE/n − k
R = 1−
SST /n − 1
b2
σ
= 1− 2.
sy
A larger k decreases SSE. We discussed this effect earlier. However, if k ↑, n − k ↓,

2
SSE/n − k ↑, and R ↓, keeping SSE fixed. The two effects might, at least partially,
compensate each other. Thus, dividing by n − k introduces a penalty.
6.3 Re-writing hypothesis testing in a friendlier (for empirical

work) format
Assume one wants to test the general restriction H0 : Rβ = r.
Theorem.
−1
b−2 (Rβb − r)0 R(X 0 X)−1 R0
σ (Rβb − r)/q
ε∗0 εb∗ − εb0 εb) /q
(b
=
εb0 εb/(n − k)
d
→ Fq,n−k ,
where the εb∗ s are the estimated residuals from a regression which imposes the restriction
Rβ = r.
Proof.
36
−1 ε∗0 εb∗ −b
(b ε0 εb)/q
b−2 (Rβb − r)0 (R(X 0 X)−1 R0 ) (Rβb − r)/q and
We want to show that σ εb0 εb/n−k
are
identical. Note that
−1
b−2 (Rβb − r)0 R(X 0 X)−1 R0
σ (Rβb − r)/q
−1
(Rβb − r)0 (R(X 0 X)−1 R0 ) (Rβb − r)/q
= .
ε0 εb)/n − k
(b
Hence, we only need to show that
−1
(Rβb − r)0 R(X 0 X)−1 R0 (Rβb − r) = εb∗0 εb∗ − εb0 εb.
First, compute the constrained OLS estimator: βb∗ .
β ∗ = arg minC(β, λ) = (Y − Xβ)0 (Y − Xβ) + λ0 (Rβ − r)

β
The first-order conditions are:
∂C(β, λ)
= 0 ⇒ −X 0 (Y − Xβ ∗ ) + R0 λ = 0,
∂β
∂C(β, λ)
= 0 ⇒ Rβ ∗ = r.
∂λ
Thus,
R0 λ = X 0 Y + (X 0 X)β ∗
and
(X 0 X)−1 R0 λ = (X 0 X)−1 X 0 Y + (X 0 X)−1 (X 0 X)β ∗
or
(X 0 X)−1 R0 λ = βb − β ∗ . (1)
37
In addition,
R(X 0 X)−1 R0 λ = R(βb − β ∗ )
and
λ = (R(X 0 X)−1 R0 )−1 (Rβb − r). (2)
Now, consider
εb∗0 εb∗ = (Y − Xβ ∗ )0 (Y − Xβ ∗ )
= (Y − X βb + X βb − Xβ ∗ )0 (Y − X βb + X βb − Xβ ∗ )
= (Y − X β) b 0 (Y − X β) b + (βb − β ∗ )0 X 0 X(βb − β ∗ ) + 2(βb − β ∗ )0 X 0 (Y − X β)
b
= εb0 εb + (βb − β ∗ )0 X 0 X(βb − β ∗ ) + 2(βb − β ∗ )0 X 0 εb

= εb0 εb + (βb − β ∗ )0 X 0 X(βb − β ∗ ).
Thus,
εb∗0 εb∗ − εb0 εb = (βb − β ∗ )0 X 0 X(βb − β ∗ )

= λ0 R(X 0 X)−1 X 0 X(X 0 X)−1 R0 λ
= λ0 R(X 0 X)−1 R0 λ
= (Rβb − r)0 (R(X 0 X)−1 R0 )−1 R(X 0 X)−1 R0 (R(X 0 X)−1 R0 )−1 (Rβb − r)

= (Rβb − r)0 (R(X 0 X)−1 R0 )−1 (Rβb − r).
Done.
Theorem.
We can write
38
−1
b−2 (Rβb − r)0 R(X 0 X)−1 R0
σ (Rβb − r)/q
(b
=
εb0 εb/n − k
(R2 − R∗2 ) /q
= (1−R2 )
n−k
d
→ Fq,n−k ,
where the εb∗ s are the estimated residuals from a regression which imposes the restriction
Rβ = r and R∗2 is the R2 from the same regression.
Proof. Recall,
εb0 εb
R2 = 1 − ,
SST
εb∗0 εb∗
R∗2 = 1− ,
SST
and
εb0 εb = SST − (SST ) R2 ,

εb∗0 εb∗ = SST − (SST ) R∗2 .
Then

(b (SST − (SST ) R∗2 − SST + (SST ) R2 ) /q
= SST −(SST )R2
εb0 εb/n − k
n−k
∗2 2
(1 − R − 1 + R ) /q
= 1−R2
n−k
39
(R2 − R∗2 ) /q
= (1−R2 )
.
n−k
Done.
Note: it is trivial to compute R2 and R∗2 from virtually all unrestricted and restricted
regressions (provided the restrictions are linear). Hence, this is a very useful way to re-
express our general tests. The classical F test for H0 : β2 = β3 = ... = βk , for example,
is
R2 /(k − 1) d
(1−R2 )
→ Fk−1,n−k .
n−k
Why? R∗2 from the restricted regression is zero (the restricted regression only contains
an intercept). The number of restrictions q is, of course, equal to k − 1.
References
[1] Bandi, F.M. and J.R. Russell (2007). Full-information transaction costs. Working
paper.
[2] Ho, T. and H.R. Stoll (1981). Optimal dealer pricing under transactions and return
uncertainty. Journal of Financial Economics 9, 47-73.
[3] Stoll, H.R. (1989). Inferring the components of the bid-ask spread: theory and em-
pirical tests. Journal of Finance 44, 115-134.
[4] Stoll, H.R. (2000). Frictions. Journal of Finance 55, 1479-1514.
40

Chapter 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1

Uploaded by

Copyright:

Available Formats

Chapter 1 Empirical finance

Figure 1: Predicting returns over alternative horizons.

Multiple linear regression with pre-determined

We consider the model

where X is a fixed (non stochastic) n × k matrix of regressors (n observations and k

1. E(ε) = 0. On average, the errors are zero.

2. V ar(ε) = σ 2 In , where In is the identity matrix of dimension n. (a) The errors

More explicitly, write

E(Y ) = E(Xβ + ε) = E(Xβ) + E(ε) = Xβ,

1 The ordinary least squares (OLS) method

C(β) = (Y − Xβ)0 (Y − Xβ)

1.1 Definitions: fitted values and estimated residuals

or, equivalently (less compactly, but more intelligibly),

1.3 Properties: continued

εb = Y − Yb = Y − X(X 0 X)−1 X 0 Y = (I − X(X 0 X)−1 X 0 )Y = M Y

1.4 Partitioned matrices

There are two cases.

" #−1 " #

by the symmetry and idempotency of M2 .

2 The statistical properties of OLS

(1) The expected value of β.

(2) The variance of β.

= E[(βb − β)(βb − β)0 ]

= E (X 0 X)−1 X 0 εε0 X(X 0 X)−1

= (X 0 X)−1 X 0 E(εε0 )X(X 0 X)−1

We can also write −1

2.1 The Gauss-Markov Theorem

Proof. Consider the generic estimator βe = AY , where A is a k × n matrix. Note that βe

Thus, AX = Ik for βe to be unbiased. So, βe = AY (with the restriction AX = Ik ) is a

= E((βe − β)(βe − β)0 )

Now, we need to show that

by the fact that AX = Ik and M is symmetric and idempotent. Is AM M 0 A0 positive

for any real vector z ∈ Rk . Because z 0 AM M 0 A0 z ≥ 0 for any conformable real z, AM M 0 A0

3 Exact inferential theory: testing

3.1 Classical testing problems

Example: Standard t-test on the j th parameter.

with q ≤ k. Here q is the number of restrictions, k is always the number of parameters.

3.2.1 Single linear restriction

Construction of the test:

and, under the null hypothesis H0 : c0 β = γ,

idempotency, the eigenvalues of M are either 1 or zero. Since the trace of M is n − k, it

since the sum of n − k independent normal random variables is a chi-squared random

ISecond aside: Consider a standard normal random variable. Consider a chi-square

a t distribution with n − k degrees of freedoms.J

Let us now go back to

Example: Classical t-test (H0 : βj = 0).

3.2.2 Multiple linear restrictions

Construction of the test:

This implies that

and, under the null hypothesis,

4 Regression analysis, liquidity and asymmetric in-

1. log volatility - The log of the daily return standard deviation

5. NumberAnalysts - This is the number of analysts following the stock

Asymmetric information is a term used to describe a situation where some market

1. Generate a histogram of the bid-ask spreads. What do you notice?

Figure 2: Histogram of the spreads

There is apparent skewness.

2. Take a logarithmic transformation of the bid-ask spreads. Plot a his-

Figure 3: Histogram of the log(spreads)

A bit more “normal” or “Gaussian”.

3. Run a least-squares regression of the log bid-ask spread on the 5 ex-

Dependent Variable: LOG TRADECOST

Coefficient Std. Error t-Statistic Prob.

We can also write −1