Professional Documents
Culture Documents
Statistics
Antonis Demos
(Athens University of Economics and Business)
October 2002
2
Part I
Probability Theory
3
Chapter 1
INTRODUCTION
A set is defined as any collection of objects, which are called points or elements.
The biggest possible collection of points under consideration is called the space,
universe, or universal set. For Probability Theory the space is called the sample
space.
A set A is called a subset of B (we write A ⊆ B or B ⊇ A) if every element
of A is also an element of B. A is called a proper subset of B (we write A ⊂ B or
B ⊃ A) if every element of A is also an element of B and there is at least one element
of B which does not belong to A.
Two sets A and B are called equivalent sets or equal sets (we write A = B)
if A ⊆ B and B ⊆ A.
If a set has no points, it will be called the empty or null set and denoted by
φ.
The complement of a set A with respect to the space Ω, denoted by Ā, Ac ,
or Ω − A, is the set of all points that are in but not in A.
The intersection of two sets A and B is a set that consists of the common
elements of the two sets and it is denoted by A ∩ B or AB.
The union of two sets A and B is a set that consists of all points that are in
A or B or both (but only once) and it is denoted by A ∪ B.
The set difference of two sets A and B is a set that consists of all points in
6 Introduction
Some more sophisticated discrete problems require counting techniques. For example:
The sample space in both cases, although discrete, can be quite large and it
not feasible to write out all possible outcomes.
1. Duplication is permissible and Order is important (Multiple
Choice Arrangement), i.e. the element AA is permitted and AB is a different
element from BA. In this case where we want to arrange n objects in x places the
possible outcomes is given from: Mxn = nx .
Example: Find all possible combinations of the letters A, B, C, and D when
duplication is allowed and order is important.
The result according to the formula is: n = 4, and x = 2, consequently the
8 Introduction
possible number of combinations is M24 = 42 = 16. To find the result we can also use
a tree diagram.
2. Duplication is not permissible and Order is important (Per-
mutation Arrangement), i.e. the element AA is not permitted and AB is a
different element from BA. In this case where we want to permute n objects in x
places the possible outcomes is given from:
n!
Pxn or P (n, x) = n × (n − 1) × ..(n − x + 1) = .
(n − x)!
Example: Find all possible permutations of the letters A, B, C, and D when
duplication is not allowed and order is important.
The result according to the formula is: n = 4, and x = 2, consequently the
4! 2∗3∗4
possible number of combinations is P24 = (4−2)!
= 2
= 12.
3. Duplication is not permissible and Order is not important
(Combination Arrangement), i.e. the element AA is not permitted and AB is
not a different element from BA. In this case where we want the combinations of n
objects in x places the possible outcomes is given from:
µ ¶
P (n, x) n! n
Cxn or C(n, x) = = =
x! (n − x)!x! x
Example: Find all possible combinations of the letters A, B, C, and D when
duplication is not allowed and order is not important.
The result according to the formula is: n = 4, and x = 2, consequently the
4! 2∗3∗4
possible number of combinations is C24 = 2!∗(4−2)!
= 2∗2
= 6.
Let us now define probability rigorously.
∩ Aα = {x ∈ S : x ∈ Aα for all α ∈ Γ}
α∈Γ
Note that since A is non-empty, (1) and (2) ⇒ φ ∈ A and S ∈ A. Note also
that ∩∞
n=1 An ∈ A. The largest σ-algebra is the set of all subsets of S, denoted by
P (S), and the smallest is {φ, S}. We can generate a σ-algebra from any collection
of subsets by adding to the set the complements and the unions of its elements. For
example let S = R, and
and let A = σ (B) consists of all intervals and countable unions of intervals and
complements thereof. This is called the Borel σ-algebra and is the usual σ-algebra
we work when S = R. The σ-algebra A ⊂ P (R), i.e., there are sets in P (R) not
in A. These are some pretty nasty ones like the Cantor set. We can alternatively
construct the Borel σ-algebra by considering J the set of all intervals of the form
(−∞, x], x ∈ R. We can prove that σ (J ) = σ (B). We can now give the definition
of probability measure which is due to Kolmogorov.
10 Introduction
Definition 2 Given a sample space S and a σ-algebra (S, A), a probability measure
is a mapping from A −→R such that
1. P (A) ≥ 0 for all A ∈ A
2. P (S) = 1
3. if A1 , A2 , ... are pairwise disjoint, i.e., Ai ∩ Aj = φ for all i 6= j, then
̰ !
[ X
∞
P Ai = P (Ai )
i=1 i=1
P ((a, b)) ∝ b − a.
Properties of P
1. P (φ) = 0
2. P (A) ≤ 1
3. P (Ac ) = 1 − P (A)
4. P (B ∩ Ac ) = P (B) − P (B ∩ A)
5. If A ⊂ B ⇒ P (A) ≤ P (B)
6. P (B ∪ A) = +P (A) + P (B) − P (A ∩ B) More generally, for events
A1 , A2 , ...An ∈ A we have:
"n #
[ X n XX XXX
P Ai = P [Ai ]− P [Ai Aj ]+ P [Ai Aj Ak ]−..+(−1)n+1 P [A1 ..An ].
i=1 i=1 i<j i<j<k
P∞
7. P (∪∞
i=1 Ai ) ≤ i=1 P (Ai )
Proofs involve manipulating sets to obtain disjoint sets and then apply the
axioms.
From the above formula is evident P [AB] = P [A|B]P [B] = P [B|A]P [A] if
both P [A] and P [B] are nonzero. Notice that when speaking of conditional proba-
bilities we are conditioning on some given event B; that is, we are assuming that the
experiment has resulted in some outcome in B. B, in effect then becomes our ”new”
sample space. All probability properties of the previous section apply to conditional
probabilities as well, i.e. P (·|B) is a probability measure. In particular:
1. P (A|B) ≥ 0
2. P (S|B) = 1
P∞
3. P (∪∞
i=1 Ai |B) = i=1 P (Ai |B) for any pairwise disjoint events{Ai }∞
i=1 .
P (A) = P (A ∩ B) + P (A ∩ B c )
X
n
P [A] = P [A|Bi ]P [Bi ]
i=1
P [A|Bj ]P [Bj ]
P [Bj |A] = P
n
P [A|Bi ]P [Bi ]
i=1
Notice that for events A and B ∈ A which satisfy P [A] > 0 and P [B] > 0 we have:
P (A|B)P (B)
P (B|A) = .
P (A|B)P (B) + P (A|B c )P (B c )
This follows from the definition of conditional independence and the law of total
probability. The probability P (B) is a prior probability and P (A|B) frequently is a
likelihood, while P (B|A) is the posterior.
Finally the Multiplication Rule states:
Given a probability space (Ω, A, P [.]), if A1 , A2 , ..., An are events in A for
which P [A1 A2 ......An−1 ] > 0 then:
Conditional Probability and Independence 13
P [A1 A2 ......An ] = P [A1 ]P [A2 |A1 ]P [A3 |A1 A2 ].....P [An |A1 A2 ....An−1 ]
Example: A plant has two machines. Machine A produces 60% of the total
output with a fraction defective of 0.02. Machine B the rest output with a fraction
defective of 0.04. If a single unit of output is observed to be defective, what is the
probability that this unit was produced by machine A?
If A is the event that the unit was produced by machine A, B the event that
the unit was produced by machine B and D the event that the unit is defective. Then
P [AD]
we ask what is P [A|D]. But P [A|D] = P [D]
. Now P [AD] = P [D|A]P [A] = 0.02 ∗
0.6 = 0.012. Also P [D] = P [D|A]P [A] + P [D|B]P [B] = 0.012 + 0.04 ∗ 0.4 = 0.028.
Consequently, P [A|D] = 0.571. Notice that P [B|D] = 1 − P [A|D] = 0.429. We can
also use a tree diagram to evaluate P [AD] and P [BD].
Example: A marketing manager believes the market demand potential of a
new product to be high with a probability of 0.30, or average with probability of 0.50,
or to be low with a probability of 0.20. From a sample of 20 employees, 14 indicated
a very favorable reception to the new product. In the past such an employee response
(14 out of 20 favorable) has occurred with the following probabilities: if the actual
demand is high, the probability of favorable reception is 0.80; if the actual demand is
average, the probability of favorable reception is 0.55; and if the actual demand is low,
the probability of the favorable reception is 0.30. Thus given a favorable reception,
what is the probability of actual high demand?
P [HF ]
Again what we ask is P [H|F ] = P [F ]
. Now P [F ] = P [H]P [F |H]+P [A]P [F |A]+
P [L]P [F |L] = 0.24+0.275+0.06 = 0.575. Also P [HF ] = P [F |H]P [H] = 0.24. Hence
0.24
P [H|F ] = 0.575
= 0.4174
Example: There are five boxes and they are numbered 1 to 5. Each box
contains 10 balls. Box i has i defective balls and 10−i non-defective balls, i = 1, 2, .., 5.
Consider the following random experiment: First a box is selected at random, and
then a ball is selected at random from the selected box. 1) What is the probability
14 Introduction
that a defective ball will be selected? 2) If we have already selected the ball and
noted that is defective, what is the probability that it came from the box 5?
Let A denote the event that a defective ball is selected and Bi the event
that box i is selected, i = 1, 2, .., 5. Note that P [Bi ] = 1/5, for i = 1, 2, .., 5, and
P [A|Bi ] = i/10. Question 1) is what is P [A]? Using the theorem of total probabilities
we have:
P
5 P
5
i1
P [A] = P [A|Bi ]P [Bi ] = 55
= 3/10. Notice that the total number of
i=1 i=1
15
defective balls is 15 out of 50. Hence in this case we can say that P [A] = 50
= 3/10.
This is true as the probabilities of choosing each of the 5 boxes is the same. Question
2) asks what is P [B5 |A]. Since box 5 contains more defective balls than box 4, which
contains more defective balls than box 3 and so on, we expect to find that P [B5 |A] >
P [B4 |A] > P [B3 |A] > P [B2 |A] > P [B1 |A]. We apply Bayes’ theorem:
11
P [A|B5 ]P [B5 ] 25 1
P [B5 |A] = 5 = =
P 3
10
3
P [A|Bi ]P [Bi ]
i=1
j 1
P [A|Bj ]P [Bj ] j
Similarly P [Bj |A] = P
5 = 10 5
3 = 15
for j = 1, 2, ..., 5. Notice that uncon-
P [A|Bi ]P [Bi ] 10
i=1
ditionally all Bi0 s were equally likely.
also independent then P [AB] = P [A]P [B], and consequently P [A]P [B] = 0, which
means that either P [A] = 0 or P [B] = 0. Hence two mutually exclusive events are
independent if P [A] = 0 or P [B] = 0. On the other hand if P [A] 6= 0 and P [B] 6= 0,
then if A and B are independent can not be mutually exclusive and oppositely if they
are mutually exclusive can not be independent. Also notice that independence is not
transitive, i.e., A independent of B and B independent of C does not imply that A
is independent of C.
Example: Consider tossing two dice. Let A denote the event of an odd total,
B the event of an ace on the first die, and C the event of a total of seven. We ask
the following:
(i) Are A and B independent?
(ii) Are A and C independent?
(iii) Are B and C independent?
(i) P [A|B] = 1/2, P [A] = 1/2 hence P [A|B] = P [A] and consequently A and
B are independent.
(ii) P [A|C] = 1 6= P [A] = 1/2 hence A and C are not independent.
(iii) P [C|B] = 1/6 = P [C] = 1/6 hence B and C are independent.
Notice that although A and B are independent and C and B are independent
A and C are not independent.
Let us extend the independence of two events to several ones:
For a given probability space (Ω, A, P [.]), let A1 , A2 , ..., An be n events in A.
Events A1 , A2 , ..., An are defined to be independent if and only if:
P [Ai Aj ] = P [Ai ]P [Aj ] for i 6= j
P [Ai Aj Ak ] = P [Ai ]P [Aj ]P [Ak ] for i 6= j, i 6= k, k 6= j
and so on
T
n Q
n
P [ Ai ] = P [Ai ]
i=1 i=1
Notice that pairwise independence does not imply independence, as the fol-
lowing example shows.
16 Introduction
Example: Consider tossing two dice. Let A1 denote the event of an odd face
in the first die, A2 the event of an odd face in the second die, and A3 the event of
11 11
an odd total. Then we have: P [A1 ]P [A2 ] = 22
= P [A1 A2 ], P [A1 ]P [A3 ] = 22
=
1
P [A3 |A1 ]P [A1 ] = P [A1 A3 ], and P [A2 A3 ] = 4
= P [A2 ]P [A3 ] hence A1 , A2 , A3 are
1
pairwise independent. However notice that P [A1 A2 A3 ] = 0 6= 8
= P [A1 ]P [A2 ]P [A3 ].
Hence A1 , A2 , A3 are not independent.
Chapter 2
The probability space (S, A, P ) is not particularly easy to work with. In practice, we
often need to work with spaces with some structure (metric spaces). It is convenient
therefore to work with a cardinalization of S by using the notion of random variable.
Formally, a random variable X is just a mapping from the sample space to
the real line, i.e.,
X : S −→ R,
© ª
AX = {A ⊂ S : X(A) ∈ B} = X −1 (B) : B ∈ B ⊆ A,
¡ ¢
PX (X ∈ B) = P X −1 (B) .
{φ, S, A, Ac }
FX (x) = PX (X ≤ x)
defined for all x ∈ R. This function effectively replaces PX . Note that we can
reconstruct PX from FX .
EXAMPLE. S = {H, T } , X (H) = 1, X (T ) = 0, (p = 1/2).
If x < 0, FX (x) = 0
If 0 ≤ x < 1, FX (x) = 1/2
If x ≥ 1, FX (x) = 1.
EXAMPLE. The logit c.d.f. is
1
FX (x) =
1 + e−x
Any function f (.) with domain the real line and counterdomain [0, 1] is defined
to be a discrete density function if for some countable set x1 , x2 , ...xn , .... has the
following properties:
i) f (xj ) > 0 for j = 1, 2, ...
ii) f (x) = 0 for x 6= xj ; j = 1, 2, ...
P
iii) f (xj ) = 1, where the summation is over the points x1 , x2 , ...xn , ....
2.0.3 Continuous Random Variables
A random variable X is called continuous if there exist a function fX (.) such that
Rx
FX (x) = fX (u)du for every real number x. In such a case FX (x) is the cumulative
−∞
distribution and the function fX (.) is the density function.
Notice that according to the above definition the density function is not
uniquely determined. The idea is that if the a function change value if a few points
its integral is unchanged. Furthermore, notice that fX (x) = dFX (x)/dx.
The notations for discrete and continuous density functions are the same,
yet they have different interpretations. We know that for discrete random variables
fX (x) = P [X = x], which is not true for continuous random variables. Further-
more, for discrete random variables fX (.) is a function with domain the real line and
counterdomain the interval [0, 1], whereas, for continuous random variables fX (.) is a
function with domain the real line and counterdomain the interval [0, ∞). Note that
for a continuous r.v.
P (X = x) ≤ P (x − ε ≤ X ≤ x) = FX (x) − FX (x − ε) → 0
for any a, b.
Random Variables, Distribution Functions, and Densities 23
random variable takes on, where each value is weighted by the probability that the
random variable takes this value. Values that are more probable receive more weight.
The same is true in the integral form in (ii). There the value x is multiplied by the
approximate probability that X equals the value x, i.e. fX (x)dx, and then integrated
over all values.
Notice that in the definition of a mean of a random variable, only density
functions or cumulative distributions were used. Hence we have really defined the
mean for these functions without reference to random variables. We then call the
defined mean the mean of the cumulative distribution or the appropriate density
function. Hence, we can speak of the mean of a distribution or density function as
well as the mean of a random variable.
Notice that E[X] is the center of gravity (or centroid) of the unit mass that
is determined by the density function of X. So the mean of X is a measure of where
the values of the random variable are centered or located i.e. is a measure of central
location.
Example: Consider the experiment of tossing two dice. Let X denote the
total of the upturned faces. Then for this case we have:
P
12
E[X] = ifX (i) = 7
i=2
Example: Consider a X that can take only to possible values, 1 and -1, each
with probability 0.5. Then the mean of X is:
E[X] = 1 ∗ 0.5 + (−1) ∗ 0.5 = 0
Notice that the mean in this case is not one of the possible values of X.
Example: Consider a continuous random variable X with density function
fX (x) = λe−λx for x ∈ [0, ∞). Then
R∞ R∞
E[X] = xfX (x)dx = xλe−λx dx = 1/λ
−∞ 0
Example: Consider a continuous random variable X with density function
fX (x) = x−2 for x ∈ [1, ∞). Then
R∞ R∞
E[X] = xfX (x)dx = xx−2 dx = lim log b = ∞
−∞ 1 b→∞
Expectations and Moments of Random Variables 27
1
FX (m) = .
2
Since in this case, FX−1 (·) exists, we can alternatively write m = FX−1 ( 12 ). For discrete
r.v., there may be many m that satisfy this or may none. Suppose
⎧
⎪
⎪ 0 1/3
⎪
⎨
X= 1 1/3 ,
⎪
⎪
⎪
⎩ 2 1/3
E (g(X)) ≥ g (E (X)) .
An Interpretation of Expectation
We claim that E (X) is the unique minimizer of E (X − θ)2 with respect to θ, assum-
ing that the second moment of X is finite.
Theorem 7 Suppose that E (X 2 ) exists and is finite, then E (X) is the unique min-
imizer of E (X − θ)2 with respect to θ.
This Theorem says that the Expectation is the closest quantity to θ, in mean
square error.
3.0.5 Variance
the density function or cumulative distribution function of the random variable and
consequently, variance can be defined in terms of these functions without reference
to a random variable.
Notice that variance is a measure of spread since if the values of the random
variable X tend to be far from their mean, the variance of X will be larger than the
variance of a comparable random variable whose values tend to be near their mean.
It is clear from (i), (ii) and (iii) that the variance is a nonnegative number.
If X is a random variable with variance σ 2X , then the standard deviation of
p
X, denoted by σX , is defined as var(X)
The standard deviation of a random variable, like the variance, is a measure
of spread or dispersion of the values of a random variable. In many applications it
is preferable to the variance since it will have the same measurement units as the
random variable itself.
Example: Consider the experiment of tossing two dice. Let X denote the
total of the upturned faces. Then for this case we have (μX = 7):
P12
var[X] = (i − μX )2 fX (i) = 210/36
i=2
Example: Consider a X that can take only to possible values, 1 and -1, each
with probability 0.5. Then the variance of X is (μX = 0):
var[X] = 0.5 ∗ 12 + 0.5 ∗ (−1)2 = 1
Example: Consider a X that can take only to possible values, 10 and -10,
each with probability 0.5. Then we have:
μX = E[X] = 10 ∗ 0.5 + (−10) ∗ 0.5 = 0
var[X] = 0.5 ∗ 102 + 0.5 ∗ (−10)2 = 100
Notice that in examples 2 and 3 the two random variables have the same mean
but different variance, larger being the variance of the random variable with values
further away from the mean.
Example: Consider a continuous random variable X with density function
fX (x) = λe−λx for x ∈ [0, ∞). Then (μX = 1/λ):
30 Expectations and Moments of Random Variables
R∞ R∞ 1
var[X] = (x − μX )2 fX (x)dx = (x − 1/λ)2 λe−λx dx = λ2
−∞ 0
Example: Consider a continuous random variable X with density function
fX (x) = x−2 for x ∈ [1, ∞). Then we know that the mean of X does not exist.
Consequently, we can not define the variance.
Notice that
£ ¤ ¡ ¢
V ar (X) = E (X − E(X))2 = E X 2 − E 2 (X)
and that
√
V ar (aX + b) = a2 V ar (X) , SD = V ar, SD (aX + b) = |a|SD(X),
μ/r = E[X r ]
/ /
if this expectation exists. Notice that μr = E[X] = μ1 = μX , the mean of X.
If X is a random variable, the rth central moment of X about α is defined
as E[(X −α)r ]. If α = μX , we have the rth central moment of X about μX , denoted
by μr , which is:
μr = E[(X − μX )r ]
Z med(X) Z ∞
1
fX (x)dx = = fX (x)dx
−∞ 2 med(X)
so the median of X is any number that has half the mass of X to its right and the
other half to its left. The median and the mean are measures of central location.
While a particular moment or a few of the moments may give little information
about a distribution the entire set of moments will determine the distribution exactly.
In applied statistics the first two moments are of great importance, but the third and
forth are also useful.
32 Expectations and Moments of Random Variables
Finally we turn to the moment generating function (mgf) and characteristic Function
(cf). The mgf is defined as
Z
¡ ¢ ∞
MX (t) = E etX = etx fX (x)dx
−∞
for any real t, provided this integral exists in some neighborhood of 0. It is the
Laplace transform of the function fX (·) with argument −t. We have the useful
inversion formula Z ∞
fX (x) = MX (t) e−tx dt
−∞
The mgf is of limited use, since it does not exist for many r.v. the cf is applicable
more generally, since it always exists:
Z ∞ Z Z
¡ itX ¢ itx
∞ ∞
ϕX (t) = E e = e fX (x)dx = cos (tx) fX (x)dx+i sin (tx) fX (x)dx
−∞ −∞ −∞
This essentially is the Fourier transform of the function fX (·) and there is a well
defined inversion formula
Z ∞
1
fX (x) = √ e−itx ϕX (t) dt
2π −∞
dr ¡ r r itX ¢
r
ϕX (0) = E iX e ↓t=0 = ir E (X r ) , r = 1, 2, 3, ..
dt
Thus the moments of X are related to the derivative of the cf at the origin.
If Z ∞
c (t) = exp (itx) dF (x)
−∞
notice that Z ∞
dr c (t)
= (ix)r exp (itx) dF (x)
dtr −∞
and ¯ Z ∞ ¯
dr c (t) ¯¯ r r / / r dr
c (t) ¯
¯
= (ix) dF (x) = (i) μ ⇒ μ = (−i)
dtr ¯t=0 −∞
r r
dtr ¯t=0
Expectations and Moments of Random Variables 33
a1 = k1
a2 = k2 + a1 k1
a3 = k3 + 2a1 k2 + a2 k1
a4 = k4 + 3a1 k3 + 3a2 k2 + a3 k1 .
These recursive formulas can be used to calculate the a0s efficiently from the
k0s, and vice versa. When X has mean 0, that is, when a1 = 0 = k1 , aj becomes
μj = E((X − E(X))j ),
μ2 = k2
μ3 = k3
μ4 = k4 + 3k22 .
34 Expectations and Moments of Random Variables
X X
Let f (X, Y ) = Y
, E (X) = μX and E (Y ) = μY . Then, expanding f (X, Y ) = Y
μX 1 μ μ 1
f (X, Y ) = + (X − μX )− X 2 (Y − μY )+ X 3 (Y − μY )2 − (X − μX ) (Y − μY )
μY μY (μY ) (μY ) (μY )2
∂f 1 ∂f ∂2f ∂2f ∂2f ∂2f
as ∂X
= Y
, ∂Y
= − YX2 , ∂X 2
= 0, ∂X∂Y
= ∂Y ∂X
= − Y12 , and ∂Y 2
= 2 YX3 . Taking
expectations we have
µ ¶
X μ μ 1
E = X + X 3 V ar (Y ) − Cov (X, Y ) .
Y μY (μY ) (μY )2
For the variance, take again the variance of the Taylor expansion and keeping only
terms up to order 2 we have:
µ ¶ ∙ ¸
X (μX )2 V ar (X) V ar (Y ) Cov (X, Y )
V ar = + −2 .
Y (μY )2 (μX )2 (μY )2 μX μY
Chapter 4
UNIFORM:
1
P (X = xj |X ) =
n
Bernoulli
A random variable whose outcome have been classified into two categories, called
“success” and “failure”, represented by the letters s and f, respectively, is called a
Bernoulli trial. If a random variable X is defined as 1 if a Bernoulli trial results in
36 Examples of Parametric Univariate Distributions
success and 0 if the same Bernoulli trial results in failure, then X has a Bernoulli
distribution with parameter p = P [success]. The definition of this distribution is:
A random variable X has a Bernoulli distribution if the discrete density of X
is given by:
⎧
⎨ px (1 − p)1−x f or x = 0, 1
fX (x) = fX (x; p) =
⎩ 0 otherwise
where p = P [X = 1]. For the above defined random variable X we have that:
BINOMIAL:
Mgf
£ ¤n
MX (t) = pet + (1 − p) .
Example: Consider a stock with value S = 50. Each period the stock moves
up or down, independently, in discrete steps of 5. The probability of going up is
Examples of Parametric Univariate Distributions 37
p = 0.7 and down 1 − p = 0.3. What is the expected value and the variance of the
value of the stock after 3 period?
Hypergeometric
Let X denote the number of defective balls in a sample of size n when sampling is done
without replacement from a box containing M balls out of which K are defective.
The X has a hypergeometric distribution. The definition of this distribution is:
K K M −KM −n
E[X] = n and var[X] = n
M M M M −1
Notice the difference of the binomial and the hypergeometric i.e. for the binomial
distribution we have Bernoulli trials i.e. independent trials with fixed probability
of success or failure, whereas in the hypergeometric in each trial the probability of
success or failure changes depending on the result.
Geometric
where p is the probability of success in each Bernoulli trial. For this distribution we
have that:
1−p 1−p
E[X] = and var[X] =
p p2
Examples of Parametric Univariate Distributions 39
POISSON:
e−λ λx
P (X = x|λ) = x = 0, 1, 2, 3, ...
x!
In calculations with the Poisson distribution we may use the fact that
X
∞ j
t
t
e = f or any t.
j=0
j!
The Poisson distribution provides a realistic model for many random phenomena.
Since the values of a Poisson random variable are nonnegative integers, any random
phenomenon for which a count of some sort is of interest is a candidate for modeling
in assuming a Poisson distribution. Such a count might be the number of fatal traffic
accidents per week in a given place, the number of telephone calls per hour, arriving
in a switchboard of a company, the number of pieces of information arriving per hour,
etc.
Example: It is known that the average number of daily changes in excess
of 1%, for a specific stock Index, occurring in each six-month period is 5. What is
the probability of having one such a change within the next 6 months? What is the
probability of at least 3 changes within the same period?
We model the number of in excess of 1% changes, X, within the next 6 months
e−λ λx e−5 5x
as a Poisson process. We know that E[X] = λ = 5. Hence fX (x) = x!
= x!
,
40 Examples of Parametric Univariate Distributions
e−5 51
for x = 0, 1, 2, , ... Then P [X = 1] = fX (1) = 1!
= 0.0337. Also P [X ≥ 3] =
1 − P [X < 3] =
= 1 − P [X = 0] − P [X = 1] − P [X = 2] =
e−5 50 e−5 51 e−5 52
=1− 0!
− 1!
− 2!
= 0.875.
We can approximate the Binomial with Poisson. The approximation is better
the smaller the p and the larger the n.
A very simple distribution for a continuous random variable is the uniform distribu-
tion. Its density function is:
⎧
⎨ 1
if x ∈ [a, b]
b−a
f (x|a, b) = ,
⎩ 0 otherwise
and
Z x
x−a
F (x|a, b) = f (z|a, b) dz = ,
a b−a
where −∞ < a < b < ∞. Then the random variable X is defined to be uniformly
distributed over the interval [a, b]. Now if X is uniformly distributed over [a, b] then
Exponential Distribution
The stable distributions are a natural generalization of the normal in that, as their
name suggests, they are stable under addition, i.e. a sum of stable random variables
is also a random variable of the same type. However, nonnormal stable distributions
have more probability mass in the tail areas than the normal. In fact, the nonnormal
stable distributions are so fat-tailed that their variance and all higher moments are
infinite.
Closed form expressions for the density functions of stable random variables
are available for only the cases of normal and Cauchy.
If a random variable X has a density function given by:
1 γ
fX (x) = fX (x; γ, δ) = for −∞<x<∞
π γ + (x − δ)2
2
where −∞ < δ < ∞ and 0 < γ < ∞, then X is defined to have a Cauchy distribu-
tion. Notice that for this random variable even the mean is infinite.
Normal or Gaussian:
Lognormal Distribution
Let X be a positive random variable, and let a new random variable Y be defined
as Y = log X. If Y has a normal distribution, then X is said to have a lognormal
distribution. The density function of a lognormal distribution is given by
1 (log x−μ)2
fX (x; μ, σ 2 ) = √ e− 2σ2 for 0 < x < ∞
x 2πσ 2
where μ and σ 2 are parameters such that −∞ < μ < ∞ and σ 2 > 0. We haven
1 2 2 2
E[X] = eμ+ 2 σ and var[X] = e2μ+2σ − e2μ+σ
Gamma-χ2
1 α−1 − βx
f (x|α, β) = αx e , 0 < x < ∞, α, β > 0
Γ (α) β
R∞
α is shape parameter, β is a scale parameter. Here Γ (α) = 0 tα−1 e−t dt is the
Gamma function, Γ (n) = n!. The χ2k is when α = k, and β = 1.
Notice that we can approximate the Poisson and Binomial functions by the
normal, in the sense that if a random variable X is distributed as Poisson with
X−λ
parameter λ, then √
λ
is distributed approximately as standard normal. On the
other hand if Y ∼ Binomial(n, p) then √Y −np ∼ N(0, 1).
np(1−p)
The standard normal is an important distribution for another reason, as well.
Assume that we have a sample of n independent random variables, x1 , x2 , ..., xn , which
are coming from the same distribution with mean m and variance s2 , then we have
the following:
1 X xi − m
n
√ ∼ N(0, 1)
n i=1 s
This is the well known Central Limit Theorem for independent observa-
tions.
Multivariate Random Variables 43
X = (X1 , X2 , .., Xk ) ∈ Rk
Zx1 Zxn
FX (x) = P (X1 ≤ x1 , ..., Xk ≤ xk ) = ... fX (z1 , z2 , ..., zk )dz1 ..dzk
−∞ −∞
The multivariate c.d.f. has similar coordinate-wise properties to a univariate c.d.f.
For continuously differentiable c.d.f.’s
∂ k FX (x)
fX (x) =
∂x1 ∂x2 ..∂xk
4.1.1 Conditional Distributions and Independence
We defined conditional probability P (A|B) = P (A∩B)/P (B) for events with P (B) 6=
0. We now want to define conditional distributions of Y |X. In the discrete case there
is no problem
f (y, x)
fY |X (y|x) = P (Y = y|X = x) =
fX (x)
44 Examples of Parametric Univariate Distributions
f (y, x)
fY |X (y|x) =
fX (x)
in terms of the joint and marginal densities. It turns out that fY |X (y|x) has the
properties of p.d.f.
1) fY |X (y|x) ≥ 0
R∞
R∞ −∞ f (y,x)dy fX (x)
2) −∞ fY |X (y|x)dy = fX (x)
= fX (x)
= 1.
We can define Expectations within the conditional distribution
Z ∞ R∞
yf (y, x)dy
E(Y |X = x) = yfY |X (y|x)dy = R−∞
∞
−∞ −∞
f (y, x)dy
4.1.2 Independence
P (Y ∈ A, X ∈ B) = P (Y ∈ A)P (X ∈ B)
for all events A, B, in the relevant sigma-algebras. This is equivalent to the cdf’s
version which is simpler to state and apply.
for all y, x, z.
Multivariate Normal
where
μX1 |X2 = μ1 + Σ12 Σ−1
22 (x2 − μ2 ) , ΣX1 |X2 = Σ11 − Σ12 Σ−1
22 Σ21 .
so that à ¡ ¢2 !
Y
k
1 1 xj − μj
fX (x|μ, Σ) = p exp −
j=1
2πσ jj 2 σ jj
We now consider the relationship between two, or more, r.v. when they are not inde-
pendent. In this case, conditional density fY |X and c.d.f. FY |X is in general varying
with the conditioning point x. Likewise for conditional mean E (Y |X), conditional
¡ ¢
median M (Y |X), conditional variance V (Y |X), conditional cf E eitY |X , and other
functionals, all of which characterize the relationship between Y and X. Note that
this is a directional concept, unlike covariance, and so for example E (Y |X) can be
very different from E (X|Y ).
Regression Models:
We start with random variable (Y, X). We can write for any such random variable
m(X) ε
z }| { z }| {
Y = E (Y |X) + Y − E (Y |X)
| {z } | {z }
systematic part rand om part
Multivariate Random Variables 47
E (Y |X) = α + βX
V ar (Y |X) = σ2
and in fact ε ⊥⊥ X.
We have the following result about conditional expectations
Covariance
Cov (X, Y )
ρXY =
σX σY
Note that
ρaX+b,cY +d = sign (a) × sign (c) × ρXY
The properties are very useful for evaluating the expected return and stan-
dard deviation of a portfolio. Assume ra and rb are the returns on assets A and B,
and their variances are σ2a and σ 2b , respectively. Assume that we form a portfolio of
the two assets with weights wa and wb , respectively. If the correlation of the returns
of these assets is ρ, find the expected return and standard deviation of the portfolio.
Inequalities 49
From the above example we can see that var[aX +bY ] = a2 var[X]+b2 var[Y ]+
2abcov[X, Y ] for random variables X and Y and constants a and b. In fact we can
generalize the formula above for several random variables X1 , X2 , ..., Xn and constants
P
n P
n
a1 , a2 , a3 , ..., an i.e. var[a1 X1 + a2 X2 + ...an Xn ] = a2i var[Xi ] + 2 ai aj cov[Xi , Xj ]
i=1 i<j
4.2 Inequalities
This section gives some inequalities that are useful in establishing a variety of prob-
abilistic results.
4.2.1 Markov
Let Y be a random variable and consider a function g (.) such that g (y) ≥ 0 for all
y ∈ R. Assume that E [g (Y )] exists. Then
Proof:
Assume that Y is continuous random variable (the discrete case follows anal-
50 Examples of Parametric Univariate Distributions
ogously) with p.d.f. f (.). Define A1 = {y|g (y) ≥ c} and A2 = {y|g (y) < c}. Then
Z Z
E [g (Y )] = g (y) f (y) dy + g (y) f (y) dy
A1 A2
Z Z
≥ g (y) f (y) dy ≥ cf (y) dy = cP [g (Y ) ≥ c] .
A1 A1
4.2.3 Minkowski
Let Y and Z be random variables such that [E (|Y |α )] < ∞ and [E (|Z|α )] < ∞ for
some 1 ≤ α < ∞. Then
4.2.4 Triangle
4.2.5 Cauchy-Schwarz
E 2 (XY ) ≤ E (X)2 E (Y )2
P ¡P 2 ¢ ¡P 2 ¢
( aj bj )2 ≤ aj bj
Inequalities 51
Proof:
£ ¤
Let 0 ≤ h (t) = E (tX − Y )2 = t2 E (X 2 ) + E (Y 2 ) − 2tE (XY ). Then the
function h (t) is a quadratic function in t which is increasing as t → ±∞. It has a
E(XY )
unique minimum at h/ (t) = 0 ⇒ 2tE (X 2 ) − 2E (XY ) = 0 ⇒ t = E(X 2 )
. Hence
³ ´
0 ≤ h E(XY )
E(X 2 )
⇒ E 2 (XY ) ≤ E (X)2 E (Y )2 .¥
1 1
E |XY | ≤ (E |X|p ) p (E |Y |q ) q
Let X be a random variable with mean E[X], and let g(.) be a convex function. Then
E[g(X)] ≥ g(E[X]).
Now a continuous function g(.) with domain and counterdomain the real line is called
convex if for any x0 on the real line, there exist a line which goes through the point
(x0 , g(x0 )) and lies on or under the graph of the function g(.). Also if g// (x0 ) ≥ 0
then g(.) is convex.
52 Examples of Parametric Univariate Distributions
Part II
Statistical Inference
53
Chapter 5
SAMPLING THEORY
1X r
n
Mr/ = X .
n i=1 i
Means and Variances 57
1 X¡
n √
2
¢2
s = Xi − X , s= s2
n i=1
1 X¡
n
¢2
s2∗ = Xi − X .
n − 1 i=1
1X
n
Fn (x) = 1 (Xi ≤ x)
n i=1
1 X itXi 1X 1X
n n n
ϕn (t) = e = sin (tXi ) + i cos (tXi )
n i=1 n i=1 n i=1
These are analogous of corresponding population characteristics and will be shown
to be similar to them when n is large. We calculate the properties of these variables:
(1) Exact properties; (2) Asymptotic.
Theorem 10 Let X1 , X2 , ..., Xk be a random sample from the density f (.). The
expected value of the rth sample moment is equal to the rth population moment, i.e.
the rth sample moment is an unbiased estimator of the rth population moment (Proof
omitted).
Theorem 11 Let X1 , X2 , ..., Xn be a random sample from a density f (.), and let
P
n
X n = n1 Xi be the sample mean. Then
i=1
1 2
E[X n ] = μ and var[X n ] = σ
n
where μ and σ 2 are the mean and variance of f (.), respectively. Notice that this is
true for any distribution f (.), provided that is not infinite.
Proof
P
n P
n P
n
E[X n ] = E[ n1 Xi ] = 1
n
E[Xi ] = 1
n
μ = n1 nμ = μ. Also
i=1 i=1 i=1
P
n P
n P
n
var[X n ] = var[ n1 Xi ] = 1
n2
var[Xi ] = 1
n2
σ2 = 1
n2
nσ 2 = n1 σ 2
i=1 i=1 i=1
Theorem 12 Let X1 , X2 , ..., Xn be a random sample from a density f (.), and let s2∗
defined as above. Then
µ ¶
1 n−3 4
E[s2∗ ] =σ 2
and var[s2∗ ] = μ4 − σ
n n−1
where σ 2 and μ4 are the variance and the 4th central moment of f (.), respectively.
Notice that this is true for any distribution f (.), provided that μ4 is not infinite.
Proof
We shall prove first the following identity, which will be used latter:
Pn Pn ¡ ¢2 ¡ ¢2
(Xi − μ)2 = Xi − X n + n X n − μ
i=1 i=1
P P ¡ ¢2 P £¡ ¢ ¡ ¢¤2
(Xi − μ)2 = Xi − X n + X n − μ = Xi − X n + X n − μ =
P h¡ ¢2 ¡ ¢ ¡ ¢ ¡ ¢2
i
= Xi − X n + 2 Xi − X n X n − μ + X n − μ =
P¡ ¢2 ¡ ¢P¡ ¢ ¡ ¢2
= Xi − X n + 2 X n − μ Xi − X n + n X n − μ =
Sampling from the Normal Distribution 59
P¡ ¢2 ¡ ¢2
= Xi − X n + n X n − μ
Using the above
∙ identity we obtain:
¸ ∙n ¸
2 1
Pn
2 1
P 2 ¡ ¢2
E[Sn ] = E n−1 (Xi − X n ) = n−1 E (Xi − μ) − n X n − μ =
∙n i=1 ¸ i=1 ∙ n ¸
P ¡ ¢2 P 2 ¡ ¢
1
= n−1 E (Xi − μ)2 − nE X n − μ 1
= n−1 σ − nvar X n =
1
£ i=12 ¤ i=1
= n−1 nσ − n n1 σ 2 = σ 2
The derivation of the variance of Sn2 is omitted.
Proof µ ¶
1
P
n
E (Fn (x)) = E n
1 (Xi ≤ x) = E (1 (Xi ≤ x)) = F (x) . Also V ar (Fn (x)) =
i=1
½ n ¾2
2 1
P
E [Fn (x) − F (x)] = E n 1 [Xi ≤ x] − F (x) =
( i=1 )
Pn P n
= E n12 {1 [Xi ≤ x] − F (x)}2 + n12 {1 [Xi ≤ x] − F (x)} {1 [Xj ≤ x] − F (x)}
i=1 i6=j
© ª
= n1 E {1 [Xi ≤ x] − F (x)}2 = n1 {E {1 [Xi ≤ x] − F 2 (x)}} = n1 F 2 (x) [1 − F (x)]
Theorem 14 Let denote X n the sample mean of a random sample of size n from a
normal distribution with mean μ and variance σ 2 . Then (1) X v N(μ, σ 2 /n).
(2) X and s2 are independent.
(n−1)s2∗
(3) σ2
v χ2n−1
(4) X−μ√
s∗ / n
v tn−1 .
Proof
60 Sampling Theory
and if t is an integer then Γ(t + 1) = t!. Also if t is again an integer then Γ(t + 12 ) =
1∗3∗5∗...(2t−1) √ √
2t
π. Finally Γ( 12 ) = π.
Recall that if X is a random variable with density
µ ¶k/2
1 1 k 1
fX (x) = x 2 −1 e− 2 x f or 0 < x < ∞
Γ(k/2) 2
where Γ(.) is the gamma function, then X is defined to have a chi-square distrib-
ution with k degrees of freedom.
Notice that X is distributed as above then:
Furthermore,
where Γ(.) is the gamma function, then X is defined to have a F distribution with
m and n degrees of freedom.
Notice that if X is distributed as above then:
n 2n2 (m + n − 2)
E[X] = and var[X] =
n−2 m(n − 2)2 (n − 4)
Γ[(k + 1)/2] 1 1
fX (x) = √ f or −∞<x<∞
Γ(k/2) kπ [1 + x /k](k+1)/2
2
where Γ(.) is the gamma function, then X is defined to have a t distribution with
k degrees of freedom.
Notice that if X is distributed as above then:
k
E[X] = 0 and var[X] =
k−2
The above Theorems are very useful especially to get the distribution of various
tests and construct confidence intervals.
Chapter 6
The point estimation admits two problems. The first is to devise some means of ob-
taining a statistic to use as an estimator. The second, to select criteria and techniques
64 Point and Interval Estimation
Any statistic (known function of observable random variables that is itself a random
variable) whose values are used to estimate τ (θ), where τ (.) is some function of the
parameter θ, is defined to be an estimator of τ (θ).
Notice that for specific values of the realized random sample the estimator
takes a specific value called estimate.
θ1 , b
Let the solution to these equations be b θ2 , ..., b
θk . We say that these k estimators
are the estimators of θ1 , θ2 , ..., θk obtained by the method of moments.
Example: Let X1 , X2 , ..., Xn be a random sample from a normal distribution
with mean μ and variance σ 2 . Let (θ1 , θ2 ) = (μ, σ 2 ). Estimate the parameters μ and
/ / /
σ by the method of moments.. Recall that σ 2 = μ2 − (μ1 )2 and μ = μ1 . The method
of moment equations become:
1
P
n
/ / /
n
Xi = X = M1 = μ1 = μ1 (μ, σ 2 ) = μ
i=1
1
Pn
/ / /
n
Xi2 = M2 = μ2 = μ2 (μ, σ 2 ) = σ 2 + μ2
i=1
Solving the two equations
r n for μ and σ we get:
P
b = X, and σ
μ b = n1 (Xi − X) which are the M-M estimators of μ and σ.
i=1
Parametric Point Estimation 65
Consider the following estimation problem. Suppose that a box contains a number of
black and a number of white balls, and suppose that it is known that the ratio of the
number is 3/1 but it is not known whether the black or the white are more numerous,
i.e. the number of drawing a black ball is either 1/4 or 3/4. If n balls are drawn with
replacement from the box, the distribution of X, the number of black balls, is given
by the binomial distribution
⎛ ⎞
n
f (x; p) = ⎝ ⎠ px (1 − p)n−x for x = 0, 1, 2, ..., n
x
where p is the probability of drawing a black ball. Here p = 1/4 or p = 3/4. We shall
draw a sample of three balls, i.e. n = 3, with replacement and attempt to estimate
the unknown parameter p of the distribution. the estimation is simple in this case as
we have to choose only between the two numbers 1/4 = 0.25 and 3/4 = 0.75. The
possible outcomes and their probabilities are given below:
outcome : x 0 1 2 3
f (x; 0.75) 1/64 9/64 27/64 27/64
f (x; 0.25) 27/64 27/64 9/64 1/64
The estimator thus selects fro every possible x the value of p, say pb, such that
and choose as our estimate that value of p which maximizes f (2; p). The position
of the maximum of the function above is found by setting equal to zero the first
d
derivative with respect to p, i.e. dp
f (2; p) = 6p − 9p2 = 3p(2 − 3p) = 0 ⇒ p = 0 or
d2 d2
p = 2/3. The second derivative is: dp2
f (2; p) = 6 − 18p. Hence, dp2
f (2; 0) = 6 and
d2
the value of p = 0 represents a minimum, whereas dp2
f (2; 23 ) = −6 and consequently
2 2
p= 3
represents the maximum. Hence pb = 3
is our estimate which has the property
Also L(θ) and logL(θ) have their maxima at the same value of θ, and it is sometimes
easier to find the maximum of the logarithm of the likelihood. Notice also that if
the likelihood function contains k parameters then we find the estimator from the
solution of the k first order conditions.
Example: Let a random sample of size n is drawn from the Bernoulli distri-
bution
f (x; p) = px (1 − p)1−x
where 0 ≤ p ≤ 1. The sample values x1 , x2 , ..., xn will be a sequence of 0s and 1s, and
the likelihood function is
Y
n
P P
L(p) = pxi (1 − p)1−xi = p xi
(1 − p)n− xi
i=1
P
Let y = xi we obtain that
and
d log L(p) y n−y
= −
dp p 1−p
Setting this expression equal to zero we get
y 1X
pb = = xi = x
n n
which is intuitively what the estimate for this parameter should be.
Example: Let a random sample of size n is drawn from the normal distrib-
ution with density
1 (x−μ)2
f (x; μ, σ 2 ) = √ e− 2σ2
2πσ 2
The likelihood function is
µ ¶n/2 " #
Y
n
1 (xi −μ)2 1 1 X
n
L(μ, σ 2 ) = √ e− 2σ2 = exp − 2 (xi − μ)2
i=1 2πσ 2 2πσ 2 2σ i=1
Interval Estimation 69
1 X
n
2 n n 2
log L(μ, σ ) = − log 2π − log σ − 2 (xi − μ)2
2 2 2σ i=1
1 X
n
∂ log L
= 2 (xi − μ)
∂μ σ i=1
and
1 X
n
∂ log L n 1
=− 2 + 4 (xi − μ)2
∂σ 2 2σ 2σ i=1
and putting these derivatives equal to 0 and solving the resulting equations we find
the estimates
1X
n
b=
μ xi = x
n i=1
and
1X
n
σb2 = (xi − x)2
n i=1
which turn out to be the sample moments corresponding to μ and σ 2 .
One needs to define criteria so that various estimators can be compared. One of these
is the unbiasedness. An estimator T = t(X1 , X2 , ..., Xn ) is defined to be an unbiased
estimator of τ (θ) if and only if
In practice estimates are often given in the form of the estimate plus or minus a
certain amount e.g. the cost per volume of a book could be 83 ± 4.5 per cent which
70 Point and Interval Estimation
means that the actual cost will lie somewhere between 78.5% and 87.5% with high
probability. Let us consider a particular example. Suppose that a random sample
(1.2, 3.4, .6, 5.6) of four observations is drawn from a normal population with unknown
mean μ and a known variance 9. The maximum likelihood estimate of μ is the sample
mean of the observations:
x = 2.7
We wish to determine upper and lower limits which are rather certain to
contain the true unknown parameter value between them. We know that the sample
mean, x, is distributed as normal with mean μ and variance 9/n i.e. x v N(μ, σ 2 /n).
Hence we have
X −μ
Z= 3 v N(0, 1)
2
Hence Z is standard normal. Consequently we can find the probability that Z will
be between two arbitrary values. For example we have that
Z1.96
P [−1.96 < Z < 1.96] = φ(z)dz = 0.95
−1.96
Let X1 , X2 , ...., Xn be a random sample from the density f (•; θ). Let T1 =
t1 (X1 , X2 , ...., Xn ) be a statistic for which Pθ [T1 < τ (θ)] = γ. Then T1 is called a one-
sided lower confidence interval for τ (θ). Similarly, let T2 = t2 (X1 , X2 , ...., Xn )
be a statistic for which Pθ [τ (θ) < T2 ] = γ. Then T2 is called a one-sided upper
confidence interval for τ (θ).
Example: Let X1 , X2 , ...., Xn be a random sample from the density f (x; θ) =
√
φθ,9 (x). Set T1 = t1 (X1 , X2 , ...., Xn ) = X − 6/ n and T2 = t2 (X1 , X2 , ...., Xn ) =
√
X + 6/ n. Then (T1 , T2 ) constitutes a random interval and is a confidence interval
√ √
for τ (θ) = θ, with confidence coefficient γ = P [X − 6/ n < θ < X + 6/ n] =
X−θ
= P [−2 < √3
< 2] = Φ(2) − Φ(−2) = 0.9772 − 0.0228 = 0.9544. hence if the
n
random sample of 25 observations has a sample mean of, say, 17.5, then the interval
√ √
(17.5 − 6/ 25, 17.5 + 6/ 25) is also called a confidence interval of θ.
Let X1 , X2 , ..., Xn be a random sample from the normal distribution with mean μ
and variance σ 2 . If σ 2 is unknown then θ = (μ, σ 2 ), the unknown parameters and
τ (θ) = μ, the parameter we want to estimate by interval estimation. We know that
X −μ
v N(0, 1)
√σ
n
However, the problem with this statistic is that we have two parameters. Conse-
quently we can not an interval. hence we look for a statistic that involves only the
parameter we want to estimate, i.e. μ. Notice that
(X − μ)/ √σn X −μ
qP = √ v tn−1
(Xi − X)2 /(n − 1)σ 2 S/ n
This statistic involves only the parameter we want to estimate. Hence we have
½ ¾
X −μ © √ √ ª
q1 < √ < q2 ⇔ X − q2 (S/ n) < μ < X − q1 (S/ n)
S/ n
where q1 , q2 are such that
∙ ¸
X −μ
P q1 < √ < q2 = γ
S/ n
72 Point and Interval Estimation
¡ √ √ ¢
Hence the interval X − q2 (S/ n), X − q1 (S/ n) is the 100γ percent confidence
interval for μ. It can be proved that if q1 , q2 are symmetrical around 0, then the
length is the interval is minimized.
Alternatively, if we want to find a confidence interval for σ 2 , when μ is un-
known, then we use the statistic
P
(Xi − X)2 (n − 1)S 2
2
= 2
v χ2n−1
σ σ
Hence we have
½ ¾ ½ ¾
(n − 1)S 2 (n − 1)S 2 2 (n − 1)S 2
q1 < < q2 ⇔ <σ <
σ2 q2 q1
HYPOTHESIS TESTING
Example Let X1, X2, ..., Xn be a random sample from f (x; θ) = φμ,25(x).
The statistical hypothesis that the mean of the normal population is less or equal
to 17 is denoted by: H : θ ≤ 17. Such a hypothesis is composite, as it does not
completely specify the distribution. On the other hand, the hypothesis H : θ ≤ 17 is
simple since it completely specifies the distribution.
Example Let X1, X2, ..., Xn be a random sample from f (x; θ) = φμ,25(x).
Consider H : θ ≤ 17. One possible test Y is as follows: Reject H if and only if
√
X > 17 + 5/ n.
to be the probability that a Type I error is made, and similarly the size of a Type
II error is defined to be the probability that a Type II error is made.
Significance level or size of a test, denoted by α, is the supremum of the
probability of rejecting H0 when H0 is correct, i.e. it is the supremum of the Type I
error. In general to perform a test we fix the size to a prespecified value in general
10%, 5% or 1%.
Example Let X1, X2, ..., Xn be a random sample from f (x; θ) = φμ,25(x).
√
Consider H0 : θ ≤ 17 and the test Y: Reject H0 if and only if X > 17 + 5/ n. Then
of the test is
√ h √ i
X−θ 17+5/ n−θ
sup P [X > 17 + 5/ n] = sup P 5/ √ >
n
√
5/ n
=
θ≤17
n h θ≤17
√ io n h √ io
X−θ 17+5/ n−θ 17+5/ n−θ
= sup 1 − P 5/ √ ≤
n
√
5/ n
= sup 1 − P Z≤ √
5/ n
=
θ≤17
n h √ io θ≤17
= sup 1 − Φ 17+5/ √
5/ n
n−θ
= 1 − Φ(1) = 0.159
θ≤17
Let us establish a test procedure via an example. Assume that n = 64, X = 9.8 and
σ 2 = 0.04. We would like to test the hypothesis that μ = 10.
1. Formulate the null hypothesis:
H0 : μ = 10
2. Formulate the alternative:
H1 : μ 6= 10
3. select the level of significance:
α = 0.01
From tables find the critical values for Z, denoted by cZ = 2.58.
4. Establish the rejection limits:
Reject H0 if Z < −2.58 or Z > 2.58.
5. Calculate Z:
X−μ0 9.8−10
Z= √σ
= √
0.2/ 64
= −8
n
To find the appropriate test for the mean we have to consider the following
cases:
1. Normal population and known population variance (or standard deviation).
In this case the statistic we use is:
X − μ0
Z= v N(0, 1)
√σ
n
X − μ0
Z= v N(0, 1)
√S
n
3. Small samples from a normal population where the population variance (or
standard deviation) is unknown.
In this case the statistic we use is:
X − μ0
t= v tn−1
√S
n
p − π0
Z= v N(0, 1) where S 2 = π 0 (1 − π0 )
√S
n
Example: Mr. X believes that he will get more 60% of the votes. However,
in a sample of 400 voters 252 indicate that they will vote for X. At a significance level
of 5% test Mr. X belief.
76 Hypothesis testing
252
p= 400
= 0.63, S 2 = 0.6(1 − 0.6) = 0.24. The H0 : π = π 0 and the alternative
p−π 0 0.63−0.6
is H1 : π > π0 . The critical value is 1.64. Now Z = S
√
= √
0.489/ 400
= 1.22.
n
Consequently, the null is not rejected as Z < 1.64. Thus Mr. X belief is wrong.
H0 is accepted H1 is accepted
H0 is correct Correct decision (1 − α) Type I error (α)
H1 is correct Type II error (β) Correct decision (1 − β)
Asymptotic Theory
77
Chapter 8
MODES OF CONVERGENCE
Tn = T (X1 , ..., Xn ) ,
and we would like to know what happens to Tn as n → ∞. It turns out that the limit
is easier to work with than Tn itself. The plan is to use the limit as approximation
device. We think of a sequence T1 , T2 , ... which have distribution functions F1 , F2 , ....
Definition A sequence of random variables T1 , T2 , ... converges in probability
p
to a random variable T (denoted by Tn −→ T ) if, for every ε > 0,
lim E (Tn − T )2 = 0
n→∞
E |Tn − T |2
P [|Tn − T | ≥ ε] ≤ → 0.
ε2
80 Modes of Convergence
E (Tn )2 E (Tn )
P [Tn ≥ ε] ≤ ≤
ε2 ε
E(Tn ) p
so that if ε
→ 0, this is sufficient for Tn −→ 0.
But the converse of the theorem is not necessarily true. To see this consider
the following random variable
⎧
⎨ n with probability 1
n
Tn = , T = 0.
⎩ 0 with probability 1 − 1
n
Then P [Tn ≥ ε] = 1
n
for any ε > 0, and P [Tn ≥ ε] = 1
n
→ 0. But E (Tn )2 = n2 n1 =
n → ∞.
A famous consequence of the theorem is the (Weak) Law of Large Numbers
p
lim P [|Tn − μ| > ε] = 0, i.e., Tn −→ μ.
n→∞
σ2
E[(Tn − μ)2 ] = → 0.
n
as we have shown. In fact, the result can be proved with only the hypothesis that
E|X| < ∞ by using a truncation argument.
Another application of the previous theorem is to the empirical distribution
function, i.e.,
1X
n
p
Fn (x) = 1 (Xi ≤ x) −→ F (x) .
n i=1
The next result is very important for applying the Law of Large Numbers
beyond the simple sum of iid’s.
Modes of Convergence 81
p
Theorem 21 CONTINUOUS MAPPING THEOREM If Tn −→ μ a constant and
g(.) is a continuous function at μ, then
p
g (Tn ) −→ g (μ) .
Let An = {|Tn − μ| < η} and Bn = {|g (Tn ) − g (μ)| < ε}. But when An is true so is
Bn , i.e., An ⊂ Bn . Since P (An ) → 1, we must have that P (Bn ) → 1.
We know that:
1X 2 p
n
¡ ¢
Xi −→ E Xi2 = σ2 + μ2
n i=1
and à !2
1X 1X
n n
p p
Xi −→ μ ⇒ Xi −→ μ2
n i=1 n i=1
by the continuous mapping theorem. Combining these two results we get
p
s2 −→ σ 2 .
Finally, notice that when dealing with a vector Tn = (Tn1 , ..., Tnk )/ , we have
that
p
kTn − T k −→ 0,
¡ ¢1/2
where kxk = x/ x is the Euclidean norm, if and only if
p
|Tnj − Tj | −→ 0
82 Modes of Convergence
for all j = 1, ..., k. The if part is no surprise and follows from the continuous mapping
theorem. The only if part follows as if kTn − T k < ε then |Tnj − Tj | < η for each j
and some η > 0.
Definition A sequence of random variables T1 , T2 , ... converges almost surely
as
to a random variable T (denoted by Tn −→ T ) if, for every ε > 0,
by ordinary continuity.
as
Tn (ω) −→ E (X) .
ASYMPTOTIC THEORY 2
We can now establish the convergence in distribution and the central limit theorem,
which is of great importance.
Definition A sequence of random variables T1 , T2 , ... converges in distribution
D
to a random variable T (denoted by Tn −→ T ) if
lim P [Tn ≤ x] = P [T ≤ x]
n→∞
P D
Tn −→ T ⇒ Tn −→ T
but not vice versa, except when the limit is nonrandom, i.e.,
D P
Tn −→ α ⇒ Tn −→ α.
D
Theorem 23 A sequence of random vectors (Tn1 , Tn2 , ..., Tnk ) −→ (T1 , T2 , ..., Tk ) iff
D
c/ Tn −→ c/ T
This is known as the Cramér − Wold device. The main result is the following:
84 Asymptotic Theory 2
√ ¡ ¢ D ¡ ¢
n X − μ −→ N 0, σ 2
√ ¡ ¢
n X −μ D
Tn = −→ N (0, 1)
σ
h i
/
The vector version: Let X1 , X2 , ..., Xn be i.i.d. with E (Xi ) = μ, E (Xi − μ) (Xi − μ) =
Σ where 0 < Σ < ∞. Then
√ ¡ ¢ D
Tn = n X − μ −→ N (0, Σ) .
E (Xi ) = 0, V ar (Xi ) = 1.
D
We know that Tn −→ N (0, 1) .
⎧ √
⎪
⎪ 2/ 2 1/4
⎪
⎨
X1 + X2
n = 2: √ = 0 1/2
2 ⎪
⎪ √
⎪
⎩ −2/ 2 1/4
⎧ √
⎪
⎪ 3/ 3 1/8
⎪
⎪
⎪
⎪ √
X1 + X2 + X3 ⎨ 1/ 3 3/8
n = 3: √ = √
3 ⎪
⎪ −1/ 3 3/8
⎪
⎪
⎪
⎩ −3/√3
⎪
1/8
and additionally
à n !−1/2 à n !1/3
X X
σ 2i m3i →0
i=1 i=1
e.g. if
1X 2 1X
n n
σ → σ2 m3i → m3
n i=1 i n i=1
then
√ ¡ ¢ D ¡ ¢
n X − μ −→ N 0, σ 2
¡ ¢
X −E X D
Tn = q ¡ ¢ −→ N (0, 1)
V ar X
is satisfied.
The key condition for the above CLT is the Lindeberg condition. Which
basically ensures that no one term is so relatively large as to dominate the entire
sample, in the limit. The following CLT gives some more transparent conditions that
are sufficient for the Lindeberg condition to hold.
−2
Theorem 27 Let xi be independent with mean μi and variance σ 2i , and let σ n =
X
1
n
σ 2i . If
³ ´ 2+δ
1
The above condition although sufficient is not necessary. To see this assume
¡ ¢
that yt = 0 with probability 12 1 − t12 , 0 with the same probability and t with
1
probability t2
. In this case yt tends to a Bernoulli random variable, and the CLT
certainly applies in this case. Yet the condition of the above theorem is not satisfied.
1
Furthermore, let assume that yt = 0 with probability 1 − t2
and t with prob-
1 1 −2
ability t2
. Then E (yt ) = 1/t → 0, and V ar (yt ) = 1 − t2
→ 1. Hence σ n → 1.
Despite this, it is clear that yt is converging to a degenerate random variable which
takes the value of 0 with probability 1 (in fact is xt = yt − E(yt ) that is degenerate).
³ ´ 2+δ
1 ³ δ ´
However, it is verified that E |xi |2+δ = O t δ+2 and consequently for any δ > 0
the condition of the above theorem must fail for n large enough.
Dependent random variables CLT’s are available too.
Combination Properties 87
D
Yn Xn −→ cX
D
Yn + Xn −→ c + X
D
Xn Yn / −→ X/c if c nonzero
√ ¡ ¢ D ¡ ¢
n X − μ −→ N 0, σ 2
Therefore, √ ¡ ¢
n X − μ D N (0, σ 2 )
−→ = N (0, 1)
sX σ
D
Tn = (Tn1 , Tn2 , .., Tnk ) −→ T = (T1 , T2 , .., Tk )
and g : Rk → Rq . Then
D
g (Tn ) −→ g (T )
EXAMPLE.
D
Yn + Xn −→ X + Y
D
Yn Xn −→ Y X
but notice that the assumption requires the joint convergence of (Yn , Xn ) .
88 Asymptotic Theory 2
d
Theorem 30 (Cramer) Assume that Xn → N (μ, Σ), and An is a conformable ma-
d ¡ ¢
trix with plimAn = A. Then An Xn → N μ, AΣA/
Notice that if √ ¡ ¢
n X −μ D
−→ N (0, 1)
sX
then ¡ ¢2
n X −μ D
2
−→ χ21
sX
Suppose that µ ¶
√ ∧ D
n θ − θ0 −→ X
q×p
Therefore, µ µ ¶ ¶
√ ∧ D ∂g
n g θ − g (θ0 ) −→ / (θ0 ) · X
∂θ
as required.
√ D
For example, sin X when μ = 0. Now (sin x)/ = cos x. Hence n sin X −→
N (0, 1) .
In fact we can state the following theorem.
Delta Method. 89
g (xn ) − g (μ) d
1 m m
→ [N (0, 1)]m .
m!
g (μ) σ n
log2 (1 + Xn ) d 2
→ χ1 .
σ 2n
To see this apply the above theorem with g (x) = log2 (1 + x), m = 2 and μ = 0.
90 Asymptotic Theory 2
Chapter 10
In most applications k = 1/2, although it can be larger than this for models
containing determinist trend terms. There are also case where k > 1/2 but the
limiting distribution is not normal, e.g. when there are stochastic trends. Asymptotic
normality is an important property to establish for an estimator, as it is often the
only basis for constructing interval estimates and tests of hypotheses.
∧
Let denote by C the class of CAN estimators of θ0 , and write θn ∈ C to denote
that the estimator belongs to this class.
∧
Definition 34 θn ∈ C µ
is said
¶ to best asymptotically normal for θ0 (BAN) in the class
³v ´ ∧ v
if AV ar θ n − AV ar θn is positive semi-definite for every θ n ∈ C.
E (u) = 0 a.s.
¡ ¢
E uu/ = σ 2 In a.s.
and
¯ ¯2+δ
¯ / ¯
E ¯λ xt ut ¯ ≤ B < ∞ δ > 0, ∀ fixed λ
Asymptotics of the Stochastic Regressor Model 93
The first additional condition can be written as plimn−1 X / X = Mxx has two com-
ponents. The weak law of large numbers must apply to the squares and the cross-
products of the elements of xt , and Mxx must have full rank. The latter can fail
even if the matrix X has rank k for every finite n. To see this take the fairly trivial
P 2 P
example xt = 1/t. Then limn→∞ nt=1 t12 = π6 . Hence limn→∞ n−1 nt=1 t12 = 0.
The least squared estimator can be written as
à n !−1 n à n !−1 n
∧ X /
X X /
X
β= xt xt xt yt = β + xt xt xt ut
t=1 t=1 t=1 t=1
Consider now the k × 1 vector xt ut . Since E(ut |xt ) = 0 and E(u2t |xt ) = σ 2 the Law
of Iterated Expectations gives
h ³ ´i ³ ´
2 / 2 /
V ar (xt ut ) = E E ut xt xt |xt = σ E xt xt < ∞.
Furthermore, the ut 0s are independent hence, the Weak Law of Large Numbers can
be applied on xt ut , i.e.
1X
n
p lim xt ut = 0
n t=1
1X
n ³ ´
lim V ar λ/ xt ut = σ2 λ/ Mxx λ.
n→∞ n
t=1
Since now Mxx is positive definite 0 < σ 2 λ/ Mxx λ < ∞. Hence the denominator of the
¯ ¯2+δ
¯ ¯
condition of Theorem 20 is bounded and bigger than 0. Furthermore, E ¯λ/ xt ut ¯ ≤
B ensures the condition of the same Theorem and consequently we have that
1 X /
n ³ ´
d
√ λ xt ut → N 0, σ 2 λ/ Mxx λ
n t=1
94 Asymptotic Estimation Theory
1 d ¡ ¢
√ X / u → N 0, σ 2 Mxx .
n
Finally µ ¶ µ ¶−1
√ ∧ 1 / 1 d ¡ ¢
n β−β = X X √ X / u → N 0, σ 2 Mxx
−1
.
n n
Part IV
Likelihood Function
95
Chapter 11
Let the observations be x = (x1 , x2 , ..., xn ), and the Likelihood Function be denoted
by L (θ) = d (x; θ), θ ∈ Θ ⊂ <k . Then the Maximum Likelihood Estimator (MLE) is:
µ ¶
∧ ∧
θ = arg max p (x; θ) ⇔ d x; θ ≥ d (x; θ) ∀θ ∈ Θ.
θ∈Θ
∧
If θ is unique, then the model is identified. Let (θ) = ln d (x; θ), then for local
identification we have the following Lemma.
Lemma 35 The model is Locally Identified iff the Hessian is negative definite with
probability 1, i.e. ⎡ µ ¶ ⎤
∧
µ ¶ ∂ 2
θ
⎢ ∧ ⎥
⎢
Pr ⎣H θ = < 0⎥
∂θ∂θ / ⎦ = 1.
Assume that the log-Likelihood Function can be written in the following form:
X
n
(θ) = ln d (xi ; θ) .
i=1
The vector
∂ (x; θ)
s (x; θ) =
∂θ
which is k × 1, is called the score vector. We can state the following Lemma.
∙ ¸
¡ ¢ ∂2
E (s) = 0, E ss/ = −E .
∂θ∂θ/
Z
L (x, θ) dx = 1
C
Z Z Z Z
∂ ∂L (x, θ) ∂L 1 ∂ ln L
L (x, θ) dx = 0 ⇒ dx = 0 ⇒ Ldx = 0 ⇒ Ldx = 0
∂θ ∂θ ∂θ L ∂θ
C C C C
Z Z
∂
⇒ Ldx = 0 ⇒ sLdx = 0 ⇒ E (s) = 0.
∂θ
C C
∗
In case that C was dependent on θ, we would have to apply the Second Fundamental Theorm
of Analysis to find the derivative of the integral.
Maximum Likelihood Estimation 99
¡ ¢
Hence, we also have that E s/ = 0 and taking derivatives with respect to θ
we have
Z Z Z µ ¶
∂ / ∂ ∂ ∂ ∂
s Ldx = 0⇒ Ldx = 0 ⇒ L dx = 0
∂θ ∂θ ∂θ/ ∂θ ∂θ/
C C
Z ∙µ ¶ µ C¶¸
∂ ∂ ∂L ∂
⇒ L+ dx = 0
∂θ ∂θ/ ∂θ ∂θ/
C
Z ∙ 2 µ ¶¸
∂ ∂L 1 ∂
⇒ L+ L dx = 0
∂θ∂θ/ ∂θ L ∂θ/
C
Z ∙ 2 ¸
∂ ∂ ∂
⇒ L+ L dx = 0
∂θ∂θ/ ∂θ ∂θ/
C
Z Z
∂ ∂ ∂2
⇒ Ldx = − Ldx
∂θ ∂θ/ ∂θ∂θ/
C C
Z Z µ 2 ¶
/ ∂2 ¡ /¢ ∂
⇒ ss Ldx = − /
Ldx ⇒ E ss = −E
∂θ∂θ ∂θ∂θ/
C C
is called (Fisher) Information Matrix and is a measure of the information that the
sample x contains about the parameters in θ. In case that (x, θ) can be written as
P
(x, θ) = ni=1 (xi , θ) we have that
à ! µ 2 ¶ µ 2 ¶
∂2 X X
n n
∂ (xi , θ) ∂ (xi , θ)
J (θ) = −E (xi , θ) = − E = −nE .
∂θ∂θ/ i=1 i=1
∂θ∂θ /
∂θ∂θ /
J (θ) = O (n) .
Now we can state the following Lemma that will be needed in the sequel.
100 Maximum Likelihood Estimation
Lemma 37 Under assumptions A1. and A2. we have that for any unbiased estima-
tor of θ, say e
θ,
³ ´
E e
θs/ = Ik ,
Proof: As e
θ is an unbiased estimator of θ, we have that
³ ´ Z
e
E θ =θ⇒ e
θL (x; θ) dx = θ.
C
⇒ E e θs/ = Ik
Theorem 38 Under the regularity assumptions A1. and A2. we have that for any
unbiased estimator e
θ we have that
³ ´
J −1
(θ) ≤ V e
θ .
³ ´ ³ / ´
Proof: Let us define the following 2k × 1 vector ξ / = δ / , s/ = e
θ − θ/ , s/ .
Now we have that
⎡ ⎤ ⎡ ³ ´ ⎤
³ ´ δδ /
δs/ V e
θ I k
V (ξ) = E ξξ / = ⎣ ⎦=⎣ ⎦.
/ /
sδ ss Ik J (θ)
³ ´
For the above result we took into consideration that e
θ is unbiased, i.e. E e
θ = θ,
³ ´ ¡ ¢
the above Lemma, i.e. E e θs/ = Ik , E (s) = 0, and E ss/ = J (θ). It is known
Maximum Likelihood Estimation 101
that all variance-covariance matrices are positive semi-definite. Hence V (ξ) ≥ 0. Let
us define the following matrix
h i
B= −1 .
Ik −J (θ)
z (θ) ≤ z (θ0 )
d(x;θ)
where the inequality is due to Jensen. The inequality is strict when the ratio d(x;θ0 )
X
n
(θ) = zi where zi = ln d (xi ; θ)
i=1
and the zi random variables are independent and have the same distribution with
mean E (zi ) = z (θ). Then from the Weak Law of Large Numbers we have that
1X
n
1
p lim zi = p lim (θ) = z (θ) .
n i=1 n
However, the above is true under weaker assumptions, e.g. for dependent observations
or for non-identically distributed random variables etc. To avoid a lengthy exhibition
of various cases we make the following assumption:
1
p lim (θ) = z (θ) ∀θ ∈ Θ.
n
Theorem 40 Under the above assumption and if the statistical model is identified
we have that
∧ p
θ → θ0
∧
where θ is the MLE and θ0 the true parameter values.
Maximum Likelihood Estimation 103
max z (θ)
θ∈A
Hence
∧ ∧ ∧ ∧
θ∈
/A⇒θ∈
/ N ∩ Θ ⇒ θ ∈ N ∩ Θ ⇒ θ ∈ N.
104 Maximum Likelihood Estimation
∧
Consequently we have shown that when Tδ is true then θ ∈ N. This implies that
µ ¶
∧
Pr θ ∈ N ≥ Pr (Tδ )
X
n
(θ) = ln d (xi ; θ)
i=1
∂ (θ) X ∂ ln d (xi ; θ)
n
s (θ) = =
∂θ i=1
∂θ
∂ 2 (θ) X ∂ 2 ln d (xi ; θ)
n
H (θ) = = .
∂θ∂θ/ i=1
∂θ∂θ/
Given now that the observations xi are independent and have the same distribution,
∂ ln d(xi ;θ)
with density d (xi ; θ), the same is true for the vectors ∂θ
and the matrices
∂2 ln d(xi ;θ)
∂θ∂θ/
. Consequently we can apply a Central Limit Theorem to get:
X
n
∂ ln d (xi ; θ) ³ _ ´
d
n−1/2 s (θ) = n−1/2 → N 0, J (θ)
i=1
∂θ
where µ ¶
_
−1 −1 −1 ∂ 2 (θ)
J (θ) = n J (θ) = n J (θ) = −n E ,
∂θ∂θ/
Maximum Likelihood Estimation 105
i.e., the average Information matrix. However, the two asymptotic results apply
even if the observations are dependent or identically distributed. To avoid a lengthy
exhibition of various cases we make the following assumptions. As n → ∞ we have:
Assumption A4.
³ _ ´
d
n−1/2 s (θ) → N 0, J (θ)
and
Assumption A5.
_
p
n−1 H (θ) → −J (θ)
where
_ ¡ ¢
J (θ) = J (θ) /n = E ss/ /n = −E (H) /n.
Theorem 41 Under assumptions A2. and A3.the above two assumptions and iden-
tification we have: µ ¶ ∙ ³_
√ ∧ ´−1 ¸
d
n θ − θ0 → N 0, J (θ0 )
∧
where θ is the MLE and θ0 the true parameter values.
∧
Proof: As θ maximises the Likelihood Function we have from the first order
conditions that µ ¶
∧
µ ¶ ∂ θ
∧
s θ = = 0.
∂θ
From the Mean Value Theorem, around θ0 , we have that
µ ¶
∧
s (θ0 ) + H (θ∗ ) θ − θ0 = 0 (11.3)
∙ ¸ ° °
∧ °∧ °
where θ∗ ∈ θ, θ0 , i.e. kθ∗ − θ0 k ≤ °θ − θ0 °
°
°.
106 Maximum Likelihood Estimation
µ ¶
√ ∧ d
Under now the first assumption the above equation implies that n θ − θ0 →
∙ ³_ ´−1 ¸
N 0, J (θ0 ) . ¥
1 X
T
T T ¡ 2¢
(θ) = − ln (2π) − ln σ − 2 (yt − μ)2
2 2 2σ t=1
⎛ ⎞
μ
where θ = ⎝ ⎠. Now
σ2
1 X
T
∂
= 2 (yt − μ)
∂μ σ t=1
and
1 X
T
∂ −T
2
= 2+ 4 (yt − μ)2 .
∂σ 2σ 2σ t=1
Now µ ¶ µ ¶
∧ ∧
∂ θ ∂ θ
= = 0.
∂μ ∂σ 2
Maximum Likelihood Estimation 107
Hence
1X 1 X³ ´
T ∧ T
∧ 2 ∧ 2
μ= yt and σ = yt − μ .
T t=1 T t=1
Now
⎡ PT ⎤
2
∂ (θ) ⎣ − σT2 − σ14 (yt − μ)
H (θ) = = P
t=1
PT ⎦,
∂θ∂θ/ − σ14 Tt=1 (yt − μ) T
− 1
(yt − μ) 2
2σ 4 σ6 t=1
∧
and consequently evaluating H (θ) at θ we have
⎡ ⎤
µ ¶ − T
0
∧
⎢ ∧ ⎥
H θ = ⎣ σ2 ⎦
0 − T∧
2σ4
ϕ (θ) = 0
∂ϕ (θ)
F (θ) =
∂θ/
the r × k matrix of derivatives has rank r, i.e. there no redundant constrain.
Under these conditions, MLE is still consistent but not is not asymptotically
efficient, as it ignores the information in the constrains. To get an asymptotically
efficient estimator we have to take into consideration the information of the constrains.
Hence we form the Lagrangian:
where λ is the r × 1 vector of Lagrange Multipliers. The first order conditions are:
∂L ∂
= + F / (θ) λ = s (θ) + F / (θ) λ = 0
∂θ ∂θ
∂L ∂λ/
= ϕ (θ) = Ik ϕ (θ) = ϕ (θ) = 0.
∂λ ∂λ
Let e e the constrained ML estimators, i.e. the solution of the above first
θ and λ
order conditions, i.e.
³ ´ ³ ´
e / e e
s θ +F θ λ=0 (12.1)
110 Restricted Maximum Likelihood Estimation
³ ´
ϕ e
θ = 0. (12.2)
³ ´ ³ ´
Applying the Mean Value Theorem around θ0 we have to s e
θ and ϕ e
θ we get
³ ´ ∙ ¸
e ∂s (θ∗ ) ³e ´ ³
e
´
s θ = s (θ0 ) + θ − θ 0 = s (θ 0 ) + H (θ ∗ ) θ − θ 0 (12.3)
∂θ/
³ ´ ∙ ¸³ ´ ³ ´
∂ϕ (θ )
ϕ e e e
∗∗
θ = ϕ (θ0 ) + θ − θ 0 = ϕ (θ 0 ) + F (θ ∗∗ ) θ − θ 0 (12.4)
∂θ/
where θ∗ and θ∗∗ are defined as
° °
° °
kθ∗ − θ0 k ≤ °e
θ − θ0 °
° °
°e °
kθ∗∗ − θ0 k ≤ °θ − θ0 ° .
and
³ ´
F (θ∗∗ ) e
θ − θ0 = 0.
Hence we get
1 1 √ ³ ´ 1 / ³e´ e
√ s (θ0 ) + H (θ∗ ) n e
θ − θ0 + √ F θ λ=0 (12.5)
n n n
and
√ ³ ´
nF (θ∗∗ ) e
θ − θ0 = 0. (12.6)
Now e
θ is consistent and so are θ∗ and θ∗∗ , i.e.
e
θ = θ0 + op (1) , θ∗ = θ0 + op (1) , and θ∗∗ = θ0 + op (1) .
1 _
H (θ∗ ) = −J (θ0 ) + op (1)
n
Restricted Maximum Likelihood Estimation 111
_
where J (θ0 ) the average information matrix.
Hence, equations (12.5) and (12.6) become:
s (θ0 ) _ √ ³ ´ e
λ
√ − J (θ0 ) n e
θ − θ0 + F / (θ0 ) √ = op (1) (12.7)
n n
and
√ ³e ´
F (θ0 ) n θ − θ0 = op (1) . (12.8)
Theorem 42 Under the assumptions A4., A5., model identification and that the true
parameter values satisfy the constrains, we have
e d
λ ¡ ¢
√ → N 0, [P (θ0 )]−1
n
and µ h_ ¶
√ ³ ´
d
i−1
n e
θ − θ0 → N 0, J (θ0 ) −A
where
h_ i−1 h_ i−1
A = J (θ0 ) F / (θ0 ) [P (θ0 )]−1 F (θ0 ) J (θ0 ) > 0.
112 Restricted Maximum Likelihood Estimation
where
½½ h_ i−1 ¾ h_ i−1/2 ¾
/ −1
Ω2 = Ik − J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 )
½½ h_ i−1 ¾ h_ i−1/2 ¾/
/ −1
× Ik − J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 )
h_ i−1 h _ i−1 h_ i−1
/ −1
= J (θ0 ) − J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 )
h_ i−1 h_ i−1
/ −1
− J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 )
h_ i−1 h_ i−1 h_ i−1
/ −1 / −1
J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 )
h_ i−1 h _ i−1 h_ i−1
= J (θ0 ) − J (θ0 ) F / (θ0 ) [P (θ0 )]−1 F (θ0 ) J (θ0 )
115
117
Let x = (x1 , x2 , ..., xn ) be a random sample having a density function d (x, θ),
where θ ∈ Θ ⊂ Rk . We would like to test a simple hypothesis
H0 : θ = θ0
H1 : θ = θ1 ∈ Θ − {θ0 } .
where cα > 0 is a constant such that the Type I Error (Size) of the test is equal to
α ∈ (0, 1/2), i.e. Z
P (R|θ = θ0 ) = d0 (x) dx = α,
R
i.e. the probability of rejecting the null when it is correct. Notice that the Power of
the above test, πR (α) is
Z
π R (α) = P (R|θ = θ1 ) = d1 (x) dx,
R
i.e. the probability of rejecting the null when the alternative is correct.
Then the Power of this test is less or equal to the Power of the Neyman Test, i.e.
P (A|θ = θ1 ) ≤ P (R|θ = θ1 ) .
118
Definition 45 Let A be the rejection region of a test. Then if π A (α, θ) ≥ α for all
θ ∈ Θ the test is unbiased.
The comparison of unbiased tests is very difficult as the one is more powerful
in one region of the parametric space and the other in another.
119
Definition 46 If for any test, say B, we have that π A (α, θ) ≥ π B (α, θ) for all θ ∈ Θ
the test A is called uniformly most powerful test.
Theorem 47 Let π A (α, θ) be the power function of a test of size α of the null hy-
pothesis H0 : θ = θ0 versous the alternative H1 : θ ∈ Θ − {θ0 }. If eα (θ) is the
Envelope Power Function we have that
∀θ ∈ Θ π A (α, θ) ≤ eα (θ) .
For hypothesis testing the Envelope Power Function is the analogue of the
Cramer-Rao bound in estimation. It is obvious that if the power function of a test is
identical to the Envelope Power Function then this test is uniformly most powerful
in the class of unbiased tests. Hence we have the following Corollary
The proof is based on the fact that if the rejection region R does not depend
on θ1 , the power of the test is identical to the Envelope Power Function.
Example 49 Let x = (x1 , x2 , ..., xn ) be a random sample from a N (μ, σ 2 ) with un-
known μ and known σ 2 . We want to test the hypothesis H0 : μ = μ0 versous the
120
X
n
σ2 n (μ1 + μ0 ) σ 2 ln cα μ + μ0
xi ≥ ln cα + /
⇔ x ≥ cα = + 1
i=1
μ1 − μ0 2 n (μ1 − μ0 ) 2
/
where the constants cα and cα are determined by the size of the test, which should be
α, i.e.
¡ ¢
P x ≥ c/α |μ = μ0 = α.
³ ´ √
Under H0 , we have that x v N μ0 , σn and n x−μ v N (0, 1). The above probabil-
2
0
σ
ity is equivalent to
à !
√ x − μ0 √ c/α − μ0
P n ≥ n |μ = μ0 = α.
σ σ
Let zα be the critical value that leaves α% in the tail of a N (0, 1) distribution. Hence
µ ¶
√ x − μ0
P n ≥ zα |μ = μ0 = α.
σ
Hence we have shown that
√ x − μ0
λ (x) ≥ cα ⇔ n ≥ zα .
σ
The Rejection Region of the Ratio of the Neyman Test is
½ ¾
√ x − μ0
R = {x ∈ S : λ (x) ≥ cα } = x ∈ S : n ≥ zα
σ
which is independent of μ1 . Consequently, the test is Uniformly Most Powerful.
121
The Neyman Ratio Test is generalised and for the testing of a composite null
versous composite alternative. Let x = (x1 , x2 , ..., xn ) be a sample with Likelihood
Function
d (x; θ) , θ ∈ Θ ⊂ Rk ,
supθ∈Θ d (x; θ)
λ (x) =
supθ∈Ω d (x; θ)
R = {x ∈ S : λ (x) ≥ cα } ,
where cα > 0 is a constant such that the probability of Type I Error is α, i.e.
Z
sup {P (R|θ ∈ Ω)} = sup d (x; θ) dx = α.
θ∈Ω θ∈Ω R
Example 50 Let x = (x1 , x2 , ..., xn ) be a random sample from a N (μ, σ 2 ) with un-
known μ and σ 2 . We want to test the hypothesis H0 : μ = μ0 versous the alternative
H1 : μ 6= μ0 . The vector of parameters is
¡ ¢
θ/ = μ, σ 2 ∈ Θ = R × R+ .
However, the maximum of supθ∈Ω [d (x; θ)] = supσ2 [d (x; μ0 , σ 2 )]. The maximum of
d (x; μ0 , σ 2 ) is at
1X
n
2
σ = s20 = (xi − μ0 )2
n i=1
Hence
¡ ¢−n/2 h ni
sup [d (x; θ)] = 2πs20 exp − .
θ∈Ω 2
With the same reasoning we find that
¡ ¢−n/2 h ni 1X
n
sup [d (x; θ)] = 2πs2 exp − , 2
s = (xi − x)2 .
θ∈Θ 2 n i=1
χ-SQUARE TESTS
∧
Theorem 51 For a Consistent Asymptotically Normal (CAN) estimator θ, assuming
that rank [F (θ)] = r and under H0 : ϕ (θ) = 0, we have that
µ ¶/ µ ¶
∧ ¡ ¢
/ −1
∧ d
nϕ θ FV F ϕ θ −→ χ2r
124 χ-Square Tests
µ ¶
√ ∧ d
Proof. From the Delta Method we have that if n θ − θ −→ X then
µ µ ¶ ¶ µ ¶
√ ∧ d √ ∧ d
n ϕ θ − ϕ (θ) −→ F (θ) · X. But under H0 ϕ (θ) = 0. Hence nϕ θ −→
µ ¶ µ ¶/ µ ¶
√ ∧ d ¡ /
¢ ∧ ¡ ¢
/ −1
∧ d
F (θ)·X ⇒ nϕ θ −→ N 0, F V F . It follows that nϕ θ FV F ϕ θ −→
¡ ¢
χ2r where r = rank F V F / = r. In case that the matrices F and V are functions
∧
of µ
the¶unknown
µ parameter
¶ vector θ, can be substituted by consistent estimators, e.g.
∧ ∧
F θ and V θ .
The χ2 test does not necessarily have the optimal properties of the test of
the likelihood ratio, however, it has the great advantage that we do not have to
know the exact distribution of the sample. What is necessary is a CAN estimator.
Furthermore, it is fairly easy to construct and evaluate such a test. Consequently, it
is not a surprise that χ2 tests are the most popular tests in statistics.
Chapter 14
Ω = {θ ∈ Θ : ϕ (θ) = 0}
where θ is the vector of parameters and ϕ (θ) = 0 are the restrictions. Consequently
the Neyman ratio test is given by:
µ ¶
∧
L θ
supθ∈Θ L (θ)
λ (x) = = ³ ´
supθ∈Ω L (θ) L e
θ
where L (θ) is the Likelihood function. As now the ln (·) is a monotonic, strictly
increasing, an equivalent test can be based on
∙ µ ¶ ³ ´¸
∧
LR = 2 ln (λ (x)) = 2 θ − eθ
where where LR is the well known Likelihood Ratio test and (θ) is the log-likelihood
function.
³ ´ ∧
Using a Taylor expansion of e
θ around θ and employing the Mean Value
Theorem we get:
µ ¶
∧
³ ´ µ ¶ ∂ θ µ ¶ µ ¶/ µ ¶
e
∧
e
∧ 1 e ∧ ∂ 2 (θ∗ ) e ∧
θ = θ + θ−θ + θ−θ θ−θ
∂θ/ 2 ∂θ∂θ/
µ ¶
° ° ° ° ∧ µ ¶
° ∧ ° ° ∧ ° ∂ θ ∧
° ° °e °
where °θ∗ − θ° ≤ °θ − θ°. Now, ∂θ/ = s θ = 0 due to the fact the the first
∧
order conditions are satisfied by the ML estimator θ. Consequently, the LR test is
126 The Classical Tests
given by:
∙ µ ¶ ³ ´¸ µ ¶ µ ¶
∧ ∧ / ∂ 2 (θ ) ∧
LR = 2 e e
θ − θ =− θ−θ
∗ e
θ−θ .
∂θ∂θ/
Now we know that
√ ³e ´ ½ h_ i−1 ¾ h_ i−1 s (θ )
n θ − θ0 = Ik − J (θ0 ) / −1
F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) √ 0 + op (1)
n
and µ ¶ h_
√ ∧ i−1 s (θ )
n θ − θ0 = J (θ0 ) √ 0 + op (1) .
n
Hence
µ ¶ h_ i−1 h_ i−1 s (θ )
√ ∧
e
n θ − θ = − J (θ0 ) / −1
F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) √ 0 + op (1)
n
and consequently
µh _ i−1 h_ i−1 s (θ ) ¶/ µ 1 ∂ 2 (θ ) ¶
LR = J (θ0 ) / −1
F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) √0 −
∗
n n ∂θ∂θ/
h_ i−1 h_ i−1 s (θ )
J (θ0 ) F / (θ0 ) [P (θ0 )]−1 F (θ0 ) J (θ0 ) √ 0 + op (1) .
n
Hence
µ ¶/ h _ i−1 h_ i−1 s (θ )
s (θ0 )
LR = √ J (θ0 ) / −1
F (θ0 ) [P (θ0 )] F (θ0 ) J (θ0 ) √ 0 + op (1) .
n n
Theorem 52 Under the usual assumptions and under the null Hypothesis we have
that ∙ µ ¶
∧ ³ ´¸
θ − e
d
LR = 2 θ → χ2r .
The Classical Tests 127
where
h_ i−1/2 s (θ ) h_ i−1/2
J (θ0 ) √ 0 = ξ0, and Z0 = J (θ0 ) F / (θ0 ) .
n
Now
h_ i−1/2 s (θ )
d
J (θ0 ) √ 0 → N (0, Ik )
n
h i−1
/ /
and Z0 Z0 Z0 Z0 is symmetric idempotent. Hence
µ h i−1 ¶ µ h i−1 ¶ µh i−1 ¶
/ / / / / /
r Z0 Z0 Z0 Z0 = tr Z0 Z0 Z0 Z0 = tr Z0 Z0 Z0 Z0 = tr (Ir ) = r.
µ ¶ The Wald test is based on the idea that if the restrictions are correct the vector
∧
ϕ θ should be close to zero.
µ ¶
∧
Expanding ϕ θ around ϕ (θ0 ) we get:
µ ¶ µ ¶ µ ¶
∧ ∂ϕ (θ∗ ) ∧ ∧
ϕ θ = ϕ (θ0 ) + θ − θ0 = F (θ∗ ) θ − θ0
∂θ/
as under the null ϕ (θ0 ) = 0. Hence
µ ¶ µ ¶
√ ∧ √ ∧
nϕ θ = nF (θ∗ ) θ − θ0
and consequently
µ ¶ µ ¶
√ ∧ √ ∧
nϕ θ = nF (θ0 ) θ − θ0 + op (1) .
Hence, µ ¶ ∙ ¸
√ ∧ ³_ ´−1
d /
nϕ θ → N 0, F (θ0 ) J (θ0 ) F (θ0 ) .
128 The Classical Tests
µ ¶ µ ¶
∧ ∧
where J θ is the estimated information matrix. In case that J θ does not have
an explicit formula it can be substituted by a consistent estimator, e.g. by
µ ¶
∧
2
∧ Xn ∂ θ
J =− /
i=1 ∂θ∂θ
∧ X
n ∂ θ ∂ θ
J= .
i=1
∂θ ∂θ/
Theorem 53 Under the usual regularity assumptions and the null hypothesis we that
∙ µ ¶¸ " µ ¶ µ ¶
/ µ ¶#−1 µ ¶
−1
∧ ∧ ∧ ∧ ∧ d
W = ϕ θ F θ J F/ θ ϕ θ → χ2r .
The Classical Tests 129
Furthermore, µ ¶ ∙ ¸
√ ∧ ³_ ´−1
d /
nϕ θ → N 0, F (θ0 ) J (θ0 ) F (θ0 ) ,
e d
λ ¡ ¢
√ → N 0, [P (θ0 )]−1 .
n
Theorem 54 Under the usual regularity assumptions and the null hypothesis we have
³ ´/ h i−1 ³ ´
LM = λ e →
e Fe Je Fe/ λ d
χ2r .
Proof: Again we have that for any consistent estimator of θ0 , as is the restricted
MLE e
θ, we have that
à !/ à !
³ ´/ h i−1 ³ ´ e
λ e
λ
e Fe Je Fe/ λ
LM = λ e = √ P (θ0 ) √ + op (1)
n n
130 The Classical Tests
and by the asymptotic distribution of the Lagrange Multipliers we get the result. ¥
Now we have that the Restricted MLE satisfy the first order conditions of the
Lagrangian, i.e.
³ ´ ³ ´
e / e e
s θ + F θ λ = 0.
Now Rao has suggested to find the score vector and the information matrix of the
unrestricted model and evaluate them at the restricted MLE. Under this form the
LM statistic is called efficient score statistic as it measures the distance of the
score vector, evaluated at the restricted MLE, from zero.
¡ ¢
y = Xβ + u, u|X v N 0, σ 2 In
n n ¡ 2 ¢ 1 (y − Xβ)/ (y − Xβ)
(θ) = − ln (2π) − ln σ − .
2 2 2 σ2
∂ (θ) X / (y − Xβ)
= =0
∂β σ2
and
∂ (θ) n 1 (y − Xβ)/ (y − Xβ)
=− 2 + = 0.
∂σ 2 2σ 2 σ4
The Linear Regression 131
Notice that the MLE of β is the same as OLS estimator. Something which is not true
for the MLE of σ 2 .
The Hessian is
⎛ ⎞ ⎛ ⎞
∂ 2 (θ) ∂ 2 (θ)
2
∂ (θ) ⎝ ∂β∂β / ∂β∂σ 2 − σ12 X / X − 2σ1 4 X / u
H (θ) = = ⎠=⎝ ⎠.
∂θ∂θ/ ∂ 2 (θ) ∂ 2 (θ)
− 2σ1 4 u/ X n
− u/ u
∂σ2 ∂β / ∂(σ2 )2 2σ 4 σ6
Notice that under normality, of the errors, the OLS estimator is asymptotically effi-
cient.
Let us now consider r linear constrains on the parameter vector β, i.e.
ϕ (β) = Qβ − q = 0 (14.1)
where Q is the r × k matrix of the restrictions (with r < k) and q a known vector.
Let us now form the Lagrangian, i.e.
where λ is the vector of the r Lagrange Multipliers. The first order conditions are:
∂L ∂ (θ) X / (y − Xβ)
= + Q/ λ = + Q/ λ = 0 (14.2)
∂β ∂β σ2
132 The Classical Tests
X / y = X / Xβ − σ2 Q/ λ (14.5)
Hence
¡ ¢−1 / ¡ ¢−1 /
Q X /X X y = Qβ − σ 2 Q X / X Q λ.
It follows that
∧
Qβ − QV Q/ λ = Qβ,
where
∧ ¡ ¢−1 / ¡ ¢−1
β = X /X X y and V = σ 2 X / X .
Substituting out λ from (14.5) employing the above and solving for β we get:
µ ¶
∧ ¡ / ¢−1 / h ¡ / ¢−1 / i−1 ∧
e
β=β− X X Q Q X X Q Qβ − q .
e u)/ u
(e e e
2
σ = , e = y − X β,
u
n
∧ ∧
u e
e = y − X β, and u = y − X β.
Hence µ ¶
∧
∧
e=u+X β−β
u e
∧
and consequently, if X / u = 0, i.e. the regression has a constant we have that
µ ¶/ µ ¶
∧ ∧/ ∧ ∧
eu
u /
e=u u+ β−βe X X β−β
/ e .
It follows that
µ ¶/ h µ ¶
/ ∧/ ∧∧ ¡ / ¢−1 / i−1 ∧
eu
u e − u u = Qβ − q Q X X Q Qβ − q .
e/ u
u e x−1
where r = ∧/ ∧
≥ 1. Now we know that ln (x) ≥ x
and the result follows by
u u
considering x = r and x = 1/r.
14.2 Autocorrelation
Apply the LM test to test the hypothesis that ρ = 0 in the following model
i.i.d. ¡ ¢
εt v N 0, σ 2 .
/
yt = xt β + ut , ut = ρut−1 + εt ,
Discuss the advantages of this LM test over the Wald and LR tests of this hypothesis.
First notice that from ut = ρut−1 + εt we get that
¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢
V ar (ut ) = E u2t = ρ2 E u2t−1 + E ε2t + 2ρE (ut−1 εt ) = ρ2 E u2t−1 + σ 2
as the first equality follows from the fact that E (ut ) = 0, and the last from the fact
that
E (ut−1 εt ) = E [ut−1 E (εt |It−1 )] = E [ut−1 0] = 0
Autocorrelation 135
where It−1 the information set at time t − 1, i.e. the sigma-field generated by
{εt−1 , εt−2 , ...}. Hence
¡ ¢ ¡ ¢ ¡ ¢ σ2
E u2t − ρ2 E u2t−1 = σ2 ⇒ E u2t =
1 − ρ2
¡ ¢
as E (u2t ) = E u2t−1 independent of t.
Substituting out ut we get
/
yt = xt β + ρut−1 + εt ,
/
and observing that ut−1 = yt−1 − xt−1 β we get
³ ´
/ / / /
yt = xt β + ρ yt−1 − xt−1 β + εt ⇒ εt = yt − xt β − ρyt−1 + xt−1 βρ
where by assumption the ε0t s are i.i.d. Hence the log-likelihood function is
³ ´2
/ /
T ¡ ¢ X yt − xt β − ρyt−1 + xt−1 βρ
T
T
l (θ) = − ln (2π) − ln σ 2 − ,
2 2 t=1
2σ 2
where we assume that y−1 = 0, and x−1 = 0. as we do not have any observations
for t = −1. In any case, given that |ρ| < 1, the first observation will not affect the
distribution LM test, as it is based in asymptotic theory, i.e. T → ∞. The first order
conditions are:
³ ´
/ /
∂l X
T yt − xt β − ρyt−1 + xt−1 βρ (xt − xt−1 ρ)
=
∂β t=1
σ2
³ ´³ ´
/ / /
∂l X
T yt − xt β − ρyt−1 + xt−1 βρ yt−1 − xt−1 β X
T
εt ut−1
= = ,
∂ρ t=1
σ2 t=1
σ2
³ ´2
/ /
∂l T X
T yt − xt β − ρyt−1 + xt−1 βρ T X ε2 T
t
2
=− 2 + =− 2 + .
∂σ 2σ t=1
2σ 4 2σ t=1
2σ 4
³ ´2
/
∂ l 2 X
T yt−1 − xt−1 β
XT
u2t−1
= − = − ,
∂ρ2 t=1
σ 2
t=1
σ 2
³ ´2
/ /
2
∂ l T X t
T y − xt β − ρy t−1 + xt−1 βρ T X
T
ε2t
= − = − .
∂ (σ 2 )2 2σ 4 t=1 σ6 2σ 4 t=1 σ 6
³ ´³ ´ ³ ´³ ´
/ / / / / /
∂2l X
T yt−1 − xt−1 β xt − xt−1 ρ + yt − xt β − ρyt−1 + xt−1 βρ xt−1
= − =
∂β∂ρ t=1
σ2
³ ´ ³ ´
/ / /
t−1 xt − xt−1 ρ + εt xt−1
T u
X
= −
t=1
σ2
³ ´³ ´
/ / /
∂ 2l X
T yt − xt β − ρyt−1 + xt−1 βρ yt−1 − xt−1 β X
T
εt ut−1
=− =− ,
∂ρ∂σ 2 t=1
σ4 t=1
σ4
³ ´³ ´ ³ ´
/ / / / / /
∂ l 2 X
T yt − xt β − ρyt−1 + xt−1 βρ xt − xt−1 ρ T ε
X t xt − xt−1 ρ
=− =−
∂β∂σ2 t=1
σ4 t=1
σ4
Notice now that the Information Matrix J is
⎡ ³ ´ ⎤
PT (xt −xt−1 ρ) x/t −x/t−1 ρ
⎢ t=1 σ2
0 0 ⎥
⎢ ⎥
J (θ) = −E [H (θ)] = ⎢ 0 T
0 ⎥
⎣ 1−ρ2 ⎦
T
0 0 2σ4
∙ ³
/ /
´ ³
/
´¸ ∙ ³
/ /
´¸
hP i h 2i
ut−1 xt −xt−1 ρ +εt xt−1 εt xt −xt−1 ρ T εt ut−1 εt
as E σ2
= 0, E σ4
= 0, E t=1 σ4 = 0, E σ6
=
h i E (u2t−1 )
1 u2t−1 1
σ4
, E σ2
= σ2
= 1−ρ2
, i.e. the matrix is block diagonal between β, ρ, and σ 2 .
Consequently the LM test has the form
(sρ )2
LM = s/ρ Jρρ
−1
sρ =
Jρρ
PT εt ut−1 T
as sρ = t=1 σ2
, Jρρ = 1−ρ2
. All these quantities evaluated under the null.
Hence under H0 : ρ = 0 we have that
Jρρ = T, and ut = εt
Autocorrelation 137
à T !−1 PT
X X
T
e/ u
u e e2t
u
e= and σe2 =
/ t=1
β xt xt xt yt , = ,
t=1 t=1
T T
³P ´2 Ã T !2 Ã T !−2
T u et−1
et u
t=1 σ f2 X X
LM = =T et u
u et−1 e2t
u .
T t=1 t=1
138 The Classical Tests
Book References
1. T. Amemiya: Advanced Econometrics.
2. E. Berndt: The Practice of Econometrics: Classic and Cotemporary
3. G. Box and G. Jenkins (1976) TimeSeries Analysis forecasting and Control.
K. Cuthbertson, S.G. Hall and M. P. Taylor: Applied Econometric Techniques.
4. R. Davidson and J. MacKinnon: Econometric Theory and Methods.
5. C. Gourieroux and A. Monfort: Statistics and Econometric Models, Vol I
and II.
6. W.H. Greene: Econometric Analysis.
7. J. Hamilton Time Series Analysis
8. A. Harvey: The Econometric Analysis of Time Series.
9. A. Harvey: Time Series Models.
10. J. Johnston: Econometric Methods.
11. G. Judge, R. Hill, H. Lutkepohl and T. Lee: Introduction to the Theory
and Practice of Econometrics.
12. R. Pindyck and D. Rubinfeld: Econometric Models and Economic Fore-
casts.
13. P. Ruud: An Introduction to Classical Econometric Theory.
14. R. Serfling: Approximation Theorems of Mathematical Statistics.
15. H White: Asymptotic Theory for Econometricians.