You are on page 1of 16

Stat 544, Lecture 4 1

'
&
$
%
Testing the Fit of
Restricted Models
Reading: Agresti Section 14.3
Last time, we derived the limiting distribution of the
ML estimate for a restricted multinomial model. The
restricted model assumes that
X = (X
1
, . . . , X
k
)
T
Mult(n, ),
where the elements of are functions of unknown
parameters = (
1
, . . . ,
t
)
T
and t k 1. We
demonstrated that, under suitable regularity
conditions, the ML estimate

has the property

n(


0
)
D
N( 0, (A
T
A)
1
),
where
A = Diag(
0
)
1/2

=
0
.
An estimated covariance matrix for

is n
1
(

A
T

A)
1
,
Stat 544, Lecture 4 2
'
&
$
%
where

A = Diag( )
1/2

.
Today we will discuss how to evaluate the t of the
model. If the restricted estimate is far from the
sample proportions p, we have evidence that the
model does not t well. We will present discuss two
measures of discrepancy between and p, deviance
and Pearsons X
2
, and describe goodness-of-t tests
based on these statistics. That is, we will show how to
test the null hypothesis that the restricted model is
true, versus the alternative that the saturated model
is true.
Fitted cell probabilities. The ML estimate of
under the saturated model is the vector of sample
proportions p = n
1
X. We already know that, under
the saturated model, the limiting distribution of p is

n(p )
D
N

0, Diag()
T

,
by the multivariate version of the Central Limit
Theorem. An estimated covariance matrix for p is

V (p) =
1
n

Diag(p) p p
T

.
Stat 544, Lecture 4 3
'
&
$
%
Because every restricted model is a special case of the
saturated model, this result holds under the null
hypothesis as well. Under the null hypothesis,
however, the ML estimate of is not p but = (

).
What can we say about ? The asymptotic
distribution of is

n(
0
)
D
N
"
0,

(A
T
A)
1

T
#
,
where /
0
is shorthand for / evaluated at
=
0
. Thus an estimated covariance matrix for is

V ( ) =
1
n

(

A
T

A)
1

T
.
This result, which is given in Section 14.2.2, follows
immediately from the limiting distribution of

by the
-method.
Agresti shows that, when contains fewer than k 1
free parameters, V (g( )) is less than or equal to
V (g(p)) for any smooth function g. That is, reducing
the number of free parameters in the model makes the
estimate of any parameter less noisy. If the null
hypothesis is not true, however, g( ) may be a biased
estimate of g(), whereas g(p) is always
Stat 544, Lecture 4 4
'
&
$
%
asymptotically unbiased. If the null hypothesis is
true, then g() is denitely better than g(p). If the
null hypothesis is not true, however, in which case
could be badly biased and inconsistent. The
unrestricted estimate p is a safe estimate of
because it is always unbiased, regardless of whether
the assumed model restrictions hold. But if the
assumed model does hold, performs better than p.
Because of this fact, large discrepancies between
and p suggest that the assumed model could be false.
The distance between and p forms the basis for
goodness-of-t testing, which is described in Section
14.3. We will summarize the results of this section in
the same order that Agresti presents them.
Joint asymptotic normality of p and
Section 14.3.1
This result is necessary to describe the behavior of
goodness-of-t measures, which calculate the distance
between and p. We already knew that
p is asymptotically normal;
is asymptotically normal; and
Stat 544, Lecture 4 5
'
&
$
%


is assumed to be a smooth function of p, which
means that = (

) is also a smooth function of


p.
The last point indicates that that is a locally linear
function of p. Therefore, p and are jointly
asymptotically normal. Their joint asymptotic
covariance matrix is rank-decient. In fact, it should
have rank k 1, because the rank of V (p) is k 1 and
is a function of p.
The formula for the joint covariance matrix is a bit
messy so we wont write it here.
Asymptotic distribution of Pearson residuals
Section 14.3.2
The best-known measure of distance between and p
is
X
2
=
k
X
i=1
(observed
i
expected
i
)
2
expected
i
,
where observed
i
= X
i
is the observed count in cell i,
and expected
i
= n
i
is the ML estimate of E(X
i
)
under the assumed model.
Stat 544, Lecture 4 6
'
&
$
%
This can also be written as X
2
=
P
i
e
2
i
, where
e
i
=

n(p
i

i
)


i
is the Pearson residual for cell i.
The Pearson residual behaves somewhat like a
standardized residual in linear regression. If |e
i
| is
larger than 2 or 3, it indicates a potentially serious
lack of t in cell i. Examining the Pearson residuals
e = (e
1
, . . . , e
k
)
T
may help us to diagnose if and where a model ts
poorly.
Its easy to show that
e
p
T
=

n Diag( )
1/2
e

T
=

n
2
[ Diag(p) + Diag( ) ] Diag( )
3/2
If the assumed model is true, then p and are jointly
normally distributed about
0
. Applying the
-method gives, after some algebra,
e
D
N

0, I
1/2
0

1/2
0

T
A(A
T
A)
1
A
T

,
Stat 544, Lecture 4 7
'
&
$
%
where
1/2
0
is the vector of the square-roots of the
elements of
0
. Thus the e
i
a are less variable than
standard normals and are negatively intercorrelated.
Note the similarity to the distribution of residuals
= y X

in normal linear regression,
1

N
h
0, I X(X
T
X)
1
X
T
i
,
which are also less variable than standard normals
and negatively intercorrelated.
Asymptotic distribution of Pearsons X
2
Section 14.3.3
A standard result in multivariate analysis is that
Y N(, ) (Y )
T

1
(Y )
2

,
where = dim(Y ), because
1/2
(Y ) is a vector
of standard normals. We might expect X
2
= e
T
e to
be something like
2
but with degrees of freedom less
than k, because the e
i
s are less variable than
standard normals.
Stat 544, Lecture 4 8
'
&
$
%
Rao (1973) gives a more general result: If
Y N(, ), then
(Y )
T
C(Y )
2

if and only if
CC = C,
where = rank(C).
Taking C =
1
gives the standard result.
The more general result is useful when is
rank-decient and cant be inverted.
Agresti uses Raos result to derive the asymptotic
distribution of X
2
as follows:
e is approaching N(, ) with = 0 and
= I
1/2
0

1/2
0

T
A(A
T
A)
1
A
T
taking C = I, the necessary and sucient
condition = holds, because is
idempotent ( = );
therefore,
X
2
=
k
X
i=1
e
i
= e
T
Ie
Stat 544, Lecture 4 9
'
&
$
%
is approaching a chisquare distribution with
degrees of freedom equal to
rank() = tr (symmetric, idempotent)
= tr I tr
1/2
0
(
1/2
0
)
T
tr A(A
T
A)
1
A
T
= k tr (
1/2
0
)
T

1/2
0
tr A
T
A(A
T
A)
1
= k 1 t
Thus, as n , X
2
is approaches a chisquare with
degrees of freedom equal to k 1 minus the number of
free parameters in the restricted model, t = dim().
Asymptotic distribution of deviance G
2
Section 14.3.4
A useful alternative measure of the distance from to
p is the deviance G
2
, which is the likelihood-ratio
statistic for testing
H
0
: restricted model is true
H
A
: saturated model is true
The statistic is
G
2
= 2 [ l(p; X) l( ; X) ]
Stat 544, Lecture 4 10
'
&
$
%
= 2
"
k
X
i=1
X
i
log p
i

k
X
i=1
X
i
log
i
#
= 2
k
X
i=1
X
i
log
X
i
n
i
Under the null hypothesis, the limiting distribution of
G
2
is
2
with degrees of freedom given by the number
of free parameters under the alternative (k 1) minus
the number of free parameters under the null (t).
Thus G
2
has the same limiting distribution as X
2
.
But in Section 14.3.4, Agresti gives a stronger result:
The dierence between X
2
and G
2
approaches zero in
probability as n when the model is true. In
large samples, the actual values of X
2
and G
2
will be
close if the restricted model is true. If the model is
not true, then X
2
and G
2
will grow unboundedly, and
the dierence between them will also grow.
Example: Testing independence in a 2 2 table
Lets use these results to test the hypothesis of
Stat 544, Lecture 4 11
'
&
$
%
row-column independence in the table
X
11
X
12
X
21
X
22
=
41 28
19 12
.
Recall that, under independence,
=
2
6
6
6
6
6
4

11

12

21

22
3
7
7
7
7
7
5
=
2
6
6
6
6
6
4

(1 )
(1 )
(1 )(1 )
3
7
7
7
7
7
5
for some and . The ML estimates of these
parameters are
=
41 + 28
100
= .69,

=
41 + 19
100
= .60.
The estimated probabilities are
2
6
6
6
6
6
4

11

12

21

22
3
7
7
7
7
7
5
=
2
6
6
6
6
6
4
.69 .60
.69 .40
.31 .60
.31 .40
3
7
7
7
7
7
5
=
2
6
6
6
6
6
4
.414
.276
.186
.124
3
7
7
7
7
7
5
.
Stat 544, Lecture 4 12
'
&
$
%
These estimates seem very close to the sample
proportions p = (.41, .28, .19, .12)
T
, so the model
appears to t well. The goodness-of-t statistics are:
X
2
=
X
i,j
n(p
ij

ij
)
2

ij
= 100
"
(.414 .41)
2
.414
+
(.276 .28)
2
.276
+
(.186 .19)
2
.186
+
(.124 .12)
2
.124
#
= 0.0312
G
2
= 2
X
i,j
X
ij
log
X
ij
n
ij
= 2

41 log
41
41.4
+ 28 log
28
27.6
+19 log
19
18.6
+ 12 log
12
12.4

= 0.0312
The degrees of freedom are 4 1 2 = 1 and the
p-value is P(
2
1
0.0312) = 0.86, so the model ts
well.
In this example, X
2
and G
2
agree closely because the
sample is large enough (well say what that means
Stat 544, Lecture 4 13
'
&
$
%
shortly) and because the model ts well. Lets try
another example where the model does not t well:
X
11
X
12
X
21
X
22
=
41 20
12 27
The estimated probabilities are
2
6
6
6
6
6
4

11

12

21

22
3
7
7
7
7
7
5
=
2
6
6
6
6
6
4
.61 .53
.61 .47
.39 .53
.39 .47
3
7
7
7
7
7
5
=
2
6
6
6
6
6
4
.3233
.2867
.2067
.1833
3
7
7
7
7
7
5
,
and the goodness-of-t statistics are X
2
= 12.68,
G
2
= 12.94. The p-values are 0.0004 and 0.0003,
respectively. The model clearly does not t.
In this example, X
2
and G
2
are still rather close. In
other examples with poor t, the two may be very far
apart. But the implied p-values are both close to
zero, so both would lead us to the same conclusion.
Eects of zero cell counts
If an X
i
is zero, X
2
can be calculated without any
problems, provided that the
i
s are all positive.
Stat 544, Lecture 4 14
'
&
$
%
But G
2
has problems. If X
i
= 0 then the deviance
contribution for that cell undened, and if we try to
use the formula
G
2
= 2
k
X
i=1
X
i
log
X
i
n
i
,
an error will result. But if we write the deviance as
G
2
= 2 log
L(p; X)
L( ; X)
= 2 log
k
Y
i=1

X
i
/n

i

X
i
,
a cell with X
i
= 0 contributes 1 to the product and
may be ignored. Thus we may calculate G
2
as
G
2
= 2
X
i:X
i
>0
X
i
log
X
i
n
i
.
If any element of is zero, then X
2
and G
2
both
break down.
How large is large?
As n becomes large, X
2
and G
2
both approach the
same limiting
2
distribution, and either one may be
used to assess model t. But how large is large?
Old rule of thumb: The
2
approximation works
well if n
i
5 for all i = 1, . . . , k.
Stat 544, Lecture 4 15
'
&
$
%
More lenient rule: The approximation is okay if
no more than 20% of cells have n
i
< 5, and if no
n
i
is less than 1.
With sparse tables (e.g. n/k < 5) the
2
approximation will be poor. There is no easy way
to assess t, in an absolute sense, when the table
is sparse. But we may still be able to compare
the t of one restricted model to another
reasonably well, using dierences in X
2
or G
2
.
More on this later.
The closeness of X
2
to G
2
may also be indicative of
how well the approximation is working. If the
p-values P(
2

X
2
) and P(
2

G
2
) lead to similar
conclusions, then we may be more condent of the
result.
Remember that X
2
G
2
approaches zero if the model
is true. If the model is not true, then X
2
and G
2
are
both large and they may be far from each other. But
even in that situation, the implied p-values from both
statistics are close to zero and the results will agree in
the sense that we reach the same conclusion.
Finally, note that these asymptotic results assume
n , which implies that the expected cell counts
Stat 544, Lecture 4 16
'
&
$
%
n
i
are all approaching at the same rate. If the
data are distributed across the cells in a very uneven
fashion (i.e. if some regions of the table are sparse)
then the
2
approximation may be poor even if the
overall n is very large.
Next time: We will start skimming over material
from Agresti (2002) Chapters 23 regarding the
analysis of two-way tables. We will carefully discuss
the usual test for independence in an r c table
and measures of association between binary and
nominal variables.

You might also like