Professional Documents
Culture Documents
TABLE 1.7
The function z = (1.5 — * (1 — y))2 + (2.25 — x (1 — y 2))2 + (2.625 — x (1 — y 3))2.
Numerical derivatives. 25 random starting-points in the cube —2 ^ x, y ^ 2
Simple Modified Hooke
quadratic quadratic Davidon Powell and
hill-climbing hill-climbing Jeeves
Time (seconds) Mean 0.012 0.014 0.018 0.019 0.013
Stand, dev. 0.003 0.003 0.002 0.004 0.002
Number of Mean 118 118 204 225 323
function Stand, dev. 27 26 24 47 47
evaluations
Function valué Mean 0.83 x 10-12 0.74 x lO'10 0.76 x 10-11 0.28 x 10-11 0.32 x 10-8
Distance from Mean 0.17 x 10-5 0.22 x lO"5 0.41 x lO"5 0.18 x 10-5 0.92 x 10-4
true extremum
Failure rate 0.160 0.160 0.240 0.120 0.200
TABLE 1.8
The function z = 100 (x3 — xj)2 + 100 (x2 — x2)2 + (1 — x¡)2.
Numerical derivatives. 25 random starting-points in the hypercube —2 ^ xlf x2 , * 3 ^ 2
Simple Modified Hooke
quadratic quadratic Davidon Powell and
hill-climbing hill-climbing Jeeves
Time (seconds) Mean 0.038 0.049 0.067 0.052 0.051
Stand, dev. 0.016 0.020 0.015 0.021 0.057
Number of Mean 425 402 802 659 1430
function Stand, dev. 173 147 185 225 1593
evaluations
Function valué Mean 0.12 x 10-9 0.24 x 10-10 0.20 x 10-8 0.53 x 10-7 0.27 x lO"5
Distance from Mean 0.33 x lO"5 0.13 x lO"5 0.53 x 10-4 0.50 x lO"4 0.29 x 10-2
true extremum
Failure rate 0.0 0.0 0.040 0.040 0.0
(2) These two methods have a smaller overall failure rate than the other
thiee methods and are quite substantially better than Davidon and Powell in
this respect.37
37 Some limited experimentation with analytically evaluated derivatives indicates modest
improvement in the time required by the quadratic hill-climbing and the Davidon
algorithms. Davidon’s method also seems to perform very much better in the case of
numerically evaluated derivatives if Stewart’s modification is employed. See G. W. Stew-
art, III, op. cit.
APPENDIX A 33
TABLE 1.9
The function z = (1 /(Q (1 + x 2))) exp { —(y — x2)2l(Q (1 + x2))2} for Q — 1.
Numerical derivatives. 25 random starting-points in the cube —5 ^ x , y ^ 5
Simple Modified Hooke
quadratic quadratic Davidon Powell and
hill-climbing hill-climbing Jeeves
Time (seconds) Mean 0.012 0.015 0.042 0.015 0.36
Stand, dev. 0.030 0.004 0.034 0.010 1.72
Number of Mean 104 111 393 142 6070
function Stand, dev. 24 30 328 91 29369
evaluations
Function valué Mean 1.00000 1.00000 0.99999 1.00000 1.00000
Distance from Mean 0.11 x 10-5 0.18 x lO"5 0.12 x 10-2 0.50 x lO"3 0.48 x lO"3
true extremum
Failure rate 0.0 0.0 0.040 0.360 0.0
TABLE 1.10
The function z = (1 /(Q (1 + *2))) exp {—(j — x2)2/(Q (1 + x2))2}. Q = 0.5.
Numerical derivatives. 10 random starting-points in the cube —5 x, y ^ 5
Simple Modified Hooke
quadratic quadratic Davidon Powell and
hill-climbing hill-climbing Jeeves
Time (seconds) Mean 0.018 0.022 0.047 0.020 0.013
Stand, dev. 0.004 0.007 0.029 0.022 0.003
Number of Mean 146 154 414 208 224
function Stand, dev. 31 42 207 224 51
evaluations
Function valué Mean 1.99999 1.99999 1.99999 2.00000 2.00000
Distance from Mean 0.10 x lO'5 0.15 x lO"5 0.13 x 10-2 0.44 x 10-3 0.45 x lO"3
true extremum
Failure rate 0.0 0.0 0.400 0.920 0.0
(3) They usually require fewer function evaluations than the other three
methods, sometimes by substantial margins. However, for 10-variable or
larger problems (not reported here), Powell is often the best method.
(4) Their superiority in terms of time is less uniform, this being due to the
greater expense of time per iteration.
(5) Davidon, Powell, and Hooke and Jeeves each become worst performers
with respect to time for some case with the latter achieving the overall worst
record.
34 NUMERICAL OPTIMIZATION
TABLE 1.11
The function z = (1/(Q (1 + x2))) exp { —O — x2)2/(Q (1 + x2))2}. Q = 0.1.
Numerical derivatives. 25 random starting-points in the cube —5 ^ x, y ^ 5
(6) The parabolic Gaussian ridge seems to be a severe test, causing Davi-
don and Powell to have 100 % failure rates for Q = 0.1 and causing excessive
computation time for Hooke and Jeeves and some failures for the quadratic
hill-climbing algorithms.
We also employed some of the methods on some more realistic examples.
The modified quadratic hill-climbing method and Powell’s method were used
for maximizing the likelihood function associated with a small econometric
model of the U.S. economy (Klein’s Model I). After eliminating identities,
the equations of this model are38:
Uu = — y u + ^ 12^21 + P l3 y 3 i + P l 2 X li + 7 l 2 X2i ~ P l 3 X 3i ~ P l 3 X 5i + P l 3 X6i
where (i) x li9 x2¿,. x7í are assumed to be nonstochastic (ii) uu , w2/ and
u3i are (unobservable) stochastic terms assumed to be jointly normally
38 G.C.Chow, Two methods of computing full Information máximum likelihood esti-
mates in simultaneous stochastic equations. International Economic Review 9 (1968),
100-12; and L.R. Klein, Economic fluctuations in the United States. Cowles Commis-
sion Monograph No. 11. New York, Wiley, 1950.
APPENDIX A 35
TABLE 1.12*
The function L = - \ log(¿ det (JJ'U)) + log (det (B))
Modified
quadratic Powell
hill-climbing
Time (seconds) Start 1 41.55 64.48
Start 2 23.52 63.78
Number of function Start 1 3289 6324
evaluations Start 2 1830 6233
Function valué Start 1 -2.75562 -2.75797
Start 2 -2.75581 -2.75625
distributed with mean vector ju = 0 and covariance matrix 27; (iii) observa-
tions are available39 on all y u , y 2i, y 3i, x u , x 7i, i = 1,...,21; and
(iv) the /Ts, y’s and the elements of Z are to be estimated. The system given
by (1.30) can be written more compactly as
U = YB + XA, (1.31)
39 See L.R.Klein, op. cit.
36 NUMERICAL OPTIMIZATION
TABLE 1.14
The function E (yt — — 02 e°3Xi)2
1=1
Number
y 5 of function Iterations
evaluations
Problem 1-1
n — 10, k — 1
Simple quadratic hill- 28.814 49.051 303.261 86 6
climbing
Modified quadratic hill- 28.814 49.051 303.261 86 6
climbing
Davidon 28.819 49.082 303.261 107 5
Powell 28.837 49.053 303.261 108 3
Hooke and Jeeves 28.808 49.045 303.261 147 29
Cooper 28.814 49.053 303.261 - 27
Problem 1-2
n = 10, k = 2.5
Simple quadratic hill- 32.604 56.201 82459.0 41 3
climbing
Modified quadratic hill- 32.604 56.201 82459.0 41 3
climbing
Davidon 32.601 56.201 82459.0 86 5
Powell 32.610 56.185 82459.0 110 3
Hooke and Jeeves 32.597 56.193 82458.9 144 28
Cooper 32.604 56.201 82459.0 - 9
Problem 2-1
n = 20, k = 1.0
Simple quadratic 52.180 52.312 793.067 71 5
hill-climbing
Modified quadratic 52.180 52.312 793.067 71 5
hill-climbing
Davidon 52.158 52.338 793.065 66 3
Powell 52.162 52.316 793.065 133 3
Hooke and Jeeves 52.174 52.329 793.065 164 31
Cooper 52.178 52.318 793.065 — 20
Problem 2-2
n = 20, k = 2.5
Simple quadratic 45.271 49.476 255069.4 41 3
hill-climbing
Modified quadratic 45.271 49.476 255069.4 41 3
hill-climbing
Davidon 45.275 49.480 255068.8 100 5
Powell 45.270 49.469 255068.8 137 4
Hooke and Jeeves 45.280 49.463 25506&.9 135 28
Cooper 45.271 49.476 255069.0 — 8
38 NUMERICAL OPTOÍIZATION
than Powell in terms of time which is not surprising since it is often true
that Powell requires more function evaluations. In terms of accuracy, both
seem to give results that compare well with the independently computed
results of Chow.
A second test case was Hartley’s function given by
s = I ( * - 01 - e2ee>x‘)2,
i= 1
where the six x and y valúes are ( —5, —3, —1, 1, 3, 5) and (127, 151, 379,
421,460,426) respectively.41 This function was minimized by all five methods
discussed in this appendix, and the results are contrasted with the independ-
dently obtained results of Vitale and Taylor.42 Table 1.14 reveáis the follow-
ing conclusions: (a) the two versions of the quadratic hill-climbing method
do not do as well as the others in terms of accuracy; (b) Powell and Davidon
perform as well as Vitale and Taylor’s modified Davidon method using
analytic derivatives; (c) using numerical derivatives the latter method is
worse than Davidon or Powell.
A third class of problem investigated is the generalized Weber problem
which consists of minimizing
s = E [(*, - X)2 + (yt - y)2f 2
¿= 1
with respect to x and y, where xt and y t are given numbers. These problems
have most often been solved by ad hoc methods that are basically Newton-
Raphson type approaches to solving the equations representing the first-order
conditions for a minimum.43 Four of the many problems solved by Cooper
were solved by the methods discussed here. All methods do very well with the
two quadratic hill-climbing methods holding the edge in terms of the num
ber of function evaluations. Unfortunately none of them can be compared
easily with Cooper’s procedure for the latter measures work performed in
terms of the number of iterations which is an ambiguous measure.
41 P. Vitale and G. Taylor, A note on the application of Davidon’s method to nonlinear
regression problems. Technometrics 10 (1968), 843-49.
42 The results of Vitale and Taylor do not include the number of times the function was
evaluated, but only the number of iterations performed.
43 See H.W.Kuhn and R.E.Kuenne, An efficient algorithm for the numerical solution of
the generalized Weber problem in spatial economics. Journal of Regional Science 4
(1962), 21-34; and L. Cooper, An extensión of the generalized Weber problem. Journal
of Regional Science 8 (1968), 181-98.
CHAPTER 2
2.1. Introduction
The previous chapter dealt with the problem of optimizing a given objective
function and presented several alternative algorithms for use in numerical
analysis. The present chapter is concerned with estimation and hypothesis
testing in econometric models and its focus is on the steps preceding and
following the actual optimization process. The step prior to estimation con-
sists of adopting a criterion for estimation. That is, one must develop an
objective function, one whose optimization will produce estimates of
unknown parameters with desirable statistical properties. The present chap
ter considers two of these criteria, least squares and máximum likelihood.
The step following estimation consists of testing statistical hypotheses about
the true parameter valúes or providing confidence intervals for them. As
we shall see, in situations involving nonlinearities this may be quite complex
and we shall be forced to examine several alternative approximative methods.
The discussion in the first parts of the chapter focuses on the general
regression model of the form
y = f ( x , 0) + « , (2.1)
13 Preliminary testing seemed to give better results for the Gaussian than for either
Newton-Cotes or Romberg quadrature.
SOME CONCRETE EXAMPLES 143
of the parameters were obtained by three methods: (1) Suppressing vt and
performing a linear regression on the logarithms of the variables;(2) suppress
ing ut and directly minimizing the relevant sum of squares; (3) maximizing
the likelihood function (5.14). The results are displayed in table 5.1, where
the first two rows correspond to one x-sample, the second to the other.
Although the limited nature of these experiments makes them only sugges-
tive, the finite sample properties of the estimates by the mixed method
(Method 3) seem to be reasonable and in some cases startlingly better than the
estimates from the other methods.14
6.1. Introduction
The present chapter is devoted to comparing alternative parameter estimators
for a model involving complex nonlinearities. Its main purpose is to examine
and evalúate the conditions under which reliable estimators exist and the
standard variance approximations are accurate.
The basic estimating equations involve integráis which contain parameters
in not only the integrand but also the lower limits. The chapter discusses the
determination of the lower limits and it emerges that certain of the para
meters are nearly indeterminate. The standard covariance approximations
are then computed, the degree to which they reflect the indeterminacy is
checked, and they are compared to covariance matrices.based on a higher-
order approximation.
We next consider the question of whether certain data should be added, in
the form of a nonlinear constraint. It is possible to approximate both the
efficiency gained by imposing such a constraint in case it is correct and the
possible bias introduced by imposing it when it is incorrect. In the present
case, the question is the addition of wage data (and of the assumption that
the wage rate equals labor’s marginal product) to a model of production and
technical change.
In general, the possible gains and losses from adding a constraint depend
upon both the true parameter valúes and the degree of discrepancy between
them and the imposed constraint. In the present example the discrepancies
often tend to follow a definite pattern—a systematic positive or negative
bias—which results from the possession of market power in the labor market
by either labor unions or firms. Thus we adapt the approximations which
produce the covariances to approximate the eífects of biases in the wage data
that are added as a constraint.
* This chapter was prepared by D.E.Smallwood.
147
148 ESTIMATOR BEHAVIOR
since the average product of labor on the newest idle vintage is labor’s mar
ginal product. Thus the same conclusión applies in either case—when the
allocation of labor is efficient, all vintages younger than a certain age (the
‘marginal vintage’) are fully in use, while the marginal vintage may be par-
tially utilized, but all older vintages are completely idle.
If we make the transition, purely for reasons of convenience, to a con
tinuous model in which a vector of capital goods is replaced by a stream
I(v), —oo < v ^ t,
where each infinitesimal of capital with vintage between v and (v + d^)
operates with characteristics (6.1) and (6.2), then there are no partially
utilized vintages. Production efficiency at time t implies that there is an inter-
val (t — mt, t) of fully utilized vintages, where mt is the age of the marginal
vintage at time t, while all capital older than mt will lie idle.
Let LVtt be the labor employed on vintage v capital at time t and QVtt the
output produced on it. Total output and labor are the integráis of these over
the interval (t — mt, t):
(6.5)
which we can rearrange and then write the basic estimating equations for the
vintage model as follows:
(6.6)
(6.7)
FIXED PROPORTIONS VINTAGE MODEL 153
With these equations, the parameters of the model—a, b, g, h, r, and s—can
be estimated along with the simultaneous estimation of the marginal vintage
for each observation period, using only Qt and Lt data.
If T observations on the pair (Qt, Lt) are available, (T 4- 6) parameters are
estimated. Since two equations are fitted for each observation, T must be at
least 6 in order that 2T > T + 6. In fact, estimating the parameters on the
basis of (6.6) and (6.7) seems rather dubious even for large T. Not only is a
parameter added for each observation, but the basis for identifying the para
meters is purely the differential effect of embodied technical change on pro-
ductivity as the mt are changed. Neither equation can be estimated by itself;
the determination of the mt depends on differences in the patterns of the Qt
and Lt.
Yet estimation based on just (6.6) and (6.7) provides very significant
advantages. First, wage data are not used. In fact, there is no assumption
that the marginal product of labor is related in any way to a wage rate (or
even to the marginal cost of adding labor, if that is different). The derivation
of (6.6) and (6.7) depends only on the specification of the production rela-
tions and on the assumption that whatever output is produced, it is produced
by efficiently allocating labor over the different capital vintages. Thus the
estimators are not based on the questionable assumption of competitive
labor and product markets.
Second, the utilization of the capital goods is jointly estimated, rather than
ignored or handled with a questionable ‘adjustment’ factor. In fact, as long
as embodied labor augmentation (h) is positive, no theoretically valid method
of aggregating capital exists. Thus a significant advantage of the vintage mo
del is that the utilization of capital—indeed, the utilization of each vintage of
capital—is simultaneously estimated rather than dealt with in rather crude
and questionable ways.
We note that the ability to estímate the parameters without wage data or
without even an assumption about wage rates is distinct from the simultane
ous determination of capital utilization. Since the marginal product of labor
is simply labor’s average product on the marginal vintage, an assumption of
competitive behavior (in both the labor and product markets) implies that:
wtlpt = best+Ht- mt\ (6.8)
This constraint can be added to (6.6) and (6.7); whether it should be or not
depends upon a comparison between the resulting increase in the efficiency
154 ESTIMATOR BEHAVIOR
of the parameter estimators and the possible deleterious effects of the specifi-
cation error committed if the competitive assumptions do not (approxi-
mately) hold.
Problems in computing máximum likelihood estimates
The conceptual framework outlined so far has three major advantages:
(1) The fixed proportions vintage model is more realistic than traditional
models with respect to the utilization of capital goods.
(2) The joint estimation of capital utilization, using eqs. (6.6) and (6.7), is
more satisfactory than the standard adjustments.
(3) The model can be estimated, at least in principie, without invoking
assumptions of competitive behavior in either labor or product markets.
Furthermore, the present model could be the building block for more com-
plex, highly realistic, models involving several labor and capital categories
and perhaps more than one production process.
Assume that we have discrete observations on the variables Q and L,
indexed by the subscript i = 1,..., T; the observations being made at times
ti, at which time the oldest capital still in use has age mt. We postúlate errors
st and ut in the observations. Since the and L t normally grow over time,
and there is no plausible reason for the errors to become relatively less im-
portant over time, we postúlate proportional errors2:
a e 1* f e°v I(v) áv
Qi = L iia a --------- = Ft¡{ i + e¡} (6.9)
{1 + £,}
where Ft and G¿ are defined in (6.9) and (6.10). If we derive a concentrated (log) likeli
hood function by inserting the estimator for a2 into the logarithm of the likelihood func
tion, the result is:
K - T log {£ (e2 + u2)} + E {log (1 + et) + log (1 + «*)},
i
where K is a constant. We have ignored the third term throughout the approximations,
which we justify on two grounds. First, the partíais with respect to a particular error,
say Si, of the second and third terms are:
- 2 TVíSe? + uj) and 1/(1 + £,)
i
respectively. For small and ut, these are of order l/o and 1 respectively. We take
a = 0.01 below. Thus the change in the second term will tend to be 100 times as large
as the change in the third term for a given change in a residual; maximizing the likeli
hood function is essentially equivalent to maximizing the second term. Second, including
the term has no effect on any of the linear approximations derived below. The quadratic
approximations would be altered slightly; these are completely valid for least squares
estimators, but the latter are not, strictly speaking, máximum likelihood estimators for
this model. Basically the quadratic approximations reflect the nonlinearities that enter
because the Ft and Gt are not linear functions of the parameters, but ignore the non
linearities that enter because of the error specification, which are much less significant
in this case. Thus the quadratic approximations reflect the essential nonlinearities in the
model, and are still sensible approximations for precise máximum likelihood estimators.
156 ESTIMATOR BEHAVIOR
Suppose, then, that data generated according to (6.13) and (6.14) are avail-
able, and that arbitrary a, g, and r are suggested, subject only to the restric-
tion that a is not ‘too smalP. Specifically, let a satisfy
a > max Qi (6.16)
i
erti e9VI(v) áv
í
where r and g are arbitrary. Then a set of can be found, by inverting (6.13),
which makes the Qt equations fit perfectly (the restriction (6.16) merely
insures that finite can be found). If b and s are then taken to satisfy (6.15)—
thus equal to the true b and s—then the set {á, b, g , h , f , s , m u ..., mT}
automatically satisfies (6.14) for each observation, since (6.13) and (6.15)
together imply (6.14). Thus if h = 0, data generated by the vintage model can
be explained with essentially any a, g and r by choosing a suitable set of
marginal vintages. Only b and s can be inferred.
While this dependence upon the eífect of h is ominous, it is not clear that
when h is relatively large the parameters will remain nearly indeterminate.
The following result and the related computations suggest that they are.
Suppose Qi and Lt data are given—not necessarily generated as in (6.6)
and (6.7)—and that arbitrary oc and /? are chosen. We let:
di = \og(QilLi) - ( o c + (3ti) 9 (6.17)
so that the <5f can be considered the residuals when the natural log of (Qi/Lf)
is fitted to oc + fiti.
Again let á, g and f be arbitrary, subject only to the restriction that a is not
too small, as above. If we choose mf’s such that
Qi = aerti e9VI(v)dv, (6.18)
J ti-r h i
then we can rearrange (6.17) and (6.18) to give:
Lt = (e~di) (a/e*) e(r-0)f* f e9VI(v) dv¿ (6.19)
J ti-r h i
Let us assume for the moment that the dt are small enough so that
e~St « 1 - 6t
is a reasonable approximation. Then (6.17) and (6.18) show that if the m* are
chosen correctly, arbitrary á, g and r can ‘explain’ the Qt data perfectly and
PROBLEMS OF INDETERMINACY 159
the Lt data with percentage errors of —dt, since, except for the term e~di,
(6.19) is precisely in the form of (6.14). Obviously if the m* choices are ad-
justed, the discrepancies can be allocated between the Qt and L t \ in fact, if
fñi are chosen to solve the equation:
mn i r ru A ru a n
egv I(v) áv = — e9VI(v) áv + eav I(v) áv L
J ti-rñi 2 LJ t i - m * J ti-m t J
where the m * are defined by fitting the Lt data perfectly, then simple manipu-
lations show that the discrepancies in the Qt and Lt data become ó j2 and
—di!2. Thus we have the following result:
Let Qi and L t data be given, and let the dt be defined by:
l°g (QíILí) = oc + fiti + di9
where ot and ¡3 may be chosen to minimize some function of the dt or may just
be arbitrary. Then—independent of the investment stream given—the vintage
model can be fittedto the Qt andLt data with h = 0 and (essentially) arbitrary
á, g and r, with percentage errors of at most |áf/2| in both the Qt and Lt
equations.
Obviously, a relevant question is how small the dt can be for optimally
chosen oc and /?, such as the coefficients in a least squares fit. Mathematically,
our result does not imply that the parameters of the model are necessarily
nearly indeterminate; Qt and Lt data can be generated for which the dt are
very large for every oc and /?. However, Qt and Lt data generated according to
(6.6) and (6.7), using investment streams, marginal vintage sequences, and
parameter valúes which appear to be at least generally plausible, are not
arbitrary. For such data, the ratio (Q í/L í) almost inevitably closely follows
an exponential time trend, implying that the determination of the vintage
model parameters, except for b and s9 is extremely tenuous.
This assertion is supported by the size of the <5¿’s computed with a selection
of illustrative examples. Twenty examples, produced by combining each of
four investment series with each of five sequences, were used in all the
computations. In each example, 20 ‘observations’ were assumed. The choices
were meant to represent plausible examples with a bias toward high variabi-
lity. Two of the investment series are essentially exponential with significant
variance around the trend, while two do not even approximate an exponen
tial trend. The sequences have averages around 20 with absolute devia-
tions from the trend which typically average 4 or 5.
160 ESTIMATOR BEHAVIOR
For the set of 20 examples, which each reflect 20 observations, the (<5,/2)
that resulted from a least squares fit appeared to be proportional to (1 ¡h),
averaging only 7.3 % for h = 0.04 and 1.6% for h = 0.01! Even for large h,
the basis for determining the parameters with this specification is extremely
tenuous.
In fact, competitive forces will typically operate to keep log (2,/L*) near its
time trend. The competitive marginal vintage can be derived from (6.8):
5 For more detailed support of these assertions, see D.Smallwood, op. cit.
QUADRATIC VARIANCE APPROXIMATIONS 161
than estimators based on just output and labor data, since the specification of
perfectly constant rates obviously represents a sacrifice of realism for feasibi-
lity.
6.4. Linear and quadratic variance approximations
We now assume that the specification, in eqs. (6.9) and (6.10), of errors in the
Qi and Lt which are independent, normally distributed, and have equal
variances, is correct. We further assume that máximum likelihood estimators
are applied to Qt and L t samples and to a given (errorless) investment series,
and approximate the sampling variances of the corresponding parameter
estimators.
The derivation of the máximum likelihood equation employs all of the as-
sumptions of the error specification; for instance, it is not only assumed that
the £t and have equal variances but also that the estimation method ex-
ploits this correct assumption. Obviously the resulting variances should be
considered as lower bounds—probably significant underestimates—of the
true variances in a realistic situation. If different variances for the et and ut,
correlation between the et and ut, or autocorrelation among the et and ut were
allowed for in the estimation, the sampling variances would be considerably
increased even if those conditions did not obtain in reality. The assumption
of uniformly favorable conditions is necessary to keep the computation of the
approximations feasible.
The standard variance approximations for nonlinear máximum likelihood
estimators, generally known as the Cramer-Rao variances and derived in
chapter 2, can be written6:
~ b k - 'k sm ]T ' (6 -2 i)
where L($) is the likelihood function. Using the deíinition of S(fí) in (6.12)
and the error properties assumed, continuing to ignore a term as explained in
footnote 3, we can write:
* 7(2Tna- W) e - S(^ \ (6.22)
6 The matrix in (6.21) should include partíais with respect to a2, which is not a component
of p. However, in this case the same covariance matrix for p results since the cross par
tíais of logL with respect to a2 and the components of are all zero when evaluated at
the true parameter point.
162 ESTIMATOR BEHAVIOR
1 8Fi
J
r
J ti-m ,
egcI(v) áv
(6.31)
— Uj
Qi Sr
and so forth. By writing:
»ti
f ve9VI (v) áv = ti
ti~ Mi
e9VI (v) áv
fJ ti-nii (tt — v) egvI (v) áv
164 ESTIMATOR BEHAVIOR
8Ft
*/ = (6.36)
8Gi
dGT
play a role analogous to that of the independent variables in the linear re
gression context.
Thus singularity of the matrix V corresponds to perfect multicollinearity in
the linear regression context. The matrix V clearly has rank less than T + 6
if and only if the matrix X has rank less than T + 6; that is, when these
‘local independent variables’ are linearly dependent. It is not surprising in the
light of the previous analysis that when h = 0, so that At(g) = At (g - h),
the columns of X can be shown to be linearly dependent. Since the primary
determinant of log (Q í/L í), given the parameters, is the average age of capital
in use, it is also not surprising that when either
M g ) = oc + fifi or At (g - h) = oc + ptt
for all i and some oc and ¡3, the columns of X are again linearly dependent.
These results emerge from the previous analysis, but further examination
of X produces new insights into the determinaron of the parameters. For
166 ESTIMATOR BEHAVIOR
instance, manipulation shows that X has less than full rank if:
M i(g)/Mí ( g - h ) = oc + pt, (i = 1,..., T) (6.37)
for some oc and /?. But the expansión of this ratio in h yields:
M ^ /M * (g - h) = 1 - h (mt - A¿g)) + 0(h2), (6.38)
where 0(h2) represents terms in h2 and higher. Thus if
Mi - Ai(g) = oc + ptu
then the matrix X could be said to have rank less than T + 6 6to order h\
This result suggests that even substantial fluctuations in the average age
around its time trend may not be sufficient to produce reasonable estimator
variances, since deviations in the A t will usually correspond to deviations in
the nii around their trend in about the same proportion.
In the linear case, the singularity of the cross-product matrix of the inde-
pendent variables, (X'X ), implies that the likelihood function has no unique
máximum and, therefore, that coefficient estimators are indetermínate. In
the nonlinear case, the likelihood function may have a unique máximum
even though V is singular; the latter condition implies merely that in some
direction in parameter space not only the first derivative but also the second
derivative of the likelihood function is zero. While this condition does not
transíate easily into a statement about variance, the parameter combination
corresponding to that direction will be cióse to indeterminate in normal
cases.
We now employ a different approach to the motivation of (6.23) as the
variance approximation which is useful in extending the approximation. The
máximum likelihood estimator, ^8, is implicitly defined by the T + 6 first
order conditions, which from (6.12), (6.25), and (6.26) are:
y ¡(im _ A 1 8Fj 0) ( Gi0) _ \ 1 50,0)1 0
T L\ Qt ) Qt Wj VU ) L, dh J
0 = 1, T + 6 ). (6.39)
This set of equations defines as an implicit function of the Qt and . Thus,
if partíais in each of these equations are taken with respect to Qk9 for in
stance, the result is a set of T + 6 linear equations in:
QUADRATIC VARIANCE APPROXIMATIONS 167
which can be calculated for a particular choice of /?; the natural choice is
obviously the true parameter point. Since, according to (6.9)
Qk = F fc/O + £k)
the partial derivatives of the with respect to the ek can be compu-
ted. Similarly, partials with respect to the uk can be calculated. Thus if we
write:
h * Pj + X \ ^ r ~ ^
fc=lL 0£k OUk
uX
J
(6.40)
then we can obtain an approximation to
E 0 j - P j) 2
by invoking the assumptions about the error distributions. Covariance terms
can similarly be approximated.
These computations result in the same ‘linearly approximated’ variances
as those produced by linearly expanding the Ft and Gt around the true para
meter point.8 The extensions to higher terms, however, do not give the same
approximations. That is, (6.35) could be expanded to include quadratic
terms:
w ) * Ft( n + os - n \ ^ f \ + ¿ o? - n [ os - n
L dp jp = p° L°p dp
(6.41)
and similarly for the Gt(/?), and the exact variances of the estimators based on
fitting these quadratic functions could be determined.
But as long as the numerous interrelations of the coefficients in (6.41) are
acknowledged, this is no solution at all. A linear regression subject to numer
ous nonlinear constraints is still a nonlinear problem, with no obviously easy
way to approximate the covariance matrix. Applied to this new nonlinear
problem, the Cramer-Rao solution simply generates again the linearized ap
proximation to the covariance matrix. An easily computable alternative
would be to compute the covariance matrix of the linear coefficients in (6.41)
when the quadratic terms are included but the coefficient restrictions are
ignored (thus producing an ordinary linear regression). This would certainly
increase the approximated variances, but the procedure has no clear justifica-
tion; if the method is extended by adding further terms but still ignoring the
8 D. Smallwood, op. cit.
168 ESTIMATOR BEHAVIOR
+ i s (6.42)
i k>i\_deidek diiiduk J 2 i k |_detduk J
By applying the assumptions about the errors, a ‘quadratically approxi-
mated’ covariance matrix for the parameters can be evaluated. Whereas only
the assumptions that the errors are independent, have zero means, and have
a common variance of g2 were actually invoked in the linear approximations,
the evaluation of the variance and covariance terms using (6.42) exploits more
fully the assumption of normality, since the assumptions that E{e\) = 0 and
that E(st) = 3 [.E(ef)]2 are used.
Including the quadratic terms in (6.42) does not produce the exact sampl
ing variances of estimators based on a quadratic expansión of the original
function, as in (6.41). This follows from the fact that the cubic and higher
terms in an extensión of (6.42) are not generally zero for estimators derived
from a quadratic function of the parameters. The class of functions for which
(6.42) produces the exact sampling variances is not clear.
COMPUTED COVARIANCE MATRICES 169
6.5. Computed covariance matrices
Since the approximated variances are taken around the true parameter valúes
rather than the means of their estimators, their square roots are root mean
square errors rather than standard deviations. We will abbreviate the linear
and quadratic approximations as RMSEL and RMSEq . As they represent an
extended approximation, the quadratic approximations are presumed to be
more accurate. Some indirect evidence in support of this presumption is
given below, where we argüe that the patterns of the RMSEq across diíferent h
are more plausible, given the analytic results of section 6.3, than the pat
terns of the RMSEl .
When jE(¡3j — f}3) 2 is computed using (6.42), the terms which have a non-
zero expectation are either of the order a2 or a4. The terms involving a2
reproduce the linear approximations; the new terms are all of order o*4. The
ratio of the RMSEq to the RMSEL thus approaches 1 if cr approaches 0.
For this model, the range of a for which the RMSEL and RMSEq are
nearly equal is extremely small and depends crucially upon the size of h.
Table 6.1 exhibits the RMSEL and the ratio (RMSEQ/RMSEL) for g, h, and
TABLE 6.1
Linearly approximated root mean square errors and the ratio
of quadratic to linear, averaged over 20 examples
(g + r), averaged over the 20 examples described above. With h = 0.04 and
1 % errors in the Qt and L i9 the RMSEq exceed the RMSEL by 50-80% on
average. For smaller h or larger cr, the discrepancy becomes enormous. If high
accuracy is desired, the inadequacy of the RMSELis striking. For the RMSEL
170 ESTIMATOR BEHAVIOR
These are an order of magnitude smaller than the RMSEL. A smaller h still
implies higher variances, but the introduction of wage data has drastically
reduced their level.
172 ESTIMATOR BEHAVIOR
The addition of wage data reduces the variance due to deviations in the rates,
but not to the extent that it reduces the variance due to errors of observation
in the data. Thus the relative increase is greater, particularly for parameters
with small RMSE^’s. For atc = 0.005, the increases range from about 50 to
over 140%, while for atc = 0.010 the average RMSE^’sare multiplied by
factors ranging from 2 to 5. The standard variance approximations-—com-
puted on the assumption of perfectly constant rates—imply confidence inter-
vals which are much too small even for otc = 0.005 and are severe underesti-
mates for atc = 0.010.
20 % bias in the wage data raises the prediction RMSE to over 50 %. Even
with good wage data, the effect of deviations in the rates of technical change,
with atc = 0.010, is to raise the standard deviation of the predictor to roughly
15 % . With h = 0.01, a bias in the wage data has a severe effect. For small h ,
the specification based on just Qt and L t data is hopeless, and only high
quality wage data together with highly stable rates of technical change will
allow accurate prediction with this specification.
Even for h = 0.04, it appears justifiable to ignore wage data only if the Qt
and Lt data have very small errors and the bias in the wage data is large.
Furthermore, the quadratic approximations indicate, not unexpectedly, that
the constraint on the size of the errors in the Qt and L¿ is stringent.
Given the previous analysis, the inadequacy of the standard (i.e., linearly
approximated) confidence intervals is not surprising for the specification
based on only output and labor data. But even with wage data added, the
calculations imply that the standard approximations are an order of magni-
tude too small when either a modest bias in the wage data or modest varia
tions in the true rates of technical change are present. Furthermore, the deri-
vation of the variance added by deviations in the rates of technical change
suggests that adding quadratic terms would produce an increase comparable
to that in the variance attributable to the errors of observation.
10 D.Smallwood, Estimation and prediction with the fixed coefficients, vintage model.
Unpublished Ph.D. thesis, Yale^University, 1970.