You are on page 1of 10

Title stata.

com
hausman — Hausman specification test

Description Quick start Menu Syntax


Options Remarks and examples Stored results Methods and formulas
Acknowledgment References Also see

Description
hausman performs Hausman’s (1978) specification test.

Quick start
Hausman test for stored models consistent and efficient
hausman consistent efficient
As above, but compare fixed-effects and random-effects linear regression models
hausman fixed random, sigmamore
Endogeneity test after ivprobit and probit with estimates stored in iv and noiv
hausman iv noiv, equations(1:1)
Test of independence of irrelevant alternatives for model with all alternatives all and model with
omitted alternative omitted
hausman omitted all, alleqs constant

Menu
Statistics > Postestimation

1
2 hausman — Hausman specification test

Syntax
   
hausman name-consistent name-efficient , options

name-consistent and name-efficient are names under which estimation results were stored via esti-
mates store; see [R] estimates store. A period (.) may be used to refer to the last estimation
results, even if these were not already stored. Not specifying name-efficient is equivalent to
specifying the last estimation results as “.”.

options Description
Main
constant include estimated intercepts in comparison; default is to exclude
alleqs use all equations to perform test; default is first equation only
skipeqs(eqlist) skip specified equations when performing test
equations(matchlist) associate/compare the specified (by number) pairs of equations
force force performance of test, even though assumptions are not met
df(#) use # degrees of freedom
sigmamore base both (co)variance matrices on disturbance variance
estimate from efficient estimator
sigmaless base both (co)variance matrices on disturbance variance
estimate from consistent estimator
Advanced
tconsistent(string) consistent estimator column header
tefficient(string) efficient estimator column header
collect is allowed; see [U] 11.1.10 Prefix commands.

Options

 Main
constant specifies that the estimated intercept(s) be included in the model comparison; by default,
they are excluded. The default behavior is appropriate for models in which the constant does not
have a common interpretation across the two models.
alleqs specifies that all the equations in the models be used to perform the Hausman test; by default,
only the first equation is used.
skipeqs(eqlist) specifies in eqlist the names of equations to be excluded from the test. Equation
numbers are not allowed in this context, because the equation names, along with the variable
names, are used to identify common coefficients.
equations(matchlist) specifies, by number, the pairs of equations that are to be compared.
The matchlist in equations() should follow the syntax
  
#c :#e ,#c :#e ,. . .
where #c (#e ) is an equation number of the always-consistent (efficient under H0 ) estimator. For
instance, equations(1:1), equations(1:1, 2:2), or equations(1:2).
If equations() is not specified, then equations are matched on equation names.
hausman — Hausman specification test 3

equations() handles the situation in which one estimator uses equation names and the other
does not. For instance, equations(1:2) means that equation 1 of the always-consistent estimator
is to be tested against equation 2 of the efficient estimator. equations(1:1, 2:2) means that
equation 1 is to be tested against equation 1 and that equation 2 is to be tested against equation 2.
If equations() is specified, the alleqs and skipeqs options are ignored.
force specifies that the Hausman test be performed, even though the assumptions of the Hausman
test seem not to be met, for example, because the estimators were pweighted or the data were
clustered.
df(#) specifies the degrees of freedom for the Hausman test. The default is the matrix rank of the
variance of the difference between the coefficients of the two estimators.
sigmamore and sigmaless specify that the two covariance matrices used in the test be based on a
common estimate of disturbance variance (σ 2 ).
sigmamore specifies that the covariance matrices be based on the estimated disturbance variance
from the efficient estimator. This option provides a proper estimate of the contrast variance for
so-called tests of exogeneity and overidentification in instrumental-variables regression.
sigmaless specifies that the covariance matrices be based on the estimated disturbance variance
from the consistent estimator.
These options can be specified only when both estimators store e(sigma) or e(rmse), or with
the xtreg command. e(sigma e) is stored after the xtreg command with the fe or mle option.
e(rmse) is stored after the xtreg command with the re option.
sigmamore or sigmaless are recommended when comparing fixed-effects and random-effects
linear regression because they are much less likely to produce a non–positive-definite-differenced
covariance matrix (although the tests are asymptotically equivalent whether or not one of the
options is specified).

 Advanced
tconsistent(string) and tefficient(string) are formatting options. They allow you to specify
the headers of the columns of coefficients that default to the names of the models. These options
will be of interest primarily to programmers.

Remarks and examples stata.com


hausman is a general implementation of Hausman’s (1978) specification test, which compares an
estimator θb1 that is known to be consistent with an estimator θb2 that is efficient under the assumption
being tested. The null hypothesis is that the estimator θb2 is indeed an efficient (and consistent)
estimator of the true parameters. If this is the case, there should be no systematic difference between
the two estimators. If there exists a systematic difference in the estimates, you have reason to doubt
the assumptions on which the efficient estimator is based.
The assumption of efficiency is violated if the estimator is pweighted or the data are clustered,
so hausman cannot be used. The test can be forced by specifying the force option with hausman.
For an alternative to using hausman in these cases, see [R] suest.
To use hausman, you
. (compute the always-consistent estimator)
. estimates store name-consistent
. (compute the estimator that is efficient under H 0 )
. hausman name-consistent .
4 hausman — Hausman specification test

Alternatively, you can turn this around:


. (compute the estimator that is efficient under H 0 )
. estimates store name-efficient
. (fit the less-efficient model )
. (compute the always-consistent estimator)
. hausman . name-efficient

You can, of course, also compute and store both the always-consistent and efficient-under-H0
estimators and perform the Hausman test with
. hausman name-consistent name-efficient

Example 1
We are studying the factors that affect the wages of young women in the United States between
1968 and 1988, and we have a panel-data sample of individual women over that time span.
. use https://www.stata-press.com/data/r17/nlswork4
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)
. describe
Contains data from https://www.stata-press.com/data/r17/nlswork4.dta
Observations: 28,534 National Longitudinal Survey of
Young Women, 14-24 years old in
1968
Variables: 6 29 Jan 2020 16:35
(_dta has notes)

Variable Storage Display Value


name type format label Variable label

idcode int %8.0g NLS ID


year byte %8.0g Interview year
age byte %8.0g Age in current year
msp byte %8.0g 1 if married, spouse present
ttl_exp float %9.0g Total work experience
ln_wage float %9.0g ln(wage/GNP deflator)

Sorted by: idcode year

We believe that a random-effects specification is appropriate for individual-level effects in our model.
We fit a fixed-effects model that will capture all temporally constant individual-level effects.
hausman — Hausman specification test 5

. xtreg ln_wage age msp ttl_exp, fe


Fixed-effects (within) regression Number of obs = 28,494
Group variable: idcode Number of groups = 4,710
R-squared: Obs per group:
Within = 0.1373 min = 1
Between = 0.2571 avg = 6.0
Overall = 0.1800 max = 15
F(3,23781) = 1262.01
corr(u_i, Xb) = 0.1476 Prob > F = 0.0000

ln_wage Coefficient Std. err. t P>|t| [95% conf. interval]

age -.005485 .000837 -6.55 0.000 -.0071256 -.0038443


msp .0033427 .0054868 0.61 0.542 -.0074118 .0140971
ttl_exp .0383604 .0012416 30.90 0.000 .0359268 .0407941
_cons 1.593953 .0177538 89.78 0.000 1.559154 1.628752

sigma_u .37674223
sigma_e .29751014
rho .61591044 (fraction of variance due to u_i)

F test that all u_i=0: F(4709, 23781) = 7.76 Prob > F = 0.0000

We assume that this model is consistent for the true parameters and store the results by using
estimates store under a name, fixed:
. estimates store fixed

Now we fit a random-effects model as a fully efficient specification of the individual effects
under the assumption that they are random and follow a normal distribution. We then compare these
estimates with the previously stored results by using the hausman command.
. xtreg ln_wage age msp ttl_exp, re
Random-effects GLS regression Number of obs = 28,494
Group variable: idcode Number of groups = 4,710
R-squared: Obs per group:
Within = 0.1373 min = 1
Between = 0.2552 avg = 6.0
Overall = 0.1797 max = 15
Wald chi2(3) = 5100.33
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

ln_wage Coefficient Std. err. z P>|z| [95% conf. interval]

age -.0069749 .0006882 -10.13 0.000 -.0083238 -.0056259


msp .0046594 .0051012 0.91 0.361 -.0053387 .0146575
ttl_exp .0429635 .0010169 42.25 0.000 .0409704 .0449567
_cons 1.609916 .0159176 101.14 0.000 1.578718 1.641114

sigma_u .32648519
sigma_e .29751014
rho .54633481 (fraction of variance due to u_i)
6 hausman — Hausman specification test

. hausman fixed ., sigmamore


Coefficients
(b) (B) (b-B) sqrt(diag(V_b-V_B))
fixed . Difference Std. err.

age -.005485 -.0069749 .0014899 .0004803


msp .0033427 .0046594 -.0013167 .0020596
ttl_exp .0383604 .0429635 -.0046031 .0007181

b = Consistent under H0 and Ha; obtained from xtreg.


B = Inconsistent under Ha, efficient under H0; obtained from xtreg.
Test of H0: Difference in coefficients not systematic
chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B)
= 260.40
Prob > chi2 = 0.0000

Under the current specification, our initial hypothesis that the individual-level effects are adequately
modeled by a random-effects model is resoundingly rejected. This result is based on the rest of our
model specification, and random effects might be appropriate for some alternate model of wages.


Jerry Allen Hausman (1946– ) is an American economist and econometrician. He was born in
West Virginia and went on to study economics at Brown and Oxford. He joined the MIT faculty
in 1972 and continues to teach there. He currently researches new goods and their effects on
consumer welfare and its measurement in the Consumer Price Index along with regulation and
competition in the telecommunications industry.
Hausman is best known for his many contributions to econometrics. In 1978, he published his
now famous paper giving the Hausman specification test. The work remains one of the most
widely cited econometrics papers. He has also done extensive work in applied microeconomics
pertaining to governments role in the economy, including antitrust regulation, public finance, and
taxation.
In 1980, Hausman received the Frisch Medal, a biennial award from the Econometric Society
recognizing exceptional applied work, for his paper with David Wise on attrition bias. In 1985,
he won the John Bates Clark Award from the American Economics Association, which is given
for outstanding contributions to economics by an economist under 40 years of age. In 2012, the
Advances in Econometrics book series devoted an entire volume to Hausman and his contributions
to econometrics.
 

Example 2
A stringent assumption of multinomial and conditional logit models is that outcome categories
for the model have the property of independence of irrelevant alternatives (IIA). Stated simply, this
assumption requires that the inclusion or exclusion of categories does not affect the relative risks
associated with the regressors in the remaining categories.
One classic example of a situation in which this assumption would be violated involves the choice
of transportation mode; see McFadden (1974). For simplicity, postulate a transportation model with
the four possible outcomes: rides a train to work, takes a bus to work, drives the Ford to work, and
drives the Chevrolet to work. Clearly, “drives the Ford” is a closer substitute to “drives the Chevrolet”
than it is to “rides a train” (at least for most people). This means that excluding “drives the Ford”
from the model could be expected to affect the relative risks of the remaining options and that the
model would not obey the IIA assumption.
hausman — Hausman specification test 7

Using the data presented in [R] mlogit, we will use a simplified model to test for IIA. The choice
of insurance type among indemnity, prepaid, and uninsured is modeled as a function of age and
gender. The indemnity category is allowed to be the base category, and the model including all three
outcomes is fit. The results are then stored under the name allcats.
. use https://www.stata-press.com/data/r17/sysdsn3
(Health insurance data)
. mlogit insure age male
Iteration 0: log likelihood = -555.85446
Iteration 1: log likelihood = -551.32973
Iteration 2: log likelihood = -551.32802
Iteration 3: log likelihood = -551.32802
Multinomial logistic regression Number of obs = 615
LR chi2(4) = 9.05
Prob > chi2 = 0.0598
Log likelihood = -551.32802 Pseudo R2 = 0.0081

insure Coefficient Std. err. z P>|z| [95% conf. interval]

Indemnity (base outcome)

Prepaid
age -.0100251 .0060181 -1.67 0.096 -.0218204 .0017702
male .5095747 .1977893 2.58 0.010 .1219147 .8972346
_cons .2633838 .2787575 0.94 0.345 -.2829708 .8097383

Uninsure
age -.0051925 .0113821 -0.46 0.648 -.0275011 .0171161
male .4748547 .3618462 1.31 0.189 -.2343508 1.18406
_cons -1.756843 .5309602 -3.31 0.001 -2.797506 -.7161803

. estimates store allcats

Under the IIA assumption, we would expect no systematic change in the coefficients if we excluded
one of the outcomes from the model. (For an extensive discussion, see Hausman and McFadden
[1984].) We reestimate the parameters, excluding the uninsured outcome, and perform a Hausman
test against the fully efficient full model.
. mlogit insure age male if insure != "Uninsure":insure
Iteration 0: log likelihood = -394.8693
Iteration 1: log likelihood = -390.4871
Iteration 2: log likelihood = -390.48643
Iteration 3: log likelihood = -390.48643
Multinomial logistic regression Number of obs = 570
LR chi2(2) = 8.77
Prob > chi2 = 0.0125
Log likelihood = -390.48643 Pseudo R2 = 0.0111

insure Coefficient Std. err. z P>|z| [95% conf. interval]

Indemnity (base outcome)

Prepaid
age -.0101521 .0060049 -1.69 0.091 -.0219214 .0016173
male .5144003 .1981735 2.60 0.009 .1259874 .9028133
_cons .2678043 .2775563 0.96 0.335 -.276196 .8118046
8 hausman — Hausman specification test

. hausman . allcats, alleqs constant


Coefficients
(b) (B) (b-B) sqrt(diag(V_b-V_B))
. allcats Difference Std. err.

age -.0101521 -.0100251 -.0001269 .


male .5144003 .5095747 .0048256 .0123338
_cons .2678043 .2633838 .0044205 .

b = Consistent under H0 and Ha; obtained from mlogit.


B = Inconsistent under Ha, efficient under H0; obtained from mlogit.
Test of H0: Difference in coefficients not systematic
chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B)
= 0.08
Prob > chi2 = 0.9944
(V_b-V_B is not positive definite)

The syntax of the if condition on the mlogit command simply identified the "Uninsured"
category with the insure value label; see [U] 12.6.3 Value labels. On examining the output from
hausman, we see that there is no evidence that the IIA assumption has been violated.
Because the Hausman test is a standardized comparison of model coefficients, using it with
mlogit requires that the base outcome be the same in both competing models. In particular, if the
most-frequent category (the default base outcome) is being removed to test for IIA, you must use the
baseoutcome() option in mlogit to manually set the base outcome to something else. Or you can
use the equation() option of the hausman command to align the equations of the two models.
Having the missing values for the square root of the diagonal of the covariance matrix of the
differences is not comforting, but it is also not surprising. This covariance matrix is guaranteed to be
positive definite only asymptotically (it is a consequence of the assumption that one of the estimators
is efficient), and assurances are not made about the diagonal elements. Negative values along the
diagonal are possible, and the fourth column of the table is provided mainly for descriptive use.
We can also perform the Hausman IIA test against the remaining alternative in the model:
. mlogit insure age male if insure != "Prepaid":insure
Iteration 0: log likelihood = -132.59913
Iteration 1: log likelihood = -131.78009
Iteration 2: log likelihood = -131.76808
Iteration 3: log likelihood = -131.76807
Multinomial logistic regression Number of obs = 338
LR chi2(2) = 1.66
Prob > chi2 = 0.4356
Log likelihood = -131.76807 Pseudo R2 = 0.0063

insure Coefficient Std. err. z P>|z| [95% conf. interval]

Indemnity (base outcome)

Uninsure
age -.0041055 .0115807 -0.35 0.723 -.0268033 .0185923
male .4591074 .3595663 1.28 0.202 -.2456296 1.163844
_cons -1.801774 .5474476 -3.29 0.001 -2.874752 -.7287968
hausman — Hausman specification test 9

. hausman . allcats, alleqs constant


Coefficients
(b) (B) (b-B) sqrt(diag(V_b-V_B))
. allcats Difference Std. err.

age -.0041055 -.0051925 .001087 .0021355


male .4591074 .4748547 -.0157473 .
_cons -1.801774 -1.756843 -.0449311 .1333421

b = Consistent under H0 and Ha; obtained from mlogit.


B = Inconsistent under Ha, efficient under H0; obtained from mlogit.
Test of H0: Difference in coefficients not systematic
chi2(3) = (b-B)’[(V_b-V_B)^(-1)](b-B)
= -0.18
Warning: chi2 < 0 ==> model fitted on these data
fails to meet the asymptotic assumptions
of the Hausman test; see suest for a
generalized test.

Here the χ2 statistic is actually negative. We might interpret this result as strong evidence that
we cannot reject the null hypothesis. Such a result is not an unusual outcome for the Hausman test,
particularly when the sample is relatively small — there are only 45 uninsured individuals in this
dataset.
Are we surprised by the results of the Hausman test in this example? Not really. Judging from
the z statistics on the original multinomial logit model, we were struggling to identify any structure
in the data with the current specification. Even when we were willing to assume IIA and computed
the efficient estimator under this assumption, few of the effects could be identified as statistically
different from those on the base category. Trying to base a Hausman test on a contrast (difference)
between two poor estimates is just asking too much of the existing data.

In example 2, we encountered a case in which the Hausman was not well defined. Unfortunately,
in our experience this happens fairly often. Stata provides an alternative to the Hausman test that
overcomes this problem through an alternative estimator of the variance of the difference between
the two estimators. This other estimator is guaranteed to be positive semidefinite. This alternative
estimator also allows a widening of the scope of problems to which Hausman-type tests can be applied
by relaxing the assumption that one of the estimators is efficient. For instance, you can perform
Hausman-type tests to clustered observations and survey estimators. See [R] suest for details.

Stored results
hausman stores the following in r():
Scalars
r(chi2) χ2
r(df) degrees of freedom for the statistic
r(p) p-value for the χ2
r(rank) rank of (V b-V B)^(-1)
10 hausman — Hausman specification test

Methods and formulas


The Hausman statistic is distributed as χ2 and is computed as

H = (βc − βe )0 (Vc − Ve )−1 (βc − βe )

where
βc is the coefficient vector from the consistent estimator
βe is the coefficient vector from the efficient estimator
Vc is the covariance matrix of the consistent estimator
Ve is the covariance matrix of the efficient estimator

When the difference in the variance matrices is not positive definite, a Moore–Penrose generalized
inverse is used. As noted in Gourieroux and Monfort (1995, 125–128), the choice of generalized
inverse is not important asymptotically.
The number of degrees of freedom for the statistic is the rank of the difference in the variance
matrices. When the difference is positive definite, this is the number of common coefficients in the
models being compared.

Acknowledgment
Portions of hausman are based on an early implementation by Jeroen Weesie of the Department
of Sociology at Utrecht University, The Netherlands.

References
Baltagi, B. H. 2011. Econometrics. 5th ed. Berlin: Springer.
Gourieroux, C. S., and A. Monfort. 1995. Statistics and Econometric Models, Vol 2: Testing, Confidence Regions,
Model Selection, and Asymptotic Theory. Trans. Q. Vuong. Cambridge: Cambridge University Press.
Hausman, J. A. 1978. Specification tests in econometrics. Econometrica 46: 1251–1271.
https://doi.org/10.2307/1913827.
Hausman, J. A., and D. L. McFadden. 1984. Specification tests for the multinomial logit model. Econometrica 52:
1219–1240. https://doi.org/10.2307/1910997.
McFadden, D. L. 1974. Measurement of urban travel demand. Journal of Public Economics 3: 303–328.
https://doi.org/10.1016/0047-2727(74)90003-6.

Also see
[R] lrtest — Likelihood-ratio test after estimation
[R] suest — Seemingly unrelated estimation
[R] test — Test linear hypotheses after estimation
[XT] xtreg — Fixed-, between-, and random-effects and population-averaged linear models

You might also like