Econometrics
with Stata
Carl Moody
Economics Department
College of William and Mary
2009
1
Table of Contents
1 AN OVERVIEW OF STATA ......................................................................................... 5
Transforming variables ................................................................................................... 7
Continuing the example .................................................................................................. 9
Reading Stata Output ...................................................................................................... 9
2 STATA LANGUAGE ................................................................................................... 12
Basic Rules of Stata ...................................................................................................... 12
Stata functions ............................................................................................................... 14
Missing values .......................................................................................................... 16
Time series operators. ............................................................................................... 16
Setting panel data. ..................................................................................................... 17
Variable labels. ......................................................................................................... 18
System variables ....................................................................................................... 18
3 DATA MANAGEMENT............................................................................................... 19
Getting your data into Stata .......................................................................................... 19
Using the data editor ................................................................................................. 19
Importing data from Excel ........................................................................................ 19
Importing data from comma separated values (.csv) files ........................................ 20
Reading Stata files: the use command ...................................................................... 22
Saving a Stata data set: the save command ................................................................... 22
Combining Stata data sets: the Append and Merge commands .................................... 22
Looking at your data: list, describe, summarize, and tabulate commands ................... 24
Culling your data: the keep and drop commands .......................................................... 27
Transforming your data: the generate and replace commands ..................................... 27
4 GRAPHS ........................................................................................................................ 29
5 USING DOFILES......................................................................................................... 34
6 USING STATA HELP .................................................................................................. 37
7 REGRESSION ............................................................................................................... 41
Linear regression ........................................................................................................... 43
Correlation and regression ............................................................................................ 46
How well does the line fit the data? .......................................................................... 48
Why is it called regression? .......................................................................................... 49
The regression fallacy ................................................................................................... 50
Horace Secrist ........................................................................................................... 53
Some tools of the trade ................................................................................................. 54
2
Summation and deviations .................................................................................... 54
Expected value .......................................................................................................... 55
Expected values, means and variance ....................................................................... 56
8 THEORY OF LEAST SQUARES ................................................................................ 57
Method of least squares ................................................................................................ 57
Properties of estimators................................................................................................. 58
Small sample properties ............................................................................................ 58
Bias ....................................................................................................................... 58
Efficiency .............................................................................................................. 58
Mean square error ................................................................................................. 59
Large sample properties ............................................................................................ 59
Consistency ........................................................................................................... 59
Mean of the sampling distribution of
ˆ
β ................................................................... 63
Variance of
ˆ
β ............................................................................................................ 63
Consistency of OLS .................................................................................................. 64
Proof of the GaussMarkov Theorem ........................................................................... 65
Inference and hypothesis testing ................................................................................... 66
Normal, Student’s t, Fisher’s F, and Chisquare ....................................................... 66
Normal distribution ................................................................................................... 67
Chisquare distribution.............................................................................................. 68
Fdistribution............................................................................................................. 70
tdistribution .............................................................................................................. 71
Asymptotic properties ................................................................................................... 73
Testing hypotheses concerning þ .................................................................................. 73
Degrees of Freedom ...................................................................................................... 74
Estimating the variance of the error term ..................................................................... 75
Chebyshev’s Inequality ................................................................................................. 78
Law of Large Numbers ................................................................................................. 79
Central Limit Theorem ................................................................................................. 80
Method of maximum likelihood ................................................................................... 84
Likelihood ratio test .................................................................................................. 86
Multiple regression and instrumental variables ............................................................ 87
Interpreting the multiple regression coefficient. ....................................................... 90
Multiple regression and omitted variable bias .............................................................. 92
The omitted variable theorem ....................................................................................... 93
Target and control variables: how many regressors? .................................................... 96
Proxy variables.............................................................................................................. 97
Dummy variables .......................................................................................................... 98
Useful tests .................................................................................................................. 102
Ftest ....................................................................................................................... 102
Chow test ................................................................................................................ 102
Granger causality test ............................................................................................. 105
Jtest for nonnested hypotheses ............................................................................. 107
LM test .................................................................................................................... 108
9 REGRESSION DIAGNOSTICS ................................................................................. 111
3
Influential observations ............................................................................................... 111
DFbetas ................................................................................................................... 113
Multicollinearity ......................................................................................................... 114
Variance inflation factors ........................................................................................ 114
10 HETEROSKEDASTICITY ....................................................................................... 117
Testing for heteroskedasticity ..................................................................................... 117
BreuschPagan test .................................................................................................. 117
White test ................................................................................................................ 120
Weighted least squares ................................................................................................ 120
Robust standard errors and tratios ............................................................................. 123
11 ERRORS IN VARIABLES ....................................................................................... 127
Cure for errors in variables ......................................................................................... 128
Two stage least squares ............................................................................................... 129
HausmanWu test ........................................................................................................ 129
12 SIMULTANEOUS EQUATIONS............................................................................. 132
Example: supply and demand ..................................................................................... 133
Indirect least squares ................................................................................................... 135
The identification problem .......................................................................................... 136
Illustrative example ..................................................................................................... 137
Diagnostic tests ........................................................................................................... 139
Tests for over identifying restrictions ..................................................................... 140
Test for weak instruments ....................................................................................... 143
HausmanWu test .................................................................................................... 143
Seemingly unrelated regressions................................................................................. 146
Three stage least squares ............................................................................................. 146
Types of equation systems .......................................................................................... 148
Strategies for dealing with simultaneous equations .................................................... 149
Another example ......................................................................................................... 150
Summary ..................................................................................................................... 152
13 TIME SERIES MODELS .......................................................................................... 154
Linear dynamic models ............................................................................................... 154
ADL model ............................................................................................................. 154
Lag operator ............................................................................................................ 155
Static model ............................................................................................................ 156
AR model ................................................................................................................ 156
Random walk model ............................................................................................... 156
First difference model ............................................................................................. 156
Distributed lag model .............................................................................................. 157
Partial adjustment model......................................................................................... 157
Error correction model ............................................................................................ 157
CochraneOrcutt model ........................................................................................... 158
14 AUTOCORRELATION ............................................................................................ 159
Effect of autocorrelation on OLS estimates ................................................................ 160
Testing for autocorrelation .......................................................................................... 161
The Durbin Watson test .......................................................................................... 161
The LM test for autocorrelation .............................................................................. 162
4
Testing for higher order autocorrelation ................................................................. 164
Cure for autocorrelation .............................................................................................. 165
The CochraneOrcutt method ................................................................................. 165
Curing autocorrelation with lags ............................................................................. 168
Heteroskedastic and autocorrelation consistent standard errors ................................. 169
Summary ..................................................................................................................... 170
15 NONSTATIONARITY, UNIT ROOTS, AND RANDOM WALKS ....................... 171
Random walks and ordinary least squares .................................................................. 172
Testing for unit roots ................................................................................................... 173
Choosing the number of lags with F tests ............................................................... 176
Digression: model selection criteria........................................................................ 178
Choosing lags using model selection criteria.......................................................... 179
DFGLS test ................................................................................................................ 181
Trend stationarity vs. difference stationarity .............................................................. 182
Why unit root tests have nonstandard distributions .................................................... 183
16 ANALYSIS OF NONSTATIONARY DATA .......................................................... 185
Cointegration............................................................................................................... 188
Dynamic ordinary least squares .................................................................................. 191
Error correction model ................................................................................................ 191
17 PANEL DATA MODELS ......................................................................................... 194
The Fixed Effects Model ............................................................................................ 194
Time series issues ....................................................................................................... 202
Linear trends ........................................................................................................... 202
Unit roots and panel data ........................................................................................ 206
Clustering .................................................................................................................... 212
Other Panel Data Models ............................................................................................ 214
Digression: the between estimator .......................................................................... 214
The Random Effects Model .................................................................................... 214
Choosing between the Random Effects Model and the Fixed Effects Model. ....... 215
HausmanWu test again .......................................................................................... 216
The Random Coefficients Model. ........................................................................... 216
Index ............................................................................................................................... 218
5
1 AN OVERVIEW OF STATA
Stata is a computer program that allows the user to perform a wide variety of statistical analyses. In this
chapter we will take a short tour of Stata to get an appreciation of what it can do.
Starting Stata
Stata is available on the server. When invoked, the screen should look something like this. (This is the current
version on my computer. The version on the server may be different.)
There are many ways to do things in Stata. The simplest way is to enter commands interactively, allowing
Stata to execute each command immediately. The commands are entered in the “Stata Command” window
(along the bottom in this view). Results are shown on the right. After the command has been entered it appears
in the “Review” window in the upper left. If you want to reenter the command you can double click on it. A
single click moves it to the command line, where you can edit it before submitting. You can also use the
buttons on the toolbar at the top. I recommend using “do” files consisting of a series of Stata commands for
6
complex jobs (see Chapter 5 below). However, the way to learn a statistical language is to do it. Different
people will find different ways of doing the same task.
An econometric project consists of several steps. The first is to choose a topic. Then you do library
research to find out what research has already been done in the area. Next, you collect data. Once you have the
data, you are ready to do the econometric analysis. The econometric part consists of four steps, which may be
repeated several times. (1) Get the data into Stata. (2) Look at the data by printing it out, graphing it, and
summarizing it. (3) Transform the data. For example, you might want to divide by population to get per capita
values, or divide by a price index to get real dollars, or take logs. (4) Analyze the transformed data using
regression or some other procedure. The final step in the project is to write up the results. In this chapter we
will do a minianalysis illustrating the four econometric steps.
Your data set should be written as a matrix (a rectangular array) of numbers, with columns
corresponding to variables and rows corresponding to observations. In the example below, we have 10
observations on 4 variables. The data matrix will therefore have four columns and 10 rows. Similarly, you
could have observations on consumption, income and wealth for the United States annually for the years
19481998. The data matrix will consist of three columns corresponding to the variables consumption,
income, and wealth, while there will be 51 rows (observations) corresponding to the years 1948 to 1998.
Consider the following data set. It corresponds to some data collected at a local recreation
association and consists of the year, the number of members, the annual membership dues, and the consumer
price index. The easiest way to get these data into Stata is to use Stata’s Data Editor. Click on Data on the
task bar then click on Data Editor in the resulting drop down menu, or click on the Data Editor button. The
result should look like this.
Type the following numbers into the corresponding columns of the data editor. Don’t type the
variable names, just the numbers. You can type across the rows or down the columns. If you type across the
rows, use tab to enter the number. If you type down the columns, use the enter key after typing each value.
Year Members Dues CPI
78 126 135 .652
79 130 145 .751
80 123 160 .855
81 122 165 .941
82 115 190 1.000
83 112 195 1.031
84 139 175 1.077
85 127 180 1.115
7
86 133 180 1.135
87 112 190 1.160
When you are done typing your data editor screen should look like this.
Now rename the variables. Double click on “var1” at the top of column one and type “year” into the resulting
dialog box. Repeat for the rest of the variables (members, dues, cpi). Notice that the new variable names are in
the variables box in the lower left. Click on “preserve” and then the red x on the data editor header to exit the
data editor.
Transforming variables
Since we have seen the data, we don’t have to look at it any more. Let’s move on to the transformations.
Suppose we want Stata to compute the logarithms of members and dues and create a dummy variable for an
advertising campaign that took place between 1984 and 1987. Type the following commands one at a time
into the “Stata command” box along the bottom. Hit <enter> at the end of each line to execute the command.
Gen is short for “generate” and is probably the command you will use most. The list command shows you the
current data values.
gen logmem=log(members)
gen logdues=log(dues)
gen dum=(year>83)
list
[output suppressed]
We almost always want to transform the variables in our dataset before analyzing them. In the
example above, we transformed the variables members and dues into logarithms. For a Phillips curve, you
might want the rate of inflation as the dependent variable while using the inverse of the unemployment rate as
the independent variable. All transformations like these are accomplished using the gen command. We have
already seen how to tell Stata to take the logarithms of variables. The logarithm function is one of several in
8
Stata's function library. Some of the functions commonly used in econometric work are listed in Chapter 2,
below. Along with the usual addition, subtraction, multiplication, and division, Stata has exponentiation,
absolute value, square root, logarithms, lags, differences and comparisons. For example, to compute the
inverse of the unemployment rate, we would write,
gen invunem=1/unem
where unem is the name of the unemployment variable.
As another example, suppose we want to take the first difference of the time series variable GNP
(defined as GNP(t)GNP(t1)), To do this we first have to tell Stata that you have a time series data set using
the tsset command.
tsset year
Now you can use the “d.” operator to take first differences (
t
y ∆ ).
gen dgnp=d.gnp
I like to use capital letters for these operators to make them more visible.
gen dcpi=D.cpi
The second difference
2
1 1
( )
t t t t t t
cpi cpi cpi cpi cpi cpi
− −
∆∆ = ∆ = ∆ − = ∆ − ∆ be calculated as,
gen d2cpi=D.dcpi
or
gen d2cpi=DD.cpi
or
gen d2cpi=D2.cpi
Lags can also be calculated. For example, the one period lag of the CPI (cpi(t1)) can be calculated
as,
gen cpi_1=L.cpi
These operators can be combined. The lagged difference
1 t
cpi
−
∆ can be written as
gen dcpi_1=LD.cpi
It is often necessary to make comparisons between variables. In the example above we created a
dummy variable, called "DUM," by the command
gen dum=(year>83)
This works as follows: for each line of the data, Stata reads the value in the column corresponding to the
variable YEAR. It then compares it to the value 83. If YEAR is greater than 83 then the variable "DUM" is
given the value 1 (“true”). If YEAR is not greater than 83, "DUM" is given the value 0 (“false”).
9
Continuing the example
Let’s continue the example by doing some more transformations and then some analyses. Type the following
commands (indicated by the courier new typeface), one at a time, ending with <enter> into the Stata
command box.
replace cpi=cpi/1.160
(Whenever you redefine a variable, you must use replace instead of generate.)
gen rdues=dues/cpi
gen logrdues=log(rdues);
gen trend=year78
list
summarize
Variable  Obs Mean Std. Dev. Min Max
+
year  10 82.5 3.02765 78 87
members  10 123.9 8.999383 112 139
dues  10 171.5 20.00694 135 195
cpi  10 .9717 .1711212 .652 1.16
logmem  10 4.817096 .0727516 4.718499 4.934474
+
logdues  10 5.138088 .1219735 4.905275 5.273
dum  10 .4 .5163978 0 1
trend  10 5.5 3.02765 1 10
(Note: rdues is real (inflationadjusted) dues, the log function produces natural logs, and the trend is simply a
counter (0, 1, 2, etc. for each year, usually called “t” in theoretical discussions.) The summarize command
produces means, standard deviations, etc.
We are now ready to do the analysis, a simple ordinary least squares (OLS) regression of members on real
dues.
regress members rdues
This command produces the following output.
Source  SS d1 MS uumber o1 obs = 10
+ I{ 1, 8) = 0.69
Mode1  58.1762689 1 58.1762689 Þrob > I = 0.4290
8esìdua1  670.722621 8 82.8404529 8squared = 0.0798
+ ^dj 8squared = 0.0252
1ota1  728.9 9 80.9888889 8oot MSL = 9.1564

members  Coe1. Std. Lrr. t Þ>t ]95x Con1. 1nterva11
+
rdues  .1210586 .1572227 0.82 0.429 .4928684 .2217512
_cons  151.0824 22.76126 4.61 0.002 75.52582 226.621

Reading Stata Output
10
Most of this output is selfexplanatory. The regression output immediately above is read as follows. The
upper left section is called the Analysis of Variance. Under SS is the model sum of squares (58.1763689).
This number is referred to in a variety of ways. Each textbook writer chooses his or her own convention.
For example, the model sum of squares is also known as the explained sum of squares, ESS (or, sometimes
SSE), or the regression sum of squares, RSS (or SSR). Mathematically, it is
2
1
ˆ
N
i
i
y
=
¯
where the circumflex
indicates that y is predicted by the regression. The fact that y is in lower case indicates that it is expressed
as deviations from the sample mean The Residual SS (670.723631) is the residual sum of squares (a.k.a.
RSS, SSR unexplained sum of squares, SSU, USS, and error sum of squares, ESS, SSE). I will refer to the
model sum of squares as the regression sum of squares, RSS and the residual sum of squares as the error
sum of squares, ESS. The sum of the RSS and the ESS is the total sum of squares, TSS. The next column
contains the degrees of freedom. The model degrees of freedom is the number of independent variables, not
including the constant term. The residual degrees of freedom is the number of observations, or sample size,
N (=10) minus the number of parameters estimated, k (=2), so that Nk=8. The third column is the “mean
square” or the sum of squares divided by the degrees of freedom. The residual mean square is also known
as the “mean square error” (MSE), the residual variance, or the error variance. Mathematically it is
2 2 2
1
ˆ ˆ ( )
N
u i
i
e N k σ σ
=
 
= = −

\ ¹
¯
Where e is the residual from the regression (the difference between the actual value of the dependent
variable and the predicted value from the regression:
ˆ ˆ
ˆ
ˆ
ˆ
ˆ
i i
i i i
i
Y X
Y Y e
X e
α β
α β
= +
= +
= + +
ˆ
ˆ
i i i
e Y X α β = − −
The panel on the top right has the number of observations, the overall Fstatistic that tests the null
hypothesis that the Rsquared is equal to zero, the probvalue corresponding to this Fratio, Rsquared, the
Rsquared adjusted for the number of explanatory variables (a.k.a. Rbar squared), and the root mean
square error (the square root of the residual sum of squares divided by its degrees of freedom, also known
as the standard error of the estimate, SEE, or the standard error of the regression, SER). Since the MSE is
the error variance, the SEE is the corresponding standard deviation (square root). The bottom table shows
the variable names, the corresponding estimated coefficients,
ˆ
β , the standard errors,
ˆ
S
β
the tratios, the
probvalues, and the confidence interval around the estimates.
The estimated coefficient for rdues is
2
ˆ
( .1310586) xy x β = ¯ ¯ = − .
{The other coefficient is the intercept or constant term,
ˆ
ˆ ( 151.0834) Y X α β = − = .}
The corresponding standard error for
ˆ
β is
2
ˆ
2
ˆ
( .1573327) S
x
β
σ
= =
¯
.
The resulting tratio for the null hypothesis that þ=0 is
ˆ
ˆ
( 0.83) T S
β
β = = − .
The probvalue for the ttest of no significance is the probability that the test statistic T would be observed
if the null hypothesis is true, using Student’s t distribution, twotailed. If this value is smaller than .05 then
11
we “reject the null hypothesis that þ=0 at the five percent significance level.” We usually just say that þ is
significant.
Let’s add the time trend and the dummy for the advertising campaign and do a multiple regression, using the
logarithm of members as the dependent variable and the log of real dues as the price.
regress logmem logrdues trend dum
Source  SS df MS Number of obs = 10
+ F( 3, 6) = 8.34
Model  .038417989 3 .012805996 Prob > F = 0.0146
Residual  .009217232 6 .001536205 Rsquared = 0.8065
+ Adj Rsquared = 0.7098
Total  .047635222 9 .005292802 Root MSE = .03919

logmem  Coef. Std. Err. t P>t [95% Conf. Interval]
+
logrdues  .6767872 .36665 1.85 0.114 1.573947 .2203729
trend  .0432308 .0094375 4.58 0.004 .0663234 .0201382
dum  .155846 .0608159 2.56 0.043 .007035 .3046571
_cons  8.557113 1.989932 4.30 0.005 3.687924 13.4263

The coefficient on logrdues is the elasticity of demand for pool membership. (Mathematically, the
derivative of a log with respect to a log is an elasticity.) Is the demand curve downward sloping? Is price
significant in the demand equation at the five percent level? At the ten percent level? The trend is the
percent change in pool membership per year. (The derivative of a log with respect to a nonlogged value is
a percent change.) According to this estimate, the pool can expect to lose four percent of its members every
year, everything else staying the same (ceteris paribus). On the other hand, continuing the advertising
campaign will generate a 16 percent increase in pool memberships every year. Note that, since this is a
multiple regression, the estimated coefficients, or parameters, are the change in the dependent variable for a
oneunit change in the corresponding independent variable, holding everything else constant.
Saving your first Stata analysis
Let’s assume that you have created a directory called “project” on the H: drive. To save the data for further
analysis, type
save “H:\Project\test.dta”
will create a Stata data set called “test.dta” in the directory H:\Project. You don’t really need the quotes unless
there are spaces in the pathname. However, it is good practice. You can read the data into Stata later with the
command,
use “H:\Project\test.dta”
You can also save the data by clicking on File, Save and browsing to the Project folder or by clicking on the
little floppy disk icon on the button bar.
You can save your output by going to the top of the output window (use the slider on the right hand side),
clicking on the first bit of output you want to save, then drag the cursor, holding the mouse key down, to the
bottom of the output window. Once you have blocked the text you want to save. You can type <control>c for
copy (or click on edit, copy text), go to your favorite word processor, and type <control>v for paste (or click
on edit, paste). You could also create a log file containing your results. We discuss how to create log files
below.
You can then write up your results around the Stata output.
12
2 STATA LANGUAGE
This section is designed to provide a more detailed introduction to the language of Stata. However, it
is still only an introduction, for a full development you must refer to the relevant Stata manual, or use Stata
Help (see Chapter 6).
Basic Rules of Stata
1. Rules for composing variable names in Stata.
a. Every name must begin with a letter or an underscore. However, I recommend that you not use
underscore at the beginning of any variable name, since Stata frequently uses that convention to
make its own internal variables. For example, _n is Stata’s variable containing the observation
number.
b. Subsequent characters may be letters, numbers, or underscores, e.g., var1. You cannot use special
characters (&,#, etc.).
c. The maximum number of characters in a name is 32, but shorter names work best.
d. Case matters. The names Var, VAR, and var are all considered to be different variables in
Stata notation. I recommend that you use lowercase only for variable names.
e. The following variable names are reserved. You may not use these names for your variables.
_all double long _rc _b float
_n _se byte if _N _skip
_coeff in _pi using _cons int
_pred with e
2. Command syntax.
The typical Stata command looks like this.
[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [, options]
13
where things in a square bracket are optional. Actually the varlist is almost always required.
A varlist is a list of variable names separated by spaces. If no varlist is specified, then, usually, all the
variables are used. For example, in the simple program above, we used the command “list” with no varlist
attached. Consequently, Stata printed out all the data for all the variables currently in memory. We could have
said, list members dues rdues to get a list of those three variables only.
The “=exp” option is used for the “generate” and “replace” commands, e.g., gen logrdues=log(dues) and
replace cpi=cpi/1.160.
The “if exp” option restricts the command to a certain subset of observations. For example, we could have
written
list year members rdues if year == 83.
This would have produced a printout of the data for year, members, and rdues for 1983. Note that Stata
requires two equal signs to denote equality because one equal sign indicates assignment in the generate and
replace commands. (Obviously, the equal sign in the expression replace cpi=cpi/1.160 indicates that
the variable cpi is to take the value of the old cpi divided by 1.160. It is not testing for the equality of the left
and right hand sides of the expression.)
The “in range” option also restricts the command to a certain range. It takes the form in
number1/number2. For example,
list year members rdues in 5/10
would list the observations from 5 to 10. We can also use the letter f (first) and l (last) in place of either
number1 or number 2.
The weight option is always in square brackets and takes the form
[weightword=exp]
where the weight word is one of the following (default weight for each particular command)
aweight (analytical weights: inverse of the variance of the observation. This is the most commonly
used weight by economists. Each observation in our typical sample is an average across counties,
states, etc. The aweight is the variance divided by the number of elements, usually population, that
comprise the county, state, etc. For most Stata commands, the aweight is rescaled to sum to the
number of observations in the sample.)
fweight (frequency weights: indicates duplicate observations, an fweight of 100 indicates that
there are really 100 identical observations)
pweight (sampling weights: the inverse of the probability that this observation is included in the
sample. A pweight of 100 indicates that this observation is representative of 100 subjects in the
population.)
iweight (importance weight: indicates the relative importance of each observation. The definition
of iweight depends on how it is defined in each command that uses them).
Options. Each Stata command takes commandspecific options. Options must be separated from the rest of the
command by a comma. For example, the “reg” command that we used above has the option “robust” which
produces heteroskedasticconsistent standard errors. That is,
14
reg members rdues trend, robust
by varlist: Almost all Stata commands can be repeated automatically for each value of the variable or
variables in the by varlist. This option must precede the command and be separated from the rest of the
command by a colon. The data must be sorted by the varlist. For example the commands,
sort dum
by dum: summarize members rdues
will produce means, etc. for the years without an advertising campaign (dum=0) and the years with an
advertising campaign (dum=1). In this case the sort command is not necessary, since the data are already
sorted by dum, but it never hurts to sort before using the by varlist option.
Arithmetic operators are:
+ add
 subtract
* multiply
/ divide
^ raise to a power
Comparison operators
== equal to
~= not equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to
Logical operators
& and
 or
~ not
Comparison and logical operators return a value of 1 for true and 0 for false. The order of operations is: ~, ^, 
(negation), /, *, (subtraction), +, ~=, >, <, <=, >=, ==, &, .
Stata functions
These are only a few of the many functions in Stata, see STATA USER’S GUIDE for a comprehensive list.
Mathematical functions include:
abs(x) absolute value
max(x) returns the largest value
min(x) returns the smallest value
sqrt(x) square root
exp(x) raises e (2.17828) to a specified power
log(x) natural logarithm
Some useful statistical functions are the following.
Chi2(df, x) returns the cumulative value of the chisquare with df degrees of freedom for a value of x.
15
For example, suppose we know that x=2.05 is distributed according to chisquare with 2 degrees of freedom.
What is the probability of observing a number as large as 2.05 if the true value of x is zero? We can compute
the following in Stata. Because we are not generating variables, we use the scalar command.
. scalar prob1=chi2(2,2.05)
. scalar list prob1
prob1 = .64120353
We apparently have a pretty good chance of observing a number as large as 2.05.
To find out if x=2.05 is significantly different from zero in the chisquare with 2 degrees of freedom we can
compute the following to get the prob value (the area in the right hand tail of the distribution).
. scalar prob=1chi2(2,2.05)
. scalar list prob
prob = .35879647
Since this value is not smaller than .05 it is not significantly different from zero at the .05 level.
Chi2tail(df,x) returns the upper tail of the same distribution. We should get the same answer as above using
this function.
. scalar prob2=chi2tail(2,2.05)
. scalar list prob2
prob2 = .35879647
Invchi2(df,p) returns the value of x for which the probability is p and the degrees of freedom are df.
. scalar x1=invchi2(2,.641)
. scalar list x1
x1 = 2.0488658
Invchi2tail is the inverse function for chi2tail.
. scalar x2=invchi2tail(2,.359)
. scalar list x2
x2 = 2.0488658
Norm(x) is the cumulative standard normal. To get the probability of observing a number as large as 1.65 if
the mean of the distribution is zero (and the standard deviation is one), i.e., P(z<1.65) is
. scalar prob=norm(1.65)
. scalar list prob
prob = .9505285
Invnorm(prob) is the inverse normal distribution so that, if prob=norm(z) then z=invnorm(prob).
Wooldridge uses the Stata functions norm, invttail, invFtail, and invchi2tail to generate the normal, t, F, and
chiisquare tables in the back of his textbook, which means that we don’t have to look in the back of textbooks
for statistical distributions any more.
16
Uniform() returns uniformly distributed pseudorandom numbers between zero and one. Even though
uniform() takes no parameters, the parentheses are part of the name of the function and must be typed. For
example to generate a sample of uniformly distributed random numbers use the command
. gen v=uniform()
. summarize v
Variable  Obs Mean Std. Dev. Min Max
+
v  51 .4819048 .2889026 .0445188 .9746088
To generate a sample of normally distributed random numbers with mean 2 and standard deviation 10, use the
command,
. gen z=2+10*invnorm(uniform())
. summarize z
Variable  Obs Mean Std. Dev. Min Max
+
z  51 1.39789 8.954225 15.87513 21.83905
Note that both these functions return 51 observations on the pseudorandom variable. Unless you tell Stata
otherwise, it will create the number of observations equal to the number of observations in the data set. I used
the crime1990.dta data set which has 51 observations, one for each state and DC.
The pseudo random number generator uniform() generates the same sequence of random numbers in each
session of Stata, unless you reinitialize the seed. To do that, type,
set seed #
If you set the seed to # you get a different sequence of random numbers. This makes it difficult to be sure you
are doing what you think you are doing. Unless I am doing a series of Monte Carlo exercises, I leave the seed
alone.
Missing values
Missing values are denoted by a single period (.) Any arithmetic operation on a missing value results in a
missing value. Stata considers missing values as larger than any other number. So, sorting by a variable
which has missing values will put all observations corresponding to missing at the end of the data set.
When doing statistical analyses, Stata ignores (drops) observations with missing values. So, for example if
you did a regression of crime on police and the unemployment rate across states, Stata will drop any
observations for which there are missing values in either police or unemployment (“case wise deletion”). In
other commands such as the summarize command, for example, Stata will simply ignore any missing
values for each variable separately when computing the mean, standard deviations, etc.
Time series operators.
The tsset command must be used to indicate that the data are time series. For example.
sort year
17
tsset year
Lags: the oneperiod lag of a variable such as the money supply, ms, can be written as
L.ms
So that L.ms is equal to ms(t1). Similarly, the second period lag could be written as
L2.ms
or, equivalently, as
LL.ms, etc.
For example, the command
gen rdues_1 = L.rdues
will generate a new variable called rdues_1 which is the oneperiod lag of rdues. Note that this will produce
a missing value for the first observation of rdues_1.
Leads: the oneperiod lead is written as
F.ms (=ms(t+1)).
The twoperiod lead is written as F2.ms or FF.ms.
Differences: the firstdifference is written as
D.ms (=ms(t)  ms(t1)).
The second difference (difference of difference) is
D2.ms or DD.ms. (=(ms(t)ms(t1)) – (ms(t1)ms(t2)))
Seasonal difference: the first seasonal difference is written
S.ms
and is the same as
D.ms.
The second seasonal difference is
S2.ms
and is equal to ms(t)ms(t2).
Time series operators can be combined. For example, for monthly data, the difference of the 12month
difference would be written as DS12.ms. Time series operators can be written in upper or lowercase. I
recommend upper case so that it is more easily readable as an operator as opposed to a variable name (which I
like to keep in lowercase).
Setting panel data.
If you have a panel of 50 states (150) across 20 years (19771996), you can use time series operators, but you
must first tell Stata that you have a panel (pooled time series and crosssection). The following commands sort
18
the data, first by state, then by year, define state and year as the crosssection and time series identifiers, and
generate the lag of the variable x.
sort state year
tsset state year
gen x_1=L.x
Variable labels.
Each variable automatically has an 80 character label associated with it, initially blank. You can use the
“label variable” command to define a new variable label. For example,
Label variable dues “Annual dues per family”
Label variable members “Number of members”
Stata will use variable labels, as long as there is room, in all subsequent output. You can also add variable
labels by invoking the Data Editor, then double clicking on the variable name at the head of each column
and typing the label into the appropriate space in the dialog box.
System variables
System variables are automatically produced by Stata. Their names begin with an underscore.
_n is the number of the current observation
You can create a counter or time trend if you have time series data by using the command
gen t=_n
_N is the total number of observations in the data set.
_pi is the value of a to machine precision.
19
3 DATA MANAGEMENT
Getting your data into Stata
Using the data editor
We already know how to use the data editor from the first chapter.
Importing data from Excel
Generally speaking, the best way to get data into Stata is through a spreadsheet program such as Excel. For
example, data presented in columns in a text file or Web page can be cut and pasted into a text editor or word
processor and saved as a text file. This file is not readable as data because such files are organized as rows
rather than columns. We need to separate the data into columns corresponding to variables.
Text files can be opened and read into Excel using the Text Import Wizard. Click on File, Open and browse to
the location of the file. Follow the Text Import Wizard’s directions. This creates separate columns of numbers,
which will eventually correspond to variables in Stata. After the data have been read in to Excel, the data can
be saved as a text file. Clicking on File, Save As in Excel reveals the following dialogue box.
20
If the data have been read from a .txt file, Excel will assume you want to save it in the same file. Clicking on
Save will save the data set with tabs between data values. Once the tab delimited text file has been created, it
can be read into Stata using the insheet command.
Insheet using filename
For example,
insheet using “H:\Project\gnp.txt”
The insheet command will figure out that the file is tab delimited and that the variable names are at the top.
The insheet command can be invoked by clicking on File, Import, and “Ascii data created by a spreadsheet.”
The following dialog box will appear.
Clicking on Browse allows you to navigate to the location of the file. Once you get to the right directory, click
on the downarrow next to “Files of type” (the default file type is “Raw,” whatever that is) to reveal
Choose .txt files if your files are tab delimited.
Importing data from comma separated values (.csv) files
The insheet command can be used to read other file formats such as csv (comma separated values) files. Csv
files are frequently used to transport data. For example, the Bureau of Labor Statistics allows the public to
download data from its website. The following screen was generated by navigating the BLS website
http://www.bls.gov/.
21
Note that I have selected the years 19562004, annual frequency, column view, and text, comma delimited
(although I could have chosen tab delimited and the insheet command would read that too). Clicking on
Retrieve Data yields the following screen.
Use the mouse (or hold down the shift key and use the down arrow on the keyboard) to highlight the data
starting at Series Id, copy (controlc) the selected text, invoke Notepad, WordPad, or Word, and paste
(controlv) the selected text. Now Save As a csv file, e.g., Save As cpi.csv. (Before you save, you should
delete the comma after Value, which indicates that there is another variable after Value. This creates a
22
variable, v5, with nothing but missing values in Stata. You don’t really have to do this because the variable
can be dropped in Stata, but it is neater if you do.) The Series Id and period can be dropped once the data are
in Stata.
You can read the data file with the insheet command, e.g.,
insheet using “H:\Project\cpi.csv”
or you can click on Data, Import, and Ascii data created by a spreadsheet, as above.
Reading Stata files: the use command
If the data are already in a Stata data set (with a .dta suffix), it can be read with the “use” command. For
example, suppose we have saved a Stata file called project.dta in the H:\Project directory. The command to
read the data into Stata is
use “h:\Project\project.dta”, clear
The “clear” option is to remove any unwanted or leftover data from previous analyses. Alternatively you can
click on the “open (use)’ button on the taskbar or click on File, Open.
Saving a Stata data set: the save command
Data in memory can be saved in a Stata (.dta) data set with the “save” command. For example, the cpi data
can be saved by typing
save h:\Project\cpi.dta”
If you want the data set to be readable by older versions of Stata use the “saveold” command,
saveold “h:\Project\cpi.dta”
The quotes are not required. However, you might have trouble saving (or using) files without the quotes if
your pathname contains spaces.
If a file by the name of cpi.dta already exists, e.g., you are updating a file, you must tell Stata that you want to
overwrite the old file. This is done by adding the option, replace.
save “h:\Project\cpi.dta”, replace.
Combining Stata data sets: the Append and
Merge commands
Suppose we want to combine the data in two different Stata data sets. There are two ways in which data can
be combined. The first is to stack one data set on top of the other, so that some or all the series have
additional observations. The “append” command is used for this purpose. For example, suppose we have a
23
data set called “project.dta” which has data on three variables (gnp, ms, cpi) from 1977 to 1998 and another
called “project2.dta.” which has data on two of the three variables (ms and cpi) for the years 19992002. To
combine the data sets, we first read the early data into Stata with the use command
use “h:\Project\project.dta”
The early data are now residing in memory. We append the newer data to the early data as follows
append using “h:\Project\project2.dta”
At this point you probably want to list the data and perhaps sort by year or some other identifying variable.
Then save the expanded data set with the save command. If you do not get the result you were expecting,
simply exit Stata without saving the data.
The other way to combine data sets is to merge two sets together. This is concatenating horizontally instead of
vertically. It makes the data set wider in the sense that, if one data set has variables that the other does not, the
combined data set will have variables from both. (Of course, there could be some overlap in the variables, so
one data set will overwrite the data in the other. This is another way of updating a data set.)
There are two kinds of merges. The first is the onetoone merge where the first observation of the first data
set is matched with the first observation of the second data set. You must be very careful that each data set has
its observations sorted exactly the same way. I do not recommend one to one merges as a general principle.
The preferred method is a “match merge,” where the two datasets have one or more variables in common.
These variables act as identifiers so that Stata can match the correct observations. For example, suppose you
have data on the following variables: year pop crmur crrap crrob crass crbur from 1929 to 1998 in a data set
called crime1.dta in the h:\Project directory. Suppose that you also have a data set called arrests.dta
containing year pop armur arrap arrob arass arbur (corresponding arrest rates) for the years 19521998. You
can combine these two data sets with the following commands
Read and sort the first data set.
use “H:\Project\crime1.dta”, clear
sort year
list
save “H:\Project\crime1.dta”, replace
Read and sort the second data set.
use “H:\Project\arrests.dta”, clear
sort year
list
save “H:\Project\arrests.dta”, replace
Read the first data set back into memory, clearing out the data from the second file. This is the “master” data
set.
use “H:\Project\crime1.dta”, clear
Merge with the second data set by year. This is the “using” data set.
merge year using “H:\Project\arrests.dta”
Look at the data, just to make sure.
list
24
Assuming it looks ok, save the combined data into a new data set.
save “H:\Project\crime2.dta”, replace
The resulting data set will contain data from 19291998, however the arrest variables will have missing values
for 19291951.
The data set that is in memory (crime1) is called the “master.” The data set mentioned in the merge command
(arrests) is called the “using” data set. Each time Stata merges two data sets, it creates a variable called
_merge. This variable takes the value 1 if the value only appeared in the master data set; 2 if the value
occurred only in the using data set; and 3 if an observation from the master is joined with one from the using.
Note also that the variable pop occurs in both data sets. Stata uses the convention that, in the case of duplicate
variables, the one that is in memory (the master) is retained. So, the version of pop that is in the crime1 data
set is stored in the new crime2 data set.
If you want to update the values of a variable with revised values, you can use the merge command with the
update replace options.
merge year using h:\Project\arrests.dta, update replace
Here, data in the using file overwrite data in the master file. In this case the pop values from the arrest data set
will be retained in the combined data set. The _merge variable can take two additional values if the update
option is invoked: _merge equals 4 if missing values in the master were replaced with values from the using
and _merge equals 5 if some values of the master disagree with the corresponding values in the using (these
correspond to the revised values). You should always list the _merge variable after a merge.
If you just want to overwrite missing values in the master data set with new observations from the using data
set (so that you retain all existing data in the master), then don’t use the replace option:
merge year using h:\Project\arrests.dta, update
list _merge
Looking at your data: list, describe, summarize,
and tabulate commands
It is important to look at your data to make sure it is what you think it is. The simplest command is
list varlist
Where varlist is a list of variables separated by spaces. If you omit the varlist, Stata will assume you want all
the variables listed. If you have too many variables to fit in a table, Stata will print out the data by observation.
This is very difficult to read. I always use a varlist to make sure I get an easy to read table.
The “describe” command is best used without a varlist or options.
describe
It produces a list of variables, including any string variables. Here is the resulting output from the crime and
arrest data set we created above.
25
. describe
Contains data from C:\Project\crime1.dta
obs: 70
vars: 13 16 May 2004 20:12
size: 3,430 (99.9% of memory free)

storage display value
variable name type format label variable label

year int %8.0g
pop float %9.0g POP
crmur int %8.0g CRMUR
crrap long %12.0g CRRAP
crrob long %12.0g CRROB
crass long %12.0g CRASS
crbur long %12.0g CRBUR
armur float %9.0g ARMUR
arrap float %9.0g ARRAP
arrob float %9.0g ARROB
arass float %9.0g ARASS
arbur float %9.0g ARBUR
_merge byte %8.0g

Sorted by:
Note: dataset has changed since last saved
We can get a list of the numerical variables and summary data for each one with the “summarize” command.
Summarize varlist
This command produces the number of observations, the mean, standard deviation, min, and max for every
variable in the varlist. If you omit the varlist, Stata will generate these statistics for all the variables in the data
set. For example,
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
year  70 1963.5 20.35109 1929 1998
pop  70 186049 52068.33 121769 270298
crmur  64 14231.02 6066.007 7508 24700
crrap  64 44291.63 36281.72 5981 109060
crrob  64 288390.6 218911.6 58323 687730
crass  64 414834.3 363602.9 62186 1135610
crbur  64 1736091 1214590 318225 3795200
armur  46 7.516304 1.884076 4.4 10.3
arrap  40 12.1675 3.158983 7 16
arrob  46 53.39348 17.57911 26.5 80.9
arass  46 113.1174 54.79689 49.3 223
arbur  46 170.5326 44.00138 97.5 254.1
_merge  70 2.371429 .9952267 1 5
If you add the detail option,
Summarize varlist, detail
You will also get the variance, skewness, kurtosis, four smallest, four largest, and a variety of percentiles,
but the printout is not as neat.
Don’t forget, as with any Stata command, you can use qualifiers to limit the command to a subset of the
data. For example, the command
26
summarize if year >1990
Will yield summary statistics only for the years 19911998.
Finally, you can use the “tabulate” command to generate a table of frequencies. This is particularly useful
for large data sets with too many observations to list comfortably.
tabulate varname
For example, suppose we think we have a panel data set on crime in nine Florida counties for the years
19781997. Just to make sure, we use the tabulate command:
. tabulate county
county  Freq. Percent Cum.
+
1  20 11.11 11.11
2  20 11.11 22.22
3  20 11.11 33.33
4  20 11.11 44.44
5  20 11.11 55.56
6  20 11.11 66.67
7  20 11.11 77.78
8  20 11.11 88.89
9  20 11.11 100.00
+
Total  180 100.00
. tabulate year
year  Freq. Percent Cum.
+
78  9 5.00 5.00
79  9 5.00 10.00
80  9 5.00 15.00
81  9 5.00 20.00
82  9 5.00 25.00
83  9 5.00 30.00
84  9 5.00 35.00
85  9 5.00 40.00
86  9 5.00 45.00
87  9 5.00 50.00
88  9 5.00 55.00
89  9 5.00 60.00
90  9 5.00 65.00
91  9 5.00 70.00
92  9 5.00 75.00
93  9 5.00 80.00
94  9 5.00 85.00
95  9 5.00 90.00
96  9 5.00 95.00
97  9 5.00 100.00
+
Total  180 100.00
Nine counties each with 20 years and 20 years each with 9 counties. Looks ok.
The tabulate command will also do another interesting trick. It can generate dummy variables corresponding
to the categories in the table. So, for example, suppose we want to create dummy variables for each of the
counties and each of the years. We can do it easily with the tabulate command and the generate() option.
Tabulate county, generate(cdum)
Tabulate year, generate(yrdum)
27
This will create nine county dummies (cdum1cdum9) and twenty year dummies (yrdum1yrdum20).
Culling your data: the keep and drop commands
Occasionally you will want to drop variables or observations from your data set. The command
Drop varlist
Will eliminate the variables listed in the varlist. The command
Drop if year <1978
Will drop observations corresponding to years before 1978. The command
Drop in 1/10
will drop the first ten observations.
Alternatively, you can use the keep command to achieve the same result. The command
Keep varlist
Will drop all variables not in the varlist.
Keep if year >1977
will keep all observations after 1977.
Transforming your data: the generate and replace
commands
As we have seen above, we use the “generate” command to transform our raw data into the variables we need
for our statistical analyses. Suppose we want to create a per capita murder variable from the raw data in the
crime2 data set we created above. We would use the command,
gen murder=crmur/pop*1000
I always abbreviate generate to gen. This operation divides crmur by pop then (remember the order of
operations) multiplies the resulting ratio by 1000. This produces a reasonable number for statistical operations.
According to the results of the summarize command above, the maximum value for crmur is 24,700. The
corresponding value for pop is 270,298. However, we know that the population is about 270 million, so our
population variable is in thousands. If we divide 24,700 by 270 million, we get the probability of being
murdered (.000091). This number is too small to be useful in statistical analyses. Dividing 24,700 by 270,298
yields .09138 (the number of murders per thousand population) which is still too small. Multiplying the ratio
of crmur to our measure of pop by 1000 yields the number of murders per million population (91.38), which is
a reasonable order of magnitude for further analysis.
If you try to generate a variable called, say, widget and a variable by that name already exists in the data set,
Stata will refuse to execute the command and tell you, “widget already defined.” If you want to replace widget
with new values, you must use the replace command. We used the replace command in our sample program in
the last chapter to rebase our cpi index.
28
replace cpi=cpi/1.160
29
4 GRAPHS
Continuing with our theme of looking at your data before you make any crucial decisions, obviously one of
the best ways of looking is to graph the data. The first kind of graph is the line graph, which is especially
useful for time series.
Here is a graph of the major crime rate in Virginia from 1960 to 1998. The command used to produce the
graph is
. graph twoway line crmaj year
which produces
It is interesting that crime peaked in the mid1970’s and has been declining fairly steadily since the early
1980’s. I wonder why. Hmmm.
Note that I am able to paste the graph into this Word document by generating the graph in State and
clicking on Edit, Copy Graph. Then, in Word, click on Edit, Paste (or controlv if you prefer keyboard
shortcuts).
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
c
r
m
a
j
1960 1970 1980 1990 2000 2010
year
30
The second type of graph frequently used by economists is the scatter diagram, which relates one variable
to another. Here is the scatter diagram relating the arrest rate to the crime rate.
. graph twoway scatter crmaj aomaj
We also occasionally like to see the scatter diagram with the regression line superimposed.
graph twoway (scatter crmaj aomaj) (lfit crmaj aomaj)
This graph seems to indicate that the higher the arrest rate, the lower the crime rate. Does deterrence work?
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
c
r
m
a
j
20 25 30
aomaj
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
c
r
m
a
j
/
F
i
t
t
e
d
v
a
l
u
e
s
20 25 30
aomaj
crmaj Fitted values
31
Note that we used socalled “parentheses binding” or “() binding” to overlay the graphs. The first
parenthesis contains the commands that created the original scatter diagram, the second parenthesis
contains the command that generated the fitted line (lfit is short for linear fit). Here is the lfit part by itself.
. graph twoway lfit crmaj aomaj
These graphs can be produced interactively. Here is another version of the above graph, with some more
bells and whistles.
This was produced with the command,
1
0
5
0
0
1
1
0
0
0
1
1
5
0
0
1
2
0
0
0
1
2
5
0
0
F
i
t
t
e
d
v
a
l
u
e
s
20 25 30
aomaj
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
C
r
i
m
e
R
a
t
e
20 25 30
Arrest Rate
Fitted values crmaj
Virginia 19601998
Crime Rates and Arrest Rates
32
. twoway (lfit crmaj aomaj, sort) (scatter crmaj aomaj), ytitle(Crime Rate) x
> title(Arrest Rate) title(Crime Rates and Arrest Rates) subtitle(Virginia 19
> 601998)
However, I did not have to enter that complicated command, and neither do you. Click on Graphics to
reveal the pulldown menu.
Click on “overlaid twoway graphs” to reveal the following dialog box with lots of tabs. Clicking on the tabs
reveals further dialog boxes. Note that only the first plot (plot 1) had “lfit” as an option. Also, I sorted by
the independent variable (aomaj) in this graph. The final command is created automatically by filling in
dialog boxes and clicking on things.
33
Here is another common graph in economics, the histogram.
This was produced by the following command, again using the dialog boxes.
. histogram crmaj, bin(10) normal yscale(off) xlabel(, alternate) title(Crime
> Rate) subtitle(Virginia)
(bin=10, start=5740.0903, width=945.2186)
There are lots of other cool graphs available to you in Stata. You should take some time to experiment with
some of them. The graphics dialog boxes make it easy to produce great graphs.
6000
8000
10000
12000
14000
16000
crmaj
Virginia
Crime Rate
34
5 USING DOFILES
A dofile is a list of Stata commands collected together in a text file. It is also known as a Stata program or
batch file. The text file must have a “.do” filename extension. The dofile is executed either with the “do”
command or by clicking on File, Do…, and browsing to the dofile. Except for very simple tasks, I
recommend always using dofiles. The reason is that complicated analyses can be difficult to reproduce. It
is possible to do a complicated analysis interactively, executing each command in turn, before writing and
executing the next command. You can wind up rummaging around in the data doing various regression
models and data transformations, get a result, and then be unable to remember the sequence of events that
led up to the final version of the model. Consequently, you may never be able to get the exact result again.
It may also make it difficult for other people to reproduce your results. However, if you are working with a
dofile, you make changes in the program, save it, execute it, and repeat. The final model can always be
reproduced, as long as the data doesn’t change, by simply running the final dofile again.
If you insist on running Stata interactively, you can “log” your output automatically to a “log file” by
clicking on the “Begin Log” button (the one that looks like a scroll) on the button bar or by clicking on
File, Log, Begin. You will be presented with a dialog box where you can enter the name of the log file and
change the default directory where the file will be stored. When you write up your results you can insert
parts of the log file to document your findings. If you want to replicate your results, you can edit the log
file, remove the output, and turn the resulting file into a do file.
Because economics typically does not use laboratories where experimental results can be replicated by
other researchers, we need to make our data and programs readily available to other economists, otherwise
how can we be sure the results are legitimate? After all, we can edit the Stata output and data input and type
in any values we want to. When the data and programs are readily available it is easy for other researchers
to take the programs and do the analysis themselves and check the data against the original sources, to
make sure we on the up and up. They can also do various tests and alternative analyses to find out if the
results are “fragile,” that is the results do not hold up under relatively small changes in the modeling
technique or data. For these reasons, I recommend that any serious econometric analysis be done with do
files which can be made available to other researchers.
You can create a dofile with the builtin Stata dofile editor, any text editor, such as Notepad or WordPad,
or with any word processing program like Word or WordPerfect. If you use a word processor, be sure to
“save as” a text (.txt) file. However, the first time you save the dofile, Word, for example, will insist on
adding a .txt filename extension. So, you wind up with a dofile called sample.do.txt. You have to rename
the file by dropping the .txt extension. You can do it from inside Word by clicking on Open, browsing to
the right directory, and then rightclicking on the dofile with the txt extension, click on Rename, and then
rename the file.
It is probably easiest to use the builtin Stata dofile editor. It is a full service text editor. It has mouse
control; cut, copy and paste; find and replace; and undo. To fire up the dofile editor, click on the button
35
that looks like an open envelope (with a pencil or something sticking out of it) on the button bar or click on
Window, Dofile Editor.
Here is a simple dofile.
/***********************************************
***** Sample DoFile *****
***** This is a comment *****
************************************************/
/* some preliminaries */
#delimit; * use semicolon to delimit commands;
set mem 10000; * make sure you have enough memory;
set more off; * turn off the more feature;
set matsize 150; * make sure you have enough room for variables;
/* tell Stata where to store the output */
/* I usually use the dofile name, with a .log extension */
/* use a full pathname */
log using "H:\Project\example.log", replace;
/* get the data */
use "H:\Project\crimeva.dta" , clear;
/*************************************/
/*** summary statistics ***/
/*************************************/
summarize crbur aobur metpct ampct unrate;
regress crbur aobur metpct ampct unrate;
log close;
The beginning block of statements should probably be put at the start of any dofile. The #delimit;
command sets the semicolon instead of the carriage return as the symbol signaling the end of the command.
This allows us to write multiline commands.
More conditions: in the usual interactive mode, when the results screen fills up, it stops scrolling and
more—
appears at the bottom of the screen. To see the next screen, you must hit the spacebar. This can be annoying
in dofiles because it keeps stopping the program. Besides, you will be able to examine the output by
opening the log file, so set more off.
For small projects you won’t have to increase memory or matsize, but if you do run out of space, these are
the commands you will need.
Don’t forget to set the log file with the “log using filename, replace” command. The “replace” part is
necessary if you edit the dofile and run it again. If you forget the replace option, Stata will not let you
overwrite the existing log file. If you want Stata to put the log file in your “project” directory, you will need
to use a complete pathname for the file, e.g., “H:\project\sample.log.”
36
Next you have to get the data with the “use” command. Finally, you do the analysis.
Don’t forget to close the log file.
To see the log file, you can open it in the dofile editor, or any other editor or word processor. I use Word
to review the log file, because I use Word to write the final paper..
The log file looks like this.

log: H:\project\sample.log
log type: text
opened on: 4 Jun 2004, 08:16:47
. /* get the data */
>. use "crimeva.dta" , clear;
. /*************************************/
> /*** summary statistics ***/
> /*************************************/
>
> summarize crbur aobur metpct ampct unrate;
Variable  Obs Mean Std. Dev. Min Max
+
crbur  40 7743.378 2166.439 3912.193 11925.49
aobur  22 16.99775 1.834964 12.47 19.6
metpct  28 71.46429 1.394291 70 74.25
ampct  28 18.76679 .1049885 18.5 18.9
unrate  30 4.786667 1.162261 2.4 7.7
. regress crbur aobur metpct ampct unrate;
Source  SS df MS Number of obs = 21
+ F( 4, 16) = 28.74
Model  56035517.4 4 14008879.3 Prob > F = 0.0000
Residual  7800038.17 16 487502.386 Rsquared = 0.8778
+ Adj Rsquared = 0.8473
Total  63835555.6 20 3191777.78 Root MSE = 698.21

crbur  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aobur  31.52691 132.9329 0.24 0.816 313.3321 250.2783
metpct  986.7367 246.7202 4.00 0.001 1509.76 463.7131
ampct  5689.995 7001.924 0.81 0.428 9153.42 20533.41
unrate  71.66737 246.0449 0.29 0.775 449.9246 593.2593
_cons  27751.14 148088.1 0.19 0.854 341683.8 286181.6

. log close;
log: C:\Stata\sample.log
log type: text
closed on: 4 Jun 2004, 08:16:48

For complicated jobs, using lots of “banner” style comments (like the “summary statistics” comment
above) makes the output more readable and reminds you, or tells someone else who might be reading the
output, what you’re doing and why.
37
6 USING STATA HELP
This little manual can’t give you all the details of each Stata command. It also can’t list all the Stata
commands that you might need. Finally, the official Stata manual consists of several volumes and is very
expensive. So where does a student turn for help with Stata. Answer: the Stata Help menu.
Clicking on Help yields the following submenu.
Clicking on Contents allows you to browse through the list of Stata commands, arranged according to
function.
38
One could rummage around under various topics and find all kinds of neat commands this way. Clicking on
any of the hyperlinks brings up the documentation for that command.
Now suppose you want information on how Stata handles a certain econometric topic. For example,
suppose we want to know how Stata deals with heteroskedasticity. Clicking on Help, Search, yields the
following dialog box.
Clicking on the “Search all” radio button and entering heteroskedasticity yields
39
On the other hand, if you know that you need more information on a given command, you can click on
Help, Stata command, yields
For example, entering “regress” to get the documentation on the regress commands yields.
40
This is the best way to find out more about the commands in this book. Most of the commands have options
that are not documented here. There are also lots of examples and links to related commands. You should
certainly spend a little time reviewing the regress help documentation because it will probably be the
command you will use most over the course of the semester.
41
7 REGRESSION
Let’s start with a simple correlation. Use the data on pool memberships from the first chapter. Use
summarize to compute the means of the variables.
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
year  10 82.5 3.02765 78 87
members  10 123.9 8.999383 112 139
dues  10 171.5 20.00694 135 195
cpi  10 .9717 .1711212 .652 1.16
rdues  10 178.8055 16.72355 158.5903 207.0552
+
logdues  10 5.138088 .1219735 4.905275 5.273
logmem  10 4.817096 .0727516 4.718499 4.934474
dum  10 .4 .5163978 0 1
trend  10 4.5 3.02765 0 9
Plot the number of members on the real dues (rdues), sorting by rdues. This can be done by typing
twoway (scatter members rdues), yline(124) xline(179)
Or by clicking on Graphics, Two way graph, Scatter plot and filling in the blanks. Click on Y axis, Major
Tick Options, and Additional Lines, to reveal the dialog box that allows you to put a line at the mean of
members. Similarly, clicking on X axis, Major Tick Options, allows you to put a line at the mean of rdues.
Either way, the result is a nice graph.
42
Note that six out of the ten dots are in the upper left and lower right quadrants. This means that high values
of rdues are associated with low membership. We need a measure of this association. Define the deviation
from the mean for X and Y as
i i
i i
x X X
y Y Y
= −
= −
Note that, in the upper right quadrant, ,
i i
X X Y Y > > so that 0, 0
i i
x y > > . Moving clockwise, in the
lower right hand quadrant, 0, 0
i i
x y > < . In the lower left quadrant, 0, 0
i i
x y < < . Finally in the upper left
quadrant, 0, 0
i i
x y < > .
So, for points in the lower left and upper right hand quadrants, if we multiply the deviations together and
sum we get, 0
i i
x y ¯ > . This quantity is called the sum of corrected crossproducts, or the covariation. If we
compute the sum of corrected crossproducts for the points in the upper left and lower right quadrants we
find that 0
i i
x y ¯ < . This means that, if most of the points are in the upper left and lower right quadrants,
then the variation, computed over the entire sample will be negative, indicating a negative association
between X and Y. Similarly, if most of the points lie in the lower left and upper right quadrants, then the
covariation will be positive, indicating a positive relationship between X and Y.
There are two problems with using covariation as our measure of association. The first is that it depends on
the number of observations. The larger the number of observations, the larger the (positive or negative)
covariation. We can fix this problem by dividing the covariation by the sample size, N. The resulting
quantity is the average covariation, known as the covariance.
1
1
0
1
2
0
1
3
0
1
4
0
m
e
m
b
e
r
s
160 170 180 190 200 210
rdues
43
1 1
( )( )
( , )
N N
i i i i
i i
x y X X Y Y
Cov x y
N N
= =
− −
= =
¯ ¯
The second problem is that the covariance depends on the unit of measure of the two variables. You get
different values of the covariance if the two variables are height in inches and weight in pounds versus
height in miles and weight in milligrams. To solve this problem, we divide each observation by its standard
deviation to eliminate the units of measure and then compute the covariance using the resulting standard
scores. The standard score of X is
i
X i x
z x S = where
2
1
N
i
i
x
x
S
N
=
=
¯
is the standard deviation of X. The
standard score of Y is
2
1
i
i
Y
N
i
i
y
z
x N
=
=
¯
. The covariance of the standard scores is,
1
1 1
1 1
i i
N
X Y N N
i i i
XY i i
i i X Y X Y
z z
x y
r x y
N N S S NS S
=
= =
= = =
¯
¯ ¯
This is the familiar simple correlation coefficient.
The correlation coefficient for our pool sample can be computed in Stata by typing
correlate members rdues
(obs=10)
 members rdues
+
members  1.0000
rdues  0.2825 1.0000
The correlation coefficient is negative, as suspected.
With respect to regression analysis, the Stata command that corresponds to ordinary least squares (OLS) is
the regress command. For example
Regress members rues
Produces a simple regression of members on rdues. That output was reported in the first chapter.
Linear regression
Sometimes we want to know more about an association between two variables than if it is positive or
negative. Consider a policy analysis concerning the effect of putting more police on the street. Does it
really reduce crime? If the answer is yes, the policy at least doesn’t make things worse, but we also have to
know by how much it can be expected to reduce crime, in order to justify the expense. If adding 100,000
police nationwide only reduces crime by a few crimes, it fails the costbenefit test. So, we have to estimate
the expected change in crime from a change in the number of police.
44
Suppose we hypothesize that there is a linear relationship between crime and police such that
Y X α β = +
where Y is crime, X is police and 0 β < . If this is true and we could estimate þ, we could estimate the
effect of adding 100,000 police as follows. Since þ is the slope of the function,
Y
X
β
∆
=
∆
which means that
(100, 000)
Y X β
β
∆ = ∆
=
This is the counterfactual (since we haven’t actually added the police yet) that we need to estimate the
benefit of the policy. The cost is determined elsewhere.
So, we need an estimate of the slope of a line relating Y to X, not just the sign. A line is determined by two
points. We will use the slope, β , and the intercept u as the two points we need.
As statisticians, as opposed to mathematicians, we know that real world data do not line up on straight
lines. We can summarize the relationship between Y and X as,
i i i
Y X U α β = + +
where Ui is the error associated with observation i.
Why do we need an error term? Why don’t observations line up on straight lines? There are two main
reasons.
1. The model is incomplete. We have left out some of the other factors that affect crime. At this point
let’s assume that the factors that we left out are “irrelevant” in the sense that if we included them
the basic form of the line would not change, that is, that the values of the estimated slope and
intercept would not change. Then the error term is simply the sum of a whole bunch of variables
that are each irrelevant, but taken together cause the error term to vary randomly.
2. The line might not be straight. That is, the line may be curved. We can get out of this either by
transforming the line (taking logs, etc.), or simply assume that the line is approximately linear for
our purposes.
So what are the values of u and þ (the parameters) that create a “good fit,” that is, which values of the
parameters best describe the cloud of data in the figure above?
Let’s define our estimate of u as ˆ α and our estimate of þ as
ˆ
β . Then for any observation, our predicted
value of Y is
ˆ ˆ
ˆ
i i
Y X α β = +
If we are wrong, the error, or residual, is the difference between what we thought Y would be and what is
actually was,
ˆ ˆ
ˆ ( )
i i i i i
e Y Y Y X α β = − = − + .
45
Clearly, we want to minimize the errors. We can’t get zero errors because that would require driving the
line through each point. Since they don’t line up, we would get one line per point, not a summary statistic.
We could minimize the sum of the errors, but that would mean that large positive errors could offset large
negative errors and we really couldn’t tell a good fit, with no error, from a bad fit, with huge positive and
negative errors that offset each other.
To solve this last problem we could square the errors or take their absolute values. If we minimize the sum
of squares of error, we are doing least squares. If we minimize the sum of absolute values we are doing
least absolute value estimation, which is beyond our scope. The least squares procedure has been around
since at least 1805 and it is the workhorse of applied statistics.
So, we want to minimize the sum of squares of error, or
2 2
ˆ
ˆ ,
ˆ
ˆ ( )
Min
i i i
e Y X
α β
α β ¯ = ¯ − −
We could take derivatives with respect to u and þ, set them equal to zero and solve the resulting equations,
but that would be too easy. Instead, let’s do it without calculus.
Suppose we have a quadratic formula and we want to minimize it by choosing a value for b.
2
2 1 0
( ) f b c b c b c = + +
Recall from high school the formula for completing the square.
2
2
1 1
2 0
2 2
( )
2 4
c c
f b c b c
c c
   
= + + −
 
\ ¹ \ ¹
Multiply it out if you are unconvinced. Since we are minimizing with respect to our choice of b and b only
occurs in the squared term, the minimum occurs at
1
2
2
c
b
c
−
=
That is, the minimum is the ratio of the negative of the coefficient on the first power to two times the
coefficient on the second power of the quadratic.
We want to minimize the sum of squares of error with respect to ˆ α and
ˆ
β , that is
2 2 2 2 2 2
2 2 2 2
ˆ ˆ ˆ ˆ
ˆ ˆ ˆ ˆ Min ( ) ( 2 2 2 )
ˆ ˆ ˆ
ˆ ˆ ˆ 2 2 2
e Y X Y Y X X XY
Y Y N X X XY
α β α α β αβ β
α α β αβ β
¯ = ¯ − − = ¯ − + + − −
= ¯ − ¯ + + ¯ − ¯ − ¯
Note that there are two quadratics in ˆ α and
ˆ
β . The one in ˆ α is
2
2
ˆ
ˆ ˆ ˆ ˆ ( ) 2 2
ˆ
ˆ ˆ(2 2 )
f N Y X C
N Y X
α α α αβ
α α β
= − ¯ − ¯ +
= − ¯ − ¯
where C is a constant consisting of terms not involving ˆ α . Minimizing ˆ ( ) f α requires setting ˆ α equal to
the negative of the coefficient of the first power divided by two times the coefficient of the second power,
namely,
46
( )
ˆ
2 ˆ
ˆ
ˆ
2
Y X
Y X
Y X
N N N
β
β
α β
¯ −
¯
= = − = −
The quadratic form in
ˆ
β is
2 2 2 2
ˆ ˆ ˆ ˆ ˆ ˆ
ˆ ˆ ( ) 2 2 2( ) * f X XY X X XY X C β β β αβ β β α = ¯ − ¯ + ¯ = ¯ − ¯ + ¯ +
where C* is another constant not involving
ˆ
β .
Setting
ˆ
β equal to the ratio of the negative of the first power divided by two times the coefficient of the
second power yields,
2 2
ˆ ˆ 2( )
ˆ
2
XY X XY X
X X
α α
β
¯ + ¯ ¯ + ¯
= =
¯ ¯
Substituting the formula for ˆ α ,
2 2
ˆ ˆ
( )
ˆ
XY Y X X XY Y Y X X
X X
β β
β
¯ + − ¯ ¯ + ¯ − ¯
= =
¯ ¯
Multiplying both sides by
2
X ¯ yields,
2
ˆ ˆ
ˆ
X XY Y X X X
XY YNX XNX
β β
β
¯ = ¯ − ¯ + ¯
= ¯ − +
2 2
2 2
ˆ ˆ
ˆ
( )
X NX XY NXY
X NX XY NXY
β β
β
¯ − = ¯ −
¯ − = ¯ −
Solving for
ˆ
β
2 2 2
ˆ
XY NXY xy
X NX x
β
¯ − ¯
= =
¯ − ¯
Correlation and regression
Let’s express the regression line in deviations.
Y X α β = +
Since
ˆ
ˆ Y X α β = − the regression line goes through the means:
Y X α β = +
Subtract the mean of Y from both sides of the regression line,
47
Y Y X X α β α β − = + − −
Or,
( ) y X X x β β = − =
Note that, by subtracting off the means and expressing x and y in deviations, we have “translated the axes”
so that the intercept is zero when the regression is expressed in deviations. The reason is, as we have seen
above, the regression line goes through the means. Here is the graph of the pool membership data with the
regression line superimposed.
Substituting
ˆ
β yields the predicted value of y (“Fitted values” in the graph):
ˆ
ˆ y x β =
We know that the correlation coefficient between x and y is
x y
xy
r
NS S
¯
=
Multiply both sides by S
y
/S
x
yields,
2
x x
y x y y y
S S xy xy
r
S NS S S NS
¯ ¯
= =
1
1
0
1
2
0
1
3
0
1
4
0
160 170 180 190 200 210
rdues
Fitted values members
48
But,
2
2 2
2 2
y
y y
NS N N y
N N
 
¯ ¯
 = = = ¯

\ ¹
Therefore,
2
ˆ
x
y
S xy
r
S y
β
¯
= =
¯
So,
ˆ
β is a linear transformation of the correlation coefficient; Since 0 and 0
x y
S S > >
ˆ
must have the same sign as r β . Since the regression line does everything that the correlation coefficient
does, plus yield the rate of change of y to x, we will concentrate on regression from now on.
How well does the line fit the data?
We need a measure of the goodness of fit of the regression line. Start with the notion that each observation
of y is equal to the predicted value plus the residual:
ˆ y y e = +
Squaring both sides,
2 2 2
ˆ 2 y y ye e = + +
Summing,
2 2 2
ˆ 2 y y ye e ¯ = ¯ + ¯ + ¯
The middle term on the right hand side is equal to zero.
ˆ ˆ
ˆ ye xe xe β β ¯ = ¯ = ¯
Which is zero if 0. xe ¯ =
( )
2
2
2
ˆ ˆ
0
xe x y x xy x
xy
xy x xy xy
x
β β ¯ = ¯ − = ¯ − ¯
¯
= ¯ − ¯ = ¯ − ¯ =
¯
This property of linear regression, namely that x is uncorrelated with the residual, turns out to be useful in
several contexts. Nevertheless, the middle term is zero and thus,
2 2 2
y y e ¯ = ¯ + ¯
Or,
TSS=RSS+ESS
49
where TSS is the total sum of squares,
2
, y ¯ RSS is the regression (explained) sum of squares,
2
ˆ , y ¯ and
ESS is the error (unexplained, residual) sum of squares,
2
. e ¯
Suppose we take the ratio of the explained sum of squares to the unexplained sum of squares as our
measure of the goodness of fit.
2 2 2 2
2
2 2 2
ˆ
ˆ
ˆ
RSS y x x
TSS y y y
β
β
¯ ¯ ¯
= = =
¯ ¯ ¯
But, recall that
2
2
ˆ
y
x
y
S
N
r r
S
x
N
β
¯
= =
¯
Squaring both sides yields,
2
2
2 2 2
2 2
ˆ
y
y
N
r r
x x
N
β
¯
¯
= =
¯ ¯
Therefore,
2
2 2
2
ˆ
RSS x
r
TSS y
β
¯
= =
¯
The ratio RSS/TSS is the proportion of the variance of Y explained by the regression and is usually referred
to as the coefficient of determination. It is usually denoted R
2
. We know why now, because the coefficient
of determination is, in simple regressions, equal to the correlation coefficient, squared. Unfortunately, this
neat relationship doesn’t hold in multiple regressions (because there are several correlation coefficients
associating the dependent variable with each of the independent variables). However, the ratio of the
regression sum of squares to the total sum of squares is always denoted R
2
.
Why is it called regression?
In 1886 Sir Francis Galton, cousin of Darwin and discoverer of fingerprints among other things, published
a paper entitled, “Regression toward mediocrity in hereditary stature.” Galton’s hypothesis was that tall
parents should have tall children, but not as tall as they are, and short parents should produce short
offspring, but not as short as they are. Consequently, we should eventually all be the same, mediocre,
height. He knew that, to prove his thesis, he needed more than a positive association. He needed the slope
of a line relating the two heights to be less than one.
He plotted the deviations from the mean heights of adult children on the deviations from the mean heights
of their parents (multiplying the mother’s height by 1.08). Taking the resulting scatter diagram, he
eyeballed a straight line through the data and noted that the slope of the line was positive, but less than one.
He claimed that, “…we can define the law of regression very briefly; It is that the heightdeviate of the
offspring is, on average, twothirds of the heightdeviate of its midparentage.” (Galton, 1886,
50
Anthropological Miscellanea, p. 352.) Midparentage is the average height of the two parents, minus the
overall mean height, 68.5 inches. This is the regression toward mediocrity, now commonly called
regression to the mean. Galton called the line through the scatter of data a “regression line.” The name has
stuck.
The original data collected by Galton is available on the web. (Actually it has a few missing values and
some data entered as “short” “medium,” etc. have been dropped.) The scatter diagram is shown below, with
the regression line (estimated by OLS, not eyeballed) superimposed.
However, another question arises, namely, when did a line estimated by least squares become a regression
line? We know that least squares as a descriptive device was invented by Legendre in 1805 to describe
astronomical data. Galton was apparently unaware of this technique, although it was well known among
mathematicians. Galton’s younger colleague Karl Pearson, later refined Galton’s graphical methods into
the now universally used correlation coefficient. (Frequently referred to in textbooks as the Pearsonian
productmoment correlation coefficient.) However, least squares was still not being used. In 1897 George
Udny Yule, another, even younger colleague of Galton’s, published two papers in which he estimated what
he called a “line of regression” using ordinary least squares. It has been called the “regression line” ever
since. In the second paper he actually estimates a multiple regression model of poverty as a function of
welfare payments including control variables. The first econometric policy analysis.
The regression fallacy
Although Galton was in his time and is even now considered a major scientific figure, he was completely
wrong about the law of heredity he claimed to have discovered. According to Galton, the heights of fathers
of exceptionally tall children tend to be tall, but not as tall as their offspring, implying that there is a natural
0
5
1
0
1
5
0 5 10 15
parents
meankids Fitted values
51
tendency to mediocrity. This theorem was supposedly proved by the fact that the regression line has a slope
less than one.
Using Galton’s data from the scatter diagram above (the actual data is in Galton.dta). We can see that the
regression line does, in fact, have a slope less than one.
. regress meankids parents [fweight=number]
1
Source  SS df MS Number of obs = 899
+ F( 1, 897) = 376.21
Model  1217.79278 1 1217.79278 Prob > F = 0.0000
Residual  2903.59203 897 3.23700338 Rsquared = 0.2955
+ Adj Rsquared = 0.2947
Total  4121.38481 898 4.58951538 Root MSE = 1.7992

meankids  Coef. Std. Err. t P>t [95% Conf. Interval]
+
parents  .640958 .0330457 19.40 0.000 .5761022 .7058138
_cons  2.386932 .2333127 10.23 0.000 1.92903 2.844835

The slope is ..64, almost exactly that calculated by Galton, which confirms his hypothesis. Nevertheless,
the fact that the law is bogus is easy to see. Mathematically, the explanation is simple. We know that
ˆ
y
x
S
r
S
β =
If
y x
S S ≅ and r > 0, then
ˆ
0 1 r β < ≅ <
This is the regression fallacy. The regression line slope estimate is always less than one for any two
variables with approximately equal variances.
Let’s see if this is true for the Galton data. The means and standard deviations are produced by the
summarize command. The correlation coefficient is produced by the correlate command.
. summarize meankids parents
Variable  Obs Mean Std. Dev. Min Max
+
meankids  204 6.647265 2.678237 0 15
parents  197 6.826122 1.916343 2 13.03
. correlate meankids parents
(obs=197)
 meankids parents
+
meankids  1.0000
parents  0.4014 1.0000
So, the standard deviations are very close and the correlation coefficient is positive. We therefore expect
that the slope of the regression line will be positive. Galton was simply demonstrating the mathematical
theorem with these data.
1
We weight by the number of kids because some families have lots of kids (up to 15) and therefore
represent 15 observations, not one.
52
To take another example, if you do extremely well on the first exam in a course, you can expect to do well
on the second exam, but probably not quite as well. Similarly, if you do poorly, you can expect to do better.
The underlying theory is that ability is distributed randomly with a constant mean. Suppose ability is
distributed uniformly with mean 50. However, on each test, you experience random error with is distributed
normally with mean zero and standard deviation 10. Thus,
1 1
2 2
1 2 1 2 1 2
( ) 50, ( ) 0, ( ) 0, cov( , ) cov( , ) cov( , ) 0
x a e
x a e
E a E e E e a e a e e e
= +
= +
= = = = = =
where x
1
is the grade on the first test, x
2
is the grade on the second test, a is ability, and e
1
and e
2
are the
random errors.
1 1 1
( ) ( ) ( ) ( ) ( ) 50 E x E a e E a E e E a = + = + = =
2 2 2
( ) ( ) ( ) ( ) ( ) 50 E x E a e E a E e E a = + = + = =
So, if you scored below 50 on the first exam, you should score closer to 50 on the second. If you scored
higher than 50, you should regress toward mediocrity. (Don’t be alarmed, these grades are not scaled, 50 is
a B.)
We can use Stata to prove this result with a Monte Carlo demonstration.
* choose a large number of observations
. set obs 10000
obs was 0, now 10000
* create a measure of underlying ability
. gen a=100*uniform()
* create a random error term for the first test
. gen x1=a+10*invnorm(uniform())
* now a random error for the second test
. gen x2=a+10*invnorm(uniform())
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
a  10000 50.01914 28.88209 .0044471 99.98645
x1  10000 50.00412 30.5206 29.68982 127.2455
x2  10000 49.94758 30.49167 25.24831 131.0626
* note that the means are all 50 and that the standard deviations are about the
same
. correlate x1 x2
(obs=10000)
 x1 x2
+
x1  1.0000
x2  0.8912 1.0000
. regress x2 x1
Source  SS df MS Number of obs = 10000
+ F( 1, 9998) =38583.43
53
Model  7383282.55 1 7383282.55 Prob > F = 0.0000
Residual  1913206.37 9998 191.358909 Rsquared = 0.7942
+ Adj Rsquared = 0.7942
Total  9296488.92 9999 929.741866 Root MSE = 13.833
x2  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x1  .890335 .0045327 196.43 0.000 .8814501 .8992199
_cons  5.427161 .2655312 20.44 0.000 4.906666 5.947655
* the regression slope is equal to the correlation coefficient, as expected
Does this actually work in practice? I took the grades from the first two tests in a statistics course I taught
years ago and created a Stata data set called tests.dta. You can download the data set and look at the scores.
Here are the results.
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
test1  41 74.87805 17.0839 30 100
test2  41 77.78049 13.54163 47 100
. correlate test2 test1
(obs=41)
 test2 test1
+
test2  1.0000
test1  0.3851 1.0000
. regress test2 test1
Source  SS df MS Number of obs = 41
+ F( 1, 39) = 6.79
Model  1087.97122 1 1087.97122 Prob > F = 0.0129
Residual  6247.05317 39 160.180851 Rsquared = 0.1483
+ Adj Rsquared = 0.1265
Total  7335.02439 40 183.37561 Root MSE = 12.656

test2  Coef. Std. Err. t P>t [95% Conf. Interval]
+
test1  .3052753 .1171354 2.61 0.013 .0683465 .542204
_cons  54.92207 8.99083 6.11 0.000 36.7364 73.10774

Horace Secrist
Secrist was Professor of Statistics at Northwestern University in 1933 when, after ten years of effort
collecting and analyzing data on profit rates of businesses, he published his magnum opus, “The Triumph
of Mediocrity in Business.” In this 486 page tome he analyzed mountains of data and showed that, in every
industry, businesses with high profits tended, over time, to see their profits erode, while businesses with
lower than average profits showed increases. Secrist thought he had found a natural law of competition
(and maybe a fundamental cause of the depression, it was 1933 after all). The book initially received
favorable reviews in places like the Royal Statistical Society, the Journal of Political Economy, and the
Annals of the American Academy of Political and Social Science. Secrist was riding high.
54
It all came crashing down when Harold Hotelling, a well known economist and mathematical statistician,
published a review in the Journal of the American Statistical Association. In the review Hotelling pointed
out that if the data were arranged according to the values taken at the end of the period, instead of the
beginning, the conclusions would be reversed. “The seeming convergence is a statistical fallacy, resulting
from the method of grouping. These diagrams really prove nothing more than that the ratios in question
have a tendency to wander about.” (Hotelling, JASA, 1933.) According to Hotelling the real test would be a
reduction in the variance of the series over time (not true of Secrist’s series). Secrist replied with a letter to
the editor of the JASA, but Hotelling was having none of it. In his reply to Secrist, Hotelling wrote,
This theorem is proved by simple mathematics. It is illustrated by genetic, astronomical, physical,
sociological, and other phenomena. To “prove” such a mathematical result by a costly and
prolonged numerical study of many kinds of business profit and expense ratios is analogous to
proving the multiplication table by arranging elephants in rows and columns, and then doing the
same for numerous other kinds of animals. The performance, though perhaps entertaining, and
having a certain pedagogical value, is not an important contribution to either zoology or to
mathematics. (Hotelling, JASA, 1934.)
Even otherwise competent economists commit this error. See Milton Friedman, 1992, “Do old fallacies
ever die?” Journal of Economic Literature 30:21292132. By the way, the “real test’ of smaller variances
over time might need to be taken with a grain of salt if measurement errors are getting smaller over time.
Also, in the same article, Friedman, who was a student of Hotelling’s when Hotelling was engaged in his
exchange with Secrist, tells how he was inspired to avoid the regression fallacy in his own work. The result
was the permanent income hypothesis, which can be derived from the test examples above by simply
replacing test scores with consumption, ability with permanent income, and the random errors with
transitory income.
Some tools of the trade
We will be using these concepts routinely throughout the semester.
Summation and deviations
Let the lowercase variable
i i
x X X = − where X is the raw data.
(1)
1
( )
N
i i i
i
x X X X NX
=
= ¯ − = ¯ −
¯
0
i
i i i
X
X N X X
N
¯  
= ¯ − = ¯ − ¯ =

\ ¹
From now on, I will suppress the subscripts (which are always i=1,…, N where N is the sample size).
(2)
( )
2 2 2 2
2 2
( ) 2
2
x X X X XX X
X X X NX
¯ = ¯ − = ¯ − +
= ¯ − ¯ +
Note that
X
X N NX
N
¯  
¯ = =

\ ¹
so that
2 2 2 2 2 2 2
2 2 2 x X X X NX X NX NX X NX ¯ = ¯ − ¯ + = ¯ − + = ¯ −
55
(3)
( )
( )( ) xy X X Y Y XY XY YX XY
XY Y X X Y NXY
XY YNX XNY NXY XY NXY
¯ = ¯ − − = ¯ − − +
= ¯ − ¯ − ¯ +
= ¯ − − + = ¯ −
(4)
( )
( ) xY X X Y XY XY
XY X Y XY NXY xy
¯ = ¯ − = ¯ −
= ¯ − ¯ = ¯ − = ¯
(5)
( ) Xy X Y Y XY Y X
XY NXY xy xY
¯ = ¯ − = ¯ − ¯
= ¯ − = ¯ = ¯
Expected value
The expected value of an event is the most probable outcome, what might be expected. It is the average
result over repeated trials. It is also a weighted average of several events where the weight attached to each
outcome is the probability of the event.
1 1 2 2
1 1
( ) ... , 1
N N
N N i i x i
i i
E X p X p X p X p X p µ
= =
= + + + = = =
¯ ¯
For example, if you play roulette in Monte Carlo, the slots are numbered 036. However, the payoff if your
number comes up is 35 to 1. So, what is the expected value of a one dollar bet? There are only two relevant
outcomes, namely your number comes up and you win $35 or it doesn’t and you lose your dollar.
1 1 2 2
36 1
( ) ( 1) (35)
37 37
.973( 1) .027(35) .973 .946 .027
E X p X p X
   
= + = − +
 
\ ¹ \ ¹
= − + = − + = −
So, you can expect to lose about 2.7 cents every time you play. Of course, when you play you never lose
2.7 cents. You either win $35 or lose a dollar. However, over the long haul, you will eventually lose. If you
play 1000 times, you will win some and lose some, but on average, you will lose $27. At the same time, the
casino can expect to make $27 for every 1000 bets. This is known as the house edge and is how the casinos
make a profit. If the expected value of the game is nonzero, as this one is, the game is said not to be fair. A
fair game is one for which the expected value is zero. For example, we can make roulette a fair game by
increasing the payoff to $36
1 1 2 2
36 1
( ) ( 1) (36) 0
37 37
E X p X p X
   
= + = − + =
 
\ ¹ \ ¹
Let’s assume that all the outcomes are equally likely, so that 1/
i
p N = then
1
1 2
1 1 1
( ) ...
N
i
N x
X
E X X X X
N N N N
µ
=
= + + + = =
¯
Note that the expected value of X is the true (population) mean,
x
µ not the sample mean, X . The expected
value of X is also called the first moment of X.
Operationally, to find an expected value, just take the mean.
56
Expected values, means and variance
(1) If b is a constant, then ( ) E b b = .
(2)
1 2
1
...
( ) / ( )
N
N
i
i
bX bX bX
E bX b X N bE X
N
=
+ + +
= = =
¯
(3) ( ) ( ) E a bX a bE X + = +
(4) ( ) ( ) ( )
x y
E X Y E X E Y µ µ + = + = +
(5) ( ) ( ) ( )
x y
E aX bY aE X bE Y a b µ µ + = + = +
(6)
2 2
( ) ( ) E bX b E X =
(7) The variance of X is
2
2
( )
( ) ( )
x
x
X
Var X E X
N
µ
µ
¯ −
= − =
also written as
2
( ) ( ( )) Var X E X E X = − The variance of X is also called the second moment of X.
(8)
2
( , ) [( )( )]
x y
Cov X Y E X Y µ µ = − −
(9)
2 2
( ) [( ) ( )] [( ) ( )]
x y x y
Var X Y E X Y E X Y µ µ µ µ + = + − + = − + −
2 2
[( ) ( ) 2( )( )]
( ) ( ) 2 ( , )
x y x y
E X Y X Y
Var X Var Y Cov X Y
µ µ µ µ = − + − + − −
= + +
(10) If X and Y are independent, then E(XY)=E(X)E(Y).
(11) If X and Y are independent,
( , ) ( )( ) [ ]
( )
( ) ( ) ( ) 0
x y x y x y
y x x y x y
Cov X Y E X Y E XY Y X
E XY
E XY E X E Y
µ µ µ µ µ µ
µ µ µ µ µ µ
= − − = − − +
= − − +
= − =
57
8 THEORY OF LEAST SQUARES
Ordinary least squares is almost certainly the most widely used statistical method ever created, aside maybe
from the mean. The GaussMarkov theorem is the theoretical basis for its ubiquity.
Method of least squares
Least squares was first developed to help explain astronomical measurements. In the late 17’th Century,
scientists were looking at the stars through telescopes and trying to figure out where they were. They knew
that the stars didn’t move, but when they tried to measure where they were, different scientists got different
locations. So which observations are right? According to Adrien Marie Legendre in 1805, “Of all the
principles that can be proposed for this purpose [analyzing large amounts of inexact astronomical data], I
think there is none more general more exact, or easier to apply, than…making the sum of the squares of the
errors a minimum. By this method a kind of equilibrium is established among the errors which, since it
prevents the extremes from dominating, is appropriate for revealing the state of the system which most
nearly approaches the truth.”
Gauss published the method of least squares in his book on planetary orbits four years later in 1809. Gauss
claimed that he had been using the technique since 1795, thereby greatly annoying Legendre. It remains
unclear who thought of the method first. Certainly Legendre beat Gauss to print. Gauss did show that least
squares is preferred if the errors are distributed normally (“Gaussian”) using a Bayesian argument.
Gauss published another derivation in 1823, which has since become the standard. This is the “Gauss” part
of the GaussMarkov theorem. Markov published his version of the theorem in 1912, using different
methods, containing nothing that Gauss hadn’t done a hundred years earlier. Neyman, in the 1920’s,
inexplicably unaware of the 1823 Gauss paper, calls Markov’s results, “Markov’s Theorem.” Somehow,
even though he was almost 100 years late, Markov’s name is still on the theorem.
We will investigate the GaussMarkov theorem below. But first, we need to know some of the properties of
estimators, so we can evaluate the least squares estimator.
58
Properties of estimators
An estimator is a formula for estimating the value of a parameter. For example the formula for computing
the mean
/ X X N = Σ
is an estimator. The resulting value is the estimate. The formula for estimating the
slope of a regression line is
2
ˆ
/ xy x β = Σ Σ
and the resulting value is the estimate. So is
ˆ
β
a good estimator?
The answer depends on the distribution of the estimator. Since we get a different value for
ˆ
β each time we
take a different sample,
ˆ
β is a random variable. The distribution of
ˆ
β across different samples is called the
sampling distribution of
ˆ
β . The properties of the sampling distribution of
ˆ
β will determine if
ˆ
β is a good
estimator, or not.
There are two types of such properties. The so called “small sample” properties are true for all sample
sizes, no matter how small. The large sample properties are true only as the sample size approaches
infinity.
Small sample properties
There are three important small sample properties: bias, efficiency, and mean square error.
Bias
Bias is the difference between the expected value of the estimator and the true value of the parameter.
Bias=
( )
ˆ
E β β −
Therefore, if
( )
ˆ
E β β = the estimator is said to be unbiased. Obviously, unbiased estimators are preferred.
Efficiency
If for a given sample size, the variance of
ˆ
β is smaller than the variance of any other unbiased estimator,
than
ˆ
β is said to be efficient. Efficiency is important because the more efficient the estimator, the lower its
variance and the smaller the variance the more precise the estimate. Efficient estimators allow us to make
more powerful statistical statements concerning the true value of the parameter being estimated. You can
visualize this by imagining a distribution with zero variance. How confident can we be in our estimate of
the parameter in this case?
The drawback to this definition of efficiency is that we are forced to compare unbiased estimators. It is
frequently the case, as we shall see throughout the semester that we must compare estimators that may be
biased, at least in small samples. For this reason, we will use the more general concept of relative
efficiency.
The estimator
ˆ
β is efficient relative to the estimator β
if
( ) ( )
ˆ
Var Var β β <
.
59
From now on, when we say an estimator is efficient, we mean efficient relative to some other estimator.
Mean square error
We will see that we often have to compare the properties of two estimators, one of which is unbiased but
inefficient while the other is biased but efficient. The concept of mean square error is useful here. Mean
square error is defined as the variance of the estimator, plus its squared bias, that is,
( ) ( ) ( )
2
ˆ ˆ ˆ
mse Var Bias β β β
= +
Large sample properties
Large sample properties are also known as asymptotic properties because they tell us what happens to the
estimator as the sample size goes to infinity. Ideally, we want the estimator to approach the truth, with
maximum precision, as N →∞ .
Define the probability limit as the limit of a probability as the sample size goes to infinity, that is
plim lim Pr
N→∞
=
Consistency
A consistent estimator is one for which
( ) ( )
ˆ ˆ
plim Pr 1
lim
N
β β ε β β ε
→∞
− < = − < =
That is, as the sample size goes to infinity, the probability that the estimator will be infinitesimally different
from the truth is one, a certainty. This requires that the estimator be centered on the truth, with no variance.
A distribution with no variance is said to be degenerate, because it is no longer really a distribution, it is a
number.
60
A biased estimator can be consistent if the bias gets smaller as the sample size increases.
Since we require that both the bias and the variance go to zero for a consistent estimator, an alternative
definition is mean square error consistency. That is, an estimator is consistent if its mean square error goes
to zero as the sample size goes to infinity.
All of the small sample properties have large sample analogs. An estimator with bias that goes to zero as
N →∞ is asymptotically unbiased. An estimator can be asymptotically efficient relative to another
estimator. Finally, an estimator can be mean square error consistent.
Note that a consistent estimator is, in the limit, a number. As such it can be used in further derivations. For
example, the ratio of two consistent estimators is consistent. The property of consistency is said to “carry
over” to the ratio. This is not true of unbiasedness. All we know about an unbiased estimate is that its mean
is equal to the truth. We do not know if our estimate is equal to the truth. If we compute the ratio of two
unbiased estimates we don’t know anything about the properties of the ratio.
GaussMarkov theorem
Consider the following simple linear model.
i i i
Y X u α β = + +
where u, and þ, are true, constant, but unknown parameters and
i
u is a random variable.
The GaussMarkov assumptions are (1) the observations on X are a set of fixed numbers, (2) Y
t
is
distributed with a mean of
i
X α β + for each observation i, (3) the variance of
i
Y is constant, and (4)
the observations on Y are independent from each other.
The basic idea behind these assumptions is that the model describes a laboratory experiment. The
researcher sets the value of X without error and then observes the value of Y, with some error. The
observations of Y come from a simple random sample, which means that the observation on Y
1
is
independent of Y
2
, which is independent of Y
3
, etc. The mean and the variance of Y does not change
between observations.
The GaussMarkov assumptions, which are currently expressed in terms of the distribution of Y, can be
rephrased in terms of the distribution of the error term, u. We can do this because, under the GaussMarkov
2
( ) σ
61
assumptions, everything in the model is constant except Y and u. Therefore, the variance of u is equal to
the variance of Y.
Also,
i i i
u Y X α β = − −
so that the mean is
( ) ( ) 0
i i i
E u E Y X α β = − − =
because ( )
i i
E X X = which means that ( )
i i
E Y X α β = + by assumption (1).
Therefore, if the GaussMarkov assumptions are true, the error term u is distributed with mean zero and
variance
2
σ .In fact, the GM assumptions can be succinctly written as
2
, ~ (0, )
i i i t
Y X u u iid α β σ = + +
where the tilde means “is distributed as,” and iid means “independently and identically distributed.” The
independent assumption refers to the simple random sample while the identical assumption means that the
mean and variance (and any other parameters) are constant.
Assumption (1) is very useful for proving the theorem, but it is not strictly necessary. The crucial
requirement is that X must be independent of u, that is the covariance between X and u must be zero, i.e.,
E(X,u)=0. Assumption (2) is crucial. Without it, least squares doesn’t work. Assumption (3) can be relaxed,
as we shall see in Chapter 9 below. Assumption (4) can also be relaxed, as we shall see in Chapter 12
below.
OK, so why is least squares so good? Here is the GaussMarkov theorem: if the GM assumptions hold, then
the ordinary least squares estimates of the parameters are unbiased, consistent, and efficient relative to any
other linear unbiased estimator. The proof of the unbiased and consistent parts are easy.
We know that the mean of u is zero. So,
( )
i i
E Y X α β = +
The “empirical analog” of this (using the sample observations instead of the theoretical model) is
where is the sample mean of Y and is the sample mean of X. Y X Y X α β = + Therefore, we can subtract
the means to get,
Let ,
i i i i
y Y Y x X X = − = −
where the lowercase letters indicate deviations from the mean. Then,
i i i
y x u β = +
(recall that the mean of u is zero so that u u u − = ). What we have done is translate the axes so that the
regression line goes through the origin.
Define e as the error. We want to minimize the sum of squares of error.
i i i
e y x β = −
( )
2
2
i i i
e y x β = −
i i
Y Y X X α β α β − = + − −
( )
2
2
N N
i i i
i i
e y x β = −
¯ ¯
62
We want to minimize the sum of squares of error by choosing þ, so take the derivative with respect to þ and
set it equal to zero.
( ) ( )
( ) ( ) ( )
( )
2
2 2 2 2
2 2 2 2 2 2 2
1 1 1 1 2 2 2 2
2 2 2
1 1 1 2 2 2
2
2 2 ... 2
2 2 2 2 ... 2 2
2 0
i i i i i i i
N N N N
N N N
i i i
d d d
e y x y x y x
d d d
d d d
y x y x y x y x y x y x
d d d
x y x x y x x y x
x y x
β β β
β β β
β β β β β β
β β β
β β β
β
¯ = ¯ − = ¯ − +
= − + + − + + + − +
= − + − + − − +
= − ¯ − =
Multiply both sides by 2, then eliminate the parentheses.
Solve for þ and let the solution value be :
This is the famous OLS estimator. Once we get an estimate of þ, we can back out the OLS estimate for u
using the means.
The first property of the OLS estimate
ˆ
β , is that it is a linear function of the observations on Y.
2 2
1
ˆ
i i
i i i i
i i
x y
x y w y
x x
β
¯
= = ¯ = ¯
¯ ¯
where
2
i
i
i
x
w
x
=
¯
The formula for
ˆ
β is linear because the w are constants. (Remember, by GaussMarkov assumption (1), the
x’s are constant.)
is an estimator. It is a formula for deriving an estimate. When we plug in the values of x and y from the
sample, we can compute the estimate, which is a numerical value.
Note that so far, we have only described the data with a linear function. We have not used our estimates to
make any inferences concerning the true value of u or þ.
Sampling distribution of
ˆ
β
Because y is a random variable and we now know that the OLS estimator
ˆ
β is a linear function of y, it
must be that
ˆ
β is a random variable. If
ˆ
β is a random variable, it has a distribution with a mean and a
variance. We call this distribution the sampling distribution of
ˆ
β . We can use this sampling distribution to
2
2
0
i i i
i i i
x y x
x x y
β
β
¯ − ¯ =
¯ = ¯
ˆ
β
2
ˆ
i i
i
x y
x
β
¯
=
¯
ˆ
ˆ
ˆ
ˆ
Y X
Y X
Y X
α β
α β
α β
= +
= +
= −
ˆ
β
63
make inferences concerning the true value of þ. There is also a sampling distribution of ˆ α , but we seldom
want to make inferences concerning the intercept, so we will ignore it here.
Mean of the sampling distribution of
ˆ
β
The formula for is
( )
2
i
2
2 2
2 2 2 2
ˆ
substitute y
ˆ
i i
i
i i
i i i
i
i i i i i i i i
i i i i
x y
x
x u
x x u
x
x x u x x u x u
x x x x
β
β
β
β
β β
β
¯
=
¯
= +
¯ +
=
¯
¯ + ¯ ¯ ¯ ¯
= = + = +
¯ ¯ ¯ ¯
The mean of
ˆ
β is its expected value,
( )
( )
2 2
2 2
ˆ
( )
i i i i
i i
i i i i
i i
x y x u
E E E
x x
x u x E u
E E
x x
β β
β β β
    ¯ ¯
= = +
 
¯ ¯
\ ¹ \ ¹
  ¯ ¯
= + = + =

¯ ¯
\ ¹
because ( ) 0
i
E u = by GM assumption (1).
The fact that the mean of the sampling distribution of
ˆ
β is equal to the truth means that the ordinary least
squares estimator is unbiased and thus a good basis on which to make inferences concerning the true þ.
Note that the unbiasedness property does not depend on any assumption concerning the form of the
distribution of the error terms. We have not, for example, assumed that the errors are distributed normally.
However, we will assume normally distributed errors below. It also relies only on two of the GM
assumptions, that the x’s are fixed and that ( ) 0
i
E u = . So we really don’t need the independent and
identical assumptions for unbiasedness.
However, the independent and identical assumptions are useful in deriving the property that OLS is
efficient. That is, if these two assumptions, along with the other, are true, then the OLS estimator is the
Best Linear Unbiased Estimator (BLUE). It can be shown that the OLS estimator has the smallest variance
of any linear unbiased estimator.
Variance of
ˆ
β
The variance of
ˆ
β can be derived as follows.
( ) ( ) ( ) ( )
2 2
ˆ ˆ ˆ ˆ
Var E E E β β β β β = − = −
We know that
2
ˆ
i i
i
x u
x
β β
¯
= +
¯
ˆ
β
64
So,
2
ˆ
i i
i i
i
x u
wu
x
β β
¯
− = = ¯
¯
and
( )
( )
2
2
2
2
ˆ
i i
i i
i
x u
E E E wu
x
β β
  ¯
− = = ¯

¯
\ ¹
because the covariances among the errors are zero due to the independence assumption, that is,,
( ) 0 for i j
i
j
E u u = ≠ . Therefore,
( )
2
2 2 2 2 2 2 2 2 2 2 2 2
1 1 2 2 1 1 2 2
( ) ( ) ... ( ) ...
i i N N N N
E wu w E u w E u w E u w w w σ σ σ ¯ = + + + = + + +
However,
2 2 2 2
1 2
...
N
σ σ σ σ = = = = because of the identical assumption. Thus,
( ) ( )
( )
( )
( )
2 2 2 2 2 2 2 2 2 2
1 2 1 2
2 2 2
2 2 2 1 2
2 2 2
2 2 2 2
1 2 2
2
2 2
2 2
2 2 2
2
ˆ
... ...
...
1
...
1
N N
N
i i i
N
i
i
i i
i
Var w w w w w w
x x x
x x x
x x x
x
x
x x
x
β σ σ σ σ
σ σ σ
σ
σ
σ σ
= + + + = + + +
     
= + + +
  
¯ ¯ ¯
\ ¹ \ ¹ \ ¹
= + + +
¯
¯
= = =
¯ ¯
¯
Consistency of OLS
To be consistent, the mean square error of
ˆ
β must go to zero as the sample size, N, goes to infinity. The
mean square error of
ˆ
β is the sum of the variance plus the squared bias.
2
ˆ ˆ ˆ
( ) ( ) ( ) mse Var E β β β β
= + −
Obviously the bias is zero since the OLS estimate is unbiased. Therefore the consistency of the ordinary
least estimator
ˆ
β depends on the variance of
ˆ
β ,
2 2
1
ˆ
( )
N
i
i
Var x β σ
=
=
¯
. As N goes to infinity,
2
1
1
N
i
x
=
¯
goes to
infinity because we keep adding more and more squared terms. Therefore, the variance of
ˆ
β goes to zero as
N goes to infinity, the mean square error goes to zero, and the OLS estimate is consistent.
( ) ( )
( )
( )
2 2
1 1 2 2
2 2 2 2 2 2
1 1 2 2 1 2 1 2 1 3 1 3 1 1 1 1
2 2 2 2 2 2
1 1 2 2
...
... ... ...
...
i i N N
N N N N N N N N
N N
E wu wu w u w u
E w u w u w u w w u u w w u u w w u u w w u u
E w u w u w u
− −
¯ = + + +
= + + + + + + + +
= + + +
65
Proof of the GaussMarkov Theorem
According to the GaussMarkov theorem, if the GM assumptions are true, the ordinary least squares
estimator is the Best Linear Unbiased Estimator (BLUE). That is, of all linear unbiased estimators, the OLS
estimator has the smallest variance.
Stated another way, any other linear unbiased estimator will have a larger variance than the OLS estimator.
The GaussMarkov assumptions are as follows.
(1) The true relationship between y and x (expressed as deviations from the mean) is
i i i
y x u β = +
(2) The residuals are independent and the variance is constant:
2
~ ( , ) u iid o
i
σ .
(3) The x’s are fixed numbers.
We know that the OLS estimator is a linear estimator
2
ˆ
xy
wy
x
β
¯
= = ¯
¯
Where
2
/ w x x = ¯ .
ˆ
β is unbiased and has a variance equal to
2 2
/ x σ ¯ .
Consider any other linear unbiased estimator,
( ) cy c x u cx cu β β β = ¯ = ¯ + = ¯ + ¯
Taking expected values,
( ) ( ) E cx cE u
cx
β β
β
= ¯ + ¯
= ¯
Since E(u)=0 as before. This estimator is unbiased if cx ¯ =1.
Note, since cx ¯ =1
and
cu
cu
β β
β β
= + ¯
− = ¯
Find the variance of β
.
[ ]
2 2
2
1 1 2 2 2
2 2 2 2 2 2
1 2
2 2
( )
...
...
N
N
Var E E cu
E c u c u c u
c c c
c
β β β
σ σ σ
σ
= − =
= + + +
= + + +
=
¯
¯
66
According to the theorem, this variance must be greater than or equal to the variance of β
.
Here is a useful identity.
2 2 2
2 2 2
( )
( ) 2 ( )
( ) 2 ( )
c w c w
c w c w w c w
c w c w w c w
= + −
= + − + −
¯ = ¯ + ¯ − + ¯ −
The last term on the RHS is equal to zero.
( )
( )
2
2
2 2
2
2
2 2
2
2 2
( )
1 1
0
w c w wc w
cx x
x
x
cx x
x
x
x x
¯ − = ¯ − ¯
= −
¯
¯
¯ ¯
= −
¯
¯
= − =
¯ ¯
¯ ¯
Since cx ¯ =1.
Therefore,
( )
2 2
2 2 2
2 2 2 2
2
2 2 2
2
2
2
2 2
2
2 2
( )
( ( ) )
( )
( )
( )
ˆ
( ) ( )
Var c
w c w
w c w
x
c w
x
c w
x
Var c w
β σ
σ
σ σ
σ σ
σ
σ
β σ
= ¯
= ¯ + −
= ¯ + ¯ −
¯
= + ¯ −
¯
= + ¯ −
¯
= + ¯ −
So
2 2
ˆ
( ) ( ) because ( ) 0 Var Var c w β β σ ≥ ¯ − ≥
. The only way the two variances are equal is if
c=w, which is the OLS estimator.
Inference and hypothesis testing
Normal, Student’s t, Fisher’s F, and Chisquare
Before we can make inferences or test hypotheses concerning the true value of þ, we need to review some
famous statistical distributions. We will be using all of the following distributions throughout the semester.
67
Normal distribution
The normal distribution is the familiar bell shaped curve. With mean µ and variance o
2
, it has the formula,
( )
2
2
( )
2
2
1
 ,
2
X
f X e
µ
σ
µ σ
πσ
− −
=
or simply
( )
2
~ , X N µ σ where the tilde is read “is distributed as.”
If X is distributed normally, then any linear transformation of X is also distributed normally. That is, if
( )
2
~ , X N µ σ then
2 2
( ) ~ ( , ) a bX N a b b µ σ + + This turns out to be extremely useful, Consider the linear
function of X, z a bX = + . Let a µ σ = − and 1 b σ = then ( ) z a bX X X µ σ σ µ σ = + = − + = − which
is the formula for a standard score, or zscore.
If we express X as a standard score, z, then z has a standard normal distribution because the mean of z is
zero and its standard deviation is one. That is,
( )
2
2
1
2
z
z e ϕ
π
−
=
Here is the graph.
We can generate a normal variable with 100,000 observations as follows.
. set obs 100000
obs was 0, now 100000
. gen y=invnorm(uniform())
. summarize y
Variable  Obs Mean Std. Dev. Min Max
+
y  100000 .0042645 1.002622 4.518918 4.448323
. histogram y
(bin=50, start=4.518918, width=.17934483)
68
The Student’s t, Fisher’s F, and chisquare are derived from the normal and arise as distributions of sums of
variables. They have associated with them their socalled degrees of freedom which are the number of
terms in the summation.
Chisquare distribution
If z is distributed standard normal, then its square is distributed as chisquare with one degree of freedom.
Since chisquare is a distribution of squared values, and, being a probability distribution, the area under the
curve must sum to one, it has to be a skewed distribution of positive values, asymptotic to the horizontal
axis.
(1) If z ~ N(0,1), then
2 2
~ (1) z χ which has a mean of 1 and a variance of 2. Here is the graph.
0
.
1
.
2
.
3
.
4
D
e
n
s
i
t
y
4 2 0 2 4
y
69
(2) If
1 2
, ,...,
n
X X X are independent
2
(1) χ variables, then
2
1
~ ( )
N
i
i
X n χ
=
¯
with mean=n and variance = 2n
(3) Therefore, if
1 2
, ,...,
n
z z z are independent standard normal, N(0,1), variables, then
2 2
1
~ ( )
n
i
i
z n χ
=
¯
.
Here are the graphs for n=2,3,and 4.
(4) If
2
n
2 2
1 2
i=1
, ,..., are independent N(0, ) variables then ~ ( ).
i
n
z
z z z n σ χ
σ
 

\ ¹
¯
(5) If
2 2 2
1 1 2 2 1 2 1 2 1 2
~ ( ) and ~ ( ) and and are independent, then ~ ( ). X n X n X X X X n n χ χ χ + +
We can generate a chisquare variable by squaring a bunch of normal variables.
. gen z1=invnorm(uniform())
. set obs 10000
obs was 0, now 10000
. gen z1=invnorm(uniform())
. gen z2=invnorm(uniform())
. gen z3=invnorm(uniform())
. gen z4=invnorm(uniform())
. gen z5=invnorm(uniform())
. gen v=z1+z2+z3+z4+z5
. gen z6=invnorm(uniform())
. gen z7=invnorm(uniform())
. gen z8=invnorm(uniform())
. gen z9=invnorm(uniform())
. gen z10=invnorm(uniform())
. gen v=z1^2+z2^2+z3^2+z4^2+z5^2+z6^2+z7^2+z8^2+z9^2+z10^2
.*this creates a chisquare variable with n=4 degrees of freedom
. histogram v
(bin=40, start=1.0842115, width=.91651339)
70
Fdistribution
(6) If
2 2
1 1 2 2 1 2
~ ( ) and ~ ( ) and and are independent, then X n X n X X χ χ
1
1
1 2
2
2
~ ( , )
X
n
F n n
X
n
This is Fisher’s Famous Ftest. It is the ratio of two squared quantities, so it is positive and therefore
skewed, like the chisquare. Here are graphs of some typical F distributions.
We can create a variable with an F distribution is follows: generate five standard normal variables, then
sum their squares to make a chisquare variable (v) with five degrees of freedom; finally, divide v by z to
make a variable distributed as F with 5 degrees of freedom in the numerator and 10 degrees of freedom in
the denominator.
. gen x1=invnorm(uniform())
0
.
0
2
.
0
4
.
0
6
.
0
8
.
1
D
e
n
s
i
t
y
0 10 20 30 40
v
71
. gen x2=invnorm(uniform())
. gen x3=invnorm(uniform())
. gen x4=invnorm(uniform())
. gen x5=invnorm(uniform())
. gen v1=x1^2+x2^2+x3^3+x4^2+x5^2
. gen f=v1/v
. * f has 5 and 10 degrees of freedom
. summarize v1 v f
Variable  Obs Mean Std. Dev. Min Max
+
v1  10000 3.962596 4.791616 62.87189 66.54086
v  10000 10.0526 4.478675 1.084211 37.74475
f  10000 .4877415 .7094975 9.901778 14.72194
. histogram f, if f<3 & f>=0
tdistribution
(7) If z ~ N(0,1) and
2
~ ( ) X n χ and z and X are independent, then the ratio ~ ( )
z
T t n
X n
= . This is
Student’s t distribution. It can be positive or negative and is symmetric, like the normal distribution. It
looks like the normal distribution but with “fat tails,” that is, more probability in the tails of the distribution,
compared to the standard normal.
0
.
5
1
1
.
5
D
e
n
s
i
t
y
0 1 2 3
f
72
(8) If ~ ( ) T t n then
2
2
~ (1, )
z
T F n
X n
= because the squared tratio is the ratio of two chisquared
variables.
We can create a variable with a t distribution by dividing a standard normal variable by a chisquare
variable.
. set obs 100000
obs was 0, now 100000
. gen z=invnorm(uniform())
. gen y1=invnorm(uniform())
. gen y2=invnorm(uniform())
. gen y3=invnorm(uniform())
. gen y4=invnorm(uniform())
. gen y5=invnorm(uniform())
. gen v=y1^2+y2^2+y3^2+y4^2+y5^2
. gen t=z/sqrt(v)
. histogram t if abs(t)<2, normal
(bin=49, start=1.9989421, width=.08160601)
0
.
2
.
4
.
6
.
8
D
e
n
s
i
t
y
2 1 0 1 2
t
73
Asymptotic properties
(9) As N goes to infinity, (0,1) t N →
(10) As
2 1 1
2 1 1 1
2 2
, ( )
X n
n F X n n
X n
χ → ∞ = →
because as
2 2
, n X →∞ → ∞, so that the ratio goes to one.
(11) As
2 2
, N NR χ → ∞ →
where
2
/
/
RSS RSS N
R
TSS TSS N
= =
and RSS is the regression (model) sum of squares.
2
/
/
RSS N
NR N
TSS N
=
As N →∞ , / 1 TSS N → so that
2 2
/
~
1
RSS N
NR N RSS χ
→ =
(12) As
2
, N NF χ → ∞ . See (10) above.
Testing hypotheses concerning
So, we can summarize the sampling distribution of as
( )
2 2
ˆ
~ ,
i
iid x β β σ ¯ . If we want to actually use
this distribution to make inferences, we need to make an assumption concerning the form of the function.
Let’s make the usual assumption that the errors in the original regression model are distributed normally.
That is, where the N stands for normal. This assumption can also be written as
2
~ (0, )
i
u IN σ which is translated as “u
i
is distributed independent normal with mean zero and variance
2
σ
.” (We don’t need the identical assumption because the normal distribution has a constant variance.) Since
ˆ
β is a linear function of the error terms and the error terms are distributed normally, then
ˆ
β is distributed
normally, that is,
( )
2 2
ˆ
~ ,
i
IN x β β σ ¯ . We are now in position to make inferences concerning the true
value of þ.
To test the null hypothesis that
0 0
where β β β = is some hypothesized value of þ, construct the test statistic,
2
0
ˆ
2
ˆ
ˆ
ˆ
where
i
T S
S x
β
β
β β σ −
= =
¯
is the standard error of
ˆ
β , the square root of its variance.
2
2 1
ˆ
N k
i
i
e
N k
σ
−
=
=
−
¯
where
2
i
e is the squared residual corresponding to the i’th observation, N is the sample size,
and k is the number of parameters in the regression. In this simple regression, k=2 because we have two
parameters, u and þ.
2
ˆ σ is the estimated error variance. It is equal to the error sum of squares divided by
the degrees of freedom (Nk) and is also known as the mean square error.
ˆ
β
( )
2
~ 0,
i
u iidN σ
74
Notice that the test statistic, also known as a tratio, is the ratio of a normally distributed random variable,
ˆ
β , to a chisquare variable (the sum of a bunch of squared, normally distributed, error terms).
Consequently, it is distributed as student’s t with Nk degrees of freedom.
By far the most common test is the significance test, namely that X and Y are unrelated. To do this test set
0
0 β = so that the test statistic reduces to the ratio of the estimated parameter to its standard error.
ˆ
ˆ
T
S
β
β
= . This is the “t” reported in the regress output in Stata.
Degrees of Freedom
Karl Pearson, who discovered the chisquare distribution around 1900, applied it to a variety of problems
but couldn’t get the parameter value quite right. Pearson, and a number of other statisticians, kept requiring
ad hoc adjustments to their chisquare applications. Sir Ronald Fisher solved the problem in 1922 and
called the parameter degrees of freedom. The best explanation of the concept that I know of is due to
Gerrard Dallal (http://www.tufts.edu/~gdallal/dof.htm). Suppose we have a data set with N observations.
We can use the data in two ways, to estimate the parameters or to estimate the variance. If we use data
points to estimate the parameters, we can’t use them to estimate the variance.
Suppose we are trying to estimate the mean of a data set with X=(1,2,3). The sum is 6 and the mean is 2.
Once we know this, we can find any data point knowing only the other two. That is, we used up one degree
of freedom to estimate the mean and we only have two left to estimate the variance. To take another
example, suppose we are estimating a simple regression model with ordinary least squares. We estimate the
slope with the formula
2
ˆ
/ xy x β = ¯ ¯ and the intercept with the formula
ˆ
ˆ Y X α β = − , using up two data
points. This leaves the remaining N2 data points to estimate the variance. We can also look at the
regression problem this way: it takes two points to determine a straight line. If we only have two points, we
can solve for the slope and intercept, but this is a mathematical problem, not a statistical one. If we have
three data points, then there are many lines that could be defined as “fitting” the data. There is one degree
of freedom that could be used to estimate the variance.
Our simple rule is that the number of degrees of freedom is the sample size minus the number of
parameters estimated. In the example of the mean, we have one parameter to estimate, yielding N1 data
points to estimate the variance. Therefore, the formula for the sample variance of X (we have necessarily
already used the data once to compute the sample mean) is
2
2
( )
1
x
X X
S
N
¯ −
=
−
where we divide by the number of degrees of freedom (N=1), not the sample size (N). To prove this, we
have to demonstrate that this is an unbiased estimator for the variance of X (and therefore, dividing by N
yields a biased estimator).
2
2
2 2
2 2
( ) ( ) ( )
( ) 2( )( ) ( )
( ) 2( ) ( ) ( )
X X X X
X X X X
X X X N X
µ µ
µ µ µ µ
µ µ µ µ
¯ − = ¯ − − −
= ¯ − − − − + −
= ¯ − − − ¯ − + −
because and X µ are both constants. Also,
( ) ( ) X X N NX N N X µ µ µ µ ¯ − = ¯ − = − = −
75
which means that the middle term in the quadratic form above is
2
2( ) ( ) 2( ) ( ) 2 ( ) X X X N X N X µ µ µ µ µ − − ¯ − = − − − = − −
and thefore,
2 2 2 2
2 2
( ) ( ) 2 ( ) ( )
( ) ( )
X X X N X N X
X N X
µ µ µ
µ µ
¯ − = ¯ − − − + −
= ¯ − − −
Now we can take the expected value of the proposed formula,
2
2 2
2 2
( ) 1
( ) ( )
1 1 1
1
( ) ( )
1 1
X X N
E E X X
N N N
N
E X E X
N N
µ µ
µ µ
  ¯ −  
= ¯ − − −
 
− − −
\ ¹
\ ¹
= ¯ − − −
− −
Note that the second term is the variance of the mean of X:
2
2
( )
( ) ( )
X
E X Var X
N
µ
µ
¯ −
− = =
But,
1 2 2
2
2
2
1 1
( ) ( ... )
1
N
x
x
Var X Var X Var X X X
N N
N
N N
σ
σ
 
= ¯ = + + +

\ ¹
= =
Therefore,
2
2 2
2
2 2 2
2 2
( ) 1
( ) ( )
1 1 1
1
1 1 1 1
1
1
x
x x x
x x
X X N
E E X X
N N N
N N N
N N N N N
N
N
µ µ
σ
σ σ σ
σ σ
  ¯ −  
= ¯ − − −
 
− − −
\ ¹
\ ¹
= − = −
− − − −
−
= =
−
Note (for later reference),
( ) ( )
2 2 2
( ) ( 1)
x
E x E X X N σ ¯ = ¯ − = − .
Estimating the variance of the error term
We want to prove that the formula
2
2
ˆ
2
e
N
σ
¯
=
−
where e is the residual from a simple regression model, yields an unbiased estimate of the true error
variance.
76
In deviations,
ˆ
e y x β = −
But,
ˆ
y x u β = +
So,
( )
ˆ ˆ
e x u x u x β β β β = + − = − −
Squaring both sides,
( ) ( )
2
2 2 2
ˆ ˆ
2 e u xu x β β β β = − − + −
Summing,
( ) ( )
2
2 2 2
ˆ ˆ
2 e u xu x β β β β ¯ = ¯ − ¯ − + − ¯
Taking expected values,
( ) ( )
2
2 2 2
ˆ ˆ
2 E e E u E xu E x β β β β
¯ = ¯ − − ¯ + − ¯
Note that the variance of
ˆ
β is
( ) ( )
2
2
2
ˆ ˆ
Var E
x
σ
β β β = − =
¯
Which means that the third term of the expectation expression is
( )
2
2
2 2 2
2
ˆ
E x x
x
σ
β β σ
¯ − ¯ = ¯ =
¯
We know from the theory of least squares that
2
ˆ
xu
x
β β
¯
− =
¯
Which implies that
( )
2
ˆ
xu x β β ¯ = − ¯
Therefore the second term in the expectation expression is
77
( )( ) ( )
2
2 2
2
2 2
2
ˆ ˆ ˆ
2 2
2 2
E x E x
x
x
β β β β β β
σ
σ
− − − ¯ = − − ¯
= − ¯ = −
¯
The first term in the expectation expression is
( )
2 2
1 E u N σ ¯ = −
because the unbiased estimator of the variance of x is
2 2
/( 1)
x
S x N = ¯ − so that the expected value of
( )
2 2
( 1)
x
E x N S ¯ = − . In the current application x=e and
2 2
x
S σ = .
Therefore, putting all three terms together, we get
2 2 2 2 2 2 2 2 2
( 1) 2 2 ( 2) E e N N N σ σ σ σ σ σ σ σ ¯ = − + − = − + − = −
Which implies that the unbiased estimate of the error variance is
2
2
ˆ
2
e
N
σ
¯
=
−
2
2 2
2 2
( 2)
ˆ
2 2 2
E e
e N
E E
N N N
σ
σ σ
¯
¯ −
= = = =
− − −
If we think of each of the squared residuals as an estimate of the variance of the error term, then the
formula tells us to take the average. However, we only have N2 independent observations on which to
base our estimate of the variance, so we divide by N2 when we take the average.
78
Chebyshev’s Inequality
Arguably the most remarkable theorem in statistics, Chebyshev’s inequality states that, for any variable x,
from any arbitrary distribution, the probability that an observation lies more than some distance, c, from
the mean must be less than or equal to
2 2
/
x
σ ε . That is,
2
2
(  )
x
P X
σ
µ ε
ε
− ≥ ≤
The following diagram might be helpful.
Proof
The variance of x can be written as
2
2 2
( ) 1
( )
x
X
X
N N
µ
σ µ
¯ −
= = ¯ −
if all observations have the same probability (1/N) of being observed.
More generally,
2 2
( ) ( )
x
X P X σ µ = ¯ −
Suppose we limit the summation on the right hand side to those values of X that lie outside the ε ± box.
Then the new summation could not be larger than the original summation, which is the variance of x,
because we are omitting some (positive, squared) terms.
2 2
2
 
( ) ( )
( ) ( )
x
X
X P X
X P X
µ ε
σ µ
µ
− ≥
= ¯ −
≥ −
¯
We also know that all of the X’s in the limited summation are at least c away from the mean. (Some of
them are much further than c away.) Therefore, replacing ( ) X µ − with c,
2 2
 
2 2 2
   
( ) ( )
( ) ( ) (  )
x
X
X X
X P X
P X P X P X
µ ε
µ ε µ ε
σ µ
ε ε ε µ ε
− ≥
− ≥ − ≥
≥ −
≥ = = − ≥
¯
¯ ¯
Xi µ
X
÷ ÷
79
So,
2 2
(  )
x
P X σ ε µ ε ≥ − ≥
Dividing both sides by
2
ε
2
2
(  )
x
P X
σ
µ ε
ε
≥ − ≥
Turning it around we get,
2
2
(  )
x
P X
σ
µ ε
ε
− ≥ ≤
QED
The theorem is often stated as follows.
For any set of data and any constant k>1, at least
2
1
1
k
− of the observations must lie within k standard
deviations of the mean.
To prove this version, let
x
k ε σ =
2
2 2 2
1
(  )
x
x
x
P X k
k k
σ
µ σ
σ
− ≥ ≤ =
Since this is the probability that the observations lie more than k standard deviations from the mean, the
probability that the observations lie less than k sd from the mean is one minus this probability. That is,
2
1
(  ) 1
x
P X k
k
µ σ − ≤ ≤ −
So, if we have any data whatever, and we set k=2, we know that
1 3
1 75%
4 4
− = = of the data lie within two standard deviations of the mean.
(We can be more precise if we know the distribution. For example, 95% of normally distributed data lie
within two standard deviations of the mean.)
Law of Large Numbers
The law of large numbers states that if a situation is repeated again and again, the proportion of successful
outcomes will tend to approach the constant probability that any one of the outcomes will be a success.
For example, suppose we have been challenged to guess how many thumbtacks will end up face down if
100 are dropped off a table. (Answer: 50.)
How would we prepare for such a challenge? Toss a thumbtack into the air and record how many times it
ends up face down. Divide that into the number of trials and you have the proportion. Multiply by 100 and
that is your guess.
Suppose we are presented with a large box full of black and white balls. We want to know the proportion of
black balls, but we are not allowed to look inside the box. What do we do? Answer: sample with
replacement. The number of black balls that we draw divided by the total number of balls drawn will give
us an estimate. The more balls we draw, the better the estimate.
Mathematically, the theorem is,
Let
1 2
, ,...,
N
X X X be an independent trials process where ( )
i
E X µ = and
2
( )
x i
Var X σ = .
80
Let
1 2
...
N
S X X X = + + + . Then, for any c > 0
0
N
S
P
N
µ ε
→∞
 
− ≥ →

\ ¹
Equivalently,
1
N
S
P
N
µ ε
→∞
 
− < →

\ ¹
Proof,
We know that
2
1 2
( ) ( ... )
N x
Var S Var X X X Nσ = + + + =
Because there is no covariance between the X’s. Therefore,
2 2
1 2 2
(( ... ) / )
x x
N
N S
Var Var X X X N
N N N
σ σ
 
= + + + = =

\ ¹
Also,
1 2
...
N
X X X S N
E E
N N N
µ
µ
+ + +    
= = =
 
\ ¹ \ ¹
By Chebyshev’s Inequality we know that, for any c > 0,
( )
2
2 2
0
x
N
S
Var
S
N
P
N N
σ
µ ε
ε ε
→∞
 
− ≥ ≤ = →

\ ¹
Equivalently, the probability that the difference between S/N and the true mean is less thanε is one
minus this. That is,
1
N
S
P
N
µ ε
→∞
 
− < →

\ ¹
Since S/N is an average, the law of large numbers is often called the law of averages.
Central Limit Theorem
This is arguably the most amazing theorem in mathematics.
Statement:
Let x be a random variable distributed according to f(x) with mean µ and variance
2
X
σ . Let X be the mean
of a random sample of size N from f(x).
Let
2
X
X
z
N
µ
σ
−
= .
The density of z will approach the standard normal (mean zero and variance one) as N goes to infinity.
The incredible part of this theorem is that no restriction is placed on the distribution of x. No matter how x
is distributed (as long as it has a finite variance), the sample mean of a large sample will be distributed
normally.
81
Not only that, but the sample doesn’t really have to be that large. It used to be that a sample of 30 was
thought to be enough. Nowadays we usually require 100 or more. Infinity comes quickly for the normal
distribution.
The CLT is the reason that the normal distribution is so important.
The theorem was first proved by DeMoivre in 1733 but was promptly forgotten. It was resurrected in 1812
by LaPlace, who derived the normal approximation to the binomial distribution, but was still mostly
ignored until Lyapunov in 1901 generalized it and showed how it worked mathematically. Nowadays, the
central limit theorem is considered to be the unofficial sovereign of probability theory.
One of the few people who didn’t ignore it in the 19’th Century was Sir Francis Galton, who was a big fan.
He wrote in Natural Inheritance, 1889:
I know of scarcely anything so apt to impress the imagination as the wonderful form of
cosmic order expressed by the "Law of Frequency of Error". The law would have been
personified by the Greeks and deified, if they had known of it. It reigns with serenity and
in complete selfeffacement, amidst the wildest confusion. The huger the mob, and the
greater the apparent anarchy, the more perfect is its sway. It is the supreme law of
Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled
in the order of their magnitude, an unsuspected and most beautiful form of regularity
proves to have been latent all along.
We can’t prove this theorem here. However, we can prove it for certain cases using Monte Carlo methods.
For example, the binomial distribution is decidedly nonnormal since values must be either zero or one.
However, according to the theorem, the mean of a large number of sample means will be distributed
normally.
Here is the histogram of the binomial distribution with an equal probability of generating a one or a zero..
82
Not normal.
I asked Stata to generate 10,000 samples of size N=100 from this distribution. I then computed the mean
and variance of the resulting 10,000 observations.
Variable  Obs Mean Std. Dev. Min Max
+
sum  10000 50.0912 4.981402 31 72
I then computed the zscores by taking each observation, subtracting the mean (50.0912) and dividing by
the standard deviation. The histogram of the resulting data is:
This is the standard normal distribution with mean zero and standard deviation one.
. summarize z
0
1
2
3
4
5
D
e
n
s
i
t
y
0 .2 .4 .6 .8 1
x
0
.
1
.
2
.
3
.
4
D
e
n
s
i
t
y
4 2 0 2 4
z
83
Variable  Obs Mean Std. Dev. Min Max
+
z  10000 3.57e07 1 3.832496 4.398119
Here is another example. This is the distribution of the population of cities in the United States with
popularion of 100,000 or more.
It is obviously not normal being highly skewed. There are very few very large cities and a great number of
relatively small cities. Suppose we treat this distribution as the population. I told Stata to take 10,000
samples of size N=200 from this data set. I then computed the mean and standard deviation of the resulting
data.
summarize mx
Variable  Obs Mean Std. Dev. Min Max
+
mx  10000 287657 41619.64 193870.2 497314.6
I then computed the corresponding zscores. According to the Central Limit Theorem, the zscores should
have a zero mean and unit variance.
. summarize z
Variable  Obs Mean Std. Dev. Min Max
+
z  10000 1.37e07 1 2.253426 5.03747
As expected, the mean is zero and the standard deviation is one. Here is the histogram.
0
5
.
0
e

0
7
1
.
0
e

0
6
1
.
5
e

0
6
2
.
0
e

0
6
D
e
n
s
i
t
y
0 2000000 4000000 6000000 8000000
population
84
It still looks a little skewed, but it is very close to the standard normal.
Method of maximum likelihood
Suppose we have the following simple regression model
2
~ (0, )
i i i
i
Y X U
U IN
α β
σ
= + +
So, the Y are distributed normally. The probability of observing the first Y is
{ } { }
2 2
1 1 1
2 2
1 1
( )
2 2
1
2 2
1 1
Pr( )
2 2
U Y X
Y e e
α β
σ σ
πσ πσ
− − − −
= =
The probability of observing the second Y is exactly the same, except for the subscript. The sample consists
of N observations. The probability of observing the sample is the probability of observing Y
1
and Y
2
, etc.
1 2 1 2
Pr( , ,..., ) Pr( ) Pr( ) Pr( )
N N
Y Y Y Y Y Y =
because of the independence assumption. Therefore,
( )
{ }
2
2
1
2
1
( )
2
1 2 1
2
1
Pr( , , ..., ) e L( , , )
2
i i
Y X
N
N
Y Y Y
α β
σ
α β σ
πσ
− − −
= =
∏
This is the likelihood function, which gives the probability of observing the sample that we actually
observed. It is a function of the unknown parameters, , , and α β σ . The symbol
1
N
∏
means multiply the
items indexed from one to N.
0
.
1
.
2
.
3
.
4
D
e
n
s
i
t
y
2 0 2 4 6
z
85
The basic idea behind maximum likelihood is that we probably observed the most likely sample, not a rare
or unusual sample, therefore, the best guess of the values of the parameters are those that generate the most
likely sample. Mathematically, we maximize the likelihood of the sample by taking derivatives with respect
to , , and α β σ , setting the derivatives equal to zero, and solving.
Let’s take logs first to make the algebra easier.
2 2 1
2 2
1
1
Log L log(2 ) ( )
2
N
i i
i
Y X πσ α β
σ
=
= − − − −
¯
This function has a lot of stuff that is not a function of the parameters.
2
2
log L log
2 2
N Q
C σ
σ
= − −
where
log(2 )
2
N
C π
 
= −

\ ¹
which is constant with respect to the unknown parameters, and
2
1
( )
N
i i
i
Q Y X α β
=
= − −
¯
where Q is the error sum of squares. Since Q is multiplied by a negative, to maximize the likelihood, we
have to minimize the error sum of squares with respect to u and þ. This is nothing more than ordinary least
squares. Therefore, OLS is also a maximum likelihood estimator. We already know the OLS formulas, so
the only thing we need now is the maximum likelihood estimator for the variance,
2
σ ,
Substitute the OLS values for u and þ:
ˆ
ˆ and α β :
2
2
ˆ
logL( ) log( )
2 2
N Q
C σ σ
σ
= − −
where
2
1
ˆ ˆ
ˆ ( )
N
i i
i
Q Y X α β
=
= − −
¯
is the residual sum of squares, ESS.
Now we have to differentiate with respect to o and set the derivative equal to zero
4
ˆ
log L( ) 4
0
4
d N Q
d
σ σ
σ σ σ
= − + =
3
ˆ
N Q
σ σ
=
2
ˆ
Q
N
σ
=
2
ˆ
ˆ
Q ESS
N N
σ = =
86
Now that we have estimates for , , and α β σ , we can substitute
ˆ
ˆ ˆ , , and α β σ into the log likelihood function
to get its maximum value:
2
2
ˆ
ˆ ˆ , ,
ˆ ˆ
Max logL log( ) log
2 2 2 2
N Q N Q N
C C
N
α β σ
σ
σ
 
= − − = − −


\ ¹
Note that
2
2 2
ˆ
ˆ
ˆ ˆ 2 2
Q Nσ
σ σ
= .
So,
( )
( )
ˆ
ˆ ˆ , ,
ˆ
Max logL log log
2 2 2
N N N
C Q N
α β σ
= − + −
Since N is constant,
( )
ˆ
ˆ ˆ , ,
ˆ
Max logL log
2
N
const Q
α β σ
= −
where
( ) log
2 2
N N
const C N = + −
Therefore,
2 2 ˆ
* *( )
N N
MaxL const Q const ESS
− −
= =
Likelihood ratio test
Let L be the likelihood function. We can use this function to test hypotheses such as 0 β = or 10 α β + = as
restrictions on the model. As in the Ftest, if the restriction is false, we expect that the residual sum of
squares (ESS) for the restricted model will be higher than the ESS for the unrestricted model. This means
that the likelihood function will be smaller (not maximized) if the restriction is false. If the hypothesis is
true, then the two values of ESS will be approximately equal and the two values of the likelihood functions
will be approximately equal.
Consider the likelihood ratio,
under the restriction
1
without the restriction
MaxL
MaxL
λ = ≤
The test statistic is
2
2log( ) ( ) p λ χ −
where p is the number of restrictions.
To do this test, do two regressions, one with the restriction, saving the residual sum of squares, ESS(R), and
one without the restriction, saving ESS(U). So,
87
2
2
2
*( ( )) ( )
( )
*( ( ))
N
N
N
const ESS R ESS R
ESS U
const ESS U
λ
−
−
−
 
= =

\ ¹
( ) log( ) log( ( ) log( ( )
2
N
ESS R ESS U λ = − −
The test statistic is
( )
2
2log( ) log( ( ) log( ( ) ( ) N ESS R ESS U p λ χ − = −
This is a large sample test. Like all large sample tests, its significance is not well known in small samples.
Usually we just assume that the small sample significance is about the same as it would be in a large
sample.
Multiple regression and instrumental variables
Suppose we have a model with two explanatory variables. For example,
2
, ~ (0, )
i i i i i
Y X Z u u iid α β γ σ = + + +
We want to estimate the parameters u, þ, and v. We define the error as
i i i i
e Y X Z α β γ = − − −
At this point we could square the errors, sum them, take derivatives with respect to the parameters, set the
derivatives equal to zero and solve. This would generate the least squares estimators for this multiple
regression. However, there is another approach, known as instrumental variables, that is both illustrative
and easier. To see how instrumental variables works, let’s first derive the OLS estimators for the simple
regression model,
i i i
Y X u α β = + +
The crucial GaussMarkov assumptions can be written as
(A1) ( ) 0
(A2) ( ) 0
i
i i
E u
E x u
=
=
A1 states that the model must be correct on average, so the (true) mean of the errors is zero. A2 is the
alternative to the assumption that the x’s are a set of fixed numbers. It says that the x’s cannot be correlated
with the error term, that is, the covariance between the explanatory variable and the error term must be
zero.
Since we do not observe u
i
, we have to make the assumptions operational. We therefore define the
following empirical analogs of A1 and A2.
i
i
(a1) 0 0
x
(a2) 0 x 0
i
i
i
i
e
e e
N
e
e
N
= = ¬ =
= ¬ =
¯
¯
¯
¯
88
(a1) says that the mean of the residuals is zero and (a2) says that the covariance between x and the residuals
is zero.
Define the residuals,
i i i
e Y X α β = − −
Sum the residuals,
( )
1 1 1 1
0 by (a1).
N N N N
i i i i i
i i i i
e Y X Y N X α β α β
= = = =
= − − = − − =
¯ ¯ ¯ ¯
Divide both sides by N.
1 1
0
N N
i i
i i
Y X
N N
α β
= =
− − =
¯ ¯
Or,
0
ˆ
Y X
Y X
α β
α β
− − =
= −
which is the OLS estimator for u.
We can use (a2) to derive the OLS estimator for β .
( )
i i i i i i i
i i i
y Y Y X u X Y X X u
y x u
α β α β β
β
= − = + + − − = − − +
= +
The predicted value of y is
ˆ
ˆ
ˆ
i i
i i i
y x
y y e
β =
= +
The observed value of y is the sum of the predicted value from the regression and the residual, the
difference between the observed value and the predicted value.
ˆ
i i i
y x e β = +
Multiply both sides by x
i
,
2
ˆ
i i i i
x y x x e β = +
Sum,
2
ˆ
i i i i
x y x x e β ¯ = ¯ + ¯
But
0
i i
x e ¯ =
by (a2). So,
2
ˆ
i i
x y x β ¯ = ¯
Solve for
ˆ
β ,
2
ˆ
i i
x y
x
β
¯
=
¯
which is the OLS estimator.
Now that we know the we can derive the OLS estimators using instrumental variables, let’s do it for the
multiple regression above.
i i i i
Y X Z u α β γ = + + +
89
We have to make three assumptions (with their empirical analogs), since we have three parameters.
(A1) ( ) 0 0
i
e
E u
N
¯
= ¬ =
(A2) ( ) 0 0 0
i i
i i i i
x e
E x u x e
N
¯
= ¬ = ¬ ¯ =
(A3) ( ) 0 0 0
i i
i i i i
z e
E z u z e
N
¯
= ¬ = ¬ ¯ =
From (A1) we get the estimate of the intercept.
0
0
i i i i
i i i i
e Y N X Z
e Y X Z
N N N N
α β γ
α β γ
¯ = ¯ − − ¯ − ¯ =
¯ ¯ ¯ ¯
= − − − =
Or,
0
ˆ
Y X Z
Y X Z
α β γ
α β γ
− − − =
= − −
which is a pretty straightforward generalization of the simple regression estimator for u.
Now we can work in deviations.
i i i i i
i i i i
y Y Y X Z u X Z
y x z u
α β γ α β γ
β γ
= − = + + + − − +
= + +
The predicted values of y are
ˆ
ˆ ˆ
i i i
y x z β γ = +
So that y is equal to the predicted value plus the residual.
ˆ
ˆ
i i i i
y x z e β γ = + +
Multiply by x and sum to get,
2
,
2
,
ˆ
ˆ
ˆ
ˆ
i i i i i i
i i i i i i
x y x x z x e
x y x x z x e
β γ
β γ
= + +
¯ = ¯ + ¯ + ¯
But, 0
i i
x e ¯ = by (A2), so
(N1)
2
,
ˆ
ˆ
i i i i
x y x x z β γ ¯ = ¯ + ¯
Now repeat the operation with z:
2
2
ˆ
ˆ
ˆ
ˆ
i i i i i i i
i i i i i i
z y z x z z e
z y zx z z e
β γ
β γ
= + +
¯ = ¯ + ¯ + ¯
But, 0
i i
z e ¯ = by (A3), so
(N2)
2
ˆ
ˆ
i i i i
z y zx z β γ ¯ = ¯ + ¯
We have two linear equations (N1 and N2) in two unknowns,
ˆ
β and ˆ γ . Even though these equations look
complicated with all the summations, these summations are just numbers derived from the sample
observations on x, y, and z. There are a variety of ways to solve two equations in two unknowns. Back
substitution from high school algebra will work just fine. Solve for ˆ γ from N2.
2
2 2
ˆ
ˆ
ˆ
ˆ
i i i i
i i i
i i
z z y zx
z y zx
z z
γ β
γ β
¯ = ¯ − ¯
¯ ¯
= −
¯ ¯
Substitute into N1
90
2
, 2 2
ˆ ˆ
i i i
i i i i
i i
z y zx
x y x x z
z z
β β
  ¯ ¯
¯ = ¯ + − ¯

¯ ¯
\ ¹
Multiply through by
2
i
z ¯
( )
2 2 2 2
, 2 2
2 2 2
,
2 2 2
,
ˆ ˆ
ˆ ˆ
ˆ ˆ
i i i
i i i i i i i
i i
i i i i i i i i i
i i i i i i i i i i i
z y zx
x y z x z z x z
z z
x y z x z z y zx x z
x y z x z z y x z zx x z
β β
β β
β β
  ¯ ¯
¯ ¯ = ¯ ¯ + ¯ − ¯

¯ ¯
\ ¹
¯ ¯ = ¯ ¯ + ¯ − ¯ ¯
¯ ¯ = ¯ ¯ + ¯ ¯ − ¯ ¯
Collect terms in
ˆ
β and move them to the left hand side.
( )
( )
2 2 2
,
2 2 2
,
2
2 2
,
2
2
2 2
,
ˆ ˆ
ˆ
ˆ
ˆ
i i i i i i i i i i i
i i i i i i i i i i i
i i i i i i i
i i i i
i i i i i i i
i i i
x z zx x z x y z z y x z
x z zx x z x y z z y x z
x y z z y x z
x z zx x z
x y z z y x z
x z x z
β β
β
β
β
¯ ¯ − ¯ ¯ = ¯ ¯ − ¯ ¯
¯ ¯ − ¯ ¯ = ¯ ¯ − ¯ ¯
¯ ¯ − ¯ ¯
=
¯ ¯ − ¯ ¯
¯ ¯ − ¯ ¯
=
¯ ¯ − ¯
because
i i i i
x z z x ¯ = ¯ .
This is the OLS estimator for
ˆ
β .
The OLS estimator for ˆ γ is derived similarly.
( )
2
2
2 2
,
ˆ
i i i i i i i
i i i
z y x x y x z
x z x z
γ
¯ ¯ − ¯ ¯
=
¯ ¯ − ¯
Interpreting the multiple regression coefficient.
We might be able to make some sense out of these formulas if we translate them into simple regression
coefficients. The formula for a simple correlation coefficient between x and y is
2 2 2 2
i i i i i i
xy
x y
i i i i
x y x y x y
r
NS S
x y x y
N
N N
¯ ¯ ¯
= = =
¯ ¯ ¯ ¯
Or,
2 2
i i xy i i
x y r x y ¯ = ¯ ¯
And,
2 2
i i xz i i
x z r x z ¯ = ¯ ¯
2 2
i i zy i i
z y r z y ¯ = ¯ ¯
Substituting into the formula for
ˆ
β
( )
2 2 2 2 2 2 2
2
2 2 2 2
,
ˆ
xy i i i zy i i xz i i
i xz i i
r x y z r z y r x z
x z r x z
β
¯ ¯ ¯ − ¯ ¯ ¯ ¯
=
¯ ¯ − ¯ ¯
Cancel out
2
i
z ¯ which appears in every term.
91
( )
2 2 2 2 2 2 2 2
2 2 2 2
2 2
,
,
ˆ
xy i i zy i xz i xy i i zy i xz i
xz i
xz i
r x y r y r x r x y r y r x
x r x
x r x
β
¯ ¯ − ¯ ¯ ¯ ¯ − ¯ ¯
= =
¯ − ¯
¯ − ¯
Factor out the common terms on the top and bottom.
2 2 2
2 2 2
2
,
2
2 2
2
ˆ
1 1
1 1
xy zy xz xy zy xz i i i
xz xz
i
xy zy xz xy zy xz y i
x xz xz
i
r r r r r r x y y
r x r
x
r r r r r r S y N
S r r
x N
β
   
− − ¯ ¯ ¯
  == =
  − ¯ −
¯
\ ¹ \ ¹
 
− − ¯  

= =

 − −
¯ \ ¹
\ ¹
The corresponding expression for ˆ γ is
2
ˆ
1
zy xy xz y
z xz
r r r S
S r
γ
−  
=

−
\ ¹
The term in parentheses is simply a positive scaling factor. Let’s concentrate on the ratio. The denominator
is
2
1
xz
r − . When x and z are perfectly correlated, r
xz
= 1 and the formula does not compute. This is the case
of perfect collinearity. X and Y are said to be collinear because one is an exact function of the other, with
no error term. This could happen if X is total wages and Y is total income minus nonwage payments.
The multiple regression coefficient þ is the effect on Y of a change in X, holding Z constant. If X and Z are
perfectly collinear, then whenever X changes, Z has to change. Therefore it is impossible to hold Z constant
to find the separate effect of X. Stata will simply drop Z from the regression and estimate a simple
regression of Y on X.
The simple correlation coefficient r
xy
in the numerator of the formula for
ˆ
β is the total effect of X on Y,
holding nothing constant. The second term, r
zy
r
xz
is the effect of X on Y through Z. In other words, X is
correlated with Z, so when X varies, Z varies. This causes Y to change because Y is correlated with Z.
Therefore, the numerator of the formula for
ˆ
β starts with the total effect on X on Y, but then subtracts off
the indirect effect of X on Y through Z. It therefore measures the effect of X on Y, while statistically
holding Z constant, the socalled partial effect of X on Y.
The formula for ˆ γ is analogous. It is the total effect of Z on Y minus the indirect effect of Z on Y through
X. It partials out the effect of X on Y.
Note that, if there is no correlation between X and Z, then X and Z are said to be orthogonal and r
xz
= 0. In
this case, the multiple regression estimators collapse to simple regression estimators.
2
ˆ
1
xy zy xz y y
xy
x x xz
r r r S S
r
S S r
β
−    
= =
 
−
\ ¹ \ ¹
2
ˆ
1
zy xy xz y y
zy
z z xz
r r r S S
r
S S r
γ
−    
= =
 
−
\ ¹ \ ¹
These are the results we would get if we did two separate simple regressions, Y on X, and Y on Z.
92
Multiple regression and omitted variable bias
The reason we use multiple regression is that most real world relationships have more than one explanatory
variable. Suppose we type the following data into the data editor in Stata.
Let’s do a simple regression of crop yield on rainfall.
regress yield rain
Source  SS df MS Number of obs = 8
+ F( 1, 6) = 0.17
Model  33.3333333 1 33.3333333 Prob > F = 0.6932
Residual  1166.66667 6 194.444444 Rsquared = 0.0278
+ Adj Rsquared = 0.1343
Total  1200 7 171.428571 Root MSE = 13.944

yield  Coef. Std. Err. t P>t [95% Conf. Interval]
+
rain  1.666667 4.025382 0.41 0.693 11.51642 8.183089
_cons  76.66667 40.5546 1.89 0.108 22.56687 175.9002

According to our analysis, the coefficient on rain is 1.67 implying that rain causes crop yield to decline (!),
although not significantly. The problem is that we have an omitted variable, temperature. Let’s do a
multiple regression with both rain and temp.
regress yield rain temp
Source  SS df MS Number of obs = 8
93
+ F( 2, 5) = 0.60
Model  231.448763 2 115.724382 Prob > F = 0.5853
Residual  968.551237 5 193.710247 Rsquared = 0.1929
+ Adj Rsquared = 0.1300
Total  1200 7 171.428571 Root MSE = 13.918

yield  Coef. Std. Err. t P>t [95% Conf. Interval]
+
rain  .7243816 4.661814 0.16 0.883 11.25919 12.70796
temp  1.366313 1.351038 1.01 0.358 2.106639 4.839266
_cons  14.02238 98.38747 0.14 0.892 266.9354 238.8907

Now both rain and warmth increase crop yields. The coefficients are not significant, probably because we
only have eight observations. Nevertheless, remember that omitting an important explanatory variable can
bias your estimates on the included variables.
The omitted variable theorem
The reason we do multiple regressions is that most things in economics are functions of more than one
variable. If we make a mistake and leave one of the important variables out, we cause the remaining
coefficient to be biased (and inconsistent).
Suppose the true model is the multiple regression (in deviations),
i i i i
y x z u β γ = + +
However, we omit z and run a simple regression of y on x. The coefficient on x will be
( )
2 2
ˆ
i i i i i i
i i
x x z u x y
x x
β γ
β
¯ + + ¯
= =
¯ ¯
The expected value of
ˆ
β is,
( )
2
2 2 2
2
ˆ
i i i i i
i i i
i i
i
x x z x u
E
x x x
x z
x
β γ
β
β γ
¯ ¯ ¯
= + +
¯ ¯ ¯
¯
= +
¯
Because 0
i i
x u ¯ = by A2. (The explanatory variables are uncorrelated with the error term.) We can write
the expression for the expected value of
ˆ
β in a number of ways.
( )
2 2 2
2 2
2
ˆ
xz i i i i i
xz
i i
i
z
xz xz
x
r x z z N x z
E r
x x
x N
S
r b
S
β β γ β γ β γ
β γ β γ
¯ ¯ ¯ ¯
= + = + = +
¯ ¯
¯
 
= + = +

\ ¹
where
xz
b is the coefficient from the simple regression of Z on X.
Assume that Z is relevant in the sense that v is not zero. If
xz
0 then 0 and b 0
i i xz
x z r ¯ ≠ ≠ ≠ in which case
ˆ
β is biased (and inconsistent since the bias does not go away as N goes to infinity). The direction of bias
depends on the signs of and .
xz
r γ If and .
xz
r γ have the same sign, then
ˆ
β is biased upwards. If they have
94
opposite signs, then
ˆ
β is biased downward. This was the case with rainfall and temperature in our crop
yield example in Chapter 7. Finally, note that if X and Z are orthogonal, then 0
xz
r = and
ˆ
β is unbiased.
The expected value of ˆ γ if we omit X is analogous.
( )
x
zx zx
z
S
E r b
S
γ γ β γ β
 
= + = +

\ ¹
where
zx
b is the coefficient from the simple regression of X on Z.
Now let’s consider a more complicated model. Suppose the true model is
0 1 1 2 2 3 3 4 4 i i i i i i
Y X X X X u β β β β β = + + + + +
But we estimate
0 1 1 2 2 i i i i
Y X X u β β β = + + +
We know we have omitted variable bias, but how much bias is there and in what direction?
Consider a set of auxiliary (and imaginary if we don’t have data on X
3
and X
4
) regressions of each of the
omitted variables on all of the included variables.
3 0 1 1 2 2
4 0 1 1 2 2
i i i
i i i
X a a X a X
X b b X b X
= + +
= + +
Note if there are more included variables, simply add them to the right hand side of each auxiliary
regression. If there are more omitted variables, add more equations.
It can be shown that, as a simple extension of the omitted variable theorem above,
( )
( )
( )
0 0 0 3 0 4
1 1 1 3 1 4
2 2 2 3 2 4
ˆ
ˆ
ˆ
E a b
E a b
E a b
β β β β
β β β β
β β β β
= + +
= + +
= + +
Here is a nice example in Stata. In the Data Editor, create the following variable.
95
Create some additional variables.
. gen x2=x^2
. gen x3=x^3
. gen x4=x^4
Now create y.
. gen y=1+x+x2+x3+x4
Regress y on all the x’s (true model)
. regress y x x2 x3 x4
Source  SS df MS Number of obs = 7
+ F( 4, 2) = .
Model  11848 4 2962 Prob > F = .
Residual  0 2 0 Rsquared = 1.0000
+ Adj Rsquared = 1.0000
Total  11848 6 1974.66667 Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  1 . . . . .
x2  1 . . . . .
x3  1 . . . . .
x4  1 . . . . .
_cons  1 . . . . .

Now omit X3 and X4.
. regress y x x2
Source  SS df MS Number of obs = 7
+ F( 2, 4) = 33.44
Model  11179.4286 2 5589.71429 Prob > F = 0.0032
Residual  668.571429 4 167.142857 Rsquared = 0.9436
+ Adj Rsquared = 0.9154
Total  11848 6 1974.66667 Root MSE = 12.928

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  8 2.443233 3.27 0.031 1.216498 14.7835
x2  10.57143 1.410601 7.49 0.002 6.654972 14.48789
_cons  9.285714 7.4642 1.24 0.281 30.00966 11.43823

Do the auxiliary regressions.
. regress x3 x x2
Source  SS df MS Number of obs = 7
+ F( 2, 4) = 12.70
Model  1372 2 686 Prob > F = 0.0185
Residual  216 4 54 Rsquared = 0.8640
+ Adj Rsquared = 0.7960
Total  1588 6 264.666667 Root MSE = 7.3485

x3  Coef. Std. Err. t P>t [95% Conf. Interval]
96
+
x  7 1.38873 5.04 0.007 3.144267 10.85573
x2  0 .8017837 0.00 1.000 2.226109 2.226109
_cons  0 4.242641 0.00 1.000 11.77946 11.77946

. regress x4 x x2
Source  SS df MS Number of obs = 7
+ F( 2, 4) = 34.01
Model  7695.42857 2 3847.71429 Prob > F = 0.0031
Residual  452.571429 4 113.142857 Rsquared = 0.9445
+ Adj Rsquared = 0.9167
Total  8148 6 1358 Root MSE = 10.637

x4  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  0 2.010178 0.00 1.000 5.581149 5.581149
x2  9.571429 1.160577 8.25 0.001 6.34915 12.79371
_cons  10.28571 6.141196 1.67 0.169 27.33641 6.764979

Compute the expected values from the formulas above.
( )
( )
( )
0 0 0 3 0 4
1 1 1 3 1 4
2 2 2 3 2 4
ˆ
1 0(1) 10.29(1) 9.29
ˆ
1 7(1) 0(1) 8
ˆ
1 0(1) 9.57(1) 10.57
E a b
E a b
E a b
β β β β
β β β β
β β β β
= + + = + − = −
= + + = + + =
= + + = + + =
These are the values of the parameters in the regression with the omitted variables. These numbers are
exact because there is no random error in the model. You can see that the theorem works and both
estimates are biased upward by the omitted variables.
Target and control variables: how many
regressors?
When we estimate a regression model, we usually have one or two parameters that we are primarily
interested in. These variables associated with those parameters are called target variables. In a demand
curve we are typically concerned with the coefficients on price and income. These coefficients tell us if the
demand curve downward sloping and whether the good is normal or inferior. We are primarily interested in
the coefficient on the interest rate in a demand for money equation. In a policy study the target variable is
frequently a dummy variable (see below) that is equal to one if the policy is in effect and zero otherwise.
To get a good estimate of the coefficient on the target variable or variables, we want to avoid omitted
variable bias. To that end we include a list of control variables. The only purpose of the control variables
is to try to guarantee that we don’t bias the coefficient on the target variables. For example, if we are
studying the effect of threestrikes law on crime, our target variable is the threestrikes dummy. We include
in the crime equation all those variables which might cause crime aside from the threestrikes law (e.g.,
percent urban, unemployment, percent in certain age groups, etc.)
What happens if, in our attempt to avoid omitted variable bias, we include too many control variables?
Instead of omitting relevant variables, suppose we include irrelevant variables. It turns out that including
97
irrelevant variables will not bias the coefficients on the target variables. However, including irrelevant
variables will make the estimates inefficient relative to estimates including only relevant variables. It will
also increase the standard errors and underestimate the tratios on all the coefficients in the model,
including the coefficients on the target variables. Thus, including too many control variables will tend to
make the target variables appear insignificant, even when they are truly significant.
We can summarize the effect of too many or too few variables as follows.
Omitting a relevant variable biases the coefficients on all the remaining variables, but decreases the
variance (increases the efficiency) of all the remaining coefficients.
Discarding a variable whose true coefficient is less than its true (theoretical) standard error decreases the
mean square error (the sum of variance plus bias squared) of all the remaining coefficients. So if
1
S
β
β
τ = <
then we are better off, in terms of mean square error, by omitting the variable, even though its parameter is
not zero. Unfortunately, we don’t know any of these true values, so this is not very helpful, but it does
demonstrate that we shouldn’t keep insignificant control variables.
What is the best practice? I recommend the general to specific modeling strategy. After doing your library
research and reviewing all previous studies on the issue, you will have comprised a list of all the control
variables that previous researchers have used. You may also come up with some new control variables.
Start with a general model, including all the potentially relevant controls, and remove the insignificant
ones. Use ttests and Ftests to justify your actions. You can proceed sequentially, dropping one or two
variables at a time, if that is convenient. After you get down to a parsimonious model including only
significant control variables, do one more Ftest to make sure that you can go from the general model to the
final model in one step. Sets of dummy variables should be treated as groups, including all or none for each
group, so that you might have some insignificant controls in the final model, but not a lot. At this point you
should be able to do valid hypothesis tests concerning the coefficients on the target variables.
Proxy variables
It frequently happens that researchers face a dilemma. Data on a potentially important control variable is
not available. However, we may be able to obtain data on a variable that is known or suspected to be highly
correlated with the unavailable variable. Such variables are known as proxy variables, or proxies. The
dilemma is this: if we omit the proxy we get omitted variable bias, if we include the proxy we get
measurement error. As we see in a later chapter, measurement error causes biased and inconsistent
estimates, but so does omitting a relevant variable. What is a researcher to do?
Monte Carlo studies have shown that the bias tends to be smaller if we include a proxy than if we omit a
variable entirely. However, the bias that results from including a proxy is directly related to how highly
correlated the proxy is with the unavailable variable. It is better to omit a poor proxy. Unfortunately, it is
impossible to know how highly correlated they are because we can’t observe the unavailable variable. It
might be possible, in some cases, to see how the two variables are related in other contexts, in other studies
for example, but generally we just have to hope.
The bottom line is that we should include proxies for control variables, but drop them if they are not
significant.
An interesting problem arises if the proxy variable is the target variable. In that case, we are stuck with
measurement error. We are also faced with the fact that the coefficient estimate is a compound of the true
98
parameter linking the dependent variable and the unavailable variable and the parameter relating the proxy
to the unavailable variable.
Suppose we know that
Y X α β = +
But we can’t get data on X. So we use Z, which is available and is related to X.
X a bZ = +
Substituting for X in the original model yields,
( ) ( ) Y X a bZ a bZ α β α β α β β = + = + + = + +
So if we estimate the model,
Y c dZ = +
the coefficient on Z is d b β = , the product of the true coefficient relating X to Y and the coefficient
relating X to Z. Unless we know the value of b, we have no measure of the effect of X on Y.
However, we do know if the coefficient d is significant or not and we do know if the sign is as expected.
That is, if we know that b is positive, then we can test the hypothesis that þ is positive using the estimated
coefficient,
ˆ
d . Therefore, if the target variable is a proxy variable, the estimated coefficient can only be
used to determine sign and significance.
An illustration of this problem occurs in studies of the relationship between guns and crime. There is no
good measure of the number of guns, so researchers have to use proxies. As we have just seen, the
coefficient on the proxy for guns cannot be used to make inferences concerning the elasticity of crime with
respect to guns. Nevertheless two studies, one by Cook and Ludwig and one by Duggan, both make this
mistake. See http://papers.ssrn.com/sol3/papers.cfm?abstract_id=473661 for a discussion of this issue.
Dummy variables
Dummy variables are variables that consist of one’s and zeroes. They are also known as binary variables.
They are extremely useful. For example, I happen to have a data set consisting of the salaries of the faculty
of a certain nameless university (salaries.dta). The average salary at the time of the survey was as follows.
. summarize salary
Variable  Obs Mean Std. Dev. Min Max
+
salary  350 72024.92 25213.02 20000 173000
Now suppose we create a dummy variable called female, that takes the value 1 if the faculty member is
female and 0 if male. We can then find the average female salary.
. summarize salary if female
Variable  Obs Mean Std. Dev. Min Max
+
salary  122 61050 18195.34 29700 130000
Similarly, we can create a dummy variable called male which is 1 if male and zero if female. The average
male salary is somewhat higher.
. summarize salary if male
Variable  Obs Mean Std. Dev. Min Max
99
+
salary  228 77897.46 26485.87 20000 173000
The difference between these two means is substantial. We can use the scalar command to find the
difference in salary between males and females.
. scalar salarydif=6105077897
. scalar list salarydif
salarydif = 16847
So, women earn almost $17,000 less than men on average. Is this difference significant given the variance
in salary? Let’s do a traditional test of the difference between two means by computing the variances and
using the formula in most introductory statistics books. First, create two variables for men’s salaries and
women’s salaries.
. gen salaryfem=salary if female==1
(228 missing values generated)
. gen salarymale=salary if female==0
(122 missing values generated)
. summarize salary*
Variable  Obs Mean Std. Dev. Min Max
+
salary  350 72024.92 25213.02 20000 173000
salaryfem  122 61050 18195.34 29700 130000
salarymale  228 77897.46 26485.87 20000 173000
Now test for the difference between these two means using the Stata command ttest.
. ttest salarymale=salaryfem, unpaired
Twosample t test with equal variances

Variable  Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
salary~e  228 77897.46 1754.069 26485.87 74441.12 81353.8
salary~m  122 61050 1647.328 18195.34 57788.68 64311.32
+
combined  350 72024.92 1347.693 25213.02 69374.3 74675.54
+
diff  16847.46 2684.423 11567.73 22127.2

Degrees of freedom: 348
Ho: mean(salarymale)  mean(salaryfem) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = 6.2760 t = 6.2760 t = 6.2760
P < t = 1.0000 P > t = 0.0000 P > t = 0.0000
There appears to be a significant difference between male and female salaries.
It is somewhat more elegant to use regression analysis with dummy variables to achieve the same goal.
. regress salary female
100
Source  SS df MS Number of obs = 350
+ F( 1, 348) = 39.39
Model  2.2558e+10 1 2.2558e+10 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.1017
+ Adj Rsquared = 0.0991
Total  2.2186e+11 349 635696291 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
female  16847.46 2684.423 6.28 0.000 22127.2 11567.73
_cons  77897.46 1584.882 49.15 0.000 74780.31 81014.61

Note that the coefficient on female is the difference in average salary attributable to being female while the
intercept is the corresponding male salary. The tratio on female tests the null hypothesis that this salary
difference is equal to zero. This is exactly the same results we got using the standard ttest of the difference
between two means. So, there appears to be significant salary discrimination against women at this
university.
Since male=1female, we can generate the bonus attributable to being male by doing the following
regression.
. regress salary male
Source  SS df MS Number of obs = 350
+ F( 1, 348) = 39.39
Model  2.2558e+10 1 2.2558e+10 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.1017
+ Adj Rsquared = 0.0991
Total  2.2186e+11 349 635696291 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
male  16847.46 2684.423 6.28 0.000 11567.73 22127.2
_cons  61050 2166.628 28.18 0.000 56788.67 65311.33

Now the female salary is captured by the intercept and the coefficient on male is the (significant) bonus that
attends maleness.
Since male=1female, female=1male, and the intercept is simply 1, male+female=intercept, so we can’t
use all three terms in the same regression. This is the dummy variable trap.
regress salary male female
Source  SS df MS Number of obs = 350
+ F( 1, 348) = 39.39
Model  2.2558e+10 1 2.2558e+10 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.1017
+ Adj Rsquared = 0.0991
Total  2.2186e+11 349 635696291 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
male  16847.46 2684.423 6.28 0.000 11567.73 22127.2
101
female  (dropped)
_cons  61050 2166.628 28.18 0.000 56788.67 65311.33

Stata has saved us the embarrassment of pointing out that the regression we asked it to run is impossible by
simply dropping one of the variables from the regression. The result is we get the same regression we got
with male as the only independent variable.
It is possible to force Stata to drop the intercept term instead of one of the other variables,
. regress salary male female, noconstant
Source  SS df MS Number of obs = 350
+ F( 2, 348) = 1604.86
Model  1.8382e+12 2 9.1911e+11 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.9022
+ Adj Rsquared = 0.9016
Total  2.0375e+12 350 5.8215e+09 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
male  77897.46 1584.882 49.15 0.000 74780.31 81014.61
female  61050 2166.628 28.18 0.000 56788.67 65311.33

Now the coefficients are the mean salaries for the two groups. This formulation is less useful because it is
more difficult to test the null hypothesis that the two salaries are different.
Perhaps this salary difference is due to a difference in the amount of experience of the two groups. Maybe
the men have more experience and that is what is causing the apparent salary discrimination. If so, the
previous analysis suffers from omitted variable bias.
. regress salary exp female
Source  SS df MS Number of obs = 350
+ F( 2, 347) = 79.49
Model  6.9709e+10 2 3.4854e+10 Prob > F = 0.0000
Residual  1.5215e+11 347 438471159 Rsquared = 0.3142
+ Adj Rsquared = 0.3103
Total  2.2186e+11 349 635696291 Root MSE = 20940

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  1069.776 103.1619 10.37 0.000 866.8753 1272.678
female  10310.35 2431.983 4.24 0.000 15093.63 5527.065
_cons  60846.72 2150.975 28.29 0.000 56616.14 65077.31

Well, the difference has been reduced to $10K, so the men are apparently somewhat older. But it is still
significant and negative. Also, the average raise is a pitiful $1000 per year. It is also possible that women
receive lower raises than men do. We can test this hypothesis by creating an interaction variable by
multiplying experience by female and including it as an additional regressor.
. gen fexp=female*exp
. regress salary exp female fexp
Source  SS df MS Number of obs = 350
+ F( 3, 346) = 53.04
Model  6.9888e+10 3 2.3296e+10 Prob > F = 0.0000
Residual  1.5197e+11 346 439219336 Rsquared = 0.3150
102
+ Adj Rsquared = 0.3091
Total  2.2186e+11 349 635696291 Root MSE = 20958

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  1038.895 113.9854 9.11 0.000 814.7038 1263.087
female  12189.86 3816.229 3.19 0.002 19695.79 4683.936
fexp  172.0422 269.0422 0.64 0.523 357.1217 701.2061
_cons  61338.93 2286.273 26.83 0.000 56842.18 65835.67

In fact, women appear to receive slightly higher, but not significantly higher, raises than men. There is a
significant salary penalty associated with being female, but it is not caused by discrimination in raises.
Useful tests
Ftest
We have seen many applications of the ttest. However, the Ftest is also extremely useful. This test was
developed by Sir Ronald Fisher (the F in Ftest), a British statistical biologist in the early 1900’s. Suppose
we want to test the null hypothesis that the coefficients on female and fexp are jointly equal to zero. The
way to test this hypothesis is to run the regression as it appears above with both female and fexp included,
and note the residual sum of squares (1.5e+11). Then run the regression again without the two variables
(assuming the null hypothesis is true) and seeing if the two residual sum of squares are significantly
different. If they are, then the null hypothesis is false. If they are not significantly different, then the two
variables do not help explain the variance of the dependent variable and the null hypothesis is true (cannot
be rejected).
Stata allows us to do this test very easily with the test command. This command is used after the regress
command. Refer to the coefficients by the corresponding variable name.
. test female fexp
( 1) female = 0
( 2) fexp = 0
F( 2, 346) = 9.18
Prob > F = 0.0001
This tests the null hypothesis that the coefficient on female and the coefficient on male are both equal to
zero. According the Fratio, we can firmly reject this hypothesis. The numbers in parentheses in the F ratio
are the degrees of freedom of the numerator (2 because there are two restrictions, one on female and one on
fexp) and the degrees of freedom of the denominator (nk: 3504=346).
Chow test
This test is frequently referred to as a Chow test, after Gregory Chow, a Princeton econometrician who
developed a slightly different version. Let’s do a slightly more complicated version of the Ftest above. I
also have a dummy variable called “admin” which is equal to one whenever a faculty member is also a
member of the administration (assistant to the President, assistant to the assistant to the President, dean,
associate dean, assistant dean, deanlet, etc., etc.). Maybe female administrators are discriminated against in
103
terms of salary. We want to know if the regression model explaining women’s salaries is different from the
regression model explaining men’s salaries. This is a Chow test.
. gen fadmin=female*admin
. regress salary exp female fexp admin fadmin
Source  SS df MS Number of obs = 350
+ F( 5, 344) = 35.33
Model  7.5271e+10 5 1.5054e+10 Prob > F = 0.0000
Residual  1.4659e+11 344 426123783 Rsquared = 0.3393
+ Adj Rsquared = 0.3297
Total  2.2186e+11 349 635696291 Root MSE = 20643

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  1005.451 112.843 8.91 0.000 783.5015 1227.4
female  11682.84 3791.92 3.08 0.002 19141.11 4224.577
fexp  156.5447 266.3948 0.59 0.557 367.423 680.5124
admin  11533.69 3905.342 2.95 0.003 3852.334 19215.04
fadmin  2011.575 7884.295 0.26 0.799 13495.92 17519.07
_cons  60202.64 2284.564 26.35 0.000 55709.17 64696.11
. test female fexp fadmin
( 1) female = 0
( 2) fexp = 0
( 3) fadmin = 0
F( 3, 344) = 5.67
Prob > F = 0.0008
We can soundly reject the null hypothesis that women have the same salary function as men. However,
because the ttests on fexp and fadmin are not significant, we know that the difference is not due to raises
or differences paid to women administrators.
I also have data on a number of other variables that might explain the difference in salaries between men
and women faculty: chprof (=1 if a chancellor professor), asst (=1 if an assistant prof), assoc (=1 if an
associate prof), and prof (=1 if a full professor). Let’s see if these variables make a difference.
. regress salary exp female asst assoc prof admin chprof
Source  SS df MS Number of obs = 350
+ F( 7, 342) = 72.14
Model  1.3227e+11 7 1.8896e+10 Prob > F = 0.0000
Residual  8.9587e+10 342 261948866 Rsquared = 0.5962
+ Adj Rsquared = 0.5879
Total  2.2186e+11 349 635696291 Root MSE = 16185

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  195.7694 102.5207 1.91 0.057 5.881204 397.42
female  5090.278 1952.174 2.61 0.010 8930.056 1250.5
asst  11182.61 4175.19 2.68 0.008 2970.325 19394.89
assoc  22847.23 4079.932 5.60 0.000 14822.31 30872.15
prof  47858.19 4395.737 10.89 0.000 39212.11 56504.28
admin  8043.421 2719.69 2.96 0.003 2693.997 13392.85
chprof  14419.67 5962.966 2.42 0.016 2690.965 26148.37
_cons  42395.53 4042.146 10.49 0.000 34444.94 50346.13
104

Despite all these highly significant control variables, the coefficient on the target variable, female is
negative and significant, indicating significant salary discrimination against women faculty. I have one
more set of dummy variables to try. I have a dummy for each department. So, dept1=1 if the faculty person
is in the first department, zero otherwise, dept2=1 if in the second department, etc.
. regress salary exp female admin chprof prof assoc asst dept2dept30
Source  SS df MS Number of obs = 350
+ F( 33, 316) = 20.18
Model  1.5047e+11 33 4.5597e+09 Prob > F = 0.0000
Residual  7.1388e+10 316 225910511 Rsquared = 0.6782
+ Adj Rsquared = 0.6446
Total  2.2186e+11 349 635696291 Root MSE = 15030

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  283.8026 99.07088 2.86 0.004 88.88067 478.7245
female  2932.928 1915.524 1.53 0.127 6701.72 835.865
admin  6767.088 2650.416 2.55 0.011 1552.396 11981.78
chprof  14794.4 5622.456 2.63 0.009 3732.222 25856.58
prof  46121.73 4324.557 10.67 0.000 37613.17 54630.29
assoc  22830.1 3930.037 5.81 0.000 15097.75 30562.44
asst  11771.36 4036.021 2.92 0.004 3830.495 19712.23
dept2  2500.519 4457.494 0.56 0.575 6269.598 11270.63
dept3  19527.87 4019.092 4.86 0.000 11620.31 27435.43
dept4  1278.9 4886.804 0.26 0.794 8335.885 10893.68
dept5  9213.681 9271.124 0.99 0.321 9027.251 27454.61
dept6  1969.66 3659.108 0.54 0.591 9168.953 5229.633
dept7  3525.149 4881.219 0.72 0.471 6078.646 13128.94
dept8  462.3493 4317.267 0.11 0.915 8031.871 8956.57
dept9  3304.275 6448.824 0.51 0.609 9383.782 15992.33
dept10  24647.11 4618.156 5.34 0.000 15560.89 33733.33
dept11  13185.66 3726.702 3.54 0.000 5853.38 20517.95
dept12  3601.157 3230.741 1.11 0.266 9957.638 2755.323
dept14  167.8798 5985.794 0.03 0.978 11609.17 11944.93
dept16  4854.208 6978.675 0.70 0.487 18584.75 8876.331
dept17  1185.402 4293.443 0.28 0.783 9632.748 7261.945
dept18  5932.65 3473.475 1.71 0.089 901.411 12766.71
dept19  18642.71 15803.41 1.18 0.239 12450.48 49735.9
dept20  8166.382 8906.207 0.92 0.360 9356.576 25689.34
dept21  1839.744 5208.921 0.35 0.724 12088.29 8408.806
dept22  2940.911 11064.36 0.27 0.791 18828.21 24710.03
dept23  1540.594 4902.64 0.31 0.754 11186.54 8105.348
dept24  777.8447 3916.975 0.20 0.843 8484.491 6928.802
dept25  14089.64 15411.33 0.91 0.361 44411.41 16232.14
dept27  1893.797 3397.766 0.56 0.578 8578.9 4791.307
dept28  4073.063 5408.214 0.75 0.452 14713.72 6567.596
dept29  2279.7 10822.38 0.21 0.833 19013.33 23572.73
dept30  80.38544 4879.813 0.02 0.987 9681.414 9520.643
_cons  38501.85 4173.332 9.23 0.000 30290.82 46712.88

Whoops, the coefficient on female isn’t significant any more. Maybe women tend to be overrepresented in
departments that pay lower salaries to everyone, males and females (e.g., modern language, English,
theatre, dance, classical studies, kinesiology). This conclusion rests on the department dummy variables
being significant as a group. We can test this hypothesis, using an Ftest, with the testparm command. We
would like to list the variables as dept1dept29, but that looks like we want to test hypotheses on the
105
difference between the coefficient on dept1 and the coefficient on dept29. That is where the testparm
command comes in.
. testparm dept2dept30
( 1) dept2 = 0
( 2) dept3 = 0
( 3) dept4 = 0
( 4) dept5 = 0
( 5) dept6 = 0
( 6) dept7 = 0
( 7) dept8 = 0
( 8) dept9 = 0
( 9) dept10 = 0
(10) dept11 = 0
(11) dept12 = 0
(12) dept14 = 0
(13) dept16 = 0
(14) dept17 = 0
(15) dept18 = 0
(16) dept19 = 0
(17) dept20 = 0
(18) dept21 = 0
(19) dept22 = 0
(20) dept23 = 0
(21) dept24 = 0
(22) dept25 = 0
(23) dept27 = 0
(24) dept28 = 0
(25) dept29 = 0
(26) dept30 = 0
F( 26, 316) = 3.10
Prob > F = 0.0000
The department dummies are highly significant as a group. So there is no significant salary discrimination
against women, once we control for field of study. Female physicists, economists, chemists, computer
scientists, etc. apparently earn just as much as their male counterparts. Unfortunately, the same is true for
female French teachers and phys ed instructors. By the way, the departments that are significantly overpaid
are dept 3 (Chemistry), dept 10 (Computer Science), and dept 11 (Economics).
It is important to remember that, when testing groups of dummy variables for significance, that you must
drop or retain all members of the group. If the group is not significant, even if one or two are significant,
you should drop them all. If the group is significant, even if only a few are, then all should be retained.
Granger causality test
Another very useful test is available only for time series. In 1969 C.W.J. Granger published an article
suggesting a way to test causality.
2
Suppose we are interesting in testing whether crime causes prison (that
is, people in prison), prison causes (deters) crime, both, or neither. Granger suggested running a regression
of prison on lagged prison and lagged crime. If lagged crime is significant, then crime causes prison. Then
do a regression of crime on lagged crime and lagged prison. If lagged prison is significant, then prison
2
Granger, C.W.J., “Investigating causal relations by econometric models and crossspectral methods,”
Econometrica 36, 1969, 424438.
106
causes crime (deters crime if the coefficients sum to a negative number, causes crime if the coefficients
sum to a positive number).
I have data on crime (crmaj, major crime, the kind you get sent to prison for: murder, rape, robbery, assault,
and burglary) and prison population per capita for Virginia from 1972 to 1999 (crimeva.dta).
. tsset year
(You have to tsset year to tell Stata that year is the time variable.) Let’s use two lags.
. regress crmaj L.crmaj LL.crmaj L.prison LL.prison
Source  SS df MS Number of obs = 24
+ F( 4, 19) = 31.36
Model  63231512.2 4 15807878.1 Prob > F = 0.0000
Residual  9578520.75 19 504132.671 Rsquared = 0.8684
+ Adj Rsquared = 0.8407
Total  72810033 23 3165653.61 Root MSE = 710.02

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
crmaj 
L1  1.003715 .2095943 4.79 0.000 .5650287 1.442401
L2  .4399737 .206103 2.13 0.046 .8713522 .0085951
prison 
L1  1087.249 1324.262 0.82 0.422 1684.463 3858.962
L2  1772.896 1287.003 1.38 0.184 4466.625 920.8333
_cons  6309.665 2689.46 2.35 0.030 680.5613 11938.77

. test L.prison LL.prison
( 1) L.prison = 0
( 2) L2.prison = 0
F( 2, 19) = 3.61
Prob > F = 0.0468
So, lagged prison is significant in the crime equation. Does prison cause or deter crime? The sum of the coefficients is
negative, so it deters crime if the sum is significantly different from zero.
. test L.prison+L2.prison=0
( 1) L.prison + L2.prison = 0
F( 1, 19) = 5.26
Prob > F = 0.0333
So, prison deters crime. Does crime cause prison?
regress prison L.prison LL.prison L.crmaj LL.crmaj
Source  SS df MS Number of obs = 24
+ F( 4, 19) = 510.43
Model  23.1503896 4 5.78759741 Prob > F = 0.0000
Residual  .215433065 19 .011338582 Rsquared = 0.9908
+ Adj Rsquared = 0.9888
Total  23.3658227 23 1.01590533 Root MSE = .10648
107

prison  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison 
L1  1.545344 .1986008 7.78 0.000 1.129668 1.96102
L2  .5637167 .193013 2.92 0.009 .9676977 .1597358
crmaj 
L1  .0000191 .0000314 0.61 0.551 .0000849 .0000467
L2  .0000159 .0000309 0.52 0.612 .0000488 .0000806
_cons  .133937 .4033407 0.33 0.743 .7102647 .9781387

. test L.crmaj LL.crmaj
( 1) L.crmaj = 0
( 2) L2.crmaj = 0
F( 2, 19) = 0.20
Prob > F = 0.8237
.
Apparently, prison deters crime, but crime does not cause prisons, at least in Virginia.
The Granger causality test is best done with control variables included in the regression as well as the lags
of the target variables. However, if the control variables are unavailable or the analysis is just preliminary,
then the lagged dependent variables will serve as proxies for the omitted control variables.
Jtest for nonnested hypotheses
Suppose we are having a debate about crime with some sociologists. We believe that crime is deterred by
punishment, mainly imprisonment, while the sociologists think that putting people in prison does no good
at all. The sociologists think that crime is caused by income inequality. We don’t think so. We both agree
that unemployment and income have an effect on crime. How can we resolve this debate? We don’t put
income inequality in our regressions and they don’t put prison in theirs. The key is to create a supermodel
with all of the variables included. We then test all the coefficients. If prison is significant and income
inequality is not, then we win. If the reverse is true then the sociologists win. If neither or both are
significant, then it is a tie. You can get ties in tests of nonnested hypotheses. You don’t get ties in tests of
nested hypotheses. By the way, the J in Jtest is for joint hypothesis.
I have time series data for the US which has data on major crime (murder, rape, robbery, assault, and
burglary), real per capita income, the unemployment rate, and income inequality measured by the Gini
coefficient from 19471998 (gini.csv). The Gini coefficient is a number between zero and one. Zero means
perfect equality, everyone has the same level of income. One means perfect inequality, one person has all
the money, everyone else has nothing. We create the supermodel by regressing the log of major crime per
capita on all these variables.
. regress lcrmajpc prisonpc unrate income gini
Source  SS df MS Number of obs = 52
+ F( 4, 47) = 3286.33
Model  1.55121091 4 .387802727 Prob > F = 0.0000
Residual  .005546219 47 .000118005 Rsquared = 0.9964
+ Adj Rsquared = 0.9961
Total  1.55675713 51 .03052465 Root MSE = .01086

108
lcrmajpc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prisonpc  .0186871 .0044176 4.23 0.000 .0098001 .0275741
unrate  .0391091 .0072343 5.41 0.000 .0245556 .0536626
income  .3201013 .0061851 51.75 0.000 .3076585 .3325442
gini  .757686 .2507801 3.02 0.004 1.26219 .2531816
_cons  5.200647 .175871 29.57 0.000 4.84684 5.554454
r
Well, both prison and gini are significant in the crime equation. It appears to be a draw. However, the
coefficient on gini is negative, indicating that, as the gini coefficient goes up (more inequality, less
equality) crime goes down. Whoops.
If there were more than one variable for each of the competing models (e.g., prison and arrests for the
economists’ model and giini and the divorce rate for the sociologists) then we would use Ftests rather than
ttests. We would test the joint significance of prison and arrests versus the joint significance of gini and
divorce with Ftests.
LM test
Suppose we want to test the joint hypothesis that rtpi and unrate are not significant in the above regression,
even if we already know that they are highly significant. We know we can test that hypothesis with an F
test. However, an alternative is to use the LM (for Lagrangian Multiplier, never mind) test. The primary
advantage of the LM test is that you only need to estimate the main equation once. Under the Ftest you
have to estimate the model with all the variables, record the error sum of squares, then estimate the model
with the variables being tested excluded, record the error sum of squares, and finally compute the Fratio.
For example, use the crime1990.dta data set to estimate the following crime equation.
. regress lcrmaj prison metpct p1824 p2534
Source  SS df MS Number of obs = 51
+ F( 4, 46) = 19.52
Model  5.32840267 4 1.33210067 Prob > F = 0.0000
Residual  3.1393668 46 .068247104 Rsquared = 0.6293
+ Adj Rsquared = 0.5970
Total  8.46776947 50 .169355389 Root MSE = .26124

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1439967 .0282658 5.09 0.000 .0871006 .2008929
metpct  .0086418 .0020265 4.26 0.000 .0045627 .0127208
p1824  .9140642 5.208719 0.18 0.861 11.39867 9.570543
p2534  1.604708 3.935938 0.41 0.685 9.527341 6.317924
_cons  9.092154 .673677 13.50 0.000 7.736113 10.4482

Suppose we want to know if the age groups are jointly significant. The Ftest is easy in Stata.
. test p1824 p2534
( 1) p1824 = 0
( 2) p2534 = 0
F( 2, 46) = 0.12
Prob > F = 0.8834
To do the LM test for the same hypothesis, estimate the model without the age group variables and save the
residuals using the predict command.
109
. regress lcrmaj prison metpct
Source  SS df MS Number of obs = 51
+ F( 2, 48) = 40.39
Model  5.31143197 2 2.65571598 Prob > F = 0.0000
Residual  3.1563375 48 .065757031 Rsquared = 0.6273
+ Adj Rsquared = 0.6117
Total  8.46776947 50 .169355389 Root MSE = .25643

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1384944 .0248865 5.57 0.000 .0884566 .1885321
metpct  .0081552 .0017355 4.70 0.000 .0046656 .0116447
_cons  8.767611 .1142346 76.75 0.000 8.537927 8.997295

. predict e, resid
Now regress the residuals on all the variables including the ones we left out. If they are truly irrelevant,
they will not be significant and the Rsquare for the regression will be approximately zero. The LM statistic
is N times the Rsquare, which is distributed as chisquare with, in this case, two degrees of freedom
because there are two restrictions (two variables left out).
. regress e prison metpct p1824 p2534
Source  SS df MS Number of obs = 51
+ F( 4, 46) = 0.06
Model  .016970705 4 .004242676 Prob > F = 0.9926
Residual  3.13936681 46 .068247105 Rsquared = 0.0054
+ Adj Rsquared = 0.0811
Total  3.15633751 50 .06312675 Root MSE = .26124

e  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .0055023 .0282658 0.19 0.847 .0513938 .0623985
metpct  .0004866 .0020265 0.24 0.811 .0035924 .0045657
p1824  .9140643 5.208719 0.18 0.861 11.39867 9.570543
p2534  1.604708 3.935938 0.41 0.685 9.527341 6.317924
_cons  .3245431 .673677 0.48 0.632 1.031498 1.680585

To compute the LM test statistic, we need to get the Rsquare and the number of observations. (Although
the fact that both variables have very low tratios is a clue.)
Whenever we run regress Stata creates a bunch of output that is available for further analysis. This output
goes under the generic heading of “e” output. To see this output, use the ereturn list command.
. ereturn list
scalars:
e(N) = 51
e(df_m) = 4
e(df_r) = 46
e(F) = .0621663918252367
e(r2) = .0053767079385045
e(rmse) = .2612414678691742
e(mss) = .0169707049657037
e(rss) = 3.139366808584275
e(r2_a) = .0811122739798864
e(ll) = 1.276850278809347
110
e(ll_0) = 1.414326247391365
macros:
e(depvar) : "e"
e(cmd) : "regress"
e(predict) : "regres_p"
e(model) : "ols"
matrices:
e(b) : 1 x 5
e(V) : 5 x 5
functions:
e(sample)
Since the numbers we want are scalars, not variables, we need the scalar command to create the test
statistic.
. scalar n=e(N)
. scalar R2=e(r2)
. scalar LM=n*R2
Now that we have the test statistic, we need to see if it is significant. We can compute the probvalue using
Stata’s chi2 (chisquare) function.
. scalar prob=1chi2(2,LM)
. scalar list
prob = .87187776
LM = .2742121
R2 = .00537671
n = 51
The Rsquare is almost equal to zero, so the LM statistic is very low. The probvalue is .87. We cannot
reject the null hypothesis that the age groups are not related to crime. OK, so the Ftest is easier.
Nevertheless, we will use the LM test later, especially for diagnostic testing of regression models.
One modification of this test that is easier to apply in Stata is the socalled Fform of the LM test which
corrects for degrees of freedom and therefore has somewhat better small sample properties.
3
To do the F
form, just apply the Ftest to the auxiliary regression. That is, after regressing the residuals on all the
variables, including the age group variables, use Stata’s test function to test the joint hypothesis that the
coefficients on both variables are zero.
. test p1824 p2534
( 1) p1824 = 0
( 2) p2534 = 0
F( 2, 46) = 0.12
Prob > F = 0.8834
The results are identical to the original Ftest above showing that the LM test is equivalent to the Ftest.
The results are also similar to the nRsquare version of the test, as expected. I routinely use the Fform of
the LM test because it is easy to do in Stata and corrects for degrees of freedom.
3
See Kiviet, J.F., Testing Linear Econometric Models, Amsterdam: University of Amsterdam, 1987.
111
9 REGRESSION DIAGNOSTICS
Influential observations
It would be unfortunate if an outlier dominated our regression in such a way that one observation
determined the entire regression result.
. gen y=10 + 5*invnorm(uniform())
. gen x=10*invnorm(uniform())
. summarize y x
Variable  Obs Mean Std. Dev. Min Max
+
y  51 9.736584 4.667585 1.497518 19.76657
x  51 .6021098 8.954225 17.87513 19.83905
Let’s graph these data. We know that they are completely independent, so there should not be a significant
relationship between them.
. twoway (scatter y x) (lfit y x)
The regression confirms no significant relationship.
112
. regress y x
Source  SS df MS Number of obs = 51
+ F( 1, 49) = 1.94
Model  41.4133222 1 41.4133222 Prob > F = 0.1703
Residual  1047.90398 49 21.3857955 Rsquared = 0.0380
+ Adj Rsquared = 0.0184
Total  1089.3173 50 21.786346 Root MSE = 4.6245

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .1016382 .0730381 1.39 0.170 .0451374 .2484138
_cons  9.797781 .649048 15.10 0.000 8.49347 11.10209

Now let’s create an outlier.
. replace y=y+30 if state==30
(1 real change made)
. twoway (scatter y x) (lfit y x)
Let’s try that regression again.
. regress y x
Source  SS df MS Number of obs = 51
+ F( 1, 49) = 6.83
Model  259.874802 1 259.874802 Prob > F = 0.0119
Residual  1864.52027 49 38.0514342 Rsquared = 0.1223
+ Adj Rsquared = 0.1044
Total  2124.39508 50 42.4879015 Root MSE = 6.1686

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .2546063 .0974255 2.61 0.012 .0588225 .4503901
_cons  10.47812 .8657642 12.10 0.000 8.738302 12.21794

113
So, one observation is determining whether we have a significant relationship or not. This observation is
called “influential.” State 30, which happens to be New Hampshire, is said to dominate the regression. It is
sometimes easy to find outliers and influential observations, but not always. If you have a large data set, or
lots of variables, you could easily miss an influential observation. You do not want your conclusions to be
driven by a single observation or even a few observations. So how do we avoid this influential observation
trap?
Stata has a function called dfbeta which helps you avoid this problem. It works like this suppose we simply
ran the regression with and without New Hampshire. With NH we get a significant result and a relatively
large coefficient on x. Without NH we should get a smaller and insignificant coefficient.
. regress y x if state ~=30
Source  SS df MS Number of obs = 50
+ F( 1, 48) = 1.61
Model  35.0543824 1 35.0543824 Prob > F = 0.2112
Residual  1047.6542 48 21.8261292 Rsquared = 0.0324
+ Adj Rsquared = 0.0122
Total  1082.70858 49 22.0960935 Root MSE = 4.6718

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .0989157 .0780517 1.27 0.211 .0580178 .2558493
_cons  9.785673 .6653936 14.71 0.000 8.447809 11.12354

By dropping NH we are back to the original result reflecting the (true) relationship for the other 50
observations. We could do this 51 times, dropping each state in turn and seeing what happens, but that
would be tedious. Instead we can use matrix algebra and computer technology to achieve the same result.
DFbetas
Belsley, Kuh, and Welsch
4
suggest dfbetas (scaled dfbeta) as the best measure of influence.
ˆ
ˆ
ˆ ˆ
ˆ ˆ
where is the standard error of and
i
i
DFbetas S
S
β
β
β β
β β
−
= is the estimated value of β omitting
observation i.
BKW suggest, as a rule of thumb, that an observation is influential if
Dfbetas > 2 N
where N is the sample size, although some researchers suggest that DFbetas should be greater than one
(that is greater than one standard error).
After regress, invoke the dfbeta command.
. dfbeta
DFx: DFbeta(x)
4
Belsley, D. A., Kuh, E. and Welsch, R. E. Regression Diagnostics. John Wiley and Sons, New York, NY, 1980.
114
This produces a new variable called DFx. Now list the observations that might be influential.
. list state stnm y x DFx if DFx>2/sqrt(51)
++
 state stnm y x DFx 

51.  30 NH 42.282 19.83905 2.110021 
++
There is only one observation for which DFbetas is greater than the threshold value. Clearly NH is
influential, increasing the value of the coefficient by two standard errors. At this point it would behoove us
to look more closely at the NH value to make sure it is not a typo or something.
What do we do about influential observations? We hope we don’t have any. If we do, we hope they don’t
affect the main conclusions. We might find simple coding errors. If we have some and they are not coding
errors, we might take logarithms (which has the advantage of squashing variance) or weight the regression
to reduce their influence. If none of this works, we have to live with the results and confess the truth to the
reader.
Multicollinearity
We know that if one variable in a multiple regression is an exact linear combination of one or more
variables in the regression, that the regression will fail. Stata will be forced to drop the collinear variable
from the model. But what happens if a regressor is not an exact linear combination of some other variable
or variables, but almost? This is called multicollinearity. It has the effect of increasing the standard errors
of the OLS estimates. This means that variables that are truly significant appear to be insignificant in the
collinear model Also, the OLS estimates are unstable and fragile (small changes in the data or model can
make a big difference in the regression estimates). It would be nice to know if we have multicollinearity.
Variance inflation factors
Stata has a diagnostic known as the variance inflation factor , VIF, which indicates the presence of
multicollearity. The rule of thumb is that multicollinearity is a problem if the VIF is over 30.
Here is an example.
. use http://www.ats.ucla.edu/stat/stata/modules/reg/multico, clear
. regress y x1 x2 x3 x4
Source  SS df MS Number of obs = 100
+ F( 4, 95) = 16.37
Model  5995.66253 4 1498.91563 Prob > F = 0.0000
Residual  8699.33747 95 91.5719733 Rsquared = 0.4080
+ Adj Rsquared = 0.3831
Total  14695 99 148.434343 Root MSE = 9.5693

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x1  1.118277 1.024484 1.09 0.278 .9155807 3.152135
115
x2  1.286694 1.042406 1.23 0.220 .7827429 3.356131
x3  1.191635 1.05215 1.13 0.260 .8971469 3.280417
x4  .8370988 1.038979 0.81 0.422 2.899733 1.225535
_cons  31.61912 .9709127 32.57 0.000 29.69161 33.54662

. vif
Variable  VIF 1/VIF
+
x4  534.97 0.001869
x3  175.83 0.005687
x1  91.21 0.010964
x2  82.69 0.012093
+
Mean VIF  221.17
Wow. All the VIF are over 30. Let’s look at the correlations among the independent variables.
. correlate x1x4
(obs=100)
 x1 x2 x3 x4
+
x1  1.0000
x2  0.3553 1.0000
x3  0.3136 0.2021 1.0000
x4  0.7281 0.6516 0.7790 1.0000
x4 appears to be the major problem here. It is highly correlated with all the other variables. Perhaps we
should drop it and try again.
. regress y x1 x2 x3
Source  SS df MS Number of obs = 100
+ F( 3, 96) = 21.69
Model  5936.21931 3 1978.73977 Prob > F = 0.0000
Residual  8758.78069 96 91.2372989 Rsquared = 0.4040
+ Adj Rsquared = 0.3853
Total  14695 99 148.434343 Root MSE = 9.5518

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x1  .298443 .1187692 2.51 0.014 .0626879 .5341981
x2  .4527284 .1230534 3.68 0.000 .2084695 .6969874
x3  .3466306 .0838481 4.13 0.000 .1801934 .5130679
_cons  31.50512 .9587921 32.86 0.000 29.60194 33.40831

. vif
Variable  VIF 1/VIF
+
x1  1.23 0.812781
x2  1.16 0.864654
x3  1.12 0.892234
+
Mean VIF  1.17
116
Much better, the variables are all significant and there is no multicollinearity at all. If we had not checked
for multicollinearity, we might have concluded that none of the variables were related to y. So doing a VIF
check is generally a good idea.
However, dropping variables is not without its dangers. It could be that x4 is a relevant variable and the
result of dropping it is that the remaining coefficients are biased and inconsistent. Be careful.
117
10 HETEROSKEDASTICITY
The GaussMarkov assumptions can be summarized as
2
~ (0, )
i i i
i
Y X u
u iid
α β
σ
= + +
where ~iid(0,o
2
) means “is independently and identically distributed with mean zero and variance o
2
”. In
this chapter we learn how to deal with the possibility that the “identically” assumption is violated. Another
way of looking at the GM assumptions is to glance at the expression for the variance of the error term, o
2
,
if there is no i subscript then the variance is the same for all observations. Thus, for each observation the
error term, u
i
, comes from a distribution with exactly the same mean and variance, i.e., is identically
distributed.
When the identically distributed assumption is violated, it means that the error variance differs across
observations. This nonconstant variance is called heteroskedasticity (hetero = different, skedastic, from the
Greek skedastikos, scattering). Constant variance must be homoskedasticity.
If the data are heteroskedastic, ordinary least squares estimates are still unbiased, but they are not efficient.
This means that we have less confidence in our estimates. However, even worse, perhaps, our standard
errors and tratios are no longer correct. It depends on the exact kind of heteroskedasticity whether we are
over or underestimating our tratios, but in either case, we will make mistakes in our hypothesis tests. So,
it would be nice to know if our data suffer from this problem.
Testing for heteroskedasticity
The first thing to do is not really a test, but simply good practice: look at the evidence. The residuals from
the OLS regression, e
i
are estimates of the unknown error term u
i
, so the first thing to to is graph the
residuals and see if they appear to be getting more or less spread out as Y and X increase.
BreuschPagan test
There are two widely used tests for heteroskdeasticity. The first is due to Breusch and Pagan. It is an LM
test. The basic idea is as follows.
Consider the following multiple regression model.
0 1 1 2 2 i i i i
Y X X u β β β = + + +
118
What we want to know is if the variance of the error term is increasing with one of the variables in the
model (or some other variable). So we need an estimate of the variance of the error term at each
observation. We use the OLS residual, e, as our estimate of the error term, u. The usual estimator of the
variance of e is
2
1
( )
n
i
i
e e n
=
−
¯
. However, we know that 0 e = in any OLS regression and if we want the
variance of e for each observation, we have to set n=1. Therefore our estimate of the variance of u
i
is
2
i
e ,
the square of the residual. If there is no heteroskedasticity, the squared residuals will be unrelated to any of
the independent variables. So an auxiliary regression of the form,
2 2
0 1 1 2 2
ˆ /
i i i
e a a X a X σ = + +
where
2
ˆ σ is the estimated error variance from the original OLS regression, should have an Rsquare of
zero. In fact,
2 2
2
nR χ where the degrees of freedom is the number of variables on the right hand side of the
auxiliary test equation. Alternatively, we could test whether the squared residual increases with the
predicted value of y or some other suspected variable.
As an example, consider the following regression of major crime per capita across 51 states (including DC)
in 1990 (hetero.dta) on prison population per capita, the percent of the population living in metropolitan
(metpct) areas, and real per capita personal income.
. regress crmaj prison metpct rpcpi
Source  SS df MS Number of obs = 51
+ F( 3, 47) = 43.18
Model  1805.01389 3 601.671298 Prob > F = 0.0000
Residual  654.970383 47 13.9355401 Rsquared = 0.7338
+ Adj Rsquared = 0.7168
Total  2459.98428 50 49.1996855 Root MSE = 3.733

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  2.952673 .3576379 8.26 0.000 2.233198 3.672148
metpct  .1418042 .0328781 4.31 0.000 .0756619 .2079464
rpcpi  .0004919 .0003134 1.57 0.123 .0011224 .0001386
_cons  6.57136 3.349276 1.96 0.056 .1665135 13.30923

Let’s plot the residuals on population (which we suspect might be a problem variable).
. predict e, resid
. twoway (scatter e pop)
119
Hard to say. If we ignore the outlier at 3000, it looks like the variance of the residuals is getting larger with
larger population. Let’s see if the BreuschPagan test can help us.
The BreuschPagan test is invoked in Stata with the hettest command after regress. The default, if you don’t
include a varlist is to use the predicted value of the dependent variable.
. hettest
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of crmaj
chi2(1) = 0.17
Prob > chi2 = 0.6819
Nothing appears to be wrong. Let’s see if there is a problem with any of the independent variables.
. hettest prison metpct rpcpi
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: prison metpct rpcpi
chi2(3) = 4.73
Prob > chi2 = 0.1923
Still nothing. Let’s see if there is a problem with population (since the dependent variable is measured as a
per capita variable).
. hettest pop
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: pop
chi2(1) = 3.32
Prob > chi2 = 0.0683
Ah ha! Population does seem to be a problem. We will consider what to do about it below.

5
0
5
1
0
1
5
R
e
s
id
u
a
ls
0 10000 20000 30000
pop
120
White test
The other widely used test for autocorrelation is due to Halbert White (econometrician from UCSD). White
modifies the BreuschPagan auxiliary regression as follows.
2 2 2
0 1 1 2 2 3 1 4 2 5 1 2 i i i i i i i
e a a X a X a X a X a X X = + + + + +
The squared and interaction terms are used to capture any nonlinear relationships that might exist. The
number of regressors in the auxiliary test equation is k(k+1)/2 where k is the number of parameters
(including the intercept) in the original OLS regression. Here k=3 so, 3(4)/2 1=61=5. The test statistic is
2 2
( 1) / 2 1
nR
k k
χ
+ −
.
To invoke the White test use imtest, white. The im stands for information matrix.
. imtest, white
White's test for Ho: homoskedasticity
against Ha: unrestricted heteroskedasticity
chi2(9) = 6.63
Prob > chi2 = 0.6757
Cameron & Trivedi's decomposition of IMtest

Source  chi2 df p
+
Heteroskedasticity  6.63 9 0.6757
Skewness  4.18 3 0.2422
Kurtosis  2.28 1 0.1309
+
Total  13.09 13 0.4405

Ignore Cameron and Trivedi’s decomposition. All we care about is heteroskedasticity, which doesn’t
appear to be a problem according to White. The White test has a slight advantage over the BrueshPagan
test in that it is less reliant on the assumption of normality. However, because it computes the squares and
crossproducts of all the variables in the regression, it sometimes runs out of space or out of degrees of
freedom and will not compute. Finally, the White test doesn’t warn us of the possibility of
heteroskedasticity with a variable that is not in the model, like population. (Remember, there is no
heteroskdeasticity associated with prison, metpct, or rpcpi.) For these reasons, I prefer the BreuschPagan
test.
Weighted least squares
Suppose we know what is causing the heteroskedasticity. For example, suppose the model is (in deviations)
i i i
y x u β = + where
2 2
var( ) u z
i i
σ =
121
Consider the weighted regression, where we divide both sides by z
i
/ ( / ) /
i i i i i i
y z x z u z β = +
What is the variance of the new error term, u
i
/z
i
.
2 2
2 2
2
var( / ) var( ) where 1/
var( / ) var( )
i i i i i i
i
i i i i
i
u z c u c z
z
u z c u
z
σ
σ
= =
= = =
So, weighting by 1/z
i
causes the variance to be constant. Applying OLS to this weighted regression will be
unbiased, consistent and efficient (relative to unweighted least squares).
The problem with this approach is that you have to know the exact form of the heteroskedasticity. If you
think you know you have heteroskedasticity and correct it with weighted least squares, you’d better be
right. If you were wrong and there was no heteroskedasticity, you have created it. You could make things
worse by using WLS inappropriately.
There is one case where weighted least squares is appropriate. Suppose that the true relationship between
two variables is (in deviations) is,
2
, (0, )
T T T T
i i i i
y x u u IN β σ = +
However, suppose the data is only available at the state level, where it is aggregated across n individuals.
Summing the relationship across the state’s population, n, yields,
1 1 1
n n n
T T
i i i
i i i
y x u β
= = =
= +
¯ ¯ ¯
or, for each state,
. y x u β = +
We want to estimate the model as close as possible to the truth, so we divide the state totals by the state
population to produce per capita values.
.
y x u
n n n
β = +
What is the variance of the error term?
1
2 2
2 2
2
1
( ( ) ( ) ... ( ))
n
T
i
T T T i
i n
u
u
Var Var Var u Var u Var u
n n n
n
n n
σ σ
=
 

 

= = + + +

 \ ¹

\ ¹
= =
¯
So, we should weight by the square root of population.
122
.
y x u
n n n
n n n
β = +
2
2
nu u
Var nVar n
n n n
σ
σ
 
 
= = =



\ ¹
\ ¹
which is homoskedastic.
This does not mean that it is always correct to weight by the square root of population. After all, the error
term might have some heteroskedasticity in addition to being an average. The most common practice in
crime studies is to weight by population, not the square root, although some researchers weight by the
square root of population.
We can do weighted least squares in Stata with the [weight=variable] option in the regress command. Put
the [weight=variable] in square brackets after the varlist.
. regress crmaj prison metpct rpcpi [weight=pop]
(analytic weights assumed)
(sum of wgt is 2.4944e+05)
Source  SS df MS Number of obs = 51
+ F( 3, 47) = 19.35
Model  1040.92324 3 346.974412 Prob > F = 0.0000
Residual  842.733646 47 17.9305031 Rsquared = 0.5526
+ Adj Rsquared = 0.5241
Total  1883.65688 50 37.6731377 Root MSE = 4.2344

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  3.470689 .7111489 4.88 0.000 2.040042 4.901336
metpct  .1887946 .058493 3.23 0.002 .0711219 .3064672
rpcpi  .0006482 .0004682 1.38 0.173 .0015902 .0002938
_cons  4.546826 4.851821 0.94 0.353 5.213778 14.30743

Now let’s see if there is any heteroskedasticity.
. hettest pop
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: pop
chi2(1) = 0.60
Prob > chi2 = 0.4386
. hettest
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of crmaj
chi2(1) = 1.07
Prob > chi2 = 0.3010
123
. hettest prison metpct rpcpi
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: prison metpct rpcpi
chi2(3) = 2.93
Prob > chi2 = 0.4021
The heteroskedasticity seems to have been taken care of. If the correction had made the heteroskedasticity
worse instead of better, then we would have tried weighting by 1/pop, the inverse of population, rather than
by population itself.
Robust standard errors and tratios
Since heteroskedasticity does not bias the coefficient estimates, we could simply correct the standard errors
and resulting tratios so our hypothesis tests are valid.
Recall how we derived the mean and variance of the sampling distribution of
ˆ
β . We started with a simple
regression model in deviations.
i i i
y x u β = +
2
2 2 2 2 2
( )
ˆ
i i i i i i i i i i
i i i i i
x y x x u x x u x u
x x x x x
β β
β β
¯ ¯ + ¯ ¯ ¯
= = = + = +
¯ ¯ ¯ ¯ ¯
The mean is
2 2
( )
ˆ
( )
i i i i
i i
x u x E u
E E
x x
β β β β
  ¯ ¯
= + = + =

¯ ¯
\ ¹
because ( ) 0
i
E u = for all i.
The variance of
ˆ
β is the expected value of the square of the deviation of
ˆ
β ,
2
ˆ
( ) E β β −
2
ˆ
i i
i
x u
x
β β
¯
− =
¯
124
( )
( )
( )
( )
2
ˆ ˆ
( ) ( )
The x's are fixed numbers so ( ) , which means
2
1 2
ˆ
( )
1 1 2 2
2 2
2
1
2 2 2 2 2 2
ˆ
( )
1 2 1 2 1 3 1 3 1 4 1 4
1 1 2 2
2
2
2
We know that (
Var E
E x x
x u
i i
Var E E x u x u x u
n n
x
i x
i
Var E x u x u x u x x u u x x u u x x u u
n n
x
i
E u
i
β β β
β
β
= −
=
 
¯

= = + + +

¯
\ ¹ ¯
= + + + + + + +
¯
( )
( )
( )
( )
( )
2 2 2
2 2 2 2
1 2 2 2 2
2 2
2
) (identical) and ( ) 0 for i j (independent), so
1
2 2 2 2 2 2
ˆ
( )
1 2
2
2
2
Factor out the constant variance, , to get
1
i
n
i
i i
E u u
i j
Var x x x
n
x
i
x
x x x
x
x x
σ
β σ σ σ
σ
σ σ
σ
= = ≠
= + + +
¯
¯
= + + + = =
¯
¯ ¯
In the case of heteroskedasticity we cannot factor out
2
σ . So,
( )
( )
( )
( )
( )
1
2 2 2 2 2 2
ˆ
( )
1 2 1 2 1 3 1 3 1 4 1 4
1 1 2 2 2
2
It is still true that ( ) 0 for i j (independent), so
2 2
1
2 2 2 2 2 2
ˆ
( )
1 1 2 2
2 2
2 2
Var E x u x u x u x x u u x x u u x x u u
n n
x
i
E u u
i j
x
i i
Var x x x
n n
x x
i i
β
σ
β σ σ σ
= + + + + + + +
¯
= ≠
¯
= + + + =
¯ ¯
To make this formula operational, we need an estimate of
2
i
σ . As above, we will use the square of the
residual,
2
e
i
.
( )
2 2
2
ˆ
2
2
ˆ
i i
i
x e
x
β
σ
¯
=
¯
Take the square root to produce the robust standard error of beta.
( )
2 2
ˆ
2
2
i i
i
x e
S
x
β
¯
=
¯
This is also known as the heteroskedastic consistent standard error (HCSE), or White, Huber, Eicker, or
sandwich standard errors. We will call them robust standard errors. Dividing
ˆ
β by the robust standard error
yields the robust tratio.
It is easy to get robust standard errors in Stata. For example, in the crime equation above we can see the
robust standard and tratios by adding the option robust to the regress command.
. regress crmaj prison metpct rpcpi, robust
125
Regression with robust standard errors Number of obs = 51
F( 3, 47) = 131.72
Prob > F = 0.0000
Rsquared = 0.7338
Root MSE = 3.733

 Robust
crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  2.952673 .1667412 17.71 0.000 2.617233 3.288113
metpct  .1418042 .03247 4.37 0.000 .0764828 .2071255
rpcpi  .0004919 .0002847 1.73 0.091 .0010646 .0000807
_cons  6.57136 3.006661 2.19 0.034 .5227393 12.61998

Compare this with the original model.
Source  SS df MS Number of obs = 51
+ F( 3, 47) = 43.18
Model  1805.01389 3 601.671298 Prob > F = 0.0000
Residual  654.970383 47 13.9355401 Rsquared = 0.7338
+ Adj Rsquared = 0.7168
Total  2459.98428 50 49.1996855 Root MSE = 3.733

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  2.952673 .3576379 8.26 0.000 2.233198 3.672148
metpct  .1418042 .0328781 4.31 0.000 .0756619 .2079464
rpcpi  .0004919 .0003134 1.57 0.123 .0011224 .0001386
_cons  6.57136 3.349276 1.96 0.056 .1665135 13.30923

It is possible to weight and use robust standard errors. In fact, it is the standard methodology in crime
regressions.
. regress crmaj prison metpct rpcpi [weight=pop], robust
(analytic weights assumed)
(sum of wgt is 2.4944e+05)
Regression with robust standard errors Number of obs = 51
F( 3, 47) = 17.28
Prob > F = 0.0000
Rsquared = 0.5526
Root MSE = 4.2344

 Robust
crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  3.470689 .6700652 5.18 0.000 2.122692 4.818686
metpct  .1887946 .0726166 2.60 0.012 .0427089 .3348802
rpcpi  .0006482 .0005551 1.17 0.249 .0017649 .0004684
_cons  4.546826 4.683806 0.97 0.337 4.875776 13.96943

What is the bottom line with respect to our treatment of heteroskedasticity? Robust standard errors yield
consistent estimates of the true standard error no matter what the form of the heteroskedasticity. Using
them allows us to avoid having to discover the particular form of the problem. Finally, if there is no
heteroskedasticity, we get the usual standard errors anyway. For these reasons, I recommend always using
126
robust standard errors. That is, always add “, robust” to the regress command. This is becoming standard
procedure among applied researchers. If you know that a variable like population is a problem, then you
should weight in addition to using robust standard errors. The existence of a problem variable will become
obvious in your library research when you find that most of the previous studies weight by a certain
variable.
127
11 ERRORS IN VARIABLES
One of the GaussMarkov assumptions is that the independent variable, x, is fixed on repeated sampling.
This is a very convenient assumption for proving that the OLS parameter estimates are unbiased However,
it is only necessary that x be independent of the error term, that is, that there be no covariance between the
error term and the regressor or regressors, that is, ( ) 0 E x u
i i
= If this assumption is violated, then ordinary
least squares is biased and inconsistent.
This could happen if there are errors of measurement in the independent variable or variables. For example,
suppose the true model, expressed in deviations is
T
y x β ε = +
where x
T
is the true value of x. However, we only have the observed x=x
T
f where f is random
measurement error.
T
x x f = −
( )
( )
y x f
x f
β ε
β ε β
= − +
= + −
We have a problem if x is correlated with the error term (cþf).
cov( , ) [( )( )]
2 2
( ) ( )
2
0 if f has any variance.
T
x f E x f f
T T
E x f x f f E f
f
ε β ε β
ε ε β β β
βσ
− = + −
= + − − =−
= − ≠
So, x is correlated with the error term. This means that the OLS estimate is biased.
( ) ( )( )
ˆ
2 2 2 2
( ) 2
2 2
( )
ˆ
lim( ) lim lim
2 2 2 2
2
T T T
xy x f y x f x
T T T
x x f x x f f
T T T T
x x x f fy x
P P P
T T T
x x f f x f
β ε
β
β ε β β
β
¯ ¯ + ¯ + +
= = =
¯ ¯ + ¯ + ¯ +¯
   
¯ + + + ¯
  = =
 
¯ + ¯ +¯ ¯ +¯
\ ¹ \ ¹
128
(We are assuming that lim( ) lim( ) lim( ) lim( ) 0) So,
2
dividing top and bottom by ,
ˆ
lim( ) lim
2 var( )
1
1
2 var( )
T T
P fy P x f P x f P fy
T
x
P P
f
f
T
T x
x
β β
β
= = = =
¯
 


= =

¯ 
+
+


¯ \ ¹
Since the two variances cannot be negative, we can conclude that
ˆ
lim if var( ) 0 p f β β ≠ ≠ which means
that the OLS estimate is biased and inconsistent if the measurement error has any variance at all. In this
case, the OLS estimate is biased downward, since the true value of beta is divided by a number greater than
one.
Cure for errors in variables
There are only two things we can do to fix errors in variables. The first is to use the true value of x. Since
this is usually not possible, we have to fall back on statistics. Suppose we know of a variable, z, that is (1)
highly correlated with x
T
, but (2) uncorrelated with the error term. This variable, called an instrument or
instrumental variable, allows us to derive consistent (not unbiased) estimates.
where (  )
Multiply both sides by z.
The error is
ˆ
y x v v f
zy zx zv
e zy zx
β ε β
β
β
= + =
= +
= −
The sum of squares of error is
2 2
ˆ
( )
ˆ
To minimize the sum of squares of error, we take the derivative with respect to and set it equal to zero.
2
ˆ
2 ( ) 0 (note chain rule).
ˆ
e zy zx
d e
zx zy zx
d
β
β
β
β
¯ =¯ −
¯
=− ¯ − =
The derivative is equal to zero if and only if the expression is parentheses is zero.
ˆ
0 which implies that
ˆ
zy zx
zy
zx
β
β
¯ − ¯ =
¯
=
¯
This is the instrumental variable (IV) estimator. It collapses to the OLS estimator if x can be an instrument
for itself, which would happen if x is not measured with error. Then x would be highly correlated with
itself, and not correlated with the error term, making it the perfect instrument.
This IV estimator is consistent because z is not correlated with v.
129
( )
( )
( )
( )
( )
ˆ
lim lim lim
lim )
lim
lim
since lim 0 because z is uncorrelated with the error term.
yz x v z
p p p
xz xz
p vz xz vz
p
xz p xz
p vz
β
β
β
β
β
    ¯ ¯ +
= =
 
¯ ¯ \ ¹ \ ¹
¯ ¯ +¯  
= = +

¯ ¯ \ ¹
= ¯ =
Two stage least squares
Another way to get IV estimates is called two stage least squares (TSLS or 2sls). In the first stage we
regress the variable with the measurement error, x, on the instrument, z, using ordinary least squares,
saving the predicted values. In the second stage, we substitute the predicted values of x for the actual x and
apply ordinary least squares again. The result is the consistent IV estimate above.
First stage: regress x on z:
x z w γ = +
Save the predicted values.
ˆ ˆ x z γ =
Note that ˆ x is a linear combination of z. Since z is independent of the error term, so is ˆ x .
Second stage: substitute ˆ x for x and apply OLS again.
ˆ y x v β = +
The resulting estimate of beta is the instrumental variable estimate above.
ˆ ˆ ˆ ( )
ˆ
2 2 2 2 2
ˆ ˆ ˆ ˆ ( )
ˆ from the first stage regression. So,
2
, the instrumental variable estimator
2
2
.
xy z y zy zy
x z z z
xz
z
zy zy
zx
zx
z
z
γ γ
β
γ γ γ
γ
¯ ¯ ¯ ¯
= = = =
¯ ¯ ¯ ¯
¯
=
¯
¯ ¯
= =
¯
¯
¯
¯
HausmanWu test
We can test for the presence of errors in variables if we have an instrument. Assume that the relationship
between the instrument z and the possibly mismeasured variable, x is linear.
x z w γ = +
We can estimate this relationship using ordinary least squares.
ˆ ˆ x z γ =
130
So,
ˆ ˆ x x w = +
where ˆ wis the residual from the regression of x on z.
And,
ˆ ˆ ( )
ˆ ˆ
y x v x w v
y x w v
β β
β β
= + = + +
= + +
Since ˆ x is simple a linear combination of z, it is uncorrelated with the error term. Therefore, applying OLS
to this equation will yield a consistent estimate of þ as the parameter multiplying ˆ x . If there is
measurement error in x it is all in the residual ˆ w , so the estmate of þ from that parameter will be
inconsistent. If these two estimates are significantly different, it must be due to measurement error. Let’s
call the coefficient on ˆ w o. So we need an easy way to test whether β δ = .
ˆ ˆ ˆ ˆ but , so
ˆ ˆ ( )
ˆ ( )
y x w v x x w
y x w w v
y x w v
β δ
β δ
β δ β
= + + = −
= − + +
= + − +
The null hypothesis that β δ = can be tested with a standard ttest on the coefficient multiplying ˆ w . If it is
significant, then the two parameter estimates are different and there is significant errors in variables bias. If
it is not significant, then the two parameters are equal and there is no significant bias.
We can implement this test in Stata as follows. Wooldridge has a nice example and the data set is available
on the web. The data are in the Stata data set called errorsinvars.dta. The dependent variable is the log of
the wage earned by working women. The independent variable is education. The estimated coefficient is
the return to education. First let’s do the regression with ordinary least squares.
. regress lwage educ
Source  SS df MS Number of obs = 428
+ F( 1, 426) = 56.93
Model  26.3264237 1 26.3264237 Prob > F = 0.0000
Residual  197.001028 426 .462443727 Rsquared = 0.1179
+ Adj Rsquared = 0.1158
Total  223.327451 427 .523015108 Root MSE = .68003

lwage  Coef. Std. Err. t P>t [95% Conf. Interval]
+
educ  .1086487 .0143998 7.55 0.000 .0803451 .1369523
_cons  .1851969 .1852259 1.00 0.318 .5492674 .1788735

The return to education is a nice ten percent. However, the high wages earned by highly educated women
might be due to more intelligence. Brighter women will probably do both, go to college and earn high
wages. So how much of the return to education is due to the college education? Let’s use father’s education
as an instrument. It is probably correlated with his daughter’s brains and is independent of her wages. We
can implement instrumental variables with ivreg. The first stage regression varlist is put in parentheses.
. ivreg lwage (educ=fatheduc)
Instrumental variables (2SLS) regression
131
Source  SS df MS Number of obs = 428
+ F( 1, 426) = 2.84
Model  20.8673618 1 20.8673618 Prob > F = 0.0929
Residual  202.460089 426 .475258426 Rsquared = 0.0934
+ Adj Rsquared = 0.0913
Total  223.327451 427 .523015108 Root MSE = .68939

lwage  Coef. Std. Err. t P>t [95% Conf. Interval]
+
educ  .0591735 .0351418 1.68 0.093 .0098994 .1282463
_cons  .4411035 .4461018 0.99 0.323 .4357311 1.317938

Instrumented: educ
Instruments: fatheduc

. regress educ fatheduc
Source  SS df MS Number of obs = 428
+ F( 1, 426) = 88.84
Model  384.841983 1 384.841983 Prob > F = 0.0000
Residual  1845.35428 426 4.33181756 Rsquared = 0.1726
+ Adj Rsquared = 0.1706
Total  2230.19626 427 5.22294206 Root MSE = 2.0813

educ  Coef. Std. Err. t P>t [95% Conf. Interval]
+
fatheduc  .2694416 .0285863 9.43 0.000 .2132538 .3256295
_cons  10.23705 .2759363 37.10 0.000 9.694685 10.77942

. predict what , resid
. regress lwage educ what
Source  SS df MS Number of obs = 428
+ F( 2, 425) = 29.80
Model  27.4648913 2 13.7324457 Prob > F = 0.0000
Residual  195.86256 425 .460853082 Rsquared = 0.1230
+ Adj Rsquared = 0.1189
Total  223.327451 427 .523015108 Root MSE = .67886

lwage  Coef. Std. Err. t P>t [95% Conf. Interval]
+
educ  .0591735 .0346051 1.71 0.088 .008845 .1271919
what  .0597931 .0380427 1.57 0.117 .0149823 .1345684
_cons  .4411035 .439289 1.00 0.316 .4223459 1.304553

The instrumental variables regression shows a lower and less significant return to education. In fact,
education is only significant at the .10 level. Also, there might still be some bias because the coefficient on
the residual what is almost significant at the .10 level. More research is needed.
132
12 SIMULTANEOUS EQUATIONS
We know that whenever an explanatory variable is correlated with the error term in a regression, the
resulting ordinary least squares estimates are biased and inconsistent. For example, consider the simple
Keynesian system.
C Y u
Y C I G X
α β = + +
= + + +
where C is consumption, Y is national income (e.g. GDP), I is investment, G is government spending, and
X is net exports.
Before we go on, we need to define some terms. The above simultaneous equation system of two equations
in two unknowns (C and Y) is known as the structural model. It is the theoretical model from the textbook
or journal article, with an error term added for statistical estimation. The variables C and Y are known as
the endogenous variables because their values are determined within the system. I, G, and X are
exogenous variables because their values are determined outside the system. They are “taken as given” by
the structural model. This structural model has two types of equations. The first equation (the familiar
consumption function) is known as a behavioral equation. Behavioral equations have error terms,
indicating some randomness. The second equation is an identity because it simply defines national income
as the sum of all spending (endogenous plus exogenous). The terms u and þ are the parameters of the
system.
If we knew the values of the parameters (u and þ) and the exogenous variables (I, G, X, and the exogenous
error term u), we could solve for the corresponding values of C and Y, as follows. Substituting the second
equation into the first yields the value for C.
( )
(1 )
(1 )
1 1
(RF1)
(1 ) (1 ) (1 ) (1 ) (1 )
C C I G X u C I G X u
C C I G X u
C I G X u
I G X u
C
C I G X u
α β α β β β β
β α β β β
β α β β β
α β β β
β
β β β
α
β β β β β
= + + + + + = + + + + +
− = + + + +
− = + + + +
+ + + +
=
−
= + + + +
− − − − −
The resulting equation is called the reduced form equation for C. I have labeled it RF1 for “reduced form
1.” Note that it is a behavioral equation and that, in this case, it is linear. We could, if we had the data,
estimate this equation with ordinary least squares.
133
We can derive the reduced form equation for Y by substituting the consumption function into the national
income identity.
(1 )
(1 )
1 1 1 1 1
(RF2)
(1 ) (1 ) (1 ) (1 ) (1 )
Y Y u I G X
Y Y I G X u
Y I G X u
I G X u
Y
Y I G X u
α β
β α
β α
α
β
α
β β β β β
= + + + + +
− = + + + +
− = + + + +
+ + + +
=
−
= + + + +
− − − − −
The two reduced form equations look very similar. They are both linear, with almost the same parameters.
Also, 1/(1þ) or þ/(1þ). The parameters are functions of the parameters of the structural model. Both
equations have the same variables on the right hand side, namely all of the exogenous variables.
Because of the fact that the reduced form equations have the same list of variables on the right hand side,
all we need to write them down in general form is the list of endogenous variables (because we have one
reduced form equation for each endogenous variable) and the list of exogenous variables. In the system
above the endogenous variables are C and Y and the exogenous variables are I, G, X (and the random error
term u), so the reduced form equations can be written as
10 11 12 13 1
Y
20 21 22 23 2
C I G X v
I G X v
π π π π
π π π π
= + + + +
= + + + +
where we are using the “pi’s” to indicate the reduced from parameters. Obviously,
20
1 (1 ) π β = − , etc.
We are now in position to show that the endogenous variable, Y, on the right hand side of the consumption
function is correlated with the error term in the consumption equation. Look at the reduced form equation
for Y (RF2) above. Note that Y is a function of u. Clearly, variation in u will cause variation in Y.
Therefore, cov(Y,u) is not zero. As a result, the OLS estimate of þ in the consumption function is biased
and inconsistent.
On the other hand, OLS estimates of the reduced form equation are unbiased and consistent because the
exogenous variables are taken as given and therefore uncorrelated with the error term, v
1
or v
2
. The bad
news is that the parameters from the reduced form are “scrambled” versions of the structural form
parameters. In the case of the simple Keynesian model, the coefficients from the reduced form equation for
Y correspond to multipliers, giving the change in Y for a oneunit change in I or G or X.
Example: supply and demand
Consider the following structural supply and demand model (in deviations).
(S)
(D)
1 2
q p
q p y u
α ε
β β
= +
= + +
where q is the equilibrium quantity supplied and demanded, p is the price, and y is income. There are two
endogenous variables (p and q) and one exogenous variable, y. We can derive the reduced form equations
by solving by back substitution.
134
Solve for p:
Substitute into the demand equation,
1 2
Multiply by to get rid of fractions.
( )
1 2 1 1 2
1 1 2
( )
1 2 1
q p
p q
q
p
q
q y u
q q y u q y u
q q y u
q y u
q
α ε
α ε
ε
α
ε
β β
α
α
α β ε αβ α β β ε αβ α
α β β ε αβ α
α β αβ α β ε
= +
− = − +
−
=
−  
= + +

\ ¹
= − + + = − + +
− = − + +
− = + + −
11
2 1
( )
1
2 1
(RF1)
( ) ( )
1 1
1
y u
u
q y
q y v
αβ α β ε
α β
αβ α β ε
α β α β
π
+ −
=
−
−
= +
− −
= +
This is RF1 in general form. Now we solve for p.
1 2
1 2
( )
1 2
2
( )
1
2
(RF2)
( ) ( )
1 1
21 2
p p y u
p p y u
p y u
y u
p
u
p y
p y v
α ε β β
α β β ε
α β β ε
β ε
α β
β ε
α β α β
π
+ = + +
− = + −
− = + −
+ −
=
−
−
= +
− −
= +
This is RF2 in general form. RF2 reveals that price is a function of u, so applying OLS to either the supply
or demand curve will result in biased and inconsistent estimates.
However, we can again appeal to instrumental variables (2sls) to yield consistent estimates. The good news
is that the structural model itself produces the instruments. Remember, instruments are required to be
highly correlated with the problem variable (p), but uncorrelated with the error term. The exogenous
variable(s) in the model satisfy these conditions. In the supply and demand model, income is correlated
with price (see RF2), but independent of the error term because it is exogenous.
Suppose we choose the supply curve as our target equation. We can get a consistent estimate of u with the
IV estimator,
ˆ
yq
yp
α
¯
=
¯
This estimator can be derived as follows.
p q α ε = +
Multiply both sides by y
135
yp yq y α ε = +
Now sum
yp yq y α ε ¯ = ¯ + ¯
But, the instrument is not correlated with the error term,
0 yε ¯ =
So,
yp yq α ¯ = ¯
and
/ yq yp α = ¯ ¯
We can also use two stage least squares. In the first stage we regress the problem variable, p, on all the
exogenous variables in the model. Here there is only one exogenous variable, y. In other words, estimate
the reduced form equation, RF2. The resulting estimate of a
21
is
21 2
ˆ
py
y
π
¯
=
¯
and the predicted value of p is
21
ˆ ˆ p y π = . In the second stage we regress q on ˆ p . The resulting
estimator is
21
2 2 2 2
2
21 21
2
ˆ ˆ
ˆ
ˆ ˆ ˆ
qy qp qy qy qy
py
py p y y
y
y
π
α
π π
¯ ¯ ¯ ¯ ¯
= = = = =
¯
¯ ¯ ¯ ¯
¯
¯
which is the same as the IV estimator above.
Indirect least squares
Yet another method for deriving a consistent estimate of u is indirect least squares. Suppose we estimate
the two reduced form equations above. Then we get unbiased and consistent estimates of a
11
and a
21
. If we
take the probability limit of the two estimators, we get the IV estimator again.
2
11
1
2
12
1
11 2 1
12 1 2
11
12
11 12
From the reduced form equations,
which implies that
So,
ˆ
lim
ˆ
ˆ ˆ Because both and are consistent estimates and the ratio of two consisten
p
αβ
π
α β
β
π
α β
π αβ α β
α
π α β β
π
α
π
π π
=
−
=
−
−
= =
−
 
=

\ ¹
t estimates is consistent.
(Remember, consistency "carries over.")
Unfortunately, indirect least squares does not always work. In fact, we cannot get a consistent estimate of
the demand equation because, no matter how we manipulate a
11
and a
12
, we keep getting u instead of þ
1
or
þ
2
.
136
Suppose we expand the model above to add another explanatory variable to the demand equation: w for
weather.
1 2 3
q p
q p y w w
α ε
β β β
= +
= + + +
Solving by back substitution, as before, yields the reduced form equations.
3 2 1
1 1 1
11 12 1
3 2
1 1 1
21 22 2
u
q y w
q y w v
u
p y w
p y w v
αβ αβ α β ε
α β α β α β
π π
β β ε
α β α β α β
π π
−
= + +
− − −
= + +
−
= + +
− − −
= + +
We can use indirect least squares to estimate the slope of the supply curve as before.
11 2 1
21 1 2
3 12 1
22 1 3
as before.
But also,
(again).
π αβ α β
α
π α β β
αβ π α β
α
π α β β
−
= =
−
−
= =
−
Whoops. Now we have two estimates of u. Given normal sampling error, we won’t get exactly the same
estimate, so which is right? Or which is more right?
So, indirect least squares yielded a unique estimate in the case of u, no estimate at all of the parameters of
the demand curve, and multiple estimates of u with a minor change in the specification of the structural
model. What is going on?
The identification problem
An equation is identified if we can derive estimates of the structural parameters from the reduced form. If
we cannot, it is said to be unidentified. The demand curve was unidentified in the above examples because
we could not derive any estimates of þ
1
or þ
2
from the reduced form equations. If an equation is identified,
it can be “just” identified, which means the reduced form equations yield a unique estimate of the
parameters of the equation of interest, or it can be “over” identified, in which case the reduce form yields
multiple estimates of the same parameters of the equation of interest. The supply curve was just identified
in the first example, but over identified when we added the weather variable.
It would be nice to know if the structural equation we are trying to estimate is identified. A simple rule, is
the socalled order condition for identification. To apply the order condition, we first have to choose the
equation of interest. Let G be the number of endogenous variables in the equation of interest (including the
dependent variable on the left hand side). Let K be the number of exogenous variables that are in the
structural model, but excluded from the equation of interest. To be identified, the following relationship
must hold.
1 K G ≥ −
Let’s apply this rule to the supply curve in the first supply and demand model. G=2 (q and p) and K=1 (y is
excluded from the supply curve), so 1=21 and the supply equation is just identified. This is why we got a
137
unique estimate of u from indirect least squares. Choosing the demand curve as the equation of interest, we
find that G=2 as before, but K=0 because there are no exogenous variables excluded from the demand
curve. Finally, let’s analyze the supply curve from the second supply and demand model. Again, G=2, but
now K=2 (y and w are both excluded from the supply curve). So 2>21 and the supply curve is over
identified. That is why we got two estimates from indirect least squares.
Two notes concerning the order condition. First, it is a necessary, but not sufficient condition for
identification. That is, no equation that fails the order condition will be identified. However, there are rare
cases where the order condition is satisfied but the equation is still not identified. However, this is not a
major problem because, if we try to estimate an equation using two stage least squares or some other
instrumental variable technique, the model will simply fail to yield an estimate. At that point we would
know that the equation was unidentified. The good news is that it won’t generate a wrong or biased
estimate. A more complicated condition, called the rank condition, is necessary and sufficient, but the order
condition is usually good enough.
Second, there is always the possibility that the model will be identified according to the order condition, but
fail to yield an estimate because the exogenous variables excluded from the equation of interest are too
highly correlated with each other. Say we require that two variables be excluded from the equation of
interest and we have two exogenous variables in the model but not in the equation. Looks good. However,
if the two variables are really so highly correlated that one is simply a linear combination of the other, then
we don’t really have two independent variables. This problem of “statistical identification” is extremely
rare.
Illustrative example
Dennis Epple and Bennett McCallum, from Carnegie Mellon, have developed a nice example of a
simultaneous equation model using real data.
5
The market they analyze is the market for broiler chickens.
The demand for broiler chicken meat is assumed to be a function of the price of chicken meat, real income,
and the price of beef, a close substitute. Supply is assumed to be a function of price and the price of chicken
feed (corn). Epple and McCallum have collected data for the US as a whole, from 1950 to 2001. They
recommend estimating the model in logarithms. The data are available in DandS.dta.
First we try ordinary least squares on the demand curve. For reasons that will be discussed in a later
chapter, we estimate the chicken demand model in first differences. This is a short run demand equation
relating the change in price to the change in consumption.
1 2 3 beef
q p y p u β β β ∆ = ∆ + ∆ + ∆ +
Since all the variables are logged, these first differences of logs are growth rates (percent changes). To
estimate this equation in Stata, we have to tell Stata that we have a time variable with the tsset year
command. We can then use the D. operator to take first differences. We will use the noconstant option to
avoid adding a time trend to the demand curve.
. tsset year
time variable: year, 1950 to 2001
. regress D.lq D.ly D.lpbeef D.lp , noconstant
5
Epple, Dennis and Bennett T. McCallum, Simultaneous equation econometrics: the missing example,
http://littlehurt.gsia.cmu.edu/gsiadoc/WP/2004E6.pdf
138
Source  SS df MS Number of obs = 51
+ F( 3, 48) = 19.48
Model  .053118107 3 .017706036 Prob > F = 0.0000
Residual  .043634876 48 .00090906 Rsquared = 0.5490
+ Adj Rsquared = 0.5208
Total  .096752983 51 .001897117 Root MSE = .03015

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  .7582548 .1677874 4.52 0.000 .4208957 1.095614
lpbeef 
D1  .305005 .0612757 4.98 0.000 .181802 .4282081
lp 
D1  .3553965 .0641272 5.54 0.000 .4843329 .2264601

Lq is the log of quantity (per capita chicken consumption), lp is the log of price, ly is the log of income, and
lpbeef is the log of the price of beef. This is not a bad estimated demand function. The demand curve is
downward sloping, but very inelastic, and significant at the .10 level. The elasticity of income is positive,
indicating that chicken is not an inferior good, which makes sense. The coefficient on the price of beef is
positive, which is consistent with the price of a substitute. All the explanatory variables are highly
significant.
Now let’s try the supply curve.
. regress D.lq D.lpfeed D.lp, noconstant
Source  SS df MS Number of obs = 39
+ F( 2, 37) = 0.28
Model  .00083485 2 .000417425 Prob > F = 0.7595
Residual  .055731852 37 .001506266 Rsquared = 0.0148
+ Adj Rsquared = 0.0385
Total  .056566702 39 .001450428 Root MSE = .03881

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lpfeed 
D1  .0392005 .060849 0.64 0.523 .1624923 .0840913
lp 
D1  .0038389 .0910452 0.04 0.967 .1806362 .188314

We have a problem. The coefficient on price is virtually zero. This means that the supply of chicken does
not respond to changes in price. We need to use instrumental variables.
Let’s first try instrumental variables on the demand equation. The exogenous variables are lpfeed, ly, and
lpbeef.
. ivreg D.lq D.ly D.lpbeef (D.lp= D.lpfeed D.ly D.lpbeef ), noconstant
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 39
+ F( 3, 36) = .
Model  .032485142 3 .010828381 Prob > F = .
Residual  .024081561 36 .000668932 Rsquared = .
+ Adj Rsquared = .
Total  .056566702 39 .001450428 Root MSE = .02586
139

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .4077264 .1410341 2.89 0.006 .6937568 .1216961
ly 
D1  .9360443 .1991184 4.70 0.000 .5322135 1.339875
lpbeef 
D1  .3159692 .0977788 3.23 0.003 .1176646 .5142739

Instrumented: D.lp
Instruments: D.ly D.lpbeef D.lpfeed

Not much different, but the estimate of the income elasticity of demand is now virtually one, indicating that
chicken demand will grow at about the same rate as income. Note that the demand curve is just identified
(by lpfeed).
Let’s see if we can get a better looking supply curve.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 39
+ F( 2, 37) = .
Model  .079162381 2 .03958119 Prob > F = .
Residual  .135729083 37 .003668354 Rsquared = .
+ Adj Rsquared = .
Total  .056566702 39 .001450428 Root MSE = .06057

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765
lpfeed 
D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173

Instrumented: D.lp
Instruments: D.lpfeed D.lpbeef D.ly

This is much better. The coefficient on price is now positive and significant, indicating that the supply of
chickens will increase in response to an increase in price. Also, increases in chickenfeed prices will reduce
the supply of chickens. The supply curve is over identified (by the omission of ly and lpbeef).
We could have asked for robust standard errors, but they didn’t make any difference in the levels of
significance, so we used the default homoskedastic only standard errors.
Note to the reader. I have taken a few liberties with the original example developed by Professors Epple
and McCallum. They actually develop a slightly more sophisticated, and more believable, model, where the
coefficient on price in the supply equation starts out negative and significant in the supply equation and
eventually winds up positive and significant. However, they have to complicate the model in a number of
ways to make it work. This is a cleaner example, using real data, but it could suffer from omitted variable
bias. On the other hand, the Epple and McCallum demand curve probably suffers from omitted variable
bias too, so there. Students are encouraged to peruse the original paper, available on the web.
Diagnostic tests
140
There are three diagnostic tests that should be done whenever you do instrumental variables. The first is to
justify the identifying restrictions.
Tests for over identifying restrictions
In the above example, we identified the supply equation by omitting income and the price of beef. This
seems quite reasonable, but it is nevertheless necessary to justify omitting these variables by showing that
they would not be significant if we included them in the regression. There are several versions of this test
by Sargan, Basmann, and others. It is simply an LM test where we save the residuals from the instrumental
variables regression and then regress them against all the instruments. If the omitted instrument does not
belong in the equation, then this regression should have an Rsquare of zero. We test this null hypothesis
with the usual
2 2
( ) nR m k χ − where m is the number of instruments excluded from the equation and k is
the number of endogenous variables on the right hand side of the equation.
We can implement this test easily in Stata. First, let’s do the ivreg again.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 39
+ F( 2, 37) = .
Model  .079162381 2 .03958119 Prob > F = .
Residual  .135729083 37 .003668354 Rsquared = .
+ Adj Rsquared = .
Total  .056566702 39 .001450428 Root MSE = .06057

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765
lpfeed 
D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173

Instrumented: D.lp
Instruments: D.lpfeed D.lpbeef D.ly

If your version of Stata has the overid command installed, then you can run the test with a single command.
. overid
Tests of overidentifying restrictions:
Sargan N*Rsq test 0.154 Chisq(1) Pvalue = 0.6948
Basmann test 0.143 Chisq(1) Pvalue = 0.7057
The tests are not significant, indicating that these variables do not belong in the equation of interest. Therefore, they are
properly used as identifying variables. Whew!
If there is no overid command after ivreg, then you can install it by clicking on help, search, overid, and clicking on the
link that says, “click to install.” It only takes a few seconds. If you are using Stata on the server, you might not be able
to install new commands. In that case, you can do it yourself as follows. First, save the residuals from the ivreg.
. predict es, resid
(13 missing values generated)
141
Now regress the residuals on all the instruments. In our case we are doing the regression in first and the
constant term was not included. So use the noconstant option.
. regress es D.lpfeed D.lpbeef D.ly, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 0.05
Model  .000535634 3 .000178545 Prob > F = 0.9860
Residual  .135193453 36 .003755374 Rsquared = 0.0039
+ Adj Rsquared = 0.0791
Total  .135729086 39 .003480233 Root MSE = .06128

es  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lpfeed 
D1  .005058 .0863394 0.06 0.954 .1700464 .1801625
lpbeef 
D1  .0536104 .1686351 0.32 0.752 .3956183 .2883974
ly 
D1  .1264734 .3897588 0.32 0.747 .6639941 .9169409
We can use the Fform of the LM test by invoking Stata’s test command on the identifying variables.
. test D.lpbeef D.ly
( 1) D.lpbeef = 0
( 2) D.ly = 0
F( 2, 36) = 0.07
Prob > F = 0.9313
The results are the same, namely, that the identifying variables are not significant and therefore can be used
to identify the equation of interest. However, the probvaluses are not identical to the large sample chi
square versions reported by the overid command.
To replicate the overid analysis, let’s start by reminding ourselves what output is available for further
analysis.
. ereturn list
scalars:
e(N) = 39
e(df_m) = 3
e(df_r) = 36
e(F) = .0475437411474742
e(r2) = .00394634310271
e(rmse) = .0612811038678276
e(mss) = .0005356335440678
e(rss) = .1351934528853412
e(r2_a) = .0790581283053975
e(ll) = 55.12129587257286
macros:
e(depvar) : "es"
e(cmd) : "regress"
e(predict) : "regres_p"
e(model) : "ols"
matrices:
e(b) : 1 x 3
e(V) : 3 x 3
functions:
e(sample)
142
. scalar r2=e(r2)
. scalar n=e(N)
. scalar nr2=n*r2
We know there is one degree of freedom for the resulting chisquare statistic because m=2 omitted
instruments (ly and lpbeef) and k=1 endogenous regressor (lp).
. scalar prob=1chi2(1,nr2)
. scalar list r2 n nr2 prob
r2 = .00394634
n = 39
nr2 = .15390738
prob = .69482895
This is identical to the Sargan form reported by the overid command above. An equivalent test of this
hypothesis is the Jtest for overidentifying restrictions. Instead of using the nRsquare statistic, the Jtest
uses the j=mF statistic where F is the Ftest for the auxiliary regression. From the output above. J is
distributed as chisquare with the same mk degrees of freedom.
. scalar J=2*e(F)
. scalar probJ=1chi2(2,J)
. scalar list J probJ
J = .09508748
probJ = .95356876
All of these tests agree that the identifying restrictions are valid. The fact that they are not significant means
that income and beef prices do not belong in the supply of chicken equation.
We cannot do the tests for over identifying restrictions on the demand curve because it is just identified.
The problem is that the residual from the ivreg regression will be an exact linear combination of the
instruments. If we forget and try to do the LM test we get a regression that looks like this (ed is the residual
from the ivreg estimate of the demand curve above.)
. regress ed D.ly D.lpbeef D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 0.00
Model  0 3 0 Prob > F = 1.0000
Residual  .024081561 36 .000668932 Rsquared = 0.0000
+ Adj Rsquared = 0.0833
Total  .024081561 39 .000617476 Root MSE = .02586

ed  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  5.70e10 .1644979 0.00 1.000 .3336173 .3336173
lpbeef 
D1  3.94e10 .0711725 0.00 1.000 .1443446 .1443446
lpfeed 
D1  1.02e11 .0364396 0.00 1.000 .0739029 .0739029

So, if you see output like this, you know you can’t do a test for over identifying restrictions, because your
equation is not over identified.
143
Test for weak instruments
To be an instrument a variable must satisfy two conditions. It must be (1) highly correlated with the
problem variable and (2) it must be uncorrelated with the error term. What happens if (1) is not true? That
is, is there a problem when the instrument is only vaguely associated with the endogenous regressor?
Bound, Jaeger
6
, and Baker have demonstrated that instrumental variable estimates are biased by about 1/F*
where F* is the Ftest on the (excluded) identifying variables in the reduced form equation for the
endogenous regressor. This is the percent of the OLS bias that remains after applying two stage least
squares. For example, if F*=2, then 2SLS is still half as biased as OLS. In the case of our chicken supply
curve, that would be a regression of lp on ly, lpbeef, and lpfeed and the F* test would be on lpbeef and ly.
. regress lp ly lpbeef lpfeed, noconstant
Source  SS df MS Number of obs = 40
+ F( 3, 37) =39090.05
Model  801.425968 3 267.141989 Prob > F = 0.0000
Residual  .252858576 37 .006834016 Rsquared = 0.9997
+ Adj Rsquared = 0.9997
Total  801.678827 40 20.0419707 Root MSE = .08267

lp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly  .1264946 .0226887 5.58 0.000 .0805228 .1724664
lpbeef  .628012 .0764761 8.21 0.000 .4730568 .7829672
lpfeed  .1196143 .1046508 1.14 0.260 .0924285 .331657

. test ly lpbeef
( 1) ly = 0
( 2) lpbeef = 0
F( 2, 37) = 35.75
Prob > F = 0.0000
Obviously, these are not weak instruments and two stage least squares is much less biased than OLS.
Actually we should have predicted this by the difference between the OLS and 2sls versions of the
equations above.
Anyway, the BJB F test is very easy to do and it should be routinely reported for any simultaneous equation
estimation. Note that it has the same problem as the tests for over identifying restrictions, namely, it does
not work where the equation is just identified. So we cannot use it for the demand function.
HausmanWu test
We have already encountered the HausmanWu test in the chapter on errors in variables. However, it is also
relevant in the case of simultaneous equations.
Consider the following equation system.
6
Yes, that is our own Professor Jaeger.
144
1 12 2 13 1 1
2 21 1 22 2 23 3 2
1 12 2 13 1 1
2 22 2 23 3 2
1 2
Model A (simultaneous)
(A1)
(A2)
Model B (recursive)
(B1)
(B2)
where cov( ) 0.
y y z u
y y z z u
y y z u
y z z u
u u
β β
β β β
β β
β β
= + +
= + + +
= + +
= + +
=
Suppose we are interested in getting a consistent estimate of
12
β . If we apply OLS to equation (A1) we get
biased and inconsistent estimates. If we apply OLS to equation (B1) we get unbiased, consistent, and
efficient estimates. And it is the same equation! We need to know about the second equation, not only the
equation of interest.
To apply the HausmanWu test to see if we need to use 2sls on the supply equation, we estimate the
reduced form equation for Alp and save the residuals.
. regress D.lp D.ly D.lpbeef D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 12.37
Model  .132113063 3 .044037688 Prob > F = 0.0000
Residual  .128161345 36 .003560037 Rsquared = 0.5076
+ Adj Rsquared = 0.4666
Total  .260274407 39 .006673703 Root MSE = .05967

D.lp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  .7530405 .3794868 1.98 0.055 .0165944 1.522675
lpbeef 
D1  .3437728 .1641908 2.09 0.043 .0107785 .6767671
lpfeed 
D1  .2583745 .084064 3.07 0.004 .0878849 .4288641

. predict dlperror, resid
(13 missing values generated)
Now, add the error term to the original regression. If it is significant, then there is a significant difference
between the OLS estimate and the 2sls estimate. In which case, OLS is biased and we need to use
instrumental variables. If it is not significant, then we need to check the magnitude of the coefficient. If it is
large and the standard error is even larger, then we cannot be sure that OLS is unbiased. In that event, we
should do it both ways, OLS and IV, and hope we get essentially the same results.
. regress D.lq D.lp dlperror D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 18.43
Model  .034261762 3 .011420587 Prob > F = 0.0000
Residual  .022304941 36 .000619582 Rsquared = 0.6057
+ Adj Rsquared = 0.5728
Total  .056566702 39 .001450428 Root MSE = .02489

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
145
D1  .6673431 .1075623 6.20 0.000 .4491967 .8854896
dlperror  .94075 .1280782 7.35 0.000 1.200505 .6809953
lpfeed 
D1  .2828285 .051217 5.52 0.000 .3867013 .1789557

You can also do the HausmanWu test using the predicted value of the problem variable instead of the
residual from the reduced form equation. The following HW test equation was performed after saving the
predicted value. The test coefficient is simply the negative of the coefficient on the residuals.
. regress D.lq D.lp dlphat D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 18.43
Model  .034261763 3 .011420588 Prob > F = 0.0000
Residual  .02230494 36 .000619582 Rsquared = 0.6057
+ Adj Rsquared = 0.5728
Total  .056566702 39 .001450428 Root MSE = .02489

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .2734069 .0695298 3.93 0.000 .4144198 .132394
dlphat  .94075 .1280782 7.35 0.000 .6809953 1.200505
lpfeed 
D1  .2828285 .051217 5.52 0.000 .3867013 .1789557

Applying the HausmanWu test to the demand equation yields the following test equation.
. regress D.lq D.ly D.lpbeef D.lp dlperror, noconstant
Source  SS df MS Number of obs = 39
+ F( 4, 35) = 13.99
Model  .034797396 4 .008699349 Prob > F = 0.0000
Residual  .021769306 35 .00062198 Rsquared = 0.6152
+ Adj Rsquared = 0.5712
Total  .056566702 39 .001450428 Root MSE = .02494

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  .9360443 .1920032 4.88 0.000 .546257 1.325832
lpbeef 
D1  .3159692 .0942849 3.35 0.002 .1245608 .5073777
lp 
D1  .4077264 .1359945 3.00 0.005 .6838099 .131643
dlperror  .1343196 .1527992 0.88 0.385 .1758793 .4445185

The test reveals that the OLS estimate is not significantly different from the instrumental variables
estimate.
146
Seemingly unrelated regressions
Consider the following system of equations relating local government expenditures on fire and police (FP),
schools (S) and other (O) to population density (D), income (Y), and the number of school age children (C).
The equations relate the proportion of the total budget (T) in each of these categories to the explanatory
variables, D, Y, and C.
0 1 2 1
0 1 2 2
0 1 2 2 3
/
/
/
FP T a a D a Y U
S T b bY b C U
O T c c D c Y c C U
= + + +
= + + +
= + + + +
These equations are not simultaneous, so OLS can be expected to yield unbiased and consistent estimates.
However, because the proportions have to sum to one (FP/T+S/T+O/T=1), any random increase in
expenditures on fire and police must imply a random decrease in schools and/or other components.
Therefore, the error terms must be correlated across equations, i.e., ( ) 0
i j
E U U ≠ for i=j. This is information
that could be used to increase the efficiency of the estimates. The procedure is to estimate the equations
using OLS, save the residuals, compute the covariances between U
1
, U
2
, etc. and use these estimated
covariances to improve the OLS estimates. This technique is known as seemingly unrelated regressions
(SUR) and Zellner efficient least squares (ZELS) for Alan Zellner from the University of Chicago who
developed the technique.
In Stata, we can use the sureg command to estimate the above system. Let FPT be the proportion spent on
fire and police and ST be the proportion spent on schools. Because the proportion of the budget going to
other things is simply one minus the sum of expenditures on fire, police, and schools, we drop the third
equation. In the sureg command, we simply put the varlist for each equation in parentheses.
Sureg (FPT D Y) (ST Y C)
Note: if the equations have the same regressors, then the SUR estimates will be identical to the OLS
estimates and there is no efficiency gain. Also, if there is specification bias in one equation, say omitted
variables, then, because the residuals are correlated across equations, the SUR estimates of the other
equation are biased. The specification error is “propagated” across equations. For this reason, it is not clear
that we should use SUR for all applications for which it might be employed. That being said, we should use
SUR in those cases where the dependent variables sum to a constant, such as in the above example.
Three stage least squares
Whenever we have a system of simultaneous equations, we automatically have equations whose errors
could be correlated. Thus, two stage least squares estimates of a multiequation system could potentially be
made more efficient by incorporating the information contained in the covariances of the residuals.
Applying SUR to two stage least squares results in three stage least squares.
Three stage least squares can be implemented in Stata with the reg3 command. Like the sureg command,
the equation varlists are put in equations. In this example I use equation labels to identify the demand and
supply equations.
147
In the first example, I replicate the ivreg above exactly by specifying each equation with the noconstant
option (in the parentheses) and adding , noconstant to the model as a whole to make sure that the constant is
not used as an instrument in the reduced form equations.
. reg3 (Demand:D.lq D.lp D.ly D.lpbeef, noconstant) (Supply: D.lq D.lp D.lpfeed
> , noconstant), noconstant
Threestage least squares regression

Equation Obs Parms RMSE "Rsq" chi2 P

Demand 39 3 .0265163 0.5152 36.31 0.0000
Supply 39 2 .0379724 0.0059 0.15 0.9258


 Coef. Std. Err. z P>z [95% Conf. Interval]
+
Demand 
lp 
D1  .194991 .0549483 3.55 0.000 .3026876 .0872944
ly 
D1  .5204546 .1255017 4.15 0.000 .2744758 .7664334
lpbeef 
D1  .1600626 .0546049 2.93 0.003 .053039 .2670861
+
Supply 
lp 
D1  .0250497 .0835039 0.30 0.764 .1887144 .138615
lpfeed 
D1  .0040577 .0468383 0.09 0.931 .0958591 .0877437

Endogenous variables: D.lq
Exogenous variables: D.lp D.ly D.lpbeef D.lpfeed

Yuck. The supply function is terrible. Let’s try adding the intercept term..
. reg3 (Demand:D.lq D.lp D.ly D.lpbeef) (Supply: D.lq D.lp D.lpfeed)
Threestage least squares regression

Equation Obs Parms RMSE "Rsq" chi2 P

Demand 39 3 .0236988 0.2754 15.13 0.0017
Supply 39 2 .0249881 0.1944 9.56 0.0084


 Coef. Std. Err. z P>z [95% Conf. Interval]
+
Demand 
lp 
D1  .1871072 .0484276 3.86 0.000 .2820235 .0921909
ly 
D1  .0948342 .109993 0.86 0.389 .1207482 .3104166
lpbeef 
D1  .0491061 .0336213 1.46 0.144 .0167905 .1150026
_cons  .0274115 .0044904 6.10 0.000 .0186104 .0362125
+
Supply 
lp 
148
D1  .1627361 .0547593 2.97 0.003 .2700624 .0554097
lpfeed 
D1  .0018805 .0196841 0.10 0.924 .0366997 .0404607
_cons  .0306582 .0042607 7.20 0.000 .0223074 .039009

Endogenous variables: D.lq
Exogenous variables: D.lp D.ly D.lpbeef D.lpfeed

Better, at least for the estimates of the price elasticities. None of the other variables are significant. Three
stage least squares has apparently made things worse. This is very unusual. Normally the three stage least
squares estimates are almost the same as the two stage least squares estimates.
Types of equation systems
We are already familiar with a fully simultaneous equation system. The supply and demand system
(S)
(D)
1 2
q p
q p y u
α ε
β β
= +
= + +
is fully simultaneous in that quantity is a function of price and price is a function of quantity. Now consider
the following general three equation model.
1 0 1 1 2 2 1
2 0 1 1 2 1 3 2 2
3 0 1 1 2 2 3 1 4 2 3
Y X X u
Y Y X X u
Y Y Y X X u
α α α
β β β β
γ γ γ γ γ
= + + +
= + + + +
= + + + + +
This equation system is simultaneous, but not completely simultaneous. Note that Y
3
is a function of Y
2
and Y
1
, but neither of them is a function of Y
3
. Y
2
is a function of Y
1
, but Y
1
is not a function of Y
2
. The
model is said to be triangular because of this structure. If we also assume that there is no covariance
among the error terms, that is,
1 2 1 3 2 3
( ) ( ) ( ) 0 E u u E u u E u u = = = , then this is a recursive system. Since
there is no feedback from any right hand side endogenous variable to the dependent variable in the same
equation, there is no correlation between the error term and any right hand side variable in the same
equation, therefore there is no simultaneous equation bias in this model. OLS applied to any of the
equations will yield unbiased, consistent, and efficient estimates (assuming the equations have no other
problems).
Finally, consider the following model.
1 0 1 2 2 1 3 2 1
2 0 1 1 2 1 3 2 2
3 0 1 1 2 2 3 1 4 2 3
Y Y X X u
Y Y X X u
Y Y Y X X u
α α α α
β β β β
γ γ γ γ γ
= + + + +
= + + + +
= + + + + +
Now Y
1
is a function of Y
2
and Y
2
is a function of Y
1
. These two equations are fully simultaneous. OLS
applied to either will yield biased and inconsistent estimates. This part of the model is called the
simultaneous core. This model system is known as block recursive and is very common in large
simultaneous macroeconomic models. If we assume that
1 3 2 3
( ) ( ) 0 E u u E u u = = then OLS applied to the
third (recursive) equation yields unbiased, consistent, and efficient estimates.
149
Strategies for dealing with simultaneous equations
The obvious strategy when faced with a simultaneous equation system is to use two stage least squares. I
don’t recommend three stage least squares because of possible propagation of errors across equations.
However, there is always the problem of identification and/or weak instruments. Some critics have even
suggested that, since the basic model of all economics is the general equilibrium model, all variables are
functions of all other variables and no equation can ever be identified. Even if we can identify the equation
of interest with some reasonable exclusion assumptions and we have good instruments, the best we can
hope for is a consistent, but inefficient estimate of the parameters.
For many situations, the best solution may be to estimate one or more reduced form equations, rather than
trying to estimate equations in the structural model. For example, suppose we are trying to estimate the
effect of a certain policy, say threestrikes laws, on crime. We may have police in the crime equation and
police and crime may be simultaneously determined, but we do not have to know the coefficient on the
police variable in the crime equation to determine the effect of the threestrikes law on crime. That
coefficient is available from the reduced form equation for crime. OLS applied to the reduced form
equations are unbiased, consistent, and efficient.
However, it may be the case that we can’t retreat to the reduced form equations. In the above example,
suppose we want to estimate the potential effect of a policy that puts 100,000 cops on the streets. In this
case we need to know the coefficient linking the police to the crime rate, so that we can multiply it by 100K
in order to estimate the effect of 100,000 more police. We have no choice but to use instrumental variables
if we are to generate consistent estimates of the parameter of interest.
A strategy that is frequently adopted when large systems of simultaneous equations are being estimated is
to plow ahead with OLS despite the fact that the estimates are biased and inconsistent. The estimates may
be biased and inconsistent, but they are efficient and the bias might not be too bad (and 2sls is also biased
in small samples). So OLS might not be too bad. Usually, researchers adopting this strategy insist that this
is just the first step and the OLS estimates are “preliminary.” The temptation is to never get around to the
next step.
A strategy that is available to researchers that have time series data is to use lags to generate instruments or
to avoid simultaneity altogether. Suppose we start with the simple demand and supply model above where
we add time subscripts.
(S)
(D)
1 2
q p
t t t
q p y u
t t t t
α ε
β β
= +
= + +
Note that this model is fully simultaneous and the only way to consistently estimate the supply function is
to use two stage least squares. However, suppose this is agricultural data. Then the supply is based on last
year’s price while demand is based on this year’s price. (This is the famous cobweb model.)
(S)
1
(D)
1 2
q p
t t t
q p y u
t t t t
α ε
β β
= +
−
= + +
Presto! Assuming that ( ) 0
t t
E u ε = and assuming we have no autocorrelation (see Chapter 14 below), we
now have a recursive system. OLS applied to either equation will yield unbiased, consistent, and efficient
estimates.
Even if we can’t assume that current supply is based on last year’s price, we can at least generate a bunch
of instruments by assuming that the equations are also functions of lagged variables. For example, the
supply equation could be a function of last years supply as well as last year’s price. In that case, we have
the following equation system.
150
(S)
1 2 1 3 1
(D)
1 2
q p p q
t t t t t
q p y u
t t t t
α α α ε
β β
= + + +
− −
= + +
Note that we have two new instruments (p
t1
and q
t1
) which are called lagged endogenous variables. Since
time does not go backwards, this year’s quantity and price cannot affect last year’s values, these are valid
instruments (again assuming no autocorrelation). Now both equations are identified and both can be
consistently estimated using two stage least squares.
Time series analysts frequently estimate vector autoregression (VAR) models in which the dependent
variable is a function of its own lags and lags of all the other variables in the simultaneous equation model.
This is essentially the reduced form equation approach for time series data where the lagged endogenous
variables are the exogenous variables.
Another example
Pinydyck and Rubinfeld have an example where state and local government expenditures (exp) are taken to
be a function of the amount of federal government grants (aid), and, as control variables, population and
income
The data are in a Stata data set called EX73 (for example 7.3 from Pindyck and Rubinfeld). The OLS
regression model is
. regress exp aid inc pop
Source  SS df MS Number of obs = 50
+ F( 3, 46) = 2185.50
Model  925135805 3 308378602 Prob > F = 0.0000
Residual  6490681.38 46 141101.769 Rsquared = 0.9930
+ Adj Rsquared = 0.9926
Total  931626487 49 19012785.4 Root MSE = 375.64

exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  3.233747 .238054 13.58 0.000 2.754569 3.712925
inc  .0001902 .0000235 8.10 0.000 .000143 .0002375
pop  .5940753 .1043992 5.69 0.000 .80422 .3839306
_cons  45.69833 84.39169 0.54 0.591 215.57 124.1733

However, many grants are tied to the amount of money the state is spending. So the aid variable could be a function of
the dependent variable, exp. Therefore aid is a potentially endogenous variable. Let’s do a HausmanWu test to see if
we need to use two stage least squares. First we need an instrument. Many openended grants (that are functions of the
amount the state spends, like matching grants) are for education. A variable that is correlated with grants but
independent of the error term is the number of school children, ps. We regress aid on ps, and the other exogenous
variables, and save the residuals, ˆ w .
. regress aid ps inc pop
Source  SS df MS Number of obs = 50
+ F( 3, 46) = 220.70
Model  32510247.4 3 10836749.1 Prob > F = 0.0000
Residual  2258713.81 46 49102.4741 Rsquared = 0.9350
+ Adj Rsquared = 0.9308
Total  34768961.2 49 709570.636 Root MSE = 221.59
151

aid  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ps  .8328735 .3838421 2.17 0.035 1.605508 .0602394
inc  .0000365 .0000133 2.75 0.008 9.79e06 .0000633
pop  .1738793 .1248854 1.39 0.171 .0775019 .4252605
_cons  42.40924 49.68255 0.85 0.398 57.59654 142.415

. predict what, resid
Now add the residual, what (what), to the regression. If it is significant, ordinary least squares is biased.
. regress exp aid what inc pop
Source  SS df MS Number of obs = 50
+ F( 4, 45) = 1716.17
Model  925559177 4 231389794 Prob > F = 0.0000
Residual  6067310.05 45 134829.112 Rsquared = 0.9935
+ Adj Rsquared = 0.9929
Total  931626487 49 19012785.4 Root MSE = 367.19

exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  4.522657 .7636841 5.92 0.000 2.984518 6.060796
what  1.420832 .8018143 1.77 0.083 3.035769 .194105
inc  .0001275 .0000422 3.02 0.004 .0000424 .0002125
pop  .5133494 .1117587 4.59 0.000 .738443 .2882557
_cons  90.1253 86.22021 1.05 0.301 263.7817 83.53111

Apparently there is significant measurement error, at the recommended 10 percent significance level for specification
tests. It turns out that the coefficient on aid in the HausmanWu test equation is exactly the same as we would have
gotten if we had used instrumental variables. To see this, let’s use IV to derive consistent estimates for this model.
. ivreg exp (aid=ps) inc pop
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 50
+ F( 3, 46) = 1304.09
Model  920999368 3 306999789 Prob > F = 0.0000
Residual  10627118.5 46 231024.316 Rsquared = 0.9886
+ Adj Rsquared = 0.9878
Total  931626487 49 19012785.4 Root MSE = 480.65

exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  4.522657 .9996564 4.52 0.000 2.510453 6.534861
inc  .0001275 .0000553 2.31 0.026 .0000162 .0002387
pop  .5133494 .1462913 3.51 0.001 .8078184 .2188803
_cons  90.1253 112.8616 0.80 0.429 317.3039 137.0533

Instrumented: aid
Instruments: inc pop ps

Note that ivreg uses all the variables that are not indicated as problem variables (inc and pop) as well as the
designated instrument, ps, as instruments. Note that the coefficient is, in fact, the same as in the Hausman
Wu test equation.
152
If we want to see the first stage regression, add the option “first.” We can also request robust standard
errors.
. ivreg exp (aid=ps) inc pop, first robust
Firststage regressions

Source  SS df MS Number of obs = 50
+ F( 3, 46) = 220.70
Model  32510247.4 3 10836749.1 Prob > F = 0.0000
Residual  2258713.81 46 49102.4741 Rsquared = 0.9350
+ Adj Rsquared = 0.9308
Total  34768961.2 49 709570.636 Root MSE = 221.59

aid  Coef. Std. Err. t P>t [95% Conf. Interval]
+
inc  .0000365 .0000133 2.75 0.008 9.79e06 .0000633
pop  .1738793 .1248854 1.39 0.171 .0775019 .4252605
ps  .8328735 .3838421 2.17 0.035 1.605508 .0602394
_cons  42.40924 49.68255 0.85 0.398 57.59654 142.415

IV (2SLS) regression with robust standard errors Number of obs = 50
F( 3, 46) = 972.40
Prob > F = 0.0000
Rsquared = 0.9886
Root MSE = 480.65

 Robust
exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  4.522657 1.515 2.99 0.005 1.47312 7.572194
inc  .0001275 .0000903 1.41 0.165 .0000544 .0003093
pop  .5133494 .1853334 2.77 0.008 .8864062 .1402925
_cons  90.1253 86.34153 1.04 0.302 263.9218 83.67117

Instrumented: aid
Instruments: inc pop ps

Because the equation is just identified, we can’t test the over identification restriction. However, ps is
significant in the reduced form equation, which means that it is not a weak instrument, although the sign
doesn’t look right to me.
Summary
The usual development of a research project is that the analyst specifies an equation of interest where the
dependent variable is a function of the target variable or variables and a list of control variables. Before
estimating with ordinary least squares, the analyst should think about possible reverse causation with
respect to each right hand side variable in turn. If there is a good argument for simultaneity, the analyst
should specify another equation, or at least a list of possible instruments, for each of the potentially
153
endogenous variables. Sometimes this process is shortened by the review of the literature where previous
analysts may have specified simultaneous equation models.
Given that there may be simultaneity, what is the best way to deal with it? We know that OLS is biased,
inconsistent, but relatively efficient compared to instrumental variables (because changing the list of
instruments will cause the estimates to change, creating additional variance of the estimates). On the other
hand, instrumental variables is consistent, but still biased (presumably less biased than OLS), and
inefficient relative to OLS. So we are swapping consistency for efficiency when we use two stage least
squares. Is this a good deal for analysts? The consensus is that OLS applied to an equation that has an
endogenous regressor results in substantial bias, so much so that the loss of efficiency is a small price to
pay.
There are a couple of ways to avoid this tradeoff. The first is to estimate the reduced form equation only.
We know that OLS applied to the reduced form equations results in estimates that are unbiased, consistent,
and efficient. For many policy analyses, the target variable is an exogenous policy variable. For example,
suppose we are interested in the effect of three strikes laws on crime. The crime equation will be a function
of arrests, among other things, a number of control variables, and the target variable, a dummy variable that
is one for states with threestrikes laws. The arrest variable is likely to be endogenous. We could specify an
arrest equation, with the attendant data collection to get the additional variables, and employ two or three
stage least squares, which are relatively inefficient. Alternatively, we can estimate the reduced form
equation, derived by simply dropping the arrest variable. The resulting estimated coefficient on the three
strikes variable is unbiased, consistent, and efficient. Of course, if arrests are not simultaneous, these
estimates are biased by omitted variables. The safest thing to do is to estimate the original crime equation in
OLS and the reduced form version, dropping arrests, and see if the resulting estimated coefficient on the
target variable are much different. If the results are the same in either case, then we can be confident we
have a good estimate of the effect of threestrikes laws on crime.
Another way around the problem is available if the data are time series. Suppose that there is a significant
lag between the time a crime is committed and the resulting prison incarceration of the guilty party. Then
prison is not simultaneously determined with crime. Similarly, it usually takes some time to get a law, like
the threestrikes law, passed and implemented. In that case, crime this year is a function of the law, but the
law is a function of crime in previous years, hence not simultaneous. Finally, if a variable Y is known to be
a function of lagged X, then, since time cannot go backwards, the two variables cannot be simultaneous.
So, lagging the potentially endogenous variable allows us to sidestep the specification of a multiequation
mode. I have seen large simultaneous macroeconomic models estimated on quarterly data avoid the use of
two stage least squares by simply lagging the endogenous variables, so that consumption this quarter is a
function of income in the previous quarter.These arguments require that the residuals be independent, that
is, not autocorrelated. We deal with autocorrelation below.
Should we use three stage least squares? Generally, I recommend against it. Usually, the other equations
are only there to generate consistent estimates of a target parameter in the equation of interest. I don’t want
specification errors in those equations causing inconsistency in the equation of interest after I have gone to
all the trouble to specify a multiequation model.
Finally, be sure to report the results of the HausmanWu test, the test for over identifying restrictions, and
the Bound, Jaeger, Baker F test for weak instruments whenever you estimate a simultaneous equation
model, assuming your model is not just identified.
154
13 TIME SERIES MODELS
Time series data are different from cross section data in that it has an additional source of information. In a
cross section it doesn’t matter how the data are arranged, The fifty states could be listed alphabetically, or
by region, or any number of ways. However, a time series must be arrayed in order, 1998 before 1999, etc.
The fact that time’s arrow only flies in one direction is extremely useful. It means, for example, that
something that happened in 1999 cannot cause something that happened in 1998. The reverse, of course, is
quite possible. This raises the possibility of using lags as a source of information, something that is not
possible in cross sections.
Time series models should take dynamics into account: lags, momentum, reaction time, etc. This means
that time series data have all the problems of cross section data, plus some more. Nevertheless, time series
data have a unique advantage because of time’s arrow. Simultaneity is a serious problem for cross section
data, but lags can help a time series analysis avoid the pitfalls of simultaneous equations. This is a good
reason to prefer time series data.
An interesting question arises with respect to time series data, namely, can we treat a time series, say GDP
from 1950 to 2000 as a random sample? It seems a little silly to argue that GDP in 1999 was a random
draw from a normal population of potential GDP’s ranging from positive to negative infinity. Yet we can
treat any tme series as a random sample if we condition on history. The value of GDP in 1999 was almost
certainly a function of GDP in 1998, 1997, etc. Let’s assume that GDP today is a function of a bunch of
past values of GDP,
0 1 1
0 1 1
so that the error term is
t t p t p t
t t t p t p
Y a a Y a Y
Y a a Y a Y
ε
ε
− −
− −
= + + + +
= − − − −
So the error term is simply the difference from what the value GDP today is compared to what it should
have been if the true relationship had held exactly. So a time series regression of the form
0 1 1 t t p t p t
Y a a Y a Y ε
− −
= + + + +
has a truly random error term.
Linear dynamic models
ADL model
The mother of all time series models is the autoregressivedistributed lag, ADL, model. It has the form,
assuming one dependent variable (Y) and one independent variable (X),
155
0 1 1 0 1 1 t t p t p t t p t p t
Y a a Y a Y b X b X b X ε
− − − −
= + + + + + + + +
Lag operator
Just for convenience, we will resort to a shorthand notation that is particularly useful for dynamic models.
Using the lag operator, we can write the ADL model very conveniently and compactly as
2
0 1 2
0
2
0 1 2
0
1
2
1 2
3 2
( ) ( )
where
( ) and ( ) are "lag polynomials."
( )
( )
The "lag operator" works like this.
( ) ( )
(
t t t
p
p i
p i
i
k
k j
k j
j
t t
t t t t
t
a L Y b L X
a L b L
a L a a L a L a L a L
b L b b L b L b L b L
LX X
L X L LX L X X
L X L L X
ε
=
=
−
− −
= +
= + + + + =
= + + + + =
=
= = =
=
¯
¯
2 3
1
) ( )
A first difference can be written as
(1 )
t t t
k
t t k
t t t t
L X X
L X X
X L X X X
− −
−
−
= =
=
∆ = − = −
So, using lag notation, the ADL(p,p) model
0 1 1 0 1 1 t t p t p t t p t p t
Y a a Y a Y b X b X b X ε
− − − −
= + + + + + + + +
can be written as
( ) ( )
0 1 1 0 1 1
0 1 0 1
1
( ) ( )
t t p t p t t p t p t
p p
p t p t t
t t
Y a a Y a Y b X b X b X
a a L a L Y b b L b L X
a L Y b L X
ε
ε
− − − −
− − − − = + + + +
− − − − = + + + +
=
Mathematically, the ADL model is a difference equation, the discrete analog of a differential equation. For
the model to be stable (so that it does not explode to positive or negative infinity), the roots of the lag
polynomial must sum to a number less than one in absolute value. Since we do not observe explosive
behavior in economic systems, we will have to make sure all our dynamic models are stable. So, for a
model like the following ADL(p,p) model to be stable
0 1 1 0 1 1 t t p t p t t p t p t
Y a a Y a Y b X b X b X ε
− − − −
= + + + + + + + +
we require that
1 2
1
p
a a a + + + <
The coefficients on the X’s, and the values of the X’s, are not constrained because the X’s are the inputs to
the difference equation and do not depend on their own past values.
All dynamic models can be derived from the ADL model. Let’s start with the simple ADL(1,1) model.
156
0 1 1 0 1 1 t t t t t
Y a a Y b X b X ε
− −
= + + + +
Static model
The static model (no dynamics) can be derived from this model by setting
1 1
0 a b = =
0 0 t t t
Y a b X ε = + +
AR model
The univariate autoregressive (AR) model is
0 1 1 0 1
( 0)
t t t
Y a a Y b b ε
−
= + + = =
Random walk model
Note that, if
1 0
1
1
1 and 0 then Y is a "random walk"
Y is whatever it was in the previous period, plus a random error term.
So, the movements of Y are completely random.
t t t
t t t t
a a
Y Y
Y Y Y
ε
ε
−
−
= =
= +
− = ∆ =
The random walk model describes the movement of a drop of water across a hot skillet. Also, because
(1 )
(1 ) ( )
t t
t t t
Y L Y
L Y a L Y ε
∆ = −
− = =
Y is said to have a unit root because the root of the lag polynomial is equal to one.
0 t 0 t1 t t t1 0
t 0
0
If 0, then Y Y + and Y Y , so
Y
This is a random walk, "with drift," because if is positive, Y drifts up, if it is negative, Y drifts down.
t
t
a a a
a
a
ε ε
ε
≠ = + − = +
∆ = +
First difference model
The first difference model can be derived by setting
1 0 1
1, a b b = = −
0 1 1 0 1 1
0 1 0 0 1
1 0 0 1
0 0
0
( )
Note that there is "drift" or trend if 0.
t t t t t
t t t t t
t t t t t
t t t
Y a a Y b X b X
Y a Y b X b X
Y Y a b X X
Y a b X
a
ε
ε
ε
ε
− −
− −
− −
= + + + +
= + + − +
− = + − +
∆ = + ∆ +
≠
157
Distributed lag model
The finite distributed lag model is
1
( 0) a =
0 0 1 1 t t t t
Y a b X b X ε
−
= + + +
Partial adjustment model
The partial adjustment model is (b
1
=0)
0 1 1 0 t t t t
Y a a Y b X ε
−
= + + +
Error correction model
The error correction model is derived as follows. Start by writing the ADL(1,1) model in deviations,
1 1 0 1 1
1
1 1 1 0 1 1 1
0 1
1 1 1 0 1 1 1 0 1 0 1
1 1 0
(1)
Subtract from both sides.
Add and subtract .
Collect terms.
(2) ( 1)
t t t t t
t
t t t t t t t
t
t t t t t t t t t
t t t
y a y b x b x
y
y y a y b x b x y
b x
y y a y b x b x y b x b x
y a y b x
ε
ε
ε
− −
−
− − − −
−
− − − − − −
−
= + + +
− = + + + −
− = + + + − + −
∆ = − + ∆
0 1 1
( )
t t
b b x ε
−
+ + +
1 1 0 1 1
1 1
1 0 1
1 0 1
0 1 0 1
1 1
Rewrite equation (1) as
In long run equilibrium, , , and 0.
(1 ) ( )
( ) ( )
where
(1 ) (1 )
In long run equilibriu
t t t t t
t t t t t
t t t t
t t
t t t
y a y b x b x
y y x x
y a y b x b x
y a b b x
b b b b
y x kx k
a a
ε
ε
− −
− = + +
= = =
− = +
− = +
+ +
= = =
− −
1
m, so
t t
y y =
0 1 0 1
1 1 1 1
1 1
1
1 1 0 1 1
0 1 1
1 1 1 1 0
( ) ( )
(1 ) ( 1)
Multiply both sides by ( 1) to get
( 1) ( )
Substituting for ( ) in (2) yields
( 1) ( 1)
Substitut
t t t t
t t
t
t t t t t
b b b b
y x x kx
a a
a
a y b b x
b b x
y a y a y b x ε
− − − −
− −
−
− −
+ +
= = − = −
− −
−
− − = +
+
∆ = − − − + ∆ +
1
0 1
1 1 1 1 0
1
ing for 
( )
( 1) ( 1)
( 1)
t
t t t t t
y
b b
y a y a x b x
a
ε
−
− −
+
∆ = − − − + ∆ +
−
158
( )
( )
1
0 1
0 1 1 1
1
0 1 1 1
0 1
1
Collect terms in ( 1)
( )
( 1)
( 1)
Or,
(3) ( 1)
Sometimes written as,
(3 ) ( 1)
t t t t t
t t t t t
t t t
t
a
b b
y b x a y x
a
y b x a y kx
a y b x a y kx
ε
ε
ε
− −
− −
−
−
  +
∆ = ∆ + − − +


−
\ ¹
∆ = ∆ + − − +
∆ = ∆ + − − +
The term (y
t1
kx
t1
) is the difference between the equilibrium value of y (kx) and the actual value of y in
the previous period. The coefficient k is the long run response of y to a change in x. The coefficient
1
1 a − is
the speed of adjustment of y to errors in the form of deviations from the long run equilibrium. Together, the
coefficient and the error are called the error correction mechanism. The coefficient þ
0
is the impact on y of
a change in x.
CochraneOrcutt model
Another commonly used time series model is the CochraneOrcutt or common factor model. It imposes
the restriction that
1 0 1
b b a = − .
0 1 1 0 1 1
0 1 1 0 0 1 1
1 1 0 0 1 1
1 0 0 1
( )
(1 ) (1 )
t t t t t
t t t t t
t t t t t
t t t
Y a a Y b X b X
Y a a Y b X b a X
Y a Y a b X a X
a L Y a b a L X
ε
ε
ε
ε
− −
− −
− −
= + + + +
= + + − +
− = + − +
− = + − +
Y and X are said to have common factors because both lag polynomials have the same root, a
1
. The terms
1 1 1 1
and ( )
t t t t
Y a Y X a X
− −
− − are called generalized first differences because they are differences, but the
coefficient on the lag polynomial is a
1
, instead of 1 which would be the coefficient for a first difference.
159
14 AUTOCORRELATION
We know that the GaussMarkov assumptions require that
2
where (0, )
t t t t
Y X u u iid α β σ = + + ∼ . The
first i in iid is independent. This actually refers to the sampling method and assumes a simple random
sample where the probability of observing u
t
is independent of observing u
t1
. When this assumption, and
the other assumptions, are true, ordinary least squares applied to this model is unbiased, consistent, and
efficient.
As we have seen above, it is frequently necessary in time series modeling to add lags. Adding a lag of X is
not a problem. However adding a lagged dependent variable changes this dramatically. In the model
2
1
, (0, )
t t t t t
Y X Y u u iid α β γ σ
−
= + + + ∼
the explanatory variables, which now include
1 t
Y
−
, cannot be assumed to be a set of constants, fixed on
repeated sampling because, since Y is a random number, the lag of Y is also a random number. It turns out
that the best we can get out of ordinary least squares with a lagged dependent variable is that OLS is
consistent and asymptotically efficient. It is no longer unbiased and efficient.
To see this, assume a very simple AR model (no x) expressed in deviations.
( )
( )
1 1
1 1 1 1 1
1 1 2 2 2
1 1 1
1 1
1 1 1 2
1 1
1 1 1
ˆ
( )
ˆ
( )
Since it is likely that ( ) 0 at least in small samples, whi
t t t
t t t t t t t
t t t
t t t t
t t
t t t t t
y a y
y a y y y y
a a
y y y
y Cov y
E a a E a
Var y y
y a y Cov y
ε
ε ε
ε ε
ε ε
−
− − − −
− − −
− −
− −
− −
= +
¯ + ¯ ¯
= = = +
¯ ¯ ¯
  ¯
= + = +

¯
\ ¹
= − ≠
( )
1 1
1
ch means that
ˆ and OLS is biased. However, in large samples, as the sample size goes to infinity, we expect that
( ) 0
which implies that OLS is at least consistent.
t t
E a a
Cov y ε
−
≠
=
It is unfortunate that all we can hope for in most dynamic models is consistency, since most time series are
quite small relative to cross sections. We will just have to live with it.
If the errors are not independent, then they are said to be “autocorrelated” (i.e., correlated with themselves,
lagged) or “serially correlated.” The most common type of autocorrelation is socalled “first order” serial
correlation,
160
1
2
t
where 1 1 is the autocorrelation coefficient and (0, )
t t t
u u
iid
ρ ε
ρ ε σ
−
= +
≤ ≤
It is possible to have autocorrelation of any order, the most general form would be q’th order
autocorrelation,
1 1 2 2 t t t q t q t
u u u u ρ ρ ρ ε
− − −
= + + + +
Autocorrelation is also known as moving average errors.
Autocorrelation is caused primarily by omitting variables with time components. Sometimes we simply
don’t have all the variables we need and we have to leave out some variables that are unobtainable. The
impact of these variables goes into the error term. If these variables have time trends or cycles or other
timedependent behavior, then the error term will be autocorrelated. It turns out that autocorrelation is so
prevalent that we can expect some form of autocorrelation in any time series analysis.
Effect of autocorrelation on OLS estimates
The effect that autocorrelation has on OLS estimators depends on whether the model has a lagged
dependent variable or not. If there is no lagged dependent variable, then autocorrelation causes OLS to be
inefficient. However, there is a “second order bias” in the sense that OLS standard errors are
underestimated and the corresponding tratios are overestimated in the usual case of positive
autocorrelation. This means that our hypothesis tests can be wildly wrong.
7
If there is a lagged dependent variable, then OLS is biased, inconsistent, and inefficient, and the tratios are
overestimated. There is nothing good about OLS under these conditions. Unfortunately, as we have seen
above, most dynamic models use a lagged dependent variable, so we have to make sure we eliminate any
autocorrelation.
We can prove the inconsistency of OLS in the presence of autocorrelation and a lagged dependent variable
as follows. Recall our simple AR(1) model above.
1 1
1
t t t
t t t
y a y u
u u ρ ε
−
−
= +
= +
The ordinary least squares estimator of a is,
( )
1 1 1 1 1
1 1 2 2 2
1 1 1
ˆ
t t t t t t t
t t t
y a y u y y y u
a a
y y y
− − − −
− − −
¯ + ¯ ¯
= = = +
¯ ¯ ¯
The expected value is
( )
1 1
1 1 1 2
1 1
( )
ˆ
( )
t t t t
t t
y u Cov y u
E a a E a
Var y y
− −
− −
  ¯
= + = +

¯
\ ¹
The covariance between the explanatory variable and the error term is
7
In the case of negative autocorrelation, it is not clear whether the tratios will be overestimated or underestimated.
However, the tratios will be “wrong.”
161
[ ]
[ ]
1 1 1 2 1 1
2 2
1 1 2 1 2 1 1 1 1 2 1
( ) ( ) ( )( )
t t t t t t t t
t t t t t t t t t t
Cov y u E y u E a y u u
E a u y a y u u a E u y E u
ρ ε
ρ ε ρ ε ρ ρ
− − − − −
− − − − − − − −
= = + +
= + + + = +
[ ] [ ]
[ ] [ ]
2 1
2 2
1 1 2 1 1
2
1 2 1 1 2 1
2
1 2 1 1 2 1
Since ( ) ( ) 0 because is truly random. Now,
( ) ( ) ( ) and ( ) ( ) ( ) which implies
t t t t t
t t t t t t t t t
t t t t t
t t t t t
E y E u
E u E u Var u E u y E u y Cov y u
E u y a E u y E u
E u y a E u y E u
ε ε ε
ρ ρ
ρ ρ
− −
− − − − −
− − − − −
− − − − −
= =
= = = =
= +
− =
[ ]
2
1
1 2
1
1
1
1
( )
( )
1
t
t t
t
t t
E u
E u y
a
Var u
Cov y u
a
ρ
ρ
ρ
ρ
−
− −
−
=
−
=
−
So, if 0 ρ ≠ , the explanatory variable is correlated with the error term and OLS is biased. Since p is
constant, it doesn’t go away as the sample size goes to infinity. Therefore, OLS is inconsistent.
Since OLS is inefficient and has overestimated tratios in the best case and is biased, inconsistent,
inefficient, with overestimated tratios in the worse case, we should find out if we have autocorrelation or
not.
Testing for autocorrelation
The Durbin Watson test
The most famous test for autocorrelation is the DurbinWatson statistic. It has the formula
2
1
2
2
( )
where is the residual from the OLS regression.
Under the null hypothesis of no autocorrelation, we can derive the expected value as follows.
Note that, under the null
(0, ) whi
t t
t
t
t
e e
d e
e
u IN σ
−
¯ −
=
¯
2 2
1
2 2
1 1 1
2
1
ch implies that ( ) 0, ( ) , ( ) 0.
The empirical analog is that the residual, e, has the following properties.
( ) ( ) ( ) 0 and ( ) ( ) . Therefore,
( )
( )
(
t t t t
t t t t t t
t t
E u E u E u u
E e E e E e e E e E e
E e e
E d
E
σ
σ
−
− − −
−
= = =
= = = = =
¯ −
=
¯
2 2 2 2 2
1 1 1 1
2 2 2
2 2
2 2
1
2
2
1
( 2 ) ( ( ) 2 ( ) ( ))
) ( ) ( )
( )
2
where T is the sample size (in a time series t=1...T).
t t t t t t t t
t t t
T
t
T
t
E e e e e E e E e e E e
e E e E e
T T
T
σ σ
σ σ
σ
σ
− − − −
=
=
¯ − + ¯ − +
= =
¯ ¯
+
+
= = =
¯
¯
Now assume autocorrelation so that
1
( ) 0
t t
E e e
−
≠ . What happens to the expected value of the Durbin
Watson statistic? As a preliminary, let’s review the concept of a correlation coefficient.
162
2 2
1
But,
1
cov( , ) ( )
So,
cov( , )
x y
x y
xy
r
T S S
x y E xy xy
T
x y
r
σ σ
¯
=
= = ¯
=
In this case,
2 2 2
1
, and
t t x y
x e y e σ σ σ
−
= = = = which means that the autocorrelation coefficient is
1 1
2
2 2
cov( ) ( )
t t t t
e e E e e
ρ
σ
σ σ
− −
= =
and
2
1
( )
t t
E e e ρσ
−
=
The expected value of the Durbin Watson statistic is
2 2 2 2 2 2
1 1 1 1 1
2 2 2
2 2 2
2 2 2
1
2
2
1
( ) ( 2 ) ( ( ) 2 ( ) ( ))
( )
( ) ( ) ( )
( 2 )
2
2 2
t t t t t t t t t t
t t t
T
t
T
t
E e e E e e e e E e E e e E e
E d
E e E e E e
T T T
T
σ ρσ σ
σ ρσ σ
ρ
σ
σ
− − − − −
=
=
¯ − ¯ − + ¯ − +
= = =
¯ ¯ ¯
− +
− +
= = = −
¯
¯
Since p is a correlation coefficient it has to less than or equal to one in absolute value. The usual case in
economics is positive autocorrelation, so that p is positive. Clearly, as p approaches one, the Durbin
Watson statistic approaches zero. On the other hand, for negative autocorrelation, as p goes to negative one,
the DurbinWatson statistic goes to four. As we already know, when p=0, the DurbinWatson statistic is
equal to two.
The DurbinWatson test, despite its fame, has a number of drawbacks. The first is that the test is not exact
and there is an indeterminate range in the DW tables. For example, suppose you get a DW statistic of 1.35
and you want to know if you have autocorrelation. If you look up the critical value corresponding to a
sample size of 25 and one explanatory variable for the five percent significance level, you will find two
values, d
l
=1.29 and d
u
=1.45. If your value is below 1.29 you have significant autocorrelation, according to
DW. If your value is above 1.45, you cannot reject the null hypothesis of no autocorrelation. Since your
value falls between these two, what do you conclude? Answer: test fails! Now what are you going to do?
Actually, this is not a serious problem. Be conservative and simply ignore the lower value. That is, reject
the null hypothesis of no autocorrelation whenever the DW statistic is below d
U
.
8
The second problem is fatal. Whenever the model contains a lagged dependent variable, the DW test is
biased toward two. That is, when the test is most needed, it not only fails, it tells us we have no problem
when we could have a very serious problem. For this reason, we will need another test for autocorrelation.
The LM test for autocorrelation
8
If the DW statistic is negative, then the test statistic is 4DW.
163
Breusch and Godfrey independently developed an LM test for autocorrelation. Like all LM tests, we run
the regression and then save the residuals for further analysis. For example, consider the following model.
0 0
1
t t t
t t t
Y a b X u
u u ρ ε
−
= + +
= +
which implies that
0 0 1 t t t t
Y a b X u ρ ε
−
= + + +
It would be nice to simply estimate this model, get an estimate of p and test the null hypothesis that p=0
with a ttest. Unfortunately, we don’t observe the error term, so we can’t do it that way. However, we can
estimate the error term with the residuals from the OLS regression of Y on X and lag it to get an estimate of
1 t
u
−
We can then run the test equation,
0 0 1 t t t t
Y a b X e ρ ε
−
= + + +
where
1 t
e
−
is the lagged residual. We can test null hypothesis with either the tratio on the lagged residual
or, because this, like all LM tests, is a large sample test, we could use the Fratio for the null hypothesis that
the coefficient on the lagged residual is zero. The Fratio is distributed according to chisquare with one
degree of freedom. Both the DurbinWatson and BreuschGodfrey LM tests are automatic in Stata.
I have a data set on crime and prison population in Virginia from 1972 to 1999 (crimeva.dta). The first
thing we have to do is tell Stata that this is a time series and the time variable is year.
. tsset year
time variable: year, 1956 to 2005
Now let’s do a simple regression of the log of major crime on prison.
. regress lcrmaj prison
Source  SS df MS Number of obs = 28
+ F( 1, 26) = 83.59
Model  .570667328 1 .570667328 Prob > F = 0.0000
Residual  .177496135 26 .006826774 Rsquared = 0.7628
+ Adj Rsquared = 0.7536
Total  .748163462 27 .027709758 Root MSE = .08262

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1345066 .0147116 9.14 0.000 .1647467 .1042665
_cons  9.663423 .0379523 254.62 0.000 9.585411 9.741435

First we want the DurbinWatson statistic. It is ok since we don’t have any lagged dependent variables.
. dwstat
DurbinWatson dstatistic( 2, 28) = .6915226
Now let’s get the BreuschGodfrey test statistic.
. bgodfrey
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  10.184 1 0.0014

H0: no serial correlation
164
OK, the DW statistic is uncomfortably close to zero. (The dl is 1.32 for k=1 and T=27 at the five percent level. So we
reject the null hypothesis of no autocorrelation.) And the BG test is also highly significant. So we apparently have
serious autocorrelation. Just for fun, let’s do the LM test ourselves. First we have to save the residuals.
. predict e, resid
(22 missing values generated)
Now we add the lagged residuals to the regression.
. regress lcrmaj prison L.e
Source  SS df MS Number of obs = 27
+ F( 2, 24) = 78.55
Model  .64494128 2 .32247064 Prob > F = 0.0000
Residual  .098530434 24 .004105435 Rsquared = 0.8675
+ Adj Rsquared = 0.8564
Total  .743471714 26 .028595066 Root MSE = .06407

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1451148 .0118322 12.26 0.000 .1695353 .1206944
e 
L1  .6516013 .1631639 3.99 0.001 .3148475 .9883552
_cons  9.689539 .0308821 313.76 0.000 9.625801 9.753277

The ttest on the lagged e is highly significant, as expected. What is the corresponding Fratio?
. test L.e
( 1) L.e = 0
F( 1, 24) = 15.95
Prob > F = 0.0005
It is certainly significant with the small sample correction. It must be incredibly significant asymptotically.
. scalar prob=1chi2(1, 15.95)
. scalar list prob
prob = .00006504
Wow. Now that’s significant.
We apparently have very serious autocorrelation. What should we do about it? We discuss possible cures
below.
Testing for higher order autocorrelation
The BreuschGodfrey test can be used to test for higher order autocorrelation. To test for second order
serial correlation, for example, use the lags option of the bgodfrey command.
. bgodfrey, lags(1 2)
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  10.184 1 0.0014
165
2  14.100 2 0.0009

H0: no serial correlation
Hmmm, look’s bad. Let’s see if there is even higher order autocorrelation.
. bgodfrey, lags(1 2 3 4)
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  10.184 1 0.0014
2  14.100 2 0.0009
3  14.412 3 0.0024
4  15.090 4 0.0045

H0: no serial correlation
It gets worse and worse. These data apparently have severe high order serial correlation. We could go on
adding lags to the test equation, but we’ll stop here.
Cure for autocorrelation
.
There are two approaches to fixing autocorrelation. The classic textbook solution is to use CochraneOrcutt
generalized first differences, also known as common factors.
The CochraneOrcutt method
Suppose we have the model above with first order autocorrelation.
0 0
1
t t t
t t t
Y a b X u
u u ρ ε
−
= + +
= +
Assume for the moment that we knew the value of the autocorrelation coefficient. Lag the first equation
and multiply by p.
1 0 0 1 1 t t t
Y a b X u ρ ρ ρ ρ
− − −
= + +
Now subtract this equation from the first equation.
1 0 0 0 0 1 1
1 0 0 1
1
(1 ) ( )
Since
t t t t t t
t t t t t
t t t
Y Y a a b X b X u u
Y Y a b X X
u u
ρ ρ ρ ρ
ρ ρ ρ ε
ρ ε
− − −
− −
−
− = − + − + −
− = − + − +
− =
The problem is that we don’ t know the value of p. In fact we are going to estimate
0 0
, , and a b ρ jointly.
Cochrane and Orcutt suggest an iterative method.
(1) Estimate the original regression
0 0 t t t
Y a b X u = + + with OLS and save the residuals, e
t
.
(2) Estimate the auxiliary regression
1 t t t
e e v ρ
−
= + with OLS, save the estimate of ˆ , . ρ ρ
(3) Transform the equation using generalized first differences using the estimate of p.
166
1 0 0 1
ˆ ˆ ˆ (1 ) ( )
t t t t t
Y Y a b X X ρ ρ ρ ε
− −
− = − + − +
(4) Estimate the generalized first difference regression with OLS, save the residuals.
(5) Go to (2).
Obviously, we need a stopping rule to avoid an infinite loop. The usual rule is to stop when successive
estimates of p are within a certain amount, say, .01.
The CochraneOrcutt method can be implemented in Stata using the prais command The prais command
uses the PraisWinston technique, which is a modification of the CochraneOrcutt method that preserves
the first observation so that you don’t get a missing value because of lagging the residual.
. prais lcrmaj prison, corc
Iteration 0: rho = 0.0000
Iteration 1: rho = 0.6353
Iteration 2: rho = 0.6448
Iteration 3: rho = 0.6462
Iteration 4: rho = 0.6465
Iteration 5: rho = 0.6465
Iteration 6: rho = 0.6465
Iteration 7: rho = 0.6465
CochraneOrcutt AR(1) regression  iterated estimates
Source  SS df MS Number of obs = 27
+ F( 1, 25) = 28.34
Model  .111808732 1 .111808732 Prob > F = 0.0000
Residual  .098637415 25 .003945497 Rsquared = 0.5313
+ Adj Rsquared = 0.5125
Total  .210446147 26 .008094083 Root MSE = .06281

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1619739 .0304269 5.32 0.000 .2246394 .0993085
_cons  9.736981 .0864092 112.68 0.000 9.559018 9.914944
+
rho  .6465167

DurbinWatson statistic (original) 0.691523
DurbinWatson statistic (transformed) 1.311269
The CochraneOrcutt approach has improved things, but the DW statistic still indicates some
autocorrelation in the residuals.
Note that the CochraneOrcutt regression is also known as feasible generalized least squares or FGLS. We
have seen above that it is a common factor model, and can be derived from a more general ADL model. It
is therefore true only if the restrictions of the ADL are true. We can test these restrictions to see if the CO
model is a legitimate reduction from the ADL.
Start with the CO model.
1 1
* ( )
t t t t t
Y Y a b X X ρ ρ ε
− −
− = + − +
where
0
* (1 ) a a ρ = − .
Rewrite the equation as
1 1
(A) *
t t t t t
Y a Y bX bX ρ ρ ε
− −
= + + − +
Now for something completely different, namely the ADL model,
167
0 1 1 0 1 1
(B)
t t t t t
Y a a Y b X b X ε
− −
= + + − +
Model (A) is actually model (B) with
1 0 1
b b a = − since
1 1 0
and a b b ρ ρ = = . The CochraneOrcutt model
(A) is true only if
1 0 1 1 0 1
or 0 b b a b b a = − + = .This is a testable hypothesis.
Because the restriction multiplies two parameters, we can’t use a simple Ftest. Instead we use a likelihood
ratio test. The likelihood ratio is a function of the residual sum of squares. So, we run the CochraneOrcutt
model (using the ,corc option because we don’t want to keep the first observation) and save the residual
sum of squares, ESS(A). Then we run the ADL model and keep its residual sum of squares, ESS(B). If the
CochraneOrcutt model is true, then ESS(A)=ESS(B). This implies that the ratio, ESS(B)/ESS(A)=1 and
the log of this ratio, ln(ESS(A)/ESS(B))=0. This is the likelihood ratio. It turns out that
2
( )
ln ~
( )
q
ESS B
L T
ESS A
χ
=
where q=K(B)K(A), K(B) is the number of parameters in model (A) and K(B) is the number of parameters
in model (B). If this statistic is not significant, then the CochraneOrcutt model is true. Otherwise, you
should use the ADL model.
We have already estimated the CochranOrcutt model (model A). The residual sum of squares is .0986, but
we can get Stata to remember this by using scalars.
. scalar ESSA=e(rss)
Now let’s estimate the ADL model (model B).
. regress lcrmaj prison L.lcrmaj L.prison
Source  SS df MS Number of obs = 27
+ F( 3, 23) = 50.18
Model  .64494144 3 .21498048 Prob > F = 0.0000
Residual  .098530274 23 .004283925 Rsquared = 0.8675
+ Adj Rsquared = 0.8502
Total  .743471714 26 .028595066 Root MSE = .06545

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1457706 .1082056 1.35 0.191 .369611 .0780698
lcrmaj 
L1  .6515369 .1670076 3.90 0.001 .3060554 .9970184
prison 
L1  .0883118 .1116564 0.79 0.437 .1426671 .3192906
_cons  3.393491 1.611994 2.11 0.046 .0588285 6.728154

Now do the likelihood ratio test.
. scalar ESSA=e(rss)
. scalar L=27*log(ESSB/ESSA)
.. scalar list
L = .02934373
ESSA = .09863742
ESSB = .09853027
. scalar prob=1chi2(2,.029)
168
. scalar list prob
prob = 1
Since ESSA and ESSB are virtually identical, the test statistic is not significant and CochraneOrcutt is
justified in this case.
Curing autocorrelation with lags
Despite the fact that CochraneOrcutt appears to be justified in the above example, I do not recommend
using it. The problem is that it is a purely mechanical solution. It has nothing to do with economics. As we
noted above, the most common cause of autocorrelation is omitted time varying variables. It is much more
satisfying to specify the problem allowing for lags, momentum, reaction times, etc., that are very likely to
be causing dynamic behavior than using a simple statistical fix. Besides, in the above example, the
CochraneOrcutt method did not solve the problem. The DurbinWatson statistic still showed some
autocorrelation even after FGLS. The reason, almost certainly, is that the model omits too many variables,
all of which are likely to be timevarying. For these reasons, most modern time series analysts do not
recommend CochraneOrcutt. Instead they recommend adding variables and lags until the autocorrelation is
eliminated.
We have seen above that the ADL model is a generalization of the CochraneOrcutt model. Since using
CochraneOrcutt involves restricting the ADL to have a common factor, a restriction that may or may not
be true, why not simply avoid the problem by estimating the ADL directly? Besides, the ADL model allows
us to model the dynamic behavior directly.
So, I recommend adding lags, especially lags of the dependent variable until the LM test indicates no
significant autocorrelation. Use the ten percent significance level when using this, or any other,
specification test. For example, adding two lags of crime to the static crime equation eliminates the
autocorrelation.
. regress lcrmaj prison L.lcrmaj LL.lcrmaj
Source  SS df MS Number of obs = 28
+ F( 3, 24) = 59.71
Model  .659768412 3 .219922804 Prob > F = 0.0000
Residual  .08839505 24 .003683127 Rsquared = 0.8819
+ Adj Rsquared = 0.8671
Total  .748163462 27 .027709758 Root MSE = .06069

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .0635699 .0208728 3.05 0.006 .1066493 .0204905
lcrmaj 
L1  .9444664 .2020885 4.67 0.000 .5273762 1.361557
L2  .3882994 .1902221 2.04 0.052 .7808986 .0042997
_cons  4.293413 1.520191 2.82 0.009 1.155892 7.430934

. bgodfrey
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  0.747 1 0.3875
169

H0: no serial correlation
. bgodfrey, lags (1 2 3 4)
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  0.747 1 0.3875
2  2.220 2 0.3296
3  2.284 3 0.5156
4  6.046 4 0.1957

H0: no serial correlation. regress lcrmaj prison
Heteroskedastic and autocorrelation consistent
standard errors
It is possible to generalize the concept of robust standard errors to include a correction for autocorrelation.
In the chapter on heteroskedasticity we derived the formula for heteroskedastic consistent standard errors.
( )
( )
1
2 2 2 2 2 2
ˆ
( )
1 1 2 2 1 2 1 2 1 3 1 3 1 4 1 4 2
2
.
Var E x u x u x u x x u u x x u u x x u u
T T
x
t
β = + + + + + + +
¯
In the absence of autocorrelation, ( ) 0 E u u
i j
= for i j ≠ (independent). The OLS estimator ignores these
terms and is unbiased. Note that the formula can be rewritten as follows.
( )
1 2
2 2 2 2
( ) ( ) ( )
1
1 2 ˆ
( )
2
( ) ( ) ( )
2
1 2 1 2 1 3 1 3 1 4 1 4
T
x Var u x Var u x Var u u
n T
Var
x x Cov u u x x Cov u u x x Cov u u
x
t
β
 
+ + +

=

 + + + +
\ ¹ ¯
If the covariances are positive (positive autocorrelation) and the x’s are positively correlated, which is the
usual case, then ignoring them will cause the standard error to be underestimated. If the covariances are
negative while the x’s are positively correlated, then ignoring them will cause the standard error to be
overestimated. If some are positive and some negative, the direction of bias of the standard error will be
unknown. Things are made more complicated if the observations on x are negatively correlated. The one
thing we do know is that if we ignore the covariances when they are nonzero, then ols estimates of the
standard errors in no longer unbiased. This is the famous second order bias of the standard errors under
autocorrelation. Typically, everything is positively correlated, the standard errors are underestimated and
the corresponding tratios are overestimated, making variables appear to be related when they are in fact
independent.
So, if we have autocorrelation, then ( ) 0
i j
E u u ≠ and we cannot ignore these terms. The problem is, if we
use all these terms, the standard errors are not consistent because, as the sample size goes to infinity, the
number of terms goes to infinity. Since each tem contains error, the amount of error goes to infinity and the
estimate is not consistent. We could use a small number of terms, but that doesn’t do it either, because it
ignores higher order autocorrelation. Newey and West suggested a formula that increased the number of
terms as the sample size increased, but at a much slower rate. The result is that the estimate is consistent
170
because the number of terms goes to infinity, but always with fewer than T terms. The formula relies on
something known as the Bartlett kernel, but that is beyond the scope of this manual.
We can make this formula operational by using the residuals from the OLS regression to estimate u. To get
heteroskedasticity and autocorrelation (HAC aka NeweyWest) consistent standard errors and tratios, use
the newey command.
. newey lcrmaj prison, lag(2)
Regression with NeweyWest standard errors Number of obs = 28
maximum lag: 2 F( 1, 26) = 60.80
Prob > F = 0.0000

 NeweyWest
lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1345066 .0172496 7.80 0.000 .1699636 .0990496
_cons  9.663423 .0498593 193.81 0.000 9.560936 9.765911

Note that you must specify the number of lags in the newey command. We can also use NeweyWest HAC
standard errors in our ADL equation, even after autocorrelation has presumably been eliminated.
. newey lcrmaj prison L.lcrmaj LL.crmaj, lag(2)
Regression with NeweyWest standard errors Number of obs = 28
maximum lag: 2 F( 3, 24) = 66.77
Prob > F = 0.0000

 NeweyWest
lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .0608961 .0193394 3.15 0.004 .1008106 .0209816
lcrmaj 
L1  .9678325 .1822947 5.31 0.000 .5915948 1.34407
crmaj 
L2  .0000335 .0000103 3.25 0.003 .0000548 .0000123
_cons  .8272894 1.67426 0.49 0.626 2.628212 4.282791

Summary
David Hendry, one of the leading time series econometricians, suggests the following general to specific
methodogy. Start with a very general ADL model with lot’s of lags. A good rule of thumb is at least two
lags for annual data, five lags for quarterly data, and 13 lags for monthly data. Check the BreuschGodfrey
test to make sure there is no autocorrelation. If there is, add more lags. Once you have no autocorrelation,
you can begin reducing the model to a parsimonious, interpretable model. Delete insignificant variables.
Justify your deletions with ttests and Ftests. Make sure that dropping variables does not cause
autocorrelation. If it does, then keep the variable even if it is not significant. When you get down to a
parsimonious model with mostly significant variables, make sure that all of the omitted variables are not
significant with an overall Ftest. Make sure that there is no autocorrelation in the final version of the
model. Use NeweyWest heteroskedastic and autocorrelation robust tratios.
171
15 NONSTATIONARITY, UNIT
ROOTS, AND RANDOM WALKS
The GaussMarkov assumptions require that the error term is distributed as
2
(0, ) iid σ . We have seen how
we can avoid problems associated with nonindependence (autocorrelation) and nonidentical
(heteroskedasticity). However, we have maintained the assumption that the mean of the distribution is
constant (and equal to zero). In this section we confront the possibility that the error mean is not constant.
A time series is stationary if its probability distribution does not change over time. The probability of
observing any given y is conditional on all past values,
0 1 1
Pr(  , ,..., )
t t
y y y y
−
. A stationary process is one
where this distribution doesn’t change over time:
1
Pr( , ,..., ) Pr( ,..., )
t t t p t q t p q
y y y y y
+ + + + +
=
for any t, p, or q. If the distribution is constant over time, then its mean and variance are constant over
time. Finally, the covariances of the series must not be dependent on time, that is
( , ) ( , )
t t p t q t p q
Cov y y Cov y y
+ + + +
= .
If a series is stationary, we can infer its properties from examination of its histogram, its sample mean, its
sample variance, and its sample autocovariances. On the other hand, if a series is not stationary, then we
can’t do any of these things. If a time series is nonstationary then we cannot consider a sample of
observations over time as representative of the true distribution, because the truth at time t is different from
the truth at time t+p.
Consider a simple deterministic time trend. It is called deterministic because it can be forecast with
complete accuracy.
t
y t α δ = +
Is y stationary? The mean of y is
( )
t t
E y t µ α δ = = +
Clearly, the mean changes over time. Therefore, a linear time trend is not stationary.
Now consider a simple random walk.
172
2
1
, ~ (0, )
t t t t
y y iid ε ε σ
−
= +
For the sake of argument, let’s assume that y=0 when t=0. Therefore,
1 1
2 1 2
3 1 2 3
t 1 2
0
0
0
so that
y ...
t
y
y
y
ε
ε ε
ε ε ε
ε ε ε
= +
= + +
= + + +
= + + +
Thus, the variance of y
t
is
2 2 2 2 2
t 1 2 1 2
E(y ) ( ... ) ( ) ( ) ... ( )
t t
E E E E ε ε ε ε ε ε = + + + = + + +
because of the independence assumption. Also, because of the identical assumption,
2 2 2
1 2
...
t
σ σ σ = = =
which implies that
2 2
( ) ( )
t t
E Var t ε ε σ = = .
Obviously, the variance of y depends on time and y cannot be stationary. Also, note that the variance of y
goes to infinity as t goes to infinity. A random walk has potentially infinite variance.
A random walk is also known as a stochastic trend. Unlike the deterministic trend above, we cannot
forecast future values of y with certainty. Also unlike deterministic trends, stochastic trends can exhibit
booms and busts. Just because a random walk is increasing for several time periods does not mean that it
will necessarily continue to increase. Similarly, just because it is falling does not mean we can expect it to
continue to fall. Many economic time series appear to behave like stochastic trends. As a result most time
series analysts mean stochastic trend when they say trend.
As we saw in the section on linear dynamic models, a random walk can be written as
1
(1 )
t t t t t
y y y L y ε
−
− = ∆ = − =
The lag polynomial is (1L), so the root of the lag polynomial is one and y is said to have a unit root. Unit
root and stochastic trend are used interchangeably.
Random walks and ordinary least squares
If we estimate a regression where the independent variable is a random walk, then the usual t and Ftests
are invalid. The reason is that if the regressor is nonstationary, it has no stable distribution, so the standard
errors and tratios are not derived from a stable distribution. There is no distribution from which the
statistics can be derived. The problem is similar to autocorrelation, only worse, because the
autocorrelation coefficient is equal to one, extreme autocorrelation.
Granger and Newbold published an article in 1974 where they did some Monte Carlo exercises. They
generated two random walks completely independently of one another and then regressed one on the other.
Even though the true relationship between the two variables was zero, Granger and Newbold got tratios
around 4.5. They called this phenomenon “spurious regression.”
We can replicate their findings in Stata. In the crimeva.dta data set we first generate a random walk called
y1. Start it out at y1=0 in the first year, 1956.
. gen y1=0
This creates a series of zero from 1956 to 2005. Now create the random walk by making y1 equal to itself
lagged plus a random error term. We use the replace command to replace the zeroes. We have to start in
1957 so we have a previous value (zero) to start from. The function 5*invnorm(uniform()) produces a
random number from a normal distribution with a mean of zero and a standard deviation of 5.
173
. replace y1=L.y1+5*invnorm(uniform()) if year > 1956
(49 real changes made)
Repeat for y2.
. gen y2=0
. replace y2=L.y2+5*invnorm(uniform()) if year > 1956
(49 real changes made)
Obviously, y1 and y2 are independent of each other. So if we regressed y1 on y2 there should be no
significant relationship.
. regress y1 y2
Source  SS df MS Number of obs = 50
+ F( 1, 48) = 55.23
Model  9328.23038 1 9328.23038 Prob > F = 0.0000
Residual  8107.68518 48 168.910108 Rsquared = 0.5350
+ Adj Rsquared = 0.5253
Total  17435.9156 49 355.835011 Root MSE = 12.997

y1  Coef. Std. Err. t P>t [95% Conf. Interval]
+
y2  .9748288 .1311767 7.43 0.000 1.238577 .7110805
_cons  14.82044 3.009038 4.93 0.000 20.87052 8.770368
This is a spurious regression. It looks like y1 is negatively (and highly significantly) associated with y2
even though they are completely independent.
Therefore, to avoid spurious regressions, we need to know if our variables have unit roots.
Testing for unit roots
The most widely used test for a unit root is the Augmented DickeyFuller (ADF) test. The idea is very
simple. Suppose we start with a simple AR(1) model.
0 1 1 t t t
y a a y ε
−
= + +
Subtract
1 t
y
−
from both sides,
1 0 1 1 1 t t t t t
y y a a y y ε
− − −
− = + − +
or
0 1 1
( 1)
t t t
y a a y ε
−
∆ = + − +
This is the (nonaugmented) DickeyFuller test equation, known as a DF equation. Note that if the AR
model is stationary,
1
1 a < , in which case
1
1 0 a − < . On the other hand, if y has a unit root then
1 1
1 and 1 0 a a = − = . So we can test for a unit root using a ttest on the lagged y.
Since most time series are autocorrelated, and autocorrelation creates overestimated tratios, we should do
something to correct for autocorrelation. Consider the following AR(1) model with first order
autocorrelation.
1 1
2
1
, ~ (0, )
t t t
t t t t
y a y u
u u iid ρ ε ε σ
−
−
= +
= +
174
But the error term u
t
is also
1 1
1 1 1 2
1 1 1 2
t t t
t t t
t t t
u y a y
u y a y
u y a y ρ ρ ρ
−
− − −
− − −
= −
= −
= −
Therefore,
1 1
1 1 1 1 2
t t t
t t t t t
y a y u
y a y y a y ρ ρ ε
−
− − −
= +
= + − +
Subtract
1 t
y
−
from both sides to get
1 1 1 1 1 1 2
1 1
1 1 1 1 1 1 1 1 1 2
Add and subtract
t t t t t t t
t
t t t t t t t t
y y a y y y a y
a y
y a y y a y a y y a y
ρ ρ ε
ρ
ρ ρ ρ ρ ε
− − − − −
−
− − − − − −
− = − + − +
∆ = − + − + − +
Collect terms
1 1 1 1 1 2
1 1 1 1
1 1
( 1 ) ( )
( 1)(1 )
t t t t t
t t t t
t t t t
y a a y a y y
y a y a y
y y y
ρ ρ ρ ε
ρ ρ ε
γ δ ε
− − −
− −
− −
∆ = − + − + − +
∆ = − − + ∆ +
∆ = + ∆ +
If
1
1 a = (unit root) then v=0. We can test this hypothesis with a ttest. Note that the lagged difference has
eliminated the autocorrelation, generating an iid error term. If there is higher order autocorrelation, simply
add more lagged differences. All we are really doing is adding lags of the dependent variable to the test
equation. We know from our analysis of autocorrelation that this is a cure for autocorrelation. The lagged
differences are said to “augment” the DickeyFuller test.
The ADF in general form is
0 1 1 1 1 2 2
( 1) ...
t t t t p t p t
y a a y c y c y c y ε
− − − −
∆ = + − + ∆ + ∆ + + ∆ +
I keep saying that we test the hypothesis that
1
( 1) 0 a − = with the usual ttest. The ADF test equation is
estimated using OLS. The tratio is computed as usual, the estimated parameter divided by its standard
error. However, while the tratio is standard, its distribution is anything but standard. For reasons that are
explained in the last section of this chapter, the tratio in the DickeyFuller test equation has a nonstandard
distribution. Dickey and Fuller, and other researchers, have tabulated critical values of the DF tratio. These
critical values are derived from Monte Carlo exercises. The DF critical values are tabulated in many
econometrics books and are automatically available in Stata.
As an example, we can do our own mini Monte Carlo exercise using y
1
and y
2
, the random walks we
created above in the crimeva.dta data set. DickeyFuller unit root tests can be done with the dfuller
command. Since we know there is no autocorrelation, we will not add any lags.
. dfuller y1, lags(0)
DickeyFuller test for unit root Number of obs = 49
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.761 3.587 2.933 2.601

* MacKinnon approximate pvalue for Z(t) = 0.4001
Note that the critical values are much higher (in absolute value) than the usual t distribution. We cannot
reject the null hypothesis of a unit root, as expected. We should always have an intercept in any real world
175
application, but we know that we did not use an intercept when we constructed y1, so let’s rerun the DF test
with no constant.
. dfuller y1, noconstant lags(0)
DickeyFuller test for unit root Number of obs = 49
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 0.143 2.622 1.950 1.610
The tratio is approximately zero, as expected. Note the critical values. They are much higher than the
corresponding criticial values for the standard tdistribution.
So far we have tested for a unit root where the alternative hypothesis is that the variable is stationary
around a constant value. If a variable has a distinct trend, and many economic variables, like GDP, have a
distinct upward trend, then we need to be able to test for a unit root against the alternative that y is a
stationary deterministic trend variable. To do this we simply add a deterministic trend term to the ADF test
equation.
0 1 1 1 1 2 2
( 1) ...
t t t t p t p t
y a a y t c y c y c y δ ε
− − − −
∆ = + − + + ∆ + ∆ + + ∆ +
The tratio on
1
( 1) a − is again the test statistic. However, the critical values are different for test equations
with deterministic trends. Fortunately, Stata knows all this and adjusts accordingly. Since it makes a
difference whether the variable we are testing for a unit root has a trend or not, it is a good idea to graph the
variable and look at it. If it has a distinct trend, include a trend in the test equation. For example, we might
want to test the prison population (in the crimeva.dta data set) for stationarity. The first thing we do is
graph it.
This variable has a distinct trend. Is it a deterministic trend or a stationary trend? Let’s test for a unit root
using an ADF test equation with a deterministic trend. We choose two lags to start with and add the regress
option to see the test equation.
. dfuller prison, lags(4) regress trend
Augmented DickeyFuller test for unit root Number of obs = 23
176
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.328 4.380 3.600 3.240

MacKinnon approximate pvalue for Z(t) = 0.8809

D.prison  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison 
L1  .1454225 .1095418 1.33 0.203 .3776407 .0867956
LD  .9004267 .2221141 4.05 0.001 .429566 1.371288
L2D  .4960933 .3426802 1.45 0.167 1.222543 .2303562
L3D  .3033914 .2941294 1.03 0.318 .3201351 .9269179
L4D  .1248609 .3584488 0.35 0.732 .8847383 .6350165
_trend  .0214251 .0127737 1.68 0.113 .0056539 .048504
_cons  .0723614 .0643962 1.12 0.278 .0641524 .2088751

How many lags should we include? The rule of thumb is at least two for annual data, five for quarterly
data, etc. Since these data are annual, we start with two. It is of some importance to get the number of lags
right because too few lags causes autocorrelation, which means that our OLS estimates are biased and
inconsistent, with overestimated tratios. Too many lags means that our estimates are inefficient with
underestimated tratios. Which is worse? Monte Carlo studies have shown that including too few lags is
much worse. There are three approaches to this problem. The first is to use ttests and Ftests on the lags to
decide whether to drop them or not. The second is to use model selection criteria. The third approach is to
do it all possible ways and see if we get the same answer. The problem with the third approach is that, if we
get different answers, we don’t know which to believe.
Choosing the number of lags with F tests
I recommend using Ftests to choose the lag length. Look at the tratios on the lags and drop the
insignificant lags (use a .15 significance level to be sure not to drop too many lags). The t and Fratios on
the lags are standard, so use the standard critical values provided by Stata. It appears that the last three lags
are not significant. Let’s test this hypothesis with an Ftest.
. test L4D.prison L3D.prison L2D.prison
( 1) L4D.prison = 0
( 2) L3D.prison = 0
( 3) L2D.prison = 0
F( 3, 16) = 0.86
Prob > F = 0.4835
So, the last three lags are indeed not significant and we settle on one lag as the correct model.
. dfuller prison, lags(1) regress trend
Augmented DickeyFuller test for unit root Number of obs = 26
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
177
Statistic Value Value Value

Z(t) 2.352 4.371 3.596 3.238

MacKinnon approximate pvalue for Z(t) = 0.4055

D.prison  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison 
L1  .1608155 .0683769 2.35 0.028 .3026204 .0190106
LD  .6308502 .15634 4.04 0.001 .3066209 .9550795
_trend  .0211228 .0091946 2.30 0.031 .0020544 .0401913
_cons  .1145497 .0493077 2.32 0.030 .0122918 .2168075

Note that the lag length apparently doesn’t matter because the DF test statistic is not significant in either
model. Prison population in Virginia appears to be a random walk.
In some cases it is not clear whether to use a trend or not. Here is the graph of the log of major crime in
Virginia.
Does it have a trend, two trends (one up, one down), or no trend? Let’s start with a trend and four lags.
. dfuller lcrmaj, lags(4) regress trend
Augmented DickeyFuller test for unit root Number of obs = 35
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.832 4.288 3.560 3.216

MacKinnon approximate pvalue for Z(t) = 0.6889

D.lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lcrmaj 
L1  .1243068 .0678357 1.83 0.078 .2632618 .0146483
LD  .0308934 .1746865 0.18 0.861 .3887225 .3269358
178
L2D  .2231895 .171358 1.30 0.203 .5742004 .1278214
L3D  .1747988 .1704449 1.03 0.314 .5239392 .1743417
L4D  .0359288 .173253 0.21 0.837 .3908215 .3189639
_trend  .0053823 .0018615 2.89 0.007 .0091955 .0015691
_cons  1.282789 .640328 2.00 0.055 .0288636 2.594441

The trend is significant, so we retain it. Let’s test for the significance of the lags.
. test L4D.lcrmaj L3D.lcrmaj L2D.lcrmaj LD.lcrmaj
( 1) L4D.lcrmaj = 0
( 2) L3D.lcrmaj = 0
( 3) L2D.lcrmaj = 0
( 4) LD.lcrmaj = 0
F( 4, 28) = 0.72
Prob > F = 0.5879
None of the lags are significant, so we drop them all and run the test again.
. dfuller lcrmaj, lags(0) regress trend
DickeyFuller test for unit root Number of obs = 39
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.314 4.251 3.544 3.206

MacKinnon approximate pvalue for Z(t) = 0.8845

D.lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lcrmaj 
L1  .0620711 .0472515 1.31 0.197 .1579016 .0337594
_trend  .0030963 .0010268 3.02 0.005 .0051789 .0010138
_cons  .6447613 .4310014 1.50 0.143 .2293501 1.518873

Again, the number of lags does not make any real difference. The ADF test statistics are all insignificant,
indicating that crime in Virginia appears to be a random walk.
Digression: model selection criteria
One way to think of the problem is to hypothesize that the correct model should have the highest Rsquare
and the lowest error variance compared to the alternative models. Recall that the Rsquare measures the
variation explained by the regression to the total variation of the dependent variable. The problem with this
approach is that the Rsquare can be maximized by simply adding more independent variables.
One way around this problem is to correct the Rsquare for the number of explanatory variables used in the
model. The adjusted Rsquare also known as the
2
R (Rbarsquare) is the ratio of the variance (instead of
variation) explained by the regression to the total variance.
2 2
( ) 1
1 1 (1 )
( )
Var e N
R R
Var y T k
−
= − = − −
−
179
where T is the sample size and k is the number of independent variables. So, as we add independent
variables, R
2
increases but so does Nk. Consequently,
2
R can go up or down. One rule might be to choose
the number of lags such that
2
R is a maximum.
The Akaike information criterion (AIC) has been proposed as an improvement over the
2
R on the theory
that the
2
R does not penalize heavily enough for adding explanatory variables. The formula is
2
2
log
i
e
k
AIC
T T
 
= + 

\ ¹
¯
The goal is to choose the number of explanatory variables (lags in the context of a unit root test) that
minimizes the AIC.
The Bayesian Information Criterion (also known as the Schwartz Criterion),
2
log( )
log
i
e
T
BIC k
T T
 
= + 

\ ¹
¯
The BIC penalizes additional explanatory variables even more heavily than the AIC. Again, the goal is to
minimize the BIC.
Note that these information criteria are not formal statistical tests, however they have been shown to be
useful in the determination of the lag length in unit root models. I usually use the BIC, but usually they
both give the same answer.
Choosing lags using model selection criteria
Let’s adopt the information criterion approach. If your version of Stata has the fitstat command, invoke
fitstat after the dfuller command. If your version does not have fitstat, you can install it by doing a keyword
search for fitstat and clicking on “click here to install.” Fitstat generates a variety of goodness of fit tests,
including the BIC. Then run the test several times with different lags and check the information criterion.
The minimum (biggest negative) is the one to go with.
. dfuller prison, lags(2) trend
Augmented DickeyFuller test for unit root Number of obs = 25
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.528 4.380 3.600 3.240

* MacKinnon approximate pvalue for Z(t) = 0.8182
. fitstat
Measures of Fit for regress of D.prison
LogLik Intercept Only: 18.085 LogLik Full Model: 27.328
D(20): 54.656 LR(4): 18.486
Prob > LR: 0.001
R2: 0.523 Adjusted R2: 0.427
AIC: 1.786 AIC*n: 44.656
BIC: 119.034 BIC': 5.610
180
. dfuller prison, lags(1) trend
Augmented DickeyFuller test for unit root Number of obs = 26
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 2.352 4.371 3.596 3.238

* MacKinnon approximate pvalue for Z(t) = 0.4069
. fitstat
Measures of Fit for regress of D.prison
LogLik Intercept Only: 18.727 LogLik Full Model: 27.580
D(22): 55.160 LR(3): 17.706
Prob > LR: 0.001
R2: 0.494 Adjusted R2: 0.425
AIC: 1.814 AIC*n: 47.160
BIC: 126.838 BIC': 7.931
. dfuller prison, lags(0) trend
DickeyFuller test for unit root Number of obs = 27
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.647 4.362 3.592 3.235

* MacKinnon approximate pvalue for Z(t) = 0.7726
. fitstat
Measures of Fit for regress of D.prison
LogLik Intercept Only: 19.427 LogLik Full Model: 21.654
D(24): 43.308 LR(2): 4.454
Prob > LR: 0.108
R2: 0.152 Adjusted R2: 0.081
AIC: 1.382 AIC*n: 37.308
BIC: 122.408 BIC': 2.138
So we go with one lag. In this case it doesn’t make any difference, so we are pretty confident that prison is
a stochastic trend.
181
DFGLS test
A soupedup version of the ADF which is somewhat more powerful has been developed by Elliott,
Rothenberg, and Stock (ERS, 1996).
9
ERS use a two step method where the data are detrended using
generalized least squares, GLS. The resulting detrended series is used in a standard ADF test.
The basic idea behind the test can be seen as follows. Suppose the null hypothesis is
0 1
H : or where (0)
t t t t t t
Y Y Y I δ ε δ ε ε
−
= + + ∆ = +
The alternative hypothesis is
1
H : or where (0)
t t t t t
Y t Y t I α β ε α β ε ε = + + − − =
If the null hypothesis is true but we assume the alternative is true and detrend,
1
( )
t t t
Y t Y t α β δ α β ε
−
− − = + − − +
Y is still a random walk.
So, the idea is to detrend Y and then test the resulting series for a unit root. If we reject the null hypothesis
of a unit root, then the series is stationary. If we cannot reject the null hypothesis of a unit root, then Y is a
nonstationary random walk.
What about autocorrelation? ERS suggest correcting for autocorrelation using a generalized least squares
procedure, similar to CochraneOrcutt. The procedure is done in two steps. In the first step we transform to
generalized first differences.
1
*
t t t
Y Y Y ρ
−
= −
where p=17/T if there is no obvious trend and p=113.5/T if there is a trend. The idea behind this choice of
p is that p should go to one as T goes to infinity. The values 7 and 13.5 were chosen with Monte Carlo
methods.
1
1
( 1)
( 1)
(1 ) ( ( 1))
t
t
t t
Y t
Y t
Y Y t t
t t
α β
ρ ρα ρβ
ρ α β ρα ρβ
α ρ β ρ
−
−
= +
= + −
− = + − + −
= − + − −
This equation is estimated with OLS.
We compute the detrended series,
9
Elliot, G., T. Rothenberg, and J. H. Stock. 1996. Efficient tests for an autoregressive unit root.
Econometrica 64: 813836.
182
ˆ
ˆ
d
t t
Y Y t α β = − −
if there is a trend and
ˆ
d
t t
Y Y α = −
if no trend.
Finally, we use the DickeyFuller test (not the augmented DF), with no intercept, to test
d
t
Y for a unit root.
The DFGLS test is available in Stata with the dfgls command. Let’s start with 4 lags just to make sure.
. dfgls lcrmaj, maxlag(4)
DFGLS for lcrmaj Number of obs = 35
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

4 0.752 3.770 3.072 2.772
3 0.527 3.770 3.147 2.843
2 0.443 3.770 3.215 2.906
1 0.456 3.770 3.273 2.960
Opt Lag (NgPerron seq t) = 0 [use maxlag(0)]
Min SC = 5.028735 at lag 1 with RMSE .0730984
Min MAIC = 5.161388 at lag 1 with RMSE .0730984
What if we drop the trend?
. dfgls lcrmaj, maxlag(4) notrend
DFGLS for lcrmaj Number of obs = 35
DFGLS mu 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

4 0.965 2.636 2.236 1.937
3 0.800 2.636 2.275 1.975
2 0.753 2.636 2.312 2.010
1 0.769 2.636 2.346 2.041
Opt Lag (NgPerron seq t) = 0 [use maxlag(0)]
Min SC = 5.056724 at lag 1 with RMSE .0720825
Min MAIC = 5.166232 at lag 1 with RMSE .0720825
Note that the SC is the Schwartz Criterion (BIC). The MAIC is the modified Akaike Information Criterion,
much like the BIC. Again, all tests appear to point to a unit root for the crime variable.
I recommend the DFGLS test for all your unit root testing needs.
Trend stationarity vs. difference stationarity
We know that a regression of one independent random walk on another is likely to result in a spurious
regression where the tratio is highly significant according to the usual critical values. So we need to know
if our variables are random walks or not. Unfortunately, the unit roots that have been developed so far are
183
not particularly powerful, that is, they cannot distinguish between unit roots and nearly unit roots that are
nevertheless stationary. For example, it would be difficult for an ADF test equation to tell the difference
between a
1
=1 and a
1
=.9, even though the former is a random walk and the latter is stationary. What happens
if we get it wrong?
First a bit of terminology: a stationary random variable is said to be “stationary of order zero,” of I(0). A
random walk that can be made stationary by taking first differences is said to be “integrated of order one,”
or I(1). Most economic time series are either I(0) or I(1). There is the odd time series that is I(2), which
means it has to be differenced twice to make it stationary. To avoid spurious regressions, we need the
variables in our models to be I(0). If the variable is a random walk, we take first differences, if it is already
I(0) it is said to be “stationary in levels,” and we can use it as is.
Consider the following table.
Truth Levels First differences
I(0) OK Negative autocorrelation
I(1) Spurious regression OK
Clearly, if we have a random walk and we take first differences, or we run a regression in levels with
stationary variables, we are OK. However, if we make a mistake and run a regression in levels with I(1)
variables we get a spurious regression, which is really bad. On the other hand, if we mess up and take first
differences of a stationary variable, it is still stationary. The error term is likely to have negative
autocorrelation, but that is a smaller problem. We can simply use NeweyWest HAC standard errors and t
ratios. Besides, most economic time series tend to have booms and busts and otherwise behave like
stochastic trends rather than deterministic trends. For all these reasons, most time series analysts assume a
variable is a stochastic trend unless convinced otherwise.
Why unit root tests have nonstandard
distributions
To see this, consider the nonaugmented DickeyFuller test equation with no intercept.
1 1
1
1
1 2
2
1
1
1
ˆ
1
1
Multiply both sides by T =
1
T
t t t
t t
t t
t
t
y a y
y y
y y
T
a
y
y
T
ε
−
−
−
−
−
∆ = +
¯∆
¯∆
= =
¯
¯
 




\ ¹
1
1
2
1 2
1
ˆ
1
t t
t
y y
T
Ta
y
T
−
−
¯∆
=
¯
Look at the numerator.
( )
2
2 2 2
1 1 1
2 ( )
t t t t t t t
y y y y y y y
− − −
¯ = ¯ + ∆ = ¯ + ¯ ∆ + ¯ ∆
Solve for the middle term.
184
( )
( )
2 2 2
1 1
2 2 2
1 1
2 2 2
1 1
2 ( )
1
( )
2
1 1 1
( )
2
t t t t t
t t t t t
t t t t t
y y y y y
y y y y y
y y y y y
T T
− −
− −
− −
¯ ∆ = ¯ − ¯ − ¯ ∆
¯ ∆ = ¯ − ¯ − ¯ ∆
¯ ∆ = ¯ − ¯ − ¯ ∆
The first two terms in the parentheses can be written as,
1
2 2 2
1 1
1
2 2 2
1 0
1 1
T T
t t T
t t
T T
t t
t t
y y y
y y y
−
= =
−
−
= =
 
= +

\ ¹
 
= +

\ ¹
¯ ¯
¯ ¯
Which means that the difference between these two terms is
1 1
2 2 2 2 2 2 2 2 2
1 0 0
1 1 1 1
0
Because we can conviently assume that 0.
T T T T
t t t T t T T
t t t t
y y y y y y y y y
y
− −
−
= = = =
   
− = + − + = − =
 
\ ¹ \ ¹
=
¯ ¯ ¯ ¯
Therefore,
( ) ( ) ( )
2
2 2 2
1
1 1 1 1 1
2 2
T
t t T t t
y
y y y y y
T T T T
−
 
 
¯ ∆ = − ¯ ∆ = − ¯ ∆ 


\ ¹
\ ¹
Under the null hypothesis,
t t
y ε ∆ = so that the second term in the parentheses is
( ) ( ) ( )
2 2 2
2
1 1 1
has the probability limit
t t t
y
T T T
ε ε σ ¯ ∆ = ¯ ¯ →
Under the assumption that
0
0 y = the first term is
1 1
T
t t
y
y
T T T
ε = ¯∆ = ¯
Since the sum of T normal variables is normal,
2
(0, )
T
y
N
T
σ →
Therefore,
( ) ( )
2
2 2 2
1
1
T
t
y
Z
T T
ε σ
 
− ¯ → −

\ ¹
where Z is a standard normal variable. We know that the square of a normal variable is a chisquare
variable with one degree of freedom. Therefore the numerator of the OLS estimator of the DickeyFuller
test equation is
( )
2
2
1 1
1
1
2
t t
y y
T
σ
χ
−
¯ ∆ → −
If y were stationary, the estimator would converge to a normal random variable. Because y is a random
walk the numerator converges to a chisquare variable. Unfortunately, the numerator is the good news. The
really bad news is the denominator. The sum of squares
2
1
1
t
y
T
−
¯ does not converge to a constant, but
remains a random variable even as T goes to infinity. So, the OLS estimator in the DickeyFuller test
equation converges to a ratio. The numerator is chisquare while the denominator is the distribution of a
random variable. The latter distribution is different for different dependent variables.
185
16 ANALYSIS OF
NONSTATIONARY DATA
Consider the following graph of the log of GDP and the log of consumption. The data are quarterly, from
19471 to 20041. I created the time trend t from 1 to 229 corresponding to each quarter. These data are
available in CYR.dta.
. gen t=_n
. tsset t
time variable: t, 1 to 229
Are these series random walks with drift (difference stationary stochastic trends) or are they deterministic
trends?.
The graph of the prime rate of interest is quite different.
186
Since the interest rate has to be between zero and one, we assume it cannot have a deterministic trend. It
does have the booms and busts that are characteristic of stochastic trends, so we suspect that it is a random
walk.
We apply unit root tests to these variables below.
. dfgls lny, maxlag(5)
DFGLS for lny Number of obs = 223
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

5 1.719 3.480 2.893 2.607
4 1.892 3.480 2.901 2.614
3 2.137 3.480 2.908 2.620
2 2.408 3.480 2.914 2.626
1 2.165 3.480 2.921 2.632
Opt Lag (NgPerron seq t) = 1 with RMSE .0094098
Min SC = 9.283516 at lag 1 with RMSE .0094098
Min MAIC = 9.289291 at lag 5 with RMSE .0092561
. dfgls lnc, maxlag(5)
DFGLS for lnc Number of obs = 223
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

5 2.042 3.480 2.893 2.607
4 2.084 3.480 2.901 2.614
3 2.487 3.480 2.908 2.620
2 2.482 3.480 2.914 2.626
1 1.818 3.480 2.921 2.632
Opt Lag (NgPerron seq t) = 4 with RMSE .0079511
Min SC = 9.572358 at lag 2 with RMSE .0080462
Min MAIC = 9.589425 at lag 4 with RMSE .0079511
. dfgls prime, maxlag(5)
DFGLS for prime Number of obs = 215
187
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

5 2.737 3.480 2.895 2.609
4 2.179 3.480 2.903 2.616
3 2.069 3.480 2.910 2.623
2 1.768 3.480 2.917 2.629
1 1.974 3.480 2.924 2.635
Opt Lag (NgPerron seq t) = 5 with RMSE .7771611
Min SC = .390588 at lag 1 with RMSE .8022992
Min MAIC = .3949452 at lag 2 with RMSE .8005883
All of these series are apparently random walks. We can transform them to stationary series by taking first
differences. Here is the graph of the first differences of the log of income. First let’s summarize the first
differences of the logs of these variables.
. summarize D.lnc D.lny D.lnr
Variable  Obs Mean Std. Dev. Min Max
+
lnc 
D1  228 .008795 .0084876 .029891 .05018
lny 
D1  228 .0084169 .0100605 .0275517 .0402255
lnr 
D1  220 .0031507 .078973 .3407288 .365536
. twoway (tsline D.lny), yline(.0084)
Note that the series is centered at .0084 (the mean of the differenced series, which is the average rate of
growth, or “drift”). Note also that, if the series wanders too far from the mean, it tends to get drawn back to
the center quickly. This is a typical graph for a stationary series.
Once we transform to stationary variables, we can use the usual regression analyses and significance tests.
regress D.lnc LD.lnc D.lny D.lnr
Source  SS df MS Number of obs = 220
+ F( 3, 216) = 46.01
Model  .006260983 3 .002086994 Prob > F = 0.0000
Residual  .009798288 216 .000045362 Rsquared = 0.3899
+ Adj Rsquared = 0.3814
188
Total  .016059271 219 .00007333 Root MSE = .00674

D.lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lnc 
LD  .1957726 .0578761 3.38 0.001 .3098469 .0816984
lny 
D1  .5741064 .049067 11.70 0.000 .4773951 .6708177
lnr 
D1  .0141193 .0059641 2.37 0.019 .0258746 .002364
_cons  .0057757 .0007042 8.20 0.000 .0043876 .0071637

The change in consumption is a function of lagged consumption, the change in income, and the change in
interest rates. The coefficient on income is positive and the coefficient on the interest rate is negative, as
expected. The negative coefficient on lagged consumption means that there is significant “reversion to the
mean,” that is, if consumption grows too rapidly in one quarter, it will tend to grow much less rapidly I the
next quarter, and vice versa. Also, in a first difference regression, the intercept term is a linear time trend.
Consumption appears to be growing exogenously at a rate of about onehalf percent per quarter.
The problem with a first difference regression is that it is purely short run. Changes in lnc are functions of
changes in lny. It doesn’t say anything about any long run relationship between consumption and income.
And economists typically don’t know anything about the short run. All our hypotheses are about equilibria,
which presumably occur in the long run. So, it would be very useful to be able estimate the long run
relationship between consumption, income, and the interest rate.
Cointegration
Engle and Granger have developed a method for determining if there is a long run relationship among a set
of nonstationary random walks. They noted that if two independent random walks were regressed on each
other, the result would simply be another random walk consisting of a linear combination of two individual
random walks. Imagine a couple of drunks who happen to exit Paul’s Deli at the same time. They are
walking randomly and we expect that they will wander off in different directions, so the distance between
them (the residual) will probably get bigger and bigger. So, for example, if x(t) and z(t) are independent
random walks, then a regression of y on x will yield z=a+bx+u, so that u=zabx is just a linear
combination of two random walks and therefore itself a random walk.
However, suppose that the two random walks are related to each other. Suppose y(t) is a random walk, but
is in fact a linear function of x(t), another random walk such that y=u +þx+u. The residual is u=y u þx.
Now, because x and y are related, there is an attraction between them, so that if u gets very large, there will
be natural tendency for x and y to adjust so that u becomes approximately zero.
Using our example of our two drunks exiting Paul’s Deli, suppose they have become best buddies after a
night of drinking, so they exit the deli arm in arm. Now, even though they are individually walking
randomly, they are linked in such a way that neither can wander too far from the other. If one does lurch,
then the other will also lurch, the distance between them remaining approximately an arm’s length.
Economists suspect that there is a consumption function such that consumption is a linear function of
income, usually a constant fraction. Therefore, even though both income and consumption are individually
random walks, we expect that they will not wander very far from each other. In fact the graph of the log of
consumption and the log of national income above shows that they stay very close to each other. If income
increases dramatically, but consumption doesn’t grow at all, the residual will become a large negative
number. Assuming that there really is a long run equilibrium relationship between consumption, then
consumption will tend to grow to its long run level consistent with the higher level of income. This implies
that the residual will tend to hover around the zero line. Large values of the residual will be drawn back to
the zero line. This is the behavior we associate with a stationary variable. So, if there is a long run
189
equilibrium relationship between two (or more) random walks, the residual will be stationary. If there is no
long run relationship between the random walks, the residual will be nonstationary.
Two random walks that are individually I(1) can nevertheless generate a stationary I(0) residual if there is a
long run relationship between them. Such a pair of random walks are said to be cointegrated.
So we can test for cointegration (i.e., a long run equilibrium relationship) by regressing consumption on
income and maybe a time trend, and then testing the resulting residual for a unit root. If it has a unit root,
then there is no long run relationship among these series. If we reject the null hypothesis of a unit root in
the residual, we accept the alternative hypothesis of cointegration.
We use the same old augmented DickeyFuller test to test for stationarity of the residual. This is the
“EngleGranger two step” test for cointegration. In the first step we estimate the long run equilibrium
relationship. This is just a static regression (no lags) in levels (no first differences) of the log of
consumption on the log of income, the log of the interest rate, and a time trend. We will drop the time trend
if it is not significant. We then save the residual and run an ADF unit root test on the residual. Because the
residual is centered on zero by construction and can have no time trend, we do not include an intercept or a
trend term in our ADF regression.
. regress lnc lny lnr t
Source  SS df MS Number of obs = 221
+ F( 3, 217) =85503.23
Model  69.7033367 3 23.2344456 Prob > F = 0.0000
Residual  .058967068 217 .000271738 Rsquared = 0.9992
+ Adj Rsquared = 0.9991
Total  69.7623038 220 .317101381 Root MSE = .01648

lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lny  .8053303 .032362 24.89 0.000 .7415463 .8691144
lnr  .0047292 .003373 1.40 0.162 .0113772 .0019189
t  .0021389 .0002607 8.21 0.000 .0016252 .0026527
_cons  .9669527 .2378607 4.07 0.000 .4981397 1.435766

. predict error, resid
(8 missing values generated)
. dfuller error, noconstant lags(3) regress
Augmented DickeyFuller test for unit root Number of obs = 217
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 4.841 2.584 1.950 1.618

D.error  Coef. Std. Err. t P>t [95% Conf. Interval]
+
error 
L1  .1445648 .0298631 4.84 0.000 .2034297 .0856998
LD  .2037883 .064761 3.15 0.002 .3314428 .0761338
L2D  .2077673 .0657189 3.16 0.002 .0782246 .33731
L3D  .2383771 .0635725 3.75 0.000 .1130652 .3636889

Unfortunately, we can’t use the critical values supplied by dfuller. The reason is that the dfuller test is
based on a standard random walk. We have a different kind of random walk here because it is a residual.
However, Phillips and Ouliaris (Asymptotic Properties of Residual Based Tests for Cointegration,
Econometrica 58 (1), 165193) provide critical values for the EngleGranger cointegration test.
190
Critical values for the EngleGranger ADF cointegration statistic
Intercept and trend included
NUMBER OF EXPLANATORY VARIABLES
IN STATIC REGRESSION MODEL
(EXCLUDING INTERCEPT AND TREND)
10% 5% 1%
1 3.52 3.80 4.36
2 3.84 4.16 4.65
3 4.20 4.49 5.04
4 4.46 4.74 5.36
5 4.73 5.03 5.58
Critical values for the EngleGranger ADF cointegration test statistic
Intercept included, no trend
NUMBER OF EXPLANATORY VARIABLES
IN STATIC REGRESSION MODEL
(EXCLUDING INTERCEPT)
10% 5% 1%
1 3.07 3.37 3.96
2 3.45 3.77 4.31
3 3.83 4.11 4.73
4 4.16 4.45 5.07
5 4.43 4.71 5.28
We have an intercept, a trend, and two independent variables in the long run consumption function,
yielding a critical value of 4.16 at the five percent level. Our ADF test statistic is 4.84, so we reject the
null hypothesis of a unit root in the residual from the consumption function and accept the alternative
hypothesis that consumption, income, and interest rates are cointegrated. There is, apparently, a long run
equilibrium relationship linking consumption, income, and the interest rate. Macro economists will be
relieved.
What are the properties of the estimated long run regression? It turns out that the OLS estimates of the
parameters of the static regression are consistent as the number of observations goes to infinity. Actually,
Engle and Granger show that the estimates are “superconsistent” in the sense that they converge to the true
parameter values much faster than the usual consistent estimates. An interesting question arises at this
point. Why isn’t the consumption function, estimated with OLS, inconsistent because of simultaneity? We
know that consumption is a function of national income, but national income is also a function of
consumption (Y=C+I+G). So how can we get consistent estimates with OLS?
The reason is that the explanatory variable is a random walk. We know that the bias and inconsistency
arises because of a correlation between the explanatory variables and the error term in the regression.
However, because each of the explanatory variables in the regression is a random walk, they can wander
off to infinity and have potentially infinite variance. The error term, on the other hand, is stationary, which
means it is centered on zero with a finite variance. A variable that could go to infinity cannot be correlated
with a residual that is stuck on zero. Therefore, OLS estimates of the long run relationship are consistent
even in the presence of simultaneity.
191
Dynamic ordinary least squares
Although the estimated parameters of the EngleGranger first stage regression are consistent, the estimates
are inefficient relative to a simple extension of the regression model. Further, the standard errors and t
ratios are nonstandard, so we can’t use them for hypothesis testing. To fix the efficiency problem, we use
dynamic ordinary least squares (DOLS). To correct the standard errors and tratios we use heteroskedastic
and autocorrelation consistent (HAC) standard errors.(i.e., NeweyWest standard errors).
DOLS is very easy, we simply add leads and lags of the differences of the integrated variables in the
regression model. (If your static regression has stationary variables mixed in with the nonstationary
integrated variables, just take leads and lags of the differences of the integrated variables, leaving the
stationary variables alone.) It also turns out that you don’t need more than one or two leads or lags to get
the job done. To get the HAC standard errors and tratios, we use the “newey” command.
But first we have to generate the leads and lags. We will go with two leads and two lags. We can use the
usual time series operators: F.x for leads (F for forward), L.x for lags, and D.x for differences.
This is the static long run model. Do not use a lagged dependent variable in the long run model.
. newey lnc lny lnr t FD.lny F2D.lny LD.lny L2D.lny FD.lnr F2D.lnr LD.lnr
L2D.lnr , lag(4)
Regression with NeweyWest standard errors Number of obs = 216
maximum lag: 4 F( 11, 204) = 5741.83
Prob > F = 0.0000

 NeweyWest
lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lny  .8891959 .0701308 12.68 0.000 .7509218 1.0274
lnr  .0001377 .0058142 0.02 0.981 .0116013 .011325
t  .0014333 .0005705 2.51 0.013 .0003084 .002558
lny 
FD.  .379992 .1678191 2.26 0.025 .0491097 .710874
F2D.  .1907836 .131358 1.45 0.148 .0682098 .44977
LD.  .1234908 .109279 1.13 0.260 .3389519 .091970
L2D.  .0499793 .1226579 0.41 0.684 .2918191 .191860
lnr 
FD.  .0030922 .0116686 0.27 0.791 .0260987 .019914
F2D.  .0051308 .0140005 0.37 0.714 .0327351 .022473
LD.  .0154288 .0145519 1.06 0.290 .0441202 .013262
L2D.  .025613 .0133788 1.91 0.057 .0519916 .00076
The lag(4) option tells newey that there is autocorrelation of up to order four. The interest rate does not
appear to be significant in the long run consumption function.
Error correction model
We now have a way of estimating short run models with nonstationary time series, by taking first
differences. We also have a way of estimating long run equilibrium relationships (the first step of the
192
EngleGranger twostep). Is there any way to combine these techniques? It turns out that the error
correction model does just this.
We derived the error correction model from the autoregressive distributed lag (ADL) model in an earlier
chapter. The simple error correction model, also known as an equilibrium correction model or ECM can be
written as follows.
( )
0 1
1
( 1)
t t t
t
y b x a y kx ε
−
∆ = ∆ + − − +
This equation can be interpreted as follows. The first difference captures the short run dynamics. The
remaining part of the equation is called the error correction mechanism (ECM). The term in parentheses is
the difference between the long run equilibrium value of y and the realized value of y in the previous
period. It is equal to zero if there is long run equilibrium. If the system is out of equilibrium, this term will
be either positive or negative. In that case, the coefficient multiplying this value will determine how much
of the disequilibrium will be eliminated in each period, the “speed of adjustment” coefficient. As a result, y
changes from period to period both because of short run changes in x, lagged changes in y, and adjustment
to disequilibrium. Therefore, the error correction model incorporates both long run and short run
information.
Note that, if x and y are both nonstationary random walks, then their first differences are stationary. If they
are cointegrated, then the residuals from the static regression are also stationary. Thus, all the terms in the
error correction model are stationary, so all the usual tests apply.
Also, if x and y are cointegrated then there is an error correction model explaining their behavior. If there is
an error correction model relating x and y, then they are cointegrated. Further, they must be related in a
Granger causality sense because each variable adjusts to lagged changes in the other through the error
correction mechanism.
This model can be estimated using a two step procedure. In the first step, estimate the static long run
regression (don’t use DOLS). Predict the residual, which measures the disequilibrium. Lag the residual and
use it as an explanatory variable in the first difference regression.
For example, consider the consumption function above. We have already estimated the long run model and
predicted the residual. The only thing left to do is to estimate the EC model.
. regress D.lnc LD.lnc L4D.lnc D.lny D.lnr L.error
Source  SS df MS Number of obs = 220
+ F( 5, 214) = 30.99
Model  .006745058 5 .001349012 Prob > F = 0.0000
Residual  .009314213 214 .000043524 Rsquared = 0.4200
+ Adj Rsquared = 0.4065
Total  .016059271 219 .00007333 Root MSE = .0066

D.lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lnc 
LD  .1823329 .0572958 3.18 0.002 .2952693 .0693965
L4D  .1036533 .0528633 1.96 0.051 .2078528 .0005462
lny 
D1  .5949995 .0485122 12.26 0.000 .4993766 .6906224
lnr 
D1  .0131099 .0059389 2.21 0.028 .0248161 .0014037
error 
L1  .0734877 .0278959 2.63 0.009 .1284737 .0185017
_cons  .0063855 .0008469 7.54 0.000 .0047162 .0080548

193
So, consumption changes because of lagged changes in consumption, changes in income, changes in the
interest rate, and lagged disequilibrium. About 7 percent of the disequilibrium is eliminated each quarter.
This is an example only. There is a lot more modeling involved in estimating even a function as simple as a
consumption function. If you were to do an LM test for autocorrelation on the above model, which you are
welcome to do, the data are in CYR.dta, you would find that there is serious autocorrelation, indicating that
the model is not properly specified. You are invited to do a better job.
194
17 PANEL DATA MODELS
A general model that pools time series and cross section data is
1
ì . ì . . ì .   . ì

` A
=
= ÷ ÷ ÷
`
where i=1,…, N (number of cross sections, e.g., states); t=1,…, T (number of time periods, e.g., years); and
K = number of explanatory variables. Note that this model gives each state its own intercept. This allows,
for example, New York to have higher crime rates, holding everything else equal, than, say, North Dakota.
The model also allows each year to have its own effect (
ì
). The model therefore also controls for national
events and trends that affect crime across all states, again holding everything else constant. Finally, the
model allows each state to have its own parameter with respect to each of the explanatory variables. This
allows, for example, the crime rate in New York to be affected differently by, say, an increase in the New
York prison population, than the crime rate in North Dakota is affected by an increase in North Dakota’s
prison population.
Different models are derived by making various assumptions concerning the parameters of this model. If
we assume that
. A
= = = ,
J
= = , and
  A 
= = = then we have the OLS
model. If we assume that the
.
and
ì
not all equal but are fixed numbers (and that the coefficients
, i k
γ
are constant across states, i) then we have the fixed effects (FE) model. This model is also called the least
squares dummy variable (LSDV) model, the covariance model, and the within estimator. If we assume
that the
.
and
ì
are random variables, still assuming that the
.
are all equal, then we have the random
effects (RE) model also known as the variance components model or the error components model.
Finally, if we assume the coefficients are constant across time, but allow the
. 
and
. 
to vary across
states and assume that
J
= = = , then we have the random coefficients model.
The Fixed Effects Model
Let’s assume that the coefficients on the explanatory variables
 . ì
. are constant across states and across
time. The model therefore reduces to
1
ì . ì . . ì   . ì

` A
=
= ÷ ÷ ÷
`
195
where the error term is assumed to be independently and identically distributed random variables with
.ì
1
l
=
l
and
. ì
1
l
=
l
l
.Note that this specification assumes that each state is independent of all other
states (no contemporaneous correlation across error terms), that each time period is independent of all the
others (no autocorrelation), and that the variance of the error term is constant (no heteroskedasticity).
This model can be conveniently estimated using dummy variables. Suppose we create a set of dummy
variables, one for each state
A
1 1 where
.
1 = if the observation is from state i and
.
1 =
otherwise. Suppose also that we similarly create a set of year dummy variables (
J
`1 `1 ). Then we
can estimate the model by regressing the dependent variable on all the explanatory variables

A , on all the
state dummies, and on all the year dummies using ordinary least squares. This is why the model is also
called the least squares dummy variable model. Under the assumptions outlined above the estimates
produced by this process are unbiased and consistent.
The fixed effects model is the most widely use panel data model. The primary reason for its wide use is that
it corrects for a particular kind of omitted variable bias, known as unobserved heterogeneity. Suppose, for
the sake of illustration, that we have data on just two states, North Dakota and New York, with three years
of data on each state. Let’s also assume that prison deters crime, so that the more criminals in prison, the
lower the crime rate. Finally, let’s assume that both the crime rate and the prison population is quite low in
North Dakota and they are both quite high in New York.
These assumptions are reflected in the following example. The true relationship between crime and prison
is
1 2
2 where 10 (ND) and 50 (NY)
it
i it
y x α α α = − = = . We set x=1,2,3 for ND and x=10,15,20 for NY.
The data set is panel.graph.dta.
. list
++
 x y state st1 st2 

1.  1 8 1 1 0 
2.  2 6 1 1 0 
3.  3 4 1 1 0 
4.  10 30 2 0 1 
5.  15 20 2 0 1 
6.  20 10 2 0 1 
++
where st1 and st2 are state dummy variables. Here is the scatter diagram with the OLS regression line
superimposed.
196
Here is the OLS regression line.
. regress y x
Source  SS df MS Number of obs = 6
+ F( 1, 4) = 0.92
Model  93.4893617 1 93.4893617 Prob > F = 0.3929
Residual  408.510638 4 102.12766 Rsquared = 0.1862
+ Adj Rsquared = 0.0172
Total  502 5 100.4 Root MSE = 10.106

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .5531915 .578184 0.96 0.393 1.052105 2.158488
_cons  8.297872 6.416714 1.29 0.266 9.517782 26.11353
Since we don’t estimate a separate intercept for each state, the fact that the observations for New York
show both high crime and high prison population while North Dakota has low crime and low prison
population yields a positive correlation between the individual state intercepts and the explanatory variable.
The OLS model does not employ individual state dummy variables, and is therefore guilty of a kind of
omitted variable bias. In this case, the bias is so strong that it results in an incorrect sign on the estimated
coefficient for prison.
Here is the fixed effects model including the state dummy variables (but no intercept because of the dummy
variable trap).
. regress y x st1 st2, noconstant
5
1
0
1
5
2
0
2
5
3
0
0 5 10 15 20
x
Fitted values y
197
Source  SS df MS Number of obs = 6
+ F( 3, 3) = .
Model  1516 3 505.333333 Prob > F = .
Residual  0 3 0 Rsquared = 1.0000
+ Adj Rsquared = 1.0000
Total  1516 6 252.666667 Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  2 . . . . .
st1  10 . . . . .
st2  50 . . . . .

The fixed effects model fits the data perfectly.
If we include the intercept, we are in the dummy variable trap. Stata will drop the last dummy variable,
making the overall intercept equal to the coefficient on the New York dummy. North Dakota’s intercept is
now measured relative to New York’s. I recommend always keeping the overall interecept, and dropping
one of the state dummies.
. regress y x st1 st2
Source  SS df MS Number of obs = 6
+ F( 2, 3) = .
Model  502 2 251 Prob > F = .
Residual  0 3 0 Rsquared = 1.0000
+ Adj Rsquared = 1.0000
Total  502 5 100.4 Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  2 . . . . .
st1  40 . . . . .
st2  (dropped)
_cons  50 . . . . .

Since different states can be wildly different from each other for a variety of reasons, the assumption of
unobserved heterogeneity across states makes perfect sense. Thus, the fixed effects model is the correct
model for panels of states. The same is true for panels of countries, cities, and counties.
This procedure becomes awkward if there is a large number of cross section or time series observations. If
your sample consists of the 3000 or so counties in the United States over several years, then 3000 dummy
variables is a bit much. Luckily, there is an alternative approach that yields identical estimates. If we
subtract the state means from each observation, we get deviations from the state means. We can then use
ordinary least squares to regress the resulting “absorbed” dependent variable against the K “absorbed”
explanatory variables. This is the covariance estimation method. The parameter estimates will be identical
to the LSDV estimates.
The model is estimated with the “areg” (absorbed regression) command in Stata. It yields the same
coefficients as the LSDV method.
. areg y x, absorb(state)
Number of obs = 6
F( 0, 3) = .
198
Prob > F = .
Rsquared = 1.0000
Adj Rsquared = 1.0000
Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  2 . . . . .
_cons  30 . . . . .
+
state  F(1, 3) = . . (2 categories)
One drawback to this approach is that we will not be able to see the separate state intercepts without some
further effort
10
. Usually, we are not particularly interested in the numerical value of each of these estimated
coefficients as they are very difficult to interpret. Since each state intercept is a combination of all the
influences on crime that are specific to each state, constant across time and independent of the explanatory
variables, it is difficult to know exactly what the coefficient is capturing. You might think of it as a measure
of the “culture” of each particular state. The fact that New York’s intercept is larger, and perhaps
significantly larger, than North Dakota’s intercept is interesting and perhaps important, but the scale is
unknown. All we know is that if the state intercepts are different from each other and we have included all
the relevant explanatory variables, the intercepts capture the effect of unobserved heterogeneity across
states.
The way to test for the significance of the state and year effects is to conduct a test of their joint
significance as a group with Ftests corresponding to each of the following hypotheses (
.
= for all i,
ì
= for all t). In my experience, the state dummies are always significant. The year dummies are
usually significant. Do not be tempted to use the tratios to drop the insignificant dummies and retain the
significant ones. The year and state dummies should be dropped or retained as a group. The reason is that
different parameterizations of the same problem can result in different dummies being omitted. This creates
an additional source of modeling error that is best avoided.
We can use the panel.dta data set to illustrate these procedures. It is a subset of the crimext.dta data set
consisting of the first five states for the years 19771997. First let’s create the state and year dummy
variables for the LSDV procedure.
. tabulate state, gen(st)
state 
number  Freq. Percent Cum.
+
1  21 20.00 20.00
2  21 20.00 40.00
3  21 20.00 60.00
4  21 20.00 80.00
5  21 20.00 100.00
+
Total  105 100.00
. tabulate year, gen(yr)
year  Freq. Percent Cum.
+
1977  5 4.76 4.76
1978  5 4.76 9.52
1979  5 4.76 14.29
10
See Judge, et.al. (1988), pp. 470472 for the procedure.
199
1980  5 4.76 19.05
1981  5 4.76 23.81
1982  5 4.76 28.57
1983  5 4.76 33.33
1984  5 4.76 38.10
1985  5 4.76 42.86
1986  5 4.76 47.62
1987  5 4.76 52.38
1988  5 4.76 57.14
1989  5 4.76 61.90
1990  5 4.76 66.67
1991  5 4.76 71.43
1992  5 4.76 76.19
1993  5 4.76 80.95
1994  5 4.76 85.71
1995  5 4.76 90.48
1996  5 4.76 95.24
1997  5 4.76 100.00
+
Total  105 100.00
. regress lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 st2st5 yr2yr21
Source  SS df MS Number of obs = 105
+ F( 30, 74) = 27.19
Model  5.24465133 30 .174821711 Prob > F = 0.0000
Residual  .475757707 74 .006429158 Rsquared = 0.9168
+ Adj Rsquared = 0.8831
Total  5.72040904 104 .055003933 Root MSE = .08018

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .1576294 3.10 0.003 .8027794 .1746125
lmetpct  .0850076 .8021396 0.11 0.916 1.51329 1.683306
lblack  1.006078 .4363459 2.31 0.024 1.875516 .1366394
lrpcpi  .9622313 .3844 2.50 0.015 .1962975 1.728165
lp1824  .5760403 .4585362 1.26 0.213 1.489694 .337613
lp2534  1.058662 .6269312 1.69 0.095 2.30785 .1905254
st2  2.174764 .7955754 2.73 0.008 3.759982 .589545
st3  1.984824 .9657397 2.06 0.043 3.909102 .0605453
st4  .7511301 .420334 1.79 0.078 1.588664 .0864037
st5  1.214557 .6007863 2.02 0.047 2.41165 .0174643
yr2  .0368809 .0529524 0.70 0.488 .068629 .1423909
yr3  .1114482 .0549662 2.03 0.046 .0019256 .2209707
yr4  .3081141 .0660811 4.66 0.000 .1764446 .4397836
yr5  .3655603 .0749165 4.88 0.000 .2162859 .5148348
yr6  .3656063 .0808562 4.52 0.000 .2044969 .5267157
yr7  .3213356 .0905921 3.55 0.001 .1408268 .5018443
yr8  .3170379 .1026446 3.09 0.003 .1125141 .5215617
yr9  .3554716 .1176801 3.02 0.003 .1209889 .5899542
yr10  .4264209 .1358064 3.14 0.002 .1558207 .6970211
yr11  .3886578 .1501586 2.59 0.012 .0894603 .6878553
yr12  .3874652 .1672654 2.32 0.023 .0541816 .7207488
yr13  .4056992 .1871167 2.17 0.033 .032861 .7785375
yr14  .4489243 .2073026 2.17 0.034 .0358647 .8619838
yr15  .5139933 .2261039 2.27 0.026 .0634715 .9645151
yr16  .4322245 .2514655 1.72 0.090 .0688314 .9332805
yr17  .387142 .2775931 1.39 0.167 .1659743 .9402583
yr18  .3239752 .304477 1.06 0.291 .2827085 .9306589
yr19  .2460049 .3344635 0.74 0.464 .4204281 .9124379
yr20  .1592112 .3602613 0.44 0.660 .558625 .8770475
yr21  .1163704 .3901125 0.30 0.766 .6609458 .8936866
200
_cons  7.362542 4.453038 1.65 0.102 1.51033 16.23541

* test for significance of the state dummies
. testparm st2st5
( 1) st2 = 0
( 2) st3 = 0
( 3) st4 = 0
( 4) st5 = 0
F( 4, 74) = 6.32
Prob > F = 0.0002
* test for significance of the year dummies
. testparm yr2yr21
( 1) yr2 = 0
( 2) yr3 = 0
( 3) yr4 = 0
( 4) yr5 = 0
( 5) yr6 = 0
( 6) yr7 = 0
( 7) yr8 = 0
( 8) yr9 = 0
( 9) yr10 = 0
(10) yr11 = 0
(11) yr12 = 0
(12) yr13 = 0
(13) yr14 = 0
(14) yr15 = 0
(15) yr16 = 0
(16) yr17 = 0
(17) yr18 = 0
(18) yr19 = 0
(19) yr20 = 0
(20) yr21 = 0
F( 20, 74) = 4.49
Prob > F = 0.0000
As expected, both sets of dummy variables are highly significant. Let’s do the same regression by
absorbing the state means.
. areg lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 yr2yr21,
absorb(state)
Number of obs = 105
F( 26, 74) = 6.12
Prob > F = 0.0000
Rsquared = 0.9168
Adj Rsquared = 0.8831
Root MSE = .08018

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .1576294 3.10 0.003 .8027794 .1746125
lmetpct  .0850076 .8021396 0.11 0.916 1.51329 1.683306
lblack  1.006078 .4363459 2.31 0.024 1.875516 .1366394
lrpcpi  .9622313 .3844 2.50 0.015 .1962975 1.728165
lp1824  .5760403 .4585362 1.26 0.213 1.489694 .337613
201
lp2534  1.058662 .6269312 1.69 0.095 2.30785 .1905254
yr2  .0368809 .0529524 0.70 0.488 .068629 .1423909
yr3  .1114482 .0549662 2.03 0.046 .0019256 .2209707
yr4  .3081141 .0660811 4.66 0.000 .1764446 .4397836
yr5  .3655603 .0749165 4.88 0.000 .2162859 .5148348
yr6  .3656063 .0808562 4.52 0.000 .2044969 .5267157
yr7  .3213356 .0905921 3.55 0.001 .1408268 .5018443
yr8  .3170379 .1026446 3.09 0.003 .1125141 .5215617
yr9  .3554716 .1176801 3.02 0.003 .1209889 .5899542
yr10  .4264209 .1358064 3.14 0.002 .1558207 .6970211
yr11  .3886578 .1501586 2.59 0.012 .0894603 .6878553
yr12  .3874652 .1672654 2.32 0.023 .0541816 .7207488
yr13  .4056992 .1871167 2.17 0.033 .032861 .7785375
yr14  .4489243 .2073026 2.17 0.034 .0358647 .8619838
yr15  .5139933 .2261039 2.27 0.026 .0634715 .9645151
yr16  .4322245 .2514655 1.72 0.090 .0688314 .9332805
yr17  .387142 .2775931 1.39 0.167 .1659743 .9402583
yr18  .3239752 .304477 1.06 0.291 .2827085 .9306589
yr19  .2460049 .3344635 0.74 0.464 .4204281 .9124379
yr20  .1592112 .3602613 0.44 0.660 .558625 .8770475
yr21  .1163704 .3901125 0.30 0.766 .6609458 .8936866
_cons  6.137487 4.477591 1.37 0.175 2.784307 15.05928
+
state  F(4, 74) = 6.315 0.000 (5 categories)
Note that the coefficients are identical to the LSDV method. Also, the Ftest on the state dummies is
automatically produced by areg.
There is yet another fixed effects estimation command in Stata, namely xtreg. However, with xtreg, you
must specify the fixed effects model with the option, fe.
. xtreg lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 yr2yr21, fe
Fixedeffects (within) regression Number of obs = 105
Group variable (i): state Number of groups = 5
Rsq: within = 0.6824 Obs per group: min = 21
between = 0.2352 avg = 21.0
overall = 0.2127 max = 21
F(26,74) = 6.12
corr(u_i, Xb) = 0.9696 Prob > F = 0.0000

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .1576294 3.10 0.003 .8027794 .1746125
lmetpct  .0850076 .8021396 0.11 0.916 1.51329 1.683306
lblack  1.006078 .4363459 2.31 0.024 1.875516 .1366394
lrpcpi  .9622313 .3844 2.50 0.015 .1962975 1.728165
lp1824  .5760403 .4585362 1.26 0.213 1.489694 .337613
lp2534  1.058662 .6269312 1.69 0.095 2.30785 .1905254
yr2  .0368809 .0529524 0.70 0.488 .068629 .1423909
yr3  .1114482 .0549662 2.03 0.046 .0019256 .2209707
yr4  .3081141 .0660811 4.66 0.000 .1764446 .4397836
yr5  .3655603 .0749165 4.88 0.000 .2162859 .5148348
yr6  .3656063 .0808562 4.52 0.000 .2044969 .5267157
yr7  .3213356 .0905921 3.55 0.001 .1408268 .5018443
yr8  .3170379 .1026446 3.09 0.003 .1125141 .5215617
yr9  .3554716 .1176801 3.02 0.003 .1209889 .5899542
yr10  .4264209 .1358064 3.14 0.002 .1558207 .6970211
yr11  .3886578 .1501586 2.59 0.012 .0894603 .6878553
202
yr12  .3874652 .1672654 2.32 0.023 .0541816 .7207488
yr13  .4056992 .1871167 2.17 0.033 .032861 .7785375
yr14  .4489243 .2073026 2.17 0.034 .0358647 .8619838
yr15  .5139933 .2261039 2.27 0.026 .0634715 .9645151
yr16  .4322245 .2514655 1.72 0.090 .0688314 .9332805
yr17  .387142 .2775931 1.39 0.167 .1659743 .9402583
yr18  .3239752 .304477 1.06 0.291 .2827085 .9306589
yr19  .2460049 .3344635 0.74 0.464 .4204281 .9124379
yr20  .1592112 .3602613 0.44 0.660 .558625 .8770475
yr21  .1163704 .3901125 0.30 0.766 .6609458 .8936866
_cons  6.137487 4.477591 1.37 0.175 2.784307 15.05928
+
sigma_u  .89507954
sigma_e  .08018203
rho  .99203915 (fraction of variance due to u_i)

F test that all u_i=0: F(4, 74) = 6.32 Prob > F = 0.0002
Again, the coefficients are identical and the Ftest is automatic.
Another advantage of the fixed effects model is that, since the estimation procedure is really ordinary least
squares with a bunch of dummy variables, problems of autocorrelation and heteroskedasticity, among
others can be dealt with relatively simply. Heteroskedasticity can be handled with weighting the regression,
usually with population, or by using heteroskedasticconsistent standard errors, or both. As we argued in
the section on time series analysis, autocorrelation is best dealt with by adding lagged variables.
I recommend either the regress (LSDV) or areg (covariance) methods because they both allow robust
standard errors and tratios, as well as a wide variety of postestimation diagnostics. Heteroskedasticity can
be addressed by using the BreuschPagan or White test or corrected using weighting and robust standard
errors. The xtreg command does not allow robust standard errors.
Time series issues
Since we are combining crosssection and timeseries data, we can expect to run into both crosssection and
timeseries problems. We saw above that one major crosssection problem, heteroskedasticity, is routinely
handled with this model. Further, our model of choice is usually the fixed effects model, ignores all
variation between states and is therefore a pure time series model, As a result, we have to address all the
usual timeseries issues.
The first time series topic that we need to deal with is the possibility of deterministic trends in the data.
Linear trends
One suggestion for panel data studies is to include an overall trend, as well as the year dummies. The trend
captures the effect of omitted slowly changing time related variables. Since we never know if we have
included all the relevant variables, I recommend always including a trend in any time series model. If it
turns out to be insignificant, then it can be dropped. The year dummies do not capture the same effects as
the overall trend. The year dummies capture the effects of common factors that affect the dependent
variable over all the states at the same time. For example, an amazingly effective federal law could reduce
crime uniformly in all the states in the years after its passage. This would be captured by year dummies, not
a linear trend.
The model becomes
203
1
ì . ì . . ì   . ì

` ì A
=
= ÷ ÷ ÷ ÷
`
A linear trend is simply the numbers 0,1,2,…,T. We can create an overall trend by simply subtracting the
first year from all years. For example, if that data set starts with 1977, we can create a linear trend by
subtracting 1977 from each value of year. Note that adding a linear trend will put us in the dummy variable
trap because the trend is simply a linear combination of the year dummies with the coefficients equal to 0,
1, 2, etc. As a result, I recommend deleting one of the year dummies when adding a linear trend.
We can illustrate these facts with the panel.dta data set.
. gen t=year1977
. list state stnm year t yr1yr5 if state==1
++
 state stnm year t yr1 yr2 yr3 yr4 yr5 

1.  1 AL 1977 0 1 0 0 0 0 
2.  1 AL 1978 1 0 1 0 0 0 
3.  1 AL 1979 2 0 0 1 0 0 
4.  1 AL 1980 3 0 0 0 1 0 
5.  1 AL 1981 4 0 0 0 0 1 

6.  1 AL 1982 5 0 0 0 0 0 
7.  1 AL 1983 6 0 0 0 0 0 
8.  1 AL 1984 7 0 0 0 0 0 
9.  1 AL 1985 8 0 0 0 0 0 
10.  1 AL 1986 9 0 0 0 0 0 

11.  1 AL 1987 10 0 0 0 0 0 
12.  1 AL 1988 11 0 0 0 0 0 
13.  1 AL 1989 12 0 0 0 0 0 
14.  1 AL 1990 13 0 0 0 0 0 
15.  1 AL 1991 14 0 0 0 0 0 

16.  1 AL 1992 15 0 0 0 0 0 
17.  1 AL 1993 16 0 0 0 0 0 
18.  1 AL 1994 17 0 0 0 0 0 
19.  1 AL 1995 18 0 0 0 0 0 
20.  1 AL 1996 19 0 0 0 0 0 

21.  1 AL 1997 20 0 0 0 0 0 
++
Now let’s estimate the corresponding fixed effects model.
. areg lcrmaj lprison lmetpct lblack lp1824 lp2534 t yr2yr20 , absorb(state)
Number of obs = 105
F( 25, 75) = 5.71
Prob > F = 0.0000
Rsquared = 0.9098
Adj Rsquared = 0.8749
Root MSE = .08295

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .6904558 .1401393 4.93 0.000 .9696276 .4112841
lmetpct  .2764064 .8260439 0.33 0.739 1.369157 1.92197
204
lblack  .9933732 .4513743 2.20 0.031 1.892557 .0941895
lp1824  .0528402 .3968248 0.13 0.894 .7376754 .8433558
lp2534  .1393306 .5256308 0.27 0.792 1.186441 .9077796
t  .0431765 .012993 3.32 0.001 .0172932 .0690598
yr2  .0135131 .0542071 0.25 0.804 .0944729 .1214992
yr3  .0296471 .0611756 0.48 0.629 .092221 .1515151
yr4  .1321815 .0773008 1.71 0.091 .0218097 .2861727
yr5  .1534079 .0914021 1.68 0.097 .0286743 .3354901
yr6  .1403835 .0946307 1.48 0.142 .0481306 .3288975
yr7  .1052747 .099064 1.06 0.291 .0920709 .3026204
yr8  .1169556 .0997961 1.17 0.245 .0818484 .3157595
yr9  .1689011 .0983383 1.72 0.090 .0269988 .364801
yr10  .2511994 .0968453 2.59 0.011 .0582736 .4441253
yr11  .1998446 .0940428 2.13 0.037 .0125017 .3871874
yr12  .195804 .093125 2.10 0.039 .0102895 .3813186
yr13  .2179579 .0911012 2.39 0.019 .0364749 .3994409
yr14  .2592705 .0842792 3.08 0.003 .0913777 .4271632
yr15  .3172979 .0757261 4.19 0.000 .1664438 .468152
yr16  .2495013 .068577 3.64 0.001 .1128888 .3861138
yr17  .2169851 .0623196 3.48 0.001 .092838 .3411321
yr18  .1670167 .0573777 2.91 0.005 .0527144 .281319
yr19  .1087532 .0535365 2.03 0.046 .002103 .2154034
yr20  .0263492 .0520785 0.51 0.614 .0773964 .1300948
_cons  10.67773 4.235064 2.52 0.014 2.241046 19.11441
+
state  F(4, 75) = 4.585 0.002 (5 categories)
The trend is significant, and should be retained.
Another suggestion is that each state should have its own trend term as well as its own intercept. The model
becomes
1
ì . ì . . . ì   . ì

j ì .
=
= ÷ ÷ ÷ ÷
`
We can create these state trends by multiplying the overall trend by the state dummies.
. gen tr1=t*st1
. gen tr2=t*st2
. gen tr3=t*st3
. gen tr4=t*st4
. gen tr5=t*st5
Let’s look at a couple of these series.
. list state stnm year t st1 tr1 st2 tr2 if state <=2
++
 state stnm year t st1 tr1 st2 tr2 

1.  1 AL 1977 0 1 0 0 0 
2.  1 AL 1978 1 1 1 0 0 
3.  1 AL 1979 2 1 2 0 0 
4.  1 AL 1980 3 1 3 0 0 
5.  1 AL 1981 4 1 4 0 0 

6.  1 AL 1982 5 1 5 0 0 
7.  1 AL 1983 6 1 6 0 0 
8.  1 AL 1984 7 1 7 0 0 
9.  1 AL 1985 8 1 8 0 0 
205
10.  1 AL 1986 9 1 9 0 0 

11.  1 AL 1987 10 1 10 0 0 
12.  1 AL 1988 11 1 11 0 0 
13.  1 AL 1989 12 1 12 0 0 
14.  1 AL 1990 13 1 13 0 0 
15.  1 AL 1991 14 1 14 0 0 

16.  1 AL 1992 15 1 15 0 0 
17.  1 AL 1993 16 1 16 0 0 
18.  1 AL 1994 17 1 17 0 0 
19.  1 AL 1995 18 1 18 0 0 
20.  1 AL 1996 19 1 19 0 0 

21.  1 AL 1997 20 1 20 0 0 
22.  2 AK 1977 0 0 0 1 0 
23.  2 AK 1978 1 0 0 1 1 
24.  2 AK 1979 2 0 0 1 2 
25.  2 AK 1980 3 0 0 1 3 

26.  2 AK 1981 4 0 0 1 4 
27.  2 AK 1982 5 0 0 1 5 
Now let’s estimate the implied fixed effects model.
. areg lcrmaj lprison lmetpct lblack lp1824 lp2534 tr1tr5 yr2yr20 ,
absorb(state)
Number of obs = 105
F( 29, 71) = 9.04
Prob > F = 0.0000
Rsquared = 0.9442
Adj Rsquared = 0.9183
Root MSE = .06706

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4922259 .1500995 3.28 0.002 .7915157 .192936
lmetpct  6.408965 3.265479 1.96 0.054 .1022141 12.92014
lblack  .1648299 1.133181 0.15 0.885 2.094669 2.424329
lp1824  .0461248 .3258571 0.14 0.888 .6958654 .6036158
lp2534  .4839899 .4826101 1.00 0.319 1.446287 .4783073
tr1  .0051039 .017904 0.29 0.776 .0408036 .0305957
tr2  .033832 .0311964 1.08 0.282 .028372 .0960359
tr3  .0140482 .016859 0.83 0.407 .0476641 .0195677
tr4  .0214053 .0127019 1.69 0.096 .0039216 .0467323
tr5  .0197035 .0137888 1.43 0.157 .0077906 .0471976
yr2  .0292319 .0449536 0.65 0.518 .060403 .1188667
yr3  .0680641 .0533424 1.28 0.206 .0382976 .1744258
yr4  .1931018 .0702593 2.75 0.008 .0530088 .3331948
yr5  .2151175 .0845614 2.54 0.013 .0465069 .383728
yr6  .1876612 .0899632 2.09 0.041 .0082798 .3670427
yr7  .1441216 .0959832 1.50 0.138 .0472634 .3355067
yr8  .151261 .097835 1.55 0.127 .0438165 .3463386
yr9  .1996939 .0973446 2.05 0.044 .0055943 .3937936
yr10  .2782618 .0968381 2.87 0.005 .085172 .4713515
yr11  .2211901 .0952123 2.32 0.023 .0313422 .411038
yr12  .2153056 .0944229 2.28 0.026 .0270317 .4035796
yr13  .2356666 .0920309 2.56 0.013 .0521622 .4191709
yr14  .2714963 .0856649 3.17 0.002 .1006853 .4423072
yr15  .3271032 .0761548 4.30 0.000 .1752548 .4789515
206
yr16  .259055 .0667641 3.88 0.000 .1259312 .3921789
yr17  .2258548 .0579972 3.89 0.000 .1102116 .341498
yr18  .1751649 .0509189 3.44 0.001 .0736355 .2766942
yr19  .1152277 .045183 2.55 0.013 .0251353 .20532
yr20  .0302166 .0425423 0.71 0.480 .0546102 .1150435
_cons  17.61505 13.88093 1.27 0.209 45.29283 10.06274
+
state  F(4, 71) = 1.246 0.299 (5 categories)
. testparm tr1tr5
( 1) tr1 = 0
( 2) tr2 = 0
( 3) tr3 = 0
( 4) tr4 = 0
( 5) tr5 = 0
F( 5, 71) = 12.13
Prob > F = 0.0000
Note that we have to drop the overall trend if we include all the state trends because of collinearity. Also,
the individual trends are significant as a group and should be retained. We could also test whether the state
trends all have the same coefficient and could be combined into a single, overall trend.
. test tr1=tr2=tr3=tr4=tr5
( 1) tr1  tr2 = 0
( 2) tr1  tr3 = 0
( 3) tr1  tr4 = 0
( 4) tr1  tr5 = 0
F( 4, 71) = 10.94
Prob > F = 0.0000
Apparently the trends are significantly different across states and cannot be combined into a single overall
trend.
Estimating fixed effects models with individual state trends is becoming common practice.
Unit roots and panel data
With respect to the time series issues, we need to know whether each individual time series has a unit root.
For example, pooling states across years means that each state forms a time series of annual data. We need
to know if the typical state series has a unit root. There are panel data unit root tests, but none are officially
available in Stata. It is possible to download user written Stata modules, however, you cannot use use these
modules on the server.
Even if you had easily available panel unit root tests, what would you do with the information generated by
such tests? If you have unit roots, then you will want to estimate a short run model with first differences,
estimate a long run equilibrium model, or estimate a error correction model. If you do not have unit roots,
you probably want to estimate a model in levels. Because unit root tests are not particularly powerful, it is
probably a good idea to estimate both short run (first differences) and a long run model in levels and hope
that the main results are consistent across both specifications.
Pooled time series and cross section panel models are ideal for estimating the long run average
relationships among nonstationary variables. The pooled panel estimator yields consistent estimates with a
207
normal limit distribution. These results hold in the presence of individual fixed effects. The coefficients are
analogous to the population (not sample) regression coefficients in conventional regressions.
11
Thus, the
static fixed effects model estimated above can be interpreted as the long run average crime equation (for the
five states in the sample).
.
It turns out that another way to estimate the fixed effects model is to take first differences. The first
differencing “sweeps out” the state dummies because a state dummy is constant across time, so that the
value of the dummy (=1) in year 2 is the same as its value (=1) in year 1, so the difference is zero. The year
dummies are also altered. The dummy for year 2 is equal to one for year 2, zero otherwise. Thus, the first
difference, year2year1=10=1. However the difference, year3year2=01=1. All the remaining values are
zero. So the first differenced year dummy equals zero for the years before the year of the dummy, then 1
for the year in question, then negative one for the year after, then zero again.
Let’s use the panel.dta data set to illustrate this proposition.
. tsset state year
panel variable: state, 1 to 5
time variable: year, 1977 to 1997
. gen dst1=D.st1
(5 missing values generated)
. gen dyr5=D.yr5
(5 missing values generated)
. list state year stnm st1 dst1 yr5 dyr5 if state==1
++
 state year stnm st1 dst1 yr5 dyr5 

1.  1 1977 AL 1 . 0 . 
2.  1 1978 AL 1 0 0 0 
3.  1 1979 AL 1 0 0 0 
4.  1 1980 AL 1 0 0 0 
5.  1 1981 AL 1 0 1 1 

6.  1 1982 AL 1 0 0 1 
7.  1 1983 AL 1 0 0 0 
8.  1 1984 AL 1 0 0 0 
9.  1 1985 AL 1 0 0 0 
10.  1 1986 AL 1 0 0 0 

11.  1 1987 AL 1 0 0 0 
12.  1 1988 AL 1 0 0 0 
13.  1 1989 AL 1 0 0 0 
14.  1 1990 AL 1 0 0 0 
15.  1 1991 AL 1 0 0 0 

16.  1 1992 AL 1 0 0 0 
17.  1 1993 AL 1 0 0 0 
18.  1 1994 AL 1 0 0 0 
19.  1 1995 AL 1 0 0 0 
20.  1 1996 AL 1 0 0 0 

21.  1 1997 AL 1 0 0 0 
++
11
Phillips, P.C.B. and H.R. Moon, “Linear Regression Limit Theory for Nonstationary Panel Data,” Econometrica, 67,
1999, 10571112.
208
Note that if you did include the state dummies in the first difference model, it is equivalent to estimating a
fixed effects in levels with individual state trends because the first difference of a trend is a series of ones.
. gen dt=D.t
(5 missing values generated)
. list state year stnm st1 t dt if state==1
++
 state year stnm st1 t dt 

1.  1 1977 AL 1 0 . 
2.  1 1978 AL 1 1 1 
3.  1 1979 AL 1 2 1 
4.  1 1980 AL 1 3 1 
5.  1 1981 AL 1 4 1 

6.  1 1982 AL 1 5 1 
7.  1 1983 AL 1 6 1 
8.  1 1984 AL 1 7 1 
9.  1 1985 AL 1 8 1 
10.  1 1986 AL 1 9 1 

11.  1 1987 AL 1 10 1 
12.  1 1988 AL 1 11 1 
13.  1 1989 AL 1 12 1 
14.  1 1990 AL 1 13 1 
15.  1 1991 AL 1 14 1 

16.  1 1992 AL 1 15 1 
17.  1 1993 AL 1 16 1 
18.  1 1994 AL 1 17 1 
19.  1 1995 AL 1 18 1 
20.  1 1996 AL 1 19 1 

21.  1 1997 AL 1 20 1 
++
Since we can estimate a fixed effects model using either levels and state dummies (absorbed or not) or
using first differences with no state dummies, shouldn’t we get the same answer either way? The short
answer is “no.” The estimates will not be identical for several reasons. If the variables are stationary, then
the first difference approach, when it subtracts last year’s value, also subtracts last year’s measurement
error, if there is any. This is likely to leave a smaller measurement error than if we subtract of the mean,
which is what we do if we use state dummies or absorb the state dummies. Thus, if there is any
measurement error we could get different parameter estimates by estimating in levels and then estimating in
first differences.
The second reason is that, if the variables are nonstationary, and not cointegrated, then taking first
differences usually creates stationary variables, which can be expected to have different parameter values.
If they are nonstationary, but cointegrated, the static model in levels is the long run equilibrium model,
which can be expected to be somewhat different from the short run model. Finally, we lose one year of data
when we take first differences, so the sample is not the same.
Let’s estimate the fixed effects model above using first differences. We use the “D.” operator to generate
first differenced series. Don’t forget to “tsset state year” to tell Stata that it is a panel data set sorted by state
and year.
. regress dlcrmaj dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534 dyr2dyr21
Source  SS df MS Number of obs = 100
209
+ F( 25, 74) = 3.55
Model  .267228983 25 .010689159 Prob > F = 0.0000
Residual  .22282077 74 .003011091 Rsquared = 0.5453
+ Adj Rsquared = 0.3917
Total  .490049753 99 .004949998 Root MSE = .05487

dlcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
dlprison  .2317252 .1700677 1.36 0.177 .5705926 .1071422
dlmetpct  1.631629 2.166495 0.75 0.454 2.685207 5.948465
dlblack  .4023363 .8714137 0.46 0.646 2.138666 1.333993
dlrpcpi  .1796915 .3049804 0.59 0.558 .4279951 .7873781
dlp1824  .9338755 .4775285 1.96 0.054 1.885372 .0176207
dlp2534  .8105992 .7116841 1.14 0.258 2.228661 .6074623
dyr2  .0442992 .0332805 1.33 0.187 .0220136 .1106119
dyr3  .0971729 .0595598 1.63 0.107 .0215027 .2158485
dyr4  .2581997 .0972473 2.66 0.010 .0644303 .4519692
dyr5  .2852936 .1208029 2.36 0.021 .0445887 .5259986
dyr6  .2441781 .1275799 1.91 0.059 .0100303 .4983865
dyr7  .1863416 .1329834 1.40 0.165 .0786336 .4513169
dyr8  .1780758 .1339427 1.33 0.188 .088811 .4449625
dyr9  .2112703 .1327145 1.59 0.116 .0531691 .4757097
dyr10  .2753427 .1314994 2.09 0.040 .0133244 .537361
dyr11  .2083532 .1298586 1.60 0.113 .0503956 .467102
dyr12  .1976981 .1288571 1.53 0.129 .0590553 .4544515
dyr13  .2160628 .1250786 1.73 0.088 .0331617 .4652874
dyr14  .2431188 .1160254 2.10 0.040 .0119332 .4743043
dyr15  .2966003 .1038212 2.86 0.006 .0897319 .5034686
dyr16  .2321504 .089506 2.59 0.011 .0538057 .4104951
dyr17  .2068382 .0745739 2.77 0.007 .0582465 .35543
dyr18  .1602132 .0600055 2.67 0.009 .0406496 .2797767
dyr19  .1017896 .0437182 2.33 0.023 .0146792 .1888999
dyr20  .0210332 .0283287 0.74 0.460 .035413 .0774795
dyr21  (dropped)
_cons  .014203 .0211906 0.67 0.505 .0564261 .0280202

We know that the constant term captures the effect of an overall linear trend. If we want to replace the
overall trend with individual state trends, we simply add the state dummies back into the model.
These results are different from the fixed effects model estimated in levels. So, I recommend that you
estimate the model in levels, with no lags, and call that the long run regression. Then estimate the model in
first differences, with as many lags as might be necessary to generate residuals without any serial
correlation (use the LM test with saved residuals). Call that the short run dynamic model.
Let’s see if we have autocorrelation in the short run model above.
. predict e, resid
(5 missing values generated)
. gen e_1=L.e
(10 missing values generated)
. regress e e_1 dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534 dyr2dyr21
Source  SS df MS Number of obs = 95
+ F( 25, 69) = 0.22
Model  .015042094 25 .000601684 Prob > F = 1.0000
Residual  .19261245 69 .002791485 Rsquared = 0.0724
+ Adj Rsquared = 0.2636
210
Total  .207654544 94 .002209091 Root MSE = .05283

e  Coef. Std. Err. t P>t [95% Conf. Interval]
+
e_1  .2340898 .1181197 1.98 0.051 .0015526 .4697323
dlprison  .0796463 .1829466 0.44 0.665 .4446148 .2853223
dlmetpct  .8967098 2.120876 0.42 0.674 5.127742 3.334322
dlblack  .6180148 .8976362 0.69 0.493 2.40875 1.172721
dlrpcpi  .2284817 .3190218 0.72 0.476 .4079494 .8649128
dlp1824  .1962703 .4761782 0.41 0.681 1.14622 .7536792
dlp2534  .1558151 .72768 0.21 0.831 1.607497 1.295867
dyr2  (dropped)
dyr3  .0056397 .0377379 0.15 0.882 .0696455 .0809248
dyr4  .0226855 .0787475 0.29 0.774 .1344115 .1797825
dyr5  .0342735 .104947 0.33 0.745 .17509 .2436371
dyr6  .0459182 .1152692 0.40 0.692 .1840377 .275874
dyr7  .0486212 .1227 0.40 0.693 .1961586 .293401
dyr8  .0458445 .1247895 0.37 0.714 .2031038 .2947929
dyr9  .0422207 .1245735 0.34 0.736 .2062966 .2907381
dyr10  .0392704 .1244473 0.32 0.753 .2089952 .2875359
dyr11  .0431052 .1250605 0.34 0.731 .2063837 .2925941
dyr12  .0427233 .1252572 0.34 0.734 .2071579 .2926046
dyr13  .0404449 .1222374 0.33 0.742 .2034122 .2843019
dyr14  .0399949 .1146092 0.35 0.728 .1886443 .2686341
dyr15  .0380374 .1033664 0.37 0.714 .1681729 .2442476
dyr16  .030283 .0887014 0.34 0.734 .1466714 .2072374
dyr17  .0234023 .0735115 0.32 0.751 .1232491 .1700537
dyr18  .0163755 .058807 0.28 0.781 .1009412 .1336922
dyr19  .0080579 .0423675 0.19 0.850 .076463 .0925788
dyr20  .0045028 .0274865 0.16 0.870 .0503314 .0593369
dyr21  (dropped)
_cons  .0029871 .0218047 0.14 0.891 .0405121 .0464863

. test e_1
( 1) e_1 = 0
F( 1, 69) = 3.93
Prob > F = 0.0515
Looks like we have autocorrelation. Let’s add a lagged dependent variable.
. regress dlcrmaj L.dlcrmaj dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534
dyr
> 2dyr21
Source  SS df MS Number of obs = 95
+ F( 25, 69) = 4.12
Model  .285675804 25 .011427032 Prob > F = 0.0000
Residual  .191239715 69 .00277159 Rsquared = 0.5990
+ Adj Rsquared = 0.4537
Total  .476915519 94 .005073569 Root MSE = .05265

dlcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
dlcrmaj 
L1  .2414129 .1144276 2.11 0.039 .013136 .4696897
dlprison  .275926 .1823793 1.51 0.135 .6397627 .0879108
dlmetpct  .5640736 2.123772 0.27 0.791 3.672735 4.800882
dlblack  .9904829 .8941283 1.11 0.272 2.77422 .7932542
211
dlrpcpi  .3920387 .318366 1.23 0.222 .243084 1.027161
dlp1824  1.14118 .4742399 2.41 0.019 2.087263 .1950977
dlp2534  .9256109 .7272952 1.27 0.207 2.376525 .5253034
dyr2  (dropped)
dyr3  .05788 .0377758 1.53 0.130 .0174806 .1332407
dyr4  .2300886 .0792884 2.90 0.005 .0719126 .3882646
dyr5  .2429812 .1087098 2.24 0.029 .0261112 .4598513
dyr6  .2159285 .1187687 1.82 0.073 .0210087 .4528656
dyr7  .1775044 .1243088 1.43 0.158 .0704849 .4254938
dyr8  .1855627 .1247856 1.49 0.142 .0633777 .4345031
dyr9  .2184576 .1244927 1.75 0.084 .0298986 .4668138
dyr10  .271181 .1253369 2.16 0.034 .0211407 .5212212
dyr11  .1919514 .1280677 1.50 0.138 .0635368 .4474395
dyr12  .1984084 .1264836 1.57 0.121 .0539195 .4507363
dyr13  .2187841 .1232808 1.77 0.080 .0271544 .4647226
dyr14  .2418713 .1163528 2.08 0.041 .0097539 .4739887
dyr15  .2866917 .1063984 2.69 0.009 .0744327 .4989507
dyr16  .2004976 .0946694 2.12 0.038 .0116374 .3893579
dyr17  .1826651 .077764 2.35 0.022 .0275302 .3378001
dyr18  .1345287 .0629234 2.14 0.036 .0089999 .2600575
dyr19  .078249 .045497 1.72 0.090 .012515 .1690131
dyr20  .0059974 .029325 0.20 0.839 .0525043 .0644992
dyr21  (dropped)
_cons  .0137706 .0216976 0.63 0.528 .0570561 .0295148

The lagged dependent variable is highly significant, indicating an omitted time dependent variable. Let’s
rerun the LM test and see if we still have autocorrelation.
. drop e
. predict e, resid
(10 missing values generated)
. regress e e_1 L.dlcrmaj dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534
dyr2
> dyr21
Source  SS df MS Number of obs = 95
+ F( 26, 68) = 0.00
Model  .000021653 26 8.3281e07 Prob > F = 1.0000
Residual  .191218062 68 .00281203 Rsquared = 0.0001
+ Adj Rsquared = 0.3822
Total  .191239715 94 .002034465 Root MSE = .05303

e  Coef. Std. Err. t P>t [95% Conf. Interval]
+
e_1  .0351238 .4002693 0.09 0.930 .8338488 .7636012
dlcrmaj 
L1  .0326156 .3891471 0.08 0.933 .7439154 .8091466
dlprison  .005018 .1923999 0.03 0.979 .3789099 .388946
dlmetpct  .0134529 2.144696 0.01 0.995 4.293127 4.266221
dlblack  .0047841 .9022764 0.01 0.996 1.79568 1.805249
dlrpcpi  .0015706 .3211793 0.00 0.996 .642474 .6393327
dlp1824  .0021095 .4782917 0.00 0.996 .9565257 .9523067
dlp2534  .0017255 .7328458 0.00 0.998 1.460646 1.464097
dyr2  (dropped)
dyr3  .0002407 .0381491 0.01 0.995 .0763661 .0758847
dyr4  .0011592 .08095 0.01 0.989 .1626924 .1603739
dyr5  .0048292 .1225515 0.04 0.969 .2493768 .2397185
dyr6  .0047852 .1314744 0.04 0.971 .2671382 .2575679
dyr7  .0028128 .1292503 0.02 0.983 .2607277 .2551021
212
dyr8  .000533 .1258393 0.00 0.997 .2516413 .2505754
dyr9  .0003932 .1254777 0.00 0.998 .25078 .2499937
dyr10  .001839 .1279756 0.01 0.989 .2572104 .2535324
dyr11  .0043207 .1380765 0.03 0.975 .2798481 .2712066
dyr12  .0022677 .1299976 0.02 0.986 .261674 .2571385
dyr13  .0020183 .1262891 0.02 0.987 .2540242 .2499876
dyr14  .0028452 .121601 0.02 0.981 .2454962 .2398058
dyr15  .0041166 .1169896 0.04 0.972 .2375657 .2293326
dyr16  .0063943 .1200121 0.05 0.958 .2458748 .2330862
dyr17  .0048495 .0958629 0.05 0.960 .196141 .1864419
dyr18  .0044938 .0814841 0.06 0.956 .1670927 .1581052
dyr19  .0034812 .0606133 0.06 0.954 .1244332 .1174709
dyr20  .0022393 .0390345 0.06 0.954 .0801315 .0756529
dyr21  (dropped)
_cons  .0001079 .0218898 0.00 0.996 .0437884 .0435726

. test e_1
( 1) e_1 = 0
F( 1, 68) = 0.01
Prob > F = 0.9303
The autocorrelation is gone. This is an acceptable short run model.
The only thing you miss under this scheme is the speed of adjustment to disequilibrium. This is beyond our
scope because, so far, there are no good tests for cointegration in panel data.
Estimating first difference models has several advantages. Autocorrelation is seldom a problem (unless
there are misspecified dynamics and you have to add more lags). Also, variables that are in first
differences are, usually, almost orthogonal. So, if they have very low correlations, omitted variables is
much less of a problem than in regressions in levels, because, as we know from the omitted variable
theorem, omitting a variable that is uncorrelated with the remaining variables in a regression will not bias
the coefficients on those remaining variables. For the same reason, dropping insignificant variables from a
first difference regression seldom makes a noticeable difference in the estimated coefficients on the
remaining variables.
Clustering
A recent investigation of the effect of socalled “shall issue” laws on crime was done by Lott and Mustard
12
Shallissue laws are state laws that make it easier for ordinary citizens to acquire concealed weapons
permits. If criminals suspect that more of their potential victims are armed, they may be dissuaded from
attacking at all. In fact, Lott and Mustard found that shallissue laws do result in lower crime rates. Lott and
Mustard estimated a host of alternative models, but the model most people focus on is a fixed effects model
on all 3000 counties in the United States for the years 1977 to 1993. The target variable was a dummy
variable for those states in those years for which a shallissue law was in effect.
The clustering issue arises in this case because the shallissue law pertains to all the counties in each state.
So all the counties in Virginia, which has had a shallissue law since 1989, get a one for the shallissue
dummy. There is no withinstate variation for the dummy variable across the counties in the years with the
law in effect. The shall issue dummy is said to be “clustered” within the states. In such cases, OLS
12
Lott, John and David Mustard, “Crime, Deterrence and RighttoCarry Concealed Handguns,” Journal of Legal
Studies, 26, 1997, 168.
213
underestimates the standard errors and overestimates the tratios. This is a kind of second order bias similar
to autocorrelation. Lott has since reestimated his model correcting for clustering.
If the data set consists of observations on states and we cluster on states, we get standard errors and tratios
adjusted for both heteroskedasticity and autocorrelation (because the only observations within states are
across years, i.e., a time series). The standard errors are not consistent as T →∞because as T goes to
infinity, the number of covariances within each state goes to infinity and the computation cumulates all the
errors. That is why Newey and West weight the long lags at zero so as to limit the number of
autocorrelations that they include in their calculations. However, since T will always be finite, we do not
have an infinite number of autocorrelations within each state, so we might try clustering on states to see if
we can generate heteroskedasticity and autocorrelationcorrected standard errors.
. areg lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 yr2yr21,
absorb(state)
> cluster(state)
Regression with robust standard errors Number of obs = 105
F( 3, 74) = 9.79
Prob > F = 0.0000
Rsquared = 0.9168
Adj Rsquared = 0.8831
Root MSE = .08018
(standard errors adjusted for clustering on state)

 Robust
lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .4061591 1.20 0.233 1.297986 .3205937
lmetpct  .0850076 1.391452 0.06 0.951 2.687522 2.857537
lblack  1.006078 .7206686 1.40 0.167 2.442041 .4298858
lrpcpi  .9622313 .303612 3.17 0.002 .3572712 1.567191
lp1824  .5760403 .1872276 3.08 0.003 .9490994 .2029813
lp2534  1.058662 .5089163 2.08 0.041 2.0727 .0446244
_cons  6.137487 8.352663 0.73 0.465 10.50556 22.78053
+
state  absorbed (5 categories)
The coefficients on the year dummies have been suppressed to save space. The standard errors are much
different, and much less significant. This probably means we had serious autocorrelation in the static
model.
Monte Carlo studies show that standard errors clustered by the crosssection identifier in a fixed effect
panel data regression (e.g. state in a stateyear panel) are approximately correct. This is true for any
arbitrary pattern of heteroskedasticity. It is also true for stationary variables with any kind of
autocorrelation. Finally, clustered standard errors are approximately correct for nonstationary random
walks whether they are independent or cointegrated. In summary, use standard errors clustered at the cross
section level for any panel data study.
214
Other Panel Data Models
While the fixed effects model is by far the most widely used panel data model, there are others that have
been used in the past and are occasionally used even today. But first, a digression.
Digression: the between estimator
13
Consider another very simple model, called the “between” model. In this approach we simply compute
averages of each state over time to create a single cross section of overall state means.
1
.  .
. 

` A ·
=
= ÷
`
where
. i
Y is the mean for state i. and
, . k i
X are the corresponding state means for the independent variables.
Note that the period indicates that the means have been taken over time so that the time subscript is
suppressed. This model is consistent if the pooled (OLS) estimator is consistent. It has the property that it is
robust to measurement error in the explanatory variables and is sometimes used to alleviate such problems.
The compression of the data into averages results in the discarding of potentially useful information. For
this reason the model is not efficient. The model ignores all variation that occurs within the state (across
time) and utilizes only variation across (between) states. By contrast, the fixed effects model can be
estimated by subtracting off the state means. Therefore, the fixed effect model uses only the information
discarded by the between estimator. It completely ignores the cross section information utilized by the
between estimator. For that reason, the fixed effects model is not efficient. The fixed effects model is a pure
time series model. It utilizes only the time series variation for each state. This is the reason that the FE
model is also known as the “within” model.
Because the between model uses only the cross section information and the within model uses only the time
series information, the two models can be combined. The OLS model is a weighted sum of the between and
within estimators.
The Random Effects Model
The main competitor to the fixed effects model is the random effects model. Suppose we assume that the
individual state intercepts are independent random variables with mean α and variance
. This means
that we are treating the individual state intercepts as a sample from a larger population of intercepts and we
are trying to infer the true population values from the estimated coefficients. The model can be summarized
as follows.
1
ì . ì . ì   . ì

` A
=
= ÷ ÷ ÷
`
where
. ì . . ì
= ÷ . Note that the intercepts are now in the error term and are omitted from the model.
13
See Johnston, J. and J. Dinardo, Econometric Methods, 4
th
Edition, McGrawHill, 1997, pp. 388411, for an
excellent discussion of panel data issues.
215
The random effects model can be derived as a feasible generalized least squares (FGLS) estimator applied
to the panel data set. While the FE model subtracts off the group means from each observation, the RE
model subtracts off a fraction of the group means, the fraction depending on the variance of the state
dummies, the variance of the overall error term, and the number of time periods. The random effects model
is efficient relative to the fixed effect model (because the latter ignores between state variation). The
random effects model has the same problem as the OLS estimator, namely, it is biased in the presence of
unobserved heterogeneity because it omits the individual state intercepts. However, the random effects
model is consistent as N (the number of cross sections, e.g., states) goes to infinity (T constant) .
The random effects model can also be considered a weighted average of the between and within estimators.
The weight depends on the same variances. If the weight is zero then the RE model collapses to the pooled
OLS model. If the weight equals one, the RE model becomes the fixed effects model. Also, as the number
of time periods, T, goes to infinity, the RE model approaches the FE model asymptotically.
The random effects model cannot be used if the data are nonstationary.
Choosing between the Random Effects Model and the Fixed
Effects Model.
The main consideration in choosing between the fixed effects model and the random effects model is
whether you think there is unobserved heterogeneity among the cross sections. If so, choosing the random
effects model omits the individual state intercepts and suffers from omitted variable bias. In this case the
only choice is the fixed effects model. Since I cannot imagine a situation where I would be willing to
assume away the possibility of unobserved heterogeneity, I always choose the fixed effects model. If the
crosssection dummies turn out to be insignificant as a group, I might be willing to consider the random
effects model.
If you are not worried about unobserved heterogeneity, the next consideration is to decide if the
observations can be considered a random sample from a larger population. This would make sense, for
example, if the data set consisted of observations on individual people who are resurveyed at various
intervals. However, if the observations are on states, cities, counties, or countries it doesn’t seem to make
much sense to assume that Oklahoma, for example, has been randomly selecting from a larger population
of potential but unrealized states. There are only fifty states and we have data on all fifty. Again, I lean
toward the FE model.
Another way of considering this question is to ask yourself if you are comfortable with generating
estimates that are conditional on the individual states. This is appropriate if we are particularly interested in
the individuals that are in the sample. Clearly, if we have data on individual people, we might be
uncomfortable making inferences conditional on these particular people.
On the other hand, individuals can have unique unobserved heterogeneity. Suppose we are estimating an
earnings equation with wages as the dependent variable and education, among other things, as explanatory
variables. Suppose there is a an unmeasured attribute called “spunk:.” Spunky people are likely to get more
education and earn more money than people without spunk. The random effects model will overestimate
the effect of education on earnings because the omitted variable, spunk, is positively correlated with both
earnings and education. If people are endowed with a given amount of spunk that doesn’t change over time,
then spunk is a fixed effect and the fixed effects model is appropriate.
With respect to the econometric properties of the FE and RE models, the critical assumption of the RE
model is not whether the effects are fixed. Both models assume that they are in fact fixed. The crucial
assumption for the random effects model is that the individual intercepts are not correlated with any of the
explanatory variables, i.e.,
.  . ì
1 .
l
=
l
l
for all i,k. Clearly this is a strong assumption.
216
The fixed effects model is the model of choice if there is a reasonable likelihood of significant fixed effects
that are correlated with the independent variables. It is difficult to think of a case where correlated fixed
effects can be ruled out a priori. I use the fixed effects model for all panel data analyses unless there is a good
reason to do otherwise. I have never found a good reason to do otherwise.
HausmanWu test again
The HausmanWu test is (also) applicable here. If there is no unobserved heterogeneity, the RE model is
correct and RE is consistent and efficient. If there is significant unobserved heterogeneity, the FE model is
correct, there are correlated fixed effects, and the FE model is consistent and efficient while the RE model is
now inconsistent. If the FE model is correct, then the estimated coefficients from the FE model will be
unbiased and consistent while the estimated coefficients from the RE model will be biased and inconsistent.
Consequently, the coefficients will be different from each other, causing the test statistic to be a large
positive number. On the other hand, if the RE model is correct, then both models yield unbiased and
consistent estimates, so that the coefficient estimates will be similar. (The RE estimates will be more
efficient.) The HW statistic should be close to zero. Thus, if the test is not significantly different from zero,
the RE model is correct. If the test statistic is significant, then the FE model is correct.
This test has one serious drawback. It frequently lacks power to distinguish the two hypotheses. In other
words, there are two ways this model could generate an insignificant outcome. One way is to for the
estimated coefficients to be very close to each other with low variance. This is the ideal outcome because
the result is a precisely measured zero. In this case one can be confident in employing the RE model. On
the other hand, the coefficients can be very different from each other but the test is insignificant because of
high variance associated with the estimates. In this case it is not clear that the RE model is superior.
Unfortunately, this is a very common outcome of the HausmanWu test.
My personal opinion is that the HausmanWu test is unnecessary for most applications of panel data in
economics. Because states, counties, cities, etc. are not random samples from a larger population, and
because correlated fixed effects are so obvious and important, the fixed effects model is to be preferred on
a priori grounds. However, if the sample is a panel of individuals which could be reasonably considered a
sample from a larger population, then the HausmanWu test is appropriate and is available in Stata using
the “hausman” command. Type “help hausman” in Stata for more information.
The Random Coefficients Model.
In this model we allow each state to have its own regression. The model was originally suggested by
Hildreth and Houck and Swamy.
14
The basic idea is that the coefficients are random variables such that
, , i k k i k
v γ γ = + where the error term
, i k
v has a zero mean and constant variance. The model is estimated in
two steps. In the first step we use ordinary least squares to estimate a separate equation for each state. We
then compute the mean group estimator
,
1
ˆ
N
k i k
i
γ γ
=
=
¯
. In the second step we use the residuals from the state
regressions to compute the variances of the estimates. The resulting generalized least squares estimate of
ˆ
k
γ is a weighted average of the original ordinary least squares estimates, where the weights are the inverses
of the estimated variances.
14
Hildreth, C. and C. Houck, "Some Estimators for a Linear model with Random Coefficients," Journal of the
American statistical Association, 63, 1968, 584595. Swamy, P. "Efficient Inference in a Random Coefficient
Regression Model," Econometrica, 38, 1970, 31132.
217
As a byproduct of the estimation procedure, we can test the null hypothesis that the coefficients are
constant across states. The test statistic is based on the difference between the state by state OLS
coefficients and the weighted average estimate.
This method is seldom employed. It tends to be very inefficient simply because we usually do not have a
long enough panel to generate precise estimates of the individual state coefficients. As a result, the
coefficients have high standard errors and low tratios. Also, even if the test indicates that the coefficients
vary across states, it is not clear whether the variance is really caused by different parameters or simple
random variation across parameters that are really the same. Finally, the model does not allow us to
incorporate a time dependent effect,
t
β , which would control for common effects across all states. These
disadvantages of the random coefficients model outweigh its advantages. I prefer the fixed effects model.
The random coefficients model is available in Stata using the xtrchh procedure (crosssection timeseries
random coefficient HildrethHouck procedure). We can use this technique to estimate the same crime
equation.
. xtrchh lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534
HildrethHouck randomcoefficients regression Number of obs = 1070
Group variable (i): state Number of groups = 51
Obs per group: min = 20
avg = 21.0
max = 21
Wald chi2(6) = 24.35
Prob > chi2 = 0.0004

lcrmaj  Coef. Std. Err. z P>z [95% Conf. Interval]
+
lprison  .1703223 .1173487 1.45 0.147 .4003215 .059677
lmetpct  52.61075 49.73406 1.06 0.290 150.0877 44.86622
lblack  .0916276 7.147835 0.01 0.990 13.91787 14.10113
lrpcpi  .4169877 .1691149 2.47 0.014 .7484469 .0855285
lp1824  1.471787 .3471744 4.24 0.000 2.152237 .7913379
lp2534  .4542666 .3208659 1.42 0.157 .1746191 1.083152
_cons  243.5776 228.6247 1.07 0.287 204.5186 691.6737

Test of parameter constancy: chi2(350) = 37470.75 Prob > chi2 = 0.0000
This model is similar to the others estimated above. The fact that the coefficients are significantly different
across states should not be a major concern, because we already know that the state intercepts are different.
For that reason we use the fixed effects model. We really don’t care about the coefficients on the control
variables (metpct, black, income, and the age groups) so we don’t need to allow their coefficients to vary
across states. If we truly believe that a target variable (prison in this case) might have different effects
across states, we could multiply prison by each of the state dummies, create 51 prisonstate variables and
include them in the fixed effects model. This combines the best attributes of the random coefficients model
and the fixed effects model.
Stick with the fixed effects model for all your panel data estimation needs.
218
Index
#delimit;, 36
_n, 13, 19, 178
ADL, model, 147
AIC, 172, 173
Akaike information criterion, 172
append, 24
Augmented DickeyFuller, 166
Bayesian Information Criterion, 172
bgodfrey
Stata command, 156, 157, 158
bias
definition, 59
BIC, 172, 173, 175
block recursive equation system, 141
BreuschGodfrey, 156
BreuschPagan test, 110
Chi2(df, x), 15
Chi2tail(df,x), 16
Chisquare distribution
definition, 67
Chow test, 95
CochraneOrcutt, 151, 158, 159, 160,
161
Cointegration, 181
comma separated values, 21
consistency
definition, 60
consistency of the ordinary least
estimator, 65
control variable
definition, 89
Copy Graph, 30
correlate, 44
correlation coefficient, 83
definition, 44
covariance
definition, 43
csv, 21
data editor, 6
describe, 25
dfbeta, 106
dfgls
Stata command, 175, 179
DFGLS, 174, 175, 179, 180
dfuller
Stata command, 167, 168, 169, 172,
173, 182, 183
DickeyFuller, 166, 167, 168, 172, 173,
176, 177, 182
dofile, 35
dofile editor, 35
drop, 28
dummy variable, 7, 8, 91, 95, 145, 186,
187, 204
dummy variable trap, 93
Dummy variables, 91
DurbinWatson, 154, 155, 156, 161
dwstat
Stata command, 156
dynamic ordinary least squares (DOLS),
184
e(N), 134
e(r2), 134
efficiency
definition, 59
empirical analog, 80
219
endogenous variable, 126, 143, 146
ereturn, 102, 134
error correction model, 150, 185, 198
Excel, 20, 21
exogenous variables, 125, 126, 128, 129,
130, 131, 143
Fdistribution
definition, 69
first difference, 8, 149, 151, 159, 181,
185, 199, 200, 204
firstdifference, 18
fitstat
Stata command, 173
Fixed Effects Model, 187
Ftest, 94, 95, 97, 100, 103, 135, 160,
163
GaussMarkov assumptions, 61, 80, 110,
120, 152, 164
GaussMarkov theorem, 58, 62
Gen, 7
general to specific, 163
general to specific modeling strategy, 89
generate, 7, 28
Granger causality test, 98
graph twoway, 30
HausmanWu test, 122, 136, 137, 138,
143, 144, 146, 208
heteroskedastic consistent standard error,
117
hettest
Stata command, 112
iid, 62, 110, 152, 167
Influential observations, 104
insheet, 21, 22, 23
instrumental variables, 80
derivation, 121
Invchi2(df,p), 16
Invchi2tail, 16
Invnorm(prob), 16
irrelevant variables, 89
ivreg
Stata command, 123, 131, 132, 133,
135, 139, 144
Jtest, 100
Jtest for overidentifying restrictions,
135
keep, 28
label variable, 19
Lag, 148, 158, 175, 179, 180, 185
lagged endogenous variables, 142
Lags, 18
Leads, 18
likelihood ratio test, 160
line graph, 30
list, 7, 9, 14, 15, 16, 24, 25, 26, 27, 38,
91, 102, 107, 126, 134, 135, 145, 157
LM test, 100, 102, 103, 110, 133, 135,
155, 156, 157, 158, 161, 186
log file, 35
match merge, 24
mean square error
definition, 60
Missing values, 17
multicollinearity, 107
newey
Stata command, 162, 163, 184
Norm(x), 16
normal distribution, 66
omitted variable bias, 84, 93, 132, 188
omitted variable theorem, 85
onetoone merge, 24
order condition for identification, 129
overid
Stata command, 133
population
scaling per capita values, 28
proxy variables, 90
Random Coefficients Model, 208
Random Effects Model, 206
random numbers, 17
random walk, 149, 164, 165, 175, 176,
177, 179, 181, 183
recursive system, 141
reduced form, 125, 126, 128, 129, 135,
137, 139, 145
reg3
Stata command, 139, 140
regress, 9, 11, 36, 37, 40, 41, 44, 84,
85, 92, 93, 94, 95, 96, 98, 99, 102,
105, 106, 112, 115, 117, 118, 119,
122, 123, 124, 128, 130, 131, 133,
134, 135, 136, 137, 138, 143, 156,
220
157, 166, 168, 169, 172, 173, 180,
182, 185, 189
relative efficiency
definition, 59
replace, 9, 14, 23, 24, 25, 28, 29, 36,
105, 165, 166
robust
Regrsss option, 117
Robust standard errors and tratios, 116
sampling distribution, 63
sampling distribution of, 59
save, 11, 21, 22, 23, 24, 25, 35, 133,
137, 139, 143, 156, 157, 158, 159, 160
saveold, 23
scalar, 16, 91, 102, 134, 135, 157,
160
scatter diagram, 31
Schwartz Criterion, 172, 175
Seasonal difference, 18
Seemingly unrelated regressions, 138
set matsize, 36
set mem, 36
set more off, 36
sort, 17
spurious regression, 165, 166, 175, 176
stationary, 164
sum of corrected crossproducts, 43
summarize, 9, 15, 17, 25, 26, 27, 28,
36, 37, 42, 91, 104, 180
sureg, 139
System variables, 19
tabulate, 27
target variable
definition, 89
tdistribution
definition, 71
test
Stata command, 95
Testing for unit roots, 166
testparm
Stata command, 97
Tests for over identifying restrictions,
133
three stage least squares, 139
too many control variables, 89
tratio, 73
tsset, 8, 17, 18, 19, 98, 130, 156, 178
two stage least squares, 122, 128, 130,
136, 139, 140, 143, 145, 146
Types of equation systems, 140
Uniform(), 17
unobserved heterogeneity, 189, 190, 208
use command, 23
Variance inflation factors, 107
varlist
definition, 14
weak instruments, 135
Weight, 14
weighted least squares, 113
White test, 113
xtrchh
Stata command, 209
221
Table of Contents
1 AN OVERVIEW OF STATA ......................................................................................... 5 Transforming variables ................................................................................................... 7 Continuing the example .................................................................................................. 9 Reading Stata Output ...................................................................................................... 9 2 STATA LANGUAGE ................................................................................................... 12 Basic Rules of Stata ...................................................................................................... 12 Stata functions ............................................................................................................... 14 Missing values .......................................................................................................... 16 Time series operators. ............................................................................................... 16 Setting panel data. ..................................................................................................... 17 Variable labels. ......................................................................................................... 18 System variables ....................................................................................................... 18 3 DATA MANAGEMENT............................................................................................... 19 Getting your data into Stata .......................................................................................... 19 Using the data editor ................................................................................................. 19 Importing data from Excel ........................................................................................ 19 Importing data from comma separated values (.csv) files ........................................ 20 Reading Stata files: the use command ...................................................................... 22 Saving a Stata data set: the save command ................................................................... 22 Combining Stata data sets: the Append and Merge commands .................................... 22 Looking at your data: list, describe, summarize, and tabulate commands ................... 24 Culling your data: the keep and drop commands.......................................................... 27 Transforming your data: the generate and replace commands ..................................... 27 4 GRAPHS ........................................................................................................................ 29 5 USING DOFILES......................................................................................................... 34 6 USING STATA HELP .................................................................................................. 37 7 REGRESSION ............................................................................................................... 41 Linear regression ........................................................................................................... 43 Correlation and regression ............................................................................................ 46 How well does the line fit the data? .......................................................................... 48 Why is it called regression? .......................................................................................... 49 The regression fallacy ................................................................................................... 50 Horace Secrist ........................................................................................................... 53 Some tools of the trade ................................................................................................. 54
1
Summation and deviations .................................................................................... 54 Expected value .......................................................................................................... 55 Expected values, means and variance ....................................................................... 56 8 THEORY OF LEAST SQUARES ................................................................................ 57 Method of least squares ................................................................................................ 57 Properties of estimators................................................................................................. 58 Small sample properties ............................................................................................ 58 Bias ....................................................................................................................... 58 Efficiency .............................................................................................................. 58 Mean square error ................................................................................................. 59 Large sample properties ............................................................................................ 59 Consistency ........................................................................................................... 59
ˆ Mean of the sampling distribution of β ................................................................... 63 ˆ Variance of β ............................................................................................................ 63 Consistency of OLS .................................................................................................. 64 Proof of the GaussMarkov Theorem ........................................................................... 65 Inference and hypothesis testing ................................................................................... 66 Normal, Student’s t, Fisher’s F, and Chisquare ....................................................... 66 Normal distribution ................................................................................................... 67 Chisquare distribution.............................................................................................. 68 Fdistribution............................................................................................................. 70 tdistribution.............................................................................................................. 71 Asymptotic properties ................................................................................................... 73 Testing hypotheses concerning .................................................................................. 73 Degrees of Freedom ...................................................................................................... 74 Estimating the variance of the error term ..................................................................... 75 Chebyshev’s Inequality................................................................................................. 78 Law of Large Numbers ................................................................................................. 79 Central Limit Theorem ................................................................................................. 80 Method of maximum likelihood ................................................................................... 84 Likelihood ratio test .................................................................................................. 86 Multiple regression and instrumental variables ............................................................ 87 Interpreting the multiple regression coefficient. ....................................................... 90 Multiple regression and omitted variable bias .............................................................. 92 The omitted variable theorem ....................................................................................... 93 Target and control variables: how many regressors?.................................................... 96 Proxy variables.............................................................................................................. 97 Dummy variables .......................................................................................................... 98 Useful tests .................................................................................................................. 102 Ftest ....................................................................................................................... 102 Chow test ................................................................................................................ 102 Granger causality test ............................................................................................. 105 Jtest for nonnested hypotheses ............................................................................. 107 LM test .................................................................................................................... 108 9 REGRESSION DIAGNOSTICS ................................................................................. 111
2
Influential observations ............................................................................................... 111 DFbetas ................................................................................................................... 113 Multicollinearity ......................................................................................................... 114 Variance inflation factors ........................................................................................ 114 10 HETEROSKEDASTICITY ....................................................................................... 117 Testing for heteroskedasticity ..................................................................................... 117 BreuschPagan test .................................................................................................. 117 White test ................................................................................................................ 120 Weighted least squares ................................................................................................ 120 Robust standard errors and tratios ............................................................................. 123 11 ERRORS IN VARIABLES ....................................................................................... 127 Cure for errors in variables ......................................................................................... 128 Two stage least squares ............................................................................................... 129 HausmanWu test ........................................................................................................ 129 12 SIMULTANEOUS EQUATIONS............................................................................. 132 Example: supply and demand ..................................................................................... 133 Indirect least squares ................................................................................................... 135 The identification problem .......................................................................................... 136 Illustrative example..................................................................................................... 137 Diagnostic tests ........................................................................................................... 139 Tests for over identifying restrictions ..................................................................... 140 Test for weak instruments ....................................................................................... 143 HausmanWu test .................................................................................................... 143 Seemingly unrelated regressions................................................................................. 146 Three stage least squares ............................................................................................. 146 Types of equation systems .......................................................................................... 148 Strategies for dealing with simultaneous equations .................................................... 149 Another example ......................................................................................................... 150 Summary ..................................................................................................................... 152 13 TIME SERIES MODELS .......................................................................................... 154 Linear dynamic models ............................................................................................... 154 ADL model ............................................................................................................. 154 Lag operator ............................................................................................................ 155 Static model ............................................................................................................ 156 AR model ................................................................................................................ 156 Random walk model ............................................................................................... 156 First difference model ............................................................................................. 156 Distributed lag model .............................................................................................. 157 Partial adjustment model......................................................................................... 157 Error correction model ............................................................................................ 157 CochraneOrcutt model ........................................................................................... 158 14 AUTOCORRELATION ............................................................................................ 159 Effect of autocorrelation on OLS estimates ................................................................ 160 Testing for autocorrelation .......................................................................................... 161 The Durbin Watson test .......................................................................................... 161 The LM test for autocorrelation .............................................................................. 162
3
... 218 4 ............................... 215 HausmanWu test again ................................................... 214 Digression: the between estimator ............................................................. 216 The Random Coefficients Model............... 191 Error correction model ...................................................................................................................................................................... 206 Clustering .......... 182 Why unit root tests have nonstandard distributions .... 188 Dynamic ordinary least squares .......................................... 171 Random walks and ordinary least squares ........ 165 The CochraneOrcutt method ...................................... 202 Linear trends ......................... 185 Cointegration...................................................................................................................................................... 194 The Fixed Effects Model ....... 170 15 NONSTATIONARITY.................................................................................................................. UNIT ROOTS......................................................... 191 17 PANEL DATA MODELS ............................. 194 Time series issues ........................................................................................................................................................................................................................................................................................................................................... 168 Heteroskedastic and autocorrelation consistent standard errors ...................... difference stationarity ...................................................................................................................... 212 Other Panel Data Models ............................................................................. 181 Trend stationarity vs.................................................................................................................................................................................. AND RANDOM WALKS ........................................................ 183 16 ANALYSIS OF NONSTATIONARY DATA ..... 169 Summary .................. 165 Curing autocorrelation with lags .......................................................................................................................................................................................... 214 The Random Effects Model .............................................................................................................................................................................................................. 164 Cure for autocorrelation ........................................................... 173 Choosing the number of lags with F tests ........................................................................................................................................................................................................................................ ................................. 179 DFGLS test .......................................................... 178 Choosing lags using model selection criteria.......................... ....................................... 172 Testing for unit roots ................ 216 Index ............................ 176 Digression: model selection criteria...... 202 Unit roots and panel data ........Testing for higher order autocorrelation .. 214 Choosing between the Random Effects Model and the Fixed Effects Model........................................................
I recommend using “do” files consisting of a series of Stata commands for 5 . Starting Stata Stata is available on the server. The simplest way is to enter commands interactively. (This is the current version on my computer. Results are shown on the right. If you want to reenter the command you can double click on it. The commands are entered in the “Stata Command” window (along the bottom in this view). You can also use the buttons on the toolbar at the top. The version on the server may be different.) There are many ways to do things in Stata. When invoked. allowing Stata to execute each command immediately. In this chapter we will take a short tour of Stata to get an appreciation of what it can do. where you can edit it before submitting. After the command has been entered it appears in the “Review” window in the upper left. the screen should look something like this. A single click moves it to the command line.1 AN OVERVIEW OF STATA Stata is a computer program that allows the user to perform a wide variety of statistical analyses.
or click on the Data Editor button.115 6 . (2) Look at the data by printing it out. Don’t type the variable names. However. You can type across the rows or down the columns. the number of members. and wealth. In the example below. we have 10 observations on 4 variables. graphing it. The easiest way to get these data into Stata is to use Stata’s Data Editor. Similarly. (1) Get the data into Stata. (4) Analyze the transformed data using regression or some other procedure.855 . or divide by a price index to get real dollars. the annual membership dues. while there will be 51 rows (observations) corresponding to the years 1948 to 1998. you collect data.031 1. Next. Consider the following data set. use tab to enter the number. The data matrix will consist of three columns corresponding to the variables consumption. you might want to divide by population to get per capita values.000 1. In this chapter we will do a minianalysis illustrating the four econometric steps. or take logs. The first is to choose a topic.751 . income and wealth for the United States annually for the years 19481998. use the enter key after typing each value. Once you have the data. Year 78 79 80 81 82 83 84 85 Members 126 130 123 122 115 112 139 127 Dues 135 145 160 165 190 195 175 180 CPI . The econometric part consists of four steps. It corresponds to some data collected at a local recreation association and consists of the year. Different people will find different ways of doing the same task. you could have observations on consumption. Click on Data on the task bar then click on Data Editor in the resulting drop down menu.652 . Type the following numbers into the corresponding columns of the data editor. just the numbers. If you type across the rows. the way to learn a statistical language is to do it. and the consumer price index. you are ready to do the econometric analysis. with columns corresponding to variables and rows corresponding to observations. Then you do library research to find out what research has already been done in the area. The result should look like this. (3) Transform the data. and summarizing it. For example. income. Your data set should be written as a matrix (a rectangular array) of numbers. If you type down the columns. The data matrix will therefore have four columns and 10 rows.941 1.complex jobs (see Chapter 5 below). which may be repeated several times. An econometric project consists of several steps.077 1. The final step in the project is to write up the results.
Gen is short for “generate” and is probably the command you will use most. Hit <enter> at the end of each line to execute the command. In the example above. gen logmem=log(members) gen logdues=log(dues) gen dum=(year>83) list [output suppressed] We almost always want to transform the variables in our dataset before analyzing them. cpi).135 1. Transforming variables Since we have seen the data. The logarithm function is one of several in 7 . Double click on “var1” at the top of column one and type “year” into the resulting dialog box.160 When you are done typing your data editor screen should look like this. Click on “preserve” and then the red x on the data editor header to exit the data editor. For a Phillips curve. we don’t have to look at it any more. Type the following commands one at a time into the “Stata command” box along the bottom. dues. Notice that the new variable names are in the variables box in the lower left. We have already seen how to tell Stata to take the logarithms of variables. All transformations like these are accomplished using the gen command. Now rename the variables. we transformed the variables members and dues into logarithms. The list command shows you the current data values. Repeat for the rest of the variables (members. you might want the rate of inflation as the dependent variable while using the inverse of the unemployment rate as the independent variable. Suppose we want Stata to compute the logarithms of members and dues and create a dummy variable for an advertising campaign that took place between 1984 and 1987. Let’s move on to the transformations.86 87 133 112 180 190 1.
cpi Lags can also be calculated. absolute value. For example. and division. In the example above we created a dummy variable. 8 . Some of the functions commonly used in econometric work are listed in Chapter 2. The lagged difference ∆cpit −1 can be written as gen dcpi_1=LD. Stata has exponentiation. below. gen d2cpi=D. If YEAR is greater than 83 then the variable "DUM" is given the value 1 (“true”). For example. As another example. subtraction. the one period lag of the CPI (cpi(t1)) can be calculated as. gen cpi_1=L.” operator to take first differences ( ∆yt ). "DUM" is given the value 0 (“false”). we would write. called "DUM. square root. gen dgnp=d. multiplication.cpi It is often necessary to make comparisons between variables. If YEAR is not greater than 83." by the command gen dum=(year>83) This works as follows: for each line of the data. lags. Stata reads the value in the column corresponding to the variable YEAR.cpi These operators can be combined. To do this we first have to tell Stata that you have a time series data set using the tsset command. differences and comparisons.cpi or gen d2cpi=D2. suppose we want to take the first difference of the time series variable GNP (defined as GNP(t)GNP(t1)). Along with the usual addition. logarithms. gen invunem=1/unem where unem is the name of the unemployment variable.Stata's function library. gen dcpi=D.gnp I like to use capital letters for these operators to make them more visible. tsset year Now you can use the “d. to compute the inverse of the unemployment rate. It then compares it to the value 83.dcpi or gen d2cpi=DD.cpi The second difference ∆∆cpit = ∆ 2 cpit = ∆ (cpit − cpit −1 ) = ∆cpit − ∆cpit −1 be calculated as.
a simple ordinary least squares (OLS) regression of members on real dues.02765 78 87 members  10 123.160 (Whenever you redefine a variable. 2. standard deviations. Min Max +year  10 82.934474 +logdues  10 5. Dev. the log function produces natural logs.0727516 4. and the trend is simply a counter (0.16 logmem  10 4.00694 135 195 cpi  10 . We are now ready to do the analysis. Reading Stata Output 9 . regress members rdues This command produces the following output. you must use replace instead of generate.5 3.5 20.) gen rdues=dues/cpi gen logrdues=log(rdues).4 . one at a time. usually called “t” in theoretical discussions.905275 5. gen trend=year78 list summarize Variable  Obs Mean Std.Continuing the example Let’s continue the example by doing some more transformations and then some analyses.273 dum  10 .1711212 .718499 4. ending with <enter> into the Stata command box. etc.138088 .02765 1 10 (Note: rdues is real (inflationadjusted) dues.5163978 0 1 trend  10 5.9717 . Type the following commands (indicated by the courier new typeface).5 3.9 8. for each year.) The summarize command produces means.1219735 4. 1.817096 .999383 112 139 dues  10 171. replace cpi=cpi/1. etc.652 1.
If this value is smaller than . using Student’s t distribution. the overall Fstatistic that tests the null hypothesis that the Rsquared is equal to zero. RSS and the residual sum of squares as the error sum of squares. RSS. the standard errors. Mathematically it is N ˆ ˆ σ 2 = σ u2 = ei2 ( N − k ) i =1 Where e is the residual from the regression (the difference between the actual value of the dependent variable and the predicted value from the regression: ˆ ˆ ˆ Yi = α + β X i ˆ Y =Y +e i i i ˆ ˆ = α + β X + ei ˆ ˆ e = Y −α − β X i i i The panel on the top right has the number of observations. the probvalues. S βˆ the tratios. and the root mean square error (the square root of the residual sum of squares divided by its degrees of freedom. The estimated coefficient for rdues is ˆ β = xy x 2 (= −. N (=10) minus the number of parameters estimated. or sample size. sometimes SSE). I will refer to the model sum of squares as the regression sum of squares. SSU. also known as the standard error of the estimate. The residual mean square is also known as the “mean square error” (MSE). the model sum of squares is also known as the explained sum of squares. the residual variance. The model degrees of freedom is the number of independent variables.a. RSS (or SSR).k. ESS (or. or the error variance.Most of this output is selfexplanatory. the corresponding estimated coefficients. ESS. TSS.k.1310586) .83) . The probvalue for the ttest of no significance is the probability that the test statistic T would be observed if the null hypothesis is true. The bottom table shows ˆ the variable names. the Rsquared adjusted for the number of explanatory variables (a. Under SS is the model sum of squares (58.} ˆ The corresponding standard error for β is S βˆ = ˆ σ2 x2 (= . The next column contains the degrees of freedom. α = Y − β X (= 151. Each textbook writer chooses his or her own convention. The fact that y is in lower case indicates that it is expressed as deviations from the sample mean The Residual SS (670.a. β . Rsquared. The residual degrees of freedom is the number of observations. The upper left section is called the Analysis of Variance. SSE).05 then 10 . The regression output immediately above is read as follows.0834) .1573327) . ESS. it is ˆ y i =1 N 2 i where the circumflex indicates that y is predicted by the regression. the probvalue corresponding to this Fratio. ˆ ˆ {The other coefficient is the intercept or constant term. not including the constant term. or the standard error of the regression. so that Nk=8. SEE. k (=2). or the regression sum of squares. SSR unexplained sum of squares.1763689). twotailed. and error sum of squares. The resulting tratio for the null hypothesis that =0 is ˆ T = β S βˆ ( = −0. USS.723631) is the residual sum of squares (a. Mathematically. The third column is the “mean square” or the sum of squares divided by the degrees of freedom. For example. the SEE is the corresponding standard deviation (square root). The sum of the RSS and the ESS is the total sum of squares. This number is referred to in a variety of ways. Since the MSE is the error variance. and the confidence interval around the estimates. SER). Rbar squared).
038417989 3 . the derivative of a log with respect to a log is an elasticity. Once you have blocked the text you want to save. type save “H:\Project\test.0432308 . the pool can expect to lose four percent of its members every year. clicking on the first bit of output you want to save. Saving your first Stata analysis Let’s assume that you have created a directory called “project” on the H: drive. Interval] +logrdues  . using the logarithm of members as the dependent variable and the log of real dues as the price. everything else staying the same (ceteris paribus). Save and browsing to the Project folder or by clicking on the little floppy disk icon on the button bar.989932 4. You don’t really need the quotes unless there are spaces in the pathname.012805996 Residual  .2203729 trend  .7098 .34 0. You could also create a log file containing your results. We discuss how to create log files below. and type <control>v for paste (or click on edit. 6) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 10 8. You can save your output by going to the top of the output window (use the slider on the right hand side). To save the data for further analysis.30 0.” We usually just say that is significant. On the other hand. (Mathematically. go to your favorite word processor.dta” You can also save the data by clicking on File. Let’s add the time trend and the dummy for the advertising campaign and do a multiple regression. You can read the data into Stata later with the command.dta” will create a Stata data set called “test.009217232 6 .dta” in the directory H:\Project. holding the mouse key down.005292802 Number of obs F( 3.0094375 4. Std.007035 . regress logmem logrdues trend dum Source  SS df MS +Model  . However.001536205 +Total  .6767872 .56 0. You can then write up your results around the Stata output. then drag the cursor.03919 logmem  Coef.8065 0.36665 1. t P>t [95% Conf. copy text). paste). to the bottom of the output window.004 .85 0.043 .557113 1.0146 0. Note that.) Is the demand curve downward sloping? Is price significant in the demand equation at the five percent level? At the ten percent level? The trend is the percent change in pool membership per year.0663234 . or parameters.4263  The coefficient on logrdues is the elasticity of demand for pool membership.0608159 2. since this is a multiple regression. the estimated coefficients.) According to this estimate.0201382 dum  . 11 . continuing the advertising campaign will generate a 16 percent increase in pool memberships every year.005 3. are the change in the dependent variable for a oneunit change in the corresponding independent variable.155846 . (The derivative of a log with respect to a nonlogged value is a percent change. You can type <control>c for copy (or click on edit.573947 .047635222 9 . use “H:\Project\test. holding everything else constant. Err.we “reject the null hypothesis that =0 at the five percent significance level.687924 13. it is good practice.3046571 _cons  8.58 0.114 1.
Every name must begin with a letter or an underscore. The names Var. For example. [by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [. a. _n is Stata’s variable containing the observation number. d.2 STATA LANGUAGE This section is designed to provide a more detailed introduction to the language of Stata.). However. You may not use these names for your variables. Basic Rules of Stata 1.g. Command syntax. The typical Stata command looks like this. for a full development you must refer to the relevant Stata manual. b. since Stata frequently uses that convention to make its own internal variables. e.. c. Case matters. The maximum number of characters in a name is 32. or underscores. etc. var1. numbers. it is still only an introduction. I recommend that you use lowercase only for variable names. Rules for composing variable names in Stata. and var are all considered to be different variables in Stata notation.#. _all _n _coeff _pred double _se in with long byte _pi e _rc if using _b _N _cons float _skip int 2. However. but shorter names work best. The following variable names are reserved. options] 12 . or use Stata Help (see Chapter 6). e. Subsequent characters may be letters. You cannot use special characters (&. VAR. I recommend that you not use underscore at the beginning of any variable name.
) fweight (frequency weights: indicates duplicate observations. We could have said. list year members rdues in 5/10 would list the observations from 5 to 10. members. etc. If no varlist is specified. usually population.where things in a square bracket are optional.160. states. It takes the form in number1/number2. the aweight is rescaled to sum to the number of observations in the sample. Options must be separated from the rest of the command by a comma. Note that Stata requires two equal signs to denote equality because one equal sign indicates assignment in the generate and replace commands. and rdues for 1983. e. It is not testing for the equality of the left and right hand sides of the expression. an fweight of 100 indicates that there are really 100 identical observations) pweight (sampling weights: the inverse of the probability that this observation is included in the sample. Consequently. A pweight of 100 indicates that this observation is representative of 100 subjects in the population. gen logrdues=log(dues) and replace cpi=cpi/1. For example. This is the most commonly used weight by economists. Each Stata command takes commandspecific options. usually. the “reg” command that we used above has the option “robust” which produces heteroskedasticconsistent standard errors.160. For most Stata commands. Stata printed out all the data for all the variables currently in memory. The weight option is always in square brackets and takes the form [weightword=exp] where the weight word is one of the following (default weight for each particular command) aweight (analytical weights: inverse of the variance of the observation. The definition of iweight depends on how it is defined in each command that uses them). the equal sign in the expression replace cpi=cpi/1. in the simple program above.g. 13 . Each observation in our typical sample is an average across counties. That is. The aweight is the variance divided by the number of elements. The “if exp” option restricts the command to a certain subset of observations. For example. etc.. all the variables are used. we used the command “list” with no varlist attached. (Obviously. that comprise the county. then. list members dues rdues to get a list of those three variables only. The “=exp” option is used for the “generate” and “replace” commands. For example. For example.) iweight (importance weight: indicates the relative importance of each observation.160 indicates that the variable cpi is to take the value of the old cpi divided by 1.) The “in range” option also restricts the command to a certain range. Actually the varlist is almost always required. We can also use the letter f (first) and l (last) in place of either number1 or number 2. Options. A varlist is a list of variable names separated by spaces. we could have written list year members rdues if year == 83. This would have produced a printout of the data for year. state.
14 . /. The data must be sorted by the varlist. (subtraction). see STATA USER’S GUIDE for a comprehensive list. *. but it never hurts to sort before using the by varlist option. etc. In this case the sort command is not necessary. Mathematical functions include: abs(x) max(x) min(x) sqrt(x) exp(x) log(x) absolute value returns the largest value returns the smallest value square root raises e (2. for the years without an advertising campaign (dum=0) and the years with an advertising campaign (dum=1). robust by varlist: Almost all Stata commands can be repeated automatically for each value of the variable or variables in the by varlist. Stata functions These are only a few of the many functions in Stata. ~=. +. Chi2(df.17828) to a specified power natural logarithm Some useful statistical functions are the following. <=. . ^. since the data are already sorted by dum. sort dum by dum: summarize members rdues will produce means. &. ==. >=.reg members rdues trend. >. This option must precede the command and be separated from the rest of the command by a colon. For example the commands. x) returns the cumulative value of the chisquare with df degrees of freedom for a value of x. Arithmetic operators are: + * / ^ Comparison operators == ~= > < >= <= Logical operators &  ~ and or not equal to not equal to greater than less than greater than or equal to less than or equal to add subtract multiply divide raise to a power Comparison and logical operators return a value of 1 for true and 0 for false. The order of operations is: ~. (negation). <.
2.0488658 Invchi2tail is the inverse function for chi2tail. .2..05. Wooldridge uses the Stata functions norm.. scalar list x1 x1 = 2. scalar list prob2 prob2 = . . t. P(z<1. scalar prob1=chi2(2.0488658 Norm(x) is the cumulative standard normal.For example.65) . To find out if x=2. F.359) . Chi2tail(df.64120353 We apparently have a pretty good chance of observing a number as large as 2. which means that we don’t have to look in the back of textbooks for statistical distributions any more.65 if the mean of the distribution is zero (and the standard deviation is one).05 it is not significantly different from zero at the .35879647 Since this value is not smaller than .p) returns the value of x for which the probability is p and the degrees of freedom are df.641) .65) is .05 if the true value of x is zero? We can compute the following in Stata. invFtail. scalar x2=invchi2tail(2. suppose we know that x=2.9505285 Invnorm(prob) is the inverse normal distribution so that. What is the probability of observing a number as large as 2. . We should get the same answer as above using this function.e. scalar list prob prob = . scalar list prob1 prob1 = . scalar prob=1chi2(2.05) . and chiisquare tables in the back of his textbook. scalar x1=invchi2(2. 15 .05 level. scalar list x2 x2 = 2. invttail. scalar list prob prob = . Because we are not generating variables. scalar prob=norm(1.x) returns the upper tail of the same distribution.05 is distributed according to chisquare with 2 degrees of freedom.35879647 Invchi2(df. scalar prob2=chi2tail(2. we use the scalar command. . i.. To get the probability of observing a number as large as 1.2.05) .05) . if prob=norm(z) then z=invnorm(prob). and invchi2tail to generate the normal.05 is significantly different from zero in the chisquare with 2 degrees of freedom we can compute the following to get the prob value (the area in the right hand tail of the distribution). .
sort year 16 .4819048 . Stata will simply ignore any missing values for each variable separately when computing the mean. The pseudo random number generator uniform() generates the same sequence of random numbers in each session of Stata. In other commands such as the summarize command. Missing values Missing values are denoted by a single period (. use the command. etc. Min Max +v  51 . type. Even though uniform() takes no parameters. Dev.83905 Note that both these functions return 51 observations on the pseudorandom variable. . the parentheses are part of the name of the function and must be typed. I used the crime1990.39789 8. This makes it difficult to be sure you are doing what you think you are doing. standard deviations. For example.954225 15.) Any arithmetic operation on a missing value results in a missing value. Stata will drop any observations for which there are missing values in either police or unemployment (“case wise deletion”).9746088 To generate a sample of normally distributed random numbers with mean 2 and standard deviation 10. To do that. Dev. So. The tsset command must be used to indicate that the data are time series.87513 21. gen z=2+10*invnorm(uniform()) . Stata ignores (drops) observations with missing values.2889026 . sorting by a variable which has missing values will put all observations corresponding to missing at the end of the data set. summarize z Variable  Obs Mean Std.0445188 . I leave the seed alone. for example. Unless I am doing a series of Monte Carlo exercises. Unless you tell Stata otherwise. summarize v Variable  Obs Mean Std. Time series operators. So. one for each state and DC. Stata considers missing values as larger than any other number. it will create the number of observations equal to the number of observations in the data set.dta data set which has 51 observations.Uniform() returns uniformly distributed pseudorandom numbers between zero and one. For example to generate a sample of uniformly distributed random numbers use the command . Min Max +z  51 1. When doing statistical analyses. set seed # If you set the seed to # you get a different sequence of random numbers. gen v=uniform() . for example if you did a regression of crime on police and the unemployment rate across states. unless you reinitialize the seed.
Time series operators can be combined.ms or FF. The twoperiod lead is written as F2.ms (=ms(t) . Similarly. ms. the second period lag could be written as L2. etc. Differences: the firstdifference is written as D. but you must first tell Stata that you have a panel (pooled time series and crosssection). the command gen rdues_1 = L.ms (=ms(t+1)). the difference of the 12month difference would be written as DS12.ms. (=(ms(t)ms(t1)) – (ms(t1)ms(t2))) Seasonal difference: the first seasonal difference is written S.ms and is the same as D. For example. can be written as L.ms or. I recommend upper case so that it is more easily readable as an operator as opposed to a variable name (which I like to keep in lowercase). Time series operators can be written in upper or lowercase.ms. The second seasonal difference is S2.ms. The second difference (difference of difference) is D2. The following commands sort 17 . Setting panel data.rdues will generate a new variable called rdues_1 which is the oneperiod lag of rdues. as LL. you can use time series operators. For example.ms.tsset year Lags: the oneperiod lag of a variable such as the money supply. If you have a panel of 50 states (150) across 20 years (19771996). Leads: the oneperiod lead is written as F.ms.ms and is equal to ms(t)ms(t2). equivalently.ms So that L.ms or DD.ms(t1)). for monthly data. Note that this will produce a missing value for the first observation of rdues_1.ms is equal to ms(t1).
the data, first by state, then by year, define state and year as the crosssection and time series identifiers, and generate the lag of the variable x.
sort state year tsset state year gen x_1=L.x
Variable labels.
Each variable automatically has an 80 character label associated with it, initially blank. You can use the “label variable” command to define a new variable label. For example, Label variable dues “Annual dues per family” Label variable members “Number of members” Stata will use variable labels, as long as there is room, in all subsequent output. You can also add variable labels by invoking the Data Editor, then double clicking on the variable name at the head of each column and typing the label into the appropriate space in the dialog box.
System variables
System variables are automatically produced by Stata. Their names begin with an underscore. _n is the number of the current observation You can create a counter or time trend if you have time series data by using the command
gen t=_n
_N is the total number of observations in the data set. _pi is the value of to machine precision.
18
3 DATA MANAGEMENT
Getting your data into Stata
Using the data editor
We already know how to use the data editor from the first chapter.
Importing data from Excel
Generally speaking, the best way to get data into Stata is through a spreadsheet program such as Excel. For example, data presented in columns in a text file or Web page can be cut and pasted into a text editor or word processor and saved as a text file. This file is not readable as data because such files are organized as rows rather than columns. We need to separate the data into columns corresponding to variables. Text files can be opened and read into Excel using the Text Import Wizard. Click on File, Open and browse to the location of the file. Follow the Text Import Wizard’s directions. This creates separate columns of numbers, which will eventually correspond to variables in Stata. After the data have been read in to Excel, the data can be saved as a text file. Clicking on File, Save As in Excel reveals the following dialogue box.
19
If the data have been read from a .txt file, Excel will assume you want to save it in the same file. Clicking on Save will save the data set with tabs between data values. Once the tab delimited text file has been created, it can be read into Stata using the insheet command.
Insheet using filename
For example,
insheet using “H:\Project\gnp.txt”
The insheet command will figure out that the file is tab delimited and that the variable names are at the top. The insheet command can be invoked by clicking on File, Import, and “Ascii data created by a spreadsheet.” The following dialog box will appear.
Clicking on Browse allows you to navigate to the location of the file. Once you get to the right directory, click on the downarrow next to “Files of type” (the default file type is “Raw,” whatever that is) to reveal
Choose .txt files if your files are tab delimited.
Importing data from comma separated values (.csv) files
The insheet command can be used to read other file formats such as csv (comma separated values) files. Csv files are frequently used to transport data. For example, the Bureau of Labor Statistics allows the public to download data from its website. The following screen was generated by navigating the BLS website http://www.bls.gov/.
20
Note that I have selected the years 19562004, annual frequency, column view, and text, comma delimited (although I could have chosen tab delimited and the insheet command would read that too). Clicking on Retrieve Data yields the following screen.
Use the mouse (or hold down the shift key and use the down arrow on the keyboard) to highlight the data starting at Series Id, copy (controlc) the selected text, invoke Notepad, WordPad, or Word, and paste (controlv) the selected text. Now Save As a csv file, e.g., Save As cpi.csv. (Before you save, you should delete the comma after Value, which indicates that there is another variable after Value. This creates a
21
e. v5. suppose we have saved a Stata file called project. Reading Stata files: the use command If the data are already in a Stata data set (with a . as above.g. The command to read the data into Stata is use “h:\Project\project. For example. suppose we have a 22 .variable. Alternatively you can click on the “open (use)’ button on the taskbar or click on File. This is done by adding the option. There are two ways in which data can be combined..dta” The quotes are not required. you are updating a file. You can read the data file with the insheet command.dta already exists. Import. Open. you might have trouble saving (or using) files without the quotes if your pathname contains spaces. The “append” command is used for this purpose. clear The “clear” option is to remove any unwanted or leftover data from previous analyses.dta) data set with the “save” command.) The Series Id and period can be dropped once the data are in Stata.g. it can be read with the “use” command. Saving a Stata data set: the save command Data in memory can be saved in a Stata (.. You don’t really have to do this because the variable can be dropped in Stata. so that some or all the series have additional observations. If a file by the name of cpi. e.dta in the H:\Project directory. save “h:\Project\cpi.dta”. saveold “h:\Project\cpi. replace. and Ascii data created by a spreadsheet.dta” If you want the data set to be readable by older versions of Stata use the “saveold” command. but it is neater if you do. you must tell Stata that you want to overwrite the old file. with nothing but missing values in Stata.dta suffix).csv” or you can click on Data. The first is to stack one data set on top of the other. insheet using “H:\Project\cpi. For example. replace. the cpi data can be saved by typing save h:\Project\cpi. For example. Combining Stata data sets: the Append and Merge commands Suppose we want to combine the data in two different Stata data sets.dta”. However.
if one data set has variables that the other does not. (Of course. This is concatenating horizontally instead of vertically. just to make sure.data set called “project. clearing out the data from the second file. list 23 . cpi) from 1977 to 1998 and another called “project2. This is another way of updating a data set. use “H:\Project\crime1.dta”.dta”.dta”. ms. These variables act as identifiers so that Stata can match the correct observations.dta” Look at the data.) There are two kinds of merges.dta” which has data on three variables (gnp. simply exit Stata without saving the data. clear Merge with the second data set by year.dta”. You can combine these two data sets with the following commands Read and sort the first data set.” which has data on two of the three variables (ms and cpi) for the years 19992002. use “H:\Project\crime1. the combined data set will have variables from both. We append the newer data to the early data as follows append using “h:\Project\project2. we first read the early data into Stata with the use command use “h:\Project\project.dta” At this point you probably want to list the data and perhaps sort by year or some other identifying variable.dta. This is the “master” data set. merge year using “H:\Project\arrests. clear sort year list save “H:\Project\crime1. If you do not get the result you were expecting. You must be very careful that each data set has its observations sorted exactly the same way. The other way to combine data sets is to merge two sets together.dta containing year pop armur arrap arrob arass arbur (corresponding arrest rates) for the years 19521998.dta” The early data are now residing in memory. replace Read and sort the second data set. I do not recommend one to one merges as a general principle. so one data set will overwrite the data in the other. use “H:\Project\arrests. clear sort year list save “H:\Project\arrests. To combine the data sets. The preferred method is a “match merge. This is the “using” data set. suppose you have data on the following variables: year pop crmur crrap crrob crass crbur from 1929 to 1998 in a data set called crime1. replace Read the first data set back into memory. Suppose that you also have a data set called arrests.dta in the h:\Project directory.” where the two datasets have one or more variables in common.dta”. Then save the expanded data set with the save command. For example. there could be some overlap in the variables. The first is the onetoone merge where the first observation of the first data set is matched with the first observation of the second data set. It makes the data set wider in the sense that.
The simplest command is list varlist Where varlist is a list of variables separated by spaces. Stata will print out the data by observation. save the combined data into a new data set. I always use a varlist to make sure I get an easy to read table. This variable takes the value 1 if the value only appeared in the master data set. The _merge variable can take two additional values if the update option is invoked: _merge equals 4 if missing values in the master were replaced with values from the using and _merge equals 5 if some values of the master disagree with the corresponding values in the using (these correspond to the revised values).dta. the version of pop that is in the crime1 data set is stored in the new crime2 data set. update list _merge Looking at your data: list. then don’t use the replace option: merge year using h:\Project\arrests.” The data set mentioned in the merge command (arrests) is called the “using” data set. in the case of duplicate variables. So. you can use the merge command with the update replace options. Each time Stata merges two data sets. however the arrest variables will have missing values for 19291951. it creates a variable called _merge. and 3 if an observation from the master is joined with one from the using. This is very difficult to read. including any string variables. save “H:\Project\crime2. If you have too many variables to fit in a table. 24 .Assuming it looks ok. update replace Here. replace The resulting data set will contain data from 19291998. Stata will assume you want all the variables listed. If you just want to overwrite missing values in the master data set with new observations from the using data set (so that you retain all existing data in the master). merge year using h:\Project\arrests. summarize. data in the using file overwrite data in the master file. describe. 2 if the value occurred only in the using data set. You should always list the _merge variable after a merge. and tabulate commands It is important to look at your data to make sure it is what you think it is. Note also that the variable pop occurs in both data sets. Stata uses the convention that. If you omit the varlist. If you want to update the values of a variable with revised values. Here is the resulting output from the crime and arrest data set we created above. In this case the pop values from the arrest data set will be retained in the combined data set.dta. describe It produces a list of variables. The data set that is in memory (crime1) is called the “master. The “describe” command is best used without a varlist or options.dta”. the one that is in memory (the master) is retained.
but the printout is not as neat.0g ARRAP arrob float %9.6 58323 687730 crass  64 414834.4 10.5 80.63 36281. describe Contains data from C:\Project\crime1.0g CRBUR armur float %9.33 121769 270298 crmur  64 14231.0g CRASS crbur long %12.430 (99.0g pop float %9. min. and a variety of percentiles.0g ARASS arbur float %9.9952267 1 5 If you add the detail option. Min Max +year  70 1963. the mean. standard deviation. as with any Stata command.5 20.0g ARROB arass float %9.02 6066.. If you omit the varlist.1174 54. Stata will generate these statistics for all the variables in the data set.3 223 arbur  46 170.0g CRMUR crrap long %12. kurtosis.57911 26. you can use qualifiers to limit the command to a subset of the data.9 arass  46 113.72 5981 109060 crrob  64 288390.9% of memory free) storage display value variable name type format label variable label year int %8. For example.0g ARBUR _merge byte %8.5 254.884076 4. skewness.39348 17. detail You will also get the variance.0g CRROB crass long %12. Summarize varlist This command produces the number of observations. Dev.00138 97.1 _merge  70 2. .1675 3.516304 1. Don’t forget.79689 49. the command 25 .6 218911.158983 7 16 arrob  46 53.35109 1929 1998 pop  70 186049 52068.3 363602.0g CRRAP crrob long %12.5326 44.9 62186 1135610 crbur  64 1736091 1214590 318225 3795200 armur  46 7.0g Sorted by: Note: dataset has changed since last saved We can get a list of the numerical variables and summary data for each one with the “summarize” command.0g POP crmur int %8. and max for every variable in the varlist.007 7508 24700 crrap  64 44291.371429 . Summarize varlist. summarize Variable  Obs Mean Std.0g ARMUR arrap float %9. four largest.dta obs: 70 vars: 13 16 May 2004 20:12 size: 3. four smallest.3 arrap  40 12. For example.
you can use the “tabulate” command to generate a table of frequencies. for example. suppose we want to create dummy variables for each of the counties and each of the years.11 33.00 Nine counties each with 20 years and 20 years each with 9 counties.00 91  9 5.00 89  9 5.00 55. This is particularly useful for large data sets with too many observations to list comfortably. Percent Cum.00 +Total  180 100.00 85  9 5.67 7  20 11.00 25.11 100.44 5  20 11.00 +Total  180 100. tabulate varname For example.00 .00 80. tabulate county county  Freq.summarize if year >1990 Will yield summary statistics only for the years 19911998. generate(yrdum) 26 . +1  20 11. tabulate year year  Freq.00 80  9 5.00 15. We can do it easily with the tabulate command and the generate() option.00 95  9 5.11 55.33 4  20 11. Finally. Tabulate county.00 100. we use the tabulate command: . It can generate dummy variables corresponding to the categories in the table.00 87  9 5.00 82  9 5.11 44.00 92  9 5.00 94  9 5.11 22.00 93  9 5. generate(cdum) Tabulate year. The tabulate command will also do another interesting trick. Just to make sure.00 65.00 79  9 5.00 83  9 5.00 81  9 5.22 3  20 11.00 97  9 5.00 5.00 86  9 5.00 75.00 84  9 5.78 8  20 11.00 45.00 85. suppose we think we have a panel data set on crime in nine Florida counties for the years 19781997.00 40.00 35.11 66.00 50.00 70.11 2  20 11.00 90  9 5.00 95.00 30.89 9  20 11. +78  9 5. So.11 77.00 20.00 10.00 96  9 5.00 60.11 11.00 88  9 5. Percent Cum.00 90.56 6  20 11. Looks ok.11 88.
09138 (the number of murders per thousand population) which is still too small.This will create nine county dummies (cdum1cdum9) and twenty year dummies (yrdum1yrdum20).” If you want to replace widget with new values. Suppose we want to create a per capita murder variable from the raw data in the crime2 data set we created above. you can use the keep command to achieve the same result. “widget already defined.38). Culling your data: the keep and drop commands Occasionally you will want to drop variables or observations from your data set. Alternatively. widget and a variable by that name already exists in the data set. If we divide 24. The command Drop if year <1978 Will drop observations corresponding to years before 1978. we use the “generate” command to transform our raw data into the variables we need for our statistical analyses.700 by 270. which is a reasonable order of magnitude for further analysis. The corresponding value for pop is 270. This operation divides crmur by pop then (remember the order of operations) multiplies the resulting ratio by 1000. Dividing 24. the maximum value for crmur is 24. Stata will refuse to execute the command and tell you. The command Keep varlist Will drop all variables not in the varlist. The command Drop in 1/10 will drop the first ten observations. According to the results of the summarize command above. The command Drop varlist Will eliminate the variables listed in the varlist. We used the replace command in our sample program in the last chapter to rebase our cpi index. We would use the command. we know that the population is about 270 million.298. Keep if year >1977 will keep all observations after 1977. Transforming your data: the generate and replace commands As we have seen above. gen murder=crmur/pop*1000 I always abbreviate generate to gen. say. This number is too small to be useful in statistical analyses. so our population variable is in thousands.700. Multiplying the ratio of crmur to our measure of pop by 1000 yields the number of murders per million population (91.298 yields .000091). If you try to generate a variable called.700 by 270 million. However. This produces a reasonable number for statistical operations. 27 . we get the probability of being murdered (. you must use the replace command.
160 28 .replace cpi=cpi/1.
which is especially useful for time series. The first kind of graph is the line graph. 29 . I wonder why. Hmmm. obviously one of the best ways of looking is to graph the data. The command used to produce the graph is . Here is a graph of the major crime rate in Virginia from 1960 to 1998. Note that I am able to paste the graph into this Word document by generating the graph in State and clicking on Edit. in Word.4 GRAPHS Continuing with our theme of looking at your data before you make any crucial decisions. Copy Graph. graph twoway line crmaj year which produces 6000 8000 crmaj 10000 12000 14000 16000 1960 1970 1980 year 1990 2000 2010 It is interesting that crime peaked in the mid1970’s and has been declining fairly steadily since the early 1980’s. Paste (or controlv if you prefer keyboard shortcuts). Then. click on Edit.
The second type of graph frequently used by economists is the scatter diagram. graph twoway scatter crmaj aomaj 16000 6000 20 8000 crmaj 10000 12000 14000 25 aomaj 30 We also occasionally like to see the scatter diagram with the regression line superimposed. the lower the crime rate. Here is the scatter diagram relating the arrest rate to the crime rate. graph twoway (scatter crmaj aomaj) (lfit crmaj aomaj) 16000 6000 20 crmaj 8000 crmaj/Fitted values 10000 12000 14000 25 aomaj Fitted values 30 This graph seems to indicate that the higher the arrest rate. Does deterrence work? 30 . which relates one variable to another. .
Crime Rates and Arrest Rates Crime Rate 8000 10000 12000 14000 16000 Virginia 19601998 6000 20 25 Arrest Rate Fitted values crmaj 30 This was produced with the command. The first parenthesis contains the commands that created the original scatter diagram. with some more bells and whistles. the second parenthesis contains the command that generated the fitted line (lfit is short for linear fit).Note that we used socalled “parentheses binding” or “() binding” to overlay the graphs. 31 . Here is the lfit part by itself. . Here is another version of the above graph. graph twoway lfit crmaj aomaj 12500 10500 20 11000 Fitted values 11500 12000 25 aomaj 30 These graphs can be produced interactively.
Also. 32 . twoway (lfit crmaj aomaj. Clicking on the tabs reveals further dialog boxes. The final command is created automatically by filling in dialog boxes and clicking on things. sort) (scatter crmaj aomaj).. ytitle(Crime Rate) x > title(Arrest Rate) title(Crime Rates and Arrest Rates) subtitle(Virginia 19 > 601998) However. I did not have to enter that complicated command. Click on Graphics to reveal the pulldown menu. I sorted by the independent variable (aomaj) in this graph. Note that only the first plot (plot 1) had “lfit” as an option. Click on “overlaid twoway graphs” to reveal the following dialog box with lots of tabs. and neither do you.
You should take some time to experiment with some of them. start=5740. Crime Rate Virginia 6000 8000 10000 12000 crmaj 14000 16000 This was produced by the following command. again using the dialog boxes.2186) There are lots of other cool graphs available to you in Stata.Here is another common graph in economics. 33 . the histogram. alternate) title(Crime > Rate) subtitle(Virginia) (bin=10. bin(10) normal yscale(off) xlabel(. . The graphics dialog boxes make it easy to produce great graphs. histogram crmaj.0903. width=945.
Except for very simple tasks. you can edit the log file. we need to make our data and programs readily available to other economists.” that is the results do not hold up under relatively small changes in the modeling technique or data. They can also do various tests and alternative analyses to find out if the results are “fragile. we can edit the Stata output and data input and type in any values we want to. executing each command in turn. for example. To fire up the dofile editor. The final model can always be reproduced. and then rename the file. Begin. If you use a word processor. It is possible to do a complicated analysis interactively. I recommend always using dofiles. will insist on adding a . For these reasons. before writing and executing the next command. Word. The reason is that complicated analyses can be difficult to reproduce. click on Rename. otherwise how can we be sure the results are legitimate? After all. You can create a dofile with the builtin Stata dofile editor. and undo. If you insist on running Stata interactively. any text editor.txt) file. It is a full service text editor. save it. you make changes in the program. You can do it from inside Word by clicking on Open. and then rightclicking on the dofile with the txt extension. and then be unable to remember the sequence of events that led up to the final version of the model.txt. if you are working with a dofile. to make sure we on the up and up. Because economics typically does not use laboratories where experimental results can be replicated by other researchers. You can wind up rummaging around in the data doing various regression models and data transformations. click on the button 34 . and repeat. It has mouse control. The text file must have a “. You will be presented with a dialog box where you can enter the name of the log file and change the default directory where the file will be stored. as long as the data doesn’t change. Log. browsing to the right directory. You have to rename the file by dropping the .do” filename extension. remove the output. When you write up your results you can insert parts of the log file to document your findings. you wind up with a dofile called sample. you may never be able to get the exact result again.txt extension. cut. by simply running the final dofile again. the first time you save the dofile. find and replace. Do…. and browsing to the dofile. such as Notepad or WordPad. copy and paste. When the data and programs are readily available it is easy for other researchers to take the programs and do the analysis themselves and check the data against the original sources. or with any word processing program like Word or WordPerfect. However.5 USING DOFILES A dofile is a list of Stata commands collected together in a text file. It is also known as a Stata program or batch file. The dofile is executed either with the “do” command or by clicking on File. If you want to replicate your results. you can “log” your output automatically to a “log file” by clicking on the “Begin Log” button (the one that looks like a scroll) on the button bar or by clicking on File. get a result. and turn the resulting file into a do file. It may also make it difficult for other people to reproduce your results. So. However. I recommend that any serious econometric analysis be done with dofiles which can be made available to other researchers. It is probably easiest to use the builtin Stata dofile editor. Consequently.do.txt filename extension. be sure to “save as” a text (. execute it.
To see the next screen. e.log extension */ /* use a full pathname */ log using "H:\Project\example. * set mem 10000.dta" . If you want Stata to put the log file in your “project” directory.log". you must hit the spacebar. replace. clear. these are the commands you will need. Dofile Editor. Besides. when the results screen fills up. * */ use semicolon to delimit commands. /*********************************************** ***** Sample DoFile ***** ***** This is a comment ***** ************************************************/ /* some preliminaries #delimit.that looks like an open envelope (with a pencil or something sticking out of it) on the button bar or click on Window. with a . Don’t forget to set the log file with the “log using filename. command sets the semicolon instead of the carriage return as the symbol signaling the end of the command.g. /* tell Stata where to store the output */ /* I usually use the dofile name. More conditions: in the usual interactive mode. “H:\project\sample. This allows us to write multiline commands. log close. For small projects you won’t have to increase memory or matsize. If you forget the replace option. turn off the more feature. it stops scrolling and more— appears at the bottom of the screen. /* get the data */ use "H:\Project\crimeva.” 35 . The “replace” part is necessary if you edit the dofile and run it again. /*************************************/ /*** summary statistics ***/ /*************************************/ summarize crbur aobur metpct ampct unrate. * set more off. The #delimit. you will be able to examine the output by opening the log file. This can be annoying in dofiles because it keeps stopping the program. so set more off. Stata will not let you overwrite the existing log file. but if you do run out of space. Here is a simple dofile.log. The beginning block of statements should probably be put at the start of any dofile. regress crbur aobur metpct ampct unrate. replace” command. you will need to use a complete pathname for the file. make sure you have enough room for variables.. * set matsize 150. make sure you have enough memory.
8778 0. or tells someone else who might be reading the output.78 Number of obs F( 4.816 313.19 0. To see the log file.0449 0.76679 . what you’re doing and why.386 +Total  63835555.7131 ampct  5689. you do the analysis.6 .21 crbur  Coef.193 11925.41 unrate  71.3 Residual  7800038. Variable  Obs Mean Std. Dev.001 1509.4 7.786667 1. 16) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 21 28.162261 2.834964 12.995 7001. Finally. t P>t [95% Conf.394291 70 74.0000 0.25 ampct  28 18.2593 _cons  27751.log log type: text opened on: 4 Jun 2004.81 0.17 16 487502.924 0. 08:16:47 . I use Word to review the log file.6 metpct  28 71. Min Max +crbur  40 7743.Next you have to get the data with the “use” command.log log type: text closed on: 4 Jun 2004. you can open it in the dofile editor.66737 246. 08:16:48  For complicated jobs.428 9153. /* get the data */ >. use "crimeva.439 3912.99775 1. because I use Word to write the final paper.14 148088.7 . clear. log: C:\Stata\sample. log: H:\project\sample.9246 593.5 18.3321 250. 36 .378 2166.7202 4.8 286181.46429 1.1049885 18.854 341683.76 463.4 4 14008879.8473 698. The log file looks like this.9 unrate  30 4.1 0. or any other editor or word processor.47 19.6 20 3191777. log close..24 0.7367 246.9329 0.74 0.775 449.29 0. > > > > /*************************************/ /*** summary statistics ***/ /*************************************/ summarize crbur aobur metpct ampct unrate. regress crbur aobur metpct ampct unrate. Source  SS df MS +Model  56035517.00 0.52691 132. using lots of “banner” style comments (like the “summary statistics” comment above) makes the output more readable and reminds you. Err.2783 metpct  986. Interval] +aobur  31.49 aobur  22 16. .42 20533.dta" . Don’t forget to close the log file. Std.
Clicking on Contents allows you to browse through the list of Stata commands. It also can’t list all the Stata commands that you might need.6 USING STATA HELP This little manual can’t give you all the details of each Stata command. Answer: the Stata Help menu. arranged according to function. Clicking on Help yields the following submenu. the official Stata manual consists of several volumes and is very expensive. Finally. 37 . So where does a student turn for help with Stata.
For example. Clicking on Help. Now suppose you want information on how Stata handles a certain econometric topic. Search.One could rummage around under various topics and find all kinds of neat commands this way. Clicking on any of the hyperlinks brings up the documentation for that command. Clicking on the “Search all” radio button and entering heteroskedasticity yields 38 . suppose we want to know how Stata deals with heteroskedasticity. yields the following dialog box.
you can click on Help. entering “regress” to get the documentation on the regress commands yields. Stata command. yields For example. 39 .On the other hand. if you know that you need more information on a given command.
Most of the commands have options that are not documented here. There are also lots of examples and links to related commands.This is the best way to find out more about the commands in this book. 40 . You should certainly spend a little time reviewing the regress help documentation because it will probably be the command you will use most over the course of the semester.
817096 . Scatter plot and filling in the blanks. Major Tick Options.5 20.5163978 0 1 trend  10 4.02765 78 87 members  10 123. to reveal the dialog box that allows you to put a line at the mean of members.905275 5.5 3. Dev. Use summarize to compute the means of the variables.273 logmem  10 4.718499 4.0727516 4. the result is a nice graph.02765 0 9 Plot the number of members on the real dues (rdues).8055 16. This can be done by typing twoway (scatter members rdues). yline(124) xline(179) Or by clicking on Graphics. Two way graph.5 3. . clicking on X axis. 41 . Use the data on pool memberships from the first chapter. Min Max +year  10 82. Either way. allows you to put a line at the mean of rdues.934474 dum  10 .5903 207.1219735 4. sorting by rdues.16 rdues  10 178.7 REGRESSION Let’s start with a simple correlation.1711212 .138088 .9717 . summarize Variable  Obs Mean Std.999383 112 139 dues  10 171.652 1. Click on Y axis. and Additional Lines.0552 +logdues  10 5.4 .9 8.72355 158. Similarly.00694 135 195 cpi  10 . Major Tick Options.
The larger the number of observations. computed over the entire sample will be negative. then the variation. There are two problems with using covariation as our measure of association. for points in the lower left and upper right hand quadrants. then the covariation will be positive. So. known as the covariance. We can fix this problem by dividing the covariation by the sample size. in the lower right hand quadrant. xi < 0. In the lower left quadrant. Define the deviation from the mean for X and Y as xi = X i − X yi = Yi − Y Note that. N. 42 . yi < 0 . X i > X . yi > 0 . The resulting quantity is the average covariation. Yi > Y so that xi > 0. xi > 0. Similarly. xi < 0. indicating a negative association between X and Y. yi < 0 . The first is that it depends on the number of observations. This quantity is called the sum of corrected crossproducts. if most of the points lie in the lower left and upper right quadrants. This means that. We need a measure of this association. If we compute the sum of corrected crossproducts for the points in the upper left and lower right quadrants we find that xi yi < 0 . Moving clockwise. in the upper right quadrant. if most of the points are in the upper left and lower right quadrants. the larger the (positive or negative) covariation. or the covariation. indicating a positive relationship between X and Y. yi > 0 . This means that high values of rdues are associated with low membership. xi yi > 0 .members 110 160 120 130 140 170 180 rdues 190 200 210 Note that six out of the ten dots are in the upper left and lower right quadrants. Finally in the upper left quadrant. if we multiply the deviations together and sum we get.
The . the policy at least doesn’t make things worse. 43 .2825 1. For example Regress members rues Produces a simple regression of members on rdues. y ) = i =1 N = i =1 The second problem is that the covariance depends on the unit of measure of the two variables.0000 rdues  0. To solve this problem. Linear regression Sometimes we want to know more about an association between two variables than if it is positive or negative. the Stata command that corresponds to ordinary least squares (OLS) is the regress command. Does it really reduce crime? If the answer is yes. With respect to regression analysis. we have to estimate the expected change in crime from a change in the number of police. x y (X i i N N i − X )(Yi − Y ) N Cov( x.0000 The correlation coefficient is negative. The correlation coefficient for our pool sample can be computed in Stata by typing correlate members rdues (obs=10)  members rdues +members  1. Consider a policy analysis concerning the effect of putting more police on the street. That output was reported in the first chapter. N x i =1 2 i z rXY = i =1 N Xi zYi = N 1 N S i =1 N xi yi 1 = NS X SY X SY x y i i =1 N i This is the familiar simple correlation coefficient. in order to justify the expense. If adding 100. The standard score of X is z X i = xi S x where S x = standard score of Y is zYi = yi N N 2 i i =1 N is the standard deviation of X. So.000 police nationwide only reduces crime by a few crimes. but we also have to know by how much it can be expected to reduce crime. it fails the costbenefit test. You get different values of the covariance if the two variables are height in inches and weight in pounds versus height in miles and weight in milligrams. The covariance of the standard scores is. we divide each observation by its standard deviation to eliminate the units of measure and then compute the covariance using the resulting standard x scores. as suspected.
that the values of the estimated slope and intercept would not change. The model is incomplete. As statisticians. Since is the slope of the function. or simply assume that the line is approximately linear for our purposes. the error. β= ∆Y ∆X which means that ∆Y = β∆X = β (100. β . we could estimate the effect of adding 100. but taken together cause the error term to vary randomly.” that is. not just the sign. So what are the values of and (the parameters) that create a “good fit. Yi = α + β X i + U i where Ui is the error associated with observation i. we know that real world data do not line up on straight lines. ˆ ˆ ˆ ei = Yi − Yi = Yi − (α + β X i ) . That is. the line may be curved. as opposed to mathematicians. Then the error term is simply the sum of a whole bunch of variables that are each irrelevant. 000) This is the counterfactual (since we haven’t actually added the police yet) that we need to estimate the benefit of the policy.). that is. our predicted value of Y is ˆ ˆ ˆ Yi = α + β X i If we are wrong. So. and the intercept as the two points we need. The line might not be straight. or residual. If this is true and we could estimate . 1. is the difference between what we thought Y would be and what is actually was. The cost is determined elsewhere. We can summarize the relationship between Y and X as. 44 . Why do we need an error term? Why don’t observations line up on straight lines? There are two main reasons. We will use the slope.Suppose we hypothesize that there is a linear relationship between crime and police such that Y =α +βX where Y is crime. Then for any observation. etc. we need an estimate of the slope of a line relating Y to X. We can get out of this either by transforming the line (taking logs. We have left out some of the other factors that affect crime. At this point let’s assume that the factors that we left out are “irrelevant” in the sense that if we included them the basic form of the line would not change. X is police and β < 0 .000 police as follows. which values of the parameters best describe the cloud of data in the figure above? ˆ ˆ Let’s define our estimate of as α and our estimate of as β . 2. A line is determined by two points.
If we minimize the sum of absolute values we are doing least absolute value estimation. we are doing least squares. We can’t get zero errors because that would require driving the line through each point. f (b) = c2 b 2 + c1b + c0 Recall from high school the formula for completing the square. which is beyond our scope. ˆ ˆ We want to minimize the sum of squares of error with respect to α and β . Minimizing f (α ) requires setting α equal to the negative of the coefficient of the first power divided by two times the coefficient of the second power. Since they don’t line up. but that would be too easy. c c2 f (b) = c2 b + 1 + c0 − 1 2c2 4c2 Multiply it out if you are unconvinced. The least squares procedure has been around since at least 1805 and it is the workhorse of applied statistics. with huge positive and negative errors that offset each other. Suppose we have a quadratic formula and we want to minimize it by choosing a value for b.β We could take derivatives with respect to and . we want to minimize the sum of squares of error. we would get one line per point. Instead. 45 . from a bad fit. the minimum is the ratio of the negative of the coefficient on the first power to two times the coefficient on the second power of the quadratic. If we minimize the sum of squares of error. namely. the minimum occurs at 2 b= −c1 2c2 That is. To solve this last problem we could square the errors or take their absolute values. that is ˆ ˆ ˆ ˆ ˆ ˆ ˆˆ Min e 2 = (Y − α − β X ) 2 = (Y 2 − 2α Y + α 2 + β 2 X 2 − 2αβ X − 2β XY ) 2 2 2 2 ˆ ˆ ˆ ˆ ˆˆ = Y − 2α Y + N α + β X − 2αβ X − 2β XY ˆ ˆ ˆ Note that there are two quadratics in α and β . not a summary statistic. but that would mean that large positive errors could offset large negative errors and we really couldn’t tell a good fit. So. let’s do it without calculus. we want to minimize the errors. set them equal to zero and solve the resulting equations. We could minimize the sum of the errors. or 2 2 ˆ ˆ Min ei = (Yi − α − β X i ) ˆ ˆ α . Since we are minimizing with respect to our choice of b and b only occurs in the squared term. The one in α is ˆ ˆ ˆ ˆˆ f (α ) = N α 2 − 2α Y − 2αβ X + C ˆ ˆ ˆ = N α 2 − α (2 Y − 2β X ) ˆ ˆ ˆ where C is a constant consisting of terms not involving α . with no error.Clearly.
46 . ˆ β= ˆ ˆ 2( XY + α X ) XY + α X = 2 2 X X2 ˆ Substituting the formula for α .ˆ α= ˆ 2 Y − β X 2N ( ) = Y − βˆ X = Y − βˆ X N N ˆ The quadratic form in β is ˆ ˆ ˆ ˆ ˆ ˆˆ ˆ f ( β ) = β 2 X 2 − 2β XY + 2αβ X = β 2 X 2 − β 2( XY + α X ) + C * ˆ where C* is another constant not involving β . Y =α +βX ˆ ˆ Since α = Y − β X the regression line goes through the means: Y =α +βX Subtract the mean of Y from both sides of the regression line. ˆ β= ˆ ˆ XY + (Y − β X ) X XY + Y Y − β X X = 2 2 X X 2 Multiplying both sides by X yields. ˆ ˆ β X 2 = XY − Y X + β X X ˆ = XY − YNX + β XNX ˆ ˆ β X 2 − β NX 2 = XY − NXY ˆ β ( X 2 − NX 2 ) = XY − NXY ˆ Solving for β ˆ β= XY − NXY xy = X 2 − NX 2 x 2 Correlation and regression Let’s express the regression line in deviations. ˆ Setting β equal to the ratio of the negative of the first power divided by two times the coefficient of the second power yields.
as we have seen above. Here is the graph of the pool membership data with the regression line superimposed. Sx xy S x xy = = 2 S y NS x S y S y NS y r 47 . the regression line goes through the means. we have “translated the axes” so that the intercept is zero when the regression is expressed in deviations. 110 160 120 130 140 170 180 rdues 190 members 200 210 Fitted values ˆ Substituting β yields the predicted value of y (“Fitted values” in the graph): ˆ ˆ y = βx We know that the correlation coefficient between x and y is r= xy NS x S y Multiply both sides by Sy/Sx yields.Y −Y = α + β X −α − β X Or. The reason is. y = β(X − X ) = β x Note that. by subtracting off the means and expressing x and y in deviations.
ˆ y 2 = y 2 + 2 ye + e 2 Summing. ˆ y 2 = y 2 + 2 ye + e 2 The middle term on the right hand side is equal to zero. Since S x > 0 and S y > 0 ˆ r must have the same sign as β . we will concentrate on regression from now on. β is a linear transformation of the correlation coefficient. How well does the line fit the data? We need a measure of the goodness of fit of the regression line. Start with the notion that each observation of y is equal to the predicted value plus the residual: ˆ y = y+e Squaring both sides. ˆ ˆ ˆ ye = β xe = β xe Which is zero if xe = 0. Since the regression line does everything that the correlation coefficient does. ˆ ˆ xe = x y − β x = xy − β x 2 xy x 2 = xy − xy = 0 x2 This property of linear regression. the middle term is zero and thus. y2 NS = N N 2 y y2 =N = y2 N 2 Therefore. = xy − y 2 = y 2 + e2 ( ) Or. plus yield the rate of change of y to x. Nevertheless. r S x xy ˆ = =β S y y2 ˆ So.But. TSS=RSS+ESS 48 . namely that x is uncorrelated with the residual. turns out to be useful in several contexts.
ˆ where TSS is the total sum of squares, y 2 , RSS is the regression (explained) sum of squares, y 2 , and
ESS is the error (unexplained, residual) sum of squares, e 2 . Suppose we take the ratio of the explained sum of squares to the unexplained sum of squares as our measure of the goodness of fit.
2 ˆ ˆ RSS y 2 β 2 x 2 ˆ x = = = β2 2 2 TSS y y y2
But, recall that
ˆ β =r
Sy Sx
=r
y2 N x2 N
Squaring both sides yields,
y2 y2 ˆ β 2 = r2 N 2 = r2 x x2 N Therefore,
2 RSS ˆ x = r2 = β2 TSS y2
The ratio RSS/TSS is the proportion of the variance of Y explained by the regression and is usually referred 2 to as the coefficient of determination. It is usually denoted R . We know why now, because the coefficient of determination is, in simple regressions, equal to the correlation coefficient, squared. Unfortunately, this neat relationship doesn’t hold in multiple regressions (because there are several correlation coefficients associating the dependent variable with each of the independent variables). However, the ratio of the 2 regression sum of squares to the total sum of squares is always denoted R .
Why is it called regression?
In 1886 Sir Francis Galton, cousin of Darwin and discoverer of fingerprints among other things, published a paper entitled, “Regression toward mediocrity in hereditary stature.” Galton’s hypothesis was that tall parents should have tall children, but not as tall as they are, and short parents should produce short offspring, but not as short as they are. Consequently, we should eventually all be the same, mediocre, height. He knew that, to prove his thesis, he needed more than a positive association. He needed the slope of a line relating the two heights to be less than one. He plotted the deviations from the mean heights of adult children on the deviations from the mean heights of their parents (multiplying the mother’s height by 1.08). Taking the resulting scatter diagram, he eyeballed a straight line through the data and noted that the slope of the line was positive, but less than one. He claimed that, “…we can define the law of regression very briefly; It is that the heightdeviate of the offspring is, on average, twothirds of the heightdeviate of its midparentage.” (Galton, 1886,
49
Anthropological Miscellanea, p. 352.) Midparentage is the average height of the two parents, minus the overall mean height, 68.5 inches. This is the regression toward mediocrity, now commonly called regression to the mean. Galton called the line through the scatter of data a “regression line.” The name has stuck. The original data collected by Galton is available on the web. (Actually it has a few missing values and some data entered as “short” “medium,” etc. have been dropped.) The scatter diagram is shown below, with the regression line (estimated by OLS, not eyeballed) superimposed.
0 0
5
10
15
5 meankids
parents
10 Fitted values
15
However, another question arises, namely, when did a line estimated by least squares become a regression line? We know that least squares as a descriptive device was invented by Legendre in 1805 to describe astronomical data. Galton was apparently unaware of this technique, although it was well known among mathematicians. Galton’s younger colleague Karl Pearson, later refined Galton’s graphical methods into the now universally used correlation coefficient. (Frequently referred to in textbooks as the Pearsonian productmoment correlation coefficient.) However, least squares was still not being used. In 1897 George Udny Yule, another, even younger colleague of Galton’s, published two papers in which he estimated what he called a “line of regression” using ordinary least squares. It has been called the “regression line” ever since. In the second paper he actually estimates a multiple regression model of poverty as a function of welfare payments including control variables. The first econometric policy analysis.
The regression fallacy
Although Galton was in his time and is even now considered a major scientific figure, he was completely wrong about the law of heredity he claimed to have discovered. According to Galton, the heights of fathers of exceptionally tall children tend to be tall, but not as tall as their offspring, implying that there is a natural
50
tendency to mediocrity. This theorem was supposedly proved by the fact that the regression line has a slope less than one. Using Galton’s data from the scatter diagram above (the actual data is in Galton.dta). We can see that the regression line does, in fact, have a slope less than one.
. regress meankids parents [fweight=number]
1
Source  SS df MS +Model  1217.79278 1 1217.79278 Residual  2903.59203 897 3.23700338 +Total  4121.38481 898 4.58951538
Number of obs F( 1, 897) Prob > F Rsquared Adj Rsquared Root MSE
= = = = = =
899 376.21 0.0000 0.2955 0.2947 1.7992
meankids  Coef. Std. Err. t P>t [95% Conf. Interval] +parents  .640958 .0330457 19.40 0.000 .5761022 .7058138 _cons  2.386932 .2333127 10.23 0.000 1.92903 2.844835 
The slope is ..64, almost exactly that calculated by Galton, which confirms his hypothesis. Nevertheless, the fact that the law is bogus is easy to see. Mathematically, the explanation is simple. We know that ˆ β =r Sy Sx
If S y ≅ S x and r > 0, then
ˆ 0 < β ≅ r <1
This is the regression fallacy. The regression line slope estimate is always less than one for any two variables with approximately equal variances. Let’s see if this is true for the Galton data. The means and standard deviations are produced by the summarize command. The correlation coefficient is produced by the correlate command.
. summarize meankids parents Variable  Obs Mean Std. Dev. Min Max +meankids  204 6.647265 2.678237 0 15 parents  197 6.826122 1.916343 2 13.03 . correlate meankids parents (obs=197)  meankids parents +meankids  1.0000 parents  0.4014 1.0000
So, the standard deviations are very close and the correlation coefficient is positive. We therefore expect that the slope of the regression line will be positive. Galton was simply demonstrating the mathematical theorem with these data.
We weight by the number of kids because some families have lots of kids (up to 15) and therefore represent 15 observations, not one.
1
51
To take another example, if you do extremely well on the first exam in a course, you can expect to do well on the second exam, but probably not quite as well. Similarly, if you do poorly, you can expect to do better. The underlying theory is that ability is distributed randomly with a constant mean. Suppose ability is distributed uniformly with mean 50. However, on each test, you experience random error with is distributed normally with mean zero and standard deviation 10. Thus, x1 = a + e1
x2 = a + e2 E (a) = 50, E (e1 ) = 0, E (e2 ) = 0, cov(a, e1 ) = cov(a, e2 ) = cov(e1 , e2 ) = 0
where x1 is the grade on the first test, x2 is the grade on the second test, a is ability, and e1 and e2 are the random errors.
E ( x1 ) = E (a + e1 ) = E (a) + E (e1 ) = E (a) = 50 E ( x2 ) = E (a + e2 ) = E (a ) + E (e2 ) = E (a) = 50
So, if you scored below 50 on the first exam, you should score closer to 50 on the second. If you scored higher than 50, you should regress toward mediocrity. (Don’t be alarmed, these grades are not scaled, 50 is a B.)
We can use Stata to prove this result with a Monte Carlo demonstration.
* choose a large number of observations . set obs 10000 obs was 0, now 10000 * . * . * . create a measure of underlying ability gen a=100*uniform() create a random error term for the first test gen x1=a+10*invnorm(uniform()) now a random error for the second test gen x2=a+10*invnorm(uniform())
. summarize Variable  Obs Mean Std. Dev. Min Max +a  10000 50.01914 28.88209 .0044471 99.98645 x1  10000 50.00412 30.5206 29.68982 127.2455 x2  10000 49.94758 30.49167 25.24831 131.0626 * note that the means are all 50 and that the standard deviations are about the same . correlate x1 x2 (obs=10000)  x1 x2 +x1  1.0000 x2  0.8912 1.0000
. regress x2 x1 Source  SS df MS +Number of obs = 10000 F( 1, 9998) =38583.43
52
Err. he published his magnum opus.37561 Number of obs F( 1.Model  7383282.8814501 . Std. Std.2655312 20.11 0.7942 13. summarize Variable  Obs Mean Std.0000 0.180851 +Total  7335.0839 30 100 test2  41 77. .000 36. Dev.79 0.1265 12.55 1 7383282.78049 13. Err.358909 +Total  9296488.92207 8.dta.1171354 2. The book initially received favorable reviews in places like the Royal Statistical Society.02439 40 183. businesses with high profits tended.97122 Residual  6247. “The Triumph of Mediocrity in Business.99083 6.7364 73.” In this 486 page tome he analyzed mountains of data and showed that. and the Annals of the American Academy of Political and Social Science. Interval] +test1  . in every industry.37 9998 191.97122 1 1087.05317 39 160.8992199 _cons  5. regress test2 test1 Source  SS df MS +Model  1087.55 Residual  1913206. Here are the results.54163 47 100 .61 0.10774  Horace Secrist Secrist was Professor of Statistics at Northwestern University in 1933 when.1483 0. it was 1933 after all).890335 . Min Max +test1  41 74. over time.947655 * the regression slope is equal to the correlation coefficient. while businesses with lower than average profits showed increases.3851 1.427161 . t P>t [95% Conf.0129 0.013 .0045327 196.87805 17. You can download the data set and look at the scores.0000 .7942 0.0683465 .43 0. 39) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 41 6. Secrist was riding high.741866 Prob > F Rsquared Adj Rsquared Root MSE = = = = 0.000 4. correlate test2 test1 (obs=41)  test2 test1 +test2  1. the Journal of Political Economy. after ten years of effort collecting and analyzing data on profit rates of businesses. Secrist thought he had found a natural law of competition (and maybe a fundamental cause of the depression.656 test2  Coef.92 9999 929.833 x2  Coef.0000 test1  0. 53 . Interval] +x1  . t P>t [95% Conf. to see their profits erode.906666 5.542204 _cons  54.3052753 . as expected Does this actually work in practice? I took the grades from the first two tests in a statistics course I taught years ago and created a Stata data set called tests.44 0.000 .
These diagrams really prove nothing more than that the ratios in question have a tendency to wander about. “The seeming convergence is a statistical fallacy. Also.” (Hotelling. and other phenomena. JASA. It is illustrated by genetic.) According to Hotelling the real test would be a reduction in the variance of the series over time (not true of Secrist’s series). and the random errors with transitory income. resulting from the method of grouping. The performance.…. though perhaps entertaining. Friedman. Hotelling wrote. Summation and deviations Let the lowercase variable xi = X i − X where X is the raw data. a well known economist and mathematical statistician.It all came crashing down when Harold Hotelling. To “prove” such a mathematical result by a costly and prolonged numerical study of many kinds of business profit and expense ratios is analogous to proving the multiplication table by arranging elephants in rows and columns. Some tools of the trade We will be using these concepts routinely throughout the semester. which can be derived from the test examples above by simply replacing test scores with consumption. tells how he was inspired to avoid the regression fallacy in his own work. See Milton Friedman. the conclusions would be reversed. but Hotelling was having none of it. and then doing the same for numerous other kinds of animals. JASA. the “real test’ of smaller variances over time might need to be taken with a grain of salt if measurement errors are getting smaller over time. N where N is the sample size). Secrist replied with a letter to the editor of the JASA. published a review in the Journal of the American Statistical Association. physical.) Even otherwise competent economists commit this error. 1933. astronomical. “Do old fallacies ever die?” Journal of Economic Literature 30:21292132. 1992. (1) x i =1 N i = ( X i − X ) = X i − NX Xi = Xi − N N = Xi − Xi = 0 From now on. in the same article. instead of the beginning. In the review Hotelling pointed out that if the data were arranged according to the values taken at the end of the period. and having a certain pedagogical value. is not an important contribution to either zoology or to mathematics. In his reply to Secrist. who was a student of Hotelling’s when Hotelling was engaged in his exchange with Secrist. The result was the permanent income hypothesis. (Hotelling. I will suppress the subscripts (which are always i=1. ability with permanent income. This theorem is proved by simple mathematics. By the way. sociological. (2) x 2 = ( X − X )2 = ( X 2 − 2 XX + X 2 ) = X 2 − 2 X X + NX 2 X Note that X = N = NX so that N x 2 = X 2 − 2 X X + NX 2 = X 2 − 2 NX 2 + NX 2 = X 2 − 2 NX 54 . 1934.
so that pi = 1/ N then E( X ) = 1 1 1 X 1 + X 2 + . namely your number comes up and you win $35 or it doesn’t and you lose your dollar.973(−1) + . At the same time. xy = ( X − X )(Y − Y ) = ( XY − XY − YX + XY ) (3) = XY − Y X − X Y + NXY = XY − YNX − XNY + NXY = XY − NXY (4) xY = ( X − X )Y = ( XY − XY ) = XY − X Y = XY − NXY = xy Xy = X (Y − Y ) = XY − Y X = XY − NXY = xy = xY (5) Expected value The expected value of an event is the most probable outcome. when you play you never lose 2..027 So. If you play 1000 times.. what might be expected. It is the average result over repeated trials. you will win some and lose some. The expected value of X is also called the first moment of X. However. µ x not the sample mean. to find an expected value. just take the mean.027(35) = −..973 + . you will eventually lose.. + X N = i =1 N N N N X N = µx Note that the expected value of X is the true (population) mean. You either win $35 or lose a dollar. we can make roulette a fair game by increasing the payoff to $36 36 1 E ( X ) = p1 X 1 + p2 X 2 = (−1) + (36) = 0 37 37 Let’s assume that all the outcomes are equally likely.7 cents every time you play. over the long haul. It is also a weighted average of several events where the weight attached to each outcome is the probability of the event. you can expect to lose about 2. the game is said not to be fair. the payoff if your number comes up is 35 to 1. the casino can expect to make $27 for every 1000 bets. 36 1 E ( X ) = p1 X 1 + p2 X 2 = (−1) + (35) 37 37 = . X . If the expected value of the game is nonzero. Of course.946 = −.7 cents. but on average. Operationally. you will lose $27. A fair game is one for which the expected value is zero. the slots are numbered 036. as this one is. + pN X N = pi X i = µ x . So. This is known as the house edge and is how the casinos make a profit. if you play roulette in Monte Carlo. E ( X ) = p1 X 1 + p2 X 2 + . For example. However. what is the expected value of a one dollar bet? There are only two relevant outcomes. pi = 1 i =1 i =1 N N For example. 55 .
If X and Y are independent. N bX 1 + bX 2 + . Y ) = E[( X − µ x )(Y − µ y )]2 Var ( X + Y ) = E[( X + Y ) − ( µ x + µ y )]2 = E[( X − µ x ) + (Y − µ y )]2 = E[( X − µ x )2 + (Y − µ y ) 2 + 2( X − µ x )(Y − µ y )] = Var ( X ) + Var (Y ) + 2Cov( X .Expected values.. then E (b) = b . Y ) (10) (11) If X and Y are independent.. Y ) = E ( X − µ x )(Y − µ y ) = E[ XY − Y µ x − X µ y + µ x µ y ] = E ( XY ) − µ y µ x − µ x µ y + µ x µ y = E ( XY ) − E ( X ) E (Y ) = 0 56 . means and variance (1) (2) (3) (4) (5) (6) (7) If b is a constant. then E(XY)=E(X)E(Y). Cov( X . + bX N = b X i / N = bE ( X ) N i =1 E (a + bX ) = a + bE ( X ) E (bX ) = E ( X + Y ) = E ( X ) + E (Y ) = µ x + µ y E ( aX + bY ) = aE ( X ) + bE (Y ) = a µ x + b µ y E (bX ) 2 = b 2 E ( X ) The variance of X is ( X − µ x ) 2 N also written as Var ( X ) = E ( X − E ( X )) 2 The variance of X is also called the second moment of X. Var ( X ) = E ( X − µ x ) 2 = (8) (9) Cov( X .
8 THEORY OF LEAST SQUARES Ordinary least squares is almost certainly the most widely used statistical method ever created. “Markov’s Theorem. Method of least squares Least squares was first developed to help explain astronomical measurements. or easier to apply. containing nothing that Gauss hadn’t done a hundred years earlier. They knew that the stars didn’t move. The GaussMarkov theorem is the theoretical basis for its ubiquity. even though he was almost 100 years late. Markov published his version of the theorem in 1912. Certainly Legendre beat Gauss to print. Gauss claimed that he had been using the technique since 1795. so we can evaluate the least squares estimator. 57 . I think there is none more general more exact. since it prevents the extremes from dominating. Gauss published another derivation in 1823. But first.” Somehow. In the late 17’th Century. It remains unclear who thought of the method first. Gauss did show that least squares is preferred if the errors are distributed normally (“Gaussian”) using a Bayesian argument. which has since become the standard. We will investigate the GaussMarkov theorem below. “Of all the principles that can be proposed for this purpose [analyzing large amounts of inexact astronomical data]. we need to know some of the properties of estimators. but when they tried to measure where they were. aside maybe from the mean. This is the “Gauss” part of the GaussMarkov theorem. So which observations are right? According to Adrien Marie Legendre in 1805. using different methods. in the 1920’s. is appropriate for revealing the state of the system which most nearly approaches the truth. scientists were looking at the stars through telescopes and trying to figure out where they were. Markov’s name is still on the theorem. inexplicably unaware of the 1823 Gauss paper. than…making the sum of the squares of the errors a minimum. calls Markov’s results.” Gauss published the method of least squares in his book on planetary orbits four years later in 1809. Neyman. By this method a kind of equilibrium is established among the errors which. thereby greatly annoying Legendre. different scientists got different locations.
Since we get a different value for β each time we ˆ ˆ take a different sample. the lower its variance and the smaller the variance the more precise the estimate. or not. For example the formula for computing the mean X = ΣX / N is an estimator. ˆ than β is said to be efficient. Efficient estimators allow us to make more powerful statistical statements concerning the true value of the parameter being estimated. For this reason. no matter how small. Bias Bias is the difference between the expected value of the estimator and the true value of the parameter.Properties of estimators An estimator is a formula for estimating the value of a parameter. ˆ Bias= E β − β ˆ Therefore. Efficiency is important because the more efficient the estimator. ˆ ˆ The estimator β is efficient relative to the estimator β if Var β < Var β . You can visualize this by imagining a distribution with zero variance. Small sample properties There are three important small sample properties: bias. How confident can we be in our estimate of the parameter in this case? The drawback to this definition of efficiency is that we are forced to compare unbiased estimators. The properties of the sampling distribution of β will determine if β is a good estimator. β is a random variable. Obviously. The resulting value is the estimate. if E β = β the estimator is said to be unbiased. at least in small samples. we will use the more general concept of relative efficiency. The so called “small sample” properties are true for all sample sizes. There are two types of such properties. ( ) ( ) Efficiency ˆ If for a given sample size. the variance of β is smaller than the variance of any other unbiased estimator. ( ) ( ) 58 . The formula for estimating the ˆ ˆ β = Σxy / Σx 2 and the resulting value is the estimate. The distribution of β across different samples is called the ˆ ˆ ˆ sampling distribution of β . So is β a good estimator? slope of a regression line is ˆ The answer depends on the distribution of the estimator. It is frequently the case. as we shall see throughout the semester that we must compare estimators that may be biased. and mean square error. efficiency. unbiased estimators are preferred. The large sample properties are true only as the sample size approaches infinity.
with maximum precision. Define the probability limit as the limit of a probability as the sample size goes to infinity. when we say an estimator is efficient. the probability that the estimator will be infinitesimally different from the truth is one. The concept of mean square error is useful here. A distribution with no variance is said to be degenerate. ˆ ˆ ˆ mse β = Var β + Bias β ( ) ( ) ( ) 2 Large sample properties Large sample properties are also known as asymptotic properties because they tell us what happens to the estimator as the sample size goes to infinity. we mean efficient relative to some other estimator. Mean square error We will see that we often have to compare the properties of two estimators. Ideally. because it is no longer really a distribution. that is. Mean square error is defined as the variance of the estimator. as the sample size goes to infinity. a certainty. one of which is unbiased but inefficient while the other is biased but efficient. This requires that the estimator be centered on the truth. plus its squared bias. 59 . we want the estimator to approach the truth. it is a number. as N → ∞ . that is plim = lim Pr N →∞ Consistency A consistent estimator is one for which ˆ ˆ plim β − β < ε = lim Pr β − β < ε = 1 N →∞ ( ) ( ) That is. with no variance.From now on.
This is not true of unbiasedness. are true. That is. The mean and the variance of Y does not change between observations. If we compute the ratio of two unbiased estimates we don’t know anything about the properties of the ratio. All of the small sample properties have large sample analogs. an estimator is consistent if its mean square error goes to zero as the sample size goes to infinity. but unknown parameters and ui is a random variable. Finally. Note that a consistent estimator is. an estimator can be mean square error consistent. under the GaussMarkov 60 . As such it can be used in further derivations. (2) Yt is distributed with a mean of α + β X i for each observation i. The observations of Y come from a simple random sample. An estimator with bias that goes to zero as N → ∞ is asymptotically unbiased. constant. Yi = α + β X i + ui where .A biased estimator can be consistent if the bias gets smaller as the sample size increases. Since we require that both the bias and the variance go to zero for a consistent estimator. The GaussMarkov assumptions. with some error. u. a number. etc. GaussMarkov theorem Consider the following simple linear model. An estimator can be asymptotically efficient relative to another estimator. which means that the observation on Y1 is independent of Y2. We can do this because. in the limit. The GaussMarkov assumptions are (1) the observations on X are a set of fixed numbers. which are currently expressed in terms of the distribution of Y. The researcher sets the value of X without error and then observes the value of Y. (3) the variance of Yi (σ 2 ) is constant. and (4) the observations on Y are independent from each other. the ratio of two consistent estimators is consistent. For example. can be rephrased in terms of the distribution of the error term. We do not know if our estimate is equal to the truth. and . All we know about an unbiased estimate is that its mean is equal to the truth. an alternative definition is mean square error consistency. which is independent of Y3. The basic idea behind these assumptions is that the model describes a laboratory experiment. The property of consistency is said to “carry over” to the ratio.
least squares doesn’t work. Assumption (4) can also be relaxed. Therefore.” and iid means “independently and identically distributed.” The independent assumption refers to the simple random sample while the identical assumption means that the mean and variance (and any other parameters) are constant. ei = yi − β xi ei2 = ( yi − β xi ) N N 2 ei2 = ( yi − β xi ) i i 2 61 . we can subtract the means to get. Then. consistent. Therefore. the variance of u is equal to the variance of Y. that is the covariance between X and u must be zero. Yi − Y = α + β X i − α − β X Let yi = Yi − Y . as we shall see in Chapter 9 below. as we shall see in Chapter 12 below. E(X. Without it. then the ordinary least squares estimates of the parameters are unbiased. if the GaussMarkov assumptions are true.In fact. The proof of the unbiased and consistent parts are easy. Therefore. OK. i. E (Yi ) = α + β X i The “empirical analog” of this (using the sample observations instead of the theoretical model) is Y = α + β X where Y is the sample mean of Y and X is the sample mean of X. We want to minimize the sum of squares of error. everything in the model is constant except Y and u. The crucial requirement is that X must be independent of u. ui = Yi − α − β X i so that the mean is E (ui ) = E (Yi ) − α − β X i = 0 because E ( X i ) = X i which means that E (Yi ) = α + β X i by assumption (1).. the error term u is distributed with mean zero and variance σ 2 . We know that the mean of u is zero. ut ~ iid (0. So. σ 2 ) where the tilde means “is distributed as.assumptions. so why is least squares so good? Here is the GaussMarkov theorem: if the GM assumptions hold. but it is not strictly necessary. Assumption (1) is very useful for proving the theorem. What we have done is translate the axes so that the regression line goes through the origin. xi = X i − X where the lowercase letters indicate deviations from the mean. Assumption (2) is crucial. and efficient relative to any other linear unbiased estimator. Also. the GM assumptions can be succinctly written as Yi = α + β X i + ui . Assumption (3) can be relaxed. yi = β xi + ui (recall that the mean of u is zero so that u − u = u ). Define e as the error.e.u)=0.
. When we plug in the values of x and y from the sample. it has a distribution with a mean and a ˆ variance. by GaussMarkov assumption (1).. so take the derivative with respect to and set it equal to zero. We have not used our estimates to make any inferences concerning the true value of or . it ˆ ˆ must be that β is a random variable. ˆ Sampling distribution of β ˆ Because y is a random variable and we now know that the OLS estimator β is a linear function of y.. It is a formula for deriving an estimate. + ddβ ( yN − 2β xN yN + β 2 xN ) dβ 2 2 = −2 x1 y1 + 2 β x12 − 2 x2 y2 + 2 β x2 − . We can use this sampling distribution to 62 .. xi yi 1 = xi yi = wi yi 2 xi xi2 where x wi = i 2 xi ˆ The formula for β is linear because the w are constants. d d d 2 ei2 = ( yi − β xi ) = ( yi2 − 2 β xi yi + β 2 xi2 ) dβ dβ dβ d 2 = ( y12 − 2β x1 y1 + β 2 x12 ) + ddβ ( y2 − 2β x2 y2 + β 2 x22 ) + . xi yi − β xi2 = 0 β xi2 = xi yi ˆ Solve for and let the solution value be β : ˆ β= xi yi xi2 This is the famous OLS estimator. is that it is a linear function of the observations on Y. which is a numerical value. then eliminate the parentheses.We want to minimize the sum of squares of error by choosing .) ˆ β is an estimator. we can back out the OLS estimate for using the means. we have only described the data with a linear function. Y =α +βX ˆ ˆ Y =α +βX ˆ ˆ α =Y −βX ˆ The first property of the OLS estimate β . We call this distribution the sampling distribution of β . the ˆ β= x’s are constant. we can compute the estimate. − 2 xN y N + 2 β xN = −2 xi ( yi − β xi ) = 0 Multiply both sides by 2. (Remember. Note that so far. Once we get an estimate of . If β is a random variable.
ˆ ˆ ˆ Var β = E β − E β We know that xi ui ˆ β =β+ xi2 ( ) ( ( )) 2 ˆ = E β −β ( ) 2 63 . It can be shown that the OLS estimator has the smallest variance of any linear unbiased estimator. However. are true. ˆ The fact that the mean of the sampling distribution of β is equal to the truth means that the ordinary least squares estimator is unbiased and thus a good basis on which to make inferences concerning the true . so we will ignore it here. we will assume normally distributed errors below. that the x’s are fixed and that E ( ui ) = 0 . Note that the unbiasedness property does not depend on any assumption concerning the form of the distribution of the error terms. the independent and identical assumptions are useful in deriving the property that OLS is efficient. There is also a sampling distribution of want to make inferences concerning the intercept. assumed that the errors are distributed normally. That is. ˆ Variance of β ˆ The variance of β can be derived as follows. However. It also relies only on two of the GM assumptions. We have not. along with the other. then the OLS estimator is the Best Linear Unbiased Estimator (BLUE). So we really don’t need the independent and identical assumptions for unbiasedness. for example. xi yi xi ui ˆ E β = E = Eβ + 2 xi2 xi xi ui xi E (ui ) = E (β ) + E =β+ =β 2 xi2 xi ( ) because E (ui ) = 0 by GM assumption (1). but we seldom ˆ Mean of the sampling distribution of β ˆ The formula for β is ˆ β= xi yi xi2 xi ( β xi + ui ) xi2 x 2 i substitute yi = β xi + ui ˆ β= = β xi2 + xi ui = β xi2 x 2 i + xi ui xi ui =β+ 2 xi xi2 ˆ The mean of β is its expected value. ˆ α .make inferences concerning the true value of . if these two assumptions.
+ wN u N ) 2 2 2 2 = E ( w12u12 + w2 u2 + . 2 2 2 2 ˆ Var β = w12σ 2 + w2 σ 2 + . = σ N = σ 2 because of the identical assumption. E (u i u j ) = 0 for i ≠ j ... ˆ β −β = and xi ui = wi ui xi2 ˆ E β −β ( ) 2 xi ui 2 = E = E ( wi ui ) xi2 2 2 2 E ( wi ui ) = ( w1u1 + w2u2 + .. + wN ) σ 2 ( ) x x x = 1 2 σ 2 + 2 2 σ 2 + .. goes to infinity. σ 12 = σ 2 = . + wN E (u N ) = w12σ 12 + w2 σ 2 + ... the mean square error of β must go to zero as the sample size.......... + xN ) σ 2 2 2 ( xi ) = xi2 2 2 2 ( x ) 2 i σ2 = 2 1 σ2 σ2 = xi2 xi2 Consistency of OLS ˆ To be consistent.. x i =1 N 2 1 goes to ˆ infinity because we keep adding more and more squared terms. + wN u N ) 2 2 2 2 = E ( w12u12 + w2 u2 + . N.. As N goes to infinity.. Therefore. Therefore.. + N 2 σ 2 xi xi xi 1 2 = ( x12 + x22 + .So.wN −1wN u N −1uN ) because the covariances among the errors are zero due to the independence assumption. Therefore the consistency of the ordinary ˆ ˆ ˆ least estimator β depends on the variance of β . the mean square error goes to zero.. + wN σ 2 = ( w12 + w2 + . 64 . and the OLS estimate is consistent. ˆ ˆ ˆ mse( β ) = Var ( β ) + E ( β − β ) 2 Obviously the bias is zero since the OLS estimate is unbiased.. Thus. 2 2 2 2 2 2 2 2 E ( wi ui ) = w12 E (u12 ) + w2 E (u2 ) + . the variance of β goes to zero as N goes to infinity.. + w1wN u1u N + . that is. Var ( β ) = σ 2 x i =1 N 2 i .. + wN σ N 2 2 2 However. + wN u N + w1w2u1u2 + w1w3u1u3 + ... The ˆ mean square error of β is the sum of the variance plus the squared bias.
if the GM assumptions are true. + cN σ 2 65 . The GaussMarkov assumptions are as follows. β = cy = c( β x + u ) = β cx + cu Taking expected values. This estimator is unbiased if Note. β = β + cu and β − β = cu Find the variance of β . the OLS estimator has the smallest variance. + cN u2 ] = σ 2 c2 2 2 2 = c12σ 2 + c2 σ 2 + .. of all linear unbiased estimators. any other linear unbiased estimator will have a larger variance than the OLS estimator. Stated another way. 2 2 Var ( β ) = E β − β = E cu = E [ c1u1 + c2u2 + . That is. since cx =1 cx =1.Proof of the GaussMarkov Theorem According to the GaussMarkov theorem. We know that the OLS estimator is a linear estimator ˆ β= xy = wy x2 Where w = x / x2 . E ( β ) = β cx + cE (u ) = β cx Since E(u)=0 as before.σ 2 ) .. the ordinary least squares estimator is the Best Linear Unbiased Estimator (BLUE).. Consider any other linear unbiased estimator. ˆ β is unbiased and has a variance equal to σ 2 / x 2 . (1) The true relationship between y and x (expressed as deviations from the mean) is yi = β xi + ui (2) The residuals are independent and the variance is constant: ui ~ iid ( o. (3) The x’s are fixed numbers..
this variance must be greater than or equal to the variance of β . w(c − w) = wc − w2 = cx x2 − 2 x2 ( x2 ) cx x2 = − x 2 ( x 2 )2 = Since 1 1 − =0 2 x x2 cx =1. and Chisquare Before we can make inferences or test hypotheses concerning the true value of . According to the theorem. Therefore. we need to review some famous statistical distributions. Here is a useful identity. σ 2 (c − w) 2 ≥ 0 . 66 . Student’s t. We will be using all of the following distributions throughout the semester. Var ( β ) = σ 2 c 2 = σ 2 ( w2 + (c − w)2 ) = σ 2 w2 + σ 2 (c − w) 2 =σ2 x2 ( x ) 2 2 2 + σ 2 (c − w) 2 = σ2 x + σ 2 (c − w) 2 ˆ = Var ( β ) + σ 2 (c − w) 2 ˆ So Var ( β ) ≥ Var ( β ) because c=w. Fisher’s F. c = w + (c − w) c 2 = w2 + (c − w) 2 + 2 w(c − w) c 2 = w2 + (c − w) 2 + 2 w(c − w) The last term on the RHS is equal to zero. The only way the two variances are equal is if Inference and hypothesis testing Normal. which is the OLS estimator.
.0042645 1. Min Max +y  100000 . If we express X as a standard score. We can generate a normal variable with 100. σ 2 ) then ( a + bX ) ~ N ( a + bµ .17934483) 67 . f ( X  µ .002622 4. gen y=invnorm(uniform()) . z. start=4. now 100000 . width=. or zscore. That is.σ 2 )= 1 − ( X − µ )2 2 ϕ ( z) = 1 2π e − z2 2 Here is the graph.” If X is distributed normally. histogram y (bin=50. Consider the linear function of X. z = a + bX . it has the formula.000 observations as follows. That is. With mean and variance 2.Normal distribution The normal distribution is the familiar bell shaped curve.518918. σ 2 ) where the tilde is read “is distributed as.518918 4. e 2σ 2πσ or simply X ~ N ( µ .448323 . Let a = − µ σ and b = 1 σ then z = a + bX = − µ σ + X σ = ( X − µ ) σ which is the formula for a standard score. set obs 100000 obs was 0. summarize y Variable  Obs Mean Std. then any linear transformation of X is also distributed normally. if X ~ N ( µ . b 2σ 2 ) This turns out to be extremely useful. Dev. then z has a standard normal distribution because the mean of z is zero and its standard deviation is one.
4 4 2 0 y 2 4 The Student’s t. Chisquare distribution If z is distributed standard normal. the area under the curve must sum to one. then z 2 ~ χ 2 (1) which has a mean of 1 and a variance of 2.0 . and. 68 . Here is the graph. asymptotic to the horizontal axis.1). then its square is distributed as chisquare with one degree of freedom. Since chisquare is a distribution of squared values. They have associated with them their socalled degrees of freedom which are the number of terms in the summation. it has to be a skewed distribution of positive values. being a probability distribution.1 Density .2 . Fisher’s F. (1) If z ~ N(0. and chisquare are derived from the normal and arise as distributions of sums of variables.3 .
gen z9=invnorm(uniform()) .. then Here are the graphs for n=2.. now 10000 . then X i =1 N i ~ χ 2 (n) with mean=n and variance = 2n (3) Therefore. gen z1=invnorm(uniform()) . i=1 σ n 2 2 (5) If X 1 ~ χ 2 (n1 ) and X 2 ~ χ 2 ( n2 ) and X 1 and X 2 are independent. gen z3=invnorm(uniform()) .and 4. z2 . set obs 10000 obs was 0.91651339) 69 .. width=.σ ) variables then i ~ χ 2 ( n).. gen z8=invnorm(uniform()) . gen z2=invnorm(uniform()) . N(0...*this creates a chisquare variable with n=4 degrees of freedom .. variables.. X 2 .1). z2 .(2) If X 1 . then X 1 + X 2 ~ χ 2 ( n1 + n2 ). gen z5=invnorm(uniform()) . z i =1 n 2 i ~ χ 2 (n) . gen z7=invnorm(uniform()) . z (4) If z1 . gen z1=invnorm(uniform()) .. if z1 . gen v=z1^2+z2^2+z3^2+z4^2+z5^2+z6^2+z7^2+z8^2+z9^2+z10^2 . gen v=z1+z2+z3+z4+z5 ... start=1. histogram v (bin=40. zn are independent standard normal. gen z4=invnorm(uniform()) ..0842115. We can generate a chisquare variable by squaring a bunch of normal variables. X n are independent χ 2 (1) variables. gen z10=invnorm(uniform()) . . gen z6=invnorm(uniform()) .3. zn are independent N(0.
gen x1=invnorm(uniform()) 70 . then sum their squares to make a chisquare variable (v) with five degrees of freedom. finally. Here are graphs of some typical F distributions. like the chisquare.02 Density .04 . X2 ~ F (n1 . . n2 ) We can create a variable with an F distribution is follows: generate five standard normal variables.0 0 . so it is positive and therefore skewed. then 2 2 n1 n2 This is Fisher’s Famous Ftest.08 .06 . divide v by z to make a variable distributed as F with 5 degrees of freedom in the numerator and 10 degrees of freedom in the denominator.1 10 20 v 30 40 Fdistribution X1 (6) If X 1 ~ χ ( n1 ) and X 2 ~ χ ( n2 ) and X 1 and X 2 are independent. It is the ratio of two squared quantities.
7094975 9. This is X n Student’s t distribution.1) and X ~ χ 2 ( n) and z and X are independent. . Dev. .962596 4. compared to the standard normal.5 1 1 f 2 3 tdistribution ~ t ( n) . if f<3 & f>=0 1.901778 14. .084211 37.0526 4. . It looks like the normal distribution but with “fat tails. gen x2=invnorm(uniform()) gen x3=invnorm(uniform()) gen x4=invnorm(uniform()) gen x5=invnorm(uniform()) gen v1=x1^2+x2^2+x3^3+x4^2+x5^2 gen f=v1/v * f has 5 and 10 degrees of freedom summarize v1 v f Variable  Obs Mean Std.. It can be positive or negative and is symmetric. histogram f. like the normal distribution.74475 f  10000 .4877415 . more probability in the tails of the distribution. Min Max +v1  10000 3. then the ratio T = z 71 .” that is.478675 1.5 Density 0 0 . (7) If z ~ N(0. .54086 v  10000 10. .72194 .791616 62.87189 66. .
gen t=z/sqrt(v) .(8) If T ~ t (n) then T 2 = variables.9989421. gen z=invnorm(uniform()) .8 1 0 t 1 2 72 . z2 ~ F (1. histogram t if abs(t)<2. gen y4=invnorm(uniform()) . gen y2=invnorm(uniform()) . gen y3=invnorm(uniform()) .2 Density . width=. normal (bin=49. set obs 100000 obs was 0.6 . gen y5=invnorm(uniform()) .4 . n) because the squared tratio is the ratio of two chisquared X n We can create a variable with a t distribution by dividing a standard normal variable by a chisquare variable. gen v=y1^2+y2^2+y3^2+y4^2+y5^2 . . start=1.08160601) 0 2 . now 100000 . gen y1=invnorm(uniform()) .
Let’s make the usual assumption that the errors in the original regression model are distributed normally. X 2 → ∞ . NF χ 2 . In this simple regression. k=2 because we have two ˆ σ2 = e 2 i ˆ parameters.” (We don’t need the identical assumption because the normal distribution has a constant variance. 2 73 . the square root of its variance. σ 2 ) which is translated as “ui is distributed independent normal with mean zero and variance σ 2 . That is. β ~ IN ( β . we need to make an assumption concerning the form of the function. then β is distributed ˆ normally. we can summarize the sampling distribution of β as β ~ iid ( β . N is the sample size. i =1 where ei2 is the squared residual corresponding to the i’th observation.) Since ˆ ˆ β is a linear function of the error terms and the error terms are distributed normally. σ 2 xi2 ) . If we want to actually use this distribution to make inferences. construct the test statistic. t → N (0. RSS / N NR 2 = N TSS / N RSS / N 2 As N → ∞ . ui ~ iidN 0. TSS / N → 1 so that NR 2 → N = RSS ~ χ 1 (12) As N → ∞. σ 2 xi2 ) . σ is the estimated error variance. σ 2 where the N stands for normal. (11) As N → ∞. that is. so that the ratio goes to one. N −k and k is the number of parameters in the regression. This assumption can also be written as ( ) ui ~ IN (0.1) (10) As n2 → ∞. We are now in position to make inferences concerning the true value of . T= ˆ β − β0 S βˆ N −k where Sβˆ = x ˆ σ2 2 i ˆ is the standard error of β . See (10) above. Testing hypotheses concerning ˆ ˆ So. To test the null hypothesis that β = β 0 where β 0 is some hypothesized value of . It is equal to the error sum of squares divided by the degrees of freedom (Nk) and is also known as the mean square error. F = X 1 n1 → X 1 n1 X 2 n2 χ 2 (n1 ) because as n2 → ∞. NR 2 → χ 2 where R 2 = RSS RSS / N = TSS TSS / N and RSS is the regression (model) sum of squares. and .Asymptotic properties (9) As N goes to infinity.
If we use data points to estimate the parameters. we can solve for the slope and intercept. This is the “t” reported in the regress output in Stata. we can find any data point knowing only the other two. we can’t use them to estimate the variance. There is one degree of freedom that could be used to estimate the variance. we have one parameter to estimate. We estimate the ˆ ˆ ˆ slope with the formula β = xy / x 2 and the intercept with the formula α = Y − β X . The best explanation of the concept that I know of is due to Gerrard Dallal (http://www.htm). ( X − µ ) = X − N µ = NX − N µ = N ( X − µ ) 74 . If we have three data points. Suppose we are trying to estimate the mean of a data set with X=(1. ˆ β . yielding N1 data points to estimate the variance. This leaves the remaining N2 data points to estimate the variance. Suppose we have a data set with N observations. and a number of other statisticians. namely that X and Y are unrelated. We can also look at the regression problem this way: it takes two points to determine a straight line. to a chisquare variable (the sum of a bunch of squared. We can use the data in two ways. That is. who discovered the chisquare distribution around 1900. also known as a tratio. then there are many lines that could be defined as “fitting” the data.edu/~gdallal/dof. it is distributed as student’s t with Nk degrees of freedom. Consequently.2. not a statistical one. using up two data points.tufts. but this is a mathematical problem. suppose we are estimating a simple regression model with ordinary least squares. error terms). we have to demonstrate that this is an unbiased estimator for the variance of X (and therefore. Once we know this. not the sample size (N). is the ratio of a normally distributed random variable. ˆ β . ( X − X ) 2 = ( X − µ ) − ( X − µ ) 2 = ( X − µ )2 − 2( X − µ )( X − µ ) + ( X − µ ) 2 = ( X − µ ) 2 − 2( X − µ ) ( X − µ ) + N ( X − µ )2 because X and µ are both constants. Therefore. the formula for the sample variance of X (we have necessarily already used the data once to compute the sample mean) is S x2 = ( X − X ) 2 N −1 where we divide by the number of degrees of freedom (N=1). Also. To do this test set β 0 = 0 so that the test statistic reduces to the ratio of the estimated parameter to its standard error. Pearson.3). To prove this. to estimate the parameters or to estimate the variance. Sir Ronald Fisher solved the problem in 1922 and called the parameter degrees of freedom. To take another example. T= S βˆ Degrees of Freedom Karl Pearson. By far the most common test is the significance test. dividing by N yields a biased estimator). kept requiring ad hoc adjustments to their chisquare applications. applied it to a variety of problems but couldn’t get the parameter value quite right. we used up one degree of freedom to estimate the mean and we only have two left to estimate the variance. The sum is 6 and the mean is 2.Notice that the test statistic. If we only have two points. normally distributed. In the example of the mean. Our simple rule is that the number of degrees of freedom is the sample size minus the number of parameters estimated.
( X − X ) 2 N 1 E ( X − µ ) 2 − ( X − µ )2 = E N −1 N −1 N −1 = 1 N E ( X − µ )2 − E ( X − µ )2 N −1 N −1 Note that the second term is the variance of the mean of X: E ( X − µ )2 = But. E ( x 2 ) = E ( ( X − X ) 2 ) = ( N − 1)σ x2 . 75 . yields an unbiased estimate of the true error variance. ( X − µ ) 2 = Var ( X ) N 1 1 Var ( X ) = Var X = 2 Var ( X 1 + X 2 + .. ( X − X ) 2 = ( X − µ ) 2 − 2 N ( X − µ ) 2 + N ( X − µ ) 2 = ( X − µ ) 2 − N ( X − µ ) 2 Now we can take the expected value of the proposed formula.which means that the middle term in the quadratic form above is −2( X − µ ) ( X − µ ) = −2( X − µ ) N ( X − µ ) = −2 N ( X − µ ) 2 and thefore. ( X − X ) 2 N 1 E ( X − µ )2 − ( X − µ )2 = E N −1 N −1 N −1 2 N N σx N 1 = σ x2 − = σ x2 − σ x2 N −1 N −1 N N −1 N −1 N −1 2 = σ x = σ x2 N −1 Note (for later reference). + X N ) N N σ2 1 2 = 2 Nσ x = x N N Therefore.. Estimating the variance of the error term We want to prove that the formula ˆ σ2 = e2 N −2 where e is the residual from a simple regression model.
ˆ y = βx+u So. ( ) ( ) 2 x2 ˆ ˆ e2 = u 2 − 2 β − β xu + β − β Taking expected values. ˆ ˆ e = βx+u −βx = u − β −β x Squaring both sides. ( ) ( ) 2 x2 ˆ ˆ E e 2 = E u 2 − 2 E β − β xu + E β − β ( ) ( ) 2 x2 ˆ Note that the variance of β is ˆ ˆ Var β = E β − β x2 Which means that the third term of the expectation expression is ˆ E β − β ( ) ( ( ) 2 = σ2 ) 2 x2 = x2 = σ 2 x2 σ2 We know from the theory of least squares that ˆ β −β = xu x2 Which implies that ˆ xu = β − β x 2 Therefore the second term in the expectation expression is ( ) 76 . ( ) ˆ ˆ e 2 = u 2 − 2 β − β xu + β − β Summing. ˆ e = y−βx But.In deviations.
In the current application x=e and S x2 = σ 2 . 77 . so we divide by N2 when we take the average.ˆ −2 E β − β ( )( βˆ − β ) x = −2E ( βˆ − β ) 2 2 x2 = −2 σ2 x2 x 2 = −2σ 2 The first term in the expectation expression is E u 2 = ( N − 1) σ 2 because the unbiased estimator of the variance of x is S x2 = x 2 /( N − 1) so that the expected value of E ( x 2 ) = ( N − 1) S x2 . we get E e 2 = ( N − 1)σ 2 + σ 2 − 2σ 2 = N σ 2 − σ 2 + σ 2 − 2σ 2 = ( N − 2)σ 2 Which implies that the unbiased estimate of the error variance is ˆ σ2 = e2 N −2 2 e 2 E e ( N − 2)σ 2 = ˆ E σ 2 = E = =σ2 N − 2 N −2 N −2 If we think of each of the squared residuals as an estimate of the variance of the error term. Therefore. then the formula tells us to take the average. putting all three terms together. However. we only have N2 independent observations on which to base our estimate of the variance.
That is. (Some of them are much further than away. Chebyshev’s inequality states that. σ x2 = ( X − µ ) 2 P( X ) ≥  X − ≥ ε ( X −µ ) µ 2 P( X ) We also know that all of the X’s in the limited summation are at least away from the mean. from any arbitrary distribution.) Therefore. because we are omitting some (positive. replacing ( X − µ ) with . the probability that an observation lies more than some distance. which is the variance of x. squared) terms. for any variable x. More generally. σ x2 ≥ ≥  X − ≥ ε ( X −µ ) µ εε µ 2 2 P( X ) P( X ) = ε 2  X − ≥  X − ≥ ε P( X ) = ε µ 2 P ( X − µ ≥ ε ) 78 . µ Proof The variance of x can be written as Xi X σ x2 = ( X − µ )2 1 = ( X − µ ) 2 N N if all observations have the same probability (1/N) of being observed. from the mean must be less than or equal to σ x2 / ε 2 . σ x2 = ( X − µ ) 2 P ( X ) Suppose we limit the summation on the right hand side to those values of X that lie outside the ±ε box. .Chebyshev’s Inequality Arguably the most remarkable theorem in statistics. Then the new summation could not be larger than the original summation. σ x2 P ( X − µ ≥ ε ) ≤ 2 ε The following diagram might be helpful.
but we are not allowed to look inside the box.. P ( X − µ ≤ kσ x ) ≤ 1 − 1− So. That is. the probability that the observations lie less than k sd from the mean is one minus this probability. the better the estimate. X 2 . if we have any data whatever. To prove this version. For any set of data and any constant k>1. the theorem is. 95% of normally distributed data lie within two standard deviations of the mean. Suppose we are presented with a large box full of black and white balls.. 4 4 (We can be more precise if we know the distribution. Mathematically. For example. let k2 ε = kσ x P ( X − µ ≥ kσ x ) ≤ σ x2 1 = 2 2 2 k σx k 1 k2 Since this is the probability that the observations lie more than k standard deviations from the mean. Let X 1 .So. Divide that into the number of trials and you have the proportion. suppose we have been challenged to guess how many thumbtacks will end up face down if 100 are dropped off a table. Multiply by 100 and that is your guess. we know that 1 3 = = 75% of the data lie within two standard deviations of the mean. X N be an independent trials process where E ( X i ) = µ and σ x2 = Var ( X i ) .. and we set k=2. at least 1 − 1 deviations of the mean. What do we do? Answer: sample with replacement. the proportion of successful outcomes will tend to approach the constant probability that any one of the outcomes will be a success.) How would we prepare for such a challenge? Toss a thumbtack into the air and record how many times it ends up face down. 79 . The number of black balls that we draw divided by the total number of balls drawn will give us an estimate. σ x2 ≥ ε 2 P ( X − µ ≥ ε ) 2 Dividing both sides by ε σ x2 ≥ P ( X − µ ≥ ε ) ε2 Turning it around we get. We want to know the proportion of black balls. For example.) Law of Large Numbers The law of large numbers states that if a situation is repeated again and again. The more balls we draw. (Answer: 50. P ( X − µ ≥ ε ) ≤ σ x2 ε2 of the observations must lie within k standard QED The theorem is often stated as follows..
Then. The incredible part of this theorem is that no restriction is placed on the distribution of x.. for any > 0 S P − µ ≥ ε 0 → N →∞ N Equivalently.. Let 2 σX . 80 . Central Limit Theorem This is arguably the most amazing theorem in mathematics. S E N X 1 + X 2 + . + X N ) = Nσ x Because there is no covariance between the X’s.. the law of large numbers is often called the law of averages.. Therefore.. + X N ) / N ) = 2x = x N N N Also.. Statement: Let x be a random variable distributed according to f(x) with mean and variance of a random sample of size N from f(x). Let X be the mean z= X −µ 2 σX . No matter how x is distributed (as long as it has a finite variance). the sample mean of a large sample will be distributed normally.. + X N . That is. for any > 0. Nσ 2 σ 2 S Var = Var (( X 1 + X 2 + . S σ2 S Var N P − µ ≥ ε ≤ = x 2 0 → N →∞ ε2 Nε N Equivalently. + X N = E N Nµ =µ = N By Chebyshev’s Inequality we know that.Let S = X 1 + X 2 + . S P − µ < ε 1 → N →∞ N Since S/N is an average. We know that 2 Var ( S ) = Var ( X 1 + X 2 + . S P − µ < ε 1 → N →∞ N Proof. the probability that the difference between S/N and the true mean is less than ε ( ) is one minus this.. N The density of z will approach the standard normal (mean zero and variance one) as N goes to infinity.
One of the few people who didn’t ignore it in the 19’th Century was Sir Francis Galton. For example. an unsuspected and most beautiful form of regularity proves to have been latent all along. according to the theorem. It was resurrected in 1812 by LaPlace. Infinity comes quickly for the normal distribution. It is the supreme law of Unreason. Here is the histogram of the binomial distribution with an equal probability of generating a one or a zero. It used to be that a sample of 30 was thought to be enough. the mean of a large number of sample means will be distributed normally. The huger the mob. if they had known of it. The CLT is the reason that the normal distribution is so important.Not only that. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude. However. and the greater the apparent anarchy. the more perfect is its sway. but was still mostly ignored until Lyapunov in 1901 generalized it and showed how it worked mathematically. 1889: I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error".. It reigns with serenity and in complete selfeffacement. He wrote in Natural Inheritance. who was a big fan. We can’t prove this theorem here. The theorem was first proved by DeMoivre in 1733 but was promptly forgotten. Nowadays we usually require 100 or more. amidst the wildest confusion. but the sample doesn’t really have to be that large. The law would have been personified by the Greeks and deified. the central limit theorem is considered to be the unofficial sovereign of probability theory. we can prove it for certain cases using Monte Carlo methods. the binomial distribution is decidedly nonnormal since values must be either zero or one. However. 81 . who derived the normal approximation to the binomial distribution. Nowadays.
The histogram of the resulting data is: . Dev.3 2 0 z 2 4 This is the standard normal distribution with mean zero and standard deviation one.0 1 2 Density 3 4 5 0 . subtracting the mean (50.4 x .2 . I then computed the mean and variance of the resulting 10.000 samples of size N=100 from this distribution.0912 4.981402 31 72 I then computed the zscores by taking each observation.1 Density .8 1 Not normal.2 . summarize z 82 . .0912) and dividing by the standard deviation. Variable  Obs Mean Std.4 0 4 . I asked Stata to generate 10.6 . Min Max +sum  10000 50.000 observations.
000 samples of size N=200 from this data set.398119 Here is another example. 83 .000 or more. Here is the histogram.2 497314.0e07 Density 1. Min Max +mx  10000 287657 41619. I then computed the mean and standard deviation of the resulting data. Min Max +z  10000 1. summarize z Variable  Obs Mean Std.Variable  Obs Mean Std.0e06 0 0 5.37e07 1 2. Dev. 2. Dev. This is the distribution of the population of cities in the United States with popularion of 100. Suppose we treat this distribution as the population.57e07 1 3.5e06 2000000 4000000 population 6000000 8000000 It is obviously not normal being highly skewed. the mean is zero and the standard deviation is one.0e06 1. There are very few very large cities and a great number of relatively small cities.253426 5. According to the Central Limit Theorem.6 I then computed the corresponding zscores. the zscores should have a zero mean and unit variance. summarize mx Variable  Obs Mean Std.832496 4. .64 193870.03747 As expected. Dev. Min Max +z  10000 3. I told Stata to take 10.
σ ) This is the likelihood function. Y2 .0 2 .2 . etc.1 Density . YN ) = ∏ N 1 1 ( 2πσ ) 2 1 2 e { 1 − 2σ 2 (Y −α − β X ) i i 2 } = L(α . σ 2 ) So. α . the Y are distributed normally. and σ . The probability of observing the sample is the probability of observing Y1 and Y2. Therefore.. Pr(Y1 .. N 1 84 .. Method of maximum likelihood Suppose we have the following simple regression model Yi = α + β X i + U i U i ~ IN (0. The probability of observing the first Y is 1 2πσ 2 Pr(Y1 ) = { e − 1 2σ 2 U1 2 }= 1 2πσ 2 { e − 1 2σ 2 ( Y1 − α − β X 1 ) 2 } The probability of observing the second Y is exactly the same.. β . Y2 . but it is very close to the standard normal..4 0 z 2 4 6 It still looks a little skewed. YN ) = Pr(Y1 ) Pr(Y2 ) Pr(YN ) because of the independence assumption. The sample consists of N observations. Pr(Y1 . β .3 . which gives the probability of observing the sample that we actually observed. . except for the subscript. The symbol ∏ means multiply the items indexed from one to N.. It is a function of the unknown parameters..
we maximize the likelihood of the sample by taking derivatives with respect to α . OLS is also a maximum likelihood estimator. and solving. log L = C − where N Q log σ 2 − 2 2σ 2 N C = − log(2π ) 2 which is constant with respect to the unknown parameters. σ 2 . the best guess of the values of the parameters are those that generate the most likely sample. ˆ ˆ Substitute the OLS values for and : α and β : log L(σ ) = C − where ˆ N Q log(σ 2 ) − 2 2 2σ ˆ ˆ ˆ Q = (Yi − α − β X i ) 2 i =1 N is the residual sum of squares. β . not a rare or unusual sample. we have to minimize the error sum of squares with respect to and . Therefore. We already know the OLS formulas. therefore. This is nothing more than ordinary least squares. Since Q is multiplied by a negative. to maximize the likelihood. N 1 Log L = − 1 log(2πσ 2 ) − (Yi − α − β X i ) 2 2 2σ 2 i =1 This function has a lot of stuff that is not a function of the parameters. setting the derivatives equal to zero. and σ . ESS. Now we have to differentiate with respect to and set the derivative equal to zero σ ˆ d log L(σ ) N 4σ Q =− + =0 4 dσ σ 4σ ˆ N Q = 3 σ N= ˆ Q σ2 ˆ Q ESS = N N ˆ σ2 = 85 . and Q = (Yi − α − β X i ) 2 i =1 N where Q is the error sum of squares. Mathematically. Let’s take logs first to make the algebra easier.The basic idea behind maximum likelihood is that we probably observed the most likely sample. so the only thing we need now is the maximum likelihood estimator for the variance.
β .σ 2 2 2 Max logL = C − ( ) Since N is constant. Consider the likelihood ratio. We can use this function to test hypotheses such as β = 0 or α + β = 10 as restrictions on the model. β . do two regressions. β .ˆ ˆ ˆ Now that we have estimates for α . we expect that the residual sum of squares (ESS) for the restricted model will be higher than the ESS for the unrestricted model. ESS(R). β . This means that the likelihood function will be smaller (not maximized) if the restriction is false. ˆ 2 2σ 2 ˆ 2σ So. if the restriction is false. and one without the restriction. λ= MaxL under the restriction ≤1 MaxL without the restriction The test statistic is −2 log(λ ) χ 2 ( p) where p is the number of restrictions. then the two values of ESS will be approximately equal and the two values of the likelihood functions will be approximately equal. 86 . As in the Ftest. we can substitute α .σ 2 2σ 2 2 ˆ ˆ Q Nσ Note that = . β . N N log ( N ) − 2 2 −N ˆ−N MaxL = const * Q 2 = const *( ESS ) 2 Likelihood ratio test Let L be the likelihood function. To do this test. If the hypothesis is true.and σ into the log likelihood function to get its maximum value: ˆ ˆ Q N N Q N log(σ 2 ) − 2 = C − log − N 2 ˆ ˆ ˆ α . Max logL = const − ˆ ˆ ˆ α . So. one with the restriction. saving ESS(U).and σ . N N N ˆ Max logL = C − log Q + log ( N ) − ˆ ˆ ˆ α .σ N ˆ log Q 2 ( ) where const = C + Therefore. saving the residual sum of squares.
so the (true) mean of the errors is zero. We therefore define the following empirical analogs of A1 and A2. ui ~ iid (0. For example. . sum them. that is both illustrative and easier. its significance is not well known in small samples. there is another approach. Yi = α + β X i + ui The crucial GaussMarkov assumptions can be written as (A1) E (ui ) = 0 (A2) E ( xi ui ) = 0 A1 states that the model must be correct on average. (a1) e = (a2) e i N x i ei N = 0 ei = 0 = 0 x i ei = 0 87 . However. It says that the x’s cannot be correlated with the error term. Multiple regression and instrumental variables Suppose we have a model with two explanatory variables. σ 2 ) We want to estimate the parameters . A2 is the alternative to the assumption that the x’s are a set of fixed numbers. set the derivatives equal to zero and solve. We define the error as ei = Yi − α − β X i − γ Zi At this point we could square the errors. Yi = α + β X i + γ Z i + ui . that is. the covariance between the explanatory variable and the error term must be zero. To see how instrumental variables works.λ= const *( ESS ( R )) −N 2 −N 2 const *( ESS (U )) log(λ ) = − ESS ( R) = ESS (U ) − N 2 N ( log( ESS ( R) − log( ESS (U ) ) 2 The test statistic is −2 log(λ ) = N ( log( ESS ( R) − log( ESS (U ) ) χ 2 ( p) This is a large sample test. This would generate the least squares estimators for this multiple regression. we have to make the assumptions operational. Like all large sample tests. Usually we just assume that the small sample significance is about the same as it would be in a large sample. known as instrumental variables. and . Since we do not observe ui. let’s first derive the OLS estimators for the simple regression model. take derivatives with respect to the parameters.
Y i =1 N i N N Or. ˆ β= Now that we know the we can derive the OLS estimators using instrumental variables. ˆ xi yi = β x 2 + xi ei But xi ei = 0 by (a2). ˆ xi yi = β x 2 ˆ Solve for β . So. We can use (a2) to derive the OLS estimator for β . the difference between the observed value and the predicted value. Define the residuals.(a1) says that the mean of the residuals is zero and (a2) says that the covariance between x and the residuals is zero. ˆ xi yi = β x 2 + xi ei Sum. yi = Yi − Y = α + β X i + ui − α − β X = Yi − β ( X i − X ) + ui yi = β xi + ui The predicted value of y is ˆ ˆ yi = β xi ˆ yi = yi + ei The observed value of y is the sum of the predicted value from the regression and the residual. ˆ yi = β xi + ei Multiply both sides by xi. Yi = α + β X i + γ Zi + ui 88 . e = (Y − α − β X ) = Y − Nα − β X i i i i i =1 i =1 i =1 i =1 N N N N i = 0 by (a1). Y −α − β X = 0 ˆ α =Y −βX −α − β X i =1 N i =0 which is the OLS estimator for . ei = Yi − α − β X i Sum the residuals. let’s do it for the multiple regression above. Divide both sides by N. xi yi x2 which is the OLS estimator.
We have to make three assumptions (with their empirical analogs). Back substitution from high school algebra will work just fine. Now we can work in deviations. y.2 + γˆ xi zi + xi ei But. There are a variety of ways to solve two equations in two unknowns. Y −α − β X − γ Z = 0 ˆ α = Y − β X −γ Z which is a pretty straightforward generalization of the simple regression estimator for . ˆ yi = β xi + γˆ zi + ei Multiply by x and sum to get. Even though these equations look complicated with all the summations. these summations are just numbers derived from the sample observations on x. ˆ γˆ zi2 = zi yi − β zxi zi yi ˆ zxi −β zi2 zi2 Substitute into N1 γˆ = 89 . zi ei = 0 by (A3). since we have three parameters. so ˆ ˆ (N2) z y = β zx + γ z 2 i i i i ˆ We have two linear equations (N1 and N2) in two unknowns. ˆ xi yi = β x. and z. e (A1) E ( ui ) = 0 =0 N xi ei (A2) E ( xi ui ) = 0 = 0 xi ei = 0 N zi ei (A3) E ( zi ui ) = 0 = 0 zi ei = 0 N From (A1) we get the estimate of the intercept. xi ei = 0 by (A2). yi = Yi − Y = α + β X i + γ Z i + ui − α − β X + γ Z yi = β xi + γ zi + ui The predicted values of y are ˆ ˆ yi = β xi + γˆ zi So that y is equal to the predicted value plus the residual. so ˆ (N1) x y = β x 2 + γˆ x z i i . i i Now repeat the operation with z: ˆ zi yi = β zi xi + γˆ zi2 + zi ei ˆ zi yi = β zxi + γˆ zi2 + zi ei But. Solve for γˆ from N2.2 + γˆ xi zi + xi ei ˆ xi yi = β x. β and γˆ . ei = Yi − N α − β X i − γ Z i = 0 ei Yi Xi Zi = −α − β −γ =0 N N N N Or.
γˆ = zi yi xi2 − xi yi xi zi x. We might be able to make some sense out of these formulas if we translate them into simple regression coefficients.2 zi2 − zxi xi zi ) = xi yi zi2 − zi yi xi zi ˆ β= ˆ β= xi yi zi2 − zi yi xi zi x. The OLS estimator for γˆ is derived similarly.2 zi2 − ( xi zi ) 2 Interpreting the multiple regression coefficient.2 zi2 + zi2 −β xi zi 2 zi2 zi ˆ ˆ x y z 2 = β x 2 z 2 + z y − β zx x z i i i . xi zi = rxz xi2 zi2 zi yi = rzy zi2 yi2 ˆ Substituting into the formula for β ˆ β= rxy xi2 yi2 zi2 − rzy zi2 yi2 rxz xi2 zi2 x.2 + −β 2 zi2 zi Multiply through by zi2 xi zi zi yi ˆ zxi ˆ xi yi zi2 = β x. zi yi ˆ zxi ˆ xi yi = β x.2 zi2 − rxz xi2 zi2 ( ) 2 Cancel out zi2 which appears in every term. ˆ This is the OLS estimator for β . ˆ ˆ β x 2 z 2 − β zx x z = x y z 2 − z y x z . xi yi = rxy xi2 yi2 And. i i i i i i i i i i i ˆ β ( x.2 zi2 − zxi xi zi xi yi zi2 − zi yi xi zi x.2 zi2 − ( xi zi ) 2 because xi zi = zi xi .2 zi2 + zi yi xi zi − β zxi xi zi 2 i ˆ Collect terms in β and move them to the left hand side. 90 . i ( i i i ) i i ˆ ˆ xi yi z = β x. The formula for a simple correlation coefficient between x and y is xi yi xi yi xi yi rxy = = = 2 2 NS x S y xi yi xi2 yi2 N N N Or.
the numerator of the formula for β starts with the total effect on X on Y. Y on X. then X and Z are said to be orthogonal and rxz = 0. In this case. S ˆ r −r r S β = xy zy xz y = rxy y 2 1 − rxz S x Sx Sy = rzy Sz These are the results we would get if we did two separate simple regressions. X is correlated with Z. γˆ = rzy − rxy rxz S y 2 1 − rxz S z 91 . Therefore it is impossible to hold Z constant to find the separate effect of X.2 − rxz xi2 Factor out the common terms on the top and bottom. The formula for γˆ is analogous. Z has to change. Note that. If X and Z are perfectly collinear. but then subtracts off the indirect effect of X on Y through Z.ˆ β= rxy xi2 yi2 − rzy yi2 rxz xi2 x. holding Z constant. It is the total effect of Z on Y minus the indirect effect of Z on Y through X. and Y on Z. rzyrxz is the effect of X on Y through Z. Z varies. rxz = 1 and the formula does not compute. The second term. γˆ = The multiple regression coefficient is the effect on Y of a change in X. if there is no correlation between X and Z. It partials out the effect of X on Y. When x and z are perfectly correlated. This could happen if X is total wages and Y is total income minus nonwage payments. while statistically holding Z constant. with no error term. Let’s concentrate on the ratio. holding nothing constant. In other words.2 − rxz xi2 ( ) 2 = rxy xi2 yi2 − rzy yi2 rxz xi2 2 x. ˆ The simple correlation coefficient rxy in the numerator of the formula for β is the total effect of X on Y. Stata will simply drop Z from the regression and estimate a simple regression of Y on X.2 rxy − rzy rxz = 2 1 − rxz Sy Sx y2 i x2 i = rxy − rzy rxz yi2 N rxy − rzy rxz = 2 2 1 − rxz xi2 N 1 − rxz The corresponding expression for γˆ is rzy − rxy rxz S y 2 1 − rxz S z The term in parentheses is simply a positive scaling factor. X and Y are said to be collinear because one is an exact function of the other. so when X varies. ˆ β == rxy − rzy rxz xi2 yi2 2 1 − rxz x. ˆ Therefore. the multiple regression estimators collapse to simple regression estimators. then whenever X changes. This causes Y to change because Y is correlated with Z. This is the case of perfect collinearity. The denominator 2 is 1 − rxz . It therefore measures the effect of X on Y. the socalled partial effect of X on Y.
108 22.3333333 Residual  1166. Std.41 0.17 = 0. temperature. although not significantly.3333333 1 33.666667 4.0278 = 0. Let’s do a multiple regression with both rain and temp.56687 175. regress yield rain Source  SS df MS +Model  33. Suppose we type the following data into the data editor in Stata. Interval] +rain  1. the coefficient on rain is 1. The problem is that we have an omitted variable.6932 = 0.5546 1.Multiple regression and omitted variable bias The reason we use multiple regression is that most real world relationships have more than one explanatory variable.944 yield  Coef.67 implying that rain causes crop yield to decline (!).428571 Number of obs F( 1.66667 40.9002  According to our analysis. Let’s do a simple regression of crop yield on rainfall.89 0. regress yield rain temp Source  SS df MS Number of obs = 8 92 .1343 = 13.66667 6 194.025382 0.51642 8. Err. t P>t [95% Conf.444444 +Total  1200 7 171. 6) Prob > F Rsquared Adj Rsquared Root MSE = 8 = 0.183089 _cons  76.693 11.
14 0.724382 Residual  968.1300 = 13. If xi zi ≠ 0 then rxz ≠ 0 and b xz ≠ 0 in which case ˆ β is biased (and inconsistent since the bias does not go away as N goes to infinity). The coefficient on x will be ˆ xi yi = xi ( β xi + γ zi + ui ) β= xi2 xi2 ˆ The expected value of β is.710247 +Total  1200 7 171.02238 98.551237 5 193.661814 0. Interval] +rain  . The coefficients are not significant. probably because we only have eight observations.5853 = 0.) We can write ˆ the expression for the expected value of β in a number of ways. The direction of bias ˆ depends on the signs of rxz and γ .428571 F( 2.358 2.839266 _cons  14.918 yield  Coef.60 = 0. t P>t [95% Conf. then β is biased upwards.1929 = 0.892 266. If they have 93 . Err.883 11. The omitted variable theorem The reason we do multiple regressions is that most things in economics are functions of more than one variable.70796 temp  1.16 0.38747 0. β xi2 γ xi zi xi ui ˆ E β = + + xi2 xi2 xi2 ( ) = β +γ xi zi xi2 Because xi ui = 0 by A2. yi = β xi + γ zi + ui However. r xi2 zi2 xi zi ˆ E β = β +γ = β + γ xz = β + γ rxz 2 xi xi2 ( ) zi2 N xi2 N S = β + γ rxz z = β + γ bxz Sx where bxz is the coefficient from the simple regression of Z on X. have the same sign.448763 2 115. we omit z and run a simple regression of y on x.9354 238. (The explanatory variables are uncorrelated with the error term. If we make a mistake and leave one of the important variables out.+Model  231.7243816 4. If rxz and γ . Assume that Z is relevant in the sense that is not zero.25919 12. Suppose the true model is the multiple regression (in deviations). Nevertheless.106639 4. Std. we cause the remaining coefficient to be biased (and inconsistent). 5) Prob > F Rsquared Adj Rsquared Root MSE = 0.351038 1.01 0.8907  Now both rain and warmth increase crop yields.366313 1. remember that omitting an important explanatory variable can bias your estimates on the included variables.
This was the case with rainfall and temperature in our crop ˆ yield example in Chapter 7. Suppose the true model is Yi = β 0 + β1 X 1i + β 2 X 2i + β3 X 3i + β 4 X 4i + ui But we estimate Yi = β 0 + β1 X 1i + β 2 X 2i + ui We know we have omitted variable bias. Now let’s consider a more complicated model. then r = 0 and β is unbiased. simply add them to the right hand side of each auxiliary regression. as a simple extension of the omitted variable theorem above.ˆ opposite signs. It can be shown that. S E ( γ ) = γ + β rzx x = γ + β bzx Sz where bzx is the coefficient from the simple regression of X on Z. In the Data Editor. ˆ E β = β +a β +b β ( ) ˆ E (β ) = β + a β + b β ˆ E (β ) = β + a β + b β 0 0 0 3 0 1 1 1 3 1 4 2 2 2 3 2 4 4 Here is a nice example in Stata. add more equations. Finally. If there are more omitted variables. but how much bias is there and in what direction? Consider a set of auxiliary (and imaginary if we don’t have data on X3 and X4) regressions of each of the omitted variables on all of the included variables. 94 . create the following variable. xz The expected value of γˆ if we omit X is analogous. note that if X and Z are orthogonal. then β is biased downward. X 3i = a0 + a1 X 1i + a2 X 2i X 4i = b0 + b1 X 1i + b2 X 2i Note if there are more included variables.
410601 7.0000 0 y  Coef. .57143 1. .0032 0. x2  1 .70 0. 2) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 . x4  1 . . Std.0000 1.9154 12.142857 +Total  11848 6 1974. 1.  Now omit X3 and X4. . gen x2=x^2 . . .27 0. . t P>t [95% Conf. Std.216498 14.66667 Number of obs F( 4.00966 11. .3485 x3  Coef. .443233 3. gen x4=x^4 Now create y. . .571429 4 167.031 1.9436 0.8640 0. t P>t [95% Conf.66667 Number of obs F( 2. . regress y x x2 Source  SS df MS +Model  11179.002 6.24 0. Err. . . regress y x x2 x3 x4 Source  SS df MS +Model  11848 4 2962 Residual  0 2 0 +Total  11848 6 1974.48789 _cons  9. . . 4) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 33. .928 y  Coef. .49 0.0185 0. Std. Interval] 95 .Create some additional variables. .654972 14. .44 0. Interval] +x  8 2. t P>t [95% Conf. .4642 1. _cons  1 . 4) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 12.4286 2 5589. gen x3=x^3 . .7835 x2  10. Interval] +x  1 . Err.7960 7. regress x3 x x2 Source  SS df MS +Model  1372 2 686 Residual  216 4 54 +Total  1588 6 264. gen y=1+x+x2+x3+x4 Regress y on all the x’s (true model) .71429 Residual  668.285714 7.281 30.43823  Do the auxiliary regressions. Err. x3  1 . . . .666667 Number of obs F( 2.
we want to avoid omitted variable bias.571429 1. we include too many control variables? Instead of omitting relevant variables.226109 2.g. in our attempt to avoid omitted variable bias.00 1. if we are studying the effect of threestrikes law on crime. we usually have one or two parameters that we are primarily interested in.010178 0.34915 12.142857 +Total  8148 6 1358 Number of obs F( 2. These numbers are exact because there is no random error in the model.67 0.007 3.141196 1.000 11.581149 x2  9. suppose we include irrelevant variables.242641 0.9445 0.226109 _cons  0 4.571429 4 113.. It turns out that including 96 . percent urban.29 These are the values of the parameters in the regression with the omitted variables.000 2. We are primarily interested in the coefficient on the interest rate in a demand for money equation.85573 x2  0 . Interval] +x  0 2. ( ) ˆ E ( β ) = β + a β + b β = 1 + 7(1) + 0(1) = 8 ˆ E ( β ) = β + a β + b β = 1 + 0(1) + 9.764979  Compute the expected values from the formulas above. For example. Err.144267 10. 4) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 34. Std.8017837 0. Target and control variables: how many regressors? When we estimate a regression model.77946 . The only purpose of the control variables is to try to guarantee that we don’t bias the coefficient on the target variables. These coefficients tell us if the demand curve downward sloping and whether the good is normal or inferior. To that end we include a list of control variables.0031 0.29(1) = −9.42857 2 3847.57(1) = 10.160577 8.01 0. You can see that the theorem works and both estimates are biased upward by the omitted variables. In a policy study the target variable is frequently a dummy variable (see below) that is equal to one if the policy is in effect and zero otherwise. In a demand curve we are typically concerned with the coefficients on price and income. To get a good estimate of the coefficient on the target variable or variables.28571 6.38873 5.) What happens if. regress x4 x x2 Source  SS df MS +Model  7695.000 5. percent in certain age groups.00 1.00 1.57 1 1 1 3 1 4 2 2 2 3 2 4 ˆ E β 0 = β 0 + a0 β3 + b0 β 4 = 1 + 0(1) − 10.79371 _cons  10. These variables associated with those parameters are called target variables.9167 10.+x  7 1.581149 5.25 0.77946 11.637 x4  Coef. etc. We include in the crime equation all those variables which might cause crime aside from the threestrikes law (e.001 6. our target variable is the threestrikes dummy. unemployment.04 0. t P>t [95% Conf.33641 6.71429 Residual  452.169 27.
but so does omitting a relevant variable. but not a lot. Discarding a variable whose true coefficient is less than its true (theoretical) standard error decreases the mean square error (the sum of variance plus bias squared) of all the remaining coefficients. in terms of mean square error. In that case. Unfortunately. including too many control variables will tend to make the target variables appear insignificant. Proxy variables It frequently happens that researchers face a dilemma. Such variables are known as proxy variables. so that you might have some insignificant controls in the final model. You may also come up with some new control variables. It might be possible. in some cases. so this is not very helpful. We can summarize the effect of too many or too few variables as follows. The bottom line is that we should include proxies for control variables. measurement error causes biased and inconsistent estimates. but drop them if they are not significant. if that is convenient. by omitting the variable. An interesting problem arises if the proxy variable is the target variable. including all the potentially relevant controls. even when they are truly significant. It will also increase the standard errors and underestimate the tratios on all the coefficients in the model. in other studies for example. However. However. We are also faced with the fact that the coefficient estimate is a compound of the true 97 . At this point you should be able to do valid hypothesis tests concerning the coefficients on the target variables. Use ttests and Ftests to justify your actions. or proxies. including all or none for each group. the bias that results from including a proxy is directly related to how highly correlated the proxy is with the unavailable variable. but generally we just have to hope. After you get down to a parsimonious model including only significant control variables. After doing your library research and reviewing all previous studies on the issue. So if τ = β Sβ <1 then we are better off. it is impossible to know how highly correlated they are because we can’t observe the unavailable variable. Sets of dummy variables should be treated as groups. including the coefficients on the target variables. As we see in a later chapter. What is a researcher to do? Monte Carlo studies have shown that the bias tends to be smaller if we include a proxy than if we omit a variable entirely. we are stuck with measurement error. do one more Ftest to make sure that you can go from the general model to the final model in one step. The dilemma is this: if we omit the proxy we get omitted variable bias. Unfortunately. Omitting a relevant variable biases the coefficients on all the remaining variables. to see how the two variables are related in other contexts. It is better to omit a poor proxy. dropping one or two variables at a time. However. including irrelevant variables will make the estimates inefficient relative to estimates including only relevant variables. Thus. and remove the insignificant ones. but decreases the variance (increases the efficiency) of all the remaining coefficients. Start with a general model. you will have comprised a list of all the control variables that previous researchers have used. if we include the proxy we get measurement error.irrelevant variables will not bias the coefficients on the target variables. we don’t know any of these true values. even though its parameter is not zero. What is the best practice? I recommend the general to specific modeling strategy. You can proceed sequentially. we may be able to obtain data on a variable that is known or suspected to be highly correlated with the unavailable variable. Data on a potentially important control variable is not available. but it does demonstrate that we shouldn’t keep insignificant control variables.
Min Max +salary  122 61050 18195. Suppose we know that Y =α +βX But we can’t get data on X. the product of the true coefficient relating X to Y and the coefficient relating X to Z. Min Max +salary  350 72024. . Dev.ssrn. See http://papers. summarize salary if male Variable  Obs Mean Std. That is. d . so researchers have to use proxies. if we know that b is positive. which is available and is related to X. that takes the value 1 if the faculty member is female and 0 if male.com/sol3/papers.34 29700 130000 Similarly. if the target variable is a proxy variable. X = a + bZ Substituting for X in the original model yields. then we can test the hypothesis that is positive using the estimated ˆ coefficient. I happen to have a data set consisting of the salaries of the faculty of a certain nameless university (salaries. Y = c + dZ the coefficient on Z is d = β b . Unless we know the value of b. we do know if the coefficient d is significant or not and we do know if the sign is as expected. Therefore. Dev. The average salary at the time of the survey was as follows. Min Max 98 . An illustration of this problem occurs in studies of the relationship between guns and crime. . both make this mistake. the coefficient on the proxy for guns cannot be used to make inferences concerning the elasticity of crime with respect to guns. Dev. summarize salary if female Variable  Obs Mean Std. We can then find the average female salary. the estimated coefficient can only be used to determine sign and significance. For example. we have no measure of the effect of X on Y. we can create a dummy variable called male which is 1 if male and zero if female. Dummy variables Dummy variables are variables that consist of one’s and zeroes. They are extremely useful. There is no good measure of the number of guns. So we use Z.cfm?abstract_id=473661 for a discussion of this issue. Nevertheless two studies. . However. summarize salary Variable  Obs Mean Std. The average male salary is somewhat higher.02 20000 173000 Now suppose we create a dummy variable called female. As we have just seen.dta). They are also known as binary variables.parameter linking the dependent variable and the unavailable variable and the parameter relating the proxy to the unavailable variable. Y = α + β X = α + β ( a + bZ ) = (α + β a) + β bZ So if we estimate the model. one by Cook and Ludwig and one by Duggan.92 25213.
2 Degrees of freedom: 348 Ho: mean(salarymale) . Dev.34 29700 130000 salarymale  228 77897.328 18195.0000 Ha: diff != 0 t = 6.87 20000 173000 The difference between these two means is substantial. scalar salarydif=6105077897 .87 20000 173000 Now test for the difference between these two means using the Stata command ttest.693 25213.32 +combined  350 72024.0000 There appears to be a significant difference between male and female salaries.46 26485. create two variables for men’s salaries and women’s salaries.423 11567. Dev. First.02 20000 173000 salaryfem  122 61050 18195. Err.87 74441. regress salary female 99 . unpaired Twosample t test with equal variances Variable  Obs Mean Std. summarize salary* Variable  Obs Mean Std. . .68 64311.8 salary~m  122 61050 1647. ttest salarymale=salaryfem.2760 P > t = 0. Min Max +salary  350 72024.3 74675.73 22127. It is somewhat more elegant to use regression analysis with dummy variables to achieve the same goal.0000 Ha: diff > 0 t = 6. Is this difference significant given the variance in salary? Let’s do a traditional test of the difference between two means by computing the variances and using the formula in most introductory statistics books. .46 2684. gen salarymale=salary if female==0 (122 missing values generated) . scalar list salarydif salarydif = 16847 So.02 69374. women earn almost $17. Std. gen salaryfem=salary if female==1 (228 missing values generated) . [95% Conf.+salary  228 77897. Interval] +salary~e  228 77897.mean(salaryfem) = diff = 0 Ha: diff < 0 t = 6. We can use the scalar command to find the difference in salary between males and females. .46 1754.92 25213.34 57788.000 less than men on average.2760 P < t = 1.54 +diff  16847.92 1347.069 26485.46 26485.2760 P > t = 0.12 81353.
Err.2558e+10 1 2.28 0.9930e+11 348 572701922 +Total  2.423 6.73 _cons  77897.46 2684. Std.423 6.2186e+11 349 635696291 Number of obs F( 1.423 6.628 28.000 56788.1017 0.000 22127.882 49. female=1male.0991 23931 salary  Coef. Interval] +female  16847.46 1584.000 11567.2558e+10 Residual  1.0000 0. Interval] +male  16847.46 2684.2558e+10 Residual  1.2 _cons  61050 2166.2558e+10 Residual  1.18 0. and the intercept is simply 1.67 65311. regress salary male Source  SS df MS +Model  2.46 2684.39 0.1017 0.39 0.61  Note that the coefficient on female is the difference in average salary attributable to being female while the intercept is the corresponding male salary. So. Std. 348) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 39.000 11567.0991 23931 salary  Coef. we can generate the bonus attributable to being male by doing the following regression. so we can’t use all three terms in the same regression.0991 23931 salary  Coef. 348) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 39.28 0.73 22127. Err.2558e+10 1 2.31 81014. Interval] +male  16847.2 100 .33  Now the female salary is captured by the intercept and the coefficient on male is the (significant) bonus that attends maleness.9930e+11 348 572701922 +Total  2.2 11567.39 0. t P>t [95% Conf. .73 22127.1017 0.2558e+10 1 2.2186e+11 349 635696291 Number of obs F( 1. Err. t P>t [95% Conf.9930e+11 348 572701922 +Total  2. This is the dummy variable trap. t P>t [95% Conf.28 0. Std. Since male=1female. Since male=1female. there appears to be significant salary discrimination against women at this university. This is exactly the same results we got using the standard ttest of the difference between two means.2186e+11 349 635696291 Number of obs F( 1.0000 0.000 74780. The tratio on female tests the null hypothesis that this salary difference is equal to zero.0000 0. male+female=intercept.Source  SS df MS +Model  2.15 0. 348) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 39. regress salary male female Source  SS df MS +Model  2.
9888e+10 3 2. 346) Prob > F Rsquared = = = = 350 53.18 0. the difference has been reduced to $10K.065 _cons  60846.8215e+09 Number of obs F( 2. Maybe the men have more experience and that is what is causing the apparent salary discrimination.female  (dropped) _cons  61050 2166. Std. regress salary exp female fexp Source  SS df MS +Model  6.2186e+11 349 635696291 Number of obs F( 2. It is possible to force Stata to drop the intercept term instead of one of the other variables.975 28. Also.14 65077.15 0.3150 101 .1619 10.000 866.46 1584. t P>t [95% Conf. . Err. Interval] +male  77897. .000 56616.5215e+11 347 438471159 +Total  2.29 0.628 28.628 28. This formulation is less useful because it is more difficult to test the null hypothesis that the two salaries are different.67 65311.8382e+12 2 9.33  Now the coefficients are the mean salaries for the two groups.31  Well. But it is still significant and negative.5197e+11 346 439219336 Number of obs F( 3. gen fexp=female*exp .9709e+10 2 3. The result is we get the same regression we got with male as the only independent variable.000 15093.0000 = 0.9022 = 0. . If so.4854e+10 Residual  1.9016 = 23931 salary  Coef.0000 0.33  Stata has saved us the embarrassment of pointing out that the regression we asked it to run is impossible by simply dropping one of the variables from the regression.49 0. It is also possible that women receive lower raises than men do.882 49.86 = 0.04 0. t P>t [95% Conf.000 74780.3103 20940 salary  Coef. Interval] +exp  1069.3142 0.1911e+11 Residual  1. the average raise is a pitiful $1000 per year. Std.61 female  61050 2166.35 2431.3296e+10 Residual  1. so the men are apparently somewhat older.9930e+11 348 572701922 +Total  2.24 0.18 0. 347) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 79. We can test this hypothesis by creating an interaction variable by multiplying experience by female and including it as an additional regressor.8753 1272.0000 0.67 65311.0375e+12 350 5. regress salary male female.678 female  10310.72 2150.776 103. noconstant Source  SS df MS +Model  1. regress salary exp female Source  SS df MS +Model  6. Perhaps this salary difference is due to a difference in the amount of experience of the two groups. Err.63 5527.983 4.000 56788. the previous analysis suffers from omitted variable bias.000 56788.37 0.31 81014. 348) Prob > F Rsquared Adj Rsquared Root MSE = 350 = 1604.
+Total  2.18 0. Stata allows us to do this test very easily with the test command. The way to test this hypothesis is to run the regression as it appears above with both female and fexp included. etc. This command is used after the regress command.93 2286. According the Fratio. but not significantly higher. If they are.002 19695. There is a significant salary penalty associated with being female.1217 701.0001 This tests the null hypothesis that the coefficient on female and the coefficient on male are both equal to zero. If they are not significantly different. I also have a dummy variable called “admin” which is equal to one whenever a faculty member is also a member of the administration (assistant to the President.79 4683.67  In fact. However.0422 269. Err.000 56842. then the null hypothesis is false.523 357. after Gregory Chow. Let’s do a slightly more complicated version of the Ftest above.11 0. . 346) = Prob > F = 9. Suppose we want to test the null hypothesis that the coefficients on female and fexp are jointly equal to zero.273 26. Then run the regression again without the two variables (assuming the null hypothesis is true) and seeing if the two residual sum of squares are significantly different.).0422 0.3091 20958 salary  Coef. t P>t [95% Conf.19 0. etc. a British statistical biologist in the early 1900’s. Useful tests Ftest We have seen many applications of the ttest. a Princeton econometrician who developed a slightly different version. test female fexp ( 1) ( 2) female = 0 fexp = 0 F( 2. Chow test This test is frequently referred to as a Chow test.86 3816. This test was developed by Sir Ronald Fisher (the F in Ftest). Std. associate dean.229 3.5e+11).000 814. we can firmly reject this hypothesis. assistant to the assistant to the President.2061 _cons  61338. raises than men.64 0.83 0..9854 9. The numbers in parentheses in the F ratio are the degrees of freedom of the numerator (2 because there are two restrictions. deanlet. the Ftest is also extremely useful. women appear to receive slightly higher.18 65835. assistant dean. but it is not caused by discrimination in raises.936 fexp  172.2186e+11 349 635696291 Adj Rsquared = Root MSE = 0. then the two variables do not help explain the variance of the dependent variable and the null hypothesis is true (cannot be rejected). Refer to the coefficients by the corresponding variable name.7038 1263. Interval] +exp  1038. Maybe female administrators are discriminated against in 102 . dean.087 female  12189. and note the residual sum of squares (1. one on female and one on fexp) and the degrees of freedom of the denominator (nk: 3504=346).895 113.
19 4395.28 admin  8043.5 asst  11182.7694 102.557 367.5271e+10 5 1. test female fexp fadmin ( 1) ( 2) ( 3) female = 0 fexp = 0 fadmin = 0 F( 3.008 2970.08 0.8896e+10 Residual  8.421 2719. I also have data on a number of other variables that might explain the difference in salaries between men and women faculty: chprof (=1 if a chancellor professor).5879 16185 salary  Coef.11 .67 5962.5447 266.000 783.3948 0.997 13392.4 female  11682. 342) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 72. This is a Chow test.67 0.295 0.843 8.146 10.91 0.91 0.15 prof  47858.19 2.965 26148.9587e+10 342 261948866 +Total  2.564 26.49 0.5015 1227.2186e+11 349 635696291 Number of obs F( 5.577 fexp  156. t P>t [95% Conf.42 0. .000 34444.737 10.92 17519.69 3905.11 56504.42 female  5090.61 4175.334 19215.325 19394. because the ttests on fexp and fadmin are not significant.2186e+11 349 635696291 Number of obs F( 7.016 2690.0000 0.3227e+11 7 1.451 112. 344) = Prob > F = 5.59 0.4659e+11 344 426123783 +Total  2.057 5.5207 1.95 0.17 64696.5054e+10 Residual  1.37 _cons  42395.0008 We can soundly reject the null hypothesis that women have the same salary function as men.11 4224.000 14822. Std.85 chprof  14419.68 0.84 3791.26 0. gen fadmin=female*admin .33 0. However.003 3852. Err. Interval] +exp  195.89 assoc  22847. asst (=1 if an assistant prof). we know that the difference is not due to raises or differences paid to women administrators.881204 397.575 7884.94 50346.423 680.terms of salary.003 2693. and prof (=1 if a full professor).000 39212.056 1250.3393 0.278 1952. We want to know if the regression model explaining women’s salaries is different from the regression model explaining men’s salaries.61 0.000 55709.342 2. Std.14 0.60 0. Err.53 4042.3297 20643 salary  Coef.010 8930.35 0.96 0. t P>t [95% Conf. .0000 0.13 103 . Let’s see if these variables make a difference. 344) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 35.5124 admin  11533. Interval] +exp  1005.04 fadmin  2011.174 2.5962 0.92 3.966 2. regress salary exp female fexp admin fadmin Source  SS df MS +Model  7.31 30872. assoc (=1 if an associate prof).89 0.69 2.002 19141.799 13495.07 _cons  60202. regress salary exp female asst assoc prof admin chprof Source  SS df MS +Model  1.23 4079.932 5.64 2284.
575 6269.17 11944.088 2650.475 1.34 dept21  1839.633 dept7  3525.64 0. Std.53 0.58 prof  46121.609 9383.17 54630.124 0.9 dept20  8166. 316) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 20..43 dept4  1278.766 0.127 6701. English.797 3397. Err.519 4457.73 4324.411 12766.945 dept18  5932.000 5853.36 4036.14 dept27  1893.092 4.987 9681.108 0.843 8484.51 0.66 3659.29 8408.524 1. Despite all these highly significant control variables.41 16232.67 0.594 4902.000 37613.88  Whoops. dept1=1 if the faculty person is in the first department.557 10. the coefficient on female isn’t significant any more.323 dept14  167.360 9356.20 0.18 0.675 0.g.72 6567.8798 5985.321 9027. This conclusion rests on the department dummy variables being significant as a group.54 0.86 0.681 9271.396 11981.491 6928. kinesiology).88067 478.596 dept29  2279.0000 0.865 admin  6767.348 dept24  777.063 5408.239 12450. I have one more set of dummy variables to try.824 0.332 9.27 0. . indicating significant salary discrimination against women faculty.23 0.07088 2.18 0.471 6078.754 11186.02 0.149 4881.81 0.724 12088.598 11270.794 0.64 15411.86 0.72 0.65 3473.643 _cons  38501.8026 99.748 7261.494 0. dance.1388e+10 316 225910511 +Total  2.38544 4879.911 11064.7245 female  2932.72 835. using an Ftest.23 dept2  2500.70 0.8447 3916.68 dept5  9213.871 8956.004 3830.267 0.63 0.48 49735. Maybe women tend to be overrepresented in departments that pay lower salaries to everyone.000 30290.009 3732.41 1. males and females (e.82 46712.000 15097.9 4791.5047e+11 33 4. So.75 30562.92 0.61 dept6  1969. female is negative and significant.928 1915.802 dept25  14089.31 0.75 0.11 4618.75 8876.208 6978.73 dept30  80.87 4019. but that looks like we want to test hypotheses on the 104 .35 0.783 9632.578 8578.11 0.7 10822. theatre.219 0.31 27435.66 3726.744 5208.03 dept23  1540.21 24710.011 1552.157 3230.55 0.251 27454.54 0.63 dept3  19527. regress salary exp female admin chprof prof assoc asst dept2dept30 Source  SS df MS +Model  1.3493 4317.92 0.782 15992.089 901.5597e+09 Residual  7. classical studies.791 18828.71 15803.402 4293.89 33733.56 0.975 0. I have a dummy for each department.36 0.71 0. t P>t [95% Conf.361 44411.94 dept8  462.004 88.921 0.91 0.307 dept28  4073.29 assoc  22830.78 chprof  14794.2186e+11 349 635696291 Number of obs F( 33.576 25689.702 3.33 dept10  24647. the coefficient on the target variable.646 13128.638 2755.806 dept22  2940.000 11620.266 9957.915 8031.021 2.443 0.54 8105.44 asst  11771.71 dept19  18642.33 dept11  13185.56 0.26 0.953 5229.38 20517.495 19712.03 0. Interval] +exp  283.456 2.000 15560.34 0.85 4173.804 0.591 9168.833 19013.275 6448.93 dept16  4854.794 8335.813 0.214 0.21 0.4 5622.9 4886.452 14713.414 9520. modern language. zero otherwise.99 0.037 5.207 0.38 0.95 dept12  3601.487 18584.6782 0.156 5.57 dept9  3304. dept2=1 if in the second department.416 2. We can test this hypothesis.382 8906. We would like to list the variables as dept1dept29.885 10893.741 1.331 dept17  1185.33 0. etc.11 0. with the testparm command.33 23572.28 0.222 25856.978 11609.1 3930.6446 15030 salary  Coef.
. and dept 11 (Economics). If the group is not significant. So there is no significant salary discrimination against women. apparently earn just as much as their male counterparts.W. 1969. people in prison). even if one or two are significant. then crime causes prison.10 0.W. even if only a few are. that you must drop or retain all members of the group. the same is true for female French teachers and phys ed instructors.0000 The department dummies are highly significant as a group. Female physicists.J.J. then all should be retained. By the way. once we control for field of study. prison causes (deters) crime.difference between the coefficient on dept1 and the coefficient on dept29. or neither. you should drop them all. economists.. If lagged prison is significant. Granger causality test Another very useful test is available only for time series. It is important to remember that. Granger published an article suggesting a way to test causality.2 Suppose we are interesting in testing whether crime causes prison (that is. the departments that are significantly overpaid are dept 3 (Chemistry). Then do a regression of crime on lagged crime and lagged prison. 424438. then prison Granger. C. etc.” Econometrica 36. 316) = Prob > F = 3. If the group is significant. computer scientists. “Investigating causal relations by econometric models and crossspectral methods. That is where the testparm command comes in. both. Granger suggested running a regression of prison on lagged prison and lagged crime. 2 105 . dept 10 (Computer Science). In 1969 C. testparm dept2dept30 ( 1) ( 2) ( 3) ( 4) ( 5) ( 6) ( 7) ( 8) ( 9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) dept2 = 0 dept3 = 0 dept4 = 0 dept5 = 0 dept6 = 0 dept7 = 0 dept8 = 0 dept9 = 0 dept10 = 0 dept11 = 0 dept12 = 0 dept14 = 0 dept16 = 0 dept17 = 0 dept18 = 0 dept19 = 0 dept20 = 0 dept21 = 0 dept22 = 0 dept23 = 0 dept24 = 0 dept25 = 0 dept27 = 0 dept28 = 0 dept29 = 0 dept30 = 0 F( 26. chemists. If lagged crime is significant. Unfortunately. when testing groups of dummy variables for significance.
8684 0.262 0.4399737 . I have data on crime (crmaj.prison + L2.crmaj L.61 0.1 Residual  9578520. 19) = Prob > F = 3. Does prison cause or deter crime? The sum of the coefficients is negative. prison deters crime. Err. tsset year (You have to tsset year to tell Stata that year is the time variable.896 1287.0000 0. Does crime cause prison? regress prison L. lagged prison is significant in the crime equation.422 1684.9888 .prison = 0 F( 2.prison = 0 L2.5613 11938.) Let’s use two lags.8333 _cons  6309. rape.77 .dta).5650287 1.01590533 Number of obs F( 4.215433065 19 .0333 So.10648 106 .43 0.046 .2095943 4.249 1324. 19) = Prob > F = 5.prison=0 ( 1) L.crmaj LL.prison Source  SS df MS +Model  63231512.79 0.prison LL.36 0.causes crime (deters crime if the coefficients sum to a negative number.0000 0.2 4 15807878.184 4466. .671 +Total  72810033 23 3165653. causes crime if the coefficients sum to a positive number).prison LL.442401 L2  . so it deters crime if the sum is significantly different from zero. assault.625 920.prison = 0 F( 1.463 3858. regress crmaj L.3658227 23 1.0085951 prison  L1  1087.665 2689.003 1.9908 0. .011338582 +Total  23.38 0.003715 . Interval] +crmaj  L1  1.61 Number of obs F( 4.75 19 504132.0468 So.prison ( 1) ( 2) L.78759741 Residual  . major crime.206103 2.prison L. test L.prison+L2. 19) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 24 31.1503896 4 5. t P>t [95% Conf.030 680.8713522 . the kind you get sent to prison for: murder.82 0.13 0.8407 710. Std.46 2.prison LL. 19) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 24 510. and burglary) and prison population per capita for Virginia from 1972 to 1999 (crimeva.crmaj LL. .26 0.02 crmaj  Coef.35 0. robbery. test L.962 L2  1772.000 .crmaj Source  SS df MS +Model  23.
However.4033407 0. then it is a tie. then we win.92 0. Zero means perfect equality.0000309 0. then the lagged dependent variables will serve as proxies for the omitted control variables. The key is to create a supermodel with all of the variables included.9676977 . real per capita income.33 0.0000467 L2  . Std. We create the supermodel by regressing the log of major crime per capita on all these variables.743 .8237 .612 . and burglary). Apparently. 19) = Prob > F = 0. You can get ties in tests of nonnested hypotheses.96102 L2  . If prison is significant and income inequality is not.9961 = . prison deters crime.129668 1.52 0. Interval] +prison  L1  1. the J in Jtest is for joint hypothesis. Err. . We believe that crime is deterred by punishment.61 0.009 .prison  Coef.0000191 .5637167 .000118005 +Total  1. regress lcrmajpc prisonpc unrate income gini Source  SS df MS +Model  1. You don’t get ties in tests of nested hypotheses. We don’t think so. but crime does not cause prisons.7102647 . If neither or both are significant.133937 .005546219 47 .03052465 Number of obs F( 4. rape. at least in Virginia.000 1. How can we resolve this debate? We don’t put income inequality in our regressions and they don’t put prison in theirs.9781387 .1986008 7. We then test all the coefficients. everyone else has nothing.csv).545344 . The Granger causality test is best done with control variables included in the regression as well as the lags of the target variables. assault. everyone has the same level of income.387802727 Residual  .crmaj = 0 F( 2.0000314 0. The Gini coefficient is a number between zero and one. We both agree that unemployment and income have an effect on crime.55121091 4 .crmaj ( 1) ( 2) L. if the control variables are unavailable or the analysis is just preliminary.0000488 .33 = 0.20 0. Jtest for nonnested hypotheses Suppose we are having a debate about crime with some sociologists.01086  107 .crmaj = 0 L2. The sociologists think that crime is caused by income inequality. mainly imprisonment. and income inequality measured by the Gini coefficient from 19471998 (gini. I have time series data for the US which has data on major crime (murder. the unemployment rate.193013 2.0000 = 0.crmaj LL. One means perfect inequality. robbery.9964 = 0. t P>t [95% Conf.0000806 _cons  .78 0.55675713 51 .1597358 crmaj  L1  .551 .0000159 . 47) Prob > F Rsquared Adj Rsquared Root MSE = 52 = 3286. By the way. If the reverse is true then the sociologists win. one person has all the money.0000849 . test L. while the sociologists think that putting people in prison does no good at all.
the coefficient on gini is negative.570543 p2534  1.23 0.3325442 gini  . estimate the model without the age group variables and save the residuals using the predict command. We would test the joint significance of prison and arrests versus the joint significance of gini and divorce with Ftests. prison and arrests for the economists’ model and giini and the divorce rate for the sociologists) then we would use Ftests rather than ttests.0086418 . It appears to be a draw.068247104 +Total  8.554454 r Well. However. If there were more than one variable for each of the competing models (e. 46) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 19. record the error sum of squares. Interval] +prison  .736113 10.092154 .0098001 .000 .0871006 .g. both prison and gini are significant in the crime equation.169355389 Number of obs F( 4.4482  Suppose we want to know if the age groups are jointly significant.208719 0. Err. .26124 lcrmaj  Coef. less equality) crime goes down.317924 _cons  9.39867 9. LM test Suppose we want to test the joint hypothesis that rtpi and unrate are not significant in the above regression.0061851 51. The primary advantage of the LM test is that you only need to estimate the main equation once.000 4. then estimate the model with the variables being tested excluded.32840267 4 1.757686 .02 0.200647 . Interval] +prisonpc  . Std.52 0.57 0.0045627 .8834 To do the LM test for the same hypothesis.1439967 .12 0.000 .09 0.000 .604708 3. and finally compute the Fratio.3201013 .004 1. Std. as the gini coefficient goes up (more inequality. test p1824 p2534 ( 1) ( 2) p1824 = 0 p2534 = 0 F( 2.26219 .175871 29. Err.0275741 unrate  .lcrmajpc  Coef.5970 .26 0. indicating that.0186871 .861 11.dta data set to estimate the following crime equation.6293 0. The Ftest is easy in Stata. . 108 .75 0. For example.0020265 4.18 0..2507801 3.000 7.41 0. an alternative is to use the LM (for Lagrangian Multiplier. t P>t [95% Conf. regress lcrmaj prison metpct p1824 p2534 Source  SS df MS +Model  5.685 9.50 0.3076585 .33210067 Residual  3.84684 5. Under the Ftest you have to estimate the model with all the variables.2008929 metpct  . t P>t [95% Conf.000 .0044176 4.0072343 5. However.46776947 50 . never mind) test.0282658 5.0391091 .41 0.9140642 5.673677 13.0245556 . Whoops.0127208 p1824  . 46) = Prob > F = 0. record the error sum of squares.0536626 income  .1393668 46 .2531816 _cons  5.527341 6. use the crime1990. We know we can test that hypothesis with an Ftest.935938 0.0000 0.000 . even if we already know that they are highly significant.
000 .2612414678691742 .0020265 0.41 0.06 = 0. If they are truly irrelevant.767611 . 48) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 40. The LM statistic is N times the Rsquare.000 .9926 = 0. Err.6117 .169355389 Number of obs F( 2.1384944 .004242676 Residual  3.18 0. we need to get the Rsquare and the number of observations.3245431 .0035924 .811 .0055023 .139366808584275 . they will not be significant and the Rsquare for the regression will be approximately zero.0623985 metpct  .0811 = . Std.48 0.57 0. regress lcrmaj prison metpct Source  SS df MS +Model  5.0081552 .1563375 48 .632 1.570543 p2534  1.0282658 0. Std. t P>t [95% Conf.25643 lcrmaj  Coef. (Although the fact that both variables have very low tratios is a clue.0169707049657037 3.39867 9.75 0.068247105 +Total  3.26124 e  Coef.861 11.1142346 76.0811122739798864 1. Err.19 0.0116447 _cons  8.065757031 +Total  8. which is distributed as chisquare with. Interval] +prison  .000 8. predict e. To see this output.276850278809347 109 .935938 0..9140643 5. two degrees of freedom because there are two restrictions (two variables left out).1885321 metpct  .65571598 Residual  3.13936681 46 .0046656 .208719 0. 46) Prob > F Rsquared Adj Rsquared Root MSE = 51 = 0.527341 6.847 .0004866 . This output goes under the generic heading of “e” output. resid Now regress the residuals on all the variables including the ones we left out.031498 1. Interval] +prison  .317924 _cons  .537927 8. .016970705 4 .673677 0.39 0.0621663918252367 .685 9.997295 .0513938 .15633751 50 .680585  To compute the LM test statistic.) Whenever we run regress Stata creates a bunch of output that is available for further analysis. ereturn list scalars: e(N) e(df_m) e(df_r) e(F) e(r2) e(rmse) e(mss) e(rss) e(r2_a) e(ll) = = = = = = = = = = 51 4 46 .0248865 5.70 0.0884566 .0053767079385045 .0045657 p1824  .604708 3. .6273 0.06312675 Number of obs F( 4. t P>t [95% Conf.0017355 4.46776947 50 .24 0.0000 0. in this case.0054 = 0. use the ereturn list command. regress e prison metpct p1824 p2534 Source  SS df MS +Model  .31143197 2 2.
Nevertheless. scalar n=e(N) . We can compute the probvalue using Stata’s chi2 (chisquare) function. including the age group variables.e(ll_0) = macros: e(depvar) e(cmd) e(predict) e(model) matrices: e(b) : e(V) : functions: e(sample) : : : : 1. I routinely use the Fform of the LM test because it is easy to do in Stata and corrects for degrees of freedom. Amsterdam: University of Amsterdam. Testing Linear Econometric Models. not variables. That is.8834 The results are identical to the original Ftest above showing that the LM test is equivalent to the Ftest. test p1824 p2534 ( 1) ( 2) p1824 = 0 p2534 = 0 F( 2. so the LM statistic is very low. scalar list prob = . . The results are also similar to the nRsquare version of the test. 1987. . 110 . The probvalue is . scalar R2=e(r2) . scalar LM=n*R2 Now that we have the test statistic. 3 See Kiviet. after regressing the residuals on all the variables.F. J. 46) = Prob > F = 0. especially for diagnostic testing of regression models.87187776 LM = . OK. we need the scalar command to create the test statistic. so the Ftest is easier.2742121 R2 = .LM) .00537671 n = 51 The Rsquare is almost equal to zero.87. . as expected.3 To do the Fform.. scalar prob=1chi2(2. just apply the Ftest to the auxiliary regression. One modification of this test that is easier to apply in Stata is the socalled Fform of the LM test which corrects for degrees of freedom and therefore has somewhat better small sample properties. we will use the LM test later.12 0. We cannot reject the null hypothesis that the age groups are not related to crime. we need to see if it is significant.414326247391365 "e" "regress" "regres_p" "ols" 1 x 5 5 x 5 Since the numbers we want are scalars. use Stata’s test function to test the joint hypothesis that the coefficients on both variables are zero.
gen y=10 + 5*invnorm(uniform()) .83905 Let’s graph these data.497518 19. 111 . gen x=10*invnorm(uniform()) . Min Max +y  51 9. . twoway (scatter y x) (lfit y x) The regression confirms no significant relationship. so there should not be a significant relationship between them.9 REGRESSION DIAGNOSTICS Influential observations It would be unfortunate if an outlier dominated our regression in such a way that one observation determined the entire regression result. Dev.954225 17.87513 19.736584 4. summarize y x Variable  Obs Mean Std.6021098 8.667585 1. We know that they are completely independent. .76657 x  51 .
0451374 .10 0.738302 12.39 0. 49) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 6.6245 y  Coef.797781 .012 .000 8. Interval] +x  .4879015 Number of obs F( 1.49347 11.10 0.0730381 1.39508 50 42.21794  112 .0974255 2.3857955 +Total  1089.4133222 1 41.90398 49 21. twoway (scatter y x) (lfit y x) Let’s try that regression again..0184 4.649048 15. t P>t [95% Conf.170 .2546063 .8657642 12. . regress y x Source  SS df MS +Model  259. Std.4503901 _cons  10.52027 49 38.0119 0.1044 6.10209  Now let’s create an outlier. Std.1703 0.4133222 Residual  1047.874802 1 259.61 0.1686 y  Coef.3173 50 21.47812 . regress y x Source  SS df MS +Model  41.83 0.1016382 .786346 Number of obs F( 1. Err.000 8. .0380 0.1223 0. Err.0588225 . t P>t [95% Conf.874802 Residual  1864.2484138 _cons  9. replace y=y+30 if state==30 (1 real change made) . 49) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 1.0514342 +Total  2124. Interval] +x  .94 0.
0122 4.61 0. We could do this 51 times. DFbetas = ˆ ˆ β − βi S βˆ ˆ ˆ where S βˆ is the standard error of β and β i is the estimated value of β omitting observation i. but not always. Instead we can use matrix algebra and computer technology to achieve the same result.447809 11. although some researchers suggest that DFbetas should be greater than one (that is greater than one standard error). E. After regress.0580178 .2558493 _cons  9. This observation is called “influential. is said to dominate the regression. or lots of variables.211 . Regression Diagnostics.So. that an observation is influential if Dfbetas > 2 N where N is the sample size. New York. Err. If you have a large data set.” State 30. E. So how do we avoid this influential observation trap? Stata has a function called dfbeta which helps you avoid this problem.0960935 Number of obs F( 1. invoke the dfbeta command.71 0. Without NH we should get a smaller and insignificant coefficient.2112 0. A. dropping each state in turn and seeing what happens.8261292 +Total  1082. which happens to be New Hampshire. t P>t [95% Conf. R. D.12354  By dropping NH we are back to the original result reflecting the (true) relationship for the other 50 observations. 113 . It works like this suppose we simply ran the regression with and without New Hampshire. BKW suggest. Interval] +x  .785673 . It is sometimes easy to find outliers and influential observations.000 8.27 0.6542 48 21. Kuh. Kuh. With NH we get a significant result and a relatively large coefficient on x. John Wiley and Sons. 48) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 1. DFbetas Belsley. . but that would be tedious. and Welsch.0989157 . you could easily miss an influential observation. and Welsch4 suggest dfbetas (scaled dfbeta) as the best measure of influence. You do not want your conclusions to be driven by a single observation or even a few observations. as a rule of thumb. 1980..0543824 Residual  1047.0324 0.0780517 1. Std. NY.6718 y  Coef. . one observation is determining whether we have a significant relationship or not. regress y x if state ~=30 Source  SS df MS +Model  35.6653936 14. dfbeta DFx: 4 DFbeta(x) Belsley.0543824 1 35.70858 49 22.
110021  ++ There is only one observation for which DFbetas is greater than the threshold value. the OLS estimates are unstable and fragile (small changes in the data or model can make a big difference in the regression estimates). use http://www.118277 1. This means that variables that are truly significant appear to be insignificant in the collinear model Also. Variance inflation factors Stata has a diagnostic known as the variance inflation factor . Interval] +x1  1. We might find simple coding errors.5719733 +Total  14695 99 148.91563 Residual  8699. 95) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 100 16. Std. list state stnm y x DFx if DFx>2/sqrt(51) ++  state stnm y x DFx   51. Err. clear . It has the effect of increasing the standard errors of the OLS estimates. regress y x1 x2 x3 x4 Source  SS df MS +Model  5995.3831 9. t P>t [95% Conf. At this point it would behoove us to look more closely at the NH value to make sure it is not a typo or something. but almost? This is called multicollinearity.  30 NH 42. If we have some and they are not coding errors.282 19.0000 0. Stata will be forced to drop the collinear variable from the model. increasing the value of the coefficient by two standard errors. VIF.278 .152135 114 . The rule of thumb is that multicollinearity is a problem if the VIF is over 30. Here is an example. But what happens if a regressor is not an exact linear combination of some other variable or variables. Multicollinearity We know that if one variable in a multiple regression is an exact linear combination of one or more variables in the regression. What do we do about influential observations? We hope we don’t have any.434343 Number of obs F( 4.ats. Clearly NH is influential.83905 2. If none of this works.09 0. . we hope they don’t affect the main conclusions. If we do.ucla. we might take logarithms (which has the advantage of squashing variance) or weight the regression to reduce their influence. Now list the observations that might be influential.4080 0. It would be nice to know if we have multicollinearity.9155807 3.37 0.This produces a new variable called DFx. .66253 4 1498.edu/stat/stata/modules/reg/multico.024484 1.5693 y  Coef.33747 95 91. which indicates the presence of multicollearity. that the regression will fail. we have to live with the results and confess the truth to the reader.
13 0.225535 _cons  31.16 0.3136 0.60194 33.13 0. Perhaps we should drop it and try again.40831 .05215 1.012093 +Mean VIF  221.83 0.5130679 _cons  31.298443 . 96) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 100 21.69 0. regress y x1 x2 x3 Source  SS df MS +Model  5936. Let’s look at the correlations among the independent variables.17 Wow.434343 Number of obs F( 3.0838481 4.21 0. Err.812781 x2  1.0000 0.61912 . . It is highly correlated with all the other variables.78069 96 91.038979 0.5341981 x2  .422 2.191635 1.x2  1.8370988 1.1230534 3.1187692 2.2084695 .23 0.17 115 . vif Variable  VIF 1/VIF +x4  534.21931 3 1978.7827429 3.73977 Residual  8758. correlate x1x4 (obs=100)  x1 x2 x3 x4 +x1  1.12 0.4527284 .000 29.001869 x3  175.0000 x4  0.280417 x4  .1801934 .57 0.010964 x2  82.000 29. All the VIF are over 30.3466306 .23 0.0000 x2  0.220 .50512 .356131 x3  1. . Std.2372989 +Total  14695 99 148.3853 9. Interval] +x1  .97 0.2021 1.51 0. t P>t [95% Conf.042406 1.005687 x1  91.68 0. vif Variable  VIF 1/VIF +x1  1.260 .6969874 x3  .892234 +Mean VIF  1.3553 1.0626879 .899733 1.9709127 32.0000 x4 appears to be the major problem here.54662 .0000 x3  0.6516 0.9587921 32.286694 1.864654 x3  1.4040 0.81 0.000 .7281 0.7790 1.8971469 3.69 0.86 0.69161 33.5518 y  Coef.000 .014 .
we might have concluded that none of the variables were related to y. the variables are all significant and there is no multicollinearity at all.Much better. So doing a VIF check is generally a good idea. Be careful. If we had not checked for multicollinearity. It could be that x4 is a relevant variable and the result of dropping it is that the remaining coefficients are biased and inconsistent. However. 116 . dropping variables is not without its dangers.
However. This nonconstant variance is called heteroskedasticity (hetero = different. comes from a distribution with exactly the same mean and variance. σ ) 2 where ~iid(0. BreuschPagan test There are two widely used tests for heteroskdeasticity. we will make mistakes in our hypothesis tests. scattering). Testing for heteroskedasticity The first thing to do is not really a test. The residuals from the OLS regression.or underestimating our tratios. but simply good practice: look at the evidence. ordinary least squares estimates are still unbiased. perhaps. Thus. ui. but they are not efficient. Another way of looking at the GM assumptions is to glance at the expression for the variance of the error term. It is an LM test. When the identically distributed assumption is violated.2) means “is independently and identically distributed with mean zero and variance 2”. even worse. Consider the following multiple regression model. for each observation the error term. So. it would be nice to know if our data suffer from this problem. is identically distributed. so the first thing to to is graph the residuals and see if they appear to be getting more or less spread out as Y and X increase. ei are estimates of the unknown error term ui. our standard errors and tratios are no longer correct. Yi = β 0 + β1 X 1i + β 2 X 2i + ui 117 .e. The basic idea is as follows. 2 . The first is due to Breusch and Pagan. Constant variance must be homoskedasticity.10 HETEROSKEDASTICITY The GaussMarkov assumptions can be summarized as Yi = α + β X i + u i u i ~ iid (0. This means that we have less confidence in our estimates. from the Greek skedastikos. skedastic. It depends on the exact kind of heteroskedasticity whether we are over. In this chapter we learn how to deal with the possibility that the “identically” assumption is violated. i. but in either case. if there is no i subscript then the variance is the same for all observations.. it means that the error variance differs across observations. If the data are heteroskedastic.
In fact.18 0. regress crmaj prison metpct rpcpi Source  SS df MS +Model  1805.01389 3 601. 47) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 43. Std.9355401 +Total  2459. ˆ ei2 / σ 2 = a0 + a1 X i1 + a2 X i 2 ˆ where σ 2 is the estimated error variance from the original OLS regression.952673 .0003134 1.0756619 .0004919 . However. Alternatively. as our estimate of the error term. t P>t [95% Conf.96 0.970383 47 13. and real per capita personal income.1996855 Number of obs F( 3.2079464 rpcpi  . We use the OLS residual.57136 3.57 0. nR 2 χ2 where the degrees of freedom is the number of variables on the right hand side of the auxiliary test equation. we could test whether the squared residual increases with the predicted value of y or some other suspected variable. The usual estimator of the variance of e is (e i =1 n i − e) 2 n .1665135 13. consider the following regression of major crime per capita across 51 states (including DC) in 1990 (hetero.672148 metpct  . If there is no heteroskedasticity. predict e.0001386 _cons  6.26 0. Therefore our estimate of the variance of ui is ei . Err.000 .7338 0. So we need an estimate of the variance of the error term at each observation.349276 1. we know that e =0 in any OLS regression and if we want the 2 variance of e for each observation.671298 Residual  654.000 2.056 .0000 0.3576379 8. the squared residuals will be unrelated to any of the independent variables. resid . As an example. the percent of the population living in metropolitan (metpct) areas. u.1418042 .31 0.What we want to know is if the variance of the error term is increasing with one of the variables in the model (or some other variable).30923  Let’s plot the residuals on population (which we suspect might be a problem variable). the square of the residual.0328781 4. .0011224 .7168 3. Interval] +prison  2. e.dta) on prison population per capita. So an auxiliary regression of the form.123 . . should have an Rsquare of 2 zero.233198 3.98428 50 49. we have to set n=1.733 crmaj  Coef. twoway (scatter e pop) 118 .
5 0
0
Residuals 5
10
15
10000 pop
20000
30000
Hard to say. If we ignore the outlier at 3000, it looks like the variance of the residuals is getting larger with larger population. Let’s see if the BreuschPagan test can help us.
The BreuschPagan test is invoked in Stata with the hettest command after regress. The default, if you don’t include a varlist is to use the predicted value of the dependent variable.
. hettest BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of crmaj chi2(1) Prob > chi2 = = 0.17 0.6819
Nothing appears to be wrong. Let’s see if there is a problem with any of the independent variables.
. hettest prison metpct rpcpi BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: prison metpct rpcpi chi2(3) Prob > chi2 = = 4.73 0.1923
Still nothing. Let’s see if there is a problem with population (since the dependent variable is measured as a per capita variable).
. hettest pop BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: pop chi2(1) Prob > chi2 = = 3.32 0.0683
Ah ha! Population does seem to be a problem. We will consider what to do about it below.
119
White test
The other widely used test for autocorrelation is due to Halbert White (econometrician from UCSD). White modifies the BreuschPagan auxiliary regression as follows.
ei2 = a0 + a1 X i1 + a2 X i 2 + a3 X i2 + a4 X i22 + a5 X i1 X i 2 1
The squared and interaction terms are used to capture any nonlinear relationships that might exist. The number of regressors in the auxiliary test equation is k(k+1)/2 where k is the number of parameters (including the intercept) in the original OLS regression. Here k=3 so, 3(4)/2 1=61=5. The test statistic is
2 nR 2 χ k ( k +1) / 2−1 .
To invoke the White test use imtest, white. The im stands for information matrix.
. imtest, white White's test for Ho: homoskedasticity against Ha: unrestricted heteroskedasticity chi2(9) Prob > chi2 = = 6.63 0.6757
Cameron & Trivedi's decomposition of IMtest Source  chi2 df p +Heteroskedasticity  6.63 9 0.6757 Skewness  4.18 3 0.2422 Kurtosis  2.28 1 0.1309 +Total  13.09 13 0.4405 
Ignore Cameron and Trivedi’s decomposition. All we care about is heteroskedasticity, which doesn’t appear to be a problem according to White. The White test has a slight advantage over the BrueshPagan test in that it is less reliant on the assumption of normality. However, because it computes the squares and crossproducts of all the variables in the regression, it sometimes runs out of space or out of degrees of freedom and will not compute. Finally, the White test doesn’t warn us of the possibility of heteroskedasticity with a variable that is not in the model, like population. (Remember, there is no heteroskdeasticity associated with prison, metpct, or rpcpi.) For these reasons, I prefer the BreuschPagan test.
Weighted least squares
Suppose we know what is causing the heteroskedasticity. For example, suppose the model is (in deviations)
yi = β xi + ui where var(u ) = z 2σ 2 i i
120
Consider the weighted regression, where we divide both sides by zi
yi / zi = β ( xi / zi ) + ui / zi
What is the variance of the new error term, ui/zi.
var(ui / zi ) = var(ci ui ) where ci = 1/ zi var(ui / zi ) = ci2 var(ui ) = zi2σ 2 =σ2 zi2
So, weighting by 1/zi causes the variance to be constant. Applying OLS to this weighted regression will be unbiased, consistent and efficient (relative to unweighted least squares). The problem with this approach is that you have to know the exact form of the heteroskedasticity. If you think you know you have heteroskedasticity and correct it with weighted least squares, you’d better be right. If you were wrong and there was no heteroskedasticity, you have created it. You could make things worse by using WLS inappropriately. There is one case where weighted least squares is appropriate. Suppose that the true relationship between two variables is (in deviations) is,
yiT = β xiT + uiT , uiT IN (0, σ 2 )
However, suppose the data is only available at the state level, where it is aggregated across n individuals. Summing the relationship across the state’s population, n, yields,
y
i =1
n
T i
= β xiT + ui
i =1 i =1
n
n
or, for each state,
y = β x + u.
We want to estimate the model as close as possible to the truth, so we divide the state totals by the state population to produce per capita values. y x u =β + . n n n What is the variance of the error term?
n T ui u Var = Var i =1 n n nσ 2 σ 2 = 2 = n n
1 T T = 2 (Var (uiT ) + Var (u2 ) + ... + Var (un )) n
So, we should weight by the square root of population.
121
n
y x u =β n + n . n n n
nu σ2 u Var =σ2 = nVar = n n n n
which is homoskedastic. This does not mean that it is always correct to weight by the square root of population. After all, the error term might have some heteroskedasticity in addition to being an average. The most common practice in crime studies is to weight by population, not the square root, although some researchers weight by the square root of population. We can do weighted least squares in Stata with the [weight=variable] option in the regress command. Put the [weight=variable] in square brackets after the varlist.
. regress crmaj prison metpct rpcpi [weight=pop] (analytic weights assumed) (sum of wgt is 2.4944e+05) Source  SS df MS +Model  1040.92324 3 346.974412 Residual  842.733646 47 17.9305031 +Total  1883.65688 50 37.6731377 Number of obs F( 3, 47) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 19.35 0.0000 0.5526 0.5241 4.2344
crmaj  Coef. Std. Err. t P>t [95% Conf. Interval] +prison  3.470689 .7111489 4.88 0.000 2.040042 4.901336 metpct  .1887946 .058493 3.23 0.002 .0711219 .3064672 rpcpi  .0006482 .0004682 1.38 0.173 .0015902 .0002938 _cons  4.546826 4.851821 0.94 0.353 5.213778 14.30743 
Now let’s see if there is any heteroskedasticity.
. hettest pop
BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: pop chi2(1) Prob > chi2 . hettest BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of crmaj chi2(1) Prob > chi2 = = 1.07 0.3010 = = 0.60 0.4386
122
93 0. yi = β xi + ui ˆ β= xi yi xi ( β xi + ui ) β xi2 xi ui xi ui = = + =β+ xi2 xi2 xi2 xi2 xi2 The mean is xi ui xi E (ui ) ˆ E (β ) = E β + =β+ = β because E (ui ) = 0 for all i. Robust standard errors and tratios Since heteroskedasticity does not bias the coefficient estimates. the inverse of population. ˆ Recall how we derived the mean and variance of the sampling distribution of β . We started with a simple regression model in deviations.. we could simply correct the standard errors and resulting tratios so our hypothesis tests are valid. rather than by population itself.4021 The heteroskedasticity seems to have been taken care of. E ( β − β ) ˆ β −β = xi ui xi2 123 . then we would have tried weighting by 1/pop. If the correction had made the heteroskedasticity worse instead of better. hettest prison metpct rpcpi BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: prison metpct rpcpi chi2(3) Prob > chi2 = = 2. 2 xi xi2 2 ˆ ˆ ˆ The variance of β is the expected value of the square of the deviation of β .
σ 2 . Dividing β by the robust standard error yields the robust tratio. in the crime equation above we can see the robust standard and tratios by adding the option robust to the regress command. robust 124 . So. which means xiui ˆ Var ( β ) = E x2 i ˆ Var ( β ) = 1 2 1 = E ( x1u1 + x2u2 ++ xnun )2 2 2 xi ( ) ( ) 1 2 xi2 2 2 2 2 2 2 E x1 u1 + x2 u2 ++ xn un + x1x2u1u2 + x1x3u1u3 + x1x4u1u4 + ( ) We know that E (ui2 ) =σ 2 (identical) and E (uiu j ) = 0 for i ≠ j (independent). ( ) 1 2 xi2 2 2 2 2 2 2 E x1 u1 + x2 u2 ++ xn un + x1x2u1u2 + x1x3u1u3 + x1x4u1u4 + ( ) It is still true that E (uiu j ) =0 for i ≠ j (independent). or ˆ sandwich standard errors. We will call them robust standard errors. Eicker. As above. It is easy to get robust standard errors in Stata. we need an estimate of σ i2 . or White. so ˆ Var ( β ) = ( ) 2 2 2 x1 σ 2 + x2σ 2 ++ xnσ 2 2 2 xi ( ) σ 2 xi2 = Factor out the constant variance. to get = 1 ( x ) 2 2 i (x 2 1 2 2 + x2 + + xn ) σ 2 = σ2 xi2 ( x ) 2 2 i In the case of heteroskedasticity we cannot factor out ˆ Var ( β ) = 1 σ 2 . Sβˆ = xi2 ei2 ( x ) 2 2 i This is also known as the heteroskedastic consistent standard error (HCSE). ei . regress crmaj prison metpct rpcpi. we will use the square of the residual. For example.2 ˆ ˆ Var ( β ) = E ( β − β ) The x's are fixed numbers so E ( x )= x . 2 ˆ2 σ βˆ = xi2 ei2 ( x ) 2 2 i Take the square root to produce the robust standard error of beta. Huber. so ˆ Var ( β )= 2 2 x σ x 2σ 2 + x 2σ 2 ++ xnσ n )= i i 2( 1 1 2 2 2 ( xi2 ) ( xi2 ) 2 2 To make this formula operational. .
30923  It is possible to weight and use robust standard errors.012 .875776 13.98428 50 49. it is the standard methodology in crime regressions.0726166 2.2071255 rpcpi  . 47) = Prob > F = Rsquared = Root MSE = 51 131. . 47) = Prob > F = Rsquared = Root MSE = 51 17.57136 3.1996855 Number of obs F( 3.2079464 rpcpi  . Using them allows us to avoid having to discover the particular form of the problem.60 0.1667412 17.952673 .1418042 .0328781 4.672148 metpct  .006661 2.470689 . 47) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 43. robust (analytic weights assumed) (sum of wgt is 2.0004919 . Std.57136 3.818686 metpct  . I recommend always using 125 . t P>t [95% Conf.349276 1.0006482 .18 0.72 0.01389 3 601.0017649 .26 0. Std. Interval] +prison  2.61998  Compare this with the original model.0000 0.96943  What is the bottom line with respect to our treatment of heteroskedasticity? Robust standard errors yield consistent estimates of the true standard error no matter what the form of the heteroskedasticity.0764828 .6700652 5.0004684 _cons  4.122692 4. t P>t [95% Conf.0005551 1.0000807 _cons  6.952673 . Source  SS df MS +Model  1805.0004919 .000 2.3576379 8.683806 0.57 0.03247 4.3348802 rpcpi  .5227393 12. regress crmaj prison metpct rpcpi [weight=pop]. Err.056 .37 0.31 0.1887946 .96 0.0000 0. Interval] +prison  3.288113 metpct  .249 .970383 47 13.0010646 .000 2.7168 3.97 0. t P>t [95% Conf.18 0. Interval] +prison  2.19 0.1665135 13.5526 4.671298 Residual  654.9355401 +Total  2459.733  Robust crmaj  Coef.71 0.0756619 . if there is no heteroskedasticity.7338 3.0427089 .0003134 1. Err.73 0.2344  Robust crmaj  Coef. Std.0000 0. Err. Finally.0001386 _cons  6.000 .4944e+05) Regression with robust standard errors Number of obs = F( 3.1418042 .337 4. In fact.0002847 1.034 . For these reasons. we get the usual standard errors anyway.000 2.28 0.091 .0011224 .733 crmaj  Coef.Regression with robust standard errors Number of obs = F( 3.17 0.233198 3.123 .546826 4.000 .7338 0.617233 3.
This is becoming standard procedure among applied researchers.robust standard errors. then you should weight in addition to using robust standard errors. robust” to the regress command. If you know that a variable like population is a problem. That is. The existence of a problem variable will become obvious in your library research when you find that most of the previous studies weight by a certain variable. always add “. 126 .
f So.11 ERRORS IN VARIABLES One of the GaussMarkov assumptions is that the independent variable. it is only necessary that x be independent of the error term. This means that the OLS estimate is biased. x. that there be no covariance between the error term and the regressor or regressors. we only have the observed x=xTf where f is random measurement error. This is a very convenient assumption for proving that the OLS parameter estimates are unbiased However. that is. expressed in deviations is y = β xT + ε where xT is the true value of x. xT = x − f y = β (x − f ) + ε = β x + (ε − β f ) We have a problem if x is correlated with the error term (f). E ( xiui )=0 If this assumption is violated. ε − β f ) = E [( x T + f )(ε − β f )] = E ( xT ε + f ε − xT β f − β f 2 )=− β E ( f 2 ) = − βσ 2 ≠ 0 if f has any variance. cov( x . is fixed on repeated sampling. x is correlated with the error term. This could happen if there are errors of measurement in the independent variable or variables. ˆ β = xy ( xT + f ) y ( xT + f )( β xT +ε ) = = x 2 ( xT + f )2 xT 2 + 2 xT f + f 2 ( β xT 2 + xT ε + β xT f + fy ) β xT 2 ˆ = P lim P lim( β )= P lim xT 2 + 2 xT f + f 2 xT 2 + f 2 127 . For example. that is. suppose the true model. However. then ordinary least squares is biased and inconsistent.
(We are assuming that P lim( fy ) = P lim( xT f ) = P lim( xT f ) = P lim( fy ) = 0) So. Cure for errors in variables There are only two things we can do to fix errors in variables. Then x would be highly correlated with itself. Suppose we know of a variable. The first is to use the true value of x. that is (1) highly correlated with xT. z. but (2) uncorrelated with the error term. This IV estimator is consistent because z is not correlated with v. allows us to derive consistent (not unbiased) estimates. called an instrument or instrumental variable. ˆ zy − β zx =0 which implies that ˆ β = zy zx This is the instrumental variable (IV) estimator. d e2 ˆ =−2 zx ( zy − β zx ) =0 (note chain rule). making it the perfect instrument. In this case.β f ) Multiply both sides by z. which would happen if x is not measured with error. we have to fall back on statistics. dividing top and bottom by xT 2 . This variable. 128 . ˆ dβ The derivative is equal to zero if and only if the expression is parentheses is zero. the OLS estimate is biased downward. It collapses to the OLS estimator if x can be an instrument for itself. zy = β zx + zv The error is ˆ e = zy − β zx The sum of squares of error is ˆ e 2 = ( zy − β zx )2 ˆ To minimize the sum of squares of error. y = β x + v where v = (ε . since the true value of beta is divided by a number greater than one. and not correlated with the error term. we can conclude that p lim β ≠ β if var( f ) ≠ 0 which means that the OLS estimate is biased and inconsistent if the measurement error has any variance at all. we take the derivative with respect to β and set it equal to zero. β ˆ P lim( β ) = P lim f2 1+ xT 2 β = var( f ) 1+ var( xT ) ˆ Since the two variances cannot be negative. Since this is usually not possible.
ˆ β = ˆ xy (γˆ z ) y γˆ zy zy = = = 2 (γˆ z )2 γˆ 2 z 2 γˆ z 2 ˆ x from the first stage regression. x. zx 2 zx z z2 HausmanWu test We can test for the presence of errors in variables if we have an instrument. ˆ y = βx+v The resulting estimate of beta is the instrumental variable estimate above. the instrumental variable estimator. In the second stage. So. Two stage least squares Another way to get IV estimates is called two stage least squares (TSLS or 2sls). ˆ ˆ x =γz 129 . Assume that the relationship between the instrument z and the possibly mismeasured variable. The result is the consistent IV estimate above. on the instrument. x is linear. using ordinary least squares. x =γz+w We can estimate this relationship using ordinary least squares. Since z is independent of the error term. ˆ Second stage: substitute x for x and apply OLS again. yz ( β x+v) z ˆ p lim β = p lim = p lim xz xz ( ) p lim( vz ) β xz + vz ) = p lim = β + xz p lim( xz ) = β since p lim ( vz ) = 0 because z is uncorrelated with the error term. In the first stage we regress the variable with the measurement error. z. γˆ = = xz z2 zy zy = . so is x . ˆ ˆ x =γz ˆ ˆ Note that x is a linear combination of z. we substitute the predicted values of x for the actual x and apply ordinary least squares again. First stage: regress x on z: x =γz+w Save the predicted values. saving the predicted values.
If these two estimates are significantly different. it must be due to measurement error. So we need an easy way to test whether β =δ . It is probably correlated with his daughter’s brains and is independent of her wages. If it is not significant. If it is significant. so ˆ ˆ y = β ( x − w) + δ w + v ˆ y = β x + (δ − β ) w + v ˆ The null hypothesis that β =δ can be tested with a standard ttest on the coefficient multiplying w . First let’s do the regression with ordinary least squares.1086487 . The estimated coefficient is the return to education.1788735  The return to education is a nice ten percent.55 0. then the two parameter estimates are different and there is significant errors in variables bias.318 . 426) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 56.523015108 Number of obs F( 1.1852259 1. ivreg lwage (educ=fatheduc) Instrumental variables (2SLS) regression 130 .1179 0. so the estmate of from that parameter will be measurement error in x it is all in the residual w inconsistent.dta. Brighter women will probably do both.462443727 +Total  223. ˆ w is the residual from the regression of x on z.001028 426 .0143998 7. The dependent variable is the log of the wage earned by working women.0000 0. it is uncorrelated with the error term. . So how much of the return to education is due to the college education? Let’s use father’s education as an instrument.1369523 _cons  . We can implement this test in Stata as follows. regress lwage educ Source  SS df MS +Model  26. ˆ ˆ ˆ ˆ y = β x + δ w + v but x = x − w. We can implement instrumental variables with ivreg. Wooldridge has a nice example and the data set is available on the web. the high wages earned by highly educated women might be due to more intelligence. Std. .1158 .0803451 . ˆ ˆ x = x+w where And.93 0. Interval] +educ  . then the two parameters are equal and there is no significant bias. Let’s ˆ call the coefficient on w . The data are in the Stata data set called errorsinvars. go to college and earn high wages.3264237 1 26. The independent variable is education. Err.5492674 . However. If there is ˆ . t P>t [95% Conf. Therefore. The first stage regression varlist is put in parentheses.68003 lwage  Coef. ˆ ˆ y = β x + v = β ( x + w) + v ˆ ˆ y = βx + βw+v ˆ Since x is simple a linear combination of z.So.327451 427 . applying OLS ˆ to this equation will yield a consistent estimate of as the parameter multiplying x .1851969 .000 .00 0.3264237 Residual  197.
0000 0.088 .0285863 9.0098994 . Err.22294206 Number of obs F( 1.71 0.008845 .523015108 Number of obs F( 1.0597931 .2759363 37.841983 Residual  1845. Interval] +fatheduc  .4357311 1. predict what .Source  SS df MS +Model  20. Std.323 . Interval] +educ  .000 9. Err.439289 1.0813 educ  Coef.00 0.0929 0. More research is needed.0000 0. there might still be some bias because the coefficient on the residual what is almost significant at the .841983 1 384. 426) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 88.80 0.10 level.7324457 Residual  195.23705 . t P>t [95% Conf.10 level.0149823 . In fact. 425) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 29.316 .317938 Instrumented: educ Instruments: fatheduc .1189 . education is only significant at the .0591735 .77942 .093 . Std. 131 .99 0. Std.1726 0.0934 0.33181756 +Total  2230.4411035 .117 .475258426 +Total  223.460089 426 . t P>t [95% Conf.523015108 Number of obs F( 2.000 .3256295 _cons  10.4411035 .327451 427 .0913 .0591735 . Also.84 0.86256 425 .2694416 .10 0.8673618 1 20. regress lwage educ what Source  SS df MS +Model  27. regress educ fatheduc Source  SS df MS +Model  384.43 0.460853082 +Total  223. 426) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 2. t P>t [95% Conf.1271919 what  .0380427 1. Err.4648913 2 13.1706 2.1345684 _cons  .35428 426 4.84 0.67886 lwage  Coef.8673618 Residual  202.304553  The instrumental variables regression shows a lower and less significant return to education.68 0.0346051 1.694685 10.1230 0. Interval] +educ  .4461018 0.0351418 1. resid .68939 lwage  Coef.1282463 _cons  .327451 427 .19626 427 5.4223459 1.57 0.2132538 .
I is investment. Before we go on. I have labeled it RF1 for “reduced form 1. It is the theoretical model from the textbook or journal article. 132 . If we knew the values of the parameters ( and ) and the exogenous variables (I. The first equation (the familiar consumption function) is known as a behavioral equation. Substituting the second equation into the first yields the value for C. it is linear. and X are exogenous variables because their values are determined outside the system. consider the simple Keynesian system. The variables C and Y are known as the endogenous variables because their values are determined within the system. in this case. G. X. Behavioral equations have error terms. with an error term added for statistical estimation. For example. we need to define some terms.” Note that it is a behavioral equation and that. as follows. The terms and are the parameters of the system.12 SIMULTANEOUS EQUATIONS We know that whenever an explanatory variable is correlated with the error term in a regression. Y is national income (e. We could. The second equation is an identity because it simply defines national income as the sum of all spending (endogenous plus exogenous). estimate this equation with ordinary least squares. This structural model has two types of equations. GDP). the resulting ordinary least squares estimates are biased and inconsistent. and the exogenous error term u). The above simultaneous equation system of two equations in two unknowns (C and Y) is known as the structural model. C = α + βY + u Y = C + I +G+ X where C is consumption. I. G is government spending. and X is net exports. C = α + β ( C + I +G + X ) + u = α + β C + β I + β G + β X + u C − β C =α + β I + β G + β X +u C (1− β )=α + β I + β G + β X +u C= α + β I + β G + β X +u (1− β ) 1 1 β β β I+ G+ X + u α+ (1− β ) (1− β ) (1− β ) (1− β ) (1− β ) (RF1) C = The resulting equation is called the reduced form equation for C. if we had the data.g. They are “taken as given” by the structural model. indicating some randomness. G. we could solve for the corresponding values of C and Y.
We can derive the reduced form equations by solving by back substitution. Because of the fact that the reduced form equations have the same list of variables on the right hand side. 133 . p is the price. namely all of the exogenous variables. on the right hand side of the consumption function is correlated with the error term in the consumption equation. Therefore.u) is not zero. Look at the reduced form equation for Y (RF2) above. As a result. π 20 = 1 (1− β ) .We can derive the reduced form equation for Y by substituting the consumption function into the national income identity. We are now in position to show that the endogenous variable. In the system above the endogenous variables are C and Y and the exogenous variables are I. G. Example: supply and demand Consider the following structural supply and demand model (in deviations). v1 or v2. In the case of the simple Keynesian model. all we need to write them down in general form is the list of endogenous variables (because we have one reduced form equation for each endogenous variable) and the list of exogenous variables. Also. cov(Y. the coefficients from the reduced form equation for Y correspond to multipliers. The parameters are functions of the parameters of the structural model. (S) q = α p + ε (D) q = β p + β y + u 1 2 where q is the equilibrium quantity supplied and demanded. X (and the random error term u). the OLS estimate of in the consumption function is biased and inconsistent. On the other hand. The bad news is that the parameters from the reduced form are “scrambled” versions of the structural form parameters. There are two endogenous variables (p and q) and one exogenous variable. variation in u will cause variation in Y. with almost the same parameters. giving the change in Y for a oneunit change in I or G or X. Y. Y = α + βY + u + I + G + X Y − β Y =α + I +G + X +u Y (1− β )=α + I +G + X +u Y= α + I +G + X +u (1− β ) 1 1 1 1 1 α+ I+ G+ X + u (1− β ) (1− β ) (1− β ) (1− β ) (1− β ) (RF2) Y = The two reduced form equations look very similar. They are both linear. so the reduced form equations can be written as C =π 10 20 +π 11 I +π 12 G +π 13 X +v 1 X +v 2 Y =π +π 21 I +π 22 G +π 23 where we are using the “pi’s” to indicate the reduced from parameters. Both equations have the same variables on the right hand side. 1/(1) or /(1). y. Obviously. etc. Clearly. OLS estimates of the reduced form equation are unbiased and consistent because the exogenous variables are taken as given and therefore uncorrelated with the error term. Note that Y is a function of u. and y is income.
RF2 reveals that price is a function of u. α p + ε = β1 p + β 2 y + u α p − β1 p = β 2 y +u −ε (α − β1 ) p = β 2 y +u −ε β y +u −ε p= 2 (α − β1 ) p= β2 u −ε y+ (RF2) (α − β1 ) (α − β1 ) 21 y+v 2 p=π This is RF2 in general form. The good news is that the structural model itself produces the instruments. we can again appeal to instrumental variables (2sls) to yield consistent estimates. so applying OLS to either the supply or demand curve will result in biased and inconsistent estimates. Remember.q =αp+ε Solve for p: −α p = − q + ε p= q −ε α Substitute into the demand equation. but independent of the error term because it is exogenous. In the supply and demand model. However. The exogenous variable(s) in the model satisfy these conditions. income is correlated with price (see RF2). We can get a consistent estimate of with the IV estimator. α q = β1 ( q − ε ) + αβ 2 y + α u = β1q − β1ε + αβ 2 y + α u α q − β1q = − β1ε + αβ 2 y + α u (α − β ) q = +αβ y + α u − β ε 1 2 1 αβ 2 y +α u − β1ε q= (α − β1 ) q= αβ 2 α u − β1ε y+ (RF1) (α − β1) (α − β1 ) q = π 11 y + v 1 This is RF1 in general form. q −ε q=β + β2 y + u 1 α Multiply by α to get rid of fractions. Now we solve for p. but uncorrelated with the error term. ˆ α = yq yp This estimator can be derived as follows. instruments are required to be highly correlated with the problem variable (p). Suppose we choose the supply curve as our target equation. p = αq + ε Multiply both sides by y 134 .
In fact. the instrument is not correlated with the error term. If we take the probability limit of the two estimators. indirect least squares does not always work. p. consistency "carries over. estimate the reduced form equation. we get the IV estimator again. In the second stage we regress q on p . αβ 2 α − β1 β2 π 12 = α − β1 π 11 = which implies that π11 αβ 2 α − β1 = =α π 12 α − β1 β 2 So.yp = α yq + yε Now sum yp = α yq + yε But. RF2.") Unfortunately. The resulting estimate of 21 is ˆ π 21 = py y2 ˆ ˆ ˆ and the predicted value of p is p = π 21 y . The resulting estimator is ˆ ˆ qp π 21 qy qy qy qy = 2 = = = 2 2 2 py ˆ ˆ ˆ py p π 21 y π 21 y y2 y2 which is the same as the IV estimator above. no matter how we manipulate 11 and 12. we keep getting instead of 1 or 2. In the first stage we regress the problem variable. ˆ α= Indirect least squares Yet another method for deriving a consistent estimate of is indirect least squares. Suppose we estimate the two reduced form equations above. (Remember. 135 . From the reduced form equations. yε = 0 So. we cannot get a consistent estimate of the demand equation because. In other words. on all the exogenous variables in the model. y. Then we get unbiased and consistent estimates of 11 and 21. Here there is only one exogenous variable. yp = α yq and α = yq / yp We can also use two stage least squares. π ˆ p lim 11 = α ˆ π12 ˆ ˆ Because both π 11 and π 12 are consistent estimates and the ratio of two consistent estimates is consistent.
K ≥ G −1 Let’s apply this rule to the supply curve in the first supply and demand model. we won’t get exactly the same estimate. π 21 α − β1 β 2 But also. If we cannot. or it can be “over” identified. To apply the order condition. αβ 3 α − β1 π 12 = = α (again). so 1=21 and the supply equation is just identified. Let K be the number of exogenous variables that are in the structural model. It would be nice to know if the structural equation we are trying to estimate is identified.Suppose we expand the model above to add another explanatory variable to the demand equation: w for weather. no estimate at all of the parameters of the demand curve. and multiple estimates of with a minor change in the specification of the structural model. The supply curve was just identified in the first example. yields the reduced form equations. it can be “just” identified. π 11 αβ 2 α − β1 = = α as before. is the socalled order condition for identification. but over identified when we added the weather variable. Now we have two estimates of . If an equation is identified. A simple rule. it is said to be unidentified. This is why we got a 136 . which means the reduced form equations yield a unique estimate of the parameters of the equation of interest. as before. but excluded from the equation of interest. the following relationship must hold. in which case the reduce form yields multiple estimates of the same parameters of the equation of interest. Given normal sampling error. What is going on? The identification problem An equation is identified if we can derive estimates of the structural parameters from the reduced form. so which is right? Or which is more right? So. we first have to choose the equation of interest. Let G be the number of endogenous variables in the equation of interest (including the dependent variable on the left hand side). indirect least squares yielded a unique estimate in the case of . To be identified. q =α p+ε q = β1 p + β 2 y + β3 w + w Solving by back substitution. G=2 (q and p) and K=1 (y is excluded from the supply curve). The demand curve was unidentified in the above examples because we could not derive any estimates of 1 or 2 from the reduced form equations. π 22 α − β1 β3 Whoops. αβ 2 αβ 3 α u − β1ε y+ w+ α − β1 α − β1 α − β1 q = π 11 y + π 12 w + v1 β2 β3 u −ε p= y+ w+ α − β1 α − β1 α − β1 p = π 21 y + π 22 w + v2 q= We can use indirect least squares to estimate the slope of the supply curve as before.
To estimate this equation in Stata. and the price of beef. there are rare cases where the order condition is satisfied but the equation is still not identified. A more complicated condition. They recommend estimating the model in logarithms. ∆q = β1∆p + β 2 ∆y + β 3 ∆pbeef + u Since all the variables are logged. from Carnegie Mellon. Say we require that two variables be excluded from the equation of interest and we have two exogenous variables in the model but not in the equation. This is a short run demand equation relating the change in price to the change in consumption. but K=0 because there are no exogenous variables excluded from the demand curve. let’s analyze the supply curve from the second supply and demand model.lq D. we have to tell Stata that we have a time variable with the tsset year command. Illustrative example Dennis Epple and Bennett McCallum. Dennis and Bennett T. a close substitute.gsia.cmu. However. Choosing the demand curve as the equation of interest.ly D. The demand for broiler chicken meat is assumed to be a function of the price of chicken meat. tsset year time variable: year. have developed a nice example of a simultaneous equation model using real data. no equation that fails the order condition will be identified. McCallum. from 1950 to 2001. The good news is that it won’t generate a wrong or biased estimate. For reasons that will be discussed in a later chapter. if we try to estimate an equation using two stage least squares or some other instrumental variable technique. However. but fail to yield an estimate because the exogenous variables excluded from the equation of interest are too highly correlated with each other. First we try ordinary least squares on the demand curve. we find that G=2 as before. is necessary and sufficient. We can then use the D. these first differences of logs are growth rates (percent changes). First. the model will simply fail to yield an estimate. but now K=2 (y and w are both excluded from the supply curve).dta.edu/gsiadoc/WP/2004E6. Finally. The data are available in DandS. Second. Epple and McCallum have collected data for the US as a whole. G=2. called the rank condition.lp . operator to take first differences. but not sufficient condition for identification. At that point we would know that the equation was unidentified. . That is why we got two estimates from indirect least squares. but the order condition is usually good enough. So 2>21 and the supply curve is over identified.5 The market they analyze is the market for broiler chickens. if the two variables are really so highly correlated that one is simply a linear combination of the other. We will use the noconstant option to avoid adding a time trend to the demand curve. real income. Simultaneous equation econometrics: the missing example. Two notes concerning the order condition. we estimate the chicken demand model in first differences. 1950 to 2001 . regress D.unique estimate of from indirect least squares. This problem of “statistical identification” is extremely rare. http://littlehurt. That is. this is not a major problem because. it is a necessary. noconstant 5 Epple. there is always the possibility that the model will be identified according to the order condition. Again.pdf 137 . However.lpbeef D. Supply is assumed to be a function of price and the price of chicken feed (corn). then we don’t really have two independent variables. Looks good.
Source  SS df MS +Model  .053118107 3 .017706036 Residual  .043634876 48 .00090906 +Total  .096752983 51 .001897117
Number of obs F( 3, 48) Prob > F Rsquared Adj Rsquared Root MSE
= = = = = =
51 19.48 0.0000 0.5490 0.5208 .03015
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +ly  D1  .7582548 .1677874 4.52 0.000 .4208957 1.095614 lpbeef  D1  .305005 .0612757 4.98 0.000 .181802 .4282081 lp  D1  .3553965 .0641272 5.54 0.000 .4843329 .2264601 
Lq is the log of quantity (per capita chicken consumption), lp is the log of price, ly is the log of income, and lpbeef is the log of the price of beef. This is not a bad estimated demand function. The demand curve is downward sloping, but very inelastic, and significant at the .10 level. The elasticity of income is positive, indicating that chicken is not an inferior good, which makes sense. The coefficient on the price of beef is positive, which is consistent with the price of a substitute. All the explanatory variables are highly significant. Now let’s try the supply curve.
. regress D.lq D.lpfeed D.lp, noconstant Source  SS df MS +Model  .00083485 2 .000417425 Residual  .055731852 37 .001506266 +Total  .056566702 39 .001450428 Number of obs F( 2, 37) Prob > F Rsquared Adj Rsquared Root MSE = 39 = 0.28 = 0.7595 = 0.0148 = 0.0385 = .03881
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lpfeed  D1  .0392005 .060849 0.64 0.523 .1624923 .0840913 lp  D1  .0038389 .0910452 0.04 0.967 .1806362 .188314 
We have a problem. The coefficient on price is virtually zero. This means that the supply of chicken does not respond to changes in price. We need to use instrumental variables. Let’s first try instrumental variables on the demand equation. The exogenous variables are lpfeed, ly, and lpbeef.
. ivreg D.lq D.ly D.lpbeef (D.lp= D.lpfeed D.ly D.lpbeef ), noconstant Instrumental variables (2SLS) regression Source  SS df MS +Model  .032485142 3 .010828381 Residual  .024081561 36 .000668932 +Total  .056566702 39 .001450428 Number of obs F( 3, 36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 . . . . .02586
138
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lp  D1  .4077264 .1410341 2.89 0.006 .6937568 .1216961 ly  D1  .9360443 .1991184 4.70 0.000 .5322135 1.339875 lpbeef  D1  .3159692 .0977788 3.23 0.003 .1176646 .5142739 Instrumented: D.lp Instruments: D.ly D.lpbeef D.lpfeed 
Not much different, but the estimate of the income elasticity of demand is now virtually one, indicating that chicken demand will grow at about the same rate as income. Note that the demand curve is just identified (by lpfeed). Let’s see if we can get a better looking supply curve.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant Instrumental variables (2SLS) regression Source  SS df MS +Model  .079162381 2 .03958119 Residual  .135729083 37 .003668354 +Total  .056566702 39 .001450428 Number of obs F( 2, 37) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 . . . . .06057
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lp  D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765 lpfeed  D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173 Instrumented: D.lp Instruments: D.lpfeed D.lpbeef D.ly 
This is much better. The coefficient on price is now positive and significant, indicating that the supply of chickens will increase in response to an increase in price. Also, increases in chickenfeed prices will reduce the supply of chickens. The supply curve is over identified (by the omission of ly and lpbeef). We could have asked for robust standard errors, but they didn’t make any difference in the levels of significance, so we used the default homoskedastic only standard errors. Note to the reader. I have taken a few liberties with the original example developed by Professors Epple and McCallum. They actually develop a slightly more sophisticated, and more believable, model, where the coefficient on price in the supply equation starts out negative and significant in the supply equation and eventually winds up positive and significant. However, they have to complicate the model in a number of ways to make it work. This is a cleaner example, using real data, but it could suffer from omitted variable bias. On the other hand, the Epple and McCallum demand curve probably suffers from omitted variable bias too, so there. Students are encouraged to peruse the original paper, available on the web.
Diagnostic tests
139
There are three diagnostic tests that should be done whenever you do instrumental variables. The first is to justify the identifying restrictions.
Tests for over identifying restrictions
In the above example, we identified the supply equation by omitting income and the price of beef. This seems quite reasonable, but it is nevertheless necessary to justify omitting these variables by showing that they would not be significant if we included them in the regression. There are several versions of this test by Sargan, Basmann, and others. It is simply an LM test where we save the residuals from the instrumental variables regression and then regress them against all the instruments. If the omitted instrument does not belong in the equation, then this regression should have an Rsquare of zero. We test this null hypothesis with the usual nR 2 χ 2 ( m − k ) where m is the number of instruments excluded from the equation and k is the number of endogenous variables on the right hand side of the equation. We can implement this test easily in Stata. First, let’s do the ivreg again.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant Instrumental variables (2SLS) regression Source  SS df MS +Model  .079162381 2 .03958119 Residual  .135729083 37 .003668354 +Total  .056566702 39 .001450428 Number of obs F( 2, 37) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 . . . . .06057
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lp  D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765 lpfeed  D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173 Instrumented: D.lp Instruments: D.lpfeed D.lpbeef D.ly 
If your version of Stata has the overid command installed, then you can run the test with a single command.
. overid Tests of overidentifying restrictions: Sargan N*Rsq test 0.154 Chisq(1) Basmann test 0.143 Chisq(1)
Pvalue = 0.6948 Pvalue = 0.7057
The tests are not significant, indicating that these variables do not belong in the equation of interest. Therefore, they are properly used as identifying variables. Whew! If there is no overid command after ivreg, then you can install it by clicking on help, search, overid, and clicking on the link that says, “click to install.” It only takes a few seconds. If you are using Stata on the server, you might not be able to install new commands. In that case, you can do it yourself as follows. First, save the residuals from the ivreg.
. predict es, resid (13 missing values generated)
140
Now regress the residuals on all the instruments. In our case we are doing the regression in first and the constant term was not included. So use the noconstant option.
. regress es D.lpfeed D.lpbeef D.ly, noconstant Source  SS df MS +Model  .000535634 3 .000178545 Residual  .135193453 36 .003755374 +Total  .135729086 39 .003480233 Number of obs F( 3, 36) Prob > F Rsquared Adj Rsquared Root MSE = 39 = 0.05 = 0.9860 = 0.0039 = 0.0791 = .06128
es  Coef. Std. Err. t P>t [95% Conf. Interval] +lpfeed  D1  .005058 .0863394 0.06 0.954 .1700464 .1801625 lpbeef  D1  .0536104 .1686351 0.32 0.752 .3956183 .2883974 ly  D1  .1264734 .3897588 0.32 0.747 .6639941 .9169409
We can use the Fform of the LM test by invoking Stata’s test command on the identifying variables.
. test D.lpbeef D.ly ( 1) ( 2) D.lpbeef = 0 D.ly = 0 F( 2, 36) = Prob > F = 0.07 0.9313
The results are the same, namely, that the identifying variables are not significant and therefore can be used to identify the equation of interest. However, the probvaluses are not identical to the large sample chisquare versions reported by the overid command. To replicate the overid analysis, let’s start by reminding ourselves what output is available for further analysis.
. ereturn list scalars: e(N) e(df_m) e(df_r) e(F) e(r2) e(rmse) e(mss) e(rss) e(r2_a) e(ll) macros: e(depvar) e(cmd) e(predict) e(model) matrices: e(b) : e(V) : functions: e(sample) 1 x 3 3 x 3 : : : : "es" "regress" "regres_p" "ols" = = = = = = = = = = 39 3 36 .0475437411474742 .00394634310271 .0612811038678276 .0005356335440678 .1351934528853412 .0790581283053975 55.12129587257286
141
J) . scalar nr2=n*r2 We know there is one degree of freedom for the resulting chisquare statistic because m=2 omitted instruments (ly and lpbeef) and k=1 endogenous regressor (lp).024081561 39 . An equivalent test of this hypothesis is the Jtest for overidentifying restrictions.94e10 .1443446 lpfeed  D1  1.09508748 probJ = . J is distributed as chisquare with the same mk degrees of freedom.00 = 1.3336173 lpbeef  D1  3. Std.000617476 Number of obs F( 3.0833 = .95356876 All of these tests agree that the identifying restrictions are valid. Interval] +ly  D1  5. scalar r2=e(r2) . Instead of using the nRsquare statistic.lpfeed.nr2) . scalar list r2 = n = nr2 = prob = r2 n nr2 prob .0364396 0.3336173 .02586 ed  Coef. 142 .1644979 0. . scalar n=e(N) .0711725 0.000 . scalar J=2*e(F) . . The fact that they are not significant means that income and beef prices do not belong in the supply of chicken equation.000668932 +Total  .000 . scalar prob=1chi2(1.00394634 39 . noconstant Source  SS df MS +Model  0 3 0 Residual  .02e11 .0739029  So.) .000 . the Jtest uses the j=mF statistic where F is the Ftest for the auxiliary regression.0000 = 0. scalar probJ=1chi2(2.0739029 .ly D.00 1. 36) Prob > F Rsquared Adj Rsquared Root MSE = 39 = 0. If we forget and try to do the LM test we get a regression that looks like this (ed is the residual from the ivreg estimate of the demand curve above.00 1.00 1. We cannot do the tests for over identifying restrictions on the demand curve because it is just identified.69482895 This is identical to the Sargan form reported by the overid command above. if you see output like this.0000 = 0. From the output above. regress ed D. The problem is that the residual from the ivreg regression will be an exact linear combination of the instruments.024081561 36 . you know you can’t do a test for over identifying restrictions.70e10 .1443446 .. scalar list J probJ J = .15390738 .lpbeef D. because your equation is not over identified. t P>t [95% Conf. Err.
lpbeef. regress lp ly lpbeef lpfeed. then 2SLS is still half as biased as OLS.331657 .75 0. t P>t [95% Conf. For example.1196143 . Note that it has the same problem as the tests for over identifying restrictions.1264946 . it is also relevant in the case of simultaneous equations.260 . So we cannot use it for the demand function. that would be a regression of lp on ly.1724664 lpbeef  .0000 = 0.9997 = 0. In the case of our chicken supply curve.1046508 1.0419707 Number of obs F( 3. it does not work where the equation is just identified. namely.628012 .08267 lp  Coef.0000 Obviously.14 0. is there a problem when the instrument is only vaguely associated with the endogenous regressor? Bound. Std.4730568 .0226887 5. Anyway.252858576 37 .05 = 0.141989 Residual  . . Actually we should have predicted this by the difference between the OLS and 2sls versions of the equations above. test ly lpbeef ( 1) ( 2) ly = 0 lpbeef = 0 F( 2.0764761 8. Jaeger6.58 0. HausmanWu test We have already encountered the HausmanWu test in the chapter on errors in variables.0924285 .006834016 +Total  801.Test for weak instruments To be an instrument a variable must satisfy two conditions. the BJB F test is very easy to do and it should be routinely reported for any simultaneous equation estimation. and lpfeed and the F* test would be on lpbeef and ly. It must be (1) highly correlated with the problem variable and (2) it must be uncorrelated with the error term.21 0. and Baker have demonstrated that instrumental variable estimates are biased by about 1/F* where F* is the Ftest on the (excluded) identifying variables in the reduced form equation for the endogenous regressor. What happens if (1) is not true? That is.678827 40 20. that is our own Professor Jaeger.0805228 . noconstant Source  SS df MS +Model  801. 37) = Prob > F = 35. 6 Yes. 37) Prob > F Rsquared Adj Rsquared Root MSE = 40 =39090. Err.000 .9997 = . However. if F*=2. Consider the following equation system. these are not weak instruments and two stage least squares is much less biased than OLS. Interval] +ly  .425968 3 267. This is the percent of the OLS bias that remains after applying two stage least squares. 143 .000 .7829672 lpfeed  .
4666 .034261762 3 .lq D.011420587 Residual  . OLS and IV.522675 lpbeef  D1  . If we apply OLS to equation (B1) we get unbiased. and efficient estimates.43 0.09 0.0000 0.lpfeed. If it is large and the standard error is even larger.4288641 . 36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 12. 36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 18.000619582 +Total  . we should do it both ways.132113063 3 . we estimate the reduced form equation for lp and save the residuals.0000 0. regress D. OLS is biased and we need to use instrumental variables.003560037 +Total  .056566702 39 . resid (13 missing values generated) Now. Std.128161345 36 . If it is not significant.0878849 .02489 D.043 .006673703 Number of obs F( 3.022304941 36 .lpbeef D. .6767671 lpfeed  D1  . consistent.5076 0.5728 .05967 D.lp  Coef.37 0. then we need to check the magnitude of the coefficient. Err. Interval] +lp  144 .3437728 . noconstant Source  SS df MS +Model  .lpfeed. t P>t [95% Conf.0107785 .98 0. add the error term to the original regression.6057 0. regress D. Interval] +ly  D1  . Suppose we are interested in getting a consistent estimate of β12 .084064 3.1641908 2.lp D.Model A (simultaneous) (A1) y1 = β12 y2 + β13 z1 + u1 (A2) y2 = β 21 y1 + β 22 z2 + β 23 z3 + u2 Model B (recursive) (B1) y1 = β12 y2 + β13 z1 + u1 (B2) y2 = β 22 z2 + β 23 z3 + u2 where cov(u1u2 ) = 0. If it is significant.2583745 .0165944 1. If we apply OLS to equation (A1) we get biased and inconsistent estimates. Err. and hope we get essentially the same results.lp dlperror D. To apply the HausmanWu test to see if we need to use 2sls on the supply equation.044037688 Residual  .004 .260274407 39 . In that event.7530405 . then there is a significant difference between the OLS estimate and the 2sls estimate. . t P>t [95% Conf. And it is the same equation! We need to know about the second equation. In which case. Std.001450428 Number of obs F( 3. not only the equation of interest. noconstant Source  SS df MS +Model  .ly D.3794868 1.07 0.055 . predict dlperror. then we cannot be sure that OLS is unbiased.lq  Coef.
lp dlperror.2828285 .3867013 .1075623 6. Std.1920032 4.lq  Coef.lpfeed.93 0.1280782 7.000 .88 0.056566702 39 .9360443 .lp dlphat D.000 .200505 lpfeed  D1  .2734069 .94075 .4144198 .200505 .000 . 145 . The test coefficient is simply the negative of the coefficient on the residuals.034797396 4 .6057 0.lpbeef D. noconstant Source  SS df MS +Model  .00 0.4445185  The test reveals that the OLS estimate is not significantly different from the instrumental variables estimate.02494 D.88 0.000 1.0000 0.02489 D.8854896  .132394 dlphat  . Interval] +ly  D1  .5728 . noconstant Source  SS df MS +Model  . t P>t [95% Conf.001450428 Number of obs F( 4.005 .35 0.5712 .1789557 dlperror lpfeed You can also do the HausmanWu test using the predicted value of the problem variable instead of the residual from the reduced form equation.1280782 7.1359945 3.5073777 lp  D1  .4491967 .546257 1. 36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 18.051217 5.0000 0.1758793 . regress D.385 .056566702 39 .051217 5. Err. regress D.35 0.52 0.001450428 Number of obs F( 3.6838099 .0695298 3.000 .1527992 0.1789557  Applying the HausmanWu test to the demand equation yields the following test equation. .3159692 .002 .D1  .6809953  D1  .000 .lq D.008699349 Residual  .0942849 3.021769306 35 .99 0.6809953 1.011420588 Residual  .94075 .00062198 +Total  .2828285 .ly D.000619582 +Total  .6152 0. The following HW test equation was performed after saving the predicted value.1343196 .034261763 3 . Interval] +lp  D1  .43 0.6673431 .lq  Coef.131643 dlperror  . Std. Err. .1245608 .35 0.02230494 36 . 35) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 13.4077264 .lq D.325832 lpbeef  D1  .000 .52 0. t P>t [95% Conf.3867013 .20 0.
because the proportions have to sum to one (FP/T+S/T+O/T=1). then. any random increase in expenditures on fire and police must imply a random decrease in schools and/or other components. However. because the residuals are correlated across equations. That being said. we simply put the varlist for each equation in parentheses. so OLS can be expected to yield unbiased and consistent estimates. Applying SUR to two stage least squares results in three stage least squares. we should use SUR in those cases where the dependent variables sum to a constant. Three stage least squares can be implemented in Stata with the reg3 command. then the SUR estimates will be identical to the OLS estimates and there is no efficiency gain. and C. Three stage least squares Whenever we have a system of simultaneous equations. Because the proportion of the budget going to other things is simply one minus the sum of expenditures on fire. In the sureg command.. Like the sureg command. schools (S) and other (O) to population density (D). E (U iU j ) ≠ 0 for ij. it is not clear that we should use SUR for all applications for which it might be employed. police. The specification error is “propagated” across equations. 146 . Thus. and use these estimated covariances to improve the OLS estimates. In this example I use equation labels to identify the demand and supply equations. For this reason. compute the covariances between U1. the SUR estimates of the other equation are biased. Y. and schools. the error terms must be correlated across equations. we can use the sureg command to estimate the above system. income (Y). the equation varlists are put in equations. if there is specification bias in one equation. such as in the above example. Also. FP / T = a0 + a1 D + a2Y + U1 S / T = b0 + b1Y + b2 C + U 2 O / T = c0 + c1 D + c2Y + c2 C + U 3 These equations are not simultaneous. Sureg (FPT D Y) (ST Y C) Note: if the equations have the same regressors. The equations relate the proportion of the total budget (T) in each of these categories to the explanatory variables. The procedure is to estimate the equations using OLS.e. This technique is known as seemingly unrelated regressions (SUR) and Zellner efficient least squares (ZELS) for Alan Zellner from the University of Chicago who developed the technique. etc. In Stata. save the residuals. two stage least squares estimates of a multiequation system could potentially be made more efficient by incorporating the information contained in the covariances of the residuals. Let FPT be the proportion spent on fire and police and ST be the proportion spent on schools. say omitted variables. D. This is information that could be used to increase the efficiency of the estimates.Seemingly unrelated regressions Consider the following system of equations relating local government expenditures on fire and police (FP). and the number of school age children (C). we drop the third equation. i. U2. Therefore. we automatically have equations whose errors could be correlated.
reg3 (Demand:D.109993 0.0484276 3. Err.1600626 .2744758 .30 0.0044904 6. Std.9258  Coef.In the first example.0040577 . I replicate the ivreg above exactly by specifying each equation with the noconstant option (in the parentheses) and adding . Let’s try adding the intercept term.93 0. Std.0958591 .0265163 0.lp D.0017 Supply 39 2 .0336213 1.lp D.lq D.lp D.931 .0167905 .ly D.1944 9.2820235 .1150026 _cons  .0549483 3.lpbeef) (Supply: D.5204546 .0546049 2.86 0.lp D.56 0.0000 Supply 39 2 . Interval] +Demand  lp  D1  .0468383 0.0249881 0.1207482 .3026876 . z P>z [95% Conf.764 . noconstant to the model as a whole to make sure that the constant is not used as an instrument in the reduced form equations.0084  Coef.003 .389 .138615 lpfeed  D1  .86 0. .0186104 .0250497 .0059 0.0379724 0.000 .lpfeed Yuck.5152 36.144 .lp D.1871072 . noconstant Threestage least squares regression Equation Obs Parms RMSE "Rsq" chi2 P Demand 39 3 . z P>z [95% Conf. Interval] +Demand  lp  D1  .0872944 ly  D1  .lq Exogenous variables: D.1255017 4.0491061 .0877437 Endogenous variables: D.lq D.000 .. noconstant).0948342 .10 0.0274115 .053039 .15 0.7664334 lpbeef  D1  .0921909 ly  D1  .55 0.15 0.lpbeef D.3104166 lpbeef  D1  .000 .09 0.lq D.ly D.lpfeed > .0835039 0. reg3 (Demand:D.0362125 +Supply  lp  147 .31 0.000 . noconstant) (Supply: D.194991 . .lq D. Err.lpfeed) Threestage least squares regression Equation Obs Parms RMSE "Rsq" chi2 P Demand 39 3 .2670861 +Supply  lp  D1  .lpbeef.13 0.1887144 .ly D.46 0. The supply function is terrible.2754 15.0236988 0.
0018805 . consistent.039009 Endogenous variables: D. but neither of them is a function of Y3.0042607 7.ly D.1627361 .lq Exogenous variables: D.lpfeed lpfeed Better. Note that Y3 is a function of Y2 and Y1.0404607 _cons  . This model system is known as block recursive and is very common in large simultaneous macroeconomic models. that is. but not completely simultaneous. therefore there is no simultaneous equation bias in this model. 148 .10 0. then this is a recursive system.2700624 .0223074 .0554097  D1  . consider the following model. and efficient estimates (assuming the equations have no other problems). Now consider the following general three equation model. Y1 = α 0 + α1 X 1 + α 2 X 2 + u1 Y2 = β 0 + β1Y1 + β 2 X 1 + β3 X 2 + u2 Y3 = γ 0 + γ 1Y1 + γ 2Y2 + γ 3 X 1 + γ 4 X 2 + u3 This equation system is simultaneous.97 0.0196841 0. there is no correlation between the error term and any right hand side variable in the same equation.003 . If we also assume that there is no covariance among the error terms. Since there is no feedback from any right hand side endogenous variable to the dependent variable in the same equation. Y1 = α 0 + α1Y2 + α 2 X 1 + α 3 X 2 + u1 Y2 = β 0 + β1Y1 + β 2 X 1 + β3 X 2 + u2 Y3 = γ 0 + γ 1Y1 + γ 2Y2 + γ 3 X 1 + γ 4 X 2 + u3 Now Y1 is a function of Y2 and Y2 is a function of Y1.0306582 .0366997 . None of the other variables are significant. Normally the three stage least squares estimates are almost the same as the two stage least squares estimates. The model is said to be triangular because of this structure. E (u1u2 ) = E (u1u3 ) = E (u2 u3 ) = 0 . These two equations are fully simultaneous. If we assume that E (u1u3 ) = E (u2 u3 ) = 0 then OLS applied to the third (recursive) equation yields unbiased.000 . but Y1 is not a function of Y2.20 0.0547593 2.D1  . Types of equation systems We are already familiar with a fully simultaneous equation system. This part of the model is called the simultaneous core. OLS applied to either will yield biased and inconsistent estimates. Y2 is a function of Y1. consistent. The supply and demand system (S) q = α p + ε (D) q = β p + β y + u 1 2 is fully simultaneous in that quantity is a function of price and price is a function of quantity. Three stage least squares has apparently made things worse.lp D.lpbeef D. This is very unusual. at least for the estimates of the price elasticities. and efficient estimates.924 . OLS applied to any of the equations will yield unbiased. Finally.
and efficient estimates. consistent. suppose this is agricultural data. (This is the famous cobweb model.000 more police. For example. In this case we need to know the coefficient linking the police to the crime rate. Then the supply is based on last year’s price while demand is based on this year’s price. I don’t recommend three stage least squares because of possible propagation of errors across equations. we can at least generate a bunch of instruments by assuming that the equations are also functions of lagged variables. OLS applied to the reduced form equations are unbiased. say threestrikes laws. A strategy that is frequently adopted when large systems of simultaneous equations are being estimated is to plow ahead with OLS despite the fact that the estimates are biased and inconsistent. on crime. the supply equation could be a function of last years supply as well as last year’s price. suppose we want to estimate the potential effect of a policy that puts 100. suppose we are trying to estimate the effect of a certain policy. OLS applied to either equation will yield unbiased. it may be the case that we can’t retreat to the reduced form equations.” The temptation is to never get around to the next step. Even if we can’t assume that current supply is based on last year’s price. but inefficient estimate of the parameters.000 cops on the streets. Usually. For example. there is always the problem of identification and/or weak instruments. We have no choice but to use instrumental variables if we are to generate consistent estimates of the parameter of interest. For many situations. In that case. We may have police in the crime equation and police and crime may be simultaneously determined. all variables are functions of all other variables and no equation can ever be identified. 149 .Strategies for dealing with simultaneous equations The obvious strategy when faced with a simultaneous equation system is to use two stage least squares. In the above example. but they are efficient and the bias might not be too bad (and 2sls is also biased in small samples). However. The estimates may be biased and inconsistent. That coefficient is available from the reduced form equation for crime. researchers adopting this strategy insist that this is just the first step and the OLS estimates are “preliminary. we now have a recursive system. Even if we can identify the equation of interest with some reasonable exclusion assumptions and we have good instruments.) (S) q = α p +ε t t −1 t (D) q = β p + β y + u t 1 t 2 t t Presto! Assuming that E (ut ε t ) = 0 and assuming we have no autocorrelation (see Chapter 14 below). we have the following equation system. A strategy that is available to researchers that have time series data is to use lags to generate instruments or to avoid simultaneity altogether. and efficient. the best we can hope for is a consistent. the best solution may be to estimate one or more reduced form equations. Some critics have even suggested that. (S) q = α p + ε t t t (D) q = β p + β y + u t 1 t 2 t t Note that this model is fully simultaneous and the only way to consistently estimate the supply function is to use two stage least squares. rather than trying to estimate equations in the structural model. So OLS might not be too bad. However. However. since the basic model of all economics is the general equilibrium model. so that we can multiply it by 100K in order to estimate the effect of 100. Suppose we start with the simple demand and supply model above where we add time subscripts. consistent. but we do not have to know the coefficient on the police variable in the crime equation to determine the effect of the threestrikes law on crime.
First we need an instrument. this year’s quantity and price cannot affect last year’s values.000143 .000 2. 46) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 220. Many openended grants (that are functions of the amount the state spends.0000 = 0.4741 +Total  34768961.0000 0. Therefore aid is a potentially endogenous variable.57 124. exp. Now both equations are identified and both can be consistently estimated using two stage least squares.58 0. Err.636 Number of obs F( 3. Std.59 150 .80422 .3839306 _cons  45. t P>t [95% Conf. Time series analysts frequently estimate vector autoregression (VAR) models in which the dependent variable is a function of its own lags and lags of all the other variables in the simultaneous equation model. and save the residuals.9930 = 0. The OLS regression model is . This is essentially the reduced form equation approach for time series data where the lagged endogenous variables are the exogenous variables.0001902 .769 +Total  931626487 49 19012785. Since time does not go backwards.39169 0.(S) q = α p + α p +α q +ε t 1 t 2 t −1 3 t −1 t (D) q = β p + β y + u t 1 t 2 t t Note that we have two new instruments (pt1 and qt1) which are called lagged endogenous variables. . regress exp aid inc pop Source  SS df MS +Model  925135805 3 308378602 Residual  6490681. regress aid ps inc pop Source  SS df MS +Model  32510247.1733 However. w .000 .000 . We regress aid on ps. as control variables. these are valid instruments (again assuming no autocorrelation). like matching grants) are for education. population and income The data are in a Stata data set called EX73 (for example 7.69833 84.9926 = 375.4 3 10836749.233747 .754569 3. 46) Prob > F Rsquared Adj Rsquared Root MSE = 50 = 2185. many grants are tied to the amount of money the state is spending.50 = 0.1 Residual  2258713.3 from Pindyck and Rubinfeld).38 46 141101.9350 0.5940753 . A variable that is correlated with grants but independent of the error term is the number of school children.81 46 49102.10 0.69 0. So the aid variable could be a function of the dependent variable. Let’s do a HausmanWu test to see if we need to use two stage least squares. ps.4 Number of obs F( 3.0002375 pop  . and the other exogenous ˆ variables.1043992 5.238054 13. and.712925 inc  .54 0. Interval] +aid  3.591 215.9308 221. Another example Pinydyck and Rubinfeld have an example where state and local government expenditures (exp) are taken to be a function of the amount of federal government grants (aid).2 49 709570.70 0.0000235 8.64 exp  Coef.
316 +Total  931626487 49 19012785. Std.02 0. resid Now add the residual.738443 .0775019 .429 317.0000 = 0. t P>t [95% Conf.31 0. To see this.40924 49.301 263.7636841 5. Std. Err. Std.19 exp  Coef.22021 1.5 46 231024. regress exp aid what inc pop Source  SS df MS +Model  925559177 4 231389794 Residual  6067310.77 0. Interval] +ps  . in fact.060796 what  1.2188803 _cons  90.1253 112. ordinary least squares is biased. 45) Prob > F Rsquared Adj Rsquared Root MSE = 50 = 1716. the same as in the HausmanWu test equation.75 0.000 . 151 .0000162 .605508 . . at the recommended 10 percent significance level for specification tests.035 1.05 0.3039 137.05 45 134829.9886 = 0.0533 Instrumented: aid Instruments: inc pop ps  Note that ivreg uses all the variables that are not indicated as problem variables (inc and pop) as well as the designated instrument.8018143 1.0000133 2.4 Number of obs F( 4. Interval] +aid  4.0001275 .09 = 0.510453 6.0000553 2.1117587 4.415 . let’s use IV to derive consistent estimates for this model.39 0. .9878 = 480.53111 Apparently there is significant measurement error. It turns out that the coefficient on aid in the HausmanWu test equation is exactly the same as we would have gotten if we had used instrumental variables.51 0.026 .0000 = 0.0000422 3.9996564 4. to the regression.0001275 .17 = 0. t P>t [95% Conf. ivreg exp (aid=ps) inc pop Instrumental variables (2SLS) regression Source  SS df MS +Model  920999368 3 306999789 Residual  10627118.0000424 .0000633 pop  .65 exp  Coef.083 3.1248854 1. Interval] +aid  4.59 0. Note that the coefficient is. If it is significant.5133494 . Err. as instruments.522657 .4 Number of obs F( 3.17 0. predict what.2882557 _cons  90.001 .000 2.4252605 _cons  42.3838421 2.5133494 . what (what).420832 .8328735 .1462913 3.0000365 .035769 .194105 inc  . t P>t [95% Conf.1738793 .79e06 .68255 0.7817 83.008 9.112 +Total  931626487 49 19012785.8616 0.8078184 .1253 86.80 0.522657 . Err.85 0.9935 = 0.52 0.59654 142.92 0.0002387 pop  . ps.171 .534861 inc  . 46) Prob > F Rsquared Adj Rsquared Root MSE = 50 = 1304.984518 6.398 57.0602394 inc  .004 .0002125 pop  .9929 = 367.aid  Coef.000 2.
for each of the potentially 152 .522657 1.605508 .5133494 . which means that it is not a weak instrument. first robust Firststage regressions Source  SS df MS +Model  32510247.81 46 49102.17 0.2 49 709570.0000903 1.79e06 .9350 0.8864062 . the analyst should think about possible reverse causation with respect to each right hand side variable in turn.59654 142.41 0. 46) = Prob > F = Rsquared = Root MSE = 50 972.04 0.0000633 pop  .035 1. Interval] +inc  . t P>t [95% Conf.415  IV (2SLS) regression with robust standard errors Number of obs = F( 3.40 0. Std. although the sign doesn’t look right to me.85 0. Std.0775019 .77 0. t P>t [95% Conf.008 9.636 Number of obs F( 3.75 0.0003093 pop  .515 2.68255 0.0000 0.0000365 .1248854 1.4741 +Total  34768961.0000544 . If there is a good argument for simultaneity.0000 0.302 263.9308 221.67117 Instrumented: aid Instruments: inc pop ps  Because the equation is just identified.59 aid  Coef.” We can also request robust standard errors.165 . However. 46) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 220. add the option “first. Summary The usual development of a research project is that the analyst specifies an equation of interest where the dependent variable is a function of the target variable or variables and a list of control variables.1738793 .99 0. we can’t test the over identification restriction.1253 86.1853334 2.9886 480.171 . ps is significant in the reduced form equation.9218 83. Err.65  Robust exp  Coef.70 0.4252605 ps  .008 . ivreg exp (aid=ps) inc pop.34153 1.4 3 10836749.0000133 2. or at least a list of possible instruments.005 1.39 0.3838421 2.40924 49. Interval] +aid  4.398 57.0602394 _cons  42. the analyst should specify another equation.0001275 .1 Residual  2258713.1402925 _cons  90.8328735 .572194 inc  .If we want to see the first stage regression. Before estimating with ordinary least squares. Err. .47312 7.
if arrests are not simultaneous. The safest thing to do is to estimate the original crime equation in OLS and the reduced form version. among other things. Usually. a number of control variables. which are relatively inefficient. derived by simply dropping the arrest variable. like the threestrikes law. the two variables cannot be simultaneous. For example. The crime equation will be a function of arrests. lagging the potentially endogenous variable allows us to sidestep the specification of a multiequation mode. If the results are the same in either case. creating additional variance of the estimates). these estimates are biased by omitted variables. so much so that the loss of efficiency is a small price to pay. The resulting estimated coefficient on the threestrikes variable is unbiased. The arrest variable is likely to be endogenous. Suppose that there is a significant lag between the time a crime is committed and the resulting prison incarceration of the guilty party. Then prison is not simultaneously determined with crime. since time cannot go backwards. Jaeger. I have seen large simultaneous macroeconomic models estimated on quarterly data avoid the use of two stage least squares by simply lagging the endogenous variables. Should we use three stage least squares? Generally. There are a couple of ways to avoid this tradeoff. the target variable is an exogenous policy variable. crime this year is a function of the law. that is. Of course. and see if the resulting estimated coefficient on the target variable are much different. then. On the other hand. and employ two or three stage least squares. consistent. and the target variable. the other equations are only there to generate consistent estimates of a target parameter in the equation of interest. suppose we are interested in the effect of three strikes laws on crime. assuming your model is not just identified. and efficient. but relatively efficient compared to instrumental variables (because changing the list of instruments will cause the estimates to change. Given that there may be simultaneity.These arguments require that the residuals be independent. Finally. and efficient.endogenous variables. The first is to estimate the reduced form equation only. I recommend against it. Is this a good deal for analysts? The consensus is that OLS applied to an equation that has an endogenous regressor results in substantial bias. Baker F test for weak instruments whenever you estimate a simultaneous equation model. Similarly. Finally. Another way around the problem is available if the data are time series. we can estimate the reduced form equation. passed and implemented. So. consistent. not autocorrelated. be sure to report the results of the HausmanWu test. the test for over identifying restrictions. with the attendant data collection to get the additional variables. Alternatively. but still biased (presumably less biased than OLS). instrumental variables is consistent. We deal with autocorrelation below. inconsistent. and the Bound. what is the best way to deal with it? We know that OLS is biased. it usually takes some time to get a law. then we can be confident we have a good estimate of the effect of threestrikes laws on crime. I don’t want specification errors in those equations causing inconsistency in the equation of interest after I have gone to all the trouble to specify a multiequation model. We know that OLS applied to the reduced form equations results in estimates that are unbiased. 153 . hence not simultaneous. Sometimes this process is shortened by the review of the literature where previous analysts may have specified simultaneous equation models. So we are swapping consistency for efficiency when we use two stage least squares. so that consumption this quarter is a function of income in the previous quarter. In that case. if a variable Y is known to be a function of lagged X. but the law is a function of crime in previous years. For many policy analyses. and inefficient relative to OLS. dropping arrests. We could specify an arrest equation. a dummy variable that is one for states with threestrikes laws.
model. ADL. etc. Linear dynamic models ADL model The mother of all time series models is the autoregressivedistributed lag.13 TIME SERIES MODELS Time series data are different from cross section data in that it has an additional source of information. say GDP from 1950 to 2000 as a random sample? It seems a little silly to argue that GDP in 1999 was a random draw from a normal population of potential GDP’s ranging from positive to negative infinity. etc. An interesting question arises with respect to time series data. reaction time. Simultaneity is a serious problem for cross section data. This raises the possibility of using lags as a source of information. momentum. assuming one dependent variable (Y) and one independent variable (X). It has the form. This means that time series data have all the problems of cross section data. that something that happened in 1999 cannot cause something that happened in 1998. Nevertheless. The fact that time’s arrow only flies in one direction is extremely useful. plus some more. This is a good reason to prefer time series data. Time series models should take dynamics into account: lags. The reverse. The value of GDP in 1999 was almost certainly a function of GDP in 1998. of course. is quite possible. In a cross section it doesn’t matter how the data are arranged. Yt = a0 + a1Yt −1 + + a pYt − p + ε t so that the error term is ε t = Yt − a0 − a1Yt −1 − − a pYt − p So the error term is simply the difference from what the value GDP today is compared to what it should have been if the true relationship had held exactly. However. or by region. or any number of ways. namely. for example. Yet we can treat any tme series as a random sample if we condition on history. etc. 1997. The fifty states could be listed alphabetically. 154 . something that is not possible in cross sections. a time series must be arrayed in order. It means. Let’s assume that GDP today is a function of a bunch of past values of GDP. So a time series regression of the form Yt = a0 + a1Yt −1 + + a pYt − p + ε t has a truly random error term. but lags can help a time series analysis avoid the pitfalls of simultaneous equations. 1998 before 1999. time series data have a unique advantage because of time’s arrow. can we treat a time series.
LX t = X t −1 L2 X t = L( LX t ) = L( X t −1 ) = X t − 2 L3 X t = L( L2 X t ) = L( X t − 2 ) = X t − 3 Lk X t = X t − k A first difference can be written as ∆X t = (1 − L ) X t = X t − X t −1 So.Yt = a0 + a1Yt −1 + + a pYt − p + b0 X t + b1 X t −1 + + bp X t − p + ε t Lag operator Just for convenience.p) model to be stable Yt = a0 + a1Yt −1 + + a pYt − p + b0 X t + b1 X t −1 + + bp X t − p + ε t we require that a1 + a2 + + a p < 1 The coefficients on the X’s. For the model to be stable (so that it does not explode to positive or negative infinity). So." a ( L) = a0 + a1 L + a2 L2 + + a p Lp = ai Li i =0 p b( L) = b0 + b1 L + b2 L2 + + bk Lk = b j L j j =0 k The "lag operator" works like this. the roots of the lag polynomial must sum to a number less than one in absolute value. the ADL(p. we will resort to a shorthand notation that is particularly useful for dynamic models. the discrete analog of a differential equation. are not constrained because the X’s are the inputs to the difference equation and do not depend on their own past values. using lag notation.1) model. for a model like the following ADL(p. and the values of the X’s. the ADL model is a difference equation. we can write the ADL model very conveniently and compactly as a ( L)Yt = b( L) X t + ε t where a ( L) and b( L) are "lag polynomials. Let’s start with the simple ADL(1. we will have to make sure all our dynamic models are stable. 155 . Using the lag operator.p) model Yt = a0 + a1Yt −1 + + a pYt − p + b0 X t + b1 X t −1 + + bp X t − p + ε t can be written as Yt − a0 − a1Yt −1 − − a p Yt − p = b0 X t + b1 X t −1 + + bp X t − p + ε t (1 − a 0 − a1 L − − a p Lp ) Yt = ( b0 + b1 L + + bp Lp ) X t + ε t a ( L )Yt = b( L) X t Mathematically. All dynamic models can be derived from the ADL model. Since we do not observe explosive behavior in economic systems.
if a1 = 1 and a0 = 0 then Y is a "random walk" Yt = Yt −1 + ε t Y is whatever it was in the previous period.Yt = a0 + a1Yt −1 + b0 X t + b1 X t −1 + ε t Static model The static model (no dynamics) can be derived from this model by setting a1 = b1 = 0 Yt = a0 + b0 X t + ε t AR model The univariate autoregressive (AR) model is Yt = a0 + a1Yt −1 + ε t (b0 = b1 = 0) Random walk model Note that. First difference model The first difference model can be derived by setting a1 = 1. Y drifts up. If a0 ≠ 0. Yt − Yt −1 = ∆Yt = ε t So. The random walk model describes the movement of a drop of water across a hot skillet. if it is negative. plus a random error term. Y drifts down. Also. because ∆Yt = (1 − L)Yt (1 − L)Yt = a ( L)Yt = ε t Y is said to have a unit root because the root of the lag polynomial is equal to one. the movements of Y are completely random. 156 . so ∆Yt = a0 + ε t This is a random walk. b0 = −b1 Yt = a0 + a1Yt −1 + b0 X t + b1 X t −1 + ε t Yt = a0 + Yt −1 + b0 X t − b0 X t −1 + ε t Yt − Yt −1 = a0 + b0 ( X t − X t −1 ) + ε t ∆Yt = a0 + b0 ∆X t + ε t Note that there is "drift" or trend if a0 ≠ 0. "with drift. then Yt = a0 + Yt1 +ε t and Yt − Yt1 = a0 + ε t ." because if a0 is positive.
yt − a1 yt = b0 xt + b1 xt yt (1 − a1 ) = (b0 + b1 ) xt yt = (b0 + b1 ) (b + b ) xt = kxt where k = 0 1 (1 − a1 ) (1 − a1 ) In long run equilibrium. xt = xt 1 . yt = yt 1 so (b + b ) (b0 + b1 ) xt −1 = − 0 1 xt −1 = − kxt −1 (1 − a1 ) (a1 − 1) Multiply both sides by (a1 − 1) to get yt −1 = −(a1 − 1) yt −1 = (b0 + b1 ) xt −1 Substituting for (b0 + b1 ) xt −1 in (2) yields ∆yt = (a1 − 1) yt −1 − (a1 − 1) yt −1 + b0 ∆xt + ε t Substituting for yt −1 (b + b ) ∆yt = (a1 − 1) yt −1 − (a1 − 1) 0 1 xt −1 + b0 ∆xt + ε t (a1 − 1) 157 .1) model in deviations. and ε t = 0. (1) yt = a1 yt −1 + b0 xt + b1 xt −1 + ε t Subtract yt −1 from both sides. yt − yt −1 = a1 yt −1 + b0 xt + b1 xt −1 + ε t − yt −1 + b0 xt −1 − b0 xt −1 Collect terms.Distributed lag model The finite distributed lag model is (a1 = 0) Yt = a0 + b0 X t + b1 X t −1 + ε t Partial adjustment model The partial adjustment model is (b1=0) Yt = a0 + a1Yt −1 + b0 X t + ε t Error correction model The error correction model is derived as follows. (2) ∆yt = (a1 − 1) yt −1 + b0 ∆xt + (b0 + b1 ) xt −1 + ε t Rewrite equation (1) as yt − a1 yt −1 = b0 xt + b1 xt −1 + ε t In long run equilibrium. yt − yt −1 = a1 yt −1 + b0 xt + b1 xt −1 + ε t − yt −1 Add and subtract b0 xt −1 . Start by writing the ADL(1. yt = yt 1 .
Collect terms in (a1 − 1) (b + b ) ∆yt = b0 ∆xt + (a1 − 1) yt −1 − 0 1 xt −1 + ε t (a1 − 1) Or. the coefficient and the error are called the error correction mechanism. CochraneOrcutt model Another commonly used time series model is the CochraneOrcutt or common factor model. The terms Yt − a1Yt −1 and ( X t − a1 X t −1 ) are called generalized first differences because they are differences. (3) ∆yt = b0 ∆xt + (a1 − 1) ( yt −1 − kxt −1 ) + ε t Sometimes written as. 158 . The coefficient 0 is the impact on y of a change in x. Together. but the coefficient on the lag polynomial is a1. (3a) ∆yt = b0 ∆xt + (a1 − 1) ( y − kx )t −1 + ε t The term (yt1kxt1) is the difference between the equilibrium value of y (kx) and the actual value of y in the previous period. Yt = a0 + a1Yt −1 + b0 X t + b1 X t −1 + ε t Yt = a0 + a1Yt −1 + b0 X t − b0 a1 X t −1 + ε t Yt − a1Yt −1 = a0 + b0 ( X t − a1 X t −1 ) + ε t (1 − a1 L )Yt = a0 + b0 (1 − a1 L) X t + ε t Y and X are said to have common factors because both lag polynomials have the same root. a1. The coefficient k is the long run response of y to a change in x. instead of 1 which would be the coefficient for a first difference. It imposes the restriction that b1 = −b0 a1 . The coefficient a1 − 1 is the speed of adjustment of y to errors in the form of deviations from the long run equilibrium.
it is frequently necessary in time series modeling to add lags. since Y is a random number. the lag of Y is also a random number. In the model Yt = α + β X t + γ Yt −1 + ut . σ 2 ) the explanatory variables. It turns out that the best we can get out of ordinary least squares with a lagged dependent variable is that OLS is consistent and asymptotically efficient. ut ∼ iid (0. cannot be assumed to be a set of constants. lagged) or “serially correlated. in large samples. However.. It is unfortunate that all we can hope for in most dynamic models is consistency. 159 . However adding a lagged dependent variable changes this dramatically.14 AUTOCORRELATION We know that the GaussMarkov assumptions require that Yt = α + β X t + ut where ut ∼ iid (0. which means that ˆ E ( a1 ) ≠ a1 and OLS is biased. consistent. are true. When this assumption. This actually refers to the sampling method and assumes a simple random sample where the probability of observing ut is independent of observing ut1. which now include Yt −1 . since most time series are quite small relative to cross sections. If the errors are not independent.e. As we have seen above. Adding a lag of X is not a problem. assume a very simple AR model (no x) expressed in deviations. and the other assumptions. and efficient. ordinary least squares applied to this model is unbiased. correlated with themselves. we expect that Cov ( yt −1ε t ) = 0 which implies that OLS is at least consistent. as the sample size goes to infinity.” The most common type of autocorrelation is socalled “first order” serial correlation. To see this. σ 2 ) . then they are said to be “autocorrelated” (i. yt = a1 yt −1 + ε t ˆ a1 = yt −1 yt yt −1 ( a1 yt −1 + ε t ) yt −1ε t = = a1 + 2 2 yt −1 yt −1 yt2−1 yt −1ε t Cov( yt −1ε t ) ˆ E ( a1 ) = a1 + E = a1 + 2 Var ( yt −1 ) yt −1 Since ε t = yt − a1 yt −1 it is likely that Cov( yt −1ε t ) ≠ 0 at least in small samples. fixed on repeated sampling because. The first i in iid is independent. It is no longer unbiased and efficient. We will just have to live with it.
Unfortunately. yt = a1 yt −1 + ut ut = ρ ut −1 + ε t The ordinary least squares estimator of a is. inconsistent. Autocorrelation is caused primarily by omitting variables with time components. This means that our hypothesis tests can be wildly wrong. It turns out that autocorrelation is so prevalent that we can expect some form of autocorrelation in any time series analysis. and inefficient. However. If these variables have time trends or cycles or other timedependent behavior. then the error term will be autocorrelated. The impact of these variables goes into the error term.” 7 160 .ut = ρ ut −1 + ε t where 1 ≤ ρ ≤ 1 is the autocorrelation coefficient and ε t iid (0. If there is no lagged dependent variable. ut = ρ1ut −1 + ρ 2 ut − 2 + + ρ q ut − q + ε t Autocorrelation is also known as moving average errors. σ 2 ) It is possible to have autocorrelation of any order. so we have to make sure we eliminate any autocorrelation. then autocorrelation causes OLS to be inefficient. and the tratios are overestimated. There is nothing good about OLS under these conditions. yt −1 yt yt −1 ( a1 yt −1 + ut ) yt −1ut = = a1 + 2 2 yt −1 yt −1 yt2−1 The expected value is ˆ a1 = yt −1ut Cov( yt −1ut ) ˆ E ( a1 ) = a1 + E = a1 + 2 Var ( yt −1 ) yt −1 The covariance between the explanatory variable and the error term is In the case of negative autocorrelation. We can prove the inconsistency of OLS in the presence of autocorrelation and a lagged dependent variable as follows. there is a “second order bias” in the sense that OLS standard errors are underestimated and the corresponding tratios are overestimated in the usual case of positive autocorrelation. the tratios will be “wrong. as we have seen above.7 If there is a lagged dependent variable. Effect of autocorrelation on OLS estimates The effect that autocorrelation has on OLS estimators depends on whether the model has a lagged dependent variable or not. then OLS is biased. Recall our simple AR(1) model above. Sometimes we simply don’t have all the variables we need and we have to leave out some variables that are unobtainable. However. the most general form would be q’th order autocorrelation. it is not clear whether the tratios will be overestimated or underestimated. most dynamic models use a lagged dependent variable.
OLS is inconsistent. Therefore. Since is constant. Note that. has the following properties. E (ut2 ) = E (ut2−1 ) = Var (ut ) and E (ut −1 yt − 2 ) = E (ut yt −1 ) = Cov( yt −1ut ) which implies E [ut −1 yt − 2 ] = a1 ρ E [ut −1 yt − 2 ] + ρ E ut2−1 E [ut −1 yt − 2 ] − a1 ρ E [ut −1 yt − 2 ] = ρ E ut2−1 E [ut −1 yt − 2 ] = Cov( yt −1ut ) = ρ E ut2−1 1 − a1 ρ ρVar (ut ) 1 − a1 ρ So. Now assume autocorrelation so that E (et et −1 ) ≠ 0 . σ 2 ) which implies that E (ut ) = 0. What happens to the expected value of the Durbin Watson statistic? As a preliminary. E (ut2 ) = σ 2 .. we can derive the expected value as follows. the explanatory variable is correlated with the error term and OLS is biased. under the null ut IN (0. Since OLS is inefficient and has overestimated tratios in the best case and is biased. et2 Under the null hypothesis of no autocorrelation. The empirical analog is that the residual. e. we should find out if we have autocorrelation or not. E (ut ut −1 ) = 0. Testing for autocorrelation The Durbin Watson test The most famous test for autocorrelation is the DurbinWatson statistic. It has the formula d= (et − et −1 )2 where et is the residual from the OLS regression. E (d ) = E (et − et −1 ) 2 E (et2 − 2et et −1 + et2−1 )2 ( E (et2 ) − 2 E (et et −1 ) + E (et2−1 )) = = E (et2 ) E (et2 ) E (et2 ) 2 (σ = t =1 T T +σ 2) = 2 σ t =1 Tσ 2 + Tσ 2 =2 Tσ 2 where T is the sample size (in a time series t=1. E (et ) = E (et −1 ) = E (et et −1 ) = 0 and E (et ) = E (et2−1 ) = σ 2 . it doesn’t go away as the sample size goes to infinity. inefficient.Cov ( yt −1ut ) = E ( yt −1ut ) = E [ ( a1 yt − 2 + ut −1 )( ρ ut −1 + ε t ) ] = E a1 ρ ut −1 yt − 2 + a1 yt − 2ε t + ρ ut2−1 + ut −1ε t = a1 ρ E [ut −1 yt − 2 ] + ρ E ut2−1 Since E ( yt − 2ε t ) = E (ut −1ε t ) = 0 because ε t is truly random. let’s review the concept of a correlation coefficient. Now. Therefore. 161 .. with overestimated tratios in the worse case. if ρ ≠ 0 .T). inconsistent.
suppose you get a DW statistic of 1. As we already know. when =0. you cannot reject the null hypothesis of no autocorrelation. dl=1. when the test is most needed.35 and you want to know if you have autocorrelation. despite its fame. cov( x. Be conservative and simply ignore the lower value. as goes to negative one. Clearly. That is. That is. For this reason. the DurbinWatson statistic goes to four. what do you conclude? Answer: test fails! Now what are you going to do? Actually.45. so that is positive.45. you will find two values. has a number of drawbacks. then the test statistic is 4DW. y = et −1 and σ x = σ y = σ 2 which means that the autocorrelation coefficient is ρ= cov(et et −1 ) σ 2 σ 2 = E (et et −1 ) σ2 and ρσ 2 = E (et et −1 ) The expected value of the Durbin Watson statistic is E (et − et −1 ) 2 E (et2 − 2et et −1 + et2−1 ) 2 ( E (et2 ) − 2 E (et et −1 ) + E (et2−1 )) E (d ) = = = E (et2 ) E (et2 ) E (et2 ) (σ = t =1 T 2 − 2 ρσ 2 + σ 2 ) σ t =1 T = 2 T σ 2 − 2T ρσ 2 + T σ 2 = 2 − 2ρ Tσ 2 Since is a correlation coefficient it has to less than or equal to one in absolute value. for negative autocorrelation. the DW test is biased toward two. the DurbinWatson statistic is equal to two.29 and du=1. it tells us we have no problem when we could have a very serious problem. For example. The LM test for autocorrelation 8 If the DW statistic is negative. this is not a serious problem. If your value is below 1. y ) = E ( xy ) = So. as approaches one. r= cov( x. If you look up the critical value corresponding to a sample size of 25 and one explanatory variable for the five percent significance level. Whenever the model contains a lagged dependent variable. it not only fails. On the other hand. The DurbinWatson test. reject the null hypothesis of no autocorrelation whenever the DW statistic is below dU. Since your value falls between these two.29 you have significant autocorrelation. according to DW. the DurbinWatson statistic approaches zero. y ) 2 σ x2 σ y 2 2 In this case. 162 .8 The second problem is fatal. The usual case in economics is positive autocorrelation. we will need another test for autocorrelation. If your value is above 1.r= 1 xy T Sx S y 1 xy T But. The first is that the test is not exact and there is an indeterminate range in the DW tables. x = et .
570667328 Residual  . we can estimate the error term with the residuals from the OLS regression of Y on X and lag it to get an estimate of ut −1 We can then run the test equation. Yt = a0 + b0 X t + ut ut = ρ ut −1 + ε t which implies that Yt = a0 + b0 X t + ρ ut −1 + ε t It would be nice to simply estimate this model.08262 lcrmaj  Coef.184 1 0. 26) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 28 83.663423 .000 .0000 0.59 0.000 9. regress lcrmaj prison Source  SS df MS +Model  .1647467 .0014 H0: no serial correlation 163 .1345066 . we run the regression and then save the residuals for further analysis.177496135 26 . dwstat DurbinWatson dstatistic( 2.7536 . Std. consider the following model. Unfortunately. because this. Err. The Fratio is distributed according to chisquare with one degree of freedom. For example. I have a data set on crime and prison population in Virginia from 1972 to 1999 (crimeva. 28) = .570667328 1 . 1956 to 2005 Now let’s do a simple regression of the log of major crime on prison.027709758 Number of obs F( 1. The first thing we have to do is tell Stata that this is a time series and the time variable is year.62 0. Interval] +prison  .7628 0. We can test null hypothesis with either the tratio on the lagged residual or. Yt = a0 + b0 X t + ρ et −1 + ε t where et −1 is the lagged residual. Both the DurbinWatson and BreuschGodfrey LM tests are automatic in Stata.0147116 9.6915226 Now let’s get the BreuschGodfrey test statistic.585411 9. we don’t observe the error term.Breusch and Godfrey independently developed an LM test for autocorrelation. .dta). get an estimate of and test the null hypothesis that =0 with a ttest. is a large sample test. .006826774 +Total  .14 0. tsset year time variable: year. like all LM tests. we could use the Fratio for the null hypothesis that the coefficient on the lagged residual is zero.1042665 _cons  9.748163462 27 . However. t P>t [95% Conf. It is ok since we don’t have any lagged dependent variables.0379523 254. so we can’t do it that way. . . Like all LM tests. bgodfrey BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  10.741435 First we want the DurbinWatson statistic.
It must be incredibly significant asymptotically. So we apparently have serious autocorrelation.0000 0. t P>t [95% Conf. To test for second order serial correlation. 24) = Prob > F = 15.6516013 . What is the corresponding Fratio? .55 0.64494128 2 .8675 0.8564 .e = 0 F( 1.0308821 313. predict e.) And the BG test is also highly significant. as expected. scalar prob=1chi2(1. .1631639 3.e Source  SS df MS +Model  .32247064 Residual  . Err. regress lcrmaj prison L.e ( 1) L.26 0. for example.9883552 _cons  9.001 .098530434 24 . lags(1 2) BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  10.689539 .1451148 .99 0.000 .000 9. the DW statistic is uncomfortably close to zero. Now that’s significant. 24) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 27 78.0118322 12.028595066 Number of obs F( 2.184 1 0. let’s do the LM test ourselves. test L. So we reject the null hypothesis of no autocorrelation. use the lags option of the bgodfrey command. We apparently have very serious autocorrelation.95 0.76 0. Std. What should we do about it? We discuss possible cures below.3148475 .753277  The ttest on the lagged e is highly significant.00006504 Wow.0014 164 .625801 9. bgodfrey.1695353 .743471714 26 . (The dl is 1. .1206944 e  L1  .004105435 +Total  .06407 lcrmaj  Coef.32 for k=1 and T=27 at the five percent level. . Testing for higher order autocorrelation The BreuschGodfrey test can be used to test for higher order autocorrelation. First we have to save the residuals.OK. Interval] +prison  .95) . scalar list prob prob = .0005 It is certainly significant with the small sample correction. resid (22 missing values generated) Now we add the lagged residuals to the regression. . 15. Just for fun.
ˆ (2) Estimate the auxiliary regression et = ρ et −1 + vt with OLS. bgodfrey. The classic textbook solution is to use CochraneOrcutt generalized first differences. Yt − ρYt −1 = a0 − ρ a0 + b0 X t − b0 ρ X t −1 + ut − ρ ut −1 Yt − ρYt −1 = a0 (1 − ρ ) + b0 ( X t − ρ X t −1 ) + ε t Since ut − ρ ut −1 = ε t The problem is that we don’ t know the value of . Cure for autocorrelation . . The CochraneOrcutt method Suppose we have the model above with first order autocorrelation. There are two approaches to fixing autocorrelation. and ρ jointly. also known as common factors.0024 4  15. ρ . Yt = a0 + b0 X t + ut ut = ρ ut −1 + ε t Assume for the moment that we knew the value of the autocorrelation coefficient. look’s bad.100 2 0. but we’ll stop here. We could go on adding lags to the test equation.100 2 0. Lag the first equation and multiply by .412 3 0. et. save the estimate of ρ . ρYt −1 = ρ a0 + b0 ρ X t −1 + ρ ut −1 Now subtract this equation from the first equation.2  14.090 4 0. b0 . 165 . In fact we are going to estimate a0 .184 1 0. lags(1 2 3 4) BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  10.0014 2  14. (1) Estimate the original regression Yt = a0 + b0 X t + ut with OLS and save the residuals. (3) Transform the equation using generalized first differences using the estimate of .0009 H0: no serial correlation Hmmm.0009 3  14. Cochrane and Orcutt suggest an iterative method. Let’s see if there is even higher order autocorrelation. These data apparently have severe high order serial correlation.0045 H0: no serial correlation It gets worse and worse.
0864092 112. Obviously.311269 The CochraneOrcutt approach has improved things.210446147 26 .6448 0. save the residuals.003945497 +Total  . Interval] +prison  .0000 0. namely the ADL model.68 0.736981 . and can be derived from a more general ADL model. The usual rule is to stop when successive estimates of are within a certain amount.2246394 . Std.06281 lcrmaj  Coef. We have seen above that it is a common factor model. which is a modification of the CochraneOrcutt method that preserves the first observation so that you don’t get a missing value because of lagging the residual. We can test these restrictions to see if the CO model is a legitimate reduction from the ADL.914944 +rho  . The CochraneOrcutt method can be implemented in Stata using the prais command The prais command uses the PraisWinston technique. .6465 0.6462 0. Start with the CO model. It is therefore true only if the restrictions of the ADL are true.5313 0. 25) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 27 28.691523 DurbinWatson statistic (transformed) 1.01.ˆ ˆ ˆ Yt − ρYt −1 = a0 (1 − ρ ) + b0 ( X t − ρ X t −1 ) + ε t (4) Estimate the generalized first difference regression with OLS.0304269 5.098637415 25 . corc Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: 6: 7: rho rho rho rho rho rho rho rho = = = = = = = = 0. we need a stopping rule to avoid an infinite loop.34 0.5125 .111808732 Residual  .6465 0. . Rewrite the equation as (A) Yt = a * + ρYt −1 + bX t − ρ bX t −1 + ε t Now for something completely different.000 9.6465167 DurbinWatson statistic (original) 0.008094083 Number of obs F( 1. but the DW statistic still indicates some autocorrelation in the residuals.111808732 1 .559018 9.0000 0.6353 0.iterated estimates Source  SS df MS +Model  .1619739 . Err.000 .6465 CochraneOrcutt AR(1) regression .0993085 _cons  9. prais lcrmaj prison.6465 0. t P>t [95% Conf. Note that the CochraneOrcutt regression is also known as feasible generalized least squares or FGLS. 166 . Yt − ρYt −1 = a * +b( X t − ρ X t −1 ) + ε t where a* = a0 (1 − ρ ) . (5) Go to (2).32 0. say.
79 0. scalar list L = .191 . then the CochraneOrcutt model is true. ln(ESS(A)/ESS(B))=0.0000 0.09863742 ESSB = . Interval] +prison  . regress lcrmaj prison L.028595066 Number of obs F( 3.743471714 26 . . ESS(B)/ESS(A)=1 and the log of this ratio.02934373 ESSA = . Otherwise. If this statistic is not significant.0588285 6.lcrmaj L.8502 . ESS(B). Then we run the ADL model and keep its residual sum of squares.6515369 .09853027 .9970184 prison  L1  . The CochraneOrcutt model (A) is true only if b1 = −b0 a1 or b1 + b0 a1 = 0 .437 .8675 0.18 0.1426671 .(B) Yt = a0 + a1Yt −1 + b0 X t − b1 X t −1 + ε t Model (A) is actually model (B) with b1 = −b0 a1 since a1 = ρ and b1 = ρ b0 .046 .This is a testable hypothesis. scalar ESSA=e(rss) .393491 1.0780698 lcrmaj  L1  . This implies that the ratio. The residual sum of squares is . This is the likelihood ratio.1457706 ..728154 Now do the likelihood ratio test.0883118 . The likelihood ratio is a function of the residual sum of squares.. Err. If the CochraneOrcutt model is true. It turns out that ESS ( B) 2 L = T ln ~ χq ESS ( A) where q=K(B)K(A). Std. we run the CochraneOrcutt model (using the .21498048 Residual  . you should use the ADL model.11 0. We have already estimated the CochranOrcutt model (model A). 23) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 27 50.098530274 23 . So. t P>t [95% Conf.1670076 3. K(B) is the number of parameters in model (A) and K(B) is the number of parameters in model (B).06545 lcrmaj  Coef. scalar ESSA=e(rss) Now let’s estimate the ADL model (model B).3192906 _cons  3.1116564 0. . scalar prob=1chi2(2.corc option because we don’t want to keep the first observation) and save the residual sum of squares. ESS(A).004283925 +Total  . scalar L=27*log(ESSB/ESSA) . Instead we use a likelihood ratio test.3060554 .001 .0986.369611 . we can’t use a simple Ftest.611994 2. .35 0.prison Source  SS df MS +Model  .64494144 3 .90 0. Because the restriction multiplies two parameters.029) 167 . but we can get Stata to remember this by using scalars.1082056 1. then ESS(A)=ESS(B).
71 0.0000 0.8819 0.3875 168 .7808986 .5273762 1. most modern time series analysts do not recommend CochraneOrcutt. adding two lags of crime to the static crime equation eliminates the autocorrelation. bgodfrey BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  0. It is much more satisfying to specify the problem allowing for lags. For these reasons.361557 L2  .9444664 .8671 . a restriction that may or may not be true. The problem is that it is a purely mechanical solution.05 0. The DurbinWatson statistic still showed some autocorrelation even after FGLS. Curing autocorrelation with lags Despite the fact that CochraneOrcutt appears to be justified in the above example. Err.027709758 Number of obs F( 3. specification test.293413 1.003683127 +Total  .1902221 2.04 0. momentum. 24) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 28 59. etc. the most common cause of autocorrelation is omitted time varying variables. .659768412 3 . Use the ten percent significance level when using this. I do not recommend using it. reaction times.0204905 lcrmaj  L1  . in the above example. especially lags of the dependent variable until the LM test indicates no significant autocorrelation.06069 lcrmaj  Coef.748163462 27 .009 1. For example. Std. scalar list prob prob = 1 Since ESSA and ESSB are virtually identical. I recommend adding lags. that are very likely to be causing dynamic behavior than using a simple statistical fix.0042997 _cons  4.219922804 Residual  .430934 .3882994 . It has nothing to do with economics. t P>t [95% Conf.747 1 0.82 0. almost certainly.2020885 4.lcrmaj LL. Interval] +prison  .155892 7. As we noted above. or any other.006 .1066493 . Instead they recommend adding variables and lags until the autocorrelation is eliminated. The reason.000 . the ADL model allows us to model the dynamic behavior directly.08839505 24 ..0208728 3.. Since using CochraneOrcutt involves restricting the ADL to have a common factor. We have seen above that the ADL model is a generalization of the CochraneOrcutt model.67 0. the test statistic is not significant and CochraneOrcutt is justified in this case. why not simply avoid the problem by estimating the ADL directly? Besides. is that the model omits too many variables.052 . regress lcrmaj prison L. So.lcrmaj Source  SS df MS +Model  . the CochraneOrcutt method did not solve the problem. all of which are likely to be timevarying. Besides.0635699 .520191 2.
1957 H0: no serial correlation. bgodfrey. because it ignores higher order autocorrelation. Typically. ˆ Var ( β ) = 1 ( ) 2 x2 t E x 2u 2 + x 2u 2 + + x 2 u 2 + x x u u + x x u u + x x u u + 1 1 2 2 T T 1 2 1 2 1 31 3 1 4 1 4 ( ) . Things are made more complicated if the observations on x are negatively correlated. then ignoring them will cause the standard error to be overestimated.220 2 0. as the sample size goes to infinity.747 1 0. the standard errors are underestimated and the corresponding tratios are overestimated. regress lcrmaj prison Heteroskedastic and autocorrelation consistent standard errors It is possible to generalize the concept of robust standard errors to include a correction for autocorrelation.284 3 0. If some are positive and some negative.5156 4  6. Note that the formula can be rewritten as follows. if we have autocorrelation. everything is positively correlated. This is the famous second order bias of the standard errors under autocorrelation. The OLS estimator ignores these i j terms and is unbiased. then E (ui u j ) ≠ 0 and we cannot ignore these terms. but at a much slower rate. then ols estimates of the standard errors in no longer unbiased. the amount of error goes to infinity and the estimate is not consistent. but that doesn’t do it either. making variables appear to be related when they are in fact independent. Since each tem contains error. We could use a small number of terms. then ignoring them will cause the standard error to be underestimated.046 4 0. The result is that the estimate is consistent 169 . lags (1 2 3 4) BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  0. If the covariances are negative while the x’s are positively correlated. The one thing we do know is that if we ignore the covariances when they are nonzero. Newey and West suggested a formula that increased the number of terms as the sample size increased.3296 3  2. if we use all these terms. x 2Var (u ) + x 2Var (u ) + + x 2Var (u )u 2 1 2 T 1 2 n T 2 2 + x1x2Cov(u1u2 ) + x1x3Cov(u1u3 ) + x1x4Cov(u1u4 ) + x t If the covariances are positive (positive autocorrelation) and the x’s are positively correlated. ˆ Var ( β ) = 1 ( ) So. the number of terms goes to infinity. which is the usual case. the direction of bias of the standard error will be unknown. E (u u ) = 0 for i ≠ j (independent).3875 2  2. In the chapter on heteroskedasticity we derived the formula for heteroskedastic consistent standard errors.H0: no serial correlation . The problem is. the standard errors are not consistent because. In the absence of autocorrelation.
five lags for quarterly data. newey lcrmaj prison L. 26) Prob > F = = = 28 60. even after autocorrelation has presumably been eliminated.003 .0990496 _cons  9. newey lcrmaj prison. Once you have no autocorrelation. Use NeweyWest heteroskedastic and autocorrelation robust tratios. . . 24) Prob > F = = = 28 66. Interval] +prison  .49 0. you can begin reducing the model to a parsimonious.0193394 3.0000335 . but always with fewer than T terms.000 . When you get down to a parsimonious model with mostly significant variables. t P>t [95% Conf.9678325 .15 0. Err. one of the leading time series econometricians.628212 4. lag(2) Regression with NeweyWest standard errors maximum lag: 2 Number of obs F( 1. To get heteroskedasticity and autocorrelation (HAC aka NeweyWest) consistent standard errors and tratios.80 0.0000548 . t P>t [95% Conf.0000  NeweyWest lcrmaj  Coef. The formula relies on something known as the Bartlett kernel. Start with a very general ADL model with lot’s of lags.25 0.0608961 .282791  Summary David Hendry.80 0.5915948 1.0209816 lcrmaj  L1  . then keep the variable even if it is not significant.0000103 3.663423 . We can also use NeweyWest HAC standard errors in our ADL equation. use the newey command.34407 crmaj  L2  .1822947 5.31 0.8272894 1. We can make this formula operational by using the residuals from the OLS regression to estimate u. suggests the following general to specific methodogy. interpretable model. Delete insignificant variables.crmaj.0000  NeweyWest lcrmaj  Coef.1008106 . add more lags. lag(2) Regression with NeweyWest standard errors maximum lag: 2 Number of obs F( 3.lcrmaj LL. If it does.1345066 . Interval] +prison  .0000123 _cons  . Std.000 9.1699636 .0172496 7. Std.77 0. Make sure that dropping variables does not cause autocorrelation. but that is beyond the scope of this manual. Make sure that there is no autocorrelation in the final version of the model.626 2.81 0. Justify your deletions with ttests and Ftests.67426 0.765911  Note that you must specify the number of lags in the newey command.560936 9.004 . Check the BreuschGodfrey test to make sure there is no autocorrelation. Err. make sure that all of the omitted variables are not significant with an overall Ftest.000 . A good rule of thumb is at least two lags for annual data. 170 .because the number of terms goes to infinity.0498593 193. If there is. and 13 lags for monthly data.
the mean changes over time. It is called deterministic because it can be forecast with complete accuracy. σ 2 ) . 171 . because the truth at time t is different from the truth at time t+p. its sample mean.. or q. and its sample autocovariances. Finally. yt + p + q ) .15 NONSTATIONARITY. Therefore. the covariances of the series must not be dependent on time. y1 .. yt −1 ) . we have maintained the assumption that the mean of the distribution is constant (and equal to zero). yt + p + q ) for any t. If a series is stationary. On the other hand. yt + p ) = Cov ( yt + q . we can infer its properties from examination of its histogram. AND RANDOM WALKS The GaussMarkov assumptions require that the error term is distributed as iid (0. However.. Now consider a simple random walk.. UNIT ROOTS. that is Cov ( yt .. Consider a simple deterministic time trend. its sample variance. yt + p ) = Pr( yt + q . If the distribution is constant over time.. If a time series is nonstationary then we cannot consider a sample of observations over time as representative of the true distribution. A stationary process is one where this distribution doesn’t change over time: Pr( yt . then we can’t do any of these things. a linear time trend is not stationary. p. then its mean and variance are constant over time.. yt +1 ...... Pr( yt  y0 . We have seen how we can avoid problems associated with nonindependence (autocorrelation) and nonidentical (heteroskedasticity). A time series is stationary if its probability distribution does not change over time. In this section we confront the possibility that the error mean is not constant. The probability of observing any given y is conditional on all past values. if a series is not stationary. yt = α + δ t Is y stationary? The mean of y is E ( yt ) = µt = α + δ t Clearly.
We use the replace command to replace the zeroes. Obviously. extreme autocorrelation. A random walk has potentially infinite variance. A random walk is also known as a stochastic trend...” We can replicate their findings in Stata. then the usual t. The function 5*invnorm(uniform()) produces a random number from a normal distribution with a mean of zero and a standard deviation of 5. Also..dta data set we first generate a random walk called y1. Therefore. = σ t2 which implies that E (ε t ) 2 = Var (ε t ) = tσ 2 . ε t ~ iid (0. gen y1=0 This creates a series of zero from 1956 to 2005. it has no stable distribution.. Just because a random walk is increasing for several time periods does not mean that it will necessarily continue to increase. In the crimeva. Also unlike deterministic trends. The reason is that if the regressor is nonstationary. The problem is similar to autocorrelation. just because it is falling does not mean we can expect it to continue to fall. note that the variance of y goes to infinity as t goes to infinity.. Random walks and ordinary least squares If we estimate a regression where the independent variable is a random walk.. because the autocorrelation coefficient is equal to one. Granger and Newbold published an article in 1974 where they did some Monte Carlo exercises. There is no distribution from which the statistics can be derived. σ 2 ) For the sake of argument. Similarly.yt = yt −1 + ε t . Unlike the deterministic trend above. They called this phenomenon “spurious regression. Now create the random walk by making y1 equal to itself lagged plus a random error term.. . + E (ε t ) 2 2 because of the independence assumption. + ε t ) 2 = E (ε 1 ) 2 + E (ε 2 ) 2 + . σ 12 = σ 2 = . the variance of yt is E(y t ) 2 = E (ε1 + ε 2 + . y1 = 0 + ε1 y2 = 0 + ε 1 + ε 2 y3 = 0 + ε1 + ε 2 + ε 3 so that y t = ε1 + ε 2 + . let’s assume that y=0 when t=0. stochastic trends can exhibit booms and busts. because of the identical assumption. Unit root and stochastic trend are used interchangeably.5. As we saw in the section on linear dynamic models.and Ftests are invalid. Start it out at y1=0 in the first year. We have to start in 1957 so we have a previous value (zero) to start from.. 172 . only worse. a random walk can be written as yt − yt −1 = ∆yt = (1 − L) yt = ε t The lag polynomial is (1L). so the standard errors and tratios are not derived from a stable distribution. As a result most time series analysts mean stochastic trend when they say trend. Also. so the root of the lag polynomial is one and y is said to have a unit root. Many economic time series appear to behave like stochastic trends. Even though the true relationship between the two variables was zero. Granger and Newbold got tratios around 4. the variance of y depends on time and y cannot be stationary. They generated two random walks completely independently of one another and then regressed one on the other. 1956. we cannot forecast future values of y with certainty. + ε t Thus.
Suppose we start with a simple AR(1) model.5253 12. ε t ~ iid (0.7110805 _cons  14.23 0. gen y2=0 . a1 < 1 . The idea is very simple.87052 8. Testing for unit roots The most widely used test for a unit root is the Augmented DickeyFuller (ADF) test. to avoid spurious regressions.000 1. Err..23038 1 9328. Std.y2+5*invnorm(uniform()) if year > 1956 (49 real changes made) Obviously. known as a DF equation. So if we regressed y1 on y2 there should be no significant relationship.997 y1  Coef.43 0. yt − yt −1 = a0 + a1 yt −1 − yt −1 + ε t or ∆yt = a0 + (a1 − 1) yt −1 + ε t This is the (nonaugmented) DickeyFuller test equation. y1 and y2 are independent of each other. Since most time series are autocorrelated. in which case a1 − 1 < 0 . we should do something to correct for autocorrelation. σ 2 ) 173 .835011 Number of obs F( 1. if y has a unit root then a1 = 1 and a1 − 1 = 0 . replace y1=L. Consider the following AR(1) model with first order autocorrelation.000 20. . yt = a0 + a1 yt −1 + ε t Subtract yt −1 from both sides.23038 Residual  8107.009038 4.9748288 . .82044 3.93 0. It looks like y1 is negatively (and highly significantly) associated with y2 even though they are completely independent. Therefore. and autocorrelation creates overestimated tratios.0000 0. So we can test for a unit root using a ttest on the lagged y. yt = a1 yt −1 + ut ut = ρ ut −1 + ε t .770368 This is a spurious regression. Note that if the AR model is stationary. replace y2=L.9156 49 355. t P>t [95% Conf. On the other hand. 48) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 55. we need to know if our variables have unit roots.238577 .y1+5*invnorm(uniform()) if year > 1956 (49 real changes made) Repeat for y2. Interval] +y2  .910108 +Total  17435. regress y1 y2 Source  SS df MS +Model  9328.68518 48 168.1311767 7.5350 0.
The tratio is computed as usual. the random walks we created above in the crimeva. + c p ∆yt − p + ε t I keep saying that we test the hypothesis that (a1 − 1) = 0 with the usual ttest. We can test this hypothesis with a ttest. . Since we know there is no autocorrelation.But the error term ut is also ut = yt − a1 yt −1 ut −1 = yt −1 − a1 yt − 2 ρ ut −1 = ρ yt −1 − ρ a1 yt − 2 Therefore.4001 Note that the critical values are much higher (in absolute value) than the usual t distribution. However. We cannot reject the null hypothesis of a unit root.601 * MacKinnon approximate pvalue for Z(t) = 0. For reasons that are explained in the last section of this chapter. the tratio in the DickeyFuller test equation has a nonstandard distribution. and other researchers. We know from our analysis of autocorrelation that this is a cure for autocorrelation. If there is higher order autocorrelation.761 3. the estimated parameter divided by its standard error. The DF critical values are tabulated in many econometrics books and are automatically available in Stata. as expected.933 2. As an example. All we are really doing is adding lags of the dependent variable to the test equation. its distribution is anything but standard. have tabulated critical values of the DF tratio. Dickey and Fuller. generating an iid error term. The ADF in general form is ∆yt = a0 + ( a1 − 1) yt −1 + c1∆yt −1 + c2 ∆yt − 2 + .587 2.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1. lags(0) DickeyFuller test for unit root Number of obs = 49 .. The lagged differences are said to “augment” the DickeyFuller test. Note that the lagged difference has eliminated the autocorrelation. yt = a1 yt −1 + ut yt = a1 yt −1 + ρ yt −1 − ρ a1 yt − 2 + ε t Subtract yt −1 from both sides to get yt − yt −1 = a1 yt −1 − yt −1 + ρ yt −1 − ρ a1 yt − 2 + ε t Add and subtract ρ a1 yt −1 ∆yt = a1 yt −1 − yt −1 + ρ a1 yt −1 − ρ a1 yt −1 + ρ yt −1 − ρ a1 yt − 2 + ε t Collect terms ∆yt = (a1 − 1 + ρ − ρ a1 ) yt −1 + ρ a1 ( yt −1 − yt − 2 ) + ε t ∆yt = (a1 − 1)(1 − ρ ) yt −1 + ρ a1∆yt −1 + ε t ∆yt = γ yt −1 + δ∆yt −1 + ε t If a1 = 1 (unit root) then =0. We should always have an intercept in any real world 174 . DickeyFuller unit root tests can be done with the dfuller command. These critical values are derived from Monte Carlo exercises. dfuller y1. while the tratio is standard. The ADF test equation is estimated using OLS. simply add more lagged differences.dta data set. we can do our own mini Monte Carlo exercise using y1 and y2. we will not add any lags..
. The first thing we do is graph it.dta data set) for stationarity. If it has a distinct trend. then we need to be able to test for a unit root against the alternative that y is a stationary deterministic trend variable. They are much higher than the corresponding criticial values for the standard tdistribution. To do this we simply add a deterministic trend term to the ADF test equation. have a distinct upward trend. and many economic variables. as expected. noconstant lags(0) DickeyFuller test for unit root Number of obs = 49 .610 The tratio is approximately zero.143 2. so let’s rerun the DF test with no constant.. This variable has a distinct trend. Fortunately. like GDP. it is a good idea to graph the variable and look at it. . . + c p ∆yt − p + ε t The tratio on (a1 − 1) is again the test statistic.application. So far we have tested for a unit root where the alternative hypothesis is that the variable is stationary around a constant value. Since it makes a difference whether the variable we are testing for a unit root has a trend or not. For example. If a variable has a distinct trend. ∆yt = a0 + ( a1 − 1) yt −1 + δ t + c1 ∆yt −1 + c2 ∆yt − 2 + . include a trend in the test equation. Is it a deterministic trend or a stationary trend? Let’s test for a unit root using an ADF test equation with a deterministic trend. dfuller y1. Note the critical values. dfuller prison. We choose two lags to start with and add the regress option to see the test equation. However. lags(4) regress trend Augmented DickeyFuller test for unit root Number of obs = 23 175 . Stata knows all this and adjusts accordingly. the critical values are different for test equations with deterministic trends.950 1.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 0.622 1. we might want to test the prison population (in the crimeva. but we know that we did not use an intercept when we constructed y1.
There are three approaches to this problem.8847383 . .429566 1.9004267 .3033914 .600 3.0723614 . which means that our OLS estimates are biased and inconsistent. The first is to use ttests and Ftests on the lags to decide whether to drop them or not.3426802 1. Interval] +prison  L1  . Which is worse? Monte Carlo studies have shown that including too few lags is much worse.prison ( 1) ( 2) ( 3) L4D.318 .222543 .3584488 0. The second is to use model selection criteria. dfuller prison.167 1.1248609 . 16) = Prob > F = 0.3201351 .2941294 1.12 0.1454225 .0056539 .203 .prison L3D.35 0. etc. Err. the last three lags are indeed not significant and we settle on one lag as the correct model. Look at the tratios on the lags and drop the insignificant lags (use a .001 .0214251 .6350165 _trend  .0867956 LD  .2303562 L3D  .0641524 .113 .4960933 .732 . Too many lags means that our estimates are inefficient with underestimated tratios.prison  Coef.4835 So. with overestimated tratios. . if we get different answers.3776407 .048504 _cons  . It is of some importance to get the number of lags right because too few lags causes autocorrelation..Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.86 0. lags(1) regress trend Augmented DickeyFuller test for unit root Number of obs = 26 Test .278 .15 significance level to be sure not to drop too many lags).8809 D. Let’s test this hypothesis with an Ftest.03 0.240 MacKinnon approximate pvalue for Z(t) = 0. It appears that the last three lags are not significant. test L4D. Std. Choosing the number of lags with F tests I recommend using Ftests to choose the lag length.328 4. The third approach is to do it all possible ways and see if we get the same answer.1095418 1. so use the standard critical values provided by Stata.371288 L2D  . Since these data are annual.0127737 1. we start with two.prison = 0 F( 3.prison L2D.Interpolated DickeyFuller 1% Critical 5% Critical 10% Critical 176 .prison = 0 L3D.0643962 1.9269179 L4D  .2221141 4.05 0.33 0.45 0. five for quarterly data. t P>t [95% Conf.68 0.prison = 0 L2D. we don’t know which to believe.and Fratios on the lags are standard.2088751  How many lags should we include? The rule of thumb is at least two for annual data. The problem with the third approach is that. The t.380 3.
028 .030 . Here is the graph of the log of major crime in Virginia.0493077 2. one down).3887225 .288 3. lags(4) regress trend Augmented DickeyFuller test for unit root Number of obs = 35 . t P>t [95% Conf.0122918 .04 0.0091946 2.1608155 .1746865 0.3066209 . or no trend? Let’s start with a trend and four lags. .9550795 _trend  .83 0.078 .15634 4. t P>t [95% Conf.861 .Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.32 0.352 4.30 0. Err.prison  Coef.596 3. Prison population in Virginia appears to be a random walk.238 MacKinnon approximate pvalue for Z(t) = 0.0190106 LD  . Does it have a trend.2168075  Note that the lag length apparently doesn’t matter because the DF test statistic is not significant in either model.371 3. Interval] +lcrmaj  L1  .35 0.1145497 .3269358 177 .1243068 .560 3. Std.832 4.0683769 2.4055 D.0401913 _cons  .6308502 .2632618 . two trends (one up.216 MacKinnon approximate pvalue for Z(t) = 0.0678357 1.0211228 .3026204 . In some cases it is not clear whether to use a trend or not.18 0.Statistic Value Value Value Z(t) 2.0308934 .031 .0146483 LD  . Std.0020544 .6889 D.lcrmaj  Coef. Err.001 . dfuller lcrmaj. Interval] +prison  L1  .
50 0.0010268 3.0010138 _cons  .1747988 .206 MacKinnon approximate pvalue for Z(t) = 0.0288636 2. test L4D.lcrmaj = 0 L2D.5742004 .171358 1. R2 = 1− Var (e) N −1 = 1 − (1 − R 2 ) Var ( y ) T −k 178 . Interval] +lcrmaj  L1  .005 .lcrmaj = 0 L3D.0472515 1. the number of lags does not make any real difference. Recall that the Rsquare measures the variation explained by the regression to the total variation of the dependent variable. Err. Let’s test for the significance of the lags.0018615 2. The ADF test statistics are all insignificant.314 .0337594 _trend  .518873  Again.lcrmaj ( ( ( ( 1) 2) 3) 4) L4D.03 0. Digression: model selection criteria One way to think of the problem is to hypothesize that the correct model should have the highest Rsquare and the lowest error variance compared to the alternative models. so we drop them all and run the test again.lcrmaj  Coef.72 0.251 3.2293501 1.173253 0.5879 None of the lags are significant.0359288 . so we retain it.197 .02 0.3908215 .89 0.1278214 L3D  .594441  The trend is significant.lcrmaj LD.055 .203 .lcrmaj = 0 LD. t P>t [95% Conf.1579016 . One way around this problem is to correct the Rsquare for the number of explanatory variables used in the model. dfuller lcrmaj.0053823 . indicating that crime in Virginia appears to be a random walk.0015691 _cons  1. 28) = Prob > F = 0. lags(0) regress trend DickeyFuller test for unit root Number of obs = 39 .1743417 L4D  .lcrmaj = 0 F( 4.282789 .837 .143 .31 0.8845 D. .0620711 .1704449 1.L2D  .314 4.lcrmaj L2D. The adjusted Rsquare also known as the R 2 (Rbarsquare) is the ratio of the variance (instead of variation) explained by the regression to the total variance.6447613 .4310014 1.21 0.640328 2.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.544 3.007 . Std. .lcrmaj L3D.3189639 _trend  .0030963 .0091955 .0051789 . The problem with this approach is that the Rsquare can be maximized by simply adding more independent variables.00 0.30 0.5239392 .2231895 .
you can install it by doing a keyword search for fitstat and clicking on “click here to install. Consequently. The formula is 2 ei2 2k AIC = log + T T The goal is to choose the number of explanatory variables (lags in the context of a unit root test) that minimizes the AIC.656 0.” Fitstat generates a variety of goodness of fit tests. invoke fitstat after the dfuller command. lags(2) trend Augmented DickeyFuller test for unit root Number of obs = 25 . If your version does not have fitstat.034 LogLik Full Model: LR(4): Prob > LR: Adjusted R2: AIC*n: BIC': 27. The Akaike information criterion (AIC) has been proposed as an improvement over the R on the theory that the R 2 does not penalize heavily enough for adding explanatory variables.001 0. however they have been shown to be useful in the determination of the lag length in unit root models. One rule might be to choose the number of lags such that R 2 is a maximum. The minimum (biggest negative) is the one to go with. dfuller prison. Then run the test several times with different lags and check the information criterion. Again.786 119. Note that these information criteria are not formal statistical tests.528 4.486 0. I usually use the BIC. including the BIC. So. ei2 log(T ) BIC = log +k T T The BIC penalizes additional explanatory variables even more heavily than the AIC.8182 .610 179 .656 5.where T is the sample size and k is the number of independent variables.380 3.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.600 3. as we add independent variables.085 54.523 1.240 * MacKinnon approximate pvalue for Z(t) = 0.prison LogLik Intercept Only: D(20): R2: AIC: BIC: 18.328 18. Choosing lags using model selection criteria Let’s adopt the information criterion approach. the goal is to minimize the BIC. R2 increases but so does Nk. fitstat Measures of Fit for regress of D. but usually they both give the same answer. If your version of Stata has the fitstat command. . R 2 can go up or down. The Bayesian Information Criterion (also known as the Schwartz Criterion).427 44.
592 3. fitstat Measures of Fit for regress of D.108 0.235 * MacKinnon approximate pvalue for Z(t) = 0.7726 .238 * MacKinnon approximate pvalue for Z(t) = 0. In this case it doesn’t make any difference.308 0.152 1.454 0.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 2. dfuller prison.654 4.814 126.931 .427 43.425 47.408 LogLik Full Model: LR(2): Prob > LR: Adjusted R2: AIC*n: BIC': 21.706 0.352 4.382 122.138 So we go with one lag.580 17.160 0. lags(1) trend Augmented DickeyFuller