Econometrics
with Stata
Carl Moody
Economics Department
College of William and Mary
2009
1
Table of Contents
1 AN OVERVIEW OF STATA ......................................................................................... 5
Transforming variables ................................................................................................... 7
Continuing the example .................................................................................................. 9
Reading Stata Output ...................................................................................................... 9
2 STATA LANGUAGE ................................................................................................... 12
Basic Rules of Stata ...................................................................................................... 12
Stata functions ............................................................................................................... 14
Missing values .......................................................................................................... 16
Time series operators. ............................................................................................... 16
Setting panel data. ..................................................................................................... 17
Variable labels. ......................................................................................................... 18
System variables ....................................................................................................... 18
3 DATA MANAGEMENT............................................................................................... 19
Getting your data into Stata .......................................................................................... 19
Using the data editor ................................................................................................. 19
Importing data from Excel ........................................................................................ 19
Importing data from comma separated values (.csv) files ........................................ 20
Reading Stata files: the use command ...................................................................... 22
Saving a Stata data set: the save command ................................................................... 22
Combining Stata data sets: the Append and Merge commands .................................... 22
Looking at your data: list, describe, summarize, and tabulate commands ................... 24
Culling your data: the keep and drop commands .......................................................... 27
Transforming your data: the generate and replace commands ..................................... 27
4 GRAPHS ........................................................................................................................ 29
5 USING DOFILES......................................................................................................... 34
6 USING STATA HELP .................................................................................................. 37
7 REGRESSION ............................................................................................................... 41
Linear regression ........................................................................................................... 43
Correlation and regression ............................................................................................ 46
How well does the line fit the data? .......................................................................... 48
Why is it called regression? .......................................................................................... 49
The regression fallacy ................................................................................................... 50
Horace Secrist ........................................................................................................... 53
Some tools of the trade ................................................................................................. 54
2
Summation and deviations .................................................................................... 54
Expected value .......................................................................................................... 55
Expected values, means and variance ....................................................................... 56
8 THEORY OF LEAST SQUARES ................................................................................ 57
Method of least squares ................................................................................................ 57
Properties of estimators................................................................................................. 58
Small sample properties ............................................................................................ 58
Bias ....................................................................................................................... 58
Efficiency .............................................................................................................. 58
Mean square error ................................................................................................. 59
Large sample properties ............................................................................................ 59
Consistency ........................................................................................................... 59
Mean of the sampling distribution of
ˆ
β ................................................................... 63
Variance of
ˆ
β ............................................................................................................ 63
Consistency of OLS .................................................................................................. 64
Proof of the GaussMarkov Theorem ........................................................................... 65
Inference and hypothesis testing ................................................................................... 66
Normal, Student’s t, Fisher’s F, and Chisquare ....................................................... 66
Normal distribution ................................................................................................... 67
Chisquare distribution.............................................................................................. 68
Fdistribution............................................................................................................. 70
tdistribution .............................................................................................................. 71
Asymptotic properties ................................................................................................... 73
Testing hypotheses concerning þ .................................................................................. 73
Degrees of Freedom ...................................................................................................... 74
Estimating the variance of the error term ..................................................................... 75
Chebyshev’s Inequality ................................................................................................. 78
Law of Large Numbers ................................................................................................. 79
Central Limit Theorem ................................................................................................. 80
Method of maximum likelihood ................................................................................... 84
Likelihood ratio test .................................................................................................. 86
Multiple regression and instrumental variables ............................................................ 87
Interpreting the multiple regression coefficient. ....................................................... 90
Multiple regression and omitted variable bias .............................................................. 92
The omitted variable theorem ....................................................................................... 93
Target and control variables: how many regressors? .................................................... 96
Proxy variables.............................................................................................................. 97
Dummy variables .......................................................................................................... 98
Useful tests .................................................................................................................. 102
Ftest ....................................................................................................................... 102
Chow test ................................................................................................................ 102
Granger causality test ............................................................................................. 105
Jtest for nonnested hypotheses ............................................................................. 107
LM test .................................................................................................................... 108
9 REGRESSION DIAGNOSTICS ................................................................................. 111
3
Influential observations ............................................................................................... 111
DFbetas ................................................................................................................... 113
Multicollinearity ......................................................................................................... 114
Variance inflation factors ........................................................................................ 114
10 HETEROSKEDASTICITY ....................................................................................... 117
Testing for heteroskedasticity ..................................................................................... 117
BreuschPagan test .................................................................................................. 117
White test ................................................................................................................ 120
Weighted least squares ................................................................................................ 120
Robust standard errors and tratios ............................................................................. 123
11 ERRORS IN VARIABLES ....................................................................................... 127
Cure for errors in variables ......................................................................................... 128
Two stage least squares ............................................................................................... 129
HausmanWu test ........................................................................................................ 129
12 SIMULTANEOUS EQUATIONS............................................................................. 132
Example: supply and demand ..................................................................................... 133
Indirect least squares ................................................................................................... 135
The identification problem .......................................................................................... 136
Illustrative example ..................................................................................................... 137
Diagnostic tests ........................................................................................................... 139
Tests for over identifying restrictions ..................................................................... 140
Test for weak instruments ....................................................................................... 143
HausmanWu test .................................................................................................... 143
Seemingly unrelated regressions................................................................................. 146
Three stage least squares ............................................................................................. 146
Types of equation systems .......................................................................................... 148
Strategies for dealing with simultaneous equations .................................................... 149
Another example ......................................................................................................... 150
Summary ..................................................................................................................... 152
13 TIME SERIES MODELS .......................................................................................... 154
Linear dynamic models ............................................................................................... 154
ADL model ............................................................................................................. 154
Lag operator ............................................................................................................ 155
Static model ............................................................................................................ 156
AR model ................................................................................................................ 156
Random walk model ............................................................................................... 156
First difference model ............................................................................................. 156
Distributed lag model .............................................................................................. 157
Partial adjustment model......................................................................................... 157
Error correction model ............................................................................................ 157
CochraneOrcutt model ........................................................................................... 158
14 AUTOCORRELATION ............................................................................................ 159
Effect of autocorrelation on OLS estimates ................................................................ 160
Testing for autocorrelation .......................................................................................... 161
The Durbin Watson test .......................................................................................... 161
The LM test for autocorrelation .............................................................................. 162
4
Testing for higher order autocorrelation ................................................................. 164
Cure for autocorrelation .............................................................................................. 165
The CochraneOrcutt method ................................................................................. 165
Curing autocorrelation with lags ............................................................................. 168
Heteroskedastic and autocorrelation consistent standard errors ................................. 169
Summary ..................................................................................................................... 170
15 NONSTATIONARITY, UNIT ROOTS, AND RANDOM WALKS ....................... 171
Random walks and ordinary least squares .................................................................. 172
Testing for unit roots ................................................................................................... 173
Choosing the number of lags with F tests ............................................................... 176
Digression: model selection criteria........................................................................ 178
Choosing lags using model selection criteria.......................................................... 179
DFGLS test ................................................................................................................ 181
Trend stationarity vs. difference stationarity .............................................................. 182
Why unit root tests have nonstandard distributions .................................................... 183
16 ANALYSIS OF NONSTATIONARY DATA .......................................................... 185
Cointegration............................................................................................................... 188
Dynamic ordinary least squares .................................................................................. 191
Error correction model ................................................................................................ 191
17 PANEL DATA MODELS ......................................................................................... 194
The Fixed Effects Model ............................................................................................ 194
Time series issues ....................................................................................................... 202
Linear trends ........................................................................................................... 202
Unit roots and panel data ........................................................................................ 206
Clustering .................................................................................................................... 212
Other Panel Data Models ............................................................................................ 214
Digression: the between estimator .......................................................................... 214
The Random Effects Model .................................................................................... 214
Choosing between the Random Effects Model and the Fixed Effects Model. ....... 215
HausmanWu test again .......................................................................................... 216
The Random Coefficients Model. ........................................................................... 216
Index ............................................................................................................................... 218
5
1 AN OVERVIEW OF STATA
Stata is a computer program that allows the user to perform a wide variety of statistical analyses. In this
chapter we will take a short tour of Stata to get an appreciation of what it can do.
Starting Stata
Stata is available on the server. When invoked, the screen should look something like this. (This is the current
version on my computer. The version on the server may be different.)
There are many ways to do things in Stata. The simplest way is to enter commands interactively, allowing
Stata to execute each command immediately. The commands are entered in the “Stata Command” window
(along the bottom in this view). Results are shown on the right. After the command has been entered it appears
in the “Review” window in the upper left. If you want to reenter the command you can double click on it. A
single click moves it to the command line, where you can edit it before submitting. You can also use the
buttons on the toolbar at the top. I recommend using “do” files consisting of a series of Stata commands for
6
complex jobs (see Chapter 5 below). However, the way to learn a statistical language is to do it. Different
people will find different ways of doing the same task.
An econometric project consists of several steps. The first is to choose a topic. Then you do library
research to find out what research has already been done in the area. Next, you collect data. Once you have the
data, you are ready to do the econometric analysis. The econometric part consists of four steps, which may be
repeated several times. (1) Get the data into Stata. (2) Look at the data by printing it out, graphing it, and
summarizing it. (3) Transform the data. For example, you might want to divide by population to get per capita
values, or divide by a price index to get real dollars, or take logs. (4) Analyze the transformed data using
regression or some other procedure. The final step in the project is to write up the results. In this chapter we
will do a minianalysis illustrating the four econometric steps.
Your data set should be written as a matrix (a rectangular array) of numbers, with columns
corresponding to variables and rows corresponding to observations. In the example below, we have 10
observations on 4 variables. The data matrix will therefore have four columns and 10 rows. Similarly, you
could have observations on consumption, income and wealth for the United States annually for the years
19481998. The data matrix will consist of three columns corresponding to the variables consumption,
income, and wealth, while there will be 51 rows (observations) corresponding to the years 1948 to 1998.
Consider the following data set. It corresponds to some data collected at a local recreation
association and consists of the year, the number of members, the annual membership dues, and the consumer
price index. The easiest way to get these data into Stata is to use Stata’s Data Editor. Click on Data on the
task bar then click on Data Editor in the resulting drop down menu, or click on the Data Editor button. The
result should look like this.
Type the following numbers into the corresponding columns of the data editor. Don’t type the
variable names, just the numbers. You can type across the rows or down the columns. If you type across the
rows, use tab to enter the number. If you type down the columns, use the enter key after typing each value.
Year Members Dues CPI
78 126 135 .652
79 130 145 .751
80 123 160 .855
81 122 165 .941
82 115 190 1.000
83 112 195 1.031
84 139 175 1.077
85 127 180 1.115
7
86 133 180 1.135
87 112 190 1.160
When you are done typing your data editor screen should look like this.
Now rename the variables. Double click on “var1” at the top of column one and type “year” into the resulting
dialog box. Repeat for the rest of the variables (members, dues, cpi). Notice that the new variable names are in
the variables box in the lower left. Click on “preserve” and then the red x on the data editor header to exit the
data editor.
Transforming variables
Since we have seen the data, we don’t have to look at it any more. Let’s move on to the transformations.
Suppose we want Stata to compute the logarithms of members and dues and create a dummy variable for an
advertising campaign that took place between 1984 and 1987. Type the following commands one at a time
into the “Stata command” box along the bottom. Hit <enter> at the end of each line to execute the command.
Gen is short for “generate” and is probably the command you will use most. The list command shows you the
current data values.
gen logmem=log(members)
gen logdues=log(dues)
gen dum=(year>83)
list
[output suppressed]
We almost always want to transform the variables in our dataset before analyzing them. In the
example above, we transformed the variables members and dues into logarithms. For a Phillips curve, you
might want the rate of inflation as the dependent variable while using the inverse of the unemployment rate as
the independent variable. All transformations like these are accomplished using the gen command. We have
already seen how to tell Stata to take the logarithms of variables. The logarithm function is one of several in
8
Stata's function library. Some of the functions commonly used in econometric work are listed in Chapter 2,
below. Along with the usual addition, subtraction, multiplication, and division, Stata has exponentiation,
absolute value, square root, logarithms, lags, differences and comparisons. For example, to compute the
inverse of the unemployment rate, we would write,
gen invunem=1/unem
where unem is the name of the unemployment variable.
As another example, suppose we want to take the first difference of the time series variable GNP
(defined as GNP(t)GNP(t1)), To do this we first have to tell Stata that you have a time series data set using
the tsset command.
tsset year
Now you can use the “d.” operator to take first differences (
t
y ∆ ).
gen dgnp=d.gnp
I like to use capital letters for these operators to make them more visible.
gen dcpi=D.cpi
The second difference
2
1 1
( )
t t t t t t
cpi cpi cpi cpi cpi cpi
− −
∆∆ = ∆ = ∆ − = ∆ − ∆ be calculated as,
gen d2cpi=D.dcpi
or
gen d2cpi=DD.cpi
or
gen d2cpi=D2.cpi
Lags can also be calculated. For example, the one period lag of the CPI (cpi(t1)) can be calculated
as,
gen cpi_1=L.cpi
These operators can be combined. The lagged difference
1 t
cpi
−
∆ can be written as
gen dcpi_1=LD.cpi
It is often necessary to make comparisons between variables. In the example above we created a
dummy variable, called "DUM," by the command
gen dum=(year>83)
This works as follows: for each line of the data, Stata reads the value in the column corresponding to the
variable YEAR. It then compares it to the value 83. If YEAR is greater than 83 then the variable "DUM" is
given the value 1 (“true”). If YEAR is not greater than 83, "DUM" is given the value 0 (“false”).
9
Continuing the example
Let’s continue the example by doing some more transformations and then some analyses. Type the following
commands (indicated by the courier new typeface), one at a time, ending with <enter> into the Stata
command box.
replace cpi=cpi/1.160
(Whenever you redefine a variable, you must use replace instead of generate.)
gen rdues=dues/cpi
gen logrdues=log(rdues);
gen trend=year78
list
summarize
Variable  Obs Mean Std. Dev. Min Max
+
year  10 82.5 3.02765 78 87
members  10 123.9 8.999383 112 139
dues  10 171.5 20.00694 135 195
cpi  10 .9717 .1711212 .652 1.16
logmem  10 4.817096 .0727516 4.718499 4.934474
+
logdues  10 5.138088 .1219735 4.905275 5.273
dum  10 .4 .5163978 0 1
trend  10 5.5 3.02765 1 10
(Note: rdues is real (inflationadjusted) dues, the log function produces natural logs, and the trend is simply a
counter (0, 1, 2, etc. for each year, usually called “t” in theoretical discussions.) The summarize command
produces means, standard deviations, etc.
We are now ready to do the analysis, a simple ordinary least squares (OLS) regression of members on real
dues.
regress members rdues
This command produces the following output.
Source  SS d1 MS uumber o1 obs = 10
+ I{ 1, 8) = 0.69
Mode1  58.1762689 1 58.1762689 Þrob > I = 0.4290
8esìdua1  670.722621 8 82.8404529 8squared = 0.0798
+ ^dj 8squared = 0.0252
1ota1  728.9 9 80.9888889 8oot MSL = 9.1564

members  Coe1. Std. Lrr. t Þ>t ]95x Con1. 1nterva11
+
rdues  .1210586 .1572227 0.82 0.429 .4928684 .2217512
_cons  151.0824 22.76126 4.61 0.002 75.52582 226.621

Reading Stata Output
10
Most of this output is selfexplanatory. The regression output immediately above is read as follows. The
upper left section is called the Analysis of Variance. Under SS is the model sum of squares (58.1763689).
This number is referred to in a variety of ways. Each textbook writer chooses his or her own convention.
For example, the model sum of squares is also known as the explained sum of squares, ESS (or, sometimes
SSE), or the regression sum of squares, RSS (or SSR). Mathematically, it is
2
1
ˆ
N
i
i
y
=
¯
where the circumflex
indicates that y is predicted by the regression. The fact that y is in lower case indicates that it is expressed
as deviations from the sample mean The Residual SS (670.723631) is the residual sum of squares (a.k.a.
RSS, SSR unexplained sum of squares, SSU, USS, and error sum of squares, ESS, SSE). I will refer to the
model sum of squares as the regression sum of squares, RSS and the residual sum of squares as the error
sum of squares, ESS. The sum of the RSS and the ESS is the total sum of squares, TSS. The next column
contains the degrees of freedom. The model degrees of freedom is the number of independent variables, not
including the constant term. The residual degrees of freedom is the number of observations, or sample size,
N (=10) minus the number of parameters estimated, k (=2), so that Nk=8. The third column is the “mean
square” or the sum of squares divided by the degrees of freedom. The residual mean square is also known
as the “mean square error” (MSE), the residual variance, or the error variance. Mathematically it is
2 2 2
1
ˆ ˆ ( )
N
u i
i
e N k σ σ
=
 
= = −

\ ¹
¯
Where e is the residual from the regression (the difference between the actual value of the dependent
variable and the predicted value from the regression:
ˆ ˆ
ˆ
ˆ
ˆ
ˆ
i i
i i i
i
Y X
Y Y e
X e
α β
α β
= +
= +
= + +
ˆ
ˆ
i i i
e Y X α β = − −
The panel on the top right has the number of observations, the overall Fstatistic that tests the null
hypothesis that the Rsquared is equal to zero, the probvalue corresponding to this Fratio, Rsquared, the
Rsquared adjusted for the number of explanatory variables (a.k.a. Rbar squared), and the root mean
square error (the square root of the residual sum of squares divided by its degrees of freedom, also known
as the standard error of the estimate, SEE, or the standard error of the regression, SER). Since the MSE is
the error variance, the SEE is the corresponding standard deviation (square root). The bottom table shows
the variable names, the corresponding estimated coefficients,
ˆ
β , the standard errors,
ˆ
S
β
the tratios, the
probvalues, and the confidence interval around the estimates.
The estimated coefficient for rdues is
2
ˆ
( .1310586) xy x β = ¯ ¯ = − .
{The other coefficient is the intercept or constant term,
ˆ
ˆ ( 151.0834) Y X α β = − = .}
The corresponding standard error for
ˆ
β is
2
ˆ
2
ˆ
( .1573327) S
x
β
σ
= =
¯
.
The resulting tratio for the null hypothesis that þ=0 is
ˆ
ˆ
( 0.83) T S
β
β = = − .
The probvalue for the ttest of no significance is the probability that the test statistic T would be observed
if the null hypothesis is true, using Student’s t distribution, twotailed. If this value is smaller than .05 then
11
we “reject the null hypothesis that þ=0 at the five percent significance level.” We usually just say that þ is
significant.
Let’s add the time trend and the dummy for the advertising campaign and do a multiple regression, using the
logarithm of members as the dependent variable and the log of real dues as the price.
regress logmem logrdues trend dum
Source  SS df MS Number of obs = 10
+ F( 3, 6) = 8.34
Model  .038417989 3 .012805996 Prob > F = 0.0146
Residual  .009217232 6 .001536205 Rsquared = 0.8065
+ Adj Rsquared = 0.7098
Total  .047635222 9 .005292802 Root MSE = .03919

logmem  Coef. Std. Err. t P>t [95% Conf. Interval]
+
logrdues  .6767872 .36665 1.85 0.114 1.573947 .2203729
trend  .0432308 .0094375 4.58 0.004 .0663234 .0201382
dum  .155846 .0608159 2.56 0.043 .007035 .3046571
_cons  8.557113 1.989932 4.30 0.005 3.687924 13.4263

The coefficient on logrdues is the elasticity of demand for pool membership. (Mathematically, the
derivative of a log with respect to a log is an elasticity.) Is the demand curve downward sloping? Is price
significant in the demand equation at the five percent level? At the ten percent level? The trend is the
percent change in pool membership per year. (The derivative of a log with respect to a nonlogged value is
a percent change.) According to this estimate, the pool can expect to lose four percent of its members every
year, everything else staying the same (ceteris paribus). On the other hand, continuing the advertising
campaign will generate a 16 percent increase in pool memberships every year. Note that, since this is a
multiple regression, the estimated coefficients, or parameters, are the change in the dependent variable for a
oneunit change in the corresponding independent variable, holding everything else constant.
Saving your first Stata analysis
Let’s assume that you have created a directory called “project” on the H: drive. To save the data for further
analysis, type
save “H:\Project\test.dta”
will create a Stata data set called “test.dta” in the directory H:\Project. You don’t really need the quotes unless
there are spaces in the pathname. However, it is good practice. You can read the data into Stata later with the
command,
use “H:\Project\test.dta”
You can also save the data by clicking on File, Save and browsing to the Project folder or by clicking on the
little floppy disk icon on the button bar.
You can save your output by going to the top of the output window (use the slider on the right hand side),
clicking on the first bit of output you want to save, then drag the cursor, holding the mouse key down, to the
bottom of the output window. Once you have blocked the text you want to save. You can type <control>c for
copy (or click on edit, copy text), go to your favorite word processor, and type <control>v for paste (or click
on edit, paste). You could also create a log file containing your results. We discuss how to create log files
below.
You can then write up your results around the Stata output.
12
2 STATA LANGUAGE
This section is designed to provide a more detailed introduction to the language of Stata. However, it
is still only an introduction, for a full development you must refer to the relevant Stata manual, or use Stata
Help (see Chapter 6).
Basic Rules of Stata
1. Rules for composing variable names in Stata.
a. Every name must begin with a letter or an underscore. However, I recommend that you not use
underscore at the beginning of any variable name, since Stata frequently uses that convention to
make its own internal variables. For example, _n is Stata’s variable containing the observation
number.
b. Subsequent characters may be letters, numbers, or underscores, e.g., var1. You cannot use special
characters (&,#, etc.).
c. The maximum number of characters in a name is 32, but shorter names work best.
d. Case matters. The names Var, VAR, and var are all considered to be different variables in
Stata notation. I recommend that you use lowercase only for variable names.
e. The following variable names are reserved. You may not use these names for your variables.
_all double long _rc _b float
_n _se byte if _N _skip
_coeff in _pi using _cons int
_pred with e
2. Command syntax.
The typical Stata command looks like this.
[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [, options]
13
where things in a square bracket are optional. Actually the varlist is almost always required.
A varlist is a list of variable names separated by spaces. If no varlist is specified, then, usually, all the
variables are used. For example, in the simple program above, we used the command “list” with no varlist
attached. Consequently, Stata printed out all the data for all the variables currently in memory. We could have
said, list members dues rdues to get a list of those three variables only.
The “=exp” option is used for the “generate” and “replace” commands, e.g., gen logrdues=log(dues) and
replace cpi=cpi/1.160.
The “if exp” option restricts the command to a certain subset of observations. For example, we could have
written
list year members rdues if year == 83.
This would have produced a printout of the data for year, members, and rdues for 1983. Note that Stata
requires two equal signs to denote equality because one equal sign indicates assignment in the generate and
replace commands. (Obviously, the equal sign in the expression replace cpi=cpi/1.160 indicates that
the variable cpi is to take the value of the old cpi divided by 1.160. It is not testing for the equality of the left
and right hand sides of the expression.)
The “in range” option also restricts the command to a certain range. It takes the form in
number1/number2. For example,
list year members rdues in 5/10
would list the observations from 5 to 10. We can also use the letter f (first) and l (last) in place of either
number1 or number 2.
The weight option is always in square brackets and takes the form
[weightword=exp]
where the weight word is one of the following (default weight for each particular command)
aweight (analytical weights: inverse of the variance of the observation. This is the most commonly
used weight by economists. Each observation in our typical sample is an average across counties,
states, etc. The aweight is the variance divided by the number of elements, usually population, that
comprise the county, state, etc. For most Stata commands, the aweight is rescaled to sum to the
number of observations in the sample.)
fweight (frequency weights: indicates duplicate observations, an fweight of 100 indicates that
there are really 100 identical observations)
pweight (sampling weights: the inverse of the probability that this observation is included in the
sample. A pweight of 100 indicates that this observation is representative of 100 subjects in the
population.)
iweight (importance weight: indicates the relative importance of each observation. The definition
of iweight depends on how it is defined in each command that uses them).
Options. Each Stata command takes commandspecific options. Options must be separated from the rest of the
command by a comma. For example, the “reg” command that we used above has the option “robust” which
produces heteroskedasticconsistent standard errors. That is,
14
reg members rdues trend, robust
by varlist: Almost all Stata commands can be repeated automatically for each value of the variable or
variables in the by varlist. This option must precede the command and be separated from the rest of the
command by a colon. The data must be sorted by the varlist. For example the commands,
sort dum
by dum: summarize members rdues
will produce means, etc. for the years without an advertising campaign (dum=0) and the years with an
advertising campaign (dum=1). In this case the sort command is not necessary, since the data are already
sorted by dum, but it never hurts to sort before using the by varlist option.
Arithmetic operators are:
+ add
 subtract
* multiply
/ divide
^ raise to a power
Comparison operators
== equal to
~= not equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to
Logical operators
& and
 or
~ not
Comparison and logical operators return a value of 1 for true and 0 for false. The order of operations is: ~, ^, 
(negation), /, *, (subtraction), +, ~=, >, <, <=, >=, ==, &, .
Stata functions
These are only a few of the many functions in Stata, see STATA USER’S GUIDE for a comprehensive list.
Mathematical functions include:
abs(x) absolute value
max(x) returns the largest value
min(x) returns the smallest value
sqrt(x) square root
exp(x) raises e (2.17828) to a specified power
log(x) natural logarithm
Some useful statistical functions are the following.
Chi2(df, x) returns the cumulative value of the chisquare with df degrees of freedom for a value of x.
15
For example, suppose we know that x=2.05 is distributed according to chisquare with 2 degrees of freedom.
What is the probability of observing a number as large as 2.05 if the true value of x is zero? We can compute
the following in Stata. Because we are not generating variables, we use the scalar command.
. scalar prob1=chi2(2,2.05)
. scalar list prob1
prob1 = .64120353
We apparently have a pretty good chance of observing a number as large as 2.05.
To find out if x=2.05 is significantly different from zero in the chisquare with 2 degrees of freedom we can
compute the following to get the prob value (the area in the right hand tail of the distribution).
. scalar prob=1chi2(2,2.05)
. scalar list prob
prob = .35879647
Since this value is not smaller than .05 it is not significantly different from zero at the .05 level.
Chi2tail(df,x) returns the upper tail of the same distribution. We should get the same answer as above using
this function.
. scalar prob2=chi2tail(2,2.05)
. scalar list prob2
prob2 = .35879647
Invchi2(df,p) returns the value of x for which the probability is p and the degrees of freedom are df.
. scalar x1=invchi2(2,.641)
. scalar list x1
x1 = 2.0488658
Invchi2tail is the inverse function for chi2tail.
. scalar x2=invchi2tail(2,.359)
. scalar list x2
x2 = 2.0488658
Norm(x) is the cumulative standard normal. To get the probability of observing a number as large as 1.65 if
the mean of the distribution is zero (and the standard deviation is one), i.e., P(z<1.65) is
. scalar prob=norm(1.65)
. scalar list prob
prob = .9505285
Invnorm(prob) is the inverse normal distribution so that, if prob=norm(z) then z=invnorm(prob).
Wooldridge uses the Stata functions norm, invttail, invFtail, and invchi2tail to generate the normal, t, F, and
chiisquare tables in the back of his textbook, which means that we don’t have to look in the back of textbooks
for statistical distributions any more.
16
Uniform() returns uniformly distributed pseudorandom numbers between zero and one. Even though
uniform() takes no parameters, the parentheses are part of the name of the function and must be typed. For
example to generate a sample of uniformly distributed random numbers use the command
. gen v=uniform()
. summarize v
Variable  Obs Mean Std. Dev. Min Max
+
v  51 .4819048 .2889026 .0445188 .9746088
To generate a sample of normally distributed random numbers with mean 2 and standard deviation 10, use the
command,
. gen z=2+10*invnorm(uniform())
. summarize z
Variable  Obs Mean Std. Dev. Min Max
+
z  51 1.39789 8.954225 15.87513 21.83905
Note that both these functions return 51 observations on the pseudorandom variable. Unless you tell Stata
otherwise, it will create the number of observations equal to the number of observations in the data set. I used
the crime1990.dta data set which has 51 observations, one for each state and DC.
The pseudo random number generator uniform() generates the same sequence of random numbers in each
session of Stata, unless you reinitialize the seed. To do that, type,
set seed #
If you set the seed to # you get a different sequence of random numbers. This makes it difficult to be sure you
are doing what you think you are doing. Unless I am doing a series of Monte Carlo exercises, I leave the seed
alone.
Missing values
Missing values are denoted by a single period (.) Any arithmetic operation on a missing value results in a
missing value. Stata considers missing values as larger than any other number. So, sorting by a variable
which has missing values will put all observations corresponding to missing at the end of the data set.
When doing statistical analyses, Stata ignores (drops) observations with missing values. So, for example if
you did a regression of crime on police and the unemployment rate across states, Stata will drop any
observations for which there are missing values in either police or unemployment (“case wise deletion”). In
other commands such as the summarize command, for example, Stata will simply ignore any missing
values for each variable separately when computing the mean, standard deviations, etc.
Time series operators.
The tsset command must be used to indicate that the data are time series. For example.
sort year
17
tsset year
Lags: the oneperiod lag of a variable such as the money supply, ms, can be written as
L.ms
So that L.ms is equal to ms(t1). Similarly, the second period lag could be written as
L2.ms
or, equivalently, as
LL.ms, etc.
For example, the command
gen rdues_1 = L.rdues
will generate a new variable called rdues_1 which is the oneperiod lag of rdues. Note that this will produce
a missing value for the first observation of rdues_1.
Leads: the oneperiod lead is written as
F.ms (=ms(t+1)).
The twoperiod lead is written as F2.ms or FF.ms.
Differences: the firstdifference is written as
D.ms (=ms(t)  ms(t1)).
The second difference (difference of difference) is
D2.ms or DD.ms. (=(ms(t)ms(t1)) – (ms(t1)ms(t2)))
Seasonal difference: the first seasonal difference is written
S.ms
and is the same as
D.ms.
The second seasonal difference is
S2.ms
and is equal to ms(t)ms(t2).
Time series operators can be combined. For example, for monthly data, the difference of the 12month
difference would be written as DS12.ms. Time series operators can be written in upper or lowercase. I
recommend upper case so that it is more easily readable as an operator as opposed to a variable name (which I
like to keep in lowercase).
Setting panel data.
If you have a panel of 50 states (150) across 20 years (19771996), you can use time series operators, but you
must first tell Stata that you have a panel (pooled time series and crosssection). The following commands sort
18
the data, first by state, then by year, define state and year as the crosssection and time series identifiers, and
generate the lag of the variable x.
sort state year
tsset state year
gen x_1=L.x
Variable labels.
Each variable automatically has an 80 character label associated with it, initially blank. You can use the
“label variable” command to define a new variable label. For example,
Label variable dues “Annual dues per family”
Label variable members “Number of members”
Stata will use variable labels, as long as there is room, in all subsequent output. You can also add variable
labels by invoking the Data Editor, then double clicking on the variable name at the head of each column
and typing the label into the appropriate space in the dialog box.
System variables
System variables are automatically produced by Stata. Their names begin with an underscore.
_n is the number of the current observation
You can create a counter or time trend if you have time series data by using the command
gen t=_n
_N is the total number of observations in the data set.
_pi is the value of a to machine precision.
19
3 DATA MANAGEMENT
Getting your data into Stata
Using the data editor
We already know how to use the data editor from the first chapter.
Importing data from Excel
Generally speaking, the best way to get data into Stata is through a spreadsheet program such as Excel. For
example, data presented in columns in a text file or Web page can be cut and pasted into a text editor or word
processor and saved as a text file. This file is not readable as data because such files are organized as rows
rather than columns. We need to separate the data into columns corresponding to variables.
Text files can be opened and read into Excel using the Text Import Wizard. Click on File, Open and browse to
the location of the file. Follow the Text Import Wizard’s directions. This creates separate columns of numbers,
which will eventually correspond to variables in Stata. After the data have been read in to Excel, the data can
be saved as a text file. Clicking on File, Save As in Excel reveals the following dialogue box.
20
If the data have been read from a .txt file, Excel will assume you want to save it in the same file. Clicking on
Save will save the data set with tabs between data values. Once the tab delimited text file has been created, it
can be read into Stata using the insheet command.
Insheet using filename
For example,
insheet using “H:\Project\gnp.txt”
The insheet command will figure out that the file is tab delimited and that the variable names are at the top.
The insheet command can be invoked by clicking on File, Import, and “Ascii data created by a spreadsheet.”
The following dialog box will appear.
Clicking on Browse allows you to navigate to the location of the file. Once you get to the right directory, click
on the downarrow next to “Files of type” (the default file type is “Raw,” whatever that is) to reveal
Choose .txt files if your files are tab delimited.
Importing data from comma separated values (.csv) files
The insheet command can be used to read other file formats such as csv (comma separated values) files. Csv
files are frequently used to transport data. For example, the Bureau of Labor Statistics allows the public to
download data from its website. The following screen was generated by navigating the BLS website
http://www.bls.gov/.
21
Note that I have selected the years 19562004, annual frequency, column view, and text, comma delimited
(although I could have chosen tab delimited and the insheet command would read that too). Clicking on
Retrieve Data yields the following screen.
Use the mouse (or hold down the shift key and use the down arrow on the keyboard) to highlight the data
starting at Series Id, copy (controlc) the selected text, invoke Notepad, WordPad, or Word, and paste
(controlv) the selected text. Now Save As a csv file, e.g., Save As cpi.csv. (Before you save, you should
delete the comma after Value, which indicates that there is another variable after Value. This creates a
22
variable, v5, with nothing but missing values in Stata. You don’t really have to do this because the variable
can be dropped in Stata, but it is neater if you do.) The Series Id and period can be dropped once the data are
in Stata.
You can read the data file with the insheet command, e.g.,
insheet using “H:\Project\cpi.csv”
or you can click on Data, Import, and Ascii data created by a spreadsheet, as above.
Reading Stata files: the use command
If the data are already in a Stata data set (with a .dta suffix), it can be read with the “use” command. For
example, suppose we have saved a Stata file called project.dta in the H:\Project directory. The command to
read the data into Stata is
use “h:\Project\project.dta”, clear
The “clear” option is to remove any unwanted or leftover data from previous analyses. Alternatively you can
click on the “open (use)’ button on the taskbar or click on File, Open.
Saving a Stata data set: the save command
Data in memory can be saved in a Stata (.dta) data set with the “save” command. For example, the cpi data
can be saved by typing
save h:\Project\cpi.dta”
If you want the data set to be readable by older versions of Stata use the “saveold” command,
saveold “h:\Project\cpi.dta”
The quotes are not required. However, you might have trouble saving (or using) files without the quotes if
your pathname contains spaces.
If a file by the name of cpi.dta already exists, e.g., you are updating a file, you must tell Stata that you want to
overwrite the old file. This is done by adding the option, replace.
save “h:\Project\cpi.dta”, replace.
Combining Stata data sets: the Append and
Merge commands
Suppose we want to combine the data in two different Stata data sets. There are two ways in which data can
be combined. The first is to stack one data set on top of the other, so that some or all the series have
additional observations. The “append” command is used for this purpose. For example, suppose we have a
23
data set called “project.dta” which has data on three variables (gnp, ms, cpi) from 1977 to 1998 and another
called “project2.dta.” which has data on two of the three variables (ms and cpi) for the years 19992002. To
combine the data sets, we first read the early data into Stata with the use command
use “h:\Project\project.dta”
The early data are now residing in memory. We append the newer data to the early data as follows
append using “h:\Project\project2.dta”
At this point you probably want to list the data and perhaps sort by year or some other identifying variable.
Then save the expanded data set with the save command. If you do not get the result you were expecting,
simply exit Stata without saving the data.
The other way to combine data sets is to merge two sets together. This is concatenating horizontally instead of
vertically. It makes the data set wider in the sense that, if one data set has variables that the other does not, the
combined data set will have variables from both. (Of course, there could be some overlap in the variables, so
one data set will overwrite the data in the other. This is another way of updating a data set.)
There are two kinds of merges. The first is the onetoone merge where the first observation of the first data
set is matched with the first observation of the second data set. You must be very careful that each data set has
its observations sorted exactly the same way. I do not recommend one to one merges as a general principle.
The preferred method is a “match merge,” where the two datasets have one or more variables in common.
These variables act as identifiers so that Stata can match the correct observations. For example, suppose you
have data on the following variables: year pop crmur crrap crrob crass crbur from 1929 to 1998 in a data set
called crime1.dta in the h:\Project directory. Suppose that you also have a data set called arrests.dta
containing year pop armur arrap arrob arass arbur (corresponding arrest rates) for the years 19521998. You
can combine these two data sets with the following commands
Read and sort the first data set.
use “H:\Project\crime1.dta”, clear
sort year
list
save “H:\Project\crime1.dta”, replace
Read and sort the second data set.
use “H:\Project\arrests.dta”, clear
sort year
list
save “H:\Project\arrests.dta”, replace
Read the first data set back into memory, clearing out the data from the second file. This is the “master” data
set.
use “H:\Project\crime1.dta”, clear
Merge with the second data set by year. This is the “using” data set.
merge year using “H:\Project\arrests.dta”
Look at the data, just to make sure.
list
24
Assuming it looks ok, save the combined data into a new data set.
save “H:\Project\crime2.dta”, replace
The resulting data set will contain data from 19291998, however the arrest variables will have missing values
for 19291951.
The data set that is in memory (crime1) is called the “master.” The data set mentioned in the merge command
(arrests) is called the “using” data set. Each time Stata merges two data sets, it creates a variable called
_merge. This variable takes the value 1 if the value only appeared in the master data set; 2 if the value
occurred only in the using data set; and 3 if an observation from the master is joined with one from the using.
Note also that the variable pop occurs in both data sets. Stata uses the convention that, in the case of duplicate
variables, the one that is in memory (the master) is retained. So, the version of pop that is in the crime1 data
set is stored in the new crime2 data set.
If you want to update the values of a variable with revised values, you can use the merge command with the
update replace options.
merge year using h:\Project\arrests.dta, update replace
Here, data in the using file overwrite data in the master file. In this case the pop values from the arrest data set
will be retained in the combined data set. The _merge variable can take two additional values if the update
option is invoked: _merge equals 4 if missing values in the master were replaced with values from the using
and _merge equals 5 if some values of the master disagree with the corresponding values in the using (these
correspond to the revised values). You should always list the _merge variable after a merge.
If you just want to overwrite missing values in the master data set with new observations from the using data
set (so that you retain all existing data in the master), then don’t use the replace option:
merge year using h:\Project\arrests.dta, update
list _merge
Looking at your data: list, describe, summarize,
and tabulate commands
It is important to look at your data to make sure it is what you think it is. The simplest command is
list varlist
Where varlist is a list of variables separated by spaces. If you omit the varlist, Stata will assume you want all
the variables listed. If you have too many variables to fit in a table, Stata will print out the data by observation.
This is very difficult to read. I always use a varlist to make sure I get an easy to read table.
The “describe” command is best used without a varlist or options.
describe
It produces a list of variables, including any string variables. Here is the resulting output from the crime and
arrest data set we created above.
25
. describe
Contains data from C:\Project\crime1.dta
obs: 70
vars: 13 16 May 2004 20:12
size: 3,430 (99.9% of memory free)

storage display value
variable name type format label variable label

year int %8.0g
pop float %9.0g POP
crmur int %8.0g CRMUR
crrap long %12.0g CRRAP
crrob long %12.0g CRROB
crass long %12.0g CRASS
crbur long %12.0g CRBUR
armur float %9.0g ARMUR
arrap float %9.0g ARRAP
arrob float %9.0g ARROB
arass float %9.0g ARASS
arbur float %9.0g ARBUR
_merge byte %8.0g

Sorted by:
Note: dataset has changed since last saved
We can get a list of the numerical variables and summary data for each one with the “summarize” command.
Summarize varlist
This command produces the number of observations, the mean, standard deviation, min, and max for every
variable in the varlist. If you omit the varlist, Stata will generate these statistics for all the variables in the data
set. For example,
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
year  70 1963.5 20.35109 1929 1998
pop  70 186049 52068.33 121769 270298
crmur  64 14231.02 6066.007 7508 24700
crrap  64 44291.63 36281.72 5981 109060
crrob  64 288390.6 218911.6 58323 687730
crass  64 414834.3 363602.9 62186 1135610
crbur  64 1736091 1214590 318225 3795200
armur  46 7.516304 1.884076 4.4 10.3
arrap  40 12.1675 3.158983 7 16
arrob  46 53.39348 17.57911 26.5 80.9
arass  46 113.1174 54.79689 49.3 223
arbur  46 170.5326 44.00138 97.5 254.1
_merge  70 2.371429 .9952267 1 5
If you add the detail option,
Summarize varlist, detail
You will also get the variance, skewness, kurtosis, four smallest, four largest, and a variety of percentiles,
but the printout is not as neat.
Don’t forget, as with any Stata command, you can use qualifiers to limit the command to a subset of the
data. For example, the command
26
summarize if year >1990
Will yield summary statistics only for the years 19911998.
Finally, you can use the “tabulate” command to generate a table of frequencies. This is particularly useful
for large data sets with too many observations to list comfortably.
tabulate varname
For example, suppose we think we have a panel data set on crime in nine Florida counties for the years
19781997. Just to make sure, we use the tabulate command:
. tabulate county
county  Freq. Percent Cum.
+
1  20 11.11 11.11
2  20 11.11 22.22
3  20 11.11 33.33
4  20 11.11 44.44
5  20 11.11 55.56
6  20 11.11 66.67
7  20 11.11 77.78
8  20 11.11 88.89
9  20 11.11 100.00
+
Total  180 100.00
. tabulate year
year  Freq. Percent Cum.
+
78  9 5.00 5.00
79  9 5.00 10.00
80  9 5.00 15.00
81  9 5.00 20.00
82  9 5.00 25.00
83  9 5.00 30.00
84  9 5.00 35.00
85  9 5.00 40.00
86  9 5.00 45.00
87  9 5.00 50.00
88  9 5.00 55.00
89  9 5.00 60.00
90  9 5.00 65.00
91  9 5.00 70.00
92  9 5.00 75.00
93  9 5.00 80.00
94  9 5.00 85.00
95  9 5.00 90.00
96  9 5.00 95.00
97  9 5.00 100.00
+
Total  180 100.00
Nine counties each with 20 years and 20 years each with 9 counties. Looks ok.
The tabulate command will also do another interesting trick. It can generate dummy variables corresponding
to the categories in the table. So, for example, suppose we want to create dummy variables for each of the
counties and each of the years. We can do it easily with the tabulate command and the generate() option.
Tabulate county, generate(cdum)
Tabulate year, generate(yrdum)
27
This will create nine county dummies (cdum1cdum9) and twenty year dummies (yrdum1yrdum20).
Culling your data: the keep and drop commands
Occasionally you will want to drop variables or observations from your data set. The command
Drop varlist
Will eliminate the variables listed in the varlist. The command
Drop if year <1978
Will drop observations corresponding to years before 1978. The command
Drop in 1/10
will drop the first ten observations.
Alternatively, you can use the keep command to achieve the same result. The command
Keep varlist
Will drop all variables not in the varlist.
Keep if year >1977
will keep all observations after 1977.
Transforming your data: the generate and replace
commands
As we have seen above, we use the “generate” command to transform our raw data into the variables we need
for our statistical analyses. Suppose we want to create a per capita murder variable from the raw data in the
crime2 data set we created above. We would use the command,
gen murder=crmur/pop*1000
I always abbreviate generate to gen. This operation divides crmur by pop then (remember the order of
operations) multiplies the resulting ratio by 1000. This produces a reasonable number for statistical operations.
According to the results of the summarize command above, the maximum value for crmur is 24,700. The
corresponding value for pop is 270,298. However, we know that the population is about 270 million, so our
population variable is in thousands. If we divide 24,700 by 270 million, we get the probability of being
murdered (.000091). This number is too small to be useful in statistical analyses. Dividing 24,700 by 270,298
yields .09138 (the number of murders per thousand population) which is still too small. Multiplying the ratio
of crmur to our measure of pop by 1000 yields the number of murders per million population (91.38), which is
a reasonable order of magnitude for further analysis.
If you try to generate a variable called, say, widget and a variable by that name already exists in the data set,
Stata will refuse to execute the command and tell you, “widget already defined.” If you want to replace widget
with new values, you must use the replace command. We used the replace command in our sample program in
the last chapter to rebase our cpi index.
28
replace cpi=cpi/1.160
29
4 GRAPHS
Continuing with our theme of looking at your data before you make any crucial decisions, obviously one of
the best ways of looking is to graph the data. The first kind of graph is the line graph, which is especially
useful for time series.
Here is a graph of the major crime rate in Virginia from 1960 to 1998. The command used to produce the
graph is
. graph twoway line crmaj year
which produces
It is interesting that crime peaked in the mid1970’s and has been declining fairly steadily since the early
1980’s. I wonder why. Hmmm.
Note that I am able to paste the graph into this Word document by generating the graph in State and
clicking on Edit, Copy Graph. Then, in Word, click on Edit, Paste (or controlv if you prefer keyboard
shortcuts).
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
c
r
m
a
j
1960 1970 1980 1990 2000 2010
year
30
The second type of graph frequently used by economists is the scatter diagram, which relates one variable
to another. Here is the scatter diagram relating the arrest rate to the crime rate.
. graph twoway scatter crmaj aomaj
We also occasionally like to see the scatter diagram with the regression line superimposed.
graph twoway (scatter crmaj aomaj) (lfit crmaj aomaj)
This graph seems to indicate that the higher the arrest rate, the lower the crime rate. Does deterrence work?
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
c
r
m
a
j
20 25 30
aomaj
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
c
r
m
a
j
/
F
i
t
t
e
d
v
a
l
u
e
s
20 25 30
aomaj
crmaj Fitted values
31
Note that we used socalled “parentheses binding” or “() binding” to overlay the graphs. The first
parenthesis contains the commands that created the original scatter diagram, the second parenthesis
contains the command that generated the fitted line (lfit is short for linear fit). Here is the lfit part by itself.
. graph twoway lfit crmaj aomaj
These graphs can be produced interactively. Here is another version of the above graph, with some more
bells and whistles.
This was produced with the command,
1
0
5
0
0
1
1
0
0
0
1
1
5
0
0
1
2
0
0
0
1
2
5
0
0
F
i
t
t
e
d
v
a
l
u
e
s
20 25 30
aomaj
6
0
0
0
8
0
0
0
1
0
0
0
0
1
2
0
0
0
1
4
0
0
0
1
6
0
0
0
C
r
i
m
e
R
a
t
e
20 25 30
Arrest Rate
Fitted values crmaj
Virginia 19601998
Crime Rates and Arrest Rates
32
. twoway (lfit crmaj aomaj, sort) (scatter crmaj aomaj), ytitle(Crime Rate) x
> title(Arrest Rate) title(Crime Rates and Arrest Rates) subtitle(Virginia 19
> 601998)
However, I did not have to enter that complicated command, and neither do you. Click on Graphics to
reveal the pulldown menu.
Click on “overlaid twoway graphs” to reveal the following dialog box with lots of tabs. Clicking on the tabs
reveals further dialog boxes. Note that only the first plot (plot 1) had “lfit” as an option. Also, I sorted by
the independent variable (aomaj) in this graph. The final command is created automatically by filling in
dialog boxes and clicking on things.
33
Here is another common graph in economics, the histogram.
This was produced by the following command, again using the dialog boxes.
. histogram crmaj, bin(10) normal yscale(off) xlabel(, alternate) title(Crime
> Rate) subtitle(Virginia)
(bin=10, start=5740.0903, width=945.2186)
There are lots of other cool graphs available to you in Stata. You should take some time to experiment with
some of them. The graphics dialog boxes make it easy to produce great graphs.
6000
8000
10000
12000
14000
16000
crmaj
Virginia
Crime Rate
34
5 USING DOFILES
A dofile is a list of Stata commands collected together in a text file. It is also known as a Stata program or
batch file. The text file must have a “.do” filename extension. The dofile is executed either with the “do”
command or by clicking on File, Do…, and browsing to the dofile. Except for very simple tasks, I
recommend always using dofiles. The reason is that complicated analyses can be difficult to reproduce. It
is possible to do a complicated analysis interactively, executing each command in turn, before writing and
executing the next command. You can wind up rummaging around in the data doing various regression
models and data transformations, get a result, and then be unable to remember the sequence of events that
led up to the final version of the model. Consequently, you may never be able to get the exact result again.
It may also make it difficult for other people to reproduce your results. However, if you are working with a
dofile, you make changes in the program, save it, execute it, and repeat. The final model can always be
reproduced, as long as the data doesn’t change, by simply running the final dofile again.
If you insist on running Stata interactively, you can “log” your output automatically to a “log file” by
clicking on the “Begin Log” button (the one that looks like a scroll) on the button bar or by clicking on
File, Log, Begin. You will be presented with a dialog box where you can enter the name of the log file and
change the default directory where the file will be stored. When you write up your results you can insert
parts of the log file to document your findings. If you want to replicate your results, you can edit the log
file, remove the output, and turn the resulting file into a do file.
Because economics typically does not use laboratories where experimental results can be replicated by
other researchers, we need to make our data and programs readily available to other economists, otherwise
how can we be sure the results are legitimate? After all, we can edit the Stata output and data input and type
in any values we want to. When the data and programs are readily available it is easy for other researchers
to take the programs and do the analysis themselves and check the data against the original sources, to
make sure we on the up and up. They can also do various tests and alternative analyses to find out if the
results are “fragile,” that is the results do not hold up under relatively small changes in the modeling
technique or data. For these reasons, I recommend that any serious econometric analysis be done with do
files which can be made available to other researchers.
You can create a dofile with the builtin Stata dofile editor, any text editor, such as Notepad or WordPad,
or with any word processing program like Word or WordPerfect. If you use a word processor, be sure to
“save as” a text (.txt) file. However, the first time you save the dofile, Word, for example, will insist on
adding a .txt filename extension. So, you wind up with a dofile called sample.do.txt. You have to rename
the file by dropping the .txt extension. You can do it from inside Word by clicking on Open, browsing to
the right directory, and then rightclicking on the dofile with the txt extension, click on Rename, and then
rename the file.
It is probably easiest to use the builtin Stata dofile editor. It is a full service text editor. It has mouse
control; cut, copy and paste; find and replace; and undo. To fire up the dofile editor, click on the button
35
that looks like an open envelope (with a pencil or something sticking out of it) on the button bar or click on
Window, Dofile Editor.
Here is a simple dofile.
/***********************************************
***** Sample DoFile *****
***** This is a comment *****
************************************************/
/* some preliminaries */
#delimit; * use semicolon to delimit commands;
set mem 10000; * make sure you have enough memory;
set more off; * turn off the more feature;
set matsize 150; * make sure you have enough room for variables;
/* tell Stata where to store the output */
/* I usually use the dofile name, with a .log extension */
/* use a full pathname */
log using "H:\Project\example.log", replace;
/* get the data */
use "H:\Project\crimeva.dta" , clear;
/*************************************/
/*** summary statistics ***/
/*************************************/
summarize crbur aobur metpct ampct unrate;
regress crbur aobur metpct ampct unrate;
log close;
The beginning block of statements should probably be put at the start of any dofile. The #delimit;
command sets the semicolon instead of the carriage return as the symbol signaling the end of the command.
This allows us to write multiline commands.
More conditions: in the usual interactive mode, when the results screen fills up, it stops scrolling and
more—
appears at the bottom of the screen. To see the next screen, you must hit the spacebar. This can be annoying
in dofiles because it keeps stopping the program. Besides, you will be able to examine the output by
opening the log file, so set more off.
For small projects you won’t have to increase memory or matsize, but if you do run out of space, these are
the commands you will need.
Don’t forget to set the log file with the “log using filename, replace” command. The “replace” part is
necessary if you edit the dofile and run it again. If you forget the replace option, Stata will not let you
overwrite the existing log file. If you want Stata to put the log file in your “project” directory, you will need
to use a complete pathname for the file, e.g., “H:\project\sample.log.”
36
Next you have to get the data with the “use” command. Finally, you do the analysis.
Don’t forget to close the log file.
To see the log file, you can open it in the dofile editor, or any other editor or word processor. I use Word
to review the log file, because I use Word to write the final paper..
The log file looks like this.

log: H:\project\sample.log
log type: text
opened on: 4 Jun 2004, 08:16:47
. /* get the data */
>. use "crimeva.dta" , clear;
. /*************************************/
> /*** summary statistics ***/
> /*************************************/
>
> summarize crbur aobur metpct ampct unrate;
Variable  Obs Mean Std. Dev. Min Max
+
crbur  40 7743.378 2166.439 3912.193 11925.49
aobur  22 16.99775 1.834964 12.47 19.6
metpct  28 71.46429 1.394291 70 74.25
ampct  28 18.76679 .1049885 18.5 18.9
unrate  30 4.786667 1.162261 2.4 7.7
. regress crbur aobur metpct ampct unrate;
Source  SS df MS Number of obs = 21
+ F( 4, 16) = 28.74
Model  56035517.4 4 14008879.3 Prob > F = 0.0000
Residual  7800038.17 16 487502.386 Rsquared = 0.8778
+ Adj Rsquared = 0.8473
Total  63835555.6 20 3191777.78 Root MSE = 698.21

crbur  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aobur  31.52691 132.9329 0.24 0.816 313.3321 250.2783
metpct  986.7367 246.7202 4.00 0.001 1509.76 463.7131
ampct  5689.995 7001.924 0.81 0.428 9153.42 20533.41
unrate  71.66737 246.0449 0.29 0.775 449.9246 593.2593
_cons  27751.14 148088.1 0.19 0.854 341683.8 286181.6

. log close;
log: C:\Stata\sample.log
log type: text
closed on: 4 Jun 2004, 08:16:48

For complicated jobs, using lots of “banner” style comments (like the “summary statistics” comment
above) makes the output more readable and reminds you, or tells someone else who might be reading the
output, what you’re doing and why.
37
6 USING STATA HELP
This little manual can’t give you all the details of each Stata command. It also can’t list all the Stata
commands that you might need. Finally, the official Stata manual consists of several volumes and is very
expensive. So where does a student turn for help with Stata. Answer: the Stata Help menu.
Clicking on Help yields the following submenu.
Clicking on Contents allows you to browse through the list of Stata commands, arranged according to
function.
38
One could rummage around under various topics and find all kinds of neat commands this way. Clicking on
any of the hyperlinks brings up the documentation for that command.
Now suppose you want information on how Stata handles a certain econometric topic. For example,
suppose we want to know how Stata deals with heteroskedasticity. Clicking on Help, Search, yields the
following dialog box.
Clicking on the “Search all” radio button and entering heteroskedasticity yields
39
On the other hand, if you know that you need more information on a given command, you can click on
Help, Stata command, yields
For example, entering “regress” to get the documentation on the regress commands yields.
40
This is the best way to find out more about the commands in this book. Most of the commands have options
that are not documented here. There are also lots of examples and links to related commands. You should
certainly spend a little time reviewing the regress help documentation because it will probably be the
command you will use most over the course of the semester.
41
7 REGRESSION
Let’s start with a simple correlation. Use the data on pool memberships from the first chapter. Use
summarize to compute the means of the variables.
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
year  10 82.5 3.02765 78 87
members  10 123.9 8.999383 112 139
dues  10 171.5 20.00694 135 195
cpi  10 .9717 .1711212 .652 1.16
rdues  10 178.8055 16.72355 158.5903 207.0552
+
logdues  10 5.138088 .1219735 4.905275 5.273
logmem  10 4.817096 .0727516 4.718499 4.934474
dum  10 .4 .5163978 0 1
trend  10 4.5 3.02765 0 9
Plot the number of members on the real dues (rdues), sorting by rdues. This can be done by typing
twoway (scatter members rdues), yline(124) xline(179)
Or by clicking on Graphics, Two way graph, Scatter plot and filling in the blanks. Click on Y axis, Major
Tick Options, and Additional Lines, to reveal the dialog box that allows you to put a line at the mean of
members. Similarly, clicking on X axis, Major Tick Options, allows you to put a line at the mean of rdues.
Either way, the result is a nice graph.
42
Note that six out of the ten dots are in the upper left and lower right quadrants. This means that high values
of rdues are associated with low membership. We need a measure of this association. Define the deviation
from the mean for X and Y as
i i
i i
x X X
y Y Y
= −
= −
Note that, in the upper right quadrant, ,
i i
X X Y Y > > so that 0, 0
i i
x y > > . Moving clockwise, in the
lower right hand quadrant, 0, 0
i i
x y > < . In the lower left quadrant, 0, 0
i i
x y < < . Finally in the upper left
quadrant, 0, 0
i i
x y < > .
So, for points in the lower left and upper right hand quadrants, if we multiply the deviations together and
sum we get, 0
i i
x y ¯ > . This quantity is called the sum of corrected crossproducts, or the covariation. If we
compute the sum of corrected crossproducts for the points in the upper left and lower right quadrants we
find that 0
i i
x y ¯ < . This means that, if most of the points are in the upper left and lower right quadrants,
then the variation, computed over the entire sample will be negative, indicating a negative association
between X and Y. Similarly, if most of the points lie in the lower left and upper right quadrants, then the
covariation will be positive, indicating a positive relationship between X and Y.
There are two problems with using covariation as our measure of association. The first is that it depends on
the number of observations. The larger the number of observations, the larger the (positive or negative)
covariation. We can fix this problem by dividing the covariation by the sample size, N. The resulting
quantity is the average covariation, known as the covariance.
1
1
0
1
2
0
1
3
0
1
4
0
m
e
m
b
e
r
s
160 170 180 190 200 210
rdues
43
1 1
( )( )
( , )
N N
i i i i
i i
x y X X Y Y
Cov x y
N N
= =
− −
= =
¯ ¯
The second problem is that the covariance depends on the unit of measure of the two variables. You get
different values of the covariance if the two variables are height in inches and weight in pounds versus
height in miles and weight in milligrams. To solve this problem, we divide each observation by its standard
deviation to eliminate the units of measure and then compute the covariance using the resulting standard
scores. The standard score of X is
i
X i x
z x S = where
2
1
N
i
i
x
x
S
N
=
=
¯
is the standard deviation of X. The
standard score of Y is
2
1
i
i
Y
N
i
i
y
z
x N
=
=
¯
. The covariance of the standard scores is,
1
1 1
1 1
i i
N
X Y N N
i i i
XY i i
i i X Y X Y
z z
x y
r x y
N N S S NS S
=
= =
= = =
¯
¯ ¯
This is the familiar simple correlation coefficient.
The correlation coefficient for our pool sample can be computed in Stata by typing
correlate members rdues
(obs=10)
 members rdues
+
members  1.0000
rdues  0.2825 1.0000
The correlation coefficient is negative, as suspected.
With respect to regression analysis, the Stata command that corresponds to ordinary least squares (OLS) is
the regress command. For example
Regress members rues
Produces a simple regression of members on rdues. That output was reported in the first chapter.
Linear regression
Sometimes we want to know more about an association between two variables than if it is positive or
negative. Consider a policy analysis concerning the effect of putting more police on the street. Does it
really reduce crime? If the answer is yes, the policy at least doesn’t make things worse, but we also have to
know by how much it can be expected to reduce crime, in order to justify the expense. If adding 100,000
police nationwide only reduces crime by a few crimes, it fails the costbenefit test. So, we have to estimate
the expected change in crime from a change in the number of police.
44
Suppose we hypothesize that there is a linear relationship between crime and police such that
Y X α β = +
where Y is crime, X is police and 0 β < . If this is true and we could estimate þ, we could estimate the
effect of adding 100,000 police as follows. Since þ is the slope of the function,
Y
X
β
∆
=
∆
which means that
(100, 000)
Y X β
β
∆ = ∆
=
This is the counterfactual (since we haven’t actually added the police yet) that we need to estimate the
benefit of the policy. The cost is determined elsewhere.
So, we need an estimate of the slope of a line relating Y to X, not just the sign. A line is determined by two
points. We will use the slope, β , and the intercept u as the two points we need.
As statisticians, as opposed to mathematicians, we know that real world data do not line up on straight
lines. We can summarize the relationship between Y and X as,
i i i
Y X U α β = + +
where Ui is the error associated with observation i.
Why do we need an error term? Why don’t observations line up on straight lines? There are two main
reasons.
1. The model is incomplete. We have left out some of the other factors that affect crime. At this point
let’s assume that the factors that we left out are “irrelevant” in the sense that if we included them
the basic form of the line would not change, that is, that the values of the estimated slope and
intercept would not change. Then the error term is simply the sum of a whole bunch of variables
that are each irrelevant, but taken together cause the error term to vary randomly.
2. The line might not be straight. That is, the line may be curved. We can get out of this either by
transforming the line (taking logs, etc.), or simply assume that the line is approximately linear for
our purposes.
So what are the values of u and þ (the parameters) that create a “good fit,” that is, which values of the
parameters best describe the cloud of data in the figure above?
Let’s define our estimate of u as ˆ α and our estimate of þ as
ˆ
β . Then for any observation, our predicted
value of Y is
ˆ ˆ
ˆ
i i
Y X α β = +
If we are wrong, the error, or residual, is the difference between what we thought Y would be and what is
actually was,
ˆ ˆ
ˆ ( )
i i i i i
e Y Y Y X α β = − = − + .
45
Clearly, we want to minimize the errors. We can’t get zero errors because that would require driving the
line through each point. Since they don’t line up, we would get one line per point, not a summary statistic.
We could minimize the sum of the errors, but that would mean that large positive errors could offset large
negative errors and we really couldn’t tell a good fit, with no error, from a bad fit, with huge positive and
negative errors that offset each other.
To solve this last problem we could square the errors or take their absolute values. If we minimize the sum
of squares of error, we are doing least squares. If we minimize the sum of absolute values we are doing
least absolute value estimation, which is beyond our scope. The least squares procedure has been around
since at least 1805 and it is the workhorse of applied statistics.
So, we want to minimize the sum of squares of error, or
2 2
ˆ
ˆ ,
ˆ
ˆ ( )
Min
i i i
e Y X
α β
α β ¯ = ¯ − −
We could take derivatives with respect to u and þ, set them equal to zero and solve the resulting equations,
but that would be too easy. Instead, let’s do it without calculus.
Suppose we have a quadratic formula and we want to minimize it by choosing a value for b.
2
2 1 0
( ) f b c b c b c = + +
Recall from high school the formula for completing the square.
2
2
1 1
2 0
2 2
( )
2 4
c c
f b c b c
c c
   
= + + −
 
\ ¹ \ ¹
Multiply it out if you are unconvinced. Since we are minimizing with respect to our choice of b and b only
occurs in the squared term, the minimum occurs at
1
2
2
c
b
c
−
=
That is, the minimum is the ratio of the negative of the coefficient on the first power to two times the
coefficient on the second power of the quadratic.
We want to minimize the sum of squares of error with respect to ˆ α and
ˆ
β , that is
2 2 2 2 2 2
2 2 2 2
ˆ ˆ ˆ ˆ
ˆ ˆ ˆ ˆ Min ( ) ( 2 2 2 )
ˆ ˆ ˆ
ˆ ˆ ˆ 2 2 2
e Y X Y Y X X XY
Y Y N X X XY
α β α α β αβ β
α α β αβ β
¯ = ¯ − − = ¯ − + + − −
= ¯ − ¯ + + ¯ − ¯ − ¯
Note that there are two quadratics in ˆ α and
ˆ
β . The one in ˆ α is
2
2
ˆ
ˆ ˆ ˆ ˆ ( ) 2 2
ˆ
ˆ ˆ(2 2 )
f N Y X C
N Y X
α α α αβ
α α β
= − ¯ − ¯ +
= − ¯ − ¯
where C is a constant consisting of terms not involving ˆ α . Minimizing ˆ ( ) f α requires setting ˆ α equal to
the negative of the coefficient of the first power divided by two times the coefficient of the second power,
namely,
46
( )
ˆ
2 ˆ
ˆ
ˆ
2
Y X
Y X
Y X
N N N
β
β
α β
¯ −
¯
= = − = −
The quadratic form in
ˆ
β is
2 2 2 2
ˆ ˆ ˆ ˆ ˆ ˆ
ˆ ˆ ( ) 2 2 2( ) * f X XY X X XY X C β β β αβ β β α = ¯ − ¯ + ¯ = ¯ − ¯ + ¯ +
where C* is another constant not involving
ˆ
β .
Setting
ˆ
β equal to the ratio of the negative of the first power divided by two times the coefficient of the
second power yields,
2 2
ˆ ˆ 2( )
ˆ
2
XY X XY X
X X
α α
β
¯ + ¯ ¯ + ¯
= =
¯ ¯
Substituting the formula for ˆ α ,
2 2
ˆ ˆ
( )
ˆ
XY Y X X XY Y Y X X
X X
β β
β
¯ + − ¯ ¯ + ¯ − ¯
= =
¯ ¯
Multiplying both sides by
2
X ¯ yields,
2
ˆ ˆ
ˆ
X XY Y X X X
XY YNX XNX
β β
β
¯ = ¯ − ¯ + ¯
= ¯ − +
2 2
2 2
ˆ ˆ
ˆ
( )
X NX XY NXY
X NX XY NXY
β β
β
¯ − = ¯ −
¯ − = ¯ −
Solving for
ˆ
β
2 2 2
ˆ
XY NXY xy
X NX x
β
¯ − ¯
= =
¯ − ¯
Correlation and regression
Let’s express the regression line in deviations.
Y X α β = +
Since
ˆ
ˆ Y X α β = − the regression line goes through the means:
Y X α β = +
Subtract the mean of Y from both sides of the regression line,
47
Y Y X X α β α β − = + − −
Or,
( ) y X X x β β = − =
Note that, by subtracting off the means and expressing x and y in deviations, we have “translated the axes”
so that the intercept is zero when the regression is expressed in deviations. The reason is, as we have seen
above, the regression line goes through the means. Here is the graph of the pool membership data with the
regression line superimposed.
Substituting
ˆ
β yields the predicted value of y (“Fitted values” in the graph):
ˆ
ˆ y x β =
We know that the correlation coefficient between x and y is
x y
xy
r
NS S
¯
=
Multiply both sides by S
y
/S
x
yields,
2
x x
y x y y y
S S xy xy
r
S NS S S NS
¯ ¯
= =
1
1
0
1
2
0
1
3
0
1
4
0
160 170 180 190 200 210
rdues
Fitted values members
48
But,
2
2 2
2 2
y
y y
NS N N y
N N
 
¯ ¯
 = = = ¯

\ ¹
Therefore,
2
ˆ
x
y
S xy
r
S y
β
¯
= =
¯
So,
ˆ
β is a linear transformation of the correlation coefficient; Since 0 and 0
x y
S S > >
ˆ
must have the same sign as r β . Since the regression line does everything that the correlation coefficient
does, plus yield the rate of change of y to x, we will concentrate on regression from now on.
How well does the line fit the data?
We need a measure of the goodness of fit of the regression line. Start with the notion that each observation
of y is equal to the predicted value plus the residual:
ˆ y y e = +
Squaring both sides,
2 2 2
ˆ 2 y y ye e = + +
Summing,
2 2 2
ˆ 2 y y ye e ¯ = ¯ + ¯ + ¯
The middle term on the right hand side is equal to zero.
ˆ ˆ
ˆ ye xe xe β β ¯ = ¯ = ¯
Which is zero if 0. xe ¯ =
( )
2
2
2
ˆ ˆ
0
xe x y x xy x
xy
xy x xy xy
x
β β ¯ = ¯ − = ¯ − ¯
¯
= ¯ − ¯ = ¯ − ¯ =
¯
This property of linear regression, namely that x is uncorrelated with the residual, turns out to be useful in
several contexts. Nevertheless, the middle term is zero and thus,
2 2 2
y y e ¯ = ¯ + ¯
Or,
TSS=RSS+ESS
49
where TSS is the total sum of squares,
2
, y ¯ RSS is the regression (explained) sum of squares,
2
ˆ , y ¯ and
ESS is the error (unexplained, residual) sum of squares,
2
. e ¯
Suppose we take the ratio of the explained sum of squares to the unexplained sum of squares as our
measure of the goodness of fit.
2 2 2 2
2
2 2 2
ˆ
ˆ
ˆ
RSS y x x
TSS y y y
β
β
¯ ¯ ¯
= = =
¯ ¯ ¯
But, recall that
2
2
ˆ
y
x
y
S
N
r r
S
x
N
β
¯
= =
¯
Squaring both sides yields,
2
2
2 2 2
2 2
ˆ
y
y
N
r r
x x
N
β
¯
¯
= =
¯ ¯
Therefore,
2
2 2
2
ˆ
RSS x
r
TSS y
β
¯
= =
¯
The ratio RSS/TSS is the proportion of the variance of Y explained by the regression and is usually referred
to as the coefficient of determination. It is usually denoted R
2
. We know why now, because the coefficient
of determination is, in simple regressions, equal to the correlation coefficient, squared. Unfortunately, this
neat relationship doesn’t hold in multiple regressions (because there are several correlation coefficients
associating the dependent variable with each of the independent variables). However, the ratio of the
regression sum of squares to the total sum of squares is always denoted R
2
.
Why is it called regression?
In 1886 Sir Francis Galton, cousin of Darwin and discoverer of fingerprints among other things, published
a paper entitled, “Regression toward mediocrity in hereditary stature.” Galton’s hypothesis was that tall
parents should have tall children, but not as tall as they are, and short parents should produce short
offspring, but not as short as they are. Consequently, we should eventually all be the same, mediocre,
height. He knew that, to prove his thesis, he needed more than a positive association. He needed the slope
of a line relating the two heights to be less than one.
He plotted the deviations from the mean heights of adult children on the deviations from the mean heights
of their parents (multiplying the mother’s height by 1.08). Taking the resulting scatter diagram, he
eyeballed a straight line through the data and noted that the slope of the line was positive, but less than one.
He claimed that, “…we can define the law of regression very briefly; It is that the heightdeviate of the
offspring is, on average, twothirds of the heightdeviate of its midparentage.” (Galton, 1886,
50
Anthropological Miscellanea, p. 352.) Midparentage is the average height of the two parents, minus the
overall mean height, 68.5 inches. This is the regression toward mediocrity, now commonly called
regression to the mean. Galton called the line through the scatter of data a “regression line.” The name has
stuck.
The original data collected by Galton is available on the web. (Actually it has a few missing values and
some data entered as “short” “medium,” etc. have been dropped.) The scatter diagram is shown below, with
the regression line (estimated by OLS, not eyeballed) superimposed.
However, another question arises, namely, when did a line estimated by least squares become a regression
line? We know that least squares as a descriptive device was invented by Legendre in 1805 to describe
astronomical data. Galton was apparently unaware of this technique, although it was well known among
mathematicians. Galton’s younger colleague Karl Pearson, later refined Galton’s graphical methods into
the now universally used correlation coefficient. (Frequently referred to in textbooks as the Pearsonian
productmoment correlation coefficient.) However, least squares was still not being used. In 1897 George
Udny Yule, another, even younger colleague of Galton’s, published two papers in which he estimated what
he called a “line of regression” using ordinary least squares. It has been called the “regression line” ever
since. In the second paper he actually estimates a multiple regression model of poverty as a function of
welfare payments including control variables. The first econometric policy analysis.
The regression fallacy
Although Galton was in his time and is even now considered a major scientific figure, he was completely
wrong about the law of heredity he claimed to have discovered. According to Galton, the heights of fathers
of exceptionally tall children tend to be tall, but not as tall as their offspring, implying that there is a natural
0
5
1
0
1
5
0 5 10 15
parents
meankids Fitted values
51
tendency to mediocrity. This theorem was supposedly proved by the fact that the regression line has a slope
less than one.
Using Galton’s data from the scatter diagram above (the actual data is in Galton.dta). We can see that the
regression line does, in fact, have a slope less than one.
. regress meankids parents [fweight=number]
1
Source  SS df MS Number of obs = 899
+ F( 1, 897) = 376.21
Model  1217.79278 1 1217.79278 Prob > F = 0.0000
Residual  2903.59203 897 3.23700338 Rsquared = 0.2955
+ Adj Rsquared = 0.2947
Total  4121.38481 898 4.58951538 Root MSE = 1.7992

meankids  Coef. Std. Err. t P>t [95% Conf. Interval]
+
parents  .640958 .0330457 19.40 0.000 .5761022 .7058138
_cons  2.386932 .2333127 10.23 0.000 1.92903 2.844835

The slope is ..64, almost exactly that calculated by Galton, which confirms his hypothesis. Nevertheless,
the fact that the law is bogus is easy to see. Mathematically, the explanation is simple. We know that
ˆ
y
x
S
r
S
β =
If
y x
S S ≅ and r > 0, then
ˆ
0 1 r β < ≅ <
This is the regression fallacy. The regression line slope estimate is always less than one for any two
variables with approximately equal variances.
Let’s see if this is true for the Galton data. The means and standard deviations are produced by the
summarize command. The correlation coefficient is produced by the correlate command.
. summarize meankids parents
Variable  Obs Mean Std. Dev. Min Max
+
meankids  204 6.647265 2.678237 0 15
parents  197 6.826122 1.916343 2 13.03
. correlate meankids parents
(obs=197)
 meankids parents
+
meankids  1.0000
parents  0.4014 1.0000
So, the standard deviations are very close and the correlation coefficient is positive. We therefore expect
that the slope of the regression line will be positive. Galton was simply demonstrating the mathematical
theorem with these data.
1
We weight by the number of kids because some families have lots of kids (up to 15) and therefore
represent 15 observations, not one.
52
To take another example, if you do extremely well on the first exam in a course, you can expect to do well
on the second exam, but probably not quite as well. Similarly, if you do poorly, you can expect to do better.
The underlying theory is that ability is distributed randomly with a constant mean. Suppose ability is
distributed uniformly with mean 50. However, on each test, you experience random error with is distributed
normally with mean zero and standard deviation 10. Thus,
1 1
2 2
1 2 1 2 1 2
( ) 50, ( ) 0, ( ) 0, cov( , ) cov( , ) cov( , ) 0
x a e
x a e
E a E e E e a e a e e e
= +
= +
= = = = = =
where x
1
is the grade on the first test, x
2
is the grade on the second test, a is ability, and e
1
and e
2
are the
random errors.
1 1 1
( ) ( ) ( ) ( ) ( ) 50 E x E a e E a E e E a = + = + = =
2 2 2
( ) ( ) ( ) ( ) ( ) 50 E x E a e E a E e E a = + = + = =
So, if you scored below 50 on the first exam, you should score closer to 50 on the second. If you scored
higher than 50, you should regress toward mediocrity. (Don’t be alarmed, these grades are not scaled, 50 is
a B.)
We can use Stata to prove this result with a Monte Carlo demonstration.
* choose a large number of observations
. set obs 10000
obs was 0, now 10000
* create a measure of underlying ability
. gen a=100*uniform()
* create a random error term for the first test
. gen x1=a+10*invnorm(uniform())
* now a random error for the second test
. gen x2=a+10*invnorm(uniform())
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
a  10000 50.01914 28.88209 .0044471 99.98645
x1  10000 50.00412 30.5206 29.68982 127.2455
x2  10000 49.94758 30.49167 25.24831 131.0626
* note that the means are all 50 and that the standard deviations are about the
same
. correlate x1 x2
(obs=10000)
 x1 x2
+
x1  1.0000
x2  0.8912 1.0000
. regress x2 x1
Source  SS df MS Number of obs = 10000
+ F( 1, 9998) =38583.43
53
Model  7383282.55 1 7383282.55 Prob > F = 0.0000
Residual  1913206.37 9998 191.358909 Rsquared = 0.7942
+ Adj Rsquared = 0.7942
Total  9296488.92 9999 929.741866 Root MSE = 13.833
x2  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x1  .890335 .0045327 196.43 0.000 .8814501 .8992199
_cons  5.427161 .2655312 20.44 0.000 4.906666 5.947655
* the regression slope is equal to the correlation coefficient, as expected
Does this actually work in practice? I took the grades from the first two tests in a statistics course I taught
years ago and created a Stata data set called tests.dta. You can download the data set and look at the scores.
Here are the results.
. summarize
Variable  Obs Mean Std. Dev. Min Max
+
test1  41 74.87805 17.0839 30 100
test2  41 77.78049 13.54163 47 100
. correlate test2 test1
(obs=41)
 test2 test1
+
test2  1.0000
test1  0.3851 1.0000
. regress test2 test1
Source  SS df MS Number of obs = 41
+ F( 1, 39) = 6.79
Model  1087.97122 1 1087.97122 Prob > F = 0.0129
Residual  6247.05317 39 160.180851 Rsquared = 0.1483
+ Adj Rsquared = 0.1265
Total  7335.02439 40 183.37561 Root MSE = 12.656

test2  Coef. Std. Err. t P>t [95% Conf. Interval]
+
test1  .3052753 .1171354 2.61 0.013 .0683465 .542204
_cons  54.92207 8.99083 6.11 0.000 36.7364 73.10774

Horace Secrist
Secrist was Professor of Statistics at Northwestern University in 1933 when, after ten years of effort
collecting and analyzing data on profit rates of businesses, he published his magnum opus, “The Triumph
of Mediocrity in Business.” In this 486 page tome he analyzed mountains of data and showed that, in every
industry, businesses with high profits tended, over time, to see their profits erode, while businesses with
lower than average profits showed increases. Secrist thought he had found a natural law of competition
(and maybe a fundamental cause of the depression, it was 1933 after all). The book initially received
favorable reviews in places like the Royal Statistical Society, the Journal of Political Economy, and the
Annals of the American Academy of Political and Social Science. Secrist was riding high.
54
It all came crashing down when Harold Hotelling, a well known economist and mathematical statistician,
published a review in the Journal of the American Statistical Association. In the review Hotelling pointed
out that if the data were arranged according to the values taken at the end of the period, instead of the
beginning, the conclusions would be reversed. “The seeming convergence is a statistical fallacy, resulting
from the method of grouping. These diagrams really prove nothing more than that the ratios in question
have a tendency to wander about.” (Hotelling, JASA, 1933.) According to Hotelling the real test would be a
reduction in the variance of the series over time (not true of Secrist’s series). Secrist replied with a letter to
the editor of the JASA, but Hotelling was having none of it. In his reply to Secrist, Hotelling wrote,
This theorem is proved by simple mathematics. It is illustrated by genetic, astronomical, physical,
sociological, and other phenomena. To “prove” such a mathematical result by a costly and
prolonged numerical study of many kinds of business profit and expense ratios is analogous to
proving the multiplication table by arranging elephants in rows and columns, and then doing the
same for numerous other kinds of animals. The performance, though perhaps entertaining, and
having a certain pedagogical value, is not an important contribution to either zoology or to
mathematics. (Hotelling, JASA, 1934.)
Even otherwise competent economists commit this error. See Milton Friedman, 1992, “Do old fallacies
ever die?” Journal of Economic Literature 30:21292132. By the way, the “real test’ of smaller variances
over time might need to be taken with a grain of salt if measurement errors are getting smaller over time.
Also, in the same article, Friedman, who was a student of Hotelling’s when Hotelling was engaged in his
exchange with Secrist, tells how he was inspired to avoid the regression fallacy in his own work. The result
was the permanent income hypothesis, which can be derived from the test examples above by simply
replacing test scores with consumption, ability with permanent income, and the random errors with
transitory income.
Some tools of the trade
We will be using these concepts routinely throughout the semester.
Summation and deviations
Let the lowercase variable
i i
x X X = − where X is the raw data.
(1)
1
( )
N
i i i
i
x X X X NX
=
= ¯ − = ¯ −
¯
0
i
i i i
X
X N X X
N
¯  
= ¯ − = ¯ − ¯ =

\ ¹
From now on, I will suppress the subscripts (which are always i=1,…, N where N is the sample size).
(2)
( )
2 2 2 2
2 2
( ) 2
2
x X X X XX X
X X X NX
¯ = ¯ − = ¯ − +
= ¯ − ¯ +
Note that
X
X N NX
N
¯  
¯ = =

\ ¹
so that
2 2 2 2 2 2 2
2 2 2 x X X X NX X NX NX X NX ¯ = ¯ − ¯ + = ¯ − + = ¯ −
55
(3)
( )
( )( ) xy X X Y Y XY XY YX XY
XY Y X X Y NXY
XY YNX XNY NXY XY NXY
¯ = ¯ − − = ¯ − − +
= ¯ − ¯ − ¯ +
= ¯ − − + = ¯ −
(4)
( )
( ) xY X X Y XY XY
XY X Y XY NXY xy
¯ = ¯ − = ¯ −
= ¯ − ¯ = ¯ − = ¯
(5)
( ) Xy X Y Y XY Y X
XY NXY xy xY
¯ = ¯ − = ¯ − ¯
= ¯ − = ¯ = ¯
Expected value
The expected value of an event is the most probable outcome, what might be expected. It is the average
result over repeated trials. It is also a weighted average of several events where the weight attached to each
outcome is the probability of the event.
1 1 2 2
1 1
( ) ... , 1
N N
N N i i x i
i i
E X p X p X p X p X p µ
= =
= + + + = = =
¯ ¯
For example, if you play roulette in Monte Carlo, the slots are numbered 036. However, the payoff if your
number comes up is 35 to 1. So, what is the expected value of a one dollar bet? There are only two relevant
outcomes, namely your number comes up and you win $35 or it doesn’t and you lose your dollar.
1 1 2 2
36 1
( ) ( 1) (35)
37 37
.973( 1) .027(35) .973 .946 .027
E X p X p X
   
= + = − +
 
\ ¹ \ ¹
= − + = − + = −
So, you can expect to lose about 2.7 cents every time you play. Of course, when you play you never lose
2.7 cents. You either win $35 or lose a dollar. However, over the long haul, you will eventually lose. If you
play 1000 times, you will win some and lose some, but on average, you will lose $27. At the same time, the
casino can expect to make $27 for every 1000 bets. This is known as the house edge and is how the casinos
make a profit. If the expected value of the game is nonzero, as this one is, the game is said not to be fair. A
fair game is one for which the expected value is zero. For example, we can make roulette a fair game by
increasing the payoff to $36
1 1 2 2
36 1
( ) ( 1) (36) 0
37 37
E X p X p X
   
= + = − + =
 
\ ¹ \ ¹
Let’s assume that all the outcomes are equally likely, so that 1/
i
p N = then
1
1 2
1 1 1
( ) ...
N
i
N x
X
E X X X X
N N N N
µ
=
= + + + = =
¯
Note that the expected value of X is the true (population) mean,
x
µ not the sample mean, X . The expected
value of X is also called the first moment of X.
Operationally, to find an expected value, just take the mean.
56
Expected values, means and variance
(1) If b is a constant, then ( ) E b b = .
(2)
1 2
1
...
( ) / ( )
N
N
i
i
bX bX bX
E bX b X N bE X
N
=
+ + +
= = =
¯
(3) ( ) ( ) E a bX a bE X + = +
(4) ( ) ( ) ( )
x y
E X Y E X E Y µ µ + = + = +
(5) ( ) ( ) ( )
x y
E aX bY aE X bE Y a b µ µ + = + = +
(6)
2 2
( ) ( ) E bX b E X =
(7) The variance of X is
2
2
( )
( ) ( )
x
x
X
Var X E X
N
µ
µ
¯ −
= − =
also written as
2
( ) ( ( )) Var X E X E X = − The variance of X is also called the second moment of X.
(8)
2
( , ) [( )( )]
x y
Cov X Y E X Y µ µ = − −
(9)
2 2
( ) [( ) ( )] [( ) ( )]
x y x y
Var X Y E X Y E X Y µ µ µ µ + = + − + = − + −
2 2
[( ) ( ) 2( )( )]
( ) ( ) 2 ( , )
x y x y
E X Y X Y
Var X Var Y Cov X Y
µ µ µ µ = − + − + − −
= + +
(10) If X and Y are independent, then E(XY)=E(X)E(Y).
(11) If X and Y are independent,
( , ) ( )( ) [ ]
( )
( ) ( ) ( ) 0
x y x y x y
y x x y x y
Cov X Y E X Y E XY Y X
E XY
E XY E X E Y
µ µ µ µ µ µ
µ µ µ µ µ µ
= − − = − − +
= − − +
= − =
57
8 THEORY OF LEAST SQUARES
Ordinary least squares is almost certainly the most widely used statistical method ever created, aside maybe
from the mean. The GaussMarkov theorem is the theoretical basis for its ubiquity.
Method of least squares
Least squares was first developed to help explain astronomical measurements. In the late 17’th Century,
scientists were looking at the stars through telescopes and trying to figure out where they were. They knew
that the stars didn’t move, but when they tried to measure where they were, different scientists got different
locations. So which observations are right? According to Adrien Marie Legendre in 1805, “Of all the
principles that can be proposed for this purpose [analyzing large amounts of inexact astronomical data], I
think there is none more general more exact, or easier to apply, than…making the sum of the squares of the
errors a minimum. By this method a kind of equilibrium is established among the errors which, since it
prevents the extremes from dominating, is appropriate for revealing the state of the system which most
nearly approaches the truth.”
Gauss published the method of least squares in his book on planetary orbits four years later in 1809. Gauss
claimed that he had been using the technique since 1795, thereby greatly annoying Legendre. It remains
unclear who thought of the method first. Certainly Legendre beat Gauss to print. Gauss did show that least
squares is preferred if the errors are distributed normally (“Gaussian”) using a Bayesian argument.
Gauss published another derivation in 1823, which has since become the standard. This is the “Gauss” part
of the GaussMarkov theorem. Markov published his version of the theorem in 1912, using different
methods, containing nothing that Gauss hadn’t done a hundred years earlier. Neyman, in the 1920’s,
inexplicably unaware of the 1823 Gauss paper, calls Markov’s results, “Markov’s Theorem.” Somehow,
even though he was almost 100 years late, Markov’s name is still on the theorem.
We will investigate the GaussMarkov theorem below. But first, we need to know some of the properties of
estimators, so we can evaluate the least squares estimator.
58
Properties of estimators
An estimator is a formula for estimating the value of a parameter. For example the formula for computing
the mean
/ X X N = Σ
is an estimator. The resulting value is the estimate. The formula for estimating the
slope of a regression line is
2
ˆ
/ xy x β = Σ Σ
and the resulting value is the estimate. So is
ˆ
β
a good estimator?
The answer depends on the distribution of the estimator. Since we get a different value for
ˆ
β each time we
take a different sample,
ˆ
β is a random variable. The distribution of
ˆ
β across different samples is called the
sampling distribution of
ˆ
β . The properties of the sampling distribution of
ˆ
β will determine if
ˆ
β is a good
estimator, or not.
There are two types of such properties. The so called “small sample” properties are true for all sample
sizes, no matter how small. The large sample properties are true only as the sample size approaches
infinity.
Small sample properties
There are three important small sample properties: bias, efficiency, and mean square error.
Bias
Bias is the difference between the expected value of the estimator and the true value of the parameter.
Bias=
( )
ˆ
E β β −
Therefore, if
( )
ˆ
E β β = the estimator is said to be unbiased. Obviously, unbiased estimators are preferred.
Efficiency
If for a given sample size, the variance of
ˆ
β is smaller than the variance of any other unbiased estimator,
than
ˆ
β is said to be efficient. Efficiency is important because the more efficient the estimator, the lower its
variance and the smaller the variance the more precise the estimate. Efficient estimators allow us to make
more powerful statistical statements concerning the true value of the parameter being estimated. You can
visualize this by imagining a distribution with zero variance. How confident can we be in our estimate of
the parameter in this case?
The drawback to this definition of efficiency is that we are forced to compare unbiased estimators. It is
frequently the case, as we shall see throughout the semester that we must compare estimators that may be
biased, at least in small samples. For this reason, we will use the more general concept of relative
efficiency.
The estimator
ˆ
β is efficient relative to the estimator β
if
( ) ( )
ˆ
Var Var β β <
.
59
From now on, when we say an estimator is efficient, we mean efficient relative to some other estimator.
Mean square error
We will see that we often have to compare the properties of two estimators, one of which is unbiased but
inefficient while the other is biased but efficient. The concept of mean square error is useful here. Mean
square error is defined as the variance of the estimator, plus its squared bias, that is,
( ) ( ) ( )
2
ˆ ˆ ˆ
mse Var Bias β β β
= +
Large sample properties
Large sample properties are also known as asymptotic properties because they tell us what happens to the
estimator as the sample size goes to infinity. Ideally, we want the estimator to approach the truth, with
maximum precision, as N →∞ .
Define the probability limit as the limit of a probability as the sample size goes to infinity, that is
plim lim Pr
N→∞
=
Consistency
A consistent estimator is one for which
( ) ( )
ˆ ˆ
plim Pr 1
lim
N
β β ε β β ε
→∞
− < = − < =
That is, as the sample size goes to infinity, the probability that the estimator will be infinitesimally different
from the truth is one, a certainty. This requires that the estimator be centered on the truth, with no variance.
A distribution with no variance is said to be degenerate, because it is no longer really a distribution, it is a
number.
60
A biased estimator can be consistent if the bias gets smaller as the sample size increases.
Since we require that both the bias and the variance go to zero for a consistent estimator, an alternative
definition is mean square error consistency. That is, an estimator is consistent if its mean square error goes
to zero as the sample size goes to infinity.
All of the small sample properties have large sample analogs. An estimator with bias that goes to zero as
N →∞ is asymptotically unbiased. An estimator can be asymptotically efficient relative to another
estimator. Finally, an estimator can be mean square error consistent.
Note that a consistent estimator is, in the limit, a number. As such it can be used in further derivations. For
example, the ratio of two consistent estimators is consistent. The property of consistency is said to “carry
over” to the ratio. This is not true of unbiasedness. All we know about an unbiased estimate is that its mean
is equal to the truth. We do not know if our estimate is equal to the truth. If we compute the ratio of two
unbiased estimates we don’t know anything about the properties of the ratio.
GaussMarkov theorem
Consider the following simple linear model.
i i i
Y X u α β = + +
where u, and þ, are true, constant, but unknown parameters and
i
u is a random variable.
The GaussMarkov assumptions are (1) the observations on X are a set of fixed numbers, (2) Y
t
is
distributed with a mean of
i
X α β + for each observation i, (3) the variance of
i
Y is constant, and (4)
the observations on Y are independent from each other.
The basic idea behind these assumptions is that the model describes a laboratory experiment. The
researcher sets the value of X without error and then observes the value of Y, with some error. The
observations of Y come from a simple random sample, which means that the observation on Y
1
is
independent of Y
2
, which is independent of Y
3
, etc. The mean and the variance of Y does not change
between observations.
The GaussMarkov assumptions, which are currently expressed in terms of the distribution of Y, can be
rephrased in terms of the distribution of the error term, u. We can do this because, under the GaussMarkov
2
( ) σ
61
assumptions, everything in the model is constant except Y and u. Therefore, the variance of u is equal to
the variance of Y.
Also,
i i i
u Y X α β = − −
so that the mean is
( ) ( ) 0
i i i
E u E Y X α β = − − =
because ( )
i i
E X X = which means that ( )
i i
E Y X α β = + by assumption (1).
Therefore, if the GaussMarkov assumptions are true, the error term u is distributed with mean zero and
variance
2
σ .In fact, the GM assumptions can be succinctly written as
2
, ~ (0, )
i i i t
Y X u u iid α β σ = + +
where the tilde means “is distributed as,” and iid means “independently and identically distributed.” The
independent assumption refers to the simple random sample while the identical assumption means that the
mean and variance (and any other parameters) are constant.
Assumption (1) is very useful for proving the theorem, but it is not strictly necessary. The crucial
requirement is that X must be independent of u, that is the covariance between X and u must be zero, i.e.,
E(X,u)=0. Assumption (2) is crucial. Without it, least squares doesn’t work. Assumption (3) can be relaxed,
as we shall see in Chapter 9 below. Assumption (4) can also be relaxed, as we shall see in Chapter 12
below.
OK, so why is least squares so good? Here is the GaussMarkov theorem: if the GM assumptions hold, then
the ordinary least squares estimates of the parameters are unbiased, consistent, and efficient relative to any
other linear unbiased estimator. The proof of the unbiased and consistent parts are easy.
We know that the mean of u is zero. So,
( )
i i
E Y X α β = +
The “empirical analog” of this (using the sample observations instead of the theoretical model) is
where is the sample mean of Y and is the sample mean of X. Y X Y X α β = + Therefore, we can subtract
the means to get,
Let ,
i i i i
y Y Y x X X = − = −
where the lowercase letters indicate deviations from the mean. Then,
i i i
y x u β = +
(recall that the mean of u is zero so that u u u − = ). What we have done is translate the axes so that the
regression line goes through the origin.
Define e as the error. We want to minimize the sum of squares of error.
i i i
e y x β = −
( )
2
2
i i i
e y x β = −
i i
Y Y X X α β α β − = + − −
( )
2
2
N N
i i i
i i
e y x β = −
¯ ¯
62
We want to minimize the sum of squares of error by choosing þ, so take the derivative with respect to þ and
set it equal to zero.
( ) ( )
( ) ( ) ( )
( )
2
2 2 2 2
2 2 2 2 2 2 2
1 1 1 1 2 2 2 2
2 2 2
1 1 1 2 2 2
2
2 2 ... 2
2 2 2 2 ... 2 2
2 0
i i i i i i i
N N N N
N N N
i i i
d d d
e y x y x y x
d d d
d d d
y x y x y x y x y x y x
d d d
x y x x y x x y x
x y x
β β β
β β β
β β β β β β
β β β
β β β
β
¯ = ¯ − = ¯ − +
= − + + − + + + − +
= − + − + − − +
= − ¯ − =
Multiply both sides by 2, then eliminate the parentheses.
Solve for þ and let the solution value be :
This is the famous OLS estimator. Once we get an estimate of þ, we can back out the OLS estimate for u
using the means.
The first property of the OLS estimate
ˆ
β , is that it is a linear function of the observations on Y.
2 2
1
ˆ
i i
i i i i
i i
x y
x y w y
x x
β
¯
= = ¯ = ¯
¯ ¯
where
2
i
i
i
x
w
x
=
¯
The formula for
ˆ
β is linear because the w are constants. (Remember, by GaussMarkov assumption (1), the
x’s are constant.)
is an estimator. It is a formula for deriving an estimate. When we plug in the values of x and y from the
sample, we can compute the estimate, which is a numerical value.
Note that so far, we have only described the data with a linear function. We have not used our estimates to
make any inferences concerning the true value of u or þ.
Sampling distribution of
ˆ
β
Because y is a random variable and we now know that the OLS estimator
ˆ
β is a linear function of y, it
must be that
ˆ
β is a random variable. If
ˆ
β is a random variable, it has a distribution with a mean and a
variance. We call this distribution the sampling distribution of
ˆ
β . We can use this sampling distribution to
2
2
0
i i i
i i i
x y x
x x y
β
β
¯ − ¯ =
¯ = ¯
ˆ
β
2
ˆ
i i
i
x y
x
β
¯
=
¯
ˆ
ˆ
ˆ
ˆ
Y X
Y X
Y X
α β
α β
α β
= +
= +
= −
ˆ
β
63
make inferences concerning the true value of þ. There is also a sampling distribution of ˆ α , but we seldom
want to make inferences concerning the intercept, so we will ignore it here.
Mean of the sampling distribution of
ˆ
β
The formula for is
( )
2
i
2
2 2
2 2 2 2
ˆ
substitute y
ˆ
i i
i
i i
i i i
i
i i i i i i i i
i i i i
x y
x
x u
x x u
x
x x u x x u x u
x x x x
β
β
β
β
β β
β
¯
=
¯
= +
¯ +
=
¯
¯ + ¯ ¯ ¯ ¯
= = + = +
¯ ¯ ¯ ¯
The mean of
ˆ
β is its expected value,
( )
( )
2 2
2 2
ˆ
( )
i i i i
i i
i i i i
i i
x y x u
E E E
x x
x u x E u
E E
x x
β β
β β β
    ¯ ¯
= = +
 
¯ ¯
\ ¹ \ ¹
  ¯ ¯
= + = + =

¯ ¯
\ ¹
because ( ) 0
i
E u = by GM assumption (1).
The fact that the mean of the sampling distribution of
ˆ
β is equal to the truth means that the ordinary least
squares estimator is unbiased and thus a good basis on which to make inferences concerning the true þ.
Note that the unbiasedness property does not depend on any assumption concerning the form of the
distribution of the error terms. We have not, for example, assumed that the errors are distributed normally.
However, we will assume normally distributed errors below. It also relies only on two of the GM
assumptions, that the x’s are fixed and that ( ) 0
i
E u = . So we really don’t need the independent and
identical assumptions for unbiasedness.
However, the independent and identical assumptions are useful in deriving the property that OLS is
efficient. That is, if these two assumptions, along with the other, are true, then the OLS estimator is the
Best Linear Unbiased Estimator (BLUE). It can be shown that the OLS estimator has the smallest variance
of any linear unbiased estimator.
Variance of
ˆ
β
The variance of
ˆ
β can be derived as follows.
( ) ( ) ( ) ( )
2 2
ˆ ˆ ˆ ˆ
Var E E E β β β β β = − = −
We know that
2
ˆ
i i
i
x u
x
β β
¯
= +
¯
ˆ
β
64
So,
2
ˆ
i i
i i
i
x u
wu
x
β β
¯
− = = ¯
¯
and
( )
( )
2
2
2
2
ˆ
i i
i i
i
x u
E E E wu
x
β β
  ¯
− = = ¯

¯
\ ¹
because the covariances among the errors are zero due to the independence assumption, that is,,
( ) 0 for i j
i
j
E u u = ≠ . Therefore,
( )
2
2 2 2 2 2 2 2 2 2 2 2 2
1 1 2 2 1 1 2 2
( ) ( ) ... ( ) ...
i i N N N N
E wu w E u w E u w E u w w w σ σ σ ¯ = + + + = + + +
However,
2 2 2 2
1 2
...
N
σ σ σ σ = = = = because of the identical assumption. Thus,
( ) ( )
( )
( )
( )
2 2 2 2 2 2 2 2 2 2
1 2 1 2
2 2 2
2 2 2 1 2
2 2 2
2 2 2 2
1 2 2
2
2 2
2 2
2 2 2
2
ˆ
... ...
...
1
...
1
N N
N
i i i
N
i
i
i i
i
Var w w w w w w
x x x
x x x
x x x
x
x
x x
x
β σ σ σ σ
σ σ σ
σ
σ
σ σ
= + + + = + + +
     
= + + +
  
¯ ¯ ¯
\ ¹ \ ¹ \ ¹
= + + +
¯
¯
= = =
¯ ¯
¯
Consistency of OLS
To be consistent, the mean square error of
ˆ
β must go to zero as the sample size, N, goes to infinity. The
mean square error of
ˆ
β is the sum of the variance plus the squared bias.
2
ˆ ˆ ˆ
( ) ( ) ( ) mse Var E β β β β
= + −
Obviously the bias is zero since the OLS estimate is unbiased. Therefore the consistency of the ordinary
least estimator
ˆ
β depends on the variance of
ˆ
β ,
2 2
1
ˆ
( )
N
i
i
Var x β σ
=
=
¯
. As N goes to infinity,
2
1
1
N
i
x
=
¯
goes to
infinity because we keep adding more and more squared terms. Therefore, the variance of
ˆ
β goes to zero as
N goes to infinity, the mean square error goes to zero, and the OLS estimate is consistent.
( ) ( )
( )
( )
2 2
1 1 2 2
2 2 2 2 2 2
1 1 2 2 1 2 1 2 1 3 1 3 1 1 1 1
2 2 2 2 2 2
1 1 2 2
...
... ... ...
...
i i N N
N N N N N N N N
N N
E wu wu w u w u
E w u w u w u w w u u w w u u w w u u w w u u
E w u w u w u
− −
¯ = + + +
= + + + + + + + +
= + + +
65
Proof of the GaussMarkov Theorem
According to the GaussMarkov theorem, if the GM assumptions are true, the ordinary least squares
estimator is the Best Linear Unbiased Estimator (BLUE). That is, of all linear unbiased estimators, the OLS
estimator has the smallest variance.
Stated another way, any other linear unbiased estimator will have a larger variance than the OLS estimator.
The GaussMarkov assumptions are as follows.
(1) The true relationship between y and x (expressed as deviations from the mean) is
i i i
y x u β = +
(2) The residuals are independent and the variance is constant:
2
~ ( , ) u iid o
i
σ .
(3) The x’s are fixed numbers.
We know that the OLS estimator is a linear estimator
2
ˆ
xy
wy
x
β
¯
= = ¯
¯
Where
2
/ w x x = ¯ .
ˆ
β is unbiased and has a variance equal to
2 2
/ x σ ¯ .
Consider any other linear unbiased estimator,
( ) cy c x u cx cu β β β = ¯ = ¯ + = ¯ + ¯
Taking expected values,
( ) ( ) E cx cE u
cx
β β
β
= ¯ + ¯
= ¯
Since E(u)=0 as before. This estimator is unbiased if cx ¯ =1.
Note, since cx ¯ =1
and
cu
cu
β β
β β
= + ¯
− = ¯
Find the variance of β
.
[ ]
2 2
2
1 1 2 2 2
2 2 2 2 2 2
1 2
2 2
( )
...
...
N
N
Var E E cu
E c u c u c u
c c c
c
β β β
σ σ σ
σ
= − =
= + + +
= + + +
=
¯
¯
66
According to the theorem, this variance must be greater than or equal to the variance of β
.
Here is a useful identity.
2 2 2
2 2 2
( )
( ) 2 ( )
( ) 2 ( )
c w c w
c w c w w c w
c w c w w c w
= + −
= + − + −
¯ = ¯ + ¯ − + ¯ −
The last term on the RHS is equal to zero.
( )
( )
2
2
2 2
2
2
2 2
2
2 2
( )
1 1
0
w c w wc w
cx x
x
x
cx x
x
x
x x
¯ − = ¯ − ¯
= −
¯
¯
¯ ¯
= −
¯
¯
= − =
¯ ¯
¯ ¯
Since cx ¯ =1.
Therefore,
( )
2 2
2 2 2
2 2 2 2
2
2 2 2
2
2
2
2 2
2
2 2
( )
( ( ) )
( )
( )
( )
ˆ
( ) ( )
Var c
w c w
w c w
x
c w
x
c w
x
Var c w
β σ
σ
σ σ
σ σ
σ
σ
β σ
= ¯
= ¯ + −
= ¯ + ¯ −
¯
= + ¯ −
¯
= + ¯ −
¯
= + ¯ −
So
2 2
ˆ
( ) ( ) because ( ) 0 Var Var c w β β σ ≥ ¯ − ≥
. The only way the two variances are equal is if
c=w, which is the OLS estimator.
Inference and hypothesis testing
Normal, Student’s t, Fisher’s F, and Chisquare
Before we can make inferences or test hypotheses concerning the true value of þ, we need to review some
famous statistical distributions. We will be using all of the following distributions throughout the semester.
67
Normal distribution
The normal distribution is the familiar bell shaped curve. With mean µ and variance o
2
, it has the formula,
( )
2
2
( )
2
2
1
 ,
2
X
f X e
µ
σ
µ σ
πσ
− −
=
or simply
( )
2
~ , X N µ σ where the tilde is read “is distributed as.”
If X is distributed normally, then any linear transformation of X is also distributed normally. That is, if
( )
2
~ , X N µ σ then
2 2
( ) ~ ( , ) a bX N a b b µ σ + + This turns out to be extremely useful, Consider the linear
function of X, z a bX = + . Let a µ σ = − and 1 b σ = then ( ) z a bX X X µ σ σ µ σ = + = − + = − which
is the formula for a standard score, or zscore.
If we express X as a standard score, z, then z has a standard normal distribution because the mean of z is
zero and its standard deviation is one. That is,
( )
2
2
1
2
z
z e ϕ
π
−
=
Here is the graph.
We can generate a normal variable with 100,000 observations as follows.
. set obs 100000
obs was 0, now 100000
. gen y=invnorm(uniform())
. summarize y
Variable  Obs Mean Std. Dev. Min Max
+
y  100000 .0042645 1.002622 4.518918 4.448323
. histogram y
(bin=50, start=4.518918, width=.17934483)
68
The Student’s t, Fisher’s F, and chisquare are derived from the normal and arise as distributions of sums of
variables. They have associated with them their socalled degrees of freedom which are the number of
terms in the summation.
Chisquare distribution
If z is distributed standard normal, then its square is distributed as chisquare with one degree of freedom.
Since chisquare is a distribution of squared values, and, being a probability distribution, the area under the
curve must sum to one, it has to be a skewed distribution of positive values, asymptotic to the horizontal
axis.
(1) If z ~ N(0,1), then
2 2
~ (1) z χ which has a mean of 1 and a variance of 2. Here is the graph.
0
.
1
.
2
.
3
.
4
D
e
n
s
i
t
y
4 2 0 2 4
y
69
(2) If
1 2
, ,...,
n
X X X are independent
2
(1) χ variables, then
2
1
~ ( )
N
i
i
X n χ
=
¯
with mean=n and variance = 2n
(3) Therefore, if
1 2
, ,...,
n
z z z are independent standard normal, N(0,1), variables, then
2 2
1
~ ( )
n
i
i
z n χ
=
¯
.
Here are the graphs for n=2,3,and 4.
(4) If
2
n
2 2
1 2
i=1
, ,..., are independent N(0, ) variables then ~ ( ).
i
n
z
z z z n σ χ
σ
 

\ ¹
¯
(5) If
2 2 2
1 1 2 2 1 2 1 2 1 2
~ ( ) and ~ ( ) and and are independent, then ~ ( ). X n X n X X X X n n χ χ χ + +
We can generate a chisquare variable by squaring a bunch of normal variables.
. gen z1=invnorm(uniform())
. set obs 10000
obs was 0, now 10000
. gen z1=invnorm(uniform())
. gen z2=invnorm(uniform())
. gen z3=invnorm(uniform())
. gen z4=invnorm(uniform())
. gen z5=invnorm(uniform())
. gen v=z1+z2+z3+z4+z5
. gen z6=invnorm(uniform())
. gen z7=invnorm(uniform())
. gen z8=invnorm(uniform())
. gen z9=invnorm(uniform())
. gen z10=invnorm(uniform())
. gen v=z1^2+z2^2+z3^2+z4^2+z5^2+z6^2+z7^2+z8^2+z9^2+z10^2
.*this creates a chisquare variable with n=4 degrees of freedom
. histogram v
(bin=40, start=1.0842115, width=.91651339)
70
Fdistribution
(6) If
2 2
1 1 2 2 1 2
~ ( ) and ~ ( ) and and are independent, then X n X n X X χ χ
1
1
1 2
2
2
~ ( , )
X
n
F n n
X
n
This is Fisher’s Famous Ftest. It is the ratio of two squared quantities, so it is positive and therefore
skewed, like the chisquare. Here are graphs of some typical F distributions.
We can create a variable with an F distribution is follows: generate five standard normal variables, then
sum their squares to make a chisquare variable (v) with five degrees of freedom; finally, divide v by z to
make a variable distributed as F with 5 degrees of freedom in the numerator and 10 degrees of freedom in
the denominator.
. gen x1=invnorm(uniform())
0
.
0
2
.
0
4
.
0
6
.
0
8
.
1
D
e
n
s
i
t
y
0 10 20 30 40
v
71
. gen x2=invnorm(uniform())
. gen x3=invnorm(uniform())
. gen x4=invnorm(uniform())
. gen x5=invnorm(uniform())
. gen v1=x1^2+x2^2+x3^3+x4^2+x5^2
. gen f=v1/v
. * f has 5 and 10 degrees of freedom
. summarize v1 v f
Variable  Obs Mean Std. Dev. Min Max
+
v1  10000 3.962596 4.791616 62.87189 66.54086
v  10000 10.0526 4.478675 1.084211 37.74475
f  10000 .4877415 .7094975 9.901778 14.72194
. histogram f, if f<3 & f>=0
tdistribution
(7) If z ~ N(0,1) and
2
~ ( ) X n χ and z and X are independent, then the ratio ~ ( )
z
T t n
X n
= . This is
Student’s t distribution. It can be positive or negative and is symmetric, like the normal distribution. It
looks like the normal distribution but with “fat tails,” that is, more probability in the tails of the distribution,
compared to the standard normal.
0
.
5
1
1
.
5
D
e
n
s
i
t
y
0 1 2 3
f
72
(8) If ~ ( ) T t n then
2
2
~ (1, )
z
T F n
X n
= because the squared tratio is the ratio of two chisquared
variables.
We can create a variable with a t distribution by dividing a standard normal variable by a chisquare
variable.
. set obs 100000
obs was 0, now 100000
. gen z=invnorm(uniform())
. gen y1=invnorm(uniform())
. gen y2=invnorm(uniform())
. gen y3=invnorm(uniform())
. gen y4=invnorm(uniform())
. gen y5=invnorm(uniform())
. gen v=y1^2+y2^2+y3^2+y4^2+y5^2
. gen t=z/sqrt(v)
. histogram t if abs(t)<2, normal
(bin=49, start=1.9989421, width=.08160601)
0
.
2
.
4
.
6
.
8
D
e
n
s
i
t
y
2 1 0 1 2
t
73
Asymptotic properties
(9) As N goes to infinity, (0,1) t N →
(10) As
2 1 1
2 1 1 1
2 2
, ( )
X n
n F X n n
X n
χ → ∞ = →
because as
2 2
, n X →∞ → ∞, so that the ratio goes to one.
(11) As
2 2
, N NR χ → ∞ →
where
2
/
/
RSS RSS N
R
TSS TSS N
= =
and RSS is the regression (model) sum of squares.
2
/
/
RSS N
NR N
TSS N
=
As N →∞ , / 1 TSS N → so that
2 2
/
~
1
RSS N
NR N RSS χ
→ =
(12) As
2
, N NF χ → ∞ . See (10) above.
Testing hypotheses concerning
So, we can summarize the sampling distribution of as
( )
2 2
ˆ
~ ,
i
iid x β β σ ¯ . If we want to actually use
this distribution to make inferences, we need to make an assumption concerning the form of the function.
Let’s make the usual assumption that the errors in the original regression model are distributed normally.
That is, where the N stands for normal. This assumption can also be written as
2
~ (0, )
i
u IN σ which is translated as “u
i
is distributed independent normal with mean zero and variance
2
σ
.” (We don’t need the identical assumption because the normal distribution has a constant variance.) Since
ˆ
β is a linear function of the error terms and the error terms are distributed normally, then
ˆ
β is distributed
normally, that is,
( )
2 2
ˆ
~ ,
i
IN x β β σ ¯ . We are now in position to make inferences concerning the true
value of þ.
To test the null hypothesis that
0 0
where β β β = is some hypothesized value of þ, construct the test statistic,
2
0
ˆ
2
ˆ
ˆ
ˆ
where
i
T S
S x
β
β
β β σ −
= =
¯
is the standard error of
ˆ
β , the square root of its variance.
2
2 1
ˆ
N k
i
i
e
N k
σ
−
=
=
−
¯
where
2
i
e is the squared residual corresponding to the i’th observation, N is the sample size,
and k is the number of parameters in the regression. In this simple regression, k=2 because we have two
parameters, u and þ.
2
ˆ σ is the estimated error variance. It is equal to the error sum of squares divided by
the degrees of freedom (Nk) and is also known as the mean square error.
ˆ
β
( )
2
~ 0,
i
u iidN σ
74
Notice that the test statistic, also known as a tratio, is the ratio of a normally distributed random variable,
ˆ
β , to a chisquare variable (the sum of a bunch of squared, normally distributed, error terms).
Consequently, it is distributed as student’s t with Nk degrees of freedom.
By far the most common test is the significance test, namely that X and Y are unrelated. To do this test set
0
0 β = so that the test statistic reduces to the ratio of the estimated parameter to its standard error.
ˆ
ˆ
T
S
β
β
= . This is the “t” reported in the regress output in Stata.
Degrees of Freedom
Karl Pearson, who discovered the chisquare distribution around 1900, applied it to a variety of problems
but couldn’t get the parameter value quite right. Pearson, and a number of other statisticians, kept requiring
ad hoc adjustments to their chisquare applications. Sir Ronald Fisher solved the problem in 1922 and
called the parameter degrees of freedom. The best explanation of the concept that I know of is due to
Gerrard Dallal (http://www.tufts.edu/~gdallal/dof.htm). Suppose we have a data set with N observations.
We can use the data in two ways, to estimate the parameters or to estimate the variance. If we use data
points to estimate the parameters, we can’t use them to estimate the variance.
Suppose we are trying to estimate the mean of a data set with X=(1,2,3). The sum is 6 and the mean is 2.
Once we know this, we can find any data point knowing only the other two. That is, we used up one degree
of freedom to estimate the mean and we only have two left to estimate the variance. To take another
example, suppose we are estimating a simple regression model with ordinary least squares. We estimate the
slope with the formula
2
ˆ
/ xy x β = ¯ ¯ and the intercept with the formula
ˆ
ˆ Y X α β = − , using up two data
points. This leaves the remaining N2 data points to estimate the variance. We can also look at the
regression problem this way: it takes two points to determine a straight line. If we only have two points, we
can solve for the slope and intercept, but this is a mathematical problem, not a statistical one. If we have
three data points, then there are many lines that could be defined as “fitting” the data. There is one degree
of freedom that could be used to estimate the variance.
Our simple rule is that the number of degrees of freedom is the sample size minus the number of
parameters estimated. In the example of the mean, we have one parameter to estimate, yielding N1 data
points to estimate the variance. Therefore, the formula for the sample variance of X (we have necessarily
already used the data once to compute the sample mean) is
2
2
( )
1
x
X X
S
N
¯ −
=
−
where we divide by the number of degrees of freedom (N=1), not the sample size (N). To prove this, we
have to demonstrate that this is an unbiased estimator for the variance of X (and therefore, dividing by N
yields a biased estimator).
2
2
2 2
2 2
( ) ( ) ( )
( ) 2( )( ) ( )
( ) 2( ) ( ) ( )
X X X X
X X X X
X X X N X
µ µ
µ µ µ µ
µ µ µ µ
¯ − = ¯ − − −
= ¯ − − − − + −
= ¯ − − − ¯ − + −
because and X µ are both constants. Also,
( ) ( ) X X N NX N N X µ µ µ µ ¯ − = ¯ − = − = −
75
which means that the middle term in the quadratic form above is
2
2( ) ( ) 2( ) ( ) 2 ( ) X X X N X N X µ µ µ µ µ − − ¯ − = − − − = − −
and thefore,
2 2 2 2
2 2
( ) ( ) 2 ( ) ( )
( ) ( )
X X X N X N X
X N X
µ µ µ
µ µ
¯ − = ¯ − − − + −
= ¯ − − −
Now we can take the expected value of the proposed formula,
2
2 2
2 2
( ) 1
( ) ( )
1 1 1
1
( ) ( )
1 1
X X N
E E X X
N N N
N
E X E X
N N
µ µ
µ µ
  ¯ −  
= ¯ − − −
 
− − −
\ ¹
\ ¹
= ¯ − − −
− −
Note that the second term is the variance of the mean of X:
2
2
( )
( ) ( )
X
E X Var X
N
µ
µ
¯ −
− = =
But,
1 2 2
2
2
2
1 1
( ) ( ... )
1
N
x
x
Var X Var X Var X X X
N N
N
N N
σ
σ
 
= ¯ = + + +

\ ¹
= =
Therefore,
2
2 2
2
2 2 2
2 2
( ) 1
( ) ( )
1 1 1
1
1 1 1 1
1
1
x
x x x
x x
X X N
E E X X
N N N
N N N
N N N N N
N
N
µ µ
σ
σ σ σ
σ σ
  ¯ −  
= ¯ − − −
 
− − −
\ ¹
\ ¹
= − = −
− − − −
−
= =
−
Note (for later reference),
( ) ( )
2 2 2
( ) ( 1)
x
E x E X X N σ ¯ = ¯ − = − .
Estimating the variance of the error term
We want to prove that the formula
2
2
ˆ
2
e
N
σ
¯
=
−
where e is the residual from a simple regression model, yields an unbiased estimate of the true error
variance.
76
In deviations,
ˆ
e y x β = −
But,
ˆ
y x u β = +
So,
( )
ˆ ˆ
e x u x u x β β β β = + − = − −
Squaring both sides,
( ) ( )
2
2 2 2
ˆ ˆ
2 e u xu x β β β β = − − + −
Summing,
( ) ( )
2
2 2 2
ˆ ˆ
2 e u xu x β β β β ¯ = ¯ − ¯ − + − ¯
Taking expected values,
( ) ( )
2
2 2 2
ˆ ˆ
2 E e E u E xu E x β β β β
¯ = ¯ − − ¯ + − ¯
Note that the variance of
ˆ
β is
( ) ( )
2
2
2
ˆ ˆ
Var E
x
σ
β β β = − =
¯
Which means that the third term of the expectation expression is
( )
2
2
2 2 2
2
ˆ
E x x
x
σ
β β σ
¯ − ¯ = ¯ =
¯
We know from the theory of least squares that
2
ˆ
xu
x
β β
¯
− =
¯
Which implies that
( )
2
ˆ
xu x β β ¯ = − ¯
Therefore the second term in the expectation expression is
77
( )( ) ( )
2
2 2
2
2 2
2
ˆ ˆ ˆ
2 2
2 2
E x E x
x
x
β β β β β β
σ
σ
− − − ¯ = − − ¯
= − ¯ = −
¯
The first term in the expectation expression is
( )
2 2
1 E u N σ ¯ = −
because the unbiased estimator of the variance of x is
2 2
/( 1)
x
S x N = ¯ − so that the expected value of
( )
2 2
( 1)
x
E x N S ¯ = − . In the current application x=e and
2 2
x
S σ = .
Therefore, putting all three terms together, we get
2 2 2 2 2 2 2 2 2
( 1) 2 2 ( 2) E e N N N σ σ σ σ σ σ σ σ ¯ = − + − = − + − = −
Which implies that the unbiased estimate of the error variance is
2
2
ˆ
2
e
N
σ
¯
=
−
2
2 2
2 2
( 2)
ˆ
2 2 2
E e
e N
E E
N N N
σ
σ σ
¯
¯ −
= = = =
− − −
If we think of each of the squared residuals as an estimate of the variance of the error term, then the
formula tells us to take the average. However, we only have N2 independent observations on which to
base our estimate of the variance, so we divide by N2 when we take the average.
78
Chebyshev’s Inequality
Arguably the most remarkable theorem in statistics, Chebyshev’s inequality states that, for any variable x,
from any arbitrary distribution, the probability that an observation lies more than some distance, c, from
the mean must be less than or equal to
2 2
/
x
σ ε . That is,
2
2
(  )
x
P X
σ
µ ε
ε
− ≥ ≤
The following diagram might be helpful.
Proof
The variance of x can be written as
2
2 2
( ) 1
( )
x
X
X
N N
µ
σ µ
¯ −
= = ¯ −
if all observations have the same probability (1/N) of being observed.
More generally,
2 2
( ) ( )
x
X P X σ µ = ¯ −
Suppose we limit the summation on the right hand side to those values of X that lie outside the ε ± box.
Then the new summation could not be larger than the original summation, which is the variance of x,
because we are omitting some (positive, squared) terms.
2 2
2
 
( ) ( )
( ) ( )
x
X
X P X
X P X
µ ε
σ µ
µ
− ≥
= ¯ −
≥ −
¯
We also know that all of the X’s in the limited summation are at least c away from the mean. (Some of
them are much further than c away.) Therefore, replacing ( ) X µ − with c,
2 2
 
2 2 2
   
( ) ( )
( ) ( ) (  )
x
X
X X
X P X
P X P X P X
µ ε
µ ε µ ε
σ µ
ε ε ε µ ε
− ≥
− ≥ − ≥
≥ −
≥ = = − ≥
¯
¯ ¯
Xi µ
X
÷ ÷
79
So,
2 2
(  )
x
P X σ ε µ ε ≥ − ≥
Dividing both sides by
2
ε
2
2
(  )
x
P X
σ
µ ε
ε
≥ − ≥
Turning it around we get,
2
2
(  )
x
P X
σ
µ ε
ε
− ≥ ≤
QED
The theorem is often stated as follows.
For any set of data and any constant k>1, at least
2
1
1
k
− of the observations must lie within k standard
deviations of the mean.
To prove this version, let
x
k ε σ =
2
2 2 2
1
(  )
x
x
x
P X k
k k
σ
µ σ
σ
− ≥ ≤ =
Since this is the probability that the observations lie more than k standard deviations from the mean, the
probability that the observations lie less than k sd from the mean is one minus this probability. That is,
2
1
(  ) 1
x
P X k
k
µ σ − ≤ ≤ −
So, if we have any data whatever, and we set k=2, we know that
1 3
1 75%
4 4
− = = of the data lie within two standard deviations of the mean.
(We can be more precise if we know the distribution. For example, 95% of normally distributed data lie
within two standard deviations of the mean.)
Law of Large Numbers
The law of large numbers states that if a situation is repeated again and again, the proportion of successful
outcomes will tend to approach the constant probability that any one of the outcomes will be a success.
For example, suppose we have been challenged to guess how many thumbtacks will end up face down if
100 are dropped off a table. (Answer: 50.)
How would we prepare for such a challenge? Toss a thumbtack into the air and record how many times it
ends up face down. Divide that into the number of trials and you have the proportion. Multiply by 100 and
that is your guess.
Suppose we are presented with a large box full of black and white balls. We want to know the proportion of
black balls, but we are not allowed to look inside the box. What do we do? Answer: sample with
replacement. The number of black balls that we draw divided by the total number of balls drawn will give
us an estimate. The more balls we draw, the better the estimate.
Mathematically, the theorem is,
Let
1 2
, ,...,
N
X X X be an independent trials process where ( )
i
E X µ = and
2
( )
x i
Var X σ = .
80
Let
1 2
...
N
S X X X = + + + . Then, for any c > 0
0
N
S
P
N
µ ε
→∞
 
− ≥ →

\ ¹
Equivalently,
1
N
S
P
N
µ ε
→∞
 
− < →

\ ¹
Proof,
We know that
2
1 2
( ) ( ... )
N x
Var S Var X X X Nσ = + + + =
Because there is no covariance between the X’s. Therefore,
2 2
1 2 2
(( ... ) / )
x x
N
N S
Var Var X X X N
N N N
σ σ
 
= + + + = =

\ ¹
Also,
1 2
...
N
X X X S N
E E
N N N
µ
µ
+ + +    
= = =
 
\ ¹ \ ¹
By Chebyshev’s Inequality we know that, for any c > 0,
( )
2
2 2
0
x
N
S
Var
S
N
P
N N
σ
µ ε
ε ε
→∞
 
− ≥ ≤ = →

\ ¹
Equivalently, the probability that the difference between S/N and the true mean is less thanε is one
minus this. That is,
1
N
S
P
N
µ ε
→∞
 
− < →

\ ¹
Since S/N is an average, the law of large numbers is often called the law of averages.
Central Limit Theorem
This is arguably the most amazing theorem in mathematics.
Statement:
Let x be a random variable distributed according to f(x) with mean µ and variance
2
X
σ . Let X be the mean
of a random sample of size N from f(x).
Let
2
X
X
z
N
µ
σ
−
= .
The density of z will approach the standard normal (mean zero and variance one) as N goes to infinity.
The incredible part of this theorem is that no restriction is placed on the distribution of x. No matter how x
is distributed (as long as it has a finite variance), the sample mean of a large sample will be distributed
normally.
81
Not only that, but the sample doesn’t really have to be that large. It used to be that a sample of 30 was
thought to be enough. Nowadays we usually require 100 or more. Infinity comes quickly for the normal
distribution.
The CLT is the reason that the normal distribution is so important.
The theorem was first proved by DeMoivre in 1733 but was promptly forgotten. It was resurrected in 1812
by LaPlace, who derived the normal approximation to the binomial distribution, but was still mostly
ignored until Lyapunov in 1901 generalized it and showed how it worked mathematically. Nowadays, the
central limit theorem is considered to be the unofficial sovereign of probability theory.
One of the few people who didn’t ignore it in the 19’th Century was Sir Francis Galton, who was a big fan.
He wrote in Natural Inheritance, 1889:
I know of scarcely anything so apt to impress the imagination as the wonderful form of
cosmic order expressed by the "Law of Frequency of Error". The law would have been
personified by the Greeks and deified, if they had known of it. It reigns with serenity and
in complete selfeffacement, amidst the wildest confusion. The huger the mob, and the
greater the apparent anarchy, the more perfect is its sway. It is the supreme law of
Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled
in the order of their magnitude, an unsuspected and most beautiful form of regularity
proves to have been latent all along.
We can’t prove this theorem here. However, we can prove it for certain cases using Monte Carlo methods.
For example, the binomial distribution is decidedly nonnormal since values must be either zero or one.
However, according to the theorem, the mean of a large number of sample means will be distributed
normally.
Here is the histogram of the binomial distribution with an equal probability of generating a one or a zero..
82
Not normal.
I asked Stata to generate 10,000 samples of size N=100 from this distribution. I then computed the mean
and variance of the resulting 10,000 observations.
Variable  Obs Mean Std. Dev. Min Max
+
sum  10000 50.0912 4.981402 31 72
I then computed the zscores by taking each observation, subtracting the mean (50.0912) and dividing by
the standard deviation. The histogram of the resulting data is:
This is the standard normal distribution with mean zero and standard deviation one.
. summarize z
0
1
2
3
4
5
D
e
n
s
i
t
y
0 .2 .4 .6 .8 1
x
0
.
1
.
2
.
3
.
4
D
e
n
s
i
t
y
4 2 0 2 4
z
83
Variable  Obs Mean Std. Dev. Min Max
+
z  10000 3.57e07 1 3.832496 4.398119
Here is another example. This is the distribution of the population of cities in the United States with
popularion of 100,000 or more.
It is obviously not normal being highly skewed. There are very few very large cities and a great number of
relatively small cities. Suppose we treat this distribution as the population. I told Stata to take 10,000
samples of size N=200 from this data set. I then computed the mean and standard deviation of the resulting
data.
summarize mx
Variable  Obs Mean Std. Dev. Min Max
+
mx  10000 287657 41619.64 193870.2 497314.6
I then computed the corresponding zscores. According to the Central Limit Theorem, the zscores should
have a zero mean and unit variance.
. summarize z
Variable  Obs Mean Std. Dev. Min Max
+
z  10000 1.37e07 1 2.253426 5.03747
As expected, the mean is zero and the standard deviation is one. Here is the histogram.
0
5
.
0
e

0
7
1
.
0
e

0
6
1
.
5
e

0
6
2
.
0
e

0
6
D
e
n
s
i
t
y
0 2000000 4000000 6000000 8000000
population
84
It still looks a little skewed, but it is very close to the standard normal.
Method of maximum likelihood
Suppose we have the following simple regression model
2
~ (0, )
i i i
i
Y X U
U IN
α β
σ
= + +
So, the Y are distributed normally. The probability of observing the first Y is
{ } { }
2 2
1 1 1
2 2
1 1
( )
2 2
1
2 2
1 1
Pr( )
2 2
U Y X
Y e e
α β
σ σ
πσ πσ
− − − −
= =
The probability of observing the second Y is exactly the same, except for the subscript. The sample consists
of N observations. The probability of observing the sample is the probability of observing Y
1
and Y
2
, etc.
1 2 1 2
Pr( , ,..., ) Pr( ) Pr( ) Pr( )
N N
Y Y Y Y Y Y =
because of the independence assumption. Therefore,
( )
{ }
2
2
1
2
1
( )
2
1 2 1
2
1
Pr( , , ..., ) e L( , , )
2
i i
Y X
N
N
Y Y Y
α β
σ
α β σ
πσ
− − −
= =
∏
This is the likelihood function, which gives the probability of observing the sample that we actually
observed. It is a function of the unknown parameters, , , and α β σ . The symbol
1
N
∏
means multiply the
items indexed from one to N.
0
.
1
.
2
.
3
.
4
D
e
n
s
i
t
y
2 0 2 4 6
z
85
The basic idea behind maximum likelihood is that we probably observed the most likely sample, not a rare
or unusual sample, therefore, the best guess of the values of the parameters are those that generate the most
likely sample. Mathematically, we maximize the likelihood of the sample by taking derivatives with respect
to , , and α β σ , setting the derivatives equal to zero, and solving.
Let’s take logs first to make the algebra easier.
2 2 1
2 2
1
1
Log L log(2 ) ( )
2
N
i i
i
Y X πσ α β
σ
=
= − − − −
¯
This function has a lot of stuff that is not a function of the parameters.
2
2
log L log
2 2
N Q
C σ
σ
= − −
where
log(2 )
2
N
C π
 
= −

\ ¹
which is constant with respect to the unknown parameters, and
2
1
( )
N
i i
i
Q Y X α β
=
= − −
¯
where Q is the error sum of squares. Since Q is multiplied by a negative, to maximize the likelihood, we
have to minimize the error sum of squares with respect to u and þ. This is nothing more than ordinary least
squares. Therefore, OLS is also a maximum likelihood estimator. We already know the OLS formulas, so
the only thing we need now is the maximum likelihood estimator for the variance,
2
σ ,
Substitute the OLS values for u and þ:
ˆ
ˆ and α β :
2
2
ˆ
logL( ) log( )
2 2
N Q
C σ σ
σ
= − −
where
2
1
ˆ ˆ
ˆ ( )
N
i i
i
Q Y X α β
=
= − −
¯
is the residual sum of squares, ESS.
Now we have to differentiate with respect to o and set the derivative equal to zero
4
ˆ
log L( ) 4
0
4
d N Q
d
σ σ
σ σ σ
= − + =
3
ˆ
N Q
σ σ
=
2
ˆ
Q
N
σ
=
2
ˆ
ˆ
Q ESS
N N
σ = =
86
Now that we have estimates for , , and α β σ , we can substitute
ˆ
ˆ ˆ , , and α β σ into the log likelihood function
to get its maximum value:
2
2
ˆ
ˆ ˆ , ,
ˆ ˆ
Max logL log( ) log
2 2 2 2
N Q N Q N
C C
N
α β σ
σ
σ
 
= − − = − −


\ ¹
Note that
2
2 2
ˆ
ˆ
ˆ ˆ 2 2
Q Nσ
σ σ
= .
So,
( )
( )
ˆ
ˆ ˆ , ,
ˆ
Max logL log log
2 2 2
N N N
C Q N
α β σ
= − + −
Since N is constant,
( )
ˆ
ˆ ˆ , ,
ˆ
Max logL log
2
N
const Q
α β σ
= −
where
( ) log
2 2
N N
const C N = + −
Therefore,
2 2 ˆ
* *( )
N N
MaxL const Q const ESS
− −
= =
Likelihood ratio test
Let L be the likelihood function. We can use this function to test hypotheses such as 0 β = or 10 α β + = as
restrictions on the model. As in the Ftest, if the restriction is false, we expect that the residual sum of
squares (ESS) for the restricted model will be higher than the ESS for the unrestricted model. This means
that the likelihood function will be smaller (not maximized) if the restriction is false. If the hypothesis is
true, then the two values of ESS will be approximately equal and the two values of the likelihood functions
will be approximately equal.
Consider the likelihood ratio,
under the restriction
1
without the restriction
MaxL
MaxL
λ = ≤
The test statistic is
2
2log( ) ( ) p λ χ −
where p is the number of restrictions.
To do this test, do two regressions, one with the restriction, saving the residual sum of squares, ESS(R), and
one without the restriction, saving ESS(U). So,
87
2
2
2
*( ( )) ( )
( )
*( ( ))
N
N
N
const ESS R ESS R
ESS U
const ESS U
λ
−
−
−
 
= =

\ ¹
( ) log( ) log( ( ) log( ( )
2
N
ESS R ESS U λ = − −
The test statistic is
( )
2
2log( ) log( ( ) log( ( ) ( ) N ESS R ESS U p λ χ − = −
This is a large sample test. Like all large sample tests, its significance is not well known in small samples.
Usually we just assume that the small sample significance is about the same as it would be in a large
sample.
Multiple regression and instrumental variables
Suppose we have a model with two explanatory variables. For example,
2
, ~ (0, )
i i i i i
Y X Z u u iid α β γ σ = + + +
We want to estimate the parameters u, þ, and v. We define the error as
i i i i
e Y X Z α β γ = − − −
At this point we could square the errors, sum them, take derivatives with respect to the parameters, set the
derivatives equal to zero and solve. This would generate the least squares estimators for this multiple
regression. However, there is another approach, known as instrumental variables, that is both illustrative
and easier. To see how instrumental variables works, let’s first derive the OLS estimators for the simple
regression model,
i i i
Y X u α β = + +
The crucial GaussMarkov assumptions can be written as
(A1) ( ) 0
(A2) ( ) 0
i
i i
E u
E x u
=
=
A1 states that the model must be correct on average, so the (true) mean of the errors is zero. A2 is the
alternative to the assumption that the x’s are a set of fixed numbers. It says that the x’s cannot be correlated
with the error term, that is, the covariance between the explanatory variable and the error term must be
zero.
Since we do not observe u
i
, we have to make the assumptions operational. We therefore define the
following empirical analogs of A1 and A2.
i
i
(a1) 0 0
x
(a2) 0 x 0
i
i
i
i
e
e e
N
e
e
N
= = ¬ =
= ¬ =
¯
¯
¯
¯
88
(a1) says that the mean of the residuals is zero and (a2) says that the covariance between x and the residuals
is zero.
Define the residuals,
i i i
e Y X α β = − −
Sum the residuals,
( )
1 1 1 1
0 by (a1).
N N N N
i i i i i
i i i i
e Y X Y N X α β α β
= = = =
= − − = − − =
¯ ¯ ¯ ¯
Divide both sides by N.
1 1
0
N N
i i
i i
Y X
N N
α β
= =
− − =
¯ ¯
Or,
0
ˆ
Y X
Y X
α β
α β
− − =
= −
which is the OLS estimator for u.
We can use (a2) to derive the OLS estimator for β .
( )
i i i i i i i
i i i
y Y Y X u X Y X X u
y x u
α β α β β
β
= − = + + − − = − − +
= +
The predicted value of y is
ˆ
ˆ
ˆ
i i
i i i
y x
y y e
β =
= +
The observed value of y is the sum of the predicted value from the regression and the residual, the
difference between the observed value and the predicted value.
ˆ
i i i
y x e β = +
Multiply both sides by x
i
,
2
ˆ
i i i i
x y x x e β = +
Sum,
2
ˆ
i i i i
x y x x e β ¯ = ¯ + ¯
But
0
i i
x e ¯ =
by (a2). So,
2
ˆ
i i
x y x β ¯ = ¯
Solve for
ˆ
β ,
2
ˆ
i i
x y
x
β
¯
=
¯
which is the OLS estimator.
Now that we know the we can derive the OLS estimators using instrumental variables, let’s do it for the
multiple regression above.
i i i i
Y X Z u α β γ = + + +
89
We have to make three assumptions (with their empirical analogs), since we have three parameters.
(A1) ( ) 0 0
i
e
E u
N
¯
= ¬ =
(A2) ( ) 0 0 0
i i
i i i i
x e
E x u x e
N
¯
= ¬ = ¬ ¯ =
(A3) ( ) 0 0 0
i i
i i i i
z e
E z u z e
N
¯
= ¬ = ¬ ¯ =
From (A1) we get the estimate of the intercept.
0
0
i i i i
i i i i
e Y N X Z
e Y X Z
N N N N
α β γ
α β γ
¯ = ¯ − − ¯ − ¯ =
¯ ¯ ¯ ¯
= − − − =
Or,
0
ˆ
Y X Z
Y X Z
α β γ
α β γ
− − − =
= − −
which is a pretty straightforward generalization of the simple regression estimator for u.
Now we can work in deviations.
i i i i i
i i i i
y Y Y X Z u X Z
y x z u
α β γ α β γ
β γ
= − = + + + − − +
= + +
The predicted values of y are
ˆ
ˆ ˆ
i i i
y x z β γ = +
So that y is equal to the predicted value plus the residual.
ˆ
ˆ
i i i i
y x z e β γ = + +
Multiply by x and sum to get,
2
,
2
,
ˆ
ˆ
ˆ
ˆ
i i i i i i
i i i i i i
x y x x z x e
x y x x z x e
β γ
β γ
= + +
¯ = ¯ + ¯ + ¯
But, 0
i i
x e ¯ = by (A2), so
(N1)
2
,
ˆ
ˆ
i i i i
x y x x z β γ ¯ = ¯ + ¯
Now repeat the operation with z:
2
2
ˆ
ˆ
ˆ
ˆ
i i i i i i i
i i i i i i
z y z x z z e
z y zx z z e
β γ
β γ
= + +
¯ = ¯ + ¯ + ¯
But, 0
i i
z e ¯ = by (A3), so
(N2)
2
ˆ
ˆ
i i i i
z y zx z β γ ¯ = ¯ + ¯
We have two linear equations (N1 and N2) in two unknowns,
ˆ
β and ˆ γ . Even though these equations look
complicated with all the summations, these summations are just numbers derived from the sample
observations on x, y, and z. There are a variety of ways to solve two equations in two unknowns. Back
substitution from high school algebra will work just fine. Solve for ˆ γ from N2.
2
2 2
ˆ
ˆ
ˆ
ˆ
i i i i
i i i
i i
z z y zx
z y zx
z z
γ β
γ β
¯ = ¯ − ¯
¯ ¯
= −
¯ ¯
Substitute into N1
90
2
, 2 2
ˆ ˆ
i i i
i i i i
i i
z y zx
x y x x z
z z
β β
  ¯ ¯
¯ = ¯ + − ¯

¯ ¯
\ ¹
Multiply through by
2
i
z ¯
( )
2 2 2 2
, 2 2
2 2 2
,
2 2 2
,
ˆ ˆ
ˆ ˆ
ˆ ˆ
i i i
i i i i i i i
i i
i i i i i i i i i
i i i i i i i i i i i
z y zx
x y z x z z x z
z z
x y z x z z y zx x z
x y z x z z y x z zx x z
β β
β β
β β
  ¯ ¯
¯ ¯ = ¯ ¯ + ¯ − ¯

¯ ¯
\ ¹
¯ ¯ = ¯ ¯ + ¯ − ¯ ¯
¯ ¯ = ¯ ¯ + ¯ ¯ − ¯ ¯
Collect terms in
ˆ
β and move them to the left hand side.
( )
( )
2 2 2
,
2 2 2
,
2
2 2
,
2
2
2 2
,
ˆ ˆ
ˆ
ˆ
ˆ
i i i i i i i i i i i
i i i i i i i i i i i
i i i i i i i
i i i i
i i i i i i i
i i i
x z zx x z x y z z y x z
x z zx x z x y z z y x z
x y z z y x z
x z zx x z
x y z z y x z
x z x z
β β
β
β
β
¯ ¯ − ¯ ¯ = ¯ ¯ − ¯ ¯
¯ ¯ − ¯ ¯ = ¯ ¯ − ¯ ¯
¯ ¯ − ¯ ¯
=
¯ ¯ − ¯ ¯
¯ ¯ − ¯ ¯
=
¯ ¯ − ¯
because
i i i i
x z z x ¯ = ¯ .
This is the OLS estimator for
ˆ
β .
The OLS estimator for ˆ γ is derived similarly.
( )
2
2
2 2
,
ˆ
i i i i i i i
i i i
z y x x y x z
x z x z
γ
¯ ¯ − ¯ ¯
=
¯ ¯ − ¯
Interpreting the multiple regression coefficient.
We might be able to make some sense out of these formulas if we translate them into simple regression
coefficients. The formula for a simple correlation coefficient between x and y is
2 2 2 2
i i i i i i
xy
x y
i i i i
x y x y x y
r
NS S
x y x y
N
N N
¯ ¯ ¯
= = =
¯ ¯ ¯ ¯
Or,
2 2
i i xy i i
x y r x y ¯ = ¯ ¯
And,
2 2
i i xz i i
x z r x z ¯ = ¯ ¯
2 2
i i zy i i
z y r z y ¯ = ¯ ¯
Substituting into the formula for
ˆ
β
( )
2 2 2 2 2 2 2
2
2 2 2 2
,
ˆ
xy i i i zy i i xz i i
i xz i i
r x y z r z y r x z
x z r x z
β
¯ ¯ ¯ − ¯ ¯ ¯ ¯
=
¯ ¯ − ¯ ¯
Cancel out
2
i
z ¯ which appears in every term.
91
( )
2 2 2 2 2 2 2 2
2 2 2 2
2 2
,
,
ˆ
xy i i zy i xz i xy i i zy i xz i
xz i
xz i
r x y r y r x r x y r y r x
x r x
x r x
β
¯ ¯ − ¯ ¯ ¯ ¯ − ¯ ¯
= =
¯ − ¯
¯ − ¯
Factor out the common terms on the top and bottom.
2 2 2
2 2 2
2
,
2
2 2
2
ˆ
1 1
1 1
xy zy xz xy zy xz i i i
xz xz
i
xy zy xz xy zy xz y i
x xz xz
i
r r r r r r x y y
r x r
x
r r r r r r S y N
S r r
x N
β
   
− − ¯ ¯ ¯
  == =
  − ¯ −
¯
\ ¹ \ ¹
 
− − ¯  

= =

 − −
¯ \ ¹
\ ¹
The corresponding expression for ˆ γ is
2
ˆ
1
zy xy xz y
z xz
r r r S
S r
γ
−  
=

−
\ ¹
The term in parentheses is simply a positive scaling factor. Let’s concentrate on the ratio. The denominator
is
2
1
xz
r − . When x and z are perfectly correlated, r
xz
= 1 and the formula does not compute. This is the case
of perfect collinearity. X and Y are said to be collinear because one is an exact function of the other, with
no error term. This could happen if X is total wages and Y is total income minus nonwage payments.
The multiple regression coefficient þ is the effect on Y of a change in X, holding Z constant. If X and Z are
perfectly collinear, then whenever X changes, Z has to change. Therefore it is impossible to hold Z constant
to find the separate effect of X. Stata will simply drop Z from the regression and estimate a simple
regression of Y on X.
The simple correlation coefficient r
xy
in the numerator of the formula for
ˆ
β is the total effect of X on Y,
holding nothing constant. The second term, r
zy
r
xz
is the effect of X on Y through Z. In other words, X is
correlated with Z, so when X varies, Z varies. This causes Y to change because Y is correlated with Z.
Therefore, the numerator of the formula for
ˆ
β starts with the total effect on X on Y, but then subtracts off
the indirect effect of X on Y through Z. It therefore measures the effect of X on Y, while statistically
holding Z constant, the socalled partial effect of X on Y.
The formula for ˆ γ is analogous. It is the total effect of Z on Y minus the indirect effect of Z on Y through
X. It partials out the effect of X on Y.
Note that, if there is no correlation between X and Z, then X and Z are said to be orthogonal and r
xz
= 0. In
this case, the multiple regression estimators collapse to simple regression estimators.
2
ˆ
1
xy zy xz y y
xy
x x xz
r r r S S
r
S S r
β
−    
= =
 
−
\ ¹ \ ¹
2
ˆ
1
zy xy xz y y
zy
z z xz
r r r S S
r
S S r
γ
−    
= =
 
−
\ ¹ \ ¹
These are the results we would get if we did two separate simple regressions, Y on X, and Y on Z.
92
Multiple regression and omitted variable bias
The reason we use multiple regression is that most real world relationships have more than one explanatory
variable. Suppose we type the following data into the data editor in Stata.
Let’s do a simple regression of crop yield on rainfall.
regress yield rain
Source  SS df MS Number of obs = 8
+ F( 1, 6) = 0.17
Model  33.3333333 1 33.3333333 Prob > F = 0.6932
Residual  1166.66667 6 194.444444 Rsquared = 0.0278
+ Adj Rsquared = 0.1343
Total  1200 7 171.428571 Root MSE = 13.944

yield  Coef. Std. Err. t P>t [95% Conf. Interval]
+
rain  1.666667 4.025382 0.41 0.693 11.51642 8.183089
_cons  76.66667 40.5546 1.89 0.108 22.56687 175.9002

According to our analysis, the coefficient on rain is 1.67 implying that rain causes crop yield to decline (!),
although not significantly. The problem is that we have an omitted variable, temperature. Let’s do a
multiple regression with both rain and temp.
regress yield rain temp
Source  SS df MS Number of obs = 8
93
+ F( 2, 5) = 0.60
Model  231.448763 2 115.724382 Prob > F = 0.5853
Residual  968.551237 5 193.710247 Rsquared = 0.1929
+ Adj Rsquared = 0.1300
Total  1200 7 171.428571 Root MSE = 13.918

yield  Coef. Std. Err. t P>t [95% Conf. Interval]
+
rain  .7243816 4.661814 0.16 0.883 11.25919 12.70796
temp  1.366313 1.351038 1.01 0.358 2.106639 4.839266
_cons  14.02238 98.38747 0.14 0.892 266.9354 238.8907

Now both rain and warmth increase crop yields. The coefficients are not significant, probably because we
only have eight observations. Nevertheless, remember that omitting an important explanatory variable can
bias your estimates on the included variables.
The omitted variable theorem
The reason we do multiple regressions is that most things in economics are functions of more than one
variable. If we make a mistake and leave one of the important variables out, we cause the remaining
coefficient to be biased (and inconsistent).
Suppose the true model is the multiple regression (in deviations),
i i i i
y x z u β γ = + +
However, we omit z and run a simple regression of y on x. The coefficient on x will be
( )
2 2
ˆ
i i i i i i
i i
x x z u x y
x x
β γ
β
¯ + + ¯
= =
¯ ¯
The expected value of
ˆ
β is,
( )
2
2 2 2
2
ˆ
i i i i i
i i i
i i
i
x x z x u
E
x x x
x z
x
β γ
β
β γ
¯ ¯ ¯
= + +
¯ ¯ ¯
¯
= +
¯
Because 0
i i
x u ¯ = by A2. (The explanatory variables are uncorrelated with the error term.) We can write
the expression for the expected value of
ˆ
β in a number of ways.
( )
2 2 2
2 2
2
ˆ
xz i i i i i
xz
i i
i
z
xz xz
x
r x z z N x z
E r
x x
x N
S
r b
S
β β γ β γ β γ
β γ β γ
¯ ¯ ¯ ¯
= + = + = +
¯ ¯
¯
 
= + = +

\ ¹
where
xz
b is the coefficient from the simple regression of Z on X.
Assume that Z is relevant in the sense that v is not zero. If
xz
0 then 0 and b 0
i i xz
x z r ¯ ≠ ≠ ≠ in which case
ˆ
β is biased (and inconsistent since the bias does not go away as N goes to infinity). The direction of bias
depends on the signs of and .
xz
r γ If and .
xz
r γ have the same sign, then
ˆ
β is biased upwards. If they have
94
opposite signs, then
ˆ
β is biased downward. This was the case with rainfall and temperature in our crop
yield example in Chapter 7. Finally, note that if X and Z are orthogonal, then 0
xz
r = and
ˆ
β is unbiased.
The expected value of ˆ γ if we omit X is analogous.
( )
x
zx zx
z
S
E r b
S
γ γ β γ β
 
= + = +

\ ¹
where
zx
b is the coefficient from the simple regression of X on Z.
Now let’s consider a more complicated model. Suppose the true model is
0 1 1 2 2 3 3 4 4 i i i i i i
Y X X X X u β β β β β = + + + + +
But we estimate
0 1 1 2 2 i i i i
Y X X u β β β = + + +
We know we have omitted variable bias, but how much bias is there and in what direction?
Consider a set of auxiliary (and imaginary if we don’t have data on X
3
and X
4
) regressions of each of the
omitted variables on all of the included variables.
3 0 1 1 2 2
4 0 1 1 2 2
i i i
i i i
X a a X a X
X b b X b X
= + +
= + +
Note if there are more included variables, simply add them to the right hand side of each auxiliary
regression. If there are more omitted variables, add more equations.
It can be shown that, as a simple extension of the omitted variable theorem above,
( )
( )
( )
0 0 0 3 0 4
1 1 1 3 1 4
2 2 2 3 2 4
ˆ
ˆ
ˆ
E a b
E a b
E a b
β β β β
β β β β
β β β β
= + +
= + +
= + +
Here is a nice example in Stata. In the Data Editor, create the following variable.
95
Create some additional variables.
. gen x2=x^2
. gen x3=x^3
. gen x4=x^4
Now create y.
. gen y=1+x+x2+x3+x4
Regress y on all the x’s (true model)
. regress y x x2 x3 x4
Source  SS df MS Number of obs = 7
+ F( 4, 2) = .
Model  11848 4 2962 Prob > F = .
Residual  0 2 0 Rsquared = 1.0000
+ Adj Rsquared = 1.0000
Total  11848 6 1974.66667 Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  1 . . . . .
x2  1 . . . . .
x3  1 . . . . .
x4  1 . . . . .
_cons  1 . . . . .

Now omit X3 and X4.
. regress y x x2
Source  SS df MS Number of obs = 7
+ F( 2, 4) = 33.44
Model  11179.4286 2 5589.71429 Prob > F = 0.0032
Residual  668.571429 4 167.142857 Rsquared = 0.9436
+ Adj Rsquared = 0.9154
Total  11848 6 1974.66667 Root MSE = 12.928

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  8 2.443233 3.27 0.031 1.216498 14.7835
x2  10.57143 1.410601 7.49 0.002 6.654972 14.48789
_cons  9.285714 7.4642 1.24 0.281 30.00966 11.43823

Do the auxiliary regressions.
. regress x3 x x2
Source  SS df MS Number of obs = 7
+ F( 2, 4) = 12.70
Model  1372 2 686 Prob > F = 0.0185
Residual  216 4 54 Rsquared = 0.8640
+ Adj Rsquared = 0.7960
Total  1588 6 264.666667 Root MSE = 7.3485

x3  Coef. Std. Err. t P>t [95% Conf. Interval]
96
+
x  7 1.38873 5.04 0.007 3.144267 10.85573
x2  0 .8017837 0.00 1.000 2.226109 2.226109
_cons  0 4.242641 0.00 1.000 11.77946 11.77946

. regress x4 x x2
Source  SS df MS Number of obs = 7
+ F( 2, 4) = 34.01
Model  7695.42857 2 3847.71429 Prob > F = 0.0031
Residual  452.571429 4 113.142857 Rsquared = 0.9445
+ Adj Rsquared = 0.9167
Total  8148 6 1358 Root MSE = 10.637

x4  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  0 2.010178 0.00 1.000 5.581149 5.581149
x2  9.571429 1.160577 8.25 0.001 6.34915 12.79371
_cons  10.28571 6.141196 1.67 0.169 27.33641 6.764979

Compute the expected values from the formulas above.
( )
( )
( )
0 0 0 3 0 4
1 1 1 3 1 4
2 2 2 3 2 4
ˆ
1 0(1) 10.29(1) 9.29
ˆ
1 7(1) 0(1) 8
ˆ
1 0(1) 9.57(1) 10.57
E a b
E a b
E a b
β β β β
β β β β
β β β β
= + + = + − = −
= + + = + + =
= + + = + + =
These are the values of the parameters in the regression with the omitted variables. These numbers are
exact because there is no random error in the model. You can see that the theorem works and both
estimates are biased upward by the omitted variables.
Target and control variables: how many
regressors?
When we estimate a regression model, we usually have one or two parameters that we are primarily
interested in. These variables associated with those parameters are called target variables. In a demand
curve we are typically concerned with the coefficients on price and income. These coefficients tell us if the
demand curve downward sloping and whether the good is normal or inferior. We are primarily interested in
the coefficient on the interest rate in a demand for money equation. In a policy study the target variable is
frequently a dummy variable (see below) that is equal to one if the policy is in effect and zero otherwise.
To get a good estimate of the coefficient on the target variable or variables, we want to avoid omitted
variable bias. To that end we include a list of control variables. The only purpose of the control variables
is to try to guarantee that we don’t bias the coefficient on the target variables. For example, if we are
studying the effect of threestrikes law on crime, our target variable is the threestrikes dummy. We include
in the crime equation all those variables which might cause crime aside from the threestrikes law (e.g.,
percent urban, unemployment, percent in certain age groups, etc.)
What happens if, in our attempt to avoid omitted variable bias, we include too many control variables?
Instead of omitting relevant variables, suppose we include irrelevant variables. It turns out that including
97
irrelevant variables will not bias the coefficients on the target variables. However, including irrelevant
variables will make the estimates inefficient relative to estimates including only relevant variables. It will
also increase the standard errors and underestimate the tratios on all the coefficients in the model,
including the coefficients on the target variables. Thus, including too many control variables will tend to
make the target variables appear insignificant, even when they are truly significant.
We can summarize the effect of too many or too few variables as follows.
Omitting a relevant variable biases the coefficients on all the remaining variables, but decreases the
variance (increases the efficiency) of all the remaining coefficients.
Discarding a variable whose true coefficient is less than its true (theoretical) standard error decreases the
mean square error (the sum of variance plus bias squared) of all the remaining coefficients. So if
1
S
β
β
τ = <
then we are better off, in terms of mean square error, by omitting the variable, even though its parameter is
not zero. Unfortunately, we don’t know any of these true values, so this is not very helpful, but it does
demonstrate that we shouldn’t keep insignificant control variables.
What is the best practice? I recommend the general to specific modeling strategy. After doing your library
research and reviewing all previous studies on the issue, you will have comprised a list of all the control
variables that previous researchers have used. You may also come up with some new control variables.
Start with a general model, including all the potentially relevant controls, and remove the insignificant
ones. Use ttests and Ftests to justify your actions. You can proceed sequentially, dropping one or two
variables at a time, if that is convenient. After you get down to a parsimonious model including only
significant control variables, do one more Ftest to make sure that you can go from the general model to the
final model in one step. Sets of dummy variables should be treated as groups, including all or none for each
group, so that you might have some insignificant controls in the final model, but not a lot. At this point you
should be able to do valid hypothesis tests concerning the coefficients on the target variables.
Proxy variables
It frequently happens that researchers face a dilemma. Data on a potentially important control variable is
not available. However, we may be able to obtain data on a variable that is known or suspected to be highly
correlated with the unavailable variable. Such variables are known as proxy variables, or proxies. The
dilemma is this: if we omit the proxy we get omitted variable bias, if we include the proxy we get
measurement error. As we see in a later chapter, measurement error causes biased and inconsistent
estimates, but so does omitting a relevant variable. What is a researcher to do?
Monte Carlo studies have shown that the bias tends to be smaller if we include a proxy than if we omit a
variable entirely. However, the bias that results from including a proxy is directly related to how highly
correlated the proxy is with the unavailable variable. It is better to omit a poor proxy. Unfortunately, it is
impossible to know how highly correlated they are because we can’t observe the unavailable variable. It
might be possible, in some cases, to see how the two variables are related in other contexts, in other studies
for example, but generally we just have to hope.
The bottom line is that we should include proxies for control variables, but drop them if they are not
significant.
An interesting problem arises if the proxy variable is the target variable. In that case, we are stuck with
measurement error. We are also faced with the fact that the coefficient estimate is a compound of the true
98
parameter linking the dependent variable and the unavailable variable and the parameter relating the proxy
to the unavailable variable.
Suppose we know that
Y X α β = +
But we can’t get data on X. So we use Z, which is available and is related to X.
X a bZ = +
Substituting for X in the original model yields,
( ) ( ) Y X a bZ a bZ α β α β α β β = + = + + = + +
So if we estimate the model,
Y c dZ = +
the coefficient on Z is d b β = , the product of the true coefficient relating X to Y and the coefficient
relating X to Z. Unless we know the value of b, we have no measure of the effect of X on Y.
However, we do know if the coefficient d is significant or not and we do know if the sign is as expected.
That is, if we know that b is positive, then we can test the hypothesis that þ is positive using the estimated
coefficient,
ˆ
d . Therefore, if the target variable is a proxy variable, the estimated coefficient can only be
used to determine sign and significance.
An illustration of this problem occurs in studies of the relationship between guns and crime. There is no
good measure of the number of guns, so researchers have to use proxies. As we have just seen, the
coefficient on the proxy for guns cannot be used to make inferences concerning the elasticity of crime with
respect to guns. Nevertheless two studies, one by Cook and Ludwig and one by Duggan, both make this
mistake. See http://papers.ssrn.com/sol3/papers.cfm?abstract_id=473661 for a discussion of this issue.
Dummy variables
Dummy variables are variables that consist of one’s and zeroes. They are also known as binary variables.
They are extremely useful. For example, I happen to have a data set consisting of the salaries of the faculty
of a certain nameless university (salaries.dta). The average salary at the time of the survey was as follows.
. summarize salary
Variable  Obs Mean Std. Dev. Min Max
+
salary  350 72024.92 25213.02 20000 173000
Now suppose we create a dummy variable called female, that takes the value 1 if the faculty member is
female and 0 if male. We can then find the average female salary.
. summarize salary if female
Variable  Obs Mean Std. Dev. Min Max
+
salary  122 61050 18195.34 29700 130000
Similarly, we can create a dummy variable called male which is 1 if male and zero if female. The average
male salary is somewhat higher.
. summarize salary if male
Variable  Obs Mean Std. Dev. Min Max
99
+
salary  228 77897.46 26485.87 20000 173000
The difference between these two means is substantial. We can use the scalar command to find the
difference in salary between males and females.
. scalar salarydif=6105077897
. scalar list salarydif
salarydif = 16847
So, women earn almost $17,000 less than men on average. Is this difference significant given the variance
in salary? Let’s do a traditional test of the difference between two means by computing the variances and
using the formula in most introductory statistics books. First, create two variables for men’s salaries and
women’s salaries.
. gen salaryfem=salary if female==1
(228 missing values generated)
. gen salarymale=salary if female==0
(122 missing values generated)
. summarize salary*
Variable  Obs Mean Std. Dev. Min Max
+
salary  350 72024.92 25213.02 20000 173000
salaryfem  122 61050 18195.34 29700 130000
salarymale  228 77897.46 26485.87 20000 173000
Now test for the difference between these two means using the Stata command ttest.
. ttest salarymale=salaryfem, unpaired
Twosample t test with equal variances

Variable  Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
+
salary~e  228 77897.46 1754.069 26485.87 74441.12 81353.8
salary~m  122 61050 1647.328 18195.34 57788.68 64311.32
+
combined  350 72024.92 1347.693 25213.02 69374.3 74675.54
+
diff  16847.46 2684.423 11567.73 22127.2

Degrees of freedom: 348
Ho: mean(salarymale)  mean(salaryfem) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = 6.2760 t = 6.2760 t = 6.2760
P < t = 1.0000 P > t = 0.0000 P > t = 0.0000
There appears to be a significant difference between male and female salaries.
It is somewhat more elegant to use regression analysis with dummy variables to achieve the same goal.
. regress salary female
100
Source  SS df MS Number of obs = 350
+ F( 1, 348) = 39.39
Model  2.2558e+10 1 2.2558e+10 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.1017
+ Adj Rsquared = 0.0991
Total  2.2186e+11 349 635696291 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
female  16847.46 2684.423 6.28 0.000 22127.2 11567.73
_cons  77897.46 1584.882 49.15 0.000 74780.31 81014.61

Note that the coefficient on female is the difference in average salary attributable to being female while the
intercept is the corresponding male salary. The tratio on female tests the null hypothesis that this salary
difference is equal to zero. This is exactly the same results we got using the standard ttest of the difference
between two means. So, there appears to be significant salary discrimination against women at this
university.
Since male=1female, we can generate the bonus attributable to being male by doing the following
regression.
. regress salary male
Source  SS df MS Number of obs = 350
+ F( 1, 348) = 39.39
Model  2.2558e+10 1 2.2558e+10 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.1017
+ Adj Rsquared = 0.0991
Total  2.2186e+11 349 635696291 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
male  16847.46 2684.423 6.28 0.000 11567.73 22127.2
_cons  61050 2166.628 28.18 0.000 56788.67 65311.33

Now the female salary is captured by the intercept and the coefficient on male is the (significant) bonus that
attends maleness.
Since male=1female, female=1male, and the intercept is simply 1, male+female=intercept, so we can’t
use all three terms in the same regression. This is the dummy variable trap.
regress salary male female
Source  SS df MS Number of obs = 350
+ F( 1, 348) = 39.39
Model  2.2558e+10 1 2.2558e+10 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.1017
+ Adj Rsquared = 0.0991
Total  2.2186e+11 349 635696291 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
male  16847.46 2684.423 6.28 0.000 11567.73 22127.2
101
female  (dropped)
_cons  61050 2166.628 28.18 0.000 56788.67 65311.33

Stata has saved us the embarrassment of pointing out that the regression we asked it to run is impossible by
simply dropping one of the variables from the regression. The result is we get the same regression we got
with male as the only independent variable.
It is possible to force Stata to drop the intercept term instead of one of the other variables,
. regress salary male female, noconstant
Source  SS df MS Number of obs = 350
+ F( 2, 348) = 1604.86
Model  1.8382e+12 2 9.1911e+11 Prob > F = 0.0000
Residual  1.9930e+11 348 572701922 Rsquared = 0.9022
+ Adj Rsquared = 0.9016
Total  2.0375e+12 350 5.8215e+09 Root MSE = 23931

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
male  77897.46 1584.882 49.15 0.000 74780.31 81014.61
female  61050 2166.628 28.18 0.000 56788.67 65311.33

Now the coefficients are the mean salaries for the two groups. This formulation is less useful because it is
more difficult to test the null hypothesis that the two salaries are different.
Perhaps this salary difference is due to a difference in the amount of experience of the two groups. Maybe
the men have more experience and that is what is causing the apparent salary discrimination. If so, the
previous analysis suffers from omitted variable bias.
. regress salary exp female
Source  SS df MS Number of obs = 350
+ F( 2, 347) = 79.49
Model  6.9709e+10 2 3.4854e+10 Prob > F = 0.0000
Residual  1.5215e+11 347 438471159 Rsquared = 0.3142
+ Adj Rsquared = 0.3103
Total  2.2186e+11 349 635696291 Root MSE = 20940

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  1069.776 103.1619 10.37 0.000 866.8753 1272.678
female  10310.35 2431.983 4.24 0.000 15093.63 5527.065
_cons  60846.72 2150.975 28.29 0.000 56616.14 65077.31

Well, the difference has been reduced to $10K, so the men are apparently somewhat older. But it is still
significant and negative. Also, the average raise is a pitiful $1000 per year. It is also possible that women
receive lower raises than men do. We can test this hypothesis by creating an interaction variable by
multiplying experience by female and including it as an additional regressor.
. gen fexp=female*exp
. regress salary exp female fexp
Source  SS df MS Number of obs = 350
+ F( 3, 346) = 53.04
Model  6.9888e+10 3 2.3296e+10 Prob > F = 0.0000
Residual  1.5197e+11 346 439219336 Rsquared = 0.3150
102
+ Adj Rsquared = 0.3091
Total  2.2186e+11 349 635696291 Root MSE = 20958

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  1038.895 113.9854 9.11 0.000 814.7038 1263.087
female  12189.86 3816.229 3.19 0.002 19695.79 4683.936
fexp  172.0422 269.0422 0.64 0.523 357.1217 701.2061
_cons  61338.93 2286.273 26.83 0.000 56842.18 65835.67

In fact, women appear to receive slightly higher, but not significantly higher, raises than men. There is a
significant salary penalty associated with being female, but it is not caused by discrimination in raises.
Useful tests
Ftest
We have seen many applications of the ttest. However, the Ftest is also extremely useful. This test was
developed by Sir Ronald Fisher (the F in Ftest), a British statistical biologist in the early 1900’s. Suppose
we want to test the null hypothesis that the coefficients on female and fexp are jointly equal to zero. The
way to test this hypothesis is to run the regression as it appears above with both female and fexp included,
and note the residual sum of squares (1.5e+11). Then run the regression again without the two variables
(assuming the null hypothesis is true) and seeing if the two residual sum of squares are significantly
different. If they are, then the null hypothesis is false. If they are not significantly different, then the two
variables do not help explain the variance of the dependent variable and the null hypothesis is true (cannot
be rejected).
Stata allows us to do this test very easily with the test command. This command is used after the regress
command. Refer to the coefficients by the corresponding variable name.
. test female fexp
( 1) female = 0
( 2) fexp = 0
F( 2, 346) = 9.18
Prob > F = 0.0001
This tests the null hypothesis that the coefficient on female and the coefficient on male are both equal to
zero. According the Fratio, we can firmly reject this hypothesis. The numbers in parentheses in the F ratio
are the degrees of freedom of the numerator (2 because there are two restrictions, one on female and one on
fexp) and the degrees of freedom of the denominator (nk: 3504=346).
Chow test
This test is frequently referred to as a Chow test, after Gregory Chow, a Princeton econometrician who
developed a slightly different version. Let’s do a slightly more complicated version of the Ftest above. I
also have a dummy variable called “admin” which is equal to one whenever a faculty member is also a
member of the administration (assistant to the President, assistant to the assistant to the President, dean,
associate dean, assistant dean, deanlet, etc., etc.). Maybe female administrators are discriminated against in
103
terms of salary. We want to know if the regression model explaining women’s salaries is different from the
regression model explaining men’s salaries. This is a Chow test.
. gen fadmin=female*admin
. regress salary exp female fexp admin fadmin
Source  SS df MS Number of obs = 350
+ F( 5, 344) = 35.33
Model  7.5271e+10 5 1.5054e+10 Prob > F = 0.0000
Residual  1.4659e+11 344 426123783 Rsquared = 0.3393
+ Adj Rsquared = 0.3297
Total  2.2186e+11 349 635696291 Root MSE = 20643

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  1005.451 112.843 8.91 0.000 783.5015 1227.4
female  11682.84 3791.92 3.08 0.002 19141.11 4224.577
fexp  156.5447 266.3948 0.59 0.557 367.423 680.5124
admin  11533.69 3905.342 2.95 0.003 3852.334 19215.04
fadmin  2011.575 7884.295 0.26 0.799 13495.92 17519.07
_cons  60202.64 2284.564 26.35 0.000 55709.17 64696.11
. test female fexp fadmin
( 1) female = 0
( 2) fexp = 0
( 3) fadmin = 0
F( 3, 344) = 5.67
Prob > F = 0.0008
We can soundly reject the null hypothesis that women have the same salary function as men. However,
because the ttests on fexp and fadmin are not significant, we know that the difference is not due to raises
or differences paid to women administrators.
I also have data on a number of other variables that might explain the difference in salaries between men
and women faculty: chprof (=1 if a chancellor professor), asst (=1 if an assistant prof), assoc (=1 if an
associate prof), and prof (=1 if a full professor). Let’s see if these variables make a difference.
. regress salary exp female asst assoc prof admin chprof
Source  SS df MS Number of obs = 350
+ F( 7, 342) = 72.14
Model  1.3227e+11 7 1.8896e+10 Prob > F = 0.0000
Residual  8.9587e+10 342 261948866 Rsquared = 0.5962
+ Adj Rsquared = 0.5879
Total  2.2186e+11 349 635696291 Root MSE = 16185

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  195.7694 102.5207 1.91 0.057 5.881204 397.42
female  5090.278 1952.174 2.61 0.010 8930.056 1250.5
asst  11182.61 4175.19 2.68 0.008 2970.325 19394.89
assoc  22847.23 4079.932 5.60 0.000 14822.31 30872.15
prof  47858.19 4395.737 10.89 0.000 39212.11 56504.28
admin  8043.421 2719.69 2.96 0.003 2693.997 13392.85
chprof  14419.67 5962.966 2.42 0.016 2690.965 26148.37
_cons  42395.53 4042.146 10.49 0.000 34444.94 50346.13
104

Despite all these highly significant control variables, the coefficient on the target variable, female is
negative and significant, indicating significant salary discrimination against women faculty. I have one
more set of dummy variables to try. I have a dummy for each department. So, dept1=1 if the faculty person
is in the first department, zero otherwise, dept2=1 if in the second department, etc.
. regress salary exp female admin chprof prof assoc asst dept2dept30
Source  SS df MS Number of obs = 350
+ F( 33, 316) = 20.18
Model  1.5047e+11 33 4.5597e+09 Prob > F = 0.0000
Residual  7.1388e+10 316 225910511 Rsquared = 0.6782
+ Adj Rsquared = 0.6446
Total  2.2186e+11 349 635696291 Root MSE = 15030

salary  Coef. Std. Err. t P>t [95% Conf. Interval]
+
exp  283.8026 99.07088 2.86 0.004 88.88067 478.7245
female  2932.928 1915.524 1.53 0.127 6701.72 835.865
admin  6767.088 2650.416 2.55 0.011 1552.396 11981.78
chprof  14794.4 5622.456 2.63 0.009 3732.222 25856.58
prof  46121.73 4324.557 10.67 0.000 37613.17 54630.29
assoc  22830.1 3930.037 5.81 0.000 15097.75 30562.44
asst  11771.36 4036.021 2.92 0.004 3830.495 19712.23
dept2  2500.519 4457.494 0.56 0.575 6269.598 11270.63
dept3  19527.87 4019.092 4.86 0.000 11620.31 27435.43
dept4  1278.9 4886.804 0.26 0.794 8335.885 10893.68
dept5  9213.681 9271.124 0.99 0.321 9027.251 27454.61
dept6  1969.66 3659.108 0.54 0.591 9168.953 5229.633
dept7  3525.149 4881.219 0.72 0.471 6078.646 13128.94
dept8  462.3493 4317.267 0.11 0.915 8031.871 8956.57
dept9  3304.275 6448.824 0.51 0.609 9383.782 15992.33
dept10  24647.11 4618.156 5.34 0.000 15560.89 33733.33
dept11  13185.66 3726.702 3.54 0.000 5853.38 20517.95
dept12  3601.157 3230.741 1.11 0.266 9957.638 2755.323
dept14  167.8798 5985.794 0.03 0.978 11609.17 11944.93
dept16  4854.208 6978.675 0.70 0.487 18584.75 8876.331
dept17  1185.402 4293.443 0.28 0.783 9632.748 7261.945
dept18  5932.65 3473.475 1.71 0.089 901.411 12766.71
dept19  18642.71 15803.41 1.18 0.239 12450.48 49735.9
dept20  8166.382 8906.207 0.92 0.360 9356.576 25689.34
dept21  1839.744 5208.921 0.35 0.724 12088.29 8408.806
dept22  2940.911 11064.36 0.27 0.791 18828.21 24710.03
dept23  1540.594 4902.64 0.31 0.754 11186.54 8105.348
dept24  777.8447 3916.975 0.20 0.843 8484.491 6928.802
dept25  14089.64 15411.33 0.91 0.361 44411.41 16232.14
dept27  1893.797 3397.766 0.56 0.578 8578.9 4791.307
dept28  4073.063 5408.214 0.75 0.452 14713.72 6567.596
dept29  2279.7 10822.38 0.21 0.833 19013.33 23572.73
dept30  80.38544 4879.813 0.02 0.987 9681.414 9520.643
_cons  38501.85 4173.332 9.23 0.000 30290.82 46712.88

Whoops, the coefficient on female isn’t significant any more. Maybe women tend to be overrepresented in
departments that pay lower salaries to everyone, males and females (e.g., modern language, English,
theatre, dance, classical studies, kinesiology). This conclusion rests on the department dummy variables
being significant as a group. We can test this hypothesis, using an Ftest, with the testparm command. We
would like to list the variables as dept1dept29, but that looks like we want to test hypotheses on the
105
difference between the coefficient on dept1 and the coefficient on dept29. That is where the testparm
command comes in.
. testparm dept2dept30
( 1) dept2 = 0
( 2) dept3 = 0
( 3) dept4 = 0
( 4) dept5 = 0
( 5) dept6 = 0
( 6) dept7 = 0
( 7) dept8 = 0
( 8) dept9 = 0
( 9) dept10 = 0
(10) dept11 = 0
(11) dept12 = 0
(12) dept14 = 0
(13) dept16 = 0
(14) dept17 = 0
(15) dept18 = 0
(16) dept19 = 0
(17) dept20 = 0
(18) dept21 = 0
(19) dept22 = 0
(20) dept23 = 0
(21) dept24 = 0
(22) dept25 = 0
(23) dept27 = 0
(24) dept28 = 0
(25) dept29 = 0
(26) dept30 = 0
F( 26, 316) = 3.10
Prob > F = 0.0000
The department dummies are highly significant as a group. So there is no significant salary discrimination
against women, once we control for field of study. Female physicists, economists, chemists, computer
scientists, etc. apparently earn just as much as their male counterparts. Unfortunately, the same is true for
female French teachers and phys ed instructors. By the way, the departments that are significantly overpaid
are dept 3 (Chemistry), dept 10 (Computer Science), and dept 11 (Economics).
It is important to remember that, when testing groups of dummy variables for significance, that you must
drop or retain all members of the group. If the group is not significant, even if one or two are significant,
you should drop them all. If the group is significant, even if only a few are, then all should be retained.
Granger causality test
Another very useful test is available only for time series. In 1969 C.W.J. Granger published an article
suggesting a way to test causality.
2
Suppose we are interesting in testing whether crime causes prison (that
is, people in prison), prison causes (deters) crime, both, or neither. Granger suggested running a regression
of prison on lagged prison and lagged crime. If lagged crime is significant, then crime causes prison. Then
do a regression of crime on lagged crime and lagged prison. If lagged prison is significant, then prison
2
Granger, C.W.J., “Investigating causal relations by econometric models and crossspectral methods,”
Econometrica 36, 1969, 424438.
106
causes crime (deters crime if the coefficients sum to a negative number, causes crime if the coefficients
sum to a positive number).
I have data on crime (crmaj, major crime, the kind you get sent to prison for: murder, rape, robbery, assault,
and burglary) and prison population per capita for Virginia from 1972 to 1999 (crimeva.dta).
. tsset year
(You have to tsset year to tell Stata that year is the time variable.) Let’s use two lags.
. regress crmaj L.crmaj LL.crmaj L.prison LL.prison
Source  SS df MS Number of obs = 24
+ F( 4, 19) = 31.36
Model  63231512.2 4 15807878.1 Prob > F = 0.0000
Residual  9578520.75 19 504132.671 Rsquared = 0.8684
+ Adj Rsquared = 0.8407
Total  72810033 23 3165653.61 Root MSE = 710.02

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
crmaj 
L1  1.003715 .2095943 4.79 0.000 .5650287 1.442401
L2  .4399737 .206103 2.13 0.046 .8713522 .0085951
prison 
L1  1087.249 1324.262 0.82 0.422 1684.463 3858.962
L2  1772.896 1287.003 1.38 0.184 4466.625 920.8333
_cons  6309.665 2689.46 2.35 0.030 680.5613 11938.77

. test L.prison LL.prison
( 1) L.prison = 0
( 2) L2.prison = 0
F( 2, 19) = 3.61
Prob > F = 0.0468
So, lagged prison is significant in the crime equation. Does prison cause or deter crime? The sum of the coefficients is
negative, so it deters crime if the sum is significantly different from zero.
. test L.prison+L2.prison=0
( 1) L.prison + L2.prison = 0
F( 1, 19) = 5.26
Prob > F = 0.0333
So, prison deters crime. Does crime cause prison?
regress prison L.prison LL.prison L.crmaj LL.crmaj
Source  SS df MS Number of obs = 24
+ F( 4, 19) = 510.43
Model  23.1503896 4 5.78759741 Prob > F = 0.0000
Residual  .215433065 19 .011338582 Rsquared = 0.9908
+ Adj Rsquared = 0.9888
Total  23.3658227 23 1.01590533 Root MSE = .10648
107

prison  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison 
L1  1.545344 .1986008 7.78 0.000 1.129668 1.96102
L2  .5637167 .193013 2.92 0.009 .9676977 .1597358
crmaj 
L1  .0000191 .0000314 0.61 0.551 .0000849 .0000467
L2  .0000159 .0000309 0.52 0.612 .0000488 .0000806
_cons  .133937 .4033407 0.33 0.743 .7102647 .9781387

. test L.crmaj LL.crmaj
( 1) L.crmaj = 0
( 2) L2.crmaj = 0
F( 2, 19) = 0.20
Prob > F = 0.8237
.
Apparently, prison deters crime, but crime does not cause prisons, at least in Virginia.
The Granger causality test is best done with control variables included in the regression as well as the lags
of the target variables. However, if the control variables are unavailable or the analysis is just preliminary,
then the lagged dependent variables will serve as proxies for the omitted control variables.
Jtest for nonnested hypotheses
Suppose we are having a debate about crime with some sociologists. We believe that crime is deterred by
punishment, mainly imprisonment, while the sociologists think that putting people in prison does no good
at all. The sociologists think that crime is caused by income inequality. We don’t think so. We both agree
that unemployment and income have an effect on crime. How can we resolve this debate? We don’t put
income inequality in our regressions and they don’t put prison in theirs. The key is to create a supermodel
with all of the variables included. We then test all the coefficients. If prison is significant and income
inequality is not, then we win. If the reverse is true then the sociologists win. If neither or both are
significant, then it is a tie. You can get ties in tests of nonnested hypotheses. You don’t get ties in tests of
nested hypotheses. By the way, the J in Jtest is for joint hypothesis.
I have time series data for the US which has data on major crime (murder, rape, robbery, assault, and
burglary), real per capita income, the unemployment rate, and income inequality measured by the Gini
coefficient from 19471998 (gini.csv). The Gini coefficient is a number between zero and one. Zero means
perfect equality, everyone has the same level of income. One means perfect inequality, one person has all
the money, everyone else has nothing. We create the supermodel by regressing the log of major crime per
capita on all these variables.
. regress lcrmajpc prisonpc unrate income gini
Source  SS df MS Number of obs = 52
+ F( 4, 47) = 3286.33
Model  1.55121091 4 .387802727 Prob > F = 0.0000
Residual  .005546219 47 .000118005 Rsquared = 0.9964
+ Adj Rsquared = 0.9961
Total  1.55675713 51 .03052465 Root MSE = .01086

108
lcrmajpc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prisonpc  .0186871 .0044176 4.23 0.000 .0098001 .0275741
unrate  .0391091 .0072343 5.41 0.000 .0245556 .0536626
income  .3201013 .0061851 51.75 0.000 .3076585 .3325442
gini  .757686 .2507801 3.02 0.004 1.26219 .2531816
_cons  5.200647 .175871 29.57 0.000 4.84684 5.554454
r
Well, both prison and gini are significant in the crime equation. It appears to be a draw. However, the
coefficient on gini is negative, indicating that, as the gini coefficient goes up (more inequality, less
equality) crime goes down. Whoops.
If there were more than one variable for each of the competing models (e.g., prison and arrests for the
economists’ model and giini and the divorce rate for the sociologists) then we would use Ftests rather than
ttests. We would test the joint significance of prison and arrests versus the joint significance of gini and
divorce with Ftests.
LM test
Suppose we want to test the joint hypothesis that rtpi and unrate are not significant in the above regression,
even if we already know that they are highly significant. We know we can test that hypothesis with an F
test. However, an alternative is to use the LM (for Lagrangian Multiplier, never mind) test. The primary
advantage of the LM test is that you only need to estimate the main equation once. Under the Ftest you
have to estimate the model with all the variables, record the error sum of squares, then estimate the model
with the variables being tested excluded, record the error sum of squares, and finally compute the Fratio.
For example, use the crime1990.dta data set to estimate the following crime equation.
. regress lcrmaj prison metpct p1824 p2534
Source  SS df MS Number of obs = 51
+ F( 4, 46) = 19.52
Model  5.32840267 4 1.33210067 Prob > F = 0.0000
Residual  3.1393668 46 .068247104 Rsquared = 0.6293
+ Adj Rsquared = 0.5970
Total  8.46776947 50 .169355389 Root MSE = .26124

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1439967 .0282658 5.09 0.000 .0871006 .2008929
metpct  .0086418 .0020265 4.26 0.000 .0045627 .0127208
p1824  .9140642 5.208719 0.18 0.861 11.39867 9.570543
p2534  1.604708 3.935938 0.41 0.685 9.527341 6.317924
_cons  9.092154 .673677 13.50 0.000 7.736113 10.4482

Suppose we want to know if the age groups are jointly significant. The Ftest is easy in Stata.
. test p1824 p2534
( 1) p1824 = 0
( 2) p2534 = 0
F( 2, 46) = 0.12
Prob > F = 0.8834
To do the LM test for the same hypothesis, estimate the model without the age group variables and save the
residuals using the predict command.
109
. regress lcrmaj prison metpct
Source  SS df MS Number of obs = 51
+ F( 2, 48) = 40.39
Model  5.31143197 2 2.65571598 Prob > F = 0.0000
Residual  3.1563375 48 .065757031 Rsquared = 0.6273
+ Adj Rsquared = 0.6117
Total  8.46776947 50 .169355389 Root MSE = .25643

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1384944 .0248865 5.57 0.000 .0884566 .1885321
metpct  .0081552 .0017355 4.70 0.000 .0046656 .0116447
_cons  8.767611 .1142346 76.75 0.000 8.537927 8.997295

. predict e, resid
Now regress the residuals on all the variables including the ones we left out. If they are truly irrelevant,
they will not be significant and the Rsquare for the regression will be approximately zero. The LM statistic
is N times the Rsquare, which is distributed as chisquare with, in this case, two degrees of freedom
because there are two restrictions (two variables left out).
. regress e prison metpct p1824 p2534
Source  SS df MS Number of obs = 51
+ F( 4, 46) = 0.06
Model  .016970705 4 .004242676 Prob > F = 0.9926
Residual  3.13936681 46 .068247105 Rsquared = 0.0054
+ Adj Rsquared = 0.0811
Total  3.15633751 50 .06312675 Root MSE = .26124

e  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .0055023 .0282658 0.19 0.847 .0513938 .0623985
metpct  .0004866 .0020265 0.24 0.811 .0035924 .0045657
p1824  .9140643 5.208719 0.18 0.861 11.39867 9.570543
p2534  1.604708 3.935938 0.41 0.685 9.527341 6.317924
_cons  .3245431 .673677 0.48 0.632 1.031498 1.680585

To compute the LM test statistic, we need to get the Rsquare and the number of observations. (Although
the fact that both variables have very low tratios is a clue.)
Whenever we run regress Stata creates a bunch of output that is available for further analysis. This output
goes under the generic heading of “e” output. To see this output, use the ereturn list command.
. ereturn list
scalars:
e(N) = 51
e(df_m) = 4
e(df_r) = 46
e(F) = .0621663918252367
e(r2) = .0053767079385045
e(rmse) = .2612414678691742
e(mss) = .0169707049657037
e(rss) = 3.139366808584275
e(r2_a) = .0811122739798864
e(ll) = 1.276850278809347
110
e(ll_0) = 1.414326247391365
macros:
e(depvar) : "e"
e(cmd) : "regress"
e(predict) : "regres_p"
e(model) : "ols"
matrices:
e(b) : 1 x 5
e(V) : 5 x 5
functions:
e(sample)
Since the numbers we want are scalars, not variables, we need the scalar command to create the test
statistic.
. scalar n=e(N)
. scalar R2=e(r2)
. scalar LM=n*R2
Now that we have the test statistic, we need to see if it is significant. We can compute the probvalue using
Stata’s chi2 (chisquare) function.
. scalar prob=1chi2(2,LM)
. scalar list
prob = .87187776
LM = .2742121
R2 = .00537671
n = 51
The Rsquare is almost equal to zero, so the LM statistic is very low. The probvalue is .87. We cannot
reject the null hypothesis that the age groups are not related to crime. OK, so the Ftest is easier.
Nevertheless, we will use the LM test later, especially for diagnostic testing of regression models.
One modification of this test that is easier to apply in Stata is the socalled Fform of the LM test which
corrects for degrees of freedom and therefore has somewhat better small sample properties.
3
To do the F
form, just apply the Ftest to the auxiliary regression. That is, after regressing the residuals on all the
variables, including the age group variables, use Stata’s test function to test the joint hypothesis that the
coefficients on both variables are zero.
. test p1824 p2534
( 1) p1824 = 0
( 2) p2534 = 0
F( 2, 46) = 0.12
Prob > F = 0.8834
The results are identical to the original Ftest above showing that the LM test is equivalent to the Ftest.
The results are also similar to the nRsquare version of the test, as expected. I routinely use the Fform of
the LM test because it is easy to do in Stata and corrects for degrees of freedom.
3
See Kiviet, J.F., Testing Linear Econometric Models, Amsterdam: University of Amsterdam, 1987.
111
9 REGRESSION DIAGNOSTICS
Influential observations
It would be unfortunate if an outlier dominated our regression in such a way that one observation
determined the entire regression result.
. gen y=10 + 5*invnorm(uniform())
. gen x=10*invnorm(uniform())
. summarize y x
Variable  Obs Mean Std. Dev. Min Max
+
y  51 9.736584 4.667585 1.497518 19.76657
x  51 .6021098 8.954225 17.87513 19.83905
Let’s graph these data. We know that they are completely independent, so there should not be a significant
relationship between them.
. twoway (scatter y x) (lfit y x)
The regression confirms no significant relationship.
112
. regress y x
Source  SS df MS Number of obs = 51
+ F( 1, 49) = 1.94
Model  41.4133222 1 41.4133222 Prob > F = 0.1703
Residual  1047.90398 49 21.3857955 Rsquared = 0.0380
+ Adj Rsquared = 0.0184
Total  1089.3173 50 21.786346 Root MSE = 4.6245

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .1016382 .0730381 1.39 0.170 .0451374 .2484138
_cons  9.797781 .649048 15.10 0.000 8.49347 11.10209

Now let’s create an outlier.
. replace y=y+30 if state==30
(1 real change made)
. twoway (scatter y x) (lfit y x)
Let’s try that regression again.
. regress y x
Source  SS df MS Number of obs = 51
+ F( 1, 49) = 6.83
Model  259.874802 1 259.874802 Prob > F = 0.0119
Residual  1864.52027 49 38.0514342 Rsquared = 0.1223
+ Adj Rsquared = 0.1044
Total  2124.39508 50 42.4879015 Root MSE = 6.1686

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .2546063 .0974255 2.61 0.012 .0588225 .4503901
_cons  10.47812 .8657642 12.10 0.000 8.738302 12.21794

113
So, one observation is determining whether we have a significant relationship or not. This observation is
called “influential.” State 30, which happens to be New Hampshire, is said to dominate the regression. It is
sometimes easy to find outliers and influential observations, but not always. If you have a large data set, or
lots of variables, you could easily miss an influential observation. You do not want your conclusions to be
driven by a single observation or even a few observations. So how do we avoid this influential observation
trap?
Stata has a function called dfbeta which helps you avoid this problem. It works like this suppose we simply
ran the regression with and without New Hampshire. With NH we get a significant result and a relatively
large coefficient on x. Without NH we should get a smaller and insignificant coefficient.
. regress y x if state ~=30
Source  SS df MS Number of obs = 50
+ F( 1, 48) = 1.61
Model  35.0543824 1 35.0543824 Prob > F = 0.2112
Residual  1047.6542 48 21.8261292 Rsquared = 0.0324
+ Adj Rsquared = 0.0122
Total  1082.70858 49 22.0960935 Root MSE = 4.6718

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .0989157 .0780517 1.27 0.211 .0580178 .2558493
_cons  9.785673 .6653936 14.71 0.000 8.447809 11.12354

By dropping NH we are back to the original result reflecting the (true) relationship for the other 50
observations. We could do this 51 times, dropping each state in turn and seeing what happens, but that
would be tedious. Instead we can use matrix algebra and computer technology to achieve the same result.
DFbetas
Belsley, Kuh, and Welsch
4
suggest dfbetas (scaled dfbeta) as the best measure of influence.
ˆ
ˆ
ˆ ˆ
ˆ ˆ
where is the standard error of and
i
i
DFbetas S
S
β
β
β β
β β
−
= is the estimated value of β omitting
observation i.
BKW suggest, as a rule of thumb, that an observation is influential if
Dfbetas > 2 N
where N is the sample size, although some researchers suggest that DFbetas should be greater than one
(that is greater than one standard error).
After regress, invoke the dfbeta command.
. dfbeta
DFx: DFbeta(x)
4
Belsley, D. A., Kuh, E. and Welsch, R. E. Regression Diagnostics. John Wiley and Sons, New York, NY, 1980.
114
This produces a new variable called DFx. Now list the observations that might be influential.
. list state stnm y x DFx if DFx>2/sqrt(51)
++
 state stnm y x DFx 

51.  30 NH 42.282 19.83905 2.110021 
++
There is only one observation for which DFbetas is greater than the threshold value. Clearly NH is
influential, increasing the value of the coefficient by two standard errors. At this point it would behoove us
to look more closely at the NH value to make sure it is not a typo or something.
What do we do about influential observations? We hope we don’t have any. If we do, we hope they don’t
affect the main conclusions. We might find simple coding errors. If we have some and they are not coding
errors, we might take logarithms (which has the advantage of squashing variance) or weight the regression
to reduce their influence. If none of this works, we have to live with the results and confess the truth to the
reader.
Multicollinearity
We know that if one variable in a multiple regression is an exact linear combination of one or more
variables in the regression, that the regression will fail. Stata will be forced to drop the collinear variable
from the model. But what happens if a regressor is not an exact linear combination of some other variable
or variables, but almost? This is called multicollinearity. It has the effect of increasing the standard errors
of the OLS estimates. This means that variables that are truly significant appear to be insignificant in the
collinear model Also, the OLS estimates are unstable and fragile (small changes in the data or model can
make a big difference in the regression estimates). It would be nice to know if we have multicollinearity.
Variance inflation factors
Stata has a diagnostic known as the variance inflation factor , VIF, which indicates the presence of
multicollearity. The rule of thumb is that multicollinearity is a problem if the VIF is over 30.
Here is an example.
. use http://www.ats.ucla.edu/stat/stata/modules/reg/multico, clear
. regress y x1 x2 x3 x4
Source  SS df MS Number of obs = 100
+ F( 4, 95) = 16.37
Model  5995.66253 4 1498.91563 Prob > F = 0.0000
Residual  8699.33747 95 91.5719733 Rsquared = 0.4080
+ Adj Rsquared = 0.3831
Total  14695 99 148.434343 Root MSE = 9.5693

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x1  1.118277 1.024484 1.09 0.278 .9155807 3.152135
115
x2  1.286694 1.042406 1.23 0.220 .7827429 3.356131
x3  1.191635 1.05215 1.13 0.260 .8971469 3.280417
x4  .8370988 1.038979 0.81 0.422 2.899733 1.225535
_cons  31.61912 .9709127 32.57 0.000 29.69161 33.54662

. vif
Variable  VIF 1/VIF
+
x4  534.97 0.001869
x3  175.83 0.005687
x1  91.21 0.010964
x2  82.69 0.012093
+
Mean VIF  221.17
Wow. All the VIF are over 30. Let’s look at the correlations among the independent variables.
. correlate x1x4
(obs=100)
 x1 x2 x3 x4
+
x1  1.0000
x2  0.3553 1.0000
x3  0.3136 0.2021 1.0000
x4  0.7281 0.6516 0.7790 1.0000
x4 appears to be the major problem here. It is highly correlated with all the other variables. Perhaps we
should drop it and try again.
. regress y x1 x2 x3
Source  SS df MS Number of obs = 100
+ F( 3, 96) = 21.69
Model  5936.21931 3 1978.73977 Prob > F = 0.0000
Residual  8758.78069 96 91.2372989 Rsquared = 0.4040
+ Adj Rsquared = 0.3853
Total  14695 99 148.434343 Root MSE = 9.5518

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x1  .298443 .1187692 2.51 0.014 .0626879 .5341981
x2  .4527284 .1230534 3.68 0.000 .2084695 .6969874
x3  .3466306 .0838481 4.13 0.000 .1801934 .5130679
_cons  31.50512 .9587921 32.86 0.000 29.60194 33.40831

. vif
Variable  VIF 1/VIF
+
x1  1.23 0.812781
x2  1.16 0.864654
x3  1.12 0.892234
+
Mean VIF  1.17
116
Much better, the variables are all significant and there is no multicollinearity at all. If we had not checked
for multicollinearity, we might have concluded that none of the variables were related to y. So doing a VIF
check is generally a good idea.
However, dropping variables is not without its dangers. It could be that x4 is a relevant variable and the
result of dropping it is that the remaining coefficients are biased and inconsistent. Be careful.
117
10 HETEROSKEDASTICITY
The GaussMarkov assumptions can be summarized as
2
~ (0, )
i i i
i
Y X u
u iid
α β
σ
= + +
where ~iid(0,o
2
) means “is independently and identically distributed with mean zero and variance o
2
”. In
this chapter we learn how to deal with the possibility that the “identically” assumption is violated. Another
way of looking at the GM assumptions is to glance at the expression for the variance of the error term, o
2
,
if there is no i subscript then the variance is the same for all observations. Thus, for each observation the
error term, u
i
, comes from a distribution with exactly the same mean and variance, i.e., is identically
distributed.
When the identically distributed assumption is violated, it means that the error variance differs across
observations. This nonconstant variance is called heteroskedasticity (hetero = different, skedastic, from the
Greek skedastikos, scattering). Constant variance must be homoskedasticity.
If the data are heteroskedastic, ordinary least squares estimates are still unbiased, but they are not efficient.
This means that we have less confidence in our estimates. However, even worse, perhaps, our standard
errors and tratios are no longer correct. It depends on the exact kind of heteroskedasticity whether we are
over or underestimating our tratios, but in either case, we will make mistakes in our hypothesis tests. So,
it would be nice to know if our data suffer from this problem.
Testing for heteroskedasticity
The first thing to do is not really a test, but simply good practice: look at the evidence. The residuals from
the OLS regression, e
i
are estimates of the unknown error term u
i
, so the first thing to to is graph the
residuals and see if they appear to be getting more or less spread out as Y and X increase.
BreuschPagan test
There are two widely used tests for heteroskdeasticity. The first is due to Breusch and Pagan. It is an LM
test. The basic idea is as follows.
Consider the following multiple regression model.
0 1 1 2 2 i i i i
Y X X u β β β = + + +
118
What we want to know is if the variance of the error term is increasing with one of the variables in the
model (or some other variable). So we need an estimate of the variance of the error term at each
observation. We use the OLS residual, e, as our estimate of the error term, u. The usual estimator of the
variance of e is
2
1
( )
n
i
i
e e n
=
−
¯
. However, we know that 0 e = in any OLS regression and if we want the
variance of e for each observation, we have to set n=1. Therefore our estimate of the variance of u
i
is
2
i
e ,
the square of the residual. If there is no heteroskedasticity, the squared residuals will be unrelated to any of
the independent variables. So an auxiliary regression of the form,
2 2
0 1 1 2 2
ˆ /
i i i
e a a X a X σ = + +
where
2
ˆ σ is the estimated error variance from the original OLS regression, should have an Rsquare of
zero. In fact,
2 2
2
nR χ where the degrees of freedom is the number of variables on the right hand side of the
auxiliary test equation. Alternatively, we could test whether the squared residual increases with the
predicted value of y or some other suspected variable.
As an example, consider the following regression of major crime per capita across 51 states (including DC)
in 1990 (hetero.dta) on prison population per capita, the percent of the population living in metropolitan
(metpct) areas, and real per capita personal income.
. regress crmaj prison metpct rpcpi
Source  SS df MS Number of obs = 51
+ F( 3, 47) = 43.18
Model  1805.01389 3 601.671298 Prob > F = 0.0000
Residual  654.970383 47 13.9355401 Rsquared = 0.7338
+ Adj Rsquared = 0.7168
Total  2459.98428 50 49.1996855 Root MSE = 3.733

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  2.952673 .3576379 8.26 0.000 2.233198 3.672148
metpct  .1418042 .0328781 4.31 0.000 .0756619 .2079464
rpcpi  .0004919 .0003134 1.57 0.123 .0011224 .0001386
_cons  6.57136 3.349276 1.96 0.056 .1665135 13.30923

Let’s plot the residuals on population (which we suspect might be a problem variable).
. predict e, resid
. twoway (scatter e pop)
119
Hard to say. If we ignore the outlier at 3000, it looks like the variance of the residuals is getting larger with
larger population. Let’s see if the BreuschPagan test can help us.
The BreuschPagan test is invoked in Stata with the hettest command after regress. The default, if you don’t
include a varlist is to use the predicted value of the dependent variable.
. hettest
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of crmaj
chi2(1) = 0.17
Prob > chi2 = 0.6819
Nothing appears to be wrong. Let’s see if there is a problem with any of the independent variables.
. hettest prison metpct rpcpi
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: prison metpct rpcpi
chi2(3) = 4.73
Prob > chi2 = 0.1923
Still nothing. Let’s see if there is a problem with population (since the dependent variable is measured as a
per capita variable).
. hettest pop
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: pop
chi2(1) = 3.32
Prob > chi2 = 0.0683
Ah ha! Population does seem to be a problem. We will consider what to do about it below.

5
0
5
1
0
1
5
R
e
s
id
u
a
ls
0 10000 20000 30000
pop
120
White test
The other widely used test for autocorrelation is due to Halbert White (econometrician from UCSD). White
modifies the BreuschPagan auxiliary regression as follows.
2 2 2
0 1 1 2 2 3 1 4 2 5 1 2 i i i i i i i
e a a X a X a X a X a X X = + + + + +
The squared and interaction terms are used to capture any nonlinear relationships that might exist. The
number of regressors in the auxiliary test equation is k(k+1)/2 where k is the number of parameters
(including the intercept) in the original OLS regression. Here k=3 so, 3(4)/2 1=61=5. The test statistic is
2 2
( 1) / 2 1
nR
k k
χ
+ −
.
To invoke the White test use imtest, white. The im stands for information matrix.
. imtest, white
White's test for Ho: homoskedasticity
against Ha: unrestricted heteroskedasticity
chi2(9) = 6.63
Prob > chi2 = 0.6757
Cameron & Trivedi's decomposition of IMtest

Source  chi2 df p
+
Heteroskedasticity  6.63 9 0.6757
Skewness  4.18 3 0.2422
Kurtosis  2.28 1 0.1309
+
Total  13.09 13 0.4405

Ignore Cameron and Trivedi’s decomposition. All we care about is heteroskedasticity, which doesn’t
appear to be a problem according to White. The White test has a slight advantage over the BrueshPagan
test in that it is less reliant on the assumption of normality. However, because it computes the squares and
crossproducts of all the variables in the regression, it sometimes runs out of space or out of degrees of
freedom and will not compute. Finally, the White test doesn’t warn us of the possibility of
heteroskedasticity with a variable that is not in the model, like population. (Remember, there is no
heteroskdeasticity associated with prison, metpct, or rpcpi.) For these reasons, I prefer the BreuschPagan
test.
Weighted least squares
Suppose we know what is causing the heteroskedasticity. For example, suppose the model is (in deviations)
i i i
y x u β = + where
2 2
var( ) u z
i i
σ =
121
Consider the weighted regression, where we divide both sides by z
i
/ ( / ) /
i i i i i i
y z x z u z β = +
What is the variance of the new error term, u
i
/z
i
.
2 2
2 2
2
var( / ) var( ) where 1/
var( / ) var( )
i i i i i i
i
i i i i
i
u z c u c z
z
u z c u
z
σ
σ
= =
= = =
So, weighting by 1/z
i
causes the variance to be constant. Applying OLS to this weighted regression will be
unbiased, consistent and efficient (relative to unweighted least squares).
The problem with this approach is that you have to know the exact form of the heteroskedasticity. If you
think you know you have heteroskedasticity and correct it with weighted least squares, you’d better be
right. If you were wrong and there was no heteroskedasticity, you have created it. You could make things
worse by using WLS inappropriately.
There is one case where weighted least squares is appropriate. Suppose that the true relationship between
two variables is (in deviations) is,
2
, (0, )
T T T T
i i i i
y x u u IN β σ = +
However, suppose the data is only available at the state level, where it is aggregated across n individuals.
Summing the relationship across the state’s population, n, yields,
1 1 1
n n n
T T
i i i
i i i
y x u β
= = =
= +
¯ ¯ ¯
or, for each state,
. y x u β = +
We want to estimate the model as close as possible to the truth, so we divide the state totals by the state
population to produce per capita values.
.
y x u
n n n
β = +
What is the variance of the error term?
1
2 2
2 2
2
1
( ( ) ( ) ... ( ))
n
T
i
T T T i
i n
u
u
Var Var Var u Var u Var u
n n n
n
n n
σ σ
=
 

 

= = + + +

 \ ¹

\ ¹
= =
¯
So, we should weight by the square root of population.
122
.
y x u
n n n
n n n
β = +
2
2
nu u
Var nVar n
n n n
σ
σ
 
 
= = =



\ ¹
\ ¹
which is homoskedastic.
This does not mean that it is always correct to weight by the square root of population. After all, the error
term might have some heteroskedasticity in addition to being an average. The most common practice in
crime studies is to weight by population, not the square root, although some researchers weight by the
square root of population.
We can do weighted least squares in Stata with the [weight=variable] option in the regress command. Put
the [weight=variable] in square brackets after the varlist.
. regress crmaj prison metpct rpcpi [weight=pop]
(analytic weights assumed)
(sum of wgt is 2.4944e+05)
Source  SS df MS Number of obs = 51
+ F( 3, 47) = 19.35
Model  1040.92324 3 346.974412 Prob > F = 0.0000
Residual  842.733646 47 17.9305031 Rsquared = 0.5526
+ Adj Rsquared = 0.5241
Total  1883.65688 50 37.6731377 Root MSE = 4.2344

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  3.470689 .7111489 4.88 0.000 2.040042 4.901336
metpct  .1887946 .058493 3.23 0.002 .0711219 .3064672
rpcpi  .0006482 .0004682 1.38 0.173 .0015902 .0002938
_cons  4.546826 4.851821 0.94 0.353 5.213778 14.30743

Now let’s see if there is any heteroskedasticity.
. hettest pop
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: pop
chi2(1) = 0.60
Prob > chi2 = 0.4386
. hettest
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of crmaj
chi2(1) = 1.07
Prob > chi2 = 0.3010
123
. hettest prison metpct rpcpi
BreuschPagan / CookWeisberg test for heteroskedasticity
Ho: Constant variance
Variables: prison metpct rpcpi
chi2(3) = 2.93
Prob > chi2 = 0.4021
The heteroskedasticity seems to have been taken care of. If the correction had made the heteroskedasticity
worse instead of better, then we would have tried weighting by 1/pop, the inverse of population, rather than
by population itself.
Robust standard errors and tratios
Since heteroskedasticity does not bias the coefficient estimates, we could simply correct the standard errors
and resulting tratios so our hypothesis tests are valid.
Recall how we derived the mean and variance of the sampling distribution of
ˆ
β . We started with a simple
regression model in deviations.
i i i
y x u β = +
2
2 2 2 2 2
( )
ˆ
i i i i i i i i i i
i i i i i
x y x x u x x u x u
x x x x x
β β
β β
¯ ¯ + ¯ ¯ ¯
= = = + = +
¯ ¯ ¯ ¯ ¯
The mean is
2 2
( )
ˆ
( )
i i i i
i i
x u x E u
E E
x x
β β β β
  ¯ ¯
= + = + =

¯ ¯
\ ¹
because ( ) 0
i
E u = for all i.
The variance of
ˆ
β is the expected value of the square of the deviation of
ˆ
β ,
2
ˆ
( ) E β β −
2
ˆ
i i
i
x u
x
β β
¯
− =
¯
124
( )
( )
( )
( )
2
ˆ ˆ
( ) ( )
The x's are fixed numbers so ( ) , which means
2
1 2
ˆ
( )
1 1 2 2
2 2
2
1
2 2 2 2 2 2
ˆ
( )
1 2 1 2 1 3 1 3 1 4 1 4
1 1 2 2
2
2
2
We know that (
Var E
E x x
x u
i i
Var E E x u x u x u
n n
x
i x
i
Var E x u x u x u x x u u x x u u x x u u
n n
x
i
E u
i
β β β
β
β
= −
=
 
¯

= = + + +

¯
\ ¹ ¯
= + + + + + + +
¯
( )
( )
( )
( )
( )
2 2 2
2 2 2 2
1 2 2 2 2
2 2
2
) (identical) and ( ) 0 for i j (independent), so
1
2 2 2 2 2 2
ˆ
( )
1 2
2
2
2
Factor out the constant variance, , to get
1
i
n
i
i i
E u u
i j
Var x x x
n
x
i
x
x x x
x
x x
σ
β σ σ σ
σ
σ σ
σ
= = ≠
= + + +
¯
¯
= + + + = =
¯
¯ ¯
In the case of heteroskedasticity we cannot factor out
2
σ . So,
( )
( )
( )
( )
( )
1
2 2 2 2 2 2
ˆ
( )
1 2 1 2 1 3 1 3 1 4 1 4
1 1 2 2 2
2
It is still true that ( ) 0 for i j (independent), so
2 2
1
2 2 2 2 2 2
ˆ
( )
1 1 2 2
2 2
2 2
Var E x u x u x u x x u u x x u u x x u u
n n
x
i
E u u
i j
x
i i
Var x x x
n n
x x
i i
β
σ
β σ σ σ
= + + + + + + +
¯
= ≠
¯
= + + + =
¯ ¯
To make this formula operational, we need an estimate of
2
i
σ . As above, we will use the square of the
residual,
2
e
i
.
( )
2 2
2
ˆ
2
2
ˆ
i i
i
x e
x
β
σ
¯
=
¯
Take the square root to produce the robust standard error of beta.
( )
2 2
ˆ
2
2
i i
i
x e
S
x
β
¯
=
¯
This is also known as the heteroskedastic consistent standard error (HCSE), or White, Huber, Eicker, or
sandwich standard errors. We will call them robust standard errors. Dividing
ˆ
β by the robust standard error
yields the robust tratio.
It is easy to get robust standard errors in Stata. For example, in the crime equation above we can see the
robust standard and tratios by adding the option robust to the regress command.
. regress crmaj prison metpct rpcpi, robust
125
Regression with robust standard errors Number of obs = 51
F( 3, 47) = 131.72
Prob > F = 0.0000
Rsquared = 0.7338
Root MSE = 3.733

 Robust
crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  2.952673 .1667412 17.71 0.000 2.617233 3.288113
metpct  .1418042 .03247 4.37 0.000 .0764828 .2071255
rpcpi  .0004919 .0002847 1.73 0.091 .0010646 .0000807
_cons  6.57136 3.006661 2.19 0.034 .5227393 12.61998

Compare this with the original model.
Source  SS df MS Number of obs = 51
+ F( 3, 47) = 43.18
Model  1805.01389 3 601.671298 Prob > F = 0.0000
Residual  654.970383 47 13.9355401 Rsquared = 0.7338
+ Adj Rsquared = 0.7168
Total  2459.98428 50 49.1996855 Root MSE = 3.733

crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  2.952673 .3576379 8.26 0.000 2.233198 3.672148
metpct  .1418042 .0328781 4.31 0.000 .0756619 .2079464
rpcpi  .0004919 .0003134 1.57 0.123 .0011224 .0001386
_cons  6.57136 3.349276 1.96 0.056 .1665135 13.30923

It is possible to weight and use robust standard errors. In fact, it is the standard methodology in crime
regressions.
. regress crmaj prison metpct rpcpi [weight=pop], robust
(analytic weights assumed)
(sum of wgt is 2.4944e+05)
Regression with robust standard errors Number of obs = 51
F( 3, 47) = 17.28
Prob > F = 0.0000
Rsquared = 0.5526
Root MSE = 4.2344

 Robust
crmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  3.470689 .6700652 5.18 0.000 2.122692 4.818686
metpct  .1887946 .0726166 2.60 0.012 .0427089 .3348802
rpcpi  .0006482 .0005551 1.17 0.249 .0017649 .0004684
_cons  4.546826 4.683806 0.97 0.337 4.875776 13.96943

What is the bottom line with respect to our treatment of heteroskedasticity? Robust standard errors yield
consistent estimates of the true standard error no matter what the form of the heteroskedasticity. Using
them allows us to avoid having to discover the particular form of the problem. Finally, if there is no
heteroskedasticity, we get the usual standard errors anyway. For these reasons, I recommend always using
126
robust standard errors. That is, always add “, robust” to the regress command. This is becoming standard
procedure among applied researchers. If you know that a variable like population is a problem, then you
should weight in addition to using robust standard errors. The existence of a problem variable will become
obvious in your library research when you find that most of the previous studies weight by a certain
variable.
127
11 ERRORS IN VARIABLES
One of the GaussMarkov assumptions is that the independent variable, x, is fixed on repeated sampling.
This is a very convenient assumption for proving that the OLS parameter estimates are unbiased However,
it is only necessary that x be independent of the error term, that is, that there be no covariance between the
error term and the regressor or regressors, that is, ( ) 0 E x u
i i
= If this assumption is violated, then ordinary
least squares is biased and inconsistent.
This could happen if there are errors of measurement in the independent variable or variables. For example,
suppose the true model, expressed in deviations is
T
y x β ε = +
where x
T
is the true value of x. However, we only have the observed x=x
T
f where f is random
measurement error.
T
x x f = −
( )
( )
y x f
x f
β ε
β ε β
= − +
= + −
We have a problem if x is correlated with the error term (cþf).
cov( , ) [( )( )]
2 2
( ) ( )
2
0 if f has any variance.
T
x f E x f f
T T
E x f x f f E f
f
ε β ε β
ε ε β β β
βσ
− = + −
= + − − =−
= − ≠
So, x is correlated with the error term. This means that the OLS estimate is biased.
( ) ( )( )
ˆ
2 2 2 2
( ) 2
2 2
( )
ˆ
lim( ) lim lim
2 2 2 2
2
T T T
xy x f y x f x
T T T
x x f x x f f
T T T T
x x x f fy x
P P P
T T T
x x f f x f
β ε
β
β ε β β
β
¯ ¯ + ¯ + +
= = =
¯ ¯ + ¯ + ¯ +¯
   
¯ + + + ¯
  = =
 
¯ + ¯ +¯ ¯ +¯
\ ¹ \ ¹
128
(We are assuming that lim( ) lim( ) lim( ) lim( ) 0) So,
2
dividing top and bottom by ,
ˆ
lim( ) lim
2 var( )
1
1
2 var( )
T T
P fy P x f P x f P fy
T
x
P P
f
f
T
T x
x
β β
β
= = = =
¯
 


= =

¯ 
+
+


¯ \ ¹
Since the two variances cannot be negative, we can conclude that
ˆ
lim if var( ) 0 p f β β ≠ ≠ which means
that the OLS estimate is biased and inconsistent if the measurement error has any variance at all. In this
case, the OLS estimate is biased downward, since the true value of beta is divided by a number greater than
one.
Cure for errors in variables
There are only two things we can do to fix errors in variables. The first is to use the true value of x. Since
this is usually not possible, we have to fall back on statistics. Suppose we know of a variable, z, that is (1)
highly correlated with x
T
, but (2) uncorrelated with the error term. This variable, called an instrument or
instrumental variable, allows us to derive consistent (not unbiased) estimates.
where (  )
Multiply both sides by z.
The error is
ˆ
y x v v f
zy zx zv
e zy zx
β ε β
β
β
= + =
= +
= −
The sum of squares of error is
2 2
ˆ
( )
ˆ
To minimize the sum of squares of error, we take the derivative with respect to and set it equal to zero.
2
ˆ
2 ( ) 0 (note chain rule).
ˆ
e zy zx
d e
zx zy zx
d
β
β
β
β
¯ =¯ −
¯
=− ¯ − =
The derivative is equal to zero if and only if the expression is parentheses is zero.
ˆ
0 which implies that
ˆ
zy zx
zy
zx
β
β
¯ − ¯ =
¯
=
¯
This is the instrumental variable (IV) estimator. It collapses to the OLS estimator if x can be an instrument
for itself, which would happen if x is not measured with error. Then x would be highly correlated with
itself, and not correlated with the error term, making it the perfect instrument.
This IV estimator is consistent because z is not correlated with v.
129
( )
( )
( )
( )
( )
ˆ
lim lim lim
lim )
lim
lim
since lim 0 because z is uncorrelated with the error term.
yz x v z
p p p
xz xz
p vz xz vz
p
xz p xz
p vz
β
β
β
β
β
    ¯ ¯ +
= =
 
¯ ¯ \ ¹ \ ¹
¯ ¯ +¯  
= = +

¯ ¯ \ ¹
= ¯ =
Two stage least squares
Another way to get IV estimates is called two stage least squares (TSLS or 2sls). In the first stage we
regress the variable with the measurement error, x, on the instrument, z, using ordinary least squares,
saving the predicted values. In the second stage, we substitute the predicted values of x for the actual x and
apply ordinary least squares again. The result is the consistent IV estimate above.
First stage: regress x on z:
x z w γ = +
Save the predicted values.
ˆ ˆ x z γ =
Note that ˆ x is a linear combination of z. Since z is independent of the error term, so is ˆ x .
Second stage: substitute ˆ x for x and apply OLS again.
ˆ y x v β = +
The resulting estimate of beta is the instrumental variable estimate above.
ˆ ˆ ˆ ( )
ˆ
2 2 2 2 2
ˆ ˆ ˆ ˆ ( )
ˆ from the first stage regression. So,
2
, the instrumental variable estimator
2
2
.
xy z y zy zy
x z z z
xz
z
zy zy
zx
zx
z
z
γ γ
β
γ γ γ
γ
¯ ¯ ¯ ¯
= = = =
¯ ¯ ¯ ¯
¯
=
¯
¯ ¯
= =
¯
¯
¯
¯
HausmanWu test
We can test for the presence of errors in variables if we have an instrument. Assume that the relationship
between the instrument z and the possibly mismeasured variable, x is linear.
x z w γ = +
We can estimate this relationship using ordinary least squares.
ˆ ˆ x z γ =
130
So,
ˆ ˆ x x w = +
where ˆ wis the residual from the regression of x on z.
And,
ˆ ˆ ( )
ˆ ˆ
y x v x w v
y x w v
β β
β β
= + = + +
= + +
Since ˆ x is simple a linear combination of z, it is uncorrelated with the error term. Therefore, applying OLS
to this equation will yield a consistent estimate of þ as the parameter multiplying ˆ x . If there is
measurement error in x it is all in the residual ˆ w , so the estmate of þ from that parameter will be
inconsistent. If these two estimates are significantly different, it must be due to measurement error. Let’s
call the coefficient on ˆ w o. So we need an easy way to test whether β δ = .
ˆ ˆ ˆ ˆ but , so
ˆ ˆ ( )
ˆ ( )
y x w v x x w
y x w w v
y x w v
β δ
β δ
β δ β
= + + = −
= − + +
= + − +
The null hypothesis that β δ = can be tested with a standard ttest on the coefficient multiplying ˆ w . If it is
significant, then the two parameter estimates are different and there is significant errors in variables bias. If
it is not significant, then the two parameters are equal and there is no significant bias.
We can implement this test in Stata as follows. Wooldridge has a nice example and the data set is available
on the web. The data are in the Stata data set called errorsinvars.dta. The dependent variable is the log of
the wage earned by working women. The independent variable is education. The estimated coefficient is
the return to education. First let’s do the regression with ordinary least squares.
. regress lwage educ
Source  SS df MS Number of obs = 428
+ F( 1, 426) = 56.93
Model  26.3264237 1 26.3264237 Prob > F = 0.0000
Residual  197.001028 426 .462443727 Rsquared = 0.1179
+ Adj Rsquared = 0.1158
Total  223.327451 427 .523015108 Root MSE = .68003

lwage  Coef. Std. Err. t P>t [95% Conf. Interval]
+
educ  .1086487 .0143998 7.55 0.000 .0803451 .1369523
_cons  .1851969 .1852259 1.00 0.318 .5492674 .1788735

The return to education is a nice ten percent. However, the high wages earned by highly educated women
might be due to more intelligence. Brighter women will probably do both, go to college and earn high
wages. So how much of the return to education is due to the college education? Let’s use father’s education
as an instrument. It is probably correlated with his daughter’s brains and is independent of her wages. We
can implement instrumental variables with ivreg. The first stage regression varlist is put in parentheses.
. ivreg lwage (educ=fatheduc)
Instrumental variables (2SLS) regression
131
Source  SS df MS Number of obs = 428
+ F( 1, 426) = 2.84
Model  20.8673618 1 20.8673618 Prob > F = 0.0929
Residual  202.460089 426 .475258426 Rsquared = 0.0934
+ Adj Rsquared = 0.0913
Total  223.327451 427 .523015108 Root MSE = .68939

lwage  Coef. Std. Err. t P>t [95% Conf. Interval]
+
educ  .0591735 .0351418 1.68 0.093 .0098994 .1282463
_cons  .4411035 .4461018 0.99 0.323 .4357311 1.317938

Instrumented: educ
Instruments: fatheduc

. regress educ fatheduc
Source  SS df MS Number of obs = 428
+ F( 1, 426) = 88.84
Model  384.841983 1 384.841983 Prob > F = 0.0000
Residual  1845.35428 426 4.33181756 Rsquared = 0.1726
+ Adj Rsquared = 0.1706
Total  2230.19626 427 5.22294206 Root MSE = 2.0813

educ  Coef. Std. Err. t P>t [95% Conf. Interval]
+
fatheduc  .2694416 .0285863 9.43 0.000 .2132538 .3256295
_cons  10.23705 .2759363 37.10 0.000 9.694685 10.77942

. predict what , resid
. regress lwage educ what
Source  SS df MS Number of obs = 428
+ F( 2, 425) = 29.80
Model  27.4648913 2 13.7324457 Prob > F = 0.0000
Residual  195.86256 425 .460853082 Rsquared = 0.1230
+ Adj Rsquared = 0.1189
Total  223.327451 427 .523015108 Root MSE = .67886

lwage  Coef. Std. Err. t P>t [95% Conf. Interval]
+
educ  .0591735 .0346051 1.71 0.088 .008845 .1271919
what  .0597931 .0380427 1.57 0.117 .0149823 .1345684
_cons  .4411035 .439289 1.00 0.316 .4223459 1.304553

The instrumental variables regression shows a lower and less significant return to education. In fact,
education is only significant at the .10 level. Also, there might still be some bias because the coefficient on
the residual what is almost significant at the .10 level. More research is needed.
132
12 SIMULTANEOUS EQUATIONS
We know that whenever an explanatory variable is correlated with the error term in a regression, the
resulting ordinary least squares estimates are biased and inconsistent. For example, consider the simple
Keynesian system.
C Y u
Y C I G X
α β = + +
= + + +
where C is consumption, Y is national income (e.g. GDP), I is investment, G is government spending, and
X is net exports.
Before we go on, we need to define some terms. The above simultaneous equation system of two equations
in two unknowns (C and Y) is known as the structural model. It is the theoretical model from the textbook
or journal article, with an error term added for statistical estimation. The variables C and Y are known as
the endogenous variables because their values are determined within the system. I, G, and X are
exogenous variables because their values are determined outside the system. They are “taken as given” by
the structural model. This structural model has two types of equations. The first equation (the familiar
consumption function) is known as a behavioral equation. Behavioral equations have error terms,
indicating some randomness. The second equation is an identity because it simply defines national income
as the sum of all spending (endogenous plus exogenous). The terms u and þ are the parameters of the
system.
If we knew the values of the parameters (u and þ) and the exogenous variables (I, G, X, and the exogenous
error term u), we could solve for the corresponding values of C and Y, as follows. Substituting the second
equation into the first yields the value for C.
( )
(1 )
(1 )
1 1
(RF1)
(1 ) (1 ) (1 ) (1 ) (1 )
C C I G X u C I G X u
C C I G X u
C I G X u
I G X u
C
C I G X u
α β α β β β β
β α β β β
β α β β β
α β β β
β
β β β
α
β β β β β
= + + + + + = + + + + +
− = + + + +
− = + + + +
+ + + +
=
−
= + + + +
− − − − −
The resulting equation is called the reduced form equation for C. I have labeled it RF1 for “reduced form
1.” Note that it is a behavioral equation and that, in this case, it is linear. We could, if we had the data,
estimate this equation with ordinary least squares.
133
We can derive the reduced form equation for Y by substituting the consumption function into the national
income identity.
(1 )
(1 )
1 1 1 1 1
(RF2)
(1 ) (1 ) (1 ) (1 ) (1 )
Y Y u I G X
Y Y I G X u
Y I G X u
I G X u
Y
Y I G X u
α β
β α
β α
α
β
α
β β β β β
= + + + + +
− = + + + +
− = + + + +
+ + + +
=
−
= + + + +
− − − − −
The two reduced form equations look very similar. They are both linear, with almost the same parameters.
Also, 1/(1þ) or þ/(1þ). The parameters are functions of the parameters of the structural model. Both
equations have the same variables on the right hand side, namely all of the exogenous variables.
Because of the fact that the reduced form equations have the same list of variables on the right hand side,
all we need to write them down in general form is the list of endogenous variables (because we have one
reduced form equation for each endogenous variable) and the list of exogenous variables. In the system
above the endogenous variables are C and Y and the exogenous variables are I, G, X (and the random error
term u), so the reduced form equations can be written as
10 11 12 13 1
Y
20 21 22 23 2
C I G X v
I G X v
π π π π
π π π π
= + + + +
= + + + +
where we are using the “pi’s” to indicate the reduced from parameters. Obviously,
20
1 (1 ) π β = − , etc.
We are now in position to show that the endogenous variable, Y, on the right hand side of the consumption
function is correlated with the error term in the consumption equation. Look at the reduced form equation
for Y (RF2) above. Note that Y is a function of u. Clearly, variation in u will cause variation in Y.
Therefore, cov(Y,u) is not zero. As a result, the OLS estimate of þ in the consumption function is biased
and inconsistent.
On the other hand, OLS estimates of the reduced form equation are unbiased and consistent because the
exogenous variables are taken as given and therefore uncorrelated with the error term, v
1
or v
2
. The bad
news is that the parameters from the reduced form are “scrambled” versions of the structural form
parameters. In the case of the simple Keynesian model, the coefficients from the reduced form equation for
Y correspond to multipliers, giving the change in Y for a oneunit change in I or G or X.
Example: supply and demand
Consider the following structural supply and demand model (in deviations).
(S)
(D)
1 2
q p
q p y u
α ε
β β
= +
= + +
where q is the equilibrium quantity supplied and demanded, p is the price, and y is income. There are two
endogenous variables (p and q) and one exogenous variable, y. We can derive the reduced form equations
by solving by back substitution.
134
Solve for p:
Substitute into the demand equation,
1 2
Multiply by to get rid of fractions.
( )
1 2 1 1 2
1 1 2
( )
1 2 1
q p
p q
q
p
q
q y u
q q y u q y u
q q y u
q y u
q
α ε
α ε
ε
α
ε
β β
α
α
α β ε αβ α β β ε αβ α
α β β ε αβ α
α β αβ α β ε
= +
− = − +
−
=
−  
= + +

\ ¹
= − + + = − + +
− = − + +
− = + + −
11
2 1
( )
1
2 1
(RF1)
( ) ( )
1 1
1
y u
u
q y
q y v
αβ α β ε
α β
αβ α β ε
α β α β
π
+ −
=
−
−
= +
− −
= +
This is RF1 in general form. Now we solve for p.
1 2
1 2
( )
1 2
2
( )
1
2
(RF2)
( ) ( )
1 1
21 2
p p y u
p p y u
p y u
y u
p
u
p y
p y v
α ε β β
α β β ε
α β β ε
β ε
α β
β ε
α β α β
π
+ = + +
− = + −
− = + −
+ −
=
−
−
= +
− −
= +
This is RF2 in general form. RF2 reveals that price is a function of u, so applying OLS to either the supply
or demand curve will result in biased and inconsistent estimates.
However, we can again appeal to instrumental variables (2sls) to yield consistent estimates. The good news
is that the structural model itself produces the instruments. Remember, instruments are required to be
highly correlated with the problem variable (p), but uncorrelated with the error term. The exogenous
variable(s) in the model satisfy these conditions. In the supply and demand model, income is correlated
with price (see RF2), but independent of the error term because it is exogenous.
Suppose we choose the supply curve as our target equation. We can get a consistent estimate of u with the
IV estimator,
ˆ
yq
yp
α
¯
=
¯
This estimator can be derived as follows.
p q α ε = +
Multiply both sides by y
135
yp yq y α ε = +
Now sum
yp yq y α ε ¯ = ¯ + ¯
But, the instrument is not correlated with the error term,
0 yε ¯ =
So,
yp yq α ¯ = ¯
and
/ yq yp α = ¯ ¯
We can also use two stage least squares. In the first stage we regress the problem variable, p, on all the
exogenous variables in the model. Here there is only one exogenous variable, y. In other words, estimate
the reduced form equation, RF2. The resulting estimate of a
21
is
21 2
ˆ
py
y
π
¯
=
¯
and the predicted value of p is
21
ˆ ˆ p y π = . In the second stage we regress q on ˆ p . The resulting
estimator is
21
2 2 2 2
2
21 21
2
ˆ ˆ
ˆ
ˆ ˆ ˆ
qy qp qy qy qy
py
py p y y
y
y
π
α
π π
¯ ¯ ¯ ¯ ¯
= = = = =
¯
¯ ¯ ¯ ¯
¯
¯
which is the same as the IV estimator above.
Indirect least squares
Yet another method for deriving a consistent estimate of u is indirect least squares. Suppose we estimate
the two reduced form equations above. Then we get unbiased and consistent estimates of a
11
and a
21
. If we
take the probability limit of the two estimators, we get the IV estimator again.
2
11
1
2
12
1
11 2 1
12 1 2
11
12
11 12
From the reduced form equations,
which implies that
So,
ˆ
lim
ˆ
ˆ ˆ Because both and are consistent estimates and the ratio of two consisten
p
αβ
π
α β
β
π
α β
π αβ α β
α
π α β β
π
α
π
π π
=
−
=
−
−
= =
−
 
=

\ ¹
t estimates is consistent.
(Remember, consistency "carries over.")
Unfortunately, indirect least squares does not always work. In fact, we cannot get a consistent estimate of
the demand equation because, no matter how we manipulate a
11
and a
12
, we keep getting u instead of þ
1
or
þ
2
.
136
Suppose we expand the model above to add another explanatory variable to the demand equation: w for
weather.
1 2 3
q p
q p y w w
α ε
β β β
= +
= + + +
Solving by back substitution, as before, yields the reduced form equations.
3 2 1
1 1 1
11 12 1
3 2
1 1 1
21 22 2
u
q y w
q y w v
u
p y w
p y w v
αβ αβ α β ε
α β α β α β
π π
β β ε
α β α β α β
π π
−
= + +
− − −
= + +
−
= + +
− − −
= + +
We can use indirect least squares to estimate the slope of the supply curve as before.
11 2 1
21 1 2
3 12 1
22 1 3
as before.
But also,
(again).
π αβ α β
α
π α β β
αβ π α β
α
π α β β
−
= =
−
−
= =
−
Whoops. Now we have two estimates of u. Given normal sampling error, we won’t get exactly the same
estimate, so which is right? Or which is more right?
So, indirect least squares yielded a unique estimate in the case of u, no estimate at all of the parameters of
the demand curve, and multiple estimates of u with a minor change in the specification of the structural
model. What is going on?
The identification problem
An equation is identified if we can derive estimates of the structural parameters from the reduced form. If
we cannot, it is said to be unidentified. The demand curve was unidentified in the above examples because
we could not derive any estimates of þ
1
or þ
2
from the reduced form equations. If an equation is identified,
it can be “just” identified, which means the reduced form equations yield a unique estimate of the
parameters of the equation of interest, or it can be “over” identified, in which case the reduce form yields
multiple estimates of the same parameters of the equation of interest. The supply curve was just identified
in the first example, but over identified when we added the weather variable.
It would be nice to know if the structural equation we are trying to estimate is identified. A simple rule, is
the socalled order condition for identification. To apply the order condition, we first have to choose the
equation of interest. Let G be the number of endogenous variables in the equation of interest (including the
dependent variable on the left hand side). Let K be the number of exogenous variables that are in the
structural model, but excluded from the equation of interest. To be identified, the following relationship
must hold.
1 K G ≥ −
Let’s apply this rule to the supply curve in the first supply and demand model. G=2 (q and p) and K=1 (y is
excluded from the supply curve), so 1=21 and the supply equation is just identified. This is why we got a
137
unique estimate of u from indirect least squares. Choosing the demand curve as the equation of interest, we
find that G=2 as before, but K=0 because there are no exogenous variables excluded from the demand
curve. Finally, let’s analyze the supply curve from the second supply and demand model. Again, G=2, but
now K=2 (y and w are both excluded from the supply curve). So 2>21 and the supply curve is over
identified. That is why we got two estimates from indirect least squares.
Two notes concerning the order condition. First, it is a necessary, but not sufficient condition for
identification. That is, no equation that fails the order condition will be identified. However, there are rare
cases where the order condition is satisfied but the equation is still not identified. However, this is not a
major problem because, if we try to estimate an equation using two stage least squares or some other
instrumental variable technique, the model will simply fail to yield an estimate. At that point we would
know that the equation was unidentified. The good news is that it won’t generate a wrong or biased
estimate. A more complicated condition, called the rank condition, is necessary and sufficient, but the order
condition is usually good enough.
Second, there is always the possibility that the model will be identified according to the order condition, but
fail to yield an estimate because the exogenous variables excluded from the equation of interest are too
highly correlated with each other. Say we require that two variables be excluded from the equation of
interest and we have two exogenous variables in the model but not in the equation. Looks good. However,
if the two variables are really so highly correlated that one is simply a linear combination of the other, then
we don’t really have two independent variables. This problem of “statistical identification” is extremely
rare.
Illustrative example
Dennis Epple and Bennett McCallum, from Carnegie Mellon, have developed a nice example of a
simultaneous equation model using real data.
5
The market they analyze is the market for broiler chickens.
The demand for broiler chicken meat is assumed to be a function of the price of chicken meat, real income,
and the price of beef, a close substitute. Supply is assumed to be a function of price and the price of chicken
feed (corn). Epple and McCallum have collected data for the US as a whole, from 1950 to 2001. They
recommend estimating the model in logarithms. The data are available in DandS.dta.
First we try ordinary least squares on the demand curve. For reasons that will be discussed in a later
chapter, we estimate the chicken demand model in first differences. This is a short run demand equation
relating the change in price to the change in consumption.
1 2 3 beef
q p y p u β β β ∆ = ∆ + ∆ + ∆ +
Since all the variables are logged, these first differences of logs are growth rates (percent changes). To
estimate this equation in Stata, we have to tell Stata that we have a time variable with the tsset year
command. We can then use the D. operator to take first differences. We will use the noconstant option to
avoid adding a time trend to the demand curve.
. tsset year
time variable: year, 1950 to 2001
. regress D.lq D.ly D.lpbeef D.lp , noconstant
5
Epple, Dennis and Bennett T. McCallum, Simultaneous equation econometrics: the missing example,
http://littlehurt.gsia.cmu.edu/gsiadoc/WP/2004E6.pdf
138
Source  SS df MS Number of obs = 51
+ F( 3, 48) = 19.48
Model  .053118107 3 .017706036 Prob > F = 0.0000
Residual  .043634876 48 .00090906 Rsquared = 0.5490
+ Adj Rsquared = 0.5208
Total  .096752983 51 .001897117 Root MSE = .03015

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  .7582548 .1677874 4.52 0.000 .4208957 1.095614
lpbeef 
D1  .305005 .0612757 4.98 0.000 .181802 .4282081
lp 
D1  .3553965 .0641272 5.54 0.000 .4843329 .2264601

Lq is the log of quantity (per capita chicken consumption), lp is the log of price, ly is the log of income, and
lpbeef is the log of the price of beef. This is not a bad estimated demand function. The demand curve is
downward sloping, but very inelastic, and significant at the .10 level. The elasticity of income is positive,
indicating that chicken is not an inferior good, which makes sense. The coefficient on the price of beef is
positive, which is consistent with the price of a substitute. All the explanatory variables are highly
significant.
Now let’s try the supply curve.
. regress D.lq D.lpfeed D.lp, noconstant
Source  SS df MS Number of obs = 39
+ F( 2, 37) = 0.28
Model  .00083485 2 .000417425 Prob > F = 0.7595
Residual  .055731852 37 .001506266 Rsquared = 0.0148
+ Adj Rsquared = 0.0385
Total  .056566702 39 .001450428 Root MSE = .03881

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lpfeed 
D1  .0392005 .060849 0.64 0.523 .1624923 .0840913
lp 
D1  .0038389 .0910452 0.04 0.967 .1806362 .188314

We have a problem. The coefficient on price is virtually zero. This means that the supply of chicken does
not respond to changes in price. We need to use instrumental variables.
Let’s first try instrumental variables on the demand equation. The exogenous variables are lpfeed, ly, and
lpbeef.
. ivreg D.lq D.ly D.lpbeef (D.lp= D.lpfeed D.ly D.lpbeef ), noconstant
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 39
+ F( 3, 36) = .
Model  .032485142 3 .010828381 Prob > F = .
Residual  .024081561 36 .000668932 Rsquared = .
+ Adj Rsquared = .
Total  .056566702 39 .001450428 Root MSE = .02586
139

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .4077264 .1410341 2.89 0.006 .6937568 .1216961
ly 
D1  .9360443 .1991184 4.70 0.000 .5322135 1.339875
lpbeef 
D1  .3159692 .0977788 3.23 0.003 .1176646 .5142739

Instrumented: D.lp
Instruments: D.ly D.lpbeef D.lpfeed

Not much different, but the estimate of the income elasticity of demand is now virtually one, indicating that
chicken demand will grow at about the same rate as income. Note that the demand curve is just identified
(by lpfeed).
Let’s see if we can get a better looking supply curve.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 39
+ F( 2, 37) = .
Model  .079162381 2 .03958119 Prob > F = .
Residual  .135729083 37 .003668354 Rsquared = .
+ Adj Rsquared = .
Total  .056566702 39 .001450428 Root MSE = .06057

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765
lpfeed 
D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173

Instrumented: D.lp
Instruments: D.lpfeed D.lpbeef D.ly

This is much better. The coefficient on price is now positive and significant, indicating that the supply of
chickens will increase in response to an increase in price. Also, increases in chickenfeed prices will reduce
the supply of chickens. The supply curve is over identified (by the omission of ly and lpbeef).
We could have asked for robust standard errors, but they didn’t make any difference in the levels of
significance, so we used the default homoskedastic only standard errors.
Note to the reader. I have taken a few liberties with the original example developed by Professors Epple
and McCallum. They actually develop a slightly more sophisticated, and more believable, model, where the
coefficient on price in the supply equation starts out negative and significant in the supply equation and
eventually winds up positive and significant. However, they have to complicate the model in a number of
ways to make it work. This is a cleaner example, using real data, but it could suffer from omitted variable
bias. On the other hand, the Epple and McCallum demand curve probably suffers from omitted variable
bias too, so there. Students are encouraged to peruse the original paper, available on the web.
Diagnostic tests
140
There are three diagnostic tests that should be done whenever you do instrumental variables. The first is to
justify the identifying restrictions.
Tests for over identifying restrictions
In the above example, we identified the supply equation by omitting income and the price of beef. This
seems quite reasonable, but it is nevertheless necessary to justify omitting these variables by showing that
they would not be significant if we included them in the regression. There are several versions of this test
by Sargan, Basmann, and others. It is simply an LM test where we save the residuals from the instrumental
variables regression and then regress them against all the instruments. If the omitted instrument does not
belong in the equation, then this regression should have an Rsquare of zero. We test this null hypothesis
with the usual
2 2
( ) nR m k χ − where m is the number of instruments excluded from the equation and k is
the number of endogenous variables on the right hand side of the equation.
We can implement this test easily in Stata. First, let’s do the ivreg again.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 39
+ F( 2, 37) = .
Model  .079162381 2 .03958119 Prob > F = .
Residual  .135729083 37 .003668354 Rsquared = .
+ Adj Rsquared = .
Total  .056566702 39 .001450428 Root MSE = .06057

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765
lpfeed 
D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173

Instrumented: D.lp
Instruments: D.lpfeed D.lpbeef D.ly

If your version of Stata has the overid command installed, then you can run the test with a single command.
. overid
Tests of overidentifying restrictions:
Sargan N*Rsq test 0.154 Chisq(1) Pvalue = 0.6948
Basmann test 0.143 Chisq(1) Pvalue = 0.7057
The tests are not significant, indicating that these variables do not belong in the equation of interest. Therefore, they are
properly used as identifying variables. Whew!
If there is no overid command after ivreg, then you can install it by clicking on help, search, overid, and clicking on the
link that says, “click to install.” It only takes a few seconds. If you are using Stata on the server, you might not be able
to install new commands. In that case, you can do it yourself as follows. First, save the residuals from the ivreg.
. predict es, resid
(13 missing values generated)
141
Now regress the residuals on all the instruments. In our case we are doing the regression in first and the
constant term was not included. So use the noconstant option.
. regress es D.lpfeed D.lpbeef D.ly, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 0.05
Model  .000535634 3 .000178545 Prob > F = 0.9860
Residual  .135193453 36 .003755374 Rsquared = 0.0039
+ Adj Rsquared = 0.0791
Total  .135729086 39 .003480233 Root MSE = .06128

es  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lpfeed 
D1  .005058 .0863394 0.06 0.954 .1700464 .1801625
lpbeef 
D1  .0536104 .1686351 0.32 0.752 .3956183 .2883974
ly 
D1  .1264734 .3897588 0.32 0.747 .6639941 .9169409
We can use the Fform of the LM test by invoking Stata’s test command on the identifying variables.
. test D.lpbeef D.ly
( 1) D.lpbeef = 0
( 2) D.ly = 0
F( 2, 36) = 0.07
Prob > F = 0.9313
The results are the same, namely, that the identifying variables are not significant and therefore can be used
to identify the equation of interest. However, the probvaluses are not identical to the large sample chi
square versions reported by the overid command.
To replicate the overid analysis, let’s start by reminding ourselves what output is available for further
analysis.
. ereturn list
scalars:
e(N) = 39
e(df_m) = 3
e(df_r) = 36
e(F) = .0475437411474742
e(r2) = .00394634310271
e(rmse) = .0612811038678276
e(mss) = .0005356335440678
e(rss) = .1351934528853412
e(r2_a) = .0790581283053975
e(ll) = 55.12129587257286
macros:
e(depvar) : "es"
e(cmd) : "regress"
e(predict) : "regres_p"
e(model) : "ols"
matrices:
e(b) : 1 x 3
e(V) : 3 x 3
functions:
e(sample)
142
. scalar r2=e(r2)
. scalar n=e(N)
. scalar nr2=n*r2
We know there is one degree of freedom for the resulting chisquare statistic because m=2 omitted
instruments (ly and lpbeef) and k=1 endogenous regressor (lp).
. scalar prob=1chi2(1,nr2)
. scalar list r2 n nr2 prob
r2 = .00394634
n = 39
nr2 = .15390738
prob = .69482895
This is identical to the Sargan form reported by the overid command above. An equivalent test of this
hypothesis is the Jtest for overidentifying restrictions. Instead of using the nRsquare statistic, the Jtest
uses the j=mF statistic where F is the Ftest for the auxiliary regression. From the output above. J is
distributed as chisquare with the same mk degrees of freedom.
. scalar J=2*e(F)
. scalar probJ=1chi2(2,J)
. scalar list J probJ
J = .09508748
probJ = .95356876
All of these tests agree that the identifying restrictions are valid. The fact that they are not significant means
that income and beef prices do not belong in the supply of chicken equation.
We cannot do the tests for over identifying restrictions on the demand curve because it is just identified.
The problem is that the residual from the ivreg regression will be an exact linear combination of the
instruments. If we forget and try to do the LM test we get a regression that looks like this (ed is the residual
from the ivreg estimate of the demand curve above.)
. regress ed D.ly D.lpbeef D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 0.00
Model  0 3 0 Prob > F = 1.0000
Residual  .024081561 36 .000668932 Rsquared = 0.0000
+ Adj Rsquared = 0.0833
Total  .024081561 39 .000617476 Root MSE = .02586

ed  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  5.70e10 .1644979 0.00 1.000 .3336173 .3336173
lpbeef 
D1  3.94e10 .0711725 0.00 1.000 .1443446 .1443446
lpfeed 
D1  1.02e11 .0364396 0.00 1.000 .0739029 .0739029

So, if you see output like this, you know you can’t do a test for over identifying restrictions, because your
equation is not over identified.
143
Test for weak instruments
To be an instrument a variable must satisfy two conditions. It must be (1) highly correlated with the
problem variable and (2) it must be uncorrelated with the error term. What happens if (1) is not true? That
is, is there a problem when the instrument is only vaguely associated with the endogenous regressor?
Bound, Jaeger
6
, and Baker have demonstrated that instrumental variable estimates are biased by about 1/F*
where F* is the Ftest on the (excluded) identifying variables in the reduced form equation for the
endogenous regressor. This is the percent of the OLS bias that remains after applying two stage least
squares. For example, if F*=2, then 2SLS is still half as biased as OLS. In the case of our chicken supply
curve, that would be a regression of lp on ly, lpbeef, and lpfeed and the F* test would be on lpbeef and ly.
. regress lp ly lpbeef lpfeed, noconstant
Source  SS df MS Number of obs = 40
+ F( 3, 37) =39090.05
Model  801.425968 3 267.141989 Prob > F = 0.0000
Residual  .252858576 37 .006834016 Rsquared = 0.9997
+ Adj Rsquared = 0.9997
Total  801.678827 40 20.0419707 Root MSE = .08267

lp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly  .1264946 .0226887 5.58 0.000 .0805228 .1724664
lpbeef  .628012 .0764761 8.21 0.000 .4730568 .7829672
lpfeed  .1196143 .1046508 1.14 0.260 .0924285 .331657

. test ly lpbeef
( 1) ly = 0
( 2) lpbeef = 0
F( 2, 37) = 35.75
Prob > F = 0.0000
Obviously, these are not weak instruments and two stage least squares is much less biased than OLS.
Actually we should have predicted this by the difference between the OLS and 2sls versions of the
equations above.
Anyway, the BJB F test is very easy to do and it should be routinely reported for any simultaneous equation
estimation. Note that it has the same problem as the tests for over identifying restrictions, namely, it does
not work where the equation is just identified. So we cannot use it for the demand function.
HausmanWu test
We have already encountered the HausmanWu test in the chapter on errors in variables. However, it is also
relevant in the case of simultaneous equations.
Consider the following equation system.
6
Yes, that is our own Professor Jaeger.
144
1 12 2 13 1 1
2 21 1 22 2 23 3 2
1 12 2 13 1 1
2 22 2 23 3 2
1 2
Model A (simultaneous)
(A1)
(A2)
Model B (recursive)
(B1)
(B2)
where cov( ) 0.
y y z u
y y z z u
y y z u
y z z u
u u
β β
β β β
β β
β β
= + +
= + + +
= + +
= + +
=
Suppose we are interested in getting a consistent estimate of
12
β . If we apply OLS to equation (A1) we get
biased and inconsistent estimates. If we apply OLS to equation (B1) we get unbiased, consistent, and
efficient estimates. And it is the same equation! We need to know about the second equation, not only the
equation of interest.
To apply the HausmanWu test to see if we need to use 2sls on the supply equation, we estimate the
reduced form equation for Alp and save the residuals.
. regress D.lp D.ly D.lpbeef D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 12.37
Model  .132113063 3 .044037688 Prob > F = 0.0000
Residual  .128161345 36 .003560037 Rsquared = 0.5076
+ Adj Rsquared = 0.4666
Total  .260274407 39 .006673703 Root MSE = .05967

D.lp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  .7530405 .3794868 1.98 0.055 .0165944 1.522675
lpbeef 
D1  .3437728 .1641908 2.09 0.043 .0107785 .6767671
lpfeed 
D1  .2583745 .084064 3.07 0.004 .0878849 .4288641

. predict dlperror, resid
(13 missing values generated)
Now, add the error term to the original regression. If it is significant, then there is a significant difference
between the OLS estimate and the 2sls estimate. In which case, OLS is biased and we need to use
instrumental variables. If it is not significant, then we need to check the magnitude of the coefficient. If it is
large and the standard error is even larger, then we cannot be sure that OLS is unbiased. In that event, we
should do it both ways, OLS and IV, and hope we get essentially the same results.
. regress D.lq D.lp dlperror D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 18.43
Model  .034261762 3 .011420587 Prob > F = 0.0000
Residual  .022304941 36 .000619582 Rsquared = 0.6057
+ Adj Rsquared = 0.5728
Total  .056566702 39 .001450428 Root MSE = .02489

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
145
D1  .6673431 .1075623 6.20 0.000 .4491967 .8854896
dlperror  .94075 .1280782 7.35 0.000 1.200505 .6809953
lpfeed 
D1  .2828285 .051217 5.52 0.000 .3867013 .1789557

You can also do the HausmanWu test using the predicted value of the problem variable instead of the
residual from the reduced form equation. The following HW test equation was performed after saving the
predicted value. The test coefficient is simply the negative of the coefficient on the residuals.
. regress D.lq D.lp dlphat D.lpfeed, noconstant
Source  SS df MS Number of obs = 39
+ F( 3, 36) = 18.43
Model  .034261763 3 .011420588 Prob > F = 0.0000
Residual  .02230494 36 .000619582 Rsquared = 0.6057
+ Adj Rsquared = 0.5728
Total  .056566702 39 .001450428 Root MSE = .02489

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lp 
D1  .2734069 .0695298 3.93 0.000 .4144198 .132394
dlphat  .94075 .1280782 7.35 0.000 .6809953 1.200505
lpfeed 
D1  .2828285 .051217 5.52 0.000 .3867013 .1789557

Applying the HausmanWu test to the demand equation yields the following test equation.
. regress D.lq D.ly D.lpbeef D.lp dlperror, noconstant
Source  SS df MS Number of obs = 39
+ F( 4, 35) = 13.99
Model  .034797396 4 .008699349 Prob > F = 0.0000
Residual  .021769306 35 .00062198 Rsquared = 0.6152
+ Adj Rsquared = 0.5712
Total  .056566702 39 .001450428 Root MSE = .02494

D.lq  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ly 
D1  .9360443 .1920032 4.88 0.000 .546257 1.325832
lpbeef 
D1  .3159692 .0942849 3.35 0.002 .1245608 .5073777
lp 
D1  .4077264 .1359945 3.00 0.005 .6838099 .131643
dlperror  .1343196 .1527992 0.88 0.385 .1758793 .4445185

The test reveals that the OLS estimate is not significantly different from the instrumental variables
estimate.
146
Seemingly unrelated regressions
Consider the following system of equations relating local government expenditures on fire and police (FP),
schools (S) and other (O) to population density (D), income (Y), and the number of school age children (C).
The equations relate the proportion of the total budget (T) in each of these categories to the explanatory
variables, D, Y, and C.
0 1 2 1
0 1 2 2
0 1 2 2 3
/
/
/
FP T a a D a Y U
S T b bY b C U
O T c c D c Y c C U
= + + +
= + + +
= + + + +
These equations are not simultaneous, so OLS can be expected to yield unbiased and consistent estimates.
However, because the proportions have to sum to one (FP/T+S/T+O/T=1), any random increase in
expenditures on fire and police must imply a random decrease in schools and/or other components.
Therefore, the error terms must be correlated across equations, i.e., ( ) 0
i j
E U U ≠ for i=j. This is information
that could be used to increase the efficiency of the estimates. The procedure is to estimate the equations
using OLS, save the residuals, compute the covariances between U
1
, U
2
, etc. and use these estimated
covariances to improve the OLS estimates. This technique is known as seemingly unrelated regressions
(SUR) and Zellner efficient least squares (ZELS) for Alan Zellner from the University of Chicago who
developed the technique.
In Stata, we can use the sureg command to estimate the above system. Let FPT be the proportion spent on
fire and police and ST be the proportion spent on schools. Because the proportion of the budget going to
other things is simply one minus the sum of expenditures on fire, police, and schools, we drop the third
equation. In the sureg command, we simply put the varlist for each equation in parentheses.
Sureg (FPT D Y) (ST Y C)
Note: if the equations have the same regressors, then the SUR estimates will be identical to the OLS
estimates and there is no efficiency gain. Also, if there is specification bias in one equation, say omitted
variables, then, because the residuals are correlated across equations, the SUR estimates of the other
equation are biased. The specification error is “propagated” across equations. For this reason, it is not clear
that we should use SUR for all applications for which it might be employed. That being said, we should use
SUR in those cases where the dependent variables sum to a constant, such as in the above example.
Three stage least squares
Whenever we have a system of simultaneous equations, we automatically have equations whose errors
could be correlated. Thus, two stage least squares estimates of a multiequation system could potentially be
made more efficient by incorporating the information contained in the covariances of the residuals.
Applying SUR to two stage least squares results in three stage least squares.
Three stage least squares can be implemented in Stata with the reg3 command. Like the sureg command,
the equation varlists are put in equations. In this example I use equation labels to identify the demand and
supply equations.
147
In the first example, I replicate the ivreg above exactly by specifying each equation with the noconstant
option (in the parentheses) and adding , noconstant to the model as a whole to make sure that the constant is
not used as an instrument in the reduced form equations.
. reg3 (Demand:D.lq D.lp D.ly D.lpbeef, noconstant) (Supply: D.lq D.lp D.lpfeed
> , noconstant), noconstant
Threestage least squares regression

Equation Obs Parms RMSE "Rsq" chi2 P

Demand 39 3 .0265163 0.5152 36.31 0.0000
Supply 39 2 .0379724 0.0059 0.15 0.9258


 Coef. Std. Err. z P>z [95% Conf. Interval]
+
Demand 
lp 
D1  .194991 .0549483 3.55 0.000 .3026876 .0872944
ly 
D1  .5204546 .1255017 4.15 0.000 .2744758 .7664334
lpbeef 
D1  .1600626 .0546049 2.93 0.003 .053039 .2670861
+
Supply 
lp 
D1  .0250497 .0835039 0.30 0.764 .1887144 .138615
lpfeed 
D1  .0040577 .0468383 0.09 0.931 .0958591 .0877437

Endogenous variables: D.lq
Exogenous variables: D.lp D.ly D.lpbeef D.lpfeed

Yuck. The supply function is terrible. Let’s try adding the intercept term..
. reg3 (Demand:D.lq D.lp D.ly D.lpbeef) (Supply: D.lq D.lp D.lpfeed)
Threestage least squares regression

Equation Obs Parms RMSE "Rsq" chi2 P

Demand 39 3 .0236988 0.2754 15.13 0.0017
Supply 39 2 .0249881 0.1944 9.56 0.0084


 Coef. Std. Err. z P>z [95% Conf. Interval]
+
Demand 
lp 
D1  .1871072 .0484276 3.86 0.000 .2820235 .0921909
ly 
D1  .0948342 .109993 0.86 0.389 .1207482 .3104166
lpbeef 
D1  .0491061 .0336213 1.46 0.144 .0167905 .1150026
_cons  .0274115 .0044904 6.10 0.000 .0186104 .0362125
+
Supply 
lp 
148
D1  .1627361 .0547593 2.97 0.003 .2700624 .0554097
lpfeed 
D1  .0018805 .0196841 0.10 0.924 .0366997 .0404607
_cons  .0306582 .0042607 7.20 0.000 .0223074 .039009

Endogenous variables: D.lq
Exogenous variables: D.lp D.ly D.lpbeef D.lpfeed

Better, at least for the estimates of the price elasticities. None of the other variables are significant. Three
stage least squares has apparently made things worse. This is very unusual. Normally the three stage least
squares estimates are almost the same as the two stage least squares estimates.
Types of equation systems
We are already familiar with a fully simultaneous equation system. The supply and demand system
(S)
(D)
1 2
q p
q p y u
α ε
β β
= +
= + +
is fully simultaneous in that quantity is a function of price and price is a function of quantity. Now consider
the following general three equation model.
1 0 1 1 2 2 1
2 0 1 1 2 1 3 2 2
3 0 1 1 2 2 3 1 4 2 3
Y X X u
Y Y X X u
Y Y Y X X u
α α α
β β β β
γ γ γ γ γ
= + + +
= + + + +
= + + + + +
This equation system is simultaneous, but not completely simultaneous. Note that Y
3
is a function of Y
2
and Y
1
, but neither of them is a function of Y
3
. Y
2
is a function of Y
1
, but Y
1
is not a function of Y
2
. The
model is said to be triangular because of this structure. If we also assume that there is no covariance
among the error terms, that is,
1 2 1 3 2 3
( ) ( ) ( ) 0 E u u E u u E u u = = = , then this is a recursive system. Since
there is no feedback from any right hand side endogenous variable to the dependent variable in the same
equation, there is no correlation between the error term and any right hand side variable in the same
equation, therefore there is no simultaneous equation bias in this model. OLS applied to any of the
equations will yield unbiased, consistent, and efficient estimates (assuming the equations have no other
problems).
Finally, consider the following model.
1 0 1 2 2 1 3 2 1
2 0 1 1 2 1 3 2 2
3 0 1 1 2 2 3 1 4 2 3
Y Y X X u
Y Y X X u
Y Y Y X X u
α α α α
β β β β
γ γ γ γ γ
= + + + +
= + + + +
= + + + + +
Now Y
1
is a function of Y
2
and Y
2
is a function of Y
1
. These two equations are fully simultaneous. OLS
applied to either will yield biased and inconsistent estimates. This part of the model is called the
simultaneous core. This model system is known as block recursive and is very common in large
simultaneous macroeconomic models. If we assume that
1 3 2 3
( ) ( ) 0 E u u E u u = = then OLS applied to the
third (recursive) equation yields unbiased, consistent, and efficient estimates.
149
Strategies for dealing with simultaneous equations
The obvious strategy when faced with a simultaneous equation system is to use two stage least squares. I
don’t recommend three stage least squares because of possible propagation of errors across equations.
However, there is always the problem of identification and/or weak instruments. Some critics have even
suggested that, since the basic model of all economics is the general equilibrium model, all variables are
functions of all other variables and no equation can ever be identified. Even if we can identify the equation
of interest with some reasonable exclusion assumptions and we have good instruments, the best we can
hope for is a consistent, but inefficient estimate of the parameters.
For many situations, the best solution may be to estimate one or more reduced form equations, rather than
trying to estimate equations in the structural model. For example, suppose we are trying to estimate the
effect of a certain policy, say threestrikes laws, on crime. We may have police in the crime equation and
police and crime may be simultaneously determined, but we do not have to know the coefficient on the
police variable in the crime equation to determine the effect of the threestrikes law on crime. That
coefficient is available from the reduced form equation for crime. OLS applied to the reduced form
equations are unbiased, consistent, and efficient.
However, it may be the case that we can’t retreat to the reduced form equations. In the above example,
suppose we want to estimate the potential effect of a policy that puts 100,000 cops on the streets. In this
case we need to know the coefficient linking the police to the crime rate, so that we can multiply it by 100K
in order to estimate the effect of 100,000 more police. We have no choice but to use instrumental variables
if we are to generate consistent estimates of the parameter of interest.
A strategy that is frequently adopted when large systems of simultaneous equations are being estimated is
to plow ahead with OLS despite the fact that the estimates are biased and inconsistent. The estimates may
be biased and inconsistent, but they are efficient and the bias might not be too bad (and 2sls is also biased
in small samples). So OLS might not be too bad. Usually, researchers adopting this strategy insist that this
is just the first step and the OLS estimates are “preliminary.” The temptation is to never get around to the
next step.
A strategy that is available to researchers that have time series data is to use lags to generate instruments or
to avoid simultaneity altogether. Suppose we start with the simple demand and supply model above where
we add time subscripts.
(S)
(D)
1 2
q p
t t t
q p y u
t t t t
α ε
β β
= +
= + +
Note that this model is fully simultaneous and the only way to consistently estimate the supply function is
to use two stage least squares. However, suppose this is agricultural data. Then the supply is based on last
year’s price while demand is based on this year’s price. (This is the famous cobweb model.)
(S)
1
(D)
1 2
q p
t t t
q p y u
t t t t
α ε
β β
= +
−
= + +
Presto! Assuming that ( ) 0
t t
E u ε = and assuming we have no autocorrelation (see Chapter 14 below), we
now have a recursive system. OLS applied to either equation will yield unbiased, consistent, and efficient
estimates.
Even if we can’t assume that current supply is based on last year’s price, we can at least generate a bunch
of instruments by assuming that the equations are also functions of lagged variables. For example, the
supply equation could be a function of last years supply as well as last year’s price. In that case, we have
the following equation system.
150
(S)
1 2 1 3 1
(D)
1 2
q p p q
t t t t t
q p y u
t t t t
α α α ε
β β
= + + +
− −
= + +
Note that we have two new instruments (p
t1
and q
t1
) which are called lagged endogenous variables. Since
time does not go backwards, this year’s quantity and price cannot affect last year’s values, these are valid
instruments (again assuming no autocorrelation). Now both equations are identified and both can be
consistently estimated using two stage least squares.
Time series analysts frequently estimate vector autoregression (VAR) models in which the dependent
variable is a function of its own lags and lags of all the other variables in the simultaneous equation model.
This is essentially the reduced form equation approach for time series data where the lagged endogenous
variables are the exogenous variables.
Another example
Pinydyck and Rubinfeld have an example where state and local government expenditures (exp) are taken to
be a function of the amount of federal government grants (aid), and, as control variables, population and
income
The data are in a Stata data set called EX73 (for example 7.3 from Pindyck and Rubinfeld). The OLS
regression model is
. regress exp aid inc pop
Source  SS df MS Number of obs = 50
+ F( 3, 46) = 2185.50
Model  925135805 3 308378602 Prob > F = 0.0000
Residual  6490681.38 46 141101.769 Rsquared = 0.9930
+ Adj Rsquared = 0.9926
Total  931626487 49 19012785.4 Root MSE = 375.64

exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  3.233747 .238054 13.58 0.000 2.754569 3.712925
inc  .0001902 .0000235 8.10 0.000 .000143 .0002375
pop  .5940753 .1043992 5.69 0.000 .80422 .3839306
_cons  45.69833 84.39169 0.54 0.591 215.57 124.1733

However, many grants are tied to the amount of money the state is spending. So the aid variable could be a function of
the dependent variable, exp. Therefore aid is a potentially endogenous variable. Let’s do a HausmanWu test to see if
we need to use two stage least squares. First we need an instrument. Many openended grants (that are functions of the
amount the state spends, like matching grants) are for education. A variable that is correlated with grants but
independent of the error term is the number of school children, ps. We regress aid on ps, and the other exogenous
variables, and save the residuals, ˆ w .
. regress aid ps inc pop
Source  SS df MS Number of obs = 50
+ F( 3, 46) = 220.70
Model  32510247.4 3 10836749.1 Prob > F = 0.0000
Residual  2258713.81 46 49102.4741 Rsquared = 0.9350
+ Adj Rsquared = 0.9308
Total  34768961.2 49 709570.636 Root MSE = 221.59
151

aid  Coef. Std. Err. t P>t [95% Conf. Interval]
+
ps  .8328735 .3838421 2.17 0.035 1.605508 .0602394
inc  .0000365 .0000133 2.75 0.008 9.79e06 .0000633
pop  .1738793 .1248854 1.39 0.171 .0775019 .4252605
_cons  42.40924 49.68255 0.85 0.398 57.59654 142.415

. predict what, resid
Now add the residual, what (what), to the regression. If it is significant, ordinary least squares is biased.
. regress exp aid what inc pop
Source  SS df MS Number of obs = 50
+ F( 4, 45) = 1716.17
Model  925559177 4 231389794 Prob > F = 0.0000
Residual  6067310.05 45 134829.112 Rsquared = 0.9935
+ Adj Rsquared = 0.9929
Total  931626487 49 19012785.4 Root MSE = 367.19

exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  4.522657 .7636841 5.92 0.000 2.984518 6.060796
what  1.420832 .8018143 1.77 0.083 3.035769 .194105
inc  .0001275 .0000422 3.02 0.004 .0000424 .0002125
pop  .5133494 .1117587 4.59 0.000 .738443 .2882557
_cons  90.1253 86.22021 1.05 0.301 263.7817 83.53111

Apparently there is significant measurement error, at the recommended 10 percent significance level for specification
tests. It turns out that the coefficient on aid in the HausmanWu test equation is exactly the same as we would have
gotten if we had used instrumental variables. To see this, let’s use IV to derive consistent estimates for this model.
. ivreg exp (aid=ps) inc pop
Instrumental variables (2SLS) regression
Source  SS df MS Number of obs = 50
+ F( 3, 46) = 1304.09
Model  920999368 3 306999789 Prob > F = 0.0000
Residual  10627118.5 46 231024.316 Rsquared = 0.9886
+ Adj Rsquared = 0.9878
Total  931626487 49 19012785.4 Root MSE = 480.65

exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  4.522657 .9996564 4.52 0.000 2.510453 6.534861
inc  .0001275 .0000553 2.31 0.026 .0000162 .0002387
pop  .5133494 .1462913 3.51 0.001 .8078184 .2188803
_cons  90.1253 112.8616 0.80 0.429 317.3039 137.0533

Instrumented: aid
Instruments: inc pop ps

Note that ivreg uses all the variables that are not indicated as problem variables (inc and pop) as well as the
designated instrument, ps, as instruments. Note that the coefficient is, in fact, the same as in the Hausman
Wu test equation.
152
If we want to see the first stage regression, add the option “first.” We can also request robust standard
errors.
. ivreg exp (aid=ps) inc pop, first robust
Firststage regressions

Source  SS df MS Number of obs = 50
+ F( 3, 46) = 220.70
Model  32510247.4 3 10836749.1 Prob > F = 0.0000
Residual  2258713.81 46 49102.4741 Rsquared = 0.9350
+ Adj Rsquared = 0.9308
Total  34768961.2 49 709570.636 Root MSE = 221.59

aid  Coef. Std. Err. t P>t [95% Conf. Interval]
+
inc  .0000365 .0000133 2.75 0.008 9.79e06 .0000633
pop  .1738793 .1248854 1.39 0.171 .0775019 .4252605
ps  .8328735 .3838421 2.17 0.035 1.605508 .0602394
_cons  42.40924 49.68255 0.85 0.398 57.59654 142.415

IV (2SLS) regression with robust standard errors Number of obs = 50
F( 3, 46) = 972.40
Prob > F = 0.0000
Rsquared = 0.9886
Root MSE = 480.65

 Robust
exp  Coef. Std. Err. t P>t [95% Conf. Interval]
+
aid  4.522657 1.515 2.99 0.005 1.47312 7.572194
inc  .0001275 .0000903 1.41 0.165 .0000544 .0003093
pop  .5133494 .1853334 2.77 0.008 .8864062 .1402925
_cons  90.1253 86.34153 1.04 0.302 263.9218 83.67117

Instrumented: aid
Instruments: inc pop ps

Because the equation is just identified, we can’t test the over identification restriction. However, ps is
significant in the reduced form equation, which means that it is not a weak instrument, although the sign
doesn’t look right to me.
Summary
The usual development of a research project is that the analyst specifies an equation of interest where the
dependent variable is a function of the target variable or variables and a list of control variables. Before
estimating with ordinary least squares, the analyst should think about possible reverse causation with
respect to each right hand side variable in turn. If there is a good argument for simultaneity, the analyst
should specify another equation, or at least a list of possible instruments, for each of the potentially
153
endogenous variables. Sometimes this process is shortened by the review of the literature where previous
analysts may have specified simultaneous equation models.
Given that there may be simultaneity, what is the best way to deal with it? We know that OLS is biased,
inconsistent, but relatively efficient compared to instrumental variables (because changing the list of
instruments will cause the estimates to change, creating additional variance of the estimates). On the other
hand, instrumental variables is consistent, but still biased (presumably less biased than OLS), and
inefficient relative to OLS. So we are swapping consistency for efficiency when we use two stage least
squares. Is this a good deal for analysts? The consensus is that OLS applied to an equation that has an
endogenous regressor results in substantial bias, so much so that the loss of efficiency is a small price to
pay.
There are a couple of ways to avoid this tradeoff. The first is to estimate the reduced form equation only.
We know that OLS applied to the reduced form equations results in estimates that are unbiased, consistent,
and efficient. For many policy analyses, the target variable is an exogenous policy variable. For example,
suppose we are interested in the effect of three strikes laws on crime. The crime equation will be a function
of arrests, among other things, a number of control variables, and the target variable, a dummy variable that
is one for states with threestrikes laws. The arrest variable is likely to be endogenous. We could specify an
arrest equation, with the attendant data collection to get the additional variables, and employ two or three
stage least squares, which are relatively inefficient. Alternatively, we can estimate the reduced form
equation, derived by simply dropping the arrest variable. The resulting estimated coefficient on the three
strikes variable is unbiased, consistent, and efficient. Of course, if arrests are not simultaneous, these
estimates are biased by omitted variables. The safest thing to do is to estimate the original crime equation in
OLS and the reduced form version, dropping arrests, and see if the resulting estimated coefficient on the
target variable are much different. If the results are the same in either case, then we can be confident we
have a good estimate of the effect of threestrikes laws on crime.
Another way around the problem is available if the data are time series. Suppose that there is a significant
lag between the time a crime is committed and the resulting prison incarceration of the guilty party. Then
prison is not simultaneously determined with crime. Similarly, it usually takes some time to get a law, like
the threestrikes law, passed and implemented. In that case, crime this year is a function of the law, but the
law is a function of crime in previous years, hence not simultaneous. Finally, if a variable Y is known to be
a function of lagged X, then, since time cannot go backwards, the two variables cannot be simultaneous.
So, lagging the potentially endogenous variable allows us to sidestep the specification of a multiequation
mode. I have seen large simultaneous macroeconomic models estimated on quarterly data avoid the use of
two stage least squares by simply lagging the endogenous variables, so that consumption this quarter is a
function of income in the previous quarter.These arguments require that the residuals be independent, that
is, not autocorrelated. We deal with autocorrelation below.
Should we use three stage least squares? Generally, I recommend against it. Usually, the other equations
are only there to generate consistent estimates of a target parameter in the equation of interest. I don’t want
specification errors in those equations causing inconsistency in the equation of interest after I have gone to
all the trouble to specify a multiequation model.
Finally, be sure to report the results of the HausmanWu test, the test for over identifying restrictions, and
the Bound, Jaeger, Baker F test for weak instruments whenever you estimate a simultaneous equation
model, assuming your model is not just identified.
154
13 TIME SERIES MODELS
Time series data are different from cross section data in that it has an additional source of information. In a
cross section it doesn’t matter how the data are arranged, The fifty states could be listed alphabetically, or
by region, or any number of ways. However, a time series must be arrayed in order, 1998 before 1999, etc.
The fact that time’s arrow only flies in one direction is extremely useful. It means, for example, that
something that happened in 1999 cannot cause something that happened in 1998. The reverse, of course, is
quite possible. This raises the possibility of using lags as a source of information, something that is not
possible in cross sections.
Time series models should take dynamics into account: lags, momentum, reaction time, etc. This means
that time series data have all the problems of cross section data, plus some more. Nevertheless, time series
data have a unique advantage because of time’s arrow. Simultaneity is a serious problem for cross section
data, but lags can help a time series analysis avoid the pitfalls of simultaneous equations. This is a good
reason to prefer time series data.
An interesting question arises with respect to time series data, namely, can we treat a time series, say GDP
from 1950 to 2000 as a random sample? It seems a little silly to argue that GDP in 1999 was a random
draw from a normal population of potential GDP’s ranging from positive to negative infinity. Yet we can
treat any tme series as a random sample if we condition on history. The value of GDP in 1999 was almost
certainly a function of GDP in 1998, 1997, etc. Let’s assume that GDP today is a function of a bunch of
past values of GDP,
0 1 1
0 1 1
so that the error term is
t t p t p t
t t t p t p
Y a a Y a Y
Y a a Y a Y
ε
ε
− −
− −
= + + + +
= − − − −
So the error term is simply the difference from what the value GDP today is compared to what it should
have been if the true relationship had held exactly. So a time series regression of the form
0 1 1 t t p t p t
Y a a Y a Y ε
− −
= + + + +
has a truly random error term.
Linear dynamic models
ADL model
The mother of all time series models is the autoregressivedistributed lag, ADL, model. It has the form,
assuming one dependent variable (Y) and one independent variable (X),
155
0 1 1 0 1 1 t t p t p t t p t p t
Y a a Y a Y b X b X b X ε
− − − −
= + + + + + + + +
Lag operator
Just for convenience, we will resort to a shorthand notation that is particularly useful for dynamic models.
Using the lag operator, we can write the ADL model very conveniently and compactly as
2
0 1 2
0
2
0 1 2
0
1
2
1 2
3 2
( ) ( )
where
( ) and ( ) are "lag polynomials."
( )
( )
The "lag operator" works like this.
( ) ( )
(
t t t
p
p i
p i
i
k
k j
k j
j
t t
t t t t
t
a L Y b L X
a L b L
a L a a L a L a L a L
b L b b L b L b L b L
LX X
L X L LX L X X
L X L L X
ε
=
=
−
− −
= +
= + + + + =
= + + + + =
=
= = =
=
¯
¯
2 3
1
) ( )
A first difference can be written as
(1 )
t t t
k
t t k
t t t t
L X X
L X X
X L X X X
− −
−
−
= =
=
∆ = − = −
So, using lag notation, the ADL(p,p) model
0 1 1 0 1 1 t t p t p t t p t p t
Y a a Y a Y b X b X b X ε
− − − −
= + + + + + + + +
can be written as
( ) ( )
0 1 1 0 1 1
0 1 0 1
1
( ) ( )
t t p t p t t p t p t
p p
p t p t t
t t
Y a a Y a Y b X b X b X
a a L a L Y b b L b L X
a L Y b L X
ε
ε
− − − −
− − − − = + + + +
− − − − = + + + +
=
Mathematically, the ADL model is a difference equation, the discrete analog of a differential equation. For
the model to be stable (so that it does not explode to positive or negative infinity), the roots of the lag
polynomial must sum to a number less than one in absolute value. Since we do not observe explosive
behavior in economic systems, we will have to make sure all our dynamic models are stable. So, for a
model like the following ADL(p,p) model to be stable
0 1 1 0 1 1 t t p t p t t p t p t
Y a a Y a Y b X b X b X ε
− − − −
= + + + + + + + +
we require that
1 2
1
p
a a a + + + <
The coefficients on the X’s, and the values of the X’s, are not constrained because the X’s are the inputs to
the difference equation and do not depend on their own past values.
All dynamic models can be derived from the ADL model. Let’s start with the simple ADL(1,1) model.
156
0 1 1 0 1 1 t t t t t
Y a a Y b X b X ε
− −
= + + + +
Static model
The static model (no dynamics) can be derived from this model by setting
1 1
0 a b = =
0 0 t t t
Y a b X ε = + +
AR model
The univariate autoregressive (AR) model is
0 1 1 0 1
( 0)
t t t
Y a a Y b b ε
−
= + + = =
Random walk model
Note that, if
1 0
1
1
1 and 0 then Y is a "random walk"
Y is whatever it was in the previous period, plus a random error term.
So, the movements of Y are completely random.
t t t
t t t t
a a
Y Y
Y Y Y
ε
ε
−
−
= =
= +
− = ∆ =
The random walk model describes the movement of a drop of water across a hot skillet. Also, because
(1 )
(1 ) ( )
t t
t t t
Y L Y
L Y a L Y ε
∆ = −
− = =
Y is said to have a unit root because the root of the lag polynomial is equal to one.
0 t 0 t1 t t t1 0
t 0
0
If 0, then Y Y + and Y Y , so
Y
This is a random walk, "with drift," because if is positive, Y drifts up, if it is negative, Y drifts down.
t
t
a a a
a
a
ε ε
ε
≠ = + − = +
∆ = +
First difference model
The first difference model can be derived by setting
1 0 1
1, a b b = = −
0 1 1 0 1 1
0 1 0 0 1
1 0 0 1
0 0
0
( )
Note that there is "drift" or trend if 0.
t t t t t
t t t t t
t t t t t
t t t
Y a a Y b X b X
Y a Y b X b X
Y Y a b X X
Y a b X
a
ε
ε
ε
ε
− −
− −
− −
= + + + +
= + + − +
− = + − +
∆ = + ∆ +
≠
157
Distributed lag model
The finite distributed lag model is
1
( 0) a =
0 0 1 1 t t t t
Y a b X b X ε
−
= + + +
Partial adjustment model
The partial adjustment model is (b
1
=0)
0 1 1 0 t t t t
Y a a Y b X ε
−
= + + +
Error correction model
The error correction model is derived as follows. Start by writing the ADL(1,1) model in deviations,
1 1 0 1 1
1
1 1 1 0 1 1 1
0 1
1 1 1 0 1 1 1 0 1 0 1
1 1 0
(1)
Subtract from both sides.
Add and subtract .
Collect terms.
(2) ( 1)
t t t t t
t
t t t t t t t
t
t t t t t t t t t
t t t
y a y b x b x
y
y y a y b x b x y
b x
y y a y b x b x y b x b x
y a y b x
ε
ε
ε
− −
−
− − − −
−
− − − − − −
−
= + + +
− = + + + −
− = + + + − + −
∆ = − + ∆
0 1 1
( )
t t
b b x ε
−
+ + +
1 1 0 1 1
1 1
1 0 1
1 0 1
0 1 0 1
1 1
Rewrite equation (1) as
In long run equilibrium, , , and 0.
(1 ) ( )
( ) ( )
where
(1 ) (1 )
In long run equilibriu
t t t t t
t t t t t
t t t t
t t
t t t
y a y b x b x
y y x x
y a y b x b x
y a b b x
b b b b
y x kx k
a a
ε
ε
− −
− = + +
= = =
− = +
− = +
+ +
= = =
− −
1
m, so
t t
y y =
0 1 0 1
1 1 1 1
1 1
1
1 1 0 1 1
0 1 1
1 1 1 1 0
( ) ( )
(1 ) ( 1)
Multiply both sides by ( 1) to get
( 1) ( )
Substituting for ( ) in (2) yields
( 1) ( 1)
Substitut
t t t t
t t
t
t t t t t
b b b b
y x x kx
a a
a
a y b b x
b b x
y a y a y b x ε
− − − −
− −
−
− −
+ +
= = − = −
− −
−
− − = +
+
∆ = − − − + ∆ +
1
0 1
1 1 1 1 0
1
ing for 
( )
( 1) ( 1)
( 1)
t
t t t t t
y
b b
y a y a x b x
a
ε
−
− −
+
∆ = − − − + ∆ +
−
158
( )
( )
1
0 1
0 1 1 1
1
0 1 1 1
0 1
1
Collect terms in ( 1)
( )
( 1)
( 1)
Or,
(3) ( 1)
Sometimes written as,
(3 ) ( 1)
t t t t t
t t t t t
t t t
t
a
b b
y b x a y x
a
y b x a y kx
a y b x a y kx
ε
ε
ε
− −
− −
−
−
  +
∆ = ∆ + − − +


−
\ ¹
∆ = ∆ + − − +
∆ = ∆ + − − +
The term (y
t1
kx
t1
) is the difference between the equilibrium value of y (kx) and the actual value of y in
the previous period. The coefficient k is the long run response of y to a change in x. The coefficient
1
1 a − is
the speed of adjustment of y to errors in the form of deviations from the long run equilibrium. Together, the
coefficient and the error are called the error correction mechanism. The coefficient þ
0
is the impact on y of
a change in x.
CochraneOrcutt model
Another commonly used time series model is the CochraneOrcutt or common factor model. It imposes
the restriction that
1 0 1
b b a = − .
0 1 1 0 1 1
0 1 1 0 0 1 1
1 1 0 0 1 1
1 0 0 1
( )
(1 ) (1 )
t t t t t
t t t t t
t t t t t
t t t
Y a a Y b X b X
Y a a Y b X b a X
Y a Y a b X a X
a L Y a b a L X
ε
ε
ε
ε
− −
− −
− −
= + + + +
= + + − +
− = + − +
− = + − +
Y and X are said to have common factors because both lag polynomials have the same root, a
1
. The terms
1 1 1 1
and ( )
t t t t
Y a Y X a X
− −
− − are called generalized first differences because they are differences, but the
coefficient on the lag polynomial is a
1
, instead of 1 which would be the coefficient for a first difference.
159
14 AUTOCORRELATION
We know that the GaussMarkov assumptions require that
2
where (0, )
t t t t
Y X u u iid α β σ = + + ∼ . The
first i in iid is independent. This actually refers to the sampling method and assumes a simple random
sample where the probability of observing u
t
is independent of observing u
t1
. When this assumption, and
the other assumptions, are true, ordinary least squares applied to this model is unbiased, consistent, and
efficient.
As we have seen above, it is frequently necessary in time series modeling to add lags. Adding a lag of X is
not a problem. However adding a lagged dependent variable changes this dramatically. In the model
2
1
, (0, )
t t t t t
Y X Y u u iid α β γ σ
−
= + + + ∼
the explanatory variables, which now include
1 t
Y
−
, cannot be assumed to be a set of constants, fixed on
repeated sampling because, since Y is a random number, the lag of Y is also a random number. It turns out
that the best we can get out of ordinary least squares with a lagged dependent variable is that OLS is
consistent and asymptotically efficient. It is no longer unbiased and efficient.
To see this, assume a very simple AR model (no x) expressed in deviations.
( )
( )
1 1
1 1 1 1 1
1 1 2 2 2
1 1 1
1 1
1 1 1 2
1 1
1 1 1
ˆ
( )
ˆ
( )
Since it is likely that ( ) 0 at least in small samples, whi
t t t
t t t t t t t
t t t
t t t t
t t
t t t t t
y a y
y a y y y y
a a
y y y
y Cov y
E a a E a
Var y y
y a y Cov y
ε
ε ε
ε ε
ε ε
−
− − − −
− − −
− −
− −
− −
= +
¯ + ¯ ¯
= = = +
¯ ¯ ¯
  ¯
= + = +

¯
\ ¹
= − ≠
( )
1 1
1
ch means that
ˆ and OLS is biased. However, in large samples, as the sample size goes to infinity, we expect that
( ) 0
which implies that OLS is at least consistent.
t t
E a a
Cov y ε
−
≠
=
It is unfortunate that all we can hope for in most dynamic models is consistency, since most time series are
quite small relative to cross sections. We will just have to live with it.
If the errors are not independent, then they are said to be “autocorrelated” (i.e., correlated with themselves,
lagged) or “serially correlated.” The most common type of autocorrelation is socalled “first order” serial
correlation,
160
1
2
t
where 1 1 is the autocorrelation coefficient and (0, )
t t t
u u
iid
ρ ε
ρ ε σ
−
= +
≤ ≤
It is possible to have autocorrelation of any order, the most general form would be q’th order
autocorrelation,
1 1 2 2 t t t q t q t
u u u u ρ ρ ρ ε
− − −
= + + + +
Autocorrelation is also known as moving average errors.
Autocorrelation is caused primarily by omitting variables with time components. Sometimes we simply
don’t have all the variables we need and we have to leave out some variables that are unobtainable. The
impact of these variables goes into the error term. If these variables have time trends or cycles or other
timedependent behavior, then the error term will be autocorrelated. It turns out that autocorrelation is so
prevalent that we can expect some form of autocorrelation in any time series analysis.
Effect of autocorrelation on OLS estimates
The effect that autocorrelation has on OLS estimators depends on whether the model has a lagged
dependent variable or not. If there is no lagged dependent variable, then autocorrelation causes OLS to be
inefficient. However, there is a “second order bias” in the sense that OLS standard errors are
underestimated and the corresponding tratios are overestimated in the usual case of positive
autocorrelation. This means that our hypothesis tests can be wildly wrong.
7
If there is a lagged dependent variable, then OLS is biased, inconsistent, and inefficient, and the tratios are
overestimated. There is nothing good about OLS under these conditions. Unfortunately, as we have seen
above, most dynamic models use a lagged dependent variable, so we have to make sure we eliminate any
autocorrelation.
We can prove the inconsistency of OLS in the presence of autocorrelation and a lagged dependent variable
as follows. Recall our simple AR(1) model above.
1 1
1
t t t
t t t
y a y u
u u ρ ε
−
−
= +
= +
The ordinary least squares estimator of a is,
( )
1 1 1 1 1
1 1 2 2 2
1 1 1
ˆ
t t t t t t t
t t t
y a y u y y y u
a a
y y y
− − − −
− − −
¯ + ¯ ¯
= = = +
¯ ¯ ¯
The expected value is
( )
1 1
1 1 1 2
1 1
( )
ˆ
( )
t t t t
t t
y u Cov y u
E a a E a
Var y y
− −
− −
  ¯
= + = +

¯
\ ¹
The covariance between the explanatory variable and the error term is
7
In the case of negative autocorrelation, it is not clear whether the tratios will be overestimated or underestimated.
However, the tratios will be “wrong.”
161
[ ]
[ ]
1 1 1 2 1 1
2 2
1 1 2 1 2 1 1 1 1 2 1
( ) ( ) ( )( )
t t t t t t t t
t t t t t t t t t t
Cov y u E y u E a y u u
E a u y a y u u a E u y E u
ρ ε
ρ ε ρ ε ρ ρ
− − − − −
− − − − − − − −
= = + +
= + + + = +
[ ] [ ]
[ ] [ ]
2 1
2 2
1 1 2 1 1
2
1 2 1 1 2 1
2
1 2 1 1 2 1
Since ( ) ( ) 0 because is truly random. Now,
( ) ( ) ( ) and ( ) ( ) ( ) which implies
t t t t t
t t t t t t t t t
t t t t t
t t t t t
E y E u
E u E u Var u E u y E u y Cov y u
E u y a E u y E u
E u y a E u y E u
ε ε ε
ρ ρ
ρ ρ
− −
− − − − −
− − − − −
− − − − −
= =
= = = =
= +
− =
[ ]
2
1
1 2
1
1
1
1
( )
( )
1
t
t t
t
t t
E u
E u y
a
Var u
Cov y u
a
ρ
ρ
ρ
ρ
−
− −
−
=
−
=
−
So, if 0 ρ ≠ , the explanatory variable is correlated with the error term and OLS is biased. Since p is
constant, it doesn’t go away as the sample size goes to infinity. Therefore, OLS is inconsistent.
Since OLS is inefficient and has overestimated tratios in the best case and is biased, inconsistent,
inefficient, with overestimated tratios in the worse case, we should find out if we have autocorrelation or
not.
Testing for autocorrelation
The Durbin Watson test
The most famous test for autocorrelation is the DurbinWatson statistic. It has the formula
2
1
2
2
( )
where is the residual from the OLS regression.
Under the null hypothesis of no autocorrelation, we can derive the expected value as follows.
Note that, under the null
(0, ) whi
t t
t
t
t
e e
d e
e
u IN σ
−
¯ −
=
¯
2 2
1
2 2
1 1 1
2
1
ch implies that ( ) 0, ( ) , ( ) 0.
The empirical analog is that the residual, e, has the following properties.
( ) ( ) ( ) 0 and ( ) ( ) . Therefore,
( )
( )
(
t t t t
t t t t t t
t t
E u E u E u u
E e E e E e e E e E e
E e e
E d
E
σ
σ
−
− − −
−
= = =
= = = = =
¯ −
=
¯
2 2 2 2 2
1 1 1 1
2 2 2
2 2
2 2
1
2
2
1
( 2 ) ( ( ) 2 ( ) ( ))
) ( ) ( )
( )
2
where T is the sample size (in a time series t=1...T).
t t t t t t t t
t t t
T
t
T
t
E e e e e E e E e e E e
e E e E e
T T
T
σ σ
σ σ
σ
σ
− − − −
=
=
¯ − + ¯ − +
= =
¯ ¯
+
+
= = =
¯
¯
Now assume autocorrelation so that
1
( ) 0
t t
E e e
−
≠ . What happens to the expected value of the Durbin
Watson statistic? As a preliminary, let’s review the concept of a correlation coefficient.
162
2 2
1
But,
1
cov( , ) ( )
So,
cov( , )
x y
x y
xy
r
T S S
x y E xy xy
T
x y
r
σ σ
¯
=
= = ¯
=
In this case,
2 2 2
1
, and
t t x y
x e y e σ σ σ
−
= = = = which means that the autocorrelation coefficient is
1 1
2
2 2
cov( ) ( )
t t t t
e e E e e
ρ
σ
σ σ
− −
= =
and
2
1
( )
t t
E e e ρσ
−
=
The expected value of the Durbin Watson statistic is
2 2 2 2 2 2
1 1 1 1 1
2 2 2
2 2 2
2 2 2
1
2
2
1
( ) ( 2 ) ( ( ) 2 ( ) ( ))
( )
( ) ( ) ( )
( 2 )
2
2 2
t t t t t t t t t t
t t t
T
t
T
t
E e e E e e e e E e E e e E e
E d
E e E e E e
T T T
T
σ ρσ σ
σ ρσ σ
ρ
σ
σ
− − − − −
=
=
¯ − ¯ − + ¯ − +
= = =
¯ ¯ ¯
− +
− +
= = = −
¯
¯
Since p is a correlation coefficient it has to less than or equal to one in absolute value. The usual case in
economics is positive autocorrelation, so that p is positive. Clearly, as p approaches one, the Durbin
Watson statistic approaches zero. On the other hand, for negative autocorrelation, as p goes to negative one,
the DurbinWatson statistic goes to four. As we already know, when p=0, the DurbinWatson statistic is
equal to two.
The DurbinWatson test, despite its fame, has a number of drawbacks. The first is that the test is not exact
and there is an indeterminate range in the DW tables. For example, suppose you get a DW statistic of 1.35
and you want to know if you have autocorrelation. If you look up the critical value corresponding to a
sample size of 25 and one explanatory variable for the five percent significance level, you will find two
values, d
l
=1.29 and d
u
=1.45. If your value is below 1.29 you have significant autocorrelation, according to
DW. If your value is above 1.45, you cannot reject the null hypothesis of no autocorrelation. Since your
value falls between these two, what do you conclude? Answer: test fails! Now what are you going to do?
Actually, this is not a serious problem. Be conservative and simply ignore the lower value. That is, reject
the null hypothesis of no autocorrelation whenever the DW statistic is below d
U
.
8
The second problem is fatal. Whenever the model contains a lagged dependent variable, the DW test is
biased toward two. That is, when the test is most needed, it not only fails, it tells us we have no problem
when we could have a very serious problem. For this reason, we will need another test for autocorrelation.
The LM test for autocorrelation
8
If the DW statistic is negative, then the test statistic is 4DW.
163
Breusch and Godfrey independently developed an LM test for autocorrelation. Like all LM tests, we run
the regression and then save the residuals for further analysis. For example, consider the following model.
0 0
1
t t t
t t t
Y a b X u
u u ρ ε
−
= + +
= +
which implies that
0 0 1 t t t t
Y a b X u ρ ε
−
= + + +
It would be nice to simply estimate this model, get an estimate of p and test the null hypothesis that p=0
with a ttest. Unfortunately, we don’t observe the error term, so we can’t do it that way. However, we can
estimate the error term with the residuals from the OLS regression of Y on X and lag it to get an estimate of
1 t
u
−
We can then run the test equation,
0 0 1 t t t t
Y a b X e ρ ε
−
= + + +
where
1 t
e
−
is the lagged residual. We can test null hypothesis with either the tratio on the lagged residual
or, because this, like all LM tests, is a large sample test, we could use the Fratio for the null hypothesis that
the coefficient on the lagged residual is zero. The Fratio is distributed according to chisquare with one
degree of freedom. Both the DurbinWatson and BreuschGodfrey LM tests are automatic in Stata.
I have a data set on crime and prison population in Virginia from 1972 to 1999 (crimeva.dta). The first
thing we have to do is tell Stata that this is a time series and the time variable is year.
. tsset year
time variable: year, 1956 to 2005
Now let’s do a simple regression of the log of major crime on prison.
. regress lcrmaj prison
Source  SS df MS Number of obs = 28
+ F( 1, 26) = 83.59
Model  .570667328 1 .570667328 Prob > F = 0.0000
Residual  .177496135 26 .006826774 Rsquared = 0.7628
+ Adj Rsquared = 0.7536
Total  .748163462 27 .027709758 Root MSE = .08262

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1345066 .0147116 9.14 0.000 .1647467 .1042665
_cons  9.663423 .0379523 254.62 0.000 9.585411 9.741435

First we want the DurbinWatson statistic. It is ok since we don’t have any lagged dependent variables.
. dwstat
DurbinWatson dstatistic( 2, 28) = .6915226
Now let’s get the BreuschGodfrey test statistic.
. bgodfrey
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  10.184 1 0.0014

H0: no serial correlation
164
OK, the DW statistic is uncomfortably close to zero. (The dl is 1.32 for k=1 and T=27 at the five percent level. So we
reject the null hypothesis of no autocorrelation.) And the BG test is also highly significant. So we apparently have
serious autocorrelation. Just for fun, let’s do the LM test ourselves. First we have to save the residuals.
. predict e, resid
(22 missing values generated)
Now we add the lagged residuals to the regression.
. regress lcrmaj prison L.e
Source  SS df MS Number of obs = 27
+ F( 2, 24) = 78.55
Model  .64494128 2 .32247064 Prob > F = 0.0000
Residual  .098530434 24 .004105435 Rsquared = 0.8675
+ Adj Rsquared = 0.8564
Total  .743471714 26 .028595066 Root MSE = .06407

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1451148 .0118322 12.26 0.000 .1695353 .1206944
e 
L1  .6516013 .1631639 3.99 0.001 .3148475 .9883552
_cons  9.689539 .0308821 313.76 0.000 9.625801 9.753277

The ttest on the lagged e is highly significant, as expected. What is the corresponding Fratio?
. test L.e
( 1) L.e = 0
F( 1, 24) = 15.95
Prob > F = 0.0005
It is certainly significant with the small sample correction. It must be incredibly significant asymptotically.
. scalar prob=1chi2(1, 15.95)
. scalar list prob
prob = .00006504
Wow. Now that’s significant.
We apparently have very serious autocorrelation. What should we do about it? We discuss possible cures
below.
Testing for higher order autocorrelation
The BreuschGodfrey test can be used to test for higher order autocorrelation. To test for second order
serial correlation, for example, use the lags option of the bgodfrey command.
. bgodfrey, lags(1 2)
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  10.184 1 0.0014
165
2  14.100 2 0.0009

H0: no serial correlation
Hmmm, look’s bad. Let’s see if there is even higher order autocorrelation.
. bgodfrey, lags(1 2 3 4)
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  10.184 1 0.0014
2  14.100 2 0.0009
3  14.412 3 0.0024
4  15.090 4 0.0045

H0: no serial correlation
It gets worse and worse. These data apparently have severe high order serial correlation. We could go on
adding lags to the test equation, but we’ll stop here.
Cure for autocorrelation
.
There are two approaches to fixing autocorrelation. The classic textbook solution is to use CochraneOrcutt
generalized first differences, also known as common factors.
The CochraneOrcutt method
Suppose we have the model above with first order autocorrelation.
0 0
1
t t t
t t t
Y a b X u
u u ρ ε
−
= + +
= +
Assume for the moment that we knew the value of the autocorrelation coefficient. Lag the first equation
and multiply by p.
1 0 0 1 1 t t t
Y a b X u ρ ρ ρ ρ
− − −
= + +
Now subtract this equation from the first equation.
1 0 0 0 0 1 1
1 0 0 1
1
(1 ) ( )
Since
t t t t t t
t t t t t
t t t
Y Y a a b X b X u u
Y Y a b X X
u u
ρ ρ ρ ρ
ρ ρ ρ ε
ρ ε
− − −
− −
−
− = − + − + −
− = − + − +
− =
The problem is that we don’ t know the value of p. In fact we are going to estimate
0 0
, , and a b ρ jointly.
Cochrane and Orcutt suggest an iterative method.
(1) Estimate the original regression
0 0 t t t
Y a b X u = + + with OLS and save the residuals, e
t
.
(2) Estimate the auxiliary regression
1 t t t
e e v ρ
−
= + with OLS, save the estimate of ˆ , . ρ ρ
(3) Transform the equation using generalized first differences using the estimate of p.
166
1 0 0 1
ˆ ˆ ˆ (1 ) ( )
t t t t t
Y Y a b X X ρ ρ ρ ε
− −
− = − + − +
(4) Estimate the generalized first difference regression with OLS, save the residuals.
(5) Go to (2).
Obviously, we need a stopping rule to avoid an infinite loop. The usual rule is to stop when successive
estimates of p are within a certain amount, say, .01.
The CochraneOrcutt method can be implemented in Stata using the prais command The prais command
uses the PraisWinston technique, which is a modification of the CochraneOrcutt method that preserves
the first observation so that you don’t get a missing value because of lagging the residual.
. prais lcrmaj prison, corc
Iteration 0: rho = 0.0000
Iteration 1: rho = 0.6353
Iteration 2: rho = 0.6448
Iteration 3: rho = 0.6462
Iteration 4: rho = 0.6465
Iteration 5: rho = 0.6465
Iteration 6: rho = 0.6465
Iteration 7: rho = 0.6465
CochraneOrcutt AR(1) regression  iterated estimates
Source  SS df MS Number of obs = 27
+ F( 1, 25) = 28.34
Model  .111808732 1 .111808732 Prob > F = 0.0000
Residual  .098637415 25 .003945497 Rsquared = 0.5313
+ Adj Rsquared = 0.5125
Total  .210446147 26 .008094083 Root MSE = .06281

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1619739 .0304269 5.32 0.000 .2246394 .0993085
_cons  9.736981 .0864092 112.68 0.000 9.559018 9.914944
+
rho  .6465167

DurbinWatson statistic (original) 0.691523
DurbinWatson statistic (transformed) 1.311269
The CochraneOrcutt approach has improved things, but the DW statistic still indicates some
autocorrelation in the residuals.
Note that the CochraneOrcutt regression is also known as feasible generalized least squares or FGLS. We
have seen above that it is a common factor model, and can be derived from a more general ADL model. It
is therefore true only if the restrictions of the ADL are true. We can test these restrictions to see if the CO
model is a legitimate reduction from the ADL.
Start with the CO model.
1 1
* ( )
t t t t t
Y Y a b X X ρ ρ ε
− −
− = + − +
where
0
* (1 ) a a ρ = − .
Rewrite the equation as
1 1
(A) *
t t t t t
Y a Y bX bX ρ ρ ε
− −
= + + − +
Now for something completely different, namely the ADL model,
167
0 1 1 0 1 1
(B)
t t t t t
Y a a Y b X b X ε
− −
= + + − +
Model (A) is actually model (B) with
1 0 1
b b a = − since
1 1 0
and a b b ρ ρ = = . The CochraneOrcutt model
(A) is true only if
1 0 1 1 0 1
or 0 b b a b b a = − + = .This is a testable hypothesis.
Because the restriction multiplies two parameters, we can’t use a simple Ftest. Instead we use a likelihood
ratio test. The likelihood ratio is a function of the residual sum of squares. So, we run the CochraneOrcutt
model (using the ,corc option because we don’t want to keep the first observation) and save the residual
sum of squares, ESS(A). Then we run the ADL model and keep its residual sum of squares, ESS(B). If the
CochraneOrcutt model is true, then ESS(A)=ESS(B). This implies that the ratio, ESS(B)/ESS(A)=1 and
the log of this ratio, ln(ESS(A)/ESS(B))=0. This is the likelihood ratio. It turns out that
2
( )
ln ~
( )
q
ESS B
L T
ESS A
χ
=
where q=K(B)K(A), K(B) is the number of parameters in model (A) and K(B) is the number of parameters
in model (B). If this statistic is not significant, then the CochraneOrcutt model is true. Otherwise, you
should use the ADL model.
We have already estimated the CochranOrcutt model (model A). The residual sum of squares is .0986, but
we can get Stata to remember this by using scalars.
. scalar ESSA=e(rss)
Now let’s estimate the ADL model (model B).
. regress lcrmaj prison L.lcrmaj L.prison
Source  SS df MS Number of obs = 27
+ F( 3, 23) = 50.18
Model  .64494144 3 .21498048 Prob > F = 0.0000
Residual  .098530274 23 .004283925 Rsquared = 0.8675
+ Adj Rsquared = 0.8502
Total  .743471714 26 .028595066 Root MSE = .06545

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1457706 .1082056 1.35 0.191 .369611 .0780698
lcrmaj 
L1  .6515369 .1670076 3.90 0.001 .3060554 .9970184
prison 
L1  .0883118 .1116564 0.79 0.437 .1426671 .3192906
_cons  3.393491 1.611994 2.11 0.046 .0588285 6.728154

Now do the likelihood ratio test.
. scalar ESSA=e(rss)
. scalar L=27*log(ESSB/ESSA)
.. scalar list
L = .02934373
ESSA = .09863742
ESSB = .09853027
. scalar prob=1chi2(2,.029)
168
. scalar list prob
prob = 1
Since ESSA and ESSB are virtually identical, the test statistic is not significant and CochraneOrcutt is
justified in this case.
Curing autocorrelation with lags
Despite the fact that CochraneOrcutt appears to be justified in the above example, I do not recommend
using it. The problem is that it is a purely mechanical solution. It has nothing to do with economics. As we
noted above, the most common cause of autocorrelation is omitted time varying variables. It is much more
satisfying to specify the problem allowing for lags, momentum, reaction times, etc., that are very likely to
be causing dynamic behavior than using a simple statistical fix. Besides, in the above example, the
CochraneOrcutt method did not solve the problem. The DurbinWatson statistic still showed some
autocorrelation even after FGLS. The reason, almost certainly, is that the model omits too many variables,
all of which are likely to be timevarying. For these reasons, most modern time series analysts do not
recommend CochraneOrcutt. Instead they recommend adding variables and lags until the autocorrelation is
eliminated.
We have seen above that the ADL model is a generalization of the CochraneOrcutt model. Since using
CochraneOrcutt involves restricting the ADL to have a common factor, a restriction that may or may not
be true, why not simply avoid the problem by estimating the ADL directly? Besides, the ADL model allows
us to model the dynamic behavior directly.
So, I recommend adding lags, especially lags of the dependent variable until the LM test indicates no
significant autocorrelation. Use the ten percent significance level when using this, or any other,
specification test. For example, adding two lags of crime to the static crime equation eliminates the
autocorrelation.
. regress lcrmaj prison L.lcrmaj LL.lcrmaj
Source  SS df MS Number of obs = 28
+ F( 3, 24) = 59.71
Model  .659768412 3 .219922804 Prob > F = 0.0000
Residual  .08839505 24 .003683127 Rsquared = 0.8819
+ Adj Rsquared = 0.8671
Total  .748163462 27 .027709758 Root MSE = .06069

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .0635699 .0208728 3.05 0.006 .1066493 .0204905
lcrmaj 
L1  .9444664 .2020885 4.67 0.000 .5273762 1.361557
L2  .3882994 .1902221 2.04 0.052 .7808986 .0042997
_cons  4.293413 1.520191 2.82 0.009 1.155892 7.430934

. bgodfrey
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  0.747 1 0.3875
169

H0: no serial correlation
. bgodfrey, lags (1 2 3 4)
BreuschGodfrey LM test for autocorrelation

lags(p)  chi2 df Prob > chi2
+
1  0.747 1 0.3875
2  2.220 2 0.3296
3  2.284 3 0.5156
4  6.046 4 0.1957

H0: no serial correlation. regress lcrmaj prison
Heteroskedastic and autocorrelation consistent
standard errors
It is possible to generalize the concept of robust standard errors to include a correction for autocorrelation.
In the chapter on heteroskedasticity we derived the formula for heteroskedastic consistent standard errors.
( )
( )
1
2 2 2 2 2 2
ˆ
( )
1 1 2 2 1 2 1 2 1 3 1 3 1 4 1 4 2
2
.
Var E x u x u x u x x u u x x u u x x u u
T T
x
t
β = + + + + + + +
¯
In the absence of autocorrelation, ( ) 0 E u u
i j
= for i j ≠ (independent). The OLS estimator ignores these
terms and is unbiased. Note that the formula can be rewritten as follows.
( )
1 2
2 2 2 2
( ) ( ) ( )
1
1 2 ˆ
( )
2
( ) ( ) ( )
2
1 2 1 2 1 3 1 3 1 4 1 4
T
x Var u x Var u x Var u u
n T
Var
x x Cov u u x x Cov u u x x Cov u u
x
t
β
 
+ + +

=

 + + + +
\ ¹ ¯
If the covariances are positive (positive autocorrelation) and the x’s are positively correlated, which is the
usual case, then ignoring them will cause the standard error to be underestimated. If the covariances are
negative while the x’s are positively correlated, then ignoring them will cause the standard error to be
overestimated. If some are positive and some negative, the direction of bias of the standard error will be
unknown. Things are made more complicated if the observations on x are negatively correlated. The one
thing we do know is that if we ignore the covariances when they are nonzero, then ols estimates of the
standard errors in no longer unbiased. This is the famous second order bias of the standard errors under
autocorrelation. Typically, everything is positively correlated, the standard errors are underestimated and
the corresponding tratios are overestimated, making variables appear to be related when they are in fact
independent.
So, if we have autocorrelation, then ( ) 0
i j
E u u ≠ and we cannot ignore these terms. The problem is, if we
use all these terms, the standard errors are not consistent because, as the sample size goes to infinity, the
number of terms goes to infinity. Since each tem contains error, the amount of error goes to infinity and the
estimate is not consistent. We could use a small number of terms, but that doesn’t do it either, because it
ignores higher order autocorrelation. Newey and West suggested a formula that increased the number of
terms as the sample size increased, but at a much slower rate. The result is that the estimate is consistent
170
because the number of terms goes to infinity, but always with fewer than T terms. The formula relies on
something known as the Bartlett kernel, but that is beyond the scope of this manual.
We can make this formula operational by using the residuals from the OLS regression to estimate u. To get
heteroskedasticity and autocorrelation (HAC aka NeweyWest) consistent standard errors and tratios, use
the newey command.
. newey lcrmaj prison, lag(2)
Regression with NeweyWest standard errors Number of obs = 28
maximum lag: 2 F( 1, 26) = 60.80
Prob > F = 0.0000

 NeweyWest
lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .1345066 .0172496 7.80 0.000 .1699636 .0990496
_cons  9.663423 .0498593 193.81 0.000 9.560936 9.765911

Note that you must specify the number of lags in the newey command. We can also use NeweyWest HAC
standard errors in our ADL equation, even after autocorrelation has presumably been eliminated.
. newey lcrmaj prison L.lcrmaj LL.crmaj, lag(2)
Regression with NeweyWest standard errors Number of obs = 28
maximum lag: 2 F( 3, 24) = 66.77
Prob > F = 0.0000

 NeweyWest
lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison  .0608961 .0193394 3.15 0.004 .1008106 .0209816
lcrmaj 
L1  .9678325 .1822947 5.31 0.000 .5915948 1.34407
crmaj 
L2  .0000335 .0000103 3.25 0.003 .0000548 .0000123
_cons  .8272894 1.67426 0.49 0.626 2.628212 4.282791

Summary
David Hendry, one of the leading time series econometricians, suggests the following general to specific
methodogy. Start with a very general ADL model with lot’s of lags. A good rule of thumb is at least two
lags for annual data, five lags for quarterly data, and 13 lags for monthly data. Check the BreuschGodfrey
test to make sure there is no autocorrelation. If there is, add more lags. Once you have no autocorrelation,
you can begin reducing the model to a parsimonious, interpretable model. Delete insignificant variables.
Justify your deletions with ttests and Ftests. Make sure that dropping variables does not cause
autocorrelation. If it does, then keep the variable even if it is not significant. When you get down to a
parsimonious model with mostly significant variables, make sure that all of the omitted variables are not
significant with an overall Ftest. Make sure that there is no autocorrelation in the final version of the
model. Use NeweyWest heteroskedastic and autocorrelation robust tratios.
171
15 NONSTATIONARITY, UNIT
ROOTS, AND RANDOM WALKS
The GaussMarkov assumptions require that the error term is distributed as
2
(0, ) iid σ . We have seen how
we can avoid problems associated with nonindependence (autocorrelation) and nonidentical
(heteroskedasticity). However, we have maintained the assumption that the mean of the distribution is
constant (and equal to zero). In this section we confront the possibility that the error mean is not constant.
A time series is stationary if its probability distribution does not change over time. The probability of
observing any given y is conditional on all past values,
0 1 1
Pr(  , ,..., )
t t
y y y y
−
. A stationary process is one
where this distribution doesn’t change over time:
1
Pr( , ,..., ) Pr( ,..., )
t t t p t q t p q
y y y y y
+ + + + +
=
for any t, p, or q. If the distribution is constant over time, then its mean and variance are constant over
time. Finally, the covariances of the series must not be dependent on time, that is
( , ) ( , )
t t p t q t p q
Cov y y Cov y y
+ + + +
= .
If a series is stationary, we can infer its properties from examination of its histogram, its sample mean, its
sample variance, and its sample autocovariances. On the other hand, if a series is not stationary, then we
can’t do any of these things. If a time series is nonstationary then we cannot consider a sample of
observations over time as representative of the true distribution, because the truth at time t is different from
the truth at time t+p.
Consider a simple deterministic time trend. It is called deterministic because it can be forecast with
complete accuracy.
t
y t α δ = +
Is y stationary? The mean of y is
( )
t t
E y t µ α δ = = +
Clearly, the mean changes over time. Therefore, a linear time trend is not stationary.
Now consider a simple random walk.
172
2
1
, ~ (0, )
t t t t
y y iid ε ε σ
−
= +
For the sake of argument, let’s assume that y=0 when t=0. Therefore,
1 1
2 1 2
3 1 2 3
t 1 2
0
0
0
so that
y ...
t
y
y
y
ε
ε ε
ε ε ε
ε ε ε
= +
= + +
= + + +
= + + +
Thus, the variance of y
t
is
2 2 2 2 2
t 1 2 1 2
E(y ) ( ... ) ( ) ( ) ... ( )
t t
E E E E ε ε ε ε ε ε = + + + = + + +
because of the independence assumption. Also, because of the identical assumption,
2 2 2
1 2
...
t
σ σ σ = = =
which implies that
2 2
( ) ( )
t t
E Var t ε ε σ = = .
Obviously, the variance of y depends on time and y cannot be stationary. Also, note that the variance of y
goes to infinity as t goes to infinity. A random walk has potentially infinite variance.
A random walk is also known as a stochastic trend. Unlike the deterministic trend above, we cannot
forecast future values of y with certainty. Also unlike deterministic trends, stochastic trends can exhibit
booms and busts. Just because a random walk is increasing for several time periods does not mean that it
will necessarily continue to increase. Similarly, just because it is falling does not mean we can expect it to
continue to fall. Many economic time series appear to behave like stochastic trends. As a result most time
series analysts mean stochastic trend when they say trend.
As we saw in the section on linear dynamic models, a random walk can be written as
1
(1 )
t t t t t
y y y L y ε
−
− = ∆ = − =
The lag polynomial is (1L), so the root of the lag polynomial is one and y is said to have a unit root. Unit
root and stochastic trend are used interchangeably.
Random walks and ordinary least squares
If we estimate a regression where the independent variable is a random walk, then the usual t and Ftests
are invalid. The reason is that if the regressor is nonstationary, it has no stable distribution, so the standard
errors and tratios are not derived from a stable distribution. There is no distribution from which the
statistics can be derived. The problem is similar to autocorrelation, only worse, because the
autocorrelation coefficient is equal to one, extreme autocorrelation.
Granger and Newbold published an article in 1974 where they did some Monte Carlo exercises. They
generated two random walks completely independently of one another and then regressed one on the other.
Even though the true relationship between the two variables was zero, Granger and Newbold got tratios
around 4.5. They called this phenomenon “spurious regression.”
We can replicate their findings in Stata. In the crimeva.dta data set we first generate a random walk called
y1. Start it out at y1=0 in the first year, 1956.
. gen y1=0
This creates a series of zero from 1956 to 2005. Now create the random walk by making y1 equal to itself
lagged plus a random error term. We use the replace command to replace the zeroes. We have to start in
1957 so we have a previous value (zero) to start from. The function 5*invnorm(uniform()) produces a
random number from a normal distribution with a mean of zero and a standard deviation of 5.
173
. replace y1=L.y1+5*invnorm(uniform()) if year > 1956
(49 real changes made)
Repeat for y2.
. gen y2=0
. replace y2=L.y2+5*invnorm(uniform()) if year > 1956
(49 real changes made)
Obviously, y1 and y2 are independent of each other. So if we regressed y1 on y2 there should be no
significant relationship.
. regress y1 y2
Source  SS df MS Number of obs = 50
+ F( 1, 48) = 55.23
Model  9328.23038 1 9328.23038 Prob > F = 0.0000
Residual  8107.68518 48 168.910108 Rsquared = 0.5350
+ Adj Rsquared = 0.5253
Total  17435.9156 49 355.835011 Root MSE = 12.997

y1  Coef. Std. Err. t P>t [95% Conf. Interval]
+
y2  .9748288 .1311767 7.43 0.000 1.238577 .7110805
_cons  14.82044 3.009038 4.93 0.000 20.87052 8.770368
This is a spurious regression. It looks like y1 is negatively (and highly significantly) associated with y2
even though they are completely independent.
Therefore, to avoid spurious regressions, we need to know if our variables have unit roots.
Testing for unit roots
The most widely used test for a unit root is the Augmented DickeyFuller (ADF) test. The idea is very
simple. Suppose we start with a simple AR(1) model.
0 1 1 t t t
y a a y ε
−
= + +
Subtract
1 t
y
−
from both sides,
1 0 1 1 1 t t t t t
y y a a y y ε
− − −
− = + − +
or
0 1 1
( 1)
t t t
y a a y ε
−
∆ = + − +
This is the (nonaugmented) DickeyFuller test equation, known as a DF equation. Note that if the AR
model is stationary,
1
1 a < , in which case
1
1 0 a − < . On the other hand, if y has a unit root then
1 1
1 and 1 0 a a = − = . So we can test for a unit root using a ttest on the lagged y.
Since most time series are autocorrelated, and autocorrelation creates overestimated tratios, we should do
something to correct for autocorrelation. Consider the following AR(1) model with first order
autocorrelation.
1 1
2
1
, ~ (0, )
t t t
t t t t
y a y u
u u iid ρ ε ε σ
−
−
= +
= +
174
But the error term u
t
is also
1 1
1 1 1 2
1 1 1 2
t t t
t t t
t t t
u y a y
u y a y
u y a y ρ ρ ρ
−
− − −
− − −
= −
= −
= −
Therefore,
1 1
1 1 1 1 2
t t t
t t t t t
y a y u
y a y y a y ρ ρ ε
−
− − −
= +
= + − +
Subtract
1 t
y
−
from both sides to get
1 1 1 1 1 1 2
1 1
1 1 1 1 1 1 1 1 1 2
Add and subtract
t t t t t t t
t
t t t t t t t t
y y a y y y a y
a y
y a y y a y a y y a y
ρ ρ ε
ρ
ρ ρ ρ ρ ε
− − − − −
−
− − − − − −
− = − + − +
∆ = − + − + − +
Collect terms
1 1 1 1 1 2
1 1 1 1
1 1
( 1 ) ( )
( 1)(1 )
t t t t t
t t t t
t t t t
y a a y a y y
y a y a y
y y y
ρ ρ ρ ε
ρ ρ ε
γ δ ε
− − −
− −
− −
∆ = − + − + − +
∆ = − − + ∆ +
∆ = + ∆ +
If
1
1 a = (unit root) then v=0. We can test this hypothesis with a ttest. Note that the lagged difference has
eliminated the autocorrelation, generating an iid error term. If there is higher order autocorrelation, simply
add more lagged differences. All we are really doing is adding lags of the dependent variable to the test
equation. We know from our analysis of autocorrelation that this is a cure for autocorrelation. The lagged
differences are said to “augment” the DickeyFuller test.
The ADF in general form is
0 1 1 1 1 2 2
( 1) ...
t t t t p t p t
y a a y c y c y c y ε
− − − −
∆ = + − + ∆ + ∆ + + ∆ +
I keep saying that we test the hypothesis that
1
( 1) 0 a − = with the usual ttest. The ADF test equation is
estimated using OLS. The tratio is computed as usual, the estimated parameter divided by its standard
error. However, while the tratio is standard, its distribution is anything but standard. For reasons that are
explained in the last section of this chapter, the tratio in the DickeyFuller test equation has a nonstandard
distribution. Dickey and Fuller, and other researchers, have tabulated critical values of the DF tratio. These
critical values are derived from Monte Carlo exercises. The DF critical values are tabulated in many
econometrics books and are automatically available in Stata.
As an example, we can do our own mini Monte Carlo exercise using y
1
and y
2
, the random walks we
created above in the crimeva.dta data set. DickeyFuller unit root tests can be done with the dfuller
command. Since we know there is no autocorrelation, we will not add any lags.
. dfuller y1, lags(0)
DickeyFuller test for unit root Number of obs = 49
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.761 3.587 2.933 2.601

* MacKinnon approximate pvalue for Z(t) = 0.4001
Note that the critical values are much higher (in absolute value) than the usual t distribution. We cannot
reject the null hypothesis of a unit root, as expected. We should always have an intercept in any real world
175
application, but we know that we did not use an intercept when we constructed y1, so let’s rerun the DF test
with no constant.
. dfuller y1, noconstant lags(0)
DickeyFuller test for unit root Number of obs = 49
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 0.143 2.622 1.950 1.610
The tratio is approximately zero, as expected. Note the critical values. They are much higher than the
corresponding criticial values for the standard tdistribution.
So far we have tested for a unit root where the alternative hypothesis is that the variable is stationary
around a constant value. If a variable has a distinct trend, and many economic variables, like GDP, have a
distinct upward trend, then we need to be able to test for a unit root against the alternative that y is a
stationary deterministic trend variable. To do this we simply add a deterministic trend term to the ADF test
equation.
0 1 1 1 1 2 2
( 1) ...
t t t t p t p t
y a a y t c y c y c y δ ε
− − − −
∆ = + − + + ∆ + ∆ + + ∆ +
The tratio on
1
( 1) a − is again the test statistic. However, the critical values are different for test equations
with deterministic trends. Fortunately, Stata knows all this and adjusts accordingly. Since it makes a
difference whether the variable we are testing for a unit root has a trend or not, it is a good idea to graph the
variable and look at it. If it has a distinct trend, include a trend in the test equation. For example, we might
want to test the prison population (in the crimeva.dta data set) for stationarity. The first thing we do is
graph it.
This variable has a distinct trend. Is it a deterministic trend or a stationary trend? Let’s test for a unit root
using an ADF test equation with a deterministic trend. We choose two lags to start with and add the regress
option to see the test equation.
. dfuller prison, lags(4) regress trend
Augmented DickeyFuller test for unit root Number of obs = 23
176
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.328 4.380 3.600 3.240

MacKinnon approximate pvalue for Z(t) = 0.8809

D.prison  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison 
L1  .1454225 .1095418 1.33 0.203 .3776407 .0867956
LD  .9004267 .2221141 4.05 0.001 .429566 1.371288
L2D  .4960933 .3426802 1.45 0.167 1.222543 .2303562
L3D  .3033914 .2941294 1.03 0.318 .3201351 .9269179
L4D  .1248609 .3584488 0.35 0.732 .8847383 .6350165
_trend  .0214251 .0127737 1.68 0.113 .0056539 .048504
_cons  .0723614 .0643962 1.12 0.278 .0641524 .2088751

How many lags should we include? The rule of thumb is at least two for annual data, five for quarterly
data, etc. Since these data are annual, we start with two. It is of some importance to get the number of lags
right because too few lags causes autocorrelation, which means that our OLS estimates are biased and
inconsistent, with overestimated tratios. Too many lags means that our estimates are inefficient with
underestimated tratios. Which is worse? Monte Carlo studies have shown that including too few lags is
much worse. There are three approaches to this problem. The first is to use ttests and Ftests on the lags to
decide whether to drop them or not. The second is to use model selection criteria. The third approach is to
do it all possible ways and see if we get the same answer. The problem with the third approach is that, if we
get different answers, we don’t know which to believe.
Choosing the number of lags with F tests
I recommend using Ftests to choose the lag length. Look at the tratios on the lags and drop the
insignificant lags (use a .15 significance level to be sure not to drop too many lags). The t and Fratios on
the lags are standard, so use the standard critical values provided by Stata. It appears that the last three lags
are not significant. Let’s test this hypothesis with an Ftest.
. test L4D.prison L3D.prison L2D.prison
( 1) L4D.prison = 0
( 2) L3D.prison = 0
( 3) L2D.prison = 0
F( 3, 16) = 0.86
Prob > F = 0.4835
So, the last three lags are indeed not significant and we settle on one lag as the correct model.
. dfuller prison, lags(1) regress trend
Augmented DickeyFuller test for unit root Number of obs = 26
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
177
Statistic Value Value Value

Z(t) 2.352 4.371 3.596 3.238

MacKinnon approximate pvalue for Z(t) = 0.4055

D.prison  Coef. Std. Err. t P>t [95% Conf. Interval]
+
prison 
L1  .1608155 .0683769 2.35 0.028 .3026204 .0190106
LD  .6308502 .15634 4.04 0.001 .3066209 .9550795
_trend  .0211228 .0091946 2.30 0.031 .0020544 .0401913
_cons  .1145497 .0493077 2.32 0.030 .0122918 .2168075

Note that the lag length apparently doesn’t matter because the DF test statistic is not significant in either
model. Prison population in Virginia appears to be a random walk.
In some cases it is not clear whether to use a trend or not. Here is the graph of the log of major crime in
Virginia.
Does it have a trend, two trends (one up, one down), or no trend? Let’s start with a trend and four lags.
. dfuller lcrmaj, lags(4) regress trend
Augmented DickeyFuller test for unit root Number of obs = 35
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.832 4.288 3.560 3.216

MacKinnon approximate pvalue for Z(t) = 0.6889

D.lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lcrmaj 
L1  .1243068 .0678357 1.83 0.078 .2632618 .0146483
LD  .0308934 .1746865 0.18 0.861 .3887225 .3269358
178
L2D  .2231895 .171358 1.30 0.203 .5742004 .1278214
L3D  .1747988 .1704449 1.03 0.314 .5239392 .1743417
L4D  .0359288 .173253 0.21 0.837 .3908215 .3189639
_trend  .0053823 .0018615 2.89 0.007 .0091955 .0015691
_cons  1.282789 .640328 2.00 0.055 .0288636 2.594441

The trend is significant, so we retain it. Let’s test for the significance of the lags.
. test L4D.lcrmaj L3D.lcrmaj L2D.lcrmaj LD.lcrmaj
( 1) L4D.lcrmaj = 0
( 2) L3D.lcrmaj = 0
( 3) L2D.lcrmaj = 0
( 4) LD.lcrmaj = 0
F( 4, 28) = 0.72
Prob > F = 0.5879
None of the lags are significant, so we drop them all and run the test again.
. dfuller lcrmaj, lags(0) regress trend
DickeyFuller test for unit root Number of obs = 39
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.314 4.251 3.544 3.206

MacKinnon approximate pvalue for Z(t) = 0.8845

D.lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lcrmaj 
L1  .0620711 .0472515 1.31 0.197 .1579016 .0337594
_trend  .0030963 .0010268 3.02 0.005 .0051789 .0010138
_cons  .6447613 .4310014 1.50 0.143 .2293501 1.518873

Again, the number of lags does not make any real difference. The ADF test statistics are all insignificant,
indicating that crime in Virginia appears to be a random walk.
Digression: model selection criteria
One way to think of the problem is to hypothesize that the correct model should have the highest Rsquare
and the lowest error variance compared to the alternative models. Recall that the Rsquare measures the
variation explained by the regression to the total variation of the dependent variable. The problem with this
approach is that the Rsquare can be maximized by simply adding more independent variables.
One way around this problem is to correct the Rsquare for the number of explanatory variables used in the
model. The adjusted Rsquare also known as the
2
R (Rbarsquare) is the ratio of the variance (instead of
variation) explained by the regression to the total variance.
2 2
( ) 1
1 1 (1 )
( )
Var e N
R R
Var y T k
−
= − = − −
−
179
where T is the sample size and k is the number of independent variables. So, as we add independent
variables, R
2
increases but so does Nk. Consequently,
2
R can go up or down. One rule might be to choose
the number of lags such that
2
R is a maximum.
The Akaike information criterion (AIC) has been proposed as an improvement over the
2
R on the theory
that the
2
R does not penalize heavily enough for adding explanatory variables. The formula is
2
2
log
i
e
k
AIC
T T
 
= + 

\ ¹
¯
The goal is to choose the number of explanatory variables (lags in the context of a unit root test) that
minimizes the AIC.
The Bayesian Information Criterion (also known as the Schwartz Criterion),
2
log( )
log
i
e
T
BIC k
T T
 
= + 

\ ¹
¯
The BIC penalizes additional explanatory variables even more heavily than the AIC. Again, the goal is to
minimize the BIC.
Note that these information criteria are not formal statistical tests, however they have been shown to be
useful in the determination of the lag length in unit root models. I usually use the BIC, but usually they
both give the same answer.
Choosing lags using model selection criteria
Let’s adopt the information criterion approach. If your version of Stata has the fitstat command, invoke
fitstat after the dfuller command. If your version does not have fitstat, you can install it by doing a keyword
search for fitstat and clicking on “click here to install.” Fitstat generates a variety of goodness of fit tests,
including the BIC. Then run the test several times with different lags and check the information criterion.
The minimum (biggest negative) is the one to go with.
. dfuller prison, lags(2) trend
Augmented DickeyFuller test for unit root Number of obs = 25
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.528 4.380 3.600 3.240

* MacKinnon approximate pvalue for Z(t) = 0.8182
. fitstat
Measures of Fit for regress of D.prison
LogLik Intercept Only: 18.085 LogLik Full Model: 27.328
D(20): 54.656 LR(4): 18.486
Prob > LR: 0.001
R2: 0.523 Adjusted R2: 0.427
AIC: 1.786 AIC*n: 44.656
BIC: 119.034 BIC': 5.610
180
. dfuller prison, lags(1) trend
Augmented DickeyFuller test for unit root Number of obs = 26
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 2.352 4.371 3.596 3.238

* MacKinnon approximate pvalue for Z(t) = 0.4069
. fitstat
Measures of Fit for regress of D.prison
LogLik Intercept Only: 18.727 LogLik Full Model: 27.580
D(22): 55.160 LR(3): 17.706
Prob > LR: 0.001
R2: 0.494 Adjusted R2: 0.425
AIC: 1.814 AIC*n: 47.160
BIC: 126.838 BIC': 7.931
. dfuller prison, lags(0) trend
DickeyFuller test for unit root Number of obs = 27
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 1.647 4.362 3.592 3.235

* MacKinnon approximate pvalue for Z(t) = 0.7726
. fitstat
Measures of Fit for regress of D.prison
LogLik Intercept Only: 19.427 LogLik Full Model: 21.654
D(24): 43.308 LR(2): 4.454
Prob > LR: 0.108
R2: 0.152 Adjusted R2: 0.081
AIC: 1.382 AIC*n: 37.308
BIC: 122.408 BIC': 2.138
So we go with one lag. In this case it doesn’t make any difference, so we are pretty confident that prison is
a stochastic trend.
181
DFGLS test
A soupedup version of the ADF which is somewhat more powerful has been developed by Elliott,
Rothenberg, and Stock (ERS, 1996).
9
ERS use a two step method where the data are detrended using
generalized least squares, GLS. The resulting detrended series is used in a standard ADF test.
The basic idea behind the test can be seen as follows. Suppose the null hypothesis is
0 1
H : or where (0)
t t t t t t
Y Y Y I δ ε δ ε ε
−
= + + ∆ = +
The alternative hypothesis is
1
H : or where (0)
t t t t t
Y t Y t I α β ε α β ε ε = + + − − =
If the null hypothesis is true but we assume the alternative is true and detrend,
1
( )
t t t
Y t Y t α β δ α β ε
−
− − = + − − +
Y is still a random walk.
So, the idea is to detrend Y and then test the resulting series for a unit root. If we reject the null hypothesis
of a unit root, then the series is stationary. If we cannot reject the null hypothesis of a unit root, then Y is a
nonstationary random walk.
What about autocorrelation? ERS suggest correcting for autocorrelation using a generalized least squares
procedure, similar to CochraneOrcutt. The procedure is done in two steps. In the first step we transform to
generalized first differences.
1
*
t t t
Y Y Y ρ
−
= −
where p=17/T if there is no obvious trend and p=113.5/T if there is a trend. The idea behind this choice of
p is that p should go to one as T goes to infinity. The values 7 and 13.5 were chosen with Monte Carlo
methods.
1
1
( 1)
( 1)
(1 ) ( ( 1))
t
t
t t
Y t
Y t
Y Y t t
t t
α β
ρ ρα ρβ
ρ α β ρα ρβ
α ρ β ρ
−
−
= +
= + −
− = + − + −
= − + − −
This equation is estimated with OLS.
We compute the detrended series,
9
Elliot, G., T. Rothenberg, and J. H. Stock. 1996. Efficient tests for an autoregressive unit root.
Econometrica 64: 813836.
182
ˆ
ˆ
d
t t
Y Y t α β = − −
if there is a trend and
ˆ
d
t t
Y Y α = −
if no trend.
Finally, we use the DickeyFuller test (not the augmented DF), with no intercept, to test
d
t
Y for a unit root.
The DFGLS test is available in Stata with the dfgls command. Let’s start with 4 lags just to make sure.
. dfgls lcrmaj, maxlag(4)
DFGLS for lcrmaj Number of obs = 35
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

4 0.752 3.770 3.072 2.772
3 0.527 3.770 3.147 2.843
2 0.443 3.770 3.215 2.906
1 0.456 3.770 3.273 2.960
Opt Lag (NgPerron seq t) = 0 [use maxlag(0)]
Min SC = 5.028735 at lag 1 with RMSE .0730984
Min MAIC = 5.161388 at lag 1 with RMSE .0730984
What if we drop the trend?
. dfgls lcrmaj, maxlag(4) notrend
DFGLS for lcrmaj Number of obs = 35
DFGLS mu 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

4 0.965 2.636 2.236 1.937
3 0.800 2.636 2.275 1.975
2 0.753 2.636 2.312 2.010
1 0.769 2.636 2.346 2.041
Opt Lag (NgPerron seq t) = 0 [use maxlag(0)]
Min SC = 5.056724 at lag 1 with RMSE .0720825
Min MAIC = 5.166232 at lag 1 with RMSE .0720825
Note that the SC is the Schwartz Criterion (BIC). The MAIC is the modified Akaike Information Criterion,
much like the BIC. Again, all tests appear to point to a unit root for the crime variable.
I recommend the DFGLS test for all your unit root testing needs.
Trend stationarity vs. difference stationarity
We know that a regression of one independent random walk on another is likely to result in a spurious
regression where the tratio is highly significant according to the usual critical values. So we need to know
if our variables are random walks or not. Unfortunately, the unit roots that have been developed so far are
183
not particularly powerful, that is, they cannot distinguish between unit roots and nearly unit roots that are
nevertheless stationary. For example, it would be difficult for an ADF test equation to tell the difference
between a
1
=1 and a
1
=.9, even though the former is a random walk and the latter is stationary. What happens
if we get it wrong?
First a bit of terminology: a stationary random variable is said to be “stationary of order zero,” of I(0). A
random walk that can be made stationary by taking first differences is said to be “integrated of order one,”
or I(1). Most economic time series are either I(0) or I(1). There is the odd time series that is I(2), which
means it has to be differenced twice to make it stationary. To avoid spurious regressions, we need the
variables in our models to be I(0). If the variable is a random walk, we take first differences, if it is already
I(0) it is said to be “stationary in levels,” and we can use it as is.
Consider the following table.
Truth Levels First differences
I(0) OK Negative autocorrelation
I(1) Spurious regression OK
Clearly, if we have a random walk and we take first differences, or we run a regression in levels with
stationary variables, we are OK. However, if we make a mistake and run a regression in levels with I(1)
variables we get a spurious regression, which is really bad. On the other hand, if we mess up and take first
differences of a stationary variable, it is still stationary. The error term is likely to have negative
autocorrelation, but that is a smaller problem. We can simply use NeweyWest HAC standard errors and t
ratios. Besides, most economic time series tend to have booms and busts and otherwise behave like
stochastic trends rather than deterministic trends. For all these reasons, most time series analysts assume a
variable is a stochastic trend unless convinced otherwise.
Why unit root tests have nonstandard
distributions
To see this, consider the nonaugmented DickeyFuller test equation with no intercept.
1 1
1
1
1 2
2
1
1
1
ˆ
1
1
Multiply both sides by T =
1
T
t t t
t t
t t
t
t
y a y
y y
y y
T
a
y
y
T
ε
−
−
−
−
−
∆ = +
¯∆
¯∆
= =
¯
¯
 




\ ¹
1
1
2
1 2
1
ˆ
1
t t
t
y y
T
Ta
y
T
−
−
¯∆
=
¯
Look at the numerator.
( )
2
2 2 2
1 1 1
2 ( )
t t t t t t t
y y y y y y y
− − −
¯ = ¯ + ∆ = ¯ + ¯ ∆ + ¯ ∆
Solve for the middle term.
184
( )
( )
2 2 2
1 1
2 2 2
1 1
2 2 2
1 1
2 ( )
1
( )
2
1 1 1
( )
2
t t t t t
t t t t t
t t t t t
y y y y y
y y y y y
y y y y y
T T
− −
− −
− −
¯ ∆ = ¯ − ¯ − ¯ ∆
¯ ∆ = ¯ − ¯ − ¯ ∆
¯ ∆ = ¯ − ¯ − ¯ ∆
The first two terms in the parentheses can be written as,
1
2 2 2
1 1
1
2 2 2
1 0
1 1
T T
t t T
t t
T T
t t
t t
y y y
y y y
−
= =
−
−
= =
 
= +

\ ¹
 
= +

\ ¹
¯ ¯
¯ ¯
Which means that the difference between these two terms is
1 1
2 2 2 2 2 2 2 2 2
1 0 0
1 1 1 1
0
Because we can conviently assume that 0.
T T T T
t t t T t T T
t t t t
y y y y y y y y y
y
− −
−
= = = =
   
− = + − + = − =
 
\ ¹ \ ¹
=
¯ ¯ ¯ ¯
Therefore,
( ) ( ) ( )
2
2 2 2
1
1 1 1 1 1
2 2
T
t t T t t
y
y y y y y
T T T T
−
 
 
¯ ∆ = − ¯ ∆ = − ¯ ∆ 


\ ¹
\ ¹
Under the null hypothesis,
t t
y ε ∆ = so that the second term in the parentheses is
( ) ( ) ( )
2 2 2
2
1 1 1
has the probability limit
t t t
y
T T T
ε ε σ ¯ ∆ = ¯ ¯ →
Under the assumption that
0
0 y = the first term is
1 1
T
t t
y
y
T T T
ε = ¯∆ = ¯
Since the sum of T normal variables is normal,
2
(0, )
T
y
N
T
σ →
Therefore,
( ) ( )
2
2 2 2
1
1
T
t
y
Z
T T
ε σ
 
− ¯ → −

\ ¹
where Z is a standard normal variable. We know that the square of a normal variable is a chisquare
variable with one degree of freedom. Therefore the numerator of the OLS estimator of the DickeyFuller
test equation is
( )
2
2
1 1
1
1
2
t t
y y
T
σ
χ
−
¯ ∆ → −
If y were stationary, the estimator would converge to a normal random variable. Because y is a random
walk the numerator converges to a chisquare variable. Unfortunately, the numerator is the good news. The
really bad news is the denominator. The sum of squares
2
1
1
t
y
T
−
¯ does not converge to a constant, but
remains a random variable even as T goes to infinity. So, the OLS estimator in the DickeyFuller test
equation converges to a ratio. The numerator is chisquare while the denominator is the distribution of a
random variable. The latter distribution is different for different dependent variables.
185
16 ANALYSIS OF
NONSTATIONARY DATA
Consider the following graph of the log of GDP and the log of consumption. The data are quarterly, from
19471 to 20041. I created the time trend t from 1 to 229 corresponding to each quarter. These data are
available in CYR.dta.
. gen t=_n
. tsset t
time variable: t, 1 to 229
Are these series random walks with drift (difference stationary stochastic trends) or are they deterministic
trends?.
The graph of the prime rate of interest is quite different.
186
Since the interest rate has to be between zero and one, we assume it cannot have a deterministic trend. It
does have the booms and busts that are characteristic of stochastic trends, so we suspect that it is a random
walk.
We apply unit root tests to these variables below.
. dfgls lny, maxlag(5)
DFGLS for lny Number of obs = 223
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

5 1.719 3.480 2.893 2.607
4 1.892 3.480 2.901 2.614
3 2.137 3.480 2.908 2.620
2 2.408 3.480 2.914 2.626
1 2.165 3.480 2.921 2.632
Opt Lag (NgPerron seq t) = 1 with RMSE .0094098
Min SC = 9.283516 at lag 1 with RMSE .0094098
Min MAIC = 9.289291 at lag 5 with RMSE .0092561
. dfgls lnc, maxlag(5)
DFGLS for lnc Number of obs = 223
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

5 2.042 3.480 2.893 2.607
4 2.084 3.480 2.901 2.614
3 2.487 3.480 2.908 2.620
2 2.482 3.480 2.914 2.626
1 1.818 3.480 2.921 2.632
Opt Lag (NgPerron seq t) = 4 with RMSE .0079511
Min SC = 9.572358 at lag 2 with RMSE .0080462
Min MAIC = 9.589425 at lag 4 with RMSE .0079511
. dfgls prime, maxlag(5)
DFGLS for prime Number of obs = 215
187
DFGLS tau 1% Critical 5% Critical 10% Critical
[lags] Test Statistic Value Value Value

5 2.737 3.480 2.895 2.609
4 2.179 3.480 2.903 2.616
3 2.069 3.480 2.910 2.623
2 1.768 3.480 2.917 2.629
1 1.974 3.480 2.924 2.635
Opt Lag (NgPerron seq t) = 5 with RMSE .7771611
Min SC = .390588 at lag 1 with RMSE .8022992
Min MAIC = .3949452 at lag 2 with RMSE .8005883
All of these series are apparently random walks. We can transform them to stationary series by taking first
differences. Here is the graph of the first differences of the log of income. First let’s summarize the first
differences of the logs of these variables.
. summarize D.lnc D.lny D.lnr
Variable  Obs Mean Std. Dev. Min Max
+
lnc 
D1  228 .008795 .0084876 .029891 .05018
lny 
D1  228 .0084169 .0100605 .0275517 .0402255
lnr 
D1  220 .0031507 .078973 .3407288 .365536
. twoway (tsline D.lny), yline(.0084)
Note that the series is centered at .0084 (the mean of the differenced series, which is the average rate of
growth, or “drift”). Note also that, if the series wanders too far from the mean, it tends to get drawn back to
the center quickly. This is a typical graph for a stationary series.
Once we transform to stationary variables, we can use the usual regression analyses and significance tests.
regress D.lnc LD.lnc D.lny D.lnr
Source  SS df MS Number of obs = 220
+ F( 3, 216) = 46.01
Model  .006260983 3 .002086994 Prob > F = 0.0000
Residual  .009798288 216 .000045362 Rsquared = 0.3899
+ Adj Rsquared = 0.3814
188
Total  .016059271 219 .00007333 Root MSE = .00674

D.lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lnc 
LD  .1957726 .0578761 3.38 0.001 .3098469 .0816984
lny 
D1  .5741064 .049067 11.70 0.000 .4773951 .6708177
lnr 
D1  .0141193 .0059641 2.37 0.019 .0258746 .002364
_cons  .0057757 .0007042 8.20 0.000 .0043876 .0071637

The change in consumption is a function of lagged consumption, the change in income, and the change in
interest rates. The coefficient on income is positive and the coefficient on the interest rate is negative, as
expected. The negative coefficient on lagged consumption means that there is significant “reversion to the
mean,” that is, if consumption grows too rapidly in one quarter, it will tend to grow much less rapidly I the
next quarter, and vice versa. Also, in a first difference regression, the intercept term is a linear time trend.
Consumption appears to be growing exogenously at a rate of about onehalf percent per quarter.
The problem with a first difference regression is that it is purely short run. Changes in lnc are functions of
changes in lny. It doesn’t say anything about any long run relationship between consumption and income.
And economists typically don’t know anything about the short run. All our hypotheses are about equilibria,
which presumably occur in the long run. So, it would be very useful to be able estimate the long run
relationship between consumption, income, and the interest rate.
Cointegration
Engle and Granger have developed a method for determining if there is a long run relationship among a set
of nonstationary random walks. They noted that if two independent random walks were regressed on each
other, the result would simply be another random walk consisting of a linear combination of two individual
random walks. Imagine a couple of drunks who happen to exit Paul’s Deli at the same time. They are
walking randomly and we expect that they will wander off in different directions, so the distance between
them (the residual) will probably get bigger and bigger. So, for example, if x(t) and z(t) are independent
random walks, then a regression of y on x will yield z=a+bx+u, so that u=zabx is just a linear
combination of two random walks and therefore itself a random walk.
However, suppose that the two random walks are related to each other. Suppose y(t) is a random walk, but
is in fact a linear function of x(t), another random walk such that y=u +þx+u. The residual is u=y u þx.
Now, because x and y are related, there is an attraction between them, so that if u gets very large, there will
be natural tendency for x and y to adjust so that u becomes approximately zero.
Using our example of our two drunks exiting Paul’s Deli, suppose they have become best buddies after a
night of drinking, so they exit the deli arm in arm. Now, even though they are individually walking
randomly, they are linked in such a way that neither can wander too far from the other. If one does lurch,
then the other will also lurch, the distance between them remaining approximately an arm’s length.
Economists suspect that there is a consumption function such that consumption is a linear function of
income, usually a constant fraction. Therefore, even though both income and consumption are individually
random walks, we expect that they will not wander very far from each other. In fact the graph of the log of
consumption and the log of national income above shows that they stay very close to each other. If income
increases dramatically, but consumption doesn’t grow at all, the residual will become a large negative
number. Assuming that there really is a long run equilibrium relationship between consumption, then
consumption will tend to grow to its long run level consistent with the higher level of income. This implies
that the residual will tend to hover around the zero line. Large values of the residual will be drawn back to
the zero line. This is the behavior we associate with a stationary variable. So, if there is a long run
189
equilibrium relationship between two (or more) random walks, the residual will be stationary. If there is no
long run relationship between the random walks, the residual will be nonstationary.
Two random walks that are individually I(1) can nevertheless generate a stationary I(0) residual if there is a
long run relationship between them. Such a pair of random walks are said to be cointegrated.
So we can test for cointegration (i.e., a long run equilibrium relationship) by regressing consumption on
income and maybe a time trend, and then testing the resulting residual for a unit root. If it has a unit root,
then there is no long run relationship among these series. If we reject the null hypothesis of a unit root in
the residual, we accept the alternative hypothesis of cointegration.
We use the same old augmented DickeyFuller test to test for stationarity of the residual. This is the
“EngleGranger two step” test for cointegration. In the first step we estimate the long run equilibrium
relationship. This is just a static regression (no lags) in levels (no first differences) of the log of
consumption on the log of income, the log of the interest rate, and a time trend. We will drop the time trend
if it is not significant. We then save the residual and run an ADF unit root test on the residual. Because the
residual is centered on zero by construction and can have no time trend, we do not include an intercept or a
trend term in our ADF regression.
. regress lnc lny lnr t
Source  SS df MS Number of obs = 221
+ F( 3, 217) =85503.23
Model  69.7033367 3 23.2344456 Prob > F = 0.0000
Residual  .058967068 217 .000271738 Rsquared = 0.9992
+ Adj Rsquared = 0.9991
Total  69.7623038 220 .317101381 Root MSE = .01648

lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lny  .8053303 .032362 24.89 0.000 .7415463 .8691144
lnr  .0047292 .003373 1.40 0.162 .0113772 .0019189
t  .0021389 .0002607 8.21 0.000 .0016252 .0026527
_cons  .9669527 .2378607 4.07 0.000 .4981397 1.435766

. predict error, resid
(8 missing values generated)
. dfuller error, noconstant lags(3) regress
Augmented DickeyFuller test for unit root Number of obs = 217
 Interpolated DickeyFuller 
Test 1% Critical 5% Critical 10% Critical
Statistic Value Value Value

Z(t) 4.841 2.584 1.950 1.618

D.error  Coef. Std. Err. t P>t [95% Conf. Interval]
+
error 
L1  .1445648 .0298631 4.84 0.000 .2034297 .0856998
LD  .2037883 .064761 3.15 0.002 .3314428 .0761338
L2D  .2077673 .0657189 3.16 0.002 .0782246 .33731
L3D  .2383771 .0635725 3.75 0.000 .1130652 .3636889

Unfortunately, we can’t use the critical values supplied by dfuller. The reason is that the dfuller test is
based on a standard random walk. We have a different kind of random walk here because it is a residual.
However, Phillips and Ouliaris (Asymptotic Properties of Residual Based Tests for Cointegration,
Econometrica 58 (1), 165193) provide critical values for the EngleGranger cointegration test.
190
Critical values for the EngleGranger ADF cointegration statistic
Intercept and trend included
NUMBER OF EXPLANATORY VARIABLES
IN STATIC REGRESSION MODEL
(EXCLUDING INTERCEPT AND TREND)
10% 5% 1%
1 3.52 3.80 4.36
2 3.84 4.16 4.65
3 4.20 4.49 5.04
4 4.46 4.74 5.36
5 4.73 5.03 5.58
Critical values for the EngleGranger ADF cointegration test statistic
Intercept included, no trend
NUMBER OF EXPLANATORY VARIABLES
IN STATIC REGRESSION MODEL
(EXCLUDING INTERCEPT)
10% 5% 1%
1 3.07 3.37 3.96
2 3.45 3.77 4.31
3 3.83 4.11 4.73
4 4.16 4.45 5.07
5 4.43 4.71 5.28
We have an intercept, a trend, and two independent variables in the long run consumption function,
yielding a critical value of 4.16 at the five percent level. Our ADF test statistic is 4.84, so we reject the
null hypothesis of a unit root in the residual from the consumption function and accept the alternative
hypothesis that consumption, income, and interest rates are cointegrated. There is, apparently, a long run
equilibrium relationship linking consumption, income, and the interest rate. Macro economists will be
relieved.
What are the properties of the estimated long run regression? It turns out that the OLS estimates of the
parameters of the static regression are consistent as the number of observations goes to infinity. Actually,
Engle and Granger show that the estimates are “superconsistent” in the sense that they converge to the true
parameter values much faster than the usual consistent estimates. An interesting question arises at this
point. Why isn’t the consumption function, estimated with OLS, inconsistent because of simultaneity? We
know that consumption is a function of national income, but national income is also a function of
consumption (Y=C+I+G). So how can we get consistent estimates with OLS?
The reason is that the explanatory variable is a random walk. We know that the bias and inconsistency
arises because of a correlation between the explanatory variables and the error term in the regression.
However, because each of the explanatory variables in the regression is a random walk, they can wander
off to infinity and have potentially infinite variance. The error term, on the other hand, is stationary, which
means it is centered on zero with a finite variance. A variable that could go to infinity cannot be correlated
with a residual that is stuck on zero. Therefore, OLS estimates of the long run relationship are consistent
even in the presence of simultaneity.
191
Dynamic ordinary least squares
Although the estimated parameters of the EngleGranger first stage regression are consistent, the estimates
are inefficient relative to a simple extension of the regression model. Further, the standard errors and t
ratios are nonstandard, so we can’t use them for hypothesis testing. To fix the efficiency problem, we use
dynamic ordinary least squares (DOLS). To correct the standard errors and tratios we use heteroskedastic
and autocorrelation consistent (HAC) standard errors.(i.e., NeweyWest standard errors).
DOLS is very easy, we simply add leads and lags of the differences of the integrated variables in the
regression model. (If your static regression has stationary variables mixed in with the nonstationary
integrated variables, just take leads and lags of the differences of the integrated variables, leaving the
stationary variables alone.) It also turns out that you don’t need more than one or two leads or lags to get
the job done. To get the HAC standard errors and tratios, we use the “newey” command.
But first we have to generate the leads and lags. We will go with two leads and two lags. We can use the
usual time series operators: F.x for leads (F for forward), L.x for lags, and D.x for differences.
This is the static long run model. Do not use a lagged dependent variable in the long run model.
. newey lnc lny lnr t FD.lny F2D.lny LD.lny L2D.lny FD.lnr F2D.lnr LD.lnr
L2D.lnr , lag(4)
Regression with NeweyWest standard errors Number of obs = 216
maximum lag: 4 F( 11, 204) = 5741.83
Prob > F = 0.0000

 NeweyWest
lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lny  .8891959 .0701308 12.68 0.000 .7509218 1.0274
lnr  .0001377 .0058142 0.02 0.981 .0116013 .011325
t  .0014333 .0005705 2.51 0.013 .0003084 .002558
lny 
FD.  .379992 .1678191 2.26 0.025 .0491097 .710874
F2D.  .1907836 .131358 1.45 0.148 .0682098 .44977
LD.  .1234908 .109279 1.13 0.260 .3389519 .091970
L2D.  .0499793 .1226579 0.41 0.684 .2918191 .191860
lnr 
FD.  .0030922 .0116686 0.27 0.791 .0260987 .019914
F2D.  .0051308 .0140005 0.37 0.714 .0327351 .022473
LD.  .0154288 .0145519 1.06 0.290 .0441202 .013262
L2D.  .025613 .0133788 1.91 0.057 .0519916 .00076
The lag(4) option tells newey that there is autocorrelation of up to order four. The interest rate does not
appear to be significant in the long run consumption function.
Error correction model
We now have a way of estimating short run models with nonstationary time series, by taking first
differences. We also have a way of estimating long run equilibrium relationships (the first step of the
192
EngleGranger twostep). Is there any way to combine these techniques? It turns out that the error
correction model does just this.
We derived the error correction model from the autoregressive distributed lag (ADL) model in an earlier
chapter. The simple error correction model, also known as an equilibrium correction model or ECM can be
written as follows.
( )
0 1
1
( 1)
t t t
t
y b x a y kx ε
−
∆ = ∆ + − − +
This equation can be interpreted as follows. The first difference captures the short run dynamics. The
remaining part of the equation is called the error correction mechanism (ECM). The term in parentheses is
the difference between the long run equilibrium value of y and the realized value of y in the previous
period. It is equal to zero if there is long run equilibrium. If the system is out of equilibrium, this term will
be either positive or negative. In that case, the coefficient multiplying this value will determine how much
of the disequilibrium will be eliminated in each period, the “speed of adjustment” coefficient. As a result, y
changes from period to period both because of short run changes in x, lagged changes in y, and adjustment
to disequilibrium. Therefore, the error correction model incorporates both long run and short run
information.
Note that, if x and y are both nonstationary random walks, then their first differences are stationary. If they
are cointegrated, then the residuals from the static regression are also stationary. Thus, all the terms in the
error correction model are stationary, so all the usual tests apply.
Also, if x and y are cointegrated then there is an error correction model explaining their behavior. If there is
an error correction model relating x and y, then they are cointegrated. Further, they must be related in a
Granger causality sense because each variable adjusts to lagged changes in the other through the error
correction mechanism.
This model can be estimated using a two step procedure. In the first step, estimate the static long run
regression (don’t use DOLS). Predict the residual, which measures the disequilibrium. Lag the residual and
use it as an explanatory variable in the first difference regression.
For example, consider the consumption function above. We have already estimated the long run model and
predicted the residual. The only thing left to do is to estimate the EC model.
. regress D.lnc LD.lnc L4D.lnc D.lny D.lnr L.error
Source  SS df MS Number of obs = 220
+ F( 5, 214) = 30.99
Model  .006745058 5 .001349012 Prob > F = 0.0000
Residual  .009314213 214 .000043524 Rsquared = 0.4200
+ Adj Rsquared = 0.4065
Total  .016059271 219 .00007333 Root MSE = .0066

D.lnc  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lnc 
LD  .1823329 .0572958 3.18 0.002 .2952693 .0693965
L4D  .1036533 .0528633 1.96 0.051 .2078528 .0005462
lny 
D1  .5949995 .0485122 12.26 0.000 .4993766 .6906224
lnr 
D1  .0131099 .0059389 2.21 0.028 .0248161 .0014037
error 
L1  .0734877 .0278959 2.63 0.009 .1284737 .0185017
_cons  .0063855 .0008469 7.54 0.000 .0047162 .0080548

193
So, consumption changes because of lagged changes in consumption, changes in income, changes in the
interest rate, and lagged disequilibrium. About 7 percent of the disequilibrium is eliminated each quarter.
This is an example only. There is a lot more modeling involved in estimating even a function as simple as a
consumption function. If you were to do an LM test for autocorrelation on the above model, which you are
welcome to do, the data are in CYR.dta, you would find that there is serious autocorrelation, indicating that
the model is not properly specified. You are invited to do a better job.
194
17 PANEL DATA MODELS
A general model that pools time series and cross section data is
1
ì . ì . . ì .   . ì

` A
=
= ÷ ÷ ÷
`
where i=1,…, N (number of cross sections, e.g., states); t=1,…, T (number of time periods, e.g., years); and
K = number of explanatory variables. Note that this model gives each state its own intercept. This allows,
for example, New York to have higher crime rates, holding everything else equal, than, say, North Dakota.
The model also allows each year to have its own effect (
ì
). The model therefore also controls for national
events and trends that affect crime across all states, again holding everything else constant. Finally, the
model allows each state to have its own parameter with respect to each of the explanatory variables. This
allows, for example, the crime rate in New York to be affected differently by, say, an increase in the New
York prison population, than the crime rate in North Dakota is affected by an increase in North Dakota’s
prison population.
Different models are derived by making various assumptions concerning the parameters of this model. If
we assume that
. A
= = = ,
J
= = , and
  A 
= = = then we have the OLS
model. If we assume that the
.
and
ì
not all equal but are fixed numbers (and that the coefficients
, i k
γ
are constant across states, i) then we have the fixed effects (FE) model. This model is also called the least
squares dummy variable (LSDV) model, the covariance model, and the within estimator. If we assume
that the
.
and
ì
are random variables, still assuming that the
.
are all equal, then we have the random
effects (RE) model also known as the variance components model or the error components model.
Finally, if we assume the coefficients are constant across time, but allow the
. 
and
. 
to vary across
states and assume that
J
= = = , then we have the random coefficients model.
The Fixed Effects Model
Let’s assume that the coefficients on the explanatory variables
 . ì
. are constant across states and across
time. The model therefore reduces to
1
ì . ì . . ì   . ì

` A
=
= ÷ ÷ ÷
`
195
where the error term is assumed to be independently and identically distributed random variables with
.ì
1
l
=
l
and
. ì
1
l
=
l
l
.Note that this specification assumes that each state is independent of all other
states (no contemporaneous correlation across error terms), that each time period is independent of all the
others (no autocorrelation), and that the variance of the error term is constant (no heteroskedasticity).
This model can be conveniently estimated using dummy variables. Suppose we create a set of dummy
variables, one for each state
A
1 1 where
.
1 = if the observation is from state i and
.
1 =
otherwise. Suppose also that we similarly create a set of year dummy variables (
J
`1 `1 ). Then we
can estimate the model by regressing the dependent variable on all the explanatory variables

A , on all the
state dummies, and on all the year dummies using ordinary least squares. This is why the model is also
called the least squares dummy variable model. Under the assumptions outlined above the estimates
produced by this process are unbiased and consistent.
The fixed effects model is the most widely use panel data model. The primary reason for its wide use is that
it corrects for a particular kind of omitted variable bias, known as unobserved heterogeneity. Suppose, for
the sake of illustration, that we have data on just two states, North Dakota and New York, with three years
of data on each state. Let’s also assume that prison deters crime, so that the more criminals in prison, the
lower the crime rate. Finally, let’s assume that both the crime rate and the prison population is quite low in
North Dakota and they are both quite high in New York.
These assumptions are reflected in the following example. The true relationship between crime and prison
is
1 2
2 where 10 (ND) and 50 (NY)
it
i it
y x α α α = − = = . We set x=1,2,3 for ND and x=10,15,20 for NY.
The data set is panel.graph.dta.
. list
++
 x y state st1 st2 

1.  1 8 1 1 0 
2.  2 6 1 1 0 
3.  3 4 1 1 0 
4.  10 30 2 0 1 
5.  15 20 2 0 1 
6.  20 10 2 0 1 
++
where st1 and st2 are state dummy variables. Here is the scatter diagram with the OLS regression line
superimposed.
196
Here is the OLS regression line.
. regress y x
Source  SS df MS Number of obs = 6
+ F( 1, 4) = 0.92
Model  93.4893617 1 93.4893617 Prob > F = 0.3929
Residual  408.510638 4 102.12766 Rsquared = 0.1862
+ Adj Rsquared = 0.0172
Total  502 5 100.4 Root MSE = 10.106

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  .5531915 .578184 0.96 0.393 1.052105 2.158488
_cons  8.297872 6.416714 1.29 0.266 9.517782 26.11353
Since we don’t estimate a separate intercept for each state, the fact that the observations for New York
show both high crime and high prison population while North Dakota has low crime and low prison
population yields a positive correlation between the individual state intercepts and the explanatory variable.
The OLS model does not employ individual state dummy variables, and is therefore guilty of a kind of
omitted variable bias. In this case, the bias is so strong that it results in an incorrect sign on the estimated
coefficient for prison.
Here is the fixed effects model including the state dummy variables (but no intercept because of the dummy
variable trap).
. regress y x st1 st2, noconstant
5
1
0
1
5
2
0
2
5
3
0
0 5 10 15 20
x
Fitted values y
197
Source  SS df MS Number of obs = 6
+ F( 3, 3) = .
Model  1516 3 505.333333 Prob > F = .
Residual  0 3 0 Rsquared = 1.0000
+ Adj Rsquared = 1.0000
Total  1516 6 252.666667 Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  2 . . . . .
st1  10 . . . . .
st2  50 . . . . .

The fixed effects model fits the data perfectly.
If we include the intercept, we are in the dummy variable trap. Stata will drop the last dummy variable,
making the overall intercept equal to the coefficient on the New York dummy. North Dakota’s intercept is
now measured relative to New York’s. I recommend always keeping the overall interecept, and dropping
one of the state dummies.
. regress y x st1 st2
Source  SS df MS Number of obs = 6
+ F( 2, 3) = .
Model  502 2 251 Prob > F = .
Residual  0 3 0 Rsquared = 1.0000
+ Adj Rsquared = 1.0000
Total  502 5 100.4 Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  2 . . . . .
st1  40 . . . . .
st2  (dropped)
_cons  50 . . . . .

Since different states can be wildly different from each other for a variety of reasons, the assumption of
unobserved heterogeneity across states makes perfect sense. Thus, the fixed effects model is the correct
model for panels of states. The same is true for panels of countries, cities, and counties.
This procedure becomes awkward if there is a large number of cross section or time series observations. If
your sample consists of the 3000 or so counties in the United States over several years, then 3000 dummy
variables is a bit much. Luckily, there is an alternative approach that yields identical estimates. If we
subtract the state means from each observation, we get deviations from the state means. We can then use
ordinary least squares to regress the resulting “absorbed” dependent variable against the K “absorbed”
explanatory variables. This is the covariance estimation method. The parameter estimates will be identical
to the LSDV estimates.
The model is estimated with the “areg” (absorbed regression) command in Stata. It yields the same
coefficients as the LSDV method.
. areg y x, absorb(state)
Number of obs = 6
F( 0, 3) = .
198
Prob > F = .
Rsquared = 1.0000
Adj Rsquared = 1.0000
Root MSE = 0

y  Coef. Std. Err. t P>t [95% Conf. Interval]
+
x  2 . . . . .
_cons  30 . . . . .
+
state  F(1, 3) = . . (2 categories)
One drawback to this approach is that we will not be able to see the separate state intercepts without some
further effort
10
. Usually, we are not particularly interested in the numerical value of each of these estimated
coefficients as they are very difficult to interpret. Since each state intercept is a combination of all the
influences on crime that are specific to each state, constant across time and independent of the explanatory
variables, it is difficult to know exactly what the coefficient is capturing. You might think of it as a measure
of the “culture” of each particular state. The fact that New York’s intercept is larger, and perhaps
significantly larger, than North Dakota’s intercept is interesting and perhaps important, but the scale is
unknown. All we know is that if the state intercepts are different from each other and we have included all
the relevant explanatory variables, the intercepts capture the effect of unobserved heterogeneity across
states.
The way to test for the significance of the state and year effects is to conduct a test of their joint
significance as a group with Ftests corresponding to each of the following hypotheses (
.
= for all i,
ì
= for all t). In my experience, the state dummies are always significant. The year dummies are
usually significant. Do not be tempted to use the tratios to drop the insignificant dummies and retain the
significant ones. The year and state dummies should be dropped or retained as a group. The reason is that
different parameterizations of the same problem can result in different dummies being omitted. This creates
an additional source of modeling error that is best avoided.
We can use the panel.dta data set to illustrate these procedures. It is a subset of the crimext.dta data set
consisting of the first five states for the years 19771997. First let’s create the state and year dummy
variables for the LSDV procedure.
. tabulate state, gen(st)
state 
number  Freq. Percent Cum.
+
1  21 20.00 20.00
2  21 20.00 40.00
3  21 20.00 60.00
4  21 20.00 80.00
5  21 20.00 100.00
+
Total  105 100.00
. tabulate year, gen(yr)
year  Freq. Percent Cum.
+
1977  5 4.76 4.76
1978  5 4.76 9.52
1979  5 4.76 14.29
10
See Judge, et.al. (1988), pp. 470472 for the procedure.
199
1980  5 4.76 19.05
1981  5 4.76 23.81
1982  5 4.76 28.57
1983  5 4.76 33.33
1984  5 4.76 38.10
1985  5 4.76 42.86
1986  5 4.76 47.62
1987  5 4.76 52.38
1988  5 4.76 57.14
1989  5 4.76 61.90
1990  5 4.76 66.67
1991  5 4.76 71.43
1992  5 4.76 76.19
1993  5 4.76 80.95
1994  5 4.76 85.71
1995  5 4.76 90.48
1996  5 4.76 95.24
1997  5 4.76 100.00
+
Total  105 100.00
. regress lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 st2st5 yr2yr21
Source  SS df MS Number of obs = 105
+ F( 30, 74) = 27.19
Model  5.24465133 30 .174821711 Prob > F = 0.0000
Residual  .475757707 74 .006429158 Rsquared = 0.9168
+ Adj Rsquared = 0.8831
Total  5.72040904 104 .055003933 Root MSE = .08018

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .1576294 3.10 0.003 .8027794 .1746125
lmetpct  .0850076 .8021396 0.11 0.916 1.51329 1.683306
lblack  1.006078 .4363459 2.31 0.024 1.875516 .1366394
lrpcpi  .9622313 .3844 2.50 0.015 .1962975 1.728165
lp1824  .5760403 .4585362 1.26 0.213 1.489694 .337613
lp2534  1.058662 .6269312 1.69 0.095 2.30785 .1905254
st2  2.174764 .7955754 2.73 0.008 3.759982 .589545
st3  1.984824 .9657397 2.06 0.043 3.909102 .0605453
st4  .7511301 .420334 1.79 0.078 1.588664 .0864037
st5  1.214557 .6007863 2.02 0.047 2.41165 .0174643
yr2  .0368809 .0529524 0.70 0.488 .068629 .1423909
yr3  .1114482 .0549662 2.03 0.046 .0019256 .2209707
yr4  .3081141 .0660811 4.66 0.000 .1764446 .4397836
yr5  .3655603 .0749165 4.88 0.000 .2162859 .5148348
yr6  .3656063 .0808562 4.52 0.000 .2044969 .5267157
yr7  .3213356 .0905921 3.55 0.001 .1408268 .5018443
yr8  .3170379 .1026446 3.09 0.003 .1125141 .5215617
yr9  .3554716 .1176801 3.02 0.003 .1209889 .5899542
yr10  .4264209 .1358064 3.14 0.002 .1558207 .6970211
yr11  .3886578 .1501586 2.59 0.012 .0894603 .6878553
yr12  .3874652 .1672654 2.32 0.023 .0541816 .7207488
yr13  .4056992 .1871167 2.17 0.033 .032861 .7785375
yr14  .4489243 .2073026 2.17 0.034 .0358647 .8619838
yr15  .5139933 .2261039 2.27 0.026 .0634715 .9645151
yr16  .4322245 .2514655 1.72 0.090 .0688314 .9332805
yr17  .387142 .2775931 1.39 0.167 .1659743 .9402583
yr18  .3239752 .304477 1.06 0.291 .2827085 .9306589
yr19  .2460049 .3344635 0.74 0.464 .4204281 .9124379
yr20  .1592112 .3602613 0.44 0.660 .558625 .8770475
yr21  .1163704 .3901125 0.30 0.766 .6609458 .8936866
200
_cons  7.362542 4.453038 1.65 0.102 1.51033 16.23541

* test for significance of the state dummies
. testparm st2st5
( 1) st2 = 0
( 2) st3 = 0
( 3) st4 = 0
( 4) st5 = 0
F( 4, 74) = 6.32
Prob > F = 0.0002
* test for significance of the year dummies
. testparm yr2yr21
( 1) yr2 = 0
( 2) yr3 = 0
( 3) yr4 = 0
( 4) yr5 = 0
( 5) yr6 = 0
( 6) yr7 = 0
( 7) yr8 = 0
( 8) yr9 = 0
( 9) yr10 = 0
(10) yr11 = 0
(11) yr12 = 0
(12) yr13 = 0
(13) yr14 = 0
(14) yr15 = 0
(15) yr16 = 0
(16) yr17 = 0
(17) yr18 = 0
(18) yr19 = 0
(19) yr20 = 0
(20) yr21 = 0
F( 20, 74) = 4.49
Prob > F = 0.0000
As expected, both sets of dummy variables are highly significant. Let’s do the same regression by
absorbing the state means.
. areg lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 yr2yr21,
absorb(state)
Number of obs = 105
F( 26, 74) = 6.12
Prob > F = 0.0000
Rsquared = 0.9168
Adj Rsquared = 0.8831
Root MSE = .08018

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .1576294 3.10 0.003 .8027794 .1746125
lmetpct  .0850076 .8021396 0.11 0.916 1.51329 1.683306
lblack  1.006078 .4363459 2.31 0.024 1.875516 .1366394
lrpcpi  .9622313 .3844 2.50 0.015 .1962975 1.728165
lp1824  .5760403 .4585362 1.26 0.213 1.489694 .337613
201
lp2534  1.058662 .6269312 1.69 0.095 2.30785 .1905254
yr2  .0368809 .0529524 0.70 0.488 .068629 .1423909
yr3  .1114482 .0549662 2.03 0.046 .0019256 .2209707
yr4  .3081141 .0660811 4.66 0.000 .1764446 .4397836
yr5  .3655603 .0749165 4.88 0.000 .2162859 .5148348
yr6  .3656063 .0808562 4.52 0.000 .2044969 .5267157
yr7  .3213356 .0905921 3.55 0.001 .1408268 .5018443
yr8  .3170379 .1026446 3.09 0.003 .1125141 .5215617
yr9  .3554716 .1176801 3.02 0.003 .1209889 .5899542
yr10  .4264209 .1358064 3.14 0.002 .1558207 .6970211
yr11  .3886578 .1501586 2.59 0.012 .0894603 .6878553
yr12  .3874652 .1672654 2.32 0.023 .0541816 .7207488
yr13  .4056992 .1871167 2.17 0.033 .032861 .7785375
yr14  .4489243 .2073026 2.17 0.034 .0358647 .8619838
yr15  .5139933 .2261039 2.27 0.026 .0634715 .9645151
yr16  .4322245 .2514655 1.72 0.090 .0688314 .9332805
yr17  .387142 .2775931 1.39 0.167 .1659743 .9402583
yr18  .3239752 .304477 1.06 0.291 .2827085 .9306589
yr19  .2460049 .3344635 0.74 0.464 .4204281 .9124379
yr20  .1592112 .3602613 0.44 0.660 .558625 .8770475
yr21  .1163704 .3901125 0.30 0.766 .6609458 .8936866
_cons  6.137487 4.477591 1.37 0.175 2.784307 15.05928
+
state  F(4, 74) = 6.315 0.000 (5 categories)
Note that the coefficients are identical to the LSDV method. Also, the Ftest on the state dummies is
automatically produced by areg.
There is yet another fixed effects estimation command in Stata, namely xtreg. However, with xtreg, you
must specify the fixed effects model with the option, fe.
. xtreg lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 yr2yr21, fe
Fixedeffects (within) regression Number of obs = 105
Group variable (i): state Number of groups = 5
Rsq: within = 0.6824 Obs per group: min = 21
between = 0.2352 avg = 21.0
overall = 0.2127 max = 21
F(26,74) = 6.12
corr(u_i, Xb) = 0.9696 Prob > F = 0.0000

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .1576294 3.10 0.003 .8027794 .1746125
lmetpct  .0850076 .8021396 0.11 0.916 1.51329 1.683306
lblack  1.006078 .4363459 2.31 0.024 1.875516 .1366394
lrpcpi  .9622313 .3844 2.50 0.015 .1962975 1.728165
lp1824  .5760403 .4585362 1.26 0.213 1.489694 .337613
lp2534  1.058662 .6269312 1.69 0.095 2.30785 .1905254
yr2  .0368809 .0529524 0.70 0.488 .068629 .1423909
yr3  .1114482 .0549662 2.03 0.046 .0019256 .2209707
yr4  .3081141 .0660811 4.66 0.000 .1764446 .4397836
yr5  .3655603 .0749165 4.88 0.000 .2162859 .5148348
yr6  .3656063 .0808562 4.52 0.000 .2044969 .5267157
yr7  .3213356 .0905921 3.55 0.001 .1408268 .5018443
yr8  .3170379 .1026446 3.09 0.003 .1125141 .5215617
yr9  .3554716 .1176801 3.02 0.003 .1209889 .5899542
yr10  .4264209 .1358064 3.14 0.002 .1558207 .6970211
yr11  .3886578 .1501586 2.59 0.012 .0894603 .6878553
202
yr12  .3874652 .1672654 2.32 0.023 .0541816 .7207488
yr13  .4056992 .1871167 2.17 0.033 .032861 .7785375
yr14  .4489243 .2073026 2.17 0.034 .0358647 .8619838
yr15  .5139933 .2261039 2.27 0.026 .0634715 .9645151
yr16  .4322245 .2514655 1.72 0.090 .0688314 .9332805
yr17  .387142 .2775931 1.39 0.167 .1659743 .9402583
yr18  .3239752 .304477 1.06 0.291 .2827085 .9306589
yr19  .2460049 .3344635 0.74 0.464 .4204281 .9124379
yr20  .1592112 .3602613 0.44 0.660 .558625 .8770475
yr21  .1163704 .3901125 0.30 0.766 .6609458 .8936866
_cons  6.137487 4.477591 1.37 0.175 2.784307 15.05928
+
sigma_u  .89507954
sigma_e  .08018203
rho  .99203915 (fraction of variance due to u_i)

F test that all u_i=0: F(4, 74) = 6.32 Prob > F = 0.0002
Again, the coefficients are identical and the Ftest is automatic.
Another advantage of the fixed effects model is that, since the estimation procedure is really ordinary least
squares with a bunch of dummy variables, problems of autocorrelation and heteroskedasticity, among
others can be dealt with relatively simply. Heteroskedasticity can be handled with weighting the regression,
usually with population, or by using heteroskedasticconsistent standard errors, or both. As we argued in
the section on time series analysis, autocorrelation is best dealt with by adding lagged variables.
I recommend either the regress (LSDV) or areg (covariance) methods because they both allow robust
standard errors and tratios, as well as a wide variety of postestimation diagnostics. Heteroskedasticity can
be addressed by using the BreuschPagan or White test or corrected using weighting and robust standard
errors. The xtreg command does not allow robust standard errors.
Time series issues
Since we are combining crosssection and timeseries data, we can expect to run into both crosssection and
timeseries problems. We saw above that one major crosssection problem, heteroskedasticity, is routinely
handled with this model. Further, our model of choice is usually the fixed effects model, ignores all
variation between states and is therefore a pure time series model, As a result, we have to address all the
usual timeseries issues.
The first time series topic that we need to deal with is the possibility of deterministic trends in the data.
Linear trends
One suggestion for panel data studies is to include an overall trend, as well as the year dummies. The trend
captures the effect of omitted slowly changing time related variables. Since we never know if we have
included all the relevant variables, I recommend always including a trend in any time series model. If it
turns out to be insignificant, then it can be dropped. The year dummies do not capture the same effects as
the overall trend. The year dummies capture the effects of common factors that affect the dependent
variable over all the states at the same time. For example, an amazingly effective federal law could reduce
crime uniformly in all the states in the years after its passage. This would be captured by year dummies, not
a linear trend.
The model becomes
203
1
ì . ì . . ì   . ì

` ì A
=
= ÷ ÷ ÷ ÷
`
A linear trend is simply the numbers 0,1,2,…,T. We can create an overall trend by simply subtracting the
first year from all years. For example, if that data set starts with 1977, we can create a linear trend by
subtracting 1977 from each value of year. Note that adding a linear trend will put us in the dummy variable
trap because the trend is simply a linear combination of the year dummies with the coefficients equal to 0,
1, 2, etc. As a result, I recommend deleting one of the year dummies when adding a linear trend.
We can illustrate these facts with the panel.dta data set.
. gen t=year1977
. list state stnm year t yr1yr5 if state==1
++
 state stnm year t yr1 yr2 yr3 yr4 yr5 

1.  1 AL 1977 0 1 0 0 0 0 
2.  1 AL 1978 1 0 1 0 0 0 
3.  1 AL 1979 2 0 0 1 0 0 
4.  1 AL 1980 3 0 0 0 1 0 
5.  1 AL 1981 4 0 0 0 0 1 

6.  1 AL 1982 5 0 0 0 0 0 
7.  1 AL 1983 6 0 0 0 0 0 
8.  1 AL 1984 7 0 0 0 0 0 
9.  1 AL 1985 8 0 0 0 0 0 
10.  1 AL 1986 9 0 0 0 0 0 

11.  1 AL 1987 10 0 0 0 0 0 
12.  1 AL 1988 11 0 0 0 0 0 
13.  1 AL 1989 12 0 0 0 0 0 
14.  1 AL 1990 13 0 0 0 0 0 
15.  1 AL 1991 14 0 0 0 0 0 

16.  1 AL 1992 15 0 0 0 0 0 
17.  1 AL 1993 16 0 0 0 0 0 
18.  1 AL 1994 17 0 0 0 0 0 
19.  1 AL 1995 18 0 0 0 0 0 
20.  1 AL 1996 19 0 0 0 0 0 

21.  1 AL 1997 20 0 0 0 0 0 
++
Now let’s estimate the corresponding fixed effects model.
. areg lcrmaj lprison lmetpct lblack lp1824 lp2534 t yr2yr20 , absorb(state)
Number of obs = 105
F( 25, 75) = 5.71
Prob > F = 0.0000
Rsquared = 0.9098
Adj Rsquared = 0.8749
Root MSE = .08295

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .6904558 .1401393 4.93 0.000 .9696276 .4112841
lmetpct  .2764064 .8260439 0.33 0.739 1.369157 1.92197
204
lblack  .9933732 .4513743 2.20 0.031 1.892557 .0941895
lp1824  .0528402 .3968248 0.13 0.894 .7376754 .8433558
lp2534  .1393306 .5256308 0.27 0.792 1.186441 .9077796
t  .0431765 .012993 3.32 0.001 .0172932 .0690598
yr2  .0135131 .0542071 0.25 0.804 .0944729 .1214992
yr3  .0296471 .0611756 0.48 0.629 .092221 .1515151
yr4  .1321815 .0773008 1.71 0.091 .0218097 .2861727
yr5  .1534079 .0914021 1.68 0.097 .0286743 .3354901
yr6  .1403835 .0946307 1.48 0.142 .0481306 .3288975
yr7  .1052747 .099064 1.06 0.291 .0920709 .3026204
yr8  .1169556 .0997961 1.17 0.245 .0818484 .3157595
yr9  .1689011 .0983383 1.72 0.090 .0269988 .364801
yr10  .2511994 .0968453 2.59 0.011 .0582736 .4441253
yr11  .1998446 .0940428 2.13 0.037 .0125017 .3871874
yr12  .195804 .093125 2.10 0.039 .0102895 .3813186
yr13  .2179579 .0911012 2.39 0.019 .0364749 .3994409
yr14  .2592705 .0842792 3.08 0.003 .0913777 .4271632
yr15  .3172979 .0757261 4.19 0.000 .1664438 .468152
yr16  .2495013 .068577 3.64 0.001 .1128888 .3861138
yr17  .2169851 .0623196 3.48 0.001 .092838 .3411321
yr18  .1670167 .0573777 2.91 0.005 .0527144 .281319
yr19  .1087532 .0535365 2.03 0.046 .002103 .2154034
yr20  .0263492 .0520785 0.51 0.614 .0773964 .1300948
_cons  10.67773 4.235064 2.52 0.014 2.241046 19.11441
+
state  F(4, 75) = 4.585 0.002 (5 categories)
The trend is significant, and should be retained.
Another suggestion is that each state should have its own trend term as well as its own intercept. The model
becomes
1
ì . ì . . . ì   . ì

j ì .
=
= ÷ ÷ ÷ ÷
`
We can create these state trends by multiplying the overall trend by the state dummies.
. gen tr1=t*st1
. gen tr2=t*st2
. gen tr3=t*st3
. gen tr4=t*st4
. gen tr5=t*st5
Let’s look at a couple of these series.
. list state stnm year t st1 tr1 st2 tr2 if state <=2
++
 state stnm year t st1 tr1 st2 tr2 

1.  1 AL 1977 0 1 0 0 0 
2.  1 AL 1978 1 1 1 0 0 
3.  1 AL 1979 2 1 2 0 0 
4.  1 AL 1980 3 1 3 0 0 
5.  1 AL 1981 4 1 4 0 0 

6.  1 AL 1982 5 1 5 0 0 
7.  1 AL 1983 6 1 6 0 0 
8.  1 AL 1984 7 1 7 0 0 
9.  1 AL 1985 8 1 8 0 0 
205
10.  1 AL 1986 9 1 9 0 0 

11.  1 AL 1987 10 1 10 0 0 
12.  1 AL 1988 11 1 11 0 0 
13.  1 AL 1989 12 1 12 0 0 
14.  1 AL 1990 13 1 13 0 0 
15.  1 AL 1991 14 1 14 0 0 

16.  1 AL 1992 15 1 15 0 0 
17.  1 AL 1993 16 1 16 0 0 
18.  1 AL 1994 17 1 17 0 0 
19.  1 AL 1995 18 1 18 0 0 
20.  1 AL 1996 19 1 19 0 0 

21.  1 AL 1997 20 1 20 0 0 
22.  2 AK 1977 0 0 0 1 0 
23.  2 AK 1978 1 0 0 1 1 
24.  2 AK 1979 2 0 0 1 2 
25.  2 AK 1980 3 0 0 1 3 

26.  2 AK 1981 4 0 0 1 4 
27.  2 AK 1982 5 0 0 1 5 
Now let’s estimate the implied fixed effects model.
. areg lcrmaj lprison lmetpct lblack lp1824 lp2534 tr1tr5 yr2yr20 ,
absorb(state)
Number of obs = 105
F( 29, 71) = 9.04
Prob > F = 0.0000
Rsquared = 0.9442
Adj Rsquared = 0.9183
Root MSE = .06706

lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4922259 .1500995 3.28 0.002 .7915157 .192936
lmetpct  6.408965 3.265479 1.96 0.054 .1022141 12.92014
lblack  .1648299 1.133181 0.15 0.885 2.094669 2.424329
lp1824  .0461248 .3258571 0.14 0.888 .6958654 .6036158
lp2534  .4839899 .4826101 1.00 0.319 1.446287 .4783073
tr1  .0051039 .017904 0.29 0.776 .0408036 .0305957
tr2  .033832 .0311964 1.08 0.282 .028372 .0960359
tr3  .0140482 .016859 0.83 0.407 .0476641 .0195677
tr4  .0214053 .0127019 1.69 0.096 .0039216 .0467323
tr5  .0197035 .0137888 1.43 0.157 .0077906 .0471976
yr2  .0292319 .0449536 0.65 0.518 .060403 .1188667
yr3  .0680641 .0533424 1.28 0.206 .0382976 .1744258
yr4  .1931018 .0702593 2.75 0.008 .0530088 .3331948
yr5  .2151175 .0845614 2.54 0.013 .0465069 .383728
yr6  .1876612 .0899632 2.09 0.041 .0082798 .3670427
yr7  .1441216 .0959832 1.50 0.138 .0472634 .3355067
yr8  .151261 .097835 1.55 0.127 .0438165 .3463386
yr9  .1996939 .0973446 2.05 0.044 .0055943 .3937936
yr10  .2782618 .0968381 2.87 0.005 .085172 .4713515
yr11  .2211901 .0952123 2.32 0.023 .0313422 .411038
yr12  .2153056 .0944229 2.28 0.026 .0270317 .4035796
yr13  .2356666 .0920309 2.56 0.013 .0521622 .4191709
yr14  .2714963 .0856649 3.17 0.002 .1006853 .4423072
yr15  .3271032 .0761548 4.30 0.000 .1752548 .4789515
206
yr16  .259055 .0667641 3.88 0.000 .1259312 .3921789
yr17  .2258548 .0579972 3.89 0.000 .1102116 .341498
yr18  .1751649 .0509189 3.44 0.001 .0736355 .2766942
yr19  .1152277 .045183 2.55 0.013 .0251353 .20532
yr20  .0302166 .0425423 0.71 0.480 .0546102 .1150435
_cons  17.61505 13.88093 1.27 0.209 45.29283 10.06274
+
state  F(4, 71) = 1.246 0.299 (5 categories)
. testparm tr1tr5
( 1) tr1 = 0
( 2) tr2 = 0
( 3) tr3 = 0
( 4) tr4 = 0
( 5) tr5 = 0
F( 5, 71) = 12.13
Prob > F = 0.0000
Note that we have to drop the overall trend if we include all the state trends because of collinearity. Also,
the individual trends are significant as a group and should be retained. We could also test whether the state
trends all have the same coefficient and could be combined into a single, overall trend.
. test tr1=tr2=tr3=tr4=tr5
( 1) tr1  tr2 = 0
( 2) tr1  tr3 = 0
( 3) tr1  tr4 = 0
( 4) tr1  tr5 = 0
F( 4, 71) = 10.94
Prob > F = 0.0000
Apparently the trends are significantly different across states and cannot be combined into a single overall
trend.
Estimating fixed effects models with individual state trends is becoming common practice.
Unit roots and panel data
With respect to the time series issues, we need to know whether each individual time series has a unit root.
For example, pooling states across years means that each state forms a time series of annual data. We need
to know if the typical state series has a unit root. There are panel data unit root tests, but none are officially
available in Stata. It is possible to download user written Stata modules, however, you cannot use use these
modules on the server.
Even if you had easily available panel unit root tests, what would you do with the information generated by
such tests? If you have unit roots, then you will want to estimate a short run model with first differences,
estimate a long run equilibrium model, or estimate a error correction model. If you do not have unit roots,
you probably want to estimate a model in levels. Because unit root tests are not particularly powerful, it is
probably a good idea to estimate both short run (first differences) and a long run model in levels and hope
that the main results are consistent across both specifications.
Pooled time series and cross section panel models are ideal for estimating the long run average
relationships among nonstationary variables. The pooled panel estimator yields consistent estimates with a
207
normal limit distribution. These results hold in the presence of individual fixed effects. The coefficients are
analogous to the population (not sample) regression coefficients in conventional regressions.
11
Thus, the
static fixed effects model estimated above can be interpreted as the long run average crime equation (for the
five states in the sample).
.
It turns out that another way to estimate the fixed effects model is to take first differences. The first
differencing “sweeps out” the state dummies because a state dummy is constant across time, so that the
value of the dummy (=1) in year 2 is the same as its value (=1) in year 1, so the difference is zero. The year
dummies are also altered. The dummy for year 2 is equal to one for year 2, zero otherwise. Thus, the first
difference, year2year1=10=1. However the difference, year3year2=01=1. All the remaining values are
zero. So the first differenced year dummy equals zero for the years before the year of the dummy, then 1
for the year in question, then negative one for the year after, then zero again.
Let’s use the panel.dta data set to illustrate this proposition.
. tsset state year
panel variable: state, 1 to 5
time variable: year, 1977 to 1997
. gen dst1=D.st1
(5 missing values generated)
. gen dyr5=D.yr5
(5 missing values generated)
. list state year stnm st1 dst1 yr5 dyr5 if state==1
++
 state year stnm st1 dst1 yr5 dyr5 

1.  1 1977 AL 1 . 0 . 
2.  1 1978 AL 1 0 0 0 
3.  1 1979 AL 1 0 0 0 
4.  1 1980 AL 1 0 0 0 
5.  1 1981 AL 1 0 1 1 

6.  1 1982 AL 1 0 0 1 
7.  1 1983 AL 1 0 0 0 
8.  1 1984 AL 1 0 0 0 
9.  1 1985 AL 1 0 0 0 
10.  1 1986 AL 1 0 0 0 

11.  1 1987 AL 1 0 0 0 
12.  1 1988 AL 1 0 0 0 
13.  1 1989 AL 1 0 0 0 
14.  1 1990 AL 1 0 0 0 
15.  1 1991 AL 1 0 0 0 

16.  1 1992 AL 1 0 0 0 
17.  1 1993 AL 1 0 0 0 
18.  1 1994 AL 1 0 0 0 
19.  1 1995 AL 1 0 0 0 
20.  1 1996 AL 1 0 0 0 

21.  1 1997 AL 1 0 0 0 
++
11
Phillips, P.C.B. and H.R. Moon, “Linear Regression Limit Theory for Nonstationary Panel Data,” Econometrica, 67,
1999, 10571112.
208
Note that if you did include the state dummies in the first difference model, it is equivalent to estimating a
fixed effects in levels with individual state trends because the first difference of a trend is a series of ones.
. gen dt=D.t
(5 missing values generated)
. list state year stnm st1 t dt if state==1
++
 state year stnm st1 t dt 

1.  1 1977 AL 1 0 . 
2.  1 1978 AL 1 1 1 
3.  1 1979 AL 1 2 1 
4.  1 1980 AL 1 3 1 
5.  1 1981 AL 1 4 1 

6.  1 1982 AL 1 5 1 
7.  1 1983 AL 1 6 1 
8.  1 1984 AL 1 7 1 
9.  1 1985 AL 1 8 1 
10.  1 1986 AL 1 9 1 

11.  1 1987 AL 1 10 1 
12.  1 1988 AL 1 11 1 
13.  1 1989 AL 1 12 1 
14.  1 1990 AL 1 13 1 
15.  1 1991 AL 1 14 1 

16.  1 1992 AL 1 15 1 
17.  1 1993 AL 1 16 1 
18.  1 1994 AL 1 17 1 
19.  1 1995 AL 1 18 1 
20.  1 1996 AL 1 19 1 

21.  1 1997 AL 1 20 1 
++
Since we can estimate a fixed effects model using either levels and state dummies (absorbed or not) or
using first differences with no state dummies, shouldn’t we get the same answer either way? The short
answer is “no.” The estimates will not be identical for several reasons. If the variables are stationary, then
the first difference approach, when it subtracts last year’s value, also subtracts last year’s measurement
error, if there is any. This is likely to leave a smaller measurement error than if we subtract of the mean,
which is what we do if we use state dummies or absorb the state dummies. Thus, if there is any
measurement error we could get different parameter estimates by estimating in levels and then estimating in
first differences.
The second reason is that, if the variables are nonstationary, and not cointegrated, then taking first
differences usually creates stationary variables, which can be expected to have different parameter values.
If they are nonstationary, but cointegrated, the static model in levels is the long run equilibrium model,
which can be expected to be somewhat different from the short run model. Finally, we lose one year of data
when we take first differences, so the sample is not the same.
Let’s estimate the fixed effects model above using first differences. We use the “D.” operator to generate
first differenced series. Don’t forget to “tsset state year” to tell Stata that it is a panel data set sorted by state
and year.
. regress dlcrmaj dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534 dyr2dyr21
Source  SS df MS Number of obs = 100
209
+ F( 25, 74) = 3.55
Model  .267228983 25 .010689159 Prob > F = 0.0000
Residual  .22282077 74 .003011091 Rsquared = 0.5453
+ Adj Rsquared = 0.3917
Total  .490049753 99 .004949998 Root MSE = .05487

dlcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
dlprison  .2317252 .1700677 1.36 0.177 .5705926 .1071422
dlmetpct  1.631629 2.166495 0.75 0.454 2.685207 5.948465
dlblack  .4023363 .8714137 0.46 0.646 2.138666 1.333993
dlrpcpi  .1796915 .3049804 0.59 0.558 .4279951 .7873781
dlp1824  .9338755 .4775285 1.96 0.054 1.885372 .0176207
dlp2534  .8105992 .7116841 1.14 0.258 2.228661 .6074623
dyr2  .0442992 .0332805 1.33 0.187 .0220136 .1106119
dyr3  .0971729 .0595598 1.63 0.107 .0215027 .2158485
dyr4  .2581997 .0972473 2.66 0.010 .0644303 .4519692
dyr5  .2852936 .1208029 2.36 0.021 .0445887 .5259986
dyr6  .2441781 .1275799 1.91 0.059 .0100303 .4983865
dyr7  .1863416 .1329834 1.40 0.165 .0786336 .4513169
dyr8  .1780758 .1339427 1.33 0.188 .088811 .4449625
dyr9  .2112703 .1327145 1.59 0.116 .0531691 .4757097
dyr10  .2753427 .1314994 2.09 0.040 .0133244 .537361
dyr11  .2083532 .1298586 1.60 0.113 .0503956 .467102
dyr12  .1976981 .1288571 1.53 0.129 .0590553 .4544515
dyr13  .2160628 .1250786 1.73 0.088 .0331617 .4652874
dyr14  .2431188 .1160254 2.10 0.040 .0119332 .4743043
dyr15  .2966003 .1038212 2.86 0.006 .0897319 .5034686
dyr16  .2321504 .089506 2.59 0.011 .0538057 .4104951
dyr17  .2068382 .0745739 2.77 0.007 .0582465 .35543
dyr18  .1602132 .0600055 2.67 0.009 .0406496 .2797767
dyr19  .1017896 .0437182 2.33 0.023 .0146792 .1888999
dyr20  .0210332 .0283287 0.74 0.460 .035413 .0774795
dyr21  (dropped)
_cons  .014203 .0211906 0.67 0.505 .0564261 .0280202

We know that the constant term captures the effect of an overall linear trend. If we want to replace the
overall trend with individual state trends, we simply add the state dummies back into the model.
These results are different from the fixed effects model estimated in levels. So, I recommend that you
estimate the model in levels, with no lags, and call that the long run regression. Then estimate the model in
first differences, with as many lags as might be necessary to generate residuals without any serial
correlation (use the LM test with saved residuals). Call that the short run dynamic model.
Let’s see if we have autocorrelation in the short run model above.
. predict e, resid
(5 missing values generated)
. gen e_1=L.e
(10 missing values generated)
. regress e e_1 dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534 dyr2dyr21
Source  SS df MS Number of obs = 95
+ F( 25, 69) = 0.22
Model  .015042094 25 .000601684 Prob > F = 1.0000
Residual  .19261245 69 .002791485 Rsquared = 0.0724
+ Adj Rsquared = 0.2636
210
Total  .207654544 94 .002209091 Root MSE = .05283

e  Coef. Std. Err. t P>t [95% Conf. Interval]
+
e_1  .2340898 .1181197 1.98 0.051 .0015526 .4697323
dlprison  .0796463 .1829466 0.44 0.665 .4446148 .2853223
dlmetpct  .8967098 2.120876 0.42 0.674 5.127742 3.334322
dlblack  .6180148 .8976362 0.69 0.493 2.40875 1.172721
dlrpcpi  .2284817 .3190218 0.72 0.476 .4079494 .8649128
dlp1824  .1962703 .4761782 0.41 0.681 1.14622 .7536792
dlp2534  .1558151 .72768 0.21 0.831 1.607497 1.295867
dyr2  (dropped)
dyr3  .0056397 .0377379 0.15 0.882 .0696455 .0809248
dyr4  .0226855 .0787475 0.29 0.774 .1344115 .1797825
dyr5  .0342735 .104947 0.33 0.745 .17509 .2436371
dyr6  .0459182 .1152692 0.40 0.692 .1840377 .275874
dyr7  .0486212 .1227 0.40 0.693 .1961586 .293401
dyr8  .0458445 .1247895 0.37 0.714 .2031038 .2947929
dyr9  .0422207 .1245735 0.34 0.736 .2062966 .2907381
dyr10  .0392704 .1244473 0.32 0.753 .2089952 .2875359
dyr11  .0431052 .1250605 0.34 0.731 .2063837 .2925941
dyr12  .0427233 .1252572 0.34 0.734 .2071579 .2926046
dyr13  .0404449 .1222374 0.33 0.742 .2034122 .2843019
dyr14  .0399949 .1146092 0.35 0.728 .1886443 .2686341
dyr15  .0380374 .1033664 0.37 0.714 .1681729 .2442476
dyr16  .030283 .0887014 0.34 0.734 .1466714 .2072374
dyr17  .0234023 .0735115 0.32 0.751 .1232491 .1700537
dyr18  .0163755 .058807 0.28 0.781 .1009412 .1336922
dyr19  .0080579 .0423675 0.19 0.850 .076463 .0925788
dyr20  .0045028 .0274865 0.16 0.870 .0503314 .0593369
dyr21  (dropped)
_cons  .0029871 .0218047 0.14 0.891 .0405121 .0464863

. test e_1
( 1) e_1 = 0
F( 1, 69) = 3.93
Prob > F = 0.0515
Looks like we have autocorrelation. Let’s add a lagged dependent variable.
. regress dlcrmaj L.dlcrmaj dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534
dyr
> 2dyr21
Source  SS df MS Number of obs = 95
+ F( 25, 69) = 4.12
Model  .285675804 25 .011427032 Prob > F = 0.0000
Residual  .191239715 69 .00277159 Rsquared = 0.5990
+ Adj Rsquared = 0.4537
Total  .476915519 94 .005073569 Root MSE = .05265

dlcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
dlcrmaj 
L1  .2414129 .1144276 2.11 0.039 .013136 .4696897
dlprison  .275926 .1823793 1.51 0.135 .6397627 .0879108
dlmetpct  .5640736 2.123772 0.27 0.791 3.672735 4.800882
dlblack  .9904829 .8941283 1.11 0.272 2.77422 .7932542
211
dlrpcpi  .3920387 .318366 1.23 0.222 .243084 1.027161
dlp1824  1.14118 .4742399 2.41 0.019 2.087263 .1950977
dlp2534  .9256109 .7272952 1.27 0.207 2.376525 .5253034
dyr2  (dropped)
dyr3  .05788 .0377758 1.53 0.130 .0174806 .1332407
dyr4  .2300886 .0792884 2.90 0.005 .0719126 .3882646
dyr5  .2429812 .1087098 2.24 0.029 .0261112 .4598513
dyr6  .2159285 .1187687 1.82 0.073 .0210087 .4528656
dyr7  .1775044 .1243088 1.43 0.158 .0704849 .4254938
dyr8  .1855627 .1247856 1.49 0.142 .0633777 .4345031
dyr9  .2184576 .1244927 1.75 0.084 .0298986 .4668138
dyr10  .271181 .1253369 2.16 0.034 .0211407 .5212212
dyr11  .1919514 .1280677 1.50 0.138 .0635368 .4474395
dyr12  .1984084 .1264836 1.57 0.121 .0539195 .4507363
dyr13  .2187841 .1232808 1.77 0.080 .0271544 .4647226
dyr14  .2418713 .1163528 2.08 0.041 .0097539 .4739887
dyr15  .2866917 .1063984 2.69 0.009 .0744327 .4989507
dyr16  .2004976 .0946694 2.12 0.038 .0116374 .3893579
dyr17  .1826651 .077764 2.35 0.022 .0275302 .3378001
dyr18  .1345287 .0629234 2.14 0.036 .0089999 .2600575
dyr19  .078249 .045497 1.72 0.090 .012515 .1690131
dyr20  .0059974 .029325 0.20 0.839 .0525043 .0644992
dyr21  (dropped)
_cons  .0137706 .0216976 0.63 0.528 .0570561 .0295148

The lagged dependent variable is highly significant, indicating an omitted time dependent variable. Let’s
rerun the LM test and see if we still have autocorrelation.
. drop e
. predict e, resid
(10 missing values generated)
. regress e e_1 L.dlcrmaj dlprison dlmetpct dlblack dlrpcpi dlp1824 dlp2534
dyr2
> dyr21
Source  SS df MS Number of obs = 95
+ F( 26, 68) = 0.00
Model  .000021653 26 8.3281e07 Prob > F = 1.0000
Residual  .191218062 68 .00281203 Rsquared = 0.0001
+ Adj Rsquared = 0.3822
Total  .191239715 94 .002034465 Root MSE = .05303

e  Coef. Std. Err. t P>t [95% Conf. Interval]
+
e_1  .0351238 .4002693 0.09 0.930 .8338488 .7636012
dlcrmaj 
L1  .0326156 .3891471 0.08 0.933 .7439154 .8091466
dlprison  .005018 .1923999 0.03 0.979 .3789099 .388946
dlmetpct  .0134529 2.144696 0.01 0.995 4.293127 4.266221
dlblack  .0047841 .9022764 0.01 0.996 1.79568 1.805249
dlrpcpi  .0015706 .3211793 0.00 0.996 .642474 .6393327
dlp1824  .0021095 .4782917 0.00 0.996 .9565257 .9523067
dlp2534  .0017255 .7328458 0.00 0.998 1.460646 1.464097
dyr2  (dropped)
dyr3  .0002407 .0381491 0.01 0.995 .0763661 .0758847
dyr4  .0011592 .08095 0.01 0.989 .1626924 .1603739
dyr5  .0048292 .1225515 0.04 0.969 .2493768 .2397185
dyr6  .0047852 .1314744 0.04 0.971 .2671382 .2575679
dyr7  .0028128 .1292503 0.02 0.983 .2607277 .2551021
212
dyr8  .000533 .1258393 0.00 0.997 .2516413 .2505754
dyr9  .0003932 .1254777 0.00 0.998 .25078 .2499937
dyr10  .001839 .1279756 0.01 0.989 .2572104 .2535324
dyr11  .0043207 .1380765 0.03 0.975 .2798481 .2712066
dyr12  .0022677 .1299976 0.02 0.986 .261674 .2571385
dyr13  .0020183 .1262891 0.02 0.987 .2540242 .2499876
dyr14  .0028452 .121601 0.02 0.981 .2454962 .2398058
dyr15  .0041166 .1169896 0.04 0.972 .2375657 .2293326
dyr16  .0063943 .1200121 0.05 0.958 .2458748 .2330862
dyr17  .0048495 .0958629 0.05 0.960 .196141 .1864419
dyr18  .0044938 .0814841 0.06 0.956 .1670927 .1581052
dyr19  .0034812 .0606133 0.06 0.954 .1244332 .1174709
dyr20  .0022393 .0390345 0.06 0.954 .0801315 .0756529
dyr21  (dropped)
_cons  .0001079 .0218898 0.00 0.996 .0437884 .0435726

. test e_1
( 1) e_1 = 0
F( 1, 68) = 0.01
Prob > F = 0.9303
The autocorrelation is gone. This is an acceptable short run model.
The only thing you miss under this scheme is the speed of adjustment to disequilibrium. This is beyond our
scope because, so far, there are no good tests for cointegration in panel data.
Estimating first difference models has several advantages. Autocorrelation is seldom a problem (unless
there are misspecified dynamics and you have to add more lags). Also, variables that are in first
differences are, usually, almost orthogonal. So, if they have very low correlations, omitted variables is
much less of a problem than in regressions in levels, because, as we know from the omitted variable
theorem, omitting a variable that is uncorrelated with the remaining variables in a regression will not bias
the coefficients on those remaining variables. For the same reason, dropping insignificant variables from a
first difference regression seldom makes a noticeable difference in the estimated coefficients on the
remaining variables.
Clustering
A recent investigation of the effect of socalled “shall issue” laws on crime was done by Lott and Mustard
12
Shallissue laws are state laws that make it easier for ordinary citizens to acquire concealed weapons
permits. If criminals suspect that more of their potential victims are armed, they may be dissuaded from
attacking at all. In fact, Lott and Mustard found that shallissue laws do result in lower crime rates. Lott and
Mustard estimated a host of alternative models, but the model most people focus on is a fixed effects model
on all 3000 counties in the United States for the years 1977 to 1993. The target variable was a dummy
variable for those states in those years for which a shallissue law was in effect.
The clustering issue arises in this case because the shallissue law pertains to all the counties in each state.
So all the counties in Virginia, which has had a shallissue law since 1989, get a one for the shallissue
dummy. There is no withinstate variation for the dummy variable across the counties in the years with the
law in effect. The shall issue dummy is said to be “clustered” within the states. In such cases, OLS
12
Lott, John and David Mustard, “Crime, Deterrence and RighttoCarry Concealed Handguns,” Journal of Legal
Studies, 26, 1997, 168.
213
underestimates the standard errors and overestimates the tratios. This is a kind of second order bias similar
to autocorrelation. Lott has since reestimated his model correcting for clustering.
If the data set consists of observations on states and we cluster on states, we get standard errors and tratios
adjusted for both heteroskedasticity and autocorrelation (because the only observations within states are
across years, i.e., a time series). The standard errors are not consistent as T →∞because as T goes to
infinity, the number of covariances within each state goes to infinity and the computation cumulates all the
errors. That is why Newey and West weight the long lags at zero so as to limit the number of
autocorrelations that they include in their calculations. However, since T will always be finite, we do not
have an infinite number of autocorrelations within each state, so we might try clustering on states to see if
we can generate heteroskedasticity and autocorrelationcorrected standard errors.
. areg lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534 yr2yr21,
absorb(state)
> cluster(state)
Regression with robust standard errors Number of obs = 105
F( 3, 74) = 9.79
Prob > F = 0.0000
Rsquared = 0.9168
Adj Rsquared = 0.8831
Root MSE = .08018
(standard errors adjusted for clustering on state)

 Robust
lcrmaj  Coef. Std. Err. t P>t [95% Conf. Interval]
+
lprison  .4886959 .4061591 1.20 0.233 1.297986 .3205937
lmetpct  .0850076 1.391452 0.06 0.951 2.687522 2.857537
lblack  1.006078 .7206686 1.40 0.167 2.442041 .4298858
lrpcpi  .9622313 .303612 3.17 0.002 .3572712 1.567191
lp1824  .5760403 .1872276 3.08 0.003 .9490994 .2029813
lp2534  1.058662 .5089163 2.08 0.041 2.0727 .0446244
_cons  6.137487 8.352663 0.73 0.465 10.50556 22.78053
+
state  absorbed (5 categories)
The coefficients on the year dummies have been suppressed to save space. The standard errors are much
different, and much less significant. This probably means we had serious autocorrelation in the static
model.
Monte Carlo studies show that standard errors clustered by the crosssection identifier in a fixed effect
panel data regression (e.g. state in a stateyear panel) are approximately correct. This is true for any
arbitrary pattern of heteroskedasticity. It is also true for stationary variables with any kind of
autocorrelation. Finally, clustered standard errors are approximately correct for nonstationary random
walks whether they are independent or cointegrated. In summary, use standard errors clustered at the cross
section level for any panel data study.
214
Other Panel Data Models
While the fixed effects model is by far the most widely used panel data model, there are others that have
been used in the past and are occasionally used even today. But first, a digression.
Digression: the between estimator
13
Consider another very simple model, called the “between” model. In this approach we simply compute
averages of each state over time to create a single cross section of overall state means.
1
.  .
. 

` A ·
=
= ÷
`
where
. i
Y is the mean for state i. and
, . k i
X are the corresponding state means for the independent variables.
Note that the period indicates that the means have been taken over time so that the time subscript is
suppressed. This model is consistent if the pooled (OLS) estimator is consistent. It has the property that it is
robust to measurement error in the explanatory variables and is sometimes used to alleviate such problems.
The compression of the data into averages results in the discarding of potentially useful information. For
this reason the model is not efficient. The model ignores all variation that occurs within the state (across
time) and utilizes only variation across (between) states. By contrast, the fixed effects model can be
estimated by subtracting off the state means. Therefore, the fixed effect model uses only the information
discarded by the between estimator. It completely ignores the cross section information utilized by the
between estimator. For that reason, the fixed effects model is not efficient. The fixed effects model is a pure
time series model. It utilizes only the time series variation for each state. This is the reason that the FE
model is also known as the “within” model.
Because the between model uses only the cross section information and the within model uses only the time
series information, the two models can be combined. The OLS model is a weighted sum of the between and
within estimators.
The Random Effects Model
The main competitor to the fixed effects model is the random effects model. Suppose we assume that the
individual state intercepts are independent random variables with mean α and variance
. This means
that we are treating the individual state intercepts as a sample from a larger population of intercepts and we
are trying to infer the true population values from the estimated coefficients. The model can be summarized
as follows.
1
ì . ì . ì   . ì

` A
=
= ÷ ÷ ÷
`
where
. ì . . ì
= ÷ . Note that the intercepts are now in the error term and are omitted from the model.
13
See Johnston, J. and J. Dinardo, Econometric Methods, 4
th
Edition, McGrawHill, 1997, pp. 388411, for an
excellent discussion of panel data issues.
215
The random effects model can be derived as a feasible generalized least squares (FGLS) estimator applied
to the panel data set. While the FE model subtracts off the group means from each observation, the RE
model subtracts off a fraction of the group means, the fraction depending on the variance of the state
dummies, the variance of the overall error term, and the number of time periods. The random effects model
is efficient relative to the fixed effect model (because the latter ignores between state variation). The
random effects model has the same problem as the OLS estimator, namely, it is biased in the presence of
unobserved heterogeneity because it omits the individual state intercepts. However, the random effects
model is consistent as N (the number of cross sections, e.g., states) goes to infinity (T constant) .
The random effects model can also be considered a weighted average of the between and within estimators.
The weight depends on the same variances. If the weight is zero then the RE model collapses to the pooled
OLS model. If the weight equals one, the RE model becomes the fixed effects model. Also, as the number
of time periods, T, goes to infinity, the RE model approaches the FE model asymptotically.
The random effects model cannot be used if the data are nonstationary.
Choosing between the Random Effects Model and the Fixed
Effects Model.
The main consideration in choosing between the fixed effects model and the random effects model is
whether you think there is unobserved heterogeneity among the cross sections. If so, choosing the random
effects model omits the individual state intercepts and suffers from omitted variable bias. In this case the
only choice is the fixed effects model. Since I cannot imagine a situation where I would be willing to
assume away the possibility of unobserved heterogeneity, I always choose the fixed effects model. If the
crosssection dummies turn out to be insignificant as a group, I might be willing to consider the random
effects model.
If you are not worried about unobserved heterogeneity, the next consideration is to decide if the
observations can be considered a random sample from a larger population. This would make sense, for
example, if the data set consisted of observations on individual people who are resurveyed at various
intervals. However, if the observations are on states, cities, counties, or countries it doesn’t seem to make
much sense to assume that Oklahoma, for example, has been randomly selecting from a larger population
of potential but unrealized states. There are only fifty states and we have data on all fifty. Again, I lean
toward the FE model.
Another way of considering this question is to ask yourself if you are comfortable with generating
estimates that are conditional on the individual states. This is appropriate if we are particularly interested in
the individuals that are in the sample. Clearly, if we have data on individual people, we might be
uncomfortable making inferences conditional on these particular people.
On the other hand, individuals can have unique unobserved heterogeneity. Suppose we are estimating an
earnings equation with wages as the dependent variable and education, among other things, as explanatory
variables. Suppose there is a an unmeasured attribute called “spunk:.” Spunky people are likely to get more
education and earn more money than people without spunk. The random effects model will overestimate
the effect of education on earnings because the omitted variable, spunk, is positively correlated with both
earnings and education. If people are endowed with a given amount of spunk that doesn’t change over time,
then spunk is a fixed effect and the fixed effects model is appropriate.
With respect to the econometric properties of the FE and RE models, the critical assumption of the RE
model is not whether the effects are fixed. Both models assume that they are in fact fixed. The crucial
assumption for the random effects model is that the individual intercepts are not correlated with any of the
explanatory variables, i.e.,
.  . ì
1 .
l
=
l
l
for all i,k. Clearly this is a strong assumption.
216
The fixed effects model is the model of choice if there is a reasonable likelihood of significant fixed effects
that are correlated with the independent variables. It is difficult to think of a case where correlated fixed
effects can be ruled out a priori. I use the fixed effects model for all panel data analyses unless there is a good
reason to do otherwise. I have never found a good reason to do otherwise.
HausmanWu test again
The HausmanWu test is (also) applicable here. If there is no unobserved heterogeneity, the RE model is
correct and RE is consistent and efficient. If there is significant unobserved heterogeneity, the FE model is
correct, there are correlated fixed effects, and the FE model is consistent and efficient while the RE model is
now inconsistent. If the FE model is correct, then the estimated coefficients from the FE model will be
unbiased and consistent while the estimated coefficients from the RE model will be biased and inconsistent.
Consequently, the coefficients will be different from each other, causing the test statistic to be a large
positive number. On the other hand, if the RE model is correct, then both models yield unbiased and
consistent estimates, so that the coefficient estimates will be similar. (The RE estimates will be more
efficient.) The HW statistic should be close to zero. Thus, if the test is not significantly different from zero,
the RE model is correct. If the test statistic is significant, then the FE model is correct.
This test has one serious drawback. It frequently lacks power to distinguish the two hypotheses. In other
words, there are two ways this model could generate an insignificant outcome. One way is to for the
estimated coefficients to be very close to each other with low variance. This is the ideal outcome because
the result is a precisely measured zero. In this case one can be confident in employing the RE model. On
the other hand, the coefficients can be very different from each other but the test is insignificant because of
high variance associated with the estimates. In this case it is not clear that the RE model is superior.
Unfortunately, this is a very common outcome of the HausmanWu test.
My personal opinion is that the HausmanWu test is unnecessary for most applications of panel data in
economics. Because states, counties, cities, etc. are not random samples from a larger population, and
because correlated fixed effects are so obvious and important, the fixed effects model is to be preferred on
a priori grounds. However, if the sample is a panel of individuals which could be reasonably considered a
sample from a larger population, then the HausmanWu test is appropriate and is available in Stata using
the “hausman” command. Type “help hausman” in Stata for more information.
The Random Coefficients Model.
In this model we allow each state to have its own regression. The model was originally suggested by
Hildreth and Houck and Swamy.
14
The basic idea is that the coefficients are random variables such that
, , i k k i k
v γ γ = + where the error term
, i k
v has a zero mean and constant variance. The model is estimated in
two steps. In the first step we use ordinary least squares to estimate a separate equation for each state. We
then compute the mean group estimator
,
1
ˆ
N
k i k
i
γ γ
=
=
¯
. In the second step we use the residuals from the state
regressions to compute the variances of the estimates. The resulting generalized least squares estimate of
ˆ
k
γ is a weighted average of the original ordinary least squares estimates, where the weights are the inverses
of the estimated variances.
14
Hildreth, C. and C. Houck, "Some Estimators for a Linear model with Random Coefficients," Journal of the
American statistical Association, 63, 1968, 584595. Swamy, P. "Efficient Inference in a Random Coefficient
Regression Model," Econometrica, 38, 1970, 31132.
217
As a byproduct of the estimation procedure, we can test the null hypothesis that the coefficients are
constant across states. The test statistic is based on the difference between the state by state OLS
coefficients and the weighted average estimate.
This method is seldom employed. It tends to be very inefficient simply because we usually do not have a
long enough panel to generate precise estimates of the individual state coefficients. As a result, the
coefficients have high standard errors and low tratios. Also, even if the test indicates that the coefficients
vary across states, it is not clear whether the variance is really caused by different parameters or simple
random variation across parameters that are really the same. Finally, the model does not allow us to
incorporate a time dependent effect,
t
β , which would control for common effects across all states. These
disadvantages of the random coefficients model outweigh its advantages. I prefer the fixed effects model.
The random coefficients model is available in Stata using the xtrchh procedure (crosssection timeseries
random coefficient HildrethHouck procedure). We can use this technique to estimate the same crime
equation.
. xtrchh lcrmaj lprison lmetpct lblack lrpcpi lp1824 lp2534
HildrethHouck randomcoefficients regression Number of obs = 1070
Group variable (i): state Number of groups = 51
Obs per group: min = 20
avg = 21.0
max = 21
Wald chi2(6) = 24.35
Prob > chi2 = 0.0004

lcrmaj  Coef. Std. Err. z P>z [95% Conf. Interval]
+
lprison  .1703223 .1173487 1.45 0.147 .4003215 .059677
lmetpct  52.61075 49.73406 1.06 0.290 150.0877 44.86622
lblack  .0916276 7.147835 0.01 0.990 13.91787 14.10113
lrpcpi  .4169877 .1691149 2.47 0.014 .7484469 .0855285
lp1824  1.471787 .3471744 4.24 0.000 2.152237 .7913379
lp2534  .4542666 .3208659 1.42 0.157 .1746191 1.083152
_cons  243.5776 228.6247 1.07 0.287 204.5186 691.6737

Test of parameter constancy: chi2(350) = 37470.75 Prob > chi2 = 0.0000
This model is similar to the others estimated above. The fact that the coefficients are significantly different
across states should not be a major concern, because we already know that the state intercepts are different.
For that reason we use the fixed effects model. We really don’t care about the coefficients on the control
variables (metpct, black, income, and the age groups) so we don’t need to allow their coefficients to vary
across states. If we truly believe that a target variable (prison in this case) might have different effects
across states, we could multiply prison by each of the state dummies, create 51 prisonstate variables and
include them in the fixed effects model. This combines the best attributes of the random coefficients model
and the fixed effects model.
Stick with the fixed effects model for all your panel data estimation needs.
218
Index
#delimit;, 36
_n, 13, 19, 178
ADL, model, 147
AIC, 172, 173
Akaike information criterion, 172
append, 24
Augmented DickeyFuller, 166
Bayesian Information Criterion, 172
bgodfrey
Stata command, 156, 157, 158
bias
definition, 59
BIC, 172, 173, 175
block recursive equation system, 141
BreuschGodfrey, 156
BreuschPagan test, 110
Chi2(df, x), 15
Chi2tail(df,x), 16
Chisquare distribution
definition, 67
Chow test, 95
CochraneOrcutt, 151, 158, 159, 160,
161
Cointegration, 181
comma separated values, 21
consistency
definition, 60
consistency of the ordinary least
estimator, 65
control variable
definition, 89
Copy Graph, 30
correlate, 44
correlation coefficient, 83
definition, 44
covariance
definition, 43
csv, 21
data editor, 6
describe, 25
dfbeta, 106
dfgls
Stata command, 175, 179
DFGLS, 174, 175, 179, 180
dfuller
Stata command, 167, 168, 169, 172,
173, 182, 183
DickeyFuller, 166, 167, 168, 172, 173,
176, 177, 182
dofile, 35
dofile editor, 35
drop, 28
dummy variable, 7, 8, 91, 95, 145, 186,
187, 204
dummy variable trap, 93
Dummy variables, 91
DurbinWatson, 154, 155, 156, 161
dwstat
Stata command, 156
dynamic ordinary least squares (DOLS),
184
e(N), 134
e(r2), 134
efficiency
definition, 59
empirical analog, 80
219
endogenous variable, 126, 143, 146
ereturn, 102, 134
error correction model, 150, 185, 198
Excel, 20, 21
exogenous variables, 125, 126, 128, 129,
130, 131, 143
Fdistribution
definition, 69
first difference, 8, 149, 151, 159, 181,
185, 199, 200, 204
firstdifference, 18
fitstat
Stata command, 173
Fixed Effects Model, 187
Ftest, 94, 95, 97, 100, 103, 135, 160,
163
GaussMarkov assumptions, 61, 80, 110,
120, 152, 164
GaussMarkov theorem, 58, 62
Gen, 7
general to specific, 163
general to specific modeling strategy, 89
generate, 7, 28
Granger causality test, 98
graph twoway, 30
HausmanWu test, 122, 136, 137, 138,
143, 144, 146, 208
heteroskedastic consistent standard error,
117
hettest
Stata command, 112
iid, 62, 110, 152, 167
Influential observations, 104
insheet, 21, 22, 23
instrumental variables, 80
derivation, 121
Invchi2(df,p), 16
Invchi2tail, 16
Invnorm(prob), 16
irrelevant variables, 89
ivreg
Stata command, 123, 131, 132, 133,
135, 139, 144
Jtest, 100
Jtest for overidentifying restrictions,
135
keep, 28
label variable, 19
Lag, 148, 158, 175, 179, 180, 185
lagged endogenous variables, 142
Lags, 18
Leads, 18
likelihood ratio test, 160
line graph, 30
list, 7, 9, 14, 15, 16, 24, 25, 26, 27, 38,
91, 102, 107, 126, 134, 135, 145, 157
LM test, 100, 102, 103, 110, 133, 135,
155, 156, 157, 158, 161, 186
log file, 35
match merge, 24
mean square error
definition, 60
Missing values, 17
multicollinearity, 107
newey
Stata command, 162, 163, 184
Norm(x), 16
normal distribution, 66
omitted variable bias, 84, 93, 132, 188
omitted variable theorem, 85
onetoone merge, 24
order condition for identification, 129
overid
Stata command, 133
population
scaling per capita values, 28
proxy variables, 90
Random Coefficients Model, 208
Random Effects Model, 206
random numbers, 17
random walk, 149, 164, 165, 175, 176,
177, 179, 181, 183
recursive system, 141
reduced form, 125, 126, 128, 129, 135,
137, 139, 145
reg3
Stata command, 139, 140
regress, 9, 11, 36, 37, 40, 41, 44, 84,
85, 92, 93, 94, 95, 96, 98, 99, 102,
105, 106, 112, 115, 117, 118, 119,
122, 123, 124, 128, 130, 131, 133,
134, 135, 136, 137, 138, 143, 156,
220
157, 166, 168, 169, 172, 173, 180,
182, 185, 189
relative efficiency
definition, 59
replace, 9, 14, 23, 24, 25, 28, 29, 36,
105, 165, 166
robust
Regrsss option, 117
Robust standard errors and tratios, 116
sampling distribution, 63
sampling distribution of, 59
save, 11, 21, 22, 23, 24, 25, 35, 133,
137, 139, 143, 156, 157, 158, 159, 160
saveold, 23
scalar, 16, 91, 102, 134, 135, 157,
160
scatter diagram, 31
Schwartz Criterion, 172, 175
Seasonal difference, 18
Seemingly unrelated regressions, 138
set matsize, 36
set mem, 36
set more off, 36
sort, 17
spurious regression, 165, 166, 175, 176
stationary, 164
sum of corrected crossproducts, 43
summarize, 9, 15, 17, 25, 26, 27, 28,
36, 37, 42, 91, 104, 180
sureg, 139
System variables, 19
tabulate, 27
target variable
definition, 89
tdistribution
definition, 71
test
Stata command, 95
Testing for unit roots, 166
testparm
Stata command, 97
Tests for over identifying restrictions,
133
three stage least squares, 139
too many control variables, 89
tratio, 73
tsset, 8, 17, 18, 19, 98, 130, 156, 178
two stage least squares, 122, 128, 130,
136, 139, 140, 143, 145, 146
Types of equation systems, 140
Uniform(), 17
unobserved heterogeneity, 189, 190, 208
use command, 23
Variance inflation factors, 107
varlist
definition, 14
weak instruments, 135
Weight, 14
weighted least squares, 113
White test, 113
xtrchh
Stata command, 209
221
Table of Contents
1 AN OVERVIEW OF STATA ......................................................................................... 5 Transforming variables ................................................................................................... 7 Continuing the example .................................................................................................. 9 Reading Stata Output ...................................................................................................... 9 2 STATA LANGUAGE ................................................................................................... 12 Basic Rules of Stata ...................................................................................................... 12 Stata functions ............................................................................................................... 14 Missing values .......................................................................................................... 16 Time series operators. ............................................................................................... 16 Setting panel data. ..................................................................................................... 17 Variable labels. ......................................................................................................... 18 System variables ....................................................................................................... 18 3 DATA MANAGEMENT............................................................................................... 19 Getting your data into Stata .......................................................................................... 19 Using the data editor ................................................................................................. 19 Importing data from Excel ........................................................................................ 19 Importing data from comma separated values (.csv) files ........................................ 20 Reading Stata files: the use command ...................................................................... 22 Saving a Stata data set: the save command ................................................................... 22 Combining Stata data sets: the Append and Merge commands .................................... 22 Looking at your data: list, describe, summarize, and tabulate commands ................... 24 Culling your data: the keep and drop commands.......................................................... 27 Transforming your data: the generate and replace commands ..................................... 27 4 GRAPHS ........................................................................................................................ 29 5 USING DOFILES......................................................................................................... 34 6 USING STATA HELP .................................................................................................. 37 7 REGRESSION ............................................................................................................... 41 Linear regression ........................................................................................................... 43 Correlation and regression ............................................................................................ 46 How well does the line fit the data? .......................................................................... 48 Why is it called regression? .......................................................................................... 49 The regression fallacy ................................................................................................... 50 Horace Secrist ........................................................................................................... 53 Some tools of the trade ................................................................................................. 54
1
Summation and deviations .................................................................................... 54 Expected value .......................................................................................................... 55 Expected values, means and variance ....................................................................... 56 8 THEORY OF LEAST SQUARES ................................................................................ 57 Method of least squares ................................................................................................ 57 Properties of estimators................................................................................................. 58 Small sample properties ............................................................................................ 58 Bias ....................................................................................................................... 58 Efficiency .............................................................................................................. 58 Mean square error ................................................................................................. 59 Large sample properties ............................................................................................ 59 Consistency ........................................................................................................... 59
ˆ Mean of the sampling distribution of β ................................................................... 63 ˆ Variance of β ............................................................................................................ 63 Consistency of OLS .................................................................................................. 64 Proof of the GaussMarkov Theorem ........................................................................... 65 Inference and hypothesis testing ................................................................................... 66 Normal, Student’s t, Fisher’s F, and Chisquare ....................................................... 66 Normal distribution ................................................................................................... 67 Chisquare distribution.............................................................................................. 68 Fdistribution............................................................................................................. 70 tdistribution.............................................................................................................. 71 Asymptotic properties ................................................................................................... 73 Testing hypotheses concerning .................................................................................. 73 Degrees of Freedom ...................................................................................................... 74 Estimating the variance of the error term ..................................................................... 75 Chebyshev’s Inequality................................................................................................. 78 Law of Large Numbers ................................................................................................. 79 Central Limit Theorem ................................................................................................. 80 Method of maximum likelihood ................................................................................... 84 Likelihood ratio test .................................................................................................. 86 Multiple regression and instrumental variables ............................................................ 87 Interpreting the multiple regression coefficient. ....................................................... 90 Multiple regression and omitted variable bias .............................................................. 92 The omitted variable theorem ....................................................................................... 93 Target and control variables: how many regressors?.................................................... 96 Proxy variables.............................................................................................................. 97 Dummy variables .......................................................................................................... 98 Useful tests .................................................................................................................. 102 Ftest ....................................................................................................................... 102 Chow test ................................................................................................................ 102 Granger causality test ............................................................................................. 105 Jtest for nonnested hypotheses ............................................................................. 107 LM test .................................................................................................................... 108 9 REGRESSION DIAGNOSTICS ................................................................................. 111
2
Influential observations ............................................................................................... 111 DFbetas ................................................................................................................... 113 Multicollinearity ......................................................................................................... 114 Variance inflation factors ........................................................................................ 114 10 HETEROSKEDASTICITY ....................................................................................... 117 Testing for heteroskedasticity ..................................................................................... 117 BreuschPagan test .................................................................................................. 117 White test ................................................................................................................ 120 Weighted least squares ................................................................................................ 120 Robust standard errors and tratios ............................................................................. 123 11 ERRORS IN VARIABLES ....................................................................................... 127 Cure for errors in variables ......................................................................................... 128 Two stage least squares ............................................................................................... 129 HausmanWu test ........................................................................................................ 129 12 SIMULTANEOUS EQUATIONS............................................................................. 132 Example: supply and demand ..................................................................................... 133 Indirect least squares ................................................................................................... 135 The identification problem .......................................................................................... 136 Illustrative example..................................................................................................... 137 Diagnostic tests ........................................................................................................... 139 Tests for over identifying restrictions ..................................................................... 140 Test for weak instruments ....................................................................................... 143 HausmanWu test .................................................................................................... 143 Seemingly unrelated regressions................................................................................. 146 Three stage least squares ............................................................................................. 146 Types of equation systems .......................................................................................... 148 Strategies for dealing with simultaneous equations .................................................... 149 Another example ......................................................................................................... 150 Summary ..................................................................................................................... 152 13 TIME SERIES MODELS .......................................................................................... 154 Linear dynamic models ............................................................................................... 154 ADL model ............................................................................................................. 154 Lag operator ............................................................................................................ 155 Static model ............................................................................................................ 156 AR model ................................................................................................................ 156 Random walk model ............................................................................................... 156 First difference model ............................................................................................. 156 Distributed lag model .............................................................................................. 157 Partial adjustment model......................................................................................... 157 Error correction model ............................................................................................ 157 CochraneOrcutt model ........................................................................................... 158 14 AUTOCORRELATION ............................................................................................ 159 Effect of autocorrelation on OLS estimates ................................................................ 160 Testing for autocorrelation .......................................................................................... 161 The Durbin Watson test .......................................................................................... 161 The LM test for autocorrelation .............................................................................. 162
3
. 194 The Fixed Effects Model ....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 165 Curing autocorrelation with lags ............................................ 191 Error correction model ......Testing for higher order autocorrelation ..................................... 168 Heteroskedastic and autocorrelation consistent standard errors ........................................................... 202 Unit roots and panel data ................................................ 215 HausmanWu test again ..................................................... 183 16 ANALYSIS OF NONSTATIONARY DATA ......................................................................... 188 Dynamic ordinary least squares . .... 185 Cointegration................................. difference stationarity ......................................................................................................................................... .......... 179 DFGLS test ......................................... 182 Why unit root tests have nonstandard distributions ..................................................... 164 Cure for autocorrelation ..................................................................................... 216 Index .................................................... 169 Summary ................................................................................................................... 194 Time series issues ............ 218 4 ............................................................................................... UNIT ROOTS..................................................................................................................................................................................................................... 178 Choosing lags using model selection criteria............................... 206 Clustering ................... 165 The CochraneOrcutt method ............................................ 191 17 PANEL DATA MODELS .... 172 Testing for unit roots ............................................................................. 214 The Random Effects Model ... 181 Trend stationarity vs.............................................. 171 Random walks and ordinary least squares .............. 212 Other Panel Data Models ....................................................................................................................................................................... 216 The Random Coefficients Model.......................................................................................................................................... 214 Choosing between the Random Effects Model and the Fixed Effects Model........................ AND RANDOM WALKS ..................................................................................... 202 Linear trends ........................ 173 Choosing the number of lags with F tests ....................................... 176 Digression: model selection criteria.............. 170 15 NONSTATIONARITY. 214 Digression: the between estimator ......................................................................................................................................................................
A single click moves it to the command line. I recommend using “do” files consisting of a series of Stata commands for 5 . When invoked. where you can edit it before submitting. If you want to reenter the command you can double click on it. The version on the server may be different. (This is the current version on my computer. The commands are entered in the “Stata Command” window (along the bottom in this view). allowing Stata to execute each command immediately. Starting Stata Stata is available on the server. After the command has been entered it appears in the “Review” window in the upper left. The simplest way is to enter commands interactively. the screen should look something like this. In this chapter we will take a short tour of Stata to get an appreciation of what it can do.1 AN OVERVIEW OF STATA Stata is a computer program that allows the user to perform a wide variety of statistical analyses. Results are shown on the right.) There are many ways to do things in Stata. You can also use the buttons on the toolbar at the top.
graphing it. income and wealth for the United States annually for the years 19481998. In this chapter we will do a minianalysis illustrating the four econometric steps. income. (3) Transform the data. and summarizing it. The data matrix will therefore have four columns and 10 rows. (2) Look at the data by printing it out. The econometric part consists of four steps. Don’t type the variable names. while there will be 51 rows (observations) corresponding to the years 1948 to 1998. or divide by a price index to get real dollars. the number of members.077 1. or take logs. with columns corresponding to variables and rows corresponding to observations. we have 10 observations on 4 variables. If you type across the rows.000 1. Different people will find different ways of doing the same task. The final step in the project is to write up the results. which may be repeated several times. The easiest way to get these data into Stata is to use Stata’s Data Editor. use the enter key after typing each value.941 1. If you type down the columns. the annual membership dues. you could have observations on consumption. Type the following numbers into the corresponding columns of the data editor. (1) Get the data into Stata. You can type across the rows or down the columns. you are ready to do the econometric analysis.855 . Similarly. use tab to enter the number.652 . For example. (4) Analyze the transformed data using regression or some other procedure. Consider the following data set. Next. and the consumer price index. The data matrix will consist of three columns corresponding to the variables consumption. the way to learn a statistical language is to do it. Once you have the data. Then you do library research to find out what research has already been done in the area. Year 78 79 80 81 82 83 84 85 Members 126 130 123 122 115 112 139 127 Dues 135 145 160 165 190 195 175 180 CPI .115 6 . Click on Data on the task bar then click on Data Editor in the resulting drop down menu. and wealth. or click on the Data Editor button. An econometric project consists of several steps. you collect data.031 1.complex jobs (see Chapter 5 below).751 . you might want to divide by population to get per capita values. just the numbers. The result should look like this. Your data set should be written as a matrix (a rectangular array) of numbers. The first is to choose a topic. However. It corresponds to some data collected at a local recreation association and consists of the year. In the example below.
cpi). Hit <enter> at the end of each line to execute the command. we transformed the variables members and dues into logarithms. Notice that the new variable names are in the variables box in the lower left. we don’t have to look at it any more. Click on “preserve” and then the red x on the data editor header to exit the data editor. you might want the rate of inflation as the dependent variable while using the inverse of the unemployment rate as the independent variable. Repeat for the rest of the variables (members. Suppose we want Stata to compute the logarithms of members and dues and create a dummy variable for an advertising campaign that took place between 1984 and 1987.86 87 133 112 180 190 1. The logarithm function is one of several in 7 . All transformations like these are accomplished using the gen command. Gen is short for “generate” and is probably the command you will use most. Let’s move on to the transformations.160 When you are done typing your data editor screen should look like this.135 1. dues. Double click on “var1” at the top of column one and type “year” into the resulting dialog box. Type the following commands one at a time into the “Stata command” box along the bottom. Transforming variables Since we have seen the data. For a Phillips curve. The list command shows you the current data values. In the example above. gen logmem=log(members) gen logdues=log(dues) gen dum=(year>83) list [output suppressed] We almost always want to transform the variables in our dataset before analyzing them. Now rename the variables. We have already seen how to tell Stata to take the logarithms of variables.
If YEAR is greater than 83 then the variable "DUM" is given the value 1 (“true”).” operator to take first differences ( ∆yt ). In the example above we created a dummy variable. gen cpi_1=L. lags. absolute value. below. gen invunem=1/unem where unem is the name of the unemployment variable. Some of the functions commonly used in econometric work are listed in Chapter 2. called "DUM. we would write. square root. As another example.cpi These operators can be combined. gen dgnp=d.cpi It is often necessary to make comparisons between variables. It then compares it to the value 83. gen dcpi=D. and division.cpi or gen d2cpi=D2.dcpi or gen d2cpi=DD. multiplication. the one period lag of the CPI (cpi(t1)) can be calculated as. The lagged difference ∆cpit −1 can be written as gen dcpi_1=LD." by the command gen dum=(year>83) This works as follows: for each line of the data. To do this we first have to tell Stata that you have a time series data set using the tsset command.Stata's function library. For example.cpi The second difference ∆∆cpit = ∆ 2 cpit = ∆ (cpit − cpit −1 ) = ∆cpit − ∆cpit −1 be calculated as. tsset year Now you can use the “d. 8 . Along with the usual addition. Stata reads the value in the column corresponding to the variable YEAR.gnp I like to use capital letters for these operators to make them more visible.cpi Lags can also be calculated. logarithms. For example. subtraction. "DUM" is given the value 0 (“false”). Stata has exponentiation. to compute the inverse of the unemployment rate. gen d2cpi=D. suppose we want to take the first difference of the time series variable GNP (defined as GNP(t)GNP(t1)). differences and comparisons. If YEAR is not greater than 83.
you must use replace instead of generate.652 1. one at a time. ending with <enter> into the Stata command box.1219735 4.4 . and the trend is simply a counter (0. We are now ready to do the analysis.02765 1 10 (Note: rdues is real (inflationadjusted) dues.0727516 4.) gen rdues=dues/cpi gen logrdues=log(rdues).) The summarize command produces means. etc.934474 +logdues  10 5. usually called “t” in theoretical discussions.817096 .273 dum  10 . the log function produces natural logs.1711212 . 2.9717 .02765 78 87 members  10 123. etc. a simple ordinary least squares (OLS) regression of members on real dues. Dev. Type the following commands (indicated by the courier new typeface).Continuing the example Let’s continue the example by doing some more transformations and then some analyses. gen trend=year78 list summarize Variable  Obs Mean Std.5 3. regress members rdues This command produces the following output. standard deviations.00694 135 195 cpi  10 .9 8.718499 4. Min Max +year  10 82.138088 .5 20.16 logmem  10 4.160 (Whenever you redefine a variable. for each year.905275 5.5163978 0 1 trend  10 5. 1.999383 112 139 dues  10 171. replace cpi=cpi/1. Reading Stata Output 9 .5 3.
Mathematically. S βˆ the tratios. k (=2). SSR unexplained sum of squares. SSE). Each textbook writer chooses his or her own convention. using Student’s t distribution. The regression output immediately above is read as follows. SEE. ESS.05 then 10 . not including the constant term. the Rsquared adjusted for the number of explanatory variables (a. Rbar squared). Rsquared. SER).723631) is the residual sum of squares (a. The bottom table shows ˆ the variable names. twotailed. For example. the residual variance. RSS and the residual sum of squares as the error sum of squares. TSS. the standard errors. RSS (or SSR). ESS (or. The residual mean square is also known as the “mean square error” (MSE).1763689). and error sum of squares.k. or the standard error of the regression. The third column is the “mean square” or the sum of squares divided by the degrees of freedom. or the regression sum of squares.0834) . The resulting tratio for the null hypothesis that =0 is ˆ T = β S βˆ ( = −0. or sample size.Most of this output is selfexplanatory. The model degrees of freedom is the number of independent variables. and the root mean square error (the square root of the residual sum of squares divided by its degrees of freedom. SSU.a. USS. This number is referred to in a variety of ways. so that Nk=8. the corresponding estimated coefficients. Under SS is the model sum of squares (58. sometimes SSE). The probvalue for the ttest of no significance is the probability that the test statistic T would be observed if the null hypothesis is true. it is ˆ y i =1 N 2 i where the circumflex indicates that y is predicted by the regression.1310586) .a. Since the MSE is the error variance. The sum of the RSS and the ESS is the total sum of squares. the model sum of squares is also known as the explained sum of squares. also known as the standard error of the estimate.k. The next column contains the degrees of freedom. α = Y − β X (= 151. If this value is smaller than . The upper left section is called the Analysis of Variance. or the error variance.1573327) . the probvalues. I will refer to the model sum of squares as the regression sum of squares.} ˆ The corresponding standard error for β is S βˆ = ˆ σ2 x2 (= . RSS. Mathematically it is N ˆ ˆ σ 2 = σ u2 = ei2 ( N − k ) i =1 Where e is the residual from the regression (the difference between the actual value of the dependent variable and the predicted value from the regression: ˆ ˆ ˆ Yi = α + β X i ˆ Y =Y +e i i i ˆ ˆ = α + β X + ei ˆ ˆ e = Y −α − β X i i i The panel on the top right has the number of observations. N (=10) minus the number of parameters estimated. ESS. the overall Fstatistic that tests the null hypothesis that the Rsquared is equal to zero. The fact that y is in lower case indicates that it is expressed as deviations from the sample mean The Residual SS (670.83) . ˆ ˆ {The other coefficient is the intercept or constant term. the SEE is the corresponding standard deviation (square root). and the confidence interval around the estimates. the probvalue corresponding to this Fratio. The residual degrees of freedom is the number of observations. The estimated coefficient for rdues is ˆ β = xy x 2 (= −. β .
7098 .004 . are the change in the dependent variable for a oneunit change in the corresponding independent variable. go to your favorite word processor.687924 13. On the other hand. You can type <control>c for copy (or click on edit. clicking on the first bit of output you want to save. We discuss how to create log files below. using the logarithm of members as the dependent variable and the log of real dues as the price. However.114 1.155846 . Let’s add the time trend and the dummy for the advertising campaign and do a multiple regression.” We usually just say that is significant. Saving your first Stata analysis Let’s assume that you have created a directory called “project” on the H: drive.dta” You can also save the data by clicking on File. since this is a multiple regression.0663234 .36665 1.989932 4. to the bottom of the output window.001536205 +Total  . everything else staying the same (ceteris paribus). then drag the cursor.0608159 2.56 0. 11 .573947 .047635222 9 . the estimated coefficients.) Is the demand curve downward sloping? Is price significant in the demand equation at the five percent level? At the ten percent level? The trend is the percent change in pool membership per year. it is good practice.6767872 .0201382 dum  .4263  The coefficient on logrdues is the elasticity of demand for pool membership.dta” in the directory H:\Project.we “reject the null hypothesis that =0 at the five percent significance level. t P>t [95% Conf.85 0. regress logmem logrdues trend dum Source  SS df MS +Model  .) According to this estimate.012805996 Residual  .03919 logmem  Coef. Err.009217232 6 . and type <control>v for paste (or click on edit.3046571 _cons  8. You can then write up your results around the Stata output. You can save your output by going to the top of the output window (use the slider on the right hand side). Note that. You could also create a log file containing your results.0146 0. Once you have blocked the text you want to save. (Mathematically. (The derivative of a log with respect to a nonlogged value is a percent change.038417989 3 .2203729 trend  . type save “H:\Project\test.005292802 Number of obs F( 3.8065 0. Std. To save the data for further analysis.dta” will create a Stata data set called “test.0432308 . the pool can expect to lose four percent of its members every year.58 0. Interval] +logrdues  . holding everything else constant. Save and browsing to the Project folder or by clicking on the little floppy disk icon on the button bar. use “H:\Project\test.005 3. the derivative of a log with respect to a log is an elasticity. paste). You can read the data into Stata later with the command. holding the mouse key down.007035 . copy text). continuing the advertising campaign will generate a 16 percent increase in pool memberships every year. 6) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 10 8.30 0.557113 1.0094375 4.34 0. You don’t really need the quotes unless there are spaces in the pathname. or parameters.043 .
but shorter names work best. VAR. I recommend that you not use underscore at the beginning of any variable name. Every name must begin with a letter or an underscore. d. You cannot use special characters (&. for a full development you must refer to the relevant Stata manual. b. For example. it is still only an introduction. and var are all considered to be different variables in Stata notation. a. e. since Stata frequently uses that convention to make its own internal variables.g. or use Stata Help (see Chapter 6).. or underscores. c. I recommend that you use lowercase only for variable names. Case matters. e. Basic Rules of Stata 1. [by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [.2 STATA LANGUAGE This section is designed to provide a more detailed introduction to the language of Stata. The names Var. _all _n _coeff _pred double _se in with long byte _pi e _rc if using _b _N _cons float _skip int 2.#. _n is Stata’s variable containing the observation number. However. The maximum number of characters in a name is 32. The following variable names are reserved. Command syntax. Rules for composing variable names in Stata. The typical Stata command looks like this. Subsequent characters may be letters.). options] 12 . However. etc. You may not use these names for your variables. var1. numbers.
A varlist is a list of variable names separated by spaces. It takes the form in number1/number2..where things in a square bracket are optional. Options must be separated from the rest of the command by a comma. Consequently. For example. the “reg” command that we used above has the option “robust” which produces heteroskedasticconsistent standard errors. Actually the varlist is almost always required.160. e. that comprise the county. The “=exp” option is used for the “generate” and “replace” commands. states. 13 . etc. That is.) The “in range” option also restricts the command to a certain range. usually population. usually. Each Stata command takes commandspecific options. We could have said.g. all the variables are used. we could have written list year members rdues if year == 83. Options. The weight option is always in square brackets and takes the form [weightword=exp] where the weight word is one of the following (default weight for each particular command) aweight (analytical weights: inverse of the variance of the observation. For example.160 indicates that the variable cpi is to take the value of the old cpi divided by 1. Stata printed out all the data for all the variables currently in memory. Each observation in our typical sample is an average across counties. If no varlist is specified. The definition of iweight depends on how it is defined in each command that uses them). For most Stata commands. The aweight is the variance divided by the number of elements. an fweight of 100 indicates that there are really 100 identical observations) pweight (sampling weights: the inverse of the probability that this observation is included in the sample. then. etc. A pweight of 100 indicates that this observation is representative of 100 subjects in the population.) iweight (importance weight: indicates the relative importance of each observation.) fweight (frequency weights: indicates duplicate observations. the aweight is rescaled to sum to the number of observations in the sample. It is not testing for the equality of the left and right hand sides of the expression. We can also use the letter f (first) and l (last) in place of either number1 or number 2. list members dues rdues to get a list of those three variables only. in the simple program above. For example. For example. the equal sign in the expression replace cpi=cpi/1. state. Note that Stata requires two equal signs to denote equality because one equal sign indicates assignment in the generate and replace commands. gen logrdues=log(dues) and replace cpi=cpi/1. list year members rdues in 5/10 would list the observations from 5 to 10. This is the most commonly used weight by economists. and rdues for 1983. members. The “if exp” option restricts the command to a certain subset of observations. we used the command “list” with no varlist attached. (Obviously.160. This would have produced a printout of the data for year.
since the data are already sorted by dum. 14 . Stata functions These are only a few of the many functions in Stata. . *. +. ~=. sort dum by dum: summarize members rdues will produce means. etc.reg members rdues trend. The order of operations is: ~. robust by varlist: Almost all Stata commands can be repeated automatically for each value of the variable or variables in the by varlist. ==. (subtraction). The data must be sorted by the varlist. Mathematical functions include: abs(x) max(x) min(x) sqrt(x) exp(x) log(x) absolute value returns the largest value returns the smallest value square root raises e (2. For example the commands. (negation). In this case the sort command is not necessary. /. >=. see STATA USER’S GUIDE for a comprehensive list. >. <=. <. This option must precede the command and be separated from the rest of the command by a colon. ^. x) returns the cumulative value of the chisquare with df degrees of freedom for a value of x. &. but it never hurts to sort before using the by varlist option. Chi2(df. for the years without an advertising campaign (dum=0) and the years with an advertising campaign (dum=1).17828) to a specified power natural logarithm Some useful statistical functions are the following. Arithmetic operators are: + * / ^ Comparison operators == ~= > < >= <= Logical operators &  ~ and or not equal to not equal to greater than less than greater than or equal to less than or equal to add subtract multiply divide raise to a power Comparison and logical operators return a value of 1 for true and 0 for false.
05) . scalar x2=invchi2tail(2. . and invchi2tail to generate the normal.05. we use the scalar command. F. 15 . . if prob=norm(z) then z=invnorm(prob).64120353 We apparently have a pretty good chance of observing a number as large as 2. scalar prob2=chi2tail(2. scalar x1=invchi2(2. scalar list prob prob = . suppose we know that x=2.65) . . Because we are not generating variables.65 if the mean of the distribution is zero (and the standard deviation is one).359) . To get the probability of observing a number as large as 1.65) is ..05 is distributed according to chisquare with 2 degrees of freedom. scalar list prob1 prob1 = .05 it is not significantly different from zero at the .. scalar list x1 x1 = 2. and chiisquare tables in the back of his textbook. P(z<1. scalar list prob prob = .2. We should get the same answer as above using this function.05 is significantly different from zero in the chisquare with 2 degrees of freedom we can compute the following to get the prob value (the area in the right hand tail of the distribution).9505285 Invnorm(prob) is the inverse normal distribution so that.641) . invttail. Wooldridge uses the Stata functions norm. .x) returns the upper tail of the same distribution. To find out if x=2.2. scalar prob=1chi2(2.0488658 Norm(x) is the cumulative standard normal.05 level.2. scalar prob=norm(1.05) . .05 if the true value of x is zero? We can compute the following in Stata. scalar prob1=chi2(2. Chi2tail(df.For example. i. scalar list x2 x2 = 2. scalar list prob2 prob2 = .35879647 Since this value is not smaller than .0488658 Invchi2tail is the inverse function for chi2tail. t.35879647 Invchi2(df.e.05) . invFtail. What is the probability of observing a number as large as 2.. which means that we don’t have to look in the back of textbooks for statistical distributions any more.p) returns the value of x for which the probability is p and the degrees of freedom are df.
for example. In other commands such as the summarize command. use the command. Unless I am doing a series of Monte Carlo exercises. Stata considers missing values as larger than any other number. gen v=uniform() .Uniform() returns uniformly distributed pseudorandom numbers between zero and one. Missing values Missing values are denoted by a single period (. sorting by a variable which has missing values will put all observations corresponding to missing at the end of the data set. I used the crime1990. Min Max +z  51 1. it will create the number of observations equal to the number of observations in the data set.954225 15. For example. The pseudo random number generator uniform() generates the same sequence of random numbers in each session of Stata. For example to generate a sample of uniformly distributed random numbers use the command .83905 Note that both these functions return 51 observations on the pseudorandom variable.) Any arithmetic operation on a missing value results in a missing value. . Time series operators. unless you reinitialize the seed.2889026 . standard deviations. Stata will simply ignore any missing values for each variable separately when computing the mean. When doing statistical analyses. Dev. summarize z Variable  Obs Mean Std. etc. the parentheses are part of the name of the function and must be typed. I leave the seed alone. Min Max +v  51 . So. summarize v Variable  Obs Mean Std. The tsset command must be used to indicate that the data are time series.39789 8.dta data set which has 51 observations. for example if you did a regression of crime on police and the unemployment rate across states. So. This makes it difficult to be sure you are doing what you think you are doing. type.4819048 .9746088 To generate a sample of normally distributed random numbers with mean 2 and standard deviation 10. To do that. sort year 16 . Stata ignores (drops) observations with missing values.0445188 .87513 21. Stata will drop any observations for which there are missing values in either police or unemployment (“case wise deletion”). gen z=2+10*invnorm(uniform()) . one for each state and DC. Unless you tell Stata otherwise. Dev. set seed # If you set the seed to # you get a different sequence of random numbers. Even though uniform() takes no parameters.
ms or FF. for monthly data. Time series operators can be combined.ms. Setting panel data.ms.ms(t1)). the second period lag could be written as L2.ms. The second seasonal difference is S2. Note that this will produce a missing value for the first observation of rdues_1. Similarly.ms and is the same as D.rdues will generate a new variable called rdues_1 which is the oneperiod lag of rdues. Time series operators can be written in upper or lowercase.ms or. the difference of the 12month difference would be written as DS12. as LL. Leads: the oneperiod lead is written as F.ms (=ms(t) .ms and is equal to ms(t)ms(t2).ms. you can use time series operators. the command gen rdues_1 = L. etc. If you have a panel of 50 states (150) across 20 years (19771996). For example. I recommend upper case so that it is more easily readable as an operator as opposed to a variable name (which I like to keep in lowercase). The second difference (difference of difference) is D2. ms. equivalently.tsset year Lags: the oneperiod lag of a variable such as the money supply.ms is equal to ms(t1). but you must first tell Stata that you have a panel (pooled time series and crosssection). For example.ms or DD. The following commands sort 17 . (=(ms(t)ms(t1)) – (ms(t1)ms(t2))) Seasonal difference: the first seasonal difference is written S. can be written as L.ms (=ms(t+1)). The twoperiod lead is written as F2.ms. Differences: the firstdifference is written as D.ms So that L.
the data, first by state, then by year, define state and year as the crosssection and time series identifiers, and generate the lag of the variable x.
sort state year tsset state year gen x_1=L.x
Variable labels.
Each variable automatically has an 80 character label associated with it, initially blank. You can use the “label variable” command to define a new variable label. For example, Label variable dues “Annual dues per family” Label variable members “Number of members” Stata will use variable labels, as long as there is room, in all subsequent output. You can also add variable labels by invoking the Data Editor, then double clicking on the variable name at the head of each column and typing the label into the appropriate space in the dialog box.
System variables
System variables are automatically produced by Stata. Their names begin with an underscore. _n is the number of the current observation You can create a counter or time trend if you have time series data by using the command
gen t=_n
_N is the total number of observations in the data set. _pi is the value of to machine precision.
18
3 DATA MANAGEMENT
Getting your data into Stata
Using the data editor
We already know how to use the data editor from the first chapter.
Importing data from Excel
Generally speaking, the best way to get data into Stata is through a spreadsheet program such as Excel. For example, data presented in columns in a text file or Web page can be cut and pasted into a text editor or word processor and saved as a text file. This file is not readable as data because such files are organized as rows rather than columns. We need to separate the data into columns corresponding to variables. Text files can be opened and read into Excel using the Text Import Wizard. Click on File, Open and browse to the location of the file. Follow the Text Import Wizard’s directions. This creates separate columns of numbers, which will eventually correspond to variables in Stata. After the data have been read in to Excel, the data can be saved as a text file. Clicking on File, Save As in Excel reveals the following dialogue box.
19
If the data have been read from a .txt file, Excel will assume you want to save it in the same file. Clicking on Save will save the data set with tabs between data values. Once the tab delimited text file has been created, it can be read into Stata using the insheet command.
Insheet using filename
For example,
insheet using “H:\Project\gnp.txt”
The insheet command will figure out that the file is tab delimited and that the variable names are at the top. The insheet command can be invoked by clicking on File, Import, and “Ascii data created by a spreadsheet.” The following dialog box will appear.
Clicking on Browse allows you to navigate to the location of the file. Once you get to the right directory, click on the downarrow next to “Files of type” (the default file type is “Raw,” whatever that is) to reveal
Choose .txt files if your files are tab delimited.
Importing data from comma separated values (.csv) files
The insheet command can be used to read other file formats such as csv (comma separated values) files. Csv files are frequently used to transport data. For example, the Bureau of Labor Statistics allows the public to download data from its website. The following screen was generated by navigating the BLS website http://www.bls.gov/.
20
Note that I have selected the years 19562004, annual frequency, column view, and text, comma delimited (although I could have chosen tab delimited and the insheet command would read that too). Clicking on Retrieve Data yields the following screen.
Use the mouse (or hold down the shift key and use the down arrow on the keyboard) to highlight the data starting at Series Id, copy (controlc) the selected text, invoke Notepad, WordPad, or Word, and paste (controlv) the selected text. Now Save As a csv file, e.g., Save As cpi.csv. (Before you save, you should delete the comma after Value, which indicates that there is another variable after Value. This creates a
21
and Ascii data created by a spreadsheet. Alternatively you can click on the “open (use)’ button on the taskbar or click on File. you must tell Stata that you want to overwrite the old file. the cpi data can be saved by typing save h:\Project\cpi. Combining Stata data sets: the Append and Merge commands Suppose we want to combine the data in two different Stata data sets.dta” The quotes are not required.. There are two ways in which data can be combined. replace.variable.dta”.dta) data set with the “save” command. v5.dta suffix). as above. saveold “h:\Project\cpi. For example.) The Series Id and period can be dropped once the data are in Stata.. suppose we have a 22 . However. insheet using “H:\Project\cpi. save “h:\Project\cpi.g.dta”. Reading Stata files: the use command If the data are already in a Stata data set (with a . so that some or all the series have additional observations.dta” If you want the data set to be readable by older versions of Stata use the “saveold” command.dta already exists. The first is to stack one data set on top of the other. For example. The command to read the data into Stata is use “h:\Project\project. If a file by the name of cpi. e. You don’t really have to do this because the variable can be dropped in Stata. you are updating a file.csv” or you can click on Data.dta in the H:\Project directory. Saving a Stata data set: the save command Data in memory can be saved in a Stata (. e. suppose we have saved a Stata file called project. clear The “clear” option is to remove any unwanted or leftover data from previous analyses. Import. Open. This is done by adding the option. but it is neater if you do. The “append” command is used for this purpose. with nothing but missing values in Stata. it can be read with the “use” command.g. You can read the data file with the insheet command. you might have trouble saving (or using) files without the quotes if your pathname contains spaces. replace. For example.
use “H:\Project\crime1.dta” At this point you probably want to list the data and perhaps sort by year or some other identifying variable.dta in the h:\Project directory. replace Read and sort the second data set. It makes the data set wider in the sense that. use “H:\Project\arrests. This is another way of updating a data set.dta containing year pop armur arrap arrob arass arbur (corresponding arrest rates) for the years 19521998. ms. the combined data set will have variables from both. merge year using “H:\Project\arrests. simply exit Stata without saving the data.dta”.dta” The early data are now residing in memory. If you do not get the result you were expecting. clear sort year list save “H:\Project\crime1.dta”. The preferred method is a “match merge. clearing out the data from the second file. we first read the early data into Stata with the use command use “h:\Project\project. We append the newer data to the early data as follows append using “h:\Project\project2.data set called “project. This is concatenating horizontally instead of vertically.dta” Look at the data. there could be some overlap in the variables. These variables act as identifiers so that Stata can match the correct observations. cpi) from 1977 to 1998 and another called “project2.dta”. The first is the onetoone merge where the first observation of the first data set is matched with the first observation of the second data set. This is the “using” data set.dta”. just to make sure. For example.” where the two datasets have one or more variables in common. To combine the data sets. You must be very careful that each data set has its observations sorted exactly the same way. You can combine these two data sets with the following commands Read and sort the first data set. (Of course. use “H:\Project\crime1.dta” which has data on three variables (gnp.dta”. list 23 .dta. This is the “master” data set.” which has data on two of the three variables (ms and cpi) for the years 19992002. suppose you have data on the following variables: year pop crmur crrap crrob crass crbur from 1929 to 1998 in a data set called crime1. The other way to combine data sets is to merge two sets together. I do not recommend one to one merges as a general principle.) There are two kinds of merges. clear sort year list save “H:\Project\arrests. Suppose that you also have a data set called arrests. if one data set has variables that the other does not. Then save the expanded data set with the save command. clear Merge with the second data set by year. replace Read the first data set back into memory. so one data set will overwrite the data in the other.
describe It produces a list of variables. The data set that is in memory (crime1) is called the “master. If you omit the varlist. You should always list the _merge variable after a merge. Stata uses the convention that. Here is the resulting output from the crime and arrest data set we created above. merge year using h:\Project\arrests. including any string variables. then don’t use the replace option: merge year using h:\Project\arrests. This variable takes the value 1 if the value only appeared in the master data set.dta. replace The resulting data set will contain data from 19291998. If you just want to overwrite missing values in the master data set with new observations from the using data set (so that you retain all existing data in the master). The simplest command is list varlist Where varlist is a list of variables separated by spaces. update replace Here. 2 if the value occurred only in the using data set. Note also that the variable pop occurs in both data sets. The _merge variable can take two additional values if the update option is invoked: _merge equals 4 if missing values in the master were replaced with values from the using and _merge equals 5 if some values of the master disagree with the corresponding values in the using (these correspond to the revised values).dta”. I always use a varlist to make sure I get an easy to read table. Stata will assume you want all the variables listed. If you have too many variables to fit in a table. describe. you can use the merge command with the update replace options. This is very difficult to read. Each time Stata merges two data sets. and tabulate commands It is important to look at your data to make sure it is what you think it is. 24 . So.Assuming it looks ok. and 3 if an observation from the master is joined with one from the using.dta. the one that is in memory (the master) is retained. the version of pop that is in the crime1 data set is stored in the new crime2 data set. Stata will print out the data by observation. however the arrest variables will have missing values for 19291951. If you want to update the values of a variable with revised values. in the case of duplicate variables.” The data set mentioned in the merge command (arrests) is called the “using” data set. save the combined data into a new data set. In this case the pop values from the arrest data set will be retained in the combined data set. The “describe” command is best used without a varlist or options. save “H:\Project\crime2. update list _merge Looking at your data: list. summarize. it creates a variable called _merge. data in the using file overwrite data in the master file.
1 _merge  70 2. skewness.9 62186 1135610 crbur  64 1736091 1214590 318225 3795200 armur  46 7. Summarize varlist.5 20.3 223 arbur  46 170.371429 .79689 49.9952267 1 5 If you add the detail option.3 arrap  40 12.63 36281. Dev. four smallest. Min Max +year  70 1963.35109 1929 1998 pop  70 186049 52068.0g CRASS crbur long %12. summarize Variable  Obs Mean Std. and a variety of percentiles. For example. as with any Stata command.33 121769 270298 crmur  64 14231.884076 4.00138 97.0g POP crmur int %8.5326 44.1174 54.dta obs: 70 vars: 13 16 May 2004 20:12 size: 3. If you omit the varlist.007 7508 24700 crrap  64 44291.0g CRBUR armur float %9.1675 3.0g ARROB arass float %9. the command 25 .0g CRROB crass long %12.430 (99. Summarize varlist This command produces the number of observations.5 254. For example.6 218911.72 5981 109060 crrob  64 288390.0g ARRAP arrob float %9.9 arass  46 113.0g CRMUR crrap long %12.0g ARASS arbur float %9.0g ARBUR _merge byte %8.0g ARMUR arrap float %9. standard deviation.39348 17.9% of memory free) storage display value variable name type format label variable label year int %8. and max for every variable in the varlist. but the printout is not as neat.5 80.0g Sorted by: Note: dataset has changed since last saved We can get a list of the numerical variables and summary data for each one with the “summarize” command.0g CRRAP crrob long %12. the mean.3 363602. .516304 1.02 6066.6 58323 687730 crass  64 414834.57911 26. four largest. you can use qualifiers to limit the command to a subset of the data. detail You will also get the variance. kurtosis. describe Contains data from C:\Project\crime1..158983 7 16 arrob  46 53. min.4 10. Stata will generate these statistics for all the variables in the data set.0g pop float %9. Don’t forget.
for example.11 33.00 97  9 5.00 82  9 5.00 93  9 5.44 5  20 11.00 55.00 91  9 5. Looks ok.00 60.00 65.00 80  9 5. generate(cdum) Tabulate year. So.00 92  9 5. generate(yrdum) 26 .00 10.00 84  9 5.00 83  9 5.00 Nine counties each with 20 years and 20 years each with 9 counties. we use the tabulate command: .11 55.56 6  20 11. Percent Cum.67 7  20 11. tabulate year year  Freq. +1  20 11. you can use the “tabulate” command to generate a table of frequencies. Percent Cum. +78  9 5.00 89  9 5.00 86  9 5.11 22.00 30.00 96  9 5. suppose we want to create dummy variables for each of the counties and each of the years.00 20.11 100.00 85  9 5.00 70.00 90.89 9  20 11.00 35.00 50.00 5.00 45.33 4  20 11.00 85. tabulate county county  Freq.00 15. This is particularly useful for large data sets with too many observations to list comfortably.00 +Total  180 100.00 94  9 5. Tabulate county.00 87  9 5. It can generate dummy variables corresponding to the categories in the table.00 25. tabulate varname For example.00 95.11 11.00 79  9 5. We can do it easily with the tabulate command and the generate() option.00 90  9 5. suppose we think we have a panel data set on crime in nine Florida counties for the years 19781997.11 2  20 11.22 3  20 11. Finally.00 95  9 5.00 80. Just to make sure.00 75.11 88.00 +Total  180 100.11 66.00 88  9 5.00 81  9 5.00 40.11 77.00 .78 8  20 11.00 100.11 44. The tabulate command will also do another interesting trick.summarize if year >1990 Will yield summary statistics only for the years 19911998.
27 . According to the results of the summarize command above. The command Drop in 1/10 will drop the first ten observations.298. The command Drop varlist Will eliminate the variables listed in the varlist.” If you want to replace widget with new values. We used the replace command in our sample program in the last chapter to rebase our cpi index.700 by 270. The command Drop if year <1978 Will drop observations corresponding to years before 1978. we get the probability of being murdered (.38).000091). This number is too small to be useful in statistical analyses. gen murder=crmur/pop*1000 I always abbreviate generate to gen. Suppose we want to create a per capita murder variable from the raw data in the crime2 data set we created above. We would use the command. we use the “generate” command to transform our raw data into the variables we need for our statistical analyses. If we divide 24. Culling your data: the keep and drop commands Occasionally you will want to drop variables or observations from your data set.298 yields . Keep if year >1977 will keep all observations after 1977.700. Alternatively. However. Multiplying the ratio of crmur to our measure of pop by 1000 yields the number of murders per million population (91. which is a reasonable order of magnitude for further analysis.700 by 270 million. Stata will refuse to execute the command and tell you. we know that the population is about 270 million. The corresponding value for pop is 270. This produces a reasonable number for statistical operations. This operation divides crmur by pop then (remember the order of operations) multiplies the resulting ratio by 1000. you must use the replace command. The command Keep varlist Will drop all variables not in the varlist. so our population variable is in thousands. say. Transforming your data: the generate and replace commands As we have seen above.This will create nine county dummies (cdum1cdum9) and twenty year dummies (yrdum1yrdum20). the maximum value for crmur is 24. If you try to generate a variable called. you can use the keep command to achieve the same result.09138 (the number of murders per thousand population) which is still too small. Dividing 24. “widget already defined. widget and a variable by that name already exists in the data set.
160 28 .replace cpi=cpi/1.
The first kind of graph is the line graph.4 GRAPHS Continuing with our theme of looking at your data before you make any crucial decisions. in Word. graph twoway line crmaj year which produces 6000 8000 crmaj 10000 12000 14000 16000 1960 1970 1980 year 1990 2000 2010 It is interesting that crime peaked in the mid1970’s and has been declining fairly steadily since the early 1980’s. Here is a graph of the major crime rate in Virginia from 1960 to 1998. Paste (or controlv if you prefer keyboard shortcuts). Then. Note that I am able to paste the graph into this Word document by generating the graph in State and clicking on Edit. 29 . which is especially useful for time series. Hmmm. The command used to produce the graph is . click on Edit. Copy Graph. I wonder why. obviously one of the best ways of looking is to graph the data.
The second type of graph frequently used by economists is the scatter diagram. graph twoway (scatter crmaj aomaj) (lfit crmaj aomaj) 16000 6000 20 crmaj 8000 crmaj/Fitted values 10000 12000 14000 25 aomaj Fitted values 30 This graph seems to indicate that the higher the arrest rate. . which relates one variable to another. graph twoway scatter crmaj aomaj 16000 6000 20 8000 crmaj 10000 12000 14000 25 aomaj 30 We also occasionally like to see the scatter diagram with the regression line superimposed. Here is the scatter diagram relating the arrest rate to the crime rate. the lower the crime rate. Does deterrence work? 30 .
. The first parenthesis contains the commands that created the original scatter diagram.Note that we used socalled “parentheses binding” or “() binding” to overlay the graphs. Crime Rates and Arrest Rates Crime Rate 8000 10000 12000 14000 16000 Virginia 19601998 6000 20 25 Arrest Rate Fitted values crmaj 30 This was produced with the command. graph twoway lfit crmaj aomaj 12500 10500 20 11000 Fitted values 11500 12000 25 aomaj 30 These graphs can be produced interactively. Here is another version of the above graph. Here is the lfit part by itself. with some more bells and whistles. 31 . the second parenthesis contains the command that generated the fitted line (lfit is short for linear fit).
Also. I did not have to enter that complicated command. Clicking on the tabs reveals further dialog boxes. Click on “overlaid twoway graphs” to reveal the following dialog box with lots of tabs. The final command is created automatically by filling in dialog boxes and clicking on things. and neither do you. Click on Graphics to reveal the pulldown menu. sort) (scatter crmaj aomaj). ytitle(Crime Rate) x > title(Arrest Rate) title(Crime Rates and Arrest Rates) subtitle(Virginia 19 > 601998) However. 32 . Note that only the first plot (plot 1) had “lfit” as an option.. I sorted by the independent variable (aomaj) in this graph. twoway (lfit crmaj aomaj.
33 . Crime Rate Virginia 6000 8000 10000 12000 crmaj 14000 16000 This was produced by the following command. . You should take some time to experiment with some of them. alternate) title(Crime > Rate) subtitle(Virginia) (bin=10. width=945.0903. histogram crmaj. The graphics dialog boxes make it easy to produce great graphs.Here is another common graph in economics. start=5740. bin(10) normal yscale(off) xlabel(. the histogram. again using the dialog boxes.2186) There are lots of other cool graphs available to you in Stata.
It is probably easiest to use the builtin Stata dofile editor. and repeat. You have to rename the file by dropping the . as long as the data doesn’t change. For these reasons. execute it. the first time you save the dofile. You can create a dofile with the builtin Stata dofile editor. Log. If you insist on running Stata interactively. They can also do various tests and alternative analyses to find out if the results are “fragile. such as Notepad or WordPad. and then be unable to remember the sequence of events that led up to the final version of the model. It may also make it difficult for other people to reproduce your results. browsing to the right directory.txt. save it. you can edit the log file. Word. copy and paste. The final model can always be reproduced. It has mouse control. If you use a word processor. It is a full service text editor. and turn the resulting file into a do file.5 USING DOFILES A dofile is a list of Stata commands collected together in a text file. However. find and replace. otherwise how can we be sure the results are legitimate? After all. If you want to replicate your results. get a result.txt filename extension. So. The reason is that complicated analyses can be difficult to reproduce. You will be presented with a dialog box where you can enter the name of the log file and change the default directory where the file will be stored.” that is the results do not hold up under relatively small changes in the modeling technique or data. The text file must have a “. click on Rename. before writing and executing the next command. It is possible to do a complicated analysis interactively. you may never be able to get the exact result again. When the data and programs are readily available it is easy for other researchers to take the programs and do the analysis themselves and check the data against the original sources. and undo. you make changes in the program. To fire up the dofile editor. if you are working with a dofile. we can edit the Stata output and data input and type in any values we want to. However. and then rightclicking on the dofile with the txt extension. any text editor. Except for very simple tasks. You can wind up rummaging around in the data doing various regression models and data transformations. The dofile is executed either with the “do” command or by clicking on File. we need to make our data and programs readily available to other economists.do. Do…. When you write up your results you can insert parts of the log file to document your findings. and then rename the file. be sure to “save as” a text (. by simply running the final dofile again. executing each command in turn. Consequently. will insist on adding a . remove the output. You can do it from inside Word by clicking on Open. click on the button 34 . to make sure we on the up and up. cut. you wind up with a dofile called sample. Begin. I recommend that any serious econometric analysis be done with dofiles which can be made available to other researchers. and browsing to the dofile.txt extension. It is also known as a Stata program or batch file.txt) file.do” filename extension. for example. or with any word processing program like Word or WordPerfect. you can “log” your output automatically to a “log file” by clicking on the “Begin Log” button (the one that looks like a scroll) on the button bar or by clicking on File. Because economics typically does not use laboratories where experimental results can be replicated by other researchers. I recommend always using dofiles.
make sure you have enough memory. command sets the semicolon instead of the carriage return as the symbol signaling the end of the command. /* tell Stata where to store the output */ /* I usually use the dofile name.” 35 .that looks like an open envelope (with a pencil or something sticking out of it) on the button bar or click on Window. Dofile Editor. * set more off. replace.dta" . these are the commands you will need. regress crbur aobur metpct ampct unrate. Don’t forget to set the log file with the “log using filename. Stata will not let you overwrite the existing log file. Here is a simple dofile. The beginning block of statements should probably be put at the start of any dofile. This can be annoying in dofiles because it keeps stopping the program. /*********************************************** ***** Sample DoFile ***** ***** This is a comment ***** ************************************************/ /* some preliminaries #delimit. More conditions: in the usual interactive mode. it stops scrolling and more— appears at the bottom of the screen. you must hit the spacebar. you will need to use a complete pathname for the file.. The “replace” part is necessary if you edit the dofile and run it again. clear. but if you do run out of space. * set mem 10000. This allows us to write multiline commands. * set matsize 150. “H:\project\sample. If you want Stata to put the log file in your “project” directory. /*************************************/ /*** summary statistics ***/ /*************************************/ summarize crbur aobur metpct ampct unrate.log extension */ /* use a full pathname */ log using "H:\Project\example. turn off the more feature. If you forget the replace option. replace” command. so set more off.log.log". * */ use semicolon to delimit commands. Besides. The #delimit. log close.g. /* get the data */ use "H:\Project\crimeva. when the results screen fills up. you will be able to examine the output by opening the log file. with a . e. To see the next screen. For small projects you won’t have to increase memory or matsize. make sure you have enough room for variables.
0449 0.29 0.162261 2.76679 .4 7.3321 250.76 463.7 .786667 1. > > > > /*************************************/ /*** summary statistics ***/ /*************************************/ summarize crbur aobur metpct ampct unrate.2783 metpct  986.1 0. 36 . Dev.3 Residual  7800038.8473 698. Err. use "crimeva.6 metpct  28 71.2593 _cons  27751.17 16 487502.. Min Max +crbur  40 7743. Std. log close.log log type: text opened on: 4 Jun 2004.4 4 14008879. To see the log file. Source  SS df MS +Model  56035517.854 341683. .193 11925. you can open it in the dofile editor. you do the analysis.74 0.995 7001.8778 0. log: H:\project\sample.378 2166.7202 4.9246 593. using lots of “banner” style comments (like the “summary statistics” comment above) makes the output more readable and reminds you.775 449.9 unrate  30 4. Variable  Obs Mean Std.00 0.834964 12.dta" .7131 ampct  5689.66737 246.52691 132. what you’re doing and why.5 18.7367 246.14 148088.41 unrate  71.428 9153. /* get the data */ >. regress crbur aobur metpct ampct unrate.924 0.25 ampct  28 18.0000 0. or tells someone else who might be reading the output.46429 1.47 19.6 .78 Number of obs F( 4.42 20533. Finally.21 crbur  Coef.99775 1. because I use Word to write the final paper.8 286181. log: C:\Stata\sample. or any other editor or word processor.1049885 18.9329 0.49 aobur  22 16. 16) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 21 28.001 1509.81 0. I use Word to review the log file.Next you have to get the data with the “use” command.24 0.386 +Total  63835555. clear. 08:16:47 .19 0.394291 70 74.816 313. The log file looks like this. Don’t forget to close the log file. Interval] +aobur  31.439 3912. t P>t [95% Conf.log log type: text closed on: 4 Jun 2004. 08:16:48  For complicated jobs.6 20 3191777.
the official Stata manual consists of several volumes and is very expensive. It also can’t list all the Stata commands that you might need. So where does a student turn for help with Stata. 37 . Clicking on Help yields the following submenu. Clicking on Contents allows you to browse through the list of Stata commands. Answer: the Stata Help menu. arranged according to function.6 USING STATA HELP This little manual can’t give you all the details of each Stata command. Finally.
For example. yields the following dialog box.One could rummage around under various topics and find all kinds of neat commands this way. Clicking on any of the hyperlinks brings up the documentation for that command. Now suppose you want information on how Stata handles a certain econometric topic. Clicking on the “Search all” radio button and entering heteroskedasticity yields 38 . Search. suppose we want to know how Stata deals with heteroskedasticity. Clicking on Help.
39 . Stata command. yields For example. if you know that you need more information on a given command.On the other hand. you can click on Help. entering “regress” to get the documentation on the regress commands yields.
This is the best way to find out more about the commands in this book. Most of the commands have options that are not documented here. 40 . There are also lots of examples and links to related commands. You should certainly spend a little time reviewing the regress help documentation because it will probably be the command you will use most over the course of the semester.
16 rdues  10 178. allows you to put a line at the mean of rdues.718499 4.652 1.138088 .5 3.00694 135 195 cpi  10 . Either way.9 8.7 REGRESSION Let’s start with a simple correlation.1711212 . clicking on X axis. Two way graph. Major Tick Options. Use summarize to compute the means of the variables.02765 0 9 Plot the number of members on the real dues (rdues).817096 . Major Tick Options. .02765 78 87 members  10 123. Use the data on pool memberships from the first chapter. the result is a nice graph.72355 158.1219735 4. This can be done by typing twoway (scatter members rdues). to reveal the dialog box that allows you to put a line at the mean of members. Dev. Scatter plot and filling in the blanks.905275 5. 41 .0727516 4.999383 112 139 dues  10 171.5 3.9717 . Similarly.4 .273 logmem  10 4. sorting by rdues. yline(124) xline(179) Or by clicking on Graphics.8055 16.5 20.934474 dum  10 . and Additional Lines.5163978 0 1 trend  10 4. Min Max +year  10 82. Click on Y axis.0552 +logdues  10 5.5903 207. summarize Variable  Obs Mean Std.
in the lower right hand quadrant. then the variation. yi > 0 . This means that high values of rdues are associated with low membership. Yi > Y so that xi > 0. The resulting quantity is the average covariation.members 110 160 120 130 140 170 180 rdues 190 200 210 Note that six out of the ten dots are in the upper left and lower right quadrants. indicating a positive relationship between X and Y. X i > X . We need a measure of this association. xi yi > 0 . if most of the points lie in the lower left and upper right quadrants. The larger the number of observations. There are two problems with using covariation as our measure of association. Finally in the upper left quadrant. then the covariation will be positive. If we compute the sum of corrected crossproducts for the points in the upper left and lower right quadrants we find that xi yi < 0 . yi < 0 . The first is that it depends on the number of observations. the larger the (positive or negative) covariation. xi < 0. This quantity is called the sum of corrected crossproducts. or the covariation. xi < 0. indicating a negative association between X and Y. So. yi < 0 . Moving clockwise. 42 . if we multiply the deviations together and sum we get. In the lower left quadrant. N. Define the deviation from the mean for X and Y as xi = X i − X yi = Yi − Y Note that. in the upper right quadrant. xi > 0. This means that. yi > 0 . for points in the lower left and upper right hand quadrants. computed over the entire sample will be negative. if most of the points are in the upper left and lower right quadrants. We can fix this problem by dividing the covariation by the sample size. known as the covariance. Similarly.
0000 rdues  0. Linear regression Sometimes we want to know more about an association between two variables than if it is positive or negative. To solve this problem. as suspected.000 police nationwide only reduces crime by a few crimes. You get different values of the covariance if the two variables are height in inches and weight in pounds versus height in miles and weight in milligrams. x y (X i i N N i − X )(Yi − Y ) N Cov( x. The correlation coefficient for our pool sample can be computed in Stata by typing correlate members rdues (obs=10)  members rdues +members  1. the Stata command that corresponds to ordinary least squares (OLS) is the regress command.2825 1. That output was reported in the first chapter. If adding 100. N x i =1 2 i z rXY = i =1 N Xi zYi = N 1 N S i =1 N xi yi 1 = NS X SY X SY x y i i =1 N i This is the familiar simple correlation coefficient. So. Consider a policy analysis concerning the effect of putting more police on the street. the policy at least doesn’t make things worse. The covariance of the standard scores is.0000 The correlation coefficient is negative. For example Regress members rues Produces a simple regression of members on rdues. 43 . but we also have to know by how much it can be expected to reduce crime. we divide each observation by its standard deviation to eliminate the units of measure and then compute the covariance using the resulting standard x scores. it fails the costbenefit test. we have to estimate the expected change in crime from a change in the number of police. in order to justify the expense. With respect to regression analysis. The . Does it really reduce crime? If the answer is yes. The standard score of X is z X i = xi S x where S x = standard score of Y is zYi = yi N N 2 i i =1 N is the standard deviation of X. y ) = i =1 N = i =1 The second problem is that the covariance depends on the unit of measure of the two variables.
etc. 44 . ˆ ˆ ˆ ei = Yi − Yi = Yi − (α + β X i ) . So what are the values of and (the parameters) that create a “good fit. which values of the parameters best describe the cloud of data in the figure above? ˆ ˆ Let’s define our estimate of as α and our estimate of as β . β= ∆Y ∆X which means that ∆Y = β∆X = β (100. We have left out some of the other factors that affect crime. We will use the slope. The line might not be straight. The model is incomplete. A line is determined by two points. we could estimate the effect of adding 100.Suppose we hypothesize that there is a linear relationship between crime and police such that Y =α +βX where Y is crime. 1.000 police as follows. 2. but taken together cause the error term to vary randomly. or simply assume that the line is approximately linear for our purposes. our predicted value of Y is ˆ ˆ ˆ Yi = α + β X i If we are wrong. that is. At this point let’s assume that the factors that we left out are “irrelevant” in the sense that if we included them the basic form of the line would not change. Since is the slope of the function.” that is. That is. or residual. Then the error term is simply the sum of a whole bunch of variables that are each irrelevant. We can get out of this either by transforming the line (taking logs. not just the sign. Why do we need an error term? Why don’t observations line up on straight lines? There are two main reasons. the line may be curved. we need an estimate of the slope of a line relating Y to X.). Yi = α + β X i + U i where Ui is the error associated with observation i. X is police and β < 0 . 000) This is the counterfactual (since we haven’t actually added the police yet) that we need to estimate the benefit of the policy. as opposed to mathematicians. β . we know that real world data do not line up on straight lines. and the intercept as the two points we need. Then for any observation. So. the error. If this is true and we could estimate . is the difference between what we thought Y would be and what is actually was. The cost is determined elsewhere. We can summarize the relationship between Y and X as. that the values of the estimated slope and intercept would not change. As statisticians.
45 . The least squares procedure has been around since at least 1805 and it is the workhorse of applied statistics. We can’t get zero errors because that would require driving the line through each point. we want to minimize the sum of squares of error. with no error. but that would mean that large positive errors could offset large negative errors and we really couldn’t tell a good fit. we want to minimize the errors. Suppose we have a quadratic formula and we want to minimize it by choosing a value for b. we are doing least squares.Clearly. If we minimize the sum of absolute values we are doing least absolute value estimation. Minimizing f (α ) requires setting α equal to the negative of the coefficient of the first power divided by two times the coefficient of the second power. set them equal to zero and solve the resulting equations. ˆ ˆ We want to minimize the sum of squares of error with respect to α and β . let’s do it without calculus. If we minimize the sum of squares of error. c c2 f (b) = c2 b + 1 + c0 − 1 2c2 4c2 Multiply it out if you are unconvinced. Since we are minimizing with respect to our choice of b and b only occurs in the squared term. or 2 2 ˆ ˆ Min ei = (Yi − α − β X i ) ˆ ˆ α . So.β We could take derivatives with respect to and . namely. that is ˆ ˆ ˆ ˆ ˆ ˆ ˆˆ Min e 2 = (Y − α − β X ) 2 = (Y 2 − 2α Y + α 2 + β 2 X 2 − 2αβ X − 2β XY ) 2 2 2 2 ˆ ˆ ˆ ˆ ˆˆ = Y − 2α Y + N α + β X − 2αβ X − 2β XY ˆ ˆ ˆ Note that there are two quadratics in α and β . which is beyond our scope. Since they don’t line up. not a summary statistic. Instead. we would get one line per point. the minimum is the ratio of the negative of the coefficient on the first power to two times the coefficient on the second power of the quadratic. We could minimize the sum of the errors. from a bad fit. f (b) = c2 b 2 + c1b + c0 Recall from high school the formula for completing the square. with huge positive and negative errors that offset each other. but that would be too easy. the minimum occurs at 2 b= −c1 2c2 That is. To solve this last problem we could square the errors or take their absolute values. The one in α is ˆ ˆ ˆ ˆˆ f (α ) = N α 2 − 2α Y − 2αβ X + C ˆ ˆ ˆ = N α 2 − α (2 Y − 2β X ) ˆ ˆ ˆ where C is a constant consisting of terms not involving α .
Y =α +βX ˆ ˆ Since α = Y − β X the regression line goes through the means: Y =α +βX Subtract the mean of Y from both sides of the regression line. ˆ Setting β equal to the ratio of the negative of the first power divided by two times the coefficient of the second power yields. ˆ ˆ β X 2 = XY − Y X + β X X ˆ = XY − YNX + β XNX ˆ ˆ β X 2 − β NX 2 = XY − NXY ˆ β ( X 2 − NX 2 ) = XY − NXY ˆ Solving for β ˆ β= XY − NXY xy = X 2 − NX 2 x 2 Correlation and regression Let’s express the regression line in deviations. 46 . ˆ β= ˆ ˆ XY + (Y − β X ) X XY + Y Y − β X X = 2 2 X X 2 Multiplying both sides by X yields. ˆ β= ˆ ˆ 2( XY + α X ) XY + α X = 2 2 X X2 ˆ Substituting the formula for α .ˆ α= ˆ 2 Y − β X 2N ( ) = Y − βˆ X = Y − βˆ X N N ˆ The quadratic form in β is ˆ ˆ ˆ ˆ ˆ ˆˆ ˆ f ( β ) = β 2 X 2 − 2β XY + 2αβ X = β 2 X 2 − β 2( XY + α X ) + C * ˆ where C* is another constant not involving β .
y = β(X − X ) = β x Note that. Here is the graph of the pool membership data with the regression line superimposed. the regression line goes through the means. The reason is. Sx xy S x xy = = 2 S y NS x S y S y NS y r 47 .Y −Y = α + β X −α − β X Or. by subtracting off the means and expressing x and y in deviations. as we have seen above. we have “translated the axes” so that the intercept is zero when the regression is expressed in deviations. 110 160 120 130 140 170 180 rdues 190 members 200 210 Fitted values ˆ Substituting β yields the predicted value of y (“Fitted values” in the graph): ˆ ˆ y = βx We know that the correlation coefficient between x and y is r= xy NS x S y Multiply both sides by Sy/Sx yields.
TSS=RSS+ESS 48 . plus yield the rate of change of y to x. How well does the line fit the data? We need a measure of the goodness of fit of the regression line. Start with the notion that each observation of y is equal to the predicted value plus the residual: ˆ y = y+e Squaring both sides. β is a linear transformation of the correlation coefficient. = xy − y 2 = y 2 + e2 ( ) Or. ˆ ˆ ˆ ye = β xe = β xe Which is zero if xe = 0. Since S x > 0 and S y > 0 ˆ r must have the same sign as β . Nevertheless. ˆ y 2 = y 2 + 2 ye + e 2 The middle term on the right hand side is equal to zero. Since the regression line does everything that the correlation coefficient does. turns out to be useful in several contexts.But. ˆ y 2 = y 2 + 2 ye + e 2 Summing. ˆ ˆ xe = x y − β x = xy − β x 2 xy x 2 = xy − xy = 0 x2 This property of linear regression. we will concentrate on regression from now on. r S x xy ˆ = =β S y y2 ˆ So. y2 NS = N N 2 y y2 =N = y2 N 2 Therefore. the middle term is zero and thus. namely that x is uncorrelated with the residual.
ˆ where TSS is the total sum of squares, y 2 , RSS is the regression (explained) sum of squares, y 2 , and
ESS is the error (unexplained, residual) sum of squares, e 2 . Suppose we take the ratio of the explained sum of squares to the unexplained sum of squares as our measure of the goodness of fit.
2 ˆ ˆ RSS y 2 β 2 x 2 ˆ x = = = β2 2 2 TSS y y y2
But, recall that
ˆ β =r
Sy Sx
=r
y2 N x2 N
Squaring both sides yields,
y2 y2 ˆ β 2 = r2 N 2 = r2 x x2 N Therefore,
2 RSS ˆ x = r2 = β2 TSS y2
The ratio RSS/TSS is the proportion of the variance of Y explained by the regression and is usually referred 2 to as the coefficient of determination. It is usually denoted R . We know why now, because the coefficient of determination is, in simple regressions, equal to the correlation coefficient, squared. Unfortunately, this neat relationship doesn’t hold in multiple regressions (because there are several correlation coefficients associating the dependent variable with each of the independent variables). However, the ratio of the 2 regression sum of squares to the total sum of squares is always denoted R .
Why is it called regression?
In 1886 Sir Francis Galton, cousin of Darwin and discoverer of fingerprints among other things, published a paper entitled, “Regression toward mediocrity in hereditary stature.” Galton’s hypothesis was that tall parents should have tall children, but not as tall as they are, and short parents should produce short offspring, but not as short as they are. Consequently, we should eventually all be the same, mediocre, height. He knew that, to prove his thesis, he needed more than a positive association. He needed the slope of a line relating the two heights to be less than one. He plotted the deviations from the mean heights of adult children on the deviations from the mean heights of their parents (multiplying the mother’s height by 1.08). Taking the resulting scatter diagram, he eyeballed a straight line through the data and noted that the slope of the line was positive, but less than one. He claimed that, “…we can define the law of regression very briefly; It is that the heightdeviate of the offspring is, on average, twothirds of the heightdeviate of its midparentage.” (Galton, 1886,
49
Anthropological Miscellanea, p. 352.) Midparentage is the average height of the two parents, minus the overall mean height, 68.5 inches. This is the regression toward mediocrity, now commonly called regression to the mean. Galton called the line through the scatter of data a “regression line.” The name has stuck. The original data collected by Galton is available on the web. (Actually it has a few missing values and some data entered as “short” “medium,” etc. have been dropped.) The scatter diagram is shown below, with the regression line (estimated by OLS, not eyeballed) superimposed.
0 0
5
10
15
5 meankids
parents
10 Fitted values
15
However, another question arises, namely, when did a line estimated by least squares become a regression line? We know that least squares as a descriptive device was invented by Legendre in 1805 to describe astronomical data. Galton was apparently unaware of this technique, although it was well known among mathematicians. Galton’s younger colleague Karl Pearson, later refined Galton’s graphical methods into the now universally used correlation coefficient. (Frequently referred to in textbooks as the Pearsonian productmoment correlation coefficient.) However, least squares was still not being used. In 1897 George Udny Yule, another, even younger colleague of Galton’s, published two papers in which he estimated what he called a “line of regression” using ordinary least squares. It has been called the “regression line” ever since. In the second paper he actually estimates a multiple regression model of poverty as a function of welfare payments including control variables. The first econometric policy analysis.
The regression fallacy
Although Galton was in his time and is even now considered a major scientific figure, he was completely wrong about the law of heredity he claimed to have discovered. According to Galton, the heights of fathers of exceptionally tall children tend to be tall, but not as tall as their offspring, implying that there is a natural
50
tendency to mediocrity. This theorem was supposedly proved by the fact that the regression line has a slope less than one. Using Galton’s data from the scatter diagram above (the actual data is in Galton.dta). We can see that the regression line does, in fact, have a slope less than one.
. regress meankids parents [fweight=number]
1
Source  SS df MS +Model  1217.79278 1 1217.79278 Residual  2903.59203 897 3.23700338 +Total  4121.38481 898 4.58951538
Number of obs F( 1, 897) Prob > F Rsquared Adj Rsquared Root MSE
= = = = = =
899 376.21 0.0000 0.2955 0.2947 1.7992
meankids  Coef. Std. Err. t P>t [95% Conf. Interval] +parents  .640958 .0330457 19.40 0.000 .5761022 .7058138 _cons  2.386932 .2333127 10.23 0.000 1.92903 2.844835 
The slope is ..64, almost exactly that calculated by Galton, which confirms his hypothesis. Nevertheless, the fact that the law is bogus is easy to see. Mathematically, the explanation is simple. We know that ˆ β =r Sy Sx
If S y ≅ S x and r > 0, then
ˆ 0 < β ≅ r <1
This is the regression fallacy. The regression line slope estimate is always less than one for any two variables with approximately equal variances. Let’s see if this is true for the Galton data. The means and standard deviations are produced by the summarize command. The correlation coefficient is produced by the correlate command.
. summarize meankids parents Variable  Obs Mean Std. Dev. Min Max +meankids  204 6.647265 2.678237 0 15 parents  197 6.826122 1.916343 2 13.03 . correlate meankids parents (obs=197)  meankids parents +meankids  1.0000 parents  0.4014 1.0000
So, the standard deviations are very close and the correlation coefficient is positive. We therefore expect that the slope of the regression line will be positive. Galton was simply demonstrating the mathematical theorem with these data.
We weight by the number of kids because some families have lots of kids (up to 15) and therefore represent 15 observations, not one.
1
51
To take another example, if you do extremely well on the first exam in a course, you can expect to do well on the second exam, but probably not quite as well. Similarly, if you do poorly, you can expect to do better. The underlying theory is that ability is distributed randomly with a constant mean. Suppose ability is distributed uniformly with mean 50. However, on each test, you experience random error with is distributed normally with mean zero and standard deviation 10. Thus, x1 = a + e1
x2 = a + e2 E (a) = 50, E (e1 ) = 0, E (e2 ) = 0, cov(a, e1 ) = cov(a, e2 ) = cov(e1 , e2 ) = 0
where x1 is the grade on the first test, x2 is the grade on the second test, a is ability, and e1 and e2 are the random errors.
E ( x1 ) = E (a + e1 ) = E (a) + E (e1 ) = E (a) = 50 E ( x2 ) = E (a + e2 ) = E (a ) + E (e2 ) = E (a) = 50
So, if you scored below 50 on the first exam, you should score closer to 50 on the second. If you scored higher than 50, you should regress toward mediocrity. (Don’t be alarmed, these grades are not scaled, 50 is a B.)
We can use Stata to prove this result with a Monte Carlo demonstration.
* choose a large number of observations . set obs 10000 obs was 0, now 10000 * . * . * . create a measure of underlying ability gen a=100*uniform() create a random error term for the first test gen x1=a+10*invnorm(uniform()) now a random error for the second test gen x2=a+10*invnorm(uniform())
. summarize Variable  Obs Mean Std. Dev. Min Max +a  10000 50.01914 28.88209 .0044471 99.98645 x1  10000 50.00412 30.5206 29.68982 127.2455 x2  10000 49.94758 30.49167 25.24831 131.0626 * note that the means are all 50 and that the standard deviations are about the same . correlate x1 x2 (obs=10000)  x1 x2 +x1  1.0000 x2  0.8912 1.0000
. regress x2 x1 Source  SS df MS +Number of obs = 10000 F( 1, 9998) =38583.43
52
7942 0.890335 .Model  7383282.427161 . businesses with high profits tended.dta. regress test2 test1 Source  SS df MS +Model  1087.54163 47 100 . .0000 0. summarize Variable  Obs Mean Std.0000 .92207 8. You can download the data set and look at the scores. Secrist was riding high.3851 1. and the Annals of the American Academy of Political and Social Science.0000 test1  0. 53 .180851 +Total  7335. in every industry.0683465 .1265 12.1171354 2.11 0. as expected Does this actually work in practice? I took the grades from the first two tests in a statistics course I taught years ago and created a Stata data set called tests.2655312 20.87805 17. Interval] +x1  .” In this 486 page tome he analyzed mountains of data and showed that. 39) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 41 6. t P>t [95% Conf. Here are the results.7942 13.656 test2  Coef. after ten years of effort collecting and analyzing data on profit rates of businesses.43 0. Err.10774  Horace Secrist Secrist was Professor of Statistics at Northwestern University in 1933 when.906666 5. the Journal of Political Economy.542204 _cons  54.1483 0.0045327 196.358909 +Total  9296488.97122 Residual  6247. over time.8814501 .79 0.000 36. Std.0839 30 100 test2  41 77. The book initially received favorable reviews in places like the Royal Statistical Society. Interval] +test1  .05317 39 160.78049 13.0129 0.7364 73. Min Max +test1  41 74.741866 Prob > F Rsquared Adj Rsquared Root MSE = = = = 0. to see their profits erode. Err. Std. t P>t [95% Conf. Secrist thought he had found a natural law of competition (and maybe a fundamental cause of the depression. it was 1933 after all).000 4.000 .3052753 .92 9999 929.947655 * the regression slope is equal to the correlation coefficient.8992199 _cons  5.02439 40 183.97122 1 1087.55 1 7383282. while businesses with lower than average profits showed increases.37 9998 191.833 x2  Coef. he published his magnum opus.013 .99083 6.61 0. “The Triumph of Mediocrity in Business.44 0.37561 Number of obs F( 1. Dev.55 Residual  1913206. correlate test2 test1 (obs=41)  test2 test1 +test2  1.
1934. (2) x 2 = ( X − X )2 = ( X 2 − 2 XX + X 2 ) = X 2 − 2 X X + NX 2 X Note that X = N = NX so that N x 2 = X 2 − 2 X X + NX 2 = X 2 − 2 NX 2 + NX 2 = X 2 − 2 NX 54 . astronomical. 1933. (Hotelling. The performance. To “prove” such a mathematical result by a costly and prolonged numerical study of many kinds of business profit and expense ratios is analogous to proving the multiplication table by arranging elephants in rows and columns. This theorem is proved by simple mathematics. though perhaps entertaining. JASA. By the way. Also.) Even otherwise competent economists commit this error. which can be derived from the test examples above by simply replacing test scores with consumption.It all came crashing down when Harold Hotelling. It is illustrated by genetic. In his reply to Secrist. Summation and deviations Let the lowercase variable xi = X i − X where X is the raw data. and other phenomena. Some tools of the trade We will be using these concepts routinely throughout the semester.…. published a review in the Journal of the American Statistical Association. The result was the permanent income hypothesis. physical. “The seeming convergence is a statistical fallacy. and then doing the same for numerous other kinds of animals. instead of the beginning. tells how he was inspired to avoid the regression fallacy in his own work.) According to Hotelling the real test would be a reduction in the variance of the series over time (not true of Secrist’s series). I will suppress the subscripts (which are always i=1. Secrist replied with a letter to the editor of the JASA. In the review Hotelling pointed out that if the data were arranged according to the values taken at the end of the period.” (Hotelling. a well known economist and mathematical statistician. See Milton Friedman. and having a certain pedagogical value. ability with permanent income. the conclusions would be reversed. in the same article. (1) x i =1 N i = ( X i − X ) = X i − NX Xi = Xi − N N = Xi − Xi = 0 From now on. N where N is the sample size). but Hotelling was having none of it. and the random errors with transitory income. who was a student of Hotelling’s when Hotelling was engaged in his exchange with Secrist. resulting from the method of grouping. “Do old fallacies ever die?” Journal of Economic Literature 30:21292132. JASA. 1992. Friedman. Hotelling wrote. is not an important contribution to either zoology or to mathematics. the “real test’ of smaller variances over time might need to be taken with a grain of salt if measurement errors are getting smaller over time. sociological. These diagrams really prove nothing more than that the ratios in question have a tendency to wander about.
the casino can expect to make $27 for every 1000 bets.. if you play roulette in Monte Carlo. as this one is. 55 . 36 1 E ( X ) = p1 X 1 + p2 X 2 = (−1) + (35) 37 37 = . what is the expected value of a one dollar bet? There are only two relevant outcomes. However..027 So. we can make roulette a fair game by increasing the payoff to $36 36 1 E ( X ) = p1 X 1 + p2 X 2 = (−1) + (36) = 0 37 37 Let’s assume that all the outcomes are equally likely. what might be expected. µ x not the sample mean. It is the average result over repeated trials. It is also a weighted average of several events where the weight attached to each outcome is the probability of the event.973 + . so that pi = 1/ N then E( X ) = 1 1 1 X 1 + X 2 + .973(−1) + . Operationally. you will win some and lose some. you will lose $27. you will eventually lose.7 cents every time you play. to find an expected value. the game is said not to be fair. just take the mean. The expected value of X is also called the first moment of X. you can expect to lose about 2.. X . + X N = i =1 N N N N X N = µx Note that the expected value of X is the true (population) mean. xy = ( X − X )(Y − Y ) = ( XY − XY − YX + XY ) (3) = XY − Y X − X Y + NXY = XY − YNX − XNY + NXY = XY − NXY (4) xY = ( X − X )Y = ( XY − XY ) = XY − X Y = XY − NXY = xy Xy = X (Y − Y ) = XY − Y X = XY − NXY = xy = xY (5) Expected value The expected value of an event is the most probable outcome. So. E ( X ) = p1 X 1 + p2 X 2 + . A fair game is one for which the expected value is zero.946 = −. pi = 1 i =1 i =1 N N For example. but on average. At the same time. However. Of course. over the long haul. the slots are numbered 036. If the expected value of the game is nonzero.. If you play 1000 times.7 cents. This is known as the house edge and is how the casinos make a profit. For example. namely your number comes up and you win $35 or it doesn’t and you lose your dollar.027(35) = −. the payoff if your number comes up is 35 to 1. You either win $35 or lose a dollar. when you play you never lose 2. + pN X N = pi X i = µ x .
If X and Y are independent. N bX 1 + bX 2 + . then E(XY)=E(X)E(Y). Cov( X .. Y ) (10) (11) If X and Y are independent. Var ( X ) = E ( X − µ x ) 2 = (8) (9) Cov( X . means and variance (1) (2) (3) (4) (5) (6) (7) If b is a constant. + bX N = b X i / N = bE ( X ) N i =1 E (a + bX ) = a + bE ( X ) E (bX ) = E ( X + Y ) = E ( X ) + E (Y ) = µ x + µ y E ( aX + bY ) = aE ( X ) + bE (Y ) = a µ x + b µ y E (bX ) 2 = b 2 E ( X ) The variance of X is ( X − µ x ) 2 N also written as Var ( X ) = E ( X − E ( X )) 2 The variance of X is also called the second moment of X.Expected values. then E (b) = b . Y ) = E ( X − µ x )(Y − µ y ) = E[ XY − Y µ x − X µ y + µ x µ y ] = E ( XY ) − µ y µ x − µ x µ y + µ x µ y = E ( XY ) − E ( X ) E (Y ) = 0 56 .. Y ) = E[( X − µ x )(Y − µ y )]2 Var ( X + Y ) = E[( X + Y ) − ( µ x + µ y )]2 = E[( X − µ x ) + (Y − µ y )]2 = E[( X − µ x )2 + (Y − µ y ) 2 + 2( X − µ x )(Y − µ y )] = Var ( X ) + Var (Y ) + 2Cov( X .
Certainly Legendre beat Gauss to print. aside maybe from the mean. which has since become the standard. so we can evaluate the least squares estimator. “Of all the principles that can be proposed for this purpose [analyzing large amounts of inexact astronomical data]. using different methods. different scientists got different locations. inexplicably unaware of the 1823 Gauss paper. even though he was almost 100 years late. Markov’s name is still on the theorem. containing nothing that Gauss hadn’t done a hundred years earlier. Markov published his version of the theorem in 1912. since it prevents the extremes from dominating. Gauss claimed that he had been using the technique since 1795. In the late 17’th Century. thereby greatly annoying Legendre. They knew that the stars didn’t move. “Markov’s Theorem.” Gauss published the method of least squares in his book on planetary orbits four years later in 1809. in the 1920’s. By this method a kind of equilibrium is established among the errors which. I think there is none more general more exact.” Somehow. This is the “Gauss” part of the GaussMarkov theorem. we need to know some of the properties of estimators. The GaussMarkov theorem is the theoretical basis for its ubiquity. But first. is appropriate for revealing the state of the system which most nearly approaches the truth. or easier to apply. Method of least squares Least squares was first developed to help explain astronomical measurements. So which observations are right? According to Adrien Marie Legendre in 1805. but when they tried to measure where they were. calls Markov’s results. Gauss published another derivation in 1823. Neyman. We will investigate the GaussMarkov theorem below. 57 . than…making the sum of the squares of the errors a minimum. scientists were looking at the stars through telescopes and trying to figure out where they were. It remains unclear who thought of the method first. Gauss did show that least squares is preferred if the errors are distributed normally (“Gaussian”) using a Bayesian argument.8 THEORY OF LEAST SQUARES Ordinary least squares is almost certainly the most widely used statistical method ever created.
no matter how small. The large sample properties are true only as the sample size approaches infinity. For example the formula for computing the mean X = ΣX / N is an estimator. Efficiency is important because the more efficient the estimator. ˆ Bias= E β − β ˆ Therefore. How confident can we be in our estimate of the parameter in this case? The drawback to this definition of efficiency is that we are forced to compare unbiased estimators. For this reason. There are two types of such properties. It is frequently the case. The resulting value is the estimate. β is a random variable. or not. efficiency. ˆ ˆ The estimator β is efficient relative to the estimator β if Var β < Var β . You can visualize this by imagining a distribution with zero variance. The distribution of β across different samples is called the ˆ ˆ ˆ sampling distribution of β . ( ) ( ) 58 . Bias Bias is the difference between the expected value of the estimator and the true value of the parameter. and mean square error. Obviously. Since we get a different value for β each time we ˆ ˆ take a different sample. The properties of the sampling distribution of β will determine if β is a good estimator. Small sample properties There are three important small sample properties: bias. So is β a good estimator? slope of a regression line is ˆ The answer depends on the distribution of the estimator. the variance of β is smaller than the variance of any other unbiased estimator. ˆ than β is said to be efficient. the lower its variance and the smaller the variance the more precise the estimate. unbiased estimators are preferred. The so called “small sample” properties are true for all sample sizes. if E β = β the estimator is said to be unbiased.Properties of estimators An estimator is a formula for estimating the value of a parameter. at least in small samples. Efficient estimators allow us to make more powerful statistical statements concerning the true value of the parameter being estimated. we will use the more general concept of relative efficiency. ( ) ( ) Efficiency ˆ If for a given sample size. The formula for estimating the ˆ ˆ β = Σxy / Σx 2 and the resulting value is the estimate. as we shall see throughout the semester that we must compare estimators that may be biased.
Define the probability limit as the limit of a probability as the sample size goes to infinity. we mean efficient relative to some other estimator. we want the estimator to approach the truth. as the sample size goes to infinity. Mean square error We will see that we often have to compare the properties of two estimators. it is a number. A distribution with no variance is said to be degenerate. because it is no longer really a distribution. This requires that the estimator be centered on the truth. the probability that the estimator will be infinitesimally different from the truth is one. Mean square error is defined as the variance of the estimator. The concept of mean square error is useful here. when we say an estimator is efficient. with no variance. as N → ∞ .From now on. 59 . plus its squared bias. Ideally. with maximum precision. that is plim = lim Pr N →∞ Consistency A consistent estimator is one for which ˆ ˆ plim β − β < ε = lim Pr β − β < ε = 1 N →∞ ( ) ( ) That is. one of which is unbiased but inefficient while the other is biased but efficient. that is. ˆ ˆ ˆ mse β = Var β + Bias β ( ) ( ) ( ) 2 Large sample properties Large sample properties are also known as asymptotic properties because they tell us what happens to the estimator as the sample size goes to infinity. a certainty.
but unknown parameters and ui is a random variable. Note that a consistent estimator is. An estimator with bias that goes to zero as N → ∞ is asymptotically unbiased. which means that the observation on Y1 is independent of Y2. and (4) the observations on Y are independent from each other. The GaussMarkov assumptions are (1) the observations on X are a set of fixed numbers. This is not true of unbiasedness. An estimator can be asymptotically efficient relative to another estimator. u. are true. a number. The property of consistency is said to “carry over” to the ratio. All we know about an unbiased estimate is that its mean is equal to the truth. and . constant. The basic idea behind these assumptions is that the model describes a laboratory experiment. (3) the variance of Yi (σ 2 ) is constant. in the limit. The GaussMarkov assumptions. an estimator is consistent if its mean square error goes to zero as the sample size goes to infinity. an estimator can be mean square error consistent. The observations of Y come from a simple random sample. That is. For example. The researcher sets the value of X without error and then observes the value of Y. under the GaussMarkov 60 . We can do this because. Finally. We do not know if our estimate is equal to the truth. The mean and the variance of Y does not change between observations. the ratio of two consistent estimators is consistent. which is independent of Y3. can be rephrased in terms of the distribution of the error term. an alternative definition is mean square error consistency. which are currently expressed in terms of the distribution of Y. (2) Yt is distributed with a mean of α + β X i for each observation i. Since we require that both the bias and the variance go to zero for a consistent estimator. All of the small sample properties have large sample analogs. with some error. etc.A biased estimator can be consistent if the bias gets smaller as the sample size increases. Yi = α + β X i + ui where . GaussMarkov theorem Consider the following simple linear model. As such it can be used in further derivations. If we compute the ratio of two unbiased estimates we don’t know anything about the properties of the ratio.
OK.” and iid means “independently and identically distributed. consistent. Assumption (2) is crucial. that is the covariance between X and u must be zero. yi = β xi + ui (recall that the mean of u is zero so that u − u = u ).u)=0.In fact. we can subtract the means to get.e. Without it. Assumption (1) is very useful for proving the theorem. so why is least squares so good? Here is the GaussMarkov theorem: if the GM assumptions hold. if the GaussMarkov assumptions are true. the variance of u is equal to the variance of Y.. least squares doesn’t work. E(X. Assumption (3) can be relaxed. Also. then the ordinary least squares estimates of the parameters are unbiased. The crucial requirement is that X must be independent of u. everything in the model is constant except Y and u. but it is not strictly necessary. as we shall see in Chapter 9 below. as we shall see in Chapter 12 below. What we have done is translate the axes so that the regression line goes through the origin.” The independent assumption refers to the simple random sample while the identical assumption means that the mean and variance (and any other parameters) are constant. Assumption (4) can also be relaxed. i. and efficient relative to any other linear unbiased estimator. σ 2 ) where the tilde means “is distributed as. We want to minimize the sum of squares of error. Therefore. We know that the mean of u is zero. Then. E (Yi ) = α + β X i The “empirical analog” of this (using the sample observations instead of the theoretical model) is Y = α + β X where Y is the sample mean of Y and X is the sample mean of X.assumptions. the GM assumptions can be succinctly written as Yi = α + β X i + ui . ei = yi − β xi ei2 = ( yi − β xi ) N N 2 ei2 = ( yi − β xi ) i i 2 61 . Therefore. Define e as the error. Yi − Y = α + β X i − α − β X Let yi = Yi − Y . The proof of the unbiased and consistent parts are easy. ut ~ iid (0. the error term u is distributed with mean zero and variance σ 2 . Therefore. ui = Yi − α − β X i so that the mean is E (ui ) = E (Yi ) − α − β X i = 0 because E ( X i ) = X i which means that E (Yi ) = α + β X i by assumption (1). xi = X i − X where the lowercase letters indicate deviations from the mean. So.
We call this distribution the sampling distribution of β . d d d 2 ei2 = ( yi − β xi ) = ( yi2 − 2 β xi yi + β 2 xi2 ) dβ dβ dβ d 2 = ( y12 − 2β x1 y1 + β 2 x12 ) + ddβ ( y2 − 2β x2 y2 + β 2 x22 ) + . We have not used our estimates to make any inferences concerning the true value of or . − 2 xN y N + 2 β xN = −2 xi ( yi − β xi ) = 0 Multiply both sides by 2.) ˆ β is an estimator. Note that so far.We want to minimize the sum of squares of error by choosing . we can compute the estimate. When we plug in the values of x and y from the sample. xi yi − β xi2 = 0 β xi2 = xi yi ˆ Solve for and let the solution value be β : ˆ β= xi yi xi2 This is the famous OLS estimator.. + ddβ ( yN − 2β xN yN + β 2 xN ) dβ 2 2 = −2 x1 y1 + 2 β x12 − 2 x2 y2 + 2 β x2 − . It is a formula for deriving an estimate. xi yi 1 = xi yi = wi yi 2 xi xi2 where x wi = i 2 xi ˆ The formula for β is linear because the w are constants. which is a numerical value. is that it is a linear function of the observations on Y.. we can back out the OLS estimate for using the means. ˆ Sampling distribution of β ˆ Because y is a random variable and we now know that the OLS estimator β is a linear function of y. by GaussMarkov assumption (1). it has a distribution with a mean and a ˆ variance... we have only described the data with a linear function. the ˆ β= x’s are constant. If β is a random variable. it ˆ ˆ must be that β is a random variable. then eliminate the parentheses. Y =α +βX ˆ ˆ Y =α +βX ˆ ˆ α =Y −βX ˆ The first property of the OLS estimate β . (Remember. Once we get an estimate of . so take the derivative with respect to and set it equal to zero. We can use this sampling distribution to 62 .
we will assume normally distributed errors below. Note that the unbiasedness property does not depend on any assumption concerning the form of the distribution of the error terms. We have not. are true. so we will ignore it here. There is also a sampling distribution of want to make inferences concerning the intercept. That is. ˆ The fact that the mean of the sampling distribution of β is equal to the truth means that the ordinary least squares estimator is unbiased and thus a good basis on which to make inferences concerning the true . ˆ α . It also relies only on two of the GM assumptions. that the x’s are fixed and that E ( ui ) = 0 . It can be shown that the OLS estimator has the smallest variance of any linear unbiased estimator. So we really don’t need the independent and identical assumptions for unbiasedness. ˆ Variance of β ˆ The variance of β can be derived as follows. but we seldom ˆ Mean of the sampling distribution of β ˆ The formula for β is ˆ β= xi yi xi2 xi ( β xi + ui ) xi2 x 2 i substitute yi = β xi + ui ˆ β= = β xi2 + xi ui = β xi2 x 2 i + xi ui xi ui =β+ 2 xi xi2 ˆ The mean of β is its expected value. for example. the independent and identical assumptions are useful in deriving the property that OLS is efficient. if these two assumptions. along with the other.make inferences concerning the true value of . However. ˆ ˆ ˆ Var β = E β − E β We know that xi ui ˆ β =β+ xi2 ( ) ( ( )) 2 ˆ = E β −β ( ) 2 63 . then the OLS estimator is the Best Linear Unbiased Estimator (BLUE). However. assumed that the errors are distributed normally. xi yi xi ui ˆ E β = E = Eβ + 2 xi2 xi xi ui xi E (ui ) = E (β ) + E =β+ =β 2 xi2 xi ( ) because E (ui ) = 0 by GM assumption (1).
E (u i u j ) = 0 for i ≠ j . As N goes to infinity. + wN σ N 2 2 2 However.. + wN u N ) 2 2 2 2 = E ( w12u12 + w2 u2 + . Var ( β ) = σ 2 x i =1 N 2 i . 64 . the variance of β goes to zero as N goes to infinity. ˆ β −β = and xi ui = wi ui xi2 ˆ E β −β ( ) 2 xi ui 2 = E = E ( wi ui ) xi2 2 2 2 E ( wi ui ) = ( w1u1 + w2u2 + .So... + xN ) σ 2 2 2 ( xi ) = xi2 2 2 2 ( x ) 2 i σ2 = 2 1 σ2 σ2 = xi2 xi2 Consistency of OLS ˆ To be consistent.. ˆ ˆ ˆ mse( β ) = Var ( β ) + E ( β − β ) 2 Obviously the bias is zero since the OLS estimate is unbiased... + wN u N + w1w2u1u2 + w1w3u1u3 + . Therefore. 2 2 2 2 2 2 2 2 E ( wi ui ) = w12 E (u12 ) + w2 E (u2 ) + . σ 12 = σ 2 = .. + N 2 σ 2 xi xi xi 1 2 = ( x12 + x22 + ... goes to infinity. 2 2 2 2 ˆ Var β = w12σ 2 + w2 σ 2 + . + wN u N ) 2 2 2 2 = E ( w12u12 + w2 u2 + ..wN −1wN u N −1uN ) because the covariances among the errors are zero due to the independence assumption. x i =1 N 2 1 goes to ˆ infinity because we keep adding more and more squared terms. N.. Therefore.... + wN ) σ 2 ( ) x x x = 1 2 σ 2 + 2 2 σ 2 + ... Thus. that is. Therefore the consistency of the ordinary ˆ ˆ ˆ least estimator β depends on the variance of β . the mean square error of β must go to zero as the sample size. and the OLS estimate is consistent.. the mean square error goes to zero.... + wN σ 2 = ( w12 + w2 + .. + w1wN u1u N + .... + wN E (u N ) = w12σ 12 + w2 σ 2 + . The ˆ mean square error of β is the sum of the variance plus the squared bias. = σ N = σ 2 because of the identical assumption..
. ˆ β is unbiased and has a variance equal to σ 2 / x 2 . 2 2 Var ( β ) = E β − β = E cu = E [ c1u1 + c2u2 + . + cN σ 2 65 . the OLS estimator has the smallest variance. β = cy = c( β x + u ) = β cx + cu Taking expected values. β = β + cu and β − β = cu Find the variance of β . E ( β ) = β cx + cE (u ) = β cx Since E(u)=0 as before. since cx =1 cx =1. Stated another way.. (3) The x’s are fixed numbers. + cN u2 ] = σ 2 c2 2 2 2 = c12σ 2 + c2 σ 2 + .. any other linear unbiased estimator will have a larger variance than the OLS estimator. of all linear unbiased estimators. if the GM assumptions are true. That is. the ordinary least squares estimator is the Best Linear Unbiased Estimator (BLUE).σ 2 ) . Consider any other linear unbiased estimator. (1) The true relationship between y and x (expressed as deviations from the mean) is yi = β xi + ui (2) The residuals are independent and the variance is constant: ui ~ iid ( o. We know that the OLS estimator is a linear estimator ˆ β= xy = wy x2 Where w = x / x2 . This estimator is unbiased if Note..Proof of the GaussMarkov Theorem According to the GaussMarkov theorem. The GaussMarkov assumptions are as follows.
Student’s t. w(c − w) = wc − w2 = cx x2 − 2 x2 ( x2 ) cx x2 = − x 2 ( x 2 )2 = Since 1 1 − =0 2 x x2 cx =1. Fisher’s F. this variance must be greater than or equal to the variance of β . Var ( β ) = σ 2 c 2 = σ 2 ( w2 + (c − w)2 ) = σ 2 w2 + σ 2 (c − w) 2 =σ2 x2 ( x ) 2 2 2 + σ 2 (c − w) 2 = σ2 x + σ 2 (c − w) 2 ˆ = Var ( β ) + σ 2 (c − w) 2 ˆ So Var ( β ) ≥ Var ( β ) because c=w. 66 . According to the theorem. σ 2 (c − w) 2 ≥ 0 . We will be using all of the following distributions throughout the semester. Here is a useful identity. Therefore. and Chisquare Before we can make inferences or test hypotheses concerning the true value of . The only way the two variances are equal is if Inference and hypothesis testing Normal. c = w + (c − w) c 2 = w2 + (c − w) 2 + 2 w(c − w) c 2 = w2 + (c − w) 2 + 2 w(c − w) The last term on the RHS is equal to zero. which is the OLS estimator. we need to review some famous statistical distributions.
We can generate a normal variable with 100. then z has a standard normal distribution because the mean of z is zero and its standard deviation is one.0042645 1. or zscore.518918. e 2σ 2πσ or simply X ~ N ( µ .518918 4. start=4. histogram y (bin=50. Consider the linear function of X.448323 . Let a = − µ σ and b = 1 σ then z = a + bX = − µ σ + X σ = ( X − µ ) σ which is the formula for a standard score. then any linear transformation of X is also distributed normally. set obs 100000 obs was 0. it has the formula.” If X is distributed normally. . Min Max +y  100000 . gen y=invnorm(uniform()) . b 2σ 2 ) This turns out to be extremely useful. That is.σ 2 )= 1 − ( X − µ )2 2 ϕ ( z) = 1 2π e − z2 2 Here is the graph. z = a + bX . Dev. z. now 100000 .17934483) 67 . If we express X as a standard score. With mean and variance 2. if X ~ N ( µ .Normal distribution The normal distribution is the familiar bell shaped curve. f ( X  µ . σ 2 ) where the tilde is read “is distributed as. σ 2 ) then ( a + bX ) ~ N ( a + bµ .002622 4.000 observations as follows. That is. summarize y Variable  Obs Mean Std. width=.
68 . the area under the curve must sum to one. Here is the graph. Since chisquare is a distribution of squared values. (1) If z ~ N(0.2 . and.3 . and chisquare are derived from the normal and arise as distributions of sums of variables. Fisher’s F.0 . being a probability distribution.4 4 2 0 y 2 4 The Student’s t. They have associated with them their socalled degrees of freedom which are the number of terms in the summation. Chisquare distribution If z is distributed standard normal.1 Density .1). it has to be a skewed distribution of positive values. then z 2 ~ χ 2 (1) which has a mean of 1 and a variance of 2. then its square is distributed as chisquare with one degree of freedom. asymptotic to the horizontal axis.
width=. now 10000 .0842115. gen z5=invnorm(uniform()) .1). i=1 σ n 2 2 (5) If X 1 ~ χ 2 (n1 ) and X 2 ~ χ 2 ( n2 ) and X 1 and X 2 are independent.... X n are independent χ 2 (1) variables. then X i =1 N i ~ χ 2 (n) with mean=n and variance = 2n (3) Therefore. gen z10=invnorm(uniform()) . gen v=z1+z2+z3+z4+z5 ... then Here are the graphs for n=2. gen z6=invnorm(uniform()) . gen v=z1^2+z2^2+z3^2+z4^2+z5^2+z6^2+z7^2+z8^2+z9^2+z10^2 .3.91651339) 69 . z2 . gen z1=invnorm(uniform()) . X 2 ... We can generate a chisquare variable by squaring a bunch of normal variables... gen z9=invnorm(uniform()) .. gen z1=invnorm(uniform()) . gen z3=invnorm(uniform()) .σ ) variables then i ~ χ 2 ( n).*this creates a chisquare variable with n=4 degrees of freedom . z2 . z i =1 n 2 i ~ χ 2 (n) .. then X 1 + X 2 ~ χ 2 ( n1 + n2 ). gen z4=invnorm(uniform()) . if z1 . variables.(2) If X 1 . zn are independent standard normal.and 4. set obs 10000 obs was 0. start=1. gen z7=invnorm(uniform()) . zn are independent N(0. histogram v (bin=40. gen z2=invnorm(uniform()) . N(0.. gen z8=invnorm(uniform()) . z (4) If z1 . .
0 0 .02 Density . It is the ratio of two squared quantities. then sum their squares to make a chisquare variable (v) with five degrees of freedom. n2 ) We can create a variable with an F distribution is follows: generate five standard normal variables. divide v by z to make a variable distributed as F with 5 degrees of freedom in the numerator and 10 degrees of freedom in the denominator.1 10 20 v 30 40 Fdistribution X1 (6) If X 1 ~ χ ( n1 ) and X 2 ~ χ ( n2 ) and X 1 and X 2 are independent. X2 ~ F (n1 . finally. . then 2 2 n1 n2 This is Fisher’s Famous Ftest. gen x1=invnorm(uniform()) 70 .06 . Here are graphs of some typical F distributions. so it is positive and therefore skewed. like the chisquare.04 .08 .
74475 f  10000 .791616 62.. histogram f. then the ratio T = z 71 . It looks like the normal distribution but with “fat tails.” that is. This is X n Student’s t distribution.5 Density 0 0 .5 1 1 f 2 3 tdistribution ~ t ( n) . It can be positive or negative and is symmetric. gen x2=invnorm(uniform()) gen x3=invnorm(uniform()) gen x4=invnorm(uniform()) gen x5=invnorm(uniform()) gen v1=x1^2+x2^2+x3^3+x4^2+x5^2 gen f=v1/v * f has 5 and 10 degrees of freedom summarize v1 v f Variable  Obs Mean Std.54086 v  10000 10. .72194 . . compared to the standard normal. . .084211 37. .478675 1. more probability in the tails of the distribution.901778 14. . Min Max +v1  10000 3.962596 4. . (7) If z ~ N(0. like the normal distribution. if f<3 & f>=0 1.0526 4. Dev.87189 66.1) and X ~ χ 2 ( n) and z and X are independent.7094975 9.4877415 .
gen t=z/sqrt(v) . gen y4=invnorm(uniform()) . . n) because the squared tratio is the ratio of two chisquared X n We can create a variable with a t distribution by dividing a standard normal variable by a chisquare variable. start=1. gen y2=invnorm(uniform()) .6 . gen v=y1^2+y2^2+y3^2+y4^2+y5^2 .2 Density .08160601) 0 2 .4 . set obs 100000 obs was 0.(8) If T ~ t (n) then T 2 = variables. gen y3=invnorm(uniform()) . normal (bin=49. width=. now 100000 .9989421. gen z=invnorm(uniform()) . gen y1=invnorm(uniform()) . histogram t if abs(t)<2. z2 ~ F (1.8 1 0 t 1 2 72 . gen y5=invnorm(uniform()) .
that is.Asymptotic properties (9) As N goes to infinity. This assumption can also be written as ( ) ui ~ IN (0. T= ˆ β − β0 S βˆ N −k where Sβˆ = x ˆ σ2 2 i ˆ is the standard error of β . In this simple regression. so that the ratio goes to one. σ 2 ) which is translated as “ui is distributed independent normal with mean zero and variance σ 2 . σ 2 xi2 ) . and . Testing hypotheses concerning ˆ ˆ So.” (We don’t need the identical assumption because the normal distribution has a constant variance. we need to make an assumption concerning the form of the function. β ~ IN ( β . ui ~ iidN 0. σ 2 xi2 ) . construct the test statistic.1) (10) As n2 → ∞.) Since ˆ ˆ β is a linear function of the error terms and the error terms are distributed normally. Let’s make the usual assumption that the errors in the original regression model are distributed normally. we can summarize the sampling distribution of β as β ~ iid ( β . 2 73 . σ is the estimated error variance. We are now in position to make inferences concerning the true value of . i =1 where ei2 is the squared residual corresponding to the i’th observation. TSS / N → 1 so that NR 2 → N = RSS ~ χ 1 (12) As N → ∞. N −k and k is the number of parameters in the regression. To test the null hypothesis that β = β 0 where β 0 is some hypothesized value of . N is the sample size. RSS / N NR 2 = N TSS / N RSS / N 2 As N → ∞ . See (10) above. It is equal to the error sum of squares divided by the degrees of freedom (Nk) and is also known as the mean square error. k=2 because we have two ˆ σ2 = e 2 i ˆ parameters. (11) As N → ∞. NF χ 2 . X 2 → ∞ . t → N (0. the square root of its variance. If we want to actually use this distribution to make inferences. then β is distributed ˆ normally. NR 2 → χ 2 where R 2 = RSS RSS / N = TSS TSS / N and RSS is the regression (model) sum of squares. That is. F = X 1 n1 → X 1 n1 X 2 n2 χ 2 (n1 ) because as n2 → ∞. σ 2 where the N stands for normal.
Therefore. To prove this. We can use the data in two ways. If we have three data points. We can also look at the regression problem this way: it takes two points to determine a straight line. This leaves the remaining N2 data points to estimate the variance. The sum is 6 and the mean is 2. ( X − µ ) = X − N µ = NX − N µ = N ( X − µ ) 74 . In the example of the mean. There is one degree of freedom that could be used to estimate the variance. then there are many lines that could be defined as “fitting” the data. but this is a mathematical problem. Sir Ronald Fisher solved the problem in 1922 and called the parameter degrees of freedom. If we use data points to estimate the parameters. yielding N1 data points to estimate the variance. Suppose we have a data set with N observations. we have to demonstrate that this is an unbiased estimator for the variance of X (and therefore. This is the “t” reported in the regress output in Stata.Notice that the test statistic. If we only have two points. we can’t use them to estimate the variance. not a statistical one. Suppose we are trying to estimate the mean of a data set with X=(1. That is. Pearson. it is distributed as student’s t with Nk degrees of freedom. ˆ β . to a chisquare variable (the sum of a bunch of squared. Also. we can find any data point knowing only the other two. We estimate the ˆ ˆ ˆ slope with the formula β = xy / x 2 and the intercept with the formula α = Y − β X . suppose we are estimating a simple regression model with ordinary least squares. also known as a tratio. error terms). Once we know this.tufts. applied it to a variety of problems but couldn’t get the parameter value quite right. To take another example. The best explanation of the concept that I know of is due to Gerrard Dallal (http://www. ˆ β . using up two data points. T= S βˆ Degrees of Freedom Karl Pearson. Our simple rule is that the number of degrees of freedom is the sample size minus the number of parameters estimated. we can solve for the slope and intercept. By far the most common test is the significance test. to estimate the parameters or to estimate the variance. the formula for the sample variance of X (we have necessarily already used the data once to compute the sample mean) is S x2 = ( X − X ) 2 N −1 where we divide by the number of degrees of freedom (N=1). we have one parameter to estimate. To do this test set β 0 = 0 so that the test statistic reduces to the ratio of the estimated parameter to its standard error. dividing by N yields a biased estimator).2. namely that X and Y are unrelated.htm). kept requiring ad hoc adjustments to their chisquare applications.3). we used up one degree of freedom to estimate the mean and we only have two left to estimate the variance. ( X − X ) 2 = ( X − µ ) − ( X − µ ) 2 = ( X − µ )2 − 2( X − µ )( X − µ ) + ( X − µ ) 2 = ( X − µ ) 2 − 2( X − µ ) ( X − µ ) + N ( X − µ )2 because X and µ are both constants. Consequently. normally distributed.edu/~gdallal/dof. is the ratio of a normally distributed random variable. not the sample size (N). and a number of other statisticians. who discovered the chisquare distribution around 1900.
. yields an unbiased estimate of the true error variance. ( X − X ) 2 = ( X − µ ) 2 − 2 N ( X − µ ) 2 + N ( X − µ ) 2 = ( X − µ ) 2 − N ( X − µ ) 2 Now we can take the expected value of the proposed formula. E ( x 2 ) = E ( ( X − X ) 2 ) = ( N − 1)σ x2 . ( X − µ ) 2 = Var ( X ) N 1 1 Var ( X ) = Var X = 2 Var ( X 1 + X 2 + . 75 . ( X − X ) 2 N 1 E ( X − µ ) 2 − ( X − µ )2 = E N −1 N −1 N −1 = 1 N E ( X − µ )2 − E ( X − µ )2 N −1 N −1 Note that the second term is the variance of the mean of X: E ( X − µ )2 = But. ( X − X ) 2 N 1 E ( X − µ )2 − ( X − µ )2 = E N −1 N −1 N −1 2 N N σx N 1 = σ x2 − = σ x2 − σ x2 N −1 N −1 N N −1 N −1 N −1 2 = σ x = σ x2 N −1 Note (for later reference).which means that the middle term in the quadratic form above is −2( X − µ ) ( X − µ ) = −2( X − µ ) N ( X − µ ) = −2 N ( X − µ ) 2 and thefore.. Estimating the variance of the error term We want to prove that the formula ˆ σ2 = e2 N −2 where e is the residual from a simple regression model. + X N ) N N σ2 1 2 = 2 Nσ x = x N N Therefore.
( ) ˆ ˆ e 2 = u 2 − 2 β − β xu + β − β Summing.In deviations. ˆ y = βx+u So. ( ) ( ) 2 x2 ˆ ˆ E e 2 = E u 2 − 2 E β − β xu + E β − β ( ) ( ) 2 x2 ˆ Note that the variance of β is ˆ ˆ Var β = E β − β x2 Which means that the third term of the expectation expression is ˆ E β − β ( ) ( ( ) 2 = σ2 ) 2 x2 = x2 = σ 2 x2 σ2 We know from the theory of least squares that ˆ β −β = xu x2 Which implies that ˆ xu = β − β x 2 Therefore the second term in the expectation expression is ( ) 76 . ˆ e = y−βx But. ˆ ˆ e = βx+u −βx = u − β −β x Squaring both sides. ( ) ( ) 2 x2 ˆ ˆ e2 = u 2 − 2 β − β xu + β − β Taking expected values.
so we divide by N2 when we take the average. Therefore. However. then the formula tells us to take the average. In the current application x=e and S x2 = σ 2 . 77 . we get E e 2 = ( N − 1)σ 2 + σ 2 − 2σ 2 = N σ 2 − σ 2 + σ 2 − 2σ 2 = ( N − 2)σ 2 Which implies that the unbiased estimate of the error variance is ˆ σ2 = e2 N −2 2 e 2 E e ( N − 2)σ 2 = ˆ E σ 2 = E = =σ2 N − 2 N −2 N −2 If we think of each of the squared residuals as an estimate of the variance of the error term. putting all three terms together. we only have N2 independent observations on which to base our estimate of the variance.ˆ −2 E β − β ( )( βˆ − β ) x = −2E ( βˆ − β ) 2 2 x2 = −2 σ2 x2 x 2 = −2σ 2 The first term in the expectation expression is E u 2 = ( N − 1) σ 2 because the unbiased estimator of the variance of x is S x2 = x 2 /( N − 1) so that the expected value of E ( x 2 ) = ( N − 1) S x2 .
which is the variance of x. Chebyshev’s inequality states that. from any arbitrary distribution.) Therefore. . for any variable x. σ x2 P ( X − µ ≥ ε ) ≤ 2 ε The following diagram might be helpful. because we are omitting some (positive. µ Proof The variance of x can be written as Xi X σ x2 = ( X − µ )2 1 = ( X − µ ) 2 N N if all observations have the same probability (1/N) of being observed.Chebyshev’s Inequality Arguably the most remarkable theorem in statistics. from the mean must be less than or equal to σ x2 / ε 2 . σ x2 = ( X − µ ) 2 P ( X ) Suppose we limit the summation on the right hand side to those values of X that lie outside the ±ε box. replacing ( X − µ ) with . (Some of them are much further than away. squared) terms. σ x2 ≥ ≥  X − ≥ ε ( X −µ ) µ εε µ 2 2 P( X ) P( X ) = ε 2  X − ≥  X − ≥ ε P( X ) = ε µ 2 P ( X − µ ≥ ε ) 78 . More generally. the probability that an observation lies more than some distance. That is. σ x2 = ( X − µ ) 2 P( X ) ≥  X − ≥ ε ( X −µ ) µ 2 P( X ) We also know that all of the X’s in the limited summation are at least away from the mean. Then the new summation could not be larger than the original summation.
) Law of Large Numbers The law of large numbers states that if a situation is repeated again and again. suppose we have been challenged to guess how many thumbtacks will end up face down if 100 are dropped off a table. and we set k=2. Multiply by 100 and that is your guess. X N be an independent trials process where E ( X i ) = µ and σ x2 = Var ( X i ) . (Answer: 50. To prove this version. Mathematically. Let X 1 . P ( X − µ ≥ ε ) ≤ σ x2 ε2 of the observations must lie within k standard QED The theorem is often stated as follows. 79 . For example. the probability that the observations lie less than k sd from the mean is one minus this probability. The number of black balls that we draw divided by the total number of balls drawn will give us an estimate. 4 4 (We can be more precise if we know the distribution. at least 1 − 1 deviations of the mean. we know that 1 3 = = 75% of the data lie within two standard deviations of the mean.. the better the estimate.. let k2 ε = kσ x P ( X − µ ≥ kσ x ) ≤ σ x2 1 = 2 2 2 k σx k 1 k2 Since this is the probability that the observations lie more than k standard deviations from the mean.. For example. σ x2 ≥ ε 2 P ( X − µ ≥ ε ) 2 Dividing both sides by ε σ x2 ≥ P ( X − µ ≥ ε ) ε2 Turning it around we get. The more balls we draw. Suppose we are presented with a large box full of black and white balls. Divide that into the number of trials and you have the proportion.So. the theorem is. X 2 . What do we do? Answer: sample with replacement.. the proportion of successful outcomes will tend to approach the constant probability that any one of the outcomes will be a success. We want to know the proportion of black balls. 95% of normally distributed data lie within two standard deviations of the mean. For any set of data and any constant k>1. That is. if we have any data whatever. but we are not allowed to look inside the box. P ( X − µ ≤ kσ x ) ≤ 1 − 1− So.) How would we prepare for such a challenge? Toss a thumbtack into the air and record how many times it ends up face down.
80 . + X N ) = Nσ x Because there is no covariance between the X’s. The incredible part of this theorem is that no restriction is placed on the distribution of x.. the sample mean of a large sample will be distributed normally. the probability that the difference between S/N and the true mean is less than ε ( ) is one minus this. + X N = E N Nµ =µ = N By Chebyshev’s Inequality we know that. S P − µ < ε 1 → N →∞ N Proof. the law of large numbers is often called the law of averages. Therefore. That is. Central Limit Theorem This is arguably the most amazing theorem in mathematics. We know that 2 Var ( S ) = Var ( X 1 + X 2 + . for any > 0. No matter how x is distributed (as long as it has a finite variance). Nσ 2 σ 2 S Var = Var (( X 1 + X 2 + . S σ2 S Var N P − µ ≥ ε ≤ = x 2 0 → N →∞ ε2 Nε N Equivalently..Let S = X 1 + X 2 + . + X N ) / N ) = 2x = x N N N Also. S E N X 1 + X 2 + . Then. for any > 0 S P − µ ≥ ε 0 → N →∞ N Equivalently.. + X N ... Let 2 σX . Statement: Let x be a random variable distributed according to f(x) with mean and variance of a random sample of size N from f(x). N The density of z will approach the standard normal (mean zero and variance one) as N goes to infinity.... S P − µ < ε 1 → N →∞ N Since S/N is an average. Let X be the mean z= X −µ 2 σX .
we can prove it for certain cases using Monte Carlo methods. the central limit theorem is considered to be the unofficial sovereign of probability theory. Here is the histogram of the binomial distribution with an equal probability of generating a one or a zero. Nowadays we usually require 100 or more. but was still mostly ignored until Lyapunov in 1901 generalized it and showed how it worked mathematically. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude. Nowadays. amidst the wildest confusion. The theorem was first proved by DeMoivre in 1733 but was promptly forgotten. He wrote in Natural Inheritance. However. an unsuspected and most beautiful form of regularity proves to have been latent all along. The huger the mob. Infinity comes quickly for the normal distribution. One of the few people who didn’t ignore it in the 19’th Century was Sir Francis Galton. the binomial distribution is decidedly nonnormal since values must be either zero or one.. 1889: I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the "Law of Frequency of Error". We can’t prove this theorem here. and the greater the apparent anarchy. It was resurrected in 1812 by LaPlace. who derived the normal approximation to the binomial distribution. but the sample doesn’t really have to be that large. according to the theorem. It reigns with serenity and in complete selfeffacement. However. For example. It used to be that a sample of 30 was thought to be enough.Not only that. It is the supreme law of Unreason. The law would have been personified by the Greeks and deified. who was a big fan. The CLT is the reason that the normal distribution is so important. 81 . the more perfect is its sway. the mean of a large number of sample means will be distributed normally. if they had known of it.
3 2 0 z 2 4 This is the standard normal distribution with mean zero and standard deviation one.4 0 4 . summarize z 82 .8 1 Not normal.000 observations.981402 31 72 I then computed the zscores by taking each observation.0912 4.6 . The histogram of the resulting data is: . .1 Density . Min Max +sum  10000 50. Dev.000 samples of size N=100 from this distribution.4 x .0912) and dividing by the standard deviation. Variable  Obs Mean Std. I asked Stata to generate 10.0 1 2 Density 3 4 5 0 . I then computed the mean and variance of the resulting 10.2 . subtracting the mean (50.2 .
Min Max +z  10000 1. .6 I then computed the corresponding zscores. Dev. 83 .03747 As expected.57e07 1 3. Dev. According to the Central Limit Theorem. I told Stata to take 10.64 193870.37e07 1 2. Here is the histogram. Dev.000 or more. Min Max +z  10000 3.398119 Here is another example.5e06 2000000 4000000 population 6000000 8000000 It is obviously not normal being highly skewed. the zscores should have a zero mean and unit variance.Variable  Obs Mean Std. Suppose we treat this distribution as the population.0e06 1.253426 5. summarize z Variable  Obs Mean Std.0e07 Density 1.832496 4. summarize mx Variable  Obs Mean Std. I then computed the mean and standard deviation of the resulting data.2 497314. This is the distribution of the population of cities in the United States with popularion of 100. 2.000 samples of size N=200 from this data set. the mean is zero and the standard deviation is one. Min Max +mx  10000 287657 41619.0e06 0 0 5. There are very few very large cities and a great number of relatively small cities.
4 0 z 2 4 6 It still looks a little skewed. but it is very close to the standard normal. The sample consists of N observations. β ... Pr(Y1 . Method of maximum likelihood Suppose we have the following simple regression model Yi = α + β X i + U i U i ~ IN (0.2 . The symbol ∏ means multiply the items indexed from one to N. N 1 84 .. except for the subscript.. It is a function of the unknown parameters. etc.. The probability of observing the sample is the probability of observing Y1 and Y2. which gives the probability of observing the sample that we actually observed. YN ) = Pr(Y1 ) Pr(Y2 ) Pr(YN ) because of the independence assumption. α . Y2 . The probability of observing the first Y is 1 2πσ 2 Pr(Y1 ) = { e − 1 2σ 2 U1 2 }= 1 2πσ 2 { e − 1 2σ 2 ( Y1 − α − β X 1 ) 2 } The probability of observing the second Y is exactly the same.0 2 . .1 Density . Pr(Y1 . Therefore. the Y are distributed normally. σ ) This is the likelihood function.3 . and σ . σ 2 ) So. Y2 . β .. YN ) = ∏ N 1 1 ( 2πσ ) 2 1 2 e { 1 − 2σ 2 (Y −α − β X ) i i 2 } = L(α ..
We already know the OLS formulas. N 1 Log L = − 1 log(2πσ 2 ) − (Yi − α − β X i ) 2 2 2σ 2 i =1 This function has a lot of stuff that is not a function of the parameters. Mathematically. so the only thing we need now is the maximum likelihood estimator for the variance. log L = C − where N Q log σ 2 − 2 2σ 2 N C = − log(2π ) 2 which is constant with respect to the unknown parameters. the best guess of the values of the parameters are those that generate the most likely sample. setting the derivatives equal to zero.The basic idea behind maximum likelihood is that we probably observed the most likely sample. Since Q is multiplied by a negative. Let’s take logs first to make the algebra easier. we maximize the likelihood of the sample by taking derivatives with respect to α . ESS. therefore. OLS is also a maximum likelihood estimator. ˆ ˆ Substitute the OLS values for and : α and β : log L(σ ) = C − where ˆ N Q log(σ 2 ) − 2 2 2σ ˆ ˆ ˆ Q = (Yi − α − β X i ) 2 i =1 N is the residual sum of squares. to maximize the likelihood. and solving. and Q = (Yi − α − β X i ) 2 i =1 N where Q is the error sum of squares. not a rare or unusual sample. β . σ 2 . This is nothing more than ordinary least squares. and σ . we have to minimize the error sum of squares with respect to and . Therefore. Now we have to differentiate with respect to and set the derivative equal to zero σ ˆ d log L(σ ) N 4σ Q =− + =0 4 dσ σ 4σ ˆ N Q = 3 σ N= ˆ Q σ2 ˆ Q ESS = N N ˆ σ2 = 85 .
ˆ ˆ ˆ Now that we have estimates for α . one with the restriction. 86 . ˆ 2 2σ 2 ˆ 2σ So. N N N ˆ Max logL = C − log Q + log ( N ) − ˆ ˆ ˆ α . λ= MaxL under the restriction ≤1 MaxL without the restriction The test statistic is −2 log(λ ) χ 2 ( p) where p is the number of restrictions. We can use this function to test hypotheses such as β = 0 or α + β = 10 as restrictions on the model. β . do two regressions.σ 2 2 2 Max logL = C − ( ) Since N is constant. ESS(R). N N log ( N ) − 2 2 −N ˆ−N MaxL = const * Q 2 = const *( ESS ) 2 Likelihood ratio test Let L be the likelihood function. If the hypothesis is true.σ 2 2σ 2 2 ˆ ˆ Q Nσ Note that = . if the restriction is false. This means that the likelihood function will be smaller (not maximized) if the restriction is false. saving the residual sum of squares. To do this test. So. Max logL = const − ˆ ˆ ˆ α . we can substitute α . β . and one without the restriction. then the two values of ESS will be approximately equal and the two values of the likelihood functions will be approximately equal. Consider the likelihood ratio.σ N ˆ log Q 2 ( ) where const = C + Therefore. β . β .and σ into the log likelihood function to get its maximum value: ˆ ˆ Q N N Q N log(σ 2 ) − 2 = C − log − N 2 ˆ ˆ ˆ α . As in the Ftest. saving ESS(U).and σ . we expect that the residual sum of squares (ESS) for the restricted model will be higher than the ESS for the unrestricted model. β .
known as instrumental variables. ui ~ iid (0. However. Yi = α + β X i + γ Z i + ui . We therefore define the following empirical analogs of A1 and A2. For example. take derivatives with respect to the parameters. This would generate the least squares estimators for this multiple regression. and . there is another approach. Yi = α + β X i + ui The crucial GaussMarkov assumptions can be written as (A1) E (ui ) = 0 (A2) E ( xi ui ) = 0 A1 states that the model must be correct on average. that is both illustrative and easier. the covariance between the explanatory variable and the error term must be zero. Multiple regression and instrumental variables Suppose we have a model with two explanatory variables. σ 2 ) We want to estimate the parameters . let’s first derive the OLS estimators for the simple regression model. A2 is the alternative to the assumption that the x’s are a set of fixed numbers. It says that the x’s cannot be correlated with the error term. so the (true) mean of the errors is zero. set the derivatives equal to zero and solve. that is. sum them. we have to make the assumptions operational. (a1) e = (a2) e i N x i ei N = 0 ei = 0 = 0 x i ei = 0 87 . .λ= const *( ESS ( R )) −N 2 −N 2 const *( ESS (U )) log(λ ) = − ESS ( R) = ESS (U ) − N 2 N ( log( ESS ( R) − log( ESS (U ) ) 2 The test statistic is −2 log(λ ) = N ( log( ESS ( R) − log( ESS (U ) ) χ 2 ( p) This is a large sample test. its significance is not well known in small samples. Like all large sample tests. We define the error as ei = Yi − α − β X i − γ Zi At this point we could square the errors. Since we do not observe ui. Usually we just assume that the small sample significance is about the same as it would be in a large sample. To see how instrumental variables works.
(a1) says that the mean of the residuals is zero and (a2) says that the covariance between x and the residuals is zero. So. yi = Yi − Y = α + β X i + ui − α − β X = Yi − β ( X i − X ) + ui yi = β xi + ui The predicted value of y is ˆ ˆ yi = β xi ˆ yi = yi + ei The observed value of y is the sum of the predicted value from the regression and the residual. Divide both sides by N. Y i =1 N i N N Or. ˆ xi yi = β x 2 + xi ei Sum. ˆ xi yi = β x 2 ˆ Solve for β . ˆ xi yi = β x 2 + xi ei But xi ei = 0 by (a2). ei = Yi − α − β X i Sum the residuals. ˆ β= Now that we know the we can derive the OLS estimators using instrumental variables. We can use (a2) to derive the OLS estimator for β . Y −α − β X = 0 ˆ α =Y −βX −α − β X i =1 N i =0 which is the OLS estimator for . let’s do it for the multiple regression above. the difference between the observed value and the predicted value. xi yi x2 which is the OLS estimator. e = (Y − α − β X ) = Y − Nα − β X i i i i i =1 i =1 i =1 i =1 N N N N i = 0 by (a1). Define the residuals. ˆ yi = β xi + ei Multiply both sides by xi. Yi = α + β X i + γ Zi + ui 88 .
y. yi = Yi − Y = α + β X i + γ Z i + ui − α − β X + γ Z yi = β xi + γ zi + ui The predicted values of y are ˆ ˆ yi = β xi + γˆ zi So that y is equal to the predicted value plus the residual. since we have three parameters. ei = Yi − N α − β X i − γ Z i = 0 ei Yi Xi Zi = −α − β −γ =0 N N N N Or. so ˆ ˆ (N2) z y = β zx + γ z 2 i i i i ˆ We have two linear equations (N1 and N2) in two unknowns. so ˆ (N1) x y = β x 2 + γˆ x z i i . ˆ xi yi = β x. ˆ γˆ zi2 = zi yi − β zxi zi yi ˆ zxi −β zi2 zi2 Substitute into N1 γˆ = 89 . β and γˆ .We have to make three assumptions (with their empirical analogs). xi ei = 0 by (A2). ˆ yi = β xi + γˆ zi + ei Multiply by x and sum to get.2 + γˆ xi zi + xi ei But. Y −α − β X − γ Z = 0 ˆ α = Y − β X −γ Z which is a pretty straightforward generalization of the simple regression estimator for . and z. There are a variety of ways to solve two equations in two unknowns. e (A1) E ( ui ) = 0 =0 N xi ei (A2) E ( xi ui ) = 0 = 0 xi ei = 0 N zi ei (A3) E ( zi ui ) = 0 = 0 zi ei = 0 N From (A1) we get the estimate of the intercept. Even though these equations look complicated with all the summations. i i Now repeat the operation with z: ˆ zi yi = β zi xi + γˆ zi2 + zi ei ˆ zi yi = β zxi + γˆ zi2 + zi ei But. Solve for γˆ from N2. Back substitution from high school algebra will work just fine. these summations are just numbers derived from the sample observations on x.2 + γˆ xi zi + xi ei ˆ xi yi = β x. zi ei = 0 by (A3). Now we can work in deviations.
90 .2 zi2 − zxi xi zi xi yi zi2 − zi yi xi zi x. i ( i i i ) i i ˆ ˆ xi yi z = β x. zi yi ˆ zxi ˆ xi yi = β x. The OLS estimator for γˆ is derived similarly. We might be able to make some sense out of these formulas if we translate them into simple regression coefficients.2 + −β 2 zi2 zi Multiply through by zi2 xi zi zi yi ˆ zxi ˆ xi yi zi2 = β x. ˆ This is the OLS estimator for β . γˆ = zi yi xi2 − xi yi xi zi x.2 zi2 − zxi xi zi ) = xi yi zi2 − zi yi xi zi ˆ β= ˆ β= xi yi zi2 − zi yi xi zi x. xi yi = rxy xi2 yi2 And.2 zi2 − rxz xi2 zi2 ( ) 2 Cancel out zi2 which appears in every term. xi zi = rxz xi2 zi2 zi yi = rzy zi2 yi2 ˆ Substituting into the formula for β ˆ β= rxy xi2 yi2 zi2 − rzy zi2 yi2 rxz xi2 zi2 x.2 zi2 + zi2 −β xi zi 2 zi2 zi ˆ ˆ x y z 2 = β x 2 z 2 + z y − β zx x z i i i .2 zi2 − ( xi zi ) 2 because xi zi = zi xi .2 zi2 + zi yi xi zi − β zxi xi zi 2 i ˆ Collect terms in β and move them to the left hand side.2 zi2 − ( xi zi ) 2 Interpreting the multiple regression coefficient. The formula for a simple correlation coefficient between x and y is xi yi xi yi xi yi rxy = = = 2 2 NS x S y xi yi xi2 yi2 N N N Or. ˆ ˆ β x 2 z 2 − β zx x z = x y z 2 − z y x z . i i i i i i i i i i i ˆ β ( x.
holding nothing constant.2 rxy − rzy rxz = 2 1 − rxz Sy Sx y2 i x2 i = rxy − rzy rxz yi2 N rxy − rzy rxz = 2 2 1 − rxz xi2 N 1 − rxz The corresponding expression for γˆ is rzy − rxy rxz S y 2 1 − rxz S z The term in parentheses is simply a positive scaling factor. This is the case of perfect collinearity. The second term.2 − rxz xi2 Factor out the common terms on the top and bottom. Note that. It therefore measures the effect of X on Y. holding Z constant. In other words. When x and z are perfectly correlated. Therefore it is impossible to hold Z constant to find the separate effect of X. If X and Z are perfectly collinear. while statistically holding Z constant. so when X varies. then X and Z are said to be orthogonal and rxz = 0. then whenever X changes. ˆ Therefore. and Y on Z. but then subtracts off the indirect effect of X on Y through Z. S ˆ r −r r S β = xy zy xz y = rxy y 2 1 − rxz S x Sx Sy = rzy Sz These are the results we would get if we did two separate simple regressions. The formula for γˆ is analogous. with no error term. This could happen if X is total wages and Y is total income minus nonwage payments. the numerator of the formula for β starts with the total effect on X on Y. ˆ β == rxy − rzy rxz xi2 yi2 2 1 − rxz x. Z varies. Z has to change. In this case. It partials out the effect of X on Y. γˆ = rzy − rxy rxz S y 2 1 − rxz S z 91 . the socalled partial effect of X on Y. It is the total effect of Z on Y minus the indirect effect of Z on Y through X. X and Y are said to be collinear because one is an exact function of the other.2 − rxz xi2 ( ) 2 = rxy xi2 yi2 − rzy yi2 rxz xi2 2 x. rxz = 1 and the formula does not compute. ˆ The simple correlation coefficient rxy in the numerator of the formula for β is the total effect of X on Y. The denominator 2 is 1 − rxz . X is correlated with Z.ˆ β= rxy xi2 yi2 − rzy yi2 rxz xi2 x. This causes Y to change because Y is correlated with Z. γˆ = The multiple regression coefficient is the effect on Y of a change in X. the multiple regression estimators collapse to simple regression estimators. Y on X. Stata will simply drop Z from the regression and estimate a simple regression of Y on X. rzyrxz is the effect of X on Y through Z. if there is no correlation between X and Z. Let’s concentrate on the ratio.
89 0.67 implying that rain causes crop yield to decline (!).66667 6 194.41 0.0278 = 0. Std.9002  According to our analysis. Let’s do a multiple regression with both rain and temp.1343 = 13. although not significantly. regress yield rain temp Source  SS df MS Number of obs = 8 92 . Suppose we type the following data into the data editor in Stata.025382 0.5546 1.56687 175. The problem is that we have an omitted variable.666667 4. temperature.3333333 Residual  1166.17 = 0.3333333 1 33.428571 Number of obs F( 1. regress yield rain Source  SS df MS +Model  33. t P>t [95% Conf.66667 40.6932 = 0.693 11.108 22.Multiple regression and omitted variable bias The reason we use multiple regression is that most real world relationships have more than one explanatory variable.51642 8.183089 _cons  76. the coefficient on rain is 1. 6) Prob > F Rsquared Adj Rsquared Root MSE = 8 = 0. Interval] +rain  1. Let’s do a simple regression of crop yield on rainfall.444444 +Total  1200 7 171. Err.944 yield  Coef.
If we make a mistake and leave one of the important variables out. Interval] +rain  . Std.1300 = 13.106639 4.839266 _cons  14. β xi2 γ xi zi xi ui ˆ E β = + + xi2 xi2 xi2 ( ) = β +γ xi zi xi2 Because xi ui = 0 by A2. probably because we only have eight observations.710247 +Total  1200 7 171.428571 F( 2.351038 1. (The explanatory variables are uncorrelated with the error term.661814 0. If rxz and γ .551237 5 193.8907  Now both rain and warmth increase crop yields.448763 2 115.) We can write ˆ the expression for the expected value of β in a number of ways. yi = β xi + γ zi + ui However.883 11.60 = 0.70796 temp  1.7243816 4.38747 0. Err.366313 1.918 yield  Coef.1929 = 0. The omitted variable theorem The reason we do multiple regressions is that most things in economics are functions of more than one variable.358 2. Assume that Z is relevant in the sense that is not zero.16 0.25919 12. have the same sign.01 0. The coefficients are not significant. we omit z and run a simple regression of y on x.+Model  231. t P>t [95% Conf. If xi zi ≠ 0 then rxz ≠ 0 and b xz ≠ 0 in which case ˆ β is biased (and inconsistent since the bias does not go away as N goes to infinity).02238 98. The coefficient on x will be ˆ xi yi = xi ( β xi + γ zi + ui ) β= xi2 xi2 ˆ The expected value of β is. r xi2 zi2 xi zi ˆ E β = β +γ = β + γ xz = β + γ rxz 2 xi xi2 ( ) zi2 N xi2 N S = β + γ rxz z = β + γ bxz Sx where bxz is the coefficient from the simple regression of Z on X. we cause the remaining coefficient to be biased (and inconsistent). 5) Prob > F Rsquared Adj Rsquared Root MSE = 0.724382 Residual  968. If they have 93 . remember that omitting an important explanatory variable can bias your estimates on the included variables. Suppose the true model is the multiple regression (in deviations). Nevertheless.5853 = 0.9354 238. The direction of bias ˆ depends on the signs of rxz and γ .14 0.892 266. then β is biased upwards.
If there are more omitted variables. Finally. then β is biased downward. In the Data Editor.ˆ opposite signs. add more equations. ˆ E β = β +a β +b β ( ) ˆ E (β ) = β + a β + b β ˆ E (β ) = β + a β + b β 0 0 0 3 0 1 1 1 3 1 4 2 2 2 3 2 4 4 Here is a nice example in Stata. It can be shown that. Suppose the true model is Yi = β 0 + β1 X 1i + β 2 X 2i + β3 X 3i + β 4 X 4i + ui But we estimate Yi = β 0 + β1 X 1i + β 2 X 2i + ui We know we have omitted variable bias. then r = 0 and β is unbiased. create the following variable. simply add them to the right hand side of each auxiliary regression. as a simple extension of the omitted variable theorem above. This was the case with rainfall and temperature in our crop ˆ yield example in Chapter 7. but how much bias is there and in what direction? Consider a set of auxiliary (and imaginary if we don’t have data on X3 and X4) regressions of each of the omitted variables on all of the included variables. note that if X and Z are orthogonal. Now let’s consider a more complicated model. 94 . X 3i = a0 + a1 X 1i + a2 X 2i X 4i = b0 + b1 X 1i + b2 X 2i Note if there are more included variables. S E ( γ ) = γ + β rzx x = γ + β bzx Sz where bzx is the coefficient from the simple regression of X on Z. xz The expected value of γˆ if we omit X is analogous.
. .4286 2 5589.410601 7. . . . Err.9154 12. 4) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 33.24 0. Interval] 95 . . .0185 0.Create some additional variables. t P>t [95% Conf.48789 _cons  9.216498 14.142857 +Total  11848 6 1974. gen x2=x^2 .666667 Number of obs F( 2.285714 7. _cons  1 . Err. . .66667 Number of obs F( 4.8640 0.031 1.70 0. t P>t [95% Conf. 2) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 . .281 30.71429 Residual  668.44 0.49 0. t P>t [95% Conf. gen x3=x^3 . Interval] +x  8 2.57143 1.43823  Do the auxiliary regressions. regress y x x2 Source  SS df MS +Model  11179.66667 Number of obs F( 2. . Std. x3  1 . . gen x4=x^4 Now create y. Err.4642 1. . x4  1 . 1.00966 11. . . . .9436 0. 4) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 12. gen y=1+x+x2+x3+x4 Regress y on all the x’s (true model) .3485 x3  Coef. . .7960 7. regress x3 x x2 Source  SS df MS +Model  1372 2 686 Residual  216 4 54 +Total  1588 6 264. .  Now omit X3 and X4. Std.571429 4 167. . . .928 y  Coef. x2  1 .27 0. Interval] +x  1 .654972 14. .443233 3. regress y x x2 x3 x4 Source  SS df MS +Model  11848 4 2962 Residual  0 2 0 +Total  11848 6 1974.0000 0 y  Coef.7835 x2  10. . Std.002 6.0000 1.0032 0.
79371 _cons  10. unemployment. Interval] +x  0 2.581149 x2  9.226109 2. regress x4 x x2 Source  SS df MS +Model  7695.001 6.226109 _cons  0 4. suppose we include irrelevant variables.571429 4 113. percent urban.144267 10.42857 2 3847. The only purpose of the control variables is to try to guarantee that we don’t bias the coefficient on the target variables. You can see that the theorem works and both estimates are biased upward by the omitted variables.9167 10.0031 0. t P>t [95% Conf. These numbers are exact because there is no random error in the model.581149 5.007 3.169 27.010178 0.25 0.141196 1. 4) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 7 34.000 5. percent in certain age groups. we include too many control variables? Instead of omitting relevant variables. our target variable is the threestrikes dummy. To that end we include a list of control variables.34915 12.04 0.242641 0. Std.) What happens if.00 1.57(1) = 10. ( ) ˆ E ( β ) = β + a β + b β = 1 + 7(1) + 0(1) = 8 ˆ E ( β ) = β + a β + b β = 1 + 0(1) + 9.77946 . etc.637 x4  Coef.9445 0.77946 11.00 1. we want to avoid omitted variable bias.160577 8.85573 x2  0 . To get a good estimate of the coefficient on the target variable or variables.01 0. if we are studying the effect of threestrikes law on crime. we usually have one or two parameters that we are primarily interested in. These coefficients tell us if the demand curve downward sloping and whether the good is normal or inferior.+x  7 1.000 2. We include in the crime equation all those variables which might cause crime aside from the threestrikes law (e.71429 Residual  452.38873 5. In a demand curve we are typically concerned with the coefficients on price and income.571429 1. It turns out that including 96 . We are primarily interested in the coefficient on the interest rate in a demand for money equation. In a policy study the target variable is frequently a dummy variable (see below) that is equal to one if the policy is in effect and zero otherwise.142857 +Total  8148 6 1358 Number of obs F( 2.57 1 1 1 3 1 4 2 2 2 3 2 4 ˆ E β 0 = β 0 + a0 β3 + b0 β 4 = 1 + 0(1) − 10..33641 6.764979  Compute the expected values from the formulas above.g.29(1) = −9. Target and control variables: how many regressors? When we estimate a regression model. For example.8017837 0.000 11. in our attempt to avoid omitted variable bias. These variables associated with those parameters are called target variables.67 0.00 1. Err.28571 6.29 These are the values of the parameters in the regression with the omitted variables.
if we include the proxy we get measurement error. After you get down to a parsimonious model including only significant control variables. Start with a general model. Use ttests and Ftests to justify your actions. but decreases the variance (increases the efficiency) of all the remaining coefficients. including irrelevant variables will make the estimates inefficient relative to estimates including only relevant variables. in some cases. Data on a potentially important control variable is not available.irrelevant variables will not bias the coefficients on the target variables. including all the potentially relevant controls. by omitting the variable. We can summarize the effect of too many or too few variables as follows. including too many control variables will tend to make the target variables appear insignificant. It might be possible. However. In that case. Thus. but drop them if they are not significant. As we see in a later chapter. You may also come up with some new control variables. or proxies. At this point you should be able to do valid hypothesis tests concerning the coefficients on the target variables. We are also faced with the fact that the coefficient estimate is a compound of the true 97 . It is better to omit a poor proxy. Sets of dummy variables should be treated as groups. including the coefficients on the target variables. dropping one or two variables at a time. but it does demonstrate that we shouldn’t keep insignificant control variables. including all or none for each group. it is impossible to know how highly correlated they are because we can’t observe the unavailable variable. so this is not very helpful. we don’t know any of these true values. if that is convenient. After doing your library research and reviewing all previous studies on the issue. we may be able to obtain data on a variable that is known or suspected to be highly correlated with the unavailable variable. The dilemma is this: if we omit the proxy we get omitted variable bias. Discarding a variable whose true coefficient is less than its true (theoretical) standard error decreases the mean square error (the sum of variance plus bias squared) of all the remaining coefficients. measurement error causes biased and inconsistent estimates. However. It will also increase the standard errors and underestimate the tratios on all the coefficients in the model. you will have comprised a list of all the control variables that previous researchers have used. so that you might have some insignificant controls in the final model. Unfortunately. and remove the insignificant ones. An interesting problem arises if the proxy variable is the target variable. but generally we just have to hope. However. but not a lot. even when they are truly significant. Omitting a relevant variable biases the coefficients on all the remaining variables. even though its parameter is not zero. do one more Ftest to make sure that you can go from the general model to the final model in one step. Such variables are known as proxy variables. but so does omitting a relevant variable. You can proceed sequentially. Proxy variables It frequently happens that researchers face a dilemma. in terms of mean square error. we are stuck with measurement error. Unfortunately. What is a researcher to do? Monte Carlo studies have shown that the bias tends to be smaller if we include a proxy than if we omit a variable entirely. What is the best practice? I recommend the general to specific modeling strategy. So if τ = β Sβ <1 then we are better off. The bottom line is that we should include proxies for control variables. to see how the two variables are related in other contexts. in other studies for example. the bias that results from including a proxy is directly related to how highly correlated the proxy is with the unavailable variable.
Dummy variables Dummy variables are variables that consist of one’s and zeroes. However. Unless we know the value of b.92 25213. Min Max 98 . They are extremely useful. that takes the value 1 if the faculty member is female and 0 if male. which is available and is related to X.34 29700 130000 Similarly. Suppose we know that Y =α +βX But we can’t get data on X. Y = c + dZ the coefficient on Z is d = β b . then we can test the hypothesis that is positive using the estimated ˆ coefficient. Min Max +salary  350 72024. Min Max +salary  122 61050 18195. d . We can then find the average female salary. Dev. So we use Z. we have no measure of the effect of X on Y. the product of the true coefficient relating X to Y and the coefficient relating X to Z. . the estimated coefficient can only be used to determine sign and significance. one by Cook and Ludwig and one by Duggan. As we have just seen. .cfm?abstract_id=473661 for a discussion of this issue. summarize salary Variable  Obs Mean Std. Y = α + β X = α + β ( a + bZ ) = (α + β a) + β bZ So if we estimate the model. we can create a dummy variable called male which is 1 if male and zero if female. Therefore. both make this mistake. if we know that b is positive.dta).ssrn.02 20000 173000 Now suppose we create a dummy variable called female. The average salary at the time of the survey was as follows. Dev. Nevertheless two studies. summarize salary if male Variable  Obs Mean Std.com/sol3/papers. X = a + bZ Substituting for X in the original model yields. Dev. so researchers have to use proxies. . I happen to have a data set consisting of the salaries of the faculty of a certain nameless university (salaries. There is no good measure of the number of guns. we do know if the coefficient d is significant or not and we do know if the sign is as expected. For example. That is.parameter linking the dependent variable and the unavailable variable and the parameter relating the proxy to the unavailable variable. if the target variable is a proxy variable. See http://papers. summarize salary if female Variable  Obs Mean Std. They are also known as binary variables. The average male salary is somewhat higher. the coefficient on the proxy for guns cannot be used to make inferences concerning the elasticity of crime with respect to guns. An illustration of this problem occurs in studies of the relationship between guns and crime.
Err. create two variables for men’s salaries and women’s salaries. gen salarymale=salary if female==0 (122 missing values generated) .34 57788.423 11567.12 81353. .328 18195.46 1754. It is somewhat more elegant to use regression analysis with dummy variables to achieve the same goal. scalar list salarydif salarydif = 16847 So.3 74675. We can use the scalar command to find the difference in salary between males and females. Is this difference significant given the variance in salary? Let’s do a traditional test of the difference between two means by computing the variances and using the formula in most introductory statistics books.+salary  228 77897.32 +combined  350 72024.2760 P > t = 0. Min Max +salary  350 72024.46 26485.069 26485. Dev.54 +diff  16847.87 74441. [95% Conf.0000 Ha: diff > 0 t = 6. unpaired Twosample t test with equal variances Variable  Obs Mean Std. Dev. . .92 1347.34 29700 130000 salarymale  228 77897.87 20000 173000 The difference between these two means is substantial. summarize salary* Variable  Obs Mean Std.46 26485. women earn almost $17.0000 There appears to be a significant difference between male and female salaries.2760 P > t = 0. First.02 69374.73 22127. regress salary female 99 .92 25213. .0000 Ha: diff != 0 t = 6.2760 P < t = 1. scalar salarydif=6105077897 .46 2684.000 less than men on average. gen salaryfem=salary if female==1 (228 missing values generated) .02 20000 173000 salaryfem  122 61050 18195. Interval] +salary~e  228 77897.mean(salaryfem) = diff = 0 Ha: diff < 0 t = 6.87 20000 173000 Now test for the difference between these two means using the Stata command ttest. ttest salarymale=salaryfem.68 64311.693 25213.2 Degrees of freedom: 348 Ho: mean(salarymale) . Std.8 salary~m  122 61050 1647.
000 11567.9930e+11 348 572701922 +Total  2.000 56788.000 22127. 348) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 39.0000 0.39 0. 348) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 39.39 0. Since male=1female.46 2684. Std.39 0. so we can’t use all three terms in the same regression. This is exactly the same results we got using the standard ttest of the difference between two means.0991 23931 salary  Coef.2558e+10 1 2.Source  SS df MS +Model  2.423 6.628 28. regress salary male female Source  SS df MS +Model  2.2 11567.46 2684. male+female=intercept.1017 0.31 81014.73 _cons  77897. The tratio on female tests the null hypothesis that this salary difference is equal to zero. regress salary male Source  SS df MS +Model  2.2186e+11 349 635696291 Number of obs F( 1. 348) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 39. t P>t [95% Conf. there appears to be significant salary discrimination against women at this university.46 2684.000 11567.9930e+11 348 572701922 +Total  2.61  Note that the coefficient on female is the difference in average salary attributable to being female while the intercept is the corresponding male salary.33  Now the female salary is captured by the intercept and the coefficient on male is the (significant) bonus that attends maleness.18 0.2558e+10 1 2.2 100 .28 0. Interval] +male  16847.2 _cons  61050 2166. Std.15 0. This is the dummy variable trap.0991 23931 salary  Coef. Interval] +male  16847.000 74780. Interval] +female  16847. Err. Err. Since male=1female. So.423 6.2186e+11 349 635696291 Number of obs F( 1.73 22127.2558e+10 1 2.2558e+10 Residual  1.882 49. and the intercept is simply 1.73 22127. t P>t [95% Conf. Std. we can generate the bonus attributable to being male by doing the following regression.28 0.2558e+10 Residual  1.423 6.0000 0.9930e+11 348 572701922 +Total  2. Err.1017 0.46 1584. .0991 23931 salary  Coef.2558e+10 Residual  1.1017 0. t P>t [95% Conf.2186e+11 349 635696291 Number of obs F( 1.0000 0.67 65311.28 0. female=1male.
18 0.000 56788. If so.15 0.5197e+11 346 439219336 Number of obs F( 3. Std.628 28.4854e+10 Residual  1. t P>t [95% Conf.0000 0. Also. the difference has been reduced to $10K. We can test this hypothesis by creating an interaction variable by multiplying experience by female and including it as an additional regressor.46 1584.67 65311. Maybe the men have more experience and that is what is causing the apparent salary discrimination. 348) Prob > F Rsquared Adj Rsquared Root MSE = 350 = 1604.9930e+11 348 572701922 +Total  2.9888e+10 3 2.000 866. The result is we get the same regression we got with male as the only independent variable. t P>t [95% Conf.9016 = 23931 salary  Coef. gen fexp=female*exp .8753 1272.0375e+12 350 5. noconstant Source  SS df MS +Model  1.000 15093.61 female  61050 2166. This formulation is less useful because it is more difficult to test the null hypothesis that the two salaries are different. regress salary exp female Source  SS df MS +Model  6. regress salary male female. .628 28.35 2431. Std. 347) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 79.33  Now the coefficients are the mean salaries for the two groups.975 28.2186e+11 349 635696291 Number of obs F( 2.24 0.72 2150. Perhaps this salary difference is due to a difference in the amount of experience of the two groups.983 4.67 65311.882 49.776 103.63 5527.29 0.04 0. Err.86 = 0.3296e+10 Residual  1.3103 20940 salary  Coef.9022 = 0. It is also possible that women receive lower raises than men do.000 56616. But it is still significant and negative. .000 74780.14 65077.0000 = 0.18 0. Err.1911e+11 Residual  1.9709e+10 2 3.678 female  10310.5215e+11 347 438471159 +Total  2. Interval] +male  77897.1619 10.8382e+12 2 9. 346) Prob > F Rsquared = = = = 350 53. the average raise is a pitiful $1000 per year. the previous analysis suffers from omitted variable bias. Interval] +exp  1069.3142 0. It is possible to force Stata to drop the intercept term instead of one of the other variables. so the men are apparently somewhat older.000 56788.49 0.8215e+09 Number of obs F( 2.065 _cons  60846.33  Stata has saved us the embarrassment of pointing out that the regression we asked it to run is impossible by simply dropping one of the variables from the regression.37 0. . regress salary exp female fexp Source  SS df MS +Model  6.3150 101 .0000 0.female  (dropped) _cons  61050 2166.31  Well.31 81014.
Let’s do a slightly more complicated version of the Ftest above.93 2286. Useful tests Ftest We have seen many applications of the ttest. Std.. assistant to the assistant to the President. Then run the regression again without the two variables (assuming the null hypothesis is true) and seeing if the two residual sum of squares are significantly different.18 65835. According the Fratio. but not significantly higher.273 26. Chow test This test is frequently referred to as a Chow test. and note the residual sum of squares (1.). assistant dean.+Total  2. women appear to receive slightly higher. a British statistical biologist in the early 1900’s.9854 9.087 female  12189.64 0.0422 0. a Princeton econometrician who developed a slightly different version.229 3. Err.0422 269. the Ftest is also extremely useful.000 814.2186e+11 349 635696291 Adj Rsquared = Root MSE = 0.2061 _cons  61338. we can firmly reject this hypothesis.000 56842. The numbers in parentheses in the F ratio are the degrees of freedom of the numerator (2 because there are two restrictions.83 0. test female fexp ( 1) ( 2) female = 0 fexp = 0 F( 2.936 fexp  172.7038 1263. t P>t [95% Conf. If they are.895 113.0001 This tests the null hypothesis that the coefficient on female and the coefficient on male are both equal to zero. However.523 357. deanlet. There is a significant salary penalty associated with being female. Suppose we want to test the null hypothesis that the coefficients on female and fexp are jointly equal to zero. 346) = Prob > F = 9. If they are not significantly different.002 19695. but it is not caused by discrimination in raises. Maybe female administrators are discriminated against in 102 . etc. then the two variables do not help explain the variance of the dependent variable and the null hypothesis is true (cannot be rejected).3091 20958 salary  Coef.19 0. one on female and one on fexp) and the degrees of freedom of the denominator (nk: 3504=346). The way to test this hypothesis is to run the regression as it appears above with both female and fexp included. after Gregory Chow. .67  In fact. Refer to the coefficients by the corresponding variable name. Interval] +exp  1038. then the null hypothesis is false. raises than men.79 4683. dean.1217 701.86 3816. This test was developed by Sir Ronald Fisher (the F in Ftest).5e+11).18 0. This command is used after the regress command.11 0. associate dean. Stata allows us to do this test very easily with the test command. I also have a dummy variable called “admin” which is equal to one whenever a faculty member is also a member of the administration (assistant to the President. etc.
60 0.94 50346.295 0.003 3852.881204 397. regress salary exp female fexp admin fadmin Source  SS df MS +Model  7.3297 20643 salary  Coef. .000 34444.07 _cons  60202.016 2690.423 680.0000 0. Interval] +exp  1005.4659e+11 344 426123783 +Total  2.42 female  5090.325 19394.5015 1227.59 0.966 2.67 0. assoc (=1 if an associate prof). and prof (=1 if a full professor).13 103 .575 7884.17 64696.932 5.577 fexp  156.0008 We can soundly reject the null hypothesis that women have the same salary function as men.33 0.64 2284.91 0.61 0.92 3. We want to know if the regression model explaining women’s salaries is different from the regression model explaining men’s salaries.2186e+11 349 635696291 Number of obs F( 7.557 367. regress salary exp female asst assoc prof admin chprof Source  SS df MS +Model  1. Interval] +exp  195.003 2693. Std.421 2719.5879 16185 salary  Coef. I also have data on a number of other variables that might explain the difference in salaries between men and women faculty: chprof (=1 if a chancellor professor).174 2.7694 102.8896e+10 Residual  8. .37 _cons  42395.2186e+11 349 635696291 Number of obs F( 5.3227e+11 7 1. gen fadmin=female*admin . 342) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 72.5 asst  11182.31 30872. test female fexp fadmin ( 1) ( 2) ( 3) female = 0 fexp = 0 fadmin = 0 F( 3. t P>t [95% Conf.85 chprof  14419.342 2.69 2.000 783.9587e+10 342 261948866 +Total  2.056 1250.91 0.334 19215.14 0.08 0.19 4395.19 2. t P>t [95% Conf.5447 266.5962 0. Let’s see if these variables make a difference. asst (=1 if an assistant prof).5207 1.010 8930.61 4175.terms of salary.451 112.68 0. we know that the difference is not due to raises or differences paid to women administrators.89 0. 344) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 35.67 5962.278 1952.799 13495. 344) = Prob > F = 5.11 56504.49 0.69 3905.11 4224.965 26148.002 19141.23 4079. because the ttests on fexp and fadmin are not significant. Err.96 0.0000 0.737 10.564 26.5271e+10 5 1.11 .42 0.26 0.4 female  11682.95 0.84 3791. This is a Chow test.5054e+10 Residual  1.04 fadmin  2011.000 14822.35 0.5124 admin  11533.008 2970. However.146 10.89 assoc  22847. Std.53 4042.057 5.28 admin  8043.000 55709.3948 0.997 13392.15 prof  47858.843 8.3393 0.000 39212. Err.92 17519.
783 9632. This conclusion rests on the department dummy variables being significant as a group. Despite all these highly significant control variables.416 2.20 0.31 27435.813 0.009 3732. .41 1.382 8906.93 dept16  4854.54 8105.71 0.157 3230.7245 female  2932.91 0.000 37613. We would like to list the variables as dept1dept29.724 12088. males and females (e.004 88.11 4618.748 7261.681 9271.871 8956.34 0.921 0. Interval] +exp  283.29 assoc  22830.31 0.55 0.7 10822.843 8484.48 49735.36 4036.487 18584.11 0.34 dept21  1839.591 9168.72 835.75 8876.95 dept12  3601.87 4019.35 0.791 18828.5047e+11 33 4.1 3930.14 dept27  1893.638 2755.82 46712.000 30290.18 0. theatre.88  Whoops.56 0.348 dept24  777.17 54630.63 0.011 1552.57 dept9  3304.75 0.794 8335.5597e+09 Residual  7. We can test this hypothesis.576 25689.124 0.865 admin  6767.633 dept7  3525.8798 5985.578 8578.03 0.g.71 15803..524 1. I have a dummy for each department.987 9681.21 24710.72 0.675 0.766 0.17 11944.38544 4879.9 4791.575 6269.29 8408.156 5.251 27454.332 9.266 9957.9 4886.64 15411.702 3. the coefficient on the target variable.092 4.307 dept28  4073.8447 3916.64 0.396 11981.321 9027.65 3473.207 0.73 dept30  80.596 dept29  2279.833 19013.361 44411.23 0.452 14713.28 0. kinesiology).27 0. modern language.021 2.794 0.594 4902. with the testparm command.23 dept2  2500.56 0.54 0. Err.43 dept4  1278.2186e+11 349 635696291 Number of obs F( 33.519 4457.03 dept23  1540.978 11609.609 9383.67 0. indicating significant salary discrimination against women faculty.915 8031.75 30562.089 901.414 9520.81 0.18 0.44 asst  11771.78 chprof  14794. English. zero otherwise.331 dept17  1185.802 dept25  14089.11 0.86 0.475 1.4 5622.61 dept6  1969.323 dept14  167.149 4881.70 0.219 0.38 0. using an Ftest.38 20517.598 11270.037 5.21 0.643 _cons  38501.53 0.51 0.99 0.73 4324.36 0.928 1915.953 5229. female is negative and significant.54 0.491 6928. I have one more set of dummy variables to try.214 0. but that looks like we want to test hypotheses on the 104 .402 4293.85 4173.0000 0.804 0.92 0.495 19712. t P>t [95% Conf.33 dept10  24647.33 23572.94 dept8  462.88067 478.275 6448.33 dept11  13185.58 prof  46121.127 6701.68 dept5  9213. regress salary exp female admin chprof prof assoc asst dept2dept30 Source  SS df MS +Model  1.911 11064.646 13128.72 6567.33 0.443 0.86 0.8026 99.41 16232.3493 4317.66 3659.108 0. dance. classical studies.797 3397.456 2.741 1.945 dept18  5932.824 0. Std.063 5408.6446 15030 salary  Coef.360 9356.66 3726.92 0.975 0.6782 0.26 0. Maybe women tend to be overrepresented in departments that pay lower salaries to everyone.004 3830.89 33733. So.222 25856.9 dept20  8166.07088 2.557 10.000 15560. the coefficient on female isn’t significant any more.000 11620.63 dept3  19527.806 dept22  2940.000 15097. dept2=1 if in the second department.02 0.754 11186.411 12766. dept1=1 if the faculty person is in the first department.239 12450.782 15992.267 0. 316) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 350 20.494 0. etc.471 6078.744 5208.71 dept19  18642.000 5853.885 10893.208 6978.1388e+10 316 225910511 +Total  2.088 2650.
the same is true for female French teachers and phys ed instructors.W. or neither. once we control for field of study. By the way. people in prison). In 1969 C. If the group is not significant. Then do a regression of crime on lagged crime and lagged prison. If lagged crime is significant. If the group is significant. testparm dept2dept30 ( 1) ( 2) ( 3) ( 4) ( 5) ( 6) ( 7) ( 8) ( 9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) dept2 = 0 dept3 = 0 dept4 = 0 dept5 = 0 dept6 = 0 dept7 = 0 dept8 = 0 dept9 = 0 dept10 = 0 dept11 = 0 dept12 = 0 dept14 = 0 dept16 = 0 dept17 = 0 dept18 = 0 dept19 = 0 dept20 = 0 dept21 = 0 dept22 = 0 dept23 = 0 dept24 = 0 dept25 = 0 dept27 = 0 dept28 = 0 dept29 = 0 dept30 = 0 F( 26. Granger suggested running a regression of prison on lagged prison and lagged crime.” Econometrica 36. apparently earn just as much as their male counterparts. that you must drop or retain all members of the group. Female physicists. both. even if only a few are. If lagged prison is significant. when testing groups of dummy variables for significance. 1969.difference between the coefficient on dept1 and the coefficient on dept29.. 424438. 2 105 . It is important to remember that. economists. etc. then prison Granger.J. the departments that are significantly overpaid are dept 3 (Chemistry). computer scientists. That is where the testparm command comes in.2 Suppose we are interesting in testing whether crime causes prison (that is. and dept 11 (Economics). “Investigating causal relations by econometric models and crossspectral methods. 316) = Prob > F = 3. Granger published an article suggesting a way to test causality. dept 10 (Computer Science).W. then all should be retained. you should drop them all. Unfortunately. then crime causes prison. Granger causality test Another very useful test is available only for time series. prison causes (deters) crime. even if one or two are significant. chemists.J. So there is no significant salary discrimination against women.10 0.0000 The department dummies are highly significant as a group. C. .
crmaj L. causes crime if the coefficients sum to a positive number).prison ( 1) ( 2) L.8713522 .crmaj Source  SS df MS +Model  23. the kind you get sent to prison for: murder. 19) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 24 510. Does crime cause prison? regress prison L.0000 0.8333 _cons  6309.prison = 0 F( 1.) Let’s use two lags.2095943 4. 19) = Prob > F = 5.5650287 1.8684 0. tsset year (You have to tsset year to tell Stata that year is the time variable.463 3858.prison=0 ( 1) L. and burglary) and prison population per capita for Virginia from 1972 to 1999 (crimeva.prison LL.896 1287.prison Source  SS df MS +Model  63231512.prison = 0 F( 2.4399737 . Err.02 crmaj  Coef.crmaj LL.01590533 Number of obs F( 4.011338582 +Total  23. 19) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 24 31.prison+L2. .61 Number of obs F( 4.79 0.184 4466. test L. regress crmaj L.prison LL.36 0.1503896 4 5.38 0.2 4 15807878. lagged prison is significant in the crime equation. . Std.442401 L2  . t P>t [95% Conf.0000 0.82 0. Interval] +crmaj  L1  1.dta).crmaj LL.26 0. prison deters crime.0085951 prison  L1  1087.003 1.78759741 Residual  .61 0. I have data on crime (crmaj.262 0.8407 710.43 0.causes crime (deters crime if the coefficients sum to a negative number. test L.625 920.249 1324.9888 .215433065 19 .665 2689.prison LL.1 Residual  9578520.5613 11938.46 2. robbery.13 0.962 L2  1772.0468 So.0333 So.75 19 504132.422 1684.046 .206103 2.prison + L2. so it deters crime if the sum is significantly different from zero. rape. .9908 0.35 0.003715 . Does prison cause or deter crime? The sum of the coefficients is negative. major crime.671 +Total  72810033 23 3165653.prison = 0 L2. 19) = Prob > F = 3.030 680. assault.prison L.77 .10648 106 .3658227 23 1.000 .
We then test all the coefficients.96102 L2  .33 0.55675713 51 . One means perfect inequality.crmaj LL.crmaj = 0 F( 2. real per capita income.0000309 0. while the sociologists think that putting people in prison does no good at all. and burglary). If prison is significant and income inequality is not.8237 .92 0.csv). The sociologists think that crime is caused by income inequality. The Granger causality test is best done with control variables included in the regression as well as the lags of the target variables.0000314 0. You can get ties in tests of nonnested hypotheses.61 0. Err.7102647 .9964 = 0.crmaj ( 1) ( 2) L.4033407 0.52 0. If the reverse is true then the sociologists win.009 . . We believe that crime is deterred by punishment.005546219 47 . prison deters crime.03052465 Number of obs F( 4. but crime does not cause prisons. then the lagged dependent variables will serve as proxies for the omitted control variables. I have time series data for the US which has data on major crime (murder. Std. one person has all the money.9781387 . However. everyone has the same level of income. We create the supermodel by regressing the log of major crime per capita on all these variables. mainly imprisonment.5637167 .1597358 crmaj  L1  . regress lcrmajpc prisonpc unrate income gini Source  SS df MS +Model  1. then it is a tie.0000467 L2  . if the control variables are unavailable or the analysis is just preliminary.0000159 . We both agree that unemployment and income have an effect on crime.193013 2.20 0.129668 1.9676977 .0000806 _cons  .387802727 Residual  .prison  Coef.0000 = 0. test L. Interval] +prison  L1  1. t P>t [95% Conf.55121091 4 . rape.crmaj = 0 L2.0000488 .133937 .743 .78 0.33 = 0.000 1.0000191 .545344 . We don’t think so.1986008 7. then we win. 47) Prob > F Rsquared Adj Rsquared Root MSE = 52 = 3286. and income inequality measured by the Gini coefficient from 19471998 (gini. The Gini coefficient is a number between zero and one. How can we resolve this debate? We don’t put income inequality in our regressions and they don’t put prison in theirs.01086  107 . the unemployment rate. You don’t get ties in tests of nested hypotheses. The key is to create a supermodel with all of the variables included. 19) = Prob > F = 0. the J in Jtest is for joint hypothesis.612 .0000849 . Apparently. If neither or both are significant. By the way. Jtest for nonnested hypotheses Suppose we are having a debate about crime with some sociologists. assault.551 . everyone else has nothing.9961 = . Zero means perfect equality. robbery.000118005 +Total  1. at least in Virginia.
0000 0.169355389 Number of obs F( 4. regress lcrmaj prison metpct p1824 p2534 Source  SS df MS +Model  5. The primary advantage of the LM test is that you only need to estimate the main equation once. record the error sum of squares.0127208 p1824  . t P>t [95% Conf.200647 . Interval] +prison  .41 0.0072343 5.175871 29.57 0. less equality) crime goes down. an alternative is to use the LM (for Lagrangian Multiplier. use the crime1990. 108 .0045627 .0282658 5.39867 9.0098001 . estimate the model without the age group variables and save the residuals using the predict command.861 11. even if we already know that they are highly significant.3325442 gini  . both prison and gini are significant in the crime equation.09 0.02 0.75 0.33210067 Residual  3. Std.0536626 income  . The Ftest is easy in Stata.527341 6.1393668 46 . then estimate the model with the variables being tested excluded.46776947 50 . .068247104 +Total  8. t P>t [95% Conf.000 .0245556 .000 .2008929 metpct  .4482  Suppose we want to know if the age groups are jointly significant. Interval] +prisonpc  .23 0.3076585 ..0061851 51.673677 13.52 0.26124 lcrmaj  Coef.000 .0871006 . However.000 . Err.0391091 .26219 . It appears to be a draw.g. .26 0. prison and arrests for the economists’ model and giini and the divorce rate for the sociologists) then we would use Ftests rather than ttests. record the error sum of squares.004 1.317924 _cons  9. LM test Suppose we want to test the joint hypothesis that rtpi and unrate are not significant in the above regression.3201013 .dta data set to estimate the following crime equation.000 . Under the Ftest you have to estimate the model with all the variables.12 0. If there were more than one variable for each of the competing models (e. We know we can test that hypothesis with an Ftest.8834 To do the LM test for the same hypothesis.50 0.18 0.092154 . 46) = Prob > F = 0.208719 0.0186871 .0086418 . Std.1439967 .604708 3.0275741 unrate  .9140642 5. as the gini coefficient goes up (more inequality.32840267 4 1. Whoops. the coefficient on gini is negative.lcrmajpc  Coef.2531816 _cons  5. However.84684 5. 46) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 19. indicating that.0044176 4. test p1824 p2534 ( 1) ( 2) p1824 = 0 p2534 = 0 F( 2.570543 p2534  1.935938 0.41 0.685 9.0020265 4.6293 0. Err.757686 .736113 10.2507801 3. never mind) test. For example.000 7.000 4. and finally compute the Fratio.554454 r Well.5970 . We would test the joint significance of prison and arrests versus the joint significance of gini and divorce with Ftests.
Interval] +prison  .0081552 .767611 .0055023 .24 0. we need to get the Rsquare and the number of observations.0045657 p1824  .0811 = .680585  To compute the LM test statistic.2612414678691742 .0513938 .39 0.31143197 2 2. .0000 0.1885321 metpct  .3245431 .208719 0.570543 p2534  1.016970705 4 . . To see this output. t P>t [95% Conf.068247105 +Total  3.632 1.0020265 0.604708 3..0004866 . Std.41 0.06 = 0.685 9.6117 .0035924 .000 8.0169707049657037 3.317924 _cons  .65571598 Residual  3. t P>t [95% Conf.537927 8. two degrees of freedom because there are two restrictions (two variables left out).861 11.1563375 48 . regress e prison metpct p1824 p2534 Source  SS df MS +Model  .06312675 Number of obs F( 4. Err.0623985 metpct  .46776947 50 .26124 e  Coef. which is distributed as chisquare with.70 0.065757031 +Total  8.9926 = 0.0053767079385045 .997295 . (Although the fact that both variables have very low tratios is a clue.39867 9.0017355 4.) Whenever we run regress Stata creates a bunch of output that is available for further analysis.13936681 46 . 48) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 40.031498 1.276850278809347 109 . resid Now regress the residuals on all the variables including the ones we left out. in this case.811 . ereturn list scalars: e(N) e(df_m) e(df_r) e(F) e(r2) e(rmse) e(mss) e(rss) e(r2_a) e(ll) = = = = = = = = = = 51 4 46 . If they are truly irrelevant.0282658 0.48 0.0054 = 0.0811122739798864 1.0884566 . This output goes under the generic heading of “e” output. Err.0621663918252367 .9140643 5.935938 0.19 0.0248865 5.18 0.1384944 . 46) Prob > F Rsquared Adj Rsquared Root MSE = 51 = 0.673677 0.847 .000 .0116447 _cons  8. predict e. The LM statistic is N times the Rsquare. Interval] +prison  .000 .15633751 50 .57 0. use the ereturn list command.0046656 . Std.75 0. regress lcrmaj prison metpct Source  SS df MS +Model  5.527341 6.139366808584275 .004242676 Residual  3.6273 0. they will not be significant and the Rsquare for the regression will be approximately zero.1142346 76.169355389 Number of obs F( 2.25643 lcrmaj  Coef.
e(ll_0) = macros: e(depvar) e(cmd) e(predict) e(model) matrices: e(b) : e(V) : functions: e(sample) : : : : 1. just apply the Ftest to the auxiliary regression. That is. Testing Linear Econometric Models. Amsterdam: University of Amsterdam. especially for diagnostic testing of regression models.00537671 n = 51 The Rsquare is almost equal to zero. . .. OK. 1987. . so the Ftest is easier. I routinely use the Fform of the LM test because it is easy to do in Stata and corrects for degrees of freedom.414326247391365 "e" "regress" "regres_p" "ols" 1 x 5 5 x 5 Since the numbers we want are scalars. 46) = Prob > F = 0. we need the scalar command to create the test statistic. including the age group variables. as expected. The results are also similar to the nRsquare version of the test.8834 The results are identical to the original Ftest above showing that the LM test is equivalent to the Ftest. scalar list prob = . not variables. 3 See Kiviet. test p1824 p2534 ( 1) ( 2) p1824 = 0 p2534 = 0 F( 2. Nevertheless. so the LM statistic is very low.87. 110 .12 0. scalar LM=n*R2 Now that we have the test statistic.3 To do the Fform. use Stata’s test function to test the joint hypothesis that the coefficients on both variables are zero. after regressing the residuals on all the variables. we will use the LM test later.LM) . we need to see if it is significant. scalar prob=1chi2(2.F.2742121 R2 = . J.87187776 LM = . The probvalue is . One modification of this test that is easier to apply in Stata is the socalled Fform of the LM test which corrects for degrees of freedom and therefore has somewhat better small sample properties. We can compute the probvalue using Stata’s chi2 (chisquare) function. We cannot reject the null hypothesis that the age groups are not related to crime. scalar R2=e(r2) . scalar n=e(N) .
Min Max +y  51 9.83905 Let’s graph these data.6021098 8. twoway (scatter y x) (lfit y x) The regression confirms no significant relationship. summarize y x Variable  Obs Mean Std. gen x=10*invnorm(uniform()) . gen y=10 + 5*invnorm(uniform()) . Dev.954225 17. so there should not be a significant relationship between them.497518 19.76657 x  51 .87513 19.736584 4. 111 . We know that they are completely independent. . .9 REGRESSION DIAGNOSTICS Influential observations It would be unfortunate if an outlier dominated our regression in such a way that one observation determined the entire regression result.667585 1.
1686 y  Coef.0184 4. regress y x Source  SS df MS +Model  259. Std.8657642 12. Err.000 8.47812 .874802 Residual  1864.4133222 Residual  1047.0451374 ..1703 0.012 . t P>t [95% Conf.0730381 1. Err.0588225 .10 0.0974255 2.738302 12. regress y x Source  SS df MS +Model  41.0119 0.1016382 . Std.170 .2546063 .0380 0.10 0.90398 49 21.61 0.786346 Number of obs F( 1.1044 6.52027 49 38. t P>t [95% Conf.000 8.83 0.874802 1 259.797781 . . 49) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 6. twoway (scatter y x) (lfit y x) Let’s try that regression again.3857955 +Total  1089. Interval] +x  . . Interval] +x  .49347 11.3173 50 21.1223 0.39 0.94 0. replace y=y+30 if state==30 (1 real change made) .39508 50 42.0514342 +Total  2124.4133222 1 41.2484138 _cons  9.21794  112 .4879015 Number of obs F( 1.4503901 _cons  10. 49) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 1.6245 y  Coef.649048 15.10209  Now let’s create an outlier.
0122 4. DFbetas Belsley. It is sometimes easy to find outliers and influential observations. E. We could do this 51 times. With NH we get a significant result and a relatively large coefficient on x.71 0.6542 48 21.2558493 _cons  9. regress y x if state ~=30 Source  SS df MS +Model  35.27 0. Kuh.So. NY. you could easily miss an influential observation. 1980. as a rule of thumb. . and Welsch. and Welsch4 suggest dfbetas (scaled dfbeta) as the best measure of influence.0324 0. It works like this suppose we simply ran the regression with and without New Hampshire.2112 0. Interval] +x  .0543824 1 35. invoke the dfbeta command. Err. Regression Diagnostics.” State 30. Kuh. BKW suggest. Without NH we should get a smaller and insignificant coefficient. but that would be tedious.447809 11. that an observation is influential if Dfbetas > 2 N where N is the sample size.6718 y  Coef. John Wiley and Sons. dfbeta DFx: 4 DFbeta(x) Belsley. is said to dominate the regression.. 113 . R. This observation is called “influential. New York. A. So how do we avoid this influential observation trap? Stata has a function called dfbeta which helps you avoid this problem.12354  By dropping NH we are back to the original result reflecting the (true) relationship for the other 50 observations.0543824 Residual  1047.211 .70858 49 22.61 0. but not always. t P>t [95% Conf. or lots of variables. 48) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 1. which happens to be New Hampshire. After regress.0989157 .0960935 Number of obs F( 1. one observation is determining whether we have a significant relationship or not. although some researchers suggest that DFbetas should be greater than one (that is greater than one standard error). dropping each state in turn and seeing what happens.6653936 14. Std. If you have a large data set. D. . DFbetas = ˆ ˆ β − βi S βˆ ˆ ˆ where S βˆ is the standard error of β and β i is the estimated value of β omitting observation i.000 8. You do not want your conclusions to be driven by a single observation or even a few observations.8261292 +Total  1082. E. Instead we can use matrix algebra and computer technology to achieve the same result.0780517 1.785673 .0580178 .
152135 114 .434343 Number of obs F( 4.110021  ++ There is only one observation for which DFbetas is greater than the threshold value. the OLS estimates are unstable and fragile (small changes in the data or model can make a big difference in the regression estimates). Clearly NH is influential. VIF. If we do.ats. This means that variables that are truly significant appear to be insignificant in the collinear model Also. The rule of thumb is that multicollinearity is a problem if the VIF is over 30. Interval] +x1  1. It would be nice to know if we have multicollinearity. clear .09 0. t P>t [95% Conf. But what happens if a regressor is not an exact linear combination of some other variable or variables.5693 y  Coef.282 19. What do we do about influential observations? We hope we don’t have any.5719733 +Total  14695 99 148.66253 4 1498. Now list the observations that might be influential. list state stnm y x DFx if DFx>2/sqrt(51) ++  state stnm y x DFx   51. we have to live with the results and confess the truth to the reader. regress y x1 x2 x3 x4 Source  SS df MS +Model  5995. 95) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 100 16.  30 NH 42. We might find simple coding errors.ucla. If we have some and they are not coding errors.edu/stat/stata/modules/reg/multico. At this point it would behoove us to look more closely at the NH value to make sure it is not a typo or something. Variance inflation factors Stata has a diagnostic known as the variance inflation factor .37 0. Std.83905 2.91563 Residual  8699. which indicates the presence of multicollearity.9155807 3.4080 0. increasing the value of the coefficient by two standard errors.0000 0.3831 9. but almost? This is called multicollinearity. that the regression will fail. Err. Stata will be forced to drop the collinear variable from the model.278 . If none of this works. Here is an example. we might take logarithms (which has the advantage of squashing variance) or weight the regression to reduce their influence. . It has the effect of increasing the standard errors of the OLS estimates. use http://www.024484 1.This produces a new variable called DFx.33747 95 91. .118277 1. we hope they don’t affect the main conclusions. Multicollinearity We know that if one variable in a multiple regression is an exact linear combination of one or more variables in the regression.
Let’s look at the correlations among the independent variables.1230534 3.51 0.000 29.23 0. t P>t [95% Conf.69 0.225535 _cons  31.280417 x4  .220 .005687 x1  91.012093 +Mean VIF  221.69 0.12 0.0626879 .260 .356131 x3  1.422 2. All the VIF are over 30.13 0. It is highly correlated with all the other variables.83 0. regress y x1 x2 x3 Source  SS df MS +Model  5936.7790 1.4527284 .5341981 x2  .7281 0.1801934 .3553 1. Err.0000 x2  0.21931 3 1978.014 .191635 1.298443 . vif Variable  VIF 1/VIF +x4  534.61912 .81 0.78069 96 91.812781 x2  1.23 0.3853 9. .8370988 1. correlate x1x4 (obs=100)  x1 x2 x3 x4 +x1  1.9587921 32.0000 x4  0.899733 1.6969874 x3  .97 0.17 Wow.0838481 4. Perhaps we should drop it and try again.1187692 2.0000 x4 appears to be the major problem here.2084695 .286694 1.434343 Number of obs F( 3.6516 0. . 96) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 100 21.010964 x2  82.9709127 32.86 0.x2  1.7827429 3.21 0.8971469 3.5518 y  Coef.16 0.2021 1.000 .3466306 .2372989 +Total  14695 99 148.69161 33.3136 0.000 .17 115 .57 0. Interval] +x1  .864654 x3  1.13 0.001869 x3  175.50512 .68 0.000 29.05215 1.038979 0. Std.0000 0.5130679 _cons  31.54662 .042406 1.0000 x3  0.60194 33. vif Variable  VIF 1/VIF +x1  1.73977 Residual  8758.892234 +Mean VIF  1.40831 .4040 0.
If we had not checked for multicollinearity. we might have concluded that none of the variables were related to y. Be careful. dropping variables is not without its dangers.Much better. It could be that x4 is a relevant variable and the result of dropping it is that the remaining coefficients are biased and inconsistent. So doing a VIF check is generally a good idea. 116 . the variables are all significant and there is no multicollinearity at all. However.
it means that the error variance differs across observations. if there is no i subscript then the variance is the same for all observations. even worse.2) means “is independently and identically distributed with mean zero and variance 2”. for each observation the error term. so the first thing to to is graph the residuals and see if they appear to be getting more or less spread out as Y and X increase. This nonconstant variance is called heteroskedasticity (hetero = different. Consider the following multiple regression model. In this chapter we learn how to deal with the possibility that the “identically” assumption is violated. The first is due to Breusch and Pagan.e.. but in either case. ei are estimates of the unknown error term ui. However. ui. perhaps. scattering). i. Yi = β 0 + β1 X 1i + β 2 X 2i + ui 117 .or underestimating our tratios. σ ) 2 where ~iid(0. comes from a distribution with exactly the same mean and variance. but they are not efficient. 2 . we will make mistakes in our hypothesis tests. BreuschPagan test There are two widely used tests for heteroskdeasticity. When the identically distributed assumption is violated. Another way of looking at the GM assumptions is to glance at the expression for the variance of the error term. So. but simply good practice: look at the evidence. This means that we have less confidence in our estimates. Constant variance must be homoskedasticity. our standard errors and tratios are no longer correct. The residuals from the OLS regression. It depends on the exact kind of heteroskedasticity whether we are over. Thus.10 HETEROSKEDASTICITY The GaussMarkov assumptions can be summarized as Yi = α + β X i + u i u i ~ iid (0. If the data are heteroskedastic. It is an LM test. ordinary least squares estimates are still unbiased. is identically distributed. it would be nice to know if our data suffer from this problem. Testing for heteroskedasticity The first thing to do is not really a test. from the Greek skedastikos. skedastic. The basic idea is as follows.
So an auxiliary regression of the form.0000 0.57 0.01389 3 601.123 .0003134 1. twoway (scatter e pop) 118 .970383 47 13. We use the OLS residual.7338 0.3576379 8. .0004919 .0011224 . the squared residuals will be unrelated to any of the independent variables.dta) on prison population per capita. t P>t [95% Conf.What we want to know is if the variance of the error term is increasing with one of the variables in the model (or some other variable). However.000 2.96 0.30923  Let’s plot the residuals on population (which we suspect might be a problem variable). and real per capita personal income.26 0. the square of the residual.1665135 13.233198 3. we have to set n=1. In fact. If there is no heteroskedasticity. ˆ ei2 / σ 2 = a0 + a1 X i1 + a2 X i 2 ˆ where σ 2 is the estimated error variance from the original OLS regression. Therefore our estimate of the variance of ui is ei .0001386 _cons  6. 47) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 43.9355401 +Total  2459. Interval] +prison  2.000 .31 0.671298 Residual  654. Alternatively. the percent of the population living in metropolitan (metpct) areas. The usual estimator of the variance of e is (e i =1 n i − e) 2 n .056 .2079464 rpcpi  . u. Err. nR 2 χ2 where the degrees of freedom is the number of variables on the right hand side of the auxiliary test equation. So we need an estimate of the variance of the error term at each observation. Std.0328781 4.349276 1.733 crmaj  Coef. should have an Rsquare of 2 zero. we could test whether the squared residual increases with the predicted value of y or some other suspected variable. As an example.1996855 Number of obs F( 3.0756619 .672148 metpct  . we know that e =0 in any OLS regression and if we want the 2 variance of e for each observation. as our estimate of the error term. predict e. consider the following regression of major crime per capita across 51 states (including DC) in 1990 (hetero.952673 .98428 50 49. e. . regress crmaj prison metpct rpcpi Source  SS df MS +Model  1805.1418042 .18 0. resid .7168 3.57136 3.
5 0
0
Residuals 5
10
15
10000 pop
20000
30000
Hard to say. If we ignore the outlier at 3000, it looks like the variance of the residuals is getting larger with larger population. Let’s see if the BreuschPagan test can help us.
The BreuschPagan test is invoked in Stata with the hettest command after regress. The default, if you don’t include a varlist is to use the predicted value of the dependent variable.
. hettest BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of crmaj chi2(1) Prob > chi2 = = 0.17 0.6819
Nothing appears to be wrong. Let’s see if there is a problem with any of the independent variables.
. hettest prison metpct rpcpi BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: prison metpct rpcpi chi2(3) Prob > chi2 = = 4.73 0.1923
Still nothing. Let’s see if there is a problem with population (since the dependent variable is measured as a per capita variable).
. hettest pop BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: pop chi2(1) Prob > chi2 = = 3.32 0.0683
Ah ha! Population does seem to be a problem. We will consider what to do about it below.
119
White test
The other widely used test for autocorrelation is due to Halbert White (econometrician from UCSD). White modifies the BreuschPagan auxiliary regression as follows.
ei2 = a0 + a1 X i1 + a2 X i 2 + a3 X i2 + a4 X i22 + a5 X i1 X i 2 1
The squared and interaction terms are used to capture any nonlinear relationships that might exist. The number of regressors in the auxiliary test equation is k(k+1)/2 where k is the number of parameters (including the intercept) in the original OLS regression. Here k=3 so, 3(4)/2 1=61=5. The test statistic is
2 nR 2 χ k ( k +1) / 2−1 .
To invoke the White test use imtest, white. The im stands for information matrix.
. imtest, white White's test for Ho: homoskedasticity against Ha: unrestricted heteroskedasticity chi2(9) Prob > chi2 = = 6.63 0.6757
Cameron & Trivedi's decomposition of IMtest Source  chi2 df p +Heteroskedasticity  6.63 9 0.6757 Skewness  4.18 3 0.2422 Kurtosis  2.28 1 0.1309 +Total  13.09 13 0.4405 
Ignore Cameron and Trivedi’s decomposition. All we care about is heteroskedasticity, which doesn’t appear to be a problem according to White. The White test has a slight advantage over the BrueshPagan test in that it is less reliant on the assumption of normality. However, because it computes the squares and crossproducts of all the variables in the regression, it sometimes runs out of space or out of degrees of freedom and will not compute. Finally, the White test doesn’t warn us of the possibility of heteroskedasticity with a variable that is not in the model, like population. (Remember, there is no heteroskdeasticity associated with prison, metpct, or rpcpi.) For these reasons, I prefer the BreuschPagan test.
Weighted least squares
Suppose we know what is causing the heteroskedasticity. For example, suppose the model is (in deviations)
yi = β xi + ui where var(u ) = z 2σ 2 i i
120
Consider the weighted regression, where we divide both sides by zi
yi / zi = β ( xi / zi ) + ui / zi
What is the variance of the new error term, ui/zi.
var(ui / zi ) = var(ci ui ) where ci = 1/ zi var(ui / zi ) = ci2 var(ui ) = zi2σ 2 =σ2 zi2
So, weighting by 1/zi causes the variance to be constant. Applying OLS to this weighted regression will be unbiased, consistent and efficient (relative to unweighted least squares). The problem with this approach is that you have to know the exact form of the heteroskedasticity. If you think you know you have heteroskedasticity and correct it with weighted least squares, you’d better be right. If you were wrong and there was no heteroskedasticity, you have created it. You could make things worse by using WLS inappropriately. There is one case where weighted least squares is appropriate. Suppose that the true relationship between two variables is (in deviations) is,
yiT = β xiT + uiT , uiT IN (0, σ 2 )
However, suppose the data is only available at the state level, where it is aggregated across n individuals. Summing the relationship across the state’s population, n, yields,
y
i =1
n
T i
= β xiT + ui
i =1 i =1
n
n
or, for each state,
y = β x + u.
We want to estimate the model as close as possible to the truth, so we divide the state totals by the state population to produce per capita values. y x u =β + . n n n What is the variance of the error term?
n T ui u Var = Var i =1 n n nσ 2 σ 2 = 2 = n n
1 T T = 2 (Var (uiT ) + Var (u2 ) + ... + Var (un )) n
So, we should weight by the square root of population.
121
n
y x u =β n + n . n n n
nu σ2 u Var =σ2 = nVar = n n n n
which is homoskedastic. This does not mean that it is always correct to weight by the square root of population. After all, the error term might have some heteroskedasticity in addition to being an average. The most common practice in crime studies is to weight by population, not the square root, although some researchers weight by the square root of population. We can do weighted least squares in Stata with the [weight=variable] option in the regress command. Put the [weight=variable] in square brackets after the varlist.
. regress crmaj prison metpct rpcpi [weight=pop] (analytic weights assumed) (sum of wgt is 2.4944e+05) Source  SS df MS +Model  1040.92324 3 346.974412 Residual  842.733646 47 17.9305031 +Total  1883.65688 50 37.6731377 Number of obs F( 3, 47) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 19.35 0.0000 0.5526 0.5241 4.2344
crmaj  Coef. Std. Err. t P>t [95% Conf. Interval] +prison  3.470689 .7111489 4.88 0.000 2.040042 4.901336 metpct  .1887946 .058493 3.23 0.002 .0711219 .3064672 rpcpi  .0006482 .0004682 1.38 0.173 .0015902 .0002938 _cons  4.546826 4.851821 0.94 0.353 5.213778 14.30743 
Now let’s see if there is any heteroskedasticity.
. hettest pop
BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: pop chi2(1) Prob > chi2 . hettest BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of crmaj chi2(1) Prob > chi2 = = 1.07 0.3010 = = 0.60 0.4386
122
hettest prison metpct rpcpi BreuschPagan / CookWeisberg test for heteroskedasticity Ho: Constant variance Variables: prison metpct rpcpi chi2(3) Prob > chi2 = = 2. Robust standard errors and tratios Since heteroskedasticity does not bias the coefficient estimates. the inverse of population. We started with a simple regression model in deviations. then we would have tried weighting by 1/pop.. If the correction had made the heteroskedasticity worse instead of better. rather than by population itself. yi = β xi + ui ˆ β= xi yi xi ( β xi + ui ) β xi2 xi ui xi ui = = + =β+ xi2 xi2 xi2 xi2 xi2 The mean is xi ui xi E (ui ) ˆ E (β ) = E β + =β+ = β because E (ui ) = 0 for all i. we could simply correct the standard errors and resulting tratios so our hypothesis tests are valid. 2 xi xi2 2 ˆ ˆ ˆ The variance of β is the expected value of the square of the deviation of β . ˆ Recall how we derived the mean and variance of the sampling distribution of β .93 0. E ( β − β ) ˆ β −β = xi ui xi2 123 .4021 The heteroskedasticity seems to have been taken care of.
to get = 1 ( x ) 2 2 i (x 2 1 2 2 + x2 + + xn ) σ 2 = σ2 xi2 ( x ) 2 2 i In the case of heteroskedasticity we cannot factor out ˆ Var ( β ) = 1 σ 2 . Huber. For example. so ˆ Var ( β )= 2 2 x σ x 2σ 2 + x 2σ 2 ++ xnσ n )= i i 2( 1 1 2 2 2 ( xi2 ) ( xi2 ) 2 2 To make this formula operational. 2 ˆ2 σ βˆ = xi2 ei2 ( x ) 2 2 i Take the square root to produce the robust standard error of beta. Sβˆ = xi2 ei2 ( x ) 2 2 i This is also known as the heteroskedastic consistent standard error (HCSE). We will call them robust standard errors. ei . ( ) 1 2 xi2 2 2 2 2 2 2 E x1 u1 + x2 u2 ++ xn un + x1x2u1u2 + x1x3u1u3 + x1x4u1u4 + ( ) It is still true that E (uiu j ) =0 for i ≠ j (independent). in the crime equation above we can see the robust standard and tratios by adding the option robust to the regress command. So. we will use the square of the residual. or ˆ sandwich standard errors. so ˆ Var ( β ) = ( ) 2 2 2 x1 σ 2 + x2σ 2 ++ xnσ 2 2 2 xi ( ) σ 2 xi2 = Factor out the constant variance. . which means xiui ˆ Var ( β ) = E x2 i ˆ Var ( β ) = 1 2 1 = E ( x1u1 + x2u2 ++ xnun )2 2 2 xi ( ) ( ) 1 2 xi2 2 2 2 2 2 2 E x1 u1 + x2 u2 ++ xn un + x1x2u1u2 + x1x3u1u3 + x1x4u1u4 + ( ) We know that E (ui2 ) =σ 2 (identical) and E (uiu j ) = 0 for i ≠ j (independent). we need an estimate of σ i2 . It is easy to get robust standard errors in Stata. As above. Eicker. Dividing β by the robust standard error yields the robust tratio. regress crmaj prison metpct rpcpi.2 ˆ ˆ Var ( β ) = E ( β − β ) The x's are fixed numbers so E ( x )= x . or White. σ 2 . robust 124 .
it is the standard methodology in crime regressions.349276 1.0005551 1.000 2.72 0. Using them allows us to avoid having to discover the particular form of the problem. 47) = Prob > F = Rsquared = Root MSE = 51 131.0006482 .000 2.0000 0.034 .0001386 _cons  6.01389 3 601.7338 3.0000807 _cons  6.1996855 Number of obs F( 3.818686 metpct  .546826 4.57136 3.875776 13.30923  It is possible to weight and use robust standard errors.03247 4.012 .98428 50 49.19 0.233198 3.2079464 rpcpi  . robust (analytic weights assumed) (sum of wgt is 2.1667412 17.000 .1665135 13. . Err.470689 .71 0.0017649 .37 0.952673 .1887946 .733  Robust crmaj  Coef.5227393 12. In fact.0000 0.000 .672148 metpct  . Interval] +prison  3. Err. 47) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 51 43.0756619 . 47) = Prob > F = Rsquared = Root MSE = 51 17. t P>t [95% Conf. Err.1418042 .2071255 rpcpi  .091 .96943  What is the bottom line with respect to our treatment of heteroskedasticity? Robust standard errors yield consistent estimates of the true standard error no matter what the form of the heteroskedasticity.97 0.Regression with robust standard errors Number of obs = F( 3. t P>t [95% Conf. Finally.73 0.17 0.122692 4.288113 metpct  . Interval] +prison  2.249 . Std.0011224 .57 0.671298 Residual  654.000 2.0010646 .26 0.96 0.5526 4.0328781 4.31 0. if there is no heteroskedasticity.7168 3.28 0. Std.337 4.4944e+05) Regression with robust standard errors Number of obs = F( 3.0004919 .0000 0. I recommend always using 125 .0004919 .60 0.733 crmaj  Coef.18 0. For these reasons. Interval] +prison  2.2344  Robust crmaj  Coef.57136 3.6700652 5. we get the usual standard errors anyway. regress crmaj prison metpct rpcpi [weight=pop]. Std.006661 2.7338 0.056 .3348802 rpcpi  .0726166 2.952673 . t P>t [95% Conf.0004684 _cons  4.683806 0.61998  Compare this with the original model.123 .617233 3.3576379 8.970383 47 13.0427089 .18 0.1418042 .0764828 .9355401 +Total  2459.0002847 1.0003134 1. Source  SS df MS +Model  1805.
always add “. That is. then you should weight in addition to using robust standard errors. If you know that a variable like population is a problem. The existence of a problem variable will become obvious in your library research when you find that most of the previous studies weight by a certain variable. robust” to the regress command.robust standard errors. This is becoming standard procedure among applied researchers. 126 .
This is a very convenient assumption for proving that the OLS parameter estimates are unbiased However. However. then ordinary least squares is biased and inconsistent. xT = x − f y = β (x − f ) + ε = β x + (ε − β f ) We have a problem if x is correlated with the error term (f). that is. ˆ β = xy ( xT + f ) y ( xT + f )( β xT +ε ) = = x 2 ( xT + f )2 xT 2 + 2 xT f + f 2 ( β xT 2 + xT ε + β xT f + fy ) β xT 2 ˆ = P lim P lim( β )= P lim xT 2 + 2 xT f + f 2 xT 2 + f 2 127 . suppose the true model. that there be no covariance between the error term and the regressor or regressors. x is correlated with the error term. expressed in deviations is y = β xT + ε where xT is the true value of x. is fixed on repeated sampling. f So. x. E ( xiui )=0 If this assumption is violated. This means that the OLS estimate is biased. This could happen if there are errors of measurement in the independent variable or variables. it is only necessary that x be independent of the error term.11 ERRORS IN VARIABLES One of the GaussMarkov assumptions is that the independent variable. cov( x . ε − β f ) = E [( x T + f )(ε − β f )] = E ( xT ε + f ε − xT β f − β f 2 )=− β E ( f 2 ) = − βσ 2 ≠ 0 if f has any variance. we only have the observed x=xTf where f is random measurement error. that is. For example.
Since this is usually not possible. we have to fall back on statistics. and not correlated with the error term. This variable. This IV estimator is consistent because z is not correlated with v. Suppose we know of a variable. z. 128 . which would happen if x is not measured with error. β ˆ P lim( β ) = P lim f2 1+ xT 2 β = var( f ) 1+ var( xT ) ˆ Since the two variances cannot be negative.(We are assuming that P lim( fy ) = P lim( xT f ) = P lim( xT f ) = P lim( fy ) = 0) So. that is (1) highly correlated with xT. d e2 ˆ =−2 zx ( zy − β zx ) =0 (note chain rule). y = β x + v where v = (ε . In this case. called an instrument or instrumental variable.β f ) Multiply both sides by z. ˆ dβ The derivative is equal to zero if and only if the expression is parentheses is zero. but (2) uncorrelated with the error term. Cure for errors in variables There are only two things we can do to fix errors in variables. we take the derivative with respect to β and set it equal to zero. Then x would be highly correlated with itself. allows us to derive consistent (not unbiased) estimates. ˆ zy − β zx =0 which implies that ˆ β = zy zx This is the instrumental variable (IV) estimator. since the true value of beta is divided by a number greater than one. dividing top and bottom by xT 2 . zy = β zx + zv The error is ˆ e = zy − β zx The sum of squares of error is ˆ e 2 = ( zy − β zx )2 ˆ To minimize the sum of squares of error. It collapses to the OLS estimator if x can be an instrument for itself. making it the perfect instrument. the OLS estimate is biased downward. we can conclude that p lim β ≠ β if var( f ) ≠ 0 which means that the OLS estimate is biased and inconsistent if the measurement error has any variance at all. The first is to use the true value of x.
Assume that the relationship between the instrument z and the possibly mismeasured variable. ˆ ˆ x =γz ˆ ˆ Note that x is a linear combination of z. In the second stage. yz ( β x+v) z ˆ p lim β = p lim = p lim xz xz ( ) p lim( vz ) β xz + vz ) = p lim = β + xz p lim( xz ) = β since p lim ( vz ) = 0 because z is uncorrelated with the error term. Since z is independent of the error term. Two stage least squares Another way to get IV estimates is called two stage least squares (TSLS or 2sls). In the first stage we regress the variable with the measurement error. saving the predicted values. ˆ Second stage: substitute x for x and apply OLS again. x =γz+w We can estimate this relationship using ordinary least squares. zx 2 zx z z2 HausmanWu test We can test for the presence of errors in variables if we have an instrument. γˆ = = xz z2 zy zy = . z. So. so is x . ˆ ˆ x =γz 129 . First stage: regress x on z: x =γz+w Save the predicted values. x is linear. the instrumental variable estimator. we substitute the predicted values of x for the actual x and apply ordinary least squares again. The result is the consistent IV estimate above. x. ˆ y = βx+v The resulting estimate of beta is the instrumental variable estimate above. using ordinary least squares. ˆ β = ˆ xy (γˆ z ) y γˆ zy zy = = = 2 (γˆ z )2 γˆ 2 z 2 γˆ z 2 ˆ x from the first stage regression. on the instrument.
1086487 . Wooldridge has a nice example and the data set is available on the web. If it is not significant.1852259 1. The estimated coefficient is the return to education. then the two parameters are equal and there is no significant bias. it is uncorrelated with the error term. If these two estimates are significantly different. applying OLS ˆ to this equation will yield a consistent estimate of as the parameter multiplying x . ivreg lwage (educ=fatheduc) Instrumental variables (2SLS) regression 130 . so ˆ ˆ y = β ( x − w) + δ w + v ˆ y = β x + (δ − β ) w + v ˆ The null hypothesis that β =δ can be tested with a standard ttest on the coefficient multiplying w . ˆ ˆ ˆ ˆ y = β x + δ w + v but x = x − w.68003 lwage  Coef.1788735  The return to education is a nice ten percent.318 . Therefore.462443727 +Total  223.So.1851969 . So how much of the return to education is due to the college education? Let’s use father’s education as an instrument. ˆ w is the residual from the regression of x on z.0803451 .000 .1179 0. Std. The first stage regression varlist is put in parentheses.0143998 7. . If there is ˆ .00 0. ˆ ˆ x = x+w where And. Err. then the two parameter estimates are different and there is significant errors in variables bias. We can implement instrumental variables with ivreg. . regress lwage educ Source  SS df MS +Model  26.3264237 1 26. ˆ ˆ y = β x + v = β ( x + w) + v ˆ ˆ y = βx + βw+v ˆ Since x is simple a linear combination of z. Let’s ˆ call the coefficient on w . 426) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 56.327451 427 .1369523 _cons  . Brighter women will probably do both. We can implement this test in Stata as follows. However. The data are in the Stata data set called errorsinvars.1158 . The independent variable is education. If it is significant.3264237 Residual  197. Interval] +educ  .93 0. First let’s do the regression with ordinary least squares. The dependent variable is the log of the wage earned by working women.0000 0. go to college and earn high wages. so the estmate of from that parameter will be measurement error in x it is all in the residual w inconsistent. t P>t [95% Conf. It is probably correlated with his daughter’s brains and is independent of her wages. So we need an easy way to test whether β =δ .523015108 Number of obs F( 1.001028 426 .dta. the high wages earned by highly educated women might be due to more intelligence.5492674 . it must be due to measurement error.55 0.
Std.99 0.304553  The instrumental variables regression shows a lower and less significant return to education.43 0.86256 425 . Std. Err.0098994 . Interval] +educ  .1345684 _cons  .0913 .000 9. Err.460089 426 .0934 0. Std.3256295 _cons  10.0929 0.4223459 1. Also.10 level.8673618 1 20.84 0. t P>t [95% Conf.323 . regress educ fatheduc Source  SS df MS +Model  384.19626 427 5.841983 1 384.0380427 1.84 0.23705 .4357311 1.1230 0.0285863 9.80 0.4461018 0.57 0.00 0.0591735 .841983 Residual  1845.33181756 +Total  2230.327451 427 . education is only significant at the .1282463 _cons  .523015108 Number of obs F( 2.71 0.0597931 . 426) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 2.2132538 .0351418 1.0000 0.1706 2.0591735 .4648913 2 13.Source  SS df MS +Model  20.68939 lwage  Coef.4411035 . predict what . regress lwage educ what Source  SS df MS +Model  27.1189 .77942 .35428 426 4. Interval] +educ  .523015108 Number of obs F( 1.68 0. there might still be some bias because the coefficient on the residual what is almost significant at the .0813 educ  Coef. More research is needed.1726 0.088 .460853082 +Total  223. 131 .475258426 +Total  223. Interval] +fatheduc  .22294206 Number of obs F( 1. 425) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 29.0346051 1. t P>t [95% Conf.316 .327451 427 . In fact. resid .2759363 37. 426) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 428 88.000 .117 .10 0.317938 Instrumented: educ Instruments: fatheduc . Err.0000 0.2694416 .439289 1.8673618 Residual  202. t P>t [95% Conf.67886 lwage  Coef.4411035 .7324457 Residual  195.10 level.093 .694685 10.1271919 what  .0149823 .008845 .
Before we go on. It is the theoretical model from the textbook or journal article. The terms and are the parameters of the system.g. it is linear. G. C = α + β ( C + I +G + X ) + u = α + β C + β I + β G + β X + u C − β C =α + β I + β G + β X +u C (1− β )=α + β I + β G + β X +u C= α + β I + β G + β X +u (1− β ) 1 1 β β β I+ G+ X + u α+ (1− β ) (1− β ) (1− β ) (1− β ) (1− β ) (RF1) C = The resulting equation is called the reduced form equation for C. G is government spending. Behavioral equations have error terms. 132 . The above simultaneous equation system of two equations in two unknowns (C and Y) is known as the structural model. I. Y is national income (e. the resulting ordinary least squares estimates are biased and inconsistent. For example. If we knew the values of the parameters ( and ) and the exogenous variables (I. and X is net exports. The variables C and Y are known as the endogenous variables because their values are determined within the system. and X are exogenous variables because their values are determined outside the system. GDP). if we had the data. consider the simple Keynesian system. The first equation (the familiar consumption function) is known as a behavioral equation. we need to define some terms.12 SIMULTANEOUS EQUATIONS We know that whenever an explanatory variable is correlated with the error term in a regression. we could solve for the corresponding values of C and Y. indicating some randomness. G. They are “taken as given” by the structural model. in this case. We could. I is investment.” Note that it is a behavioral equation and that. estimate this equation with ordinary least squares. The second equation is an identity because it simply defines national income as the sum of all spending (endogenous plus exogenous). C = α + βY + u Y = C + I +G+ X where C is consumption. X. as follows. with an error term added for statistical estimation. This structural model has two types of equations. and the exogenous error term u). I have labeled it RF1 for “reduced form 1. Substituting the second equation into the first yields the value for C.
Example: supply and demand Consider the following structural supply and demand model (in deviations). We can derive the reduced form equations by solving by back substitution. Both equations have the same variables on the right hand side. 133 . We are now in position to show that the endogenous variable. giving the change in Y for a oneunit change in I or G or X.We can derive the reduced form equation for Y by substituting the consumption function into the national income identity. p is the price. Because of the fact that the reduced form equations have the same list of variables on the right hand side. on the right hand side of the consumption function is correlated with the error term in the consumption equation. G. all we need to write them down in general form is the list of endogenous variables (because we have one reduced form equation for each endogenous variable) and the list of exogenous variables. and y is income. They are both linear. with almost the same parameters. The bad news is that the parameters from the reduced form are “scrambled” versions of the structural form parameters. namely all of the exogenous variables. In the system above the endogenous variables are C and Y and the exogenous variables are I. Y. Therefore. cov(Y. the coefficients from the reduced form equation for Y correspond to multipliers. As a result. In the case of the simple Keynesian model. y. Note that Y is a function of u. Look at the reduced form equation for Y (RF2) above. X (and the random error term u). The parameters are functions of the parameters of the structural model. Y = α + βY + u + I + G + X Y − β Y =α + I +G + X +u Y (1− β )=α + I +G + X +u Y= α + I +G + X +u (1− β ) 1 1 1 1 1 α+ I+ G+ X + u (1− β ) (1− β ) (1− β ) (1− β ) (1− β ) (RF2) Y = The two reduced form equations look very similar. the OLS estimate of in the consumption function is biased and inconsistent. etc. Obviously. There are two endogenous variables (p and q) and one exogenous variable. On the other hand. 1/(1) or /(1). variation in u will cause variation in Y. Clearly. v1 or v2. so the reduced form equations can be written as C =π 10 20 +π 11 I +π 12 G +π 13 X +v 1 X +v 2 Y =π +π 21 I +π 22 G +π 23 where we are using the “pi’s” to indicate the reduced from parameters. π 20 = 1 (1− β ) . (S) q = α p + ε (D) q = β p + β y + u 1 2 where q is the equilibrium quantity supplied and demanded. Also. OLS estimates of the reduced form equation are unbiased and consistent because the exogenous variables are taken as given and therefore uncorrelated with the error term.u) is not zero.
we can again appeal to instrumental variables (2sls) to yield consistent estimates. q −ε q=β + β2 y + u 1 α Multiply by α to get rid of fractions. but independent of the error term because it is exogenous. but uncorrelated with the error term. Now we solve for p. Remember. p = αq + ε Multiply both sides by y 134 . α q = β1 ( q − ε ) + αβ 2 y + α u = β1q − β1ε + αβ 2 y + α u α q − β1q = − β1ε + αβ 2 y + α u (α − β ) q = +αβ y + α u − β ε 1 2 1 αβ 2 y +α u − β1ε q= (α − β1 ) q= αβ 2 α u − β1ε y+ (RF1) (α − β1) (α − β1 ) q = π 11 y + v 1 This is RF1 in general form. so applying OLS to either the supply or demand curve will result in biased and inconsistent estimates. income is correlated with price (see RF2). In the supply and demand model. However. α p + ε = β1 p + β 2 y + u α p − β1 p = β 2 y +u −ε (α − β1 ) p = β 2 y +u −ε β y +u −ε p= 2 (α − β1 ) p= β2 u −ε y+ (RF2) (α − β1 ) (α − β1 ) 21 y+v 2 p=π This is RF2 in general form.q =αp+ε Solve for p: −α p = − q + ε p= q −ε α Substitute into the demand equation. RF2 reveals that price is a function of u. The exogenous variable(s) in the model satisfy these conditions. We can get a consistent estimate of with the IV estimator. ˆ α = yq yp This estimator can be derived as follows. The good news is that the structural model itself produces the instruments. instruments are required to be highly correlated with the problem variable (p). Suppose we choose the supply curve as our target equation.
we get the IV estimator again. we cannot get a consistent estimate of the demand equation because. The resulting estimator is ˆ ˆ qp π 21 qy qy qy qy = 2 = = = 2 2 2 py ˆ ˆ ˆ py p π 21 y π 21 y y2 y2 which is the same as the IV estimator above. In the second stage we regress q on p . Suppose we estimate the two reduced form equations above. 135 . In other words. αβ 2 α − β1 β2 π 12 = α − β1 π 11 = which implies that π11 αβ 2 α − β1 = =α π 12 α − β1 β 2 So. If we take the probability limit of the two estimators. on all the exogenous variables in the model. the instrument is not correlated with the error term. Here there is only one exogenous variable. p. RF2. estimate the reduced form equation. The resulting estimate of 21 is ˆ π 21 = py y2 ˆ ˆ ˆ and the predicted value of p is p = π 21 y . From the reduced form equations. we keep getting instead of 1 or 2. indirect least squares does not always work. yε = 0 So. In fact. (Remember. In the first stage we regress the problem variable.yp = α yq + yε Now sum yp = α yq + yε But. ˆ α= Indirect least squares Yet another method for deriving a consistent estimate of is indirect least squares. π ˆ p lim 11 = α ˆ π12 ˆ ˆ Because both π 11 and π 12 are consistent estimates and the ratio of two consistent estimates is consistent.") Unfortunately. yp = α yq and α = yq / yp We can also use two stage least squares. consistency "carries over. no matter how we manipulate 11 and 12. Then we get unbiased and consistent estimates of 11 and 21. y.
is the socalled order condition for identification.Suppose we expand the model above to add another explanatory variable to the demand equation: w for weather. it is said to be unidentified. αβ 3 α − β1 π 12 = = α (again). indirect least squares yielded a unique estimate in the case of . Let K be the number of exogenous variables that are in the structural model. π 22 α − β1 β3 Whoops. π 11 αβ 2 α − β1 = = α as before. but excluded from the equation of interest. Now we have two estimates of . yields the reduced form equations. The supply curve was just identified in the first example. Let G be the number of endogenous variables in the equation of interest (including the dependent variable on the left hand side). K ≥ G −1 Let’s apply this rule to the supply curve in the first supply and demand model. q =α p+ε q = β1 p + β 2 y + β3 w + w Solving by back substitution. so 1=21 and the supply equation is just identified. G=2 (q and p) and K=1 (y is excluded from the supply curve). the following relationship must hold. To be identified. What is going on? The identification problem An equation is identified if we can derive estimates of the structural parameters from the reduced form. we won’t get exactly the same estimate. we first have to choose the equation of interest. The demand curve was unidentified in the above examples because we could not derive any estimates of 1 or 2 from the reduced form equations. Given normal sampling error. If an equation is identified. which means the reduced form equations yield a unique estimate of the parameters of the equation of interest. To apply the order condition. as before. but over identified when we added the weather variable. in which case the reduce form yields multiple estimates of the same parameters of the equation of interest. It would be nice to know if the structural equation we are trying to estimate is identified. This is why we got a 136 . it can be “just” identified. so which is right? Or which is more right? So. or it can be “over” identified. If we cannot. and multiple estimates of with a minor change in the specification of the structural model. π 21 α − β1 β 2 But also. no estimate at all of the parameters of the demand curve. A simple rule. αβ 2 αβ 3 α u − β1ε y+ w+ α − β1 α − β1 α − β1 q = π 11 y + π 12 w + v1 β2 β3 u −ε p= y+ w+ α − β1 α − β1 α − β1 p = π 21 y + π 22 w + v2 q= We can use indirect least squares to estimate the slope of the supply curve as before.
Second. Looks good. but not sufficient condition for identification. if we try to estimate an equation using two stage least squares or some other instrumental variable technique. The demand for broiler chicken meat is assumed to be a function of the price of chicken meat. A more complicated condition. This problem of “statistical identification” is extremely rare. regress D.pdf 137 .unique estimate of from indirect least squares. Finally. but now K=2 (y and w are both excluded from the supply curve). from Carnegie Mellon. However.lp . McCallum. if the two variables are really so highly correlated that one is simply a linear combination of the other. We will use the noconstant option to avoid adding a time trend to the demand curve. we estimate the chicken demand model in first differences. but K=0 because there are no exogenous variables excluded from the demand curve. called the rank condition. Say we require that two variables be excluded from the equation of interest and we have two exogenous variables in the model but not in the equation. The data are available in DandS. That is. from 1950 to 2001. we have to tell Stata that we have a time variable with the tsset year command. but the order condition is usually good enough. G=2. However. So 2>21 and the supply curve is over identified. http://littlehurt. a close substitute. The good news is that it won’t generate a wrong or biased estimate. That is why we got two estimates from indirect least squares. . Again. 1950 to 2001 . let’s analyze the supply curve from the second supply and demand model. and the price of beef. Simultaneous equation econometrics: the missing example. noconstant 5 Epple. Illustrative example Dennis Epple and Bennett McCallum. Choosing the demand curve as the equation of interest. the model will simply fail to yield an estimate. To estimate this equation in Stata. real income. have developed a nice example of a simultaneous equation model using real data.5 The market they analyze is the market for broiler chickens. operator to take first differences. there is always the possibility that the model will be identified according to the order condition.gsia. Supply is assumed to be a function of price and the price of chicken feed (corn). We can then use the D. First we try ordinary least squares on the demand curve.lq D. Two notes concerning the order condition. For reasons that will be discussed in a later chapter. At that point we would know that the equation was unidentified. it is a necessary. this is not a major problem because. then we don’t really have two independent variables. we find that G=2 as before. First. Epple and McCallum have collected data for the US as a whole.dta. no equation that fails the order condition will be identified. Dennis and Bennett T. ∆q = β1∆p + β 2 ∆y + β 3 ∆pbeef + u Since all the variables are logged.edu/gsiadoc/WP/2004E6. these first differences of logs are growth rates (percent changes).ly D. is necessary and sufficient. but fail to yield an estimate because the exogenous variables excluded from the equation of interest are too highly correlated with each other. tsset year time variable: year. They recommend estimating the model in logarithms. there are rare cases where the order condition is satisfied but the equation is still not identified.lpbeef D.cmu. This is a short run demand equation relating the change in price to the change in consumption. However.
Source  SS df MS +Model  .053118107 3 .017706036 Residual  .043634876 48 .00090906 +Total  .096752983 51 .001897117
Number of obs F( 3, 48) Prob > F Rsquared Adj Rsquared Root MSE
= = = = = =
51 19.48 0.0000 0.5490 0.5208 .03015
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +ly  D1  .7582548 .1677874 4.52 0.000 .4208957 1.095614 lpbeef  D1  .305005 .0612757 4.98 0.000 .181802 .4282081 lp  D1  .3553965 .0641272 5.54 0.000 .4843329 .2264601 
Lq is the log of quantity (per capita chicken consumption), lp is the log of price, ly is the log of income, and lpbeef is the log of the price of beef. This is not a bad estimated demand function. The demand curve is downward sloping, but very inelastic, and significant at the .10 level. The elasticity of income is positive, indicating that chicken is not an inferior good, which makes sense. The coefficient on the price of beef is positive, which is consistent with the price of a substitute. All the explanatory variables are highly significant. Now let’s try the supply curve.
. regress D.lq D.lpfeed D.lp, noconstant Source  SS df MS +Model  .00083485 2 .000417425 Residual  .055731852 37 .001506266 +Total  .056566702 39 .001450428 Number of obs F( 2, 37) Prob > F Rsquared Adj Rsquared Root MSE = 39 = 0.28 = 0.7595 = 0.0148 = 0.0385 = .03881
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lpfeed  D1  .0392005 .060849 0.64 0.523 .1624923 .0840913 lp  D1  .0038389 .0910452 0.04 0.967 .1806362 .188314 
We have a problem. The coefficient on price is virtually zero. This means that the supply of chicken does not respond to changes in price. We need to use instrumental variables. Let’s first try instrumental variables on the demand equation. The exogenous variables are lpfeed, ly, and lpbeef.
. ivreg D.lq D.ly D.lpbeef (D.lp= D.lpfeed D.ly D.lpbeef ), noconstant Instrumental variables (2SLS) regression Source  SS df MS +Model  .032485142 3 .010828381 Residual  .024081561 36 .000668932 +Total  .056566702 39 .001450428 Number of obs F( 3, 36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 . . . . .02586
138
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lp  D1  .4077264 .1410341 2.89 0.006 .6937568 .1216961 ly  D1  .9360443 .1991184 4.70 0.000 .5322135 1.339875 lpbeef  D1  .3159692 .0977788 3.23 0.003 .1176646 .5142739 Instrumented: D.lp Instruments: D.ly D.lpbeef D.lpfeed 
Not much different, but the estimate of the income elasticity of demand is now virtually one, indicating that chicken demand will grow at about the same rate as income. Note that the demand curve is just identified (by lpfeed). Let’s see if we can get a better looking supply curve.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant Instrumental variables (2SLS) regression Source  SS df MS +Model  .079162381 2 .03958119 Residual  .135729083 37 .003668354 +Total  .056566702 39 .001450428 Number of obs F( 2, 37) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 . . . . .06057
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lp  D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765 lpfeed  D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173 Instrumented: D.lp Instruments: D.lpfeed D.lpbeef D.ly 
This is much better. The coefficient on price is now positive and significant, indicating that the supply of chickens will increase in response to an increase in price. Also, increases in chickenfeed prices will reduce the supply of chickens. The supply curve is over identified (by the omission of ly and lpbeef). We could have asked for robust standard errors, but they didn’t make any difference in the levels of significance, so we used the default homoskedastic only standard errors. Note to the reader. I have taken a few liberties with the original example developed by Professors Epple and McCallum. They actually develop a slightly more sophisticated, and more believable, model, where the coefficient on price in the supply equation starts out negative and significant in the supply equation and eventually winds up positive and significant. However, they have to complicate the model in a number of ways to make it work. This is a cleaner example, using real data, but it could suffer from omitted variable bias. On the other hand, the Epple and McCallum demand curve probably suffers from omitted variable bias too, so there. Students are encouraged to peruse the original paper, available on the web.
Diagnostic tests
139
There are three diagnostic tests that should be done whenever you do instrumental variables. The first is to justify the identifying restrictions.
Tests for over identifying restrictions
In the above example, we identified the supply equation by omitting income and the price of beef. This seems quite reasonable, but it is nevertheless necessary to justify omitting these variables by showing that they would not be significant if we included them in the regression. There are several versions of this test by Sargan, Basmann, and others. It is simply an LM test where we save the residuals from the instrumental variables regression and then regress them against all the instruments. If the omitted instrument does not belong in the equation, then this regression should have an Rsquare of zero. We test this null hypothesis with the usual nR 2 χ 2 ( m − k ) where m is the number of instruments excluded from the equation and k is the number of endogenous variables on the right hand side of the equation. We can implement this test easily in Stata. First, let’s do the ivreg again.
. ivreg D.lq D.lpfeed (D.lp=D.lpbeef D.ly), noconstant Instrumental variables (2SLS) regression Source  SS df MS +Model  .079162381 2 .03958119 Residual  .135729083 37 .003668354 +Total  .056566702 39 .001450428 Number of obs F( 2, 37) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 . . . . .06057
D.lq  Coef. Std. Err. t P>t [95% Conf. Interval] +lp  D1  .6673431 .2617257 2.55 0.015 .1370364 1.19765 lpfeed  D1  .2828285 .1246235 2.27 0.029 .5353398 .0303173 Instrumented: D.lp Instruments: D.lpfeed D.lpbeef D.ly 
If your version of Stata has the overid command installed, then you can run the test with a single command.
. overid Tests of overidentifying restrictions: Sargan N*Rsq test 0.154 Chisq(1) Basmann test 0.143 Chisq(1)
Pvalue = 0.6948 Pvalue = 0.7057
The tests are not significant, indicating that these variables do not belong in the equation of interest. Therefore, they are properly used as identifying variables. Whew! If there is no overid command after ivreg, then you can install it by clicking on help, search, overid, and clicking on the link that says, “click to install.” It only takes a few seconds. If you are using Stata on the server, you might not be able to install new commands. In that case, you can do it yourself as follows. First, save the residuals from the ivreg.
. predict es, resid (13 missing values generated)
140
Now regress the residuals on all the instruments. In our case we are doing the regression in first and the constant term was not included. So use the noconstant option.
. regress es D.lpfeed D.lpbeef D.ly, noconstant Source  SS df MS +Model  .000535634 3 .000178545 Residual  .135193453 36 .003755374 +Total  .135729086 39 .003480233 Number of obs F( 3, 36) Prob > F Rsquared Adj Rsquared Root MSE = 39 = 0.05 = 0.9860 = 0.0039 = 0.0791 = .06128
es  Coef. Std. Err. t P>t [95% Conf. Interval] +lpfeed  D1  .005058 .0863394 0.06 0.954 .1700464 .1801625 lpbeef  D1  .0536104 .1686351 0.32 0.752 .3956183 .2883974 ly  D1  .1264734 .3897588 0.32 0.747 .6639941 .9169409
We can use the Fform of the LM test by invoking Stata’s test command on the identifying variables.
. test D.lpbeef D.ly ( 1) ( 2) D.lpbeef = 0 D.ly = 0 F( 2, 36) = Prob > F = 0.07 0.9313
The results are the same, namely, that the identifying variables are not significant and therefore can be used to identify the equation of interest. However, the probvaluses are not identical to the large sample chisquare versions reported by the overid command. To replicate the overid analysis, let’s start by reminding ourselves what output is available for further analysis.
. ereturn list scalars: e(N) e(df_m) e(df_r) e(F) e(r2) e(rmse) e(mss) e(rss) e(r2_a) e(ll) macros: e(depvar) e(cmd) e(predict) e(model) matrices: e(b) : e(V) : functions: e(sample) 1 x 3 3 x 3 : : : : "es" "regress" "regres_p" "ols" = = = = = = = = = = 39 3 36 .0475437411474742 .00394634310271 .0612811038678276 .0005356335440678 .1351934528853412 .0790581283053975 55.12129587257286
141
An equivalent test of this hypothesis is the Jtest for overidentifying restrictions. noconstant Source  SS df MS +Model  0 3 0 Residual  . scalar n=e(N) .nr2) .3336173 lpbeef  D1  3. you know you can’t do a test for over identifying restrictions.0000 = 0. scalar probJ=1chi2(2.70e10 ..15390738 .1644979 0. scalar list r2 = n = nr2 = prob = r2 n nr2 prob .0000 = 0.0739029  So.024081561 36 .95356876 All of these tests agree that the identifying restrictions are valid.lpbeef D. scalar prob=1chi2(1.000668932 +Total  . If we forget and try to do the LM test we get a regression that looks like this (ed is the residual from the ivreg estimate of the demand curve above. Interval] +ly  D1  5.0739029 . because your equation is not over identified. .1443446 .J) . scalar nr2=n*r2 We know there is one degree of freedom for the resulting chisquare statistic because m=2 omitted instruments (ly and lpbeef) and k=1 endogenous regressor (lp). if you see output like this. The fact that they are not significant means that income and beef prices do not belong in the supply of chicken equation. From the output above.ly D. The problem is that the residual from the ivreg regression will be an exact linear combination of the instruments.00 = 1. scalar J=2*e(F) . the Jtest uses the j=mF statistic where F is the Ftest for the auxiliary regression.00394634 39 . t P>t [95% Conf.000 .00 1.00 1. scalar r2=e(r2) . Err. 36) Prob > F Rsquared Adj Rsquared Root MSE = 39 = 0.) . J is distributed as chisquare with the same mk degrees of freedom.024081561 39 .09508748 probJ = .0364396 0. Instead of using the nRsquare statistic.000 .lpfeed. .00 1.1443446 lpfeed  D1  1.69482895 This is identical to the Sargan form reported by the overid command above.02e11 .000 .0711725 0.94e10 . Std.000617476 Number of obs F( 3. We cannot do the tests for over identifying restrictions on the demand curve because it is just identified. scalar list J probJ J = .0833 = . 142 .3336173 .02586 ed  Coef. regress ed D.
0000 Obviously. 143 .1724664 lpbeef  .260 . and Baker have demonstrated that instrumental variable estimates are biased by about 1/F* where F* is the Ftest on the (excluded) identifying variables in the reduced form equation for the endogenous regressor. HausmanWu test We have already encountered the HausmanWu test in the chapter on errors in variables.678827 40 20. Actually we should have predicted this by the difference between the OLS and 2sls versions of the equations above.08267 lp  Coef.1264946 . regress lp ly lpbeef lpfeed. and lpfeed and the F* test would be on lpbeef and ly.0764761 8. So we cannot use it for the demand function. noconstant Source  SS df MS +Model  801. Consider the following equation system.006834016 +Total  801. 6 Yes. Note that it has the same problem as the tests for over identifying restrictions.425968 3 267.000 . namely.0419707 Number of obs F( 3. that would be a regression of lp on ly. is there a problem when the instrument is only vaguely associated with the endogenous regressor? Bound.0924285 . What happens if (1) is not true? That is. if F*=2. . Std.0805228 .Test for weak instruments To be an instrument a variable must satisfy two conditions.0000 = 0. In the case of our chicken supply curve.252858576 37 . Anyway. it does not work where the equation is just identified.14 0.7829672 lpfeed  . test ly lpbeef ( 1) ( 2) ly = 0 lpbeef = 0 F( 2.1196143 . t P>t [95% Conf.05 = 0. It must be (1) highly correlated with the problem variable and (2) it must be uncorrelated with the error term. Jaeger6.331657 . these are not weak instruments and two stage least squares is much less biased than OLS. For example.0226887 5. then 2SLS is still half as biased as OLS.000 . Err. This is the percent of the OLS bias that remains after applying two stage least squares.628012 .9997 = .58 0. 37) = Prob > F = 35.9997 = 0. lpbeef. the BJB F test is very easy to do and it should be routinely reported for any simultaneous equation estimation. it is also relevant in the case of simultaneous equations. 37) Prob > F Rsquared Adj Rsquared Root MSE = 40 =39090.141989 Residual  .75 0. that is our own Professor Jaeger. However.4730568 . Interval] +ly  .21 0.1046508 1.
If we apply OLS to equation (A1) we get biased and inconsistent estimates. we should do it both ways. predict dlperror.07 0.98 0.034261762 3 .lp dlperror D. then we cannot be sure that OLS is unbiased. Std.084064 3.044037688 Residual  .43 0. then we need to check the magnitude of the coefficient. If we apply OLS to equation (B1) we get unbiased. consistent. t P>t [95% Conf.02489 D. To apply the HausmanWu test to see if we need to use 2sls on the supply equation.2583745 . we estimate the reduced form equation for lp and save the residuals.0878849 .0165944 1. OLS is biased and we need to use instrumental variables.0107785 .260274407 39 .7530405 . Err.022304941 36 .056566702 39 .3794868 1. Interval] +lp  144 .lpfeed. resid (13 missing values generated) Now.lp  Coef.0000 0.128161345 36 .4288641 .05967 D.4666 . If it is not significant.lq  Coef. and efficient estimates. If it is significant.6057 0.132113063 3 . and hope we get essentially the same results. not only the equation of interest.055 . In which case.5076 0.522675 lpbeef  D1  . 36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 12.003560037 +Total  .09 0.3437728 . add the error term to the original regression.004 . 36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 18.Model A (simultaneous) (A1) y1 = β12 y2 + β13 z1 + u1 (A2) y2 = β 21 y1 + β 22 z2 + β 23 z3 + u2 Model B (recursive) (B1) y1 = β12 y2 + β13 z1 + u1 (B2) y2 = β 22 z2 + β 23 z3 + u2 where cov(u1u2 ) = 0.043 .ly D.lpfeed. Err. In that event. regress D. noconstant Source  SS df MS +Model  .lp D.0000 0.001450428 Number of obs F( 3.lq D.6767671 lpfeed  D1  . . noconstant Source  SS df MS +Model  . then there is a significant difference between the OLS estimate and the 2sls estimate. Suppose we are interested in getting a consistent estimate of β12 . . Interval] +ly  D1  . Std. And it is the same equation! We need to know about the second equation. OLS and IV.lpbeef D.37 0. If it is large and the standard error is even larger.1641908 2.006673703 Number of obs F( 3.5728 . t P>t [95% Conf.000619582 +Total  . regress D.011420587 Residual  .
36) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 18.lp dlperror.131643 dlperror  .93 0.2828285 .034797396 4 .lq  Coef.2828285 .00 0.1343196 . The following HW test equation was performed after saving the predicted value.000 1.lq  Coef. t P>t [95% Conf.200505 lpfeed  D1  . t P>t [95% Conf.0942849 3.1920032 4.88 0. The test coefficient is simply the negative of the coefficient on the residuals.9360443 .6809953 1.000 .005 .4077264 . Interval] +lp  D1  .011420588 Residual  .056566702 39 . 145 .1758793 .1789557 dlperror lpfeed You can also do the HausmanWu test using the predicted value of the problem variable instead of the residual from the reduced form equation.325832 lpbeef  D1  .99 0. .20 0. noconstant Source  SS df MS +Model  . noconstant Source  SS df MS +Model  .3159692 .001450428 Number of obs F( 4.000 .ly D.5728 .88 0.lpbeef D.35 0. Interval] +ly  D1  .02494 D.4445185  The test reveals that the OLS estimate is not significantly different from the instrumental variables estimate.000 .35 0.94075 . 35) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 39 13.056566702 39 .1359945 3.1245608 .1527992 0. Err.94075 . . regress D.3867013 .02230494 36 .lpfeed.lq D. Err.200505 .52 0.52 0.1789557  Applying the HausmanWu test to the demand equation yields the following test equation.lq D.D1  .2734069 .4144198 .051217 5.051217 5.0000 0.5712 . regress D.008699349 Residual  .00062198 +Total  .001450428 Number of obs F( 3.5073777 lp  D1  . Std.0695298 3.034261763 3 .43 0.6673431 .6809953  D1  .4491967 .000 .6057 0.002 .385 .000619582 +Total  .000 .1280782 7.132394 dlphat  . Std.546257 1.1280782 7.6838099 .02489 D.lp dlphat D.000 .0000 0.8854896  .6152 0.1075623 6.35 0.021769306 35 .3867013 .
we drop the third equation.. Sureg (FPT D Y) (ST Y C) Note: if the equations have the same regressors. In the sureg command. we automatically have equations whose errors could be correlated. and schools. Applying SUR to two stage least squares results in three stage least squares. Therefore. In Stata. and C. Three stage least squares Whenever we have a system of simultaneous equations. This technique is known as seemingly unrelated regressions (SUR) and Zellner efficient least squares (ZELS) for Alan Zellner from the University of Chicago who developed the technique. E (U iU j ) ≠ 0 for ij. we should use SUR in those cases where the dependent variables sum to a constant. Also. if there is specification bias in one equation. U2. Three stage least squares can be implemented in Stata with the reg3 command. then. income (Y). because the proportions have to sum to one (FP/T+S/T+O/T=1). the equation varlists are put in equations. so OLS can be expected to yield unbiased and consistent estimates. This is information that could be used to increase the efficiency of the estimates. it is not clear that we should use SUR for all applications for which it might be employed. The equations relate the proportion of the total budget (T) in each of these categories to the explanatory variables. save the residuals. police. Like the sureg command. Because the proportion of the budget going to other things is simply one minus the sum of expenditures on fire. such as in the above example. then the SUR estimates will be identical to the OLS estimates and there is no efficiency gain. However. compute the covariances between U1. the error terms must be correlated across equations. the SUR estimates of the other equation are biased. i. any random increase in expenditures on fire and police must imply a random decrease in schools and/or other components. Thus. because the residuals are correlated across equations. two stage least squares estimates of a multiequation system could potentially be made more efficient by incorporating the information contained in the covariances of the residuals. 146 . That being said. The procedure is to estimate the equations using OLS. Y. Let FPT be the proportion spent on fire and police and ST be the proportion spent on schools. we can use the sureg command to estimate the above system. For this reason. D. and the number of school age children (C). say omitted variables.e. and use these estimated covariances to improve the OLS estimates. In this example I use equation labels to identify the demand and supply equations. The specification error is “propagated” across equations.Seemingly unrelated regressions Consider the following system of equations relating local government expenditures on fire and police (FP). etc. schools (S) and other (O) to population density (D). FP / T = a0 + a1 D + a2Y + U1 S / T = b0 + b1Y + b2 C + U 2 O / T = c0 + c1 D + c2Y + c2 C + U 3 These equations are not simultaneous. we simply put the varlist for each equation in parentheses.
lq D.46 0. Err.7664334 lpbeef  D1  .86 0.09 0.lp D.93 0.0491061 .0948342 .In the first example.0084  Coef.0379724 0.1150026 _cons  .931 . z P>z [95% Conf.0059 0.0872944 ly  D1  .lpfeed) Threestage least squares regression Equation Obs Parms RMSE "Rsq" chi2 P Demand 39 3 .3026876 .lp D.lpfeed > .0274115 .0186104 .0362125 +Supply  lp  147 .0484276 3.000 .2744758 .0835039 0.lp D.1871072 .000 .0468383 0. noconstant) (Supply: D.0236988 0.2754 15.0040577 .0958591 .lp D.0167905 .0546049 2.0265163 0. The supply function is terrible. Let’s try adding the intercept term.1255017 4.13 0. noconstant to the model as a whole to make sure that the constant is not used as an instrument in the reduced form equations.ly D. Std.lp D.lq Exogenous variables: D.lpbeef D.1600626 .764 .15 0. Err. noconstant Threestage least squares regression Equation Obs Parms RMSE "Rsq" chi2 P Demand 39 3 .86 0.194991 .000 .1207482 .lpbeef. I replicate the ivreg above exactly by specifying each equation with the noconstant option (in the parentheses) and adding .0877437 Endogenous variables: D.2820235 .144 . Interval] +Demand  lp  D1  .31 0.lpfeed Yuck.2670861 +Supply  lp  D1  .138615 lpfeed  D1  . .5152 36.lq D.30 0.56 0.55 0.lq D.1887144 .000 .0000 Supply 39 2 .ly D.389 . z P>z [95% Conf. reg3 (Demand:D.0921909 ly  D1  .003 . noconstant).5204546 .15 0.. . Interval] +Demand  lp  D1  .10 0.0336213 1.0017 Supply 39 2 .1944 9.ly D.053039 .0044904 6.3104166 lpbeef  D1  .0250497 .9258  Coef. reg3 (Demand:D.lq D.109993 0.0249881 0.lpbeef) (Supply: D. Std.0549483 3.
924 . The model is said to be triangular because of this structure. If we assume that E (u1u3 ) = E (u2 u3 ) = 0 then OLS applied to the third (recursive) equation yields unbiased.000 .0554097  D1  . that is. This is very unusual. Y2 is a function of Y1. Normally the three stage least squares estimates are almost the same as the two stage least squares estimates. These two equations are fully simultaneous. but not completely simultaneous. Y1 = α 0 + α1Y2 + α 2 X 1 + α 3 X 2 + u1 Y2 = β 0 + β1Y1 + β 2 X 1 + β3 X 2 + u2 Y3 = γ 0 + γ 1Y1 + γ 2Y2 + γ 3 X 1 + γ 4 X 2 + u3 Now Y1 is a function of Y2 and Y2 is a function of Y1.D1  . 148 . consistent. there is no correlation between the error term and any right hand side variable in the same equation. Since there is no feedback from any right hand side endogenous variable to the dependent variable in the same equation.0018805 . but Y1 is not a function of Y2.lq Exogenous variables: D.ly D.97 0. This part of the model is called the simultaneous core.lp D. then this is a recursive system.003 . OLS applied to either will yield biased and inconsistent estimates. E (u1u2 ) = E (u1u3 ) = E (u2 u3 ) = 0 .lpfeed lpfeed Better. and efficient estimates. Note that Y3 is a function of Y2 and Y1.0042607 7. None of the other variables are significant.10 0.0196841 0. therefore there is no simultaneous equation bias in this model. Finally.1627361 . The supply and demand system (S) q = α p + ε (D) q = β p + β y + u 1 2 is fully simultaneous in that quantity is a function of price and price is a function of quantity. If we also assume that there is no covariance among the error terms. at least for the estimates of the price elasticities.039009 Endogenous variables: D.0306582 . Types of equation systems We are already familiar with a fully simultaneous equation system. consider the following model.lpbeef D. This model system is known as block recursive and is very common in large simultaneous macroeconomic models.0366997 .0404607 _cons  . Now consider the following general three equation model. but neither of them is a function of Y3.20 0.0547593 2.0223074 . and efficient estimates (assuming the equations have no other problems).2700624 . Three stage least squares has apparently made things worse. Y1 = α 0 + α1 X 1 + α 2 X 2 + u1 Y2 = β 0 + β1Y1 + β 2 X 1 + β3 X 2 + u2 Y3 = γ 0 + γ 1Y1 + γ 2Y2 + γ 3 X 1 + γ 4 X 2 + u3 This equation system is simultaneous. consistent. OLS applied to any of the equations will yield unbiased.
it may be the case that we can’t retreat to the reduced form equations. the supply equation could be a function of last years supply as well as last year’s price. but we do not have to know the coefficient on the police variable in the crime equation to determine the effect of the threestrikes law on crime. suppose we want to estimate the potential effect of a policy that puts 100. I don’t recommend three stage least squares because of possible propagation of errors across equations. consistent. Some critics have even suggested that. rather than trying to estimate equations in the structural model. the best solution may be to estimate one or more reduced form equations.000 more police. For example. we now have a recursive system. and efficient estimates. We may have police in the crime equation and police and crime may be simultaneously determined. OLS applied to either equation will yield unbiased. the best we can hope for is a consistent.000 cops on the streets. The estimates may be biased and inconsistent. In this case we need to know the coefficient linking the police to the crime rate. (This is the famous cobweb model. suppose we are trying to estimate the effect of a certain policy. consistent. suppose this is agricultural data. there is always the problem of identification and/or weak instruments. However. For many situations. A strategy that is available to researchers that have time series data is to use lags to generate instruments or to avoid simultaneity altogether. we can at least generate a bunch of instruments by assuming that the equations are also functions of lagged variables. since the basic model of all economics is the general equilibrium model. we have the following equation system. However. In that case. and efficient. say threestrikes laws. Suppose we start with the simple demand and supply model above where we add time subscripts. We have no choice but to use instrumental variables if we are to generate consistent estimates of the parameter of interest. (S) q = α p + ε t t t (D) q = β p + β y + u t 1 t 2 t t Note that this model is fully simultaneous and the only way to consistently estimate the supply function is to use two stage least squares. Usually. OLS applied to the reduced form equations are unbiased. but they are efficient and the bias might not be too bad (and 2sls is also biased in small samples).) (S) q = α p +ε t t −1 t (D) q = β p + β y + u t 1 t 2 t t Presto! Assuming that E (ut ε t ) = 0 and assuming we have no autocorrelation (see Chapter 14 below). on crime.Strategies for dealing with simultaneous equations The obvious strategy when faced with a simultaneous equation system is to use two stage least squares. all variables are functions of all other variables and no equation can ever be identified. but inefficient estimate of the parameters. Even if we can’t assume that current supply is based on last year’s price. A strategy that is frequently adopted when large systems of simultaneous equations are being estimated is to plow ahead with OLS despite the fact that the estimates are biased and inconsistent. 149 . So OLS might not be too bad. researchers adopting this strategy insist that this is just the first step and the OLS estimates are “preliminary.” The temptation is to never get around to the next step. For example. Then the supply is based on last year’s price while demand is based on this year’s price. That coefficient is available from the reduced form equation for crime. However. In the above example. Even if we can identify the equation of interest with some reasonable exclusion assumptions and we have good instruments. so that we can multiply it by 100K in order to estimate the effect of 100.
Let’s do a HausmanWu test to see if we need to use two stage least squares. This is essentially the reduced form equation approach for time series data where the lagged endogenous variables are the exogenous variables. regress exp aid inc pop Source  SS df MS +Model  925135805 3 308378602 Residual  6490681. as control variables.0000 = 0.712925 inc  .000143 .9926 = 375.000 .769 +Total  931626487 49 19012785.4 3 10836749.38 46 141101. exp. and.80422 .0000235 8. like matching grants) are for education. Now both equations are identified and both can be consistently estimated using two stage least squares. Since time does not go backwards. t P>t [95% Conf.39169 0.9308 221. We regress aid on ps.238054 13.3 from Pindyck and Rubinfeld). Many openended grants (that are functions of the amount the state spends. regress aid ps inc pop Source  SS df MS +Model  32510247. population and income The data are in a Stata data set called EX73 (for example 7.50 = 0.754569 3.0001902 .3839306 _cons  45.0002375 pop  .0000 0.(S) q = α p + α p +α q +ε t 1 t 2 t −1 3 t −1 t (D) q = β p + β y + u t 1 t 2 t t Note that we have two new instruments (pt1 and qt1) which are called lagged endogenous variables.000 2. and the other exogenous ˆ variables.9930 = 0.69833 84.64 exp  Coef. Another example Pinydyck and Rubinfeld have an example where state and local government expenditures (exp) are taken to be a function of the amount of federal government grants (aid). Therefore aid is a potentially endogenous variable.9350 0. ps. Interval] +aid  3.591 215.4 Number of obs F( 3.1 Residual  2258713.54 0.70 0. . many grants are tied to the amount of money the state is spending.69 0.5940753 . Time series analysts frequently estimate vector autoregression (VAR) models in which the dependent variable is a function of its own lags and lags of all the other variables in the simultaneous equation model. First we need an instrument. The OLS regression model is . and save the residuals.4741 +Total  34768961.10 0. Std.1733 However. A variable that is correlated with grants but independent of the error term is the number of school children.233747 . these are valid instruments (again assuming no autocorrelation).2 49 709570.1043992 5.000 . 46) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 220.58 0.59 150 .81 46 49102.57 124. Err. this year’s quantity and price cannot affect last year’s values. w .636 Number of obs F( 3. So the aid variable could be a function of the dependent variable. 46) Prob > F Rsquared Adj Rsquared Root MSE = 50 = 2185.
9878 = 480.51 0.0000162 .aid  Coef.083 3.65 exp  Coef.0000633 pop  .0000365 .420832 . t P>t [95% Conf. It turns out that the coefficient on aid in the HausmanWu test equation is exactly the same as we would have gotten if we had used instrumental variables. Note that the coefficient is.0000133 2.522657 . Interval] +ps  .001 .8078184 .9935 = 0.026 .429 317.53111 Apparently there is significant measurement error. Interval] +aid  4. Err.75 0.2882557 _cons  90.59 0.52 0. the same as in the HausmanWu test equation.9886 = 0.5133494 .522657 .1248854 1.05 45 134829.0000422 3.398 57. To see this.4252605 _cons  42. let’s use IV to derive consistent estimates for this model. 45) Prob > F Rsquared Adj Rsquared Root MSE = 50 = 1716.0602394 inc  .0000 = 0.000 2.060796 what  1. in fact. 46) Prob > F Rsquared Adj Rsquared Root MSE = 50 = 1304.000 .035769 .738443 .0001275 . 151 .85 0.0533 Instrumented: aid Instruments: inc pop ps  Note that ivreg uses all the variables that are not indicated as problem variables (inc and pop) as well as the designated instrument.0002387 pop  .3039 137.80 0.5 46 231024. Err.09 = 0. . Interval] +aid  4.02 0.68255 0.1253 86.0000424 .0002125 pop  .59654 142.31 0. ivreg exp (aid=ps) inc pop Instrumental variables (2SLS) regression Source  SS df MS +Model  920999368 3 306999789 Residual  10627118.4 Number of obs F( 3.316 +Total  931626487 49 19012785.415 . resid Now add the residual.9929 = 367.22021 1.17 = 0.39 0. ordinary least squares is biased.2188803 _cons  90.035 1.40924 49.19 exp  Coef.000 2. what (what).77 0. Std.92 0. ps. at the recommended 10 percent significance level for specification tests.7817 83. t P>t [95% Conf.0001275 . .112 +Total  931626487 49 19012785.0775019 .8328735 .8018143 1.4 Number of obs F( 4.1117587 4.05 0.534861 inc  .004 . as instruments.8616 0.1738793 .1462913 3.17 0. Std.194105 inc  .510453 6. Std. regress exp aid what inc pop Source  SS df MS +Model  925559177 4 231389794 Residual  6067310.5133494 .1253 112. If it is significant. Err.0000553 2.008 9.79e06 .7636841 5. to the regression.3838421 2.171 . predict what.984518 6.0000 = 0.9996564 4.605508 .301 263. t P>t [95% Conf.
Err.40924 49.9308 221.572194 inc  . for each of the potentially 152 .0001275 .0000133 2. t P>t [95% Conf.0000633 pop  . although the sign doesn’t look right to me.3838421 2. Summary The usual development of a research project is that the analyst specifies an equation of interest where the dependent variable is a function of the target variable or variables and a list of control variables.40 0. first robust Firststage regressions Source  SS df MS +Model  32510247.81 46 49102. we can’t test the over identification restriction.9218 83. 46) = Prob > F = Rsquared = Root MSE = 50 972. ps is significant in the reduced form equation.41 0.17 0. . the analyst should specify another equation.68255 0.005 1. Err.” We can also request robust standard errors.171 .008 9.1 Residual  2258713.0000365 . If there is a good argument for simultaneity.65  Robust exp  Coef.77 0.1853334 2. Before estimating with ordinary least squares. Std.636 Number of obs F( 3.165 .70 0.0003093 pop  . 46) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 220.67117 Instrumented: aid Instruments: inc pop ps  Because the equation is just identified. Std.515 2.9886 480.0000903 1.59 aid  Coef. which means that it is not a weak instrument.4252605 ps  .5133494 .0775019 .1248854 1. t P>t [95% Conf.398 57.34153 1.0000 0.0602394 _cons  42.0000 0.4 3 10836749. Interval] +inc  .522657 1.39 0.4741 +Total  34768961. the analyst should think about possible reverse causation with respect to each right hand side variable in turn. or at least a list of possible instruments.99 0. Interval] +aid  4.59654 142. However.8864062 .9350 0.008 . add the option “first.75 0.1402925 _cons  90. ivreg exp (aid=ps) inc pop.0000544 .79e06 .2 49 709570.035 1.302 263.47312 7.8328735 .605508 .If we want to see the first stage regression.1738793 .04 0.85 0.415  IV (2SLS) regression with robust standard errors Number of obs = F( 3.1253 86.
We know that OLS applied to the reduced form equations results in estimates that are unbiased. crime this year is a function of the law. For example. but relatively efficient compared to instrumental variables (because changing the list of instruments will cause the estimates to change. and the target variable. The crime equation will be a function of arrests. The resulting estimated coefficient on the threestrikes variable is unbiased. consistent. Finally. Alternatively. derived by simply dropping the arrest variable.endogenous variables. Of course. Suppose that there is a significant lag between the time a crime is committed and the resulting prison incarceration of the guilty party. but the law is a function of crime in previous years. like the threestrikes law. then. Then prison is not simultaneously determined with crime. and efficient. be sure to report the results of the HausmanWu test. So we are swapping consistency for efficiency when we use two stage least squares. the other equations are only there to generate consistent estimates of a target parameter in the equation of interest. Is this a good deal for analysts? The consensus is that OLS applied to an equation that has an endogenous regressor results in substantial bias. I don’t want specification errors in those equations causing inconsistency in the equation of interest after I have gone to all the trouble to specify a multiequation model. We deal with autocorrelation below. so that consumption this quarter is a function of income in the previous quarter. if a variable Y is known to be a function of lagged X. assuming your model is not just identified. it usually takes some time to get a law. lagging the potentially endogenous variable allows us to sidestep the specification of a multiequation mode. we can estimate the reduced form equation. 153 . suppose we are interested in the effect of three strikes laws on crime. Usually. and the Bound. Similarly. and see if the resulting estimated coefficient on the target variable are much different. creating additional variance of the estimates). There are a couple of ways to avoid this tradeoff. Sometimes this process is shortened by the review of the literature where previous analysts may have specified simultaneous equation models. dropping arrests. then we can be confident we have a good estimate of the effect of threestrikes laws on crime. consistent. but still biased (presumably less biased than OLS). since time cannot go backwards. so much so that the loss of efficiency is a small price to pay. Given that there may be simultaneity. hence not simultaneous. For many policy analyses. instrumental variables is consistent. with the attendant data collection to get the additional variables.These arguments require that the residuals be independent. We could specify an arrest equation. and inefficient relative to OLS. On the other hand. inconsistent. The safest thing to do is to estimate the original crime equation in OLS and the reduced form version. if arrests are not simultaneous. Baker F test for weak instruments whenever you estimate a simultaneous equation model. Should we use three stage least squares? Generally. the target variable is an exogenous policy variable. not autocorrelated. a number of control variables. The arrest variable is likely to be endogenous. and employ two or three stage least squares. So. what is the best way to deal with it? We know that OLS is biased. the two variables cannot be simultaneous. which are relatively inefficient. and efficient. these estimates are biased by omitted variables. a dummy variable that is one for states with threestrikes laws. I have seen large simultaneous macroeconomic models estimated on quarterly data avoid the use of two stage least squares by simply lagging the endogenous variables. Another way around the problem is available if the data are time series. passed and implemented. If the results are the same in either case. Finally. that is. Jaeger. In that case. The first is to estimate the reduced form equation only. among other things. the test for over identifying restrictions. I recommend against it.
It means. So a time series regression of the form Yt = a0 + a1Yt −1 + + a pYt − p + ε t has a truly random error term. It has the form. Yt = a0 + a1Yt −1 + + a pYt − p + ε t so that the error term is ε t = Yt − a0 − a1Yt −1 − − a pYt − p So the error term is simply the difference from what the value GDP today is compared to what it should have been if the true relationship had held exactly. is quite possible. but lags can help a time series analysis avoid the pitfalls of simultaneous equations. Let’s assume that GDP today is a function of a bunch of past values of GDP. something that is not possible in cross sections. Linear dynamic models ADL model The mother of all time series models is the autoregressivedistributed lag. This raises the possibility of using lags as a source of information. say GDP from 1950 to 2000 as a random sample? It seems a little silly to argue that GDP in 1999 was a random draw from a normal population of potential GDP’s ranging from positive to negative infinity. assuming one dependent variable (Y) and one independent variable (X). The reverse. time series data have a unique advantage because of time’s arrow. This means that time series data have all the problems of cross section data. momentum. Yet we can treat any tme series as a random sample if we condition on history. In a cross section it doesn’t matter how the data are arranged. 1997. plus some more. or any number of ways. that something that happened in 1999 cannot cause something that happened in 1998. Time series models should take dynamics into account: lags. ADL. However. etc. or by region. Nevertheless. etc. Simultaneity is a serious problem for cross section data. 154 . for example. This is a good reason to prefer time series data. reaction time. An interesting question arises with respect to time series data. The fifty states could be listed alphabetically. The fact that time’s arrow only flies in one direction is extremely useful. etc. 1998 before 1999. The value of GDP in 1999 was almost certainly a function of GDP in 1998. can we treat a time series. namely. a time series must be arrayed in order. of course. model.13 TIME SERIES MODELS Time series data are different from cross section data in that it has an additional source of information.
155 . the ADL(p.Yt = a0 + a1Yt −1 + + a pYt − p + b0 X t + b1 X t −1 + + bp X t − p + ε t Lag operator Just for convenience.p) model to be stable Yt = a0 + a1Yt −1 + + a pYt − p + b0 X t + b1 X t −1 + + bp X t − p + ε t we require that a1 + a2 + + a p < 1 The coefficients on the X’s. the roots of the lag polynomial must sum to a number less than one in absolute value. the ADL model is a difference equation. are not constrained because the X’s are the inputs to the difference equation and do not depend on their own past values. we will resort to a shorthand notation that is particularly useful for dynamic models. LX t = X t −1 L2 X t = L( LX t ) = L( X t −1 ) = X t − 2 L3 X t = L( L2 X t ) = L( X t − 2 ) = X t − 3 Lk X t = X t − k A first difference can be written as ∆X t = (1 − L ) X t = X t − X t −1 So. Let’s start with the simple ADL(1. All dynamic models can be derived from the ADL model. we will have to make sure all our dynamic models are stable. for a model like the following ADL(p. So. we can write the ADL model very conveniently and compactly as a ( L)Yt = b( L) X t + ε t where a ( L) and b( L) are "lag polynomials." a ( L) = a0 + a1 L + a2 L2 + + a p Lp = ai Li i =0 p b( L) = b0 + b1 L + b2 L2 + + bk Lk = b j L j j =0 k The "lag operator" works like this.p) model Yt = a0 + a1Yt −1 + + a pYt − p + b0 X t + b1 X t −1 + + bp X t − p + ε t can be written as Yt − a0 − a1Yt −1 − − a p Yt − p = b0 X t + b1 X t −1 + + bp X t − p + ε t (1 − a 0 − a1 L − − a p Lp ) Yt = ( b0 + b1 L + + bp Lp ) X t + ε t a ( L )Yt = b( L) X t Mathematically. For the model to be stable (so that it does not explode to positive or negative infinity). Using the lag operator. Since we do not observe explosive behavior in economic systems.1) model. the discrete analog of a differential equation. using lag notation. and the values of the X’s.
b0 = −b1 Yt = a0 + a1Yt −1 + b0 X t + b1 X t −1 + ε t Yt = a0 + Yt −1 + b0 X t − b0 X t −1 + ε t Yt − Yt −1 = a0 + b0 ( X t − X t −1 ) + ε t ∆Yt = a0 + b0 ∆X t + ε t Note that there is "drift" or trend if a0 ≠ 0. Yt − Yt −1 = ∆Yt = ε t So. so ∆Yt = a0 + ε t This is a random walk. If a0 ≠ 0." because if a0 is positive. 156 . Also. if a1 = 1 and a0 = 0 then Y is a "random walk" Yt = Yt −1 + ε t Y is whatever it was in the previous period.Yt = a0 + a1Yt −1 + b0 X t + b1 X t −1 + ε t Static model The static model (no dynamics) can be derived from this model by setting a1 = b1 = 0 Yt = a0 + b0 X t + ε t AR model The univariate autoregressive (AR) model is Yt = a0 + a1Yt −1 + ε t (b0 = b1 = 0) Random walk model Note that. the movements of Y are completely random. plus a random error term. Y drifts up. because ∆Yt = (1 − L)Yt (1 − L)Yt = a ( L)Yt = ε t Y is said to have a unit root because the root of the lag polynomial is equal to one. The random walk model describes the movement of a drop of water across a hot skillet. First difference model The first difference model can be derived by setting a1 = 1. if it is negative. Y drifts down. "with drift. then Yt = a0 + Yt1 +ε t and Yt − Yt1 = a0 + ε t .
1) model in deviations. Start by writing the ADL(1.Distributed lag model The finite distributed lag model is (a1 = 0) Yt = a0 + b0 X t + b1 X t −1 + ε t Partial adjustment model The partial adjustment model is (b1=0) Yt = a0 + a1Yt −1 + b0 X t + ε t Error correction model The error correction model is derived as follows. yt − a1 yt = b0 xt + b1 xt yt (1 − a1 ) = (b0 + b1 ) xt yt = (b0 + b1 ) (b + b ) xt = kxt where k = 0 1 (1 − a1 ) (1 − a1 ) In long run equilibrium. yt − yt −1 = a1 yt −1 + b0 xt + b1 xt −1 + ε t − yt −1 Add and subtract b0 xt −1 . xt = xt 1 . yt − yt −1 = a1 yt −1 + b0 xt + b1 xt −1 + ε t − yt −1 + b0 xt −1 − b0 xt −1 Collect terms. (1) yt = a1 yt −1 + b0 xt + b1 xt −1 + ε t Subtract yt −1 from both sides. yt = yt 1 . yt = yt 1 so (b + b ) (b0 + b1 ) xt −1 = − 0 1 xt −1 = − kxt −1 (1 − a1 ) (a1 − 1) Multiply both sides by (a1 − 1) to get yt −1 = −(a1 − 1) yt −1 = (b0 + b1 ) xt −1 Substituting for (b0 + b1 ) xt −1 in (2) yields ∆yt = (a1 − 1) yt −1 − (a1 − 1) yt −1 + b0 ∆xt + ε t Substituting for yt −1 (b + b ) ∆yt = (a1 − 1) yt −1 − (a1 − 1) 0 1 xt −1 + b0 ∆xt + ε t (a1 − 1) 157 . and ε t = 0. (2) ∆yt = (a1 − 1) yt −1 + b0 ∆xt + (b0 + b1 ) xt −1 + ε t Rewrite equation (1) as yt − a1 yt −1 = b0 xt + b1 xt −1 + ε t In long run equilibrium.
158 . a1. The coefficient a1 − 1 is the speed of adjustment of y to errors in the form of deviations from the long run equilibrium. The terms Yt − a1Yt −1 and ( X t − a1 X t −1 ) are called generalized first differences because they are differences. The coefficient 0 is the impact on y of a change in x. but the coefficient on the lag polynomial is a1. instead of 1 which would be the coefficient for a first difference. (3a) ∆yt = b0 ∆xt + (a1 − 1) ( y − kx )t −1 + ε t The term (yt1kxt1) is the difference between the equilibrium value of y (kx) and the actual value of y in the previous period. It imposes the restriction that b1 = −b0 a1 . Yt = a0 + a1Yt −1 + b0 X t + b1 X t −1 + ε t Yt = a0 + a1Yt −1 + b0 X t − b0 a1 X t −1 + ε t Yt − a1Yt −1 = a0 + b0 ( X t − a1 X t −1 ) + ε t (1 − a1 L )Yt = a0 + b0 (1 − a1 L) X t + ε t Y and X are said to have common factors because both lag polynomials have the same root.Collect terms in (a1 − 1) (b + b ) ∆yt = b0 ∆xt + (a1 − 1) yt −1 − 0 1 xt −1 + ε t (a1 − 1) Or. (3) ∆yt = b0 ∆xt + (a1 − 1) ( yt −1 − kxt −1 ) + ε t Sometimes written as. the coefficient and the error are called the error correction mechanism. Together. The coefficient k is the long run response of y to a change in x. CochraneOrcutt model Another commonly used time series model is the CochraneOrcutt or common factor model.
are true. consistent. It turns out that the best we can get out of ordinary least squares with a lagged dependent variable is that OLS is consistent and asymptotically efficient. correlated with themselves. we expect that Cov ( yt −1ε t ) = 0 which implies that OLS is at least consistent. It is unfortunate that all we can hope for in most dynamic models is consistency. as the sample size goes to infinity.e. the lag of Y is also a random number. The first i in iid is independent. It is no longer unbiased and efficient. which means that ˆ E ( a1 ) ≠ a1 and OLS is biased. σ 2 ) the explanatory variables. We will just have to live with it. 159 . lagged) or “serially correlated. which now include Yt −1 . ordinary least squares applied to this model is unbiased.14 AUTOCORRELATION We know that the GaussMarkov assumptions require that Yt = α + β X t + ut where ut ∼ iid (0. fixed on repeated sampling because.” The most common type of autocorrelation is socalled “first order” serial correlation. To see this. assume a very simple AR model (no x) expressed in deviations. and efficient. ut ∼ iid (0. cannot be assumed to be a set of constants. As we have seen above. If the errors are not independent. However. since Y is a random number. Adding a lag of X is not a problem. it is frequently necessary in time series modeling to add lags. However adding a lagged dependent variable changes this dramatically. then they are said to be “autocorrelated” (i. in large samples. yt = a1 yt −1 + ε t ˆ a1 = yt −1 yt yt −1 ( a1 yt −1 + ε t ) yt −1ε t = = a1 + 2 2 yt −1 yt −1 yt2−1 yt −1ε t Cov( yt −1ε t ) ˆ E ( a1 ) = a1 + E = a1 + 2 Var ( yt −1 ) yt −1 Since ε t = yt − a1 yt −1 it is likely that Cov( yt −1ε t ) ≠ 0 at least in small samples. σ 2 ) . and the other assumptions. In the model Yt = α + β X t + γ Yt −1 + ut . When this assumption. This actually refers to the sampling method and assumes a simple random sample where the probability of observing ut is independent of observing ut1.. since most time series are quite small relative to cross sections.
Unfortunately. it is not clear whether the tratios will be overestimated or underestimated. If these variables have time trends or cycles or other timedependent behavior. there is a “second order bias” in the sense that OLS standard errors are underestimated and the corresponding tratios are overestimated in the usual case of positive autocorrelation. and the tratios are overestimated. then the error term will be autocorrelated.ut = ρ ut −1 + ε t where 1 ≤ ρ ≤ 1 is the autocorrelation coefficient and ε t iid (0. most dynamic models use a lagged dependent variable. However. as we have seen above. Effect of autocorrelation on OLS estimates The effect that autocorrelation has on OLS estimators depends on whether the model has a lagged dependent variable or not. ut = ρ1ut −1 + ρ 2 ut − 2 + + ρ q ut − q + ε t Autocorrelation is also known as moving average errors. σ 2 ) It is possible to have autocorrelation of any order. Recall our simple AR(1) model above. then autocorrelation causes OLS to be inefficient. This means that our hypothesis tests can be wildly wrong. If there is no lagged dependent variable. the most general form would be q’th order autocorrelation.” 7 160 . the tratios will be “wrong. so we have to make sure we eliminate any autocorrelation. Autocorrelation is caused primarily by omitting variables with time components. We can prove the inconsistency of OLS in the presence of autocorrelation and a lagged dependent variable as follows. There is nothing good about OLS under these conditions. and inefficient. yt = a1 yt −1 + ut ut = ρ ut −1 + ε t The ordinary least squares estimator of a is. Sometimes we simply don’t have all the variables we need and we have to leave out some variables that are unobtainable. then OLS is biased. yt −1 yt yt −1 ( a1 yt −1 + ut ) yt −1ut = = a1 + 2 2 yt −1 yt −1 yt2−1 The expected value is ˆ a1 = yt −1ut Cov( yt −1ut ) ˆ E ( a1 ) = a1 + E = a1 + 2 Var ( yt −1 ) yt −1 The covariance between the explanatory variable and the error term is In the case of negative autocorrelation.7 If there is a lagged dependent variable. inconsistent. It turns out that autocorrelation is so prevalent that we can expect some form of autocorrelation in any time series analysis. However. The impact of these variables goes into the error term.
has the following properties. et2 Under the null hypothesis of no autocorrelation. It has the formula d= (et − et −1 )2 where et is the residual from the OLS regression. E (d ) = E (et − et −1 ) 2 E (et2 − 2et et −1 + et2−1 )2 ( E (et2 ) − 2 E (et et −1 ) + E (et2−1 )) = = E (et2 ) E (et2 ) E (et2 ) 2 (σ = t =1 T T +σ 2) = 2 σ t =1 Tσ 2 + Tσ 2 =2 Tσ 2 where T is the sample size (in a time series t=1. What happens to the expected value of the Durbin Watson statistic? As a preliminary. E (et ) = E (et −1 ) = E (et et −1 ) = 0 and E (et ) = E (et2−1 ) = σ 2 . inconsistent. 161 . Now assume autocorrelation so that E (et et −1 ) ≠ 0 . under the null ut IN (0. OLS is inconsistent. σ 2 ) which implies that E (ut ) = 0. we can derive the expected value as follows. let’s review the concept of a correlation coefficient. it doesn’t go away as the sample size goes to infinity. The empirical analog is that the residual. Therefore. the explanatory variable is correlated with the error term and OLS is biased.. E (ut ut −1 ) = 0. with overestimated tratios in the worse case. we should find out if we have autocorrelation or not. Note that. e. if ρ ≠ 0 . Since is constant. inefficient. Testing for autocorrelation The Durbin Watson test The most famous test for autocorrelation is the DurbinWatson statistic. Since OLS is inefficient and has overestimated tratios in the best case and is biased. Now. E (ut2 ) = σ 2 ..Cov ( yt −1ut ) = E ( yt −1ut ) = E [ ( a1 yt − 2 + ut −1 )( ρ ut −1 + ε t ) ] = E a1 ρ ut −1 yt − 2 + a1 yt − 2ε t + ρ ut2−1 + ut −1ε t = a1 ρ E [ut −1 yt − 2 ] + ρ E ut2−1 Since E ( yt − 2ε t ) = E (ut −1ε t ) = 0 because ε t is truly random. E (ut2 ) = E (ut2−1 ) = Var (ut ) and E (ut −1 yt − 2 ) = E (ut yt −1 ) = Cov( yt −1ut ) which implies E [ut −1 yt − 2 ] = a1 ρ E [ut −1 yt − 2 ] + ρ E ut2−1 E [ut −1 yt − 2 ] − a1 ρ E [ut −1 yt − 2 ] = ρ E ut2−1 E [ut −1 yt − 2 ] = Cov( yt −1ut ) = ρ E ut2−1 1 − a1 ρ ρVar (ut ) 1 − a1 ρ So. Therefore.T).
has a number of drawbacks. If your value is below 1. it not only fails. despite its fame. it tells us we have no problem when we could have a very serious problem. Be conservative and simply ignore the lower value. reject the null hypothesis of no autocorrelation whenever the DW statistic is below dU. suppose you get a DW statistic of 1. according to DW. The usual case in economics is positive autocorrelation. On the other hand.45. The DurbinWatson test. the DurbinWatson statistic approaches zero. x = et .8 The second problem is fatal. the DW test is biased toward two. That is. the DurbinWatson statistic is equal to two. y ) 2 σ x2 σ y 2 2 In this case.29 and du=1. y ) = E ( xy ) = So. For this reason. For example. Whenever the model contains a lagged dependent variable. Clearly. If your value is above 1. Since your value falls between these two. the DurbinWatson statistic goes to four. That is. this is not a serious problem.29 you have significant autocorrelation.r= 1 xy T Sx S y 1 xy T But. as approaches one. The LM test for autocorrelation 8 If the DW statistic is negative. for negative autocorrelation.35 and you want to know if you have autocorrelation. 162 . r= cov( x. As we already know. y = et −1 and σ x = σ y = σ 2 which means that the autocorrelation coefficient is ρ= cov(et et −1 ) σ 2 σ 2 = E (et et −1 ) σ2 and ρσ 2 = E (et et −1 ) The expected value of the Durbin Watson statistic is E (et − et −1 ) 2 E (et2 − 2et et −1 + et2−1 ) 2 ( E (et2 ) − 2 E (et et −1 ) + E (et2−1 )) E (d ) = = = E (et2 ) E (et2 ) E (et2 ) (σ = t =1 T 2 − 2 ρσ 2 + σ 2 ) σ t =1 T = 2 T σ 2 − 2T ρσ 2 + T σ 2 = 2 − 2ρ Tσ 2 Since is a correlation coefficient it has to less than or equal to one in absolute value.45. we will need another test for autocorrelation. If you look up the critical value corresponding to a sample size of 25 and one explanatory variable for the five percent significance level. you cannot reject the null hypothesis of no autocorrelation. dl=1. when =0. you will find two values. when the test is most needed. The first is that the test is not exact and there is an indeterminate range in the DW tables. as goes to negative one. then the test statistic is 4DW. so that is positive. what do you conclude? Answer: test fails! Now what are you going to do? Actually. cov( x.
The first thing we have to do is tell Stata that this is a time series and the time variable is year. . We can test null hypothesis with either the tratio on the lagged residual or.570667328 1 . Yt = a0 + b0 X t + ut ut = ρ ut −1 + ε t which implies that Yt = a0 + b0 X t + ρ ut −1 + ε t It would be nice to simply estimate this model.Breusch and Godfrey independently developed an LM test for autocorrelation. 28) = . bgodfrey BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  10.184 1 0.663423 . .748163462 27 . dwstat DurbinWatson dstatistic( 2.6915226 Now let’s get the BreuschGodfrey test statistic.0000 0. It is ok since we don’t have any lagged dependent variables.dta). we can estimate the error term with the residuals from the OLS regression of Y on X and lag it to get an estimate of ut −1 We can then run the test equation. Err.08262 lcrmaj  Coef.585411 9. .1647467 . The Fratio is distributed according to chisquare with one degree of freedom. Like all LM tests. 26) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 28 83. Yt = a0 + b0 X t + ρ et −1 + ε t where et −1 is the lagged residual.7628 0.0014 H0: no serial correlation 163 .59 0.7536 . we run the regression and then save the residuals for further analysis.0147116 9. Interval] +prison  .000 . Std. However. I have a data set on crime and prison population in Virginia from 1972 to 1999 (crimeva.027709758 Number of obs F( 1.14 0. Unfortunately. For example.62 0. tsset year time variable: year. like all LM tests. 1956 to 2005 Now let’s do a simple regression of the log of major crime on prison.1042665 _cons  9. regress lcrmaj prison Source  SS df MS +Model  .0379523 254.741435 First we want the DurbinWatson statistic. . so we can’t do it that way. is a large sample test. consider the following model.570667328 Residual  . we don’t observe the error term. get an estimate of and test the null hypothesis that =0 with a ttest.006826774 +Total  .000 9. because this.1345066 . Both the DurbinWatson and BreuschGodfrey LM tests are automatic in Stata. t P>t [95% Conf.177496135 26 . we could use the Fratio for the null hypothesis that the coefficient on the lagged residual is zero.
001 . Std.1695353 . (The dl is 1.64494128 2 . What is the corresponding Fratio? .8675 0.e Source  SS df MS +Model  .028595066 Number of obs F( 2. 24) = Prob > F = 15. let’s do the LM test ourselves.9883552 _cons  9. . as expected. What should we do about it? We discuss possible cures below.1631639 3.3148475 .753277  The ttest on the lagged e is highly significant.625801 9. scalar list prob prob = . First we have to save the residuals.0014 164 . .95) .e ( 1) L. lags(1 2) BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  10.e = 0 F( 1. Now that’s significant. .0308821 313. 15. scalar prob=1chi2(1. use the lags option of the bgodfrey command.55 0.098530434 24 .743471714 26 . So we reject the null hypothesis of no autocorrelation.689539 . the DW statistic is uncomfortably close to zero.76 0.) And the BG test is also highly significant. It must be incredibly significant asymptotically. Just for fun.000 . 24) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 27 78. test L.1206944 e  L1  .6516013 . Interval] +prison  .06407 lcrmaj  Coef. for example. Err. So we apparently have serious autocorrelation. . We apparently have very serious autocorrelation. bgodfrey.004105435 +Total  . predict e. resid (22 missing values generated) Now we add the lagged residuals to the regression.1451148 . To test for second order serial correlation.OK.184 1 0.0000 0.00006504 Wow.32 for k=1 and T=27 at the five percent level.95 0. Testing for higher order autocorrelation The BreuschGodfrey test can be used to test for higher order autocorrelation.0005 It is certainly significant with the small sample correction. regress lcrmaj prison L.000 9.0118322 12.8564 .26 0. t P>t [95% Conf.99 0.32247064 Residual  .
0045 H0: no serial correlation It gets worse and worse. but we’ll stop here.2  14. et. The classic textbook solution is to use CochraneOrcutt generalized first differences. Yt − ρYt −1 = a0 − ρ a0 + b0 X t − b0 ρ X t −1 + ut − ρ ut −1 Yt − ρYt −1 = a0 (1 − ρ ) + b0 ( X t − ρ X t −1 ) + ε t Since ut − ρ ut −1 = ε t The problem is that we don’ t know the value of . b0 . There are two approaches to fixing autocorrelation. ρYt −1 = ρ a0 + b0 ρ X t −1 + ρ ut −1 Now subtract this equation from the first equation. Let’s see if there is even higher order autocorrelation. also known as common factors. save the estimate of ρ .100 2 0. Cochrane and Orcutt suggest an iterative method.100 2 0. bgodfrey. ρ . Cure for autocorrelation .0009 H0: no serial correlation Hmmm. We could go on adding lags to the test equation. These data apparently have severe high order serial correlation.184 1 0.412 3 0. In fact we are going to estimate a0 . (3) Transform the equation using generalized first differences using the estimate of .0009 3  14. lags(1 2 3 4) BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  10. and ρ jointly. 165 . Yt = a0 + b0 X t + ut ut = ρ ut −1 + ε t Assume for the moment that we knew the value of the autocorrelation coefficient. The CochraneOrcutt method Suppose we have the model above with first order autocorrelation. Lag the first equation and multiply by . ˆ (2) Estimate the auxiliary regression et = ρ et −1 + vt with OLS. look’s bad. .090 4 0.0014 2  14. (1) Estimate the original regression Yt = a0 + b0 X t + ut with OLS and save the residuals.0024 4  15.
6465167 DurbinWatson statistic (original) 0.111808732 1 .0993085 _cons  9.iterated estimates Source  SS df MS +Model  . Yt − ρYt −1 = a * +b( X t − ρ X t −1 ) + ε t where a* = a0 (1 − ρ ) .ˆ ˆ ˆ Yt − ρYt −1 = a0 (1 − ρ ) + b0 ( X t − ρ X t −1 ) + ε t (4) Estimate the generalized first difference regression with OLS.559018 9.000 9. Rewrite the equation as (A) Yt = a * + ρYt −1 + bX t − ρ bX t −1 + ε t Now for something completely different. and can be derived from a more general ADL model.6465 0. Interval] +prison  . we need a stopping rule to avoid an infinite loop.0304269 5.6465 0.003945497 +Total  . .6465 CochraneOrcutt AR(1) regression .0000 0. We can test these restrictions to see if the CO model is a legitimate reduction from the ADL.68 0.210446147 26 .111808732 Residual  . prais lcrmaj prison.691523 DurbinWatson statistic (transformed) 1.311269 The CochraneOrcutt approach has improved things.5125 . which is a modification of the CochraneOrcutt method that preserves the first observation so that you don’t get a missing value because of lagging the residual. Std.32 0.06281 lcrmaj  Coef. (5) Go to (2). 25) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 27 28.736981 . We have seen above that it is a common factor model. corc Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: 6: 7: rho rho rho rho rho rho rho rho = = = = = = = = 0.008094083 Number of obs F( 1. say. Start with the CO model.34 0. namely the ADL model.6462 0.6448 0. The CochraneOrcutt method can be implemented in Stata using the prais command The prais command uses the PraisWinston technique. but the DW statistic still indicates some autocorrelation in the residuals.098637415 25 . t P>t [95% Conf. .01. Note that the CochraneOrcutt regression is also known as feasible generalized least squares or FGLS. Err.1619739 . 166 . Obviously.6353 0.0000 0. It is therefore true only if the restrictions of the ADL are true.0864092 112.5313 0. The usual rule is to stop when successive estimates of are within a certain amount.914944 +rho  .6465 0.2246394 .000 . save the residuals.
prison Source  SS df MS +Model  .1116564 0.743471714 26 . It turns out that ESS ( B) 2 L = T ln ~ χq ESS ( A) where q=K(B)K(A).90 0.393491 1. We have already estimated the CochranOrcutt model (model A).0780698 lcrmaj  L1  . scalar ESSA=e(rss) Now let’s estimate the ADL model (model B). If this statistic is not significant. ESS(B).21498048 Residual  . Otherwise.0883118 .3192906 _cons  3. . Interval] +prison  .191 .1457706 .3060554 .098530274 23 . scalar L=27*log(ESSB/ESSA) .02934373 ESSA = . The likelihood ratio is a function of the residual sum of squares.004283925 +Total  .9970184 prison  L1  . The residual sum of squares is .0588285 6.611994 2. Because the restriction multiplies two parameters. ln(ESS(A)/ESS(B))=0. Then we run the ADL model and keep its residual sum of squares. K(B) is the number of parameters in model (A) and K(B) is the number of parameters in model (B).0000 0.001 .06545 lcrmaj  Coef. Instead we use a likelihood ratio test. If the CochraneOrcutt model is true. ESS(B)/ESS(A)=1 and the log of this ratio. This is the likelihood ratio.64494144 3 .8502 .369611 .corc option because we don’t want to keep the first observation) and save the residual sum of squares. then ESS(A)=ESS(B).437 . . but we can get Stata to remember this by using scalars. This implies that the ratio. The CochraneOrcutt model (A) is true only if b1 = −b0 a1 or b1 + b0 a1 = 0 . scalar prob=1chi2(2. regress lcrmaj prison L.09853027 . you should use the ADL model.046 . we can’t use a simple Ftest. we run the CochraneOrcutt model (using the .lcrmaj L.. t P>t [95% Conf. scalar ESSA=e(rss) . then the CochraneOrcutt model is true.029) 167 .18 0.0986.1670076 3. ESS(A). 23) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 27 50. scalar list L = . .1426671 .79 0.11 0. Std.1082056 1.028595066 Number of obs F( 3..(B) Yt = a0 + a1Yt −1 + b0 X t − b1 X t −1 + ε t Model (A) is actually model (B) with b1 = −b0 a1 since a1 = ρ and b1 = ρ b0 . So.6515369 .728154 Now do the likelihood ratio test.8675 0. Err.35 0.This is a testable hypothesis.09863742 ESSB = .
I do not recommend using it.000 . especially lags of the dependent variable until the LM test indicates no significant autocorrelation. Err. reaction times.155892 7.2020885 4. For these reasons.8819 0.219922804 Residual  .748163462 27 . 24) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 28 59.lcrmaj LL.0042997 _cons  4..003683127 +Total  . or any other. The reason.027709758 Number of obs F( 3.293413 1. in the above example. . scalar list prob prob = 1 Since ESSA and ESSB are virtually identical. most modern time series analysts do not recommend CochraneOrcutt.08839505 24 . Instead they recommend adding variables and lags until the autocorrelation is eliminated. For example.06069 lcrmaj  Coef.7808986 .009 1. As we noted above. the ADL model allows us to model the dynamic behavior directly.82 0.0635699 . Interval] +prison  .05 0.520191 2. It is much more satisfying to specify the problem allowing for lags.1902221 2. We have seen above that the ADL model is a generalization of the CochraneOrcutt model. regress lcrmaj prison L. It has nothing to do with economics. the test statistic is not significant and CochraneOrcutt is justified in this case. that are very likely to be causing dynamic behavior than using a simple statistical fix. Use the ten percent significance level when using this. is that the model omits too many variables.659768412 3 . the most common cause of autocorrelation is omitted time varying variables.1066493 .3882994 . I recommend adding lags. etc. a restriction that may or may not be true.430934 .3875 168 .006 .052 . momentum. bgodfrey BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  0.5273762 1.lcrmaj Source  SS df MS +Model  .67 0. adding two lags of crime to the static crime equation eliminates the autocorrelation. the CochraneOrcutt method did not solve the problem. all of which are likely to be timevarying.. almost certainly.747 1 0.04 0. specification test.0000 0.71 0.9444664 . The problem is that it is a purely mechanical solution. Since using CochraneOrcutt involves restricting the ADL to have a common factor. The DurbinWatson statistic still showed some autocorrelation even after FGLS. why not simply avoid the problem by estimating the ADL directly? Besides.8671 . t P>t [95% Conf.361557 L2  . So. Besides. Std.0204905 lcrmaj  L1  . Curing autocorrelation with lags Despite the fact that CochraneOrcutt appears to be justified in the above example.0208728 3.
3296 3  2. If the covariances are negative while the x’s are positively correlated. the standard errors are not consistent because.747 1 0. the amount of error goes to infinity and the estimate is not consistent. Things are made more complicated if the observations on x are negatively correlated. then ols estimates of the standard errors in no longer unbiased. Note that the formula can be rewritten as follows. if we use all these terms.1957 H0: no serial correlation. because it ignores higher order autocorrelation. the standard errors are underestimated and the corresponding tratios are overestimated. but at a much slower rate. We could use a small number of terms. lags (1 2 3 4) BreuschGodfrey LM test for autocorrelation lags(p)  chi2 df Prob > chi2 +1  0. everything is positively correlated. Typically. the direction of bias of the standard error will be unknown. then ignoring them will cause the standard error to be underestimated.5156 4  6. then E (ui u j ) ≠ 0 and we cannot ignore these terms. if we have autocorrelation. If some are positive and some negative.3875 2  2.H0: no serial correlation .046 4 0. The OLS estimator ignores these i j terms and is unbiased.284 3 0. the number of terms goes to infinity. then ignoring them will cause the standard error to be overestimated. The problem is. ˆ Var ( β ) = 1 ( ) So. making variables appear to be related when they are in fact independent. The one thing we do know is that if we ignore the covariances when they are nonzero. but that doesn’t do it either. which is the usual case. x 2Var (u ) + x 2Var (u ) + + x 2Var (u )u 2 1 2 T 1 2 n T 2 2 + x1x2Cov(u1u2 ) + x1x3Cov(u1u3 ) + x1x4Cov(u1u4 ) + x t If the covariances are positive (positive autocorrelation) and the x’s are positively correlated. Newey and West suggested a formula that increased the number of terms as the sample size increased. as the sample size goes to infinity. This is the famous second order bias of the standard errors under autocorrelation. In the chapter on heteroskedasticity we derived the formula for heteroskedastic consistent standard errors. Since each tem contains error. ˆ Var ( β ) = 1 ( ) 2 x2 t E x 2u 2 + x 2u 2 + + x 2 u 2 + x x u u + x x u u + x x u u + 1 1 2 2 T T 1 2 1 2 1 31 3 1 4 1 4 ( ) . bgodfrey.220 2 0. The result is that the estimate is consistent 169 . regress lcrmaj prison Heteroskedastic and autocorrelation consistent standard errors It is possible to generalize the concept of robust standard errors to include a correction for autocorrelation. E (u u ) = 0 for i ≠ j (independent). In the absence of autocorrelation.
0193394 3.34407 crmaj  L2  .0000335 .81 0.49 0. lag(2) Regression with NeweyWest standard errors maximum lag: 2 Number of obs F( 1. but always with fewer than T terms. Justify your deletions with ttests and Ftests. Std. add more lags. If there is.lcrmaj LL. 26) Prob > F = = = 28 60. A good rule of thumb is at least two lags for annual data. Err. newey lcrmaj prison L. lag(2) Regression with NeweyWest standard errors maximum lag: 2 Number of obs F( 3. even after autocorrelation has presumably been eliminated. .31 0.0209816 lcrmaj  L1  .628212 4. then keep the variable even if it is not significant.003 .1008106 .0608961 .004 . Use NeweyWest heteroskedastic and autocorrelation robust tratios. We can make this formula operational by using the residuals from the OLS regression to estimate u. you can begin reducing the model to a parsimonious. t P>t [95% Conf.765911  Note that you must specify the number of lags in the newey command.5915948 1.0000123 _cons  . t P>t [95% Conf.0000548 . five lags for quarterly data.000 9.15 0. Delete insignificant variables.0000  NeweyWest lcrmaj  Coef. To get heteroskedasticity and autocorrelation (HAC aka NeweyWest) consistent standard errors and tratios. one of the leading time series econometricians.560936 9.80 0. Interval] +prison  .because the number of terms goes to infinity. Err.1822947 5. and 13 lags for monthly data. suggests the following general to specific methodogy.0000  NeweyWest lcrmaj  Coef. 170 .663423 . 24) Prob > F = = = 28 66. but that is beyond the scope of this manual.crmaj.77 0.000 .1699636 . The formula relies on something known as the Bartlett kernel. Make sure that dropping variables does not cause autocorrelation.0172496 7.80 0. Check the BreuschGodfrey test to make sure there is no autocorrelation.9678325 . Std.0498593 193. Make sure that there is no autocorrelation in the final version of the model.67426 0. interpretable model.0990496 _cons  9. .000 .282791  Summary David Hendry. use the newey command.1345066 . Once you have no autocorrelation.0000103 3. We can also use NeweyWest HAC standard errors in our ADL equation. Start with a very general ADL model with lot’s of lags. newey lcrmaj prison. Interval] +prison  . If it does.626 2.25 0. When you get down to a parsimonious model with mostly significant variables.8272894 1. make sure that all of the omitted variables are not significant with an overall Ftest.
yt + p ) = Cov ( yt + q . Now consider a simple random walk. yt +1 .. On the other hand. yt + p + q ) . 171 . yt −1 ) . Pr( yt  y0 . the mean changes over time. If the distribution is constant over time. We have seen how we can avoid problems associated with nonindependence (autocorrelation) and nonidentical (heteroskedasticity). or q. yt = α + δ t Is y stationary? The mean of y is E ( yt ) = µt = α + δ t Clearly. we can infer its properties from examination of its histogram... then we can’t do any of these things. σ 2 ) . y1 .. that is Cov ( yt ... and its sample autocovariances. because the truth at time t is different from the truth at time t+p. Finally. UNIT ROOTS. p. It is called deterministic because it can be forecast with complete accuracy. if a series is not stationary. If a series is stationary. A time series is stationary if its probability distribution does not change over time. Consider a simple deterministic time trend.15 NONSTATIONARITY.. yt + p + q ) for any t. then its mean and variance are constant over time. In this section we confront the possibility that the error mean is not constant. However. AND RANDOM WALKS The GaussMarkov assumptions require that the error term is distributed as iid (0. its sample variance... Therefore. its sample mean.. the covariances of the series must not be dependent on time. yt + p ) = Pr( yt + q . The probability of observing any given y is conditional on all past values. we have maintained the assumption that the mean of the distribution is constant (and equal to zero).. A stationary process is one where this distribution doesn’t change over time: Pr( yt . a linear time trend is not stationary. If a time series is nonstationary then we cannot consider a sample of observations over time as representative of the true distribution..
Start it out at y1=0 in the first year. σ 2 ) For the sake of argument. A random walk has potentially infinite variance. The function 5*invnorm(uniform()) produces a random number from a normal distribution with a mean of zero and a standard deviation of 5. In the crimeva. then the usual t. extreme autocorrelation. The reason is that if the regressor is nonstationary. + ε t Thus. Many economic time series appear to behave like stochastic trends. Therefore.. We use the replace command to replace the zeroes.. Obviously.5.yt = yt −1 + ε t . we cannot forecast future values of y with certainty. ε t ~ iid (0. Just because a random walk is increasing for several time periods does not mean that it will necessarily continue to increase. = σ t2 which implies that E (ε t ) 2 = Var (ε t ) = tσ 2 . because of the identical assumption. . Also. Random walks and ordinary least squares If we estimate a regression where the independent variable is a random walk. the variance of yt is E(y t ) 2 = E (ε1 + ε 2 + .and Ftests are invalid. + E (ε t ) 2 2 because of the independence assumption... only worse. Similarly.dta data set we first generate a random walk called y1. because the autocorrelation coefficient is equal to one. Now create the random walk by making y1 equal to itself lagged plus a random error term. Also unlike deterministic trends. The problem is similar to autocorrelation.. stochastic trends can exhibit booms and busts. As a result most time series analysts mean stochastic trend when they say trend. gen y1=0 This creates a series of zero from 1956 to 2005.” We can replicate their findings in Stata. Unlike the deterministic trend above. Granger and Newbold got tratios around 4. There is no distribution from which the statistics can be derived. 1956. They called this phenomenon “spurious regression. They generated two random walks completely independently of one another and then regressed one on the other. y1 = 0 + ε1 y2 = 0 + ε 1 + ε 2 y3 = 0 + ε1 + ε 2 + ε 3 so that y t = ε1 + ε 2 + . so the standard errors and tratios are not derived from a stable distribution. a random walk can be written as yt − yt −1 = ∆yt = (1 − L) yt = ε t The lag polynomial is (1L). We have to start in 1957 so we have a previous value (zero) to start from. A random walk is also known as a stochastic trend. Also. 172 .. it has no stable distribution. Unit root and stochastic trend are used interchangeably. note that the variance of y goes to infinity as t goes to infinity. so the root of the lag polynomial is one and y is said to have a unit root. just because it is falling does not mean we can expect it to continue to fall. As we saw in the section on linear dynamic models. σ 12 = σ 2 = . Even though the true relationship between the two variables was zero. + ε t ) 2 = E (ε 1 ) 2 + E (ε 2 ) 2 + . Granger and Newbold published an article in 1974 where they did some Monte Carlo exercises.. the variance of y depends on time and y cannot be stationary.. let’s assume that y=0 when t=0.
82044 3. yt = a0 + a1 yt −1 + ε t Subtract yt −1 from both sides.910108 +Total  17435. 48) Prob > F Rsquared Adj Rsquared Root MSE = = = = = = 50 55.23 0.0000 0. Suppose we start with a simple AR(1) model. The idea is very simple. Consider the following AR(1) model with first order autocorrelation. if y has a unit root then a1 = 1 and a1 − 1 = 0 .9156 49 355.238577 . we need to know if our variables have unit roots.y1+5*invnorm(uniform()) if year > 1956 (49 real changes made) Repeat for y2.87052 8. t P>t [95% Conf. . replace y1=L. σ 2 ) 173 .997 y1  Coef.009038 4. yt − yt −1 = a0 + a1 yt −1 − yt −1 + ε t or ∆yt = a0 + (a1 − 1) yt −1 + ε t This is the (nonaugmented) DickeyFuller test equation. Std.y2+5*invnorm(uniform()) if year > 1956 (49 real changes made) Obviously. Since most time series are autocorrelated. ε t ~ iid (0. to avoid spurious regressions.23038 1 9328.68518 48 168. known as a DF equation. So we can test for a unit root using a ttest on the lagged y. yt = a1 yt −1 + ut ut = ρ ut −1 + ε t .1311767 7.835011 Number of obs F( 1.. On the other hand. Therefore. replace y2=L.23038 Residual  8107.5350 0. . Note that if the AR model is stationary. and autocorrelation creates overestimated tratios.770368 This is a spurious regression. It looks like y1 is negatively (and highly significantly) associated with y2 even though they are completely independent. So if we regressed y1 on y2 there should be no significant relationship. regress y1 y2 Source  SS df MS +Model  9328. Interval] +y2  . Err.93 0.000 20.9748288 .5253 12. Testing for unit roots The most widely used test for a unit root is the Augmented DickeyFuller (ADF) test.000 1. we should do something to correct for autocorrelation. in which case a1 − 1 < 0 . y1 and y2 are independent of each other. gen y2=0 . a1 < 1 .7110805 _cons  14.43 0.
If there is higher order autocorrelation. its distribution is anything but standard. All we are really doing is adding lags of the dependent variable to the test equation.4001 Note that the critical values are much higher (in absolute value) than the usual t distribution. Since we know there is no autocorrelation. DickeyFuller unit root tests can be done with the dfuller command. + c p ∆yt − p + ε t I keep saying that we test the hypothesis that (a1 − 1) = 0 with the usual ttest. The tratio is computed as usual. The ADF in general form is ∆yt = a0 + ( a1 − 1) yt −1 + c1∆yt −1 + c2 ∆yt − 2 + .. generating an iid error term. yt = a1 yt −1 + ut yt = a1 yt −1 + ρ yt −1 − ρ a1 yt − 2 + ε t Subtract yt −1 from both sides to get yt − yt −1 = a1 yt −1 − yt −1 + ρ yt −1 − ρ a1 yt − 2 + ε t Add and subtract ρ a1 yt −1 ∆yt = a1 yt −1 − yt −1 + ρ a1 yt −1 − ρ a1 yt −1 + ρ yt −1 − ρ a1 yt − 2 + ε t Collect terms ∆yt = (a1 − 1 + ρ − ρ a1 ) yt −1 + ρ a1 ( yt −1 − yt − 2 ) + ε t ∆yt = (a1 − 1)(1 − ρ ) yt −1 + ρ a1∆yt −1 + ε t ∆yt = γ yt −1 + δ∆yt −1 + ε t If a1 = 1 (unit root) then =0. We should always have an intercept in any real world 174 . Dickey and Fuller. and other researchers.. As an example. while the tratio is standard. have tabulated critical values of the DF tratio.761 3.587 2. However. We know from our analysis of autocorrelation that this is a cure for autocorrelation. simply add more lagged differences. we will not add any lags.dta data set.But the error term ut is also ut = yt − a1 yt −1 ut −1 = yt −1 − a1 yt − 2 ρ ut −1 = ρ yt −1 − ρ a1 yt − 2 Therefore. We can test this hypothesis with a ttest. as expected. the tratio in the DickeyFuller test equation has a nonstandard distribution. We cannot reject the null hypothesis of a unit root. the random walks we created above in the crimeva.601 * MacKinnon approximate pvalue for Z(t) = 0. These critical values are derived from Monte Carlo exercises.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1. For reasons that are explained in the last section of this chapter. Note that the lagged difference has eliminated the autocorrelation. The lagged differences are said to “augment” the DickeyFuller test. The DF critical values are tabulated in many econometrics books and are automatically available in Stata. dfuller y1. lags(0) DickeyFuller test for unit root Number of obs = 49 . the estimated parameter divided by its standard error. .933 2. The ADF test equation is estimated using OLS. we can do our own mini Monte Carlo exercise using y1 and y2.
950 1. the critical values are different for test equations with deterministic trends. noconstant lags(0) DickeyFuller test for unit root Number of obs = 49 .Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 0. However. lags(4) regress trend Augmented DickeyFuller test for unit root Number of obs = 23 175 . So far we have tested for a unit root where the alternative hypothesis is that the variable is stationary around a constant value. They are much higher than the corresponding criticial values for the standard tdistribution. dfuller prison. dfuller y1. ∆yt = a0 + ( a1 − 1) yt −1 + δ t + c1 ∆yt −1 + c2 ∆yt − 2 + .143 2. If a variable has a distinct trend. Is it a deterministic trend or a stationary trend? Let’s test for a unit root using an ADF test equation with a deterministic trend. This variable has a distinct trend. Stata knows all this and adjusts accordingly. so let’s rerun the DF test with no constant. include a trend in the test equation. Note the critical values. and many economic variables. as expected.. then we need to be able to test for a unit root against the alternative that y is a stationary deterministic trend variable. it is a good idea to graph the variable and look at it.application. have a distinct upward trend. For example.610 The tratio is approximately zero. . Fortunately. If it has a distinct trend. Since it makes a difference whether the variable we are testing for a unit root has a trend or not.. we might want to test the prison population (in the crimeva.622 1. We choose two lags to start with and add the regress option to see the test equation. + c p ∆yt − p + ε t The tratio on (a1 − 1) is again the test statistic.dta data set) for stationarity. like GDP. . To do this we simply add a deterministic trend term to the ADF test equation. but we know that we did not use an intercept when we constructed y1. The first thing we do is graph it.
3201351 . 16) = Prob > F = 0.12 0. lags(1) regress trend Augmented DickeyFuller test for unit root Number of obs = 26 Test .prison  Coef.9269179 L4D  . Too many lags means that our estimates are inefficient with underestimated tratios.3776407 .2303562 L3D  . The second is to use model selection criteria.03 0.prison = 0 L3D.4960933 . with overestimated tratios. t P>t [95% Conf.and Fratios on the lags are standard.1248609 . which means that our OLS estimates are biased and inconsistent.3033914 .prison L3D.222543 .0643962 1. Which is worse? Monte Carlo studies have shown that including too few lags is much worse. if we get different answers.167 1. There are three approaches to this problem.. It appears that the last three lags are not significant. . Look at the tratios on the lags and drop the insignificant lags (use a .371288 L2D  .86 0.68 0. Std.429566 1.4835 So. The first is to use ttests and Ftests on the lags to decide whether to drop them or not. the last three lags are indeed not significant and we settle on one lag as the correct model.240 MacKinnon approximate pvalue for Z(t) = 0.0641524 . The third approach is to do it all possible ways and see if we get the same answer. we start with two. Interval] +prison  L1  .Interpolated DickeyFuller 1% Critical 5% Critical 10% Critical 176 .001 . The t.35 0.prison = 0 L2D. five for quarterly data.2088751  How many lags should we include? The rule of thumb is at least two for annual data.45 0.3426802 1.328 4.6350165 _trend  . Since these data are annual. etc.0127737 1.2941294 1.0867956 LD  .1454225 . Choosing the number of lags with F tests I recommend using Ftests to choose the lag length.8809 D.600 3. dfuller prison.0723614 .318 .048504 _cons  .113 .prison ( 1) ( 2) ( 3) L4D.9004267 .prison L2D. so use the standard critical values provided by Stata.8847383 . The problem with the third approach is that.prison = 0 F( 3.380 3. Err.2221141 4. It is of some importance to get the number of lags right because too few lags causes autocorrelation. . we don’t know which to believe.278 .3584488 0.0214251 .203 .0056539 .1095418 1.732 . Let’s test this hypothesis with an Ftest.15 significance level to be sure not to drop too many lags).05 0. test L4D.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.33 0.
371 3.0190106 LD  .596 3.078 .1608155 .lcrmaj  Coef. two trends (one up.0091946 2.30 0.1145497 .15634 4.0211228 .9550795 _trend  . or no trend? Let’s start with a trend and four lags.2168075  Note that the lag length apparently doesn’t matter because the DF test statistic is not significant in either model. t P>t [95% Conf. Does it have a trend.0683769 2.001 . Std. lags(4) regress trend Augmented DickeyFuller test for unit root Number of obs = 35 .3066209 .560 3.6308502 .832 4. . Err. t P>t [95% Conf.031 .3026204 .35 0. Interval] +lcrmaj  L1  . one down). Prison population in Virginia appears to be a random walk.1746865 0.0678357 1.288 3.3269358 177 .216 MacKinnon approximate pvalue for Z(t) = 0.83 0.0401913 _cons  . Err. Interval] +prison  L1  .1243068 .32 0.18 0.4055 D.0020544 .Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.0493077 2.352 4.prison  Coef.238 MacKinnon approximate pvalue for Z(t) = 0.0122918 .3887225 .0146483 LD  .028 . dfuller lcrmaj.0308934 .04 0.861 . In some cases it is not clear whether to use a trend or not. Std.Statistic Value Value Value Z(t) 2.6889 D. Here is the graph of the log of major crime in Virginia.2632618 .030 .
50 0.640328 2. Let’s test for the significance of the lags. One way around this problem is to correct the Rsquare for the number of explanatory variables used in the model. 28) = Prob > F = 0.5239392 .lcrmaj L2D.72 0. .5742004 .173253 0.0010268 3. . Digression: model selection criteria One way to think of the problem is to hypothesize that the correct model should have the highest Rsquare and the lowest error variance compared to the alternative models.00 0.206 MacKinnon approximate pvalue for Z(t) = 0.282789 . t P>t [95% Conf. Err.0051789 .02 0.007 .055 .314 . test L4D.lcrmaj  Coef.1278214 L3D  .1579016 . Recall that the Rsquare measures the variation explained by the regression to the total variation of the dependent variable.0337594 _trend  .lcrmaj L3D.03 0.1704449 1. the number of lags does not make any real difference.005 . The problem with this approach is that the Rsquare can be maximized by simply adding more independent variables.5879 None of the lags are significant.lcrmaj = 0 LD.518873  Again.0359288 .lcrmaj LD. indicating that crime in Virginia appears to be a random walk. The ADF test statistics are all insignificant.0288636 2.lcrmaj = 0 L2D. so we retain it.1747988 .0015691 _cons  1.251 3.0010138 _cons  .594441  The trend is significant.lcrmaj = 0 L3D. The adjusted Rsquare also known as the R 2 (Rbarsquare) is the ratio of the variance (instead of variation) explained by the regression to the total variance.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.0053823 .3908215 .0091955 .2231895 .3189639 _trend  .2293501 1.0030963 .0472515 1.143 .837 .314 4.31 0.21 0.0018615 2. dfuller lcrmaj.30 0. Std.197 .203 . so we drop them all and run the test again.1743417 L4D  .6447613 .4310014 1. R2 = 1− Var (e) N −1 = 1 − (1 − R 2 ) Var ( y ) T −k 178 .8845 D.89 0.lcrmaj ( ( ( ( 1) 2) 3) 4) L4D. Interval] +lcrmaj  L1  .171358 1.lcrmaj = 0 F( 4.0620711 . lags(0) regress trend DickeyFuller test for unit root Number of obs = 39 .544 3.L2D  .
240 * MacKinnon approximate pvalue for Z(t) = 0. Choosing lags using model selection criteria Let’s adopt the information criterion approach. Again.328 18. R2 increases but so does Nk.001 0.8182 . ei2 log(T ) BIC = log +k T T The BIC penalizes additional explanatory variables even more heavily than the AIC. The Bayesian Information Criterion (also known as the Schwartz Criterion).085 54.prison LogLik Intercept Only: D(20): R2: AIC: BIC: 18. The formula is 2 ei2 2k AIC = log + T T The goal is to choose the number of explanatory variables (lags in the context of a unit root test) that minimizes the AIC.656 5. however they have been shown to be useful in the determination of the lag length in unit root models.523 1. you can install it by doing a keyword search for fitstat and clicking on “click here to install. as we add independent variables. fitstat Measures of Fit for regress of D.427 44.380 3. lags(2) trend Augmented DickeyFuller test for unit root Number of obs = 25 .656 0.where T is the sample size and k is the number of independent variables. . The Akaike information criterion (AIC) has been proposed as an improvement over the R on the theory that the R 2 does not penalize heavily enough for adding explanatory variables.786 119. If your version of Stata has the fitstat command. the goal is to minimize the BIC.528 4. So. Consequently. If your version does not have fitstat.610 179 . including the BIC. Note that these information criteria are not formal statistical tests. I usually use the BIC. dfuller prison. Then run the test several times with different lags and check the information criterion.” Fitstat generates a variety of goodness of fit tests. invoke fitstat after the dfuller command. but usually they both give the same answer. R 2 can go up or down.034 LogLik Full Model: LR(4): Prob > LR: Adjusted R2: AIC*n: BIC': 27.600 3. One rule might be to choose the number of lags such that R 2 is a maximum.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1.486 0. The minimum (biggest negative) is the one to go with.
. so we are pretty confident that prison is a stochastic trend.425 47. dfuller prison.706 0.160 7.494 1.prison LogLik Intercept Only: D(22): R2: AIC: BIC: 18. fitstat Measures of Fit for regress of D. 180 .4069 . lags(0) trend DickeyFuller test for unit root Number of obs = 27 .592 3.081 37.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 2.371 3.427 43. dfuller prison.Interpolated DickeyFuller Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value Z(t) 1. lags(1) trend Augmented DickeyFuller test for unit root Number of obs = 26 .308 2.prison LogLik Intercept Only: D(24): R2: AIC: BIC: 19.138 So we go with one lag.654 4.238 * MacKinnon approximate pvalue for Z(t) = 0.580 17.814 126.596 3.408 LogLik Full Model: LR(2): Prob > LR: Adjusted R2: AIC*n: BIC': 21.001 0.352 4.382 122.454 0.838 LogLik Full Model: LR(3): Prob > LR: Adjusted R2: AIC*n: BIC': 27.308 0.7726 .647 4. fitstat Measures of Fit for regress of D.160 0.362 3.727 55.108 0.931 .152 1. In this case it doesn’t make any difference.235 * MacKinnon approximate pvalue for Z(t) = 0.
Stock. 1996). Yt − α − β t = Yt −1 + (δ − α ) − β t + ε t Y is still a random walk. and Stock (ERS.5/T if there is a trend. Suppose the null hypothesis is H 0 : Yt = Yt −1 + δ + ε t or ∆Yt = δ + ε t where ε t The alternative hypothesis is I (0) H1 : Yt = α + β t + ε t or Yt − α − β t = ε t where ε t I (0) If the null hypothesis is true but we assu