You are on page 1of 17

Z ^

09/09/2011
Source: http://www.princeton.edu/~otorres/Stata/
Zeynep Ugur
Regression
Technically, linear regression estimates how much Y changes when X changes
one unit.
In Stata use the command "regress",
type:
regress [dependent variable] [independent variable(s)]
regress y x
In a multivariate setting we type:
regress y x1 x2 x3
Before running a regression it is recommended to have a clear idea of what you
are trying to estimate (i.e. which are your outcome and predictor variables).
A regression makes sense only if there is a sound hypothesis behind it.
Regression: example
Example: Do older people report lower life satisfaction controlling for other factors?*
Outcome (Y) variable life satisfaction, cp08a011 in sample dataset
Predictor (X) variables
Age of houshold member (leeftijd)
nationality ( cr08a043)
gender (geslacht)
level of education (oplcat)
Personal monthly income in categories (nettocat)
Civil Status (burgstat)

Assuming that sample dataset is saved on the desktop, type:

use "C:\Documents and Settings\Administrator\Desktop\sample dataset.dta"
Regression: variables
It is recommended first to examine the variables in the model to check for possible errors, type:
describe lifesatisfaction age dutch female married nevermarried netincome educ

summarize lifesatisfaction age dutch female married nevermarried netincome educ





Regression: what to look for
This is the p-value of the model. It
tests whether R
2
is different from
0. Usually we need a p-value
lower than 0.05 to show a
statistically significant relationship
between Y and Xs.
R-square shows the amount of
variance of Y explained by Xs. In
this case the model explains 4%
of the variance in life satisfaction.
Lets run the regression:

Adj R
2
shows the same as R
2
but
adjusted by the # of cases and
# of variables.
Two-tail p-values test the hypothesis that each coefficient is different
from 0. To reject this, the p-value has to be lower than 0.05 (you
could choose also an alpha of 0.10). In this case, age is not
statistically significant in explaining life satisfaction.
The t-values test the hypothesis that the coefficient is
different from 0. To reject this, you need a t-value greater
than 1.96 (for 95% confidence). You can get the t-values
by dividing the coefficient by its standard error. The t-
values also show the importance of a variable in the
model.
Outcome
variable (Y)
Predictor
variables (X)
1
2
3
4
5



Regression: with dummies
Region is entered here as dummy variable. The easy way to add dummy variables
to a regression is using xi and the prefix i. (interpretation is the same as before).
The first category is always the reference:
xi:lifesatisfaction age female netincome educ dutch married nevermarried
i.VW HG
NOTE: By default xi excludes the first value, to select a different value, before running the regression type:
char sted [omit] 4
xi: regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted
This will select (4) as the reference category for the dummy variables.
NOTE: Another way to create dummy variables is to type:
tab sted, gen(urban)
This will create 5 new variables (or a many a categories in the variable), one for each region in this case.
Regression: ANOVA table
When you run the regression, at the top you get the ANOVA table
xi: regress csat expense percent income high college i.region
A = Model Sum of Squares (MSS). The closer to TSS the better fit.
B = Residual Sum of Squares (RSS)
C = Total Sum of Squares (TSS)
A
B
C
Regression: estto/esttab
To show the models side-by-side you can use the commands estto and esttab:
regress lifesatisfaction age female
eststo model1
regress lifesatisfaction age female netincome
educ dutch married nevermarried
eststo model2
xi:regress lifesatisfaction age female netincome
educ dutch married nevermarried i.sted
eststo model3
esttab, r2 ar
Regression: exploring relationships
scatter lifesatisfaction age
There might be be a curvilinear relationship between llifesatisfaction and age.
we PLJKWZDQWWRadd a square version ofthe variable, in this case DJH
gen age2=age*age
scatter lifesatisfaction age2
Regression: getting predicted values
How good the model is will depend on how well it predicts Y, the linearity of the model and the behavior of
the residuals.
to generate the predicted values of Y (usually called Yhat) given the model:
use predict immediately after running the regression:

xi: regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted
predict lifesathat
label variable lifesathat "predicted life satisfaction"
Regression: observed vs. predicted values
For a quick assessment of the model run a scatter plot
scatter liesatisaction liesathat
We should expect a 45 degree pattern in the data. Y-axis is the observed data and x-axis the predicted
data (Yhat).
In this case the model does not seem to be doing a good job in predicting lifesatisfaction
Regression: joint test (F-test)
To test whether two coefficients are jointly different from 0 use the command test
xi: quietly regress lifesatisfaction age female netincome educ dutch married nevermarried
i.sted
Note quietly suppress the regression output
To test the null hypothesis that both coefficients do not have any effect on lifesatisfaction
type:
test age female
The p-value is 0.0023, we reject the null and conclude that the variables jointly have indeed a significant effect
on lifesatisfaction.
Some other possible tests are
test netincome = 1
test netincome = educ
Regression: saving regression coefficients
Stata temporarily stores the coefficients as _b[varname], so if you type:
gen age_b = _b[age]
gen constant_b = _b[_cons]
You can also save the standard errors of the variables _se[varname]
gen age_se = _se[age]
gen constant_se = _se[_cons]
Regression: saving regression coefficients/getting predicted values
Type help return for more details
Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of
another independent variable. We will explore here the interaction between two dummy (binary) variables. In the example below there
could be the case that the effect of type of dwelling on lifesatisfaction may depend on the gender of the respondent.
Dependent variable (Y) Lifesatisfaction
Independent variables (X)
Binary selfowneddwelling is 1 if (woning) type of dwelling is self-owned.
Interaction term In Stata: gen selfownd_f=female* selfowneddwelling
xi: regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted selfowneddwelling selfownd_f
Regression: interaction between dummies
The effect of female on the lifesatisfaction is 0.8 but given the interaction term (and assuming all coefficients are significant), the net effect is
0.8+0.4* selfowneddwelling. If selfowneddwelling is 0 then the effect is 0.8 (which is selfowneddwelling coefficient),
but if selfowneddwelling is 1 then the effect is 0.8+0.4= 1.2.
In this case, the effect of being female on lifesatisfaction is more positive if women have their own houses.
Binary rentaldwelling is 1 if (woning) type of dwelling is rental.
Lets explore the same interaction as before but we keep student-teacher ratio continuous and the English learners variable as binary. The
question remains the same*.
xi:regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted income_f
Regression: interaction between a dummy and a continuous variable
The effect of income on lifesatisfaction is lower for females.
If female=0 then the effect of income 0.06
If female=1 then the effect of income 0.06-0.03
Increasing income category by 1 unit for males will increase life satisfaction by 0.06 units, but it will have a lower impact for
females.
Dependent variable (Y) Lifesatisfaction
Independent variables (X)
Continous netincome
Interaction term In Stata: gen income_f=female* netincome
Binary female
Lets keep now both variables continuous. The question remains the same*.
Regression: interaction between two continuous variables
The effect of the interaction term is very small. the effect of rise in income category is 0.02 + 0.0003*age
So: If age = 50, the slope of income is 0.042
If age = 70, the slope of income 0.05.
In the continuous case there is a very small effect (and not significant).
xi:regress lifesatisfaction age female netincome educ dutch married nevermarried i.sted inc_age
Dependent variable (Y) Lifesatisfaction
Independent variables (X)
Continous netincome
Continous age
Interaction term In Stata: gen income_age=age* netincome