You are on page 1of 22

(W4911Y09Paper1V2.

doc)

Political Science W4911 (Spring 2009)

Professor Robert Y. Shapiro
Analysis of Political Data
Sample Assignment # 1
Prepared by TA: Narayani Lasala, January 29, 2009

Assignment 1. Using any data you wish, examine and write up a four or more variable causal
model by estimating multiple regression equations. Present the structural equations and the
path diagram. Interpret the regression coefficients (focus on the usual unstandardized
Step 1:itTheory,
coefficients; Path
is not required to interpretDiagram, and Recoding
the standardized coefficients nor to decompose any
zero-order relationships into direct, indirect, and noncausal [spurious]
1) Develop a simple theoretical relationship between four (or more) effects). Test forvari
first-order
aes and
interactions (and, if you think
present it in a flow chart. necessary, any theoretically compelling higher order interactions).
Examine possible multicollinearity (in the correlations matrix) and provide some analysis of residuals (i.e.,
heteroskedasticity, as shown below), and, as needed, outliers if you have a small data set.

1. Variables and Path Diagram

For this assignment, we examine the relationship between four (or more) variables, i.e.
three (or more) independent variables and one dependent variable. In this example, we
estimate the following relationships.

Race
X1
(White)

1a 1d
1b 1c

2c

Education 2b
Attitude regarding the
Party ID
X2 4a responsibility of
X4
(Republican)
government for
poverty alleviation
2a
3a
Y
3b (Not its responsibility)

Income
X3

To do this we selected the following variables from GSS 2006: “race” (white, black,
other), “educ” (highest year of school completed), “income06” (total family income),
“partyid” (political party affiliation) and “helppoor” (self placement on a five point scale
that goes from “I strongly agree the government should improve living standards” to “I
strongly agree that people should take of themselves”).1

2) Obtain the frequency distribution for the original four variables (Check the
missing values)
. tab race, miss
race of |
respondent | Freq. Percent Cum.
------------+-----------------------------------
white | 3,284 72.82 72.82
black | 634 14.06 86.87
other | 592 13.13 100.00
------------+-----------------------------------
Total | 4,510 100.00

. tab educ, miss

highest |
year of |
school |
completed | Freq. Percent Cum.
------------+-----------------------------------
0 | 22 0.49 0.49
1 | 4 0.09 0.58
2 | 28 0.62 1.20
3 | 13 0.29 1.49
4 | 11 0.24 1.73
5 | 23 0.51 2.24
6 | 69 1.53 3.77
7 | 32 0.71 4.48
8 | 85 1.88 6.36
9 | 127 2.82 9.18
10 | 152 3.37 12.55
11 | 215 4.77 17.32
12 | 1,204 26.70 44.01
13 | 422 9.36 53.37
14 | 628 13.92 67.29
15 | 212 4.70 72.00
16 | 687 15.23 87.23
17 | 167 3.70 90.93
18 | 208 4.61 95.54
19 | 78 1.73 97.27
20 | 112 2.48 99.76

1. I'd like to talk with you about issues some people tell us are important. Please look at CARD BC. Some people think
that the government in Washington should do everything possible to improve the standard of living of all poor
Americans; they are at Point 1 on this card. Other people think it is not the government's responsibility, and that each
person should take care of himself; they are at Point 5.

2
dk | 2 0.04 99.80
. | 9 0.20 100.00
------------+-----------------------------------
Total | 4,510 100.00

. tab income06, miss

total family |
income | Freq. Percent Cum.
-------------------+-----------------------------------
under \$1 000 | 43 0.95 0.95
\$1 000 to 2 999 | 38 0.84 1.80
\$3 000 to 3 999 | 29 0.64 2.44
\$4 000 to 4 999 | 27 0.60 3.04
\$5 000 to 5 999 | 40 0.89 3.92
\$6 000 to 6 999 | 45 1.00 4.92
\$7 000 to 7 999 | 48 1.06 5.99
\$8 000 to 9 999 | 83 1.84 7.83
\$10000 to 12499 | 142 3.15 10.98
\$12500 to 14999 | 145 3.22 14.19
\$15000 to 17499 | 126 2.79 16.98
\$17500 to 19999 | 102 2.26 19.25
\$20000 to 22499 | 157 3.48 22.73
\$22500 to 24999 | 125 2.77 25.50
\$25000 to 29999 | 212 4.70 30.20
\$30000 to 34999 | 231 5.12 35.32
\$35000 to 39999 | 217 4.81 40.13
\$40000 to 49999 | 394 8.74 48.87
\$50000 to 59999 | 332 7.36 56.23
\$60000 to 74999 | 360 7.98 64.21
\$75000 to \$89999 | 284 6.30 70.51
\$90000 to \$109999 | 229 5.08 75.59
\$110000 to \$129999 | 162 3.59 79.18
\$130000 to \$149999 | 89 1.97 81.15
\$150000 or over | 213 4.72 85.88
refused | 442 9.80 95.68
dk | 195 4.32 100.00
-------------------+-----------------------------------
Total | 4,510 100.00

. tab partyid, miss

political party |
affiliation | Freq. Percent Cum.
-------------------+-----------------------------------
strong democrat | 700 15.52 15.52
not str democrat | 736 16.32 31.84
ind,near dem | 527 11.69 43.53
independent | 997 22.11 65.63
ind,near rep | 327 7.25 72.88
not str republican | 637 14.12 87.01
strong republican | 495 10.98 97.98
other party | 65 1.44 99.42
. | 26 0.58 100.00
-------------------+-----------------------------------
Total | 4,510 100.00

. tab helppoor, miss

should govt |
improve standard |

3
of living? | Freq. Percent Cum.
-------------------+-----------------------------------
govt action | 369 8.18 8.18
2 | 204 4.52 12.71
agree with both | 915 20.29 32.99
4 | 261 5.79 38.78
people help selves | 209 4.63 43.41
dk | 30 0.67 44.08
. | 2,522 55.92 100.00
-------------------+-----------------------------------
Total | 4,510 100.00

3) Recode the variables if necessary and obtain the frequency distribution of the
recoded variables.

You are advised not to collapse categories of any variable unless you have compelling
reason to do so. Recode so that the values start from “0” while retaining the original
number of categories. This makes it easier to interpret the regression results, that is, to
interpret the constant when the variables take their lowest value, 0.

Race

We recode this variable by reversing the order of the categories so that the larger value is
assigned to whites (1) because we believe being white will have a positive impact on the
dependent variable, we also combine“other” and “black” into a non-white category which
will be coded “0”.

. recode race (1=1) (2/3=0), gen (RACE)

(1226 differences between race and RACE)
. tab RACE

RECODE of |
race (race |
of |
respondent) | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,226 27.18 27.18
1 | 3,284 72.82 100.00
------------+-----------------------------------
Total | 4,510 100.00

Education (educ→ EDUC)

For this variable, we retain the original values and treat 22/98 as missing because values
begin with “0”

. recode educ (22/98=.), gen(EDUC)

(2 differences between educ and EDUC)

. tab EDUC
RECODE of |
educ |
(highest |
year of |

4
school |
completed) | Freq. Percent Cum.
------------+-----------------------------------
0 | 22 0.49 0.49
1 | 4 0.09 0.58
2 | 28 0.62 1.20
3 | 13 0.29 1.49
4 | 11 0.24 1.73
5 | 23 0.51 2.24
6 | 69 1.53 3.78
7 | 32 0.71 4.49
8 | 85 1.89 6.38
9 | 127 2.82 9.20
10 | 152 3.38 12.58
11 | 215 4.78 17.36
12 | 1,204 26.76 44.12
13 | 422 9.38 53.50
14 | 628 13.96 67.46
15 | 212 4.71 72.17
16 | 687 15.27 87.44
17 | 167 3.71 91.15
18 | 208 4.62 95.78
19 | 78 1.73 97.51
20 | 112 2.49 100.00
------------+-----------------------------------
Total | 4,499 100.00

Income (income06 → INCOM)

We recode this variable so that the values start from “0” while retaining the original
number of categories.

.recode income06(1=0)(2=1)(3=2)(4=3)(5=4)(6=5)(7=6)(8=7)(9=8)(10=9)
(11=10)(12=11)(13=12)(14=13)(15=14)(16=15)(17=16)(18=17)(19=18)(20=19)(2
1=20)(22=21) (23=22)(24=23)(25=24)(26/98=.), gen (INCOM)
(4510 differences between income06 and INCOM)

. tab (INCOM)

RECODE of |
income06 |
(total |
family |
income) | Freq. Percent Cum.
------------+-----------------------------------
0 | 43 1.11 1.11
1 | 38 0.98 2.09
2 | 29 0.75 2.84
3 | 27 0.70 3.54
4 | 40 1.03 4.57
5 | 45 1.16 5.73
6 | 48 1.24 6.97
7 | 83 2.14 9.11
8 | 142 3.67 12.78
9 | 145 3.74 16.52
10 | 126 3.25 19.78
11 | 102 2.63 22.41
12 | 157 4.05 26.47
13 | 125 3.23 29.69
14 | 212 5.47 35.17
15 | 231 5.96 41.13

5
16 | 217 5.60 46.73
17 | 394 10.17 56.91
18 | 332 8.57 65.48
19 | 360 9.30 74.77
20 | 284 7.33 82.11
21 | 229 5.91 88.02
22 | 162 4.18 92.20
23 | 89 2.30 94.50
24 | 213 5.50 100.00
------------+-----------------------------------
Total | 3,873 100.00

Party identification (partyid→ REPUBLICAN)

For this variable, we retain the original values and treat 7 (other party) as missing.

. recode partyid (7/8=.), gen(REPUBLICAN)

(65 differences between partyid and REPUBLICAN)
. tab REPUBLICAN

RECODE of |
partyid |
(political |
party |
affiliation |
) | Freq. Percent Cum.
------------+-----------------------------------
0 | 700 15.84 15.84
1 | 736 16.66 32.50
2 | 527 11.93 44.42
3 | 997 22.56 66.98
4 | 327 7.40 74.38
5 | 637 14.42 88.80
6 | 495 11.20 100.00
------------+-----------------------------------
Total | 4,419 100.00

Views regarding Government’s role in reducing inequality. (helppoor→ GOVRES) We

recode so that the high value (5, recoded into 4) is assigned to those who think people
should help themselves, and those who agree that it is government’s responsibility are
coded “0” Also, the “dk” is recoded “.”

. recode helppoor (1=0)(2=1)(3=2)(4=3)(5=4)(8=.), gen(GOVRES)

(1988 differences between helppoor and GOVRES)

. tab GOVRES

RECODE of |
helppoor |
(should |
govt |
improve |
standard of |
living?) | Freq. Percent Cum.
------------+-----------------------------------
0 | 369 18.85 18.85
1 | 204 10.42 29.26
2 | 915 46.73 76.00

6
3 | 261 13.33 89.33
4 | 209 10.67 100.00
------------+-----------------------------------
Total | 1,958 100.00

4) Filter observations with missing value on any variables in the model, if you are
estimating a set of equations and want all equations based on the same cases.

When you estimate a regression, Stata drops observations with missing values in any of
the variables included in the model automatically. But, when you estimate more then one
regression, different observations may be dropped because different variables are
included in the different models. To make sure that exactly the same sample is used in all
regressions, you have to follow either of the following two methods.

Method #1 Drop the cases with missing value in any of the five variables

Normally, dropping observations with missing value drop in any of the newly recoded
variables is not recommended because once you drop them you cannot recover them. But,
for the sake of simplicity in this exercise, you can choose this method.

. drop if RACE==.| EDUC==.| INCOM==.| REPUBLICAN==.|GOVRES==.

(2849 observations deleted)

Method #2 Create an indicator for not missing value (Recommended)

This second method is highly recommended for real data analysis. The first command,
“ mark newvariable” creates a new variable named newvariable that equals 1 for all
cases. The second command “markout newvariable variablelist” adjusts the values of
newvariable from 1 to 0 for the cases in which values of any of the variables in
variablelist ( in this case RACE, EDUC, INCOM, REPUBLICAN and GOVRES) are
missing. Here we name the newvariable “nomiss”.

. mark nomiss
. markout nomiss RACE EDUC INCOM REPUBLICAN GOVRES

Then include “ if nomiss ==1” at the end of the regression models you estimate.
Observations will be used in the estimate only if they have no missing values in any of
the variables that are used in this analysis. We will use method #2 in this handout.
Examples are shown below.

(obs=1661)

| RACE EDUC INCOM REPUBL~N GOVRES

-------------+---------------------------------------------

7
RACE | 1.0000
EDUC | 0.1997 1.0000
INCOM | 0.2259 0.3931 1.0000
REPUBLICAN | 0.2566 -0.0004 0.1191 1.0000
GOVRES | 0.2043 0.1178 0.1889 0.2716 1.0000

6) Estimate the regressions without interactions

For this model we run four regressions. Write up the regression equation for each of them
using the estimated coefficients and t-values. Use the “beta” option to obtain the
standardized coefficients which you will use to write up a path diagram. (Note that we
include “if nomiss==1” at the end of each command.)

1) Regress X2 on X1
. reg EDUC RACE if nomiss==1, beta

Source | SS df MS Number of obs = 1661

-------------+------------------------------ F( 1, 1659) = 68.90
Model | 674.593006 1 674.593006 Prob > F = 0.0000
Residual | 16243.1216 1659 9.79091117 R-squared = 0.0399
-------------+------------------------------ Adj R-squared = 0.0393
Total | 16917.7146 1660 10.1913944 Root MSE = 3.129

------------------------------------------------------------------------------
EDUC | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | 1.44842 .1744958 8.30 0.000 .1996871
_cons | 12.40138 .149854 82.76 0.000 .
------------------------------------------------------------------------------

EDUC = 12.40138 + 1.44842 *RACE

2) Regress X3 on X1 and X2

Source | SS df MS Number of obs = 1661

-------------+------------------------------ F( 2, 1658) = 178.46
Model | 9339.41411 2 4669.70705 Prob > F = 0.0000
Residual | 43385.5371 1658 26.1673927 R-squared = 0.1771
-------------+------------------------------ Adj R-squared = 0.1761
Total | 52724.9512 1660 31.7620188 Root MSE = 5.1154

------------------------------------------------------------------------------
INCOM | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | 1.966103 .2911319 6.75 0.000 .1535411
EDUC | .6397802 .0401371 15.94 0.000 .3624045
_cons | 5.483277 .5547762 9.88 0.000 .
------------------------------------------------------------------------------

3) Regress X4 on X1, X2 and X3

8
. reg REPUBLICAN RACE EDUC INCOM if nomiss==1, beta

Source | SS df MS Number of obs = 1661

-------------+------------------------------ F( 3, 1657) = 45.66
Model | 492.36588 3 164.12196 Prob > F = 0.0000
Residual | 5956.35537 1657 3.59466226 R-squared = 0.0764
-------------+------------------------------ Adj R-squared = 0.0747
Total | 6448.72125 1660 3.88477184 Root MSE = 1.896

------------------------------------------------------------------------------
REPUBLICAN | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | 1.130313 .1093783 10.33 0.000 .2523994
EDUC | -.0549386 .0159755 -3.44 0.001 -.0889839
INCOM | .0339408 .0091024 3.73 0.000 .0970496
_cons | 2.219035 .2115915 10.49 0.000 .
------------------------------------------------------------------------------

Source | SS df MS Number of obs = 1661

-------------+------------------------------ F( 4, 1656) = 52.59
Model | 257.513083 4 64.3782709 Prob > F = 0.0000
Residual | 2027.19011 1656 1.22414862 R-squared = 0.1127
-------------+------------------------------ Adj R-squared = 0.1106
Total | 2284.70319 1660 1.37632722 Root MSE = 1.1064

------------------------------------------------------------------------------
GOVRES | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | .2903471 .0658539 4.41 0.000 .1089254
EDUC | .0183701 .0093559 1.96 0.050 .0499881
INCOM | .0244208 .0053341 4.58 0.000 .1173151
REPUBLICAN | .1367426 .014336 9.54 0.000 .2297342
_cons | .6329858 .1275091 4.96 0.000 .
------------------------------------------------------------------------------

GOVRES = .6329858 + .2903471*RACE + .0183701* EDUC + .0244208*INCOM

+ .1367426* REPUBLICAN

Interpret the regression coefficients. Focus on the usual unstandardized

coefficients (It is not required to interpret the standardized coefficients nor to decompose
any zero-order relationships into direct, indirect, and noncausal [spurious] effects, they
are only included for reference.)

Review all the equations and discuss their implications.

9
EDUC = 12.40138 + 1.44842 *RACE

GOVRES = .6329858 + .2903471*RACE + .0183701* EDUC + .0244208*INCOM +

.1367426* REPUBLICAN

Complete the full path diagram with standardized coefficients (“betas”).

Got to next page:

Race
X1

.1996 .1089
.15354 .2524
254
11
.04999 Government’s
Education 891
X2 -.0889 Party ID Responsibility
.2297
X4 Y

.362 .09704
4045 .1173

INCOM
X3

Decomposition of effects (not required)

We can decompose the total effect of each of the independent variables on the dependent
variable. Calculate the decomposition tables for each of X1, X2, X3, X4 and Y,
according to the following rules. Do not round when doing the calculations. (You may
round when presenting the final result.)

Total effect = Pearson r coefficient

Direct effect = Standardized beta coefficient from relevant the regression
equation
Indirect effect = Sum of (the products of beta coefficients of all arrows for an
indirect path to a dependent variable for) all possible indirect
paths
Spurious effect = Total effect – Direct Effect – Sum of all Indirect effect

10
Decomposition of Effects for x2 (EDUC)
Variables Total Direct Indirect Calculation of Indirect Effects Spurious
Effects Effects Effects Effects

Race 0.1997 .1997 0 0

Decomposition of Effects for x3 (INCOM)
Variables Total Direct Indirect Calculation of Indirect Effects Spurious
Effects Effects Effects Effects

Race 0.2259 .1535411 0.0722914 0.1997*.3624 0

Educ 0.3931 .3624045 0 0.030696
Decomposition of Effects for x4 (REPUBLICAN)
Variables Total Direct Indirect Calculation of Indirect Effects Spurious
Effects Effects Effects Effects

Race 0.2566 .2524 0.0042 0.1997*(-.0889)+ 0

(0.1997*.3624*.09704)+
(.15354*.09704)
Educ -0.0004 -.08899 0.035167 .3624*.09704 0.053423

Decomposition of Effects for y (GOVRES)

Variables Total Direct Indirect Calculation of Indirect Effects Spurious
Effects Effects Effects Effects

Race 0.2523* 0.2297 + 0

0.2043 .1089254 0.09521 0.1997*0.04998+
0.1997* (-0.0889)* 0.2297+
0.1997*.3624*0.09704* 0.2297+
0.1997*.3624*(0.1173)+
0.15354* 0.09704*0.2297+
(0.15354)*(0.1173)
Educ 0.1178 .04999 0.0301595 ( -.0889)*( 0.2297)+ 0.037652
0.3624* 0.09704*(0.2297)+
0.3624*(0.1173)
Income 0.1889 0.117315 0.02229 0.09704*0.2297) 0.04929
Party ID 0.2716 0.2297 0 0.0419
NEXT:

A) Create interaction terms between each pair of independent variables (all

first-order interactions):

11
1) X1 and X2
2) X1 and X3
3) X1 and X4
4) X2 and X3
5) X2 and X4
6) X3 and X4

1) . gen raceduc = RACE*EDUC

2) . gen racINCOM = RACE*INCOM
3) . gen racrepub = RACE*REPUBLICAN
4) . gen edurepub= EDUC*REPUBLICAN
5) . gen eduINCOM= EDUC*INCOM
6) . gen incomrepub= INCOM*REPUB

B) Estimate the regression with all six interaction terms

. reg GOVRES RACE EDUC INCOM REPUBLICAN raceduc racINCOM racrepub edurepub
eduINCOM incomrepub, beta

Source | SS df MS Number of obs = 1661

-------------+------------------------------ F( 10, 1650) = 24.10
Model | 291.20039 10 29.120039 Prob > F = 0.0000
Residual | 1993.5028 1650 1.20818352 R-squared = 0.1275
-------------+------------------------------ Adj R-squared = 0.1222
Total | 2284.70319 1660 1.37632722 Root MSE = 1.0992

------------------------------------------------------------------------------
GOVRES | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | .9902724 .2931196 3.38 0.001 .3715064
EDUC | .0239648 .0287538 0.83 0.405 .0652123
INCOM | .0287287 .0229921 1.25 0.212 .1380098
REPUBLICAN | -.1350117 .0698544 -1.93 0.053 -.2268262
raceduc | -.0352751 .0202833 -1.74 0.082 -.1978869
racINCOM | -.0131167 .011881 -1.10 0.270 -.094953
racrepub | -.032165 .0367194 -0.88 0.381 -.0600898
edurepub | .013037 .0051297 2.54 0.011 .3244292
eduINCOM | -.0011666 .0015857 -0.74 0.462 -.1041171
incomrepub | .0075223 .0027743 2.71 0.007 .2464588
_cons | .7505942 .368533 2.04 0.042 .
------------------------------------------------------------------------------

GOVRES = 7505942+ . 9902724*RACE + 0239648* EDUC + .0287287*INCOM + (-.1350117)*REPUBLICAN

-.0352751 *raceduc -.0131167*racINCOM -.032165*racrepub +.013037*edurepub -.0011666*eduINCOM

C) Estimate the regression without the insignificant interaction terms, i.e., with
the significant interactions.

In this section, run the regression retaining only those interaction terms that turned
statistically significant in the previous section. For this example, we will see if omitting
raceduc, racINCOM, racrepub and eduINCOM makes the fit of the regression
significantly different.
. reg GOVRES RACE EDUC INCOM REPUBLICAN edurepub incomrepub, beta

Source | SS df MS Number of obs = 1661

-------------+------------------------------ F( 6, 1654) = 38.55

12
Model | 280.290217 6 46.7150361 Prob > F = 0.0000
Residual | 2004.41297 1654 1.2118579 R-squared = 0.1227
-------------+------------------------------ Adj R-squared = 0.1195
Total | 2284.70319 1660 1.37632722 Root MSE = 1.1008
------------------------------------------------------------------------------
GOVRES | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
RACE | .2845684 .0655477 4.34 0.000 .1067575
EDUC | -.0122706 .0164101 -0.75 0.455 -.0333903
INCOM | .0059682 .0091095 0.66 0.512 .0286707
REPUBLICAN | -.1310843 .0678247 -1.93 0.053 -.220228
edurepub | .0115822 .0050109 2.31 0.021 .2882262
incomrepub | .0067193 .0026649 2.52 0.012 .2201474
_cons | 1.348882 .2181173 6.18 0.000 .
------------------------------------------------------------------------------

GOVRES = 1.348882 + .2845684 *RACE -.0122706 * EDUC + .0059682 *INCOM -.1310843

*REPUBLICAN + .0115822 *edurepub +.0067193* incomrepub
+.0075223*incomrepub
8) Chow test

The Chow test tests whether a regression equation with interaction terms explains a
significantly greater amount of variance than a regression equation without interaction
terms. The null hypothesis is the difference between the explained variance of the two
equations is zero in the population.

Rejecting the null hypothesis tells you that the interaction terms bring additional
explanatory power in a statistically significant way. The Chow test formula produces an
F-statistics to be compared to the critical values in the F-statistic table. If the F-test value
of the Chow test exceeds the critical value found in the table, you can reject the null
hypothesis.
We will compare the following three regression models using Chow Test.
Model ➀: The Regression Model with no interaction terms.

GOVRES = .6329858 +.2903471 *RACE + .0183701* EDUC +.0244208 *INCOM + .1367426* REPUBLICAN

Model ➁: Model with six interaction terms.

GOVRES = 7505942+ .9902724*RACE + 0239648* EDUC + .0287287*INCOM + (-.1350117)*REPUBLICAN
-.0352751 *raceduc -.0131167*racINCOM -.032165*racrepub +.013037*edurepub -.0011666*eduINCOM
+.0075223*incomrepub

Manual calculation. (Everyone must do this!)

The following is the formula for the Chow Test:

M = Number of interaction variables included in the equation

13
K = Number of original independent variables +1 (for the constant)
N = Number of observations
R2K = R2 for the original regression equation (with no interaction terms)
R2K+M = R2 for the regression equation with interaction terms

F(6, 1650) = (0.1275- 0.1127) /6 = 4.67

(1-0.1275) / (1661 -5 -6)

The critical value of F statistics with df1=6 (degree of freedom of the numerator, M),
df2=1651 (degree of freedom of the denominator, N-K-M), and α = 0.05 is 2.09. Since
4.67>2.09, we reject the null hypothesis that the equation with six interaction terms and
the equation with no interaction term explain just the same amount of variance.
(Remember: rejecting the null hypothesis tells you that the interaction terms bring
additional explanatory power in a statistically significant way.)

Crosscheck by using Stata (Optional)

The equivalent of the Chow test can be done with Stata by typing “test” command
right after executing the regression command. See the following examples. Compare this
F-statistic with the hand-calculated one in the previous section. (Small differences may
result from rounding.)

Test Model ➁ against Model ➀

Run the regression with the larger model (➁ in this case with 6 interaction terms),
against model ➀ (Model with no interact terms) Run the regression with the larger
model and then test the terms that are NOT in the smaller model
. reg GOVRES RACE EDUC INCOM REPUBLICAN raceduc racINCOM racrepub edurepub
eduINCOM incomrepub, beta

(Output omitted)
. test raceduc racINCOM racrepub edurepub eduINCOM incomrepub
( 1) raceduc = 0
( 2) racINCOM = 0
( 3) racrepub = 0
( 4) edurepub = 0
( 5) eduINCOM = 0
( 6) incomrepub = 0
F( 6, 1650) = 4.65
Prob > F = 0.0001

Estimated F-value (6, 1650) is 4.65. Since 4.65>2.09 critical value, we reject the null
hypothesis that the equation with six interaction terms and the equation with no
interaction term explain just the same amount of variance.

14
***YOU CAN REPEAT VARIATIONS OF THIS TEST FOR SUBSET OF THE
INTERACTION TERMS TO HELP DETERMINE WHICH ONES TO KEEP***

a) Immediately after estimating the regression equation you wish to examine.

obtain the predicted values, residuals, and (less important) standardized
residuals.

The “predict” command applies to the regression estimated right before typing it into the
command window. In this section, we will give new names for the predicted values and
estimated residuals after executing the “predict” command. When you type “predict
newvariable” without adding any option, you will obtain the predicted values of your
dependent variable. In this exercise we have named this newvariable, yhat.
So, right after the estimated regression equation that you are focusing on:
. predict yhat
(option xb assumed; fitted values)

This command can also be used to obtain residuals by and standardized residuals as
shown below by adding to the “predict newvariable” “, resid” and “,
rstandard” respectively. We have named the variable which contains the residuals
“e” and the variable containing the standardized residuals “std_e”.

. predict e, resid
. predict std_e, rstandard

b) (Optional) Histogram and normal probability plot of the standardized

residuals.

The command for obtaining the histogram is “hist” followed by the variable name. The
command “qnorm” followed by the variable name will give you a normal probability
plot of this variable. The option “saving (name for graph, replace)” saves the
generated images to the working directory. For example, we named the file containing
the histogram of the standard residuals, “Histogram_std_e”

Histogram
To save and display the histogram
. hist std_e, saving(Histogram_std_e, replace)
(bin=32, start=-2.2421422, width=.15336815)
(file Histogram_std_e.gph saved)

15
.3 .5
.4
Density
.2
.1
0

-2 -1 0 1 2 3
Standardized residuals

Normal probability plot

qnorm std_e, saving(NPP_std_e, replace)
4 2
Standardized residuals
0 -2
-4

-4 -2 0 2 4
Inverse Normal

c) Plot the (unstandardized) residuals with each of the independent variables.

Useful for examining heteroskedasticity and other possible abnormalities.

The unstandardized residuals should be on the y axis and the independent variables
should be on the x axis.

RACE
. graph twoway scatter e RACE, saving(e_RACE, replace)
(file e_RACE.gph saved)

16
4
2
Residuals
0
-2

0 .2 .4 .6 .8 1
RECODE of race (race of respondent)

EDUC
graph twoway scatter e EDUC, saving(e_EDUC, replace)
(file e_EDUC.gph saved)
4
2
Residuals
0
-2

0 5 10 15 20
RECODE of educ (highest year of school completed)

INCOM
graph twoway scatter e INCOM, saving(e_INCOM, replace)
4
2
Residuals
0
-2

0 5 10 15 20 25
RECODE of income06 (total family income)

17
REPUBLICAN
graph twoway scatter e REPUBLICAN, saving(e_REPUBLICAN, replace)
4
2
Residuals
0
-2

0 2 4 6
RECODE of partyid (political party affiliation)

d) Create new variables: squared (unstandardized) residual and the absolute

value of the (unstandardized) residual. Focus on the squared residual. Why?

Squared residuals will be named “squared_e” and the variable containing the
absolute values of the residuals will be “absolute_e”. The command “abs
(variable)” gives you the absolute value of variable.

. gen squared_e=e*e
. gen absolute_e= abs(e)

e) Obtain correlations of squared residuals and absolute value of

(unstandardized) residuals to all other variables in the model. What do we
find?:

corr GOVRES RACE EDUC INCOM REPUBLICAN squared_e absolute_e

(obs=1661)
| GOVRES RACE EDUC INCOM REPUBL~N square~e absolu~e
-------------+---------------------------------------------------------------
GOVRES | 1.0000
RACE | 0.2043 1.0000
EDUC | 0.1178 0.1997 1.0000
INCOM | 0.1889 0.2259 0.3931 1.0000
REPUBLICAN | 0.2716 0.2566 -0.0004 0.1191 1.0000
squared_e | 0.0354 -0.0420 -0.1545 -0.1468 -0.0440 1.0000
absolute_e | -0.0434 -0.0858 -0.1544 -0.1632 -0.0591 0.9565 1.0000

f) Examine the means of the residual (in absolute values, not shown here) and
squared residuals by different categories of independent variables. Why?

18
This can be done using the command “tab variable A name, sum(variable B name)”.
However, in order to make this analysis clearer, we collapse the independent variables
into fewer categories. In this example, we collapse the independent variables that have
many categories EDUC, INCOME and REPUBLICAN into three categories each and
leave RACE (and GOVRES) intact.
. recode EDUC (0/12=0)(13/16=1)(17/20=2), gen(EDUC2)
(1655 differences between EDUC and EDUC2)

. recode INCOM (0/10=0)(11/20=1)(21/24=2), gen(INCOM2)

(1641 differences between INCOM and INCOM2)

. recode REPUBLICAN (0 1=0) (2 3 4=1) (5 6=2), gen(REPUBLICAN2)

(1411 differences between REPUBLICAN and REPUBLICAN2)

We now use the command “tab independen variable, summ(squaredresidualsvariable) for

each independent variable. Focus on the mean of each category. Why? What do we find?

. tab RACE, summ(squared_e)

RECODE of |
race (race |
of | Summary of squared_e
respondent) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.3321228 1.6271939 436
1 | 1.1807221 1.5715731 1225
------------+------------------------------------
Total | 1.2204636 1.5872673 1661

. tab EDUC2, summ(squared_e)

RECODE of |
EDUC |
(RECODE of |
educ |
(highest |
year of |
school | Summary of squared_e
completed)) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.4125591 1.7571281 687
1 | 1.1320202 1.4776772 737
2 | .9386629 1.3135448 237
------------+------------------------------------
Total | 1.2204636 1.5872673 1661
. tab INCOM2, summ(squared_e)
RECODE of |
INCOM |
(RECODE of |
income06 |
(total |
family | Summary of squared_e

19
income)) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.5703851 1.8339021 337
1 | 1.1768049 1.5589693 1020
2 | .97904393 1.3033466 304
------------+------------------------------------
Total | 1.2204636 1.5872673 1661
. tab REPUBLICAN, summ(squared_e)
RECODE of |
partyid |
(political |
party |
affiliation | Summary of squared_e
) | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 1.3544698 1.754 250
1 | 1.1022653 1.5308627 283
2 | 1.2853162 1.6759844 189
3 | 1.4062153 1.6491756 350
4 | 1.2662294 1.6828067 127
5 | .99803069 1.3768972 273
6 | 1.101894 1.398698 189
------------+------------------------------------
Total | 1.2204636 1.5872673 1661

Sample do-file
*Obtain the frequency distribution

tab race, miss

tab educ, miss
tab income06, miss
tab partyid, miss
tab helppoor, miss
*Recoding variables
recode race (1=1) (2/3=0), gen (RACE)
tab RACE
recode educ (22/98=.), gen(EDUC)
tab EDUC
recode income06(1=0)(2=1)(3=2)(4=3)(5=4)(6=5)(7=6)(8=7)(9=8)(10=9)
(11=10)(12=11)(13=12)(14=13)(15=14)(16=15)(17=16)(18=17)(19=18)(20=19)(21=20)(22
=21) (23=22)(24=23)(25=24)(26/98=.), gen (INCOM)
tab (INCOM)
recode partyid (7/8=.), gen(REPUBLICAN)
tab REPUBLICAN
recode helppoor (1=0)(2=1)(3=2)(4=3)(5=4)(8=.), gen(GOVRES)
tab GOVRES
*Dropping missing cases, (either method)
drop if RACE==.| EDUC==.| INCOM==.| REPUBLICAN==.|GOVRES==.

20
** or,
mark nomiss
markout nomiss RACE EDUC INCOM REPUBLICAN GOVRES

*Obtain correlations
corr RACE EDUC INCOM REPUBLICAN GOVRES if nomiss== 1
**Regressions
reg EDUC RACE if nomiss==1, beta

reg INCOM RACE EDUC if nomiss==1, beta

reg REPUBLICAN RACE EDUC INCOM if nomiss==1, beta
reg GOVRES RACE EDUC INCOM REPUBLICAN if nomiss==1, beta

gen raceduc = RACE*EDUC

gen racINCOM = RACE*INCOM
gen racrepub = RACE*REPUBLICAN
gen edurepub= EDUC*REPUBLICAN
gen eduINCOM= EDUC*INCOM
gen incomrepub= INCOM*REPUB

reg GOVRES RACE EDUC INCOM REPUBLICAN raceduc racINCOM racrepub edurepub
eduINCOM incomrepub, beta
**Chow Test

reg GOVRES RACE EDUC INCOM REPUBLICAN raceduc racINCOM racrepub edurepub
eduINCOM incomrepub, beta

** Obtain predicted values, residuals and standardized residuals.

**Run the regression from which you want to obtain predicted values first.
reg GOVRES RACE EDUC INCOM REPUBLICAN

predict yhat
predict e, resid
predict std_e, rstandard

**Obtain histogram and normal probability plots of the standardized residuals.

hist std_e, saving(Histogram_std_e, replace)
qnorm std_e, saving(NPP_std_e, replace)
**Plot the (unstandardized) residuals with each of the independent variables
graph twoway scatter e RACE, saving(e_RACE, replace)
graph twoway scatter e EDUC, saving(e_EDUC, replace)
graph twoway scatter e INCOM, saving(e_INCOM, replace)
graph twoway scatter e REPUBLICAN, saving(e_REPUBLICAN, replace)

* *Create new variables: squared (unstandardized) residual and the absolute

value of the (unstandardized) residual.
gen squared_e=e*e
gen absolute_e= abs(e)

**Obtain correlations of squared residuals and absolute value of (unstandardized)

residuals to all other variables in the model
corr GOVRES RACE EDUC INCOM REPUBLICAN squared_e absolute_e

21
**Examine the means of the residual (in absolute values, not shown here) and
squared residuals by different categories of independent variables.
**First collapse indep. var. into fewer categories.

recode EDUC (0/12=0)(13/16=1)(17/20=2), gen(EDUC2)

recode INCOM (0/10=0)(11/20=1)(21/24=2), gen(INCOM2)
recode REPUBLICAN (0 1=0) (2 3 4=1) (5 6=2), gen(REPUBLICAN2)
tab RACE, summ(squared_e)
tab EDUC2, summ(squared_e)
tab INCOM2, summ(squared_e)
tab REPUBLICAN, summ(squared_e)

22