You are on page 1of 16

Part 1: Regression model with dummy

variables
 Create dummy variables
 Fit the linear regression model
 Oneway ANOVA Model using Proc GLM
 Test for Equality of Variances
 Post-hoc comparison of means
The purpose of the tutorial is to create dummy variables that
can be used in a linear regression model.

1. Create dummy variables


Note: Please download the ‘data_car.csv’ file and save it in your SAS working
folder.
data car; /* name a data */
infile 'data_car.csv' dlm="," firstobs=2; /* start from 2nd line */
input MPG ORIGIN$;
proc print data = car; /* print data */
run;
/*Data step to create dummy variables for each level of ORIGIN*/
data car1;
set car;
/*create dummy variables*/
if origin = "US" then do;
USA = 1;
European = 0;
japanese = 0;
end;
else if origin = "Europe" then do;
USA = 0;
European = 1;
japanese = 0;
end;
else if origin = "Japan" then do;
USA = 0;
European = 0;
Japanese = 1;
end;
run;
proc print data = car1; /* print data */
run;
2. Fit the linear regression model
2.1 Fit a model with USA as the reference category
We will use USA cars as the reference category in this model by
omitting the dummy variable for USA cars from our model, and
including the dummy variables for European and Japanese cars.
These two dummy variables represent a contrast between the
average MPG for European and Japanese cars vs. USA cars,
respectively.
Yi = 0 + 1European + 2Japanese + i
SAS CODE:
/*Fit a regression model with American cars as the reference category*/
proc reg data=car1;
model mpg = european japanese;
run;
 We first note that there are 394 observations included in this regression
fit. One observation is excluded because of missing values in either the
outcome variable, MPG, or the predictors (EUROPEAN, JAPANESE).

 Next, we look at the Analysis of Variance table. Always check this table
to be sure the model is set up correctly. The Corrected Total df = n-1,
which is 393 for this model. The Model df = 2, because we have two
dummy variables as predictors. The Error df = 391, which is calculated as:
Error df = Corrected Total df – Model df.

 The F test is reported as F (2, 391) = 97.57, p <0.0001, and indicates that
we have a significant overall model.

 The Model R-square is 0.3329, indicating that about 33% of the total
variance of MPG is explained by this regression model.

 The parameter estimate (20.045) for the Intercept represents the


estimated MPG for the reference category, USA cars. The parameter
estimate for European (7.657) represents the contrast in mean MPG for
European cars vs USA cars (the reference). That is, European cars are
estimated to have a mean value of MPG that is 7.657 units higher than
the mean for USA cars. This difference is significant, t(391) = 8.80,
p<0.0001. The parameter estimate for Japanese (10.405) represents the
contrast in mean MPG for Japanese cars vs USA cars. Japanese cars are
estimated to have a mean value of MPG that is 10.405 units higher than
American cars. This difference is significant, t(391) = 12.59, p<0.0001).
 We can use the model output to calculate the mean MPG for European
cars by adding the estimated intercept plus the parameter estimate for
European (20.04553 + 7.65737). We calculate the mean MPG for
Japanese cars by adding the intercept plus the parameter estimate for
Japanese (20.04553 + 10.40510).
Note: The df (degrees of freedom) shown in the table of Parameter
Estimates table are all equal to 1. This means that we are looking at one
parameter for each of these estimates; it is not the df to use for the t-
tests. The df to use for the t-tests is the Error df, which in this case is 391.

2.2 Fit a new model with Japanese as the reference


category
This time we include the two dummy variables, USA and EUROPEAN in
our model. The SAS commands and output are shown below:
proc reg data=car1;
model mpg = USA european;
run;

The Analysis of Variance table for this model is the same as for the previous model,
and the Model R-square is the same. However, the parameter estimates differ,
because they represent different quantities than they did in the first model. This is
because we have fit the same model, but used a different way to parameterize the
dummy variables for ORIGIN.

The intercept is now the estimated mean MPG for Japanese cars (30.45). The
parameter estimate for USA represents the contrast in the mean MPG for USA cars
vs. Japanese cars (USA cars have on average, 10.405 lower MPG than do Japanese
cars). The parameter estimate for EUROPEAN represents the contrast in the mean
MPG for European cars vs. Japanese cars (European cars have on average, 2.747
MPG lower MPG than do Japanese cars).
3. Oneway ANOVA Model using Proc GLM
The model we investigate here is called a oneway ANOVA because
there is only one categorical predictor. You may also fit a twoway or
higher-way ANOVA, if you have two or more categorical predictors in
the model.

When we use Proc GLM, we do not have to create the dummy


variables as we did for Proc Reg. Here is sample SAS code for fitting a
oneway ANOVA model using Proc GLM. Note the class statement
specifying ORIGIN as a class variable. This causes SAS to create dummy
variables for ORIGIN automatically. SAS will use the highest formatted
level (USA in this case) of ORIGIN as the reference category. SAS also
over-parameterizes the model, including a dummy variable for each
level of ORIGIN, but setting the parameter for the highest level equal
to zero.
/*Fit an ANOVA model (USA will be the default reference
category)*/
proc glm data=car1;
class origin;
model mpg = origin / solution;
means origin / hovtest=levene(type=abs) tukey;
run;
Note that the ANOVA Table results and Model R-Square here are the
same as for the Regression Model with dummy variables. The parameter
estimates are the same as those obtained in the dummy variable
regression model, with USA as the reference category. Here, USA is the
reference category, because its format has the highest level of ORIGIN
alphabetically.

Test for Equality of Variances

We also requested Levene's test for homogeneity of variances for the


three groups of MPG. Here we are testing H0: 2American = 2European =
2Japanese, and we do not reject H0. We conclude that the variances are
not significantly different from each other.

Post-hoc comparison of means


The output below is for Tukey's studentized range test for comparing the
means of MPG for each pair of origins. There are 3 possible comparisons
of means, and the Tukey procedure assures that the overall
experimentwise Type I error rate will not be exceeded. SAS uses an
experimentwise alpha level of 0.05 by default.
We can see from the above output that all pairwise comparisons of
means are significant at the .05 level, after applying the Tukey
adjustment for multiple comparisons.
SAS shows each comparison of means twice, doing the subtraction of the
two means being compared both ways.

Part 2: proc glm to fit regression model and


output PRESS statistic
/*Create data*/
data c;
input x1 x2 x3 x4 y;
x1x2 = x1*x2;
cards;
77 182 77 9.5478 180
58 161 51 9.0132 159
53 161 54 9.0704 158
68 177 70 9.4246 175
59 157 59 9.1338 155
76 170 76 9.4665 165
76 167 77 9.4618 165
69 186 73 9.5162 180
71 178 71 9.4445 175
65 171 64 9.3005 170
70 175 75 9.4823 174
166 57 56 8.0684 163
51 161 52 9.0326 158
64 168 64 9.2828 165
52 163 57 9.1368 160
65 166 66 9.3016 165
92 187 101 9.8462 185
62 168 62 9.2511 165
76 197 75 9.6007 200
61 175 61 9.2757 171
119 180 124 10.0132 178
61 170 61 9.2467 170
65 175 66 9.3544 173
66 173 70 9.4018 170
54 171 59 9.2192 168
50 166 50 9.0240 165
63 169 61 9.2408 168
58 166 60 9.2063 160
39 157 41 8.7698 153
101 183 100 9.8147 180
71 166 71 9.3747 165
75 178 73 9.4722 175
79 173 76 9.4840 173
52 164 52 9.0511 161
68 169 63 9.2730 170
64 176 65 9.3449 175
56 166 54 9.1010 165
69 174 69 9.3932 171
88 178 86 9.6361 175
65 187 67 9.4358 188
;
run;

1. call proc glm to fit regression model and output PRESS


statistic.
/*SAS COMMAND*/
proc glm data = c; /* proc glm is similar to proc reg, but
different from proc reg */
model y = x1 x2 x3 /p clm; /* p, clm: output predicted value
confidence limit for mean, PRESS statistic */
run;
2. output MSE, SSE, C, adjusted r^2 for models.

/*SAS COMMAND*/
proc rsquare cp mse sse adjrsquare;
model y = x1 x2 x3 x4 x1x2;
run;
3. Model selection: stepwise regression, forward selection,
backward elimination, and maximum r^2 improvement

/*performs stepwise regression. Default value of alpha_entry are


used here.*/
/* Default values for alpha_entry and alpha_stay are 0.5 and
0.15 respective*/
proc stepwise data = c;
model y = x1 x2 x3 x4 x1x2 /stepwise;
run;

/*performs forward selection. Default value of alpha_entry are


used here.*/
proc stepwise;
model y = x1 x2 x3 x4 x1x2/ forward;
run;

/*performs backward elimination. Default value of alpha_stay are


used here.*/
proc stepwise;
model y = x1 x2 x3 x4 x1x2 / backward;
run;

/*performs maximum R^2 improvement. Default value of alpha_stay


are used here.*/
proc stepwise;
model y = x1 x2 x3 x4 x1x2 / maxr;
run;

/*performs stepwise regresion with alph_enter set equal


to 0.05(sle=0.05) and alpha_stay set equal to
0.05(sls=0.05). Default values for alpha_entry
and alpha_stay are 0.5 and 0.15 respective*/

proc stepwise;
model y = x1 x2 x3 x4 x1x2 /stepwise sle = 0.05
sls=0.05;
run;

SAS OUTPUT FOR STEPWISE METHOD:

You might also like