You are on page 1of 17

BUILDING REGRESSION MODELS PART 2

Topics Outline
Include/Exclude Decisions
Variable Selection Procedures
Example 1
Explaining spending amounts at HyTex
HyTex is a direct marketer of stereo equipment, personal computers, and other electronic
products. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders are
taken over the telephone. The company spends a great deal of money on its catalog mailings, and
it wants to be sure that this is paying off in sales.
The file Catalog_Marketing.xlsx contains data on 1000 customers who purchased mail-order
products from the HyTex Company in the current year. For each customer there are data on the
following variables:
Age
Gender
OwnHome
Married
Close

age of the customer at the end of the current year

= 1 for males, 0 for females

Salary
Children

= 1 if customer owns a home, 0 otherwise

= 1 if customer is currently married, 0 otherwise
= 1 if customer lives reasonably close to a shopping area that sells similar
merchandise, 0 otherwise
combined annual salary of customer and spouse (if any)
number of children living with customer

PrevCust
PrevSpent
Catalogs
AmountSpent

1 if customer purchased from HyTex during the previous year, 0 otherwise

total amount of purchases made from HyTex during the previous year
number of catalogs sent to the customer this year
total amount of purchases made from HyTex this year

Develop a multiple regression model that is useful for explaining current year spending amounts at HyTex.
Solution:
With this much data, 1000 observations, it is possible to set aside part of the data set for validation.
Although any split can be used, lets base the regression on the first 750 observations and use the
other 250 for validation. Therefore, you should select only the range through row 751 when
defining the StatTools data set.
(a) Regression 1
Run first a multiple regression with all explanatory variables.
The goal is then to exclude variables that aren't necessary, based on their t-values and P-values.
Here is the multiple regression output.
-1-

It indicates a fairly good fit. The r2 value is 74.7% and se is about \$491. Given that the actual
amounts spent in the current year vary from a low of under \$50 to a high of over \$5500, with
a median of about \$950, a typical prediction error of around \$491 is decent but not great.
(b) Which variable(s) would you exclude from the regression equation?
From the P-value column, you can see that there are four variables, Age, Gender, OwnHome,
and Married, that have P-values well above 0.05. These are the obvious candidates for
exclusion from the equation. You could rerun the equation with all four of these variables
excluded, but it is a better practice to exclude one variable at a time. It is possible that when
one of these variables is excluded, another one of them will become significant.
(c) Rerun the regression after excluding the variables with the largest P-values one at a time.
Regression 2
The variable Married has the largest P-value. The result from rerunning the regression
without this variable shows that Age, Gender, and OwnHome still have large p-values.
Regression 3
The variable with the largest remaining P-value Age is excluded.
Regression 4
The variable with the largest remaining P-value OwnHome is excluded.
Regression 5
The variable with the largest remaining P-value Gender is excluded.
Here is the resulting output.
-2-

The r2 and se values of 74.6% and \$491 are almost the same as they were with all variables
included, and all of the P-values are very small.
(d) Interpret the coefficients of the final regression equation.
The coefficient of Close implies that an average customer living close to stores with this type
of merchandise spent about \$416 less than an average customer living far from such stores.
The coefficient of Salary implies that, on average, about 1.8 cents of every extra salary dollar
was spent on HyTex merchandise.
The coefficient of Children implies that about \$161 less was spent for every extra child living at home.
The PrevCust and PrevSpent terms are somewhat more difficult to interpret.
First, both of these terms are zero for customers who didn't purchase from HyTex in the
previous year. For those who did, the terms become
544 + 0.27PrevSpent
The coefficient 0.27 implies that each extra dollar spent the previous year can be expected to
contribute an extra 27 cents in the current year. The 544 literally means that if you compare a
customer who didn't purchase from HyTex last year to another customer who purchased only a
tiny amount, the latter is expected to spend about \$544 less than the former this year. However,
none of the latter customers were in the data set. A look at the data shows that of all customers
who purchased from HyTex last year, almost all spent at least \$100 and most spent considerably
more. In fact, the median amount spent by these customers last year was about \$900 (the median
of all positive values for the PrevSpent variable). If you substitute this median value into the
expression 544 + 0.27PrevSpent, you obtain 298. Therefore, this median spender from last
year can be expected to spend about \$298 less this year than the previous year nonspender.
The coefficient of Catalogs implies that each extra catalog can be expected to generate about
\$44 in extra spending.
-3-

(e) Do forward, backward, and stepwise

procedures produce the same regression
equation for the amount spent in the
current year?
Each of these options is found in the
StatTools Regression dialog box shown
to the right. It is just a matter of choosing
the appropriate option from the
Regression Type dropdown list.
In each, specify AmountSpent as the
dependent variable and select all of the
other variables (besides Customer) as
potential independent variables.
Once you choose one of the regression
types, the dialog box changes, as shown
below, to include a Parameters section
and an advanced option to Include
Detailed Step Information.

It turns out that each regression

procedure (stepwise, forward, and
backward) produces the same final
equation that we obtained previously,
with all variables except Age, Gender,
OwnHome, and Married included.
This often happens, but not always.
The stepwise and forward procedures
add the variables in the order Salary,
Catalogs, Close, Children, PrevCust,
and PrevSpent.
The backward procedure, which starts
with all variables in the equation,
eliminates variables in the order
Married, Age, OwnHome, and Gender.
A sample of the stepwise output appears
below.

-4-

The variables that enter or exit the equation are listed at the bottom of the output. The usual
regression output for the final equation also appears. Again, however, this final equation's
output is exactly the same as when multiple regression is used with these particular
variables.
Notes:
1. If you validate this final regression equation on the other 250 customers, you will find r 2 and
se values of 73.2% and \$486. These are very promising. They are very close to the values
based on the original 750 customers.
2. We haven't tried all possibilities. We haven't tried nonlinear or interaction variables,
nor have we looked at different coding schemes (such as treating Catalogs as a categorical
variable and using dummy variables to represent it).
3. We haven't checked the regression assumptions. In particular, it turns out that the condition
for constant error variance is violated as can be seen from the fan shape of the scatterplot of
AmountSpent versus Salary:

-5-

As usual, when you see a fan shape, where the variability increases from left to right in a
scatterplot, you can try a logarithmic transformation. The reason this often works is that the
logarithmic transformation squeezes the large values closer together and pulls the small
values farther apart. The scatterplot of the log of AmountSpent versus Salary is shown below.

Clearly, the fan shape is gone. However, the logarithmic transformation appears to have
introduced some curvature into the plot. So, perhaps some other nonlinear transformations are
worth exploring in this example.

-6-

Example 2
Possible gender discrimination in salary at Fifth National Bank of Springfield
The Fifth National Bank of Springfield is facing a gender discrimination suit.
The charge is that its female employees receive substantially smaller salaries than its male
employees. The bank's employee data are listed in the file Bank_Salaries.xlsx.
1
3
1
2
1
1
M
M
M
207
5
6
208
5
6

YrsExper
3
14
M
35
33

Age
26
38
M
59
62

Gender YrsPrior
Male
1
Female
1
M
M
Male
0
Female
0

PCJob
No
No
M
No
No

Salary
\$32,000
\$39,100
M
\$94,000
\$30,000

For each of the 208 employees, the data set includes the following variables:
EducLev education level, a categorical variable with categories
1 (finished high school), 2 (finished some college courses), 3 (obtained a bachelor's degree),
JobGrade a categorical variable indicating the current job level, the possible levels being 1 through 6
YrsExper years of experience with this bank
Age

Gender a categorical variable with values Female and Male

YrsPrior number of years of work experience at another bank prior to working at Fifth National
PCJob

computer-related

Salary

current annual salary

Do these data provide evidence that the bank discriminates against females in terms of salary?
A formal hypothesis test to compare the average female salary to the average male salary could
be run. Using this method, you can check that the average of all salaries is \$39,922, the female
average is \$37,210, the male average is \$45,505, and the difference between the male and female
averages is statistically significant at any reasonable level of significance.
In short, the females definitely earn less. But perhaps there is a reason for this.
They might have lower education levels, they might have been hired more recently, and so on.
The question is whether the difference between female and male salaries is still evident after
taking these other attributes into account.
Solution:

-7-

(a) Create dummy variables for the various categorical variables.

Using Excel
Create a dummy variable Female based on Gender in column J by entering the formula
=IF(F2= Female,1,0)
in cell J2 and copying it down.
Note that females are coded as 1s and males as 0s.
Create a dummy variable HasPCJob based on PCJob in column K by entering the formula
=IF(H2= Yes,1,0)
in cell K2 and copying it down.
Using StatTools
StatTools's Dummy procedure is somewhat easier, especially when there are multiple categories.
Here are the steps to create five dummies for the education levels.
Data Utilities
Dummy
Select EducLev to base the dummies on
Create One Dummy Variable for Each Distinct Category
OK
Yes
This creates five dummy columns with variable names EducLev = 1 through EducLev = 5.
(b) Regression 1
Estimate a regression equation with only one explanatory variable, Female and interpret it.
The output appears below.

The resulting equation is

Predicted Salary = 45505 8296 Female
To interpret regression equations with dummy variables, it is useful to rewrite the equation for
each category.
-8-

If you substitute Female = 1 into the estimated regression equation, you obtain
Predicted Salary = 45505 8296(1) = 37209
Because Female = 1 corresponds to females, this equation simply indicates the average female salary.
Similarly, if you substitute Female = 0 into the estimated equation, you obtain
Predicted Salary = 45505 8296(0) = 45505
Because Female = 0 corresponds to males, this equation indicates the average male salary.
Therefore, the interpretation of the 8296 coefficient of the Female dummy variable is straightforward.
It is the average female salary relative to the reference (male) category.
In short, females get paid \$8296 less on average than males.
(c) Regression 2
Expand the regression equation by adding the experience variables YrsExper and YrsPrior.
Here is the output with the Female dummy variable and these two experience variables.

The corresponding regression equation is

Predicted Salary = 35492 + 988 YrsExper + 131 YrsPrior 8080 Female
It is again useful to write this equation in two forms: one for females (substituting Female = 1)
and one for males (substituting Female = 0). After doing the arithmetic, they become
Predicted Salary = 27412 + 988 YrsExper + 131 YrsPrior
Predicted Salary = 35492 + 988 YrsExper + 131 YrsPrior
Except for the intercept term, these equations are identical. You can now interpret the
coefficient 8080 of the Female dummy variable as the average salary disadvantage for
females relative to males after controlling for job experience.
Gender discrimination still appears to be a very plausible conclusion.
Note that the r2 value is only 49.2%. Perhaps there is still more to the story.
-9-

(d) Regression 3
Add education level to the equation by including any four of the five education level dummies,
for example by including EducLev = 2 through EducLev = 5. (Reminder: You should always
use one fewer dummy than the number of categories for any categorical variable.)
Here is the resulting output.

The estimated regression equation is now

Predicted Salary = 26613 + 1033 YrsExper + 362 YrsPrior 4501 Female
+ 160 EducLev=2 + 4765 EducLev=3 + 7320 EducLev=4 + 11770 EducLev=5
Now there are two categorical variables involved, gender and education level.
However, you can still write a separate equation for each combination of categories by
setting the dummies to appropriate values. For example, the equation for females at
education level 5 is found by setting Female and EducLev=5 equal to 1, and setting the other
education dummies equal to 0. After combining terms, this equation is
Predicted Salary = 33882 + 1033 YrsExper + 362 YrsPrior
This equation can be interpreted as follows. For either gender and any education level,
the expected increase in salary for one extra year of experience with Fifth National is \$1033;
the expected increase in salary for one extra year of prior experience with another bank is \$362.
The coefficients of the education dummies indicate the average increase in salary an
employee can expect relative to the reference (lowest) education level.
For example, an employee with education level 4 can expect to earn \$7320 more than an
employee with education level 1, all else being equal.
The key coefficient, \$4501 for females, indicates the average salary disadvantage for females
relative to males, given that they have the same experience levels and the same education levels.
Note that the r2 value is now 64.5%, quite a bit larger than when the education dummies were not
included. We appear to be getting closer to the truth. In particular, you can see that there appears to
be gender discrimination in salaries, even after accounting for job experience and education level.
- 10 -

(e) Regression 4
(the lowest job grade is used as the reference category), Age and HasPCJob.
The regression output for this equation with all variables appears below.

The effect of age appears to be minimal, and there appears to be a bonus of close to \$5000
for having a PC-related job.
The r2 value has now increased to 76.5%, and the penalty for being a female has decreased to
\$2555 still large but not as large as before.
As expected, the coefficients of the job grade dummies are all positive, and they increase as
the job grade increases it pays to be in the higher job grades. Thus, the regression indicates
that being in lower job grades implies lower salaries, but it doesn't explain why females are
in the lower job grades in the first place.
(f) Regression 5
If you rerun the regression using the numerical explanatory variable YrsExper and the
dummy variable Female, you obtain the equation
Predicted Salary = 35824 + 981 YrsExper 8012 Female
The r2 value for this equation is 49.1%.
It is certainly plausible that the effect of YrsExper on Salary is different for males than for females.
So, it makes good sense to test for an interaction between YrsExper and Female variables.
- 11 -

(g) Regression 6
If an interaction variable between YrsExper and Female is added to this equation, what is its effect?
You first need to form an interaction variable that is the product of YrsExper and Female.
Using Excel
Use an Excel formula that multiplies the two variables involved.
Using StatTools
Data Utilities
Interaction
Interaction Between: Two Numeric Variables
Select YrsExper and Female
OK
Now you can run the regression. The multiple regression output appears below.

Notice that the r2 value with the interaction variable has increased from 49.1% to 63.9%.
The interaction variable has definitely added to the explanatory power of the equation.
The estimated regression equation is
Predicted Salary = 30430 + 1528 YrsExper + 4098 Female 1248 Interaction(YrsExper,Female)
The negative interaction here means that females tend to get lower raises for each extra year
of experience than the males get. To unravel the meaning of this negative interaction, it is useful
to write the above equation as two separate equations, one for females and one for males.
The female equation (Female = 1, so that Interaction(YrsExper,Female) = YrsExper ) is
Predicted Salary = (30430 + 4098) = (1528 1248) YrsExper = 34528 + 280 YrsExper
and the male equation (Female = 0, so that Interaction(YrsExper,Female) = 0 ) is
Predicted Salary = 30430 + 1528 YrsExper
Graphically, these equations appear in the following figure.

- 12 -

The y-intercept for the female line is slightly higher females with no experience with Fifth National
tend to start out slightly higher than males but the slope of the female line is much smaller.
That is, males tend to move up the salary ladder much more quickly than females. This provides
another argument, although a somewhat different one, for gender discrimination against females.
Notes:
1. Interaction variables can make a regression quite difficult to interpret, and they are certainly
not always necessary. However, without them, the effect of each x on y is independent of the
values of the other xs. If you believe, as in this example, that the effect of years of experience
on salary is different for males than it is for females, the only way to capture this behavior is
to include an interaction variable between years of experience and gender.
2. The product of any two variables, a numerical and a dummy variable, two dummy variables,
or even two numerical variables, can be used to create an interaction term. The easiest way to
interpret the results correctly is the way we have been doing it by writing several separate
equations and seeing how they differ.
(h) Suppose you include the variables YrsExper, Female, and HighJob in the equation for Salary,
along with interactions between Female and YrsExper and between Female and HighJob.
Here, HighJob is a new dummy variable that is 1 for job grades 4 to 6 and is 0 for job grades 1 to 3.
(It can be calculated as the sum of the dummies JobGrade = 4 through JobGrade = 6.)
The resulting equation is
Predicted Salary = 28168 + 1261 YrsExper + 9242 HighJob + 6601 Female
1224 Interaction(YrsExper,Female) + 1564 Interaction(Female,HighJob)
and the r2 value is now 76.6%.
Interpret the regression coefficients.

- 13 -

The interpretation of this equation is quite a challenge because it is really composed of four
separate equations, one for each combination of Female and HighJob.
For females in the high job category, the equation becomes
Predicted Salary = (28168 + 9242 + 6601 + 1564) + (1261 - 1224) YrsExper
= 45575 + 37 YrsExper
and for females in the low job category it is
Predicted Salary = (28168 + 6601) + (1261 - 1224) YrsExper
= 34769 + 37 YrsExper
Similarly, for males in the high job category, the equation becomes
Predicted Salary = (28168 + 9242) + 1261 YrsExper
= 37410 + 1261 YrsExper
and for males in the low job category it is
Predicted Salary = 28168 + 1261 YrsExper
Putting this into words, the various coefficients can be interpreted as follows.
The intercept 28168 is the average starting salary (that is, with no experience at Fifth National)
for males in the low job category.
The coefficient 1261 of YrsExper is the expected increase in salary per extra year of
experience for males (in either job category).
The coefficient 9242 of HighJob is the expected salary premium for males starting in the
high job category instead of the low job category.
The coefficient 6601 of Female is the expected starting salary premium for females relative
to males, given that they start in the low job category.
The coefficient 1224 of Interaction(YrsExper,Female) is the penalty per extra year of
experience for females relative to males that is, male salaries increase this much more than
female salaries each year.
The coefficient 1564 of Interaction(Female,HighJob) is the extra premium (in addition to the
male premium) for females starting in the high job category instead of the low job category.
(i) Regression 7
A glance at the distribution of salaries of the 208 employees shows some skewness to the right
a few employees make substantially more than the majority of employees. Therefore, it might
make more sense to use the natural logarithm of Salary as the dependent variable, not Salary.
Run a regression with Log(Salary) as the dependent variable and YrsExper and Female as
explanatory variables. How can you interpret the results?
Here are the results obtained after creating the Log(Salary) variable and running the regression.

- 14 -

The estimated regression equation is

Predicted Log(Salary) = 10.4907 + 0.0188 YrsExper 0.1616 Female
The r 2 and s e values are 42.4% and 0.1794.
When this same equation was estimated with Salary as the dependent variable, r 2 and s e were
49.1% and 8,070. However, these measures are not directly comparable because when the logarithm
of y is used in the regression equation the units of the dependent variable are completely different.
The two r 2 values are percentages explained of different dependent variables, Log(Salary) and Salary.
The fact that one is smaller than the other (42.4% versus 49.1%) does not necessarily mean
that it corresponds to a worse fit. They simply are not comparable.
Each s e is a measure of a typical residual, but the residuals in the Log(Salary) equation are in
log dollars, whereas the residuals in the Salary equation are in dollars. These units are of
totally different magnitudes. For example, the log of \$1000 is only 6.91. Therefore, it is no
surprise that s e for the Log(Salary) equation is much smaller than s e for the Salary equation.
If you want comparable standard error measures for the two equations, you should take antilogs
(using the EXP function in Excel) of fitted values from the Log(Salary) equation to convert them
back to dollars, subtract these from the original Salary values, and take the standard deviation of
these residuals. You can check that the resulting standard deviation is 7,774. This is somewhat
smaller than s e = 8,080 from the Salary equation, an indication of a slightly better fit.
To interpret the regression equation itself, recall that when the dependent variable is log(y)
and a term on the right-hand side of the equation is of the form bx, then whenever x increases
by one unit, the predicted value of y changes by a constant percentage, and this percentage is
approximately equal to b (written as a percentage). Thus, the regression coefficient for
YrsExper means that for each extra year of experience with Fifth National, an employee's
salary can be expected to increase by about 1.88%.
To interpret the Female coefficient, note that the only possible increase in Female is one unit
(from 0 for male to 1 for female). When this occurs, the expected percentage decrease in
salary is approximately 16.16%. In other words, the regression equation implies that females
can expect to make about 16% less than men for comparable years of experience.
- 15 -

(j) In Regression 6 we regressed Salary versus the Female dummy, YrsExper, and the interaction
between Female and YrsExper, Interaction(YrsExper,Female). The output appears below.

This group of three explanatory variables,

Block1 = Female, YrsExper, Interaction(YrsExper,Female),
already explains 63.9% of the variation in Salary. Does including the followings groups of
Block2 = EducLev dummies, EducLev=2 to EducLev=5
Block4 = interactions between the Female dummy and the education dummies,
Interaction(Female,EducLev=2) to Interaction(Female,EducLev=5)
This question can be answered by performing several partial F tests.
With StatTools, this analysis can be done in one step.
Select the Block option from the Regression Type
dropdown list. The dialog box then changes, as
shown in the figure to the right.
Number of blocks: 4
Check which variables are in which blocks.
Check Salary as dependent variable.
Specify 0.05 as the P-Value to enter, which in this
case indicates how significant the block as a whole
must be to enter for the partial F test.
OK
The regression calculations are done in stages.
At each stage, the partial F test checks whether a
block is significant. If it is, the variables in this
block enter and the procedure goes to the next
stage. If it is not, the procedure ends; neither this
block nor any later blocks enter.
The output from this procedure appears below.
- 16 -

The middle part of the output

shows the final regression
equation.
The output in rows 34 through 37
indicates summary measures after
successive blocks have entered.
Note that the final block, the
interactions between Female and
the education dummies,
is not in the final equation.
This block did not pass the partial
F test at the 5% level.

(k) Run the block procedure a second time, changing the order of the blocks:
Block3 = EducLev dummies, EducLev=2 to EducLev=5
Block4 = interactions between the Female dummy and the education dummies,
Interaction(Female,EducLev=2) to Interaction(Female,EducLev=5)
The regression output appears to
the right.
Note that neither of the last two
blocks enters the equation this