You are on page 1of 16

CHAPTER 11

MULTIPLE REGRESSION

11.1 DATA ANALYSIS FOR MULTIPLE REGRESSION

Use the following to answer questions 2 and 3:

Predicting presidential elections has become a hot topic in the media these days, and numerous
people have become famous based on the their predictions. In this problem we will investigate a
model considered by Ray Fair, professor of economics at Yale University. To predict the 2016
presidential election, Fair is using a multiple regression model to predict the Democratic share of
the two-party vote (VP). He considers three macroeconomic predictor variables:
 Growth rate (%) of real per capita GDP in the first three quarters of the election year (G)
 Absolute value of the growth rate (%) of GDP deflator in the first 15 quarters prior to the
election (P)
 Number of quarters in the 15 quarters prior to the election in which the growth rate of
real per capita GDP was greater than 3.2% at an annual rate (Z)

Using his favorite statistical software, Fair obtained the following fitted model equation:

𝑉𝑃 = 42.39 + .667𝐺 − .690𝑃 + .968�

1. What is the dependent (response) variable?


A) VP
B) G
C) P
D) Z
Ans: A

2. Assume there is
I. a 3.03% growth rate of real per capita GDP in the first three quarters of 2016
II. a 1.33% growth rate of the GDP deflator in the first 15 quarters of the second
Obama administration,
III. and three quarters in the first 15 quarters of the second Obama administration in
which the growth rate of real per capita GDP is greater than 3.2 percent at an
annual rate.
What is the predicted Democratic share of the two-party vote in the 2016 presidential
election?

A) 42.4%
B) 44.1%
C) 46.4%
182 Chapter 11

D) 48.2%
Ans: C

11.2 INFERENCE FOR MULTIPLE REGRESSION

3. For any regression model, �2 does not indicate:


A) whether the correct regression model was used.
B) whether the F statistic for an overall test of utility is large enough to reject �4.
C) whether the explanatory variables are a true cause of the changes in the response
variable.
D) both A and B.
E) all of the above.
Ans: E

Use the following to answer questions 512:

Has the number of home runs hit by major league teams been changing over time? For the 41
years from 1960 to 2000, the average number of home runs hit per game per team for each
season was computed in order to assess any change over time. Initially, simple linear regression
was used to study the trend in home runs hit over the period 1960 to 2000 by using year to
predict the average number of home runs per game per team in that year. However, it was
pointed out that after the 1976 season the manufacturer of major league baseballs was changed
from Spaulding to Rawlings. Because the change in the baseball used might affect the number
of home runs (for example, if Rawlings produces a livelier ball, this would likely lead to more
home runs), it was decided to include an additional variable, namely

0 if before 1976 that is, the Spaulding baseball was used . 1 if


Manufacturer = after 1976 that is, the Rawlings basedball was used .

A multiple regression analysis was performed using the model:

Avg. home runs per game per team = 0 + 1(Year) + 2(Manufacturer)

The following results were obtained:

Analysis of Variance
Source df Sum of Squares
Model 2 0.744384
Error 38 2.539330

Parameter Estimates
Variable df Parameter Estimate Standard Error
Intercept 1 −27.91180 12.8900
Year 1 0.014977 0.00650
Manufacturer 1 −0.107816 0.11573

4. Using the regression equation, the predicted average number of home runs per game per
team in 1978 would be:
A) 1.605.
B) 1.713.
C) 2.431.
D) 29.624.
Ans: A

5. The value of the regression standard error is:

A) 0.067.
B) 0.259.
C) 0.372.
D) 1.593.
Ans: B

SE =
sqrt
(SS/df
)

6. The value of R2 is:

A) 0.067.
B) 0.227.
C) 0.259.
D) 0.744.
Ans: B

7. A 90% confidence interval for 1, the coefficient of Year, based on these results is:

A) 0.014977 ± 0.0065.
B) 0.014977 ± 0.0110.
C) 0.014977 ± 0.0128.
D) 0.0065 ± 0.014977.
Ans: B
8. Researchers are interested in determining if the change in manufacturer had any effect
on the average number of home runs hit per game per team. To answer this question,
they decide to test the hypotheses

H0: 2 = 0, Ha: 2  0.

Using the estimates obtained from our statistical software, the P-value for this test is:
A) greater than 0.05.
B) between 0.05 and 0.01.
C) between 0.01 and 0.005.
D) below 0.005.
Ans: A

9. The researchers also ran the simple linear regression model

Avg. home runs per game per team = 0 + 1(Year).

The sum of squares for error for this model:


A) will be greater than 2.53933.
B) will be less than 2.53933.
C) will be unchanged.
D) can be greater or less than 2.53933.
Ans: A

10. Suppose we wish to test the hypotheses

H0: 1 = 2 = 0, Ha: at least one of the j is not 0

using the ANOVA F test. The value of the F statistic is:


A) 0.06.
B) 0.29.
C) 2.29.
D) 5.57.
Ans: D
11. Based on the data and the above analyses, it would be reasonable to conclude:

A) the model fits moderately well and because we have examined 41 years, which is a
substantial fraction of the time Major League Baseball has been in existence, we
can be fairly sure of the validity of the multiple linear regression model.
B) the model fits fairly well and the observed average number of home runs per game
per team in the data can be predicted with moderate accuracy from a model using
both Year and Manufacturer.
C) both A) and B). In addition, the fact that the sign of the parameter estimate for
Manufacturer is negative indicates that the baseballs by Rawlings are less lively
than those by Spaulding.
D) none of the above.
Ans: D

Use the following to answer questions 1924:

A researcher was investigating variables that might be associated with the academic performance
of high school students. He examined data from 1990 for each of the 50 states plus the District
of Columbia. The data included the average Math SAT score of all high school seniors in the
state who took the exam (labeled as the variable SAT-M), the average number of dollars per
pupil spent on education by the state (labeled as the variable $ Per Pupil), the percentage of high
school seniors in the state who took the exam (labeled as the variable % Taking), and the average
teacher salary (labeled as the variable Salary). As part of his investigation, he ran the following
multiple regression model

SAT-M = 0 + 1($ Per Pupil) + 2(% Taking) +3(Avg Salary)+ i

where the deviations i were assumed to be independent and Normally distributed with mean 0
and standard deviation . This model was fit to the data using the method of least squares. The
following results were obtained from statistical software.

Source df Sum of Squares ms f p


Regression 3 45937 15312 52.10 0.000
Error 47 13813 294
Total 50 59750

Predictor Coef Stdev t-ratio p


Constant 518.23 16.65 31.12 0.000
$ Per Pupil 0.007098 0.003582 1.98 0.053
% Taking -1.4860 0.1450 -10.24 0.000
Avg Salary -0.2373 0.8636 -0.27 0.785
12. The value of MSError (residual) is:

A) 17.1.
B) 294.
C) 13813.
D) 15312.
Ans: B

13. The total sum of squares is:

A) 294.
B) 15312.
C) 13813.
D) 59750.
Ans: D

14. The proportion of the variation in the variable SAT-M that is explained by the
explanatory variables $ Per Pupil, % Taking, and teacher salary is:

A) 0.232.
B) 0.301.
C) 0.769.
D) 0.960.
Ans: C

15. Suppose we wish to test the hypotheses

H0: 1 = 2 = 3 = 0, Ha: at least one of the j is not 0

using the ANOVA F test. The value of the F statistic is


A) 3.32.
B) 24.0.
C) 52.10.
D) 159.3
Ans: C

16. A 95% confidence interval for 1, the coefficient of the variable $ Per Pupil, is
approximately:

A) 0.0071 ± 0.0036.
B) 0.0071 ± 0.0042.
C) 0.0071 ± 0.0059.
D) 0.0071 ± 0.0072.
Ans: D
17. The value of the t statistic for testing the hypothesis

H0: 3 = 0, Ha: 3  0

is:
A) −.24.
B) −.27.
C) −1.48.
D) −10.24.
Ans: B
Use the following to answer questions 2527:

A researcher was investigating variables that might be associated with the academic performance
of high school students. He examined data from 1990 for each of the 50 states plus the District
of Columbia. The data included the average Math SAT score of all high school seniors in the
state who took the exam (labeled as the variable SAT-M), the average number of dollars per
pupil spent on education by the state (labeled as the variable $ Per Pupil), the percentage of high
school seniors in the state who took the exam (labeled as the variable % Taking), and the average
teacher salary (labeled as the variable Salary). As part of his investigation, he ran the following
multiple regression model

SAT-M = 0 + 1($ Per Pupil) + 2(% Taking) +3(Avg Salary)+ i

where the deviations i were assumed to be independent and normally distributed with mean 0
and standard deviation . This model was fit to the data using the method of least squares. The
following results were obtained from statistical software:

Source df Sum of Squares ms f p


Regression 3 45937 15312 52.10 0.000
Error 47 13813 294
Total 50 59750

Predictor Coef Stdev t-ratio p


Constant 518.23 16.65 31.12 0.000
$ Per Pupil 0.007098 0.003582 1.98 0.053
% Taking -1.4860 0.1450 -10.24 0.000
Avg Salary -0.2373 0.8636 -0.27 0.785

Another researcher, using the same data, ran the following simple linear regression model

SAT-M = 0 + 1($ Per Pupil) + i

where the deviations i were assumed to be independent and normally distributed, with mean 0
and standard deviation . This model was fit to the data using the method of least squares. The
following results were obtained from statistical software.

Source df Sum of Squares


Model 1 14022.7
Error 49 45727.4

Variable Parameter Standard Error of


Estimate Parameter Estimate
Constant 560.374 16.80
$ Per Pupil −0.012169 0.0031
18. Based on these results, the second researcher obtains a 95% confidence interval for 1,
the coefficient of the variable $ Per Pupil, as approximately:
A) −0.012169 ± 0.0031.
B) −0.012169 ± 0.0052.
C) −0.012169 ± 0.0062.
D) −0.012169 ± 0.0083.

Ans: C

19. The first researcher concluded that because the coefficient for the variable $ Per Pupil
was positive in his results, spending additional money on students would have a positive
effect on SAT-M scores. This researcher therefore recommended more money be spent
on students. The second researcher concluded that because the coefficient for the
variable $ Per Pupil was negative in his results, spending additional money on students
would have a negative effect on SAT-M scores. This researcher therefore recommended
less money be spent on students. Even though the researchers used the same data, these
two conclusions are different because:
A) an error must have been made by one of the researchers.
B) both researchers failed to take into account that in their analyses 1, the coefficient
of the variable $ Per Pupil was not statistically significant at even the 0.10
significance level. Hence neither researcher could conclude that 1 was
significantly different from 0.
C) the researchers did not use the same set of explanatory variables in their models.
D) there must have been an influential observation in the data, rendering the analyses
inappropriate.
Ans: C

20. The value of R2 when all three variables are included in the model is 0.769. The value
of R2 when only $ Per Pupil is included in the model is 0.235. Using this information,
the value of the F statistic to determine if % Taking or Avg Salary helps explain SAT-M
in a model that already contains $ Per Pupil is:
A) 12.55.
B) 25.10.
C) 50.20.
D) 54.32.
Ans: D

21. In a multiple regression analysis involving seven independent variables and 235 data
points, the degrees of freedom associated with the sum of squares for residual error
(SSE) is:
A) 235.
B) 234.
C) 228.
D) 227.
Ans: D
22. In multiple regression analysis, if the model provides good fit, this indicates that:

A) the sum of squares for error will be small.


B) the value of the regression standard error will be small.
C) the squared multiple regression correlation value is will be close to 1 or −1.
D) all of the above choices are correct.
Ans: D

11.3 MULTIPLE REGRESSION MODEL BUILDING

Use the following to answer questions 35 and 36:

Based on a sample of the salaries of professors at a major university, you have performed a
multiple regression relating salary to years of service and gender. The estimated multiple linear
regression model is:

Salary = $45,000 + $3000(Years) + $4000(Gender) + $1000[(Years)(Gender)]

where Gender = 1 if the professor is male and Gender = 0 if the professor is female.

23. Using the multiple linear regression equation, you would estimate the average difference
in the salaries of a male professor with three years of service and female professor with
three years of service to be:
A) $3000.
B) $4000.
C) $5000.
D) $7000.
Ans: D

24. Using the multiple linear regression equation, you would estimate the average salary of
male professors with three years of experience to be:

A) $53,000.
B) $54,000.
C) $58,000.
D) $61,000.
Ans: D

Using the following to answer questions 37–49:


Are more selective colleges more expensive? This is a question asked by students enrolled in a
statistics course at a liberal arts college. To answer this question, the students collected
information on a random sample of 41 liberal arts colleges across the nation. They considered the
regression model given below.

�NOPQ = �4 + �S𝑃�𝐼𝑉𝐴𝑇𝐸 + �2 𝐴𝐷𝑀_�𝐴𝑇𝐸

Variable Description
COST Average cost of attendance
PRIVATE Is the school private? (0 = no, 1 = yes)
ADM_RATE Admissions rate

Below is the output the students obtained from their favorite statistical software package.

Term Estimate Std. Error t value


Intercept 33670.89 5486.87 6.137
PRIVATE 22713.66 3566.25 6.369
ADM_RATE -186.60 68.73 -2.715

�2 = 0.6009
𝐴𝑑𝑗. �2 = 0.5799
𝑠 = 8864

25. What is the response variable?


A) COST
B) PRIVATE
C) ADM_RATE
D) Not enough information
Ans: A

26. Which best describes the association between average cost and the admissions rate, based
on a correlation of r = −0.418?
A) positive, linear, moderate
B) negative, linear, moderate
C) negative, linear, strong
D) no clear association
Ans: B

27. What is the regression standard error for the fitted model?
A) 5486.87
B) 8864
C) 0.6009
D) 5799
Ans: B

28. What proportion of the variation in the average cost of attendance does this multiple
regression model explain?
A) .6009
B) .6800
C) .5799
D) .8864
Ans: A

29. What is the interpretation of the estimated slope associated with PRIVATE?
A) For each 1-unit increase in private, the average cost of attendance increases by
$22,713.66, holding the admissions rate constant.
B) The average cost of attendance for a private school is $22,713.66 higher than the
average cost for a public school, assuming that they have the same admissions
rate.
C) Private schools are $22,713.66 more expensive, on average.
D) None of the above.
Ans: B

30. What is the interpretation of the estimated slope associated with ADM_RATE?
A) A 1% increase in the admissions rate causes the average cost of attendance to
decrease by $186.60.
B) As the cost of attendance decreases by $186.60, the admissions rate increases by
1%.
C) A 1% decrease in the admissions rate is associated with a $186.60 decrease in the
average cost of attendance, holding all else constant.
D) A 1% increase in the admissions rate is associated with a $186.60 decrease in the
average cost of attendance, holding all else constant.
Ans: D

31. What do you predict the cost of attendance is for a private school that has a 17%
admission rate?
A) $30,498.69
B) $53,212.35
C) $56,197.95
D) $419,616.50
Ans: B

32. A plot of the residuals versus fitted values is given below. What does this plot reveal
about the appropriateness of the assumptions necessary for inference?
A) The plot is not useful since it just looks like random scatter.
B) The spread of the residuals is not constant across admissions rate.
C) The spread of the residuals is constant across admissions rate.
D) There are many troubling outliers.

Ans: C

If we wished to determine the overall utility of the model, what hypotheses should we
test?
A) �4 : �2 = 0 versus �` : �2 ≠ 0
B) �4 : �S = 0 versus �` : �S > 0
C) �4 : �S = �2 = 0 versus �` : at least one �c ≠ 0
D) �4 : �S = �2 = 0 versus �` : at least one �c > 0
Ans: C

33. A 95% confidence interval for the slope �2 of the regression line is:
A) 22713.66 ± 2.021×8864.
B) −186.60 ± 2.024×68.73.
C) 22713.66 ± 1.697×3566.25.
D) −186.60 ± 2.021×68.73.
Ans: B

Using the following to answer questions 50−58:

Determining the sale price of a home is an important task for city assessors as it helps the city
project future tax revenue. Regression models using the physical characteristics of a home to
predict the sale price is standard practice for many assessors. A random sample of 724 homes
sold in Ames, Iowa, between 2006 and 2010 was obtained to build such a model for the city of
Ames. The assessor considered the following variables in their initial model:

Variable Description
LotArea Lot size (in thousands of square feet)
LivingArea Living space (in thousands of square feet)
Bedrooms Number of bedrooms
Rooms Number of rooms
Fireplaces Number of fireplaces
Bath Number of bathrooms
Age Age of the home (in years)
Price Sale price of the home (in thousands of dollars)

Below is the output obtained from the statistical software.

Std
Estimate Error t value Pr(>|t|)
(Intercept) 100.55 6.167 16.31 < 0.0001
LotArea 0.99 0.144 6.86 < 0.0001
LivingArea 112.97 5.600 20.17 <0 .0001
Bedrooms −16.35 2.366 −6.91 < 0.0001
Rooms −0.51 1.667 −0.30 0.7613
Fireplaces 12.80 2.206 5.80 < 0.0001
Bath −13.60 3.459 −3.93 < 0.0001
Age −0.96 0.054 −17.86 < 0.0001

Number of
Observations Residual Std Error R2 Adjusted R2
724 33.69 0.7686 0.7663

34. What is the response variable?


A) lot area
B) living area
C) age
D) sale price
Ans: D

35. What proportion of the variation in sale price does this multiple regression model
explain?
A) .3369
B) .7686
C) .7663
D) 0.2314
Ans: B

36. What is the interpretation of the slope associated with Bedrooms?


A) For each additional bedroom, the sales price of the house will decrease by
$16,350.
B) For each additional bedroom, the sales price of the house will decrease by
$16,350, holding all other variables constant.
C) For each additional bedroom, the expected sales price of the house will decrease
$16,350, holding all other variables constant.
D) For each additional $1,000 in sales price, the average number of bedrooms in the
house decreases by 16.35, holding all other variables constant.
Ans: C

37. Calculate a 98% confidence interval for the slope associated with Bedrooms.
A) −16.35 ± 1.984×2.366
B) −16.35 ± 2.081×33.69
C) 16.35 ± 2.33×2.366
D) −16.35 ± 2.364×2.366
Ans: D

38. What is the predicted sale price of a 28-year-old 2,784 ft 2 house on a 16,692 ft2 lot with 5
bedrooms, 12 rooms, 2 fireplaces, and 3.5 bathrooms? (The options below are rounded to
the nearest dollar.)
A) $100,550
B) $294,834
C) $330,997
D) $619,534
Ans: C

39. An experienced realtor believes that living area, the number of rooms, and age will
adequately describe the sales prices of a home. The model with only these variables has
�2 =. 7059 and 𝐴𝑑𝑗. �2 = 0.7047. Do you agree?
A) No, �2 is higher in the original model.
B) Yes, �2 is not significantly higher in the original model
Ans: C
C) N ing that the full model explains significantly more of the variability in the
o model.
, D) Yes, the F statistic is (approximately) 48.5, indicating that the reduced model is
t significantly better.
h
e

s
t
a
t
i
s
t
i
c

i
s
(
a
p
p
r
o
x
i
m
a
t
e
l
y
)
4
8
.
5
,
i
n
d
i
c
a
t

You might also like