Professional Documents
Culture Documents
Chapter 9
Regression Analysis
1. a. Y = 250 + 3 X
b. Functional. For a given value of X there is one unique value of Y.
2. The model with the highest R2 might actually "overfit" the data and not provide accurate predictions. The
R2 statistic can be inflated (or made arbitrarily large) by including superfluous independent variables in the
model. If this happens the predictive ability of the model will actually be degraded since the model is
biased toward sample specific anomalies in the data that may not be characteristic of the underlying
population from which the sample was drawn.
4. The solution would be unbounded. For virtually any regression problem, the sum of the estimation errors
can be made to approach - by selecting a regression function such that the estimated values a far greater
than the actual values. Even in we place a lower bound of zero on the sum of the estimation errors, a
regression function with a sum of estimation errors equal to zero will not necessarily fit the data well.
5. You should collect data so that the average value of the X1 observations is equal to X1h.
20. a. See file: Prb9_20.xlsx. Age and Milage both have fairly strong negative linear relations with price.
The geographic variables all have fairly weak linear relations with price, with the possible exception of
the indicator variable for the Mid-Atlantic region.
b. The highest R2 value is obtained using all of the variables. R2 =0.722. Estimated Price = 20339.05 –
1070.75 Age – 0.0645 Milage – 1073.97 West + 2156.02South + 3318.418 Northeast - 1746.53Mid-
Atlantic + 2153.14 Southwest
c. Adjusted R2 =0.672. Estimated Price = 19346.321 -1051.9795 Age -0.0685759 Milage + 3237.1774
South + 4354.1414 + Northeast + 3229.119 Southwest
d. See file: Prb9_20.xlsx.
e. The condition of the car, whether or not it is a convertible, etc.
a.
1 2
3 4
Chapter 9 - Regression Analysis : S-5
————————————————————————————————————————————
24. a. See file: Prb9_24.xlsx. There is a fairly strong linear relation between avg outside temperature and
average heating cost. There is also fairly strong linear relation between square footage and average
heating cost. There is somewhat of a linear relation between age of furnace and average heating cost.
There is not much of a relation between the amount of attic insulation and average heating cost.
b. Square footage has the strongest linear relation with average heating cost.
c. Attic insulation and square footage.
d. Attic insulation, age of the furnace, and square footage.
e. Ŷ = -29.218 - 1.178X 1 - 6.895X 2 + 3.213X 3 + 0.149X 4
f. Ŷ = -152.037 - 4.8923×4 + 3.698×5 + 0.1815×2500 = 300.801
Approximate Lower confidence limit = 300.801 – 2 × 34.985 = 230.84
Approximate Upper confidence limit = 300.801 + 2 × 34.985 = 370.78
We can be approximately 95% confident that the actual average heating cost for a home with 4 inches
of insulation, a 5-year old furnace, and 2500 square feet will be between about $230 and $370.
1. There doesn’t appear to be much of a linear relationship between price and color or price and clarity. There
appears to be a positive correlation between price and carats.
2. The model with X3 & X2 as independent variables has the highest adjusted R2 value (69.4%).
3. Ŷ = 42.05 - 165.35X 2 + 1688.0X 3 , R2 = 72.3%, adjusted R2 = 69.4%.
4. See file: Case9_1.xlsx. The eighth set of earrings appears to be the most underpriced value.
5. The model with X3 & X2 as independent variables has the highest adjusted R2 value (71.4%).
6. Ŷ = 14.807 - 2.65X 2 + 28.13X 3 , R2 = 74.1%, adjusted R2 = 71.4%.
7. See file: Case9_1.xlsx. The fourth & fifth sets of earrings appears to be the most under priced values.
8. Step-wise regression leads to the selection of a model with X3 & X6 with an adjusted R2 of 73.3%.
9. Ŷ =-25.33+15.03X1+63.8X3-1.16X4-12.68X5, R2 = 81.6%, adjusted R2 = 77.3%.
10. See file: Case9_1.xlsx. The eighth set of earrings appears to be the most underpriced value. (In this final
analysis, it is interesting to note that the earring selected as the best value in part d now appear to be far less
of a great value. It is also instructive to note that step-wise regression is a heuristic that does not
necessarily lead to the best multiple regression model.)
1. 1725 votes
2. See file: Case9_2.xlsx. Palm Beach county appears to be an outlier.
3. Buchanan votes = 109.7673 + 0.002541 × Gore Votes
4. R2 = 0.630. Approximately 63% of the total variation in the votes received by Buchanan in each county is
accounted for by the number of votes received by Gore in the same county.
5. Buchanan votes = 109.7673 + 0.002541 × 268945 = 793.15
Lower confidence limit = 793.15 – 2.576 × 137.32 = 439.32
Upper confidence limit = 793.15 + 2.576 × 137.32 = 1146.87
With approximately 99% confidence, we would expect Buchanan to receive between 439 and 1147 votes in
Palm Beach county if Gore received 268945.
6. See file: Case9_2.xlsx. Palm Beach county appears to be an outlier.
7. Buchanan votes = 66.09 + 0.003478 × Bush Votes
8. R2 = 0.753. Approximately 75.3% of the total variation in the votes received by Buchanan in each county
is accounted for by the number of votes received by Bush in the same county.
9. Buchanan votes = 66.09 + 0.003478 × 152846 = 597.71
Approximate Lower confidence limit = 597.71 – 2.576 × 112.18 = 308.73
Approximate Upper confidence limit = 597.71 + 2.576 × 112.18 = 886.68
With approximately 99% confidence, we would expect Buchanan to receive between 309 and 887 votes in
Palm Beach county if Bush received 152846.
10. The results suggest that something quite unexpected happened with the votes in Palm Beach county. We
cannot really say much more than that, but it does appear that whatever happened may have cost Al Gore
the election. This analysis assumes that the voting patterns observed in other counties of Florida are
representative of voting patterns in Palm Beach county – which may or may not be the case.
^
3. Y = 33.3205 + 15.0159 X1
4. R2 = 0.8735. Approximately 87.4% of the total variation in line maintenance expense is accounted for
using this model.
^
5. Y = 33.3205 + 15.0159 (75) = 1,159.51
6. An approximate 95% confidence interval for a new line maintenance expense at this number of customers
is given by:
Approximate 95% Lower Confidence Limit = 1,159.51 - 2×187.713 = $784.09 (in 000s)
Approximate 95% Lower Confidence Limit = 1,159.51 + 2×187.713 = $1,534.95 (in 000s)
Thus, it would not be unexpected for a company with 75,000 customers to show a line maintenance charge
of $1,500,000.
7. See file: Case9_3.xlsx. There seems to be some systematic variation that this not being accounted for by
the linear model.
^
Y = 707.47 - 7.392 X1 + 0.1543 (X1)
2
8.
9. R2 = 0.9416. Approximately 94.2% of the total variation in line maintenance expense is accounted for
using this model.
10. R a2 = 0.9286. This is larger than the R2 and adjusted-R2 statistics from the linear model -- implying that
the addition of the quadratic term in the model served a useful purpose.
^
11. Y = 707.47 - 7.392(75) + 0.1543(75)2 = 1,021.024
12. See file: Case9_3.xlsx. The quadratic model seems to fit the data much better.
13. An approximate 95% confidence interval for a new line maintenance expense at this number of customers
is given by:
Approximate 95% Lower Confidence Limit = 1,021.024 - 2×134.407 = $752.209 (in 000s)
Approximate 95% Lower Confidence Limit = 1,021.024 + 2×134.407 = $1,289.838 (in 000s)
Thus, it would appear unusual for a company with 75,000 customers to show a line maintenance charge of
$1,500,000 and Nolan might wish to investigate the cause of this further.