You are on page 1of 6

REVISED

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 46

C H A P T E R

Regression Models

4
15 9 40 20 25 25 15 35 6 4 16 6 13 9 10 16

TEACHING SUGGESTIONS
Teaching Suggestion 4.1: Which Is the Independent Variable? We nd that students are often confused about which variable is independent and which is dependent in a regression model. For example, in Triple As problem, clarify which variable is X and which is Y. Emphasize that the dependent variable (Y ) is what we are trying to predict based on the value of the independent (X ) variable. Use examples such as the time required to drive to a store and the distance traveled, the totals number of units sold and the selling price of a product, and the cost of a computer and the processor speed. Teaching Suggestion 4.2: Statistical Correlation Does Not Always Mean Causality. Students should understand that a high R2 doesnt always mean one variable will be a good predictor of the other. Explain that skirt lengths and stock market prices may be correlated, but raising one doesnt necessarily mean the other will go up or down. An interesting study indicated that, over a 10-year period, the salaries of college professors were highly correlated to the dollar sales volume of alcoholic beverages (both were actually correlated with ination). Teaching Suggestion 4.3: Give students a set of data and have them plot the data and manually draw a line through the data. A discussion of which line is best can help them appreciate the least squares criterion. Teaching Suggestion 4.4: Select some randomly generated values for X and Y (you can use random numbers from the random number table in Chapter 15 or use the RAND function in Excel). Develop a regression line using Excel and discuss the coefcient of determination and the F-test. Students will see that a regression line can always be developed, but it may not necessarily be useful. Teaching Suggestion 4.5: A discussion of the long formulas and short-cut formulas that are provided in the appendix is helpful. The long formulas provide students with a better understanding of the meaning of the SSE and SST. Since many people use computers for regression problems, it helps to see the original formulas. The short-cut formulas are helpful if students are performing the computations on a calculator.

Ads purchased, (X)

Apartments leased, (Y) 6 4 16 6 13 9 10 16

We can nd a mathematical equation by using the least squares regression approach.


Leases, Y Ads, X 15 9 40 20 25 25 15 35 )2 (X X )(Y Y ) (X X

Y 80 X 184

64 32 196 84 289 102 9 12 4 6 4 2 64 0 144 72 )2 774 (X X )(Y Y ) 306 (X X

Y =

80 184 = 10; X = = 23 8 8

b1 306/774 0.395 b0 10 0.395(23) 0.915 The estimated regression equation is 0.915 0.395X Y or Apartments leased 0.915 0.395 ads placed If the number of ads is 30, we can estimate the number of apartments leased with the regression equation 0.915 0.395(30) 12.76 or 13 apartments Alternative Example 4.2: Given the data on ads and apartment rentals in Alternative Example 4.1, nd the coefcient of determination. The following have been computed in the table that follows: SST 150; SSE 29.02; SSR 120.76 (Note: Round-off error may cause this to be slightly different than a computer solution.)

ALTERNATIVE EXAMPLES
Alternative Example 4.1: The sales manager of a large apartment rental complex feels the demand for apartments may be related to the number of newspaper ads placed during the previous month. She has collected the data shown in the accompanying table.

46

REVISED
M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 47

CHAPTER 4

REGRESSION MODELS

47

Y 6.00 4.00 16.00 6.00 13.00 9.00 10.00 16.00 80.00

X 15.00 9.00 40.00 20.00 25.00 25.00 15.00 35.00 184.00

)2 (Y Y 16 36 36 16 9 1 0 36 SST150.00

0.9150.395X Y 6.84 4.47 16.715 8.815 10.79 10.79 6.84 14.74 80.00

)2 (Y Y 0.706 0.221 0.511 7.924 4.884 3.204 9.986 1.588 SSE29.02

Y )2 (Y
9.986 30.581 45.091 1.404 0.624 0.624 9.986 22.468 SSR120.76

From this the coefcient of determination is r SSR/SST 120.76/150 0.81 Alternative Example 4.3: For Alternative Examples 4.1 and 4.2, dealing with ads, X, and apartments leased, Y, compute the correlation coefcient. Since r2 0.81 and the slope is positive (0.395), the positive square root of 0.81 is the correlation coefcient. r 0.90.
2

adjusted r2 value declines or does not increase when a new variable is added, then the variable should not be added to the model. 4-6. The F-test is used to determine if the overall regression model is helpful in predicting the value of the independent variable (Y ). If the F-value is large and the p-value or signicance level is low, then we can conclude that there is a linear relationship and the model is useful, as these results would probably not occur by chance. If the signicance level is high, then the model is not useful and the results in the sample could be due to random variations. 4-7. The SSE is the sum of the squared errors in a regression model. SST SSE SSR. 4-8. When the residuals (errors) are plotted after a regression line is found, the errors should be random and should not show any signicant pattern. If a pattern does exist, then the assumptions may not be met or another model (perhaps nonlinear) would be more appropriate. 36 4.3(70) 337 4-9. a. Y 36 4.3(80) 380 b. Y c. Y 36 4.3(90) 423 4-10.
12 10 8

SOLUTIONS TO DISCUSSION QUESTIONS AND PROBLEMS


4-1. The term least-squares means that the regression line will minimize the sum of the squared errors (SSE). No other line will give a lower SSE. 4-2. Dummy variables are used when a qualitative factor such as the gender of an individual (male or female) is to be included in the model. Usually this is given a value of 1 when the condition is met (e.g. person is male) and 0 otherwise. When there are more than two levels or values for the qualitative factor, more than one dummy variable must be used. The number of dummy variables is one less than the number of possible values or categories. For example, if students are classied as freshmen, sophomores, juniors and seniors, three dummy variables would be necessary. 4-3. The coefcient of determination (r2) is the square of the coefcient of correlation (r). Both of these give an indication of how well a regression model ts a particular set of data. An r2 value of 1 would indicate a perfect t of the regression model to the points. This would also mean that r would equal 1 or 1. 4-4. A scatter diagram is a plot of the data. This graphical image helps to determine if a linear relationship is present, or if another type of relationship would be more appropriate. 4-5. The adjusted r2 value is used to help determine if a new variable should be added to a regression model. Generally, if the adjusted r2 value increases when a new variable is added to a model, this new variable should be included in the model. If the

a.

Demand

6 4 2 0 0 2 4 6 8 10 TV Appearances

REVISED
M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 48

48

CHAPTER 4

REGRESSION MODELS

4-10.

b.
TV Appearances X 3 4 7 6 8 5 X 33 5.5 X )2 (X X 6.25 2.25 2.25 0.25 6.25 0.25 17.5 )2 (Y Y 12.25 0.25 0.25 2.25 12.25 2.25 29.5 SST )(Y Y ) (X X 8.75 0.75 0.75 0.75 8.75 0.75 17.5 Y 4 5 8 7 9 6 )2 )2 (Y Y (Y Y 1 1 1 4 1 4 12 SSE 6.25 2.25 2.25 0.25 6.25 0.25 17.5 SSR

Demand Y 3 6 7 5 10 8 Y 39.0 6.5 Y

SST 29.5; SSE 12; SSR 17.5 b1 17.5/17.5 1 b0 6.5 1(5.5) 1 1 1X. The regression equation is Y c. Y 1 1X 1 1(6) 7. 4-11. See the table for the solution to problem 4-10 to obtain some of these numbers. MSE = SSE/(n k 1) = 12/(6 1 1) = 3 MSR = SSR/k = 17.7/1 = 17.5 F = MSR/MSE = 17.5/3 = 5.83 df1 = k = 1 df2 = n k 1 = 6 1 1 = 4 F0.05, 1, 4 = 7.71 Do not reject H0 since 5.83 7.71. Therefore, we cannot conclude there is a statistically signicant relationship at the 0.05 level. 1 1X. 4-12. Using Excel, the regression equation is Y F 5.83, the signicance level is 0.073. This is signicant at the 0.10 level (0.073 0.10), but it is not signicant at the 0.05 level. There is marginal evidence that there is a relationship between demand for drums and TV appearances. 4-13.
Fin. Ave,(Y) 93 78 84 73 84 64 64 95 76 711 Test 1 (X) 98 77 88 80 96 61 66 95 69 730 )2 (X X 285.235 16.901 47.457 1.235 221.679 404.457 228.346 192.901 146.679 1544.9 )2 (Y Y 196 1 25 36 25 225 225 256 9 998 )(Y Y ) (X X 236.444 4.111 34.444 6.667 74.444 301.667 226.667 222.222 36.333 1143 Y 91.5 76 84.1 78.2 90 64.1 67.8 89.3 70

)2 (Y Y
2.264 4.168 0.009 26.811 36.188 0.015 14.592 32.766 35.528 152.341

Y )2 (Y
156.135 9.252 25.977 0.676 121.345 221.396 124.994 105.592 80.291 845.659

b1 = 1143/1544.9 = 0.740 b0 = (711/9) 0.740 (730/9) = 18.99

REVISED
M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 49

CHAPTER 4

REGRESSION MODELS

49

18.99 0.74X a. Y 18.99 0.74(83) 80.41 b. Y c. r2 = SSR/SST = 845.629/998 = 0.85; r 0.92; this means that 85% of the variability in the nal average can be explained by the variability in the rst test score. 4-14. See the table for the solution to problem 4-13 to obtain some of these numbers.

1.03 0.0034(450) 2.56. Y If a student scores 800 on the SAT, we get 1.03 0.0034(800) 3.75. Y 4-19.
50 45

a. A linear model is reasonable from the graph below.

Ridership (100,000s)

MSE = SSE/(n k 1) = 152.341/(9 1 1) = 21.76 MSR = SSR/k = 845.659/1 = 845.659 F = MSR/MSE = 845.659/21.76 = 38.9 df1 = k = 1 df2 = n k 1 = 9 1 1 = 7 F0.05, 1, 7 = 5.59 Because 38.9 5.59, we can conclude (at the 0.05 level) that there is a statistically signicant relationship between the rst test grade and the nal average. 4-15. F 38.86; the signicance level 0.0004 (which is extremely small) so there is denitely a statistically signicant relationship. 13,473 37.65(1,860) $83,502. 4-16. a. Y b. The predicted average selling price for a house this size would be $83,502. Some will sell for more and some will sell for less. There are other factors besides size that inuence the price of the house. c. Some other variables that might be included are age of the house, number of bedrooms, and size of the lot. There are other factors in addition to these that one can identify. d. The coefcient of determination (r2) (0.63)2 0.3969. $90.00 4-17. The multiple regression equation is Y $48.50X1 $0.40X2 a. Number of days on the road: X1 5; Distance traveled: X2 300 miles The amount he may be expected to claim is 90.00 48.50(5) $0.40(300) $452.50 Y b. The reimbursement request, according to the model, appears to be too high. However, this does not mean that it is not justied. The accountants should question Thomas Williams about his expenses to see if there are other explanations for the high cost. c. A number of other variables should be included, such as the type of travel (air or car), conference fees if any, and expenses for entertainment of customers, and other transportation (cab and limousine) expenses. In addition, the coefcient of correlation is only 0.68 and r2 (0.68)2 0.46. Thus, about 46% of the variability in the cost of the trip is explained by this model; the other 54% is due to other factors. 4-18. Using computer software to get the regression equation, we get 1.03 0.0034X Y predicted GPA and X SAT score. where Y If a student scores 450 on the SAT, we get

40 35 30 25 20 15 10 5 0 0 5 10 15 20 25

Tourists (Millions)

5.060 1.593X b. Y 5.060 1.593(10) 20.99, or 2,099,000 people. c. Y d. If there are no tourists, the predicted ridership would be 5.06 (100,000s) or 506,000. Because X 0 is outside the range of values that were used to construct the regression model, this number may be questionable. 4-20. The F-value for the F-test is 52.6 and the signicance level is extremely small (0.00002) which indicates that there is a statistically signicant relationship between number of tourists and ridership. The coefcient of determination is 0.84 indicating that 84% of the variability in ridership from one year to the next could be explained by the variations in the number of tourists. 24,328 3026.67X1 6684X2 4-21. a. Y predicted starting salary; X1 GPA; X2 1 if business where Y major, 0 otherwise. 24,328 3026.67(3.0) 6684(1) $40,092.01. b. Y c. The starting salary for business majors tends to be about $6,684 higher than non-business majors in this sample, even after adjusting for variations in GPA. d. The overall signicance level is 0.099 and r2 0.69. Thus, the model is signicant at the 0.10 level and 69% of the variability in starting salary is explained by GPA and major. The model is useful in predicting starting salary. 4-22. a. Let predicted selling price Y X1 square footage X2 number of bedrooms X3 age 2367.26 46.60X1 ; r2 0.65 The model with square footage: Y 1923.5 36137.76X2 ; The model with number of bedrooms: Y r2 0.36 147670.9 2424.16 X3 ; r2 0.78 The model with age: Y

REVISED
M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 50

50

CHAPTER 4

REGRESSION MODELS

All of these models are signicant at the 0.01 level or less. The best model uses age as the independent variable. The coefcient of determination is highest for this, and it is signicant. 5701.45 48.51X1 2540.39X2 and r2 0.65. 4-23. Y 5701.45 48.51(2000) 2540.39(3) 95,100.28. Y Notice the r2 value is the same as it was in the previous problem with just square footage as the independent variable. Adding the number of bedrooms did not add any signicant information that was not already captured by the square footage. It should not be included in the model. The r2 for this is lower than for age alone in the previous problem. 82185.5 25.94X1 2151.7X2 1711.5X3 and 4-24. Y r2 0.89. 82185.5 25.94(2000) 2151.7(3) 1711.5(10) Y $110,495.4. 3071.885 6.5326X where 4-25. Y Y DJIA and X S&P. r 0.84 and r 0.70. 3071.885 6.5326(1100) 10257.8 (rounded) Y
2

If both SAT and a dummy variable (X2 1 for private, 0 otherwise) are used to predict the cost, we get r2 0.79. The model is 7121.8 5.16X1 9354.99X2. Y This says that a private school tends to be about $9,355 more expensive than a public school when the median SAT score is used to adjust for the quality of the school. The coefcient of determination indicates that about 79% of the variability in cost can be explained by these factors. The model is signicant at the 0.001 level.

= 67.8 + 0.0145 X 4-31. Y


There is a signicant relationship between the number of victories (Y ) and the payroll (X ) at the 0.054 level, which is marginally signicant. However, r2 = 0.24, so the relationship is not very strong. Only about 24% of the variability in victories is explained by this model. = 42.43 + 0.0004 X 4-32. a. Y

= 31.54 + 0.0058 X b. Y
c. The correlation coefcient for the rst stock is only 0.19 while the correlation coefcient for the second is 0.96. Thus, there is a much stronger correlation between stock 2 and the DJI than there is for stock 1 and the DJI.

4-26. With one independent variable, beds, in the model, r2 0.88. With just admissions in the model, r2 0.974. When both variables are in the model, r2 0.975. Thus, the model with only admissions as the independent variable is the best. Adding the number of beds had virtually no impact on r2, and the adjusted r2 1.518 0.6686X decreased slightly. Thus, the best model is Y where Y expense and X admissions. 4-27. Using Excel with Y MPG; X1 horsepower; X2 weight the models are: 53.87 0.269X1; r2 0.77 Y Y 57.53 0.01X2; r2 0.73. Thus, the model with horsepower as the independent variable is better since r2 is higher. 57,69 0.17X1 0.005X2 where 4-28. Y Y MPG X1 horsepower X2 weight r 0.82.
2

CASE STUDIES SOLUTION TO NORTHSOUTH AIRLINE CASE


Northern Airline Data
Year 2001 2002 2003 2004 2005 2006 2007 Airframe Cost per Aircraft 51.80 54.92 69.70 68.90 63.72 84.73 78.74 Engine Cost per Aircraft 43.49 38.58 51.48 58.72 45.47 50.26 79.60 Average Age (Hours) 6,512 8,404 11,077 11,717 13,275 15,215 18,390

Southeast Airline Data


Year 2001 2002 2003 2004 2005 2006 2007 Airframe Cost per Aircraft 13.29 25.15 32.18 31.78 25.34 32.78 35.56 Engine Cost per Aircraft 18.86 31.55 40.43 22.10 19.69 32.58 38.07 Average Age (Hours) 5,107 8,145 7,360 5,773 7,150 9,364 8,259

This model is better because the coefcient of determination is much higher with both variables than it is with either one individually. 4-29. Let Y MPG; X1 horsepower; X2 weight 69.93 0.620X1 b0 b1X1 b2X12 is Y The model Y 0.001747X12 and has r2 0.798. 89.09 0.0337X2 b0 b3X2 b4X22 is Y The model Y 0.0000039X22 and has r2 0.800. 89.2 b0 b1X1 b2X12 b3X2 b4X22 is Y The model Y 0.51X1 0.001889X12 0.01615X2 0.00000162X22 and has r2 0.883. This model has a higher r2 value than the model in 4-28. A graph of the data would show a nonlinear relationship. 4-30. If SAT median score alone is used to predict the cost, we get 7793.1 21.8X1 with r2 0.22. Y

Utilizing QM for Windows, we can develop the following regression equations for the variables of interest. Northern Airlineairframe maintenance cost: Cost 36.10 0.0025 (airframe age) Coefcient of determination 0.7694 Coefcient of correlation 0.8771

REVISED
M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 51

CHAPTER 4

REGRESSION MODELS

51

Northern Airlineengine maintenance cost: Cost 20.57 0.0026 (airframe age) Coefcient of determination 0.6124 Coefcient of correlation 0.7825 Southeast Airlineairframe maintenance cost: Cost 4.60 0.0032 (airframe age) Coefcient of determination 0.3904 Coefcient of correlation 0.6248 Southeast Airlineengine maintenance cost: Cost 0.671 0.0041 (airframe age) Coefcient of determination 0.4599 Coefcient of correlation 0.6782 The graphs below portray both the actual data and the regression lines for airframe and engine maintenance costs for both airlines. Note that the two graphs have been drawn to the same scale to facilitate comparisons between the two airlines. Northern Airline: There seem to be modest correlations between maintenance costs and airframe age for Northern Airline. There is certainly reason to conclude, however, that airframe age is not the only important factor. Southeast Airline: The relationships between maintenance costs and airframe age for Southeast Airline are much less well dened. It is even more obvious that airframe age is not the only important factorperhaps not even the most important factor.

Overall, it would seem that: 1. Northern Airline has the smallest variance in maintenance costs, indicating that the day-to-day management of maintenance is working pretty well. 2. Maintenance costs seem to be more a function of airline than of airframe age. 3. The airframe and engine maintenance costs for Southeast Airline are not only lower but more nearly similar than those for Northern Airline, but, from the graphs at least, appear to be rising more sharply with age. 4. From an overall perspective, it appears that Southeast Airline may perform more efciently on sporadic or emergency repairs, and Northern Airline may place more emphasis on preventive maintenance. Ms. Youngs report should conclude that: 1. There is evidence to suggest that maintenance costs could be made to be a function of airframe age by implementing more effective management practices. 2. The difference between maintenance procedures of the two airlines should be investigated. 3. The data with which she is presently working do not provide conclusive results.

Northern Airline 90 80 70 90 80 70

Southeast Airline

Cost ($)

50 40 30 20 10 5 7 9 11 13 Airframe Engine 15 17 19

Cost ($)

60

60 50 40 30 20 10 5 7 9 11 13 Airframe Engine 15 17 19

Average Airframe Age (Thousands)

Average Airframe Age (Thousands)