# 9/27/2006

Mix and Match
1. i 2. c 3. a 4. g 5. b 6. e 7. j 8. f 9. d 10. h

True/False
11. True It’s only possible confounding. The lurking variable must also be related to the response. 12. False It’s another name for regression to patch up for the absence of randomization, but its not the same thing. 13. True 14. True 15. True 16. True 17. False The purpose of an interaction is to allow the slopes to differ. Without an interaction, the slopes match. 18. True 19. False We’d have to do this for every conceivable lurking factor, and we’ve not measured them all. Confounding is always possible without randomization. 20. True 21. False It is helpful if the sizes of the two groups are similar, but not assumed by the model.

9/27/2006

22. False Use comparison boxplots of the residuals grouped by the categorical variable.

23. Is this data from a randomized experiment? If not, do we know that the sales agents sell comparable products that produce similar revenue streams? Do we know the costs for the agents in the two groups are comparable, with similar supporting budgets, such as comparable levels of advertising and internal staff support? Without such balance, there are many sources of confounding that could explain the differences that we see in the figure. The lurking factor might also explain the slight difference in variation that we see in the summary. 24. The relevant lurking factor that ought to come to mind is inflation in the cost of products bought by this firm. If the prices of these purchase have risen 10% over the year, then this should be taken into account. Similarly, has the nature of the business changed over this time period. Perhaps the invoices in the 2006 year are for more expensive types of purchases or in larger quantity than those bought in 2005. 25. We combine them in order to compare the intercepts and compare the slopes. The multiple regression that combines them include one coefficient that is the difference in the intercepts and another that is the difference between the slopes. These both come with standard errors, and hence allow us to test whether the observed differences (which are the same with either approach) are statistically significant. 26. The assumption of equation error variance in the two SRMs. You can have the SRM work in each subset, but with different error variances. When combined, the difference in error variances violates the similar variances condition of the MRM. To check this condition, we should look at the residuals grouped by the dummy variable. 27. In general, one should always try an interaction unless you have strong reason to know that the slopes are parallel in the two groups. In this context, it seems clear that the model needs an interaction. Union labor in auto plants make more than nonunion labor, and the slope is the estimate for the cost per hour of labor. We’d expect it to be higher in the union shop. 28. The intercept is the start-up time and the slope is the time per unit (fixed and variable costs, respectively). If the robots function as before, then the main change will be the reduction in the intercept (smaller fixed costs). One might also expect less variation after the change; it might have been the case that start-up times varied widely seeing that they ran on for 20 hours. The slopes should be about the same, though we’d still check for an interaction. 29. a) The intercept is the mean salary for Group=0, namely the women (\$140,467). The slope is the difference in salaries, with men marginally making \$3,644 more than women overall (i.e., ignoring the effect of managerial grade level). b) These match (almost). The slope in the simple regression is the difference in mean salaries, so regression assigns this estimate almost same level of significance A26-2

30. In the context of this problem. The t-test introduced in Chapter 18 does not require this assumption. It seems like it might be positive. b) Remove the interaction term to reduce the collinearity and force the slopes to be precisely parallel. The regression approach is comparable to a two-sample t-test that requires equal variances. At x = 0. the average seems to be about 0. That is. 34. a) The intercept is the average number of mailings (about 30) for companies that were not aware. closer to zero. The slope is the difference on average between those that were not aware and those that are (12 more for those that are aware). 8/4 = 2. c) The difference in variation could be due to a lurking variable. b) On average. As you can see in the following figure that shows the fits. 31. however.9/27/2006 26 Answers found in the two-sample t-test.82 means that the green employees (those with training) take about 50 minutes longer to get the production line set up for producing units. a) The slope for D tells you the difference in the fits of the two equations when Units = 0. those that were aware sent statistically significant more packages in the ntext month. The two fits evidently cross-over near about 40 to 50 units. but will be considerably less than 2. 33. A similar calculation applies to the red points and gives a similar slope near 2. If a wider range of hours were given to those that are aware. b) The slope is the minutes per item produced. the fits appear parallel because the coefficient of the interaction (D * x) which measures the difference in the slopes of the two groups is not statistically significant (its t-statistic is within 1 of zero). time to configure the robots used in the assembly. a lurking variable could spread out those that were aware more than those that were not. they are more efficient with smaller time costs for additional units. 32. basically the rate at which the process churns out items once started. Once they get it set up. the two lines cross near 45 units: A26-3 . c) That the variances in the two groups are the same. Focus on the green points. a) About 2. the average of these is near 8. the slope 52. The larger variation could be due to the role of the hours variable. a) Yes. b) The slope will be much flatter. At x = 4. then this could explain the visible differences in variation. c) Need the interaction. a) The intercept is set-up time. b) We need to find the point at which the two regression lines cross (they are not parallel in this example).

054408 VVS1 54 0.783 + 52.062 units = (26. with VVS1 costing on average about \$110 more than VS1 diamonds. A twosample comparison of weight by clarity shows that the average weight is almost the same in the two groups.783 + 2.95 Prob < t 0. VS1-VVS1. diamonds of one clarity grade have to have different weights than those of the other.93 DF 103.11 Prob > |t| 0. we need to find the units such that 26.816) + (2.30 t Ratio -2. If the two groups have comparable weights.88504 Std Err Dif 38. the weight has to be related to the price (we know this is true from previous study of these data. allowing unequal variances Difference -112.816 -1. Weight is unlikely to be a confounding effect in this analysis.053661 (b) The two-sample t-test finds a statistically significant difference. and common sense) and the weight has to be related to the group indicator. Emerald diamonds (a) In order to be a confounding variable.062 -1.277 units  units = 52.9/27/2006 26 Answers 250 200 Minutes 150 100 50 20 30 40 50 60 70 80 90 100 110 Units That’s probably good enough in practice.408148 0. then the effect of weight is balanced between the two.816/1. Level Number Mean Std Dev VS1 90 0.4548 Upper CL Dif -35.9976 Confidence 0.36 You Do It 35. we can drop these and solve for units in this equation: 0 = 52. That is.0024 A26-4 . but if we want to be thorough.0048 Lower CL Dif -189.50 Prob > t 0.277 ≈ 41.413556 0.277) units Since the baseline terms are common to both sides.

1595 250. the fits are parallel and the estimated effect for clarity is statistically significant.99 -0. the multiple regression does not meet the similar variances condition. Term Intercept Weight (carats) Clarity Clarity * Weight Estimate -52.9054 127. Evidently.44887 2785.2 * 27.5379 Std Error 131. the cost of either type of diamond rise at the same rate with weight. R2 se n Term Intercept Weight (carats) Clarity Estimate -20.0001 Based on the fit of this multiple regression.497376 161.89249. those of clarity VVS1 cost on average about \$127 more than those of clarity VS1.6860 Without the interaction.89249 t Ratio -0. (d) From the two-sample comparison. Thus. providing a more precise estimate. We’ve seen this one before. the 95% confidence interval for the mean difference in price is \$35 to \$190 more for VVS1 diamonds.8461 <.1266 -211.9049 316.1921 t Ratio -0. but the variance increases with the price. There’s no confounding however. we see that for diamonds of comparable weight.0001 0.8489 144 Std Error 105.10 4.41 Prob>|t| 0. Hence the estimated average difference in price (\$112 vs \$127) are comparable.53705 2863.4963 214.05 0. because the weights are comparable in the two groups.3232 0.36823 + 2 * 27.6910 <.89249 ≈ \$72 to \$183 The regression interval is shorter because it removed the variation price due to weight.2582 215. The prices of diamonds become more variable as they get larger. 500 400 300 200 100 0 -100 -200 -300 -400 -500 700 Price (\$) Residual 900 1100 1300 1500 17001900 Price (\$) Predicted A26-5 . 127. The estimated mean difference from the multiple regression is 127.36823 0.57 Prob>|t| 0. we’ll remove it and refit the model without this term.9/27/2006 26 Answers (c) Because the interaction is not statistically significant.40 9.0001 <.9869 522.19 11.9129 27. (e) The two groups have similar variances.36823 .

(d) The initial two-sample comparison estimates the mean difference in daily sales as \$677 to \$777 more at Site 1. Gasoline sales (traffic) confounds the comparison of the sales.10 <.0001 <.5976 n 568 Term Intercept Volume (Gallons) Dummy Estimate 650.0000 Confidence 0.64755 t Ratio 12. the following summary shows that Site 1 is busier. Adjusted for gasoline sales.54 0.0000 (c) The initial analysis finds no statistically significant interaction.2997907 0.0001 Volume (Gallons) 0.2273 Upper CL Dif 752.2 * 23.95 Prob < t 1. Convenience shopping (a) Previous analysis of this data has shown that volume of gasoline is related to sales in the convenience store.0000 Lower CL Dif 565.0001 Adjusted for volume.0001 Dummy * Volume 0. so we’ll remove this term and refit the model.313664 520.14764 113. the estimated model is R2 0.39 22.520 Prob > t 0.40009 0. Evidently.95 Prob < t 1. allowing unequal variances Difference 659.735122 se 243.208 t Ratio 28.9/27/2006 26 Answers 36.42454 + 2 * 23.59 <. Term Estimate Std Error t Ratio Prob>|t| Intercept 688. To be a confounder. the estimated difference in sales remains statistically significant.949 Prob > |t| 0.64755.031264 9. So. but falls to about \$520.01 Prob>|t| <.06922 84.71723 Std Err Dif 25. allowing unequal variances Difference 727. In this data.018032 23.5871 Without the interaction.91 17.96066 8.323 DF 551.0208022 0.42454 Std Error 50.0000 Lower CL Dif 677.91601 0.0000 (b) A two-sample t-test finds a statistically significant difference in sales. Site 1 sells about 20% more gasoline.4244 4. a statistically significant amount. with Site 1 selling on average about \$700 more than Site 2.251 t Ratio 13. volume meets one of the conditions for a confounding variable: it’s related to the response. Site 1-Site 2.4839 Upper CL Dif 776.42454 .0001 <.982 Prob > |t| 0.64755 ≈ \$473 to \$567 A26-6 . gasoline sales produce comparable sales in both convenience stores. it also has to differ between the two groups.708 DF 506. 520. Site 1-Site 2.81833 Std Err Dif 47.06 <.466 Prob > t 0.038283 0. the multiple regression puts this difference at 520.0001 Dummy 460.0000 Confidence 0.

the file size cannot be a confounding variable. This simple regression suggests that each gallon of gas sold generates \$0.44032 0. the slope at either location is only \$0. As shown in the two sample comparison. The bigger difference.75 24. but it would probably be best to do both. however. Sales (Dollars) Residual 1000 0 1000 2000 3000 Sales (Dollars) Predicted (f) Yes. In this example. is the shift of about \$200.0001 <.9/27/2006 26 Answers The estimated range from the multiple regression is shorter because the regression removes the variation from the response due to variation in gasoline sales. In fact. making it look as though gasoline sales have a bigger impact on sales in the convenience store. the slope is inflated. we can identify both groups in the plot of the residuals on the fitted values. Color-coding makes the boxplots unnecessary in this case.51 in convenience store sales. 3000 Sales (Dollars) 2000 1000 1000 2000 3000 4000 5000 Volume (Gallons) Term Intercept Volume (Gallons) Estimate 310.021225 t Ratio 4.31/gallon.31151 0. To be a confounding variable it must also be different in the two groups. (e) The model meets the similar variances condition.7014 A26-7 . Download a) The file size is related to the transmission time. By pooling. Level Number Mean Std Dev MS 40 56. Site 1 is still doing better. Because of this balance.5131557 Std Error 65. It’s the same in both samples. When “adjusted” for differences in traffic volume.9500 25.0001 37.18 Prob>|t| <. the file sizes are paired in the two groups. but not so much as suggested by the initial comparison.

0956 0.138168 n 80 Term Intercept File Size (MB) Vendor Dummy Vendor Dummy * File Size Estimate 4.752229 se 5.99 Prob>|t| 0.1515 Prob > |t| 0.0142 Lower CL Dif -9.52682 Std Err Dif 2.7014 b) The two-sample t-test finds a very statistically significant difference in the performance of the software from the two vendors.61 1. meaning that the two types of software have different rates of transfer (different MB per second).8929786 0.4037229 4.9500 26 Answers Std Dev 25.45 12.822677 0. R2 0.0001 As shown in this plot of the fit of this model (different intercepts and slopes in the two groups).045272 t Ratio 2.995934 0.) MS-NP.0165 <.0071 c) The interaction in the model is statistically significant. 50 40 30 20 10 20 30 40 50 60 70 80 90 100 Transfer Time (sec) File Size (MB) A26-8 .0001 0.1905 DF 58.180832 Std Error 1.5 fewer seconds.69 -3. The difference emerges only when the files get larger. (The variance is substantially larger for the files sent using the NP software.9/27/2006 Level NP Number 40 Mean 56. The small difference in the intercepts (the coefficient of the dummy variable is not statistically significant) happens because both send small files quickly. the software labeled “MS” transfers files in about 5. On average.79005 Upper CL Dif -1. allowing unequal variances Difference -5.9185 Prob > t 0.9929 Confidence 0.032012 2.5350 t Ratio -2.95 Prob < t 0. the transfer times using MS (in red) become progressively less than obtained by the software labeled NP.7633694 -0.

2608 Upper CL Dif 0.1999 Lower CL Dif -0.83 <.1142 Upper CL Dif 1. Hence.650956 19.5 seconds (range 1 to 10 seconds).22241 t Ratio -1. the boxplots of residuals show different variances. allowing unequal variances Difference -0.1000 b) The average cost per unit is slightly lower per unit in the new plant by \$1.9437822 0.9000 Confidence 0. allowing unequal variances Difference -1.9/27/2006 26 Answers d) The two-sample comparison finds an average difference of 5. Production costs a) Material costs could be a confounding variable because it is related to the average cost per unit.8091 Confidence 0.4 sec/MB compared to 0. very similar in the two plants. Similarly. but the different found by the two-sample test is not statistically significant.11889 Prob > |t| 0. The material costs per unit are.877 Std Err Dif 1.6195 Prob > t 0.609098 4.0001 Material Cost (\$/unit) 2.1133 t Ratio -0.0001 A26-9 .83758 1.28719 Std Err Dif 0. however.56372 Prob > t 0.17279 DF 156.10. NEW-OLD.3930 Prob > |t| 0.1909 c) Neither the interaction nor dummy variable are statistically significant in the model with both.3818 Lower CL Dif -3.95 Prob < t 0.89 <. NEW-OLD. The analysis of covariance also identifies MS as faster. but shows that the gap becomes progressively wider as the file size increases. NP transfers files (once started) at a rate of about 0. Transfer Time (sec) Residual Residual Transfer Time (sec) 10 5 0 -5 -10 -15 MS NP 10 5 0 -5 -10 -15 10 20 30 40 50 Transfer Time (sec) Predicted Vendor 38. Term Estimate Std Error t Ratio Prob>|t| Intercept 32. You can see hints of a problem in the color-coded plot of residuals on fitted values (with MS shown in red).4 sec/MB for MS. The mean of the two-sample comparison is an “average gap” ignoring the size of the files. materials costs per unit is not going to confound the comparison using the two-sample test.2694 DF 166. e) No.95 Prob < t 0. with MS transferring files faster.

372891 1.507856 1.716403 26 Answers Std Error 2.12 more than the old plant.65 Prob>|t| 0. and the comparison boxplots agree.0001 Material Cost (\$/unit) 2. d) In both analyses.505381 5. The residuals are slightly more variable in the old plant.55 less to 0.255816 ≈ -\$3.9/27/2006 Term Plant Dummy Dummy * Mat Cost/Unit Estimate 1. (one box twice the length of the other).39 <.431873 23.722051 0. The t-test indicates that the new plant costs run between \$0.776285 1.0001 Plant Dummy -0. with a much steeper slope (higher fixed costs) for the data for Realtor B (shown as green crosses here) A26-10 .40 -0. Home prices a) There’s a clear difference.00 e) The model with a dummy variable meets the assumptions.6893 0. There’s no indication of a problem in the color-coded plot of residuals on the fitted values.255816 -0. -0.6864 The final model is basically the original simple regression. Term Estimate Std Error t Ratio Prob>|t| Intercept 33.1116877 -0.31 <. 20 Residual Average Cost (\$/unit) 10 0 -10 -20 NEW OLD Plant 39.2 * 1. we do not find a difference between the plants.40 0.094844 t Ratio 0.507856 .255816. The final model is just the original simple regression.507856 + 2 * 1. but not so much as to indicate a problem. plus there’s so much variation that we find little difference between the t-test interval and that from regression. Both approaches reach comparable conclusions since the material cost is not confounded between the two plants.5137 After removing the interaction term. the effect of the plant dummy alone remains not statistically significant.00 to \$2. The regression gives a wider range -0.

86 -2.5921 Std Error 0.05 0.40 . the prices for this realtor seem to be all fixed costs.35 Price/Sq Ft 0.0068 <.0003 .019713 31.923342 -0.2 0. The fixed costs for this realtor (slope for 1/SqFt) run about \$58.1 .037308 n 36 Term Intercept 1/Sq Ft Realtor Dummy Dummy * 1/SqFt Estimate 0.0011 1/Sq Ft b) The model requires both a dummy variable and interaction.10 .3 0. and that is A26-11 .050 0.050 -0. For Realtor B.15 0.25 .13 Prob>|t| <.158-0.025 0 -0.000 -0.15 .8419 t Ratio 7.90 5. 0. if you’ve got the points colored.45 0.05 -0.30 . the estimated intercept is near zero (0.2019 0.45 Residual Price/Sq Ft 0.155721 57.075 0. you cannot use these estimates of variation.0001 c) The data for Realtor B is much less variable around the fitted line than for Realtor A.762904 se 0.4 0. R2 0.9/27/2006 26 Answers 0.075 A B Price/Sq Ft Predicted Realtor d) The estimates are fine to interpret even with the evident lack of similar variances. the residuals do not meet the similar variances condition.176852 568. as these reproduce the fitted equations for the separate groups. The intercept.025 0. with an estimate near 58+569 = \$627.90 1.025 -0.0726 0.0001 0.025 -0. about \$156/SqFt is the estimated variable cost for Realtor A homes.177).35 .000 regardless of the size of the home! e) No.075 Price/Sq Ft Residual 0.20 . You don’t need to see the boxplots to see the problem in this example.25 0.061062 110.075 .0005 .000. suggesting no variable costs! Instead.0009 .0007 . The formula for the SE of a regression slope depends on the single estimate se of residual variation.

996 6.20 <. The plot of residuals shows slightly higher variation in the city.615452 se 1.0001 c) Certainly.5369136 0. with large estimated values) are rather expensive.8437 3. 26 24 Cost per Sq Foot 22 20 18 16 14 12 0 . but not use the tools for inference.092205 n 223 Term Estimate Std Error t Ratio Prob>|t| Intercept 15.3659 512.0001 1/Sq Feet 1911.9/27/2006 26 Answers inappropriate in this analysis.0009 1/Sq Feet b) The fitted model uses both the dummy variable and interaction. Leases a) The locations are very distinct.0007 .817545 0.15 <. Some (shown to the right. 40. The boxplots show that the variances are by-and-large comparable.1802 835. We can interpret the fit. R2 0.0002 Location Dummy 1.0001 . A26-12 . A normal quantile plot of the residuals combined for both locations shows that the data are also nearly normal (not shown here). We need to have separate estimates of the variance of for the two realtors. Both appear statistically significant.73 0.0003 . though we need to check the conditions before going further with inference.117467 134.0001 Dummy * 1/SqFt 5145. the original plot looks straight enough.1874 8. with those in the city (shown as red dots) costing more than those in the suburbs (green crosses).66 <.0005 .

the variable costs are higher by about \$1.1874.7954859 0. because this model meets the conditions for the MRM. we can estimate that the premium for locating in the city costs roughly 1.896963 985 Std Error 0.5369 .23 Prob>|t| <. e) Yes.5369 + 2 * 0.9/27/2006 26 Answers Cost per Sq Foot Residual 5 4 Residual Cost per Sq Foot 5 4 3 2 1 0 -1 -2 -3 3 2 1 0 -1 -2 -3 15 16 17 18 19 20 21 22 23 Cost per Sq Foot Predicted City Suburbs Location d) The baseline model (for the suburbs coded by 0 in the dummy variable). Both have comparable variances.91 per square foot more than a comparable location in the suburbs. 1. 41. the variable costs are about \$15. For example.807597 0.2 * 0.1874 ≈ \$1.0000 (b) The residuals from the multiple regression show some skewness noted previously for 2004. 8 Log R&D Expense 6 4 2 0 -2 -4 -6 0 10 Log Assets R2 se n Term Intercept Log Assets Estimate -1.062477 0. R&D expenses (a) The two look very similar with the colors evenly mixed.09 64.012384 t Ratio -19. however. A simple regression to both years seems reasonable.82 per square foot with about \$1900 in fixed costs. and share this A26-13 . we can build confidence intervals and tests.192587 0.16 to \$1. For the city.54 with higher fixed costs (about \$5150 more).0001 0.

017898 0. but since we are working with the slopes (which are averages) we can continue on thanks to the CLT.50. There’s a more serious problem. for example.14 -0.95.9/27/2006 26 Answers problem.8077 se 0.6565 A26-14 .94 44. We’re voting for dependent. however.25.0001 0. calling into question any notion of using the usual formulas for standard errors.016828 0.091473 0.0110461 Std Error 0. Neither added variable is statistically significant and the R2 has hardly moved from the simple regression. are independent of each other? Or.01.13 0.897636 n 985 Term Intercept Log Assets Year Dummy Dummy * Log Assets Estimate -1. does it seem more likely that the data are dependent. the combined data are not nearly normal.44 Prob>|t| <.001.0001 <.184021 0.999 -3 -2 -1 0 1 2 3 4 Count Normal Quantile Plot (c) Here’s the summary of the multiple regression.8932 0. 2 Residual Log R&D Expense Log R&D Expense Residual 3 2 1 0 -1 -2 -3 -4 -6 -4 -2-1 0 1 2 3 4 5 6 7 8 9 1 0 -1 -2 -3 -4 2003 2004 Log R&D Expense Predicted Year 2 1 0 -1 -2 -3 -4 100 200 300 -4 .90 . R2 0.75.125321 0. not seen in these plots: do you really think that the two data values from AMD or Intel.05 . As you can tell from the normal quantile plot.024828 t Ratio -12.99 .10.7900453 -0.

2423271 Std Error 0.69 Prob>|t| <. than cars from domestic manufacturers (green crosses). Here is a summary of the fit of the model. R2 0. a common regression model captures the relationship.5 4.7 4.8077) * (985-1-3)/2 ≈ 0.8 4.767004 se 0.104323 n 132 Term Intercept Log 10 HP Estimate 1.6 4.406791 Std Error 0. Cars a) The color-coded scatterplot shows that cars from European companies (red dots) test to be more expensive. It’s hard to think of these as independent. given their HP.0001 <.26 which is not statistically significant.70 <.158665 0.6 Log 10 Price Log 10 HP As a result. about the independence of the residuals in the two years. however.4 4. I have serious questions.2 2.2 4. the shown simple regression “splits the difference” between the two.01 <.3 4. compromising the slope and intercept to blend the two into one.807597)/(1-0.068972 t Ratio Prob>|t| 9.8% increase in R&D expenses.068004 t Ratio 7.9 4. 5 4.157681 0.5 2.5383054 1.9/27/2006 26 Answers The incremental F test that measures the change in R2 that comes with adding two explanatory variables is F =(0.69 20. 42.869241 se 0. (d) Overall.078761 n 132 Term Intercept Log 10 HP Estimate 1.0001 A26-15 .0.8077 .2125378 1. since I have a pair of measurements on each company. This agrees with the visual impression conveyed by the original scatterplot: the relationship appears to be the same in both years.8: on average each 1% increase is assets comes with a 0.4 2.1 2.1 4 2.3 2. The elasticity of R&D expenses with respect to assets is about 0. R2 0.0001 b) The model appears fine but for some minor warts.0001 18.

1 -0.0 0.3 0.2 4.1 -0.05 .4 4. however. (d) The estimates of the MRM show that neither coefficient is statistically significant taken separately.0 4.2 0.1 0 -0. though perhaps a slight increase in variation as the car prices increase.9 5.271731 0. That’s collinearity at work! The VIFs for these estimates are larger than 300! Because of the collinearity.1 4. The boxplots indicate similar variances.3 0.10 .3 4.8 4.7 4.3 Residual Log 10 Price Log 10 Price Residual 0. the combination brings a statistically significant improvement to the fit of the model.01 .25 . 43. and the plot of the residuals on fitted values suggests no problems.5 4.869241) * (132 .2697 0.0.11 1.869241 .1 -0.0937 The initial scatterplot of the data seems straight enough in both groups.6 4.9/27/2006 Term Import Dummy Import * Log HP Estimate -0.1777922 26 Answers Std Error 0.1 0 -0.3)/2 ≈ 50 >> 4 The incremental F shows that an increase in R2 from 77% to 87% by the addition of two explanatory variables is highly statistically significant.90 .2 0. As a pair.99 -2 -1 0 1 2 3 Count Normal Quantile Plot (c) The incremental F-test gives the value F = (0.) 0.105291 t Ratio -1.95 .2 0.69 Prob>|t| 0.1 0.2 Log 10 Price Predicted Europe US Location 0.50 .1 .767004)/(1-0. neither one appears statistically significant taken individually. and the normal quantile plot confirms that the data are nearly normal (albeit with outliers such as the exotic Panoz on the high end of the scale and the cheap for its power Ford Cobra on the low side.245108 0.2 4.75 .0 -0. Movies a) Adult movies (red dots) appear to have consistently higher subsequent sales at A26-16 .2 10 20 30 -3 .

The comparison boxplots show that the variability is consistent in the two groups.063621 0.75236 se 0.0001 (b) The following results summarize fitting the multiple regression with a dummy variable and interaction. Here’s the simple regression.5666 0.9/27/2006 26 Answers a given box-office gross than family movies.62 -0.0001 <. A26-17 . Call it the “Barney effect” … parents who endured these in theatres didn’t want these movies in the house.7394797 -0. though a bit skewed (toward the “left” and smaller values).070228 0. The fits to the two groups look linear (on this log scale) with a “fringe” of outliers.122375 0.076836 t Ratio -12. and the plot of residuals on fitted values shows no deviations from the conditions.063479 0.0001 <. it seems) earn quite a bit less in subsequent sales for their level of box-office success. Log 10 Subsequent Purchase 0 -1 1 2 Log 10 Gross R2 se n Term Intercept Log 10 Gross Estimate -1.0001 0.305742 0.8420019 0.104526 0.07 Prob>|t| <. R2 0.344678 0.648668 0. The normal quantile plot confirms that the combined residuals are nearly normal.86 11. A subset of movies (kids movies.0024 The initial scatterplot appears straight enough within groups.253298 224 Std Error 0.57 20.213623 n 224 Term Intercept Log 10 Gross Audience Dummy Dummy * Log Gross Estimate -1. A common simple regression “splits the difference” between the two groups.04159 t Ratio -20.57 3.2358524 Std Error 0.25 Prob>|t| <.

663929 n 464 Term Intercept Log Accounts Office Dummy Office * Log Accounts Estimate 8.28 -4. A26-19 . Hence.29671 se 0.432733 Std Error 0.0001 <.029215 t Ratio 89.11 9.9444533 0.147548 0.94 Prob>|t| <.00 6. 12 11 Log Profit 10 9 8 7 6 0 1 2 3 4 5 6 7 Log Accounts R2 se n Term Intercept Log Accounts Estimate 8. The residuals in the two groups have similar variances.068611 t Ratio 59.224139 0. Both are statistically significant. plots of the data do not suggest a problem.99 7.8514041 0.717014 464 Std Error 0.9/27/2006 26 Answers fit to all of the data has a smaller slope than seems appropriate to employess in a new office (green).035875 0.2903952 0. R2 0. though somewhat skewed as noticed.0001 <.100374 0.897375 0. The normal quantile plot of all residuals together shows these to be nearly normal.0001 The plot of the residuals on fitted values shows skewness. with the negative range far twice that above zero.0001 (b) These table summarize the multiple regression with dummy variable and interaction. We would question the assumption of independence if we learned that some of these employees worked in the same office or collaborated in some way.176184 0.31 Prob>|t| <.0001 <.0001 <.2612048 -0.

9/27/2006 26 Answers 2 2 Log Profit Residual Residual Log Profit 1 0 -1 -2 -3 -4 8 9 10 11 1 0 -1 -2 -3 -4 Existing New Log Profit Predicted Office 2 1 0 -1 -2 -3 -4 50 100 150 -3 .90 .433 ≈ 2.10 . Part of the reason for the agreement is that both the slope and intercept differ in the two groups.75 . The VIFs are about 10 – large.01 . but not devastating. A26-20 .05 . The statistically significant interaction suggests that a “one approach for all” placement procedure is not going to be the best solution.07 or about exp(2.176184)/(1-0. Also.95 .25 .99 -2 -1 0 1 2 3 Count Normal Quantile Plot (c) The incremental F-test judges the change in R2 to be statistically significant. or say 8 accounts.29671 .897/0.50 .4 (d) These agree strongly in this example. (e) The following plot shows the fits implied by the multiple regression. Hires that do not open so many accounts appear more suited to starting work in an existing office. as you would guess since both estimates are statistically significant by wide margins and the sample size is rather large (n = 464) F = (0. there’s less collinearity than in many cases (such as the other exercises).29671) * (464-1-3)/2 ≈ 39.07) ≈ 7.9. The crossover point in the two fits occurs where log of accounts is approximately (ratio of coefficient of dummy to the interaction) 0.0. Hires that are able to generate lots of new accounts appear to do much better in new offices.

Promotion (a) A simple regression that combines the data from both locations makes a serious mistake.14 Detail Voice R2 se n Term Intercept Detail Voice Estimate 0. with a common slope for detailing.20 0.18 0.184853 t Ratio 5.12 0.0917039 1.97145 A26-21 . With a dummy variable (Boston is 1.015423 0. This model meets the MRM conditions. one that vastly overstates the effect/benefit of detailing.10 . the higher sales in Boston (red dots) inflate the slope.02 .038406 78 Std Error 0. By fitting one line to both groups.0825305 0. 0.16 0. First. R2 0.08 .22 Market Share 0.310936 0.12 .9/27/2006 26 Answers 12 11 Log Profit 10 9 8 7 6 0 1 2 3 4 5 6 7 Log Accounts 45.06 .04 .10 .86 Prob>|t| <.14 0. Portland 0). we check the interaction and find that it’s not statistically significant (the model meets the MRM conditions). rather than within each.24 0.0001 (b) The scatterplot suggests that parallel fits in the two locations.95 5.0001 <. the fitted model with interaction gives a much better fit to the data.

54 Prob>|t| <.02 -0.240287 A26-22 .0001 0.0001 0.005 -0.16 .0001 0.025 0.010 -0.01 .064047 to . Term Intercept Detail Voice City Dummy Estimate 0.005 .1798054 0.02 -0.0009 <.015 -0.01 0.10 .048412 0.005 -0. If we plot the residuals from one location on those from the other. there’s no association here either.14 .152167+ 2 * 0.20 .1521676 0.007923 78 Estimate 0.1212103 0.089441 t Ratio 25.1230925 0.04406.003255 0.152167 .24 Std Error 0.020 .067579 0.0861738 Std Error 0.007311 0.81 3.39 2.02 Portland Residual Because the interaction term is not statistically significant.18 . 0.2 * 0.15.66 12. we’ll omit it and continue (the model continues to meet the usual conditions).005 0.22 .000 -0. with a range of 0.9/27/2006 se n Term Intercept Detail Voice City Dummy Dummy * Detailing Market Share Residual 26 Answers 0.015 0. 0. with the DW statistic for both being reasonably close to 2.45 41.01 -0.5899 Market Share Predicted The residuals also do not show substantial tracking over time.0096 <.004774 0.57 Prob>|t| <.010 0.015 .005 0 -0.02 Boston Residual 0.08 down to 0. Here’s the summary of the model without the interaction.0899672 -0.025 0.002073 t Ratio 37.01 0 .04406 = .015 -0.020 0.04406 0.31 -0. forcing parallel slopes in the two locations.12 .0001 (c) The effect for detailing has fallen from 1.015 0.

Makes you wonder why anyone would prefer AIFF format unless it sounds a lot better. these differences are only an issue if we need to predict one group or the other very accurately.1579 0.007804 0.41 603.152824 Std Error 0. the analyst inflated the effect of promotion. by comparison. The other files encoded using AAC. with an R2 that’s about off the charts… The only error appears to be when a song does not quite fill the allocated space.0000 The residuals for the AIFF format spread to the right in this plot because they’re bigger fitted sizes.63 4761. are uniformly smaller. A26-23 . it’s perhaps not surprising that the AIFF residuals have more variation (not that its big.9/27/2006 26 Answers which rounds to 0. mind you!) With such a good fit. 130 120 110 Megabytes (MB) 90 80 70 60 50 30 20 10 0 0 100 200 300 400 500 600 700 800 900 Time (seconds) (b) The estimated model with dummy variable (1 for AIFF and 0 for AAC) is darn near perfect.0154106 0.0715807 0.0000 <. There’s no problem in this plot.24. with the AAC files (red dots) occupying much less space than the AIFF files (green +) for a given time duration.999996 se 0.76 7. iTunes (a) The scatterplot makes it clear that you need to distinguish the formats.0110338 0. Rather than get a 1% gain in market share with each 1% increase in detailing voice. the model estimates a far smaller return on this promotion.06 to 0.000032 t Ratio 1.043304 n 596 Term Intercept Time (seconds) Format Dummy Dummy * Time Estimate 0.000026 0. By ignoring the effects of the two groups. We do better for the AAC files. R2 0. 46. With such a large difference in the size of the files.9 Prob>|t| 0. There’s a clear interaction.009385 0. These t-statistics are about as large as they come unless you have a data set with millions of cases.0001 0.

01 0 -0.02115 0. let’s just fit two separate regression lines.08 0 100 200 300 400 500 600 700 800 900 10 20 30 40 -3 -2 -1 0 1 2 3 Time (seconds) Count Normal Quantile Plot For the songs stored in AIFF format.05 0 -0.072 MB to get started.0110338 + 0. that the data have a slight “kink” and are skewed.0154106 Time (seconds) se = 0.50 .03 -0.9/27/2006 26 Answers Megabytes (MB) Residual 0.10 .05 -0.04 -0. whereas AIFF requires an additional 0.05 -0.02 0. No prediction interval for these! AAC: Megabytes (MB) = 0. Moreover.06 -0.02 -0.05 0.04 0.00 -0. definitely not normal.25 .02 -0. whatever effect this violation of the similar variance condition has on the SE’s.1 AAC AIFF Megabytes (MB) Predicted Format Because of these differences in variation. We already know the fit for AAC.99 0.01 . one for each format.08 .1528 MB additional space per second (more than 10 times the space used by AAC).10 0 10 2030 405060 708090 110 130 0.07 -0.0826145 + 0.1682346 Time (seconds) A26-24 . the standard errors of the slopes are probably not precise. Those recorded using the AIFF format require about 0.01541 megabytes per second of recording time.04 0.95 . now that we can get more details. the equation is AIFF: Megabytes (MB) = 0. Just the same.011 MB. We also discover.01 -0.02 Residual 0.04 -0.90 .06 -0.05 -0.05 . (d) Because the errors do not meet the similar variances condition and the fits are so good that we don’t need to borrow strength. but now we also get the appropriate se for the errors.10 Residual Megabytes (MB) 0. the fixed space needed by AAC (regardless of the length of the song) is about 0. (c) The estimates show that songs recorded using the AAC format take about 0.75 .00 -0.1 0. its not enough to change those t-statistics to make the estimates of the slopes for the compression rates not statistically significant.03 0.

10 0 100 200 300 400 500 600 700 10 20 30 -3 -2 -1 0 1 2 3 Time (seconds) Count Normal Quantile Plot Where does this leave us? We can come within 0.1 MB for the AAC format with about 100% coverage.1682346 * 240 = 40. we’d say the song would take 0.9/27/2006 26 Answers se = 0.50 . which might overestimate by 0. we get a much smaller estimate.0826145 + 0.95 .7095778 . These residuals seem more typical and symmetric about zero.25 . None of the residuals is larger than that.05 -0. So.05 . They appear uniformly distributed.90 . but the distribution does not tail off as one would expect for a normal distribution.01 .06 or underestimate by 0.0110338 + 0.75 .10 .10 MB guaranteed for the AIFF format song.0484 The SD of these residuals is about twice that for the songs coded using the AAC procedure.10 0.05 Residual 0 0. but its not so easy to set the range.04… about half the size of the interval for the other format. Perhaps we might be able to use a range like this: 0.0154106 * 240 = 3. .4589185 ± 0.00 -0. A26-25 . For the ACC format.99 0.