# 4/21/2008

22 Exercises
Mix and Match
1. Use this plot to check the straight enough condition 2. Use this plot to check for dependence in data over time 3. Use this plot to check the similar variances condition 4. Use this plot to check the nearly normal condition 5. Term that describes data with unequal error variation 6. Another name for the assumption that the SRM makes about the variance of errors 7. An observation in a regression model with an unusually large or small value of x 8. Statistic used to detect dependence in sequences of residuals 9. An observation that deviates from the pattern in the rest of the data 10. Simple enough is another name for data collected in this manner a. Normal quantile plot of residuals b. Heteroscedasticity c. Time plot of residuals d. Durbin-Watson statistic e. Leveraged f. Plot of residuals on x g. Random sample from a population h. Homoscedasticity i. Scatterplot of y on x j. Outlier

True/False
If you believe that a statement is false, briefly say why you think it is false. 11. If the SRM is used to model data that do not have constant variance, then 95% prediction intervals produced by this model are longer than needed. 12. When data do not satisfy the similar variances condition, the regression predictions tend to be too high on average, over-predicting most observations. 13. A common cause of dependent error terms is the presence of a lurking variable. 14. The Durbin-Watson test quantifies deviations from a normal population that are seen in the normal quantile plot. 15. A leveraged outlier has an unusually large or small value of the explanatory variable. 16. The presence of an outlier in the data used to fit a regression causes the estimated model to have a lower r2 than it should.

90 . the size of the survey of prospective employees was proportional to the size of the community. a company investigated the education and income of the local population.99 10 20 30 40 -3 -2 -1 0 1 2 3 Count Normal Quantile Plot 26. If the Durbin-Watson statistic is near zero. The best plot for checking the similar variances condition is the scatterplot of y on x. A hasty analyst looked at the normal quantile plot for the residuals from the shown regression and concluded that the model could not be used because the residuals were not normally distributed. Think About It 23. We should exclude from the estimation of the regression equation a case if its residual is more than 3 se away from the fitted line. because the fitted line clearly tracks the mean of y as the value of x increases. As part locating a new factory.25 .05 . 22.95 . at least for time series data. a) If the observation marked with an “×” in the following plot is removed. What do you think of this analysis? 27. it is more important to fit a model that meets the conditions of the SRM than to maximize the value of r2. Some of the stores are considerably larger (more square feet of display space) than others.22 Exercises 17. 20.75. Residuals seldom satisfy the nearly normal condition because it is silly to think that only one explanatory variable affects the value of the response. What do you think of this analysis of the problem? . 18. What possible problems for the SRM would you expect to find in a scatterplot of average income versus average education for communities of varying size? 25. To keep costs low. on average. can you anticipate any problems? 24. how will the slope of the least squares line change? E22-2 . In regression modeling.50 . Data on sales have been collected from a chain of convenience stores. In a regression of sales on square feet. The nearly normal condition is critical when using prediction intervals. 21.10. 19. we can conclude that the fitted model meets the no lurking variable condition. A second analyst looked at the same data as in Question 25 and concluded that the use of the SRM for prediction was fine.01 .

how will the slope of the least squares line change? b) What will happen to r2 and se? c) Is this observation leveraged? 29. a) If the observation marked with an “×” in the following plot is removed. a) If the observation marked with an “×” in the following plot is removed. how will the slope of the least squares line change? E22-3 .22 Exercises b) What will happen to r2 and se? c) Is this observation leveraged? 28. how will the slope of the least squares line change? b) What will happen to r2 and se? c) Is this observation leveraged? 30. a) If the observation marked with an “×” in the following plot is removed.

Use price as the response and weight as the explanatory variable. spanning the last 65 weeks since the chain opened its first outlets opened. The prices are in Singapore dollars. Its history and fame make it impossible to assign a price. Let’s say 45. What lurking variable that might introduce dependence into the errors of the SRM? 32. The Durbin-Watson statistic for the fit of the least squares regression in this figure is 0. Management of a retail chain has been tracking the growth of sales. Supervisors of an assembly line track the output of the plant. with weights less than ½ carat. and smaller stones of its quality have gone for \$600. or is really an artifact of a different problem? (Note: The value of the explanatory variable is getting larger with the sequence order of the data. If the Durbin-Watson statistic is near 2 for the fit of a SRM to monthly data. Identify a lurking variable that might violate one of the assumptions of the SRM. The Hope Diamond weighs in at 45.52 carats.52 carats × \$750.46. regressing the company’s sales versus the number of outlets. Their data is weekly. These rings hold relatively small diamonds. For the E22-4 .000 and call it \$35 million. with the weights in carats. One tool that they use is a simple regression of the count of packages shipped each day versus the number of employees who were active on the assembly line during that day. which varies from 35 to about 50.000/carat = 34.) You Do It 35. have we proven that the errors are independent and meet the assumption of the SRM? 34. Diamond rings This data table contains the listed prices and weights of the diamonds in 48 rings offered for sale in The Singapore Times.000 per caret.140. Should we interpret the value of D as indicating dependence. 33.22 Exercises b) What will happen to r2 and se? c) Is this observation leveraged? 31.

a) The data used previously for this analysis excludes a home with 2. list several possible lurking variables that might be responsible for the size and position of leases with large residual costs. Does this plot suggest that the labor input is a lurking variable? 39. Seattle homes This data table contains the listed prices (in thousands of dollars) and the number of square feet for 28 homes in or near Seattle. For the response. use the cost of the lease per square foot.500 thousand) and is on a lot with 871. a) Identify the leases whose residuals lie outside the 95% prediction intervals for leases of their size. What is it. and does it help you identify a lurking variable? E22-6 . Does the slope or intercept differ by very much between the two cases? Use one estimated model as your point of reference. The data come from the web site of a realtor offering homes in the area.500 square feet and costs \$1.5 million (\$1. b) Compare the fit of the model with this large home to the fit without this home. As the explanatory variable. c) Consider a scatterplot of the residuals from this regression on the number of labor hours.22 Exercises blueprint). not just the materials that are used. Does this column help explain the outlier and suggest a lurking variable? 40. Recognize that the response reflects cost of all inputs to the manufacturing task. These give the number of square feet for the size of the lot that comes with the home. The data for the analysis were sampled from the accounting records of 195 orders that were filled during the previous 3 months. and the manufacturer sends the customer an estimated price per unit.000 square feet. Does the location of these data indicate a problem with the fitted model? (Hint: Are all of these residuals on the same side – positive or negative – of the regression?) b) Given the context of the problem (costs of leasing commercial property). Formulate the regression model with y as the average cost per unit and x as the material cost per unit. Use the selling price per square foot as y and the reciprocal of the number of square feet as x. Add this case to the data table and refit the indicated model. c) Which is more affected by the outlier: the estimated fixed costs or the estimated variable costs? d) Outliers often shout “There’s a reason for me being different!” Consider the nonmissing values in the column labeled lot size. use the reciprocal of the number of square feet. Leases This data table gives annual costs of 223 commercial leases. This cost estimate determines a price for the customer. All of these leases provide office space in a Midwestern city in the US. a) Do the scatterplot of y on x or the plot of the residuals on x indicate a problem with the fitted equation? b) Use the context of this regression to suggest any possible lurking variables. c) The leases with the 4 largest residuals have something in common.

to the SRM? b) Would the linear equation. a) Compare the two plots: Price versus Horsepower and log10 Price versus log10 Horsepower. even approximately. have the same meaning in both cases considered in “a”? c) Fit the preferred equation as identified in “a. of those that lie outside the 95% prediction intervals. Set y to the natural log of R&D expenses. Formulate the SRM with GDP as the response and Trade Balance as the explanatory variable. 21.22 Exercises 41.e. a measure of the overall production in an economy per citizen) and trade balance (measured as a percentage of GDP). The countries are located in Europe. Are these randomly distributed between the two halves? Are the error terms homoscedastic? 43. divide the plot in half at the median of the explanatory variable in your model. Does this explain the size of the difference between the two equations in “a”? Explain. importers have negative trade balances. (i. Two variables of interest are GDP (gross domestic product per capita. First.” then use this rough test for equal variance. Luxembourg reported the highest positive balance of trade. Fit the least squares equation both with and without Luxembourg and compare the results..5% of GDP.” c) Luxembourg also has the second smallest population among the countries. and set x to the natural log of assets. R&D expenses This table contains accounting and financial data that describe 493 companies operating in technology industries: software. and semiconductor manufacturing. Both columns are reported in millions of dollars. Does the fitted slope change by very much? b) Explain any differences between R2 and se for the two fits considered in “a. These data are from the 2005 report of the OECD. One column gives the expenses on research and development (R&D). and another gives the total assets of the companies. For each of 223 types of cars sold in the US during the 2003 and 2004 model years. OECD The Organization for Economic Co-operation and Development (OECD) tracks summary statistics of the member economies.” can you anticipate that these residuals are not nearly normal – without needing the normal quantile plot? 42. Second. Note that the variables are recorded in millions. and North America. so 1000 = 1 billion. a) In 2005. Does either seem suited. count the number of points that lie outside the 95% prediction limits in each half of the data. Cars The cases that make up this data set are types of cars. E22-7 . particularly the slope. Exporting countries have positive trade balances. Use the logs of both variables rather than the originals.) a) What problem with the use of the SRM is evident in the scatterplot of y on x as well as in the plot of the residuals from the fitted equation on x? b) If the residuals are nearly normal. parts of Asia. systems design. these data include the base price (in dollars) and the horsepower of the engine (HP). what proportion should be above the fitted equation? c) Based on the property of residuals identified in “b. Use the price of the car as the response and the horsepower as the explanatory variable.

Promotion These data describe spending by a major pharmaceutical company for promoting a cholesterol-lowering drug. the firm has monitored activities of new agents over the past two years. The data covers 39 consecutive weeks and isolates the area around Boston. Marketing research often describes the level of promotion in terms of voice. The column Detail Voice is the ratio of detailing for this drug to the amount of detailing for all cholesterol-lowering drugs in Boston. a) Identify the week associated with the outlying value highlighted in the figure below. E22-8 . an account is a new customer to the business. Which case is this? b) Explain some characteristics that distinguish this employee from the others. a) Locate the most negative residual in the data. The column Market Share is the ratio of sales of this product divided by total sales for such drugs in the Boston area. Why does the fit change by so much or so little? 45. and keep them with the company. To build the system. or unusually low levels of promotion? Take a look at the timeplots to help you decide. A key task for agents is to open new accounts. Among the possible explanations of this performance is the number of new accounts developed by the agent during the first 3 months of work. offer them incentives. direct-to-consumer sales force would like to build a system to monitor the progress of new agents. Formulate the SRM with y given by the natural log of Profit from Sales and x given by the natural log of Number Accounts. The response of interest is the profit to the firm (in dollars) of contracts sold by agents over their first year. (The figure shows the least squares fitted line. Both are measured in dollars and measure the quality of business developed in the first 3 months of working for this firm.) c) How does the fit change if this point is set aside. Detailing counts the number of promotional visits made by representatives of a pharmaceutical company to doctors’ offices.22 Exercises 44. The variables in this collection are shares. excluded from the original regression? Compare the fitted model both with and without this employee. (Hint: Consider the data in the Early Commission and Early Selling columns.) Does this week have unusually large sales given the level of promotion. voice is the share of advertising devoted to a specific product. d) Explain the magnitude of the change in the fit. The goal is to identify “superstar agents” as rapidly as possible. In place of the level spending. Formulate the SRM with y given by the Market Share and x given by the Detailing Voice. Hiring A firm that operates a large.

22 Exercises 0.10 .220 0.04 .02 .225 0.1 Apple Return Market Return (b) Which observation is more important to the statistical significance of the fit of the least squares equation? That is. The data includes 300 monthly returns on Apple Computer.240 0. (d) Explain why removing either observation has little effect on the least squares fit. about these two months? 0.4 0.08 .210 0.2 0 -0. as well as returns on the entire stock market. if we want to keep the absolute value of the t-statistic as large as possible. if anything.235 Market Share 0.2 -0.205 . the standard error for b1 increases.6 -0. which month should be? (c) Explain why the month that keeps the t-statistic large is more important than the other month.230 0. 30-day loans to the government).2 -0. and inflation. Treasury Bills (short-term.4 -0.14 Detail Voice b) How does the fitted regression equation change if this week is excluded from the analysis? Are these large changes? c) The R2 of the fit gets larger and se gets smaller without this week.1 0 . however.12 . (The column Market Return is the return on a valueweighted portfolio that purchases stock in proportion to the size of the company rather than one of each stock. Do other diagnostics suggest a violation of the assumptions of the SRM? 46.215 0. E22-9 . Apple This dataset tracks monthly performance of stock in Apple Computer since its inception in 1980. Why? d) These are time series data. What’s special.) Formulate the SRM with Apple Return as the response and Market Return as the predictor. (a) Identify the time period associated with each of the two outliers highlighted in this scatterplot.06 .

22 Exercises E22-10 .