4/21/2008

22 Exercises
Mix and Match
1. Use this plot to check the straight enough condition 2. Use this plot to check for dependence in data over time 3. Use this plot to check the similar variances condition 4. Use this plot to check the nearly normal condition 5. Term that describes data with unequal error variation 6. Another name for the assumption that the SRM makes about the variance of errors 7. An observation in a regression model with an unusually large or small value of x 8. Statistic used to detect dependence in sequences of residuals 9. An observation that deviates from the pattern in the rest of the data 10. Simple enough is another name for data collected in this manner a. Normal quantile plot of residuals b. Heteroscedasticity c. Time plot of residuals d. Durbin-Watson statistic e. Leveraged f. Plot of residuals on x g. Random sample from a population h. Homoscedasticity i. Scatterplot of y on x j. Outlier

True/False
If you believe that a statement is false, briefly say why you think it is false. 11. If the SRM is used to model data that do not have constant variance, then 95% prediction intervals produced by this model are longer than needed. 12. When data do not satisfy the similar variances condition, the regression predictions tend to be too high on average, over-predicting most observations. 13. A common cause of dependent error terms is the presence of a lurking variable. 14. The Durbin-Watson test quantifies deviations from a normal population that are seen in the normal quantile plot. 15. A leveraged outlier has an unusually large or small value of the explanatory variable. 16. The presence of an outlier in the data used to fit a regression causes the estimated model to have a lower r2 than it should.

90 . the size of the survey of prospective employees was proportional to the size of the community. a company investigated the education and income of the local population.99 10 20 30 40 -3 -2 -1 0 1 2 3 Count Normal Quantile Plot 26. If the Durbin-Watson statistic is near zero. The best plot for checking the similar variances condition is the scatterplot of y on x. A hasty analyst looked at the normal quantile plot for the residuals from the shown regression and concluded that the model could not be used because the residuals were not normally distributed. Think About It 23. We should exclude from the estimation of the regression equation a case if its residual is more than 3 se away from the fitted line. because the fitted line clearly tracks the mean of y as the value of x increases. As part locating a new factory.25 .05 . 22.95 . at least for time series data. a) If the observation marked with an “×” in the following plot is removed. What do you think of this analysis? 27. it is more important to fit a model that meets the conditions of the SRM than to maximize the value of r2. Some of the stores are considerably larger (more square feet of display space) than others.22 Exercises 17. 20.75. Residuals seldom satisfy the nearly normal condition because it is silly to think that only one explanatory variable affects the value of the response. What do you think of this analysis of the problem? . 18. What possible problems for the SRM would you expect to find in a scatterplot of average income versus average education for communities of varying size? 25. To keep costs low. on average. can you anticipate any problems? 24. how will the slope of the least squares line change? E22-2 . In regression modeling.50 . Data on sales have been collected from a chain of convenience stores. In a regression of sales on square feet. The nearly normal condition is critical when using prediction intervals. 21.10. 19. we can conclude that the fitted model meets the no lurking variable condition. A second analyst looked at the same data as in Question 25 and concluded that the use of the SRM for prediction was fine.01 .

how will the slope of the least squares line change? b) What will happen to r2 and se? c) Is this observation leveraged? 29. a) If the observation marked with an “×” in the following plot is removed. a) If the observation marked with an “×” in the following plot is removed. how will the slope of the least squares line change? E22-3 .22 Exercises b) What will happen to r2 and se? c) Is this observation leveraged? 28. how will the slope of the least squares line change? b) What will happen to r2 and se? c) Is this observation leveraged? 30. a) If the observation marked with an “×” in the following plot is removed.

Use price as the response and weight as the explanatory variable. spanning the last 65 weeks since the chain opened its first outlets opened. The prices are in Singapore dollars. Its history and fame make it impossible to assign a price. Let’s say 45. What lurking variable that might introduce dependence into the errors of the SRM? 32. The Durbin-Watson statistic for the fit of the least squares regression in this figure is 0. Management of a retail chain has been tracking the growth of sales. Supervisors of an assembly line track the output of the plant. with weights less than ½ carat. and smaller stones of its quality have gone for $600. or is really an artifact of a different problem? (Note: The value of the explanatory variable is getting larger with the sequence order of the data. If the Durbin-Watson statistic is near 2 for the fit of a SRM to monthly data. Identify a lurking variable that might violate one of the assumptions of the SRM. The Hope Diamond weighs in at 45.52 carats.52 carats × $750.46. regressing the company’s sales versus the number of outlets. Their data is weekly. These rings hold relatively small diamonds. For the E22-4 .000 and call it $35 million. with the weights in carats. One tool that they use is a simple regression of the count of packages shipped each day versus the number of employees who were active on the assembly line during that day. which varies from 35 to about 50.000/carat = 34.) You Do It 35. have we proven that the errors are independent and meet the assumption of the SRM? 34. Diamond rings This data table contains the listed prices and weights of the diamonds in 48 rings offered for sale in The Singapore Times.000 per caret.140. Should we interpret the value of D as indicating dependence. 33.22 Exercises b) What will happen to r2 and se? c) Is this observation leveraged? 31.

(The data file has values for two stations. and the column Volume gives the number of gallons of gasoline sold. a) These data are a time series. use only the 283 cases for site 1. Download Before plunging into videoconferencing. (d) Why does the addition of one point. Formulate the SRM with y given by Transfer time and x given by File size. have so much influence on the fitted model? 36. (The initial data collection did not monitor sales on Saturday.) Does the sequence plot of residuals from the fitted equation indicate the presence of dependence? b) Calculate the Durbin-Watson statistic D. (a) Add an imaginary ring with the weight and this price of the Hope Diamond (in Singapore dollars) to the data set as a 49th case. as indicated by the column labeled time. making up only 2% of the data. How does this outlier affect the fit of the regression of sales on gallons? d) Should the outlier be removed from the fit? 37. 38. any value of D that is more than ½ away from 2 is statistically significant with p-value smaller than 0. How does the addition of the Hope Diamond to these other rings change the appearance of the plot? How many points can you see? (b) How does the fitted equation of the SRM to this data change with the addition of this one case? (c) Explain how it can be that both R2 and se increase with the addition of this point. c) Explain or interpret the result of the Durbin-Watson test. (The files were transferred at roughly equally spaced times during a 5-day period. a company tested the speed of its internal computer network. and it also has a convenience store and a car wash.6 Singapore dollars. Formulate the regression model with dollar sales as the response and number of gallons sold as the predictor.01.) Does the value of D indicate the presence of dependence? Does it agree with your impression in “a”? c) The residual for row 14 is rather large and positive.22 Exercises exchange rate. assume that 1 US dollar is worth about 1. and the time to send the files recorded. For this exercise. Does it indicate a problem? For a series of this length. (Ignore the fact that the data over the weekend are not adjacent. Production costs A manufacturer produces custom metal blanks that are used by its customers for computer-aided machining. The tests were designed to measure how rapidly data moved through the network under a typical load.) The column labeled Sales gives the dollar sales of the convenience store. Eighty files ranging in size from 20 to 100 megabytes (MB) were transmitted over the network at various times of day. with 5 or 6 measurements per week. a) Plot the residuals versus the x variable and in time order. Convenience shopping These data describe the sales over time at a franchise outlet of a major US oil company. Each row summarizes sales for one day at this location. This particular station sells gas. The customer sends a design via computer (a 3-D E22-5 .) Does either plot suggest a problem with the SRM? b) Compute the Durbin-Watson D statistic.

a) The data used previously for this analysis excludes a home with 2. list several possible lurking variables that might be responsible for the size and position of leases with large residual costs. Does this plot suggest that the labor input is a lurking variable? 39. Seattle homes This data table contains the listed prices (in thousands of dollars) and the number of square feet for 28 homes in or near Seattle. For the response. use the cost of the lease per square foot.500 thousand) and is on a lot with 871. a) Identify the leases whose residuals lie outside the 95% prediction intervals for leases of their size. What is it. and does it help you identify a lurking variable? E22-6 . Does the slope or intercept differ by very much between the two cases? Use one estimated model as your point of reference. The data come from the web site of a realtor offering homes in the area.500 square feet and costs $1.5 million ($1. b) Compare the fit of the model with this large home to the fit without this home. As the explanatory variable. c) Consider a scatterplot of the residuals from this regression on the number of labor hours.22 Exercises blueprint). not just the materials that are used. Does this column help explain the outlier and suggest a lurking variable? 40. Recognize that the response reflects cost of all inputs to the manufacturing task. These give the number of square feet for the size of the lot that comes with the home. The data for the analysis were sampled from the accounting records of 195 orders that were filled during the previous 3 months. and the manufacturer sends the customer an estimated price per unit.000 square feet. Does the location of these data indicate a problem with the fitted model? (Hint: Are all of these residuals on the same side – positive or negative – of the regression?) b) Given the context of the problem (costs of leasing commercial property). Formulate the regression model with y as the average cost per unit and x as the material cost per unit. Use the selling price per square foot as y and the reciprocal of the number of square feet as x. Add this case to the data table and refit the indicated model. c) Which is more affected by the outlier: the estimated fixed costs or the estimated variable costs? d) Outliers often shout “There’s a reason for me being different!” Consider the nonmissing values in the column labeled lot size. use the reciprocal of the number of square feet. Leases This data table gives annual costs of 223 commercial leases. This cost estimate determines a price for the customer. All of these leases provide office space in a Midwestern city in the US. a) Do the scatterplot of y on x or the plot of the residuals on x indicate a problem with the fitted equation? b) Use the context of this regression to suggest any possible lurking variables. c) The leases with the 4 largest residuals have something in common.

to the SRM? b) Would the linear equation. a) Compare the two plots: Price versus Horsepower and log10 Price versus log10 Horsepower. even approximately. have the same meaning in both cases considered in “a”? c) Fit the preferred equation as identified in “a. of those that lie outside the 95% prediction intervals. Set y to the natural log of R&D expenses. Formulate the SRM with GDP as the response and Trade Balance as the explanatory variable. 21.22 Exercises 41.e. a measure of the overall production in an economy per citizen) and trade balance (measured as a percentage of GDP). The countries are located in Europe. Are these randomly distributed between the two halves? Are the error terms homoscedastic? 43. divide the plot in half at the median of the explanatory variable in your model. Does this explain the size of the difference between the two equations in “a”? Explain. importers have negative trade balances. (i. Two variables of interest are GDP (gross domestic product per capita. First.” then use this rough test for equal variance. Luxembourg reported the highest positive balance of trade. Fit the least squares equation both with and without Luxembourg and compare the results..5% of GDP.” c) Luxembourg also has the second smallest population among the countries. and set x to the natural log of assets. R&D expenses This table contains accounting and financial data that describe 493 companies operating in technology industries: software. and semiconductor manufacturing. Both columns are reported in millions of dollars. Does the fitted slope change by very much? b) Explain any differences between R2 and se for the two fits considered in “a. These data are from the 2005 report of the OECD. One column gives the expenses on research and development (R&D). and another gives the total assets of the companies. For each of 223 types of cars sold in the US during the 2003 and 2004 model years. OECD The Organization for Economic Co-operation and Development (OECD) tracks summary statistics of the member economies.” can you anticipate that these residuals are not nearly normal – without needing the normal quantile plot? 42. Second. Note that the variables are recorded in millions. and North America. so 1000 = 1 billion. a) In 2005. Does either seem suited. count the number of points that lie outside the 95% prediction limits in each half of the data. Cars The cases that make up this data set are types of cars. E22-7 . particularly the slope. Exporting countries have positive trade balances. Use the logs of both variables rather than the originals.) a) What problem with the use of the SRM is evident in the scatterplot of y on x as well as in the plot of the residuals from the fitted equation on x? b) If the residuals are nearly normal. parts of Asia. systems design. these data include the base price (in dollars) and the horsepower of the engine (HP). what proportion should be above the fitted equation? c) Based on the property of residuals identified in “b. Use the price of the car as the response and the horsepower as the explanatory variable.

Promotion These data describe spending by a major pharmaceutical company for promoting a cholesterol-lowering drug. the firm has monitored activities of new agents over the past two years. The data covers 39 consecutive weeks and isolates the area around Boston. Marketing research often describes the level of promotion in terms of voice. The column Detail Voice is the ratio of detailing for this drug to the amount of detailing for all cholesterol-lowering drugs in Boston. a) Identify the week associated with the outlying value highlighted in the figure below. E22-8 . an account is a new customer to the business. Which case is this? b) Explain some characteristics that distinguish this employee from the others. a) Locate the most negative residual in the data. The column Market Share is the ratio of sales of this product divided by total sales for such drugs in the Boston area. Why does the fit change by so much or so little? 45. and keep them with the company. To build the system. or unusually low levels of promotion? Take a look at the timeplots to help you decide. A key task for agents is to open new accounts. Among the possible explanations of this performance is the number of new accounts developed by the agent during the first 3 months of work. offer them incentives. direct-to-consumer sales force would like to build a system to monitor the progress of new agents. Formulate the SRM with y given by the natural log of Profit from Sales and x given by the natural log of Number Accounts. The response of interest is the profit to the firm (in dollars) of contracts sold by agents over their first year. (The figure shows the least squares fitted line. Both are measured in dollars and measure the quality of business developed in the first 3 months of working for this firm.) c) How does the fit change if this point is set aside. Detailing counts the number of promotional visits made by representatives of a pharmaceutical company to doctors’ offices.22 Exercises 44. The variables in this collection are shares. excluded from the original regression? Compare the fitted model both with and without this employee. (Hint: Consider the data in the Early Commission and Early Selling columns.) Does this week have unusually large sales given the level of promotion. voice is the share of advertising devoted to a specific product. d) Explain the magnitude of the change in the fit. The goal is to identify “superstar agents” as rapidly as possible. In place of the level spending. Formulate the SRM with y given by the Market Share and x given by the Detailing Voice. Hiring A firm that operates a large.

22 Exercises 0.10 .220 0.04 .02 .225 0.1 Apple Return Market Return (b) Which observation is more important to the statistical significance of the fit of the least squares equation? That is. The data includes 300 monthly returns on Apple Computer.240 0. (d) Explain why removing either observation has little effect on the least squares fit. about these two months? 0.4 0.08 .210 0.2 0 -0. as well as returns on the entire stock market. if we want to keep the absolute value of the t-statistic as large as possible. if anything.235 Market Share 0.2 -0.205 . the standard error for b1 increases.6 -0. which month should be? (c) Explain why the month that keeps the t-statistic large is more important than the other month.230 0. 30-day loans to the government).2 -0. and inflation. Treasury Bills (short-term.4 -0.14 Detail Voice b) How does the fitted regression equation change if this week is excluded from the analysis? Are these large changes? c) The R2 of the fit gets larger and se gets smaller without this week.1 0 . however.12 . (The column Market Return is the return on a valueweighted portfolio that purchases stock in proportion to the size of the company rather than one of each stock. Do other diagnostics suggest a violation of the assumptions of the SRM? 46.215 0. E22-9 . Apple This dataset tracks monthly performance of stock in Apple Computer since its inception in 1980. Why? d) These are time series data. What’s special.) Formulate the SRM with Apple Return as the response and Market Return as the predictor. (a) Identify the time period associated with each of the two outliers highlighted in this scatterplot.06 .

22 Exercises E22-10 .

000 more per house) to add the gate and fence to the development – if you do it now while construction is starting.000 divided by the crime rate. expressed in incidents per 1. If you add a security wall around the development. How much does this have to increase the value of these homes (on average) in order for building the security fence to be cost effective? (b) If the regression model identifies a statistically significant association between the price of housing and the number of people per crime (the reciprocal of the crime rate).000 residents.) Motivation (a) Assume that the addition of a gate and wall have the effect of convincing potential buyers that the crime rate of this development will “feel” like 10 crimes per 1000 rather than 15. The data also include the crime rate in these communities. The design calls for 25 homes that you expect to sell for about $450. you’ll make a profit of $50.22 Exercises 4M Do Fences Make Good Neighbors? For this exercise.25 million overall. you’re a real-estate developer.000. is it worth adding the gate and wall? The builders say that it will cost you about $875. Because of the soaring values of homes. That exercise focuses on the transformation. For this exercise. the costs will rise sharply. (These data also appear in an exercise in Chapter 20.) (d) Fit the linear equation to the scatterplot in part “c.000 residents. A security consultant claims a gate and fence would reduce this further 10 per 1. These data appeared in the April. you might be able to sell them for more Gates convey the safety and low crime rates to potential homebuyers. You’re planning a suburban housing development outside Philadelphia.000 each. It’s now or never. we will focus on the use of the model with the transformation. or $1.000 per house. Does the plot seem straight enough to continue? (The variable created as 1000 divided by the crime rate is the number of residents per crime. If you wait until people move in. issue of Philadelphia Magazine. This analysis will use the reciprocal of this rate. The crime rate in the area where you are building the development is already low.000 ($35. If all goes as planned. let’s assume that prices have doubled since these data were measured. what do you think about building the wall? Be sure to take the doubling of home prices into account. will this prove that lowering the crime rate will pay for the cost of constructing the security wall? Method (c) Plot the selling prices of homes in these communities versus the ratio of 1. If this consultant is right. The data includes the median selling price of homes in communities in the Philadelphia area. 1996.” If we accept the fit of this equation. We have some data to help us decide. about 15 incidents per 1. E22-11 .

remember to take the doubling of prices since 1996 into account.22 Exercises Mechanics (e) Which communities are leveraged in this analysis? What distinguishes these communities from the others? (f) Which communities are outliers with unusually positive or negative residuals? Identify these in the plot of the residuals on the explanatory variable. (g) Does this model meet conditions needed for using the SRM for inference about the parameters? What about prediction intervals? (h) If we ignore any problems noted in the form of this model. would the usual inferences lead us to tell the developer to build the wall? (Again. E22-12 .) Message (i) How would you answer the question for the developer? Should the developer proceed with the wall? (j) What could you do to improve the analysis? State your suggestions in a form that the developer would understand.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times