Bowerman−O’Connell: Business Statistics in Practice, Third Edition

11. Simple Linear Regression Analysis

Text

© The McGraw−Hill Companies, 2003

Chapter 11
Simple Linear Regression Analysis

Chapter Outline 11.1 The Simple Linear Regression Model 11.2 The Least Squares Estimates, and Point Estimation and Prediction 11.3 Model Assumptions and the Standard Error 11.4 Testing the Significance of the Slope and y Intercept
*Optional section

11.5 Confidence and Prediction Intervals 11.6 Simple Coefficients of Determination and Correlation 11.7 An F Test for the Model *11.8 Residual Analysis *11.9 Some Shortcut Formulas

Bowerman−O’Connell: Business Statistics in Practice, Third Edition

11. Simple Linear Regression Analysis

Text

© The McGraw−Hill Companies, 2003

M

anagers often make decisions by studying the relationships between variables, and process improvements can often be made by understanding how changes in one or more variables affect the process output. Regression analysis is a statistical technique in which we use observed data to relate a variable of interest, which is called the dependent (or response) variable, to one or more independent (or predictor) variables. The objective is to build a regression model, or prediction equation, that can be used to describe, predict, and control the dependent variable on the basis of the independent variables. For example, a company might wish to improve its marketing process. After collecting data concerning the demand for a product, the product’s price, and the advertising expenditures made to promote the product, the company might use regression analysis to develop an equation to predict demand on the basis of price and advertising

expenditure. Predictions of demand for various price–advertising expenditure combinations can then be used to evaluate potential changes in the company’s marketing strategies. As another example, a manufacturer might use regression analysis to describe the relationship between several input variables and an important output variable. Understanding the relationships between these variables would allow the manufacturer to identify control variables that can be used to improve the process performance. In the next two chapters we give a thorough presentation of regression analysis. We begin in this chapter by presenting simple linear regression analysis. Using this technique is appropriate when we are relating a dependent variable to a single independent variable and when a straight-line model describes the relationship between these two variables. We explain many of the methods of this chapter in the context of two new cases:

C
The Fuel Consumption Case: A management consulting firm uses simple linear regression analysis to predict the weekly amount of fuel (in millions of cubic feet of natural gas) that will be required to heat the homes and businesses in a small city on the basis of the week’s average hourly temperature. A natural gas company uses these predictions to improve its gas ordering process. One of the gas company’s objectives is to reduce the fines imposed by its pipeline transmission system when the company places inaccurate natural gas orders. The QHIC Case: The marketing department at Quality Home Improvement Center (QHIC) uses simple linear regression analysis to predict home upkeep expenditure on the basis of home value. Predictions of home upkeep expenditures are used to help determine which homes should be sent advertising brochures promoting QHIC’s products and services.

11.1 ■ The Simple Linear Regression Model
The simple linear regression model assumes that the relationship between the dependent variable, which is denoted y, and the independent variable, denoted x, can be approximated by a straight line. We can tentatively decide whether there is an approximate straight-line relationship between y and x by making a scatter diagram, or scatter plot, of y versus x. First, data concerning the two variables are observed in pairs. To construct the scatter plot, each value of y is plotted against its corresponding value of x. If the y values tend to increase or decrease in a straight-line fashion as the x values increase, and if there is a scattering of the (x, y) points around the straight line, then it is reasonable to describe the relationship between y and x by using the simple linear regression model. We illustrate this in the following case study, which shows how regression analysis can help a natural gas company improve its gas ordering process.

CHAPTER 14

EXAMPLE 11.1 The Fuel Consumption Case: Reducing Natural Gas
Transmission Fines
When the natural gas industry was deregulated in 1993, natural gas companies became responsible for acquiring the natural gas needed to heat the homes and businesses in the cities they serve. To do this, natural gas companies purchase natural gas from marketers (usually through longterm contracts) and periodically (daily, weekly, monthly, or the like) place orders for natural gas to be transmitted by pipeline transmission systems to their cities. There are hundreds of pipeline transmission systems in the United States, and many of these systems supply a large number of

C

Bowerman−O’Connell: Business Statistics in Practice, Third Edition

11. Simple Linear Regression Analysis

Text

© The McGraw−Hill Companies, 2003

446

Chapter 11

Simple Linear Regression Analysis

cities. For instance, the map on pages 448 and 449 illustrates the pipelines of and the cities served by the Columbia Gas System. To place an order (called a nomination) for an amount of natural gas to be transmitted to its city over a period of time (day, week, month), a natural gas company makes its best prediction of the city’s natural gas needs for that period. The natural gas company then instructs its marketer(s) to deliver this amount of gas to its pipeline transmission system. If most of the natural gas companies being supplied by the transmission system can predict their cities’ natural gas needs with reasonable accuracy, then the overnominations of some companies will tend to cancel the undernominations of other companies. As a result, the transmission system will probably have enough natural gas to efficiently meet the needs of the cities it supplies. In order to encourage natural gas companies to make accurate transmission nominations and to help control costs, pipeline transmission systems charge, in addition to their usual fees, transmission fines. A natural gas company is charged a transmission fine if it substantially undernominates natural gas, which can lead to an excessive number of unplanned transmissions, or if it substantially overnominates natural gas, which can lead to excessive storage of unused gas. Typically, pipeline transmission systems allow a certain percentage nomination error before they impose a fine. For example, some systems do not impose a fine unless the actual amount of natural gas used by a city differs from the nomination by more than 10 percent. Beyond the allowed percentage nomination error, fines are charged on a sliding scale—the larger the nomination error, the larger the transmission fine. Furthermore, some transmission systems evaluate nomination errors and assess fines more often than others. For instance, some transmission systems do this as frequently as daily, while others do this weekly or monthly (this frequency depends on the number of storage fields to which the transmission system has access, the system’s accounting practices, and other factors). In any case, each natural gas company needs a way to accurately predict its city’s natural gas needs so it can make accurate transmission nominations. Suppose we are analysts in a management consulting firm. The natural gas company serving a small city has hired the consulting firm to develop an accurate way to predict the amount of fuel (in millions of cubic feet—MMcf—of natural gas) that will be required to heat the city. Because the pipeline transmission system supplying the city evaluates nomination errors and assesses fines weekly, the natural gas company wants predictions of future weekly fuel consumptions.1 Moreover, since the pipeline transmission system allows a 10 percent nomination error before assessing a fine, the natural gas company would like the actual and predicted weekly fuel consumptions to differ by no more than 10 percent. Our experience suggests that weekly fuel consumption substantially depends on the average hourly temperature (in degrees Fahrenheit) measured in the city during the week. Therefore, we will try to predict the dependent (response) variable weekly fuel consumption (y) on the basis of the independent (predictor) variable average hourly temperature (x) during the week. To this end, we observe values of y and x for eight weeks. The data are given in Table 11.1. In Figure 11.1 we give an Excel output of a scatter plot of y versus x. This plot shows
1 2

A tendency for the fuel consumption to decrease in a straight-line fashion as the temperatures increase. A scattering of points around the straight line.

A regression model describing the relationship between y and x must represent these two characteristics. We now develop such a model.2 We begin by considering a specific average hourly temperature x. For example, consider the average hourly temperature 28°F, which was observed in week 1, or consider the average hourly temperature 45.9°F, which was observed in week 5 (there is nothing special about these two average hourly temperatures, but we will use them throughout this example to help explain the idea of a regression model). For the specific average hourly temperature x that we consider, there are, in theory, many weeks that could have this temperature. However, although these weeks

1

For whatever period of time a transmission system evaluates nomination errors and charges fines, a natural gas company is free to actually make nominations more frequently. Sometimes this is a good strategy, but we will not further discuss it.

2

Generally, the larger the sample size is—that is, the more combinations of values of y and x that we have observed—the more accurately we can describe the relationship between y and x. Therefore, as the natural gas company observes values of y and x in future weeks, the new data should be added to the data in Table 11.1.

4 4 32. we will then be able to estimate the mean weekly fuel consumptions.9 9. for example.9°F is my 45. For example. the weeks could have different fuel consumptions. consider the eight mean weekly fuel consumptions that correspond to the eight average hourly temperatures in Table 11.1 Excel Output of a Scatter Plot of y versus x C 15 13 11 9 7 5 20 D E F G 447 1 2 3 4 5 6 7 8 28.5 39.7 12. 1 The Fuel Consumption Data FuelCon1 Weekly Fuel Consumption.1.9) 9.9 b0 b1(45. However. other factors that affect fuel consumption could vary from week to week. because we do not know the true values of b0 and b1. we mean that the different mean weekly fuel consumptions that correspond to different average hourly temperatures lie exactly on a straight line. different thermostat settings. To better understand the straight line and the meanings of b0 and b1.4 2 28 11.0 7.0 28.5 9 62. that the mean of the population of all weekly fuel consumptions that could be observed when the average hourly temperature is 28°F is my 28 b0 b1(28) 15.9 57. for illustrative purposes.8 5 39 9. x (°F) A B 1 TEMP FUELCONS 12. For example. It follows that there is a population of weekly fuel consumptions that could be observed when the average hourly temperature is x.18 MMcf to 9. mean weekly fuel consumption decreases from 12.2(a) we depict these mean weekly fuel consumptions as triangles that lie exactly on the straight line defined by . let us suppose that the true value of b0 is 15.1 by assuming that my x is related to x by the equation my x b0 b1x This equation is the equation of a straight line with y-intercept B0 (pronounced beta zero) and slope B1 (pronounced beta one).7 3 28 12.9) 15.4 6 45. when we say that the equation my x b0 b1x is the equation of a straight line.5 10 11 12 13 14 H 30 40 50 TEMP 60 70 each have the same average hourly temperature. y (MMcf) 12.4 9. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.8 58.0 45.77 .5 The Simple Linear Regression Model FIGURE 11. In Figure 11. Of course. it would also follow that the mean of the population of all weekly fuel consumptions that could be observed when the average hourly temperature is 45.5 8.8 8 8 58. when we learn in the next section how to estimate b0 and b1.9°F.18 MMcf of natural gas As another example. 2003 11.0 32.1281(45. this population has a mean. This makes sense because we would expect to use less fuel if the average hourly temperature increases.4 10. For now.5 FUEL Week Average Hourly Temperature. Furthermore.1281.5 7 57.4 11.Bowerman−O’Connell: Business Statistics in Practice.89 MMcf of natural gas. these weeks might have different average hourly wind velocities. and so forth.77 .1 TA B L E 1 1 .5 10.1 7. However. Third Edition 11.8 9. as the average hourly temperature increases from 28°F to 45. We can represent the straight-line tendency we observe in Figure 11.77 and the true value of b1 is . we must first realize that the values of b0 and b1 determine the precise value of the mean weekly fuel consumption my x that corresponds to a given value of the average hourly temperature x. We cannot know the true values of b0 and b1. we cannot actually calculate these mean weekly fuel consumptions. which we denote as y | x (pronounced mu of y given x).1 62.89 MMcf of natural gas Note that.1281(28) 12. Therefore. It would then follow. and in the next section we learn how to estimate these values.

2003 448 Chapter 11 Simple Linear Regression Analysis Columbia Gas System Michigan Middleburg Heights Sandusky Lorain Toledo Elyria Parma Ohio Gulf of Mexico Columbia Gas Transmission Columbia Gulf Transmission Cove Point LNG Corporate Headquarters Cove Point Terminal Storage Fields Distribution Service Territory Independent Power Projects Communities Served by Columbia Companies Communities Served by Companies Supplied by Columbia Mansfield Marion Springfield Columbus Indiana Dayton Athens Cincinnati Ashland Frankfort Huntington Lexington Kentucky Tennessee Source: Columbia Gas System 1995 Annual Report. . Simple Linear Regression Analysis Text © The McGraw−Hill Companies.Bowerman−O’Connell: Business Statistics in Practice. Third Edition 11.

1 The Simple Linear Regression Model 449 New York Binghamton Wilkes-Barre Elizabeth Allentown New Brighton Pittsburgh Uniontown Wheeling Cumberland Hagerstown Maryland Baltimore West Virginia Manassas Delaware Washington. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. Cove Point Terminal Charleston Staunton Fredericksburg Richmond Virginia Williamsburg Newport News Norfolk Chesapeake York Wilmington Atlantic City Harrisburg Bethlehem New Jersey Pennsylvania Lynchburg Roanoke Petersburg Portsmouth North Carolina © Reprinted courtesy of Columbia Gas System. 2003 11. Third Edition 11.C. .Bowerman−O’Connell: Business Statistics in Practice. D.

The mean weekly For the second week.2(b).5 (a) The line of means and the error terms y 1 0 0 1(c 1c 1 The change in mean weekly fuel consumption that is associated with a one-degree increase in average hourly temperature 1) x c y 15 14 13 12 11 10 9 8 7 0 c 1 (b) The slope of the line of means Mean weekly fuel consumption when the average hourly temperature is 0°F x 0 28 62.4 The observed fuel consumption for the fifth week The straight line defined by the equation y x 0 1x x 28.9.9 The error term for the fifth week (a negative error term) 9.9 Mean weekly fuel consumption when x 45. in this figure we draw arrows pointing to the triangles that represent the previously discussed means my 28 and my 45. consider two different weeks. the slope b1 is the change in mean weekly fuel consumption that is associated with a one-degree increase in average hourly temperature. as illustrated in Figure 11. To interpret the meaning of .0 45.4 The observed fuel consumption for the first week y 45. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. Third Edition 11. The mean weekly fuel consumption for all such weeks is b0 b1(c) 1). Suppose that for the first week the average hourly temperature is c. suppose that the average hourly temperature is (c fuel consumption for all such weeks is b0 b1(c 1) It is easy to see that the difference between these mean weekly fuel consumptions is b1. Thus.9 62. 2003 450 F I G U R E 11. Furthermore. Sometimes we refer to the straight line defined by the equation my x b0 b1x as the line of means.5 (c) The y-intercept of the line of means the equation my x b0 b1x. In order to interpret the slope b1 of the line of means.2 Chapter 11 Simple Linear Regression Analysis The Simple Linear Regression Model Relating Weekly Fuel Consumption (y) to Average Hourly Temperature (x) y 28 y 13 12 11 10 9 8 7 Mean weekly fuel consumption when x 28 The error term for the first week (a positive error term) 12.Bowerman−O’Connell: Business Statistics in Practice.

2(c). the mean value of y decreases as x increases.1 The Simple Linear Regression Model 451 the y-intercept b0. The mean weekly fuel consumption for all such weeks is b0 b1(0) b0 Therefore.9. This model is illustrated in Figure 11. the y-intercept b0 is the mean weekly fuel consumption when the average hourly temperature is 0°F. Then: The Simple Linear Regression Model T 1 2 3 he simple linear (or straight line) regression model is: y Here my x b0 b1x is the mean value of the dependent variable y when the value of the independent variable is x. 2003 11. Therefore.4 in the fifth week was below the corresponding mean weekly fuel consumption for all weeks when x 45. they are scattered around a straight line. 4 e is an error term that describes the effects on y of all factors other than the value of the independent variable x. Third Edition 11. b0 is the y-intercept. Figure 11. Therefore. With the fuel consumption example as background.2(a) illustrates that the simple linear regression model says that the eight observed fuel consumptions (the dots in the figure) deviate from the eight mean fuel consumptions (the triangles in the figure) by amounts equal to the error terms (the line segments in the figure). b1 is the change (amount of increase or decrease) in the mean value of y my x e b0 b1x e associated with a one-unit increase in x.1. Such factors would include the average hourly wind velocity and the average hourly thermostat setting in the city. we are ready to define the simple linear regression model relating the dependent variable y to the independent variable x. because we have not observed any weeks with temperatures near 0. If b1 is positive. we have no data to tell us what the relationship between mean weekly fuel consumption and average hourly temperature looks like for temperatures near 0. this interpretation is of dubious practical value. Figure 11. We suppose that we have gathered n observations—each observation consists of an observed value of x and its corresponding value of y. the relative positions of the quantities pictured in the figure are only hypothetical. as illustrated in Figure 11. The y-intercept b0 and the slope b1 are called regression parameters. Therefore. If b1 is negative. we use the simple linear regression model y my x e b0 b1x e This model says that the weekly fuel consumption y observed when the average hourly temperature is x differs from the mean weekly fuel consumption my x by an amount equal to e (pronounced epsilon).Bowerman−O’Connell: Business Statistics in Practice. The error term describes the effect on y of all factors other than the average hourly temperature. Because we do not know the true values of these parameters.3 (note that x0 in this figure denotes a specific value of the independent variable x). For example. since we do not know the true values of b0 and b1. b0 is the mean value of y when x equals 0. Now recall that the observed weekly fuel consumptions are not exactly on a straight line.4 in the first week was above the corresponding mean weekly fuel consumption for all weeks when x 28. the observed fuel consumption y 9. Rather. the interpretation of b0 is of dubious practical value. Here is called an error term. . Of course. However. consider a week having an average hourly temperature of 0°F. if we have not observed any values of x near 0. we must use the sample data to 3 As implied by the discussion of Example 11. Figure 11. More generally. the mean value of y increases as x increases. More will be said about this later.2(a) shows that the error term for the first week is positive. As another example. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. the observed fuel consumption y 12.3 b1 is the slope.2(a) also shows that the error term for the fifth week was negative. To represent this phenomenon.

Public records of the county auditor are used to obtain the previous year’s assessed values of the homeowner’s homes. We have interpreted the slope b1 of the simple linear regression model to be the change in the mean value of y associated with a one-unit increase in x.000 increase in home value. When data are observed in time sequence. The resulting x and y values are given in Table 11.2 The QHIC Case C Quality Home Improvement Center (QHIC) operates five stores in a large metropolitan area.3 Chapter 11 The Simple Linear Regression Model (Here B1 y Simple Linear Regression Analysis 0) Error term An observed value of y when x equals x0 Straight line defined by the equation y x 0 1x Slope 0 1 Mean value of y when x equals x0 One-unit change in x y-intercept x 0 x0 A specific value of the independent variable x estimate these values. and y. The fuel consumption data in Table 11.4. EXAMPLE 11.2. these data are cross-sectional. yearly expenditure on home upkeep (in dollars). the data are called time series data. We see how this is done in the next section. we cannot prove that . In later examples the marketing department at QHIC will use predictions given by this simple linear regression model to help determine which homes should be sent advertising brochures promoting QHIC’s products and services. Another frequently used type of data is called cross-sectional data. Many applications of regression utilize such data.Bowerman−O’Connell: Business Statistics in Practice. Because the 40 observations are for the same year (for different homes). However. Third Edition 11. home value (in thousands of dollars). We sometimes refer to this change as the effect of the independent variable x on the dependent variable y.1 were observed sequentially over time (in eight consecutive weeks). Simple Linear Regression Analysis Text © The McGraw−Hill Companies. A random sample of 40 homeowners is taken and asked to estimate their expenditures during the previous year on the types of home upkeep products and services offered by QHIC. 2003 452 F I G U R E 11. it is reasonable to relate y to x by using the simple linear regression model having a positive slope (b1 0) y b0 b1x e The slope b1 is the change (increase) in mean dollar yearly upkeep expenditure that is associated with each $1. Assuming that my x and x have a straight-line relationship. In later sections we show how to use these estimates to predict y. The marketing department at QHIC wishes to study the relationship between x. The MINITAB output of a scatter plot of y versus x is given in Figure 11. We see that the observed values of y tend to increase in a straight-line (or slightly curved) fashion as x increases. This kind of data is observed at a single point in time.

y (Dollars) 1.40 479.72 168.82 120.351.2 Home 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 The Simple Linear Regression Model The QHIC Upkeep Expenditure Data Upkeep Expenditure.90 104.94 390.18 1.28 171.14 950.5.56 286. Rather.08 1.00 869.04 775.44 191.78 857.24 1.475.02 228.84 F I G U R E 11.02 Upkeep Expenditure. both variables are influenced by a third variable—long-run growth in the national economy.4 MINITAB Plot of Upkeep Expenditure versus Value of Home for the QHIC Data 2000 UPKEEP 1000 0 100 200 VALUE 300 a change in an independent variable causes a change in the dependent variable.627.670.26 1.00 711.90 185. y (Dollars) 849.04 918.090.78 178. Exercises for Section 11.36 425.78 2.10 158.1 T A B L E 11.50 272.413.78 247.204. However.20 872.48 1.42 852.80 697.86 222.32 453 QHIC Home 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Value of Home.316.08 184.038.Bowerman−O’Connell: Business Statistics in Practice.78 257.82 177.68 99.20 202.313.86 96.32 125. regression can be used only to establish that the two variables move together and that the independent variable contributes information for predicting the dependent variable. x (Thousands of Dollars) 237.28 162.86 122. x (Thousands of Dollars) 153. For instance.00 324.14 1. Rather.6 .54 224. Third Edition 11.06 642.06 160.18 125.60 626.14 1. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. this does not prove that increases in liquor sales cause increases in college professors’ salaries.08 1.46 423.396.48 1.1 CONCEPTS 11.68 229.90 1.00 231.90 1.64 368.412.18 83.72 86.32 205. college professors’ salaries have also increased.00 153.74 378.20 48.288.84 602.50 1.90 288.20 133.58 212.16 1.08 797. regression analysis might be used to establish that as liquor sales have increased over the years.02 198.34 965.010.10 920. 2003 11.1 When does the scatter plot of the values of a dependent variable y versus the values of an independent variable x suggest that the simple linear regression model y my x e b0 might appropriately relate y to x? b1x e 11.44 169.06 155.04 Value of Home.04 101.04 232. 11.003.

Accu-Copiers has collected data for 11 service calls.5 30.6 THE STARTING SALARY CASE StartSal Consider the simple linear regression model describing the starting salary data of Exercise 11. Inc. the company agrees to perform routine service on this copier.50 b0 b1(2.82 2.Bowerman−O’Connell: Business Statistics in Practice.7 THE SERVICE TIME CASE SrvcTime Accu-Copiers.8 29. sells and services the Accu-500 copying machine.. b Explain the meaning of my x 2.35 2. records of seven recent marketing graduates are randomly selected. The data are as follows: e b1x e Service Call 1 2 3 4 5 6 7 8 9 10 11 Number of Copiers Serviced.4 36. d Interpret the meaning of the y-intercept b0.50). and e? 11. As part of its standard service contract. 1 2 3 4 5 6 7 3.00). a Explain the meaning of my x 4.4 What is the difference between time series data and cross-sectional data? METHODS AND APPLICATIONS 11. x 4 2 5 7 1 3 4 5 2 4 6 Number of Minutes Required.6 35.5.21 3. x Starting Salary.47 33.26 2. explain why the simple linear regression model y my x b0 might appropriately relate y to x.2 In the simple linear regression model. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.00 b0 b1(4. Third Edition 11.4 27.3 StartSal Marketing Graduate GPA. c Interpret the meaning of the slope parameter b1. 11.8 33.5 THE STARTING SALARY CASE StartSal The chairman of the marketing department at a large state university undertakes a study to relate starting salary (y) after graduation for marketing majors to grade point average (GPA) in major courses. 11. my x.86 3.3 In the simple linear regression model. what are y. Why does this interpretation fail to make practical sense? e The error term e describes the effects of many factors on starting salary y.60 3. 2003 454 Chapter 11 Simple Linear Regression Analysis 11. define the meanings of the slope b1 and the y-intercept b0. y (Thousands of Dollars) 37 36 35 34 33 32 31 30 29 28 27 2 3 GPA 4 Using the scatter plot (from MINITAB) of y versus x. What are these factors? Give two specific examples. To do this. 11. y 109 58 138 189 37 82 103 134 68 112 154 Minutes 200 150 100 50 0 0 2 4 Copiers 6 8 . To obtain information about the time it takes to perform routine service.

65 4.60 3.75 7.75 3. a Explain the meaning of my x 4 b0 b1(4).27 8.10 .45 .75 3.25 x4 .95 7.67 7.65 4.70 x2 4.60 3.80 3.80 4.30 . x4 .85 3.10 8.51 9.8 THE SERVICE TIME CASE SrvcTime Consider the simple linear regression model describing the service time data in Exercise 11.50 8.55 x2 x1 y 8. 2003 11.1 The Simple Linear Regression Model Using the scatter plot (from Excel) of y versus x.26 Using the scatter plot (from MINITAB) of y versus x4 shown below.10 .65 7.90 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Sales Period 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 x1 3.05 .15 9. for each sales period.05 . x3) to be consistent with other notation to be introduced in the Fresh detergent case in Chapter 12. discuss why the simple linear regression model might appropriately relate y to x.40 .30 4.10 4.60 3.35 x2 x1 3.90 3.60 0 .70 3.0 8.10 4.65 3.25 .0 0.21 8.75 3. discuss why the simple linear regression model might appropriately relate y to x4.1 0. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.70 3.75 3. d Interpret the meaning of the y-intercept b 0 .75 x2 3.9 THE FRESH DETERGENT CASE Fresh Enterprise Industries produces Fresh.26 9.10 3.1 0.10 8. c Interpret the meaning of the slope parameter b1. 455 11.5 8.85 3.0 7.80 3.10 .70 3.75 3.70 3.05 .90 3.75 3.05 0 .80 3.90 3.00 7.25 .50 .00 4.2 0.3 0.87 9.27 7. 9. Fresh Detergent Demand Data Sales x1 y Period 7. What are these factors? Give two specific examples.40 .5 0.20 .25 3.20 4.75 9.50 .30 3.80 3.85 3.10 4. b Explain the meaning of my x 6 b0 b1(6).6 PriceDif.05 .65 3.50 9. for example.80 3.86 8.28 8. In order to study the relationship between price and demand for the large bottle of Fresh.60 .70 3.93 9.Bowerman−O’Connell: Business Statistics in Practice. Does this interpretation make practical sense? e The error term e describes the effects of many factors on service time.75 3.7.20 . y x1 x2 x4 demand for the large bottle of Fresh (in hundreds of thousands of bottles) in the sales period the price (in dollars) of Fresh as offered by Enterprise Industries in the sales period the average industry price (in dollars) of competitors’ similar detergents in the sales period x2 x1 the “price difference” in the sales period Note: We denote the “price difference” as x4 (rather than.20 .85 4.70 3.00 4.15 .75 7.20 4. a brand of liquid laundry detergent.87 7.70 3.80 3.52 7. 11.5 Demand.38 8.00 8.89 8.80 3.0 -0. Here.70 3.33 8.10 x4 . y 9. the company has gathered data concerning demand for Fresh over the last 30 sales periods (each sales period is four weeks).2-0.5 7.85 3.55 3.65 3.10 4.15 .60 3. Third Edition 11.80 3.00 8.05 .4 0.50 .75 3.00 4.15 .

y. 11.5 shows a line that has been visually fitted to the plot of the . But how do we fit the best straight line? One approach would be to simply “eyeball” a line through the points. Chicago. a Construct a scatter plot of y versus x.9 193. Illinois. y ($100s) 71 663 381 138 861 145 493 548 251 1024 435 772 Chapter 11 11. The scatter plot of y (fuel consumption) versus x (average hourly temperature) in Figure 11.5 141 165. Copyright 1986 by the Appraisal Institute.Bowerman−O’Connell: Business Statistics in Practice. we begin with a simple example. 2003 456 Direct Labor Cost DirLab Data Direct Labor Cost. EXAMPLE 11.1 173. Does this explanation make practical sense? e What factors are represented by the error term in this model? Give two specific examples. d Explain the meaning of the intercept b0.13 THE REAL ESTATE SALES PRICE CASE RealEst A real estate agency collects data concerning y the sales price of a house (in thousands of dollars). c Explain the meaning of the slope parameter b1. Then we could read the y-intercept and slope off the visually fitted line and use these values as the estimates of b0 and b1.1. b Explain the meaning of my x 30 b0 b1(30). and the Fresh demand data of Exercise 11.05). Third Edition 11. 11. Real Estate Sales Price Data RealEst Sales Price (y) 180 98.3 The Fuel Consumption Case C Consider the fuel consumption problem of Example 11. c Explain the meaning of the slope parameter b1. b Explain the meaning of my x4 . Simple Linear Regression Analysis Text © The McGraw−Hill Companies. Does this explanation make practical sense? e What factors are represented by the error term in this model? Give two specific examples of these factors. to the price difference. d Explain the meaning of the intercept b0. x 5 62 35 12 83 14 46 52 23 100 41 75 Consider the simple linear regression model relating demand. We now wish to use the data in Table 11. a Explain the meaning of my x4 . To do this.10 b0 b1(. 11. 11.10). Figure 11.5 127. To see how this is done. For example.10 THE FRESH DETERGENT CASE Fresh Simple Linear Regression Analysis Batch Size. x4.5 172. c Explain the meaning of the slope parameter b1. it is necessary to use observed data to compute estimates of these regression parameters. Therefore. d Explain the meaning of the intercept b0. b Discuss whether the scatter plot suggests that a simple linear regression model might appropriately relate y to x.1 to estimate the intercept b0 and the slope b1 of the line of means. b Explain the meaning of my x 18 b0 b1(18).2 ■ The Least Squares Estimates.9.12 THE DIRECT LABOR COST CASE DirLab Consider the simple linear regression model describing the direct labor cost data of Exercise 11.13. Data for 12 production runs are given in the table in the margin. and Point Estimation and Prediction CHAPTER 15 The true values of the y-intercept (b0) and slope (b1) in the simple linear regression model are unknown. 11. it might be reasonable to estimate the line of means by “fitting” the “best” straight line to the plotted data in Figure 11.05 b0 b1( .5 Home Size (x) 23 11 20 17 15 21 24 13 19 25 Source: Reprinted with permission from The Real Estate Appraiser and Analyst Spring 1986 issue. a Explain the meaning of my x 60 b0 b1(60).1.11. The data are given in the margin.1 suggests that the simple linear regression model appropriately relates y to x.8 163. a Construct a scatter plot of y versus x.14 THE REAL ESTATE SALES PRICE CASE RealEst Consider the simple linear regression model describing the sales price data of Exercise 11. b Discuss whether the scatter plot suggests that a simple linear regression model might appropriately relate y to x. Does this explanation make practical sense? e What factors are represented by the error term in this model? Give two specific examples. a Explain the meaning of my x 20 b0 b1(20). and x the home size (in hundreds of square feet).11 THE DIRECT LABOR COST CASE DirLab An accountant wishes to predict direct labor cost (y) on the basis of the batch size (x) of a product produced in a job shop.1 136.

1x 15 2.8 13.4 12. based on the visually fitted line. the figure shows that the slope of the line is 12.8 12.8 10 .8 20 13.2 For instance. Therefore. We see that this line intersects the y axis at y 15. 2003 11. In order to evaluate how “good” our point estimates of b0 and b1 are.1.2 The Least Squares Estimates.2 .5 Visually Fitting a Line to the Fuel Consumption Data y 16 15 14 13.1 change in x 20 10 10 Therefore. the y-intercept of the line is 15. we estimate that b0 is 15 and that b1 is . predicted fuel consumption is . when temperature is 28°F.8 1 change in y . a prediction of weekly fuel consumption when average hourly temperature is x is y ˆ y ˆ 15 15 . ˆ We can evaluate how well the visually determined line fits the points on the scatter plot by . Simple Linear Regression Analysis Text © The McGraw−Hill Companies.8 13 12 11 10 9 8 7 x 0 10 20 30 40 50 60 70 Change in y Change in x y-intercept Slope change in y change in x 12.2 13 12 11 10 9 8 7 x 28 28 y ^ x 0 10 20 30 40 50 60 70 fuel consumption data. In addition.1(28) Here y is simply the point on the visually fitted line corresponding to x 28 (see Figure 11. Denoting such a prediction as y (pronounced ˆ y hat).6 Using the Visually Fitted Line to Predict When x y 16 15 14 12.Bowerman−O’Connell: Business Statistics in Practice.1 12.2 F I G U R E 11.8 12. consider using the visually fitted line to predict weekly fuel consumption. and Point Estimation and Prediction 457 F I G U R E 11.6). Third Edition 11.

2 12.0 28.19)2 ( 1.4 and x 28.5 . y.1(28.1) . Table 11.25)2 ˆ y)2 .0201 . However.01 . Third Edition 11.5 8.1x 12.4 11.2 .0 7.4 9.5625 a(y ˆ y )2 comparing each observed value of y with the corresponding predicted value of y given by the fitted line.5 9.0.1.0 32.5 is not the only line that could be fitted to the observed fuel consumption data. the prediction error (also called the residual) for this observation is ei yi yi ˆ yi (b0 b1xi) Then the least squares line is the line that minimizes the sum of the squared prediction errors (that is.2 12. y2).0) .5 39.9 57. x.01)2 (. We find that SSE 4. The deviations (or prediction errors) are the vertical distances between the observed y values and the predictions obtained using the fitted line—that is.3 also gives the squared deviations and the SSE for our visually fitted line.5)2 (.4225 .2 11.4 12.5) . y denotes the predicted value of the dependent variable when the value of the indeˆ pendent variable is x.8796.4 10. we compute the sum of squared deviations or sum of squared errors.1(57. yi). In addition.25 y ˆ y .4 10.04 .4 11.3 Chapter 11 Simple Linear Regression Analysis Calculation of SSE for a Line Visually Fitted to the Fuel Consumption Data y 12. To obtain an overall measure of the quality of the fit.1(58. it can be shown that there is exactly one line that gives a value of SSE that is smaller than the value of SSE that would be given by any other line that could be fitted to the data.0784 1. Table 11.04 . Using calculus. y1).3)2 ( 1. denoted SSE.0 9. (xn. it . .5.5) . Since the predicted fuel consumption when x equals 28 is y ˆ 12.0) .65 . the line shown in Figure 11.2.09 1. To show how to find the least squares line. the deviations (errors) will be small.7 12.8 9.1(39. This line is called the least squares regression line or the least squares prediction equation.28)2 ( 1.75 11. we find the values of the y-intercept b0 and the slope b1 that minimize SSE. the predicted value of yi is yi ˆ b0 b1xi Furthermore. For instance.1(45.5 8. yn).28 1.65)2 ( . .5 SSE 15 15 15 15 15 15 15 15 ˆ y 15 . If we consider a particular observation (xi.0 45. we observed y 12.25 4.0) .41 9. If the visually determined line fits the data well.1 62.22 9.4161 1.3 1. we first write the general form of a straight-line prediction equation as y ˆ b0 b1x Here b0 (pronounced b zero) is the y-intercept and b1 (pronounced b one) is the slope of the line.5.75 .8796 (y (.1(28. the deviation y y equals 12. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. looking at the first obserˆ vation in Table 11.8 11.19 8. the sum of squared residuals): n SSE i a (yi 1 (b0 b1xi))2 To find this line.2 .25 .1 (page 447).5 x 28.19 1.Bowerman−O’Connell: Business Statistics in Practice. (x2. These values of b0 and b1 are called the least squares point estimates of b0 and b1.4 12.2)2 ( .5625 12. 2003 458 T A B L E 11.75 1.1(32.2 11. and y ˆ y for each ˆ observation in Table 11.8) .22 8.75 10.9) .3 gives the values of y.1 9. .8 58.7 12. We do this by computing the deviation y y. This ˆ deviation is illustrated in Figure 11.1 10. Now suppose we have collected n observations (x1. they are the line segments depicted in Figure 11. Different people would obtain somewhat different visually fitted lines. .41 9.2.1(62. Clearly.19 7.

8)(81.874.1 464.4 10.0)(10.84 3.4 The Fuel Consumption Case Part 1: Calculating the least squares point estimates Again consider the fuel consumption problem.0) (28. That is.4 11.4) (57.8 468.404.Bowerman−O’Connell: Business Statistics in Practice. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.1 62.4) (28.056.76 (28.0 7.2 431. and Point Estimation and Prediction 459 can be shown that these estimates are calculated as follows:4 The Least Squares Point Estimates F 1 2 or the simple linear regression model: The least squares point estimate of the slope B1 is b1 a xi SSxy a ( xi x )( yi y) a xi yi n a yi and y a xi n SSxx a (xi x )2 a xi 2 SSxy where SSxx a xi n 2 The least squares point estimate of the y-intercept B0 is b0 y a yi n and x b1x where Here n is the number of observations (an observation is an observed value of x and its corresponding value of y).11 a yi n (351.355 16.340. The following example illustrates how to calculate these point estimates and how to use these point estimates to estimate mean values and predict individual values of the dependent variable.8)2 8 1.8 (28.7 a xi C xi yi xi 28.7) (32.7 12.5)(12.5)2 a xi 2 2 x2 i 784 784 1.413.0 45.81 3.76 4 In order to simplify notation. Third Edition 11.9)2 (57. EXAMPLE 11.0)2 (45.0 32.5 39.8 58. 2003 11.1)2 (62. Note that the quantities SSxy and SSxx used to calculate the least squares point estimates are also used throughout this chapter to perform other important calculations.906.5)2 (39.2 327. we calculate SSxy and SSxx as follows.106.521 2.5 8.4 9.5)(7.75 3.6475 SSxx a xi 2 n (351.0) (62.2 The Least Squares Estimates. instead of n using the summation a we will simply write a .5 a yi 81.0 28.8)(9.6 403 421.7) 8 a xi 2 179.8 9.4) (39.5) (58.5) a xiyi 347.46 549.874.8)2 (58.0)2 (32.375.5 351. a xi SSxy a xiyi 3. i 1 .61 3.0)(11.9)(9.0)(12. To compute the least squares point estimates of the regression parameters b0 and b1 we first calculate the following preliminary summations: yi 12.25 16. we will often drop the limits on summations in this and subsequent chapters.9 57.8) (45.413.1)(8.25 1.11 Using these summations.

7(a) illustrates the eight observed fuel consumptions (the dots in the figure) and the eight predicted fuel consumptions (the squares in the figure) given by the least squares line.1279(45.2588 11. because 81.71675 . However.5680112 (yi (.96939 9.44738 8. we estimate that mean weekly fuel consumption decreases (since b1 is negative) by .84 15.1) .0 32. Table 11.1412 .84 15.0 45.4 9.5°F.5 15.0519 . Therefore.1279(39. we will rely on MINITAB.1279(62. We discuss this point more fully after this example. The distances between the observed and predicted fuel consumptions are the residuals.127922.1279(58.8519 9.3122574 yi ˆ yi residual .404.4). is smaller than the SSE of Table 11.7 a yi 10. Figure 11.1672892 . which was obtained using the visually fitted line y ˆ 15 .2588 11.0) .05262)2 ( .40901 . the experimental region consists of the range of average hourly temperatures from 28°F to 62. Notice that the SSE here.7(b) gives the MINITAB output of this best fit line.84625 .84 15.68325 10.4 10. Part 2: Estimating a mean fuel consumption and predicting an individual fuel consumption We define the experimental region to be the range of the previously observed values of the average hourly temperature x.324205 1.5588 .5 7.7 12.0 28.2588 12.96939 8.7 12.84625 .1279(57.5588)2 (.1279)(43.Bowerman−O’Connell: Business Statistics in Practice.2588 12. we estimate that mean weekly fuel consumption is 15.5) .98 SSxy SSxx 179.5 8. when we say that the least squares point estimates minimize SSE.0) .84 SSE xi 28.05262 . we are saying that these estimates position the least squares line so as to minimize the sum of the squared distances between the observed and predicted fuel consumptions.8) . Simple Linear Regression Analysis Text © The McGraw−Hill Companies.44738 8.1198891 . Excel.2125 a xi 8 351. 2003 460 T A B L E 11.8519 9. so making this interpretation of b0 might be dangerous. Third Edition 11.84 15.0519)2 ( .6475 1.4 12.5 39.84 .5680112) obtained by using this prediction equation.34625)2 ˆ yi)2 .9 57.4 11.1198891 a (yi It follows that the least squares point estimate of the slope b1 is b1 Furthermore.0 7.355 .8379 and b1 .68325 10. Since b0 15.0 8. which was obtained using the least squares point estimates.1x.1279x The table also gives each of the residuals and squared residuals and the sum of squared residuals (SSE 2.1279xi 12. The simple linear regression model relates . In general.84.9) .8 9. Note that this output gives the least squares estimates b0 15.4 9. we have not observed any weeks with temperatures near 0.1279(28. In this sense.40901)2 ( .5 8.5137306 .5) ˆ yi)2 12.1080089 .56939)2 (1.84 15. In general. Because we have observed average hourly temperatures between 28°F and 62.1 62.40901 7.8 8 15.84 15.1279(28.0) .1279(32.0199374 .40901 7.1412)2 ( .5°F (see Table 11.4 gives predictions of fuel consumption for each observed week obtained by using the least squares line (or prediction equation) y ˆ b0 b1x 15.2125 and x 8 8 the least squares point estimate of the y-intercept b0 is y b0 y b1x 10.71675)2 ( .1279.3.84 .0026936 .34625 2.4 yi 12.4 11. it can be shown that the SSE obtained by using the least squares point estimates is smaller than the value of SSE that would be obtained by using any other estimates of b0 and b1. Figure 11.1279 ( .84 15.0199374 .98) Since b1 .5 Chapter 11 Simple Linear Regression Analysis Calculation of SSE Obtained by Using the Least Squares Point Estimates ˆ yi 15.3122574 .8 58.84 MMcf of natural gas when average hourly temperature is 0°F.56939 1.8 10. and MegaStat to compute the least squares estimates (and to perform many other regression calculations). the least squares line is the best straight line that can be fitted to the eight observed fuel consumptions.1279 MMcf of natural gas when average hourly temperature increases by 1 degree.84 43.

72 MMcf of natural gas is (1) the point estimate of the mean weekly fuel consumption when the average hourly temperature is 40°F and (2) the point prediction of an individual weekly fuel consumption when the . The quantity y is also the point prediction of the individual value ˆ y b0 b1x e which is the amount of fuel consumed in a single week when average hourly temperature equals x.1279x is the point estimate of the mean of all the weekly fuel consumptions that could be observed when the average hourly temperature is x: my x b0 b1x ˆ Note that y is an intuitively logical point estimate of my x. One implication of these assumptions is that the error term has a 50 percent chance of being positive and a 50 percent chance of being negative. note that y is the sum of the mean b0 b1x ˆ and the error term e. the least squares line is the estimate of the line of means. To see why we should predict the error term to be 0. it is reasonable to predict the error term to be 0 and to use y as the point prediction of a single value of y when the average hourly temperature ˆ equals x. For such values of x. ˆ We will now reason that we should predict the error term E to be 0. 2003 11.Bowerman−O’Connell: Business Statistics in Practice.8379 0.899 14 13 Predicted fuel consumption when x 45. which implies that y is also ˆ the point prediction of y.7 The Least Squares Estimates. and Point Estimation and Prediction The Least Squares Line for the Fuel Consumption Data 461 (a) The observed and predicted fuel consumptions (b) The MINITAB output of the least squares line BEST FIT LINE FOR FUEL CONSUMPTION DATA y 16 15 14 13 12 11 10 9 8 7 The least squares line ^ y 15.9 12 11 10 9 8 7 x 6 30 40 TEMP 50 60 Residual Observed fuel consumption when x 45. note that in the next section we discuss several assumptions concerning the simple linear regression model. Now suppose a weather forecasting service predicts that the average hourly temperature in the next week will be 40°F.127922X R-Squared = 0. Third Edition 11. Therefore.84 .1279x FUELCONS Y = 15.84 . Simple Linear Regression Analysis Text © The McGraw−Hill Companies.1279(40) 10. Because 40°F is in the experimental region y ˆ 15.9 0 10 20 30 40 50 60 70 weekly fuel consumption y to average hourly temperature x for values of x that are in the experimental region. This implies that the point on the least squares line that corresponds to the average hourly temperature x y ˆ b0 b1x 15. We have already seen that y b0 b1x is the point estimate of b0 b1x.2 F I G U R E 11. To understand why y is the point prediction of y. This is because the expression b0 b1x used to calculate y has been obtained from the expression b0 b1x for my x by replacing the ˆ unknown values of b0 and b1 by their least squares point estimates b0 and b1.84 .

1279x x 10 0 10 20 30 40 50 60 70 28 62.8 Chapter 11 Simple Linear Regression Analysis Point Estimation and Point Prediction in the Fuel Consumption Problem y 16 15 14 13 12 ^ y The least squares line ^ y 15. and (2) we predict that the fuel consumption in a single week when the average hourly temperature is 40°F will be 10.8 illustrates (1) the point estimate of mean fuel consumption when x is 40°F (the square on the least squares line).84 . 2003 462 F I G U R E 11.Bowerman−O’Connell: Business Statistics in Practice. This says that (1) we estimate that the average of all possible weekly fuel consumptions that could potentially be observed when the average hourly temperature is 40°F equals 10.5 Experimental region average hourly temperature is 40°F.72 10 9 8 7 x 10 20 30 40 50 60 70 F I G U R E 11.1279x An individual value of fuel consumption when x 40 The true mean fuel consumption when x 40 The point estimate of mean fuel consumption when x 40 The true line of means yx 0 1x 11 10.72 MMcf of natural gas.84 .72 MMcf of natural gas. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.9 The Danger of Extrapolation Outside the Experimental Region y 22 21 True mean fuel consumption when x 10 Estimated mean fuel consumption when x 10 obtained by extrapolating the least squares line 20 19 18 17 16 15 14 13 12 11 10 9 8 7 The relationship between mean fuel consumption and x might become curved at low temperatures The least squares line ^ y 15. Third Edition 11. Figure 11. (2) the true mean fuel consumption when x is 40°F (the .

Bowerman−O’Connell: Business Statistics in Practice.9). In such a situation. y. we often omit such interpretations.000 increase in home value. for values of x in the experimental region the observed values of y tend to decrease in a straight-line fashion as the values of x increase. Of course this figure is only hypothetical. Using the data in Table 11. In the figure. Here we predict the error term to be 0.2).2583. In addition. For example. extrapolating the straight-line prediction equation to obtain a prediction for x 10 might badly underestimate mean weekly fuel consumption (see Figure 11. EXAMPLE 11. Third Edition 11. for temperatures lower than 28°F the relationship between y and x might become curved. y.2583(220) 1.2583.9 to 286. in the fuel consumption problem it would not be appropriate to use b0 15. we should not estimate a mean value or predict an individual value unless the corresponding value of x is in the experimental region—the range of the previously observed values of x.5. It follows that y ˆ b1x0 348. To conclude this example. it is very likely that the point prediction y 10. we extrapolate the least squares line far beyond the experimental region to obtain a prediction for a temperature of 10°F. Since b1 7. and Point Estimation and Prediction 463 triangle on the true line of means).3921 and b1 7.72. We now present a general procedure for estimating a mean value and predicting an individual value: Point Estimation and Point Prediction in Simple Linear Regression L et b0 and b1 be the least squares point estimates of the y-intercept b0 and the slope b1 in the simple linear regression model. Often the value x 0 is not in the experimental region. We will see how to do this in Section 11. If it does.000.9 illustrates the potential danger of using the least squares line to predict outside the experimental region. Figure 11.5 The QHIC Case Consider the simple linear regression model relating yearly home upkeep expenditure. we estimate that mean yearly upkeep expenditure increases by $7.248. we can calculate the least squares point estimates of the y-intercept b0 and the slope b1 to be b0 348. and suppose that x0.18 (see Table 11. to home value. The previous example illustrates that when we are using a least squares regression line. As shown in Figure 11.43) b0 C is the point estimate of the mean yearly upkeep expenditure for all homes worth $220. and note that x0 220 is in the range of previously observed values of x: 48. it would not be appropriate to interpret the y-intercept b0 as the estimate of the mean value of y when x equals 0. Therefore. y is the point predicˆ tion of an individual value of the dependent variable when the value of the independent variable is x0.2 (page 453). which is the ˆ natural gas company’s transmission nomination for next week. x. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. a specified value of the independent variable x. note that Figure 11.000. It follows that we might wish to predict the largest and smallest that y might reasonably be. Therefore. However. and (3) an individual value of fuel consumption when x is 40°F (the dot in the figure).43 (or $1. Then ˆ y b0 b1x0 is the point estimate of the mean value of the dependent variable when the value of the independent variable is x0.2 The Least Squares Estimates.1. For example. it illustrates that the point estimate of the mean value of y (which is also the point prediction of the individual value of y) will (unless we are extremely fortunate) differ from both the true mean value of y and the individual value of y.84 as the point estimate of the mean weekly fuel consumption when average hourly temperature is 0. will differ from next week’s actual fuel consumption. 2003 11. Consider a home worth $220. However.26 for each additional $1.248. .9 illustrates that the average hourly temperature 0°F is not in the experimental region.000 and is the point prediction of a yearly upkeep expenditure for an individual home worth $220.3921 7. is inside the experimental region. consider the fuel consumption problem. because it is not meaningful to interpret the y-intercept in many regression situations.

19. In the middle is the output obtained when Excel is used to fit a least squares line to the service time data given in Exercise 11.15 What does SSE measure? 11.18 Why is it dangerous to extrapolate outside the experimental region? METHODS AND APPLICATIONS Exercises 11.3921 7.81409 2. 11.19 THE STARTING SALARY CASE StartSal Using the leftmost output a Identify and interpret the least squares point estimates b0 and b1.2 CONCEPTS 11.2 0.3921) 7.0 7.7 (page 454).5 7.5 0.6022X 9.9 (page 455).3 0. 11.0 Demand 8.5 (page 454).17 How do we obtain a point estimate of the mean value of the dependent variable and a point prediction of an individual value of the dependent variable? 11. and 11.20.1 0.25.Bowerman−O’Connell: Business Statistics in Practice.977 200 180 160 140 120 100 80 60 40 20 0 0 2 Y Copiers Line Fit Plot 11. .66522X R-Sq 0. At the left is the output obtained when MINITAB is used to fit a least squares line to the starting salary data given in Exercise 11.19.16 What is the least squares regression line. and what are the least squares point estimates? 11.0 4 Copiers 6 8 Y Regression Plot 7.886 ($116.25 and a point prediction of the starting salary for an individual marketing graduate having a grade point average of 3.1 0.0 0.8156 5.4641 24.886) Exercises for Section 11.792 StartSal Minutes 0. Does the interpretation of b0 make practical sense? b Use the least squares line to obtain a point estimate of the mean starting salary for all marketing graduates having a grade point average of 3. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.6 PriceDif 11.23 11.21 are based on the following MINITAB and Excel output.5 8.2583 500 348.4 0. Third Edition 11. 2003 464 Chapter 11 Simple Linear Regression Analysis The marketing department at QHIC wishes to determine which homes should be sent advertising brochures promoting QHIC’s products and services. The rightmost output is obtained when MINITAB is used to fit a least squares line to the Fresh detergent demand data given in Exercise 11. if QHIC wishes to send an advertising brochure to any home that has a predicted upkeep expenditure of at least $500. then QHIC should send this brochure to any home that has a value of at least x y ˆ 348. The prediction equation y b0 b1x ˆ implies that the home value x corresponding to a predicted upkeep expenditure of y is ˆ x y ˆ b1 b0 y ˆ ( 348.5 9.2583 y ˆ 348.2583 Therefore.2583 116.2 0. for example.70657X R-Sq 0. Regression Plot Y 37 36 35 34 33 32 31 30 29 28 27 2 3 GPA 4 14.3921 7.3921 7.

Bowerman−O’Connell: Business Statistics in Practice, Third Edition

11. Simple Linear Regression Analysis

Text

© The McGraw−Hill Companies, 2003

11.3 11.20 THE SERVICE TIME CASE SrvcTime

Model Assumptions and the Standard Error

465

Using the middle output a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0 make practical sense? b Use the least squares line to obtain a point estimate of the mean time to service four copiers and a point prediction of the time to service four copiers on a single call. 11.21 THE FRESH DETERGENT CASE Fresh

Using the rightmost output a Identify and interpret the least squares point estimates b0 and b1. Does the interpretation of b0 make practical sense? b Use the least squares line to obtain a point estimate of the mean demand in all sales periods when the price difference is .10 and a point prediction of the actual demand in an individual sales period when the price difference is .10. c If Enterprise Industries wishes to maintain a price difference that corresponds to a predicted demand of 850,000 bottles (that is, y 8.5), what should this price ˆ difference be? 11.22 THE DIRECT LABOR COST CASE DirLab

Consider the direct labor cost data given in Exercise 11.11 (page 456), and suppose that a simple linear regression model is appropriate. a Verify that b0 18.4880 and b1 10.1463 by using the formulas illustrated in Example 11.4 (pages 459–460). b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense? c Write the least squares prediction equation. d Use the least squares line to obtain a point estimate of the mean direct labor cost for all batches of size 60 and a point prediction of the direct labor cost for an individual batch of size 60. 11.23 THE REAL ESTATE SALES PRICE CASE RealEst

Consider the sales price data given in Exercise 11.13 (page 456), and suppose that a simple linear regression model is appropriate. a Verify that b0 48.02 and b1 5.7003 by using the formulas illustrated in Example 11.4 (pages 459–460). b Interpret the meanings of b0 and b1. Does the interpretation of b0 make practical sense? c Write the least squares prediction equation. d Use the least squares line to obtain a point estimate of the mean sales price of all houses having 2,000 square feet and a point prediction of the sales price of an individual house having 2,000 square feet.

11.3 ■ Model Assumptions and the Standard Error
Model assumptions In order to perform hypothesis tests and set up various types of intervals when using the simple linear regression model y my x b0 e b1x e

we need to make certain assumptions about the error term e. At any given value of x, there is a population of error term values that could potentially occur. These error term values describe the different potential effects on y of all factors other than the value of x. Therefore, these error term values explain the variation in the y values that could be observed when the independent variable is x. Our statement of the simple linear regression model assumes that my x, the mean of the population of all y values that could be observed when the independent variable is x, is b0 b1x. This model also implies that e y (b0 b1x), so this is equivalent to assuming that the mean of the corresponding population of potential error term values is 0. In total, we make four assumptions—called the regression assumptions—about the simple linear regression model.

Bowerman−O’Connell: Business Statistics in Practice, Third Edition

11. Simple Linear Regression Analysis

Text

© The McGraw−Hill Companies, 2003

466

Chapter 11

Simple Linear Regression Analysis

These assumptions can be stated in terms of potential y values or, equivalently, in terms of potential error term values. Following tradition, we begin by stating these assumptions in terms of potential error term values:

The Regression Assumptions 1 At any given value of x, the population of poten- 3
tial error term values has a mean equal to 0.

2

Constant Variance Assumption At any given value of x, the population of potential error term values has a variance that does not depend on the value of x. That is, the different populations of potential error term values corresponding to different values of x have equal variances. We denote the constant variance as 2.

Normality Assumption At any given value of x, the population of potential error term values has a normal distribution. Independence Assumption Any one value of the error term E is statistically independent of any other value of E. That is, the value of the error term E corresponding to an observed value of y is statistically independent of the value of the error term corresponding to any other observed value of y.

4

Taken together, the first three assumptions say that, at any given value of x, the population of potential error term values is normally distributed with mean zero and a variance S2 that does not depend on the value of x. Because the potential error term values cause the variation in the potential y values, these assumptions imply that the population of all y values that could be observed when the independent variable is x is normally distributed with mean B0 B1x and a variance S2 that does not depend on x. These three assumptions are illustrated in Figure 11.10 in the context of the fuel consumption problem. Specifically, this figure depicts the populations of weekly fuel consumptions corresponding to two values of average hourly temperature—32.5 and 45.9. Note that these populations are shown to be normally distributed with different means (each of which is on the line of means) and with the same variance (or spread). The independence assumption is most likely to be violated when time series data are being utilized in a regression study. Intuitively, this assumption says that there is no pattern of positive error terms being followed (in time) by other positive error terms, and there is no pattern of positive error terms being followed by negative error terms. That is, there is no pattern of higher-thanaverage y values being followed by other higher-than-average y values, and there is no pattern of higher-than-average y values being followed by lower-than-average y values. It is important to point out that the regression assumptions very seldom, if ever, hold exactly in any practical regression problem. However, it has been found that regression results are not extremely sensitive to mild departures from these assumptions. In practice, only pronounced

F I G U R E 11.10

An Illustration of the Model Assumptions
y 12.4 Observed value of y when x 32.5 32.5

The mean fuel consumption when x The mean fuel consumption when x 45.9 Population of y values when x 32.5 Population of y values when x 45.9 32.5 45.9 9.4

Observed value of y when x 45.9 The straight line defined by the equation y |x 0 1x (the line of means) x

Bowerman−O’Connell: Business Statistics in Practice, Third Edition

11. Simple Linear Regression Analysis

Text

© The McGraw−Hill Companies, 2003

11.3

Model Assumptions and the Standard Error

467

departures from these assumptions require attention. In optional Section 11.8 we show how to check the regression assumptions. Prior to doing this, we will suppose that the assumptions are valid in our examples. In Section 11.2 we stated that, when we predict an individual value of the dependent variable, we predict the error term to be 0. To see why we do this, note that the regression assumptions state that, at any given value of the independent variable, the population of all error term values that can potentially occur is normally distributed with a mean equal to 0. Since we also assume that successive error terms (observed over time) are statistically independent, each error term has a 50 percent chance of being positive and a 50 percent chance of being negative. Therefore, it is reasonable to predict any particular error term value to be 0. The mean square error and the standard error To present statistical inference formulas in later sections, we need to be able to compute point estimates of s2 and s, the constant variance and standard deviation of the error term populations. The point estimate of s2 is called the mean square error and the point estimate of s is called the standard error. In the following box, we show how to compute these estimates:

The Mean Square Error and the Standard Error

I

f the regression assumptions are satisfied and SSE is the sum of squared residuals: The point estimate of s2 is the mean square error s2 SSE n 2

1

2

The point estimate of s is the standard error s

SSE 2 Bn

In order to understand these point estimates, recall that s2 is the variance of the population of ˆ y values (for a given value of x) around the mean value my x. Because y is the point estimate of this mean, it seems natural to use SSE a (yi yi)2 ˆ

to help construct a point estimate of s2. We divide SSE by n 2 because it can be proven that doing so makes the resulting s2 an unbiased point estimate of s2. Here we call n 2 the number of degrees of freedom associated with SSE.

EXAMPLE 11.6 The Fuel Consumption Case
Consider the fuel consumption situation, and recall that in Table 11.4 (page 460) we have calculated the sum of squared residuals to be SSE 2.568. It follows, because we have observed n 8 fuel consumptions, that the point estimate of s2 is the mean square error s2 2s2 SSE n 2 2.428 2.568 8 2 .428

C

This implies that the point estimate of s is the standard error s .6542

As another example, it can be verified that the standard error for the simple linear regression model describing the QHIC data is s 146.8970. To conclude this section, note that in optional Section 11.9 we present a shortcut formula for calculating SSE. The reader may study Section 11.9 now or at any later point.

calculate s2 Refer to the starting salary data of Exercise 11. For example.25 What is estimated by the mean square error.3 CONCEPTS 11. 110.29 THE DIRECT LABOR COST CASE DirLab Refer to the direct labor cost data of Exercise 11. 11. 13 and 14. 6. 87. calculate Refer to the Fresh detergent data of Exercise 11.Bowerman−O’Connell: Business Statistics in Practice. we test the null hypothesis H0: b1 0 which says that there is no change in the mean value of y associated with an increase in x.13 (page 456). 9. and what is estimated by the standard error? 11.8059. versus the alternative hypothesis Ha: b1 0 which says that there is a (positive or negative) change in the mean value of y associated with an increase in x. 12. calculate 896.11 (page 456). 98. and 130. The advertising expenditures (in units of $10. 110.4303. consider the fuel consumption problem. b1 4. Third Edition 11. Assuming that the simple linear regression model is appropriate. 11.31 Ten sales regions of equal sales potential for a company were randomly selected. A different sample of n observed y values would yield a different least squares point estimate b1. respectively. 5. recall that we compute the least squares point estimate b1 of the true slope b1 by using a sample of n observed values of the dependent variable y. In order to judge the significance of the relationship between y and x. 8. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.2121. Corresponding to each temperature there is a (theoretically) infinite population of fuel consumptions that could potentially be observed at that temperature [see Table 11.26 THE STARTING SALARY CASE StartSal 1. Given that SSE s2 and s.4 ■ Testing the Significance of the Slope and y Intercept Testing the significance of the slope A simple linear regression model is not likely to be useful unless there is a significant relationship between y and x.000) were then recorded for the 10 sales regions and found to be. It would be reasonable to conclude that x is significantly related to y if we can be quite certain that we should reject H0 in favor of Ha. 11. Given that SSE and s. 11. calculate s2 11. 747. Given that SSE calculate s2 and s.000) in these 10 sales regions were purposely set during July of last year at. 7. and recall that we have observed eight average hourly temperatures. 126. SalesVol Calculate s2 and s.7 (page 454).26. respectively. In order to test these hypotheses.27 THE SERVICE TIME CASE SrvcTime 191. The sales volumes (in units of $10. 89. 2003 468 Chapter 11 Simple Linear Regression Analysis Exercises for Section 11.9 (page 455). 11.5(a)]. Given that SSE s2 and s.5 (page 454). 11. Refer to the service time data in Exercise 11.8242. Given that SSE and s. 114.30 THE REAL ESTATE SALES PRICE CASE RealEst Refer to the sales price data of Exercise 11. and SSE 222.28 THE FRESH DETERGENT CASE Fresh 2.31 METHODS AND APPLICATIONS 11.438.70166. 10. 11.5(b) is the sample of eight fuel consumptions that we have actually observed from these populations (these are the same fuel consumptions originally given . it can be shown that b0 66. 103.24 What four assumptions do we make about the simple linear regression model? 11. Sample 1 in Table 11. 116.8.

In general.5 8.8 12.0 28.1285 b1 .0 28. then the population of all possible values of the test statistic t b1 sb1 b1 . Third Edition 11. Furthermore.5 7.0666 (d) Three Possible Values of b0 b0 15. then the population of all values of b1 sb1 has a t distribution with n 2 degrees of freedom.47 s2 . If the regression assumptions hold.4 11. s2.8 58.686 s . Samples 2 and 3 in Table 11.0 7.8 9.7 10.85 2 b0 12.5 39.1).5 8.4 10.5 39.0 32.0 45.5 8.0 32. if the regression assumptions hold. if the null hypothesis H0: b1 0 is true.5(c)–(f)].9 8.3 11.5(b) are two other samples that we could have observed.0 11. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.5 9.4 Testing the Significance of the Slope and y Intercept 469 T A B L E 11.5 Population of Potential Weekly Fuel Consumptions Population of fuel consumptions when x Population of fuel consumptions when x Population of fuel consumptions when x Population of fuel consumptions when x Population of fuel consumptions when x Population of fuel consumptions when x Population of fuel consumptions when x Population of fuel consumptions when x 28. Because each sample yields its own unique values of b1.8 58.0 45.8 9.6542 s . then the population of all possible values of b1 is normally distributed with a mean of b1 and with a standard deviation of sb1 s 1SSxx s 1SSxx The standard error s is the point estimate of s.0667 (f) Three Possible Values of s s .1 9.2 10.5 9. so it follows that a point estimate of sb1 is sb1 which is called the standard error of the estimate b1. It follows that.4 9.2 Sample 3 y1 y2 y3 y4 y5 y6 y7 y8 10.Bowerman−O’Connell: Business Statistics in Practice. and s [see Table 11.1 62.84 b0 15.5 (b) Three Possible Samples Sample 1 y1 y2 y3 y4 y5 y6 y7 y8 12.1 62. b0.2582 in Table 11. there is an infinite population of potential values of each of these estimates. 2003 11. an infinite number of such samples could be observed.7 12.0 (c) Three Possible Values of b1 b1 .5 Sample 2 y1 y2 y3 y4 y5 y6 y7 y8 12.9 57.9 57.5 Three Samples in the Fuel Consumption Case (a) The Eight Populations of Fuel Consumptions Week 1 2 3 4 5 6 7 8 Average Hourly Temperature x 28.1279 b1 .428 s2 .2 8.44 (e) Three Possible Values of s s2 .

For example.01 significance level. there is little practical difference between using the one-sided or two-sided alternative. tA. in the fuel consumption problem we can say that if the slope b1 is not 0.6 (page 467)]. For these reasons we will emphasize the two-sided test. sometimes a one-sided alternative is appropriate. the stronger is the evidence that the regression relationship is significant. this is usually regarded as very strong evidence that the regression relationship is significant.01746 .01746 7.1279 . Therefore.33 s 1SSxx y b0 . Simple Linear Regression Analysis Text © The McGraw−Hill Companies.7 The Fuel Consumption Case Again consider the fuel consumption model For this model SSxx 1.404. and s .05 significance level. Alternative Hypothesis Ha: b1 Ha: b1 Ha: b1 0 0 0 t t Rejection Point Condition: Reject H0 if t ta ta ta 2 p-Value Twice the area under the t curve to the right of t The area under the t curve to the right of t The area under the t curve to the left of t Here tA/2. 2 3 EXAMPLE 11. However. we can test the significance of the regression relationship as follows: Testing the Significance of the Regression Relationship: Testing the Significance of the Slope D efine the test statistic t b1 sb1 where sb1 s 1SSxx and suppose that the regression assumptions hold. or.4 (pages 459–460) . then we have concluded that x is significantly related to y by using a test that allows only a .355.6542 11.Bowerman−O’Connell: Business Statistics in Practice. Furthermore. This is usually regarded as strong evidence that the regression relationship is significant. b1 and 11. equivalently. Third Edition 11. We usually use the two-sided alternative Ha: b1 0 for this test of significance.404. and all p-values are based on n 2 degrees of freedom.1279. Then we can reject H0: b1 0 in favor of a particular alternative hypothesis at significance level a (that is. it would be appropriate to decide that x is significantly related to y if we can reject H0: b1 0 in favor of the onesided alternative Ha: b1 0.05 probability of concluding that x is significantly related to y when it is not. Because of this. If we can decide that the slope is significant at the . The smaller the significance level a at which H0 can be rejected. the corresponding p-value is less than a. then it must be negative. 2003 470 Chapter 11 Simple Linear Regression Analysis has a t distribution with n 2 degrees of freedom.355 b1x e C . A negative b1 would say that mean fuel consumption decreases as temperature x increases.6542 [see Examples 11. Although this test would be slightly more effective than the usual two-sided test. Therefore sb1 and t b1 sb1 . by setting the probability of a Type I error equal to a) if and only if the appropriate rejection point condition holds. It should also be noted that 1 If we can decide that the slope is significant at the . computer packages (such as MINITAB and Excel) present results for testing a two-sided alternative hypothesis.

721 a j 1 6 7 Stdev Fitp 0.32768f P-Valueg 1. we can reject H0 in favor of Ha at level of significance .241 9 5 . and t 7. and MINITAB has rounded this p-value to . 5 6 8k 25.84.00033 Lower 95% 13. or .127921715b b0 bb1 csb0 dsb1 et for testing H0: b0 0 ft for testing H0: b1 0 Explained variation kSSE Unexplained variation lTotal variation m p-values for t statistics hs standard error ir2 F(model) statistic np-value for F(model) o95% confidence interval for b1 To test the significance of the slope we compare t with ta degrees of freedom. Other quantities on the MINITAB and Excel outputs will be discussed later.001).75353e 7.000 R-Sq = 89.98081629j 2.0% PIr (9. 0 0 0n Analysis of Variance Source DF Regression Error Total F i to 10.948413871 0.05. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.9%i SS 22. . 11.427989 F 53.7 .12792b Stdev 0. Note that b0 15.3% MS 22.447 2 based on n 2 8 2 6 we can reject H0: b1 0 in favor of Ha: b1 0 at level of significance .05. sb1 .128 TEMP Predictor Constant TEMP S = 0.170638294o Upper 95% 17.11 presents the MINITAB and Excel outputs of a simple linear regression analysis of the fuel consumption data. Figure 11.33 under the curve of the t distribution having n 2 6 degrees of freedom.981j 2 .4 Testing the Significance of the Slope and y Intercept 471 F I G U R E 11.00033. Since this p-value can be shown to be . 0 0 0g 0.8 .654208646h 8 ANOVA df Regression Residual Total 1 6 7 SS 22.08520514o 15.6542.Bowerman−O’Connell: Business Statistics in Practice.428) b0 bb1 csb0 dsb1 et for testing H0: b0 0 ft for testing H0: b1 0 gp-values for t statistics hs standard error ir2 ˆ Explained variation kSSE Unexplained variation lTotal variation mF(model) statistic np-value for F(model) oy psy ˆ q 95% confidence interval when x 40 r95% prediction interval when x 40 (b) The Excel output Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.54875l MS 22. Because t 7. 7 5e .130.6542h Coef 15.549l R-Sq(adj) = 88.001.79972765 0.00033.87598718 0.428 F 53.69488m Significance F 0.8018c 0.33 (each of which has been previously calculated) are given on these outputs. b1 . Also note that Excel gives the p-value of .014. 2003 11. 3 3f p 0 .01746.1279.01745733d g t Stat 19.83785741a 0. Third Edition 11. The p-value for testing H0 versus Ha is twice the area to the right of t 7.98082 0.69m p 0 .000 (which means less than .981 0.11 MINITAB and Excel Output of a Simple Linear Regression Analysis of the Fuel Consumption Data (a) The MINITAB output Regression Analysis The regression equation is FUELCONS = 15. We therefore have extremely strong evidence that x is significantly related to y and that the regression relationship is significant.882737016 0.025 2.01.899488871i 0.000330052n Coefficients Intercept Temp a j Standard Error 0.567933713k 25.0. 0 % C Iq (10. 12.8379a -0.312) 95.01746d T 1 9 .33 t. .09E-06 0.801773385c 0. s .

75. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.42 and $8.01746.3931 . This interval says we are 95 percent confident that mean yearly upkeep expenditure increases by between $6. Note that the 95 percent confidence interval for b1 is given on the Excel output but not on the MINITAB output. Here ta 2 is based on n 2 degrees of freedom. Testing the significance of the y-intercept We can also test the significance of the y-intercept b0.11.05. EXAMPLE 11.10 for each additional $1. because the 95 percent confidence interval for b1 does not contain 0.8018.1706.2583 s 146. To carry out this test we use the test statistic t b0 sb0 where sb0 s 1 Bn x2 SSxx Here the rejection point and p-value conditions for rejecting H0 are the same as those given previously for testing the significance of the slope.000.447(. it is often useful to calculate a confidence interval for b1. if average hourly temperature increases by one degree. . we can reject H0: b1 0 in favor of Ha: b1 0 at the . we see that b0 15.897 . Because t 19. 2.4170.05. We show how this is done in the following box: A Confidence Interval for the Slope I f the regression assumptions hold.0995]. 2003 472 Chapter 11 Simple Linear Regression Analysis In addition to testing the significance of the slope. a 100(1 A) percent confidence interval for the true slope B1 is [b1 ta 2 sb1].0852 MMcf of natural gas and by at most . 8.447 and p-value .Bowerman−O’Connell: Business Statistics in Practice. Also.001.01746)] .12 presents the MegaStat output of a simple linear regression analysis of the QHIC data.001 17.75 t.11 tell us that b1 .0852] This interval says we are 95 percent confident that. EXAMPLE 11. t 19. because t. for instance. The MegaStat output also tells us that a 95 percent confidence interval for the true slope b1 is [6.1279 [ . a 95 percent confidence interval for b1 is [b1 t.466 p-value for t Since the p-value for testing the significance of the slope is less than .8 The Fuel Consumption Case C . Third Edition 11.8379.001 level of significance. then mean weekly fuel consumption will decrease (because both the lower bound and the upper bound of the interval are negative) by at least . sb0 .1279 and sb1 Thus.1706 MMcf of natural gas.000 increase in home value. The MINITAB and Excel outputs in Figure 11.025 2.05 level of significance. except that t is calculated as b0 sb0. we can reject H0: b1 0 in favor of Ha: b1 0 at level of significance . we can reject H0: b0 0 in favor of Ha: b0 0 at the .4156 t b1 sb1 b1 7.025 based on n 2 8 2 6 degrees of freedom equals 2. if we consider the fuel consumption problem and the MINITAB output in Figure 11.447.9 The QHIC Case C Figure 11. Below we summarize some important quantities from the output (we discuss the other quantities later): b0 sb1 348. We do this by testing the null hypothesis H0: b0 0 versus the alternative hypothesis Ha: b0 0.025 sb1] [ . and p-value . For example. It follows that we have extremely strong evidence that the regression relationship is significant.

759.38. Show how t has been calculated by using b1 and sb1. In fact. If.187.01? METHODS AND APPLICATIONS In Exercises 11.3921a 7.40 11.466f p-valueg 4.8301 F 305.01.4156d t (df 38) 4.38.943 Std. Third Edition 11. if we fail to conclude that the intercept is significant at a level of significance of . and Excel output of simple linear regression analyses of the data sets related to the five case studies introduced in the exercises for Section 11. What do you conclude about the relationship between y and x? e Using the t statistic and appropriate rejection points. it might be reasonable to drop the y-intercept from the model. b Identify SSE. Using the p-value. and . Error 146.889i r 0. since the p-value . determine whether we can reject H0 by setting a equal to . in the fuel consumption problem.4 Testing the Significance of the Slope and y Intercept 473 F I G U R E 11. a Identify the least squares point estimates b0 and b1 of b0 and b1.582. 2003 11.78943 1. 11. c Identify sb1 and the t statistic for testing the significance of the slope. .001 level of significance. when in doubt. s2.576e 17.00 a j Predicted 1.2399l df 1 38 39 MS 6. d Using the t statistic and appropriate rejection points. logically speaking. However.402.05.759.10.92878 1. we can also reject H0 at the .995.248.6972 21. to include the intercept b0. This provides extremely strong evidence that the y-intercept b0 does not equal 0 and that we should include b0 in the fuel consumption model.4170o 194. MegaStat. the mean value of y would not equal 0 when x equals 0 (for example. error 76. Using the appropriate output for each case study.5315 6.309. Discuss one practical application of this interval.92317 Leverages 0.6972j 819. What do you conclude about the relationship between y and x? g Calculate the 95 percent confidence interval for b1.4 CONCEPTS 11.551.12 MegaStat Output of a Simple Linear Regression Analysis of the QHIC Data Regression Analysis r2 0.582. .1410c 0. experience suggests that it is definitely safest. test H0: b1 0 versus Ha: b1 0 by setting a equal to .33 Give an example of a practical application of the confidence interval for b1.49E-20n Regression output variables Intercept Value coefficients 348.001.49E-20 confidence interval 95% lower 95% upper 502. remember that b0 equals the mean value of y when x equals 0. and s.1. .2583b std. What do you conclude about the relationship between y and x? f Identify the p-value for testing H0: b1 0 versus Ha: b1 0. Exercises for Section 11.897h n 40 k 1 Dep.34.001.05.Bowerman−O’Connell: Business Statistics in Practice.32 What do we conclude if we can reject H0: b1 0 in favor of Ha: b1 a a equal to . 11.755. In general. Var. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.01. 0 by setting 11.06251 95% Prediction Intervalr lower upper 944.578. mean fuel consumption would not equal 0 when the average hourly temperature is 0).2527 8. we refer to MINITAB.42597 P 95% Confidence Intervalq lower upper 1. it is common practice to include the y-intercept whether or not H0: b0 0 is rejected.042 b0 bb1 csb0 dsb1 et for testing H0: b0 0 ft for testing H0: b1 0 gp-values for t statistics hs standard error ir2 Explained variation kSSE Unexplained variation lTotal variation mF(model) statistic np-value for F(model) o95% confidence interval for b1 p ˆ y q95% confidence interval when x 220 r95% prediction interval when x 220 sdistance value In fact. Upkeep ANOVA table Source Regression Residual Total SS 6.0995o Predicted values for: Upkeep Value 220.05.95E-05 9.05? b a equal to .06m p-value 9.5427k 7. test H0: b1 0 versus Ha: b1 0 by setting a equal to .34 through 11.

11 1 Minutes ANOVA table Source Regression Residual Total SS 19.127 0.35 THE SERVICE TIME CASE SrvcTime The MegaStat output of a simple linear regression analysis of the data set for this case (see Exercise 11.4641 24.2% MS 59.233 42. determine whether we can reject H0 by setting a equal to .357 88.10. error 3.188 72. .528 171.224 74.14 Regression Analysis MegaStat Output of a Simple Linear Regression Analysis of the Service Time Data r2 0.627 195.942 1.918. show how sb0 and sb1 have been calculated.8 + 5.091 0.990 r 0.334 30.71 GPA Predictor Constant GPA S = 0.0087 2.7823 19.126 95% Prediction Intervals lower upper 23.05.2437 26.3953 T 12.942 0.15 p-value 2.410 48.288 F 208.0% CI ( 32.6022 std. Var.4221 3.13.001.000 Analysis of Variance Source DF Regression 1 Error 5 Total 6 Fit 33.580 Predicted values for: Minutes Copiers 1 2 3 4 5 6 7 Predicted 36. j Identify the p-value for testing H0: b0 0 versus Ha: b0 0.7 on page 454) is given in Figure 11. 34. What do you conclude? k Using the appropriate data set. 11.475 159.827 113.907 55.00 14.4390 0. Error 4.967 123. 2003 474 Chapter 11 Simple Linear Regression Analysis h Calculate the 99 percent confidence interval for b1. and .846) F I G U R E 11.Bowerman−O’Connell: Business Statistics in Practice.235 0.348 0.813.5363 Coef 14.7066 StDev 1.8438 21.911) 95. Hint: Calculate SSxx.680 95% Confidence Intervals lower upper 29.09E-10 Regression output variables Intercept Copiers Coefficients 11.300 120. Show how t has been calculated by using b0 and sb0.391 147.14.362 StDev Fit 0.34 THE STARTING SALARY CASE StartSal The MINITAB output of a simple linear regression analysis of the data set for this case (see Exercise 11.7017 20.944 49. Third Edition 11.077 183.271 109.918.09E-10 confidence interval 95% lower 95% upper 3.01.8438 191.116 0.025 138.3002 F 935.224 0.721 130.669 85.110.980 81.139 177. i Identify sb0 and the t statistic for testing the significance of the y intercept.000 0.13 MINITAB Output of a Simple Linear Regression Analysis of the Starting Salary Data The regression equation is SALARY = 14.950 Leverage 0.779 145.753 154.213 95. Using the p-value. 33. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.381 . Recall that a labeled MegaStat regression output is on page 473. .559 170.615 n k Dep.197 164.816 5.5 on page 454) is given in Figure 11.878.44 P 0. Recall that a labeled MINITAB regression output is on page 471. 11.5455 df 1 9 10 MS 19.438 61.873 134.113 96.380 R-Sq(adj) = 97.066 60.241 98.39 P 0.202 0.000 R-Sq = 97.7% SS 59.8045 t (df 9) p-value .226 65.715 106.0% PI (31.6845 22.995 Std. F I G U R E 11.016 190.

787514479 0.2% R-Sq(adj) = 78.37 THE DIRECT LABOR COST CASE DirLab The Excel and MegaStat output of a simple linear regression analysis of the data set for this case (see Exercise 11.999271693 0.67 PriceDif Predictor Constant PriceDif S = 0. 8.0% PI ( 7.096613291 Upper 95% 0.9 on page 455) is given in Figure 11. Third Edition 11.1400) F I G U R E 11.000840801 t Stat 3. Recall that labeled Excel and MegaStat regression outputs are on pages 471 and 473.2134) ( 8.248599895 9952.003647528 5.8208.851387097 12 ANOVA df Regression Residual Total Intercept LaborCost 1 10 11 SS 9945. 11.4 F I G U R E 11.666667 MS 9945. Recall that a labeled MINITAB regression output is on page 471.772336617 117.098486713 Standard Error 0.653 2.0806 8.82 10. 9.31 P 0.6004) 95.843313988 0.104 11.4% Analysis of Variance Source Regression Error Total Fit 8.000 R-Sq = 79.39 Find and interpret a 95 percent confidence interval for the slope b1 of the simple linear regression model describing the sales volume data in Exercise 11.31 (page 468).15 Testing the Significance of the Slope and y Intercept MINITAB Output of a Simple Linear Regression Analysis of the Fresh Detergent Demand Data 475 The regression equation is Demand = 7. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.Bowerman−O’Connell: Business Statistics in Practice.4804 StDev Fit 0.73171497 0.07988 0.38 THE REAL ESTATE SALES PRICE CASE RealEst The MINITAB output of a simple linear regression analysis of the data set for this case (see Exercise 11.418067 0.100360134 (b) Prediction using MegaStat Predicted values for: LaborCost BatchSize 60 Predicted 627.000 0.15. 2003 11.16 Excel and MegaStat Output of a Simple Linear Regression Analysis of the Direct Labor Cost Data (a) The Excel output Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.054 633.11 on page 456) is given in Figure 11.6652 StDev 0.2585 T 97.04436E-17 Coefficients 1. 11.81409 2.1344001 P-value 0. 8.032 647.0648 0.653 0.806 13.36 THE FRESH DETERGENT CASE Fresh The MINITAB output of a simple linear regression analysis of the data set for this case (see Exercise 11.0% CI ( 7.46769 Significance F 5.9478.100 F 106.13 on page 456) is given in Figure 11.263 95% Confidence Interval lower upper 621.17 on page 476.418067 7.30 P 0. 11.494 Leverage 0. 8.4186.472 95% Prediction Interval lower upper 607.459 MS 10.724859989 F 13720.3604.7427) ( 7. Recall that a labeled MINITAB regression output is on page 471. SalesVol .999198862 0.0586 DF 1 28 29 SS 10.000 95.81 + 2.99963578 0.473848084 0.16.3166 Coef 7.04436E-17 Lower 95% 2.

2003 476 F I G U R E 11.02) 95.988541971 Upper 95% 0. The mean ratings given by the 406 individuals are given in the following table: Mean Taste 3. Here.974728196 0. 4.0% R-Sq(adj) = 86.9661 3.47 SS 6550.17 Chapter 11 Simple Linear Regression Analysis MINITAB Output of a Simple Linear Regression Analysis of the Sales Price Data The regression equation is SPrice = 48.5 MS 6550.0895 1.680702508 1.7457 T 3. and then ranked the restaurants from 1 through 6 on the basis of overall preference.000 R-Sq = 88.43 P 0. mean preference is the dependent variable and mean taste is the independent variable. 170.7 112. or 6 on the basis of taste. 1 is the best rating and 6 the worst.160197061 1.27312365 Standard Error 0.8 7447.0% PI ( 136.0 + 5. Hardee’s.33 7. Each of 406 randomly selected individuals gave each restaurant a rating of 1.98728324 0.7003 StDev 14.03 DF 1 8 9 StDev Fit 3.2552 4.73) F I G U R E 11.00109663 0.0052 2.1 F 58. 5.302868523 0.000 95.02 5.2429 2. In each case.5351 4. Recall that a labeled Excel regression output is given on page 471.59 Coef 48.7812 Figure 11.968410245 0.010 0.04.329 2.Bowerman−O’Connell: Business Statistics in Practice.8061 Restaurant Borden Burger Hardee’s Burger King McDonald’s Wendy’s White Castle Mean Preference 4. 2.42091641 P-Value 0.0% CI ( 154.557705329 11.5% Analysis of Variance Source Regression Error Total Fit 162. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.033586417 F 154.41 0.18 Excel Output of a Simple Linear Regression Analysis of the Fast-Food Restaurant Rating Data Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.7 896.18 gives the Excel output of a simple linear regression analysis of this data. Wendy’s. McDonald’s.000241546 Coefficients Intercept MEANTASTE 0.4231 2.5659 3.624842601 0. .528932686 12.181684399 0.64 P 0. 187.70 HomeSize Predictor Constant HomeSize S = 10. Burger King.134345669 5.0911 3. 3.181684399 0.40 THE FAST-FOOD RESTAURANT RATING CASE FastFood In the early 1990s researchers at The Ohio State University studied consumer ratings of six fast-food restaurants: Borden Burger.2791645 Significance F 0.000241546 Lower 95% 1.183265974 6 ANOVA df Regression Residual Total 1 4 5 SS 5. and White Castle.102498367 t Stat 0.33.316030068 MS 5. Third Edition 11.

we call this mean value my x0). 477 . Simple Linear Regression Analysis Text © The McGraw−Hill Companies. Identify the p-value for testing H0: b1 0 versus Ha: b1 0. which can be regarded as the center of the experimental region. of b0 and b1. We now consider establishing a confidence interval for the mean value of y when x equals a particular value x0 (for later reference. Would we reject H0 at a At a . if the regression assumptions hold. and s. then the population of all possible values of y is normally distributed with mean my x0 and standard deviation ˆ sy ˆ The point estimate of sy is ˆ sy ˆ s2Distance value s2Distance value which is called the standard error of the estimate y. a 100(1 A) percent confidence interval for the mean value of y when the value of the independent variable is x0 is ˆ [y tA 2s2Distance value] Here ta 2 is based on n 2 degrees of freedom.5 a b c d e Confidence and Prediction Intervals Identify the least squares point estimates b0 and b1. Using this standard error. We can do this by calculating a confidence interval for the ˆ mean value of y and a prediction interval for an individual value of y.05? 11. 2003 11. Identify the t statistic for testing H0: b1 0 versus Ha: b1 0. we need to place bounds on how far y might be from these values. Both of these intervals employ a quantity called the distance value. Therefore. Identify sb1. different samples give different values of the point estimate y ˆ b0 b1x0 It can be shown that. y will not exactly equal either the mean value of y when x equals x0 or ˆ a particular individual value of y when x equals x0. Because each possible sample of n values of the dependent variable gives values of b0 and b1 that differ from the values given by other samples.001? f Identify and interpret a 95 percent confidence interval for b1. the larger is the distance value. For simple linear regression this quantity is calculated as follows: The Distance Value for Simple Linear Regression I n simple linear regression the distance value for a particular value x0 of x is Distance value 1 n (x0 x )2 SSxx This quantity is given its name because it is a measure of the distance between the value x0 of x and x.Bowerman−O’Connell: Business Statistics in Practice. Third Edition 11.5 ■ Confidence and Prediction Intervals The point on the least squares line corresponding to a particular value x0 of the independent variable x is y ˆ b0 b1x0 CHAPTER 15 Unless we are very lucky. Identify SSE.01? At a . The significance of this fact will become apparent shortly. we form a conˆ fidence interval as follows: A Confidence Interval for a Mean Value of y I f the regression assumptions hold. the average of the previously observed values of x. Notice from the above formula that the farther x0 is from x. s2. .

4.1362 Since s . we could observe any one of an infinite number of different individual values of y (because of different possible error terms).59] 2.31] This interval says we are 95 percent confident that the mean (or average) of all the weekly fuel consumptions that would be observed in all weeks having an average hourly temperature of 40°F is between 10.72 .84 b1x0 .355 .6542 (see Example 11. After observing each possible sample and calculating the ˆ point prediction based on that sample.447(.10 The Fuel Consumption Case C In the fuel consumption problem.Bowerman−O’Connell: Business Statistics in Practice.13. the point estimate of this mean is y ˆ b0 15. a 100(1 A) percent prediction interval for an individual value of y when the value of the independent variable is x0 is ˆ [y tA 2s21 Distance value] Here ta 2 is based on n 2 degrees of freedom. we compute Distance value 1 n 1 8 (x0 x)2 SSxx (40 43. Therefore. suppose we wish to compute a 95 percent confidence interval for the mean value of weekly fuel consumption when the average hourly temperature is x0 40°F. it follows that the desired 95 percent confidence interval is [y ˆ ta 2s2Distance value] [10. Using this quantity we obtain a prediction interval as follows: A Prediction Interval for an Individual Value of y I f the regression assumptions hold.1362 ] 6 [10. using the information in Example 11. 11.13 MMcf of natural gas and 11.447. there are an infinite number of different prediction errors that could be observed.6 on page 467) and since ta 2 t.31 MMcf of natural gas. .404. We develop an interval for an individual value of y when x equals a particular value x0 by considering the prediction error y y.98)2 1.72 [10. If the regression assumptions hold. Third Edition 11.025 based on n 2 8 2 degrees of freedom equals 2. From Example 11.4 (pages 459 and 460). 2003 478 Chapter 11 Simple Linear Regression Analysis EXAMPLE 11. it can be shown that the population of all possible prediction errors is normally distributed with mean 0 and standard deviation s(y The point estimate of s(y y ˆ) y ˆ) s21 Distance value is s(y y ˆ) s21 Distance value which is called the standard error of the prediction error. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.6542)2.1279(40) 10.72 MMcf of natural gas Furthermore.

which is (1. In Chapter 12 we use a multiple regression model to substantially reduce the natural gas company’s percentage nomination errors. Recalling that y ˆ 10.5 Confidence and Prediction Intervals 479 EXAMPLE 11.72. and the 95 percent prediction interval for an individual value of y when x equals 40.721 Stdev Fit 0.428) Although the MINITAB output does not directly give the distance value. Fit 10. we are not confident that the company’s percentage nomination error will be within the 10 percent allowance granted by the pipeline transmission system. First. It follows that we are 95 percent confident that the actual amount of natural gas that will be used by the city next week will differ from the natural gas company’s transmission nomination by no more than 15. note that the half-length of the 95 percent prediction interval given by our model is 1.10. This distance value is (within rounding) equal to the distance value that we hand calculated in Example 11. and also supposes that in a future week when x equals 40 the error term will equal 1.241 95.312) 95.91% of the transmission nomination.01.014.1362 ] Distance value] C 1.025 is again based on n 2 6 degrees of freedom.43] Here ta 2 t.05—no human being would actually know these facts. 2003 11. Below we repeat the bottom of the MINITAB output in Figure 11. 11.72 [10.11 The Fuel Consumption Case In the fuel consumption problem.0% PI (9.72. That is. Although this does not imply that the natural gas company is likely to make a terribly inaccurate nomination. because sy .91 percent. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.241/.77 and . 12.0% CI (10. it follows that the distance value equals (.72 given by our model is the natural gas company’s transmission nomination for next week.71.1281(40) 10.71 10.6542 [see the ˆ ˆ MINITAB output in Figure 11. Because the weather forecasting service has predicted that the average hourly temperature in the next week will be 40°F. while it probably will not equal 0). we are 95 percent confident that the natural gas company’s percentage nomination error will be less than or equal to 15. it does give sy s1Distance value under the heading “Stdev Fit”.72 when x0 40.6542) 21.77 .Bowerman−O’Connell: Business Statistics in Practice. the natural gas company may be assessed a transmission fine.6542)2 .43 MMcf of natural gas.241 and s . we can use the prediction interval to evaluate how well our regression model is likely to predict next week’s fuel consumption and to evaluate whether the natural gas company will be assessed a transmission fine. This interval says we are 95 percent confident that the individual fuel consumption in a future single week having an average hourly temperature of 40°F will be between 9. the prediction interval is longer than the confidence interval. This output gives the point estimate and prediction y ˆ 10. This is because the formula for the prediction interval has an “extra 1 under the radical.447(.19 illustrates and compares the 95 percent confidence interval for the mean value of y when x equals 40 and the 95 percent prediction interval for an individual value of y when ˆ x equals 40.72 2. suppose we wish to compute a 95 percent prediction interval for an individual weekly fuel consumption when average hourly temperature equals 40°F.11(a). Specifically.01 MMcf of natural gas and 12.65 .71] [9.130. the mean value of y when x equals 40 would be b0 b1(40) 15. A little algebra shows that this implies ˆ that the distance value equals ( sy s)2. Therefore. However.72)100% 15. Figure 11. Third Edition 11. Also. the 95 percent confidence interval for the mean value of y when x equals 40.11(a) on page 471].91 percent.1281. 12. Assuming these values. recall that the point prediction y ˆ 10. Figure 11. We see that both intervals are centered at y 10.” which accounts for the added uncertainty introduced by our not knowing the value of the error term (which we nevertheless predict to be 0. the desired interval is [y ˆ ta 2s21 [10.1357.19 hypothetically supposes that the true values of b0 and b1 are 15.

59 10. remember.65 1.65) A hypothetical individual value of y when x equals 40 ( 11.43 F I G U R E 11. Third Edition 11. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.31 1. the 95 percent confidence interval contains the mean 10. the 95 percent prediction interval is long enough and does contain the individual value 11. the relative positions of the quantities shown in Figure 11. it is not meant to contain this individual value of y. . this figure emphasizes that we must be careful to include the extra 1 under the radical when computing a prediction interval for an individual value of y.Bowerman−O’Connell: Business Statistics in Practice.7.19 Chapter 11 Simple Linear Regression Analysis A Comparison of a Confidence Interval for the Mean Value of y When x Equals 40 and a Prediction Interval for an Individual Value of y When x Equals 40 Scale of values of fuel consumption A hypothetical true mean value of y when x equals 40 ( 10.20 MINITAB Output of 95% Confidence and Prediction Intervals for the Fuel Consumption Case and in the future week the individual value of y when x equals 40 would be b0 b1(40) e 10. this interval is not long enough to contain the individual value 11. In contrast.05 11.72 ^ y .7.13 11. Of course.01 95% prediction interval 10.72 ^ y 12. However.71 9.7 As illustrated in the figure.19 will vary in different situations.7) 95% confidence interval 10. However.65. 2003 480 F I G U R E 11.

042.187. since s equals 146. Although it is not important to estimate a mean value in the fuel consumption problem.43.06. There are many homes worth roughly $220. the MegaStat output also tells us that the distance value.551. A confidence interval is useful if it is important to estimate the mean value.025(146.43) This predicted value is given at the bottom of the MegaStat output in Figure 11.12 The QHIC Case Consider a home worth $220.025s11 [1.20 illustrates the MINITAB output of the 95 percent confidence and prediction intervals corresponding to all values of x in the experimental region. This interval says that QHIC is 95 percent confident that the mean upkeep expenditure for all homes worth $220.248. Third Edition 11.06251 95% Prediction Interval lower upper 944. The MegaStat output tells us that a 95 percent confidence interval for this mean upkeep expenditure is [1187.187.248. which we repeat here: Predicted values for: Upkeep Value 220.5 that the predicted yearly upkeep expenditure for such a home is y ˆ b0 b1x0 348.92878 1. These longer intervals are undesirable because they give us less information about mean and individual values of y.025 is based on n 2 40 2 38 degrees of freedom.042 In addition to giving y ˆ 1.000 is calculated as follows: [ˆ y t. Therefore.92].79 and is no more than $1.4 does not give t.79. 1309.12. $1. MegaStat uses the appropriate t point and finds that the 95 percent prediction interval is [944. note that Figure 11. it might be important to estimate the mean value if we will observe and are affected by a very large number of values of the dependent variable when the independent variable equals a particular value. We illustrate this in the following example.042] Here t.248.42597 95% Confidence Interval lower upper 1. EXAMPLE 11.5 Confidence and Prediction Intervals 481 To conclude this example.43 Distance value] t. To understand this.98 can be regarded as the center of the experimental region. Notice that the farther x is from x 43. it is important to estimate a mean value in other situations.93.897)11.000. the larger is the distance value and.3921 7.000 in the metropolitan area.000 is at least $1. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.897 (see Figure 11. as in the fuel consumption problem. In general.551. Here x 43. it is important to predict an individual value of the dependent variable.248.025 for 38 degrees of freedom. therefore.78943 1. We have seen in Example 11.2583(220) C 1.Bowerman−O’Connell: Business Statistics in Practice. the longer are the 95 percent confidence and prediction intervals. the prediction interval is useful if.06]. which is given under the heading “Leverage” on the output.43 (that is.309.248.92317 Leverage 0.309. Therefore. 2003 11. Although the t table in Table A. it follows that a 95 percent prediction interval for the yearly upkeep expenditure of an individual home worth $220.12 on page 473). recall that the mean value is the average of all the values of the dependent variable that could potentially be observed when the independent variable equals a particular value. equals . 1.00 Predicted 1. This interval is given on the MegaStat output. QHIC is more interested in the mean upkeep expenditure for all such homes than in the individual upkeep expenditure for one such home.98. .

202 0. and some will be above.15 (page 475) corresponds to future sales periods in which the price difference will be .827 113.721 130.907 55.5 CONCEPTS 11.14 on page 474.271 109. 11. b Report a point prediction of and a 95 percent prediction interval for the starting salary of an individual marketing graduate having a grade point average of 3. note that a person making service calls will (in.197 164. However.8.48.348 0.391 147. Third Edition 11. Some of the person’s individual service times will be below.066 60. The 95 percent confidence intervals for the mean service times on these calls might be used to schedule future service calls. and 7 copiers.950 0. respectively.016 190.944 49. Therefore. 2.873 134.116 0. the corresponding mean service times.753 154.10 and . a Report a point estimate of and a 95 percent confidence interval for the mean starting salary of all marketing graduates having a grade point average of 3.381 a Report a point estimate of and a 95 percent confidence interval for the mean time to service four copiers.410 48. 11. .46 THE FRESH DETERGENT CASE Fresh The information at the bottom of the MINITAB output in Figure 11. 138.475 159.13 (page 474) relates to marketing graduates having a grade point average of 3.41 What does the distance value measure? 11.188 72.224 74. 11. 5.967 123.127 0.44.241 98. b Report a point prediction of and a 95 percent prediction interval for the actual demand for Fresh in an individual sales period when the price difference is .2 minutes.2]. it seems fair to both the efficiency of the company and to the person making service calls to schedule service calls by using estimates of the mean service times. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. 2003 482 Chapter 11 Simple Linear Regression Analysis Exercises for Section 11.139 177. Predicted values for: Minutes Copiers 1 2 3 4 5 6 7 Predicted 36. it would seem fair to allow 138 minutes to make the service call.42 What is the difference between a confidence interval and a prediction interval? 11.669 85.715 106.25.45 THE SERVICE TIME CASE SrvcTime Below we repeat the bottom portion of the MegaStat output in Figure 11.Bowerman−O’Connell: Business Statistics in Practice. Determine how many minutes to allow for the service call. say. 3.10.43 Discuss how the distance value affects the length of a confidence interval and a prediction interval.25.077 183. To understand this. b Report a point prediction of and a 95 percent prediction interval for the time to service four copiers on a single call.25.627 195. 4. since the very large number of individual service times will average out to the mean service times. c If we examine the service time data.357 88.126 95% Prediction Intervals lower upper Leverage 23.113 96.091 0. a year or more) make a very large number of service calls.226 65.559 170.980 81. Examining the MegaStat output.10.300 120. we see that there was at least one call on which AccuCopiers serviced each of 1.025 138. Since the mean time might be 138. Now suppose we wish to schedule a call to service four copiers.528 171.224 0.680 95% Confidence Intervals lower upper 29. 6.49 11. suppose we wish to schedule a call to service five copiers.779 145. METHODS AND APPLICATIONS 11.25. 11. a Report a point estimate of and a 95 percent confidence interval for the mean demand for Fresh in all sales periods when the price difference is .233 42. we see that a 95 percent confidence interval for the mean time to service five copiers is [130. c Verify the results in a and b (within rounding) by hand calculation.44 THE STARTING SALARY CASE StartSal The information at the bottom of the MINITAB output in Figure 11.

For each company the authors recorded values of x. The heat losses at the outdoor temperature 20 F were 86. 11. the mean yearly accounting rate for the period 1959 to 1974. b.16(b) (page 475) relates to a batch of size 60. 2003 11. Fifty-four companies were selected. 38.9661. The heat losses at the outdoor temperature 60 F were 33.6 PriceDif d Find 99 percent confidence and prediction intervals for the mean and actual demands referred to in parts a and b. 11. the mean yearly market return rate for the period 1959 to 1974.2 0.21 gives the MINITAB output of a simple linear regression analysis of the fast-food restaurant rating data in Exercise 11. Find a point prediction of and a 95 percent prediction interval for the mean preference ranking (given by 406 randomly selected individuals) of the restaurant.4 0.11.5 Confidence and Prediction Intervals c Locate the confidence interval and prediction interval you found in parts a and b on the following MINITAB output: 483 Y = 7. Three different windows were randomly assigned to each of three different outdoor temperatures.81409 + 2. a Report a point estimate of and a 95 percent confidence interval for the mean sales price of all houses having 2. Third Edition 11. Hint: Find the distance value on the MegaStat output.49 THE FAST-FOOD RESTAURANT RATING CASE FastFood Figure 11. Benzion Barlev and Haim Levy consider relating accounting rates on stocks and market returns. and 75. b Report a point estimate of and a 95 percent prediction interval for the actual direct labor cost of an individual batch of size 60. c. and y. c Find 99 percent confidence and prediction intervals for the mean and actual direct labor costs referred to in parts a and b.2 -0. The data in .0 0.000 square feet. and d for sales periods in which the price difference is .000 square feet. b Report a point estimate of and a 95 percent prediction interval for the sales price of an individual house having 2.47 THE DIRECT LABOR COST CASE DirLab The information on the MegaStat output in Figure 11. 84. 11. The information at the bottom of the output relates to a fast-food restaurant that has a mean taste rating of 1.3 0. 80.1 0.40 (page 476).48 THE REAL ESTATE SALES PRICE CASE RealEst The information at the bottom of the MINITAB output in Figure 11.17 (page 476) relates to a house that will go on the market in the future and have 2. The heat losses at the outdoor temperature 40 F were 78. and 77. e Repeat parts a.66522X R-Sq = 0. a Report a point estimate of and a 95 percent confidence interval for the mean direct labor cost of all batches of size 60.50 Ott (1987) presents a study of the amount of heat loss for a certain brand of thermal pane window. and 43. Hint: Solve for the distance value—see Example 11.1 0.5 0. For each trial the indoor window temperature was controlled at 68 F and 50 percent relative humidity.Bowerman−O’Connell: Business Statistics in Practice. Use the simple linear regression model to find a point prediction of and a 95 percent prediction interval for the heat loss of an individual window when the outdoor temperature is HeatLoss a 20°F b 30°F c 40°F d 50°F e 60°F 11.25.51 In an article in the Journal of Accounting Research. 11. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.000 square feet.792 10 Regression 95% CI 95% PI Demand 9 8 7 -0.

76 8.3029 0.3160 MS 5.41 19.96 14. J.00.60 4.37 3.19 15.34 15.Bowerman−O’Connell: Business Statistics in Practice.65 4.54 3.23 10.28 P 0.97 6.3429 DF 1 4 5 StDev Fit 0.95 9.24 4.42 P 0.6721) 95.19 19.1833 Coef 0.67 10.63 16.89 10.45 10.38 13.59 20.69 12.71 13.1817 0.74 12.53 12.37 15.66 14.30 Accounting Rate 17.81 AcctRet Market Rate 5.00 16.09 17.6 were obtained.160 + 1.11 11.625 0.47 18. Steel Bethlehem Steel Armco Steel Texaco Shell Oil Standard Oil (Indiana) Owens Illinois Gulf Oil Tenneco Inland Steel Kraft Accounting Rate 13.88 1.53 9. 305–315.66 3.11 6. Grace Ford Motors Textron Lockheed Aircraft Getty Oil Atlantic Richfield Radio Corporation of America Westinghouse Electric Johnson & Johnson Champion International R.15 5.13 6.68 14.58 11. 2.20 10.40 . Third Edition 11.70 9.” Journal of Accounting Research (Autumn 1979).74 6.10 17.32 8.73 6. pp.41 8.90 9.94 9.70 11.02 11.88 6.96 6.67 13. .42 12.56 10.35 5.32 1.27 Source: Reprinted by permission from Benzion Barlev and Haim Levy.90 2.68 12.6 Company McDonnell Douglas NCR Honeywell TRW Raytheon W.44 32.46 14.63 14.1817 0.49 17. R.22 .78 9.17 9.0% CI ( 2.58 14.2731 StDev 0.40 11.78 4.92 12.27 MEANTASTE Predictor Constant MEANTAST S = 0.26 2.74 23. Reynolds General Dynamics Colgate-Palmolive Coca-Cola International Business Machines Allied Chemical Uniroyal Greyhound Cities Service Philip Morris General Motors Philips Petroleum Accounting Rates on Stocks and Market Returns for 54 Companies Market Rate 17.43 10.96 8.25 7.02 7.11 6.5% R-Sq(adj) = 9 6 . “On the Variability of Accounting Income Numbers.06 9.34 18.30 . 8 % Analysis of Variance Source Regression Error Total Fit 2.1602 1. Table 11.53 12.16 23.1025 T 0.000 R-Sq = 97.73 4.00. 2003 484 F I G U R E 11.06 Company FMC Caterpillar Tractor Georgia Pacific Minnesota Mining & Manufacturing Standard Oil (Ohio) American Brands Aluminum Company of America General Electric General Tire Borden American Home Products Standard Oil (California) International Paper National Steel Republic Steel Warner Lambert U.30 17. b Find a point prediction of and a 95 percent prediction interval for the market return rate of an individual stock having an accounting rate of 15.69 4.86 10. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.17 6.66 9.03 6. Here the accounting rate can be interpreted to represent input into investment and therefore is a logical predictor of market return.S. 2.21 Chapter 11 Simple Linear Regression Analysis MINITAB Output of a Simple Linear Regression Analysis of the Fast-Food Restaurant Rating Data The regression equation is MEANPREF = -0.60 7.9491) T A B L E 11.51 17.00 21.96 8. Use the simple linear regression model and a computer to AcctRet a Find a point estimate of and a 95 percent confidence interval for the mean market return rate of all stocks having an accounting rate of 15.1343 5.0336 F 154.0136.35 16.0% PI ( 1.000 95.97 7.49 10.1186 SS 5.90 5.78 9.12 14.12 6.62 16.90 9.7367.05 12.11 12.43 19.

Together. . .21 x 10 20 30 40 50 60 70 (b) Prediction errors for the fuel consumption problem when we use the information contributed by x by using the least squares line y 16 15 14 13 12 11 10 9 8 7 The least squares line ^ y 15.22 The Reduction in the Prediction Errors Accomplished by Employing the Predictor Variable x (a) Prediction errors for the fuel consumption problem when we do not use the information contributed by x y 16 15 14 13 12 11 10 9 8 7 y 10. say yi.6 Simple Coefficients of Determination and Correlation 485 11. suppose we have observed n values of the dependent variable y. Here the error of prediction in predicting yi would be yi y. . . Simple Linear Regression Analysis Text © The McGraw−Hill Companies. average hourly temperature. CHAPTER 15 F I G U R E 11. . Figure 11. xn corresponding to the observed values of y. yn.22(b) illustrates the prediction errors ˆ obtained in the fuel consumption problem when we use the predictor variable x. However. Third Edition 11. which is simply the average of the n observed values y1.Bowerman−O’Connell: Business Statistics in Practice. which is denoted r 2 (pronounced r squared).22(a) illustrates the prediction errors obtained for the fuel consumption data when we do not use the information provided by the independent variable x.6 ■ Simple Coefficients of Determination and Correlation The simple coefficient of determination The simple coefficient of determination is a measure of the usefulness of a simple linear regression model. In such a case the only reasonable prediction of a specific value of y. suppose we decide to employ the predictor variable x and observe the values x1.84 . Next. . we choose to predict y without using a predictor (independent) variable x. In this case the prediction of yi is yi ˆ b0 b1xi and the error of prediction is yi yi. 2003 11. x2. . Figure 11. For example. To introduce this quantity. For example. . would be y.1279x x 10 20 30 40 50 60 70 . y2.

r 2 is the proportion of the total variation in the n observed values of y that is explained by the simple linear regression model. or by an amount equal to ˆ (yi It can be shown that in general a (yi y)2 a (yi yi )2 ˆ ˆ a (yi y)2 y) (yi yi ) ˆ (yi ˆ y) The sum of squared prediction errors obtained when we do not employ the predictor variable x. Intuitively. is called the unexplained variation (this is ˆ another name for SSE). It is also possible that no regression model employing a single predictor variable will accurately predict y. Third Edition 11. (yi y)2. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. The quantity (yi ˆ y)2 is called the explained variation. r 2 is greater than or equal to 0. Neither the explained variation nor the total variation can be negative (both quantities are sums of squares). 2003 486 Chapter 11 Simple Linear Regression Analysis Figures 11. Using these definitions and the above equation involving these summations. r 2 cannot be greater than 1. this quantity measures the amount of variation in the values of y that is not explained by the predictor variable. In such a case. We see how to do this in Chapter 12. we see that Total variation Unexplained variation Explained variation It follows that the explained variation is the reduction in the sum of squared prediction errors that has been accomplished by using the predictor variable x to predict y. Because the explained variation must be less than or equal to the total variation. Using the predictor variable x decreases the prediction error in predicting yi from (yi – y) to (yi – yi). We now define the simple coefficient of determination to be r2 Explained variation Total variation That is. In the following box we summarize the results of this section: The Simple Coefficient of Determination. r 2 F 1 2 3 4 or the simple linear regression model 5 y) a (yi ˆ Explained variation a ( yi Total variation Unexplained variation Total variation 2 y )2 ˆ yi )2 The simple coefficient of determination is Explained variation r2 Total variation r 2 is the proportion of the total variation in the n observed values of the dependent variable that is explained by the simple linear regression model. this equation implies that the explained variation represents the amount of the total variation in the observed values of y that is explained by the predictor variable x (and the simple linear regression model). (yi yi)2. a different predictor variable must be found in order to accurately predict y. the larger is the proportion of the total variation that is explained by the model. Therefore. a ( yi 6 Explained variation Unexplained variation . the independent variable in the model does not provide accurate predictions of y. and the greater is the utility of the model in predicting y.22(a) and (b) show the reduction in the prediction errors accomplished by employing the predictor variable x (and the least squares line). this quantity measures the total amount of variation exhibited by the observed values of y. In this case the model must be improved by including more than one independent variable.Bowerman−O’Connell: Business Statistics in Practice. It also follows that Total variation Explained variation Unexplained variation Intuitively. If the value of r 2 is not reasonably close to 1. Intuitively. The nearer r 2 is to 1. The sum of squared prediction errors obtained when we use the predictor variable x. is called the total variation.

582. r2 EXAMPLE 11.568 (see “SS Error”). and unexplained variations. note that in optional Section 11. and the total variation is 25.549 This value of r 2 says that the regression model explains 89.” The output also tells us that r 2 equals . the unexplained variation is 2.13 The Fuel Consumption Case Below we present the middle portion of the MINITAB output in Figure 11.899 Total variation 25.2399 df 1 38 39 MS 6.” and “SS Total.” “SS Residual.889.549 (see “SS Total”).755.9 percent of the total variation in the 40 observed home upkeep expenditures.11(b) on page 471.9% R-Sq(adj)= 8 8 .9 we present some shortcut formulas for calculating the total. Third Edition 11. 2003 11.943 Std.Bowerman−O’Connell: Business Statistics in Practice.549 MS 22. a college admissions officer might feel that the academic performance of college students (measured by grade point average) is correlated with the students’ scores on a standardized college entrance examination. r People often claim that two variables are correlated.995. Note that r 2 is given on the MINITAB output (see “R-Sq”) and is expressed as a percentage.6972 819.8301 F 305.981 0. denoted by r. Upkeep C ANOVA table Source Regression Residual Total SS 6.6542 R-Sq = 89.889 r 0. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.49E-20 This output gives the explained variation. and the total variation under the respective headings “SS Regression. This means that college students’ grade point averages are related to their college entrance exam scores.981 . the unexplained variation.9 percent of the total variation in the eight observed fuel consumptions. Var.69 P 0.5 Also note that the quantities discussed here are given in the Excel output in Figure 11.568 25.6972 21. One measure of the relationship between two variables y and x is the simple correlation coefficient.402.000 C This output tells us that the explained variation is 22.14 The QHIC Case Below we present the top portion of the MegaStat output in Figure 11.582.06 p-value 9.428 F 53. 3 % Analysis of Variance Source Regression Error Total DF 1 6 7 SS 22.11(a): S = 0.12: r2 0. We explain the meaning of “R-Sq (adj)” in Chapter 12. This correlation coefficient measures the strength of the linear relationship between y and x. The simple correlation coefficient. Error 146.981 (see “SS Regression”).6 Simple Coefficients of Determination and Correlation 487 EXAMPLE 11. For example. the simple linear regression model that employs home value as a predictor variable explains 88. . We define this quantity as follows: 2r 2 2r 2 The Simple Correlation Coefficient T 5 he simple correlation coefficient between y and x.5427 7. It follows that Explained variation 22. Before continuing.759.981 2.759. Therefore.897 n 40 k 1 Dep. is r if b1 is positive and r if b1 is negative where b1 is the slope of the least squares line relating y to x.578. explained.

Figure 11. which indicates that y and x are negatively correlated. the method given in the previous box provides the easiest way to calculate r. EXAMPLE 11. A value of r near 0 implies little linear relationship between y and x. If we have computed the least squares slope b1 and r 2.948 2r 2 2. that y and x are highly related and positively correlated. .1279 and r . therefore. in the fuel consumption problem. For instance. this formula for r automatically gives r the correct ( or ) sign.23 y Chapter 11 Simple Linear Regression Analysis An Illustration of Different Values of the Simple Correlation Coefficient y y x (a) r 1: perfect positive correlation y x (b) Positive correlation (positive r): y increases as x increases in a straight-line fashion y x (c) Little correlation (r near 0): little linear relationship between y and x x (d) Negative correlation (negative r): y decreases as x increases in a straight-line fashion (e) r x 1: perfect negative correlation Because r2 is always between 0 and 1. and SSyy denotes the total variation.2 on page 459. y and x have a perfect linear relationship with a negative slope. whereas when r 1. A value of r close to 1 says that y and x have a strong tendency to move together in a straight-line fashion with a negative slope and. SSxy 179. the correlation coefficient r is between 1 and 1.23 illustrates these relationships. which has been defined in this section. y and x have a perfect linear relationship with a positive slope. It follows that the simple correlation coefficient between y (weekly fuel consumption) and x (average hourly temperature) is r .1 (page 447). therefore. The simple correlation coefficient can also be calculated using the formula SSxy r 2SSxx SSyy Here SSxy and SSxx have been defined in Section 11. A value of r close to 1 says that y and x have a strong tendency to move together in a straight-line fashion with a positive slope and. Furthermore.899 C 2 This simple correlation coefficient says that x and y have a strong tendency to move together in a linear fashion with a negative slope. Third Edition 11.15 The Fuel Consumption Case In the fuel consumption problem we have previously found that b1 . Simple Linear Regression Analysis Text © The McGraw−Hill Companies.6475. Notice that when r 1.Bowerman−O’Connell: Business Statistics in Practice. 2003 488 F I G U R E 11. We have seen this tendency in Figure 11. that y and x are highly related and negatively correlated.899.

See Wonnacott and Wonnacott (1981) for a discussion of this distribution. while the simple correlation coefficient can show that variables tend to move together in a straight-line fashion.5. Third Edition 11. A similar coefficient of linear correlation can be defined for the population of all possible combinations of observed values of x and y. EXAMPLE 11. high correlation does not imply that a cause-and-effect relationship exists. 6 C Essentially.6 Simple Coefficients of Determination and Correlation 489 SSxx 1. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. suppose that college students’ grade point averages and college entrance exam scores are highly positively correlated. respectively.13 on page 487).001.05.6 Second. . because we have previously calculated r to be . If the bivariate normal distribution assumption for the test concerning r is badly violated. Instead.549 (see Examples 11. we estimate that x and y are negatively correlated. It follows that we have extremely strong evidence of a linear relationship. This test can be done by using r to compute the test statistic t r2n 21 2 r2 The test is based on the assumption that the population of all possible observed combinations of values of x and y has a bivariate normal probability distribution. It can be shown that the preceding test statistic t and the p-value used to test H0: r 0 versus Ha: r 0 are equal to. If we wish to find this slope.404.9482. or correlation. For example. We use r as the point estimate of r. the test b1 sb1 and the p-value used to test H0: b1 0 versus Ha: b1 0. we can also carry out a hypothesis test. these tests are based on different assumptions (remember that the test for significance of the slope is based on the regression assumptions). Rather. . scientific theory must be used to establish cause-and-effect relationships. and attitude probably determine both a student’s score on a college entrance exam and a student’s college grade point average. . that although the mechanics involved in these hypothesis tests are the same. Testing the significance of the population correlation coefficient Thus far we have seen that the simple correlation coefficient measures the linear relationship between the observed values of x and the observed values of y that make up the sample. and SSyy Therefore r 2SSxx SSyy SSxy 25.549) It is important to make a couple of points. Here we test the null hypothesis H0: 0. In addition.05. we should use the previously given formula for b1.4 on page 459 and 11. The correlation does not mean that changes in x cause changes in y. In general. Keep in mind. this says that y and x have a strong tendency to move together in a straight-line fashion. Furthermore. This also implies (if the population of all possible observed combinations of x and y has a bivariate normal probability distribution) that we can reject H0: r 0 in favor of Ha: r 0 at level of significance .01. We test H0 against the alternative Ha: 0. We call this coefficient the population correlation coefficient and denote it by the symbol R (pronounced rho).33 and that the p-value related to this t statistic is less than . some other variable (or variables) could be causing the apparent relationship between y and x. we can use a nonparametric approach to correlation.Bowerman−O’Connell: Business Statistics in Practice. or .7 we found that t 7. or . study habits. This approach is discussed in optional Section 15. other factors such as intellectual ability. First. This does not mean that earning a high score on a college entrance exam causes students to receive a high grade point average.01. however. the difference between r and b1 is a change of scale.6475 . One such approach is Spearman’s rank correlation coefficient. which says there is no linear relationship between x and y.948 2(1. 2003 11. When r indicates that y and x are highly correlated. and we have extremely strong evidence that x is significantly related to y. 179.355. which says there is a positive or negative linear relationship between x and y. the value of the simple correlation coefficient is not the slope of the least squares line.404. We therefore (if the regression assumptions hold) can reject H0: b1 0 at level of significance . It can be shown that b1 and r are related by the equation b1 (SSyy SSxx)1 2 r. between x and y. Recall that in Example 11.001. where b1 is the statistic t slope in the simple linear regression model.355)(25.001.16 The Fuel Consumption Case Again consider testing the significance of the slope in the fuel consumption problem.

we refer to MINITAB.S.54 through 11. © 1994 American Society for Quality. F.” Quality Management Journal. test H0: r 0 versus Ha: r 0 by setting a equal to . In Exercises 11.63 In the Quality Management Journal (Fall 1994).52 Discuss the meanings of the total variation.59 METHODS AND APPLICATIONS In Exercises 11.S. (Note: All managers are in the electronics manufacturing industry. managers and the percentage of 96 Asian managers who agree with each of 13 randomly selected statements concerning quality.57 THE DIRECT LABOR COST CASE StartSal SrvcTime Fresh DirLab RealEst FastFood SalesVol AccRet HeatLoss Use the MINITAB output in Figure 11.31 (page 468) and a computer.50 (page 483) and a computer. and Asians managers’ views about quality? .6 CONCEPTS 11. the simple coefficient of determination (r 2). Simple Linear Regression Analysis Text © The McGraw−Hill Companies. Reprinted by permission.51 (page 483) and a computer.56 THE FRESH DETERGENT CASE 11.16 (page 475). 11. and Asian Managers toward Product Quality. What does this indicate about the similarity of U. for these data.13 (page 474). Interpret r2. Use the MegaStat output in Figure 11. If we calculate the simple correlation coefficient. B. 11. 11. M.01. What do you conclude? 11. r.) MgmtOp Percentage U.21 (page 484). and the explained variation. 49 (Table 5). F. 11.59. and the simple correlation coefficient (r). the explained variation.54. 11.17 (page 476). M. “A Comparative Study of Attitudes of U. Yavas and T.05 and by setting a equal to . a Find the total variation.60 Use the sales volume data in Exercise 11.S. we find that r . b Using a t statistic involving r and appropriate rejection points.Bowerman−O’Connell: Business Statistics in Practice.570.58 THE REAL ESTATE SALES PRICE CASE Use the MINITAB output in Figure 11. Fall 1994.60 through 11. p.15 (page 475). Using the appropriate output or data set. the unexplained variation. which gives the percentage of 100 U. 11.53 What does the simple coefficient of determination measure? 11.55 THE SERVICE TIME CASE 11.59 THE FAST-FOOD RESTAURANT RATING CASE Use the Excel output in Figure 11. Third Edition 11. 11.62 Use the heat loss data in Exercise 11. 11. the unexplained variation. Yavas and T. and Excel output of simple linear regression analyses of the data sets related to six previously discussed case studies. Use the Excel output in Figure 11. 2003 490 Chapter 11 Simple Linear Regression Analysis Exercises for Section 11.14 (page 474).61 Use the accounting rate data in Exercise 11.S. Use the MINITAB output in Figure 11. MegaStat. Managers Agreeing 36 31 28 27 78 74 43 50 31 66 18 61 53 Statement 1 2 3 4 5 6 7 8 9 10 11 12 13 Percentage Asian Managers Agreeing 38 42 43 48 58 49 46 56 65 58 21 69 45 Source: B.62 we refer to data sets contained in other exercises of this chapter.18 (page 476) or the MINITAB output in Figure 11.54 THE STARTING SALARY CASE 11. Burrows. Burrows present the following table.

Simple Linear Regression Analysis Text © The McGraw−Hill Companies.7 ■ An F Test for the Model In this section we discuss an F test that can be used to test the significance of the regression relationship between x and y.7 An F Test for the Model 491 11. and define the overall F statistic to be F (model) Explained variation (Unexplained variation) (n 2) We can reject H0 : b1 0 in favor of Ha : b1 Z 0 at level of significance a if either of the following equivalent conditions hold: Also define the p-value related to F(model) to be the area under the curve of the F distribution (having 1 numerator and n 2 denominator degrees of freedom) to the right of F(model)—see Figure 11.24(b).24 An F Test for the Simple Linear Regression Model The curve of the F distribution having 1 and n 2 degrees of freedom 1 The probability of a Type I error F If F(model) F . 2 The first condition in the box says we should reject H0: b1 0 (and conclude that the relationship between x and y is significant) when F(model) is large.Bowerman−O’Connell: Business Statistics in Practice. reject H0 in favor of Ha (a) The rejection point F based on setting the probability of a Type I error equal to The curve of the F distribution having 1 and n 2 degrees of freedom p-value F(model) (b) If the p-value is smaller than . 1 2 F(model) p-value a Fa Here the point Fa is based on 1 numerator and n denominator degrees of freedom. 2003 11. This is intuitive because a large overall F statistic would be obtained when the explained variation is large compared to the unexplained variation. Figure 11. which would imply that the slope b1 is not equal to 0. . Sometimes people refer to this as testing the significance of the simple linear regression model. For simple linear regression. Third Edition 11. An F Test for the Simple Linear Regression Model S uppose that the regression assumptions hold. This would occur if x is significantly related to y. do not reject H0 in favor of Ha If F(model) F . we often say that the simple linear regression model is significant at level of significance A. this test is another way to test the null hypothesis H0: b1 0 (the relationship between x and y is not significant) versus Ha: b1 0 (the relationship between x and y is significant).24(a) illustrates that we reject H0 when F(model) is greater F I G U R E 11. If we can reject H0 at level of significance a. then F (model) F and we reject H0.

69 under the curve of the F distribution having 1 numerator and 6 denominator degrees of freedom.24(b) illustrates that the second condition in the box ( p-value a) is an equivalent way to carry out this test. Therefore. we see that the explained variation is 22. 11.05 5. This p-value is also given on the MINITAB output (labeled “p”). we reject H0.Bowerman−O’Connell: Business Statistics in Practice. Exercises for Section 11.14 (page 487) tells us that for the QHIC simple linear regression model F(model) is 305. This is further explained in Chapter 12. Here F(model) is labeled as “F.05.001.428 53.65 The F test in simple linear regression is equivalent to what other test? 11.000 is smaller than .70 . . since p-value .6 (page 818).981 22. Figure 11.06 and the related p-value is less than . Furthermore.01.24(b).05 based on 1 numerator and 6 denominator degrees of freedom. That is. Because these tests are equivalent.99. the p-values related to t and F(model) can be shown to be equal. we have extremely strong evidence that H0: b1 0 should be rejected and that the regression relationship between x and y is significant. we have extremely strong evidence that the regression relationship is significant.981 2. we find that F. . First. The F test in multiple regression is not equivalent to a t test.000 (which means less than . when F(model) is large.99. and . we can reject H0: b1 0 in favor of Ha: b1 0 at level of significance . As can be seen in Figure 11.001. Specifically.66. If we wish to test the significance of the regression relationship with level of significance a .64 What are the null and alternative hypotheses for the F test in simple linear regression? 11.001. It follows that F(model) Explained variation (Unexplained variation) (n 22. Testing the significance of the regression relationship between y and x by using the overall F statistic and its related p-value is equivalent to doing this test by using the t statistic and its related p-value.05. Here MINITAB tells us that the p-value is . Since F(model) 53. Alternatively.05. the F test has a useful generalization in multiple regression analysis (where we employ more than one predictor variable).568. we can reject H0 at level of significance .981 and the unexplained variation is 2. most standard regression computer packages include the results of the F test as a part of the regression output. EXAMPLE 11.05. the related p-value is small. Looking at this output. it can be shown that (t)2 F(model) and that (ta 2)2 based on n 2 degrees of freedom equals Fa based on 1 numerator and n 2 denominator degrees of freedom. it would be logical to ask why we have presented the F test.69 Note that this overall F statistic is given on the MINITAB output (it is labeled as “F”). As another example. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. 2003 492 Chapter 11 Simple Linear Regression Analysis than Fa. It follows that the rejection point conditions t ta 2 2) and F(model) Fa are equivalent. There are two reasons.13 (page 487) of the simple linear regression model relating weekly fuel consumption y to average hourly temperature x. or .17 The Fuel Consumption Case C Consider the fuel consumption problem and the MINITAB output in Example 11. Third Edition 11.05 5.” Because the p-value is less than . Second.001).01.568 (8 2) . the MegaStat output in Example 11. When the p-value is small enough [resulting from an F(model) statistic that is large enough]. The p-value related to F(model) is the area to the right of 53.001. we use the rejection point F. Using Table A. we might say that we have extremely strong evidence that the simple linear model relating y to x is significant.7 CONCEPTS 11.69 F.

70 THE REAL ESTATE SALES PRICE CASE Use the MINITAB output in Figure 11. b Utilize the F(model) statistic and the appropriate rejection point to test H0: b1 0 versus Ha: b1 by setting a equal to . Since y in the previous box is clearly the point estimate of ˆ b0 b1x.8 METHODS AND APPLICATIONS Residual Analysis 493 In Exercises 11. In fact. it follows that If the regression assumptions hold. Use the Excel output in Figure 11. Use the MINITAB output in Figure 11.71.68 THE FRESH DETERGENT CASE 11. . 11. The required checks are carried out by analyzing the regression residuals. a Calculate the F(model) statistic by using the explained variation. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.3 on page 466). In any real regression problem. then. Because the residuals provide point estimates of the error terms. Using the appropriate computer output. for any given value of the independent variable. the residuals should look like they have been randomly and independently selected from normally distributed populations having mean 0 and variance s2.13 (page 474).71 THE FAST-FOOD RESTAURANT RATING CASE Use the Excel output in Figure 11.Bowerman−O’Connell: Business Statistics in Practice. we see that the residual e y y is the point estimate of the error term e. .66 THE STARTING SALARY CASE 11.001. If the reˆ gression assumptions are valid. and .05 rejection point is the square of the t.69 THE DIRECT LABOR COST CASE StartSal SrvcTime Fresh DirLab RealEst FastFood Use the MINITAB output in Figure 11. 0 versus Ha: b1 0 0 0 0. the regression assumptions will not hold exactly. *11. and other relevant quantities. it is important to point out that mild departures from the regression assumptions do not seriously . The residuals are defined as follows: For any particular observed value of y. 2003 11.16 (page 475). the different error terms will be statistically independent. Third Edition 11.14 (page 474).025 rejection point. What do you conclude about the relationship between y and x? d Find the p-value related to F(model).67 THE SERVICE TIME CASE 11.21 (page 484). MegaStat. the unexplained variation. Using the p-value. determine whether we can reject H0: b1 in favor of Ha: b1 0 by setting a equal to .05. the population of potential error term values will be normally distributed with mean 0 and variance s2 (see the regression assumptions in Section 11.66 through 11.05. Furthermore. What do you conclude about the relationship between y and x? c Utilize the F(model) statistic and the appropriate rejection point to test H0: b1 0 versus Ha: b1 by setting a equal to . and Excel output of simple linear regression analyses of the data sets related to the six previously discussed case studies.18 (page 476) or the MINITAB output in Figure 11.17 (page 476). we refer to MINITAB.01.15 (page 475). the corresponding residual is e y y ˆ (observed value of y predicted value of y) CHAPTER 16 where the predicted value of y is calculated using the least squares prediction equation y ˆ b0 b1 x The linear regression model y b0 b1x e implies that the error term e is given by the equation e y (b0 b1x). Use the MegaStat output in Figure 11. What do you conclude? e Show that the F(model) statistic is the square of the t statistic for testing H0: b1 Also. 11. show that the F.10. 11.01.8 ■ Residual Analysis In this section we explain how to check the validity of the regression assumptions.

an increasing error variance exists. the pattern of the residuals’ fluctuation around 0 tells us about the validity of the constant variance assumption.25(a) presents the predicted home upkeep expenditures and residuals that are given by the simple linear regression model describing the QHIC data. EXAMPLE 11. we compute the residual for each observed y value. The constant variance assumption To check the validity of the constant variance assumption. the predicted value of the dependent variable.371.816.000. and (3) the time order ˆ in which the data have been observed (if the regression data are time series data). The calculated residuals are then plotted versus some criterion.2583x) 237.Bowerman−O’Connell: Business Statistics in Practice.816. Then we explain how to use these plots to check the regression assumptions. Third Edition 11. Because of this. y 1.08 1. departures from the regression assumptions.412. we cannot make a residual plot versus time. Therefore.26(b)] suggests that the spread of the error terms is decreasing as the horizontal plot value increases and that again the constant variance assumption is violated. not time series data. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. we are looking for pronounced. A residual plot that “funnels in” [as in Figure 11. A residual plot with a “horizontal band appearance” [as in Figure 11.264.000 would be larger than the variance of the population of potential yearly upkeep expenditures for houses worth $100. for the first observation (home) when y on page 453). As an example.412. To understand how these plots are constructed. The reason is that the model y b0 b1x e says that the variation of y is the same as the variation of e. and the residual is 40.264 The MINITAB output in Figure 11.26(a)] suggests that the error terms are becoming more spread out as the horizontal plot value increases and that the constant variance assumption is violated. It follows that the point plotted in Figure 11. we examine plots of the residuals against values of x.371.3921 7. ˆ recall that for the first observation (home) y 1.264.25(c) corresponding to the first observation has a horizontal axis coordinate of the y value 1. (2) values of y. Residual plots One useful way to analyze residuals is to plot them versus various criteria.3921 7. The resulting plots are called residual plots. For example. the variance of the population of potential yearly upkeep expenditures for houses worth $200. note that the QHIC data are cross-sectional data.08.18 The QHIC Case C y y ˆ y (b0 b1x) y ( 348. . That is.2 The MegaStat output in Figure 11.08 1. we make residual plots against (1) values of the independent variable x. x 237. Here each residual is computed as e For instance. indicating that the spread of the error terms is increasing as x increases. Therefore. Here we would say that an increasing error variance exists. we will require that the residuals only approximately fit the description just given. the residual is e 1.371. To validate the regression assumptions. We next look at an example of constructing residual plots. This is equivalent to saying that the variance of the population of potential yearly upkeep expenditures for houses worth x (thousand dollars) appears to increase as x increases. rather than subtle. It also follows that the point plotted in Figure 11. and a vertical axis ˆ coordinate of the residual 40.412.25(b) and (c) gives plots of the residuals for the QHIC simple ˆ linear regression model against values of x and y.25(b).00. consider the QHIC case and the residual plot in Figure 11. A residual plot that “fans out” [as in Figure 11. In this case we would say that a decreasing error variance exists.26(c)] suggests that the spread of the error terms around 0 is not changing much as the horizontal plot value increases. and time (if the regression data ˆ are time series data). Such a plot tells us that the constant variance assumption (approximately) holds. Finally.264. 2003 494 Chapter 11 Simple Linear Regression Analysis hinder our ability to use a regression model to make statistical inferences. To construct a residual plot. When we look at these plots.2583(237)) 1.08 and x ( 348. y.00 (see Table 11. This plot appears to fan out as x increases.816 40.25(b) corresponding to the first observation has a horizontal axis coordinate of the x value 237.00 and a vertical axis coordinate of the residual 40.412.

862 109.480 1.140 950.728.101 1.200 872.038. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.958 35.815 1.196.288.371 1.080 797.351.484 27.662 804.627.889 Residual 86.8 F I G U R E 11.453 73.532 1.523.891 259.900 1.412 1.733 120.640 368.038.632 25.040 918.309 298.340 965.780 2.204.562 116.055 289.975 242.793 1.771 1.084 999.010.058 196. 2003 11.413.353 833.260 1.613 7.000 711.992 39.034 86.727 22.658 1.600 626.840 Predicted 762.940 390.835 69.398 201.140 1.602 537.252 Observation 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Upkeep 849.778 259.085 884.075 947.307 6.270 621.167 1.413 1.900 288.041 390.090 148. Third Edition 11.328.264 34.000 869.118.951 (b) MINITAB output of residual plot versus x 300 200 Residual 100 0 -100 -200 -300 100 200 VALUE 300 (c) MINITAB output of residual plot versus y ˆ 300 200 Residual 100 0 -100 -200 -300 0 200 400 600 800 1000 1200 1400 1600 1800 Fitted Value F I G U R E 11.206 938.396.866 375.338 147.320 125.314.360 425.460 423.040 775.26 Residual Residual Plots and the Constant Variance Assumption Residual Residuals funnel in Residuals fan out (a) Increasing error variance Residual (b) Decreasing error variance Residuals form a horizontal band (c) Constant error variance .480 1.000 324.606 25.060 642.800 697.224 350.261 1.816 762.146.270 277.248 281.398 48.500 1.558 1.447 136.434 892.835 780.044 224.627.412.090.269 295.371.703 993.420 852.003.497 120.336.828 104.263.475.013 118.692 43.088.240 1.537 410.140 1.398 525.180 1.112 1.840 602.911 1.068 Residual 40.316.900 1.066 100.581 32.278.400 479.613 876.444.354 1.863 69.25 Residual Analysis MegaStat and MINITAB Output of the Residuals and Residual Plots for the QHIC Simple Linear Regression Model 495 (a) MegaStat output of the residuals Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Upkeep 1.378 817.160 1.065 3.740 378.208 1.330 5.760 857.308.100 920.320 Predicted 1.670.313.Bowerman−O’Connell: Business Statistics in Practice.080 1.080 1.726 171.832 562.

For instance. the second point plotted is e(2) 259. We used a simple linear regression model to describe the relationship between y and x.044 on the vertical scale versus z(1) 2. . recall in the QHIC case that there are n 40 residuals given in Figure 11. To make a normal plot.0165 under the standard normal curve to its left. For example. e(2). It can be proven that. a histogram and/or stemand-leaf display of the residuals should look reasonably bell-shaped and reasonably symmetric about 0.0165 Therefore. Another way to check the normality assumption is to construct a normal plot of the residuals. Here z(i) is defined to be the point on the horizontal axis under the standard normal curve so that the area under this curve to the left of z(i) is (3i 1) (3n 1). both the scatter plot and residual plots indicate that there might be a slightly curved relationship between y and x. Thus. The stem-and-leaf display looks fairly bell-shaped and symmetric about 0. in the upper left portion of each residual plot in Figure 11.27(c).0413 and thus that z(2) 1. Because the smallest residual in Figure 11. It follows that. The assumption of correct functional form If the functional form of a regression model is incorrect. When the constant variance assumption is violated. in general.13. because of possible differences in scaling between residual plots and scatter plots. Later in this section we discuss one way to model curved relationships.74.25. One answer is that. if we use a simple linear regression model when the true relationship between y and x is curved. e(n) we denote the ith residual in the ordered listing as e(i). 2003 496 Chapter 11 Simple Linear Regression Analysis Increasing variance makes some intuitive sense because people with more expensive homes generally have more discretionary income. then 3i 3n 1 1 3(1) 3(40) 1 1 2 121 . a plot of the e(i) values on the vertical scale versus the z(i) values on the horizontal scale should have a straight-line appearance. This plot tells us that the residuals appear to fan out as y (predicted y) increases.27(a) gives the MINITAB output of a stem-and-leaf display of the residuals from the simple linear regression model describing the QHIC data. Also. ˆ which is logical because y is an increasing function of x.” indicating a possible violation of the normality assumption. A . We plot e(i) on the vertical axis against a point called z(i) on the horizontal axis. as illustrated in Figure 11. z(1) is the normal point having an area of .13 on the horizontal scale. Therefore. In fact. the first point plotted is e(1) 289. if the normality assumption holds. Letting the ordered residuals be denoted as e(1). y. note that the original scatter plot of ˆ y versus x in Figure 11.958. because the second-smallest residual in Figure 11. Later in this section we discuss how we can make statistical inferences when a nonconstant error variance exists. . Simple Linear Regression Analysis Text © The McGraw−Hill Companies. in Figure 11. For example. it can be verified that (3i 1) (3n 1) equals . These people can choose to spend either a substantial amount or a much smaller amount on home upkeep. Therefore. when i 1. we should always consider both types of plots. one of these types of plots might be more informative in a particular situation. However. Third Edition 11. versus home value. Therefore. the residual plots constructed by using the model often display a pattern suggesting the form of a more appropriate model.958 on the vertical scale versus z(2) 1. if the normality assumption holds. z(1) equals 2. Figure 11.25(a) is 289. we first arrange the residuals in order from smallest to largest. x.74 on the horizontal scale. This implies that the area under the standard normal curve between z(1) and 0 is . This process is continued until the entire normal plot is constructed.” or slightly curved appearance. .25(a) is 259. Another residual plot showing the increasing error variance in the QHIC case is Figure 11.4 (page 453) has either a straight-line or slightly curved appearance. then the expected value of the ith ordered residual e(i) is proportional to z(i). . the residual plot will have a curved appearance.4835.4 (page 453) shows the increasing error variance—the y values appear to fan out as x increases.0165 . That is. then the normal plot should have a straight-line appearance. the scatter plot of upkeep expenditure.Bowerman−O’Connell: Business Statistics in Practice.25(c). the tails of the display look somewhat long and “heavy” or “thick. The MINITAB output of this plot is given in Figure 11. Therefore. The normality assumption If the normality assumption holds. When i 2. we cannot use the formulas of this chapter to make statistical inferences.27(b).25(a). one might ask why we need to consider residual plots when we can simply look at scatter plots. thus causing a relatively large variation in upkeep expenditures.044. but note that there is a “dip.5 .

or be followed by. positive autocorrelation exists when positive error terms tend to be followed over time by positive error terms and when negative error terms tend to be followed over time by negative error terms. Intuitively. The independence assumption The independence assumption is most likely to be violated when the regression data are time series data—that is.5 . which illustrates that positive autocorrelation can produce a cyclical error term pattern over time. it is usually a good idea to use residual plots to check for nonconstant variance and incorrect functional form before making any final conclusions about the normality assumption.0165 . The simple linear regression model implies that a positive error term produces a greater-than-average value of y and a negative error term produces a smaller-than-average value of y. In other words.28.0165 . another negative error term in time period i k. For such data the time-ordered error terms can be autocorrelated.8 F I G U R E 11. a subjective decision) indicates that the normality assumption is violated. 2003 11.4835 0 Standard normal curve z(1) 2. Positive autocorrelation in the error terms is depicted in Figure 11. It is important to realize that violations of the constant variance and correct functional form assumptions can often cause a histogram and/or stem-and-leaf display of the residuals to look nonnormal and can cause the normal plot to have a curved appearance. and normality assumptions. or be followed by. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. Because of this.Bowerman−O’Connell: Business Statistics in Practice. Since the normal plot in Figure 11. data that have been collected in a time sequence.13 (a) MINITAB output of the stem-and-leaf display (b) Calculating z(1) for a normal plot Normal Probability Plot of the Residuals 300 200 Residual 100 0 -100 -200 -300 -2 -1 0 Normal Score (c) MINITAB output of the normal plot 1 2 normal plot that does not look like a straight line (admittedly. we say that error terms occurring over time have positive autocorrelation if a positive error term in time period i tends to produce. another positive error term in time period i k (some later time period) and if a negative error term in time period i tends to produce. Later in this section we discuss a procedure that sometimes remedies simultaneous violations of the constant variance.27 Residual Analysis Stem-and-Leaf Display and a Normal Plot of the Residuals from the Simple Linear Regression Model Describing the QHIC Data N = 40 497 Stem-and-leaf of RESI Leaf Unit = 10 2 5 6 10 13 17 (11) 12 10 4 3 3 -2 -2 -1 -1 -0 -0 0 0 1 1 2 2 85 420 7 4320 876 4220 00022333344 68 001124 9 899 3(1) 3(40) 1 1 2 121 . It follows that positive . correct functional form.27(c) has some curvature (particularly in the upper right portion). there is a possible violation of the normality assumption. Third Edition 11.

Third Edition 11. 2003 498 F I G U R E 11.29 Simple Linear Regression Analysis Negative Autocorrelation in the Error Terms: Alternating Pattern Error term Error term 5 1 2 3 4 6 7 8 9 Time 9 1 2 3 4 5 6 7 8 Time autocorrelation in the error terms means that greater-than-average values of y tend to be followed by greater-than-average values of y. In this case the error terms would display positive autocorrelation. An example of negative autocorrelation in the error terms is depicted in Figure 11. Here a larger-than-average stock order one week might result in an oversupply and hence a smaller-than-average order the next week. Intuitively. In other words. Because the residuals are point estimates of the error terms. then. An example of negative autocorrelation might be provided by a retailer’s weekly stock orders. The independence assumption basically says that the time-ordered error terms display no positive or negative autocorrelation. and thus these error terms would not be statistically independent. we assume that competitors’ advertising expenditure significantly affects the demand for the product. If. then a higher-than-average competitors’ advertising expenditure probably causes demand for the product to be lower than average and hence probably causes a negative error term. and smaller-than-average values of y tend to be followed by smaller-than-average values of y. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. which illustrates that negative autocorrelation in the error terms can produce an alternating pattern over time. negative autocorrelation exists when positive error terms tend to be followed over time by negative error terms and negative error terms tend to be followed over time by positive error terms. or be followed by. the error terms are positively autocorrelated. a residual plot versus time is used to check the independence assumption.28 Chapter 11 Positive Autocorrelation in the Error Terms: Cyclical Pattern F I G U R E 11. If. and the independence assumption is violated. One of the factors included in the error term of the simple linear regression model is competitors’ advertising expenditure for their similar products. If a plot of the time-ordered residuals has an alternating pattern. If a residual plot versus the data’s time sequence has a cyclical appearance. Such a random pattern would imply that the error terms (and their corresponding y values) are statistically independent. competitors tend to spend money on advertising in a cyclical fashion— spending large amounts for several consecutive sales periods (during an advertising campaign) and then spending lesser amounts for several consecutive sales periods—a negative error term in one sales period will tend to be followed by a negative error term in the next sales period. the error terms are negatively autocorrelated. .Bowerman−O’Connell: Business Statistics in Practice. On the other hand. for the moment. Here we assume that the data are time series data observed over a number of consecutive sales periods.29. a lower-than-average competitors’ advertising expenditure probably causes the demand for the product to be higher than average and hence probably causes a positive error term. This says that the error terms occur in a random pattern over time. and again the independence assumption is violated. It follows that negative autocorrelation in the error terms means that greater-than-average values of y tend to be followed by smaller-than-average values of y and smaller-than-average values of y tend to be followed by greater-than-average values of y. a negative error term in time period i k and if a negative error term in time period i tends to produce. or be followed by. a positive error term in time period i k. and a positive error term in one sales period will tend to be followed by a positive error term in the next sales period. error terms occurring over time have negative autocorrelation if a positive error term in time period i tends to produce. An example of positive autocorrelation could hypothetically be provided by a simple linear regression model relating demand for a product to advertising expenditure.

0 0.9 40. In such a case.6 42.8 62.30(a) presents data concerning weekly sales at Pages’ Bookstore (Sales). Pages’ weekly advertising expenditure (Adver). error) 5.6 4.1 4.4 42.8 0.7 0.4 3. These residuals are plotted versus time in Figure 11.6 40. 15. and the advertising expenditure values are expressed in hundreds of dollars.6 0.6 57. the competitor’s advertising expenditure seems to be causing the positive autocorrelation.3 4.7 23. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.Bowerman−O’Connell: Business Statistics in Practice.1 0 5 10 Observation 15 20 .0 0. Here the sales values are expressed in thousands of dollars.7 46.0 23.30(a) also gives the residuals that are obtained when MegaStat is used to perform a simple linear regression analysis relating Pages’ sales to Pages’ advertising expenditure. it is reasonable to conclude that the independence assumption holds.0 33. Therefore.19 Figure 11. if a plot of the time-ordered residuals displays a random pattern.30(b). EXAMPLE 11.6 77. there tend to be positive residuals when the competitor’s advertising expenditure is lower (in weeks 1 through 8 and weeks 14.4 44.9 53.30 Pages’ Bookstore Sales and Advertising Data.6 7.0 2.4 6. Furthermore. F I G U R E 11. 2003 11.9 8. Third Edition 11.3 Residual 3. and 16) and negative residuals when the competitor’s advertising expenditure is higher (in weeks 9 through 13).4 2.8 Residual Analysis 499 However.65 Durbin-Watson (b) MegaStat output of a plot of the residuals in Figure 11. and the weekly advertising expenditure of Pages’ main competitor (Compadv).4 55.1 68. the error terms have little or no autocorrelation.30(a) versus time 10. Figure 11.1 Residual (gridlines std. and Residual Analysis (a) The data and the MegaStat output of the residuals from a simple linear regression relating Pages’ sales to Pages’ advertising expenditure BookSales Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Adver 18 20 20 25 28 29 29 28 30 31 34 35 36 38 41 45 Compadv 10 10 15 15 15 20 20 25 35 35 35 30 30 25 20 20 Sales 22 27 23 31 45 47 45 42 37 39 45 52 57 62 73 84 Predicted 18.7 7.0 10. We see that the residual plot has a cyclical pattern. This tells us that the error terms for the model are positively autocorrelated and the independence assumption is violated.9 4.0 5.4 1.

025.a and dU. A. the Durbin–Watson statistic for the simple linear regression model relating Pages’ sales to Pages’ advertising expenditure is calculated to be 16 t a (et 2 16 t et 1)2 2 d a et 1 (4. then 1 2 3 If d dL. .a. These tables give the appropriate dL.0)2 (6. (Multiple regression models are discussed in Chapter 12. we reject H0.3)2 (0. the error term in time period t.30(a).a have been constructed. we can use the Durbin– Watson statistic CHAPTER 16 n t a (et 2 n t et 1)2 2 d a et 1 where e1. k. One approach is to identify which independent variable left in the error term (for example. and n.7)2 4. s. the number of observations. So that the Durbin–Watson test may be easily done.30(b) includes grid lines that are placed one and two standard errors above and below the residual mean of 0.0 4. various remedies can be employed. and a .0)2 2 (3.a points for various values of a. The MegaStat residual plot in Figure 11.Bowerman−O’Connell: Business Statistics in Practice. . To check for first-order autocorrelation.11. the differences (et et 1) are small. competitors’ advertising expenditure) is causing the error terms to be autocorrelated. This indicates that the adjacent residuals et and et 1 are of the same magnitude. Third Edition 11.12 (pages 827–829) give these points for a . It says that et. the test is inconclusive. of 5. forming a multiple regression model. tables containing the points dL.a and dU. . Tables A.05. When the independence assumption is violated. Using the residuals in Figure 11. Durbin and Watson have shown that there are points (denoted dL. if d is small. Note that when we are considering a simple linear regression model.” Other values of k are used when we study multiple regression models in Chapter 12. a .a) such that. the error term in time period t 1.10. which uses one independent variable. .a d dU. If dL. Intuitively. note that the simple linear regression model relating Pages’ sales to Pages’ advertising expenditure has a standard error. and A. Consider testing the null hypothesis H0 that the error terms are not autocorrelated versus the alternative hypothesis Ha that the error terms are positively autocorrelated. To test for positive autocorrelation. e2.4)2 A MegaStat output of the Durbin–Watson statistic is given at the bottom of Figure 11.3) (4.a and dU.a and dU. we do not reject H0. the number of independent variables used by the regression model.038.30(a).65 3.7 (6.) The Durbin–Watson test One type of positive or negative autocorrelation is called firstorder autocorrelation. If d dU.10 is given in Table 11. if a is the probability of a Type I error.a. en are the time-ordered residuals.7. is related to et 1.01. which in turn says that the adjacent error terms et and et 1 are positively correlated. small values of d lead us to conclude that there is positive autocorrelation.a. We can then remove this independent variable from the error term and insert it directly into the regression model. This is because. we look up the points dL. A portion of Table A.a under the heading “k 1.0 . Simple Linear Regression Analysis Text © The McGraw−Hill Companies. we note that there are n 16 observations and the regression . All MegaStat residual plots use such grid lines to help better diagnose potential violations of the regression assumptions. 2003 500 Chapter 11 Simple Linear Regression Analysis To conclude this example.

025 . the test is inconclusive.a 2.98... This says that the adjacent error terms et and et 1 are negatively autocorrelated.a 2 and if (4 d ) dU. 2003 11.08 1. then a 2 .a 2 d dU.08 1.. for the Pages’ sales simple linear regression model. the points dL. Since positive autocorrelation is more common in real time series data than negative autocorrelation. Table 11.82 0.10 2 dU.7 tells us that dL..95 0. we conclude (at an a of . Although we have used the Pages’ sales model in these examples to demonstrate the Durbin– Watson tests for (1) positive autocorrelation. Large values of d (and hence small values of 4 d) lead us to conclude that there is negative autocorrelation because if d is large.98 1..02 1.8 T A B L E 11.39 1..69 1.65 is less than dL. we reject H0.025 1. Also. this indicates that the differences (et et 1) are large. That is. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. the test is inconclusive.65 is less than dL.. Durbin and Watson have shown that based on setting the probability of a Type I error equal to a.a. Specifically.Bowerman−O’Connell: Business Statistics in Practice.10 1.05 1.05. consider testing the null hypothesis H0 that the error terms are not autocorrelated versus the alternative hypothesis Ha that the error terms are positively or negatively autocorrelated. we do not reject H0. Since d ...36 1. Looking up these points in Table A.93 0.93 1.025 when n 16 and k 1. For example. we find that dL.05 1.05 0. there is no evidence of negative (first-order) autocorrelation. Third Edition 11. if a data or residual plot indicates that the error variance of a regression model increases as an . Since d . Transforming the dependent variable: A possible remedy for violations of the constant variance.025.97 1..05) k dU.10 and dU. consider testing for positive or negative autocorrelation in the Pages’ sales model.54 1.86 0.54 1.05) that there is positive (first-order) autocorrelation.18 1.90 4 dU..90 0.05 1.37 As an example.13 1. Consider testing the null hypothesis H0 that the error terms are not autocorrelated versus the alternative hypothesis Ha that the error terms are negatively autocorrelated.10.53 1.025 and dU.87 1..82 0.05 1..05 1. and normality assumptions In general. and (3) positive or negative autocorrelation..a 2 (4 d) dU.90 1.97 1. (4 d) (4 .16 1.38 1. If (4 d ) dU.40 1. It can be shown that the Durbin–Watson statistic d is always between 0 and 4.a are such that 1 2 3 If (4 d ) dL. correct functional form. and we need to find the points dL.a 2 or if (4 d ) dL. 1 2 3 If d dL.a (4 d ) dU. If dL.05 1.73 1.74 0.05 1.a 2. based on setting the probability of a Type I error equal to a. on the basis of setting a equal to . We can also use the Durbin–Watson statistic to test for positive or negative autocorrelation.7 Critical Values for the Durbin–Watson d Statistic (A k n 15 16 17 18 19 20 Residual Analysis .69 0. we must in practice choose one of these Durbin–Watson tests in a particular situation.71 1.53 1. we reject the null hypothesis of no autocorrelation.05 1. we reject the null hypothesis of no autocorrelation. That is. the Durbin–Watson test for positive autocorrelation is used more often than the other two tests.78 0.68 1.65) 3.35 dU.98 and dU.05 0.75 1. we do not reject H0. (2) negative autocorrelation. That is. if we set a .85 1.05.83 dL. Therefore.a. we reject H0.54 1. we see that Therefore.05 1.20 model uses k 1 independent variable..05.a 2 and dL.86 0.a 2.24. Durbin and Watson have shown that. note that each Durbin–Watson test assumes that the population of all possible residuals at any time t has a normal distribution.11 (page 828).05 1.05 0. we do not reject the null hypothesis of no autocorrelation. If we set a equal to .54 k dL..a and dU. If d dU.37. we conclude (at an a of .00 3 dL.68 501 1 dU.05) that there is first-order autocorrelation.025 . If dL.37 1..41 k dL.a.

and natural logarithmic transformations of the upkeep expenditures and plot the transformed values versus the home values. Note that the natural logarithm transformation seems to “overtransform” the data—the error variance tends to decrease as the home value increases and the data plot seems to F I G U R E 11. . Third Edition 11. A fractional power transformation can also help to remedy a violation of the normality assumption.4. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. we might use a transformation in which we take the square root (or one-half power) of each y value. Specifically. The square root transformation seems to best equalize the error variance and straighten out the curved data plot in Figure 11. In general. if a data plot indicates that the dependent variable is increasing at an increasing rate (as in Figure 11. then we can sometimes remedy the situation by transforming the dependent variable.20 The QHIC Case C Consider the QHIC upkeep expenditures. In fact.32 MINITAB Plot of the Quartic Roots of the Upkeep Expenditures versus the Home Values 45 40 SRUPKEEP QRUPKEEP 35 30 25 20 15 10 100 200 VALUE 300 7 6 5 4 3 100 200 VALUE 300 .31.32. 11. we would write the square root transformation as y* 1y y . when we take a fractional power (including the natural logarithm) of the dependent variable.31 MINITAB Plot of the Square Roots of the Upkeep Expenditures versus the Home Values F I G U R E 11. the transformed value y* approaches the natural logarithm of y (commonly written lny).25.33 we show the plots that result when we take the square root. as the power approaches 0.5. we sometimes use the logarithmic transformation y* lny which takes the natural logarithm of each y value. Letting y* denote the value obtained when the transformation is applied to y. In Figures 11. Here we take each y value to the one-fourth power. then a fractional power transformation tends to straighten out the data plot. EXAMPLE 11. quartic root. quartic root.25 If we consider a transformation that takes each y value to a fractional power (such as . As an example. the transformation not only tends to equalize the error variance but also tends to “straighten out” certain types of nonlinear data plots.5 Another commonly used transformation is the quartic root transformation. and natural logarithm transformations and seeing which one best equalizes the error variance and (possibly) straightens out a nonlinear data plot. y* y. we recommend taking all of the square root. 2003 502 Chapter 11 Simple Linear Regression Analysis independent variable or the predicted value of the dependent variable increases. That is. or the like).4 on page 453). and 11.Bowerman−O’Connell: Business Statistics in Practice. One transformation that works well is to take each y value to a fractional power. Because we cannot know which fractional power to use before we actually take the transformation.

006577 T 5.8% R-Sq(adj) = 90.325 Coef 7.33 Residual Analysis MINITAB Plot of the Natural Logarithms of the Upkeep Expenditures versus the Home Values 503 7 LNUPKEEP 6 5 100 200 VALUE 300 F I G U R E 11. as the fractional power gets smaller. Different fractional powers are best in different situations. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. which we do not give here.5 The MINITAB output of a regression analysis using this transformed model is given in Figure 11.0% CI ( 34. we consider the model y* b0 b1x e where y* y.8 5. note that the stem-and-leaf display of the transformed model’s .34 MINITAB Output of a Regression Analysis of the Upkeep Expenditure Data by Using the Model y* B0 B1x E where y* y.8 205.205 0. 36.4 2222.20 + 0. 2003 11. 39.35(a) has a horizontal band appearance.17 P 0.191.4 F 373.151 DF 1 38 39 StDev Fit 0.000 R-Sq = 90.474 SS 2016.347.98 19.31 of the square roots of the upkeep expenditures versus the home values has a straight-line appearance.955) “bend down.000 0. Note that the residual plot versus x for the transformed model in Figure 11.35.8 F I G U R E 11.5 Regression Analysis The regression equation is SRUPKEEP = 7.34. Third Edition 11. Since the plot in Figure 11. Therefore. and the MINITAB output of an analysis of the model’s residuals is given in Figure 11.32 P 0. we conclude that the constant variance and the correct functional form assumptions approximately hold for the transformed model. has a similar horizontal band appearance.111) 95. Next.5% Analysis of Variance Source Regression Error Total Fit 35.000 95.2 MS 2016.201 0. It can also be verified that the transformed model’s residual plot ˆ versus y. the transformation gets stronger.Bowerman−O’Connell: Business Statistics in Practice.127 VALUE Predictor Constant VALUE S = 2.” The plot of the quartic roots indicates that the quartic root transformation also seems to overtransform the data (but not by as much as the logarithmic transformation).127047 StDev 1.0% PI ( 30. In general.

34.201 .347.27 on page 497).235.955].35(c) looks straighter than the normal plot for the untransformed model (see Figure 11.201 35. Suppose that QHIC wishes to send an advertising brochure to any home that has a predicted upkeep expenditure of at least $500. Solving the prediction equation y* ˆ b0 b1x for x.127047(220) Residual Stem-and-leaf of RESI Leaf Unit = 0. Therefore.94.35 Chapter 11 Simple Linear Regression Analysis MINITAB Output of Residual Analysis for the Upkeep Expenditure Model y* B0 B1 x E where y* y . 2003 504 F I G U R E 11.323) .000.5 5 4 3 2 1 0 -1 -2 -3 -4 -5 100 200 VALUE (a) Residual plot versus x Normal Probability Plot of the Residuals 5 4 3 2 1 0 -1 -2 -3 -4 -5 -2 -1 0 Normal Score (c) Normal plot of the residuals 1 2 300 2 5 9 13 18 (7) 15 8 2 1 -4 -3 -2 -1 -0 0 1 2 3 4 40 741 9982 4421 94442 0246889 2333455 044579 7 7 (b) Stem-and-leaf display of the residuals residuals in Figure 11.10 Residual N = 40 . it follows that a point prediction of y* for such a home is y* ˆ 7. and noting that a predicted upkeep expenditure of $500 corresponds to a y* of 1500 22. Consider a home worth $220.995)2] [$920.127047 119. (39. which is [30. Third Edition 11.36068. Because the regression assumptions approximately hold for the transformed regression model.151 This point prediction is given at the bottom of the MINITAB output.347)2. we can use this model to make statistical inferences. 39.Bowerman−O’Connell: Business Statistics in Practice. $1599.60]. we also conclude that the normality assumption approximately holds for the transformed model. as is the 95 percent prediction interval for y*.36068 7. and note that the normal plot of these residuals in Figure 11. Simple Linear Regression Analysis Text © The McGraw−Hill Companies. Using the least squares point estimates on the MINITAB output in Figure11.59 and that a 95 percent prediction interval for this upkeep expenditure is [(30. ˆ it follows that QHIC should send the advertising brochure to any home that has a value of at least x y ˆ* b1 b0 22.3234 (or $119.35(b) looks reasonably bell-shaped and symmetric.151)2 $1. It follows that a point prediction of the upkeep expenditure for a home worth $220.000 is (35.

191. Interpret the diagnostics and determine if they indicate any violations of the regression assumptions. and value of x obtained here using the transformed model are not very different from the results obtained in Examples 11.37 gives the MegaStat output of the residuals from the simple linear regression model describing the service time-data in Exercise 11.000.27 (page 497).78 THE SERVICE TIME CASE SrvcTime Figure 11. if we reconsider the residual analysis of the original. and normality assumptions? METHODS AND APPLICATIONS 11.9 (page 455). Does the plot indicate any violations of the regression assumptions? 11. note that the point prediction. for example.Bowerman−O’Connell: Business Statistics in Practice. If we analyze the residuals in Table 11.4 gives the residuals from the simple linear regression model relating weekly fuel consumption to average hourly temperature.25 (page 495) and 11.73 In regression analysis. we cannot transform the results for the mean of the square roots back into a result for the mean of the original upkeep expenditures. 11.36(b) gives the MINITAB output of residual diagnostics that are obtained when the simple linear regression model is fit to the Fresh detergent demand data in Exercise 11. Note that the I chart of the residuals is a plot of the residuals versus time.72 In a regression analysis.38(a) and (b). Describe the appearance of this plot.37.85 of this section. In this section we have concentrated on analyzing the residuals for the QHIC simple linear regression model. . Some of these remedies involve transforming the independent variable—a procedure introduced in Exercise 11. Exercises for Section 11.80.12 (page 481) using the untransformed model.7. 95 percent prediction interval. Unfortunately. The MINITAB output in Figure 11. what variables should the residuals be plotted against? What types of patterns in residual plots indicate violations of the regression assumptions? 11.5 (page 463) and 11. untransformed QHIC model in Figures 11. we might conclude that the regression assumptions are not badly violated for the untransformed model. This is a major drawback to transforming the dependent variable and one reason why many statisticians avoid transforming the dependent variable unless the regression assumptions are badly violated. Do the plots indicate any violations of the regression assumptions? 11.7. we conclude that the regression assumptions approximately hold for this model. but we ignore them here). Figure 11.77 THE SERVICE TIME CASE SrvcTime Recall that Figure 11.111].74 What is one possible remedy for violations of the constant variance.76 THE FRESH DETERGENT CASE Fresh Figure 11. correct function form.8 Residual Analysis 505 Recall that because there are many homes of a particular value in the metropolitan area. how do we check the normality assumption? 11. with control limits that are used to show unusually large residuals (such control limits are discussed in Chapter 14. or to at least rely on the results for the mean upkeep expenditures obtained using the untransformed model. In Chapter 12 we discuss other remedies for violations of the regression assumptions that do not have some of the drawbacks of transforming the dependent variable. The MegaStat output of the residuals given by ˆ this model is given in Figure 11. QHIC is interested in estimating the mean upkeep expenditure corresponding to this value. Simple Linear Regression Analysis Text © The McGraw−Hill Companies.36(a) gives the Excel output of a plot of these residuals versus average hourly temperature. Consider all homes worth. $220.75 THE FUEL CONSUMPTION CASE FuelCon1 11. 2003 11.4 (page 460) for the fuel consumption simple linear regression model (recall that the fuel consumption data are time series data). and MegaStat output of residual plots versus x and y is given in Figure 11.8 CONCEPTS 11. 11. This implies that it might be reasonable to rely on the results obtained using the untransformed model. because it can be shown that the mean of the square root is not the square root of the mean. Also.81 Recall that Table 11.34 tells us that a point estimate of the mean of the square roots of the upkeep expenditures for all such homes is 35. Furthermore.151 and that a 95 percent confidence interval for this mean is [34. Third Edition 11. 36.14 on page 474 gives the MegaStat output of a simple linear regression analysis of the service time data in Exercise 11.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.