This action might not be possible to undo. Are you sure you want to continue?
FITTING LINES TO DATA ....................................................................................................... 19-3 Least Squares .................................................................................................................. 19-3 INTERPRETING THE FITTED LINE .......................................................................................... 19-4 Interpreting the Intercept ................................................................................................ 19-5 Interpreting the Slope ..................................................................................................... 19-6 PROPERTIES OF RESIDUALS .................................................................................................. 19-9 Standard Deviation of the Residuals............................................................................. 19-11 EXPLAINING VARIATION .................................................................................................... 19-12 Using the R-Squared Statistic ....................................................................................... 19-13 CONDITIONS FOR SIMPLE REGRESSION............................................................................... 19-13 SUMMARY .......................................................................................................................... 19-17
19 Linear Patterns
Many factors affect the price of a commodity, but we can group these factors into two broad categories: fixed costs and variable costs. Fixed costs are present and of constant size regardless of the quantity; variable costs increase with the amount. As an example, let’s consider the price charged by a jewelry store for a diamond. The size of the diamond determines a variable cost. The larger the diamond, the higher the price – all other things held fixed. Some diamonds are more desirable than others because of a rare color or particular sparkle. We express a variable cost as a rate, for example as dollars per carat. (A carat is a unit of weight commonly used for gems. One carat is 0.2 grams.) Fixed costs are present regardless of the size of the diamond. Fixed costs include the vague category often called “overhead expenses,” such as the cost of lighting the store and maintaining a storefront where the diamonds are shown. These expenses also include the cost of maintaining a web site to advertise goods on-line. If we simply take the ratio of the cost of a diamond to its weight, say $2,500 for a 1-carat diamond, then we mix these costs together. The variable cost is not $2,500 per carat unless this jeweler has no fixed costs. We can separate fixed and variable costs by comparing the prices of diamonds of varying size. By studying the relationship between the price and weight in a diverse collection of diamonds, we will be able to separate fixed costs from variable costs and get a better understanding of what determines the cost of a gem. The technique that we will use to estimate fixed and variable costs is known as regression analysis. Regression analysis builds a description of the relationship (dependence, association) between two variables. In this chapter, we’ll focus on using regression as a descriptive tool and only hint at the use of regression for making inferences about populations.
19-3 € residual Vertical distance of a point from a line.4/21/2008 19 Linear Patterns Fitting Lines to Data Response Explanatory Variable Predictor The following scatterplot shows the price in dollars versus the weight in carats of 320 emerald-cut diamonds. using Credit Card) 1500 1250 1000 750 . The association is evident. it matters which variable is the response and which is the explanatory variable.35 . the equation € of this line is Estimated Price = b0 + b1 Weight The word “Estimated” takes the place of the caret. There’s quite a bit of variation around the upward trend. (The last name is traditional. To identify the line. y . we’ll write the equation of a line that describes data with a different symbol on the left side as ˆ y fitted value Estimate of response based on fitting a line to data. The correlation between weight and price is r = 0.3 . clarity.) Unlike correlation (Chapter 6). then we can summarize the relationship between the variables with a line. y ˆ as a fitted value.45 .66. ˆ = b0 + b1 x.4 Weight (carats) . covariates. but not terribly strong.5 association • direction • bends • variation • surprises Figure 19-1 Prices versus weights for emerald-cut diamonds.(b0 + b1 x) or . Least Squares If the association between x and y is linear. diamonds of the same weight do not all cost the same amount. Using the terminology of Chapter 6. linear association between weight and price. Because we associate the symbol y with the observed response. but introduces the word “independent” in a way that that has nothing to do with probability. We put the variable that we are trying to predict on the y-axis and call it the response. or color of the stone. Explanatory variables have many names: factors. or even independent variables. The equation of the resulting line would usually be written y = b0 + b1 x in algebra. we use an intercept b0 and a slope b1. It’s there to remind you that the data vary around the€ line. an estimate of y The “hat” or caret over y identifies y based on an equation fit to the data. The line omits details in order to show the overall trend. the scatterplot shows positive. Other characteristics aside from weight influence the price such as the cut. 1750 Price ($. Expressed using the names of the variables. The associated variable on the x-axis is called an explanatory variable or predictor.
A residual is the vertical deviation (positive or negative) from the line. straight line will not pass though every point in the scatterplot unless there are only 2 points or the points are perfectly aligned (r = 1 or r = –1). The following scatterplot adds the least squares regression line to the plot of the price on the weight for the 320 diamonds.y € e The units of the residuals match the units of the response. y . (The formulas for b0 and b1 are shown in Under the Hood: The Least Squares Line. b0 + b1 x2 least squares regression Picks the line that minimizes the sum of the squared residuals. ˆ = y – b0 – b1 x e=y. the average of the residuals from a least squares regression is zero. Residuals are positive for points above the line (y1 in Figure 19-2) and negative for points below the line (y2).4/21/2008 19 Linear Patterns To measure how close the line comes to a point. The double-headed arrows in this drawing€ illustrate two residuals. A single. and the best-fitting line makes these deviations as small as possible. the residuals are measured in dollars. Some of the residuals will be positive and others will be negative. the deviation y . We use vertical deviations.y to be as close to zero as possible.670 and intercept b0 = 43. This line is called the least squares regression line.670 Weight Interpreting the Fitted Line You should always look at a graph that shows the least squares regression line with the data. we choose b0 and b1 so that the resulting line minimizes the sum of the squared residuals. rather than perpendicular deviation. To avoid canceling negative and positive residuals. Each observation defines a residual.) We used a software package to compute the least squares regression line. because if we use the line to predict y from ˆ is the error we’d make. In this example. y1 b0 + b1 x1 y2 x1 x2 Figure 19-2. Put together. 19-4 . the line fit to these diamonds is Estimated Price = 43 + 2. The slope b1 = 2. We want these deviations x. we use the vertical ˆ .y deviations used in geometry. These deviations are important in € regression analysis and are called residuals.
5 carats.670 × 0.111 = $240 € y. regardless of x. so its residual is positive.4 = $1. then the estimated price is (follow the blue arrows in Figure 19-3) ˆ = 43 + 2. Because this diamond costs less than the fitted value. interpretations of b0 1. The intercept b0 and the slope b1 are easy to interpret if you are familiar with the data and pay attention to measurement units. Estimates average of the response when x = 0. Because the response in this example measures price in dollars. the price of a diamond of any weight.111 y The price of one of the diamonds that weigh 0. 2. set x = 0. than y ˆ = $1.009. To interpret b0. The intercept has the measurement units of y.4 carats. labor.5 Figure 19-3.35 .45 . 19-5 .5 = $1. think of it as telling you how much of the response is always present.4 carets is priced at $1. a jeweler has costs regardless of the size of the gem: storage.4 Weight (carats) .3 . The intercept represents the portion of the price that is present regardless of the weight: fixed costs. the residual at this point is negative: € ˆ = $1. and other costs of running the business. it’s $43. Estimating prices using the least squares line. if we set the weight to x = 0.378 y Interpreting the Intercept The interpretation € of the intercept and slope of a line fit to data is an essential part of regression modeling. This equation estimates that fixed costs make up $43 of the selling price of every diamond.4/21/2008 19 Linear Patterns 1750 Price ($.251. Component of y that is present regardless of x. The estimated price is $267 higher than the price of a 0. For example. The equation of this line estimates.111 = –$102 e=y.y Another diamond that weighs 0.4 carats is $1. It costs more ˆ . using Credit Card) 1500 1250 1000 750 .009 – $1.4-carat diamond (follow the green € arrows in Figure 19-3) ˆ = 3 + 2. the estimated intercept b0 is not just 43. In this example.251 – $1.670 × 0. or predicts.y € To estimate the price of a larger ½ carat diamond.
That’s a naïve way to think about b0 in this case. 2500 Price ($.670 dollars per carat. To show the intercept. The slope estimates the variable costs of a diamond. then Estimated Price = 43 + 2. Scatterplot showing the intercept. matching the range of weights. To see the problem. this equation estimates the cost of a weightless diamond to be $43. The intercept is the point where the y-axis and the least squares line meet. Saying anything about “weightless diamonds” lies outside what these data can tell us.5 carats. That’s often the case.1 . too.4 . The vertical line at the left of the plot is not the real y-axis. using Credit Card) 2000 1500 1000 500 0 0 . Unless the range of the explanatory variable includes zero. the intercept is the fitted value if x = 0.670 × 0 = $43 € € Interpreted literally.5 b0 Weight (carats) Figure 19-4.2 . Interpreting the Slope The interpretation of the slope typically offers more insights than the intercept because the slope describes how differences in the explanatory variable associate with differences in the response.3 . if we set the weight to 0. An extrapolation is an estimate based on conditions unlike those in the data. we’ve got to expand the plot to show b0. b0 lies outside the data and is an extrapolation. This point lies far from the data. Equations become less reliable when extended beyond the observations. even on average. extrapolation An estimate outside the range of experience provided in the data. If we plug x=0 into the equation for the line. The x-axis in Figure 19-3 runs from 0. The slope has units of y divided by units of x. The slope does not mean that a one-carat diamond costs $2.3 carats up to 0. we’ve got to extend x-axis to show x=0. The slope in this example converts carats to dollars: b1 = 2. 19-6 . For the diamonds. Once you attach units to b1.4/21/2008 19 Linear Patterns It is common to see a second interpretation of an intercept. The odd appearance of the next scatterplot shows why software generally does not do this: too much white space. we are left with b0: ˆ = b0 + b1 x ˆ = b0 if x = 0 ⇒ y y Hence.670. Estimates of price include the intercept (fixed costs). its meaning should be clear.
Link b0 and b1 to problem. Method describe the data and select an approach The explanatory variable is the average number of degrees below 65° during the billing period. Diamonds that have different weights can be different in other ways as well. Identify x and y.50 carats is b1 × (0. For example. Many of these are older homes with the gas meter located in the basement.” For instance.1 = $267. to describe the slope as “the change in y caused by changing x by 1.0.50 carats is the same as the difference in prices between diamonds that weigh 0. We have already seen that other factors affect the price.40 carats.40 versus 0. That’s another extrapolation. the utility has to estimate the amount of energy used. Perhaps the heavier diamonds have nicer colors or better cuts. The slope $2. Such lurking variables would mean that some of the price increase that we have attributed to weight is due to these other factors. These data don’t include diamonds that weigh more than 0. It is tempting. much less diamonds that differ by a full carat. The explanatory 19-7 .1 Estimating Consumption Motivation state the question Utilities in many communities rely on “meter readers” who visit homes to read the meters that record consumption of electricity and gas. In essence. Unless someone is home to let the meter reader come in. the average difference in price between diamonds that weigh 0. but incorrect.30 versus 0. We can estimate the use of gas in these homes using a regression equation. Instead.5 carats. Chapter 18) happens in regression analysis as well.670 for each additional tenth of a carat increase in weight. Check for linear association.40 carats and 0. one might say “The price of a diamond goes up by $2.50 . and the response is the number of hundred cubic feet of gas (CCF) consumed during the billing period (about a month). the slope in a regression equation compares averages. Example 19.40) = $2670 × 0. The difference in average prices between diamonds that weigh 0. Only the difference in weight matters because the fixed costs present in both cancel.4/21/2008 19 Linear Patterns The slope in a regression equation indicates how the average value of y changes as x changes. It is more sensible to use the slope to compare the prices of diamonds that differ by. The utility in this example sells natural gas to homes in the Philadelphia area. we compared diamonds of different weights.670/carat compares average prices of diamonds that differ by 1 carat. one tenth of a carat. Describe data. The problem of confounding (confusing the effects of explanatory variables as in a two-sample t-test. say.” The problem with this language is that we cannot change the weight of a diamond to see how its price changes.
The equation of this line is Estimated Gas (CCF) = 26. and the slope measures the amount of gas used on average per 1° decrease in temperature.7 + 5. we expect this home to use about 27 + 6 × 10 = 87 CCF of gas during a billing period with average temperature 55°. Are You There? A manufacturing plant receives orders for customized mechanical parts. For example. detached home. gas use rises about 6 hundred cubic feet for each degree below 65. The orders vary in size. 19-8 .7 CCF estimates the amount of gas used for things other than heating. Message summarize the results The utility can accurately predict gas use for this home – and perhaps homes like this one – without reading the meter by using the temperature during the billing period. Based on the scatterplot. There’s relatively little variation around the fitted line. Mechanics do the analysis The fitted least squares line shown in the scatterplot tracks the pattern in the data very closely (r = 0. As the weather gets colder. For this experiment. the home uses about 27 hundred cubic feet of gas.7 CCF per 1° drop in average temperature. and the slope b1 estimates that this homeowner’s use of gas for heating increases on average by about 5. from about 25 to 150 units. This scatterplot plots the production time (in hours) versus the number of units for 45 orders. During the summer. the local utility has 4 years of data (n = 48) for an owner-occupied. The intercept estimates the base level of gas consumption for things unrelated to temperature (such as heating water).4/21/2008 19 Linear Patterns variable is 0 if the average temperature is above 65° (assuming that the homeowner won’t need to heat in this case).98). a supervisor oversees the production. the association appears linear.7 Degrees Below 65 The estimated intercept b0 = 26. After configuring the production line.
it lies farther to the left and is thus an extrapolation.2 hours. how much more time does an order with 100 units require over an order with 50 units?4 Properties of Residuals The slope and intercept describe how y is related to x. If a regression equation works well. estimate the amount of time needed for an order with 100 units. The explanatory variable remains on the x-axis.g.55 additional hours. but a better plot makes it easier to identify problems by zooming in on these deviations around the least squares regression line.9 minutes (60 × 0. Scatterplot of production time on order size.031× 100 = 5.031) per unit. 1 The intercept (2.2 (c) Using the fitted line. it should capture the underlying pattern. it is essential to plot the residuals. Is the intercept visible in Figure 19-5?1 (b) Interpret the slope of the estimated line.1 + 0. 2 Once production is running. The residuals show what’s left over after we account for this relationship. 4 50 more units would need about 0. Is this estimate an extrapolation?3 (d) Based on the fitted line. but the residuals go on the y-axis in place of the response y. 19-9 . to set up production). The least squares regression line shown in the scatterplot is Estimated Production Time (Hours) = 2. This is not an extrapolation because we have orders for less and more than 100 units. It is not visible. 3 An order for 100 units is estimated to take 2.. an order takes about 1.031 × 50 = 1. To see what is left after fitting the line.4/21/2008 19 Linear Patterns Figure 19-5.031 × Number of Units (a) Interpret the intercept of the estimated line.1 + 0. Only simple random variation that can be summarized in a histogram should be left in the residuals. regardless of size (e. You can see the residuals in the original scatterplot.1 hours) is best interpreted as the time required for all orders.
4 Weight (carats) Residual vs. It should show no bends and ideally lack outliers. with consistent vertical scatter throughout. using Credit Card) 1500 1250 Price vs. then a scatterplot of residuals versus x should have no pattern. Visually. One of the following scatterplots is the residual plot from Figure 19-1.35 .3 .5 500 250 Residual 0 -250 -500 . pulling up on the left and pushing down on the right until the regression line in the scatterplot becomes flat.4/21/2008 19 Linear Patterns 1750 Price ($. Do you see a pattern in the residual plot shown in Figure 19-6? If the least squares line captures the underlying relationship. Weight 1000 750 .45 .45 . the plot of the residuals focuses our attention on the deviations around the line rather than the line itself. You can check for simplicity of the residuals using the visual test for simplicity (see Chapter 6). In other words. Weight .3 .5 Figure 19-6. The other three scramble the weights so that residuals are randomly paired with the weights. the residuals should show only simple variation that we can summarize in a histogram.35 . It should stretch out horizontally. By flattening the line. the residual plot shears the original scatterplot.4 Weight (carats) . then there’s no apparent pattern in the residual plot. The horizontal line at zero in the residual plot corresponds to the regression line in the scatterplot. 19-10 . Do you recognize the original plot of the residuals? If all of these plots look the same. Shearing produces a residual plot.
35 .4 . Simplicity in residuals.3 .3 .35 .) Standard Deviation of the Residuals A regression equation should capture the pattern that relates x to y and leave behind only simple variation in the residuals. The histogram of the diamond residuals appears reasonably symmetric around 0 and bell-shaped.35 . Smaller diamonds have more consistent prices than larger diamonds. Summarizing residuals in a histogram.4 Weight (carats) .” we can summarize them in a histogram. but becomes more apparent in the scatterplot of residuals on weights. If we decide that the residuals are “simple enough.3 . Just as a microscope reveals unseen organisms. Because we fit this line using least squares.4 .45 . If the residuals are nearly normal. These plots appear similar.3 .5 Count Figure 19-8.35 . the 19-11 .4 . but it’s subtle: the residuals become more variable as the diamonds become larger.5 Weight (carats) 500 500 Weight (carats) 250 250 0 0 e -250 e .35 .4/21/2008 500 19 Linear Patterns 500 250 250 0 0 e -250 e .5 -250 -500 -500 .5 -250 -500 -500 .3 . a residual plot can reveal subtleties invisible to the naked eye.45 .5 Weight (carats) Weight (carats) Figure 19-7. This pattern is well hidden in the initial scatterplot of the data.4 .45 . It does have a pattern. we can summarize the residual variation with a mean and standard deviation.45 . The bottom right plot shows the residuals on the weights. 500 500 250 Residual 400 300 200 100 0 -100 -200 -300 -400 -500 20 40 60 0 -250 -500 . with a hint of skewness.45 . (We’ll deal with such problems in Chapter 23.
The prices among diamonds (left) vary more than the residuals (right). Price ($. to calculate each residual that contributes to se. then se would be zero. As a percentage. we ought to be surprised. but it seldom can reduce it to zero. the units of se match those of the response (dollars in this case).3 .5 Weight (carats) Figure 19-9.4/21/2008 19 Linear Patterns mean of the residuals must be zero. (See Chapter 15. The standard deviation is more interesting because it measures how much the residuals vary around the fitted line.35 . Since these residuals have a bell-shaped distribution around the regression line. There is less variation in the residuals after subtracting away the line.) € If all of the data were exactly on a line. Like the residuals.5-carat diamond that is $400 more than this line predicts. If a jeweler quotes a price for a 0. Explaining Variation A regression line splits each value of the response into two parts. such as the standard error of the regression or the root mean squared error (RMSE). the fitted value and residual. The formula used to compute the standard deviation of the residuals is almost the formula used to calculate other standard deviations: se 2 2 2 e1 + e2 + + en n− 2 We subtract 2 in the denominator because we use two estimates derived from the data. Under the Hood: Student’s t and Degrees of Freedom. se = For the diamonds. the Empirical Rule implies that the prices of about 2/3 of the diamonds are within $169 of the regression line and about 95% of the prices are within $338 of the regression. ˆ +e y= y ˆ represents that portion of y that is related to x.4 .45 .5 250 0 -250 -500 1250 1000 750 Weight (carats) . se = $169.4 . Least squares makes se as small as possible.3 . b0 and b1. using Credit Card) 1750 500 1500 Residuals . and The fitted value y the residual e represents the effects of other factors. How much less? What proportion of the variation remains in the 19-12 . how much of the variation of y belongs with each of these components? € The histograms in this figure show the variation in price (left) and the € variation in the residuals (right). The standard deviation of the residuals goes by many names.
In that case. r-squared Square of the correlation between x and y. about 2/3 of the data lie within about 16 CCF of the fitted line. you should always report both r2 and se so that others can judge how well your equation describes the data. in Example 19. one might say “The fitted line explains 43.” Once you see the plots.955 and se = 16 CCF. this summary is often described on a percentage scale. we get a value between 0 and 1. For example. For example. We can check two of these conditions by looking at scatterplots.434 and 1 . you should be immediately know whether a line is a good 19-13 . This lack of units makes it incomplete. For the diamonds. Since se = $169. we know that fitted prices frequently differ from actual prices by $100 or more.r2 is 0.566. There’s no fast rule for how large r2 must be. what would you say? The correlation between x and y determines this percentage. se tells us that. If r2 = 0. tip Conditions for Simple Regression Even though we have not assumed that the data are a sample from a population. and 1 .662 = 0. appealing to the Empirical Rule. the line represents all of the variation and se = 0. the percentage of “explained” variation. however. The squared correlation r2 is exactly the fraction of the variation accounted for by the least squares regression line. Because these conditions are easily verified by looking at plots. The standard deviation of the residuals se is useful along with r2 because se conveys these units. The size of r2 tells us that the data stick close to the fitted line. an r2 of 30% and less may indicate an important discovery. If r2 = 1 (or 100%). on the other hand. In macroeconomics. In fact.4% of the variation in price. The sample correlation r is confined to the range -1 ≤ r ≤ 1. regression analysis with one explanatory variable and one response is often called “simple regression.” All regression analyses include r-squared as part of the summary of the fitted equation. The sign won’t matter. With medical studies and surveys.1 (household gas use) r2 = 0. For example. Along with the slope and intercept. r2 alone does not indicate the size of the typical residual. its slope would be zero. If we square the correlation. Using the R-Squared Statistic r2 is a popular measure of a regression because of it’s intuitive interpretation as a percentage. we will frequently see regression lines that have r2 larger than 90%. r2 = 0. but only in a relative sense. Because 0 ≤ r2 ≤ 1. the regression line describes none of the variation in the data.r2 is the fraction of variation left in the residuals. as a summary statistic. the least squares regression line is the line determined by the correlation studied in Chapter 6. The typical size of r2 varies greatly across applications.4/21/2008 19 Linear Patterns residuals? If you had to put a number between 0% and 100% on the fraction of the variation left in the residuals. we need to check three conditions before we summarize the association between two variables with a line.
Let’s help a BMW dealer in the Philadelphia area determine the cost due to depreciation.000 is enough to cover the depreciation per year. If the relationship appears to bend.4/21/2008 19 Linear Patterns summary of the association. stop. The following example illustrates a relationship that is affected by a lurking variable. The dealer currently estimates that $4. 19-14 . We mentioned this issue when discussing the interpretation of the slope of the fitted line in the scatterplot of the prices of the diamonds versus their weights (Figure 19-3). Outliers. The straight-enough condition is met if you think the pattern in the scatterplot looks like a line. or isolated clusters indicate problems worth investigating. The plot of the residuals on x should have no pattern. Because residuals magnify deviations from the linear pattern. The third condition requires some thinking.2 Lease Costs Motivation state the question When auto dealers lease cars. A small dealer that needs to account for local conditions won’t have enough data. If the pattern in the scatterplot of y on x does not look straight. Example 19. for example. They want to be sure that the price of the lease plus the resale value of the returned car earns a profit. Group cars of similar ages and see how the price drops as cars age.) Two conditions tell us whether the least squares line captures the pattern that relates x and y. allowing you to summarize them with a histogram. The residuals must meet the simple-enough condition. bending patterns. If the larger diamonds are systematically different in other ways from the smaller diamonds than weight alone. They need regression analysis. What’s embarrassing? Unless you always obtain data by running experiments (Ch 18). an alternative equation is needed (Chapter 20). you always have the potential for lurking variables. The no embarrassing lurking variable condition is met if you cannot easily think of another explanation for the pattern in the plot of y on x. they include the cost of depreciation. This issue is known as confounding in two-sample tests of the difference in means (Chapter 18). it’s important to examine the residuals as well. Just make sure there isn’t anything obvious that explains the relationship you are describing. (Regression analysis that uses several explanatory variables at once is called multiple regression. then it may be the case that these other differences explain the increase in price better than weight. How can a dealer anticipate the effect of age on the value of a used car? A manufacturer who leases thousands of cars can take the approach that we first used with the diamonds.
Add units to se . the fitted equation is (calculated by software ) Estimated price = 39. The intercept estimates the price of used cars from this current model year to be about $39. After a little rounding of b0 and b1.000. The slope b1 estimates how much. (These same data appear in Chapter 1. about $2. Older cars are likely to have been driven further than newer models. We will fit a least squares regression with a linear equation to see how the age of a car is related to its resale value.2. The data for one car at age 0 (4 cars from 2006) seems unusual.) We obtained them from web sites advertising certified used BMWs in 2006. so the intercept suggests that a car depreciates about $3. and we can anticipate that the higher the mileage. we expect a negative slope. one with age 0. Use r2 and se to summarize the fit of your equation. ✘ No lurking variables.900 Age. but we don’t have many cars in that model year. we have the price (in dollars) and the age in years.900 per year. Interpret the slope and intercept within the context of the problem. Straight-enough.000 when it is driven off the lot. Age in years is the explanatory variable and resale value in dollars is the response. The regression describes r2 = 45% of the variation in prices. Check straight-enough condition. Describe data. For each car. The slope is the annual estimated decrease in the resale value of the car. There are no extreme outliers or isolated clusters. There is an important lurking variable that is associated with both price and age: how far the car has been driven. The residual standard deviation se = € € € 19-15 . it will combine the measured effect of age with the unknown effect of mileage. We have prices of 218 used BMWs in the 3-series located in the Philadelphia region. We can use it to see if our fit is reasonable. The average selling price of new cars like these is $43. This plot shows the data. the resale price falls per year. Link b0 and b1 to problem. the lower the price. Seems okay. Don’t go on until you’re comfortable with the interpretation of these estimates. Whatever estimate we get for b1.4/21/2008 19 Linear Patterns describe the data and select an approach Method Identify x and y. on average. The effect of mileage is mixed with the effect of age. leaving the rest to other factors (such as different models and options).850 .850. assigning units to both. Mechanics do the analysis The scatterplot shows a linear relationship. The intercept estimates the value of a new car.
Message Residual summarize the results Our results indicate that used BMW cars (in the 3 series).000 per year appears profitable. Our estimate of resale value could be off by thousands of dollars.000 depreciation that occurs when the lessee drives the car off the lot. There is roughly equal scatter at each age. Also. That’s a lot of residual variation.4/21/2008 19 Linear Patterns $3. Check the residuals. These seem okay. decline in value by about $2. 15000 10000 5000 0 -5000 -10000 0 1 2 Age 3 4 5 Simple enough condition: The residuals cluster into vertical groups for cars of each model year. We should confirm that fees at the time the lease is signed are adequate to cover the estimated $3. So long as leases last 4 years. our estimates of the depreciation should only be used for short-term leases. our current lease pricing that charges $4. State the limitations. the estimates should be fine. Thus.367. but we have not taken account of other factors that affect resale value (such as mileage or options). on average. This estimate combines the effect of a car getting older along with the effects of other variables like mileage and damage that accumulate as a car gets older. The model leaves more than half of the variation in prices unexplained. 19-16 . A plot of the residuals is an essential part of any regression analysis because it is the best check for additional patterns and interesting quirks in the data.900 per year. however. Longer leases would require extrapolation outside the data used to build our model.
19-12 Formulas Linear equation ˆ = b0 + b1 x y b0 is the intercept and b1 is the slope. 19-4 response. The r2 statistic tells the percentage of variation in the response that is described by the equation. The intercept has the same units as the response. 19-4 predictor.x)2 € 19-17 ∑ € 2 se = i = 1 (y i − b0 − b1 xi ) ∑e 2 n 2 i . Fitted value Residual € ˆ = b0 + b1 x y ˆ = y – b0 – b1 x e=y– y € the residuals Standard deviation of n = i=1 n− 2 n− 2 R-squared Proportion of the variation in y that has been captured by the equation. 19-3 extrapolation.4/21/2008 19 Linear Patterns Summary The least-squares regression line summarizes how the average value of the response (y) depends upon the value of an explanatory variable or ˆ = b0 + b1 x. The fitted value of the response is y intercept and b1 is the slope. € and the residual standard deviation se gives the scale of the unexplained variation. Least squares determines values for b0 and b1 that minimize the sum of the squared residuals. and no embarrassing lurking variables conditions. 19-13 simple regression. 19-3 least squares regression. Key Terms condition no embarrassing lurking variable. 19-14 simple-enough. The vertical deviations e = y . 19-6 fitted value. 19-14 explanatory variable.y € the line are residuals. The linear equation should satisfy the straightenough. simple-enough. 19-14 straight-enough. and the slope has the units of the response divided by the ˆ from units of the explanatory variable. 19-12 r-squared. r2 = corr(y. b0 is the predictor (x). 19-13 standard error of the regression. This equation for a line is sometimes called “slope-intercept form”. 19-3 root mean squared error (RMSE). 19-3 residual.
more precise cuts. Know the substantive context of your model. there’s no way for you to decide whether the slope and intercept make sense. the model may have serious flaws. not your choice of the explanatory variable. but you’ve got to know the context to identify a lurking variable. Forgetting about lurking variables. The relationship is not linear. a topic we will consider in Chapter 20. the fitted line in the following figure (which is related to Example 19. Trusting summaries like r2 without checking the plots. you won’t know if the predictions make sense. The line describes most of the variation. With a little imagination. When you extrapolate the equation outside the data. Without data. and correlation is not causation. If you cannot interpret the slope and intercept. Limit predictions to the range of observed conditions. A plot will show you an outlier. but it doesn’t prove it.1) has r2 = 0. Plot the residuals versus x to zoom in on deviations from the regression line. Perhaps there’s an important lurking factor or extreme outlier. a high r2 does not demonstrate the appropriateness of the linear equation. Although r2 measures the strength of the linear equation’s ability to describe the response. It also helps to add the fitted line to this plot. Regression is reliable if you look at the plot of the response on the explanatory variable. For example. Unless you do. Perhaps heavier diamonds also have more desirable colors or better. • • Pitfalls • Believing that changing x causes changes in y because the linear model describes the variation in y. A linear equation often does a reasonable job over the range of observed x-values but fails to describe the relationship beyond this range. A linear equation is closely related to a correlation. you can think of other explanations for why heavier diamonds cost more than smaller diamonds.90. That’s what it means to have a lurking variable: perhaps it’s the lurking variable that produces the higher costs. • • 19-18 . but misses the effect of warmer weather and underestimates the slope. Drawing a regression line on a scatterplot might suggest causation. you’re making a bet that the equation keeps going and going.4/21/2008 19 Linear Patterns Best Practices • Always look at the scatterplot.
Under the Hood: The Least Squares Line The least squares line minimizes the sum n S( b0 . Gas consumption versus temperature is not linear.) After a bit of algebra. can make produce a large r2 when. in fact. “Normal” in this sense means right angles. About the Data We obtained the data for emerald-cut diamonds from the web site of McGivern Diamonds in the fall of 2004. the normal equations give these formulas for the slope and intercept: € n ∑ yi − y xi − x cov(x. Conversely. b1 ) = ∑ yi − b0 − b1 xi i=1 ( ) 2 Two equations. a low r2 value may be due to a single outlier. These data are not a representative sample of prices of diamonds. y) s y b1 = i = 1 n = =r and b0 = y − b1 x 2 var( x) sx ∑ xi − x ( )( ) i=1 ( ) € 19-19 . they are the population of diamonds (of these sizes and cut) offered for sale on a specific date.4/21/2008 19 Linear Patterns Figure 19-19-10. The data on prices of used BMW cars was similarly taken from listings of used cars available within 100 miles of Philadelphia during the fall of 2006. the linear equation is inappropriate. known as the normal equations. A single outlier. lead to formulas for the optimal values for b0 and b1 that determine the best-fitting line: n € ∑ yi − b0 − b1xi = 0 i=1 ( ) ∑( y i=1 n i − b0 − b1 xi xi = 0 ) (These equations have nothing to do with a normal distribution. or data that separate into two groups rather than a single cloud of points. It may be that most of the data fall roughly along a straight line with the exception of an outlier.
The mean of the residuals is zero.8713 t Ratio 0. For now. With y n n ∑( y i= 1 ˆ i − yi ) = 0 ∑( y i= 1 i ˆ i ) xi = 0 −y Written in this way. 2.419237 + 2669.0001 The slope and intercept coefficient are given in the second table in a column labeled “Estimate.4/21/2008 19 Linear Patterns The least squares line matches the line defined by the sample correlation r in Chapter 6. The residuals are uncorrelated with the explanatory variable. you should round the reported numbers. € the normal equations tell you that ˆ is the residual. The deviation y.23376 170.61 15.62 Prob>|t| 0. you can remember the formula for the slope. These are circled above.434303 Root Mean Square Error 168. We will learn about the other numbers in the regression table in the coming chapters. the average residual is zero as well. you need to be able to find the coefficients b0 and b1 and to locate se and r2. Since 1. Summary of Fit: Response = Price RSquare 0.” Usually the slope is labeled with the name of the x-variable.y the sum of €the residuals is zero.5426 <. 19-20 . the normal equations are squares regression. Because the covariance is zero. and the intercept is labeled “Intercept” or “Constant.419237 2669. so is the correlation. Software Hints All computer packages that fit a regression summarize the fitted model with several tables.634 Mean of Response 1146.653 Observations 320 Term Intercept Weight (carats) Parameter Estimate 43.8544 Estimates Std Error 71. but all contain essentially the same information— and all include many more numbers than we need to use at this point. The second equation is the covariance between x and € the residuals.” The regression equation in this table is Estimated Price = 43. These tables may be laid out differently from one package to another. If you remember that the units of the slope are those of y divided by those of x.8544 Weight Statistics packages show more digits for the estimated slope and intercept than you need for interpretation. The normal equations say two things about the residuals from a least ˆ = b0 + b1 x. The output for regression from software includes a section that looks something like this. Ordinarily.
Minitab The procedure resembles constructing a scatterplot of the data. and add the plot to your spreadsheet. The summary includes various plots if you select that option. pick the icon that looks like a scatterplot. Following the menu sequence Analyze > Fit Y by X opens the dialog that we used to construct a scatterplot in Chapter 6. These options do not include se. Excel fills another sheet in the open workbook with a summary of the regression. there was a problem loading this add in. JMP adds the least squares line to the plot and appends a tabular summary of the model to the output window below the scatterplot. Follow the menu sequence Graph > Scatterplot… and choose the icon that adds a line to the plot. The tabular summary of the fitted model appears in the scrolling window that records results of calculations in your current session. Fill in variables for the response and explanatory variable.4/21/2008 19 Linear Patterns Excel The right way to do regression with Excel starts with a picture of the data: the scatterplot of Y on X. After you pick variables for the response and explanatory variable. identify your columns. Next. but it’s easier to load and use the Analysis ToolPak that comes with Excel. select the scatterplot and follow the menu commands Chart > Add Trendline… Pick the option for adding a line. click OK. In the pop-up menu. You can also use formulas in Excel to find the least squares regression. click on the red triangle above the scatterplot (near the words “Bivariate Fit of …”). (If you don’t see Data Analysis… in the Tools menu. The formula LINEST does most of the work. If you double-click the line in the chart. then click OK. Check your installation. We talked about how to scatterplot data back in Chapter 6. Excel will show the equation and r2 for the model. follow the menu commands Tools > Data Analysis… and fill in the dialog with the range for Y and X.) By default. Minitab then draws the scatterplot with the least squares line added. JMP 19-21 . After you’ve done that. choose the item Fit Line. and Excel will add the least squares line to the plot. In the chart wizard. In the window that shows the scatterplot.
This action might not be possible to undo. Are you sure you want to continue?