# 4/21/08

21

The Simple Regression Model

THE SIMPLE REGRESSION MODEL............................................................................. 21-3 Linear Conditional Mean...................................................................................... 21-3 Deviations from the Mean..................................................................................... 21-3 Data Generating Process...................................................................................... 21-4 CONDITIONS FOR THE SRM....................................................................................... 21-6 Modeling Process ................................................................................................. 21-9 INFERENCE IN REGRESSION ....................................................................................... 21-9 Standard Errors .................................................................................................. 21-10 Confidence Intervals ........................................................................................... 21-11 Hypothesis Tests ................................................................................................. 21-12 Interpreting Tests of the Slope ............................................................................ 21-12 PREDICTION INTERVALS .......................................................................................... 21-14 Reliability of Prediction Intervals....................................................................... 21-15 SUMMARY ............................................................................................................... 21-19

4/21/08

21 The SRM

The Capital Assets Pricing Model (CAPM) describes the relationship between returns on a speculative asset and returns on the stock market. According to this theory, the market rewards investors for taking unavoidable risks collectively known as market risk. Since investors cannot escape market risk, the market pays investors who are willing to take these on. Other risks, called idiosyncratic risk, are avoidable, and the CAPM promises no compensation for these. For instance, if you buy stock in a company that is pursuing an uncertain biotechnology strategy, the market will not compensate you for those unique risks. We can formulate the CAPM as a simple regression. The response Rt is the percentage change in the value of an asset. The explanatory variable Mt is the contemporaneous percentage change in the stock market as a whole. The intercept in this regression is called alpha. The slope is called beta. Written using Greek letters for these terms, the equation associated with the CAPM is Rt = α + β Mt + εt The added term εt represents the effect of everything else on Rt. According to the CAPM, the mean of εt is zero and α = 0. On average, if the return on the market is zero, we expect the return on the stock to be zero as well. We can use regression analysis to test this theory. This scatterplot shows monthly percentage changes in the price of stock in Berkshire-Hathaway, the company managed by the famous investor Warren Buffett.

Estimated % Change Berkshire-Hathaway = 1.32 + 0.737 % Change Market

Figure 21-21-1. Estimating the CAPM for Berkshire-Hathaway.

The graph plots percentage changes in Berkshire-Hathaway versus percentage changes in the value of the entire stock market. The data span October 1976 (the earliest public trading in Berkshire-Hathaway) through December 2007. The red line is the least squares regression line. Is the estimated intercept b0 = 1.32 large enough to suggest that α ≠ 0? Did Warren Buffett “beat the market” by a statistically significantly amount? To answer this question, we need tools for inference: standard errors and either confidence intervals or hypothesis tests. We develop these methods for regression analysis in this chapter. 21-2

depending on whether data lie above the line (positive) or below the line (negative). The errors have equal variance σε2 . Independence. E(ε) = 0. (Finance traditionally denotes β0 by α and β1 by β in the CAPM. According to this model. Because the errors are not observed. The equation of the SRM states that averages of the response fall on a line: µy|x = E(Y|X=x) = β0 + β1 x Linear on Average conditional mean µ y|x The average of one variable given that another variable takes on a specific value. Because µy|x is the conditional mean Y in the population. the deviation from the line is zero. The variables X and Y may involve transformations such as logs or reciprocals as in Chapter 20. not a sample from the population. Deviations that separate observations from the conditional averages µy|x are called errors. E(Y|X = 5) is the average percentage change in Berkshire-Hathaway during months in which the market increases 5%. The usual symbol for the error is another Greek letter. Read the notation µy|x = E(Y|X=x) as the expected value. of Y given the explanatory variable has value x. just random variation around the conditional means. The error for one observation is independent of the error for any other observation.) This equation may give the impression that the simple regression model only describes linear patterns. The SRM assumes that you have chosen variables that have a linear relationship. Errors can be positive € or negative. 2. the conditional means fall on a line with intercept β0 and the slope β1. The Greek letter ε (epsilon) is a reminder that we do not observe the errors. or average. ε = y – µy | x . These aren’t mistakes. the errors are deviations from the population conditional mean. the expected value of an error is zero. Equal variance. Deviations from the Mean error ε Deviation from the conditional mean specified by the SRM. On average. we use Greek letters and random variables. the SRM makes several assumptions about them: 1. € 21-3 . The equation in the simple regression model specifies how the explanatory variable is related to the mean of the response. The equation of the simple regression model describes what happens on average. but that’s not true. Normal. not an average in the data. The errors are normally distributed.4/21/08 21 The SRM The Simple Regression Model The simple regression model (SRM) combines an equation that relates two numerical variables with a description of the remaining variation. Because we’re describing the population. 3. The average value of Y for each x is the conditional mean of Y given X. For instance. The SRM describes a population.

Figure 21-2. If the company only wants to estimate µy|150. all of the other factors that affect sales combine to produce a deviation from µy|150 that looks like a random choice from a normal distribution with mean 0 and SD σε. Suppose the company spends x1 = 150 (thousand dollars) on advertising in the first month.000 for advertising month after month. Let’s denote the first error as ε1 and imagine that ε1 = -20.000. There’s a normal distribution with mean 500+2x and standard deviation 45 at each € value of x.4/21/08 21 The SRM If all 3 assumptions hold. Suppose that if the company spends x thousand dollars on advertising. The simple regression model assumes a normal distribution at each x. then the expected level of sales is µ y|x = 500 + 2 x (β0 = 500. Data Generating Process A statistical model describes an idealized sequence of steps that produce € €y denote monthly sales of a the data we observe. If it spends \$150. As an illustration. We further set σε = 45. Every dollar spent on advertising increases expected sales by \$2. ε ~ N(0. the company is free to decide how much it wants to advertise. let company and x denote its spending on advertising (both in thousands of dollars). Sales during the first month are then y1 = µy|150 + ε1 = 800 + (–20) = \$780. we need to choose values for β 0 .000. then the errors are an iid sample from a normal population with mean 0 and variance σε2 . here’s what it could do. To specify the SRM. The expected level of sales given this advertising is µy|150 = β0 + β1 (150) = 500 + 2(150) = \$800. The data generating process defined by this model begins by allowing the company to choose a value for the explanatory variable. and σε.000 According to the SRM. β 2) 1 =€ € Without advertising. σε2 ). sales average \$500. it’s below the line because ε1 < 0.000 That’s the dot in the figure at the left. the average level of sales will eventually settle down on µy|150 = \$800. β1 . The company would 21-4 . The SRM does not specify how this is done.

Let’s follow the data generating process for a second month. and 2. The SRM is a model. Figure 21-3. the company spends x2 = 100 (thousand dollars) on advertising. say ε2 = 50. The observed response y is linearly related to the explanatory variable x by the equation y = β0 + β1 x + ε. but it would eventually know µy|150. The closer data conform to this ideal scenario. have equal variance σε2 around the 3. the more reliable inferences become.000. ε ~ N(0. The same line determines the response. and the data generating process defined by the SRM determines sales. Because the errors are independent of one another. then the company observes data like these. Recognize that the simple regression model presents a simplified view of reality. To reveal. This data point lies above the line because ε2 is positive. β0 or β1 the company must vary the explanatory variable x. Nonetheless the SRM is often a reasonable description of data. from the same normal distribution. The equation for µy|x remains the same: expected sales are µy|100 = β0 + β1 (100) = 700. If the SRM holds. σε2 ) € regression line. € 21-5 . Simple Regression Model. not the real thing. are independent of one another. Sales in the second month are then y2 = µy|100 + ε2 = 700 + 50 = \$750. we ignore ε1 and independently draw a second error. This process repeats each month. In the next month. The observed data do not show the population regression line. The observations 1. The company sets the amount of advertising. are normally distributed around the regression line.4/21/08 21 The SRM not learn β0 or β1 this way.

If the answer to each question is “yes. we have three specific conditions. All we observe are a sample and the fitted least squares regression line. but we can now refine what to look for in the residuals. (d) Fails because it appears that the error variation grows with the mean.” then the data match the SRM well enough to move on to inference. We should confirm this by plotting the residuals versus the explanatory All but (d).4/21/08 21 The SRM Are You There? Which scatterplots show data that are consistent with the data generating process defined by the simple regression model?1 (a) (b) (c) (d) Conditions for the SRM We never know for certain whether the SRM describes the population. but that just means the slope is near zero.24). 1 21-6 . Straight-enough. (c) is an ideal example of a linear pattern with little error variation. The association is not strong (r2 = 0. Instead of checking for no pattern (that the residuals are “simple enough”). Checklist for the Simple Regression Model  Is the pattern in the scatterplot of y on x straight-enough?  Have we ruled out embarrassing lurking variables?  Are the errors evidently independent? simple enough  Are the variances of the residuals similar?  Are the residuals nearly normal? Let’s check these conditions for the regression of Berkshire-Hathaway on the market (Figure 21-21-1). In (a) the data track along a line with negative slope. there’s little evidence of a line. but seems linear. In (b). We checked these in Chapter 19. The scatterplot of y on x in Figure 21-21-1 seems straight enough. The best we can do is to check several conditions.

In general. Similar variances. it is easier to see changes in variation. Figure 21-4. the SRM makes no assumption about the distribution of the explanatory variable. 21-7 . Be alert for a fan-shaped pattern or tendency for the variability to grow or shrink. With the fitted line removed. Figure 21-5. In this case. buzzing above and below the horizontal line at zero. These do not indicate a problem. In place of checking that the residuals are “simple enough”. as in this example. equal variance. and normality.To check this condition. If the data are time series. we can check this assumption by looking at sequence plots of the residuals. There’s no evident pattern in this case. This timeplot shows that the residuals vary around zero consistently over time. Evidently independent. with no drifts. According to the CAPM. This plot should look like a random swarm of bees. No embarrassing lurking variables. claiming that other variables predict stock performance. The data are shifted to the right of the plot to accommodate the outliers at the left (months with large declines in the market: October 1987 and August 1998). These appear independent. no plot shows dependence among the observations – with one exception. Timeplot of residuals from the CAPM regression for Berkshire-Hathaway.4/21/08 21 The SRM variable. there aren’t any lurking variables. we have three specific tasks: check for independence. Residuals from the regression of percentage changes. Some in Finance question CAPM for just this reason. start with the scatterplot of the residuals on x (Figure 21-4). the spread appears constant around the horizontal line at zero.

The following histogram and normal quantile plot summarize the residuals from the regression of percentage changes in BerkshireHathaway on the market. so we can rely on the CLT for making inferences about β0 and β1. The sample size should be larger than 10 times the larger of the skewness K3 and kurtosis K4 of the residuals. As when estimating µ using . however. inferences about β0 and β1 work well even if the data are not normally distributed. however. To check this condition. Confidence intervals for the slope and intercept are reliable even if the errors are not normally distributed. does show periods in which the residuals become more and less variable. the justification comes from the Central Limit Theorem. added together.5. We don’t observe the errors so we substitute the residuals in place of them. The effect is subtle and gradual changes in variation will not cause us problems. Histogram and normal quantile plot of residuals. K3 = 0. In general.8 and K4 = 2. Fortunately.4/21/08 21 The SRM The timeplot of the residuals. they drift too far from the reference line. A normal model is often a good description for the unexplained variation because the errors represent the net effect of all other variables on the response. however. Otherwise. (This issue is very important to prediction. The distribution of the residuals also has a long right tail.) Nearly normal. At this location. Since sums of random effects tend to be normally distributed. these residuals are not nearly normal. the residuals track the diagonal reference line except near the arrow. In this example. If the residuals are not normally distributed (as in this example). 21-8 . we check that the residuals are nearly normal (Chapter 12). Figure 21-6. The residuals seem to have lower variance after 2001. check the CLT condition for the residuals (Chapter 15). It’s only a start. but this aspect of time series is an important area of research in quantitative finance. this check is done as before: inspect a histogram and normal quantile plot. a normal model is a start.

The estimated standard error is our ruler. and σ 2 identify each instance of the simple regression model. have no pattern. but for the wrong reason. We’ll have more to say about fixing problems in Chapter 22. If you think both answers are “yes.” then follow this outline: • Plot y versus x and verify that the association is straight enough. and se estimates σε. we are we ready for inference. 21-9 . for example. β0. Note the presence of outliers as well. e.” then find a remedy for the problem. • Inspect the histogram and normal quantile plot of the residuals to check the nearly normal condition. • If the data are measured over time. check the skewness and kurtosis. fit the least squares regression line 1. plot the residuals against time to check for dependence. Inference in Regression b0 b1 Intercept Slope Fit/Mean Deviation € SD(ε) ˆ y e se β0 β1 µ y| x ε σε € Once our model passes this gauntlet of checks. think about these two questions (a) Does a linear relationship make sense? (b) Is the relationship free of major lurking variables? If either answer is “no. residuals versus x 3. You may need to transform a variable or find better data. If the residuals are not nearly normal. Detecting a problem at this late stage offers little advice for how to fix it. you are likely to find something unusual in the normal quantile plot of the residuals. the errors are not normally distributed. and any “thickening” indicates different variances in the errors. We estimate these using the least squares regression line. Don’t worry about properties of residuals until you get an equation that captures the pattern in the scatterplot. Three parameters. It’s a good idea to proceed in this order. If you skip the initial check for straight enough. you’re not going to know what to do next.” You’d be right. This plot should quantile plot of residuals. At that point. three plots • If the pattern is straight-enough. b1 estimates β1. 2. the residuals should show only simple variation. Unless you identify the source of the problem. b0 estimates β0.4/21/08 21 The SRM Modeling Process Before you look at plots. As you may suspect. β1. Histogram and normal • Plot the residuals versus the explanatory variable. We reject a null hypothesis if an estimate is too many standard errors from the hypothesized value. y versus x and obtain the residuals. Curvature suggests that the pattern wasn’t straight enough after all. Inference for these €parameters proceeds as when testing or building confidence intervals for µ (Chapters 15-18). Student’s t-distribution determines how many standard errors are necessary for statistical significance. you might conclude “Ah-ha. and the confidence interval consists of values within a certain number of standard errors of the estimate.

(The exact formula is at the end of this chapter. the less precise the estimate of the slope. sits on top in the numerator since more variation around the line increases the standard error. The larger the sample is. Regression coefficient estimates for the CAPM regression. those in the next row describe the estimated slope b1. The formula for the standard error of the slope in a least squares regression resembles the standard error of an average.93 10. Standard Errors Standard errors describe the sample-to-sample variability of b0 and b1. Each time we draw a sample from the population and fit a regression. se(Y ) = sy 2 y (y = 1 − y + y 2 − y +  + yn − y ) ( 2 ) 2 ( ) 2 21-10 .4/21/08 21 The SRM For the SRM.336367 0. the more precise the estimate of the slope becomes.321082 0. we get different estimates.736763 0.0001 Table 21-1. Larger samples decrease the standard error.) € 2 s 1 e2 + e2 +  en se( b1 ) ≈ ε × . Numbers in the row labeled “Intercept” describe the estimated intercept b0. se2 = 1 2 n− 2 n SD( X) Three aspects of the data determine the standard error of the slope: • Standard deviation of the residuals € • Sample size • Standard deviation of the explanatory variable The residual standard deviation. y2. Packages routinely summarize the results of a least squares regression in a table. If the standard error of b1 is small. The sample size is in the denominator. then not only are estimates from different samples similar. software handles the details of calculating standard errors of the least squares estimators. but they are also close to β1. s n−1 n This formula uses the sample standard deviation sy in place of σy. …. The more variable the data are around the regression line. How different? That’s the job of the standard error: estimate the sample-to-sample variation.84 0. yn is . se. To see why the standard deviation of the explanatory variable affects the standard error of the slope.067973 3. The estimated standard error of the average of a sample of n observations y1. Each row of this table summarizes the estimate of a parameter in the CAPM regression for Berkshire-Hathaway. The standard error of the slope is similar.0001 <. Term Intercept Market Estimate b0 b1 Std Error t Statistic p-value 1. consider these scatterplots.

If the errors are nearly normally distributed or satisfy the CLT condition.025.067973 ≈ [0. the 95% confidence interval for β1 (beta for Berkshire-Hathaway) is b1 ± t0. The only difference is that the points on the left are spread out along the x-axis.97 × 0. there’s more variability among the slopes. the divisor in the estimate se.736763 ± 1. Since we substitute se for σε to calculate the standard error. 12 10 8 12 10 8 Y1 6 4 2 0 0 2 4 6 8 10 Y2 6 4 2 0 0 2 4 6 8 10 X1 X2 Figure 21-8.n-2 × se(b1€ ) = 0. This sample produces a more accurate estimate of β1. we conclude that β for Berkshire-Hathaway is statistically significantly less than 1. Hence. reducing the volatility of the value of this asset. These data provide less information about β1. For the intercept. 0.4/21/08 12 10 8 12 10 8 21 The SRM Y1 6 4 2 0 0 2 4 6 8 10 Y2 6 4 2 0 0 2 4 6 8 10 X1 X2 Figure 21-7. we use a t-distribution. then the sampling distribution of b1 is approximately normal.603. Since the data on the right are packed closely together. Confidence Intervals The sampling distribution of b1 is centered on β1 with standard deviation estimated by se(b1).871] Because 1 lies outside the confidence interval. Put together. Returns on Berkshire-Hathaway attenuate returns on the market. The lines sketched in the next figure equally well represent the data. the 95% confidence interval is 21-11 . the sampling distribution of the ratio b −β t= 1 1 se(b1 ) is Student’s t with n-2 degrees of freedom. The degrees of freedom are n-2. The formula for the standard error of the intercept is similar and shown at the end of the chapter. More variation in x leads to a better estimate of the slope. Which data tells you more about the slope? Each scatterplot shows a sample of 25 observations from the same population.

In this example. The test rejects H0: β0 = 0.4/21/08 21 The SRM b0 ± t0. Each t-statistic in Table 21-1 is the ratio of an estimate to its standard error. In a graph. For example. the t-statistic € tells us that β for Berkshire-Hathaway is not zero. For the intercept. That’s not a surprise. That’s unlikely to happen by chance if the null hypothesis H0: β1=0 is true. There is no linear association between y and x. We can reject H0 if we accept a 5% chance of a Type I error. The first phrase comes from the connection between β1 21-12 equivalent inferences Might a parameter in the population be zero? Not if… 1. As with tests of the mean. far less than the common threshold 0. as when testing a hypothesis about a mean. Hence. If this hypothesis is true. so the p-value is tiny (less than 0. and 0 lies outside the 95% confidence interval.93 The estimated intercept lies about 3. that would mean returns on Berkshire-Hathaway were unrelated to returns on the market. t-statistic is larger than 2 in absolute size.025. Each counts the number of standard errors that separate the estimate from zero. The tstatistic tells us that b1 is t = b1/se(b1) = 10.315335 ≈ [0. these are “two-sided” tests. then the distribution of y is the same regardless of the value of the explanatory variable x. You can rely on confidence intervals and avoid hypothesis tests. 1. Buffett’s stock has averaged higher returns than predicted by the CAPM: he beat the market. Special language describes a regression model that rejects H0: β1 = 0. the p-value is small whether b0 for instance is far below or far above zero. 2.97 × 0.0001. the p-value = 0. The accompanying p-value converts the t-statistic into a probability.84 standard errors above zero. You’ll sometimes hear that “the model explains statistically significant variation in the response” or “the slope is significantly different from zero. the t-statistic for the intercept is t = b0/se(b0) = 3. α for BerkshireHathaway is statistically significantly larger than zero. positive and negative deviations from 0 are evidence against H0. The t-statistic and p-value in the next row of output test H0: β1 = 0. the line is flat. In plain language. few financial experts would expect β for any stock to be zero. The reason that software automatically tests the null hypothesis H0: β1 = 0 is simple.05. Each test compares the estimate to zero. Interpreting Tests of the Slope .0001).93 standard errors above zero. but the output for regression summarizes tests of both hypotheses in the columns labeled “t Statistic” and “p-value” in Table 21-1. zero lies outside 95% confidence interval. We’d expect to find the same mean value for the response regardless of the value of x.942] Since zero lies well outside the 95% confidence interval. Hypothesis Tests Output from software that fits a least squares regression usually provides several redundant ways to test the hypotheses H0: β0 = 0 and H0: β1 = 0. p-value less than 0. or 3.” These expressions mean that the 95% confidence interval for β1 does not include zero. the test agrees with the confidence interval.n-2 × se(b0) = 1.700.05.321082 ± 1.

½ is a plausible value for the elasticity.58]. the slope is the elasticity of compensation with respect to net sales.4/21/08 21 The SRM tip Are You There? and the correlation between x and y.5 lies inside the confidence interval for β1. The plots don’t show a problem. By rejecting H0. but these residuals probably meet the CLT condition.65 0.5028293 0.) Figure 21-9. 50. Remember. The variation seems consistent. we have not proven that β1 = 0. € The following results summarize a model for the relationship between total compensation of CEOs and net sales at 201 at finance companies.400834 4.42 to 0.97 (0. and hence statistically significant. too.61 p-value <. r2 se Term Intercept Log10 Net Sales Estimate 1. It’s hard to judge normality without a quantile plot. If β1 = 0. Log of CEO compensation versus log of net sales in the finance industry. not that it is zero. we’ve said that ρ ≠ 0. that if we don’t reject H0: β1 = 0. Does this model agree?5 The relationship seems straight-enough (with logs). Zero will not be in the 95% confidence interval. then ρ = 0.388446 Std Error t Statistic 0.4 (d) A CEO claimed to her Board of Directors that the elasticity of compensation with respect to net sales is ½.503 ± 1. All we’ve said is that β1 might be zero.0403) ≈ [0. Because both x and y are expressed as logs.043318 11. the average percentage change in the salary associated with a 1% increase in set sales.8653351 0. 4 The confidence interval for the slope is 0.0001 b0 b1 (a) Based on what is shown. though.403728 0. 3 The estimate b1 is more than 10 se’s away from zero.0001 <. (See Chapter 20. 2 21-13 . Hence ρ2 > 0. The response and explanatory variables measured on a log10 scale. check the conditions for the SRM: (1) straight-enough (2) no lurking variable (3) evidently independent (4) similar variances (4) nearly normal2 (b) What does it mean that the t-statistic for b1 is bigger than 10?3 (c) Find the 95% confidence interval for the elasticity of compensation (the slope in this model) with respect to net sales.

61 170. due presumably to other factors such as varying color and clarity.5426 <. The same formula gives a prediction. we could not predict the exact price of any diamond because of the error variation around the conditional mean.346437 ≈ \$1.000 to \$1. say. Let’s begin with the prediction itself. What range in prices can you anticipate? This plot shows the estimated least squares fit for the diamonds. r2 se Term Intercept Weight (carats) Estimate 43.5 = 1378. Summary of the estimated SRM for the diamonds in Figure 21-10.5 Price (\$. A range that quantifies the accuracy of a prediction has to account for this remaining variation. 1800 1700 1600 1500 1400 1300 1200 1100 1000 900 800 700 . That’s a wide interval.419237 +2669.62 p-value 0.23376 0. Let’s go back to the emerald diamonds considered in Chapter 19.0001 b0 b1 Table 21-2. The observed prices of ½ carat diamonds range from about \$1. the predicted price for a ½ carat diamond is ˆ = b0 + b1(½) = 43.8544 * 0. According to the SRM.419237 2669.3 .8544 0. using Credit Card) Weight (carats) Figure 21-10. = b0 + b1 x.4 . The following tables summarize the fit shown in Figure 21-10. Why is he an outlier?6 Prediction Intervals Regression is often used for predicting the response.378 y € 6 Buffett pays himself a small salary.8713 15. you’d like to know what you should expect to pay. If you are interested in buying a ½ carat diamond. Even if we knew β0 and β1 and hence µy|x. Plugging into the equation. 21-14 .634 Std Error t-statistic 71.4/21/08 21 The SRM (e) The outlier marked with an x at the right in the plots in Figure 21-9 is Warren Buffett. the price of each diamond is y = µy|x + ε.700.434303 168. he makes his money in stock invested in his firm. Prices and weights of emerald cut diamonds. We know how to compute fitted values of y for any value of x.

as yet unknown.n-2 se( y ˆ ) ≤ ynew ≤ y ˆ + t0. Hence. however.378 as a prediction of a randomly chosen ½ carat diamond? There’s no reason to think that we should be able to predict the price of another ½ carat diamond any better than this line fits these diamonds.5% of diamonds that weigh ½ carat.4/21/08 21 The SRM According to the SRM. but we have of the unknown parameters. It fits well for diamonds below ½ carat.268 ≈ [\$1041.025. Often. but typically underpredicts prices for larger gems. we can afford to buy 50% of diamonds that weigh ½ carat.95 We don’t know µy|x or σε.600. we’re predicting the price of a single diamond. e € € se( y ) ≈ s€ € n-2 ≈ 2 € ˆ – 2 se . y ˆ +2 se] is an approximate 95% prediction Thus. A confidence interval is a statement about a € € population parameter. If the data are nearly normal. The next scatterplot shows prices of a larger collection of diamonds.96 σε) = 0. then and se.378. prices of 95% of diamonds fall within 1. The approximate 95% prediction interval for the cost of a ½ carat diamond is ˆ ± 2(168. then about 95% of the data lie within 2 se of the fitted line. We’re not guessing the value of a mythical parameter. a prediction interval is a statement about a specific. \$1716] y We shouldn’t be surprised if price of the diamond were \$1. but so long as we’re not extrapolating. use ˆ ± 2 se y ˆ – t0. The standard deviation of the residuals tells you how well you can predict new observations.025. the error variation around µy|x is normally distributed. if we have \$1. Predictions that extrapolate beyond the data.346 ± 337.n-2 se( y ˆ ) = 0. The rest would be too expensive. The prediction interval provides another way to think about se.63 gives us a good idea of the accuracy. How accurate is \$1. including many that weigh more than ½ carat. Reliability of Prediction Intervals Prediction intervals are reliable within the range of observed data. then we have enough to buy all but the most expensive 2. 21-15 .200 or \$1.95 P( y The standard error of the prediction is tedious to calculate.025. observation. we can use the handy approximations ˆ and t0. a € future observation.96 σε ≤ ynew ≤ µy|x + 1. a relationship that is linear over the range of the observed data changes when extrapolated. rely heavily on the assumed truth of the model. If we use these in place approximate 95% prediction interval As long as you are not extrapolating beyond observed conditions.634) = 1378. P(µy|x – 1. the range [ y interval.716. rather than a confidence interval because we’re making a statement about the price of a single diamond. The SD of the residuals se = \$168. This range is a prediction interval. If the SRM holds.96 standard deviations of the conditional mean µy|x for any weight. We can also think about the upper endpoint of this interval like this: if we arrive € at the jewelers with \$1. The fitted line is the line used in Chapter 19 to describe smaller diamonds.

Describe data. Prediction intervals depend on normality of each observation. Have you noticed that prices seem higher at stations located near busy interstate highways? With so many cars passing by. the t-percentiles used to set the endpoints of the interval can be off target. Consider where to locate a gasoline station. Similarly. location. If the variance of the errors around the regression line increases with the size of the prediction (as in Figure 21-11) then the prediction interval will be too wide for small items and too narrow for large items. Even if the high price deters some drivers. When the prediction is not an extrapolation. if the errors are not normally distributed. This fanshaped pattern of increasing variation is a serious issue when using the regression model for prediction. Example 21. One is located on a highway that averages 40.” The same goes for commercial properties. location. The slope measures the sales per passing car. How much more gasoline can we expect to sell at the busier location? Method Identify x and y.000 “drive-bys” a day and another gets 32.4/21/08 21 The SRM Figure 21-11. Both averages were computed during a recent month at 80 franchise outlets in similar communities. with y equal to the sales of gasoline per day (thousands of gallons) and x given by the average daily traffic volume (in thousands of cars). All charge roughly the same price. Prices rise faster than the model fit to small diamonds anticipates. Check straightenough condition. the reliability of a prediction interval depends on two assumptions: equal variance and normality. Link b0 and b1 to problem.000. 21-16 . not normality produced by averaging. How does traffic volume affect sales? We will compare two sites. The intercept sets a baseline of gasoline sales that occur regardless of traffic intensity (probably due to local customers). more than enough stop in to keep the business profitable. We will deal with this issue in Chapter 22.1 Locating a Franchise Outlet Motivation state the question A common saying about real estate is that three things determine the value of a property: “location. a station can raise prices. describe the data and select an approach We will use regression.

41 9.236729 0.1611 <.548596 1. While not random samples. The scatterplot suggests that the relationship is linear.024314 do the analysis These tables summarize the fit of the least squares regression equation. These stations are in similar areas with comparable prices. We expected more variation at the busier stations.945844 0. and the normal quantile plot summarizes the distribution of the residuals. that effect is not evident.  Similar variances. but since these data are averages.  No lurking variable.4/21/08 21 The SRM A 95% confidence interval for 8. nothing in the plots of the data suggest a problem with the assumption of independence. We should also check that they face similar competition.  Nearly normal. t Stat -1.  Straight-enough.  Evidently independent. The points in the normal quantile plot stay near the diagonal. This is confirmed in the plot of the residuals on the explanatory variable. Mechanics r2 se n Term Intercept b0 Traffic Volume (000) b1 Estimate -1.0001 This scatterplot graphs the residuals versus the explanatory variable. The histogram of the residuals is reasonably bell-shaped with no large outliers. 21-17 .505407 80 Std Error 0.338097 0.74 p-value 0.000 times the estimated slope will indicate how much more gasoline to expect to sell at the busier location.

we predict sales at a site with 40.025.236729 × 40 ≈ 8. It is good to mention possible lurking variables.300 more gallons of gasoline daily than a location with 32.000 drive-bys will sell on average from 1.23 to 9. y The predicted sales for a location with 32.5054 ≈ [5.14] thousand gallons per day. 21-18 .500 to 2.131 ± 2 × 1. for example.99 × 0.188 to 0. a difference of 8.131 thousand gallons per day y The approximate 95% prediction interval for such a location is € ˆ ± 2 se = 8.000 in daily traffic volume is associated with a € difference in average daily sales of 8000 × [0. Since the residuals in this example are nearly normal. we can move on to inference. The 95% confidence interval for β1 is b1 ± t0.338097 + 0. If prices differ. stations at busy locations charge higher prices.338097 + 0.25] thousand gallons per day. The reason for the overlap is that the prediction intervals contrast sales at two specific locations. For instance.285 gallons/passing car] Hence. we can use the fit of this model to build prediction intervals.507 to 2. this equation may underestimate the benefit of the busier location.4/21/08 21 The SRM Since the conditions for the SRM are met. These prediction intervals overlap to € considerable extent. then the estimate from this model mixes the effect on sales of increasing traffic (positive effect) with increasing price (negative effect). we should check that the stations in our data face similar levels of competition and verify that all charge comparable prices.000 drive-bys is € ˆ = -1. we expect (with 95% confidence) that a station located at a site with 40. The confidence interval compares the average level of sales across all stations at such locations.237 thousand gallons per day y The approximate prediction interval is 6.285 gallons per car] ≈ 1.5054 = [3. If. even though the confidence interval for the difference in sales based on the slope does not include zero. Averages are less variable than the individual cases.188 to 0.000 drive-bys.024314 ≈ [0. Before we go further.237 ± 2 × 1.12 to 11.78 se(b1) = 0. We can also use this equation for predictions.281 more gallons per day Message summarize the results Based on sales at a sample of 80 stations in similar communities.000 drive-bys to be ˆ = -1.236729 × 32 ≈ 6.236729 ± 1.

Check the conditions. Consider the effects of other possible explanatory variables. 21-7 conditional mean. both visually and substantively. then what’s the point of fitting a line? If the relationship between x and y isn’t linear. µy|x = β 0 + β1 x. The second component of the SRM describes the random variation around this pattern as a sample of independent. there’s no sense in summarizing it with a line. 21-3 Best Practices • Verify that your model makes sense. The farther up the chain you find a problem. in the listed order. the more likely you can fix it.4/21/08 21 The SRM Summary € € Key Terms The simple regression model (SRM) provides an idealized description of the association between two numerical variables. 21-15 simple regression model. The 95% confidence intervals are b0 ± t se(b0) and b1 ± t se(b1). SRM. Provided the SRM holds. 21-3 errors. A prediction interval measures the accuracy of € predictions of new observations. The equation of the SRM describes the association between the explanatory variable x and the response y. 21-19 • • • . If you can think of several others. € € The simple regression model provides a framework for inferences about the parameters β 0 and β1 of the linear equation. This model has two components. The standard summary of a regression includes a t-statistic and p-value for testing H0: € € β 0 =0 and H0: β1 =0. Both x and y may require transformations. normally distributed errors with constant variance. then you know that you need some type of transformation. 21-8 similar variances. now what can you do? Use confidence intervals to express what you know about the slope and intercept. If you cannot interpret the slope. the ˆ new ± 2 se. Confidence intervals for β 0 and β1 are centered on the least squares estimates b0 and b1. 21-7 nearly normal. approximate 95% prediction interval for an observation at x is y € condition evidently independent. If the plot of y on x is not straight enough. Confidence intervals convey uncertainty and show that we don’t know things like the beta of a stock perfectly from data. If you find that the residuals are not normal. you may need to use multiple regression (Chapter 23). 21-3 € prediction interval. The single explanatory variable in the model may not be the only important influence on the response. This equation states that the conditional mean of Y given X = x is a line.

Confusing a problem with the fit as a problem in the errors.4/21/08 21 The SRM • Use rounding to suppress extraneous digits when presenting results. Wrong: the interval is only as good as the model. Confusing confidence intervals with prediction intervals. More data provide a better estimate of this line. there’d still be variation. however. but R2 doesn’t head to 1 and se to zero. rely on the shape of the normal distribution to set a range for the new value. you’ll get residuals with all sorts of problems. Prediction intervals. and you’re worried about \$0. Mistaking lots of data for unequal variances. It’s a consequence of using the wrong equation for the conditional average value of y. it’s clearly not straight enough. If the data have more observations at some x’s than others. not the SD. Both r2 and se reflect the variation of the data around the regression line. The range always grows with more data.0032? Round the values! Check the assumption of normality very carefully before using prediction intervals.0355. Other inferences work well even if the data are not normally distributed. • • • • 21-20 . The ε’s are the errors around the ideal line and represent all of the other factors that influence the response that are not accounted for by our model. The sampling variation of the estimated parameters determines the width of confidence intervals. they’ll take care of all of our uncertainty so we don’t have to worry about extrapolating. both within the range of the data and outside. but we fit the line anyway and got the residuals shown at the right. Even samples from normal distributions have outliers and irregularities every now and then. Take a look at this scatterplot. Use the visual test for simplicity if you’re not sure. The SD of the unexplained (residual) variation determines the major component of the width of prediction intervals. That’s because we see the range of the residuals when we look at a scatterplot. It’s tempting to think that because we have prediction intervals. Prediction intervals presume that the SRM holds.6732 to \$3006. If you stare at a residual plot long enough. Believing that r2 and se improve with a larger sample. Nothing makes you look more out of touch than saying things like “the cost per carat is \$2333. If you model the data using a linear relationship when a curve is needed. but even if we knew β0 and β1.” The cost might be \$2500 or \$2900. you’ll see a pattern. Standard errors get smaller as n grows. • • Pitfalls • Overreacting to residual plots. it may seem as though the variance is larger where you have more data. Be careful when extrapolating.

€ Standard error of prediction. 2. the larger the standard error of b0 becomes. Independent of each other. Be normally distributed Checklist of conditions for the simple regression model  Straight enough  No embarrassing lurking variable  Evidently independent  Similar variances  Nearly normal Standard error of the slope se( b1 ) = sε 1 s 1 ≈ ε × 2 ( n − 1) sx n sx Standard error of the intercept € se( b0 ) = sε 1 x2 sε x2 + ≈ × 1 + 2 2 n ( n − 1) sx sx n If x = 0. and 3. The conditional mean of the response given that the value of the explanatory variable is x is µy|x = E(y|x) = β0 + β1 x As a description of individual values of the response yi = β0 + β1 xi + εi The error terms εi are assumed to be 1.Fitting a line to a nonlinear pattern often produces residuals that are not nearly normal. Have equal standard deviation σε. The farther x is from 0. Formulas Simple Regression Model.4/21/08 21 The SRM Figure 21-12. When using a€ simple regression to predict the value of an independent observation for which x = xnew. the formula reduces to se/√n. 21-21 . Large values of se(b0) warn that the € intercept may be an extrapolation.

The collection includes a scatterplot of the residuals on the fitted values and a normal probability plot of the residuals. first pick the response and explanatory variable and then choose the options that produce residuals. Don’t skip the scatterplot of Y on X to get to the residuals. For the total stock market.) In the dialog for the regression command. It’s best to begin a regression analysis with a scatterplot of Y on X as emphasized in Chapter 19. The shown percentage changes earned by Berkshire Hathaway and the market are 100 times the excess returns on each. we need to go further than just looking at that plot. The salaries of CEOs in the finance industry is from the Compustat database for 2003. follow the menu commands Tools > Data Analysis… > Regression (If you don’t see this option in your Tools menu. About the Data € The stock returns are from the Center for Research in Security Prices (CRSP). For checking the SRM. See the Software Hints in Chapter 19. however. you will need to add these commands. then the excess return is 2.5%. The output that summarizes the regression itself appears in the scrolling session window. For the cost of borrowing. Click the Graphs button and pick the 4-in-1 option to see all of the residual plots together. The excess return is obtained by subtracting the cost of borrowing from the return. To inspect the residuals from a regression. use the menu sequence Stat > Regression… > Regression and select the response and explanatory variable. If Berkshire-Hathaway returned 3% in a month.4/21/08 2 21 The SRM 1 (x − x) ˆ new ) = se 1 + + new ≈ se se( y 2 n ( n − 1) sx The approximation by se is accurate so long as xnew lies within the range of the observed data and n is moderately large (on the order of 40 or more). We need to see plots of the residuals as well. The data on locating a gasoline station are from a consulting project performed for a national oil company that operates filling stations around the US. we use the rate of interest on 30-day Treasury Bills. To see plots of the residuals. Be sure to click the option to see the normal probability plot of the residuals. Follow the menu sequence Software Hints Excel Minitab JMP Analyze > Fit Y by X 21-22 . and the cost of borrowing is ½%. we used returns on the value-weighted market index.

you have to save them first as a column in the data table.4/21/08 21 The SRM to construct a scatterplot and add a regression line. a button labeled Linear Fit appears below the scatterplot. click on the red triangle immediately above the histogram and choose the item Normal Quantile Plot. follow the menu sequence Analyze > Distribution and choose the residuals to go into the histogram. Click on the red triangle in the Linear Fit button and chooser the item Save Residuals. To obtain a normal quantile plot of the residuals. 21-23 . To get the normal quantile plot. Click on the red triangle in this field and choose the item Plot Residuals to see a plot of the residuals on the explanatory variable. Once you see the histogram.) Once you’ve added the least squares fitted line. (Click on the red triangle above the scatterplot near the words “Bivariate Fit of …”.