Ch24.Multiple | F Test | Regression Analysis

24

Multiple Regression

SIMPLE MODELS ............................................................................................................................24-3 ERRORS IN THE SIMPLE REGRESSION MODEL ..............................................................................24-4 AN EMBARRASSING LURKING FACTOR ........................................................................................24-5 MULTIPLE REGRESSION ................................................................................................................24-6 PARTIAL AND MARGINAL SLOPES ................................................................................................24-7 PATH DIAGRAMS ...........................................................................................................................24-8 R2 INCREASES WITH ADDED VARIABLES ....................................................................................24-11 THE MULTIPLE REGRESSION MODEL .........................................................................................24-11 CALIBRATION PLOT .....................................................................................................................24-13 THE RESIDUAL PLOT ...................................................................................................................24-14 CHECKING NORMALITY ..............................................................................................................24-14 INFERENCE IN MULTIPLE REGRESSION.......................................................................................24-15 THE F-TEST IN MULTIPLE REGRESSION .....................................................................................24-17 STEPS IN BUILDING A MULTIPLE REGRESSION ..........................................................................24-19 SUMMARY ....................................................................................................................................24-22

7/27/07

24 Multiple Regression

Utilities that supply natural gas to residential customers have to anticipate how much fuel they will need to supply in the coming winter. Natural gas is difficult to store, so utilities contract with pipelines to deliver the gas as needed. The larger the supply that the utility locks in, the larger the cost for the contract. That’s OK if the utility needs the fuel, but it’s a waste if the contract reserves more gas than needed. On the other hand, if deliveries fall short, the utility can find itself in a tight spot when the winter turns cold. A shortage means cold houses or expensive purchases of natural gas on the spot market. Either way, the utility will have unhappy customers – they’ll either be cold or shocked at surcharges on their gas bills. It makes sense, then, for the utility to anticipate how much gas its customers will need. Let’s focus on estimating the demand of a community of 100,000 residential customers in Michigan. According to the US Energy Information Administration, 62 million residential customers in the US consumed about 5 trillion cubic feet of gas in 2004. That works out to about 80 thousand cubic feet of natural gas per household. In industry parlance, that’s 80 MCF per household. Should the utility should lock in 80 MCF for every customer? Probably not. Does every residence use natural gas for heating, or only some of them? A home that uses gas for heat burns a lot more than one that uses gas only for cooking and heating water. And what about the weather? These 100,000 homes aren’t just anywhere in the US; they’re in Michigan, where it gets colder than many other parts of the country. You can be sure these homes need more heating than if they were in Florida. Forecasters are calling for a typical winter. For the part of Michigan around Detroit, that means a winter with 6,500 heating degree days.1 How much gas does the utility need for the winter? How much should they lock in with contracts? As we’ll see, the answers to these questions don’t have to be the same. To answer either, we better look at data.

Here’s how to compute the heating degree days. For a day with low temperature of 20 and a high of 60, the “average” temperature is (20+60)/2 = 40. Now subtract the average from 65. This day contributes 65 – 40 = 25 heating degree days to the year. If the average temperature is above 65, the day contributes nothing to the total. It’s as if you only need heat if the average temperature is below 65.
1

24-2

49 22.0001 <. Multiple regression is more than the sum of several simple regression models because it accounts for relationships among the explanatory variables. The data are a sample of 512 homes from around the US that use natural gas. Simple regression of gas use.5147 x 24-3 .4272 + 14.(L+H)/2 so long as this number is positive.6575 11. It’s not for convenience either. Multiple regression amplifies the power of regression modeling by permitting us to consider several explanatory variables at once.) 200 HDD Heating degree days measure how cold it gets.5147 2. in thousands (MHDD). If H is the daily high temp and L the low temp. Randomly scattered errors contribute the rest of the variation. Consider what influences a customer to purchase an item: its packaging. If the average temp is above 65. that the variation in the response depends on many factors.9951 0. Let’s start with a simple regression of the amount of natural gas used in homes on the number of heating degree days over the year. Orange? We have to first check conditions of the model before relying on these portions of the summary.0001 The fitted line estimates the annual consumption of natural gas for households that experience x thousand heating degree days to be Estimated Natural Gas (MCF) = 34.08 <. R2 se n Term b0 b1 Table 24-1. Natural Gas (MCF) 150 100 50 0 0 1 2 3 4 5 6 7 8 Heating DD (000) Figure 24-1. The list goes on and on. lifestyle. and price. Homes in colder climates use more fuel to heat. (Chapter 6 has a similar example. this formula says you don’t need any heating. Estimate 0. The following scatterplot shows the number of thousands of cubic feet (MCF) of natural gas used during a heating season versus heating degree days.4272 14. promotion. then HDD for this day is HDD = 65 . however. and mood.7/27/07 24 Multiple Regression Simple Models The Simple Regression Model describes the variation in the response with one explanatory variable. It’s likely.64422 512 Std Error t Ratio Prob>|t| 34. The following tables summarize the estimated SRM.488654 29. as well as the customer’s income. We need to verify the conditions of the SRM before using portions colored orange.

4272 + 14. 24-4 ! . the slope b1 indicates that. (1) Linear pattern.7/27/07 24 Multiple Regression The intercept b0 implies that residences use about 34. consumption increases by about 14. the type of construction. They must be independent of each other with equal variance.500 cubic feet of gas for each increase of 1 more MHDD. The SRM is a strange model when you think about it. with moderate sample sizes. The people who live there also affect the energy consumption. ! (2) Random errors. written as y = µy | x + " The model has two main parts. That should be! distribution of y. How about the size of the home? It takes more to heat a big house than a small one. we can add a range to convey our uncertainty. but climate is not the only thing.8 MCF Once we verify the conditions of the SRM.5 ≈ 128. The errors in regression represent variables that influence y that are omitted from the model.500 HDD for the winter in Michigan (compared to 700 in Florida). The average value of y depends on x through the line the only way that x affects the " 0 +! "1 x. this equation predicts annual gas consumption to be 34. Errors in the Simple Regression Model The SRM says that the explanatory variable x affects the average value of the response y through a linear equation. and so on. The temperature matters. For example. The “real” equation for y looks more like y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + " 5 x 5 + L What is ε? The errors represent the accumulation of all of the other variables that affect the response that we’ve not accounted for. Otherwise. with µy | x = " 0 + "1 x .5147 * 6. we can rely on the CLT to justify confidence intervals.000 cubic feet of natural gas. Other things contribute as well: the number of the windows. All of these households use gas for heating water and some use gas for cooking. the value of x should not affect the variance of the ε’s (as seen in the previous chapter). The errors should resemble a simple random sample ! from a normal distribution. regardless of the weather. To capture the effect of the weather. Plugging this value into the least squares line for x. Think about what influences energy consumption in your home. on average. How many live in the house? Do they leave windows open on a cold day? Where do they set the thermostat? A model that omits these variables treats their effects as random errors. The National Oceanic and Atmospheric Administration estimates 6. That seems too simple. the amount of insulation. Only one variable systematically affects the response. A utility can use this equation to estimate the average gas consumption in residences. Normality matters most for prediction intervals.

That’s an embarrassing omission. size. Size is related to fuel use as well.. That’s another reason to watch for a lack of normality in residuals. it’s not too surprising that we often find normally ! distributed residuals.g. 200 Natural Gas (MCF) 150 100 50 0 2 3 4 5 6 7 8 9 10 11 Number of Rooms Figure 24-2. 2 3 Others include options (e. If we omit an important variable whose effect stands out from the others. The SRM draws a boundary after x1 and bundles the rest into the error. Deviations from normality can suggest an important omitted variable. even if we are aware. we don’t observe them. (This is easier to collect in surveys than measuring the number of square feet. ε needn’t be normal.2 (b) What happens to the effects of these other explanatory variables if they are not used in the regression model?3 tip AYT An Embarrassing Lurking Factor The best way to identify omitted explanatory variables is to know the context of your model. then the Central Limit Theorem tells us that the sum of their effects is roughly normally distributed.) This scatterplot graphs gas consumption versus the number of rooms. A simple regression that describes the prices of cars offered for sale at a large dealer regresses Price (y) on Engine Power (x). convertible).. 24-5 . Unless these homes are the same size. The simple regression for gas use says that the only thing that systematically affects consumption is temperature. Variables that affect y that are not explicitly used in the model become part of the error term.g.7/27/07 24 Multiple Regression Either we are unaware of some of these x’s or. y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + " 5 x 5 + L 1444442444443 # = " 0 + "1 x1 + # If the omitted variables have comparable effects on y. size matters. We’ll use the number of rooms to measure the size of the houses. a sunroof or navigation system). (a) Name other explanatory variables that affect the price of the car. and styling (e. The size of the home doesn’t matter. however. just the climate. As a result.

45 0. The format should look familiar.0001 24-6 . HDD explains 49% of the variation in gas use. a multiple regression model using these two explains quite a bit less.0724 11. It’s another simple view.7396 <. and we’re going to have to figure out why.78 Number of Rooms b2 6.861 7. This equation does not represent as much variation in gas use as the equation using HDD as the explanatory variable. Multiple Regression Multiple regression moves the boundary between included and omitted explanatory variables to the right.9409 2.339 -0. and the number of rooms explains 21%. In fact.33 0. Accordingly. This regression is interesting.0229 <. We’ll move the boundary a little to the right and add a second explanatory variable: y = " 0 + "1 x1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + " 5 x 5 + L 1444 4 24444 3 error = " 0 + "1 x1 + " 2 x 2 + # What do you think will happen? Let’s try to anticipate how well the model will do overall. Summary of the multiple regression.5457 27. se is larger at 37 MCF.943 512 Std Error t-statistic 5.4070 Std Error t-statistic 6.57 p-value 0. To combine these explanations for the variation in y. If the effects of temperature and size ! are unrelated.28 1.0001 <.882 Table 24-3. Here’s the summary of the multiple regression. R2 se n Term Estimate Intercept b0 -1. a model with both would explain 21 + 49 = 70% of the variation.8340 12. Number of Rooms Estimate 15. R2 0.0001 Table 24-2.90 Term b0 b1.2079 se 36.775 Heating Degree Days (000) b1 12. Simple regression of gas use on number of rooms. we need multiple regression. only one that considers the effect of size rather than temperature on energy consumption.657 19. compared to 30 MCF for the fit using HDD. but not what we need. The R2 of this fit is 21% compared to 49% for HDD.7/27/07 24 Multiple Regression The slope of the least squares line in the figure is positive: larger homes use more gas.99 p-value 0. 0.

To appreciate why these estimates are different. on average. On average. The equation has an intercept and two slopes. homes with more rooms are in colder climates as shown in the following plot. larger homes use 12. homes with another room use 6. costing $69 annually. R2 = 0. That’s not a mistake: the slope of an explanatory variable in multiple regression estimates something different from the slope of the same variable in simple regression. consider a specific question. Homes in a climate with 3. (The equation for se is in the Formulas section at the end of the chapter. than homes with the same number of rooms in a climate with 2. the residual SD is smaller than with simple regression.51 of MHDD in the simple regression (Table 24-1) estimates the average difference in gas consumption between homes in different climates.78 MHDD + 6.) 24-7 .775 + 12.000 HDD. but it limits the comparison to homes with the same number of rooms.88 MCF/Room (Table 24-3) indicates that.000 HDD use 14. The estimate se = 27. Because multiple regression explains more variation. an added room increases the annual heating bill by about $124.41 MCF/Room (Table 24-2) indicates that. The partial slope 6. on average.000 HDD. The marginal slope for the number of rooms estimates the difference in gas use between homes that differ in size by one room. Which slope provides a better answer: the marginal or partial slope for the number of rooms? The reason for the difference between the estimated slopes is the association between the two explanatory variables.51). That’s less than we expected. Homes in a climate with 3.) The rest of Table 24-3 describes the equation.943 MCF estimates the SD of the underlying model errors. the slope in a multiple regression is known as a partial slope.7/27/07 24 Multiple Regression We interpret R2 and se in Table 24-3 as in simple regression. Partial and Marginal Slopes The slope 14. The marginal slope 12. taken from the column labeled “Estimates”: Estimated Gas Use = -1.51 more MCF of gas. At $10 per MCF. on average.882 Number of Rooms The slope for MHDD in this equation differs from the slope for MHDD in the simple regression (14. a slope in an SRM is called the marginal slope for y on x.5457 indicates that the equation of the model represents about 55% of the variation in gas usage.000 HDD use 12. The slope 12.88 more MCF of gas. but more than either simple regression. Because it ignores other differences between homes. on average.78 for MHDD in the multiple regression (Table 24-3) also estimates the difference in gas consumption between homes in different climates. Suppose a customer calls the utility with a question about the added costs for heating a one-room addition to her home.410 more cubic feet of gas. (The line is the least squares fit of MHDD on number of rooms. Because multiple regression adjusts for other factors. than homes in a climate with 2.78 more MCF of gas.

If we add the R2s from the simple regressions. Multiple regression counts it once and the resulting R2 is smaller. ignoring that homes with more rooms tend to be in colder places. That’s not true. Collinearity between the explanatory variables explains why R2 does not go up as much as we expected. The diagram also joins the explanatory variables to each other with a double-headed arrow that symbolizes the association between them. A note of warning in advance: these pictures of models often suggest that we’ve uncovered the cause of the variation. multiple regression models association.7/27/07 24 Multiple Regression Figure 24-3.33. A path diagram shows a regression model as a collection of nodes and directed edges. Some of the variation explained by HDD is also explained by the number of rooms. Nodes represent variables and directed edges show slopes. Path Diagrams Path diagrams offer another way to appreciate the differences between marginal and partial slopes. it compares the average consumption of smaller to larger homes in the same climate. some of the variation explained by the number of rooms is explained by HDD. Arrows lead from the explanatory variables to the response. Multiple regression adjusts for the association between the explanatory variables. Multiple regression separates them. Collinearity Correlation between the explanatory variables in a multiple regression. 24-8 . there’s a bit of overlap between the two explanatory variables. The marginal slope in the simple regression mixes the effect of size (number of rooms) with the effect of climate. like simple regression. Simple regression compares the average consumption of smaller to larger homes. she should use the partial slope! The correlation between MHDD and the number of rooms is r = 0. That’s why the partial slope is smaller. Unless the homeowner who called the utility moved her home to a colder climate when she added the room. Homes with more rooms tend to be in colder climates. Correlation between explanatory variables is known as collinearity. we double count the variation that is explained by both explanatory variables. Evidently. Similarly. Let’s draw the path diagram of the multiple regression.

It’s important to keep units with the slopes. 0.882 MCF/Room Figure 24-4. homes in the colder climate average 1 MHDD × 0.78 MCF of natural gas. The slopes for this edge come from two simple regressions: one with rooms as the response (see Figure 24-3) and the other with MHDD as the response. particularly for the doubledheaded arrow between the explanatory variables. The arrow from HDD to the response indicates that a difference of 1 MHDD produces an average increase of 1 MHDD × 12. too. If you’ve got that “déjà vu all over again” feeling. The marginal slope is 14.51.4322 Number of Rooms As an example. A colder climate has an indirect effect.2516 Rooms/MHDD = 0.2516 Rooms/MHDD Number of Rooms 6. That’s exactly what we’ve calculated from the path diagram.7/27/07 24 Multiple Regression 0. homes in the colder climate use (on average) 12. Simple regression answers a simple question.78 MCF + 1. Following the path to the number of rooms.4322 MHDD/Room 1000 Heating Degree Days 12.78 MCF/MHDD Consumption of natural gas 0.882 MCF/Room ≈ 1. If we add this indirect effect to the direct effect. This difference also increases gas use. The partial slope for 24-9 .73 MCF = 14. On average how much more gas do homes in a colder climate use? The marginal slope shows that those in the colder climate use 14.2516 MHDD Estimated MHDD = 1. Following the path from number of rooms to consumption.2516 rooms converts into 0. look back at the summary of the simple regression of gas use on MHDD in Table 24-1. That’s the “direct effect” of colder weather: the furnace runs more.51 more MCF of natural gas.2516 Rooms more than those in the warmer climate. and larger homes require more heat. but let’s make sure we understand why that happens.3776 + 0. Path diagram of the multiple regression. Estimated Number of Rooms = 5.73 MCF of natural gas. consider the difference in consumption between homes in climates that differ by 1.78 MCF/MHDD = 12.2602 + 0.51 MCF more MCF of natural gas than those in the warmer climate. Neat.2516 Rooms × 6. Homes in colder climates tend to have more rooms.000 heating degree days.

78 MCF/HDD Use of natural gas for heating Number of Rooms 0.2516 × 6. The problem is that we sometimes forget about indirect effects and interpret the marginal slope as though it represents the direct effect. 12. Usually repairs to larger homes require both more windows and more siding. breaking the pathway for the indirect effect. Heating Degree Days 12. He fits two regressions. In the language of path diagrams. It estimates the difference in gas use due to climate among homes of the same size – excluding the pathway through the number of rooms.882 MCF ≈ 14. Bigger homes that require replacing more windows also require more siding. It’s fine for the marginal slope to add these effects. When would the marginal and partial slope be the same? They match if there are no indirect effects. The homes he’s fixed vary in size.78 MCF + 0. replacing exterior windows and siding. Multiple regression separates these costs. Multiple regression separates the marginal slope into two parts: the direct effect (the partial slope) and the indirect effect (blue). then the marginal and partial slopes agree. Once you appreciate the role of indirect effects. the marginal slope combines a positive direct effect with a positive indirect effect. He’s kept track of material costs for these jobs (basically. The marginal slope blends the direct effect of an explanatory variable with all o its indirect effects. the costs of replacement windows and new vinyl siding). homes with more windows are bigger. Path diagrams help you realize something else about multiple regression.882 MCF/Room The sum of the two paths reproduces the simple regression slope.51 MCF direct effect + indirect effect = marginal effect Marginal = Partial If there’s no collinearity (uncorrelated explanatory variables). Hence the marginal slope combines the cost of windows plus the cost of more siding. 4 AYT 24-10 .2516 Rooms/HDD 6. a simple regression of cost on the number of windows and a multiple regression of cost on the number of windows and square feet of siding. A contractor rehabs suburban homes. This happens when the explanatory variables are uncorrelated.7/27/07 24 Multiple Regression HDD in the multiple regression answers a different question. you begin to think differently about the marginal slope in a simple regression. Which should be larger: the marginal slope for the number of windows in the simple regression or the partial slope for the number of windows in the multiple regression?4 The marginal slope is larger. too.

Under the Hood: Why R2 gets larger How does software figure out the slopes in a multiple regression? We thought you’d never ask! The software does what it did before: minimize the sum of squared deviations. b1 i= 1 2 When we add x2 to the regression. b1 . Add a column of random numbers and R2 goes up. The equation 24-11 . That’s why R2 goes up. We need to be choosy. b1 i= 1 2 by inserting the least squares estimates for b0 and b1. it goes down too easily for its own good. The Multiple Regression Model The Multiple Regression Model (MRM) resembles the SRM. With one x. In a way. we need confidence intervals and tests. least squares. The residual standard deviation se shares a touch of this perversion.7/27/07 24 Multiple Regression R2 Increases with Added Variables Is the multiple regression “better” than the initial simple regression that considers only the role of climate? The overall summary statistics look better: R2 is larger and se is smaller. Simple regression constrains the slope of x2 to zero: n min # ( y i " b0 " b1 x1. R2 increases every time you add a explanatory variable to a regression. Look at what happens when we add x2. the software is free to choose a slope of x2. It gets to solve this problem: n ! b 0 .i ) 2 Now that it can change b2. This flexibility allows the fitting ! procedure to explain more variation and increase R2. With many possible explanations for the variation in the response. we should only include those explanatory variables that add value. but it goes up.b 2 min #( y i= 1 i " b0 " b1 x1. only its equation has several explanatory variables rather than one. the software can find a smaller residual sum of squares. before we accept the addition of an explanatory variable. Only now it gets to use another slope to make the sum of squared residuals smaller. the software minimizes the sum of squares n min # ( y i " b0 " b1 x1. however.i ) b 0 . A multiple regression with two explanatory variables has more choices. the x2 has been there all along but with its ! slope constrained to be zero. That means looking at the conditions for multiple regression. How do you tell if an explanatory variable adds value? R2 is not very useful unless it changes dramatically.i " b2 x 2. Not by much.i " 0 x 2.i ) b 0 . For evaluating the changes brought by adding an explanatory variable.

The more it advertises. Equal variance " 2 . Let’s examine this equation carefully. and ideally are 3.x2) = µy | x1 . The x’s do not mediate each other’s influence on y. That may be the case. It would not make sense to a constant as an explanatory variable. Differences in y associated with differences in x1 are the same regardless of the value of x2 (and vice versa). Ideally. but there are situations in which the effect of advertising depends on the difference in price. We’ve seen remedies for some of these problems. The equation of the MRM embeds several assumptions.x 2 = " 0 + "1 x1 + " 2 x 2 . For now. explanatory variables x1 and x2. advertising has the same effect regardless of the difference in price. Given x1 and x2. a log-log scale captures diminishing returns. It implies that the impact on sales. the MRM requires nothing of the explanatory variables. with 2. all we need is variation in the x’s. The rest of the assumptions describe the errors. Also. As in the ! SRM. Normally distributed. the effect of advertising might depend on whether the ad is promoting a difference in price. these are the same ! three assumptions as in the SRM: 1. The errors ε are independent. but we’ll save interactions for Chapter 26. Here’s an equation for the sales of a product marketed by a company: Sales = β0 + β1 Advertising Spending + β2 Price Difference + ε The price difference is the difference between the list price of the company’s product and the list price of its main rival. For instance. It implies that the average of y varies linearly with each explanatory variable. regardless of the other. For example. the errors are an iid sample from a normal distribution. Special explanatory variables known as interactions allow the equation to capture synergies between the explanatory variables. with log sales regressed on log advertising – but that does nothing for the second problem. we have to hope that the effect of one explanatory variable does not depend on the value of the other. Because we want to see how y varies with changes in the x’s. That can be a hard assumption to swallow. the more it sells.7/27/07 24 Multiple Regression for multiple regression with two explanatory variables describes how the conditional average of y given both x1 and x2 depends on the x’s: E(y|x1. of spending more for advertising is limitless. 24-12 . y lies at its mean plus added random noise – the errors: y = µy | x1 . on average. It does not matter which costs more – advertising has the same effect.x 2 + " ! only the conditional means of y change with the According to this model.

These two plots convert multiple regression into a special simple regression.5457 27. Calibration plot for the two-explanatory variable MRM.7/27/07 24 Multiple Regression Since the only difference between the SRM and the MRM is the equation. and two plots belong with these.  Straight enough  No embarrassing lurking variable  Similar variances  Nearly normal The difference lies in the choice of plots that used to verify these conditions. neither can the computer. Rather than plot y on either x1 or x2. but there is a natural sequence of plots that go with each numerical summary. Multiple regression offers more choices. Simple regression is “simple” because there’s one key plot: the scatterplot of y on x. and the estimate y Natural Gas (MCF) 150 100 50 0 20 30 40 50 60 70 80 90 110 130 150 ! Estimated Natural Gas (MCF) Figure 24-5. In particular. R2 se Table 24-4. Calibration Plot The summary of a regression usually begins with R2 and se. most plots make multiple regression look like simple regression in one way or another. 0. the calibration plot shows R2 for the multiple regression. R2 in multiple regression is again the square of a correlation. For the simple regression the scatterplot of y on x shows R2: it’s the square of the correlation between y and x.97 The calibration plot summarizes the fit of a multiple regression. If you cannot see a line. namely the square of the correlation between y and the fitted 24-13 . You want to look at these before you dig into the output very far. the ˆ = b0 + b1 x1 + calibration plot is a scatterplot of y versus the fitted value y b2 x2. This table repeats R2 and se for the two-predictor model for natural gas. 200 ! R2 The square of the correlation between y ˆ. it should not come as a surprise that the check-list of conditions match. much as a scatterplot of y on x summarizes a simple regression. Overall summary of the two-predictor model. Similarly. the calibration plot places the fitted values on the horizontal axis. Indeed.

The analogous plot is useful in multiple regression. the residuals suggest a lack of constant variation (heteroscedasticity. y . The residuals at the far left seem less variable than the rest. the model does not meet the conditions of the MRM. The tighter the data cluster along the diagonal in the calibration values y plot. If you see a pattern in this plot. All we do is “shear” the calibration plot.97 MCF. but there might be a slight change in the variation. twisting it so that the regression line in the calibration plot becomes flat. so chances are this one will be OK as well. If the residuals are nearly normal. In this example. on the fitted values y ˆ. either a trend in the residuals or changing variation. Chapter 23). Natural Gas (MCF) Residual 80 60 40 20 0 -20 -40 -60 10 20 30 40 50 60 70 80 90100 120 140 160 Estimated Natural Gas (MCF) Figure 24-6. this plot is the as they get larger. as we’ve seen in Chapter 23.7/27/07 24 Multiple Regression ˆ . You can guess that se is about 30 MCF from this plot because all but about 10 cases (out of 512) lie within ± 60 MCF of the horizontal line at zero (The actual value of se is 27. 24-14 . Here’s the plot of residuals on fitted values for the multiple regression ! of gas use on HDD and ! rooms. The Residual Plot ! The plot of residuals on x is very useful in simple regression because it zooms in on the deviations around the fitted line. In this example.) The residuals should suggest an angry swarm of bees with no real pattern. the better the fit. Scatterplot of residuals on fitted values. In other words. ˆ . data become more variable ˆ tracks the size of the predictions. the residuals hover ! around zero with no trend. but the effect is not severe. Here’s the normal quantile plot of the residuals.y This view of the fit shows se. That’s the most common use of this plot: checking the similar variances condition. We haven’t discovered outliers or other problems in the other views. we plot the residuals. Checking Normality The last condition to check is the nearly normal condition. Often. Since y natural place to check for changes in variation. then 95% of them lie within ±2 se of zero.

we start with the standard error.33 19.25 . Comparison of confidence intervals for marginal and partial slopes.86 Table 24-6.861 -0. and almost consistent with normality. It’s a good thing that we checked. Both partial slopes are smaller with about the same standard errors.07 12.) For inferences about slopes. (The skewness and slight shift in variation in Figure 24-6 suggest you might have problems with prediction intervals. however.66 12.775 12. If we have a large sample (n > 30).10 . -1.7396 <.66 6.41 ± 2 × 1. We’ll take this as nearly normal. 24-15 .05 .99 24 Multiple Regression Count Normal Quantile Plot Figure 24-7. The residuals are slightly skewed.51 ± 2 × 0.0001 The procedure for building confidence intervals is the same as in simple regression.657 0.339 0.88 ± 2 × 0.99 0. however. Normal quantile plot of residuals from the multiple regression.90 .78 6.45 7. only the table of estimates has one more row. followed by a t-statistic and p-value.75 . Each row gives an estimate and its standard error. Marginal Partial Heating Degree Days Number of Rooms 14. but be careful about predicting the gas consumption of individual homes. The effect is mild.01 . Inference in Multiple Regression It’s time for confidence intervals.882 5. Summary of the multiple regression.50 . Term Estimate Std Error t-statistic p-value Intercept b0 Heating Degree Days (000) b1 Number of Rooms b2 Table 24-5.0001 <.78 ± 2 × 0. we’re all set to go. The residuals reach further out on the positive side (to +80) than on the negative side (–60). the do-it-yourself 95% confidence intervals have the form estimate ± 2 se(estimate) This table compares confidence intervals for the partial slopes to those for the marginal slopes from the two simple regressions.7/27/07 80 60 40 20 0 -20 -40 -60 25 50 75 -3 -2 -1 0 1 2 3 .95 . As usual. The layout of the estimates for multiple regression in Table 24-3 matches that for simple regression.

we believe that homes in colder climates use more gas than homes with the same number of rooms in warmer climates. To illustrate this situation.60 thousand cubic feet.16 to 8. The two columns in Table 24-5 that follow the standard errors are derived from the estimates and standard errors. As before. a t-statistic counts the number of standard errors that separate the estimated slope from zero. With several explanatory variables. The p-values in the final column use Student’s t-model (Chapter 17) to assign a probability to the t-statistic. 6. Had the absolute size of either t-statistic been small (|tj| < 2) or the pvalue large (more than 0. a regression that includes HDD explains statistically significantly more variation in natural gas use than one without this explanatory variable. by rejecting H0: β1 = 0. the default hypothesized value: b "0 tj = j se(b j ) Values of tj larger than 2 or less than -2 are “far” from zero. The regression with xj added explains statistically significantly more variation. neither confidence interval in Table 24-6 contains zero. The t24-16 tip . then the p-value < 0. Each t-statistic is the ratios of the estimate to its standard error. The test of a slope in a multiple regression also has an incremental interpretation related to the predictive accuracy of the fitted model. You can estimate the p-value from the ! Empirical rule. 2. βj = 0. This test works as in simple regression.86] = 5. The addition of HDD significantly improves the fit of the model beyond what is achieved using the number of rooms alone. The t-statistics and p-values in Table 24-5 indicate that both partial slopes in the multiple regression differ significantly from zero. On average. Agreeing with this.88 + 2 × 0. That is. the partial slope for number of rooms indicates that she can expect her annual consumption of natural gas to increase by about [6.7/27/07 24 Multiple Regression We interpret confidence intervals in multiple regression like those in simple regression. by adding HDD to a model that contains the number of rooms. For example. We’ll write this generic null hypothesis as H0: βj =0. let’s revisit the homeowner who’s added a one-room addition to her home. Zero is not in the confidence interval. Roughly.86.88 . by rejecting H0: β2 = 0. the data could be a sample from a population in which the partial slope is zero. have a look at the t-statistic for the intercept in the multiple regression.05. The t-statistic tests the null hypothesis that the intercept or a specific slope is zero. The order of the variables listed in the table doesn’t matter. If we reject H0: β1 = 0. R2 rises by a statistically significant amount. we see that adding the number of rooms to the simple regression that has only HDD improves the R2 of the fit by a statistically significant amount. Rejecting H0:β j = 0 means… 1. Tests in regression come in two forms: those for each explanatory variable and those for the model as a whole.05). if |tj| > 2.2 × 0. a t-statistic adjusts for all of the other explanatory variables. These are equivalent in simple regression because there’s only one explanatory variable. Similarly. we can look at them individually or collectively. For example.

We’ll have more to say about the ANOVA table in Chapter 25. You can think of the F-test as a test of the size of R2. or positive. The F-Test in Multiple Regression Multiple regression adds a test that we don’t need in simple regression. The F-test in regression comes in handy because of the problem with R2. Because the p-value is larger than 0. In this case. It’s called the F-test. measures the explanatory value of all of the explanatory variables. The F-statistic cares. if you add enough explanatory variables. Her regression has an R2 of 98%. n t vs F The t-stat tests the effect of one explanatory variable. it tests the null hypothesis that the model explains nothing. (The F-statistic is not needed in simple regression because the t-statistic serves the same purpose. Suppose that a friend of yours who’s taking Stat tells you that she has built a great regression model. In other words. which is obtained using the F-statistic. negative. an F-statistic looks at them collectively. abbreviated ANOVA. taken collectively rather than separately. it only suggests that it might be zero. R2 = ˆ Variation in y = Variation in y ˆ #( y i= 1 n i " y) " y) 2 #( y i= 1 2 i 24-17 ! .7/27/07 24 Multiple Regression statistic for the test of H0: β0 = 0 is –0. What null hypothesis is being tested? For the F-statistic. R2 is the ratio of how much variation resides in the fitted values of your model compared to the variation in y. F = t2 and both produce the same p-value. R2 does not track the number of explanatory variables or the sample size. the null hypothesis is H0: β1 = β2 = 0. the F-stat tests the combination of them all. Unless you can reject this one.33. If she has n=50 cases and 45 explanatory variables. the estimate lies 1/3 of a standard error below zero. it gets larger whenever you add an explanatory variable. we cannot reject H0. What’s that mean? It does not mean β0 is zero. The p-value indicates that 74% of random samples from a population with β0 = 0 produce estimates this far (or farther) from zero. In fact. you ought to ask her a couple of questions: “How many explanatory variables are in your model?” and “How many observations do you have?” If her model obtains an R2 of 98% using two explanatory variables and n = 1000. Namely. and it usually appears in a portion of the output known as the “Analysis of Variance”. you should learn more about her model. In a simple regression the F-statistic is the square of the t-statistic for the one slope. the model is not so impressive.05. the null hypothesis states that your data is a sample from a population for which all the slopes are zero. the explanatory variables collectively explain nothing more than random variation.) The F-test. you can make the R2 as large as you want. Before you get impressed. t-statistics consider the explanatory variables oneat-a-time.

05. smaller values may ! as well). We were hasty and jumped into the partial slopes in this example before checking the F-statistic. It’s a gatekeeper.20 $ 254. then you’ve got little reason to look at the individual slopes. His model does explain more than random variation. The F-statistic doesn’t offer this free lunch. As you add explanatory variables. We soundly reject the null hypothesis in this example. ! 0. it “charges” the model for each explanatory variable.f. You’d never get an R2 of 55% with 2 predictors and 512 observations unless some slope differs from zero. anything above 4 is statistically significant (depending on n and q. however.4 1 # 0. Unless you reject its null hypothesis. the more variation gets tied up in the fitted values. The F-test is a good place to start when you’re working with multiple regression. then it’s useful to approximate the F-statistic as ! R2 n R2 number of cases F" $ = $ 2 2 1# R q 1 # R number of explanatory variables The F-statistic is roughly the ratio of the variation that you have explained relative to what is left over. check your output for a p-value. He has data on costs at n = 49 homes and he used q = 26 explanatory variables. multiplied by the number of cases per explanatory variable. R2 = 0. the Fstatistic is R2 ˆ Variation in y per explanatory variable q F= = 2 Variation remaining per residual d. R2 goes up.7/27/07 24 Multiple Regression The more explanatory variables you have. the top of this ratio gets smaller and the bottom stays the same.5 = 305. you shouldn’t peek at the slopes until you’ve verified that the F-test rejects its H0. That’s statistically significant. By the time he was done.892/(1-0. For this model. 5 24-18 . you shouldn’t go further.892. but he’s going to have a hard time sorting out what those slopes mean. Unless its p-value is less than 0. AYT Our contractor (previous AYT) got excited about regression. Are you impressed by his model?5 The overall F-statistic is F = R2/(1-R2) × (n -1-q)/q = 0.892) * (49-1-26)/26 ≈ 7. There’s no way this data is a sample from a population with both slopes equal to zero.5457 509 F" $ " 1. Otherwise. For a multiple regression with q explanatory variables. 1 " R n " q "1 = R2 n " q "1 # 2 1" R q If you have relatively few explanatory variables compared to the number of cases (q << n). and he particularly liked the way R2 went up with each added variable. In practice.5457 2 What’s a big value for the F-statistic? As a conservative rule-of-thumb.

my company would like to know what might be gained by moving into the subprime market. 2) Check the scatterplots of the response versus the explanatory variables and also those that show relationships among the explanatory variables.80 (80%). the Fair-Isaac Company) is the most common commercial measure of the credit worthiness of a borrower. Motivation State the business decision.7/27/07 24 Multiple Regression Steps in Building a Multiple Regression Let’s summarize the steps that we’ve considered and the order that we’ve made them. 3) If these scatterplots of y on xj appear straight enough. fit the multiple regression. ˆ and 5) Make scatterplots that show the overall model (y on y ˆ ). The loan-to-value ratio (LTV) captures the exposure of the lender to defaults. find a transformation to straighten out a relationship that bends. 7) Check the F-statistic to see whether you can reject the “null model” – conclude that some slope differs from zero. A subprime mortgage is a loan made to a more risky borrower than most. If not. The residuals versus y residual plot is the best place to identify changing variances. then the mortgage covers 80% of the value of the property. If not. if LTV = 0. The FICO score (named for its owner. The residual plot should look simple. Make sure that you understand the measurement units of the variables and identify any outliers. there’s a reason that bank and hedge funds plunged into the risky housing market: these loans earn more interest – so long as the borrower can keep paying. 4) Find the residuals and fitted values from your regression. As a mortgage lender. we’d like to know which characteristics of the borrower and loan affect the amount of interest we can earn on 24-19 . be very ! cautious !about using prediction intervals. we’ve made you an analyst at a creditor who’s considering moving into this market. In particular. As this example shows. 4M Subprime Mortgages Subprime mortgages dominated business news in 2007. 1) What problem are you trying to solve? Do these data help you? Until you know enough about the data and problem to answer these two questions. 6) Check whether the residuals are nearly normal. Otherwise. 8) Test and interpret individual partial slopes. Defaults on such mortgages brought down several hedge funds that year. there’s no need to fit a complicated model. it pays to start with the big picture. For example. For this analysis. The two explanatory variables in this analysis are common in this domain. As in simple regression. go no further with tests.

I can imagine a few other factors that may be omitted. Scatterplots of APR on both explanatory variables seem reasonable. We’ve omitted them here to save space. so I’ll summarize them with a table of correlations.6696 0. and no big outliers. These loans are a SRS of subprime mortgages within the geographic area where we operate. Use a plot to make sure correlations make sense. Other aspects of the borrower (age. The partial slope for LTV controls for the exposure we’re taking on the loan: the higher LTV.4853 Relate regression to the business decision. and FICO and LTV are not dramatically correlated with each other. The plot of residuals on fitted values shows consistent variation over the range of fitted values.  No embarrassing lurking factor. 24 Multiple Regression Method Plan your analysis. such as more precise geographic information.)  Nearly normal. Seems OK from scatterplots of APR versus LTV and FICO. race) better not matter or something illegal is going on. The partial slope for FICO describes whether for a given LTV how much poor credit costs the borrower. For example.  Similar variances. We obtained data on 372 mortgages from a credit bureau that monitors the subprime market. but look before you summarize. I have two explanatory variables: the credit rating score of the borrower (FICO) and the ratio of the value of the loan to the value of the property (LTV). the plots indicate moderate dependence. There’s no evident bending. Describe the sample.4265 -0. but those features are more visible in the quantile plot. The response is the annual percentage rate of interest earned by the loan (APR). Since I’m not building a prediction 24-20 . Mechanics Check the additional conditions on the errors by examining the residuals.7/27/07 loans in this category. The histogram and normal quantile plot confirm the skewness of the residuals. the more risk we face if the borrower defaults. (There is one outlier and some skewness in the residuals. The relationships are linear. The regression fit underestimates APR by up to 7%. I’ll use a multiple regression. here’s APR versus LTV.  Straight-enough. but rarely overpredicts by more than 2%. Identify the predictor and the response. APR LTV LTV FICO -0. Verify the big-picture condition.

don’t forget to round to the relevant precision.7/27/07 24 Multiple Regression interval for individual loans.019±2(0. The fitted equation is Estimated APR ≈ -24. Term b0 LTV FICO Est SE t Stat p-value 23. Based on my analysis of the supplied sample of 372 mortgages.4619 1.0186 0.0001 -1. I can clearly reject H0 that both slopes are zero.46) in the interest rates. The data also suggest that some loans have much higher interest rates than what we would expect from this basic analysis. The SD of the residuals is se = 1. LTV=0.5).1. These are a small sample of only 372 of the thousands of mortgages in our area.85 <. risky borrowers with FICO score 500 pay on average between 4.001).. it is clear that characteristics of both the borrower (FICO score) and loan (loan to value ratio) affect interest rates for loans in the subprime market.0013) to 200 *( -0.7 .6 LTV .003 -0.04 0.0001 If there are no severe violations of the conditions. with a larger t-statistic and tighter confidence interval -0.650 36. Among mortgages that cover a fixed percentage of the value of the property (e.2% more than borrowers with FICO score 700.518 -3. Convert the slopes into meaningful comparisons.24%. which is highly statistically significant by the F-test. The effect of the borrowers credit rating is much stronger.0186 .8).0186 + 2 * 0.0013) Note important caveats that are relevant to the question at hand. some borrowers are more risky than the FICO score indicates.242 F = R /(1-R )*(n-1-2)/2 = 0.g.2 * 0. such as the skewness in this example.02 FICO The effect of LTV is poorly determined. Interpret slopes as partial slopes.0013 -13.6 ± 2(0. Evidently. 200 * (-0. These two factors alone describe about ½ of the variation in rates. Describe the estimated equation.46 <. I’ll rely on the CLT to produce normal sampling distributions for my estimates.4619) * (372-3)/2 ≈ 158 2 2 It explains about ½ of the variation (R2 = 0.4619/(1-0. This model explains real variation in APR. With all of these other details. summarize the overall fit of the model.24% to 3.577 0.691 0. 24-21 . Message This part comes from the overall model. with a wide confidence interval of -1. Here’s the summary of my fitted model R2 se 0. Build confidence intervals for the relevant parameters. I’ll round to 1 digit for LTV and show 4 digits for the FICO score.0.

24-8 F-statistic. A path diagram is a useful figure for distinguishing the direct and indirect effects of ˆ shows the overall fit of explanatory variables.2 ) n #1# 2 2 n #1# q Best Practices • Know !the context of your model. b1. 24-13 collinearity. It’s important in simple regression. visualizing R2 of the model. and b2) n n 2 i 2 se = "e i= 1 "( y = i= 1 i # b0 # b1 x i.f. The F-statistic measures the overall significance of the fitted model. q = 2. and individual t-statistics for each partial ! slope test the incremental improvement in R2 obtained by adding that ! variable to a model containing the other explanatory variables. Collinearity is the correlation between predictors in a multiple regression. 24-17. The plot of residuals on y a check for similar variances. q denotes the number of explanatory variables.1 # b2 x i. The slopes in the equation of an MRM are partial slopes that typically differ from the marginal slopes in a simple regression. 24-7 Formulas In each formula. F-statistic F= ˆ per predictor Variation in y q = 2 Variation remaining per residual d. 24-18 F-test. A calibration plot of y on y ˆ allows the model. 24-7 partial. Index calibration plot. but even more important in multiple regression. For a multiple regression with q=2 ! explanatory variables (and estimates b0. 1 " R n " q "1 R2 n " q "1 # 2 1" R q R2 = se Divide the sum of squared residuals by n minus the number of estimates in the equation.7/27/07 24 Multiple Regression Summary The Multiple Regression Model (MRM) expands the Simple Regression Model by incorporating other explanatory variables in its equation. For the examples in this chapter. 24-17 slope marginal. How are you supposed to guess what factors are missing from the model unless you know something about the problem and the data? 24-22 .

suppose we were to use our multiple regression to estimate gas consumption for 4-room homes in climates with 7. but it’s an outlier when we combine the two! In general. If you look at enough t-statistics. you better let the software do the calculations. you may fool yourself into thinking you’ve figured it all out. It can be really. Some would say that the partial slope “holds the other variables fixed” but that’s too far from the truth. 8 7 • • • Heating DD (000) 6 5 4 3 2 1 2 3 4 5 6 7 8 9 10 11 Number of Rooms Figure 24-8. we just compared energy consumption among homes of differing size in different climates. namely predicted value ± 2 se. For instance. Let your software compute prediction intervals in multiple regression. Check the overall F-statistic before digging into the t-statistics. You did it in simple regression. This only applies when not extrapolating. Pitfalls • Become impatient. but also to do. you’ll find that you eventually find explanatory variables that are statistically significant.500 heating DD. really tempting to dive right into the output rather than hold off to look at the plots. and you need to do it even more in multiple regression. only to discover later than it was just an outlier. That’s not an outlier on either variable alone. Multiple regression takes time not only to learn. A partial slope avoids the indirect effects of other variables in the model. If you do extrapolate. A marginal slope combines the direct and indirect effects. It is true is a certain mathematical sense. but you’ll find that you make better choices by being more patient. Distinguish marginal from partial slopes. but we didn’t hold anything fixed in our example. If you get hasty and skip the plots. Statistics awards persistence. 24-23 . Extrapolation is harder to recognize in multiple regression.7/27/07 24 Multiple Regression • Examine plots of the overall model and coefficients before interpreting the output. prediction intervals have the same form in multiple regression as in simple regression. We always do in multiple regression. If you check the F-statistic first. Outliers are more subtle when the x’s are correlated. you’ll avoid the worst of these problems.

it’s telling us that the partial slope might be positive or might be negative. El Anshasy. Sure. That does not mean that we’ve gotten them all. It might not. Just as we did not hold fixed any variable. If the confidence interval includes zero. • • About the Data The data on household energy consumption are from the Residential Energy Consumption Survey conducted by the US Department of Energy. G. we probably did not change any of the variables either. 24-24 .7/27/07 24 Multiple Regression • Think that you have all of the important variables. and Y. See the Software Hints in Chapter 19. No matter how many explanatory variables you have in the model. All it means if we cannot reject H0: β1 = 0 is that this slope might be zero. it virtually impossible to know whether you’ve found all of the relevant explanatory variables. Elliehausen. however. Think that an insignificant t-statistic implies an explanatory variable has no effect. Unless you got your data by running an experiment (and this does happen sometimes). you will need to add these commands. The menu sequence Minitab Stat > Regression… > Regression constructs a multiple regression if several variables are chosen as explanatory variables. we added a second variable to our model for energy consumption and it made a lot of sense.) Selecting several columns as X variables produces a multiple regression analysis. We’ve just compared averages. you cannot get causation from a regression model. Shimazaki of George Washington University and The Credit Research Center of Georgetown Univerity. The confidence interval tells you a bit more. The study of subprime mortgages is based on an analysis of these loans in a working paper “Mortgage Brokers and the Subprime Mortgage Market” produced by A. In most applications. Think that you’ve found causal effects. Just because we don’t know the direction (or sign) of the slope doesn’t mean it’s zero. Excel To fit a multiple regression. Software Hints The software commands for building a multiple regression are essentially those used for building a model with one explanatory variable. All you need to do is select several columns as explanatory variables rather than just one. follow the menu commands Tools > Data Analysis… > Regression (If you don’t see this option in your Tools menu.

24-25 . Click the Run Model button to obtain a summary of the least squares regression. The summary window combines the now familiar numerical summary statistics as well as several plots.7/27/07 24 Multiple Regression JMP The menu sequence Analyze > Fit Model constructs a multiple regression if two or more variables are entered as explanatory variables.

Sign up to vote on this title
UsefulNot useful