You are on page 1of 5

9/24/2021

Assessing the Accuracy of the Model

• Once we have rejected the null hypothesis in favor of the


alternative hypothesis, it is natural to want to quantify the extent
to which the model fits the data.

• The quality of a linear regression fit is typically assessed using


two related quantities:
1. The residual standard error (RSE) and
2. The R2 statistic.

11

11

1. Residual Standard Error


• Recall from the population line equation that associated with
each observation is an error term .
• Due to the presence of these error terms, even if we knew the
true regression line, we would not be able to perfectly predict Y
from X
• The RSE is an estimate of the standard deviation of .
• It is the average amount that the response will deviate from the true
regression line.

12

12

1
9/24/2021

The least squares model for the regression of number of units


sold on TV advertising budget.

• Residual standard error = 3.26


• In other words, actual sales in each market deviate from the true
regression line by approximately 3,260 units, on average.
• In the advertising data set, the mean value of sales over all markets is
approximately 14,000 units, and so the percentage error is 3,260/14,000 =
23%.
• RSE is considered a measure of the lack of fit of the model to the data

• Since the RSE is measured in the units of Y, it is not always clear


what constitutes a good RSE

13

13

2. Coefficient of Determination: R2
• Some of the variation in Y can be explained by variation in the
X’s and some cannot.

• R2 measures the proportion of variability in Y that can be


explained using X.

𝑅𝑆𝑆
𝑅2 = 1 −
σ𝑛𝑖=1 𝑌𝑖 − 𝑌ത 2

• R2 is always between 0 and 1. Zero means no variance has been


explained. One means it has all been explained (perfect fit to the
data).

14

14

2
9/24/2021

• An R2 statistic that is close to 1 indicates:


• that a large proportion of the variability in the response has been
explained by the regression.

• A number near 0 indicates:


• that the regression did not explain much of the variability in the response;
this might occur because the linear model is wrong, or the inherent error
σ2 is high, or both.

• R2 = 0.612
• just under two-thirds of the variability in sales is explained by a linear
regression on TV.

• In the simple linear regression setting, R2 = r2

15

15

Multiple Linear Regression (MLR)


• Simple linear regression is a useful approach for predicting a
response on the basis of a single predictor variable. However, in
practice we often have more than one predictor.

• How can we extend our analysis to accommodate additional


predictors?

• One option is to run separate simple linear regressions for each


predictor; not entirely satisfactory.
1. Difficult to make a single prediction given the different predictors.
2. Each predictor ignores the other predictors. What if the various
predictors are correlated ➔ can lead to very misleading estimates of
the individual predictor's effects on response.

16

16

3
9/24/2021

Multiple Linear Regression (2)

Population Yi = b0 + b1X1 + b2 X2 + + bp Xp +e
line

Least Squares
line
Yˆi = b̂0 + b̂1 X1 + b̂2 X2 + + b̂ p X p

• The parameters in the linear regression model are very easy to


interpret.
• 0 is the intercept (i.e. the average value for Y if all the X’s are
zero), j is the slope for the jth variable Xj
• j is the average increase in Y when Xj is increased by one unit
and all other X’s are held constant.

17

17

MLR on advertising data

18

18

4
9/24/2021

Observations
• Simple and multiple regression coefficients can be quite different.
• Does it make sense for the multiple regression to suggest no
relationship between sales and newspaper while the simple
linear regression implies the opposite?

• So newspaper sales are a surrogate for radio advertising;


newspaper gets “credit” for the effect of radio on sales.
• Almost all the explaining that Newspapers could do in simple regression
has already been done by TV and Radio in multiple regression!

19

19

20

20

You might also like