You are on page 1of 15

4.

Residual analysis for simple linear regression

4.1. Residuals
(a) Observed error
ε i = yi − E ( yi )
• Assumptions for regression model
o εi independent normal random variable
o Mean 0
o Constant variance σ2
(b) Residual / errors of fit
• Residual is defined as
ei = yi − yˆ i
o Part of Y not explained by the model
• Sample mean of ei
1 n 1 n
e = ∑ ei = ∑ ( yi − b0 − b1 xi )
n i =1 n i =1
1 n
= ∑ ( yi − ( y − b1 x ) − b1 xi )
n i =1
1⎡ n n

= ⎢ ∑ ( y i − y ) + b1 ∑ ( x − xi )⎥
n ⎣ i =1 i =1 ⎦
=0
(c) Standardized residual
• ei are not independent
o ∑ ei = 0 and ∑ xi ei = 0
• Sample variance of the n residuals
∑ (ei − e ) ∑e
2 2
SSE
= i
= = MSE
n−2 n−2 n−2
o If model is appropriate, MSE is unbiased for σ2
• Standardized residual is defined as
ei − e ei
=
MSE MSE
o Used at times in residual analysis
o Identifying outlying observations
(d) Studentized residuals
• Recall
Var ( yi ) = Var (β 0 + β1 xi + ε i ) = Var (ε i ) = σ 2
o For mean response
⎛ 1 ( x − x )2 ⎞
Var ( yˆ ) = σ 2 ⎜⎜ + 0 ⎟
⎝n S XX ⎟⎠
• We have (Exercise)
Var (ei ) = Var ( yi − yˆ i )
= Var ( yi ) + Var ( yˆ i ) − 2 × Cov ( yi , yˆ i )
⎡ ⎛ 1 ( x − x )2 ⎞⎤
= σ 2 ⎢1 − ⎜⎜ + i ⎟⎥
⎣⎢ ⎝ n S XX ⎟⎠⎦⎥

1
• Studentized residual is defined as
ei ei
=
SD (ei ) ⎛ 1 ( xi − x )2 ⎞
s 1 − ⎜⎜ + ⎟

⎝ n S XX ⎠
o Identifying outlying observations

4.2. Residual analysis


• Check for departure from linear regression model with normal errors
o The regression function is not linear
o The error terms do not have constant variance
o The error terms are not independent
o The model fits all but one or a few outliers
o The error terms are not normally distributed
o One or several important independent variables have been omitted from the model

• Residual plots
o Residuals vs independent variable
o Residuals vs fitted values
o Residuals vs time
o Residuals vs omitted independent variable
o Box plot
o Normal probability plot

Example

Westwood company data


Residual vs X Residual vs predicted value
6 6

4 4

2
2
Unstandardized Residual
Unstandardized Residual

0
0

-2
-2

-4
-4 40 60 80 100 120 140 160
10 20 30 40 50 60 70 80
Unstandardized Predicted Value
Lot size

Time plot Box plot


6 6

4 4

2
2
Unstandardized Residual

0
0

-2
-2

-4
0 2 4 6 8 10 12 -4
N= 9

Production run Unstandardized Resid

2
Normal Q-Q plot
Normal Q-Q Plot of Unstandardized Residua
1.5

1.0

.5

0.0
Expected Normal

-.5

-1.0

-1.5
-4 -2 0 2 4 6

Observed Value

(a) Nonlinearity
• Whether a linear regression function is appropriate for the data being analyzed by studied
from
o Residual vs independent variable
o Residual vs fitted values
o Scatter plot
• Linear model is appropriate
o Residuals fall within a horizontal band centered around 0
• Departure from the linear regression model
o Indication of the trend for a curvilinear regression function

e e

0 0

X X
Example

Transit data
• A study of relation between amount of transit information and bus ridership in eight comparable
test cities
• 8 observations are collected
• Number of bus transit maps distributed free to residents of the city at the beginning of the test
(X)
• Increase during the test period in average daily bus ridership during nonpeak hours (Y)

(a) Simple linear regression model


• Y = β0 + β1 X + ε
• b0 = -1.82
o SE = 1.052
o | t | = |-1.727| < 2.447 = t0.025,6
o β0 is insignificant at 5% level of significance

3
• b1 = 0.0435
o SE = 0.007
o | t | = 6.484 > 2.447 = t0.025,6
o β1 is significant at 5% level of significance
• R2 = 0.875
o The model fit the data well
• ANOVA table
Sum of Squares df Mean Square F
Regression 31.7637 1 31.7637 42.0388
Residual 4.5335 6 0.7556
Total 36.2972 7
o F-value = 42.04 > 8.81 = F0.05,1,6
ƒ p-value = 0.0006
o Significant linear trend at 5% level of significance

• Scatter plot (observed and fitted Y against X)


o Linear model capture the increasing linear trend while there is more than a simply linear
relationship between X and Y
7

3
increase in ridership

0
60 80 100 120 140 160 180 200 220 240

maps distributed

• Standardized residual against X and ŷ

o Lack of fit of the linear regression function


o Residuals depart from 0 in a systematic fashion
ƒ Negative for smaller ŷ (or X)
ƒ Positive for medium size ŷ (or X)
ƒ Negative for large ŷ (or X)

4
(b) Quadratic trend model
• Y = β0 + β1 X + β2 X2 + ε
• The model capture the relationship nicely
7

increase in ridership
2

0
60 80 100 120 140 160 180 200 220 240

maps distributed

• Standardized residual against predicted values of Y


o No particular pattern is observed
1.5

1.0

.5

0.0
Standardized Residual

-.5

-1.0

-1.5
0 1 2 3 4 5 6 7

Unstandardized Predicted Value

(b) Nonconstancy of error variance


• The linear model assumes that the error term (ε) has constant variance (σ2)
o Residuals vs independent variable
o Residuals vs fitted values

Constant variance Error variance increases with X


e e

0 0

X X

5
Example

Hormone data
• The data are from the results of two assay experiments for a certain hormone.
• In the experiment, the old (or reference, X) method is compared to a new (or test, Y) method.
• There are 85 measurements for each of the two methods.

• Y = β0 + β1 X + ε
• b0 = 0.08486
o SE = 0.51175
o | t | = 0.166 < 1.989 = t0.025,83
o β0 is insignificant at 5% level of significance
• b1 = 0.95201
o SE = 0.03177
o | t | = 29.970 > 1.989 = t0.025,83
o β1 is significant at 5% level of significance
• R2=0.9154
o The linear trend model fit the model very well
• Scatter plot of observed and predicted values
o The linear model capture the increasing linear trend very well

• Standardized residual against predicted values


o The variance of the residuals is not constant
o The larger the fitted value is (so as the regressor variable), the more spread out the residuals
are
o The relation between test method and the reference method is positive
o Error variance is larger for larger value for hormone than for smaller

6
(c) Outliers
• Outliers are extreme observations
• (Standardized) residual vs independent variable or fitted value
o Outliers are points lying far beyond the scatter of the remaining residuals
• Outliers can create great difficulty
o The model fitting distorted by the outlying cases
• Possible reasons
o Resulted from a mistake or other extraneous effect
ƒ Discard
ƒ Under least squares method, fitted line may be pulled disproportionately toward an
outlying observation
o Convey significant information
ƒ An interaction with another independent variable omitted from the model

Example

Consider 21 cases of X and Y values


y
4
17 20

12
3 16 18
14
9
6 13 19
4 15
2 10
12 7
8
1 11
3 5

21
0
0 10 20 30
x

(a) Full data set


• Y = β0 + β1 X + ε
• b0 = 1.254
o SE = 0.395
o | t | = 3.17 > 2.093 = t0.025,19
o β0 is significant at 5% level of significance
• b1 = 0.0629
o SE = 0.0315
o | t | = 2.00 < 2.093 = t0.025,19
o β1 is insignificant at 5% level of significance
• R2 = 0.1737
o The linear trend model fit the model badly
• Studentized residual against predicted value
o Case 21 is an outlier
2
12 17 20
Studentized Residual

1 4 6 9 14 16 18
12
7 13
0 10 19
8 15
3
-1 5 11

-2

-3 21

-4
1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
Predicted Value of y

7
(b) Reduced data set
• Case 21 removed
• 20 cases remained
• No more outlying case

y
4
17 20

12
3 16 18
14
9
6 13 19
4 15
2 10
1 2 7
8
1 11
3 5

0
0 2 4 6 8 10 12 14 16 18 20
x

• b0 = 0.967
o SE = 0.291
o | t | = 3.33 > 2.101 = t0.025,18
o β0 is significant at 5% level of significance
• b1 = 0.102
o SE = 0.024
o | t | = 4.20 > 2.101 = t0.025,18
o β1 is significant at 5% level of significance
• R2=0.4952
o The linear trend model fit the model fairly well
• Inclusion and exclusion of the outlier affects the significance of the linear model and also
the model fitness
o Outlier always has an effect?
• Studendized residual against predicted value
o No more outlier

2
12
Studentized Residual

17
1 4 6 9 20
1
2
14 16
18
0 7
10 13

3 8
-1 15
5 19

11
-2
1.0 1.5 2.0 2.5 3.0 3.5
Predicted Value of y

(d) Nonindependence
• The linear model assumes all error terms (and hence, all observations) are independent
• Whenever data are obtained in a time sequence, it is a good idea to prepare a time plot of the
residual
o Check if any correlation between the error terms over time
• Independent error terms
o Residuals fluctuate in a random pattern around the base line 0

8
• Lack of randomness
o Too much alternation or too little alternation
• Correlation between error terms
o Some effect connected with time (but not included in the regression model) was present

time

Example

Ice-cream data
• The data give the icecream consumption over 30 four-week periods from 18 March 1950 to 11
July 1953. There are 30 observations over 3 variables.
• Period
o The week of the study
• Consumption (Y)
o The icecream consumption (in pints per capita)
• Temp (X)
o The mean temperature (in degrees F)

• b0 = 0.2069
o SE= 0.0247
o | t | = 8.375 > 2.048 = t0.025,28
o β0 is significant at 5% level of significance
• b1 = 0.0031
o SE = 0.0005
o | t | = 6.502 > 2.048 = t0.025,28
o β1 is significant at 5% level of significance
• R2 = 0.602
o The model fits quite well
• Scatter plot of observed and predicted values

9
• Standardized residual against time (period)
o Correlation between the error terms with time
o Too little alternation
o The scatters follow a specific curve

(e) Nonnormality
• Significance tests (F or t-tests) are based on normal assumption (of the error term)
o Small departures from normality create no serious problems
o Major departures are of concern
• Distribution plots
o Boxplot, histogram, stem-and-leaf plot
o Check if gross departure from normality are shown by such a plot
• Comparison of frequencies
o Actual frequencies of the residuals vs expected frequencies under normality
o Using standard normal distribution
ƒ ~68% of standardized residuals between -1 and +1
ƒ ~90% between -1.645 and +1.645
o Using t distribution when sample size is small
• Normal probability plot / Normal Q-Q plot
o (Standardized) Residual vs expected quantile under normality
o Expected quantile of the i-th smallest standardized residual is
⎛ i − 0.375 ⎞
z⎜ ⎟
⎝ n + 0.25 ⎠
ƒ z(q) = q × 100 percentile of the standard normal distribution
o Near linear suggests agreement with normality
ƒ Departs substantially from linearity suggests non-normal distribution
o Typical residual plots of Q-Q plot
ƒ Observed residual against expected value

10
• Significance test for normality
o Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Cramér-von Mises test
o Accept normality when p-value > significance level

Example

Market data
• The data give the Market rate (Y) and the account rate (X)
• 54 cases

• b0 = 0.848
o SE = 1.976
o | t | = 0.429 < 2.007 = t0.025,52
o β0 is insignificant at 5% level of significance
• b1 = 0.610
o SE = 0.143
o | t | = 4.263 > 2.007 = t0.025,52
o β1 is significant at 5% level of significance
• R2=0.259
o The model does not fit well
• For the standardized residual
o Skewness = 1.034
ƒ Positively skewed
o Kurtosis = 0.376
ƒ A little leptokurtic
ƒ Thinner tails than normal
o 72.2% between −1 and +1
ƒ 68.3% based on standard normal
ƒ 67.8% based on t52
o 92.6% between −2 and +2
ƒ 95.5% based on standard normal
ƒ 94.9% based on t52

11
• Histogram
o Positively skewed
o Thinner tails
35

30

25

P
e
20
r
c
e
n 15
t

10

0
-6 -3 0 3 6 9 12
Re s id u al

• Normal Q-Q plot


o Order the residuals
o Calculate the quantiles from standard normal

x y ŷ Res STD Res Order (i) Expected z


6.42 -1.63 4.77 -6.40 -1.27 1 -2.27
6.78 -1.34 4.99 -6.33 -1.26 2 -1.88
18.45 5.86 12.11 -6.25 -1.24 3 -1.66
15.74 4.37 10.45 -6.08 -1.21 4 -1.50
32.58 14.73 20.73 -6.00 -1.19 5 -1.37
8.19 0.24 5.85 -5.61 -1.11 6 -1.26
12.02 3.11 8.18 -5.07 -1.01 7 -1.16

o The scatters do not fall on the diagonal line


15

10

R
e 5
s
i
d
u
a 0
l

-5

-10

-3 -2 -1 0 1 2 3
Normal Quantiles

12
• Normality tests

Test Statistic P-value


Shapiro-Wilk 0.897964 0.0002
Kolmogorov-Smirnov 0.191956 <0.0100
Cramer-von Mises 0.329436 <0.0050
Anderson-Darling 1.879455 <0.0050

o All p-values < 0.05


o The distribution of the error term is significantly different from normal

(f) Omission of important independent variables


• Residuals should be plotted against variables omitted from the model that might have
important effects on the response
o e.g. Time variable
o Determine whether there are any other key independent variables that could provide
important additional descriptive and predictive power to the model

Example

Fat data
• The data come from a study investigating a new method of measuring body composition.
• The body fat percentage, age and gender is given for 18 adults aged between 23 and 61.
• There are 18 observations on three variables.
• Age (X)
o The age of the subject in (completed) years
• Percent (Y)
o The body fat percentage of the subject
• Gender
o The gender of the subject

• b0 = 3.221
o SE = 5.076
o | t | = 0.635 < 2.120 = t0.025,16
o β0 is insignificant at 5% level of significance
• b1 = 0.548
o SE = 0.106
o | t | = 5.191 > 2.120 = t0.025,16
o β1 is significant at 5% level of significance
• R2=0.627
o The model fits quite well

13
• Scatter plot
o The regression model capture the linear relationship

• Standardized residual plot and Q-Q plot show that the model assumptions are valid

• Consider the an extra effect


o Standardized residual against gender
o Residuals for male are negative
o Gender has definite effect on productivity
o The model is still appropriate with the omission of gender
o Inclusion of gender improves the model

14
• A modified regression model with different gender effects

15

You might also like