Math 445 Chapter 8: A Closer Look at Assumptions for Simple Linear Regression

1. Linearity

2. Constant variance

3. Normality

4. Independence

Assumptions 1, 2 and 4 are the most important. Violation of 1 can bias estimates of means and

predictions. Violations of 2 and 4 can lead to under- or over-estimates of standard errors and misleading

inferences and confidence intervals. Violation of 3 is only a problem when sample sizes are small. An

exception is prediction intervals for an individual response which depend critically on the normality

assumption (confidence intervals for the mean response at a particular X are robust to normality because

of the Central Limit Theorem).

Assessing assumptions

Linearity and constant variance assumptions: assess through scatterplots, smoothing (loess, for example),

and residual plots

Example: Ozone level and maximum temperature on 111 days at a location on New Jersey, summer 1973

200

100

150

Unstandardized Residual

Ozone(ppb)

50

100

0

50

0 -50

Maximum temperature (F) Unstandardized Predicted Value

• The relationship is not linear and the variance appears to increase as temperature increases. These

violations suggest transforming the response variable (transforming the explanatory variable will

not solve the nonconstant variance problem).

• When deciding whether to transform the response variable or the explanatory variable (or both),

sometimes it is helpful to look at histograms of each variable individually. If the distribution of

either variable is skewed, this suggests transforming that variable. In this example, the

distribution of ozone is skewed to the right while the distribution of temperature is roughly

symmetric.

• See Display 8.6 on p. 213 for suggested courses of action for other patterns.

Chapter 8, page 2

• Recall the ladder of powers: the family of power transformations (Chapter 10 of DeVeaux,

Velleman and Bock). Examples:

2 represents squaring (y2)

1 represents no transformation (y)

½ represents square root ( y )

0, by convention, represents log(y) (to any base)

-1/2 represents reciprocal square root (- 1 / y ) (the negative preserves the original order)

-1 represents reciprocal (-1/y)

For univariate data, powers less than 1 are often used for variables whose distribution is skewed to the

right; the stronger the skew, often the smaller the power needed (0 is smaller than ½, -1/2 is smaller than

0, etc.).

Log transformation is generally the most interpretable, though other transformations are sometimes

interpretable in special situations (see bottom of p. 216; in particular, the inverse transformation is

interpretable for rations where miles per gallon, for example, becomes gallons per mile).

Can easily try different transformations (in SPSS Chart Editor, can do power transformations with non-

negative exponents to X, Y or both).

A log transformation works well for the Ozone data, making the relationship more linear and the variance

more constant. There is one moderate outlier which we’ll address later.

2.50

0.5

2.00

Unstandardized Residual

Log10(Ozone)

1.50

0.0

1.00

-0.5

0.50

0.00 -1.0

Maximum temperature (F) Unstandardized Predicted Value

Coefficientsa

Unstandardized Standardized

Coefficients Coefficients 95% Confidence Interval for B

Model B Std. Error Beta t Sig. Lower Bound Upper Bound

1 (Constant) -.8028 .1976 -4.062 .000 -1.1945 -.4111

Maximum temperature (F) .0294 .0025 .745 11.654 .000 .0244 .0344

a. Dependent Variable: Log10(Ozone)

Chapter 8, page 3

Before proceeding to the interpretation of this model, we first address the other assumptions: normality

and independence. Normality is not crucial with larger sample sizes, but we should make sure that there

is not strong skewness or outliers. The assumption of a normal distribution at each value of X means that

the residuals ε i = Yi − ( β 0 + β1 X i ) are assumed to be N (0, σ ) . Thus we can look at the distribution of

the observed residuals res = e = Y − ( βˆ + βˆ X ) with a histogram and/or normal probability plot.

i i i 0 1 i

0.75

20

0.50

Frequency

15

0.25

0.00

10

-0.25

-0.50

0

-0.75

-1.00 -0.50 0.00 0.50 -1.0 -0.5 0.0 0.5

Unstandardized Residual Observed Value

The residuals for the log(Ozone) model appear quite symmetrically distributed with only one mild outlier

on the negative end.

The assumption of independence of the residuals can only be judged from the sampling plan and/or from

plotting the residuals versus time order or other covariates that may have been measured. For example, if

these observations had come from two different locations, then the independence assumption would be

violated. We would want to examine a scatterplot with the points from the two locations identified to see

if the relationship were different at the two locations. We would also want to plot the residuals versus day

number to see if there were patterns in the residuals.

0.50

Unstandardized Residual

0.00

-0.50

-1.00

41

45

49

21

25

29

57

61

65

69

5

9

33

37

53

73

77

81

85

89

93

97

1

13

17

101

105

109

Sequence number

Chapter 8, page 4

Interpretation of transformed model

µˆ [log(Ozone Temp)] = −.8028 + .0294Temp

If we transform back, by taking 10 to each side, the left-hand side does not become the mean of Y because

the mean of the logged data is not the log of the mean of the raw data. However, if the transformation has

succeeded in making the distribution of the log(Y) values symmetric about their mean, then

Median [log(Y X )] = µ [log(Y X )]

Medians can be transformed back: the median of the logged data is the log of the median of the original

data. Therefore, we can say:

Note that

Estimated Median(Ozone Temp + 1) (.1575)10.0294 ( Temp+1)

= .0294 Temp

= 10.0294 = 1.070

Estimated Median(Ozone Temp) (.1575)10

This means that median ozone level is estimated to increase by a factor of 1.070, or 7.0%, for every one

degree increase in maximum temperature (95% confidence interval 5.8% to 8.2%, since 10.0244 = 1.058

and 10.0344 = 1.082).

Model Summary

Model R R Square R Square the Estimate

1 .745a .555 .551 .25207

a. Predictors: (Constant), Maximum temperature (F)

ANOVAb

Sum of

Model Squares df Mean Square F Sig.

1 Regression 8.629 1 8.629 135.813 .000a

Residual 6.926 109 .064

Total 15.555 110

a. Predictors: (Constant), Maximum temperature (F)

b. Dependent Variable: Log10(Ozone)

Coefficientsa

Unstandardized Standardized

Coefficients Coefficients 95% Confidence Interval for B

Model B Std. Error Beta t Sig. Lower Bound Upper Bound

1 (Constant) -.8028 .1976 -4.062 .000 -1.1945 -.4111

Maximum temperature (F) .0294 .0025 .745 11.654 .000 .0244 .0344

a. Dependent Variable: Log10(Ozone)

Chapter 8, page 5

The t-statistics and P-values (“Sig.”) reported in the Coefficients table are for testing the hypothesis

H 0 : β 0 = 0 and the hypothesis H 0 : β1 = 0 . The former is usually not of interest, but the latter is a test

of the equal-means model.

The ANOVA table is precisely analogous to the ANOVA table for comparing several groups. It

compares the linear regression model with 2 parameters for the means ( β 0 and β1 ), which is the full

model, to the equal-means model µ (Y X ) = β 0 , which is the reduced model.

n

• Total sum of squares = residual sum of squares for equal-means (reduced) model = ∑ (Yi − Y ) 2 .

i =1

n 2

• Residual sum of squares = residual sum of squares for full model = .

i =1

1 n

• Mean square residual = ∑

n − 2 i =1

resi2 =σˆ 2

The F-test is a test of the simple linear regression model versus the equal-means model. Since the only

difference between the two models is the parameter β1 , this is a two-sided test of the hypothesis

H 0 : β1 = 0 . This is mathematically equivalent to the t-test of this hypothesis that is reported in the

regression coefficients table.

The R-squared statistic, or coefficient of determination gives us the percentage of the total variation in the

response, y, that is explained by the explanatory variable, x, which for our example yields:

R2 = = = 0.555

total sum of squares 15.555

The residual sum of squares is the deviation in y away from the regression model and hence the difference

of the total variation and the residual variation represents the reduction in the variation achieved by

modeling y in terms of the model.

For linear regression, R2 is identical to the square of the sample correlation coefficient for the response

and the explanatory variable. Hence, this quantity is only a valid measure if the assumptions are met—i.e.

that the data are random samples and should never be used to evaluate the adequacy of the linear model.

Chapter 8, page 6

Case Study 8.2: Breakdown times for Insulating Fluid

ANOVA

Log(Time)

Sum of

Squares df Mean Square F Sig.

Between Groups 196.477 6 32.746 13.004 .000

Within Groups 173.749 69 2.518

Total 370.226 75

ANOVAb

Sum of

Model Squares df Mean Square F Sig.

1 Regression 190.151 1 190.151 78.141 .000a

Residual 180.075 74 2.433

Total 370.226 75

a. Predictors: (Constant), Voltage (kV)

b. Dependent Variable: Log(Time)

Questions:

1) How much is the residual sum of squares lowered by going from the 2 parameter regression model

to the 7 parameter ‘separate means model’?

Between

Regression

Lack of fit

Within

Total

