This action might not be possible to undo. Are you sure you want to continue?
7, page 1 Chapter 7 The Simple Linear Regression Model A common model for modeling the relationship between two quantitative variables is the linear regression model. Don’t be fooled by the “linear” part: as we’ll see, linear regression models can often be used to model relationships which aren’t linear. Although we looked at the linear regression model last semester, we only looked at one part of it – the part that models the mean response Y as a linear function of X. We’ll extend the model to model the scatter of the individual data points around the line. The way we extend it makes the linear regression model exactly like the ANOVA model, except that the explanatory variable is quantitative instead of categorical. We assume that at each X, the distribution of Y values is normal with mean β 0 + β1 X and standard deviation σ.
µ (Y X ) = β 0 + β 1 X
σ (Y X ) = σ 2
Data: ( X 1 , Y1 ), ( X 2 , Y2 ),…, ( X n , Yn ) . The Yi ’s are assumed to be independent.
ˆ ˆ Least squares estimates of β 0 and β 1 are denoted by β 0 and β 1 . The predicted or fitted value of Y for a particular X is: ˆ ˆ ˆ µ (Y X ) = β 0 + β1 X .
ˆ This is also denoted Y in many books. The fitted values for the data points are:
ˆ ˆ ˆ Yi = fit i = β 0 + β 1 X i
and the residuals are:
ˆ resi = Yi − fit i = Yi − Yi .
The residuals are sometimes denoted ei in other texts. By modeling the distribution of data points around the line, we can make inferences from the sample data about the regression parameters.
Chap. 7, page 2 Case Study 7.2: Meat Processing and pH
ANOVAb Model 1 Sum of Squares 3.00647 .05413 3.06060 df 1 8 9 Mean Square 3.00647 .00677 F 444.306 Sig. .000a
Regression Residual Total
a. Predictors: (Constant), Log(hours) b. Dependent Variable: pH
Coefficientsa Unstandardized Coefficients B Std. Error 6.9836 .0485 -.7257 .0344 Standardized Coefficients Beta -.991
t 143.897 -21.079
Sig. .000 .000
a. Dependent Variable: pH
Hours 1 1 2 2 4 4 6 6 8 8
pH Log(hours) fit 7.02 0 6.9836 6.93 0 6.9836 6.42 0.69 6.4806 6.51 0.69 6.4806 6.07 1.39 5.9777 5.99 1.39 5.9777 5.59 1.79 5.6834 5.8 1.79 5.6834 5.51 2.08 5.4747 5.36 2.08 5.4747
res 0.0364 -0.0536 -0.0606 0.0294 0.0923 0.0123 -0.0934 0.1166 0.0353 -0.1147
Chap. 7, page 3 Another (equivalent) way to write the linear regression model is Yi = β 0 + β 1 X i + ε i where the ε i ’s are independent N(0,σ) random variables.
Formulas for least squares estimators:
ˆ β1 =
∑ ( X i − X )(Yi − Y )
∑ ( X i − X )2
ˆ ˆ β 0 = Y − β1 X
Mean of residuals is 0 (always true for least squares)
ˆ Estimate of σ is σ =
sum of squared residuals = i =1 . degrees of freedom n−2 Degrees of freedom = n - #parameters in the model for the means = n –2 for simple linear regression
The ANOVA table gives the sum of squared residuals and the mean square residual which is ˆ ˆ σ 2 = 0.00677 so σ = 0.0823.
ˆ ˆ The standard errors of β 0 and β 1 represent the estimated standard deviations of the sampling ˆ ˆ distributions of β and β . The sampling distributions refer to how the least squares estimates
would vary from sample to sample. We view the X i ’s as fixed; they are viewed to remain the same from sample to sample while the Yi ’s are random. ˆ ˆ SE ( β 1 ) = σ 1 , 2 (n − 1) s X
1 X2 ˆ ˆ SE ( β 0 ) = σ + 2 n (n − 1) s X
Confidence intervals for slope and intercept are Estimate ± t df (1 − α / 2) SE(Estimate)
Chap. 7, page 4
Example: Steer carcass data
Predicted pH = 6.9836 - .7257 Log(Hours) where Log is natural logarithm. Inferences for slope: Mean pH is estimated to decrease by .7257 for every one unit increase in Log(Hours). A one unit increase in Log(Hours) is an increase in Hours by a factor of e ≈ 2.72. If we had used Log10(Hours) instead, the interpretation would be easier: the slope represents the increase in predicted pH for every 10-fold increase in time since slaughter. A 95% confidence interval for β 1 is -.7257 ± t 8 (.975) (.0344) = -.7257 ± 2.306 (.0344) = -.7257± .0793 = -.805 to -.646. So we are 95% confident that the decrease in mean pH is between .646 and .805 for every 2.72-fold increase in time since slaughter. The confidence interval can also be obtained from SPSS by choosing Options in the Analyze…Regression…Linear window.
Coefficientsa Unstandardized Coefficients B Std. Error 6.984 .049 -.726 .034 Standardized Coefficients Beta -.991 95% Confidence Interval for B Lower Bound Upper Bound 6.872 7.096 -.805 -.646
t 143.897 -21.079
Sig. .000 .000
a. Dependent Variable: pH
Inferences for intercept: The intercept β 0 represents the mean value of Y when X = 0. Usually, this is not particularly meaningful. It is usually more meaningful to estimate the mean value of Y at particular values of X which are meaningful and interesting, which is covered next. Inferences for the mean response at a particular value of X: Inferences about the slope of the regression line tell us about how big the change is in the mean response (Y) for a 1-unit increase in X. Sometimes, we are interested in a confidence interval for the mean response at a particular X, say X 0 . According to the model, the true mean of Y at X 0 ˆ ˆ ˆ is µ (Y X ) = β + β X . The estimate of this is µ Y X = β + β X . The standard error of
ˆ µ (Y X 0 ) is
1 ( X 0 − X )2 ˆ ˆ ˆ SE µ Y X 0 = σ + 2 n (n − 1) s X
Note that the standard error is bigger for values of X 0 further from X and is smallest at X .
Chap. 7, page 5
Steer data: What is the estimated mean pH for carcasses 3 hours old? Give a 95% confidence interval for the mean pH after 3 hours.
First, remember that the X variable in the regression model is log(Hours), so X 0 = log(3) = ˆ 1.0986 (natural logarithm). Therefore, µ Y X 0 = 1.0986 = 6.9836 - .7257(1.0986) = 6.186. To calculate the standard error, we need to compute X , the mean of the log(Hours) for the 10 2 data points and s X , the sample variance of log(Hours). From SPSS,
Descriptive Statistics N LogTime Valid N (listwise) 10 10 Mean 1.19013 Std. Deviation .796480 Variance .63438
2 Hence, X = 1.1901 and (n − 1) s X = 9(.63438) = 5.709.
ˆ ˆ SE µ Y X 0 = 1.0986 = 0.0823
1 (1.0986 − 1.1901) 2 = 0.0262 + 10 5.709
and a 95% confidence interval for the mean pH among all steers after 3 hours is 6.186 ± t 8 (.975) (.0262) = 6.186 ± 2.306(.0262) = 6.186 ± .0604 ≈ 6.13 to 6.25
Simultaneous confidence intervals for the mean response at several values of X If we want simultaneous confidence intervals at several different values of X, we can use Bonferroni if the number of values is small. We can compute simultaneous confidence intervals at every possible value of X using a Scheffe procedure. The result is a set of confidence bands for the regression line. We are 95% (or whatever the chosen confidence level) that the regression line lies entirely within the bands. Thus, we are 95% confident that the true means at all possible values of X are all within the confidence band limits. The formula for the simultaneous confidence bands is ˆ ˆ ˆ β 0 + β 1 X ± 2 F2,n−2 (1 − α ) SE[µ (Y X )]
This is referred to as the Workman-Hotelling procedure. In practice, you compute these limits at a large number of X values, then join the limits to make a smooth curve on the scatterplot. Some programs will do this automatically, but SPSS will not. It will, however, plot the individual confidence intervals for all X’s using the t coefficient rather than the Scheffe coefficient. Steer data: for simultaneous 95% confidence intervals, F2,n −2 (.1 − α ) = F2,8 (.95) = 4.46. The confidence interval for the mean pH after 3 hours is therefore (see above): 6.186 ±
2(4.46) (.0262) = 6.186 ± 2.987(.0262) = 6.186 ± .0782 = 6.11 to 6.26
We could compute confidence intervals for any number of values of X.
Chap. 7, page 6
Prediction interval for a future response The confidence intervals above is for the mean pH for all steer 3 hours after slaughter. A 95% prediction interval for the pH of an individual steer 3 hours after slaughter is an interval in which you are 95% confident that the pH of a particular steer will lie 3 hours after slaughter. A confidence interval is for a mean; a prediction interval is for an individual.
The predicted value for a future response at X = X 0 is ˆ ˆ ˆ Pred(Y X 0 ) = µ (Y X 0 ) = β 0 + β 1 X 0 The standard error of prediction is
ˆ ˆ ˆ SE[Pred(Y X 0 )] = σ 2 + SE[µ (Y X 0 )] = σ 1 +
1 ( X 0 − X )2 + 2 n (n − 1) s X
The standard error of prediction has two parts: the uncertainty due to estimating the mean response at X 0 and the uncertainty due to the fact that individual observations vary around that mean with standard deviation σ. Note that while the standard error of the mean response at X 0 goes to 0 as n increases, the standard error of prediction never goes to 0. An individual 100(1α)% prediction interval for the response of an individual at X 0 is ˆ ˆ β 0 + β 1 X 0 ± t n −2 (1 − α / 2) SE[Pr ed(Y X 0 )] For the steer data, a 95% prediction interval for the pH of a particular steer 3 hours after slaughter is: 6.186 ± 2.306 (.0823) 1 + 5.99 to 6.39. Simultaneous prediction intervals can be computed for several different X values using Bonferroni, but there is no analog to the Working-Hotelling Scheffe-based procedure for simultaneous prediction intervals at all possible values of X. 1 (1.0986 − 1.1901) 2 + = 6.186 ± 2.306(.08637) = 6.186 ± .1992 = 10 5.709
Chap. 7, page 7
SPSS commands Analyze…Regression…Linear
Under Statistics button, you can choose to get confidence intervals for β 0 and β1 . Under Save button: • Unstandardized Predicted Values • Unstandardized Residuals • Prediction Intervals: Mean: this isn’t a prediction interval, it’s an individual confidence interval for the mean response at each X. SPSS does not compute the Working-Hotelling simultaneous confidence intervals • Prediction Intervals: Individual: this is a prediction interval for an individual response at each X To obtain predicted values, confidence intervals and prediction intervals for a value of X not in the data set, add a case to the data with the desired X value, but leave the value of Y blank (it should display a period which indicates a missing value). SPSS can plot the individual confidence intervals for mean response and the prediction intervals for an individual response. Create a scatterplot and double-click the plot to get into Chart Editor. Select one of the data points and click on the “Add fit line” icon. Under the “Fit line” tab you can select “Mean” or “Individual” confidence intervals. The first gives individual (not simultaneous) confidence intervals for the mean response at each X and the second gives prediction intervals.
Chap. 7, page 8 95% individual confidence intervals for the mean, 95% Working-Hotelling simultaneous confidence bands for the mean, and 95% individual prediction intervals for a single response (this graph is from S-Plus,; SPSS will only do the first and last of the three).
y 5.5 0.0 6.0