You are on page 1of 15

# Single Variable Regression (Part II

)
7. Residual Plots
After the curve is fit, it is important to examine if the fitted curve is reasonable. This is
done using residuals. The residual for a point is the difference between the observed value
and the predicted value, i.e., the residual from fitting a straight line is found as:

There are several standard residual plots:
• plot of residuals vs predicted
• plot of residuals vs X;
• plot of residuals vs time ordering.
In all cases, the residual plots should show random scatter around zero with no obvious
pattern. Don’t plot residual vs Y - this will lead to odd looking plots which are an artifact of
the plot and don’t mean anything.

8. Probability Plots
The probability plot is a graphical technique for assessing whether or not a data set
follows a given distribution such as the normal distribution. The data are plotted against a
theoretical normal distribution in such a way that the points should form approximately a
straight line. Departures from this straight line indicate departures from the specified
distribution.

Page 9 of 15

9383 . the greater the indication of departures from normality. The normal probability plot is formed by:  Vertical axis: Ordered response values  Horizontal axis: Normal order statistic medians The observations are plotted as a function of the corresponding normal order statistic quantiles.8804 15 .8951 .9764 .01 4 .10 α = 0. In addition.9490 40 .8804 .9835 . which indicates that the normal distribution is a good model for this data set.9710 75 .9582 .9503 .9835 .9662 . longer than expected tails)? Page 10 of 15 . 1.8320 10 . a straight line can be fit to the points and added as a reference line.9290 25 .9767 .9757 The normal probability plot is used to answer the following questions.8734 .9110 20 . shorter than expected tails.9506 .9707 .9033 . The further the points vary from this line.8318 5 . The correlation coefficient of the points on the normal probability plot can be compared to a table of critical values to provide a formal test of the hypothesis that the data come from a normal distribution.9799 .9639 .05 α = 0.The points on this plot form a nearly linear pattern.9180 . What is the nature of the departure from normality (data skewed.9408 30 .9597 50 .9664 60 . Are the data (meaning the residuals) normally distributed? 2. n α = 0.9807 .9865 .9600 .9347 .9715 .

26 and 0. This is verified by the correlation coefficient of 0.Typical Normal Probability Plot: Normally Distributed Data Normal Probability Plot The following normal probability plot is from a heat flow meter data.9989 of the line fit to the probability plot. Conclusions We can make the following conclusions from the above plot. The normal distribution appears to be a good model for these data. The intercept and slope of the fitted line give estimates of 9. The normal probability plot shows a strongly linear pattern. The fact that the points in the lower and upper extremes of the plot do not deviate significantly from the straight-line pattern indicates that there are not any significant outliers (relative to a normal distribution). 2. we can quite reasonably conclude that the normal distribution provides an excellent model for the data. There are only minor deviations from the line fit to the points on the probability plot. the probability plot shows a strongly linear pattern.023 for the location and scale parameters of the fitted normal distribution. Visually. 1. Typical Normal Probability Plot: Data Have Short Tails Page 11 of 15 . Discussion In this case.

First. the first few and the last few points show a marked departure from the reference fitted line. The Tukey Lambda PPCC plot can often be helpful in identifying an appropriate distributional family. Page 12 of 15 . In comparing this plot to the long tail example in the next section. the first few points show increasing departure from the fitted line above the line and last few points show increasing departure from the fitted line below the line.1. the middle of the data shows an S-like pattern. with Short Tails Conclusions We can make the following conclusions from the above plot. For short tails. the non-linearity of the normal probability plot shows up in two ways. The normal probability plot shows a non-linear pattern.The following is a normal probability plot for 500 random Normal numbers generated from a Tukey-Lambda distribution with Probability Plot for Data the parameter equal to 1. For data with short tails relative to the normal distribution. This is common for both short and long tails. we can reasonably conclude that the normal distribution does not provide an adequate fit for this data set. 1. 2. For long tails. this pattern is reversed. the next step might be to generate a Tukey Lambda PPCC plot. For probability plots that indicate short-tailed distributions. the important difference is the direction of the departure from the fitted line for the first few and last few points. Second. In this case. Discussion The normal distribution is not a good model for these data.

Typical Normal Probability Plot: Data Have Long Tails The following is a normal probability plot of 500 numbers Normal generated from a double exponential distribution. The normal probability plot shows a reasonably linear pattern in the center of the data. particularly the lower tail. 2. but relative to the with Long Tails normal it declines rapidly and has longer tails. Conclusions We can make the following conclusions from the above plot. For data with long tails relative to the normal distribution. However. Discussion Page 13 of 15 . the tails. A distribution other than the normal distribution would be a good model for these data. The double Probability Plot for Data exponential distribution is symmetric. 1. show departures from the fitted line.

In comparing this plot to the short-tail example in the previous section. 1.the non-linearity of the normal probability plot can show up in two ways. First. For probability plots that indicate long-tailed distributions. the middle of the data may show an S-like pattern. it shows a quadratic pattern in which all the points are below a reference line drawn between the first and last points. Second. The normal probability plot shows a strongly non-linear pattern. Typical Normal Probability Plot: Data are Skewed Right Normal Probability Plot for Data that are Skewed Right Conclusions We can make the following conclusions from the above plot. This is common for both short and long tails. In the plot above. 2. the first few and the last few points show marked departure from the reference fitted line. In this case we can reasonably conclude that the normal distribution can be improved upon as a model for these data. For short tails. Specifically. this is most noticeable for the first few data points. the next step might be to generate a Tukey Lambda PPCC plot. In this particular case. the important difference is the direction of the departure from the fitted line for the first few and the last few points. this pattern is reversed. For long tails. The Tukey Lambda PPCC plot can often be helpful in identifying an appropriate distributional family. the S pattern in the middle is fairly mild. the first few points show increasing departure from the fitted line below the line and last few points show increasing departure from the fitted line above the line. The normal distribution is not a good model for these Page 14 of 15 .

Yield and fertilizer We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. In this case we can quite reasonably conclude that we need to model these data with a right skewed distribution such as the Weibull or lognormal. An experiment was conducted in the Schwarz household on summer on 11 plots of land where the amount of fertilizer was varied and the yield measured at the end of the season. the yields were measured and the following data were obtained. The amount of fertilizer applied to each plot was chosen between 5 and 18 kg/ha. Similarly. they represent commonly used amounts based on a preliminary survey of producers.g. they were not evenly spaced between the highest and lowest values).Discussion data. if all the points on the normal probability plot fell above the reference line connecting the first and last points. 9. The level of fertilizer were randomly assigned to each plot. that would be the signature pattern for a significantly left-skewed data set. Page 15 of 15 . Interest also lies in predicting the yield when 16 kg/ha are assigned. Example . This quadratic pattern in the normal probability plot is the signature of a significantly right-skewed data set. At the end of the experiment. While the levels were not systematically chosen (e.

The population consists of all possible field plots with all possible tomato plants of this type grown under all possible fertilizer levels between about 5 and 18 kg/ha.e. The values of βo and β1 are impossible to obtain as the entire population could never be measured. The term ε represents random variation that is always present. while the response variable (Y) is the yield. it is quite clear that the fertilizer is the predictor (X) variable. The population parameters to be estimated are βo . These are taken over all plants in all possible field plots of this type. the true average change in yield per unit change in the amount of fertilizer. The ordering of the rows is NOT important. If all of the population could be measured (which it can’t) you could find a relationship between the yield and the amount of fertilizer applied.In this study. This relationship would have the form: where βo and β1 represent the true population intercept and slope respectively. Note the scale of both variables (continuous). Page 16 of 15 .the true average yield when the amount of fertilizer is 0. i. the yield would not be identical (why?). however. KYPLOT Analysis Here is the data entered into a KYPLOT data sheet. even if the same plot was grown twice in a row with the same amount of fertilizer. it is often easier to find individual data points if the data is sorted by the X value and the rows for future predictions are placed at the end of the dataset. and β1.

Page 17 of 15 .Use the Statistics-> Regression Analysis -> Simple Regression platform to start the analysis. Specify the Y and X variable as needed. Then click OK. A new spread sheet will be created that contains the regression results.

At this stage. The Fit menu item allows you to fit the least-squares line. Residual plots will be used later to check these assumptions in more detail. (here called A1 for the intercept and A2 for the slope) of the fitted line are printed below the fit spread sheet. Page 18 of 15 . The actual fitted line is drawn on the scatter plot. it would be also useful to draw a scatter plot of the data (refer to previous KYPLOT tutorials) The relationship look approximately linear. and the straight line equation coefficients. there don’t appear to be any outlier or influential points. the scatter appears to be roughly equal across the entire regression line.

10137 L when the fertilizer amount is increased by 1 kg/ha.101 is the estimated slope. The formulae for the standard errors of b0 and b1 are messy.856 is the estimated intercept. you would obtain different estimates (b0 and b1 would change). In this case. and hopeless to compute by hand.856 L. The standard deviation of b0 and b1 over all possible experiments is again referred to as the standard error of b0 and b1. Page 19 of 15 . Once again. the yield is expected to increase (why?) by 1. b0=12. The estimated intercept is the estimated yield when the amount of fertilizer is 0. The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit. And just like inference for a mean or a proportion. In this case. and b1=1. the estimated yield when no fertilizer is added is 12. these are the results from a single experiment. The sampling distribution over all possible experiments would describe the variation in b0 and b1 over all possible experiments.The estimated regression line is: In terms of estimates.not the value of Y when X = 1. NOTE that the slope is the CHANGE in Y when X increases by 1 unit . but I’d be worried about extrapolating outside the range of the observed X values. we can obtain estimates of the standard error from KYPLOT (from the regression results sheet created in page 18 ). In this particular case the intercept has a meaningful interpretation. If another experiment was repeated.

Normally.264 = (0.132 L/kg. An “exact” confidence interval can be computed by KYPLOT as shown above.365) L/kg of fertilizer applied.837 to 1. The “exact” confidence interval is based on the t-distribution and is slightly wider than our approximate confidence interval because the total sample size (11 pairs of points) is rather small.837 to 1. but a standard error can also be found for it as shown in the above table.101 ± . Using exactly the same logic as when we found a confidence interval for the population mean.The estimated standard error for b1 (the estimated slope) is 0.’ Page 20 of 15 . a confidence interval for the population slope (β1) is found (approximately) as b1 ± 2(estimated se) In the above example.132) = 1. This is an estimate of the standard deviation of b1 over all possible experiments. the intercept is of limited interest. We interpret this interval as ‘being 95% confident that the true increase in yield when the amount of fertilizer is increased by one unit is somewhere between (.101 ± 2 × (0.365) L/kg. an approximate confidence interval for β1 is found as 1.

• b1 be the estimated slope. Again note that we are interested in the population parameters and not the sample statistics: 1. Find the test statistic and the p-value. the estimate is over 8 standard errors away from the hypothesized value! This will be compared to a t-distribution with n−2 = 9 degrees of freedom.0001). The test statistic is computed as: In other words. In this case b1 = 1. This would correspond to no linear relationship between the response and predictor variable (why?) In many cases. but is a confidence interval for β1 . Specify the null and alternate hypothesis: Notice that the null hypothesis is in terms of the population parameter β1. This is a two-sided test as we are interested in detecting differences from zero in either direction. a confidence interval tells the entire story. KYPLOT produces a test of the hypothesis that each of the parameters (the slope and the intercept in the population) is zero. The p-value is found to very small (less than 0.Be sure to carefully distinguish between β1 and b1. The output is reproduced again below: The test of hypothesis about the intercept is not of interest (why?). In linear regression problems. Page 21 of 15 . The hypothesis testing proceeds as follows. Let • β1 be the true (unknown) slope.1014.the population parameter that is unknown . 2. Note that the confidence interval is computed using b1. one hypothesis of interest is if the true slope is zero.

This is not too surprising given that the 95% confidence intervals show that plausible values for the true slope are from about 0. This would correspond to the predicted yield for a single future plot with 16 kg/ha of fertilizer added. It is important to distinguish between them as these two intervals are the source of much confusion in regression problems. This is exactly what a confidence interval tells you. replacing the value 0 with the hypothesized value. KYPLOT makes it easier to do such a prediction. It is possible to construct tests of the slope equal to some value other than 0. there are two types of estimates of precision associated with predictions using the regression line. Most packages can’t do this. You would compute the T value as shown above. you get: As noted earlier. I usually prefer to find confidence intervals. If you scroll down in the “Regression Results” spread sheet. If you insert 16. what values of the parameter are plausible given this data’. rather than doing formal hypothesis testing. Conclusion. What about making predictions for future yields when certain amounts of fertilizer are applied? For example. a natural question to ask is ‘well. Page 22 of 15 .4. There is strong evidence that the true slope is not zero.3. the experimenter may be interested in predicting a single FUTURE individual value for a particular X. If the hypothesis is rejected. Consequently.8 to about 1. First. you will find a table with 2 blank spaces to insert X values and predict Y values. what would be the future yield when 16 kg/ha of fertilizer are applied? The predicted value is found by substituting the new X into the estimated regression line.

predication intervals are computed for future random variables. (To be continued). Page 23 of 15 .Second the experimenter may be interested in predicting the average of ALL FUTURE responses at a particular X. The prediction interval for an individual response is sometimes called a confidence interval for an individual response but this is an unfortunate (and incorrect) use of the term confidence interval. Strictly speaking confidence intervals are computed for fixed unknown parameter values. This would correspond to the average yield for all future plots when 16 kg/ha of fertilizer is added.