You are on page 1of 8

Single-variable regression

1. Introduction
Along with the Analysis of Variance, this is likely the most commonly used statistical
methodology in chemical and engineering research. In virtually every issue of a chemical
engineering journal, you will find papers that use a regression analysis. There are HUNDREDS of
books written on regression analysis. Some of the better ones (in my opinion) are:
Draper and Smith. Applied Regression Analysis. Wiley.
Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin.
Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury.
Zar. Biostatistics. Prentice Hall.
Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review of
regression analysis. Please consult the above references for all the gory details.
It turns out that both Analysis of Variance and Regression are special cases of a more general
statistical methodology called General Linear Models which in turn are special cases of
Generalized Linear Models which in turn are special cases of Generalized Additive Models, which
in turn are special cases of .....
The key difference between a Regression analysis and an ANOVA is that the X variable is
nominal scaled in ANOVA, while in regression analysis the X variable is continuous scaled. This
implies that in ANOVA, the shape of the response profile was unspecified (the null hypothesis
was that all means were equal while the alternate was that at least one mean differs), while in
regression, the response profile must be a linear line.
Because both ANOVA and regression are from the same class of statistical models, many of the
assumptions are similar, the fitting methods are similar, hypotheses testing and inference are
similar as well.

2. Equation for a line - getting notation straight
In order to use regression analysis effectively, it is important that you understand the concepts
of slopes and intercepts and how to determine these from data values.
This will be QUICKLY reviewed here in class.
In previous courses at high school or in linear algebra, the equation of a straight line was often
written y = mx + b where m is the slope and b is the intercept. In some popular spreadsheet
programs, the authors decided to write the equation of a line as y = a + bx. Now a is the
intercept, and b is the slope. Statisticians, for good reasons, have rationalized this notation and
usually write the equation of a line as y = β o + β1x or as Y = b0 + b1X. (the distinction between βo

Page 1 of 8

If you could draw a scatterplot of Y against X for ALL elements of the population. the spread of data points in the population is constant over the entire regression line). The use of the subscripts 0 to represent the intercept and the subscript 1 to represent the coefficient for the X variable then readily extends to more complex cases. and population standard deviation. Populations and samples All of statistics is about detecting signals in the face of noise and in estimating population parameters from samples. the units in the population that share this X value. the value of Y would fluctuate above or below a straight line at any given X value. population intercept. e. The bare minimum that must be achieved is that for any individual X value found in the sample. On each unit. Conceptually. we can think of the large set of all units of interest. [This is analogous to having different treatment groups corresponding to different values of X in ANOVA. F = ma or PV = nRt. Review definition of intercept as the value of Y when X=0. Unlike a correlation analysis. must have been selected at random. both an X and Y variable present. it is NOT necessary to select a simple random sample from the entire population and more elaborate schemes can be used.] If this were physics. First consider the population. and furthermore wish to make predictions of the Y value for future X values that may be observed from this population. The term ε represent random variation of individual units in the population above and below the expected value. Regression is no different. Rather. We wish to summarize the relationship between Y and X. there is conceptually. We denote this relationship as Y = βo + β1X + ε where now βo and β1 are the POPULATION intercept and slope respectively. we can never measure all units of the population. We say that E[Y ] = βo + β1X is the expected or average value of Y at X. The correct definition of the population is important as part of any study. 3.e. and slope as the change in Y per unit change in X. However. the relationship between Y and X is often much more tenuous.g. It is assumed to have constant standard deviation over the entire regression line (i.and b0 will be made clearer in a few minutes). Page 2 of 8 . Of course. the points would NOT fall exactly on a straight line. in chemical engineering. So a sample must be taken in order to estimate the population slope. we may conceive of a physical law between X and Y .

and 3=black). but not before assessing the assumptions! 4. rather than having to select at random from the population as a whole. Some caution is required in transformation in dealing with the error structure as you will see in later examples. Fortunately. Then using these values in a regression either as predictor variable or as a using these values in a regression either as predictor variable or as a response variable is not sensible. Third. the estimation process can proceed. The scatterplot of Y vs X should be examined. this usually indicates that the relationship between Y and X is not linear.g. For example. Or. for a given X. randomly select from the relevant subset of the population. it allows us to deliberately choose values of X from the extremes and then only at those X value. However.1 Linearity Regression analysis assumes that the relationship between Y and X is linear.3 No outliers or influential points All the points must belong to the relationship – there should be no unusual points.2 Correct sampling scheme The Y must be a random sample from the population of Y values for every X value in the sample. Perhaps a transformation is required (e. if there are multiple readings at some X-values. suppose that you code hair color as (1=red. fit a model that includes X and X 2 and test if the coefficient associated with X2 is zero. 2=brown. Once the data points are selected. quadratic curve). using a numerical value to represent a category and then using this numerical value in a regression is not valid. Assumptions 4. Plot residuals vs the X values. 4.This is quite a relaxed assumption! For example. Unfortunately.g. log(Y ) vs log(X)). the values from the population must be a simple random sample. Make a scatterplot between Y and X to assess this assumption. Page 3 of 8 . 4. then a test of goodness-of-fit can be performed where the variation of the responses at the same X value is compared to the variation around the regression line. The response and predictor variables must be both interval or ratio scaled. this test could fail to detect a higher order relationship. fit the model with the points in and out of the fit and see if this makes a difference in the fit. If the scatter is not random around 0 but shows some pattern (e. If in doubt. it is not necessary to have a completely random sample from the population as the regression line is valid even if the X values are deliberately chosen. In particular.

Many people erroneously assume that the distribution of Y over all X values must be normally distributed. 4. However. i. In ANOVA. in the following graph.6 Normality of errors The difference between the value of Y and the expected value of Y is assumed to be normally distributed. the group membership was always “exact”. The most common cases where this fail are time series data where X is a time measurement. i.e they look simply at the distribution of the Y ’s ignoring the Xs.e. For example.5 Independence Each value of Y is independent of any other value of Y. This assumption can be assessed by again looking at residual plots against time or other variables. the scatter of the points above and below the fitted line should be roughly constant over the entire line. the treatment applied to an experimental unit was known without ambiguity. 4. the difference between the value of Y and the point on the line must be normally distributed. it can turn out that that the X value may not be known exactly. This is one of the most misunderstood assumptions. This is assessed by looking at the plots of the residuals against X to see if the scatter is roughly uniformly scattered around zero with no increase and no decrease in spread over the entire line.Outliers can have a dramatic effect on the fitted line. 4. in regression.7 X measured without error This is a new assumption for regression as compared to ANOVA.4 Equal variation along the line (homoskedacity) The variability about the regression line is similar for all values of X. the single point is an outlier and an influential point: 4.e. time series analysis should be used. i. This can be assessed by looking at normal probability plots of the residuals. The assumption only states that the residuals. In these cases. Page 4 of 8 .

e. if the value used for X is an actual measurement of the true underlying X then there is uncertainty in both the X and Y direction. This is called the Berkson case after Berkson who first examined this situation. Refer to the reference books listed at the start of this chapter for more details. we denote the sample intercept by b0 and the sample slope as b 1.e.least squares . In this case.least absolute deviations . Mathematically.fuzzy regression We typically use the principle of least squares. It turns out that there are two important cases.baeysian regression . the least squares line is the line that minimizes : Page 5 of 8 . 5. More alarmingly.This general problem is called the “error in variables” problem and has a long history in statistics. the estimates no longer tend to the true population values! This latter case of “error in variables” is very difficult to analyze properly and there are not universally accepted solutions.g. and b 1 is the estimated slope.least maximum absolute deviation . The least-squares line is the line that makes the sum of the squares of the deviations of the data points from the line in the vertical direction as small as possible. However. temperature as set by a thermostat) while the actual X that occurs would vary randomly around this target value. The most common cases are where the recorded X is a target value (e. then there is no bias in the estimates. i. Some of these methods are: . as the sample size increases. positive slopes are biased downwards.least median-of-squares . estimates of the slope are attenuated towards zero (i. If the value reported for X is a nominal value and the actual value of X varies randomly around this nominal value. negative slopes biased upwards). The equation of a particular sample of points is expressed : where b0 is the estimated intercept. The symbol we are referring to the estimated line and not to a line in the entire population. the estimate are no longer consistent. indicates that How is the best fitting line found when the points are scattered? Many methods have been proposed (and used) for curve fitting. Obtaining Estimates To distinguish between population parameters and sample estimates.

As before. i. As before. The estimated intercept (b0) is the estimated value of Y when X = 0. but in this age of computers. Usually. This is also known as the predicted value of Y for a given value of X. If b1 is negative. As with all estimates.. It is possible to write out a formula for the estimated intercept and slope. For every unit change in the horizontal direction. i. More formally the null hypothesis is: notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms of a sample statistic. The null hypothesis is that population slope is 0. approximate 95% confidence intervals for the corresponding population parameters are found as estimate ± 2 × se. it seems kind of silly to investigate income in year 0. Again. there is no clear interpretation of the intercept. the fitted line points downwards. there is no relationship between Y and X (can you draw a scatterplot showing such a relationship?). the fitted line increased by b 1 units.e. In these cases. Page 6 of 8 .e. i. This is usually automatically done by most computer packages. there are computational formulae. is measures the probability of observing this data if the hypothesis of no relationship were true.e. and it merely serves as a placeholder for the line. Formal tests of hypotheses can also be done. a measure of precision can be obtained. This formal definition of least squares is not that important .let the computer do the dirty work. actually a decrease. The p-value is interpreted in exactly the same way as in ANOVA. and the increase in the line is negative. The alternate hypothesis is typically chosen as: The test-statistics is found as: and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value .the concept as expressed in the previous paragraph is more important – in particular it is the SQUARED deviation in the VERTICAL direction that is used. these are only done on the slope parameter as this is typically of most interest. these are not important. this is the standard error of each of the estimates. In some cases. it is meaningless to talk about values of Y when X = 0 because X = 0 is nonsensical. The estimated slope (b1) is the estimated change in Y per unit change in X. For example. in a plot of income vs year.where is the point on the line corresponding to each X value. but who cares .

This interval is often called a prediction interval at a new X. The prediction interval for an individual response is typically MUCH wider than the confidence interval for the mean of all future responses because it must account for the uncertainty from the fitted line plus individual variation around the fitted line. In the second case. Then there is the additional uncertainty that the value could be above or below the predicted line. statistical vs. The prediction interval for an individual response is sometimes called a confidence interval for an individual response but this is an unfortunate (and incorrect) use of the term confidence interval. Strictly speaking confidence intervals are computed for fixed unknown parameter values. there is little to be gained by examining them. Page 7 of 8 . but again. there are two sources of uncertainty involved in the prediction. What is important is that you read the documentation of your software carefully to ensure that you understand exactly what interval is being given to you. This interval is often called a confidence interval for the mean at a new X. engineering (non)significance must be determined and assessed. Obtaining Predictions Once the best fitting line is found it can be used to make predictions for new values of X. but the X value allows the package to compute an estimate for this observation. First.e. the estimate is found in the same manner – substitute the new value of X into the equation and compute the predicted value . In the first case. i. What differs between the two predictions are the estimates of uncertainty. There are two types of predictions that are commonly made. First. In most computer packages this is accomplished by inserting a new “dummy” observation in the dataset with the value of Y missing. but the value of X present. In both cases. Second the experimenter may be interested in predicting the AVERAGE of ALL future responses at a particular X. there is the uncertainty caused by the fact that this estimated line is based upon a sample. Both of the above intervals should be distinguished from the confidence interval for the slope. The missing Y value prevents this new observation from being used in the fitting process. the experimenter may be interested in predicting a SINGLE future individual value for a particular X. Many textbooks have the formulae for the se for the two types of predictions. 6.The p-value does not tell the whole story. It is important to distinguish between them as these two intervals are the source of much confusion in regression problems. predication intervals are computed for future random variables. only the uncertainty caused by estimating the line based on a sample is relevant.

Page 8 of 8 .