You are on page 1of 44

LEAST SQUARES

REGRESSION
Least squares regression (Intro)
Criteria for best fit
Least squares for a straight line
Error in Linear regression
Linearization in Non-linear relationships
Matlab implementation

Linear least squares regression
• Where substantial error is associated with data, the best curve-fitting
strategy is to derive an approximating function that fits the shape or general
trend of the data without necessarily matching the individual points.

• One approach to do this is to visually inspect the plotted data and then
sketch a .best. line through the points.

Linear least squares regression

Although such eyeball approaches have
commonsense appeal and are valid for
.back-of-the-envelope calculations, they
are deficient because they are arbitrary.

That is, unless the points define a perfect
straight line (in which case, interpolation
would be appropriate), different analysts
would draw different lines.

Linear least squares regression
• To remove this subjectivity, some criterion must be devised to establish a
basis for the fit.

• One way to do this is to derive a curve that minimizes the discrepancy
between the data points and the curve. To do this, we must first quantify the
discrepancy.

• The simplest example is fitting a straight line to a set of paired observations:

Eq. 1

where a0 and a1 are coefficients representing the intercept and the slope,
respectively, and e is the error, or residual,

the residual is the discrepancy between the true value of y and the approximate value.Linear least squares regression Rearranging equation 1 Eq. . a0 + a1x. 2 Thus. predicted by the linear equation.

However. The best fit is the line connecting the points.Criteria for best fit One strategy for fitting a best line through the data would be to minimize the sum of the residual errors for all the available data. any straight line passing through the midpoint of the connecting line (except a perfectly vertical line) results in a minimum value of Eq. (3) equal to zero because positive and negative errors cancel. as in Eq. . 3 where n = total number of points.

4 Figure b demonstrates why this criterion is also inadequate. . any straight line falling within the dashed lines will minimize the sum of the absolute values of the residuals. as in Eq. For the four points shown.Criteria for best fit One way to remove the effect of the signs might be to minimize the sum of the absolute values of the discrepancies.

C. the line is chosen that minimizes the maximum distance that an individual point falls from the line. a single point with a large error. . As depicted in Fig. In this technique. this strategy is ill-suited for regression because it gives undue influence to an outlier that is.Criteria for best fit A third strategy for fitting a best line is the minimax criterion.

. which is called least squares.Criteria for best fit A strategy that overcomes the shortcomings of the aforementioned approaches is to minimize the sum of the squares of the residuals: Eq. 5 This criterion. has a number of advantages. including that it yields a unique line for a given set of data.

Setting these derivatives equal to zero will result in a minimum Sr . 6 All summations are from i = 1 to n. (5) is differentiated with respect to each unknown coefficient: Eq. If this is done. the equations can be expressed as: . Eq.Least squares fit for a straight line To determine values for a0 and a1.

8 They can be solved simultaneously for Eq. (7) to solve for Eq. we can express the equations as a set of two simultaneous linear equations with two unknowns (a0 and a1): Eq. . 9 This result can then be used in conjunction with Eq. realizing that Σa0 = na0. 7 Eq. 10 where y and x are the means of y and x.Least squares fit for a straight line Now. respectively.

Example 1 Fit a straight line to the values in Table 1. Table 1: Experimental data for force (N) and velocity (m/s) from a wind tunnel experiment. .

The means can be computed as .Example 1 Table 2: Data and summations needed to compute the best-fit line for the data from Table 1.

(9) and (10) as Using force and velocity in place of y and x. the least-squares fit is .Example 1 The slope and the intercept can then be calculated with Eqs.

Example 1 .

A number of additional properties of this fit can be elucidated by examining more closely the way in which residuals were computed. 11 Eq.Quantification of Error of Linear Regression Any line other than the one computed in Example 1 results in a larger sum of the squares of the residuals. Eq. Thus. 12 . the line is unique and in terms of our chosen criterion is a best line through the points.

. 13 where Sy/x is called the standard error of the estimate. for the regression line can be determined as Eq.Quantification of Error of Linear Regression A standard deviation.

.Quantification of Error of Linear Regression The residual in linear regression represents the vertical distance between a data point and the straight line.

.Quantification of Error of Linear Regression Examples of linear regression with (a) small and (b) large residual errors.

the difference is normalized to St to yield Eq. The difference between the two quantities. we can compute Sr . St − Sr . Because the magnitude of this quantity is scale-dependent. .Quantification of Error of Linear Regression After performing the regression. therefore. the sum of the squares of the residuals around the regression line with Eq. quantifies the improvement or error reduction due to describing the data in terms of a straight line rather than as an average value. This characterizes the residual error that remains after the regression. sometimes called the unexplained sum of the squares. (11). It is. 14 where r 2 is called the coefficient of determination and r is the correlation coefficient (= √r 2) .

An alternative formulation for r that is more convenient for computer implementation is Eq.Quantification of Error of Linear Regression • For a perfect fit. 15 . Sr = 0 and r 2 = 1. Sr = St and the fit represents no improvement. • For r 2 = 0. signifying that the line explains 100% of the variability of the data.

Example 2 Compute the total standard deviation. Table 1: Experimental data for force (N) and velocity (m/s) from a wind tunnel experiment. and the correlation coefficient for the fit in Example 1. the standard error of the estimate. .

Example 2 Table 3: Data and summations needed to compute the goodness-of-fit statistics for the data from Table 2. .

. The exact value is 1.Example 2 Results for the composite trapezoidal rule to estimate the integral of f (x) = 0.8.2 + 25x − 200x2 + 675x3 − 900x4 + 400x5 from x = 0 to 0.640533.

Example 2 The standard deviation is and the standard error of the estimate is Thus. the linear regression model has merit. . because Sy/x < Sy .05% of the original uncertainty has been explained by the linear model. The extent of the improvement is quantified by These results indicate that 88.

• A nice example was developed by Anscombe (1973). you should be careful not to ascribe more meaning to it than is warranted.67! . y = 3 + 0. For example. he came up with four data sets consisting of 11 data points each.Non-linear curves • Although the coefficient of determination provides a handy measure of goodness-of-fit. As in Fig. 1. it is possible to obtain a relatively high value of r 2 when the underlying relationship between y and x is not even linear. r 2 = 0. and the same coefficient of determination. • Just because r 2 is close to 1 does not mean that the fit is necessarily good. all have the same best-fit equation. • You should always inspect a plot of the data along with your regression curve. Although their graphs are very different.5x .

y = 3 + 0.Non-linear curves Fig 1: Anscombe’s four data sets along with the best-fit line. .5x.

. This is not always the case. it is predicated on the fact that the relationship between the dependent and independent variables is linear. However. and the first step in any regression analysis should be to plot and visually inspect the data to ascertain whether a linear model applies.Linearization of Non-linear relationships Linear regression provides a powerful technique for fitting a best line to data.

For example. population growth or radioactive decay can exhibit such behavior. 16 where α1 and β1 are constants. .Linearization of Non-linear relationships Exponential model: Eq. This model is used in many fields of engineering and science to characterize quantities that increase (positive β1) or decrease (negative β1) at a rate that is directly proportional to their own magnitude.

Linearization of Non-linear relationships Power equation: Eq. . This model has wide applicability in all fields of engineering and science. 17 where α2 and β2 are constant coefficients. It is very frequently used to fit experimental data when the underlying model is not known.

. also represents a nonlinear relationship between y and x that levels off.Linearization of Non-linear relationships Saturation-growth-rate equation Eq. particularly in biologically related areas of both engineering and science.saturates.. as x increases. or . 18 where α3 and β3 are constant coefficients. It has many applications. This model. which is particularly well-suited for characterizing population growth rate under limiting conditions.

(e). (b) the power equation. and (c) the saturation-growth- rate equation. and (f ) are linearized versions of these equations that result from simple transformations. . Linearization of Non-linear relationships (a) The exponential equation. Parts (d).

Example 3 . . (17) to the data in Table 1 using a logarithmic transformation.Fitting Data with the Power Equation Fit Eq. Table 1: Experimental data for force (N) and velocity (m/s) from a wind tunnel experiment.

The means can be computed as Table 4: Data and summations needed to fit the power model to the data from Table 14.Fitting Data with the Power Equation The data can be set up in tabular form and the necessary sums computed as in Table 4.1 .Example 3 .

Example 3 . (9) and (10) as The least-squares fit is .Fitting Data with the Power Equation The slope and the intercept can then be calculated with Eqs.

Fitting Data with the Power Equation .Example 3 .

Example 3 . (a) The fit of the transformed data. (b) The power equation fit along with the data.Fitting Data with the Power Equation Least-squares fit of a power model to the data from Table 1. .

Matlab implementation A simple example of the use of this M-file would be to fit the force-velocity data analyzed .

Matlab implementation .

Matlab implementation It can just as easily be used to fit the power model (Example 14.6) by applying the log10 function to the data as in .

Matlab implementation .

it is not random and is known without error. Each x has a fixed value.General Comments on Linear Regression 1. 2. 3. The y values are independent random variables and all have the same variance. The y values for a given x must be normally distributed. .