You are on page 1of 7


Regression is the fitting of a function to a set of observations. Usually there are two variables (more are possible) divided into two types, experimental variables and a response variable. The Y variable is the observed response to the X values, so they are naturally paired data points. The X variable can be of two sorts: It can be only values chosen by the experimenter It can be observed, just like the Y values, calculations for either situation are the same, but the interpretation can differ. If the function is a linear function (all experimental variables are to the power 1), then the relationship is linear and this is called linear regression. If Y is the response variable, and there is one experimental variable, X, then the function is in the familiar form: yi = b0 + b1xi with n observations (the i index goes from 1 to n) For the slope and intercept, we have used b1 and b0.

non-linear regression situations, such as polynomial

regressions like: yi = b0 + b1xi + b2xi2 + b3xi3... Linear regression is very useful and can describe the relationship among many variables. It does not mean that X causes Y, only that X and Y are ASSOCIATED

Fitting the Linear Regression Line

You can fit a line with A plot of the data with pencil and a ruler.
But is that the best fit possible? To answer this question, you need a criterion for determining what constitutes the "best" fit; the method used is called : LEAST SQUARES METHOD After we define some terms, we can better describe what this method is. Using the linear function yi = b0 + b1xi (defined above), we can calculate both b1 and b0 as follows:

xiyi nxy b1 xi nx
2 2

The x-bar and y-bar refer to the mean of the x and y values, respectively.

What is slope then? The slope is then the product of the x and y values, minus their respective means, over the square of x, minus its mean. The least-square regression line always goes through (xbar, y-bar), Slop is the point on the graph that represents the mean of both values. we can get the intercept from the line for the equation: by substituting x-bar and y-bar for xi and yi and rearranging the formula:

The diagram below gives the data points in blue and the regression line in red. no data points falls on the line. They can, but only by chance. However, (x-bar, y-bar), the point defined by the means of the x and y values, is always on the line.

Now look at all of the arrows. They identify four points. The first to notice is xi, a data point chosen at random.

Follow the dashed vertical line up from xi until it gets to the regression line. That point is (xi, y-cap). Y-cap is the symbol for the PREDICTED y, given a particular xi and the linear equation estimated by the least-squares procedure. the predicted value of yi for a given xi is gotten by substituting the value of xi into the linear equation (y = mx + b) and calculating yi Y-cap is the PREDICTED VALUE for a particular yi. If you follow the horizontal line over to the y-axis from (xi, y-cap), you come to y-cap on the axis. The difference between yi and y-cap is the residual for yi. Residual = yi - y-cap. The line segment between (xi, y-cap) and (xi, yi) is the distance from the line to the data point. This distance is the RESIDUAL of yi, Note that the vertical distance is not the shortest distance between the line and data point, Now we can go back to the Least-squares criterion. Now we need another term: RESIDUAL SUM OF SQUARES:

ssr yi 2 b0 yi b1 xiyi

The second y term is y-cap, the predicted y the average y. The residual SS gives us the ability to calculate the standard deviation of the residuals:

This formula means that about 95% of all residuals will be within 2 standard deviations of the line. Note the subscript of s. It is Y given X, meaning that x has been used to predict y. Coefficient of Determination We need to define a couple of terms before beginning this section: Both are sums of squares and they look similar:

sst yi ny

SS (total) is the SS of the y data points corrected for the mean of the y's (y-bar).

SS (regression) is the SS of the predicted y's (y-caps) corrected for the mean of the y's (so the difference here is between the predicted y and the mean y). The relationship between the three SS for regression is: SS (total) = SS (regression) + SS (error)

This term is called the COEFFICIENT OF DETERMINATION and is represented by r2:

This term is often expressed as a percentage and it represents the proportion of total variation explained by the regression line. if all of the points lie on the line, then it is 100%, if any point lies off of the line (as all points do in the graphs above), then it will be less than 100%.

It is the ratio of the COVARIATION between x and y to the total variation in both x and y By: waqas mehmood (BS: acc&fin) Dated: 28.10.2010