Professional Documents
Culture Documents
html
Ordinary least squares (OLS) regression is a statistical method of analysis that
estimates the relationship between one or more independent variables and a
dependent variable; the method estimates the relationship by minimizing the sum
of the squares in the difference between the observed and predicted values of the
dependent variable configured as a straight line. In this entry, OLS regression will
be discussed in the context of a bivariate model, that is, a model in which there is
only one independent variable ( X ) predicting a dependent variable ( Y ). However,
the logic of OLS regression is easily extended to the multivariate model in which
there are two or more independent variables.
Social scientists are often concerned with questions about the relationship between
two variables. These include the following: Among women, is there a relationship
between education and fertility? Do more-educated women have fewer children,
and less-educated women have more children? Among countries, is there a
relationship between gross national product (GNP) and life expectancy? Do
countries with higher levels of GNP have higher levels of life expectancy, and
countries with lower levels of GNP, lower levels of life expectancy? Among
countries, is there a positive relationship between employment opportunities and
net migration? Among people, is there a relationship between age and values of
baseline systolic blood pressure? (Lewis-Beck 1980; Vittinghoff et al. 2005).
where a is the intercept and indicates where the straight line intersects the Y -axis
(the vertical axis); b is the slope and indicates the degree of steepness of the
straight line; and e represents the error.
The error term indicates that the relationship predicted in the equation is not
perfect. That is, the straight line does not perfectly predict Y. This lack of a perfect
prediction is common in the social sciences. For instance, in terms of the education
and fertility relationship mentioned above, we would not expect all women with
exactly sixteen years of education to have exactly one child, and women with
exactly four years of education to have exactly eight children. But we would
expect that a woman with a lot of education would have fewer children than a
woman with a little education. Stated in another way, the number of children born
to a woman is likely to be a linear function of her education, plus some error.
Actually, in low-fertility societies, Poisson and negative binomial regression
methods are preferred over ordinary least squares regression methods for the
prediction of fertility (Poston 2002; Poston and McKibben 2003).
We first introduce a note about the notation used in this entry. In the social
sciences we almost always undertake research with samples drawn from larger
populations, say, a 1 percent random sample of the U.S. population. Greek letters
like α and β are used to denote the parameters (i.e., the intercept and slope values)
representing the relationship between X and Y in the larger population, whereas
lowercase Roman letters like a and b will be used to denote the parameters in the
sample.
But given that we wish to use a straight line for relating variable Y, the dependent
variable, with variable X, the independent variable, there is a question about which
line to use. In any scatterplot of observations of X and Y values (see Figure 1),
there would be an infinite number of straight lines that might be used to represent
the relationship. Which line is the best line?
The chosen straight line needs to be the one that minimizes the amount of error
between the predicted values of Y and the actual values of Y. Specifically, for each
of the i th observations in the sample, if one were to square the difference between
the observed and predicted values of Y, and then sum these squared differences, the
best line would have the lowest sum of squared errors (SSE), represented as
follows:
Ordinary least squares regression is a statistical method that produces the one
straight line that minimizes the total squared error.
Using the calculus, it may be shown that SSE is the lowest or the “least” amount
when the coefficients a and b are calculated with these formulas (Hamilton 1992,
p. 33):
We now will apply the least squares principles. We are interested in the extent to
which there is a relationship among the counties of China between the fertility rate
(the dependent variable) and the level of illiteracy (the independent variable).
China had 2,372 counties in 1982. We hypothesize that counties with populations
that are heavily illiterate will have higher fertility rates than those with populations
with low levels of illiteracy.
The dependent variable, Y, is the general fertility rate, GFR, that is, the number of
children born in 1982 per 1,000 women in the age group fifteen to forty-nine. The
independent variable, X, is the percentage of the population in the county in 1981
aged twelve or more who are illiterate.
Equation (1) may be estimated using the least squares formulas for a and b in
equations (3) and (4). This produces the following:
The OLS results in equation (5) indicate that the intercept value is 57.56, and the
slope value is 1.19. The intercept, or a, indicates the point where the regression
line “intercepts” the Y -axis. It tells the average value of Y when X = 0. Thus, in this
China dataset, the value of a indicates that a county with no illiterate person in the
population would have an expected fertility rate of 57.6 children per 1,000 women
aged fifteen to forty-nine.
We would probably want to interpret this b coefficient in the other direction; that
is, it makes more sense to say that if we reduce the county’s illiteracy rate by 1
percent, this would result in an average reduction of 1.2 children per 1,000 women
aged fifteen to forty-nine. This kind of interpretation would be consistent with a
policy intervention that a government might wish to use; that is, a lower illiteracy
rate would tend to result in a lower fertility rate.
The regression line may be plotted in the above scatterplot, as shown in Figure 2.
It is noted that while in general the relationship between illiteracy and fertility is
linear, there is a lot of error in the prediction of county fertility with a knowledge
of county illiteracy. Whereas some counties lie right on or close to the regression
line, and therefore, their illiteracy rates perfectly or near perfectly predict their
fertility rates, the predictions for other counties are not as good.
One way to appraise the overall predictive efficiency of the OLS model is
to “eyeball” the relationship as we have done above. How well does the above OLS
equation correspond with variation in the fertility rates of the counties? As we
noted above, the relationship appears to be positive and linear. A more accurate
statistical approach to address the question of how well the data points fit the
regression line is with the coefficient of determination ( R 2).
We start by considering the problem of predicting Y, the fertility rate, when we
have no other knowledge about the observations (the counties). That is, if we only
know the values of Y for the observations, then the best prediction of Y, the fertility
rate, is the mean of Y. It is believed that Carl Friedrich Gauss (1777–1855) was the
first to demonstrate that lacking any other information about a variable’s value for
any one subject, the arithmetic mean is the most probable value (Gauss [1809]
2004, p. 244).
But if we guess the mean of Y for every case, we will have lots of poor predictions
and lots of error. When we have information about the values of X, predictive
efficiency may be improved, as long as X has a relationship with Y. “The question
then is, how much does this knowledge of X improve our prediction
of Y ?” (Lewis-Beck 1980, p. 20).
First, consider the sum of the squared differences of each observation’s value
on Y from the mean of Y. This is the total sum of squares (TSS) and represents the
total amount of statistical variation in Y, the dependent variable.
Values on X are then introduced for all the observations (the Chinese counties),
and the OLS regression equation is estimated. The regression line is plotted (as in
the scatterplot in Figure 2), and the actual values of Y for all the observations are
compared to their predicted values of Y. The sum of the squared differences
between the predicted values of Y and the mean of Y is the explained sum of
squares (ESS), sometimes referred to as the model sum of squares. This represents
the amount of the total variation in Y that is accounted for by X. The difference
between TSS and ESS is the amount of the variation in Y that is not explained
by X, known as the residual sum of squares (RSS).
Poston, Dudley L., Jr. 2002. The Statistical Modeling of the Fertility of Chinese
Women. Journal of Modern Applied Statistical Methods 1 (2): 387–396.
Poston, Dudley L., Jr., and Sherry L. McKibben. 2003. Zero-inflated Count
Regression Models to Estimate the Fertility of U.S. Women. Journal of Modern
Applied Statistical Methods 2 (2): 371–379.
Imagine you have some points, and want to have a line that best fits
them like this:
We can place the line "by eye": try to have the line as close as possible to
all points, and a similar number of points above and below the line.
But for better accuracy let's see how to calculate the line using Least
Squares Regression.
The Line
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy ( Σ
means "sum up" )
b = Σy − m ΣxN
y = mx + b
Done!
Example
Example: Sam found how many hours of sunshine vs how many ice
creams were sold at the shop from Monday to Friday:
"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
= 249164 = 1.5183...
b = Σy − m ΣxN
= 41 − 1.5183 x 265
= 0.3049...
y = mx + b
y = 1.518x + 0.305
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Nice fit!
Sam hears the weather forecast which says "we expect 8 hours of sun
tomorrow", so he uses the above equation to estimate that he will sell
Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.
So, when we square each of those errors and add them all up, the total is
as small as possible.
You can imagine (but not accurately) each data point connected to a
straight bar by springs:
Boing!
Ordinary least squares regression (OLSR) is a generalized linear modeling
technique. It is used for estimating all unknown parameters involved in a linear
regression model, the goal of which is to minimize the sum of the squares of the
difference of the observed variables and the explanatory variables.
Ordinary least squares regression is also known as ordinary least squares or
least squared errors regression.
There is little extra to know beyond regression with one explanatory variable.
The main addition is the F-test for overall fit.
This requires the Data Analysis Add-in: see Excel 2007: Access and Activating
the Data Analysis Add-in
The data used are in carsdata.xls
We have regression with an intercept and the regressors HH SIZE and CUBED
HH SIZE
The population regression model is: y = β1 + β2 x2 + β3 x3 + u
It is assumed that the error u is independent with constant variance
(homoskedastic) - see EXCEL LIMITATIONS at the bottom.
The only change over one-variable regression is to include more than one column
in the Input X Range.
Note, however, that the regressors need to be in contiguous columns (here
columns B and C).
If this is not the case in the original data, then columns need to be copied to get
the regressors in contiguous columns.
Hitting OK we obtain
Explanation
R Square 0.802508 R2
Adjusted R
0.605016 Adjusted R2 used if more than one x variable
Square
The standard error here refers to the estimated standard deviation of the error
term u.
It is sometimes called the standard error of the regression. It equals sqrt(SSE/(n-
k)).
It is not to be confused with the standard error of y itself (from descriptive
statistics) or with the standard errors of the regression coefficients given below.
R2 = 0.8025 means that 80.25% of the variation of yi around ybar (its mean) is
explained by the regressors x2i and x3i.
Total 4 2.0
The ANOVA (analysis of variance) table splits the sum of squares into its
components.
For example:
R2 = 1 - Residual SS / Total SS (general formula for R2)
= 1 - 0.3950 / 1.6050 (from data in the ANOVA table)
= 0.8025 (which equals R2 given in the regression Statistics
table).
The column labeled F gives the overall F-test of H0: β2 = 0 and β3 = 0 versus Ha:
at least one of β2 and β3 does not equal zero.
Aside: Excel computes F this as:
F = [Regression SS/(k-1)] / [Residual SS/(n-k)] = [1.6050/2] / [.39498/2] = 4.0635.
The regression output of most interest is the following table of coefficients and
associated output:
Let βj denote the population coefficient of the jth regressor (intercept, HH SIZE
and CUBED HH SIZE).
Then
Column "P-value" gives the p-value for test of H0: βj = 0 against Ha: βj ≠
0..
Columns "Lower 95%" and "Upper 95%" values define a 95% confidence
interval for βj.
95% confidence interval for slope coefficient β2 is from Excel output (-1.4823,
2.1552).
Then
t = (b2 - H0 value of β2) / (standard error of b2 )
= (0.33647 - 1.0) / 0.42270
= -1.569.
We computed t = -1.569
The critical value is t_.025(2) = TINV(0.05,2) = 4.303. [Here n=5 and k=3
so n-k=2].
So do not reject null hypothesis at level .05 since t = |-1.569| < 4.303.
OVERALL TEST OF SIGNIFICANCE OF THE REGRESSION
PARAMETERS
We test H0: β2 = 0 and β3 = 0 versus Ha: at least one of β2 and β3 does not equal
zero.
From the ANOVA table the F-test statistic is 4.0635 with p-value of 0.1975.
Since the p-value is not less than 0.05 we do not reject the null hypothesis that
the regression parameters are zero at significance level 0.05.
Conclude that the parameters are jointly statistically insignificant at
significance level 0.05.
Consider case where x = 4 in which case CUBED HH SIZE = x^3 = 4^3 = 64.
EXCEL LIMITATIONS
Excel standard errors and t-statistics and p-values are based on the assumption
that the error is independent with constant variance (homoskedastic).
Excel does not provide alternaties, such asheteroskedastic-robust or
autocorrelation-robust standard errors and t-statistics and p-values.
More specialized software such as STATA, EVIEWS, SAS, LIMDEP, PC-TSP, ...
is needed.
Suppose you want to predict the amount of ice cream sales you would make
based on the temperature of the day, then you can plot a regression line that
passes through all the data points.
This regression line in red is the best fit line that predict the sale of ice creams to
best possible accuracy. One of the methods to draw this line is using the least
squares method.
Least squares is one of the methods to find the best fit line for a dataset using
linear regression. The most common application is to create a straight line
that minimizes the sum of squares of the errors generated from the
differences in the observed value and the value anticipated from the model.
Least-squares problems fall into two categories: linear and non linear
squares, depending on whether or not the residuals are linear in all unknowns.
Follow the below steps to find the best fit line for a set of ordered pairs (x1,y1),
(x2,y2)….
Step 2: Calculate the slope of the best fit line using the following formula