You are on page 1of 19

Regression Analysis:

A statistical procedure used to


find relationships among a set
of variables

In regression analysis, there is a dependent


variable, which is the one you are trying to
explain, and one or more independent
variables that are related to it.
You can express the relationship as a linear
equation, such as:

y = a + bx

y = a + bx
y is the dependent variable
x is the independent variable
a is a constant
b is the slope of the line
For every increase of 1 in x, y changes by an amount equal
to b
Some relationships are perfectly linear and fit this equation
exactly. Your cell phone bill, for instance, may be:
Total Charges = Base Fee + 30 (overage minutes)
If you know the base fee and the number of overage
minutes, you can predict the total charges exactly.

Other relationships may not


be so exact.
Weight, for instance, is to
some degree a function of
height, but there are
variations that height does
not explain.
On average, you might have
an equation like:
Weight = -222 + 5.7*Height
If you take a sample of
actual heights and weights,
you might see something
like the graph to the right.

The line in the graph shows the


average relationship described by
the equation. Often, none of the
actual observations lie on the line.
The difference between the line and
any individual observation is the
error.
The new equation is:
Weight = -222 + 5.7*Height + e
This equation does not mean that
people who are short enough will
have a negative weight. The
observations that contributed to this
analysis were all for heights
between 5 and 64. The model
will likely provide a reasonable
estimate for anyone in this height
range. You cannot, however,
extrapolate the results to heights
outside of those observed. The
regression results are only valid for
the range of actual observations.

Regression finds the line that best fits the observations. It


does this by finding the line that results in the lowest sum of
squared errors. Since the line describes the mean of the
effects of the independent variables, by definition, the sum
of the actual errors will be zero. If you add up all of the
values of the dependent variable and you add up all the
values predicted by the model, the sum is the same. That
is, the sum of the negative errors (for points below the line)
will exactly offset the sum of the positive errors (for points
above the line). Summing just the errors wouldnt be useful
because the sum is always zero. So, instead, regression
uses the sum of the squares of the errors. An Ordinary
Least Squares (OLS) regression finds the line that results
in the lowest sum of squared errors.

Multiple Regression
What if there are several factors affecting the
independent variable?
As an example, think of the price of a home as a
dependent variable. Several factors contribute to
the price of a home among them are square
footage, the number of bedrooms, the number of
bathrooms, the age of the home, whether or not it
has a garage or a swimming pool, if it has both
central heat and air conditioning, how many
fireplaces it has, and, of course, location.

The Multiple Regression Equation


Each of these factors has a separate relationship with the
price of a home. The equation that describes a multiple
regression relationship is:

y = a + b1x1 + b2x2 + b3x3 + bnxn + e


This equation separates each individual independent
variable from the rest, allowing each to have its own
coefficient describing its relationship to the dependent
variable. If square footage is one of the independent
variables, and it has a coefficient of $50, then every
additional square foot of space adds $50, on average, to
the price of the home.

How Do You Run a


Regression?
In a Multiple Regression Analysis of home prices,
you take data from actual homes that have sold
recently. You include the selling price, as well as
the values for the independent variables (square
footage, number of bedrooms, etc.). The multiple
regression analysis finds the coefficients for each
independent variable so that they make the line
that has the lowest sum of squared errors.

How Good is the Model?


One of the measures of how well the model
explains the data is the R2 value. Differences
between observations that are not explained by
the model remain in the error term. The R 2 value
tells you what percent of those differences is
explained by the model. An R2 of .68 means that
68% of the variance in the observed values of the
dependent variable is explained by the model, and
32% of those differences remains unexplained in
the error term.

Sometimes Theres No
Accounting for Taste
Some of the error is random, and no model
will explain it. A prospective homebuyer
might value a basement playroom more than
other people because it reminds her of her
grandmothers house where she played as a
child. This cant be observed or measured,
and these types of effects will vary randomly
and unpredictably. Some variance will
always remain in the error term. As long as
it is random, it is of no concern.

p-values and Significance Levels


Each independent variable has another number attached to
it in the regression results its p-value or significance
level.
The p-value is a percentage. It tells you how likely it is that
the coefficient for that independent variable emerged by
chance and does not describe a real relationship.
A p-value of .05 means that there is a 5% chance that the
relationship emerged randomly and a 95% chance that the
relationship is real.
It is generally accepted practice to consider variables with a
p-value of less than .1 as significant, though the only basis
for this cutoff is convention.

Significance Levels of F
There is also a significance level for the
model as a whole. This is the Significance
F value in Excel; some other statistical
programs call it by other names. This
measures the likelihood that the model as a
whole describes a relationship that emerged
at random, rather than a real relationship.
As with the p-value, the lower the
significance F value, the greater the chance
that the relationships in the model are real.

Some Things to Watch Out For


Multicollinearity
Omitted Variables
Endogeneity
Other

Multicollinearity
Multicollinearity occurs when one or more of your independent variables
are related to one another. The coefficient for each independent variable
shows how much an increase of one in its value will change the dependent
variable, holding all other independent variables constant. But what if you
cannot hold them constant? If you have two houses that are exactly the
same, and you add a bedroom to one of them, the value of the house may
go up by, say, $10,000. But you have also added to its square footage.
How much of that $10,000 is a result of the extra bedroom and how much
is a result of the extra square footage? If the variables are very closely
related, and/or if you have only a small number of observations, it can be
difficult to separate these effects. Your regression gives you the
coefficients that best describe your set of data, but the independent
variables may not have a good p-value if multicollinearity is present.
Sometimes it may be appropriate to remove a variable that is related to
others, but it may not always be appropriate. In the home value example,
both the number of bedrooms and the square footage are important on
their own, in addition to whatever combined effects they may have.
Removing them may be worse than leaving them in. This does not
necessarily mean that the model as a whole is hurt, but it may mean that
the model should not be used to draw conclusions about the relationship of
individual independent variables with the dependent variable.

Omitted Variables
If independent variables that have significant relationships with the
dependent variable are left out of the model, the results will not be as
good as if they are included. In the home value example, any real
estate agent will tell you that location is the most important variable of
all. But location is hard to measure. Locations are more or less
desirable based on a number of factors. Some of them, like population
density or crime rate, may be measurable factors that can be included.
Others, like perceived quality of the local schools, may be more
difficult. You must also decide what level of specificity to use. Do you
use the crime rate for the whole city, a quadrant of the city, the zip
code, the street? Is the data even available at the level of specificity
you want to use? These factors can lead to omitted variable bias
variance in the error term that is not random and that could be
explained by an independent variable that is not in the model. Such
bias can distort the coefficients on the other independent variables, as
well as decreasing the R2 and increasing the Significance F.
Sometimes data just isnt available, and some variables arent
measurable. There are methods for reducing the bias from omitted
variables, but it cant always be completely corrected.

Endogeneity
Regression measures the effect of changes in the
independent variable on the dependent variable.
Endogeneity occurs when that relationship is either
backwards or circular, meaning that changes in the
dependent variable cause changes in the independent
variable. In the home value example, we had discussed
earlier that the perceived quality of the local schools might
affect home values. But the perceived quality is likely also
related to the actual quality, and the actual quality is at least
partially a result of funding levels. Funding levels are often
related to the property tax base, or the value of local
homes. So good schools increase home values, but high
home values also improve schools. This circular
relationship, if it is strong, can bias the results of the
regression. There are strategies for reducing the bias if
removing the endogenous variable is not an option.

Others
There are several other types of biases that can
exist in a model for a variety of reasons. As with
the types already described, there are tests to
measure the levels of bias, and there are
strategies that can be used to reduce it.
Eventually, though, one may have to accept a
certain amount of bias in the final model,
especially when there are data limitations. In that
case, the best that can be done is to describe the
problem and the effects it might have when
presenting the model.

The 136 System Model Regression Equation


Local Revenue
per Pupil =
-236 (y-intercept)
+ .0041 x County-area Property per Pupil
+ .0032 x System Unshared Property per Pupil
+ .0202 x County-area Sales per Pupil
+ .0022 x System Unshared Sales per Pupil
+ .0471 x System State-shared Taxes per Pupil
+ 296 x [County-area Commercial, Industrial,
Utility and Business Personal Property
Assessment Total Assessment]
+ 327 x [System Commercial, Industrial, Utility
and Business Personal Property
Assessment Total Assessment]
+ .0209 x County-area Median Household Income
+ -795 x System Child Poverty Rate

You might also like