Professional Documents
Culture Documents
Lesson 29: Multiple Regression: e X B X B X B A y
Lesson 29: Multiple Regression: e X B X B X B A y
11.556 179
R
E
S
E
A
R
C
H
M
E
T
H
O
D
O
L
O
G
Y
So far we have talked of regression analysis using only one
independent explanatory variable. At this level the regression
analysis can also be estimated manually. However it is very rare
that we have just one explanatory variable and the explanatory
power of the estimated equation can be substantially improved
by the addition of more independent variables. For example in
our earlier example of household consumption we can probably
improve the explanatory power of the equation by adding more
variables such as household size, age distribution of household,
etc. However when we use two or more independent variables the
process of regression becomes that much more complex and it is
not feasible to solve for the parameters of the equation manually.
Typically most multiple regression analysis is always carried out by
computers, which enable us to carry out complex calculations using
large volumes of data easily. Therefore our stress when discussing
multiple regression will be on understanding and interpreting
computer output.
Multiple Regression equation
The general form of the multiple regression equation is as follows:
The three variable case for example is :
e x b x b x b a y + + + + =
3 3 2 2 1 1
For the two variable case we can find the multiple regression
equation as follows:
e x b x b a y + + + =
2 2 1 1
=
k n
y y
s
e
Where
Y = sample values of the dependent variable
y = corresponding estimated values from the regression
equation
n = number of data points in the sample
k = number of independent variables (3 in our example)
The denominator of this equation shows that in a regression
with k independent variables, the standard error has n-k-1 degrees
of freedom. This is because one more degree of freedom is reduced
by the estimation of the intercept term a. Thus we have k+1
parameters to estimate from the sample data.
Standard error of the regression is also called the root mean square
or Mean square error (MSE). In our sample output this is indicated
by s. The standard error of the regression in our problem is 0.286.
We can also use the standard error of estimate or the MSE and a
the t distribution to form an approximate confidence interval
around our estimated vale . The t value at 95% level of confidence
given our degrees of freedom of n-k-1 is 2.444. For example in
our problem:
For a value of
x
1
= 4300 hours
x
2=
1500 hors
x
3=
$75000
Our estimate for y is $27905000 and our s
e
is $286000. for
example if we want to construct a 95% confidence interval around
this estimate of $27905000 we can do it as follows:
$27905000+/ -t s
e=
$27905000+2.447(286,000)
=$2860,800 upper limit
=$27905000+2.447(286,000)
=$27,205,200 Lower limit
the standard error of the estimate measures the dispersion of
data points around the regression plane. Smaller values of s
e
indicate a betterregression. If the addition of another variable
reduces s
e
then we say that the inclusion of the third variable
improves the fir of the regression.
The Coefficient of Multiple Determination
In a multiple regression we measure the strength of the relationship
among the three independent variables and the dependent variables
by the coefficient of determination or R
2
. This defined as :
R
2
is the proportion of total variation in y that is explained by the
regression plane.
In our example we have R
2
=98.3% .This tells us the 98.3% of
variation in unpaid taxes is explained by the three independent
variables. AS we add more variables in a regression explanatory
power of the equation improves if the R
2
increases.
Example 2
Insert exercise lr p732
Example
Pam Schneider owns and operates an accounting firm in Ithaca,
New York. Pam feels that it would be used to be able to predict in
advance in the number of rush income-tax returns during the
Copy Right : Ra i Unive rsit y
11.556 181
R
E
S
E
A
R
C
H
M
E
T
H
O
D
O
L
O
G
Y
busy march 1 to April 15 period so that she can better ;oan her
personnel need during this time. She has hypothesized that several
factors may be useful in her production. Data for these factors and
number of rush returns for past years are as follows:
X1 X2 X3 Y
Economic
index
Population
within 1 mile of
office
Average
income in
Ithaca
Number of rush
returns, march 1
to April 15
99 10188 21465 2306
106 8566 22228 1266
100 10557 27665 1422
129 10219 25200 1721
179 9662 26300 2544
a. Use the following Minitab output to determine the best fitting
regression equation for these data:
The regressions equation is
Predictor
const
Coef Stdev T ratio P
-1275 2699 -0.47 0.719
X1 17.059 6.098 2.47 0.245
X2 0.5456 0.3144 1.72 0.335
X3 -
0.1743
0.1005 -1.73 0.333
S = 396.1 R sq = 87.2%
b. What percentage of the total variation in the number of rush
returns is explained by this equation?
c. For this year, the economic index is 169, then population with
in 1 mile of the office is 10212, and the average income in
Ithaca is $26925. How many rush returns should Pam expect
to prices between March 1 April 15?
Results
y = -1275 + 17.059 X
1
+ 0.5406 X
2
- 0.1743 X
3
.
R
2
= 87.2%; 87.2% of the total variation in Y is explained by the
model.
y = -1275 + 17.059 (169 ) + 0.5406 (10,212) 0.1743( 26,925)
= 2436 rush returns.
Exercises
Q1. Given the following set of data, use whatever computer
package is available to f find the best fitting regression
equation and answer the following:
a. What is the regression equation?
b. What is the standard error of estimate?
c. What is R
2
for this regression?
d. What is the predicted value for Y when X
1=
5.8, X
2
= 4.2, X
3
=
5.1?
Y X1 X2 X3
64.7 3.5 5.3 8.5
80.9 7.4 1.6 2.6
24.6 2.5 6.3 4.5
43.9 3.7 9.4 8.8
77.7 5.5 1.4 3.6
20.6 8.3 9.2 2.5
66.9 6.7 2.5 2.7
34.3 1.2 2.2 1.3
Given the following set of data use whatever computer package is
available to find the best fitting regression equation and answer
the following:
a. What is the regression equation?
b. What is the standard error of estimate?
c. What is R
2
for this regression?
e. Given an approximate 95 percent confidence interval for the
value of Y when the values of X
1,
X
2,
X
3, and
X
4
are 52.4, 41.6
35.8, and 3, respectively.
Q3.We are trying to predict the annual demand for widgets
(Demand)using the following independent variable.
Price = price of widgets (in $)
Income= consumer income (in $)
Sub = price of a substitute commodity (in $)
(Note: A substitute commodity is one that can be substituted for
another commodity. For example, margarine is a substitute
commodity for butter,)
Year Demand Price ($) Income Sub ($)
1982 40 9 400 10
1983 45 8 500 14
1984 50 9 600 12
1985 55 8 700 13
1986 60 7 800 11
1987 70 6 900 15
1988 65 6 1000 26
1989 65 8 1100 27
1990 75 5 1200 22
1991 75 5 1300 19
1992 80 5 1400 20
1993 100 3 1500 23
1994 90 4 1600 18
1995 95 3 1700 24
1996 85 4 1800 21
Copy Right : Ra i Unive rsit y
182 11.556
R
E
S
E
A
R
C
H
M
E
T
H
O
D
O
L
O
G
Y
a Using whatever computer package is available, determine the
best-fitting regression equation for these data.
b. Are the signs (+ or -) of the regression coefficients of the
independent variables, as one would expect? Explain briefly.
c. State and interpret the coefficient of multiple determinations
for this problem.
d. State and interpret the standard error of estimate for this
problem.
e. Using the equation, what would you predict for DEMAND if
the price of widgets was $6, consumer income was $1200 and
the price of the substitute commodity was $17?
Notes