Lesson 29: Multiple Regression: e X B X B X B A y

Copy Right : Ra i Unive rsit y
11.556 179
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
So far we have talked of regression analysis using only one
independent explanatory variable. At this level the regression
analysis can also be estimated manually. However it is very rare
that we have just one explanatory variable and the explanatory
power of the estimated equation can be substantially improved
by the addition of more independent variables. For example in
our earlier example of household consumption we can probably
improve the explanatory power of the equation by adding more
variables such as household size, age distribution of household,
etc. However when we use two or more independent variables the
process of regression becomes that much more complex and it is
not feasible to solve for the parameters of the equation manually.
Typically most multiple regression analysis is always carried out by
computers, which enable us to carry out complex calculations using
large volumes of data easily. Therefore our stress when discussing
multiple regression will be on understanding and interpreting
computer output.
Multiple Regression equation
The general form of the multiple regression equation is as follows:
The three variable case for example is :
e x b x b x b a y + + + + =
3 3 2 2 1 1
For the two variable case we can find the multiple regression
equation as follows:
e x b x b a y + + + =
2 2 1 1
The normal equations for this are as follows:

+ + =
2 2 1 1
x b x b na y

+ + =
2 2
2
1 1 1 1
xx b x b x a y x

+ + =
2
2 2 2 1 1 2 2
x b x x b x a y x
These can be solved to obtain the values of the parameters a, b
1,
b
2
So far we have referred to a as the y intercept and b
1
as the slopes of
the multiple regression. However are the estimated regression
coefficients. The constant a is the value of y if both x
1
, x
2
are
zero. The coefficients b
1
, b
2
describe how changes in x
1
affect the
value of y . Thus b
1
measures the value of changes in x
1
on
y holding x
2
constant. Similarly b
2
measures the effect on y of
changes in x
2
holding x
1
constant.
Thus linear regression estimates a regression line between two
variables. Multiple regression there is a regression plane among y,
x
1
and

x
2.
This regression plane is determined in the same way as
the regression line by minimizingthe sum of squared deviations
of data points from the regression plane. Each independent
variable accounts for some of the variation in the dependent
variable. This is shown in figure 1 below.
Figure 1
The Computer and Multiple regression
A manager in any managerial situation deals with complex
problems requiring large samples and several independent
variables.
The generalized multiple regression model is specified for k
variables with n data points. That is for each of the k independent
variables we have n data points. The regression equation that we
estimate is :
k k
x b x b x b x b a y + + + + + = .. ..........
3 3 2 2 1 1
This equation is estimated by the computer . We now look at how
a statistical package such as SPSS or Minitab handles the data.
An example will help make the process clearer:
Suppose the IRS in US wish to model discovery of unpaid taxes.
They include the following independent variables:
1. No. of hours of Field audit($00s)
2. No. of computer hours($00s)
3. Reward to informants ($000s)
4. Actual unpaid taxes discovered. ($100000s)
The data is shown in Table 1
LESSON 29:
MULTIPLE REGRESSION
180 11.556
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
Month Field
audit
Comp
hours
Reward to
informers
Actual
unpaid
taxes
Jan
Feb
March
Apr
May
Jun
July
Aug
Sept
Oct
45
42
44
43
46
44
45
44
43
42

16
14
15
13
13
14
16
16
15
15
71
70
72
71
75
74
76
69
74
73

29
24
27
25
26
28
30
28
28
27
Now a regression is run on Minitab and the sample out put is
presented below. We now have to interpret this output. This is
given in table 2
Now how do we interpret this output?
1. The regression equation is of the form :
3 3 2 2 1 1
x b x b x b a y + + + =

From the numbers given in
the coefficient column we can read the estimating equation:

=-45.8+.597Audit+1.18Comp+.405Rewards
How do we interpret this equation?
The interpretation is similar to that of the one variable simple
linear regression case.
If we hold the number of field audit labour hours, number
of computer hours constant and we change rewards to
informants by one unit , then y will change by an additional
$405000 for each additional $1000 paid to informants.
Similarly holding x
1and
x
3
constant an additional 100 hours
of computer time will increase by $1177000.
Similarly for holding x
2and
x
3
constant we increase an increase in
hours in filed by $100 increase recoveries by$597000.
We can also use this equation to solve problems such as :
Suppose in Nov the IRS plans to leave field hours and computer
hours at their Oct level but increase rewards to $75000 How much
of recoveries can they expect to make in Nov?
We can get a forecasted value by substituting in the equation.
y = -45.8+.597(43)+1.18(15)+.405(75)
y = 27.905 or approximately $28 million.
Standard Error of The Regression
Now that we have our equation we need to have some measure
of the dispersion of actual observations around the estimated
regression plane. We can expect the estimation to be more accurate
as the degree of dispersion around this regression plane is less.
To measure this dispersion or variation by the standard error of
the estimate s
e
1
) (
2

=
k n
y y
s
e
Where
Y = sample values of the dependent variable
y = corresponding estimated values from the regression
equation
n = number of data points in the sample
k = number of independent variables (3 in our example)
The denominator of this equation shows that in a regression
with k independent variables, the standard error has n-k-1 degrees
of freedom. This is because one more degree of freedom is reduced
by the estimation of the intercept term a. Thus we have k+1
parameters to estimate from the sample data.
Standard error of the regression is also called the root mean square
or Mean square error (MSE). In our sample output this is indicated
by s. The standard error of the regression in our problem is 0.286.
We can also use the standard error of estimate or the MSE and a
the t distribution to form an approximate confidence interval
around our estimated vale . The t value at 95% level of confidence
given our degrees of freedom of n-k-1 is 2.444. For example in
our problem:
For a value of
x
1
= 4300 hours
x
2=
1500 hors
x
3=
$75000
Our estimate for y is $27905000 and our s
e
is $286000. for
example if we want to construct a 95% confidence interval around
this estimate of $27905000 we can do it as follows:
$27905000+/ -t s
e=
$27905000+2.447(286,000)
=$2860,800 upper limit
=$27905000+2.447(286,000)
=$27,205,200 Lower limit
the standard error of the estimate measures the dispersion of
data points around the regression plane. Smaller values of s
e
indicate a betterregression. If the addition of another variable
reduces s
e
then we say that the inclusion of the third variable
improves the fir of the regression.
The Coefficient of Multiple Determination
In a multiple regression we measure the strength of the relationship
among the three independent variables and the dependent variables
by the coefficient of determination or R
2
. This defined as :
R
2
is the proportion of total variation in y that is explained by the
regression plane.
In our example we have R
2
=98.3% .This tells us the 98.3% of
variation in unpaid taxes is explained by the three independent
variables. AS we add more variables in a regression explanatory
power of the equation improves if the R
2
increases.
Example 2
Insert exercise lr p732
Example
Pam Schneider owns and operates an accounting firm in Ithaca,
New York. Pam feels that it would be used to be able to predict in
advance in the number of rush income-tax returns during the
11.556 181
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
busy march 1 to April 15 period so that she can better ;oan her
personnel need during this time. She has hypothesized that several
factors may be useful in her production. Data for these factors and
number of rush returns for past years are as follows:
X1 X2 X3 Y
Economic
index
Population
within 1 mile of
office
Average
income in
Ithaca
Number of rush
returns, march 1
to April 15
99 10188 21465 2306
106 8566 22228 1266
100 10557 27665 1422
129 10219 25200 1721
179 9662 26300 2544

a. Use the following Minitab output to determine the best fitting
regression equation for these data:
The regressions equation is
Predictor
const
Coef Stdev T ratio P
-1275 2699 -0.47 0.719
X1 17.059 6.098 2.47 0.245
X2 0.5456 0.3144 1.72 0.335
X3 -
0.1743
0.1005 -1.73 0.333

S = 396.1 R sq = 87.2%
b. What percentage of the total variation in the number of rush
returns is explained by this equation?
c. For this year, the economic index is 169, then population with
in 1 mile of the office is 10212, and the average income in
Ithaca is $26925. How many rush returns should Pam expect
to prices between March 1 April 15?
Results
y = -1275 + 17.059 X
1
+ 0.5406 X
2
- 0.1743 X
3
.
R
2
= 87.2%; 87.2% of the total variation in Y is explained by the
model.
y = -1275 + 17.059 (169 ) + 0.5406 (10,212) 0.1743( 26,925)
= 2436 rush returns.
Exercises
Q1. Given the following set of data, use whatever computer
package is available to f find the best fitting regression
equation and answer the following:
a. What is the regression equation?
b. What is the standard error of estimate?
c. What is R
2
for this regression?
d. What is the predicted value for Y when X
1=
5.8, X
2
= 4.2, X
3
=
5.1?
Y X1 X2 X3
64.7 3.5 5.3 8.5
80.9 7.4 1.6 2.6
24.6 2.5 6.3 4.5
43.9 3.7 9.4 8.8
77.7 5.5 1.4 3.6
20.6 8.3 9.2 2.5
66.9 6.7 2.5 2.7
34.3 1.2 2.2 1.3
Given the following set of data use whatever computer package is
available to find the best fitting regression equation and answer
the following:
a. What is the regression equation?
b. What is the standard error of estimate?
c. What is R
2
for this regression?
e. Given an approximate 95 percent confidence interval for the
value of Y when the values of X
1,
X
2,
X
3, and
X
4
are 52.4, 41.6
35.8, and 3, respectively.
Q3.We are trying to predict the annual demand for widgets
(Demand)using the following independent variable.
Price = price of widgets (in $)
Income= consumer income (in $)
Sub = price of a substitute commodity (in $)
(Note: A substitute commodity is one that can be substituted for
another commodity. For example, margarine is a substitute
commodity for butter,)
Year Demand Price ($) Income Sub ($)
1982 40 9 400 10
1983 45 8 500 14
1984 50 9 600 12
1985 55 8 700 13
1986 60 7 800 11
1987 70 6 900 15
1988 65 6 1000 26
1989 65 8 1100 27
1990 75 5 1200 22
1991 75 5 1300 19
1992 80 5 1400 20
1993 100 3 1500 23
1994 90 4 1600 18
1995 95 3 1700 24
1996 85 4 1800 21

182 11.556
R
E
S
E
A
R
C
H

M
E
T
H
O
D
O
L
O
G
Y
a Using whatever computer package is available, determine the
best-fitting regression equation for these data.
b. Are the signs (+ or -) of the regression coefficients of the
independent variables, as one would expect? Explain briefly.
c. State and interpret the coefficient of multiple determinations
for this problem.
d. State and interpret the standard error of estimate for this
problem.
e. Using the equation, what would you predict for DEMAND if
the price of widgets was $6, consumer income was $1200 and
the price of the substitute commodity was $17?
Notes

Lesson 29: Multiple Regression: e X B X B X B A y

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 29: Multiple Regression: e X B X B X B A y

Uploaded by

Copyright:

Available Formats

Copy Right : Ra i Unive rsit y

The normal equations for this are as follows:

You might also like