You are on page 1of 4

RESEARCH METHODOLOGY

LESSON 29: MULTIPLE REGRESSION


So far we have talked of regression analysis using only one independent explanatory variable. At this level the regression analysis can also be estimated manually. However it is very rare that we have just one explanatory variable and the explanatory power of the estimated equation can be substantially improved by the addition of more independent variables. For example in our earlier example of household consumption we can probably improve the explanatory power of the equation by adding more variables such as household size, age distribution of household, etc. However when we use two or more independent variables the process of regression becomes that much more complex and it is not feasible to solve for the parameters of the equation manually. Typically most multiple regression analysis is always carried out by computers, which enable us to carry out complex calculations using large volumes of data easily. Therefore our stress when discussing multiple regression will be on understanding and interpreting computer output. Multiple Regression equation The general form of the multiple regression equation is as follows: The three variable case for example is : Figure 1 The Computer and Multiple regression A manager in any managerial situation deals with complex problems requiring large samples and several independent variables. The generalized multiple regression model is specified for k variables with n data points. That is for each of the k independent variables we have n data points. The regression equation that we estimate is :

= a + b1 x1 + b2 x 2 + b3 x 3 + e y
For the two variable case we can find the multiple regression equation as follows:

= a + b1 x1 + b2 x 2 + e y
The normal equations for this are as follows:

y = na + b x + b x x y = a x + b x + b xx
1 1 2 2 1 1 1 2 1 2

= a + b1 x1 + b2 x2 + b3 x 3 + ............ + bk x k y
This equation is estimated by the computer . We now look at how a statistical package such as SPSS or Minitab handles the data. An example will help make the process clearer: Suppose the IRS in US wish to model discovery of unpaid taxes. They include the following independent variables: 1. 2. 3. 4. No. of hours of Field audit($00s) No. of computer hours($00s) Reward to informants ($000s) Actual unpaid taxes discovered. ($100000s)

x 2 y = a x2 + b1 x1 x 2 + b2 x 22
These can be solved to obtain the values of the parameters a, b1, b2 So far we have referred to a as the y intercept and b 1 as the slopes of the multiple regression. However are the estimated regression c o e f f i c i e n t s .T h ec o n s t a n tai st h ev a l u eo f y if both x 1, x 2 are zero. The coefficients b1, b2 describe how changes in x 1 affect the value of y . Thus b1 measures the value of changes in x 1 o n y holding x 2 constant. Similarly b2 measures the effect on y of changes in x 2 holding x 1 constant. Thus linear regression estimates a regression line between two variables. Multiple regression there is a regression plane among y, x1 and x2. This regression plane is determined in the same way as the regression line by minimizing the sum of squared deviations of data points from the regression plane. Each independent variable accounts for some of the variation in the dependent variable. This is shown in figure 1 below.

The data is shown in Table 1

11.556

Copy Right: Rai University

179

Month Jan Feb March Apr May Jun July Aug Sept Oct

Field audit 45 42 44 43 46 44 45 44 43 42

Comp hours 16 14 15 13 13 14 16 16 15

Reward to informers 71 70 72 71 75 74 76 69 74 73

Actual unpaid taxes 29 24 27 25 26 28 30 28 28

y = corresponding estimated values from the regression equation


n = number of data points in the sample k = number of independent variables (3 in our example) The denominator of this equation shows that in a regression with k independent variables, the standard error has n-k-1 degrees of freedom. This is because one more degree of freedom is reduced by the estimation of the intercept term a. Thus we have k+1 parameters to estimate from the sample data. Standard error of the regression is also called the root mean square or Mean square error (MSE). In our sample output this is indicated by s. The standard error of the regression in our problem is 0.286. We can also use the standard error of estimate or the MSE and a the t distribution to form an approximate confidence interval around our estimated vale . The t value at 95% level of confidence given our degrees of freedom of n-k -1 is 2.444. For example in our problem: For a value of

RESEARCH METHODOLOGY

15

27

Now a regression is run on Minitab and the sample out put is presented below. We now have to interpret this output. This is given in table 2 Now how do we interpret this output? 1. The regression equation is of the form : the coefficient column we can read the estimating equation: =-45.8+.597Audit+1.18Comp+.405Rewards How do we interpret this equation? The interpretation is similar to that of the one variable simple linear regression case.
If we hold the number of field audit labour hours, number

= a + b1 x1 + b2 x 2 + b3 x 3 From the numbers given in y

x1 = 4300 hours x2= 1500 hors x3= $75000


Our estimate for

y is $27905000 and our se is $286000. for

of computer hours constant and we change rewards to informants by one unit , then y will change by an additional $405000 for each additional $1000 paid to informants.
Similarly holding x 1 and x 3 constant an additional 100 hours

example if we want to construct a 95% confidence interval around this estimate of $27905000 we can do it as follows: $27905000+/-t s e= $27905000+2.447(286,000) =$2860,800 upper limit =$27905000+2.447(286,000) =$27,205,200 Lower limit the standard error of the estimate measures the dispersion of data points around the regression plane. Smaller values of se indicate a better regression. If the addition of another variable reduces se then we say that the inclusion of the third variable improves the fir of the regression. The Coefficient of Multiple Determination In a multiple regression we measure the strength of the relationship among the three independent variables and the dependent variables by the coefficient of determination or R2. This defined as : R2 is the proportion of total variation in y that is explained by the regression plane. In our example we have R2=98.3% .This tells us the 98.3% of variation in unpaid taxes is explained by the three independent variables. AS we add more variables in a regression explanatory power of the equation improves if the R2 increases. Example 2 Insert exercise lr p732 Example Pam Schneider owns and operates an accounting firm in Ithaca, New York. Pam feels that it would be used to be able to predict in advance in the number of rush income-tax returns during the

of computer time will increase by $1177000.

Similarly for holding x2 and x3 constant we increase an increase in

hours in filed by $100 increase recoveries by $597000.

We can also use this equation to solve problems such as : Suppose in Nov the IRS plans to leave field hours and computer hours at their Oct level but increase rewards to $75000 How much of recoveries can they expect to make in Nov? We can get a forecasted value by substituting in the equation.

y = -45.8+.597(43)+1.18(15)+.405(75) y = 27.905 or approximately $28 million.


Standard Error of The Regression Now that we have our equation we need to have some measure of the dispersion of actual observations around the estimated regression plane. We can expect the estimation to be more accurate as the degree of dispersion around this regression plane is less. To measure this dispersion or variation by the standard error of the estimate se

)2 (y y se = n k 1
Where Y = sample values of the dependent variable

180

Copy Right: Rai University

11.556

busy march 1 to April 15 period so that she can better ;oan her personnel need during this time. She has hypothesized that several factors may be useful in her production. Data for these factors and number of rush returns for past years are as follows:
X1 Economic index 99 106 100 129 179 X2 Population within 1 mile of office 10188 8566 10557 10219 9662 X3 Average income in Ithaca 21465 22228 27665 25200 26300 Y Number of rush returns, march 1 to April 15 2306 1266 1422 1721 2544

d. What is the predicted value for Y when X1 = 5.8, X2 = 4.2, X3 = 5.1?

RESEARCH METHODOLOGY

Y 64.7 80.9 24.6 43.9 77.7 20.6 66.9 34.3

X1 3.5 7.4 2.5 3.7 5.5 8.3 6.7 1.2

X2 5.3 1.6 6.3 9.4 1.4 9.2 2.5 2.2

X3 8.5 2.6 4.5 8.8 3.6 2.5 2.7 1.3

Given the following set of data use whatever computer package is available to find the best fitting regression equation and answer the following: a. What is the regression equation? b. What is the standard error of estimate? c. What is R2 for this regression? e. Given an approximate 95 percent confidence interval for the value of Y when the values of X1, X2, X3, and X4 are 52.4, 41.6 35.8, and 3, respectively. Q3.We are trying to predict the annual demand for widgets (Demand)using the following independent variable. Price = price of widgets (in $) Income = consumer income (in $) Sub = price of a substitute commodity (in $) (Note: A substitute commodity is one that can be substituted for another commodity. For example, margarine is a substitute commodity for butter,)

a. Use the following Minitab output to determine the best fitting regression equation for these data: The regressions equation is

Predictor const

Coef -1275

Stdev 2699 6.098 0.3144 0.1005

T ratio -0.47 2.47 1.72 -1.73

P 0.719 0.245 0.335 0.333

X1 X2 X3

17.059 0.5456 0.1743


S = 396.1

Year
R sq = 87.2%

Demand 40 45 50 55 60 70 65 65 75 75 80

Price ($) 9 8 9 8 7 6 6 8 5 5 5

Income 400 500 600 700 800 900 1000 1100 1200 1300 1400

Sub ($) 10 14 12 13 11 15 26 27 22 19 20

1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992

b. What percentage of the total variation in the number of rush returns is explained by this equation? c. For this year, the economic index is 169, then population with in 1 mile of the office is 10212, and the average income in Ithaca is $26925. How many rush returns should Pam expect to prices between March 1 April 15? Results

y = -1275 + 17.059 X1 + 0.5406 X2 - 0.1743 X3 .


R2 = 87.2%; 87.2% of the total variation in Y is explained by the model.

y = -1275 + 17.059 (169 ) + 0.5406 (10,212) 0.1743( 26,925)


= 2436 rush returns. Exercises Q1. Given the following set of data, use whatever computer package is available to f find the best fitting regression equation and answer the following: a. b. c. What is the regression equation? What is the standard error of estimate? What is R2 for this regression?

1993 100 1994 90 1995 95 1996 85

3 4 3 4

1500 1600 1700 1800

23 18 24 21

11.556

Copy Right: Rai University

181

a Using whatever computer package is available, determine the best-fitting regression equation for these data. b. Are the signs (+ or -) of the regression coefficients of the independent variables, as one would expect? Explain briefly. c. State and interpret the coefficient of multiple determinations for this problem. d. State and interpret the standard error of estimate for this problem. e. Using the equation, what would you predict for DEMAND if the price of widgets was $6, consumer income was $1200 and the price of the substitute commodity was $17? Notes

RESEARCH METHODOLOGY

182

Copy Right: Rai University

11.556