P. 1
Multiple Regression

# Multiple Regression

|Views: 253|Likes:

See more
See less

09/10/2010

pdf

text

original

# D U A n

a t a

A

n

a ly s is

i v a r i Ba t i e a r i a M t e u l t i v a r v n a l y s Ai s n a l y s Ai s n a l y s i

Three Types of Analysis we can classify analysis into three types – Univariate, involving a single variable at a time,

1.

2. Bivariate, involving two variables at a time, and 3. Multivariate, involving variables simultaneously. three or more

Revision : Application Areas: Correlation
1.

Correlation and Regression are generally performed together. The application of correlation analysis is to measure the degree of association between two sets of quantitative data. The correlation coefficient measures this association. It has a value ranging from 0 (no correlation) to 1 (perfect positive correlation), or -1 (perfect

2. For example, how are sales of product A correlated with sales of product B? Or, how is the advertising expenditure correlated with other promotional expenditure? Or, are daily ice cream sales correlated with daily maximum temperature?

3. Correlation does not necessarily mean there is a causal effect. Given any two strings of numbers, there will be some correlation among them. It does not imply that one variable is causing a change in another, or is dependent upon another. 4. Correlation is usually followed by regression analysis in many applications.

Application Areas: Regression
1.The

main objective of regression analysis is to explain the variation in one variable (called the dependent variable),based on the variation in one or more other variables (called the independent variables).
2.Application

‘explaining’ variations in sales of a product based on advertising expenses, or number of sales people, or number of sales offices, or on all the above variables. 3. If there is only one dependent variable

example:

4. If multiple independent variables are used to explain the variation in a dependent variable, it is called a multiple regression model. 5. Even though the form of the regression equation could be either linear or nonlinear, we will limit our discussion to linear (straight line) models.

Purposes of Regression Analysis To establish the relationship between a dependent variable (outcome) and a set of independent (explanatory) variables

To identify the relative importance of the different independent (explanatory) variables on the outcome
 

To make predictions

Requirements for applying Multiple regression analysis

1. The variables used (independent and dependent) are assumed to be either interval scaled or ratio scaled. 2. Nominally scaled variables can be used as independent variables in a regression model, with dummy variable coding. 3. If the dependent variable happens to be a nominally scaled one, discriminant analysis should be the technique used instead of regression. 4. Dependent variable essentially METRIC Independent variables Metric or Dummy

The general regression model (linear) is of the type Y = b0 + b1x1 + b2x2 +…….+ bkxk (Unstand) where y is the dependent variable x1, x2 , x3….xkare the independent variables expected to be related to y and expected to explain or predict y. b1, b2, b3…bn are the coefficients of the respective independent variables, which will be determined from the input data.

Steps of Regression Analysis Step 1: Construct a regression model Step 2: Estimate the regression and interpret the result Step 3: Conduct diagnostic analysis of the results Step 4: Change the original regression model if necessary Step 5: Make predictions

DATA (INPUT / OUTPUT)
1.

Input data on y and each of the x variables is required to do a regression analysis. This data is input into a computer package to perform the regression analysis.

2. The output consists of the ‘b’ coefficients for all the independent variables in the model. It also gives the results of a ‘t’ test for the significance of each variable in the model, and the results of the ‘F’ test

3 Assuming the model is statistically significant at the desired confidence level (usually 90 or 95%), the coefficient of determination or R2 of the model is an important part of the output. The R2 value is the percentage (or proportion) of the total variance in ‘y’ explained by all the independent variables in the regression equation.

Requirements for applying Multiple regression analysis

1. The variables used (independent and dependent) are assumed to be either interval scaled or ratio scaled. 2. Nominally scaled variables can be used as independent variables in a regression model, with dummy variable coding. 3. If the dependent variable happens to be a nominally scaled one, discriminant analysis should be the technique used instead of regression. 4. Dependent variable essentially METRIC Independent variables Metric or Dummy

Worked Example: Problem

A manufacturer and marketer of electric motors would like to build a regression model consisting of five or six independent variables, to predict sales. Past data has been collected for 15 sales territories, on Sales and six different independent variables. Build a regression model and recommend whether or not it

The data are for a particular year, in different sales territories in which the company operates, and the variables on which data are collected are as follows:

Dependent Variable Y =sales in Rs.lakhs in the territory Independent Variables X1 = market potential in the territory (in Rs.lakhs). X2 = No. of dealers of the company in the territory. X3 = No. of salespeople in the territory. X4 = Index of competitor activity in the territory on a 5 point scale (1=low, 5=high level of activity by competitors). X5 = No. of service people in the territory. X6 = No. of existing customers in the

1 SALES 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2 POTENTL

3 DEALERS

4 PEOPLE

5 COMPET

6 SERVICE

7 CUSTOM

5 60 20 11 45 6 15 22 29 3 16 8 18 23 81

25 150 45 30 75 10 29 43 70 40 40 25 32 73 150

1 12 5 2 12 3 5 7 4 1 4 2 7 10 15

6 30 15 10 20 8 18 16 15 6 11 9 14 10 35

5 4 3 3 2 2 4 3 2 5 4 3 3 4 4

2 5 2 2 4 3 5 6 5 2 2 3 4 3 7

20 50 25 20 30 16 30 40 39 5 17 10 31 43 70

Regression
We will first run the regression model of the following form, by entering all the 6 'x' variables in the model Y= b0+ b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6 ………..Equation 1 and determine the values of b0, b1, b2, b3, b4, b5, & b6. Regression Output:

MULTIPLE REGRESSION RESULTS: All independent variables were entered in one block Dependent Variable: Multiple R: Multiple R-Square: Adjusted R-Square: Number of cases: SALES .988531605 .977194734 .960090784 15

T h e A N O V A T a b le
STAT. A n a ly s is o f V a r ia n c e ; D e p e n .V a r : S A L E S ( r e g d a ta 1 .s ta ) M U L T IP L E REG RESS. Sum s of M ean E ffe c t S q ua res df S q ua res F R egress. 6 6 0 9 .4 8 4 6 1 1 0 1 .5 8 5 7 .1 3 2 6 9.0 0 0 0 0 4 1 R e s id u a l 1 5 4 .2 4 9 8 1 9 .2 8 1 T o ta l 6 7 6 3 .7 3 3

From the analysis of variance table, the last column indicates the p-level to be 0.000004. This indicates that the model is statistically significant at a confidence level of (1-0.000004)*100 or (0.999996)*100, or 99.9996.

:
STAT. Regression Summary for Dependent Variable: SALES 2 2 R M U LTIPLE R= .98853160 = .97719473Adjusted R .96009078 = p< REGRESS. F(6,8)=57.133 .00000 Std.Error of Estimate: 4.3910

N =15 Intercept POTEN TL D EALERS PEO PLE CO M PET SERVICE CUSTO M

St.Err. BETA of BETA

B

St. Err. of B

t(8)

p-level

-3.1729 5.813394 -.54581 .600084 .439073 .144411 .22685 .074611 3.04044 .016052 .164315 .126591 .81938 .631266 1.29800 .230457 .413967 .158646 1.09104 .418122 2.60937 .031161 .084871 .060074 -1.89270 1.339712 -1.41276 .195427 .040806 .116511 -.54925 1.568233 -.35024 .735204 .050490 .149302 .06594 .095002 .33817 .743935

Column 4 of the table, titled ‘B’ lists all the coefficients for the model. These are : a (intercept) = -3.17298 b1 = .22685 b2 = .81938 b3 = 1.09104 b4 = -1.89270 b5 = -0.54925 b6 = 0.06594

Substituting these values of a, b1, b2, ..b6 in equation 1 we can write the equation (rounding off all coefficients to 2 decimals), as

Sales = -3.17 + .23 (potential) + .82 (dealers) + 1.09 (salespeople) 1.89 (competitor activity) - 0.55 (service people) + 0.07 (existing customers)

[Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6 ] The estimated increase in sales for every unit increase or decrease in the independent variables is given by the coefficients of the respective variables. For instance, if the number of sales people is increased by 1, sales in Rs . lakhs, are estimated to increase by 1.09, if all other variables are unchanged. Similarly, if 1 more dealer is added, sales are expected to increase by

The SERVICE variable does not make too much intuitive sense. If we increase the number of service people, sales are estimated to decrease according to the –0.55 coefficient of the variable "No. of Service People" (SERVICE). Now look at the individual variable ‘t’ tests, we find that the coefficients of the variable SERVICE is statistically not significant (p-level 0.735204). Therefore, the coefficient for SERVICE is not to be used in interpreting the regression, as it may lead to wrong conclusions.

Strictly speaking, only two variables, potential (POTENTL) and No. of sales people (PEOPLE) are significant statistically at 90 percent confidence level since their p- level is less than 0.10. One should therefore only look at the relationship of sales with one of these variables, or both these variables. Different modes of entering independent variables in the model  Enter  Forward Stepwise Regression

The final unstandardised model is given by
Sales = -10.6164 + .2433 (POTENTL) + 1.4244 (PEOPLE)………… Equation 3 Predictions: If potential in a territory were to be Rs. 50 lakhs, and the territory had 6 salespeople, then expected sales, using the above equation would be = -10.6164 +.2433(50) +1.4244(6) = 10.095 lakhs. Similarly, we could use this model to make predictions regarding sales in any territory for

Recommended usage 1. It is recommended that for serious decision-making, there has to be a-priori knowledge of the variables which are likely to affect y, and only such variables should be used in the regression analysis. 2. For exploratory research, the hit-and-trial approach may be used. 3. It is also recommended that unless the model is itself significant at the desired confidence level (as evidenced by the F test results printed out for the model), the R² value should not be interpreted.

Multicollinearity and how to tackle it Multicollinearity : Interrelationship of the various independent variables It is essential to verify whether independent variables are highly correlated with each other. If they are, this may indicate that they are not independent of each other, and we may be able to use only 1 or 2 of them to predict the dependent variables. Independent variables which are highly correlated with each other should not be included in the model together

scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->