You are on page 1of 27

Multiple Regression Analysis

R.Venkatesakumar
Department of Management Studies (SOM)
Pondicherry University

What is Multiple Regression Analysis?

n Multiple regression analysis is the most


widely-used dependence technique, with
applications across all types of problems and
all disciplines
n Multiple regression analysis . . . is a
statistical technique that can be used to
analyze the relationship between a single
dependent (criterion) variable and several
independent (predictor) variables
2

1
Why do we use multiple regression analysis?

n When the research objective is to predict


a statistical relationship or to explain
underlying relationships among
variables.
n Enables the researcher to utilize two or
more metric independent variables in
order to estimate the dependent variable.

When do you use Multiple Regression Analysis?

When the researcher has theoretical or


conceptual justification for predicting or
explaining the dependent variable with the set
of independent variables.

2
An Illustration of Multiple Regression

Number of Credit Cards (Y) Family Size (V1) Family Income (V2) Number of cars (V3)

4 2 14 1

6 2 16 2

6 4 14 2

7 4 17 1

8 5 18 3

7 5 21 2

8 6 17 1

10 6 25 2

Mean representation
n If Mean alone used as
‘representative for the Number of Credit Cards
(Y) Y- µ (Y-µ)**
data’ then 4 -3 9
µ = Total / Number of
observations 6 -1 1

= 56 /8 = 7 6 -1 1

Total ‘Error’ due to mean 7 0 0

representation of the 8 1 1

data is 22 7 0 0

8 1 1

n How to reduce the ‘error’ 10 3 9

and get ‘better 0 22

representation’ for the


data? 6

3
How to reduce ‘error’?

n Mean will produce lowest error term than


median or mode (if variable is normal, then all the
terms will produce equal error terms)
n This error term going to play significant role in all

the statistical techniques (variance)


n Need a new variable to reduce the error in the
consideration variable (dependent variable)
n Variables which are related (correlation) to the
variable considered (Y-variable / dependent) may be
very useful

Correlation among variables

Number of Family Number of


Credit Family Size Income cars
Cards (Y) (V1) (V2) (V3)

Number of Credit
Cards
(Y) 1.0000
Family Size
(V1) 0.8664 1.0000
Family Income
(V2) 0.8290 0.6727 1.0000
Number of cars
(V3) 0.3419 0.1917 0.3008 1.0000

The variable with highest correlation is considered in


the regression 8

4
Prediction accuracy of regression
Prediction using Y = 2.871 + 0.971

Number of Credit Cards


(Y) Family Size (V1) Prediction ? Y-? (Y-?)**

4 2 4.814285714 -0.81 0.663

6 2 4.814285714 1.19 1.406

6 4 6.757142857 -0.76 0.573

7 4 6.757142857 0.24 0.059

8 5 7.728571429 0.27 0.074

7 5 7.728571429 -0.73 0.531

8 6 8.7 -0.70 0.490

10 6 8.7 1.30 1.690

Prediction Error due to regression (Unexplained by regression)


0.00 5.486

Mathematics of regression

Total Error (due to mean representation) 22

Error in regression prediction 5.486

Error reduced by regression analysis 16.514

16.514 / 22

Error Reduction Ability of regression

0.7506

0.8664 * 0.8664
Correlation square
Coefficient of determination) R**

0.7506

10

5
Excel / SPSS output
Regression Statistics

Multiple R 0.866400225

ANOVA
R Square 0.750649351

df SS MS F Significance F

Adjusted R Square 0.709090909


Regression 1 16.514286 16.514286 18.0625 0.00538014

Standard Error 0.956182887 Residual 6 5.4857143 0.9142857

Total 7 22

Observations 8

Standard Lower Upper


Erro 95. 95.
Coefficients r t Stat P-value Lower 95% Upper 95% 0% 0%

Intercept 2.871428571 1.0285714 2.7916667 0.0315082 0.354604958 5.3882522 0.354605 5.3882522

X Variable 1 0.971428571 0.2285714 4.25 0.0053801 0.412134435 1.5307227 0.4121344 1.5307227

11

Reducing the error further

n To reduce the error further in the dependent


variable – we use one more independent
variable
n The variable with next highest correlation
(preferably less correlation with existing
independent variable already in the model)
n ‘Correlation among three or more
independent variables – referred as
Multicollinearity’ influence the further
predictions
12

6
Prediction using
Y = 0.482 + 0.63 V1 + 0.216 V2

Number of Credit Family Income


Cards (Y) Family Size (V1) (V2) ? Y-? (Y - ?)**

4 2 14 4.77 -0.77 0.586756

6 2 16 5.20 0.80 0.643204

6 4 14 6.03 -0.03 0.000676

7 4 17 6.67 0.33 0.106276

8 5 18 7.52 0.48 0.2304

7 5 21 8.17 -1.17 1.364224

8 6 17 7.93 0.07 0.004356

10 6 25 9.66 0.34 0.114244

0 3.050136

13

Total Error (due to mean representation) 22

Error in regression prediction 3.050

Error reduced by regression analysis 18.950

Error Reduction Ability of regression 18.950 / 22

0.8614

14

7
The logic…

n The logics are slightly different


n Instead of correlation – we have to
consider Partial Correlation
n It is the measure of variation in
‘dependent’ variable (in the equation), not
accounted by the variable (s) in the
equation – that can be accounted for by
each of these additional variables……..

15

Consider the correlation & Partial correlation


Correlations
Control Variables VAR00001 VAR00003 VAR00004 VAR00002
-none- a VAR00001 Correlation 1.000 .829 .342 .866
Significance (2-tailed) . .011 .407 .005
df 0 6 6 6
VAR00003 Correlation .829 1.000 .301 .673
Significance (2-tailed) .011 . .469 .068
df 6 0 6 6
VAR00004 Correlation .342 .301 1.000 .192
Significance (2-tailed) .407 .469 . .649
df 6 6 0 6
VAR00002 Correlation .866 .673 .192 1.000
Significance (2-tailed) .005 .068 .649 .
df 6 6 6 0
VAR00002 VAR00001 Correlation 1.000 .666 .359
Significance (2-tailed) . .102 .429
df 0 5 5
VAR00003 Correlation .666 1.000 .237
Significance (2-tailed) .102 . .609
df 5 0 5
VAR00004 Correlation .359 .237 1.000
Significance (2-tailed) .429 .609 .
df 5 5 0
a. Cells contain zero-order (Pearson) correlations.

16

8
The logic…

n The highest Partial Variation accounted by fitted model 0.7506

correlation is between Remaining variance 0.2494

V2 & Y (0.666) Partial correlation 0.666

n The incremental Partial correlation ** 0.4436

variance accounted by
entering V2
Error reduction = (Remaining variance X 0.11062
Partial correlation squared)
= {0.2494 X (0.666 X 0.666)}

17

18

9
Stage -1

Multiple regression analysis - two


possible objectives

n prediction — attempts to predict a


change in the dependent variable
resulting from changes in multiple
independent variables.
n explanation — enables the researcher to
explain the variate by assessing the
relative contribution of each independent
variable to the regression equation.
19

Stage -1
Two objectives associated with prediction

Maximization of the overall predictive power


of the independent variables in the
variate.
Comparison of competing models made up
of two or more sets of independent
variables to assess the predictive power
of each.

20

10
Stage -1

Explanation - three objectives associated

Determination of the relative importance of each


independent variable in the prediction of the
dependent variable.
Assessment of the nature of the relationships
between the predictors and the dependent
variable. (i.e. linearity)
Insight into the interrelationships among the
independent variables and the dependent
variable. (i.e. correlations)
– (Multicollinearity- Masking the relations)

21

Stage -1
Appropriate for statistical relationships,
not functional relationships

n Statistical relationships
assume that more than one
value of the dependent value
will be observed for any value
of the independent variables.
n An average value is
estimated and error is
expected in prediction.
n Functional relationships
assume that a single value of
the dependent value will be
observed for any value of the
independent variables.
n An exact estimate is
made, with no error

22

11
Stage -1
Selection of Dependent and Independent
Variables

The researcher should always consider three


issues that can affect any decision about
variables:
• The theory that supports using the variables,
• Measurement error, especially in the
dependent variable,
• (whether the variable is an accurate and consistent
measure of the concept being studied )
• Specification error.
23

Stage -1

Specification error
n The inclusion of an independent variable must be
guided by the theoretical foundation of the regression
model and its managerial implications.
n A variable that by chance happens to influence
statistical significance, but has no theoretical or
managerial relationship with the dependent variable is
of no use to the researcher in explaining the
phenomena under observation.
n Researchers must be concerned with specification
error, or the inclusion of irrelevant variables or the
omission of relevant variables.
n Parsimony in the regression model - fewest
independent variables with the greatest contribution to
the variance explained
24

12
Stage -2
Research Design

Research Design of a Multiple Regression Analysis

n Sample size used will impact the statistical


power and generalizability
n The power (probability of detecting statistically
significant relationships) at specified
significance levels is related to sample size
n generalizability of the results is directly
affected by the ratio of observations to
independent variables.

25

Sample size, variables in the model


and prediction
Sample Size Significance Level (a = 0.01) & Number of Independent variables

2 5 10 20

20 0.45 0.56 0.71 NA


(Minimum R2
that can be
50 0.23 0.29 0.36 0.49

100 0.13 0.16 0.20 0.26


found
250 0.05 0.07 0.08 0.11 statistically
500 0.03 0.03 0.04 0.06 significant
1000 0.01 0.02 0.02 0.03 with a power
of 0.80 for
Sample Size Significance Level (a = 0.05) & Number of Independent variables
varying
2 5 10 20
numbers of
independent
20 0.39 0.48 0.64 NA
variables and
50 0.19 0.23 0.29 0.42
sample sizes)
100 0.10 0.12 0.15 0.21

250 0.04 0.05 0.06 0.08

500 0.03 0.04 0.05 0.09

1000 0.01 0.01 0.02 0.02


26

13
Stage -2
Research Design

Small samples Vs. Large samples

n Small samples (less than 20 observations), will


detect only very strong relationships with any degree
of certainty
n Large samples (1000 or more observations) will find
almost any relationship statistically significant due to
the over sensitivity of the test
n Minimum level is 5 to 1 (i.e. 5 observations per
independent variable in the variate).
n Desired level is 15 to 20 observations for each
independent variable
n If stepwise method employed – there are references
to use 50 : 1

27

Stage -2
Research Design

Random effects model

n Most regression models for survey data are


random effects models.
n In a random effects model, the levels of the
predictor are selected at random and a portion
of the random error comes from the sampling
of the predictors

28

14
Stage -2
Research Design

Transformations of the data

n Non-linear relationships:
n Arithmetic transformations (i.e. square root or logarithm) and
polynomials are most often used to represent non-linear relationships.
n Moderator effects:
n Reflect the changing nature of one independent variable's relationship
with the dependent variable as a function of another independent
variable
n Represented as a compound variable in the regression equation.
n Moderators change the interpretation of the regression coefficients.
To determine the total effect of an independent variable, the separate
and the moderated effects must be combined.
n Nonmetric variable inclusion:
n Dichotomous variables, also known as dummy variables. may be used
to replace nonmetric independent variables.
n The resulting coefficients represent the differences in group means
from the comparison group and are in the same units as the
dependent variable.
29

Stage -3
Assumptions
Assumptions in Multiple Regression Analysis

n Multiple regression analysis assumes the use of metric


independent and dependent variables
n A statistical linear relationship must exist between the
independent and dependent variables
n Implied relationship: Regression attempts to predict one
variable from a group of other variables.
n When no relationship between the criterion variable and predictor
variables exists, there can be no prediction
n Researcher must feel that some relationship will be found between
the single criterion variable and the predictor group of variables

30

15
Stage -3
Assumptions
3.1 Assessment of individual variables versus the variate

n Testing assumptions must be done not only


for each dependent and independent
variable, but for the variate as well.
n Graphical analyses (i.e., partial regression
plots, residual plots and normal probability
plots) are the most widely used methods of
assessing assumptions for the variate.
n Remedies for problems found in the variate
must be accomplished by modifying one or
more independent variables
31

Stage -3
Assumptions
3.2 Linearity of the phenomenon

n The regression coefficient is constant across the range of


values for the independent variables
n Accessed by Residual Plot
n Any typical pattern of residuals indicating the existence
of a nonlinear relationships – not represented by the
current model
n In a multiple regression model, residual plot denotes
‘combined effect of all the independent variables’ in the
model
n Relationship of single independent variable to the
dependent variable is portrayed by Partial regression
plots

32

16
Stage -3
Assumptions

33

Stage -4
Estimating & Model fit
assessment
Estimating the regression model
and assessing overall fit

4.1 Variable selection


n Confirmatory specification
n Analyst specifies the complete set of independent variables .
n Analyst has total control over variable selection
n Sequential search approaches
n Sequential approaches estimate a regression equation with a
set of variables and by either adding or deleting variables until
some overall criterion measure is achieved.
n Variable entry may be done in a forward, backward, or stepwise
manner

34

17
Stage -4
Estimating & Model fit
assessment

4.1 Variable selection (Cont…)

n Forward method begins with no variables in the equation and then adds
variables that satisfy the F-to-enter test.
n Equation is estimated again and the F-to-enter of the remaining variables is
calculated
n This is repeated until the F-to-enter test finds no variables to enter.
n Backward elimination begins with all variables in the regression
equation and then eliminates any variables with the F-to-remove test.
n The same repetition of estimation is performed as with forward estimation.
n Stepwise estimation is a combination of forward and backward
methods.
n It begins with no variables in the equation as with forward estimation and
then adds variables that satisfy the F test.
n The equation is estimated again and additional variables that satisfy the F
test are entered.
n At each re-estimation stage, however, the variables already in the equation
are also examined for removal by the appropriate F test.
n This repetition continues until both F tests are not satisfied by any of the
variables either in or out of the regression equation.

35

Stage -4
Estimating & Model fit
assessment

4.1 Variable selection (Cont…)

n Combinatorial Methods
n The combinatorial approach estimates regression

equations for all subset combinations of the


independent variables.
n The most common procedure is known as all-

possible-subsets regression.
n Combinatorial methods become impractical for very
large sets of independent variables.
n For example, for even 10 independent variables, one would
have to estimate 1024 regression equations

36

18
Stage -4
Estimating & Model fit
assessment

4.2 Examine the overall model fit


n Examine the variate's ability to predict the criterion
variable and assess how well the independent
variables predict the dependent variable.
n Several statistics exist for the evaluation of overall
model fit
n Coefficient of determination (R2)
n The coefficient of determination is a measure of the amount of
variance in the dependent variable explained by the independent
variable(s).
n A value of one (1) means perfect explanation and is not
encountered in reality due to ever present error.
n A value of .91 means that 91% of the variance in the dependent
variable is explained by the independent variables.
n The amount of variation explained by the regression model should
be more than the variation explained by the average.
n Thus, R2 should be greater than zero.

37

Stage -4
Estimating & Model fit
assessment

4.2 Examine the overall model fit

n R2 is impacted by two facets of the


data:
n The number of independent variables relative to the sample
size. (see sample size discussion earlier)
n For this reason, analysts should use the adjusted coefficient of
determination, which adjusts for inflation in R2 from overfitting
the data.
n The number of independent variables included in the
analysis.
n As you increase the number of independent variables in the
model, you increase the R2 automatically because the sum of
squared errors by regression begins to approach the sum of
squared errors about the average.

38

19
Stage -4
Estimating & Model fit
assessment

4.2 Examine the overall model fit


n Standard error of the estimate
n Standard error of the estimate is another measure of the
accuracy of our predictions. It represents an estimate of the
standard deviation of the actual dependent values around
the regression line.
n Since this is a measure of variation about the regression
line, the smaller the standard error, the better.
n F-test
n The F-test reported with the R2 is a significance test of the
R2. This test indicates whether a significant amount
(significantly different from zero) of variance was explained
by the model.

39

Stage -4
Estimating & Model fit
assessment

Further insight…
n As a further measure of the strength of the
model fit, compare the standard error of the
estimate in the model summary table to the
standard deviation of time reported in the
descriptive statistics table.
n The standard error of estimate would be
lower than Standard deviation implies that the
model is better than mean representation

40

20
Stage -4
Estimating & Model fit
assessment
4.3 Analyze the variate
n The variate is the linear combination of independent variables
used to predict the dependent variables.
n Analysis of the variate relates the respective contribution of each
independent variable in the variate to the regression model.
n The researcher is informed as to which independent variable
contributes the most to the variance explained and may make relative
judgments between/among independent variables (using
standardized coefficients only).
n Regression coefficients are tested for statistical significance.
n The intercept (or constant term) should be tested for appropriateness
for the predictive model. If the constant is not significantly different
from zero, it cannot be used for predictive purposes.
n The estimated coefficients should be tested to ensure that across all
possible samples, the coefficient would be different from zero.
n The size of the sample will impact the stability of the regression
coefficients. The larger the sample size, the more generalizable the
estimated coefficients will be.
n An F-test may be used to test the appropriateness of the intercept
and the regression coefficients.
41

Stage -4
4.4 Examine the data for influential Estimating & Model fit
observations assessment

n Influential observations, leverage points, and


outliers all have an effect on the regression results
n One of four conditions gives rise to influential
observations:
n an error in observation or data entry,
n a valid but exceptional observation, which is explainable by
an extraordinary situation,
n an exceptional observation with no likely explanation, or
n an ordinary observation on its individual characteristics, but
exceptional in its combination of characteristics

42

21
Stage -4
Estimating & Model fit
assessment

To assess residuals

n A residual is the difference between the observed


and model-predicted values of the dependent
variable.
n The residual for a given product is the observed
value of the error term for that product.
n If the model is appropriate for the data, the residuals
should follow a normal distribution.
n A histogram or P-P plot of the residuals will help you
to check the assumption of normality of the error
term.
n The shape of the histogram should approximately follow the
shape of the normal curve.
n The P-P plotted residuals should follow the 45-degree line.

43

Stage -5
Interpreting Regression

Interpreting the Regression Variate


n When interpreting the regression variate, the researcher evaluates
the estimated regression coefficients for their explanation of the
dependent variable and evaluates the potential impact of omitted
variables
n Standardized regression coefficients (beta coefficients)
n Regression coefficients must be standardized (i.e. computed on the same unit of
measurement) in order to be able to directly compare the contribution of each
independent
n Beta coefficients (the term for standardized regression coefficients) enable the
researcher to examine the relative strength of each variable in the equation
n For prediction, regression coefficients are not standardized and, therefore, are in
their original units of measurement

44

22
Stage -5
Interpreting Regression

Issues related to Multicollinearity

n Collinearity (or multicollinearity) is the undesirable


situation where the correlations among the
independent variables are string
n Data Problem – not a problem of model specification
n Assessing the degree of multicollinearity and its
impact on the results
n It reduces the explanatory prediction from additional
variables
n Contribution of each independent variable is difficult – since
the effects are mixed
n High collinearity result in regression coefficient being
incorrectly estimated – even with wrong signs

45

Stage -5
Interpreting Regression

Impact of Multicollinearity

46

23
Stage -5
Interpreting Regression

Issue of multicollinearity
Correlations

Dependent V1 V2 Dependent V1 V2
Dependent Pearson Correlation 1 .823* -.977**
5.00 6.00 13.00 Sig. (2-tailed) .012 .000
N 8 8 8
3.00 8.00 13.00
V1 Pearson Correlation .823* 1 -.913**
9.00 8.00 11.00 Sig. (2-tailed) .012 .002
N 8 8 8
9.00 10.00 11.00
V2 Pearson Correlation -.977** -.913** 1
13.00 10.00 9.00 Sig. (2-tailed) .000 .002
N 8 8 8
11.00 12.00 9.00
*. Correlation is significant at the 0.05 level (2-tailed).
17.00 12.00 7.00 **. Correlation is significant at the 0.01 level (2-tailed).
15.00 14.00 7.00

Regression Models V1 &


Dependent
Y = -4.75 + 1.5 V1 are positively
related – but
Y = 29.75 – 1.95 V2 appears with
negative sign
Y = 44.75 – 0.75 V1 – 2.7 V2 47

Stage -5
Interpreting Regression

Identifying & Remedy for Multicollinearity


n Eigenvalues provide an indication of how many
distinct dimensions there are among the independent
variables
n When several eigenvalues are close to zero, the
variables are highly intercorrelated and small changes
in the data values may lead to large changes in the
estimates of the coefficients
n Condition indices are the square roots of the ratios of
the largest eigenvalue to each successive eigenvalue.
n A condition index greater than 15 indicates a possible
problem and an index greater than 30 suggests a
serious problem with collinearity.

48

24
Stage -5
Interpreting Regression

Identifying & Remedy for Multicollinearity

n The partial correlation is the correlation of each


independent variable with the dependent variable
after removing the linear effect of variables already in
the model.
n Tolerance is a statistic used to determine how much
the independent variables are linearly related to one
another (multicollinear).
n Tolerance is the proportion of a variable's variance not
accounted for by other independent variables in the model
n A variable with very low tolerance contributes little
information to a model, and can cause computational
problems.

49

Stage -5
Interpreting Regression

Identifying & Remedy for Multicollinearity

n VIF, or the variance inflation factor is the


reciprocal of the tolerance.
n As the variance inflation factor increases, so
does the variance of the regression
coefficient, making it an unstable estimate
n Large VIF values are an indicator of
multicollinearity

50

25
Stage -5
Interpreting Regression

Remedy
n One can apply stepwise procedure
n Omit one or more highly correlated independent
variables
n (but check whether the remaining variables specify the
needed model)
n Use simple correlations between independent and
dependent variables to understand relationships

51

Stage -6
Validation
Validation of the results

n Ensure that the regression model represents the


general population (generalizability) and is appropriate
for the applications in which it will be employed
(transferability).
n Collection of additional samples or split samples

n Calculation of the PRESS statistic

n Assesses the overall predictive accuracy of the regression


by a series of iterations, whereby one observation is
omitted in the estimation of the regression model and the
omitted observation is predicted with the estimated model.
The residuals for the observations are summed to provide
an overall measure of predictive fit.

52

26
Stage -6
Validation
Validation of the results

n Comparison of regression models


n The adjusted R-square is compared across
different estimated regression models to assess
the model with the best prediction
n Applying an estimated regression model to a new

set of independent variable values

53

Thank you

54

27

You might also like