Multiple Regression Analysis 1

Multiple Regression
Analysis
Multiple Regression
Cautions About Linear Regression
 Correlation and regression describe only linear
relations.
 Correlation and least-squares regression line
are not resistant to outliers.
 Predictions outside the range of observed data
are often inaccurate.
 Relationship between two variables often
influenced by lurking variables not included in
our model.
General Principle of Data Analysis
Plot your data To understand the data, always start with a

series of graphs
Interpret what Look for all pattern and deviations on that

you see pattern
Numerical Choose an appropriate measure to

summary ? describe the pattern and deviation
Mathematical If the pattern is regular, summarize the data

model ? in a compact mathematical model
Analysis of Two Quantitative Variables
Plot your data For two quantitative variables, use a

scatterplot
Interpret what Describe the direction, form and strength of

you see the relationship
Numerical If pattern is roughly linear, summarize with

summary ? correlation, means and standard deviations
Mathematical Regression gives a compact model of

model ? overall pattern if relationship is roughly
linear
Analysis of Three or More Quantitative
Variables
Plot your data To examine relationships among all

possible pairs, use a scatterplot matrix
Interpret what Describe the direction, form and strength of

you see the relationships
Numerical If pattern is roughly linear, summarize with

summary ? correlations, means and standard
deviations
Mathematical Multiple Regression gives a compact model

model ? of relationship between response variable
and a set of predictors
Multiple Regression
• Can we predict job performance (Y) from overall school

achievement (X1) and IQ scores (X2)?
- How much variance in Y is explained by X1 and X2 in

combination?
- how important is each predictor of job performance?
• Two kinds of research questions in Multiple Regression:
- Is the model significant and important??

- Are the individual predictors significant and important?
The Structural Model
Y  c  b1 X 1  b2 X 2  ...  b p X p  e
Y - any dependent variable score is predicted according to :

c - an intercept on the Y axis, plus
b1 X 1 - a weighted effect of predictor X1
b2 X 2 - a weighted effect of predictor X2
bp X p - a weighted effect of predictor Xp
e - error
The Structural Model
Y  c  b1 X 1  b2 X 2  ...  b p X p  e
DATA = MODEL + RESIDUAL

The Regression Plane – Two Predictors
(3D space)
Unstandardized Partial Regression
Coefficients - b
• Y is calculated according to Least Square Criterion (LSC)

• solved for by finding a set of weights (b) minimising errors
of prediction (around the plane)
- b1 indicates change in Y given unit change in X1 when X2 …
Xp = 0
- when standardised, indicates SD change in Y given SD
change in X, and is denoted by 
• c is the Y intercept
• Y is therefore a weighted combination of the predictors
(and intercept) called a linear composite (LC)
Bivariate regression
Multiple Regression
Multiple Regression
Variance Explained – R2
R 2 is simply the r 2 representing the proportion of variance in Y

which is explained by Yˆ – the linear composite
r2 
SS regression

SSYˆ

 Yˆ  Y 
2
 Y  Y 
2
SS total SSY
a ratio reflecting the proportion of variance captured by our
model relative to the overall variance in our data
R 2 =.50 means 50% of the variance in Y is explained by the
combination of X1, X2… Xp
2
R vs r 2
Significance of the Model
• R 2 tells us how important the model is

• the model can also be tested for statistical
significance
• test is conducted on R the multiple correlation
coefficient, against df = p, N - p - 1
( N  p  1) R 2
MS regression
F 
p (1  R ) 2
MS residual
Importance of Individual Predictors
r – simple correlation coefficient

b – partial regression coefficient
 – standardized partial regression coefficient
pr – partial correlation coefficient
sr – semi-partial correlation coefficient
r – simple correlation coefficient
• indicates importance of predictor in terms of its

direct relationship with the criterion
• not very useful in Multiple Regression as it does

not take into account inter-correlations with other
predictors.
b – Partial Regression Coefficient
• indication of the importance of a predictor in

terms of the model (not the data).
• scale-bound so can’t compare magnitude.
• can however compare significance – each b is

tested by dividing it by its standard error to give
a t-value:
 – standardized partial regression coefficient
• indication of the importance of a predictor in

terms of the model (not the data).
• standardized (scale free) so you can compare
magnitude
• test of significance is same as for b

pr – Partial Correlation Coefficient
sr – semi-partial correlation coefficient
Unique, Shared and Total Variance
Assumptions of Multiple Regression
• Scale (predictor and criterion scores)

• measured using a continuous scale (interval
or ratio)
• normality (variables are normally distributed)
• linearity (there is a straight line relationship
between predictors and criterion)
• predictors are not multicollinear or singular
(extremely highly correlated)
Assumptions of Multiple Regression
• Residuals
• normality: array of Y values are normally
distributed around Yˆ (assumption of normality in
arrays)
• homoscedasticity: variance of Y values are
constant across full range Y values (assumption
of homogeneity of variance in arrays)
• linearity: straight-line relationship between Y
and residuals (with mean = 0 and slope = 0)
• independence (residuals uncorrelated)
Multicollinearity and Singularity
• occurs when predictors are highly correlated (>.90)

• causes unstable calculation of regression weights (b)
• diagnosed with inter-correlations, tolerance and VIF
Tolerance = (1  Rx2 )
2
• where Rx is the overlap between a particular
predictor and all the other predictors
• values below .10 considered problematic
Variance Inflation Factor (VIF) = 1/tolerance

- values above 4 considered problematic
• best solution is to remove or combine collinear predictors

Outliers – Extreme Cases
• distort solution and inflate standard error

• univariate outliers
• cases beyond 3 SD on any variable
• multivariate outliers
• described in terms of:
• leverage (h) – distance of case from group
centroid along line/plane of best fit
• discrepancy – extent to which case
deviates from line/plane of best fit
• influence – combined effect of leverage
and discrepancy: effect of the outlier on
the solution
Multivariate Outliers – high influence
high discrepancy
Multivariate Outliers – low influence
high discrepancy
Multivariate Outliers – Testing
Leverage
• Leverage statistic (h): varies from 0 to 1, values > .50 are
problematic
• Mahalanobis Distance h x (n-1), distributed as chi-square and
tested as such (df = p, <.001)
discrepancy – not directly tested

influence –
assesses change in solution when case is removed
Cook’s Distance, values > 1 are problematic
Working example
A marketing manager of a large supermarket chain wanted to

determine the effect of shelf space and price on the sales of pet food.
A random sample of 15 equal-sized shops was selected, and the
sales, shelf space in square metres and price per kilogram were
recorded
1. What contribution do both shelf space and price make to the

prediction of sales of pet food?
2 . Which is a better predictor of sales of pet food?
3. Do a residual analysis
The data file can be found in Work17.sav

Using SPSS
Graphs
Scatter/Dot
Matrix Scatter
Using SPSS
Graph
[DataSet1] C:\Users\demo\Desktop\Corr&RegressPresentation\SPSS DataFile\Work17.sav
Using SPSS
Multiple Linear Regression:
Starting the Procedure

• In the menu, click on
Analyze
• Point to
Regression
• Point to
Linear…
… and click.
Using SPSS
Selecting Variables
Choose the variables

for analysis from the
list in the variable box.
To select multiple
variables, hold down
the Ctrl Key and chose
the variables that you
want.
Using SPSS
Selecting Variables
Move shelf space
(space) & price per kg
(price), which are
already highlighted, to
the box labeled
Independent(s) then
click the arrow.
Move sales of pet food

(sales) to the box
labeled Dependent by
clicking the arrow.
Using SPSS
Requesting Statistics
Request
descriptive
statistics by
clicking the
button
labeled
Statistics…
Using SPSS
Requesting Statistics
Statistics for the Model
fit and Estimates for
Regression
Coefficients will be
produced by default.
Click the checkbox for

Descriptives. Also,
click the checkbox for
Durbin-Watson for
Residuals.
Click the Continue
button.
Using SPSS
Standardized Residual Plots

You can also request
several different plots.
Click the Plots… button.
In the box labeled
Standardized Residual
Plots, first click the
checkbox for
Histogram,
then click the box
for Normal
probability plot.
Click the Continue
button.
Using SPSS
Enter Method
The independent
variables can be
entered into the
analysis using
five different
methods.
Enter Method, a procedure for variable selection in which all variables in a

block are entered in a single step.
Using SPSS
Enter Method
Enter is the
default method
of variable entry.
Click the OK
button to run the
Multiple Linear
Regression
procedure.
Using SPSS
Multiple Linear Regression Output:
Descriptive Statistics
Regression
[DataSet1] C:\Users\demo\Desktop\Corr&RegressPresentation\SPSS DataFile\Work17.sav

Using SPSS
Multiple Linear Regression Output:
Correlations
Using SPSS
Multiple Linear Regression Enter Method Output:
Variables Entered
Using SPSS
Model Summary
Correlation Standard Deviation
Coefficient of around the
Determination regression line
Durbin-Watson
Statistic
Using SPSS
Model Summary
Independence
Durbin-Watson Statistic.
The D-W statistic is
defined as:
Another way to look at the Durbin-Watson Statistic is:
D = 2(1-ρ)
where ρ = the correlation between consecutive errors.

Using SPSS
ANOVA
Measures of
Variation
Using SPSS
Coefficients
Regression Equation:
ŷi = 10.50x1 + 0.057x2 + 2.029
Using SPSS
Residuals Statistics
Using SPSS
Residuals Histogram
Normality
Normality of residuals
is only required for
valid hypothesis
testing, that is, the
normality assumption
assures that the p-
values for the t-tests
and F-test will be
valid. Normality is not
required in order to
obtain unbiased
estimates of the
regression
coefficients
Using SPSS
Plot of Standardized Residuals

Normality
A standardized
normal
probability (P-P)
plot is sensitive
to non-normality
in the middle
range of data
tails.
Using SPSS
Interpretation of Output
1. What contribution do both shelf space and price make to the
prediction of sales of pet food?
Both independent variables (shelf space and price) together explain 85 per
cent of the variance (R Square) in sales of pet food, which is highly
significant as indicated by the F-value of 34.08
Using SPSS
Interpretation of Output
2. Which of the two variable is a better predictor of sales of pet food?
An examination of the t-values and Beta values indicate that price contributes
better to the prediction of sales. Therefore, you can say that price
significantly predicts sales of pet food with t = 3.22, P < .05. However, the
shelf space allocated is not a significant predictor.

Multiple Regression Analysis 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression Analysis 1

Uploaded by

Copyright:

Available Formats

Multiple Regression

Plot your data To understand the data, always start with a

Interpret what Look for all pattern and deviations on that

Numerical Choose an appropriate measure to

Mathematical If the pattern is regular, summarize the data

Plot your data For two quantitative variables, use a

Interpret what Describe the direction, form and strength of

Numerical If pattern is roughly linear, summarize with

Mathematical Regression gives a compact model of

Plot your data To examine relationships among all

Interpret what Describe the direction, form and strength of

Numerical If pattern is roughly linear, summarize with

Mathematical Multiple Regression gives a compact model

• Can we predict job performance (Y) from overall school

- How much variance in Y is explained by X1 and X2 in

- Is the model significant and important??

Y - any dependent variable score is predicted according to :

DATA = MODEL + RESIDUAL

• Y is calculated according to Least Square Criterion (LSC)

R 2 is simply the r 2 representing the proportion of variance in Y

• R 2 tells us how important the model is

r – simple correlation coefficient

• indicates importance of predictor in terms of its

• not very useful in Multiple Regression as it does

• indication of the importance of a predictor in

• scale-bound so can’t compare magnitude.

• can however compare significance – each b is

• indication of the importance of a predictor in

• test of significance is same as for b

• Scale (predictor and criterion scores)

• occurs when predictors are highly correlated (>.90)

Variance Inflation Factor (VIF) = 1/tolerance

• best solution is to remove or combine collinear predictors

• distort solution and inflate standard error

discrepancy – not directly tested

A marketing manager of a large supermarket chain wanted to

1. What contribution do both shelf space and price make to the

The data file can be found in Work17.sav

Starting the Procedure

Choose the variables

Move sales of pet food

Click the checkbox for

Standardized Residual Plots

Enter Method, a procedure for variable selection in which all variables in a

[DataSet1] C:\Users\demo\Desktop\Corr&RegressPresentation\SPSS DataFile\Work17.sav

Another way to look at the Durbin-Watson Statistic is:

where ρ = the correlation between consecutive errors.

Plot of Standardized Residuals

You might also like