You are on page 1of 57

Multiple Regression

Analysis
Multiple Regression
Cautions About Linear Regression
 Correlation and regression describe only linear
relations.
 Correlation and least-squares regression line
are not resistant to outliers.
 Predictions outside the range of observed data
are often inaccurate.
 Relationship between two variables often
influenced by lurking variables not included in
our model.
General Principle of Data Analysis

Plot your data To understand the data, always start with a


series of graphs

Interpret what Look for all pattern and deviations on that


you see pattern

Numerical Choose an appropriate measure to


summary ? describe the pattern and deviation

Mathematical If the pattern is regular, summarize the data


model ? in a compact mathematical model
Analysis of Two Quantitative Variables

Plot your data For two quantitative variables, use a


scatterplot

Interpret what Describe the direction, form and strength of


you see the relationship

Numerical If pattern is roughly linear, summarize with


summary ? correlation, means and standard deviations

Mathematical Regression gives a compact model of


model ? overall pattern if relationship is roughly
linear
Analysis of Three or More Quantitative
Variables

Plot your data To examine relationships among all


possible pairs, use a scatterplot matrix

Interpret what Describe the direction, form and strength of


you see the relationships

Numerical If pattern is roughly linear, summarize with


summary ? correlations, means and standard
deviations

Mathematical Multiple Regression gives a compact model


model ? of relationship between response variable
and a set of predictors
Multiple Regression

• Can we predict job performance (Y) from overall school


achievement (X1) and IQ scores (X2)?

- How much variance in Y is explained by X1 and X2 in


combination?
- how important is each predictor of job performance?
• Two kinds of research questions in Multiple Regression:

- Is the model significant and important??


- Are the individual predictors significant and important?
The Structural Model

Y  c  b1 X 1  b2 X 2  ...  b p X p  e

Y - any dependent variable score is predicted according to :


c - an intercept on the Y axis, plus
b1 X 1 - a weighted effect of predictor X1
b2 X 2 - a weighted effect of predictor X2
bp X p - a weighted effect of predictor Xp
e - error
The Structural Model

Y  c  b1 X 1  b2 X 2  ...  b p X p  e

DATA = MODEL + RESIDUAL


The Regression Plane – Two Predictors
(3D space)
Unstandardized Partial Regression
Coefficients - b

• Y is calculated according to Least Square Criterion (LSC)


• solved for by finding a set of weights (b) minimising errors
of prediction (around the plane)
- b1 indicates change in Y given unit change in X1 when X2 …
Xp = 0
- when standardised, indicates SD change in Y given SD
change in X, and is denoted by 

• c is the Y intercept
• Y is therefore a weighted combination of the predictors
(and intercept) called a linear composite (LC)
Bivariate regression
Multiple Regression
Multiple Regression
Variance Explained – R2

R 2 is simply the r 2 representing the proportion of variance in Y


which is explained by Yˆ – the linear composite

r2 
SS regression

SSYˆ

 Yˆ  Y 
2

 Y  Y 
2
SS total SSY
a ratio reflecting the proportion of variance captured by our
model relative to the overall variance in our data
R 2 =.50 means 50% of the variance in Y is explained by the
combination of X1, X2… Xp
2
R vs r 2
Significance of the Model

• R 2 tells us how important the model is


• the model can also be tested for statistical
significance
• test is conducted on R the multiple correlation
coefficient, against df = p, N - p - 1

( N  p  1) R 2
MS regression
F 
p (1  R ) 2
MS residual
Importance of Individual Predictors

r – simple correlation coefficient


b – partial regression coefficient
 – standardized partial regression coefficient
pr – partial correlation coefficient
sr – semi-partial correlation coefficient
r – simple correlation coefficient

• indicates importance of predictor in terms of its


direct relationship with the criterion

• not very useful in Multiple Regression as it does


not take into account inter-correlations with other
predictors.
b – Partial Regression Coefficient

• indication of the importance of a predictor in


terms of the model (not the data).

• scale-bound so can’t compare magnitude.

• can however compare significance – each b is


tested by dividing it by its standard error to give
a t-value:
 – standardized partial regression coefficient

• indication of the importance of a predictor in


terms of the model (not the data).
• standardized (scale free) so you can compare
magnitude

• test of significance is same as for b


pr – Partial Correlation Coefficient
sr – semi-partial correlation coefficient
Unique, Shared and Total Variance
Assumptions of Multiple Regression

• Scale (predictor and criterion scores)


• measured using a continuous scale (interval
or ratio)
• normality (variables are normally distributed)
• linearity (there is a straight line relationship
between predictors and criterion)
• predictors are not multicollinear or singular
(extremely highly correlated)
Assumptions of Multiple Regression

• Residuals
• normality: array of Y values are normally
distributed around Yˆ (assumption of normality in
arrays)
• homoscedasticity: variance of Y values are
constant across full range Y values (assumption
of homogeneity of variance in arrays)
• linearity: straight-line relationship between Y
and residuals (with mean = 0 and slope = 0)
• independence (residuals uncorrelated)
Multicollinearity and Singularity

• occurs when predictors are highly correlated (>.90)


• causes unstable calculation of regression weights (b)
• diagnosed with inter-correlations, tolerance and VIF

Tolerance = (1  Rx2 )
2
• where Rx is the overlap between a particular
predictor and all the other predictors
• values below .10 considered problematic

Variance Inflation Factor (VIF) = 1/tolerance


- values above 4 considered problematic

• best solution is to remove or combine collinear predictors


Outliers – Extreme Cases

• distort solution and inflate standard error


• univariate outliers
• cases beyond 3 SD on any variable
• multivariate outliers
• described in terms of:
• leverage (h) – distance of case from group
centroid along line/plane of best fit
• discrepancy – extent to which case
deviates from line/plane of best fit
• influence – combined effect of leverage
and discrepancy: effect of the outlier on
the solution
Multivariate Outliers – high influence

high discrepancy
Multivariate Outliers – low influence

high discrepancy
Multivariate Outliers – Testing

Leverage
• Leverage statistic (h): varies from 0 to 1, values > .50 are
problematic
• Mahalanobis Distance h x (n-1), distributed as chi-square and
tested as such (df = p, <.001)

discrepancy – not directly tested


influence –
assesses change in solution when case is removed
Cook’s Distance, values > 1 are problematic
Working example

A marketing manager of a large supermarket chain wanted to


determine the effect of shelf space and price on the sales of pet food.
A random sample of 15 equal-sized shops was selected, and the
sales, shelf space in square metres and price per kilogram were
recorded

1. What contribution do both shelf space and price make to the


prediction of sales of pet food?
2 . Which is a better predictor of sales of pet food?
3. Do a residual analysis

The data file can be found in Work17.sav


Using SPSS
Graphs
Scatter/Dot
Matrix Scatter
Using SPSS
Graph
[DataSet1] C:\Users\demo\Desktop\Corr&RegressPresentation\SPSS DataFile\Work17.sav
Using SPSS
Multiple Linear Regression:

Starting the Procedure


• In the menu, click on
Analyze
• Point to
Regression

• Point to
Linear…

… and click.
Using SPSS
Multiple Linear Regression:

Selecting Variables

Choose the variables


for analysis from the
list in the variable box.

To select multiple
variables, hold down
the Ctrl Key and chose
the variables that you
want.
Using SPSS
Multiple Linear Regression:

Selecting Variables
Move shelf space
(space) & price per kg
(price), which are
already highlighted, to
the box labeled
Independent(s) then
click the arrow.

Move sales of pet food


(sales) to the box
labeled Dependent by
clicking the arrow.
Using SPSS
Multiple Linear Regression:

Requesting Statistics

Request
descriptive
statistics by
clicking the
button
labeled
Statistics…
Using SPSS
Multiple Linear Regression:

Requesting Statistics
Statistics for the Model
fit and Estimates for
Regression
Coefficients will be
produced by default.

Click the checkbox for


Descriptives. Also,
click the checkbox for
Durbin-Watson for
Residuals.
Click the Continue
button.
Using SPSS
Multiple Linear Regression:

Standardized Residual Plots


You can also request
several different plots.
Click the Plots… button.
In the box labeled
Standardized Residual
Plots, first click the
checkbox for
Histogram,
then click the box
for Normal
probability plot.
Click the Continue
button.
Using SPSS
Multiple Linear Regression:

Enter Method

The independent
variables can be
entered into the
analysis using
five different
methods.

Enter Method, a procedure for variable selection in which all variables in a


block are entered in a single step.
Using SPSS
Multiple Linear Regression:

Enter Method

Enter is the
default method
of variable entry.
Click the OK
button to run the
Multiple Linear
Regression
procedure.
Using SPSS
Multiple Linear Regression Output:

Descriptive Statistics
Regression

[DataSet1] C:\Users\demo\Desktop\Corr&RegressPresentation\SPSS DataFile\Work17.sav


Using SPSS
Multiple Linear Regression Output:

Correlations
Using SPSS
Multiple Linear Regression Enter Method Output:

Variables Entered
Using SPSS
Multiple Linear Regression Enter Method Output:

Model Summary
Correlation Standard Deviation
Coefficient of around the
Determination regression line

Durbin-Watson
Statistic
Using SPSS
Multiple Linear Regression Enter Method Output:

Model Summary
Independence
Durbin-Watson Statistic.
The D-W statistic is
defined as:

Another way to look at the Durbin-Watson Statistic is:

D = 2(1-ρ)

where ρ = the correlation between consecutive errors.


Using SPSS
Multiple Linear Regression Enter Method Output:

ANOVA
Measures of
Variation
Using SPSS
Multiple Linear Regression Enter Method Output:

Coefficients

Regression Equation:
ŷi = 10.50x1 + 0.057x2 + 2.029
Using SPSS
Multiple Linear Regression Enter Method Output:

Residuals Statistics
Using SPSS
Multiple Linear Regression Enter Method Output:

Residuals Histogram
Normality
Normality of residuals
is only required for
valid hypothesis
testing, that is, the
normality assumption
assures that the p-
values for the t-tests
and F-test will be
valid. Normality is not
required in order to
obtain unbiased
estimates of the
regression
coefficients
Using SPSS
Multiple Linear Regression Enter Method Output:

Plot of Standardized Residuals


Normality

A standardized
normal
probability (P-P)
plot is sensitive
to non-normality
in the middle
range of data
tails.
Using SPSS
Multiple Linear Regression Enter Method Output:

Interpretation of Output
1. What contribution do both shelf space and price make to the
prediction of sales of pet food?

Both independent variables (shelf space and price) together explain 85 per
cent of the variance (R Square) in sales of pet food, which is highly
significant as indicated by the F-value of 34.08
Using SPSS
Multiple Linear Regression Enter Method Output:

Interpretation of Output
2. Which of the two variable is a better predictor of sales of pet food?

An examination of the t-values and Beta values indicate that price contributes
better to the prediction of sales. Therefore, you can say that price
significantly predicts sales of pet food with t = 3.22, P < .05. However, the
shelf space allocated is not a significant predictor.

You might also like