Multiple Linear Regression
(Multiple Regression Analysis)
What is MLR?
Multiple Linear Regression is a statistical
method for estimating the relationship
between a dependent variable and two or
more independent (or predictor) variables.
Purposes:
◦ Prediction
◦ Explanation
◦ Theory building
Y = β0+β1X1 + β2X2 … + βnXn +Є
y-intercept error term
Regression coefficient
Design Requirements
One dependent variable (criterion)
Two or more independent variables
(predictor or explanatory variables).
Sample size: >= 50 (at least 10 times as
many cases as independent variables)
Variables
Dependent variable should be measured on a
continuous scale (interval or ratio variable).
Examples: revision time (hours), intelligence
(IQ score), weight (kg), etc,.
Two or more independent variables, which
can be either continuous (an interval or ratio)
or categorical (an ordinal or nominal variable)
Ordinal variables include Likert items
Nominal variables include gender (2 groups:
male and female), ethnicity (3 groups:
Caucasian, African American and Hispanic)
Variations
Predictable variation by
the combination of
independent variables
Total Variation in Y
Unpredictable
Variation
MLR Model: Basic Assumptions
Independence
Normality
Homoscedasticity
Linearity
No or Little Multicollinearity
Independence
Independence of observations .
Assumption: values of the residuals (errors) are
independent across all observations
Check using the Durbin-Watson statistic
The Durbin-Watson statistic lies in the range
0-4. An acceptable range is 1.50 - 2.50.
Durbin-Watson is low (less than 1.50); this
indicates the presence of positive
autocorrelation
Normality of Residuals
- Check that the residuals (errors) are
approximately normally distributed
- Two common methods to check
(a) Histogram (with a superimposed normal curve)
(b) a Normal Q-Q Plot.
Residuals should follow a normal distribution
Homoscedasticity
Homoscedasticity, or homogeneity of
variances, is an assumption of equal or
similar variances of residuals across all levels
of the independent variables
Residual plot shows a random scatter of
points without any discernible pattern.
Linear relationship
Relationship between dependent variable and
independent variables must be linear
Open the Plots Dialog Box: Click on the Plots
button.
Set Up Partial Plots:
"Produce all partial plots" option is checked.
Click OK to run the regression analysis and
generate the plots.
If the relationship is linear, the data points in
each partial plot will be randomly dispersed
around a straight line
Multicollinearity
Two or more independent variables are highly
correlated, they explain the same information
A model will not be able to know which
variable is actually responsible for a change
in the dependent variable
Correlation Matrix: The first step is often to
look at the correlation matrix for all
independent variables. High correlation
coefficients (typically above 0.7 or 0.8)
between two or more predictors indicate
potential multicollinearity.
Multicollinearity
Detecting multicollinearity through an
inspection of correlation coefficients and
Tolerance/VIF (Variance Inflation Factor )
Tolerance <0.1 and VIF > 10 significant
mulicollinearity
Tolerance <0.25 and VIF >4 might be
multicollinearity
Simple vs. Multiple Regression
One dependent variable One dependent variable
Y predicted from one Y predicted from a set
independent variable X of independent
variables (X1, X2 ….Xk)
One regression
One regression
coefficient coefficient for each
independent variable
R2: proportion of
R2: proportion of
variation in dependent variation in dependent
variable Y predictable variable Y predictable by
from X set of independent
variables (X’s)
MLR Equation
Ypred = a + b1X1 + B2X2 … + BnXn
Ypred = dependent variable or the variable to be
predicted.
X = the independent or predictor variables
a = a constant (Intercept of Y axis, when X = 0)
b = weights; or partial regression coefficients.
relative contribution of IVs on DV when controlling for the
effects of the other predictors.
MLR Output
R2, adjusted R2, constant, b coefficient, beta, F-
test, t-test
“R2” assesses the strength of the complex
relationship
The adjusted R2 adjusts for the inflation in R2
caused by the number of variables in the
equation
B coefficient (Regression Weights)
The regression weights or regression coefficients
measures the amount of increase or decrease in the
dependent variable for a one-unit difference in the
independent variable
Y = 2 + (b) X, b= 4: Y = 2 + (4) X every unit
change in X, Y will be increased 4
if X = 1, Y = 6
X= 2, Y = 10
X increases 1 Y increase 4
Different Ways of Building
Regression Models
Simultaneous: all independent variables
entered together
Stepwise: independent variables entered
according to some order
◦ By size or correlation with dependent variable
◦ In order of significance
Hierarchical: independent variables entered in
stages
F-test
The F-test is used as a general indicator of
the probability that any of the predictor
variables contribute to the variance in the
dependent variable within the population.
The null hypothesis is that the predictors’
weights are all effectively equal to zero.
t-test
t-test is used to test the significance of each
predictor in the equation.
The null hypothesis is that a predictor’s
weight is effectively equal to zero when the
effects of the other predictors are taken into
account.
It does not contribute to the variance in
the dependent variable within the population.
Example
Examine the factors that predict the length of
hospitalization following spinal surgery in children
(dependent continuous variable).
The available variables in the dataset are
hematocrit, estimated blood loss, cell saver,
operating time, age at surgery, and parked red
blood cells.
Dependent and independent variables are
measured on continuous scales
◦ Select appropriate variables (theory based and statistical
approach), and determine the effect of estimated blood loss
while controlling hematocrit and parked red blood cell, age
at surgery, cell saver, operating time (duration of surgery).
SPSS: 1) analyze, 2)
regression, 3) linear
SPSS Screen
SPSS Output
SPSS Output
Interpreting Your SPSS Multiple
Regression Output
First let’s look at the zero-order (pairwise) correlations
between Average Female Life Expectancy (Y), Daily
Calorie Intake (X1) and People who Read (X2). Note that
these are .776 for Y with X1, .869 for Y with X2, and .682
for X1 with X2
Correlations
Average
female life Daily calorie People who
expectancy intake read (%)
Pearson Correlation Average female life
1.000 .776 .869
expectancy
r YX1 Daily calorie intake
People who read (%)
.776
.869
1.000
.682
.682
1.000
r X1X2
r YX2 Sig. (1-tailed) Average female life
expectancy
. .000 .000
Daily calorie intake .000 . .000
People who read (%) .000 .000 .
N Average female life
74 74 74
expectancy
Daily calorie intake 74 74 74
People who read (%) 74 74 74
Examining the Regression Weights
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B Correlations Collinearity Statistics
Model B Std. Error Beta t Sig. Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
1 (Constant) 25.838 2.882 8.964 .000 20.090 31.585
People who read (%) .315 .034 .636 9.202 .000 .247 .383 .869 .738 .465 .535 1.868
Daily calorie intake .007 .001 .342 4.949 .000 .004 .010 .776 .506 .250 .535 1.868
a. Dependent Variable: Average female life expectancy
Regression of female life expectancy on daily calorie intake and
percentage of people who read.
Standardized beta for daily caloric intake is .342; people who read is
much larger, .636.
Every unit change in percentage of people who read , Y (female life
expectancy) will increase by a multiple of .636 standard deviations.
Beta coefficients are significant at p < .001
R, R Square, and the SEE
Model Summary
Change Statistics
Adjusted Std. Error of R Square
Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change
1 .905 a .818 .813 4.948 .818 159.922 2 71 .000
a. Predictors: (Constant), People who read (%), Daily calorie intake
R is .905, which is a very high correlation.
R2 tells 81.8% of the variation in female life
expectancy is explained by the two predictors
F -Test for the Significance of the
Regression Equation
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 7829.451 2 3914.726 159.922 .000 a
Residual 1738.008 71 24.479
Total 9567.459 73
a. Predictors: (Constant), People who read (%), Daily calorie intake
b. Dependent Variable: Average female life expectancy
Y = .342 X1 + .636 X2 + 25.838
A very large value of F, which is significant at p <.001.
It indicates this linear regression model provides a better fit to the data than a
model that contains no independent variables.
Multicollinearity, cont’d
As a rule of thumb, bivariate zero-order
correlations between predictors should not
exceed .80
The best prediction occurs when the predictors are
moderately independent of each other, but each is
highly correlated with the dependent (criterion)
variable Y
Multicollinearity Issues in our Current
SPSS Problem
Daily Calorie Intake (X1) and People who Read (X2) is .682. This is a
pretty high correlation for two predictors to be interpreted
independently:
Average life expectancy with % people who read, you note that the
correlation is quite high, .869. Correlations
Average
female life Daily calorie People who
expectancy intake read (%)
Pearson Correlation Average female life
1.000 .776 .869
expectancy
r YX1 Daily calorie intake
People who read (%)
.776
.869
1.000
.682
.682
1.000
r X1X2
r YX2 Sig. (1-tailed) Average female life
expectancy
. .000 .000
Daily calorie intake .000 . .000
People who read (%) .000 .000 .
N Average female life
74 74 74
expectancy
Daily calorie intake 74 74 74
People who read (%) 74 74 74
Multicollinearity Issues in our Current
SPSS Problem, cont’d
In the case of our two predictors,
there is some indication of
multicollinearity but not enough to
throw out one of the variables
Thank you