0% found this document useful (0 votes)
81 views37 pages

Understanding Multiple Linear Regression

Multiple linear regression allows predicting the value of a dependent variable based on the values of two or more independent variables. It is used for prediction, explanation, and theory building. The key assumptions of multiple linear regression are independence of observations, normality, homoscedasticity, linearity, and little to no multicollinearity between independent variables. Multiple linear regression analysis produces outputs including R-squared, regression coefficients, F-tests, and t-tests that are used to evaluate the significance and relative contribution of each independent variable to the model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views37 pages

Understanding Multiple Linear Regression

Multiple linear regression allows predicting the value of a dependent variable based on the values of two or more independent variables. It is used for prediction, explanation, and theory building. The key assumptions of multiple linear regression are independence of observations, normality, homoscedasticity, linearity, and little to no multicollinearity between independent variables. Multiple linear regression analysis produces outputs including R-squared, regression coefficients, F-tests, and t-tests that are used to evaluate the significance and relative contribution of each independent variable to the model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Multiple Linear Regression

(Multiple Regression Analysis)


What is MLR?
 Multiple Linear Regression is a statistical
method for estimating the relationship
between a dependent variable and two or
more independent (or predictor) variables.
 Purposes:

◦ Prediction
◦ Explanation
◦ Theory building
 Y = β0+β1X1 + β2X2 … + βnXn +Є

y-intercept error term


Regression coefficient
Design Requirements

 One dependent variable (criterion)


 Two or more independent variables

(predictor or explanatory variables).


 Sample size: >= 50 (at least 10 times as

many cases as independent variables)


Variables
 Dependent variable should be measured on a
continuous scale (interval or ratio variable).
Examples: revision time (hours), intelligence
(IQ score), weight (kg), etc,.
 Two or more independent variables, which

can be either continuous (an interval or ratio)


or categorical (an ordinal or nominal variable)
 Ordinal variables include Likert items
 Nominal variables include gender (2 groups:

male and female), ethnicity (3 groups:


Caucasian, African American and Hispanic)
Variations
Predictable variation by
the combination of
independent variables
Total Variation in Y

Unpredictable
Variation
MLR Model: Basic Assumptions
 Independence
 Normality
 Homoscedasticity
 Linearity
 No or Little Multicollinearity
Independence
 Independence of observations .
Assumption: values of the residuals (errors) are
independent across all observations
 Check using the Durbin-Watson statistic
 The Durbin-Watson statistic lies in the range

0-4. An acceptable range is 1.50 - 2.50.


 Durbin-Watson is low (less than 1.50); this

indicates the presence of positive


autocorrelation
Normality of Residuals
- Check that the residuals (errors) are
approximately normally distributed
- Two common methods to check

 (a) Histogram (with a superimposed normal curve)


 (b) a Normal Q-Q Plot.
 Residuals should follow a normal distribution
Homoscedasticity
 Homoscedasticity, or homogeneity of
variances, is an assumption of equal or
similar variances of residuals across all levels
of the independent variables
 Residual plot shows a random scatter of

points without any discernible pattern.


Linear relationship
 Relationship between dependent variable and
independent variables must be linear
 Open the Plots Dialog Box: Click on the Plots

button.
 Set Up Partial Plots:
 "Produce all partial plots" option is checked.
 Click OK to run the regression analysis and

generate the plots.


 If the relationship is linear, the data points in

each partial plot will be randomly dispersed


around a straight line
Multicollinearity
 Two or more independent variables are highly
correlated, they explain the same information
 A model will not be able to know which

variable is actually responsible for a change


in the dependent variable
 Correlation Matrix: The first step is often to

look at the correlation matrix for all


independent variables. High correlation
coefficients (typically above 0.7 or 0.8)
between two or more predictors indicate
potential multicollinearity.
Multicollinearity

 Detecting multicollinearity through an


inspection of correlation coefficients and
Tolerance/VIF (Variance Inflation Factor )
 Tolerance <0.1 and VIF > 10  significant

mulicollinearity
 Tolerance <0.25 and VIF >4  might be

multicollinearity
Simple vs. Multiple Regression

 One dependent variable  One dependent variable


Y predicted from one Y predicted from a set
independent variable X of independent
variables (X1, X2 ….Xk)
 One regression
 One regression
coefficient coefficient for each
independent variable
 R2: proportion of
 R2: proportion of
variation in dependent variation in dependent
variable Y predictable variable Y predictable by
from X set of independent
variables (X’s)
MLR Equation
 Ypred = a + b1X1 + B2X2 … + BnXn

Ypred = dependent variable or the variable to be


predicted.
 X = the independent or predictor variables
 a = a constant (Intercept of Y axis, when X = 0)
 b = weights; or partial regression coefficients.
 relative contribution of IVs on DV when controlling for the
effects of the other predictors.
MLR Output
 R2, adjusted R2, constant, b coefficient, beta, F-
test, t-test
 “R2” assesses the strength of the complex

relationship
 The adjusted R2 adjusts for the inflation in R2

caused by the number of variables in the


equation
B coefficient (Regression Weights)
 The regression weights or regression coefficients
measures the amount of increase or decrease in the
dependent variable for a one-unit difference in the
independent variable
 Y = 2 + (b) X, b= 4: Y = 2 + (4) X  every unit
change in X, Y will be increased 4
 if X = 1, Y = 6
 X= 2, Y = 10
 X increases 1  Y increase 4
Different Ways of Building
Regression Models
 Simultaneous: all independent variables
entered together
 Stepwise: independent variables entered

according to some order


◦ By size or correlation with dependent variable
◦ In order of significance
 Hierarchical: independent variables entered in
stages
F-test

 The F-test is used as a general indicator of


the probability that any of the predictor
variables contribute to the variance in the
dependent variable within the population.
 The null hypothesis is that the predictors’
weights are all effectively equal to zero.
t-test

 t-test is used to test the significance of each


predictor in the equation.
 The null hypothesis is that a predictor’s
weight is effectively equal to zero when the
effects of the other predictors are taken into
account.
  It does not contribute to the variance in
the dependent variable within the population.
Example
 Examine the factors that predict the length of
hospitalization following spinal surgery in children
(dependent continuous variable).
 The available variables in the dataset are
hematocrit, estimated blood loss, cell saver,
operating time, age at surgery, and parked red
blood cells.
 Dependent and independent variables are
measured on continuous scales
◦ Select appropriate variables (theory based and statistical
approach), and determine the effect of estimated blood loss
while controlling hematocrit and parked red blood cell, age
at surgery, cell saver, operating time (duration of surgery).
SPSS: 1) analyze, 2)
regression, 3) linear
SPSS Screen
SPSS Output
SPSS Output
Interpreting Your SPSS Multiple
Regression Output
 First let’s look at the zero-order (pairwise) correlations
between Average Female Life Expectancy (Y), Daily
Calorie Intake (X1) and People who Read (X2). Note that
these are .776 for Y with X1, .869 for Y with X2, and .682
for X1 with X2
Correlations

Average
female life Daily calorie People who
expectancy intake read (%)
Pearson Correlation Average female life
1.000 .776 .869
expectancy
r YX1 Daily calorie intake
People who read (%)
.776
.869
1.000
.682
.682
1.000
r X1X2
r YX2 Sig. (1-tailed) Average female life
expectancy
. .000 .000
Daily calorie intake .000 . .000
People who read (%) .000 .000 .
N Average female life
74 74 74
expectancy
Daily calorie intake 74 74 74
People who read (%) 74 74 74
Examining the Regression Weights
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B Correlations Collinearity Statistics
Model B Std. Error Beta t Sig. Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
1 (Constant) 25.838 2.882 8.964 .000 20.090 31.585
People who read (%) .315 .034 .636 9.202 .000 .247 .383 .869 .738 .465 .535 1.868
Daily calorie intake .007 .001 .342 4.949 .000 .004 .010 .776 .506 .250 .535 1.868
a. Dependent Variable: Average female life expectancy

Regression of female life expectancy on daily calorie intake and


percentage of people who read.
Standardized beta for daily caloric intake is .342; people who read is
much larger, .636.
Every unit change in percentage of people who read , Y (female life
expectancy) will increase by a multiple of .636 standard deviations.
Beta coefficients are significant at p < .001
R, R Square, and the SEE
Model Summary

Change Statistics
Adjusted Std. Error of R Square
Model R R Square R Square the Estimate Change F Change df1 df2 Sig. F Change
1 .905 a .818 .813 4.948 .818 159.922 2 71 .000
a. Predictors: (Constant), People who read (%), Daily calorie intake

R is .905, which is a very high correlation.


R2 tells 81.8% of the variation in female life
expectancy is explained by the two predictors
F -Test for the Significance of the
Regression Equation
ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 7829.451 2 3914.726 159.922 .000 a
Residual 1738.008 71 24.479
Total 9567.459 73
a. Predictors: (Constant), People who read (%), Daily calorie intake
b. Dependent Variable: Average female life expectancy

Y = .342 X1 + .636 X2 + 25.838


A very large value of F, which is significant at p <.001.
It indicates this linear regression model provides a better fit to the data than a
model that contains no independent variables.
Multicollinearity, cont’d
 As a rule of thumb, bivariate zero-order
correlations between predictors should not
exceed .80
 The best prediction occurs when the predictors are
moderately independent of each other, but each is
highly correlated with the dependent (criterion)
variable Y
Multicollinearity Issues in our Current
SPSS Problem
 Daily Calorie Intake (X1) and People who Read (X2) is .682. This is a
pretty high correlation for two predictors to be interpreted
independently:
 Average life expectancy with % people who read, you note that the
correlation is quite high, .869. Correlations

Average
female life Daily calorie People who
expectancy intake read (%)
Pearson Correlation Average female life
1.000 .776 .869
expectancy
r YX1 Daily calorie intake
People who read (%)
.776
.869
1.000
.682
.682
1.000
r X1X2
r YX2 Sig. (1-tailed) Average female life
expectancy
. .000 .000
Daily calorie intake .000 . .000
People who read (%) .000 .000 .
N Average female life
74 74 74
expectancy
Daily calorie intake 74 74 74
People who read (%) 74 74 74
Multicollinearity Issues in our Current
SPSS Problem, cont’d

In the case of our two predictors,


there is some indication of
multicollinearity but not enough to
throw out one of the variables
Thank you

You might also like