You are on page 1of 17

Multiple Linear Regression Analysis

Introduction
Multiple regression is an extension of simple linear regression. It is used when we want to
predict the value of a variable based on the value of two or more other variables. The variable
we want to predict is called the dependent variable (predicted). The variables we are using
to predict the value of the dependent variable are called the independent variables (or
sometimes, the predictor, explanatory or regressor variables).
For example, you could use multiple regression to understand whether exam performance
can be predicted based on revision time, test anxiety, lecture attendance and gender.
Alternately, you could use multiple regression to understand whether daily cigarette
consumption can be predicted based on smoking duration, age when started smoking,
smoker type, income and gender.
Multiple regression also allows you to determine the overall fit (variance explained) of the
model and the relative contribution of each of the predictors to the total variance explained.
For example, you might want to know how much of the variation in exam performance can
be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but
also the "relative contribution" of each independent variable in explaining the variance.
However, before we introduce you to this procedure, you need to understand the different
assumptions that your data must meet in order for multiple regression to give you a valid
result. We discuss these assumptions next.

SPSS Statistics

Assumptions
When you choose to analyse your data using multiple regression, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using
multiple regression. You need to do this because it is only appropriate to use multiple
regression if your data "passes" eight assumptions that are required for multiple regression
to give you a valid result.
Before we introduce you to these eight assumptions, do not be surprised if, when analysing
your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not
met). This is not uncommon when working with real-world data rather than textbook
examples, which often only show you how to carry out multiple regression when everything

1
goes well! However, don’t worry. Even when your data fails certain assumptions, there is
often a solution to overcome this. First, let's take a look at these eight assumptions:
o Assumption #1: Your dependent variable should be measured on a continuous scale
(i.e., it is either an interval or ratio variable). Examples of variables that meet this
criterion include revision time (measured in hours), intelligence (measured using IQ
score), exam performance (measured from 0 to 100), weight (measured in kg), and so
forth. If your dependent variable was measured on an ordinal scale, you will need to
carry out ordinal regression rather than multiple regression. Examples of ordinal
variables include Likert items (e.g., a 7-point scale from "strongly agree" through to
"strongly disagree"), amongst other ways of ranking categories (e.g., a 3-point scale
explaining how much a customer liked a product, ranging from "Not very much" to "Yes,
a lot").
o Assumption #2: You have two or more independent variables, which can be
either continuous (i.e., an interval or ratio variable) or categorical (i.e.,
an ordinal or nominal variable).
o Assumption #3: You should have independence of observations (i.e., independence
of residuals), which you can easily check using the Durbin-Watson statistic
o Assumption #4: There needs to be a linear relationship between (a) the dependent
variable and each of your independent variables, and (b) the dependent variable and the
independent variables collectively. If the relationship displayed in your scatterplots and
partial regression plots are not linear, you will have to either run a non-linear regression
analysis or "transform" your data
o Assumption #5: Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move along the line.
o Assumption #6: Your data must not show multicollinearity, which occurs when you
have two or more independent variables that are highly correlated with each other. This
leads to problems with understanding which independent variable contributes to the
variance explained in the dependent variable, as well as technical issues in calculating a
multiple regression model. detect for multicollinearity through an inspection of
Tolerance/VIF values;
o Assumption #7: There should be no significant outliers, high leverage
points or highly influential points. Outliers, leverage and influential points are different
terms used to represent observations in your data set that are in some way unusual when
you wish to perform a multiple regression analysis. However, all these points can have a
very negative effect on the regression equation that is used to predict the value of the

2
dependent variable based on the independent variables. This can reduce the predictive
accuracy of your results .how to detect outliers using "casewise diagnostics" and
"studentized deleted residuals.
o Assumption #8: Finally, you need to check that the residuals
(errors) are approximately normally distributed .Two common methods to check this
assumption include using: (a) a histogram (with a superimposed normal curve) and a
Normal P-P Plot; or (b) a Normal Q-Q Plot of the studentized residuals.
You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions
#1 and #2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7
and #8. Just

3
Multiple Linear Regression Example
• Multiple Regression - Example
• Data Checks and Descriptive Statistics
• SPSS Regression Dialogs
• SPSS Multiple Regression Output
• Multiple Regression Assumptions

Multiple Regression - Example


A scientist wants to know if and how health care costs can be predicted from several patient
characteristics. All data are in health-costs.sav as shown below.

The dependent variable is health care costs (in US dollars) declared over 2020 or “costs” for
short.

The independent variables are sex, age, drinking, smoking and exercise.
Our scientist thinks that each independent variable has a linear relation with health care costs. He
therefore decides to fit a multiple linear regression model. The final model will predict costs from
all independent variables simultaneously.

4
Data Checks and Descriptive Statistics

Before running multiple regression, first make sure that


1. the dependent variable is quantitative;

2. each independent variable is quantitative or dichotomous;

3. you have sufficient sample size.


A visual inspection of our data shows that requirements 1 and 2 are met: sex is a dichotomous
variable and all other relevant variables are quantitative. Regarding sample size, Our data contain
525 cases so this seems fine.

Let's now proceed with some quick data checks.


1. run basic histograms over all variables. Check if their frequency distributions look plausible. Are
there any outliers? Should you specify any missing values?
2. inspect a scatterplot for each independent variable (x-axis) versus the dependent variable (y-
axis).* Do you see any curvilinear relations or anything unusual?

3. run descriptive statistics over all variables. Inspect if any variables have any missing values and -if so-
how many.

SPSS Regression Dialogs

We'll first navigate to Analyze Regression Linear as shown below.

Next, we fill out the main dialog and subdialogs as shown below.

5
We'll select 95% confidence intervals for our b coefficients.
Some analysts report squared semipartial (or “part”) correlations as effect size measures for
individual predictors. But for now, let's skip them.
By selecting “Exclude cases listwise”, our regression analysis uses only cases without any
missing values on any of our regression variables. That's fine for our example data but this may be a
bad idea for other data files.
Clicking Paste results in the syntax below. Let's run it.

SPSS Multiple Regression Output


The first table we inspect is the Coefficients table shown below.

The b-coefficients dictate our regression model:


Costs′=−3263.6+509.3⋅Sex+114.7⋅Age+50.4⋅Alcohol+139.4⋅Cigarettes−271.3.Ex
ercise⋅

6
where Costs′ denotes predicted yearly health care costs in dollars.

Each b-coefficient indicates the average increase in costs associated with a 1-unit increase in a
predictor. For example, a 1-year increase in age results in an average $114.7 increase in costs. Or a 1
hour increase in exercise per week is associated with a -$271.3 increase (that is, a $271.3 decrease) in
yearly health costs.
Now, let's talk about sex: a 1-unit increase in sex results in an average $509.3 increase in costs. For
understanding what this means, please note that sex is coded 0 (female) and 1 (male) in our example
data. So for this variable, the only possible 1-unit increase is from female (0) to male (1). Therefore,
B = $509.3 simply means that the average yearly costs for males
are $509.3 higher than for females(everything else equal, that is). This hopefully clarifies how
dichotomous variables can be used in multiple regression.
The “Sig.” column in our coefficients table contains the (2-tailed) p-value for each b-coefficient.
As a general guide line, a b-coefficient is statistically significant if its “Sig.” or p <0.05.Therefore,
all b-coefficients in our table are highly statistically significant. Precisely, a p-value of 0.000 means
that if some b-coefficient is zero in the population (the null hypothesis), then there's a 0.000 probability
of finding the observed sample b-coefficient or a more extreme one. We then conclude that the
population b-coefficient probably wasn't zero after all.
Now, our b-coefficients don't tell us the relative strengths of our predictors. This is because these have
different scales: is a cigarette per day more or less than an alcoholic beverage per week? One way to
deal with this, is to compare the standardized regression coefficients or beta coefficients, often denoted
as β (the Greek letter “beta”).*
Beta coefficients (standardized regression coefficients) are useful for comparing the relative
strengths of our predictors. Like so, the 3 strongest predictors in our coefficients table are:
• age (β = 0.322);
• cigarette consumption (β = 0.311);
• exercise (β = -0.281).
Beta coefficients are obtained by standardizing all regression variables into z-scores before computing
b-coefficients. Standardizing variables applies a similar standard (or scale) to them: the resulting z-
scores always have mean of 0 and a standard deviation of 1.
This holds regardless whether they're computed over years, cigarettes or alcoholic beverages. So that's
why b-coefficients computed over standardized variables -beta coefficients- are comparable within
and between regression models.
Right, so our b-coefficients make up our multiple regression model. This tells us how to predict yearly
health care costs. What we don't know, however, is precisely how well does our model predict these
costs? We'll find the answer in the model summary table discussed below.

7
Coefficientsa
Unstandardized Standardized 95.0% Confidence Collinearity
Coefficients Coefficients Interval for B Statistics
Std. Lower Upper
Model B Error Beta t Sig. Bound Bound Tolerance VIF
1 (Constant) - 1059.163 - .002 - -
3263.586- 3.081- 5344.360- 1182.812-
Sex 509.271 139.565 .126 3.649 .000 235.089 783.454 .975 1.026
Age at Survey 114.658 12.452 .322 9.208 .000 90.196 139.121 .954 1.048
Completion
(Years)
Average 50.386 10.275 .192 4.904 .000 30.201 70.572 .756 1.322
Consumption of
Alcoholic
Beverages per
Week
Average 139.414 17.384 .311 8.020 .000 105.263 173.565 .773 1.293
Consumption of
Cigarettes per
Day
Average Hours -271.270- 36.300 -.281- - .000 -342.584- -199.956- .825 1.212
of Exercise per 7.473-
Week
a. Dependent Variable: Total Health Care Costs Declared over 2020

Model Summaryb
Adjusted R Std. Error of the
Model R R Square Square Estimate Durbin-Watson
1 .629a .395 .390 $1,577.10344 1.989
a. Predictors: (Constant), Average Hours of Exercise per Week, Sex, Age at Survey
Completion (Years), Average Consumption of Cigarettes per Day, Average Consumption
of Alcoholic Beverages per Week
b. Dependent Variable: Total Health Care Costs Declared over 2020

8
SPSS Regression Output II - Model Summary & ANOVA

The figure below shows the model summary and the ANOVA tables in the
regression output.

R denotes the multiple correlation coefficient. This is simply the Pearson correlation between the
actual scores and those predicted by our regression model.
R-square or R2 is simply the squared multiple correlation. It is also the proportion of variance in
the dependent variable accounted for by the entire regression model.
R-square computed on sample data tends to overestimate R-square for the entire population. We
therefore prefer to report adjusted R-square or R2adj, which is an unbiased estimator for the
population R-square. For our example, R2adj = 0.390. By most standards, this is considered very high.
However, the p-value found in the ANOVA table applies to R and R-square. It evaluates the null
hypothesis that our entire regression model has a population R of zero. Since p < 0.05, we reject this
null hypothesis for our example data.

Multiple Regression Assumptions


• create and inspect a scatterplot of residuals (y-axis) versus predicted values (x-axis). This scatterplot
may detect violations of both homoscedasticity and linearity.
The easy way to obtain these 2 regression plots, is selecting them in the dialogs (shown below) and
rerunning the regression analysis.

9
Clicking Paste results in the syntax below. We'll run it and inspect the residual plots shown
below.
. Most analysts would conclude that the residuals are roughly normally distributed. If you're not
convinced, you could add the residuals as a new variable to the data via the SPSS regression dialogs.
Next, you could run a Shapiro-Wilk test or a Kolmogorov-Smirnov test on them.

Residual Plots II - Scatterplot

The residual scatterplot shown below is often used for checking a) the homoscedasticity and b) the
linearity assumptions. If both assumptions hold, this scatterplot shouldn't show any systematic
pattern whatsoever. That seems to be the case here.

Homoscedasticity implies that the variance of the residuals should be constant. This variance can be
estimated from how far the dots in our scatterplot lie apart vertically. Therefore, the height of our
scatterplot should neither increase nor decrease as we move from left to right. We don't see any such
pattern.

10
A common check for the linearity assumption is inspecting if the dots in this scatterplot show any
kind of curve. That's not the case here so linearity also seems to hold here.*

Example
A health researcher wants to be able to predict "VO2max", an indicator of fitness and health.
Normally, to perform this procedure requires expensive laboratory equipment and
necessitates that an individual exercise to their maximum (i.e., until they can longer continue
exercising due to physical exhaustion). This can put off those individuals who are not very
active/fit and those individuals who might be at higher risk of ill health (e.g., older unfit
subjects). For these reasons, it has been desirable to find a way of predicting an individual's
VO2max based on attributes that can be measured more easily and cheaply. To this end, a
researcher recruited 100 participants to perform a maximum VO2max test, but also recorded
their "age", "weight", "heart rate" and "gender". Heart rate is the average of the last 5 minutes
of a 20 minute, much easier, lower workload cycling test. The researcher's goal is to be able
to predict VO2max based on these four attributes: age, weight, heart rate and gender.

Setup in SPSS Statistics


In SPSS Statistics, we created six variables: (1) VO2max , which is the maximal aerobic

capacity; (2) age , which is the participant's age; (3) weight , which is the participant's weight

(technically, it is their 'mass'); (4) heart_rate , which is the participant's heart rate; (5) gender ,

which is the participant's gender; and (6) caseno , which is the case number.

The caseno variable is used to make it easy for you to eliminate cases (e.g., "significant

outliers", "high leverage points" and "highly influential points") that you have identified when
checking for assumptions.

Test Procedure in SPSS Statistics


The seven steps below show you how to analyse your data using multiple regression in SPSS
Statistics when none of the eight assumptions in the previous section, Assumptions, have
been violated. At the end of these seven steps, we show you how to interpret the results from
your multiple regression.

1. Click Analyze > Regression > Linear... on the main menu, as shown below:

11
12
2. Transfer the dependent variable, VO2max , into the Dependent: box and the independent

variables, age , weight , heart_rate and gender into the Independent(s): box, using

the buttons, as shown below (all other boxes can be ignored):

3. Click on the button. You will be presented with the Linear Regression:
Statistics dialogue box, as shown below:

13
4. In addition to the options that are selected by default, select Confidence intervals in the –
Regression Coefficients– area leaving the Level(%): option at "95". You will end up with the
following screen:

5. Click on the button. You will be returned to the Linear


Regression dialogue box.

14
6. Click on the button. This will generate the output.
Interpreting and Reporting the Output of Multiple Regression
Analysis
SPSS Statistics will generate quite a few tables of output for a multiple regression analysis.
In this section, we show you only the three main tables required to understand your results
from the multiple regression procedure, assuming that no assumptions have been violated.
Determining how well the model fits
The first table of interest is the Model Summary table. This table provides the R, R2,
adjusted R2, and the standard error of the estimate, which can be used to determine how well
a regression model fits the data:

The "R" column represents the value of R, the multiple correlation coefficient. R can be
considered to be one measure of the quality of the prediction of the dependent variable; in
this case, VO2max . A value of 0.760, in this example, indicates a good level of prediction.

The "R Square" column represents the R2 value (also called the coefficient of determination),
which is the proportion of variance in the dependent variable that can be explained by the
independent variables (technically, it is the proportion of variation accounted for by the
regression model above and beyond the mean model). You can see from our value of 0.577
that our independent variables explain 57.7% of the variability of our dependent
variable, VO2max . However, you also need to be able to interpret "Adjusted R Square" (adj.

R2) to accurately report your data.


Statistical significance
The F-ratio in the ANOVA table (see below) tests whether the overall regression model is a
good fit for the data. The table shows that the independent variables statistically significantly
predict the dependent variable, F(4, 95) = 32.393, p < .0005 (i.e., the regression model is a
good fit of the data).

15
Published with written permission from SPSS Statistics, IBM Corporation.

Estimated model coefficients


The general form of the equation to predict VO2max from age , weight , heart_rate , gender ,

is:
predicted VO2max = 87.83 – (0.165 x age ) – (0.385 x weight ) – (0.118 x heart_rate ) +

(13.208 x gender )

This is obtained from the Coefficients table, as shown below:

Published with written permission from SPSS Statistics, IBM Corporation.

Unstandardized coefficients indicate how much the dependent variable varies with an
independent variable when all other independent variables are held constant. Consider the
effect of age in this example. The unstandardized coefficient, B1, for age is equal to -0.165

(see Coefficients table). This means that for each one year increase in age, there is a
decrease in VO2max of 0.165 ml/min/kg.
Statistical significance of the independent variables
You can test for the statistical significance of each of the independent variables. This tests
whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the
population. If p < .05, you can conclude that the coefficients are statistically significantly
different to 0 (zero). The t-value and corresponding p-value are located in the "t" and "Sig."
columns, respectively, as highlighted below:

16
You can see from the "Sig." column that all independent variable coefficients are statistically
significantly different from 0 (zero). Although the intercept, B0, is tested for statistical
significance, this is rarely an important or interesting finding.

17

You might also like