You are on page 1of 7

Introduction to multiple regression | Dr Joanne Cummings

Introduction
Hello, my name is Dr Joanne Cummings. I’m a teaching associate within the School
of Psychological Sciences and Health. Today’s lecture is going to focus on an
introduction to multiple regression. You will already have had your lecture on simple
regression by Dr Mark Elliott, and multiple regression is an extension of simple
regression. This lecture will be made up of four parts. For the first part I just want to
go over what a multiple regression is, and the regression equation. The second part I
will go on to show you how to run a multiple regression in SPSS, and within this
section I will mention briefly the coding of dummy variables, which sometimes is
required within multiple regression – but I will explain more on this in-depth when I
come to this part of the lecture. The third section will focus on different types of
multiple regression that you can use, such as forced entry methods or statistical
methods, and again I will go into these in a bit more depth when I come to that part.
And in the last section, I will discuss assumptions and sample sizes that we need
when running multiple regression with participants.

Multiple regression theory


So, this first part of the lecture – multiple regression lecture – I’m going to be talking
about the multiple regression theory, and how it extends from the simple linear
regression. I want to mention a little bit about shared variance between independent
variables when we are running multiple regression analyses, and I will then talk a
little bit around the multiple regression equation. So, I’m just going to start talking
about the multiple regression theory. Multiple regression is an extension of simple
regression. We are still interested in predicting a dependent variable, but now it’s
from more than one independent variable, and this increases the prediction than if
you were just using one IV. So, multiple regression is a statistical technique that
allows us to predict someone’s score on one variable, on the basis of their score of
several other variables. And like the simple regression, multiple regression does not
imply causal relationships. In the second part of this lecture, I’m going to go through
how to run a multiple regression, and what we are interested in is predicting test
performance from the number of hours spent studying, and intelligence. So, in this
slide, it shows a number of diagrams. The first one shows the relationship between a
dependent variable and an independent variable. The independent variable is
referred to as X1, so we can see that X1 has some unique variance that overlaps with
the variation in Y – Y is the dependent variable. And this is referred to as a semi-
partial correlation. The second diagram shows the relationship between a second
independent variable and the dependent variable. X2 – so the second independent
variable – has some unique variance that overlaps with the variation in Y, just like
the first independent variable. With the third diagram, we can see that X1 and X2 –
so, both independent variables – have some shared variance that overlaps with the
variation in the dependent variable, and this is the purple area on the diagram. This
shared variance is accounted for in the analysis. The unshaded proportion of the
variance in Y in the dependent variable is the unexplained variance – we do not know
what else is predicting this. So, just to recap. The equation for the regression line in
simple regression is just one X, so Y = b nought plus b one, X one. B nought is the
intercept (i.e. the point where the line crosses the Y axis on a scatterplot) and B one is
the slope. The next slide illustrates this. In a multiple regression, the equation is
extended, so it becomes Y = B nought plus B one, X one plus B two, X two, and so on.
The subscripts indicate the number of predictors in the analysis. So, in this part of
1
the lecture, I’ve covered multiple regression theory; how it extends from linear
regression; spoke a little bit about shared variance; and mentioned the multiple
regression equation. In the next part of the lecture, I’m going to go through how to
run a multiple regression in SPSS.

Running a multiple regression in SPSS


Okay, so this is now part two of the lecture and I’m going to talk you through how to
run a multiple regression in SPSS. So, to run a multiple regression in SPSS, you
would open your data file, and in this example as I’ve already mentioned, we have
test result as the dependent variable, and study time and intelligence as the
independent variables. So, when you open your data file, you would go to the top of
the task bar and select analyse, scroll down to regression, and then select linear –
just as you would do with a simple linear regression. This opens a dialogue box, and
in this section, you would move test result over to the dependent box and move the
independent variables over to the independent box, and then you would click on
okay. This analysis provides a number of tables in your output. The first table
confirms what the predictors and the dependent variable in the model are. The next
table is called the model summary table and the r value is the correlation between all
predictors considered together, and the dependent variable. The r square value in
that table tells us the proportion of variance in the dependent variable that is
explained by the predictors considered together. So, in this example, study time and
intelligence together account for 62% of the variance in test result. We can say that
we have a large effect size. Our predictors account for a large proportion of the
variance in the dependent variable. Still looking at the model summary table, the
adjusted r square value is the result of adjusting the r square value downwards, based
on the number of participants in the sample, and the number of predictors in the
analysis. Here we have 40 participants and just two predictors, so there is not much
difference between r square and adjusted r square. Now we move onto the ANOVA
table, and I would like to draw your attention to the sum of squares column. The
regression value is the variation accounted for by the regression line. The residual
value is the variation not accounted for (i.e. the error). And the total value is the total
variation in the dependent variable there is to be explained. So, staying on the
ANOVA table, this tells us whether the model is significant or not. The F value is
given and the significance level, and here you can see that this level is less than .05.
The next table to look at is the coefficient table, and first we will look at the
unstandardized coefficients column. These coefficients give us the information we
need to specific the regression equation. So, for the constant, this is the intercept,
and for the predictors, this is the slope of the line. The direction of the relationship
between the predictors and the dependent variable is also shown here, and you can
see from these values that the relationships are positive, so this means that as
intelligence increases, so does test result, and as study time increases, so does test
result. We can use these values to specify the regression equation. So, as I’ve
mentioned earlier, the regression equation is Y = B nought plus B one, X one plus B
two, X two. This becomes Y = 1.171 – that’s the B nought value – plus .547 multiplied
by X plus 0.267 multiplied by X. So, if we knew that someone studied for one hour
and had an intelligence score of ten, they would be predicted to get 4.38 out of ten on
the multiple-choice test, on average. So, the regression equation is Y = 1.71 plus .547
multiplied by one – because they’ve studied for one hour – plus .267 multiplied by
ten – because that’s the intelligence score – and this gives us a value of 4.38. The
next column to look at in this table is the standardized coefficients, and these are
referred to as beta weights. They tell us what the regression coefficients are, when all
2
predictor variables have been standardized. We can, therefore, use these to judge the
relative importance of the predictors, and here we can say that the study time is a
more important predictor of test score than is intelligence, as it is a higher value.
Again, the direction of the relationship between the predictor variables and the
dependent variable are positively associated. So, staying with the coefficients table,
the next column to pay attention to, or the next columns rather, to pay attention to, is
the t and the significance value column. T-tests are applied to each standardized
regression coefficient to test whether the coefficient is different from zero. Here, both
p values are below .05, so they are both significant predictors of test result at the 95%
level. Study time is significant at a more stringent threshold, so overall study time
and intelligence together account for 62% of the variation in test result. Both
independent variables are statistically significant independent predictors, although
study time is a more important predictor than intelligence. I just want to mention
something about categorical predictors in multiple regression. In a multiple
regression, the dependent variable is measured either on an interval or ratio scale –
but not categorical. The predictor variables – or the independent variables – can also
be interval, ratio or ordinal, but a multiple regression can also handle categorical
variables if there are two levels. For example, if we use sex (male and female) there’s
two levels of the category sex. But, what if we had a categorical variable with more
than two levels? We would need to dummy code those variables so that multiple
regression can acknowledge them. If there are more than two categories, the
variables will be dummy coded. For example, gender or gender identity – such as
masculine, feminine and androgynous – would need to be dummy coded as three
different variables, and these would be masculine/not masculine, feminine/non-
feminine, androgynous/not androgynous. In our study that we’ve just been talking
about, the dependent variable with the test result, and the predictor variables, with
study time and intelligence. If we had a new categorical predictor, with more than
two levels (such as ethnic group), we could code these as white – one – black – two –
and Asian – three. And we wanted to see if ethnic group had an effect on test result,
in addition to study time and intelligence. We would need to convert these ethnic
groups into dummy coded variables, so white would be coded as one, and all other
groups as zero. In the next column on your SPSS datasheet, black would be coded as
two, and all other groups zero. And in the next column on your SPSS datasheet,
Asian would be coded as three, and all other groups zero. So, the output for this, if we
just look at this diagram, if white – for example – ethnic group was a significant
predictor of test result with a positive regression coefficient, then white participants
would have performed better on the test than the other ethnic groups. However, if
white – just for example – was a significant predictor of test result with a negative
regression coefficient, then this means that white participants performed worst on
the test than other ethnic groups. So, this is the end of part two of the multiple
regression lecture. We have went over how to cover and run a multiple regression in
SPSS, and mentioned briefly how to code dummy variables. The next part of the
lecture, we’re going to be looking at different types of multiple regression.

Different types of multiple regression


So, this is part three of the lecture, and in this section, we’re going to be looking at
different types of multiple regression. So, broadly speaking, we have two main types
of multiple regression. One is referred to as the forced entry method, and you have a
standard method and a hierarchical method within this. The second broad category
is statistical methods, and within this, you can run a stepwise or a forwards or a
backwards regression. I will explain each in turn and show you how you would
3
choose an alternative type of multiple regression in SPSS. So, as I’ve mentioned,
there are two broad types of multiple regression – forced entry, and statistical
methods. With forced entry methods, variables that are included in the analysis are
determined by the researcher. And as mentioned, there are two types. So, a standard
enter method, and a hierarchical method. With statistical methods, on the other
hand, variables are included in the analysis on purely statistical groups, and as
mentioned, stepwise, forwards and backwards. So, with the standard method, all
chosen predictors are entered into the regression model/equation at the same time,
and this is what we did earlier in part two of the lecture. This tells us the proportion
of the variance (i.e. r square) in the dependent variable to see if the proportion of the
variance explained in the independent variable is different from the zero. And we get
information on the relative importance of the independent variables and whether
they are statistically significant predictors of the dependent variable. With the
hierarchical method, all chosen predictors are entered into the equation model, but
at different steps. So, for example, we can enter a single variable at step one, then
another variable at step two, and if you had more than two independent variables,
they would be entered at step three, and so on. You can also enter a set of a variables
at step one and another set at step two – it doesn’t have to be just independent
variables. So, in our example, we could enter study time into the analysis at step one,
and intelligence at step two, if we had a rationale for doing so. Entering the variables
in at different steps would tell us, at step one, the proportion of the variance in the
dependent variable that study time accounted for on its own, and whether that was a
significant proportion of the variance. It’d also tell us the relative importance of
study time as a predictor and whether this is statistically significant. The results
would also tell us, at step two, the proportion of the variance in the dependent
variable that study time and intelligence account for altogether, and whether that is a
significant proportion of the variance. Therefore, by comparison with step one, we
can assess the change in the r square value due to intelligence. And we can determine
if there’s an increase in explained variance and whether this is statistically
significant. The relative importance of all the predictors included in the analysis, and
whether they are statistically significant predictors of the dependent variable, is
determined here. The next type of multiple regression method is the statistical type.
With the forward method, the predictor with the largest zero order correlation, with
the dependent variable, is entered into the analysis first. The predictor that has the
second largest semi-partial correlation with the dependent variable is then
considered, and if it significantly increases the proportion of the variance explained,
then it is added to the analysis. This process continues until there are no more
predictors left to be entered that account for the significant increment in r square
value. The backwards method is the opposite from forwards. With this method, all
variables are entered into an analysis, and are systematically removed. The one with
the smallest semi-partial correlation is considered first, and variables continue to be
removed until the r square value decreases significantly. With the stepwise method,
variables are entered into the analysis the best first, then the next best, and so on.
But variables already in the analysis may become non-significant when new ones are
entered. In a stepwise multiple regression analysis, these will be removed. The
analysis terminates when no more variables are eligible for inclusion or removal. So,
which type of method is better – forced entry method or the statistical method? In
terms of which method is better, generally speaking, forced entry methods are used
when you have a theory or a rationale to guide which variables to include in the
analysis. This is the preferred option, as it implies you have a well-designed study,
and you know what you’re doing. Statistical methods are generally used in the
4
absence of theory to guide the selection of predictors. You might have a large number
of predictor variables and not know what to do with them, so you can hunt and seek
using this statistical method. However, statistical methods are potentially
problematic. There is a risk of a type-one error. Sooner or later, some variables will
be found to be significant due to chance alone. The larger the number of predictor
variables considered, the greater the risk. So, how would you select a different
method if you did not want to use the standard method. When your dataset is open,
just like we did before, you would go to the taskbar and click on analyse, and then
regression, and then linear. This would open the dialogue box and your dependent
variable would be moved over, just as we did before, to the dependent box. And your
independent variables would be moved over to the independent box. Where is says
method, the enter method is selected. This is the standard method, and it is the
default. However, if you wanted to use hierarchical or any of the statistical methods –
stepwise, forwards or backwards – you would click on the down arrow and a number
of options would appear, and you would select your method of choice. That’s the end
of part three of the lecture. We touched upon different types of multiple regression
that you can use, and why it would be better to use the forced entry method, such as
the standard method or the hierarchical method. In the last part of the lecture – part
four – I’m going to talk about assumptions when we are running multiple regression
analyses and mention a little bit of information in relation to the sample size that
would be required when conducting multiple regression.

Assumptions and sample sizes


So, now we’re onto part four of the lecture and I want to mention just some
information about assumptions that you need to consider when carrying out multiple
regression analyses, and just mention a little bit of information around the sample
size that would be required when conducting multiple regression, with multiple
predictors. So, with a multiple regression, the assumptions are the same as simple
regression. And Dr Mark Elliott, who delivered the simple regression lecture, would
have covered these with you. So, analysis of residuals should indicate linearity,
normality and homoscedasticity and independence. However, with multiple
regression, there is a further assumption. The predictor variables should be
independent, and that means not highly correlated with each other. When a
predictor variable or an independent variable in a multiple regression analysis is
strongly correlated with one or more of the other predictor variables, you get what is
called multicollinearity. Perfect collinearity is when a predictor is perfectly correlated
with one or more of the other predictors. This means that all of its variance is shared
with the other variables. This is very, very rare, and the analysis will simply not run if
this occurs. You can also get what is known as near collinearity, and this occurs when
the predictors are highly correlated. This is more common, but it may mean that the
regression coefficients become unstable, and this can result in small changes to the
dataset that dramatically changes the regression coefficients. For example, the
regression coefficients can change signs – there might be positive relationships
between predictor variables and dependent variables, and these may become
negative; they could be small coefficients which become large. So, they become
unstable. You can check for collinearity, however, in SPSS. You can request for
collinearity diagnostics and when the dialogue box is open, after you have clicked on
analyse, regression, linear – as you would do to run your multiple regression – you
would click on statistics, and this brings up another option; another dialogue box.
And within this, you would click on descriptives, and you will see an option on the
right-hand side for collinearity diagnostics. You would click this box and then press
5
continue. So, there are a number of warning signs that will tell you if you have a
problem with collinearity between your variables. So, if collinearity exists, the
correlations between the predictor variables will be high. Various criteria has been
proposed, but if two variables are correlated above .70, it implies a lack of
discriminant validity between predictor variables; it means they’re measuring the
same thing, and you wouldn’t want this in a multiple regression analysis. Other
warning signs to look out for is when you have an F test for overall regression model
that is significant, but the T-tests in the output tables should you that none of the
regression coefficients are significant. These are found in the coefficients table.
Another warning sign would be that the beta values blow up; they have a value
greater one. And you can see in the example that we did with the multiple regression
analysis that this is not a concern here (lecture slide). If you had collinearity, you
would also have small tolerance values near .01. Tolerance values refer to the
proportion of variance in a predictor that is unique to that predictor. So, if a
predictor variable has a tolerance value of .01, it means that 1% of its variance is
unique (i.e. not shared with any other variable) and 99% of its variance is shared
with the other predictor variables. Again, if you had collinearity, you would also get
large variance inflation factors, and variance inflation factors measure the impact of
collinearity on the regression coefficients. The general rule is that if a predictor has a
VIF – a variance inflation factor – greater than ten, it is of concern. The collinearity
diagnostics also give us eigenvalues, condition index values and variance
proportions. With the eigenvalues that are near zero, then it’s a concern and a
possible indicator that you have collinearity with your variables. Condition index
values over 30 also indicate problems with collinearity. And variance proportions
that indicate that a dimension accounts for a large proportion of variance in more
than one variable; this indicates a problem. As you can see from the table with
dimension three, there is a high correlation between the variables, but this is the only
warning sign that is of concern. So, it is unlikely that collinearity is a significant
problem here. Just to say though that the indicators are just indicators. There are no
definite guidelines, other than when you get perfect collinearity. So, the advice would
be to assess the warning indicators; consider them all together; are all or most of
them pointing towards collinearity? It can be difficult if some warning indicators are
pointing to collinearity, and others are not. But, experience helps. And the more you
apply and run multiple regression to datasets, the better you will become at detecting
collinearity issues if there are anyway. There are a few things that you can do if
collinearity is an issue. First, find out which variables are causing the problem. If two
variables are highly correlated, you could remove one of them, or you could combine
them into a single measure if it makes sense to do so (and then take the mean). Or,
you could increase the sample size. One reason why predictors might be correlated is
because the measures of those variables are not very accurate; they have large
confidence intervals. So, basically, the larger the sample, the smaller the confidence
interval. Now, I would just like to say a few things about sample sizes in multiple
regression. You can conduct a power analysis to determine the sample size required
for your analysis, but it is important to point out that the sample size required
depends on the effect size you expect (small, moderate or large), the power level,
alpha level, and the number of predictors that you have. There is a general rule of
thumb; the more participants the better, but 20 participants per each predictor as a
bare minimum is acceptable. Tabachnick and Fidell who’ve written a book on
statistics recommend that you can have complete data, so for example, if you’re
testing an overall model, the number of participants that you would have would be
greater or equal to and/or plus eight times the number of predictors. So, if you had
6
three predictors; eight times three would be 24, plus at the minimum 50, you would
require 74 participants. That’s to test the overall model. If you’re testing the
significance of overall predictors, Tabachnick and Fidell recommend that you have
greater or equal to 104 plus the number of predictors. So, if you had four predictors,
you would need 108 altogether. If you want to test both the overall model and the
individual predictors for statistical significance, then do both calculations, and
choose the biggest sample size that is required.
Summary
In this section, we have covered assumptions of multiple regression, and touched
briefly on sample size required when conducting multiple regression analyses. What
I would like you to do now is run a multiple regression on a different dataset. For
your task, I would like you to attempt to run a multiple regression and then answer
some of the questions. And you can find these on this section on MyPlace underneath
the video recordings. The data is from a real study that tested if social cognitive
factors – or variables – such as intention, self-efficacy – which is the belief in your
ability to do something – and optimism, predicted levels of physical activity. The
levels of physical activity were measured in minutes, spent doing over the course of
the week. Try also to attempt to write the results section for this, as well as answering
the questions. There is an example of how to report a multiple regression in this
section. So, if you can do the task and attempt the questions, I’ve also provided some
references for you. If you have any questions, however, in relation to this lecture,
please do get in touch.

You might also like