You are on page 1of 29

Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.

12
Bivariate Regression
Contents
Introduction: Bivariate Regression Equation ................................................................................ 1
Slope .......................................................................................................................................... 2
Intercept .................................................................................................................................... 4
Bivariate Regression in SPSS: Intercept and Slope .................................................................... 5
Scatter Plots .................................................................................................................................. 6
Regression Equation: Detailed ...................................................................................................... 7
Regression, Error, and Residuals ............................................................................................... 8
Residuals: Illustration with SPSS ............................................................................................... 9
Assumptions ................................................................................................................................ 10
Pearson/Spearman Correlation Approach .............................................................................. 15
Bivariate Regression and Linearity .......................................................................................... 15
A note on Regression and Causality ........................................................................................ 16
Advanced Topics ......................................................................................................................... 17
Why is it called the ‘line of best fit’? ....................................................................................... 17
Bivariate Regression in SPSS: Intercept and Slope - Bootstrap ............................................... 17
Breusch-Pagan and the Koenker test ...................................................................................... 19
Adjusting Beta-Weight Standard Errors .................................................................................. 20
An Example of Heteroscedasticity .......................................................................................... 22
When p-values Contradict Each Other: Don’t Despair............................................................ 24
Practice Questions ...................................................................................................................... 26
References................................................................................................................................... 29

Introduction: Bivariate Regression Equation


In a previous chapter, the association between years of education completed and
earnings per day was estimated at r = .337, based on a Pearson correlation. A Pearson
correlation is a very useful statistic to estimate the association between two variables measured
on a continuous scale. However, it is not quite sophisticated enough to make specific
predictions, which is often the purpose of conducting a statistical analysis. For example, suppose
a person had 13 years of education completed, what would you predict their per day earnings
to be? Bivariate regression can answer that question. It can do so because bivariate regression
is associated with results that can be used to build a regression equation. A regression equation
allows for the prediction of Y (a dependent variable) based on a particular value of X (the
independent variable).

C12.1
CHAPTER 12: BIVARIATE REGRESSION

In this example, I would like to predict a person’s earnings per day based on the years
of education they have completed. In regression, it is conventional to refer to the predicted
variable as Y and the predictor variable as X. Thus, in the context of regression, if someone were
to say that the X variable was not normally distributed, you would assume that they were
referring to the predictor variable. There is no logical reason for this; it is simply convention.
The bivariate regression equation consists of two primary elements: (1) a slope, and (2)
an intercept.

Slope
The slope is closely related to the Pearson correlation. Specifically, the slope represents
the amount of change in Y as a function of a unit increase in X. The value of a slope can either
be positive or negative, in the same way that a Pearson correlation can be positive or negative.
Consequently, based on the direction of the slope, the change in Y may be an increase or a
decrease. It is also important to know that the slope can either be unstandardized or
standardized. Again, there is a connection to the Pearson correlation here. The Pearson
correlation is a standardized representation of the association between two variables (i.e., it is
bounded by -1.0 to 1.0). By contrast, the Pearson covariation is an unstandardized
representation of the association between two variables (unbounded). Correspondingly, the
standardized slope is “essentially” bounded by -1.0 to 1.0, and the unstandardized slope is
unbounded. I wrote “essentially” because a standardized slope can actually be greater than
|1.0|, although it is very rare to take on such a value, and it will only do so in the case of a
multiple regression (see Chapter 14), rather than bivariate regression. I will treat the
unstandardized slope first, although the formula is the same for both the standardized and
unstandardized slopes. Typically, an unstandardized slope is represented as b. Below is the
formula for the unstandardized slope:
 SDY 
b  r   (1)
 SDX 
where r = Pearson correlation, SDY = the standard deviation associated with the raw scores from
the Y variable, and SDX = the standard deviation associated with the raw scores from the X
variable. Thus, b can be obtained by multiplying a Pearson correlation by the ratio of the raw
score standard deviation of the predicted (dependent) variable to the raw score standard
deviation of the predictor (independent) variable. It is important that the SD from the Y
(predicted) variable is divided by the predictor (X) variable - and not the other way around.
In the current example, the Pearson correlation between years of education completed
and earnings per day was .337. Furthermore, the independent variable and dependent variable
standard deviations were: SDX = 3.065 and SDY = 69.188, respectively (Data File:

C12.2
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

education_earnings_N_40) (Watch Video 12.1: Descriptive Statistics for Bivariate Regression in


SPSS)

Therefore, the unstandardized slope, in this example, was equal to 7.067:


 SDY   69.188 
b  r    .337   .337 * 22.574  7.067
 SDX   3.065 
Recall that the slope represents the amount of change in Y as a function of a one unit increase
in X. The slope in this case is positive, therefore, a one unit increase in years of education
completed was associated with an increase in per day earnings. Specifically, a one year increase
in education was associated with an increase of $7.067 of earnings per day. On an annual basis,
that works out to an increase of $2,759.46 for each extra year of education completed ($7.067
* 365). If a person’s working life were 40 years, that would come out to $110,378 for just one
extra year of education completed (not including wage inflation). Recall that these data are
based on the results of a real study (i.e., Zagorsky, 2003), consequently, the $110,378 value is a
valid estimate (in the USA).
The standardized slope in the context of bivariate regression is equal to the Pearson
correlation between the independent and dependent variables. The standardized slope is often
represented by the following symbol: β. Recall that the formula for the standardized slope is the
same as the formula for the unstandardized slope. It is also the case that the standard deviations
in the standardized slope case are equal to 1 for both independent variable and the dependent
variable. That’s what makes it standardized. Recall that z-scores have a mean of 0 and a standard
deviation of 1. If you transformed the raw years of education completed and earnings per day
scores into z-scores and performed the bivariate regression analysis on the z-scores, the Pearson
correlation and the standardized slope would be exactly the same. Here, for thoroughness, I
solve the slope equation in the standardized case:
 SD    1
  r  Y   .337   .337 *1  .337
 SDX  1

C12.3
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

The standardized slope has a similar interpretation to the unstandardized slope, except
that it is expressed in standard deviation units. Specifically, a one standard deviation increase in
X is associated with a percent increase or decrease of a standard deviation in Y, where percent
is equal to the standardized slope. In the unlikely case of a standardized slope equal to 1.0, it
would imply that a one standard deviation increase in X is associated with a one standard
deviation increase in Y. In this example, recall that the years of education completed standard
deviation was equal to 3.065 and the earnings per day standard deviation was equal to 69.188.
Therefore, a 3.065 increase in years of education completed would be expected to be associated
with, on average, a $23.316 increase in earnings per day (.337 * 69.188 = 23.316). On an annual
basis, that would work out to an earnings increase of $8,510.34. Across a 40 year work career,
the increase in earnings would amount to $340,413, on average.

Intercept
The intercept may be regarded as the foundation of the regression equation, as it may
be considered the starting point of the equation. The intercept represents the predicted value
of Y when X is zero. In the current example, the intercept represents the expected amount of
earnings per day, if a person had 0 years of education completed. The intercept can also be
discussed in the context of the line of best fit, which I present further below in the context of
scatter plots.
In practice, the estimated intercept value can be absurd, that is, essentially impossible.
For example, if someone conducted a regression analysis with adult human weight as the
dependent variable and adult human height as the independent variable, it is very possible that
the regression analysis would estimate a negative intercept. In fact, if you do the Practice
Questions in this chapter, you will learn about a study where the intercept was -228 pounds (or
-103 kg). That’s right, negative body weight, which is impossible for a human to weigh, no matter
what diet the person may be following. Recall that the intercept represents the value of Y
(weight) when X (height) is zero. Although, the total absence of height is also impossible, the
regression analysis will nonetheless estimate an intercept value to get the regression equation
solved. It’s totally fine if an impossible intercept value is estimate from the regression analysis.
It is only occasionally the case that researchers are actually interested in interpreting the
intercept substantively. Typically, they just need it to make predictions through the regression
equation.
The intercept is often symbolized as α and may be formulated as:
a  Y  bX (2)

Thus, the mean of the Y variable scores minus the product of the slope and the mean of the X
variable scores. Just like the slope can be expressed in unstandardized and standardized forms,
the intercept can also be expressed both ways. However, in standardized form, the intercept is

C12.4
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

always equal to zero. This is true because the mean of z-scores is always zero. Stated
alternatively, if you solve the intercept equation above, the Y bar and X bar values will be zeroes
which will result in an intercept value of zero, no matter what the slope is estimated to be. In
the unstandardized case, however, the intercept can take on any value. Based on an
unstandardized slope of 7.067, years of education completed mean of 13.80, and an income per
day mean of 118.986, the unstandardized intercept in this example works out to:
a  118.986  7.607 *13.80
a  118.986  104.977
a  14.009
Recall that an intercept represents the expected value of Y (earnings per day) when X (years of
education completed) is zero. Thus, in this example, a person with zero years of education would
be expected to earn, on average, $14 per day. Stay in school!

Bivariate Regression in SPSS: Intercept and Slope


To conduct a bivariate regression in SPSS, the ‘Linear’ menu option can be used (Watch
Video 12.2: Bivariate Regression in SPSS). If you do the analysis in SPSS correctly, you’ll get the
following key SPSS table entitled ‘Coefficients’:

Refer yourself to the left side of the table labeled ‘Unstandardized Coefficients’. You can see
that ‘education’ appears at the bottom of the table and that it was associated with an
unstandardized slope of 7.609, which is nearly identical to the estimate of 7.607 obtained by
hand above. The difference is due simply to rounding. SPSS also provided a standard error for
the unstandardized slope, which is used to test the unstandardized slope for statistical
significance. If you divide the slope point estimate by its standard error, you get the t-value:
7.609 / 3.447 = 2.207, which is associated with a p-value of .033. Because .033 is less than .05,
people interpret the unstandardized slope as statistically significant, i.e., sufficiently unlikely to
be due to chance fluctuations. Correspondingly, the normal theory 95% confidence intervals
associated with the slope were .630 and 14.588. Thus, if this study were re-conducted a number
of times with different participants, we would expect an extra year of completed education to
be associated with somewhere between $.63 and $14.588 extra earnings per day, with 95%
confidence. That’s really quite a large range in slope estimates. However, such a large range in
the slope estimates should be expected, because the sample size upon which this analysis was
based was only 40. Greater sample sizes gives us greater confidence.
C12.5
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

The ‘Coefficients’ table also includes a row labelled ‘(Constant)’. This is the word SPSS
uses to refer to the intercept. It has been estimated at 13.976, which is nearly identical to the
estimate of 14.009 obtained by hand. Again, the difference is due simply to rounding. SPSS also
provides a standard error for the intercept, which is used to test the intercept for statistical
significance. If you divide the intercept point estimate by its standard error, you get the t-value:
13.976 / 48.705 = .287, which is associated with a p value of .776. Consequently, the intercept
was not statistically significantly different from zero. Correspondingly, the 95% confidence
intervals for the intercept corresponded to -84.623 and 112.575. As one value is negative and
the other positive, it implies that the intercept is not statistically significantly different from zero.
Recall that the vast majority of researchers are not really interested in the intercept in a bivariate
regression. SPSS is just being thorough by reporting all of the statistics associated with the
intercept.

Scatter Plots
Cohen (1988) provided guidelines for interpreting the magnitude of a Pearson
correlation, which could also be applied in the bivariate regression case (but not the multiple
regression case). Thus, in this example, the standardized beta-weight of .34 would be considered
a medium effect. Another method that can be used to represent the nature and strength of an
effect between two continuously scored variables is a scatter plot, as introduced in the chapter
devoted to correlation.
The association between years of education competed and earnings (per day) is
depicted in the scatter plot presented in Figure C12.1 (Panel A). It can be observed that as years
of education values increase, so do values of earnings per day. For example, the individual with
the greatest years of education completed also earns the most amount of money (see value in
top right corner of the scatter plot). However, there is also a lot of “noise” amongst the values
in the scatter plot. For example, the person who earned the second most amount of money did
not have a particularly high amount of years of education completed (less than 15 years).
The scatter plot depicted in Panel B is exactly the same as the scatter plot depicted in
Panel B. The exception is that the Panel B scatter plot includes a regression line. The regression
line is also known as the ‘line of best fit’. The regression line is a visual representation of the
slope. In this example, the regression line has an upward tilt, which reflects the fact that the
association between years of education completed and earnings is positive. Had the association
been negative, the slope would have had a downward tilt. In the complete absence of an
association between two variables, the regression line is completely level.
In addition to representing the slope visually, the regression line also reflects the
intercept value visually. Recall that the intercept is defined as the value of Y when X is zero. If
you follow the regression line to the point where it intersects with the Y-axis, you will get an
estimate of the intercept. Based on the scatter plot depicted in Panel B, the regression line
C12.6
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

intersects with the Y-axis at somewhere between 0 and 50. However, it is important to keep in
mind that the X-axis has been restricted to a minimum of 5 years of education. As I show you in
the video, you can change the X-axis range to 0 to get a better indication of the intercept.
Ultimately, however, you can’t rely upon the regression line to give you an exact estimate of the
intercept, particularly when the sample size is small, and there are few observations near the
intersection of the X- and Y-axes. However, it will give you a sense of what the intercept is. Based
on the formula, the intercept was estimated at 14.009 (SPSS = 13.976) (Watch Video 12.3:
Scatter Plot in SPSS)

Figure C12.1. Scatterplots Depicting the Association Between Education and Earnings
Panel A Panel B

Regression Equation: Detailed


As mentioned above, one of the reasons a Pearson correlation analysis may be extended
to a bivariate regression analysis is to make predictions. In this example, a person may wish to
predict a person’s earnings per day, based on knowing their years of education completed. To
do so, one must create a regression equation. In the bivariate regression case, the regression
equation can be formulated as:

Yˆ  a  bX (3)

The hat on top of the Y indicates that the value corresponds to a predicted value (derived from
the equation), rather than an observed value (as would exist within the raw data). The b
represents the slope, the X next to the b represents a raw score on the variable X, and a

C12.7
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

represents the intercept. In plain words, the formula reads: Predicted Y value = the intercept +
slope * raw score from the X variable.
Theoretically, it is more accurate to include an error term at the end of the regression
equation:

Ŷ  a  bX   (4)

The ε represents an error margin, which would be expected in practice, because few, if any,
regression equations would predict dependent variable scores perfectly. Consequently, the
equation will make imperfect predictions, so there has to be a “residual” value to represent the
degree to which the prediction was off. This will become clearer when the residuals are
calculated below. However, for the purposes of predicting someone’s earnings per day, one
does not have to specify error. (Watch Video 12.4: What is the bivariate regression equation?).
Suppose you met a person at a party and they told you that they completed 17 years of
education. Most people would follow-up with questions about what topics they studied at
university, but not me. I would bet them $100 that I can guess how much they earn with 95%
confidence. To do so, I would plug in 17 into the regression equation formula, alongside the
unstandardized intercept, and unstandardized slope I estimated above with SPSS:
Yˆ  13.976  7.609(17)
Yˆ  143 .329
The $143.329 I estimated is income per day, so to get the annual earnings estimate I
need to multiply by 365 days which gives $52,315.09. To obtain the normal theory 95% lower
bound and upper bound confidence intervals, I would need to apply the regression equation
two more times: Once with the lower bound estimates and once with the upper bound
estimates:
Yˆ 95%Lower  84.623  .630(17)  73.91
Yˆ 95%Upper  112.575  14.588(17)  360.57
Based on the above, with 95% confidence, I would say that the individual at the party with 17
years of education has an annual income somewhere between -73.91 and 360.57 dollars per
day! That’s a massive range. Nobody would be impressed by that. The fact that the 95%
confidence intervals around the point estimate prediction of $143.329 dollars a day is so large
underscores a very important issue to bear in mind when creating and using regression
equations: error.

Regression, Error, and Residuals


Although the application of regression can seem powerful, it should always be
acknowledged that regression estimates are associated with a particular amount of error. In
practice, the amount of error is usually very large. A method that can be used to estimate the

C12.8
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

accuracy of a regression equation is ‘model R’. In the bivariate regression case, model R is equal
to the standardized slope (a.k.a., standardized beta weight). A method to evaluate the amount
of error associated with a regression equation is to subtract R2 from 1. Such a result is known as
the coefficient of alienation, which is in contrast to the coefficient of determination (i.e., R2).
The coefficient of alienation is simply:
1- R2 (5)
2 2
In this example, the model R was equal to .337 or .114. Thus, 11.4% of the variance in
earnings was accounted for by years of education completed. By contrast, the coefficient of
alienation (1-R2) was equal to 1-.114 = .886, which implied that 88.6% of the variance in annual
income was not accounted for by years of education completed. Because such a larger
percentage of the variance in annual income was not accounted for by years of education, it
should come as no surprise that individual predictions of peoples’ annual income would be
associated with a substantial amount of error.
In regression, we refer to the difference between a predicted dependent variable value
and an actual dependent variable value (or observed value) as a residual. As you will see in the
demonstration below, it is literally the difference between the two values (predicted –
observed). As model R2 increases, the difference between predicted values and observed values
becomes progressively smaller, i.e., better predictive capacity. However, in practice, model R2 is
usually much less than .50, so more than 50% of the model is associated with error, often times
substantially so!

Residuals: Illustration with SPSS


To estimate the residuals associated with regression model, it is first necessary to
calculate the predicted values for all of the cases in the data file. In the example above, I
estimated the per day earnings of an individual at a party who mentioned that he completed 17
years of education. Based on the regression equation, I estimated the person’s per day earnings
at $143.329. Naturally, it is more efficient to calculate everyone’s predicted value with a
statistical program.
In SPSS, predicted values can be calculated efficiently by conducting a bivariate
regression analysis and saving the predicted values through the ‘Save’ utility (Watch Video 12.5:
Regression Residuals in SPSS). SPSS will automatically create a new variable called PRE_1 which
includes all of the cases predicted values. You’ll see that for case 1, a person who achieved 17
years of education, SPSS estimated their per day earnings at 143.34, which is almost identical to
the estimate I calculated by hand (difference due simply to rounding). However, case 1 actually
earned $35.08 per day in the past year. Consequently, the residual associated with case 1 is
equal to 35.08 - 143.329 = -108.25. SPSS can also calculate the residuals associated with each
case very efficiently. Re-run the bivariate regression again; however, this time select

C12.9
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

‘Unstandardized residuals’ and ‘Standardized residuals’ within the ‘Save’ utility. You’ll see that
SPSS created two new variables called RES_1 and ZRE_1. The RES_1 values correspond to the
raw score differences between the predicted values and the observed values. The RES_1 values
were obtained by transforming the unstandardized residuals into z-scores.
In a basic bivariate regression analysis, SPSS reports an estimate of the error associated
with a predictive capacity of the regression equation in the ‘Model Summary’ table. It refers to
the error estimate as the ‘Standard Error of the Estimate’, and it can be seen in the SPSS table
entitled ‘Model Summary’ that it was estimated at $65.99. The standard error of the estimate is
the standard deviation associated with the unstandardized residuals. Evidently, the amount of
error associated with the prediction equation is very large at $65.99!

SPSS also reports the Model R and the R2 (R Square) value in the ‘Model Summary’ table.
Again, in the context of a bivariate regression, Model R and R2 will always equal Pearson r and
Pearson r2. Furthermore, Pearson’s r and the standardized slope will always equal each other in
the bivariate regression case. Consequently, it is not really useful to examine and/or report such
values in a bivariate regression analysis. Model R and R2 become much more useful in the
context of multiple regression, which is an extension of bivariate regression. The ‘adjusted R
square’ value corresponded to .090, which implies that 9% of the variance in earnings was
accounted for by years of education completed, rather than 11.4%. The reason SPSS reports the
adjusted value is because it is known that R2 is an overestimate at the population level.
Consequently, it is more justifiable to report the adjusted R2, but almost no one ever does in the
context of bivariate regression (only multiple regression).

Assumptions
The correct application of bivariate regression is based on the satisfaction of various
assumptions. The easy assumptions to evaluate include all of the assumptions associated with
Pearson correlation:
1: Random sampling
2: Independence of observations
3: Dependent variable measured on an interval/ratio scale
4: Linearity
As the above four assumptions have been treated in detail in the chapter devoted to correlation,
they have only been listed here. There are four additional assumptions associated with bivariate
regression:
C12.10
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

5: Normally distributed residuals


6: Independence of errors
7: Homoscedasticity
Curiously, many textbooks only introduce the last three assumptions listed above in the
context of multiple regression, which is an extension of bivariate regression. However, these
assumptions apply to bivariate regression, as well.

5: Normally Distributed Residuals


Most textbooks mention something about the assumption that the residuals associated
with a regression analysis should be normally distributed, but it is a rare occurrence to see a
published paper mention anything about it. I suspect this is because there is no commonly
regarded statistical test to determine whether the distribution of residuals was sufficiently
normal or not. The distribution of standardized residuals for the education and income per day
example is depicted in Figure C12.2. The distribution was associated with skew of .730 and
kurtosis of -.100. Is that good? I think so, but, again, there are no published guidelines of which
I am aware. The main thing to note about a distribution of standardized residuals is whether
there are any particularly large (outlying) values.
How large is too large? The answer to such a question will depend on sample size. In
most cases, a standardized residual greater than |4.0| is very likely cause for concern, as the
regression equation failed to predict that case’s dependent variable observed score very poorly.
Again, you need to keep sample size in mind here. Very large sample sizes (> 1000) will yield a
couple of standardized residuals as large as |4.0|, even when none of the cases were particularly
poorly predicted.
An alternative approach to evaluating the normality of residuals issue is to look for
“influential cases”. Such cases are those that have influenced the regression results very
substantially one way or the other (i.e., for better or for worse). Cook’s distance (or Cook’s D) is
a useful statistic to evaluate for this purpose and it is available in SPSS. As was the case with the
procedure to obtain residuals, use the ‘Save’ utility within the regression analysis option in SPSS.
Be sure the ‘Cooks’ option is selected within the ‘Distances’ section of the window. Once the
analysis is run, SPSS will produce a table entitled ‘Residuals Statistics, as well as an additional
column of data (named COO_1). It is often recommended that a maximum Cook’s distance value
to tolerate is 1. Values greater than 1 suggest an unusually large influence on the regression
results. As can be seen in the SPSS table entitled ‘Residuals Statistics’ below, the largest Cook’s
distance value was .40. Thus, arguably, there were no influential cases in the regression analysis,
which is a good thing (Watch 12.6: Cook’s Distance in SPSS).

C12.11
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

Figure C12.2. Histogram of Standardized Residuals

If a value closer to 1 or larger is reported in the table, check the SPSS data file and
identify the case with the large Cook’s distance value. Try to understand why this case has been
associated with such a large influence on the results. You may consider deleting the case, if there
is a good reason to do so, and then re-running the regression analysis without the influential
case.

6: Independence of Errors
In practice, this assumption is a more common concern for researchers in economics, as
they often collect data across time (month to month to month; “time series data”).
C12.12
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

Independence of errors implies that there is no correlation in the size of the residuals across the
cases (i.e., the rows of data). Sometimes you can see that the residuals increase or decrease in
size across the cases/rows. Theoretically, the Durbin-Watson statistic can range from 0 to 4, with
a value of 2 indicating the total absence of a serial correlation (which is a “good” thing for the
regression analysis). A value closer to 4 is indicative of a strong negative correlation. A value
closer to 0 is indicative of a strong positive correlation. When the option is selected, SPSS adds
a column of information at the end of the table entitled ‘Model Summary’:

Based on the education and earnings per day example, you can see that the Durbin-
Watson statistic was estimated at .100. SPSS does not test the Durbin-Watson statistic for
statistical significance. In this example, the Durbin-Watson statistic is very close to 0, which
would suggest a violation of the independence of errors assumption. However, in this example,
the data were not associated with a time-series nature, consequently, it is effectively impossible
to violate the independence of errors assumption. Instead, what happened is that I specifically
ordered the size of the residuals from smallest to largest simply to demonstrate the effect
(Watch 12.7: Durbin-Watson Statistic in SPSS). Personally, I would not even perform the Durbin-
Watson statistical analysis, in this sort of study. It has been performed simply for demonstration
purposes.

7. Homoscedasticity
Homoscedasticity is a bit of a mouthful to pronounce. It may seem like a rather
complicated statistical issue, however, it is not. This may become apparent when you realize
that homoscedasticity could be renamed homogeneity of variance, or homogeneity of variance
of residuals, specifically. Homoscedasticity implies that the regression equation is equally
predictive across all levels of the dependent variable. When this is true, the magnitude of the
residuals will be approximately equal across the continuum of the dependent variable. If the
regression equation “breaks down” at some point along the continuum, the residuals become
larger, which necessarily implies that the variance of the residuals becomes larger. Recall that
only one standard error of estimate is provided in a regression analysis. Consequently, the single
estimate has to be applicable across the continuum of the dependent variable. When the
assumption of homoscedasticity is violated, the single standard error of estimate cannot be
trusted to be accurate across the continuum of the dependent variable.

C12.13
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

There are two broad approaches to the evaluation of the homoscedasticity assumption:
visually and statistically. Visually, one popular method is to plot the standardized residuals on
the Y-axis and the standardized predicted values on the X-axis in a scatter plot. There should be
an absence of any pattern in the residuals when the assumption of homoscedasticity is satisfied.
However, when a pattern emerges, then you have probably violated the assumption of
homoscedasticity. In this example, the assumption of homoscedasticity was likely satisfied, as
there was no discernible pattern in the scatter plot depicted in Figure C12.3 (Watch 12.8:
Evaluate Homoscedasticity in Scatter Plot).
In my experience, it is better to plot the absolute standardized residuals on the Y-axis,
rather than the non-absolute standardized residuals (i.e., negative and positive residuals).
Absolute, in this case, means that you need to strip away the negative signs associated with the
negative residuals. When the scatter plot “test” is applied with the absolute standardized
residuals, the pattern you are looking for is either a positive or negative correlation. Thus, the
magnitude of the variance in the residuals is either increasing (positive correlation) or
decreasing (negative correlation) when heteroscedasticity is present. By contrast, when
homoscedasticity is present, there should be no pattern in the residuals. As can be seen in Figure
C12.4, the corresponding scatter plot with the absolute standardized residuals on the Y-axis did
not reveal an obvious pattern; thus confirming the absence of heteroscedasticity. However,
again, this is not exactly a proper, scientific approach to addressing the question of
homoscedasticity.

Figure C12.3. Residual Scatter Plot

C12.14
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

Figure C12.4. Absolute Residual Scatter Plot

Pearson/Spearman Correlation Approach


Fortunately, homoscedasticity can be tested statistically. Perhaps the simplest method
to test the hypothesis of homoscedasticity is to run a Pearson (and/or Spearman) correlation
between the standardized predicted values and the absolute standardized residuals. If the
correlation is statistically significant, then the assumption of homoscedasticity has probably
been violated. In this example, the Pearson correlation between the standardized predicted
earnings per day scores and the absolute standardized residuals (ZRE_1_abs) was r = .178, p =
.272. The Spearman correlation was very similar at r = .156, p = .338. In this context, you would
probably be better off consulting the Spearman correlation, when you suspect the residuals are
not normally distributed. Either way, the absence of a statistically significant correlation in this
example suggests that the homoscedasticity assumption has been satisfied. In the Advanced
Topics section of this chapter, I illustrate the more popular Breusch-Pagan and Koenker tests of
homoscedasticity. I also illustrate an example where the homoscedasticity assumption has been
violated.

Bivariate Regression and Linearity


Many textbooks state that bivariate regression assumes linearity. This supposed
assumption implies that the association between the independent variable and the dependent
variable is constant across all levels of the independent variable. In my opinion, it is not quite
accurate to state that bivariate regression assumes linearity. It would be more accurate to state
that linear bivariate regression assumes linearity, which should be completely obvious. It’s true,
if there is a curvilinear effect within the data, and linear bivariate regression is used to build a
regression equation to predict the dependent variable, then some substantial prediction errors
C12.15
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

will be made at that point of the data in which the curvilinearity appears (and beyond). However,
it seems a bit obvious that linear bivariate regression will only work properly when the nature
of the association is linear. All is not lost, however, as curvilinear bivariate regression also exists
and should be considered for application when the association between the independent and
dependent variables is suspected not to be linear.
In practice, I suspect that the homoscedasticity assumption will be violated when the
association between the independent variable and the dependent variable is not strictly linear.
To evaluate the possibility that there may be a curvilinear element to the association between
the independent and dependent variable requires the addition of an extra independent variable
term to the model. To conduct the regression analysis requires knowledge of multiple
regression, as bivariate regression cannot accommodate the additional independent variable
term. Consequently, I describe curvilinear regression in the chapter devoted to multiple
regression.

A note on Regression and Causality


Bivariate regression is very closely related to Pearson correlation. Consequently, you
may not be surprised to learn that it also suffers from the same limitations, namely, “regression
does not imply causation”.1 Consequently, a person who reads this chapter and decides to
complete one year extra of education cannot be guaranteed to earn an extra $2,759 per year.
There are likely a number of other factors that account for the fact that there is an association
between years of education completed and earnings. In statistics, this is known as the “third
variable problem”. In the chapter on multiple regression, I examine the possibility that IQ may
be the driving force behind the effect between years of education completed and earnings.

1
Truth be told, no statistic implies causation. It’s the research design that may allow for causal
interpretations.
C12.16
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

Advanced Topics

Why is it called the ‘line of best fit’?


In panel B of Figure C12.1, a straight, upward slopping line was included in the scatter
plot. This line represents the slope, which, in this example, was equal to 7.609. Thus, a one unit
of years of education complete was associated with a 7.609 unit (dollars) increase in per day
earnings. A positive slope implies an upward (left to right) line of best fit. A negative slope
implies a downward (left to right) line of best fit. The reason the line is known as the line of best
fit is that the line represents the smallest sum of the squared deviations between the line and
each of the observed values. So, if you subtracted (and squared) each observed value in the
scatter plot from the values that correspond to the line of best fit, you would get the smallest
sum possible, in comparison to any other straight line you could imagine. That’s why it is called
the line of best fit (Watch 12.9: Line of Best Fit in SPSS).

Bivariate Regression in SPSS: Intercept and Slope - Bootstrap


In this example, the data were relatively normally distributed, as both the income and
years of education variables were associated with skew and kurtosis much less than |2.0| and
|9.0|, respectively. However, if one or both of the variables were associated with substantial
levels of skew and/kurtosis, one could easily overcome the problem by estimating the
confidence intervals via bootstrapping (assuming you have the SPSS bootstrap module). To
estimate the unstandardized bootstrapped 95% confidence intervals associated with the
unstandardized intercept and unstandardized regression weight (i.e., slope), one can use the
‘Bootstrap’ utility, if you have access to the Bootstrap module in SPSS (Watch 12.10:
Bootstrapped Bivariate Regression in SPSS).
As can be seen in the SPSS table below entitled ‘Bootstrap for Coefficients’, the
bootstrapped 95% confidence intervals associated with the unstandardized regression weight
corresponded to -.508 and 14.238. These values are similar to the normal theory 95% confidence
intervals reported above: .630 to 14.588. The confidence intervals are similar because the data
were roughly normally distributed. However, there is one key difference between the normal
theory and bootstrapped confidence intervals: the lower-bound estimate from the
bootstrapped analysis intersected with zero. Consequently, the bootstrapped results suggested
that the association was not statistically significant, p = .056.

C12.17
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

To obtain the confidence intervals for the standardized regression weight (i.e., β = .337),
the education and income per day variables need to be converted in z-scores first. Then, the z-
scores need to be entered into the bivariate regression analysis. Such a procedure is essentially
tricking SPSS into doing something it wasn’t designed to do, but the results will be accurate.2
As can be seen in the table below, the bootstrapped 95% confidence intervals associated
with the standardized regression weight correspond to -.033 and .611. Furthermore, the
bootstrapped p-value was estimated at .068, which would suggest the association between
education and earnings per day was not significant statistically.

Evidently, at least on the surface, the normal theory p-value of .033 is at odds with the
bootstrap estimated p-value of .068. That is, the normal theory p-value was statistically
significant (p< .05) and the bootstrapped p-value was not (p> .05). One of the reasons the p-
values were not exactly the same is that the earnings per day variable was associated with a
moderate level of skew (i.e., .911). Normal theory estimation assumes the data are associated
with normal distributions. However, as described further below, normal theory estimation is
relatively (but not perfectly) robust to violations of normality. In absolute numbers, the
difference between .068 and .033 is only .035. They are essentially telling the same story. That
is, the normal theory estimation approach implied that that there was a 3.3% chance that, if the
null hypothesis were rejected, one would be fooling oneself. By contrast, the bootstrapped
approach implies that there was a 6.8% chance that, if we rejected the null hypothesis, we would

2
Note that estimating the normal theory 95% confidence intervals for the standardized beta weight
based on the z-scores would only produce an approximate estimate, consequently, it could only be
recommended with caution. See the chapter on correlation for a proper method to estimate normal
theory 95% confidence intervals for a Pearson correlation (which corresponds to a bivariate regression
standardized coefficient).
C12.18
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

be fooling ourselves. It is for this reason that many researchers criticize the statistical that is
fixated on observing p < .05, which is ultimately a fairly arbitrary value.
In the context of a real-life study, however, a decision would have to be made with
respect to whether the normal theory results should be reported or the bootstrapped results. I
fear all too often researchers pick and choose the results that coincide with their desires. It
should be kept in mind that a sample size of only 40 was used in this example. Therefore, the
results would not be expected to be particularly stable. Fortunately, the Zagorsky (2003) study
was actually based on a very large sample size of 5,000+ participants. Consequently, the point-
estimates reported in this example (b = 7.609; β = .337) should be considered trustworthy, as
they are essentially the same values reported in Zagorsky (2003).

Breusch-Pagan and the Koenker test


More sophisticated approaches to the testing of the homoscedasticity assumption
include the Breusch-Pagan and Koenker tests. The two tests are very similar to each other. In
fact, the Koenker test is simply a modification of the Breusch-Pagan test. The main difference
between the two tests is that the Koenker test was developed to be more powerful, as well as
more robust to violations of normality in the distribution of the error terms (Koenker, 1981;
Koenker & Bassett, 1982). As you will see in the additional example below, where
homoscedasticity was violated, the two tests can yield very different results.
Unfortunately, neither of the two tests can be applied in SPSS using the menu options.
Fortunately, however, Ahmad Daryanto has created a SPSS macro that can be used to test the
assumption of homoscedasticity statistically. To do so, you need to open SPSS as an
administrator, install the macro, and run the analysis through the added regression menu
option.3 This may sound difficult, but it is not, assuming you can run SPSS as an administrator
(Watch 12.11: Breusch-Pagan & Koenker Test in SPSS).
If you are successful at installing the SPSS macro and run the analysis with the education
and income data, you will see that the Breusch-Pagan and the Koenker test both suggest that
the assumption of homoscedasticity has been satisfied. Specifically, Breusch-Pagan, Lagrange-
Multiplier (LM) = .647, p = .421, and Koenker, Lagrange-Multiplier (LM) = .733, p = .392. Thus, as
the p-values were greater than .05, we can interpret the result as a failure to reject the null
hypothesis. In this case, the null hypothesis is the presence of homoscedasticity. We don’t
“want” to reject that null hypothesis, in this case.

3
If you can’t run SPSS as administrator, I’m not sure there is an attractive option that will work
in SPSS consistently. However, you can rely upon the Pearson or Spearman correlation analysis
between standardized predicted values and absolute standardized residuals demonstrated in
the preceding section.
C12.19
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

It should be noted that these statistical tests of homoscedasticity are affected by sample
size. That is, a small sample size will make it difficult to obtain a p-value less than .05, all other
things equal. In this example, the sample size was only 40, consequently, although all of the
homoscedasticity tests suggested the presence of homoscedasticity, a larger sample size may
have detected heteroscedasticity statistically significant.
Finally, I’ll note one more statistical test of homoscedasticity that some researchers
prefer. It is known as White’s test of homoscedasticity. If you’re interested in White’s test, there
is some syntax to conduct such an analysis here: http://www-
01.ibm.com/support/docview.wss?uid=swg21476748.
Finally, if your data do violate the assumption of homoscedasticity, you cannot trust the p-values
associated with the beta-weights. However, you can adjust the standard errors and p-values
with a procedure developed by Hayes and Cai (2007).

Adjusting Beta-Weight Standard Errors


The problem with the presence of heteroscedasticity in the data is that you can no
longer trust the standard errors associated with the beta weights. Fortunately, if the assumption
of homoscedasticity has been violated, it is possible to obtain adjusted standard errors that are
accurate. For example, Hayes and Cai (2007) provided a method and SPSS macro that provides
adjusted standard errors for regression analyses that are associated with statistically significant
heteroscedasticity. You can download the SPSS macro here: https://tinyurl.com/ya5939sy
Although the assumption of homoscedasticity was satisfied based on the years of
education and earnings study data, I have nonetheless applied the Hayes and Cai (2007) SPSS
macro for didactic purposes. In the video, you will see that, in addition to running the macro, I
ran the following syntax (Watch 12.12: Heteroscedastic Corrected Standard Errors in SPSS):

hcreg dv = earnings_per_day/ iv = education

which yielded the following SPSS output:

C12.20
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

As can be seen above, the unstandardized beta weight was estimated at b = 7.61.
Furthermore, the adjusted standard error was estimated at 4.3445, t = 1.7515, and p = .0879.
Recall that, when I first ran the bivariate regression analysis without any consideration for
heteroscedasticity, the normal theory unstandardized beta weight was estimate at b = 7.61
(exactly the same, as expected), but the standard error was estimated at 3.447, which is smaller
than the adjusted standard error of 4.3445. Consequently, in this case, the t-value decreased in
magnitude to t = 1.7515 and the p-value increased in magnitude to .0879. Furthermore, the 95%
confidence intervals for the unstandardized beta weight corresponded to:
interval = 4.3445 * 1.96 = 8.515
95% lower-bound = 7.6094 - 8.515 = -.9056
95% upper-bound = 7.6094 + 8.515 = 16.12.
In practice, violating the assumption of heteroscedasticity results in ordinary least squares
results which are both biased and inconsistent. This means that you don’t know whether you
have less or more statistical power in applying the OLS estimation approach in any particular
case. Consequently, sometimes adjusting the standard errors via the Hayes and Cai (2007)
macro will help you reject the null hypothesis and sometimes it will do the opposite. Ultimately,
if you have violated the assumption of heteroscedasticity, you should estimate the adjusted
standard errors, irrespective of whether such a decision results in greater or lesser statistical
power. I’m not sure if there is any loss in power applying the Hayes and Cai (2007) adjusted
standard errors, when the data are associated with the perfect homoscedasticity. If there
wasn’t, then it would make sense to use the adjusted standard errors in all cases.
It is important to note that Hayes and Cai’s (2007) technique may be sensitive to
violations of non-normally distributed data. When the data are non-normally distributed, and
the results are associated with heteroscedasticity, the Wild bootstrap is an attractive approach
to the estimation of adjusted beta weight standard errors, p-values, and confidence intervals. I
demonstrate the application of the wild bootstrap in the chapter devoted to multiple regression
(Chapter 14).
Finally, it is important to acknowledge that heteroscedasticity is not simply a problem
to overcome with respect to estimating confidence intervals and p-values for a predictor in a

C12.21
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

regression equation. The presence of heteroscedasticity may be telling you something


important about your study, just like heterogeneity of variance between two groups in the
between-subjects ANOVA case may be interesting theoretically. Consequently, you should
always entertain the question of why your data are associated with heteroscedasticity, if they
are. In many cases, the presence of heteroscedasticity may be suggestive of a curvilinear
association between the independent variable and the dependent variable. You can evaluate
such a possibility with curvilinear regression, which is covered in the chapter in multiple
regression (Chapter 14).

An Example of Heteroscedasticity
In the Foundations section of this chapter, the assumption of homoscedasticity was
satisfied for the years of education and earnings example. Consequently, the normal theory
bivariate regression results outputted by SPSS should be considered accurate. However, for the
purposes of illustration, I have re-simulated some data (N = 25) such that the assumption of
homoscedasticity has been violated (Data File: education_earnings_IQ_hetero).
Prior to conducting the regression analysis, I performed an examination of the
descriptive statistics using SPSS’s ‘Frequencies’ utility. As can be seen in the SPSS table below,
the mean income per day was equal to $84.60 and the mean education was 13.36 years. The
income per day variable was skewed positively (.577) to some degree, as expected.

Next, I conducted a regular linear regression analysis with the ‘Linear’ menu option in SPSS
(Watch 12.13: Heteroscedastic Data Example in SPSS). I report below only the key table with the
beta weight and standard error.

C12.22
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

It can be seen in the SPSS table entitled ‘Coefficients’ that the unstandardized beta
weight was estimated at 9.95 with a corresponding t-value of 2.051, which was not statistically
significant, p = .052. A researcher may be disappointed to see p-value so close to p < .05.
However, an examination of the standardized predicted value (X-axis) and absolute
standardized residual (Y-axis) scatter plot (see Figure C12.5) suggested that the standardized
residuals became progressively larger across the spectrum of the standardized predicted values
(i.e., positive association).

Figure C12.5. Homoscedasticity Scatter Plot

In fact, the Spearman correlation between the standardized predicated values and the
absolute standardized residuals was r = .56, p = .004, which suggested that the assumption of
homoscedasticity was violated. Furthermore, the Koenker test also supported the notion that
the assumption of homoscedasticity was violated LM = 5.49, p = .019. In light of these results,
the regression was re-performed with the Hayes and Cai (2007) adjusted standard error macro
(see above for description of procedure to use the macro). Once the macro has been run, the
following syntax was run (Watch 12.14: Heteroscedastic Corrected Standard Errors in SPSS (Take
2)):
C12.23
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

hcreg dv = income_per_day/ iv = education

which yielded the following results:

It can be seen that the unstandardized beta weight was exactly the same (b = 9.95), as
expected. However, the corrected standard error (SE) was somewhat larger (SE = 5.18) than the
original standard error estimate (SE = 4.85). Additionally, the t-value was smaller (t = 1.92), and
the p-value was larger at .067. Thus, the heteroscedasticity adjusted standard error regression
analysis confirmed the original analysis: there was no statistically significant effect between
years of education and income per day, in this sample (N = 25).

When p-values Contradict Each Other: Don’t Despair


The bivariate regression with years of education and earnings was examined with three
estimation techniques in this chapter. As all three approaches yielded different results (p-
values), it may be overwhelming for a student. That is, one approach (normal theory) yielded a
p-value less than .05 and two approaches (bootstrapped and heteroscedasticity corrected)
yielded a p-value greater than .05.
However, as can be seen in Table C12.1, it should be appreciated that the results were
actually very similar. For example, the upper-bound 95% confidence interval for the
unstandardized beta weight was estimated at 14.59, 14.24, and 16.12 across the three
estimation approaches. The lower-bound 95% confidence interval across the three estimation
methods were also similar in that they were all very close to zero.

C12.24
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

A good question to wonder about is which method yielded the most accurate estimate?
Unfortunately, there is no conclusive answer to such a question. Ideally, a researcher would
build a case for one method over another, with reference to analytical and empirical (simulation)
research. Based on a sample size of 40, and positively skewed data (skew = 2.8), Bishara and
Hittner (2012) found that the normal theory Pearson correlation (identical to bivariate
regression) was associated with a type I error rate of .048, which is very close to the expected
alpha level of .05. By contrast, the bootstrapped (bias-corrected, accelerated) estimation
approach was associated with a type I error rate .080.4 Consequently, I would report the normal
theory results and reject the null hypothesis, in this case. That is, the slope between education
and earnings was significant statistically. Overall, though, it should be emphasized that the
sample size of 40 was small and does not offer much opportunity to yield research results with
a high level of confidence. Importantly, the study upon which the data were simulated in this
example (i.e., Zagorsky, 2003), actually had a sample size greater than 5,000. Therefore, the .337
effect between education and earnings was unambiguously statistically significant in Zagorsky
(2003). I used a sample size of 40 in this example to allow for larger (i.e., not all p < .001) and
different p-values across the different estimation methods. In practice, researchers all too often
find themselves dealing with such small sample sizes.

Table C12.1: Unstandardized Beta Weight and 95% Confidence Intervals: Three Estimation
Approaches
L-B P-E U-B p
Normal Theory .60 7.61 14.59 .033
Bootstrapped -.51 7.61 14.24 .068
H-C -.91 7.61 16.12 .088
Note. N = 40; H-C = Heteroscedasticity Corrected; L-B = lower-bound; P-E = point-estimate; U-
B = upper-bound.

4
Personally, I find this result surprising, given that bootstrapping is widely considered to be unaffected
by non-normality. However, one must accept the results as they are published, in the absence of any
contrary evidence.
C12.25
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

Practice Questions
The questions below are all very practical. They require you to read a summary of a
research scenario (based on a genuine scientific publication). Then, after opening the data file,
you need to test the specified hypothesis. Naturally, as these are ‘real-world’ research scenarios,
you should concern yourself with the assumptions associated with the particular statistical
analyses you decide to employ.
How should you go about answering these questions, while dealing with assumptions?
Very good question. I have created a list of steps for you to follow to help you generate a
statistical analytic strategy that should lead you to a defendable answer is virtually all cases. The
steps are essentially a break-down of all of the key pieces of information I described in the
Foundations section of this chapter (i.e., all material preceding the ‘Advanced’ section). You can
watch and listen to me walk you through these importance steps (Watch 12.15: Bivariate
Regression - Steps). Once you have reviewed the steps, have a go at answering the 12 specific
questions below. They research scenarios are all based on scientific publications, so, in addition
to practicing your statistical skills, I hope you learn some scientific “facts” along the way.

For each of the practice exercises below, answer the following specific questions:

1. What is the model R?


2. What is the R2? What is the adjusted R2?
3. What is the standard error of the estimate?
4. What is the interpretation of the standard error of the estimate?
5. What is the intercept?
6. What is the interpretation of the intercept?
7. What is the unstandardized beta weight?
8. What is the interpretation of the unstandardized beta weight?
9. What is the standardized beta weight?
10. What is the interpretation of the standardized beta weight?
11. Was the beta-weight statistically significant?
12. What are the 95% CIs associated with the unstandardized beta-weight?

1: Can depression in women be predicted by hemoglobin concentration?


Approximately 10% of adult women have an iron deficiency (Looker et al., 1997). Iron
has been implicated with the production of energy in humans. Corwin et al. (2003) hypothesized
that hemoglobin concentration levels in the blood (iron in the diet increases hemoglobin) may
be able to predict self-reported depression in women. To test the hypothesis, they collected
data from 37 women. Specifically, they measured hemoglobin through a pin-prick blood test.
Additionally, the 37 women completed the 20-item CES-D, which is a measure of depression. I
C12.26
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

have simulated some data to correspond very closely to the results reported by Corwin et al.
(2003). Conduct a bivariate regression analysis on the data (Data File: hemoglobin_depression)
(Watch 12.16: Bivariate Regression - Hemoglobin & Depression (Practice 1)).

2: Predicting actual height with self-reported height.


In my experience, people report themselves to be taller than they really are. This
phenomenon is particularly noteworthy on online dating sites! Stewart et al. (1987) collected
data on height and weight for 1,523 adults. Conduct a bivariate regression analysis with actual
height as the dependent variable and self-reported height as the independent variable. In
addition to the 12 questions listed at the beginning of this section, what would you predict
someone’s height to actually be, if they said they were 183 cm tall? (Data File:
self_actual_height) (Watch 12.17: Bivariate Regression - Self-Reported and Actual Height
(Practice 2)).

3: Do individual differences in 4-year-olds’ capacity to delay gratification predict academic


achievement?
In a classic study in psychology, Shoda, Mischel, and Peake (1990) collected data on
individual differences to delay gratification in 4-year-olds. Specifically, children were given a one
serving of a preferred treat (marshmallow, Oreo cookie, or pretzel stick). After showing the treat
to the child, the experimenter said that the child could have to one treat now, or, if the child
could wait until the experimenter returned, the child could have two of the treats. The amount
of time the child could wait was considered his/her capacity to delay gratification. The
experimenter usually returned in 15 minutes. Fourteen years later, Shoda et al. (1990) collected
SAT (university entrance examination) data for the same, now adult, participants. Conduct a
bivariate regression analysis such that delay of gratification is the predictor variable and SAT
scores is the dependent variable. In addition to the 12 questions listed at the beginning of this
section, what you predict a child’s SAT score to be in the future, based on a delay of gratification
score of 45 seconds (Data File: delay_gratification) (Watch 12.18: Bivariate Regression - Delay
Of Gratification & SAT (Practice 3))

4: How well can a person’s weight be predicted by their height?


Lamont et al. (1977) collected data on the height and weight of 143 salespeople. Not
surprisingly, the association between adult height and adult weight is not perfect. However,
answer all of the 12 key bivariate regression questions with the data. In addition to the 12
questions listed at the beginning of this section, if someone were 173.02 cm tall, what would
you predict their weight to be? (Data File: height_weight) (Watch: Height and Weight)

C12.27
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

5: The alarming time costs of travelling to and from work.


For this practice question, the data were not derived from a sample, consequently, it is
not necessary to calculate confidence intervals or a p-value. Furthermore, the association
between the two variables of interest is perfect. Consequently, it is not necessary to run a full
bivariate regression analysis. Instead, calculate a scatter plot with the data. Use the
‘minutes_one_way’ as the predictor variable (X-axis) and the ‘weeks_per_year’ variable
dependent variable (Y-axis). What you will discover by this analysis is the number of work weeks
(i.e., 40 hours) per year people spend travelling to and from work/school per year, based on the
number of minutes it takes them to get to work/school every day. I suspect you may be alarmed
at the amount of time people are spending in a car/bus/train. (Data File: time_travelling)
(Watch: Time Travel).

C12.28
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION

References

Corwin, E. J., Murray-Kolb, L. E., & Beard, J. L. (2003). Low hemoglobin level is a risk factor for
postpartum depression. The Journal of Nutrition, 133(12), 4139-4142.
Lamont, L. M., & Lundstrom, W. J. (1977). Identifying successful industrial salesmen by
personality and personal characteristics. Journal of Marketing Research, 517-529.
Looker, A. C., Dallman, P. R., Carroll, M. D., Gunter, E. W., & Johnson, C. L. (1997). Prevalence
of iron deficiency in the United States. Jama, 277(12), 973-976.
Shoda, Y., Mischel, W., & Peake, P. K. (1990). Predicting adolescent cognitive and self-
regulatory competencies from preschool delay of gratification: Identifying diagnostic
conditions. Developmental Psychology, 26(6), 978-986.
Stewart, A. W., Jackson, R. T., Ford, M. A., & Beaglehole, R. (1987). Underestimation of relative
weight by use of self-reported height and weight. American journal of
epidemiology, 125(1), 122-126.
Zagorsky, J. L. (2007). Do you have to be smart to be rich? The impact of IQ on
wealth, income and financial distress. Intelligence, 35(5), 489-501.

C12.29
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.

You might also like