Professional Documents
Culture Documents
Chapter 12 - Bivariate Regression - 2019
Chapter 12 - Bivariate Regression - 2019
12
Bivariate Regression
Contents
Introduction: Bivariate Regression Equation ................................................................................ 1
Slope .......................................................................................................................................... 2
Intercept .................................................................................................................................... 4
Bivariate Regression in SPSS: Intercept and Slope .................................................................... 5
Scatter Plots .................................................................................................................................. 6
Regression Equation: Detailed ...................................................................................................... 7
Regression, Error, and Residuals ............................................................................................... 8
Residuals: Illustration with SPSS ............................................................................................... 9
Assumptions ................................................................................................................................ 10
Pearson/Spearman Correlation Approach .............................................................................. 15
Bivariate Regression and Linearity .......................................................................................... 15
A note on Regression and Causality ........................................................................................ 16
Advanced Topics ......................................................................................................................... 17
Why is it called the ‘line of best fit’? ....................................................................................... 17
Bivariate Regression in SPSS: Intercept and Slope - Bootstrap ............................................... 17
Breusch-Pagan and the Koenker test ...................................................................................... 19
Adjusting Beta-Weight Standard Errors .................................................................................. 20
An Example of Heteroscedasticity .......................................................................................... 22
When p-values Contradict Each Other: Don’t Despair............................................................ 24
Practice Questions ...................................................................................................................... 26
References................................................................................................................................... 29
C12.1
CHAPTER 12: BIVARIATE REGRESSION
In this example, I would like to predict a person’s earnings per day based on the years
of education they have completed. In regression, it is conventional to refer to the predicted
variable as Y and the predictor variable as X. Thus, in the context of regression, if someone were
to say that the X variable was not normally distributed, you would assume that they were
referring to the predictor variable. There is no logical reason for this; it is simply convention.
The bivariate regression equation consists of two primary elements: (1) a slope, and (2)
an intercept.
Slope
The slope is closely related to the Pearson correlation. Specifically, the slope represents
the amount of change in Y as a function of a unit increase in X. The value of a slope can either
be positive or negative, in the same way that a Pearson correlation can be positive or negative.
Consequently, based on the direction of the slope, the change in Y may be an increase or a
decrease. It is also important to know that the slope can either be unstandardized or
standardized. Again, there is a connection to the Pearson correlation here. The Pearson
correlation is a standardized representation of the association between two variables (i.e., it is
bounded by -1.0 to 1.0). By contrast, the Pearson covariation is an unstandardized
representation of the association between two variables (unbounded). Correspondingly, the
standardized slope is “essentially” bounded by -1.0 to 1.0, and the unstandardized slope is
unbounded. I wrote “essentially” because a standardized slope can actually be greater than
|1.0|, although it is very rare to take on such a value, and it will only do so in the case of a
multiple regression (see Chapter 14), rather than bivariate regression. I will treat the
unstandardized slope first, although the formula is the same for both the standardized and
unstandardized slopes. Typically, an unstandardized slope is represented as b. Below is the
formula for the unstandardized slope:
SDY
b r (1)
SDX
where r = Pearson correlation, SDY = the standard deviation associated with the raw scores from
the Y variable, and SDX = the standard deviation associated with the raw scores from the X
variable. Thus, b can be obtained by multiplying a Pearson correlation by the ratio of the raw
score standard deviation of the predicted (dependent) variable to the raw score standard
deviation of the predictor (independent) variable. It is important that the SD from the Y
(predicted) variable is divided by the predictor (X) variable - and not the other way around.
In the current example, the Pearson correlation between years of education completed
and earnings per day was .337. Furthermore, the independent variable and dependent variable
standard deviations were: SDX = 3.065 and SDY = 69.188, respectively (Data File:
C12.2
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
C12.3
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
The standardized slope has a similar interpretation to the unstandardized slope, except
that it is expressed in standard deviation units. Specifically, a one standard deviation increase in
X is associated with a percent increase or decrease of a standard deviation in Y, where percent
is equal to the standardized slope. In the unlikely case of a standardized slope equal to 1.0, it
would imply that a one standard deviation increase in X is associated with a one standard
deviation increase in Y. In this example, recall that the years of education completed standard
deviation was equal to 3.065 and the earnings per day standard deviation was equal to 69.188.
Therefore, a 3.065 increase in years of education completed would be expected to be associated
with, on average, a $23.316 increase in earnings per day (.337 * 69.188 = 23.316). On an annual
basis, that would work out to an earnings increase of $8,510.34. Across a 40 year work career,
the increase in earnings would amount to $340,413, on average.
Intercept
The intercept may be regarded as the foundation of the regression equation, as it may
be considered the starting point of the equation. The intercept represents the predicted value
of Y when X is zero. In the current example, the intercept represents the expected amount of
earnings per day, if a person had 0 years of education completed. The intercept can also be
discussed in the context of the line of best fit, which I present further below in the context of
scatter plots.
In practice, the estimated intercept value can be absurd, that is, essentially impossible.
For example, if someone conducted a regression analysis with adult human weight as the
dependent variable and adult human height as the independent variable, it is very possible that
the regression analysis would estimate a negative intercept. In fact, if you do the Practice
Questions in this chapter, you will learn about a study where the intercept was -228 pounds (or
-103 kg). That’s right, negative body weight, which is impossible for a human to weigh, no matter
what diet the person may be following. Recall that the intercept represents the value of Y
(weight) when X (height) is zero. Although, the total absence of height is also impossible, the
regression analysis will nonetheless estimate an intercept value to get the regression equation
solved. It’s totally fine if an impossible intercept value is estimate from the regression analysis.
It is only occasionally the case that researchers are actually interested in interpreting the
intercept substantively. Typically, they just need it to make predictions through the regression
equation.
The intercept is often symbolized as α and may be formulated as:
a Y bX (2)
Thus, the mean of the Y variable scores minus the product of the slope and the mean of the X
variable scores. Just like the slope can be expressed in unstandardized and standardized forms,
the intercept can also be expressed both ways. However, in standardized form, the intercept is
C12.4
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
always equal to zero. This is true because the mean of z-scores is always zero. Stated
alternatively, if you solve the intercept equation above, the Y bar and X bar values will be zeroes
which will result in an intercept value of zero, no matter what the slope is estimated to be. In
the unstandardized case, however, the intercept can take on any value. Based on an
unstandardized slope of 7.067, years of education completed mean of 13.80, and an income per
day mean of 118.986, the unstandardized intercept in this example works out to:
a 118.986 7.607 *13.80
a 118.986 104.977
a 14.009
Recall that an intercept represents the expected value of Y (earnings per day) when X (years of
education completed) is zero. Thus, in this example, a person with zero years of education would
be expected to earn, on average, $14 per day. Stay in school!
Refer yourself to the left side of the table labeled ‘Unstandardized Coefficients’. You can see
that ‘education’ appears at the bottom of the table and that it was associated with an
unstandardized slope of 7.609, which is nearly identical to the estimate of 7.607 obtained by
hand above. The difference is due simply to rounding. SPSS also provided a standard error for
the unstandardized slope, which is used to test the unstandardized slope for statistical
significance. If you divide the slope point estimate by its standard error, you get the t-value:
7.609 / 3.447 = 2.207, which is associated with a p-value of .033. Because .033 is less than .05,
people interpret the unstandardized slope as statistically significant, i.e., sufficiently unlikely to
be due to chance fluctuations. Correspondingly, the normal theory 95% confidence intervals
associated with the slope were .630 and 14.588. Thus, if this study were re-conducted a number
of times with different participants, we would expect an extra year of completed education to
be associated with somewhere between $.63 and $14.588 extra earnings per day, with 95%
confidence. That’s really quite a large range in slope estimates. However, such a large range in
the slope estimates should be expected, because the sample size upon which this analysis was
based was only 40. Greater sample sizes gives us greater confidence.
C12.5
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
The ‘Coefficients’ table also includes a row labelled ‘(Constant)’. This is the word SPSS
uses to refer to the intercept. It has been estimated at 13.976, which is nearly identical to the
estimate of 14.009 obtained by hand. Again, the difference is due simply to rounding. SPSS also
provides a standard error for the intercept, which is used to test the intercept for statistical
significance. If you divide the intercept point estimate by its standard error, you get the t-value:
13.976 / 48.705 = .287, which is associated with a p value of .776. Consequently, the intercept
was not statistically significantly different from zero. Correspondingly, the 95% confidence
intervals for the intercept corresponded to -84.623 and 112.575. As one value is negative and
the other positive, it implies that the intercept is not statistically significantly different from zero.
Recall that the vast majority of researchers are not really interested in the intercept in a bivariate
regression. SPSS is just being thorough by reporting all of the statistics associated with the
intercept.
Scatter Plots
Cohen (1988) provided guidelines for interpreting the magnitude of a Pearson
correlation, which could also be applied in the bivariate regression case (but not the multiple
regression case). Thus, in this example, the standardized beta-weight of .34 would be considered
a medium effect. Another method that can be used to represent the nature and strength of an
effect between two continuously scored variables is a scatter plot, as introduced in the chapter
devoted to correlation.
The association between years of education competed and earnings (per day) is
depicted in the scatter plot presented in Figure C12.1 (Panel A). It can be observed that as years
of education values increase, so do values of earnings per day. For example, the individual with
the greatest years of education completed also earns the most amount of money (see value in
top right corner of the scatter plot). However, there is also a lot of “noise” amongst the values
in the scatter plot. For example, the person who earned the second most amount of money did
not have a particularly high amount of years of education completed (less than 15 years).
The scatter plot depicted in Panel B is exactly the same as the scatter plot depicted in
Panel B. The exception is that the Panel B scatter plot includes a regression line. The regression
line is also known as the ‘line of best fit’. The regression line is a visual representation of the
slope. In this example, the regression line has an upward tilt, which reflects the fact that the
association between years of education completed and earnings is positive. Had the association
been negative, the slope would have had a downward tilt. In the complete absence of an
association between two variables, the regression line is completely level.
In addition to representing the slope visually, the regression line also reflects the
intercept value visually. Recall that the intercept is defined as the value of Y when X is zero. If
you follow the regression line to the point where it intersects with the Y-axis, you will get an
estimate of the intercept. Based on the scatter plot depicted in Panel B, the regression line
C12.6
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
intersects with the Y-axis at somewhere between 0 and 50. However, it is important to keep in
mind that the X-axis has been restricted to a minimum of 5 years of education. As I show you in
the video, you can change the X-axis range to 0 to get a better indication of the intercept.
Ultimately, however, you can’t rely upon the regression line to give you an exact estimate of the
intercept, particularly when the sample size is small, and there are few observations near the
intersection of the X- and Y-axes. However, it will give you a sense of what the intercept is. Based
on the formula, the intercept was estimated at 14.009 (SPSS = 13.976) (Watch Video 12.3:
Scatter Plot in SPSS)
Figure C12.1. Scatterplots Depicting the Association Between Education and Earnings
Panel A Panel B
Yˆ a bX (3)
The hat on top of the Y indicates that the value corresponds to a predicted value (derived from
the equation), rather than an observed value (as would exist within the raw data). The b
represents the slope, the X next to the b represents a raw score on the variable X, and a
C12.7
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
represents the intercept. In plain words, the formula reads: Predicted Y value = the intercept +
slope * raw score from the X variable.
Theoretically, it is more accurate to include an error term at the end of the regression
equation:
Ŷ a bX (4)
The ε represents an error margin, which would be expected in practice, because few, if any,
regression equations would predict dependent variable scores perfectly. Consequently, the
equation will make imperfect predictions, so there has to be a “residual” value to represent the
degree to which the prediction was off. This will become clearer when the residuals are
calculated below. However, for the purposes of predicting someone’s earnings per day, one
does not have to specify error. (Watch Video 12.4: What is the bivariate regression equation?).
Suppose you met a person at a party and they told you that they completed 17 years of
education. Most people would follow-up with questions about what topics they studied at
university, but not me. I would bet them $100 that I can guess how much they earn with 95%
confidence. To do so, I would plug in 17 into the regression equation formula, alongside the
unstandardized intercept, and unstandardized slope I estimated above with SPSS:
Yˆ 13.976 7.609(17)
Yˆ 143 .329
The $143.329 I estimated is income per day, so to get the annual earnings estimate I
need to multiply by 365 days which gives $52,315.09. To obtain the normal theory 95% lower
bound and upper bound confidence intervals, I would need to apply the regression equation
two more times: Once with the lower bound estimates and once with the upper bound
estimates:
Yˆ 95%Lower 84.623 .630(17) 73.91
Yˆ 95%Upper 112.575 14.588(17) 360.57
Based on the above, with 95% confidence, I would say that the individual at the party with 17
years of education has an annual income somewhere between -73.91 and 360.57 dollars per
day! That’s a massive range. Nobody would be impressed by that. The fact that the 95%
confidence intervals around the point estimate prediction of $143.329 dollars a day is so large
underscores a very important issue to bear in mind when creating and using regression
equations: error.
C12.8
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
accuracy of a regression equation is ‘model R’. In the bivariate regression case, model R is equal
to the standardized slope (a.k.a., standardized beta weight). A method to evaluate the amount
of error associated with a regression equation is to subtract R2 from 1. Such a result is known as
the coefficient of alienation, which is in contrast to the coefficient of determination (i.e., R2).
The coefficient of alienation is simply:
1- R2 (5)
2 2
In this example, the model R was equal to .337 or .114. Thus, 11.4% of the variance in
earnings was accounted for by years of education completed. By contrast, the coefficient of
alienation (1-R2) was equal to 1-.114 = .886, which implied that 88.6% of the variance in annual
income was not accounted for by years of education completed. Because such a larger
percentage of the variance in annual income was not accounted for by years of education, it
should come as no surprise that individual predictions of peoples’ annual income would be
associated with a substantial amount of error.
In regression, we refer to the difference between a predicted dependent variable value
and an actual dependent variable value (or observed value) as a residual. As you will see in the
demonstration below, it is literally the difference between the two values (predicted –
observed). As model R2 increases, the difference between predicted values and observed values
becomes progressively smaller, i.e., better predictive capacity. However, in practice, model R2 is
usually much less than .50, so more than 50% of the model is associated with error, often times
substantially so!
C12.9
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
‘Unstandardized residuals’ and ‘Standardized residuals’ within the ‘Save’ utility. You’ll see that
SPSS created two new variables called RES_1 and ZRE_1. The RES_1 values correspond to the
raw score differences between the predicted values and the observed values. The RES_1 values
were obtained by transforming the unstandardized residuals into z-scores.
In a basic bivariate regression analysis, SPSS reports an estimate of the error associated
with a predictive capacity of the regression equation in the ‘Model Summary’ table. It refers to
the error estimate as the ‘Standard Error of the Estimate’, and it can be seen in the SPSS table
entitled ‘Model Summary’ that it was estimated at $65.99. The standard error of the estimate is
the standard deviation associated with the unstandardized residuals. Evidently, the amount of
error associated with the prediction equation is very large at $65.99!
SPSS also reports the Model R and the R2 (R Square) value in the ‘Model Summary’ table.
Again, in the context of a bivariate regression, Model R and R2 will always equal Pearson r and
Pearson r2. Furthermore, Pearson’s r and the standardized slope will always equal each other in
the bivariate regression case. Consequently, it is not really useful to examine and/or report such
values in a bivariate regression analysis. Model R and R2 become much more useful in the
context of multiple regression, which is an extension of bivariate regression. The ‘adjusted R
square’ value corresponded to .090, which implies that 9% of the variance in earnings was
accounted for by years of education completed, rather than 11.4%. The reason SPSS reports the
adjusted value is because it is known that R2 is an overestimate at the population level.
Consequently, it is more justifiable to report the adjusted R2, but almost no one ever does in the
context of bivariate regression (only multiple regression).
Assumptions
The correct application of bivariate regression is based on the satisfaction of various
assumptions. The easy assumptions to evaluate include all of the assumptions associated with
Pearson correlation:
1: Random sampling
2: Independence of observations
3: Dependent variable measured on an interval/ratio scale
4: Linearity
As the above four assumptions have been treated in detail in the chapter devoted to correlation,
they have only been listed here. There are four additional assumptions associated with bivariate
regression:
C12.10
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
C12.11
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
If a value closer to 1 or larger is reported in the table, check the SPSS data file and
identify the case with the large Cook’s distance value. Try to understand why this case has been
associated with such a large influence on the results. You may consider deleting the case, if there
is a good reason to do so, and then re-running the regression analysis without the influential
case.
6: Independence of Errors
In practice, this assumption is a more common concern for researchers in economics, as
they often collect data across time (month to month to month; “time series data”).
C12.12
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
Independence of errors implies that there is no correlation in the size of the residuals across the
cases (i.e., the rows of data). Sometimes you can see that the residuals increase or decrease in
size across the cases/rows. Theoretically, the Durbin-Watson statistic can range from 0 to 4, with
a value of 2 indicating the total absence of a serial correlation (which is a “good” thing for the
regression analysis). A value closer to 4 is indicative of a strong negative correlation. A value
closer to 0 is indicative of a strong positive correlation. When the option is selected, SPSS adds
a column of information at the end of the table entitled ‘Model Summary’:
Based on the education and earnings per day example, you can see that the Durbin-
Watson statistic was estimated at .100. SPSS does not test the Durbin-Watson statistic for
statistical significance. In this example, the Durbin-Watson statistic is very close to 0, which
would suggest a violation of the independence of errors assumption. However, in this example,
the data were not associated with a time-series nature, consequently, it is effectively impossible
to violate the independence of errors assumption. Instead, what happened is that I specifically
ordered the size of the residuals from smallest to largest simply to demonstrate the effect
(Watch 12.7: Durbin-Watson Statistic in SPSS). Personally, I would not even perform the Durbin-
Watson statistical analysis, in this sort of study. It has been performed simply for demonstration
purposes.
7. Homoscedasticity
Homoscedasticity is a bit of a mouthful to pronounce. It may seem like a rather
complicated statistical issue, however, it is not. This may become apparent when you realize
that homoscedasticity could be renamed homogeneity of variance, or homogeneity of variance
of residuals, specifically. Homoscedasticity implies that the regression equation is equally
predictive across all levels of the dependent variable. When this is true, the magnitude of the
residuals will be approximately equal across the continuum of the dependent variable. If the
regression equation “breaks down” at some point along the continuum, the residuals become
larger, which necessarily implies that the variance of the residuals becomes larger. Recall that
only one standard error of estimate is provided in a regression analysis. Consequently, the single
estimate has to be applicable across the continuum of the dependent variable. When the
assumption of homoscedasticity is violated, the single standard error of estimate cannot be
trusted to be accurate across the continuum of the dependent variable.
C12.13
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
There are two broad approaches to the evaluation of the homoscedasticity assumption:
visually and statistically. Visually, one popular method is to plot the standardized residuals on
the Y-axis and the standardized predicted values on the X-axis in a scatter plot. There should be
an absence of any pattern in the residuals when the assumption of homoscedasticity is satisfied.
However, when a pattern emerges, then you have probably violated the assumption of
homoscedasticity. In this example, the assumption of homoscedasticity was likely satisfied, as
there was no discernible pattern in the scatter plot depicted in Figure C12.3 (Watch 12.8:
Evaluate Homoscedasticity in Scatter Plot).
In my experience, it is better to plot the absolute standardized residuals on the Y-axis,
rather than the non-absolute standardized residuals (i.e., negative and positive residuals).
Absolute, in this case, means that you need to strip away the negative signs associated with the
negative residuals. When the scatter plot “test” is applied with the absolute standardized
residuals, the pattern you are looking for is either a positive or negative correlation. Thus, the
magnitude of the variance in the residuals is either increasing (positive correlation) or
decreasing (negative correlation) when heteroscedasticity is present. By contrast, when
homoscedasticity is present, there should be no pattern in the residuals. As can be seen in Figure
C12.4, the corresponding scatter plot with the absolute standardized residuals on the Y-axis did
not reveal an obvious pattern; thus confirming the absence of heteroscedasticity. However,
again, this is not exactly a proper, scientific approach to addressing the question of
homoscedasticity.
C12.14
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
will be made at that point of the data in which the curvilinearity appears (and beyond). However,
it seems a bit obvious that linear bivariate regression will only work properly when the nature
of the association is linear. All is not lost, however, as curvilinear bivariate regression also exists
and should be considered for application when the association between the independent and
dependent variables is suspected not to be linear.
In practice, I suspect that the homoscedasticity assumption will be violated when the
association between the independent variable and the dependent variable is not strictly linear.
To evaluate the possibility that there may be a curvilinear element to the association between
the independent and dependent variable requires the addition of an extra independent variable
term to the model. To conduct the regression analysis requires knowledge of multiple
regression, as bivariate regression cannot accommodate the additional independent variable
term. Consequently, I describe curvilinear regression in the chapter devoted to multiple
regression.
1
Truth be told, no statistic implies causation. It’s the research design that may allow for causal
interpretations.
C12.16
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
Advanced Topics
C12.17
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
To obtain the confidence intervals for the standardized regression weight (i.e., β = .337),
the education and income per day variables need to be converted in z-scores first. Then, the z-
scores need to be entered into the bivariate regression analysis. Such a procedure is essentially
tricking SPSS into doing something it wasn’t designed to do, but the results will be accurate.2
As can be seen in the table below, the bootstrapped 95% confidence intervals associated
with the standardized regression weight correspond to -.033 and .611. Furthermore, the
bootstrapped p-value was estimated at .068, which would suggest the association between
education and earnings per day was not significant statistically.
Evidently, at least on the surface, the normal theory p-value of .033 is at odds with the
bootstrap estimated p-value of .068. That is, the normal theory p-value was statistically
significant (p< .05) and the bootstrapped p-value was not (p> .05). One of the reasons the p-
values were not exactly the same is that the earnings per day variable was associated with a
moderate level of skew (i.e., .911). Normal theory estimation assumes the data are associated
with normal distributions. However, as described further below, normal theory estimation is
relatively (but not perfectly) robust to violations of normality. In absolute numbers, the
difference between .068 and .033 is only .035. They are essentially telling the same story. That
is, the normal theory estimation approach implied that that there was a 3.3% chance that, if the
null hypothesis were rejected, one would be fooling oneself. By contrast, the bootstrapped
approach implies that there was a 6.8% chance that, if we rejected the null hypothesis, we would
2
Note that estimating the normal theory 95% confidence intervals for the standardized beta weight
based on the z-scores would only produce an approximate estimate, consequently, it could only be
recommended with caution. See the chapter on correlation for a proper method to estimate normal
theory 95% confidence intervals for a Pearson correlation (which corresponds to a bivariate regression
standardized coefficient).
C12.18
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
be fooling ourselves. It is for this reason that many researchers criticize the statistical that is
fixated on observing p < .05, which is ultimately a fairly arbitrary value.
In the context of a real-life study, however, a decision would have to be made with
respect to whether the normal theory results should be reported or the bootstrapped results. I
fear all too often researchers pick and choose the results that coincide with their desires. It
should be kept in mind that a sample size of only 40 was used in this example. Therefore, the
results would not be expected to be particularly stable. Fortunately, the Zagorsky (2003) study
was actually based on a very large sample size of 5,000+ participants. Consequently, the point-
estimates reported in this example (b = 7.609; β = .337) should be considered trustworthy, as
they are essentially the same values reported in Zagorsky (2003).
3
If you can’t run SPSS as administrator, I’m not sure there is an attractive option that will work
in SPSS consistently. However, you can rely upon the Pearson or Spearman correlation analysis
between standardized predicted values and absolute standardized residuals demonstrated in
the preceding section.
C12.19
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
It should be noted that these statistical tests of homoscedasticity are affected by sample
size. That is, a small sample size will make it difficult to obtain a p-value less than .05, all other
things equal. In this example, the sample size was only 40, consequently, although all of the
homoscedasticity tests suggested the presence of homoscedasticity, a larger sample size may
have detected heteroscedasticity statistically significant.
Finally, I’ll note one more statistical test of homoscedasticity that some researchers
prefer. It is known as White’s test of homoscedasticity. If you’re interested in White’s test, there
is some syntax to conduct such an analysis here: http://www-
01.ibm.com/support/docview.wss?uid=swg21476748.
Finally, if your data do violate the assumption of homoscedasticity, you cannot trust the p-values
associated with the beta-weights. However, you can adjust the standard errors and p-values
with a procedure developed by Hayes and Cai (2007).
C12.20
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
As can be seen above, the unstandardized beta weight was estimated at b = 7.61.
Furthermore, the adjusted standard error was estimated at 4.3445, t = 1.7515, and p = .0879.
Recall that, when I first ran the bivariate regression analysis without any consideration for
heteroscedasticity, the normal theory unstandardized beta weight was estimate at b = 7.61
(exactly the same, as expected), but the standard error was estimated at 3.447, which is smaller
than the adjusted standard error of 4.3445. Consequently, in this case, the t-value decreased in
magnitude to t = 1.7515 and the p-value increased in magnitude to .0879. Furthermore, the 95%
confidence intervals for the unstandardized beta weight corresponded to:
interval = 4.3445 * 1.96 = 8.515
95% lower-bound = 7.6094 - 8.515 = -.9056
95% upper-bound = 7.6094 + 8.515 = 16.12.
In practice, violating the assumption of heteroscedasticity results in ordinary least squares
results which are both biased and inconsistent. This means that you don’t know whether you
have less or more statistical power in applying the OLS estimation approach in any particular
case. Consequently, sometimes adjusting the standard errors via the Hayes and Cai (2007)
macro will help you reject the null hypothesis and sometimes it will do the opposite. Ultimately,
if you have violated the assumption of heteroscedasticity, you should estimate the adjusted
standard errors, irrespective of whether such a decision results in greater or lesser statistical
power. I’m not sure if there is any loss in power applying the Hayes and Cai (2007) adjusted
standard errors, when the data are associated with the perfect homoscedasticity. If there
wasn’t, then it would make sense to use the adjusted standard errors in all cases.
It is important to note that Hayes and Cai’s (2007) technique may be sensitive to
violations of non-normally distributed data. When the data are non-normally distributed, and
the results are associated with heteroscedasticity, the Wild bootstrap is an attractive approach
to the estimation of adjusted beta weight standard errors, p-values, and confidence intervals. I
demonstrate the application of the wild bootstrap in the chapter devoted to multiple regression
(Chapter 14).
Finally, it is important to acknowledge that heteroscedasticity is not simply a problem
to overcome with respect to estimating confidence intervals and p-values for a predictor in a
C12.21
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
An Example of Heteroscedasticity
In the Foundations section of this chapter, the assumption of homoscedasticity was
satisfied for the years of education and earnings example. Consequently, the normal theory
bivariate regression results outputted by SPSS should be considered accurate. However, for the
purposes of illustration, I have re-simulated some data (N = 25) such that the assumption of
homoscedasticity has been violated (Data File: education_earnings_IQ_hetero).
Prior to conducting the regression analysis, I performed an examination of the
descriptive statistics using SPSS’s ‘Frequencies’ utility. As can be seen in the SPSS table below,
the mean income per day was equal to $84.60 and the mean education was 13.36 years. The
income per day variable was skewed positively (.577) to some degree, as expected.
Next, I conducted a regular linear regression analysis with the ‘Linear’ menu option in SPSS
(Watch 12.13: Heteroscedastic Data Example in SPSS). I report below only the key table with the
beta weight and standard error.
C12.22
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
It can be seen in the SPSS table entitled ‘Coefficients’ that the unstandardized beta
weight was estimated at 9.95 with a corresponding t-value of 2.051, which was not statistically
significant, p = .052. A researcher may be disappointed to see p-value so close to p < .05.
However, an examination of the standardized predicted value (X-axis) and absolute
standardized residual (Y-axis) scatter plot (see Figure C12.5) suggested that the standardized
residuals became progressively larger across the spectrum of the standardized predicted values
(i.e., positive association).
In fact, the Spearman correlation between the standardized predicated values and the
absolute standardized residuals was r = .56, p = .004, which suggested that the assumption of
homoscedasticity was violated. Furthermore, the Koenker test also supported the notion that
the assumption of homoscedasticity was violated LM = 5.49, p = .019. In light of these results,
the regression was re-performed with the Hayes and Cai (2007) adjusted standard error macro
(see above for description of procedure to use the macro). Once the macro has been run, the
following syntax was run (Watch 12.14: Heteroscedastic Corrected Standard Errors in SPSS (Take
2)):
C12.23
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
It can be seen that the unstandardized beta weight was exactly the same (b = 9.95), as
expected. However, the corrected standard error (SE) was somewhat larger (SE = 5.18) than the
original standard error estimate (SE = 4.85). Additionally, the t-value was smaller (t = 1.92), and
the p-value was larger at .067. Thus, the heteroscedasticity adjusted standard error regression
analysis confirmed the original analysis: there was no statistically significant effect between
years of education and income per day, in this sample (N = 25).
C12.24
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
A good question to wonder about is which method yielded the most accurate estimate?
Unfortunately, there is no conclusive answer to such a question. Ideally, a researcher would
build a case for one method over another, with reference to analytical and empirical (simulation)
research. Based on a sample size of 40, and positively skewed data (skew = 2.8), Bishara and
Hittner (2012) found that the normal theory Pearson correlation (identical to bivariate
regression) was associated with a type I error rate of .048, which is very close to the expected
alpha level of .05. By contrast, the bootstrapped (bias-corrected, accelerated) estimation
approach was associated with a type I error rate .080.4 Consequently, I would report the normal
theory results and reject the null hypothesis, in this case. That is, the slope between education
and earnings was significant statistically. Overall, though, it should be emphasized that the
sample size of 40 was small and does not offer much opportunity to yield research results with
a high level of confidence. Importantly, the study upon which the data were simulated in this
example (i.e., Zagorsky, 2003), actually had a sample size greater than 5,000. Therefore, the .337
effect between education and earnings was unambiguously statistically significant in Zagorsky
(2003). I used a sample size of 40 in this example to allow for larger (i.e., not all p < .001) and
different p-values across the different estimation methods. In practice, researchers all too often
find themselves dealing with such small sample sizes.
Table C12.1: Unstandardized Beta Weight and 95% Confidence Intervals: Three Estimation
Approaches
L-B P-E U-B p
Normal Theory .60 7.61 14.59 .033
Bootstrapped -.51 7.61 14.24 .068
H-C -.91 7.61 16.12 .088
Note. N = 40; H-C = Heteroscedasticity Corrected; L-B = lower-bound; P-E = point-estimate; U-
B = upper-bound.
4
Personally, I find this result surprising, given that bootstrapping is widely considered to be unaffected
by non-normality. However, one must accept the results as they are published, in the absence of any
contrary evidence.
C12.25
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
Practice Questions
The questions below are all very practical. They require you to read a summary of a
research scenario (based on a genuine scientific publication). Then, after opening the data file,
you need to test the specified hypothesis. Naturally, as these are ‘real-world’ research scenarios,
you should concern yourself with the assumptions associated with the particular statistical
analyses you decide to employ.
How should you go about answering these questions, while dealing with assumptions?
Very good question. I have created a list of steps for you to follow to help you generate a
statistical analytic strategy that should lead you to a defendable answer is virtually all cases. The
steps are essentially a break-down of all of the key pieces of information I described in the
Foundations section of this chapter (i.e., all material preceding the ‘Advanced’ section). You can
watch and listen to me walk you through these importance steps (Watch 12.15: Bivariate
Regression - Steps). Once you have reviewed the steps, have a go at answering the 12 specific
questions below. They research scenarios are all based on scientific publications, so, in addition
to practicing your statistical skills, I hope you learn some scientific “facts” along the way.
For each of the practice exercises below, answer the following specific questions:
have simulated some data to correspond very closely to the results reported by Corwin et al.
(2003). Conduct a bivariate regression analysis on the data (Data File: hemoglobin_depression)
(Watch 12.16: Bivariate Regression - Hemoglobin & Depression (Practice 1)).
C12.27
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
C12.28
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.
CHAPTER 12: BIVARIATE REGRESSION
References
Corwin, E. J., Murray-Kolb, L. E., & Beard, J. L. (2003). Low hemoglobin level is a risk factor for
postpartum depression. The Journal of Nutrition, 133(12), 4139-4142.
Lamont, L. M., & Lundstrom, W. J. (1977). Identifying successful industrial salesmen by
personality and personal characteristics. Journal of Marketing Research, 517-529.
Looker, A. C., Dallman, P. R., Carroll, M. D., Gunter, E. W., & Johnson, C. L. (1997). Prevalence
of iron deficiency in the United States. Jama, 277(12), 973-976.
Shoda, Y., Mischel, W., & Peake, P. K. (1990). Predicting adolescent cognitive and self-
regulatory competencies from preschool delay of gratification: Identifying diagnostic
conditions. Developmental Psychology, 26(6), 978-986.
Stewart, A. W., Jackson, R. T., Ford, M. A., & Beaglehole, R. (1987). Underestimation of relative
weight by use of self-reported height and weight. American journal of
epidemiology, 125(1), 122-126.
Zagorsky, J. L. (2007). Do you have to be smart to be rich? The impact of IQ on
wealth, income and financial distress. Intelligence, 35(5), 489-501.
C12.29
Gignac, G. E. (2019). How2statsbook (Online Edition 1). Perth, Australia: Author.