You are on page 1of 11

Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

Making Predictions with Regression

Analysis - Statistics By Jim
11-14 minutes

If you were able to make predictions about something important to

you, you’d probably love that, right? It’s even better if you know that
your predictions are sound. In this post, I show how to use
regression analysis to make predictions and determine whether
they are both unbiased and precise.

You can use regression equations to make predictions. Regression

equations are a crucial part of the statistical output after you fit a
model. The coefficients in the equation define the relationship
between each independent variable and the dependent variable.
However, you can also enter values for the independent variables
into the equation to predict the mean value of the dependent

Related post: When Should I Use Regression Analysis?

The Regression Approach for Predictions

Using regression to make predictions doesn’t necessarily involve

predicting the future. Instead, you predict the mean of the
dependent variable given specific values of the dependent
variable(s). For our example, we’ll use one independent variable to
predict the dependent variable. I measured both of these variables

1 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

at the same point in time.

Psychic predictions are things that just pop into mind and are not
often verified against reality. Unsurprisingly, predictions in the
regression context are more rigorous. We need to collect data for
relevant variables, formulate a model, and evaluate how well the
model fits the data.

The general procedure for using regression to make good

predictions is the following:

1. Research the subject-area so you can build on the work of others.

This research helps with the subsequent steps.

2. Collect data for the relevant variables.

3. Specify and assess your regression model.

4. If you have a model that adequately fits the data, use it to make

While this process involves more work than the psychic approach, it
provides valuable benefits. With regression, we can evaluate the
bias and precision of our predictions:

Bias in a statistical model indicates that the predictions are

systematically too high or too low.

2 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

Precision represents how close the predictions are to the observed


When we use regression to make predictions, our goal is to

produce predictions that are both correct on average and close to
the real values. In other words, we need predictions that are both
unbiased and precise.

Example Scenario for Regression Predictions

We’ll use a regression model to predict body fat percentage based

on body mass index (BMI). I collected these data for a study with
92 middle school girls. The variables we measured include height,
weight, and body fat measured by a Hologic DXA whole-body
system. I’ve calculated the BMI using the height and weight
measurements. DXA measurements of body fat percentage are
considered to be among the best.

You can download the CSV data file: Predict_BMI.

Why might we want to use BMI to predict body fat percentage? It’s
more expensive to obtain your body fat percentage through a direct
measure like DXA. If you can use your BMI to predict your body fat
percentage, that provides valuable information more easily and
cheaply. Let’s see if BMI can produce good predictions!

Finding a Good Regression Model for Predictions

We have the data. Now, we need to determine whether there is a

statistically significant relationship between the variables.
Relationships, or correlations between variables, are crucial if we
want to use the value of one variable to predict the value of
another. We also need to evaluate the suitability of the regression

3 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

model for making predictions.

We have only one independent variable (BMI), so we can use a

fitted line plot to display its relationship with body fat percentage.
The relationship between the variables is curvilinear. I’ll use a
polynomial term to fit the curvature. In this case, I’ll include a
quadratic (squared) term. The fitted line plot below suggests that
this model fits the data.

Related post: Curve Fitting using Linear and Nonlinear Regression

This curvature is readily apparent because we have only one

independent variable and we can graph the relationship. If your
model has more than one independent variable, use separate
scatterplots to display the association between each independent
variable and the dependent variable so you can evaluate the nature
of each relationship.

Assess the residual plots

4 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

You should also assess the residual plots. If you see patterns in the
residual plots, you know that your model is incorrect and that you
need to reevaluate it. Non-random residuals indicate that the
predicted values are biased. You need to fix the model to produce
unbiased predictions.

Learn how to choose the correct regression model.

The residual plots below also confirm the unbiased fit because the
data points fall randomly around zero and follow a normal

Interpret the regression output

In the statistical output below, the p-values indicate that both the
linear and squared terms are statistically significant. Based on all of
this information, we have a model that provides a statistically
significant and unbiased fit to these data. We have a valid
regression model. However, there are additional issues we must

5 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

consider before we can use this model to make predictions.

As an aside, the curved relationship is interesting. The flattening

curve indicates that higher BMI values are associated with smaller
increases in body fat percentage.

Other Considerations for Valid Predictions

Precision of the Predictions

Previously, we established that our regression model provides

unbiased predictions of the observed values. That’s good. However,
it doesn’t address the precision of those predictions. Precision
measures how close the predictions are to the observed values. We
want the predictions to be both unbiased and close to the actual
values. Predictions are precise when the observed values cluster
close to the predicted values.

Regression predictions are for the mean of the dependent variable.

If you think of any mean, you know that there is variation around
that mean. The same applies to the predicted mean of the
dependent variable. In the fitted line plot, the regression line is
nicely in the center of the data points. However, there is a spread of
data points around the line. We need to quantify that spread to
know how close the predictions are to the observed values. If the
spread is too large, the predictions won’t provide useful information.

Later, I’ll generate predictions and show you how to assess the

6 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

Related post: Understand Precision in Applied Regression to Avoid

Costly Mistakes

Goodness-of-Fit Measures

Goodness-of-fit measures, like R-squared, assess the scatter of the

data points around the fitted value. The R-squared for our model is
76.1%, which is good but not great. For a given dataset, higher
R-squared values represent predictions that are more precise.
However, R-squared doesn’t tell us directly how precise the
predictions are in the units of the dependent variable. We can use
the standard error of the regression (S) to assess the precision in
this manner. However, for this post, I’ll use prediction intervals
to evaluate precision.

Related post: Standard Error of the Regression vs. R-squared

New Observations versus Data Used to Fit the Model

R-squared and S indicate how well the model fits the observed
data. We need predictions for new observations that the analysis
did not use during the model estimation process. Assessing that
type of fit requires a different goodness-of-fit measure, the
predicted R-squared.

Predicted R-squared measures how well the model predicts the

value of new observations. Statistical software packages calculate
it by sequentially removing each observation, fitting the model, and
determining how well the model predicts the removed observations.

If the predicted R-squared is much lower than the regular

R-squared, you know that your regression model doesn’t predict
new observations as well as it fits the current dataset. This situation

7 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

should make you wary of the predictions.

The statistical output below shows that the predicted R-squared

(74.14%) is nearly equal to the regular R-squared (76.06%) for our
model. We have reason to believe that the model predicts new
observations nearly as well as it fits the dataset.

Related post: How to Interpret Adjusted R-squared and Predicted


Make Predictions Only Within the Range of the Data

Regression predictions are valid only for the range of data used to
estimate the model. The relationship between the independent
variables and the dependent variable can change outside of that
range. In other words, we don’t know whether the shape of the
curve changes. If it does, our predictions will be invalid.

The graph shows that the observed BMI values range from 15-35.
We should not make predictions outside of this range.

Make Predictions Only for the Population You Sampled

The relationships that a regression model estimates might be valid

for only the specific population that you sampled. Our data were
collected from middle school girls that are 12-14 years old. The
relationship between BMI and body fat percentage might be
different for males and different age groups.

Using our Regression Model to Make Predictions

8 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

We have a valid regression model that appears to produce

unbiased predictions and can predict new observations nearly as
well as it predicts the data used to fit the model. Let’s go ahead and
use our model to make a prediction and assess the precision.

It is possible to use the regression equation and calculate the

predicted values ourselves. However, I’ll use statistical software to
do this for us. Not only is this approach easier and more accurate,
but I’ll also have it calculate the prediction intervals so we can
assess the precision.

I’ll use the software to predict the body fat percentage for a BMI of
18. The prediction output is below.

Interpreting the Regression Prediction Results

The output indicates that the mean value associated with a BMI of
18 is estimated to be ~23% body fat. Again, this mean applies to
the population of middle school girls. Let’s assess the precision
using the confidence interval (CI) and the prediction interval (PI).

The confidence interval is the range where the mean value for girls
with a BMI of 18 is likely to fall. We can be 95% confident that this
mean is between 22.1% and 23.9%. However, this confidence
interval does not help us evaluate the precision of individual

9 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=


A prediction interval is the range where a single new observation is

likely to fall. Narrower prediction intervals represent more precise
predictions. For an individual middle school girl with a BMI of 18,
we can be 95% confident that her body fat percentage is between
16% and 30%.

The range of the prediction interval is always wider than the

confidence interval due to the greater uncertainty of predicting an
individual value rather than the mean.

Is this prediction sufficiently precise? To make this determination,

we’ll need to use our subject-area knowledge in conjunction with
any specific requirements we have. I’m not a medical expert, but I’d
guess that the 14 point range of 16-30% is too imprecise to provide
meaningful information. If this is true, our regression model is too
imprecise to be useful.

Don’t Focus On Only the Fitted Values

As we saw in this post, using regression analysis to make

predictions is a multi-step process. After collecting the data, you
need to specify a valid model. The model must satisfy several
conditions before you make predictions. Finally, be sure to assess
the precision of the predictions. It’s all too easy to get lulled into a
false sense of security by focusing on only the fitted value and not
consider the prediction interval.

If you’re learning regression and like the approach I use in my blog,

check out my eBook!

10 of 11 10/05/2020, 06:17
Making Predictions with Regression Analysis - Statistics By Jim about:reader?url=

11 of 11 10/05/2020, 06:17

You might also like