Professional Documents
Culture Documents
in Regression Analysis
Friday, July 22, 2022 8:35 PM
Does this graph display an actual relationship or is it an overfit model? This blog post shows you how to
make this determination.
Multiple linear regression can seduce you! Yep, you read it here first. It’s an incredibly tempting
statistical analysis that practically begs you to include additional independent variables in your model.
Every time you add a variable, the R-squared increases, which tempts you to add more. Some of the
independent variables will be statistically significant. Perhaps there is an actual relationship? Or is it just
a chance correlation?
You just pop the variables into the model as they occur to you or just because the data are readily
available. Higher-order polynomials curve your regression line any which way you want. But are you
fitting real relationships or just playing connect the dots? Meanwhile, the R-squared increases,
mischievously convincing you to include yet more variables!
In my post about interpreting R-squared, I show how evaluating how well a linear regression model fits
the data is not as intuitive as you may think. Now, I’ll explore reasons why you need to use adjusted R-
squared and predicted R-squared to help you specify a good regression model!
Some Problems with R-squared
Previously, I demonstrated that you cannot use R-squared to conclude whether your model is biased. To
check for this bias, you need to check your residual plots. Unfortunately, there are yet more problems
with R-squared that we need to address.
Problem 1: R-squared increases every time you add an independent variable to the model. The R-
squared never decreases, not even when it’s just a chance correlation between variables. A regression
model that contains more independent variables than another model can look like it provides a better fit
merely because it contains more variables.
Problem 2: When a model contains an excessive number of independent variables and polynomial
terms, it becomes overly customized to fit the peculiarities and random noise in your sample rather than
reflecting the entire population. Statisticians call this overfitting the model, and it produces deceptively
In this example, the researchers might want to include only three independent variables in their
regression model. My R-squared blog post shows how an under-specified model (too few terms) can
produce biased estimates. However, an overspecified model (too many terms) can reduce the model’s
precision. In other words, both the coefficient estimates and predicted values can have larger margins of
error around them. That’s why you don’t want to include too many terms in the regression model!
What Is the Predicted R-squared?
Use predicted R-squared to determine how well a regression model makes predictions. This statistic
helps you identify cases where the model provides a good fit for the existing data but isn’t as good at
making predictions. However, even if you aren’t using your model to make predictions, predicted R-
squared still offers valuable insights about your model.
Statistical software calculates predicted R-squared using the following procedure:
• It removes a data point from the dataset.
• Calculates the regression equation.
• Evaluates how well the model predicts the missing observation.
• And repeats this for all data points in the dataset.
Predicted R-squared helps you determine whether you are overfitting a regression model. Again, an
overfit model includes an excessive number of terms, and it begins to fit the random noise in your
sample.
By its very definition, it is not possible to predict random noise. Consequently, if your model fits a lot of
random noise, the predicted R-squared value must fall. A predicted R-squared that is distinctly smaller
than R-squared is a warning sign that you are overfitting the model. Try reducing the number of terms.
If I had to name my favorite flavor of R-squared, it would be predicted R-squared!
Related post: Overfitting Regression Models: Problems, Detection, and Avoidance
Example of an Overfit Model and Predicted R-squared
You can try this example using this CSV data file: PresidentRanking.
These data come from an analysis I performed that assessed the relationship between the highest
approval rating that a U.S. President achieved and their rank by historians. I found no correlation
between these variables, as shown in the fitted line plot. It’s nearly a perfect example of no relationship
because it is a flat line with an R-squared of 0.7%!
How to Interpret Regression Models that have Significant Variables but a Low R-squared
In "Regression"
Jaysoonsays
July 19, 2022 at 11:09 am
Hi Jim,
Thank you for your excellent post and for continuing your informational blogs.
My question is a little bit off-topic.
What is your thought about the predicted R squared with NA value
due to leverage of 1.000, which in this case Predicted R squared and PRESS Statistic not defined?
Jim Frostsays
July 20, 2022 at 12:45 am
Hi Jaysoon,
I’m not really sure what is causing the NA value for predicted R-squared in your software–although I do
have some guesses.
Unlike the regular R-squared, predicted R-squared can be negative. I’m guessing NA refers to negative
values, but I’m not sure. At the statistical software company I used to work at, they changed how they
reported negative values. At first, the software reported the negative R-squared value, However, later
they changed it so it reported 0% for negative values. Perhaps NA refers to negative values? I really
don’t know.
What cause a negative value and how do you interpret it? Well, if 0% is really bad for prediction, then
negative values are even worse! They’re really, really bad! You probably have very small total sample
size AND a large number of predictors given the sample size.
As for leverage points, removing them will change the estimated model greatly. And the predicted R-
squared process removes each point systematically, which should cause a large change and a low
predicted R-squared. That’s consistent with my theory that NA equates to a negative R-squared, which
wouldn’t be surprising with the presence of very influential leverage points.
But you’d really need to check with your software’s documentation about it to be sure.
Reply
Hannahsays
May 26, 2022 at 11:04 am
Hi Jim,
I have run two multiple regressions and the second one with an additional variable is significant but the
first one is not, what would be the explanation for this? Would this be due to an overfit model?
Thank you!
Reply
Jim Frostsays
June 2, 2022 at 11:16 pm
Hi Hannah,
I’m not exactly sure what you mean. You’ve run two regressions, and what becomes significant? The
entire regression model? The second variable?
Reply
Adele Manulatsays
April 29, 2022 at 1:30 am
Jim Frostsays
April 29, 2022 at 4:34 pm
Hi Adele,
Yes, that can be a confusing issue. When you add a variable that has a t-value that is ≥ 1 but < ~1.96,
that variable causes the adjusted R-squared to increase even though the p-value won't be statistically
significant. The statistical measures disagree. You're talking about doing the reverse by removing
variables, but the same principles apply. So, what to do? For starters, I suggest you read my post about
specifying the best model. In that article, I talk about how choosing your model is a mix of using
statistical measures and theory. Statistical measures don’t always agree (as you’re seeing) and you really
need to let theory be your guide.
With this in mind, if the IV, it’s coefficient sign, and magnitude all make theoretical sense, I’d lean
towards leaving it in and explaining why in the writeup. On the other hand, if it doesn’t make theoretical
sense, there’s more reason to remove it. Also, consider the fact that generally it is better to leave in an
unnecessary variable than it is to remove a necessary one. So, when you’re not sure, err on the side of
leaving it in. If removing the variable makes the residual plots go from looking good to bad, you almost
definitely want to leave that variable in! Again, you’d need to explain the rationale in the writeup.
Unfortunately, it’s not possible for me to tell you the correct answer because that depends on very
subject-area specific knowledge, theory, and those other details I mention. But those are points I’d
consider.
Reply
Adele Manulatsays
April 28, 2022 at 10:04 pm
Hello sir, what does having a nonsignificant independent variable having a t-statistic greater than one
signify? or imply?
I learned that iremoving an independent variable has t- statistic greater than 1 decreases the adjusted r-
square. With this, is it okay to remove that variable and continue doing backward deletion technique in
order to arrive with the best model?
Reply
Adele Manulatsays
Chandrakant Bhogayatasays
April 17, 2022 at 1:44 am
You are right, Jim! t-value is 1.483 and it falls within that range.
Thanks a lot!
Reply
Jim Frostsays
April 21, 2022 at 2:33 am
You’re very welcome!
Reply
Chandrakant Bhogayatasays
February 22, 2022 at 11:34 pm
In my case, I performed a backward stepwise linear regression. In the final step, r square adjusted
decreased even after a non-significant predictor was removed. Why?
Thank you very much, Jim!
Reply
Jim Frostsays
February 23, 2022 at 10:18 pm
Hi,
For understanding when the adjusted R-squared will increase or decrease, the key value to know is the
t-value of 1. If the t-value for a variable is greater than 1, then adding it to the model causes adjusted R-
squared to increase. Alternatively, and for your case, if you remove a variable with a t-value > 1 the
adjusted R-squared will decrease.
The t-value for statistical significance varies depending on the degrees of freedom but it will always be
at least 1.96. Consequently, there is the range from 1.00 – 1.96 where the variable is not significant but
removing it will still cause the adjusted R-squared to decrease.
I think that is what happened in your case. If you check the t-value when the variable is in the model, I
bet it falls within that range.
Reply
Angel Lagosays
July 15, 2021 at 10:27 pm
Hi Jim,
I have a question. I am trying to run a regression analysis test on my dataset (6 IVs and 1 DV). However,
VICTOR MELLOsays
May 20, 2021 at 6:16 pm
Hello Jim. Great website, very clear and easy to follow. I have a question interpreting R2 when
comparing Multiple Linear Regressions with Linear Regressions. It would be great to have your thoughts
on it.
To illustrate, I am trying to find the correlation between a product Sales (Y) and its Prices (X). But not
only the company Prices, also the competition. When I run the Multiple Regression, I tend to get good
Adjusted R2s and Ps.
However, when I isolate those X’s one by one, the R2 tends to decrease. I’ve ran a lot of samples, and
find myself going back to this trend. It confuses me since other websites suggests that the multiple
regression could be better in this case.
Do you have a recomendation on which alternative I should go for?
Thank you,
Victor Mello
Reply
Jim Frostsays
May 20, 2021 at 6:39 pm
Hi Victor,
So, let’s start with simple regression and then move up to multiple regression. Let’s say you have sales
as the DV and in the first model you have Prices as the lone IV. You’ll get an r-squared. Suppose you use
that same dataset but add another IV, Competitor’s Prices, so you now have both IVs. R-squared will
never decrease when you add an IV. Theoretically, it could remain the same. However, in practice, it’ll
always go up by at least a trivial amount, but it can go up substantially.
The thing to remember is that R-squared NEVER goes down when you add an IV (assuming you’re using
the same dataset). Consequently, based on what you write, there is some sort of problem going on
because you’re describing a situation where a simple regression has a higher R-squared than a multiple
regression that has the original IV plus another one. That’s just not possible! I don’t know what the error
is, but there is some sort of serious error.
It IS possible for the adjusted R-squared to decline as you add an IV. I discuss the reasons for that in this
article. But, that doesn’t seem to be the situation to which you’re referring.
Beyond the R-squared issue, in a more general sense, if you have multiple significant IVs, it’s better to
Davidsays
April 11, 2021 at 8:06 am
Hi Jim, thanks for the answer!
Reply
Gemechu Asfawsays
April 10, 2021 at 1:53 am
is correlation coefficient positive and regression coefficient negative at the same time? how can we
interpret the result?
Reply
Davidsays
April 9, 2021 at 4:11 am
Hi Jim,
I was wondering about the interpretability of adjusted R-square when not used for model selection: in
particular, I was presented with a model and its adjusted R-square value, but not with its (non-adjusted)
R-square value.
In my understanding, adjusted R-square is a tool used to prevent over-fitting; so it is used when
comparing different model versions to each other: we test whether we should add or leave predictors
out.
Once a model is chosen, the adjusted R-square does not add any information anymore, instead one
should mention the R-square when presenting it.
Is my understanding correct?
Reply
Jim Frostsays
April 10, 2021 at 12:38 am
Hi David,
Frequently, or almost exclusively, you’ll see adjusted R-squared advertised as the way to compare
regression models with different numbers of predictors. However, there is another purpose for adjusted
R-squared. Regular R-squared is a biased estimator for how much variance the model explains in the
population. It tends to be too high. Sample R-squared values tend to overestimate the population R-
squared. Adjusted R-squared counteracts that by shrinking down R-squared to a point where it’s not a
biased estimator. Consequently, that’s another reason to report adjusted R-squared even when model
selection is done. I almost never see that in practice but I actually think it’s a good idea.
hyudatansays
April 6, 2021 at 9:42 am
Could you please explain why the difference between the adjusted R-squared and predicted R-squared
is preferably less than 0.2?
Reply
Stan Alekmansays
April 5, 2021 at 5:14 pm
Jim,
You describe R-squareds as explanatory statistics. I think of it as accounting terms: Proportion of
variation in response captured by the fitted model, which is therefore a relative measure of the
goodness of fit. I don’t know if we are saying anything different.
However RMSE is an absolute measure of the accuracy of a fitted model in the units of “Y”. The RMSE
gives the SD of the residuals. The RMSE estimates the concentration of the data around the fitted
equation. It can be used to compare models when the response variable does not change. I always
compare RMSE when terms are added or subtracted from a model.
Regards,
Stan Alekman
Reply
Jim Frostsays
April 5, 2021 at 5:39 pm
Hi Stan,
Statisticians frequently use both the “explain” and “account for” wording interchangeably. The model
explains X% of the variability. The model accounts for X% of the variability.
R-squared is a relative measure. As you say, RMSE is an absolute measure. I often like using the similar
but subtly different, Standard Error of the Regression (S), which is also an absolute measure but it
adjusts for the number of terms, making it more akin to adjusted R-squared whereas RMSE is more
closely related to R-squared. I’ve written a post, Standard Error of the Regression vs. R-squared that
looks at the differences. In that post, I explain why I like S quite a bit!
One point to be aware of with the RMSE is that it always decreases as you add terms, similar to how R-
squared always increases. S uses the same adjustment as adjusted R-squared and can actually increase.
Best wishes,
Jim
Reply
John Hogenbirksays
March 28, 2021 at 4:25 pm
Hi Jim,
I appreciate your easily understood explanations of all things statistical.
Jim Frostsays
March 29, 2021 at 3:02 pm
Hi John,
Thanks for the question. I should make this more clear in the blog post itself!
I mean a difference of 10 percentage points. And that’s just a rule of thumb. But I don’t typically worry
much when it’s lower than that.
Reply
Aoun ALisays
January 25, 2021 at 9:48 am
Hi sir Jim,
I need your help in understanding the following :
R square value is 0.098
Adjusted R Square value is 0.079
R is..312
where my significance level is .027
what dose it mean ..
Reply
Jim Frostsays
March 26, 2021 at 2:44 am
Hi Aoun,
You have a low R-squared of 0.098 or 9.8%. That indicates that your model doesn’t explain much of the
variance in the dependent variable. However, if the p-value for your overall significance is 0.027 (I think
that’s what you’re saying but please correct me if that’s wrong), it is a significant model. It means that
you likely have at least one significant independent in your model. So, the model has a low R-squared
but with one or more significant IVs. That sounds confusing but I’ve written a post on exactly that topic!
How to Interpret a Model that have Significant Variables but a Low R-Squared
That post should answer your questions!
Reply
anniiezsays
January 21, 2021 at 3:07 am
Can I ask some question? In my experiment i.e. RSM CCD, I have R-square 0.6304, the difference of
predicted (0.5280) and adjusted (0.5853) R-square les than 0.2, lack of fit is not significant and model is
significant. Can I use this model for predict the response?
Jim Frostsays
January 22, 2021 at 12:38 am
Hi Anniiez,
There are several issues to continue. First, check those residual plots! If those look good, it’s a good sign
that your predictions won’t be biased. Even when you have a high R-squared, it’s possible to have biased
predictions, and checking the residual plots helps you avoid that.
The R-squared values relate to the precision of your predictions. There’s not a huge drop between your
R-square and predicted R-squared. So, that’s a good sign. However, the overall values for both are not
particularly high. High R-squared values aren’t important for all models, but they become important
when you want to make predictions. Lower R-squared values indicate that your predictions will be less
precise. It’s possible that you’re prediction will be so imprecise that they won’t be useful.
Here’s a couple of posts I wrote that you should read to help you understand the issue of prediction
precision:
Prediction Precision
Low R-squared Values and Prediction Precision
I hope this information helps!
Reply
Ana Ferreirasays
December 16, 2020 at 12:22 pm
What does it mean when the adjusted r-squared evidently increases between models, however, the
regression coefficient stays the same? Does this mean our explanatory variable is still a suppressor, or
due to the unchanged coefficient we cannot say this.
Reply
Jim Frostsays
December 17, 2020 at 10:56 pm
Hi Ana,
It means that whatever variables you added, they are likely to be worthwhile additions to the model.
The model is explaining more of the variance in the dependent variable even when you accounting for
the fact that you’re using more variables. So, that’s a good sign!
The fact that the coefficient of the explanatory variable in question didn’t change is neither a good or
bad thing really. What it means is that the new variables to the model are probably not correlated with
your variable of interest (VOI). If they had been correlated with both the VOI and the dependent
variable, when they weren’t in the model their absence would’ve been causing omitted variable bias in
the VOI. Adding in those variables reduces that bias and causes the coefficients to change. However,
that doesn’t appear to be the case because the VOI’s coefficient didn’t change. Consequently, the
interpretation of the VOI doesn’t change. In other words, you’d continue to interpret it the same way in
the new model.
I hope that helps!
Reply
Jim Frostsays
December 3, 2020 at 1:53 am
Hi Joseph,
There’s actually a good reason to know adjusted R-squared when you have only one variable. Typically,
you think of adjusted R-squared for helping you compare models with differing numbers of predictors.
However, there’s another use/interpretation of adjusted R-squared. It turns out regular R-squared is a
biased estimator. The R-squared in your statistical output tends to be higher than the correct population
value for R-squared. Adjusted R-squared shrinks down the regular R-squared to an unbiased value.
Adjusted R-squared doesn’t tend to be too high or too low on average. You can read more about that in
my post about Five Reasons Why Your R-squared Can Be Too High.
Reply
Bobbysays
November 21, 2020 at 10:44 pm
Hi Jim! Thank you for the insightful article.
I just wanted to ask, there are times when an independent variable in a multiple regression model is not
95% statistically significant (e.g. a p-value of 0.07). However, upon removing this variable from the
model, the adjusted R squared value also decreases. Hence, my question is which provides a better
measure of what model to use? Should I be removing predictors based off their p-value? Or rather,
should I be adding and removing variables based on the adjusted R squared? Which one takes priority?
Thanks for your help!
Bobby
Reply
Aaronsays
November 21, 2020 at 2:32 am
Hi Jim!
What is wrong with this kind of thinking: “I understand that R-squared is not a perfect measure of the
quality of a regression equation because it always increases when a variable is added to the equation.
Once we adjust for degrees of freedom by using Adjusted R-squared, though, it seems to me that the
higher the Adjusted R-squared, the better the equation.”
Reply
Jim Frostsays
October 13, 2020 at 1:52 pm
Hi Amogh,
These two statistics are telling you different things. Adjusted R-squared includes a shrinkage factor to
counteract the fact that regular R-squared is a biased estimator. Sample R-squared values tend to be
higher than the true population value and adjusted R-squared corrects for that bias.
Predicted R-squared indicates how well a model without each observation would predict that
observation.
Because what they measure is so different, it’s not surprising that the results can be different. I find that
predicted R-squared tends to be more sensitive to models that are overly complicated. Overfitting is
when the model starts to fit the random noise in the data. Because random noise is, by definition, not
predictable, this problem shows up in the predicted R-squared. Adjusted R-squared is not designed to
detect that problem–hence it doesn’t show up there.
I hope that helps clarify it!
Reply
Kyle Seibenicksays
September 29, 2020 at 9:09 pm
Hi Jim. Thanks for writing the regression ebook, this is a great refresher and enhancement of my skills.
I already saw this question and your response was “you don’t know of any easy way”. So I’ll ask – do you
know of a hard way or manual way to calc predicted r2 (or “PRESS” as I’m seeing in other places) in
Excel?
Reply
Tomingansays
August 16, 2020 at 6:05 am
Hi Jim
How can we conclusively tell that the number of IV are optimum for a given DV. As you did mentioned
that the more we add ID the r squared will continue increase. So when or where is the stopping point.
Any simple test that can be done. Please help Jim. Thank you
Reply
Jim Frostsays
August 17, 2020 at 9:11 pm
Tomingansays
August 16, 2020 at 4:31 am
Hi Jim
I have this reading, r = .344, r Squared = .118 and Adj. r sqared = .084. form 1 DV and 5 IV. My initial
analysis is that there is a low positive correlation. About 11.8% of the DV is explained or supported by
the IV. There is no telling that the 5 IV is the sufficient number. I believe 5 IV is not the optimum
number. What other testing can we do to identify the optimum number of IV? Thank you Jim.
Reply
Tomingansays
August 16, 2020 at 3:38 am
Hi Jim
I value greatly your comment on the r and r squared as well as the adjusted r squared. Can you please
indicate the best reference for this please. Thank you
Reply
Maximsays
August 12, 2020 at 6:42 am
Hello Jim,
Thank you for your blog! It helps a lot in doing my research. Could you please provide any reference for
the predicted R squared? I have found a method for its calculation, but all I can reference so far is
various posts in the internet. Thank you!
Reply
Dereksays
August 3, 2020 at 2:57 pm
Hey Jim,
I appreciate you sharing this article. I know that if an adjusted r-squared is 0.58, then the independent
variables in my model collectively account for 58% of the variability in the dependent variable around its
mean. I know that this is a basic question, but how would the interpretation differ if the predicted r-
squared
Jim Frostsays
August 5, 2020 at 12:23 am
Hi Derek,
Typically, analysts use adjusted R-squared to compare models with different numbers of predictors, as I
show in the post. But, interestingly, it has its own unique interpretation. While regular R-squared is the
amount of variation the model accounts for in your sample, adjusted R-squared is an estimate for how
much your model accounts for in the population. I write about this interpretation in this post.
But, on to your question! For predicted R-squared, the interpretation is the amount of variability that
your model accounts for in new observations that were not used during the parameter estimation
process.
Reply
Simon McGreesays
July 13, 2020 at 2:23 am
Hi Jim
Thank you for your earlier reply to my comment. Below is a summary of my analysis
I have 43 years of annual crop yield data and 360 climate indices (rain, maximum and minimum
temperature individual month and seasonal combinations). I use PCA to reduce the number of climate
variables and deal with multicollinearity. The scree plot shows no obvious elbow so I retain 32 PCs or
99.9% of the variance. Some of the variables have a weak relationship with sugarcane so it is possible
the first PCs have a weak relationship with sugarcane, another reason to perhaps retain more PCs. I then
examine the absolute value of the PC coefficients, I select the climate variable with the highest
coefficients to represent that PC.
I then use stepwise regression backward elimination. I stop at the highest R-sq predicted.
In the process of my paper undergoing review. I received the following “all data are used to screen for
hindcast skill, and hence there is potential for “artificial skill”. The authors indicate that they used
“leave‐one‐out‐cross validation”. However, they are using PCAs which does utilises all data in calculation
of principal
components. When this is done, statistical models have artificial skill in cross‐validation mode. Statistical
models so derived will be useless in actual prediction; their apparent skill results from the fact that the
crossvalidation is not truly on independent data because the entire sample was used to screen the
predictors from PCAs.
Appreciate your thoughts.
Reply
Raosays
June 27, 2020 at 2:00 am
Hi Jim,
Thanks for a great blog.
I’m curious why you say that adjusted R-square has no associated p-value.
The difference between R-square and adjusted R-square is in their degrees of freedom. I assume the
Jim Frostsays
June 27, 2020 at 4:09 pm
Hi Rao,
I’ve never seen one developed or used. Usually, adjusted R-squared is used to compare models with
differing numbers of predictors. Even with regular R-squared, you don’t usually see it discussed in
relation to its p-value.
I don’t really know why. I’ve never seen a discussion of this issue. R-squared is a biased estimator
whereas adjusted R-squared is not. I agree with your reasoning, but I don’t have answer for why it’s not
done.
Reply
Daisysays
June 18, 2020 at 6:28 am
Hi Jim, this is a hugely helpful website.
I am trying to calculated predicted R2 in stata following mixed effects ML regression. Do you have any
syntax for how to create it?
Reply
Jim Frostsays
June 18, 2020 at 12:14 pm
Hi Daisy, sorry, I’m not a stata user so I don’t know what command you’d use.
Reply
Mukhtarsays
June 9, 2020 at 8:46 am
Thanks for the comment and suggestions. I really appreciate your effort in educating masses through
your blog.
Reply
Jim Frostsays
June 10, 2020 at 12:09 pm
You’re very welcome, Mukhtar!
Reply
Mukhtarsays
June 5, 2020 at 11:54 pm
Hello Jim,
is there any benchmark that for the difference in r-suare, r-square (adj) and R-square (pred) values.
i have the following case and suspect if the model is overfit.
Source DF Adj SS Adj MS F-Value P-Value
Regression 5 0.174696 0.034939 14.81 0.000
Vc 1 0.033814 0.033814 14.33 0.001
ap*2 1 0.162968 0.162968 69.06 0.000
fr*2 1 0.143943 0.143943 61.00 0.000
A*2 1 0.015032 0.015032 6.37 0.020
ap*fr 1 0.151329 0.151329 64.13 0.000
Error 21 0.049556 0.002360
Lack-of-Fit 3 0.008540 0.002847 1.25 0.321
Pure Error 18 0.041017 0.002279
Total 26 0.224253
Model Summary
S R-sq R-sq(adj) R-sq(pred)
0.0485781 77.90% 72.64% 62.38%
other models have the following results, with all p-values significant
S R-sq R-sq(adj) R-sq(pred)
0.0587089 81.32% 76.87% 69.63%
S R-sq R-sq(adj) R-sq(pred)
0.0058268 73.94% 69.21% 60.91%
Thanks……
Reply
Jim Frostsays
June 8, 2020 at 3:50 pm
Hi Mukhtar,
There’s no standard guideline that I’m familiar with. But, I always start to worry when the difference is
greater than 10%. For overfitting, you also need to consider the number of observations per model
term. Given your output, I’d say you have some reason for concern about overfitting. I think reading the
other post will help you out.
Reply
Parikshitsays
May 28, 2020 at 12:22 pm
Thanks Jim.
Reply
Parikshitsays
May 27, 2020 at 2:10 pm
Jim Frostsays
May 27, 2020 at 4:59 pm
Hi,
For an R-squared to be statistically significant, the overall F-test for the model must be significant. To be
practically significant, that depends on the field of study.
Use predicted R-squared to assess prediction, not adjusted R-squared. There’s no exact guideline for
how close it must be. I start to worry when the difference is more than 0.1 (10%). However, you
probably should be assessing the precision of the prediction as I describe in this post about S vs. R-
squared.
Reply
Anne Wambuisays
May 27, 2020 at 11:35 am
Hello, Thank you for the explanations. I have a questions. I have used multiple regression to compare
three groups, when I removed one variable the model was not significant for one group, other
independent variables became significant(they were not before) R squared decreased significantly for
the second group, and one group has a slight decrease in R squared. How can I interpret this or meaning
of this factor.
Reply
Jim Frostsays
May 27, 2020 at 5:08 pm
Hi Anne,
If you’re find that the significance of predictor changes depending on specifically which variables are
include in the model, you might well have multicollinearity (correlated IVs). Read my post about
multicollinearity for more information.
Reply
Heidisays
May 26, 2020 at 12:27 pm
I completed a multi regression analysis in Exel with three independent variables and the results show an
R-squared value is 0.11 but the adjusted R-squared is 0.98. How could these values be so different? Also,
excel doesn’t give a predicted R squared value. is there another [easy] way to get it? The residuals show
values for the predicted [dependent variable] but that can’t be it.
Jim Frostsays
May 26, 2020 at 9:38 pm
Hi Heidi,
I’m so glad my blog has been helpful!
For the first thing, it’s impossible for the R-squared value to be lower than the adjusted R-squared for
the same model. There’s something off there. I don’t think there’s any easy way to get predicted R-
squared with Excel.
It is possible to have a large difference between R-squared and adjusted R-squared. However, adjusted
R-squared will always be smaller than R-squared. If there is a large difference, it might indicate you have
too many predictors (IV) in your model. It comes down to the number of observations per term in your
model. To see how this works, look at my post about Five Reasons Why Your R-squared can be Too High.
In the first reason, you’ll read about adjusted R-squared and see a graph that shows how adjusted R-
squared decreases by the sample size per term.
Reply
Julie Nielsensays
May 8, 2020 at 7:07 am
Hi Jim,
I have two models where I add time fixed-effects and robust and clustered standard errors. When I add
FE and robust and clustered standard errors to my models, model 1’s R-squared increases while model
2’s R-squared becomes negatives (from 0,301 to -0,385). If I look at my coefficient in the two models,
none of them seems to be significant, but I don’t understand how one of the R-squared can become
negative?
Reply
ankitasays
May 7, 2020 at 3:30 pm
sir, i am getting predicted R squared value as zero. Is it normal? please help me out.
Reply
Jim Frostsays
May 7, 2020 at 3:38 pm
Hi Ankita,
Yes, it’s possible. Unlike regular R-squared, both adjusted and predicted R-squared can fall below 0%. In
terms of interpretation, just interpret it as if it were 0%. It’s not good. Usually when you get a negative
value, it means you have a very small sample size along with an overly complex model.
Reply
Jim Frostsays
April 25, 2020 at 1:42 am
Hi Elizabeth,
It sounds like your four IVs explain a very low proportion of the variance in the DV. Are any of the p-
values for the coefficients statistically significant?
What’s the p-value associated with the F-statistic. You’ll usually only interpret the p-value for the F-
statistic rather than the F-value itself. You can read my post about the overall F-test for more
information.
Reply
Swapnilsays
April 7, 2020 at 9:02 am
Hi Jim,
Can you please help me out with this data. Is it statistically significant or not
Model Summary
S R-sq R-sq(adj) R-sq(pred)
52.0410 97.63% 93.49% 78.69%
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Model 7 446694 63813 23.56 0.004
Linear 7 446694 63813 23.56 0.004
A 1 150522 150522 55.58 0.002
B 1 885 885 0.33 0.598
C 1 138967 138967 51.31 0.002
D 1 118108 118108 43.61 0.003
E 1 19212 19212 7.09 0.056
F 1 9624 9624 3.55 0.133
G 1 9377 9377 3.46 0.136
Error 4 10833 2708
Total 11 457527
Reply
Jim Frostsays
April 7, 2020 at 11:53 pm
Hi Swapnil,
Please read my post about regression coefficients and p-values. That post will show you how to
Sukusays
April 4, 2020 at 7:01 am
Hi Jim,
I need your help in understanding the following :
R square value is 0.018
Adjusted R Square value is -0.024
R is.0.133
What does a negative Adjusted R Square value predict about the relationship between 1 DV and 1 IVs ?
Thanks
Suku
Reply
Jim Frostsays
April 5, 2020 at 6:53 pm
Hi Suku,
Just interpret the negative value as if it were zero. Your model does not explain variability in the DV.
Reply
Jim Frostsays
March 29, 2020 at 7:49 pm
Hi Alfer,
I suspect that you’re referring to the practice of the increase in R-squared that occurs when you include
each predictor in the model last. That’s not exactly “splitting” the R-squared but I think it is what you’re
referring to. I’ve written a post that talks about this method as a way of determining the importance of
each predictor. I’d read that post to see if it answers your questions!
Reply
Jim Frostsays
February 20, 2020 at 11:07 am
Hi Ida,
There’s no p-value for adjusted R-squared. Typically, you use it to compare models with different
numbers of predictors/IVs. It’s more for comparing models rather than determining statistical
significance. However, there is a p-value for the regular r-squared, although you might need to hunt for
it in the statistical output. The F-test of overall significance produces a p-value. When that p-value is less
than your significance level, you can reject the null hypothesis that R-squared equals zero.
I hope this helps!
Reply
Simon McGreesays
February 16, 2020 at 7:16 pm
Hi Jim
I’m concerned I have over fitted my models but first let me give you a bit of background.
I have 43 years of annual sugarcane and sugar data. I have 852 climate indices (rain, maximum and
minimum temperature individual month and seasonal combinations). I use PCA to reduce the number of
climate variables and deal with multicollinearity. The scree plot shows no obvious elbow so I retain 25
PCs or 99.9% of the variance. Some of the variables have a weak relationship with sugarcane so it is
possible the first PCs have a weak relationship with sugarcane, another reason to perhaps retain more
PCs. I then examine the absolute value of the PC coefficients, I focus on the four climate variables with
the four highest coefficients. The representative variable for each coefficient that I take to the next
stage is the one that has the strongest correlation coefficient with sugarcane and sugar yield
respectively.
I then use stepwise regression backward elimination. I stop at the highest R-sq predicted. For the
sugarcane model I have an adjusted R squared of 79% and predicted R squared of 73% (DF = 8). For the
sugar model I have an adjusted R squared of 81% and predicted R squared of 73% (DF=11). How am I
doing? Appreciate your thoughts.
I repeated the above with 70% of the variance retained. the R-sq adjusted and predicted values are
much lower. It appears some key climate variables are lost by only retaining 70%
Regards
Simon
Reply
Jim Frostsays
Reisays
January 28, 2020 at 8:52 am
So i’ve got to do a paper using regression analysis. I use 3 model, linier, quadratic and exponential as
comparison. Each of them got :
Linier R2 : 0.197 R2 ad : 0.875
Quadratic R2 : 0.931 R2 ad : 0.794
Exponential R2 : 0.919 R2 ad : 0.879
Which model i choose..?
Reply
Jim Frostsays
January 28, 2020 at 11:15 pm
Hi Rei,
Choosing a model is more than just going by several R-squared statistics! Check graphs and theory. For
more information, read my post about choosing the correct model. It’s not even possible to say that any
of those three are the correct model with the information provided. And, if your model has curvature,
which seems likely, read my post about curve fitting, which describes different methods and how to
compare the resulting models.
Best of luck with your analysis!
Reply
MD VASEEM CHAVHANsays
January 25, 2020 at 4:35 am
Thanks for explanaition.
please comment on the following model Model Summary
S R-sq R-sq(adj) R-sq(pred)
14.7955 99.33% 88.60% 0.00%
Reply
Jim Frostsays
January 25, 2020 at 4:58 pm
Hi,
Your example closely matches the example that I use in the section of this post titled, “Example of an
Overfit Model and Predicted R-squared.” Read that section more closely. You have an overfit model.
I’ve also written a post about overfitting models that will help you understand.
Reply
Jim Frostsays
January 14, 2020 at 11:05 am
Hi Mahesh,
I’m not really sure. I’m drawing a blank as to why the procedure would be able to calculate R-squared
and adjusted R-squared but not predicted R-squared. Is there a chance that you have only three
observations? I’m thinking of a scenario where you have enough degrees of freedom to fit the model
when you use all the observations but not enough for predicted R-squared where you’re systematically
fitting the model multiple times where each time one observation is removed. That would suggest you
have just barely enough degrees of freedom to begin with and you’re probably overfitting the model
anyway. But, when one observation is removed you no longer have a sufficient number of DF.
I’m not sure that’s what is happening but it’s one possible scenario.
Reply
rezvansays
October 25, 2019 at 3:54 pm
Hi Jim,
I am writing my paper about optimization the leaching process of Cd by RSM using DX7. I obtained R2=
0.79, adjusted R2=0.74, and predicted R2 = 0.59. The software in box cox proposed me to normalize
data by transforming λ from 1 to 3, Then the results would change as follow R2 = 0.85, adjusted R2 =
0.80, and predicted R2 = 0.71. the other statistical tools like F-value , P-value and others would be
approximately constant in terms of being significant or not significant. I am confused if i do
transformation or not.
Thanks
Reply
Jim Frostsays
October 25, 2019 at 4:11 pm
Hi,
Check the residual plots for the model that does not transform the data. If the residual plots look good,
you don’t need to transform the data. On the other hand, if you see a problem in the residual plots, such
Mikesays
September 28, 2019 at 1:11 pm
Hi Jim.
Great Article. I would like some advice. I’m trying to build a linear regression model. I’ve determined
what the control variables are going to be based on prior knowledge and previous literature. I now need
to work out which of my 7 predictors to include in my final model with those control variables. In the
past I have decided on which predictors to include in the final model based on significance, adding those
with a p value <0.10. However, I've been speaking to a statistician, and instead they recommend
choosing the model with the best adjusted r2 value. I've seen lots of studies using my usual method for
variable selection, but I haven't come across any that selected variables based on adjusted r-squared
values. So, I'm just wondering whether you would recommend choosing the model with the highest
adjusted r-squared value, and whether you know of any papers that have selected variables for the final
model using this method? Looking forward to hearing from you.
Mike
Reply
Jim Frostsays
September 29, 2019 at 2:08 pm
Hi Mike,
Choosing the correct model is almost as much of an art as it is a science. One thing I always highlight is
the need to incorporate your subject-area knowledge about the underlying process/research question.
Never go solely by statistical measures. I’d also add to that by saying, there’s no single statistical
measure that is best. In fact, the various measures can disagree. Adjusted R-squared is a good on to
keep an eye, but it can lead you astray. For example, if you start to overfit your model, the adjusted R-
squared can look great, but your coefficients and their p-values are all messed up (technical term
there!). Chasing a high R-squared or adjusted R-squared can lead to problems.
Also, it’s important at least to pay attention to the p-values of the coefficients. If you include too many
variables that are not significant it reduces the precision of your model. Taken further, it can lead to the
overfitting I referred to before. However, if you have to choose between the possibility of leaving out an
important variable even though it’s not significant versus leaving it in even though you’re not sure, yes,
it’s generally better to include it. And, perhaps that’s the thinking behind the recommendation.
However, you shouldn’t take that too far!
I’d suggest reading my post about specifying the correct model. And, then for an illustration of how R-
squared and adjusted R-squared can lead you astray, read my posts about overfitting and data mining
Merpatisays
May 27, 2019 at 12:05 am
Hello! I want to ask. All my R2, adj.R2 and predicted R2 got the value of 1.0000.
Is it acceptable? And if possible, could you help me to deduce this information bcs I, myself not so good
in statistical analysis. Btw, the results were from three independent variables (pressure, time and
temperature) with one dependent variable (antioxidant activity).
I hope that I can improve my understanding on this matter.
Thank you in advance ^^
Reply
Jim Frostsays
May 27, 2019 at 9:44 pm
Hi,
Unfortunately, no, that’s not normal. Usually you only obtain an R-squared of 1 under several related
problematic circumstances.
If you fit a model that contains the same number of independent variables as observations, you’ll always
get an R-squared of 1 (or 100%).
If you overfit a model, which means too many terms for your number of observations, you can get the
same thing.
This can also happen with an automated procedure such as stepwise regression with a relative small
dataset and lots of candidate predictors.
I’m not sure what is going on with your data. If it’s physical process where the measurements are very
precise/accurate and there’s extremely low noise in the system, you can get R-squared values in the
90-99% range. Unless your software is rounding up, I’d be very skeptical. I’ve never seen a legitimate
100% in practice. 100% would indicate no random error in the model at all AND no measurement error
all. That just doesn’t happen in the real world. I’m assuming this is real world data rather than generated
data.
Reply
Jeffsays
May 24, 2019 at 7:07 pm
It’sincredible how clear and simple you can explain difficult concepts. Thank you, really
Reply
Ronniesays
Jim Frostsays
April 16, 2019 at 11:51 am
Hi Ronnie,
I’m really happy to hear that you found my site to be helpful!
Regular R-squared should be greater than Predicted R-squared. The model can’t predict new
observations better than the data used to fit the model. You might be thinking of the test R-squared.
The test R-squared is generally lower than the Predicted R-squared. A test R-squared is based on
validation data. The software uses an existing model and a new dataset to see how well the model
predicts values that were not used to estimate the model.
To make good predictions, you want Predicted R-squared to be close to the regular R-squared. And, you
want the test R-squared to be close to the Predicted R-squared.
For your dataset, it appears like the regular R-squared and predicted R-squared are not that close. This
condition indicates that your model doesn’t predict new observations as well as it fits the data used to
fit the model. Chances are that your test R-squared would be even lower than the predicted R-squared.
I’m not knowledgeable in model spectral data, so I’m not sure how this fit compares to similar models
and industry standards. I’d recommend doing some research to see what sort of fit is typical for this type
of data and see how your model compares. Some study areas are inherently more or less predictable
than other areas. So, I can’t really say whether the fit you’ve obtained is “justifiable.” The basic question
you need to answer is whether the fit you obtain is representative of the study area and really the best
you can do given the nature of the data. Or, do you need to improve the model to obtain a better fit.
Those answers depend on subject-area knowledge.
Reply
Allan Paolosays
April 1, 2019 at 11:00 am
Hi again Jim!
I just want to take time to thank you. Thanks to this article (and to you of course) I was able to get my
master’s degree. Thanks a lot!
Reply
Jim Frostsays
April 1, 2019 at 4:46 pm
Patrik Silvasays
November 19, 2018 at 4:25 am
Dear Jim!
I would like to know if you can clarify some of this points to me:
In the text section where the title is “What is the Predicted R-Squared”, I have read this:
“Statistical software calculates predicted R-squared using the following procedure:
1- It removes a data point from the dataset.
2 – Calculates the regression equation.
3 – Evaluates how well the model predicts the missing observation.
4 – And, repeats this for all data points in the dataset.
a) Is this procedure the same as what is called LOOCV (Leave one out Cross Validation)?
b) Which values do we compare to R-squared? Do we need to record the R-Squared in each time that
we leave one out till the last observation?
I want to understand this procedure to see which statistic it corresponds to in SPSS software.
Thank you in advance!
PS
Reply
Alsays
October 6, 2018 at 4:22 pm
Hi Jim,
Very helpful post.
Regarding the issue of “how much of the variation in the y values does the regression model explain”
1. The adjusted-R-squared is the answer to this for multiple regression, yes?
2. Why don’t we also use adjusted-R-squared when answering the question for simple regression? (most
stats textbooks use R-squared for this)
In general, when comparing a regression model with one independent variable to a model with multiple
independent variables — do we compare them on adjusted-R-squared, or do we compare the adjusted-
R-square of the second model with the R-squared of the first?
Thanks,
Al
Reply
Jim Frostsays
October 6, 2018 at 9:46 pm
Hi Al,
These are great questions! And, there is confusion in this area because many people don’t know exactly
what R-squared measures.
Let’s start with the easy part. When you’re comparing models with different numbers of independent
variables, use adjusted R-squared. Specifically, compare the adjusted R-squared from one model to the
adjusted R-squared values of the other models. Don’t use the regular R-squared for any of the models.
Now, onto which R-squared to report for what models. Typically, analysts will report the regular R-
Jonathansays
August 21, 2018 at 11:42 am
Hello,
I have a challenge here i have rsquared of 0.6596 and an adjusted rsquared of -0.3617! How can this be
interpreted? what can you say about this ?
Thanks
Reply
Jim Frostsays
August 23, 2018 at 2:19 am
Hi Jonathan,
Chances are that you are severely overfitting your model. You probably have very few observations per
model term. To learn more about this problem, read my post about overfitting!
Reply
Kripasays
August 7, 2018 at 9:35 pm
Hi Sir,
Can you help me to interpret R squared value of .166 and Adjusted R squared value of .158?
Reply
Jim Frostsays
August 8, 2018 at 4:55 pm
Hi Kripa,
These blog posts should provide you with enough information so you know how to interpret these
values.
The R-squared value indicates that your model accounts for 16.6% of the variation in the dependent
variable around its mean. That’s usually considered a low amount. You typically interpret adjusted R-
squared in conjunction with the adjusted R-squared values from other models. Use adjusted R-squared
to compare the fit of models with a different number of independent variables.
Additionally, regular R-squared from a sample is biased. It tends to over-estimate the true R-squared for
the population. Adjusted R-squared is an unbiased estimate of the population value.
I hope this helps!
Tejaswi Dalavisays
July 8, 2018 at 2:53 am
what is the exact difference between R square & adjusted R square.which is better?
Reply
Jim Frostsays
July 8, 2018 at 3:02 am
Hi Tejaswi, you’re in the right place to learn about the differences. This blog post describes adjusted R-
squared. In it, there’s a link to my blog post about the regular R-squared. Between the two posts, you’ll
know all about both types. Adjusted R-squared is the better of the two. Although, my favorite is actually
predicted R-squared.
Reply
Juansays
July 6, 2018 at 4:16 am
Dear Jim,
Thanks a lot for your response, it answered some questions that I had for quite some time without
finding a clear/understandable explanation. I will certainly continue to follow the blog, it is a very
valuable source of information specially for us non-statisticians. I have already recommended it to my
colleges and i’m sure they will agree with me.
Best regards,
Juan F.
Reply
Jim Frostsays
July 6, 2018 at 10:58 am
Thanks so much, Juan. I appreciate that!
Reply
Juansays
July 3, 2018 at 4:06 am
Dear Jim,
Thanks for your explanation and fast response. Congratulations on such a good blog, it is very valuable
to be able to discuss / understand this topics in more friendly manner.
With respect to my question, I still have a couple of doubts.
– I can understand that one could obtain a high R2 and R2 (adj) in a model with significant curvature but,
shouldn’t the R2(pred) be generally low?
– isn’t the prediction power of the regression covered by including in the regression equation the center
point?
Jim Frostsays
July 3, 2018 at 2:39 pm
Hi Juan,
Yes, it’s definitely possible that Predicted R-squared would be affected by inadequately modeling the
curvature. However, the degree to which the lack-of-fit affects it depends on how inadequate the fit is
and the number of observations. So, I couldn’t tell you specifically for your case whether it would be low
or not. But, definitely the lack of fit would impact it to some degree.
Center points allow you to detect curvature but are not sufficient to model the curvature.
I would agree, as I mention in my previous response, that I would not use the model to make predictions
when you know that it inadequately fits curvature that is present in the data. In that sense, yes, it
doesn’t matter what Predicted R-squared is because you know the predictions are biased. As I
mentioned, high R-squared values of any type do not indicate that your model provides an unbiased fit.
That pattern that you describe is heteroscedasticity. In my post about it, I discuss other options for
resolving it. A Box-Cox transformation is a recognized way to fix this problem, but I usually save that for
last solution I try. I prefer solutions that involve less data manipulation. I’m also a bit leery of how it
transformed away the curvature issue. However, I don’t have any specific reason to say that you
shouldn’t trust the model based on the limited information that I have. Just be sure to closely examine
the coefficients and be really certain that the signs and magnitudes fit with theory.
Also, be aware that the model fit statistics (the various R-squared values and S) apply to the transformed
response variable and not the response using natural units. That can make the model appear better than
it is. Although, they were high before the transformation, so no reason for concern.
Reply
Juansays
July 2, 2018 at 3:33 am
Dear Jim,
I recently started using Minitab for DoE. I work with an extraction process to evaluate the recovery
(Yield) of proteins. Evaluation of a half-factorial set of experiments with 5 variables DoE gave me a very
good regression model with R2(98.29%); adj-R2 (97.35%) and pred-R2 (95.57%). However I noticed that
my model indicates that the curvature is significant (P = 0.022). What is the effect of this curvature on
the predictive power of the model? in other words, is this model still good to make predictions? or is a
CCD required?
Reply
Sundarsays
June 24, 2018 at 12:38 pm
Dear Jim,
As usual brilliant post. However I would like to know about “The adjusted R-squared value actually
decreases when the term doesn’t improve the model fit by a sufficient amount.” How does the adjusted
R-square determines if addition of a variable has a positive or negative effect on the model.
Thanks
\
Reply
Jim Frostsays
June 25, 2018 at 10:50 am
Hi Sundar, the adjusted R-squared value decreases when the t-value for the coefficient is less than 1.
Reply
Franklin Moormannsays
April 10, 2018 at 1:25 am
When calculating predicted rsquared for a full dataset of 4000 data points, you would do all 4000 or a
random sample of those 4000 data points?
Reply
Jim Frostsays
April 12, 2018 at 3:05 pm
The procedure always cycles through the complete dataset and systematically removes one data point
at a time to calculate predicted R-squared.
Reply
Jim Frostsays
February 27, 2018 at 11:15 pm
Hi Emanuel,
Thanks so much! I’m glad you have found it to be helpful!
About predicted R-squared, which is really my favorite type of R-squared. Think about the error sum of
squares (SSE). This is where you take the squared differences between each observation and the fitted
value and sum them up across all observationa. It’s also known as the residual sum of squares because
it’s the sum of the squared residuals. A small value produces a high R-squared.
For predicted R-squared, you use the predicted error sum of squares (PRESS), which is similar to the SSE.
To calculate PRESS, you remove a point, refit the model, and then use the model to predict the removed
observation. Then, you take the removed value and subtract the predicted value and then square this
difference. You repeat for all of the removed values. You end up with a squared difference for each
value when it is removed. You then sum those squared differences and you have PRESS. A low PRESS
value produces a high predicted R-squared. So, it’s fairly analogous to the SSE but the squared
differences are based on predicting the missing values versus values that were used to fit the model.
Regarding point 2, yes, you’re correct, when you have more data points, it’s harder to overfit your
model and, hence, you wouldn’t expect a much lower predicted R-squared. Imagine you have a 1000
data points that follow the same U-shaped pattern. In that case, you’d be really sure about that curved
relationship because such a large number of data points aren’t going to follow that curve by chance.
That’s why you wouldn’t expect the predicted R-squared to drop when you have many data points.
However, fewer data points can produce that pattern by chance. If you remove one, it changes that
relationship noticeably. You’re not really certain that the relationship really is that U-shape. Predicted R-
squared detects this uncertainty and that’s why it drops.
Overfitting depends on the number of observations per term in the model, as you can read about in my
post about overfitting. You’d need a very, very complex model to overfit a dataset with 1000
observations!
I hope this helps!
Reply
MUHAMMAD K. N.says
January 10, 2018 at 7:07 pm
Jim Frostsays
January 10, 2018 at 7:50 pm
Hi Muhammad! Unfortunately, that situation isn’t too uncommon and I’ve written a blog post that is
specifically about it:
Interpreting a Regression Model with a Low R-squared
A low R-square might or might not be a problem. If you have significant independent variables and your
main goal is to understand the relationships between the variables, a low R-squared is not necessarily a
problem.
However, if your main goal is to produce precise predictions, it can be a problem.
The blog post I recommend covers these scenarios and shows how it works. I think it’ll make your
situation more clear!
Reply
ALMAS KHURSHEEDsays
December 28, 2017 at 10:33 am
hi sir
i am very confuse how to write interpret statement for r2 if value is 0.68
can u please help me out
thank you
Reply
Jim Frostsays
December 28, 2017 at 10:58 am
Hi Almas, it means that the independent variables in your model collectively account for 68% of the
variability in the dependent variable around its mean. Click the link in the post to go to my post where I
talk about R-squared in more detail. I hope this helps!
Reply
Jim Frostsays
November 7, 2017 at 11:46 am
Hi Allan, you’re very close! Think about how you usually calculate sums of squares. It’s the sum of the
squared deviations between the the fitted values and the observations. PRESS is similar except it is the
sum of the squared deviations between the fitted value of each removed observation and the removed
observation. So, the procedure basically removes each observation and uses the model to predict that
observation and squares the difference between the two. It does that systematically for all observations
and sums those squared differences. For your 4th point, you never fit the model with all observations
when calculating predicted R-squared. Instead, there is always one removed observation and you’re
essentially seeing how well the model predicts each removed observation. I hope this makes it more
clear!
Reply
Franklin Moormannsays
October 28, 2017 at 10:50 am
I’m not explaining well enough I believe. This is my formula results using junk data (with a rsquared
value of 0.2)
Predicted Rsquared = 1 – (PRESS / TSS) = 1 – (-1.04 / 67408.86) = 1.00
So as you can see something is definitely wrong.
Reply
Franklin Moormannsays
October 26, 2017 at 5:43 pm
I have no clue how to do diagonal elements in C# so I guess I’m going to have to go through and
eliminate one observation at a time and then calculate the press and rss after each elimination. Since
I’m doing that, how would I calculate the press statistic instead of doing the diagonal matrix stuff?
Reply
Franklin Moormannsays
October 26, 2017 at 7:15 pm
I found a workaround but I’m now getting a negative value for the press statistic so when I divide by the
total sum of squares it is returning 1 which I know isn’t correct
Reply
Jun Lisays
July 4, 2018 at 10:37 am
Hello Jim,
I develop an nonlinear regression model in R studio with R2 (0.904), R2(adj) 0.864 and R2 (predicted)
0.919. I wonder if it is possible that predicted R2 higher than the normal R2?
Hope for your reply.
Jun Li
Reply
Jim Frostsays
July 5, 2018 at 3:13 pm
Hi Jun Li,
First we need to make sure we’re clear on some terminology. Did you develop a true nonlinear model or
is it a linear model that uses polynomials to model curvature? You can read about the differences in my
post: The Difference between Linear and Nonlinear Models.
It’s an important distinction because R-squared and its variants are not valid for nonlinear models. If you
are truly using a nonlinear model, I suppose it might be possible to obtain a Predicted R-squared that is
higher than R-squared. Maybe. But, you shouldn’t be using any of those R-squared values because they
are invalid. You can use another goodness-of-fit statistic, such as the standard error of the regression.
For linear models, you can’t obtain a predicted R-squared that is higher than R-squared. That scenario
would indicate that the model predicts new observations better than it predicts the values used during
the model fitting process. That makes no sense.
I hope this helps!
Timsays
October 26, 2017 at 2:25 am
Hi Jim,
I know the way how R-squared is calculated in logistic regression is different. I wonder what would you
do if a reviewer asks you to provide similar indicator.
Thanks!
Tim
Reply
Jim Frostsays
October 26, 2017 at 1:37 pm
Franklin Moormannsays
October 25, 2017 at 8:40 pm
I’m only supposed to remove one observation at a time to recalculate the prediction model but after
that, I’m supposed to use all original observations to run the calculations for press and tss?
Reply
Franklin Moormannsays
October 25, 2017 at 6:31 pm
I’m trying to create my own formula to calculate predicted rsquared and this was the only information
that I found on how to do it. I believe the formula to do this is predicted r2 = 1 – (press / tss) so would
you systematically leave off one data point at a time and calculate the press statistic and tss statistic and
add those values to a final total and calculate predicted r2 at the end?
Reply
Jim Frostsays
October 25, 2017 at 10:57 pm
Hi Franklin, here’s the predicted R-squared and PRESS formulas. The formulas don’t actually go through
and remove each observation one-at-a-time, but it is equivalent to that process.
Reply
Duc-Anh Luongsays
May 10, 2017 at 6:28 am
Hi Jim,
I have question about calculation of the predicted R squared in the linear regression.
(1). Is it true that in each time when we remove 1 data point, we have to fit model again and use this
model to predict the values of removed data point?
(2). Is it possible to get negative predicted R-squared?
Many thanks
Duc Anh
Reply
From <https://statisticsbyjim.com/regression/interpret-adjusted-r-squared-predicted-r-squared-regression/#more-938>