You are on page 1of 43

Vol.

8, 2023-10

Replacing the R² Coefficient in Model Analysis

Hugo Hernandez

ForsChem Research, 050030 Medellin, Colombia

hugo.hernandez@forschem.org

doi: 10.13140/RG.2.2.26570.13769

Abstract

The R² coefficient (a generalization of the determination coefficient defined in linear


regression) has been widely used as a criterion for assessing and comparing the performance
of mathematical models with respect to a given set of experimental data. Unfortunately, the R²
coefficient can only be used to confidently compare linear models with different terms, fitted
by ordinary least-squares (OLS) regression, satisfying all assumptions of OLS regression, and
using the same experimental data set. In addition, the R² coefficient actually represents the
relative performance of a model compared to the best constant model for the data. A new
fitness coefficient (CF) is proposed as an alternative to R², where the performance of the model
is now relative to the corresponding measurement error in the data. A modeling selection
procedure is suggested where the best model maximizes the fitness coefficient and the
normality of the residuals, while minimizing the number of fitted parameters (parsimony
principle).

Keywords

Correlation, Fitness Coefficient, Heteroscedasticity, Mathematical Modeling, Model Analysis,


Ordinary Least Squares, R² Coefficient, Regression Analysis, Uncertainty

1. Introduction

Mathematical models are valuable tools commonly used to describe, explain and even predict
the behavior of any particular system. Anyone working with mathematical models must take
into account two important principles of modeling:

Cite as: Hernandez, H. (2023). Replacing the R² Coefficient in Model Analysis. ForsChem Research
Reports, 8, 2023-10, 1 - 43. Publication Date: 18/07/2023.
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

1) No mathematical model is perfect because it cannot consider all possible factors


that may have an influence on the system, and also because there is always
uncertainty in the information on the factors and their effects considered in the
model [1,2]
2) There is not a single correct model but a wide range of possible models capable of
representing the experimental data with more or less the same confidence [3].

We need, however, objective criteria to help us select the most adequate model for our
purposes, from the universe of possible models. Typically, the “goodness of fit” of a
mathematical model with respect to the experimental data available is the criterion used to
choose the best model. Several numerical “goodness of fit” criteria are available, including the
coefficient [4], the likelihood-ratio ( ) [5], and the Akaike information criterion ( ) [6],
just to mention a few.

Amongst all “goodness of fit” criteria, the coefficient is probably the one most commonly
used for model comparison, but it is also widely misused [7]. For this reason, the focus of the
current report is the replacement of the coefficient as a criterion for model evaluation and
comparison. First, we need to better know the origin and nature of the coefficient along
with its shortcomings (Section 2). Then, by understanding the true meaning of the
coefficient (Section 3), we can derive an improved criterion correcting some of the issues with
the coefficient (Section 4). Such criterion is denoted as the model fitness coefficient ( ).
Numerical examples are presented to illustrate the advantage of using the model fitness
coefficient for evaluating, comparing and selecting mathematical models.

2. About the R² Coefficient

2.1. Brief History

The origins of the coefficient can be traced back to the origins of “regression”. The term
“regression” was first used by Francis Galton§ in 1885 [8]. Some years later, Galton devised a co-
relation index ( ) to quantify the degree of association between two random variables [9].
Based on Galton ideas and works about “correlation” [10], Karl Pearson developed in 1895 the
mathematical formula that is still most commonly used to measure linear correlation, the
Pearson product-moment correlation coefficient [11]:

§
Galton originally used the term “reversion” and later he changed it into “regression”, to describe a
somehow negative concept, where a population “reverses” or “regresses” towards a mediocre

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (2 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

∑ (∑ ) (∑ )

√( ∑ (∑ ) ) ( ∑ (∑ ) )
(2.1)

where and denote two variables observed simultaneously, and are the corresponding
values for the -th observation, and is the number of observations.

Some authors suggest that should be called the Galton-Pearson correlation coefficient [12].

Pearson’s correlation coefficient can be alternatively expressed as follows:

( )
√ ( ) ( )
(2.2)
where and represent the Variance and Covariance operators, respectively.

The correlation coefficient describes the strength and direction of an association between
variables. However, a statistical significant correlation must not be confused with a relevant**
correlation [13]. Particularly with large datasets, very small (practically irrelevant) correlation
coefficients can be found to be statistically significant.

The square of the correlation coefficient has been referred to as the coefficient of
determination ( ) [14]:
( ( ))
( ) ( )
(2.3)

Similarly, the complement of the coefficient of determination is the coefficient of non-


determination given by [14]:

( ( ))
( ) ( )
(2.4)

where √ is known as the coefficient of alienation [4,14].

**
A significant correlation is a correlation than is unlikely to be caused by sampling error. A relevant
correlation is a useful correlation from a practical point of view.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (3 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Notice that has been called the coefficient of determination because it can be interpreted
as the percentage of variance in one variable predicted or explained by the other. That is, the
degree to which some relation or phenomenon is present, also denoted sometimes as effect
size [15].

Notice that the original notation of the coefficient of determination was the lower case ,
but the one commonly used is the upper case . There is actually a subtle difference between
both terms, and it is that refers exclusively to the linear association between two variables
( and ), whereas describes the “goodness of fit” of a particular model, which may
correspond to any arbitrary univariate or multivariate relation. Since both terms are denoted as
“coefficient of determination”, misinterpretations quickly arise.

Nevertheless, the coefficient has been considered “one of the most widely used reliable
statistical tool for testing goodness of fit of a model or comparing the performance of various
models” [16]. Furthermore, has been for many years, a common output in computer
regression packages [17].

2.2. Definitions

While Eq. (2.3) is a straightforward definition of the determination coefficient ( ), it is


possible to find in the literature several different definitions of the coefficient. Kvålseth [7]
presents an excellent review of the most common definitions used for :

∑ ( ̂)
∑ ( ̅)
(2.5)
∑ (̂ ̅)
∑ ( ̅)
(2.6)
∑ (̂ ̅̂ )
∑ ( ̅)
(2.7)
∑ ( )̅
∑ ( ̅)
(2.8)

(2.9)
̂
(2.10)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (4 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

∑ ( ̂)

(2.11)
∑ ̂

(2.12)

where the circumflex accent ( ̂ ) indicates a value estimated from a deterministic [1]
mathematical model, and the flat accent ( ̅ ) represents an average value over all
observations. In addition, the difference between the actual observation of the response
variable or regressand ( ) and the model prediction (or estimation) of the behavior of the
system ( ̂ ) under any particular set of conditions is known as the residual error of the model:

̂
(2.13)

On the other hand, ̂ is the simple linear correlation coefficient between the regressand and
the model estimation, and is the multiple correlation coefficient between the regressand
( ) and the regressor vector ( ), where

(2.14)

is the vector of individual correlation coefficients between the regressand and each variable
in the regressor, is the matrix of correlation coefficients between all elements in the
regressor, and the superscripts and indicate the transpose of a vector, and the inverse of
a matrix, respectively.

The first 6 definitions (Eq. 2.5 to 2.10) are all equivalent for a simple linear model with intercept.
However, for any other type of model, different values arise. The last two terms (Eq. 2.11 and
2.12) are intended to be used for linear models without intercept. According to Kvålseth [7], the
first definition (Eq. 2.5) is the best overall statistic for evaluating models in general, despite also
having potential shortcomings when tacit assumptions are violated.

Eq. (2.5) can be expressed alternatively as follows:

(2.15)
where

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (5 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

∑( ̂) ∑

(2.16)
is the sum of squared errors, and

∑( ̅)

(2.17)
is the sum of total squares or total variation.

Notice that parameter identification of regression models is commonly performed by


minimizing the , as given in Eq. (2.16). Such method is known as ordinary least squares
minimization (OLS) [18], and it was originally developed by Gauss by the end of the 18 th century
[19].

For linear models that satisfy the standard assumptions underlying OLS, the coefficient has
an attractive interpretation as the proportion of variation in the response variable that is
explained by the model [20], since:

(2.18)
where ∑ (̂ ̅) is the regression sum of squares [21].

As it was previously mentioned, Eq. (2.18) is only valid for linear models where the following
relation holds [16]:

(2.19)

It can also be shown that for simple linear models fitted by OLS, the coefficient becomes
identical to the original coefficient of determination ( ) giving rise to the misinterpretation
of as the coefficient of determination. Notice that in this case, is constrained to values
between and , since the correlation coefficient ( ) can only take values between and
.

For non-linear models Eq. (2.15) can still be used but now the coefficient may also take
negative values (whenever ), and of course, Eq. (2.18) does not apply.

For small samples, a correction or ‘adjustment’ was proposed to to account for lost degrees
of freedom [22], resulting in the adjusted coefficient:

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (6 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

( )( )

(2.20)
where is the number of parameters in the model fitted from the experimental data.

Interestingly, this adjusted coefficient also penalizes an unnecessary increase in the number
of parameters in the model, satisfying the principle of parsimony [23] for mathematical
modeling, resulting useful also for large samples.

By using Eq. (2.15) to (2.17) in Eq. (2.20), the following expression is obtained:


̂
∑ ( ̅) ̂

(2.21)

where ̂ is an unbiased estimate of the error variance (assuming a zero average error as in the
case of OLS), and ̂ is an unbiased estimate of the regressand variance.

2.3. Criticism

Correlations and regressions are frequently misunderstood and misused. First of all, we must
be aware that an observed correlation does not imply a cause-and-effect relationship between
two variables [13]. In second place, correlations cannot be compared across samples: Two
correlations can be different because the variances in the samples are different, not because
the underlying relationship has changed [24]. In addition, the correlation coefficient is
sometimes criticized as having no obvious intrinsic interpretation, and therefore, the
coefficient of determination (interpreted as the proportion of variance in one variable that is
accounted by the other) is preferred [12,13]. However, since the coefficient of determination
depends on the correlation coefficient, it can also be influenced by the differences in variance
across samples [24,25], and for this reason the determination coefficient cannot be used either
to compare different samples. According to King [26] all of the criticisms of the correlation and
standardized regression coefficients apply equally to the determination coefficient.

Notice that the terms “coefficient of determination” and “ coefficient” are commonly used as
equivalents in the literature. However, we may notice that the coefficient of determination
should be used only for the analysis of unplanned data (also known as observational or random
data), whereas the coefficient is best suited for planned data (obtained from experimental
designs under controlled conditions). According to Box [27], regression analysis of unplanned
data is a technique which must be used with great care.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (7 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The coefficient has been widely used over several decades to compare the “goodness of fit”
of competing regression models in spite of repeated warnings about the potential dangers
[20]. The coefficient has been under criticism over the last five decades owing to its diverse
limitations as well as its appropriateness for its applicability to nonlinear models [16]. The main
concerns about using the coefficient for model comparison and selection include the
following:

 The coefficient should not be used when OLS basic assumptions are violated [20].
These include the assumptions of normality of residuals, independence (no
autocorrelation) of residuals, and homoscedasticity (constant variance) of residuals.
 The coefficient is an inadequate measure for the goodness of fit in nonlinear
††
models , or linear models without intercept [25,28]. In addition, except for linear models
with an intercept term, the several alternative statistics are not generally equivalent,
and yield different values [7].
 The use of the coefficient is particularly inappropriate if the models are obtained by
different transformations of the response scale [4,16, 20].
 The coefficient cannot be compared across different samples. It is only useful for
comparisons between models obtained from the same data sample. [24,29]
 The coefficient cannot be used to compare models obtained by different fitting
methods [4], like for example, weighted least-squares (WLS) regression [30]. In fact,
is maximized by OLS minimization, and therefore, the criterion will always show
preference for OLS regression over any other fitting method.
 Furthermore, a single way to measure variation is necessary for the comparison of two
values of [4]. Since different fitting methods may measure variation differently, they
can also be directly compared by .
 An increase in the number of replicates will tend to artificially reduce the value of
[21,31].
 The coefficient cannot be used when the mean of the regressand is not stationary
[32], like for example, dynamic systems.
 Usage of the coefficient in linear regressions with binary responses is misleading,
since low values of are inevitable even if an important relation is present [33].
 A large does not guarantee a “good” model fit [25], while a small is not
necessarily a proof of a “bad” model fit [24,25,32].

††
Nonlinearity refers here to the parameters rather than to the independent variables, since nonlinear
functions of independent variables can be transformed into a new regressor variable, and incorporated
as a linear term in the model.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (8 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

 Moreover, a perfect fit with zero model error‡‡, does not correspond to [21],
since it is impossible for a fitted model to explain pure error [31].
 In analyzing two or more sets of data, predictions for a regression equation with a
larger might not be more precise (and it could even be less precise) than the
predictions based on an equation with a smaller [34].
 Sole reliance on the coefficient may fail to reveal important data characteristics and
model inadequacies [30]. The coefficient only partially measures the “usefulness” or
the “goodness of fit” of a regression equation [34].

According to all these issues, the cases where the coefficient can be correctly used is limited
only to comparing different linear models with intercept (containing different terms) fitted by
OLS, satisfying the assumptions of OLS, obtained from the same sample, using the same
method of estimation of variation, and using the same scale of the response variable.
Considering all these limitations, some statisticians even take the extreme position that one
should never look at at all [17].

3. Adequate Interpretation of the R² Coefficient

Much of the confusion surrounding interpretation and application of the coefficient results
from the erroneous assumption that it is equivalent to the coefficient of determination ( ).
While both of them are numerically equivalent for a single linear model with non-zero
intercept, they are conceptually different.

Let us consider the definition of given in Eq. (2.15). While quantifies the deviation
between the predictions by a particular model and the experimental data, actually
quantifies the deviation from experimental data of the predictions obtained by the “null” [4] or
“constant” model:

(3.1)
fitted by OLS. As a result,
̅
(3.2)
and therefore, ̂ ̅.

‡‡
takes into account model fitness error and pure error simultaneously in the term “model residual
error”.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (9 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Thus, actually measures the goodness of fit in the sense of comparing a given model
(regression model) with a reference model (best constant model) [4,34].

This way, a model with indicates that its performance is identical to that of the
reference model. Large values only show that the given model improves the reference
model. On the other hand, negative values of (which are possible for according to Eq.
2.15, but impossible for the coefficient of determination according to Eq. 2.3) indicate that
our model performs worse than the reference model. In any case, since we do not know how
good the reference model is, we have no means of assessing the actual goodness of fit of our
model. Therefore, does not provide any information about how good is the model in an
absolute sense [4].

The typical interpretation of considers a value of zero as a “bad” model, which is not
necessarily true. Let us consider the example provided by McGuirk & Driscoll [32], and
illustrated here in Figure 1.

The experimental data set is fitted by OLS linear regression resulting in a of practically .
Now, a linear term ( ) is added to the original data and the resulting data and OLS linear
model are presented in Figure 2. The for the modified data is . Which model is better?
Basically we have the same data with a different inclination.

In principle, both models should present the same “goodness of fit”. However, the for the
modified data seems better than that for the original data. What we are actually comparing
with the coefficient is the fitted model with the reference best constant model for each
data set. In the case of the original data, both the fitted model and the best constant model are
equivalent, and for that reason . However, for the modified data, we can clearly
evidence that the fitted model (red line) performs much better than the reference constant
model (green line), as indicated by a large value. In both cases, does not measure the
absolute “goodness of fit” of the model, only a “goodness of fit” relative to the reference best
constant model.

As it can be seen in Figure 1, the constant model could be considered a satisfactory description
of the data or not, depending on our “tolerance” for the residual error. Such error “tolerance”
should be a required input for evaluating the “goodness of fit” of a model, as it will be
discussed in Section 4.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (10 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 1. Illustrative example of a constant model. Blue dots: Experimental data. Red line: OLS
regression model. Adapted from Figure 3 in [32].

Figure 2. Experimental data from Figure 1 modified by a linear term. Blue dots: Modified data.
Red line: OLS regression model. Green line: Reference best constant model. Adapted from
Figure 3 in [32].

This alternative interpretation also clarifies the difficulties found with the calculation of in
linear models passing through the origin (zero intercept) [4], since the reference model usually
considered is not the best constant model (Eq. 3.1) but a pure (unbiased) random model (Eq.
3.3):

(3.3)

The fact that is a fitness measure relative to a reference model, can also be evidenced by
expressing it in terms of statistical likelihoods [35]:

( )
( )
( ̂)
(3.4)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (11 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

where ( ̂ ) is the statistical likelihood of the fitted model, and ( ) is the statistical likelihood
of the “null” or best constant model.

Now, if we have different linear models with intercept fitted by OLS from the same data
sample, we can directly compare their values and identify the model with lower residual
error. However, the difference in values might not be significant neither from a statistical
nor a practical point of view.

Let us now discuss about the statistical significance of the coefficient. The statistical
significance of is typically determined by testing the following set of hypotheses:

(3.5)

Thus, a statistically significant value of simply indicates that the fitted model is significantly
better than the corresponding reference model. It does not necessarily mean that the fitted
model has any “practical significance”, regarded as the importance of a relationship [22].
Furthermore, as the “statistical significance” of any variable is greatly affected by the range it
covers there is a strong probability, that the most important variables will be dubbed “not
significant” by a standard regression analysis when they are carefully controlled within narrow
intervals [27].

If we are interested in the statistical significance of the difference between two models (
and ) we might perform the following test:

( ) ( )

( ) ( )
(3.6)

While testing Eq. (3.5) for normal residual errors is possible using the -test, a rigorous
evaluation of Eq. (3.6) requires a new test based on the distribution of the difference between
two variables.

Alternatively, and assuming that the same data set is considered, the following -test can be
used:
( )

( )

( )

( )
(3.7)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (12 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Finally, let us recall that the coefficient of determination is interpreted as the percentage of
variance in one variable predicted or explained by the other, and it should be used only when
the regressors or explanatory variables are random [17]. On the other hand, the coefficient
is interpreted as a relative measure of the “goodness of fit” of a model, when compared to the
reference model (“null” or best constant model) for a particular data set.

4. Derivation of a Suitable Coefficient for Model Comparison

So far, we have presented plenty of arguments discouraging the use of the coefficient as a
criterion for model comparison. However, it is still widely used because it provides some
benefits over other metrics available, including its simplicity and its intuitive representation as a
percentage. Thus, before deriving an alternative model fitting coefficient, let us summarize the
desirable features that it should provide.

4.1. Requirements of a Criterion for Model Comparison

The following is a non-exhaustive list of recommended requirements of a good criterion for


model comparison and selection presented by Kvålseth [7]:
 It should possess utility as a measure of goodness of fit and have an intuitively
reasonable interpretation.
 It should be dimensionless, and independent of the units of measurement of the model
variables.
 The potential range of values should be well defined with endpoints corresponding to
perfect fit ( ) and complete lack of fit ( ). In this sense, statistical likelihood and other
information criteria do not fulfill this requirement.
 It should be sufficiently general to be applicable to (a) any type of model, (b) whether
the variables are random or not, and (c) regardless of the statistical properties of the
model variables (including the residual error ).
 It should not be confined to any specific model-fitting technique.
 It should be compatible with other acceptable measures of fit (e.g. standard error of
prediction and root mean squared residual).
 It should weight equally both positive and negative residuals.

In addition to Kvålseth’s requirements, we may also want a criterion that:


 Allows comparing models obtained from different data samples.
 Allows comparing models obtained from different transformations of the response
variable.
 Allows comparing models considering different response variables.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (13 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

 Allows incorporating the principle of parsimony, penalizing overfitted models.


 Can be used for heteroscedastic residual errors.
 Can be easily determined or calculated.

4.2. Fitness Coefficient (CF)

Let us begin our discussion considering the adjusted coefficient defined in Eq. (2.21). This
coefficient is used as starting point as it already incorporates the principle of parsimony in its
formulation. Notice that Eq. (2.21) involves the ratio between the residual error variance for
two models, the fitted model ( ), and the reference model ( ).

Assuming that the true variance of the residual error for each model is known, we can
represent the true model residual error variance ratio§§ by the following function:

( )
( )
( )
(4.1)

Of course, since true variances are typically unknown, we must approximate the variance ratio
using estimations of the variances:
̂( )
̂ ( )
̂( )
(4.2)
Thus, the adjusted coefficient then becomes:

̂ ( )
(4.3)
where the reference model is given by Eq. (3.1).

While choosing the best constant model as reference model seems a reasonable decision, it
does not provide any real information about the absolute “goodness of fit” of the model. In
the previous Section, it was mentioned that an error “tolerance” should be used for assessing
the fitness of a model. However, such error “tolerance” must not be subjectively determined
by the analyst, since an objective comparison of models would be impossible. Thus, an
objective method for determining the error “tolerance” is needed.

Let us now discuss about the components of model error.

§§
From now on, it will be denoted as variance ratio for simplicity.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (14 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

 First of all, we have the parametric error (inadequacy of parameter values), which is the
error component that is minimized during the parameter identification procedure.
 In second place, we have the structural error (inadequacy of model structure), caused
either by “unknown”, “latent” or “lurking” variables [27] which are not considered in
the model, or by inadequate transformations of the model variables. Decreasing this
type of error requires either good knowledge of the system, modeling skills,
perseverance, intuition, or all of them.
 Finally, we have the measurement error (also known as “pure” error), considering all
uncertainty introduced (and propagated) by all measuring systems *** used to obtain
the experimental data, and also including sampling error. This type of error cannot be
reduced by modeling, it is necessary to improve all measuring systems and
experimental setups before obtaining improved sets of experimental data.

Since measurement error is the only component of residual error that remains invariant when
the model changes, it can be used as error “tolerance”. Thus, if the residual model error is close
to the measurement error, we can be perfectly satisfied with the model. Any further decrease
of residual error typically results in undesirable model overfitting.

Determination of measurement error should be a mandatory step during any experimental


data acquisition. The measurement error can be quantified by means of established and
standardized procedures for determining measurement uncertainty [36].

Thus, the estimated variance ratio for a particular model ( ) will be given by:

̂ ( )
̂ ( )
̂
(4.4)
where ̂ is the combined standard measurement uncertainty.

Notice that for the same measuring system:

̂ ( )
̂ ( )
̂ ( )
(4.5)

We might propose an absolute coefficient (not relative to a reference model) using the
estimated variance ratio of the model ( ̂ ( ) ) as follows:

***
The measuring system is not limited to the sensor or instrument. It also includes methods and
procedures, materials, equipment, environmental conditions, and operators.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (15 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

̂ ( )
( ) ̂ ( )
̂
(4.6)

If both random variables (residual error and measurement error) were normally distributed,
the estimated variance ratio ̂ ( ) would follow a Fisher-Snedecor ( ) distribution. Just like the
distribution, the estimated variance ratio may take values between and . That means that
the ( ) coefficient may take values between and . Therefore, ( ) is a semi-
bounded transformation of the estimated variance ratio. Unfortunately, this is an undesirable
feature of a model selection criterion.

An alternative bounded transformation of the estimated variance ratio, proposed in this report,
is the following hyperbolic sigmoid transformation:

̂ ( )
( ̂ ( )) ̂ ( )
(4.7)

Using this hyperbolic transformation, the resulting values are now bounded between and .
If ( ̂ ( )) , the residual error of the model perfectly matches the measurement error
( ̂ ( ) ). If ( ̂ ( )) , then the residual error is greater than the measurement error,
indicating that there is room for improving the model (underfitting). If ( ̂ ( )) , then the
residual error is less than the measurement error, indicating either model overfitting or lack of
data.

Let us now define the fitness coefficient ( ) of model , in terms of the hyperbolic
transformation of the variance ratio, as follows:

| ̂ ( )|
( ) | ( ̂ ( )) |
̂ ( )
(4.8)

The fitness coefficient defined this way can only take values between and , where a value of
is a perfect match of the measurement error, and a value of is the worst possible scenario
(representing either infinite residual error or maximum overfitting).

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (16 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

4.3. Homoscedastic Errors

So far, we have tacitly assumed that both the model residual error and measurement error are
homoscedastic, that is, that they have the exact same variance (or standard deviation) for all
observation conditions.

For this particular case where variances are constant, Eq. (4.8) can be alternatively expressed
as follows:

|̂ ( ) ̂ |
( )
̂ ( ) ̂
(4.9)

valid only for ̂ ††† . Of course, any real experimental system will always have a non-zero

measurement uncertainty. If two models are compared without referring to any experimental
set of observations, we may recur to the model similitude index described in Section 4.6.

The term ̂ ( ) considered in Eq. (4.9) is determined as follows:

∑ ( ̂( ))
̂ ( )
( )
(4.10)

where ̂ ( ) is the prediction of the -th observation obtained from model , and ( ) is the
total number of parameters in model fitted from the experimental data used for the
evaluation. The term ( ) should also account for the intercept if it is fitted from the data. If all
model parameters were fitted from a data set different to the data set used for the evaluation
of model residuals, then ( ) .

4.4. Heteroscedastic Errors

In many situations we may find that the model residual error or the measurement error (or
both) is heteroscedastic, having a non-constant variance over the range of observation
conditions considered. In those cases, the fitness coefficient defined in Eq. (4.8) or (4.9) is no
longer valid.

†††
For the particular case of ̂ , we obtain an indetermination in the value of the variance ratio ̂ ( ).

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (17 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

One possible solution is finding a suitable transformation of the regressand resulting in


homoscedasticity for both the model residual error and the measurement error. However, this
is a trial-and-error procedure with no guarantee of success.

If both random variables (model residual error and measurement error) can be described by a
suitable heteroscedastic function [37], then we can define the following analogous
heteroscedastic fitness coefficient:

| ( ̂ ( ) )|
( ) | ( ( ̂ ( ))
|
) ( ̂ ( ))
(4.11)

where ( ̂ ( )) represents the expected value of the variance ratio, determined as follows:

̂ ( )
( ̂ ( )) ∭ ∏( ( ) )
̂ ( )

(4.12)

̂ ( ) and ̂ ( ) are the model residual variance and measurement variance evaluated at each
observation condition , and ( ) is the probability density function of each variable in the
regressor.

Notice that Eq. (4.12) allows considering a non-uniform occurrence of values in the regressor.
However, for most practical purposes a uniform distribution of values can be assumed. In this
case, the expected value of the variance ratio simply becomes the average variance ratio
obtained from available experimental data.

Since Eq. (4.12) can be greatly influenced by extreme values, a robust heteroscedastic fitness
coefficient can be alternatively used:

| ( ̂ ( ) )|
( ) | ( ( ̂ ( ) ))
|
( ̂ ( ))
(4.13)
where ( ̂ ( )) is the median of the heteroscedastic variance ratio values available.

Alternatively, it is possible to assess the model fitness considering the worst-case scenario
fitness coefficient:

| ( ̂ ( ) )|
( ) | ( ( ̂ ( ) ))
|
( ̂ ( ))
(4.14)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (18 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

If the variance ratio remains constant for all observation conditions, Eq. (4.12) to (4.14) are
identical, and also equivalent to the homoscedastic fitness coefficient (Eq. 4.8 or 4.9).

4.5. Model Comparison and Selection

The following procedure summarizes the steps required to compare different models and
choose the one most suitable to our purposes.

4.5.1. Defining Variables

First of all, we must clearly identify what is the regressand or response variable of interest. Of
course, we might be interested in multiple response variables.

In second place, we need to define which variables will be considered in the regressor, and
what is the range of values of interest for each variable.

An important part of this stage is the calibration of all measuring systems for the regressand
and regressor variables. Failing to verify the quality of observations over the whole range of
values of interest may lead to erroneous conclusions.

4.5.2. Determining the Measurement Error Variance

Different observation conditions must be selected within the whole range of observation. In
the case of planned experiments, the observation conditions can be chosen from the possible
treatments considered.

Several repeated measurements of each response variable must be performed at each


observation condition. Then, the measurement error variance at each observation condition
( ( ) ) can be determined as follows:

∑ ( ( ) ̅̅̅̅̅)
( )
( )

(4.15)

where is the number of repeated measurements at observation condition , and ̅̅̅̅̅


( )
∑ ( )
is the average value measured at .

In the case of scale transformations, the measurement error variance must be consequently
corrected:

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (19 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

∑ ( ̅̅̅̅̅)
( ) ( )
( )

(4.16)
where represents a suitable transformation of .

If the response variable is not directly measured but calculated from other measured variables,
we need to estimate a combined measurement error variance as follows:

( ) ∑( ) ( )

(4.17)

where ( ) is the sensitivity of the response variable to regressor variable at the

observation conditions, and ( ) is the measurement error variance for the regressor variable
at the observation conditions considered.

The combined error variance can be alternatively estimated using analytical methods such as
the Change of Variable Theorem [38], Polynomial Chaos Expansion [39], or Variance Algebra
[40], or simulation algorithms such as the Monte Carlo Method [41].

Now, check for homoscedasticity using any suitable test [42]. If the measurement error can be
considered homoscedastic, then determine the measurement variance as the pooled-average
of variances from each -th observation condition:

∑ ( )
̂ ( )

(4.18)

For heteroscedastic error, find an approximate function ( ) representing the behavior of the
measurement variance [37]:

̂ ( ) ( )
(4.19)

In some cases, we may find useful a simpler expression in terms of the measured value:

̂ ( ) ( )
(4.20)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (20 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

4.5.3. Acquiring Experimental Data

There are two main types of experimental data: Planned and Unplanned. In the case of planned
data, a suitable design of experiments is required which should preferably guarantee regressor
orthogonality and balance. For unplanned data, it is important to obtain a truly representative,
unbiased sample of observations.

Even after careful execution of experimental procedures, including calibration of measuring


systems, errors are inevitable. Whenever a critical error is observed, such as deviating from the
specified procedures, evidencing an instrument failure, or detecting calculation or transcription
errors, action is needed. Defective data must be corrected or removed from the data set.
Outlier detection methods [43] might be used to help identify defective data. However, no
unjustified manipulation, addition, or omission of data is advisable.

Sample size is a common concern during experimentation. While increasing the number of
observations is always better from a modeling perspective, experimentation or observation
costs (and budget) usually determines sample size in practice. It also applies to the data used
for assessing the measurement variance.

4.5.4. Choosing the Model Structures

At this stage, the model structures of interest need to be identified. If a single model structure
is considered, we will only be able to evaluate the “goodness of fit” of the model with respect
to the experimental data.

While different model structures for a particular system can be obtained from the literature, or
from the knowledge of the governing principles, it is also common to propose model
structures from the observation of the data obtained [44].

4.5.5. Fitting Model Parameters

Each model structure needs to be fitted to the experimental data using any suitable method,
such as OLS [18], WLS [30], or any other [45]. Parameters could also be arbitrarily determined,
or obtained from the literature. In those cases, non-fitted parameters do not consume degrees
of freedom in the estimation of residual error variance.

In general, parameter estimation in general is a minimization problem of the form:

( ̂)

(4.21)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (21 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

where is the vector of unknown model parameters to be fitted, and ( ̂) is a penalty


function determined from the vector of observed ( ) and predicted (̂) response variables. In
the case of OLS, the penalty function is .

4.5.6. Determining Model Residuals and Residual Variance

Model residuals are obtained as the difference between experimental observations and their
corresponding model predictions:

( ) ̂( )
(4.22)

The variance of model residuals is determined using Eq. (4.10):

∑ ( ̂( ))
̂ ( )
( )
(4.10)

where is the number of observations and ( ) is the total number of parameters present in
model and fitted from the experimental data set.

Again, we need to test for heteroscedasticity of the residual error. If the residual error is
heteroscedastic, we must find an approximate function ( ) describing the behavior of residual
variance in terms of either the observation conditions ( ) or the measured response variable
( ):

̂ ( ) ( )
(4.23)
̂ ( ) ( )
(4.24)

4.5.7. Calculating Fitting Coefficients and Statistical Significance

If both the measurement error and the model residual error are homoscedastic, proceed to
determine the fitting coefficient of the model using Eq. (4.9):

|̂ ( ) ̂ |
( )
̂ ( ) ̂
(4.9)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (22 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

When at least one of the random variables is heteroscedastic, and approximate models
describing the variance are available, we can determine the variance ratio function in terms of
the observation conditions as follows:

( )
̂ ( )
( )
(4.25)
or alternatively, in terms of the observed response variable:

( )
̂ ( )
( )
(4.26)

The expected value of the variance ratio is then given by either Eq. (4.27) or Eq. (4.28):

( )
( ̂ ( )) ∭ ∏( ( ) )
( )

(4.27)

( )
( ̂ ( )) ∫ ( )
( )

(4.28)

Finally, determine the heteroscedastic fitting coefficient using Eq. (4.11):

| ( ̂ ( ) )|
( ) | ( ( ̂ ( ) ))
|
( ̂ ( ))
(4.11)

For a robust estimation of the heteroscedastic fitness coefficient, Eq. (4.13) can be used. For a
conservative estimation of the heteroscedastic fitness coefficient, Eq. (4.14) can be used.

A model with a larger fitness coefficient will provide a more suitable fit to the experimental
data that a model with a smaller fitness coefficient. However, there are no fixed thresholds
determining a good or a bad model.

The maximum fitness that can be achieved by a model is , indicating that it has already
matched the measurement error. It is perfectly normal to find various different models with
the maximum fitness. In that case, we may apply the principle of parsimony as it will be
described in the next subsection.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (23 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Now, two models with close fitness values might also be considered to have practically the
same performance. Since we are using the same data set with the same measurement error,
then we can calculate the variance ratio between the two models from Eq. (4.5), resulting in:

̂ ( ) ̂ ( )
̂ ( )
̂ ( ) ̂ ( )
(4.29)
for homoscedastic errors, or alternatively:

( ) ( )
( ̂ ( )) ∭ ∏( ( ) ) ∫ ( )
( ) ( )

(4.30)
for heteroscedastic errors.

If residual errors can be considered normally distributed [46], or the sample size is large
enough so that the Central Limit Theorem [47] applies, we can evaluate the statistical
significance of the difference between the models performance using an -test. Using this
approach, we may conclude that the residual error model is not significantly different from the
measurement error if:
| |
( )

(4.31)

In addition, in order to avoid the subjective determination of a significance level for this test, an
optimal significance level can be used [48].

4.5.8. Tie-break: Applying the Principle of Parsimony and Normal Fit

As it was previously mentioned, different models may reach the maximum possible fit, or they
can obtain fitting coefficients that cannot be considered statistically different. In those cases
we need a different criterion for model selection.

In first place, by applying the principle of parsimony we may prefer simpler models, that is,
those with less fitted parameters. For example, a constant model would be preferred over a
linear model if they reach a similar fitness value. Similarly, a linear model would be preferred
over a polynomial model. Now, since we may have models with “given” parameter values and
not fitted from the data, we would prefer models with a simpler mathematical representation.
However, in this case the principle of parsimony may become ambiguous and subjective.
Furthermore, we may have several models with the same model structure but different
parameter values achieving similar fitness values.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (24 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

For that reason, an alternative approach proposed for choosing the best model in case of tie, is
choosing the model having residuals with largest normality ( ) value [46], or lowest
probability ( ) value in a normality test, or in general, showing a better fit to a normal
probability model [1].

4.5.9. Comparing Models with Multiple Response Variables

When each model predicts multiple response variables measured experimentally, one fitting
coefficient will be obtained for each response variable. The overall fitting coefficient ( ( ) ) of
model can be determined in at least two possible ways:

 Geometric mean:

( ) (∏ ( ))

(4.32)
 Minimum value:
( ) ( )

(4.33)

where is the number of response variables considered, and ( ) is the fitting coefficient
of model for the response variable .

4.6. Model Similitude

Finally, let us consider the case where two models must be compared independently of any
experimental data. In this case, rather than considering a “fitness” coefficient, a “similitude”
coefficient is preferred. In a previous report [49], a similitude coefficient was introduced for
comparing two different probability density functions. Now, let us extend such concept for the
comparison of two mathematical models in general.

The similitude coefficient ( ) between two models ( and ) predicting a certain response
variable ( ), can be defined as:

∭ | ̂( ) ̂( )| ∏ ( ( ) )
( )
∭ ( ̂( ) ̂( )) ∏ ( ( ) )
(4.34)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (25 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Notice the structural similarity between the fitness coefficient (e.g. Eq. 4.9) and the similitude
coefficient (Eq. 4.31). Also notice that considering probability density functions as models and
assuming a uniform distribution of values in the regressor, the original similitude coefficient is
obtained.

5. Numerical Examples

5.1. McGuirk & Driscoll Example: Data with Different Inclination

As a first example, let us consider the data presented in Figure 1 and Figure 2, extracted from
Figure 3 presented by McGuirk & Driscoll [32]. The corresponding numerical values are
summarized in Table 1.
Table 1. Data with different inclination. Adapted from Figure 3 in [32].

1 7.5 7.9 26 7.0 17.4


2 9.5 10.3 27 11.2 22.0
3 8.2 9.4 28 9.4 20.6
4 9.0 10.6 29 10.4 22.0
5 12.9 14.9 30 10.9 22.9
6 10.1 12.5 31 10.0 22.4
7 8.1 10.9 32 9.8 22.6
8 9.3 12.5 33 10.0 23.2
9 11.3 14.9 34 13.8 27.4
10 9.2 13.2 35 11.5 25.5
11 10.3 14.7 36 11.4 25.8
12 9.3 14.1 37 11.4 26.2
13 10.0 15.2 38 8.4 23.6
14 12.0 17.6 39 9.4 25.0
15 10.8 16.8 40 10.2 26.2
16 10.1 16.5 41 10.6 27.0
17 9.5 16.3 42 9.4 26.2
18 8.5 15.7 43 8.4 25.6
19 12.1 19.7 44 10.6 28.2
20 12.6 20.6 45 9.4 27.4
21 11.3 19.7 46 8.8 27.2
22 7.0 15.8 47 8.4 27.2
23 10.8 20.0 48 10.6 29.8
24 11.3 20.9 49 9.4 29.0
25 10.5 20.5 50 8.4 28.4

Data set represents data with a perfectly horizontal trend, whereas data set represents the
same data with a slope of .

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (26 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Unfortunately, we do not have information about the measurement error for this data. Thus,
let us consider two possible scenarios:

 Low noise measurement: In this first scenario, let us assume that measurement error is
limited to rounding error. Since data is presented with a resolution of , the
corresponding rounding error variance (assuming a uniform truncation error) will be:

̂
(5.1)
 High noise measurement: In the second case, we will consider a measurement error of
, assuming that data approximately represents a biased, white noise random variable.

Table 2 summarizes the linear models obtained by OLS minimization of each data set, along
with their corresponding values of , ̂ , ̂ , normality value ( ) of the residuals [46],
and homoscedastic for both scenarios.

Table 2. Results summary for two data sets with different inclination
Data set Data set
Intercept 10 10
Slope 0 0.4
104.42 104.42
̂ 2.1754 2.1754
̂ 1.4749 1.4749
-value 2.7338 2.7338
0 0.941
(Low noise) 0.0008 0.0008
(High noise) 0.958 0.958

While the value is greatly affected by the slope value, we can observe that all other
performance coefficients are identical between both data sets, an expected result considering
that their data structure is basically the same. In this sense, the fitness coefficient
satisfactorily eliminates the artificial effect of slope on the model performance.

On the other hand, we can appreciate large differences in fitness values depending on the level
of noise assumed. This example illustrates the importance of accurately determining the
measurement error of data sets. It also shows a different perception of the same model for the
same data, depending on our error tolerance. If the data set describes accurate measurements,
we then notice that the effect of “latent” variables is missing in the model. On the other hand,
if the data is noisy, we can be completely satisfied with the fitted model.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (27 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

5.2. Draper Example: Data with Different Sample Sizes

Draper [31] proposes an example where two data sets of the same system are obtained
considering different sample sizes, resulting in the same fitted model but different values.
Particularly, while the first data set only contains one observation for each observation
condition, the second data set considers several replications.

Draper does not provide numerical values for this example, so they are obtained as follows: An
arbitrary linear model is defined (in this example, ). The observation conditions
considered are . At each observation condition , replicates are obtained
using the following equation:

(5.2)

where represents the replicate number, and is an unbiased normal random variable where

(5.3)

The values obtained are then truncated with two decimal positions, and are denoted as data
set . Now, data set is defined as the truncated median value of the replicates for each .

The numerical data is summarized in Table 3 and illustrated in Figure 3, with the corresponding
linear models fitted by OLS.

Table 3. Data with different sample sizes. Adapted from Figure 1 in [31].
X 1 2 3 4 5 6
Data set A y 8.79 7.98 7.46 6.15 4.95 3.88
y(1) 9.77 9.41 8.54 10.20 10.02 8.70
y(2) 8.04 8.08 8.87 8.37 6.88 7.06
y(3) 8.82 8.15 8.09 7.99 7.96 7.49
y(4) 10.06 7.51 6.29 7.45 6.54 7.46
y(5) 7.87 5.42 7.83 7.72 5.90 7.53
Data set B
y(6) 6.62 4.89 6.63 7.64 4.78 6.97
y(7) 6.32 4.77 5.39 5.98 4.98 5.10
y(8) 4.41 5.41 6.25 4.91 3.73 6.70
y(9) 4.83 3.68 4.50 4.05 5.62 4.46
y(10) 3.70 3.66 3.18 3.47 5.05 2.31

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (28 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

The determination of measurement error is done by estimating measurement uncertainty from


data set . The sample variance is determined for each observation condition, and then the
pooled variance is calculated under the homoscedastic assumption. Since measurement error
cannot be determined for data set , we will consider two possible scenarios: a) The
measurement error obtained from data set is used ( ̂ ), and b) Since the data set
is actually the median of replicates in data set B, the combined measurement error of the
average value will be used as estimate of the measurement error ( ̂ ).

Figure 3. Illustrative example of two data sets (Table 3) from the same system with different
sample sizes. Left plot: Data set . Right plot: Data set . Blue dots: Experimental data. Red
line: OLS regression model. Adapted from Figure 1 in [31].

Table 4 summarizes the fitted parameters obtained for each model, as well as the different
metrics for assessing error and fitness.

Table 4. Results summary for two data sets with different sample sizes
Data set Data set
# Replicates 1 10
Intercept 10.03 10.001
Slope -0.9986 -1.0001
0.2881 46.09
̂ 0.0720 0.7947
̂ 0.2684 0.8914
-value 0.2146 2.5365
0.9838 0.7916
(Scenario a) 0.1556 0.9643
(Scenario b) 0.9153 0.9643

This example was proposed by Draper to illustrate how the same model shows a decrease in
just by increasing sample size. Also, values cannot be directly compared for different

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (29 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

sample sizes. According to the fitness coefficient using the same measurement error in both
cases (scenario a), we find a very low value for data set as a result of “apparent
overfitting” ( ̂ ̂ ). Considering the same measurement error, a representative sample
should not behave like that, and if it does, it is the result of sampling error due to a deficient
sample size. Now, since the data set was actually obtained as a central measure of the original
replicates, it is more reasonable to compare the residual error with the measurement error of
the average value, as indicated in scenario b. As a result, the difference in fitness between both
data sets decreases. In fact, we may check the statistical significance of the difference in
variance ratios using:
̂( )
̂ ( ) ̂ ̂( )
̂ ( )
̂ ( ) ̂( ) ̂( )
̂
(5.4)

having a -value of for a lower tail -distribution (residual error can be considered normal
for both models according to -values), clearly indicating that the fitness coefficients of both
models are not significantly different (for scenario b). Nevertheless, a higher -value for the
residuals of model confirms that it provides a more reliable fit.

5.3. Anscombe Example: Different Data with Identical R2 values

Anscombe [44] proposes an interesting situation where different data sets whose linear fits
achieve the same performance in terms of the coefficient. A fifth data set (data set ) with
identical is included after a similar example proposed by Schober et al. [13]. The data sets
values are presented in Table 5 and illustrated in Figure 4.

Table 5. Different data with identical values. Obtained from Table 1 in [44] and adapted from
Figure 3 in [13].
Data set A Data set B Data set C Data set D Data set E
x y y y y x y
4 4.26 3.1 5.39 7.01 8 6.58
5 5.68 4.74 5.73 5.71 8 5.76
6 7.24 6.13 6.08 4.93 8 7.71
7 4.82 7.26 6.42 4.92 8 8.84
8 6.95 8.14 6.77 5.68 8 8.47
9 8.81 8.77 7.11 6.97 8 7.04
10 8.04 9.14 7.46 8.43 8 5.25
11 8.33 9.26 7.81 9.62 8 5.56
12 10.84 9.13 8.15 10.19 8 7.91
13 7.58 8.74 12.74 9.97 8 6.89
14 9.96 8.1 8.84 9.02 19 12.5

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (30 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Figure 4. Illustrative example of five different data sets (Table 5) fitted by linear OLS
minimization and having identical values. Blue dots: Experimental data. Red line: OLS
regression model. Data obtained from Table 1 in [44] (data sets A, B, C and E) and adapted from
Figure 3 in [13] (data set D).

The main issue with these data sets is that no measurement error information is explicitly
available. Again, let us consider two scenarios:
( )
 Low noise: Only truncation error is considered. In this case, ̂ .
 High noise: Replicates in data set are used to estimate the measurement error of all
data sets. In this case, ̂ .

Table 6 summarizes the performance of linear models fitted by OLS for all 5 data sets,
considering both scenarios.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (31 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 6. Results summary for different data sets with identical values
Data set A Data set B Data set C Data set D Data set E
Intercept 3 3 3 3 3
Slope 0.5 0.5 0.5 0.5 0.5
13.7627 13.7763 13.7562 13.7855 13.7425
̂ 1.5292 1.5307 1.5285 1.5317 1.5269
̂ 1.2366 1.2372 1.2363 1.2376 1.2357
-value 1.1387 -0.2137 -4.9852 1.2497 1.5013
0.6665 0.6662 0.6663 0.6663 0.6667
(Low noise) 0.0000 0.0000 0.0000 0.0000 0.0000
(High noise) 0.9993 0.9988 0.9995 0.9984 1.0000

First of all, notice again the importance of accurately determining measurement error for a
reliable assessment of the “goodness of fit” of a model. In this particular example, it is highly
likely that data sets and involve large measurement errors, whereas data sets , and
were likely obtained with small measurement errors. In that sense, linear models obtained for
data sets and might be considered as good models, whereas all other linear models are
unsatisfactory.

Notice also that all models have the same fitness coefficient as long as the same measurement
error is involved. Assuming that the measurement error was the same for all data sets, and
since we do not have certainty of the true measurement error of the data, we may require an
alternative criterion for evaluating model performance. In this case, the normality of residuals
is a viable option. In this case, negative -values indicate a poor performance, whereas larger
positive values indicate good performance. In that sense, linear models for data sets , , and
can be considered satisfactory, whereas linear models for data sets and are inadequate.

As a final remark for this example, we may conclude that -values represent a good
complement to fitness coefficients, for assessing and comparing the performance of different
models.

5.4. Comparison of Different Models in a Response Surface Design

The next example corresponds to Example 3 presented in [45]. The data obtained from [50],
corresponds to a response surface methodology used for maximizing the conversion of refined
sunflower oil into fatty acid methyl esters. The factors considered were: Reaction temperature
and catalyst (sodium hydroxide) concentration. The experimental results obtained are
summarized in Table 7.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (32 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Table 7. Biodiesel production optimization [50]


Run T (°C) C (wt%) Y0 (%)
1 25.0 0.50 86.0
2 65.0 0.50 98.1
3 25.0 1.50 99.7
4 65.0 1.50 100.0
5 45.0 1.00 97.7
6 45.0 1.00 97.8
7 45.0 1.00 97.6
8 45.0 1.00 98.0
9 45.0 1.71 100.0
10 73.3 1.00 99.7
11 16.7 1.00 96.6
12 45.0 0.29 89.0

Six different prediction models, all of them fitted by OLS, are considered:

 Model 1: Best constant model


( )
(5.5)
 Model 2: Linear model without interactions
( ) ( ) ( )
(5.6)
 Model 3: Interaction-only model
( ) ( ) ( )
(5.7)
 Model 4: Full linear model
( ) ( ) ( ) ( ) ( )
(5.8)
 Model 5: Full quadratic model
( ) ( ) ( ) ( ) ( )
( ) ( )
(5.9)
 Model 6: Best relevant model ‡‡‡

( ) ( ) ( ) ( ) ( )
( )
(5.10)

‡‡‡
The best relevant model consider only terms with absolute -statistic values larger than , irrespective
of whether the terms are statistically significant or not.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (33 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Assuming homoscedasticity, the measurement error variance is estimated using the replicates
at the central point, resulting in ̂ . Considering this value, the comparative
performance of the models is summarized in Table 8.

Table 8. Results summary for different models in a Response Surface Design. Red value: Worst
performance. Green value: Best performance.
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
220.24 63.69 109.82 28.88 8.43 8.49
̂ 20.02 7.08 10.98 3.61 1.41 1.21
̂ 4.47 2.66 3.31 1.90 1.19 1.10
0 0.7108 0.5014 0.8688 0.9617 0.9615
0 0.6465 0.4515 0.8197 0.9298 0.9395
-value -4.6127 0.4333 0.5413 0.3226 0.5897 0.8336
0.00291 0.00821 0.00530 0.01603 0.04066 0.04699

If we need to choose a model, based on the value of the coefficient (or ), we may
conclude that model 5 (full quadratic) is the best model. Considering the parsimony principle,
and taking into account the best coefficient (or lowest residual error variance), model 6
performs better. Let us recall that using (or ) for comparing linear models with
different terms, using the same data set is a valid procedure. Unfortunately, these coefficients
do not provide a reliable assessment of the “absolute goodness of fit” of individual models. For
example, models 5 and 6 with , seem to offer an excellent fit. However, by observing
the fitness coefficients we may conclude that none of these models is actually satisfactory
( ). Of course, model 6 remains the best model (largest and -values) amongst all
models considered, but it has actually a poor fit compared to the observed measurement error.

The main issue in this example is probably the fact that the response variable considered
(%conversion) is heteroscedastic and non-normal, particularly since values close to the 100%
bound are observed. Unfortunately, the experimental data available cannot be used to obtain a
reliable heteroscedastic measurement error model.

Two additional recommendations here would be to consider a response variable


transformation, like for example a logit transformation:

( )
( )
( )
(5.11)
taking into account that values of 100% are not viable, and that the measurement error must be
obtained as a combined error since conversion is not directly measured, but calculated from
other experimental observations.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (34 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

5.5. Heteroscedastic Data

Let us now consider Example 3 presented in [1], consisting on the calibration data of a high-
performance liquid chromatography (HPLC) method for quantifying caffeine in coffee samples
[51]. The experimental data is presented in Table 9 and illustrated in Figure 5.

Table 9. HPLC calibration data for caffeine [51]


Caffeine [mg/L] 1 10 20 30 40 50
Observation #
1 112150 1228710 2378700 3558660 4789280 5969690
2 103990 1215970 2372930 3577220 4826450 5960160
3 122080 1234480 2348690 3595400 4866850 5995480
4 113690 1237890 2449910 3656870 4894560 6070520
5 118310 1225250 2367160 3675230 4901490 6079670
6 108370 1226940 2452340 3781300 4917650 6104000
7 109000 1237890 2481440 3693390 4950550 6121290
8 120480 1251960 2458360 3708060 4988070 6125990
9 2472870 3714540 6164030
10 3706050 6200400

Figure 5. Calibration data for the determination of caffeine concentration in coffee samples
using a chromatographic method [51].

The linear model prediction obtained from this data by OLS is [1]:

(5.12)

with , where is the concentration of caffeine in the sample in mg/L, and is


the peak area obtained.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (35 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Unfortunately, the residual error is heteroscedastic [1]. Table 10 shows the estimation of
measurement error ( ̂ ) for each observation condition considered, along with the
corresponding residual model error ( ̂ ) estimated as the average squared residuals for each
observation condition. Individual variance ratios are also shown.

Table 10. Heteroscedastic measurement and residual errors


Caffeine [mg/L] ̂ ̂ ̂ ( )

1 40619698 108948576 2.6822


10 115956255 287559067 2.4799
20 2722487550 2708915271 0.9950
30 4925070862 4551101432 0.9241
40 4127522250 3919865333 0.9497
50 6634331690 6158629623 0.9283

From this information we find an average variance ratio ( ̂ ( )) , with the


corresponding fitness coefficient ( ) . Furthermore, the -value of the residuals is
.

While the heteroscedastic fitness coefficient is lower than , by looking at Table 10 we can
conclude that the fitness coefficient provides a more realistic assessment of the model
performance. Notice that the residual error at low concentrations of caffeine is more than two
times larger than the measurement error determined from the experimental data.
Nevertheless, the -value for an distribution is larger than the corresponding optimal
significance level ( ), indicating that the residual error of the model can
be considered already within the experimental measurement error.

5.6. Effect of Sampling Error

As a final example, let us consider a Monte Carlo simulation exercise where “experimental”
data is generated from a pre-defined simple linear model by adding an unbiased normal
random error with known variance ( ).

Each data set comprises random observations, where the independent variable is uniformly
generated as a value between and , truncated after the first decimal digit. The measured
response variable is also truncated after the first decimal digit. Thus, the measurement error
will be given by:
( )

(5.13)

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (36 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

Two models are considered for each data set: 1) The model obtained by OLS regression, and 2)
the source model used for generating the data. Each model is evaluated using the
coefficient and the homoscedastic fitness coefficient .

A sample of different random data sets is used for the analysis of each particular set of
conditions (original model parameters and normal error variance). The results obtained for
different sets of conditions are summarized in Table 11.

Table 11. Comparative model assessment after Monte Carlo data generation for different sets
of conditions
Original model: Original model:
Measurement error: 0.1 Measurement error: 1.0
Intercept 0 Intercept 0
Slope 1 min average max Slope 1 min average max
R2 0.97229 0.98832 0.99462 R2 0.69714 0.88163 0.96168
OLS OLS
CF 0.72434 0.87984 0.99043 CF 0.67022 0.84729 0.99132
R2 0.97226 0.98996 0.99548 R2 0.70988 0.88519 0.97149
Source Source
CF 0.69281 0.88697 0.99255 CF 0.50467 0.88632 0.99776
Original model: Original model:
Measurement error: 0.1 Measurement error: 1.0
Intercept 10 Intercept 10
Slope 1 min average max Slope 1 min average max
R2 0.98124 0.98860 0.99742 R2 0.81742 0.89701 0.94877
OLS OLS
CF 0.59343 0.84530 0.99679 CF 0.61262 0.84360 0.97338
R2 0.98292 0.98914 0.99830 R2 0.80334 0.90952 0.96149
Source Source
CF 0.40000 0.86184 0.99749 CF 0.65501 0.84899 0.99900
Original model: Original model:
Measurement error: 0.1 Measurement error: 1.0
Intercept 10 Intercept 10
Slope 0.1 min average max Slope 0.1 min average max
R2 0.36041 0.50888 0.70833 R2 0.00101 0.10315 0.33538
OLS OLS
CF 0.67037 0.87285 0.99738 CF 0.69609 0.87022 0.99475
R2 0.32139 0.51338 0.75272 R2 0.00561 0.16822 0.44423
Source Source
CF 0.66831 0.84797 0.98783 CF 0.55565 0.87141 0.99595

At first sight, just looking at the first set of conditions, we may conclude that the coefficient
is more sensitive to sampling error than the coefficient. However, for all other sets of
conditions we notice that the coefficient is strongly affected by measurement error and the
slope of the source model, whereas the behavior of the coefficient remains identical. The
intercept value does not have any effect on either of the two coefficients. We can also observe
that the behavior of the OLS model is very similar to the behavior of the source model.

While the high variability in the values is a clear drawback of the coefficient as an absolute
measure of the goodness of fit, we may also notice that: i) Such variability is induced by
sampling error, and ii) low values are mainly caused by apparent overfitting.

Notice also that after certain minimum value of , all models have essentially the same
performance. For this particular example, assuming one false positive in the set (corresponding

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (37 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

to a significance level of ), we obtain , corresponding to a minimum


value of . Thus, models having are not statistically different.

6. Conclusion

The coefficient is perhaps the most commonly used criterion for evaluating the
performance of mathematical models, as well as for comparing and selecting the best possible
model fitting a set of experimental data.

Different possible definitions of the coefficient have been proposed. The most commonly
accepted definition is (Eq. 2.15):

(2.15)
where

∑( ̂) ∑

(2.16)
is the sum of squared errors, and

∑( ̅)

(2.17)
is the sum of total squares or total variation.

The coefficient is frequently denoted as the coefficient of determination (used to evaluate


effects in random variables). However, they are different concepts, being equivalent only for
simple linear models with intercept fitted by ordinary least-squares (OLS) regression.

The coefficient must be used with caution for evaluating the “goodness of fit” of a model.
Due to the implicit assumptions of , it should be limited only to comparing different linear
models with intercept (containing different terms) fitted by OLS, satisfying the assumptions of
OLS, obtained from the same sample, using the same method of estimation of variation, and
using the same scale of the response variable.

In addition, the coefficient does not provide an absolute assessment of the performance of
a model. In fact, the coefficient is a relative index comparing the performance of a model
with a reference “null” model, corresponding to the best constant model fitted by OLS from
the experimental data.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (38 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

An alternative to the coefficient has been proposed, with the purpose of satisfying the
requirements of a good criterion for model comparison, including [7]:
 It should possess utility as a measure of goodness of fit and have an intuitively
reasonable interpretation.
 It should be dimensionless, and independent of the units of measurement of the model
variables.
 The potential range of values should be well defined with endpoints corresponding to
perfect fit ( ) and complete lack of fit ( ).
 It should be sufficiently general to be applicable to (a) any type of model, (b) whether
the variables are random or not, and (c) regardless of the statistical properties of the
model variables (including the residual error ).
 It should not be confined to any specific model-fitting technique.
 It should be compatible with other acceptable measures of fit (e.g. standard error of
prediction and root mean squared residual).
 It should weight equally both positive and negative residuals.
 Allows comparing models obtained from different data samples.
 Allows comparing models obtained from different transformations of the response
variable.
 Allows comparing models considering different response variables.
 Allows incorporating the principle of parsimony, penalizing overfitted models.
 Can be used for heteroscedastic residual errors.
 Can be easily determined or calculated.

The new criterion, denoted as a fitness coefficient , is defined as follows:


 Homoscedastic fitness coefficient:
|̂ ( ) ̂ |
( )
̂ ( ) ̂
(4.9)
where ̂ ( ) is an unbiased estimate of the residual error variance of the model, and ̂
is the estimated measurement variance of the response variable.
 Heteroscedastic fitness coefficient:
| ( ̂ ( ) )|
( )
( ̂ ( ))
(4.11)
where ( ̂ ( ) ) represents the expected value of the variance ratio, determined as
follows:

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (39 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

̂ ( )
( ̂ ( )) ∭ ∏( ( ) )
̂ ( )

(4.12)

Different examples are presented to illustrate and compare the use of the fitness coefficient
as an alternative to the coefficient.

It is important to recall that the fitness coefficient is more sensitive to random error, and we
must be aware that models having fitness values greater than a certain minimum , are
statistically identical:

| |
( )

(4.31)
where is a critical -value obtained for the corresponding analysis conditions.

Typically, we may find that fitness values greater than can be considered as “good” models.
In case of model selection, additional criteria are needed like for example, the number of fitted
parameters or model simplicity (principle of parsimony), but it is also highly advisable to test
the normality of residuals and prefer models with higher normality values.

Finally, since the fitness coefficient depends on the measurement error variance, a careful
estimation of measurement error during experimental data acquisition is highly advisable for a
more reliable assessment of model performance.

Acknowledgment and Disclaimer

The author gratefully acknowledges Prof. Silvia Ochoa (Universidad de Antioquia) for reading
and revising the manuscript, as well as for helpful discussions on the topic.

This report provides data, information and conclusions obtained by the author(s) as a result of original
scientific research, based on the best scientific knowledge available to the author(s). The main purpose
of this publication is the open sharing of scientific knowledge. Any mistake, omission, error or inaccuracy
published, if any, is completely unintentional.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-
for-profit sectors.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC
4.0). Anyone is free to share (copy and redistribute the material in any medium or format) or adapt
(remix, transform, and build upon the material) this work under the following terms:

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (40 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

 Attribution: Appropriate credit must be given, providing a link to the license, and indicating if
changes were made. This can be done in any reasonable manner, but not in any way that
suggests endorsement by the licensor.
 NonCommercial: This material may not be used for commercial purposes.

References

[1] Hernandez, H. (2019). Goodness-of-fit of Randomistic Models. ForsChem Research Reports, 4,


2019-10, 1-27. doi: 10.13140/RG.2.2.35386.34248.
[2] Hernandez, H. (2023). Question Everything: Models vs. Reality. ForsChem Research Reports, 8,
2023-07, 1 - 11. doi: 10.13140/RG.2.2.35224.67845.
[3] Hernandez, H. (2020). Formulation and Testing of Scientific Hypotheses in the presence of
Uncertainty. ForsChem Research Reports, 5, 2020-01, 1-16. doi: 10.13140/RG.2.2.36317.97767.
[4] Anderson-Sprecher, R. (1994). Model comparisons and R2. The American Statistician, 48 (2), 113-
117. doi: 10.1080/00031305.1994.10476036.
[5] Azzalini, A. (1996). Statistical Inference Based on the likelihood. Chapman & Hall/CRC Press, Boca
Raton. ISBN: 9781032478012.
[6] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19 (6), 716-723. doi: 10.1109/TAC.1974.1100705.
[7] Kvålseth, T. O. (1985). Cautionary note about R2. The American Statistician, 39 (4), 279-285. doi:
10.1080/00031305.1985.10479448.
[8] Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the
Anthropological Institute of Great Britain and Ireland, 15, 246-263. doi: 10.2307/2841583.
[9] Galton, F. (1889). I. Co-relations and their measurement, chiefly from anthropometric data.
Proceedings of the Royal Society of London, 45 (273-279), 135-145. doi: 10.1098/rspl.1888.0082.
[10] Pearson, K. (1920). Notes on the history of correlation. Biometrika, 13 (1), 25-45. doi:
10.2307/2331722.
[11] Pearson, K. (1895). VII. Note on regression and inheritance in the case of two parents. Proceedings
of the Royal Society of London, 58 (347-352), 240-242. doi: 10.1098/rspl.1895.0041.
[12] Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient.
The American Statistician, 42 (1), 59-66. doi: 10.1080/00031305.1988.10475524.
[13] Schober, P., Boer, C., & Schwarte, L. A. (2018). Correlation coefficients: appropriate use and
interpretation. Anesthesia & Analgesia, 126 (5), 1763-1768. doi: 10.1213/ANE.0000000000002864.
[14] Guilford, J. P. (1936). Psychometric Methods. McGraw-Hill, New York. pp. 68-69.
https://archive.org/details/in.ernet.dli.2015.459761/page/n75/mode/2up.
[15] Ozer, D. J. (1985). Correlation and the coefficient of determination. Psychological Bulletin, 97 (2),
307. doi: 10.1037/0033-2909.97.2.307.
[16] Sapra, R. L. (2014). Using R2 with caution. Current Medicine Research and Practice, 4 (3), 130-134.
doi: 10.1016/j.cmrp.2014.06.002.
[17] Helland, I. S. (1987). On the interpretation and use of R2 in regression analysis. Biometrics, 61-69.
doi: 10.2307/2531949.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (41 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[18] Fomby, T.B., Johnson, S.R., Hill, R.C. (1984). Review of Ordinary Least Squares and Generalized
Least Squares. In: Advanced Econometric Methods. Springer, New York. doi: 10.1007/978-1-4419-
8746-4_2.
[19] Gauss, C. F., & Stewart, G. W. (1995). Theory of the combination of observations least subject to
errors, Part One, Part Two, Supplement. Society for Industrial and Applied Mathematics,
Philadelphia. doi: 10.1137/1.9781611971248.
[20] Scott, A., & Wild, C. (1991). Transformations and R2. The American Statistician, 45 (2), 127-129. doi:
10.1080/00031305.1991.10475785.
[21] Healy, M. J. R. (1984). The use of R2 as a measure of goodness of fit. Journal of the Royal
Statistical Society: Series A (General), 147 (4), 608-609. doi: 10.2307/2981848.
[22] Crocker, D. C. (1972). Some interpretations of the multiple correlation coefficient. The American
Statistician, 26 (2), 31-33. doi: 10.1080/00031305.1972.10477345.
[23] Sober, E. (1981). The principle of Parsimony. The British Journal for the Philosophy of Science, 32
(2), 145-156. doi: 10.1093/bjps/32.2.145.
[24] Achen, C. H. (1977). Measuring representation: Perils of the correlation coefficient. American
Journal of Political Science, 21 (4), 805-815. doi: 10.2307/2110737.
[25] Figueiredo Filho, D. B., Júnior, J. A. S., & Rocha, E. C. (2011). What is R2 all about? Leviathan (São
Paulo), 3, 60-68. doi: 10.11606/issn.2237-4485.lev.2011.132282.
[26] King, G. (1986). How not to lie with statistics: Avoiding common mistakes in quantitative political
science. American Journal of Political Science, 30 (3), 666-687. doi: 10.2307/2111095.
[27] Box, G. E. (1966). Use and abuse of regression. Technometrics, 8 (4), 625-629. doi:
10.1080/00031305.1988.10475573.
[28] Spiess, A. N., & Neumeyer, N. (2010). An evaluation of R2 as an inadequate measure for nonlinear
models in pharmacological and biochemical research: a Monte Carlo approach. BMC
pharmacology, 10 (1), 1-11. doi: 10.1186/1471-2210-10-6.
[29] Hagquist, C., & Stenbeck, M. (1998). Goodness of fit in regression analysis – R2 and G2
reconsidered. Quality and Quantity, 32 (3), 229-245. doi: 10.1023/A:1004328601205.
[30] Willett, J. B., & Singer, J. D. (1988). Another cautionary note about R2: Its use in weighted least-
squares regression analysis. The American Statistician, 42 (3), 236-238. doi:
10.1080/00031305.1988.10475573.
[31] Draper, N. R. (1984). The Box‐Wetz Criterion Versus R2. Journal of the Royal Statistical Society:
Series A (General), 147 (1), 100-103. doi: 10.2307/2981740.
[32] McGuirk, A. M., & Driscoll, P. (1995). The hot air in R2 and consistent measures of explained
variation. American Journal of Agricultural Economics, 77 (2), 319-328. doi: 10.2307/1243542.
[33] Cox, D. R., & Wermuth, N. (1992). A comment on the coefficient of determination for binary
responses. The American Statistician, 46 (1), 1-4. doi: 10.1080/00031305.1992.10475836.
[34] Barrett, J. P. (1974). The coefficient of determination—some limitations. The American
Statistician, 28 (1), 19-20. doi: 10.1080/00031305.1974.10479056.
[35] Nagelkerke, N. J. (1991). A note on a general definition of the coefficient of determination.
Biometrika, 78 (3), 691-692. doi: 10.1093/biomet/78.3.691.
[36] Kirkup, L., & Frenkel, R. B. (2006). An introduction to uncertainty in measurement: Using the GUM
(Guide to the expression of Uncertainty in Measurement). Cambridge University Press,
Cambridge. doi: 10.1017/CBO9780511755538.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (42 / 43)
Replacing the R² Coefficient
in Model Analysis
Hugo Hernandez
ForsChem Research
hugo.hernandez@forschem.org

[37] Hernandez, H. (2023). Heteroscedastic Regression Models. ForsChem Research Reports, 8, 2023-
08, 1 - 29. doi: 10.13140/RG.2.2.31538.58562.
[38] Hernandez, H. (2017). Multivariate Probability Theory: Determination of Probability Density
Functions. ForsChem Research Reports, 2, 2017-13, 1-13. doi: 10.13140/RG.2.2.28214.60481.
[39] Wiener, N. (1938). The Homogeneous Chaos. American Journal of Mathematics, 60 (4), 897-936.
doi: 10.2307/2371268.
[40] Hernandez, H. (2016). Modelling the effect of fluctuation in nonlinear systems using variance
algebra - Application to light scattering of ideal gases, ForsChem Research Reports, 1, 2016-1, 1-19.
doi: 10.13140/RG.2.2.36501.52969.
[41] Schwartz, L. M. (1975). Random error propagation by Monte Carlo simulation. Analytical
Chemistry, 47 (6), 963-964. doi: 10.1021/ac60356a027.
[42] Parra-Frutos, I. (2013). Testing homogeneity of variances with unequal sample sizes.
Computational Statistics, 28, 1269-1297. doi: 10.1007/s00180-012-0353-x.
[43] Hernandez, H. (2023). Towards a Robust and Unbiased Estimation of Standard Deviation.
ForsChem Research Reports, 8, 2023-03, 1 - 33. doi: 10.13140/RG.2.2.23633.81767.
[44] Anscombe, F. J. (1973). Graphs in statistical analysis. The american statistician, 27 (1), 17-21. doi:
10.1080/00031305.1973.10478966.
[45] Hernandez, H. (2018). Statistical Modeling and Analysis of Experiments without ANOVA. ForsChem
Research Reports, 3, 2018-05, 1-27. doi: 10.13140/RG.2.2.21499.00803.
[46] Hernandez, H. (2021). Testing for Normality: What is the Best Method? ForsChem Research
Reports, 6, 2021-05, 1-38. doi: 10.13140/RG.2.2.13926.14406.
[47] Hernandez, H. (2019). Sums and Averages of Large Samples Using Standard Transformations: The
Central Limit Theorem and the Law of Large Numbers. ForsChem Research Reports, 4, 2019-01, 1-
14. doi: 10.13140/RG.2.2.32429.33767.
[48] Hernandez, H. (2021). Optimal Significance Level and Sample Size in Hypothesis Testing. 2. Tests of
Variances. ForsChem Research Reports, 6, 2021-07, 1-34. doi: 10.13140/RG.2.2.11266.20161.
[49] Hernandez, H. (2018). Parameter Identification using Standard Transformations: An Alternative
Hypothesis Testing Method. ForsChem Research Reports, 3, 2018-04, 1-44. doi:
10.13140/RG.2.2.14895.02728.
[50] Vicente, G., Coteron, A., Martinez, M., & Aracil, J. (1998). Application of the factorial design of
experiments and response surface methodology to optimize biodiesel production. Industrial
Crops and Products, 8 (1), 29-35. doi: 10.1016/S0926-6690(97)10003-6.
[51] Sanchez, J. (2018). Estimating detection limits in chromatography from calibration data: ordinary
least squares regression vs. weighted least squares. Separations, 5 (4), 49. doi:
10.3390/separations5040049.

18/07/2023 ForsChem Research Reports Vol. 8, 2023-10


10.13140/RG.2.2.26570.13769 www.forschem.org (43 / 43)

You might also like