You are on page 1of 24

00:35 Friday 16th October, 2015

See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/

Lecture 10: F -Tests, R2, and Other Distractions


36-401, Section B, Fall 2015
1 October 2015

Contents
1 The F Test 1
1.1 F test of β1 = 0 vs. β1 6= 0 . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . 7

2 What the F Test Really Tests 11


2.1 The Essential Thing to Remember . . . . . . . . . . . . . . . . . 12

3 R2 16
3.1 Theoretical R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Distraction or Nuisance? . . . . . . . . . . . . . . . . . . . . . . . 17

4 The Correlation Coefficient 19

5 More on the Likelihood Ratio Test 20

6 Concluding Comment 21

7 Further Reading 22

1 The F Test
The F distribution with a, b degrees of freedom is defined to be the distribution
of the ratio
χ2a /a
χ2b /b
when χ2a and χ2b are independent.
Since χ2 distributions arise from sums of Gaussians, F -distributed random
variables tend to arise when we are dealing with ratios of sums of Gaussians.
The outstanding examples of this are ratios of variances.

1
2 1.1 F test of β1 = 0 vs. β1 6= 0

1.1 F test of β1 = 0 vs. β1 6= 0


Let’s consider testing the null hypothesis β1 = 0 against the alternative β1 6= 0,
in the context of the Gaussian-noise simple linear regression model. That is,
we won’t question, in our mathematics, whether or not the assumptions of that
model hold, we’ll presume that they all do, and just ask how we can tell whether
β1 = 0.
We have said, ad nauseam, that under the unrestricted model,
n
1X
σ̂ 2 = (yi − (β̂0 + β̂1 xi ))2
n i=1

with
nσ̂ 2
∼ χ2n−2
σ2
This is true no matter what β1 is, so, in particular, it continues to hold when
β1 = 0 but we estimate the general model anyway.
The null model is that
Y = β0 + 
with  ∼ N (0, σ 2 ), independent of X and independently across measurements.
It’s an exercise from 36-226 to show (really, remind!) yourself that, in the null
model
β̂0 = y ∼ N (β0 , σ 2 /n)
It is another exercise to show
n
1X
σ̂ 2 = (yi − y)2 = s2Y
n i=1

and
ns2Y
∼ χ2n−1
σ2
However, s2Y is not independent of σ̂ 2 . What is statistically independent of
σ̂ 2 is the difference
s2Y − σ̂ 2
and
n(s2Y − σ̂ 2 )
∼ χ21
σ2
I will not pretend to give a proper demonstration of this. Rather, to make it
plausible, I’ll note that s2Y − σ̂ 2 is the extra mean squared error which comes
from estimating only one coefficient rather than two, that each coefficient kills
one degree of freedom in the data, and the total squared error associated with
one degree of freedom, over the entire data set, should be about σ 2 χ21 .

00:35 Friday 16th October, 2015


3 1.1 F test of β1 = 0 vs. β1 6= 0

Taking the previous paragraph on trust, then, let’s look at a ratio of vari-
ances:
s2Y − σ̂ 2 n(s2Y − σ̂ 2 )
2
= (1)
σ̂ nσ̂ 2
n(s2Y −σ̂ 2 )
σ2
= nσ̂ 2
(2)
σ2
χ21
= (3)
χ2n−2
To get our F distribution, then, we need to use as our test statistic
s2Y − σ̂ 2 n − 2
 2 
sY
= − 1 (n − 2)
σ̂ 2 1 σ̂ 2
which will have an F1,n−2 distribution under the null hypothesis that β1 = 0.
Note that the only random, data-dependent part of this is the ratio of s2Y /σ̂ 2 .
We reject the null β1 = 0 when this is too large, compared to what’s expected
under the F1,n−2 distribution. Again, this is the distribution of the test statistic
under the null β1 = 0. The variance ratio will tend to be larger under the
alternative, with its expected size growing with |β1 |.

Running this F test in R The easiest way to run the F test for the regression
slope on a linear model in R is to invoke the anova function, like so:
anova(lm(y~x))

This will give you an analysis-of-variance table for the model. The actual
object the function returns is an anova object, which is a special type of data
frame. The columns record, respectively, degrees of freedom, sums of squares,
mean squares, the actual F statistic, and the p value of the F statistic. What
we’ll care about will be the first row of this table, which will give us the test
information for the slope on X.
To illustrate more concretely, let’s revisit our late friends in Chicago:
library(gamair); data(chicago)
death.temp.lm <- lm(death ~ tmpd, data=chicago)
anova(death.temp.lm)

## Analysis of Variance Table


##
## Response: death
## Df Sum Sq Mean Sq F value Pr(>F)
## tmpd 1 162473 162473 803.07 < 2.2e-16 ***
## Residuals 5112 1034236 202
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As with summary on lm, the stars are usually a distraction; see Lecture 8 for
how to turn them off.

00:35 Friday 16th October, 2015


4 1.1 F test of β1 = 0 vs. β1 6= 0

# Simulate a Gaussian-noise simple linear regression model


# Inputs: x sequence; intercept; slope; noise variance; switch for whether to
# return the simulated values, or the ratio of s^2_Y/\hat{\sigma}^2
# Output: data frame or coefficient vector
sim.gnslrm <- function(x, intercept, slope, sigma.sq, var.ratio=TRUE) {
n <- length(x)
y <- intercept + slope*x + rnorm(n,mean=0,sd=sqrt(sigma.sq))
if (var.ratio) {
mdl <- lm(y~x)
hat.sigma.sq <- mean(residuals(mdl)^2)
# R uses the n-1 denominator in var(), but we need the MLE
s.sq.y <- var(y)*(n-1)/n
return(s.sq.y/hat.sigma.sq)
} else {
return(data.frame(x=x, y=y))
}
}

# Parameters
beta.0 <- 5
beta.1 <- 0 # We are simulating under the null!
sigma.sq <- 0.1
x.seq <- seq(from=-5, to=5, length.out=42)

Figure 1: Code setting up a simulation of a Gaussian-noise simple linear regression


model, returning either the actual simulated data frame, or just the variance ratio
s2Y /σ̂ 2 .

00:35 Friday 16th October, 2015


5 1.1 F test of β1 = 0 vs. β1 6= 0

1.2
1.0
0.8
Density

0.6
0.4
0.2
0.0

0 5 10 15 20 25 30

# Run a bunch of simulations under the null and get all the F statistics
# Actual F statistic is in the 4th column of the output of anova()
f.stats <- replicate(1000, anova(lm(y~x, data= sim.gnslrm(x.seq, beta.0, beta.1,
sigma.sq, FALSE)))[1,4])
# Store histogram of the F statistics, but hold off on plotting it
null.hist <- hist(f.stats, breaks=50, plot=FALSE)
# Run a bunch of simulations under the alternative and get all the F statistics
alt.f <- replicate(1000, anova(lm(y~x, data=sim.gnslrm(x.seq, beta.0, -0.05,
sigma.sq, FALSE)))[1,4])
# Store a histogram of this, but again hold off on plotting
alt.hist <- hist(alt.f, breaks=50, plot=FALSE)
# Create an empty plot
plot(0, xlim=c(0,30), ylim=c(0,1.2), xlab="F", ylab="Density", type="n")
# Add the histogram of F under the alternative, then under the null
plot(alt.hist, freq=FALSE, add=TRUE, col="grey", border="grey")
plot(null.hist, freq=FALSE, add=TRUE)
# Finally, the theoretical F distribution
curve(df(x,1,length(x.seq)-2), add=TRUE, col="blue")

th
Figure 2: Comparing the00:35 Friday
actual 16 October,
distribution 2015 when we simulate under
of F statistics
the null model (black histogram) to the theoretical F1,n−2 distribution (blue curve), and
to the distribution under the alternative β1 = −0.05.
6 1.1 F test of β1 = 0 vs. β1 6= 0

15
10
Density

5
0

0.0 0.2 0.4 0.6 0.8 1.0

p−value

# Take the simulated F statistics and convert to p-values


p.vals <- pf(f.stats, 1, length(x.seq)-2, lower.tail=FALSE)
alt.p <- pf(alt.f, 1, length(x.seq)-2, lower.tail=FALSE)
hist(alt.p, col="grey", freq=FALSE, xlab="p-value", main="", border="grey",
xlim=c(0,1))
plot(hist(p.vals, plot=FALSE), add=TRUE, freq=FALSE)

Figure 3: Distribution of p-values from repeated simulations, under the null hypothesis
(black) and the alternative (grey). Notice how the p-values under the null are uniformly
distributed, while under the alternative they are bunched up towards small values at
the left.

00:35 Friday 16th October, 2015


7 1.2 The Likelihood Ratio Test

Assumptions In deriving the F distribution, it is absolutely vital that all


of the assumptions of the Gaussian-noise simple linear regression model hold:
the true model must be linear, the noise around it must be Gaussian, the noise
variance must be constant, the noise must be independent of X and independent
across measurements. The only hypothesis being tested is whether, maintaining
all these assumptions, we must reject the flat model β1 = 0 in favor of a line at
an angle. In particular, the test never doubts that the right model is a straight
line.

The “general linear test” As a preview of coming attractions, we can look


at what happens when we compare a linear, Gaussian-noise model with p pa-
rameters to a restricted Gaussian-noise linear model model with only q free
2
parameters. Each model gives us an estimate of the noise variance, say σ̂A and
2
σ̂0 (respectively); these are just the mean squared residuals in each model. It
will not surprise you to learn that, under the null that the smaller, restricted
model is true
n(σ̂02 − σ̂A2
)
2
∼ χ2p−q
σ
while
2
nσ̂A
2
∼ χ2n−p
σ
The F statistic for testing the restriction of the full model to the sub-model is
therefore
σ̂02 − σ̂A
2
n−p
2
σ̂A p − q
and it has an Fp−q,n−p distribution.

ANOVA You will notice that I made no use of the ponderous machinery of
analysis of variance which the textbook wheels out in connection with the F
test. Despite (or because) of all of its size and complexity, this is really just a
historical relic. In serious applied work from the modern (say, post-1985) era, I
have never seen any study where filling out an ANOVA table for a regression,
etc., was at all important.
There is more to be said for analysis of variance where the observations
are divided into discrete, categorical groups, and one wants to know about
differences between groups vs. variation within a group. In a few weeks, when
we see how to handle categorical predictor variables, it will turn out that this
useful form of ANOVA is actually a special case of linear regression.

1.2 The Likelihood Ratio Test


The F test is a special case of a much more general procedure, the likelihood
ratio test, which works as follows. We start with a general model, where the
parameter is a vector θ = (θ1 , θ2 , . . . θp ). We contemplate a restriction, where

00:35 Friday 16th October, 2015


8 1.2 The Likelihood Ratio Test

θ = (θ1 , θ2 , . . . θq , 0, . . . 0), q < p. (See below on other possible restrictions.) The


restricted sub-model is the null hypothesis, and the full model is the alternative.
Both the restricted model and the full model have maximum likelihood es-
timators; call these θ̂ and Θ̂, respectively. Let’s write L for the log-likelihood
function, so L(θ̂) is the maximized log-likelihood under the restricted null model,
and L(Θ̂) is the maximized log-likelihood under the unrestricted, alternative,
full model. Then
Λ ≡ L(Θ̂) − L(θ̂)
is the log of the likelihood ratio between the models (because log a/b = log a−
log b). Λ is the test statistic in the likelihood ratio test1 .
Under some “regularity” conditions, which I’ll sketch in a moment, there is
a simple asymptotic distribution for Λ under the null hypothesis. As n → ∞

2Λ ∼ χ2p−q

Let me first try to give a little intuition, then hand-wave at the math, and
then work things through for test β1 = 0 vs. β1 6= 0.
The null model is, as I said, a restriction of the alternative model. Any
possible parameter value for the null model is also allowed for the alternative.
This means the parameter space for the null, say ω, is a strict subset of that for
the alternative, ω ⊂ Ω. The maximum of the likelihood over the larger space
must be at least as high as the maximum over the smaller space:

L(Θ̂) = max L(θ) ≥ max L(θ) = L(θ̂)


θ∈Ω θ∈ω

Thus, Λ ≥ 0. What’s more surprising is that its distribution doesn’t change


with n (asymptotically), and that depends on the difference in the number of
free parameters. Because the MLE is consistent, under the null the estimates
of θq+1 , θq+2 . . . θp in Θ̂ all converge to zero, because those parameters are zero
under the null. In fact, they get closer and closer to zero, but end up making
larger and larger contributions to L, because L grows with n. The two effects
cancel out, and each free parameter ends up contributing one χ21 term.
Why χ2 ? Well, for large n, θ̂ and Θ̂ both have Gaussian distributions around
the true θ, and the contributions to the log-likelihood end up depending on the
squares of parameter estimates. Since the square of a Gaussian is proportional
to a χ2 , it’s not surprising that we get something χ2 -ish, though it is nice how
everything cancels out. I defer a fuller explanation to the option §5.

Sketch of the regularity conditions where the likelihood-ratio test


has a χ2 null First, the MLE must be consistent for both models, and must
have a Gaussian distribution around the true parameter (for large n). Second,
the restricted model has to “lie in the interior” of the unrestricted, alternative
model, and not on the boundary. That is, it must make sense in the alternative
model for all the zeroed-out parameters to be either positive or negative. (This
1 Some people, being a bit pedantic, call it the log-likelihood-ratio test.

00:35 Friday 16th October, 2015


9 1.2 The Likelihood Ratio Test

would be violated, for instance, if one of the parameters set to zero by the null
were a variance.) And that’s mostly it. Again, see §5 for more mathematical
details.

Testing β1 = 0 What’s the log-likelihood at the MLE of the simple linear


model? Dredging up the log-likelihood function from Lecture 6,
n
2 n n 2 1 X
L(β̂0 , β̂1 , σ̂ ) = − log 2π − log σ̂ − (yi − (β̂0 + β̂1 xi ))2 (4)
2 2 2σˆ2 i=1

But
n
1X
σ̂ 2 = (yi − (β̂0 + β̂1 xi ))2
n i=1
Substituting into Eq. 4,

n n nσ̂ 2 n n
L(β̂0 , β̂1 , σ̂ 2 ) = − log 2π − log σ̂ 2 − = − (1 + log 2π) − log σ̂ 2
2 2 2σˆ2 2 2
So we get a constant which doesn’t depend on the parameters at all, and then
something proportional to log σ̂ 2 .
The intercept-only model works similarly, only its estimate of the intercept
is y, and its noise variance, σ̂02 , is just the sample variance of the yi :
n n
L(y, 0, s2Y ) = − (1 + log 2π) − log s2Y
2 2
Putting these together,

n s2
L(β̂0 , β̂1 , σ̂ 2 ) − L(y, 0, s2Y ) = log Y2
2 σ̂
Thus, under the null hypothesis,

n s2
log Y2 ∼ χ21
2 σ̂
Figure 4 shows a simulation confirming this bit of theory.

Connection to F tests The ratio s2Y /σ̂ 2 is, of course, the F -statistic, up to
constants not depending on the data. Since, for this problem, the likelihood
ratio test and the F test use equivalent test statistics, if we fix the same size or
level α for the two tests, they will have exactly the same power. In fact, even
for more complicated linear models — the “general linear tests” of the textbook
— the F test is always equivalent to a likelihood ratio test, at least when the
presumptions of the former are met. The likelihood ratio test, however, applies
to problems which do not involve Gaussian-noise linear models, while the F test
is basically only good for them. If you can only remember one of the two tests,
remember the likelihood ratio test.

00:35 Friday 16th October, 2015


10 1.2 The Likelihood Ratio Test

1.5
1.0
Density

0.5
0.0

0 2 4 6 8 10

# Simulate from the model 1000 times, capturing the variance ratios
var.ratios <- replicate(1000, sim.gnslrm(x.seq,beta.0,beta.1,sigma.sq))
# Convert variance ratios into log likelihood-ratios
LLRs <- (length(x.seq)/2)*log(var.ratios)
# Create a histogram of 2*LLR
hist(2*LLRs, breaks=50, freq=FALSE, xlab=expression(2*Lambda), main="")
# Add the theoretical chi^2_1 distribution
curve(dchisq(x,df=1), col="blue", add=TRUE)

Figure 4: Comparison of log-likelihood ratios (black histogram) with theoretical χ21


distribution (blue). Note we are simulating under the null hypothesis β1 = 0. Can
you add a histogram of the distribution under the alternative, and make histograms of
p-values, as in Figures 2 and 3?

00:35 Friday 16th October, 2015


11

Other constraints Setting p − q parameters to zero is really a special case


of imposing p − q linearly independent constraints on the p parameters. For
instance, requiring θ2 = θ1 while θ3 = −2θ1 is just as much a two-parameter
restriction as fixing θ2 = θ3 = 0. This is because we could transform to a new set
of parameters, say ψ1 = θ1 , ψ2 = θ2 − θ1 , ψ3 = θ3 + 2θ1 , where the restrictions
are ψ2 = ψ3 = 0, and we can transform back to the θ parameters without loss
of information. So the theory of the likelihood ratio test applies whenever we
have linearly independent constraints.
More generally, that theory applies under the following (admittedly rather
complicated) conditions:

• Under the null model, θ must obey equations f1 (θ) = 0, f2 (θ) = 0, . . .


fp−q (θ) = 0.
• Any θ which obeys those equations is in the null model.

• There is an invertible function g where, writing ψ = g(θ), in the null


model, ψ always has ψq+1 , . . . ψp = 0, and under the alternative, ψ is
unrestricted.
Basically, we need to be able to come up with a change of coordinates where the
restrictions amount to fixing some coordinates to zero, but leaving the others
alone.

2 What the F Test Really Tests


The textbook (§2.7–2.8) goes into great detail about an F test for whether
the simple linear regression model “explains” (really, predicts) a “significant”
amount of the variance in the response. What this really does is compare two
versions of the simple linear regression model. The null hypothesis is that all
of the assumptions of that model hold, and the slope, β1 , is exactly 0. (This is
sometimes called the “intercept-only” model, for obvious reasons.) The alter-
native is that all of the simple linear regression assumptions hold2 , with β1 6= 0.
The alternative, non-zero-slope model will always fit the data better than the
null, intercept-only model (why?); the F test asks if the improvement in fit is
larger than we’d expect under the null3 .
There are situations where it is useful to know about this precise quantity,
and so run an F test on the regression. It is hardly ever, however, a good
way to check whether the simple linear regression model is correctly specified,
because neither retaining nor rejecting the null gives us information about what
we really want to know.
2 To get an exact F distribution for the test statistic, we also need the Gaussian-noise

assumptions, but under the weaker assumptions of uncorrelated noise, we’ll often approach
an F distribution asymptotically.
3 This is also what the likelihood ratio test of §1.2 is asking, just with a different notion of

measuring fit to the data (likelihood vs. squared error). Everything I’m about to say about
F tests applies, suitably modified, to likelihood ratio tests.

00:35 Friday 16th October, 2015


12 2.1 The Essential Thing to Remember

Suppose first that we retain the null hypothesis, i.e., we do not find any sig-
nificant share of variance associated with the regression. This could be because
(i) there is no such variance — the intercept-only model is right; (ii) there is
some variance, but we were unlucky; (iii) the test doesn’t have enough power
to detect departures from the null. To expand on that last point, the power
to detect a non-zero slope is going to increase with the sample size n, decrease
with the noise level σ 2 , and increase with the magnitude of the slope |β1 |. As
σ 2 /n → 0, the test’s power to detect any departures from the null → 1. If we
have a very powerful test, then we can reliably detect departures from the null.
If we don’t find them, then, we can be pretty sure they’re not there. If we have
a low-power test, not detecting departures from the null tells us little4 . If σ 2 is
too big or n is too small, our test is inevitably low-powered. Without knowing
the power, retaining the null is ambiguous between “there’s no signal here” and
“we can’t tell if there’s a signal or not”. It would be more useful to look at
things like a confidence interval for the regression slope, or, if you must, for σ 2 .
Of course, there is also possibility (iv), that the real relationship is nonlinear,
but the best linear approximation to it has slope (nearly) zero, in which case
the F test will have no power to detect the nonlinearity.
Suppose instead that we reject the null, intercept-only hypothesis. This does
not mean that the simple linear model is right. It means that the latter model
predicts better than the intercept-only model — too much better to be due to
chance. The simple linear regression model can be absolute garbage, with every
single one of its assumptions flagrantly violated, and yet better than the model
which makes all those assumptions and thinks the optimal slope is zero.
Figure 5 provides simulation code for a simple set up where the true regres-
sion function is nonlinear and the noise around it has non-constant variance.
(Indeed, regression curve is non-monotonic and the noise is multiplicative, not
additive.) Still, because a tilted straight line is a much better fit than a flat
line, the F test delivers incredibly small p-values — the largest, when I simulate
drawing 200 points from the model, is around 10−35 , which is about the prob-
ability of drawing any particular molecule from 3 billion liters of water. This is
the math’s way of looking at data like Figure 6 and saying “If you want to run
a flat line through this, instead of one with a slope, you’re crazy”5 . This is, of
course, true; it’s just not an answer to “Is simple linear model right here?”

2.1 The Essential Thing to Remember


Neither the F test of β1 = 0 vs. β1 6= 0 nor the likelihood ratio test nor the
Wald/t test of the same hypothesis tell us anything about the correctness of
the simple linear regression model. All these tests presume the simple linear
regression model with Gaussian noise is true, and check a special case (flat
4 Refer
back to the discussion of hypothesis testing in Lecture 8.
when on p. 1.1 we’re told the p-value is ≤ 2.2 × 10−16 , that doesn’t mean that
5 Similarly,

there’s overwhelming evidence for the simple linear model, it again means that it’d be really
stupid to prefer a flat line to a titled one.

00:35 Friday 16th October, 2015


13 2.1 The Essential Thing to Remember

# Simulate from a non-linear, non-constant-variance model


# Curve: Y = (X-1)^2 * U
# U ~ Unif(0.8, 1.2)
# X ~ Exp(0.5)
# Inputs: number of data points; whether to return data frame or F test of
# a simple linear model

sim.non.slr <- function(n, do.test=FALSE) {


x <- rexp(n,rate=0.5)
y <- (x-1)^2 * runif(n, min=0.8, max=1.2)
if (! do.test) {
return(data.frame(x=x,y=y))
} else {
# Fit a linear model, run F test, return p-value
return(anova(lm(y~x))[["Pr(>F)"]][1])
}
}

Figure 5: Code to simulate a non-linear model with non-constant variance (in fact,
multiplicative rather than additive noise).

line) against the general one (titled line). They do not test linearity, constant
variance, lack of correlation, or Gaussianity.

00:35 Friday 16th October, 2015


14 2.1 The Essential Thing to Remember

80


60
40
y


● ●●






20

●●●


●●●


●● ●
●●
●●
● ●
●●●

●●● ●

●●●●●●


●●●
●● ●●●
●●●




●●●

●●

●●

●●
●●
● ●●

● ●

●●●
●●

●●●
●●
●●
●●

●●
●●
●●
●●●

● ●
●●
●●●
0

0 2 4 6 8 10

not.slr <- sim.non.slr(n=200)


plot(y~x, data=not.slr)
curve((x-1)^2, col="blue", add=TRUE)
abline(lm(y~x,data=not.slr), col="red")

Figure 6: 200 points drawn from the non-linear, heteroskedastic model defined in
Figure 5 (black dots); the true regression curve (blue curve); the least-squares estimate
of the simple linear regression (red line). Anyone who’s read Lecture 6 and looks at
this can realize the linear model is badly wrong here.

00:35 Friday 16th October, 2015


15 2.1 The Essential Thing to Remember

0.05
0.04
Density

0.03
0.02
0.01
0.00

−80 −70 −60 −50 −40 −30

log (base ten) of p value

f.tests <- replicate(1e4, sim.non.slr(n=200, do.test=TRUE))


hist(log10(f.tests),breaks=30,freq=FALSE, main="",
xlab="log (base ten) of p value")

Figure 7: Distribution of p values from the F test for the simple linear regression
model when the data come from the non-linear, heteroskedastic model of Figure 5, with
sample size of n = 200. The p-values are all so small that rather than plotting them,
I plot their logs in base 10, so the distribution is centered around a p-value of 10−60 ,
and the largest, least-significant p-values produced in ten thousand simulations were
around 10−35 .

00:35 Friday 16th October, 2015


16

3 R2
R2 has several definitions which are equivalent when we estimate a linear model
by least squares. The most basic one is the ratio of the sample variance of the
fitted values to the sample variance of Y .

s2m̂
R2 ≡ (5)
s2Y

Alternatively, it’s the ratio between the sample covariance of Y and the fitted
values, to the sample variance of Y :
cY,m̂
R2 = (6)
s2Y

Let’s show that these are equal. Clearly, it’s enough to show that the sample
variance of m̂ equals its covariance with Y . The key observations are that (i)
that each yi = m̂(xi )+ei , while (ii) the sample covariance between ei and m̂(xi )
is exactly zero. Thus

cY,m̂ = cm̂+e,m̂ = s2m̂ + ce,m̂ = s2m̂

and we see that, for linear models estimated by least squares, Eqs. 5 and 6 do
in fact always give the same result.
That said, what is s2m̂ ? Since m̂(xi ) = β̂0 + β̂1 xi ,

s2m̂ = s2β̂ = s2β̂ = β̂12 s2X


0 +β̂1 X 1X

We thus get a third expression for R2 :

s2X
R2 = β̂12 (7)
s2Y

From this, we can derive yet a fourth expression:


 2
2 cXY
R = (8)
sX sY
which we can recognize as the squared correlation coefficient between X and Y
(hence the square in R2 ). A noteworthy feature of this equation is that it shows
we get exactly the same R2 whether we regress Y on X, or regress X on Y .
A final expression for R2 is

s2Y − σ̂ 2
R2 = (9)
s2Y

Since σ̂ 2 is the sample variance of the residuals, and the residuals are uncorre-
lated (in sample) with m̂, it’s not hard to show that the numerator is equation
to s2m̂ .

00:35 Friday 16th October, 2015


17 3.1 Theoretical R2

“Adjusted” R2 As you remember, σ̂ 2 has a slight negative bias as an estimate


n
of σ 2 . One therefore sometimes sees an “adjusted” R2 , using n−2 σ̂ 2 in place of
2 2
σ̂ , that being an unbiased estimate of σ .

Limits for R2 From Eq. 7, it is clear that R2 will be 0 when β̂1 = 0. On the
other hand, if all the residuals are zero, then s2Y = β̂21 s2X and R2 = 1. It is not
too hard to show that R2 can’t possible be bigger than 1, so we have marked
out the limits: a sample slope of 0 gives an R2 of 0, the lowest possbile, and all
the data points falling exactly on a straight line gives an R2 of 1, the largest
possible.

3.1 Theoretical R2
Suppose we knew the true coefficients. What would R2 be? Using Eq. 5, we’d
see
Var [m(X)]
R2 = (10)
Var [Y ]
Var [β0 + β1 X]
= (11)
Var [β0 + β1 X + ]
Var [β1 X]
= (12)
Var [β1 X + ]
β12 Var [X]
= 2 (13)
β1 Var [X] + σ 2

Since all our parameter estimates are consistent, and this formula is continuous
in all the parameters, the R2 we get from our estimate will converge on this
limit.
As you will recall from lecture 1, even if the linear model is totally wrong,
our estimate of β1 will converge on Cov [X, Y ] /Var [X]. Thus, whether or not
the simple linear model applies, the limiting theoretical R2 is given by Eq. 13,
provided we interpret β1 appropriately.

3.2 Distraction or Nuisance?


Greetings, Red-
Unfortunately, a lot of myths about R2 have become endemic in the scientific ditors! May
community, and it is vital at this point to immunize you against them. I trouble you
to read to the
1. The most fundamental is that R2 does not measure goodness of fit. end before
commenting?
(a) R2 can be arbitrarily low when the model is completely correct. Look
(To Audiendi: I
at Eq. 13. By making Var [X] small, or σ 2 large, we drive R2 towards
don’t know who
0, even when every assumption of the simple linear regression model
you are; I won’t
is correct in every particular.
try to find out;
you wouldn’t be
in trouble if I
did.)
00:35 Friday 16th October, 2015
18 3.2 Distraction or Nuisance?

(b) R2 can be arbitrarily close to 1 when the model is totally wrong. For
a demonstration, the R2 of the linear model fitted to the simulation
in §2 is 0.745. There is, indeed, no limit to how high R2 can get when
the true model is nonlinear. All that’s needed is for the slope of the
best linear approximation to be non-zero, and for Var [X] to get big.
2. R2 is also pretty useless as a measure of predictability.
(a) R2 says nothing about prediction error. Go back to Eq. 13, the ideal
case: even with σ 2 exactly the same, and no change in the coefficients,
R2 can be anywhere between 0 and 1 just by changing the range of X.
Mean squared error is a much better measure of how good predictions
are; better yet are estimates of out-of-sample error which we’ll cover
later in the course.
(b) R2 says nothing about interval forecasts. In particular, it gives us no
idea how big prediction intervals, or confidence intervals for m(x),
might be.
3. R2 cannot be compared across data sets: again, look at Eq. 13, and see
that exactly the same model can have radically different R2 values on
different data.

4. R2 cannot be compared between a model with untransformed Y and one


with transformed Y , or between different transformations of Y . More
exactly: it’s a free country and no one will stop you from doing that, but
it’s meaningless; R2 can easily go down when the model assumptions are
better fulfilled, etc.

5. The one situation where R2 can be compared is when different models are
fit to the same data set with the same, untransformed response variable.
Then increasing R2 is the same as decreasing in-sample MSE (by Eq. 9).
In that case, however, you might as well just compare the MSEs.
6. It is very common to say that R2 is “the fraction of variance explained” by
the regression. This goes along with calling R2 “the coefficient of determi-
nation”. These usages arise from Eq. 9, and have nothing to recommend
them. Eq. 8 shows that if we regressed X on Y , we’d get exactly the same
R2 . This in itself should be enough to show that a high R2 says nothing
about explaining one variable by another. It is also extremely easy to
devise situations where R2 is high even though neither one could possible
explain the other6 . Unless you want to re-define the verb “to explain” in
6 Imagine, for example, regressing deaths in Chicago on the number of Chicagoans wearing

t-shirts each day. For that matter, imagine regressing the number of Chicagoans wearing
t-shirts on the number of deaths. For thousands of examples with even less to recommend
them as explanations, see http://www.tylervigen.com/spurious-correlations.

00:35 Friday 16th October, 2015


19

terms of R2 , there is no connection between it and anything which might


be called a scientific explanation7 .

Using adjusted R2 instead of R2 does absolutely nothing to fix any of this.


At this point, you might be wondering just what R2 is good for — what
job it does that isn’t better done by other tools. The only honest answer I
can give you is that I have never found a situation where it helped at all. If I
could design the regression curriculum from scratch, I would never mention it.
Unfortunately, it lives on as a historical relic, so you need to know what it is,
and what mis-understandings about it people suffer from.

4 The Correlation Coefficient


As you know, the correlation coefficient between X and Y is

Cov [X, Y ]
ρXY = p
Var [X] Var [Y ]

which lies between −1 and 1. It takes its extreme values when Y is a linear
function of X.
Recall, from lecture 1, that the slope of the ideal linear predictor β1 is

Cov [X, Y ]
Var [X]
so s
Var [X]
ρXY = β1
Var [Y ]
p
It’s also straightforward to show (Exercise
p 1) that if we regress Y / Var [Y ], the
standardized version of Y , on X/ Var [X], the standardized version of X, the
regression coefficient we’d get would be ρXY .
In 1954, the great statistician John W. Tukey wrote (Tukey, 1954, p. 721)
Does anyone know when the correlation coefficient is useful, as
opposed to when it is used? If so, why not tell us?
Sixty years later, the world is still waiting for a good answer8 .
7 Some people (e.g., Weisburd and Piquero 2008; Low-Décarie et al. 2014) have attempted

to gather all the values of R2 reported in scientific papers on, say, ecology or crime, to see
if ecologists or criminologists have gotten better at explaining the phenomena they study. I
hope it’s clear why these exercises are pointless.
8 To be scrupulously fair, Tukey did admit there was one clear case where correlation

coefficients were useful; they are, as we have just seen, basically the slopes in simple linear
regressions. But even so, as soon as we have multiple predictors (as we will in two weeks),
regression will no longer match up with correlation. Also, covariances are useful, but why
turn a covariance into a correlation?

00:35 Friday 16th October, 2015


20

5 More on the Likelihood Ratio Test


This section is optional, but strongly recommended.

We’re assuming that the true parameter value, call it θ, lies in the restricted
class of models ω. So there are q components to θ which matter, and the
other p − q are fixed by the constraints defining ω. To simplify the book-
keeping, let’s say those constraints are all that the extra parameters are zero,
so θ = (θ1 , θ2 , . . . θq , 0, . . . 0), with p − q zeroes at the end.
The restricted MLE θb obeys the constraints, so

θb = (θb1 , θb2 , . . . θbq , 0, . . . 0) (14)

The unrestricted MLE does not have to obey the constraints, so it’s

Θ
b = (Θ
b 1, Θ
b 2, . . . Θ
b q, Θ
b q+1 , . . . Θ
b p) (15)

Because both MLEs are consistent, we know that θbi → θi , Θ b i → θi if 1 ≤ i ≤ q,


and that Θi → 0 if q + 1 ≤ i ≤ p.
b
Very roughly speaking, it’s the last extra terms which end up making L(Θ) b
larger than L(θ). Each of them tends towards a mean-zero Gaussian whose
b
variance is O(1/n), but their impact on the log-likelihood depends on the square
of their sizes, and the square of a mean-zero Gaussian has a χ2 distribution with
one degree of freedom. A whole bunch of factors cancel out, leaving us with a
sum of p − q independent χ21 variables, which has a χ2p−q distribution.
In slightly more detail, we know that L(Θ) b ≥ L(θ),b because the former is
a maximum over a larger space than the latter. Let’s try to see how big the
difference is by doing a Taylor expansion around Θ, b which we’ll take out to
second order.
p   p p  2 
b i − θbi ) ∂L 1 XX b ∂ L
X
b ≈ L(Θ)
L(θ) b + (Θ + ( Θi − θ
bi ) b j − θbj )

i=1
∂θi Θ b 2 i=1 j=1 ∂θi ∂θj Θ
b
p X p  2 
b +1 ∂ L
X
= L(Θ) (Θ b i − θbi ) b j − θbj )
(Θ (16)
2 i=1 j=1 ∂θi ∂θj Θ
b

All the first-order terms go away, because Θ b is a maximum of the likelihood,


and so the first derivatves are all zero there. Now we’re left with the second-
order terms. Writing all the partials out repeatedly gets tiresome, so abbreviate
∂ 2 L/∂θi ∂θj as L,ij .
To simplify the book-keeping, suppose that the second-derivative matrix, or
Hessian, is diagonal. (This should seem like a swindle, but we get the same
conclusion without this supposition, only we need to use a lot more algebra —
we diagonalize the Hessian by an orthogonal transformation.) That is, suppose

00:35 Friday 16th October, 2015


21

L,ij = 0 unless i = j. Now we can write


p
1X b
b − L(θ)
L(Θ) b ≈ − (Θi − θbi )2 L,ii (17)
2 i=1
h i q
X p
X
b − L(θ)
2 L(Θ) b ≈ − b i − θbi )2 L,ii −
(Θ b i )2 L,ii
(Θ (18)
i=1 i=q+1

At this point, we need a fact about the asymptotic distribution of maximum


likelihood estimates: they’re generally Gaussian, centered around the true value,
and with a shrinking variance that depends on the Hessian evaluated at the true
parameter value; this is called the Fisher information, F or I. (Call it F .) If
the Hessian is diagonal, then we can say that

Θ
bi N (θi , −1/nFii ) (19)
θbi N (θ1 , −1/nFii ) 1 ≤ i ≤ q (20)
θbi = 0 q+1≤i≤p (21)

Also, (1/n)L,ii → −Fii .


Putting all this together, we see that each term in the second summation in
Eq. 18 is (to abuse notation a little)
−1 2
(N (0, 1)) L,ii → χ21 (22)
nFii

so the whole second summation has a χ2p−q distribution. The first summation,
meanwhile, goes to zero because Θ b i and θbi are actually strongly correlated, so
their difference is O(1/n), and their difference squared is O(1/n2 ). Since L,ii is
only O(n), that summation drops out.
A somewhat less hand-wavy version of the argument uses the fact that the
MLE is really a vector, with a multivariate normal distribution which depends
on the inverse of the Fisher information matrix:

Θ
b N (θ, (1/n)F −1 ) (23)

Then, at the cost of more linear algebra, we don’t have to assume that the
Hessian is diagonal.

6 Concluding Comment
The tone I have taken when discussing F tests, R2 and correlation has been
dismissive. This is deliberate, because they are grossly abused and over-used in
current practice, especially by non-statisticians, and I want you to be too proud
(or too ashamed) to engage in those abuses. In a better world, we’d just skip
over them, but you will have to deal with colleagues, and bosses, who learned

00:35 Friday 16th October, 2015


22

their statistics in the bad old days, and so have to understand what they’re
doing wrong. (“Science advances funeral by funeral”.)
In all fairness, the people who came up with these tools were great scientists,
struggling with very hard problems when nothing was clear; they were inventing
all the tools and concepts we take for granted in a class like this. Anyone in this
class, me included, would be doing very well to come up with one idea over the
whole of our careers which is as good as R2 . But we best respect our ancestors,
and the tradition they left us, when we improve that tradition where we can.
Sometimes that means throwing out the broken bits.

7 Further Reading
Refer back to lecture 7 on diagnostics for ways of actually checking whether the
relationship between Y and X is linear (along with the other assumptions of the
model). We will come back to the topic of conducting formal tests of linearity,
or other parametric regression specifications, in 402.
Refer back to lecture 8, on parametric inference, for advice on when it is
actually interesting to test the hypothesis β1 = 0.
Full mathematical treatments of likelihood ratio tests can be found in many
textbooks, e.g., Schervish (1995) or Gouriéroux and Monfort (1989/1995, vol.
II). The original proof that it has a χ2p−q asymptotic distribution was given
by Wilks (1938). Vuong (1989) provides an interesting and valuable treatment
of what happens to the likelihood ratio test when neither the null nor the al-
ternative is strictly true, but we want to pick the one which is closer to the
truth; that paper also develops the theory when the null is not a restriction
of the alternative, but rather the two hypotheses come from distinct statistical
models.
People have been warning about the fallacy of R2 to measure goodness of
fit for a long time (Anderson and Shanteau, 1977; Birnbaum, 1973), apparently
without having much of an impact. (See Hamilton (1996) for a discussion of how
academic communities can keep on teaching erroneous ideas long after they’ve
been shown to be wrong, and some speculations about why this happens.)
That R2 has got nothing to do with explaining anything has also been pointed
out, time after time, for decades (Berk, 2004). A small demo of just how silly
“variance explained” can get, using the Chicago data, can be found at http:
//bactra.org/weblog/874.html. Just what it does means to give a proper
scientific explanation, and what role statistical models might play in doing so,
is a topic full of debate, not to say confusion. Shmueli (2010) attempts to
relate some of these debates to the practical conduct of statistical modeling.
Personally, I have found Salmon (1984) very helpful in thinking about these
issues.

00:35 Friday 16th October, 2015


23 REFERENCES

Exercises
To think through or to practice on, not to hand in.
p p
1. Define Ỹ = Y / Var [Y ] and X̃ = X/ Var [X]. Show that the slope of
the optimal linear predictor of Ỹ from X̃ is ρXY .
2. Work through the likelihood ratio test for testing regression through the
origin (β0 = 0) against the simple linear model (β0 6= 0); that is, write Λ
in terms of the sample statistics and simplify as much as possible. Under
the null hypothesis, 2Λ follows a χ2 distribution with a certain number of
degrees of freedom: how many?

References
Anderson, Norman H. and James Shanteau (1977). “Weak inference with
linear models.” Psychological Bulletin, 84: 1155–1170. doi:10.1037/0033-
2909.84.6.1155.
Berk, Richard A. (2004). Regression Analysis: A Constructive Critique. Thou-
sand Oaks, California: Sage.
Birnbaum, Michael H. (1973). “The Devil Rides Again: Correlation as an Index
of Fit.” Psychological Bulletin, 79: 239–242. doi:10.1037/h0033853.
Gouriéroux, Christian and Alain Monfort (1989/1995). Statistics and Econo-
metric Models. Themes in Modern Econometrics. Cambridge, England: Cam-
bridge University Press. Translated by Quang Vuong from Statistique et
modèles économétriques, Paris: Économica.
Hamilton, Richard F. (1996). The Social Misconstruction of Reality: Validity
and Verification in the Scholarly Community. New Haven: Yale University
Press.
Low-Décarie, Etienne, Corey Chivers and Monica Granados (2014). “Rising
complexity and falling explanatory power in ecology.” Frontiers in Ecology
and the Environment, 12: 412–418. doi:10.1890/130230.
Salmon, Wesley C. (1984). Scientific Explanation and the Causal Structure of
the World . Princeton: Princeton University Press.
Schervish, Mark J. (1995). Theory of Statistics. Berlin: Springer-Verlag.
Shmueli, Galit (2010). “To Explain or to Predict?” Statistical Science, 25:
289–310. doi:10.1214/10-STS330.

Tukey, John W. (1954). “Unsolved Problems of Experimental Statistics.”


Journal of the American Statistical Association, 49: 706–731. URL http:
//www.jstor.org/pss/2281535.

00:35 Friday 16th October, 2015


24 REFERENCES

Vuong, Quang H. (1989). “Likelihood Ratio Tests for Model Selection and Non-
Nested Hypotheses.” Econometrica, 57: 307–333. URL http://www.jstor.
org/pss/1912557.

Weisburd, David and Alex R. Piquero (2008). “How Well Do Criminologists Ex-
plain Crime? Statistical Modeling in Published Studies.” Crime and Justice,
37: 453–502. doi:10.1086/524284.
Wilks, S. S. (1938). “The Large Sample Distribution of the Likelihood Ra-
tio for Testing Composite Hypotheses.” Annals of Mathematical Statistics,
9: 60–62. URL http://projecteuclid.org/euclid.aoms/1177732360.
doi:10.1214/aoms/1177732360.

00:35 Friday 16th October, 2015

You might also like