Professional Documents
Culture Documents
Contents
1 The F Test 1
1.1 F test of β1 = 0 vs. β1 6= 0 . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . 7
3 R2 16
3.1 Theoretical R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Distraction or Nuisance? . . . . . . . . . . . . . . . . . . . . . . . 17
6 Concluding Comment 21
7 Further Reading 22
1 The F Test
The F distribution with a, b degrees of freedom is defined to be the distribution
of the ratio
χ2a /a
χ2b /b
when χ2a and χ2b are independent.
Since χ2 distributions arise from sums of Gaussians, F -distributed random
variables tend to arise when we are dealing with ratios of sums of Gaussians.
The outstanding examples of this are ratios of variances.
1
2 1.1 F test of β1 = 0 vs. β1 6= 0
with
nσ̂ 2
∼ χ2n−2
σ2
This is true no matter what β1 is, so, in particular, it continues to hold when
β1 = 0 but we estimate the general model anyway.
The null model is that
Y = β0 +
with ∼ N (0, σ 2 ), independent of X and independently across measurements.
It’s an exercise from 36-226 to show (really, remind!) yourself that, in the null
model
β̂0 = y ∼ N (β0 , σ 2 /n)
It is another exercise to show
n
1X
σ̂ 2 = (yi − y)2 = s2Y
n i=1
and
ns2Y
∼ χ2n−1
σ2
However, s2Y is not independent of σ̂ 2 . What is statistically independent of
σ̂ 2 is the difference
s2Y − σ̂ 2
and
n(s2Y − σ̂ 2 )
∼ χ21
σ2
I will not pretend to give a proper demonstration of this. Rather, to make it
plausible, I’ll note that s2Y − σ̂ 2 is the extra mean squared error which comes
from estimating only one coefficient rather than two, that each coefficient kills
one degree of freedom in the data, and the total squared error associated with
one degree of freedom, over the entire data set, should be about σ 2 χ21 .
Taking the previous paragraph on trust, then, let’s look at a ratio of vari-
ances:
s2Y − σ̂ 2 n(s2Y − σ̂ 2 )
2
= (1)
σ̂ nσ̂ 2
n(s2Y −σ̂ 2 )
σ2
= nσ̂ 2
(2)
σ2
χ21
= (3)
χ2n−2
To get our F distribution, then, we need to use as our test statistic
s2Y − σ̂ 2 n − 2
2
sY
= − 1 (n − 2)
σ̂ 2 1 σ̂ 2
which will have an F1,n−2 distribution under the null hypothesis that β1 = 0.
Note that the only random, data-dependent part of this is the ratio of s2Y /σ̂ 2 .
We reject the null β1 = 0 when this is too large, compared to what’s expected
under the F1,n−2 distribution. Again, this is the distribution of the test statistic
under the null β1 = 0. The variance ratio will tend to be larger under the
alternative, with its expected size growing with |β1 |.
Running this F test in R The easiest way to run the F test for the regression
slope on a linear model in R is to invoke the anova function, like so:
anova(lm(y~x))
This will give you an analysis-of-variance table for the model. The actual
object the function returns is an anova object, which is a special type of data
frame. The columns record, respectively, degrees of freedom, sums of squares,
mean squares, the actual F statistic, and the p value of the F statistic. What
we’ll care about will be the first row of this table, which will give us the test
information for the slope on X.
To illustrate more concretely, let’s revisit our late friends in Chicago:
library(gamair); data(chicago)
death.temp.lm <- lm(death ~ tmpd, data=chicago)
anova(death.temp.lm)
As with summary on lm, the stars are usually a distraction; see Lecture 8 for
how to turn them off.
# Parameters
beta.0 <- 5
beta.1 <- 0 # We are simulating under the null!
sigma.sq <- 0.1
x.seq <- seq(from=-5, to=5, length.out=42)
1.2
1.0
0.8
Density
0.6
0.4
0.2
0.0
0 5 10 15 20 25 30
# Run a bunch of simulations under the null and get all the F statistics
# Actual F statistic is in the 4th column of the output of anova()
f.stats <- replicate(1000, anova(lm(y~x, data= sim.gnslrm(x.seq, beta.0, beta.1,
sigma.sq, FALSE)))[1,4])
# Store histogram of the F statistics, but hold off on plotting it
null.hist <- hist(f.stats, breaks=50, plot=FALSE)
# Run a bunch of simulations under the alternative and get all the F statistics
alt.f <- replicate(1000, anova(lm(y~x, data=sim.gnslrm(x.seq, beta.0, -0.05,
sigma.sq, FALSE)))[1,4])
# Store a histogram of this, but again hold off on plotting
alt.hist <- hist(alt.f, breaks=50, plot=FALSE)
# Create an empty plot
plot(0, xlim=c(0,30), ylim=c(0,1.2), xlab="F", ylab="Density", type="n")
# Add the histogram of F under the alternative, then under the null
plot(alt.hist, freq=FALSE, add=TRUE, col="grey", border="grey")
plot(null.hist, freq=FALSE, add=TRUE)
# Finally, the theoretical F distribution
curve(df(x,1,length(x.seq)-2), add=TRUE, col="blue")
th
Figure 2: Comparing the00:35 Friday
actual 16 October,
distribution 2015 when we simulate under
of F statistics
the null model (black histogram) to the theoretical F1,n−2 distribution (blue curve), and
to the distribution under the alternative β1 = −0.05.
6 1.1 F test of β1 = 0 vs. β1 6= 0
15
10
Density
5
0
p−value
Figure 3: Distribution of p-values from repeated simulations, under the null hypothesis
(black) and the alternative (grey). Notice how the p-values under the null are uniformly
distributed, while under the alternative they are bunched up towards small values at
the left.
ANOVA You will notice that I made no use of the ponderous machinery of
analysis of variance which the textbook wheels out in connection with the F
test. Despite (or because) of all of its size and complexity, this is really just a
historical relic. In serious applied work from the modern (say, post-1985) era, I
have never seen any study where filling out an ANOVA table for a regression,
etc., was at all important.
There is more to be said for analysis of variance where the observations
are divided into discrete, categorical groups, and one wants to know about
differences between groups vs. variation within a group. In a few weeks, when
we see how to handle categorical predictor variables, it will turn out that this
useful form of ANOVA is actually a special case of linear regression.
2Λ ∼ χ2p−q
Let me first try to give a little intuition, then hand-wave at the math, and
then work things through for test β1 = 0 vs. β1 6= 0.
The null model is, as I said, a restriction of the alternative model. Any
possible parameter value for the null model is also allowed for the alternative.
This means the parameter space for the null, say ω, is a strict subset of that for
the alternative, ω ⊂ Ω. The maximum of the likelihood over the larger space
must be at least as high as the maximum over the smaller space:
would be violated, for instance, if one of the parameters set to zero by the null
were a variance.) And that’s mostly it. Again, see §5 for more mathematical
details.
But
n
1X
σ̂ 2 = (yi − (β̂0 + β̂1 xi ))2
n i=1
Substituting into Eq. 4,
n n nσ̂ 2 n n
L(β̂0 , β̂1 , σ̂ 2 ) = − log 2π − log σ̂ 2 − = − (1 + log 2π) − log σ̂ 2
2 2 2σˆ2 2 2
So we get a constant which doesn’t depend on the parameters at all, and then
something proportional to log σ̂ 2 .
The intercept-only model works similarly, only its estimate of the intercept
is y, and its noise variance, σ̂02 , is just the sample variance of the yi :
n n
L(y, 0, s2Y ) = − (1 + log 2π) − log s2Y
2 2
Putting these together,
n s2
L(β̂0 , β̂1 , σ̂ 2 ) − L(y, 0, s2Y ) = log Y2
2 σ̂
Thus, under the null hypothesis,
n s2
log Y2 ∼ χ21
2 σ̂
Figure 4 shows a simulation confirming this bit of theory.
Connection to F tests The ratio s2Y /σ̂ 2 is, of course, the F -statistic, up to
constants not depending on the data. Since, for this problem, the likelihood
ratio test and the F test use equivalent test statistics, if we fix the same size or
level α for the two tests, they will have exactly the same power. In fact, even
for more complicated linear models — the “general linear tests” of the textbook
— the F test is always equivalent to a likelihood ratio test, at least when the
presumptions of the former are met. The likelihood ratio test, however, applies
to problems which do not involve Gaussian-noise linear models, while the F test
is basically only good for them. If you can only remember one of the two tests,
remember the likelihood ratio test.
1.5
1.0
Density
0.5
0.0
0 2 4 6 8 10
2Λ
# Simulate from the model 1000 times, capturing the variance ratios
var.ratios <- replicate(1000, sim.gnslrm(x.seq,beta.0,beta.1,sigma.sq))
# Convert variance ratios into log likelihood-ratios
LLRs <- (length(x.seq)/2)*log(var.ratios)
# Create a histogram of 2*LLR
hist(2*LLRs, breaks=50, freq=FALSE, xlab=expression(2*Lambda), main="")
# Add the theoretical chi^2_1 distribution
curve(dchisq(x,df=1), col="blue", add=TRUE)
assumptions, but under the weaker assumptions of uncorrelated noise, we’ll often approach
an F distribution asymptotically.
3 This is also what the likelihood ratio test of §1.2 is asking, just with a different notion of
measuring fit to the data (likelihood vs. squared error). Everything I’m about to say about
F tests applies, suitably modified, to likelihood ratio tests.
Suppose first that we retain the null hypothesis, i.e., we do not find any sig-
nificant share of variance associated with the regression. This could be because
(i) there is no such variance — the intercept-only model is right; (ii) there is
some variance, but we were unlucky; (iii) the test doesn’t have enough power
to detect departures from the null. To expand on that last point, the power
to detect a non-zero slope is going to increase with the sample size n, decrease
with the noise level σ 2 , and increase with the magnitude of the slope |β1 |. As
σ 2 /n → 0, the test’s power to detect any departures from the null → 1. If we
have a very powerful test, then we can reliably detect departures from the null.
If we don’t find them, then, we can be pretty sure they’re not there. If we have
a low-power test, not detecting departures from the null tells us little4 . If σ 2 is
too big or n is too small, our test is inevitably low-powered. Without knowing
the power, retaining the null is ambiguous between “there’s no signal here” and
“we can’t tell if there’s a signal or not”. It would be more useful to look at
things like a confidence interval for the regression slope, or, if you must, for σ 2 .
Of course, there is also possibility (iv), that the real relationship is nonlinear,
but the best linear approximation to it has slope (nearly) zero, in which case
the F test will have no power to detect the nonlinearity.
Suppose instead that we reject the null, intercept-only hypothesis. This does
not mean that the simple linear model is right. It means that the latter model
predicts better than the intercept-only model — too much better to be due to
chance. The simple linear regression model can be absolute garbage, with every
single one of its assumptions flagrantly violated, and yet better than the model
which makes all those assumptions and thinks the optimal slope is zero.
Figure 5 provides simulation code for a simple set up where the true regres-
sion function is nonlinear and the noise around it has non-constant variance.
(Indeed, regression curve is non-monotonic and the noise is multiplicative, not
additive.) Still, because a tilted straight line is a much better fit than a flat
line, the F test delivers incredibly small p-values — the largest, when I simulate
drawing 200 points from the model, is around 10−35 , which is about the prob-
ability of drawing any particular molecule from 3 billion liters of water. This is
the math’s way of looking at data like Figure 6 and saying “If you want to run
a flat line through this, instead of one with a slope, you’re crazy”5 . This is, of
course, true; it’s just not an answer to “Is simple linear model right here?”
there’s overwhelming evidence for the simple linear model, it again means that it’d be really
stupid to prefer a flat line to a titled one.
Figure 5: Code to simulate a non-linear model with non-constant variance (in fact,
multiplicative rather than additive noise).
line) against the general one (titled line). They do not test linearity, constant
variance, lack of correlation, or Gaussianity.
80
●
60
40
y
●
● ●●
●
●
●
●
●
20
●●●
●
●
●●●
●
●
●● ●
●●
●●
● ●
●●●
●
●●● ●
●
●●●●●●
●
●
●●●
●● ●●●
●●●
●
●
●
●
●●●
●
●●
●
●●
●
●●
●●
● ●●
●
● ●
●
●●●
●●
●
●●●
●●
●●
●●
●
●●
●●
●●
●●●
●
● ●
●●
●●●
0
0 2 4 6 8 10
Figure 6: 200 points drawn from the non-linear, heteroskedastic model defined in
Figure 5 (black dots); the true regression curve (blue curve); the least-squares estimate
of the simple linear regression (red line). Anyone who’s read Lecture 6 and looks at
this can realize the linear model is badly wrong here.
0.05
0.04
Density
0.03
0.02
0.01
0.00
Figure 7: Distribution of p values from the F test for the simple linear regression
model when the data come from the non-linear, heteroskedastic model of Figure 5, with
sample size of n = 200. The p-values are all so small that rather than plotting them,
I plot their logs in base 10, so the distribution is centered around a p-value of 10−60 ,
and the largest, least-significant p-values produced in ten thousand simulations were
around 10−35 .
3 R2
R2 has several definitions which are equivalent when we estimate a linear model
by least squares. The most basic one is the ratio of the sample variance of the
fitted values to the sample variance of Y .
s2m̂
R2 ≡ (5)
s2Y
Alternatively, it’s the ratio between the sample covariance of Y and the fitted
values, to the sample variance of Y :
cY,m̂
R2 = (6)
s2Y
Let’s show that these are equal. Clearly, it’s enough to show that the sample
variance of m̂ equals its covariance with Y . The key observations are that (i)
that each yi = m̂(xi )+ei , while (ii) the sample covariance between ei and m̂(xi )
is exactly zero. Thus
and we see that, for linear models estimated by least squares, Eqs. 5 and 6 do
in fact always give the same result.
That said, what is s2m̂ ? Since m̂(xi ) = β̂0 + β̂1 xi ,
s2X
R2 = β̂12 (7)
s2Y
s2Y − σ̂ 2
R2 = (9)
s2Y
Since σ̂ 2 is the sample variance of the residuals, and the residuals are uncorre-
lated (in sample) with m̂, it’s not hard to show that the numerator is equation
to s2m̂ .
Limits for R2 From Eq. 7, it is clear that R2 will be 0 when β̂1 = 0. On the
other hand, if all the residuals are zero, then s2Y = β̂21 s2X and R2 = 1. It is not
too hard to show that R2 can’t possible be bigger than 1, so we have marked
out the limits: a sample slope of 0 gives an R2 of 0, the lowest possbile, and all
the data points falling exactly on a straight line gives an R2 of 1, the largest
possible.
3.1 Theoretical R2
Suppose we knew the true coefficients. What would R2 be? Using Eq. 5, we’d
see
Var [m(X)]
R2 = (10)
Var [Y ]
Var [β0 + β1 X]
= (11)
Var [β0 + β1 X + ]
Var [β1 X]
= (12)
Var [β1 X + ]
β12 Var [X]
= 2 (13)
β1 Var [X] + σ 2
Since all our parameter estimates are consistent, and this formula is continuous
in all the parameters, the R2 we get from our estimate will converge on this
limit.
As you will recall from lecture 1, even if the linear model is totally wrong,
our estimate of β1 will converge on Cov [X, Y ] /Var [X]. Thus, whether or not
the simple linear model applies, the limiting theoretical R2 is given by Eq. 13,
provided we interpret β1 appropriately.
(b) R2 can be arbitrarily close to 1 when the model is totally wrong. For
a demonstration, the R2 of the linear model fitted to the simulation
in §2 is 0.745. There is, indeed, no limit to how high R2 can get when
the true model is nonlinear. All that’s needed is for the slope of the
best linear approximation to be non-zero, and for Var [X] to get big.
2. R2 is also pretty useless as a measure of predictability.
(a) R2 says nothing about prediction error. Go back to Eq. 13, the ideal
case: even with σ 2 exactly the same, and no change in the coefficients,
R2 can be anywhere between 0 and 1 just by changing the range of X.
Mean squared error is a much better measure of how good predictions
are; better yet are estimates of out-of-sample error which we’ll cover
later in the course.
(b) R2 says nothing about interval forecasts. In particular, it gives us no
idea how big prediction intervals, or confidence intervals for m(x),
might be.
3. R2 cannot be compared across data sets: again, look at Eq. 13, and see
that exactly the same model can have radically different R2 values on
different data.
5. The one situation where R2 can be compared is when different models are
fit to the same data set with the same, untransformed response variable.
Then increasing R2 is the same as decreasing in-sample MSE (by Eq. 9).
In that case, however, you might as well just compare the MSEs.
6. It is very common to say that R2 is “the fraction of variance explained” by
the regression. This goes along with calling R2 “the coefficient of determi-
nation”. These usages arise from Eq. 9, and have nothing to recommend
them. Eq. 8 shows that if we regressed X on Y , we’d get exactly the same
R2 . This in itself should be enough to show that a high R2 says nothing
about explaining one variable by another. It is also extremely easy to
devise situations where R2 is high even though neither one could possible
explain the other6 . Unless you want to re-define the verb “to explain” in
6 Imagine, for example, regressing deaths in Chicago on the number of Chicagoans wearing
t-shirts each day. For that matter, imagine regressing the number of Chicagoans wearing
t-shirts on the number of deaths. For thousands of examples with even less to recommend
them as explanations, see http://www.tylervigen.com/spurious-correlations.
Cov [X, Y ]
ρXY = p
Var [X] Var [Y ]
which lies between −1 and 1. It takes its extreme values when Y is a linear
function of X.
Recall, from lecture 1, that the slope of the ideal linear predictor β1 is
Cov [X, Y ]
Var [X]
so s
Var [X]
ρXY = β1
Var [Y ]
p
It’s also straightforward to show (Exercise
p 1) that if we regress Y / Var [Y ], the
standardized version of Y , on X/ Var [X], the standardized version of X, the
regression coefficient we’d get would be ρXY .
In 1954, the great statistician John W. Tukey wrote (Tukey, 1954, p. 721)
Does anyone know when the correlation coefficient is useful, as
opposed to when it is used? If so, why not tell us?
Sixty years later, the world is still waiting for a good answer8 .
7 Some people (e.g., Weisburd and Piquero 2008; Low-Décarie et al. 2014) have attempted
to gather all the values of R2 reported in scientific papers on, say, ecology or crime, to see
if ecologists or criminologists have gotten better at explaining the phenomena they study. I
hope it’s clear why these exercises are pointless.
8 To be scrupulously fair, Tukey did admit there was one clear case where correlation
coefficients were useful; they are, as we have just seen, basically the slopes in simple linear
regressions. But even so, as soon as we have multiple predictors (as we will in two weeks),
regression will no longer match up with correlation. Also, covariances are useful, but why
turn a covariance into a correlation?
We’re assuming that the true parameter value, call it θ, lies in the restricted
class of models ω. So there are q components to θ which matter, and the
other p − q are fixed by the constraints defining ω. To simplify the book-
keeping, let’s say those constraints are all that the extra parameters are zero,
so θ = (θ1 , θ2 , . . . θq , 0, . . . 0), with p − q zeroes at the end.
The restricted MLE θb obeys the constraints, so
The unrestricted MLE does not have to obey the constraints, so it’s
Θ
b = (Θ
b 1, Θ
b 2, . . . Θ
b q, Θ
b q+1 , . . . Θ
b p) (15)
Θ
bi N (θi , −1/nFii ) (19)
θbi N (θ1 , −1/nFii ) 1 ≤ i ≤ q (20)
θbi = 0 q+1≤i≤p (21)
so the whole second summation has a χ2p−q distribution. The first summation,
meanwhile, goes to zero because Θ b i and θbi are actually strongly correlated, so
their difference is O(1/n), and their difference squared is O(1/n2 ). Since L,ii is
only O(n), that summation drops out.
A somewhat less hand-wavy version of the argument uses the fact that the
MLE is really a vector, with a multivariate normal distribution which depends
on the inverse of the Fisher information matrix:
Θ
b N (θ, (1/n)F −1 ) (23)
Then, at the cost of more linear algebra, we don’t have to assume that the
Hessian is diagonal.
6 Concluding Comment
The tone I have taken when discussing F tests, R2 and correlation has been
dismissive. This is deliberate, because they are grossly abused and over-used in
current practice, especially by non-statisticians, and I want you to be too proud
(or too ashamed) to engage in those abuses. In a better world, we’d just skip
over them, but you will have to deal with colleagues, and bosses, who learned
their statistics in the bad old days, and so have to understand what they’re
doing wrong. (“Science advances funeral by funeral”.)
In all fairness, the people who came up with these tools were great scientists,
struggling with very hard problems when nothing was clear; they were inventing
all the tools and concepts we take for granted in a class like this. Anyone in this
class, me included, would be doing very well to come up with one idea over the
whole of our careers which is as good as R2 . But we best respect our ancestors,
and the tradition they left us, when we improve that tradition where we can.
Sometimes that means throwing out the broken bits.
7 Further Reading
Refer back to lecture 7 on diagnostics for ways of actually checking whether the
relationship between Y and X is linear (along with the other assumptions of the
model). We will come back to the topic of conducting formal tests of linearity,
or other parametric regression specifications, in 402.
Refer back to lecture 8, on parametric inference, for advice on when it is
actually interesting to test the hypothesis β1 = 0.
Full mathematical treatments of likelihood ratio tests can be found in many
textbooks, e.g., Schervish (1995) or Gouriéroux and Monfort (1989/1995, vol.
II). The original proof that it has a χ2p−q asymptotic distribution was given
by Wilks (1938). Vuong (1989) provides an interesting and valuable treatment
of what happens to the likelihood ratio test when neither the null nor the al-
ternative is strictly true, but we want to pick the one which is closer to the
truth; that paper also develops the theory when the null is not a restriction
of the alternative, but rather the two hypotheses come from distinct statistical
models.
People have been warning about the fallacy of R2 to measure goodness of
fit for a long time (Anderson and Shanteau, 1977; Birnbaum, 1973), apparently
without having much of an impact. (See Hamilton (1996) for a discussion of how
academic communities can keep on teaching erroneous ideas long after they’ve
been shown to be wrong, and some speculations about why this happens.)
That R2 has got nothing to do with explaining anything has also been pointed
out, time after time, for decades (Berk, 2004). A small demo of just how silly
“variance explained” can get, using the Chicago data, can be found at http:
//bactra.org/weblog/874.html. Just what it does means to give a proper
scientific explanation, and what role statistical models might play in doing so,
is a topic full of debate, not to say confusion. Shmueli (2010) attempts to
relate some of these debates to the practical conduct of statistical modeling.
Personally, I have found Salmon (1984) very helpful in thinking about these
issues.
Exercises
To think through or to practice on, not to hand in.
p p
1. Define Ỹ = Y / Var [Y ] and X̃ = X/ Var [X]. Show that the slope of
the optimal linear predictor of Ỹ from X̃ is ρXY .
2. Work through the likelihood ratio test for testing regression through the
origin (β0 = 0) against the simple linear model (β0 6= 0); that is, write Λ
in terms of the sample statistics and simplify as much as possible. Under
the null hypothesis, 2Λ follows a χ2 distribution with a certain number of
degrees of freedom: how many?
References
Anderson, Norman H. and James Shanteau (1977). “Weak inference with
linear models.” Psychological Bulletin, 84: 1155–1170. doi:10.1037/0033-
2909.84.6.1155.
Berk, Richard A. (2004). Regression Analysis: A Constructive Critique. Thou-
sand Oaks, California: Sage.
Birnbaum, Michael H. (1973). “The Devil Rides Again: Correlation as an Index
of Fit.” Psychological Bulletin, 79: 239–242. doi:10.1037/h0033853.
Gouriéroux, Christian and Alain Monfort (1989/1995). Statistics and Econo-
metric Models. Themes in Modern Econometrics. Cambridge, England: Cam-
bridge University Press. Translated by Quang Vuong from Statistique et
modèles économétriques, Paris: Économica.
Hamilton, Richard F. (1996). The Social Misconstruction of Reality: Validity
and Verification in the Scholarly Community. New Haven: Yale University
Press.
Low-Décarie, Etienne, Corey Chivers and Monica Granados (2014). “Rising
complexity and falling explanatory power in ecology.” Frontiers in Ecology
and the Environment, 12: 412–418. doi:10.1890/130230.
Salmon, Wesley C. (1984). Scientific Explanation and the Causal Structure of
the World . Princeton: Princeton University Press.
Schervish, Mark J. (1995). Theory of Statistics. Berlin: Springer-Verlag.
Shmueli, Galit (2010). “To Explain or to Predict?” Statistical Science, 25:
289–310. doi:10.1214/10-STS330.
Vuong, Quang H. (1989). “Likelihood Ratio Tests for Model Selection and Non-
Nested Hypotheses.” Econometrica, 57: 307–333. URL http://www.jstor.
org/pss/1912557.
Weisburd, David and Alex R. Piquero (2008). “How Well Do Criminologists Ex-
plain Crime? Statistical Modeling in Published Studies.” Crime and Justice,
37: 453–502. doi:10.1086/524284.
Wilks, S. S. (1938). “The Large Sample Distribution of the Likelihood Ra-
tio for Testing Composite Hypotheses.” Annals of Mathematical Statistics,
9: 60–62. URL http://projecteuclid.org/euclid.aoms/1177732360.
doi:10.1214/aoms/1177732360.